Ha Khanh Nguyen (hknguyen)
len()
is a built-in function returning the length of a string or a list, etc.read_csv()
is NOT built-in and can only be called after importing Pandas library.# import a library/module
import pandas
# call read_csv() function
pandas.read_csv('https://stat107.hknguyen.org/files/datasets/rivers.csv')
length | |
---|---|
0 | 735 |
1 | 320 |
2 | 325 |
3 | 392 |
4 | 524 |
... | ... |
136 | 720 |
137 | 270 |
138 | 430 |
139 | 671 |
140 | 1770 |
141 rows × 1 columns
# import pandas but now call it pd
import pandas as pd
# call read_csv() function
pd.read_csv('https://stat107.hknguyen.org/files/datasets/rivers.csv')
length | |
---|---|
0 | 735 |
1 | 320 |
2 | 325 |
3 | 392 |
4 | 524 |
... | ... |
136 | 720 |
137 | 270 |
138 | 430 |
139 | 671 |
140 | 1770 |
141 rows × 1 columns
# not recommended for pandas!
from pandas import read_csv
read_csv('https://stat107.hknguyen.org/files/datasets/rivers.csv')
length | |
---|---|
0 | 735 |
1 | 320 |
2 | 325 |
3 | 392 |
4 | 524 |
... | ... |
136 | 720 |
137 | 270 |
138 | 430 |
139 | 671 |
140 | 1770 |
141 rows × 1 columns
"pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language." -pandas.pydata.org
csv
stands for comma separated value. This means that the values of a row are comma separated.Function | Description |
---|---|
read_csv |
Load delimited data from a file, URL, or file-like object; use comma as default delimiter |
read_table |
Load delimited data from a file, URL, or file-like object; use tab ('\t') as default delimiter |
read_excel |
Read tabular data from an Excel XLS or XLSX file |
read_json |
Read data from a JSON (JavaScript Object Notation) string representation |
read_sql |
Read the results of a SQL query (using SQLAlchemy) as a pandas DataFrame |
read_csv()
, but we never know, we might practice with read_json()
and read_excel()
at the end of the course.rivers.csv
we used in part 1 was loaded to Python using a URL.rivers = pd.read_csv('https://stat107.hknguyen.org/files/datasets/rivers.csv')
rivers
variable is now referencing an object of type DataFrame (people also say pandas DataFrame). What is a DataFrame? Well, it's a table with named columns!rivers
only has 1 column which is named length
.coffee_sleep = pd.read_csv('https://stat107.hknguyen.org/files/datasets/coffee-sleep.csv')
coffee_sleep
coffee | sleep | level | |
---|---|---|---|
0 | 9.2 | 6.2 | g |
1 | 6.4 | 8.1 | u |
2 | 9.0 | 5.8 | u |
3 | 6.7 | 10.8 | g |
4 | 4.7 | 8.8 | u |
... | ... | ... | ... |
95 | 0.0 | 9.0 | u |
96 | 0.0 | 9.4 | u |
97 | 0.0 | 10.7 | g |
98 | 0.0 | 7.6 | u |
99 | 0.0 | 10.7 | u |
100 rows × 3 columns
pd.read_csv("path to the file")
pd.read_csv("filename")
coffee-sleep.csv
.jupyter notebook
and press Enter.coffee_sleep
coffee | sleep | level | |
---|---|---|---|
0 | 9.2 | 6.2 | g |
1 | 6.4 | 8.1 | u |
2 | 9.0 | 5.8 | u |
3 | 6.7 | 10.8 | g |
4 | 4.7 | 8.8 | u |
... | ... | ... | ... |
95 | 0.0 | 9.0 | u |
96 | 0.0 | 9.4 | u |
97 | 0.0 | 10.7 | g |
98 | 0.0 | 7.6 | u |
99 | 0.0 | 10.7 | u |
100 rows × 3 columns
'coffee'
: daily average coffee consumption of a student in STAT 430 (in oz)'sleep'
: daily average number of hours of sleep (in hours)'level'
: whether the student is a graduate ('g'
) or undergradte ('u'
) studentcoffee_sleep['coffee']
0 9.2 1 6.4 2 9.0 3 6.7 4 4.7 ... 95 0.0 96 0.0 97 0.0 98 0.0 99 0.0 Name: coffee, Length: 100, dtype: float64
# compute mean
coffee_sleep['sleep'].mean()
7.942
# compute median
coffee_sleep['sleep'].median()
8.0
# get all summary statistics
coffee_sleep['sleep'].describe()
count 100.000000 mean 7.942000 std 1.766648 min 3.300000 25% 6.675000 50% 8.000000 75% 9.325000 max 11.500000 Name: sleep, dtype: float64
# compute median
coffee_sleep['level'].value_counts()
u 59 g 41 Name: level, dtype: int64