1. What is a DataFrame?

import pandas as pd

ramen = pd.read_csv('https://stat107.hknguyen.org/files/datasets/clean-ramen.csv')
ramen
##                Brand  ... Stars
## 0          New Touch  ...  3.75
## 1           Just Way  ...  1.00
## 2             Nissin  ...  2.25
## 3            Wei Lih  ...  2.75
## 4     Ching's Secret  ...  3.75
## ...              ...  ...   ...
## 2570           Vifon  ...  3.50
## 2571         Wai Wai  ...  1.00
## 2572         Wai Wai  ...  2.00
## 2573         Wai Wai  ...  2.00
## 2574        Westbrae  ...  0.50
## 
## [2575 rows x 5 columns]

1.1 Exploratory Data Analysis (EDA): Summary Statistics

  • Use the function describe() of the DataFrame object to get the summary statistics of variables in the DataFrame.
  • Syntax: dataframe_name.describe().
ramen.describe()
##              Stars
## count  2575.000000
## mean      3.654893
## std       1.015641
## min       0.000000
## 25%       3.250000
## 50%       3.750000
## 75%       4.250000
## max       5.000000
  • We also want to know the data type of each variable!
ramen.dtypes
## Brand       object
## Variety     object
## Style       object
## Country     object
## Stars      float64
## dtype: object
  • You might wonder: How many different country that produce ramen? How many style of ramen are there? Or how many different brands?
    • Use the unique() function to find out!
ramen['Brand'].unique()
  • Well… That is a lot to count! Is there a way to get the length of the above list (It’s not really a list, but the idea is similar to a list):
len(ramen['Brand'].unique())
## 355

1.2 Exploratory Data Analysis (EDA): Plotting

Histogram

  • Let’s investigate the histogram of the Stars variable:
  • The method of plotting we use here is slightly different from what we learned last week.
  • This new method is convenient if what you want to plot is a column of a DataFrame!
ramen['Stars'].hist()

  • To add the label axes, we still need to import matplotlib module:
import matplotlib.pyplot as plt

ramen['Stars'].hist(color = 'darkorange')

plt.xlabel('Rating (0-5)')
plt.ylabel('Count')
plt.title('Histogram of Ramen Ratings')
plt.show()

  • Another way to plot histogram is as followed:
ramen.hist(column='Stars')
plt.show()

  • This method is efficient when we want to plot histograms for multiple columns at the same time!
iris = pd.read_csv('https://stat107.hknguyen.org/files/datasets/iris.csv')
iris.hist(column=['Sepal.Length', 'Sepal.Width'])
plt.show()

Boxplot

  • Just like histogram, pandas DataFrame also has a boxplot() function.
ramen.boxplot(column='Stars')

  • But note that the following does NOT work!
ramen['Stars'].boxplot()
## Error in py_call_impl(callable, dots$args, dots$keywords): AttributeError: 'Series' object has no attribute 'boxplot'
## 
## Detailed traceback: 
##   File "<string>", line 1, in <module>
##   File "/Users/hknguyen/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/pandas/core/generic.py", line 5136, in __getattr__
##     return object.__getattribute__(self, name)
  • Again, this new way of plotting boxplot is very helpful when plotting multiple columns at the same time!
iris.boxplot(column=['Sepal.Length', 'Sepal.Width'])


2. Selecting Rows/Columns of a DataFrame

2.1 Using []

  • We already learned one way of selecting a column of a DataFrame in the previous lecture.
ramen['Brand']
  • Another way to do this is:
ramen.Brand
  • To select multiple columns at the same time:
ramen[['Brand', 'Stars']]
##                Brand  Stars
## 0          New Touch   3.75
## 1           Just Way   1.00
## 2             Nissin   2.25
## 3            Wei Lih   2.75
## 4     Ching's Secret   3.75
## ...              ...    ...
## 2570           Vifon   3.50
## 2571         Wai Wai   1.00
## 2572         Wai Wai   2.00
## 2573         Wai Wai   2.00
## 2574        Westbrae   0.50
## 
## [2575 rows x 2 columns]

2.2 Using iloc

  • iloc is short for interger-location based indexing.
  • iloc syntax is:
dataframe_name.iloc[<row index>, <column index>]
  • To select the Stars rating of the first observation (index 0):
ramen.iloc[0, 4]
## 3.75
  • To select multiple columns of the same row/observation:
ramen.iloc[0, 1:5]
## Variety    T's Restaurant Tantanmen 
## Style                            Cup
## Country                        Japan
## Stars                           3.75
## Name: 0, dtype: object
  • To select ALL columns of one row:
ramen.iloc[0, :]
## Brand                      New Touch
## Variety    T's Restaurant Tantanmen 
## Style                            Cup
## Country                        Japan
## Stars                           3.75
## Name: 0, dtype: object
  • Similarly, we can select multiple rows of the same column:
    • The following code segment returns the rating of the first 5 observations in this dataset.
ramen.iloc[0:5, 4]
## 0    3.75
## 1    1.00
## 2    2.25
## 3    2.75
## 4    3.75
## Name: Stars, dtype: float64
  • We can also select multiple rows and multiple columns at the same time:
ramen.iloc[[0, 5, 10, 15], :]
##             Brand                           Variety Style      Country  Stars
## 0       New Touch         T's Restaurant Tantanmen    Cup        Japan   3.75
## 5   Samyang Foods            Kimchi song Song Ramen  Pack  South Korea   4.75
## 10    Tao Kae Noi       Creamy tom Yum Kung Flavour  Pack     Thailand   5.00
## 15           KOKA  Mushroom Flavour Instant Noodles   Cup    Singapore   3.50

Exercise

What would ramen.iloc[:, :] return?

2.3 Using loc

  • The i in iloc stands for integer. That is why with iloc, we always use numbers for indexing.
  • With loc, we use label (names) or boolean array/list for indexing instead.
  • For our dataset, our rows are labeled by intergers, so we still use integers as our row labels.
ramen.loc[0,'Stars']
## 3.75

Exercise

Try to use loc to select the following rows, columns:

  • All columns except Brand for the 1st observation.
  • All columns for the last observation.
  • Stars for the first 5 observations.
  • All columns for the first 5 observations.

3. Filtering DataFrame

ramen['Country'] == 'Vietnam'
## 0       False
## 1       False
## 2       False
## 3       False
## 4       False
##         ...  
## 2570     True
## 2571    False
## 2572    False
## 2573    False
## 2574    False
## Name: Country, Length: 2575, dtype: bool
# method 1
ramen[ramen['Country'] == 'Vietnam']
##               Brand  ... Stars
## 18         Binh Tay  ...  4.00
## 52    Uni-President  ...  0.00
## 143        Mum Ngon  ...  3.50
## 224           Vifon  ...  5.00
## 365         Acecook  ...  4.00
## ...             ...  ...   ...
## 2484       Binh Tay  ...  2.75
## 2533        Ve Wong  ...  2.75
## 2568        Ve Wong  ...  1.00
## 2569          Vifon  ...  2.50
## 2570          Vifon  ...  3.50
## 
## [108 rows x 5 columns]
# method 2
ramen.loc[ramen['Country'] == 'Vietnam']
##               Brand  ... Stars
## 18         Binh Tay  ...  4.00
## 52    Uni-President  ...  0.00
## 143        Mum Ngon  ...  3.50
## 224           Vifon  ...  5.00
## 365         Acecook  ...  4.00
## ...             ...  ...   ...
## 2484       Binh Tay  ...  2.75
## 2533        Ve Wong  ...  2.75
## 2568        Ve Wong  ...  1.00
## 2569          Vifon  ...  2.50
## 2570          Vifon  ...  3.50
## 
## [108 rows x 5 columns]
ramen.loc[(ramen['Country'] == 'Vietnam') & (ramen['Brand'] == 'Gau Do')]
##        Brand          Variety Style  Country  Stars
## 1776  Gau Do  Hot Sour Shrimp  Pack  Vietnam   3.75
## 1840  Gau Do   Chicken Shrimp  Pack  Vietnam   2.50

Exercise

Let’s look up the rating for the 2 ramen packs I bought recently! (see video)