## 1. What is a DataFrame?

• In this lecture, we will be working with the cleaned version of the Ramen Ratings dataset available on kaggle.com.
• Cleaned version = a dataset that has been simplified and cleaned by me!
• First, let’s load the data to Python!
import pandas as pd

ramen
##                Brand  ... Stars
## 0          New Touch  ...  3.75
## 1           Just Way  ...  1.00
## 2             Nissin  ...  2.25
## 3            Wei Lih  ...  2.75
## 4     Ching's Secret  ...  3.75
## ...              ...  ...   ...
## 2570           Vifon  ...  3.50
## 2571         Wai Wai  ...  1.00
## 2572         Wai Wai  ...  2.00
## 2573         Wai Wai  ...  2.00
## 2574        Westbrae  ...  0.50
##
## [2575 rows x 5 columns]
• So this dataset has 2575 observations of 5 variables. Let’s investigate this further!

### 1.1 Exploratory Data Analysis (EDA): Summary Statistics

• Use the function describe() of the DataFrame object to get the summary statistics of variables in the DataFrame.
• Syntax: dataframe_name.describe().
ramen.describe()
##              Stars
## count  2575.000000
## mean      3.654893
## std       1.015641
## min       0.000000
## 25%       3.250000
## 50%       3.750000
## 75%       4.250000
## max       5.000000
• We also want to know the data type of each variable!
ramen.dtypes
## Brand       object
## Variety     object
## Style       object
## Country     object
## Stars      float64
## dtype: object
• You might wonder: How many different country that produce ramen? How many style of ramen are there? Or how many different brands?
• Use the unique() function to find out!
ramen['Brand'].unique()
• Well… That is a lot to count! Is there a way to get the length of the above list (It’s not really a list, but the idea is similar to a list):
len(ramen['Brand'].unique())
## 355

### 1.2 Exploratory Data Analysis (EDA): Plotting

#### Histogram

• Let’s investigate the histogram of the Stars variable:
• The method of plotting we use here is slightly different from what we learned last week.
• This new method is convenient if what you want to plot is a column of a DataFrame!
ramen['Stars'].hist()

• To add the label axes, we still need to import matplotlib module:
import matplotlib.pyplot as plt

ramen['Stars'].hist(color = 'darkorange')

plt.xlabel('Rating (0-5)')
plt.ylabel('Count')
plt.title('Histogram of Ramen Ratings')
plt.show()

• Another way to plot histogram is as followed:
ramen.hist(column='Stars')
plt.show()

• This method is efficient when we want to plot histograms for multiple columns at the same time!
iris = pd.read_csv('https://stat107.hknguyen.org/files/datasets/iris.csv')
iris.hist(column=['Sepal.Length', 'Sepal.Width'])
plt.show()

#### Boxplot

• Just like histogram, pandas DataFrame also has a boxplot() function.
ramen.boxplot(column='Stars')

• But note that the following does NOT work!
ramen['Stars'].boxplot()
## Error in py_call_impl(callable, dots$args, dots$keywords): AttributeError: 'Series' object has no attribute 'boxplot'
##
## Detailed traceback:
##   File "<string>", line 1, in <module>
##   File "/Users/hknguyen/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/pandas/core/generic.py", line 5136, in __getattr__
##     return object.__getattribute__(self, name)
• Again, this new way of plotting boxplot is very helpful when plotting multiple columns at the same time!
iris.boxplot(column=['Sepal.Length', 'Sepal.Width'])

## 2. Selecting Rows/Columns of a DataFrame

### 2.1 Using []

• We already learned one way of selecting a column of a DataFrame in the previous lecture.
ramen['Brand']
• Another way to do this is:
ramen.Brand
• To select multiple columns at the same time:
ramen[['Brand', 'Stars']]
##                Brand  Stars
## 0          New Touch   3.75
## 1           Just Way   1.00
## 2             Nissin   2.25
## 3            Wei Lih   2.75
## 4     Ching's Secret   3.75
## ...              ...    ...
## 2570           Vifon   3.50
## 2571         Wai Wai   1.00
## 2572         Wai Wai   2.00
## 2573         Wai Wai   2.00
## 2574        Westbrae   0.50
##
## [2575 rows x 2 columns]

### 2.2 Using iloc

• iloc is short for interger-location based indexing.
• iloc syntax is:
dataframe_name.iloc[<row index>, <column index>]
• To select the Stars rating of the first observation (index 0):
ramen.iloc[0, 4]
## 3.75
• To select multiple columns of the same row/observation:
ramen.iloc[0, 1:5]
## Variety    T's Restaurant Tantanmen
## Style                            Cup
## Country                        Japan
## Stars                           3.75
## Name: 0, dtype: object
• To select ALL columns of one row:
ramen.iloc[0, :]
## Brand                      New Touch
## Variety    T's Restaurant Tantanmen
## Style                            Cup
## Country                        Japan
## Stars                           3.75
## Name: 0, dtype: object
• Similarly, we can select multiple rows of the same column:
• The following code segment returns the rating of the first 5 observations in this dataset.
ramen.iloc[0:5, 4]
## 0    3.75
## 1    1.00
## 2    2.25
## 3    2.75
## 4    3.75
## Name: Stars, dtype: float64
• We can also select multiple rows and multiple columns at the same time:
ramen.iloc[[0, 5, 10, 15], :]
##             Brand                           Variety Style      Country  Stars
## 0       New Touch         T's Restaurant Tantanmen    Cup        Japan   3.75
## 5   Samyang Foods            Kimchi song Song Ramen  Pack  South Korea   4.75
## 10    Tao Kae Noi       Creamy tom Yum Kung Flavour  Pack     Thailand   5.00
## 15           KOKA  Mushroom Flavour Instant Noodles   Cup    Singapore   3.50

#### Exercise

What would ramen.iloc[:, :] return?

### 2.3 Using loc

• The i in iloc stands for integer. That is why with iloc, we always use numbers for indexing.
• With loc, we use label (names) or boolean array/list for indexing instead.
• For our dataset, our rows are labeled by intergers, so we still use integers as our row labels.
ramen.loc[0,'Stars']
## 3.75

#### Exercise

Try to use loc to select the following rows, columns:

• All columns except Brand for the 1st observation.
• All columns for the last observation.
• Stars for the first 5 observations.
• All columns for the first 5 observations.

## 3. Filtering DataFrame

• Let’s say I only want observations of ramen coming from Vietnam!
• Then we need to filter the DataFrame and only print out the rows where Country is Vietnam.
ramen['Country'] == 'Vietnam'
## 0       False
## 1       False
## 2       False
## 3       False
## 4       False
##         ...
## 2570     True
## 2571    False
## 2572    False
## 2573    False
## 2574    False
## Name: Country, Length: 2575, dtype: bool
• This returns an array (similar to list) of Boolean (True/False values). We can use this to filter the DataFrame!
• There are 2 ways to do this:
# method 1
ramen[ramen['Country'] == 'Vietnam']
##               Brand  ... Stars
## 18         Binh Tay  ...  4.00
## 52    Uni-President  ...  0.00
## 143        Mum Ngon  ...  3.50
## 224           Vifon  ...  5.00
## 365         Acecook  ...  4.00
## ...             ...  ...   ...
## 2484       Binh Tay  ...  2.75
## 2533        Ve Wong  ...  2.75
## 2568        Ve Wong  ...  1.00
## 2569          Vifon  ...  2.50
## 2570          Vifon  ...  3.50
##
## [108 rows x 5 columns]
# method 2
ramen.loc[ramen['Country'] == 'Vietnam']
##               Brand  ... Stars
## 18         Binh Tay  ...  4.00
## 52    Uni-President  ...  0.00
## 143        Mum Ngon  ...  3.50
## 224           Vifon  ...  5.00
## 365         Acecook  ...  4.00
## ...             ...  ...   ...
## 2484       Binh Tay  ...  2.75
## 2533        Ve Wong  ...  2.75
## 2568        Ve Wong  ...  1.00
## 2569          Vifon  ...  2.50
## 2570          Vifon  ...  3.50
##
## [108 rows x 5 columns]
• The 2nd method is PREFERRED and HIGHLY RECOMMENDED! The reason? We will learn in future lectures!

• Now, what’s if I want only the ones where Brand is Gau Do (means Red Bear)?

ramen.loc[(ramen['Country'] == 'Vietnam') & (ramen['Brand'] == 'Gau Do')]
##        Brand          Variety Style  Country  Stars
## 1776  Gau Do  Hot Sour Shrimp  Pack  Vietnam   3.75
## 1840  Gau Do   Chicken Shrimp  Pack  Vietnam   2.50

### Exercise

Let’s look up the rating for the 2 ramen packs I bought recently! (see video)