# DataFrame¶

Ha Khanh Nguyen (hknguyen)

## 1. What is a DataFrame?¶

• In this lecture, we will be working with the cleaned version of the Ramen Ratings dataset available on kaggle.com.
• Cleaned version = a dataset that has been simplified and cleaned by me!
• First, let's load the data to Python!
• So this dataset has 2575 observations of 5 variables. Let's investigate this further!

### 1.1 Exploratory Data Analysis (EDA): Summary Statistics¶

• Use the function describe() of the DataFrame object to get the summary statistics of variables in the DataFrame.
• Syntax: dataframe_name.describe().
• We also want to know the data type of each variable!
• You might wonder: How many different country that produce ramen? How many style of ramen are there? Or how many different brands?
• Use the unique() function to find out!
• Well... That is a lot to count! Is there a way to get the length of the above list (It's not really a list, but the idea is similar to a list):

### 1.2 Exploratory Data Analysis (EDA): Plotting¶

#### Histogram¶

• Let's investigate the histogram of the Stars variable:
• The method of plotting we use here is slightly different from what we learned last week.
• This new method is convenient if what you want to plot is a column of a DataFrame!
• To add the label axes, we still need to import matplotlib module:
• Another way to plot histogram is as followed:
• This method is efficient when we want to plot histograms for multiple columns at the same time!

#### Boxplot¶

• Just like histogram, pandas DataFrame also has a boxplot() function.
• But note that the following does NOT work!
• Again, this new way of plotting boxplot is very helpful when plotting multiple columns at the same time!

## 2. Selecting Rows/Columns of a DataFrame¶

### 2.1 Using []¶

• We already learned one way of selecting a column of a DataFrame in the previous lecture.
• Another way to do this is:
• To select multiple columns at the same time:

### 2.2 Using iloc¶

• iloc is short for interger-location based indexing.
• iloc syntax is:
• To select the Stars rating of the first observation (index 0):
• To select multiple columns of the same row/observation:
• To select ALL columns of one row:
• Similarly, we can select multiple rows of the same column:
• The following code segment returns the rating of the first 5 observations in this dataset.
• We can also select multiple rows and multiple columns at the same time:

#### Exercise¶

What would ramen.iloc[:, :] return?

### 2.3 Using loc¶

• The i in iloc stands for integer. That is why with iloc, we always use numbers for indexing.
• With loc, we use label (names) or boolean array/list for indexing instead.
• For our dataset, our rows are labeled by intergers, so we still use integers as our row labels.

#### Exercise¶

Try to use loc to select the following rows, columns:

• All columns except Brand for the 1st observation.
• All columns for the last observation.
• Stars for the first 5 observations.
• All columns for the first 5 observations.

## 3. Filtering DataFrame¶

• Let's say I only want observations of ramen coming from Vietnam!
• Then we need to filter the DataFrame and only print out the rows where Country is Vietnam.
• This returns an array (similar to list) of Boolean (True/False values). We can use this to filter the DataFrame!
• There are 2 ways to do this:
• The 2nd method is PREFERRED and HIGHLY RECOMMENDED! The reason? We will learn in future lectures!

• Now, what's if I want only the ones where Brand is Gau Do (means Red Bear)?

### Exercise¶

Let's look up the rating for the 2 ramen packs I bought recently! (see video)