There are several goals for this project:

- Gain experience reading a “large” data set into Python and manipulating it.
- Learn to think critically about how to state and test hypotheses.
- Develop skill in describing and presenting analyses.

- Project release: Friday, 11/20/2020.
- Project due date:
**11:59 PM Wednesday, 12/16/2020**.- Late submissions are accepted until 11:59 PM Friday, 12/18/2020 with 20% penalty.

For this project, we will look at the 2015 Flight Delays and Cancellations Data provided by the U.S. Department of Transportation. This is a huge dataset available on Kaggle.

There are 3 csv files in this dataset. The one that we will focus on is the `flights.csv`

file.

There are 31 columns in this csv file. They are described as followed:

`'YEAR'`

: year of the flight trip`'MONTH'`

: month of the flight trip`'DAY'`

: day of the flight trip`'DAY_OF_WEEK'`

: day of the week of the flight trip`'AIRLINE'`

: airline identifier`'FLIGHT_NUMBER'`

: flight identifier`'TAIL_NUMBER'`

: aircraft identifier`'ORIGIN_AIRPORT'`

: starting airport`'DESTINATION_AIRPORT'`

: destination airport`'SCHEDULED_DEPARTURE'`

: planned departure time`'DEPARTURE_TIME'`

: actual departure time, the time the aircraft leaves the gate and starts taxing (=`WHEELS_OFF`

-`TAXI_OUT`

)`'DEPARTURE_DELAY'`

: total delay on departure`'TAXI_OUT'`

: the time duration elapsed between departure from the origin airport gate and`WHEELS_OFF`

.`'WHEELS_OFF'`

: the time point that the aircraft’s wheels leave the ground`'SCHEDULED_TIME'`

: planned time amount needed for the flight trip (in min)`'ELAPSED_TIME'`

: time from the gate of the departure airport until the gate of the arrival airport (in min) (=`AIR_TIME`

+`TAXI_IN`

+`TAXI_OUT`

)`'AIR_TIME'`

: the time duration between`WHEELS_OFF`

and`WHEELS_ON`

time.`'DISTANCE'`

: distance between two airports`'WHEELS_ON'`

: the time point that the aircraft’s wheels touch on the ground`'TAXI_IN'`

: the time duration elapsed between`WHEELS_ON`

and gate arrival at the destination airport`'SCHEDULED_ARRIVAL'`

: planned arrival time`'ARRIVAL_TIME'`

: actual arrival time, the time the aircraft arrives at the arrival gate`'ARRIVAL_DELAY'`

: the delay between scheduled arrival time and actual arrival time (=`ARRIVAL_TIME`

-`SCHEDULED_ARRIVAL`

)`'DIVERTED'`

: aircraft landed on airport that out of schedule`'CANCELLED'`

: whether the flight is cancelled (1 = cancelled, 0 = not cancelled)`'CANCELLATION_REASON'`

: reason for Cancellation of flight: A - Airline/Carrier; B - Weather; C - National Air System; D - Security`'AIR_SYSTEM_DEPLAY'`

: delay caused by air system`'SECURITY_DELAY'`

: delay caused by security`'AIRLINE_DELAY'`

: delay caused by the airline`'LATE_AIRCRAFT_DELAY'`

: delay caused by aircraft`'WEATHER_DELAY'`

: delay caused by weather

- Use the following URL to download the dataset to your local computer: flights.csv
- This link does NOT work for directly loading the data to Jupyter Notebook.
- After downloading the file, move the file to the location of your Jupyter Notebook to get started.

- It will take a bit of time for the dataset to load, be patient and don’t run any other cell or rerun the cell when
`[*]`

is still present on the left of the cell. - You might get a
`DtypeWarning`

shown in a red box. No worries! This is just a warning, the dataset would still be read correctly.

- The analysis will include 4 hypothesis tests:
- Hypothesis test for 2 means
- Hypothesis test for 2 proportions
- Your choice of one of the above
- Using multiple-comparison for the above 3 tests

- Let’s say we’re interested in the
**average arrival delay on flights in December from O’Hare International Airport (Chicago)**.`ORD`

to Los Angeles International Airport`LAX`

- In particularly, we want to compare the average arrival delay between American Airlines
`AA`

and United Air Lines`UA`

.

- Check if the test assumptions are satisfied (assuming we’re using z-test to compare the 2 population means).
- Plot a boxplot of the two sample (arrival delays on flights in December from ORD to LAX on American Airlines flights and on United Air Lines flights).
- State the hypotheses.
- Conduct the test at significance level \(\alpha =0.05\).
- State the conclusion.

- To help you get started, here is the code to filter out only the values we’re interested:

```
american = flights.loc[(flights['MONTH'] == 12) & (flights['ORIGIN_AIRPORT'] == 'ORD') & (flights['DESTINATION_AIRPORT'] == 'LAX') &
(flights['AIRLINE'] == 'AA'), 'ARRIVAL_DELAY']
```

```
united = flights.loc[(flights['MONTH'] == 12) & (flights['ORIGIN_AIRPORT'] == 'ORD') & (flights['DESTINATION_AIRPORT'] == 'LAX') &
(flights['AIRLINE'] == 'UA'), 'ARRIVAL_DELAY']
```

- Let’s say instead of worrying about average delay, we want to know the probability of arriving over an hour late.
- That is we’re interested in the
**probability of flights delaying over an hour in December from O’Hare International Airport (Chicago)**.`ORD`

to Los Angeles International Airport`LAX`

- In particularly, we want to compare this probability between American Airlines
`AA`

and United Air Lines`UA`

.

- Check if the test assumptions are satisfied (assuming we’re using z-test to compare the 2 population proportions).
- State the hypotheses.
- Conduct the test at significance level \(\alpha =0.05\).
- State the conclusion.

- Using the above hypothesis tests as example, choose 2 populations that you’re interested in comparing.
- You must use either the hypothesis test for 2 means or the hypothesis test for 2 proportions.
- State the question that you’re interesed in answering by conducting this test.
- Check if the test assumptions are satisfied.
- State the hypotheses.
- Conduct the test at significance level \(\alpha =0.05\).
- State the conclusion.

- Using the Bonferroni correction conduct the 3 above tests with a family-wise error rate of \(\alpha = 0.05\).
- State the conclusion for all 3 tests.

- The report must be written using Jupyter Notebook and rendered into a
`.html`

file.- You can do this by clicking
`Files`

->`Download as`

->`HTML (.html)`

.

- You can do this by clicking
- Reports should be “approximately” 2 pages (double-spaced) of text at most.
- The report must have a title and the full name of the author.
- Pay attention to grammar, spelling, formatting, etc. This is designed to provide practice for the real world, where you would provide reports to clients or to your boss. Use professional language, provide references, write paragraphs of complete sentences, etc.

There should be 4 sections in the report:

- Section 1: Hypothesis Test 1
- Section 2: Hypothesis Test 2
- Section 3: Hypothesis Test 3
- Section 4: Multiple Comparisons

Note that you must explain in text at each step what are you doing. For example, if you’re calculating the test statistic, you must state that in the Markdown cell followed by the Python code used to compute the test statistic.

The total points for the project is 100. There are broken down by task:

- Correctly conduct hypothesis test 1:
**25 points****[3]**Are the test assumptions correctly checked?**[5]**Are the test hypotheses correct and clearly stated (all notations must be defined)?**[3]**Is the test statistic correct?**[5]**Is the \(p\)-value computed correctly?**[4]**Is the correct conclusion drawn from the result?**[5]**Is the boxplot correct and clearly defined with title, labels, etc.?

- Correctly conduct hypothesis test 2:
**20 points****[3]**Are the test assumptions correctly checked?**[5]**Are the test hypotheses correct and clearly stated (all notations must be defined)?**[3]**Is the test statistic correct?**[5]**Is the \(p\)-value computed correctly?**[4]**Is the correct conclusion drawn from the result?

- Correctly stated the question AND conduct hypothesis test 3:
**30 points****[10]**Is the question of interest clearly stated?**[3]**Are the test assumptions correctly checked?**[5]**Are the test hypotheses correct and clearly stated (all notations must be defined)?**[3]**Is the test statistic correct?**[5]**Is the \(p\)-value computed correctly?**[4]**Is the correct conclusion drawn from the result?

- Correctly perform multiple comparisons:
**10 points****[5]**Is Bonferroni correction correctly applied?**[5]**Is the correct conclusion drawn from the results?

- Project report format:
**15 points****[5]**Are both the html file and the Jupyter Notebook submitted?**[5]**Does the report have a title and clear section headings?**[5]**Is the text free of spelling errors?

- Submit both the html file and Jupyter Notebook to the assignment on
**Compass**by the deadline stated above. - After submitting, make sure to double-check that your files were correctly submitted.
- There have been cases where the files were corrupted when submitting, hence preventing the graders from viewing the files.