Goals

There are several goals for this project:


Timeline


Project

1. Dataset: 2015 Flight Delays and Cancellations Data

For this project, we will look at the 2015 Flight Delays and Cancellations Data provided by the U.S. Department of Transportation. This is a huge dataset available on Kaggle.

There are 3 csv files in this dataset. The one that we will focus on is the flights.csv file.

1.1 Dataset Description

There are 31 columns in this csv file. They are described as followed:

  • 'YEAR': year of the flight trip
  • 'MONTH': month of the flight trip
  • 'DAY': day of the flight trip
  • 'DAY_OF_WEEK': day of the week of the flight trip
  • 'AIRLINE': airline identifier
  • 'FLIGHT_NUMBER': flight identifier
  • 'TAIL_NUMBER': aircraft identifier
  • 'ORIGIN_AIRPORT': starting airport
  • 'DESTINATION_AIRPORT': destination airport
  • 'SCHEDULED_DEPARTURE': planned departure time
  • 'DEPARTURE_TIME': actual departure time, the time the aircraft leaves the gate and starts taxing (= WHEELS_OFF - TAXI_OUT)
  • 'DEPARTURE_DELAY': total delay on departure
  • 'TAXI_OUT': the time duration elapsed between departure from the origin airport gate and WHEELS_OFF.
  • 'WHEELS_OFF': the time point that the aircraft’s wheels leave the ground
  • 'SCHEDULED_TIME': planned time amount needed for the flight trip (in min)
  • 'ELAPSED_TIME': time from the gate of the departure airport until the gate of the arrival airport (in min) (= AIR_TIME + TAXI_IN + TAXI_OUT)
  • 'AIR_TIME': the time duration between WHEELS_OFF and WHEELS_ON time.
  • 'DISTANCE': distance between two airports
  • 'WHEELS_ON': the time point that the aircraft’s wheels touch on the ground
  • 'TAXI_IN': the time duration elapsed between WHEELS_ON and gate arrival at the destination airport
  • 'SCHEDULED_ARRIVAL': planned arrival time
  • 'ARRIVAL_TIME': actual arrival time, the time the aircraft arrives at the arrival gate
  • 'ARRIVAL_DELAY': the delay between scheduled arrival time and actual arrival time (= ARRIVAL_TIME - SCHEDULED_ARRIVAL)
  • 'DIVERTED': aircraft landed on airport that out of schedule
  • 'CANCELLED': whether the flight is cancelled (1 = cancelled, 0 = not cancelled)
  • 'CANCELLATION_REASON': reason for Cancellation of flight: A - Airline/Carrier; B - Weather; C - National Air System; D - Security
  • 'AIR_SYSTEM_DEPLAY': delay caused by air system
  • 'SECURITY_DELAY': delay caused by security
  • 'AIRLINE_DELAY': delay caused by the airline
  • 'LATE_AIRCRAFT_DELAY': delay caused by aircraft
  • 'WEATHER_DELAY': delay caused by weather

1.2 Loading Data

  • Use the following URL to download the dataset to your local computer: flights.csv
    • This link does NOT work for directly loading the data to Jupyter Notebook.
    • After downloading the file, move the file to the location of your Jupyter Notebook to get started.
  • It will take a bit of time for the dataset to load, be patient and don’t run any other cell or rerun the cell when [*] is still present on the left of the cell.
  • You might get a DtypeWarning shown in a red box. No worries! This is just a warning, the dataset would still be read correctly.

2. Analysis

  • The analysis will include 4 hypothesis tests:
    • Hypothesis test for 2 means
    • Hypothesis test for 2 proportions
    • Your choice of one of the above
    • Using multiple-comparison for the above 3 tests

2.1 Hypothesis Test 1

  • Let’s say we’re interested in the average arrival delay on flights in December from O’Hare International Airport (Chicago) ORD to Los Angeles International Airport LAX.
  • In particularly, we want to compare the average arrival delay between American Airlines AA and United Air Lines UA.

Your tasks include:

  • Check if the test assumptions are satisfied (assuming we’re using z-test to compare the 2 population means).
  • Plot a boxplot of the two sample (arrival delays on flights in December from ORD to LAX on American Airlines flights and on United Air Lines flights).
  • State the hypotheses.
  • Conduct the test at significance level \(\alpha =0.05\).
  • State the conclusion.

Provided code

  • To help you get started, here is the code to filter out only the values we’re interested:
american = flights.loc[(flights['MONTH'] == 12) & (flights['ORIGIN_AIRPORT'] == 'ORD') & (flights['DESTINATION_AIRPORT'] == 'LAX') &
           (flights['AIRLINE'] == 'AA'), 'ARRIVAL_DELAY']
united = flights.loc[(flights['MONTH'] == 12) & (flights['ORIGIN_AIRPORT'] == 'ORD') & (flights['DESTINATION_AIRPORT'] == 'LAX') &
           (flights['AIRLINE'] == 'UA'), 'ARRIVAL_DELAY']

2.2 Hypothesis Test 2

  • Let’s say instead of worrying about average delay, we want to know the probability of arriving over an hour late.
  • That is we’re interested in the probability of flights delaying over an hour in December from O’Hare International Airport (Chicago) ORD to Los Angeles International Airport LAX.
  • In particularly, we want to compare this probability between American Airlines AA and United Air Lines UA.

Your tasks include:

  • Check if the test assumptions are satisfied (assuming we’re using z-test to compare the 2 population proportions).
  • State the hypotheses.
  • Conduct the test at significance level \(\alpha =0.05\).
  • State the conclusion.

2.3 Hypothesis Test 3

  • Using the above hypothesis tests as example, choose 2 populations that you’re interested in comparing.
  • You must use either the hypothesis test for 2 means or the hypothesis test for 2 proportions.
  • State the question that you’re interesed in answering by conducting this test.
  • Check if the test assumptions are satisfied.
  • State the hypotheses.
  • Conduct the test at significance level \(\alpha =0.05\).
  • State the conclusion.

2.4 Multiple comparisons

  • Using the Bonferroni correction conduct the 3 above tests with a family-wise error rate of \(\alpha = 0.05\).
  • State the conclusion for all 3 tests.

Project Report

There should be 4 sections in the report:

Note that you must explain in text at each step what are you doing. For example, if you’re calculating the test statistic, you must state that in the Markdown cell followed by the Python code used to compute the test statistic.


Grading Rubric

The total points for the project is 100. There are broken down by task:


Submissions