Goals

There are several goals for this project:


Timeline


Project

1. Dataset:

In this project, you will be working with a simulated dataset (data created using simulation) of STAT 107 students’ average daily coffee consumption (in oz). Assume that this data is the population you’re interested in.

1.1 Dataset Description

There is 1 column in this coffee.csv file:

  • coffee: average daily coffee consumption (in oz)

1.2 Loading Data


2. Project Setup

  • Inside your NetID folder, create a new folder named project.
  • Inside the project folder, create a new Jupyter Notebook file.
    • Click the New button in the top-right corner (near Upload, below Quit and Logout), then select Python 3 option.
  • Name the new notebook as stat107-project.
    • You can do that by click on the ‘Untitled’ text at the top, then input a new new, and click the blue Rename button.

3. Simulation Studies

Simulation study is an important part of statistics. In fact, there is a whole area named simulation-based statistics. In this project, we will use both traditional mathematical statistics (the materials we have been learning in our [Stat] lecture) and the “new” simulation-based statistics method (more Python-focused) to compute confidence intervals for a statistic of interest. (I know there are a lot of ‘statistics’ and ‘statistic’ in this paragraph, but hey, that’s why it’s called STAT 107, right?)

The main goal is for you to understand the relationship between a population and a sample, and what it means to compute a confidence interval using a sample.

3.1 Sampling

  • Note that the data provided in coffee.csv is what we assumed the population data.
  • From this population, use the appropriate sampling function to get a sample of size 40.
    • Think of selecting 40 students from our class.
    • It’s important to determine whether you should be sampling with or without replacement. We (the TAs, CAs and I) cannot answer this question for you.

Your tasks include:

  • Determine the appropriate sampling method and the corresponding sampling function in Python.
  • Set the seed to be your UIN.
    • You can find this on your iCard as well as other places like Student’s Self-Service website.
  • Use the sampling function to get a sample of size 40, then store the sample in a variable named my_sample.

3.2 Compute a CI Using Mathematical Statistics

  • Now, use the sample YOU got from the previous step to compute a 95% confidence interval for the true proportion of STAT 107 students who drink over 8 oz of coffee daily on average.
  • You must use the mathematical formula to compute this interval and clearly state the interval in context of the problem (in words).
  • Then determine if the CI you computed includes the true population proportion of STAT 107 students who drink over 8 oz of coffee daily on average**.

Your tasks include:

  • Use Python to compute the lower-bound and upper-bound of a 95% confidence interval for proportion using the sample you got from the previous step.
  • State the CI in the context of the problem.
  • Compute the true population proportion of interest.
  • Determine whether the CI you computed covers the true population proportion.

3.3 Compute a CI Using Bootstrapping

  • Again, use the sample YOU got from the previous step to compute a 95% confidence interval for the true proportion of STAT 107 students who drink over 8 oz of coffee daily on average.
  • But this time, you must use the Bootstrapping method (covered in lecture 10.2).
  • Then determine if the CI you computed includes the true population proportion of STAT 107 students who drink over 8 oz of coffee daily on average**.

Your tasks include:

  • Use Python to compute a 95% Bootstrap confidence interval for proportion using the sample you got from the previous step.
    • To calculate this CI, please use 5000 iterations! You can use more if you prefer.
  • State the CI in the context of the problem.
  • Determine whether the CI you computed covers the true population proportion.

3.4 Compare the CI Coverages

  • In order to access the effectiveness of the CI computing method, one single CI is not enough. We need to compute the coverage for both the CI computed using mathematical statistics AND the CI computed using Bootstrapping.
  • CI coverage was covered in lecture 10.1.

Your tasks include:

  • Compute the CI coverage for the CI computed using mathematical statistics.
    • To calculate the coverage, please use 5,000 iterations! You can use more if you prefer. If your computer have trouble running this many iterations, you can slowly reduce the number to 2500, 2000, 1000 (but should not be lower than that).
  • Compute the CI coverage for the CI computed using Bootstrapping.
    • To calculate the coverage, please use 5,000 iterations! You can use more if you prefer. If your computer have trouble running this many iterations, you can slowly reduce the number to 2500, 2000, 1000 (but should not be lower than that).
    • Pay attention to the left-hand side panel, if it shows [*] next to the cell, that means Python has not finished running the code yet. Once it finishes running, you will see a number inside the square bracket.
  • Determine which one has better coverage or if both are equally effective/ineffective for this data.
  • Give your reasoning for why one CI might be better than the other OR both are equally effective OR both are equally ineffective. You should compare the coverage to the confidence level you used to compute the CI.
    • Note that the mathematical statistics CI assume the population is normal. Is it? A plot might be helpful here!
    • For Bootstrapping method, generally, the method performs better when the distribution of the sample we got is “close” to the population distribution. So, maybe plot a histogram of the my_sample and compare it to a histogram of the population? Do they look similar?

Updated 1: The number of iterations was updated on 04/06/2021 from 10,000 to 5,000.

Updated 2: More instructions are provided for comparison of CI coverages on 04/07/2021.


Project Report

There should be 4 sections in the report:

Note that you MUST explain in text at each step what are you doing. See example below (note that this is a screenshot from lecture 9.1, you might/might not need to use the functions used in this example).


Grading Rubric

[Detailed Rubric]


Submission

    git add -A
    git commit -m "project submission"
    git push origin master