# Goals

There are several goals for this project:

• Practice sampling from a population
• Compute confidence intervals (CIs) using 2 different techniques:
• Mathematical Statistics
• Statistical Bootstrapping
• Compute and understand the confidence interval coverage
• Understand the relationship between population, sample, and statistics

# Timeline

• Project release: Wednesday 03/31/2021.
• Project due date: 11:59 PM Sunday 04/11/2021.
• Late submissions are accepted until 11:59 PM Monday 04/12/2021 with 20% penalty.

# Project

## 1. Dataset:

In this project, you will be working with a simulated dataset (data created using simulation) of STAT 107 students’ average daily coffee consumption (in oz). Assume that this data is the population you’re interested in.

### 1.1 Dataset Description

There is 1 column in this coffee.csv file:

• coffee: average daily coffee consumption (in oz)

## 2. Project Setup

• Inside your NetID folder, create a new folder named project.
• Inside the project folder, create a new Jupyter Notebook file.
• Click the New button in the top-right corner (near Upload, below Quit and Logout), then select Python 3 option.
• Name the new notebook as stat107-project.
• You can do that by click on the ‘Untitled’ text at the top, then input a new new, and click the blue Rename button.

## 3. Simulation Studies

Simulation study is an important part of statistics. In fact, there is a whole area named simulation-based statistics. In this project, we will use both traditional mathematical statistics (the materials we have been learning in our [Stat] lecture) and the “new” simulation-based statistics method (more Python-focused) to compute confidence intervals for a statistic of interest. (I know there are a lot of ‘statistics’ and ‘statistic’ in this paragraph, but hey, that’s why it’s called STAT 107, right?)

The main goal is for you to understand the relationship between a population and a sample, and what it means to compute a confidence interval using a sample.

### 3.1 Sampling

• Note that the data provided in coffee.csv is what we assumed the population data.
• From this population, use the appropriate sampling function to get a sample of size 40.
• Think of selecting 40 students from our class.
• It’s important to determine whether you should be sampling with or without replacement. We (the TAs, CAs and I) cannot answer this question for you.

• Determine the appropriate sampling method and the corresponding sampling function in Python.
• Set the seed to be your UIN.
• You can find this on your iCard as well as other places like Student’s Self-Service website.
• Use the sampling function to get a sample of size 40, then store the sample in a variable named my_sample.

### 3.2 Compute a CI Using Mathematical Statistics

• Now, use the sample YOU got from the previous step to compute a 95% confidence interval for the true proportion of STAT 107 students who drink over 8 oz of coffee daily on average.
• You must use the mathematical formula to compute this interval and clearly state the interval in context of the problem (in words).
• Then determine if the CI you computed includes the true population proportion of STAT 107 students who drink over 8 oz of coffee daily on average**.

• Use Python to compute the lower-bound and upper-bound of a 95% confidence interval for proportion using the sample you got from the previous step.
• State the CI in the context of the problem.
• Compute the true population proportion of interest.
• Determine whether the CI you computed covers the true population proportion.

### 3.3 Compute a CI Using Bootstrapping

• Again, use the sample YOU got from the previous step to compute a 95% confidence interval for the true proportion of STAT 107 students who drink over 8 oz of coffee daily on average.
• But this time, you must use the Bootstrapping method (covered in lecture 10.2).
• Then determine if the CI you computed includes the true population proportion of STAT 107 students who drink over 8 oz of coffee daily on average**.

• Use Python to compute a 95% Bootstrap confidence interval for proportion using the sample you got from the previous step.
• To calculate this CI, please use 5000 iterations! You can use more if you prefer.
• State the CI in the context of the problem.
• Determine whether the CI you computed covers the true population proportion.

### 3.4 Compare the CI Coverages

• In order to access the effectiveness of the CI computing method, one single CI is not enough. We need to compute the coverage for both the CI computed using mathematical statistics AND the CI computed using Bootstrapping.
• CI coverage was covered in lecture 10.1.

• Compute the CI coverage for the CI computed using mathematical statistics.
• To calculate the coverage, please use 5,000 iterations! You can use more if you prefer. If your computer have trouble running this many iterations, you can slowly reduce the number to 2500, 2000, 1000 (but should not be lower than that).
• Compute the CI coverage for the CI computed using Bootstrapping.
• To calculate the coverage, please use 5,000 iterations! You can use more if you prefer. If your computer have trouble running this many iterations, you can slowly reduce the number to 2500, 2000, 1000 (but should not be lower than that).
• Pay attention to the left-hand side panel, if it shows [*] next to the cell, that means Python has not finished running the code yet. Once it finishes running, you will see a number inside the square bracket.
• Determine which one has better coverage or if both are equally effective/ineffective for this data.
• Give your reasoning for why one CI might be better than the other OR both are equally effective OR both are equally ineffective. You should compare the coverage to the confidence level you used to compute the CI.
• Note that the mathematical statistics CI assume the population is normal. Is it? A plot might be helpful here!
• For Bootstrapping method, generally, the method performs better when the distribution of the sample we got is “close” to the population distribution. So, maybe plot a histogram of the my_sample and compare it to a histogram of the population? Do they look similar?

Updated 1: The number of iterations was updated on 04/06/2021 from 10,000 to 5,000.

Updated 2: More instructions are provided for comparison of CI coverages on 04/07/2021.

# Project Report

• The report must be written using Jupyter Notebook.
• The report must have a title and the full name of the author.
• Pay attention to grammar, spelling, formatting, etc. This is designed to provide practice for the real world, where you would provide reports to clients or to your boss. Use professional language, write paragraphs of complete sentences, etc.

There should be 4 sections in the report:

• Section 1: Sampling
• Section 2: Confidence Interval Using Mathematical Statistics
• Section 3: Confidence Interval Using Bootstrapping
• Section 4: Compare the CI Coverages

Note that you MUST explain in text at each step what are you doing. See example below (note that this is a screenshot from lecture 9.1, you might/might not need to use the functions used in this example).

The total points for the project is 100. There are broken down by task:

• Correctly select a random sample of 40 students from the dataset: 20 points
• [5] Is the method of sampling correct? (Student should clearly state that we’re sampling with or without replacement)
• [10] Is the function used for sampling correct with the chosen sampling method (with the correct parameters)?
• [5] Is the sample a random sample of size 40 from the population?
• Correctly compute a CI Using Mathematical Statistics: 20 points
• [5] Is the sample proportion correctly calculated?
• [5] Is the alpha level used in the calculation correct?
• [5] Is the lower bound correct?
• [5] Is the upper bound correct?
• Correctly compute a CI Using Bootstrapping: 20 points
• [3] Is the correct number of iterations used?
• [2] Does the re-sample (new sample) have the correct size?
• [3] Is the re-sample correctly selected with the correct sampling method?
• [2] Is the new sample proportion correctly computed?
• [5] Is the lower bound correct?
• [5] Is the upper bound correct?
• Correctly compute and compare CI coverages: 25 points
• Mathematical statistics CI:
• [1] Is the correct number of iterations used for math-stat CI coverage computation?
• [2] Is a new sample correctly selected from the population?
• [5] Is the checking of coverage correct (lower bound, upper bound calculation, and check)?
• [3] Is the coverage correct?
• Bootstrapping CI:
• [1] Is the correct number of iterations used for bootstrapping CI coverage computation?
• [2] Is a new sample correctly selected from the population?
• [5] Is the checking of coverage correct (lower bound, upper bound calculation, and check)?
• [3] Is the coverage correct?
• Compare the coverages:
• [2] Is the correct conclusion drawn from the results?
• [1] Is there a comparison of the coverage to the confidence level used in CI computation?
• [2] Are there logical explanations why one CI is better than the other OR why both CIs are effective/ineffective for this problem?
• Project report format: 15 points
• [5] Does the report have a title, author name, and section headings?
• [5] Is there explanation for each step and each code segment?
• [5] Is the text free of spelling errors?

# Submission

• Submit the project to GitHub using the following commands:
    git add -A
git commit -m "project submission"
git push origin master
• After submitting, make sure to double-check that your files were correctly submitted. You can verify by following these steps (similar to checking lab submission):
• You can submit as many times as needed until the deadline!