There are several goals for this project:

- Practice sampling from a population
- Compute confidence intervals (CIs) using 2 different techniques:
- Mathematical Statistics
- Statistical Bootstrapping

- Compute and understand the confidence interval coverage
- Understand the relationship between population, sample, and statistics

- Project release: Wednesday 03/31/2021.
- Project due date:
**11:59 PM Sunday 04/11/2021**.- Late submissions are accepted until 11:59 PM Monday 04/12/2021 with 20% penalty.

In this project, you will be working with a simulated dataset (data created using simulation) of **STAT 107 students’ average daily coffee consumption (in oz)**. Assume that this data is the **population** you’re interested in.

There is 1 column in this `coffee.csv`

file:

`coffee`

: average daily coffee consumption (in oz)

- You can access the data using this link: https://stat107.hknguyen.org/files/datasets/coffee.csv
- It’s highly recommended that you load the data to Jupyter Notebook directly using the link.

- Inside your
**NetID**folder, create a new folder named`project`

. - Inside the
`project`

folder, create a new Jupyter Notebook file.- Click the
`New`

button in the top-right corner (near Upload, below Quit and Logout), then select`Python 3`

option.

- Click the
- Name the new notebook as
`stat107-project`

.- You can do that by click on the ‘Untitled’ text at the top, then input a new new, and click the blue
`Rename`

button.

- You can do that by click on the ‘Untitled’ text at the top, then input a new new, and click the blue

Simulation study is an important part of statistics. In fact, there is a whole area named simulation-based statistics. In this project, we will use both traditional mathematical statistics (the materials we have been learning in our [Stat] lecture) and the “new” simulation-based statistics method (more Python-focused) to compute confidence intervals for a statistic of interest. (I know there are a lot of ‘statistics’ and ‘statistic’ in this paragraph, but hey, that’s why it’s called STAT 107, right?)

The main goal is for you to understand the relationship between a population and a sample, and what it means to compute a confidence interval using a sample.

- Note that the data provided in
`coffee.csv`

is what we assumed the**population**data. - From this population, use the appropriate sampling function to get a sample of
**size 40**.- Think of selecting 40 students from our class.
- It’s important to determine whether you should be sampling with or without replacement. We (the TAs, CAs and I) cannot answer this question for you.

- Determine the appropriate sampling method and the corresponding sampling function in Python.
**Set the seed to be your UIN**.- You can find this on your iCard as well as other places like Student’s Self-Service website.

- Use the sampling function to get a sample of size 40, then store the sample in a variable named
`my_sample`

.

- Now, use the sample YOU got from the previous step to compute a
**95%**confidence interval for the true**proportion of STAT 107 students who drink over 8 oz of coffee daily on average**. - You must use the mathematical formula to compute this interval and clearly state the interval in context of the problem (in words).
- Then determine if the CI you computed includes the true
**population**proportion of STAT 107 students who drink over 8 oz of coffee daily on average**.

- Use Python to compute the lower-bound and upper-bound of a 95% confidence interval for proportion using the sample you got from the previous step.
- State the CI in the context of the problem.
- Compute the true population proportion of interest.
- Determine whether the CI you computed covers the true population proportion.

- Again, use the sample YOU got from the previous step to compute a
**95%**confidence interval for the true**proportion of STAT 107 students who drink over 8 oz of coffee daily on average**. - But this time, you must use the
**Bootstrapping method**(covered in lecture 10.2). - Then determine if the CI you computed includes the true
**population**proportion of STAT 107 students who drink over 8 oz of coffee daily on average**.

- Use Python to compute a 95%
**Bootstrap**confidence interval for proportion using the sample you got from the previous step.- To calculate this CI, please use
**5000**iterations! You can use more if you prefer.

- To calculate this CI, please use
- State the CI in the context of the problem.
- Determine whether the CI you computed covers the true population proportion.

- In order to access the effectiveness of the CI computing method, one single CI is not enough. We need to compute the coverage for both the CI computed using mathematical statistics AND the CI computed using Bootstrapping.
- CI coverage was covered in lecture 10.1.

- Compute the CI coverage for the CI computed using mathematical statistics.
- To calculate the coverage, please use
**5,000**iterations! You can use more if you prefer. If your computer have trouble running this many iterations, you can slowly reduce the number to 2500, 2000, 1000 (but should not be lower than that).

- To calculate the coverage, please use
- Compute the CI coverage for the CI computed using Bootstrapping.
- To calculate the coverage, please use
**5,000**iterations! You can use more if you prefer. If your computer have trouble running this many iterations, you can slowly reduce the number to 2500, 2000, 1000 (but should not be lower than that). - Pay attention to the left-hand side panel, if it shows
`[*]`

next to the cell, that means Python has not finished running the code yet. Once it finishes running, you will see a number inside the square bracket.

- To calculate the coverage, please use
- Determine which one has better coverage or if both are equally effective/ineffective for this data.
- Give your reasoning for why one CI might be better than the other OR both are equally effective OR both are equally
*ineffective*. You should compare the coverage to the confidence level you used to compute the CI.- Note that the mathematical statistics CI assume the population is normal. Is it? A plot might be helpful here!
- For Bootstrapping method, generally, the method performs better when the distribution of the sample we got is “close” to the population distribution. So, maybe plot a histogram of the
`my_sample`

and compare it to a histogram of the population? Do they look similar?

**Updated 1**: The number of iterations was updated on 04/06/2021 from 10,000 to 5,000.

**Updated 2**: More instructions are provided for comparison of CI coverages on 04/07/2021.

- The report must be written using Jupyter Notebook.
- The report must have a title and the full name of the author.
- Pay attention to grammar, spelling, formatting, etc. This is designed to provide practice for the real world, where you would provide reports to clients or to your boss. Use professional language, write paragraphs of complete sentences, etc.

There should be 4 sections in the report:

- Section 1: Sampling
- Section 2: Confidence Interval Using Mathematical Statistics
- Section 3: Confidence Interval Using Bootstrapping
- Section 4: Compare the CI Coverages

Note that you MUST explain in text at each step what are you doing. See example below (note that this is a screenshot from lecture 9.1, you might/might not need to use the functions used in this example).

- Submit the project to GitHub using the following commands:

```
git add -A
git commit -m "project submission"
git push origin master
```

- After submitting, make sure to double-check that your files were correctly submitted. You can verify by following these steps (similar to checking lab submission):
- Visit https://github-dev.cs.illinois.edu/stat107-sp21/
- Click on your NetID.
- Click on
`project`

folder, then open`project.ipynb`

file to view your project report.

- You can submit as many times as needed until the deadline!