# Bootstrapping¶

Ha Khanh Nguyen (hknguyen)

## 1. Sampling With Replacement¶

• random.choices() allows us to sample values from a "population" with replacement.
• "population" here might not be the actual population that we're interested in.
• In this context, "population" just means possible values.
• Let's say we're interested in the amount of daily coffee consumption of U of I undergraduate students.
• Assume we have a sample of size 10:
• Get a sample of size 10 from this sample by sampling with replacement.
• This is called re-sampling.

## 2. Confidence Interval for Population Mean¶

• There are many cases where we would not be able to compute the confidence interval for population mean:
• Population is not normal, sample size is small (< 30).
• What do we do in those cases?
• One solution is Nonparametric Bootstrapping.

### Average daily coffee consumption¶

• Come back to the example above where we're interested in the average amount of daily coffee consumption of U of I undergraduate students.
• The population distribution is definitely not normal.
• The sample size is 10, too small.

### The Idea of Bootstrapping¶

• In an ideal world, we would go get more samples from the population.
• That is not always possible. So instead, we resample from our sample to get "new" samples.
• With these "new" samples, we compute the sample means.
• These sample means together gives us an idea of what the true distribution of sample mean would look like!
• We want a 95% confidence interval for the population mean $\mu$.
• A 95% CI for the population mean is (4.8, 14.2).

## 3. Confidence Interval for Population Proportion¶

• Just like with population mean, we can also use nonparametric bootstrapping to estimate a CI for $p$.
• Remember the faithful dataset that we worked with at the beginning of the semester?
• I'm going to see the Old Faithful Geyser during winter break.
• I want to compute a 99% confidence interval for the probability that I have to wait for more than an hour to see an eruption!

## 4. Confidence Interval for Any Statistic!¶

• With Nonparametric Bootstrapping, we can compute confidence interval for any statistic and not restricted to only mean or proportion!

### How long is a Trump's tweet?¶

• President Trump's use of Twitter sparked an interesting analysis by David Robinson who is currently the Principal Data Scientist at Heap and former Data Scientist at StackOverflow. The dataset we will be looking at today comes from one of his analyses.
• Since this distribution seems to be very skewed, median might be a better statistic than mean.
• Let's find a CI for the population median length of Trump's tweet!
• Compute a 87% confidence interval for the population median.