People are often confused by confidence intervals, and even dyed in the wool statisticians get the definition mixed up from time to time. However, I’ve found that if you derive them with bootstrapping, they’re meaning becomes crystal clear.

However, before we start, I want to say that there are a lot of ways to create confidence intervals. Bootstrapping is just one way, but usually people create them using formulas. The formulas are fine, but I don’t think they are as easy to understand.

If you can’t remember bootstrapping, I review it briefly in the video for confidence intervals, or you can read my post on the standard error.

The basic idea is that you take a sample of something, apply bootstrapping to it, and then create an interval that covers 95% of the means. That’s all! Here are some figures to illustrate it.

First, take a sample. Here’s a sample of weights from 12 female mice.

Bootstrap the sample.

Create an interval that covers 95% of the samples (this will be a 95% confidence interval).

Now that we know what a confidence interval is, why should we care? Well, I like confidence intervals because they let us do statistics using pictures, rather than equations. It’s nice when you can just look at something and say, “The p-value for this hypothesis is less than 0.05, thus, we will reject the hypothesis” without having to rely on a single equation.

Here’s an example. In that figure, all values outside of the 95% confidence interval occurred less than 5% of the time. Thus, the p-value of the true mean from the population taking on any of the values outside of the confidence interval has a p-value < 0.05.

The standard error isn’t just some error that we all make (as in “Dude, did you see Jimmy at the bar last night? He made the standard error…”), it’s a measure of how much we can expect the mean to change if we were to re-do an experiment.

Let’s start by thinking about error bars:

We’ve all seen error bars in plots before. The most common types of error bars are standard deviations, standard errors and confidence intervals. Standard deviations describe how the data that you collected varies from one measurement to the next. In contrast, standard errors describe how much the mean value might change if we did the whole experiment over again. Whoa!!! I know it sounds crazy, so let’s talk about.

First, let’s consider a simple (but expensive and time consuming) way to calculate the standard error. There’s a simple formula for calculating the standard error, but it won’t help us understand what’s really going on, so we’ll ignore it for now.

Assume we took a sample of 5 measurements. From these 5 measurements, we can easily calculate (and plot) the mean.

Now, we can easily imagine that if we took another sample of 5 measurements, the mean might be slightly different from the one we got from the first sample. So let’s take a bunch of samples.

After taking a bunch of samples, we can calculate (and plot) the mean for each one. In the figure we can see that the means are more tightly clustered together than the original measurements. This is because it’s much more common to get a single extreme value in a single sample, than get an entire sample worth of extreme values, which is what we would need to get an extreme value for a mean. Now that we have calculated the means, we can calculate their standard deviation.

The standard error is just the standard deviation of all those means – it describes how much the values for the mean differs in a bunch of samples.

Now, like I said before, there’s a simple formula for calculating the standard error, but there’s also a simple, time and cost effective way to do it without having to use that formula. This method is called “bootstrapping”. What’s cool about bootstrapping is that you can use it to calculate the standard error of anything, not just the mean, even if there isn’t a nice formula for it. For example, if you had calculated the median instead of the mean, you’d be in deep trouble if you needed to calculate its standard error and you didn’t have bootstrapping.

Bootstrapping is introduced in the video, and in the next StatQuest, we’ll use it to calculate confidence intervals, which are super cool since they let you compare samples visually.

If you wan to try out bootstrapping yourself, here’s some R code:

[sourcecode language=”R”]
## bootstrap demo…

# Step 1: get some data
# (in this case, we will just generate 20 random numbers)
n=20
data <- rnorm(n)

## Step 2: plot the data
stripchart(data, pch=3)

## Step 3: calculate the mean and put it on the plot
data.mean <- mean(data)
abline(v=data.mean, col="blue", lwd=2)

## Step 4: calculate the standard error of the mean using the
## standard formula (there is a standard formula for the standard
## error of the mean, but not for a lot of other things. We use
## the standard formula in this example to show you that the
## bootstrap method gets the about the same values
data.stderr <- sqrt(var(data))/sqrt(n)
data.stderr ## this just prints out the calculated standard err

## Step 5: Plot 2*data.stderr lines +/- the mean
## 2 times the standard error +/- the mean is a quick and dirty
## approximatino of a 95% confidence interval
abline(v=(data.mean)+(2*data.stderr), col="red")
abline(v=(data.mean)-(2*data.stderr), col="red")

## Step 6: Use "sample()" to bootstrap the data and calculate a lot of
## bootstrapped means.
num.loops <- 1000
boot.means <- vector(mode="numeric", length=num.loops)
for(i in 1:num.loops) {
boot.data <- sample(data, size=20, replace=TRUE)
boot.means[i] <- mean(boot.data)
}

## Step 7: Calculate the standard deviation of the bootstrapped means.
## This is the bootstrapped standard error of the mean
boot.stderr <- sqrt(var(boot.means))
boot.stderr

## Step 8: Plot 2*boot.stderr +/- the mean.
## Notice that the 2*data.stderr lines and the 2*boot.stderr lines
## nearly overlap.
abline(v=(data.mean)+(2*boot.stderr), col="green")
abline(v=(data.mean)-(2*boot.stderr), col="green")
[/sourcecode]