We’ve all seen heatmaps a million times before, but there are a lot of arbitrary decisions made when drawing them. In this video, I cover the various things you do to create a heatmap using hierarchical clustering.
By request, here are the powerpoint slides. These are for your personal use. Instead of sharing these to your friends, please ask them watch the video because view and subscriber metrics are very important to me when it comes to reporting how many people use StatQuest to understand stats.
Science, like most things, isn’t perfect. Because individual mice (or other biological samples, like humans) are different and because the way we measure things isn’t perfect, every time we do an experiment, we get slightly different results.
We cope with this variation by doing the same experiment a bunch of times. Once we’ve collected a lot of samples, we typically calculate the average and the standard deviation. If we’re worried about the biological variation (the variation from each individual mouse or human) we do lots of biological replicates, and if we’re worried about technical variation, we do technical replicates (remeasure the same individual multiple times). Often we’ll do both.
For simple and inexpensive experiments, it’s easy to do lots of replicates, both biological and technical. However, RNA-seq is time consuming and expensive, and we want to squeeze as much as we can out of the samples that we have (which can be very limited!) The good news is that because of the way technical variation affects RNA-seq results, we only need to do biological replicates with this type of experiment.
Here’s an example of biological variation alone (imagine there is no technical variation). In this example, we’re measuring read counts for an imaginary gene “X”. Each mouse has either more or less than the average for all mice. If we do enough samples, our estimate of the average of all mice will be pretty good because the measurements that are above the mean will be canceled out by the measurements that are below the mean.
This figure shows how biological variation gives us measurements that are always a little larger or smaller than the mean.
When we add technical variation to the mix, RNA-seq will result in random values that either add or subtract read counts from each mouse’s measurement. Here’s an example where I’ve colored the biological variation orange and the technical variation green.
This figure shows how technical variation can increase or decrease the read counts we get for each sample. It also shows the equation for the average read count.
When we calculate the average measurement from our samples, we end up with a term for biological variation and a term for technical variation. Just like before, the term for biological variation will go to zero as we add samples, because the positive values will be canceled out by the negative values. However, because the technical variation is similarly distributed, it will cancel itself out for the same reason. Thus, without doing a single technical replicate (and only doing biological replicates) we can reduce the affects of biological and technical variation and get a good estimate of the mean.
This last figure shows how the biological and technical variation can be separated into separate terms. Each of these terms goes to zero as we add more biological replicates.
One problem that is often overlooked is that technical replicates alone will only eliminate the term for technical variation. The term for biological variation will only be reinforced, and this can lead to misleading results. See the video for more details.
Update: A lot of people ask to see example code that shows how to actually do PCA. Here’s how to do it in R
[sourcecode language=”R”]
## In this example, the data is in a matrix called
## data.matrix
## columns are individual samples (i.e. cells)
## rows are measurements taken for all the samples (i.e. genes)
## Just for the sake of the example, here’s some made up data…
data.matrix <- matrix(nrow=100, ncol=10)
colnames(data.matrix) <- c(
paste("wt", 1:5, sep=""),
paste("ko", 1:5, sep=""))
head(data.matrix)
rownames(data.matrix) <- paste("gene", 1:100, sep="")
for (i in 1:100) {
wt.values <- rpois(5, lambda=sample(x=10:1000, size=1))
ko.values <- rpois(5, lambda=sample(x=10:1000, size=1))
## get the name of the sample (cell) with the highest pc1 value
rownames(pca$x)[order(pca$x[ ,1], decreasing=TRUE)[1]]
## get the name of the top 10 measurements (genes) that contribute
## most to pc1.
loading_scores <- pca$rotation[,1]
gene_scores <- abs(loading_scores) ## get the magnitudes
gene_score_ranked <- sort(gene_scores, decreasing=TRUE)
top_10_genes <- names(gene_score_ranked[1:10])
top_10_genes ## show the names of the top 10 genes
## get the scree information
pca.var <- pca$sdev^2
scree <- pca.var/sum(pca.var)
plot((scree[1:10]*100), main="Scree Plot", xlab="Principal Component", ylab="Percent Variation")
[/sourcecode]
RNA-seq results often contain a PCA (Principal Component Analysis) or MDS plot. Usually we use these graphs to verify that the control samples cluster together. However, there’s a lot more going on, and if you are willing to dive in, you can extract a lot more information from these plots. The good news is that PCA only sounds complicated. Conceptually, it’s actually quite simple.
In this video I clearly explain how PCA graphs are generated, how to interpret them, and how to determine if the plot is informative or not. I then show how you can extract extra information out of the data used to draw the graph.
Warning: This StatQuest is specifically for people who do high-throughput RNA sequencing (RNA-seq). If that’s what you’re interested in, quest on!
It used to be when you did RNA-seq, you reported your results in RPKM (Reads Per Kilobase Million) or FPKM (Fragments Per Kilobase Million). However, TPM (Transcripts Per Million) is now becoming quite popular. Since there seems to be a lot of confusion about these terms, I thought I’d use a StatQuest to clear everything up.
These three metrics attempt to normalize for sequencing depth and gene length. Here’s how you do it for RPKM:
Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)
Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.
FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice).
TPM is very similar to RPKM and FPKM. The only difference is the order of operations. Here’s how you calculate TPM:
Divide the read counts by the length of each gene in kilobases. This gives you reads per kilobase (RPK).
Count up all the RPK values in a sample and divide this number by 1,000,000. This is your “per million” scaling factor.
Divide the RPK values by the “per million” scaling factor. This gives you TPM.
So you see, when calculating TPM, the only difference is that you normalize for gene length first, and then normalize for sequencing depth second. However, the effects of this difference are quite profound.
When you use TPM, the sum of all TPMs in each sample are the same. This makes it easier to compare the proportion of reads that mapped to a gene in each sample. In contrast, with RPKM and FPKM, the sum of the normalized reads in each sample may be different, and this makes it harder to compare samples directly.
Here’s an example. If the TPM for gene A in Sample 1 is 3.33 and the TPM in sample B is 3.33, then I know that the exact same proportion of total reads mapped to gene A in both samples. This is because the sum of the TPMs in both samples always add up to the same number (so the denominator required to calculate the proportions is the same, regardless of what sample you are looking at.)
With RPKM or FPKM, the sum of normalized reads in each sample can be different. Thus, if the RPKM for gene A in Sample 1 is 3.33 and the RPKM in Sample 2 is 3.33, I would not know if the same proportion of reads in Sample 1 mapped to gene A as in Sample 2. This is because the denominator required to calculate the proportion could be different for the two samples.
People are often confused by confidence intervals, and even dyed in the wool statisticians get the definition mixed up from time to time. However, I’ve found that if you derive them with bootstrapping, they’re meaning becomes crystal clear.
However, before we start, I want to say that there are a lot of ways to create confidence intervals. Bootstrapping is just one way, but usually people create them using formulas. The formulas are fine, but I don’t think they are as easy to understand.
If you can’t remember bootstrapping, I review it briefly in the video for confidence intervals, or you can read my post on the standard error.
The basic idea is that you take a sample of something, apply bootstrapping to it, and then create an interval that covers 95% of the means. That’s all! Here are some figures to illustrate it.
First, take a sample. Here’s a sample of weights from 12 female mice.
Bootstrap the sample.
Create an interval that covers 95% of the samples (this will be a 95% confidence interval).
Now that we know what a confidence interval is, why should we care? Well, I like confidence intervals because they let us do statistics using pictures, rather than equations. It’s nice when you can just look at something and say, “The p-value for this hypothesis is less than 0.05, thus, we will reject the hypothesis” without having to rely on a single equation.
Here’s an example. In that figure, all values outside of the 95% confidence interval occurred less than 5% of the time. Thus, the p-value of the true mean from the population taking on any of the values outside of the confidence interval has a p-value < 0.05.
The standard error isn’t just some error that we all make (as in “Dude, did you see Jimmy at the bar last night? He made the standard error…”), it’s a measure of how much we can expect the mean to change if we were to re-do an experiment.
Let’s start by thinking about error bars:
We’ve all seen error bars in plots before. The most common types of error bars are standard deviations, standard errors and confidence intervals. Standard deviations describe how the data that you collected varies from one measurement to the next. In contrast, standard errors describe how much the mean value might change if we did the whole experiment over again. Whoa!!! I know it sounds crazy, so let’s talk about.
First, let’s consider a simple (but expensive and time consuming) way to calculate the standard error. There’s a simple formula for calculating the standard error, but it won’t help us understand what’s really going on, so we’ll ignore it for now.
Assume we took a sample of 5 measurements. From these 5 measurements, we can easily calculate (and plot) the mean.
Now, we can easily imagine that if we took another sample of 5 measurements, the mean might be slightly different from the one we got from the first sample. So let’s take a bunch of samples.
After taking a bunch of samples, we can calculate (and plot) the mean for each one. In the figure we can see that the means are more tightly clustered together than the original measurements. This is because it’s much more common to get a single extreme value in a single sample, than get an entire sample worth of extreme values, which is what we would need to get an extreme value for a mean. Now that we have calculated the means, we can calculate their standard deviation.
The standard error is just the standard deviation of all those means – it describes how much the values for the mean differs in a bunch of samples.
Now, like I said before, there’s a simple formula for calculating the standard error, but there’s also a simple, time and cost effective way to do it without having to use that formula. This method is called “bootstrapping”. What’s cool about bootstrapping is that you can use it to calculate the standard error of anything, not just the mean, even if there isn’t a nice formula for it. For example, if you had calculated the median instead of the mean, you’d be in deep trouble if you needed to calculate its standard error and you didn’t have bootstrapping.
Bootstrapping is introduced in the video, and in the next StatQuest, we’ll use it to calculate confidence intervals, which are super cool since they let you compare samples visually.
If you wan to try out bootstrapping yourself, here’s some R code:
[sourcecode language=”R”]
## bootstrap demo…
# Step 1: get some data
# (in this case, we will just generate 20 random numbers)
n=20
data <- rnorm(n)
## Step 2: plot the data
stripchart(data, pch=3)
## Step 3: calculate the mean and put it on the plot
data.mean <- mean(data)
abline(v=data.mean, col="blue", lwd=2)
## Step 4: calculate the standard error of the mean using the
## standard formula (there is a standard formula for the standard
## error of the mean, but not for a lot of other things. We use
## the standard formula in this example to show you that the
## bootstrap method gets the about the same values
data.stderr <- sqrt(var(data))/sqrt(n)
data.stderr ## this just prints out the calculated standard err
## Step 5: Plot 2*data.stderr lines +/- the mean
## 2 times the standard error +/- the mean is a quick and dirty
## approximatino of a 95% confidence interval
abline(v=(data.mean)+(2*data.stderr), col="red")
abline(v=(data.mean)-(2*data.stderr), col="red")
## Step 6: Use "sample()" to bootstrap the data and calculate a lot of
## bootstrapped means.
num.loops <- 1000
boot.means <- vector(mode="numeric", length=num.loops)
for(i in 1:num.loops) {
boot.data <- sample(data, size=20, replace=TRUE)
boot.means[i] <- mean(boot.data)
}
## Step 7: Calculate the standard deviation of the bootstrapped means.
## This is the bootstrapped standard error of the mean
boot.stderr <- sqrt(var(boot.means))
boot.stderr
## Step 8: Plot 2*boot.stderr +/- the mean.
## Notice that the 2*data.stderr lines and the 2*boot.stderr lines
## nearly overlap.
abline(v=(data.mean)+(2*boot.stderr), col="green")
abline(v=(data.mean)-(2*boot.stderr), col="green")
[/sourcecode]
R squared is an awesome metric of correlation – which is to say, it’s an awesome way to assess how two quantitative variables might be related. For example, mouse weight and size might be correlated. (This example is pretty obvious – a bigger mouse will probably weigh more than a smaller one, unless the bigger one is just fluff).
Here’s the raw data for 7 mice (n=7) with mouse weight on the Y-axis and mouse size on the X-axis.
Looking at the raw data for 7 mice (n=7) we see that there is a trend: larger mice tend to weigh more than smaller mice.
The blue line illustrates the correlation between mouse weight and mouse size. Bigger mice tend to weigh more.
Can we use this trend to predict a mouse’s weight if we know it’s size? How good is this prediction? R-squared can answer these questions. It is an easy to calculate and easy to interpret metric of correlations (i.e. the relationship between mouse size and mouse weight.) R-squared is the percentage of variation in the data that the relationship accounts for.
What’s the variation in the data? That’s easy, it’s the difference between the individual data points and the mean.
Calculating the variation in the data. It’s the squared differences between the data points and the mean.
How do we calculate the variation around the mouse/weight relationship? We just calculate the variation around the blue line that we drew before.
The variation around the line is just the squared differences between the data points and the blue line.
Now to calculate the percentage of variation that mouse/weight relationship accounts for, we just subtract the the variation around the line from the variation around the mean and then divide by the variation around the mean. Here it is in pictures:
In the bottom of the graph we have the formula for R-squared. All we need to know to calculate it is the variation around the mean and the variation around the blue line.
First, we calculate the variation around the mean.
Second, we subtract the variation around the blue line.
Lastly, we divide by the variation around the mean. This implies that the numerator will be a percentage of the variation around mean since the variation around the line will never be less than 0 and it will never be greater than the variation around the mean.
Now let’s use some numbers to calculate a real R-squared!
Here is our calculation, all flushed out with numbers.
The first statistics class I took was a disaster. Held in a large auditorium packed with students, its atmosphere was more “refugee camp” than classroom. The general din made it impossible to hear the professor. Attendance was mandatory and, as far as I could tell, the sole factor that determined our final grades. One day a group of students started punching each other, hard. A fight had broken out. That’s right, a fight had broke out in my graduate level statistics class.
I quickly gave up on learning anything from the lectures and turned to the $150 textbook instead. However, it was a collection of random SAS code and ANOVA tables, and the examples didn’t look like anything I had seen before. I re-read the chapter on t-tests 50 times before giving it up as a lost cause. Even basic concepts, like “average”, became confusing. Everything was backwards.
Despite the rough start, I love statistics. Statistics create knowledge. You start with a pile of numbers, you run statistics on them, and out comes information for making the best decisions. That’s cool, right? Making an informed decision is so much more awesome than guessing, especially when it pertains to how to use precious resources, like our time. It’s awesome. Stats are also fun. That’s right, stats are fun. You might now believe me yet, but you will. Trust me.
With this blog, I am going to explain statistics so that you can use them confidently. If you’re in the middle of a statistics jungle, and fights are breaking out around you, I want this blog to be a refuge so that you can learn what you need to learn. The methods can be used in a lot of contexts, but, because I work in a mouse molecular genetics lab, I’m going to explain them primarily within that context. I’ll try to keep things general, so that if this isn’t your specialty you can still follow along, but I wanted something that my co-workers could turn to and understand without having to translate the examples into their own language. Thus, examples will be given in terms of “gene expression” rather than “migrating birds”, or “seismic activity”.