# statistics

# StatQuest: Linear Discriminant Analysis (LDA), clearly explained

Here it is, folks! By popular demand, a StatQuest on linear discriminant analysis (LDA)!

Also, because you asked for it, here’s some sample R code that shows you how to get LDA working in R.

If all went well, you should get a graph that looks like this:

# RNA-seq: The Pitfalls of Technical Replicates

Science, like most things, isn’t perfect. Because individual mice (or other biological samples, like humans) are different and because the way we measure things isn’t perfect, every time we do an experiment, we get slightly different results.

We cope with this variation by doing the same experiment a bunch of times. Once we’ve collected a lot of samples, we typically calculate the average and the standard deviation. If we’re worried about the biological variation (the variation from each individual mouse or human) we do lots of biological replicates, and if we’re worried about technical variation, we do technical replicates (remeasure the same individual multiple times). Often we’ll do both.

For simple and inexpensive experiments, it’s easy to do lots of replicates, both biological and technical. However, RNA-seq is time consuming and expensive, and we want to squeeze as much as we can out of the samples that we have (which can be very limited!) The good news is that because of the way technical variation affects RNA-seq results, we only need to do biological replicates with this type of experiment.

Here’s an example of biological variation alone (imagine there is no technical variation). In this example, we’re measuring read counts for an imaginary gene “X”. Each mouse has either more or less than the average for all mice. If we do enough samples, our estimate of the average of all mice will be pretty good because the measurements that are above the mean will be canceled out by the measurements that are below the mean.

When we add technical variation to the mix, RNA-seq will result in random values that either add or subtract read counts from each mouse’s measurement. Here’s an example where I’ve colored the biological variation orange and the technical variation green.

When we calculate the average measurement from our samples, we end up with a term for biological variation and a term for technical variation. Just like before, the term for biological variation will go to zero as we add samples, because the positive values will be canceled out by the negative values. However, because the technical variation is similarly distributed, it will cancel itself out for the same reason. Thus, without doing a single technical replicate (and only doing biological replicates) we can reduce the affects of biological *and* technical variation and get a good estimate of the mean.

One problem that is often overlooked is that technical replicates alone will only eliminate the term for technical variation. The term for biological variation will only be reinforced, and this can lead to misleading results. See the video for more details.

# PCA clearly explained

Update: A lot of people ask to see example code that shows how to actually do PCA. Here’s how to do it in R

[sourcecode language=”R”]

## In this example, the data is in a matrix called

## data.matrix

## columns are individual samples (i.e. cells)

## rows are measurements taken for all the samples (i.e. genes)

## Just for the sake of the example, here’s some made up data…

data.matrix <- matrix(nrow=100, ncol=10)

colnames(data.matrix) <- c(

paste("wt", 1:5, sep=""),

paste("ko", 1:5, sep=""))

head(data.matrix)

rownames(data.matrix) <- paste("gene", 1:100, sep="")

for (i in 1:100) {

wt.values <- rpois(5, lambda=sample(x=10:1000, size=1))

ko.values <- rpois(5, lambda=sample(x=10:1000, size=1))

data.matrix[i,] <- c(wt.values, ko.values)

}

head(data.matrix)

pca <- prcomp(t(data.matrix), scale=TRUE)

## plot pc1 and pc2

plot(pca$x[,1], pca$x[,2])

## get the name of the sample (cell) with the highest pc1 value

rownames(pca$x)[order(pca$x[ ,1], decreasing=TRUE)[1]]

## get the name of the top 10 measurements (genes) that contribute

## most to pc1.

loading_scores <- pca$rotation[,1]

gene_scores <- abs(loading_scores) ## get the magnitudes

gene_score_ranked <- sort(gene_scores, decreasing=TRUE)

top_10_genes <- names(gene_score_ranked[1:10])

top_10_genes ## show the names of the top 10 genes

## get the scree information

pca.var <- pca$sdev^2

scree <- pca.var/sum(pca.var)

plot((scree[1:10]*100), main="Scree Plot", xlab="Principal Component", ylab="Percent Variation")

[/sourcecode]

RNA-seq results often contain a PCA (Principal Component Analysis) or MDS plot. Usually we use these graphs to verify that the control samples cluster together. However, there’s a lot more going on, and if you are willing to dive in, you can extract a lot more information from these plots. The good news is that PCA only sounds complicated. Conceptually, it’s actually quite simple.

In this video I clearly explain how PCA graphs are generated, how to interpret them, and how to determine if the plot is informative or not. I then show how you can extract extra information out of the data used to draw the graph.

# The Standard Error (and a bootstrapping bonus!!!)

The standard error isn’t just some error that we all make (as in “Dude, did you see Jimmy at the bar last night? He made the standard error…”), it’s a measure of how much we can expect the mean to change if we were to re-do an experiment.

Let’s start by thinking about error bars:

We’ve all seen error bars in plots before. The most common types of error bars are **standard deviations**, **standard errors** and **confidence intervals**. Standard deviations describe how the data that you collected varies from one measurement to the next. In contrast, standard errors describe how much the mean value might change if we did the whole experiment over again. Whoa!!! I know it sounds crazy, so let’s talk about.

First, let’s consider a simple (but expensive and time consuming) way to calculate the standard error. There’s a simple formula for calculating the standard error, but it won’t help us understand what’s really going on, so we’ll ignore it for now.

Assume we took a sample of 5 measurements. From these 5 measurements, we can easily calculate (and plot) the mean.

Now, we can easily imagine that if we took another sample of 5 measurements, the mean might be slightly different from the one we got from the first sample. So let’s take a bunch of samples.

After taking a bunch of samples, we can calculate (and plot) the mean for each one. In the figure we can see that the means are more tightly clustered together than the original measurements. This is because it’s much more common to get a single extreme value in a single sample, than get an entire sample worth of extreme values, which is what we would need to get an extreme value for a mean. Now that we have calculated the means, we can calculate their standard deviation.

The standard error is just the standard deviation of all those means – it describes how much the values for the mean differs in a bunch of samples.

Now, like I said before, there’s a simple formula for calculating the standard error, but there’s also a simple, time and cost effective way to do it without having to use that formula. This method is called “bootstrapping”. What’s cool about bootstrapping is that you can use it to calculate the standard error of anything, not just the mean, even if there isn’t a nice formula for it. For example, if you had calculated the median instead of the mean, you’d be in deep trouble if you needed to calculate its standard error and you didn’t have bootstrapping.

Bootstrapping is introduced in the video, and in the next StatQuest, we’ll use it to calculate confidence intervals, which are super cool since they let you compare samples visually.

If you wan to try out bootstrapping yourself, here’s some R code:

[sourcecode language=”R”]

## bootstrap demo…

# Step 1: get some data

# (in this case, we will just generate 20 random numbers)

n=20

data <- rnorm(n)

## Step 2: plot the data

stripchart(data, pch=3)

## Step 3: calculate the mean and put it on the plot

data.mean <- mean(data)

abline(v=data.mean, col="blue", lwd=2)

## Step 4: calculate the standard error of the mean using the

## standard formula (there is a standard formula for the standard

## error of the mean, but not for a lot of other things. We use

## the standard formula in this example to show you that the

## bootstrap method gets the about the same values

data.stderr <- sqrt(var(data))/sqrt(n)

data.stderr ## this just prints out the calculated standard err

## Step 5: Plot 2*data.stderr lines +/- the mean

## 2 times the standard error +/- the mean is a quick and dirty

## approximatino of a 95% confidence interval

abline(v=(data.mean)+(2*data.stderr), col="red")

abline(v=(data.mean)-(2*data.stderr), col="red")

## Step 6: Use "sample()" to bootstrap the data and calculate a lot of

## bootstrapped means.

num.loops <- 1000

boot.means <- vector(mode="numeric", length=num.loops)

for(i in 1:num.loops) {

boot.data <- sample(data, size=20, replace=TRUE)

boot.means[i] <- mean(boot.data)

}

## Step 7: Calculate the standard deviation of the bootstrapped means.

## This is the bootstrapped standard error of the mean

boot.stderr <- sqrt(var(boot.means))

boot.stderr

## Step 8: Plot 2*boot.stderr +/- the mean.

## Notice that the 2*data.stderr lines and the 2*boot.stderr lines

## nearly overlap.

abline(v=(data.mean)+(2*boot.stderr), col="green")

abline(v=(data.mean)-(2*boot.stderr), col="green")

[/sourcecode]