Update: A lot of people ask to see example code that shows how to actually do PCA. Here’s how to do it in R

[sourcecode language=”R”]

## In this example, the data is in a matrix called

## data.matrix

## columns are individual samples (i.e. cells)

## rows are measurements taken for all the samples (i.e. genes)

## Just for the sake of the example, here’s some made up data…

data.matrix <- matrix(nrow=100, ncol=10)

colnames(data.matrix) <- c(

paste("wt", 1:5, sep=""),

paste("ko", 1:5, sep=""))

head(data.matrix)

rownames(data.matrix) <- paste("gene", 1:100, sep="")

for (i in 1:100) {

wt.values <- rpois(5, lambda=sample(x=10:1000, size=1))

ko.values <- rpois(5, lambda=sample(x=10:1000, size=1))

data.matrix[i,] <- c(wt.values, ko.values)

}

head(data.matrix)

pca <- prcomp(t(data.matrix), scale=TRUE)

## plot pc1 and pc2

plot(pca$x[,1], pca$x[,2])

## get the name of the sample (cell) with the highest pc1 value

rownames(pca$x)[order(pca$x[ ,1], decreasing=TRUE)[1]]

## get the name of the top 10 measurements (genes) that contribute

## most to pc1.

loading_scores <- pca$rotation[,1]

gene_scores <- abs(loading_scores) ## get the magnitudes

gene_score_ranked <- sort(gene_scores, decreasing=TRUE)

top_10_genes <- names(gene_score_ranked[1:10])

top_10_genes ## show the names of the top 10 genes

## get the scree information

pca.var <- pca$sdev^2

scree <- pca.var/sum(pca.var)

plot((scree[1:10]*100), main="Scree Plot", xlab="Principal Component", ylab="Percent Variation")

[/sourcecode]

RNA-seq results often contain a PCA (Principal Component Analysis) or MDS plot. Usually we use these graphs to verify that the control samples cluster together. However, there’s a lot more going on, and if you are willing to dive in, you can extract a lot more information from these plots. The good news is that PCA only sounds complicated. Conceptually, it’s actually quite simple.

In this video I clearly explain how PCA graphs are generated, how to interpret them, and how to determine if the plot is informative or not. I then show how you can extract extra information out of the data used to draw the graph.