PCA clearly explained

Update: People have asked for PowerPoint slides for this presentation. Here they are! If you use them, please cite me. Thank you so much for your support.

Update: A lot of people ask to see example code that shows how to actually do PCA. Here’s how to do it in R

## In this example, the data is in a matrix called
## data.matrix
## columns are individual samples (i.e. cells)
## rows are measurements taken for all the samples (i.e. genes)
## Just for the sake of the example, here's some made up data...
data.matrix <- matrix(nrow=100, ncol=10)
colnames(data.matrix) <- c(
  paste("wt", 1:5, sep=""),
  paste("ko", 1:5, sep=""))
head(data.matrix)
rownames(data.matrix) <- paste("gene", 1:100, sep="")
for (i in 1:100) {
  wt.values <- rpois(5, lambda=sample(x=10:1000, size=1))
  ko.values <- rpois(5, lambda=sample(x=10:1000, size=1))

  data.matrix[i,] <- c(wt.values, ko.values)
}
head(data.matrix)

pca <- prcomp(t(data.matrix), scale=TRUE)

## plot pc1 and pc2
plot(pca$x[,1], pca$x[,2])

## get the name of the sample (cell) with the highest pc1 value
rownames(pca$x)[order(pca$x[ ,1], decreasing=TRUE)[1]]

## get the name of the top 10 measurements (genes) that contribute
## most to pc1.
loading_scores <- pca$rotation[,1]
gene_scores <- abs(loading_scores) ## get the magnitudes
gene_score_ranked <- sort(gene_scores, decreasing=TRUE)
top_10_genes <- names(gene_score_ranked[1:10])

top_10_genes ## show the names of the top 10 genes

## get the scree information
pca.var <- pca$sdev^2
scree <- pca.var/sum(pca.var)
plot((scree[1:10]*100), main="Scree Plot", xlab="Principal Component", ylab="Percent Variation")<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>

RNA-seq results often contain a PCA (Principal Component Analysis) or MDS plot. Usually we use these graphs to verify that the control samples cluster together. However, there’s a lot more going on, and if you are willing to dive in, you can extract a lot more information from these plots. The good news is that PCA only sounds complicated. Conceptually, it’s actually quite simple.

In this video I clearly explain how PCA graphs are generated, how to interpret them, and how to determine if the plot is informative or not. I then show how you can extract extra information out of the data used to draw the graph.

Advertisements

16 thoughts on “PCA clearly explained

  1. Thanks Joshua for such a wonderful explanation of PCA! I was wondering if you would be willing to share your slide deck. Your slides are very clear and well put together. I was hoping to use them for educational purposes. I will of course acknowledge you appropriately for the content. Thanks!

    Like

  2. Bug report. Try: rownames(pca$x)[order(pca$x[ ,1], decreasing=TRUE)[1]]
    for the sample ID. Your code tries to use the max PCA value as an index. ???

    Like

  3. for distinguish between two different kind of samples (here it is cells), two PCs are okay but how to distinguish more than two samples with two PCs??

    Like

    • To plot a cell on the graph, you multiply the gene measurements associated with that cell with the loading scores in PC1. Then add up the values and that gives you a coordinate along the PC1 axis. Then do the same thing with PC2 – multiply the gene measurements with the loading scores in PC2 and then add them up. That gives you a coordinate on PC2. With those two coordinates, we can plot the cell on the 2D graph. (If we wanted to draw a 3D graph, we would do the same thing to get a coordinate along the PC3 axis). That process can be repeated for as many cells as we have. This is explained here: https://youtu.be/_UVHneBUBW0?t=17m25s

      Like

      • Thank you. . :)

        Actually i had one more doubt, In your example rows corresponds to variables and columns corresponds to observations but when I try to do it in matlab (which uses SVD algorithm) its completely different, There rows correspond to observations and columns correspond to variables and it returns the loadings in the size of columns x columns. How it is possible?

        Like

  4. For historical reasons, most PCA functions (including SVD) expect the rows to be samples, and the columns to be variables. However, genomic data is often the transpose of that type of matrix – thus, samples are columns and variables (usually genes) are rows. Thus, to do PCA with my data (where the samples are columns and the variables are rows) you have to transpose the matrix before you make the call to the PCA function. In my example code (for R), I have the line that looks like this:

    pca <- prcomp(t(data.matrix), scale=TRUE)

    The t() function that the data goes through before going to prcomp() transposes the data. So try transposing your matrix to see if you get the right values.

    Like

    • I’m not sure what you mean by both methods. PCA is just one method. Typically it is used to plot samples. If you data has the samples as columns and variables as rows, than you need to transpose the matrix so that the samples are rows and the variables are columns before you do PCA on it so that you can plot the samples. If your data is already formatted so that the samples are rows, than you’re set, you don’t need to transpose the matrix. Does this make sense?

      Like

      • Actually I mean to ask SVD algorithm which am using in MATLAB (where rows correspond to observations and columns correspond to variables) also plotting between the samples?

        Liked by 1 person

  5. I’m not sure how to answer your question since I don’t use matlab. However, R is free and you can download it. Here’s how you use SVD in R:

    svd.stuff <- svd(scale(MY_MATRIX, center=TRUE, scale=TRUE))

    Where "MY_MATRIX" is a matrix where the rows correspond to samples and the columns correspond to variables.

    To draw a PCA plot from this:

    plot((svd.stuff$u[,1] * svd.stuff$d[1]), (svd.stuff$u[,2] * svd.stuff$d[2]))

    Where "u" is a matrix of PCs that have all been scaled so that they have a magnitude of 1 and and "d" is a vector of values that spread the columns in "u" out according to how much variance each PC accounts for in the original data.

    Like

      • I watched the video and the “samples” are the rows and the “variables” are columns. Ultimately, in later videos, the dudes shows how you can use SVD to plot the rows on a 2-D graph to show how the samples are related to each other. Specifically, check out https://youtu.be/K38wVcdNuFc?t=4m17s In that, the samples are people (and rows in this case) and the variables are movies (and columns) that they have seen and scored based on how much they liked them. On the left side of the screen you see the raw data for two people. On the right side, you see the results of SVD – X and Y coordinates that will put those two people close to each other in the final graph because they both gave Sci-Fi movies high scores and they both gave Romance movies low scores. Thus, the final plot will show us which people like Sci-Fi movies and which people like Romance movies. In my example, however, the matrix is transposed – the goal is to plot cells by looking at the scores for each gene. Samples with high scores for the same genes and low scores for the same genes will be plotted near each other. So in my example, the matrix has to be transposed before you put it into SVD.

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s