PCA clearly explained

Update: People have asked for PowerPoint slides for this presentation. Here they are! If you use them, please cite me. Thank you so much for your support.

Update: A lot of people ask to see example code that shows how to actually do PCA. Here’s how to do it in R

## In this example, the data is in a matrix called
## data.matrix
## columns are individual samples (i.e. cells)
## rows are measurements taken for all the samples (i.e. genes)
## Just for the sake of the example, here's some made up data...
data.matrix <- matrix(nrow=100, ncol=10)
colnames(data.matrix) <- c(
  paste("wt", 1:5, sep=""),
  paste("ko", 1:5, sep=""))
rownames(data.matrix) <- paste("gene", 1:100, sep="")
for (i in 1:100) {
  wt.values <- rpois(5, lambda=sample(x=10:1000, size=1))
  ko.values <- rpois(5, lambda=sample(x=10:1000, size=1))

  data.matrix[i,] <- c(wt.values, ko.values)

pca <- prcomp(t(data.matrix), scale=TRUE)

## plot pc1 and pc2
plot(pca$x[,1], pca$x[,2])

## get the name of the sample (cell) with the highest pc1 value
rownames(pca$x)[order(pca$x[ ,1], decreasing=TRUE)[1]]

## get the name of the top 10 measurements (genes) that contribute
## most to pc1.
loading_scores <- pca$rotation[,1]
gene_scores <- abs(loading_scores) ## get the magnitudes
gene_score_ranked <- sort(gene_scores, decreasing=TRUE)
top_10_genes <- names(gene_score_ranked[1:10])

top_10_genes ## show the names of the top 10 genes

## get the scree information
pca.var <- pca$sdev^2
scree <- pca.var/sum(pca.var)
plot((scree[1:10]*100), main="Scree Plot", xlab="Principal Component", ylab="Percent Variation")<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span>

RNA-seq results often contain a PCA (Principal Component Analysis) or MDS plot. Usually we use these graphs to verify that the control samples cluster together. However, there’s a lot more going on, and if you are willing to dive in, you can extract a lot more information from these plots. The good news is that PCA only sounds complicated. Conceptually, it’s actually quite simple.

In this video I clearly explain how PCA graphs are generated, how to interpret them, and how to determine if the plot is informative or not. I then show how you can extract extra information out of the data used to draw the graph.


27 thoughts on “PCA clearly explained

  1. Thanks Joshua for such a wonderful explanation of PCA! I was wondering if you would be willing to share your slide deck. Your slides are very clear and well put together. I was hoping to use them for educational purposes. I will of course acknowledge you appropriately for the content. Thanks!


  2. Bug report. Try: rownames(pca$x)[order(pca$x[ ,1], decreasing=TRUE)[1]]
    for the sample ID. Your code tries to use the max PCA value as an index. ???


  3. for distinguish between two different kind of samples (here it is cells), two PCs are okay but how to distinguish more than two samples with two PCs??


    • To plot a cell on the graph, you multiply the gene measurements associated with that cell with the loading scores in PC1. Then add up the values and that gives you a coordinate along the PC1 axis. Then do the same thing with PC2 – multiply the gene measurements with the loading scores in PC2 and then add them up. That gives you a coordinate on PC2. With those two coordinates, we can plot the cell on the 2D graph. (If we wanted to draw a 3D graph, we would do the same thing to get a coordinate along the PC3 axis). That process can be repeated for as many cells as we have. This is explained here: https://youtu.be/_UVHneBUBW0?t=17m25s


      • Thank you. . :)

        Actually i had one more doubt, In your example rows corresponds to variables and columns corresponds to observations but when I try to do it in matlab (which uses SVD algorithm) its completely different, There rows correspond to observations and columns correspond to variables and it returns the loadings in the size of columns x columns. How it is possible?


  4. For historical reasons, most PCA functions (including SVD) expect the rows to be samples, and the columns to be variables. However, genomic data is often the transpose of that type of matrix – thus, samples are columns and variables (usually genes) are rows. Thus, to do PCA with my data (where the samples are columns and the variables are rows) you have to transpose the matrix before you make the call to the PCA function. In my example code (for R), I have the line that looks like this:

    pca <- prcomp(t(data.matrix), scale=TRUE)

    The t() function that the data goes through before going to prcomp() transposes the data. So try transposing your matrix to see if you get the right values.


    • I’m not sure what you mean by both methods. PCA is just one method. Typically it is used to plot samples. If you data has the samples as columns and variables as rows, than you need to transpose the matrix so that the samples are rows and the variables are columns before you do PCA on it so that you can plot the samples. If your data is already formatted so that the samples are rows, than you’re set, you don’t need to transpose the matrix. Does this make sense?


      • Actually I mean to ask SVD algorithm which am using in MATLAB (where rows correspond to observations and columns correspond to variables) also plotting between the samples?

        Liked by 1 person

  5. I’m not sure how to answer your question since I don’t use matlab. However, R is free and you can download it. Here’s how you use SVD in R:

    svd.stuff <- svd(scale(MY_MATRIX, center=TRUE, scale=TRUE))

    Where "MY_MATRIX" is a matrix where the rows correspond to samples and the columns correspond to variables.

    To draw a PCA plot from this:

    plot((svd.stuff$u[,1] * svd.stuff$d[1]), (svd.stuff$u[,2] * svd.stuff$d[2]))

    Where "u" is a matrix of PCs that have all been scaled so that they have a magnitude of 1 and and "d" is a vector of values that spread the columns in "u" out according to how much variance each PC accounts for in the original data.


      • I watched the video and the “samples” are the rows and the “variables” are columns. Ultimately, in later videos, the dudes shows how you can use SVD to plot the rows on a 2-D graph to show how the samples are related to each other. Specifically, check out https://youtu.be/K38wVcdNuFc?t=4m17s In that, the samples are people (and rows in this case) and the variables are movies (and columns) that they have seen and scored based on how much they liked them. On the left side of the screen you see the raw data for two people. On the right side, you see the results of SVD – X and Y coordinates that will put those two people close to each other in the final graph because they both gave Sci-Fi movies high scores and they both gave Romance movies low scores. Thus, the final plot will show us which people like Sci-Fi movies and which people like Romance movies. In my example, however, the matrix is transposed – the goal is to plot cells by looking at the scores for each gene. Samples with high scores for the same genes and low scores for the same genes will be plotted near each other. So in my example, the matrix has to be transposed before you put it into SVD.


  6. Thank you for the great explanation, as i understand the number of PC we get should equal the number of columns (variables) in our data, so we after we take the transposed matrix and run prcomp(), PC should equal the number of genes(100 genes) not the number of samples


    • That’s a reasonable idea, however, the number of PCs that are any use whatsoever is the minimum of either the number of samples or the number of variables. In a few weeks I’m going to make a video that clarifies this. In the mean time… Imagine you had 2 variables and only 1 sample. If you did PCA on this, how many dimensions would you need to plot a single sample? Just 1. If you had 10 variables and 2 samples, you could still plot those 2 samples on a line if you just drew a line between them. Does that make sense. The video will clarify this more by showing images.


      • josh , i have another question and i would be really glad if you answer me, now in your example the first PC explains 90% of data variation , but of course when we deal with the real biological data we will get more number of PC that can explains the variation , now my question is how can we get the most important genes from number of PCs that explains more than 80% of data variation , in your example u used just the 1st PC to extract the important genes , but what if i want to extract genes from several PCs, hope i could explain what i want!!
        Best regards

        Liked by 1 person

  7. This is a good question – often I have to look at more than 1 principal component to find the genes that are interesting. All you do is look at the scree plot and decide how many PCs you want to look at, then rank the loading scores fo those PCs by their magnitude (aka absolute value, since large negative values are just as interesting as large positive values) and then look at the top genes. If you want to get fancy, you can then rank the genes across the PCs. For example, after ranking the gene in the first 3 PCs, you could then see which PC has the largest value for the top rank, and use that gene. Then you could see which PC has the largest value for the 2nd rank, and use that gene, etc.


      • hey josh, thank you for answering me, but i have noticed something ,and i hope that you will correct me if am wrong.
        now , i have noticed that genes have smaller rotation values in PC1 than in PC2 and ,PC2 rotation values for each gene are smaller than the ones in PC3 and so on , so basically if i write the code u send me and then sort by decreasing to obtain the highest ten genes , i will end up getting genes that are in the last PC because they have the biggest rotation values, which are the less important genes.


  8. Here’s an example of getting the top genes from the first 3 eigenvectors…

    eigen_vectors_top_3 <- eigen_vectors[, 1:3]

    gene_score_top <- abs(eigen_vectors_top)
    score <- apply(gene_score_top, 1, max)

    gene.names <- names(score)


  9. Well, you might have an odd dataset. When I do it on the two PCs in “how to do PCA in R” code https://statquest.org/2017/11/27/statquest-pca-in-r-clearly-explained/#code I get a mix of genes with higher values in PC1 and PC2 (although I have to look at the top 20 genes to see the mix). Assuming you’ve loaded in the code that I mentioned and run it, you and see genes from the two PCs with the following lines:

    eigen_vectors <- pca$rotation[,1:2]
    gene_scores <- abs(eigen_vectors)
    score <- apply(gene_scores, 1, max)
    score <- sort(score, decreasing=TRUE)
    gene_scores[names(head(score, n=20)),]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s