PCA clearly explained

Update: A lot of people ask to see example code that shows how to actually do PCA. Here’s how to do it in R

[sourcecode language=”R”]
## In this example, the data is in a matrix called
## data.matrix
## columns are individual samples (i.e. cells)
## rows are measurements taken for all the samples (i.e. genes)
## Just for the sake of the example, here’s some made up data…
data.matrix <- matrix(nrow=100, ncol=10)
colnames(data.matrix) <- c(
paste("wt", 1:5, sep=""),
paste("ko", 1:5, sep=""))
head(data.matrix)
rownames(data.matrix) <- paste("gene", 1:100, sep="")
for (i in 1:100) {
wt.values <- rpois(5, lambda=sample(x=10:1000, size=1))
ko.values <- rpois(5, lambda=sample(x=10:1000, size=1))

data.matrix[i,] <- c(wt.values, ko.values)
}
head(data.matrix)

pca <- prcomp(t(data.matrix), scale=TRUE)

## plot pc1 and pc2
plot(pca$x[,1], pca$x[,2])

## get the name of the sample (cell) with the highest pc1 value
rownames(pca$x)[order(pca$x[ ,1], decreasing=TRUE)[1]]

## get the name of the top 10 measurements (genes) that contribute
## most to pc1.
loading_scores <- pca$rotation[,1]
gene_scores <- abs(loading_scores) ## get the magnitudes
gene_score_ranked <- sort(gene_scores, decreasing=TRUE)
top_10_genes <- names(gene_score_ranked[1:10])

top_10_genes ## show the names of the top 10 genes

## get the scree information
pca.var <- pca$sdev^2
scree <- pca.var/sum(pca.var)
plot((scree[1:10]*100), main="Scree Plot", xlab="Principal Component", ylab="Percent Variation")
[/sourcecode]

RNA-seq results often contain a PCA (Principal Component Analysis) or MDS plot. Usually we use these graphs to verify that the control samples cluster together. However, there’s a lot more going on, and if you are willing to dive in, you can extract a lot more information from these plots. The good news is that PCA only sounds complicated. Conceptually, it’s actually quite simple.

In this video I clearly explain how PCA graphs are generated, how to interpret them, and how to determine if the plot is informative or not. I then show how you can extract extra information out of the data used to draw the graph.

45 thoughts on “PCA clearly explained

  1. Thanks Joshua for such a wonderful explanation of PCA! I was wondering if you would be willing to share your slide deck. Your slides are very clear and well put together. I was hoping to use them for educational purposes. I will of course acknowledge you appropriately for the content. Thanks!

  2. Bug report. Try: rownames(pca$x)[order(pca$x[ ,1], decreasing=TRUE)[1]]
    for the sample ID. Your code tries to use the max PCA value as an index. ???

  3. for distinguish between two different kind of samples (here it is cells), two PCs are okay but how to distinguish more than two samples with two PCs??

    • To plot a cell on the graph, you multiply the gene measurements associated with that cell with the loading scores in PC1. Then add up the values and that gives you a coordinate along the PC1 axis. Then do the same thing with PC2 – multiply the gene measurements with the loading scores in PC2 and then add them up. That gives you a coordinate on PC2. With those two coordinates, we can plot the cell on the 2D graph. (If we wanted to draw a 3D graph, we would do the same thing to get a coordinate along the PC3 axis). That process can be repeated for as many cells as we have. This is explained here: https://youtu.be/_UVHneBUBW0?t=17m25s

      • Thank you. . 🙂

        Actually i had one more doubt, In your example rows corresponds to variables and columns corresponds to observations but when I try to do it in matlab (which uses SVD algorithm) its completely different, There rows correspond to observations and columns correspond to variables and it returns the loadings in the size of columns x columns. How it is possible?

  4. For historical reasons, most PCA functions (including SVD) expect the rows to be samples, and the columns to be variables. However, genomic data is often the transpose of that type of matrix – thus, samples are columns and variables (usually genes) are rows. Thus, to do PCA with my data (where the samples are columns and the variables are rows) you have to transpose the matrix before you make the call to the PCA function. In my example code (for R), I have the line that looks like this:

    pca <- prcomp(t(data.matrix), scale=TRUE)

    The t() function that the data goes through before going to prcomp() transposes the data. So try transposing your matrix to see if you get the right values.

    • I’m not sure what you mean by both methods. PCA is just one method. Typically it is used to plot samples. If you data has the samples as columns and variables as rows, than you need to transpose the matrix so that the samples are rows and the variables are columns before you do PCA on it so that you can plot the samples. If your data is already formatted so that the samples are rows, than you’re set, you don’t need to transpose the matrix. Does this make sense?

      • Actually I mean to ask SVD algorithm which am using in MATLAB (where rows correspond to observations and columns correspond to variables) also plotting between the samples?

  5. I’m not sure how to answer your question since I don’t use matlab. However, R is free and you can download it. Here’s how you use SVD in R:

    svd.stuff <- svd(scale(MY_MATRIX, center=TRUE, scale=TRUE))

    Where "MY_MATRIX" is a matrix where the rows correspond to samples and the columns correspond to variables.

    To draw a PCA plot from this:

    plot((svd.stuff$u[,1] * svd.stuff$d[1]), (svd.stuff$u[,2] * svd.stuff$d[2]))

    Where "u" is a matrix of PCs that have all been scaled so that they have a magnitude of 1 and and "d" is a vector of values that spread the columns in "u" out according to how much variance each PC accounts for in the original data.

      • I watched the video and the “samples” are the rows and the “variables” are columns. Ultimately, in later videos, the dudes shows how you can use SVD to plot the rows on a 2-D graph to show how the samples are related to each other. Specifically, check out https://youtu.be/K38wVcdNuFc?t=4m17s In that, the samples are people (and rows in this case) and the variables are movies (and columns) that they have seen and scored based on how much they liked them. On the left side of the screen you see the raw data for two people. On the right side, you see the results of SVD – X and Y coordinates that will put those two people close to each other in the final graph because they both gave Sci-Fi movies high scores and they both gave Romance movies low scores. Thus, the final plot will show us which people like Sci-Fi movies and which people like Romance movies. In my example, however, the matrix is transposed – the goal is to plot cells by looking at the scores for each gene. Samples with high scores for the same genes and low scores for the same genes will be plotted near each other. So in my example, the matrix has to be transposed before you put it into SVD.

  6. Thank you for the great explanation, as i understand the number of PC we get should equal the number of columns (variables) in our data, so we after we take the transposed matrix and run prcomp(), PC should equal the number of genes(100 genes) not the number of samples

    • That’s a reasonable idea, however, the number of PCs that are any use whatsoever is the minimum of either the number of samples or the number of variables. In a few weeks I’m going to make a video that clarifies this. In the mean time… Imagine you had 2 variables and only 1 sample. If you did PCA on this, how many dimensions would you need to plot a single sample? Just 1. If you had 10 variables and 2 samples, you could still plot those 2 samples on a line if you just drew a line between them. Does that make sense. The video will clarify this more by showing images.

          • josh , i have another question and i would be really glad if you answer me, now in your example the first PC explains 90% of data variation , but of course when we deal with the real biological data we will get more number of PC that can explains the variation , now my question is how can we get the most important genes from number of PCs that explains more than 80% of data variation , in your example u used just the 1st PC to extract the important genes , but what if i want to extract genes from several PCs, hope i could explain what i want!!
            Best regards

  7. This is a good question – often I have to look at more than 1 principal component to find the genes that are interesting. All you do is look at the scree plot and decide how many PCs you want to look at, then rank the loading scores fo those PCs by their magnitude (aka absolute value, since large negative values are just as interesting as large positive values) and then look at the top genes. If you want to get fancy, you can then rank the genes across the PCs. For example, after ranking the gene in the first 3 PCs, you could then see which PC has the largest value for the top rank, and use that gene. Then you could see which PC has the largest value for the 2nd rank, and use that gene, etc.

      • hey josh, thank you for answering me, but i have noticed something ,and i hope that you will correct me if am wrong.
        now , i have noticed that genes have smaller rotation values in PC1 than in PC2 and ,PC2 rotation values for each gene are smaller than the ones in PC3 and so on , so basically if i write the code u send me and then sort by decreasing to obtain the highest ten genes , i will end up getting genes that are in the last PC because they have the biggest rotation values, which are the less important genes.

  8. Here’s an example of getting the top genes from the first 3 eigenvectors…

    eigen_vectors_top_3 <- eigen_vectors[, 1:3]

    gene_score_top <- abs(eigen_vectors_top)
    score <- apply(gene_score_top, 1, max)

    gene.names <- names(score)

  9. Well, you might have an odd dataset. When I do it on the two PCs in “how to do PCA in R” code https://statquest.org/2017/11/27/statquest-pca-in-r-clearly-explained/#code I get a mix of genes with higher values in PC1 and PC2 (although I have to look at the top 20 genes to see the mix). Assuming you’ve loaded in the code that I mentioned and run it, you and see genes from the two PCs with the following lines:

    eigen_vectors <- pca$rotation[,1:2]
    gene_scores <- abs(eigen_vectors)
    score <- apply(gene_scores, 1, max)
    score <- sort(score, decreasing=TRUE)
    gene_scores[names(head(score, n=20)),]

  10. Hi, Josh
    All of your videos I’ve seen to date (half a dozen) feature excellent explanations. Thank you for making this material and your very helpful take on it available to everyone on the Internet.
    I have an insignificant question, just to help identify if I’m misinterpreting something: Using the small data set in the video, I calculated cell 1’s PC1 score (practicing it by transferring it to terminology that works better for me, but with your numbers). Treating each cell-1 entry (value) and multiplying it with the quantitative influence, I get:
    Cell 1’s PC 1 score = (10*10)+(0*0.5)+(14*0.2)+(33*[-0.2])+(50*13)+(80*[-14]) = -373.8 (not 12)
    I noticed you said “… we _might_ end up with a number like 12 …” (emphasis added), so I’m wondering whether I’m indeed doing something wrong or whether 12 was just an arbitrary number? (Or did you divide by the total number of cells?)
    Thanks

  11. Dear Joshua, great video, thank you!

    I’m not being able to follow your R code, though. It seems polluted with random html tagging…
    Some examples at line lines 7, 8 and 9:
    colnames(data.matrix) <- c(
    paste("wt", 1:5, sep=""),
    paste("ko", 1:5, sep=""))

    If possible, when you have the time and the energy together, could you please remove them so the code could appear clearer?

    Sorry for bothering you, and thanks in advance!

    Best Regards,

    JK

    • Ummm…. I just copied and pasted the example you gave me (lines 7, 8 and 9) into R and they ran just fine. I pasted from the code on the website and the code in your comment and both ran fine. Can you tell me what the error is that you are getting?

      • Maybe is my browser (Firefox) perhaps?
        When I copy your line 6 and paste into R, I get this:
        > data.matrix <- matrix(nrow=100, ncol=10)
        Error: object ‘lt’ not found

        And when copy and paste lines 7, 8, and 9, I get this:
        > colnames(data.matrix) <- c(
        Error: object ‘lt’ not found
        > paste("wt", 1:5, sep=""),
        Error: unexpected ‘&’ in ” paste(&”
        > paste("ko", 1:5, sep=""))
        Error: unexpected ‘&’ in ” paste(&”
        >

        JK

  12. The odd thing is: when pasting into your blog’s message box, I have the odd characters visible; but when I click on “Post Comment”, they disappeared…

  13. This time I’m using Chrome, the odd characters are still there. Just to make them visible to you, will add an extra underline inside the odd characters:

    data.matrix _&_l_t_;_- matrix(nrow=100, ncol=10)
    colnames(data.matrix) _&_l_t_;_- c(
    paste(_&_q_u_o_t_;_w_t_&_q_u_o_t_;_, 1:5, sep=_&_q_u_o_t_;_w_t_&_q_u_o_t_;_),
    paste(_&_q_u_o_t_;_w_t_&_q_u_o_t_;_, 1:5, sep=_&_q_u_o_t_;_w_t_&_q_u_o_t_;_))

    JK

    • If you could copy and paste the above into a notepad, and replace the underline for nothing, you’ll be able to see what I’m seeing when I look to your original code.

      Sorry for the trouble!

      JK

    • Weird! Maybe it’s some wordpress strangeness. I’m sorry it’s not working for you. That’s a bummer. I’ll try to put a link to download the code (as a raw text file) so that people that run into the same problem will have an easy work around. Thanks for letting me know about this problem.

      • WordPress… I think you’re right!
        Thank you very much and, once again, sorry for the trouble.

        Really appreciated your concern!

        Best Regards,

        JK

  14. Acessing this page with my Android cell phone using Chrome, the odd characters are very cliear.

    I thing I’ve partially got it:
    _&_l_t_;_ equals <
    _&_q_u_o_t_;_ equals "

    but the last line of code (line 40) has some additional html code, not sure how to deal with it. It starts with
    < span

  15. For top genes, how many of them should we use? Is there a reasonable cutoff for loading scores? I read some discussions and someone suggested not to use the genes with loading score less than 0.3. Any advice?

    • There’s no real cutoff. It just depends on your data and what you hope to get out of it. Often, instead of using a strict cutoff, people will take the top 100 genes and see if that works for them.

  16. im a newbie. i got to this website cuz i bought a subscription (double-bam) to your stat-quest channel in youtube. is there anyway i could get my hands on the data set you used for this example? although the video provides a great explanation of the method i have some questions regarding details, namely why gene gene 1 is high, while gene 2 is low when their values are so close. i thought that if i processed the original data set in r or pandas i would get whatever im missing. thank you.

  17. Hello,

    Thank you for this splendid tutorial.

    I have a question about the code. For the below code line, what is the ”t” ?

    pca <- prcomp(t(data.matrix), scale=TRUE)

Leave a Reply

Your email address will not be published. Required fields are marked *