PCA clearly explained

August 13, 2015 MDS, PCA, Principal Component Analysis, RNA-seq, Single Cell RNA-seq, statistics

Update: A lot of people ask to see example code that shows how to actually do PCA. Here’s how to do it in R

[sourcecode language=”R”]
## In this example, the data is in a matrix called
## data.matrix
## columns are individual samples (i.e. cells)
## rows are measurements taken for all the samples (i.e. genes)
## Just for the sake of the example, here’s some made up data…
data.matrix <- matrix(nrow=100, ncol=10)
colnames(data.matrix) <- c(
paste("wt", 1:5, sep=""),
paste("ko", 1:5, sep=""))
head(data.matrix)
rownames(data.matrix) <- paste("gene", 1:100, sep="")
for (i in 1:100) {
wt.values <- rpois(5, lambda=sample(x=10:1000, size=1))
ko.values <- rpois(5, lambda=sample(x=10:1000, size=1))

data.matrix[i,] <- c(wt.values, ko.values)
}
head(data.matrix)

pca <- prcomp(t(data.matrix), scale=TRUE)

## plot pc1 and pc2
plot(pca$x[,1], pca$x[,2])

## get the name of the sample (cell) with the highest pc1 value
rownames(pca$x)[order(pca$x[ ,1], decreasing=TRUE)[1]]

## get the name of the top 10 measurements (genes) that contribute
## most to pc1.
loading_scores <- pca$rotation[,1]
gene_scores <- abs(loading_scores) ## get the magnitudes
gene_score_ranked <- sort(gene_scores, decreasing=TRUE)
top_10_genes <- names(gene_score_ranked[1:10])

top_10_genes ## show the names of the top 10 genes

## get the scree information
pca.var <- pca$sdev^2
scree <- pca.var/sum(pca.var)
plot((scree[1:10]*100), main="Scree Plot", xlab="Principal Component", ylab="Percent Variation")
[/sourcecode]

RNA-seq results often contain a PCA (Principal Component Analysis) or MDS plot. Usually we use these graphs to verify that the control samples cluster together. However, there’s a lot more going on, and if you are willing to dive in, you can extract a lot more information from these plots. The good news is that PCA only sounds complicated. Conceptually, it’s actually quite simple.

In this video I clearly explain how PCA graphs are generated, how to interpret them, and how to determine if the plot is informative or not. I then show how you can extract extra information out of the data used to draw the graph.

45 thoughts on “PCA clearly explained”

drpalv

October 19, 2016 at 9:56 pm

Thanks Joshua for such a wonderful explanation of PCA! I was wondering if you would be willing to share your slide deck. Your slides are very clear and well put together. I was hoping to use them for educational purposes. I will of course acknowledge you appropriately for the content. Thanks!

Reply
- Josh
  
  October 20, 2016 at 11:49 am
  
  Sure! I’d be happy to.
  
  Reply
Statistics Student

November 9, 2017 at 2:15 am

Bug report. Try: rownames(pca$x)[order(pca$x[ ,1], decreasing=TRUE)[1]]
for the sample ID. Your code tries to use the max PCA value as an index. ???

Reply
- Josh
  
  November 9, 2017 at 1:50 pm
  
  Thanks for catching that! I’ve corrected it and added some toy data to make it easier to play around with.
  
  Reply
Karalan

November 10, 2017 at 8:45 am

for distinguish between two different kind of samples (here it is cells), two PCs are okay but how to distinguish more than two samples with two PCs??

Reply
- Josh
  
  November 10, 2017 at 2:00 pm
  
  To plot a cell on the graph, you multiply the gene measurements associated with that cell with the loading scores in PC1. Then add up the values and that gives you a coordinate along the PC1 axis. Then do the same thing with PC2 – multiply the gene measurements with the loading scores in PC2 and then add them up. That gives you a coordinate on PC2. With those two coordinates, we can plot the cell on the 2D graph. (If we wanted to draw a 3D graph, we would do the same thing to get a coordinate along the PC3 axis). That process can be repeated for as many cells as we have. This is explained here: https://youtu.be/_UVHneBUBW0?t=17m25s
  
  Reply
  - Karalan
    
    November 13, 2017 at 5:57 am
    
    Thank you. . 🙂
    
    Actually i had one more doubt, In your example rows corresponds to variables and columns corresponds to observations but when I try to do it in matlab (which uses SVD algorithm) its completely different, There rows correspond to observations and columns correspond to variables and it returns the loadings in the size of columns x columns. How it is possible?
    
    Reply
Josh

November 13, 2017 at 1:39 pm

For historical reasons, most PCA functions (including SVD) expect the rows to be samples, and the columns to be variables. However, genomic data is often the transpose of that type of matrix – thus, samples are columns and variables (usually genes) are rows. Thus, to do PCA with my data (where the samples are columns and the variables are rows) you have to transpose the matrix before you make the call to the PCA function. In my example code (for R), I have the line that looks like this:

pca <- prcomp(t(data.matrix), scale=TRUE)

The t() function that the data goes through before going to prcomp() transposes the data. So try transposing your matrix to see if you get the right values.

Reply
- Karalan
  
  November 14, 2017 at 4:31 am
  
  So the plotting is done between variables or samples??
  
  Reply
  - Josh
    
    November 14, 2017 at 8:25 pm
    
    The plotting is usually done for the samples, not the variables.
    
    Reply
Karalan

November 15, 2017 at 4:17 am

In both methods??

Reply
- Josh
  
  November 15, 2017 at 2:34 pm
  
  I’m not sure what you mean by both methods. PCA is just one method. Typically it is used to plot samples. If you data has the samples as columns and variables as rows, than you need to transpose the matrix so that the samples are rows and the variables are columns before you do PCA on it so that you can plot the samples. If your data is already formatted so that the samples are rows, than you’re set, you don’t need to transpose the matrix. Does this make sense?
  
  Reply
  - Karalan
    
    November 16, 2017 at 4:12 am
    
    Actually I mean to ask SVD algorithm which am using in MATLAB (where rows correspond to observations and columns correspond to variables) also plotting between the samples?
    
    Reply
Josh

November 16, 2017 at 11:51 am

I’m not sure how to answer your question since I don’t use matlab. However, R is free and you can download it. Here’s how you use SVD in R:

svd.stuff <- svd(scale(MY_MATRIX, center=TRUE, scale=TRUE))

Where "MY_MATRIX" is a matrix where the rows correspond to samples and the columns correspond to variables.

To draw a PCA plot from this:

plot((svd.stuff$u[,1] * svd.stuff$d[1]), (svd.stuff$u[,2] * svd.stuff$d[2]))

Where "u" is a matrix of PCs that have all been scaled so that they have a magnitude of 1 and and "d" is a vector of values that spread the columns in "u" out according to how much variance each PC accounts for in the original data.

Reply
- Karalan
  
  November 20, 2017 at 6:19 am
  
  Sorry to say this. I referred lot of articles especially this one serious https://www.youtube.com/watch?v=yLdOS6xyM_Q&index=46&list=PLLssT5z_DsK9JDLcT8T62VtzwyW9LNepV . Everywhere they plot between variables not observations. By plotting between variables only we can achieve the dimensionality reduction.
  
  Reply
  - Josh
    
    November 20, 2017 at 10:28 am
    
    I watched the video and the “samples” are the rows and the “variables” are columns. Ultimately, in later videos, the dudes shows how you can use SVD to plot the rows on a 2-D graph to show how the samples are related to each other. Specifically, check out https://youtu.be/K38wVcdNuFc?t=4m17s In that, the samples are people (and rows in this case) and the variables are movies (and columns) that they have seen and scored based on how much they liked them. On the left side of the screen you see the raw data for two people. On the right side, you see the results of SVD – X and Y coordinates that will put those two people close to each other in the final graph because they both gave Sci-Fi movies high scores and they both gave Romance movies low scores. Thus, the final plot will show us which people like Sci-Fi movies and which people like Romance movies. In my example, however, the matrix is transposed – the goal is to plot cells by looking at the scores for each gene. Samples with high scores for the same genes and low scores for the same genes will be plotted near each other. So in my example, the matrix has to be transposed before you put it into SVD.
    
    Reply
Hussam

March 13, 2018 at 10:37 am

Thank you for the great explanation, as i understand the number of PC we get should equal the number of columns (variables) in our data, so we after we take the transposed matrix and run prcomp(), PC should equal the number of genes(100 genes) not the number of samples

Reply
- Josh
  
  March 13, 2018 at 11:35 am
  
  That’s a reasonable idea, however, the number of PCs that are any use whatsoever is the minimum of either the number of samples or the number of variables. In a few weeks I’m going to make a video that clarifies this. In the mean time… Imagine you had 2 variables and only 1 sample. If you did PCA on this, how many dimensions would you need to plot a single sample? Just 1. If you had 10 variables and 2 samples, you could still plot those 2 samples on a line if you just drew a line between them. Does that make sense. The video will clarify this more by showing images.
  
  Reply
  - Hussam
    
    March 16, 2018 at 5:15 am
    
    Thank you so much, waiting for new videos , and great songs too
    
    Reply
    - Josh
      
      March 16, 2018 at 11:22 am
      
      Hooray! I’m glad you like them both!
      
      Reply
      - Hussam
        
        March 16, 2018 at 11:58 am
        
        josh , i have another question and i would be really glad if you answer me, now in your example the first PC explains 90% of data variation , but of course when we deal with the real biological data we will get more number of PC that can explains the variation , now my question is how can we get the most important genes from number of PCs that explains more than 80% of data variation , in your example u used just the 1st PC to extract the important genes , but what if i want to extract genes from several PCs, hope i could explain what i want!!
        Best regards
    - Hussam
      
      March 16, 2018 at 12:32 pm
      
      Can you write the code in R for ranking genes in several PCs?
      
      Reply
Josh

March 16, 2018 at 12:21 pm

This is a good question – often I have to look at more than 1 principal component to find the genes that are interesting. All you do is look at the scree plot and decide how many PCs you want to look at, then rank the loading scores fo those PCs by their magnitude (aka absolute value, since large negative values are just as interesting as large positive values) and then look at the top genes. If you want to get fancy, you can then rank the genes across the PCs. For example, after ranking the gene in the first 3 PCs, you could then see which PC has the largest value for the top rank, and use that gene. Then you could see which PC has the largest value for the 2nd rank, and use that gene, etc.

Reply
- Hussam
  
  March 16, 2018 at 12:36 pm
  
  because i look at more than 500 PC
  
  Reply
  - Hussam
    
    March 18, 2018 at 7:28 pm
    
    hey josh, thank you for answering me, but i have noticed something ,and i hope that you will correct me if am wrong.
    now , i have noticed that genes have smaller rotation values in PC1 than in PC2 and ,PC2 rotation values for each gene are smaller than the ones in PC3 and so on , so basically if i write the code u send me and then sort by decreasing to obtain the highest ten genes , i will end up getting genes that are in the last PC because they have the biggest rotation values, which are the less important genes.
    
    Reply
Josh

March 16, 2018 at 12:48 pm

Here’s an example of getting the top genes from the first 3 eigenvectors…

eigen_vectors_top_3 <- eigen_vectors[, 1:3]

gene_score_top <- abs(eigen_vectors_top)
score <- apply(gene_score_top, 1, max)

gene.names <- names(score)

Reply
Josh

March 18, 2018 at 8:44 pm

Well, you might have an odd dataset. When I do it on the two PCs in “how to do PCA in R” code https://statquest.org/2017/11/27/statquest-pca-in-r-clearly-explained/#code I get a mix of genes with higher values in PC1 and PC2 (although I have to look at the top 20 genes to see the mix). Assuming you’ve loaded in the code that I mentioned and run it, you and see genes from the two PCs with the following lines:

eigen_vectors <- pca$rotation[,1:2]
gene_scores <- abs(eigen_vectors)
score <- apply(gene_scores, 1, max)
score <- sort(score, decreasing=TRUE)
gene_scores[names(head(score, n=20)),]

Reply
Mehmet Vurkaç

February 10, 2019 at 8:24 pm

Hi, Josh
All of your videos I’ve seen to date (half a dozen) feature excellent explanations. Thank you for making this material and your very helpful take on it available to everyone on the Internet.
I have an insignificant question, just to help identify if I’m misinterpreting something: Using the small data set in the video, I calculated cell 1’s PC1 score (practicing it by transferring it to terminology that works better for me, but with your numbers). Treating each cell-1 entry (value) and multiplying it with the quantitative influence, I get:
Cell 1’s PC 1 score = (10*10)+(0*0.5)+(14*0.2)+(33*[-0.2])+(50*13)+(80*[-14]) = -373.8 (not 12)
I noticed you said “… we _might_ end up with a number like 12 …” (emphasis added), so I’m wondering whether I’m indeed doing something wrong or whether 12 was just an arbitrary number? (Or did you divide by the total number of cells?)
Thanks

Reply
- Josh
  
  February 10, 2019 at 8:48 pm
  
  I have an updated version of this video that might help you more. https://youtu.be/FgakZw6K1QQ
  
  Reply
John Kennedy

April 15, 2019 at 9:08 pm

Dear Joshua, great video, thank you!

I’m not being able to follow your R code, though. It seems polluted with random html tagging…
Some examples at line lines 7, 8 and 9:
colnames(data.matrix) <- c(
paste("wt", 1:5, sep=""),
paste("ko", 1:5, sep=""))

If possible, when you have the time and the energy together, could you please remove them so the code could appear clearer?

Sorry for bothering you, and thanks in advance!

Best Regards,

JK

Reply
- Josh
  
  April 15, 2019 at 9:50 pm
  
  Ummm…. I just copied and pasted the example you gave me (lines 7, 8 and 9) into R and they ran just fine. I pasted from the code on the website and the code in your comment and both ran fine. Can you tell me what the error is that you are getting?
  
  Reply
  - John Kennedy
    
    April 17, 2019 at 1:10 am
    
    Maybe is my browser (Firefox) perhaps?
    When I copy your line 6 and paste into R, I get this:
    > data.matrix <- matrix(nrow=100, ncol=10)
    Error: object ‘lt’ not found
    
    And when copy and paste lines 7, 8, and 9, I get this:
    > colnames(data.matrix) <- c(
    Error: object ‘lt’ not found
    > paste("wt", 1:5, sep=""),
    Error: unexpected ‘&’ in ” paste(&”
    > paste("ko", 1:5, sep=""))
    Error: unexpected ‘&’ in ” paste(&”
    >
    
    JK
    
    Reply
John Kennedy

April 17, 2019 at 1:14 am

The odd thing is: when pasting into your blog’s message box, I have the odd characters visible; but when I click on “Post Comment”, they disappeared…

Reply
John Kennedy

April 17, 2019 at 1:20 am

This time I’m using Chrome, the odd characters are still there. Just to make them visible to you, will add an extra underline inside the odd characters:

data.matrix _&_l_t_;_- matrix(nrow=100, ncol=10)
colnames(data.matrix) _&_l_t_;_- c(
paste(_&_q_u_o_t_;_w_t_&_q_u_o_t_;_, 1:5, sep=_&_q_u_o_t_;_w_t_&_q_u_o_t_;_),
paste(_&_q_u_o_t_;_w_t_&_q_u_o_t_;_, 1:5, sep=_&_q_u_o_t_;_w_t_&_q_u_o_t_;_))

JK

Reply
- John Kennedy
  
  April 17, 2019 at 1:22 am
  
  If you could copy and paste the above into a notepad, and replace the underline for nothing, you’ll be able to see what I’m seeing when I look to your original code.
  
  Sorry for the trouble!
  
  JK
  
  Reply
- Josh
  
  April 17, 2019 at 1:23 am
  
  Weird! Maybe it’s some wordpress strangeness. I’m sorry it’s not working for you. That’s a bummer. I’ll try to put a link to download the code (as a raw text file) so that people that run into the same problem will have an easy work around. Thanks for letting me know about this problem.
  
  Reply
  - John Kennedy
    
    April 17, 2019 at 1:56 am
    
    WordPress… I think you’re right!
    Thank you very much and, once again, sorry for the trouble.
    
    Really appreciated your concern!
    
    Best Regards,
    
    JK
    
    Reply
JK

April 17, 2019 at 1:52 am

Acessing this page with my Android cell phone using Chrome, the odd characters are very cliear.

I thing I’ve partially got it:
_&_l_t_;_ equals <
_&_q_u_o_t_;_ equals "

but the last line of code (line 40) has some additional html code, not sure how to deal with it. It starts with
< span

Reply
- Josh
  
  April 17, 2019 at 12:39 pm
  
  I’m working on providing a link so that people with your problem can download the code directly as a text file.
  
  Reply
lihua jiang

December 7, 2019 at 8:55 pm

For top genes, how many of them should we use? Is there a reasonable cutoff for loading scores? I read some discussions and someone suggested not to use the genes with loading score less than 0.3. Any advice?

Reply
- Josh
  
  December 7, 2019 at 9:27 pm
  
  There’s no real cutoff. It just depends on your data and what you hope to get out of it. Often, instead of using a strict cutoff, people will take the top 100 genes and see if that works for them.
  
  Reply
Astianax Cabrera

March 7, 2021 at 5:03 pm

im a newbie. i got to this website cuz i bought a subscription (double-bam) to your stat-quest channel in youtube. is there anyway i could get my hands on the data set you used for this example? although the video provides a great explanation of the method i have some questions regarding details, namely why gene gene 1 is high, while gene 2 is low when their values are so close. i thought that if i processed the original data set in r or pandas i would get whatever im missing. thank you.

Reply
- starmer
  
  March 7, 2021 at 5:11 pm
  
  If you want to play around with PCA, I would recommend following my “PCA in R” https://youtu.be/0Jp4gsfOLMs (and code: https://github.com/StatQuest/pca_demo/blob/master/pca_demo.R ) and “PCA in Python” https://youtu.be/Lsue2gEM9D0 (and code: https://github.com/StatQuest/pca_demo/blob/master/pca_demo.py ).
  
  Reply
AGloutney

January 8, 2024 at 2:45 am

Hello,

Thank you for this splendid tutorial.

I have a question about the code. For the below code line, what is the ”t” ?

pca <- prcomp(t(data.matrix), scale=TRUE)

Reply
- starmer
  
  January 8, 2024 at 1:00 pm
  
  See: https://youtu.be/0Jp4gsfOLMs?si=lNEdswk_hVkwDYmS&t=162
  
  Reply

StatQuest!!!

An epic journey through statistics and machine learning

PCA clearly explained

45 thoughts on “PCA clearly explained”

Leave a Reply Cancel reply