An epic journey through statistics and machine learning

StatQuest: PCA in R, Clearly Explained!!!

You might also want to know a few practical tips when doing PCA:

Here’s a link to the source code on the StatQuest GitHub.

20 thoughts on “StatQuest: PCA in R, Clearly Explained!!!”

Hi Josh, really helpful code! I have a question about the loadings – the loading scores are extracted from pca$rotation and values are between -1 and 1. What are the differences between these loadings, and the values given by prcomp((data.matrix), scale=TRUE)$x ?

I’m pretty sure I covered this in detail in the video, but just in case you missed it: The “x” values are the locations for the samples in the PCA graph (should you choose to draw it). The loadings reflect how much influence each variable has on each axis in the PCA graph.

Hi, Josh. Excellent and detailed presentation. But I am a little confused that whether read counts should be used to do PCA, or RPKM/KPKM/TPM should be used? After searching for the answers, it seemed that some used read counts, and some used RPKM/FPKM. Thanks a lot!

You should use “normalized” counts (i.e. you should use RPKM or TPM etc.) for PCA. This was just an example and I didn’t want to get into the details of RPKM/TPM – especially for people that are not biologists. But yes, when you’re working with real sequencing data, use the normalized counts, not the raw counts.

Hi Josh, really clear explanation. I am wondering what the argument “scale” is doing in the “prcomp()”. I found that the PC values between “scale=TRUE” and “scale=FALSE” are very different. Although “wt” cluster is still on the left of the x axis and “ko” cluster is on the right， but the relative position of sample individuals within each cluster differs.

Because in my own data, there are many 0 counts, so it cannot be scaled. In this case, I don’t know whether it is okay if I use “scale = False ” result.

Genetics data is almost always in the 100 x 10 format (rows = genes, columns = sample), and I’m a geneticist – so that’s what I see most of the time. However, other fields do it differently – so it’s important to know that when using the prcomp() function in R, the samples are supposed to be rows, regardless of how the original data is formatted.

What should we do if my data set contain “NA”? Is there a way to still include those samples? ( I have a huge dataset with 2000 samples and more then a hundred “genes” (I am working with traits but whatever really) and most samples are missing at least 20 traits.) The 5 most important traits are present for all samples

It’s possible that there are some PCA packages out there that can handle NAs. I don’t know of any, but that doesn’t mean they don’t exist, so look around. Another thing you can do is impute the values. One way to impute values is with a Random Forest (I mention this method specifically because I have a video that describes how to do it – however, there are other methods). Here’s the link on how to impute values with a RandomForest: https://youtu.be/6EXPYzbfLCE

Since there are only 10 column names (since there are 10 different samples), but there are 100 row names (since there are 100 genes, or 100 variables that we measure per sample), it doesn’t make sense to plot 10 samples on a PCA plot and then try to label them with 100 names. However, you can re-do the PCA plot to use the samples as the variables and the variables as the samples. This would plot 100 genes on the PCA plot, clustered by sample. Then you could use the gene names as labels.

t-SNE has a fundamentally different approach to clustering samples and has no concept of a loading score or a way to rank variables based on importance to the clustering. Thus, I would be very surprised if t-SNE had something like pca$rotation, and I would be very skeptical of how to interpret it. Here’s a StatQuest that describes how t-SNE works: https://youtu.be/NEaUSP4YerM

Hi Josh, I am having trouble understanding what is going on in lines 34 to 37 of the code. I am trying to use ggplot on my data set. I have a gene expression dataset with transript id’s as the first column and every other column is a sample containing TPM values. In my sample preprocessing, I got rid of the first column, removed all rows that sum to zero, and I added 1 to all values in the dataset (because I got the log later on). What do I need to do to have it ready for ggplot?
Thanks for all your great content.

Hello, thank you for this useful tutorial.
I’m trying to do the PCA analysis on my data, but I’m having some troubles in constructing my matrix. I have 3 tests (I put them in columns) and 47 mice from 3 groups (13 Ctrl,20 Pres and 14Psus ,in rows). I also had to insert my data. Here is the matrix part of my code :

hi, eamsa. I’m trying to make sense of the matrix you want to build. You say you’ve got 47 mice and 3 tests. I assume you want a matrix 47 x 3. But with colnames() you assign 6 columns, i.e. ‘Avtes1’ ‘Avtes2’ ‘EP1’ ‘EP2’ ‘Sen1’ ‘Sen2’. Then you name the rows with 47 labels distibuted among the three groups you say the mice are split into, i.e. ctrl, pres an psus. That would result in a 47 x 6 matrix. Then you have three chr vectors (shouldn’t they be numeric vectors?) I guess you want to make the martix out of, i.e ctrl.values, pres.values and psus.value with lengths 39, 63 and 40. Shouldn’t all of them be 47 elements long?

Hi Josh, really helpful code! I have a question about the loadings – the loading scores are extracted from pca$rotation and values are between -1 and 1. What are the differences between these loadings, and the values given by prcomp((data.matrix), scale=TRUE)$x ?

I’m pretty sure I covered this in detail in the video, but just in case you missed it: The “x” values are the locations for the samples in the PCA graph (should you choose to draw it). The loadings reflect how much influence each variable has on each axis in the PCA graph.

Hi, Josh. Excellent and detailed presentation. But I am a little confused that whether read counts should be used to do PCA, or RPKM/KPKM/TPM should be used? After searching for the answers, it seemed that some used read counts, and some used RPKM/FPKM. Thanks a lot!

You should use “normalized” counts (i.e. you should use RPKM or TPM etc.) for PCA. This was just an example and I didn’t want to get into the details of RPKM/TPM – especially for people that are not biologists. But yes, when you’re working with real sequencing data, use the normalized counts, not the raw counts.

OK. Thanks again for your kind reply.

Hi Josh, really clear explanation. I am wondering what the argument “scale” is doing in the “prcomp()”. I found that the PC values between “scale=TRUE” and “scale=FALSE” are very different. Although “wt” cluster is still on the left of the x axis and “ko” cluster is on the right， but the relative position of sample individuals within each cluster differs.

Because in my own data, there are many 0 counts, so it cannot be scaled. In this case, I don’t know whether it is okay if I use “scale = False ” result.

I’m glad you like the explanation. I explain the “scale” option in another video. Here’s the link: https://youtu.be/oRvgq966yZg

In this https://statquest.org/2017/11/27/statquest-pca-in-r-clearly-explained/ video

Why don’t we draw a 10 by 100 matrix instead of 100 x 10 matrix so that we do not need to us t() function in the prcomp() function ?

Is there a particular reasons?

Thank you very much.

Best Regards

Genetics data is almost always in the 100 x 10 format (rows = genes, columns = sample), and I’m a geneticist – so that’s what I see most of the time. However, other fields do it differently – so it’s important to know that when using the prcomp() function in R, the samples are supposed to be rows, regardless of how the original data is formatted.

What should we do if my data set contain “NA”? Is there a way to still include those samples? ( I have a huge dataset with 2000 samples and more then a hundred “genes” (I am working with traits but whatever really) and most samples are missing at least 20 traits.) The 5 most important traits are present for all samples

It’s possible that there are some PCA packages out there that can handle NAs. I don’t know of any, but that doesn’t mean they don’t exist, so look around. Another thing you can do is impute the values. One way to impute values is with a Random Forest (I mention this method specifically because I have a video that describes how to do it – however, there are other methods). Here’s the link on how to impute values with a RandomForest: https://youtu.be/6EXPYzbfLCE

Hi Josh!

Thanks a lot for a very informative and good tutorial.

Is it possible to plot the column names instead of rownames in the ggplot? Tried to change that in the code:

pca.data <- data.frame(Sample=rownames(pca$x),

X=pca$x[,1],

Y=pca$x[,2])

columns instead of rownames but got an error message.

Thanks in advance!

Since there are only 10 column names (since there are 10 different samples), but there are 100 row names (since there are 100 genes, or 100 variables that we measure per sample), it doesn’t make sense to plot 10 samples on a PCA plot and then try to label them with 100 names. However, you can re-do the PCA plot to use the samples as the variables and the variables as the samples. This would plot 100 genes on the PCA plot, clustered by sample. Then you could use the gene names as labels.

By chance, would it be possible to extract gene names after performing t-SNE, similarly to prcomp? Does t-SNE has an option similar to pca$rotation?

t-SNE has a fundamentally different approach to clustering samples and has no concept of a loading score or a way to rank variables based on importance to the clustering. Thus, I would be very surprised if t-SNE had something like pca$rotation, and I would be very skeptical of how to interpret it. Here’s a StatQuest that describes how t-SNE works: https://youtu.be/NEaUSP4YerM

There is another question that I would like to ask. How can I run LASSO-modified PCA? Is any special function for it?

Hi Josh, I am having trouble understanding what is going on in lines 34 to 37 of the code. I am trying to use ggplot on my data set. I have a gene expression dataset with transript id’s as the first column and every other column is a sample containing TPM values. In my sample preprocessing, I got rid of the first column, removed all rows that sum to zero, and I added 1 to all values in the dataset (because I got the log later on). What do I need to do to have it ready for ggplot?

Thanks for all your great content.

Hello, thank you for this useful tutorial.

I’m trying to do the PCA analysis on my data, but I’m having some troubles in constructing my matrix. I have 3 tests (I put them in columns) and 47 mice from 3 groups (13 Ctrl,20 Pres and 14Psus ,in rows). I also had to insert my data. Here is the matrix part of my code :

data.matrix <- matrix(nrow=47, ncol=3)

colnames(data.matrix) <- c(

paste("Avtes", 1:2, sep=""),

paste("EP", 1:2, sep=""),

paste("Sen",1:2, sep="")

rownames(data.matrix) <- c(

paste("Ctrl", 1:13, sep=""),

paste("Pres", 14:34, sep="")),

paste("Psus", 35:47, sep="")) {

Ctrl.values <-c('79','44','33','90','159','173','122','240','184','18','162','230','16','0.54','0.46','0.72','0.60','0.68','0.54','0.66','0.27','0.47','0.44','0.37','0.05','0.62''25','0','0','0','11','66','0','0','0','0','0','0','0')

Pres.values<-c('200','40','400','55','45','9','171','480','215','480','48','26','366','148','75','42.8','97.5','25.6','91.9','194','120','0.70','0.60','0.72','0.92','0.63','0.49','0.48','0.51','0.51','0.46','0.72','0.55','0.46','0.59','0.47','0.47','0.65','0.34','0.64','0.64','0.35','0','5','0','0','0','18','0','0','0','3,33','0','0','0','0','0','9','7,66','9','46,5','0','3,33')

Psus.values<-c('480','480','480','480','480','480','366','245','480','266.2','480','480','480','0.49','0.94','0.47','0.71','0.81','0.56','0.79','0.74','0.55','0.60','0.86','0.79','0.94','0','43,33','0','11,66','0','0','16','0','26,66','18','45','2,16','57,33','9,5')

data.matrix[i,] <- c(Avtes.values, EP.values,Sen.values)

}

head(data.matrix)

dim(data.matrix)

Can anyone review it please?

Thank you !

PS: I'm not used to programming in R (and in general), so, excuse my potentially dumb mistakes.

hi, eamsa. I’m trying to make sense of the matrix you want to build. You say you’ve got 47 mice and 3 tests. I assume you want a matrix 47 x 3. But with colnames() you assign 6 columns, i.e. ‘Avtes1’ ‘Avtes2’ ‘EP1’ ‘EP2’ ‘Sen1’ ‘Sen2’. Then you name the rows with 47 labels distibuted among the three groups you say the mice are split into, i.e. ctrl, pres an psus. That would result in a 47 x 6 matrix. Then you have three chr vectors (shouldn’t they be numeric vectors?) I guess you want to make the martix out of, i.e ctrl.values, pres.values and psus.value with lengths 39, 63 and 40. Shouldn’t all of them be 47 elements long?

Very useful many thanks