You might also want to know a few practical tips when doing PCA:

## In this example, the data is in a matrix called ## data.matrix ## columns are individual samples (i.e. cells) ## rows are measurements taken for all the samples (i.e. genes) ## Just for the sake of the example, here's some made up data... data.matrix <- matrix(nrow=100, ncol=10) colnames(data.matrix) <- c( paste("wt", 1:5, sep=""), paste("ko", 1:5, sep="")) rownames(data.matrix) <- paste("gene", 1:100, sep="") for (i in 1:100) { wt.values <- rpois(5, lambda=sample(x=10:1000, size=1)) ko.values <- rpois(5, lambda=sample(x=10:1000, size=1)) data.matrix[i,] <- c(wt.values, ko.values) } head(data.matrix) dim(data.matrix) pca <- prcomp(t(data.matrix), scale=TRUE) ## plot pc1 and pc2 plot(pca$x[,1], pca$x[,2]) ## make a scree plot pca.var <- pca$sdev^2 pca.var.per <- round(pca.var/sum(pca.var)*100, 1) barplot(pca.var.per, main="Scree Plot", xlab="Principal Component", ylab="Percent Variation") ## now make a fancy looking plot that shows the PCs and the variation: library(ggplot2) pca.data <- data.frame(Sample=rownames(pca$x), X=pca$x[,1], Y=pca$x[,2]) pca.data ggplot(data=pca.data, aes(x=X, y=Y, label=Sample)) + geom_text() + xlab(paste("PC1 - ", pca.var.per[1], "%", sep="")) + ylab(paste("PC2 - ", pca.var.per[2], "%", sep="")) + theme_bw() + ggtitle("My PCA Graph") ## get the name of the top 10 measurements (genes) that contribute ## most to pc1. loading_scores <- pca$rotation[,1] gene_scores <- abs(loading_scores) ## get the magnitudes gene_score_ranked <- sort(gene_scores, decreasing=TRUE) top_10_genes <- names(gene_score_ranked[1:10]) top_10_genes ## show the names of the top 10 genes pca$rotation[top_10_genes,1] ## show the scores (and +/- sign) ####### ## ## NOTE: Everything that follow is just bonus stuff. ## It simply demonstrates how to get the same ## results using "svd()" (Singular Value Decomposition) or using "eigen()" ## (Eigen Decomposition). ## ## ####### ############################################ ## ## Now let's do the same thing with svd() ## ## svd() returns three things ## v = the "rotation" that prcomp() returns, this is a matrix of eigenvectors ## in other words, a matrix of loading scores ## u = this is similar to the "x" that prcomp() returns. In other words, ## sum(the rotation * the original data), but compressed to the unit vector ## You can spread it out by multiplying by "d" ## d = this is similar to the "sdev" value that prcomp() returns (and thus ## related to the eigen values), but not ## scaled by sample size in an unbiased way (ie. 1/(n-1)). ## For prcomp(), sdev = sqrt(var) = sqrt(ss(fit)/(n-1)) ## For svd(), d = sqrt(ss(fit)) ############################################ svd.stuff <- svd(scale(t(data.matrix), center=TRUE)) ## calculate the PCs svd.data <- data.frame(Sample=colnames(data.matrix), X=(svd.stuff$u[,1] * svd.stuff$d[1]), Y=(svd.stuff$u[,2] * svd.stuff$d[2])) svd.data ## alternatively, we could compute the PCs with the eigen vectors and the ## original data svd.pcs <- t(t(svd.stuff$v) %*% t(scale(t(data.matrix), center=TRUE))) svd.pcs[,1:2] ## the first to principal components svd.df <- ncol(data.matrix) - 1 svd.var <- svd.stuff$d^2 / svd.df svd.var.per <- round(svd.var/sum(svd.var)*100, 1) ggplot(data=svd.data, aes(x=X, y=Y, label=Sample)) + geom_text() + xlab(paste("PC1 - ", svd.var.per[1], "%", sep="")) + ylab(paste("PC2 - ", svd.var.per[2], "%", sep="")) + theme_bw() + ggtitle("svd(scale(t(data.matrix), center=TRUE)") ############################################ ## ## Now let's do the same thing with eigen() ## ## eigen() returns two things... ## vectors = eigen vectors (vectors of loading scores) ## NOTE: pcs = sum(loading scores * values for sample) ## values = eigen values ############################################ cov.mat <- cov(scale(t(data.matrix), center=TRUE)) dim(cov.mat) ## since the covariance matrix is symmetric, we can tell eigen() to just ## work on the lower triangle with "symmetric=TRUE" eigen.stuff <- eigen(cov.mat, symmetric=TRUE) dim(eigen.stuff$vectors) head(eigen.stuff$vectors[,1:2]) eigen.pcs <- t(t(eigen.stuff$vectors) %*% t(scale(t(data.matrix), center=TRUE))) eigen.pcs[,1:2] eigen.data <- data.frame(Sample=rownames(eigen.pcs), X=(-1 * eigen.pcs[,1]), ## eigen() flips the X-axis in this case, so we flip it back Y=eigen.pcs[,2]) ## X axis will be PC1, Y axis will be PC2 eigen.data eigen.var.per <- round(eigen.stuff$values/sum(eigen.stuff$values)*100, 1) ggplot(data=eigen.data, aes(x=X, y=Y, label=Sample)) + geom_text() + xlab(paste("PC1 - ", eigen.var.per[1], "%", sep="")) + ylab(paste("PC2 - ", eigen.var.per[2], "%", sep="")) + theme_bw() + ggtitle("eigen on cov(t(data.matrix))")

Advertisements

Hi Josh, really helpful code! I have a question about the loadings – the loading scores are extracted from pca$rotation and values are between -1 and 1. What are the differences between these loadings, and the values given by prcomp((data.matrix), scale=TRUE)$x ?

LikeLike

I’m pretty sure I covered this in detail in the video, but just in case you missed it: The “x” values are the locations for the samples in the PCA graph (should you choose to draw it). The loadings reflect how much influence each variable has on each axis in the PCA graph.

LikeLike

Hi, Josh. Excellent and detailed presentation. But I am a little confused that whether read counts should be used to do PCA, or RPKM/KPKM/TPM should be used? After searching for the answers, it seemed that some used read counts, and some used RPKM/FPKM. Thanks a lot!

LikeLike

You should use “normalized” counts (i.e. you should use RPKM or TPM etc.) for PCA. This was just an example and I didn’t want to get into the details of RPKM/TPM – especially for people that are not biologists. But yes, when you’re working with real sequencing data, use the normalized counts, not the raw counts.

LikeLike

OK. Thanks again for your kind reply.

LikeLike

Hi Josh, really clear explanation. I am wondering what the argument “scale” is doing in the “prcomp()”. I found that the PC values between “scale=TRUE” and “scale=FALSE” are very different. Although “wt” cluster is still on the left of the x axis and “ko” cluster is on the right， but the relative position of sample individuals within each cluster differs.

Because in my own data, there are many 0 counts, so it cannot be scaled. In this case, I don’t know whether it is okay if I use “scale = False ” result.

LikeLike

I’m glad you like the explanation. I explain the “scale” option in another video. Here’s the link: https://youtu.be/oRvgq966yZg

LikeLike

In this https://statquest.org/2017/11/27/statquest-pca-in-r-clearly-explained/ video

Why don’t we draw a 10 by 100 matrix instead of 100 x 10 matrix so that we do not need to us t() function in the prcomp() function ?

Is there a particular reasons?

Thank you very much.

Best Regards

LikeLike

Genetics data is almost always in the 100 x 10 format (rows = genes, columns = sample), and I’m a geneticist – so that’s what I see most of the time. However, other fields do it differently – so it’s important to know that when using the prcomp() function in R, the samples are supposed to be rows, regardless of how the original data is formatted.

LikeLike