Update: People have asked for PowerPoint slides for this presentation. Here they are! If you use them, please cite me. Thank you so much for your support.
Update: A lot of people ask to see example code that shows how to actually do PCA. Here’s how to do it in R
## In this example, the data is in a matrix called ## data.matrix ## columns are individual samples (i.e. cells) ## rows are measurements taken for all the samples (i.e. genes) pca <- prcomp(t(data.matrix), scale=TRUE) ## plot pc1 and pc2 plot(pca$x[,1], pca$x[,2]) ## get the name of the sample (cell) with the highest pc1 value rownames(pca$x)[pca$x[(order(pca$x[,1], decreasing=TRUE))]] ## get the name of the top 10 measurements (genes) that contribute ## most to pc1. loading_scores <- pca$rotation[,1] gene_scores <- abs(loading_scores) ## get the magnitudes gene_score_ranked <- sort(gene_scores, decreasing=TRUE) top_10_genes <- names(gene_score_ranked[1:10]) top_10_genes ## show the names of the top 10 genes ## get the scree information pca.var <- pca$sdev^2 scree <- pca.var/sum(pca.var) plot((scree[1:10]*100), main="Skree Plot", xlab="Principal Component", ylab="Percent Variation")
RNA-seq results often contain a PCA (Principal Component Analysis) or MDS plot. Usually we use these graphs to verify that the control samples cluster together. However, there’s a lot more going on, and if you are willing to dive in, you can extract a lot more information from these plots. The good news is that PCA only sounds complicated. Conceptually, it’s actually quite simple.
In this video I clearly explain how PCA graphs are generated, how to interpret them, and how to determine if the plot is informative or not. I then show how you can extract extra information out of the data used to draw the graph.