StatQuest: PCA in Python

January 8, 2018

Here’s the link to the source code on the StatQuest GitHub.

10 thoughts on “StatQuest: PCA in Python”

Computer Science

June 7, 2018 at 2:58 pm

Thank you for this great article, I have shared it on Facebook.

Reply
- Josh
  
  June 7, 2018 at 2:58 pm
  
  Thanks!
  
  Reply
Michelle

July 11, 2018 at 8:54 am

This is so very useful. Especially for people who are not computer scientists.Thank you very much.

Reply
- Josh
  
  July 11, 2018 at 3:16 pm
  
  You’re welcome! 🙂
  
  Reply
Hemon

August 30, 2018 at 5:52 am

Hi Josh, thanks for the informative videos – they are very helpful. I

‘m trying to run the code but I’ve come across a head scratcher, I’m getting issues around this line:
pca_df = pd.DataFrame(pca_data, index=[*wt, *ko], columns=labels)

pca_data is a 100×10 vector, ie. the transformed version of scaled_data.

and so it doesn’t make sense for the rows/index to be the 10 samples as there are 100 rows???
Did you mean that this should be pca.components_ instead?

Would appreciate if you can clear up my confusion, much thanks!

Reply
- Josh
  
  August 30, 2018 at 2:09 pm
  
  In the example code, we start by making a matrix with 100 rows and 10 columns (100×10). The rows are “genes” and the columns are samples (thus, there are 100 genes, and 10 sample). This format is the standard format for genomic data (genes are rows, samples as columns). However, PCA functions almost always expect the samples to be rows and the “variables” (which are the “genes” in this case) to be columns. So when we scale the data we transpose it: scaled_data = preprocessing.scale(data.T)
  
  As a result of passing “preprocessing.scale()” the transposed data, “scaled_data” is also transposed and we can pass this directly to pca.fit() without any more transposing. Does this make sense?
  
  Reply
  - Hemon
    
    August 30, 2018 at 9:34 pm
    
    Ah yes, this makes sense now – thanks! 🙂
    
    Reply
sktrinh

October 4, 2018 at 3:52 am

hi, thanks for sharing, it’s brilliant. i have a question about visualising more than 2 PCs. How can I plot 4 PCs? My spree plot suggest 4 PCs explains most of the variance in the my dataset. Can you please direct me how to visualise 4 PCs? Thanks.

Reply
- Josh
  
  October 4, 2018 at 10:49 am
  
  You can make a bunch of graphs, like PC1 against PC2, then PC1 agains PC2, then PC1 against PC3, then PC1 against PC4, then PC2 against PC3 etc.
  
  Reply
Pingback: StatQuest: PCA in Python | Quantum Code

StatQuest!!!

An epic journey through statistics and machine learning

StatQuest: PCA in Python

10 thoughts on “StatQuest: PCA in Python”

Leave a Reply Cancel reply