StatQuest: PCA in Python

## NOTE: This is Python 3 code.
import pandas as pd
import numpy as np
import random as rd
from sklearn.decomposition import PCA
from sklearn import preprocessing
import matplotlib.pyplot as plt # NOTE: This was tested with matplotlib v. 2.1.0

#########################
#
# Data Generation Code
#
#########################
## In this example, the data is in a data frame called data.
## Columns are individual samples (i.e. cells)
## Rows are measurements taken for all the samples (i.e. genes)
## Just for the sake of the example, we'll use made up data...
genes = ['gene' + str(i) for i in range(1,101)]

wt = ['wt' + str(i) for i in range(1,6)]
ko = ['ko' + str(i) for i in range(1,6)]

data = pd.DataFrame(columns=[*wt, *ko], index=genes)

for gene in data.index:
    data.loc[gene,'wt1':'wt5'] = np.random.poisson(lam=rd.randrange(10,1000), size=5)
    data.loc[gene,'ko1':'ko5'] = np.random.poisson(lam=rd.randrange(10,1000), size=5)

print(data.head())
print(data.shape)

#########################
#
# Perform PCA on the data
#
#########################
# First center and scale the data
scaled_data = preprocessing.scale(data.T)

pca = PCA() # create a PCA object
pca.fit(scaled_data) # do the math
pca_data = pca.transform(scaled_data) # get PCA coordinates for scaled_data

#########################
#
# Draw a scree plot and a PCA plot
#
#########################

#The following code constructs the Scree plot
per_var = np.round(pca.explained_variance_ratio_* 100, decimals=1)
labels = ['PC' + str(x) for x in range(1, len(per_var)+1)]

plt.bar(x=range(1,len(per_var)+1), height=per_var, tick_label=labels)
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.title('Scree Plot')
plt.show()

#the following code makes a fancy looking plot using PC1 and PC2
pca_df = pd.DataFrame(pca_data, index=[*wt, *ko], columns=labels)

plt.scatter(pca_df.PC1, pca_df.PC2)
plt.title('My PCA Graph')
plt.xlabel('PC1 - {0}%'.format(per_var[0]))
plt.ylabel('PC2 - {0}%'.format(per_var[1]))

for sample in pca_df.index:
    plt.annotate(sample, (pca_df.PC1.loc[sample], pca_df.PC2.loc[sample]))

plt.show()

#########################
#
# Determine which genes had the biggest influence on PC1
#
#########################

## get the name of the top 10 measurements (genes) that contribute
## most to pc1.
## first, get the loading scores
loading_scores = pd.Series(pca.components_[0], index=genes)
## now sort the loading scores based on their magnitude
sorted_loading_scores = loading_scores.abs().sort_values(ascending=False)

# get the names of the top 10 genes
top_10_genes = sorted_loading_scores[0:10].index.values

## print the gene names and their scores (and +/- sign)
print(loading_scores[top_10_genes])<span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>
Advertisements

9 thoughts on “StatQuest: PCA in Python

  1. Hi Josh, thanks for the informative videos – they are very helpful. I

    ‘m trying to run the code but I’ve come across a head scratcher, I’m getting issues around this line:
    pca_df = pd.DataFrame(pca_data, index=[*wt, *ko], columns=labels)

    pca_data is a 100×10 vector, ie. the transformed version of scaled_data.

    and so it doesn’t make sense for the rows/index to be the 10 samples as there are 100 rows???
    Did you mean that this should be pca.components_ instead?

    Would appreciate if you can clear up my confusion, much thanks!

    Like

    • In the example code, we start by making a matrix with 100 rows and 10 columns (100×10). The rows are “genes” and the columns are samples (thus, there are 100 genes, and 10 sample). This format is the standard format for genomic data (genes are rows, samples as columns). However, PCA functions almost always expect the samples to be rows and the “variables” (which are the “genes” in this case) to be columns. So when we scale the data we transpose it: scaled_data = preprocessing.scale(data.T)

      As a result of passing “preprocessing.scale()” the transposed data, “scaled_data” is also transposed and we can pass this directly to pca.fit() without any more transposing. Does this make sense?

      Like

  2. hi, thanks for sharing, it’s brilliant. i have a question about visualising more than 2 PCs. How can I plot 4 PCs? My spree plot suggest 4 PCs explains most of the variance in the my dataset. Can you please direct me how to visualise 4 PCs? Thanks.

    Like

    • You can make a bunch of graphs, like PC1 against PC2, then PC1 agains PC2, then PC1 against PC3, then PC1 against PC4, then PC2 against PC3 etc.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s