Here’s a link to the source code on the StatQuest GitHub.
Uncategorized
StatQuest: K-Nearest Neighbors
You’re going to freak out when you find out how easy it is to classify data!
StatQuest: Hierarchical Clustering
[sourcecode language=”R”]
## NOTE: You can only have a few hundred rows and columns in a heatmap
## Any more and the program will need to much memory to run. Also, it’s
## hard to distinguish things when there are so many rows/columns.
## Usually I pick the top 100 genes of interest and focus the heatmap on those
## There are, of course, a million ways to pick 100 "genes of interest".
## You might know what they are by doing differential gene expression
## you you might just pick the top 100 with the most variation accross the
## samples. Anything goes as long as it helps you gain insight into the data.
## first lets "make up some data"..
sample1 <- c(rnorm(n=5, mean=10), rnorm(n=5, mean=0), rnorm(n=5, mean=20))
sample2 <- c(rnorm(n=5, mean=0), rnorm(n=5, mean=10), rnorm(n=5, mean=10))
sample3 <- c(rnorm(n=5, mean=10), rnorm(n=5, mean=0), rnorm(n=5, mean=20))
sample4 <- c(rnorm(n=5, mean=0), rnorm(n=5, mean=10), rnorm(n=5, mean=10))
sample5 <- c(rnorm(n=5, mean=10), rnorm(n=5, mean=0), rnorm(n=5, mean=20))
sample6 <- c(rnorm(n=5, mean=0), rnorm(n=5, mean=10), rnorm(n=5, mean=10))
data <- data.frame(sample1, sample2, sample3, sample4, sample5, sample6)
head(data)
## draw heatmap without clustering
heatmap(as.matrix(data), Rowv=NA, Colv=NA)
## draw heatmap with clustering and use default settings for everything
##
## The heatmap function defaults to using…
## The Eucldian distance to compare rows or columns.
## "Complete-linkage" to compare clusters.
##
## Also, heatmap defaults to scaling the rows so that each row has mean=0
## and standard deviation of 1. This is done to keep the range of colors
## needed to show differences to a reasonable amount. There are only so
## many shades of a color we can easily differenitate…
heatmap(as.matrix(data))
## draw heatmap with clustering and custom settings…
## We’ll use the ‘manhattan’ distance to compare rows/columns
## And "centroid" to compare clusters.
## The differences are… subtle…
heatmap(as.matrix(data),
distfun=function(x) dist(x, method="manhattan"),
hclustfun=function(x) hclust(x, method="centroid"))
## Now do the same thing, but this time, scale columns instead of rows…
heatmap(as.matrix(data),
distfun=function(x) dist(x, method="manhattan"),
hclustfun=function(x) hclust(x, method="centroid"),
scale="column")
## if you want to save a heatmap to a PDF file….
pdf(file="my_first_heatmap.pdf", width=5, height=6)
heatmap(as.matrix(data))
dev.off()
[/sourcecode]
StatQuest: Fitting a curve to data, aka Lowess, aka Loess
As an added bonus, here’s some code that shows you how to use “lowess()” and “loess()” in R.
StatQuest: Filtering genes with low read counts.
Here’s a template for adding DESeq2’s filtering method to an edgeR analysis pipeline.
[sourcecode language=”R”]
## NOTE: If you use this template, be sure to cite both edgeR and
## DESeq2 since this template uses DESeq2’s method for identifying
## an optimal minimum sequencing depth for a gene to be
## considered transcribed.
## CITE:
## Robinson MD, McCarthy DJ and Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics, 26, pp. -1.
## Love MI, Huber W and Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, pp. 550. doi: 10.1186/s13059-014-0550-8.
require(edgeR) ## this command just loads the edgeR module/library
## this next step loads "genefilter". If you don’t have genefilter
## it tries to install it for you.
if (!require(genefilter)) {
print("Trying to install genefilter")
source("https://bioconductor.org/biocLite.R")
biocLite("genefilter")
if (require(genefilter)) {
print("genefilter successfully installed")
} else {
stop("could not install genefilter")
}
}
###
### CHANGE THE VALUE FOR "root.name".
### root.name is the the start of all output file names. For example, if you
### set root.name like this:
###
### root.name <- "wt_v_ko_edgeR_20141210_"
###
### then the output files will be named:
###
### wt_v_ko_edgeR_20141210_results.txt <- that contains all statistical tests
### wt_v_ko_edgeR_20141210_sig_results.txt <- just the genes with FDR < 0.05
### wt_v_ko_edgeR_20141210_mds.pdf <- the MDS (like PCA) plot
### wt_v_ko_edgeR_20141210_smear.pdf <- the smear plot
###
root.name <- "results_"
###
### That said, you can run the analysis without actually saving the results to
### files. If you want to do that, change the following variable
###
#save.files <- FALSE # if you don’t want to actually save the output, use FALSE
save.files <- TRUE # if you want to save the output to files, use TRUE
###
### This is where you read in the reads/count data files for WT and KO.
### I’m assuming you’ve got 3 samples of each. If you have more, or less, you’ll
### have to change this code a little bit (add or subtract lines for reading
### files in, and then change a few other variables below…)
###
### CHANGE THE VALUES FOR THE FILENAMES (i.e. change "wt.count.txt" to something
### meaningful. You can also specify paths in the filename if you need to. i.e.
### "my_data_directory/my_wt1_data.txt"
###
wt1 <- read.delim(file="wt1.count.txt", sep="\t", stringsAsFactors=FALSE, header=FALSE)
wt2 <- read.delim(file="wt2.count.txt", sep="\t", stringsAsFactors=FALSE, header=FALSE)
wt3 <- read.delim(file="wt3.count.txt", sep="\t", stringsAsFactors=FALSE, header=FALSE)
ko1 <- read.delim(file="ko1.count.txt", sep="\t", stringsAsFactors=FALSE, header=FALSE)
ko2 <- read.delim(file="ko2.count.txt", sep="\t", stringsAsFactors=FALSE, header=FALSE)
ko3 <- read.delim(file="ko3.count.txt", sep="\t", stringsAsFactors=FALSE, header=FALSE)
###
### If you used "htseq_count" to count the reads, then each file will have the
### following format:
### geneName1 count1
### geneName2 count2
### geneName3 count3
### …
### geneNameX countX
###
### For edgeR, we have to consolidate the counts into a single table that looks
### like this:
### geneName1 wt1.count1 wt2.count1 wt3.count1 ko1.count1 ko2.count1 …
### geneName2 wt1.count2 wt2.count2 wt3.count2 ko1.count2 ko2.count2 …
### geneName3 wt1.count3 wt2.count3 wt3.count3 ko1.count3 ko2.count3 …
### ….
### We do this by making a new "data frame". V1 refers to the data in the column
### 1 (the gene names), and V2 refers to the data in column2 (the read counts
### per gene). In the command below, we are creating a column called "id" that
### is set to the gene names (we could use any file for this, since they are the
### same in all files, so we just pick the first file we read in). We then
### create columns for the read counts for all wt and ko datasets.
###
### CHANGE THE NAMES AND NUMBER OF READ COUNT COLUMNS IF YOU CHANGED THE
### FILES YOU READ IN. NOTE: it’s a good idea to do all off the WT columns
### followed by all of the KO columns (or the other way around).
###
raw.data <- data.frame(id=wt1$V1,
wt1=wt1$V2, wt2=wt2$V2, wt3=wt3$V2,
ko1=ko1$V2, ko2=ko2$V2, ko3=ko3$V2)
### If you used htseq-count to count the reads per gene, then the last five
### rows of data are not for specific genes, but are summaries of how the
### counting went. We must remove them from our new data frame, or else
### edgeR will treat those last 5 rows as individual genes and throw off
### the statistics.
tail(raw.data)
raw.data <- raw.data[1:(nrow(raw.data)-5),]
head(raw.data) ### Look at the first 6 rows of data to make sure it’s reasonable
tail(raw.data) ### Look at the last 6 rows of data to make sure it’s reasonable
nrow(raw.data) ### This prints out the number of genes in the data table
###
### raw.data is still just a general "table" (described above). Here we are
### converting that table into something that is specific to edgeR.
y <- DGEList(counts=raw.data[,2:ncol(raw.data)], genes=raw.data[,1])
###
### Here we are just doing more things that edgeR requires.
###
rownames(y) <- 1:nrow(y$counts)
y <- calcNormFactors(y)
###
### Here is where we tell edgeR how to group the samples. 1 represnts one group
### and 2 represents another group.
### CHANGE THIS IF YOU CHANGED THE FILES YOU READ IN.
### EXAMPLES:
### If you read in 3 WT and 3 KO, you want: 1,1,1,2,2,2
###
### If you read in 4 WT and 2 KO, you want: 1,1,1,1,2,2
###
### If you read in 2 WT and 6 KO, you want: 1,1,2,2,2,2,2,2
###
replicate <- factor(c(1,1,1,2,2,2))
###
### The following commands set up the tests and then execute them
###
design <- model.matrix(~replicate)
rownames(design) <- colnames(y)
design
## Notes about the next three fucntion calls…
## These can be consolodated to one call by using "estimateDisp(y, design)"
## I’m not doing that because I only want the estimated dispersion
## parameters (Disp and BCV) printed to the screen, and estimateDisp() doesn’t
## give you that option. It prints a ton of stuff, or nothing at all.
y <- estimateGLMCommonDisp(y, design, verbose=TRUE)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
y.fit <- glmFit(y, design)
y.lrt <- glmLRT(y.fit)
##
## Here we figure out the optimal way to filter out lowly expressed genes.
## We do this so that the FDR correction has more power – it doesn’t filter
## out as many True Positives. The few genes that FDR adjusts, the more true
## positives we’ll get (however, there’s a point where you are filtering out
## true positives, so you can’t just toss out all genes)
##
reads.cpm <- cpm(y)
unfiltered.results <- data.frame(id=y$genes, reads.cpm, y.lrt$table)
## NOTE: If you use the following code to se
## We’re making "filter" equal the second largest CPM value for
## each gene, rather than the average CPM value for each gene.
## Setting it to the second largest CPM makes filter act like
## edgeR’s selection where at least 2 samples have to have
## CPMs > MIN.CPM
filter <- apply(X=reads.cpm, MARGIN=1,
FUN=function(data) {data[order(rank(data), decreasing=TRUE)[2]]})
lowerQuantile <- mean(filter == 0)
if (lowerQuantile < .95) upperQuantile <- .95 else upperQuantile <- 1
theta <- seq(lowerQuantile, upperQuantile, length=50)
filtPadj <- filtered_p(filter=filter, test=unfiltered.results$PValue,
theta=theta, method="BH")
min.fdr <- 0.05
numRej <- colSums(filtPadj < min.fdr, na.rm = TRUE)
filter.quantiles <- quantile(filter, probs=theta)
lo.fit.theta <- lowess(numRej ~ theta, f=1/5)
if (max(numRej) <= 10) {
j <- 1
} else {
residual <- if (all(numRej==0)) {
0
} else {
numRej[numRej > 0] – lo.fit.theta$y[numRej > 0]
}
thresh <- max(lo.fit.theta$y) – sqrt(mean(residual^2))
j <- if (any(numRej > thresh)) {
which(numRej > thresh)[1]
} else {
1
}
}
plot(theta, numRej, type="b", xlab="", ylab="", lwd=3,
frame.plot=FALSE, col="black")
if (save.files) {
file.name = paste(root.name, "cpm_thresholds.pdf", sep="")
print(file.name)
pdf(file=file.name)
plot(theta, numRej, main="Threshold for optimal FDR correction", type="b", lwd=3,
frame.plot=FALSE, col="black", ylab="# Significant Genes", xlab="Quantile")
lines(lo.fit.theta$x[1:21], lo.fit.theta$y[1:21], col="red", lwd=3)
abline(h=thresh, col="blue", lwd=3)
dev.off()
}
filtered.results <- unfiltered.results
filtered.results$FDR <- filtPadj[, j, drop=TRUE]
filtered.results$de <- sign(filtered.results$logFC)*(filtered.results$FDR < 0.05)
filtered.results$sig <- abs(filtered.results$de)
filtered.results[is.na(filtered.results$sig),]$sig <- 0
###
### This command does a "head" on just the significantly DE genes. This let’s
### us, at a glance, make sure things worked as expected.
###
head(filtered.results[filtered.results$sig==1,])
if (save.files) {
###
### save results
###
file.name = paste(root.name, "results.txt", sep="")
print(file.name)
write.table(filtered.results, file=file.name, sep="\t", row.names=FALSE, quote=FALSE)
file.name = paste(root.name, "sig_results.txt", sep="")
print(file.name)
write.table(filtered.results[filtered.results$sig==1,], file=file.name, sep="\t", row.names=FALSE, quote=FALSE)
###
### save mds plot
###
file.name = paste(root.name, "mds.pdf", sep="")
print(file.name)
pdf(file=file.name)
plotMDS(y)
dev.off()
###
### save smear plot
###
file.name = paste(root.name, "smear.pdf", sep="")
print(file.name)
pdf(file=file.name, width=8, height=4)
y.min <- min(filtered.results$logFC)
y.max <- max(filtered.results$logFC)
if (abs(y.min) > y.max) {
y.max = abs(y.min)
} else {
y.min = -1 * y.max
}
x.min <- min(filtered.results$logCPM)
x.max <- max(filtered.results$logCPM)
insig <- subset(filtered.results, FDR > 0.05)
sig <- subset(filtered.results, FDR <= 0.05)
plot(x=insig$logCPM, y=insig$logFC, pch=16, col="#00000011", ylab="logFC", xlab="logCPM", ylim=c(y.min, y.max), xlim=c(x.min, x.max))
points(x=sig$logCPM, y=sig$logFC, pch=20, col="#FF000088")
dev.off()
plot(x=insig$logCPM, y=insig$logFC, pch=16, col="#00000011", ylab="logFC", xlab="logCPM", ylim=c(y.min, y.max), xlim=c(x.min, x.max))
points(x=sig$logCPM, y=sig$logFC, pch=20, col="#FF000088")
}
[/sourcecode]
StatQuest: Sampling from a distribution.
This is something we do all the time in StatQuest. It’s useful for exploring statistics (which is fun!). So learn about it if you don’t already know…
StatQuest: One vs Two-Tailed P-values
Here’s the deal!
StatQuest: edgeR part 1, Library Normalization
Here’s the dope on how edgeR normalized libraries…