StatQuest: Filtering genes with low read counts.

Here’s a template for adding DESeq2’s filtering method to an edgeR analysis pipeline.

## NOTE: If you use this template, be sure to cite both edgeR and
## DESeq2 since this template uses DESeq2's method for identifying
## an optimal minimum sequencing depth for a gene to be
## considered transcribed.

## CITE:
## Robinson MD, McCarthy DJ and Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics, 26, pp. -1.
## Love MI, Huber W and Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, pp. 550. doi: 10.1186/s13059-014-0550-8.

require(edgeR) ## this command just loads the edgeR module/library

## this next step loads "genefilter". If you don't have genefilter
## it tries to install it for you.
if (!require(genefilter)) {
print("Trying to install genefilter")
source("https://bioconductor.org/biocLite.R")
biocLite("genefilter")
if (require(genefilter)) {
print("genefilter successfully installed")
} else {
stop("could not install genefilter")
}
}

###
### CHANGE THE VALUE FOR "root.name".
### root.name is the the start of all output file names. For example, if you
### set root.name like this:
###
### root.name <- "wt_v_ko_edgeR_20141210_"
###
### then the output files will be named:
###
### wt_v_ko_edgeR_20141210_results.txt <- that contains all statistical tests
### wt_v_ko_edgeR_20141210_sig_results.txt <- just the genes with FDR < 0.05
### wt_v_ko_edgeR_20141210_mds.pdf <- the MDS (like PCA) plot
### wt_v_ko_edgeR_20141210_smear.pdf <- the smear plot
###
root.name <- "results_"

###
### That said, you can run the analysis without actually saving the results to
### files. If you want to do that, change the following variable
###
#save.files <- FALSE # if you don't want to actually save the output, use FALSE
save.files <- TRUE # if you want to save the output to files, use TRUE


###
### This is where you read in the reads/count data files for WT and KO.
### I'm assuming you've got 3 samples of each. If you have more, or less, you'll
### have to change this code a little bit (add or subtract lines for reading
### files in, and then change a few other variables below...)
###
### CHANGE THE VALUES FOR THE FILENAMES (i.e. change "wt.count.txt" to something
### meaningful. You can also specify paths in the filename if you need to. i.e.
### "my_data_directory/my_wt1_data.txt"
###
wt1 <- read.delim(file="wt1.count.txt", sep="\t", stringsAsFactors=FALSE, header=FALSE)
wt2 <- read.delim(file="wt2.count.txt", sep="\t", stringsAsFactors=FALSE, header=FALSE)
wt3 <- read.delim(file="wt3.count.txt", sep="\t", stringsAsFactors=FALSE, header=FALSE)

ko1 <- read.delim(file="ko1.count.txt", sep="\t", stringsAsFactors=FALSE, header=FALSE)
ko2 <- read.delim(file="ko2.count.txt", sep="\t", stringsAsFactors=FALSE, header=FALSE)
ko3 <- read.delim(file="ko3.count.txt", sep="\t", stringsAsFactors=FALSE, header=FALSE)

###
### If you used "htseq_count" to count the reads, then each file will have the
### following format:
### geneName1 count1
### geneName2 count2
### geneName3 count3
### ...
### geneNameX countX
###
### For edgeR, we have to consolidate the counts into a single table that looks
### like this:
### geneName1 wt1.count1 wt2.count1 wt3.count1 ko1.count1 ko2.count1 ...
### geneName2 wt1.count2 wt2.count2 wt3.count2 ko1.count2 ko2.count2 ...
### geneName3 wt1.count3 wt2.count3 wt3.count3 ko1.count3 ko2.count3 ...
### ....
### We do this by making a new "data frame". V1 refers to the data in the column
### 1 (the gene names), and V2 refers to the data in column2 (the read counts
### per gene). In the command below, we are creating a column called "id" that
### is set to the gene names (we could use any file for this, since they are the
### same in all files, so we just pick the first file we read in). We then
### create columns for the read counts for all wt and ko datasets.
###
### CHANGE THE NAMES AND NUMBER OF READ COUNT COLUMNS IF YOU CHANGED THE
### FILES YOU READ IN. NOTE: it's a good idea to do all off the WT columns
### followed by all of the KO columns (or the other way around).
###
raw.data <- data.frame(id=wt1$V1,
wt1=wt1$V2, wt2=wt2$V2, wt3=wt3$V2,
ko1=ko1$V2, ko2=ko2$V2, ko3=ko3$V2)

### If you used htseq-count to count the reads per gene, then the last five
### rows of data are not for specific genes, but are summaries of how the
### counting went. We must remove them from our new data frame, or else
### edgeR will treat those last 5 rows as individual genes and throw off
### the statistics.
tail(raw.data)
raw.data <- raw.data[1:(nrow(raw.data)-5),]

head(raw.data) ### Look at the first 6 rows of data to make sure it's reasonable
tail(raw.data) ### Look at the last 6 rows of data to make sure it's reasonable
nrow(raw.data) ### This prints out the number of genes in the data table

###
### raw.data is still just a general "table" (described above). Here we are
### converting that table into something that is specific to edgeR.
y <- DGEList(counts=raw.data[,2:ncol(raw.data)], genes=raw.data[,1])

###
### Here we are just doing more things that edgeR requires.
###
rownames(y) <- 1:nrow(y$counts)
y <- calcNormFactors(y)


###
### Here is where we tell edgeR how to group the samples. 1 represnts one group
### and 2 represents another group.
### CHANGE THIS IF YOU CHANGED THE FILES YOU READ IN.
### EXAMPLES:
### If you read in 3 WT and 3 KO, you want: 1,1,1,2,2,2
###
### If you read in 4 WT and 2 KO, you want: 1,1,1,1,2,2
###
### If you read in 2 WT and 6 KO, you want: 1,1,2,2,2,2,2,2
###
replicate <- factor(c(1,1,1,2,2,2))

###
### The following commands set up the tests and then execute them
###
design <- model.matrix(~replicate)
rownames(design) <- colnames(y)
design

## Notes about the next three fucntion calls...
## These can be consolodated to one call by using "estimateDisp(y, design)"
## I'm not doing that because I only want the estimated dispersion
## parameters (Disp and BCV) printed to the screen, and estimateDisp() doesn't
## give you that option. It prints a ton of stuff, or nothing at all.
y <- estimateGLMCommonDisp(y, design, verbose=TRUE)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)

y.fit <- glmFit(y, design)
y.lrt <- glmLRT(y.fit)


##
## Here we figure out the optimal way to filter out lowly expressed genes.
## We do this so that the FDR correction has more power - it doesn't filter
## out as many True Positives. The few genes that FDR adjusts, the more true
## positives we'll get (however, there's a point where you are filtering out
## true positives, so you can't just toss out all genes)
##
reads.cpm <- cpm(y)
unfiltered.results <- data.frame(id=y$genes, reads.cpm, y.lrt$table)

## NOTE: If you use the following code to se
## We're making "filter" equal the second largest CPM value for
## each gene, rather than the average CPM value for each gene.
## Setting it to the second largest CPM makes filter act like
## edgeR's selection where at least 2 samples have to have
## CPMs > MIN.CPM
filter <- apply(X=reads.cpm, MARGIN=1,
FUN=function(data) {data[order(rank(data), decreasing=TRUE)[2]]})

lowerQuantile <- mean(filter == 0)
if (lowerQuantile < .95) upperQuantile <- .95 else upperQuantile <- 1
theta <- seq(lowerQuantile, upperQuantile, length=50)

filtPadj <- filtered_p(filter=filter, test=unfiltered.results$PValue,
theta=theta, method="BH")

min.fdr <- 0.05
numRej <- colSums(filtPadj < min.fdr, na.rm = TRUE)

filter.quantiles <- quantile(filter, probs=theta)

lo.fit.theta <- lowess(numRej ~ theta, f=1/5)

if (max(numRej) <= 10) {
j <- 1
} else {
residual <- if (all(numRej==0)) {
0
} else {
numRej[numRej > 0] - lo.fit.theta$y[numRej > 0]
}
thresh <- max(lo.fit.theta$y) - sqrt(mean(residual^2))
j <- if (any(numRej > thresh)) {
which(numRej > thresh)[1]
} else {
1
}
}

plot(theta, numRej, type="b", xlab="", ylab="", lwd=3,
frame.plot=FALSE, col="black")

if (save.files) {
file.name = paste(root.name, "cpm_thresholds.pdf", sep="")
print(file.name)
pdf(file=file.name)

plot(theta, numRej, main="Threshold for optimal FDR correction", type="b", lwd=3,
frame.plot=FALSE, col="black", ylab="# Significant Genes", xlab="Quantile")

lines(lo.fit.theta$x[1:21], lo.fit.theta$y[1:21], col="red", lwd=3)
abline(h=thresh, col="blue", lwd=3)

dev.off()
}

filtered.results <- unfiltered.results
filtered.results$FDR <- filtPadj[, j, drop=TRUE]

filtered.results$de <- sign(filtered.results$logFC)*(filtered.results$FDR < 0.05)
filtered.results$sig <- abs(filtered.results$de)

filtered.results[is.na(filtered.results$sig),]$sig <- 0

###
### This command does a "head" on just the significantly DE genes. This let's
### us, at a glance, make sure things worked as expected.
###
head(filtered.results[filtered.results$sig==1,])


if (save.files) {
###
### save results
###
file.name = paste(root.name, "results.txt", sep="")
print(file.name)
write.table(filtered.results, file=file.name, sep="\t", row.names=FALSE, quote=FALSE)

file.name = paste(root.name, "sig_results.txt", sep="")
print(file.name)
write.table(filtered.results[filtered.results$sig==1,], file=file.name, sep="\t", row.names=FALSE, quote=FALSE)

###
### save mds plot
###
file.name = paste(root.name, "mds.pdf", sep="")
print(file.name)
pdf(file=file.name)
plotMDS(y)
dev.off()

###
### save smear plot
###
file.name = paste(root.name, "smear.pdf", sep="")
print(file.name)
pdf(file=file.name, width=8, height=4)
y.min <- min(filtered.results$logFC)
y.max <- max(filtered.results$logFC)
if (abs(y.min) > y.max) {
y.max = abs(y.min)
} else {
y.min = -1 * y.max
}
x.min <- min(filtered.results$logCPM)
x.max <- max(filtered.results$logCPM)
insig <- subset(filtered.results, FDR > 0.05)
sig <- subset(filtered.results, FDR <= 0.05)
plot(x=insig$logCPM, y=insig$logFC, pch=16, col="#00000011", ylab="logFC", xlab="logCPM", ylim=c(y.min, y.max), xlim=c(x.min, x.max))
points(x=sig$logCPM, y=sig$logFC, pch=20, col="#FF000088")
dev.off()

plot(x=insig$logCPM, y=insig$logFC, pch=16, col="#00000011", ylab="logFC", xlab="logCPM", ylim=c(y.min, y.max), xlim=c(x.min, x.max))
points(x=sig$logCPM, y=sig$logFC, pch=20, col="#FF000088")
}
Advertisements

5 thoughts on “StatQuest: Filtering genes with low read counts.

  1. Thank you for nice explanation. I got FDR close to 1 for all the genes. What is wrong ? I have 6 groups each with 3 samples. I removed low counts either by excluding raw counts =0 or CPM of 1.

    Like

    • A million things could be wrong. If you have 6 different categories, it could be that you are doing tons and tons of tests and even FDR can’t reasonably filter them. Or it could be that the changes are relatively small, and thus, FDR can filter them. I would recommend just comparing two of the groups and see what you see. If the results seem reasonable (low FDR and high fold differences), then you can start adding additional comparisons.

      Like

      • Thanks for your rapid reply. I did calculate cpm putting together all samples. in Design I then compared only selected groups? Do you mean to calculate different CPM for selected groups, run analysis separately, and do the same for the rest ? Thanks

        Like

  2. Thanks for your rapid reply. I did calculate cpm putting together all samples. in Design I then compared only selected groups? Do you mean to calculate different CPM for selected groups, run analysis separately, and do the same for the rest ? Thanks

    Like

    • What program are you using to test fo differential expression? I’m only familiar with EdgeR and DESeq2. Those two programs require raw read counts (not normalized to CPMs) as input – so make sure you are using raw read counts if you are using either of those (you probably are – but I’m just checking). What you’re saying you did – put all the samples together and then only compare selected groups – makes sense – so you are probably doing things correctly. Unfortunately, I can’t debug your experiment since so many little things could be wrong (both on the experimental side and on the analytical side). I recommend finding someone you can ask in person to help you.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s