StatQuest: Filtering genes with low read counts.

Here’s a template for adding DESeq2’s filtering method to an edgeR analysis pipeline.

[sourcecode language=”R”]
## NOTE: If you use this template, be sure to cite both edgeR and
## DESeq2 since this template uses DESeq2’s method for identifying
## an optimal minimum sequencing depth for a gene to be
## considered transcribed.

## CITE:
## Robinson MD, McCarthy DJ and Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics, 26, pp. -1.
## Love MI, Huber W and Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, pp. 550. doi: 10.1186/s13059-014-0550-8.

require(edgeR) ## this command just loads the edgeR module/library

## this next step loads "genefilter". If you don’t have genefilter
## it tries to install it for you.
if (!require(genefilter)) {
print("Trying to install genefilter")
source("https://bioconductor.org/biocLite.R")
biocLite("genefilter")
if (require(genefilter)) {
print("genefilter successfully installed")
} else {
stop("could not install genefilter")
}
}

###
### CHANGE THE VALUE FOR "root.name".
### root.name is the the start of all output file names. For example, if you
### set root.name like this:
###
### root.name <- "wt_v_ko_edgeR_20141210_"
###
### then the output files will be named:
###
### wt_v_ko_edgeR_20141210_results.txt <- that contains all statistical tests
### wt_v_ko_edgeR_20141210_sig_results.txt <- just the genes with FDR < 0.05
### wt_v_ko_edgeR_20141210_mds.pdf <- the MDS (like PCA) plot
### wt_v_ko_edgeR_20141210_smear.pdf <- the smear plot
###
root.name <- "results_"

###
### That said, you can run the analysis without actually saving the results to
### files. If you want to do that, change the following variable
###
#save.files <- FALSE # if you don’t want to actually save the output, use FALSE
save.files <- TRUE # if you want to save the output to files, use TRUE

###
### This is where you read in the reads/count data files for WT and KO.
### I’m assuming you’ve got 3 samples of each. If you have more, or less, you’ll
### have to change this code a little bit (add or subtract lines for reading
### files in, and then change a few other variables below…)
###
### CHANGE THE VALUES FOR THE FILENAMES (i.e. change "wt.count.txt" to something
### meaningful. You can also specify paths in the filename if you need to. i.e.
### "my_data_directory/my_wt1_data.txt"
###
wt1 <- read.delim(file="wt1.count.txt", sep="\t", stringsAsFactors=FALSE, header=FALSE)
wt2 <- read.delim(file="wt2.count.txt", sep="\t", stringsAsFactors=FALSE, header=FALSE)
wt3 <- read.delim(file="wt3.count.txt", sep="\t", stringsAsFactors=FALSE, header=FALSE)

ko1 <- read.delim(file="ko1.count.txt", sep="\t", stringsAsFactors=FALSE, header=FALSE)
ko2 <- read.delim(file="ko2.count.txt", sep="\t", stringsAsFactors=FALSE, header=FALSE)
ko3 <- read.delim(file="ko3.count.txt", sep="\t", stringsAsFactors=FALSE, header=FALSE)

###
### If you used "htseq_count" to count the reads, then each file will have the
### following format:
### geneName1 count1
### geneName2 count2
### geneName3 count3
### …
### geneNameX countX
###
### For edgeR, we have to consolidate the counts into a single table that looks
### like this:
### geneName1 wt1.count1 wt2.count1 wt3.count1 ko1.count1 ko2.count1 …
### geneName2 wt1.count2 wt2.count2 wt3.count2 ko1.count2 ko2.count2 …
### geneName3 wt1.count3 wt2.count3 wt3.count3 ko1.count3 ko2.count3 …
### ….
### We do this by making a new "data frame". V1 refers to the data in the column
### 1 (the gene names), and V2 refers to the data in column2 (the read counts
### per gene). In the command below, we are creating a column called "id" that
### is set to the gene names (we could use any file for this, since they are the
### same in all files, so we just pick the first file we read in). We then
### create columns for the read counts for all wt and ko datasets.
###
### CHANGE THE NAMES AND NUMBER OF READ COUNT COLUMNS IF YOU CHANGED THE
### FILES YOU READ IN. NOTE: it’s a good idea to do all off the WT columns
### followed by all of the KO columns (or the other way around).
###
raw.data <- data.frame(id=wt1$V1,
wt1=wt1$V2, wt2=wt2$V2, wt3=wt3$V2,
ko1=ko1$V2, ko2=ko2$V2, ko3=ko3$V2)

### If you used htseq-count to count the reads per gene, then the last five
### rows of data are not for specific genes, but are summaries of how the
### counting went. We must remove them from our new data frame, or else
### edgeR will treat those last 5 rows as individual genes and throw off
### the statistics.
tail(raw.data)
raw.data <- raw.data[1:(nrow(raw.data)-5),]

head(raw.data) ### Look at the first 6 rows of data to make sure it’s reasonable
tail(raw.data) ### Look at the last 6 rows of data to make sure it’s reasonable
nrow(raw.data) ### This prints out the number of genes in the data table

###
### raw.data is still just a general "table" (described above). Here we are
### converting that table into something that is specific to edgeR.
y <- DGEList(counts=raw.data[,2:ncol(raw.data)], genes=raw.data[,1])

###
### Here we are just doing more things that edgeR requires.
###
rownames(y) <- 1:nrow(y$counts)
y <- calcNormFactors(y)

###
### Here is where we tell edgeR how to group the samples. 1 represnts one group
### and 2 represents another group.
### CHANGE THIS IF YOU CHANGED THE FILES YOU READ IN.
### EXAMPLES:
### If you read in 3 WT and 3 KO, you want: 1,1,1,2,2,2
###
### If you read in 4 WT and 2 KO, you want: 1,1,1,1,2,2
###
### If you read in 2 WT and 6 KO, you want: 1,1,2,2,2,2,2,2
###
replicate <- factor(c(1,1,1,2,2,2))

###
### The following commands set up the tests and then execute them
###
design <- model.matrix(~replicate)
rownames(design) <- colnames(y)
design

## Notes about the next three fucntion calls…
## These can be consolodated to one call by using "estimateDisp(y, design)"
## I’m not doing that because I only want the estimated dispersion
## parameters (Disp and BCV) printed to the screen, and estimateDisp() doesn’t
## give you that option. It prints a ton of stuff, or nothing at all.
y <- estimateGLMCommonDisp(y, design, verbose=TRUE)
y <- estimateGLMTrendedDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)

y.fit <- glmFit(y, design)
y.lrt <- glmLRT(y.fit)

##
## Here we figure out the optimal way to filter out lowly expressed genes.
## We do this so that the FDR correction has more power – it doesn’t filter
## out as many True Positives. The few genes that FDR adjusts, the more true
## positives we’ll get (however, there’s a point where you are filtering out
## true positives, so you can’t just toss out all genes)
##
reads.cpm <- cpm(y)
unfiltered.results <- data.frame(id=y$genes, reads.cpm, y.lrt$table)

## NOTE: If you use the following code to se
## We’re making "filter" equal the second largest CPM value for
## each gene, rather than the average CPM value for each gene.
## Setting it to the second largest CPM makes filter act like
## edgeR’s selection where at least 2 samples have to have
## CPMs > MIN.CPM
filter <- apply(X=reads.cpm, MARGIN=1,
FUN=function(data) {data[order(rank(data), decreasing=TRUE)[2]]})

lowerQuantile <- mean(filter == 0)
if (lowerQuantile < .95) upperQuantile <- .95 else upperQuantile <- 1
theta <- seq(lowerQuantile, upperQuantile, length=50)

filtPadj <- filtered_p(filter=filter, test=unfiltered.results$PValue,
theta=theta, method="BH")

min.fdr <- 0.05
numRej <- colSums(filtPadj < min.fdr, na.rm = TRUE)

filter.quantiles <- quantile(filter, probs=theta)

lo.fit.theta <- lowess(numRej ~ theta, f=1/5)

if (max(numRej) <= 10) {
j <- 1
} else {
residual <- if (all(numRej==0)) {
0
} else {
numRej[numRej > 0] – lo.fit.theta$y[numRej > 0]
}
thresh <- max(lo.fit.theta$y) – sqrt(mean(residual^2))
j <- if (any(numRej > thresh)) {
which(numRej > thresh)[1]
} else {
1
}
}

plot(theta, numRej, type="b", xlab="", ylab="", lwd=3,
frame.plot=FALSE, col="black")

if (save.files) {
file.name = paste(root.name, "cpm_thresholds.pdf", sep="")
print(file.name)
pdf(file=file.name)

plot(theta, numRej, main="Threshold for optimal FDR correction", type="b", lwd=3,
frame.plot=FALSE, col="black", ylab="# Significant Genes", xlab="Quantile")

lines(lo.fit.theta$x[1:21], lo.fit.theta$y[1:21], col="red", lwd=3)
abline(h=thresh, col="blue", lwd=3)

dev.off()
}

filtered.results <- unfiltered.results
filtered.results$FDR <- filtPadj[, j, drop=TRUE]

filtered.results$de <- sign(filtered.results$logFC)*(filtered.results$FDR < 0.05)
filtered.results$sig <- abs(filtered.results$de)

filtered.results[is.na(filtered.results$sig),]$sig <- 0

###
### This command does a "head" on just the significantly DE genes. This let’s
### us, at a glance, make sure things worked as expected.
###
head(filtered.results[filtered.results$sig==1,])

if (save.files) {
###
### save results
###
file.name = paste(root.name, "results.txt", sep="")
print(file.name)
write.table(filtered.results, file=file.name, sep="\t", row.names=FALSE, quote=FALSE)

file.name = paste(root.name, "sig_results.txt", sep="")
print(file.name)
write.table(filtered.results[filtered.results$sig==1,], file=file.name, sep="\t", row.names=FALSE, quote=FALSE)

###
### save mds plot
###
file.name = paste(root.name, "mds.pdf", sep="")
print(file.name)
pdf(file=file.name)
plotMDS(y)
dev.off()

###
### save smear plot
###
file.name = paste(root.name, "smear.pdf", sep="")
print(file.name)
pdf(file=file.name, width=8, height=4)
y.min <- min(filtered.results$logFC)
y.max <- max(filtered.results$logFC)
if (abs(y.min) > y.max) {
y.max = abs(y.min)
} else {
y.min = -1 * y.max
}
x.min <- min(filtered.results$logCPM)
x.max <- max(filtered.results$logCPM)
insig <- subset(filtered.results, FDR > 0.05)
sig <- subset(filtered.results, FDR <= 0.05)
plot(x=insig$logCPM, y=insig$logFC, pch=16, col="#00000011", ylab="logFC", xlab="logCPM", ylim=c(y.min, y.max), xlim=c(x.min, x.max))
points(x=sig$logCPM, y=sig$logFC, pch=20, col="#FF000088")
dev.off()

plot(x=insig$logCPM, y=insig$logFC, pch=16, col="#00000011", ylab="logFC", xlab="logCPM", ylim=c(y.min, y.max), xlim=c(x.min, x.max))
points(x=sig$logCPM, y=sig$logFC, pch=20, col="#FF000088")
}
[/sourcecode]

15 thoughts on “StatQuest: Filtering genes with low read counts.

  1. Thank you for nice explanation. I got FDR close to 1 for all the genes. What is wrong ? I have 6 groups each with 3 samples. I removed low counts either by excluding raw counts =0 or CPM of 1.

    • A million things could be wrong. If you have 6 different categories, it could be that you are doing tons and tons of tests and even FDR can’t reasonably filter them. Or it could be that the changes are relatively small, and thus, FDR can filter them. I would recommend just comparing two of the groups and see what you see. If the results seem reasonable (low FDR and high fold differences), then you can start adding additional comparisons.

      • Thanks for your rapid reply. I did calculate cpm putting together all samples. in Design I then compared only selected groups? Do you mean to calculate different CPM for selected groups, run analysis separately, and do the same for the rest ? Thanks

  2. Thanks for your rapid reply. I did calculate cpm putting together all samples. in Design I then compared only selected groups? Do you mean to calculate different CPM for selected groups, run analysis separately, and do the same for the rest ? Thanks

    • What program are you using to test fo differential expression? I’m only familiar with EdgeR and DESeq2. Those two programs require raw read counts (not normalized to CPMs) as input – so make sure you are using raw read counts if you are using either of those (you probably are – but I’m just checking). What you’re saying you did – put all the samples together and then only compare selected groups – makes sense – so you are probably doing things correctly. Unfortunately, I can’t debug your experiment since so many little things could be wrong (both on the experimental side and on the analytical side). I recommend finding someone you can ask in person to help you.

  3. Great video! I had wanted to do something like this in edgeR in the past but couldn’t quite get my head around implementing it. I was just wondering, how would you go about extracting the CPM threshold from the quantiles? I would like to be able to say a threshold of X CPM was applied, though the outputted graph just shows the percentage of genes that had to be removed in order to achieve the highest number of significantly called genes.

    • Since the math is all done on quantiles, the best you can do if you want to report a threshold CPM is just look at the minimum CPM in the results that were called significant.

  4. Nice video! I download the RNAseq data from Xena(https://xenabrowser.net/datapages/?cohort=TCGA%20TARGET%20GTEx&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443) and choosed the expected_count data. Is it the right dataset? I mean this data is same to raw data so I can analyze following your code? I selected the subset which contains about 400 samples, according to your video,can I filter the gene which CPM>cutoff above 2 samples with edgR?
    The code is about edgR,what about DEseq and limma voom(the cutoff selection)?
    Thanks

  5. Hi! I am having difficulty parsing this loop in R, and it is failing in my Rscript. I am using R.3.3.1. Do you know if there has been a syntactical change that would explain this?

    Here is the problematic section:

    if (max(numRej) <= 10) {
    j <- 1
    } else {
    if (all(numRej==0)) {
    residual 0] – lo.fit.theta$y[numRej > 0]
    }
    thresh <- max(lo.fit.theta$y) – sqrt(mean(residual^2))
    j thresh)) {
    which(numRej > thresh)[1]
    } else {
    residual 0] ?”
    Execution halted

    • It looks like your code got mangled when you copied and pasted. And it looks like wordpress is to blame. When I enter the code, I get the same mangling… hmmm..

    • if (max(numRej) <= 10) {
      j <- 1
      } else {
      residual 0] – lo.fit.theta$y[numRej > 0]
      }
      thresh <- max(lo.fit.theta$y) – sqrt(mean(residual^2))
      j thresh)) {
      which(numRej > thresh)[1]
      } else {
      1
      }

  6. Hi, I am impressed by yours videos on Deseq2 and edgeR. Congratulations, great high-level teaching tools.
    It seems edgeR has “slightly” changed the filtering fuction “filterByExpression”, which no longer filters by CPM>1, but CPM being, cite ” above k in n samples, where k is determined by min.count and by the sample library sizes and n is determined by the design matrix”, version 3.28.1.
    Do you have any info about how k is determined?
    I guess the new edgeR function might address some of the “criticism” (being foreigner don’t know whether this is the appropriate word) in the video. Does it solve it to any degree?
    Thank you very much for any hint

    • The bad news is that I no longer do differential gene expression analysis – so I haven’t kept up with the new versions of edgeR – so I can’t answer your question. However, I’ll add a pinned note to the video alerting viewers that the video only applies to old versions of edgeR (I still think it is interesting to learn how things were done because a lot of publications are based on the older versions, even if it no longer applies).

Leave a Reply

Your email address will not be published. Required fields are marked *