RPKM, FPKM and TPM, clearly explained

Warning: This StatQuest is specifically for people who do high-throughput RNA sequencing (RNA-seq). If that’s what you’re interested in, quest on!

It used to be when you did RNA-seq, you reported your results in RPKM (Reads Per Kilobase Million) or FPKM (Fragments Per Kilobase Million). However, TPM (Transcripts Per Million) is now becoming quite popular. Since there seems to be a lot of confusion about these terms, I thought I’d use a StatQuest to clear everything up.

These three metrics attempt to normalize for sequencing depth and gene length. Here’s how you do it for RPKM:

  1. Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
  2. Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)
  3. Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.

FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice).

TPM is very similar to RPKM and FPKM. The only difference is the order of operations. Here’s how you calculate TPM:

  1. Divide the read counts by the length of each gene in kilobases. This gives you reads per kilobase (RPK).
  2. Count up all the RPK values in a sample and divide this number by 1,000,000. This is your “per million” scaling factor.
  3. Divide the RPK values by the “per million” scaling factor. This gives you TPM.

So you see, when calculating TPM, the only difference is that you normalize for gene length first, and then normalize for sequencing depth second. However, the effects of this difference are quite profound.

When you use TPM, the sum of all TPMs in each sample are the same. This makes it easier to compare the proportion of reads that mapped to a gene in each sample. In contrast, with RPKM and FPKM, the sum of the normalized reads in each sample may be different, and this makes it harder to compare samples directly.

Here’s an example. If the TPM for gene A in Sample 1 is 3.33 and the TPM in sample B is 3.33, then I know that the exact same proportion of total reads mapped to gene A in both samples. This is because the sum of the TPMs in both samples always add up to the same number (so the denominator required to calculate the proportions is the same, regardless of what sample you are looking at.)

With RPKM or FPKM, the sum of normalized reads in each sample can be different. Thus, if the RPKM for gene A in Sample 1 is 3.33 and the RPKM in Sample 2 is 3.33, I would not know if the same proportion of reads in Sample 1 mapped to gene A as in Sample 2. This is because the denominator required to calculate the proportion could be different for the two samples.

Advertisements

19 thoughts on “RPKM, FPKM and TPM, clearly explained

  1. When would you use TPM? For modelling? For relative differential expression analysis of course you use raw read counts that are scaled during analysis (e.g. R/Bioc edger or deseq2 that have been shown to be superior). So you don’t need to calculate any of these values if you just interested in differential expression.

    Like

    • That’s true, edgeR and DESeq2 both have their own internal normalization schemes for detecting differential expression and do not rely on TPM (or RPKM etc.) However, I believe a full analysis should include summary graphs, with Log(FoldChange) on one axis and TPM on the other. Furthermore, TPM values are what should be reported in written descriptions of the data (i.e. academic manuscripts).

      Like

  2. Any expression values are normalized and you use them specially for the heatmap, clustering, correlation maps or volcano plot to represent your DEGs with log(FC) and pvalue. In any case the tools for DE analysis uses raw counts so that their internal normalizations can be performed by these tools considering that the data has a negative binomial distribution.

    Like

    • True! I hope my explanation will help people understand the heatmap, clustering, correlation maps and other presentations normalized data.

      Like

  3. Thank you for the clear explanation!

    This is going to be a very stupid question, please for give me. I am grad student in genetics and can do basic programming, so I got required to help with RNASeq analysis.

    Our sample was sequenced by an outside group who gave us the processed BAM files. I downloaded all the annotations files for the human genome. How do I go about getting the read counts?

    Thanks,
    Mark

    Like

  4. Hi Josh, thank you for this post !

    To understand better the Pekka’s post, TPM can be used as a between-sample normalization (BSN) as DESeq2 for instance? Or it remains a unit (as RPKM / FPKM) and it is only used in order to report the associated results with graphics?

    As you replied, I am thinking to complete a DESeq2 analysis with TPM plots. Let WT, C1, C2 and C3 be the samples. WT is the wild type and control. C1, C2 and C3 are the conditions.

    DESeq2 normalization, which gives us three analysis WT vs C1 (A1), WT vs C2 (A2) and WT vs C3 (A3). As you know, the normalized counts provided for the WT will differ between A1, A2 and A3. So, from now, if I want to plot the genes expression level, I can see two choices:

    1) From DESeq2 table, I can plot WT vs C*. This results to 3 graphics with 2 histogram’s bars (WT and C1 for instance)

    2) I want now to plot the gene expression level for each sample. It means, 4 graphics with only one histogram’s bar. And for these graphics, I can not use DESeq2 values because WT values will differ. That is the reason why I was thinking to use TPM units in order to plot the genes expression level in WT, C1, C2 and C3 keeping the comparison possible between the plots.

    Please feel free to critic and comment this suggestion, and thank you for this cool post !

    Pierre

    Liked by 1 person

  5. Using edgeR for example, my DGE calculations are based on normalized counts typically reported as Log2CPM (TMM normalized). So for plotting intensity boxplots to compare a gene between samples, I would use the Log2CPM to use the same normalized units that went into the differential expression calculation. The TPM values do not reflect the TMM normalization incorporated in the Log2CPM values and therefore you fold-change calculations. So it feels like the computational equivalent of changing horse midstream to use CPM in my DGE calculations and then illustrate the difference with a different unit of measure. Given the effective gene length, you can convert TMM normalized CPM to TPM, however, you then lose the TPM property of all samples summing to the same value. The only time I really feel the need to resort to a length normalized unit like TPM is when I need to compare expression between Gene A to Gene B. TPM is certainly better than CPM for this purpose but I still see cautions about doing this (albeit without seeing a better alternative offered). I think the issue here relates to compositional bias. That is, large change in one or more highly expressed genes can introduce a bias in other genes. This is because the increase in a highly expressed gene, effectively “steals” reads from other genes. Thus a large increase in highly expressed genes, can make other genes that have not changed appear to go down. Abundant and highly variable globin mRNA in blood samples is the prototypical situation where compositional bias looms large.

    So generally, I think it’s best to plot the same units you use in your DGE calculations to accurately illustrate changes between samples. You only need to resort to a length normalized metric to compare gene A to gene B, but the comparison is not without issues. But I’ll take the issues with RNA-Seq any day over the enormous issues of trying to compare Affy intensities between genes (basically you can’t). I’ve focused on compositional bias for the sake of simplicity but there can be other bias’ as well such as GC content bias and 3′ bias (typical with oligodT-based library preps) to name a few.

    Like

    • That is all true. When TPM first came out, it really seemed like a significant improvement over other metrics, but, like you said, edgeR does not use it (nor does DESeq2). So far, the tools that use TPM have failed to impress me so I’ve resorted to using CPM for my graphs.

      Like

  6. Sorry for my naive question but how does RPMs and CPMs differ? Both are calculated similarly. Raw counts are divided by total number of reads (counts) and by multiplying with 1,000,000.

    Like

    • My guess is that there is no difference. However, it’s always worth looking at the fine print. Was the sequencing paired-end? If so, make sure they counted fragments, not just reads.

      Like

  7. Thank you very much!
    Clearly explained indeed.
    If I may ask – How can I convert RPKM values into TPM?
    I will need the length of the genes, won’t I?
    A clarification about it will be highly appreciated!
    Thanks again.

    Like

    • I’m not an expert on converting from RPKM to TPM, but I would bet that you’ll need to convert RPKM back to the raw read counts. So you’ll need to know the gene lengths as well as the overall number of reads per replicate. There might be an easier way to do this, but, short of starting over with the raw data, this is the only thing I can think of.

      Like

      • Yes, this is what I’ve thought.. Have to figure out now how to get the gene lengths.
        Thank you for your quick response :)

        Like

  8. Both FPKM (RPKM) and TPM are length normalized already. So you don’t need the length to interconvert. See Harold Pimentel’s url at the top of the thread. He provides several conversion function to illustrate and I’ve copied the fpkmToTpm one here:

    fpkmToTpm <- function(fpkm)
    {
    exp(log(fpkm) – log(sum(fpkm)) + log(1e6))
    }

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s