Warning: This StatQuest is specifically for people who do high-throughput RNA sequencing (RNA-seq). If that’s what you’re interested in, quest on!
It used to be when you did RNA-seq, you reported your results in RPKM (Reads Per Kilobase Million) or FPKM (Fragments Per Kilobase Million). However, TPM (Transcripts Per Million) is now becoming quite popular. Since there seems to be a lot of confusion about these terms, I thought I’d use a StatQuest to clear everything up.
These three metrics attempt to normalize for sequencing depth and gene length. Here’s how you do it for RPKM:
- Count up the total reads in a sample and divide that number by 1,000,000 – this is our “per million” scaling factor.
- Divide the read counts by the “per million” scaling factor. This normalizes for sequencing depth, giving you reads per million (RPM)
- Divide the RPM values by the length of the gene, in kilobases. This gives you RPKM.
FPKM is very similar to RPKM. RPKM was made for single-end RNA-seq, where every read corresponded to a single fragment that was sequenced. FPKM was made for paired-end RNA-seq. With paired-end RNA-seq, two reads can correspond to a single fragment, or, if one read in the pair did not map, one read can correspond to a single fragment. The only difference between RPKM and FPKM is that FPKM takes into account that two reads can map to one fragment (and so it doesn’t count this fragment twice).
TPM is very similar to RPKM and FPKM. The only difference is the order of operations. Here’s how you calculate TPM:
- Divide the read counts by the length of each gene in kilobases. This gives you reads per kilobase (RPK).
- Count up all the RPK values in a sample and divide this number by 1,000,000. This is your “per million” scaling factor.
- Divide the RPK values by the “per million” scaling factor. This gives you TPM.
So you see, when calculating TPM, the only difference is that you normalize for gene length first, and then normalize for sequencing depth second. However, the effects of this difference are quite profound.
When you use TPM, the sum of all TPMs in each sample are the same. This makes it easier to compare the proportion of reads that mapped to a gene in each sample. In contrast, with RPKM and FPKM, the sum of the normalized reads in each sample may be different, and this makes it harder to compare samples directly.
Here’s an example. If the TPM for gene A in Sample 1 is 3.33 and the TPM in sample B is 3.33, then I know that the exact same proportion of total reads mapped to gene A in both samples. This is because the sum of the TPMs in both samples always add up to the same number (so the denominator required to calculate the proportions is the same, regardless of what sample you are looking at.)
With RPKM or FPKM, the sum of normalized reads in each sample can be different. Thus, if the RPKM for gene A in Sample 1 is 3.33 and the RPKM in Sample 2 is 3.33, I would not know if the same proportion of reads in Sample 1 mapped to gene A as in Sample 2. This is because the denominator required to calculate the proportion could be different for the two samples.
41 thoughts on “RPKM, FPKM and TPM, clearly explained”
Just a minor typo: “However, TPM (Transcripts Per Kilobase Million)”. I think you meant Transcripts Per Million.
In case anyone is wondering in a slightly more technical explanation along with the connection between TPM and FPKM, they might want to check out my post: https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/
Thanks for catching the typo!
When would you use TPM? For modelling? For relative differential expression analysis of course you use raw read counts that are scaled during analysis (e.g. R/Bioc edger or deseq2 that have been shown to be superior). So you don’t need to calculate any of these values if you just interested in differential expression.
That’s true, edgeR and DESeq2 both have their own internal normalization schemes for detecting differential expression and do not rely on TPM (or RPKM etc.) However, I believe a full analysis should include summary graphs, with Log(FoldChange) on one axis and TPM on the other. Furthermore, TPM values are what should be reported in written descriptions of the data (i.e. academic manuscripts).
I had same questions.
I think for Time course analysis, and also I think for SmartSeq2 data.
I am not sure how much e can generalise results on scRNAseq made for 3′ end gene.
You make an excellent point! These methods (RPKM, FPKM and TPM) only apply to full length RNA-sequencing and are not intended for 3′ end sequencing. I made this video before people were doing much single-cell RNA sequencing so I didn’t think to mention it at the time.
Any expression values are normalized and you use them specially for the heatmap, clustering, correlation maps or volcano plot to represent your DEGs with log(FC) and pvalue. In any case the tools for DE analysis uses raw counts so that their internal normalizations can be performed by these tools considering that the data has a negative binomial distribution.
True! I hope my explanation will help people understand the heatmap, clustering, correlation maps and other presentations normalized data.
Thank you for the clear explanation!
This is going to be a very stupid question, please for give me. I am grad student in genetics and can do basic programming, so I got required to help with RNASeq analysis.
Our sample was sequenced by an outside group who gave us the processed BAM files. I downloaded all the annotations files for the human genome. How do I go about getting the read counts?
One of the best way’s to get read counts from BAM files is by using HTSeq-count. Good luck!
Hi Josh, thank you for this post !
To understand better the Pekka’s post, TPM can be used as a between-sample normalization (BSN) as DESeq2 for instance? Or it remains a unit (as RPKM / FPKM) and it is only used in order to report the associated results with graphics?
As you replied, I am thinking to complete a DESeq2 analysis with TPM plots. Let WT, C1, C2 and C3 be the samples. WT is the wild type and control. C1, C2 and C3 are the conditions.
DESeq2 normalization, which gives us three analysis WT vs C1 (A1), WT vs C2 (A2) and WT vs C3 (A3). As you know, the normalized counts provided for the WT will differ between A1, A2 and A3. So, from now, if I want to plot the genes expression level, I can see two choices:
1) From DESeq2 table, I can plot WT vs C*. This results to 3 graphics with 2 histogram’s bars (WT and C1 for instance)
2) I want now to plot the gene expression level for each sample. It means, 4 graphics with only one histogram’s bar. And for these graphics, I can not use DESeq2 values because WT values will differ. That is the reason why I was thinking to use TPM units in order to plot the genes expression level in WT, C1, C2 and C3 keeping the comparison possible between the plots.
Please feel free to critic and comment this suggestion, and thank you for this cool post !
Your graphing ideas are correct. Use TPM to show the relative levels between samples. However, do not use TPM as input to DEseq or edgeR. They only work with raw count data.
Of course, inputs of these two methods are raw data.
Thank you Josh! 😉
Using edgeR for example, my DGE calculations are based on normalized counts typically reported as Log2CPM (TMM normalized). So for plotting intensity boxplots to compare a gene between samples, I would use the Log2CPM to use the same normalized units that went into the differential expression calculation. The TPM values do not reflect the TMM normalization incorporated in the Log2CPM values and therefore you fold-change calculations. So it feels like the computational equivalent of changing horse midstream to use CPM in my DGE calculations and then illustrate the difference with a different unit of measure. Given the effective gene length, you can convert TMM normalized CPM to TPM, however, you then lose the TPM property of all samples summing to the same value. The only time I really feel the need to resort to a length normalized unit like TPM is when I need to compare expression between Gene A to Gene B. TPM is certainly better than CPM for this purpose but I still see cautions about doing this (albeit without seeing a better alternative offered). I think the issue here relates to compositional bias. That is, large change in one or more highly expressed genes can introduce a bias in other genes. This is because the increase in a highly expressed gene, effectively “steals” reads from other genes. Thus a large increase in highly expressed genes, can make other genes that have not changed appear to go down. Abundant and highly variable globin mRNA in blood samples is the prototypical situation where compositional bias looms large.
So generally, I think it’s best to plot the same units you use in your DGE calculations to accurately illustrate changes between samples. You only need to resort to a length normalized metric to compare gene A to gene B, but the comparison is not without issues. But I’ll take the issues with RNA-Seq any day over the enormous issues of trying to compare Affy intensities between genes (basically you can’t). I’ve focused on compositional bias for the sake of simplicity but there can be other bias’ as well such as GC content bias and 3′ bias (typical with oligodT-based library preps) to name a few.
That is all true. When TPM first came out, it really seemed like a significant improvement over other metrics, but, like you said, edgeR does not use it (nor does DESeq2). So far, the tools that use TPM have failed to impress me so I’ve resorted to using CPM for my graphs.
Sorry for my naive question but how does RPMs and CPMs differ? Both are calculated similarly. Raw counts are divided by total number of reads (counts) and by multiplying with 1,000,000.
My guess is that there is no difference. However, it’s always worth looking at the fine print. Was the sequencing paired-end? If so, make sure they counted fragments, not just reads.
Thank you very much!
Clearly explained indeed.
If I may ask – How can I convert RPKM values into TPM?
I will need the length of the genes, won’t I?
A clarification about it will be highly appreciated!
I’m not an expert on converting from RPKM to TPM, but I would bet that you’ll need to convert RPKM back to the raw read counts. So you’ll need to know the gene lengths as well as the overall number of reads per replicate. There might be an easier way to do this, but, short of starting over with the raw data, this is the only thing I can think of.
Yes, this is what I’ve thought.. Have to figure out now how to get the gene lengths.
Thank you for your quick response 🙂
Both FPKM (RPKM) and TPM are length normalized already. So you don’t need the length to interconvert. See Harold Pimentel’s url at the top of the thread. He provides several conversion function to illustrate and I’ve copied the fpkmToTpm one here:
fpkmToTpm <- function(fpkm)
exp(log(fpkm) – log(sum(fpkm)) + log(1e6))
Could somebody please explain the nature of the gene size that is used in calculating FPKM or TPM? Is this value experimentally calculated for each individual sample based upon distribution of read sizes from the gene specifically within and unique to a sample? Or, is there a standard look up table based upon the reference genome such that everyone who uses the same reference genome would automatically be using the same universal gene sizes? There are many mathematical explanations regarding estimating gene size on the web, but none give the big picture application context on how you use gene size in practice. In other words, does the gene size value change from sample to sample, or from project to project based upon measured data?
This is a good question, and there are probably a million answers. Unfortunately there is nothing set in stone for each reference genome. When I calculate TPM and RPKM manually, I use the gene lengths from the UCSC genome browser. However, there would be a good argument for just using the exons for the transcript that the reads came from. In practice that might be tricky.
That said, in practice I never calculate TPM or RPKM manually (at least not any more). Instead I rely on programs to do it for me if they even need it to begin with. For example, edgeR and DESeq do their own read count normalizations that do not depend on gene length and lately that’s what I’ve been using to report read counts to clients (since they are mostly interested in differential gene expression).
The concept of effective gene length has been introduced. RSEM, for example, provides an effective gene length in it’s output. Working from memory I think efflen = gene length – average fragment size +1. Here gene length would be the non-redundant sum of exonic bases (including UTRs) from all isoforms.
Average fragment size refers to fragment size in paired end data. For single end, this would simply be the read length.
So in practice, the effective length could vary from sample to sample based on differences in the library preps and/or RNA quality. In my experience, given good quality RNA and a single batch run for the library preps, the sample to sample variation in fragment size is minimal. Most post alignment sequencing QC workflows will include the fragment length estimates for each sample.
Thanks Josh for the rapid response.
I have a few follow-up questions then. I want to use a public RNA-seq database that has FPKM values reported for some 60,000+ genes listed in the files, but I prefer to have the FPKM calculated using only the read totals for the 19,000+ protein coding genes present. Therefore, I would like to re-calculate the FPKM values myself. The database also has data for “RSEM expected counts”, and not raw counts. 1) If all I have are the RSEM expected counts and none of the original data, is it possible for me to calculate my own FPKM using just the subset of protein coding genes for the read totals? 2) If so, should I just use the gene sizes available from UCSC gene browser and not worry about the original read sizes or data ?
3) Also, can you confirm that using the values from RESM expected counts to calculate FPKM is valid verses using raw reads? Thanks again.
As an aside, I tried to re-calculate the gene size values used in the database by comparing the FPKM and expected counts for each gene and column totals, and to my dismay the gene size varied with each individual sample. This is what prompted my line of questioning in the first place.
I’ll be honest, I’ve never worked with RSEM before. At least not directly. But there seems to be a comment from John Thompson that has some suggestions about it – so hopefully that will be helpful. I’d worry that in your context, not knowing the total reads will be the bigger problem. Hmm…. I wish I could say something more helpful!!!! Oh well. 🙂
I believe, for all practical purposes, RSEM expected counts are what you should consider raw counts. RSEM is an expectation maximization algorithm which proportionally splits reads that map to multiple transcripts/genes and thus results in non-integer counts. Early on, I think edgeR and DESeq only accepted integer data but were later adapted to accommodate RSEM non-integer counts. So RSEM counts can now be used with these popular DGE tools.
So you should ideally go straight to the RSEM count data as your starting point if you want control over the subset of genes used when you calculate FPKM.
To the extent that your library fragment sizes differ between samples, then the effective length for each gene will indeed be different between samples. So it sounds like the original FPKM values you downloaded may have been calculated with sample-specific effective lengths.
Fragment sizes are typically 150-250bp in our hands. So the difference between actual length and effective length is usually ~200bp. Using actual gene length instead of effective gene length will inflate the denominator in the FPKM calculation and thus shrink your FPKM values. The amount of bias introduced will be minimal for long genes and more substantial for short genes.
Awesome! Thanks John!
Thanks so much for that clarification John. This is quite helpful.
This only takes into account all the fragments mapped to annotated genes in the genome. What about regions of the genome which are not annotated but still has fragments mapped to them. For example total mapped paired end fragments is 4563345(making up a number) in a sample, but total mapped fragments to annotated genes(as in your example) is only let’s say 4234323. Still some fragments are not taken into account. Is that ok to assume we can neglect that???
Yep, that’s the standard practice – to ignore reads that map to unannotated regions. It’s not ideal, and perhaps it is an opportunity for someone to come up with an improved method.
In the video it is implied that one of the advantages of TPM over RPKM is that these TPM values can be compared across samples. However, in this post https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/ they specifically say that these TPM values should not be compared across samples but more within samples. Can you comment on this? I am a bit confused
That is confusing!!! I thought that the whole point of normalizing the data was so that you could compare between samples. I’m pretty confident that I am correct because the whole premise of using replicate samples to identify differential gene expression is based on normalizing the data. If, after normalizing the data, you can’t compare samples, than DESeq2 and edgeR all have some serious flaws, and I doubt that is the case.
Almost 2020 and this video keeps helping people. Thanks a lot.
It’s one of my very first videos and it’s the one that motivated me to make StatQuest.
Hi Josh, I think your videos are great. They’ve really helped me grasp some of these slightly confusing concepts!
I had a couple of questions regarding which of these values would be appropriate for generating a PCA plot. I have RPKM, FMPK, TPM, and CPM (counts per million) values for my RNA-Seq data. Though intuitively I would think that log-transformed TPM values would be appropriate for a PCA plot, I have seen papers use CPM values, instead. I guess my next question is: what is the difference between CPM and TPM? I can’t seem to find a clear explanation on this… Thanks!!
CPM is just counts per million and is common with a lot of sinlge-cell methods that only sequence the 3-prime end of the mRNA transcript – rather than fragments along the entire length mRNA. The advantage of just using the 3-prime end is that we don’t have to normalize for the length of the gene. The disadvantage is that you can’t see alternative splicing or any other interesting features that happen outside of the 3-prime end. So, for a lot of single-cell papers, you’ll see PCA done with CPMs and that makes sense, since they don’t have to normalize for the length of the mRNA. However, if you are doing full length sequencing, then it would probably be a good idea to normalize for the transcripts length.
I just now saw your response! Thank you so much! This makes a lot of sense. We did paired-end sequencing for bacterial gene expression, even though bacteria don’t undergo alternative splicing. It would still be better to use TPM, no?
If you are sequencing more than just the 3′ end of the gene, you want to use TPM in order to compensate for the fact that different genes are different lengths.
Thank you! One last question: is there an advantage in using FPKM over TPM? Since FPKM normalizes for reads that map twice to the genome in paired-end sequencing…
Both FPKM and TPM correctly deal with the paired end reads.