Perplexity: Evaluating Transcript Abundance Estimation in the Absence of Ground Truth

Authors Jason Fan , Skylar Chan , Rob Patro

Thumbnail PDF


  • Filesize: 2.04 MB
  • 22 pages

Document Identifiers

Author Details

Jason Fan
  • University of Maryland, College Park, MD, USA
Skylar Chan
  • University of Maryland, College Park, MD, USA
Rob Patro
  • University of Maryland, College Park, MD, USA

Cite AsGet BibTex

Jason Fan, Skylar Chan, and Rob Patro. Perplexity: Evaluating Transcript Abundance Estimation in the Absence of Ground Truth. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 201, pp. 4:1-4:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)


There has been rapid development of probabilistic models and inference methods for transcript abundance estimation from RNA-seq data. These models aim to accurately estimate transcript-level abundances, to account for different biases in the measurement process, and even to assess uncertainty in resulting estimates that can be propagated to subsequent analyses. The assumed accuracy of the estimates inferred by such methods underpin gene expression based analysis routinely carried out in the lab. Although hyperparameter selection is known to affect the distributions of inferred abundances (e.g. producing smooth versus sparse estimates), strategies for performing model selection in experimental data have been addressed informally at best. Thus, we derive perplexity for evaluating abundance estimates on fragment sets directly. We adapt perplexity from the analogous metric used to evaluate language and topic models and extend the metric to carefully account for corner cases unique to RNA-seq. In experimental data, estimates with the best perplexity also best correlate with qPCR measurements. In simulated data, perplexity is well behaved and concordant with genome-wide measurements against ground truth and differential expression analysis. To our knowledge, our study is the first to make possible model selection for transcript abundance estimation on experimental data in the absence of ground truth.

Subject Classification

ACM Subject Classification
  • Applied computing → Computational biology
  • RNA-seq
  • transcript abundance estimation
  • model selection


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Simon Anders, Paul Theodor Pyl, and Wolfgang Huber. Htseq—a python framework to work with high-throughput sequencing data. Bioinformatics, 31(2):166-169, 2015. Google Scholar
  2. Shawn C Baker, Steven R Bauer, Richard P Beyer, James D Brenton, Bud Bromley, John Burrill, et al. The External RNA Controls Consortium: a progress report. Nature Methods, 2(10):731-734, 2005. URL:
  3. Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer New York, 2016. Google Scholar
  4. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3(null):993–1022, 2003. Google Scholar
  5. Elena Bushmanova, Dmitry Antipov, Alla Lapidus, and Andrey D Prjibelski. rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data. GigaScience, 8(9), September 2019. giz100. URL:
  6. Scott C. Clark, Rob Egan, Peter I. Frazier, and Zhong Wang. ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies. Bioinformatics, 29(4):435-443, 2013. URL:
  7. Steffen Durinck, Paul T Spellman, Ewan Birney, and Wolfgang Huber. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nature Protocols, 4(8):1184-1191, 2009. URL:
  8. Alyssa C. Frazee, Andrew E. Jaffe, Ben Langmead, and Jeffrey T. Leek. Polyester: simulating RNA-seq datasets with differential transcript expression. Bioinformatics, 31(17):2778-2784, April 2015. URL:
  9. Peter Glaus, Antti Honkela, and Magnus Rattray. Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics, 28(13):1721-1728, 2012. Google Scholar
  10. Manfred G Grabherr, Brian J Haas, Moran Yassour, Joshua Z Levin, Dawn A Thompson, Ido Amit, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology, 29(7):644-652, 2011. URL:
  11. James Hensman, Panagiotis Papastamoulis, Peter Glaus, Antti Honkela, and Magnus Rattray. Fast and accurate approximate inference of transcript expression from RNA-seq data. Bioinformatics, 31(24):3881-3889, August 2015. URL:
  12. F. Jelinek. Continuous speech recognition by statistical methods. Proceedings of the IEEE, 64(4):532-556, 1976. URL:
  13. Hui Jiang and Wing Hung Wong. Statistical inferences for isoform expression in RNA-seq. Bioinformatics, 25(8):1026-1032, 2009. Google Scholar
  14. Daniel C. Jones, Kavitha T. Kuppusamy, Nathan J. Palpant, Xinxia Peng, Charles E. Murry, Hannele Ruohola-Baker, and Walter L. Ruzzo. Isolator: accurate and stable analysis of isoform-level expression in rna-seq experiments. bioRxiv, 2016. URL:
  15. Daniel C. Jones and Walter L. Ruzzo. Polee: RNA-Seq analysis using approximate likelihood. bioRxiv, 2020. URL:
  16. Woo Jin Kim, Jae Hyun Lim, Jae Seung Lee, Sang-Do Lee, Ju Han Kim, and Yeon-Mok Oh. Comprehensive analysis of transcriptome sequencing data in the lung tissues of copd subjects. International Journal of Genomics, 2015:206937, March 2015. URL:
  17. Bo Li and Colin N. Dewey. Rsem: accurate transcript quantification from rna-seq data with or without a reference genome. BMC Bioinformatics, 12(1):323, August 2011. URL:
  18. Bo Li, Nathanael Fillmore, Yongsheng Bai, Mike Collins, James A Thomson, Ron Stewart, and Colin N Dewey. Evaluation of de novo transcriptome assemblies from RNA-Seq data. Genome Biology, 15(12):553, 2014. URL:
  19. Yang Liao, Gordon K Smyth, and Wei Shi. featurecounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7):923-930, 2014. Google Scholar
  20. Peng Liu, Rajendran Sanalkumar, Emery H Bresnick, Sündüz Keleş, and Colin N Dewey. Integrative analysis with chip-seq advances the limits of transcript quantification from rna-seq. Genome research, 26(8):1124-1133, 2016. Google Scholar
  21. Felix Mölder, Kim Philipp Jablonski, Brice Letcher, Michael B. Hall, Christopher H. Tomkins-Tinch, Vanessa Sochat, et al. Sustainable data analysis with Snakemake. F1000Research, 10:33, 2021. URL:
  22. Naoki Nariai, Osamu Hirose, Kaname Kojima, and Masao Nagasaki. TIGAR: transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference. Bioinformatics, 29(18):2292-2299, July 2013. URL:
  23. Naoki Nariai, Kaname Kojima, Takahiro Mimori, Yosuke Kawai, and Masao Nagasaki. A bayesian approach for estimating allele-specific expression from RNA-seq data with diploid genomes. In BMC genomics, volume 17(1), pages 7-17. BioMed Central, 2016. Google Scholar
  24. Naoki Nariai, Kaname Kojima, Takahiro Mimori, Yukuto Sato, Yosuke Kawai, Yumi Yamaguchi-Kabata, and Masao Nagasaki. Tigar2: sensitive and accurate estimation of transcript isoform expression with longer RNA-seq reads. BMC genomics, 15(10):1-9, 2014. Google Scholar
  25. Daniel J. Nasko, Sergey Koren, Adam M. Phillippy, and Todd J. Treangen. RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification. Genome Biology, 2018. URL:
  26. Rob Patro, Geet Duggal, Michael I. Love, Rafael A. Irizarry, and Carl Kingsford. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods, 14(4):417-419, April 2017. URL:
  27. Atif Rahman and Lior Pachter. CGAL: computing genome assembly likelihoods. Genome Biology, 14(1):R8, 2013. URL:
  28. Johannes Rainer. EnsDb.Hsapiens.v86: Ensembl based annotation package, 2017. R package version 2.99.0. Google Scholar
  29. Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53-65, 1987. URL:
  30. Migun Shakya, Chien-Chi Lo, and Patrick S. G. Chain. Advances and challenges in metatranscriptomic analysis. Frontiers in Genetics, 10:904, 2019. URL:
  31. Leming Shi, Laura H Reid, Wendell D Jones, Richard Shippy, Janet A Warrington, et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature Biotechnology, 24(9):1151-1161, 2006. URL:
  32. Richard Smith-Unna, Chris Boursnell, Rob Patro, Julian M. Hibberd, and Steven Kelly. TransRate: reference-free quality assessment of de novo transcriptome assemblies. Genome Research, 26(8):1134-1144, 2016. URL:
  33. Charlotte Soneson, Michael I. Love, and Mark D. Robinson. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research, 4, 2015. URL:
  34. Avi Srivastava, Laraib Malik, Hirak Sarkar, and Rob Patro. A Bayesian framework for inter-cellular information sharing improves dscRNA-seq quantification. Bioinformatics, 36(Supplement_1):i292-i299, 2020. Google Scholar
  35. Zhenqiang Su, Paweł P Łabaj, Sheng Li, Jean Thierry-Mieg, Danielle Thierry-Mieg, Wei Shi, et al. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nature Biotechnology, 32(9):903-914, 2014. URL:
  36. Ernest Turro, Shu-Yi Su, Ângela Gonçalves, Lachlan JM Coin, Sylvia Richardson, and Alex Lewin. Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads. Genome biology, 12(2):1-15, 2011. Google Scholar
  37. Andrew D Yates, Premanand Achuthan, Wasiu Akanni, James Allen, Jamie Allen, Jorge Alvarez-Jarreta, et al. Ensembl 2020. Nucleic Acids Research, 48(D1):D682-D688, November 2019. URL:
  38. Mohsen Zakeri, Avi Srivastava, Fatemeh Almodaresi, and Rob Patro. Improved data-driven likelihood factorizations for transcript abundance estimation. Bioinformatics, 33(14):i142-i151, July 2017. URL:
  39. Anqi Zhu, Avi Srivastava, Joseph G Ibrahim, Rob Patro, and Michael I Love. Nonparametric expression analysis using inferential replicate counts. Nucleic Acids Research, 47(18):e105-e105, 2019. URL:
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail