PRINCE: Accurate Approximation of the Copy Number of Tandem Repeats

Authors Mehrdad Mansouri, Julian Booth, Margaryta Vityaz, Cedric Chauve, Leonid Chindelevitch

Thumbnail PDF


  • Filesize: 0.52 MB
  • 13 pages

Document Identifiers

Author Details

Mehrdad Mansouri
  • School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
Julian Booth
  • School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
Margaryta Vityaz
  • School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
Cedric Chauve
  • Department of Mathematics, Simon Fraser University, Burnaby, BC, Canada
Leonid Chindelevitch
  • School of Computing Science, Simon Fraser University , Burnaby, BC, Canada

Cite AsGet BibTex

Mehrdad Mansouri, Julian Booth, Margaryta Vityaz, Cedric Chauve, and Leonid Chindelevitch. PRINCE: Accurate Approximation of the Copy Number of Tandem Repeats. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 20:1-20:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Variable-Number Tandem Repeats (VNTR) are genomic regions where a short sequence of DNA is repeated with no space in between repeats. While a fixed set of VNTRs is typically identified for a given species, the copy number at each VNTR varies between individuals within a species. Although VNTRs are found in both prokaryotic and eukaryotic genomes, the methodology called multi-locus VNTR analysis (MLVA) is widely used to distinguish different strains of bacteria, as well as cluster strains that might be epidemiologically related and investigate evolutionary rates. We propose PRINCE (Processing Reads to Infer the Number of Copies via Estimation), an algorithm that is able to accurately estimate the copy number of a VNTR given the sequence of a single repeat unit and a set of short reads from a whole-genome sequence (WGS) experiment. This is a challenging problem, especially in the cases when the repeat region is longer than the expected read length. Our proposed method computes a statistical approximation of the local coverage inside the repeat region. This approximation is then mapped to the copy number using a linear function whose parameters are fitted to simulated data. We test PRINCE on the genomes of three datasets of Mycobacterium tuberculosis strains and show that it is more than twice as accurate as a previous method. An implementation of PRINCE in the Python language is freely available at

Subject Classification

ACM Subject Classification
  • Applied computing → Molecular sequence analysis
  • Variable-Number Tandem Repeats
  • Copy number
  • Bacterial genomics


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. A Abyzov, A E Urban, M Snyder, and M Gerstein. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Research, 21(6):974-984, 2011. Google Scholar
  2. Lindstedt B. Multiple-locus variable number tandem repeats analysis for genetic fingerprinting of pathogenic bacteria. Electrophoresis, 26(13):2567-2582, 2005. Google Scholar
  3. M Bakhtiari, S Shleizer-Burko, M Gymrek, V Bansal, and V Bafna. Targeted genotyping of variable number tandem repeats with adVNTR. bioRxiv, 2017. Google Scholar
  4. MD Cao, E Tasker, K Willadsen, M Imelfort, S Vishwanathan, et al. Inferring short tandem repeat variation from paired-end short reads. Nucleic Acids Research, 42(3):e16-e16, 2013. Google Scholar
  5. F Coll, K Mallard, MD Preston, S Bentley, J Parkhill, et al. SpolPred: rapid and accurate prediction of Mycobacterium tuberculosis spoligotypes from short genomic sequences. Bioinformatics, 28(22):2991-2993, 2012. Google Scholar
  6. JL De Beer, K Kremer, C Ködmön, P Supply, D Van Soolingen, Global Network for the Molecular Surveillance of Tuberculosis 2009, et al. First worldwide proficiency study on variable-number tandem-repeat typing of Mycobacterium tuberculosis complex strains. Journal of Clinical Microbiology, 50(3):662-669, 2012. Google Scholar
  7. E Dolzhenko, JJFA van Vugt, RJ Shaw, MA Bekritsky, M van Blitterswijk, et al. Detection of long repeat expansions from PCR-free whole-genome sequence data. Genome Research, 27(11):1895-1903, 2017. Google Scholar
  8. M Escalona, S Rocha, and D Posada. A comparison of tools for the simulation of genomic next-generation sequencing data. Nature Reviews Genetics, 17(8):459, 2016. Google Scholar
  9. B Ewing, L Hillier, MC Wendl, and P Green. Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Research, 8(3):175-185, 1998. Google Scholar
  10. J Friedman, T Hastie, and R Tibshirani. The Elements of Statistical Learning, volume 1. Springer Series in Statistics New York, 2001. Google Scholar
  11. R Frothingham and WA Meeker-O'Connell. Genetic diversity in the Mycobacterium tuberculosis complex based on variable numbers of tandem DNA repeats. Microbiology, 144(5):1189-1196, 1998. Google Scholar
  12. Y Gelfand, Y Hernandez, J Loving, and G Benson. VNTRseek - a computational tool to detect tandem repeat variants in high-throughput sequencing data. Nucleic Acids Research, 42(14):8884-8894, 2014. Google Scholar
  13. S Goodwin, JD McPherson, and WR McCombie. Coming of age: ten years of next-generation sequencing technologies. Nature Reviews Genetics, 17(6):333-351, 2016. Google Scholar
  14. JL Guthrie, C Kong, D Roth, D Jorgensen, M Rodrigues, et al. Molecular epidemiology of tuberculosis in British Columbia, Canada-a 10-year retrospective study. Clinical Infectious Diseases, 2017. Google Scholar
  15. M Gymrek, D Golan, S Rosset, and Y Erlich. lobSTR: a short tandem repeat profiler for personal genomes. Genome Research, 22(6):1154-1162, 2012. Google Scholar
  16. W Huang, L Li, JR Myers, and GT Marth. ART: a next-generation sequencing read simulator. Bioinformatics, 28(4):593-594, 2012. Google Scholar
  17. T Jagielski, J van Ingen, N Rastogi, J Dziadek, PK Mazur, and J Bielecki. Current methods in the molecular typing of Mycobacterium tuberculosis and other mycobacteria. BioMed Research International, 2014(645802), 2014. Google Scholar
  18. P Liao, GA Satten, and Y Hu. PhredEM: a Phred-score-informed genotype-calling approach for next-generation sequencing studies. Genetic Epidemiology, 41(5):375-387, 2017. Google Scholar
  19. B Mathema, NE Kurepina, PJ Bifani, and BN Kreiswirth. Molecular epidemiology of Tuberculosis: Current Insights. Clinical Microbiology Reviews, 19(4):658-685, 2006. Google Scholar
  20. CJ Meehan, P Moris, TA Kohl, J Pečerska, S Akter, et al. The relationship between transmission time and clustering methods in Mycobacterium tuberculosis epidemiology. bioRxiv, 2018. Google Scholar
  21. M Merker, C Blin, S Mona, N Duforet-Frebourg, S Lecher, et al. Evolutionary history and global spread of the Mycobacterium tuberculosiseijing lineage. Nature Genetics, 47(3):242-249, 2015. Google Scholar
  22. T Miyoshi-Akiyama, K Satou, M Kato, A Shiroma, K Matsumura, et al. Complete annotated genome sequence of Mycobacterium tuberculosis (Zopf) Lehmann and Neumann (ATCC35812)(Kurono). Tuberculosis, 95(1):37-39, 2015. Google Scholar
  23. CA Nadon, E Trees, LK Ng, E Møller Nielsen, A Reimer, et al. Development and application of MLVA methods as a tool for inter-laboratory surveillance. Euro Surveillance, 18(35), 2013. Google Scholar
  24. V Nikolayevskyy, A Trovato, A Broda, E Borroni, D Cirillo, and F Drobniewski. MIRU-VNTR genotyping of Mycobacterium tuberculosis strains using QIAxcel technology: A multicentre evaluation study. PLoS One, 11(3):e0149435, 2016. Google Scholar
  25. JG Rodríguez, C Pino, A Tauch, and MI Murcia. Complete genome sequence of the clinical Beijing-like strain Mycobacterium tuberculosis 323 using the PacBio real-time sequencing platform. Genome Announcements, 3(2):e00371-15, 2015. Google Scholar
  26. MG Ross, C Russ, M Costello, A Hollinger, NJ Lennon, et al. Characterizing and measuring bias in sequence data. Genome Biology, 14(5):R51, 2013. Google Scholar
  27. SL Salzberg and JA Yorke. Beware of mis-assembled genomes. Bioinformatics, 21(24):4320-4321, 2005. Google Scholar
  28. T Sekizuka, A Yamashita, Y Murase, T Iwamoto, S Mitarai, S Kato, and M Kuroda. TGS-TB: Total genotyping solution for Mycobacterium tuberculosissing Short-Read Whole-Genome Sequencing. PLoS One, 10(11):e0142951, 2015. Google Scholar
  29. P Supply. Multilocus Variable Number Tandem Repeat genotyping of Mycobacterium tuberculosis. Technical report, Institut de Biologie/Institut Pasteur de Lille, 2005. Google Scholar
  30. P Supply, C Allix, S Lesjean, M Cardoso-Oelemann, S Rüsch-Gerdes, et al. Proposal for standardization of optimized mycobacterial interspersed repetitive unit-variable-number tandem repeat typing of Mycobacterium tuberculosis. Journal of Clinical Microbiology, 44(12):4498-4510, 2006. Google Scholar
  31. DW Ussery, TM Wassenaar, and S Borini. Computing for Comparative Microbial Genomics: Bioinformatics for Microbiologists, volume 8 of Computational Biology. Springer, 2009. Google Scholar
  32. Z Wang, F Hormozdiari, W Yang, E Halperin, and E Eskin. CNVeM: copy number variation detection using uncertainty of read mapping. Journal of Computational Biology, 20(3):224-236, 2013. Google Scholar
  33. T Willems, D Zielinski, J Yuan, A Gordon, M Gymrek, and Y Erlich. Genome-wide profiling of heritable and de novo STR variations. Nature Methods, 14(6):590, 2017. Google Scholar
  34. AE Woerner, JL King, and B Budowle. Fast STR allele identification with STRait Razor 3.0. Forensic Science International: Genetics, 30:18-23, 2017. Google Scholar
  35. S Yoon, Z Xuan, V Makarov, K Ye, and J Sebat. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Research, 19(9):1586-1592, 2009. Google Scholar
  36. M Zhao, Q Wang, Q Wang, P Jia, and Z Zhao. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics, 14(11):S1, 2013. Google Scholar