Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time

Authors Tuukka Norri , Bastien Cazaux , Dmitry Kosolobov , Veli Mäkinen

Thumbnail PDF


  • Filesize: 0.77 MB
  • 15 pages

Document Identifiers

Author Details

Tuukka Norri
  • Department of Computer Science, University of Helsinki, Helsinki, Finland
Bastien Cazaux
  • Department of Computer Science, University of Helsinki, Helsinki, Finland
Dmitry Kosolobov
  • Department of Computer Science, University of Helsinki, Helsinki, Finland
Veli Mäkinen
  • Department of Computer Science, University of Helsinki, Helsinki, Finland

Cite AsGet BibTex

Tuukka Norri, Bastien Cazaux, Dmitry Kosolobov, and Veli Mäkinen. Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 15:1-15:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Given a threshold L and a set R = {R_1, ..., R_m} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1,n] into set P of disjoint segments such that each segment [a,b] in P has length at least L and the number d(a,b)=|{R_i[a,b] : 1 <= i <= m}| of distinct substrings at segment [a,b] is minimized over [a,b] in P. The distinct substrings in the segments represent founder blocks that can be concatenated to form max{d(a,b) : [a,b] in P} founder sequences representing the original R such that crossovers happen only at segment boundaries. We give an optimal O(mn) time algorithm to solve the problem, improving over earlier O(mn^2). This improvement enables to exploit the algorithm on a pan-genomic setting of input strings being aligned haplotype sequences of complete human chromosomes, with a goal of finding a representative set of references that can be indexed for read alignment and variant calling. We implemented the new algorithm and give some experimental evidence on the practicality of the approach on this pan-genomic setting.

Subject Classification

ACM Subject Classification
  • Theory of computation → Design and analysis of algorithms
  • Applied computing → Bioinformatics
  • Pan-genome indexing
  • founder reconstruction
  • dynamic programming
  • positional Burrows-Wheeler transform
  • range minimum query


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Computational Pan-Genomics Consortium et al. Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics, page bbw089, 2016. Google Scholar
  2. Alexander Dilthey, Charles Cox, Zamin Iqbal, Matthew R Nelson, and Gil McVean. Improved genome inference in the MHC using a population reference graph. Nature Genetics, 47:682-688, 2015. Google Scholar
  3. Richard Durbin. Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics, 30(9):1266-1272, 2014. Google Scholar
  4. Exome Aggregation Consortium. Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536(7616):285-291, 2016. Google Scholar
  5. Héctor Ferrada, Travis Gagie, Tommi Hirvola, and Simon J. Puglisi. Hybrid indexes for repetitive datasets. Philosophical Transactions of the Royal Society A, 372, 2014. Google Scholar
  6. Johannes Fischer and Volker Heun. Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE. In CPM 2006, volume 4009 of LNCS, pages 36-48. Springer, 2006. URL:
  7. Travis Gagie and Simon J. Puglisi. Searching and indexing genomic databases via kernelization. Frontiers in Bioengineering and Biotechnology, 3(12), 2015. Google Scholar
  8. Erik Garrison, Jouni Sirén, Adam M Novak, Glenn Hickey, Jordan M Eizenga, Eric T Dawson, William Jones, Michael F Lin, Benedict Paten, and Richard Durbin. Sequence variation aware genome references and read mapping with the variation graph toolkit. bioRxiv, 2017. URL:
  9. Lin Huang, Victoria Popic, and Serafim Batzoglou. Short read alignment with populations of genomes. Bioinformatics, 29(13):361-370, 2013. Google Scholar
  10. Sorina Maciuca, Carlos del Ojo Elias, Gil McVean, and Zamin Iqbal. A natural encoding of genetic variation in a Burrows-Wheeler transform to enable mapping and genome inference. In Algorithms in Bioinformatics - 16th International Workshop, WABI 2016, Aarhus, Denmark, August 22-24, 2016. Proceedings, volume 9838 of Lecture Notes in Computer Science, pages 222-233. Springer, 2016. Google Scholar
  11. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281-308, 2010. Google Scholar
  12. Veli Mäkinen and Tuukka Norri. Applying the positional Burrows-Wheeler transform to all-pairs hamming distance. Submitted manuscript, 2018. Google Scholar
  13. Tom O Mokveld, Jasper Linthorst, Zaid Al-Ars, and Marcel Reinders. Chop: Haplotype-aware path indexing in population graphs. bioRxiv, 2018. URL:
  14. Gonzalo Navarro. Indexing highly repetitive collections. In Proc. 23rd International Workshop on Combinatorial Algorithms (IWOCA), LNCS 7643, pages 274-279, 2012. Google Scholar
  15. Pasi Rastas and Esko Ukkonen. Haplotype inference via hierarchical genotype parsing. In Algorithms in Bioinformatics, 7th International Workshop, WABI 2007, Philadelphia, PA, USA, September 8-9, 2007, Proceedings, pages 85-97, 2007. Google Scholar
  16. Korbinian Schneeberger, Jörg Hagmann, Stephan Ossowski, Norman Warthmann, Sandra Gesing, Oliver Kohlbacher, and Detlef Weigel. Simultaneous alignment of short reads against multiple genomes. Genome Biology, 10:R98, 2009. Google Scholar
  17. Jouni Sirén, Erik Garrison, Adam M. Novak, Benedict Paten, and Richard Durbin. Haplotype-aware graph indexes. arXiv preprint arXiv:1805.03834, 2018. Google Scholar
  18. Jouni Sirén, Niko Välimäki, and Veli Mäkinen. Indexing graphs for path queries with applications in genome research. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(2):375-388, 2014. Google Scholar
  19. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526(7571):68-74, sep 2015. Google Scholar
  20. The UK10K Consortium. The UK10K project identifies rare variants in health and disease. Nature, 526(7571):82-90, 2015. Google Scholar
  21. Esko Ukkonen. Finding founder sequences from a set of recombinants. In Algorithms in Bioinformatics, Second International Workshop, WABI 2002, Rome, Italy, September 17-21, 2002, Proceedings, pages 277-286, 2002. Google Scholar
  22. Daniel Valenzuela, Tuukka Norri, Välimäki Niko, Esa Pitkänen, and Veli Mäkinen. Towards pan-genome read alignment to improve variation calling. BMC Genomics, 19(Suppl 2):87, 2018. URL:
  23. Sebastian Wandelt, Johannes Starlinger, Marc Bux, and Ulf Leser. Rcsi: Scalable similarity search in thousand(s) of genomes. PVLDB, 6(13):1534-1545, 2013. Google Scholar