Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time

Norri, Tuukka; Cazaux, Bastien; Kosolobov, Dmitry; Mäkinen, Veli

doi:10.4230/LIPIcs.WABI.2018.15

File

Author Details

Tuukka Norri

Department of Computer Science, University of Helsinki, Helsinki, Finland

Bastien Cazaux

Department of Computer Science, University of Helsinki, Helsinki, Finland

Dmitry Kosolobov

Department of Computer Science, University of Helsinki, Helsinki, Finland

Veli Mäkinen

Department of Computer Science, University of Helsinki, Helsinki, Finland

Cite AsGet BibTex

Tuukka Norri, Bastien Cazaux, Dmitry Kosolobov, and Veli Mäkinen. Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 15:1-15:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.WABI.2018.15

Abstract

Given a threshold L and a set R = {R_1, ..., R_m} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1,n] into set P of disjoint segments such that each segment [a,b] in P has length at least L and the number d(a,b)=|{R_i[a,b] : 1 <= i <= m}| of distinct substrings at segment [a,b] is minimized over [a,b] in P. The distinct substrings in the segments represent founder blocks that can be concatenated to form max{d(a,b) : [a,b] in P} founder sequences representing the original R such that crossovers happen only at segment boundaries. We give an optimal O(mn) time algorithm to solve the problem, improving over earlier O(mn^2). This improvement enables to exploit the algorithm on a pan-genomic setting of input strings being aligned haplotype sequences of complete human chromosomes, with a goal of finding a representative set of references that can be indexed for read alignment and variant calling. We implemented the new algorithm and give some experimental evidence on the practicality of the approach on this pan-genomic setting.

Subject Classification

ACM Subject Classification

Theory of computation → Design and analysis of algorithms
Applied computing → Bioinformatics

Keywords

Pan-genome indexing
founder reconstruction
dynamic programming
positional Burrows-Wheeler transform
range minimum query

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Computational Pan-Genomics Consortium et al. Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics, page bbw089, 2016.
Alexander Dilthey, Charles Cox, Zamin Iqbal, Matthew R Nelson, and Gil McVean. Improved genome inference in the MHC using a population reference graph. Nature Genetics, 47:682-688, 2015.
Richard Durbin. Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (PBWT). Bioinformatics, 30(9):1266-1272, 2014.
Exome Aggregation Consortium. Analysis of protein-coding genetic variation in 60,706 humans. Nature, 536(7616):285-291, 2016.
Héctor Ferrada, Travis Gagie, Tommi Hirvola, and Simon J. Puglisi. Hybrid indexes for repetitive datasets. Philosophical Transactions of the Royal Society A, 372, 2014.
Johannes Fischer and Volker Heun. Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE. In CPM 2006, volume 4009 of LNCS, pages 36-48. Springer, 2006. URL: http://dx.doi.org/10.1007/11780441_5.
Travis Gagie and Simon J. Puglisi. Searching and indexing genomic databases via kernelization. Frontiers in Bioengineering and Biotechnology, 3(12), 2015.
Erik Garrison, Jouni Sirén, Adam M Novak, Glenn Hickey, Jordan M Eizenga, Eric T Dawson, William Jones, Michael F Lin, Benedict Paten, and Richard Durbin. Sequence variation aware genome references and read mapping with the variation graph toolkit. bioRxiv, 2017. URL: http://dx.doi.org/10.1101/234856.
Lin Huang, Victoria Popic, and Serafim Batzoglou. Short read alignment with populations of genomes. Bioinformatics, 29(13):361-370, 2013.
Sorina Maciuca, Carlos del Ojo Elias, Gil McVean, and Zamin Iqbal. A natural encoding of genetic variation in a Burrows-Wheeler transform to enable mapping and genome inference. In Algorithms in Bioinformatics - 16th International Workshop, WABI 2016, Aarhus, Denmark, August 22-24, 2016. Proceedings, volume 9838 of Lecture Notes in Computer Science, pages 222-233. Springer, 2016.
Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281-308, 2010.
Veli Mäkinen and Tuukka Norri. Applying the positional Burrows-Wheeler transform to all-pairs hamming distance. Submitted manuscript, 2018.
Tom O Mokveld, Jasper Linthorst, Zaid Al-Ars, and Marcel Reinders. Chop: Haplotype-aware path indexing in population graphs. bioRxiv, 2018. URL: http://dx.doi.org/10.1101/305268.
Gonzalo Navarro. Indexing highly repetitive collections. In Proc. 23rd International Workshop on Combinatorial Algorithms (IWOCA), LNCS 7643, pages 274-279, 2012.
Pasi Rastas and Esko Ukkonen. Haplotype inference via hierarchical genotype parsing. In Algorithms in Bioinformatics, 7th International Workshop, WABI 2007, Philadelphia, PA, USA, September 8-9, 2007, Proceedings, pages 85-97, 2007.
Korbinian Schneeberger, Jörg Hagmann, Stephan Ossowski, Norman Warthmann, Sandra Gesing, Oliver Kohlbacher, and Detlef Weigel. Simultaneous alignment of short reads against multiple genomes. Genome Biology, 10:R98, 2009.
Jouni Sirén, Erik Garrison, Adam M. Novak, Benedict Paten, and Richard Durbin. Haplotype-aware graph indexes. arXiv preprint arXiv:1805.03834, 2018.
Jouni Sirén, Niko Välimäki, and Veli Mäkinen. Indexing graphs for path queries with applications in genome research. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(2):375-388, 2014.
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526(7571):68-74, sep 2015.
The UK10K Consortium. The UK10K project identifies rare variants in health and disease. Nature, 526(7571):82-90, 2015.
Esko Ukkonen. Finding founder sequences from a set of recombinants. In Algorithms in Bioinformatics, Second International Workshop, WABI 2002, Rome, Italy, September 17-21, 2002, Proceedings, pages 277-286, 2002.
Daniel Valenzuela, Tuukka Norri, Välimäki Niko, Esa Pitkänen, and Veli Mäkinen. Towards pan-genome read alignment to improve variation calling. BMC Genomics, 19(Suppl 2):87, 2018. URL: http://dx.doi.org/10.1186/s12864-018-4465-8.
Sebastian Wandelt, Johannes Starlinger, Marc Bux, and Ulf Leser. Rcsi: Scalable similarity search in thousand(s) of genomes. PVLDB, 6(13):1534-1545, 2013.

Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time

Authors Tuukka Norri , Bastien Cazaux , Dmitry Kosolobov , Veli Mäkinen

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time

Authors Tuukka Norri , Bastien Cazaux , Dmitry Kosolobov , Veli Mäkinen

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message