Haplotype-aware graph indexes

Authors Jouni Sirén , Erik Garrison , Adam M. Novak , Benedict J. Paten , Richard Durbin



PDF
Thumbnail PDF

File

LIPIcs.WABI.2018.4.pdf
  • Filesize: 0.63 MB
  • 13 pages

Document Identifiers

Author Details

Jouni Sirén
  • University of California, Santa Cruz, USA , Wellcome Sanger Institute, Hinxton, UK
Erik Garrison
  • Wellcome Sanger Institute, Hinxton, UK
Adam M. Novak
  • University of California, Santa Cruz, USA
Benedict J. Paten
  • University of California, Santa Cruz, USA
Richard Durbin
  • Department of Genetics, University of Cambridge, UK , Wellcome Sanger Institute, Hinxton, UK

Cite AsGet BibTex

Jouni Sirén, Erik Garrison, Adam M. Novak, Benedict J. Paten, and Richard Durbin. Haplotype-aware graph indexes. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 4:1-4:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.WABI.2018.4

Abstract

The variation graph toolkit (VG) represents genetic variation as a graph. Each path in the graph is a potential haplotype, though most paths are unlikely recombinations of true haplotypes. We augment the VG model with haplotype information to identify which paths are more likely to be correct. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by indexing the 1000 Genomes Project haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pattern matching
  • Theory of computation → Data compression
  • Applied computing → Computational genomics
Keywords
  • FM-indexes
  • variation graphs
  • haplotypes

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Markus J. Bauer, Anthony J. Cox, and Giovanna Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theoretical Computer Science, 483:134-148, 2013. URL: http://dx.doi.org/10.1016/j.tcs.2012.02.002.
  2. Michael Burrows and David J. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994. URL: http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.html.
  3. Ho-Leung Chan et al. Compressed indexes for dynamic text collections. ACM Transactions on Algorithms, 3(2):21, 2007. URL: http://dx.doi.org/10.1145/1240233.1240244.
  4. Richard Durbin. Efficient haplotype matching and storage using the Positional Burrows-Wheeler transform (PBWT). Bioinformatics, 30(9):1266-1272, 2014. URL: http://dx.doi.org/10.1093/bioinformatics/btu014.
  5. Hannes P. Eggertsson et al. Graphtyper enables population-scale genotyping using pangenome graphs. Nature Genetics, 49:1654-1660, 2017. URL: http://dx.doi.org/10.1038/ng.3964.
  6. Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552-581, 2005. URL: http://dx.doi.org/10.1145/1082036.1082039.
  7. Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for BWT-based data structures. Theoretical Computer Science, 698:67-78, 2017. URL: http://dx.doi.org/10.1016/j.tcs.2017.06.016.
  8. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in BWT-runs bounded space. In Proc. ALENEX 2018, pages 1459-1477. SIAM, 2018. URL: http://dx.doi.org/10.1137/1.9781611975031.96.
  9. Erik Garrison et al. Sequence variation aware genome references and read mapping with the variation graph toolkit. bioRxiv, 2017. URL: http://dx.doi.org/10.1101/234856.
  10. Simon Gog et al. From theory to practice: Plug and play with succinct data structures. In Proc. SEA 2014, volume 8504 of LNCS, pages 326-337. Springer, 2014. URL: http://dx.doi.org/10.1007/978-3-319-07959-2_28.
  11. Wing-Kai Hon et al. A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica, 48(1):23-36, 2007. URL: http://dx.doi.org/10.1007/s00453-006-1228-8.
  12. Lin Huang, Victoria Popic, and Serafim Batzoglou. Short read alignment with populations of genomes. Bioinformatics, 29(13):i361-i370, 2013. URL: http://dx.doi.org/10.1093/bioinformatics/btt215.
  13. Songbo Huang et al. Indexing similar DNA sequences. In Proc. AAIM 2010, volume 6124 of LNCS, pages 180-190. Springer, 2010. URL: http://dx.doi.org/10.1007/978-3-642-14355-7_19.
  14. Heng Li. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics, 28(14):1838-1844, 2012. URL: http://dx.doi.org/10.1093/bioinformatics/bts280.
  15. Heng Li. Fast construction of FM-index for long sequence reads. Bioinformatics, 30(22):3274-3275, 2014. URL: http://dx.doi.org/10.1093/bioinformatics/btu541.
  16. Sorina Maciuca et al. A natural encoding of genetic variation in a Burrows-Wheeler transform to enable mapping and genome inference. In Proc. WABI 2016, volume 9838 of LNCS, pages 222-233. Springer, 2016. URL: http://dx.doi.org/10.1007/978-3-319-43681-4_18.
  17. Tom O. Mokveld et al. CHOP: Haplotype-aware path indexing in population graphs. bioRxiv, 2018. URL: http://dx.doi.org/10.1101/305268.
  18. Veli Mäkinen et al. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281-308, 2010. URL: http://dx.doi.org/10.1089/cmb.2009.0169.
  19. Joong Chae Na et al. FM-index of alignment: A compressed index for similar strings. Theoretical Computer Science, 638:159-170, 2016. URL: http://dx.doi.org/10.1016/j.tcs.2015.08.008.
  20. Joong Chae Na et al. FM-index of alignment with gaps. Theoretical Computer Science, 710(148-157), 2018. URL: http://dx.doi.org/10.1016/j.tcs.2017.02.020.
  21. Adam Novak, Erik Garrison, and Benedict Paten. A graph extension of the positional Burrows–Wheeler transform and its applications. Algorithms for Molecular Biology, 12:18, 2017. URL: http://dx.doi.org/10.1186/s13015-017-0109-9.
  22. Daisuke Okanohara and Kunihiko Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. ALENEX 2007, pages 60-70. SIAM, 2007. URL: http://dx.doi.org/10.1137/1.9781611972870.6.
  23. Goran Rakocevic et al. Fast and accurate genomic analyses using genome graphs. bioRxiv, 2017. URL: http://dx.doi.org/10.1101/194530.
  24. Korbinian Schneeberger et al. Simultaneous alignment of short reads against multiple genomes. Genome Biology, 10(9):R98, 2009. URL: http://dx.doi.org/10.1186/gb-2009-10-9-r98.
  25. Jouni Sirén. Indexing variation graphs. In Proc. ALENEX 2017, pages 13-27. SIAM, 2017. URL: http://dx.doi.org/10.1137/1.9781611974768.2.
  26. Jouni Sirén, Niko Välimäki, and Veli Mäkinen. Indexing graphs for path queries with applications in genome research. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(2):375-388, 2014. URL: http://dx.doi.org/10.1109/TCBB.2013.2297101.
  27. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68-64, 2015. URL: http://dx.doi.org/10.1038/nature15393.
  28. The Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics, 19(1):118-135, 2018. URL: http://dx.doi.org/10.1093/bib/bbw089.