Read Mapping on Genome Variation Graphs

Authors Kavya Vaddadi, Rajgopal Srinivasan, Naveen Sivadasan



PDF
Thumbnail PDF

File

LIPIcs.WABI.2019.7.pdf
  • Filesize: 0.55 MB
  • 17 pages

Document Identifiers

Author Details

Kavya Vaddadi
  • TCS Research, Hyderabad, India
Rajgopal Srinivasan
  • TCS Research, Hyderabad, India
Naveen Sivadasan
  • TCS Research, Hyderabad, India

Acknowledgements

Authors would like to thank the anonymous reviewers for their valuable comments. Authors would also like to acknowledge Kshitij Tayal for the initial implementation of the algorithm.

Cite AsGet BibTex

Kavya Vaddadi, Rajgopal Srinivasan, and Naveen Sivadasan. Read Mapping on Genome Variation Graphs. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 7:1-7:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.WABI.2019.7

Abstract

Genome variation graphs are natural candidates to represent a pangenome collection. In such graphs, common subsequences are encoded as vertices and the genomic variations are captured by introducing additional labeled vertices and directed edges. Unlike a linear reference, a reference graph allows a rich representation of the genomic diversities and avoids reference bias. We address the fundamental problem of mapping reads to genome variation graphs. We give a novel mapping algorithm V-MAP for efficient identification of small subgraph of the genome graph for optimal gapped alignment of the read. V-MAP creates space efficient index using locality sensitive minimizer signatures computed using a novel graph winnowing and graph embedding onto metric space for fast and accurate mapping. Experiments involving graph constructed from the 1000 Genomes data and using both real and simulated reads show that V-MAP is fast, memory efficient and can map short reads, as well as PacBio/Nanopore long reads with high accuracy. V-MAP performance was significantly better than the state-of-the-art, especially for long reads.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Combinatorial algorithms
  • Applied computing → Computational genomics
Keywords
  • read mapping
  • pangenome
  • genome variation graphs
  • locality sensitive hashing

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. 1000Genome. 1000 Genome VCF. ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502, 2013. [Online; accessed 15-April-2017].
  2. Stephen F Altschul, Thomas L Madden, Alejandro A Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25(17):3389-3402, 1997. Google Scholar
  3. Michael Burrows and David Wheeler. A Block-Sorting Lossless Data Compression Algorithm. In DIGITAL SRC RESEARCH REPORT. Citeseer, 1994. Google Scholar
  4. Stefan Canzar and Steven L Salzberg. Short read mapping: an algorithmic tour. Proceedings of the IEEE, 105(3):436-458, 2017. Google Scholar
  5. Mark J Chaisson and Glenn Tesler. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC bioinformatics, 13(1):238, 2012. Google Scholar
  6. Moses S Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380-388. ACM, 2002. Google Scholar
  7. Agnieszka Danek, Sebastian Deorowicz, and Szymon Grabowski. Indexes of large genome collections on a PC. PloS one, 9(10):e109384, 2014. Google Scholar
  8. Arthur L Delcher, Adam Phillippy, Jane Carlton, and Steven L Salzberg. Fast algorithms for large-scale genome alignment and comparison. Nucleic acids research, 30(11):2478-2483, 2002. Google Scholar
  9. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 390-398. IEEE, 2000. Google Scholar
  10. Erik Garrison, Jouni Sirén, Adam M Novak, Glenn Hickey, Jordan M Eizenga, Eric T Dawson, William Jones, Shilpa Garg, Charles Markello, Michael F Lin, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology, 2018. Google Scholar
  11. Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378-407, 2005. Google Scholar
  12. Lin Huang, Victoria Popic, and Serafim Batzoglou. Short read alignment with populations of genomes. Bioinformatics, 29(13):i361-i370, 2013. Google Scholar
  13. Weichun Huang, Leping Li, Jason R Myers, and Gabor T Marth. ART: a next-generation sequencing read simulator. Bioinformatics, 28(4):593-594, 2011. Google Scholar
  14. Chirag Jain, Alexander Dilthey, Sergey Koren, Srinivas Aluru, and Adam M Phillippy. A fast approximate algorithm for mapping long reads to large reference databases. In International Conference on Research in Computational Molecular Biology, pages 66-81. Springer, 2017. Google Scholar
  15. Chirag Jain, Haowen Zhang, Yu Gao, and Srinivas Aluru. On the complexity of sequence to graph alignment. In International Conference on Research in Computational Molecular Biology, pages 85-100. Springer, 2019. Google Scholar
  16. Vaddadi Naga Sai Kavya, Kshitij Tayal, Rajgopal Srinivasan, and Naveen Sivadasan. Sequence Alignment on Directed Graphs. Journal of Computational Biology, 2018. Google Scholar
  17. Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4):357, 2012. Google Scholar
  18. Christopher Lee, Catherine Grasso, and Mark F Sharlow. Multiple sequence alignment using partial order graphs. Bioinformatics, 18(3):452-464, 2002. Google Scholar
  19. Hayan Lee, James Gurtowski, Shinjae Yoo, Shoshana Marcus, W Richard McCombie, and Michael Schatz. Error correction and assembly complexity of single molecule sequencing reads. BioRxiv, page 006395, 2014. Google Scholar
  20. Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint, 2013. URL: http://arxiv.org/abs/1303.3997.
  21. Heng Li. Minimap and Miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14):2103-2110, 2016. Google Scholar
  22. Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 1:7, 2018. Google Scholar
  23. Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. bioinformatics, 25(14):1754-1760, 2009. Google Scholar
  24. Heng Li and Nils Homer. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in bioinformatics, 11(5):473-483, 2010. Google Scholar
  25. Antoine Limasset, Bastien Cazaux, Eric Rivals, and Pierre Peterlongo. Read mapping on de Bruijn graphs. BMC bioinformatics, 17(1):237, 2016. Google Scholar
  26. Tobias Marschall, et al. Computational pan-genomics: status, promises and challenges. Briefings in bioinformatics, 19(1):118-135, 2016. Google Scholar
  27. Subhas C Nandy and Bhargab B Bhattacharya. A unified algorithm for finding maximum and minimum object enclosing rectangles and cuboids. Computers & Mathematics with Applications, 29(8):45-61, 1995. Google Scholar
  28. Benedict Paten, Adam M Novak, Jordan M Eizenga, and Erik Garrison. Genome graphs and the evolution of genome inference. Genome research, 27(5):665-676, 2017. Google Scholar
  29. Mikko Rautiainen and Tobias Marschall. Aligning sequences to general graphs in O (V+ mE) time. bioRxiv, page 216127, 2017. Google Scholar
  30. Saul Schleimer, Daniel S Wilkerson, and Alex Aiken. Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 76-85. ACM, 2003. Google Scholar
  31. Korbinian Schneeberger, Jörg Hagmann, Stephan Ossowski, Norman Warthmann, Sandra Gesing, Oliver Kohlbacher, and Detlef Weigel. Simultaneous alignment of short reads against multiple genomes. Genome biology, 10(9):R98, 2009. Google Scholar
  32. Jouni Sirén. Indexing variation graphs. In 2017 Proceedings of the ninteenth workshop on algorithm engineering and experiments (ALENEX), pages 13-27. SIAM, 2017. Google Scholar
  33. Jouni Sirén, Niko Välimäki, and Veli Mäkinen. Indexing graphs for path queries with applications in genome research. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 11(2):375-388, 2014. Google Scholar
  34. Ivan Sović, Mile Šikić, Andreas Wilm, Shannon Nicole Fenlon, Swaine Chen, and Niranjan Nagarajan. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nature communications, 7:11307, 2016. Google Scholar
  35. Matthew Stephens and Peter Donnelly. Inference in molecular population genetics. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(4):605-635, 2000. Google Scholar
  36. Daniel Valenzuela and Veli Mäkinen. CHIC: a short read aligner for pan-genomic references. bioRxiv, page 178129, 2017. Google Scholar
  37. Sebastian Wandelt, Johannes Starlinger, Marc Bux, and Ulf Leser. RCSI: Scalable similarity search in thousand (s) of genomes. Proceedings of the VLDB Endowment, 6(13):1534-1545, 2013. Google Scholar
  38. Mengyao Zhao, Wan-Ping Lee, Erik P Garrison, and Gabor T Marth. SSW library: an SIMD Smith-Waterman C/C++ library for use in genomic applications. PloS one, 8(12), 2013. Google Scholar
  39. Justin M Zook, David Catoe, Jennifer McDaniel, Lindsay Vang, Noah Spies, Arend Sidow, Ziming Weng, Yuling Liu, Christopher E Mason, Noah Alexander, et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Scientific data, 3:160025, 2016. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail