Read Mapping on Genome Variation Graphs

Vaddadi, Kavya; Srinivasan, Rajgopal; Sivadasan, Naveen

doi:10.4230/LIPIcs.WABI.2019.7

Abstract

Genome variation graphs are natural candidates to represent a pangenome collection. In such graphs, common subsequences are encoded as vertices and the genomic variations are captured by introducing additional labeled vertices and directed edges. Unlike a linear reference, a reference graph allows a rich representation of the genomic diversities and avoids reference bias. We address the fundamental problem of mapping reads to genome variation graphs. We give a novel mapping algorithm V-MAP for efficient identification of small subgraph of the genome graph for optimal gapped alignment of the read. V-MAP creates space efficient index using locality sensitive minimizer signatures computed using a novel graph winnowing and graph embedding onto metric space for fast and accurate mapping. Experiments involving graph constructed from the 1000 Genomes data and using both real and simulated reads show that V-MAP is fast, memory efficient and can map short reads, as well as PacBio/Nanopore long reads with high accuracy. V-MAP performance was significantly better than the state-of-the-art, especially for long reads.

1000Genome. 1000 Genome VCF. ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502, 2013. [Online; accessed 15-April-2017].
Stephen F Altschul, Thomas L Madden, Alejandro A Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25(17):3389-3402, 1997.
Michael Burrows and David Wheeler. A Block-Sorting Lossless Data Compression Algorithm. In DIGITAL SRC RESEARCH REPORT. Citeseer, 1994.
Stefan Canzar and Steven L Salzberg. Short read mapping: an algorithmic tour. Proceedings of the IEEE, 105(3):436-458, 2017.
Mark J Chaisson and Glenn Tesler. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC bioinformatics, 13(1):238, 2012.
Moses S Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380-388. ACM, 2002.
Agnieszka Danek, Sebastian Deorowicz, and Szymon Grabowski. Indexes of large genome collections on a PC. PloS one, 9(10):e109384, 2014.
Arthur L Delcher, Adam Phillippy, Jane Carlton, and Steven L Salzberg. Fast algorithms for large-scale genome alignment and comparison. Nucleic acids research, 30(11):2478-2483, 2002.
Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on, pages 390-398. IEEE, 2000.
Erik Garrison, Jouni Sirén, Adam M Novak, Glenn Hickey, Jordan M Eizenga, Eric T Dawson, William Jones, Shilpa Garg, Charles Markello, Michael F Lin, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology, 2018.
Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378-407, 2005.
Lin Huang, Victoria Popic, and Serafim Batzoglou. Short read alignment with populations of genomes. Bioinformatics, 29(13):i361-i370, 2013.
Weichun Huang, Leping Li, Jason R Myers, and Gabor T Marth. ART: a next-generation sequencing read simulator. Bioinformatics, 28(4):593-594, 2011.
Chirag Jain, Alexander Dilthey, Sergey Koren, Srinivas Aluru, and Adam M Phillippy. A fast approximate algorithm for mapping long reads to large reference databases. In International Conference on Research in Computational Molecular Biology, pages 66-81. Springer, 2017.
Chirag Jain, Haowen Zhang, Yu Gao, and Srinivas Aluru. On the complexity of sequence to graph alignment. In International Conference on Research in Computational Molecular Biology, pages 85-100. Springer, 2019.
Vaddadi Naga Sai Kavya, Kshitij Tayal, Rajgopal Srinivasan, and Naveen Sivadasan. Sequence Alignment on Directed Graphs. Journal of Computational Biology, 2018.
Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4):357, 2012.
Christopher Lee, Catherine Grasso, and Mark F Sharlow. Multiple sequence alignment using partial order graphs. Bioinformatics, 18(3):452-464, 2002.
Hayan Lee, James Gurtowski, Shinjae Yoo, Shoshana Marcus, W Richard McCombie, and Michael Schatz. Error correction and assembly complexity of single molecule sequencing reads. BioRxiv, page 006395, 2014.
Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint, 2013. URL: http://arxiv.org/abs/1303.3997.
Heng Li. Minimap and Miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32(14):2103-2110, 2016.
Heng Li. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 1:7, 2018.
Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows-Wheeler transform. bioinformatics, 25(14):1754-1760, 2009.
Heng Li and Nils Homer. A survey of sequence alignment algorithms for next-generation sequencing. Briefings in bioinformatics, 11(5):473-483, 2010.
Antoine Limasset, Bastien Cazaux, Eric Rivals, and Pierre Peterlongo. Read mapping on de Bruijn graphs. BMC bioinformatics, 17(1):237, 2016.
Tobias Marschall, et al. Computational pan-genomics: status, promises and challenges. Briefings in bioinformatics, 19(1):118-135, 2016.
Subhas C Nandy and Bhargab B Bhattacharya. A unified algorithm for finding maximum and minimum object enclosing rectangles and cuboids. Computers & Mathematics with Applications, 29(8):45-61, 1995.
Benedict Paten, Adam M Novak, Jordan M Eizenga, and Erik Garrison. Genome graphs and the evolution of genome inference. Genome research, 27(5):665-676, 2017.
Mikko Rautiainen and Tobias Marschall. Aligning sequences to general graphs in O (V+ mE) time. bioRxiv, page 216127, 2017.
Saul Schleimer, Daniel S Wilkerson, and Alex Aiken. Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 76-85. ACM, 2003.
Korbinian Schneeberger, Jörg Hagmann, Stephan Ossowski, Norman Warthmann, Sandra Gesing, Oliver Kohlbacher, and Detlef Weigel. Simultaneous alignment of short reads against multiple genomes. Genome biology, 10(9):R98, 2009.
Jouni Sirén. Indexing variation graphs. In 2017 Proceedings of the ninteenth workshop on algorithm engineering and experiments (ALENEX), pages 13-27. SIAM, 2017.
Jouni Sirén, Niko Välimäki, and Veli Mäkinen. Indexing graphs for path queries with applications in genome research. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 11(2):375-388, 2014.
Ivan Sović, Mile Šikić, Andreas Wilm, Shannon Nicole Fenlon, Swaine Chen, and Niranjan Nagarajan. Fast and sensitive mapping of nanopore sequencing reads with GraphMap. Nature communications, 7:11307, 2016.
Matthew Stephens and Peter Donnelly. Inference in molecular population genetics. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(4):605-635, 2000.
Daniel Valenzuela and Veli Mäkinen. CHIC: a short read aligner for pan-genomic references. bioRxiv, page 178129, 2017.
Sebastian Wandelt, Johannes Starlinger, Marc Bux, and Ulf Leser. RCSI: Scalable similarity search in thousand (s) of genomes. Proceedings of the VLDB Endowment, 6(13):1534-1545, 2013.
Mengyao Zhao, Wan-Ping Lee, Erik P Garrison, and Gabor T Marth. SSW library: an SIMD Smith-Waterman C/C++ library for use in genomic applications. PloS one, 8(12), 2013.
Justin M Zook, David Catoe, Jennifer McDaniel, Lindsay Vang, Noah Spies, Arend Sidow, Ziming Weng, Yuling Liu, Christopher E Mason, Noah Alexander, et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Scientific data, 3:160025, 2016.

Read Mapping on Genome Variation Graphs

Authors Kavya Vaddadi, Rajgopal Srinivasan, Naveen Sivadasan

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References