Validating Paired-End Read Alignments in Sequence Graphs

Jain, Chirag; Zhang, Haowen; Dilthey, Alexander; Aluru, Srinivas

doi:10.4230/LIPIcs.WABI.2019.17

Abstract

Graph based non-linear reference structures such as variation graphs and colored de Bruijn graphs enable incorporation of full genomic diversity within a population. However, transitioning from a simple string-based reference to graphs requires addressing many computational challenges, one of which concerns accurately mapping sequencing read sets to graphs. Paired-end Illumina sequencing is a commonly used sequencing platform in genomics, where the paired-end distance constraints allow disambiguation of repeats. Many recent works have explored provably good index-based and alignment-based strategies for mapping individual reads to graphs. However, validating distance constraints efficiently over graphs is not trivial, and existing sequence to graph mappers rely on heuristics. We introduce a mathematical formulation of the problem, and provide a new algorithm to solve it exactly. We take advantage of the high sparsity of reference graphs, and use sparse matrix-matrix multiplications (SpGEMM) to build an index which can be queried efficiently by a mapping algorithm for validating the distance constraints. Effectiveness of the algorithm is demonstrated using real reference graphs, including a human MHC variation graph, and a pan-genome de-Bruijn graph built using genomes of 20 B. anthracis strains. While the one-time indexing time can vary from a few minutes to a few hours using our algorithm, answering a million distance queries takes less than a second.

Stefano Beretta, Paola Bonizzoni, Luca Denti, Marco Previtali, and Raffaella Rizzi. Mapping RNA-seq data to a transcript graph via approximate pattern matching to a hypertext. In International Conference on Algorithms for Computational Biology, pages 49-61. Springer, 2017.
Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn graphs. In International Workshop on Algorithms in Bioinformatics, pages 225-235. Springer, 2012.
Aydın Buluç, John Gilbert, and Viral B Shah. Implementing sparse matrices for graph algorithms. In Graph Algorithms in the Language of Linear Algebra, pages 287-313. SIAM, 2011.
Stefan Canzar and Steven L Salzberg. Short read mapping: An algorithmic tour. Proceedings of the IEEE, 105(3):436-458, 2015.
Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Briefings in bioinformatics, 19(1):118-135, 2016.
1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature, 526(7571):68, 2015.
Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2009.
Luca Denti, Raffaella Rizzi, Stefano Beretta, Gianluca Della Vedova, Marco Previtali, and Paola Bonizzoni. ASGAL: aligning RNA-Seq data to a splicing graph to detect novel alternative splicing events. BMC bioinformatics, 19(1):444, 2018.
Mehmet Deveci, Christian Trott, and Sivasankaran Rajamanickam. Performance-portable sparse matrix-matrix multiplication for many-core architectures. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 693-702. IEEE, 2017.
Alexander Dilthey, Charles Cox, Zamin Iqbal, Matthew R Nelson, and Gil McVean. Improved genome inference in the MHC using a population reference graph. Nature genetics, 47(6):682, 2015.
Alexander Dilthey, Pierre-Antoine Gourraud, Alexander J Mentzer, Nezih Cereb, Zamin Iqbal, and Gil McVean. High-accuracy HLA type inference from whole-genome sequencing data using population reference graphs. PLoS computational biology, 12(10):e1005151, 2016.
Erik Garrison, Jouni Sirén, Adam M Novak, Glenn Hickey, Jordan M Eizenga, Eric T Dawson, William Jones, Shilpa Garg, Charles Markello, Michael F Lin, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology, 2018.
John R Gilbert, Cleve Moler, and Robert Schreiber. Sparse matrices in MATLAB: Design and implementation. SIAM Journal on Matrix Analysis and Applications, 13(1):333-356, 1992.
Richard E Green, Johannes Krause, Adrian W Briggs, Tomislav Maricic, Udo Stenzel, Martin Kircher, Nick Patterson, Heng Li, Weiwei Zhai, Markus Hsi-Yang Fritz, et al. A draft sequence of the Neandertal genome. science, 328(5979):710-722, 2010.
Fred G Gustavson. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Transactions on Mathematical Software (TOMS), 4(3):250-269, 1978.
Mahdi Heydari, Giles Miclotte, Yves Van de Peer, and Jan Fostier. BrownieAligner: accurate alignment of Illumina sequencing data to de Bruijn graphs. BMC bioinformatics, 19(1):311, 2018.
Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek, and Gil McVean. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature genetics, 44(2):226, 2012.
Chirag Jain, Sanchit Misra, Haowen Zhang, Alexander Dilthey, and Srinivas Aluru. Accelerating sequence alignment to graphs. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2019 (to appear).
Chirag Jain, Haowen Zhang, Yu Gao, and Srinivas Aluru. On the Complexity of Sequence to Graph Alignment. In Research in Computational Molecular Biology, pages 85-100, Cham, 2019. Springer International Publishing.
Vaddadi Naga Sai Kavya, Kshitij Tayal, Rajgopal Srinivasan, and Naveen Sivadasan. Sequence Alignment on Directed Graphs. Journal of Computational Biology, 26(1):53-67, 2019.
Daehwan Kim, Joseph M Paggi, and Steven Salzberg. HISAT-genotype: Next Generation Genomic Analysis Platform on a Personal Computer. BioRxiv, page 266197, 2018.
Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4):357, 2012.
François Le Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the 39th international symposium on symbolic and algebraic computation, pages 296-303. ACM, 2014.
Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint, 2013. URL: http://arxiv.org/abs/1303.3997.
Antoine Limasset, Bastien Cazaux, Eric Rivals, and Pierre Peterlongo. Read mapping on de Bruijn graphs. BMC bioinformatics, 17(1):237, 2016.
Bo Liu, Hongzhe Guo, Michael Brudno, and Yadong Wang. deBGA: read alignment with de Bruijn graph-based seed and extension. Bioinformatics, 32(21):3224-3232, 2016.
Shoshana Marcus, Hayan Lee, and Michael C Schatz. SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics, 30(24):3476-3483, 2014.
Tom O Mokveld, Jasper Linthorst, Zaid Al-Ars, and Marcel Reinders. CHOP: Haplotype-aware path indexing in population graphs. bioRxiv, 2018.
Martin D Muggli, Alexander Bowe, Noelle R Noyes, Paul S Morley, Keith E Belk, Robert Raymond, Travis Gagie, Simon J Puglisi, and Christina Boucher. Succinct colored de Bruijn graphs. Bioinformatics, 33(20):3181-3187, 2017.
Gonzalo Navarro. Improved approximate pattern matching on hypertext. Theoretical Computer Science, 237(1-2):455-463, 2000.
Adam M Novak, Glenn Hickey, Erik Garrison, Sean Blum, Abram Connelly, Alexander Dilthey, Jordan Eizenga, MA Saleh Elmohamed, Sally Guthrie, André Kahles, et al. Genome graphs. bioRxiv, page 101378, 2017.
Esko Nuutila. Efficient transitive closure computation in large digraphs. Finnish Academy of Technology, 1998.
Matti Nykänen and Esko Ukkonen. The exact path length problem. Journal of Algorithms, 42(1):41-53, 2002.
Benedict Paten, Adam M Novak, Jordan M Eizenga, and Erik Garrison. Genome graphs and the evolution of genome inference. Genome research, 27(5):665-676, 2017.
Goran Rakocevic, Vladimir Semenyuk, Wan-Ping Lee, James Spencer, John Browning, Ivan J Johnson, Vladan Arsenijevic, Jelena Nadj, Kaushik Ghose, Maria C Suciu, et al. Fast and accurate genomic analyses using genome graphs. Technical report, Nature Publishing Group, 2019.
Mikko Rautiainen and Tobias Marschall. Aligning sequences to general graphs in O(V + mE) time. bioRxiv, 2017. URL: https://www.biorxiv.org/content/early/2017/11/08/216127.
Mikko Rautiainen, Veli Mäkinen, and Tobias Marschall. Bit-parallel sequence-to-graph alignment. Bioinformatics, March 2019. URL: https://doi.org/10.1093/bioinformatics/btz162.
David Reich, Michael A Nalls, WH Linda Kao, Ermeg L Akylbekova, Arti Tandon, Nick Patterson, James Mullikin, Wen-Chi Hsueh, Ching-Yu Cheng, Josef Coresh, et al. Reduced neutrophil count in people of African descent is due to a regulatory variant in the Duffy antigen receptor for chemokines gene. PLoS genetics, 5(1):e1000360, 2009.
Leena Salmela, Kristoffer Sahlin, Veli Mäkinen, and Alexandru I Tomescu. Gap filling as exact path length problem. Journal of Computational Biology, 23(5):347-361, 2016.
Jouni Sirén. Indexing variation graphs. In 2017 Proceedings of the ninteenth workshop on algorithm engineering and experiments (ALENEX), pages 13-27. SIAM, 2017.
Jouni Sirén, Niko Välimäki, and Veli Mäkinen. Indexing graphs for path queries with applications in genome research. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 11(2):375-388, 2014.

Validating Paired-End Read Alignments in Sequence Graphs

Authors Chirag Jain, Haowen Zhang, Alexander Dilthey, Srinivas Aluru

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

Validating Paired-End Read Alignments in Sequence Graphs

Authors Chirag Jain, Haowen Zhang, Alexander Dilthey, Srinivas Aluru

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Acknowledgements

References

Thanks for your feedback!

Could not send message