Validating Paired-End Read Alignments in Sequence Graphs

Authors Chirag Jain, Haowen Zhang, Alexander Dilthey, Srinivas Aluru



PDF
Thumbnail PDF

File

LIPIcs.WABI.2019.17.pdf
  • Filesize: 485 kB
  • 13 pages

Document Identifiers

Author Details

Chirag Jain
  • School of Computational Science and Engineering, Georgia Institute of Technology, USA
Haowen Zhang
  • School of Computational Science and Engineering, Georgia Institute of Technology, USA
Alexander Dilthey
  • Institute of Medical Microbiology, University Hospital of Düsseldorf, Germany
Srinivas Aluru
  • School of Computational Science and Engineering, Georgia Institute of Technology, USA

Acknowledgements

The authors thank Abdurrahman Yasar, Siva Rajamanickam and Srinivas Eswar for sharing their insights on sparse matrix manipulations.

Cite AsGet BibTex

Chirag Jain, Haowen Zhang, Alexander Dilthey, and Srinivas Aluru. Validating Paired-End Read Alignments in Sequence Graphs. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 17:1-17:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.WABI.2019.17

Abstract

Graph based non-linear reference structures such as variation graphs and colored de Bruijn graphs enable incorporation of full genomic diversity within a population. However, transitioning from a simple string-based reference to graphs requires addressing many computational challenges, one of which concerns accurately mapping sequencing read sets to graphs. Paired-end Illumina sequencing is a commonly used sequencing platform in genomics, where the paired-end distance constraints allow disambiguation of repeats. Many recent works have explored provably good index-based and alignment-based strategies for mapping individual reads to graphs. However, validating distance constraints efficiently over graphs is not trivial, and existing sequence to graph mappers rely on heuristics. We introduce a mathematical formulation of the problem, and provide a new algorithm to solve it exactly. We take advantage of the high sparsity of reference graphs, and use sparse matrix-matrix multiplications (SpGEMM) to build an index which can be queried efficiently by a mapping algorithm for validating the distance constraints. Effectiveness of the algorithm is demonstrated using real reference graphs, including a human MHC variation graph, and a pan-genome de-Bruijn graph built using genomes of 20 B. anthracis strains. While the one-time indexing time can vary from a few minutes to a few hours using our algorithm, answering a million distance queries takes less than a second.

Subject Classification

ACM Subject Classification
  • Mathematics of computing → Paths and connectivity problems
  • Applied computing → Computational genomics
Keywords
  • Sequence graphs
  • read mapping
  • index
  • sparse matrix-matrix multiplication

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Stefano Beretta, Paola Bonizzoni, Luca Denti, Marco Previtali, and Raffaella Rizzi. Mapping RNA-seq data to a transcript graph via approximate pattern matching to a hypertext. In International Conference on Algorithms for Computational Biology, pages 49-61. Springer, 2017. Google Scholar
  2. Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn graphs. In International Workshop on Algorithms in Bioinformatics, pages 225-235. Springer, 2012. Google Scholar
  3. Aydın Buluç, John Gilbert, and Viral B Shah. Implementing sparse matrices for graph algorithms. In Graph Algorithms in the Language of Linear Algebra, pages 287-313. SIAM, 2011. Google Scholar
  4. Stefan Canzar and Steven L Salzberg. Short read mapping: An algorithmic tour. Proceedings of the IEEE, 105(3):436-458, 2015. Google Scholar
  5. Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Briefings in bioinformatics, 19(1):118-135, 2016. Google Scholar
  6. 1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature, 526(7571):68, 2015. Google Scholar
  7. Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2009. Google Scholar
  8. Luca Denti, Raffaella Rizzi, Stefano Beretta, Gianluca Della Vedova, Marco Previtali, and Paola Bonizzoni. ASGAL: aligning RNA-Seq data to a splicing graph to detect novel alternative splicing events. BMC bioinformatics, 19(1):444, 2018. Google Scholar
  9. Mehmet Deveci, Christian Trott, and Sivasankaran Rajamanickam. Performance-portable sparse matrix-matrix multiplication for many-core architectures. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 693-702. IEEE, 2017. Google Scholar
  10. Alexander Dilthey, Charles Cox, Zamin Iqbal, Matthew R Nelson, and Gil McVean. Improved genome inference in the MHC using a population reference graph. Nature genetics, 47(6):682, 2015. Google Scholar
  11. Alexander Dilthey, Pierre-Antoine Gourraud, Alexander J Mentzer, Nezih Cereb, Zamin Iqbal, and Gil McVean. High-accuracy HLA type inference from whole-genome sequencing data using population reference graphs. PLoS computational biology, 12(10):e1005151, 2016. Google Scholar
  12. Erik Garrison, Jouni Sirén, Adam M Novak, Glenn Hickey, Jordan M Eizenga, Eric T Dawson, William Jones, Shilpa Garg, Charles Markello, Michael F Lin, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature biotechnology, 2018. Google Scholar
  13. John R Gilbert, Cleve Moler, and Robert Schreiber. Sparse matrices in MATLAB: Design and implementation. SIAM Journal on Matrix Analysis and Applications, 13(1):333-356, 1992. Google Scholar
  14. Richard E Green, Johannes Krause, Adrian W Briggs, Tomislav Maricic, Udo Stenzel, Martin Kircher, Nick Patterson, Heng Li, Weiwei Zhai, Markus Hsi-Yang Fritz, et al. A draft sequence of the Neandertal genome. science, 328(5979):710-722, 2010. Google Scholar
  15. Fred G Gustavson. Two fast algorithms for sparse matrices: Multiplication and permuted transposition. ACM Transactions on Mathematical Software (TOMS), 4(3):250-269, 1978. Google Scholar
  16. Mahdi Heydari, Giles Miclotte, Yves Van de Peer, and Jan Fostier. BrownieAligner: accurate alignment of Illumina sequencing data to de Bruijn graphs. BMC bioinformatics, 19(1):311, 2018. Google Scholar
  17. Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek, and Gil McVean. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature genetics, 44(2):226, 2012. Google Scholar
  18. Chirag Jain, Sanchit Misra, Haowen Zhang, Alexander Dilthey, and Srinivas Aluru. Accelerating sequence alignment to graphs. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2019 (to appear). Google Scholar
  19. Chirag Jain, Haowen Zhang, Yu Gao, and Srinivas Aluru. On the Complexity of Sequence to Graph Alignment. In Research in Computational Molecular Biology, pages 85-100, Cham, 2019. Springer International Publishing. Google Scholar
  20. Vaddadi Naga Sai Kavya, Kshitij Tayal, Rajgopal Srinivasan, and Naveen Sivadasan. Sequence Alignment on Directed Graphs. Journal of Computational Biology, 26(1):53-67, 2019. Google Scholar
  21. Daehwan Kim, Joseph M Paggi, and Steven Salzberg. HISAT-genotype: Next Generation Genomic Analysis Platform on a Personal Computer. BioRxiv, page 266197, 2018. Google Scholar
  22. Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4):357, 2012. Google Scholar
  23. François Le Gall. Powers of tensors and fast matrix multiplication. In Proceedings of the 39th international symposium on symbolic and algebraic computation, pages 296-303. ACM, 2014. Google Scholar
  24. Heng Li. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint, 2013. URL: http://arxiv.org/abs/1303.3997.
  25. Antoine Limasset, Bastien Cazaux, Eric Rivals, and Pierre Peterlongo. Read mapping on de Bruijn graphs. BMC bioinformatics, 17(1):237, 2016. Google Scholar
  26. Bo Liu, Hongzhe Guo, Michael Brudno, and Yadong Wang. deBGA: read alignment with de Bruijn graph-based seed and extension. Bioinformatics, 32(21):3224-3232, 2016. Google Scholar
  27. Shoshana Marcus, Hayan Lee, and Michael C Schatz. SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics, 30(24):3476-3483, 2014. Google Scholar
  28. Tom O Mokveld, Jasper Linthorst, Zaid Al-Ars, and Marcel Reinders. CHOP: Haplotype-aware path indexing in population graphs. bioRxiv, 2018. Google Scholar
  29. Martin D Muggli, Alexander Bowe, Noelle R Noyes, Paul S Morley, Keith E Belk, Robert Raymond, Travis Gagie, Simon J Puglisi, and Christina Boucher. Succinct colored de Bruijn graphs. Bioinformatics, 33(20):3181-3187, 2017. Google Scholar
  30. Gonzalo Navarro. Improved approximate pattern matching on hypertext. Theoretical Computer Science, 237(1-2):455-463, 2000. Google Scholar
  31. Adam M Novak, Glenn Hickey, Erik Garrison, Sean Blum, Abram Connelly, Alexander Dilthey, Jordan Eizenga, MA Saleh Elmohamed, Sally Guthrie, André Kahles, et al. Genome graphs. bioRxiv, page 101378, 2017. Google Scholar
  32. Esko Nuutila. Efficient transitive closure computation in large digraphs. Finnish Academy of Technology, 1998. Google Scholar
  33. Matti Nykänen and Esko Ukkonen. The exact path length problem. Journal of Algorithms, 42(1):41-53, 2002. Google Scholar
  34. Benedict Paten, Adam M Novak, Jordan M Eizenga, and Erik Garrison. Genome graphs and the evolution of genome inference. Genome research, 27(5):665-676, 2017. Google Scholar
  35. Goran Rakocevic, Vladimir Semenyuk, Wan-Ping Lee, James Spencer, John Browning, Ivan J Johnson, Vladan Arsenijevic, Jelena Nadj, Kaushik Ghose, Maria C Suciu, et al. Fast and accurate genomic analyses using genome graphs. Technical report, Nature Publishing Group, 2019. Google Scholar
  36. Mikko Rautiainen and Tobias Marschall. Aligning sequences to general graphs in O(V + mE) time. bioRxiv, 2017. URL: https://www.biorxiv.org/content/early/2017/11/08/216127.
  37. Mikko Rautiainen, Veli Mäkinen, and Tobias Marschall. Bit-parallel sequence-to-graph alignment. Bioinformatics, March 2019. URL: https://doi.org/10.1093/bioinformatics/btz162.
  38. David Reich, Michael A Nalls, WH Linda Kao, Ermeg L Akylbekova, Arti Tandon, Nick Patterson, James Mullikin, Wen-Chi Hsueh, Ching-Yu Cheng, Josef Coresh, et al. Reduced neutrophil count in people of African descent is due to a regulatory variant in the Duffy antigen receptor for chemokines gene. PLoS genetics, 5(1):e1000360, 2009. Google Scholar
  39. Leena Salmela, Kristoffer Sahlin, Veli Mäkinen, and Alexandru I Tomescu. Gap filling as exact path length problem. Journal of Computational Biology, 23(5):347-361, 2016. Google Scholar
  40. Jouni Sirén. Indexing variation graphs. In 2017 Proceedings of the ninteenth workshop on algorithm engineering and experiments (ALENEX), pages 13-27. SIAM, 2017. Google Scholar
  41. Jouni Sirén, Niko Välimäki, and Veli Mäkinen. Indexing graphs for path queries with applications in genome research. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 11(2):375-388, 2014. Google Scholar