eng
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Leibniz International Proceedings in Informatics
1868-8969
2019-09-03
17:1
17:13
10.4230/LIPIcs.WABI.2019.17
article
Validating Paired-End Read Alignments in Sequence Graphs
Jain, Chirag
1
Zhang, Haowen
1
Dilthey, Alexander
2
Aluru, Srinivas
1
School of Computational Science and Engineering, Georgia Institute of Technology, USA
Institute of Medical Microbiology, University Hospital of Düsseldorf, Germany
Graph based non-linear reference structures such as variation graphs and colored de Bruijn graphs enable incorporation of full genomic diversity within a population. However, transitioning from a simple string-based reference to graphs requires addressing many computational challenges, one of which concerns accurately mapping sequencing read sets to graphs. Paired-end Illumina sequencing is a commonly used sequencing platform in genomics, where the paired-end distance constraints allow disambiguation of repeats. Many recent works have explored provably good index-based and alignment-based strategies for mapping individual reads to graphs. However, validating distance constraints efficiently over graphs is not trivial, and existing sequence to graph mappers rely on heuristics. We introduce a mathematical formulation of the problem, and provide a new algorithm to solve it exactly. We take advantage of the high sparsity of reference graphs, and use sparse matrix-matrix multiplications (SpGEMM) to build an index which can be queried efficiently by a mapping algorithm for validating the distance constraints. Effectiveness of the algorithm is demonstrated using real reference graphs, including a human MHC variation graph, and a pan-genome de-Bruijn graph built using genomes of 20 B. anthracis strains. While the one-time indexing time can vary from a few minutes to a few hours using our algorithm, answering a million distance queries takes less than a second.
https://drops.dagstuhl.de/storage/00lipics/lipics-vol143-wabi2019/LIPIcs.WABI.2019.17/LIPIcs.WABI.2019.17.pdf
Sequence graphs
read mapping
index
sparse matrix-matrix multiplication