Abstract
Graph based nonlinear reference structures such as variation graphs and colored de Bruijn graphs enable incorporation of full genomic diversity within a population. However, transitioning from a simple stringbased reference to graphs requires addressing many computational challenges, one of which concerns accurately mapping sequencing read sets to graphs. Pairedend Illumina sequencing is a commonly used sequencing platform in genomics, where the pairedend distance constraints allow disambiguation of repeats. Many recent works have explored provably good indexbased and alignmentbased strategies for mapping individual reads to graphs. However, validating distance constraints efficiently over graphs is not trivial, and existing sequence to graph mappers rely on heuristics. We introduce a mathematical formulation of the problem, and provide a new algorithm to solve it exactly. We take advantage of the high sparsity of reference graphs, and use sparse matrixmatrix multiplications (SpGEMM) to build an index which can be queried efficiently by a mapping algorithm for validating the distance constraints. Effectiveness of the algorithm is demonstrated using real reference graphs, including a human MHC variation graph, and a pangenome deBruijn graph built using genomes of 20 B. anthracis strains. While the onetime indexing time can vary from a few minutes to a few hours using our algorithm, answering a million distance queries takes less than a second.
BibTeX  Entry
@InProceedings{jain_et_al:LIPIcs:2019:11047,
author = {Chirag Jain and Haowen Zhang and Alexander Dilthey and Srinivas Aluru},
title = {{Validating PairedEnd Read Alignments in Sequence Graphs}},
booktitle = {19th International Workshop on Algorithms in Bioinformatics (WABI 2019)},
pages = {17:117:13},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {9783959771238},
ISSN = {18688969},
year = {2019},
volume = {143},
editor = {Katharina T. Huber and Dan Gusfield},
publisher = {Schloss DagstuhlLeibnizZentrum fuer Informatik},
address = {Dagstuhl, Germany},
URL = {http://drops.dagstuhl.de/opus/volltexte/2019/11047},
URN = {urn:nbn:de:0030drops110470},
doi = {10.4230/LIPIcs.WABI.2019.17},
annote = {Keywords: Sequence graphs, read mapping, index, sparse matrixmatrix multiplication}
}
Keywords: 

Sequence graphs, read mapping, index, sparse matrixmatrix multiplication 
Seminar: 

19th International Workshop on Algorithms in Bioinformatics (WABI 2019) 
Issue Date: 

2019 
Date of publication: 

06.09.2019 