,
Brian Riccardi
,
Luca Denti
,
Simone Ciccolella
,
Kunihiko Sadakane
,
Paola Bonizzoni
Creative Commons Attribution 4.0 International license
Indexing pangenome graphs is a major algorithmic challenge in computational pangenomics, a recent and active research field that seeks to use graphs as representations of multiple genomes. Since these graphs are constructed from whole genome sequences of a species population, they can become very large, making indexing one of the most challenging problems. In this paper, we propose gindex, a novel indexing approach to solve the Graph Pattern Matching Problem based on the multidollar-BWT. Specifically, gindex aims to find all occurrences of a pattern in a sequence-labeled graph by overcoming two main limitations of GCSA2, one of the most widely used graph indexes: handling queries of arbitrary length and scaling to large graphs without pruning any complex regions. Moreover, we show how a smart preprocessing step can optimize the use of multidollar-BWT to skip small redundant sub-patterns and enhance gindex’s querying capabilities. We demonstrate the effectiveness of our approach by comparing it to GCSA2 in terms of index construction and query time, using different preprocessing modes on three pangenome graphs: one built from Drosophila genomes and two produced by the Human Pangenome Reference Consortium. The results show that gindex can scale on human pangenome graphs - which GCSA2 cannot index using large amounts of RAM - with acceptable memory and time requirements. Moreover, gindex achieves fast query times, although not as fast as GCSA2, which may produce false positives.
@InProceedings{cozzi_et_al:LIPIcs.SEA.2025.13,
author = {Cozzi, Davide and Riccardi, Brian and Denti, Luca and Ciccolella, Simone and Sadakane, Kunihiko and Bonizzoni, Paola},
title = {{Pangenome Graph Indexing via the Multidollar-BWT}},
booktitle = {23rd International Symposium on Experimental Algorithms (SEA 2025)},
pages = {13:1--13:17},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-375-1},
ISSN = {1868-8969},
year = {2025},
volume = {338},
editor = {Mutzel, Petra and Prezza, Nicola},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2025.13},
URN = {urn:nbn:de:0030-drops-232515},
doi = {10.4230/LIPIcs.SEA.2025.13},
annote = {Keywords: Multidollar-BWT, Graph Index, Graph Pattern Matching, Pangenome Graph}
}
archived version