,
Luca Renders
,
Simon Van de Vyver
,
Lennart Veys,
Travis Gagie
,
Jan Fostier
Creative Commons Attribution 4.0 International license
Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.’s r-index and Nishimoto and Tabei’s move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.’s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns. We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient bidirectional character extensions in run-length compressed space. It achieves bidirectional character extensions up to 8 times faster than the br-index, closing the performance gap with FM-index-based alternatives, while maintaining the br-index’s favorable memory characteristics. For example, all available complete E. coli genomes on NCBI’s RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop. Thus, b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.
@InProceedings{depuydt_et_al:LIPIcs.WABI.2024.10,
author = {Depuydt, Lore and Renders, Luca and Van de Vyver, Simon and Veys, Lennart and Gagie, Travis and Fostier, Jan},
title = {{b-move: Faster Bidirectional Character Extensions in a Run-Length Compressed Index}},
booktitle = {24th International Workshop on Algorithms in Bioinformatics (WABI 2024)},
pages = {10:1--10:18},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-340-9},
ISSN = {1868-8969},
year = {2024},
volume = {312},
editor = {Pissis, Solon P. and Sung, Wing-Kin},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2024.10},
URN = {urn:nbn:de:0030-drops-206546},
doi = {10.4230/LIPIcs.WABI.2024.10},
annote = {Keywords: Pan-genomics, FM-index, r-index, Move Structure, Bidirectional Search, Approximate Pattern Matching, Lossless Alignment, Cache Efficiency}
}
archived version