,
Nathaniel K. Brown
,
Omar Y. Ahmed
,
Katharine M. Jenike
,
Sam Kovaka
,
Michael C. Schatz
,
Ben Langmead
Creative Commons Attribution 4.0 International license
Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on k-mers and de Bruijn graphs are limited to answering questions at a specific substring length k. We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test k-mer presence/absence (membership queries) and that count the number of genomes containing k-mers in a window (conservation queries). MEMO’s index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8× smaller than a comparable KMC3 index and 11.4× smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 seconds, 2.5× faster than other approaches. MEMO’s small index size, lack of k-mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.
@InProceedings{hwang_et_al:LIPIcs.WABI.2024.4,
author = {Hwang, Stephen and Brown, Nathaniel K. and Ahmed, Omar Y. and Jenike, Katharine M. and Kovaka, Sam and Schatz, Michael C. and Langmead, Ben},
title = {{MEM-Based Pangenome Indexing for k-mer Queries}},
booktitle = {24th International Workshop on Algorithms in Bioinformatics (WABI 2024)},
pages = {4:1--4:17},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-340-9},
ISSN = {1868-8969},
year = {2024},
volume = {312},
editor = {Pissis, Solon P. and Sung, Wing-Kin},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2024.4},
URN = {urn:nbn:de:0030-drops-206482},
doi = {10.4230/LIPIcs.WABI.2024.4},
annote = {Keywords: Pangenomics, Comparative genomics, Compressed indexing}
}
archived version
archived version
archived version
archived version