DROPS

Document

Research

Faster Run-Length Compressed Suffix Arrays

Authors: Nathaniel K. Brown, Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Marinella Sciortino

Published in: OASIcs, Volume 132, From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday (2025)

Abstract

We first review how we can store a run-length compressed suffix array (RLCSA) for a text T of length n over an alphabet of size σ whose Burrows-Wheeler Transform (BWT) consists of r runs in O (r log (n / r) + r log σ + σ) bits such that later, given character a and the suffix-array (SA) interval for P, we can find the SA interval for a P in O (log r_a + log log n) time, where r_a is the number of runs of copies of a in the BWT. We then show how to modify the RLCSA such that we find the SA interval for a P in only O (log r_a) time, without increasing its asymptotic space bound. Our key idea is applying a result by Nishimoto and Tabei (ICALP 2021) and then replacing rank queries on sparse bitvectors by a constant number of select queries. We also review two-level indexing and discuss how our faster RLCSA may be useful in improving it. Finally, we briefly discuss how two-level indexing may speed up a recent heuristic for finding maximal exact matches of a pattern with respect to an indexed text.

Cite as

Nathaniel K. Brown, Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Marinella Sciortino. Faster Run-Length Compressed Suffix Arrays. In From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 132, pp. 10:1-10:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{brown_et_al:OASIcs.Grossi.10,
  author =	{Brown, Nathaniel K. and Gagie, Travis and Manzini, Giovanni and Navarro, Gonzalo and Sciortino, Marinella},
  title =	{{Faster Run-Length Compressed Suffix Arrays}},
  booktitle =	{From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday},
  pages =	{10:1--10:15},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-391-1},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{132},
  editor =	{Conte, Alessio and Marino, Andrea and Rosone, Giovanna and Vitter, Jeffrey Scott},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Grossi.10},
  URN =		{urn:nbn:de:0030-drops-238095},
  doi =		{10.4230/OASIcs.Grossi.10},
  annote =	{Keywords: Run-length compressed suffix arrays, interpolative coding, two-level indexing}
}

Artifact

Software

DOI: 10.4230/artifacts.22508

StephenHwang/MEMO

Authors: Stephen Hwang, Nathaniel K. Brown, Omar Y. Ahmed, Katharine M. Jenike, Sam Kovaka, Michael C. Schatz, and Ben Langmead

Abstract

Cite as

Stephen Hwang, Nathaniel K. Brown, Omar Y. Ahmed, Katharine M. Jenike, Sam Kovaka, Michael C. Schatz, Ben Langmead. StephenHwang/MEMO (Software, Source Code). Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@misc{dagstuhl-artifact-22508,
   title = {{StephenHwang/MEMO}}, 
   author = {Hwang, Stephen and Brown, Nathaniel K. and Ahmed, Omar Y. and Jenike, Katharine M. and Kovaka, Sam and Schatz, Michael C. and Langmead, Ben},
   note = {Software, version 1.0.0., swhId: \href{https://archive.softwareheritage.org/swh:1:dir:793f47e3260ebae1887b07175fe3087c8e93d1f8;origin=https://github.com/StephenHwang/MEMO;visit=swh:1:snp:b23bfa6e000a68e85c5b91961d022de194b4b86b;anchor=swh:1:rev:d61a1a995b8027ae3d3dbe449502e952321f7217}{\texttt{swh:1:dir:793f47e3260ebae1887b07175fe3087c8e93d1f8}} (visited on 2024-11-28)},
   url = {https://github.com/StephenHwang/MEMO},
   doi = {10.4230/artifacts.22508},
}

Artifact

Software

DOI: 10.4230/artifacts.22509

StephenHwang/MEMO_experiments

Authors: Stephen Hwang, Nathaniel K. Brown, Omar Y. Ahmed, Katharine M. Jenike, Sam Kovaka, Michael C. Schatz, and Ben Langmead

Abstract

Cite as

Stephen Hwang, Nathaniel K. Brown, Omar Y. Ahmed, Katharine M. Jenike, Sam Kovaka, Michael C. Schatz, Ben Langmead. StephenHwang/MEMO_experiments (Software, Experiments performed for paper). Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@misc{dagstuhl-artifact-22509,
   title = {{StephenHwang/MEMO\underlineexperiments}}, 
   author = {Hwang, Stephen and Brown, Nathaniel K. and Ahmed, Omar Y. and Jenike, Katharine M. and Kovaka, Sam and Schatz, Michael C. and Langmead, Ben},
   note = {Software, swhId: \href{https://archive.softwareheritage.org/swh:1:dir:d69ad61b0d1d563b3945a978b1396fd81be04732;origin=https://github.com/StephenHwang/MEMO_experiments;visit=swh:1:snp:c6a9c4193f1f39f83e8987cf1f9dda2ad2fc3e2d;anchor=swh:1:rev:b47d8f5f8a1d7ff511dad707c79f168feef8469f}{\texttt{swh:1:dir:d69ad61b0d1d563b3945a978b1396fd81be04732}} (visited on 2024-11-28)},
   url = {https://github.com/StephenHwang/MEMO_experiments},
   doi = {10.4230/artifacts.22509},
}

Document

DOI: 10.4230/LIPIcs.WABI.2024.4

MEM-Based Pangenome Indexing for k-mer Queries

Authors: Stephen Hwang, Nathaniel K. Brown, Omar Y. Ahmed, Katharine M. Jenike, Sam Kovaka, Michael C. Schatz, and Ben Langmead

Published in: LIPIcs, Volume 312, 24th International Workshop on Algorithms in Bioinformatics (WABI 2024)

Abstract

Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on k-mers and de Bruijn graphs are limited to answering questions at a specific substring length k. We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test k-mer presence/absence (membership queries) and that count the number of genomes containing k-mers in a window (conservation queries). MEMO’s index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8× smaller than a comparable KMC3 index and 11.4× smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 seconds, 2.5× faster than other approaches. MEMO’s small index size, lack of k-mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.

Cite as

Stephen Hwang, Nathaniel K. Brown, Omar Y. Ahmed, Katharine M. Jenike, Sam Kovaka, Michael C. Schatz, and Ben Langmead. MEM-Based Pangenome Indexing for k-mer Queries. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 4:1-4:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{hwang_et_al:LIPIcs.WABI.2024.4,
  author =	{Hwang, Stephen and Brown, Nathaniel K. and Ahmed, Omar Y. and Jenike, Katharine M. and Kovaka, Sam and Schatz, Michael C. and Langmead, Ben},
  title =	{{MEM-Based Pangenome Indexing for k-mer Queries}},
  booktitle =	{24th International Workshop on Algorithms in Bioinformatics (WABI 2024)},
  pages =	{4:1--4:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-340-9},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{312},
  editor =	{Pissis, Solon P. and Sung, Wing-Kin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2024.4},
  URN =		{urn:nbn:de:0030-drops-206482},
  doi =		{10.4230/LIPIcs.WABI.2024.4},
  annote =	{Keywords: Pangenomics, Comparative genomics, Compressed indexing}
}

Document

DOI: 10.4230/LIPIcs.SEA.2022.16

RLBWT Tricks

Authors: Nathaniel K. Brown, Travis Gagie, and Massimiliano Rossi

Published in: LIPIcs, Volume 233, 20th International Symposium on Experimental Algorithms (SEA 2022)

Abstract

Until recently, most experts would probably have agreed we cannot backwards-step in constant time with a run-length compressed Burrows-Wheeler Transform (RLBWT), since doing so relies on rank queries on sparse bitvectors and those inherit lower bounds from predecessor queries. At ICALP '21, however, Nishimoto and Tabei described a new, simple and constant-time implementation. For a permutation π, it stores an O (r)-space table - where r is the number of positions i where either i = 0 or π (i + 1) ≠ π (i) + 1 - that enables the computation of successive values of π(i) by table look-ups and linear scans. Nishimoto and Tabei showed how to increase the number of rows in the table to bound the length of the linear scans such that the query time for computing π(i) is constant while maintaining O (r)-space. In this paper we refine Nishimoto and Tabei’s approach, including a time-space tradeoff, and experimentally evaluate different implementations demonstrating the practicality of part of their result. We show that even without adding rows to the table, in practice we almost always scan only a few entries during queries. We propose a decomposition scheme of the permutation π corresponding to the LF-mapping that allows an improved compression of the data structure, while limiting the query time. We tested our implementation on real-world genomic datasets and found that without compression of the table, backward-stepping is drastically faster than with sparse bitvector implementations but, unfortunately, also uses drastically more space. After compression, backward-stepping is competitive both in time and space with the best existing implementations.

Cite as

Nathaniel K. Brown, Travis Gagie, and Massimiliano Rossi. RLBWT Tricks. In 20th International Symposium on Experimental Algorithms (SEA 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 233, pp. 16:1-16:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{brown_et_al:LIPIcs.SEA.2022.16,
  author =	{Brown, Nathaniel K. and Gagie, Travis and Rossi, Massimiliano},
  title =	{{RLBWT Tricks}},
  booktitle =	{20th International Symposium on Experimental Algorithms (SEA 2022)},
  pages =	{16:1--16:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-251-8},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{233},
  editor =	{Schulz, Christian and U\c{c}ar, Bora},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2022.16},
  URN =		{urn:nbn:de:0030-drops-165500},
  doi =		{10.4230/LIPIcs.SEA.2022.16},
  annote =	{Keywords: Compressed String Indexes, Repetitive Text Collections, Burrows-Wheeler Transform}
}

Search Results

Documents authored by Brown, Nathaniel K.

Faster Run-Length Compressed Suffix Arrays

Abstract

Cite as

StephenHwang/MEMO

Abstract

Cite as

StephenHwang/MEMO_experiments

Abstract

Cite as

MEM-Based Pangenome Indexing for k-mer Queries

Abstract

Cite as

RLBWT Tricks

Abstract

Cite as

Thanks for your feedback!

Could not send message