DROPS

Document

DOI: 10.4230/LIPIcs.WABI.2025.17

An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT

Authors: Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

String matching problems in bioinformatics are typically for finding exact substring matches between a query and a reference text. Previous formulations often focus on maximum exact matches (MEMs). However, multiple occurrences of substrings of the query in the text that are long enough but not maximal may not be captured by MEMs. Such long matches can be informative, especially when the text is a collection of similar sequences such as genomes. In this paper, we describe a new type of match between a pattern and a text that aren't necessarily maximal in the query, but still contain useful matching information: locally maximal exact matches (LEMs). There are usually a large amount of LEMs, so we only consider those above some length threshold ℒ. These are referred to as long LEMs. The purpose of long LEMs is to capture substring matches between a query and a text that are not necessarily maximal in the pattern but still long enough to be important. Therefore efficient long LEMs finding algorithms are desired for these datasets. However, these datasets are too large to query on traditional string indexes. Fortunately, these datasets are very repetitive. Recently, compressed string indexes that take advantage of the redundancy in the data but retain efficient querying capability have been proposed as a solution. We therefore give an efficient algorithm for computing all the long LEMs of a query and a text in a BWT runs compressed string index. We describe an O(m+occ) expected time algorithm that relies on an O(r) words space string index for outputting all long LEMs of a pattern with respect to a text given the matching statistics of the pattern with respect to the text. Here m is the length of the query, occ is the number of long LEMs outputted, and r is the number of runs in the BWT of the text. The O(r) space string index we describe relies on an adaptation of the move data structure by Nishimoto and Tabei. We are able to support LCP[i] queries in constant time given SA[i]. In other words, we answer PLCP[i] queries in constant time. These PLCP queries enable the efficient long LEM query. Long LEMs may provide useful similarity information between a pattern and a text that MEMs may ignore. This information is particularly useful in pangenome and biobank scale haplotype panel contexts.

Cite as

Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang. An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 17:1-17:25, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{sanaullah_et_al:LIPIcs.WABI.2025.17,
  author =	{Sanaullah, Ahsan and Zhi, Degui and Zhang, Shaojie},
  title =	{{An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{17:1--17:25},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.17},
  URN =		{urn:nbn:de:0030-drops-239433},
  doi =		{10.4230/LIPIcs.WABI.2025.17},
  annote =	{Keywords: BWT, LEM, Long LEM, MEM, Run Length Compressed BWT, Move Data Structure, Pangenome}
}

Document

Research

DOI: 10.4230/OASIcs.Grossi.12

Encoding Data Structures for Range Queries on Arrays

Authors: Seungbum Jo and Srinivasa Rao Satti

Published in: OASIcs, Volume 132, From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday (2025)

Abstract

Efficiently processing range queries on arrays is a fundamental problem in computer science, with applications spanning diverse domains such as database management, computational biology, and geographic information systems. A range query retrieves information about a specific segment of an array, such as the sum, minimum, maximum, or median of elements within a given range. The challenge lies in designing data structures that allow such queries to be answered quickly, often in constant or logarithmic time, while keeping space overhead (and preprocessing time) small. Encoding data structures for range queries has emerged as a pivotal area of research due to the increasing demand for high-performance systems handling massive datasets. These structures consider the data together with the queries and aim to store only as much information about the data as is needed to answer the queries. The data structure does not need to access the original data to answer the queries. Encoding-based solutions often leverage techniques from succinct data structures, bit manipulation, and combinatorial optimization to achieve both space and time efficiency. By encoding the array in a manner that preserves critical information, these methods strike a balance between query time and space usage. In this survey article, we explore the landscape of encoding data structures for range queries on arrays, providing a comprehensive overview of some important results on space-efficient encodings for various types of range query.

Cite as

Seungbum Jo and Srinivasa Rao Satti. Encoding Data Structures for Range Queries on Arrays. In From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 132, pp. 12:1-12:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{jo_et_al:OASIcs.Grossi.12,
  author =	{Jo, Seungbum and Satti, Srinivasa Rao},
  title =	{{Encoding Data Structures for Range Queries on Arrays}},
  booktitle =	{From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday},
  pages =	{12:1--12:12},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-391-1},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{132},
  editor =	{Conte, Alessio and Marino, Andrea and Rosone, Giovanna and Vitter, Jeffrey Scott},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Grossi.12},
  URN =		{urn:nbn:de:0030-drops-238116},
  doi =		{10.4230/OASIcs.Grossi.12},
  annote =	{Keywords: range queries, RMQ, Cartesian tree, top-k queries, range median, range mode}
}

Document

Track A: Algorithms, Complexity and Games

DOI: 10.4230/LIPIcs.ICALP.2025.55

Optimal Oblivious Subspace Embeddings with Near-Optimal Sparsity

Authors: Shabarish Chenakkod, Michał Dereziński, and Xiaoyu Dong

Published in: LIPIcs, Volume 334, 52nd International Colloquium on Automata, Languages, and Programming (ICALP 2025)

Abstract

An oblivious subspace embedding is a random m× n matrix Π such that, for any d-dimensional subspace, with high probability Π preserves the norms of all vectors in that subspace within a 1±ε factor. In this work, we give an oblivious subspace embedding with the optimal dimension m = Θ(d/ε²) that has a near-optimal sparsity of Õ(1/ε) non-zero entries per column of Π. This is the first result to nearly match the conjecture of Nelson and Nguyen [FOCS 2013] in terms of the best sparsity attainable by an optimal oblivious subspace embedding, improving on a prior bound of Õ(1/ε⁶) non-zeros per column [Chenakkod et al., STOC 2024]. We further extend our approach to the non-oblivious setting, proposing a new family of Leverage Score Sparsified embeddings with Independent Columns, which yield faster runtimes for matrix approximation and regression tasks. In our analysis, we develop a new method which uses a decoupling argument together with the cumulant method for bounding the edge universality error of isotropic random matrices. To achieve near-optimal sparsity, we combine this general-purpose approach with new trace inequalities that leverage the specific structure of our subspace embedding construction.

Cite as

Shabarish Chenakkod, Michał Dereziński, and Xiaoyu Dong. Optimal Oblivious Subspace Embeddings with Near-Optimal Sparsity. In 52nd International Colloquium on Automata, Languages, and Programming (ICALP 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 334, pp. 55:1-55:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{chenakkod_et_al:LIPIcs.ICALP.2025.55,
  author =	{Chenakkod, Shabarish and Derezi\'{n}ski, Micha{\l} and Dong, Xiaoyu},
  title =	{{Optimal Oblivious Subspace Embeddings with Near-Optimal Sparsity}},
  booktitle =	{52nd International Colloquium on Automata, Languages, and Programming (ICALP 2025)},
  pages =	{55:1--55:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-372-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{334},
  editor =	{Censor-Hillel, Keren and Grandoni, Fabrizio and Ouaknine, Jo\"{e}l and Puppis, Gabriele},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2025.55},
  URN =		{urn:nbn:de:0030-drops-234324},
  doi =		{10.4230/LIPIcs.ICALP.2025.55},
  annote =	{Keywords: Randomized linear algebra, matrix sketching, subspace embeddings}
}

Document

Track A: Algorithms, Complexity and Games

DOI: 10.4230/LIPIcs.ICALP.2025.114

Optimal Static Fully Indexable Dictionaries

Authors: Jingxun Liang and Renfei Zhou

Published in: LIPIcs, Volume 334, 52nd International Colloquium on Automata, Languages, and Programming (ICALP 2025)

Abstract

Fully indexable dictionaries (FID) store sets of integer keys while supporting rank/select queries. They serve as basic building blocks in many succinct data structures. Despite the great importance of FIDs, no known FID is succinct with efficient query time when the universe size U is a large polynomial in the number of keys n, which is the conventional parameter regime for dictionary problems. In this paper, we design an FID that uses log binom(U,n) + n/((log U/t)^{Ω(t)}) bits of space, and answers rank/select queries in O(t + log log n) time in the worst case, for any parameter 1 ≤ t ≤ log n / log log n, provided U = n^{1 + Θ(1)}. This time-space trade-off matches known lower bounds for FIDs [Pǎtraşcu and Thorup, 2006; Pǎtraşcu and Viola, 2010; Viola, 2023] when t ≤ log^{0.99} n. Our techniques also lead to efficient succinct data structures for the fundamental problem of maintaining n integers each of 𝓁 = Θ(log n) bits and supporting partial-sum queries, with a trade-off between O(t) query time and n𝓁 + n / (log n / t)^{Ω(t)} bits of space. Prior to this work, no known data structure for the partial-sum problem achieves constant query time with n 𝓁 + o(n) bits of space usage.

Cite as

Jingxun Liang and Renfei Zhou. Optimal Static Fully Indexable Dictionaries. In 52nd International Colloquium on Automata, Languages, and Programming (ICALP 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 334, pp. 114:1-114:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{liang_et_al:LIPIcs.ICALP.2025.114,
  author =	{Liang, Jingxun and Zhou, Renfei},
  title =	{{Optimal Static Fully Indexable Dictionaries}},
  booktitle =	{52nd International Colloquium on Automata, Languages, and Programming (ICALP 2025)},
  pages =	{114:1--114:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-372-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{334},
  editor =	{Censor-Hillel, Keren and Grandoni, Fabrizio and Ouaknine, Jo\"{e}l and Puppis, Gabriele},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2025.114},
  URN =		{urn:nbn:de:0030-drops-234918},
  doi =		{10.4230/LIPIcs.ICALP.2025.114},
  annote =	{Keywords: data structures, dictionaries, space efficiency}
}

Document

Track A: Algorithms, Complexity and Games

DOI: 10.4230/LIPIcs.ICALP.2020.79

Succinct Filters for Sets of Unknown Sizes

Authors: Mingmou Liu, Yitong Yin, and Huacheng Yu

Published in: LIPIcs, Volume 168, 47th International Colloquium on Automata, Languages, and Programming (ICALP 2020)

Abstract

The membership problem asks to maintain a set S ⊆ [u], supporting insertions and membership queries, i.e., testing if a given element is in the set. A data structure that computes exact answers is called a dictionary. When a (small) false positive rate ε is allowed, the data structure is called a filter. The space usages of the standard dictionaries or filters usually depend on the upper bound on the size of S, while the actual set can be much smaller. Pagh, Segev and Wieder [Pagh et al., 2013] were the first to study filters with varying space usage based on the current |S|. They showed in order to match the space with the current set size n = |S|, any filter data structure must use (1-o(1))n(log(1/ε)+(1-O(ε))log log n) bits, in contrast to the well-known lower bound of N log(1/ε) bits, where N is an upper bound on |S|. They also presented a data structure with almost optimal space of (1+o(1))n(log(1/ε)+O(log log n)) bits provided that n > u^0.001, with expected amortized constant insertion time and worst-case constant lookup time. In this work, we present a filter data structure with improvements in two aspects: - it has constant worst-case time for all insertions and lookups with high probability; - it uses space (1+o(1))n(log (1/ε)+log log n) bits when n > u^0.001, achieving optimal leading constant for all ε = o(1). We also present a dictionary that uses (1+o(1))nlog(u/n) bits of space, matching the optimal space in terms of the current size, and performs all operations in constant time with high probability.

Cite as

Mingmou Liu, Yitong Yin, and Huacheng Yu. Succinct Filters for Sets of Unknown Sizes. In 47th International Colloquium on Automata, Languages, and Programming (ICALP 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 168, pp. 79:1-79:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{liu_et_al:LIPIcs.ICALP.2020.79,
  author =	{Liu, Mingmou and Yin, Yitong and Yu, Huacheng},
  title =	{{Succinct Filters for Sets of Unknown Sizes}},
  booktitle =	{47th International Colloquium on Automata, Languages, and Programming (ICALP 2020)},
  pages =	{79:1--79:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-138-2},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{168},
  editor =	{Czumaj, Artur and Dawar, Anuj and Merelli, Emanuela},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2020.79},
  URN =		{urn:nbn:de:0030-drops-124867},
  doi =		{10.4230/LIPIcs.ICALP.2020.79},
  annote =	{Keywords: Bloom filters, Data structures, Approximate set membership, Dictionaries}
}

5 Search Results for "Liu, Mingmou"

An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT

Abstract

Cite as

Encoding Data Structures for Range Queries on Arrays

Abstract

Cite as

Optimal Oblivious Subspace Embeddings with Near-Optimal Sparsity

Abstract

Cite as

Optimal Static Fully Indexable Dictionaries

Abstract

Cite as

Succinct Filters for Sets of Unknown Sizes

Abstract

Cite as

Thanks for your feedback!

Could not send message