DROPS

Document

DOI: 10.4230/LIPIcs.WABI.2025.17

An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT

Authors: Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang

Published in: LIPIcs, Volume 344, 25th International Conference on Algorithms for Bioinformatics (WABI 2025)

Abstract

String matching problems in bioinformatics are typically for finding exact substring matches between a query and a reference text. Previous formulations often focus on maximum exact matches (MEMs). However, multiple occurrences of substrings of the query in the text that are long enough but not maximal may not be captured by MEMs. Such long matches can be informative, especially when the text is a collection of similar sequences such as genomes. In this paper, we describe a new type of match between a pattern and a text that aren't necessarily maximal in the query, but still contain useful matching information: locally maximal exact matches (LEMs). There are usually a large amount of LEMs, so we only consider those above some length threshold ℒ. These are referred to as long LEMs. The purpose of long LEMs is to capture substring matches between a query and a text that are not necessarily maximal in the pattern but still long enough to be important. Therefore efficient long LEMs finding algorithms are desired for these datasets. However, these datasets are too large to query on traditional string indexes. Fortunately, these datasets are very repetitive. Recently, compressed string indexes that take advantage of the redundancy in the data but retain efficient querying capability have been proposed as a solution. We therefore give an efficient algorithm for computing all the long LEMs of a query and a text in a BWT runs compressed string index. We describe an O(m+occ) expected time algorithm that relies on an O(r) words space string index for outputting all long LEMs of a pattern with respect to a text given the matching statistics of the pattern with respect to the text. Here m is the length of the query, occ is the number of long LEMs outputted, and r is the number of runs in the BWT of the text. The O(r) space string index we describe relies on an adaptation of the move data structure by Nishimoto and Tabei. We are able to support LCP[i] queries in constant time given SA[i]. In other words, we answer PLCP[i] queries in constant time. These PLCP queries enable the efficient long LEM query. Long LEMs may provide useful similarity information between a pattern and a text that MEMs may ignore. This information is particularly useful in pangenome and biobank scale haplotype panel contexts.

Cite as

Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang. An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT. In 25th International Conference on Algorithms for Bioinformatics (WABI 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 344, pp. 17:1-17:25, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{sanaullah_et_al:LIPIcs.WABI.2025.17,
  author =	{Sanaullah, Ahsan and Zhi, Degui and Zhang, Shaojie},
  title =	{{An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT}},
  booktitle =	{25th International Conference on Algorithms for Bioinformatics (WABI 2025)},
  pages =	{17:1--17:25},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-386-7},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{344},
  editor =	{Brejov\'{a}, Bro\v{n}a and Patro, Rob},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2025.17},
  URN =		{urn:nbn:de:0030-drops-239433},
  doi =		{10.4230/LIPIcs.WABI.2025.17},
  annote =	{Keywords: BWT, LEM, Long LEM, MEM, Run Length Compressed BWT, Move Data Structure, Pangenome}
}

Document

Research

DOI: 10.4230/OASIcs.Grossi.10

Faster Run-Length Compressed Suffix Arrays

Authors: Nathaniel K. Brown, Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Marinella Sciortino

Published in: OASIcs, Volume 132, From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday (2025)

Abstract

We first review how we can store a run-length compressed suffix array (RLCSA) for a text T of length n over an alphabet of size σ whose Burrows-Wheeler Transform (BWT) consists of r runs in O (r log (n / r) + r log σ + σ) bits such that later, given character a and the suffix-array (SA) interval for P, we can find the SA interval for a P in O (log r_a + log log n) time, where r_a is the number of runs of copies of a in the BWT. We then show how to modify the RLCSA such that we find the SA interval for a P in only O (log r_a) time, without increasing its asymptotic space bound. Our key idea is applying a result by Nishimoto and Tabei (ICALP 2021) and then replacing rank queries on sparse bitvectors by a constant number of select queries. We also review two-level indexing and discuss how our faster RLCSA may be useful in improving it. Finally, we briefly discuss how two-level indexing may speed up a recent heuristic for finding maximal exact matches of a pattern with respect to an indexed text.

Cite as

Nathaniel K. Brown, Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Marinella Sciortino. Faster Run-Length Compressed Suffix Arrays. In From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 132, pp. 10:1-10:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{brown_et_al:OASIcs.Grossi.10,
  author =	{Brown, Nathaniel K. and Gagie, Travis and Manzini, Giovanni and Navarro, Gonzalo and Sciortino, Marinella},
  title =	{{Faster Run-Length Compressed Suffix Arrays}},
  booktitle =	{From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi's 60th Birthday},
  pages =	{10:1--10:15},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-391-1},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{132},
  editor =	{Conte, Alessio and Marino, Andrea and Rosone, Giovanna and Vitter, Jeffrey Scott},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Grossi.10},
  URN =		{urn:nbn:de:0030-drops-238095},
  doi =		{10.4230/OASIcs.Grossi.10},
  annote =	{Keywords: Run-length compressed suffix arrays, interpolative coding, two-level indexing}
}

Document

DOI: 10.4230/OASIcs.Manzini.6

Optimizing the Performance of the FM-Index for Large-Scale Data

Authors: Eddie Ferro and Christina Boucher

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)

Abstract

The FM-index is a fundamental data structure used in bioinformatics to efficiently search for strings and index genomes. However, the FM-index can pose computational challenges, particularly in the context of large-scale genomic datasets, due to the complexity of its underlying components and data encodings. In this paper, we present a comprehensive review of efficient variants of the FM-index and the encoding strategies used to improve performance. We examine hardware-accelerated techniques, such as memory-efficient data layouts and cache-aware structures, as well as software-level innovations, including algorithmic refinements and compact representations. The reviewed work demonstrates substantial gains in both speed and scalability, making methods that use the FM-index more practical for high-throughput genomic applications. By analyzing the trade-offs and design choices of these variants, we highlight how combining hardware-aware and software-centric strategies enables more efficient FM-index construction and usage across a range of bioinformatics tasks.

Cite as

Eddie Ferro and Christina Boucher. Optimizing the Performance of the FM-Index for Large-Scale Data. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 6:1-6:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{ferro_et_al:OASIcs.Manzini.6,
  author =	{Ferro, Eddie and Boucher, Christina},
  title =	{{Optimizing the Performance of the FM-Index for Large-Scale Data}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{6:1--6:21},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.6},
  URN =		{urn:nbn:de:0030-drops-239140},
  doi =		{10.4230/OASIcs.Manzini.6},
  annote =	{Keywords: FM-Index Acceleration, Run-Length Encoding, Suffix Array Optimization, Burrows-Wheeler Transform, Efficient Backward Search}
}

Document

DOI: 10.4230/OASIcs.Manzini.7

Algorithms for Computing Very Large BWTs: a Short Survey

Authors: Diego Díaz-Domínguez, Lavinia Egidi, Veronica Guerrini, Felipe A. Louza, and Giovanna Rosone

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)

Abstract

The Burrows-Wheeler Transform (BWT) is a fundamental string transformation that, although initially introduced for data compression, has been extensively utilized across various domains, including text indexing and pattern matching within large datasets. Although the BWT construction is linear, the constants make the task impractical for large datasets, and as highlighted by Ferragina et al. [Paolo Ferragina et al., 2012], "to use it, one must first build it!". Thus, the construction of the BWT remains a significant challenge. For these reasons, during the past three decades there has been a succession of new algorithms for its construction using techniques that work in external memory or that use text compression. In this survey, we revise some of the most important advancements and tools presented in the past years for computing large BWTs exploiting external memory or text compression approaches without using additional information about the data.

Cite as

Diego Díaz-Domínguez, Lavinia Egidi, Veronica Guerrini, Felipe A. Louza, and Giovanna Rosone. Algorithms for Computing Very Large BWTs: a Short Survey. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 7:1-7:28, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{diazdominguez_et_al:OASIcs.Manzini.7,
  author =	{D{\'\i}az-Dom{\'\i}nguez, Diego and Egidi, Lavinia and Guerrini, Veronica and Louza, Felipe A. and Rosone, Giovanna},
  title =	{{Algorithms for Computing Very Large BWTs: a Short Survey}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{7:1--7:28},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.7},
  URN =		{urn:nbn:de:0030-drops-239151},
  doi =		{10.4230/OASIcs.Manzini.7},
  annote =	{Keywords: Burrows-Wheeler transform, Extended Burrows-Wheeler transform, external memory, text compression, longest common prefix}
}

@InProceedings{diazdominguez_et_al:OASIcs.Manzini.7,
  author =	{D{\'\i}az-Dom{\'\i}nguez, Diego and Egidi, Lavinia and Guerrini, Veronica and Louza, Felipe A. and Rosone, Giovanna},
  title =	{{Algorithms for Computing Very Large BWTs: a Short Survey}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{7:1--7:28},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.7},
  URN =		{urn:nbn:de:0030-drops-239151},
  doi =		{10.4230/OASIcs.Manzini.7},
  annote =	{Keywords: Burrows-Wheeler transform, Extended Burrows-Wheeler transform, external memory, text compression, longest common prefix}
}

Document

DOI: 10.4230/LIPIcs.SEA.2025.12

Bit Packed Encodings for Grammar-Compressed Strings Supporting Fast Random Access

Authors: Alan M. Cleary, Joseph Winjum, Jordan Dood, Hiroki Shibata, and Shunsuke Inenaga

Published in: LIPIcs, Volume 338, 23rd International Symposium on Experimental Algorithms (SEA 2025)

Abstract

Grammar-based compression is a powerful compression technique that allows for computation over the compressed data. While there has been extensive theoretical work on grammar and encoding size, there has been little work on practical grammar encodings. In this work, we consider the canonical array-of-arrays grammar representation and present a general bit packing approach for reducing its space requirements in practice. We then present three bit packing strategies based on this approach - one online and two offline - with different space-time trade-offs. This technique can be used to encode any grammar-compressed string while preserving the virtues of the array-of-arrays representation. We show that our encodings are Nlog₂ N away from the information-theoretic bound, where N is the number of symbols in the grammar, and that they are much smaller than methods that meet the information-theoretic bound in practice. Moreover, our experiments show that by using bit packed encodings we can achieve state-of-the-art performance both in grammar encoding size and run-time performance of random-access queries.

Cite as

Alan M. Cleary, Joseph Winjum, Jordan Dood, Hiroki Shibata, and Shunsuke Inenaga. Bit Packed Encodings for Grammar-Compressed Strings Supporting Fast Random Access. In 23rd International Symposium on Experimental Algorithms (SEA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 338, pp. 12:1-12:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{cleary_et_al:LIPIcs.SEA.2025.12,
  author =	{Cleary, Alan M. and Winjum, Joseph and Dood, Jordan and Shibata, Hiroki and Inenaga, Shunsuke},
  title =	{{Bit Packed Encodings for Grammar-Compressed Strings Supporting Fast Random Access}},
  booktitle =	{23rd International Symposium on Experimental Algorithms (SEA 2025)},
  pages =	{12:1--12:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-375-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{338},
  editor =	{Mutzel, Petra and Prezza, Nicola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2025.12},
  URN =		{urn:nbn:de:0030-drops-232506},
  doi =		{10.4230/LIPIcs.SEA.2025.12},
  annote =	{Keywords: String algorithms, data compression, random access, grammar-based compression}
}

Document

Track A: Algorithms, Complexity and Games

DOI: 10.4230/LIPIcs.ICALP.2025.88

Repetition Aware Text Indexing for Matching Patterns with Wildcards

Authors: Daniel Gibney, Jackson Huffstutler, Mano Prakash Parthasarathi, and Sharma V. Thankachan

Published in: LIPIcs, Volume 334, 52nd International Colloquium on Automata, Languages, and Programming (ICALP 2025)

Abstract

We study the problem of indexing a text T[1..n] to support pattern matching with wildcards. The input of a query is a pattern P[1..m] containing h ∈ [0, k] wildcard (a.k.a. don't care) characters and the output is the set of occurrences of P in T (i.e., starting positions of substrings of T that matches P), where k = o(log n) is fixed at index construction. A classic solution by Cole et al. [STOC 2004] provides an index with space complexity O(n ⋅ (clog n)^k/k!)) and query time O(m+2^h log log n+occ), where c > 1 is a constant, and occ denotes the number of occurrences of P in T. We introduce a new data structure that significantly reduces space usage for highly repetitive texts while maintaining efficient query processing. Its space (in words) and query time are as follows: O(δ log (n/δ)⋅ c^k (1+(log^k (δ log n))/k!)) and O((m+2^h +occ)log n)) The parameter δ, known as substring complexity, is a recently introduced measure of repetitiveness that serves as a unifying and lower-bounding metric for several popular measures, including the number of phrases in the LZ77 factorization (denoted by z) and the number of runs in the Burrows-Wheeler Transform (denoted by r). Moreover, O(δ log (n/δ)) represents the optimal space required to encode the data in terms of n and δ, helping us see how close our space is to the minimum required. In another trade-off, we match the query time of Cole et al.’s index using O(n+δ log (n/δ) ⋅ (clogδ)^{k+ε}/k!) space, where ε > 0 is an arbitrarily small constant. We also demonstrate how these techniques can be applied to a more general indexing problem, where the query pattern includes k-gaps (a gap can be interpreted as a contiguous sequence of wildcard characters).

Cite as

Daniel Gibney, Jackson Huffstutler, Mano Prakash Parthasarathi, and Sharma V. Thankachan. Repetition Aware Text Indexing for Matching Patterns with Wildcards. In 52nd International Colloquium on Automata, Languages, and Programming (ICALP 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 334, pp. 88:1-88:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{gibney_et_al:LIPIcs.ICALP.2025.88,
  author =	{Gibney, Daniel and Huffstutler, Jackson and Parthasarathi, Mano Prakash and Thankachan, Sharma V.},
  title =	{{Repetition Aware Text Indexing for Matching Patterns with Wildcards}},
  booktitle =	{52nd International Colloquium on Automata, Languages, and Programming (ICALP 2025)},
  pages =	{88:1--88:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-372-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{334},
  editor =	{Censor-Hillel, Keren and Grandoni, Fabrizio and Ouaknine, Jo\"{e}l and Puppis, Gabriele},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2025.88},
  URN =		{urn:nbn:de:0030-drops-234656},
  doi =		{10.4230/LIPIcs.ICALP.2025.88},
  annote =	{Keywords: Pattern Matching, Text Indexing, Wildcard Matching}
}

Document

DOI: 10.4230/LIPIcs.CPM.2025.23

Space-Efficient Online Computation of String Net Occurrences

Authors: Takuya Mieno and Shunsuke Inenaga

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Abstract

A substring u of a string T is said to be a repeat if u occurs at least twice in T. An occurrence [i..j] of a repeat u in T is said to be a net occurrence if each of the substrings aub = T[i-1..j+1], au = T[i-1..j], and ub = T[i..j+1] occurs exactly once in T. The occurrence [i-1..j+1] of aub is said to be an extended net occurrence of u. Let T be an input string of length n over an alphabet of size σ, and let ENO(T) denote the set of extended net occurrences of repeats in T. Guo et al. [SPIRE 2024] presented an online algorithm which can report ENO(T[1..i]) in T[1..i] in O(nσ²) time, for each prefix T[1..i] of T. Very recently, Inenaga [arXiv 2024] gave a faster online algorithm that can report ENO(T[1..i]) in optimal O(#ENO(T[1..i])) time for each prefix T[1..i] of T, where #S denotes the cardinality of a set S. Both of the aforementioned data structures can be maintained in O(n log σ) time and occupy O(n) space, where the O(n)-space requirement comes from the suffix tree data structure. In particular, Inenaga’s recent algorithm is based on Weiner’s right-to-left online suffix tree construction. In this paper, we show that one can modify Ukkonen’s left-to-right online suffix tree construction algorithm in O(n) space, so that ENO(T[1..i]) can be reported in optimal O(#ENO(T[1..i])) time for each prefix T[1..i] of T. This is an improvement over Guo et al.’s method that is also based on Ukkonen’s algorithm. Further, this leads us to the two following space-efficient alternatives: - A sliding-window algorithm of O(d) working space that can report ENO(T[i-d+1..i]) in optimal O(#ENO(T[i-d+1..i])) time for each sliding window T[i-d+1..i] of size d in T. - A CDAWG-based online algorithm of O(𝖾) working space that can report ENO(T[1..i]) in optimal O(#ENO(T[1..i])) time for each prefix T[1..i] of T, where 𝖾 < 2n is the number of edges in the CDAWG for T. All of our proposed data structures can be maintained in O(n log σ) time for the input online string T. We also discuss that the extended net occurrences of repeats in T can be fully characterized in terms of the minimal unique substrings (MUSs) in T.

Cite as

Takuya Mieno and Shunsuke Inenaga. Space-Efficient Online Computation of String Net Occurrences. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 23:1-23:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{mieno_et_al:LIPIcs.CPM.2025.23,
  author =	{Mieno, Takuya and Inenaga, Shunsuke},
  title =	{{Space-Efficient Online Computation of String Net Occurrences}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{23:1--23:13},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.23},
  URN =		{urn:nbn:de:0030-drops-231175},
  doi =		{10.4230/LIPIcs.CPM.2025.23},
  annote =	{Keywords: string net occurrences, suffix trees, CDAWGs, maximal repeats, minimal unique substrings (MUSs)}
}

Document

DOI: 10.4230/LIPIcs.SEA.2024.1

Move-r: Optimizing the r-index

Authors: Nico Bertram, Johannes Fischer, and Lukas Nalbach

Published in: LIPIcs, Volume 301, 22nd International Symposium on Experimental Algorithms (SEA 2024)

Abstract

We present a static text index called Move-r, which is a highly optimized version of the r-index ([Travis Gagie et al., 2020] Gagie et al., 2020) that encorporates recent theoretical developments of the move data structure ([Takaaki Nishimoto and Yasuo Tabei, 2021] Nishimoto and Tabei, 2021). The r-index is the method of choice for indexing highly repetitive texts, such as different versions of a text document or DNA from the same species, as it exploits the compressibilty of the underlying data. With Move-r, we can answer count- and locate queries 2-35 (typically 15) times as fast as with any other r-index supporting locate queries while being 0.8-2.5 (typically 2) times as large. A Move-r index can be constructed 0.9-2 (typically 2) times as fast while using 1/3-1 (typically 1/2) times as much space.

Cite as

Nico Bertram, Johannes Fischer, and Lukas Nalbach. Move-r: Optimizing the r-index. In 22nd International Symposium on Experimental Algorithms (SEA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 301, pp. 1:1-1:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

@InProceedings{bertram_et_al:LIPIcs.SEA.2024.1,
  author =	{Bertram, Nico and Fischer, Johannes and Nalbach, Lukas},
  title =	{{Move-r: Optimizing the r-index}},
  booktitle =	{22nd International Symposium on Experimental Algorithms (SEA 2024)},
  pages =	{1:1--1:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-325-6},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{301},
  editor =	{Liberti, Leo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2024.1},
  URN =		{urn:nbn:de:0030-drops-203662},
  doi =		{10.4230/LIPIcs.SEA.2024.1},
  annote =	{Keywords: Compressed Text Index, Burrows-Wheeler Transform}
}

Document

Track A: Algorithms, Complexity and Games

DOI: 10.4230/LIPIcs.ICALP.2022.99

An Optimal-Time RLBWT Construction in BWT-Runs Bounded Space

Authors: Takaaki Nishimoto, Shunsuke Kanda, and Yasuo Tabei

Published in: LIPIcs, Volume 229, 49th International Colloquium on Automata, Languages, and Programming (ICALP 2022)

Abstract

The compression of highly repetitive strings (i.e., strings with many repetitions) has been a central research topic in string processing, and quite a few compression methods for these strings have been proposed thus far. Among them, an efficient compression format gathering increasing attention is the run-length Burrows-Wheeler transform (RLBWT), which is a run-length encoded BWT as a reversible permutation of an input string on the lexicographical order of suffixes. State-of-the-art construction algorithms of RLBWT have a serious issue with respect to (i) non-optimal computation time or (ii) a working space that is linearly proportional to the length of an input string. In this paper, we present r-comp, the first optimal-time construction algorithm of RLBWT in BWT-runs bounded space. That is, the computational complexity of r-comp is O(n + r log r) time and O(r log n) bits of working space for the length n of an input string and the number r of equal-letter runs in BWT. The computation time is optimal (i.e., O(n)) for strings with the property r = O(n/log n), which holds for most highly repetitive strings. Experiments using a real-world dataset of highly repetitive strings show the effectiveness of r-comp with respect to computation time and space.

Cite as

Takaaki Nishimoto, Shunsuke Kanda, and Yasuo Tabei. An Optimal-Time RLBWT Construction in BWT-Runs Bounded Space. In 49th International Colloquium on Automata, Languages, and Programming (ICALP 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 229, pp. 99:1-99:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{nishimoto_et_al:LIPIcs.ICALP.2022.99,
  author =	{Nishimoto, Takaaki and Kanda, Shunsuke and Tabei, Yasuo},
  title =	{{An Optimal-Time RLBWT Construction in BWT-Runs Bounded Space}},
  booktitle =	{49th International Colloquium on Automata, Languages, and Programming (ICALP 2022)},
  pages =	{99:1--99:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-235-8},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{229},
  editor =	{Boja\'{n}czyk, Miko{\l}aj and Merelli, Emanuela and Woodruff, David P.},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2022.99},
  URN =		{urn:nbn:de:0030-drops-164403},
  doi =		{10.4230/LIPIcs.ICALP.2022.99},
  annote =	{Keywords: lossless data compression, Burrows-Wheeler transform, highly repetitive text collections}
}

Document

Invited Talk

DOI: 10.4230/LIPIcs.CPM.2022.3

Compact Text Indexing for Advanced Pattern Matching Problems: Parameterized, Order-Isomorphic, 2D, etc. (Invited Talk)

Authors: Sharma V. Thankachan

Published in: LIPIcs, Volume 223, 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)

Abstract

In the past two decades, we have witnessed the design of various compact data structures for pattern matching over an indexed text [Navarro, 2016]. Popular indexes like the FM-index [Paolo Ferragina and Giovanni Manzini, 2005], compressed suffix arrays/trees [Roberto Grossi and Jeffrey Scott Vitter, 2005; Kunihiko Sadakane, 2007], the recent r-index [Travis Gagie et al., 2020; Takaaki Nishimoto and Yasuo Tabei, 2021], etc., capture the key functionalities of classic suffix arrays/trees [Udi Manber and Eugene W. Myers, 1993; Peter Weiner, 1973] in compact space. Mostly, they rely on the Burrows-Wheeler Transform (BWT) and its associated operations [Burrows and Wheeler, 1994]. However, compactly encoding some advanced suffix tree (ST) variants, like parameterized ST [Brenda S. Baker, 1993; S. Rao Kosaraju, 1995; Juan Mendivelso et al., 2020], order-isomorphic/preserving ST [Maxime Crochemore et al., 2016], two-dimensional ST [Raffaele Giancarlo, 1995; Dong Kyue Kim et al., 1998], etc. [Sung Gwan Park et al., 2019; Tetsuo Shibuya, 2000]- collectively known as suffix trees with missing suffix links [Richard Cole and Ramesh Hariharan, 2003], has been challenging. The previous techniques are not easily extendable because these variants do not hold some structural properties of the standard ST that enable compression. However, some limited progress has been made in these directions recently [Arnab Ganguly et al., 2017; Travis Gagie et al., 2017; Gianni Decaroli et al., 2017; Dhrumil Patel and Rahul Shah, 2021; Arnab Ganguly et al., 2021; Sung{-}Hwan Kim and Hwan{-}Gue Cho, 2021; Sung{-}Hwan Kim and Hwan{-}Gue Cho, 2021; Arnab Ganguly et al., 2017; Arnab Ganguly et al., 2022; Arnab Ganguly et al., 2021]. This talk will briefly survey them and highlight some interesting open problems.

Cite as

Sharma V. Thankachan. Compact Text Indexing for Advanced Pattern Matching Problems: Parameterized, Order-Isomorphic, 2D, etc. (Invited Talk). In 33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 223, pp. 3:1-3:3, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{thankachan:LIPIcs.CPM.2022.3,
  author =	{Thankachan, Sharma V.},
  title =	{{Compact Text Indexing for Advanced Pattern Matching Problems: Parameterized, Order-Isomorphic, 2D, etc.}},
  booktitle =	{33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)},
  pages =	{3:1--3:3},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-234-1},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{223},
  editor =	{Bannai, Hideo and Holub, Jan},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2022.3},
  URN =		{urn:nbn:de:0030-drops-161300},
  doi =		{10.4230/LIPIcs.CPM.2022.3},
  annote =	{Keywords: Text Indexing, Suffix Trees, String Matching}
}

Document

Track A: Algorithms, Complexity and Games

DOI: 10.4230/LIPIcs.ICALP.2021.101

Optimal-Time Queries on BWT-Runs Compressed Indexes

Authors: Takaaki Nishimoto and Yasuo Tabei

Published in: LIPIcs, Volume 198, 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)

Abstract

Indexing highly repetitive strings (i.e., strings with many repetitions) for fast queries has become a central research topic in string processing, because it has a wide variety of applications in bioinformatics and natural language processing. Although a substantial number of indexes for highly repetitive strings have been proposed thus far, developing compressed indexes that support various queries remains a challenge. The run-length Burrows-Wheeler transform (RLBWT) is a lossless data compression by a reversible permutation of an input string and run-length encoding, and it has received interest for indexing highly repetitive strings. LF and ϕ^{-1} are two key functions for building indexes on RLBWT, and the best previous result computes LF and ϕ^{-1} in O(log log n) time with O(r) words of space for the string length n and the number r of runs in RLBWT. In this paper, we improve LF and ϕ^{-1} so that they can be computed in a constant time with O(r) words of space. Subsequently, we present OptBWTR (optimal-time queries on BWT-runs compressed indexes), the first string index that supports various queries including locate, count, extract queries in optimal time and O(r) words of space.

Cite as

Takaaki Nishimoto and Yasuo Tabei. Optimal-Time Queries on BWT-Runs Compressed Indexes. In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 198, pp. 101:1-101:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{nishimoto_et_al:LIPIcs.ICALP.2021.101,
  author =	{Nishimoto, Takaaki and Tabei, Yasuo},
  title =	{{Optimal-Time Queries on BWT-Runs Compressed Indexes}},
  booktitle =	{48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)},
  pages =	{101:1--101:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-195-5},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{198},
  editor =	{Bansal, Nikhil and Merelli, Emanuela and Worrell, James},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2021.101},
  URN =		{urn:nbn:de:0030-drops-141702},
  doi =		{10.4230/LIPIcs.ICALP.2021.101},
  annote =	{Keywords: Compressed text indexes, Burrows-Wheeler transform, highly repetitive text collections}
}

Document

DOI: 10.4230/LIPIcs.CPM.2021.21

R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space

Authors: Takaaki Nishimoto and Yasuo Tabei

Published in: LIPIcs, Volume 191, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)

Abstract

Enumerating characteristic substrings (e.g., maximal repeats, minimal unique substrings, and minimal absent words) in a given string has been an important research topic because there are a wide variety of applications in various areas such as string processing and computational biology. Although several enumeration algorithms for characteristic substrings have been proposed, they are not space-efficient in that their space-usage is proportional to the length of an input string. Recently, the run-length encoded Burrows-Wheeler transform (RLBWT) has attracted increased attention in string processing, and various algorithms for the RLBWT have been developed. Developing enumeration algorithms for characteristic substrings with the RLBWT, however, remains a challenge. In this paper, we present r-enum (RLBWT-based enumeration), the first enumeration algorithm for characteristic substrings based on RLBWT. R-enum runs in O(n log log (n/r)) time and with O(r log n) bits of working space for string length n and number r of runs in RLBWT. Here, r is expected to be significantly smaller than n for highly repetitive strings (i.e., strings with many repetitions). Experiments using a benchmark dataset of highly repetitive strings show that the results of r-enum are more space-efficient than the previous results. In addition, we demonstrate the applicability of r-enum to a huge string by performing experiments on a 300-gigabyte string of 100 human genomes.

Cite as

Takaaki Nishimoto and Yasuo Tabei. R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 21:1-21:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{nishimoto_et_al:LIPIcs.CPM.2021.21,
  author =	{Nishimoto, Takaaki and Tabei, Yasuo},
  title =	{{R-enum: Enumeration of Characteristic Substrings in BWT-runs Bounded Space}},
  booktitle =	{32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)},
  pages =	{21:1--21:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-186-3},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{191},
  editor =	{Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2021.21},
  URN =		{urn:nbn:de:0030-drops-139723},
  doi =		{10.4230/LIPIcs.CPM.2021.21},
  annote =	{Keywords: Enumeration algorithm, Burrows-Wheeler transform, Maximal repeats, Minimal unique substrings, Minimal absent words}
}

Document

DOI: 10.4230/LIPIcs.CPM.2019.9

Conversion from RLBWT to LZ77

Authors: Takaaki Nishimoto and Yasuo Tabei

Published in: LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)

Abstract

Converting a compressed format of a string into another compressed format without an explicit decompression is one of the central research topics in string processing. We discuss the problem of converting the run-length Burrows-Wheeler Transform (RLBWT) of a string into Lempel-Ziv 77 (LZ77) phrases of the reversed string. The first results with Policriti and Prezza’s conversion algorithm [Algorithmica 2018] were O(n log r) time and O(r) working space for length of the string n, number of runs r in the RLBWT, and number of LZ77 phrases z. Recent results with Kempa’s conversion algorithm [SODA 2019] are O(n / log n + r log^{9} n + z log^{9} n) time and O(n / log_{sigma} n + r log^{8} n) working space for the alphabet size sigma of the RLBWT. In this paper, we present a new conversion algorithm by improving Policriti and Prezza’s conversion algorithm where dynamic data structures for general purpose are used. We argue that these dynamic data structures can be replaced and present new data structures for faster conversion. The time and working space of our conversion algorithm with new data structures are O(n min{log log n, sqrt{(log r)/(log log r)}}) and O(r), respectively.

Cite as

Takaaki Nishimoto and Yasuo Tabei. Conversion from RLBWT to LZ77. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 9:1-9:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{nishimoto_et_al:LIPIcs.CPM.2019.9,
  author =	{Nishimoto, Takaaki and Tabei, Yasuo},
  title =	{{Conversion from RLBWT to LZ77}},
  booktitle =	{30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)},
  pages =	{9:1--9:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-103-0},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{128},
  editor =	{Pisanti, Nadia and P. Pissis, Solon},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2019.9},
  URN =		{urn:nbn:de:0030-drops-104807},
  doi =		{10.4230/LIPIcs.CPM.2019.9},
  annote =	{Keywords: Burrows-Wheeler Transform, Lempel-Ziv Parsing, Lossless Data Compression}
}

13 Search Results for "Tabei, Yasuo"

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Thanks for your feedback!

Could not send message