DROPS

Document

A Survey of the Bijective Burrows-Wheeler Transform

Authors: Hideo Bannai, Dominik Köppl, and Zsuzsanna Lipták

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)

Abstract

The Bijective BWT (BBWT), conceived by Scott in 2007, later summarized in a preprint by Gil and Scott in 2009 (arXiv 2012), is a variant of the Burrows-Wheeler Transform which is bijective: every string is the BBWT of some string. Indeed, the BBWT of a string is the extended BWT [Mantaci et al., 2007] of the factors of its Lyndon factorization. The BBWT has been receiving increasing interest in recent years. In this paper, we survey existing research on the BBWT, starting with its history and motivation. We then present algorithmic topics including construction algorithms with various complexities and an index on top of the BBWT for pattern matching. We subsequently address some properties of the BBWT as a compressor, discussing robustness to operations such as reversal, edits, rotation, as well as compression power. We close with listing other bijective variants of the BWT and open problems concerning the BBWT.

Cite as

Hideo Bannai, Dominik Köppl, and Zsuzsanna Lipták. A Survey of the Bijective Burrows-Wheeler Transform. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 2:1-2:26, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{bannai_et_al:OASIcs.Manzini.2,
  author =	{Bannai, Hideo and K\"{o}ppl, Dominik and Lipt\'{a}k, Zsuzsanna},
  title =	{{A Survey of the Bijective Burrows-Wheeler Transform}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{2:1--2:26},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.2},
  URN =		{urn:nbn:de:0030-drops-239100},
  doi =		{10.4230/OASIcs.Manzini.2},
  annote =	{Keywords: Burrows-Wheeler Transform, compression, text indexing, repetitiveness measure, Lyndon words, index construction algorithms, bijective string transformation}
}

Document

DOI: 10.4230/LIPIcs.SEA.2025.14

Efficient Terabyte-Scale Text Compression via Stable Local Consistency and Parallel Grammar Processing

Authors: Diego Díaz-Domínguez

Published in: LIPIcs, Volume 338, 23rd International Symposium on Experimental Algorithms (SEA 2025)

Abstract

We present compression algorithms designed to process terabyte-sized datasets in parallel. Our approach builds on locally consistent grammars, a lightweight form of compression, combined with simple post-processing techniques to achieve further space reductions. Locally consistent grammar algorithms are suitable for scaling as they need minimal satellite information to compact the text, but they are not inherently parallel. To enable parallelisation, we introduce a novel concept that we call stable local consistency. A grammar algorithm ALG is stable if for any pattern P occurring in a collection 𝒯 = {T_1, T_2, …, T_k}, instances ALG(T_1), ALG(T_2), …, ALG(T_k) independently produce cores for P with the same topology. In a locally consistent grammar, the core of P is a subset of nodes and edges in the parse tree of 𝒯 that remains the same in all the occurrences of P. This feature enables compression, but it only holds if ALG defines a common set of nonterminal symbols for the strings. Stability removes this restriction, allowing us to run ALG(T_1), ALG(T_2), …, ALG(T_k) in parallel and subsequently merge their grammars into a single output equivalent to that of ALG(𝒯). We implemented our ideas and tested them on massive datasets. Our experiments showed that our method could process 7.9 TB of bacterial genomes in around nine hours, using 16 threads and 0.43 bits/symbol of working memory, achieving a compression ratio of 85x.

Cite as

Diego Díaz-Domínguez. Efficient Terabyte-Scale Text Compression via Stable Local Consistency and Parallel Grammar Processing. In 23rd International Symposium on Experimental Algorithms (SEA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 338, pp. 14:1-14:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{diazdominguez:LIPIcs.SEA.2025.14,
  author =	{D{\'\i}az-Dom{\'\i}nguez, Diego},
  title =	{{Efficient Terabyte-Scale Text Compression via Stable Local Consistency and Parallel Grammar Processing}},
  booktitle =	{23rd International Symposium on Experimental Algorithms (SEA 2025)},
  pages =	{14:1--14:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-375-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{338},
  editor =	{Mutzel, Petra and Prezza, Nicola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2025.14},
  URN =		{urn:nbn:de:0030-drops-232525},
  doi =		{10.4230/LIPIcs.SEA.2025.14},
  annote =	{Keywords: Grammar compression, locally consistent parsing, hashing}
}

Document

DOI: 10.4230/LIPIcs.SEA.2025.12

Bit Packed Encodings for Grammar-Compressed Strings Supporting Fast Random Access

Authors: Alan M. Cleary, Joseph Winjum, Jordan Dood, Hiroki Shibata, and Shunsuke Inenaga

Published in: LIPIcs, Volume 338, 23rd International Symposium on Experimental Algorithms (SEA 2025)

Abstract

Grammar-based compression is a powerful compression technique that allows for computation over the compressed data. While there has been extensive theoretical work on grammar and encoding size, there has been little work on practical grammar encodings. In this work, we consider the canonical array-of-arrays grammar representation and present a general bit packing approach for reducing its space requirements in practice. We then present three bit packing strategies based on this approach - one online and two offline - with different space-time trade-offs. This technique can be used to encode any grammar-compressed string while preserving the virtues of the array-of-arrays representation. We show that our encodings are Nlog₂ N away from the information-theoretic bound, where N is the number of symbols in the grammar, and that they are much smaller than methods that meet the information-theoretic bound in practice. Moreover, our experiments show that by using bit packed encodings we can achieve state-of-the-art performance both in grammar encoding size and run-time performance of random-access queries.

Cite as

Alan M. Cleary, Joseph Winjum, Jordan Dood, Hiroki Shibata, and Shunsuke Inenaga. Bit Packed Encodings for Grammar-Compressed Strings Supporting Fast Random Access. In 23rd International Symposium on Experimental Algorithms (SEA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 338, pp. 12:1-12:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{cleary_et_al:LIPIcs.SEA.2025.12,
  author =	{Cleary, Alan M. and Winjum, Joseph and Dood, Jordan and Shibata, Hiroki and Inenaga, Shunsuke},
  title =	{{Bit Packed Encodings for Grammar-Compressed Strings Supporting Fast Random Access}},
  booktitle =	{23rd International Symposium on Experimental Algorithms (SEA 2025)},
  pages =	{12:1--12:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-375-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{338},
  editor =	{Mutzel, Petra and Prezza, Nicola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2025.12},
  URN =		{urn:nbn:de:0030-drops-232506},
  doi =		{10.4230/LIPIcs.SEA.2025.12},
  annote =	{Keywords: String algorithms, data compression, random access, grammar-based compression}
}

Document

DOI: 10.4230/LIPIcs.CPM.2025.16

Net Occurrences in Fibonacci and Thue-Morse Words

Authors: Peaker Guo and Kaisei Kishi

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Abstract

A net occurrence of a repeated string in a text is an occurrence with unique left and right extensions, and the net frequency of the string is the number of its net occurrences in the text. Originally introduced for applications in Natural Language Processing, net frequency has recently gained attention for its algorithmic aspects. Guo et al. [CPM 2024] and Ohlebusch et al. [SPIRE 2024] focus on its computation in the offline setting, while Guo et al. [SPIRE 2024], Inenaga [arXiv 2024], and Mieno and Inenaga [CPM 2025] tackle the online counterpart. Mieno and Inenaga also characterize net occurrences in terms of the minimal unique substrings of the text. Additionally, Guo et al. [CPM 2024] initiate the study of net occurrences in Fibonacci words to establish a lower bound on the asymptotic running time of algorithms. Although there has been notable progress in algorithmic developments and some initial combinatorial insights, the combinatorial aspects of net occurrences have yet to be thoroughly examined. In this work, we make two key contributions. First, we confirm the conjecture that each Fibonacci word contains exactly three net occurrences. Second, we show that each Thue-Morse word contains exactly nine net occurrences. To achieve these results, we introduce the notion of overlapping net occurrence cover, which narrows down the candidate net occurrences in any text. Furthermore, we provide a precise characterization of occurrences of Fibonacci and Thue-Morse words of smaller order, offering structural insights that may have independent interest and potential applications in algorithm analysis and combinatorial properties of these words.

Cite as

Peaker Guo and Kaisei Kishi. Net Occurrences in Fibonacci and Thue-Morse Words. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 16:1-16:22, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{guo_et_al:LIPIcs.CPM.2025.16,
  author =	{Guo, Peaker and Kishi, Kaisei},
  title =	{{Net Occurrences in Fibonacci and Thue-Morse Words}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{16:1--16:22},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.16},
  URN =		{urn:nbn:de:0030-drops-231107},
  doi =		{10.4230/LIPIcs.CPM.2025.16},
  annote =	{Keywords: Fibonacci words, Thue-Morse words, net occurrence, net frequency, factorization}
}

Document

DOI: 10.4230/LIPIcs.CPM.2025.17

On the Compressiveness of the Burrows-Wheeler Transform

Authors: Hideo Bannai, Tomohiro I, and Yuto Nakashima

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Abstract

The Burrows-Wheeler transform (BWT) is a reversible transform that converts a string w into another string BWT(w). The size of the run-length encoded BWT (RLBWT) can be interpreted as a measure of repetitiveness in the class of representations called dictionary compression which are essentially representations based on copy and paste operations. In this paper, we shed new light on the compressiveness of BWT and the bijective BWT (BBWT). We first extend previous results on the relations of their run-length compressed sizes r and r_B. We also show that the so-called "clustering effect" of BWT and BBWT can be captured by measures other than empirical entropy or run-length encoding. In particular, we show that BWT and BBWT do not increase the repetitiveness of the string with respect to various measures based on dictionary compression by more than a polylogarithmic factor. Furthermore, we show that there exists an infinite family of strings that are maximally incompressible by any dictionary compression measure, but become very compressible after applying BBWT. An interesting implication of this result is that it is possible to transcend dictionary compression in some cases by simply applying BBWT before applying dictionary compression.

Cite as

Hideo Bannai, Tomohiro I, and Yuto Nakashima. On the Compressiveness of the Burrows-Wheeler Transform. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 17:1-17:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{bannai_et_al:LIPIcs.CPM.2025.17,
  author =	{Bannai, Hideo and I, Tomohiro and Nakashima, Yuto},
  title =	{{On the Compressiveness of the Burrows-Wheeler Transform}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{17:1--17:15},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.17},
  URN =		{urn:nbn:de:0030-drops-231116},
  doi =		{10.4230/LIPIcs.CPM.2025.17},
  annote =	{Keywords: Data Compression, Bijective Burrows-Wheeler Transform, Fibonacci words}
}

Document

DOI: 10.4230/LIPIcs.CPM.2017.31

Synergistic Solutions on MultiSets

Authors: Jérémy Barbay, Carlos Ochoa, and Srinivasa Rao Satti

Published in: LIPIcs, Volume 78, 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)

Abstract

Karp et al. (1988) described Deferred Data Structures for Multisets as "lazy" data structures which partially sort data to support online rank and select queries, with the minimum amount of work in the worst case over instances of size n and number of queries q fixed. Barbay et al. (2016) refined this approach to take advantage of the gaps between the positions hit by the queries (i.e., the structure in the queries). We develop new techniques in order to further refine this approach and take advantage all at once of the structure (i.e., the multiplicities of the elements), some notions of local order (i.e., the number and sizes of runs) and global order (i.e., the number and positions of existing pivots) in the input; and of the structure and order in the sequence of queries. Our main result is a synergistic deferred data structure which outperforms all solutions in the comparison model that take advantage of only a subset of these features. As intermediate results, we describe two new synergistic sorting algorithms, which take advantage of some notions of structure and order (local and global) in the input, improving upon previous results which take advantage only of the structure (Munro and Spira 1979) or of the local order (Takaoka 1997) in the input; and one new multiselection algorithm which takes advantage of not only the order and structure in the input, but also of the structure in the queries.

Cite as

Jérémy Barbay, Carlos Ochoa, and Srinivasa Rao Satti. Synergistic Solutions on MultiSets. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 78, pp. 31:1-31:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{barbay_et_al:LIPIcs.CPM.2017.31,
  author =	{Barbay, J\'{e}r\'{e}my and Ochoa, Carlos and Satti, Srinivasa Rao},
  title =	{{Synergistic Solutions on MultiSets}},
  booktitle =	{28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)},
  pages =	{31:1--31:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-039-2},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{78},
  editor =	{K\"{a}rkk\"{a}inen, Juha and Radoszewski, Jakub and Rytter, Wojciech},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2017.31},
  URN =		{urn:nbn:de:0030-drops-73441},
  doi =		{10.4230/LIPIcs.CPM.2017.31},
  annote =	{Keywords: deferred data structure, multivariate analysis, quick sort, select}
}

6 Search Results for "Ochoa, Carlos"

A Survey of the Bijective Burrows-Wheeler Transform

Abstract

Cite as

Efficient Terabyte-Scale Text Compression via Stable Local Consistency and Parallel Grammar Processing

Abstract

Cite as

Bit Packed Encodings for Grammar-Compressed Strings Supporting Fast Random Access

Abstract

Cite as

Net Occurrences in Fibonacci and Thue-Morse Words

Abstract

Cite as

On the Compressiveness of the Burrows-Wheeler Transform

Abstract

Cite as

Synergistic Solutions on MultiSets

Abstract

Cite as

Thanks for your feedback!

Could not send message