DROPS

Document

DOI: 10.4230/LIPIcs.ESA.2025.60

Fast and Memory-Efficient BWT Construction of Repetitive Texts Using Lyndon Grammars

Authors: Jannik Olbrich

Published in: LIPIcs, Volume 351, 33rd Annual European Symposium on Algorithms (ESA 2025)

Abstract

The Burrows-Wheeler Transform (BWT) serves as the basis for many important sequence indexes. On very large datasets (e.g. genomic databases), classical BWT construction algorithms are often infeasible because they usually need to have the entire dataset in main memory. Fortunately, such large datasets are often highly repetitive. It can thus be beneficial to compute the BWT from a compressed representation. We propose an algorithm for computing the BWT via the Lyndon straight-line program, a grammar based on the standard factorization of Lyndon words. Our algorithm can also be used to compute the extended BWT (eBWT) of a multiset of sequences. We empirically evaluate our implementation and find that we can compute the BWT and eBWT of very large datasets faster and/or with less memory than competing methods.

Cite as

Jannik Olbrich. Fast and Memory-Efficient BWT Construction of Repetitive Texts Using Lyndon Grammars. In 33rd Annual European Symposium on Algorithms (ESA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 351, pp. 60:1-60:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{olbrich:LIPIcs.ESA.2025.60,
  author =	{Olbrich, Jannik},
  title =	{{Fast and Memory-Efficient BWT Construction of Repetitive Texts Using Lyndon Grammars}},
  booktitle =	{33rd Annual European Symposium on Algorithms (ESA 2025)},
  pages =	{60:1--60:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-395-9},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{351},
  editor =	{Benoit, Anne and Kaplan, Haim and Wild, Sebastian and Herman, Grzegorz},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ESA.2025.60},
  URN =		{urn:nbn:de:0030-drops-245286},
  doi =		{10.4230/LIPIcs.ESA.2025.60},
  annote =	{Keywords: Burrows-Wheeler Transform, Grammar compression}
}

Document

DOI: 10.4230/OASIcs.Manzini.3

BWT for String Collections

Authors: Davide Cenzato, Zsuzsanna Lipták, Nadia Pisanti, Giovanna Rosone, and Marinella Sciortino

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)

Abstract

We survey the different methods used for extending the BWT to collections of strings, following largely [Cenzato and Lipták, CPM 2022, Bioinformatics 2024]. We analyze the specific aspects and combinatorial properties of the resulting BWT variants and give a categorization of publicly available tools for computing the BWT of string collections. We show how the specific method used impacts on the resulting transform, including the number of runs, and on the dynamicity of the transform with respect to adding or removing strings from the collection. We then focus on the number of runs of these BWT variants and present the optimal BWT introduced in [Cenzato et al., DCC 2023], which implements an algorithm originally proposed by [Bentley et al., ESA 2020] to minimize the number of BWT-runs. We also discuss several recent heuristics and study their impact on the compression of biological sequences. We conclude with an overview of the applications and the impact of the BWT of string collections in bioinformatics.

Cite as

Davide Cenzato, Zsuzsanna Lipták, Nadia Pisanti, Giovanna Rosone, and Marinella Sciortino. BWT for String Collections. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 3:1-3:29, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{cenzato_et_al:OASIcs.Manzini.3,
  author =	{Cenzato, Davide and Lipt\'{a}k, Zsuzsanna and Pisanti, Nadia and Rosone, Giovanna and Sciortino, Marinella},
  title =	{{BWT for String Collections}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{3:1--3:29},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.3},
  URN =		{urn:nbn:de:0030-drops-239113},
  doi =		{10.4230/OASIcs.Manzini.3},
  annote =	{Keywords: Burrows-Wheeler transform, Extended Burrows-Wheeler transform, compressed text indexes, text compression, string collections, bioinformatics}
}

Document

DOI: 10.4230/OASIcs.Manzini.6

Optimizing the Performance of the FM-Index for Large-Scale Data

Authors: Eddie Ferro and Christina Boucher

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)

Abstract

The FM-index is a fundamental data structure used in bioinformatics to efficiently search for strings and index genomes. However, the FM-index can pose computational challenges, particularly in the context of large-scale genomic datasets, due to the complexity of its underlying components and data encodings. In this paper, we present a comprehensive review of efficient variants of the FM-index and the encoding strategies used to improve performance. We examine hardware-accelerated techniques, such as memory-efficient data layouts and cache-aware structures, as well as software-level innovations, including algorithmic refinements and compact representations. The reviewed work demonstrates substantial gains in both speed and scalability, making methods that use the FM-index more practical for high-throughput genomic applications. By analyzing the trade-offs and design choices of these variants, we highlight how combining hardware-aware and software-centric strategies enables more efficient FM-index construction and usage across a range of bioinformatics tasks.

Cite as

Eddie Ferro and Christina Boucher. Optimizing the Performance of the FM-Index for Large-Scale Data. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 6:1-6:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{ferro_et_al:OASIcs.Manzini.6,
  author =	{Ferro, Eddie and Boucher, Christina},
  title =	{{Optimizing the Performance of the FM-Index for Large-Scale Data}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{6:1--6:21},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.6},
  URN =		{urn:nbn:de:0030-drops-239140},
  doi =		{10.4230/OASIcs.Manzini.6},
  annote =	{Keywords: FM-Index Acceleration, Run-Length Encoding, Suffix Array Optimization, Burrows-Wheeler Transform, Efficient Backward Search}
}

Document

DOI: 10.4230/OASIcs.Manzini.7

Algorithms for Computing Very Large BWTs: a Short Survey

Authors: Diego Díaz-Domínguez, Lavinia Egidi, Veronica Guerrini, Felipe A. Louza, and Giovanna Rosone

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)

Abstract

The Burrows-Wheeler Transform (BWT) is a fundamental string transformation that, although initially introduced for data compression, has been extensively utilized across various domains, including text indexing and pattern matching within large datasets. Although the BWT construction is linear, the constants make the task impractical for large datasets, and as highlighted by Ferragina et al. [Paolo Ferragina et al., 2012], "to use it, one must first build it!". Thus, the construction of the BWT remains a significant challenge. For these reasons, during the past three decades there has been a succession of new algorithms for its construction using techniques that work in external memory or that use text compression. In this survey, we revise some of the most important advancements and tools presented in the past years for computing large BWTs exploiting external memory or text compression approaches without using additional information about the data.

Cite as

Diego Díaz-Domínguez, Lavinia Egidi, Veronica Guerrini, Felipe A. Louza, and Giovanna Rosone. Algorithms for Computing Very Large BWTs: a Short Survey. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 7:1-7:28, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{diazdominguez_et_al:OASIcs.Manzini.7,
  author =	{D{\'\i}az-Dom{\'\i}nguez, Diego and Egidi, Lavinia and Guerrini, Veronica and Louza, Felipe A. and Rosone, Giovanna},
  title =	{{Algorithms for Computing Very Large BWTs: a Short Survey}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{7:1--7:28},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.7},
  URN =		{urn:nbn:de:0030-drops-239151},
  doi =		{10.4230/OASIcs.Manzini.7},
  annote =	{Keywords: Burrows-Wheeler transform, Extended Burrows-Wheeler transform, external memory, text compression, longest common prefix}
}

@InProceedings{diazdominguez_et_al:OASIcs.Manzini.7,
  author =	{D{\'\i}az-Dom{\'\i}nguez, Diego and Egidi, Lavinia and Guerrini, Veronica and Louza, Felipe A. and Rosone, Giovanna},
  title =	{{Algorithms for Computing Very Large BWTs: a Short Survey}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{7:1--7:28},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.7},
  URN =		{urn:nbn:de:0030-drops-239151},
  doi =		{10.4230/OASIcs.Manzini.7},
  annote =	{Keywords: Burrows-Wheeler transform, Extended Burrows-Wheeler transform, external memory, text compression, longest common prefix}
}

Document

DOI: 10.4230/OASIcs.Manzini.12

Wheeler Graphs and Wheeler Languages

Authors: Nicola Cotumaccio, Giovanna D'Agostino, Daniel Gibney, Alberto Policriti, Nicola Prezza, and Sharma V. Thankachan

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)

Abstract

Suffix sorting stands at the core of the most efficient solutions for indexed pattern matching: the suffix tree, the suffix array, compressed indexes based on the Burrows-Wheeler transform, and so on. In [Gagie, Manzini, Sirén, TCS 2017] this concept was extended to labeled graphs, obtaining the rich class of Wheeler graphs. This work opened a very fruitful line of research, ultimately generating results able to bridge the fields of compressed data structures, graph theory, and regular language theory. In a Wheeler graph, nodes are sorted according to the alphabetic order of their incoming labels, propagating this order through pairs of equally-labeled edges. This apparently-simple definition makes it possible to solve on Wheeler graphs problems (including, but not limited to: compression, subpath queries, NFA equivalence, determinization, minimization) that on general labeled graphs are extremely hard to solve, and induces a rich structure in the class of regular languages (Wheeler languages) recognized by automata whose state transition is a Wheeler graph. The goal of this survey is to provide a summary of (and intuitions behind) the results on Wheeler graphs that appeared in the literature since their introduction, in addition to a discussion of interesting problems that are still open in the field.

Cite as

Nicola Cotumaccio, Giovanna D'Agostino, Daniel Gibney, Alberto Policriti, Nicola Prezza, and Sharma V. Thankachan. Wheeler Graphs and Wheeler Languages. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 12:1-12:28, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{cotumaccio_et_al:OASIcs.Manzini.12,
  author =	{Cotumaccio, Nicola and D'Agostino, Giovanna and Gibney, Daniel and Policriti, Alberto and Prezza, Nicola and Thankachan, Sharma V.},
  title =	{{Wheeler Graphs and Wheeler Languages}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{12:1--12:28},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.12},
  URN =		{urn:nbn:de:0030-drops-239205},
  doi =		{10.4230/OASIcs.Manzini.12},
  annote =	{Keywords: Wheeler languages, Wheeler graphs, pattern matching, indexing, compressed data structures}
}

Document

DOI: 10.4230/OASIcs.Manzini.8

A Taxonomy of LCP-Array Construction Algorithms

Authors: Johannes Fischer and Enno Ohlebusch

Published in: OASIcs, Volume 131, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday (2025)

Abstract

The combination of the suffix array and the LCP-array can be used to solve many string processing problems efficiently. We review some of the most important sequential LCP-array construction algorithms in random access memory.

Cite as

Johannes Fischer and Enno Ohlebusch. A Taxonomy of LCP-Array Construction Algorithms. In The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday. Open Access Series in Informatics (OASIcs), Volume 131, pp. 8:1-8:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{fischer_et_al:OASIcs.Manzini.8,
  author =	{Fischer, Johannes and Ohlebusch, Enno},
  title =	{{A Taxonomy of LCP-Array Construction Algorithms}},
  booktitle =	{The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini's 60th Birthday},
  pages =	{8:1--8:17},
  series =	{Open Access Series in Informatics (OASIcs)},
  ISBN =	{978-3-95977-390-4},
  ISSN =	{2190-6807},
  year =	{2025},
  volume =	{131},
  editor =	{Ferragina, Paolo and Gagie, Travis and Navarro, Gonzalo},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/OASIcs.Manzini.8},
  URN =		{urn:nbn:de:0030-drops-239166},
  doi =		{10.4230/OASIcs.Manzini.8},
  annote =	{Keywords: longest common prefix array, suffix array, Burrows-Wheeler transform}
}

Document

DOI: 10.4230/LIPIcs.SEA.2025.14

Efficient Terabyte-Scale Text Compression via Stable Local Consistency and Parallel Grammar Processing

Authors: Diego Díaz-Domínguez

Published in: LIPIcs, Volume 338, 23rd International Symposium on Experimental Algorithms (SEA 2025)

Abstract

We present compression algorithms designed to process terabyte-sized datasets in parallel. Our approach builds on locally consistent grammars, a lightweight form of compression, combined with simple post-processing techniques to achieve further space reductions. Locally consistent grammar algorithms are suitable for scaling as they need minimal satellite information to compact the text, but they are not inherently parallel. To enable parallelisation, we introduce a novel concept that we call stable local consistency. A grammar algorithm ALG is stable if for any pattern P occurring in a collection 𝒯 = {T_1, T_2, …, T_k}, instances ALG(T_1), ALG(T_2), …, ALG(T_k) independently produce cores for P with the same topology. In a locally consistent grammar, the core of P is a subset of nodes and edges in the parse tree of 𝒯 that remains the same in all the occurrences of P. This feature enables compression, but it only holds if ALG defines a common set of nonterminal symbols for the strings. Stability removes this restriction, allowing us to run ALG(T_1), ALG(T_2), …, ALG(T_k) in parallel and subsequently merge their grammars into a single output equivalent to that of ALG(𝒯). We implemented our ideas and tested them on massive datasets. Our experiments showed that our method could process 7.9 TB of bacterial genomes in around nine hours, using 16 threads and 0.43 bits/symbol of working memory, achieving a compression ratio of 85x.

Cite as

Diego Díaz-Domínguez. Efficient Terabyte-Scale Text Compression via Stable Local Consistency and Parallel Grammar Processing. In 23rd International Symposium on Experimental Algorithms (SEA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 338, pp. 14:1-14:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{diazdominguez:LIPIcs.SEA.2025.14,
  author =	{D{\'\i}az-Dom{\'\i}nguez, Diego},
  title =	{{Efficient Terabyte-Scale Text Compression via Stable Local Consistency and Parallel Grammar Processing}},
  booktitle =	{23rd International Symposium on Experimental Algorithms (SEA 2025)},
  pages =	{14:1--14:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-375-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{338},
  editor =	{Mutzel, Petra and Prezza, Nicola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2025.14},
  URN =		{urn:nbn:de:0030-drops-232525},
  doi =		{10.4230/LIPIcs.SEA.2025.14},
  annote =	{Keywords: Grammar compression, locally consistent parsing, hashing}
}

Document

DOI: 10.4230/LIPIcs.SEA.2025.12

Bit Packed Encodings for Grammar-Compressed Strings Supporting Fast Random Access

Authors: Alan M. Cleary, Joseph Winjum, Jordan Dood, Hiroki Shibata, and Shunsuke Inenaga

Published in: LIPIcs, Volume 338, 23rd International Symposium on Experimental Algorithms (SEA 2025)

Abstract

Grammar-based compression is a powerful compression technique that allows for computation over the compressed data. While there has been extensive theoretical work on grammar and encoding size, there has been little work on practical grammar encodings. In this work, we consider the canonical array-of-arrays grammar representation and present a general bit packing approach for reducing its space requirements in practice. We then present three bit packing strategies based on this approach - one online and two offline - with different space-time trade-offs. This technique can be used to encode any grammar-compressed string while preserving the virtues of the array-of-arrays representation. We show that our encodings are Nlog₂ N away from the information-theoretic bound, where N is the number of symbols in the grammar, and that they are much smaller than methods that meet the information-theoretic bound in practice. Moreover, our experiments show that by using bit packed encodings we can achieve state-of-the-art performance both in grammar encoding size and run-time performance of random-access queries.

Cite as

Alan M. Cleary, Joseph Winjum, Jordan Dood, Hiroki Shibata, and Shunsuke Inenaga. Bit Packed Encodings for Grammar-Compressed Strings Supporting Fast Random Access. In 23rd International Symposium on Experimental Algorithms (SEA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 338, pp. 12:1-12:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{cleary_et_al:LIPIcs.SEA.2025.12,
  author =	{Cleary, Alan M. and Winjum, Joseph and Dood, Jordan and Shibata, Hiroki and Inenaga, Shunsuke},
  title =	{{Bit Packed Encodings for Grammar-Compressed Strings Supporting Fast Random Access}},
  booktitle =	{23rd International Symposium on Experimental Algorithms (SEA 2025)},
  pages =	{12:1--12:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-375-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{338},
  editor =	{Mutzel, Petra and Prezza, Nicola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2025.12},
  URN =		{urn:nbn:de:0030-drops-232506},
  doi =		{10.4230/LIPIcs.SEA.2025.12},
  annote =	{Keywords: String algorithms, data compression, random access, grammar-based compression}
}

Document

DOI: 10.4230/LIPIcs.SEA.2025.2

IBB: Fast Burrows-Wheeler Transform Construction for Length-Diverse DNA Data

Authors: Enno Adler, Stefan Böttcher, Rita Hartel, and Cederic Alexander Steininger

Published in: LIPIcs, Volume 338, 23rd International Symposium on Experimental Algorithms (SEA 2025)

Abstract

The Burrows-Wheeler transform (BWT) is integral to the FM-index, which is used extensively in text compression, indexing, pattern search, and bioinformatic problems as de novo assembly and read alignment. Thus, efficient construction of the BWT in terms of time and memory usage is key to these applications. We present a novel external-memory algorithm called Improved-Bucket Burrows-Wheeler transform (IBB) for constructing the BWT of DNA datasets with highly diverse sequence lengths. IBB uses a right-aligned approach to efficiently handle sequences of varying lengths, a tree-based data structure to manage relative insert positions and ranks, and fine buckets to reduce the necessary amount of input and output to external memory. Our experiments demonstrate that IBB is 10% to 40% faster than the best existing state-of-the-art BWT construction algorithms on most datasets while maintaining competitive memory consumption.

Cite as

Enno Adler, Stefan Böttcher, Rita Hartel, and Cederic Alexander Steininger. IBB: Fast Burrows-Wheeler Transform Construction for Length-Diverse DNA Data. In 23rd International Symposium on Experimental Algorithms (SEA 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 338, pp. 2:1-2:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{adler_et_al:LIPIcs.SEA.2025.2,
  author =	{Adler, Enno and B\"{o}ttcher, Stefan and Hartel, Rita and Steininger, Cederic Alexander},
  title =	{{IBB: Fast Burrows-Wheeler Transform Construction for Length-Diverse DNA Data}},
  booktitle =	{23rd International Symposium on Experimental Algorithms (SEA 2025)},
  pages =	{2:1--2:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-375-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{338},
  editor =	{Mutzel, Petra and Prezza, Nicola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2025.2},
  URN =		{urn:nbn:de:0030-drops-232402},
  doi =		{10.4230/LIPIcs.SEA.2025.2},
  annote =	{Keywords: burrows-wheeler transform, self-indexes, external-memory}
}

Document

DOI: 10.4230/LIPIcs.WABI.2018.10

External memory BWT and LCP computation for sequence collections with applications

Authors: Lavinia Egidi, Felipe A. Louza, Giovanni Manzini, and Guilherme P. Telles

Published in: LIPIcs, Volume 113, 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)

Abstract

We propose an external memory algorithm for the computation of the BWT and LCP array for a collection of sequences. Our algorithm takes the amount of available memory as an input parameter, and tries to make the best use of it by splitting the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external memory and in the process it also computes the LCP values. We show that our algorithm performs O(n maxlcp) sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm outperforms the current best algorithm for collections of sequences with different lengths and when the average LCP of the collection is relatively small compared to the length of the sequences. In the second part of the paper, we show that our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP arrays, provide simple, scan based, external memory algorithms for three well known problems in bioinformatics: the computation of the all pairs suffix-prefix overlaps, the computation of maximal repeats, and the construction of succinct de Bruijn graphs.

Cite as

Lavinia Egidi, Felipe A. Louza, Giovanni Manzini, and Guilherme P. Telles. External memory BWT and LCP computation for sequence collections with applications. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 10:1-10:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{egidi_et_al:LIPIcs.WABI.2018.10,
  author =	{Egidi, Lavinia and Louza, Felipe A. and Manzini, Giovanni and Telles, Guilherme P.},
  title =	{{External memory BWT and LCP computation for sequence collections with applications}},
  booktitle =	{18th International Workshop on Algorithms in Bioinformatics (WABI 2018)},
  pages =	{10:1--10:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-082-8},
  ISSN =	{1868-8969},
  year =	{2018},
  volume =	{113},
  editor =	{Parida, Laxmi and Ukkonen, Esko},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.10},
  URN =		{urn:nbn:de:0030-drops-93122},
  doi =		{10.4230/LIPIcs.WABI.2018.10},
  annote =	{Keywords: Burrows-Wheeler Transform, Longest Common Prefix Array, All pairs suffix-prefix overlaps, Succinct de Bruijn graph, Maximal repeats}
}

10 Search Results for "Louza, Felipe A."

Fast and Memory-Efficient BWT Construction of Repetitive Texts Using Lyndon Grammars

Abstract

Cite as

BWT for String Collections

Abstract

Cite as

Optimizing the Performance of the FM-Index for Large-Scale Data

Abstract

Cite as

Algorithms for Computing Very Large BWTs: a Short Survey

Abstract

Cite as

Wheeler Graphs and Wheeler Languages

Abstract

Cite as

A Taxonomy of LCP-Array Construction Algorithms

Abstract

Cite as

Efficient Terabyte-Scale Text Compression via Stable Local Consistency and Parallel Grammar Processing

Abstract

Cite as

Bit Packed Encodings for Grammar-Compressed Strings Supporting Fast Random Access

Abstract

Cite as

IBB: Fast Burrows-Wheeler Transform Construction for Length-Diverse DNA Data

Abstract

Cite as

External memory BWT and LCP computation for sequence collections with applications

Abstract

Cite as

Thanks for your feedback!

Could not send message