A Class of Heuristics for Reducing the Number of BWT-Runs in the String Ordering Problem

Authors Gianmarco Bertola, Anthony J. Cox, Veronica Guerrini , Giovanna Rosone



PDF
Thumbnail PDF

File

LIPIcs.CPM.2024.7.pdf
  • Filesize: 0.78 MB
  • 15 pages

Document Identifiers

Author Details

Gianmarco Bertola
  • Department of Computer Science, University of Pisa, Italy
Anthony J. Cox
  • Independent Researcher, Cambridge, UK
Veronica Guerrini
  • Department of Computer Science, University of Pisa, Italy
Giovanna Rosone
  • Department of Computer Science, University of Pisa, Italy

Cite AsGet BibTex

Gianmarco Bertola, Anthony J. Cox, Veronica Guerrini, and Giovanna Rosone. A Class of Heuristics for Reducing the Number of BWT-Runs in the String Ordering Problem. In 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 296, pp. 7:1-7:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.CPM.2024.7

Abstract

The Burrows-Wheeler transform (BWT) is a famous text transformation that rearranges the symbols of the input strings so that occurrences of a same symbol tend to occur in runs. The number of runs is an important parameter in the BWT output string, historically associated with its high compressibility and more recently used as a measure for the space complexity of efficient data structures. It is a known fact that reordering the strings in the input collection 𝒮 affects the number of runs in the output string bwt(𝒮) produced by applying the BWT to the string collection. In this paper, we define a class of transformed strings where symbols in particular blocks of the bwt(𝒮) can be reordered according to a different adaptive alphabet order. Then, we introduce new heuristics to reduce the number of runs in the BWT output of a string collection that improve on the two existing heuristics introduced in Cox et al. [Anthony J. Cox et al., 2012]. These new heuristics are computed when applying the BWT to a string collection assuming no a priori order on the input strings and without requiring any pre- and/or post- processing of the collection 𝒮 or of the BWT string. In this paper, we also face the problem of reconstructing the input collection 𝒮 from the string bwt(𝒮) together with the string permutation realized when applying an alphabetical reordering of symbols during the construction of bwt(𝒮).

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
  • Applied computing → Bioinformatics
Keywords
  • Burrows-Wheeler Transform
  • SAP-interval
  • repetitive text
  • string compression

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Markus J. Bauer, Anthony J. Cox, and Giovanna Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci., 483(0):134-148, 2013. Source code: URL: https://github.com/BEETL/BEETL.
  2. Jason W. Bentley, Daniel Gibney, and Sharma V. Thankachan. On the complexity of BWT-runs minimization via alphabet reordering. In ESA 2020, volume 173 of LIPIcs, pages 15:1-15:13, 2020. URL: https://doi.org/10.4230/LIPIcs.ESA.2020.15.
  3. Michael Burrows and David J. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994. Google Scholar
  4. Davide Cenzato, Veronica Guerrini, Zsuzsanna Lipták, and Giovanna Rosone. Computing the optimal BWT of very large string collections. In Data Compression Conference, DCC 2023, pages 71-80. IEEE, 2023. Source code: https://github.com/davidecenzato/optimalBWT. URL: https://doi.org/10.1109/DCC55655.2023.00015.
  5. Davide Cenzato and Zsuzsanna Lipták. A theoretical and experimental analysis of BWT variants for string collections. In CPM 2022, volume 223 of LIPIcs, pages 25:1-25:18, 2022. URL: https://doi.org/10.4230/LIPIcs.CPM.2022.25.
  6. Brenton Chapin and Stephen R. Tate. Higher compression from the Burrows-Wheeler transform by modified sorting. In DCC, page 532, Washington, DC, USA, 1998. IEEE Computer Society. Google Scholar
  7. Anthony J. Cox, Markus J. Bauer, Tobias Jakobi, and Giovanna Rosone. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinform., 28(11):1415-1419, 2012. Availability: Code is part of the BEETL library, available as a github repository at https://github.com/BEETL/BEETL. URL: https://doi.org/10.1093/bioinformatics/bts173.
  8. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In FOCS, pages 390-398. IEEE Computer Society, 2000. URL: https://doi.org/10.1109/SFCS.2000.892127.
  9. Paolo Ferragina and Giovanni Manzini. An experimental study of a compressed index. Information Sciences, 135(1):13-28, 2001. URL: https://doi.org/10.1016/S0020-0255(01)00098-6.
  10. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in bwt-runs bounded space. J. ACM, 67(1), 2020. URL: https://doi.org/10.1145/3375890.
  11. Raffaele Giancarlo, Giovanni Manzini, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. A new class of string transformations for compressed text indexing. Information and Computation, 294:105068, 2023. URL: https://doi.org/10.1016/j.ic.2023.105068.
  12. Veronica Guerrini, Alessio Conte, Roberto Grossi, Gianni Liti, Giovanna Rosone, and Lorenzo Tattini. phyBWT2: phylogeny reconstruction via eBWT positional clustering. Algorithms Mol. Biol., 18(1):11, 2023. URL: https://doi.org/10.1186/S13015-023-00232-4.
  13. Veronica Guerrini, Felipe A. Louza, and Giovanna Rosone. Parallel lossy compression for large FASTQ files. In Biomedical Engineering Systems and Technologies, pages 97-120, Cham, 2023. Springer Nature Switzerland. URL: https://doi.org/10.1007/978-3-031-38854-5_6.
  14. Christophe Reutenauer Ira M. Gessel, Antonio Restivo. A bijection between words and multisets of necklaces. Eur. J. Combin., 33(7):1537-1546, 2012. Google Scholar
  15. Heng Li. Fast construction of FM-index for long sequence reads. Bioinformatics, 30(22):3274-3275, 2014. Source code: https://github.com/lh3/ropebwt2. URL: https://doi.org/10.1093/bioinformatics/btu541.
  16. Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. Nordic J. of Computing, 12(1):40-66, 2005. Google Scholar
  17. Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol., 17(3):281-308, 2010. Google Scholar
  18. Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. An extension of the Burrows-Wheeler Transform. Theor. Comput. Sci., 387(3):298-312, 2007. URL: https://doi.org/10.1016/j.tcs.2007.07.014.
  19. Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, Marinella Sciortino, and Luca Versari. Measuring the clustering effect of BWT via RLE. Theor. Comput. Sci., 698:79-87, 2017. Google Scholar
  20. Joong Chae Na, Hyunjoon Kim, Seunghwan Min, Heejin Park, Thierry Lecroq, Martine Léonard, Laurent Mouchard, and Kunsoo Park. Fm-index of alignment with gaps. Theor. Comput. Sci., 710:148-157, 2018. Google Scholar
  21. Gonzalo Navarro. Indexing highly repetitive string collections, part I: repetitiveness measures. ACM Comput. Surv., 54(2):29:1-29:31, 2021. Google Scholar
  22. Gonzalo Navarro. Indexing highly repetitive string collections, part II: compressed indexes. ACM Comput. Surv., 54(2):26:1-26:32, 2021. Google Scholar
  23. Giovanna Rosone and Marinella Sciortino. The Burrows-Wheeler Transform between Data Compression and Combinatorics on Words. In CiE, volume 7921 LNCS of LNCS, pages 353-364. Springer, 2013. URL: https://doi.org/10.1007/978-3-642-39053-1_42.
  24. Jared T. Simpson and Richard Durbin. Efficient construction of an assembly string graph using the FM-index. Bioinform., 26(12):367-373, 2010. Google Scholar