An Optimal-Time RLBWT Construction in BWT-Runs Bounded Space

Authors Takaaki Nishimoto, Shunsuke Kanda, Yasuo Tabei



PDF
Thumbnail PDF

File

LIPIcs.ICALP.2022.99.pdf
  • Filesize: 1.51 MB
  • 20 pages

Document Identifiers

Author Details

Takaaki Nishimoto
  • RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
Shunsuke Kanda
  • RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
Yasuo Tabei
  • RIKEN Center for Advanced Intelligence Project, Tokyo, Japan

Cite AsGet BibTex

Takaaki Nishimoto, Shunsuke Kanda, and Yasuo Tabei. An Optimal-Time RLBWT Construction in BWT-Runs Bounded Space. In 49th International Colloquium on Automata, Languages, and Programming (ICALP 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 229, pp. 99:1-99:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)
https://doi.org/10.4230/LIPIcs.ICALP.2022.99

Abstract

The compression of highly repetitive strings (i.e., strings with many repetitions) has been a central research topic in string processing, and quite a few compression methods for these strings have been proposed thus far. Among them, an efficient compression format gathering increasing attention is the run-length Burrows-Wheeler transform (RLBWT), which is a run-length encoded BWT as a reversible permutation of an input string on the lexicographical order of suffixes. State-of-the-art construction algorithms of RLBWT have a serious issue with respect to (i) non-optimal computation time or (ii) a working space that is linearly proportional to the length of an input string. In this paper, we present r-comp, the first optimal-time construction algorithm of RLBWT in BWT-runs bounded space. That is, the computational complexity of r-comp is O(n + r log r) time and O(r log n) bits of working space for the length n of an input string and the number r of equal-letter runs in BWT. The computation time is optimal (i.e., O(n)) for strings with the property r = O(n/log n), which holds for most highly repetitive strings. Experiments using a real-world dataset of highly repetitive strings show the effectiveness of r-comp with respect to computation time and space.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
Keywords
  • lossless data compression
  • Burrows-Wheeler transform
  • highly repetitive text collections

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Hideo Bannai, Travis Gagie, and Tomohiro I. Refining the r-index. Theor. Comput. Sci., 812:96-108, 2020. Google Scholar
  2. Djamal Belazzougui, Manuel Cáceres, Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Gonzalo Navarro, Alberto Ordóñez Pereira, Simon J. Puglisi, and Yasuo Tabei. Block trees. J. Comput. Syst. Sci., 117:1-22, 2021. Google Scholar
  3. Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, and Mathieu Raffinot. Composite repetition-aware data structures. In Proceedings of CPM, pages 26-39, 2015. Google Scholar
  4. Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, and Veli Mäkinen. Linear-time string indexing and analysis in small space. ACM Trans. Algor., 16:17:1-17:54, 2020. Google Scholar
  5. Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms Mol. Biol., 14:13:1-13:15, 2019. Google Scholar
  6. Michael Burrows and David J Wheeler. A block-sorting lossless data compression algorithm. Technical report, 1994. Google Scholar
  7. Dustin Cobas, Veli Mäkinen, and Massimiliano Rossi. Tailoring r-index for document listing towards metagenomics applications. In Proceedings of SPIRE, pages 291-306, 2020. Google Scholar
  8. Maxime Crochemore, Roberto Grossi, Juha Kärkkäinen, and Gad M. Landau. Computing the Burrows-Wheeler transform in place and in small space. J. Disc. Algor., pages 44-52, 2015. Google Scholar
  9. Paul F. Dietz and Daniel Dominic Sleator. Two algorithms for maintaining order in a list. In Proceedings of STOC, pages 365-372, 1987. Google Scholar
  10. Patrick Dinklage, Jonas Ellert, Johannes Fischer, Dominik Köppl, and Manuel Penschuck. Bidirectional text compression in external memory. In Proceedings of ESA, pages 41:1-41:16, 2019. Google Scholar
  11. Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings of FOCS, pages 390-398, 2000. Google Scholar
  12. Isamu Furuya, Takuya Takagi, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Takuya Kida. Practical grammar compression based on maximal repeats. Algorithms, 13:103, 2020. Google Scholar
  13. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67, 2020. Google Scholar
  14. Artur Jez. A really simple approximation of smallest grammar. Theor. Comput. Sci., 616:141-150, 2016. Google Scholar
  15. Dominik Kempa. Optimal construction of compressed indexes for highly repetitive texts. In Proceedings of SODA, pages 1344-1357, 2019. Google Scholar
  16. Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In Proceedings of STOC, pages 756-767, 2019. Google Scholar
  17. Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler transform conjecture. In Proceedings of FOCS, pages 1002-1013, 2020. Google Scholar
  18. Dominik Kempa and Ben Langmead. Fast and space-efficient construction of AVL grammars from the LZ77 parsing. In Proceedings of ESA, 2021. Google Scholar
  19. N. Jesper Larsson and Alistair Moffat. Offline dictionary-based compression. In Proceedings of DCC, pages 296-305, 1999. Google Scholar
  20. J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. Space-efficient construction of compressed indexes in deterministic linear time. In Proceedings of SODA, pages 408-424, 2017. Google Scholar
  21. Gonzalo Navarro and Yakov Nekrich. Optimal dynamic sequence representations. SIAM J. Comput., 43:1781-1806, 2014. Google Scholar
  22. Gonzalo Navarro, Carlos Ochoa, and Nicola Prezza. On the approximation ratio of ordered parsings. IEEE Trans. Inform. Theory, 67:1008-1026, 2021. Google Scholar
  23. Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Dynamic index and LZ factorization in compressed space. Discret. Appl. Math., 274:116-129, 2020. Google Scholar
  24. Takaaki Nishimoto, Shunsuke Kanda, and Yasuo Tabei. An optimal-time RLBWT construction in BWT-runs bounded space. CoRR, abs/2202.07885, 2022. Google Scholar
  25. Takaaki Nishimoto and Yasuo Tabei. LZRR: LZ77 parsing with right reference. Inf. Comput., page 104859, 2021. Google Scholar
  26. Takaaki Nishimoto and Yasuo Tabei. Optimal-time queries on BWT-runs bounded space. In Proceedings of ICALP, pages 101:1-101:15, 2021. Google Scholar
  27. Takaaki Nishimoto and Yasuo Tabei. R-enum: Enumeration of characteristic substrings in BWT-runs bounded space. In Proceedings of CPM, pages 21:1-21:21, 2021. Google Scholar
  28. Tatsuya Ohno, Kensuke Sakai, Yoshimasa Takabatake, Tomohiro I, and Hiroshi Sakamoto. A faster implementation of online RLBWT and its application to LZ77 parsing. J. Disc. Algor., 52-53:18-28, 2018. Google Scholar
  29. Pizza&Chili repetitive corpus. URL: http://pizzachili.dcc.uchile.cl/repcorpus.html.
  30. Alberto Policriti and Nicola Prezza. LZ77 computation based on the run-length encoded BWT. Algorithmica, 80:1986-2011, 2018. Google Scholar
  31. Wojciech Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci., 302:211-222, 2003. Google Scholar
  32. The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature, 491:56-65, 2012. Google Scholar
  33. Enwiki dump progress on 20210401: All pages with complete edit history (xml-p1p873). URL: https://dumps.wikimedia.org/enwiki/.
  34. Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory, 23:337-343, 1977. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail