On the Impact of Morphisms on BWT-Runs

Authors Gabriele Fici , Giuseppe Romana , Marinella Sciortino , Cristian Urbina



PDF
Thumbnail PDF

File

LIPIcs.CPM.2023.10.pdf
  • Filesize: 0.78 MB
  • 18 pages

Document Identifiers

Author Details

Gabriele Fici
  • Department of Mathematics and Informatics, University of Palermo, Italy
Giuseppe Romana
  • Department of Mathematics and Informatics, University of Palermo, Italy
Marinella Sciortino
  • Department of Mathematics and Informatics, University of Palermo, Italy
Cristian Urbina
  • Department of Computer Science, University of Chile, Santiago, Chile
  • Centre for Biotechnology and Bioengineering (CeBiB), Santiago, Chile

Acknowledgements

This research was carried out during the visit of Cristian Urbina to the University of Palermo, Italy.

Cite As Get BibTex

Gabriele Fici, Giuseppe Romana, Marinella Sciortino, and Cristian Urbina. On the Impact of Morphisms on BWT-Runs. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 10:1-10:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.CPM.2023.10

Abstract

Morphisms are widely studied combinatorial objects that can be used for generating infinite families of words. In the context of Information theory, injective morphisms are called (variable length) codes. In Data compression, the morphisms, combined with parsing techniques, have been recently used to define new mechanisms to generate repetitive words. Here, we show that the repetitiveness induced by applying a morphism to a word can be captured by a compression scheme based on the Burrows-Wheeler Transform (BWT). In fact, we prove that, differently from other compression-based repetitiveness measures, the measure r_bwt (which counts the number of equal-letter runs produced by applying BWT to a word) strongly depends on the applied morphism. More in detail, we characterize the binary morphisms that preserve the value of r_bwt(w), when applied to any binary word w containing both letters. They are precisely the Sturmian morphisms, which are well-known objects in Combinatorics on words. Moreover, we prove that it is always possible to find a binary morphism that, when applied to any binary word containing both letters, increases the number of BWT-equal letter runs by a given (even) number. In addition, we derive a method for constructing arbitrarily large families of binary words on which BWT produces a given (even) number of new equal-letter runs. Such results are obtained by using a new class of morphisms that we call Thue-Morse-like. Finally, we show that there exist binary morphisms μ for which it is possible to find words w such that the difference r_bwt(μ(w))-r_bwt(w) is arbitrarily large.

Subject Classification

ACM Subject Classification
  • Mathematics of computing → Combinatorics on words
  • Theory of computation → Data compression
Keywords
  • Morphism
  • Burrows-Wheeler transform
  • Sturmian word
  • Sturmian morphism
  • Thue-Morse morphism
  • Repetitiveness measure

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Tooru Akagi, Mitsuru Funakoshi, and Shunsuke Inenaga. Sensitivity of string compressors and repetitiveness measures. Inf. Comput., 291:104999, 2023. Google Scholar
  2. Jean Berstel, Dominique Perrin, and Christophe Reutenauer. Codes and Automata, volume 129 of Encyclopedia of mathematics and its applications. Cambridge University Press, 2010. Google Scholar
  3. Jean Berstel and Patrice Séébold. A characterization of sturmian morphisms. In MFCS, volume 711 of Lecture Notes in Computer Science, pages 281-290. Springer, 1993. Google Scholar
  4. Srecko Brlek, Andrea Frosini, Ilaria Mancini, Elisa Pergola, and Simone Rinaldi. Burrows-wheeler transform of words defined by morphisms. In IWOCA, volume 11638 of Lecture Notes in Computer Science, pages 393-404. Springer, 2019. Google Scholar
  5. Michael Burrows and David Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994. Google Scholar
  6. Julien Cassaigne. Sequences with grouped factors. In DLT, pages 211-222. Aristotle University of Thessaloniki, 1997. Google Scholar
  7. Giusi Castiglione, Antonio Restivo, and Marinella Sciortino. Circular Sturmian words and Hopcroft’s algorithm. Theor. Comput. Sci., 410(43):4372-4381, 2009. Google Scholar
  8. Wai-Fong Chuan. Sturmian morphisms and alpha-words. Theor. Comput. Sci., 225(1-2):129-148, 1999. Google Scholar
  9. Sorin Constantinescu and Lucian Ilie. The lempel-ziv complexity of fixed points of morphisms. SIAM J. Discret. Math., 21(2):466-481, 2007. Google Scholar
  10. Maxime Crochemore, Thierry Lecroq, and Wojciech Rytter. 125 Problems in Text Algorithms: with Solutions. Cambridge University Press, 2021. Google Scholar
  11. Paolo Ferragina and Giovanni Manzini. Opportunistic Data Structures with Applications. In FOCS, pages 390-398. IEEE Computer Society, 2000. Google Scholar
  12. Andrea Frosini, Ilaria Mancini, Simone Rinaldi, Giuseppe Romana, and Marinella Sciortino. Logarithmic equal-letter runs for BWT of purely morphic words. In DLT, volume 13257 of Lect. Notes Comput. Sc., pages 139-151. Springer, 2022. Google Scholar
  13. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space. J. ACM, 67(1):2:1-2:54, 2020. Google Scholar
  14. Sara Giuliani, Shunsuke Inenaga, Zsuzsanna Lipták, Nicola Prezza, Marinella Sciortino, and Anna Toffanello. Novel results on the number of runs of the burrows-wheeler-transform. In SOFSEM, volume 12607 of Lecture Notes in Computer Science, pages 249-262. Springer, 2021. Google Scholar
  15. Sara Giuliani, Shunsuke Inenaga, Zsuzsanna Lipták, Giuseppe Romana, Marinella Sciortino, and Cristian Urbina. Bit catastrophes for the Burrows-Wheeler Transform. In DLT, volume 13911 of Lect. Notes Comput. Sc. Springer, 2023. Google Scholar
  16. Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler Transform Conjecture. Commun. ACM, 65(6):91-98, 2022. Google Scholar
  17. Takuya Kida, Tetsuya Matsumoto, Yusuke Shibata, Masayuki Takeda, Ayumi Shinohara, and Setsuo Arikawa. Collage system: a unifying framework for compressed pattern matching. Theor. Comput. Sci., 298(1):253-272, 2003. Google Scholar
  18. John C. Kieffer and En-Hui Yang. Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Inf. Theory, 46(3):737-754, 2000. Google Scholar
  19. Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323-350, 1977. Google Scholar
  20. Sebastian Kreft and Gonzalo Navarro. LZ77-Like Compression with Fast Random Access. In DCC, pages 239-248. IEEE, 2010. Google Scholar
  21. Ben Langmead and Steven Salzberg. Fast gapped-read alignment with Bowtie 2. Nature methods, 9:357-359, March 2012. Google Scholar
  22. Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Trans. Inf. Theory, 22(1):75-81, 1976. Google Scholar
  23. Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinform., 26(5):589-595, 2010. Google Scholar
  24. M. Lothaire. Algebraic Combinatorics on Words. Encyclopedia of Mathematics and its Applications. Cambridge Univ. Press, New York, NY, USA, 2002. Google Scholar
  25. Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, Marinella Sciortino, and Luca Versari. Measuring the clustering effect of BWT via RLE. Theoret. Comput. Sci., 698:79-87, 2017. Google Scholar
  26. Sabrina Mantaci, Antonio Restivo, and Marinella Sciortino. Burrows-Wheeler transform and Sturmian words. Inf. Process. Lett., 86(5):241-246, 2003. URL: https://doi.org/10.1016/S0020-0190(02)00512-4.
  27. Giovanni Manzini. An analysis of the Burrows-Wheeler transform. J. ACM, 48(3):407-430, 2001. Google Scholar
  28. Filippo Mignosi and Patrice Séébold. Morphismes sturmiens et règles de Rauzy. Journal de théorie des nombres de Bordeaux, 5(2):221-233, 1993. Google Scholar
  29. Gonzalo Navarro. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Computing Surveys, 54(2):article 29, 2021. Google Scholar
  30. Gonzalo Navarro. The compression power of the BWT: technical perspective. Commun. ACM, 65(6):90, 2022. Google Scholar
  31. Gonzalo Navarro, Carlos Ochoa, and Nicola Prezza. On the approximation ratio of ordered parsings. IEEE Trans. Inf. Theory, 67(2):1008-1026, 2021. Google Scholar
  32. Gonzalo Navarro and Cristian Urbina. On stricter reachable repetitiveness measures. In SPIRE, volume 12944 of Lect. Notes Comput. Sci., pages 193-206. Springer, 2021. Google Scholar
  33. Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Fully Dynamic Data Structure for LCE Queries in Compressed Space. In MFCS, volume 58 of Leibniz International Proceedings in Informatics (LIPIcs), pages 72:1-72:15, 2016. Google Scholar
  34. Geneviève Paquin. On a generalization of Christoffel words: epichristoffel words. Theor. Comput. Sci., 410(38-40):3782-3791, 2009. URL: https://doi.org/10.1016/j.tcs.2009.05.014.
  35. Antonio Restivo and Giovanna Rosone. Balancing and clustering of words in the Burrows-Wheeler transform. Theor. Comput. Sci., 412(27):3019-3032, 2011. URL: https://doi.org/10.1016/j.tcs.2010.11.040.
  36. Grzegorz Rozenberg and Arto Salomaa. The mathematical theory of L systems. Academic press, 1980. Google Scholar
  37. Marinella Sciortino and Luca Q. Zamboni. Suffix automata and standard sturmian words. In DLT, volume 4588 of Lect. Notes Comput. Sc., pages 382-398. Springer, 2007. Google Scholar
  38. James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. J. ACM, !month = oct, 29(4):928-951, 1982. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail