L-Systems for Measuring Repetitiveness

Authors Gonzalo Navarro , Cristian Urbina



PDF
Thumbnail PDF

File

LIPIcs.CPM.2023.25.pdf
  • Filesize: 0.76 MB
  • 17 pages

Document Identifiers

Author Details

Gonzalo Navarro
  • Department of Computer Science, University of Chile, Santiago, Chile
  • Centre for Biotechnology and Bioengineering (CeBiB), Santiago, Chile
Cristian Urbina
  • Department of Computer Science, University of Chile, Santiago, Chile
  • Centre for Biotechnology and Bioengineering (CeBiB), Santiago, Chile

Cite AsGet BibTex

Gonzalo Navarro and Cristian Urbina. L-Systems for Measuring Repetitiveness. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 25:1-25:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.CPM.2023.25

Abstract

In order to use them for compression, we extend L-systems (without ε-rules) with two parameters d and n, and also a coding τ, which determines unambiguously a string w = τ(φ^d(s))[1:n], where φ is the morphism of the system, and s is its axiom. The length of the shortest description of an L-system generating w is known as 𝓁, and it is arguably a relevant measure of repetitiveness that builds on the self-similarities that arise in the sequence. In this paper, we deepen the study of the measure 𝓁 and its relation with a better-established measure called δ, which builds on substring complexity. Our results show that 𝓁 and δ are largely orthogonal, in the sense that one can be much larger than the other, depending on the case. This suggests that both mechanisms capture different kinds of regularities related to repetitiveness. We then show that the recently introduced NU-systems, which combine the capabilities of L-systems with bidirectional macro schemes, can be asymptotically strictly smaller than both mechanisms for the same fixed string family, which makes the size ν of the smallest NU-system the unique smallest reachable repetitiveness measure to date. We conclude that in order to achieve better compression, we should combine morphism substitution with copy-paste mechanisms.

Subject Classification

ACM Subject Classification
  • Mathematics of computing → Combinatorics on words
  • Theory of computation → Data compression
Keywords
  • L-systems
  • String morphisms
  • Repetitiveness measures
  • Text compression

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. H. Bannai, M. Funakoshi, T. I, D. Köppl, T. Mieno, and T. Nishimoto. A separation of γ and b via Thue-Morse words. In Proc. 28th International Symposium on String Processing and Information Retrieval (SPIRE), volume 12944 of Lecture Notes in Computer Science (LNCS), pages 167-178, 2021. Google Scholar
  2. M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994. Google Scholar
  3. M. Charikar, E. Lehman, Ding Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. The smallest grammar problem. IEEE Transactions on Information Theory, 51(7):2554-2576, 2005. Google Scholar
  4. T. Gagie, G. Navarro, and N. Prezza. Optimal-time text indexing in BWT-runs bounded space. In Proc. 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1459-1477, 2018. Google Scholar
  5. J. K. Gallant. String Compression Algorithms. PhD thesis, Princeton University, 1982. Google Scholar
  6. S. Giuliani, S. Inenaga, Z. Lipták, N. Prezza, M. Sciortino, and A. Toffanello. Novel results on the number of runs of the Burrows-Wheeler-Transform. In Proc. Theory and Practice of Computer Science (SOFSEM), pages 249-262, 2021. Google Scholar
  7. A. Jeż. Approximation of grammar-based compression via recompression. Theoretical Computer Science, 592:115-134, 2015. Google Scholar
  8. D. Kempa and T. Kociumaka. Resolution of the Burrows-Wheeler Transform conjecture. Communications of the ACM, 65(6):91-98, 2022. Google Scholar
  9. D. Kempa and N. Prezza. At the roots of dictionary compression: String attractors. In Proc. 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 827-840, 2018. Google Scholar
  10. T. Kida, T. Matsumoto, Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. Collage system: a unifying framework for compressed pattern matching. Theoretical Computer Science, 298(1):253-272, 2003. Google Scholar
  11. T. Kociumaka, G. Navarro, and F. Olivares. Near-optimal search time in δ-optimal space. In Proc. 15th Latin American Symposium on Theoretical Informatics (LATIN), volume 13568 of Lecture Notes in Computer Science (LNCS), pages 88-103, 2022. Google Scholar
  12. T. Kociumaka, G. Navarro, and N. Prezza. Towards a definitive compressibility measure for repetitive sequences. IEEE Transactions on Information Theory, 69(4):2074-2092, 2023. Google Scholar
  13. S. Kreft and G. Navarro. LZ77-like compression with fast random access. In 2010 Data Compression Conference (DCC), pages 239-248, 2010. Google Scholar
  14. A. Lempel and J. Ziv. On the complexity of finite sequences. IEEE Transactions on Information Theory, 22(1):75-81, 1976. Google Scholar
  15. A. Lindenmayer. Mathematical models for cellular interactions in development I. Filaments with one-sided inputs. Journal of Theoretical Biology, 18(3):280-299, 1968. Google Scholar
  16. A. Lindenmayer. Mathematical models for cellular interactions in development II. Simple and branching filaments with two-sided inputs. Journal of Theoretical Biology, 18(3):300-315, 1968. Google Scholar
  17. G. Navarro. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Computing Surveys, 54(2):article 29, 2021. Google Scholar
  18. G. Navarro, C. Ochoa, and N. Prezza. On the approximation ratio of ordered parsings. IEEE Transactions on Information Theory, 67(2):1008-1026, 2021. Google Scholar
  19. G. Navarro and C. Urbina. On stricter reachable repetitiveness measures. In Proc. 28th International Symposium on String Processing and Information Retrieval (SPIRE), volume 12944 of Lecture Notes in Computer Science (LNCS), pages 193-206, 2021. Google Scholar
  20. T. Nishimoto, T. I, S. Inenaga, H. Bannai, and M. Takeda. Fully dynamic data structure for LCE queries in compressed space. In 41st International Symposium on Mathematical Foundations of Computer Science (MFCS), volume 58 of Leibniz International Proceedings in Informatics (LIPIcs), pages 72:1-72:15, 2016. Google Scholar
  21. J.-J. Pansiot. On various classes of infinite words obtained by iterated mappings. In Automata on Infinite Words, volume 192 of Lecture Notes in Computer Science, pages 188-197, 1984. Google Scholar
  22. J.-J. Pansiot. Subword complexities and iteration. Bulletin of the European Association for Theoretical Computer Science, 26:55-62, 1985. Google Scholar
  23. M Przeworski, RR Hudson, and A Di Rienzo. Adjusting the focus on human variation. Trends in Genetics, 16(7):296-302, 2000. Google Scholar
  24. W. Rytter. Application of Lempel–Ziv factorization to the approximation of grammar-based compression. Theoretical Computer Science, 302(1):211-222, 2003. Google Scholar
  25. J. Shallit and D. Swart. An efficient algorithm for computing the ith letter of φⁿ(a). In Proc. 10th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 768-775, 1999. Google Scholar
  26. J. A. Storer and T. G. Szymanski. Data compression via textual substitution. Journal of the ACM, 29(4):928-951, 1982. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail