L-Systems for Measuring Repetitiveness

Navarro, Gonzalo; Urbina, Cristian

doi:10.4230/LIPIcs.CPM.2023.25

File

Author Details

Gonzalo Navarro

Department of Computer Science, University of Chile, Santiago, Chile
Centre for Biotechnology and Bioengineering (CeBiB), Santiago, Chile

Cristian Urbina

Department of Computer Science, University of Chile, Santiago, Chile
Centre for Biotechnology and Bioengineering (CeBiB), Santiago, Chile

Cite AsGet BibTex

Gonzalo Navarro and Cristian Urbina. L-Systems for Measuring Repetitiveness. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 25:1-25:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.CPM.2023.25

Abstract

In order to use them for compression, we extend L-systems (without ε-rules) with two parameters d and n, and also a coding τ, which determines unambiguously a string w = τ(φ^d(s))[1:n], where φ is the morphism of the system, and s is its axiom. The length of the shortest description of an L-system generating w is known as 𝓁, and it is arguably a relevant measure of repetitiveness that builds on the self-similarities that arise in the sequence. In this paper, we deepen the study of the measure 𝓁 and its relation with a better-established measure called δ, which builds on substring complexity. Our results show that 𝓁 and δ are largely orthogonal, in the sense that one can be much larger than the other, depending on the case. This suggests that both mechanisms capture different kinds of regularities related to repetitiveness. We then show that the recently introduced NU-systems, which combine the capabilities of L-systems with bidirectional macro schemes, can be asymptotically strictly smaller than both mechanisms for the same fixed string family, which makes the size ν of the smallest NU-system the unique smallest reachable repetitiveness measure to date. We conclude that in order to achieve better compression, we should combine morphism substitution with copy-paste mechanisms.

Subject Classification

ACM Subject Classification

Mathematics of computing → Combinatorics on words
Theory of computation → Data compression

Keywords

L-systems
String morphisms
Repetitiveness measures
Text compression

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

H. Bannai, M. Funakoshi, T. I, D. Köppl, T. Mieno, and T. Nishimoto. A separation of γ and b via Thue-Morse words. In Proc. 28th International Symposium on String Processing and Information Retrieval (SPIRE), volume 12944 of Lecture Notes in Computer Science (LNCS), pages 167-178, 2021.
M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
M. Charikar, E. Lehman, Ding Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. The smallest grammar problem. IEEE Transactions on Information Theory, 51(7):2554-2576, 2005.
T. Gagie, G. Navarro, and N. Prezza. Optimal-time text indexing in BWT-runs bounded space. In Proc. 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1459-1477, 2018.
J. K. Gallant. String Compression Algorithms. PhD thesis, Princeton University, 1982.
S. Giuliani, S. Inenaga, Z. Lipták, N. Prezza, M. Sciortino, and A. Toffanello. Novel results on the number of runs of the Burrows-Wheeler-Transform. In Proc. Theory and Practice of Computer Science (SOFSEM), pages 249-262, 2021.
A. Jeż. Approximation of grammar-based compression via recompression. Theoretical Computer Science, 592:115-134, 2015.
D. Kempa and T. Kociumaka. Resolution of the Burrows-Wheeler Transform conjecture. Communications of the ACM, 65(6):91-98, 2022.
D. Kempa and N. Prezza. At the roots of dictionary compression: String attractors. In Proc. 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 827-840, 2018.
T. Kida, T. Matsumoto, Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. Collage system: a unifying framework for compressed pattern matching. Theoretical Computer Science, 298(1):253-272, 2003.
T. Kociumaka, G. Navarro, and F. Olivares. Near-optimal search time in δ-optimal space. In Proc. 15th Latin American Symposium on Theoretical Informatics (LATIN), volume 13568 of Lecture Notes in Computer Science (LNCS), pages 88-103, 2022.
T. Kociumaka, G. Navarro, and N. Prezza. Towards a definitive compressibility measure for repetitive sequences. IEEE Transactions on Information Theory, 69(4):2074-2092, 2023.
S. Kreft and G. Navarro. LZ77-like compression with fast random access. In 2010 Data Compression Conference (DCC), pages 239-248, 2010.
A. Lempel and J. Ziv. On the complexity of finite sequences. IEEE Transactions on Information Theory, 22(1):75-81, 1976.
A. Lindenmayer. Mathematical models for cellular interactions in development I. Filaments with one-sided inputs. Journal of Theoretical Biology, 18(3):280-299, 1968.
A. Lindenmayer. Mathematical models for cellular interactions in development II. Simple and branching filaments with two-sided inputs. Journal of Theoretical Biology, 18(3):300-315, 1968.
G. Navarro. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Computing Surveys, 54(2):article 29, 2021.
G. Navarro, C. Ochoa, and N. Prezza. On the approximation ratio of ordered parsings. IEEE Transactions on Information Theory, 67(2):1008-1026, 2021.
G. Navarro and C. Urbina. On stricter reachable repetitiveness measures. In Proc. 28th International Symposium on String Processing and Information Retrieval (SPIRE), volume 12944 of Lecture Notes in Computer Science (LNCS), pages 193-206, 2021.
T. Nishimoto, T. I, S. Inenaga, H. Bannai, and M. Takeda. Fully dynamic data structure for LCE queries in compressed space. In 41st International Symposium on Mathematical Foundations of Computer Science (MFCS), volume 58 of Leibniz International Proceedings in Informatics (LIPIcs), pages 72:1-72:15, 2016.
J.-J. Pansiot. On various classes of infinite words obtained by iterated mappings. In Automata on Infinite Words, volume 192 of Lecture Notes in Computer Science, pages 188-197, 1984.
J.-J. Pansiot. Subword complexities and iteration. Bulletin of the European Association for Theoretical Computer Science, 26:55-62, 1985.
M Przeworski, RR Hudson, and A Di Rienzo. Adjusting the focus on human variation. Trends in Genetics, 16(7):296-302, 2000.
W. Rytter. Application of Lempel–Ziv factorization to the approximation of grammar-based compression. Theoretical Computer Science, 302(1):211-222, 2003.
J. Shallit and D. Swart. An efficient algorithm for computing the ith letter of φⁿ(a). In Proc. 10th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 768-775, 1999.
J. A. Storer and T. G. Szymanski. Data compression via textual substitution. Journal of the ACM, 29(4):928-951, 1982.

L-Systems for Measuring Repetitiveness

Authors Gonzalo Navarro , Cristian Urbina

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

L-Systems for Measuring Repetitiveness

Authors Gonzalo Navarro , Cristian Urbina

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message