L-Systems for Measuring Repetitiveness
In order to use them for compression, we extend L-systems (without ε-rules) with two parameters d and n, and also a coding τ, which determines unambiguously a string w = τ(φ^d(s))[1:n], where φ is the morphism of the system, and s is its axiom. The length of the shortest description of an L-system generating w is known as 𝓁, and it is arguably a relevant measure of repetitiveness that builds on the self-similarities that arise in the sequence.
In this paper, we deepen the study of the measure 𝓁 and its relation with a better-established measure called δ, which builds on substring complexity. Our results show that 𝓁 and δ are largely orthogonal, in the sense that one can be much larger than the other, depending on the case. This suggests that both mechanisms capture different kinds of regularities related to repetitiveness.
We then show that the recently introduced NU-systems, which combine the capabilities of L-systems with bidirectional macro schemes, can be asymptotically strictly smaller than both mechanisms for the same fixed string family, which makes the size ν of the smallest NU-system the unique smallest reachable repetitiveness measure to date. We conclude that in order to achieve better compression, we should combine morphism substitution with copy-paste mechanisms.
L-systems
String morphisms
Repetitiveness measures
Text compression
Mathematics of computing~Combinatorics on words
Theory of computation~Data compression
25:1-25:17
Regular Paper
Gonzalo
Navarro
Gonzalo Navarro
Department of Computer Science, University of Chile, Santiago, Chile
Centre for Biotechnology and Bioengineering (CeBiB), Santiago, Chile
https://users.dcc.uchile.cl/~gnavarro/
https://orcid.org/0000-0002-2286-741X
Funded by Basal Funds FB0001 and Fondecyt Grant 1-200038, Chile.
Cristian
Urbina
Cristian Urbina
Department of Computer Science, University of Chile, Santiago, Chile
Centre for Biotechnology and Bioengineering (CeBiB), Santiago, Chile
https://users.dcc.uchile.cl/~crurbina/
https://orcid.org/0000-0001-8979-9055
Funded by Basal Funds FB0001 and ANID National Doctoral Scholarship - 21210580, Chile.
10.4230/LIPIcs.CPM.2023.25
H. Bannai, M. Funakoshi, T. I, D. Köppl, T. Mieno, and T. Nishimoto. A separation of γ and b via Thue-Morse words. In Proc. 28th International Symposium on String Processing and Information Retrieval (SPIRE), volume 12944 of Lecture Notes in Computer Science (LNCS), pages 167-178, 2021.
M. Burrows and D. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
M. Charikar, E. Lehman, Ding Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. The smallest grammar problem. IEEE Transactions on Information Theory, 51(7):2554-2576, 2005.
T. Gagie, G. Navarro, and N. Prezza. Optimal-time text indexing in BWT-runs bounded space. In Proc. 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1459-1477, 2018.
J. K. Gallant. String Compression Algorithms. PhD thesis, Princeton University, 1982.
S. Giuliani, S. Inenaga, Z. Lipták, N. Prezza, M. Sciortino, and A. Toffanello. Novel results on the number of runs of the Burrows-Wheeler-Transform. In Proc. Theory and Practice of Computer Science (SOFSEM), pages 249-262, 2021.
A. Jeż. Approximation of grammar-based compression via recompression. Theoretical Computer Science, 592:115-134, 2015.
D. Kempa and T. Kociumaka. Resolution of the Burrows-Wheeler Transform conjecture. Communications of the ACM, 65(6):91-98, 2022.
D. Kempa and N. Prezza. At the roots of dictionary compression: String attractors. In Proc. 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 827-840, 2018.
T. Kida, T. Matsumoto, Y. Shibata, M. Takeda, A. Shinohara, and S. Arikawa. Collage system: a unifying framework for compressed pattern matching. Theoretical Computer Science, 298(1):253-272, 2003.
T. Kociumaka, G. Navarro, and F. Olivares. Near-optimal search time in δ-optimal space. In Proc. 15th Latin American Symposium on Theoretical Informatics (LATIN), volume 13568 of Lecture Notes in Computer Science (LNCS), pages 88-103, 2022.
T. Kociumaka, G. Navarro, and N. Prezza. Towards a definitive compressibility measure for repetitive sequences. IEEE Transactions on Information Theory, 69(4):2074-2092, 2023.
S. Kreft and G. Navarro. LZ77-like compression with fast random access. In 2010 Data Compression Conference (DCC), pages 239-248, 2010.
A. Lempel and J. Ziv. On the complexity of finite sequences. IEEE Transactions on Information Theory, 22(1):75-81, 1976.
A. Lindenmayer. Mathematical models for cellular interactions in development I. Filaments with one-sided inputs. Journal of Theoretical Biology, 18(3):280-299, 1968.
A. Lindenmayer. Mathematical models for cellular interactions in development II. Simple and branching filaments with two-sided inputs. Journal of Theoretical Biology, 18(3):300-315, 1968.
G. Navarro. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Computing Surveys, 54(2):article 29, 2021.
G. Navarro, C. Ochoa, and N. Prezza. On the approximation ratio of ordered parsings. IEEE Transactions on Information Theory, 67(2):1008-1026, 2021.
G. Navarro and C. Urbina. On stricter reachable repetitiveness measures. In Proc. 28th International Symposium on String Processing and Information Retrieval (SPIRE), volume 12944 of Lecture Notes in Computer Science (LNCS), pages 193-206, 2021.
T. Nishimoto, T. I, S. Inenaga, H. Bannai, and M. Takeda. Fully dynamic data structure for LCE queries in compressed space. In 41st International Symposium on Mathematical Foundations of Computer Science (MFCS), volume 58 of Leibniz International Proceedings in Informatics (LIPIcs), pages 72:1-72:15, 2016.
J.-J. Pansiot. On various classes of infinite words obtained by iterated mappings. In Automata on Infinite Words, volume 192 of Lecture Notes in Computer Science, pages 188-197, 1984.
J.-J. Pansiot. Subword complexities and iteration. Bulletin of the European Association for Theoretical Computer Science, 26:55-62, 1985.
M Przeworski, RR Hudson, and A Di Rienzo. Adjusting the focus on human variation. Trends in Genetics, 16(7):296-302, 2000.
W. Rytter. Application of Lempel–Ziv factorization to the approximation of grammar-based compression. Theoretical Computer Science, 302(1):211-222, 2003.
J. Shallit and D. Swart. An efficient algorithm for computing the ith letter of φⁿ(a). In Proc. 10th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 768-775, 1999.
J. A. Storer and T. G. Szymanski. Data compression via textual substitution. Journal of the ACM, 29(4):928-951, 1982.
Gonzalo Navarro and Cristian Urbina
Creative Commons Attribution 4.0 International license
https://creativecommons.org/licenses/by/4.0/legalcode