Constant-Delay Enumeration for SLP-Compressed Documents

Muñoz, Martín; Riveros, Cristian

doi:10.4230/LIPIcs.ICDT.2023.7

File

LIPIcs.ICDT.2023.7.pdf

Filesize: 0.85 MB
17 pages

Document Identifiers

DOI: 10.4230/LIPIcs.ICDT.2023.7
URN: urn:nbn:de:0030-drops-177495

Author Details

Martín Muñoz

Pontificia Universidad Católica de Chile, Santiago, Chile
Millennium Institute for Foundational Research on Data, Santiago, Chile

Cristian Riveros

Pontificia Universidad Católica de Chile, Santiago, Chile
Millennium Institute for Foundational Research on Data, Santiago, Chile

Cite AsGet BibTex

Martín Muñoz and Cristian Riveros. Constant-Delay Enumeration for SLP-Compressed Documents. In 26th International Conference on Database Theory (ICDT 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 255, pp. 7:1-7:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.ICDT.2023.7

Abstract

We study the problem of enumerating results from a query over a compressed document. The model we use for compression are straight-line programs (SLPs), which are defined by a context-free grammar that produces a single string. For our queries we use a model called Annotated Automata, an extension of regular automata that allows annotations on letters. This model extends the notion of Regular Spanners as it allows arbitrarily long outputs. Our main result is an algorithm which evaluates such a query by enumerating all results with output-linear delay after a preprocessing phase which takes linear time on the size of the SLP, and cubic time over the size of the automaton. This is an improvement over Schmid and Schweikardt’s result [Markus L. Schmid and Nicole Schweikardt, 2021], which, with the same preprocessing time, enumerates with a delay which is logarithmic on the size of the uncompressed document. We achieve this through a persistent data structure named Enumerable Compact Sets with Shifts which guarantees output-linear delay under certain restrictions. These results imply constant-delay enumeration algorithms in the context of regular spanners. Further, we use an extension of annotated automata which utilizes succinctly encoded annotations to save an exponential factor from previous results that dealt with constant-delay enumeration over vset automata. Lastly, we extend our results in the same fashion Schmid and Schweikardt did [Markus L. Schmid and Nicole Schweikardt, 2022] to allow complex document editing while maintaining the constant-delay guarantee.

Subject Classification

ACM Subject Classification

Theory of computation → Database theory

Keywords

SLP compression
query evaluation
enumeration algorithms

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Alfred V Aho and John E Hopcroft. The design and analysis of computer algorithms. Pearson Education India, 1974.
Antoine Amarilli, Pierre Bourhis, Louis Jachiet, and Stefan Mengel. A circuit-based approach to efficient enumeration. In ICALP, volume 80, pages 111:1-111:15, 2017.
Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-delay enumeration for nondeterministic document spanners. ACM Trans. Database Syst., 46(1):2:1-2:30, 2021.
Antoine Amarilli, Louis Jachiet, Martin Muñoz, and Cristian Riveros. Efficient enumeration for annotated grammars. In PODS, pages 291-300, 2022.
Guillaume Bagan. MSO queries on tree decomposable structures are computable with linear delay. In CSL, pages 167-181, 2006.
Guillaume Bagan, Arnaud Durand, and Etienne Grandjean. On acyclic conjunctive queries and constant delay enumeration. In CSL, pages 208-222, 2007.
Jean Berstel. Transductions and context-free languages. Springer-Verlag, 2013.
Pierre Bourhis, Alejandro Grez, Louis Jachiet, and Cristian Riveros. Ranked enumeration of MSO logic on words. In ICDT, volume 186, pages 20:1-20:19, 2021.
Marco Bucchi, Alejandro Grez, Andrés Quintana, Cristian Riveros, and Stijn Vansummeren. CORE: a complex event recognition engine. VLDB, 15(9):1951-1964, 2022.
Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Benny Kimelfeld, and Nicole Schweikardt. Answering (unions of) conjunctive queries using random access and random-order enumeration. In PODS, pages 393-409, 2020.
Francisco Claude and Gonzalo Navarro. Self-indexed grammar-based compression. Fundam. Informaticae, 111(3):313-337, 2011.
Johannes Doleschal, Benny Kimelfeld, Wim Martens, and Liat Peterfreund. Weight annotation in information extraction. Log. Methods Comput. Sci., 18(1), 2022.
James R. Driscoll, Neil Sarnak, Daniel Dominic Sleator, and Robert Endre Tarjan. Making data structures persistent. In STOC, pages 109-121, 1986.
Arnaud Durand and Etienne Grandjean. First-order queries on structures of bounded degree are computable with constant delay. ACM Trans. Comput. Log., 8(4):21, 2007.
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners: A formal approach to information extraction. J. ACM, 62(2):12:1-12:51, 2015.
Fernando Florenzano, Cristian Riveros, Martín Ugarte, Stijn Vansummeren, and Domagoj Vrgoc. Efficient enumeration algorithms for regular document spanners. ACM Trans. Database Syst., 45(1):3:1-3:42, 2020.
Alejandro Grez and Cristian Riveros. Towards streaming evaluation of queries with correlation in complex event processing. In ICDT, volume 155, pages 14:1-14:17, 2020.
Alejandro Grez, Cristian Riveros, Martín Ugarte, and Stijn Vansummeren. A formal framework for complex event recognition. ACM Trans. Database Syst., 46(4):1-49, 2021.
Wojciech Kazana and Luc Segoufin. First-order query evaluation on structures of bounded degree. Log. Methods Comput. Sci., 7(2), 2011.
John C. Kieffer and En-Hui Yang. Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Inf. Theory, 46(3):737-754, 2000.
Markus Lohrey. Algorithmics on slp-compressed strings: A survey. Groups Complex. Cryptol., 4(2):241-299, 2012.
Martin Muñoz and Cristian Riveros. Streaming enumeration on nested documents. In ICDT, volume 220, pages 19:1-19:18, 2022.
Liat Peterfreund. Grammars for document spanners. In ICDT, volume 186, pages 7:1-7:18, 2021.
Wojciech Rytter. Application of lempel-ziv factorization to the approximation of grammar-based compression. In CPM, volume 2373, pages 20-31, 2002.
Markus L. Schmid and Nicole Schweikardt. Spanner evaluation over slp-compressed documents. In PODS, pages 153-165, 2021.
Markus L. Schmid and Nicole Schweikardt. Query evaluation over slp-represented document databases with complex document editing. In PODS, pages 79-89, 2022.
Nicole Schweikardt, Luc Segoufin, and Alexandre Vigny. Enumeration for FO queries over nowhere dense graphs. In PODS, pages 151-163, 2018.
James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. J. ACM, 29(4):928-951, 1982.

Constant-Delay Enumeration for SLP-Compressed Documents

Authors Martín Muñoz, Cristian Riveros

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Constant-Delay Enumeration for SLP-Compressed Documents

Authors Martín Muñoz, Cristian Riveros

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message