Constant-Delay Enumeration for SLP-Compressed Documents

Authors Martín Muñoz, Cristian Riveros



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2023.7.pdf
  • Filesize: 0.85 MB
  • 17 pages

Document Identifiers

Author Details

Martín Muñoz
  • Pontificia Universidad Católica de Chile, Santiago, Chile
  • Millennium Institute for Foundational Research on Data, Santiago, Chile
Cristian Riveros
  • Pontificia Universidad Católica de Chile, Santiago, Chile
  • Millennium Institute for Foundational Research on Data, Santiago, Chile

Cite AsGet BibTex

Martín Muñoz and Cristian Riveros. Constant-Delay Enumeration for SLP-Compressed Documents. In 26th International Conference on Database Theory (ICDT 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 255, pp. 7:1-7:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.ICDT.2023.7

Abstract

We study the problem of enumerating results from a query over a compressed document. The model we use for compression are straight-line programs (SLPs), which are defined by a context-free grammar that produces a single string. For our queries we use a model called Annotated Automata, an extension of regular automata that allows annotations on letters. This model extends the notion of Regular Spanners as it allows arbitrarily long outputs. Our main result is an algorithm which evaluates such a query by enumerating all results with output-linear delay after a preprocessing phase which takes linear time on the size of the SLP, and cubic time over the size of the automaton. This is an improvement over Schmid and Schweikardt’s result [Markus L. Schmid and Nicole Schweikardt, 2021], which, with the same preprocessing time, enumerates with a delay which is logarithmic on the size of the uncompressed document. We achieve this through a persistent data structure named Enumerable Compact Sets with Shifts which guarantees output-linear delay under certain restrictions. These results imply constant-delay enumeration algorithms in the context of regular spanners. Further, we use an extension of annotated automata which utilizes succinctly encoded annotations to save an exponential factor from previous results that dealt with constant-delay enumeration over vset automata. Lastly, we extend our results in the same fashion Schmid and Schweikardt did [Markus L. Schmid and Nicole Schweikardt, 2022] to allow complex document editing while maintaining the constant-delay guarantee.

Subject Classification

ACM Subject Classification
  • Theory of computation → Database theory
Keywords
  • SLP compression
  • query evaluation
  • enumeration algorithms

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Alfred V Aho and John E Hopcroft. The design and analysis of computer algorithms. Pearson Education India, 1974. Google Scholar
  2. Antoine Amarilli, Pierre Bourhis, Louis Jachiet, and Stefan Mengel. A circuit-based approach to efficient enumeration. In ICALP, volume 80, pages 111:1-111:15, 2017. Google Scholar
  3. Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-delay enumeration for nondeterministic document spanners. ACM Trans. Database Syst., 46(1):2:1-2:30, 2021. Google Scholar
  4. Antoine Amarilli, Louis Jachiet, Martin Muñoz, and Cristian Riveros. Efficient enumeration for annotated grammars. In PODS, pages 291-300, 2022. Google Scholar
  5. Guillaume Bagan. MSO queries on tree decomposable structures are computable with linear delay. In CSL, pages 167-181, 2006. Google Scholar
  6. Guillaume Bagan, Arnaud Durand, and Etienne Grandjean. On acyclic conjunctive queries and constant delay enumeration. In CSL, pages 208-222, 2007. Google Scholar
  7. Jean Berstel. Transductions and context-free languages. Springer-Verlag, 2013. Google Scholar
  8. Pierre Bourhis, Alejandro Grez, Louis Jachiet, and Cristian Riveros. Ranked enumeration of MSO logic on words. In ICDT, volume 186, pages 20:1-20:19, 2021. Google Scholar
  9. Marco Bucchi, Alejandro Grez, Andrés Quintana, Cristian Riveros, and Stijn Vansummeren. CORE: a complex event recognition engine. VLDB, 15(9):1951-1964, 2022. Google Scholar
  10. Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Benny Kimelfeld, and Nicole Schweikardt. Answering (unions of) conjunctive queries using random access and random-order enumeration. In PODS, pages 393-409, 2020. Google Scholar
  11. Francisco Claude and Gonzalo Navarro. Self-indexed grammar-based compression. Fundam. Informaticae, 111(3):313-337, 2011. Google Scholar
  12. Johannes Doleschal, Benny Kimelfeld, Wim Martens, and Liat Peterfreund. Weight annotation in information extraction. Log. Methods Comput. Sci., 18(1), 2022. Google Scholar
  13. James R. Driscoll, Neil Sarnak, Daniel Dominic Sleator, and Robert Endre Tarjan. Making data structures persistent. In STOC, pages 109-121, 1986. Google Scholar
  14. Arnaud Durand and Etienne Grandjean. First-order queries on structures of bounded degree are computable with constant delay. ACM Trans. Comput. Log., 8(4):21, 2007. Google Scholar
  15. Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners: A formal approach to information extraction. J. ACM, 62(2):12:1-12:51, 2015. Google Scholar
  16. Fernando Florenzano, Cristian Riveros, Martín Ugarte, Stijn Vansummeren, and Domagoj Vrgoc. Efficient enumeration algorithms for regular document spanners. ACM Trans. Database Syst., 45(1):3:1-3:42, 2020. Google Scholar
  17. Alejandro Grez and Cristian Riveros. Towards streaming evaluation of queries with correlation in complex event processing. In ICDT, volume 155, pages 14:1-14:17, 2020. Google Scholar
  18. Alejandro Grez, Cristian Riveros, Martín Ugarte, and Stijn Vansummeren. A formal framework for complex event recognition. ACM Trans. Database Syst., 46(4):1-49, 2021. Google Scholar
  19. Wojciech Kazana and Luc Segoufin. First-order query evaluation on structures of bounded degree. Log. Methods Comput. Sci., 7(2), 2011. Google Scholar
  20. John C. Kieffer and En-Hui Yang. Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Inf. Theory, 46(3):737-754, 2000. Google Scholar
  21. Markus Lohrey. Algorithmics on slp-compressed strings: A survey. Groups Complex. Cryptol., 4(2):241-299, 2012. Google Scholar
  22. Martin Muñoz and Cristian Riveros. Streaming enumeration on nested documents. In ICDT, volume 220, pages 19:1-19:18, 2022. Google Scholar
  23. Liat Peterfreund. Grammars for document spanners. In ICDT, volume 186, pages 7:1-7:18, 2021. Google Scholar
  24. Wojciech Rytter. Application of lempel-ziv factorization to the approximation of grammar-based compression. In CPM, volume 2373, pages 20-31, 2002. Google Scholar
  25. Markus L. Schmid and Nicole Schweikardt. Spanner evaluation over slp-compressed documents. In PODS, pages 153-165, 2021. Google Scholar
  26. Markus L. Schmid and Nicole Schweikardt. Query evaluation over slp-represented document databases with complex document editing. In PODS, pages 79-89, 2022. Google Scholar
  27. Nicole Schweikardt, Luc Segoufin, and Alexandre Vigny. Enumeration for FO queries over nowhere dense graphs. In PODS, pages 151-163, 2018. Google Scholar
  28. James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. J. ACM, 29(4):928-951, 1982. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail