Constant-Delay Enumeration for Nondeterministic Document Spanners

Authors Antoine Amarilli , Pierre Bourhis , Stefan Mengel , Matthias Niewerth



PDF
Thumbnail PDF

File

LIPIcs.ICDT.2019.22.pdf
  • Filesize: 0.56 MB
  • 19 pages

Document Identifiers

Author Details

Antoine Amarilli
  • LTCI, France
  • Télécom ParisTech, France
  • Université Paris-Saclay, France
Pierre Bourhis
  • CNRS, CRIStAL UMR 9189, France
  • Inria Lille, France
Stefan Mengel
  • CNRS, France
  • CRIL UMR 8188, Lens, France
Matthias Niewerth
  • University of Bayreuth, Germany

Cite AsGet BibTex

Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-Delay Enumeration for Nondeterministic Document Spanners. In 22nd International Conference on Database Theory (ICDT 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 127, pp. 22:1-22:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.ICDT.2019.22

Abstract

We consider the information extraction framework known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm which is tractable in combined complexity, i.e., in the sizes of the input document and the VA; while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS'18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, in particular for the restricted case of so-called extended VAs.

Subject Classification

ACM Subject Classification
  • Information systems → Information extraction
  • Theory of computation → Formal languages and automata theory
  • Theory of computation → Database query processing and optimization (theory)
Keywords
  • enumeration
  • spanners
  • automata

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. The design and analysis of computer algorithms. Addison-Wesley, 1974. Google Scholar
  2. Antoine Amarilli, Pierre Bourhis, Louis Jachiet, and Stefan Mengel. A circuit-based approach to efficient enumeration. In http://icalp17.mimuw.edu.pl/, 2017. URL: http://arxiv.org/abs/1702.05589.
  3. Antoine Amarilli, Pierre Bourhis, and Stefan Mengel. Enumeration on trees under relabelings. In ICDT, 2018. URL: http://arxiv.org/abs/1709.06185.
  4. Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Enumeration on Trees with Tractable Combined Complexity and Efficient Updates. Under review, 2019. URL: http://arxiv.org/abs/1812.09519.
  5. Marcelo Arenas, Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. A framework for annotating CSV-like data. PVLDB, 9(11), 2016. URL: http://www.vldb.org/pvldb/vol9/p876-arenas.pdf, URL: http://dx.doi.org/10.14778/2983200.2983204.
  6. Guillaume Bagan. MSO queries on tree decomposable structures are computable with linear delay. In CSL, 2006. Google Scholar
  7. Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. Answering Conjunctive Queries under Updates. In PODS, 2017. URL: http://arxiv.org/abs/1702.06370.
  8. Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. Answering FO+MOD Queries Under Updates on Bounded Degree Databases. In ICDT, 2017. URL: http://arxiv.org/abs/1702.08764.
  9. Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. Answering UCQs under Updates and in the Presence of Integrity Constraints. In ICDT, 2018. URL: http://arxiv.org/abs/1709.10039.
  10. Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document Spanners: A Formal Approach to Information Extraction. J. ACM, 62(2), 2015. URL: https://pdfs.semanticscholar.org/8df0/ad1c6aa0df93e58071b8afe3371a16a3182f.pdf, URL: http://dx.doi.org/10.1145/2699442.
  11. Fernando Florenzano, Cristian Riveros, Martín Ugarte, Stijn Vansummeren, and Domagoj Vrgoc. Constant Delay Algorithms for Regular Document Spanners. In PODS, 2018. URL: http://arxiv.org/abs/1803.05277.
  12. Dominik D. Freydenberger. A Logic for Document Spanners. Unpublished extended version. URL: http://ddfy.de/sci/splog.pdf.
  13. Dominik D. Freydenberger. A Logic for Document Spanners. In ICDT, 2017. URL: http://drops.dagstuhl.de/opus/volltexte/2017/7049/, URL: http://dx.doi.org/10.4230/LIPIcs.ICDT.2017.13.
  14. Dominik D. Freydenberger and Mario Holldack. Document Spanners: From Expressive Power to Decision Problems. Theory Comput. Syst., 62(4), 2018. URL: http://dx.doi.org/10.1007/s00224-017-9770-0.
  15. Dominik D. Freydenberger, Benny Kimelfeld, and Liat Peterfreund. Joining Extractions of Regular Expressions. In PODS, 2018. URL: http://arxiv.org/abs/1703.10350.
  16. François Le Gall. Improved output-sensitive quantum algorithms for Boolean matrix multiplication. In SODA, 2012. URL: https://pdfs.semanticscholar.org/91a5/dd90ed43a6e8f55f8ec18ceead7dd0a6e988.pdf.
  17. François Le Gall. Powers of tensors and fast matrix multiplication. In ISSAC, 2014. URL: http://arxiv.org/abs/1401.7714.
  18. Étienne Grandjean. Sorting, linear time and the satisfiability problem. Annals of Mathematics and Artificial Intelligence, 16(1), 1996. Google Scholar
  19. Katja Losemann and Wim Martens. MSO queries on trees: Enumerating answers under updates. In CSL-LICS, 2014. URL: http://www.theoinf.uni-bayreuth.de/download/lics14-preprint.pdf.
  20. Arnaud Mary and Yann Strozecki. Efficient Enumeration of Solutions Produced by Closure Operations. In STACS, 2016. URL: http://drops.dagstuhl.de/opus/volltexte/2016/5753/.
  21. Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. Document Spanners for Extracting Incomplete Information: Expressiveness and Complexity. In PODS, 2018. URL: http://arxiv.org/abs/1707.00827.
  22. Andrea Morciano. Engineering a runtime system for AQL. Master’s thesis, Politecnico di Milano, 2017. URL: https://www.politesi.polimi.it/bitstream/10589/135034/1/2017_07_Morciano.pdf.
  23. Matthias Niewerth. MSO queries on trees: Enumerating answers under updates using forest algebras. In LICS, 2018. URL: http://dx.doi.org/10.1145/3209108.3209144.
  24. Matthias Niewerth and Luc Segoufin. Enumeration of MSO Queries on Strings with Constant Delay and Logarithmic Updates. In PODS, 2018. URL: http://www.di.ens.fr/~segoufin/Papers/Mypapers/enum-update-words.pdf.
  25. Ronald C. Read and Robert E. Tarjan. Bounds on backtrack algorithms for listing cycles, paths, and spanning trees. Networks, 5(3), 1975. Google Scholar
  26. IBM Research. SystemT, 2018. URL: https://researcher.watson.ibm.com/researcher/view_group.php?id=1264.
  27. Luc Segoufin. A glimpse on constant delay enumeration (Invited talk). In STACS, 2014. URL: https://hal.inria.fr/hal-01070893/document.
  28. Shuji Tsukiyama, Mikio Ide, Hiromu Ariyoshi, and I Shirakawa. A New Algorithm for Generating All the Maximal Independent Sets. SIAM J. Comput., 6, September 1977. URL: http://dx.doi.org/10.1137/0206036.
  29. L.G. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8(2), 1979. URL: https://www.sciencedirect.com/science/article/pii/0304397579900446.
  30. Kunihiro Wasa. Enumeration of enumeration algorithms. CoRR, 2016. URL: http://arxiv.org/abs/1605.05102.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail