Constant-Delay Enumeration for Nondeterministic Document Spanners

Amarilli, Antoine; Bourhis, Pierre; Mengel, Stefan; Niewerth, Matthias

doi:10.4230/LIPIcs.ICDT.2019.22

File

LIPIcs.ICDT.2019.22.pdf

Filesize: 0.56 MB
19 pages

Document Identifiers

DOI: 10.4230/LIPIcs.ICDT.2019.22
URN: urn:nbn:de:0030-drops-103246

Author Details

Antoine Amarilli

LTCI, France
Télécom ParisTech, France
Université Paris-Saclay, France

Pierre Bourhis

CNRS, CRIStAL UMR 9189, France
Inria Lille, France

Stefan Mengel

CNRS, France
CRIL UMR 8188, Lens, France

Matthias Niewerth

University of Bayreuth, Germany

Cite AsGet BibTex

Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-Delay Enumeration for Nondeterministic Document Spanners. In 22nd International Conference on Database Theory (ICDT 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 127, pp. 22:1-22:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.ICDT.2019.22

Abstract

We consider the information extraction framework known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm which is tractable in combined complexity, i.e., in the sizes of the input document and the VA; while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS'18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, in particular for the restricted case of so-called extended VAs.

Subject Classification

ACM Subject Classification

Information systems → Information extraction
Theory of computation → Formal languages and automata theory
Theory of computation → Database query processing and optimization (theory)

Keywords

enumeration
spanners
automata

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman. The design and analysis of computer algorithms. Addison-Wesley, 1974.
Antoine Amarilli, Pierre Bourhis, Louis Jachiet, and Stefan Mengel. A circuit-based approach to efficient enumeration. In http://icalp17.mimuw.edu.pl/, 2017. URL: http://arxiv.org/abs/1702.05589.
Antoine Amarilli, Pierre Bourhis, and Stefan Mengel. Enumeration on trees under relabelings. In ICDT, 2018. URL: http://arxiv.org/abs/1709.06185.
Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Enumeration on Trees with Tractable Combined Complexity and Efficient Updates. Under review, 2019. URL: http://arxiv.org/abs/1812.09519.
Marcelo Arenas, Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. A framework for annotating CSV-like data. PVLDB, 9(11), 2016. URL: http://www.vldb.org/pvldb/vol9/p876-arenas.pdf, URL: http://dx.doi.org/10.14778/2983200.2983204.
Guillaume Bagan. MSO queries on tree decomposable structures are computable with linear delay. In CSL, 2006.
Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. Answering Conjunctive Queries under Updates. In PODS, 2017. URL: http://arxiv.org/abs/1702.06370.
Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. Answering FO+MOD Queries Under Updates on Bounded Degree Databases. In ICDT, 2017. URL: http://arxiv.org/abs/1702.08764.
Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. Answering UCQs under Updates and in the Presence of Integrity Constraints. In ICDT, 2018. URL: http://arxiv.org/abs/1709.10039.
Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document Spanners: A Formal Approach to Information Extraction. J. ACM, 62(2), 2015. URL: https://pdfs.semanticscholar.org/8df0/ad1c6aa0df93e58071b8afe3371a16a3182f.pdf, URL: http://dx.doi.org/10.1145/2699442.
Fernando Florenzano, Cristian Riveros, Martín Ugarte, Stijn Vansummeren, and Domagoj Vrgoc. Constant Delay Algorithms for Regular Document Spanners. In PODS, 2018. URL: http://arxiv.org/abs/1803.05277.
Dominik D. Freydenberger. A Logic for Document Spanners. Unpublished extended version. URL: http://ddfy.de/sci/splog.pdf.
Dominik D. Freydenberger. A Logic for Document Spanners. In ICDT, 2017. URL: http://drops.dagstuhl.de/opus/volltexte/2017/7049/, URL: http://dx.doi.org/10.4230/LIPIcs.ICDT.2017.13.
Dominik D. Freydenberger and Mario Holldack. Document Spanners: From Expressive Power to Decision Problems. Theory Comput. Syst., 62(4), 2018. URL: http://dx.doi.org/10.1007/s00224-017-9770-0.
Dominik D. Freydenberger, Benny Kimelfeld, and Liat Peterfreund. Joining Extractions of Regular Expressions. In PODS, 2018. URL: http://arxiv.org/abs/1703.10350.
François Le Gall. Improved output-sensitive quantum algorithms for Boolean matrix multiplication. In SODA, 2012. URL: https://pdfs.semanticscholar.org/91a5/dd90ed43a6e8f55f8ec18ceead7dd0a6e988.pdf.
François Le Gall. Powers of tensors and fast matrix multiplication. In ISSAC, 2014. URL: http://arxiv.org/abs/1401.7714.
Étienne Grandjean. Sorting, linear time and the satisfiability problem. Annals of Mathematics and Artificial Intelligence, 16(1), 1996.
Katja Losemann and Wim Martens. MSO queries on trees: Enumerating answers under updates. In CSL-LICS, 2014. URL: http://www.theoinf.uni-bayreuth.de/download/lics14-preprint.pdf.
Arnaud Mary and Yann Strozecki. Efficient Enumeration of Solutions Produced by Closure Operations. In STACS, 2016. URL: http://drops.dagstuhl.de/opus/volltexte/2016/5753/.
Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. Document Spanners for Extracting Incomplete Information: Expressiveness and Complexity. In PODS, 2018. URL: http://arxiv.org/abs/1707.00827.
Andrea Morciano. Engineering a runtime system for AQL. Master’s thesis, Politecnico di Milano, 2017. URL: https://www.politesi.polimi.it/bitstream/10589/135034/1/2017_07_Morciano.pdf.
Matthias Niewerth. MSO queries on trees: Enumerating answers under updates using forest algebras. In LICS, 2018. URL: http://dx.doi.org/10.1145/3209108.3209144.
Matthias Niewerth and Luc Segoufin. Enumeration of MSO Queries on Strings with Constant Delay and Logarithmic Updates. In PODS, 2018. URL: http://www.di.ens.fr/~segoufin/Papers/Mypapers/enum-update-words.pdf.
Ronald C. Read and Robert E. Tarjan. Bounds on backtrack algorithms for listing cycles, paths, and spanning trees. Networks, 5(3), 1975.
IBM Research. SystemT, 2018. URL: https://researcher.watson.ibm.com/researcher/view_group.php?id=1264.
Luc Segoufin. A glimpse on constant delay enumeration (Invited talk). In STACS, 2014. URL: https://hal.inria.fr/hal-01070893/document.
Shuji Tsukiyama, Mikio Ide, Hiromu Ariyoshi, and I Shirakawa. A New Algorithm for Generating All the Maximal Independent Sets. SIAM J. Comput., 6, September 1977. URL: http://dx.doi.org/10.1137/0206036.
L.G. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8(2), 1979. URL: https://www.sciencedirect.com/science/article/pii/0304397579900446.
Kunihiro Wasa. Enumeration of enumeration algorithms. CoRR, 2016. URL: http://arxiv.org/abs/1605.05102.

Constant-Delay Enumeration for Nondeterministic Document Spanners

Authors Antoine Amarilli , Pierre Bourhis , Stefan Mengel , Matthias Niewerth

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Constant-Delay Enumeration for Nondeterministic Document Spanners

Authors Antoine Amarilli , Pierre Bourhis , Stefan Mengel , Matthias Niewerth

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References

Thanks for your feedback!

Could not send message