Constant-Delay Enumeration for Nondeterministic Document Spanners

eng Schloss Dagstuhl – Leibniz-Zentrum für Informatik Leibniz International Proceedings in Informatics 1868-8969 2019-03-19 22:1 22:19 10.4230/LIPIcs.ICDT.2019.22 article Constant-Delay Enumeration for Nondeterministic Document Spanners Amarilli, Antoine 1 2 3 https://orcid.org/0000-0002-7977-4441 Bourhis, Pierre 4 5 https://orcid.org/0000-0001-5699-0320 Mengel, Stefan 6 7 https://orcid.org/0000-0003-1386-8784 Niewerth, Matthias 8 https://orcid.org/0000-0003-2032-5374 LTCI, France Télécom ParisTech, France Université Paris-Saclay, France CNRS, CRIStAL UMR 9189, France Inria Lille, France CNRS, France CRIL UMR 8188, Lens, France University of Bayreuth, Germany We consider the information extraction framework known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm which is tractable in combined complexity, i.e., in the sizes of the input document and the VA; while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS'18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, in particular for the restricted case of so-called extended VAs. https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.22/LIPIcs.ICDT.2019.22.pdf enumeration spanners automata

<publisher>Schloss Dagstuhl – Leibniz-Zentrum für Informatik</publisher>

<journalTitle>Leibniz International Proceedings in Informatics</journalTitle>

<doi>10.4230/LIPIcs.ICDT.2019.22</doi>

<documentType>article</documentType>

<title language="eng">Constant-Delay Enumeration for Nondeterministic Document Spanners</title>

<name>Amarilli, Antoine</name>

<orcid_id>https://orcid.org/0000-0002-7977-4441</orcid_id>

</author>

<name>Bourhis, Pierre</name>

<orcid_id>https://orcid.org/0000-0001-5699-0320</orcid_id>

</author>

<name>Mengel, Stefan</name>

<orcid_id>https://orcid.org/0000-0003-1386-8784</orcid_id>

</author>

<name>Niewerth, Matthias</name>

<orcid_id>https://orcid.org/0000-0003-2032-5374</orcid_id>

</author>

</authors>

<affiliationName affiliationId="1">LTCI, France</affiliationName>

<affiliationName affiliationId="2">Télécom ParisTech, France</affiliationName>

<affiliationName affiliationId="3">Université Paris-Saclay, France</affiliationName>

<affiliationName affiliationId="4">CNRS, CRIStAL UMR 9189, France</affiliationName>

<affiliationName affiliationId="5">Inria Lille, France</affiliationName>

<affiliationName affiliationId="6">CNRS, France</affiliationName>

<affiliationName affiliationId="7">CRIL UMR 8188, Lens, France</affiliationName>

<affiliationName affiliationId="8">University of Bayreuth, Germany</affiliationName>

</affiliationsList>

<abstract language="eng">We consider the information extraction framework known as document spanners, and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm which is tractable in combined complexity, i.e., in the sizes of the input document and the VA; while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS'18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, in particular for the restricted case of so-called extended VAs.</abstract>

<fullTextUrl format="pdf">https://drops.dagstuhl.de/storage/00lipics/lipics-vol127-icdt2019/LIPIcs.ICDT.2019.22/LIPIcs.ICDT.2019.22.pdf</fullTextUrl>

<keyword>enumeration</keyword>

<keyword>spanners</keyword>

<keyword>automata</keyword>

</keywords>

</record>

</records>