DROPS

Document

DOI: 10.4230/LIPIcs.STACS.2026.77

An Improved Version of Hmelevskii’s Theorem on Three-Variable Word Equations

Authors: Aleksi Saarela

Published in: LIPIcs, Volume 364, 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)

Abstract

Hmelevskii proved in 1971 that every constant-free three-variable word equation has a parametric solution. We prove an improved version of this result by showing that every such equation has a parametric solution using only three numerical parameters and with only two levels of nesting. This means that the structure of the solution sets of these equations is considerably simpler than has been known before.

Cite as

Aleksi Saarela. An Improved Version of Hmelevskii’s Theorem on Three-Variable Word Equations. In 43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026). Leibniz International Proceedings in Informatics (LIPIcs), Volume 364, pp. 77:1-77:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2026)

Copy BibTex To Clipboard

@InProceedings{saarela:LIPIcs.STACS.2026.77,
  author =	{Saarela, Aleksi},
  title =	{{An Improved Version of Hmelevskii’s Theorem on Three-Variable Word Equations}},
  booktitle =	{43rd International Symposium on Theoretical Aspects of Computer Science (STACS 2026)},
  pages =	{77:1--77:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-412-3},
  ISSN =	{1868-8969},
  year =	{2026},
  volume =	{364},
  editor =	{Mahajan, Meena and Manea, Florin and McIver, Annabelle and Thắng, Nguy\~{ê}n Kim},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2026.77},
  URN =		{urn:nbn:de:0030-drops-255664},
  doi =		{10.4230/LIPIcs.STACS.2026.77},
  annote =	{Keywords: Combinatorics on words, word equation, parametric word}
}

Document

DOI: 10.4230/LIPIcs.MFCS.2025.9

Linear Time Subsequence and Supersequence Regex Matching

Authors: Antoine Amarilli, Florin Manea, Tina Ringleb, and Markus L. Schmid

Published in: LIPIcs, Volume 345, 50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025)

Abstract

It is well-known that checking whether a given string w matches a given regular expression r can be done in quadratic time O(|w|⋅ |r|) and that this cannot be improved to a truly subquadratic running time of O((|w|⋅ |r|)^{1-ε}) assuming the strong exponential time hypothesis (SETH). We study a different matching paradigm where we ask instead whether w has a subsequence that matches r, and show that regex matching in this sense can be solved in linear time O(|w| + |r|). Further, the same holds if we ask for a supersequence. We show that the quantitative variants where we want to compute a longest or shortest subsequence or supersequence of w that matches r can be solved in O(|w|⋅ |r|), i. e., asymptotically no worse than classical regex matching; and we show that O(|w| + |r|) is conditionally not possible for these problems. We also investigate these questions with respect to other natural string relations like the infix, prefix, left-extension or extension relation instead of the subsequence and supersequence relation. We further study the complexity of the universal problem where we ask if all subsequences (or supersequences, infixes, prefixes, left-extensions or extensions) of an input string satisfy a given regular expression.

Cite as

Antoine Amarilli, Florin Manea, Tina Ringleb, and Markus L. Schmid. Linear Time Subsequence and Supersequence Regex Matching. In 50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 345, pp. 9:1-9:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{amarilli_et_al:LIPIcs.MFCS.2025.9,
  author =	{Amarilli, Antoine and Manea, Florin and Ringleb, Tina and Schmid, Markus L.},
  title =	{{Linear Time Subsequence and Supersequence Regex Matching}},
  booktitle =	{50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025)},
  pages =	{9:1--9:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-388-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{345},
  editor =	{Gawrychowski, Pawe{\l} and Mazowiecki, Filip and Skrzypczak, Micha{\l}},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2025.9},
  URN =		{urn:nbn:de:0030-drops-241162},
  doi =		{10.4230/LIPIcs.MFCS.2025.9},
  annote =	{Keywords: subsequence, supersequence, regular language, regular expression, automata}
}

Document

DOI: 10.4230/LIPIcs.MFCS.2025.81

Efficient Matching of Some Fundamental Regular Expressions with Backreferences

Authors: Taisei Nogami and Tachio Terauchi

Published in: LIPIcs, Volume 345, 50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025)

Abstract

Regular expression matching is of practical importance due to its widespread use in real-world applications. In practical use, regular expressions are often used with real-world extensions. Accordingly, the matching problem of regular expressions with real-world extensions has been actively studied in recent years, yielding steady progress. However, backreference, a popular extension supported by most modern programming languages such as Java, Python, JavaScript and others in their standard libraries for string processing, is an exception to this positive trend. In fact, it is known that the matching problem of regular expressions with backreferences (rewbs) is theoretically hard and the existence of an asymptotically fast matching algorithm for arbitrary rewbs seems unlikely. Even among currently known partial solutions, the balance between efficiency and generality remains unsatisfactory. To bridge this gap, we present an efficient matching algorithm for rewbs of the form e_0 (e)_1 e_1 \1 e_2 where e_0, e, e_1, e_2 are pure regular expressions, which are fundamental and frequently used in practical applications. It runs in quadratic time with respect to the input string length, substantially improving the best-known cubic time complexity for these rewbs. Our algorithm combines ideas from both stringology and automata theory in a novel way. We leverage two techniques from automata theory, injection and summarization, to simultaneously examine matches whose backreferenced substrings are either a fixed right-maximal repeat or its extendable prefixes, which are concepts from stringology. By further utilizing a subtle property of extendable prefixes, our algorithm correctly decides the matching problem while achieving the quadratic-time complexity.

Cite as

Taisei Nogami and Tachio Terauchi. Efficient Matching of Some Fundamental Regular Expressions with Backreferences. In 50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 345, pp. 81:1-81:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{nogami_et_al:LIPIcs.MFCS.2025.81,
  author =	{Nogami, Taisei and Terauchi, Tachio},
  title =	{{Efficient Matching of Some Fundamental Regular Expressions with Backreferences}},
  booktitle =	{50th International Symposium on Mathematical Foundations of Computer Science (MFCS 2025)},
  pages =	{81:1--81:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-388-1},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{345},
  editor =	{Gawrychowski, Pawe{\l} and Mazowiecki, Filip and Skrzypczak, Micha{\l}},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2025.81},
  URN =		{urn:nbn:de:0030-drops-241886},
  doi =		{10.4230/LIPIcs.MFCS.2025.81},
  annote =	{Keywords: Regular expressions, Backreferences, Regex matching, NFA simulation, Suffix arrays, Right-maximal repeats}
}

Document

DOI: 10.4230/LIPIcs.CPM.2025.4

The Equivalence Problem of E-Pattern Languages with Length Constraints Is Undecidable

Authors: Dirk Nowotka and Max Wiedenhöft

Published in: LIPIcs, Volume 331, 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Abstract

Patterns are words with terminals and variables. The language of a pattern is the set of words obtained by uniformly substituting all variables with words that contain only terminals. Length constraints restrict valid substitutions of variables by associating the variables of a pattern with a system (or disjunction of systems) of linear diophantine inequalities. Pattern languages with length constraints contain only words in which all variables are substituted to words with lengths that fulfill such a given set of length constraints. We consider membership, inclusion, and equivalence problems for erasing and non-erasing pattern languages with length constraints. Our main result shows that the erasing equivalence problem - one of the most prominent open problems in the realm of patterns - becomes undecidable if length constraints are allowed in addition to variable equality. Additionally, it is shown that the terminal-free inclusion problem, a prominent problem which has been shown to be undecidable in the binary case for patterns without any constraints, is also generally undecidable for all larger alphabets in this setting. Finally, we also show that considering regular constraints, i.e., associating variables also with regular languages as additional restrictions together with length constraints for valid substitutions, results in undecidability of the non-erasing equivalence problem. This sets a first upper bound on constraints to obtain undecidability in this case, as this problem is trivially decidable in the case of no constraints and as it has unknown decidability if only regular or only length constraints are considered.

Cite as

Dirk Nowotka and Max Wiedenhöft. The Equivalence Problem of E-Pattern Languages with Length Constraints Is Undecidable. In 36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 331, pp. 4:1-4:23, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{nowotka_et_al:LIPIcs.CPM.2025.4,
  author =	{Nowotka, Dirk and Wiedenh\"{o}ft, Max},
  title =	{{The Equivalence Problem of E-Pattern Languages with Length Constraints Is Undecidable}},
  booktitle =	{36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)},
  pages =	{4:1--4:23},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-369-0},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{331},
  editor =	{Bonizzoni, Paola and M\"{a}kinen, Veli},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2025.4},
  URN =		{urn:nbn:de:0030-drops-230988},
  doi =		{10.4230/LIPIcs.CPM.2025.4},
  annote =	{Keywords: Patterns, Pattern Languages, Length Constraints, Regular Constraints, Decidability, Undecidability, Membership, Inclusion, Equivalence}
}

Document

DOI: 10.4230/LIPIcs.ICDT.2025.18

A Framework for Extraction and Transformation of Documents

Authors: Cristian Riveros, Markus L. Schmid, and Nicole Schweikardt

Published in: LIPIcs, Volume 328, 28th International Conference on Database Theory (ICDT 2025)

Abstract

We present a theoretical framework for the extraction and transformation of text documents as a two-phase process: The first phase uses document spanners to extract information from the input document. The second phase transforms the extracted information into a suitable output. To support several reasonable extract-transform scenarios, we propose for the first phase an extension of document spanners from span-tuples to so-called multispan-tuples, where variables are mapped to sets of spans instead of only single spans. We focus on multispanners described by regex formulas, and we prove that these have the same desirable properties as standard regular spanners. To formalize the second phase, we consider transformations that map every pair document-tuple, where each tuple comes from the (multi)span-relation extracted in the first phase, into a new output document. The specification of the two phases is what we call an extract-transform (ET) program, which covers practically relevant extract-transform tasks. In this paper, our main technical goal is to identify a broad class of ET programs that can be evaluated efficiently. We specifically focus on the scenario of regular ET programs: the extraction phase is given by a regex multispanner and the transformation phase is given by a regular string-to-string function. We show that for any regular ET program, given an input document, we can enumerate all final output documents with output-linear delay after linear preprocessing. As a side effect, we characterize the expressive power of regular ET programs and also show that they have desirable properties, like being closed under composition.

Cite as

Cristian Riveros, Markus L. Schmid, and Nicole Schweikardt. A Framework for Extraction and Transformation of Documents. In 28th International Conference on Database Theory (ICDT 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 328, pp. 18:1-18:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{riveros_et_al:LIPIcs.ICDT.2025.18,
  author =	{Riveros, Cristian and Schmid, Markus L. and Schweikardt, Nicole},
  title =	{{A Framework for Extraction and Transformation of Documents}},
  booktitle =	{28th International Conference on Database Theory (ICDT 2025)},
  pages =	{18:1--18:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-364-5},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{328},
  editor =	{Roy, Sudeepa and Kara, Ahmet},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2025.18},
  URN =		{urn:nbn:de:0030-drops-229593},
  doi =		{10.4230/LIPIcs.ICDT.2025.18},
  annote =	{Keywords: Information extraction, Document spanners, Transducers, Query evaluation}
}

Document

DOI: 10.4230/LIPIcs.ICDT.2025.29

FC-Datalog as a Framework for Efficient String Querying

Authors: Owen M. Bell, Joel D. Day, and Dominik D. Freydenberger

Published in: LIPIcs, Volume 328, 28th International Conference on Database Theory (ICDT 2025)

Abstract

Core spanners are a class of document spanners that capture the core functionality of IBM’s AQL. FC is a logic on strings built around word equations that when extended with constraints for regular languages can be seen as a logic for core spanners. The recently introduced FC-Datalog extends FC with recursion, which allows us to define recursive relations for core spanners. Additionally, as FC-Datalog captures 𝖯, it is also a tractable version of Datalog on strings. This presents an opportunity for optimization. We propose a series of FC-Datalog fragments with desirable properties in terms of complexity of model checking, expressive power, and efficiency of checking membership in the fragment. This leads to a range of fragments that all capture LOGSPACE, which we further restrict to obtain linear combined complexity. This gives us a framework to tailor fragments for particular applications. To showcase this, we simulate deterministic regex in a tailored fragment of FC-Datalog.

Cite as

Owen M. Bell, Joel D. Day, and Dominik D. Freydenberger. FC-Datalog as a Framework for Efficient String Querying. In 28th International Conference on Database Theory (ICDT 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 328, pp. 29:1-29:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025)

Copy BibTex To Clipboard

@InProceedings{bell_et_al:LIPIcs.ICDT.2025.29,
  author =	{Bell, Owen M. and Day, Joel D. and Freydenberger, Dominik D.},
  title =	{{FC-Datalog as a Framework for Efficient String Querying}},
  booktitle =	{28th International Conference on Database Theory (ICDT 2025)},
  pages =	{29:1--29:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-364-5},
  ISSN =	{1868-8969},
  year =	{2025},
  volume =	{328},
  editor =	{Roy, Sudeepa and Kara, Ahmet},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2025.29},
  URN =		{urn:nbn:de:0030-drops-229708},
  doi =		{10.4230/LIPIcs.ICDT.2025.29},
  annote =	{Keywords: Information extraction, word equations, datalog, document spanners, regex}
}

Document

DOI: 10.4230/LIPIcs.ICDT.2022.10

Splitting Spanner Atoms: A Tool for Acyclic Core Spanners

Authors: Dominik D. Freydenberger and Sam M. Thompson

Published in: LIPIcs, Volume 220, 25th International Conference on Database Theory (ICDT 2022)

Abstract

This paper investigates regex CQs with string equalities (SERCQs), a subclass of core spanners. As shown by Freydenberger, Kimelfeld, and Peterfreund (PODS 2018), these queries are intractable, even if restricted to acyclic queries. This previous result defines acyclicity by treating regex formulas as atoms. In contrast to this, we propose an alternative definition by converting SERCQs into FC-CQs - conjunctive queries in FC, a logic that is based on word equations. We introduce a way to decompose word equations of unbounded arity into a conjunction of binary word equations. If the result of the decomposition is acyclic, then evaluation and enumeration of results become tractable. The main result of this work is an algorithm that decides in polynomial time whether an FC-CQ can be decomposed into an acyclic FC-CQ. We also give an efficient conversion from synchronized SERCQs to FC-CQs with regular constraints. As a consequence, tractability results for acyclic relational CQs directly translate to a large class of SERCQs.

Cite as

Dominik D. Freydenberger and Sam M. Thompson. Splitting Spanner Atoms: A Tool for Acyclic Core Spanners. In 25th International Conference on Database Theory (ICDT 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 220, pp. 10:1-10:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{freydenberger_et_al:LIPIcs.ICDT.2022.10,
  author =	{Freydenberger, Dominik D. and Thompson, Sam M.},
  title =	{{Splitting Spanner Atoms: A Tool for Acyclic Core Spanners}},
  booktitle =	{25th International Conference on Database Theory (ICDT 2022)},
  pages =	{10:1--10:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-223-5},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{220},
  editor =	{Olteanu, Dan and Vortmeier, Nils},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2022.10},
  URN =		{urn:nbn:de:0030-drops-158843},
  doi =		{10.4230/LIPIcs.ICDT.2022.10},
  annote =	{Keywords: Document spanners, information extraction, conjunctive queries}
}

Document

Track B: Automata, Logic, Semantics, and Theory of Programming

DOI: 10.4230/LIPIcs.ICALP.2021.130

The Theory of Concatenation over Finite Models

Authors: Dominik D. Freydenberger and Liat Peterfreund

Published in: LIPIcs, Volume 198, 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)

Abstract

We propose FC, a new logic on words that combines finite model theory with the theory of concatenation - a first-order logic that is based on word equations. Like the theory of concatenation, FC is built around word equations; in contrast to it, its semantics are defined to only allow finite models, by limiting the universe to a word and all its factors. As a consequence of this, FC has many of the desirable properties of FO on finite models, while being far more expressive than FO[<]. Most noteworthy among these desirable properties are sufficient criteria for efficient model checking, and capturing various complexity classes by adding operators for transitive closures or fixed points. Not only does FC allow us to obtain new insights and techniques for expressive power and efficient evaluation of document spanners, but it also provides a general framework for logic on words that also has potential applications in other areas.

Cite as

Dominik D. Freydenberger and Liat Peterfreund. The Theory of Concatenation over Finite Models. In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 198, pp. 130:1-130:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{freydenberger_et_al:LIPIcs.ICALP.2021.130,
  author =	{Freydenberger, Dominik D. and Peterfreund, Liat},
  title =	{{The Theory of Concatenation over Finite Models}},
  booktitle =	{48th International Colloquium on Automata, Languages, and Programming (ICALP 2021)},
  pages =	{130:1--130:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-195-5},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{198},
  editor =	{Bansal, Nikhil and Merelli, Emanuela and Worrell, James},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2021.130},
  URN =		{urn:nbn:de:0030-drops-141997},
  doi =		{10.4230/LIPIcs.ICALP.2021.130},
  annote =	{Keywords: finite model theory, word equations, descriptive complexity, model checking, document spanners}
}

Document

DOI: 10.4230/LIPIcs.ICDT.2021.7

Grammars for Document Spanners

Authors: Liat Peterfreund

Published in: LIPIcs, Volume 186, 24th International Conference on Database Theory (ICDT 2021)

Abstract

We propose a new grammar-based language for defining information-extractors from documents (text) that is built upon the well-studied framework of document spanners for extracting structured data from text. While previously studied formalisms for document spanners are mainly based on regular expressions, we use an extension of context-free grammars, called {extraction grammars}, to define the new class of context-free spanners. Extraction grammars are simply context-free grammars extended with variables that capture interval positions of the document, namely spans. While regular expressions are efficient for tokenizing and tagging, context-free grammars are also efficient for capturing structural properties. Indeed, we show that context-free spanners are strictly more expressive than their regular counterparts. We reason about the expressive power of our new class and present a pushdown-automata model that captures it. We show that extraction grammars can be evaluated with polynomial data complexity. Nevertheless, as the degree of the polynomial depends on the query, we present an enumeration algorithm for unambiguous extraction grammars that, after quintic preprocessing, outputs the results sequentially, without repetitions, with a constant delay between every two consecutive ones.

Cite as

Liat Peterfreund. Grammars for Document Spanners. In 24th International Conference on Database Theory (ICDT 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 186, pp. 7:1-7:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{peterfreund:LIPIcs.ICDT.2021.7,
  author =	{Peterfreund, Liat},
  title =	{{Grammars for Document Spanners}},
  booktitle =	{24th International Conference on Database Theory (ICDT 2021)},
  pages =	{7:1--7:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-179-5},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{186},
  editor =	{Yi, Ke and Wei, Zhewei},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2021.7},
  URN =		{urn:nbn:de:0030-drops-137154},
  doi =		{10.4230/LIPIcs.ICDT.2021.7},
  annote =	{Keywords: Information Extraction, Document Spanners, Context-Free Grammars, Constant-Delay Enumeration, Regular Expressions, Pushdown Automata}
}

Document

DOI: 10.4230/LIPIcs.ICDT.2020.11

Dynamic Complexity of Document Spanners

Authors: Dominik D. Freydenberger and Sam M. Thompson

Published in: LIPIcs, Volume 155, 23rd International Conference on Database Theory (ICDT 2020)

Abstract

The present paper investigates the dynamic complexity of document spanners, a formal framework for information extraction introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (JACM 2015). We first look at the class of regular spanners and prove that any regular spanner can be maintained in the dynamic complexity class DynPROP. This result follows from work done previously on the dynamic complexity of formal languages by Gelade, Marquardt, and Schwentick (TOCL 2012). To investigate core spanners we use SpLog, a concatenation logic that exactly captures core spanners. We show that the dynamic complexity class DynCQ is more expressive than SpLog and therefore can maintain any core spanner. This result is then extended to show that DynFO can maintain any generalized core spanner and that DynFO is more powerful than SpLog with negation.

Cite as

Dominik D. Freydenberger and Sam M. Thompson. Dynamic Complexity of Document Spanners. In 23rd International Conference on Database Theory (ICDT 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 155, pp. 11:1-11:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{freydenberger_et_al:LIPIcs.ICDT.2020.11,
  author =	{Freydenberger, Dominik D. and Thompson, Sam M.},
  title =	{{Dynamic Complexity of Document Spanners}},
  booktitle =	{23rd International Conference on Database Theory (ICDT 2020)},
  pages =	{11:1--11:21},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-139-9},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{155},
  editor =	{Lutz, Carsten and Jung, Jean Christoph},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2020.11},
  URN =		{urn:nbn:de:0030-drops-119355},
  doi =		{10.4230/LIPIcs.ICDT.2020.11},
  annote =	{Keywords: Document spanners, information extraction, dynamic complexity, descriptive complexity, word equations}
}

Document

DOI: 10.4230/LIPIcs.ICDT.2017.13

A Logic for Document Spanners

Authors: Dominik D. Freydenberger

Published in: LIPIcs, Volume 68, 20th International Conference on Database Theory (ICDT 2017)

Abstract

Document spanners are a formal framework for information extraction that was introduced by [Fagin, Kimelfeld, Reiss, and Vansummeren, J.ACM, 2015]. One of the central models in this framework are core spanners, which are based on regular expressions with variables that are then extended with an algebra. As shown by [Freydenberger and Holldack, ICDT, 2016], there is a connection between core spanners and EC^{reg}, the existential theory of concatenation with regular constraints. The present paper further develops this connection by defining SpLog, a fragment of EC^{reg} that has the same expressive power as core spanners. This equivalence extends beyond equivalence of expressive power, as we show the existence of polynomial time conversions between this fragment and core spanners. This even holds for variants of core spanners that are based on automata instead of regular expressions. Applications of this approach include an alternative way of defining relations for spanners, insights into the relative succinctness of various classes of spanner representations, and a pumping lemma for core spanners.

Cite as

Dominik D. Freydenberger. A Logic for Document Spanners. In 20th International Conference on Database Theory (ICDT 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 68, pp. 13:1-13:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{freydenberger:LIPIcs.ICDT.2017.13,
  author =	{Freydenberger, Dominik D.},
  title =	{{A Logic for Document Spanners}},
  booktitle =	{20th International Conference on Database Theory (ICDT 2017)},
  pages =	{13:1--13:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-024-8},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{68},
  editor =	{Benedikt, Michael and Orsi, Giorgio},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2017.13},
  URN =		{urn:nbn:de:0030-drops-70493},
  doi =		{10.4230/LIPIcs.ICDT.2017.13},
  annote =	{Keywords: information extraction, document spanners, word equations, regex, descriptional complexity}
}

Document

DOI: 10.4230/LIPIcs.STACS.2017.33

Deterministic Regular Expressions with Back-References

Authors: Dominik D. Freydenberger and Markus L. Schmid

Published in: LIPIcs, Volume 66, 34th Symposium on Theoretical Aspects of Computer Science (STACS 2017)

Abstract

Most modern libraries for regular expression matching allow back-references (i.e. repetition operators) that substantially increase expressive power, but also lead to intractability. In order to find a better balance between expressiveness and tractability, we combine these with the notion of determinism for regular expressions used in XML DTDs and XML Schema. This includes the definition of a suitable automaton model, and a generalization of the Glushkov construction.

Cite as

Dominik D. Freydenberger and Markus L. Schmid. Deterministic Regular Expressions with Back-References. In 34th Symposium on Theoretical Aspects of Computer Science (STACS 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 66, pp. 33:1-33:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{freydenberger_et_al:LIPIcs.STACS.2017.33,
  author =	{Freydenberger, Dominik D. and Schmid, Markus L.},
  title =	{{Deterministic Regular Expressions with Back-References}},
  booktitle =	{34th Symposium on Theoretical Aspects of Computer Science (STACS 2017)},
  pages =	{33:1--33:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-028-6},
  ISSN =	{1868-8969},
  year =	{2017},
  volume =	{66},
  editor =	{Vollmer, Heribert and Vall\'{e}e, Brigitte},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2017.33},
  URN =		{urn:nbn:de:0030-drops-70004},
  doi =		{10.4230/LIPIcs.STACS.2017.33},
  annote =	{Keywords: Deterministic Regular Expression, Regex, Glushkov Automaton}
}

Document

DOI: 10.4230/LIPIcs.ICDT.2016.17

Document Spanners: From Expressive Power to Decision Problems

Authors: Dominik D. Freydenberger and Mario Holldack

Published in: LIPIcs, Volume 48, 19th International Conference on Database Theory (ICDT 2016)

Abstract

We examine document spanners, a formal framework for information extraction that was introduced by Fagin et al. (PODS 2013). A document spanner is a function that maps an input string to a relation over spans (intervals of positions of the string). We focus on document spanners that are defined by regex formulas, which are basically regular expressions that map matched subexpressions to corresponding spans, and on core spanners, which extend the former by standard algebraic operators and string equality selection. First, we compare the expressive power of core spanners to three models - namely, patterns, word equations, and a rich and natural subclass of extended regular expressions (regular expressions with a repetition operator). These results are then used to analyze the complexity of query evaluation and various aspects of static analysis of core spanners. Finally, we examine the relative succinctness of different kinds of representations of core spanners and relate this to the simplification of core spanners that are extended with difference operators.

Cite as

Dominik D. Freydenberger and Mario Holldack. Document Spanners: From Expressive Power to Decision Problems. In 19th International Conference on Database Theory (ICDT 2016). Leibniz International Proceedings in Informatics (LIPIcs), Volume 48, pp. 17:1-17:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2016)

Copy BibTex To Clipboard

@InProceedings{freydenberger_et_al:LIPIcs.ICDT.2016.17,
  author =	{Freydenberger, Dominik D. and Holldack, Mario},
  title =	{{Document Spanners: From Expressive Power to Decision Problems}},
  booktitle =	{19th International Conference on Database Theory (ICDT 2016)},
  pages =	{17:1--17:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-002-6},
  ISSN =	{1868-8969},
  year =	{2016},
  volume =	{48},
  editor =	{Martens, Wim and Zeume, Thomas},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICDT.2016.17},
  URN =		{urn:nbn:de:0030-drops-57867},
  doi =		{10.4230/LIPIcs.ICDT.2016.17},
  annote =	{Keywords: Information extraction, document spanners, regular expressions, regex, patterns, word equations, decision problems, descriptional complexity}
}

Document

DOI: 10.4230/LIPIcs.STACS.2011.213

Weakly Unambiguous Morphisms

Authors: Dominik D. Freydenberger, Hossein Nevisi, and Daniel Reidenbach

Published in: LIPIcs, Volume 9, 28th International Symposium on Theoretical Aspects of Computer Science (STACS 2011)

Abstract

A nonerasing morphism sigma is said to be weakly unambiguous with respect to a word w if sigma is the only nonerasing morphism that can map w to sigma(w), i.e., there does not exist any other nonerasing morphism tau satisfying tau(w) = sigma(w). In the present paper, we wish to characterise those words with respect to which there exists such a morphism. This question is nontrivial if we consider so-called length-increasing morphisms, which map a word to an image that is strictly longer than the word. Our main result is a compact characterisation that holds for all morphisms with ternary or larger target alphabets. We also comprehensively describe those words that have a weakly unambiguous length-increasing morphism with a unary target alphabet, but we have to leave the problem open for binary alphabets, where we can merely give some non-characteristic conditions.

Cite as

Dominik D. Freydenberger, Hossein Nevisi, and Daniel Reidenbach. Weakly Unambiguous Morphisms. In 28th International Symposium on Theoretical Aspects of Computer Science (STACS 2011). Leibniz International Proceedings in Informatics (LIPIcs), Volume 9, pp. 213-224, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2011)

Copy BibTex To Clipboard

@InProceedings{freydenberger_et_al:LIPIcs.STACS.2011.213,
  author =	{Freydenberger, Dominik D. and Nevisi, Hossein and Reidenbach, Daniel},
  title =	{{Weakly Unambiguous Morphisms}},
  booktitle =	{28th International Symposium on Theoretical Aspects of Computer Science (STACS 2011)},
  pages =	{213--224},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-939897-25-5},
  ISSN =	{1868-8969},
  year =	{2011},
  volume =	{9},
  editor =	{Schwentick, Thomas and D\"{u}rr, Christoph},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2011.213},
  URN =		{urn:nbn:de:0030-drops-30123},
  doi =		{10.4230/LIPIcs.STACS.2011.213},
  annote =	{Keywords: Combinatorics on words, Morphisms, Ambiguity}
}

Document

DOI: 10.4230/LIPIcs.STACS.2011.507

Extended Regular Expressions: Succinctness and Decidability

Authors: Dominik D. Freydenberger

Published in: LIPIcs, Volume 9, 28th International Symposium on Theoretical Aspects of Computer Science (STACS 2011)

Abstract

Most modern implementations of regular expression engines allow the use of variables (also called back references). The resulting extended regular expressions (which, in the literature, are also called practical regular expressions, rewbr, or regex) are able to express non-regular languages. The present paper demonstrates that extended regular-expressions cannot be minimized effectively (neither with respect to length, nor number of variables), and that the tradeoff in size between extended and ``classical'' regular expressions is not bounded by any recursive function. In addition to this, we prove the undecidability of several decision problems (universality, equivalence, inclusion, regularity, and cofiniteness) for extended regular expressions. Furthermore, we show that all these results hold even if the extended regular expressions contain only a single variable.

Cite as

Dominik D. Freydenberger. Extended Regular Expressions: Succinctness and Decidability. In 28th International Symposium on Theoretical Aspects of Computer Science (STACS 2011). Leibniz International Proceedings in Informatics (LIPIcs), Volume 9, pp. 507-518, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2011)

Copy BibTex To Clipboard

@InProceedings{freydenberger:LIPIcs.STACS.2011.507,
  author =	{Freydenberger, Dominik D.},
  title =	{{Extended Regular Expressions: Succinctness and Decidability}},
  booktitle =	{28th International Symposium on Theoretical Aspects of Computer Science (STACS 2011)},
  pages =	{507--518},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-939897-25-5},
  ISSN =	{1868-8969},
  year =	{2011},
  volume =	{9},
  editor =	{Schwentick, Thomas and D\"{u}rr, Christoph},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.STACS.2011.507},
  URN =		{urn:nbn:de:0030-drops-30396},
  doi =		{10.4230/LIPIcs.STACS.2011.507},
  annote =	{Keywords: extended regular expressions, regex, decidability, non-recursive tradeoffs}
}

15 Search Results for "Freydenberger, Dominik D."

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Abstract

Cite as

Thanks for your feedback!

Could not send message