Abstract 1 Introduction 2 Preliminaries 3 Efficient FC-Datalog 4 Applying the Framework: Simulating DRX 5 Conclusions and Future Work References

FC-Datalog as a Framework for Efficient String Querying

Owen M. Bell ORCID Loughborough University, UK Joel D. Day ORCID Loughborough University, UK Dominik D. Freydenberger ORCID Loughborough University, UK
Abstract

Core spanners are a class of document spanners that capture the core functionality of IBMโ€™s AQL. ๐–ฅ๐–ขย is a logic on strings built around word equations that when extended with constraints for regular languages can be seen as a logic for core spanners. The recently introduced ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ extends ๐–ฅ๐–ข with recursion, which allows us to define recursive relations for core spanners. Additionally, as ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ capturesย ๐–ฏ, it is also a tractable version of ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ on strings. This presents an opportunity for optimization.

We propose a series of ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ fragments with desirable properties in terms of complexity of model checking, expressive power, and efficiency of checking membership in the fragment. This leads to a range of fragments that all capture ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค, which we further restrict to obtain linear combined complexity. This gives us a framework to tailor fragments for particular applications. To showcase this, we simulate deterministic regex in a tailored fragment of ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€.

Keywords and phrases:
Information extraction, word equations, datalog, document spanners, regex
Copyright and License:
[Uncaptioned image]โ€‚ยฉย Owen M. Bell, Joel D. Day, and Dominik D. Freydenberger; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation โ†’ Logic and databases
Related Version:
Full Version: https://arxiv.org/abs/2501.10344
Acknowledgements:
The authors would like to thank the anonymous reviewers of the current and previous versions for their detailed and helpful feedback.
Funding:
This work was supported by EPSRC grant EP/T033762/1.
Supplementary Material:
Data Accessibility: No data was generated or analyzed during this study.
Editors:
Sudeepa Roy and Ahmet Kara

1 Introduction

As a vast amount of valuable information is stored in unstructured textual data, the operation of extracting structured information from such data is crucial. This operation is the classical task known as Information Extraction (IE), and has applications from healthcareย (seeย e.g.ย [39]) to social media analytics (see e.g. [5]). A popular approach to this task is the rule-based technique, which can be understood as querying a text as one queries a relational database. Document spanners are a framework for rule-based information extraction. We consider a recursive model connected to document spanners called ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, which is based on the logic on strings ๐–ฅ๐–ข and the query language ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€. Thus, we first discuss the latter two, and the connection to document spanners, before moving to the former.

FC.

The theory of concatenation (short: ๐–ข) is a logic on strings with the infinite universe ฮฃโˆ—. Introduced by Freydenberger and Peterfreundย [16], ๐–ฅ๐–ข is a finite model version of ๐–ข which has a finite universe, a single word and all of its factors (contiguous subwords). As a result of this restriction, ๐–ฅ๐–ข has decidable model checking and evaluation (see [16]). ๐–ฅ๐–ข is built on word equations, equations of the form xโขxโ‰yโขyโขy, where variables x and y represent words over a finite alphabet ฮฃ. As a result, in ๐–ฅ๐–ข we can reason directly over factors rather than intervals of positions as is the case for other logics on strings such as monadic second order logic (๐–ฌ๐–ฒ๐–ฎ) over a linear order. Furthermore, in ๐–ฅ๐–ข we can compare factors of unbounded length, something which is not possible in ๐–ฌ๐–ฒ๐–ฎ (see [16] for details). Word equations themselves are a natural way of expressing many typical properties of a string such as containing a square or being imprimitive (see e.g. [24]), and have previously been used for data management in other areas such as graph databases (see [6]).

Document Spanners and FC.

Introduced by Fagin, Kimelfeld, Reiss and Vansummerenย [12], document spanners (or spanners) are a rule-based framework for Information Extraction. Spanners were introduced to capture the core functionality of AQL, a query language used for IBMโ€™s SystemT. Informally, spanners are functions that take a text document as input and output a relation over intervals (called spans) from the document. Intuitively, primitive extractors (which are commonly regular expressions with capture variables called regex formulas) extract relations of spans, and these are then combined using relational algebra.

Many works on spanners, particularly in the area of enumeration (see e.g. [3, 13, 35]), have been concerned with the subclass of regular spanners, which are regex formulas extended with projection (ฯ€), union (โˆช) and natural join (โ‹ˆ). However, [12] showed that this subclass cannot express more than recognizable relations. The full class of spanners introduced by Fagin et al. [12] are called the core spanners as these achieve the original motivation of capturing the core functionality of AQL. Core spanners extend the regular spanners with string equality (denoted ฮถ=), an operation necessary to perform fundamental tasks such as reading multiple occurrences of a string. This added expressibility comes at the cost of reduced efficiency (see e.g. [14, 15]). Further extending the core spanners with set difference gives the class of generalized core spanners.

The logic ๐–ฅ๐–ข has a tight connection to core spanners. In particular, the extension ๐–ฅ๐–ขโข[๐–ฑ๐–ค๐–ฆ], which extends ๐–ฅ๐–ข with constraints that decide membership of regular languages, captures the expressive power of generalized core spanners, and the existential-positive fragment ๐–ค๐–ฏ-๐–ฅ๐–ขโข[๐–ฑ๐–ค๐–ฆ] captures the expressive power of core spanners. Furthermore, there are polynomial time conversions between ๐–ฅ๐–ขโข[๐–ฑ๐–ค๐–ฆ] and generalized core spanners, and ๐–ค๐–ฏ-๐–ฅ๐–ขโข[๐–ฑ๐–ค๐–ฆ] and coreย spanners (see Section 5.2 of [16]). While spanners reason over intervals of positions, ๐–ฅ๐–ข reasons directly over factors. When dealing with factors, unlike with intervals of positions, the default is not to distinguish between duplicates. However, simulating different intervals containing the same factor is easily done: we can simply store in addition to the factor, the prefix preceding it. On the other hand, eliminating duplicates when working with intervals, such as in spanners, is not so easily achieved. Due to the connection between the two models, we can use ๐–ฅ๐–ข to gain insights into spanners; previous examples of work on ๐–ฅ๐–ข that has produced results for spanners are [18] for tractability and [38] for inexpressibility.

Datalog.

A query language for relational databases, ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ was introduced to perform operations that were not possible in earlier database languages such as graph transitive closure (see e.g. [1]). A ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program has a database of prepopulated extensional relations and a set of recursive rules that define new intensional relations. Semi-positive ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, which allows negation for atoms with extensional relation symbols, captures ๐–ฏ on ordered structures (see e.g. [28]). Linear ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ permits at most one atom with an intensional relation symbol in the body of every rule, and semi-positive linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ captures ๐–ญ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค on ordered structures (see e.g. [20]). A more general definition of linear ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ permits, in the body of every rule, at most one atom with an intensional relation symbol mutually recursive with the head relation symbol (see e.g [1]). We can evaluate body atoms that have other intensional relation symbols as subroutines, and so this extended definition does not affect the complexity. Linear ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ also captures how recursion is done in SQLย (see [30]).

The combined complexity of model checking ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ is ๐–ค๐–ท๐–ฏ-complete, even if the input database is empty, the universe is made up of only two elements, and the program has only a single rule (see [10, 20]). For linear ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ it is ๐–ฏ๐–ฒ๐–ฏ๐– ๐–ข๐–ค-complete, again even with an empty input database, a two-element universe, and a single rule (see [20]).

An earlier approach to adapting ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ for querying strings is Sequence ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ (seeย [7]), but this has an undecidable model checking problem. Furthermore, in the spanner setting Peterfreund, ten Cate, Fagin and Kimelfeld [33] introduced ๐–ฑ๐–ฆ๐–ท๐—…๐—ˆ๐—€, ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ over regex formulas. ๐–ฑ๐–ฆ๐–ท๐—…๐—ˆ๐—€ was motivated by the SystemT developersโ€™ interest in recursion (for example, to implement context free grammars for natural language processing), and captures the complexity class ๐–ฏ. As introduced in [32], ๐–ฒ๐—‰๐–บ๐—‡๐—‡๐–พ๐—‹๐—…๐—ˆ๐—€โขโŸจ๐–ฑ๐–ฆ๐–ทโŸฉ generalizes the spanner and relational model and has recently been implemented in [29]. ๐–ฒ๐—‰๐–บ๐—‡๐—‡๐–พ๐—‹๐—…๐—ˆ๐—€โขโŸจ๐–ฑ๐–ฆ๐–ทโŸฉ with stratified negation and restricted to string extensional relations also captures ๐–ฏ (see Section 6 of [33]).

FC-Datalog.

Together with ๐–ฅ๐–ข, Freydenberger and Peterfreundย [16] also introduced ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, which extends existential-positive ๐–ฅ๐–ข (๐–ค๐–ฏ-๐–ฅ๐–ข) with recursion analogously to how ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ extends existential-positive ๐–ฅ๐–ฎ. It is worth pointing out that ๐–ค๐–ฏ-๐–ฅ๐–ข is able to express the inequality of two strings (see e.g. Example 5.3 in [14]). As in ๐–ฅ๐–ข, we have the finite universe of a single word and all of its factors. ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ has word equations instead of extensional relations. That is, ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ atoms are word equations or relations. In ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ we adopt the fixedย point ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ semantics.

Example 1.1.

The rules of an ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program P:

Ansโข() โ†๐”ฒโ‰yโขz,Eโข(y,z);
Eโข(x,y) โ†xโ‰ฮต,yโ‰ฮต;
Eโข(x,y) โ†xโ‰๐–บโขu,yโ‰๐–ปโขv,Eโข(u,v).

P defines the language โ„’โข(P)โ‰”{๐–บnโข๐–ปnโˆฃnโˆˆโ„•}. See Definitionย 2.1 for the semantics.

From Theorem 4.11 of [16], ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ captures the complexity class ๐–ฏ. When considering efficiency, we are primarily interested in model checking, which relates practically to deciding if a tuple is in a relation. We can see ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ as a language for expressing relations that can be used in spanner selections. Because spanners reason over intervals of positions, expressing relations can become cumbersome as a relation holds such intervals, including all those that represent the same factor. In contrast, we can neatly express relations in ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ as we reason directly over factors and so a relation holds the factors themselves.

Example 1.2.

In ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, we would express a relation that contains all factors that are squares with Rโข(x)โ†xโ‰yโขy. In core spanners, we would express a relation that contains the positions of all factors that are squares with ฯ€xโข(ฮถy1,y2=โข(ฮฃโˆ—โขxโข{y1โข{ฮฃโˆ—}โขy2โข{ฮฃโˆ—}}โขฮฃโˆ—)). Seeย [12] for the full definition of core spanners.

Parallel to this, where model checking Sequence ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ is undecidable, the same problem for ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ is in ๐–ฏ, and so we can also see ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ as a tractable recursive query language for strings, independent of the connection to spanners. We aim to identify techniques that make recursion less expensive. We thus look to define restrictions that lead to more efficient fragments of ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ which also retain other desirable properties.

We focus on the model checking problem primarily through two different lenses: data complexity and combined complexity. In many cases for query languages it is reasonable to use data complexity as the queries are often significantly smaller than the data. In text-based settings on the other hand, features such as regular expressions can make queries large. Consequently, combined complexity remains an important consideration.

Deterministic Regex.

As the regular languages are often not enough to express what is required in practice, almost all modern programming languages (such as e.g. PERL, Python, and Java) do not implement only classical regular expressions, as introduced by Kleeneย [26], but regex, regular expressions extended with back-references. These are operators that match a repetition of a previously matched string, and whilst they do increase expressibility, they also lead to intractability of membership (see [2]). Freydenberger and Schmid [17] combined regex with the notion of determinism to define ๐–ฃ๐–ฑ๐–ท, the class of deterministic regex, to obtain a tractable class of regex with more expressive power than deterministic regular expressions.

Other Related Models.

As well as ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, there also exist other related models that capture ๐–ฏ. These include positive Rangeย Concatenation Grammarsย (๐–ฏ๐–ฑ๐–ข๐–ฆ)ย (see e.g.ย [8]), and Hereditaryย Elementaryย Formalย Systemsย (๐–ง๐–ค๐–ฅ๐–ฒ)ย (see e.g. [31]), which do not have finite model semantics. The key difference of these models to ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ is in ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€โ€™s use of word equations, which are not present in other formalisms and are crucial for our restrictions that lead to efficient fragments.

Contributions of this Paper.

The only previously known complexity result for ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ is that it captures ๐–ฏ. We first show that combined complexity of ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ is ๐–ค๐–ท๐–ฏ-complete, and then perform an evaluation of different restrictions on ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ to identify fragments: with more efficient complexity of model checking, for both data and combined complexity; that do not overly sacrifice expressive power; and where membership in the fragment can be checked efficiently. As the semi-positive linear fragment of classical ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ captures ๐–ญ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค (on ordered structures), our first restriction is to adapt linearity to ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€. We show that linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ also captures ๐–ญ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค and has ๐–ฏ๐–ฒ๐–ฏ๐– ๐–ข๐–ค-complete combined complexity. Our second restriction is to remove nondeterminism. Here, we define deterministic linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ which captures ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค. But, checking whether a linear program is deterministic is as hard as satisfiability for word equations, a problem which is known to be ๐–ญ๐–ฏ-hardย (see [4, 27]) and in nondeterministic linear spaceย (see [23]). Therefore, we employ another restriction that we call oneย letterย lookahead (๐–ฎ๐–ซ๐–ซ๐– ) on the permitted word equations. Deterministic ๐–ฎ๐–ซ๐–ซ๐–  (๐–ฃ๐–ฎ๐–ซ๐–ซ๐– ) ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ captures ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค and checking if an ๐–ฎ๐–ซ๐–ซ๐–  program is deterministic can be done in polynomial time, but its combined complexity is still ๐–ฏ๐–ฒ๐–ฏ๐– ๐–ข๐–ค-complete. We thus make a final restriction that we call strictly decreasingย (๐–ฒ๐–ฃ) and define ๐–ฒ๐–ฃ-๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, which has linear combined complexity.

We therefore establish the endpoints of a range of fragments that all capture ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค; at one end is deterministic linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ at at the other is ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€. We establish a trade-off in this range between how rich the fragmentโ€™s syntax is and how easy it is to check membership in the fragment, although fully mapping this range is left for future work. Furthermore, we show that we can obtain an ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ fragment with linear combined complexity, namely ๐–ฒ๐–ฃ-๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€. Consequently, we have paved the way to design tailored fragments for particular applications.

We then explain how we can view ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ programs from our range as generalized nondeterministic and deterministic multi-headed two-way finite automata, which are equivalent to nondeterministic and deterministic ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค Turing machines, respectively (seeย [25]). ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ fragments from our range allow for more flexibility than these automata models, for example ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, tailored to simulate ๐–ฃ๐–ฑ๐–ท, can be viewed as a generalization that permits performing nonregular string computations in the transitions.

Where in [17], to check if a regex matches a word we have to construct a technically involved automata model, we show that we can model this simply in ๐–ฒ๐–ฃ-๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€. We also show how this tailored fragment allows us to conveniently and naturally write programs that are more concise. ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ permits additional deterministic components, but despite this maintains all of the desirable properties of ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€: it captures ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค, determinism can be checked in polynomial time, and its strictly decreasing variant ๐–ฒ๐–ฃ-๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ has linear combined complexity. Finally, we show that as we can simulate ๐–ฃ๐–ฑ๐–ท, another example of deterministic components that can be added to these tailored fragments are constraints that match ๐–ฃ๐–ฑ๐–ท.

2 Preliminaries

Let โ„•โ‰”{0,1,2,โ€ฆ} and โ„•+โ‰”{1,2,โ€ฆ}. We denote the empty set by โˆ… and the cardinality of a set S by |S|. Let Sโˆ–Sโ€ฒ be the set difference of S and Sโ€ฒ, and let ๐’ซโข(S) be the power set of S. Let A be some alphabet. For any two u,vโˆˆAโˆ—, we use uโ‹…v, or simply uโขv, for the concatenation of u and v. Let |u| denote the length of uโˆˆAโˆ—. Let ฮฃ be a finite alphabet that we call terminal symbols (or just terminals). We call any wโˆˆฮฃโˆ— a word (or string). We use ฮต to denote the empty word and let ฮฃ+=ฮฃโˆ—โˆ–{ฮต}. A factor of a word w is a word t such that there exist u,vโˆˆฮฃโˆ— with w=uโ‹…tโ‹…v. In literature, a factor is sometimes called a contiguous subword.

For a tuple tโ†’, let its size |tโ†’| be the number of components in tโ†’ and xโˆˆtโ†’ denote that x is a component of tโ†’. We refer to tuples where |tโ†’|=1 as singletons and omit the brackets in this case. A relation is a set of tuples of the same size, and is represented by a relation symbol.

Let ฮž be a countably infinite set of variables where ฮฃโˆฉฮž=โˆ…. A pattern is a word ฮฑโˆˆ(ฮฃโˆชฮž)โˆ—. We define Var(ฮฑ) as the set of variables in ฮฑ. For two alphabets A and B, aย homomorphism is a function g:Aโˆ—โ†’Bโˆ— where gโข(u) โ‹… gโข(v) = gโข(uโ‹…v) holds for all u,vโˆˆAโˆ—. A patternย substitution ฮธ is a homomorphism ฮธ:(ฮฃโˆชฮž)โˆ—โ†’(ฮฃโˆชฮž)โˆ— where ฮธโข(๐–บ)=๐–บ for all ๐–บโˆˆฮฃ. We usually refer to a pattern substitution as simply a substitution. We denote the image of a pattern ฮฑ under a substitution ฮธ by ฮธโข(ฮฑ). If ฮธโข(x)=ฮต for some variable x, we say ฮธ is erasing. For a tuple tโ†’=(ฮฑ1,โ€ฆ,ฮฑn) of patterns ฮฑ1,โ€ฆ,ฮฑn and a substitution ฮธ, let ฮธโข(tโ†’)=(ฮธโข(ฮฑ1),โ€ฆ,ฮธโข(ฮฑn)).

For two patterns ฮฑ,ฮฒโˆˆ(ฮฃโˆชฮž)โˆ—, an equation of the form ฮฑโ‰ฮฒ is called a word equation. Under a substitution ฮธ, the word equation ฯ†โ‰”ฮฑโ‰ฮฒ holds if ฮธโข(ฮฑ)=ฮธโข(ฮฒ), and we then say ฮธ is a solution of ฯ†. If there exists a solution for ฯ†, then we say ฯ† is satisfiable. We say a conjunction of two word equations ฯ†1โˆงฯ†2 is satisfiable if there exists a substitution ฮธ that is a solution to both ฯ†1 and ฯ†2. We say two word equations ฯ†1 and ฯ†2 contradict each other if their conjunction is not satisfiable. We say a pattern equation is a word equation of the form xโ‰ฮฑ, where xโˆˆฮž and ฮฑโˆˆ(ฮฃโˆชฮž)โˆ—. For every pattern equation ฯ†=xโ‰ฮฑ, we define Varโข(ฯ†)โ‰”{x}โˆชVarโข(ฮฑ). The expressive power of conjunctions of pattern equations is the same as that for conjunctions of word equations as we can simulate a word equation ฮฑ1โ‰ฮฑ2, for ฮฑ1,ฮฑ2โˆˆ(ฮฃโˆชฮž)โˆ—, using two pattern equations xโ‰ฮฑ1 and xโ‰ฮฑ2 for a new variable xโˆˆฮž.

FCโ€“Datalog.

An ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ atom is either a pattern equation, or a so-called relation atom of the form Rโข(y1,โ€ฆ,yarโข(R)) where R is a relation symbol that has an arity arโข(R)โ‰ฅ0 and y1,โ€ฆ,yarโข(R)โˆˆฮž. Without loss of generality, we can assume that for every pattern equation atom xโ‰ฮฑ, the variable x does not occur in ฮฑ because such equations reduce to trivial scenarios, as in Lemma 3.1 of [37]. For a relation atom ฯ†=Rโข(y1,โ€ฆ,yarโข(R)), we define Var(ฯ†)ย =ย {y1,โ€ฆ,yarโข(R)}. For now, we assume each relation symbol R has a corresponding relation R. How relations are updated is specified in the following semantics.

Let wโˆˆฮฃโˆ— be a word and let ฮธ be a substitution. For a pattern equation atom ฯ†=xโ‰ฮฑ, we say (w,ฮธ)โŠงฯ† if ฮธโข(x)=ฮธโข(ฮฑ) and both ฮธโข(x) and ฮธโข(ฮฑ) are factors of w. For a relation atom ฯ†=Rโข(y1,โ€ฆ,yarโข(R)), we say (w,ฮธ)โŠงฯ† if (ฮธโข(y1),โ€ฆ,ฮธโข(yarโข(R)))โˆˆR and ฮธโข(y1),โ€ฆ,ฮธโข(yarโข(R)) are factors of w. We represent the word w that defines the universe with the distinguished universe variable ๐”ฒ, and use this as an input to an ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program.

Let ฯƒ be a relational vocabulary (see e.g. [11, 28] for a definition of ๐–ฅ๐–ฎ and vocabularies). A conjunctive query is an ๐–ฅ๐–ฎโข[ฯƒ] formula of the form ฯโข(xโ†’)โ‰”โˆƒyโ†’:โ‹€i=1nฮทi, where ฮทiโ‰”Riโข(tโ†’i), each Ri is a relation symbol in ฯƒ, each tโ†’i is a ฯƒ-term, and nโ‰ฅ1. We usually write this as ฯโ‰”Ansโข(xโ†’)โ†โ‹€i=1nฮทi where xโ†’ is the tuple of free variables. We call Ansโข(xโ†’) the head of ฯ and โ‹€i=1nฮทi the body of ฯ. If there are no free variables then ฯ is Boolean. We call the set of conjunctive queries ๐–ข๐–ฐ. We also use the output relation symbol Ans in ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€.

Definition 2.1.

An ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program is a 3-tuple Pโ‰”(๐”ฒ, โ„›, ฮฆ), where:

  • โ– 

    ๐”ฒโˆˆฮž is the universe variable,

  • โ– 

    โ„›โЇ{Ans} is a finite set of relation symbols, where each Rโˆˆโ„› has an arity arโข(R)โ‰ฅ0,

  • โ– 

    ฮฆ is a finite set of rules of the form Rโข(x1,โ€ฆ,xarโข(R))โ†ฯ†1,โ€ฆ,ฯ†m for some mโ‰ฅ1, some Rโˆˆโ„›, where for 1โ‰คiโ‰คm every ฯ†i is an ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ atom, and for 0โ‰คjโ‰คarโข(R) every xj occurs in some Varโข(ฯ†i).

Each element of ฮฆ can be intuitively seen as a conjunctive query over ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ atoms and as for ๐–ข๐–ฐ, when arโข(Ans)=0, the program is Boolean. For brevity, just ฮฆ is used to represent the whole tuple if ๐”ฒ and โ„› are clear from context. For a relation symbol Rโˆˆโ„›, let ฮฆRโІฮฆ be the subset of rules with head relation symbol R. For an ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ rule ฯ we define ๐—‰๐–พโข(ฯ) as the conjunction of all the pattern equations in ฯ.

Example 2.2.

For the ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program P=(๐”ฒ,{Ans,E},ฮฆ) where ฮฆ are the rules defined in Example 1.1, we have the two subsets: ฮฆAnsโ‰”{Ans()โ†๐”ฒโ‰yz,E(y,z)} and ฮฆEโ‰”ฮฆโˆ–ฮฆAns. Let ฯ=Eโข(x,y)โ†xโ‰๐–บโขu,yโ‰๐–ปโขv,Eโข(u,v). Then ๐—‰๐–พโข(ฯ)=xโ‰๐–บโขuโˆงyโ‰๐–ปโขv.

Let ฯ=R(x1,โ€ฆ,xarโข(R)) โ†ฯ†1,โ€ฆ,ฯ†m be an ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ rule and let Vโ‰”โ‹ƒi=1mVarโข(ฯ†i). For a word wโˆˆฮฃโˆ—, a w-substitution is a substitution ฮธ where ฮธโข(๐”ฒ)=w and ฮธโข(x) is a factor of w for each xโˆˆV. Using only w-substitutions ensures that the universe is restricted to w and its factors. For the rule ฯ and w-substitution ฮธ, we say (w,ฮธ)โŠงฯ if for all 1โ‰คiโ‰คm we have (w,ฮธ)โŠงฯ†i. As we only consider the finite universe setting here, we are thus only considering w-substitutions, and so refer to these as just substitutions for brevity.

We treat an ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program P=(๐”ฒ,โ„›,ฮฆ) as implicitly defining a vocabulary and define corresponding structures. An ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ structure ๐”„P consists of a fixed universe wโˆˆฮฃโˆ— and all its factors, and an interpretation function f๐”„P that maps every relation symbol Rโˆˆโ„› to an interpretationย R๐”„P. For convenience, we also refer to R๐”„P as R.

A program P and a word w define a structure โŸฆPโŸง(w) incrementally. First, all Rโˆˆโ„› are initialized to โˆ…. Then, for each rule R(x1,โ€ฆ,xarโข(R)) โ†ฯ†1,โ€ฆ,ฯ†m and Vโ‰”โ‹ƒi=1mVarโข(ฯ†i), for every w-substitutionย ฮธ where (w,ฮธ)โŠงฯ†i, for all 1โ‰คiโ‰คm, add (ฮธโข(x1),โ€ฆ,ฮธโข(xarโข(R))) toย R. We repeat this until all Rโˆˆโ„› have stabilized. Then โŸฆPโŸง(w) is the content of the Ans relation. This โ€œfilling upโ€ of relations mirrors the fixed point semantics of classical ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€.

The defined functionย P that maps w to arโข(Ans)-ary tuples over factors of w is well-defined (thus the fixed point is unique) and the โ€œfilling upโ€ process terminates for every givenย w. If P is Boolean, โŸฆPโŸง(w) is either โˆ… or {()}. We interpret these as ๐–ฑ๐–พ๐—ƒ๐–พ๐–ผ๐— and ๐– ๐–ผ๐–ผ๐–พ๐—‰๐— (resp.) and use this to define a language โ„’โข(P).

Example 2.3.

The ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program P=(๐”ฒ,{Ans,R},ฮฆ) where ฮฆ is:

Ansโข() โ†๐”ฒโ‰yโขy,Rโข(y); Rโข(x) โ†xโ‰yโข๐–บ,Rโข(y);
Rโข(x) โ†xโ‰ฮต; Rโข(x) โ†xโ‰yโข๐–ป,Rโข(y).

defines the language โ„’โข(P)โ‰”{vโ‹…vโˆฃvโˆˆ{๐–บ,๐–ป}โˆ—}. Note that without the three rules that have R in the head, we would define the language {vโ‹…vโˆฃvโˆˆฮฃโˆ—}.

We look at the model checking problem for ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€: given a Boolean ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program P and a word w, is wโˆˆโ„’โข(P)? We consider three perspectives: data complexity, where the program P is fixed and only the word w is considered input, expression complexity, where the word w is fixed and only the program P is considered input, and combined complexity, where both the word w and program P are considered input (for details, see [28]).

We call a subset of ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ programs a fragment. We say that a fragment โ„ฑ captures a complexity class โ„‚ if โ„‚={โ„’โข(P)โˆฃPโˆˆโ„ฑ}. Note that, following directly from the definitions, if โ„ฑ captures โ„‚, then the data complexity of โ„ฑ is โ„‚. The only complexity result that is known for ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ is that ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ capturesย ๐–ฏ (see Theorem 4.11 of [16]), however, there are results for other versions of ๐–ฅ๐–ข, such as those discussed in Section 1.

3 Efficient FC-Datalog

As ๐–ฏ is often not considered efficient for data complexity, this presents an opportunity for optimization. We show that this is also the case for combined complexity.

Theorem 3.1.

Combined complexity of ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ is ๐–ค๐–ท๐–ฏ-complete.

It is straightforward to see that as our universe is finite, the number of tuples we can add to the relations is exponential. We therefore have a naive evaluation algorithm of a loop with an exponential bound on the number of iterations, and an exponential number of steps in each iteration. We can also straightforwardly reduce classical ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ evaluation to show ๐–ค๐–ท๐–ฏ-hardness. Our main goal in this section is to identify fragments with lower data and combined complexity, and where membership in the fragment can be easily checked.

3.1 Linearity

There exists a fragment of classical ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ that captures the complexity class ๐–ญ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค on ordered structures (see e.g. [20]), namely the fragment of all semi-positive linear programs. As discussed in Section 1, there is a more general definition for linearity (see e.g. [1]) that does not affect complexity. Here we translate the more general linearity restriction from the relational setting to the text setting.

We define a relation Rโ†Rโ€ฒ over relation symbols R and Rโ€ฒ if there exists a rule ฯ where R is the head relation symbol and Rโ€ฒ appears in the body. We denote the transitive closure of โ† by โ†+, and two relation symbols R and Rโ€ฒ are mutually recursive if R โ†+ Rโ€ฒ and Rโ€ฒ โ†+ R.

Definition 3.2.

Let P be an ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program with rule set ฮฆ. A rule ฯโˆˆฮฆ is linear with respect to P if at most one atom in the body of ฯ has a relation symbol with which the head relation symbol of ฯ is mutually recursive. If every ฯโˆˆฮฆ is linear, then P is linear.

Checking if a given ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program is linear can be done in polynomial time with respect to the programโ€™s number of rules, as it amounts to determining which pairs of relation symbols are mutually recursive, which is a syntactic criterion.

Example 3.3.

The ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ programs given in Example 1.1 and Example 2.3 are both linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ programs. An example ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program that is not linear is the following program that retrieves all even length factors of the input word:

Ansโข(z) โ†zโ‰xโขy,Ansโข(x),Ansโข(y);
Ansโข(z) โ†zโ‰xโขy,Lโข(x),Lโข(y);

and a rule Lโข(x)โ†xโ‰๐–บ for each ๐–บโˆˆฮฃ. This is not linear as Ans is mutually recursive with Ans, and the top rule has Ans in the head and two occurrences of Ans in the body.

While unrestricted ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ captures ๐–ฏ, the restriction to linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ has substantially improved data complexity (Theoremย 3.4) as well as combined complexity (Theoremย 3.5).

Theorem 3.4.

Linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ captures ๐–ญ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค.

We capture ๐–ญ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค as we can simulate multi-headed two-way nondeterministic finite automata, which are equivalent to nondeterministic Turing machines with logarithmic space (see [25]). We will further discuss the connection to automata in Sectionย 3.5.

Theorem 3.5.

Combined and expression complexity of linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ are ๐–ฏ๐–ฒ๐–ฏ๐– ๐–ข๐–ค-complete.

On the other hand, we would like to lower both data complexity further than ๐–ญ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค and combined complexity further than ๐–ฏ๐–ฒ๐–ฏ๐– ๐–ข๐–ค. Furthermore, ๐–ฏ๐–ฒ๐–ฏ๐– ๐–ข๐–ค-completeness occurs even on a single-character input. As such, we look for more efficient fragments.

3.2 Determinism

As linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ captures ๐–ญ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค, it is natural to look at causes of nondeterminism and see if there is a corresponding fragment that captures ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค. For model checking, we evaluate ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ programs top-down.

Example 3.6.

We show a top-down evaluation of the program P in Example 2.3 on the input word w=๐–บ๐–ป๐–บ๐–ป. The only rule we can apply first is Ansโข()โ†๐”ฒโ‰yโขy,Rโข(y). As ฮธโข(๐”ฒ)=w, we see that ฮธโข(y)=๐–บ๐–ป, and we pass this into the relation R. We can then only apply the rule Rโข(x)โ†xโ‰yโข๐–ป,Rโข(y). As ฮธโข(x)=๐–บ๐–ป, we have ฮธโข(y)=๐–บ and we recurse on this value of y. We then apply Rโข(x)โ†xโ‰yโข๐–บ,Rโข(y) and recurse on ฮธโข(y)=ฮต. We can then apply Rโข(x)โ†xโ‰ฮต, and as this holds, we accept.

Based on this top-down evaluation model, we define two categories of variables.

Definition 3.7.

Let ฯ=R1โข(x1,โ€ฆโขxm)โ†R2โข(y1,โ€ฆโขyn),ฯ†1,โ€ฆ,ฯ†โ„“ be a linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ rule where R1 is mutually recursive with R2, and ฯ†i is either a pattern equation or a relation atom with a relation symbol that is not mutually recursive with R1, for 1โ‰คiโ‰คโ„“. We say ฯโ€™s top variables ๐—๐—ˆ๐—‰โข(ฯ) are (x1,โ€ฆ,xm,๐”ฒ) and ฯโ€™s bottom variables ๐–ป๐—ˆ๐—๐—๐—ˆ๐—†โข(ฯ) are (y1,โ€ฆ,yn).

In top-down evaluation, a ruleโ€™s top variables contain values that were passed down from a preceding ruleโ€™s evaluation, and its bottom variables contain the values that are passed down. A variable can be both a top and bottom variable.

Example 3.8.

Let ฯ be the linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ rule Rโข(x)โ†xโ‰yโข๐–บ,Rโข(y) from Exampleย 2.3. Then ๐—๐—ˆ๐—‰โข(ฯ)=(x,๐”ฒ) and ๐–ป๐—ˆ๐—๐—๐—ˆ๐—†โข(ฯ)=(y).

When evaluating a linear program, nondeterminism can occur in two ways: in choosing the values of the variables when processing a rule, which we call local nondeterminism, and in choosing which rule to process, which we call global nondeterminism. In order to remove all nondeterminism, we must ensure every program has both local and global determinism.

Let w be the input word and let ฯ be a linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ rule. We define the relationย Wฯ to be all pairs (ฮธโข(๐—๐—ˆ๐—‰โข(ฯ)),ฮธโข(๐–ป๐—ˆ๐—๐—๐—ˆ๐—†โข(ฯ))), for all substitutions ฮธ such that (w,ฮธ)โŠง๐—‰๐–พโข(ฯ). For some (tโ†’,tโ€ฒโ†’)โˆˆWฯ, we call tโ†’ a top tuple and tโ€ฒโ†’ a bottom tuple. Recall that as that we are working under finite model semantics, each value is a factor of the word w. We define two criteria for determinism.

Definition 3.9.

A linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program P with rule set ฮฆ is locally deterministic if for each ฯโˆˆฮฆ, the relation Wฯ is a partial function. P is globally deterministic if for every relation symbol R, for all pairs of distinct rules ฯ,ฯ‡โˆˆฮฆR, there is no top tuple tโ†’ such that (tโ†’,tโ€ฒโ†’) in Wฯ and (tโ†’,tโ€ฒโ€ฒโ†’) in Wฯ‡, for some bottom tuples tโ€ฒ and tโ€ฒโ€ฒ.

If Wฯ is a partial function for some rule ฯ, then top-down evaluation of ฯ is locally deterministic as for every top tuple there is at most one bottom tuple. If a top tuple is a valid input for only one rule ฯ in ฮฆR, for ฯโ€™s head relation symbol R, then we have global determinism as ฯ is the only rule we can process.

Definition 3.10.

Deterministicย linearย ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ is the set of linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ programs that are locally and globally deterministic.

Our condition is sufficiently strong as we precisely capture ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค.

Theorem 3.11.

Deterministic linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ captures ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค.

We capture ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค as we can simulate multi-headed two-way deterministic finite automata, which are equivalent to deterministic Turing machines with logarithmic space (see [25]). Unfortunately, where linearity for ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ can be checked in polynomial time, checking determinism for linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ is considerably more expensive, as the criteria we have to check is semantic rather than syntactic. A problemย p is word equations-hard if there exists a polynomial time reduction from word equation satisfiability toย p.

Proposition 3.12.

Checking local or global determinism of linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ is word equations-hard.

Word equation satisfiability is ๐–ญ๐–ฏ-hard [4, 27] and in nondeterministic linear spaceย [23]. Closing this gap is a longstanding open problem. Thus, although we have reduced the complexity, we have lost efficient checking of membership in the fragment.

We now show that, in contrast to regular spanners, under standard complexity theoretic assumptions we cannot expect to have constant delay enumeration algorithms in our setting. A constant delay enumeration algorithm has a preprocessing phase, which often runs in linear time, and an enumeration phase that outputs solutions one by one, with constant time between any consecutive outputted solutions (see e.g. [36]).

Proposition 3.13.

If there exists a constant delay enumeration algorithm with polynomial preprocessing for deterministic linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, then ๐–ฏ=๐–ญ๐–ฏ.

As such, we do not look further into constant delay enumeration, and instead focus on identifying a fragment that has both reduced complexity and efficient membership checking, without compromising on expressive power.

3.3 One Letter Lookahead

In this subsection, we introduce a second fragment that captures ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค, but where we can check membership in the fragment in polynomial time. This fragment has a severely restricted syntax but can still simulate multi-headed two-way deterministic finite automata, demonstrating how little of ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ is required to retain this level of expressibility.

Definition 3.14.

Let ฯ be a linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ rule. Let x,โˆˆ๐—๐—ˆ๐—‰(ฯ), yโˆˆ๐–ป๐—ˆ๐—๐—๐—ˆ๐—†โข(ฯ), and ๐–บโˆˆฮฃโˆช{ฮต}. The rule ฯ is a one letter lookahead (๐–ฎ๐–ซ๐–ซ๐– ) ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ rule if each pattern equation has either of the below forms:

  • โ– 

    xโ‰ฮต or xโ‰xโ€ฒ where xโ€ฒโˆˆ๐—๐—ˆ๐—‰โข(ฯ).

  • โ– 

    xโ‰yโข๐–บ or xโ‰๐–บโขy ย ย  (deleting a letter from a top variable).

  • โ– 

    yโ‰xโข๐–บ or yโ‰๐–บโขx ย ย  (adding a letter to a top variable).

  • โ– 

    ๐”ฒโ‰xโข๐–บโขz for zโˆˆฮž where zโˆ‰๐—๐—ˆ๐—‰โข(ฯ)โˆช๐–ป๐—ˆ๐—๐—๐—ˆ๐—†โข(ฯ) and z does not occur elsewhere in ฯ (matching the next letter after a top variable in the input word).

  • โ– 

    ๐”ฒโ‰zโข๐–บโขx for zโˆˆฮž where zโˆ‰๐—๐—ˆ๐—‰โข(ฯ)โˆช๐–ป๐—ˆ๐—๐—๐—ˆ๐—†โข(ฯ) and z does not occur elsewhere in ฯ (matching the first letter before a top variable in the input word).

We call an equation xโ‰ฮต or xโ‰xโ€ฒ or where ๐–บ=ฮต an ฮต-๐–ฎ๐–ซ๐–ซ๐–  pattern equation. Otherwise, we call an equation where ๐–บ is on the right side of x or y a right ๐–ฎ๐–ซ๐–ซ๐–  pattern equation, and an equation where ๐–บ is on the left side of x or y a left ๐–ฎ๐–ซ๐–ซ๐–  pattern equation.

Definition 3.15.

Let P be a linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program with rule set ฮฆ. We say P is a oneย letterย lookahead (๐–ฎ๐–ซ๐–ซ๐– ) ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program if all ฯโˆˆฮฆ are ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ rules, and for every relation symbol R, for every rule ฯโˆˆฮฆR, every variable xโˆˆ๐—๐—ˆ๐—‰โข(ฯ)โˆช๐–ป๐—ˆ๐—๐—๐—ˆ๐—†โข(ฯ) is not used in both left ๐–ฎ๐–ซ๐–ซ๐–  and right ๐–ฎ๐–ซ๐–ซ๐–  pattern equations. An ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program is deterministic (a ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program) if it is both locally and globally deterministic.

As such, in ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, for every rule ฯโˆˆฮฆR we can only use equations of the form xโ‰xโ€ฒ for x,xโ€ฒโˆˆ๐—๐—ˆ๐—‰โข(ฯ) if |ฮฆR|=1 or there exists a equation of the form x=ฮต or x=๐–บโขy or x=yโข๐–บ in ๐—‰๐–พโข(ฯ), for yโˆˆ๐–ป๐—ˆ๐—๐—๐—ˆ๐—†โข(ฯ). We call such a program guarded.

Example 3.16.

The linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program P given in Example 2.3 is not a ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– ย ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program as the rule Ansโข()โ†๐”ฒโ‰yโขy,Rโข(y) contains ๐”ฒโ‰yโขy, which does not fit any of the permitted forms of pattern equations for an ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program.

The simplest ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program the authors could find that models the language of squares given in Example 2.3 is the one that models the multi-headed two-way DFA; this program first identifies the middle of the string and processes the two halves letter by letter. We now see that despite restricting the programs substantially, our new fragment ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ retains the expressive power of deterministic linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€.

Theorem 3.17.

๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ captures ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค.

To check if a program is an ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program is straightforward. Checking if an ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program is deterministic is easier than in the general case, as the pattern equations are restricted to only match one letter at a time. When checking both local and global determinism, we only need to consider the case where rules do not contain contradicting pattern equations; if we have a rule with two pattern equations ฯ†1 and ฯ†2 that contradict each other, then as there are no solutions to ฯ†1โˆงฯ†2, this rule can never be applied. To check if a program is locally deterministic, we check that for every rule ฯ without pattern equations that contradict each other, every xโˆˆ๐—๐—ˆ๐—‰โข(ฯ) appears in some pattern equation in ๐—‰๐–พโข(ฯ) or xโˆˆ๐–ป๐—ˆ๐—๐—๐—ˆ๐—†โข(ฯ), and every yโˆˆ๐–ป๐—ˆ๐—๐—๐—ˆ๐—†โข(ฯ) appears in some pattern equation in ๐—‰๐–พโข(ฯ) or yโˆˆ๐—๐—ˆ๐—‰โข(ฯ). If so, then Wฯ is a partial function. To check if a program is globally deterministic, we use profiles for its rules.

Definition 3.18.

Let ฯ be an ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ rule without pattern equations that contradict each other. We define a profile for ฯ as a function pโขrโขoฯ:๐—๐—ˆ๐—‰โข(ฯ)โ†’ฮฃโˆช{ฮต,โŠฅ} where:

pโขrโขoฯโข(x)={๐–บโขย if there exists a pattern equationย xโ‰yโข๐–บย orย xโ‰๐–บโขy, forย ๐–บโˆˆฮฃ,ฮตโขย if there exists a pattern equationย xโ‰ฮต,โŠฅย otherwise.

For p,pโ€ฒโˆˆฮฃโˆช{ฮต,โŠฅ}, we say p and pโ€ฒ are in conflict if pโ‰ โŠฅ and pโ€ฒโ‰ โŠฅ, and pโ‰ pโ€ฒ. Let ฯ and ฯ‡ be linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ rules with the same head relation symbol, with respective profiles pโขrโขoฯ and pโขrโขoฯ‡. Let ๐—๐—ˆ๐—‰โข(ฯ)=(x1,โ€ฆ,xk). Note that ๐—๐—ˆ๐—‰โข(ฯ)=๐—๐—ˆ๐—‰โข(ฯ‡). We say pโขrโขoฯ and pโขrโขoฯ‡ are in conflict if there exists some xโˆˆ๐—๐—ˆ๐—‰โข(ฯ) where pโขrโขoฯโข(x) is in conflict with pโขrโขoฯ‡โข(x).

The function pโขrโขoฯ represents how each xโˆˆ๐—๐—ˆ๐—‰โข(ฯ) is processed in ฯ. In ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, all we need to consider is the leftmost or rightmost letter of each x, depending on whether x is used in left or right ๐–ฎ๐–ซ๐–ซ๐–  pattern equations for rules with this head relation symbol.

Lemma 3.19.

A guarded ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program P is globally deterministic if and only if for every relation symbol R, for every pair of distinct rules ฯ,ฯ‡โˆˆฮฆR, we have that pโขrโขoฯ is in conflict with pโขrโขoฯ‡.

This allows us to check if a given program is deterministic efficiently.

Proposition 3.20.

Local and global determinism for ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ can be decided in polynomial time.

We thus have a fragment where we have reduced data complexity to ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค, and without losing efficient checking of membership in the fragment. However, the combined complexity for ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ is the same as for linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€.

Theorem 3.21.

Combined complexity of ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ is ๐–ฏ๐–ฒ๐–ฏ๐– ๐–ข๐–ค-complete.

3.4 Strictly Decreasing

Our next goal is more efficient combined complexity. We introduce a further restriction called strictly decreasing, and show that this leads to linear combined complexity.

Definition 3.22.

Let P be a ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program with rule set ฮฆ, and let ฮฆโ€ฒโІฮฆ be the rules containing both an atom with a relation symbol mutually recursive with the head relation symbol, and at least one pattern equation. Let w be the input word. We say P is strictly decreasing (๐–ฒ๐–ฃ) if for every ruleย ฯโˆˆฮฆโ€ฒ that has head relation symbol R:

  1. 1.

    there exist xโˆˆ๐—๐—ˆ๐—‰โข(ฯ) and yโˆˆ๐–ป๐—ˆ๐—๐—๐—ˆ๐—†โข(ฯ) such that y occurs in a pattern equation with x and for all substitutions ฮธ such that (w,ฮธ)โŠงฯ we have |ฮธโข(y)|<|ฮธโข(x)|, and

  2. 2.

    if there exists a relation symbol Rโ€ฒ in the body of ฯ that is mutually recursive with R, for every ruleย ฯ‡โˆˆฮฆโ€ฒ with head relation symbol Rโ€ฒ, condition 1 holds for some xโ€ฒโˆˆ๐—๐—ˆ๐—‰โข(ฯ‡) and some yโ€ฒโˆˆ๐–ป๐—ˆ๐—๐—๐—ˆ๐—†โข(ฯ‡), and ฮธโข(y)=ฮธโ€ฒโข(xโ€ฒ) for all substitutions ฮธ such that (w,ฮธ)โŠงฯ and all substitutions ฮธโ€ฒ such that (w,ฮธโ€ฒ)โŠงฯ‡.

We now see that this restriction significantly improves the combined complexity. In fact, if the maximum relation symbol arity k is fixed (or assumed to be much smaller than |w|), then this is linear. Furthermore, the preprocessing is not dependent on the word w.

Theorem 3.23.

Given a word w and an ๐–ฒ๐–ฃ-๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program P with n relation symbols and maximum relation symbol arity k, we can decide wโˆˆโ„’โข(P) in ๐’ชโข(|w|โขk) time after ๐’ชโข(nโข|ฮฃ|) preprocessing.

In this section, we have defined the endpoints of an infinite range of fragments that all capture ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค. At one end, ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ has just enough syntax to simulate multi-headed two-way deterministic finite automata, but has easy checking of determinism. At the other end, deterministic linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ has a richer syntax but has more difficult checking of determinism. From this range, there is therefore the opportunity to extend ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ to make writing programs more natural and convenient. We will demonstrate such a situation in Sectionย 4 when designing an appropriate fragment to model deterministic regex. Furthermore, although the fragments in this range do not reduce combined complexity any more than is the case for linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, when we add the extra dimension of strictly decreasing, we can reduce the combined complexity to linear time.

3.5 FC-Datalog as Generalized Automata

As ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, by design, permits just enough to simulate multi-headed two-way deterministic finite automata, we can naturally see a ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program as a generalization of such an automaton: every relation acts as a state, every rule acts as a transition, and every string variable acts a head which can be moved, added and deleted (as the number of heads can change per state). In each transition we read one letter and move, add, or delete the heads accordingly.

The fragments of ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ discussed in this paper can be seen as further generalizations. As the syntax grows, the connection to automata becomes less natural and checking determinism becomes harder. For example, Section 4 introduces ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ which allows word equations such as xโ‰yโขy. Instead of just moving a head by one letter, this is an operation that either splits a memory in half or doubles it. We can then see a ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program as an generalized multi-headed two-way DFA where heads can read words rather than letters and that can perform nonregular string computations in the transitions.

We can also see an ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program as a generalized multi-headed two-way NFA, and a linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program as a further generalization. There, reading letters at head positions is replaced with a declarative programming language that checks if the rule applies and computes the operations. The increasing gap to traditional automata is reflected in the fact that determinism is now expensive to verify. Finally, for ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, we can understand every relation symbol in the body of a rule as a call of a subroutine, which splits the automaton into parallel copies, each of which must terminate. At this general level, however, the gap to traditional automata is so large the connection is no longer natural.

4 Applying the Framework: Simulating DRX

In this section, we tailor a fragment from our range to a specific application. We apply our model to simulate deterministic regex, introduced by Freydenberger and Schmidย [17].

We define regular expressions as usual (see e.g. [1]) and extend these with back-references to define regex. For xโˆˆฮž and a regular expression ฮด, we thus add the expressions x and โŸจx:ฮดโŸฉ to our syntax. The expression โŸจx:ฮดโŸฉ matches ฮด and saves the string that is matched by ฮด in the memory x. All further occurrences of x are recalls of memoryย x, which we match using the content of the variable that was saved earlier. For readability, we use โ‹… for concatenation.

Example 4.1.

Let ฮณโ‰”โŸจx:(๐–บโˆจ๐–ป)+โŸฉโ‹…๐–ฝโ‹…x. Then ฮณ matches all words uโข๐–ฝโขu where uโˆˆ{๐–บ,๐–ป}+.

Every regular expression can be converted into a finite automaton using the classical construction from Glushkov [19]. If the result of this construction is deterministic then the regular expression is deterministic. Deterministic regular expressions define a strict subclass of the regular languages, and have a more efficient membership problem than general regular expressions. For a deterministic regular expression ฮณ and a word w, we can decide membership in ๐’ช(|ฮฃ||ฮณ|+|w|) (see [9, 34]) or ๐’ชโข(|ฮณ|+|w|โขlogโกlogโก|ฮณ|) (see [21]).

Freydenberger and Schmid [17] combined regex and deterministic regular expressions to define deterministic regex, which can define nonregular languages, and have an efficient membership problem; this can be decided in ๐’ชโข(|ฮฃ|โข|ฮณ|โขn+kโข|w|) for a word w and a deterministic regex ฮณ with k distinct variables and n total occurrences of terminal symbols or variable references (see Theorem 5 of [17]). Similarly to the Glushkov [19] construction, every regex ฮณ has a corresponding memory finite automaton with trap state (๐–ฃ๐–ณ๐–ฌ๐–ฅ๐– ) Mฮณ (where the trap state handles memory recall failures), and ฮณ is deterministic if Mฮณ is deterministic (seeย [17] for details). We call the class of all deterministic regex ๐–ฃ๐–ฑ๐–ท.

Example 4.2.

Let ฮณ be the regex defined in Example 4.1, and let ฮณโ€ฒโ‰”โŸจx:(๐–บโˆจ๐–ป)โˆ—โŸฉโ‹…x. Then โ„’โข(ฮณ)โ‰”{uโข๐–ฝโขuโˆฃuโˆˆ{๐–บ,๐–ป}โˆ—} and โ„’โข(ฮณโ€ฒ)โ‰”{uโขuโˆฃuโˆˆ{๐–บ,๐–ป}โˆ—}. Then ฮณโˆˆ๐–ฃ๐–ฑ๐–ท and ฮณโ€ฒโˆ‰๐–ฃ๐–ฑ๐–ท, as Mฮณ is deterministic and Mฮณโ€ฒ is not deterministic. Intuitively, this is because ฮณ has only one choice of when to stop matching the first u, whereas ฮณโ€ฒ does not.

As in [17], we can assume w.โ€‰l.โ€‰o.โ€‰g. that a regex does not recall empty memories, does not start to save into a memory that is already being saved to, and does not reset a memory to its initial value. We first show that for any ฮณโˆˆ๐–ฃ๐–ฑ๐–ท, we can express โ„’โข(ฮณ) in ๐–ฒ๐–ฃ-๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, but with a large number of rules. We can parameterize such a program using the number of memories, terminal symbols and memory recalls in ฮณ. Here we use ฮต-semantics111In ฮต-semantics we treat everything not initialized as ฮต (see e.g. Section 8.2.1 in [17])..

Theorem 4.3.

Let ฮณโˆˆ๐–ฃ๐–ฑ๐–ท have k memories. Let the total number of terminal symbols and memory recalls in ฮณ be n. We can express โ„’โข(ฮณ) with an ๐–ฒ๐–ฃ-๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ programย P that has at most k+n+2 relation symbols and at most kโข(|ฮฃ|+1)+nโข(n+3)+1 rules.

The program P that is the result of our construction requires a rule for each pair (๐–บ,โ„“), for ๐–บโˆˆฮฃ and 0โ‰คโ„“โ‰คkโˆ’1. This is because the syntax permits matching only one letter at a time, and so the program requires many rules to read memories letter by letter. We can achieve this with a much more concise program if we relax the syntax slightly, to what we call ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€. This sits between ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ and deterministic linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, and it retains the polynomial time checking of determinism. To define this, we introduce the concept of a symbol being uniquely defined for top-down evaluation.

Definition 4.4.

Let ฯ be an ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ rule. We inductively define uniquely defined symbols: As base rules, the universe variable ๐”ฒ, every ๐–บโˆˆฮฃ and every xโˆˆ๐—๐—ˆ๐—‰โข(ฯ) are uniquely defined. Then, if we have some pattern equationย ฯ† where Varโข(ฯ†)={x1,โ€ฆโขxk} and exactly one variable xi is not uniquely defined, then xi becomes uniquely defined.

Example 3.6 illustrates how top variables are uniquely defined for top-down evaluation.

Example 4.5.

Let ฯ=R1โข(x1,x2)โ†ฯ†1,ฯ†2,R2โข(y1,y2) be a linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ rule where ฯ†1=x1โ‰x2โขy1 and ฯ†2=y1โ‰y2โขy2. From the base rules, x1 and x2 are uniquely defined. In the first iteration, as all zโˆˆVarโข(ฯ†1)โˆ–{y1} are uniquely defined, y1 is uniquely defined. In the second iteration, as all zโˆˆVarโข(ฯ†2)โˆ–{y2} are uniquely defined, y2 is uniquely defined.

In deterministic linear and ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, by Definition 3.9, we ensure local determinism by requiring the relation Wฯ to be a partial function for every rule ฯ, implicitly requiring every variable to be uniquely defined. We ensure local determinism in ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ by reasoning directly over uniquely defined variables.

Definition 4.6.

Let P be a linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program with rule set ฮฆ. We say P is a ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program if it is globally deterministic and for every rule ฯโˆˆฮฆ, every variable that occurs in ฯ is uniquely defined.

Example 4.7.

The ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program P given in Example 2.3 that is not a ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program (see Example 3.16) is in fact a ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program.

We define ๐–ฒ๐–ฃ for ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + the same way as for ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  (Definitionย 3.22). ๐–ฒ๐–ฃ-๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ generalizes ๐–ฒ๐–ฃ-๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€. We now show that despite being extensions, ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ retains the low data complexity (Theoremย 4.8) and the efficient checking of determinism (Proposition 4.9) of ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, and ๐–ฒ๐–ฃ-๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ retains the low combined complexity of ๐–ฒ๐–ฃ-๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ (Theoremย 4.8).

Theorem 4.8.

๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ captures ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค, and given a word w and an ๐–ฒ๐–ฃ-๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program P with n relation symbols and maximum relation symbol arity k, we can decide wโˆˆโ„’โข(P) in ๐’ชโข(|w|โขk) time after ๐’ชโข(nโข|ฮฃ|) preprocessing.

Proposition 4.9.

Membership of ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ can be decided in polynomial time.

Where a ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program is required to match letter-by-letter, this is no longer the case for ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, and a consequence of this is that we can simulate a deterministic regex with a much more concise ๐–ฒ๐–ฃ-๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program.

Theorem 4.10.

Let ฮณโˆˆ๐–ฃ๐–ฑ๐–ท have k memories. Let the total number of terminal symbols and memory recalls in ฮณ be n. We can express โ„’โข(ฮณ) with an ๐–ฒ๐–ฃ-๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ programย P that has at most n+2 relation symbols and at most nโข(n+3)+1 rules.

As the following example shows, the extra flexibility permitted in the syntax of ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ allows us to write programs much more conveniently than for ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€.

Example 4.11.

In ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, we can process a memory xnโ€ฒ, with a rule:

Qโ€ฒโข(u,x1โ€ฒ,โ€ฆ,xkโ€ฒ)โ†Qโข(v,x1,โ€ฆ,xk),uโ‰xnโ€ฒโขv,x1โ‰x1โ€ฒโขxnโ€ฒ,โ€ฆ,xkโ‰xkโ€ฒโขxnโ€ฒ.

In ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, to process a memory xnโ€ฒ, we use a new relation symbol Rโ„“ and rules:

Qโ€ฒโข(u,x1โ€ฒ,โ€ฆ,xkโ€ฒ) โ†Qโข(v,x1,โ€ฆ,xk),Rโ„“โข(u,v,xn,x1โ€ฒ,x1,โ€ฆ,xโ€ฒโขโ„“,xโ„“);
Rโ„“โข(u,v,x1โ€ฒ,x1,โ€ฆ,xโ„“โ€ฒ,xโ„“) โ†xnโ‰ฮต,uโ‰v,x1โ€ฒโ‰x1,โ€ฆ,xโ„“โ€ฒโ‰xโ„“;

and for every ๐–บโˆˆฮฃ a rule:

Rโ„“โข(u,v,xn,x1โ€ฒ,x1,โ€ฆ,xโ„“โ€ฒ,xโ„“)โ†xnโ‰๐–บโขxnโ€ฒ,uโ‰๐–บโขuโ€ฒ,x1โ€ฒโ€ฒโ‰x1โ€ฒโข๐–บ,โ€ฆ,xโ„“โ€ฒโ€ฒโ‰xโ„“โ€ฒโข๐–บ,Rโ„“โข(uโ€ฒ,v,xnโ€ฒ,x1โ€ฒโ€ฒ,x1,โ€ฆ,xโ„“โ€ฒโ€ฒ,xโ„“).

We have thus demonstrated the design of a tailored fragment in our range spanned by ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  and deterministic linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, in this case by adding word equations with uniquely defined variables. The ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + and ๐–ฒ๐–ฃ-๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + fragments retain the desirable properties of the restrictive ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  and ๐–ฒ๐–ฃ-๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  fragments. As discussed in Sectionย 3.5, we can see ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ as a generalization of multi-headed DFAs which is then not restricted to left-to-right parsing. As the next example shows, it can express context-free languages which cannot be recognized by a DPDA (see e.g [22]).

Example 4.12.

The palindrome language can be expressed deterministically by the ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program with the rules Ansโข()โ†Rโข(๐”ฒ); Rโข(x)โ†xโ‰ฮต and the rules Rโข(x)โ†xโ‰๐–บโขyโข๐–บ,Rโข(y); Rโข(x)โ†xโ‰๐–บ for each ๐–บโˆˆฮฃ. This evaluates in โŒˆ|w|/2โŒ‰+1 steps, which is linear.

Furthermore, as we can simulate ๐–ฃ๐–ฑ๐–ท, another feature that can be added to these tailored fragments are atoms that match ๐–ฃ๐–ฑ๐–ท, which we can solve as a subroutine. We say a ๐–ฃ๐–ฑ๐–ท-constraint is an expression (xโขโˆˆห™โขฮณ) for xโˆˆฮž and a deterministic regex ฮณ, that denotes x is mapped to an element uโˆˆโ„’โข(ฮณ) and u is a factor of the input word w.

To stay in ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค, we must ensure programs remain locally and globally deterministic. As ๐–ฃ๐–ฑ๐–ท-constrains only check if a variable matches a deterministic regex, we can decide local determinism as we would without the ๐–ฃ๐–ฑ๐–ท-constraints. To retain global determinism (Definition 3.9), if using deterministic regexes ฮณ1,โ€ฆ,ฮณn in multiple rules with the same head relation symbol, we must ensure that always at most one rule can accept. We thus need to decide if โ‹‚i=1nโ„’โข(ฮณi)=โˆ…. Unfortunately, intersection-emptiness for ๐–ฃ๐–ฑ๐–ท is undecidable (see Theorem 9 of [17]). We say ฮณโˆˆ๐–ฃ๐–ฑ๐–ท is variable-star-free if each of its sub-regexes under a Kleene-star or Kleene-plus do not contain any variable operations. Intersection-emptiness for variable-star-free ๐–ฃ๐–ฑ๐–ท is at least word equations-hard (see Proposition 8 of [17]). Thus, adding ๐–ฃ๐–ฑ๐–ท-constraints means we lose efficient checking of determinism, another example of the trade-off between richer syntax and efficient determinism checking.

We can keep efficient determinism checking if we limit where we include ๐–ฃ๐–ฑ๐–ท-constraints. If |ฮฆR|=1 for some relation symbol R, global determinism for ฮฆR is inherent. If we limit inclusions of ๐–ฃ๐–ฑ๐–ท-constraints to rules ฯโˆˆฮฆR where |ฮฆR|=1, we can verify determinism by verifying every ๐–ฃ๐–ฑ๐–ท-constraint is in such rules. We can decide this in polynomial time. Combining this with Propositionย 4.9, we can check the whole programโ€™s determinism in polynomial time.

Example 4.13.

The ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program P:

Ansโข() โ†R1โข(๐”ฒ); R1โข(x) โ†xโ‰๐–บโขyโข๐–ผ,R1โข(y);
R2โข(x) โ†(xโˆˆห™โŸจy:(๐–ผโˆจ๐–ฝ)+โŸฉโ‹…๐–บโ‹…y); R1โข(x) โ†xโ‰๐–ปโขyโข๐–ป,R2โข(y).

defines the language {๐–บzโข๐–ปโขwโข๐–บโขwโข๐–ป๐–ผzโˆฃzโˆˆโ„•โˆงwโˆˆ{๐–ผ,๐–ฝ}โˆ—}, and |ฮฆR2|=1, where R2 is the only head relation symbol that has rules containing ๐–ฃ๐–ฑ๐–ท-constraints.

โ–ถย Remark 4.14.

Another extension of ๐–ฅ๐–ข is ๐–ฅ๐–ขโข[๐–ฑ๐–ค๐–ฆ], which captures the expressive power of generalized core spanners and also has ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค date complexity. However, ๐–ฅ๐–ขโข[๐–ฑ๐–ค๐–ฆ] does not capture ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค, as it cannot express the language {anโขbnโˆฃnโˆˆโ„•} (see [16]). We can naturally express this language in ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€:

Example 4.15.

We can express {anโขbnโˆฃnโˆˆโ„•} with the ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ program that has the rules: Ansโข()โ†Rโข(๐”ฒ); Rโข(x)โ†xโ‰ฮต; and Rโข(x)โ†xโ‰๐–บโขyโข๐–ป,Rโข(y).

5 Conclusions and Future Work

Freydenberger and Peterfreund [16] recently proposed ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ as an extension of the logic ๐–ฅ๐–ข with recursion. ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ captures the complexity class ๐–ฏ and can be seen as a language for expressing relations that can be used in spanner selections. Furthermore, independent of the spanner connection, ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ can also be seen as a version of ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ on strings with a decidable model checking problem, where the same problem for a previous approach towards ๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ on strings is undecidable. We first showed that combined complexity of ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ is ๐–ค๐–ท๐–ฏ-complete, presenting an opportunity for optimization of model checking. In this paper, we identified fragments of ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ with an efficient model checking problem for both data and combined complexity by performing an analysis of four restrictions: linearity, determinism, one letter lookahead and strictly decreasing.

In Section 3.1 we proposed linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, a fragment that captures ๐–ญ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค. In Sectionย 3.2 we identified and removed nondeterminism. Thus, deterministic linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ captures ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค. However, this restriction cannot be checked efficiently.

In Section 3.3 we imposed the oneย letter lookahead restriction on a programโ€™s word equations to obtain deterministic ๐–ฎ๐–ซ๐–ซ๐–  (๐–ฃ๐–ฎ๐–ซ๐–ซ๐– ) ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ that has both desirable properties: it captures ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค and determinism can be checked in polynomial time. We also showed that the combined complexity of model checking for both linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ and ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ is ๐–ฏ๐–ฒ๐–ฏ๐– ๐–ข๐–ค-complete. Therefore, we added the further restriction of strictly decreasing (๐–ฒ๐–ฃ), and showed that ๐–ฒ๐–ฃ-๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ has linear combined complexity. We thus established the endpoints of a range of ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ fragments that all capture ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค, and showed how we can restrict this further to reduce combined complexity to linear time. Hence, we have paved the way to construct further fragments which can be tailored for particular applications.

We illustrated tailoring a fragment in Section 4, where we constructed ๐–ฃ๐–ฎ๐–ซ๐–ซ๐– + ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€. This allows us to straightforwardly model deterministic regex without the need for a technically involved automata model. Yet, as for ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€: we capture ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค, determinism can be checked in polynomial time, and strictly decreasing programs have linear combined complexity. We then showed how deterministic regex can be used as atoms, and how we can restrict where they are used so as to not compromise on these desirable properties.

As ๐–ซ๐–ฎ๐–ฆ๐–ฒ๐–ฏ๐– ๐–ข๐–ค is closed under complement, we can add stratified negation without affecting the combined complexity. Other convenient additions could include predicates that express properties of a string such as โ€œis a letterโ€, and functions that build on these to express properties such as โ€œis the first/last letterโ€. We could also add regular constraints: atoms that work in the same way as ๐–ฃ๐–ฑ๐–ท-constraints, but match a classical regular expression rather than a deterministic regex, as in ๐–ฅ๐–ขโข[๐–ฑ๐–ค๐–ฆ]. Using these or other syntactic additions, we could further map the range between ๐–ฃ๐–ฎ๐–ซ๐–ซ๐–  and deterministic linear ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€, and investigate how these additions affect checking membership in the fragment. Furthermore, we could also look to find sufficient conditions to ensure membership can be easily checked.

There are structure criteria for conjunctive queries in ๐–ฅ๐–ข (๐–ฅ๐–ข-๐–ข๐–ฐ) that improve efficiency such as acyclicity, as demonstrated by Freydenberger and Thompson [18]. Every ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ rule can be seen as a conjunctive query in ๐–ฅ๐–ข. We could therefore add such structure criteria to ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ rules and examine how this improves the complexity of ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ fragments further. Another direction for research is to look at inexpressibility for these fragments. However, even for ๐–ฅ๐–ข (as opposed to ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€), finding inexpressibility results is challenging (see [38]). Another logical next step is to use our restrictions for ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ to identify natural fragments in the spanner setting with desirable properties.

Finally, as we can also see ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€ as a generalization of range concatenation grammars (๐–ฑ๐–ข๐–ฆ)ย (seeย [16]), parsing techniques for these grammars such as the algorithm in Boullier [8] could be adapted to ๐–ฅ๐–ข-๐–ฃ๐–บ๐—๐–บ๐—…๐—ˆ๐—€.

References

  • [1] Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison-Wesley, 1995. URL: http://webdam.inria.fr/Alice/.
  • [2] Alfredย V. Aho. Algorithms for Finding Patterns in Strings, pages 255โ€“300. MIT Press, 1991.
  • [3] Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-Delay Enumeration for Nondeterministic Document Spanners. SIGMOD Rec., 49(1):25โ€“32, 2020. doi:10.1145/3422648.3422655.
  • [4] Dana Angluin. Finding Patterns Common to a Set of Strings. J. Comput. Syst. Sci., 21(1):46โ€“62, 1980. doi:10.1016/0022-0000(80)90041-0.
  • [5] Mena Badieh Habibย Morgan and Maurice van Keulen. Information Extraction for Social Media. In Proc. SWAIE, pages 9โ€“16, 2014. doi:10.3115/v1/W14-6202.
  • [6] Pablo Barcelรณ and Pablo Muรฑoz. Graph Logics with Rational Relations: The Role of Word Combinatorics. ACM Trans. Comput. Logic, 18(2), 2017. doi:10.1145/3070822.
  • [7] Anthonyย J. Bonner and Giansalvatore Mecca. Sequences, Datalog, and Transducers. J. Comput. Syst. Sci., 57(3):234โ€“259, 1998. doi:10.1006/jcss.1998.1562.
  • [8] Pierre Boullier. Range Concatenation Grammars, pages 269โ€“289. New Developments in Parsing Technology. Springer, 2004. doi:10.1007/1-4020-2295-6_13.
  • [9] Anne Brรผggemann-Klein. Regular expressions into finite automata. Theoretical Computer Science, 120(2):197โ€“213, 1993. doi:10.1016/0304-3975(93)90287-4.
  • [10] Evgeny Dantsin, Thomas Eiter, Georg Gottlob, and Andrei Voronkov. Complexity and Expressive Power of Logic Programming. ACM Computing Surveys, 33(3):374โ€“425, 2001. doi:10.1145/502807.502810.
  • [11] Heinz-Dieter Ebbinghaus and Jรถrg Flum. Finite Model Theory. Springer Monographs in Mathematics, 2nd edition, 1999.
  • [12] Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document Spanners: A Formal Approach to Information Extraction. J. ACM, 62(2), 2015. doi:10.1145/2699442.
  • [13] Fernando Florenzano, Cristian Riveros, Martรญn Ugarte, Stijn Vansummeren, and Domagoj Vrgoฤ. Efficient Enumeration Algorithms for Regular Document Spanners. ACM Trans. Database Syst., 45(1), 2020. doi:10.1145/3351451.
  • [14] Dominikย D. Freydenberger. A Logic for Document Spanners. Theory of Computing Systems, 63(7):1679โ€“1754, 2019. doi:10.1007/s00224-018-9874-1.
  • [15] Dominikย D. Freydenberger and Mario Holldack. Document Spanners: From Expressive Power to Decision Problems. Theory Comput. Syst., 62(4):854โ€“898, 2018. doi:10.1007/S00224-017-9770-0.
  • [16] Dominikย D. Freydenberger and Liat Peterfreund. The Theory of Concatenation over Finite Models. In Proc. ICALP 2021, pages 130:1โ€“130:17, 2021. doi:10.4230/LIPIcs.ICALP.2021.130.
  • [17] Dominikย D. Freydenberger and Markusย L. Schmid. Deterministic regular expressions with back-references. J. Comput. Syst. Sci., 105:1โ€“39, 2019. doi:10.1016/J.JCSS.2019.04.001.
  • [18] Dominikย D. Freydenberger and Samย M. Thompson. Splitting Spanner Atoms: A Tool for Acyclic Core Spanners. In Proc. ICDT 2022, pages 10:1โ€“10:18, 2022. doi:10.4230/LIPIcs.ICDT.2022.10.
  • [19] Vย M Glushkov. The Absract Theory of Automata. Russian Mathematical Surveys, 16(5), 1961. doi:10.1070/RM1961v016n05ABEH004112.
  • [20] Georg Gottlob and Christos Papadimitriou. On the complexity of single-rule datalog queries. Information and Computation, 183(1):104โ€“122, 2003. doi:10.1016/S0890-5401(03)00012-9.
  • [21] B.ย Groz and S.ย Maneth. Efficient testing and matching of deterministic regular expressions. J. Comp. Syst. Sci., 89:372โ€“399, 2017. doi:10.1016/j.jcss.2017.05.013.
  • [22] Johnย E. Hopcroft, Rajeev Motwani, and Jeffreyย D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley, 3rd edition, 2007.
  • [23] Artur Jez. Word Equations in Nondeterministic Linear Space. In Proc. ICALP 2017, pages 95:1โ€“95:13, 2017. doi:10.4230/LIPIcs.ICALP.2017.95.
  • [24] Juhani Karhumรคki, Filippo Mignosi, and Wojciech Plandowski. The expressibility of languages and relations by word equations. J. ACM, 47(3):483โ€“505, 2000. doi:10.1145/337244.337255.
  • [25] K.ย N. King. Alternating Multihead Finite Automata. Theoretical Computer Science, 61(2):149โ€“174, 1988. doi:10.1016/0304-3975(88)90122-3.
  • [26] S.ย C. Kleene. Representation of Events in Nerve Nets and Finite Automata, pages 3โ€“42. Princeton University Press, 1956. doi:10.1515/9781400882618-002.
  • [27] Antoni Koscielski and Leszek Pacholski. Complexity of Makaninโ€™s algorithm. J. ACM, 43(4):670โ€“684, 1996. doi:10.1145/234533.234543.
  • [28] Leonid Libkin. Complexity of First-Order Logic, pages 87โ€“111. Elements of Finite Model Theory. Springer, 2004. doi:10.1007/978-3-662-07003-1_6.
  • [29] Dean Light, Ahmad Aiashi, Mahmoud Diab, Daniel Nachmias, Stijn Vansummeren, and Benny Kimelfeld. SpannerLib: Embedding Declarative Information Extraction in an Imperative Workflow. Proc. VLDB Endow., 17(12):4281โ€“4284, 2024. doi:10.14778/3685800.3685855.
  • [30] Carsten Lutz and Leif Sabellek. Ontology-Mediated Querying with the Description Logic EL: Trichotomy and Linear Datalog Rewritability. In IJCAI 2017, pages 1181โ€“1187, 2017. doi:10.24963/ijcai.2017/164.
  • [31] S.ย Miyano, A.ย Shinohara, and T.ย Shinohara. Which classes of Elementary Formal Systems are polynomial-time learnable? In Proc. ALT 1992, pages 139โ€“150, 1992.
  • [32] Yoav Nahshon, Liat Peterfreund, and Stijn Vansummeren. Incorporating information extraction in the relational database model. In Proc. WebDB 2016, 2016. doi:10.1145/2932194.2932200.
  • [33] Liat Peterfreund, Balder ten Cate, Ronald Fagin, and Benny Kimelfeld. Recursive Programs for Document Spanners. In Proc. ICDT 2019, pages 13:1โ€“13:18, 2019. doi:10.4230/LIPIcs.ICDT.2019.13.
  • [34] J.ย L. Ponty, D.ย Ziadi, and J.ย M. Champarnaud. A new quadratic algorithm to convert a regular expression into an automaton. In Automata Implementation, pages 109โ€“119. Springer, 1997.
  • [35] Markusย L. Schmid and Nicole Schweikardt. Spanner Evaluation over SLP-Compressed Documents. In Proc. PODS 2021, pages 153โ€“165, 2021. doi:10.1145/3452021.3458325.
  • [36] Luc Segoufin. Enumerating with constant delay the answers to a query. In Proc. ICDT 2013, pages 10โ€“20, 2013. doi:10.1145/2448496.2448498.
  • [37] Samย M. Thompson and Dominikย D. Freydenberger. Languages Generated by Conjunctive Query Fragments of FC[REG]. In Proc. DLT 2023, pages 233โ€“245, 2023. doi:10.1007/978-3-031-33264-7_19.
  • [38] Samย M. Thompson and Dominikย D. Freydenberger. Generalized Core Spanner Inexpressibility via Ehrenfeucht-Fraรฏssรฉ Games for FC. Proc. ACM Manag. Data, 2(2), 2024. doi:10.1145/3651143.
  • [39] Yanshan Wang, Liwei Wang, Majid Rastegar-Mojarad, Sungrim Moon, Feichen Shen, Naveed Afzal, Sijia Liu, Yuqun Zeng, Saeed Mehrabi, Sunghwan Sohn, and Hongfang Liu. Clinical information extraction applications: A literature review. Journal of Biomedical Informatics, 77:34โ€“49, 2018. doi:10.1016/j.jbi.2017.11.011.