Abstract 1 Introduction 2 Preliminaries 3 Direct access for MSO queries 4 From direct access to counting: a matrix approach 5 Maintaining semi-groups products 6 Dynamic direct access under complex editing operations 7 Future work References

Dynamic Direct Access of MSO Query Evaluation over Strings

Pierre Bourhis ORCID Univ. Lille, CNRS, Inria, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France Florent Capelli ORCID Univ. Artois, CNRS, UMR 8188, Centre de Recherche en Informatique de Lens (CRIL),
F-62300 Lens, France
Stefan Mengel ORCID Univ. Artois, CNRS, UMR 8188, Centre de Recherche en Informatique de Lens (CRIL),
F-62300 Lens, France
Cristian Riveros ORCID Pontificia Universidad Católica de Chile, Santiago, Chile
Millennium Institute for Foundational Research on Data, Santiago, Chile
Abstract

We study the problem of evaluating a Monadic Second Order (MSO) query over strings under updates in the setting of direct access. We present an algorithm that, given an MSO query with first-order free variables represented by an unambiguous variable-set automaton 𝒜 with state set Q and variables X and a string s, computes a data structure in time 𝒪(|Q|ω|X|2|s|) and, then, given an index i retrieves, using the data structure, the i-th output of the evaluation of 𝒜 over s in time 𝒪(|Q|ω|X|3log(|s|)2) where ω is the exponent for matrix multiplication. Ours is the first efficient direct access algorithm for MSO query evaluation over strings; such algorithms so far had only been studied for first-order queries and conjunctive queries over relational data.

Our algorithm gives the answers in lexicographic order where, in contrast to the setting of conjunctive queries, the order between variables can be freely chosen by the user without degrading the runtime. Moreover, our data structure can be updated efficiently after changes to the input string, allowing more powerful updates than in the enumeration literature, e.g. efficient deletion of substrings, concatenation and splitting of strings, and cut-and-paste operations. Our approach combines a matrix representation of MSO queries and a novel data structure for dynamic word problems over semi-groups which yields an overall algorithm that is elegant and easy to formulate.

Keywords and phrases:
Query evaluation, direct access, MSO queries
Copyright and License:
[Uncaptioned image] © Pierre Bourhis, Florent Capelli, Stefan Mengel, and Cristian Riveros; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Database theory
Related Version:
Full version with missing proofs: https://arxiv.org/abs/2409.17329
Funding:
This work was partially funded by ANR projects CQFD 18-CE23-0003, EQUUS ANR-19-CE48-0019 and KCODA ANR-20-CE48-0004, by ANID - Millennium Science Initiative Program - Code ICN17_002, by ANID Fondecyt Regular project 1230935, and by the Alexander von Humboldt Foundation.
Acknowledgements:
This work benefited from Dagstuhl Seminar 24032, Representation, Provenance, and Explanations in Database Theory and Logic.
Editors:
Sudeepa Roy and Ahmet Kara

1 Introduction

The aim of direct access algorithms in query answering is to represent query answers in a compact and efficiently computable way without materializing them explicitly, while still allowing efficient access as if the answers were stored in an array. This is modeled in a two-stage framework where, first, in a preprocessing phase, given an input, one computes a data structure which then, in the so-called access phase, allows efficiently accessing individual answers to the query by the position they would have in the materialized query result.

For this approach, there is a trade-off between the runtime of the preprocessing and the access phase: on the one hand, one could simply materialize the answers and then answer all queries in constant time; on the other hand, one could just not preprocess at all and then for every access request evaluate from scratch. When designing direct access algorithms, one thus tries to find interesting intermediate points between these extremes, mostly giving more importance to the access time than the preprocessing, since the latter has to be performed only once for every input while the former can be performed an arbitrary number of times.

Direct access algorithms were first introduced by Bagan, Durand, Grandjean, and Olive in [8] in the context of sampling and enumeration for first-order queries (both can easily be reduced to direct access). However, it was arguably the very influential thesis of Brault-Baron [14] that established direct access as an independent regime for query answering. Since then, there has been a large amount of work extending and generalizing Brault-Baron’s work in several directions [19, 18, 15, 23, 17].

Curiously absent from this line of work is the query evaluation of Monadic Second Order logic (MSO), which in other contexts, like enumeration algorithms, has often been studied in parallel with conjunctive queries. In this paper, we make a first step in that direction, giving a direct access algorithm for MSO queries with free first-order variables over strings.

Our approach to direct access decomposes into several steps: first, we reduce direct access to a counting problem by using binary search to find the answers to be accessed as it is done, e.g., in [17, 18, 15]. We then express this counting problem in terms of matrix multiplication where we have to compute the result of a product of a linear number of small matrices. To enable the binary search we then require that this product can be efficiently maintained under substitutions of some of the matrices. This type of problem is known in the literature as dynamic word problems where the input is a word whose letters are elements from some semi-group whose product has to be maintained under element substitutions. Since we want to allow more powerful updates to the input later on, we cannot use the known data structures for dynamic word problems directly. We instead opt for a simpler approach that uses an extension of binary search trees to store our matrix product. The price of this is modest as it leads to a data structure that on only a doubly logarithmic factor slower than the provably optimal data structures for dynamic word problems. Plugging these ingredients together leads to a direct access algorithm with preprocessing time linear in the size of the input word and polylogarithmic access time. Moreover, if the query to evaluate is given in the form of an unambiguous automaton – which is well known to be equivalent in expressivity to MSO-queries [16] – then all runtime dependencies on the query are polynomial.

One advantage of our search tree based data structure is that, by relying heavily on known tree balancing techniques, it allows updating the input string efficiently without running the preprocessing from scratch. Such update operations have been studied before in the enumeration literature for first-order logic [12], conjunctive queries [28, 11, 12] and, most intensively for MSO on words and trees [10, 32, 36, 3, 35, 5, 29]. Compared to this latter line of work, we support more powerful updates: instead of single letter, resp. node, additions, and deletions as in most of these works, our data structure efficiently allows typical text operations like deletion of whole parts of the text, cut and paste, and concatenation of documents. The most similar update operations in the literature are those from [39] which beside our operations allow persistence of the data structure and thus, in contrast to us, also copy and paste operations. Still, our updates are vastly more powerful than in all the rest of the literature. Moreover, we are the first to propose updates to direct access data structures; all works with updates mentioned above only deal with enumeration.

Another advantage of our algorithm is that it easily allows direct access to the query result for any lexicographic order. It has been observed that ranking the orders by a preference stated by the users is desirable. Thus, there are several works that tackle this question for direct access, enumeration, and related problems, see e.g. [15, 18, 9, 21, 20, 40, 41]. While the lexicographic orders we consider are more restricted, we still consider them very natural. Also, it is interesting to see that, while there is an unavoidable runtime price to pay for ranked access with respect to lexicographic orders to conjunctive queries [18, 15], we here get arbitrary lexicographic access for free by simply changing the order in which variables are treated during the binary search. As a consequence, we can even choose the order at access time as the preprocessing is completely agnostic to the order, similarly to what was shown for the ranked enumeration of conjunctive queries answers [20] which however does not consider a dynamic setting.

2 Preliminaries

Sets and strings.

For a set A, we denote by 2A the power set of A. For n with n1, we denote by [n] the set {1,,n}. We use Σ to denote a finite alphabet and Σ all strings (also called words) with symbols in Σ. We will usually denote strings by s,s,si and similar. For every s1,s2Σ, we write s1s2 (or s1s2 for short) for the concatenation of s1 and s2. If s=a1anΣ, we write |s|=n, that is, the length of s. We denote by ϵΣ the string of 0 length (i.e., |ϵ|=0), also called the empty string. For s=a1anΣ, we denote by s[i,j] the substring aiaj, by s[..i] the prefix a1ai and by s[i..] the suffix aian.

Mappings.

Given a finite set X of variables and n, in this work we will heavily work with mappings of the form μ:X[n] as our outputs. We denote by 𝖽𝗈𝗆(μ)=X the domain of μ and by 𝗋𝖺𝗇𝗀𝖾(μ)={i[n]xX:μ(x)=i} the range of μ. Further, we denote by μ the empty mapping, which is the unique mapping such that 𝖽𝗈𝗆(μ)=. In our examples, we will usually write (x1i1,,xi) to denote the mapping μ:{x1,,x}[n] such that μ(x1)=i1, …, μ(x)=i. We also write μ(xi) with x𝖽𝗈𝗆(μ) for the mapping μ such that μ(x)=i and μ(y)=μ(y) for every y𝖽𝗈𝗆(μ).

Given a mapping μ:X[n], we define μ1:[n]2X as the set-inverse mapping of μ, defined by μ1(i):={xXμ(x)=i}. Note that i𝗋𝖺𝗇𝗀𝖾(μ) if, and only if, μ1(i). Moreover, given a subset YX, we define the projection πY(μ) as the mapping μ:Y[n] such that μ(y)=μ(y) for every yY. Remark that if Y=, then πY(μ)=μ.

Variable-set automata.

A variable-set automaton [24, 25, 4] (or vset automaton for short) is a tuple 𝒜=(Q,Σ,X,Δ,q0,F) where Q is a finite set of states, X is a finite set of variables, q0Q is an initial state, FQ is a finite set of states, and ΔQ×Σ×2X×Q is the transition relation. Given a string s=a1an, a run ρ of 𝒜 over s is a sequence:

ρ:=q0a1/X1q1a2/X2an/Xnqn (1)

such that (qi1,ai,Xi,qi+1)Δ for every in. We say that a run ρ like (1) is valid if, and only if, i=1nXi=X and XiXj= for every i<jn; in other words, each variable in X appears exactly once in ρ. If ρ is valid, one can define the mapping μρ:X[n] such that, for every xX, μρ(x)=i where i is the unique position that satisfies xXi. As usual, we also say that a run ρ like (1) is accepting if, and only if, qnF. Then we define the output of 𝒜 over a string s as the set of mappings:

𝒜(s)={μρρ is a valid and accepting run of 𝒜 over s}.

We define a partial run ρ of 𝒜 as a sequence of transitions ρ:=p0b1/Y1p1b2/Y2bn/Ynpn such that (pi1,bi,Yi,pi)Δ. We say that ρ is a partial run from p0 to pn over the string b1bn. Note that a run is also a partial run where we additionally assumed that p0=q0. We say that a partial run is valid if YiYj= for all ijn. We define the length of ρ as |ρ|=n, and we make the convention that a single state p0 is a partial run of length 0. We define the set of variables 𝗏𝖺𝗋𝗌(ρ) of a partial run ρ as 𝗏𝖺𝗋𝗌(ρ)=i=1nYi. Given two partial runs ρ=p0b1/Y1bn/Ynpn and σ=r0c1/Z1cm/Zmrm such that pn=r0 we define the run ρσ as the concatenation of ρ and σ, i.e., the partial run ρσ:=p0b1/Y1bn/Ynpnc1/Z1cm/Zmrm. Note that |ρσ|=|ρ|+|σ| and 𝗏𝖺𝗋𝗌(ρσ)=𝗏𝖺𝗋𝗌(ρ)𝗏𝖺𝗋𝗌(σ).

We define the size |𝒜— of a vset automaton 𝒜=(Q,Σ,X,Δ,q0,F) as the number of states and transitions, so |𝒜|:=|Q|+|Δ|. In the following, we assume that all vset automata are trimmed, i.e., for every qQ there exists a run ρ that reaches q, and there exists a partial run σ from q to some state in F. We can make this assumption without loss of generality, since removing unreachable states can be done in time 𝒪(|𝒜|) without modifying 𝒜.

MSO and vset automata.

In this paper, we usually refer to Monadic Second Order Logic (MSO) as our language for query evaluation; however, we will not define it formally here since we will use vset automata as an equivalent model for MSO. Precisely, when we refer to MSO, we mean MSO formulas of the form φ(x1,,xn) over strings where x1,,xn are first-order open variables (see, e.g., [31] for a formal definition). Then, given a string s as a logical structure, the MSO query evaluation problem refers to computing the set φ(x1,,xn)(s):={μ:{x1,,xn}[|s|]s,μφ(x1,,xn)}, which is the set of all first-order assignments (i.e., mappings) that satisfy φ over s. One can show that MSO over strings with first-order open variables is equally expressible to vset automata, basically, by following the same construction as for the Büchi-Elgot-Trakhtenbrot theorem [16] (see also [34]). Furthermore, vset automata (as defined here) are equally expressive to regular document spanners [24] for information extraction (see, e.g., [6, 34]). For this reason, we can use vset automata to define MSO queries, which also applies to the setting of document spanners, or any other query language equally expressive to MSO.

Functional and unambiguous vset automata.

It is useful to consider vset automata that have no accepting runs that are not valid, so we make the following definition: we say that a vset automaton is functional if, and only if, for every sΣ, every accepting run ρ of 𝒜 over s is also valid. In [24], it was shown that there is an algorithm that, given a vset automaton 𝒜, constructs in exponential time a functional vset automaton 𝒜 of exponential size with respect to 𝒜 such that 𝒜=𝒜. We will thus restrict our analysis and algorithms to functional vset automata, since we can extend them to non-functional vset automata, incurring an exponential blow-up. One useful property of functional vset automata is that each state determines the variables that are assigned before reaching it in the following sense.

Lemma 1.

Let 𝒜=(Q,Σ,X,Δ,q0,F) be a functional vset automaton. For every qQ there exists a set XqX such that for every partial run ρ from q0 to q it holds that 𝗏𝖺𝗋𝗌(ρ)=Xq.

Proof (sketch)..

The lemma is standard in the literature for similar models, see e.g. [33]. We provide a short proof. By way of contradiction, suppose that there exists a state qQ and two runs ρ and ρ that reach q such that 𝗏𝖺𝗋𝗌(ρ)𝗏𝖺𝗋𝗌(ρ). Given that 𝒜 is trimmed, there exists a partial run σ starting from q that reaches some final state in F. Then the runs ρσ and ρσ are accepting. Since 𝒜 is functional, both runs are also valid and should satisfy 𝗏𝖺𝗋𝗌(ρσ)=X=𝗏𝖺𝗋𝗌(ρσ). However, 𝗏𝖺𝗋𝗌(ρσ)𝗏𝖺𝗋𝗌(ρσ) which is a contradiction.

By Lemma 1, for every functional vset automaton 𝒜=(Q,Σ,X,Δ,q0,F) and every qQ we define its sets of variables as Xq to be the set as in the lemma. One consequence of Lemma 1 is that for every transition (q,a,Y,q)Δ, we have Y=Xq\Xq. In particular, there can only be |Σ| transitions for every pair q,q of states such that overall |Δ|=O(|Q|2Σ).

We will also restrict to unambiguous vset automata. We say that a vset automaton 𝒜=(Q,Σ,X,Δ,q0,F) is unambiguous if for every string sΣ and every μ𝒜(s) there exists exactly one accepting run ρ of 𝒜 over s such that μ=μρ. It is well-known that for every vset automaton 𝒜 one can construct an equivalent unambiguous functional vset automaton 𝒜 of exponential size with respect to 𝒜. Therefore, we can restrict our analysis to the class of unambiguous functional vset automaton without losing expressive power. Both are standard assumptions in the literature for MSO query evaluation and have been used to find efficient enumeration algorithms for computing 𝒜(s), see e.g. [25, 38, 34].

Example 2.
(a) The vset automaton 𝒜0.
x1 x2
1 2
4 2
4 5
4 3

(b) The mappings computed by 𝒜0 on input string s=abbab.

Ma=q0q1q2q3q01100q10100q20001q30001

Mb=q0q1q2q3q01010q10001q20010q30001

(c) Transition matrices of 𝒜0.
Figure 1: A running example of a vset automaton 𝒜0 that will be used throughout this work.

Figure 1(a) depicts an unambiguous functional vset automaton 𝒜0 on variables {x1,x2} and alphabet {a,b}. It can be verified that Xq0=, Xq1={x1}, Xq2={x2} and Xq3={x1,x2}. Let s=abbab. The following are all valid and accepting runs of 𝒜0 over s:

  • ρ0:=q0a/{x1}q1b/{x2}q3b/q3a/q3b/q3 and μρ0={x11,x22}

  • ρ1:=q0a/q0b/{x2}q2b/q2a/{x1}q3b/q3 and μρ1={x14,x22}

  • ρ2:=q0a/q0b/q0b/q0a/{x1}q1b/{x2}q3 and μρ2={x14,x25}

  • ρ3:=q0a/q0b/q0b/{x2}q2a/{x1}q3b/q3 and μρ3={x14,x23}

In Figure 1(b), we show a summary of 𝒜0(s). It can be verified that 𝒜0(s) is the set of tuples μ such that μ(x1) is a position where there is an a in s, μ(x2) is a position where there is a b in s. Moreover if μ(x1)<μ(x2) then μ(x2) is the first position after μ(x1) containing a b. Otherwise, if μ(x2)<μ(x1), then μ(x1) is the first position after μ(x2) containing an a.

Computational model.

We work with the usual RAM model with unit costs and logarithmic size registers, see e.g. [27]. All data structures that we will encounter in our algorithms will be of polynomial size, so all addresses and pointers will fit into a constant number of registers. Also, all atomic memory operations like random memory access and following pointers can be performed in constant time. The numbers we consider will be of value at most |s||X| where X is the variable set of the automaton we consider. Thus, all numbers can be stored in at most |X| memory cells and arithmetic operations on them can be performed in time O(|X|2) naively111We remark that there are better algorithms for arithmetic operations, but we will not follow this direction here. If the reader is willing to use faster algorithms for multiplication and addition, then in all runtime bounds below the factor |X|2 can be be reduced accordingly.. In the sequel, we use ω to denote the exponent for matrix multiplication.

3 Direct access for MSO queries

Let X be a set of variables. Given a total order on X, we extend to a lexicographic order on mappings as usual: for mappings μ,μ:X[n] we define μμ if, and only if, there exists xX such that μ(x)<μ(x) and, for every yX, if yx, then μ(y)=μ(y).

Fix a total order on X. Given a vset automaton 𝒜=(Q,Σ,X,Δ,q0,F) and an input string s, consider the set of outputs 𝒜(s)={μ1,μ2,} such that μ1μ2. We define the i-th output of 𝒜 over s, denoted by 𝒜(s)[i], as the mapping μi. Intuitively, we see 𝒜(s) as an array where the outputs are ordered by and we retrieve the i-th element of this array. The direct access problem for vset automata is the following: given a vset automaton 𝒜, a string s, and an index i, compute the i-th output 𝒜(s)[i]; if the index i is larger than the number of solutions then return an out-of-bound error. Without loss of generality, in the following we always assume that i|𝒜(s)|, since |𝒜(s)| is easy to compute for unambiguous functional vset automata, see Section 4. As usual, we split the computation into two phases, called the preprocessing phase and the access phase:

Problem: MSODirectAccess[] Preprocessing: {input:a vset automaton 𝒜 and sΣresult:a data structure D𝒜,s Access: { 
input: an index i and DA,s
output: the i-th output A(s)[i]
 
​​​​

During the preprocessing phase, the algorithm receives as input a vset automaton 𝒜 and a string sΣ and computes a data structure D𝒜,s. After preprocessing, there is the access phase where the algorithm receives any index i as input and computes the i-th output 𝒜(s)[i] by using the precomputed data structure D𝒜,s.

To measure the efficiency of a direct access algorithm, we say that an algorithm for the problem MSODirectAccess[] has f-preprocessing and g-access time for some functions f and g if, and only if, the running time of the preprocessing phase and the access phase is in O(f(𝒜,s)) and O(g(𝒜,s)), respectively. Note that the running time of the access phase does not depend on i, given that its bitsize is bounded by |X|log(|s|) so that it can be stored in |X| registers in the RAM model and arithmetic on it can be done in time O(|X|2) naively.

Given a class 𝒞 of vset automata, we define the problem MSODirectAccess[] for 𝒞 as the problem above when we restrict 𝒜 to 𝒞. The main result of this work is the following.

Theorem 3.

There is an algorithm that solves MSODirectAccess[] for the class of unambiguous functional vset automata for any variable order with preprocessing O(|Q|ω|X|2|s|) and access time O(|Q|ω|X|3log2(|s|)). These bounds remain even true when we assume that the order is only given as an additional input in the access phase.

In Theorem 3, we restrict to the class of unambiguous vset automata. A natural question is what happens if we consider the class of all functional vset automata. We first remark that the data complexity of the problem will not change as it is possible to translate any functional vset automaton to an unambiguous one with an exponential blow-up. Unfortunately, in combined complexity, there is no polynomial time algorithm under standard assumptions.

Proposition 4.

If MSODirectAccess[] for the class of functional vset automata has an algorithm with polynomial time preprocessing and access, then SpanL FP.

SpanL is the class of functions computable as |R|, where R is the set of output values returned by the accepting paths of an NL machine, see [2] for a formal definition. In [2], it is shown that, if the functions in SpanL is in SpanL FP, i.e., computable in polynomial time, then P = NP. Therefore, by Theorem 4 and [2] it is unlikely that there is an efficient algorithm for MSODirectAccess[] for all functional vset automata in combined complexity.

In the remainder of this paper, we will prove Theorem 3. To make the proof more digestible, in Section 4 we show how to reduce the direct access problem to a counting problem and introducing a matrix approach for it. In Section 5, we will present the data structure constructed during the preprocessing and used in the access phase, showing Theorem 3. In Section 6, we then show how to integrate updates to the input string into our approach.

4 From direct access to counting: a matrix approach

In this section, we present the main algorithmic ideas for Theorem 3. In the next section, we use these ideas to develop a data structure for the preprocessing and access algorithm.

From direct access to counting.

Let 𝒜=(Q,Σ,X,Δ,q0,F) be a vset automaton and let s=a1an be a string of length n. Assume that X={x1,,x} has -variables ordered by a total order , w.l.o.g., x1x2x. For an index k[] and a mapping τ:{x1,,xk}[n], we define the set:

𝒜(s,τ)={μ𝒜(s)πx1,,xk1(μ)=πx1,,xk1(τ)μ(xk)τ(xk)}.

Intuitively, the set 𝒜(s,τ) restricts the output set 𝒜(s) to all outputs in 𝒜(s) which coincide with τ for variables xi before the variable xk, and that have a position before τ(xk) for the variable xk. If τ=μ is the empty mapping, we define 𝒜(s,τ)=𝒜(s).

Example 5.

Going back to the example from Figure 1(a) with s=abbab and τ=(x12) then 𝒜0(s,τ) contains all tuples from 𝒜0(s) which map x1 to a position before 2. As seen in Figure 1(b) we have 𝒜0(s,τ)={(x11,x22)}. Now if τ=(x12,x23), we have 𝒜0(s,τ)= since we only keep tuples where x1 is set to 2 and no such tuple can be found. If τ=(x14,x23), we have 𝒜0(s,τ)={(x14,x22),(x14,x23)}.

For the sake of presentation, we denote by #𝒜(s) (resp. #𝒜(s,τ)) the number |𝒜(s)| (resp. |𝒜(s,τ)|) of outputs in 𝒜(s) (resp. 𝒜(s,τ)). For direct access, we are interested in finding efficient algorithms for computing #𝒜(s,τ) because of the following connection.

Lemma 6 (Lemma 7 in [17]).

If there is an algorithm that computes #𝒜(s,τ) in time T for every k[] and every τ:{x1,,xk}[n], then there is an algorithm that retrieves 𝒜(s)[i] in time O(Tlog(n)) for every index i.

Proof (sketch)..

A similar proof can be found in [17]. For the convenience of the reader and because we will use a slightly more complicated variant later, we give a quick sketch here. Let μi=𝒜(s)[i]. The idea is to first compute μi(x1) by observing the following: μi(x1) is the smallest value j1[n] such that #𝒜(s,(x1j1))i. This value can be found in time O(Tlogn) by performing a binary search on {#𝒜(s,(x1j1))j1n}. Once we have found μi(x1), we compute μi(x2) similarly by observing that it is the smallest value j2 such that #𝒜(s,(x1j1,x2j2))i#𝒜(s,(x1j11)). The claim then follows by a simple induction. Given Lemma 6, in the following we concentrate our efforts on developing an index structure for efficiently computing #𝒜(s,τ) for every k[] and every τ:{x1,,xk}[n].

A matrix representation for the counting problem.

Let 𝒜=(Q,Σ,X,Δ,q0,F) be an unambiguous functional vset automaton. For any letter aΣ, we define the matrix MaQ×Q such that for every p,qQ:

Ma[p,q]:={1(p,a,S,q)Δ for some SX0otherwise.

Strictly speaking, the matrix Ma depends on 𝒜 and we should write Ma𝒜; however, 𝒜 will always be clear from the context and, thus, we omit 𝒜 as a superscript from Ma. Since 𝒜 is functional, for every pair of states p,qQ and aΣ there exists at most one transition (p,a,S,q)Δ. Then Ma contains a 1 for exactly the pairs of states that have a transition with the letter a. For this reason, one can construct Ma in time O(|Q|2) for every aΣ. Figure 1(c) gives an example of matrices Ma and Mb for the automaton 𝒜0 from Example 2.

We can map strings to matrices in Q×Q by homomorphically extending the mapping aMa from letters to strings by mapping every string s=a1an to a matrix Ms defined by the product Ms:=Ma1Ma2Man. For ϵ, we define Mϵ=I where I is the identity matrix in Q×Q. Note that this forms an homomorphism from strings to matrices where Ms1s2=Ms1Ms2 for every pair of strings s1,s2Σ. It is easy to verify that for all states p,qQ we have that Ms[p,q] is the number of partial runs from p to q of 𝒜 over s. Furthermore, if we define q0 as the (row) vector such that q0[p]=1 if p=q0 and 0, otherwise, and F as the (column) vector such that F[p]=1 if pF and 0 otherwise, then we have the following equality between number of outputs and matrix products: #𝒜(s)=q0MsF.

Our goal is to have a similar result for calculating #𝒜(s,τ) given a mapping τ:{x1,,xk}[n]. For this purpose, for every string s=a1an, we define the matrix Msτ as follows. Recall that τ1 is the set-inverse of τ. Then, for every aΣ and i[n], define the matrix Ma,iτQ×Q such that for every p,qQ:

Ma,iτ[p,q]={1τ1(i)\{xk}S for some (p,a,S,q)Δ and (1)if τ(xk)=i then xkXq;(2)0otherwise.

Finally, we define Msτ=Ma1,1τMa2,2τMan,nτ. Intuitively, Condition (1) for Ma,iτ[p,q]=1 makes sure that all runs of 𝒜 over s counted by Msτ must take transitions at position i that contain all variables xj with τ(xj)=i and j<k. Note that we remove xk from τ1(i) since xk has a special meaning in τ. Indeed, Condition (2) for Ma,iτ[p,q]=1 restricts xk such that its assignment must be before or equal to position i. For this second condition, we exploit the set Xq of a functional vset automata, that gives us information of all variables that have been used until state q.

It is important to notice that if i𝗋𝖺𝗇𝗀𝖾(τ) then Mai,iτ=Mai. Since 𝗋𝖺𝗇𝗀𝖾(τ) has at most k elements, the sequences Ma1,,Man and Ma1,1τ,,Man,nτ differ in at most k matrices. In particular, if τ is the empty mapping, then Msτ=Ms as expected. Finally, similarly to Ma one can compute Ma,iτ in time O(|Q|2) for any aΣ assuming that we have τ and τ1.

Example 7.

Consider again the example from Figure 1(a) with s=abbab. Observe that for any τ, whenever Ma[p,q]=0 then necessarily Ma,iτ[p,q]=0. Hence in this example, we focus on example where the entry Ma[p,q] goes from 1 to 0 after applying a partial mapping. We let τ=(x14,x24). We have that Ma,4τ[q0,q1]=0 because even if there is a transition (q0,a,S,q1) with x1S, we do not have x2Xq1. In other words, a run compatible with τ cannot take transition (q0,a,{x1},q1) at position 4 since it does not allow to map x2 to a position before 4. On the other hand, Ma,4τ[q2,q3]=1 because it is possible to set x1 to position 4 by taking transition (q2,a,{x1},q3). Moreover, since x2Xq3, it means that x2 has been necessarily set by an earlier transition, hence at a position preceding 4. Finally Ma,4τ[q3,q3]=0 because there is not transition from q3 to q3 setting variable x1.

Similarly to the relation between #𝒜(s) and Ms, we can compute #𝒜(s,τ) by using Msτ as the following result shows.

Lemma 8.

For every unambiguous functional vset automaton 𝒜=(Q,Σ,X,Δ,q0,F), every string sΣ, and every mapping τ:{x1,,xk}[n] it holds that: #𝒜(s,τ)=q0MsτF.

The algorithm.

Now that we have defined the main technical tools, we can present the algorithm for the preprocessing and direct access of an unambiguous functional vset automaton 𝒜 over a string s=a1an. For this purpose, we will assume here the existence of a data structure for maintaining the product of matrices {Mai,iτ}i[n] to then present and analyze the algorithms. In Section 5, we will explain this data structure in full detail.

Suppose the existence of a data structure D that given values n,m maintains n square matrices M1,,Mn of size m×m. This data structure supports three methods called 𝗂𝗇𝗂𝗍, 𝗌𝖾𝗍, and 𝗈𝗎𝗍 which are defined as follows:

  • D𝗂𝗇𝗂𝗍(M1,,Mn): receive as input the initial n matrices M1,,Mn of size m×m and initialize a data structure D storing the matrices M1,,Mn;

  • D𝗌𝖾𝗍(D,j,M): receive as input a data structure D, a position jn, an m×m-matrix M, and output a data structure D that is equivalent to D but its j-th matrix Mj is replaced by M; and

  • M𝗈𝗎𝗍(D): receive as input a data structure D storing matrices M1,,Mn and outputs a pointer to the matrix M:=M1Mn, i.e., the product of the n matrices.

Further, we assume that method 𝗌𝖾𝗍 is persistent [22], meaning that each call to 𝗌𝖾𝗍 produces a new data structure D without modifying the previous data structure D.

Algorithm 1 The preprocessing and direct access algorithms of an unambiguous functional vset automaton 𝒜=(Q,Σ,X,Δ,q0,F) with variables x1,,x and string s=a1an.
1:procedure Preprocessing(𝒜, s)
2:  D𝗂𝗇𝗂𝗍(Ma1,,Man)
3:procedure BinarySearch(x,i,τ)
4:  L0
5:  Rn
6:  while LR do
7:    j(L+R)/2
8:    ττ(xj)
9:    D𝗌𝖾𝗍(D,j,Maj,jτ)
10:    if q0𝗈𝗎𝗍(D)Fi then
11:     Rj
12:    else
13:     Lj       
14:  return R
15:procedure DirectAccess(i)
16:  τμ
17:  for k=1,, do
18:    jBinarySearch(xk,i,τ)
19:    iiCalculateDiff(xk,j,τ)
20:    ττ(xkj)
21:    UpdateStruct(xk+1,j,τ)   
22:  return τ
23:procedure CalculateDiff(x, j, τ)
24:  τprevτ(xj1)
25:  Dprev𝗌𝖾𝗍(D,j,Maj1,j1τprev)
26:  return q0𝗈𝗎𝗍(Dprev)F
27:procedure UpdateStruct(x, j, τ)
28:  τnextτ(xj+1)
29:  D𝗌𝖾𝗍(D,j,Maj,jτnext)

Assuming the existence of the data structure D, in Algorithm 1 we present all steps for the preprocessing and direct access of an unambiguous functional vset automaton 𝒜 over a string s=a1an. For the sake of presentation, we assume that 𝒜, s, and the data structure D are available globally by all procedures.

Algorithm 1 uses all the tools developed in this section for solving the MSO direct access problem. The preprocessing (Algorithm 1, left) receives as input the vset automaton and the string and constructs the data structure D by calling the method 𝗂𝗇𝗂𝗍 with matrices Ma1,,Man. The direct access (Algorithm 1, right) receives any index i and outputs the i-th mapping τ in 𝒜(s) by following the strategy of Lemma 6. Specifically, starting from an empty mapping τ (line 16), it finds the positions for variables x1,,x by binary search for each variable xk (line 16-17). After finding the value j for xk, it decreases the index i by calculating the difference of outputs just before j (line 19), updates τ with (xkj) (line 20), and updates D with the new value of xk (line 21). We use the auxiliary procedures CalculateDiff and UpdateStruct to simplify the presentation of these steps. The workhorse of the direct access is the procedure BinarySearch. It performs a standard binary search strategy for finding the value for variable x, by using the reduction of Lemma 6 to the counting problem and the matrix characterization #𝒜(s,τ) of Lemma 8.

The correctness of Algorithm 1 follows from Lemma 6 and 8. Regarding the running time, suppose that methods 𝗂𝗇𝗂𝗍, 𝗌𝖾𝗍, and 𝗈𝗎𝗍 of the data structure D take time t𝗂𝗇𝗂𝗍, t𝗌𝖾𝗍, and t𝗈𝗎𝗍, respectively. For the preprocessing, one can check that the running time is O(|Q|2|s|+t𝗂𝗇𝗂𝗍) where it takes O(|Q|2|s|) for creating the matrices Ma1,,Man. For the direct access, one can check that it takes time O(|X|log(|s|)(t𝗌𝖾𝗍+t𝗈𝗎𝗍)) for each index i.

The next section shows how to implement the data structure D and its methods 𝗂𝗇𝗂𝗍, 𝗌𝖾𝗍, and 𝗈𝗎𝗍. In particular, we show that the running time of these methods will be t𝗂𝗇𝗂𝗍=O(|Q|ω|X|2|s|), t𝗌𝖾𝗍=O(|Q|ω|X|2log(|s|)), and t𝗈𝗎𝗍=O(1). Overall, the total running time of Algorithm 1 will be O(|Q|ω|X|2|s|) for the preprocessing and O(|Q|ω|X|3log(|s|)2) for each direct access as stated in Theorem 3.

5 Maintaining semi-groups products

In this section, we will show how to implement the data structure required in Section 4, which in particular has a 𝗌𝖾𝗍 operation that allows to change some of the elements in a sequence of matrices while maintaining their product efficiently. Similar problems have been studied under the name of dynamic word problems in the literature, see e.g. [26, 7]. In that setting, one is given a sequence of elements from a semi-group and wants to maintain the product of this sequence under substitution of the elements. Depending on the algebraic structure of the semi-group, there are algorithms of different efficiencies, and there is by now quite a good understanding, see again [26, 7].

We could likely use the results of [26] directly to define our data structure (though [26] assumes a finite semiring which does not correspond to our setting).However, later we will want to support more powerful changes to the strings than just substitutions, and it is not clear how the approach from [26] could be adapted for this. So we choose to use a less technically involved approach that allows for more powerful update operations while only losing a factor of loglog(n) compared to [26]. So fix a semi-group 𝔾, i.e., a set 𝔾 with an associative operation . The semi-group that will be of most interest to us are the (m×m)-matrices with elements in with the usual matrix multiplication, but our approach here works for any semi-group. As it is common, we often leave out the multiplication symbol and write g1g2 for g1g2. We next introduce our data structure.

We will store sequences g=g1,,gn over 𝔾 in binary trees. To this end, let T=(V,E) be a rooted binary tree in which the vertices are labeled with elements from 𝔾. We assume that the children of each internal node of T are either a left or a right child. The label of a vertex v is denoted by 𝗅𝖺𝖻𝖾𝗅(v). Remember that the in-order traversal of a rooted binary tree, when encountering a vertex v with left child v1 and right child v2, first recursively visits the subtree of v1, then v, and finally recursively the subtree of v2. For every vertex v of T, we define the sequence of v, in symbols 𝗌𝖾𝗊(v), to be the sequence of elements in the subtree rooted in v read in in-order traversal. We also define the product of v, in symbols 𝗉𝗋𝗈𝖽(v) as the product of the elements in 𝗌𝖾𝗊(v) in the order of the sequence. Note that 𝗉𝗋𝗈𝖽(v) is a single element from 𝔾 while 𝗌𝖾𝗊(v) is a sequence of elements from 𝔾; we never store 𝗌𝖾𝗊(v) explicitly in the vertex v but store 𝗉𝗋𝗈𝖽(v). We also write 𝗌𝖾𝗊(T) for the sequence of the root of T and analogously define 𝗉𝗋𝗈𝖽(T). inline, color=orange!20]Florent: I added small details on the fact that we distinguish left and right children. I do not think we should be too heavy on it but this is important since the semigroup is not commutative.

In the remainder, we want to maintain 𝗉𝗋𝗈𝖽(T) under changes to 𝗌𝖾𝗊(T). To this end, it will be useful to access positions in 𝗌𝖾𝗊(T) efficiently. If the only changes were substitutions of elements as needed for Algorithm 1, we could easily do this by letting T be a binary search tree in which the keys are the position in 𝗌𝖾𝗊(T). However, we later also want to be able to delete, add, and move strings, and for these operations we would have to change a linear number of keys which is too expensive for our setting. Instead, we use the linear list representation with search-trees, see e.g. [30, Chapter 6.2.3]: we only store the size of the subtree of T rooted in v, which we denote by 𝗌𝗂𝗓𝖾(v). Remark that 𝗌𝗂𝗓𝖾(v) is also the length of 𝗌𝖾𝗊(v). The 𝗌𝗂𝗓𝖾 information lets us locate the element at desired positions in 𝗌𝖾𝗊(v).

Observation 9.

For every vertex v in T we can, given an index i[|𝗌𝖾𝗊(v)|] and correct 𝗌𝗂𝗓𝖾-labels, decide in constant time if the entry of position i in 𝗌𝖾𝗊(v) is in the left subtree, the right subtree or in v itself. In the two former cases, we can also compute at which position it is in the contiguous subsequence of that subtree.

9 directly gives an easy algorithm that descends from the root of T to any desired position in 𝗌𝖾𝗊(T) in time linear in the height of T.

As the final information, we also add, for each vertex v of T, the height of v, denoted by 0pt(v). This information is not needed in this section, but it will be useful in Section 6 for maintaining balanced trees, so we use it already here. Since the height of T will determine the runtime of our algorithm, we aim to bound it as for binary search trees. We say that a tree T satisfies the AVL condition if for every vertex v with children v1,v2 we have that |0pt(v1)0pt(v2)|1 and if v has only one child v1, then v1 is a leaf. The AVL condition allows us to bound the height of trees by the following classical result.

Theorem 10 ([1]).

Every tree of size n that satisfies the AVL condition has height O(log(n)).

We now have everything to define our data structure: we call a tree T in which all vertices are labeled with elements from 𝔾, which have correct attributes 𝗌𝗂𝗓𝖾, 0pt, and 𝗉𝗋𝗈𝖽 as above and which satisfies the AVL-condition an AVL-product representation of 𝗌𝖾𝗊(T).

Figure 2: AVL-product representation for the string abbbab and automaton 𝒜0 from Figure 1(a).
Example 11.

We consider again the automaton 𝒜0 from Figure 1(a) and string abbbab (this is a different string than the one considered in previous examples as a longer string makes illustration more relevant for the data structure). In this case, we show an AVL-product representation of MaMbMbMbMaMb=Ma,1μMb,2μMb,3μMb,4μMa,5μMb,6μ in Figure 2. For convenience, despite this not being mandatory, we assume each vertex v of the tree also is identified by a unique number 𝗂𝖽(v). In the example, we can verify that 𝗉𝗋𝗈𝖽 values are computed from the labels. For example, if we denote by vi the vertex having 𝗂𝖽(vi)=i, we have 𝗉𝗋𝗈𝖽(v2)=𝗉𝗋𝗈𝖽(v3)Mb𝗉𝗋𝗈𝖽(v4) which recursively is equal to MaMbMb. Similarly, 𝗉𝗋𝗈𝖽(v1)=𝗉𝗋𝗈𝖽(v2)𝗉𝗋𝗈𝖽(v6), that is, MaMbMbMbMaMb, which is the product we want to maintain.

We now show that AVL-product representations are useful in our setting. We start by observing that 𝗈𝗎𝗍-queries are trivial since we store the necessary information 𝗉𝗋𝗈𝖽(T) explicitly in the root.

Observation 12.

Given an AVL-product representation T, one can return 𝗉𝗋𝗈𝖽(T) in constant time. So in particular, if 𝗌𝖾𝗊(T) is a sequence of matrices, AVL-product representations implement the method 𝗈𝗎𝗍 in constant time.

We next show that there is an efficient initialization algorithm.

Lemma 13.

There is an algorithm that, given a sequence g over 𝔾 of length n, computes in time O(n) and with O(n) semi-group products an AVL-product representation of g.

Specializing to matrices, we get the following for the 𝗂𝗇𝗂𝗍-operation of Algorithm 1.

Corollary 14.

We can perform 𝗂𝗇𝗂𝗍 for AVL-product representations in time O(mωn).

Finally, we show that AVL-product representations can be used for the substitution updates that we need for direct access when using the 𝗌𝖾𝗍 operation in a persistent manner.

Lemma 15.

There is an algorithm that, given an AVL-product representation T, a position i and a semi-group element g, outputs a new AVL-product representation of the sequence we get from 𝗌𝖾𝗊(T) by substitution the element in position i by g. The runtime is O(log(|𝗌𝖾𝗊(T)|)) plus the time to compute O(log(|𝗌𝖾𝗊(T)|)) products in the semi-group.

Corollary 16.

The 𝗌𝖾𝗍-method from Algorithm 1 can be implemented in time O(mωlog(n)) with AVL-product representations.

Example 17.

Going back to the example from Figure 2. Assume now that one wants to update the data structure to only keep runs where x1 is set to a position lower or equal than 2. To this end, one needs to compute Ma,1τMb,2τMb,3τMb,4τMa,5τMb,6τ=Ma,1μMb,2τMb,3μMb,4μMa,5μMb,6μ where τ=(x12). Hence, we only have to change one matrix in the original product.

To do so, going down in the tree using 𝗌𝗂𝗓𝖾(), we can identify that the vertex where the second matrix is introduce in the product is vertex v2. Hence, we only have to update 𝗉𝗋𝗈𝖽(v2) and 𝗉𝗋𝗈𝖽(v1). To do so, observe that the new value M2 of 𝗉𝗋𝗈𝖽(v2) is 𝗉𝗋𝗈𝖽(v3)Mb,2τ𝗉𝗋𝗈𝖽(v4), which can be obtained with two matrix products, 𝗉𝗋𝗈𝖽(v3) and 𝗉𝗋𝗈𝖽(v4) being already computed. Similarly, the new value of 𝗉𝗋𝗈𝖽(v1) is M2Mb𝗉𝗋𝗈𝖽(v6) which can also be computed with two extra matrix products, provided that we compute M2 first. Hence, in general, this update scheme can be performed with at most 20pt(T) matrix products.

6 Dynamic direct access under complex editing operations

In the previous sections, we have established that, given a string sΣ of length n and an unambiguous functional automaton 𝒜, we can answer direct access queries on the relation 𝒜(s) in time O(log(n)2) after O(n) preprocessing. Now, assume that the string s is modified. If one wants to perform direct access on the relation induced by 𝒜 on the updated string so far one has to perform the initialization of Lemma 13 again which will have O(n) complexity for each editing step. In this section, we will show that the AVL-product representations from Section 5 can be updated efficiently so that we can perform direct access on the edited string without redoing the preprocessing from scratch.

In the following, we formalize the scenario of complex edits over strings. Then we define the problem of dynamic MSO direct access. We finish by showing how to easily extend our data structure and approach to give a solution for this scenario.

Editing programs over strings.

Fix an alphabet Σ. Let 𝖲𝖣𝖡={s1,,sN} be a set of strings over Σ called a strings database. For the sake of simplicity, we use si𝖲𝖣𝖡 both as a string and as a string name (i.e., as a label referring to si that is considered of constant size). Let 𝒮 be a set of string variables S1,S2, disjoint from 𝖲𝖣𝖡 and Σ. A literal l is either a string name si𝖲𝖣𝖡, a string variable S𝒮, or a symbol aΣ. An editing rule is a command of the following four types:

S:=𝖼𝗈𝗇𝖼𝖺𝗍(l,l)(S,S):=𝗌𝗉𝗅𝗂𝗍(l,i)(S,S):=𝖼𝗎𝗍(l,i,j)S:=𝗉𝖺𝗌𝗍𝖾(l,l,i)

where S and S are string variables, l and l are literals, and i,j. Intuitively, string variables will be assigned to strings and then 𝖼𝗈𝗇𝖼𝖺𝗍 will allow to concatenate two strings, 𝗌𝗉𝗅𝗂𝗍 to split a string at position i, 𝖼𝗎𝗍 to extract the substring between positions i and j, and 𝗉𝖺𝗌𝗍𝖾 to insert a string inside another at position i. An editing program Π is a pair (S,P) where S𝒮 is the output string variable and P is a sequence of editing rules R1;R2;;Rn such that each string name appears at most once in the right-hand side of some rule, and each string variable appears at most once in the right-hand side of a rule Ri after it appears in the left-hand side of a rule Rj with j<i. In other words, each string in the database can be used only once and each variable can be used at most once after it is defined. We define the size of an editing program Π as the number of rules |Π|=n.

Example 18.

Suppose that we have a strings database 𝖲𝖣𝖡0={s1,s2}. We define the editing program Π0=(S6,P0) with P0 as the sequence:

(S1,S2):=𝗌𝗉𝗅𝗂𝗍(s1,4);(S3,S4):=𝗌𝗉𝗅𝗂𝗍(S2,1);S5:=𝖼𝗈𝗇𝖼𝖺𝗍(S1,a);S6:=𝖼𝗈𝗇𝖼𝖺𝗍(S5,S4).

Next, we define the semantics of editing programs. A string assignment is a partial function σ:(𝖲𝖣𝖡Σ𝒮)Σ such that σ(a)=a for every aΣ and σ(s)=s for every s𝖲𝖣𝖡. We define the trivial assignment σ𝖲𝖣𝖡 such that 𝖽𝗈𝗆(σ𝖲𝖣𝖡)=𝖲𝖣𝖡Σ (i.e., no variable is mapped). Given an assignment σ, a string variable S, and a string sΣ, we write σ[Ss] to denote the assigment σ that replaces S with s in σ (i.e., σ(S)=s and σ(S)=σ(S) for every S𝖽𝗈𝗆(σ)\{S}). Then, we define the semantics of rules as a function that maps assignments to assignments such that for every assignment σ:

S:=𝖼𝗈𝗇𝖼𝖺𝗍(l,l)(σ)=σ[Sσ(l)σ(l)](S,S):=𝗌𝗉𝗅𝗂𝗍(l,i)(σ)=σ[Sσ(l)[..i],Sσ(l)[i+1..]](S,S):=𝖼𝗎𝗍(l,i,j)(σ)=σ[Sσ(l)[i,j],Sσ(l)[..i1]σ(l)[j+1..]]S:=𝗉𝖺𝗌𝗍𝖾(l,l,i)(σ)=σ[Sσ(l)[..i]σ(l)σ(l)[i+1..]]

where we use the syntax σ[Ss,Ss] as a shorthand for (σ[Ss])[Ss]. We extend the semantics to sequences of editing rules R1;;Rn recursively as follows:

R1;;Rn1;Rn(σ)=Rn(R1;;Rn1(σ)).

Finally, we define the semantics of an editing program Π=(S,P) as the string Π(𝖲𝖣𝖡)=[P(σ𝖲𝖣𝖡)](S), namely, the string at S after evaluating P starting from the trivial assignment. We assume that the above semantics is undefined if any of the previous definitions are undefined like, for example, indices i,j are out of range, the variable S is not defined, etc. Note that one can easily check in time O(|Π|) whether the semantics will be undefined or not. For this reason, we assume that the semantics of an editing program is always defined.

Example 19 (continue).

If we take s1=bbbbcb, we get Π0(𝖲𝖣𝖡)=bbbbab. In particular, one can easily check that the program Π0 is updating the fifth letter of s1 with a letter a.

Our editing programs can easily simulate the insertion, deletion, or update of a symbol in a string. Further, they can simulate the complex document editing operations presented in [39] like concat, extract, delete, and insert (see [39] for a definition of these operations). So, our set of operations can be seen as nearly equivalent to the one presented in [39]; the only operation that we disallow is copying. Indeed, we do not allow editing programs to make copies of the input strings or the string variables by definition. An editing program that makes copies could produce strings of exponential size, and the input index i for direct access will be of polynomial size with respect to the input documents, breaking our assumptions regarding the RAM model. For this reason, we disallow copying in our programs and leave this extension for future work.

Dynamic direct access of MSO.

Now that the notion of editing programs is clear, we can define the problem of dynamic direct access as an extension of the MSODirectAccess[] introduced in Section 3, adding a phase of edits over the strings database.

Problem: DynamicMSODirectAccess[] Preprocessing: {input:a vset automaton 𝒜 anda strings database 𝖲𝖣𝖡={s1,,sN}result:a data structure D𝒜,𝖲𝖣𝖡 Editing: {input:an editing program Π and D𝒜,𝖲𝖣𝖡result:a data structure DΠ,𝒜,𝖲𝖣𝖡 Access: { 
input: an index i and DΠ,A,SDB
output: the i-th output A(Π(SDB))[i]
 
​​​​

Contrary to the standard direct access, we add an intermediate phase, called editing, where one can receive an editing program Π over 𝖲𝖣𝖡 as input. Thus, the direct access is now performed over the resulting string Π(𝖲𝖣𝖡). As before, we measure each phase separately and say that an algorithm for DynamicMSODirectAccess[] has f-preprocessing, h-editing, and g-access time for some functions f, h, and g if, and only if, the running time of the preprocessing, editing, and access phases is in O(f(𝒜,𝖲𝖣𝖡)), O(h(𝒜,𝖲𝖣𝖡,Π)), and O(g(𝒜,𝖲𝖣𝖡,Π)), respectively.

By taking advantage of our techniques for MSO direct access, we can show that DynamicMSODirectAccess[] can be solved with logarithmic running time during the editing phase and without increasing the running time of the preprocessing and access phases.

Theorem 20.

There is an algorithm that solves DynamicMSODirectAccess[], for the class of unambiguous functional vset automata for any variable order with preprocessing time O(|Q|ω|X|2i=1N|si|), editing time O(|Q|ω|X|2|Π|maxi=1Nlog(|si|)), and access time O(|Q|ω|X|3log2(|Π|maxi=1N|si|)). These bounds remain even true when we assume that the order is only given as an additional input in the access phase.

The proof goes by showing how to implement each editing rule receiving as inputs AVL-products representations. Precisely, the preprocessing constructs an AVL-product representation for each string si in 𝖲𝖣𝖡 (e.g., like in Section 5). Then, during the editing phase we produce new AVL-product representations from the initial ones for each editing operation. Finally, the resulting representation for the string Π(𝖲𝖣𝖡) is used for the direct access phase. These give the desired running time for the preprocessing and the direct access phases. Therefore, it is only left to show how to implement each editing rule in time O(|Q|ω|X|2maxi=1Nlog(|si|)) in order to prove the running time for the editing phase.

Implementing the editing rules.

It is easy to see that (S,S):=𝖼𝗎𝗍(l,i,j) can be implemented by splitting twice (at i and j) and one concatenation. More formally, as the sequence of editing rules (S1,S2):=𝗌𝗉𝗅𝗂𝗍(l,j);(S3,S):=𝗌𝗉𝗅𝗂𝗍(S1,i);S𝖼𝗈𝗇𝖼𝖺𝗍(S3,S2). Similarly, S:=𝗉𝖺𝗌𝗍𝖾(l,l,i) can be obtained by one split and two concatenations. More formally, (S1,S2):=𝗌𝗉𝗅𝗂𝗍(l,i);S3:=𝖼𝗈𝗇𝖼𝖺𝗍(S1,l);S:=𝖼𝗈𝗇𝖼𝖺𝗍(S3,S2). Finally, we can easily replace a single symbol by a newly initialized data structure. For this reason, we dedicate this subsection to show how to implement 𝖼𝗈𝗇𝖼𝖺𝗍 and 𝗌𝗉𝗅𝗂𝗍 efficiently.

Assume that the inputs of 𝖼𝗈𝗇𝖼𝖺𝗍 and 𝗌𝗉𝗅𝗂𝗍 are given by AVL-product representations. We will heavily rely on the corresponding operations on AVL trees as presented in [13]: there is it shown that many operations on ordered sets can be implemented with AVL trees by using only an operation called 𝗃𝗈𝗂𝗇 that does the following: given two AVL trees T1,T2 and a key k, such that the maximal key in T1 is smaller than k and the minimal key in T2 is bigger than k, 𝗃𝗈𝗂𝗇 returns an AVL tree T containing all keys in T1, T2 and k. Besides the usual tree navigation, the only basic operation that 𝗃𝗈𝗂𝗇 uses is 𝗇𝗈𝖽𝖾, which takes a key k and two trees T1,T2 and returns a tree that has k in its root and T1 and T2 as left and right subtrees, respectively222We remark that the usual rotations in search trees can be simulated by a constant number of 𝗇𝗈𝖽𝖾-operations.. We directly get the following result.

Lemma 21.

There is an algorithm 𝗃𝗈𝗂𝗇 that, given two AVL-product representations T1,T2 and a semi-group element g, computes in time O(log(𝗌𝗂𝗓𝖾(T1)+𝗌𝗂𝗓𝖾(T2))) and the same number of semi-group operations an AVL-product representation T of the sequence we get by concatenating 𝗌𝖾𝗊(T1), g, and 𝗌𝖾𝗊(T2).

Using 𝗃𝗈𝗂𝗇, one can implement an operation called 𝗌𝗉𝗅𝗂𝗍 that takes as input an AVL tree and a key k and returns two AVL trees T1 and T2 such that all keys in T smaller than k are in T1 while all keys bigger than k are in T2. Besides 𝗃𝗈𝗂𝗇, the only operation that 𝗌𝗉𝗅𝗂𝗍 performs is a descent in the tree to find the node with key k. This directly gives the following algorithms for operations on 𝗌𝖾𝗊(T).

Lemma 22.

There is an algorithm 𝗌𝗉𝗅𝗂𝗍 that, given an AVL-product representation T and a position i, computes in time O(log(𝗌𝗂𝗓𝖾(T))) and the same number of semi-group operations two trees AVL-product representations T1,T2 such that 𝗌𝖾𝗊(T1) is the prefix of 𝗌𝖾𝗊(T) up to but excluding the i-th entry and 𝗌𝖾𝗊(T2) is the suffix of 𝗌𝖾𝗊(i) that starts after the i-th position.

Applying both lemmas for matrices, we directly get bounds for 𝖼𝗈𝗇𝖼𝖺𝗍 and 𝗌𝗉𝗅𝗂𝗍.

Corollary 23.

There are algorithms for both S:=𝖼𝗈𝗇𝖼𝖺𝗍(l,l) and (S,S):=𝗌𝗉𝗅𝗂𝗍(l,i) on AVL-product representations that, given inputs T1,T2, resp., T and i, run in time O(mωlog(|T1|+|T2|)), resp., O(mωlog(|T|)).

7 Future work

We have given a direct access algorithm for MSO queries on strings that allows for powerful updates and supports all lexicographic orders without especially having to prepare them in the preprocessing. Our result is certainly only a first step in understanding direct access for MSO queries that motivate new research questions. First, it would be interesting to see to which extent one can adapt our results to MSO on trees. Second, one could study how to adjust our approach to support the copying of strings like in [39]. Third, another question is if one could reduce the access time of our algorithm. Finally, it would be interesting to understand if direct access on strings (or trees) is also possible for more expressive queries, say the grammar-based extraction formalisms of [37, 6].

References

  • [1] Georgii M Adel’son-Vel’skii and Evgenii Landis. An algorithm for the organization of information. Soviet Math., 3:1259–1263, 1962.
  • [2] Carme Àlvarez and Birgit Jenner. A very hard log-space counting class. Theor. Comput. Sci., 107(1):3–30, 1993. doi:10.1016/0304-3975(93)90252-O.
  • [3] Antoine Amarilli, Pierre Bourhis, and Stefan Mengel. Enumeration on trees under relabelings. In ICDT, volume 98, pages 5:1–5:18, 2018. doi:10.4230/LIPIcs.ICDT.2018.5.
  • [4] Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-delay enumeration for nondeterministic document spanners. In ICDT, pages 22:1–22:19, 2019. doi:10.4230/LIPIcs.ICDT.2019.22.
  • [5] Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Enumeration on trees with tractable combined complexity and efficient updates. In PODS, pages 89–103, 2019. doi:10.1145/3294052.3319702.
  • [6] Antoine Amarilli, Louis Jachiet, Martin Muñoz, and Cristian Riveros. Efficient enumeration for annotated grammars. In PODS, pages 291–300, 2022. doi:10.1145/3517804.3526232.
  • [7] Antoine Amarilli, Louis Jachiet, and Charles Paperman. Dynamic membership for regular languages. In ICALP, volume 198 of LIPIcs, pages 116:1–116:17, 2021. doi:10.4230/LIPIcs.ICALP.2021.116.
  • [8] Guillaume Bagan, Arnaud Durand, Etienne Grandjean, and Frédéric Olive. Computing the jth solution of a first-order query. RAIRO Theor. Informatics Appl., 42(1):147–164, 2008. doi:10.1051/ita:2007046.
  • [9] Nurzhan Bakibayev, Tomás Kociský, Dan Olteanu, and Jakub Zavodny. Aggregation and ordering in factorised databases. Proc. VLDB Endow., 6(14):1990–2001, 2013. doi:10.14778/2556549.2556579.
  • [10] Andrey Balmin, Yannis Papakonstantinou, and Victor Vianu. Incremental validation of XML documents. ACM Trans. Database Syst., 29(4):710–751, 2004. doi:10.1145/1042046.1042050.
  • [11] Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. Answering conjunctive queries under updates. In PODS, pages 303–318, 2017. doi:10.1145/3034786.3034789.
  • [12] Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. Answering FO+MOD queries under updates on bounded degree databases. ACM Trans. Database Syst., 43(2):7:1–7:32, 2018. doi:10.1145/3232056.
  • [13] Guy E. Blelloch, Daniel Ferizovic, and Yihan Sun. Just join for parallel ordered sets. In SPAA, pages 253–264, 2016. doi:10.1145/2935764.2935768.
  • [14] Johann Brault-Baron. De la pertinence de l’énumération : complexité en logiques propositionnelle et du premier ordre. (The relevance of the list: propositional logic and complexity of the first order). PhD thesis, University of Caen Normandy, France, 2013. URL: https://tel.archives-ouvertes.fr/tel-01081392.
  • [15] Karl Bringmann, Nofar Carmeli, and Stefan Mengel. Tight fine-grained bounds for direct access on join queries. In PODS, pages 427–436, 2022. doi:10.1145/3517804.3526234.
  • [16] J Richard Büchi. Weak second-order arithmetic and finite automata. Mathematical Logic Quarterly, 6(1-6), 1960. doi:10.1002/malq.19600060105.
  • [17] Florent Capelli and Oliver Irwin. Direct access for conjunctive queries with negations. In ICDT, volume 290, pages 13:1–13:20, 2024. doi:10.4230/LIPIcs.ICDT.2024.13.
  • [18] Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, and Mirek Riedewald. Tractable orders for direct access to ranked answers of conjunctive queries. ACM Trans. Database Syst., 48(1):1:1–1:45, 2023. doi:10.1145/3578517.
  • [19] Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Alessio Conte, Benny Kimelfeld, and Nicole Schweikardt. Answering (unions of) conjunctive queries using random access and random-order enumeration. ACM Trans. Database Syst., 47(3):9:1–9:49, 2022. doi:10.1145/3531055.
  • [20] Shaleen Deep, Xiao Hu, and Paraschos Koutris. Ranked enumeration of join queries with projections. Proc. VLDB Endow., 15(5):1024–1037, 2022. doi:10.14778/3510397.3510401.
  • [21] Johannes Doleschal, Benny Kimelfeld, Wim Martens, and Liat Peterfreund. Weight annotation in information extraction. Log. Methods Comput. Sci., 18(1), 2022. doi:10.46298/lmcs-18(1:21)2022.
  • [22] James R Driscoll, Neil Sarnak, Daniel Dominic Sleator, and Robert Endre Tarjan. Making data structures persistent. In STOC, pages 109–121, 1986. doi:10.1145/12130.12142.
  • [23] Idan Eldar, Nofar Carmeli, and Benny Kimelfeld. Direct access for answers to conjunctive queries with aggregation. In ICDT, volume 290, pages 4:1–4:20, 2024. doi:10.4230/LIPIcs.ICDT.2024.4.
  • [24] Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners: A formal approach to information extraction. Journal of the ACM (JACM), 62(2):1–51, 2015. doi:10.1145/2699442.
  • [25] Fernando Florenzano, Cristian Riveros, Martín Ugarte, Stijn Vansummeren, and Domagoj Vrgoč. Efficient enumeration algorithms for regular document spanners. ACM Transactions on Database Systems (TODS), 45(1):1–42, 2020. doi:10.1145/3351451.
  • [26] Gudmund Skovbjerg Frandsen, Peter Bro Miltersen, and Sven Skyum. Dynamic word problems. Journal of the ACM (JACM), 44(2):257–271, 1997. doi:10.1145/256303.256309.
  • [27] Etienne Grandjean and Louis Jachiet. Which arithmetic operations can be performed in constant time in the RAM model with addition? CoRR, abs/2206.13851, 2022. doi:10.48550/arXiv.2206.13851.
  • [28] Muhammad Idris, Martín Ugarte, and Stijn Vansummeren. The dynamic yannakakis algorithm: Compact and efficient query processing under updates. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1259–1274, 2017. doi:10.1145/3035918.3064027.
  • [29] Sarah Kleest-Meißner, Jonas Marasus, and Matthias Niewerth. MSO queries on trees: Enumerating answers under updates using forest algebras. CoRR, abs/2208.04180, 2022. doi:10.48550/arXiv.2208.04180.
  • [30] Donald Ervin Knuth. The art of computer programming, , Volume III, 2nd Edition. Addison-Wesley, 1998. URL: https://www.worldcat.org/oclc/312994415.
  • [31] Leonid Libkin. Elements of finite model theory, volume 41. Springer, 2004. doi:10.1007/978-3-662-07003-1.
  • [32] Katja Losemann and Wim Martens. MSO queries on trees: enumerating answers under updates. In CSL-LICS, pages 67:1–67:10, 2014. doi:10.1145/2603088.2603137.
  • [33] Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. Document spanners for extracting incomplete information: Expressiveness and complexity. In PODS, pages 125–136, 2018. doi:10.1145/3196959.3196968.
  • [34] Martin Muñoz and Cristian Riveros. Streaming enumeration on nested documents. In ICDT, volume 220 of LIPIcs, pages 19:1–19:18, 2022. doi:10.4230/LIPIcs.ICDT.2022.19.
  • [35] Matthias Niewerth. MSO queries on trees: Enumerating answers under updates using forest algebras. In LICS, pages 769–778, 2018. doi:10.1145/3209108.3209144.
  • [36] Matthias Niewerth and Luc Segoufin. Enumeration of MSO queries on strings with constant delay and logarithmic updates. In Jan Van den Bussche and Marcelo Arenas, editors, PODS, pages 179–191, 2018. doi:10.1145/3196959.3196961.
  • [37] Liat Peterfreund. Grammars for document spanners. In Ke Yi and Zhewei Wei, editors, ICDT, volume 186, pages 7:1–7:18, 2021. doi:10.4230/LIPIcs.ICDT.2021.7.
  • [38] Markus L. Schmid and Nicole Schweikardt. Spanner evaluation over slp-compressed documents. In PODS, pages 153–165, 2021. doi:10.1145/3452021.3458325.
  • [39] Markus L. Schmid and Nicole Schweikardt. Query evaluation over slp-represented document databases with complex document editing. In PODS, pages 79–89, 2022. doi:10.1145/3517804.3524158.
  • [40] Nikolaos Tziavelis, Wolfgang Gatterbauer, and Mirek Riedewald. Optimal join algorithms meet top-k. In SIGMOD, pages 2659–2665, 2020. doi:10.1145/3318464.3383132.
  • [41] Nikolaos Tziavelis, Wolfgang Gatterbauer, and Mirek Riedewald. Any-k algorithms for enumerating ranked answers to conjunctive queries. CoRR, abs/2205.05649, 2022. doi:10.48550/arXiv.2205.05649.