Dynamic Direct Access of MSO Query Evaluation over Strings

Bourhis, Pierre; Capelli, Florent; Mengel, Stefan; Riveros, Cristian

doi:10.4230/LIPIcs.ICDT.2025.26

Dynamic Direct Access of MSO Query Evaluation over Strings

Pierre Bourhis

Univ. Lille, CNRS, Inria, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France Florent Capelli

Univ. Artois, CNRS, UMR 8188, Centre de Recherche en Informatique de Lens (CRIL),
F-62300 Lens, France Stefan Mengel

Univ. Artois, CNRS, UMR 8188, Centre de Recherche en Informatique de Lens (CRIL),
F-62300 Lens, France Cristian Riveros

Pontificia Universidad Católica de Chile, Santiago, Chile
Millennium Institute for Foundational Research on Data, Santiago, Chile

Abstract

We study the problem of evaluating a Monadic Second Order (MSO) query over strings under updates in the setting of direct access. We present an algorithm that, given an MSO query with first-order free variables represented by an unambiguous variable-set automaton $\mathcal{A}$ with state set $Q$ and variables $X$ and a string $s$ , computes a data structure in time $\mathcal{O}(|Q|^{\omega}\cdot|X|^{2}\cdot|s|)$ and, then, given an index $i$ retrieves, using the data structure, the $i$ -th output of the evaluation of $\mathcal{A}$ over $s$ in time $\mathcal{O}(|Q|^{\omega}\cdot|X|^{3}\cdot\log(|s|)^{2})$ where $\omega$ is the exponent for matrix multiplication. Ours is the first efficient direct access algorithm for MSO query evaluation over strings; such algorithms so far had only been studied for first-order queries and conjunctive queries over relational data.

Our algorithm gives the answers in lexicographic order where, in contrast to the setting of conjunctive queries, the order between variables can be freely chosen by the user without degrading the runtime. Moreover, our data structure can be updated efficiently after changes to the input string, allowing more powerful updates than in the enumeration literature, e.g. efficient deletion of substrings, concatenation and splitting of strings, and cut-and-paste operations. Our approach combines a matrix representation of MSO queries and a novel data structure for dynamic word problems over semi-groups which yields an overall algorithm that is elegant and easy to formulate.

Keywords and phrases:

Query evaluation, direct access, MSO queries

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Database theory

Related Version:

Full version with missing proofs: https://arxiv.org/abs/2409.17329

Funding:

This work was partially funded by ANR projects CQFD 18-CE23-0003, EQUUS ANR-19-CE48-0019 and KCODA ANR-20-CE48-0004, by ANID - Millennium Science Initiative Program - Code ICN17_002, by ANID Fondecyt Regular project 1230935, and by the Alexander von Humboldt Foundation.

Acknowledgements:

This work benefited from Dagstuhl Seminar 24032, Representation, Provenance, and Explanations in Database Theory and Logic.

DOI:

10.4230/LIPIcs.ICDT.2025.26

Event:

28th International Conference on Database Theory (ICDT 2025)

Editors:

Sudeepa Roy and Ahmet Kara

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

The aim of direct access algorithms in query answering is to represent query answers in a compact and efficiently computable way without materializing them explicitly, while still allowing efficient access as if the answers were stored in an array. This is modeled in a two-stage framework where, first, in a preprocessing phase, given an input, one computes a data structure which then, in the so-called access phase, allows efficiently accessing individual answers to the query by the position they would have in the materialized query result.

For this approach, there is a trade-off between the runtime of the preprocessing and the access phase: on the one hand, one could simply materialize the answers and then answer all queries in constant time; on the other hand, one could just not preprocess at all and then for every access request evaluate from scratch. When designing direct access algorithms, one thus tries to find interesting intermediate points between these extremes, mostly giving more importance to the access time than the preprocessing, since the latter has to be performed only once for every input while the former can be performed an arbitrary number of times.

Direct access algorithms were first introduced by Bagan, Durand, Grandjean, and Olive in [8] in the context of sampling and enumeration for first-order queries (both can easily be reduced to direct access). However, it was arguably the very influential thesis of Brault-Baron [14] that established direct access as an independent regime for query answering. Since then, there has been a large amount of work extending and generalizing Brault-Baron’s work in several directions [19, 18, 15, 23, 17].

Curiously absent from this line of work is the query evaluation of Monadic Second Order logic (MSO), which in other contexts, like enumeration algorithms, has often been studied in parallel with conjunctive queries. In this paper, we make a first step in that direction, giving a direct access algorithm for MSO queries with free first-order variables over strings.

Our approach to direct access decomposes into several steps: first, we reduce direct access to a counting problem by using binary search to find the answers to be accessed as it is done, e.g., in [17, 18, 15]. We then express this counting problem in terms of matrix multiplication where we have to compute the result of a product of a linear number of small matrices. To enable the binary search we then require that this product can be efficiently maintained under substitutions of some of the matrices. This type of problem is known in the literature as dynamic word problems where the input is a word whose letters are elements from some semi-group whose product has to be maintained under element substitutions. Since we want to allow more powerful updates to the input later on, we cannot use the known data structures for dynamic word problems directly. We instead opt for a simpler approach that uses an extension of binary search trees to store our matrix product. The price of this is modest as it leads to a data structure that on only a doubly logarithmic factor slower than the provably optimal data structures for dynamic word problems. Plugging these ingredients together leads to a direct access algorithm with preprocessing time linear in the size of the input word and polylogarithmic access time. Moreover, if the query to evaluate is given in the form of an unambiguous automaton – which is well known to be equivalent in expressivity to MSO-queries [16] – then all runtime dependencies on the query are polynomial.

One advantage of our search tree based data structure is that, by relying heavily on known tree balancing techniques, it allows updating the input string efficiently without running the preprocessing from scratch. Such update operations have been studied before in the enumeration literature for first-order logic [12], conjunctive queries [28, 11, 12] and, most intensively for MSO on words and trees [10, 32, 36, 3, 35, 5, 29]. Compared to this latter line of work, we support more powerful updates: instead of single letter, resp. node, additions, and deletions as in most of these works, our data structure efficiently allows typical text operations like deletion of whole parts of the text, cut and paste, and concatenation of documents. The most similar update operations in the literature are those from [39] which beside our operations allow persistence of the data structure and thus, in contrast to us, also copy and paste operations. Still, our updates are vastly more powerful than in all the rest of the literature. Moreover, we are the first to propose updates to direct access data structures; all works with updates mentioned above only deal with enumeration.

Another advantage of our algorithm is that it easily allows direct access to the query result for any lexicographic order. It has been observed that ranking the orders by a preference stated by the users is desirable. Thus, there are several works that tackle this question for direct access, enumeration, and related problems, see e.g. [15, 18, 9, 21, 20, 40, 41]. While the lexicographic orders we consider are more restricted, we still consider them very natural. Also, it is interesting to see that, while there is an unavoidable runtime price to pay for ranked access with respect to lexicographic orders to conjunctive queries [18, 15], we here get arbitrary lexicographic access for free by simply changing the order in which variables are treated during the binary search. As a consequence, we can even choose the order at access time as the preprocessing is completely agnostic to the order, similarly to what was shown for the ranked enumeration of conjunctive queries answers [20] which however does not consider a dynamic setting.

2 Preliminaries

Sets and strings.

For a set $A$ , we denote by $2^{A}$ the power set of $A$ . For $n\in\mathbb{N}$ with $n\geq 1$ , we denote by $[n]$ the set $\{1,\ldots,n\}$ . We use $\Sigma$ to denote a finite alphabet and $\Sigma^{*}$ all strings (also called words) with symbols in $\Sigma$ . We will usually denote strings by $s,s^{\prime},s_{i}$ and similar. For every $s_{1},s_{2}\in\Sigma^{*}$ , we write $s_{1}\cdot s_{2}$ (or $s_{1}s_{2}$ for short) for the concatenation of $s_{1}$ and $s_{2}$ . If $s=a_{1}\ldots a_{n}\in\Sigma^{*}$ , we write $|s|=n$ , that is, the length of $s$ . We denote by $\epsilon\in\Sigma^{*}$ the string of $0$ length (i.e., $|\epsilon|=0$ ), also called the empty string. For $s=a_{1}\ldots a_{n}\in\Sigma^{*}$ , we denote by $s[i,j]$ the substring $a_{i}\ldots a_{j}$ , by $s[..i]$ the prefix $a_{1}\ldots a_{i}$ and by $s[i..]$ the suffix $a_{i}\ldots a_{n}$ .

Mappings.

Given a finite set $X$ of variables and $n\in\mathbb{N}$ , in this work we will heavily work with mappings of the form $\mu:X\rightarrow[n]$ as our outputs. We denote by $\mathsf{dom}(\mu)=X$ the domain of $\mu$ and by $\mathsf{range}(\mu)=\{i\in[n]\mid\exists x\in X\colon\,\mu(x)=i\}$ the range of $\mu$ . Further, we denote by $\mu_{\emptyset}$ the empty mapping, which is the unique mapping such that $\mathsf{dom}(\mu)=\emptyset$ . In our examples, we will usually write $(x_{1}\mapsto i_{1},\ldots,x_{\ell}\mapsto i_{\ell})$ to denote the mapping $\mu:\{x_{1},\ldots,x_{\ell}\}\rightarrow[n]$ such that $\mu(x_{1})=i_{1}$ , …, $\mu(x_{\ell})=i_{\ell}$ . We also write $\mu\cup(x\mapsto i)$ with $x\notin\mathsf{dom}(\mu)$ for the mapping $\mu^{\prime}$ such that $\mu^{\prime}(x)=i$ and $\mu^{\prime}(y)=\mu(y)$ for every $y\in\mathsf{dom}(\mu)$ .

Given a mapping $\mu:X\rightarrow[n]$ , we define $\mu^{-1}:[n]\rightarrow 2^{X}$ as the set-inverse mapping of $\mu$ , defined by $\mu^{-1}(i):=\{x\in X\mid\mu(x)=i\}$ . Note that $i\in\mathsf{range}(\mu)$ if, and only if, $\mu^{-1}(i)\neq\emptyset$ . Moreover, given a subset $Y\subseteq X$ , we define the projection $\pi_{Y}(\mu)$ as the mapping $\mu^{\prime}:Y\rightarrow[n]$ such that $\mu^{\prime}(y)=\mu(y)$ for every $y\in Y$ . Remark that if $Y=\emptyset$ , then $\pi_{Y}(\mu)=\mu_{\emptyset}$ .

Variable-set automata.

A variable-set automaton [24, 25, 4] (or vset automaton for short) is a tuple $\mathcal{A}=(Q,\Sigma,X,\Delta,q_{0},F)$ where $Q$ is a finite set of states, $X$ is a finite set of variables, $q_{0}\in Q$ is an initial state, $F\subseteq Q$ is a finite set of states, and $\Delta\ \subseteq\ Q\times\Sigma\times 2^{X}\times Q$ is the transition relation. Given a string $s=a_{1}\ldots a_{n}$ , a run $\rho$ of $\mathcal{A}$ over $s$ is a sequence:

\rho\ :=\ q_{0}\,\overset{a_{1}/X_{1}}{\longrightarrow}\,q_{1}\,\overset{a_{2}% /X_{2}}{\longrightarrow}\,\ldots\,\overset{a_{n}/X_{n}}{\longrightarrow}\,q_{n}

(1)

such that $(q_{i-1},a_{i},X_{i},q_{i+1})\in\Delta$ for every $i\leq n$ . We say that a run $\rho$ like (1) is valid if, and only if, $\bigcup_{i=1}^{n}X_{i}=X$ and $X_{i}\cap X_{j}=\emptyset$ for every $i<j\leq n$ ; in other words, each variable in $X$ appears exactly once in $\rho$ . If $\rho$ is valid, one can define the mapping $\mu_{\rho}\colon X\rightarrow[n]$ such that, for every $x\in X$ , $\mu_{\rho}(x)=i$ where $i$ is the unique position that satisfies $x\in X_{i}$ . As usual, we also say that a run $\rho$ like (1) is accepting if, and only if, $q_{n}\in F$ . Then we define the output of $\mathcal{A}$ over a string $s$ as the set of mappings:

\llbracket\mathcal{A}\rrbracket(s)\ =\ \{\mu_{\rho}\mid\text{$\rho$ is a valid% and accepting run of $\mathcal{A}$ over $s$}\}.

We define a partial run $\rho$ of $\mathcal{A}$ as a sequence of transitions $\rho:=p_{0}\overset{b_{1}/Y_{1}}{\longrightarrow}p_{1}\overset{b_{2}/Y_{2}}{% \longrightarrow}\ldots\overset{b_{n}/Y_{n}}{\longrightarrow}p_{n}$ such that $(p_{i-1},b_{i},Y_{i},p_{i})\in\Delta$ . We say that $\rho$ is a partial run from $p_{0}$ to $p_{n}$ over the string $b_{1}\dots b_{n}$ . Note that a run is also a partial run where we additionally assumed that $p_{0}=q_{0}$ . We say that a partial run is valid if $Y_{i}\cap Y_{j}=\emptyset$ for all $i\leq j\leq n$ . We define the length of $\rho$ as $|\rho|=n$ , and we make the convention that a single state $p_{0}$ is a partial run of length $0$ . We define the set of variables $\mathsf{vars}(\rho)$ of a partial run $\rho$ as $\mathsf{vars}(\rho)=\bigcup_{i=1}^{n}Y_{i}$ . Given two partial runs $\rho=p_{0}\overset{b_{1}/Y_{1}}{\longrightarrow}\ldots\overset{b_{n}/Y_{n}}{% \longrightarrow}p_{n}$ and $\sigma=r_{0}\overset{c_{1}/Z_{1}}{\longrightarrow}\ldots\overset{c_{m}/Z_{m}}{% \longrightarrow}r_{m}$ such that $p_{n}=r_{0}$ we define the run $\rho\cdot\sigma$ as the concatenation of $\rho$ and $\sigma$ , i.e., the partial run $\rho\cdot\sigma:=p_{0}\overset{b_{1}/Y_{1}}{\longrightarrow}\ldots\overset{b_{% n}/Y_{n}}{\longrightarrow}p_{n}\overset{c_{1}/Z_{1}}{\longrightarrow}\ldots% \overset{c_{m}/Z_{m}}{\longrightarrow}r_{m}$ . Note that $|\rho\cdot\sigma|=|\rho|+|\sigma|$ and $\mathsf{vars}(\rho\cdot\sigma)=\mathsf{vars}(\rho)\cup\mathsf{vars}(\sigma)$ .

We define the size $|\mathcal{A}$ — of a vset automaton $\mathcal{A}=(Q,\Sigma,X,\Delta,q_{0},F)$ as the number of states and transitions, so $|\mathcal{A}|:=|Q|+|\Delta|$ . In the following, we assume that all vset automata are trimmed, i.e., for every $q\in Q$ there exists a run $\rho$ that reaches $q$ , and there exists a partial run $\sigma$ from $q$ to some state in $F$ . We can make this assumption without loss of generality, since removing unreachable states can be done in time $\mathcal{O}(|\mathcal{A}|)$ without modifying $\llbracket\mathcal{A}\rrbracket$ .

MSO and vset automata.

In this paper, we usually refer to Monadic Second Order Logic (MSO) as our language for query evaluation; however, we will not define it formally here since we will use vset automata as an equivalent model for MSO. Precisely, when we refer to MSO, we mean MSO formulas of the form $\varphi(x_{1},\ldots,x_{n})$ over strings where $x_{1},\ldots,x_{n}$ are first-order open variables (see, e.g., [31] for a formal definition). Then, given a string $s$ as a logical structure, the MSO query evaluation problem refers to computing the set $\llbracket\varphi(x_{1},\ldots,x_{n})\rrbracket(s):=\{\mu:\{x_{1},\ldots,x_{n}% \}\rightarrow[|s|]\mid s,\mu\models\varphi(x_{1},\ldots,x_{n})\}$ , which is the set of all first-order assignments (i.e., mappings) that satisfy $\varphi$ over $s$ . One can show that MSO over strings with first-order open variables is equally expressible to vset automata, basically, by following the same construction as for the Büchi-Elgot-Trakhtenbrot theorem [16] (see also [34]). Furthermore, vset automata (as defined here) are equally expressive to regular document spanners [24] for information extraction (see, e.g., [6, 34]). For this reason, we can use vset automata to define MSO queries, which also applies to the setting of document spanners, or any other query language equally expressive to MSO.

Functional and unambiguous vset automata.

It is useful to consider vset automata that have no accepting runs that are not valid, so we make the following definition: we say that a vset automaton is functional if, and only if, for every $s\in\Sigma^{*}$ , every accepting run $\rho$ of $\mathcal{A}$ over $s$ is also valid. In [24], it was shown that there is an algorithm that, given a vset automaton $\mathcal{A}$ , constructs in exponential time a functional vset automaton $\mathcal{A}^{\prime}$ of exponential size with respect to $\mathcal{A}$ such that $\llbracket\mathcal{A}\rrbracket=\llbracket\mathcal{A}^{\prime}\rrbracket$ . We will thus restrict our analysis and algorithms to functional vset automata, since we can extend them to non-functional vset automata, incurring an exponential blow-up. One useful property of functional vset automata is that each state determines the variables that are assigned before reaching it in the following sense.

Lemma 1.

Let $\mathcal{A}=(Q,\Sigma,X,\Delta,q_{0},F)$ be a functional vset automaton. For every $q\in Q$ there exists a set $X_{q}\subseteq X$ such that for every partial run $\rho$ from $q_{0}$ to $q$ it holds that $\mathsf{vars}(\rho)=X_{q}$ .

Proof (sketch)..

The lemma is standard in the literature for similar models, see e.g. [33]. We provide a short proof. By way of contradiction, suppose that there exists a state $q\in Q$ and two runs $\rho$ and $\rho^{\prime}$ that reach $q$ such that $\mathsf{vars}(\rho)\neq\mathsf{vars}(\rho^{\prime})$ . Given that $\mathcal{A}$ is trimmed, there exists a partial run $\sigma$ starting from $q$ that reaches some final state in $F$ . Then the runs $\rho\cdot\sigma$ and $\rho^{\prime}\cdot\sigma$ are accepting. Since $\mathcal{A}$ is functional, both runs are also valid and should satisfy $\mathsf{vars}(\rho\cdot\sigma)=X=\mathsf{vars}(\rho^{\prime}\cdot\sigma^{% \prime})$ . However, $\mathsf{vars}(\rho\cdot\sigma)\neq\mathsf{vars}(\rho^{\prime}\cdot\sigma)$ which is a contradiction. $\hfill\blacktriangleleft$

By Lemma 1, for every functional vset automaton $\mathcal{A}=(Q,\Sigma,X,\Delta,q_{0},F)$ and every $q\in Q$ we define its sets of variables as $X_{q}$ to be the set as in the lemma. One consequence of Lemma 1 is that for every transition $(q,a,Y,q^{\prime})\in\Delta$ , we have $Y=X_{q^{\prime}}\,\,\backslash\,\,X_{q}$ . In particular, there can only be $|\Sigma|$ transitions for every pair $q,q^{\prime}$ of states such that overall $|\Delta|=O(|Q|^{2}\cdot\Sigma)$ .

We will also restrict to unambiguous vset automata. We say that a vset automaton $\mathcal{A}=(Q,\Sigma,X,\Delta,q_{0},F)$ is unambiguous if for every string $s\in\Sigma^{*}$ and every $\mu\in\llbracket\mathcal{A}\rrbracket(s)$ there exists exactly one accepting run $\rho$ of $\mathcal{A}$ over $s$ such that $\mu=\mu_{\rho}$ . It is well-known that for every vset automaton $\mathcal{A}$ one can construct an equivalent unambiguous functional vset automaton $\mathcal{A}^{\prime}$ of exponential size with respect to $\mathcal{A}$ . Therefore, we can restrict our analysis to the class of unambiguous functional vset automaton without losing expressive power. Both are standard assumptions in the literature for MSO query evaluation and have been used to find efficient enumeration algorithms for computing $\llbracket\mathcal{A}\rrbracket(s)$ , see e.g. [25, 38, 34].

Example 2.

(a) The vset automaton

\mathcal{A}_{0}

.

$x_{1}$	$x_{2}$
1	2
4	2
4	5
4	3

(b) The mappings computed by

\mathcal{A}_{0}

on input string

s=abbab

.

$M_{a}=\begin{array}[]{c|cccc}&{q_{0}}&{q_{1}}&{q_{2}}&{q_{3}}\\ \hline\cr{q_{0}}&1&1&0&0\\ {q_{1}}&0&1&0&0\\ {q_{2}}&0&0&0&1\\ {q_{3}}&0&0&0&1\end{array}$

$M_{b}=\begin{array}[]{c|cccc}&{q_{0}}&{q_{1}}&{q_{2}}&{q_{3}}\\ \hline\cr{q_{0}}&1&0&1&0\\ {q_{1}}&0&0&0&1\\ {q_{2}}&0&0&1&0\\ {q_{3}}&0&0&0&1\end{array}$

(c) Transition matrices of

\mathcal{A}_{0}

.

Figure 1: A running example of a vset automaton

\mathcal{A}_{0}

that will be used throughout this work.

Figure 1(a) depicts an unambiguous functional vset automaton $\mathcal{A}_{0}$ on variables $\{x_{1},x_{2}\}$ and alphabet $\{a,b\}$ . It can be verified that $X_{q_{0}}=\emptyset$ , $X_{q_{1}}=\{x_{1}\}$ , $X_{q_{2}}=\{x_{2}\}$ and $X_{q_{3}}=\{x_{1},x_{2}\}$ . Let $s=abbab$ . The following are all valid and accepting runs of $\mathcal{A}_{0}$ over $s$ :

$\blacksquare$

$\rho_{0}\ :=\ q_{0}\,\overset{a/\{x_{1}\}}{\longrightarrow}\,q_{1}\,\overset{b% /\{x_{2}\}}{\longrightarrow}\,q_{3}\,\overset{b/\emptyset}{\longrightarrow}\,q% _{3}\,\overset{a/\emptyset}{\longrightarrow}\,q_{3}\,\overset{b/\emptyset}{% \longrightarrow}\,q_{3}$ and $\mu_{\rho_{0}}=\{x_{1}\mapsto 1,x_{2}\mapsto 2\}$
$\blacksquare$

$\rho_{1}\ :=\ q_{0}\,\overset{a/\emptyset}{\longrightarrow}\,q_{0}\,\overset{b% /\{x_{2}\}}{\longrightarrow}\,q_{2}\,\overset{b/\emptyset}{\longrightarrow}\,q% _{2}\,\overset{a/\{x_{1}\}}{\longrightarrow}\,q_{3}\,\overset{b/\emptyset}{% \longrightarrow}\,q_{3}$ and $\mu_{\rho_{1}}=\{x_{1}\mapsto 4,x_{2}\mapsto 2\}$
$\blacksquare$

$\rho_{2}\ :=\ q_{0}\,\overset{a/\emptyset}{\longrightarrow}\,q_{0}\,\overset{b% /\emptyset}{\longrightarrow}\,q_{0}\,\overset{b/\emptyset}{\longrightarrow}\,q% _{0}\,\overset{a/\{x_{1}\}}{\longrightarrow}\,q_{1}\,\overset{b/\{x_{2}\}}{% \longrightarrow}\,q_{3}$ and $\mu_{\rho_{2}}=\{x_{1}\mapsto 4,x_{2}\mapsto 5\}$
$\blacksquare$

$\rho_{3}\ :=\ q_{0}\,\overset{a/\emptyset}{\longrightarrow}\,q_{0}\,\overset{b% /\emptyset}{\longrightarrow}\,q_{0}\,\overset{b/\{x_{2}\}}{\longrightarrow}\,q% _{2}\,\overset{a/\{x_{1}\}}{\longrightarrow}\,q_{3}\,\overset{b/\emptyset}{% \longrightarrow}\,q_{3}$ and $\mu_{\rho_{3}}=\{x_{1}\mapsto 4,x_{2}\mapsto 3\}$

In Figure 1(b), we show a summary of $\llbracket\mathcal{A}_{0}\rrbracket(s)$ . It can be verified that $\llbracket\mathcal{A}_{0}\rrbracket(s)$ is the set of tuples $\mu$ such that $\mu(x_{1})$ is a position where there is an $a$ in $s$ , $\mu(x_{2})$ is a position where there is a $b$ in $s$ . Moreover if $\mu(x_{1})<\mu(x_{2})$ then $\mu(x_{2})$ is the first position after $\mu(x_{1})$ containing a $b$ . Otherwise, if $\mu(x_{2})<\mu(x_{1})$ , then $\mu(x_{1})$ is the first position after $\mu(x_{2})$ containing an $a$ .

Computational model.

We work with the usual RAM model with unit costs and logarithmic size registers, see e.g. [27]. All data structures that we will encounter in our algorithms will be of polynomial size, so all addresses and pointers will fit into a constant number of registers. Also, all atomic memory operations like random memory access and following pointers can be performed in constant time. The numbers we consider will be of value at most $|s|^{|X|}$ where $X$ is the variable set of the automaton we consider. Thus, all numbers can be stored in at most $|X|$ memory cells and arithmetic operations on them can be performed in time $O(|X|^{2})$ naively¹¹1We remark that there are better algorithms for arithmetic operations, but we will not follow this direction here. If the reader is willing to use faster algorithms for multiplication and addition, then in all runtime bounds below the factor $|X|^{2}$ can be be reduced accordingly.. In the sequel, we use $\omega$ to denote the exponent for matrix multiplication.

3 Direct access for MSO queries

Let $X$ be a set of variables. Given a total order $\prec$ on $X$ , we extend $\prec$ to a lexicographic order on mappings as usual: for mappings $\mu,\mu^{\prime}:X\rightarrow[n]$ we define $\mu\prec\mu^{\prime}$ if, and only if, there exists $x\in X$ such that $\mu(x)<\mu^{\prime}(x)$ and, for every $y\in X$ , if $y\prec x$ , then $\mu(y)=\mu^{\prime}(y)$ .

Fix a total order $\prec$ on $X$ . Given a vset automaton $\mathcal{A}=(Q,\Sigma,X,\Delta,q_{0},F)$ and an input string $s$ , consider the set of outputs $\llbracket\mathcal{A}\rrbracket(s)=\{\mu_{1},\mu_{2},\ldots\}$ such that $\mu_{1}\prec\mu_{2}\prec\ldots$ . We define the $i$ -th output of $\mathcal{A}$ over $s$ , denoted by $\llbracket\mathcal{A}\rrbracket(s)[i]$ , as the mapping $\mu_{i}$ . Intuitively, we see $\llbracket\mathcal{A}\rrbracket(s)$ as an array where the outputs are ordered by $\prec$ and we retrieve the $i$ -th element of this array. The direct access problem for vset automata is the following: given a vset automaton $\mathcal{A}$ , a string $s$ , and an index $i$ , compute the $i$ -th output $\llbracket\mathcal{A}\rrbracket(s)[i]$ ; if the index $i$ is larger than the number of solutions then return an out-of-bound error. Without loss of generality, in the following we always assume that $i\leq|\llbracket\mathcal{A}\rrbracket(s)|$ , since $|\llbracket\mathcal{A}\rrbracket(s)|$ is easy to compute for unambiguous functional vset automata, see Section 4. As usual, we split the computation into two phases, called the preprocessing phase and the access phase:

Problem: $\textsc{MSODirectAccess}[\prec]$ Preprocessing: $\left\{\text{\begin{tabular}[]{rl}{input:}&\!\!\! a vset automaton $\mathcal{A% }$ and $s\in\Sigma^{*}$\\ {result:}&\!\!\! a data structure $D_{\mathcal{A},s}$\end{tabular}}\right.$ Access: $\left\{\!\!\text{ \begin{tabular}[]{rl}{input:}&\!\!\! \parbox[t]{119.50148pt}{an index $i$ and $D_{\mathcal{A},s}$}\\ {output:}&\!\!\! \parbox[t]{119.50148pt}{the $i$-th output $\llbracket\mathcal{A}\rrbracket(s)[i]$}\end{tabular} }\right.$

During the preprocessing phase, the algorithm receives as input a vset automaton $\mathcal{A}$ and a string $s\in\Sigma^{*}$ and computes a data structure $D_{\mathcal{A},s}$ . After preprocessing, there is the access phase where the algorithm receives any index $i$ as input and computes the $i$ -th output $\llbracket\mathcal{A}\rrbracket(s)[i]$ by using the precomputed data structure $D_{\mathcal{A},s}$ .

To measure the efficiency of a direct access algorithm, we say that an algorithm for the problem $\textsc{MSODirectAccess}[\prec]$ has $f$ -preprocessing and $g$ -access time for some functions $f$ and $g$ if, and only if, the running time of the preprocessing phase and the access phase is in $O(f(\mathcal{A},s))$ and $O(g(\mathcal{A},s))$ , respectively. Note that the running time of the access phase does not depend on $i$ , given that its bitsize is bounded by $|X|\cdot\log(|s|)$ so that it can be stored in $|X|$ registers in the RAM model and arithmetic on it can be done in time $O(|X|^{2})$ naively.

Given a class $\mathcal{C}$ of vset automata, we define the problem $\textsc{MSODirectAccess}[\prec]$ for $\mathcal{C}$ as the problem above when we restrict $\mathcal{A}$ to $\mathcal{C}$ . The main result of this work is the following.

Theorem 3.

There is an algorithm that solves $\textsc{MSODirectAccess}[\prec]$ for the class of unambiguous functional vset automata for any variable order $\prec$ with preprocessing $O(|Q|^{\omega}\cdot|X|^{2}\cdot|s|)$ and access time $O(|Q|^{\omega}\cdot|X|^{3}\cdot\log^{2}(|s|))$ . These bounds remain even true when we assume that the order $\prec$ is only given as an additional input in the access phase.

In Theorem 3, we restrict to the class of unambiguous vset automata. A natural question is what happens if we consider the class of all functional vset automata. We first remark that the data complexity of the problem will not change as it is possible to translate any functional vset automaton to an unambiguous one with an exponential blow-up. Unfortunately, in combined complexity, there is no polynomial time algorithm under standard assumptions.

Proposition 4.

If $\textsc{MSODirectAccess}[\prec]$ for the class of functional vset automata has an algorithm with polynomial time preprocessing and access, then SpanL $\subseteq$ FP.

SpanL is the class of functions computable as $|R|$ , where $R$ is the set of output values returned by the accepting paths of an NL machine, see [2] for a formal definition. In [2], it is shown that, if the functions in SpanL is in SpanL $\subseteq$ FP, i.e., computable in polynomial time, then P $=$ NP. Therefore, by Theorem 4 and [2] it is unlikely that there is an efficient algorithm for $\textsc{MSODirectAccess}[\prec]$ for all functional vset automata in combined complexity.

In the remainder of this paper, we will prove Theorem 3. To make the proof more digestible, in Section 4 we show how to reduce the direct access problem to a counting problem and introducing a matrix approach for it. In Section 5, we will present the data structure constructed during the preprocessing and used in the access phase, showing Theorem 3. In Section 6, we then show how to integrate updates to the input string into our approach.

4 From direct access to counting: a matrix approach

In this section, we present the main algorithmic ideas for Theorem 3. In the next section, we use these ideas to develop a data structure for the preprocessing and access algorithm.

From direct access to counting.

Let $\mathcal{A}=(Q,\Sigma,X,\Delta,q_{0},F)$ be a vset automaton and let $s=a_{1}\ldots a_{n}$ be a string of length $n$ . Assume that $X=\{x_{1},\ldots,x_{\ell}\}$ has $\ell$ -variables ordered by a total order $\prec$ , w.l.o.g., $x_{1}\prec x_{2}\prec\ldots\prec x_{\ell}$ . For an index $k\in[\ell]$ and a mapping $\tau:\{x_{1},\ldots,x_{k}\}\rightarrow[n]$ , we define the set:

\llbracket\mathcal{A}\rrbracket(s,\tau)=\{\mu\in\llbracket\mathcal{A}% \rrbracket(s)\mid\pi_{x_{1},\ldots,x_{k-1}}(\mu)=\pi_{x_{1},\ldots,x_{k-1}}(% \tau)\ \wedge\ \mu(x_{k})\leq\tau(x_{k})\}.

Intuitively, the set $\llbracket\mathcal{A}\rrbracket(s,\tau)$ restricts the output set $\llbracket\mathcal{A}\rrbracket(s)$ to all outputs in $\llbracket\mathcal{A}\rrbracket(s)$ which coincide with $\tau$ for variables $x_{i}$ before the variable $x_{k}$ , and that have a position before $\tau(x_{k})$ for the variable $x_{k}$ . If $\tau=\mu_{\emptyset}$ is the empty mapping, we define $\llbracket\mathcal{A}\rrbracket(s,\tau)=\llbracket\mathcal{A}\rrbracket(s)$ .

Example 5.

Going back to the example from Figure 1(a) with $s=abbab$ and $\tau=(x_{1}\mapsto 2)$ then $\llbracket\mathcal{A}_{0}\rrbracket(s,\tau)$ contains all tuples from $\llbracket\mathcal{A}_{0}\rrbracket(s)$ which map $x_{1}$ to a position before $2$ . As seen in Figure 1(b) we have $\llbracket\mathcal{A}_{0}\rrbracket(s,\tau)=\{(x_{1}\mapsto 1,x_{2}\mapsto 2)\}$ . Now if $\tau=(x_{1}\mapsto 2,x_{2}\mapsto 3)$ , we have $\llbracket\mathcal{A}_{0}\rrbracket(s,\tau)=\emptyset$ since we only keep tuples where $x_{1}$ is set to $2$ and no such tuple can be found. If $\tau=(x_{1}\mapsto 4,x_{2}\mapsto 3)$ , we have $\llbracket\mathcal{A}_{0}\rrbracket(s,\tau)=\{(x_{1}\mapsto 4,x_{2}\mapsto 2),% (x_{1}\mapsto 4,x_{2}\mapsto 3)\}$ .

For the sake of presentation, we denote by $\#{\llbracket\mathcal{A}\rrbracket(s)}$ (resp. $\#{\llbracket\mathcal{A}\rrbracket(s,\tau)}$ ) the number $|\llbracket\mathcal{A}\rrbracket(s)|$ (resp. $|\llbracket\mathcal{A}\rrbracket(s,\tau)|$ ) of outputs in $\llbracket\mathcal{A}\rrbracket(s)$ (resp. $\llbracket\mathcal{A}\rrbracket(s,\tau)$ ). For direct access, we are interested in finding efficient algorithms for computing $\#{\llbracket\mathcal{A}\rrbracket(s,\tau)}$ because of the following connection.

Lemma 6 (Lemma 7 in [17]).

If there is an algorithm that computes $\#{\llbracket\mathcal{A}\rrbracket(s,\tau)}$ in time $T$ for every $k\in[\ell]$ and every $\tau:\{x_{1},\ldots,x_{k}\}\rightarrow[n]$ , then there is an algorithm that retrieves $\llbracket\mathcal{A}\rrbracket(s)[i]$ in time $O(T\cdot\ell\cdot\log(n))$ for every index $i$ .

Proof (sketch)..

A similar proof can be found in [17]. For the convenience of the reader and because we will use a slightly more complicated variant later, we give a quick sketch here. Let $\mu_{i}=\llbracket\mathcal{A}\rrbracket(s)[i]$ . The idea is to first compute $\mu_{i}(x_{1})$ by observing the following: $\mu_{i}(x_{1})$ is the smallest value $j_{1}\in[n]$ such that $\#{\llbracket\mathcal{A}\rrbracket(s,(x_{1}\mapsto j_{1}))}\geq i$ . This value can be found in time $O(T\cdot\log n)$ by performing a binary search on $\{\#{\llbracket\mathcal{A}\rrbracket(s,(x_{1}\mapsto j_{1}))}\mid j_{1}\leq n\}$ . Once we have found $\mu_{i}(x_{1})$ , we compute $\mu_{i}(x_{2})$ similarly by observing that it is the smallest value $j_{2}$ such that $\#{\llbracket\mathcal{A}\rrbracket(s,(x_{1}\mapsto j_{1},x_{2}\mapsto j_{2}))}% \geq i-\#{\llbracket\mathcal{A}\rrbracket(s,(x_{1}\mapsto j_{1}-1))}$ . The claim then follows by a simple induction. $\hfill\blacktriangleleft$ Given Lemma 6, in the following we concentrate our efforts on developing an index structure for efficiently computing $\#{\llbracket\mathcal{A}\rrbracket(s,\tau)}$ for every $k\in[\ell]$ and every $\tau:\{x_{1},\ldots,x_{k}\}\rightarrow[n]$ .

A matrix representation for the counting problem.

Let $\mathcal{A}=(Q,\Sigma,X,\Delta,q_{0},F)$ be an unambiguous functional vset automaton. For any letter $a\in\Sigma$ , we define the matrix $M_{a}\in\mathbb{N}^{Q\times Q}$ such that for every $p,q\in Q$ :

\displaystyle M_{a}[p,q]\ :=\ \left\{\begin{array}[]{ll}1&(p,a,S,q)\in\Delta% \text{ for some $S\subseteq X$}\\ 0&\text{otherwise.}\end{array}\right.

Strictly speaking, the matrix $M_{a}$ depends on $\mathcal{A}$ and we should write $M_{a}^{\mathcal{A}}$ ; however, $\mathcal{A}$ will always be clear from the context and, thus, we omit $\mathcal{A}$ as a superscript from $M_{a}$ . Since $\mathcal{A}$ is functional, for every pair of states $p,q\in Q$ and $a\in\Sigma$ there exists at most one transition $(p,a,S,q)\in\Delta$ . Then $M_{a}$ contains a $1$ for exactly the pairs of states that have a transition with the letter $a$ . For this reason, one can construct $M_{a}$ in time $O(|Q|^{2})$ for every $a\in\Sigma$ . Figure 1(c) gives an example of matrices $M_{a}$ and $M_{b}$ for the automaton $\mathcal{A}_{0}$ from Example 2.

We can map strings to matrices in $\mathbb{N}^{Q\times Q}$ by homomorphically extending the mapping $a\mapsto M_{a}$ from letters to strings by mapping every string $s=a_{1}\ldots a_{n}$ to a matrix $M_{s}$ defined by the product $M_{s}:=M_{a_{1}}\cdot M_{a_{2}}\cdot\ldots\cdot M_{a_{n}}$ . For $\epsilon$ , we define $M_{\epsilon}=I$ where $I$ is the identity matrix in $\mathbb{N}^{Q\times Q}$ . Note that this forms an homomorphism from strings to matrices where $M_{s_{1}s_{2}}=M_{s_{1}}\cdot M_{s_{2}}$ for every pair of strings $s_{1},s_{2}\in\Sigma^{*}$ . It is easy to verify that for all states $p,q\in Q$ we have that $M_{s}[p,q]$ is the number of partial runs from $p$ to $q$ of $\mathcal{A}$ over $s$ . Furthermore, if we define $\vec{q}_{0}$ as the (row) vector such that $\vec{q}_{0}[p]=1$ if $p=q_{0}$ and $0$ , otherwise, and $\vec{F}$ as the (column) vector such that $\vec{F}[p]=1$ if $p\in F$ and $0$ otherwise, then we have the following equality between number of outputs and matrix products: $\#{\llbracket\mathcal{A}\rrbracket(s)}=\vec{q}_{0}\cdot M_{s}\cdot\vec{F}.$

Our goal is to have a similar result for calculating $\#{\llbracket\mathcal{A}\rrbracket(s,\tau)}$ given a mapping $\tau:\{x_{1},\ldots,x_{k}\}\rightarrow[n]$ . For this purpose, for every string $s=a_{1}\ldots a_{n}$ , we define the matrix $M_{s}^{\tau}$ as follows. Recall that $\tau^{-1}$ is the set-inverse of $\tau$ . Then, for every $a\in\Sigma$ and $i\in[n]$ , define the matrix $M_{a,i}^{\tau}\in\mathbb{N}^{Q\times Q}$ such that for every $p,q\in Q$ :

\displaystyle M_{a,i}^{\tau}[p,q]\ =\ \left\{\begin{array}[]{lll}1&\tau^{-1}(i% )\,\,\backslash\,\,\{x_{k}\}\subseteq S\text{ for some }(p,a,S,q)\in\Delta% \text{ and }&\ \ \ \hfill(1)\\ &\text{if }\tau(x_{k})=i\text{ then }x_{k}\in X_{q};&\ \ \ \hfill(2)\\ 0&\text{otherwise.}\end{array}\right.

Finally, we define $M^{\tau}_{s}=M_{a_{1},1}^{\tau}\cdot M_{a_{2},2}^{\tau}\cdot\ldots\cdot M_{a_{% n},n}^{\tau}$ . Intuitively, Condition (1) for $M_{a,i}^{\tau}[p,q]=1$ makes sure that all runs of $\mathcal{A}$ over $s$ counted by $M^{\tau}_{s}$ must take transitions at position $i$ that contain all variables $x_{j}$ with $\tau(x_{j})=i$ and $j<k$ . Note that we remove $x_{k}$ from $\tau^{-1}(i)$ since $x_{k}$ has a special meaning in $\tau$ . Indeed, Condition (2) for $M_{a,i}^{\tau}[p,q]=1$ restricts $x_{k}$ such that its assignment must be before or equal to position $i$ . For this second condition, we exploit the set $X_{q}$ of a functional vset automata, that gives us information of all variables that have been used until state $q$ .

It is important to notice that if $i\notin\mathsf{range}(\tau)$ then $M^{\tau}_{a_{i},i}=M_{a_{i}}$ . Since $\mathsf{range}(\tau)$ has at most $k$ elements, the sequences $M_{a_{1}},\ldots,M_{a_{n}}$ and $M_{a_{1},1}^{\tau},\ldots,M_{a_{n},n}^{\tau}$ differ in at most $k$ matrices. In particular, if $\tau$ is the empty mapping, then $M^{\tau}_{s}=M_{s}$ as expected. Finally, similarly to $M_{a}$ one can compute $M_{a,i}^{\tau}$ in time $O(|Q|^{2})$ for any $a\in\Sigma$ assuming that we have $\tau$ and $\tau^{-1}$ .

Example 7.

Consider again the example from Figure 1(a) with $s=abbab$ . Observe that for any $\tau$ , whenever $M_{a}[p,q]=0$ then necessarily $M_{a,i}^{\tau}[p,q]=0$ . Hence in this example, we focus on example where the entry $M_{a}[p,q]$ goes from $1$ to $0$ after applying a partial mapping. We let $\tau=(x_{1}\mapsto 4,x_{2}\mapsto 4)$ . We have that $M^{\tau}_{a,4}[q_{0},q_{1}]=0$ because even if there is a transition $(q_{0},a,S,q_{1})$ with $x_{1}\in S$ , we do not have $x_{2}\in X_{q_{1}}$ . In other words, a run compatible with $\tau$ cannot take transition $(q_{0},a,\{x_{1}\},q_{1})$ at position $4$ since it does not allow to map $x_{2}$ to a position before $4$ . On the other hand, $M^{\tau}_{a,4}[q_{2},q_{3}]=1$ because it is possible to set $x_{1}$ to position $4$ by taking transition $(q_{2},a,\{x_{1}\},q_{3})$ . Moreover, since $x_{2}\in X_{q_{3}}$ , it means that $x_{2}$ has been necessarily set by an earlier transition, hence at a position preceding $4$ . Finally $M^{\tau}_{a,4}[q_{3},q_{3}]=0$ because there is not transition from $q_{3}$ to $q_{3}$ setting variable $x_{1}$ .

Similarly to the relation between $\#{\llbracket\mathcal{A}\rrbracket(s)}$ and $M_{s}$ , we can compute $\#{\llbracket\mathcal{A}\rrbracket(s,\tau)}$ by using $M_{s}^{\tau}$ as the following result shows.

Lemma 8.

For every unambiguous functional vset automaton $\mathcal{A}=(Q,\Sigma,X,\Delta,q_{0},F)$ , every string $s\in\Sigma^{*}$ , and every mapping $\tau:\{x_{1},\ldots,x_{k}\}\rightarrow[n]$ it holds that: $\#{\llbracket\mathcal{A}\rrbracket(s,\tau)}=\vec{q}_{0}\cdot M_{s}^{\tau}\cdot% \vec{F}.$

The algorithm.

Now that we have defined the main technical tools, we can present the algorithm for the preprocessing and direct access of an unambiguous functional vset automaton $\mathcal{A}$ over a string $s=a_{1}\ldots a_{n}$ . For this purpose, we will assume here the existence of a data structure for maintaining the product of matrices $\{M_{a_{i},i}^{\tau}\}_{i\in[n]}$ to then present and analyze the algorithms. In Section 5, we will explain this data structure in full detail.

Suppose the existence of a data structure $D$ that given values $n,m\in\mathbb{N}$ maintains $n$ square matrices $M_{1},\ldots,M_{n}$ of size $m\times m$ . This data structure supports three methods called $\mathsf{init}$ , $\mathsf{set}$ , and $\mathsf{out}$ which are defined as follows:

$\blacksquare$

$D\leftarrow\mathsf{init}(M_{1},\ldots,M_{n})$ : receive as input the initial $n$ matrices $M_{1},\ldots,M_{n}$ of size $m\times m$ and initialize a data structure $D$ storing the matrices $M_{1},\ldots,M_{n}$ ;
$\blacksquare$

$D^{\prime}\leftarrow\mathsf{set}(D,j,M)$ : receive as input a data structure $D$ , a position $j\leq n$ , an $m\times m$ -matrix $M$ , and output a data structure $D^{\prime}$ that is equivalent to $D$ but its $j$ -th matrix $M_{j}$ is replaced by $M$ ; and
$\blacksquare$

$M\leftarrow\mathsf{out}(D)$ : receive as input a data structure $D$ storing matrices $M_{1},\ldots,M_{n}$ and outputs a pointer to the matrix $M:=M_{1}\cdot\ldots\cdot M_{n}$ , i.e., the product of the $n$ matrices.

Further, we assume that method $\mathsf{set}$ is persistent [22], meaning that each call to $\mathsf{set}$ produces a new data structure $D^{\prime}$ without modifying the previous data structure $D$ .

Algorithm 1 The preprocessing and direct access algorithms of an unambiguous functional vset automaton

\mathcal{A}=(Q,\Sigma,X,\Delta,q_{0},F)

with variables

x_{1},\ldots,x_{\ell}

and string

s=a_{1}\ldots a_{n}

.

1:procedure Preprocessing(

\mathcal{A}

,

s

)

2:

D\leftarrow\mathsf{init}(M_{a_{1}},\ldots,M_{a_{n}})

3:procedure BinarySearch(

x,i,\tau

)

4:

L\leftarrow 0

5:

R\leftarrow n

6: while

L\neq R

do

7:

j\leftarrow\left\lceil(L+R)/2\right\rceil

8:

\tau^{\prime}\leftarrow\tau\cup(x\mapsto j)

9:

D^{\prime}\leftarrow\mathsf{set}(D,j,M^{\tau^{\prime}}_{a_{j},j})

10: if

\vec{q}_{0}\cdot\mathsf{out}(D^{\prime})\cdot\vec{F}\geq i

then

11:

R\leftarrow j

12: else

13:

L\leftarrow j

14: return

R

15:procedure DirectAccess(

i

)

16:

\tau\leftarrow\mu_{\emptyset}

17: for

k=1,\ldots,\ell

do

18:

j\leftarrow\textsc{BinarySearch}(x_{k},i,\tau)

19:

i\leftarrow i-\textsc{CalculateDiff}(x_{k},j,\tau)

20:

\tau\leftarrow\tau\cup(x_{k}\mapsto j)

21:

\textsc{UpdateStruct}(x_{k+1},j,\tau)

22: return

\tau

23:procedure CalculateDiff(

x

,

j

,

\tau

)

24:

\tau_{\text{prev}}\leftarrow\tau\cup(x\mapsto j-1)

25:

D_{\text{prev}}\leftarrow\mathsf{set}(D,j,M^{\tau_{\text{prev}}}_{a_{j-1},j-1})

26: return

\vec{q}_{0}\cdot\mathsf{out}(D_{\text{prev}})\cdot\vec{F}

27:procedure UpdateStruct(

x

,

j

,

\tau

)

28:

\tau_{\text{next}}\leftarrow\tau\cup(x\mapsto j+1)

29:

D\leftarrow\mathsf{set}(D,j,M^{\tau_{\text{next}}}_{a_{j},j})

Assuming the existence of the data structure $D$ , in Algorithm 1 we present all steps for the preprocessing and direct access of an unambiguous functional vset automaton $\mathcal{A}$ over a string $s=a_{1}\ldots a_{n}$ . For the sake of presentation, we assume that $\mathcal{A}$ , $s$ , and the data structure $D$ are available globally by all procedures.

Algorithm 1 uses all the tools developed in this section for solving the MSO direct access problem. The preprocessing (Algorithm 1, left) receives as input the vset automaton and the string and constructs the data structure $D$ by calling the method $\mathsf{init}$ with matrices $M_{a_{1}},\ldots,M_{a_{n}}$ . The direct access (Algorithm 1, right) receives any index $i$ and outputs the $i$ -th mapping $\tau$ in $\llbracket\mathcal{A}\rrbracket(s)$ by following the strategy of Lemma 6. Specifically, starting from an empty mapping $\tau$ (line 16), it finds the positions for variables $x_{1},\ldots,x_{\ell}$ by binary search for each variable $x_{k}$ (line 16-17). After finding the value $j$ for $x_{k}$ , it decreases the index $i$ by calculating the difference of outputs just before $j$ (line 19), updates $\tau$ with $(x_{k}\mapsto j)$ (line 20), and updates $D$ with the new value of $x_{k}$ (line 21). We use the auxiliary procedures CalculateDiff and UpdateStruct to simplify the presentation of these steps. The workhorse of the direct access is the procedure BinarySearch. It performs a standard binary search strategy for finding the value for variable $x$ , by using the reduction of Lemma 6 to the counting problem and the matrix characterization $\#{\llbracket\mathcal{A}\rrbracket(s,\tau)}$ of Lemma 8.

The correctness of Algorithm 1 follows from Lemma 6 and 8. Regarding the running time, suppose that methods $\mathsf{init}$ , $\mathsf{set}$ , and $\mathsf{out}$ of the data structure $D$ take time $t_{\mathsf{init}}$ , $t_{\mathsf{set}}$ , and $t_{\mathsf{out}}$ , respectively. For the preprocessing, one can check that the running time is $O(|Q|^{2}\cdot|s|+t_{\mathsf{init}})$ where it takes $O(|Q|^{2}\cdot|s|)$ for creating the matrices $M_{a_{1}},\ldots,M_{a_{n}}$ . For the direct access, one can check that it takes time $O(|X|\cdot\log(|s|)\cdot(t_{\mathsf{set}}+t_{\mathsf{out}}))$ for each index $i$ .

The next section shows how to implement the data structure $D$ and its methods $\mathsf{init}$ , $\mathsf{set}$ , and $\mathsf{out}$ . In particular, we show that the running time of these methods will be $t_{\mathsf{init}}=O(|Q|^{\omega}\cdot|X|^{2}\cdot|s|)$ , $t_{\mathsf{set}}=O(|Q|^{\omega}\cdot|X|^{2}\cdot\log(|s|))$ , and $t_{\mathsf{out}}=O(1)$ . Overall, the total running time of Algorithm 1 will be $O(|Q|^{\omega}\cdot|X|^{2}\cdot|s|)$ for the preprocessing and $O(|Q|^{\omega}\cdot|X|^{3}\cdot\log(|s|)^{2})$ for each direct access as stated in Theorem 3.

5 Maintaining semi-groups products

In this section, we will show how to implement the data structure required in Section 4, which in particular has a $\mathsf{set}$ operation that allows to change some of the elements in a sequence of matrices while maintaining their product efficiently. Similar problems have been studied under the name of dynamic word problems in the literature, see e.g. [26, 7]. In that setting, one is given a sequence of elements from a semi-group and wants to maintain the product of this sequence under substitution of the elements. Depending on the algebraic structure of the semi-group, there are algorithms of different efficiencies, and there is by now quite a good understanding, see again [26, 7].

We could likely use the results of [26] directly to define our data structure (though [26] assumes a finite semiring which does not correspond to our setting).However, later we will want to support more powerful changes to the strings than just substitutions, and it is not clear how the approach from [26] could be adapted for this. So we choose to use a less technically involved approach that allows for more powerful update operations while only losing a factor of $\log\log(n)$ compared to [26]. So fix a semi-group $\mathbb{G}$ , i.e., a set $\mathbb{G}$ with an associative operation $\circ$ . The semi-group that will be of most interest to us are the $(m\times m)$ -matrices with elements in $\mathbb{N}$ with the usual matrix multiplication, but our approach here works for any semi-group. As it is common, we often leave out the multiplication symbol $\circ$ and write $g_{1}g_{2}$ for $g_{1}\circ g_{2}$ . We next introduce our data structure.

We will store sequences $\vec{g}=g_{1},\ldots,g_{n}$ over $\mathbb{G}$ in binary trees. To this end, let $T=(V,E)$ be a rooted binary tree in which the vertices are labeled with elements from $\mathbb{G}$ . We assume that the children of each internal node of $T$ are either a left or a right child. The label of a vertex $v$ is denoted by $\mathsf{label}(v)$ . Remember that the in-order traversal of a rooted binary tree, when encountering a vertex $v$ with left child $v_{1}$ and right child $v_{2}$ , first recursively visits the subtree of $v_{1}$ , then $v$ , and finally recursively the subtree of $v_{2}$ . For every vertex $v$ of $T$ , we define the sequence of $v$ , in symbols $\mathsf{seq}(v)$ , to be the sequence of elements in the subtree rooted in $v$ read in in-order traversal. We also define the product of $v$ , in symbols $\mathsf{prod}(v)$ as the product of the elements in $\mathsf{seq}(v)$ in the order of the sequence. Note that $\mathsf{prod}(v)$ is a single element from $\mathbb{G}$ while $\mathsf{seq}(v)$ is a sequence of elements from $\mathbb{G}$ ; we never store $\mathsf{seq}(v)$ explicitly in the vertex $v$ but store $\mathsf{prod}(v)$ . We also write $\mathsf{seq}(T)$ for the sequence of the root of $T$ and analogously define $\mathsf{prod}(T)$ . inline, color=orange!20]Florent: I added small details on the fact that we distinguish left and right children. I do not think we should be too heavy on it but this is important since the semigroup is not commutative.

In the remainder, we want to maintain $\mathsf{prod}(T)$ under changes to $\mathsf{seq}(T)$ . To this end, it will be useful to access positions in $\mathsf{seq}(T)$ efficiently. If the only changes were substitutions of elements as needed for Algorithm 1, we could easily do this by letting $T$ be a binary search tree in which the keys are the position in $\mathsf{seq}(T)$ . However, we later also want to be able to delete, add, and move strings, and for these operations we would have to change a linear number of keys which is too expensive for our setting. Instead, we use the linear list representation with search-trees, see e.g. [30, Chapter 6.2.3]: we only store the size of the subtree of $T$ rooted in $v$ , which we denote by $\mathsf{size}(v)$ . Remark that $\mathsf{size}(v)$ is also the length of $\mathsf{seq}(v)$ . The $\mathsf{size}$ information lets us locate the element at desired positions in $\mathsf{seq}(v)$ .

Observation 9.

For every vertex $v$ in $T$ we can, given an index $i\in[|\mathsf{seq}(v)|]$ and correct $\mathsf{size}$ -labels, decide in constant time if the entry of position $i$ in $\mathsf{seq}(v)$ is in the left subtree, the right subtree or in $v$ itself. In the two former cases, we can also compute at which position it is in the contiguous subsequence of that subtree.

9 directly gives an easy algorithm that descends from the root of $T$ to any desired position in $\mathsf{seq}(T)$ in time linear in the height of $T$ .

As the final information, we also add, for each vertex $v$ of $T$ , the height of $v$ , denoted by $0pt(v)$ . This information is not needed in this section, but it will be useful in Section 6 for maintaining balanced trees, so we use it already here. Since the height of $T$ will determine the runtime of our algorithm, we aim to bound it as for binary search trees. We say that a tree $T$ satisfies the AVL condition if for every vertex $v$ with children $v_{1},v_{2}$ we have that $|0pt(v_{1})-0pt(v_{2})|\leq 1$ and if $v$ has only one child $v_{1}$ , then $v_{1}$ is a leaf. The AVL condition allows us to bound the height of trees by the following classical result.

Theorem 10 ([1]).

Every tree of size $n$ that satisfies the AVL condition has height $O(\log(n))$ .

We now have everything to define our data structure: we call a tree $T$ in which all vertices are labeled with elements from $\mathbb{G}$ , which have correct attributes $\mathsf{size}$ , $0pt$ , and $\mathsf{prod}$ as above and which satisfies the AVL-condition an AVL-product representation of $\mathsf{seq}(T)$ .

Figure 2: AVL-product representation for the string

a b b b a b

and automaton

\mathcal{A}_{0}

from Figure 1(a).

Example 11.

We consider again the automaton $\mathcal{A}_{0}$ from Figure 1(a) and string $a b b b a b$ (this is a different string than the one considered in previous examples as a longer string makes illustration more relevant for the data structure). In this case, we show an AVL-product representation of $M_{a}M_{b}M_{b}M_{b}M_{a}M_{b}=M^{\mu_{\emptyset}}_{a,1}M^{\mu_{\emptyset}}_{b% ,2}M^{\mu_{\emptyset}}_{b,3}M^{\mu_{\emptyset}}_{b,4}M^{\mu_{\emptyset}}_{a,5}% M^{\mu_{\emptyset}}_{b,6}$ in Figure 2. For convenience, despite this not being mandatory, we assume each vertex $v$ of the tree also is identified by a unique number $\mathsf{id}(v)$ . In the example, we can verify that $\mathsf{prod}$ values are computed from the labels. For example, if we denote by $v_{i}$ the vertex having $\mathsf{id}(v_{i})=i$ , we have $\mathsf{prod}(v_{2})=\mathsf{prod}(v_{3})\cdot M_{b}\cdot\mathsf{prod}(v_{4})$ which recursively is equal to $M_{a}M_{b}M_{b}$ . Similarly, $\mathsf{prod}(v_{1})=\mathsf{prod}(v_{2})\cdot\mathsf{prod}(v_{6})$ , that is, $M_{a}M_{b}M_{b}M_{b}M_{a}M_{b}$ , which is the product we want to maintain.

We now show that AVL-product representations are useful in our setting. We start by observing that $\mathsf{out}$ -queries are trivial since we store the necessary information $\mathsf{prod}(T)$ explicitly in the root.

Observation 12.

Given an AVL-product representation $T$ , one can return $\mathsf{prod}(T)$ in constant time. So in particular, if $\mathsf{seq}(T)$ is a sequence of matrices, AVL-product representations implement the method $\mathsf{out}$ in constant time.

We next show that there is an efficient initialization algorithm.

Lemma 13.

There is an algorithm that, given a sequence $\vec{g}$ over $\mathbb{G}$ of length $n$ , computes in time $O(n)$ and with $O(n)$ semi-group products an AVL-product representation of $\vec{g}$ .

Specializing to matrices, we get the following for the $\mathsf{init}$ -operation of Algorithm 1.

Corollary 14.

We can perform $\mathsf{init}$ for AVL-product representations in time $O(m^{\omega}n)$ .

Finally, we show that AVL-product representations can be used for the substitution updates that we need for direct access when using the $\mathsf{set}$ operation in a persistent manner.

Lemma 15.

There is an algorithm that, given an AVL-product representation $T$ , a position $i$ and a semi-group element $g$ , outputs a new AVL-product representation of the sequence we get from $\mathsf{seq}(T)$ by substitution the element in position $i$ by $g$ . The runtime is $O(\log(|\mathsf{seq}(T)|))$ plus the time to compute $O(\log(|\mathsf{seq}(T)|))$ products in the semi-group.

Corollary 16.

The $\mathsf{set}$ -method from Algorithm 1 can be implemented in time $O(m^{\omega}\log(n))$ with AVL-product representations.

Example 17.

Going back to the example from Figure 2. Assume now that one wants to update the data structure to only keep runs where $x_{1}$ is set to a position lower or equal than $2$ . To this end, one needs to compute $M^{\tau}_{a,1}M^{\tau}_{b,2}M^{\tau}_{b,3}M^{\tau}_{b,4}M^{\tau}_{a,5}M^{\tau}% _{b,6}=M^{\mu_{\emptyset}}_{a,1}M^{\tau}_{b,2}M^{\mu_{\emptyset}}_{b,3}M^{\mu_% {\emptyset}}_{b,4}M^{\mu_{\emptyset}}_{a,5}M^{\mu_{\emptyset}}_{b,6}$ where $\tau=(x_{1}\mapsto 2)$ . Hence, we only have to change one matrix in the original product.

To do so, going down in the tree using $\mathsf{size}(\cdot)$ , we can identify that the vertex where the second matrix is introduce in the product is vertex $v_{2}$ . Hence, we only have to update $\mathsf{prod}(v_{2})$ and $\mathsf{prod}(v_{1})$ . To do so, observe that the new value $M_{2}^{\prime}$ of $\mathsf{prod}(v_{2})$ is $\mathsf{prod}(v_{3})\cdot M_{b,2}^{\tau}\cdot\mathsf{prod}(v_{4})$ , which can be obtained with two matrix products, $\mathsf{prod}(v_{3})$ and $\mathsf{prod}(v_{4})$ being already computed. Similarly, the new value of $\mathsf{prod}(v_{1})$ is $M_{2}^{\prime}\cdot M_{b}\cdot\mathsf{prod}(v_{6})$ which can also be computed with two extra matrix products, provided that we compute $M_{2}^{\prime}$ first. Hence, in general, this update scheme can be performed with at most $2\cdot 0pt(T)$ matrix products.

6 Dynamic direct access under complex editing operations

In the previous sections, we have established that, given a string $s\in\Sigma^{*}$ of length $n$ and an unambiguous functional automaton $\mathcal{A}$ , we can answer direct access queries on the relation $\llbracket\mathcal{A}\rrbracket(s)$ in time $O(\log(n)^{2})$ after $O(n)$ preprocessing. Now, assume that the string $s$ is modified. If one wants to perform direct access on the relation induced by $\mathcal{A}$ on the updated string so far one has to perform the initialization of Lemma 13 again which will have $O(n)$ complexity for each editing step. In this section, we will show that the AVL-product representations from Section 5 can be updated efficiently so that we can perform direct access on the edited string without redoing the preprocessing from scratch.

In the following, we formalize the scenario of complex edits over strings. Then we define the problem of dynamic MSO direct access. We finish by showing how to easily extend our data structure and approach to give a solution for this scenario.

Editing programs over strings.

Fix an alphabet $\Sigma$ . Let $\mathsf{SDB}=\{s_{1},\ldots,s_{N}\}$ be a set of strings over $\Sigma$ called a strings database. For the sake of simplicity, we use $s_{i}\in\mathsf{SDB}$ both as a string and as a string name (i.e., as a label referring to $s_{i}$ that is considered of constant size). Let $\mathcal{S}$ be a set of string variables $S_{1},S_{2},\ldots$ disjoint from $\mathsf{SDB}$ and $\Sigma$ . A literal $l$ is either a string name $s_{i}\in\mathsf{SDB}$ , a string variable $S\in\mathcal{S}$ , or a symbol $a\in\Sigma$ . An editing rule is a command of the following four types:

S:=\mathsf{concat}(l,l^{\prime})\ \ \ \ \ (S,S^{\prime}):=\mathsf{split}(l,i)% \ \ \ \ \ (S,S^{\prime}):=\mathsf{cut}(l,i,j)\ \ \ \ \ S:=\mathsf{paste}(l,l^{% \prime},i)

where $S$ and $S^{\prime}$ are string variables, $l$ and $l^{\prime}$ are literals, and $i,j\in\mathbb{N}$ . Intuitively, string variables will be assigned to strings and then $\mathsf{concat}$ will allow to concatenate two strings, $\mathsf{split}$ to split a string at position $i$ , $\mathsf{cut}$ to extract the substring between positions $i$ and $j$ , and $\mathsf{paste}$ to insert a string inside another at position $i$ . An editing program $\Pi$ is a pair $(S,P)$ where $S\in\mathcal{S}$ is the output string variable and $P$ is a sequence of editing rules $R_{1};R_{2};\ldots;R_{n}$ such that each string name appears at most once in the right-hand side of some rule, and each string variable appears at most once in the right-hand side of a rule $R_{i}$ after it appears in the left-hand side of a rule $R_{j}$ with $j<i$ . In other words, each string in the database can be used only once and each variable can be used at most once after it is defined. We define the size of an editing program $\Pi$ as the number of rules $|\Pi|=n$ .

Example 18.

Suppose that we have a strings database $\mathsf{SDB}_{0}=\{s_{1},s_{2}\}$ . We define the editing program $\Pi_{0}=(S_{6},P_{0})$ with $P_{0}$ as the sequence:

(S_{1},S_{2}):=\mathsf{split}(s_{1},4);\ (S_{3},S_{4}):=\mathsf{split}(S_{2},1% );\ S_{5}:=\mathsf{concat}(S_{1},a);\ S_{6}:=\mathsf{concat}(S_{5},S_{4}).

Next, we define the semantics of editing programs. A string assignment is a partial function $\sigma\colon(\mathsf{SDB}\cup\Sigma\cup\mathcal{S})\rightarrow\Sigma^{*}$ such that $\sigma(a)=a$ for every $a\in\Sigma$ and $\sigma(s)=s$ for every $s\in\mathsf{SDB}$ . We define the trivial assignment $\sigma_{\mathsf{SDB}}$ such that $\mathsf{dom}(\sigma_{\mathsf{SDB}})=\mathsf{SDB}\cup\Sigma$ (i.e., no variable is mapped). Given an assignment $\sigma$ , a string variable $S$ , and a string $s\in\Sigma^{*}$ , we write $\sigma[S\mapsto s]$ to denote the assigment $\sigma^{\prime}$ that replaces $S$ with $s$ in $\sigma$ (i.e., $\sigma^{\prime}(S)=s$ and $\sigma^{\prime}(S^{\prime})=\sigma(S^{\prime})$ for every $S^{\prime}\in\mathsf{dom}(\sigma)\,\,\backslash\,\,\{S\}$ ). Then, we define the semantics of rules as a function $\llbracket\cdot\rrbracket$ that maps assignments to assignments such that for every assignment $\sigma$ :

\begin{array}[]{rcl}\llbracket S:=\mathsf{concat}(l,l^{\prime})\rrbracket(% \sigma)&=&\sigma[S\mapsto\sigma(l)\cdot\sigma(l^{\prime})]\\ \llbracket(S,S^{\prime}):=\mathsf{split}(l,i)\rrbracket(\sigma)&=&\sigma[S% \mapsto\sigma(l)[..i],S^{\prime}\mapsto\sigma(l)[i+1..]]\\ \llbracket(S,S^{\prime}):=\mathsf{cut}(l,i,j)\rrbracket(\sigma)&=&\sigma[S% \mapsto\sigma(l)[i,j],S^{\prime}\mapsto\sigma(l)[..i-1]\cdot\sigma(l)[j+1..]]% \\ \llbracket S:=\mathsf{paste}(l,l^{\prime},i)\rrbracket(\sigma)&=&\sigma[S% \mapsto\sigma(l)[..i]\cdot\sigma(l^{\prime})\cdot\sigma(l)[i+1..]]\end{array}

where we use the syntax $\sigma[S\mapsto s,S^{\prime}\mapsto s^{\prime}]$ as a shorthand for $(\sigma[S\mapsto s])[S^{\prime}\mapsto s^{\prime}]$ . We extend the semantics to sequences of editing rules $R_{1};\ldots;R_{n}$ recursively as follows:

\llbracket R_{1};\ldots;R_{n-1};R_{n}\rrbracket(\sigma)\ =\ \llbracket R_{n}% \rrbracket(\llbracket R_{1};\ldots;R_{n-1}\rrbracket(\sigma)).

Finally, we define the semantics of an editing program $\Pi=(S,P)$ as the string $\llbracket\Pi\rrbracket(\mathsf{SDB})=[\llbracket P\rrbracket(\sigma_{\mathsf{% SDB}})](S)$ , namely, the string at $S$ after evaluating $P$ starting from the trivial assignment. We assume that the above semantics is undefined if any of the previous definitions are undefined like, for example, indices $i, j$ are out of range, the variable $S$ is not defined, etc. Note that one can easily check in time $O(|\Pi|)$ whether the semantics will be undefined or not. For this reason, we assume that the semantics of an editing program is always defined.

Example 19 (continue).

If we take $s_{1}=bbbbcb$ , we get $\llbracket\Pi_{0}\rrbracket(\mathsf{SDB})=bbbbab$ . In particular, one can easily check that the program $\Pi_{0}$ is updating the fifth letter of $s_{1}$ with a letter $a$ .

Our editing programs can easily simulate the insertion, deletion, or update of a symbol in a string. Further, they can simulate the complex document editing operations presented in [39] like concat, extract, delete, and insert (see [39] for a definition of these operations). So, our set of operations can be seen as nearly equivalent to the one presented in [39]; the only operation that we disallow is copying. Indeed, we do not allow editing programs to make copies of the input strings or the string variables by definition. An editing program that makes copies could produce strings of exponential size, and the input index $i$ for direct access will be of polynomial size with respect to the input documents, breaking our assumptions regarding the RAM model. For this reason, we disallow copying in our programs and leave this extension for future work.

Dynamic direct access of MSO.

Now that the notion of editing programs is clear, we can define the problem of dynamic direct access as an extension of the $\textsc{MSODirectAccess}[\prec]$ introduced in Section 3, adding a phase of edits over the strings database.

Problem: $\textsc{DynamicMSODirectAccess}[\prec]$ Preprocessing: $\left\{\text{\begin{tabular}[]{rl}{input:}&\!\!\! a vset automaton $\mathcal{A% }$ and\\ &a strings database $\mathsf{SDB}=\{s_{1},\ldots,s_{N}\}$\\ {result:}&\!\!\! a data structure $D_{\mathcal{A},\mathsf{SDB}}$\end{tabular}}\right.$ Editing: $\left\{\text{\begin{tabular}[]{rl}{input:}&\!\!\! an editing program $\Pi$ and% $D_{\mathcal{A},\mathsf{SDB}}$\\ {result:}&\!\!\! a data structure $D_{\Pi,\mathcal{A},\mathsf{SDB}}$\end{% tabular}}\right.$ Access: $\left\{\!\!\text{ \begin{tabular}[]{rl}{input:}&\!\!\! \parbox[t]{119.50148pt}{an index $i$ and $D_{\Pi,\mathcal{A},\mathsf{SDB}}$}\\ {output:}&\!\!\! \parbox[t]{136.5733pt}{the $i$-th output $\llbracket\mathcal{% A}\rrbracket(\llbracket\Pi\rrbracket(\mathsf{SDB}))[i]$}\end{tabular} }\right.$

Contrary to the standard direct access, we add an intermediate phase, called editing, where one can receive an editing program $\Pi$ over $\mathsf{SDB}$ as input. Thus, the direct access is now performed over the resulting string $\llbracket\Pi\rrbracket(\mathsf{SDB})$ . As before, we measure each phase separately and say that an algorithm for $\textsc{DynamicMSODirectAccess}[\prec]$ has $f$ -preprocessing, $h$ -editing, and $g$ -access time for some functions $f$ , $h$ , and $g$ if, and only if, the running time of the preprocessing, editing, and access phases is in $O(f(\mathcal{A},\mathsf{SDB}))$ , $O(h(\mathcal{A},\mathsf{SDB},\Pi))$ , and $O(g(\mathcal{A},\mathsf{SDB},\Pi))$ , respectively.

By taking advantage of our techniques for MSO direct access, we can show that $\textsc{DynamicMSODirectAccess}[\prec]$ can be solved with logarithmic running time during the editing phase and without increasing the running time of the preprocessing and access phases.

Theorem 20.

There is an algorithm that solves $\textsc{DynamicMSODirectAccess}[\prec]$ , for the class of unambiguous functional vset automata for any variable order $\prec$ with preprocessing time $O(|Q|^{\omega}\cdot|X|^{2}\cdot\sum_{i=1}^{N}|s_{i}|)$ , editing time $O(|Q|^{\omega}\cdot|X|^{2}\cdot|\Pi|\cdot\max_{i=1}^{N}\log(|s_{i}|))$ , and access time $O(|Q|^{\omega}\cdot|X|^{3}\cdot\log^{2}(|\Pi|\cdot\max_{i=1}^{N}|s_{i}|))$ . These bounds remain even true when we assume that the order $\prec$ is only given as an additional input in the access phase.

The proof goes by showing how to implement each editing rule receiving as inputs AVL-products representations. Precisely, the preprocessing constructs an AVL-product representation for each string $s_{i}$ in $\mathsf{SDB}$ (e.g., like in Section 5). Then, during the editing phase we produce new AVL-product representations from the initial ones for each editing operation. Finally, the resulting representation for the string $\llbracket\Pi\rrbracket(\mathsf{SDB})$ is used for the direct access phase. These give the desired running time for the preprocessing and the direct access phases. Therefore, it is only left to show how to implement each editing rule in time $O(|Q|^{\omega}\cdot|X|^{2}\cdot\max_{i=1}^{N}\log(|s_{i}|))$ in order to prove the running time for the editing phase.

Implementing the editing rules.

It is easy to see that $(S,S^{\prime}):=\mathsf{cut}(l,i,j)$ can be implemented by splitting twice (at $i$ and $j$ ) and one concatenation. More formally, as the sequence of editing rules $(S_{1},S_{2}):=\mathsf{split}(l,j);\ (S_{3},S):=\mathsf{split}(S_{1},i);\ S^{% \prime}\leftarrow\mathsf{concat}(S_{3},S_{2})$ . Similarly, $S:=\mathsf{paste}(l,l^{\prime},i)$ can be obtained by one split and two concatenations. More formally, $(S_{1},S_{2}):=\mathsf{split}(l,i);\ S_{3}:=\mathsf{concat}(S_{1},l^{\prime});% \ S:=\mathsf{concat}(S_{3},S_{2})$ . Finally, we can easily replace a single symbol by a newly initialized data structure. For this reason, we dedicate this subsection to show how to implement $\mathsf{concat}$ and $\mathsf{split}$ efficiently.

Assume that the inputs of $\mathsf{concat}$ and $\mathsf{split}$ are given by AVL-product representations. We will heavily rely on the corresponding operations on AVL trees as presented in [13]: there is it shown that many operations on ordered sets can be implemented with AVL trees by using only an operation called $\mathsf{join}$ that does the following: given two AVL trees $T_{1},T_{2}$ and a key $k$ , such that the maximal key in $T_{1}$ is smaller than $k$ and the minimal key in $T_{2}$ is bigger than $k$ , $\mathsf{join}$ returns an AVL tree $T$ containing all keys in $T_{1}$ , $T_{2}$ and $k$ . Besides the usual tree navigation, the only basic operation that $\mathsf{join}$ uses is $\mathsf{node}$ , which takes a key $k$ and two trees $T_{1},T_{2}$ and returns a tree that has $k$ in its root and $T_{1}$ and $T_{2}$ as left and right subtrees, respectively²²2We remark that the usual rotations in search trees can be simulated by a constant number of $\mathsf{node}$ -operations.. We directly get the following result.

Lemma 21.

There is an algorithm $\mathsf{join}$ that, given two AVL-product representations $T_{1},T_{2}$ and a semi-group element $g$ , computes in time $O(\log(\mathsf{size}(T_{1})+\mathsf{size}(T_{2})))$ and the same number of semi-group operations an AVL-product representation $T$ of the sequence we get by concatenating $\mathsf{seq}(T_{1})$ , $g$ , and $\mathsf{seq}(T_{2})$ .

Using $\mathsf{join}$ , one can implement an operation called $\mathsf{split}$ that takes as input an AVL tree and a key $k$ and returns two AVL trees $T_{1}$ and $T_{2}$ such that all keys in $T$ smaller than $k$ are in $T_{1}$ while all keys bigger than $k$ are in $T_{2}$ . Besides $\mathsf{join}$ , the only operation that $\mathsf{split}$ performs is a descent in the tree to find the node with key $k$ . This directly gives the following algorithms for operations on $\mathsf{seq}(T)$ .

Lemma 22.

There is an algorithm $\mathsf{split}$ that, given an AVL-product representation $T$ and a position $i$ , computes in time $O(\log(\mathsf{size}(T)))$ and the same number of semi-group operations two trees AVL-product representations $T_{1},T_{2}$ such that $\mathsf{seq}(T_{1})$ is the prefix of $\mathsf{seq}(T)$ up to but excluding the $i$ -th entry and $\mathsf{seq}(T_{2})$ is the suffix of $\mathsf{seq}(i)$ that starts after the $i$ -th position.

Applying both lemmas for matrices, we directly get bounds for $\mathsf{concat}$ and $\mathsf{split}$ .

Corollary 23.

There are algorithms for both $S:=\mathsf{concat}(l,l^{\prime})$ and $(S,S^{\prime}):=\mathsf{split}(l,i)$ on AVL-product representations that, given inputs $T_{1},T_{2}$ , resp., $T$ and $i$ , run in time $O(m^{\omega}\cdot\log(|T_{1}|+|T_{2}|))$ , resp., $O(m^{\omega}\cdot\log(|T|))$ .

7 Future work

We have given a direct access algorithm for MSO queries on strings that allows for powerful updates and supports all lexicographic orders without especially having to prepare them in the preprocessing. Our result is certainly only a first step in understanding direct access for MSO queries that motivate new research questions. First, it would be interesting to see to which extent one can adapt our results to MSO on trees. Second, one could study how to adjust our approach to support the copying of strings like in [39]. Third, another question is if one could reduce the access time of our algorithm. Finally, it would be interesting to understand if direct access on strings (or trees) is also possible for more expressive queries, say the grammar-based extraction formalisms of [37, 6].

References

[1] Georgii M Adel’son-Vel’skii and Evgenii Landis. An algorithm for the organization of information. Soviet Math., 3:1259–1263, 1962.
[2] Carme Àlvarez and Birgit Jenner. A very hard log-space counting class. Theor. Comput. Sci., 107(1):3–30, 1993. doi:10.1016/0304-3975(93)90252-O.
[3] Antoine Amarilli, Pierre Bourhis, and Stefan Mengel. Enumeration on trees under relabelings. In ICDT, volume 98, pages 5:1–5:18, 2018. doi:10.4230/LIPIcs.ICDT.2018.5.
[4] Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-delay enumeration for nondeterministic document spanners. In ICDT, pages 22:1–22:19, 2019. doi:10.4230/LIPIcs.ICDT.2019.22.
[5] Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Enumeration on trees with tractable combined complexity and efficient updates. In PODS, pages 89–103, 2019. doi:10.1145/3294052.3319702.
[6] Antoine Amarilli, Louis Jachiet, Martin Muñoz, and Cristian Riveros. Efficient enumeration for annotated grammars. In PODS, pages 291–300, 2022. doi:10.1145/3517804.3526232.
[7] Antoine Amarilli, Louis Jachiet, and Charles Paperman. Dynamic membership for regular languages. In ICALP, volume 198 of LIPIcs, pages 116:1–116:17, 2021. doi:10.4230/LIPIcs.ICALP.2021.116.
[8] Guillaume Bagan, Arnaud Durand, Etienne Grandjean, and Frédéric Olive. Computing the jth solution of a first-order query. RAIRO Theor. Informatics Appl., 42(1):147–164, 2008. doi:10.1051/ita:2007046.
[9] Nurzhan Bakibayev, Tomás Kociský, Dan Olteanu, and Jakub Zavodny. Aggregation and ordering in factorised databases. Proc. VLDB Endow., 6(14):1990–2001, 2013. doi:10.14778/2556549.2556579.
[10] Andrey Balmin, Yannis Papakonstantinou, and Victor Vianu. Incremental validation of XML documents. ACM Trans. Database Syst., 29(4):710–751, 2004. doi:10.1145/1042046.1042050.
[11] Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. Answering conjunctive queries under updates. In PODS, pages 303–318, 2017. doi:10.1145/3034786.3034789.
[12] Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. Answering FO+MOD queries under updates on bounded degree databases. ACM Trans. Database Syst., 43(2):7:1–7:32, 2018. doi:10.1145/3232056.
[13] Guy E. Blelloch, Daniel Ferizovic, and Yihan Sun. Just join for parallel ordered sets. In SPAA, pages 253–264, 2016. doi:10.1145/2935764.2935768.
[14] Johann Brault-Baron. De la pertinence de l’énumération : complexité en logiques propositionnelle et du premier ordre. (The relevance of the list: propositional logic and complexity of the first order). PhD thesis, University of Caen Normandy, France, 2013. URL: https://tel.archives-ouvertes.fr/tel-01081392.
[15] Karl Bringmann, Nofar Carmeli, and Stefan Mengel. Tight fine-grained bounds for direct access on join queries. In PODS, pages 427–436, 2022. doi:10.1145/3517804.3526234.
[16] J Richard Büchi. Weak second-order arithmetic and finite automata. Mathematical Logic Quarterly, 6(1-6), 1960. doi:10.1002/malq.19600060105.
[17] Florent Capelli and Oliver Irwin. Direct access for conjunctive queries with negations. In ICDT, volume 290, pages 13:1–13:20, 2024. doi:10.4230/LIPIcs.ICDT.2024.13.
[18] Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, and Mirek Riedewald. Tractable orders for direct access to ranked answers of conjunctive queries. ACM Trans. Database Syst., 48(1):1:1–1:45, 2023. doi:10.1145/3578517.
[19] Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Alessio Conte, Benny Kimelfeld, and Nicole Schweikardt. Answering (unions of) conjunctive queries using random access and random-order enumeration. ACM Trans. Database Syst., 47(3):9:1–9:49, 2022. doi:10.1145/3531055.
[20] Shaleen Deep, Xiao Hu, and Paraschos Koutris. Ranked enumeration of join queries with projections. Proc. VLDB Endow., 15(5):1024–1037, 2022. doi:10.14778/3510397.3510401.
[21] Johannes Doleschal, Benny Kimelfeld, Wim Martens, and Liat Peterfreund. Weight annotation in information extraction. Log. Methods Comput. Sci., 18(1), 2022. doi:10.46298/lmcs-18(1:21)2022.
[22] James R Driscoll, Neil Sarnak, Daniel Dominic Sleator, and Robert Endre Tarjan. Making data structures persistent. In STOC, pages 109–121, 1986. doi:10.1145/12130.12142.
[23] Idan Eldar, Nofar Carmeli, and Benny Kimelfeld. Direct access for answers to conjunctive queries with aggregation. In ICDT, volume 290, pages 4:1–4:20, 2024. doi:10.4230/LIPIcs.ICDT.2024.4.
[24] Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners: A formal approach to information extraction. Journal of the ACM (JACM), 62(2):1–51, 2015. doi:10.1145/2699442.
[25] Fernando Florenzano, Cristian Riveros, Martín Ugarte, Stijn Vansummeren, and Domagoj Vrgoč. Efficient enumeration algorithms for regular document spanners. ACM Transactions on Database Systems (TODS), 45(1):1–42, 2020. doi:10.1145/3351451.
[26] Gudmund Skovbjerg Frandsen, Peter Bro Miltersen, and Sven Skyum. Dynamic word problems. Journal of the ACM (JACM), 44(2):257–271, 1997. doi:10.1145/256303.256309.
[27] Etienne Grandjean and Louis Jachiet. Which arithmetic operations can be performed in constant time in the RAM model with addition? CoRR, abs/2206.13851, 2022. doi:10.48550/arXiv.2206.13851.
[28] Muhammad Idris, Martín Ugarte, and Stijn Vansummeren. The dynamic yannakakis algorithm: Compact and efficient query processing under updates. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1259–1274, 2017. doi:10.1145/3035918.3064027.
[29] Sarah Kleest-Meißner, Jonas Marasus, and Matthias Niewerth. MSO queries on trees: Enumerating answers under updates using forest algebras. CoRR, abs/2208.04180, 2022. doi:10.48550/arXiv.2208.04180.
[30] Donald Ervin Knuth. The art of computer programming, , Volume III, 2nd Edition. Addison-Wesley, 1998. URL: https://www.worldcat.org/oclc/312994415.
[31] Leonid Libkin. Elements of finite model theory, volume 41. Springer, 2004. doi:10.1007/978-3-662-07003-1.
[32] Katja Losemann and Wim Martens. MSO queries on trees: enumerating answers under updates. In CSL-LICS, pages 67:1–67:10, 2014. doi:10.1145/2603088.2603137.
[33] Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. Document spanners for extracting incomplete information: Expressiveness and complexity. In PODS, pages 125–136, 2018. doi:10.1145/3196959.3196968.
[34] Martin Muñoz and Cristian Riveros. Streaming enumeration on nested documents. In ICDT, volume 220 of LIPIcs, pages 19:1–19:18, 2022. doi:10.4230/LIPIcs.ICDT.2022.19.
[35] Matthias Niewerth. MSO queries on trees: Enumerating answers under updates using forest algebras. In LICS, pages 769–778, 2018. doi:10.1145/3209108.3209144.
[36] Matthias Niewerth and Luc Segoufin. Enumeration of MSO queries on strings with constant delay and logarithmic updates. In Jan Van den Bussche and Marcelo Arenas, editors, PODS, pages 179–191, 2018. doi:10.1145/3196959.3196961.
[37] Liat Peterfreund. Grammars for document spanners. In Ke Yi and Zhewei Wei, editors, ICDT, volume 186, pages 7:1–7:18, 2021. doi:10.4230/LIPIcs.ICDT.2021.7.
[38] Markus L. Schmid and Nicole Schweikardt. Spanner evaluation over slp-compressed documents. In PODS, pages 153–165, 2021. doi:10.1145/3452021.3458325.
[39] Markus L. Schmid and Nicole Schweikardt. Query evaluation over slp-represented document databases with complex document editing. In PODS, pages 79–89, 2022. doi:10.1145/3517804.3524158.
[40] Nikolaos Tziavelis, Wolfgang Gatterbauer, and Mirek Riedewald. Optimal join algorithms meet top-k. In SIGMOD, pages 2659–2665, 2020. doi:10.1145/3318464.3383132.
[41] Nikolaos Tziavelis, Wolfgang Gatterbauer, and Mirek Riedewald. Any-k algorithms for enumerating ranked answers to conjunctive queries. CoRR, abs/2205.05649, 2022. doi:10.48550/arXiv.2205.05649.

[bib.bib1] [1] Georgii M Adel’son-Vel’skii and Evgenii Landis. An algorithm for the organization of information. Soviet Math., 3:1259–1263, 1962.

[bib.bib2] [2] Carme Àlvarez and Birgit Jenner. A very hard log-space counting class. Theor. Comput. Sci., 107(1):3–30, 1993. doi:10.1016/0304-3975(93)90252-O.

[bib.bib3] [3] Antoine Amarilli, Pierre Bourhis, and Stefan Mengel. Enumeration on trees under relabelings. In ICDT, volume 98, pages 5:1–5:18, 2018. doi:10.4230/LIPIcs.ICDT.2018.5.

[bib.bib4] [4] Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Constant-delay enumeration for nondeterministic document spanners. In ICDT, pages 22:1–22:19, 2019. doi:10.4230/LIPIcs.ICDT.2019.22.

[bib.bib5] [5] Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Enumeration on trees with tractable combined complexity and efficient updates. In PODS, pages 89–103, 2019. doi:10.1145/3294052.3319702.

[bib.bib6] [6] Antoine Amarilli, Louis Jachiet, Martin Muñoz, and Cristian Riveros. Efficient enumeration for annotated grammars. In PODS, pages 291–300, 2022. doi:10.1145/3517804.3526232.

[bib.bib7] [7] Antoine Amarilli, Louis Jachiet, and Charles Paperman. Dynamic membership for regular languages. In ICALP, volume 198 of LIPIcs, pages 116:1–116:17, 2021. doi:10.4230/LIPIcs.ICALP.2021.116.

[bib.bib8] [8] Guillaume Bagan, Arnaud Durand, Etienne Grandjean, and Frédéric Olive. Computing the jth solution of a first-order query. RAIRO Theor. Informatics Appl., 42(1):147–164, 2008. doi:10.1051/ita:2007046.

[bib.bib9] [9] Nurzhan Bakibayev, Tomás Kociský, Dan Olteanu, and Jakub Zavodny. Aggregation and ordering in factorised databases. Proc. VLDB Endow., 6(14):1990–2001, 2013. doi:10.14778/2556549.2556579.

[bib.bib10] [10] Andrey Balmin, Yannis Papakonstantinou, and Victor Vianu. Incremental validation of XML documents. ACM Trans. Database Syst., 29(4):710–751, 2004. doi:10.1145/1042046.1042050.

[bib.bib11] [11] Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. Answering conjunctive queries under updates. In PODS, pages 303–318, 2017. doi:10.1145/3034786.3034789.

[bib.bib12] [12] Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. Answering FO+MOD queries under updates on bounded degree databases. ACM Trans. Database Syst., 43(2):7:1–7:32, 2018. doi:10.1145/3232056.

[bib.bib13] [13] Guy E. Blelloch, Daniel Ferizovic, and Yihan Sun. Just join for parallel ordered sets. In SPAA, pages 253–264, 2016. doi:10.1145/2935764.2935768.

[bib.bib14] [14] Johann Brault-Baron. De la pertinence de l’énumération : complexité en logiques propositionnelle et du premier ordre. (The relevance of the list: propositional logic and complexity of the first order). PhD thesis, University of Caen Normandy, France, 2013. URL: https://tel.archives-ouvertes.fr/tel-01081392.

[bib.bib15] [15] Karl Bringmann, Nofar Carmeli, and Stefan Mengel. Tight fine-grained bounds for direct access on join queries. In PODS, pages 427–436, 2022. doi:10.1145/3517804.3526234.

[bib.bib16] [16] J Richard Büchi. Weak second-order arithmetic and finite automata. Mathematical Logic Quarterly, 6(1-6), 1960. doi:10.1002/malq.19600060105.

[bib.bib17] [17] Florent Capelli and Oliver Irwin. Direct access for conjunctive queries with negations. In ICDT, volume 290, pages 13:1–13:20, 2024. doi:10.4230/LIPIcs.ICDT.2024.13.

[bib.bib18] [18] Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, and Mirek Riedewald. Tractable orders for direct access to ranked answers of conjunctive queries. ACM Trans. Database Syst., 48(1):1:1–1:45, 2023. doi:10.1145/3578517.

[bib.bib19] [19] Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Alessio Conte, Benny Kimelfeld, and Nicole Schweikardt. Answering (unions of) conjunctive queries using random access and random-order enumeration. ACM Trans. Database Syst., 47(3):9:1–9:49, 2022. doi:10.1145/3531055.

[bib.bib20] [20] Shaleen Deep, Xiao Hu, and Paraschos Koutris. Ranked enumeration of join queries with projections. Proc. VLDB Endow., 15(5):1024–1037, 2022. doi:10.14778/3510397.3510401.

[bib.bib21] [21] Johannes Doleschal, Benny Kimelfeld, Wim Martens, and Liat Peterfreund. Weight annotation in information extraction. Log. Methods Comput. Sci., 18(1), 2022. doi:10.46298/lmcs-18(1:21)2022.

[bib.bib22] [22] James R Driscoll, Neil Sarnak, Daniel Dominic Sleator, and Robert Endre Tarjan. Making data structures persistent. In STOC, pages 109–121, 1986. doi:10.1145/12130.12142.

[bib.bib23] [23] Idan Eldar, Nofar Carmeli, and Benny Kimelfeld. Direct access for answers to conjunctive queries with aggregation. In ICDT, volume 290, pages 4:1–4:20, 2024. doi:10.4230/LIPIcs.ICDT.2024.4.

[bib.bib24] [24] Ronald Fagin, Benny Kimelfeld, Frederick Reiss, and Stijn Vansummeren. Document spanners: A formal approach to information extraction. Journal of the ACM (JACM), 62(2):1–51, 2015. doi:10.1145/2699442.

[bib.bib25] [25] Fernando Florenzano, Cristian Riveros, Martín Ugarte, Stijn Vansummeren, and Domagoj Vrgoč. Efficient enumeration algorithms for regular document spanners. ACM Transactions on Database Systems (TODS), 45(1):1–42, 2020. doi:10.1145/3351451.

[bib.bib26] [26] Gudmund Skovbjerg Frandsen, Peter Bro Miltersen, and Sven Skyum. Dynamic word problems. Journal of the ACM (JACM), 44(2):257–271, 1997. doi:10.1145/256303.256309.

[bib.bib27] [27] Etienne Grandjean and Louis Jachiet. Which arithmetic operations can be performed in constant time in the RAM model with addition? CoRR, abs/2206.13851, 2022. doi:10.48550/arXiv.2206.13851.

[bib.bib28] [28] Muhammad Idris, Martín Ugarte, and Stijn Vansummeren. The dynamic yannakakis algorithm: Compact and efficient query processing under updates. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1259–1274, 2017. doi:10.1145/3035918.3064027.

[bib.bib29] [29] Sarah Kleest-Meißner, Jonas Marasus, and Matthias Niewerth. MSO queries on trees: Enumerating answers under updates using forest algebras. CoRR, abs/2208.04180, 2022. doi:10.48550/arXiv.2208.04180.

[bib.bib30] [30] Donald Ervin Knuth. The art of computer programming, , Volume III, 2nd Edition. Addison-Wesley, 1998. URL: https://www.worldcat.org/oclc/312994415.

[bib.bib31] [31] Leonid Libkin. Elements of finite model theory, volume 41. Springer, 2004. doi:10.1007/978-3-662-07003-1.

[bib.bib32] [32] Katja Losemann and Wim Martens. MSO queries on trees: enumerating answers under updates. In CSL-LICS, pages 67:1–67:10, 2014. doi:10.1145/2603088.2603137.

[bib.bib33] [33] Francisco Maturana, Cristian Riveros, and Domagoj Vrgoc. Document spanners for extracting incomplete information: Expressiveness and complexity. In PODS, pages 125–136, 2018. doi:10.1145/3196959.3196968.

[bib.bib34] [34] Martin Muñoz and Cristian Riveros. Streaming enumeration on nested documents. In ICDT, volume 220 of LIPIcs, pages 19:1–19:18, 2022. doi:10.4230/LIPIcs.ICDT.2022.19.

[bib.bib35] [35] Matthias Niewerth. MSO queries on trees: Enumerating answers under updates using forest algebras. In LICS, pages 769–778, 2018. doi:10.1145/3209108.3209144.

[bib.bib36] [36] Matthias Niewerth and Luc Segoufin. Enumeration of MSO queries on strings with constant delay and logarithmic updates. In Jan Van den Bussche and Marcelo Arenas, editors, PODS, pages 179–191, 2018. doi:10.1145/3196959.3196961.

[bib.bib37] [37] Liat Peterfreund. Grammars for document spanners. In Ke Yi and Zhewei Wei, editors, ICDT, volume 186, pages 7:1–7:18, 2021. doi:10.4230/LIPIcs.ICDT.2021.7.

[bib.bib38] [38] Markus L. Schmid and Nicole Schweikardt. Spanner evaluation over slp-compressed documents. In PODS, pages 153–165, 2021. doi:10.1145/3452021.3458325.

[bib.bib39] [39] Markus L. Schmid and Nicole Schweikardt. Query evaluation over slp-represented document databases with complex document editing. In PODS, pages 79–89, 2022. doi:10.1145/3517804.3524158.

[bib.bib40] [40] Nikolaos Tziavelis, Wolfgang Gatterbauer, and Mirek Riedewald. Optimal join algorithms meet top-k. In SIGMOD, pages 2659–2665, 2020. doi:10.1145/3318464.3383132.

[bib.bib41] [41] Nikolaos Tziavelis, Wolfgang Gatterbauer, and Mirek Riedewald. Any-k algorithms for enumerating ranked answers to conjunctive queries. CoRR, abs/2205.05649, 2022. doi:10.48550/arXiv.2205.05649.

input:	an index i and DA,s
output:	the i-th output ⟦A⟧(s)[i]

input:	an index i and DΠ,A,SDB
output:	the i-th output ⟦A⟧(⟦Π⟧(SDB))[i]