Dynamic Membership for Regular Tree Languages
Abstract
We study the dynamic membership problem for regular tree languages under relabeling updates: we fix an alphabet and a regular tree language over (expressed, e.g., as a tree automaton), we are given a tree with labels in , and we must maintain the information of whether the tree belongs to while handling relabeling updates that change the labels of individual nodes in .
Our first contribution is to show that this problem admits an algorithm for any fixed regular tree language, improving over known algorithms. This generalizes the known upper bound over words, and it matches the lower bound of from dynamic membership to some word languages and from the existential marked ancestor problem.
Our second contribution is to introduce a class of regular languages, dubbed almost-commutative tree languages, and show that dynamic membership to such languages under relabeling updates can be decided in constant time per update. Almost-commutative languages generalize both commutative languages and finite languages: they are the analogue for trees of the ZG languages enjoying constant-time dynamic membership over words. Our main technical contribution is to show that this class is conditionally optimal when we assume that the alphabet features a neutral letter, i.e., a letter that has no effect on membership to the language. More precisely, we show that any regular tree language with a neutral letter which is not almost-commutative cannot be maintained in constant time under the assumption that the prefix-U1 problem from [4] also does not admit a constant-time algorithm.
Keywords and phrases:
automaton, dynamic membership, incremental maintenance, forest algebraFunding:
Antoine Amarilli: Partially supported by the ANR project EQUUS ANR-19-CE48-0019, by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – 431183758, and by the ANR project ANR-18-CE23-0003-02 (“CQFD”).Copyright and License:
2012 ACM Subject Classification:
Theory of computation Regular languagesEditors:
Paweł Gawrychowski, Filip Mazowiecki, and Michał SkrzypczakSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Regular tree languages, and the corresponding tree automata [11], are a way to enforce structural constraints on trees while generalizing regular languages over words. For any fixed tree language represented as a tree automaton , given an input tree with nodes, we can verify in whether belongs to , simply by running over . However, in some settings, the tree may be dynamically updated, and we want to efficiently maintain the information of whether belongs to the language. This problem has been studied in the setting of XML documents and schemas under the name of incremental schema validation [6]. In this paper we study the theory of this problem in the RAM model with unit cost and logarithmic word size. We call the problem dynamic membership in line with earlier work on regular word languages [4]. We focus on the computational complexity of maintaining membership, as opposed, e.g., to dynamic complexity [15, 25, 22], which studies the expressive power required (in the spirit of descriptive complexity). We further focus on the measure of data complexity [26], i.e., we assume that the regular language is fixed, and measure complexity only as a function of the input tree. We study efficient algorithms for the dynamic membership problem, and (when possible) matching lower bounds.
The naive algorithm for dynamic membership is to re-run over after each change, which takes . However, more efficient algorithms are already known. Balmin et al. [6] show that we can maintain membership to any fixed regular tree language in time under leaf insertions, node deletions, and substitutions (also known as node relabelings). We focus in this work on the specific case of relabeling updates, and it is then known that the complexity can be lowered to : see, e.g., [3] which relies on tree decomposition balancing techniques [9]. This is slightly less favorable than the complexity known for dynamic membership for regular word languages [12, 4].
In terms of lower bounds, the dynamic membership problem for trees under relabeling updates is known to require time , already for some very simple fixed languages. These lower bounds can come from two different routes. One first route is the existential marked ancestor problem, where we must maintain a tree under relabeling updates that mark and unmark nodes, and efficiently answer queries asking whether a given node has a marked ancestor. This problem is known to admit an unconditional lower bound in the cell probe model [1], and it turns out that both the updates and the queries can be phrased as relabeling updates for a fixed regular tree language. One second route is via the dynamic membership problem in the more restricted setting of word languages, where there are known lower bounds for some languages [23, 4].
These known results leave a small complexity gap between and . Our first contribution in this work is to address this gap, by presenting an algorithm for the dynamic membership problem that achieves complexity , for any fixed regular tree language, and after linear-time preprocessing of the input tree. This matches the known lower bounds, and generalizes the known algorithms for dynamic membership to word languages while achieving the same complexity. Our algorithm iteratively processes the tree to merge sets of nodes into clusters and recursively process the tree of clusters; it is very related to the tree contraction technique of Gazit et al. [14] and is reminiscent of other algorithms such as the tree contraction technique of Miller and Reif [20], see also [9]. More precisely, our algorithm regroups tree nodes into clusters which have logarithmic size, ensuring that clusters fit into a memory word. This ensures that the effects of updates on clusters can be handled in constant time by the standard technique of tabulating during the preprocessing. Note that clusters may contain holes, i.e., they amount to contexts, which we handle using the notion of forest algebras. Then, the algorithm considers the induced tree structure over clusters, and processes it recursively, ensuring that the tree size is divided by at each step. The main challenge in the algorithm is to ensure that a suitable clustering of the trees at each level can be computed in linear time.
Having classified the overall complexity of dynamic membership, the next question is to refine the study and show language-dependent bounds. Indeed, there are restricted families of regular tree languages over which we can beat . For instance, it is easy to achieve time per update when the tree language is finite, or when it is commutative (i.e., only depends on the number of label occurrences and not the tree structure). What is more, in the setting of word languages, the complexity of dynamic membership is in only for a restricted subset of the word languages, namely, those outside the class QSG defined in [4] (see also [12]). Some word languages admit constant-time dynamic membership algorithms: they were classified in [4] conditional to the hypothesis that there is no data structure achieving constant time per operation for a certain problem, dubbed prefix- and essentially amounting to maintaining a weak form of priority queue.
Our second contribution is to introduce a class of regular tree languages for which dynamic membership under relabeling updates can be achieved in constant time per update, generalizing commutative tree languages and finite tree languages. Specifically, we define the class of almost-commutative tree languages: these are obtained as the Boolean closure of regular tree languages which impose a commutative condition on the so-called frequent letters (those with more occurrences than a constant threshold) and imposing that the projection to the other letters (called rare) forms a specific constant-sized tree. We show that it is decidable whether a regular tree language is almost-commutative (when it is given, e.g., by a tree automaton). The motivation for almost-commutative languages is that we can show that dynamic membership to such languages can be maintained in under relabeling updates, generalizing the algorithm of [4] for ZG monoids.
Our third contribution is to show that, conditionally, the almost-commutative tree languages in fact characterize those regular tree languages that enjoy constant-time dynamic membership under relabeling updates, when we assume that the language features a so-called neutral letter. Intuitively, a letter is neutral for a language if it is invisible in the sense that nodes labeled with can be replaced by the forest of their children without affecting membership to . Neutral letters are often assumed in the setting of word languages [18, 8], and they make dynamic membership for word languages collapse to the same problem for their syntactic monoid (as was originally studied in [23], and in [4] as a first step).
When focusing on regular tree languages with a neutral letter, we show that for any such language which is not almost-commutative, dynamic membership cannot be maintained in constant time under relabeling updates, subject to the hypothesis from [4] that the prefix- problem does not admit a constant-time data structure. Thus, the -maintainable regular tree languages with a neutral letter can be characterized conditionally to the same hypothesis as in the case of words. We show this conditional lower bound via an (unconditional) algebraic characterization of the class of almost-commutative languages with neutral letters: they are precisely those languages with syntactic forest algebras whose vertical monoid satisfies the equation ZG (, i.e., group elements are central), which was also the (conditional) characterization for -maintainable word languages with a neutral letter [4]. The main technical challenge to show this result is to establish that every tree language with a ZG vertical monoid must fall in our class of almost-commutative languages: this uses a characterization of ZG tree languages via an equivalence relation analogous to the case of ZG on words [5], but with more challenging proofs because of the tree structure.
Paper structure.
We give preliminaries in Section 2 and introduce the dynamic membership problem. We then give in Section 3 some further preliminaries about forest algebras, which are instrumental to the proof of our results. We then present our first contribution in Section 4, namely, an algorithm for dynamic membership to any fixed regular tree language. We then introduce almost-commutative languages in Section 5 and show that dynamic membership to these languages is in . In Section 6 we show our matching lower bound: in the presence of a neutral letter, and assuming that prefix- cannot be solved in constant time, then the dynamic membership problem cannot be solved in constant time for any non-almost-commutative regular language. We conclude in Section 7.
Because of space limitations, detailed proofs are deferred to the full version [2]. Note that a version of the results of Sections 5 and 6 was presented in [7] but never formally published.
2 Preliminaries
Forests, trees, contexts.
We consider ordered forests of rooted, unranked, ordered trees on a finite alphabet , and simply call them forests. Formally, we define -forests and -trees by mutual induction: a -forest is a (possibly empty) ordered sequence of -trees, and a -tree consists of a root node with a label in and with a -forest of child nodes.
We assume familiarity with standard terminology on trees including parent node, children node, ancestors, descendants, root nodes, siblings, sibling sets, previous sibling, next sibling, prefix order, etc.; see [11] for details. As we work with ordered forests, we will always consider root nodes to be siblings, and define the previous sibling and next sibling relationships to also apply to roots. We will often abuse notation and identify forests with their sets of nodes, e.g., for a forest , we define functions on to mean functions defined on the nodes of . We say that two forests have the same shape if they are identical up to changing the labels of their nodes (which are elements of ). The size of a -forest is its number of nodes, and we write for to denote the number of occurrences of in .
Forest languages.
A forest language over an alphabet is a subset of -forests, and a tree language is a subset of -trees: throughout the paper we study forest languages, but of course our results extend to tree languages as well. We write for the complement of . We will be specifically interested in regular forest languages: among many other equivalent characterizations [11], these can be formalized via finite automata, which we formally introduce below (first for words and then for forests).
A deterministic word automaton over a set is a tuple where is a finite set of states, is a subset of final states, is the initial state, and is a transition function. For a word over , the state reached by when reading is defined inductively as usual: if is the empty word then the state reached is , otherwise the state reached is with the state reached when reading . The word is accepted if the state reached by when reading is in .
We define forest automata over forests, which recognize forest languages by running a word automaton over the sibling sets. These automata are analogous to the hedge automata of [11] (where horizontal transitions are specified by regular languages), and to the forest automata of [10] (where horizontal transitions are specified by a monoid). Formally, a forest automaton over is a tuple where is a finite set of states, is a word automaton over the set , and is a vertical transition function which associates to every state of and letter a state . The run of on a -tree is a function from to defined inductively as follows:
-
For a leaf of with label , we have for the initial state of ;
-
For an internal node of with label , letting be its successive children, and letting for all be the inductively computed images of the run, for the state reached in when reading the word , we let .
We say that a -forest is accepted by the forest automaton if, letting be the trees of in order, and letting be the states to which the respective roots of the are mapped by the respective runs of , then the word is accepted by the word automaton . (In particular, the empty forest is accepted iff .) The language of is the set of -forests that it accepts.
Relabelings, incremental maintenance.
We consider relabeling updates on forests , that change node labels without changing the tree shape. A relabeling update gives a node of (e.g., as a pointer to that node) and a label , and changes the label of the node to .
We fix a regular language of -forests given by a fixed forest automaton , and we are given as input a -forest . We can compute in linear time whether accepts . The dynamic membership problem for is the task of maintaining whether the current forest belongs to , under relabeling updates. We study the complexity of this problem, defined as the worst-case running time of an algorithm after each update, expressed as a function of the size of . Note that the language is assumed to be fixed, and is not accounted for in the complexity. We work in the RAM model with unit cost and logarithmic word size, and consider dynamic membership algorithms that run a linear-time preprocessing on the input forest to build data structures used when processing updates. (By contrast, our lower bound results will hold without the assumption that the preprocessing is in linear time.)
As we reviewed in the introduction, for any fixed regular forest language, it is folklore that the dynamic membership problem on a tree with nodes can be solved under relabeling updates in time per update: see for instance [3]. Further, for some forest languages, some unconditional lower bounds in are known. This includes some forest languages defined from intractable word languages, for instance the forest language on alphabet where the word on formed by the leaves in prefix order is required to fall in the word language , i.e., a unique with an even number of ’s before it. Indeed, dynamic membership to (under letter substitutions) admits an lower bound from the prefix- problem (see [13, Theorem 3], and [12, 4]), hence so does . Intractable forest languages also include languages corresponding to the existential marked ancestor problem [1], e.g., the language on alphabet where we have a unique node labeled and we require that has an ancestor labeled . Indeed, the existential marked ancestor problem of [1] allows us to mark and unmark nodes over a fixed tree (corresponding to letters for unmarked nodes and for marked nodes), and allows us to query whether a node (labeled ) has a marked ancestor. Thus, the existential marked ancestor problem immediately reduces to dynamic membership for , which inherits the lower bound from [1].
3 Forest Algebras
All our results about the dynamic membership problem are in fact shown by rephrasing to the terminology of forest algebras. Intuitively, forest algebras give an analogue for trees to the monoids and syntactic morphisms used in the setting of words, which formed the first step of the classification of the complexity of dynamic membership on words [12, 4]. More details on forest algebras are given in the appendix of the full version [2].
Monoids.
A monoid is a set equipped with an associative composition law featuring a neutral element, that is, an element such that for all in . We write the composition law of monoids multiplicatively, i.e., we write for the composition of and . The idempotent power of an element in a finite monoid , written , is raised to the least integer such that : see [21, Chp. II, Prop. 6.31] for a proof that every element has an idempotent power. The idempotent power of is a value ensuring that, for each , we have : this can be achieved by taking any common multiple of the exponents for the idempotent powers of all elements of .
Forest algebra.
A forest algebra [10] is a pair of two monoids. The monoid is the vertical monoid, with vertical composition written and neutral element written . The monoid is the horizontal monoid, with horizontal concatenation denoted , and with neutral element written . Further, we have three composition laws: , , and . We require that the following relations hold:
-
Action (composition): for every and , .
-
Action (neutral): for every , .
-
Mixing: for every and , we have and .
-
Faithfulness: for every distinct , there exists such that .
We explain in the appendix of the full version [2] how this formalism slightly differs from that of [10] but is in fact equivalent. We often abuse notation and write to mean one of and write to mean one of .
We almost always assume that forest algebras are finite, i.e., the horizontal and vertical monoids and are both finite. The only exception is the free forest algebra . Here, is the set of all -forests, and is the set of all -contexts, i.e., -forests having precisely one node that carries the special label : further, this node must be a leaf. The -laws denote horizontal concatenation: of two forests for , and of a forest and a context for and . (Note that two contexts cannot be horizontally concatenated because the result would have two nodes labeled .) We write the -laws as , and remark that they are not commutative. The -laws are the context application operations of plugging the forest (for ) or context (for ) on the right in place of the -node of the context on the left. We write the -laws functionally, i.e., for and for or we write to mean and write to mean .
We write for the trivial -context consisting only of a single node labeled , we write for the empty forest, and for we write the -context which consists of a single root node labeled with a single child labeled . Note that and are spanned by and and by the via context application and concatenation.
Morphisms.
A morphism of forest algebras [10] consists of two functions that map forests to forests and contexts to contexts and are compatible with the internal operations. Formally, a morphism from to is a pair of functions where:
-
is a monoid morphism: for all , we have and where and are interpreted in the corresponding monoid.
-
is a monoid morphism: for all , we have and where and are interpreted in the corresponding monoid.
-
for every , , we have .
-
for every , , we have and .
We often abuse notation and write a morphism to mean the function that applies to -contexts and to -forests. For any alphabet and forest algebra , given a function from the alphabet to , we extend it to a morphism from the free forest algebra to in the only way compatible with the requirements above, i.e.:
-
For a sequence of -trees where at most one is a -context, letting for each be inductively defined, then .
-
For a -tree or -context with root node labeled , letting be the (possibly empty) sequence of children of (of which at most one is a context, and for which was inductively defined), then we set .
A forest language over is recognized by a forest algebra if there is a subset and a function from to defining a morphism having the following property: a -forest is in iff .
Syntactic forest algebra.
Regular forest languages can be related to forest algebras via the notion of syntactic forest algebra. Indeed, a forest language is regular iff it is recognized by some forest algebra (see [10, Proposition 3.19]). Specifically, we will consider the syntactic forest algebra of a regular forest language : this forest algebra recognizes , it is minimal in a certain sense, and it is unique up to isomorphism. We omit the formal definition of the syntactic forest algebra of (see [10, Definition 3.13] for details). We will just use the fact that it recognizes for a certain function from to and associated morphism from -forests to , called the syntactic morphism, and satisfying the following:
-
Surjectivity: for any element of , there is a -context such that ;
-
Minimality: for any two -contexts and , if , then there is a -context and a -forest such that exactly one of and belongs to .
Dynamic evaluation problem for forest algebras.
We will study the analogue of the dynamic membership problem for forest algebras, which is that of computing the evaluation of an expression. More precisely, for a forest algebra, a -forest is a forest where each internal node is labeled by an element of and where each leaf is labeled with an element of – but there may be one leaf, called the distinguished leaf, which is labeled with an element of . The evaluation of a -forest is the image of by the morphism obtained by extending the function which maps elements of to themselves and maps elements of to the context (so that ). Remark that the forest evaluates to an element of if it has a distinguished leaf, and to otherwise.
The (non-restricted) dynamic evaluation problem for then asks us to maintain the evaluation of the -forest under relabeling updates which can change the label of internal nodes of (in ) and of leaf nodes of (in or in , but ensuring that there is always at most one distinguished leaf in ). Again, we assume that the forest algebra is constant, and we measure the complexity as a function of the size of the input forest (note that updates never change or the shape of ). The restricted dynamic evaluation problem for adds the restriction that the label of leaves always stays in (initially and after all updates), so that always evaluates to an element of .
We will use the dynamic evaluation problem in the next section for our upper bound. Indeed, it generalizes the dynamic membership problem in the following sense:
Lemma 3.1.
Let be a fixed regular forest language, and let be its syntactic forest algebra. Given an algorithm for the restricted dynamic evaluation problem for under relabeling updates, we can obtain an algorithm for the dynamic membership problem for with the same complexity per update.
4 Dynamic Membership to Regular Languages in
Having defined the technical prerequisites, we now start the presentation of our technical results. In this section, we show our general upper bound on the dynamic membership problem to arbitrary regular forest languages:
Theorem 4.1.
For any fixed regular forest language , the dynamic membership problem to is in , where is the number of nodes of the input forest.
Note that this upper bound matches the lower bounds reviewed at the end of Section 2. We present the proof in the rest of this section. By Lemma 3.1, we will instead show our upper bound on the restricted dynamic evaluation problem for arbitrary fixed forest algebras.
The algorithm that shows Theorem 4.1 intuitively follows a recursive scheme. For the first step of the scheme, given the input forest , we compute a so-called clustering of . This is a partition of the nodes of into subsets, called clusters, which are connected in a certain sense and will be chosen to have size . Intuitively, clusters are small enough so that we maintain their evaluation under updates in by tabulation; note that clusters may correspond to contexts (i.e., they may have holes), so we will perform non-restricted dynamic evaluation for them. Further, induces a forest structure on the clusters, called the forest of clusters and denoted , for which we will ensure that it has size . We note that the clustering algorithm that we propose is very similar to the one constructed in the work of Gazit et al. [14], which has similar properties and could be used as an alternative to the one we propose. However, our clustering is not constructed in the same way and the result of building it on an input tree is not identical to the clustering of [14], so we present our own construction to ensure that our work is self-contained.
Having computed the clustering of and the forest of clusters , we re-apply recursively the clustering scheme on , decomposing it again into clusters of size and a forest of clusters of size . We recurse until we obtain a forest with only one node, which is the base case: we will ensure that is in .
To handle updates on a node of , we will apply the update on the cluster of containing (in by tabulation), and apply the resulting update on the node in the forest of clusters . We then continue this process recursively, eventually reaching the singleton where the update is trivial. The main technical challenge is to bound the complexity of the preprocessing: we must show how to efficiently compute a suitable clustering on an input forest in time . It will then be possible to apply the algorithm to , with a total complexity amounting to .
The section is structured as follows. First, we formally define the notion of clusters and clustering of a forest , and we define the forest of clusters induced by a clustering: note that these notions only depend on the shape of and not on the node labels. Second, we explain how the evaluation of a -forest reduces to computing the evaluation of clusters along with the evaluation of the forest of clusters. Third, we explain how we can compute in linear time a clustering of the input forest which ensures that the forest of clusters is small enough: we show that it is sufficient to compute any saturated clustering (i.e., no clusters can be merged), and sketch an algorithm that achieves this. Fourth, we conclude the proof of Theorem 4.1 by summarizing how the recursive scheme works, including how the linear-time preprocessing can tabulate the effect of updates on small forests.
Clusters and clusterings.
A clustering of a forest will be defined by partitioning its vertices into connected sets, where connectedness is defined using the sibling and first-child edges.
Definition 4.2.
Let be a forest. We say that two nodes of are LCRS-adjacent (for left-child-right-sibling) if one node is the first child of the other or if they are two consecutive siblings (in particular if they are two consecutive roots). We say that a set of nodes of is LCRS-connected if, for any two nodes in , there is a sequence of nodes in such that and are LCRS-adjacent for each .
Note that the edges used in LCRS-adjacency are not the edges of , but those of a left-child-right-sibling representation of (hence the name). For instance, the set formed of a node and its parent is not connected unless is the first child of . To define clusters and clusterings, we will use LCRS-adjacency, together with a notion of border nodes that correspond to the “holes” of clusters:
Definition 4.3.
Given a forest with nodes, we say that a node in a subset of is a border node of if has a child which is not in . For , we then say that an equivalence relation over the nodes of is a -clustering when the following properties hold on the equivalence classes of , called clusters:
-
each cluster contains at most nodes;
-
each cluster is LCRS-connected;
-
each cluster contains at most one border node.
The roots of are the nodes of whose parent is not in (or which are roots of ): if is a cluster, then by LCRS-connectedness its roots must be consecutive siblings in .
When we have defined a -clustering, it induces a forest of clusters in the following way:
Definition 4.4.
Given a -clustering of a forest , the forest of clusters is a forest whose nodes are the clusters of , and where a cluster is the child of a cluster when the roots of are children of the border node of .
We order the children of a cluster in in the following way. For each child of , its root nodes are a set of consecutive siblings, and these roots are in fact consecutive children of the border node of . Thus, given two children and of in , we order before if the roots of come before the roots of in the order in on the children of . Likewise, we can order the roots of , called the root clusters, according to the order on the roots of : recall that, by our definition of siblings, the root clusters are also siblings in .
Remark that the trivial equivalence relation (putting each node in its own cluster) is vacuously a -clustering in which the border nodes are precisely the internal nodes: we call this the trivial -clustering, and its forest of clusters is isomorphic to .
Evaluation of clusters.
To solve the dynamic evaluation problem on using a clustering , we will now explain how the evaluation of the -forest can reduce to the evaluation of the forest of clusters with a suitable labeling. To define this labeling, let us first define the evaluation of a cluster in :
Definition 4.5.
Given a -forest with no distinguished leaf, a -clustering of , and a cluster of , we define the evaluation of as a value in or in the following way. Let be the sub-forest of induced by , i.e., the sub-forest containing only the nodes in and the edges connecting two nodes that are both in : note that it is a -forest where each node has the same label as in . When contains a border node , we also add a leaf as the last child of in and label with .
The evaluation of the cluster in is then the evaluation of as a -forest. Note that, as has no distinguished leaf, the evaluation is in if has a border node and in otherwise; in other words it is in exactly when has a child in .
We now see as a -forest, with each cluster labeled by its evaluation in : Then:
Claim 4.6.
For any -clustering of a -forest , the evaluation of is the same as the evaluation of its forest of clusters .
This property is what we use in the recursive scheme: to solve the dynamic evaluation problem on the input -forest , during the preprocessing we will compute a clustering of and compute the evaluation of clusters and of the forest of clusters . Given a relabeling update on , we will apply it to its cluster and recompute the evaluation of ; this gives us a relabeling update to apply to the node on the forest of clusters , and we can recursively apply the scheme to to maintain the evaluation of . What remains is to explain how we can efficiently compute a clustering of a forest that ensures that the forest of clusters is “small enough”.
Efficient computation of a clustering.
Here is what we want to establish:
Proposition 4.7.
There is a fixed constant such that the following is true: given a forest , we can compute a -clustering of in linear time such that .
Our algorithm starts with the trivial -clustering and iteratively merges clusters. Formally, merging two clusters and means setting them to be equivalent under , and we call and mergeable when doing so leads to a -clustering. Of course, we will only merge mergeable clusters in the algorithm.
We say that the -clustering is saturated if it does not contain two mergeable clusters. Our definition of clusters is designed to ensure that, as soon as the clustering is saturated, there are enough large clusters so that the number of clusters (i.e., the size of the forest of clusters) satisfies the bound of Proposition 4.7, no matter how clusters were merged. Namely:
Claim 4.8.
There is a fixed constant such that the following is true: given a -forest , any saturated -clustering on has at most clusters.
Proof sketch.
We focus on clusters with zero or one child in the forest of clusters. We consider potential merges of these clusters with their only child, with their preceding sibling, or with their parent (if they are the first child). This allows us to show that, provided that there are multiple clusters, a constant fraction of them have to contain at least nodes, so the number of clusters is .
Thus, to show Proposition 4.7, it suffices to compute a saturated -clustering. Namely:
Claim 4.9.
Given a forest, we can compute a saturated -clustering along with its forest of clusters in linear time.
Proof sketch.
We start with the trivial -clustering, and then we saturate the clustering by exploring nodes following a standard depth-first traversal: from a node , we proceed recursively on the first child of (if it exists) and then on the next sibling of (if it exists). Then, we first try to merge the cluster of with the cluster of the first child , and then try to merge the cluster of with the cluster of the next sibling . By “try to merge”, we mean that we merge the two clusters if they are mergeable.
This algorithm clearly runs in linear time as long as the mergeability tests and the actual merges are performed in constant time, which we can ensure with the right data structures. The algorithm clearly returns a clustering, and the complicated part is to show that it is saturated. For this, assume by contradiction that the result contains two mergeable clusters with LCRS-adjacent nodes and : without loss of generality is the first child or next sibling of . When processing , the algorithm tried to merge the cluster that was in with that of , and we know that this attempt failed because we see that and are in different clusters at the end of the algorithm. This yields a contradiction with the fact that and end up in mergeable clusters at the end of the algorithm, given the order in which the algorithm considers possible merges.
Description of the main algorithm.
We are now ready to describe our algorithm for the dynamic evaluation problem for a fixed forest algebra . Given as input a -forest with nodes (without a distinguished node), we want to maintain the evaluation of (which is an element in ). The first step of the preprocessing is to compute in linear time an auxiliary data structure which is used to maintain the evaluation of small forests under updates in , simply by tabulation. More precisely, can be used to solve the non-restricted dynamic evaluation problem for forests of size at most , as follows:
Proposition 4.10.
Given a forest algebra there is a constant such that the following is true. Given , we can compute in a data structure that stores a sequence (initially empty) of -forests and supports the following:
-
add(): given an -forest with at most nodes, insert it into the sequence, taking time and space
-
relabel(): given an integer , a node , and a label or , relabel the node of to , taking time – as usual we require that internal nodes have labels in and at most one leaf has label in
-
eval(): given , return the evaluation of , taking
Letting , the first step of the preprocessing builds the data structure in time . We then move on to the second step, which is to compute a sequence of forests by recursively clustering in the following way. We start by letting be the input -forest. Then, we recursively do the following. If the sequence stops at . Otherwise, we compute a saturated -clustering of using ˜4.9, and we let be the forest of clusters . We will later show that this takes time overall.
We continue with a third preprocessing step that computes the evaluation of each cluster at each level: again we will later argue that this takes . More precisely, we consider all the clusters of and add the sub-forest of induced by each to , obtaining their evaluation in time . We use the result of the evaluation as labels for the corresponding nodes in . Then we add the sub-forests induced by all the clusters of to to obtain their evaluation and to label . We continue until we have the evaluation of . Note that none of the -forests has a distinguished leaf: indeed, the only place where we perform non-restricted dynamic evaluation is in Proposition 4.10, i.e., on the sub-forests induced by the clusters and added to . Further note that all these induced sub-forests have at most nodes by definition of a -clustering (the comes from the -labeled leaf which may be added for the border node in Definition 4.5). Now, by ˜4.6 at each step, we have that the evaluation of is equal to the evaluation of the input -forest . This is the answer we need to maintain, and this concludes the preprocessing.
Let us now explain how we recursively handle relabeling updates. To apply an update to the node of the input forest , we retrieve its cluster and use to apply this update to to the induced sub-forest and retrieve its new evaluation, in . This gives us an update to apply to the node of the forest of clusters . We continue like this over the successive levels, until we have an update applied to the single node of , which again by ˜4.6 is the desired answer. The update is handled overall in time , so let us bound the number of recursion steps. At every level we have by ˜4.8 and , so we have and therefore given that .
The only remaining point is to bound the total time needed in the preprocessing. The data structure is computed in linear time in the first step. Then, we spend linear time in each to compute the -clusterings at each level in the second step, and we again spend linear time in each to feed the induced sub-forests of all the clusters of to in the third step. Now, it suffices to observe that the size of each decreases exponentially with ; for sufficiently large and we have so and the total size of the is . This ensures that the total time taken by the preprocessing across all levels in the second and third steps is in , which concludes.
5 Dynamic Membership to Almost-Commutative Languages in
We have shown in Section 4 our general upper bound (Theorem 4.1) on the dynamic membership problem to arbitrary forest languages. In this section, we study how we can show more favorable bounds on the complexity on dynamic membership when focusing on restricted families of languages. More precisely, in this section, we define a class of regular forest languages, called almost-commutative languages, and show that the dynamic membership problem to such languages can be solved in . We will continue studying these languages in the next section to show that non-almost-commutative languages conditionally cannot be maintained in constant time when assuming the presence of a neutral letter.
Defining almost-commutative languages.
To define almost-commutative languages, we need to define the subclasses of virtually-singleton languages and regular-commutative languages. Let us first define virtually-singleton languages via the operation of projection:
Definition 5.1.
The removal of a node in a -forest means replacing by the (possibly empty) sequence of its children. Removing a subset of nodes of is then defined in the expected way; note that the result does not depend on the order in which nodes are removed.
For a subalphabet, given a -forest , the projection of over is the forest obtained from when removing all nodes that are labeled by a letter of .
A forest language over is virtually-singleton if there exists a subalphabet and a -forest such that is the set of forests whose projection over is .
Note that virtually-singleton languages are always regular: a forest automaton can read an input forest, ignoring nodes with labels in , and check that the resulting forest is exactly the fixed target forest .
Let us now define regular-commutative languages as the commutative regular forest languages, i.e., membership to the language can be determined from the Parikh image:
Definition 5.2.
The Parikh image of a -forest is the vector such that for every letter , the component is the number of nodes labelled by in .
A forest language is regular-commutative if it is regular and there is a set such that is the set of forests whose Parikh image is in .
We can now define almost-commutative forest languages from these two classes:
Definition 5.3.
A forest language is almost-commutative if it is a finite Boolean combination of regular-commutative and virtually-singleton languages.
Note that almost-commutative languages are always regular, because regular forest languages are closed under Boolean operations. Further, in a certain sense, almost-commutative languages generalize all word languages with a neutral letter that enjoy constant-time dynamic membership. Indeed, such languages are known by [4] and [5, Corollary 3.5] to be described by regular-commutative conditions and the presence of specific subwords on some subalphabets. Thus, letting be such a word language, we can define a forest language consisting of the forests where the nodes of form a word of (e.g., when taken in prefix order), and is then almost-commutative. As a kind of informal converse, given a -forest , we can represent it as a word with opening and closing parentheses (sometimes called the XML encoding), and the set of such representations for an almost-commutative forest language will intuitively form a word language that enjoys constant-time dynamic membership except for the (non-regular) requirement that parentheses are balanced.
We also note that we can effectively decide whether a given forest language is almost-commutative, as will follow from the algebraic characterization in the next section:
Proposition 5.4.
The following problem is decidable: given a forest automaton , determine whether the language accepted by is almost-commutative.
Tractability for almost-commutative languages.
We show the main result of this section:
Theorem 5.5.
For any fixed almost-commutative forest language , the dynamic membership problem to is in .
Proof sketch.
This result is shown by proving the claim for regular-commutative languages and virtually-singleton languages, and then noticing that tractability is preserved under Boolean operations, simply by combining data structures for the constituent languages.
For regular-commutative languages, we maintain the Parikh image as a vector in constant-time per update, and we then easily maintain the forest algebra element to which it corresponds, similarly to the case of monoids (see [23] or [4, Theorem 4.1]).
For virtually-singleton languages, we use doubly-linked lists like in [4, Prop. 4.3] to maintain, for each letter of the subalphabet , the unordered set of nodes with label . This allows us to determine in constant-time whether the Parikh image of the input forest restricted to is correct: when this holds, then the doubly-linked lists have constant size and we can use them to recover all nodes with labels in . With constant-time reachability queries, we can then test if these nodes achieve the requisite forest over .
6 Lower Bound on Non-Almost-Commutative Languages with Neutral Letter
We have introduced in the previous section the class of almost-commutative languages, and showed that such languages admit a constant-time dynamic membership algorithm (Theorem 5.5). In this section, we show that this class is tight: non-almost-commutative regular forest languages cannot enjoy constant-time dynamic membership when assuming the presence of a neutral letter and when assuming the hardness of the prefix- problem from [4]. We first present these hypotheses in more detail, and state the lower bound that we show in this section (Theorem 6.2). Second, we present the algebraic characterization of almost-commutative regular languages on which the proof of Theorem 6.2 hinges: they are precisely the regular languages whose syntactic forest algebra is in a class called ZG. Third, we sketch the lower bound showing that dynamic membership is conditionally not in for languages with a neutral letter whose syntactic forest algebra is not in ZG, and conclude.
Hypotheses and result statement.
The lower bound shown in this section is conditional to a computational hypothesis on the prefix- problem. In this problem, we are given as input a word on the alphabet , and we must handle substitution updates to and queries where we are given and must return whether the prefix of of length contains some occurrence of . In other words, we must maintain a subset of integers of under insertion and deletion, and handle queries asking whether an input is greater than the current minimum: note that this like a priority queue except we can only compare the minimum to an input value but not retrieve it directly. We will use the following conjecture from [4] as a hypothesis for our lower bound:
Conjecture 6.1 ([4, Conj. 2.3]).
There is no data structure solving the prefix- problem in time per operation in the RAM model with unit cost and logarithmic word size.
Further, our lower bound in this section will be shown for regular forest languages over an alphabet which are assumed to feature a so-called neutral letter for . Formally, a letter is neutral for if, for every forest , we have iff , where denotes the projection operation of Definition 5.1. In other words, nodes labeled by can be removed without affecting the membership of to .
We can now state the lower bound shown in this section, which is our main contribution:
Theorem 6.2.
Let be a regular forest language featuring a neutral letter. Assuming ˜6.1, we have that has dynamic membership in iff is almost-commutative.
Algebraic characterization of almost-commutative languages.
The proof of Theorem 6.2 hinges on an algebraic characterization of the almost-commutative languages. Namely, we will define a class of forest algebras called ZG, by imposing the ZG equation from [4, 5]:
Definition 6.3.
A monoid is in the class ZG if it satisfies the ZG equation: for all we have . A forest algebra is in the class ZG if its vertical monoid is in ZG: we call it a ZG forest algebra.
The intuition for the ZG equation is the following. Elements of the form in a monoid are called group elements: they are precisely the elements which span a subsemigroup (formed of the elements ) which has the structure of a group (with being the neutral element). Note that the group is always a cyclic group: it may be the trivial group, for instance in an aperiodic monoid all such groups are trivial, or in arbitrary monoids the neutral element always spans the trivial group. The equation implies that all group elements of the monoid are central, i.e., they commute with all other elements.
The point of ZG forest algebras is that they correspond to almost-commutative languages:
Theorem 6.4.
A regular language is almost-commutative if and only if its syntactic forest algebra is in ZG.
Proving this algebraic characterization is the main technical challenge of the paper. Let us sketch the proof of Theorem 6.4, with the details given in the full version [2].
Proof sketch.
The easy direction is to show that almost-commutative languages have a syntactic forest algebra in ZG. We first show this for commutative languages (whose vertical monoids are unsurprisingly commutative) and for virtually-singleton languages (whose vertical monoids are nilpotent, i.e., are in the class MNil of [4, 24]). We conclude because satisfying the ZG equation is preserved under Boolean operations.
The hard direction is to show that any regular language whose syntactic forest algebra is in ZG is almost-commutative, i.e., can be expressed as a finite Boolean combination of virtually-singleton and regular-commutative languages. For this, we show that morphisms to a ZG forest algebra must be determined by the following information on the input forest: which letters are rare (i.e., occur a constant number of times) and which are frequent; how many times the frequent letters appear modulo the idempotent power of the monoid; and what is the projection of the forest on the rare letters. These conditions amount to an almost-commutative language, and are the analogue for trees of results on ZG languages [4]. The proof is technical, because we must show how the ZG equation on the vertical monoid implies that every sufficiently frequent letter commutes both vertically and horizontally.
Hardness for syntactic forest algebras outside ZG.
To show our conditional hardness result (Theorem 6.2), what remains is to show that the dynamic membership problem is hard for regular languages with a neutral letter and whose syntactic forest algebra is not in ZG:
Proposition 6.5.
Let be a regular forest language, and assume that it has a neutral letter and that its syntactic forest algebra is not in ZG. Subject to ˜6.1, the dynamic membership problem for cannot be solved in constant time per update.
Proof sketch.
We reduce from the case of words: from two contexts and witnessing that the ZG equation does not hold, we consider forests formed by the vertical composition of a sequence of contexts which can be either or , with a suitable context at the beginning and end. Let be the word language of those sequences of contexts giving a forest in : we show that the syntactic monoid of is not in ZG, and as features a neutral letter, we conclude by [4] that does not enjoy dynamic membership assuming ˜6.1. Further, dynamic membership to can be achieved using a data structure for the same problem for on the forests that we constructed, so hardness also holds for .
Note that Proposition 6.5 is where we use the assumption that there is a neutral letter. The result is not true otherwise, as we explain below (and will discuss further in Section 7):
Example 6.6.
Consider the language of forests over where there is a node labeled whose next sibling is labeled . Membership to can be maintained in , like the language of words . (Note that is not a neutral letter, because is accepted but is not.) However, one can show that the syntactic forest algebra of is not in ZG. By contrast, adding a neutral letter to yields a language (with the same syntactic forest algebra) with no dynamic membership algorithm under ˜6.1.
With Proposition 6.5 and Theorem 6.4, we can conclude the proof of Theorem 6.2:
Proof of Theorem 6.2.
We already know that almost-commutative languages can be maintained efficiently (Theorem 5.5). Now, given a regular forest language which is not almost-commutative and features a neutral letter, we know by Theorem 6.4 that its syntactic forest algebra is not in ZG, so we conclude by Proposition 6.5.
7 Conclusions and Future Work
We have studied the problem of dynamic membership to fixed tree languages under substitution updates. We have shown an algorithm for arbitrary regular languages, and introduced the class of almost-commutative languages for which dynamic membership can be decided in . We have shown that, under the prefix-U1 conjecture, and if the language is regular and features a neutral letter, then it must be almost-commutative to be maintainable in constant time. Our work leaves many questions open which we present below: characterizing the -maintainable languages without neutral letter, identifying languages with complexity between and , and other open directions.
Constant-time dynamic membership without neutral letters.
Our conditional characterization of constant-time-maintainable regular languages only holds in the presence of a neutral letter. In fact, without neutral letters, it is not difficult to find non-almost-commutative forest languages with constant-time dynamic membership, e.g., the language from Example 6.6. There are more complex examples, e.g., dynamic membership in is possible for the word language “there is exactly one and exactly one , the is before the , and the distance between them is even” (it is in QLZG [4]), and the same holds for the analogous forest language. We can even precompute “structural information” on forests of a more complex nature than the parity of the depths, e.g., to maintain in “there is exactly one and exactly one and the path from to is a downward path that takes only first-child edges”. We also expect other constant-time tractable cases, e.g., “there is exactly one node labeled and exactly one node labeled and their least common ancestor is labeled ” using a linear-preprocessing constant-time data structure to answer least common ancestor queries on the tree structure [16]; or “there is a node labeled where the leaf reached via first-child edges is labeled ”. These tractable languages are not almost-commutative (and do not feature neutral letters), and a natural direction for future work would be to characterize the regular forest languages maintainable in without the neutral letter assumption.
Intermediate complexity for dynamic membership.
We have explained in Section 2 that dynamic membership is in for some regular forest languages, and in Section 5 that it is in for almost-commutative languages. One natural question is to study intermediate complexity regimes. In the setting of word languages, it was shown in [23] (and extended in [4]) that any aperiodic language could be maintained in per update. This implies that some forest languages can be maintained with the same complexity, e.g., the forests whose nodes in the prefix ordering form a word in an aperiodic language .
The natural question is then to characterize which forest languages enjoy dynamic membership between and the general bound. We leave this question open, but note a difference with the setting for words: there are some aperiodic forest languages (i.e., both monoids of the syntactic forest algebra are aperiodic) to which an lower bound applies, e.g., the language for the existential marked ancestor problem reviewed at the end of Section 2. An intriguing question is whether there is a dichotomy on regular forest languages, already in the aperiodic case, between those with dynamic membership, and those with a lower bound.
Other questions.
Beyond relabeling updates it would be natural to study the complexity of dynamic membership under operations that can change the shape of the forest, e.g., leaf insertion and leaf deletion. Another question concerns the support for more general queries than dynamic membership, e.g., enumerating the answers to non-Boolean queries like in [19, 3] (but with language-dependent guarantees on the update time to improve over ). Last, another generalization of dynamic membership to forest languages is to consider the same problem for context-free languages, e.g., for Dyck languages ([17, Proposition 1]), or for visibly pushdown languages. Note that this is a different setting from forest languages because substitution updates may change the shape of the parse tree.
References
- [1] Stephen Alstrup, Thore Husfeldt, and Theis Rauhe. Marked ancestor problems. In FOCS, 1998. doi:10.1109/SFCS.1998.743504.
- [2] Antoine Amarilli, Corentin Barloy, Louis Jachiet, and Charles Paperman. Dynamic membership for regular tree languages, 2025. doi:10.48550/arXiv.2504.17536.
- [3] Antoine Amarilli, Pierre Bourhis, and Stefan Mengel. Enumeration on trees under relabelings. In ICDT, 2018. doi:10.4230/LIPIcs.ICDT.2018.5.
- [4] Antoine Amarilli, Louis Jachiet, and Charles Paperman. Dynamic membership for regular languages. In ICALP, 2021. doi:10.4230/LIPIcs.ICALP.2021.116.
- [5] Antoine Amarilli and Charles Paperman. Locality and centrality: The variety ZG. LMCS, 2023. doi:10.46298/LMCS-19(4:4)2023.
- [6] Andrey Balmin, Yannis Papakonstantinou, and Victor Vianu. Incremental validation of XML documents. TODS, 29(4), 2004.
- [7] Corentin Barloy. On the complexity of regular languages. PhD thesis, Université de Lille, 2024. URL: https://theses.hal.science/tel-04820899v1/.
- [8] David A Mix Barrington, Neil Immerman, Clemens Lautemann, Nicole Schweikardt, and Denis Thérien. First-order expressibility of languages with neutral letters or: The Crane Beach conjecture. JCSS, 70(2), 2005. doi:10.1016/j.jcss.2004.07.004.
- [9] Hans L Bodlaender and Torben Hagerup. Parallel algorithms with optimal speedup for bounded treewidth. SIAM Journal on Computing, 27(6), 1998. doi:10.1137/S0097539795289859.
- [10] Mikolaj Bojanczyk and Igor Walukiewicz. Forest algebras. Automata and logic: history and perspectives, 2008.
- [11] Hubert Comon, Max Dauchet, Rémi Gilleron, Florent Jacquemard, Denis Lugiez, Christof Löding, Sophie Tison, and Marc Tommasi. Tree automata: Techniques and applications, 2008. URL: https://hal.science/hal-03367725.
- [12] Gudmund Skovbjerg Frandsen, Thore Husfeldt, Peter Bro Miltersen, Theis Rauhe, and Søren Skyum. Dynamic algorithms for the Dyck languages. In WADS, pages 98–108, 1995.
- [13] Michael Fredman and Michael Saks. The cell probe complexity of dynamic data structures. In STOC, 1989.
- [14] Hillel Gazit, Gary L Miller, and Shang-Hua Teng. Optimal tree contraction in the EREW model. In Concurrent Computations: Algorithms, Architecture, and Technology. Springer, 1988.
- [15] Wouter Gelade, Marcel Marquardt, and Thomas Schwentick. The dynamic complexity of formal languages. TOCL, 13(3), 2012.
- [16] Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. SIAM Journal on Computing, 13(2), 1984. doi:10.1137/0213024.
- [17] Thore Husfeldt and Theis Rauhe. Hardness results for dynamic problems by extensions of Fredman and Saks’ chronogram method. In ICALP, 1998.
- [18] Michal Kouckỳ, Pavel Pudlák, and Denis Thérien. Bounded-depth circuits: Separating wires from gates. In STOC, 2005. doi:10.1145/1060590.1060629.
- [19] Katja Losemann and Wim Martens. MSO queries on trees: Enumerating answers under updates. In CSL–LICS, 2014. doi:10.1145/2603088.2603137.
- [20] Gary L Miller and John H Reif. Parallel tree contraction and its application. In FOCS, 1985. doi:10.1109/SFCS.1985.43.
- [21] Jean-Éric Pin. Mathematical foundations of automata theory, 2019. URL: https://www.irif.fr/˜jep/PDF/MPRI/MPRI.pdf.
- [22] Jonas Schmidt, Thomas Schwentick, and Jennifer Todtenhoefer. On the work of dynamic constant-time parallel algorithms for regular tree languages and context-free languages. In MFCS, volume 272, 2023. doi:10.4230/LIPIcs.MFCS.2023.81.
- [23] Gudmund Skovbjerg Frandsen, Peter Bro Miltersen, and Sven Skyum. Dynamic word problems. JACM, 44(2), 1997.
- [24] Howard Straubing. The variety generated by finite nilpotent monoids. Semigroup Forum, 24(1), 1982. doi:10.1007/bf02572753.
- [25] Felix Tschirbs, Nils Vortmeier, and Thomas Zeume. Dynamic complexity of regular languages: Big changes, small work. In CSL, 2023.
- [26] Moshe Y Vardi. The complexity of relational query languages. In STOC, 1982.
