FO-Query Enumeration over SLP-Compressed Structures of Bounded Degree

Lohrey, Markus; Maneth, Sebastian; Schmid, Markus L.

doi:10.4230/LIPIcs.MFCS.2025.69

FO-Query Enumeration over SLP-Compressed Structures of Bounded Degree

Markus Lohrey

University of Siegen, Germany Sebastian Maneth

University of Bremen, Germany Markus L. Schmid

Humboldt-Universität zu Berlin, Germany

Abstract

Enumerating the result set of a first-order query over a relational structure of bounded degree can be done with linear preprocessing and constant delay. In this work, we extend this result towards the compressed perspective where the structure is given in a potentially highly compressed form by a straight-line program (SLP). Our main result is an algorithm that enumerates the result set of a first-order query over a structure of bounded degree that is represented by an SLP satisfying the so-called apex condition. For a fixed formula, the enumeration algorithm has constant delay and needs a preprocessing time that is linear in the size of the SLP.

Keywords and phrases:

Enumeration algorithms, FO-logic, query evaluation over compressed data

Funding:

Markus L. Schmid: Supported by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) – project number 522576760 (gefördert durch die Deutsche Forschungsgemeinschaft (DFG) – Projektnummer 522576760).

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Database query processing and optimization (theory) ; Theory of computation

\rightarrow

Logic and databases

Editors:

Paweł Gawrychowski, Filip Mazowiecki, and Michał Skrzypczak

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

First order model checking (i.e., deciding whether an FO-sentence $\phi$ holds in a relational structure ${\mathcal{U}}$ , ${\mathcal{U}}\models\phi$ for short) is a classical problem in computer science and its complexity has been thoroughly investigated; see, e.g., [21, 32, 37]. In database theory, it is of importance due to its practical relevance for evaluating SQL-like query languages in relational databases. FO model checking is PSPACE-complete when $\phi$ and ${\mathcal{U}}$ are both part of the input, but it becomes fixed-parameter tractable (even linear fixed-parameter tractable) with respect to the parameter $|\phi|$ when ${\mathcal{U}}$ is restricted to a suitable class of relational structures (see the paragraph on related work below for details), while for the class of all structures it is not fixed-parameter tractable modulo certain complexity assumptions. This is relevant, since in practical scenarios queries are often small, especially in comparison to the data (represented by the relational structure) that is often huge.

FO model checking (i.e., checking a Boolean query that returns either true or false) reduces to practical query evaluation tasks and is therefore suitable to transfer lower bounds. However, from a practical point of view, FO-query enumeration is more relevant. It is the problem of enumerating without repetitions for an FO-formula $\phi(x_{1},\ldots,x_{k})$ with free variables $x_{1},\ldots,x_{k}$ the result set $\phi({\mathcal{U}})$ of all tuples $(a_{1},\ldots,a_{k})\in{\mathcal{U}}^{k}$ such that ${\mathcal{U}}\models\phi(a_{1},\ldots,a_{k})$ . Since $\phi({\mathcal{U}})$ can be rather large (exponential in $k$ in general), the total time for enumeration is not a good measure for the performance of an enumeration algorithm. More realistic measures are the preprocessing time (used for performing some preprocessing on the input) and the delay, which is the maximum time needed between the production of two consecutive output tuples from $\phi({\mathcal{U}})$ . In data complexity (where we consider $|\phi|$ to be constant), the best we can hope for is linear preprocessing time (i.e., $f(|\phi|)\cdot|{\mathcal{U}}|$ for a computable function $f$ ) and constant delay (i.e., the delay is $f(|\phi|)$ for some computable function $f$ and therefore does not depend on $|{\mathcal{U}}|$ ). Over the last two decades, many of the linear time (with respect to data complexity) FO model checking algorithms for various subclasses of structures have been extended to FO-query enumeration algorithms with linear (or quasi-linear) time preprocessing and constant delay (see the paragraph on related work below for the relevant literature).

In this work, we extend FO-query enumeration towards the compressed perspective, i.e., we wish to enumerate the result set $\phi({\mathcal{U}})$ in the scenario where ${\mathcal{U}}$ is given in a potentially highly compressed form, and we want to work directly on this compressed form without decompressing it. In this regard, we contribute to a recent research effort in database theory that is concerned with query evaluation over compressed data [45, 51, 59, 60]. Let us now explain this framework in more detail.

Query evaluation over compressed data.

Query evaluation over compressed data combines the classical task of query evaluation with the paradigm of algorithmics on compressed data (ACD), i.e., solving computational tasks directly on compressed data objects without prior decompression. ACD is an established algorithmic paradigm and it works very well in the context of grammar-based compression with so-called straight-line programs (SLPs). Such SLPs use grammar-like formalisms in order to specify how a data object is constructed from small building blocks. For example, if the data object is a finite string $w$ , then an SLP is just a context-free grammar for the language $\{w\}$ . For instance, the SLP $S\to AA$ , $A\to BBC$ , $B\to ba$ , $C\to cb$ (where $S, A, B, C$ are nonterminals and $a, b, c$ are terminals) produces the string $b a b a c b b a b a c b$ . While SLPs achieve exponential compression in the best case, there are also fast heuristic compressors that yield decent compression in practical scenarios. Moreover, SLPs are very well suited for ACD; see, e.g., [38].

An important point is that the ACD perspective can lead to dramatic running time improvements: if the same problem can be solved in linear time both in the uncompressed and in the compressed setting (i.e., linear in the compressed size), then in the case that the input can be compressed from size $n$ to size $\operatorname{\mathcal{O}}(\log n)$ (which is possible with SLPs in the best case), the algorithm for the compressed data has a running time of $\operatorname{\mathcal{O}}(\log n)$ (compared to $\operatorname{\mathcal{O}}(n)$ for the algorithm working on uncompressed data). An important problem that shows this behavior is for instance pattern matching in compressed texts [22].

SLPs are most famous for strings (see [5, 10, 22, 23] for some recent publications and [38] for a survey). What makes them particularly appealing for query evaluation is that their general approach of compressing data objects by grammars extends from strings to more complex structures like trees [24, 40, 42, 44] and hypergraphs (i.e., general relational structures) [36, 39, 46, 47], while, at the same time, their good ACD-properties are maintained to some extend. This is due to the fact that context-free string grammars extend to context-free tree grammars [58] (see also [25]) and to hyperedge replacement grammars [6, 27] (see also [16]).

In this work, we are concerned with FO-query enumeration for relational structures that are compressed by SLPs based on hyperedge replacement grammars (also known as hierarchical graph definitions or SL HR grammars; see the paragraph on related work for references). An example of such an SLP is shown in Figure 1. It consists of productions (shown in Figure 1 on the left) that replace nonterminals ( $S$ , $A$ , and $B$ in Figure 1) by their unique right-hand sides. Each right-hand side is a relational structure (a directed graph in Figure 1) together with occurrences of earlier defined nonterminals and certain distinguished contact nodes (labelled by $1$ and $2$ in Figure 1). In this way, every nonterminal $X\in\{S,A,B\}$ produces a relational structure $\mathsf{val}(X)$ (the value of $X$ ) with distinguished contact nodes. These structures are shown in Figure 1 on the right. When replacing for instance the occurrence of $B$ in the right hand side of $S$ by $\mathsf{val}(B)$ , one identifies for every $i\in\{1,2\}$ the $i$ -labelled contact node in $\mathsf{val}(B)$ with the node that is connected by the $i$ -labelled dotted edge with the $B$ -occurrence in the right-hand side of $S$ (these are the nodes labelled with $u$ and $v$ in Figure 1).

Main result.

It is known that FO-query enumeration for degree-bounded structures can be done with linear preprocessing and constant delay [13, 28]. Moreover, FO model checking for SLP-compressed degree-bounded structures can be done efficiently [39]. We combine these two results and therefore extend FO-query enumeration for bounded-degree structures towards the SLP-compressed setting, or, in other words, we extend FO model checking of SLP-compressed structures to the query-enumeration perspective. A preliminary version of our main result is stated below. It restricts to so-called apex SLPs. Roughly speaking, the apex property demands that each graph replacing a nonterminal must not contain other nonterminals at the “contact nodes” (the nodes the nonterminal was incident with). The apex property is well known from graph language theory [16, 17, 18] and has been used for SLPs in [39, 49].

Theorem 1.

Let $d\in\mathbb{N}$ be a constant. For an FO-formula $\phi(x_{1},\ldots,x_{k})$ and a relational structure ${\mathcal{U}}$ , whose Gaifman graph has degree at most $d$ , and that is given in compressed form by an apex SLP $D$ , one can enumerate the result set $\phi({\mathcal{U}})$ after preprocessing time $f(d,|\phi|)\cdot|D|$ and delay $f(d,|\phi|)$ for some computable function $f$ .

Note that the preprocessing is linear in the compressed size $|D|$ instead of the data size $|{\mathcal{U}}|$ .

We prove this result by extending the FO-query enumeration algorithm for uncompressed structures from [28] to the SLP-compressed setting. For this we have to overcome considerable technical barriers. The algorithm of [28] exploits the Gaifman locality of FO-queries. In the preprocessing phase the algorithm computes for each element $a\in{\mathcal{U}}$ the $r$ -sphere around $a$ for a radius $r$ that only depends on the formula $\phi$ . This leads to a preprocessing time of $|{\mathcal{U}}|\cdot f(d,\phi)$ . For an SLP-compressed structure we cannot afford to iterate over all elements of the structure. Inspired by a technique from [39], we will expand every nonterminal of the SLP $D$ locally up to a size that depends only on $\phi$ and $d$ . This leads to at most $|D|$ substructures of size $f(d,|\phi|)$ . Our enumeration algorithm then enumerates certain paths in the derivation tree defined by $D$ and for each such path ending in a nonterminal $A$ it searches in the precomputed local expansion of $A$ for nodes with a certain sphere type.

Related work.

In the uncompressed setting, there are several classes of relational structures for which FO-query enumeration can be solved with linear (or quasi-linear) preprocessing and constant delay, e.g., relational structures with bounded degree [8, 13, 28], low degree [14], (locally) bounded expansion [29, 64], and structures that are tree-like [3, 30] or nowhere dense [61]; see [7, 63] for surveys. Moreover, for conjunctive queries with certain acyclicity conditions, linear preprocessing and constant delay is also possible for the class of all relational structures [4, 7]. The algorithm from [28] is the most relevant one for our work.

Concerning other work on query enumeration on SLP-compressed data, we mention [51, 59, 60], which deals with constant delay enumeration for (a fragment of) MSO-queries on SLP-compressed strings, and [45], which presents a linear preprocessing and constant delay algorithm for MSO-queries on SLP-compressed unranked forests.

SLPs for (hyper)graphs were introduced as hierarchical graph descriptions by Lengauer and Wanke [36] and have been further studied, e.g., in [9, 19, 20, 34, 35, 49, 48, 50]. Model checking problems for SLP-compressed graphs have been studied in [39] for FO and MSO, [26] for fixpoint logics, and [1, 2] for the temporal logics LTL and CTL in the context of hierarchical state machines (which are a particular type of graph SLPs). Particularly relevant for this paper is a result from [39] stating that for every level $\Sigma^{\mathsf{P}}_{i}$ of the polynomial time hierarchy there is a fixed FO-formula for which the model checking problem for SLP-compressed input graphs is $\Sigma^{\mathsf{P}}_{i}$ -complete. In contrast, for apex SLPs the model checking problem for every fixed FO-formula belongs to NL (nondeterministic logspace) [39]. This (and the fact that FO-query enumeration reduces to FO model checking) partly explains the restriction to apex SLPs in Theorem 1.

Compression of graphs via graph SLPs has been considered in [57] following a “Sequitur scheme” [52] and in [47] following a “RePair scheme” [33] (see also [41]); note that both compressors produce graph SLPs that may not be apex.

Another recent concept in database theory that is concerned with compressed representations of relational data and query evaluation are factorized databases (see [31, 53, 54, 55, 56]). Intuitively speaking, in a factorized representation of a relational structure each relation $R$ is represented as an expression over the relational operators union and product that evaluates to $R$ . However, SLPs for relational structures and factorized representations cover completely different aspects of redundancy: A factorized representation is always at least as large as its active domain (i.e., all elements that occur in some tuple), while an SLP for a relational structure can be of logarithmic size in the size of the universe of the structure. On the other hand, small factorized representations do not seem to necessarily translate into small SLPs.

2 General Notations

Let $\mathbb{N}=\{0,1,2,\ldots\}$ . For every $k\in\mathbb{N}$ , we set $[k]=\{1,2,\ldots,k\}$ . For a finite alphabet $A$ , we denote by $A^{*}$ the set of all finite strings over $A$ including the empty string $\varepsilon$ . For a partial $f:A\to B$ let $\mathop{\mathrm{dom}}(f)=\{a\in A:f(a)\neq\bot\}\subseteq A$ (where $\bot\notin B$ stands for undefined) and $\mathop{\mathrm{ran}}(f)=\{f(a):a\in\mathop{\mathrm{dom}}(f)\}\subseteq B$ . For functions $f:A\to B$ and $g:B\to C$ we define the composition $g\circ f:A\to C$ by $(g\circ f)(a)=g(f(a))$ for all $a\in A$ .

A partial $k$ -tuple over a set $A$ is a partial function $t:[k]\to A$ . If $\mathop{\mathrm{dom}}(t)=[k]$ , then we also say that $t$ is a complete $k$ -tuple or just a $k$ -tuple; in this case we also write $t$ in the conventional form $(t(1),t(2),\ldots,t(k))$ . Two partial $k$ -tuples $t_{1}$ and $t_{2}$ are disjoint if $\mathop{\mathrm{dom}}(t_{1})\cap\mathop{\mathrm{dom}}(t_{2})=\emptyset$ . In this case, their union $t_{1}\sqcup t_{2}$ is the partial $k$ -tuple defined by $(t_{1}\sqcup t_{2})(j)=t_{i}(j)$ if $j\in\mathop{\mathrm{dom}}(t_{i})$ for $i\in\{1,2\}$ and $(t_{1}\sqcup t_{2})(j)=\bot$ if $j\notin\mathop{\mathrm{dom}}(t_{1})\cup\mathop{\mathrm{dom}}(t_{2})$ .

2.1 Directed acyclic graphs

An ordered dag (directed acyclic graph) is a triple $G=(V,\gamma,\iota)$ , where $V$ is a finite set of nodes, $\gamma:V\to V^{*}$ is the child-function, the relation $E:=\{(u,v):u,v\in V,v\text{ occurs in }\gamma(u)\}$ is acyclic, and $\iota\in V$ is the initial node. The size of $G$ is $|G|=\sum_{v\in V}(1+|\gamma(v)|)$ . A node $v\in V$ with $|\gamma(v)|=0$ is called a leaf.

A path in $G$ (from $v_{0}$ to $v_{n}$ ) is a sequence $p=v_{0}i_{1}v_{1}i_{2}\cdots v_{n-1}i_{n}v_{n}\in V(\mathbb{N}V)^{*}$ such that $1\leq i_{k}\leq|\gamma(v_{k-1})|$ for all $k\in[n]$ . The length of this path $p$ is $n$ (we may have $n=0$ in which case $p=v_{0}$ ) and we also call $p$ a $v_{0}$ -to- $v_{n}$ path or $v_{0}$ -path if the end point $v_{n}$ is not important. An $\iota$ -path is also called an initial path. We extend this notation to subsets of $V$ in the obvious way, e.g., for $A,B\subseteq V$ and $v\in V$ we talk about $A$ -to- $v$ paths, $A$ -to- $B$ paths, $A$ -to-leaf paths (where “leaf” refers to the set of all leaves of the dag), initial-to-leaf paths, etc. For a $v_{0}$ -to- $v_{1}$ path $p=p^{\prime}v_{1}$ and a $v_{1}$ -to- $v_{2}$ path $q=v_{1}q^{\prime}$ we define the $v_{0}$ -to- $v_{2}$ path $pq=p^{\prime}v_{1}q^{\prime}$ (note that if we just concatenate $p$ and $q$ as words, then we have to replace the double occurrence $v_{1}v_{1}$ by $v_{1}$ to obtain $p q$ ). We say that the path $p$ is a prefix of the path $q$ if there is a path $r$ such that $q=qr$ .

Since we consider ordered dags, there is a natural lexicographical ordering on all $v$ -paths (i.e., all paths that start in the same node $v$ ). More precisely, for two different $v$ -paths $p$ and $q$ we write $p<q$ if either $p$ is a proper prefix of $q$ or we can write $p$ and $q$ as $p=rip^{\prime}$ , $q=rjq^{\prime}$ for paths $r,p^{\prime},q^{\prime}$ and $i,j\in\mathbb{N}$ with $i<j$ .

2.2 Relational structures and first order logic

A signature $\mathcal{R}$ is a finite set consisting of relation symbols $r_{i}$ ( $i\in I$ ) and constant symbols $c_{j}$ ( $j\in J$ ). Each relation symbol $r_{i}$ has an associated arity $\alpha_{i}$ . A structure over the signature $\mathcal{R}$ is a tuple ${\mathcal{U}}=(U,(R_{i})_{i\in I},(u_{j})_{j\in J})$ , where $U$ is a finite non-empty set (the universe of ${\mathcal{U}}$ ), $R_{i}\subseteq U^{\alpha_{i}}$ is the relation associated with the relation symbol $r_{i}$ , and $u_{j}\in U$ is the constant associated with the constant symbol $c_{j}$ . Note that we restrict to finite structures. If the structure ${\mathcal{U}}$ is clear from the context, we will identify $R_{i}$ ( $u_{j}$ , respectively) with the relation symbol $r_{i}$ (the constant symbol $c_{j}$ , respectively). Sometimes, when we want to refer to the universe $U$ , we will refer to ${\mathcal{U}}$ itself. For instance, we write $a\in\mathcal{U}$ for $ua\in U$ , or $f:[n]\to\mathcal{U}$ for a function $f:[n]\to U$ . The size $|{\mathcal{U}}|$ of ${\mathcal{U}}$ is $|U|+\sum_{i\in I}\alpha_{i}\cdot|R_{i}|$ . As usual, a constant $a\in\mathcal{U}$ may be replaced by the unary relation $\{a\}$ . Thus, in the following, we will only consider signatures without constant symbols, except when we explicitly introduce constants. Let $\mathcal{R}=\{r_{i}:i\in I\}$ be such a signature (we call it a relational signature) and let ${\mathcal{U}}=(U,(R_{i})_{i\in I})$ be a structure over $\mathcal{R}$ (we call it a relational structure). For relational structures ${\mathcal{U}}_{1}$ and ${\mathcal{U}}_{2}$ over the signature $\mathcal{R}$ , we write ${\mathcal{U}}_{1}\simeq{\mathcal{U}}_{2}$ to denote that ${\mathcal{U}}_{1}$ and ${\mathcal{U}}_{2}$ are isomorphic. A substructure of ${\mathcal{U}}=(U,(R_{i})_{i\in I})$ is a structure $(V,(S_{i})_{i\in I})$ such that $V\subseteq U$ and $S_{i}\subseteq R_{i}$ for all $i\in I$ . The substructure of ${\mathcal{U}}$ induced by $V\subseteq U$ is $(V,(R_{i}\cap V^{\alpha_{i}})_{i\in I})$ . We define the undirected graph $\mathcal{G}({\mathcal{U}})=(U,E)$ (the so-called Gaifman graph of ${\mathcal{U}}$ ), where $E$ contains an edge $(a,b)$ if and only if there is a binary relation $R_{i}$ ( $i\in I$ ) and a tuple $(a_{1},\ldots,a_{\alpha_{i}})\in R_{i}$ with $\{a,b\}\subseteq\{a_{1},\ldots,a_{\alpha_{i}}\}$ . The degree of ${\mathcal{U}}$ is the maximal degree of a node in $\mathcal{G}({\mathcal{U}})$ . If ${\mathcal{U}}$ has degree at most $d$ , we also say that ${\mathcal{U}}$ is a degree- $d$ bounded structures.

We use first-order logic (FO) over finite relational structures; see [15] for a detailed introduction and the full version [43] for some standard notations. For an FO-formula $\psi(x_{1},\ldots,x_{k})$ over the signature $\mathcal{R}$ with free variables $x_{1},\ldots,x_{k}$ and a relational structure ${\mathcal{U}}=(U,(R_{i})_{i\in I})$ over $\mathcal{R}$ and $a_{1},\ldots,a_{k}\in U$ , we write ${\mathcal{U}}\models\psi(a_{1},\ldots,a_{k})$ if $\psi$ is true in ${\mathcal{U}}$ when the variable $x_{i}$ is set to $a_{i}$ for all $i\in[k]$ . Hence, an FO-formula $\psi(x_{1},\ldots,x_{k})$ can be interpreted as an FO-query that, for a given structure ${\mathcal{U}}$ , defines a result set

\psi({\mathcal{U}})=\{(a_{1},\ldots,a_{k})\in{\mathcal{U}}^{k}:{\mathcal{U}}% \models\psi(a_{1},\ldots,a_{k})\}.

The quantifier rank $\mathsf{qr}(\psi)$ of an FO-formula $\psi$ is inductively defined as follows: $\mathsf{qr}(\psi)=0$ if $\psi$ contains no quantifiers, $\mathsf{qr}(\neg\psi)=\mathsf{qr}(\psi)$ , $\mathsf{qr}(\psi_{1}\wedge\psi_{2})=\mathsf{qr}(\psi_{1}\vee\psi_{2})=\max\{% \mathsf{qr}(\psi_{1}),\mathsf{qr}(\psi_{2})\}$ and $\mathsf{qr}(\forall x\psi)=\mathsf{qr}(\exists x\psi)=1+\mathsf{qr}(\psi)$ .

In the rest of the paper, we assume that the signature only contains relation symbols of arity at most two. It is folklore that FO model checking and FO-query enumeration over arbitrary signatures can be reduced to this case; see the full version [43] for a possible construction. This construction can be carried out in linear time (in the size of the formula and the structure) and it increase the degree of the structure as well as the quantifier rank of the formula by at most one.

2.3 Distances, spheres and neighborhoods

Let us fix a relational signature $\mathcal{R}$ (containing only relation symbols of arity at most two) and let ${\mathcal{U}}=(U,(R_{i})_{i\in I})$ be a structure over this signature. We say that ${\mathcal{U}}$ is connected, if its Gaifman graph $\mathcal{G}({\mathcal{U}})$ is connected. The distance between elements $a,b\in U$ in the graph $\mathcal{G}({\mathcal{U}})$ is denoted by $\operatorname{\mathsf{dist}}_{{\mathcal{U}}}(a,b)$ (it can be $\infty$ ). For subsets $A,B\subseteq U$ we define $\operatorname{\mathsf{dist}}_{{\mathcal{U}}}(A,B)=\min\{\operatorname{\mathsf{% dist}}_{{\mathcal{U}}}(a,b):a\in A,b\in B\}$ . For two partial tuples (of any arity) $t,t^{\prime}$ over $U$ let $\operatorname{\mathsf{dist}}_{{\mathcal{U}}}(t,t^{\prime})=\operatorname{% \mathsf{dist}}_{{\mathcal{U}}}(\mathop{\mathrm{ran}}(t),\mathop{\mathrm{ran}}(% t^{\prime}))$ .

Fix a constant $d\geq 2$ . We will only consider degree- $d$ bounded structures in the following. Let us fix such a structure ${\mathcal{U}}$ (over the relational signature $\mathcal{R}$ ). Take additional constant symbols $c_{1},c_{2},\ldots$ called sphere center constants. For an $r\geq 1$ and a partial $k$ -tuple $t:[k]\to{\mathcal{U}}$ we define the $r$ -sphere $\mathcal{S}_{{\mathcal{U}},r}(t)=\{b\in{\mathcal{U}}:\operatorname{\mathsf{% dist}}_{{\mathcal{U}}}(t,b)\leq r\}$ . The $r$ -neighborhood $\mathcal{N}_{{\mathcal{U}},r}(t)$ of $t$ is obtained by taking the substructure of ${\mathcal{U}}$ induced by $\mathcal{S}_{{\mathcal{U}},r}(t)$ and then adding every node $t(i)$ ( $i\in\mathop{\mathrm{dom}}(t)$ ) as the interpretation of the sphere center constant $c_{i}$ . Hence, it is a structure over the signature $\mathcal{R}\cup\{c_{i}:i\in\mathop{\mathrm{dom}}(t)\}$ . The $r$ -neighborhood of a $k$ -tuple has at most $k\cdot\sum^{r}_{i=0}d^{i}\leq k\cdot d^{r+1}$ elements (here, the inequality holds since we assume $d\geq 2$ ).

We use the above definitions also for a single element $a\in{\mathcal{U}}$ in place of a tuple $t$ ; formally $a$ is identified with the $1$ -tuple $t$ such that $t(1)=a$ . We are mainly interested in $r$ -spheres and $r$ -neighborhoods of complete $k$ -tuples, but the corresponding notions for partial $k$ -tuples will be convenient later. We also drop the subscript ${\mathcal{U}}$ from the above notations if this does not cause any confusion.

A $(k,r)$ -neighborhood type is an isomorphism type for the $r$ -neighborhood of a complete $k$ -tuple in a degree- $d$ bounded structure. More precisely, we can define a $(k,r)$ -neighborhood type as a degree- $d$ bounded structure $\mathcal{B}$ over the signature $\mathcal{R}\cup\{c_{1},\ldots,c_{k}\}$ such that

$\blacksquare$

the universe of $\mathcal{B}$ is of the form $[\ell]$ for some $\ell\leq k\cdot d^{r+1}$ and
$\blacksquare$

for every $j\in[\ell]$ there is $i\in[k]$ such that $\operatorname{\mathsf{dist}}_{\mathcal{B}}(a_{i},j)\leq r$ , where, for every $i\in[k]$ , $a_{i}$ is the interpretation of the sphere center constant $c_{i}$ .

From each isomorphism class of $(k,r)$ -neighborhood types we select a unique representative and write $\mathcal{T}_{k,r}$ for the set of all selected representatives. Then, for every $k$ -tuple $\bar{a}\in{\mathcal{U}}^{k}$ there is a unique $\mathcal{B}\in\mathcal{T}_{k,r}$ such that $\mathcal{N}_{{\mathcal{U}},r}(\bar{a})\simeq\mathcal{B}$ ; we call it the $(k,r)$ -neighborhood type of $\bar{a}$ and say that $\bar{a}$ is a $\mathcal{B}$ -tuple. In case $k=1$ we speak of $\mathcal{B}$ -nodes instead of $\mathcal{B}$ -tuples, write $\mathcal{T}_{r}$ for $\mathcal{T}_{1,r}$ and call its elements $r$ -neighborhood types instead of $(1,r)$ -neighborhood types.

For every $(k,r)$ -neighborhood type $\mathcal{B}\in\mathcal{T}_{k,r}$ there is an FO-formula $\psi^{\mathcal{B}}(x_{1},\ldots,x_{k})$ such that for every degree- $d$ bounded structure ${\mathcal{U}}$ and every $k$ -tuple $\bar{a}\in{\mathcal{U}}^{k}$ we have ${\mathcal{U}}\models\psi^{\mathcal{B}}(\bar{a})$ if and only if $\bar{a}$ is a $\mathcal{B}$ -tuple.

2.4 Enumeration algorithms and the machine model

FO-query enumeration is the following problem: Given an FO-formula $\phi(x_{1},\ldots,x_{k})$ over some signature $\mathcal{R}$ and a relational structure ${\mathcal{U}}$ over $\mathcal{R}$ , we want to enumerate all tuples from $\phi({\mathcal{U}})$ in some order and without repetitions, i.e., we want to produce a sequence $(t_{1},\ldots,t_{n},t_{n+1})$ with $\{t_{1},\ldots,t_{n}\}=\phi({\mathcal{U}})$ , $|\phi({\mathcal{U}})|=n$ and $t_{n+1}=\mathsf{EOE}$ is the end-of-enumeration marker. An algorithm for FO-query enumeration starts with a preprocessing phase in which no output is produced, followed by an enumeration phase, where the elements $t_{1},t_{2},\ldots,t_{n},t_{n+1}$ are produced one after the other. The running time of the preprocessing phase is called the preprocessing time, and the delay measures the maximal time between the computation of two consecutive outputs $t_{i}$ and $t_{i+1}$ for every $i\in[n]$ .

Usually, one restricts the input structure ${\mathcal{U}}$ to some subclass $\mathsf{C}_{d}$ of relational structures that is defined by some parameter $d$ (in this paper, $\mathsf{C}_{d}$ is the class of degree- $d$ bounded structures). We say that an algorithm for FO-query enumeration for $\mathsf{C}_{d}$ has linear preprocessing and constant delay, if its preprocessing time is $\operatorname{\mathcal{O}}(|{\mathcal{U}}|\cdot f(d,|\phi|))$ and its delay is $\operatorname{\mathcal{O}}(f(d,|\phi|))$ for some computable function $f$ . This complexity measure where the query $\phi$ is considered to be constant and the running time is only measured in terms of the data, i.e., the size of the relational structure, is also called data complexity. In data complexity, linear preprocessing and constant delay is considered to be optimal (since we assume that the relational structure has to be read at least once). As mentioned in the introduction, FO-query enumeration can be solved with linear preprocessing and constant delay for several classes $\mathsf{C}_{d}$ .

For proving upper bounds in data complexity, we often have to argue that certain computational tasks can be performed in time $f(\cdot)$ (or $|{\mathcal{U}}|\cdot f(\cdot))$ for some function $f$ . In these cases, without explicitly mentioning this in the remainder, $f$ will always be a computable function (actually, $f$ will be elementary, i.e., bounded by an exponent tower of fixed height). The arguments for $f$ will only depend on the parameter $d$ and the formula size $|\phi|$ .

The special feature of this work is that we consider FO-query enumeration in the setting where the relational structure ${\mathcal{U}}$ is not given explicitly, but in a potentially highly compressed form, and our enumeration algorithm must handle this compressed representation rather than decompressing it. Then the structure size $|{\mathcal{U}}|$ will be replaced by the size of the compressed representation of ${\mathcal{U}}$ in all time bounds. This aspect shall be explained in detail in Section 4.

We use the standard RAM model with uniform cost measure as our model of computation. We will make some further restrictions for the register length tailored to the compressed setting in Section 4.2.

3 FO-Enumeration over Uncompressed Degree-Bounded Structures

In this section, we fix a relational signature $\mathcal{R}=\{R_{i}:i\in I\}$ , constants $d\geq 2$ and $\nu$ , a degree- $d$ bounded structure ${\mathcal{U}}=(U,(R_{i})_{i\in I})$ over the signature $\mathcal{R}$ , and an FO-formula $\phi(x_{1},\ldots,x_{k})$ over the signature $\mathcal{R}$ with $\mathsf{qr}(\phi)=\nu$ . Our goal is to enumerate the set $\phi({\mathcal{U}})$ after a linear time preprocessing in constant delay. Before we consider the case where the structure ${\mathcal{U}}$ is given in a compressed form, we will first outline the enumeration algorithm from [28] for the case where ${\mathcal{U}}$ is given explicitly (with some modifications). In Section 5 we will extend this algorithm to the compressed setting.

By a standard application of the Gaifman locality of FO (see [43]), we first reduce the enumeration of $\phi({\mathcal{U}})$ to the enumeration of all $\mathcal{B}$ -tuples from ${\mathcal{U}}^{k}$ for a fixed $\mathcal{B}\in\mathcal{T}_{k,r}$ (for some $r\leq 7^{\nu}$ ). Recall that $\mathcal{B}$ contains at most $k\cdot d^{r+1}$ elements, and this upper bound only depends on $d$ and the formula $\phi$ . To simplify notation, we assume that in $\mathcal{B}$ the sphere center constant $c_{i}$ is interpreted by $i\in[k]$ . In particular, the sphere center constants are interpreted by different elements. This is not a real restriction; see [43].

In order to enumerate all $\mathcal{B}$ -tuples, we will factorize $\mathcal{B}$ into its connected components. In order to accomplish this, we need the following definitions. We first define the larger radius

\rho=2rk-r+k-1.

(1)

Every $\rho$ -neighborhood of an element $a\in{\mathcal{U}}$ has at most $d^{\rho+1}$ elements. Recall that a $\rho$ -neighborhood type is a degree- $d$ bounded structure over the signature $\mathcal{R}_{1}:=\mathcal{R}\cup\{c_{1}\}$ with a universe $[\ell]$ for some $\ell\leq d^{\rho+1}$ . W.l.o.g. we assume that the sphere center constant $c_{1}$ is interpreted by the element $1$ in a $\rho$ -neighborhood type. Hence, every $j\in[\ell]$ has distance at most $\rho$ from $1$ . Moreover, the $\rho$ -neighborhood types in $\mathcal{T}_{\rho}$ are pairwise non-isomorphic.

Assume that our fixed $(k,r)$ -neighborhood type $\mathcal{B}$ splits into $m\geq 1$ connected components $\mathcal{C}^{\mathcal{B}}_{1},\ldots,\mathcal{C}^{\mathcal{B}}_{m}$ . Thus, every $\mathcal{C}^{\mathcal{B}}_{i}$ is a connected induced substructure of $\mathcal{B}$ , every node of $\mathcal{B}$ belongs to exactly one $\mathcal{C}^{\mathcal{B}}_{i}$ , and there is no edge in the undirected graph $\mathcal{G}(\mathcal{B})$ between two different components $\mathcal{C}^{\mathcal{B}}_{i}$ . Let $D_{i}=\mathcal{C}^{\mathcal{B}}_{i}\cap[k]$ be the set of sphere centers that belong to the connected component $\mathcal{C}^{\mathcal{B}}_{i}$ . We must have $D_{i}\neq\emptyset$ . Let $n_{i}=\min(D_{i})$ (we could also fix any other element from $D_{i}$ ). Every node in $\mathcal{C}^{\mathcal{B}}_{i}$ has distance at most $r$ from some $j\in D_{i}$ . Since $\mathcal{C}^{\mathcal{B}}_{i}$ is connected, it follows that every node in $\mathcal{C}^{\mathcal{B}}_{i}$ has distance at most $r+(k-1)(2r+1)=2rk-r+k-1=\rho$ from $n_{i}$ (this is in fact true for every $j\in D_{i}$ instead of $n_{i}$ ). A consistent factorization of our $(k,r)$ -neighborhood type $\mathcal{B}$ is a tuple

\Lambda=(\mathcal{B}_{1},\sigma_{1},\mathcal{B}_{2},\sigma_{2},\ldots,\mathcal% {B}_{m},\sigma_{m})

with the following properties for all $i\in[m]$ :

$\blacksquare$

$\mathcal{B}_{i}\in\mathcal{T}_{\rho}$ and $\sigma_{i}:[k]\to\mathcal{B}_{i}$ is a partial $k$ -tuple with $\mathop{\mathrm{dom}}(\sigma_{i})=D_{i}$ and $\sigma_{i}(n_{i})=1$ (so, $n_{i}$ is mapped by $\sigma_{i}$ to the center of $\mathcal{B}_{i}$ ) and
$\blacksquare$

$\mathcal{N}_{\mathcal{B}_{i},r}(\sigma_{i})\simeq\mathcal{C}^{\mathcal{B}}_{i}$ .

Clearly, the number of possible consistent factorizations of $\mathcal{B}$ is bounded by $f(d,|\phi|)$ .

For a $\rho$ -neighborhood type $\mathcal{B}^{\prime}$ , a $\mathcal{B}^{\prime}$ -node $a\in{\mathcal{U}}$ and a partial $k$ -tuple $\sigma:[k]\to\mathcal{B}^{\prime}$ we moreover fix an isomorphism $\pi_{a}:\mathcal{B}^{\prime}\to\mathcal{N}_{{\mathcal{U}},\rho}(a)$ (this isomorphism is not necessarily unique) and define the partial $k$ -tuple $t_{a,\sigma}:[k]\to{\mathcal{U}}$ by $t_{a,\sigma}(j)=\pi_{a}(\sigma(j))$ for all $j\in\mathop{\mathrm{dom}}(\sigma)$ . Note that, by definition, $\pi_{a}(1)=a$ .

Take a consistent factorization $\Lambda=(\mathcal{B}_{1},\sigma_{1},\ldots,\mathcal{B}_{m},\sigma_{m})$ of $\mathcal{B}$ . We say that an $m$ -tuple $(b_{1},\ldots,b_{m})\in{\mathcal{U}}^{m}$ is admissible for $\Lambda$ if the following conditions hold:

$\blacksquare$

for all $i\in[m]$ , $b_{i}$ is a $\mathcal{B}_{i}$ -node, and
$\blacksquare$

for all $i,j\in[m]$ with $i\neq j$ we have

$\operatorname{\mathsf{dist}}_{{\mathcal{U}}}(t_{b_{i},\sigma_{i}},t_{b_{j},% \sigma_{j}})>2r+1.$ (2)

Finally, with an $m$ -tuple $\bar{b}=(b_{1},\ldots,b_{m})$ we associate the $k$ -tuple

\Lambda(\bar{b})=t_{b_{1},\sigma_{1}}\sqcup t_{b_{2},\sigma_{2}}\sqcup\cdots% \sqcup t_{b_{m},\sigma_{m}}.

Note that $t_{b_{i},\sigma_{i}}(n_{i})=\pi_{b_{i}}(\sigma_{i}(n_{i}))=\pi_{b_{i}}(1)=b_{i}$ .

We claim that in order to enumerate all $\mathcal{B}$ -tuples $\bar{a}\in{\mathcal{U}}^{k}$ , it suffices to enumerate for every consistent factorization $\Lambda=(\mathcal{B}_{1},\sigma_{1},\ldots,\mathcal{B}_{m},\sigma_{m})$ of $\mathcal{B}$ the set of all $m$ -tuples $\bar{b}\in{\mathcal{U}}^{m}$ that are admissible for $\Lambda$ . If we can do this, then we replace every output tuple $\bar{b}\in{\mathcal{U}}^{m}$ by $\Lambda(\bar{b})\in{\mathcal{U}}^{k}$ . Note that $\Lambda(\bar{b})$ can be easily computed in time $\operatorname{\mathcal{O}}(k)$ from the tuple $\bar{b}$ , the isomorphisms $\pi_{b_{i}}$ , and the partial $k$ -tuples $\sigma_{i}:[k]\to\mathcal{B}_{i}$ . The correctness of this algorithm follows from the following two lemmas (with full proofs in [43]):

Lemma 2.

If $\Lambda$ is a consistent factorization of $\mathcal{B}$ and $\bar{b}\in{\mathcal{U}}^{m}$ is admissible for $\Lambda$ then $\Lambda(\bar{b})\in{\mathcal{U}}^{k}$ is a $\mathcal{B}$ -tuple.

Lemma 3.

If $\bar{a}\in{\mathcal{U}}^{k}$ is a $\mathcal{B}$ -tuple then there are a unique consistent factorization $\Lambda$ of $\mathcal{B}$ and a unique $m$ -tuple $\bar{b}\in{\mathcal{U}}^{m}$ that is admissible for $\Lambda$ and such that $\bar{a}=\Lambda(\bar{b})$ .

3.1 Enumeration algorithm for uncompressed structures

Let us fix a $(k,r)$ -neighborhood type $\mathcal{B}$ and a consistent factorization $\Lambda=(\mathcal{B}_{1},\sigma_{1},\ldots,\mathcal{B}_{m},\sigma_{m})$ of $\mathcal{B}$ . By Lemmas 3 and 3, it suffices to enumerate (with linear preprocessing and constant delay) the set of all $\bar{a}\in{\mathcal{U}}^{m}$ that are admissible for $\Lambda$ . In the preprocessing phase we compute

$\blacksquare$

for every $i\in[m]$ a list $L_{i}$ containing all $\mathcal{B}_{i}$ -nodes from ${\mathcal{U}}$ and
$\blacksquare$

for every $a\in L_{i}$ an isomorphism $\pi_{a}:\mathcal{B}_{i}\to\mathcal{N}_{\rho}(a)$ .

It is straightforward to compute these data in time $|{\mathcal{U}}|\cdot f(d,|\phi|)$ (in Section 5, where we deal with the more general SLP-compressed case, this is more subtle). We classify each list $L_{i}$ as being short if $|L_{i}|\leq k\cdot d^{2\rho+2r+2}$ and as being long otherwise. Without loss of generality, we assume that, for some $0\leq q\leq m$ the lists $L_{1},\ldots,L_{q}$ are short and the lists $L_{q+1},\ldots,L_{m}$ are long (note that this includes the cases that all lists are short or all lists are long).

Our enumeration procedure maintains a stack of the form $a_{1}a_{2}\cdots a_{\ell}$ with $0\leq\ell\leq m$ and $a_{i}\in L_{i}$ for all $i\in[\ell]$ . Note that if $\ell=0$ , then we have the empty stack $\varepsilon$ . Such a stack is called admissible for $\Lambda$ (or just admissible), if for all $i,i^{\prime}\in[\ell]$ with $i\neq i^{\prime}$ and all $j\in\mathop{\mathrm{dom}}(\sigma_{i})$ and $j^{\prime}\in\mathop{\mathrm{dom}}(\sigma_{i^{\prime}})$ we have $\operatorname{\mathsf{dist}}_{{\mathcal{U}}}(\pi_{a_{i}}(\sigma_{i}(j)),\pi_{a% _{i^{\prime}}}(\sigma_{i^{\prime}}(j^{\prime})))>2r+1$ . Note that the empty stack as well as every stack $a_{1}$ with $a_{1}\in L_{1}$ are admissible.

The general structure of our enumeration algorithm is a depth-first-left-to-right (DFLR) traversal over all admissible stacks $s$ . For this, it calls the recursive procedure extend (shown as Algorithm 1) with the initial admissible stack $s=\varepsilon$ . Note that whenever extend $(s)$ is called, $|s|<m$ holds. It is clear that the call extend $(\varepsilon)$ triggers the enumeration of all admissible stacks $a_{1}a_{2}\cdots a_{m}$ . In an implementation one would store $s$ as a global variable.

Algorithm 1

\mathsf{extend}(s)

.

Let us assume that we can check whether a stack $s$ is admissible in time $f(d,|\phi|)$ (it is not hard to see that this is possible, and this aspect will anyway be discussed in detail for the compressed setting in Section 5). After the initial call extend $(\varepsilon)$ , the algorithm constructs an admissible stack $s$ with $|s|=q$ (or terminates) after time bounded in $d, k, r$ and $\rho$ (since the lists $L_{1},\ldots,L_{q}$ are short). If some $a\in L_{q+1}$ is non-admissible, i.e., the stack $s a$ is not admissible, then $\operatorname{\mathsf{dist}}_{{\mathcal{U}}}(t_{a_{i},\sigma_{i}},t_{a,\sigma_% {q+1}})\leq 2r+1$ and therefore $\operatorname{\mathsf{dist}}(a_{i},a)\leq 2\rho+2r+1$ for some $i\in[q]$ . Thus, the total number of non-admissible elements from $L_{q+1}$ can be bounded by a function of $d, k, r$ and $\rho$ . Consequently, since $L_{q+1}$ is long, the algorithm necessarily finds some admissible $a\in L_{q+1}$ (or terminates) after time bounded in $d, k, r$ and $\rho$ . From this observation, the following lemma can be obtained with moderate effort; see [43].

Lemma 4.

The delay of the above enumeration procedure is bounded by $f(d,|\phi|)$ .

4 Straight-Line Programs for Relational Structures

In this section, we introduce the compression scheme that shall be used to compress relational structures. We first need the definition of pointed structures.

For $n\geq 0$ , an $n$ -pointed structure is a pair $({\mathcal{U}},\tau)$ , where ${\mathcal{U}}$ is a structure and $\tau:[n]\to{\mathcal{U}}$ is injective. The elements in $\mathop{\mathrm{ran}}(\tau)$ ( ${\mathcal{U}}\setminus\mathop{\mathrm{ran}}(\tau)$ , respectively) are called contact nodes (internal nodes, respectively). The node $\tau(i)$ is called the $i$ -th contact node.

A relational straight-line program (r-SLP or just SLP) is a tuple $D=(\mathcal{R},N,S,P)$ , where

$\blacksquare$

$\mathcal{R}$ is a relational signature,
$\blacksquare$

$N$ is a finite set of nonterminals, where every $A\in N$ has a rank $\mathsf{rank}(A)\in\mathbb{N}$ ,
$\blacksquare$

$S\in N$ is the initial nonterminal, where $\mathsf{rank}(S)=0$ , and
$\blacksquare$

$P$ is a set of productions that contains for every $A\in N$ a unique production $A\to({\mathcal{U}}_{A},\tau_{A},E_{A})$ with $({\mathcal{U}}_{A},\tau_{A})$ a $\mathsf{rank}(A)$ -pointed structure over the signature $\mathcal{R}$ and $E_{A}$ a multiset of references of the form $(B,\sigma)$ , where $B\in N$ and $\sigma:[\mathsf{rank}(B)]\to{\mathcal{U}}_{A}$ is injective.
$\blacksquare$

Define the binary relation $\to_{D}$ on $N$ as follows: $A\to_{D}B$ if and only if $E_{A}$ contains a reference of the form $(B,\sigma)$ . Then we require that $\to_{D}$ is acyclic. Its transitive closure $\succ_{D}$ is a partial order that we call the hierarchical order of $D$ .

Let $|D|=\sum_{A\in N}(|{\mathcal{U}}_{A}|+\sum_{(B,\sigma)\in E_{A}}(1+\mathsf{% rank}(B)))$ be the size of $D$ . We define the ordered dag $\mathsf{dag}(D)=(N,\gamma,S)$ , where the child-function $\gamma$ is defined as follows: Let $B\in N$ and let $(B_{1},\sigma_{1}),\ldots,(B_{m},\sigma_{n})$ be an enumeration of the references in $E_{B}$ , where every reference appears in the enumeration as many times as in the multiset $E_{B}$ . The specific order of the references is not important and assumed to be somehow given by the input encoding of $D$ We then define $\gamma(B)=B_{1}\cdots B_{n}$ .

We now define for every nonterminal $A\in N$ a $\mathsf{rank}(A)$ -pointed relational structure $\mathsf{val}(A)$ (the value of $A$ ). We do this on an intuitive level, a formal definition can be found in the full version [43]. If $E_{A}=\emptyset$ , then we define $\mathsf{val}(A)=({\mathcal{U}}_{A},\tau_{A})$ . If, on the other hand, $E_{A}\neq\emptyset$ , then $\mathsf{val}(A)$ is obtained from $({\mathcal{U}}_{A},\tau_{A})$ by expanding all references in $E_{A}$ . A reference $(B,\sigma)\in E_{A}$ is expanded by the following steps: (i) create the disjoint union of ${\mathcal{U}}_{A}$ and ${\mathcal{U}}_{B}$ , (ii) merge node $\tau_{B}(i)\in{\mathcal{U}}_{B}$ with node $\sigma(i)\in{\mathcal{U}}_{A}$ for every $i\in[\mathsf{rank}(B)]$ , (iii) remove $(B,\sigma)$ from $E_{A}$ , and (iv) add all references from $E_{B}$ to $E_{A}$ . Due to the fact that $\to_{D}$ is acyclic, we can keep on expanding references (the original ones from $E_{A}$ and the new ones added by the expansion operation) in any order until there are no references left. The resulting relational structure is $\mathsf{val}(A)$ ; see Example 4 and Figure 1 for an illustration.

We define $\mathsf{val}(D)=\mathsf{val}(S)$ . Since $\mathsf{rank}(S)=0$ it can be viewed as an ordinary ( $0$ -pointed) structure. It is not hard to see that $|\mathsf{val}(D)|\leq 2^{\operatorname{\mathcal{O}}(|D|)}$ and that this upper bound can be also reached. Thus, $D$ can be seen as a compressed representation of the structure $\mathsf{val}(D)$ .

In Section 2.2 we claimed that FO-query enumeration can be reduced to the case where $\mathcal{R}$ only contains relation symbols of arity at most two (with the details given in the full version [43]). It is easy to see that this reduction can be also done in the SLP-compressed setting simply by applying the reduction to all structures ${\mathcal{U}}_{A}$ ; details can be again found in [43].

We say that the SLP $D=(\mathcal{R},N,S,P)$ is apex, if for every $A\in N$ and every reference $(B,\sigma)\in E_{A}$ we have $\mathop{\mathrm{ran}}(\sigma)\cap\mathop{\mathrm{ran}}(\tau_{A})=\emptyset$ . Thus, contact nodes of a right-hand side cannot be accessed by references. Apex SLPs are called $1$ -level restricted in [49]. It is easy to compute the maximal degree of nodes in $\mathcal{G}(\mathsf{val}(D))$ for an apex SLP $D$ : for every node $v$ in a structure ${\mathcal{U}}_{A}$ compute $d_{v}$ as the sum of (i) the degree of $v$ in $\mathcal{G}({\mathcal{U}}_{A})$ and (ii) for every reference $(B,\sigma)\in E_{A}$ and every $i\in[\mathsf{rank}(B)]$ with $v=\sigma(i)$ , the degree of $\tau_{B}(i)$ in $\mathcal{G}({\mathcal{U}}_{B})$ . Then the maximum of all these numbers $d_{v}$ is the maximal degree of nodes in $\mathcal{G}(\mathsf{val}(D))$ . The apex property implies a certain locality property for $\mathsf{val}(D)$ that will be explained in Section 4.1. In the rest of the paper we will mainly consider apex SLPs.

A simple example of a class of graphs that are exponentially compressible with apex SLPs is the class of perfect binary trees. The perfect binary tree of height $n$ (with $2^{n}$ leaves) can be produced by an apex SLP of size $\mathcal{O}(n)$ . Here is an explicit example for an apex SLP:

Figure 1: The SLP

D

of Example 4 together with

\mathsf{dag}(D)

and

\mathsf{val}(X)

for

X\in\{S,A,B\}

.

Example 5.

Consider the SLP $D=(\mathcal{R},N,S,P)$ where $\mathcal{R}$ only contains a binary relation symbol $r_{1}$ and $N=\{S,A,B\}$ with $\mathsf{rank}(S)=0$ , $\mathsf{rank}(A)=1$ and $\mathsf{rank}(B)=2$ . The productions of these nonterminals are depicted on the left of Figure 1. For instance, the production $S\to({\mathcal{U}}_{S},\tau_{S},E_{S})$ consists of the $0$ -pointed structure $({\mathcal{U}}_{S},\tau_{S})$ , where the universe of ${\mathcal{U}}_{S}$ consists of the two red nodes $u$ and $v$ , and the reference set $E_{S}=\{(A,\sigma_{1}),(A,\sigma_{2}),(B,\sigma_{3})\}$ with $\sigma_{1}(1)=u$ , $\sigma_{2}(1)=v$ , $\sigma_{3}(1)=u$ and $\sigma_{3}(2)=v$ (in Figure 1 each $\sigma_{i}(j)$ is connected by a $j$ -labeled dotted line with the nonterminal). The production for nonterminal $B$ consists of a $2$ -pointed structure (and no references), the contact nodes of which are labeled by $1$ and $2$ . The structure $\mathsf{val}(D)=\mathsf{val}(S)$ is shown on the right of Figure 1. It can be obtained by first constructing $\mathsf{val}(A)$ by replacing the single $B$ -reference in ${\mathcal{U}}_{A}$ by ${\mathcal{U}}_{B}=\mathsf{val}(B)$ . Note that $1$ - and $2$ -labeled dotted lines identify the two nodes to be merged with the two contact nodes of ${\mathcal{U}}_{B}$ , and that $\mathsf{val}(A)$ has exactly one contact node. Then we replace the $B$ -reference in ${\mathcal{U}}_{S}$ by $\mathsf{val}(B)$ and both $A$ -references in ${\mathcal{U}}_{S}$ by $\mathsf{val}(A)$ . This merges $u$ (and $v$ ) with the contact node of the first (and the second) occurrence of $\mathsf{val}(A)$ . Red (resp., blue, green) edges and nodes are produced from $S$ (resp., $A$ , $B$ ).

Since no contact node is adjacent to any reference, this SLP is apex. The size of $\mathsf{val}(D)$ is $31$ . The size of $D$ is $26$ : $9$ (for the $S$ -production) $+\ 10$ (for the $A$ -production) $+\ 7$ (for the $B$ -production).

4.1 Representation of nodes of an SLP-compressed structure

Let $A\in N$ . A node $a\in\mathsf{val}(A)$ can be uniquely represented by a pair $(p,v)$ such that $p$ is an $A$ -path in $\mathsf{dag}(D)$ and one of the following two cases holds:

$\blacksquare$

$p$ ends in $B\in N\setminus\{A\}$ and $v\in{\mathcal{U}}_{B}\setminus\mathop{\mathrm{ran}}(\tau_{B})$ is an internal node.¹¹1The nodes in $\mathop{\mathrm{ran}}(\tau_{B})$ , i.e., the contact nodes of ${\mathcal{U}}_{B}$ , are excluded here, because they were already generated by some larger (with respect to the hierarchical order $\succ_{D}$ ) nonterminal.
$\blacksquare$

$p=A$ and $v\in{\mathcal{U}}_{A}$ .

We call this the $A$ -representation of $a$ . The $S$ -representations of the nodes of $\mathsf{val}(S)=\mathsf{val}(D)$ are also called $D$ -representations. Note that if $(p,v)$ is the $D$ -representation of a node then $v\in{\mathcal{U}}_{A}\setminus\mathop{\mathrm{ran}}(\tau_{A})$ for some $A\in N$ (since $\mathsf{rank}(S)=0$ ). We will often identify a node of $\mathsf{val}(A)$ with its $A$ -representation; in particular when $A=S$ . One may view a $D$ -representation $(p,v)$ as a stack $p v$ . In order to construct outgoing (or incoming) edges of $(p,v)$ in the structure $\mathsf{val}(D)$ , one only has to modify this stack at its end; see [43] for more details.

The apex condition implies a kind of locality in $\mathsf{val}(D)$ that can be nicely formulated in terms of $D$ -representations: If two nodes $a=(p,u)$ and $b=(q,v)$ have distance $\zeta$ in the graph $\mathcal{G}(\mathsf{val}(D))$ then the prefix distance between $p$ and $q$ (which is the number of edges in $p$ and $q$ that do not belong to the longest common prefix of $p$ and $q$ ) is also at most $\zeta$ . This property is exploited several times in the paper.

Based on $A$ -representations, we can define a natural embedding of $\mathsf{val}(B)$ into $\mathsf{val}(A)$ in case $A\succ_{D}B$ . Assume that $p$ is a non-empty $A$ -to- $B$ path in $\mathsf{dag}(D)$ with $A\neq B$ . Let us write $p=p^{\prime}CiB$ for some nonterminal $C$ (we may have $C=A$ ). Let $(B,\sigma)\in E_{C}$ be the unique reference that corresponds to the edge $(C,i,B)$ in $\mathsf{dag}(D)$ . We then define the embedding $\eta_{p}:\mathsf{val}(B)\to\mathsf{val}(A)$ as follows, where $(q,v)$ is a node in $\mathsf{val}(B)$ given by its $B$ -representation so that $q$ is a $B$ -path (recall that the path $p q$ is obtained by concatenating the paths $p$ and $q$ ; see Section 2.1):

\eta_{p}(q,v)=\begin{cases}(p^{\prime}C,\sigma(i))&\text{ if $q=B$ and $v=\tau% _{B}(j)$ for some $j\in[\mathsf{rank}(B)]$,}\\ (pq,v)&\text{ otherwise.}\end{cases}

We can extend this definition to the case $A=B$ (where $p=A$ ) by defining $\eta_{p}$ as the identity map on $\mathsf{val}(A)=\mathsf{val}(B)$ . If ${\mathcal{U}}$ is the substructure of $\mathsf{val}(B)$ induced by the set $U\subseteq\mathsf{val}(B)$ then we write $\eta_{p}({\mathcal{U}})$ for the substructure of $\mathsf{val}(A)$ induced by the set $\eta_{p}(U)$ . Note that in general we do not have $\eta_{p}({\mathcal{U}})\simeq{\mathcal{U}}$ . For instance, if ${\mathcal{U}}=\mathsf{val}(B)$ then in $\mathsf{val}(A)$ there can be edges between contact nodes of $\mathsf{val}(B)$ that are generated by a nonterminal $C$ with $C\to_{D}B$ .

Recall the definition of the lexicographic order on the set of all $A$ -paths of $\mathsf{dag}(D)$ for $A\in N$ (see Section 2.1). We define $\mathsf{lex}_{A}(p)$ as the position of $p$ in the lexicographically sorted list of all $A$ -paths of $\mathsf{dag}(D)$ , where we start with $0$ (i.e., $\mathsf{lex}_{A}(A)=0$ ; note that $A$ is the empty path starting in $A$ and hence the lexicographically smallest path among all $A$ -paths). For $\mathsf{lex}_{S}(p)$ we just write $\mathsf{lex}(p)$ . Later it will be convenient to represent the initial path component $p$ of a $D$ -representation $(p,v)$ by the number $\mathsf{lex}(p)$ and call $(\mathsf{lex}(p),v)$ be the lex-representation of the node $a=(p,v)\in\mathsf{val}(D)$ . The number of initial paths in $\mathsf{dag}(D)$ can be bounded by $2^{\operatorname{\mathcal{O}}(|D|)}$ : the number of initial-to-leaf paths in $\mathsf{dag}(D)$ is bounded by $3^{|\mathsf{dag}(D)|/3}\leq 3^{|D|/3}$ (this is implicitly shown in the proof of [11, Lemma 1]) and the number of all initial paths in $D$ is bounded by twice the number of initial-to-leaf paths in $D$ . Hence, the numbers $\mathsf{lex}(p)$ have bit length $\operatorname{\mathcal{O}}(|D|)$ .

Example 6.

Recall the SLP $D$ from Example 4 and $\mathsf{dag}(D)$ shown to the right of $D$ ’s productions in Figure 1. Then the pairs $(S,u)$ and $(S,v)$ (recall that $u$ and $v$ are the two nodes of ${\mathcal{U}}_{S}$ ) represent the two red nodes of $\mathsf{val}(D)=\mathsf{val}(S)$ , and $(S3B,w)$ , where $w$ is the green node in ${\mathcal{U}}_{B}$ , represents the rightmost green node of $\mathsf{val}(D)$ . Its lex-representation is $(5,w)$ (there are six initial paths in $\mathsf{dag}(D)$ ). As another example, the two leftmost (green) nodes of $\mathsf{val}(D)$ are represented by the pairs $(S1A1B,w)$ and $(S2A1B,w)$ with the lex-representations $(2,w)$ and $(4,w)$ , respectively. For the $S$ -to- $B$ path $p=S2A1B$ in $\mathsf{dag}(D)$ we have $\eta_{p}(B,w)=(S2A1B,w)$ and $\eta_{p}(B,\tau_{B}(1))=(S2A,\sigma(1))$ , where $(B,\sigma)$ is the only reference in $E_{A}$ .

4.2 Register length in the compressed setting

In the following sections we will develop an enumeration algorithm for the set of all tuples in $\phi(\mathsf{val}(D))$ , where the SLP $D$ is part of the input. Recall that $\mathsf{val}(D)$ may contain $2^{\Theta(|D|)}$ many elements. In order to achieve constant delay, we therefore should set the register length in our algorithm to $\Theta(|D|)$ so that we can store elements of $\mathsf{val}(D)$ . This is in fact a standard assumption for algorithms on SLP-compressed objects. For instance, when dealing with SLP-compressed strings, one usually assumes that registers can store positions in the decompressed string. We only allow additions, subtractions and comparisons on these $\Theta(|D|)$ -bit registers and these operations take constant time (since we assume the uniform cost measure). For registers of length $\operatorname{\mathcal{O}}(\log|D|)$ we will also allow pointer operations.

Note that a $D$ -representation $(p,v)$ needs $\operatorname{\mathcal{O}}(|D|)$ many $\operatorname{\mathcal{O}}(\log|D|)$ -bit registers, whereas its lex-representation $(\mathsf{lex}(p),v)$ fits into two registers (one of length $\operatorname{\mathcal{O}}(\log|D|)$ ).

5 FO-Enumeration over SLP-Compressed Degree-Bounded Structures

We now have all definitions available in order to state a more precise version of Theorem 1:

Theorem 7.

Given an apex SLP $D$ such that $\mathsf{val}(D)$ is degree- $d$ bounded and an FO-formula $\phi(x_{1},\ldots,x_{k})$ , we can enumerate the result set $\phi({\mathcal{U}})$ with preprocessing time $f(d,|\phi|)\cdot|D|$ and delay $f(d,|\phi|)$ for some computable function $f$ . All nodes of $\phi({\mathcal{U}})$ are output in their lex-representation.

Throughout Section 5 we fix $D=(\mathcal{R},N,S,P)$ and $\phi(x_{1},\ldots,x_{n})$ as in Theorem 7. Let $\mathsf{qr}(\phi)=\nu$ . W.l.o.g. we can assume that $d\geq 2$ .

The general structure of our enumeration algorithm is the same as for the uncompressed setting. In particular, we also use Gaifman-locality to reduce to the problem of enumerating for a fixed $\mathcal{B}\in\mathcal{T}_{k,r}$ the set of all $\mathcal{B}$ -tuples $\bar{a}\in\mathsf{val}(D)^{k}$ , which then reduces to the problem of enumerating for all consistent factorizations $\Lambda=(\mathcal{B}_{1},\sigma_{1},\ldots,\mathcal{B}_{m},\sigma_{m})$ of $\mathcal{B}$ the set of all $m$ -tuples $\bar{b}\in\mathsf{val}(D)^{m}$ that are admissible for $\Lambda$ (see the beginning of Section 3).

Here, a first complication occurs: one important component of the above reduction for the uncompressed setting is that FO model checking on degree- $d$ bounded structures can be done in time $|{\mathcal{U}}|\cdot f(d,|\phi|)$ [62]. For the SLP-compressed setting we do not have a linear time (i. e., in time $|D|\cdot f(d,|\phi|)$ ) model checking algorithm. Only an NL-algorithm for apex SLPs is known [39]. It is not hard to obtain a linear time algorithm from the NL-algorithm in [39]. Alternatively, one can also bypasses model checking; see the full version [43].

Consequently, as in the uncompressed setting, it suffices to consider a fixed consistent factorization $\Lambda=(\mathcal{B}_{1},\sigma_{1},\ldots,\mathcal{B}_{m},\sigma_{m})$ of $\mathcal{B}$ and to enumerate the set of all $m$ -tuples in $\mathsf{val}(D)$ that are admissible for $\Lambda$ . As before we define the larger radius $\rho=2rk-r+k-1$ ; see (1).

5.1 Expansions of nonterminals

In this section we introduce the concept of $\zeta$ -expansions for a constant $\zeta\geq 1$ (later, $\zeta$ will be a constant of the form $f(d,|\phi|)$ ), which will be needed to transfer the enumeration algorithm for the uncompressed setting (Section 3.1) to the SLP-compressed setting. The idea is to apply the productions from $D$ , starting with a nonterminal $A\in N$ , until all nodes of $\mathsf{val}(A)$ that have distance at most $\zeta$ from the nodes in the right-hand side of $A$ (except for the contact nodes of $A$ ) are produced. For a nonterminal $A\in N$ we define

\mathsf{In}_{A}=\{(A,v):v\in{\mathcal{U}}_{A}\setminus\mathop{\mathrm{ran}}(% \tau_{A})\}\subseteq\mathsf{val}(A).

These are the internal nodes of $\mathsf{val}(A)$ (written in $A$ -representation) that are directly produced with the production $A\to({\mathcal{U}}_{A},\tau_{A},E_{A})$ . Let $a_{1},\ldots,a_{m}$ be a list of all nodes from $\mathsf{In}_{A}$ . We then define the $\zeta$ -expansion as the following induced substructure of $\mathsf{val}(A)$ :

\mathcal{E}_{\zeta}(A)=\mathcal{N}_{\mathsf{val}(A),\zeta}(a_{1},\ldots,a_{m}).

We always assume that the nodes of $\mathcal{E}_{\zeta}(A)$ are represented by their $A$ -representations. Let

\mathsf{Bd}_{A,\zeta}=\{(A,v):v\in\mathop{\mathrm{ran}}(\tau_{A})\}\cup\{a\in% \mathsf{val}(A):\operatorname{\mathsf{dist}}_{\mathsf{val}(A)}(\mathsf{In}_{A}% ,a)=\zeta\}\subseteq\mathsf{val}(A)

be the boundary of $\mathcal{E}_{\zeta}(A)$ . A valid substructure of $\mathcal{E}_{\zeta}(A)$ is an induced substructure $\mathcal{A}$ of $\mathcal{E}_{\zeta}(A)$ with $\mathcal{A}\cap\mathsf{Bd}_{A,\zeta}=\emptyset\neq\mathcal{A}\cap\mathsf{In}_{A}$ . If $\mathcal{A}$ is a valid substructure of $\mathcal{E}_{\zeta}(A)$ and $p$ is an $S$ -to- $A$ path in $\mathsf{dag}(D)$ , then any neighbor of $\eta_{p}(\mathcal{A})$ in the graph $\mathcal{G}(\mathsf{val}(D))$ belongs to $\eta_{p}(\mathcal{E}_{\zeta}(A))$ . Moreover, $\eta_{p}(\mathcal{A})\simeq\mathcal{A}$ , since all contact nodes $(A,\tau_{A}(i))$ are excluded from a valid substructure of $\mathcal{E}_{\zeta}(A)$ . In the following, we consider the radius $\zeta=2\rho+1$ . For a nonterminal $A\in N$ we write $\mathcal{E}(A)$ for the expansion $\mathcal{E}_{2\rho+1}(A)$ in the rest of the paper.

Fix a $\rho$ -neighborhood type $\mathcal{B}$ . A node $a\in\mathcal{E}(A)\subseteq\mathsf{val}(A)$ is called a valid $\mathcal{B}$ -node in $\mathcal{E}(A)$ if (i) $\mathcal{N}_{\mathcal{E}(A),\rho}(a)\simeq\mathcal{B}$ and (ii) $\mathcal{N}_{\mathcal{E}(A),\rho}(a)$ is a valid substructure of $\mathcal{E}(A)$ . We say that $A$ is $\mathcal{B}$ -useful if there is a valid $\mathcal{B}$ -node in $\mathcal{E}(A)$ . We consider now the following two sets:

$\blacksquare$

$\mathsf{S}^{\mathcal{B}}_{1}=\{(p,a):\exists A\in N:\text{ $p$ is an $S$-to-$A% $ path in $\mathsf{dag}(D)$, $a$ is a valid $\mathcal{B}$-node in $\mathcal{E}% (A)$}\}$
$\blacksquare$

$\mathsf{S}^{\mathcal{B}}_{2}=\{b\in\mathsf{val}(D):b\text{ is a $\mathcal{B}$-% node}\}$

We define a mapping $h:\mathsf{S}^{\mathcal{B}}_{1}\to\mathsf{val}(D)$ as follows. Let $(p,a)\in\mathsf{S}^{\mathcal{B}}_{1}$ , where $p$ is an $S$ -to- $A$ path in $\mathsf{dag}(D)$ and let $(q,v)$ be the $A$ -representation of $a\in\mathcal{E}(A)$ . We then define $h(p,a)=\eta_{p}(a)=(pq,v)$ (where the latter is a $D$ -representation that we identify as usual with a node from $\mathsf{val}(D)$ ). The following lemma is proved in the full version [43].

Lemma 8.

The mapping $h$ is a bijection from $\mathsf{S}^{\mathcal{B}}_{1}$ to $\mathsf{S}^{\mathcal{B}}_{2}$ .

5.2 Overview of the enumeration algorithm

Our goal is to carry out the algorithm described in Section 3.1, but in the compressed setting, i.e., by only using the apex SLP $D=(\mathcal{R},N,S,P)$ instead of the explicit structure $\mathsf{val}(D)$ . As in the uncompressed setting, it suffices to consider a fixed $(k,r)$ -neighborhood type $\mathcal{B}\in\mathcal{T}_{k,r}$ together with a fixed consistent factorization

\Lambda=(\mathcal{B}_{1},\sigma_{1},\ldots,\mathcal{B}_{m},\sigma_{m})

(3)

of $\mathcal{B}$ and to enumerate the set of all $m$ -tuples in $\mathsf{val}(D)$ that are admissible for $\Lambda$ . In the following we sketch the algorithm; details can be found in the full version [43].

Enumeration of all $\mathcal{B}_{i}$ -nodes.

The algorithm for the uncompressed setting (Section 3) precomputes for every $\mathcal{B}_{i}$ a list $L_{i}$ of all $\mathcal{B}_{i}$ -nodes of the structure ${\mathcal{U}}$ . This is no longer possible in the compressed setting since the structure $\mathsf{val}(D)$ is too big. However, as shown in Section 5.1, there is a bijection between the set of $\mathcal{B}_{i}$ -nodes in $\mathsf{val}(D)$ and the set of all pairs $(p,a)$ , where $p$ is an $S$ -to- $A$ path in $\mathsf{dag}(D)$ for a $\mathcal{B}_{i}$ -useful nonterminal $A$ and $a$ is a valid $\mathcal{B}_{i}$ -node in $\mathcal{E}(A)$ that is written in its $A$ -representation $(q,v)$ . Hence, on a high level, instead of explicitly precomputing the lists $L_{i}$ of all $\mathcal{B}_{i}$ -nodes, we enumerate them with Algorithm 2.

Algorithm 2 enumeration of all

\mathcal{B}_{i}

-nodes.

To execute this algorithm we first have to compute in the preprocessing all expansions $\mathcal{E}(A)$ for a nonterminal $A$ . This is easy: using a breath-first-search (BFS), we locally generate $\mathsf{val}(A)$ starting with the nodes in $\mathsf{In}_{A}$ until all nodes $a\in\mathsf{val}(A)$ with $\operatorname{\mathsf{dist}}_{\mathsf{val}(A)}(\mathsf{In}_{A},a)\leq 2\rho+1$ are generated. The size of $\mathcal{E}(A)$ is bounded by $|{\mathcal{U}}_{A}|\cdot f(d,|\phi|)$ (the size of a $(2\rho+1)$ -sphere around a tuple of length at most $|{\mathcal{U}}_{A}|$ in a degree- $d$ bounded structure) and can be constructed in time $|{\mathcal{U}}_{A}|\cdot f(d,|\phi|)$ . Summing over all $A\in N$ shows that all $(2\rho+1)$ -expansions can be precomputed in time $|D|\cdot f(d,|\phi|)$ .

With the $\mathcal{E}(A)$ available, we can easily precompute brute-force the set of all valid $\mathcal{B}_{i}$ -nodes in $\mathcal{E}(A)$ (needed in Line 2 of Algorithm 2) and then the set of all $\mathcal{B}_{i}$ -useful nonterminals (needed in Line 1 of Algorithm 2). Recall that $A$ is $\mathcal{B}_{i}$ -useful iff there is a valid $\mathcal{B}_{i}$ -node in $\mathcal{E}(A)$ . Moreover, for every valid $\mathcal{B}_{i}$ -node $c=(q,v)\in\mathcal{E}(A)$ we compute also an isomorphism $\pi_{c}:\mathcal{B}_{i}\to\mathcal{N}_{\mathcal{E}(A),\rho}(c_{i})$ . The time for this is bounded by $f(d,|\varphi|)$ for one nonterminal $A$ and hence by $|D|\cdot f(d,|\phi|)$ in total.

The most challenging part of Algorithm 2 is the enumeration of all initial paths $p$ in $\mathsf{dag}(D)$ that end in a $\mathcal{B}_{i}$ -useful nonterminal (Line 1). Let $\mathcal{P}_{i}$ be the set of these paths. In constant delay, we cannot afford to output a path $p\in\mathcal{P}_{i}$ as a list of edges (it does not fit into a constant number of registers in our machine model, see Section 4.2). That is why we return the number $\mathsf{lex}(p)$ (which fits into a single register in our machine model) in Line 3. The idea for constant-delay path enumeration is to run over all paths $p\in\mathcal{P}_{i}$ in lexicographical order and thereby maintain the number $\mathsf{lex}(p)$ . The path $p$ is internally stored in a contracted form. If $\mathsf{dag}(D)$ would be a binary dag, then we could use an enumeration algorithm from [42], where maximal subpaths of left (right, respectively) outgoing edges are contracted to single edges. In our setting, $\mathsf{dag}(D)$ is not a binary dag, therefore we have to adapt the technique from [42] slightly; see [43].

In order to see how Algorithm 2 can be used to replace the precomputed lists $L_{i}$ in Algorithm 1 for the uncompressed setting, a few additional points have to be clarified.

Producing the final output tuples.

Note that for each enumerated $\mathcal{B}_{i}$ -node $b_{i}\in\mathsf{val}(D)$ we have to produce the partial $k$ -tuple $t_{b_{i},\sigma_{i}}$ (then the final output tuple is $t_{b_{1},\sigma_{1}}\sqcup t_{b_{2},\sigma_{2}}\sqcup\cdots\sqcup t_{b_{m},% \sigma_{m}}$ ). Let us first recall that in the uncompressed setting each partial $k$ -tuple $t_{b_{i},\sigma_{i}}$ is defined by $t_{b_{i},\sigma_{i}}(j)=\pi_{b_{i}}(\sigma_{i}(j))$ for all $j\in\mathop{\mathrm{dom}}(\sigma_{i})$ , where $\pi_{b_{i}}:\mathcal{B}_{i}\to\mathcal{N}_{{\mathcal{U}},\rho}(b_{i})$ is a precomputed isomorphism. In the compressed setting, Algorithm 2 outputs every $\mathcal{B}_{i}$ -node $b_{i}\in\mathsf{val}(D)$ as a triple $(\mathsf{lex}(p_{i}),q_{i},v)$ , where the initial path $p_{i}\in\mathcal{P}_{i}$ ends in some $\mathcal{B}_{i}$ -useful nonterminal $A_{i}\in N$ and $c_{i}:=(q_{i},v_{i})$ is a valid $\mathcal{B}_{i}$ -node in $\mathcal{E}(A_{i})$ . Moreover, we have a precomputed isomorphism $\pi_{c_{i}}:\mathcal{B}_{i}\to\mathcal{N}_{\mathcal{E}(A_{i}),\rho}(c_{i})$ , which yields the isomorphism $\pi_{b_{i}}=\eta_{p_{i}}\circ\pi_{c_{i}}:\mathcal{B}_{i}\to\mathcal{N}_{% \mathsf{val}(D),\rho}(b_{i})$ . Then, for every $j\in\mathop{\mathrm{dom}}(\sigma_{i})$ we can easily compute the lex-representation of $\pi_{b_{i}}(\sigma_{i}(j))$ . We first compute $\pi_{c_{i}}(\sigma_{i}(j))$ in its $A_{i}$ -representation $(q_{i,j},v_{i,j})$ using the precomputed mapping $\pi_{c_{i}}$ . Then the lex-representation of $t_{b_{i},\sigma_{i}}(j)=\pi_{b_{i}}(\sigma_{i}(j))$ is $(\mathsf{lex}(p_{i}q_{i,j}),v_{i,j})$ , where $\mathsf{lex}(p_{i}q_{i,j})=\mathsf{lex}(p_{i})+\mathsf{lex}_{A_{i}}(q_{i,j})$ . Here, $\mathsf{lex}(p_{i})$ is produced by Algorithm 2. The path $q_{i,j}$ has length at most $2\rho+1$ (this is a consequence of the apex condition for $D$ ). Its $\mathsf{lex}$ -number $\mathsf{lex}_{A_{i}}(q_{i,j})$ can be computed by summing at most $2\rho+1$ many edge weights that were computed in the preprocessing phase.

Count total number of $\rho$ -neighborhoods.

In Section 3.1 we distinguish between short and long lists $L_{i}$ . Since in our compressed setting, Algorithm 2 replaces the precomputed list $L_{i}$ we have to count the number of triples produced by Algorithm 2 (of course, before we run the algorithm) in the preprocessing phase. This is easy: the number of output triples can be computed by summing over all $\mathcal{B}_{i}$ -useful nonterminals $A$ the product of (i) the number of $S$ -to- $A$ paths in $\mathsf{dag}(D)$ and (ii) the number of valid $\mathcal{B}_{i}$ -nodes in $\mathcal{E}(A)$ . The latter can be computed in the preprocessing phase. Computing the number of $S$ -to- $A$ paths (for all $A\in N$ ) involves a top-down pass (starting in $S$ ) over $\mathsf{dag}(D)$ with $|\mathsf{dag}(D)|\leq|D|$ many additions on $\operatorname{\mathcal{O}}(|D|)$ -bit numbers in total.

Checking distance constraints.

Recall that we fixed the consistent factorization $\Lambda$ from (3) of the fixed $(k,r)$ -neighborhood type $\mathcal{B}$ and want to enumerate all tuples $(b_{1},\ldots,b_{m})\in\mathsf{val}(D)^{m}$ that are admissible for $\Lambda$ . The definition of an admissible tuple also requires to check whether $\operatorname{\mathsf{dist}}_{\mathsf{val}(D)}(t_{b_{i},\sigma_{i}},t_{b_{j},% \sigma_{j}})>2r+1$ for all $i\neq j$ (see (2)). The nodes $b_{i}$ are enumerated with Algorithm 2, hence the following assumptions hold for all $i\in[m]$ :

$\blacksquare$

$b_{i}$ is given by a triple $(\mathsf{lex}(p_{i}),q_{i},v_{i})$ ,
$\blacksquare$

$p_{i}$ is an initial-to- $A_{i}$ path in $\mathsf{dag}(D)$ (for some $\mathcal{B}_{i}$ -useful nonterminal $A_{i}$ ), and
$\blacksquare$

$c_{i}:=(q_{i},v_{i})$ is a node (written in $A_{i}$ -representation) from $\mathcal{E}(A_{i})$ such that $c_{i}$ has $\rho$ -neighborhood type $\mathcal{B}_{i}$ in $\mathcal{E}(A_{i})$ and $\mathcal{N}_{\mathcal{E}(A_{i}),\rho}(c_{i})$ is a valid substructure of $\mathcal{E}(A_{i})$ .

In a first step, we show that if $\operatorname{\mathsf{dist}}_{\mathsf{val}(D)}(t_{b_{i},\sigma_{i}},t_{b_{j},% \sigma_{j}})\leq 2r+1$ then there is a path $q$ of length at most $3\rho-r$ such that $p_{i}=p_{j}q$ or $p_{j}=p_{i}q$ . For this, the apex property for $D$ is important, since it lower bounds the distance between two nodes $a=(p,u)$ and $a^{\prime}=(p^{\prime},v^{\prime})$ of $\mathsf{val}(D)$ by the prefix distance between the paths $p$ and $p^{\prime}$ (i.e., the total number of edges that do not belong to the longest common prefix of $p$ and $p^{\prime}$ ).

We then proceed in two steps: We first check in time $f(d,|\phi|)$ whether $p_{j}=p_{i}q$ or $p_{i}=p_{j}q$ for some path $q$ of length at most $3\rho-r$ . For checking $p_{j}=p_{i}q$ (the case $p_{i}=p_{j}q$ is analogous) we check whether $p_{j}=p_{i}$ (by checking $\mathsf{lex}(p_{j})=\mathsf{lex}(p_{i})$ ) and if this is not the case, we repeatedly remove the last edge of $p_{j}$ (for at most $3\rho-r$ times) and check whether the resulting path equals $p_{i}$ . However, the whole procedure is complicated by the fact that $p_{i}$ and $p_{j}$ are given in a contracted form, where some subpaths are contracted to single edges (see the above paragraph on the path enumeration algorithm for $\mathsf{dag}(D)$ ).

In the second step we have to check in time $f(d,|\phi|)$ whether $\operatorname{\mathsf{dist}}_{\mathsf{val}(D)}(t_{b_{i},\sigma_{i}},t_{b_{j},% \sigma_{j}})\leq 2r+1$ , assuming that $p_{j}=p_{i}q$ for some path $q$ of length at most $3\rho-r$ . This boils down to checking, for every $b\in\mathop{\mathrm{ran}}(t_{b_{i},\sigma_{i}})$ and $b^{\prime}\in\mathop{\mathrm{ran}}(t_{b_{j},\sigma_{j}})$ , whether $\operatorname{\mathsf{dist}}_{\mathsf{val}(D)}(b,b^{\prime})\leq 2r+1$ , which is the case iff $\operatorname{\mathsf{dist}}_{\mathsf{val}(A_{i})}(c,\eta_{q}(c^{\prime}))\leq 2% r+1$ , where $c,c^{\prime}\in\mathsf{val}(A_{i})$ correspond to $b,b^{\prime}$ in the sense that $\eta_{p_{i}}(c)=b$ and $\eta_{p_{j}}(c^{\prime})=b^{\prime}$ . For this we locally construct $\mathcal{N}_{\mathsf{val}(A_{i}),2r+1}(c)$ by starting a BFS in $c$ and then computing all elements of $\mathsf{val}(A_{i})$ with distance at most $2r+1$ from $c$ just like we constructed all expansions $\mathcal{E}(A)$ (as explained above). This concludes our proof sketch for Theorem 7.

6 Conclusions and Outlook

We presented an enumeration algorithm for FO-queries on structures that are represented succinctly by apex SLPs. Assuming that the formula is fixed and the degree of the structure is bounded by a constant, the preprocessing time of our algorithm is linear and the delay is constant.

There are several possible directions into which our result can be extended. One option is to use more general formalisms for graph compression. Our SLPs are based on Courcelle’s HR (hyperedge replacement) algebra, which it tightly related to tree width [12, Section 2.3]. Our SLPs can be viewed as dag-compressed expressions in the HR algebra, where the leaves can be arbitrary pointed structures; see [39] for more details. Another (and in some sense more general) graph algebra is the VR algebra, which is tightly related to clique width [12, Section 2.5]. It is straightforward to define a notion of SLPs based on the VR algebra and this leads to the question whether our result also holds for the resulting VR-algebra-SLPs.

Another interesting question is to what extend the results on enumeration for conjunctive queries [4, 7] can be extended to the compressed setting. In this context, it is interesting to note that model checking for a fixed existential FO-formula on SLP-compressed structures (without the apex restriction) belongs to NL. It would be interesting to see, whether the constant delay enumeration algorithm from [4] for free-connex acyclic conjunctive queries can be extended to SLP-compressed structures.

Finally, one may ask whether in our main result (Theorem 7) the apex restriction is really needed. More precisely, consider an SLP $D$ such that $\mathsf{val}(D)$ has degree $d$ . Is it possible to construct from $D$ in time $|D|\cdot f(d)$ an equivalent apex SLP $D^{\prime}$ of size $|D|\cdot f(d)$ for a computable function $f$ ? If this is true then one could enforce the apex property in the preprocessing. In [17] it shown that a set of graphs of bounded degree $d$ that can be produced by a hyperedge replacement grammar (HRG) $H$ can be also produced by an apex HRG, but the size blow-up is not analyzed with respect to the parameter $d$ and the size of $H$ .

References

[1] Rajeev Alur, Michael Benedikt, Kousha Etessami, Patrice Godefroid, Thomas W. Reps, and Mihalis Yannakakis. Analysis of recursive state machines. ACM Transactions on Programming Languages and Systems (TOPLAS), 27(4):86–818, 2005. doi:10.1145/1075382.1075387.
[2] Rajeev Alur and Mihalis Yannakakis. Model checking of hierarchical state machines. ACM Transactions on Programming Languages and Systems (TOPLAS), 23(3):273–303, 2001. doi:10.1145/503502.503503.
[3] Guillaume Bagan. MSO queries on tree decomposable structures are computable with linear delay. In Proceedings of the 20th International Workshop on Computer Science Logic, CSL 2006, volume 4207 of Lecture Notes in Computer Science, pages 167–181. Springer, 2006. doi:10.1007/11874683_11.
[4] Guillaume Bagan, Arnaud Durand, and Etienne Grandjean. On acyclic conjunctive queries and constant delay enumeration. In Proceedings of the 21st International Workshop on Computer Science Logic, CSL 2007, volume 4646 of Lecture Notes in Computer Science, pages 208–222. Springer, 2007. doi:10.1007/978-3-540-74915-8_18.
[5] Hideo Bannai, Momoko Hirayama, Danny Hucke, Shunsuke Inenaga, Artur Jeż, Markus Lohrey, and Carl Philipp Reh. The smallest grammar problem revisited. IEEE Transaction on Information Theory, 67(1):317–328, 2021. doi:10.1109/TIT.2020.3038147.
[6] Michel Bauderon and Bruno Courcelle. Graph expressions and graph rewritings. Mathematical System Theory, 20(2-3):83–127, 1987. doi:10.1007/BF01692060.
[7] Christoph Berkholz, Fabian Gerhardt, and Nicole Schweikardt. Constant delay enumeration for conjunctive queries: a tutorial. ACM SIGLOG News, 7(1):4–33, 2020. doi:10.1145/3385634.3385636.
[8] Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. Answering FO+MOD queries under updates on bounded degree databases. ACM Transactions on Database Systems, 43(2):7:1–7:32, 2018. doi:10.1145/3232056.
[9] Romain Brenguier, Stefan Göller, and Ocan Sankur. A comparison of succinctly represented finite-state systems. In Proceedings of the 23rd International Conference on Concurrency Theory, CONCUR 2012, volume 7454 of Lecture Notes in Computer Science, pages 147–161. Springer, 2012. doi:10.1007/978-3-642-32940-1_12.
[10] Katrin Casel, Henning Fernau, Serge Gaspers, Benjamin Gras, and Markus L. Schmid. On the complexity of the smallest grammar problem over fixed alphabets. Theory of Computing Systems, 65(2):344–409, 2021. doi:10.1007/s00224-020-10013-w.
[11] Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. The smallest grammar problem. IEEE Transactions on Information Theory, 51(7):2554–2576, 2005. doi:10.1109/TIT.2005.850116.
[12] Bruno Courcelle and Joost Engelfriet. Graph Structure and Monadic Second-Order Logic - A Language-Theoretic Approach, volume 138 of Encyclopedia of Mathematics and its Applications. Cambridge University Press, 2012. doi:10.1017/CBO9780511977619.
[13] Arnaud Durand and Etienne Grandjean. First-order queries on structures of bounded degree are computable with constant delay. ACM Transactions on Computational Logic, 8(4):21, 2007. doi:10.1145/1276920.1276923.
[14] Arnaud Durand, Nicole Schweikardt, and Luc Segoufin. Enumerating answers to first-order queries over databases of low degree. Logical Methods in Computer Science, 18(2), 2022. doi:10.46298/LMCS-18(2:7)2022.
[15] Heinz-Dieter Ebbinghaus and Jörg Flum. Finite model theory. Perspectives in Mathematical Logic. Springer, 1995. doi:10.1007/3-540-28788-4.
[16] Joost Engelfriet. Context-free graph grammars. In Grzegorz Rozenberg and Arto Salomaa, editors, Handbook of Formal Languages, Volume 3: Beyond Words, pages 125–213. Springer, 1997. doi:10.1007/978-3-642-59126-6_3.
[17] Joost Engelfriet, Linda Heyker, and George Leih. Context-free graph languages of bounded degree are generated by apex graph grammars. Acta Informatica, 31(4):341–378, 1994. doi:10.1007/BF01178511.
[18] Joost Engelfriet and Grzegorz Rozenberg. A comparison of boundary graph grammars and context-free hypergraph grammars. Information and Computation, 84(2):163–206, 1990. doi:10.1016/0890-5401(90)90038-J.
[19] Rachel Faran and Orna Kupferman. LTL with arithmetic and its applications in reasoning about hierarchical systems. In LPAR-22. 22nd International Conference on Logic for Programming, Artificial Intelligence and Reasoning, volume 57 of EPiC Series in Computing, pages 343–362. EasyChair, 2018. doi:10.29007/WPG3.
[20] Rachel Faran and Orna Kupferman. A parametrized analysis of algorithms on hierarchical graphs. International Journal on Foundations of Computer Science, 30(6-7):979–1003, 2019. doi:10.1142/S0129054119400252.
[21] Jörg Flum and Martin Grohe. Parameterized Complexity Theory. Texts in Theoretical Computer Science. An EATCS Series. Springer, 2006. doi:10.1007/3-540-29953-X.
[22] Moses Ganardi and Pawel Gawrychowski. Pattern matching on grammar-compressed strings in linear time. In Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA 2022, pages 2833–2846. SIAM, 2022. doi:10.1137/1.9781611977073.110.
[23] Moses Ganardi, Artur Jeż, and Markus Lohrey. Balancing straight-line programs. Journal of the ACM, 68(4):27:1–27:40, 2021. doi:10.1145/3457389.
[24] Adrià Gascón, Markus Lohrey, Sebastian Maneth, Carl Philipp Reh, and Kurt Sieber. Grammar-based compression of unranked trees. Theory of Computing Systems, 64(1):141–176, 2020. doi:10.1007/s00224-019-09942-y.
[25] Ferenc Gécseg and Magnus Steinby. Tree languages. In Grzegorz Rozenberg and Arto Salomaa, editors, Handbook of Formal Languages, Volume 3: Beyond Words, pages 1–68. Springer, 1997. doi:10.1007/978-3-642-59126-6_1.
[26] Stefan Göller and Markus Lohrey. Fixpoint logics on hierarchical structures. Theory of Computing Systems, 48(1):93–131, 2009. doi:10.1007/s00224-009-9227-1.
[27] Annegret Habel and Hans-Jörg Kreowski. Some structural aspects of hypergraph languages generated by hyperedge replacement. In Proceedings of the 4th Annual Symposium on Theoretical Aspects of Computer Science, STACS 1987, volume 247 of Lecture Notes in Computer Science, pages 207–219. Springer, 1987. doi:10.1007/BFB0039608.
[28] Wojciech Kazana and Luc Segoufin. First-order query evaluation on structures of bounded degree. Logical Methods in Computer Science, 7(2), 2011. doi:10.2168/LMCS-7(2:20)2011.
[29] Wojciech Kazana and Luc Segoufin. Enumeration of first-order queries on classes of structures with bounded expansion. In Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2013, pages 297–308. ACM, 2013. doi:10.1145/2463664.2463667.
[30] Wojciech Kazana and Luc Segoufin. Enumeration of monadic second-order queries on trees. ACM Transactions on Computational Logic, 14(4):25:1–25:12, 2013. doi:10.1145/2528928.
[31] Benny Kimelfeld, Wim Martens, and Matthias Niewerth. A formal language perspective on factorized representations. In Proceedings of the 28th International Conference on Database Theory, ICDT 2025, volume 328 of LIPIcs, pages 20:1–20:20. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2025. doi:10.4230/LIPICS.ICDT.2025.20.
[32] Stephan Kreutzer and Anuj Dawar. Parameterized complexity of first-order logic. Electronic Colloquium on Computational Complexity, TR09-131, 2009. URL: https://eccc.weizmann.ac.il/report/2009/131.
[33] N. Jesper Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, 88(11):1722–1732, 2000. doi:10.1109/5.892708.
[34] Thomas Lengauer. Hierarchical planarity testing algorithms. Journal of the ACM, 36(3):474–509, 1989. doi:10.1145/65950.65952.
[35] Thomas Lengauer and Klaus W. Wagner. The correlation between the complexities of the nonhierarchical and hierarchical versions of graph problems. Journal of Computer and System Sciences, 44:63–93, 1992. doi:10.1016/0022-0000(92)90004-3.
[36] Thomas Lengauer and Egon Wanke. Efficient solution of connectivity problems on hierarchically defined graphs. SIAM Journal on Computing, 17(6):1063–1080, 1988. doi:10.1137/0217068.
[37] Leonid Libkin. Elements of Finite Model Theory. Texts in Theoretical Computer Science. An EATCS Series. Springer, 2004. doi:10.1007/978-3-662-07003-1.
[38] Markus Lohrey. Algorithmics on SLP-compressed strings: A survey. Groups Complexity Cryptology, 4(2):241–299, 2012. doi:10.1515/gcc-2012-0016.
[39] Markus Lohrey. Model-checking hierarchical structures. Journal of Computer and System Sciences, 78(2):461–490, 2012. doi:10.1016/J.JCSS.2011.05.006.
[40] Markus Lohrey. Grammar-based tree compression. In Proceedings of the 19th International Conference on Developments in Language Theory, DLT 2015, volume 9168 of Lecture Notes in Computer Science, pages 46–57. Springer, 2015. doi:10.1007/978-3-319-21500-6_3.
[41] Markus Lohrey, Sebastian Maneth, and Roy Mennicke. XML tree structure compression using RePair. Information Systems, 38(8):1150–1167, 2013. doi:10.1016/J.IS.2013.06.006.
[42] Markus Lohrey, Sebastian Maneth, and Carl Philipp Reh. Constant-time tree traversal and subtree equality check for grammar-compressed trees. Algorithmica, 80(7):2082–2105, 2018. doi:10.1007/s00453-017-0331-3.
[43] Markus Lohrey, Sebastian Maneth, and Markus L. Schmid. FO-query enumeration over SLP-compressed structures of bounded degree, 2025. arXiv:2506.19421.
[44] Markus Lohrey, Sebastian Maneth, and Manfred Schmidt-Schauß. Parameter reduction and automata evaluation for grammar-compressed trees. Journal of Computer and System Sciences, 78(5):1651–1669, 2012. doi:10.1016/j.jcss.2012.03.003.
[45] Markus Lohrey and Markus L. Schmid. Enumeration for MSO-queries on compressed trees. Proceedings of the ACM on Management of Data, 2(2):78, 2024. doi:10.1145/3651141.
[46] Sebastian Maneth and Fabian Peternek. Grammar-based graph compression. Information Systems, 76:19–45, 2018. doi:10.1016/J.IS.2018.03.002.
[47] Sebastian Maneth and Fabian Peternek. Constant delay traversal of grammar-compressed graphs with bounded rank. Information and Computation, 273:104520, 2020. doi:10.1016/J.IC.2020.104520.
[48] Madhav V. Marathe, Harry B. Hunt III, Richard Edwin Stearns, and Venkatesh Radhakrishnan. Approximation algorithms for PSPACE-hard hierarchically and periodically specified problems. SIAM Journal on Computing, 27(5):1237–1261, 1998. doi:10.1137/S0097539795285254.
[49] Madhav V. Marathe, Harry B. Hunt III, and S. S. Ravi. The complexity of approximation PSPACE-complete problems for hierarchical specifications. Nordic Journal of Computing, 1(3):275–316, 1994. URL: https://www.cs.helsinki.fi/njc/njc1_papers/number3/paper1.pdf.
[50] Madhav V. Marathe, Venkatesh Radhakrishnan, Harry B. Hunt III, and S. S. Ravi. Hierarchically specified unit disk graphs. Theoretical Computer Science, 174(1–2):23–65, 1997. doi:10.1016/S0304-3975(96)00008-4.
[51] Martin Muñoz and Cristian Riveros. Constant-delay enumeration for SLP-compressed documents. Logical Methods in Computer Science, 21(1), 2025. doi:10.46298/LMCS-21(1:17)2025.
[52] Craig G. Nevill-Manning and Ian H. Witten. Identifying hierarchical structure in sequences: A linear-time algorithm. Journal of Artificial Intelligence Research, 7:67–82, 1997. doi:10.1613/JAIR.374.
[53] Dan Olteanu. Factorized databases: A knowledge compilation perspective. In Beyond NP, Papers from the 2016 AAAI Workshop, Phoenix, Arizona, USA, February 12, 2016, 2016. URL: http://www.aaai.org/ocs/index.php/WS/AAAIW16/paper/view/12638.
[54] Dan Olteanu and Maximilian Schleich. F: regression models over factorized views. Proceedings of the VLDB Endowment, 9(13):1573–1576, 2016. doi:10.14778/3007263.3007312.
[55] Dan Olteanu and Maximilian Schleich. Factorized databases. ACM SIGMOD Record, 45(2):5–16, 2016. doi:10.1145/3003665.3003667.
[56] Dan Olteanu and Jakub Závodný. Size bounds for factorised representations of query results. ACM Transactions on Database Systems, 40(1):2:1–2:44, 2015. doi:10.1145/2656335.
[57] Leonid Peshkin. Structure induction by lossless graph compression. In Proceedings of the 2007 Data Compression Conference, DCC 2007, pages 53–62. IEEE Computer Society, 2007. doi:10.1109/DCC.2007.73.
[58] William C. Rounds. Mappings and grammars on trees. Mathematical System Theory, 4(3):257–287, 1970. doi:10.1007/BF01695769.
[59] Markus L. Schmid and Nicole Schweikardt. Spanner evaluation over SLP-compressed documents. In Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2021, pages 153–165. ACM, 2021. doi:10.1145/3452021.3458325.
[60] Markus L. Schmid and Nicole Schweikardt. Query evaluation over SLP-represented document databases with complex document editing. In Proceedings of the International Conference on Management of Data, PODS 2022, pages 79–89. ACM, 2022. doi:10.1145/3517804.3524158.
[61] Nicole Schweikardt, Luc Segoufin, and Alexandre Vigny. Enumeration for FO queries over nowhere dense graphs. Journal of the ACM, 69(3):22:1–22:37, 2022. doi:10.1145/3517035.
[62] Detlef Seese. Linear time computable problems and first-order descriptions. Mathematical Structures in Computer Science, 6(6):505–526, 1996. doi:10.1017/S0960129500070079.
[63] Luc Segoufin. Constant delay enumeration for conjunctive queries. ACM SIGMOD Record, 44(1):10–17, 2015. doi:10.1145/2783888.2783894.
[64] Luc Segoufin and Alexandre Vigny. Constant delay enumeration for FO queries over databases with local bounded expansion. In Proceedings of the 20th International Conference on Database Theory, ICDT 2017, volume 68 of LIPIcs, pages 20:1–20:16. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPICS.ICDT.2017.20.

[bib.bib1] [1] Rajeev Alur, Michael Benedikt, Kousha Etessami, Patrice Godefroid, Thomas W. Reps, and Mihalis Yannakakis. Analysis of recursive state machines. ACM Transactions on Programming Languages and Systems (TOPLAS), 27(4):86–818, 2005. doi:10.1145/1075382.1075387.

[bib.bib2] [2] Rajeev Alur and Mihalis Yannakakis. Model checking of hierarchical state machines. ACM Transactions on Programming Languages and Systems (TOPLAS), 23(3):273–303, 2001. doi:10.1145/503502.503503.

[bib.bib3] [3] Guillaume Bagan. MSO queries on tree decomposable structures are computable with linear delay. In Proceedings of the 20th International Workshop on Computer Science Logic, CSL 2006, volume 4207 of Lecture Notes in Computer Science, pages 167–181. Springer, 2006. doi:10.1007/11874683_11.

[bib.bib4] [4] Guillaume Bagan, Arnaud Durand, and Etienne Grandjean. On acyclic conjunctive queries and constant delay enumeration. In Proceedings of the 21st International Workshop on Computer Science Logic, CSL 2007, volume 4646 of Lecture Notes in Computer Science, pages 208–222. Springer, 2007. doi:10.1007/978-3-540-74915-8_18.

[bib.bib5] [5] Hideo Bannai, Momoko Hirayama, Danny Hucke, Shunsuke Inenaga, Artur Jeż, Markus Lohrey, and Carl Philipp Reh. The smallest grammar problem revisited. IEEE Transaction on Information Theory, 67(1):317–328, 2021. doi:10.1109/TIT.2020.3038147.

[bib.bib6] [6] Michel Bauderon and Bruno Courcelle. Graph expressions and graph rewritings. Mathematical System Theory, 20(2-3):83–127, 1987. doi:10.1007/BF01692060.

[bib.bib7] [7] Christoph Berkholz, Fabian Gerhardt, and Nicole Schweikardt. Constant delay enumeration for conjunctive queries: a tutorial. ACM SIGLOG News, 7(1):4–33, 2020. doi:10.1145/3385634.3385636.

[bib.bib8] [8] Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. Answering FO+MOD queries under updates on bounded degree databases. ACM Transactions on Database Systems, 43(2):7:1–7:32, 2018. doi:10.1145/3232056.

[bib.bib9] [9] Romain Brenguier, Stefan Göller, and Ocan Sankur. A comparison of succinctly represented finite-state systems. In Proceedings of the 23rd International Conference on Concurrency Theory, CONCUR 2012, volume 7454 of Lecture Notes in Computer Science, pages 147–161. Springer, 2012. doi:10.1007/978-3-642-32940-1_12.

[bib.bib10] [10] Katrin Casel, Henning Fernau, Serge Gaspers, Benjamin Gras, and Markus L. Schmid. On the complexity of the smallest grammar problem over fixed alphabets. Theory of Computing Systems, 65(2):344–409, 2021. doi:10.1007/s00224-020-10013-w.

[bib.bib11] [11] Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. The smallest grammar problem. IEEE Transactions on Information Theory, 51(7):2554–2576, 2005. doi:10.1109/TIT.2005.850116.

[bib.bib12] [12] Bruno Courcelle and Joost Engelfriet. Graph Structure and Monadic Second-Order Logic - A Language-Theoretic Approach, volume 138 of Encyclopedia of Mathematics and its Applications. Cambridge University Press, 2012. doi:10.1017/CBO9780511977619.

[bib.bib13] [13] Arnaud Durand and Etienne Grandjean. First-order queries on structures of bounded degree are computable with constant delay. ACM Transactions on Computational Logic, 8(4):21, 2007. doi:10.1145/1276920.1276923.

[bib.bib14] [14] Arnaud Durand, Nicole Schweikardt, and Luc Segoufin. Enumerating answers to first-order queries over databases of low degree. Logical Methods in Computer Science, 18(2), 2022. doi:10.46298/LMCS-18(2:7)2022.

[bib.bib15] [15] Heinz-Dieter Ebbinghaus and Jörg Flum. Finite model theory. Perspectives in Mathematical Logic. Springer, 1995. doi:10.1007/3-540-28788-4.

[bib.bib16] [16] Joost Engelfriet. Context-free graph grammars. In Grzegorz Rozenberg and Arto Salomaa, editors, Handbook of Formal Languages, Volume 3: Beyond Words, pages 125–213. Springer, 1997. doi:10.1007/978-3-642-59126-6_3.

[bib.bib17] [17] Joost Engelfriet, Linda Heyker, and George Leih. Context-free graph languages of bounded degree are generated by apex graph grammars. Acta Informatica, 31(4):341–378, 1994. doi:10.1007/BF01178511.

[bib.bib18] [18] Joost Engelfriet and Grzegorz Rozenberg. A comparison of boundary graph grammars and context-free hypergraph grammars. Information and Computation, 84(2):163–206, 1990. doi:10.1016/0890-5401(90)90038-J.

[bib.bib19] [19] Rachel Faran and Orna Kupferman. LTL with arithmetic and its applications in reasoning about hierarchical systems. In LPAR-22. 22nd International Conference on Logic for Programming, Artificial Intelligence and Reasoning, volume 57 of EPiC Series in Computing, pages 343–362. EasyChair, 2018. doi:10.29007/WPG3.

[bib.bib20] [20] Rachel Faran and Orna Kupferman. A parametrized analysis of algorithms on hierarchical graphs. International Journal on Foundations of Computer Science, 30(6-7):979–1003, 2019. doi:10.1142/S0129054119400252.

[bib.bib21] [21] Jörg Flum and Martin Grohe. Parameterized Complexity Theory. Texts in Theoretical Computer Science. An EATCS Series. Springer, 2006. doi:10.1007/3-540-29953-X.

[bib.bib22] [22] Moses Ganardi and Pawel Gawrychowski. Pattern matching on grammar-compressed strings in linear time. In Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA 2022, pages 2833–2846. SIAM, 2022. doi:10.1137/1.9781611977073.110.

[bib.bib23] [23] Moses Ganardi, Artur Jeż, and Markus Lohrey. Balancing straight-line programs. Journal of the ACM, 68(4):27:1–27:40, 2021. doi:10.1145/3457389.

[bib.bib24] [24] Adrià Gascón, Markus Lohrey, Sebastian Maneth, Carl Philipp Reh, and Kurt Sieber. Grammar-based compression of unranked trees. Theory of Computing Systems, 64(1):141–176, 2020. doi:10.1007/s00224-019-09942-y.

[bib.bib25] [25] Ferenc Gécseg and Magnus Steinby. Tree languages. In Grzegorz Rozenberg and Arto Salomaa, editors, Handbook of Formal Languages, Volume 3: Beyond Words, pages 1–68. Springer, 1997. doi:10.1007/978-3-642-59126-6_1.

[bib.bib26] [26] Stefan Göller and Markus Lohrey. Fixpoint logics on hierarchical structures. Theory of Computing Systems, 48(1):93–131, 2009. doi:10.1007/s00224-009-9227-1.

[bib.bib27] [27] Annegret Habel and Hans-Jörg Kreowski. Some structural aspects of hypergraph languages generated by hyperedge replacement. In Proceedings of the 4th Annual Symposium on Theoretical Aspects of Computer Science, STACS 1987, volume 247 of Lecture Notes in Computer Science, pages 207–219. Springer, 1987. doi:10.1007/BFB0039608.

[bib.bib28] [28] Wojciech Kazana and Luc Segoufin. First-order query evaluation on structures of bounded degree. Logical Methods in Computer Science, 7(2), 2011. doi:10.2168/LMCS-7(2:20)2011.

[bib.bib29] [29] Wojciech Kazana and Luc Segoufin. Enumeration of first-order queries on classes of structures with bounded expansion. In Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2013, pages 297–308. ACM, 2013. doi:10.1145/2463664.2463667.

[bib.bib30] [30] Wojciech Kazana and Luc Segoufin. Enumeration of monadic second-order queries on trees. ACM Transactions on Computational Logic, 14(4):25:1–25:12, 2013. doi:10.1145/2528928.

[bib.bib31] [31] Benny Kimelfeld, Wim Martens, and Matthias Niewerth. A formal language perspective on factorized representations. In Proceedings of the 28th International Conference on Database Theory, ICDT 2025, volume 328 of LIPIcs, pages 20:1–20:20. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2025. doi:10.4230/LIPICS.ICDT.2025.20.

[bib.bib32] [32] Stephan Kreutzer and Anuj Dawar. Parameterized complexity of first-order logic. Electronic Colloquium on Computational Complexity, TR09-131, 2009. URL: https://eccc.weizmann.ac.il/report/2009/131.

[bib.bib33] [33] N. Jesper Larsson and Alistair Moffat. Off-line dictionary-based compression. Proceedings of the IEEE, 88(11):1722–1732, 2000. doi:10.1109/5.892708.

[bib.bib34] [34] Thomas Lengauer. Hierarchical planarity testing algorithms. Journal of the ACM, 36(3):474–509, 1989. doi:10.1145/65950.65952.

[bib.bib35] [35] Thomas Lengauer and Klaus W. Wagner. The correlation between the complexities of the nonhierarchical and hierarchical versions of graph problems. Journal of Computer and System Sciences, 44:63–93, 1992. doi:10.1016/0022-0000(92)90004-3.

[bib.bib36] [36] Thomas Lengauer and Egon Wanke. Efficient solution of connectivity problems on hierarchically defined graphs. SIAM Journal on Computing, 17(6):1063–1080, 1988. doi:10.1137/0217068.

[bib.bib37] [37] Leonid Libkin. Elements of Finite Model Theory. Texts in Theoretical Computer Science. An EATCS Series. Springer, 2004. doi:10.1007/978-3-662-07003-1.

[bib.bib38] [38] Markus Lohrey. Algorithmics on SLP-compressed strings: A survey. Groups Complexity Cryptology, 4(2):241–299, 2012. doi:10.1515/gcc-2012-0016.

[bib.bib39] [39] Markus Lohrey. Model-checking hierarchical structures. Journal of Computer and System Sciences, 78(2):461–490, 2012. doi:10.1016/J.JCSS.2011.05.006.

[bib.bib40] [40] Markus Lohrey. Grammar-based tree compression. In Proceedings of the 19th International Conference on Developments in Language Theory, DLT 2015, volume 9168 of Lecture Notes in Computer Science, pages 46–57. Springer, 2015. doi:10.1007/978-3-319-21500-6_3.

[bib.bib41] [41] Markus Lohrey, Sebastian Maneth, and Roy Mennicke. XML tree structure compression using RePair. Information Systems, 38(8):1150–1167, 2013. doi:10.1016/J.IS.2013.06.006.

[bib.bib42] [42] Markus Lohrey, Sebastian Maneth, and Carl Philipp Reh. Constant-time tree traversal and subtree equality check for grammar-compressed trees. Algorithmica, 80(7):2082–2105, 2018. doi:10.1007/s00453-017-0331-3.

[bib.bib43] [43] Markus Lohrey, Sebastian Maneth, and Markus L. Schmid. FO-query enumeration over SLP-compressed structures of bounded degree, 2025. arXiv:2506.19421.

[bib.bib44] [44] Markus Lohrey, Sebastian Maneth, and Manfred Schmidt-Schauß. Parameter reduction and automata evaluation for grammar-compressed trees. Journal of Computer and System Sciences, 78(5):1651–1669, 2012. doi:10.1016/j.jcss.2012.03.003.

[bib.bib45] [45] Markus Lohrey and Markus L. Schmid. Enumeration for MSO-queries on compressed trees. Proceedings of the ACM on Management of Data, 2(2):78, 2024. doi:10.1145/3651141.

[bib.bib46] [46] Sebastian Maneth and Fabian Peternek. Grammar-based graph compression. Information Systems, 76:19–45, 2018. doi:10.1016/J.IS.2018.03.002.

[bib.bib47] [47] Sebastian Maneth and Fabian Peternek. Constant delay traversal of grammar-compressed graphs with bounded rank. Information and Computation, 273:104520, 2020. doi:10.1016/J.IC.2020.104520.

[bib.bib48] [48] Madhav V. Marathe, Harry B. Hunt III, Richard Edwin Stearns, and Venkatesh Radhakrishnan. Approximation algorithms for PSPACE-hard hierarchically and periodically specified problems. SIAM Journal on Computing, 27(5):1237–1261, 1998. doi:10.1137/S0097539795285254.

[bib.bib49] [49] Madhav V. Marathe, Harry B. Hunt III, and S. S. Ravi. The complexity of approximation PSPACE-complete problems for hierarchical specifications. Nordic Journal of Computing, 1(3):275–316, 1994. URL: https://www.cs.helsinki.fi/njc/njc1_papers/number3/paper1.pdf.

[bib.bib50] [50] Madhav V. Marathe, Venkatesh Radhakrishnan, Harry B. Hunt III, and S. S. Ravi. Hierarchically specified unit disk graphs. Theoretical Computer Science, 174(1–2):23–65, 1997. doi:10.1016/S0304-3975(96)00008-4.

[bib.bib51] [51] Martin Muñoz and Cristian Riveros. Constant-delay enumeration for SLP-compressed documents. Logical Methods in Computer Science, 21(1), 2025. doi:10.46298/LMCS-21(1:17)2025.

[bib.bib52] [52] Craig G. Nevill-Manning and Ian H. Witten. Identifying hierarchical structure in sequences: A linear-time algorithm. Journal of Artificial Intelligence Research, 7:67–82, 1997. doi:10.1613/JAIR.374.

[bib.bib53] [53] Dan Olteanu. Factorized databases: A knowledge compilation perspective. In Beyond NP, Papers from the 2016 AAAI Workshop, Phoenix, Arizona, USA, February 12, 2016, 2016. URL: http://www.aaai.org/ocs/index.php/WS/AAAIW16/paper/view/12638.

[bib.bib54] [54] Dan Olteanu and Maximilian Schleich. F: regression models over factorized views. Proceedings of the VLDB Endowment, 9(13):1573–1576, 2016. doi:10.14778/3007263.3007312.

[bib.bib55] [55] Dan Olteanu and Maximilian Schleich. Factorized databases. ACM SIGMOD Record, 45(2):5–16, 2016. doi:10.1145/3003665.3003667.

[bib.bib56] [56] Dan Olteanu and Jakub Závodný. Size bounds for factorised representations of query results. ACM Transactions on Database Systems, 40(1):2:1–2:44, 2015. doi:10.1145/2656335.

[bib.bib57] [57] Leonid Peshkin. Structure induction by lossless graph compression. In Proceedings of the 2007 Data Compression Conference, DCC 2007, pages 53–62. IEEE Computer Society, 2007. doi:10.1109/DCC.2007.73.

[bib.bib58] [58] William C. Rounds. Mappings and grammars on trees. Mathematical System Theory, 4(3):257–287, 1970. doi:10.1007/BF01695769.

[bib.bib59] [59] Markus L. Schmid and Nicole Schweikardt. Spanner evaluation over SLP-compressed documents. In Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2021, pages 153–165. ACM, 2021. doi:10.1145/3452021.3458325.

[bib.bib60] [60] Markus L. Schmid and Nicole Schweikardt. Query evaluation over SLP-represented document databases with complex document editing. In Proceedings of the International Conference on Management of Data, PODS 2022, pages 79–89. ACM, 2022. doi:10.1145/3517804.3524158.

[bib.bib61] [61] Nicole Schweikardt, Luc Segoufin, and Alexandre Vigny. Enumeration for FO queries over nowhere dense graphs. Journal of the ACM, 69(3):22:1–22:37, 2022. doi:10.1145/3517035.

[bib.bib62] [62] Detlef Seese. Linear time computable problems and first-order descriptions. Mathematical Structures in Computer Science, 6(6):505–526, 1996. doi:10.1017/S0960129500070079.

[bib.bib63] [63] Luc Segoufin. Constant delay enumeration for conjunctive queries. ACM SIGMOD Record, 44(1):10–17, 2015. doi:10.1145/2783888.2783894.

[bib.bib64] [64] Luc Segoufin and Alexandre Vigny. Constant delay enumeration for FO queries over databases with local bounded expansion. In Proceedings of the 20th International Conference on Database Theory, ICDT 2017, volume 68 of LIPIcs, pages 20:1–20:16. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPICS.ICDT.2017.20.

FO-Query Enumeration over SLP-Compressed Structures of Bounded Degree

Abstract

Keywords and phrases:

Funding:

Copyright and License:

2012 ACM Subject Classification:

Related Version:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

Query evaluation over compressed data.

Main result.

Theorem 1.

Related work.

2 General Notations

2.1 Directed acyclic graphs

2.2 Relational structures and first order logic

2.3 Distances, spheres and neighborhoods

2.4 Enumeration algorithms and the machine model

3 FO-Enumeration over Uncompressed Degree-Bounded Structures

Lemma 2.

Lemma 3.

3.1 Enumeration algorithm for uncompressed structures

Lemma 4.

4 Straight-Line Programs for Relational Structures

Example 5.

4.1 Representation of nodes of an SLP-compressed structure

Example 6.

4.2 Register length in the compressed setting

5 FO-Enumeration over SLP-Compressed Degree-Bounded Structures

Theorem 7.

5.1 Expansions of nonterminals

Lemma 8.

5.2 Overview of the enumeration algorithm

Enumeration of all 𝓑𝒊-nodes.

Producing the final output tuples.

Count total number of 𝝆-neighborhoods.

Checking distance constraints.

6 Conclusions and Outlook

References

Enumeration of all $\mathcal{B}_{i}$ -nodes.

Count total number of $\rho$ -neighborhoods.