A Formal Language Perspective on Factorized Representations

Kimelfeld, Benny; Martens, Wim; Niewerth, Matthias

doi:10.4230/LIPIcs.ICDT.2025.20

A Formal Language Perspective on Factorized Representations

Benny Kimelfeld

Technion, Haifa, Israel Wim Martens

University of Bayreuth, Germany Matthias Niewerth

University of Bayreuth, Germany

Abstract

Factorized representations (FRs) are a well-known tool to succinctly represent results of join queries and have been originally defined using the named database perspective. We define FRs in the unnamed database perspective and use them to establish several new connections. First, unnamed FRs can be exponentially more succinct than named FRs, but this difference can be alleviated by imposing a disjointness condition on columns. Conversely, named FRs can also be exponentially more succinct than unnamed FRs. Second, unnamed FRs are the same as (i.e., isomorphic to) context-free grammars for languages in which each word has the same length. This tight connection allows us to transfer a wide range of results on context-free grammars to database factorization; of which we offer a selection in the paper. Third, when we generalize unnamed FRs to arbitrary sets of tuples, they become a generalization of path multiset representations, a formalism that was recently introduced to succinctly represent sets of paths in the context of graph database query evaluation.

Keywords and phrases:

Databases, relational databases, graph databases, factorized databases, regular path queries, compact representations

Copyright and License:

2012 ACM Subject Classification:

Information systems

\rightarrow

Data management systems

Related Version:

Full Version: https://arxiv.org/abs/2309.11663 [38]

Funding:

This work was supported by the German Israeli Foundation for Scientific Research and Development (GIF), grant I-1502-407.6/2019. Martens and Niewerth were supported by ANR project EQUUS ANR-19-CE48-0019; funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation), project number 431183758.

DOI:

10.4230/LIPIcs.ICDT.2025.20

Event:

28th International Conference on Database Theory (ICDT 2025)

Editors:

Sudeepa Roy and Ahmet Kara

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

Factorized databases (FDBs) aim at succinctly representing the result of join queries by systematically avoiding redundancy. Since their introduction by Olteanu and Zavodny [54, 55], they were the inspiration and key technical approach toward the development of algorithms for efficient query evaluation [10, 52], including the construction of direct-access structures for join queries [18, 19], evaluation of aggregate queries [9, 62, 51], and the application of machine-learning algorithms over databases [59].

At the core of factorized databases are factorized relations. In essence, a factorized relation (FR) is a relational algebra query that builds the represented set of tuples. It involves data values and only two operators: union and Cartesian product. The restriction to these two operators provides succinctness and, at the same time, ensures the efficiency of downstream operations. We refer to [53] for a gentle introduction into factorized databases.

Factorized relations have been introduced in the named database perspective, where tuples are defined as functions from a set of attribute names to a set of attribute values [6] and are therefore unordered. In this paper, we explore FRs from the unnamed perspective, where tuples are ordered. Our motivation to explore this perspective is twofold. First, we believe that tuples in many database systems are ordered objects and second, we want to understand the relationship between factorized relations and path multiset representations (PMRs).

Path Multiset Representations (PMRs) were recently introduced to succinctly represent (multi)sets of paths in graph databases [41]. In particular, they aim at representing paths that match regular path queries (RPQs), which are the fundamental building block of modern graph database pattern matching [24, 30] and have been studied for decades [11, 12, 13, 16, 17, 22, 23, 27, 60]. Compared to traditional research, modern graph query languages such as Cypher [50, 31], SQL/PGQ [36, 24], and GQL [34, 24] use RPQs in a fundamentally new way. In a nutshell, in most of the research literature, an RPQ $q$ returns pairs of endpoints of paths that are matched by $q$ . In Cypher, SQL/PGQ, and GQL, it is possible to return the actual paths that match $q$ [24, 30], which is much less explored [42, 43, 44]. The challenge for PMRs is to succinctly represent the (possibly exponentially many or even infinitely many) paths that match an RPQ, and to allow query operations to be performed directly on the PMR. In fact, several experimental studies show that using PMRs can drastically speed up query evaluation [41, 26, 15].

While FDBs and PMRs aim at the same purpose of succinctly representing large (intermediate) results of queries, they are quite different. For instance, factorized databases represent database relations, which are tuples of the same length and which are always finite. Path multiset representations represent (multi)sets of paths in graphs, with varying lengths, and these sets can be infinite. Finally, even though going from a fixed-length setting to a varying-length setting seems like a generalization, it is not clear how PMRs generalize factorized databases. Viewing FRs through the unnamed database perspective however, will make the relationship between FRs and PMRs much clearer.

Our contributions are the following. Let us use the term named factorized representations (nFR) to refer to the $d$ -representations, also called factorized representations with definitions, introduced by Olteanu and Zavodny [54, 55]. We define unnamed factorized relations (uFRs) which are, analogously to the named case, relational algebra expressions built from data values, union, and Cartesian product. Although uFRs are conceptually very similar to nFRs, they are incomparable in size, since worst-case exponential blow-ups exist in both directions. The blow-up from uFRs to nFRs disappears, however, when we impose a disjointness condition on the columns of uFRs. We then observe that there exists a bijection $\beta$ between uFRs and a class of context-free grammars (CFGs) for languages in which all words have the same length. Furthermore, each uFR $F$ is isomorphic to the grammar $\beta(F)$ and its represented relation is the straightforward encoding of the language of $\beta(F)$ as tuples. Loosely speaking, this means that uFRs and this class of context-free grammars are the same thing.

This tight connection between uFRs and CFGs allows us to immediately infer a number of complexity results on uFRs, e.g., on their membership problem, on their equivalence problem (“Do two uFRs represent the same set of tuples?”), on the counting problem (“How many different tuples are represented by a uFR?”), their enumeration problem, and on size lower bounds. It also allows us to generalize uFRs to a setting in which database relations can contain tuples of different arities, which is a model that is currently being implemented by RelationalAI [58].

Finally, it allows us to clarify the connection between uFRs and PMRs. Loosely speaking, whereas uFRs are context-free grammars for uniform-length languages, PMRs are non-deterministic finite automata. In this sense, PMRs are more expressive than uFRs, because they can represent infinite objects. On uniform-length languages, uFRs and PMRs can represent the same languages, but PMRs are a special case of uFRs. Indeed, it is well-known that CFGs with rules only of the form $A\to bC$ and $A\to b$ are isomorphic to non-deterministic finite automata. In consequence, uFRs can be more succinct than PMRs.

Once we understand how PMRs and uFRs compare, we ask ourselves what are the tradeoffs between them if one would use them as compact representations in the same system. In principle, a CFG for a finite language can be converted into an NFA or, conversely, an NFA representing a finite number of paths could be made even more succinct as a CFG. Different representations may have different benefits. For example, if we want to compute the complement of a relation $R$ represented by a uFR (i.e., a CFG), it may be a viable plan to convert the CFG into a DFA and use the trivial complement operation on DFAs (slightly modified so that we only take the complement on words of the relevant length). As such, even a naive algorithm for complementation is able to avoid fully materializing $R$ in some cases (for example, the cases where the CFG is right-linear). In this paper, we embark on this (quite extensive) tradeoff question and investigate relative blow-ups between the different relevant classes of context-free grammars and automata.

Further related work.

The present paper mainly aims at connecting areas that were previously thought to be different (to the best of our knowledge). In the named database perspective, factorized relations are known to be closely connected to decision diagrams. (See, e.g., [4].) A recent overview on connections between binary decision diagrams (BDDs) and various kinds of automata was made by Amarilli et al. [3]. The authors focus on translations that preserve properties of interest, like the number of objects represented (variables and truth assignments), even if the objects themselves change in the translation. Here, on the other hand, we focus on much stronger correspondences, namely isomorphisms. Put differently, we are interested in preserving not only certain properties of the represented objects, but even the objects themselves. Another difference between our work and the research on circuits is that the latter usually represent unordered objects; we focus on ordered tuples and paths (and connect with factorizations for unordered tuples in Section 2.3). In Section 5.3, we consider size bounds for (classes of) factorized relations in the unnamed (ordered) perspective, which has been investigated in the named perspective by Berkholz and Vinall-Smeeth [14].

This paper is certainly not the first that considers context-free grammars for compression purposes. In fact, the area of straight-line programs, see, e.g., [20, 32, 39] is completely centered around this idea. The focus there, however, is on compactly representing a single word. Factorized relations on the other hand focus on representing a database relation; or a finite set of words.

For space limitations, some proofs are omitted and can be found in the archive version of the paper [38].

2 Factorized Relations

Factorized relations were introduced by Olteanu and Závodný [55], who defined them in the named database perspective, where tuples are unordered [6]. We investigate them in the unnamed perspective (i.e., for ordered tuples), so that we can connect them to sets of paths in graphs (Section 5), which are inherently ordered.

2.1 The Named and Unnamed Perspectives

The principles of databases can be studied in the named perspective and the unnamed perspective, which are slightly different mathematical definitions of the relational model [6, Chapter 2]. Let us provide a bit of background on both, since the difference between them is important in this paper.

Let $\mathsf{Val}$ and $\mathsf{Att}$ be two disjoint countably infinite sets of values and attribute names. In the named perspective, a database tuple is defined as a function $t\colon X\to\mathsf{Val}$ , where $X$ is a finite subset of $\mathsf{Att}$ . Such functions $t$ are usually denoted as $\langle A_{1}\!:\!a_{1},A_{2}\!:\!a_{2},\ldots,A_{k}\!:\!a_{k}\rangle$ , to say that $t(A_{i})=a_{i}$ for every $i\in k$ . Notice that tuples are unordered in the sense that $\langle A\!:\!a,B\!:\!b\rangle$ and $\langle B\!:\!b,A\!:\!a\rangle$ denote the same function. The arity of $t$ is $|X|$ . A database relation is a set of tuples that are defined over the same set $X$ . A schema in the named perspective assigns a finite set of attribute names to each relation $R$ in the database. The semantics is that each tuple in $R$ should then be a tuple over the set of attribute names assigned by the schema.

A path in a graph is usually an ordered sequence. (Sometimes containing only nodes, sometimes only edges, and sometimes nodes and edges depending on the concrete definition of the graph or multigraph.) Such paths can therefore be seen as ordered tuples of the form $(u_{1},\ldots,u_{n})$ , where $u_{1},\ldots,u_{n}$ are objects in the graph.

In the unnamed perspective, a ( $k$ -ary) database tuple is simply an element of $\mathsf{Val}^{k}$ . A database relation is a finite set of tuples of the same arity. We define schemas in the unnamed perspective a bit differently from the standard definition [6, Chapter 2], in order to be closer to the named perspective. (In the standard definition, a schema simply assigns an arity to each relation name $R$ . The semantics is that each tuple in $R$ should have the arity that is assigned by the schema.)

2.2 Unnamed Factorized Relations

For defining unnamed factorized relations (uFR), we follow the intuition that factorized relations are relational algebra expressions that can use names to re-use subexpressions [55]. We also follow [55] in disallowing unions of $\emptyset$ with non-empty relations. Let $X\subseteq\mathsf{Names}$ be a finite set of expression names and let $\mathord{\mathit{ar}}\colon X\to\mathbb{N}$ be a function that associates an arity to each name in $X$ . A relation expression that references $X$ , or $X$ -expression for short, is a relational algebra expression built from singletons, products, unions, and names from $X$ . We inductively define $X$ -expressions $E$ and their associated arity $\mathord{\mathit{ar}}(E)$ as follows:

(empty): $E=\emptyset$ is an $X$ -expression with $\mathord{\mathit{ar}}(E)=0$ ;
(nullary tuple): $E=\langle\rangle$ is an $X$ -expression with $\mathord{\mathit{ar}}(E)=0$ ;
(singleton): $E=\langle a\rangle$ is an $X$ -expression for each $a\in\mathsf{Val}$ , with $\mathord{\mathit{ar}}(E)=1$ ;
(name reference): $E=N$ is an $X$ -expression for each $N\in X$ , with $\mathord{\mathit{ar}}(E)=\mathord{\mathit{ar}}(N)$ ;
(union): for $X$ -expressions $E_{1},\ldots,E_{n}$ with $\mathord{\mathit{ar}}(E_{1})=\cdots=\mathord{\mathit{ar}}(E_{n})$ , we have that $E=(E_{1}\cup\cdots\cup E_{n})$ is an $X$ -expression with $\mathord{\mathit{ar}}(E)=\mathord{\mathit{ar}}(E_{1})=\cdots=\mathord{\mathit{% ar}}(E_{n})$ ;
(product): for $X$ -expressions $E_{1},\ldots,E_{n}$ , we have that $E=(E_{1}\times\cdots\times E_{n})$ is an $X$ -expression with $\mathord{\mathit{ar}}(E)=\sum_{i\in[n]}\mathord{\mathit{ar}}(E_{i})$ .

Definition 2.1.

A $k$ -ary unnamed factorized relation (uFR) is a pair $F=(S,D)$ , where $S\in\mathsf{Names}$ is the start symbol and $D=\{N_{1}\mathrel{\mathop{:}}=E_{1},\ldots,N_{n}\mathrel{\mathop{:}}=E_{n}\}$ is a set of expressions where:

1.

$N_{1}=S$ and $\mathord{\mathit{ar}}(N_{1})=k$ ;
2.

each $N_{i}$ is an expression name;
3.

each $E_{i}$ is an $X_{i}$ -expression for $X_{i}=\{N_{i+1},\dots,N_{n}\}$ ; and
4.

$\mathord{\mathit{ar}}(N_{i})=\mathord{\mathit{ar}}(E_{i})$ for all $i=1,\dots,n$ .

Note that the expression $E_{n}$ does not use name references. Hence, $E_{n}$ can be evaluated without resolving references, and the result is a relation $R_{n}$ . Once we have $R_{n}$ , we can construct $R_{n-1}$ from $E_{n}$ by replacing $N_{n}$ with $R_{n}$ . We continue this way until we obtain $R_{1}$ , which is a $k$ -ary relation, and is the relation that $F$ represents. We denote this relation, $R_{1}$ , by $\llbracket F\rrbracket$ . Furthermore, we will denote $R_{i}$ by $\llbracket N_{i}\rrbracket$ for every $i\in[n]$ .

Note that, since $D$ is a set, we should indicate which element is $S$ . Indeed, choosing a different start symbol changes $\llbracket F\rrbracket$ (and may make some parts of $F$ useless since they cannot be reached from $S$ ). One may use a notational convention that we always take $S$ to be the first name that we write down in $D$ , as is done in [55, Section 4.2].

(a) A database example.

(b) Visualization of a factorized relation representing Customer

\bowtie

PurchaseHistory

\bowtie

Supplies.

	$\displaystyle S\to A_{1}\cup A_{2}\cup A_{3}$
	$\displaystyle A_{1}\to B_{1}\cdot\mbox{n1}\quad\quad A_{2}\to B_{2}\cdot\mbox{% n2}$
	$\displaystyle A_{3}\to B_{3}\cdot\mbox{n3}$
	$\displaystyle B_{1}\to\mbox{c1}\cdot C_{1}\quad\quad B_{2}\to\mbox{c2}\cdot C_% {2}$
	$\displaystyle B_{3}\to\mbox{c3}\cdot C_{3}$
	$\displaystyle C_{1}\to P_{1}\cup P_{2}\quad\quad C_{2}\to P_{2}\cup P_{3}$
	$\displaystyle C_{3}\to P_{1}\cup P_{4}$
	$\displaystyle P_{1}\to\mbox{flute}\cdot D\quad\quad P_{2}\to\mbox{wire}\cdot% \mbox{s3}$
	$\displaystyle P_{3}\to\mbox{harp}\cdot D\quad\quad P_{4}\to\mbox{phone}\cdot% \mbox{s3}$
	$\displaystyle D\to\mbox{s1}\cup\mbox{s2}$

(c) Context-free grammar isomorphic to the factorized relation.

Figure 1: Factorized relation and context-free grammar representing the join of the relations in Figure 1(a).

Example 2.2.

Consider the database in Figure 1(a). We assume that the attributes of each relation are ordered from left to right, i.e., cid is the first attribute of Customer, and so on. The following is a factorized relation that represents Customer $\bowtie$ PurchaseHistory $\bowtie$ Supply. (The ordering of rules is top-to-bottom, left-to-right; and the start symbol is $N_{1}$ .)

\begin{array}[]{ll@{\hspace{1cm}}ll@{\hspace{1cm}}ll}N_{1}&:=A_{1}\cup A_{2}% \cup A_{3}\hfil\hskip 28.45274pt&B_{2}&:=\text{c2}\times C_{2}\hfil\hskip 28.4% 5274pt&P_{1}&:=\text{flute}\times D\\ A_{1}&:=B_{1}\times\text{n1}\hfil\hskip 28.45274pt&B_{3}&:=\text{c3}\times C_{% 3}\hfil\hskip 28.45274pt&P_{2}&:=\text{wire}\times\text{s3}\\ A_{2}&:=B_{2}\times\text{n2}\hfil\hskip 28.45274pt&C_{1}&:=P_{1}\cup P_{2}% \hfil\hskip 28.45274pt&P_{3}&:=\text{harp}\times D\\ A_{3}&:=B_{3}\times\text{n3}\hfil\hskip 28.45274pt&C_{2}&:=P_{2}\cup P_{3}% \hfil\hskip 28.45274pt&P_{4}&:=\text{phone}\times\text{s3}\\ B_{1}&:=\text{c1}\times C_{1}\hfil\hskip 28.45274pt&C_{3}&:=P_{1}\cup P_{4}% \hfil\hskip 28.45274pt&D&:=\text{s1}\cup\text{s2}\\ \end{array}

Figure 1(b) contains a visualization of the factorized relation. (We annotated some of the elements of $\mathsf{Names}$ in orange and omitted $\langle\cdot\rangle$ around data values.)

2.3 Relationship to Named Factorized Relations

We refer to the $d$ -representations of Olteanu and Zavodny [55] as named factorized relations (nFR). For an nFR $F$ , we use $\llbracket F\rrbracket$ to denote the named relation $R$ represented by $F$ .¹¹1 $\llbracket F\rrbracket$ is formally defined in [55]. The definition is completely analogous to Definition 2.1 but starts from tuples in the unnamed perspective. We therefore do not repeat the definition here. Although the definitions of named and unnamed factorized relations are similar, we show that unnamed factorized relations can be exponentially more succinct than named factorized relations and vice versa. In the direction from uFRs to nFRs, the exponential size difference is due to the capability of uFRs to be able to factorize “horizontally” and can be alleviated by imposing that every column uses different values. The exponential size difference in the other direction is due to the unordered nature of nFRs and tuples in the named perspective.

In order to compare uFRs with nFRs, we need to explain when we consider an nFR and a uFR to be equivalent. We consider the standard conversion between named and unnamed database relations in [6, Chapter 2] and write $\textsf{unnamed}(R)$ for the unnamed relation obtained from converting the named relation $R$ to the unnamed perspective. (Intuitively, the conversion assumes a fixed ordering on relation-attribute pairs and converts tuples $\langle A_{1}\colon a_{1},\ldots,A_{k}\colon a_{k}\rangle$ in a named relation $R$ into tuples $(a_{1},\ldots,a_{k})$ .) Finally, we say that an nFR $F_{1}$ and uFR $F_{2}$ are equivalent if $\textsf{unnamed}(\llbracket F_{1}\rrbracket)=\llbracket F_{2}\rrbracket$ .

Proposition 2.3.

(1)

For each $n\in\mathbb{N}$ , there exists a uFR of size $O(n)$ such that the smallest equivalent nFR is exponentially larger.
(2)

For each $n\in\mathbb{N}$ , there exists an nFR of size $O(n)$ such that the smallest equivalent uFR is exponentially larger.

Proof sketch.

For (1), consider the unnamed factorized relation $F$ with the expressions

N_{1}:=N_{2}\times N_{2}\quad N_{2}:=N_{3}\times N_{3}\quad\cdots\quad N_{n-1}% :=N_{n}\times N_{n}\quad N_{n}:=M\times M\quad M:=\langle 0\rangle\cup\langle 1% \rangle\;.

This uFR defines the set of all $2^{n}$ -ary tuples with values in $\{0,1\}$ . Notice that the uFR is exponentially smaller than the arities of the tuples that it defines, and doubly exponentially smaller than the number of tuples in $\llbracket F\rrbracket$ , which is $2^{2^{n}}$ . The following can be shown using a standard inductive argument.

$\vartriangleright$ Claim 2.4.

For every nFR $F$ , the size of $F$ is at most exponentially smaller than the size of $\llbracket F\rrbracket$ .

For (2), consider a relation $R[A_{1},\ldots,A_{n},B_{1},\ldots,B_{n}]$ with the fixed attribute ordering being $R.A_{1}<\cdots<R.A_{n}<R.B_{1}<\cdots<R.B_{n}$ for converting from the named to unnamed perspective. Consider the set of tuples $R=\{t\mid\pi_{A_{1},\ldots,A_{n}}t=\pi_{B_{1},\ldots,B_{n}}t\}$ in which all data values are $0$ or $1$ . It is easy to construct an nFR for $R$ of size $O(n)$ : take $S_{1}:=((\langle A_{1}:0\rangle\times\langle B_{1}:0\rangle))\cup((\langle A_{% 1}:1\rangle\times\langle B_{1}:1\rangle))\times S_{2}$ , $S_{2}:=((\langle A_{2}:0\rangle\times\langle B_{2}:0\rangle))\cup((\langle A_{% 2}:1\rangle\times\langle B_{2}:1\rangle))\times S_{3}$ , etc. The lower bound for uFRs is proved in Corollary 4.4. $\hfill\blacktriangleleft$

The significant size increase when going from the unnamed to the named perspective exists because tuples in the named perspective are “never same in different columns”. Indeed, in the named perspective, tuples $\langle A_{1}\colon a_{1},\ldots,A_{k}\colon a_{k}\rangle$ are such that all attribute names $A_{i}$ are pairwise distinct. Therefore, the named values $A_{i}\colon a_{i}$ are also pairwise distinct.

In the unnamed perspective, we can enforce the same restriction by requiring that coordinates in tuples come from disjoint domains. (This notion is similar to the notion of decomposability in circuits [3].) We say that a relation $R$ has disjoint positions if no value occurs in two different columns. More precisely, if $\vec{a}[i]\neq\vec{b}[j]$ for all $\vec{a},\vec{b}\in R$ and $1\leq i<j\leq k$ . We say that an unnamed factorized relation $F$ has disjoint positions if $\llbracket F\rrbracket$ has disjoint positions.

Proposition 2.5.

For each $n\in\mathbb{N}$ and uFR of size $n$ with disjoint positions, there is an equivalent nFR of size $n$ .

3 Context-Free Grammars and Their Connection to FRs

Let $\Sigma$ be a finite set, whose elements we call symbols. By $\varepsilon$ we denote the empty word, that is, the word of length $0$ . By $\Sigma^{*}$ we denote the set of all words over $\Sigma$ , i.e., the set of words $w=a_{1}\cdots a_{n}$ , where $\varepsilon$ and $\emptyset$ are not elements of $\Sigma$ . A regular expression (RE) over $\Sigma$ is inductively defined as follows. Every $a\in\Sigma$ is a regular expression, and so are the symbols $\varepsilon$ and $\emptyset$ . Furthermore, if $e_{1}$ and $e_{2}$ are regular expressions over $\Sigma$ , then so are $e_{1}\cdot e_{2}$ (concatenation), $e_{1}\cup e_{2}$ (union), and $e_{1}^{*}$ (Kleene star). As usual, we often omit the concatenation operator in our notation. When taking $e_{1}\cup e_{2}$ , we assume that neither $e_{1}$ nor $e_{2}$ are $\emptyset$ .²²2Indeed, a union with $\emptyset$ is never useful and the restriction is easy to enforce. We make this restriction because unions in uFRs are defined with the same restriction. Theorem 3.8, which states that uFRs and uniform-length ECFGs are the same, also holds if unions in uFRs and REs both allow the empty set. By $L(e)$ we define the language of $e$ , which is defined as usual.

Definition 3.1 (see, e.g., [40]).

An extended context-free grammar (abbreviated ECFG) consists of rules where nonterminals are defined using arbitrary regular expressions over terminals and nonterminals. Formally, an ECFG is a tuple $G=(T,N,S,R)$ where:

$\blacksquare$

$T$ is a finite set of terminals;
$\blacksquare$

$N$ is a finite set of nonterminals such that $T\cap N=\emptyset$ ;
$\blacksquare$

$S\in N$ is the start symbol; and
$\blacksquare$

$R$ is a finite set of rules of the form $A\rightarrow e$ , where $e$ is a regular expression over $T\cup N$ .

For the purpose of this paper, we will always choose $T\subseteq\mathsf{Val}$ and $N\subseteq\mathsf{Names}$ . Furthermore, we assume that all terminals in $T$ are actually used in the grammar, that is, for each terminal $a$ , there exists a rule $A\rightarrow e$ such that $a$ appears in the expression $e$ .

The language of $G$ , denoted $L(G)$ , is defined as usual. A derivation step of $G$ is a pair $(u,v)$ of words in $(T\cup N)^{*}$ such that $u=\alpha X\beta$ and $v=\alpha\gamma\beta$ where $X\in N$ and $\alpha,\beta$ , $\gamma\in(T\cup N)^{*}$ , and where $R$ contains a rule $X\to e$ with $\gamma\in L(e)$ . We denote such a derivation step as $u\Rightarrow_{G}v$ . A derivation is a sequence $u_{0},\ldots,u_{n}$ such that $u_{i-1}\Rightarrow_{G}u_{i}$ for every $i\in[n]$ . We denote by $u\Rightarrow^{*}v$ that there exists a derivation that starts in $u$ and ends in $v$ and by $u\Rightarrow^{+}v$ the case where this derivation has at least one step. By $L(u)$ we denote the language $\{w\in T^{*}\mid u\Rightarrow_{G}w\}$ . Finally, the language of $G$ , denoted $L(G)$ , is the language $L(S)$ of words derived from the start symbol $S$ .

A nonterminal $A\in N$ is useful if there exists a derivation $S\Rightarrow^{*}_{G}\alpha A\beta\Rightarrow^{*}_{G}w$ for some word $w\in\Sigma^{*}$ . The grammar $G=(T,N,S,R)$ is trimmed if every nonterminal in $N$ is useful. It is well-known that an ECFG can be converted into a trimmed ECFG in linear time. A grammar $G$ is recursive if there exists a derivation $A\Rightarrow^{+}_{G}\alpha A\beta$ for some $A\in N$ and $\alpha,\beta\in(T\cup N)^{*}$ .

$\blacktriangleright$ Remark 3.2.

Context-free grammars (CFGs) are defined analogously to ECFGs, except that rules are required to be of the form $A\rightarrow\alpha_{1}\cup\cdots\cup\alpha_{n}$ , where each $\alpha_{i}$ is a concatenation over $T\cup N$ . It is well known that CFGs and ECFGs have the same expressiveness and that they can be translated back and forth in linear time [35, p. 202].

3.1 Isomorphisms

In this section we want to define when we consider a uFR and an ECFG to be isomorphic. To warm up, we first explain when we consider two ECFGs $G_{1}$ and $G_{2}$ to be isomorphic. Intuitively, this is the case when they are the same up to renaming of non-terminals. Formally, we define isomorphisms using a function $h\colon\mathsf{Names}\to\mathsf{Names}$ that we extend to regular expressions as follows:

\begin{array}[]{rcll@{\hspace{1cm}}rcl}h(\emptyset)&=&\emptyset&\hfil\hskip 28% .45274pt&h(e_{1}\cdot e_{2})&=&h(e_{1})\cdot h(e_{2})\\ h(\varepsilon)&=&\varepsilon&\hfil\hskip 28.45274pt&h(e_{1}\cup e_{2})&=&h(e_{% 1})\cup h(e_{2})\\ h(a)&=&a&\text{ for every $a$ in $\mathsf{Val}$}\hfil\hskip 28.45274pt&h(e^{*}% )&=&h(e)^{*}\\ \end{array}

Then, $G_{1}=(T_{1},N_{1},S_{1},R_{1})$ is isomorphic to $G_{2}=(T_{2},N_{2},S_{2},R_{2})$ if there is a bijective function $h\colon N_{1}\to N_{2}$ such that $h(S_{1})=S_{2}$ , for each rule $A\to e$ in $R_{1}$ , the rule $h(A)\to h(e)$ is in $R_{2}$ , and each rule in $R_{2}$ is of the form $h(A)\to h(e)$ for some rule $A\to e$ in $R_{1}$ .

Example 3.3.

(Extended) context-free grammars are isomorphic if and only if they are the same up to renaming of non-terminals. For example, the grammars

$\begin{array}[]{rl@{\hspace{.5cm}}rl}S&\to ASB\cup C^{*}\hfil\hskip 14.22636pt% &A&\to a\\ B&\to b\hfil\hskip 14.22636pt&C&\to c\\ \end{array}$ and $\begin{array}[]{rl@{\hspace{.5cm}}rl}S&\to XSY\cup Z^{*}\hfil\hskip 14.22636pt% &X&\to a\\ Y&\to b\hfil\hskip 14.22636pt&Z&\to c\\ \end{array}$

(both with start symbol $S$ ) are isomorphic. They recognize $\{a^{n}b^{n}\mid n\in\mathbb{N}\}\cup L(c^{*})$ .

Observation 3.4.

If $G_{1}$ and $G_{2}$ are isomorphic, then $L(G_{1})=L(G_{2})$ .

We want to extend this notion of isomorphism to factorized relations and want to maintain a property such as Observation 3.4 which says that, if the objects are syntactically isomorphic, also their semantics is the same. To this end, for a tuple $t=\langle a_{1},\ldots,a_{k}\rangle$ , we denote the word $a_{1}\cdots a_{k}$ as $\mathrm{word}(t)$ . Hence, $\mathrm{word}(\langle\rangle)=\varepsilon$ . For a set $T$ of tuples, we define $\mathrm{word}(T)\mathrel{\mathop{:}}=\{\mathrm{word}(t)\mid t\in T\}$ .

We now define isomorphisms between uFRs and (star-free) ECFGs. The idea is analogous as before, but with the difference that the Cartesian product operator ( $\times$ ) in uFRs is replaced by the concatenation operator ( $\cdot$ ) in ECFGs. That is, we define isomorphisms using a function $h\colon\mathsf{Names}\to\mathsf{Names}$ that we extend to $X$ -expressions as follows:

\begin{array}[]{rcll@{\hspace{1cm}}rcl}h(\emptyset)&=&\emptyset&\hfil\hskip 28% .45274pt&h(E_{1}\times E_{2})&=&h(E_{1})\cdot h(E_{2})\\ h(\langle\rangle)&=&\varepsilon&\hfil\hskip 28.45274pt&h(E_{1}\cup E_{2})&=&h(% E_{1})\cup h(E_{2})\\ h(\langle a\rangle)&=&a&\text{ for every $a$ in $\mathsf{Val}$}\hfil\hskip 28.% 45274pt\\ \end{array}

A uFR $(N_{1},D)$ with $D=\{N_{1}\mathrel{\mathop{:}}=E_{1},\dots,N_{n}\mathrel{\mathop{:}}=E_{n}\}$ is isomorphic to an ECFG $G=(T,N,S,R)$ if there is a bijective function $h\colon\{N_{1},\dots,N_{n}\}\to N$ such that

$\blacksquare$

the start symbol $S$ is $h(N_{1})$ ;
$\blacksquare$

for every expression $N_{i}\mathrel{\mathop{:}}=E_{i}$ in $D$ , the rule $h(N_{i})\rightarrow h(E_{i})$ is in $R$ ; and
$\blacksquare$

for every rule $N\rightarrow e$ in $R$ , there is an expression $N_{i}\mathrel{\mathop{:}}=E_{i}$ in $D$ with $h(N_{i})=N$ and $h(E_{i})=e$ .

Example 3.5.

Consider the uFR in Example 2.2, which is visualized in Figure 1(b). It is routine to check that it is isomorphic to the extended context-free grammar in Figure 1(c). In fact, the isomorphism in this case is the identity function.

We now observe that isomorphisms preserve the size and, in a strong sense, also the semantics of uFRs and ECFGs. To this end, the size $|E|$ of an $X$ -expression or regular expression $E$ is defined to be the number of occurrences of symbols plus the number of occurrences of operators in $E$ . For example, $|\langle a\rangle|=1$ , $|(\langle a\rangle\cup(\langle\rangle\times\langle b\rangle)|=5$ , and $|(a\cdot b)^{*}|=4$ . The size of a uFR (resp., ECFG) is the sum of the sizes of its $X$ -expressions (resp., regular expressions).

Proposition 3.6.

If $h$ is an isomorphism from a uFR $(N_{1},D)$ to an ECFG $(T,N,S,R)$ then, for each expression $N_{i}:=E_{i}$ in $D$ such that $A_{i}=h(N_{i})$ and $e_{i}=h(E_{i})$ , we have that

(a)

$|e_{i}|=|E_{i}|$ and
(b)

$L(A_{i})=\mathrm{word}(\llbracket N_{i}\rrbracket)$ .

Corollary 3.7.

If a uFR $F$ and ECFG $G$ are isomorphic then they have the same size and $L(G)=\mathrm{word}(\llbracket F\rrbracket)$ .

3.2 FRs and ECFGs are Isomorphic on Database Relations

A factorized relation $F$ defines a database relation, where all tuples have the same arity. An ECFG defining the same relation (i.e., $\mathrm{word}(\llbracket F\rrbracket)$ ) defines a language in which each word has the same length (which is, in particular, finite).

Let $G=(V,N,S,R)$ be an ECFG. Notice that, if $G$ is trimmed and $L(G)$ is finite, then $G$ cannot use the Kleene star operator in a meaningful way. Indeed, it can only use subexpressions $e^{*}$ if $L(e)=\emptyset$ or $L(e)=\varepsilon$ , in which case $L(e^{*})=\varepsilon$ . The same holds for recursion. We therefore assume from now on that ECFGs that define a finite language are non-recursive and do not use the Kleene star operator. We call a nonterminal $A\in N$ uniform length if every word $w\in L(A)$ has the same length. We say that $G$ is uniform length if $S$ is uniform length. We now prove that factorized relations are the same as uniform-length ECFGs.

Theorem 3.8.

There is a bijection $\beta$ between the set of uFRs and the set of uniform-length ECFGs such that each factorized relation $F$ is isomorphic to $\beta(F)$ .

Similarly, we say that a uniform-length ECFG $G$ has disjoint positions if, for every pair of words $a_{1}\cdots a_{k}\in L(G)$ and $b_{1}\cdots b_{k}\in L(G)$ we have that $a_{i}\neq b_{j}$ if $i\neq j$ .

Theorem 3.9.

There is a bijection $\beta$ between the set of uFRs with disjoint positions and the set of uniform-length ECFGs with disjoint positions such that each factorized relation $F$ is isomorphic to $\beta(F)$ .

In the remainder of the paper, for an uFR $F$ , we will refer to $\beta(F)$ as the CFG corresponding to $F$ .

4 Some Consequences of the Isomorphism

We now discuss some immediate consequences of Theorems 3.8 and 3.9. We note that our list is far from exhaustive. In principle, every result on context-free grammars that holds for uniform-length languages can be lifted to uFRs. Conversely, every result that is shown for uFRs can be transferred to CFGs for uniform length words. Since uFRs are a class of CFGs due to Theorem 3.8, we use some standard terminology for CFGs to uFRs, e.g., the notion of derivation trees. We call a uFR $F$ deterministic if the CFG $G$ corresponding to $F$ is unambiguous, i.e., every word in $L(G)$ has a unique derivation tree.³³3The definition of an nFR $F$ being deterministic in [55] says that each monomial that can be obtained from using distributivity of product over union is distinct. This is equivalent to saying that each tuple in $\llbracket F\rrbracket$ has a unique derivation tree.

4.1 Membership

We define the membership problem for uFRs to be the problem that, given a tuple $t$ and uFR $F$ , tests if $t\in\llbracket F\rrbracket$ . The CYK algorithm [35] decides membership for context-free languages in polynomial time.

Corollary 4.1.

The membership problem for uFRs is in polynomial time.

4.2 FRs versus Non-Deterministic Finite Automata

We call a uFR $(S,D)$ right-linear (resp., left-linear) if every expression in $D$ is of the form $A:=\langle\rangle$ or $A:=\{b\}\times C$ for $A,C\in N$ and $b\in T$ (resp., $A:=\langle\rangle$ or $A:=C\times\{b\}$ ). (The corresponding definition for CFGs is analogous.) Due to [35], we know that right-linear CFGs are isomorphic to non-deterministic finite automata. The isomorphism, formulated in terms of uFRs, is that a rule $A\to\{b\}\times C$ corresponds to a transition from state $A$ to state $C$ with label $b$ , and a rule $A:=\langle\rangle$ corresponds to $A$ being an accepting state. A state $A$ is a start state if $A$ does not occur on the right-hand side of a rule in $S$ . (The argument for left-linear uFRs is analogous.)

Corollary 4.2.

Right-linear (and left-linear) uFRs are isomorphic to non-deterministic finite automata for uniform-length languages.

Let us define the equivalence problem of uFRs as follows. Given two uFRs $F_{1}$ and $F_{2}$ , is $\llbracket F_{1}\rrbracket=\llbracket F_{2}\rrbracket$ ? Likewise, the containment problem asks, given two uFRs $F_{1}$ and $F_{2}$ , whether $\llbracket F_{1}\rrbracket\subseteq\llbracket F_{2}\rrbracket$ . Due to [61, Corollary 5.9], we now know

Corollary 4.3.

Equivalence and containment of deterministic right-linear (resp., left-linear) uFRs is in polynomial time.

We recall that determinism in uFRs corresponds to unambiguity in context-free grammars and finite automata. In particular, Corollary 5.9 in [61], which shows that equivalence and containment of unambiguous finite automata is solvable in polynomial time is non-trivial. In fact, the result is even more general: it holds for $k$ -ambiguous automata for every constant $k$ . Here, $k$ -ambiguity intuitively means that each tuple in $\llbracket F\rrbracket$ is allowed to have up to $k$ derivation trees in $F$ .

4.3 Size Lower Bounds

Filmus [28] proves a lower bound on the size of CFGs that, in terms of uFRs, is stated as follows.

Corollary 4.4 ([28], Theorem 7).

Consider the set of tuples $S=\{(a_{1},\ldots,a_{n},b_{1},\ldots,b_{n})\in\{0,1\}^{2n}\mid a_{1}\cdots a_{% n}=b_{1}\cdots b_{n}\}$ . Then the smallest uFR for $S$ has size $\Omega(2^{n/4}/\sqrt{2n})$ .

In fact, the proofs of Corollary 4.4 and 4.1 use the fact that CFGs (uFRs) can be brought into Chomsky Normal Form [21], which may be yet another classical result that is useful for proving results on uFRs.

4.4 Counting

Corollary 4.2 allows us to connect recent results on counting problems for automata and grammars [7, 8, 45] to uFRs. For a class $\mathcal{C}$ uFRs, counting for ${\mathcal{C}}$ is the problem that, given a uFR $F\in{\mathcal{C}}$ , asks what is the cardinality of $\llbracket F\rrbracket$ . First, recall that the number of words of a given length $n$ in an unambiguous CFG (or ECFG) can be counted in polynomial time [29, Section I.5.4] by a simple dynamic programming approach.

Corollary 4.5.

Counting for deterministic uFRs is in polynomial time.

For general right-linear uFRs, the counting problem is $\#$ P-complete, but the following is immediate from the existence of an FPRAS for $\#$ NFA [7] and Theorem 3.8.

Corollary 4.6.

Counting for right-linear uFRs admits an FPRAS.

Furthermore, the recent more efficient FPRAS for $\#$ NFA by Meel et al. [45] can be applied verbatim to counting for right-linear uFRs. In fact, Meel et al. [46] recently generalized the result to $\#CFG$ . The result is still unpublished, but it would imply:

Corollary 4.7.

Counting for uFRs admits an FPRAS.

Notice that counting for uFRs (and subclasses thereof) is a practically relevant question: it is the result of the COUNT DISTINCT query for a factorized representation.

4.5 Enumeration

In terms of enumeration, Dömösi shows that, given a context-free grammar $G$ and length $n$ , the set of words in $L(G)$ of length $n$ can be enumerated with delay polynomial in $n$ [25]:

Corollary 4.8.

Given an uFR $F$ , the set of tuples in $\llbracket F\rrbracket$ can be enumerated with delay polynomial in the arity of $F$ .

We note that, in the case of right-linear uFRs, more efficient algorithms are possible [1, 2]. The same holds if the uFR is deterministic [56]. For example, Muñoz and Riveros [49] considered Enumerable Compact Sets (ECS) as a data structure for output-linear delay algorithms. ECS can be viewed as context-free grammars in Chomsky Normal Form for finite languages (using a similar isomorphism as in Section 3.1). In terms of uFRs, [49] shows that, if the uFR is deterministic and $k$ -bounded (which means that unions on right-hand sides have constant length), then all its tuples can be enumerated in output-linear delay.

4.6 FRs for Variable-Length Relations

We now consider a slightly more liberal relational data model in the sense that a database relation no longer needs to contain tuples of the same arity. The reason why we consider this case is twofold. First, this data model leads to representations for the finite languages, which is a fundamental class in formal language theory. Second, this data model is the underlying data model [57] for the query language Rel [58], implemented by RelationalAI. So it is used in practice. We say that a variable-length database relation is simply a finite set $R$ of tuples.

The correspondence between uFRs and ECFGs in Theorems 3.8 and 3.9 allows us to define a natural generalization of FRs for variable-length relations, which corresponds to ECFGs for finite languages. We inductively define variable-length $X$ -expressions (vl- $X$ -expression for short) $E$ as follows:

$\blacksquare$

$E=\emptyset$ is a vl- $X$ -expression;
$\blacksquare$

$E=\langle\rangle$ is a vl- $X$ -expression;
$\blacksquare$

for each $a\in\mathsf{Val}$ , we have that $E=\langle a\rangle$ is a vl- $X$ -expression with $\mathord{\mathit{ar}}(E)=1$ (singleton);
$\blacksquare$

for each $N\in X$ , we have that $E=N$ is a vl- $X$ -expression (name reference);
$\blacksquare$
for vl- $X$ -expressions $E_{1},\ldots,E_{n}$ we have that
- –
  
  $E=(E_{1}\cup\cdots\cup E_{n})$ is a vl- $X$ -expression (union); and
- –
  
  $E=(E_{1}\times\cdots\times E_{n})$ is a vl- $X$ -expression (Cartesian product).

Definition 4.9.

A variable-length factorized relation (vlFR) is a pair $(S,D)$ , where $S\in\mathsf{Names}$ is the start symbol and $D=\{N_{1}\mathrel{\mathop{:}}=E_{1},\ldots,N_{n}\mathrel{\mathop{:}}=E_{n}\}$ is a set of expressions where:

1.

$N_{1}=S$ ;
2.

Each $N_{i}$ is an expression name;
3.

Each $E_{i}$ is a vl- $X_{i}$ -expression for $X_{i}=\{N_{i+1},\dots,N_{n}\}$ ; and

The semantics $\llbracket F\rrbracket$ of a variable-length factorized relation $F$ is defined analogously as for FRs. The difference is that the result is now a variable-length database relation.

Theorem 4.10.

There is a bijection $\beta$ between the set of vlFRs and the set ECFGs for finite languages such that each factorized representation $F$ is isomorphic to $\beta(F)$ .

5 Path Representations in Graph Databases

We now start exploring the relationship between uFRs and Path Multiset Representations (PMRs), which were recently introduced as a succinct data structure for (multi)sets of paths in graph databases [41]. Several studies demonstrate that they can drastically speed up evaluation of queries that involve regular path queries with path variables [41, 26, 15].

We briefly introduce PMRs and explain their connection to finite automata. This allows us to relate them to uFRs and study some size tradeoffs later in this section.

5.1 Path Multiset Representations

We use edge-labeled multigraphs as our abstraction of a graph database.⁴⁴4PMRs were originally defined on property graphs, which are more complex than edge-labeled graphs. The definition of PMRs presented here is therefore slightly simplified. A graph database is a tuple $G=(N,E,\text{lab})$ , where $N$ is a finite set of nodes, $E\subseteq N\times N$ is a finite set of edges, and $\text{lab}\colon E\to\mathsf{Val}$ is a function that associates a label to each edge. A path in $G$ is a sequence of nodes $u_{0},\ldots,u_{n}$ , where $(u_{i-1},u_{i})\in E$ for every $i\in[n]$ .

Definition 5.1.

A path multiset representation (PMR) over a graph database $G=(N_{G},E_{G},\text{lab}_{G})$ is a tuple $R=(N,E,\gamma,S,T)$ where

$\blacksquare$

$N$ is a finite set of nodes;
$\blacksquare$

$E\subseteq N\times N$ is a finite set of edges;
$\blacksquare$

$\gamma:N\to N_{G}$ is a homomorphism (that is, if $(u,v)\in E$ , then $(\gamma(u),\gamma(v))\in E_{G}$ );
$\blacksquare$

$S\subseteq N$ is a finite set of start nodes;
$\blacksquare$

$T\subseteq E$ is a finite set of target nodes.

The semantics of PMRs is defined as follows. They can be used as a representation of a set or a multiset of paths. More precisely, we define $\mathsf{SPaths}(R)=\{\gamma(p)\mid p$ is a path from some node in $S$ to some node in $T$ in $R\}$ . Notice that each $\gamma(p)$ is indeed a path in $G$ , since $\gamma$ is a homomorphism. $\mathsf{MPaths}(R)$ is defined similarly, but it is a multiset, where the multiplicity of each path $p$ in $G$ is the number of paths $p^{\prime}$ in $R$ such that $\gamma(p^{\prime})=p$ .

(a) An edge-labeled graph

G

.

(b) A PMR

R

of the paths of even length from

A

to

D

(with multiplicity two for the shortest path).

Figure 2: An edge-labeled graph

G

and a PMR for a multiset of its paths.

Example 5.2.

Consider the graph in Figure 2(a). Figure 2(b) depicts a PMR $R$ representing the set of paths from $A$ to $D$ in $G$ that have even length. The homomorphism $\gamma$ simply matches both nodes $A_{1}$ and $A_{2}$ to $A$ ; and similarly for the other indexed nodes. We define $S=\{A_{1}\}$ and $T=\{D\}$ . Following our definition, we now have that $\mathsf{SPaths}(R)$ is indeed the set of paths from $A$ to $D$ in $G$ that have even length. In $\mathsf{MPaths}(R)$ , the shortest such path (having length two) has multiplicity two, whereas all other paths have multiplicity one.

5.2 PMRs versus Finite Automata

PMRs are closely connected to finite automata by design. One reason for this design choice is that graph pattern matching in languages such as Cypher, SQL/PGQ, and GQL starts with the evaluation of regular path queries, which match “regular” sets of paths.

We explain this connection next and assume familiarity with finite automata. In the following, we will denote nondeterministic finite automata (NFAs) as tuples $A=(\Sigma,Q,\delta,I,F)$ where $\Sigma$ is its symbols (or alphabet), $Q$ its finite set of states, $\delta$ its set of transitions of the form $q_{1}\xrightarrow{a}q_{2}$ (meaning that, in state $q_{1}$ , the automaton can go to state $q_{2}$ by reading the symbol $a$ ), $I\subseteq Q$ its set of initial states, and $F\subseteq Q$ its set of accepting states. As usual, we denote by $L(A)$ the language of $A$ , which is the set of words accepted by $A$ .

The connection between PMRs and NFAs is very close. Indeed, we can turn a PMR $R=(N,E,\gamma,S,T)$ over graph $G=(N_{G},E_{G},\text{lab})$ into an NFA $N_{R}=(\Sigma,Q,\delta,I,F)$ where

1.

the alphabet $\Sigma$ is the set $N_{G}$ of nodes of $G$ ;
2.

the set $Q$ of states is $N\cup\{s\}$ , where we assume $s\notin N$ ;
3.

for every edge $e=(u,v)\in E$ , there is a transition $u\xrightarrow{\gamma(v)}v$ ;
4.

for every node $u\in S$ , we have a transition $s\xrightarrow{\gamma(u)}u$ ;
5.

$I=\{s\}$ ; and $F=T$ .

As usual, we denote by $L(N_{R})$ the set of words accepted by $N_{R}$ . The automaton $N_{R}$ accepts precisely the set of paths represented by $R$ .

Proposition 5.3 (Implicit in [41]).

$L(N_{R})=\mathsf{SPaths}(R)$ .

In fact, there also exists a multiset language of NFAs, denoted $ML(A)$ , in which the multiplicity of each word $w\in ML(A)$ is the number of accepting runs that $A$ has on $w$ . Analogously to Proposition 5.3, one can show the following.

Proposition 5.4 (Implicit in [41]).

$\mathit{ML}(N_{R})=\mathsf{MPaths}(R)$ .

The correspondence in Proposition 5.4 is interesting for the purposes of representing path multisets, because deciding for given NFAs $A_{1}$ and $A_{2}$ if $\mathit{ML}(A_{1})=\mathit{ML}(A_{2})$ is in polynomial time [63] if the NFAs do not have $\varepsilon$ -transitions, which is the case here. Deciding if $L(A_{1})=L(A_{2})$ on the other hand is pspace complete [48]. (The same complexities hold for the respective containment problems.) This can be interesting if we want to consider questions like finding optimal-size representations of a (multi)set of paths.

5.3 Comparing uFRs and PMRs

In this section, we identify a path $u_{1},\ldots,u_{n}$ with the database tuple $(u_{1},\ldots,u_{n})$ . Furthermore, for an uFR $F$ and PMR $R$ , we compare $\llbracket F\rrbracket$ with $\mathsf{SPaths}(R)$ , i.e., we compare them under set semantics. (A similar comparison can be made when considering them both under bag semantics.) Under these assumptions, we have the following observation.

Proposition 5.5.

PMRs are strictly more expressive than uFRs.

Proof.

Every uFR represents a finite relation, which can be represented by a PMR (in formal language terms: every finite language is regular). Furthermore, PMRs can represent some infinite relations, namely those whose corresponding word language is regular [41]. $\hfill\blacktriangleleft$

It is interesting to compare the relative size of PMRs and uFRs. Indeed, most practical query languages (e.g., GQL, Cypher) use keywords to ensure that the sets of paths to be considered in graph pattern matching are finite (SHORTEST, TRAIL, SIMPLE, ACYCLIC).⁵⁵5It is an interesting question if such keywords that force finite number of paths are indeed always needed, and PMRs show one way to finitely represent an infinite number of paths. This means that, when using these languages, one can in principle use PMRs as well as uFRs to represent sets of paths.

Using uFRs to represent sets of paths in graph database systems opens up a wide array of questions. More precisely, context-free grammars (CFG), unambiguous context-free grammars (UCFG), non-deterministic finite automata (NFA), unambiguous finite automata (UFA), and deterministic finite automata (DFA) for finite-length languages are all special cases of uFRs and are all able to represent some set of $n$ tuples using a representation of size $O(\log n)$ . All these formalism have different properties. (E.g., counting for UFA is easy by [61, Corollary 5.9], but $\#$ P-complete for NFA, we know that we can convert an NFA into a DFA in exponential time, etc.) So, which representation can be used in which case? This question actually calls for an investigation that is too extensive for one paper – here we investigate the size tradeoffs between the different models.

Figure 3: Worst-case unavoidable blow-ups for succinct representations of uniform length relations. Every path that consists of only blue edges represents an unavoidable exponential blow-up and every path that contains at least one red (solid) edge represents an unavoidable double exponential blow-up. If there is no path, then there exists a linear translation. For the dashed edges, we only prove an upper bound. The corresponding lower bounds are conditional on Conjecture 5.7.

Uniform-Length Relations

We say that there is an $f(n)$ -size translation from one model $X$ to a model $Y$ , if there exists a translation from $X$ to $Y$ such that each object of size $n$ in $X$ can be translated to an equivalent object of size $O(f(n))$ in $Y$ . We say that these translations are tight if there is an infinite family of objects in $X$ in which, for each object of size $n$ , the smallest equivalent object in $Y$ has size $\Omega(f(n))$ . When $F$ is a set of functions (such as the exponential or double-exponential functions), we say that there’s an $F$ -translation if there exists a function $f\in F$ such that there is an $f(n)$ -size translation. Again, we say that the translations are tight, if there is a function $f\in F$ for which the translation is tight.

Theorem 5.6.

Over uniform-length languages, there are exponential translations

(E1)

from CFG to NFA;
(E2)

from UCFG to UFA;
(E3)

from NFA to UCFG;
(E4)

from NFA to UFA;
(E5)

from UFA to DFA;
(E6)

from DFA to Set; and
(E7)

from NFA to Set.

There are double-exponential translations

(DE1)

from CFG to UFA;
(DE2)

from CFG to UCFG;
(DE3)

from UCFG to DFA; and
(DE4)

from CFG to Set.

Furthermore, these translations are tight for (E1–E2,E4–E7) and (DE1,DE3–DE4).

We conjecture that the translations for (DE2) and (E3) are also tight. In fact, they are tight under Conjecture 5.7. To the best of our knowledge, the literature does not yet have well-developed methods for proving size lower bounds for UCFGs. A proof of Conjecture 5.7 using communication complexity has recently been claimed by Mengel and Vinall-Smeeth [47].

Conjecture 5.7.

For each $n\in\mathbb{N}$ , the smallest unambiguous context-free grammar for the language

L_{n}=\{(a+b)^{k}a(a+b)^{n}a(a+b)^{n-k}\mid k\leq n\}

of words of length $2n+2$ has size $2^{\Omega(n)}$ .

One reason why we believe in Conjecture 5.7 is because there does not exist a UCFG for the generalized version of the language to the unbounded length setting.

Proposition 5.8.

There does not exist a UCFG for the (infinite) language

L=\{(a+b)^{n}a(a+b)^{n}a(a+b)^{n-k}\mid k,n\in\mathbb{N},k\leq n\}\;.

Theorem 5.9.

Over uniform-length languages with disjoint positions, there are exponential translations

(E1)

from CFG to NFA;
(E2)

from UCFG to UFA;
(E3)

from CFG to UCFG;
(E4)

from NFA to UCFG;
(E5)

from NFA to UFA;
(E6)

from UFA to DFA;
(E7)

from DFA to Set; and
(E8)

from CFG to Set.

Furthermore, the translations (E1–E2,E5–E8) are tight.

Again, if Conjecture 5.7 holds true, the translations (E3–4) are also tight.

5.4 Variable-Length Relations

Theorems 5.6 and 5.9 also hold for variable-length (but finite) relations, as we considered in Section 4.6. The lower bounds are immediate from those results and the upper bound constructions are analogous.

6 Future Work

The connection between database factorization and formal languages gives rise to a plethora of questions for future investigation. What is the complexity of basic operations (e.g., enumeration, counting, and direct access) over compact representations of different formalisms? What is the impact of non-determism on this complexity? Some questions naturally emerge in the context of answering queries efficiently with compact representations. Given a query and a database, what is the best formal language for representing the result? What is the impact of the choice of formalism of on our ability to efficiently maintain the query result (as a database view) to accommodate updates in the database? Specifically, which of the formalisms allow to apply past results on updates of compact representations (e.g., [5, 37])? These questions, as well as the connection between factorized relations and path multiset representations, are especially relevant in light of the ongoing efforts on the graph query languages SQL/PGQ and GQL, as these languages combine graph pattern matching (through regular path query evaluation) and relational querying [24, 30].

In addition to the above, some questions arise independently of the manner (e.g., query) in which the factorized representation is constructed. What is the complexity of minimizing a representation of each formalism? What is the tradeoff that the variety of formalisms offers between the size and the complexity of operations? What is the best lossy representation if we have a size (space) restriction on the representation? Here, the definition of loss may depend on the application, and one interesting application is summarization of a large table via lower-resolution representations, as done by El Gebaly et al. [33].

References

[1] Margareta Ackerman and Erkki Mäkinen. Three new algorithms for regular language enumeration. In Hung Q. Ngo, editor, Computing and Combinatorics, 15th Annual International Conference, COCOON 2009, Niagara Falls, NY, USA, July 13-15, 2009, Proceedings, volume 5609 of Lecture Notes in Computer Science, pages 178–191. Springer, 2009. doi:10.1007/978-3-642-02882-3_19.
[2] Margareta Ackerman and Jeffrey O. Shallit. Efficient enumeration of words in regular languages. Theor. Comput. Sci., 410(37):3461–3470, 2009. doi:10.1016/J.TCS.2009.03.018.
[3] Antoine Amarilli, Marcelo Arenas, YooJung Choi, Mikaël Monet, Guy Van den Broeck, and Benjie Wang. A circus of circuits: Connections between decision diagrams, circuits, and automata. CoRR, abs/2404.09674, 2024. doi:10.48550/arXiv.2404.09674.
[4] Antoine Amarilli, Pierre Bourhis, Louis Jachiet, and Stefan Mengel. A circuit-based approach to efficient enumeration. In Ioannis Chatzigiannakis, Piotr Indyk, Fabian Kuhn, and Anca Muscholl, editors, 44th International Colloquium on Automata, Languages, and Programming, ICALP 2017, July 10-14, 2017, Warsaw, Poland, volume 80 of LIPIcs, pages 111:1–111:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPICS.ICALP.2017.111.
[5] Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Enumeration on trees with tractable combined complexity and efficient updates. In Symposium on Principles of Database Systems (PODS), pages 89–103. ACM, 2019. doi:10.1145/3294052.3319702.
[6] Marcelo Arenas, Pablo Barceló, Leonid Libkin, Wim Martens, and Andreas Pieris. Database Theory. Open source at https://github.com/pdm-book/community, 2022.
[7] Marcelo Arenas, Luis Alberto Croquevielle, Rajesh Jayaram, and Cristian Riveros. #NFA admits an FPRAS: efficient enumeration, counting, and uniform generation for logspace classes. J. ACM, 68(6):48:1–48:40, 2021. doi:10.1145/3477045.
[8] Marcelo Arenas, Luis Alberto Croquevielle, Rajesh Jayaram, and Cristian Riveros. When is approximate counting for conjunctive queries tractable? In Symposium on Theory of Computing (STOC), pages 1015–1027. ACM, 2021. doi:10.1145/3406325.3451014.
[9] Nurzhan Bakibayev, Tomás Kociský, Dan Olteanu, and Jakub Zavodny. Aggregation and ordering in factorised databases. Proc. VLDB Endow., 6(14):1990–2001, 2013. doi:10.14778/2556549.2556579.
[10] Nurzhan Bakibayev, Dan Olteanu, and Jakub Zavodny. FDB: A query engine for factorised relational databases. Proc. VLDB Endow., 5(11):1232–1243, 2012. doi:10.14778/2350229.2350242.
[11] Pablo Barceló, Diego Figueira, and Miguel Romero. Boundedness of conjunctive regular path queries. In International Colloquium on Automata, Languages, and Programming (ICALP), pages 104:1–104:15, 2019. doi:10.4230/LIPIcs.ICALP.2019.104.
[12] Pablo Barceló, Carlos A. Hurtado, Leonid Libkin, and Peter T. Wood. Expressive languages for path queries over graph-structured data. In Symposium on Principles of Database Systems (PODS), pages 3–14. ACM, 2010. doi:10.1145/1807085.1807089.
[13] Pablo Barceló, Leonid Libkin, and Juan L. Reutter. Querying graph patterns. In Symposium on Principles of Database Systems (PODS), pages 199–210. ACM, 2011. doi:10.1145/1989284.1989307.
[14] Christoph Berkholz and Harry Vinall-Smeeth. A dichotomy for succinct representations of homomorphisms. In International Colloquium on Automata, Languages, and Programming (ICALP), volume 261 of LIPIcs, pages 113:1–113:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.ICALP.2023.113.
[15] Vicente Calisto, Benjamín Farias, Wim Martens, Carlos Rojas, and Domagoj Vrgoc. Pathfinder demo: Returning paths in graph queries. In ISWC 2024 Posters, Demos and Industry Tracks, volume 3828 of CEUR Workshop Proceedings. CEUR-WS.org, 2024. URL: https://ceur-ws.org/Vol-3828/paper34.pdf.
[16] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. Rewriting of regular expressions and regular path queries. In Symposium on Principles of Database Systems (PODS), pages 194–204. ACM Press, 1999. doi:10.1145/303976.303996.
[17] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. Containment of conjunctive regular path queries with inverse. In International Conference on Principles of Knowledge Representation and Reasoning (KR), pages 176–185. Morgan Kaufmann, 2000.
[18] Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, and Mirek Riedewald. Tractable orders for direct access to ranked answers of conjunctive queries. ACM Trans. Database Syst., 48(1):1:1–1:45, 2023. doi:10.1145/3578517.
[19] Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Alessio Conte, Benny Kimelfeld, and Nicole Schweikardt. Answering (unions of) conjunctive queries using random access and random-order enumeration. ACM Trans. Database Syst., 47(3):9:1–9:49, 2022. doi:10.1145/3531055.
[20] Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. The smallest grammar problem. IEEE Trans. Inf. Theory, 51(7):2554–2576, 2005. doi:10.1109/TIT.2005.850116.
[21] Noam Chomsky. On certain formal properties of grammars. Inf. Control., 2(2):137–167, 1959. doi:10.1016/S0019-9958(59)90362-6.
[22] Mariano P. Consens and Alberto O. Mendelzon. GraphLog: a visual formalism for real life recursion. In Symposium on Principles of Database Systems (PODS), pages 404–416, 1990. doi:10.1145/298514.298591.
[23] Isabel F. Cruz, Alberto O. Mendelzon, and Peter T. Wood. A graphical query language supporting recursion. In International Conference on Management of Data (SIGMOD), pages 323–330, 1987. doi:10.1145/38713.38749.
[24] Alin Deutsch, Nadime Francis, Alastair Green, Keith Hare, Bei Li, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Wim Martens, Jan Michels, Filip Murlak, Stefan Plantikow, Petra Selmer, Oskar van Rest, Hannes Voigt, Domagoj Vrgoc, Mingxi Wu, and Fred Zemke. Graph pattern matching in GQL and SQL/PGQ. In International Conference on Management of Data (SIGMOD), pages 2246–2258. ACM, 2022. doi:10.1145/3514221.3526057.
[25] Pál Dömösi. Unusual algorithms for lexicographical enumeration. Acta Cybern., 14(3):461–468, 2000. URL: https://cyber.bibl.u-szeged.hu/index.php/actcybern/article/view/3539.
[26] Benjamín Farias, Wim Martens, Carlos Rojas, and Domagoj Vrgoc. Pathfinder: Returning paths in graph queries. In International Semantic Web Conference (ISWC), pages 135–154. Springer, 2024. doi:10.1007/978-3-031-77850-6_8.
[27] Diego Figueira, Adwait Godbole, Shankara Narayanan Krishna, Wim Martens, Matthias Niewerth, and Tina Trautner. Containment of simple conjunctive regular path queries. In International Conference on Principles of Knowledge Representation and Reasoning (KR), pages 371–380, 2020.
[28] Yuval Filmus. Lower bounds for context-free grammars. Inf. Process. Lett., 111(18):895–898, 2011. doi:10.1016/J.IPL.2011.06.006.
[29] Philippe Flajolet and Robert Sedgewick. Analytic Combinatorics. Cambridge University Press, 1 edition, 2009.
[30] Nadime Francis, Amélie Gheerbrant, Paolo Guagliardo, Leonid Libkin, Victor Marsault, Wim Martens, Filip Murlak, Liat Peterfreund, Alexandra Rogova, and Domagoj Vrgoč. A researcher’s digest of GQL (invited talk). In International Conference on Database Theory (ICDT), pages 1:1–1:22, 2023. doi:10.4230/LIPICS.ICDT.2023.1.
[31] Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. Cypher: An evolving query language for property graphs. In International Conference on Management of Data (SIGMOD), pages 1433–1445. ACM, 2018. doi:10.1145/3183713.3190657.
[32] Moses Ganardi, Artur Jez, and Markus Lohrey. Balancing straight-line programs. J. ACM, 68(4):27:1–27:40, 2021. doi:10.1145/3457389.
[33] Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. Interpretable and informative explanations of outcomes. Proc. VLDB Endow., 8(1):61–72, 2014. doi:10.14778/2735461.2735467.
[34] GQL. https://www.gqlstandards.org/, 2023.
[35] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to automata theory, languages, and computation, 2nd edition. Addison-Wesley, 2 edition, 2001.
[36] ISO. Information technology - database languages SQL - Part 16: Property graph queries (SQL/PGQ), 2023.
[37] Ahmet Kara, Milos Nikolic, Dan Olteanu, and Haozhe Zhang. Conjunctive queries with free access patterns under updates. In International Conference on Database Theory (ICDT), pages 17:1–17:20, 2023. doi:10.4230/LIPICS.ICDT.2023.17.
[38] Benny Kimelfeld, Wim Martens, and Matthias Niewerth. A unifying perspective on succinct data representations. CoRR, abs/2309.11663, 2023. doi:10.48550/arXiv.2309.11663.
[39] Markus Lohrey. Algorithmics on slp-compressed strings: A survey. Groups Complex. Cryptol., 4(2):241–299, 2012. doi:10.1515/GCC-2012-0016.
[40] Ole Lehrmann Madsen and Bent Bruun Kristensen. LR-parsing of extended context free grammars. Acta Informatica, 7:61–73, 1976. doi:10.1007/BF00265221.
[41] Wim Martens, Matthias Niewerth, Tina Popp, Carlos Rojas, Stijn Vansummeren, and Domagoj Vrgoc. Representing paths in graph database pattern matching. Proc. VLDB Endow., 16(7):1790–1803, 2023. doi:10.14778/3587136.3587151.
[42] Wim Martens, Matthias Niewerth, and Tina Trautner. A trichotomy for regular trail queries. In International Symposium on Theoretical Aspects of Computer Science (STACS), pages 7:1–7:16, 2020. doi:10.4230/LIPICS.STACS.2020.7.
[43] Wim Martens and Tina Popp. The complexity of regular trail and simple path queries on undirected graphs. In Symposium on Principles of Database Systems (PODS), pages 165–174. ACM, 2022. doi:10.1145/3517804.3524149.
[44] Wim Martens and Tina Trautner. Evaluation and enumeration problems for regular path queries. In International Conference on Database Theory (ICDT), pages 19:1–19:21, 2018. doi:10.4230/LIPIcs.ICDT.2018.19.
[45] Kuldeep S. Meel, Sourav Chakraborty, and Umang Mathur. A faster FPRAS for #NFA. Proc. ACM Manag. Data, 2(2):112, 2024. doi:10.1145/3651613.
[46] Kuldeep S. Meel and Alexis de Colnet. #CFG and #DNNF admit FPRAS. CoRR, abs/2406.18224, 2024. doi:10.48550/arXiv.2406.18224.
[47] Stefan Mengel and Harry Vinall-Smeeth. A lower bound on unambiguous context free grammars via communication complexity. CoRR, abs/2412.03199, 2024. doi:10.48550/arXiv.2412.03199.
[48] Albert R. Meyer and Larry J. Stockmeyer. The equivalence problem for regular expressions with squaring requires exponential space. In SWAT (FOCS), pages 125–129. IEEE Computer Society, 1972. doi:10.1109/SWAT.1972.29.
[49] Martin Muñoz and Cristian Riveros. Streaming enumeration on nested documents. In Dan Olteanu and Nils Vortmeier, editors, 25th International Conference on Database Theory, ICDT 2022, March 29 to April 1, 2022, Edinburgh, UK (Virtual Conference), volume 220 of LIPIcs, pages 19:1–19:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPICS.ICDT.2022.19.
[50] Neo4j. Intro to Cypher. https://neo4j.com/developer/cypher-query-language/, 2017.
[51] Milos Nikolic and Dan Olteanu. Incremental view maintenance with triple lock factorization benefits. In International Conference on Management of Data (SIGMOD), pages 365–380, 2018. doi:10.1145/3183713.3183758.
[52] Dan Olteanu and Maximilian Schleich. F: regression models over factorized views. Proc. VLDB Endow., 9(13):1573–1576, 2016. doi:10.14778/3007263.3007312.
[53] Dan Olteanu and Maximilian Schleich. Factorized databases. SIGMOD Rec., 45(2):5–16, 2016. doi:10.1145/3003665.3003667.
[54] Dan Olteanu and Jakub Zavodny. Factorised representations of query results: size bounds and readability. In International Conference on Database Theory (ICDT), pages 285–298. ACM, 2012. doi:10.1145/2274576.2274607.
[55] Dan Olteanu and Jakub Závodný. Size bounds for factorised representations of query results. ACM Trans. Database Syst., 40(1):2:1–2:44, 2015. doi:10.1145/2656335.
[56] Steven T. Piantadosi. How to enumerate trees from a context-free grammar. CoRR, abs/2305.00522, 2023. doi:10.48550/arXiv.2305.00522.
[57] The Rel language (relations). https://docs.relational.ai/rel/primer/basic-syntax#relations, 2023.
[58] RelationalAI. The Rel language, 2024. https://learn.relational.ai/.
[59] Maximilian Schleich, Dan Olteanu, and Radu Ciucanu. Learning linear regression models over factorized joins. In International Conference on Management of Data (SIGMOD), pages 3–18, 2016. doi:10.1145/2882903.2882939.
[60] Markus L. Schmid. Conjunctive regular path queries with string variables. In Symposium on Principles of Database Systems (PODS), pages 361–374. ACM, 2020. doi:10.1145/3375395.3387663.
[61] Richard Edwin Stearns and Harry B. Hunt III. On the equivalence and containment problems for unambiguous regular expressions, regular grammars and finite automata. SIAM J. Comput., 14(3):598–611, 1985. doi:10.1137/0214044.
[62] Szymon Torunczyk. Aggregate queries on sparse databases. In Symposium on Principles of Database Systems (PODS), pages 427–443. ACM, 2020. doi:10.1145/3375395.3387660.
[63] Wen-Guey Tzeng. On path equivalence of nondeterministic finite automata. Inf. Process. Lett., 58(1):43–46, 1996. doi:10.1016/0020-0190(96)00039-7.

[bib.bib1] [1] Margareta Ackerman and Erkki Mäkinen. Three new algorithms for regular language enumeration. In Hung Q. Ngo, editor, Computing and Combinatorics, 15th Annual International Conference, COCOON 2009, Niagara Falls, NY, USA, July 13-15, 2009, Proceedings, volume 5609 of Lecture Notes in Computer Science, pages 178–191. Springer, 2009. doi:10.1007/978-3-642-02882-3_19.

[bib.bib2] [2] Margareta Ackerman and Jeffrey O. Shallit. Efficient enumeration of words in regular languages. Theor. Comput. Sci., 410(37):3461–3470, 2009. doi:10.1016/J.TCS.2009.03.018.

[bib.bib3] [3] Antoine Amarilli, Marcelo Arenas, YooJung Choi, Mikaël Monet, Guy Van den Broeck, and Benjie Wang. A circus of circuits: Connections between decision diagrams, circuits, and automata. CoRR, abs/2404.09674, 2024. doi:10.48550/arXiv.2404.09674.

[bib.bib4] [4] Antoine Amarilli, Pierre Bourhis, Louis Jachiet, and Stefan Mengel. A circuit-based approach to efficient enumeration. In Ioannis Chatzigiannakis, Piotr Indyk, Fabian Kuhn, and Anca Muscholl, editors, 44th International Colloquium on Automata, Languages, and Programming, ICALP 2017, July 10-14, 2017, Warsaw, Poland, volume 80 of LIPIcs, pages 111:1–111:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPICS.ICALP.2017.111.

[bib.bib5] [5] Antoine Amarilli, Pierre Bourhis, Stefan Mengel, and Matthias Niewerth. Enumeration on trees with tractable combined complexity and efficient updates. In Symposium on Principles of Database Systems (PODS), pages 89–103. ACM, 2019. doi:10.1145/3294052.3319702.

[bib.bib6] [6] Marcelo Arenas, Pablo Barceló, Leonid Libkin, Wim Martens, and Andreas Pieris. Database Theory. Open source at https://github.com/pdm-book/community, 2022.

[bib.bib7] [7] Marcelo Arenas, Luis Alberto Croquevielle, Rajesh Jayaram, and Cristian Riveros. #NFA admits an FPRAS: efficient enumeration, counting, and uniform generation for logspace classes. J. ACM, 68(6):48:1–48:40, 2021. doi:10.1145/3477045.

[bib.bib8] [8] Marcelo Arenas, Luis Alberto Croquevielle, Rajesh Jayaram, and Cristian Riveros. When is approximate counting for conjunctive queries tractable? In Symposium on Theory of Computing (STOC), pages 1015–1027. ACM, 2021. doi:10.1145/3406325.3451014.

[bib.bib9] [9] Nurzhan Bakibayev, Tomás Kociský, Dan Olteanu, and Jakub Zavodny. Aggregation and ordering in factorised databases. Proc. VLDB Endow., 6(14):1990–2001, 2013. doi:10.14778/2556549.2556579.

[bib.bib10] [10] Nurzhan Bakibayev, Dan Olteanu, and Jakub Zavodny. FDB: A query engine for factorised relational databases. Proc. VLDB Endow., 5(11):1232–1243, 2012. doi:10.14778/2350229.2350242.

[bib.bib11] [11] Pablo Barceló, Diego Figueira, and Miguel Romero. Boundedness of conjunctive regular path queries. In International Colloquium on Automata, Languages, and Programming (ICALP), pages 104:1–104:15, 2019. doi:10.4230/LIPIcs.ICALP.2019.104.

[bib.bib12] [12] Pablo Barceló, Carlos A. Hurtado, Leonid Libkin, and Peter T. Wood. Expressive languages for path queries over graph-structured data. In Symposium on Principles of Database Systems (PODS), pages 3–14. ACM, 2010. doi:10.1145/1807085.1807089.

[bib.bib13] [13] Pablo Barceló, Leonid Libkin, and Juan L. Reutter. Querying graph patterns. In Symposium on Principles of Database Systems (PODS), pages 199–210. ACM, 2011. doi:10.1145/1989284.1989307.

[bib.bib14] [14] Christoph Berkholz and Harry Vinall-Smeeth. A dichotomy for succinct representations of homomorphisms. In International Colloquium on Automata, Languages, and Programming (ICALP), volume 261 of LIPIcs, pages 113:1–113:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.ICALP.2023.113.

[bib.bib15] [15] Vicente Calisto, Benjamín Farias, Wim Martens, Carlos Rojas, and Domagoj Vrgoc. Pathfinder demo: Returning paths in graph queries. In ISWC 2024 Posters, Demos and Industry Tracks, volume 3828 of CEUR Workshop Proceedings. CEUR-WS.org, 2024. URL: https://ceur-ws.org/Vol-3828/paper34.pdf.

[bib.bib16] [16] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. Rewriting of regular expressions and regular path queries. In Symposium on Principles of Database Systems (PODS), pages 194–204. ACM Press, 1999. doi:10.1145/303976.303996.

[bib.bib17] [17] Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Moshe Y. Vardi. Containment of conjunctive regular path queries with inverse. In International Conference on Principles of Knowledge Representation and Reasoning (KR), pages 176–185. Morgan Kaufmann, 2000.

[bib.bib18] [18] Nofar Carmeli, Nikolaos Tziavelis, Wolfgang Gatterbauer, Benny Kimelfeld, and Mirek Riedewald. Tractable orders for direct access to ranked answers of conjunctive queries. ACM Trans. Database Syst., 48(1):1:1–1:45, 2023. doi:10.1145/3578517.

[bib.bib19] [19] Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Alessio Conte, Benny Kimelfeld, and Nicole Schweikardt. Answering (unions of) conjunctive queries using random access and random-order enumeration. ACM Trans. Database Syst., 47(3):9:1–9:49, 2022. doi:10.1145/3531055.

[bib.bib20] [20] Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. The smallest grammar problem. IEEE Trans. Inf. Theory, 51(7):2554–2576, 2005. doi:10.1109/TIT.2005.850116.

[bib.bib21] [21] Noam Chomsky. On certain formal properties of grammars. Inf. Control., 2(2):137–167, 1959. doi:10.1016/S0019-9958(59)90362-6.

[bib.bib22] [22] Mariano P. Consens and Alberto O. Mendelzon. GraphLog: a visual formalism for real life recursion. In Symposium on Principles of Database Systems (PODS), pages 404–416, 1990. doi:10.1145/298514.298591.

[bib.bib23] [23] Isabel F. Cruz, Alberto O. Mendelzon, and Peter T. Wood. A graphical query language supporting recursion. In International Conference on Management of Data (SIGMOD), pages 323–330, 1987. doi:10.1145/38713.38749.

[bib.bib24] [24] Alin Deutsch, Nadime Francis, Alastair Green, Keith Hare, Bei Li, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Wim Martens, Jan Michels, Filip Murlak, Stefan Plantikow, Petra Selmer, Oskar van Rest, Hannes Voigt, Domagoj Vrgoc, Mingxi Wu, and Fred Zemke. Graph pattern matching in GQL and SQL/PGQ. In International Conference on Management of Data (SIGMOD), pages 2246–2258. ACM, 2022. doi:10.1145/3514221.3526057.

[bib.bib25] [25] Pál Dömösi. Unusual algorithms for lexicographical enumeration. Acta Cybern., 14(3):461–468, 2000. URL: https://cyber.bibl.u-szeged.hu/index.php/actcybern/article/view/3539.

[bib.bib26] [26] Benjamín Farias, Wim Martens, Carlos Rojas, and Domagoj Vrgoc. Pathfinder: Returning paths in graph queries. In International Semantic Web Conference (ISWC), pages 135–154. Springer, 2024. doi:10.1007/978-3-031-77850-6_8.

[bib.bib27] [27] Diego Figueira, Adwait Godbole, Shankara Narayanan Krishna, Wim Martens, Matthias Niewerth, and Tina Trautner. Containment of simple conjunctive regular path queries. In International Conference on Principles of Knowledge Representation and Reasoning (KR), pages 371–380, 2020.

[bib.bib28] [28] Yuval Filmus. Lower bounds for context-free grammars. Inf. Process. Lett., 111(18):895–898, 2011. doi:10.1016/J.IPL.2011.06.006.

[bib.bib29] [29] Philippe Flajolet and Robert Sedgewick. Analytic Combinatorics. Cambridge University Press, 1 edition, 2009.

[bib.bib30] [30] Nadime Francis, Amélie Gheerbrant, Paolo Guagliardo, Leonid Libkin, Victor Marsault, Wim Martens, Filip Murlak, Liat Peterfreund, Alexandra Rogova, and Domagoj Vrgoč. A researcher’s digest of GQL (invited talk). In International Conference on Database Theory (ICDT), pages 1:1–1:22, 2023. doi:10.4230/LIPICS.ICDT.2023.1.

[bib.bib31] [31] Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. Cypher: An evolving query language for property graphs. In International Conference on Management of Data (SIGMOD), pages 1433–1445. ACM, 2018. doi:10.1145/3183713.3190657.

[bib.bib32] [32] Moses Ganardi, Artur Jez, and Markus Lohrey. Balancing straight-line programs. J. ACM, 68(4):27:1–27:40, 2021. doi:10.1145/3457389.

[bib.bib33] [33] Kareem El Gebaly, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. Interpretable and informative explanations of outcomes. Proc. VLDB Endow., 8(1):61–72, 2014. doi:10.14778/2735461.2735467.

[bib.bib34] [34] GQL. https://www.gqlstandards.org/, 2023.

[bib.bib35] [35] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to automata theory, languages, and computation, 2nd edition. Addison-Wesley, 2 edition, 2001.

[bib.bib36] [36] ISO. Information technology - database languages SQL - Part 16: Property graph queries (SQL/PGQ), 2023.

[bib.bib37] [37] Ahmet Kara, Milos Nikolic, Dan Olteanu, and Haozhe Zhang. Conjunctive queries with free access patterns under updates. In International Conference on Database Theory (ICDT), pages 17:1–17:20, 2023. doi:10.4230/LIPICS.ICDT.2023.17.

[bib.bib38] [38] Benny Kimelfeld, Wim Martens, and Matthias Niewerth. A unifying perspective on succinct data representations. CoRR, abs/2309.11663, 2023. doi:10.48550/arXiv.2309.11663.

[bib.bib39] [39] Markus Lohrey. Algorithmics on slp-compressed strings: A survey. Groups Complex. Cryptol., 4(2):241–299, 2012. doi:10.1515/GCC-2012-0016.

[bib.bib40] [40] Ole Lehrmann Madsen and Bent Bruun Kristensen. LR-parsing of extended context free grammars. Acta Informatica, 7:61–73, 1976. doi:10.1007/BF00265221.

[bib.bib41] [41] Wim Martens, Matthias Niewerth, Tina Popp, Carlos Rojas, Stijn Vansummeren, and Domagoj Vrgoc. Representing paths in graph database pattern matching. Proc. VLDB Endow., 16(7):1790–1803, 2023. doi:10.14778/3587136.3587151.

[bib.bib42] [42] Wim Martens, Matthias Niewerth, and Tina Trautner. A trichotomy for regular trail queries. In International Symposium on Theoretical Aspects of Computer Science (STACS), pages 7:1–7:16, 2020. doi:10.4230/LIPICS.STACS.2020.7.

[bib.bib43] [43] Wim Martens and Tina Popp. The complexity of regular trail and simple path queries on undirected graphs. In Symposium on Principles of Database Systems (PODS), pages 165–174. ACM, 2022. doi:10.1145/3517804.3524149.

[bib.bib44] [44] Wim Martens and Tina Trautner. Evaluation and enumeration problems for regular path queries. In International Conference on Database Theory (ICDT), pages 19:1–19:21, 2018. doi:10.4230/LIPIcs.ICDT.2018.19.

[bib.bib45] [45] Kuldeep S. Meel, Sourav Chakraborty, and Umang Mathur. A faster FPRAS for #NFA. Proc. ACM Manag. Data, 2(2):112, 2024. doi:10.1145/3651613.

[bib.bib46] [46] Kuldeep S. Meel and Alexis de Colnet. #CFG and #DNNF admit FPRAS. CoRR, abs/2406.18224, 2024. doi:10.48550/arXiv.2406.18224.

[bib.bib47] [47] Stefan Mengel and Harry Vinall-Smeeth. A lower bound on unambiguous context free grammars via communication complexity. CoRR, abs/2412.03199, 2024. doi:10.48550/arXiv.2412.03199.

[bib.bib48] [48] Albert R. Meyer and Larry J. Stockmeyer. The equivalence problem for regular expressions with squaring requires exponential space. In SWAT (FOCS), pages 125–129. IEEE Computer Society, 1972. doi:10.1109/SWAT.1972.29.

[bib.bib49] [49] Martin Muñoz and Cristian Riveros. Streaming enumeration on nested documents. In Dan Olteanu and Nils Vortmeier, editors, 25th International Conference on Database Theory, ICDT 2022, March 29 to April 1, 2022, Edinburgh, UK (Virtual Conference), volume 220 of LIPIcs, pages 19:1–19:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPICS.ICDT.2022.19.

[bib.bib50] [50] Neo4j. Intro to Cypher. https://neo4j.com/developer/cypher-query-language/, 2017.

[bib.bib51] [51] Milos Nikolic and Dan Olteanu. Incremental view maintenance with triple lock factorization benefits. In International Conference on Management of Data (SIGMOD), pages 365–380, 2018. doi:10.1145/3183713.3183758.

[bib.bib52] [52] Dan Olteanu and Maximilian Schleich. F: regression models over factorized views. Proc. VLDB Endow., 9(13):1573–1576, 2016. doi:10.14778/3007263.3007312.

[bib.bib53] [53] Dan Olteanu and Maximilian Schleich. Factorized databases. SIGMOD Rec., 45(2):5–16, 2016. doi:10.1145/3003665.3003667.

[bib.bib54] [54] Dan Olteanu and Jakub Zavodny. Factorised representations of query results: size bounds and readability. In International Conference on Database Theory (ICDT), pages 285–298. ACM, 2012. doi:10.1145/2274576.2274607.

[bib.bib55] [55] Dan Olteanu and Jakub Závodný. Size bounds for factorised representations of query results. ACM Trans. Database Syst., 40(1):2:1–2:44, 2015. doi:10.1145/2656335.

[bib.bib56] [56] Steven T. Piantadosi. How to enumerate trees from a context-free grammar. CoRR, abs/2305.00522, 2023. doi:10.48550/arXiv.2305.00522.

[bib.bib57] [57] The Rel language (relations). https://docs.relational.ai/rel/primer/basic-syntax#relations, 2023.

[bib.bib58] [58] RelationalAI. The Rel language, 2024. https://learn.relational.ai/.

[bib.bib59] [59] Maximilian Schleich, Dan Olteanu, and Radu Ciucanu. Learning linear regression models over factorized joins. In International Conference on Management of Data (SIGMOD), pages 3–18, 2016. doi:10.1145/2882903.2882939.

[bib.bib60] [60] Markus L. Schmid. Conjunctive regular path queries with string variables. In Symposium on Principles of Database Systems (PODS), pages 361–374. ACM, 2020. doi:10.1145/3375395.3387663.

[bib.bib61] [61] Richard Edwin Stearns and Harry B. Hunt III. On the equivalence and containment problems for unambiguous regular expressions, regular grammars and finite automata. SIAM J. Comput., 14(3):598–611, 1985. doi:10.1137/0214044.

[bib.bib62] [62] Szymon Torunczyk. Aggregate queries on sparse databases. In Symposium on Principles of Database Systems (PODS), pages 427–443. ACM, 2020. doi:10.1145/3375395.3387660.

[bib.bib63] [63] Wen-Guey Tzeng. On path equivalence of nondeterministic finite automata. Inf. Process. Lett., 58(1):43–46, 1996. doi:10.1016/0020-0190(96)00039-7.

A Formal Language Perspective on Factorized Representations

Abstract

Keywords and phrases:

Copyright and License:

2012 ACM Subject Classification:

Related Version:

Funding:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

Further related work.

2 Factorized Relations

2.1 The Named and Unnamed Perspectives

2.2 Unnamed Factorized Relations

Definition 2.1.

Example 2.2.

2.3 Relationship to Named Factorized Relations

Proposition 2.3.

Proof sketch.

⊳ Claim 2.4.

Proposition 2.5.

3 Context-Free Grammars and Their Connection to FRs

Definition 3.1 (see, e.g., [40]).

▶ Remark 3.2.

3.1 Isomorphisms

Example 3.3.

Observation 3.4.

Example 3.5.

Proposition 3.6.

Corollary 3.7.

3.2 FRs and ECFGs are Isomorphic on Database Relations

Theorem 3.8.

Theorem 3.9.

4 Some Consequences of the Isomorphism

4.1 Membership

Corollary 4.1.

4.2 FRs versus Non-Deterministic Finite Automata

Corollary 4.2.

Corollary 4.3.

4.3 Size Lower Bounds

Corollary 4.4 ([28], Theorem 7).

4.4 Counting

Corollary 4.5.

Corollary 4.6.

Corollary 4.7.

4.5 Enumeration

Corollary 4.8.

4.6 FRs for Variable-Length Relations

Definition 4.9.

Theorem 4.10.

5 Path Representations in Graph Databases

5.1 Path Multiset Representations

Definition 5.1.

Example 5.2.

5.2 PMRs versus Finite Automata

Proposition 5.3 (Implicit in [41]).

Proposition 5.4 (Implicit in [41]).

5.3 Comparing uFRs and PMRs

Proposition 5.5.

Proof.

Uniform-Length Relations

Theorem 5.6.

Conjecture 5.7.

Proposition 5.8.

Theorem 5.9.

5.4 Variable-Length Relations

6 Future Work

References

$\vartriangleright$ Claim 2.4.

$\blacktriangleright$ Remark 3.2.