Abstract 1 Introduction 2 Basic Concepts of Graph Databases 3 The Ring Index 4 Rings in Higher Dimensions 5 Order Graphs 6 Regular Path Queries 7 Perspective References

BWT Indexes for Optimal Joins in Graph Databases

Diego Arroyuelo ORCID Department of Computer Science, Escuela de Ingeniería, Pontificia Universidad Católica de Chile, Santiago, Chile
Millennium Institute for Foundational Research on Data (IMFD), Santiago, Chile
Gonzalo Navarro ORCID Department of Computer Science, University of Chile, Santiago, Chile
Millennium Institute for Foundational Research on Data (IMFD), Santiago, Chile
Abstract

Graph databases represent data as a labeled directed graph, where the labels refer to properties that connect the entities represented by their source and target vertices. Queries feature, most prominently, sets of edges where source, target, and/or label can be variables; each instantiation of the variables where all the edges occur in the graph is a solution to the query. Worst-case-optimal algorithms to solve those queries have been devised, but they pose significant space requirements. This overhead has hindered the adoption of worst-case-optimal algorithms in real systems. We show that a representation of the graph based on the extended BWT (eBWT), where each edge is seen as an independent string of length 3 (source, label, target) supports worst-case-optimal algorithms while using almost no extra space on top of the raw data. We then show how the idea is generalized to the relational model, where the strings can be longer than 3 and several eBWTs are needed to obtain worst-case optimality. The aim to minimize the amount of space in that case leads to consider novel eBWT variants, where columns other than the last can be chosen. Finally, we show how the same graph representation can be used to solve other typical queries, like finding graph paths that match regular expressions.

Keywords and phrases:
Graph databases, Ring index, extended BWT, compact data structures
Funding:
Gonzalo Navarro: Fondecyt Grant 1-230755.
Copyright and License:
[Uncaptioned image] © Diego Arroyuelo and Gonzalo Navarro; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Data structures and algorithms for data management
; Theory of computation Database query processing and optimization (theory) ; Theory of computation Data compression
Funding:
ANID, Chile, via Millennium Science Initiative Program – Code ICN17_002.
Editors:
Paolo Ferragina, Travis Gagie, and Gonzalo Navarro

1 Introduction

Graph databases (GDBs) have become increasingly popular [29, 24, 26], as they enable the storage of unstructured information repositories that emphasize the relations between entities in applications such as knowledge graphs, the semantic web, social networks, transaction networks, and communication networks, among others. A number of GDB management systems and prototypes have been introduced [15, 33, 34, 1, 2, 25, 30], along with models and query languages [3, 19, 23]. The efficient implementation of GDBs, however, still faces various challenges [8], such as the efficient implementation of the key algorithms to solve the most relevant queries on GDBs. To achieve the needed efficiency, GDBs require indexes that tend to demand a lot of memory. For example, a common type of query on GDBs are basic graph patterns, which search for subgraphs that are homomorphic to a given query graph. These types of queries can be translated into the relational join of many tables, which is known to be computationally demanding. Research on efficient algorithms to compute those joins has led to the concept of worst-case optimal (wco) join algorithms [10], which offer certain guarantees of running time in a worst-case scenario. These wco algorithms, however, generally demand strong indexing schemes, which entails a significant use of memory. Leapfrog Triejoin (LTJ) [35] is one of the most popular wco join algorithms, and has proven to be efficient in practice [35, 25]. However, its high memory consumption makes it less viable in situations of limited memory or when the graphs are excessively large. This high memory usage comes from the need to store indexes for 6 different “permutations” of the graph database (a concept that will be clear soon) in order to work properly.

With other coauthors, we introduced the Ring [7, 5] as an alternative to reduce the excessive memory usage of LTJ. The Ring is able simulate the 6 permutations needed by LTJ (and the corresponding indexes) by storing only one of them. This allows implementing LTJ using little additional space on top of that used by the database itself, thereby opening the use of wco algorithms to many more applications. We originally conceived the Ring in terms of the BWT [13]. We presented it to a database audience aiming to simplify the stringology aspects as much as possible [7]. Our presentation was closer to that of the Permuterm index [18], which regards the strings as circular, and simplified it to the case of strings of length 3. In the journal version [5] we managed to get rid completely of the stringology concepts, which are difficult to grasp in other areas, and presented the Ring in terms of stable sortings.

In this paper we present the Ring to a stringology audience, and thus properly formalize its concepts in terms of the more elegant extended BWT (eBWT) [28]. The generalization of the ring to relational tables leads to the definition of eBWT variants that are new, to the best of our knowledge.

2 Basic Concepts of Graph Databases

2.1 Graph databases

A graph database is an edge-labeled graph modeled as a finite set of triples G𝒰3 where, for simplicity, 𝒰 will represent a finite and totally ordered set of constants. We will stick to the Resource Description Framework (RDF) model [27], as it is simpler than others and sufficient to present the important concepts. In RDF, each triple (s,p,o)G represents the directed edge s𝑝o, connecting vertex s to vertex o, with p being the edge label. The terms subject, predicate, and object are used to refer to s, p, and o, respectively. The number of edges (triples) in G is denoted by N=|G|. The alphabet for the vertices of graph G is defined as ΣV={s,o|(s,p,o)G}, while ΣL={p|(s,p,o)G} represents the alphabet for the edge labels. We assume ΣVΣL=, implying that no edge label p is used as a vertex in G. The domain of graph G is given by dom(G)={s,p,o|(s,p,o)G}𝒰, so we can safely assume |𝒰|3N. Figure 1 illustrates a sample graph, which will be used as our running example. The figure also presents a particular enumeration of the constants in dom(G), as well as the corresponding set of graph triples.

2.1.1 Basic Graph Patterns

Let 𝒱 denote an infinite set of variables such that 𝒰𝒱=. The simplest way of querying an edge-labeled graph is through triple patterns, which are tuples (s,p,o)(𝒰𝒱)3. The solutions to a triple-pattern query involve all potential assignments of the variables within the pattern to constants in dom(G) such that the resulting triple exists in G. This allows for querying all graph edges (using variables s, p, and o), all edges with a vertex sdom(G) as the subject (where s is a constant and p and o remain variables), all edges with label p (p is constant while s and o are variables), and all objects related to subject s by label p (s and p are constant while o is a variable), among other queries. When viewing G as a ternary relation in the relational model, triple patterns are analogous to equality-based selections.

Subsequently, triple patterns can be used to formulate more complex queries, such as basic graph patterns (BGPs). A BGP represents a finite set Q(𝒰𝒱)3 of triple patterns. Let 𝐕(Q) be the set of variables in a BGP Q. The evaluation of Q upon a graph G is defined as set of mappings Q(G)={μ:𝐕(Q)dom(G)|μ(Q)G}, referred to as solutions, where μ(G) denotes the result of substituting each variable x𝐕(Q) in Q with μ(x). Some examples of BGPs on the graph of Figure 1 are as follows:

  • Q={(𝖭𝗈𝖻𝖾𝗅,𝗐𝗂𝗇,x)}, for x𝐕(Q), which looks for all Nobel Prize winners. In our example graph, Q(G) contains the instantiations μ1(x)=𝖳𝗁𝗈𝗋𝗇𝖾, μ2(x)=𝖡𝗈𝗁𝗋, and μ3(x)=𝖳𝗁𝗈𝗆𝗌𝗈𝗇.

  • Q={(𝖭𝗈𝖻𝖾𝗅,𝗐𝗂𝗇,x),(𝖭𝗈𝖻𝖾𝗅,𝗐𝗂𝗇,y),(x,𝖺𝖽𝗏,y)}, for x,y𝐕(Q), which looks for all pairs of values x and y such that x advised y and both x and y won the Nobel Prize. In our example graph, Q(G) contains a single instantiation, namely μ(x)=𝖡𝗈𝗁𝗋 and μ(y)=𝖳𝗁𝗈𝗆𝗌𝗈𝗇.

Mapping:

1: Bohr 5: Nobel
2: Thomson 6: adv
3: Thorne 7: nom
4: Wheeler 8: win
(4, 6, 1) (5, 8, 1)
(3, 6, 4) (5, 8, 2)
(5, 8, 3) (1, 6, 2)
(5, 7, 4)
Figure 1: An example graph database (above), a possible enumeration of its constants (below, left), and the corresponding triple representation (below, right).

2.1.2 Regular Path Queries

Apart from BGPs, the second major component of most graph query languages are the so-called Regular Path Queries (RPQs). An RPQ essentially specifies a regular expression on the alphabet ΣL, and looks for the paths in the graph where the concatenation of the edge labels belongs to the language denoted by the regular expression. An RPQ may also fix the initial and/or the final nodes of the desired paths. The answer to the query are the extremes of all matching paths, in the form of pairs of nodes (s,o). Additionally, the regular expression may use arrows in reverse order, by specifying a “reversed” version p^ of the symbols pΣL. This can be handled by adding an edge op^s per graph edge s𝑝o, so we omit this feature in the sequel.

For instance, the RPQ 𝖳𝗁𝗈𝗋𝗇𝖾𝖺𝖽𝗏+x, where x is a variable, looks for all academic descendants of Thorne in the graph of Figure 1 (the regular expression 𝖺𝖽𝗏+ denotes all the nonempty repetitions of the symbol 𝖺𝖽𝗏). Its solutions are μ(x)=𝖶𝗁𝖾𝖾𝗅𝖾𝗋 (because of the path 𝖳𝗁𝗈𝗋𝗇𝖾𝖺𝖽𝗏𝖶𝗁𝖾𝖾𝗅𝖾𝗋), μ(x)=𝖡𝗈𝗁𝗋 (because of 𝖳𝗁𝗈𝗋𝗇𝖾𝖺𝖽𝗏𝖶𝗁𝖾𝖾𝗅𝖾𝗋𝖺𝖽𝗏𝖡𝗈𝗁𝗋), and μ(x)=𝖳𝗁𝗈𝗆𝗉𝗌𝗈𝗇 (because of 𝖳𝗁𝗈𝗋𝗇𝖾𝖺𝖽𝗏𝖶𝗁𝖾𝖾𝗅𝖾𝗋𝖺𝖽𝗏𝖡𝗈𝗁𝗋𝖺𝖽𝗏𝖳𝗁𝗈𝗆𝗉𝗌𝗈𝗇). Similarly, x𝖺𝖽𝗏+y, for variables x and y, looks for all instantiations of these variables such that y is an academic descendant of x.

2.2 Worst-Case-Optimal joins

Joins are some of the most resource-intensive operations in relational algebra, making their efficient computation essential for the performance of database systems. A sloppy implementation might end up computing the whole cross product of the involved relations, wasting resources. For a join query Q and a database instance D, a join algorithm enumerates Q(D), the solutions for Q over D. In order to formalize the study of efficient join algorithms, Atserias et al. [10] introduced the concept of the AGM bound (named after its authors). Consider a natural join query Q on a database instance D, which comprises a collection of relations. A natural join can be seen as the subset of the Cartesian product where the values in the common attributes coincide (and only one column per common attribute is preserved). The AGM bound of Q over D, denoted Q, is the highest possible number of tuples that can result from evaluating Q on any database instance D that includes, for each relation R in D, a relation R with the same attributes as R and such that |R||R|.

A join algorithm is termed worst-case optimal (wco) if its running time is bounded by O(Q), possibly multiplied by polylogs on N (the number of tuples in D) and factors independent of N (such as |Q| or the number of attributes in D). Although originally conceptualized for relational databases, this idea has also been extended to include graph databases. Even though BGPs are more general than natural joins, it is still possible to apply the AGM bound by considering each triple pattern in a BGP as a relation comprising the triples that match its constants [25].

2.3 Leapfrog TrieJoin

Next, we outline the Leapfrog Triejoin algorithm [35] (LTJ, for short), which is arguably the most commonly used wco join algorithm in practice. In order to work properly, LTJ requires the graph database to be structured using tries. The idea is to represent each triple (graph edge) storing its components in a trie following a particular order (e.g., in spo order). Indeed, to achieve worst-case optimality, all 3!=6 possible permutations of the tuples need to be maintained, resulting in the storage of 6 tries (the rationale for this requirement will be explained subsequently). We will call T𝑠𝑝𝑜, T𝑠𝑜𝑝, T𝑝𝑠𝑜, T𝑝𝑜𝑠, T𝑜𝑠𝑝, and T𝑜𝑝𝑠 these tries. Figure 2 shows the 6 tries corresponding to our running-example graph of Figure 1.

Figure 2: The 6 tries corresponding to our running-example graph.

Let Q={t1,,tq} be a BGP, where ti is a triple pattern, for 1iq. Let 𝐕(Q)={x1,,xv} be the set of variables of Q. The distinctive strategy of LTJ is known as variable elimination. This method involves v=|𝐕(Q)| iterations, each focusing on a single variable from 𝐕(Q) to be “eliminated” from the join process. The particular order in which variables are eliminated determines a total order xi1,,xiv of 𝐕(Q), which is called a Variable Elimination Order (VEO, for short). Throughout this process, each triple pattern ti is treated as a ternary relation that participates in the join. Once a VEO is established for Q, each triple pattern ti will have a particular trie Ti associated with it. The trie Ti should first traverse the constants in ti, and then traverse the variables in ti in an order that is consistent with that of the VEO. This is why LTJ needs to store 6 tries. To illustrate this, consider the query Q={(x,𝖺𝖽𝗏,y),(z,𝗇𝗈𝗆,x),(z,w,y)}, for x,y,z,w𝐕(Q). If the VEO is x,y,z,w, then the first triple is associated with trie T𝑝𝑠𝑜, the second with T𝑝𝑜𝑠, and the third one with T𝑜𝑠𝑝. If, instead, the VEO is z,y,x,w, then the tries are T𝑝𝑜𝑠, T𝑝𝑠𝑜, T𝑠𝑜𝑝. If some of these 6 tries were not stored, then some VEOs would become unmanageable.

The navigational operations that must be supported at any trie node v of a trie T are:

  • T.𝗋𝗈𝗈𝗍(): moves to the root node of trie T.

  • T.𝖼𝗁𝗂𝗅𝖽(v,c): descends to the child of node v labeled c in trie T.

  • T.𝗅𝖾𝖺𝗉(v,c0): yields the smallest symbol cc0 that labels a child of node v in trie T.

LTJ starts at the root of each Ti and descends by the children that correspond to the constants in triple ti. Next, we proceed to the variable elimination phase, assuming a VEO xi1,,xiv. For j=1,,v, let QjQ be the set of triple patterns that contain variable xij. Starting with the initial variable xi1 in the VEO, algorithm LTJ first identifies all values cdom(G) such that for every tQ1, substituting xi1 with c in t results in a non-empty evaluation of the modified triple pattern t over G (indicating that there are potential solutions to Q where xi1 equals c). If the trie T related to t is consistent with the VEO, then its current node’s children precisely represent the suitable values c for xi1.

Throughout the execution, we maintain a mapping μ with the solutions of Q. When a suitable value c is found for xi1, we bind xi1 to c, resulting in μ={(xi1:=c)}, and branch on this particular value c. In this branch, we go down by c in all the virtual tries T such that tQ1. Next, the same process is applied to Q2, determining suitable values d for xi2, and extending the mapping to μ={(xi1:=c),(xi2:=d)}. This process is repeated for the remaining variables in the VEO. Once all variables have been bound in this way, μ effectively contains a solution for Q; this occurs multiple times due to the branching for each binding of xi1, xi2, and so forth. When all bindings c for a variable xij have been considered, LTJ backtracks and moves on to the next binding for Qj1. Upon completion of this process, the algorithm has reported all possible solutions for Q.

Determining the values c, d, and so forth, involves finding the intersection among the child nodes of the current nodes across all tries Ti, for each tiQj. The algorithm LTJ carries out this intersection using the primitive Ti.𝗅𝖾𝖺𝗉(v,c0) to implement Barbay and Kenyon’s intersection algorithm [11]. Veldhuizen [35] proved that if the 𝗅𝖾𝖺𝗉 operation runs within polylogarithmic time, then LTJ is wco regardless of the VEO chosen, provided the tries have an appropriate attribute order.

We next illustrate how LTJ handles the query Q={(x,𝖺𝖽𝗏,y),(z,𝗇𝗈𝗆,x),(z,w,y)} using the VEO z,y,x,w. The triple patterns in Q are associated with tries T𝑝𝑜𝑠, Tpso, and T𝑠𝑜𝑝, respectively. We set μ= and begin by processing the constants 𝖺𝖽𝗏 and 𝗇𝗈𝗆, identified as 6 and 7 in our example enumeration, descending to the respective nodes in T𝑝𝑜𝑠 and T𝑝𝑠𝑜. We then handle the variables following the VEO. Starting with z in the second and third triple patterns of Q, we intersect the children at the current nodes of both tries (the node corresponding to string 7 in T𝑝𝑠𝑜 and the root of trie T𝑠𝑜𝑝). The only common child is 5, so we descend to those nodes in both tries, setting μ={(z:=5)}. We now proceed to y, in the first and third triple patterns. We intersect the children of nodes for string 6 in T𝑝𝑜𝑠 and 5 in T𝑠𝑜𝑝. The first intersection element is 1, leading us to descend in both tries and to update μ={(z:=5),(y:=1)}, and to proceed to x in the first and second triple patterns. We intersect the children of nodes corresponding to string 61 in T𝑝𝑜𝑠 and 75 in T𝑝𝑠𝑜. Since 4 is a common value, we descend to the appropriate nodes and update μ={(z:=5),(y:=1),(x:=4)}. Lastly, for w, present only in the third triple pattern, no intersection is required: all children of the node for string 51 in T𝑠𝑜𝑝 are valid for w. The only value is 8, so we set μ={(z:=5),(y:=1),(x:=4),(w:=8)}. Having bound all variables, μ is a solution to Q. According to our enumeration, these values translate to z:=𝖭𝗈𝖻𝖾𝗅, y:=𝖡𝗈𝗁𝗋, x:=𝖶𝗁𝖾𝖾𝗅𝖾𝗋, w:=𝗐𝗂𝗇, resulting in the set of triples {(𝖶𝗁𝖾𝖾𝗅𝖾𝗋,𝖺𝖽𝗏,𝖡𝗈𝗁𝗋),(𝖭𝗈𝖻𝖾𝗅,𝗇𝗈𝗆,𝖶𝗁𝖾𝖾𝗅𝖾𝗋),(𝖭𝗈𝖻𝖾𝗅,𝗐𝗂𝗇,𝖡𝗈𝗁𝗋)} in the graph. Finally we backtrack, removing w from μ and returning to variable x, and repeat the process until all bindings for the variables of Q are found. In this example, this was the only answer for Q.

3 The Ring Index

In the sequel we show how the Ring implements the functionality needed to implement the LTJ algorithm (see Section 2.3) over a different data representation: the Ring index [7, 5].

3.1 The circular strings model

While LTJ is classically presented as running on tries, we take a more abstract view and consider the database triples (s,p,o) as circular strings [12], where we regard all the strings spo, osp, and pos as the same. We formalize circular strings using the concept of rotation.

Definition 1.

A rotation of a string S[1..n] is any S[i+1..n]S[1..i], for 0i<n.

Definition 2.

The circular string (or c-string) S̊ is defined as the set of the rotations of S.

In particular, spo̊={spo,osp,pos}; note that spo̊=osp̊=pos̊. Note that, because the alphabet of the predicates is disjoint from that of subjects and objects, the three rotations of Def. 2 must be different in our case (because p is at a different position). Therefore, there is a one-to-one correspondence between triples and c-strings.

The graph database is then seen as a set of c-strings, one per triple, and we identify each trie node v with the set of c-strings corresponding to the triples that descend from v. Figure 3 shows the c-strings corresponding to three sample nodes of tries T𝑠𝑝𝑜 and Tpos.

Figure 3: The c-strings corresponding to three nodes of tries T𝑠𝑝𝑜 and T𝑝𝑜𝑠 (see solid arrows). The corresponding triple patterns t(v) (according to Definition 3) are also shown for these nodes (see dashed arrows).

For what follows, we will create two disjoint copies, Σs and Σo, of the alphabet ΣV, in order to distinguish subjects from objects in c-strings. We will call Σp=ΣL the alphabet of predicates for consistency. Thus, triples will belong to Σs×Σp×Σo. Further, the three sub-alphabets will form disjoint lexicographic ranges. In our running example, we separate the alphabets by adding |𝒰| to the p components of each triple in the graph, while the o components are increased by 2|𝒰|. Let us also redefine triple patterns.

Definition 3.

A triple pattern is an element of (Σs{})×(Σp{})×(Σo{}), where is a special symbol. We say that the elements are unbound and the others are bound. A triple (s,p,o) matches a triple pattern (s,p,o) iff x=x or x= for every x{s,p,o}.

Every trie node v then corresponds univocally to a triple pattern t(v), where the edges that lead from the root to v define the bound components of t(v). The node v then represents all the database triples that match t(v). Figure 3 also shows the triple patterns corresponding to three nodes of tries T𝑠𝑝𝑜 and T𝑝𝑜𝑠.

For our purposes, a key property of the characterization as circular strings is the following.

Theorem 4.

The bound components of any triple pattern t=(s,p,o) can be concatenated into a string P(t) such that a triple (s,p,o) matches t iff P(t) prefixes exactly one string of its c-string spo̊. The prefixed rotation of spo̊ is unique if P(t)ε.

Proof.

Let us consider all the possibilities for |P(t)|.

If (s,p,o) matches t and the latter has one bound component, then P(t) is univocally s, p, or o, and it prefixes spo, pos, or osp, respectively. It prefixes only one of those because the triple components have disjoint alphabets. Conversely, if |P(t)|=1 and P(t) prefixes spo, pos, or osp, then (s,p,o) matches t.

If there are two bound components, then we can form P(t)=sp, P(t)=po, or P(t)=os, which prefixes spo, pos, or osp, respectively. Again, P(t) prefixes exactly one of those. Conversely again, if |P(t)|=2 and it prefixes spo, pos, or osp, then (s,p,o) matches t. Note that the other choices, P(t)=ps, P(t)=so, and P(t)=op, do not prefix any suffix.

If all the components are bound, then three choices of P(t) prefix (and equal) a string of spo̊ for the only triple (s,p,o) that matches t. Finally, if there are no bound components, then P(t)=ε prefixes every string, and conversely every triple matches t.

Building on this result, it will suffice to represent tries Tspo, Tosp, and Tpos. Our plan is to identify every trie node v with a set of rotations, which except for the root will contain exactly one rotation of each c-string spo̊ whose triple (s,p,o) matches t(v). By Theorem 4, the rotations are exactly those that start with a given string P(t), so v descends from the root by P(t). The way to efficiently represent that set of suffixes is considered next.

For example, consider the trie Tpos on the right of Figure 3. Its rightmost node v descending by 83 has t(v)=(,8,3) and P(t)=83. The node represents the single c-string 583̊, and contains exactly one rotation of that c-string, 835, which starts with P(t).

3.2 The extended Burrows-Wheeler Transform

The extended BWT (eBWT) [28, 14] is an extension of the Burrows-Wheeler Transform (BWT) [13] to a set of strings. It builds on the concept of omega-order [28] on strings, which is defined as follows.

Definition 5.

An infinite string S over alphabet Σ is a mapping S:+Σ. The prefix S[..i] of the infinite string S is the finite string S[1]S[2]S[i]. The suffix X=S[i..] of the infinite string S is the infinite string such that X[j]=S[i+j1].

Definition 6.

The ω-extension Sω of a string S is the infinite string formed by concatenations of strings S, that is, Sω[i]=S[1+((i1)mod|S|)].

Definition 7.

Given a string S, let m1 be the maximum value such that S=Um for some string U; we then say that exp(S)=m and root(S)=U. The omega-order, denoted <ω, between strings S and T is then defined as follows: S<ωT iff

  • root(S)=root(T) and exp(S)<exp(T), or

  • Sω<Tω, where < is the lexicographic order.

Note that, if S and T are not a proper prefix of the other, then <ω coincides with the lexicographic order, but it can differ otherwise.

Definition 8.

Given a set of strings {S1,,Sn}, the eBWT is a permutation of the symbols in the strings Si given by listing the rotations of all strings Si in omega-order, and then concatenating the last symbols of each rotation.

We are now in position to define the Ring index.

Definition 9.

The Ring index of a set 𝒟 of triples is the eBWT of the set 𝒮={spo,(s,p,o)𝒟}, after properly separating the alphabets into Σs, Σp, and Σo.

Because of the alphabet separations, it follows that the rotations X of all the c-strings spo̊ have root(X)=X and exp(X)=1; therefore the omega-order boils down to the lexicographic order between their ω-extensions. Further, because of our alphabet separation, such order boils down in turn to the plain lexicographic order of the strings. The Ring’s eBWT then lists all the strings of all c-strings spo̊ in the graph, sorts them lexicographically, and collects their last symbols.

Figure 4 (left) shows all the spo̊ strings corresponding to the edges of the graph from Figure 1. Since |𝒰|=8 in this case, we sum 8 to the p component of each triple, and 16 to the o components, and thus Σs=[1..8], Σp=[9..16], and Σo=[17..24]. On the right, the triples are lexicographically sorted, obtaining the corresponding Ring index as the last column of the table.

Figure 4: On the left, the 3 rotations of each c-string spo̊ of our running example graph. On the right, the strings are lexicographically sorted. The last column is the corresponding eBWT. The outer red rectangle corresponds to the node v of Tspo with t(v)=(5,,), and the inner one to t(v)=(5,16,).

The following theorem shows that we can identify each node v of the three chosen tries with a range of the eBWT of the Ring index.

Theorem 10.

There is a distinct interval in the Ring representing the triples that descend from each node v of Tspo, Tosp, and Tpos.

Proof.

If v is the root, we can define its range as the whole eBWT. Otherwise, by Theorem 4, there is a string P(t) for the triple pattern t=t(v) that prefixes exactly one string of the c-string spo̊, or equivalently a rotation of spo, for each triple (s,p,o) matching t. By Def. 9, the three rotations of spo are listed in the Ring in lexicographic order, and thus those prefixed by P(t) form a single range. Those ranges are unique once we define how P(t) is formed. As seen in Theorem 4, there are multiple choices to form P(t) only when |P(t)|=3, that is, when t is a triple. In this case it corresponds to just one c-string spo̊ and the singleton range can be arbitrarily fixed to be that of any of its three rotations.

Consider, for instance, the node corresponding to t=(5,,) in T𝑠𝑝𝑜 of Figure 2, whose subtree includes the triples (5,7,4), (5,8,1), (5,8,2), and (5,8,3). This node is represented by the interval [4..7] in the Ring shown in Figure 4 (right side), highlighted with the larger red box, which contains exactly the same triples mentioned before (with components p and o shifted accordingly, as we have already explained). On the other hand, interval [5..7] (highlighted with the smaller red box) corresponds to the node for triple pattern t=(5,8,) in the trie T𝑠𝑝𝑜 of Figure 2. Actually, notice that interval [1..7] in this sample Ring corresponds to the root of trie T𝑠𝑝𝑜, interval [8..14] is associated with the root of trie T𝑝𝑜𝑠, and interval [15..21] represents the root of trie T𝑜𝑠𝑝. Each subinterval within these regions corresponds to nodes in the respective tries.

Roughly speaking, the Ring index then maps the trie operations to operations on the eBWT of the triples. In the sequel we show precisely how this is done.

3.3 The eBWT as a representation of the graph

The following definition will help us discuss how the eBWT works on the graph.

Definition 11.

Let S=a1a2a3ak be a string where aiΣi for all i. We then say that the order of S is the string 123k.

In our case, the order of spo is spo, the order of pos is pos, and the order of osp is osp. Because the three alphabets form lexicographic ranges, eBWT contains a range where all the listed rotations are of order spo, and thus the eBWT contains only symbols of Σo, a second range where all the rotations are of order pos and the eBWT contains only symbols of Σs, and a third range where all the rotations are of order osp and the eBWT contains only symbols of Σp. We store those three segments of the eBWT, respectively, as strings, called columns, Co[1..N], Cs[1..N], and Cp[1..N] (see the different ranges and colors in the last column in Figure 4), and say that their orders are spo, pos, and osp, respectively.

Let us first show that our representation allows accessing any desired triple (s,p,o) starting from its position in any of the columns Cx. For example, to retrieve the triple represented at Co[i], we know immediately that o=Co[i]. Now, by the classic BWT invariant [13, 16] (which holds true in the eBWT [28, Prop. 10]), we find the entry Cp[j] corresponding to Co[i] with

j=𝖫𝖥o(i)=Ao[o]+𝗋𝖺𝗇𝗄o(Co,i), (1)

where Ax[c] counts the number of times symbols smaller than c occur in Cx, and 𝗋𝖺𝗇𝗄c(X,i) counts the number of times c occurs in X[1..i]. With j we obtain p=Cp[j], compute the position k of the triple in Cs with

k=𝖫𝖥p(j)=Ap[p]+𝗋𝖺𝗇𝗄p(Cp,j),

and complete the process with s=Cs[k]. The eBWT invariants imply that

i=𝖫𝖥s(k)=As[s]+𝗋𝖺𝗇𝗄s(Cs,k),

that is, we cycle on the string: the eBWT effectively regards the strings as circular and this permits retrieving the triples starting from any of their components.

See Figure 5 for an example of this process. In this case, we have i=6, and hence o=Co[6]=18. So, j=Ao[18]+𝗋𝖺𝗇𝗄18(Co,6)=2+2=4. Then, p=Cp[4]=16, so k=Ap[16]+𝗋𝖺𝗇𝗄16(Cp,4)=4+2=6. Then, s=Cs[6]=5. So, we get back to i since i=As[5]+𝗋𝖺𝗇𝗄5(Cs,6)=3+3=6. The corresponding 𝑠𝑝𝑜 triple is (5,16,18).

Figure 5: The Ring index, formed by the three columns Cx, and how each column is obtained by moving some other column to the front and reordering (the dashed arrow shows the case spo osp). We also show, with solid arrows, the process for retrieving a triple in 5,16,18̊ from the index.

The rationale of (say) Eq. (1) is the following. Co corresponds to the order spo and Cp to the order osp. The second order is obtained from the first by stably reordering the tuples by their o component, in accordance with the lexicographic order. In a stable reordering of spo, position Co[i]=o will be moved to position j such that (i) all the symbols less than o come before j (and those are counted in Ao[o]), and (ii) all the symbols o preceding the one at Co[i] also come before j (thus we add 𝗋𝖺𝗇𝗄o(Co,i)). Figure 5 also shows one of those reorderings, from spo to osp.

Because the shifted symbols are stored in different strings Cx, the Ring remaps their alphabets back to Σx=[1..|Σx|]. The nodes that are both subjects and objects use the same identifier in both Σs and Σo, in the range [1..|ΣsΣo|]. Also, the Ring stores the arrays Ax using N+o(N) bits, whereas function 𝗋𝖺𝗇𝗄 is computed in time O(log|𝒰|) using a wavelet tree [22] representation of the columns Cx[1..N], just like a typical FM-index implementation [17]. Overall, it represents a graph database of N triples over universe 𝒰 using 3Nlog|𝒰|+o(Nlog|𝒰|) bits (logarithms are to the base 2), and can retrieve any triple in time O(log|𝒰|). This is almost the same space needed to store the 3N triple identifiers in plain form. What is striking is that, within this space, the Ring can implement all the operations required by the LTJ algorithm, as we see next.

3.4 Implementing the operations on the eBWT

Having separate columns, it is more convenient to identify the root of a trie starting with symbol of alphabet Σx as the whole range Cx[1..N]. This implements operation 𝗋𝗈𝗈𝗍 depending on the next triple component we will descend by.

Now assume we are in a range of the eBWT (that is, a range of Cs, Cp, or Co) that corresponds to a trie node v. Recall that this means that the range is that of all the rotations that start with P(t(v)). Let us show how to simulate the trie operations 𝖼𝗁𝗂𝗅𝖽 and 𝗅𝖾𝖺𝗉 on this representation.

The “easy” case

In the “easy” case, v is represented as Cx[i..j] and the children of v belong to the alphabet Σx. We can then use an extension of the 𝖫𝖥x functions to compute the eBWT range corresponding to a desired child of v. For example, if we are at v=𝗋𝗈𝗈𝗍() and want to descend to 𝖼𝗁𝗂𝗅𝖽(v,s), then we move from the range Cs[1..N], which represents v, to the range Co[i..j]=Co[As[s]+1..As[s+1]], which represents u=𝖼𝗁𝗂𝗅𝖽(v,s). This is the range of all the rotations starting with s. We can now descend to 𝖼𝗁𝗂𝗅𝖽(u,o) with the famous backward search formula [16], which essentially computes the range of all values 𝖫𝖥o(k) for all ikj where Co[k]=o:

i = Ao[o]+𝗋𝖺𝗇𝗄o(Co,i1)+1, (2)
j = Ao[o]+𝗋𝖺𝗇𝗄o(Co,j),

so that now Cp[i..j] is the range of all the rotations starting with os, and represents w=𝖼𝗁𝗂𝗅𝖽(u,o). If we further descend to 𝖼𝗁𝗂𝗅𝖽(w,p), the analogous formulas will boil down to a single entry of Cs corresponding to the trie leaf representing the triple (s,p,o) (in this case, using the order pos).

For instance, assume that we want to go down by s=5 and then by o=19. We start from the range Co[1..7], corresponding to T𝑠𝑝𝑜.𝗋𝗈𝗈𝗍(). Going down to the child labeled s=5 corresponds to restricting the range to Co[As[5]+1..As[6]]=Co[4..7]. Then, descending by o=19 from this node corresponds to moving to the range Cp[i..j], with i=Ao[19]+𝗋𝖺𝗇𝗄19(Co,3)+1=4+1=5 and j=Ao[19]+𝗋𝖺𝗇𝗄19(Co,7)=4+1=5. The range Cp[5..5] then corresponds to the range of a node v in the trie Tosp, which represents the triple pattern t(v)=(5,,19). Note that, although we do not represent the trie Tsop, we can find a node in some trie that corresponds to the same set of c-strings.

Operation 𝗅𝖾𝖺𝗉(v,c0) is slightly more complex, and resorts to the geometric capabilities of the wavelet tree. In the “easy case” again, node v is represented by a range Cx[i..j] and c0 belongs to Σx; for example we are in Co[i..j] and c0 is an object. We then use the wavelet tree ability to compute, in O(log|𝒰|) time, the least value not smaller than c0 in a range [i..j] of the string Cx it represents [21].

The “hard” case

If v is the root and we descend by triple component x, we are always in the “easy” case, because we can use Cx[1..N] as 𝗋𝗈𝗈𝗍(). It is also always the easy case if t=t(v) has two bound components (or, equivalently, v has depth 2): given any length-2 prefix P(t) of the eBWT rotations, their range is in Cx, where x is the unbound component of t. For example, if P(t)=sp, then its eBWT range is within Co. The only “hard” cases occur when exactly one component is bound and we are forced to move “forward”, instead of “backward” as Eq. (2) does: if only s is bound in t (the cases of p and o are analogous), then the range of P(t) is within order spo, and then the representation of v is of the form Co[i..j]. We can then descend by o using Eq. (2), but we cannot descend by p in this way. Rather, we must restrict the range Co[i..j] so that it goes from representing all the rotations starting with s to representing all the rotations starting with sp; the order is still spo.

To descend in this hard case, we reorder the string P(t): we descend from the root Cp[1..N] by p, and from the resulting interval Cs[i..j] of all the rotations starting with p, we descend by s to the interval Co[i′′..j′′] of the rotations starting with sp.

For instance, assume that we have instantiated s=5, which corresponds to Co[4..7] in our example Ring. This range is depicted by the larger box in the Ring of Figure 4. Next, suppose that we instantiate p=16 and want to go down in the corresponding trie. The resulting range is represented by the smaller box in Figure 4. Notice that in this case we remain at Co (i.e., in trie Tspo) after descending with p=16 (unlike the “easy” case, where we jump to a different trie after going down). However, using 𝖫𝖥p to go down by p=16 is not feasible since we are currently in Co. Instead, we descend from Cp[1..7] to Cs[Ap[16]+1..Ap[17]]=Cs[5..7], and then proceed with s=5 to obtain the desired range Co[i..j], with i=As[5]+𝗋𝖺𝗇𝗄5(Cs,4)+1=5 and j=As[5]+𝗋𝖺𝗇𝗄5(Cs,7)=7.

Finally, we implement 𝗅𝖾𝖺𝗉(v,c0) in the hard case as follows. Consider the same case above, where only s is bound, v is represented by Co[i..j], and p0Σp. We start from Cs[Ap[p0]+1..N], which is the range of all the rotations starting with predicates pp0. We then map that range using an analogous of Eq. (2):

i = As[s]+𝗋𝖺𝗇𝗄s(Cs,Ap[p0])+1,
j = As[s+1],

and obtain Co[i..j], the range of all the rotations starting with s and following with some pc0. The first element of that range, Co[i], is a triple with the desired minimum value pp0. We then obtain p by traversing the circular string as described in Section 3.3.

4 Rings in Higher Dimensions

The theory of worst-case-optimal join algorithms was initially developed for the relational model, and only then adapted to graph databases [25], where BGPs are modeled mainly as multijoins on a single table of three columns. The space problem is even more serious for tables with more than three attributes: a table with d attributes (which we will say to be of dimension d) needs d! tries, each storing the equivalent of a copy of the database, in order to implement worst-case-optimal joins using LTJ. This, of course, makes the technique unaffordable even for modest amounts of attributes! A natural question is: can we extend the Ring to dimensions d>3 so as to use space proportional to the raw data?

The answer is, unfortunately, no: the key Theorem 4 does not hold for higher dimensions. Already for d=4, let {s,p,o,g} be the attributes. No matter how we choose the c-strings (say, spog̊), there are subsets of attributes for which P(t) do not prefix any rotation of spog̊ (say, {s,o} or {p,g}).

It is not hard to see, however, that two c-strings, like spog̊ an sopg̊, suffice for Theorem 4 to hold on d=4, in the sense that any string P(t), in some convenient order, prefixes some rotation of spog̊ or of sopg̊. Compared to storing 4!=24 tries, storing just two eBWTs yields an enormous space reduction!

The Ring simulation of Section 3 can be extended to having two or more eBWTs, by jumping from one to another as needed. The following definition captures what is needed from the chosen set of c-strings for such a simulation to work.

Definition 12.

A string Y is a substring of a c-string S̊ iff it is a substring of Sω. If |Y||S|, this is equivalent to Y being a substring of some rotation of S.

Definition 13.

A set 𝒮={S1,S2,} of strings over alphabet Σd={1,2,,d} is complete for dimension d iff, for every 𝒴Σd and xΣd𝒴, 𝒴 can be ordered to form a string Y such that Yx or xY are a substring of some S̊i𝒮.

The elements of 𝒮 are strings of attributes, for example 𝒮={123} is complete for d=3 and 𝒮={1234,1324} is complete for d=4. A Ring for dimension d is built on a complete set 𝒮 of strings of length d. For each data tuple (a1,,ad) and each Si𝒮, we extract the string aSi[1]aSi[d] (a permutation of the tuple). For example, with d=3 and 𝒮={123}, we extract from the tuple (s,p,o) the string spo. With d=4 and 𝒮={1234,1324}, we extract from the tuple (s,p,o,g) the strings spog and sopg. The Ring consists of one eBWT for each string Si, which indexes the rotations of all the c-strings aSi[1]̊aSi[d].

Consider a trie node v representing a tuple pattern t=t(v), now of d elements. If 𝒮 is complete, then by definition it is possible to concatenate the bound elements of t in some order into a string P(t) that prefixes all rotations in the eBWT of some Si. We note that there could be various ways to concatenate the bound attributes of t to form strings P(t) so that they prefix rotations in different eBWTs.

We can then descend from v to 𝖼𝗁𝗂𝗅𝖽(v,x), with xΣx, in two ways. The “easy” one, where v is represented by a range in Cx, uses the analogous of Eq. (2). The “hard” case discards the current range and recomputes the result from scratch: it forms P(t(v)) such that its order of attributes Y, preceded or followed by x, occurs in some S̊i (this works when 𝒮 is complete). It then retraverses the tries from the root following the path of P(t(v))x or xP(t(v)), using the analogous of Eq. (2) at each step.

Suppose, with Rings for spog and sopg, that we want to go down using s, then o, then p, and finally g. Initially, without further information, we may descend by s from the root of the trie Tspog, obtaining a range of the spog Ring. Then, when it comes to go down by o, we note that this is not possible on this Ring, so we move to the sopg Ring: we start with o and continue by s using the “easy” case, reaching a node in the trie Tsopg (i.e., a range in the sopg Ring). To proceed with p, we can remain in the same Ring, yet it corresponds to the “hard” case, because p is to the right of so in the order sopg. This can be solved with “easy” cases, either by re-running p, o, and s on the same sopg Ring, or using o, p, and s if we move to spog. In both cases, the final descent by g corresponds to an easy case.

To implement 𝗅𝖾𝖺𝗉(v,x0), there are also two possibilities. In the easy case, we proceed on the wavelet tree exactly as explained (for the easy case) in Section 3. In the hard case, there are two subcases:

  • We can form P(t(v)) whose order Y is such that xY occurs in some S̊i. In this case, we retraverse the tries from the root to find Y, and finish as in the easy case. (Note that we were already in some range for the unbound attributes of t(v), but not in one from where we could descend by x.)

  • We can form P(t(v)) whose order Y is such that Yx occurs in some S̊i. In this case, we start from the range Cx[Ax[x0]+1,N] in the corresponding eBWT, and descend by P(t(v)) from it.

We then track the first cell in the resulting range, using 𝖫𝖥, back to Cx to find the answer x Because of the hard cases, we simulate each 𝖼𝗁𝗂𝗅𝖽 or 𝗅𝖾𝖺𝗉 operation in time O(dlog|𝒰|), which still yields a worst-case-optimal query algorithm, as d is independent of N.

Now that we know how to use multi-eBWT Rings to handle higher dimensions, the question is: how many eBWTs (i.e., strings in 𝒮) are needed to be complete in dimension d? Arroyuelo et al. [5] studied this problem, showing that 5 eBWTs (instead of 5!=120 tries) are necessary and sufficient for d=5, and 7 eBWTs (instead of 6!=720 tries) are necessary and sufficient for d=6. The optimal number for d=7 is shown to be between 10 and 12, and for d=8 is between 21 and 25 (they obtain various lower bounds). In asymptotic terms, they prove that the number of eBWTs needed is O(2d) and Ω(2d/d). These results make worst-case-optimal algorithms much more feasible for relational databases.

5 Order Graphs

A way to visualize complete sets 𝒮={S1,S2,} is what are called order graphs [5].

Definition 14.

An order graph of dimension d has, as nodes, all the permutations of [1..d], and has directed labeled edges of the form ZxxY.

In the order graphs that model complete sets of attribute strings, Z is always of the form Yx. For example, the order graph for 𝒮={123} consists of the cycle

123331222311123;

recall the example in Figure 5, where we can see how one can move from the spo order to osp using 𝖫𝖥o, then to pos using 𝖫𝖥p, and then back to spo using 𝖫𝖥s. The order graph for 𝒮={1234,1324} consists of the two cycles:

123444123334122234111234,
132444132224133324111324.

Each edge e=YxxxY of the order graph implies that the Ring stores a column Ce over the alphabet Σx, so that the function 𝖫𝖥e maps from order Yx to order xY. That is, 𝖫𝖥e applied on a position of the source order leads to the corresponding position in the target order. Backward search, as in Eq. (2), also maps from the first to the second order. The cycles arise because the eBWT, by definition, always stores the last column of its rotations.

An alternative way to understand Def. 13 is that, for every Y (in some order) and x, there must exist a directed path labeled Yx or xY in the order graph: we can go from the source to the target node using 𝖫𝖥 or backward search and obtain the desired cell or range in the order of target node. This leads to a definition of complete order graphs.

Definition 15.

An order graph of dimension d is complete iff, for every 𝒴Σd=[1..d] and xΣd𝒴, 𝒴 can be ordered to form a string Y such that concatenating the labels over some directed path in the graph one obtains Yx or xY.

The definition of order graph, however, opens up new alternatives that are not possible under the eBWT framework, namely when Z is not of the form Yx in Def. 14. Consider the list of the rotations listed in lexicographic order by the eBWT, and regard the possibility of storing another column of the rotations, not necessarily the last. For example, if we take the order 1234 and store the third column, the corresponding 𝖫𝖥 function on that column will stably reorder the rotations by the attribute 3, so that the ordering obtained is 3124 (recall the rationale of 𝖫𝖥 given after Eq. (1)). Backward search also works correctly using the analogous of Eq. (2). Such a column corresponds to a graph edge 123433124, and the order graph is not anymore a set of cycles of length d. Figure 6 shows how this reordering by another column operates on our example graph of dimension d=3.

Figure 6: Reordering by the second column to go from order pos to order ops.

While complete sets of c-strings yield complete order graphs, there are complete order graphs that do not correspond to complete sets. For d=4, for example, we have a minimal complete order graph that consists of a single cycle of length 8, instead of two cycles of length 4 (for both, the index stores 8 columns Ce):

34211134222134332144432111432221434421333421.

More curious optimal complete order graphs for d=4 exist; Figure 7 shows two of them.

Figure 7: Two particular complete order graphs of minimum size 8 for d=4.

A natural question is whether complete order graphs can yield smaller Rings than complete sets. The answer is indeed affirmative! For d=5, there exists a complete order graph formed by a single cycle of 20 edges [5], whereas the smallest complete set requires 5 eBWTs: seen as an order graph, that amounts to 5 cycles of length 5, that is, 25 edges. The length-20 cycle is known to be a minimal complete order graph for d=5. In general, however, it is unknown whether some minimal complete order graph is always a single cycle, or a set of cycles. It is only known that some minimal complete order graph is always formed by a set of cycles with trees possibly sprouting from the nodes [5]. That is, every node must have indegree 1. As it can be inferred from this discussion, no effective ways to build minimal complete sets of c-strings or order graphs for a dimension d exist; we have resorted to exhaustive search.

6 Regular Path Queries

Assume an RPQ such that the initial node s is fixed. A classic solution to this problem is to build the nondeterministic automaton of the regular expression and traverse the graph from s, in DFS or BFS order. As we traverse an edge s𝑝o, we feed the automaton with symbol p, and abort the branch if the automaton runs out of active states. We must also record the active automaton states with which we had already visited each graph node, to avoid repeating the same nodes with the same states (i.e., we deactivate the repeated states in this second visit). We report (s,o) for every node o where we arrive with a final automaton state activated.

The other cases are solved analogously: if only the final node o is fixed, we can reverse the regular expression the symbols (i.e. convert every p with p^ and vice versa). If both are fixed, we can start from either of those and stop as soon as the other node is found with a final state. If none is fixed, we start a search from every possible value of s.

6.1 Simulation with the Ring

This process can be simulated with the Ring without using additional data structures. In this case, it is easier to start assuming that only o is fixed, and derive the other cases analogously, so we use the reversed automaton. Our first step is to descend from the trie root by o, v=𝖼𝗁𝗂𝗅𝖽(𝗋𝗈𝗈𝗍(),o), which is represented by the range Cp[i..j]=Cp[Ao[o]+1..Ao[o+1]]. We now analyze the automaton states to determine the symbols that would lead to active states. For each such symbol p, we descend to u=𝖼𝗁𝗂𝗅𝖽(v,p) using an analogous of Eq. (2), for predicate p instead of for object o. The resulting range, Cs[i..j], contains all the distinct sources s that lead to o by symbol p. Since there cannot be repetitions in this list, we extract each value of s and iterate from it as the new source node, that is, setting os.

Note that we do not need the Ring component Co for this simulation. Further, one can avoid duplicating the edges to support reversed symbols if one uses the “non-typical” case technique to handle them. Such a technique [32] was implemented over a bit-parallel simulation of the automaton, and shown to be very competitive in space and time with classic solutions. Further, it was shown that the used Ring components are equivalent to a known succinct representation of labeled graphs [31, Sec. 9.1.4]

6.2 Larger and smarter

An earlier Ring-based implementation [6], instead, does include the reversed edges, and thus it can always use the “typical” case formulas. This enables other optimizations that make the process faster. One possibility is that, instead of trying out the labels p that are relevant for the automaton, one can enumerate the distinct labels p that occur in Cp[i..j], which could be less. The wavelet tree of Cp can do this enumeration in logarithmic time per resulting symbol [21]. Further, the wavelet tree can be enriched with information on the automaton states that are sources of symbols in ranges of ΣL. Those can be intersected with the currently active states as we traverse the wavelet tree, so as to prune subtrees that cannot activate any state. This technique computes the intersection between the predicates that lead to active automaton states and those that leave from the current node, in optimal time, that is, proportional to the alternation complexity of both sets [11]. An analogous technique on the wavelet treee of Cs avoids visiting sources of Cs[i..j] that have already been visited with all the currently active states.

Using even more space (but still several times less than classic solutions), one can implement even smarter traversals, in considerably less time [6]. For example, the regular expression can be cut at an edge with a label p that is infrequent in the graph. From each edge s𝑝o, we launch a backward traversal from s to nodes that activate the initial automaton state, and a forward traversal from o to nodes that activate the final automaton states, and report the Cartesian product of both sets.

7 Perspective

We believe that the most important take-away message for the stringology audience is that there are algorithmic and combinatorial problems in the database community, particularly (but not only) in graph databases, that can benefit from approaches rooted in stringology: the LTJ algorithm works on tries, the Ring structure builds on the eBWT, and RPQs are an extended form of regular expression matching. We have also seen how combinatorial problems arise, like finding lower bounds to the number of Rings needed in dimension d, finding optimal ones efficiently, or characterizing the shapes of optimal order graphs. There is also space for the use of compact data structures: the Ring is a non-redundant version of the six tries used by LTJ, and we have used wavelet trees to solve the more complex LTJ primitives, as well as to find better query plans [4]. Other remarkable uses of compact data structure are compressed quadtrees to solve BGPs on any dimension d [9] and ordinal tree representations to extend BGPs with topological queries [20].

On the other hand, the application of stringology techniques to databases bounces back with new challenges and ideas. The most intriguing one is probably the idea of permuting the eBWT matrix using an arbitrary column, not necessarily the last one. Could this technique bring meaningful functionalities in other scenarios?

References

  • [1] Christopher R. Aberger, Andrew Lamb, Susan Tu, Andres Nötzli, Kunle Olukotun, and Christopher Ré. Emptyheaded: A relational engine for graph processing. ACM Transactions on Database Systems, 42(4):20:1–20:44, 2017. doi:10.1145/3129246.
  • [2] Waqas Ali, Muhammad Saleem, Bin Yao, Aidan Hogan, and Axel-Cyrille Ngonga Ngomo. A survey of RDF stores & SPARQL engines for querying knowledge graphs. The VLDB Journal, 31(3):1–26, 2022. doi:10.1007/S00778-021-00711-3.
  • [3] Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan L. Reutter, and Domagoj Vrgoc. Foundations of modern query languages for graph databases. ACM Computing Surveys, 50(5):68:1–68:40, 2017. doi:10.1145/3104031.
  • [4] Diego Arroyuelo, Fabrizio Barisione, Antonio Fariña, Adrián Gómez-Brandón, and Gonzalo Navarro. New compressed indices for multijoins on graph databases. CoRR, 2408.00558, 2024. doi:10.48550/arXiv.2408.00558.
  • [5] Diego Arroyuelo, Adrián Gómez-Brandón, Aidan Hogan, Gonzalo Navarro, Juan L. Reutter, Javiel Rojas-Ledesma, and Adrián Soto. The Ring: Worst-case optimal joins in graph databases using (almost) no extra space. ACM Transactions on Database Systems, 29(2):article 5, 2024.
  • [6] Diego Arroyuelo, Adrián Gómez-Brandón, Aidan Hogan, Gonzalo Navarro, and Javiel Rojas-Ledesma. Optimizing RPQs over a compact graph representation. The VLDB Journal, 33:349–374, 2024. doi:10.1007/S00778-023-00811-2.
  • [7] Diego Arroyuelo, Aidan Hogan, Gonzalo Navarro, Juan L. Reutter, Javiel Rojas-Ledesma, and Adrián Soto. Worst-case optimal graph joins in almost no space. In Proc. 47th ACM International Conference on Management of Data (SIGMOD), pages 102–114, 2021. doi:10.1145/3448016.3457256.
  • [8] Diego Arroyuelo, Aidan Hogan, Gonzalo Navarro, Juan L. Reutter, and Domagoj Vrgoc. Tackling challenges in implementing large-scale graph databases. Communications of the ACM, 67(8):40–44, 2024. doi:10.1145/3653314.
  • [9] Diego Arroyuelo, Gonzalo Navarro, Juan L. Reutter, and Javiel Rojas-Ledesma. Optimal joins using compressed quadtrees. ACM Transactions on Database Systems, 47(2):article 8, 2022.
  • [10] Albert Atserias, Martin Grohe, and Dániel Marx. Size bounds and query plans for relational joins. SIAM Journal on Computing, 42(4):1737–1767, 2013. doi:10.1137/110859440.
  • [11] Jérémy Barbay and Claire Kenyon. Alternation and redundancy analysis of the intersection problem. ACM Transactions on Algorithms, 4(1):4:1–4:18, 2008. doi:10.1145/1328911.1328915.
  • [12] Nieves Brisaboa, Ana Cerdeira-Pena, Guillermo de Bernardo, Antonio Fariña, and Gonzalo Navarro. Space/time-efficient rdf stores based on circular suffix sorting. The Journal of Supercomputing, 79:5643–5683, 2023. doi:10.1007/S11227-022-04890-W.
  • [13] Michael Burrows and David Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
  • [14] Davide Cenzato and Zsuzsanna Lipták. A theoretical and experimental analysis of BWT variants for string collections. In Hideo Bannai and Jan Holub, editors, Proc. 33rd Annual Symposium on Combinatorial Pattern Matching (CPM), pages 25:1–25:18, 2022. doi:10.4230/LIPICS.CPM.2022.25.
  • [15] Orri Erling. Virtuoso, a hybrid rdbms/graph column store. IEEE Data Engineering Bulletin, 35(1):3–8, 2012. URL: http://sites.computer.org/debull/A12mar/vicol.pdf.
  • [16] Paolo Ferragina and Giovanni Manzini. Indexing compressed texts. Journal of the ACM, 52(4):552–581, 2005. doi:10.1145/1082036.1082039.
  • [17] Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3(2):article 20, 2007.
  • [18] Paolo Ferragina and Rossano Venturini. The compressed permuterm index. ACM Transactions on Algorithms, 7(1):article 10, 2010.
  • [19] Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. Cypher: An evolving query language for property graphs. In Proc. 44th ACM International Conference on Management of Data (SIGMOD), pages 1433–1445. ACM, 2018. doi:10.1145/3183713.3190657.
  • [20] José Fuentes-Sepúlveda, Adrián Gómez-Brandón, Aidan Hogan, Ayleen Iribarra-Cortés, Gonzalo Navarro, and Juan Reutter. Worst-case-optimal joins on graphs with topological relations. In Proc. 34th International World-Wide Web Conference (WWW), 2025. To appear.
  • [21] Travis Gagie, Gonzalo Navarro, and Simon J. Puglisi. New algorithms on wavelet trees and applications to information retrieval. Theoretical Computer Science, 426-427:25–41, 2012. doi:10.1016/J.TCS.2011.12.002.
  • [22] Roberto Grossi, Ankur Gupta, and Jeff S. Vitter. High-order entropy-compressed text indexes. In Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 841–850, 2003.
  • [23] Steve Harris, Andy Seaborne, and Eric Prud’hommeaux. SPARQL 1.1 query language. W3C recommendation, 2013. URL: https://www.w3.org/TR/sparql11-query/.
  • [24] Aidan Hogan. The Web of Data. Springer, 2020. doi:10.1007/978-3-030-51580-5.
  • [25] Aidan Hogan, Cristian Riveros, Carlos Rojas, and Adrián Soto. A worst-case optimal join algorithm for SPARQL. In Proc. 18th International Semantic Web Conference (ISWC), pages 258–275, 2019. doi:10.1007/978-3-030-30793-6_15.
  • [26] Emil Eifrem Ian Robinson, Jim Webber. Graph Databases (2nd Edition). O’Reilly Media, Inc., 2015.
  • [27] Frank Manola and Eric Miller. RDF primer, 2004. W3C Recommendation.
  • [28] Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. An extension of the Burrows-Wheeler Transform. Theoretical Computer Science, 387(3):298–312, 2007. doi:10.1016/J.TCS.2007.07.014.
  • [29] Amine Mhedhbi, Amol Deshpande, and Semih Salihoglu. Modern techniques for querying graph-structured databases. Foundations and Trends in Databases, 14(2):72–185, 2024. doi:10.1561/1900000090.
  • [30] Amine Mhedhbi and Semih Salihoglu. Optimizing subgraph queries by combining binary and worst-case optimal joins. Proc. VLDB Endowment, 12(11):1692–1704, 2019. doi:10.14778/3342263.3342643.
  • [31] Gonzalo Navarro. Compact Data Structures – A practical approach. Cambridge University Press, 2016.
  • [32] Gonzalo Navarro and Josefa Robert. Compressed graph representations for evaluating regular path queries. In Proc. 31st International Symposium on String Processing and Information Retrieval (SPIRE), pages 218–232, 2024. doi:10.1007/978-3-031-72200-4_17.
  • [33] Thomas Neumann and Gerhard Weikum. The RDF-3X engine for scalable management of RDF data. The VLDB Journal, 19(1):91–113, 2010. doi:10.1007/S00778-009-0165-Y.
  • [34] Bryan B. Thompson, Mike Personick, and Martyn Cutcher. The bigdata® RDF graph database. In Linked Data Management, pages 193–237. Chapman and Hall/CRC, 2014.
  • [35] Todd L. Veldhuizen. Triejoin: A simple, worst-case optimal join algorithm. In Proc. 17th International Conference on Database Theory (ICDT), pages 96–106, 2014. doi:10.5441/002/ICDT.2014.13.