BWT Indexes for Optimal Joins in Graph Databases
Abstract
Graph databases represent data as a labeled directed graph, where the labels refer to properties that connect the entities represented by their source and target vertices. Queries feature, most prominently, sets of edges where source, target, and/or label can be variables; each instantiation of the variables where all the edges occur in the graph is a solution to the query. Worst-case-optimal algorithms to solve those queries have been devised, but they pose significant space requirements. This overhead has hindered the adoption of worst-case-optimal algorithms in real systems. We show that a representation of the graph based on the extended BWT (eBWT), where each edge is seen as an independent string of length 3 (source, label, target) supports worst-case-optimal algorithms while using almost no extra space on top of the raw data. We then show how the idea is generalized to the relational model, where the strings can be longer than 3 and several eBWTs are needed to obtain worst-case optimality. The aim to minimize the amount of space in that case leads to consider novel eBWT variants, where columns other than the last can be chosen. Finally, we show how the same graph representation can be used to solve other typical queries, like finding graph paths that match regular expressions.
Keywords and phrases:
Graph databases, Ring index, extended BWT, compact data structuresFunding:
Gonzalo Navarro: Fondecyt Grant 1-230755.Copyright and License:
2012 ACM Subject Classification:
Theory of computation Data structures and algorithms for data management ; Theory of computation Database query processing and optimization (theory) ; Theory of computation Data compressionFunding:
ANID, Chile, via Millennium Science Initiative Program – Code ICN17_002.Editors:
Paolo Ferragina, Travis Gagie, and Gonzalo NavarroSeries and Publisher:
Open Access Series in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Graph databases (GDBs) have become increasingly popular [29, 24, 26], as they enable the storage of unstructured information repositories that emphasize the relations between entities in applications such as knowledge graphs, the semantic web, social networks, transaction networks, and communication networks, among others. A number of GDB management systems and prototypes have been introduced [15, 33, 34, 1, 2, 25, 30], along with models and query languages [3, 19, 23]. The efficient implementation of GDBs, however, still faces various challenges [8], such as the efficient implementation of the key algorithms to solve the most relevant queries on GDBs. To achieve the needed efficiency, GDBs require indexes that tend to demand a lot of memory. For example, a common type of query on GDBs are basic graph patterns, which search for subgraphs that are homomorphic to a given query graph. These types of queries can be translated into the relational join of many tables, which is known to be computationally demanding. Research on efficient algorithms to compute those joins has led to the concept of worst-case optimal (wco) join algorithms [10], which offer certain guarantees of running time in a worst-case scenario. These wco algorithms, however, generally demand strong indexing schemes, which entails a significant use of memory. Leapfrog Triejoin (LTJ) [35] is one of the most popular wco join algorithms, and has proven to be efficient in practice [35, 25]. However, its high memory consumption makes it less viable in situations of limited memory or when the graphs are excessively large. This high memory usage comes from the need to store indexes for 6 different “permutations” of the graph database (a concept that will be clear soon) in order to work properly.
With other coauthors, we introduced the Ring [7, 5] as an alternative to reduce the excessive memory usage of LTJ. The Ring is able simulate the 6 permutations needed by LTJ (and the corresponding indexes) by storing only one of them. This allows implementing LTJ using little additional space on top of that used by the database itself, thereby opening the use of wco algorithms to many more applications. We originally conceived the Ring in terms of the BWT [13]. We presented it to a database audience aiming to simplify the stringology aspects as much as possible [7]. Our presentation was closer to that of the Permuterm index [18], which regards the strings as circular, and simplified it to the case of strings of length 3. In the journal version [5] we managed to get rid completely of the stringology concepts, which are difficult to grasp in other areas, and presented the Ring in terms of stable sortings.
In this paper we present the Ring to a stringology audience, and thus properly formalize its concepts in terms of the more elegant extended BWT (eBWT) [28]. The generalization of the ring to relational tables leads to the definition of eBWT variants that are new, to the best of our knowledge.
2 Basic Concepts of Graph Databases
2.1 Graph databases
A graph database is an edge-labeled graph modeled as a finite set of triples where, for simplicity, will represent a finite and totally ordered set of constants. We will stick to the Resource Description Framework (RDF) model [27], as it is simpler than others and sufficient to present the important concepts. In RDF, each triple represents the directed edge , connecting vertex to vertex , with being the edge label. The terms subject, predicate, and object are used to refer to , , and , respectively. The number of edges (triples) in is denoted by . The alphabet for the vertices of graph is defined as , while represents the alphabet for the edge labels. We assume , implying that no edge label is used as a vertex in . The domain of graph is given by , so we can safely assume . Figure 1 illustrates a sample graph, which will be used as our running example. The figure also presents a particular enumeration of the constants in , as well as the corresponding set of graph triples.
2.1.1 Basic Graph Patterns
Let denote an infinite set of variables such that . The simplest way of querying an edge-labeled graph is through triple patterns, which are tuples . The solutions to a triple-pattern query involve all potential assignments of the variables within the pattern to constants in such that the resulting triple exists in . This allows for querying all graph edges (using variables , , and ), all edges with a vertex as the subject (where is a constant and and remain variables), all edges with label ( is constant while and are variables), and all objects related to subject by label ( and are constant while is a variable), among other queries. When viewing as a ternary relation in the relational model, triple patterns are analogous to equality-based selections.
Subsequently, triple patterns can be used to formulate more complex queries, such as basic graph patterns (BGPs). A BGP represents a finite set of triple patterns. Let be the set of variables in a BGP . The evaluation of upon a graph is defined as set of mappings , referred to as solutions, where denotes the result of substituting each variable in with . Some examples of BGPs on the graph of Figure 1 are as follows:
-
, for , which looks for all Nobel Prize winners. In our example graph, contains the instantiations , , and .
-
, for , which looks for all pairs of values and such that advised and both and won the Nobel Prize. In our example graph, contains a single instantiation, namely and .
Mapping:
| 1: Bohr | 5: Nobel | |
| 2: Thomson | 6: adv | |
| 3: Thorne | 7: nom | |
| 4: Wheeler | 8: win |
| (4, 6, 1) | (5, 8, 1) | |
| (3, 6, 4) | (5, 8, 2) | |
| (5, 8, 3) | (1, 6, 2) | |
| (5, 7, 4) |
2.1.2 Regular Path Queries
Apart from BGPs, the second major component of most graph query languages are the so-called Regular Path Queries (RPQs). An RPQ essentially specifies a regular expression on the alphabet , and looks for the paths in the graph where the concatenation of the edge labels belongs to the language denoted by the regular expression. An RPQ may also fix the initial and/or the final nodes of the desired paths. The answer to the query are the extremes of all matching paths, in the form of pairs of nodes . Additionally, the regular expression may use arrows in reverse order, by specifying a “reversed” version of the symbols . This can be handled by adding an edge per graph edge , so we omit this feature in the sequel.
For instance, the RPQ , where is a variable, looks for all academic descendants of Thorne in the graph of Figure 1 (the regular expression denotes all the nonempty repetitions of the symbol ). Its solutions are (because of the path ), (because of ), and (because of ). Similarly, , for variables and , looks for all instantiations of these variables such that is an academic descendant of .
2.2 Worst-Case-Optimal joins
Joins are some of the most resource-intensive operations in relational algebra, making their efficient computation essential for the performance of database systems. A sloppy implementation might end up computing the whole cross product of the involved relations, wasting resources. For a join query and a database instance , a join algorithm enumerates , the solutions for over . In order to formalize the study of efficient join algorithms, Atserias et al. [10] introduced the concept of the AGM bound (named after its authors). Consider a natural join query on a database instance , which comprises a collection of relations. A natural join can be seen as the subset of the Cartesian product where the values in the common attributes coincide (and only one column per common attribute is preserved). The AGM bound of over , denoted , is the highest possible number of tuples that can result from evaluating on any database instance that includes, for each relation in , a relation with the same attributes as and such that .
A join algorithm is termed worst-case optimal (wco) if its running time is bounded by , possibly multiplied by polylogs on (the number of tuples in ) and factors independent of (such as or the number of attributes in ). Although originally conceptualized for relational databases, this idea has also been extended to include graph databases. Even though BGPs are more general than natural joins, it is still possible to apply the AGM bound by considering each triple pattern in a BGP as a relation comprising the triples that match its constants [25].
2.3 Leapfrog TrieJoin
Next, we outline the Leapfrog Triejoin algorithm [35] (LTJ, for short), which is arguably the most commonly used wco join algorithm in practice. In order to work properly, LTJ requires the graph database to be structured using tries. The idea is to represent each triple (graph edge) storing its components in a trie following a particular order (e.g., in spo order). Indeed, to achieve worst-case optimality, all possible permutations of the tuples need to be maintained, resulting in the storage of 6 tries (the rationale for this requirement will be explained subsequently). We will call , , , , , and these tries. Figure 2 shows the 6 tries corresponding to our running-example graph of Figure 1.
Let be a BGP, where is a triple pattern, for . Let be the set of variables of . The distinctive strategy of LTJ is known as variable elimination. This method involves iterations, each focusing on a single variable from to be “eliminated” from the join process. The particular order in which variables are eliminated determines a total order of , which is called a Variable Elimination Order (VEO, for short). Throughout this process, each triple pattern is treated as a ternary relation that participates in the join. Once a VEO is established for , each triple pattern will have a particular trie associated with it. The trie should first traverse the constants in , and then traverse the variables in in an order that is consistent with that of the VEO. This is why LTJ needs to store 6 tries. To illustrate this, consider the query , for . If the VEO is , then the first triple is associated with trie , the second with , and the third one with . If, instead, the VEO is , then the tries are , , . If some of these 6 tries were not stored, then some VEOs would become unmanageable.
The navigational operations that must be supported at any trie node of a trie are:
-
: moves to the root node of trie .
-
: descends to the child of node labeled in trie .
-
: yields the smallest symbol that labels a child of node in trie .
LTJ starts at the root of each and descends by the children that correspond to the constants in triple . Next, we proceed to the variable elimination phase, assuming a VEO . For , let be the set of triple patterns that contain variable . Starting with the initial variable in the VEO, algorithm LTJ first identifies all values such that for every , substituting with in results in a non-empty evaluation of the modified triple pattern over (indicating that there are potential solutions to where equals ). If the trie related to is consistent with the VEO, then its current node’s children precisely represent the suitable values for .
Throughout the execution, we maintain a mapping with the solutions of . When a suitable value is found for , we bind to , resulting in , and branch on this particular value . In this branch, we go down by in all the virtual tries such that . Next, the same process is applied to , determining suitable values for , and extending the mapping to . This process is repeated for the remaining variables in the VEO. Once all variables have been bound in this way, effectively contains a solution for ; this occurs multiple times due to the branching for each binding of , , and so forth. When all bindings for a variable have been considered, LTJ backtracks and moves on to the next binding for . Upon completion of this process, the algorithm has reported all possible solutions for .
Determining the values , , and so forth, involves finding the intersection among the child nodes of the current nodes across all tries , for each . The algorithm LTJ carries out this intersection using the primitive to implement Barbay and Kenyon’s intersection algorithm [11]. Veldhuizen [35] proved that if the operation runs within polylogarithmic time, then LTJ is wco regardless of the VEO chosen, provided the tries have an appropriate attribute order.
We next illustrate how LTJ handles the query using the VEO . The triple patterns in are associated with tries , , and , respectively. We set and begin by processing the constants and , identified as 6 and 7 in our example enumeration, descending to the respective nodes in and . We then handle the variables following the VEO. Starting with in the second and third triple patterns of , we intersect the children at the current nodes of both tries (the node corresponding to string 7 in and the root of trie ). The only common child is 5, so we descend to those nodes in both tries, setting . We now proceed to , in the first and third triple patterns. We intersect the children of nodes for string 6 in and 5 in . The first intersection element is 1, leading us to descend in both tries and to update , and to proceed to in the first and second triple patterns. We intersect the children of nodes corresponding to string 61 in and in . Since 4 is a common value, we descend to the appropriate nodes and update . Lastly, for , present only in the third triple pattern, no intersection is required: all children of the node for string 51 in are valid for . The only value is 8, so we set . Having bound all variables, is a solution to . According to our enumeration, these values translate to , , , , resulting in the set of triples in the graph. Finally we backtrack, removing from and returning to variable , and repeat the process until all bindings for the variables of are found. In this example, this was the only answer for .
3 The Ring Index
In the sequel we show how the Ring implements the functionality needed to implement the LTJ algorithm (see Section 2.3) over a different data representation: the Ring index [7, 5].
3.1 The circular strings model
While LTJ is classically presented as running on tries, we take a more abstract view and consider the database triples as circular strings [12], where we regard all the strings , , and as the same. We formalize circular strings using the concept of rotation.
Definition 1.
A rotation of a string is any , for .
Definition 2.
The circular string (or c-string) is defined as the set of the rotations of .
In particular, ; note that . Note that, because the alphabet of the predicates is disjoint from that of subjects and objects, the three rotations of Def. 2 must be different in our case (because is at a different position). Therefore, there is a one-to-one correspondence between triples and c-strings.
The graph database is then seen as a set of c-strings, one per triple, and we identify each trie node with the set of c-strings corresponding to the triples that descend from . Figure 3 shows the c-strings corresponding to three sample nodes of tries and .
For what follows, we will create two disjoint copies, and , of the alphabet , in order to distinguish subjects from objects in c-strings. We will call the alphabet of predicates for consistency. Thus, triples will belong to . Further, the three sub-alphabets will form disjoint lexicographic ranges. In our running example, we separate the alphabets by adding to the components of each triple in the graph, while the components are increased by . Let us also redefine triple patterns.
Definition 3.
A triple pattern is an element of , where is a special symbol. We say that the elements are unbound and the others are bound. A triple matches a triple pattern iff or for every .
Every trie node then corresponds univocally to a triple pattern , where the edges that lead from the root to define the bound components of . The node then represents all the database triples that match . Figure 3 also shows the triple patterns corresponding to three nodes of tries and .
For our purposes, a key property of the characterization as circular strings is the following.
Theorem 4.
The bound components of any triple pattern can be concatenated into a string such that a triple matches iff prefixes exactly one string of its c-string . The prefixed rotation of is unique if .
Proof.
Let us consider all the possibilities for .
If matches and the latter has one bound component, then is univocally , , or , and it prefixes , , or , respectively. It prefixes only one of those because the triple components have disjoint alphabets. Conversely, if and prefixes , , or , then matches .
If there are two bound components, then we can form , , or , which prefixes , , or , respectively. Again, prefixes exactly one of those. Conversely again, if and it prefixes , , or , then matches . Note that the other choices, , , and , do not prefix any suffix.
If all the components are bound, then three choices of prefix (and equal) a string of for the only triple that matches . Finally, if there are no bound components, then prefixes every string, and conversely every triple matches .
Building on this result, it will suffice to represent tries , , and . Our plan is to identify every trie node with a set of rotations, which except for the root will contain exactly one rotation of each c-string whose triple matches . By Theorem 4, the rotations are exactly those that start with a given string , so descends from the root by . The way to efficiently represent that set of suffixes is considered next.
For example, consider the trie on the right of Figure 3. Its rightmost node descending by has and . The node represents the single c-string , and contains exactly one rotation of that c-string, 835, which starts with .
3.2 The extended Burrows-Wheeler Transform
The extended BWT (eBWT) [28, 14] is an extension of the Burrows-Wheeler Transform (BWT) [13] to a set of strings. It builds on the concept of omega-order [28] on strings, which is defined as follows.
Definition 5.
An infinite string over alphabet is a mapping . The prefix of the infinite string is the finite string . The suffix of the infinite string is the infinite string such that .
Definition 6.
The -extension of a string is the infinite string formed by concatenations of strings , that is, .
Definition 7.
Given a string , let be the maximum value such that for some string ; we then say that and . The omega-order, denoted , between strings and is then defined as follows: iff
-
and , or
-
, where is the lexicographic order.
Note that, if and are not a proper prefix of the other, then coincides with the lexicographic order, but it can differ otherwise.
Definition 8.
Given a set of strings , the eBWT is a permutation of the symbols in the strings given by listing the rotations of all strings in omega-order, and then concatenating the last symbols of each rotation.
We are now in position to define the Ring index.
Definition 9.
The Ring index of a set of triples is the eBWT of the set , after properly separating the alphabets into , , and .
Because of the alphabet separations, it follows that the rotations of all the c-strings have and ; therefore the omega-order boils down to the lexicographic order between their -extensions. Further, because of our alphabet separation, such order boils down in turn to the plain lexicographic order of the strings. The Ring’s eBWT then lists all the strings of all c-strings in the graph, sorts them lexicographically, and collects their last symbols.
Figure 4 (left) shows all the strings corresponding to the edges of the graph from Figure 1. Since in this case, we sum to the component of each triple, and 16 to the components, and thus , , and . On the right, the triples are lexicographically sorted, obtaining the corresponding Ring index as the last column of the table.
The following theorem shows that we can identify each node of the three chosen tries with a range of the eBWT of the Ring index.
Theorem 10.
There is a distinct interval in the Ring representing the triples that descend from each node of , , and .
Proof.
If is the root, we can define its range as the whole eBWT. Otherwise, by Theorem 4, there is a string for the triple pattern that prefixes exactly one string of the c-string , or equivalently a rotation of , for each triple matching . By Def. 9, the three rotations of are listed in the Ring in lexicographic order, and thus those prefixed by form a single range. Those ranges are unique once we define how is formed. As seen in Theorem 4, there are multiple choices to form only when , that is, when is a triple. In this case it corresponds to just one c-string and the singleton range can be arbitrarily fixed to be that of any of its three rotations.
Consider, for instance, the node corresponding to in of Figure 2, whose subtree includes the triples , , , and . This node is represented by the interval in the Ring shown in Figure 4 (right side), highlighted with the larger red box, which contains exactly the same triples mentioned before (with components and shifted accordingly, as we have already explained). On the other hand, interval (highlighted with the smaller red box) corresponds to the node for triple pattern in the trie of Figure 2. Actually, notice that interval in this sample Ring corresponds to the root of trie , interval is associated with the root of trie , and interval represents the root of trie . Each subinterval within these regions corresponds to nodes in the respective tries.
Roughly speaking, the Ring index then maps the trie operations to operations on the eBWT of the triples. In the sequel we show precisely how this is done.
3.3 The eBWT as a representation of the graph
The following definition will help us discuss how the eBWT works on the graph.
Definition 11.
Let be a string where for all . We then say that the order of is the string .
In our case, the order of is spo, the order of is pos, and the order of is osp. Because the three alphabets form lexicographic ranges, eBWT contains a range where all the listed rotations are of order spo, and thus the eBWT contains only symbols of , a second range where all the rotations are of order pos and the eBWT contains only symbols of , and a third range where all the rotations are of order osp and the eBWT contains only symbols of . We store those three segments of the eBWT, respectively, as strings, called columns, , , and (see the different ranges and colors in the last column in Figure 4), and say that their orders are spo, pos, and osp, respectively.
Let us first show that our representation allows accessing any desired triple starting from its position in any of the columns . For example, to retrieve the triple represented at , we know immediately that . Now, by the classic BWT invariant [13, 16] (which holds true in the eBWT [28, Prop. 10]), we find the entry corresponding to with
| (1) |
where counts the number of times symbols smaller than occur in , and counts the number of times occurs in . With we obtain , compute the position of the triple in with
and complete the process with . The eBWT invariants imply that
that is, we cycle on the string: the eBWT effectively regards the strings as circular and this permits retrieving the triples starting from any of their components.
See Figure 5 for an example of this process. In this case, we have , and hence . So, . Then, , so . Then, . So, we get back to since . The corresponding triple is .
The rationale of (say) Eq. (1) is the following. corresponds to the order spo and to the order osp. The second order is obtained from the first by stably reordering the tuples by their o component, in accordance with the lexicographic order. In a stable reordering of spo, position will be moved to position such that (i) all the symbols less than come before (and those are counted in ), and (ii) all the symbols preceding the one at also come before (thus we add ). Figure 5 also shows one of those reorderings, from spo to osp.
Because the shifted symbols are stored in different strings , the Ring remaps their alphabets back to . The nodes that are both subjects and objects use the same identifier in both and , in the range . Also, the Ring stores the arrays using bits, whereas function is computed in time using a wavelet tree [22] representation of the columns , just like a typical FM-index implementation [17]. Overall, it represents a graph database of triples over universe using bits (logarithms are to the base 2), and can retrieve any triple in time . This is almost the same space needed to store the triple identifiers in plain form. What is striking is that, within this space, the Ring can implement all the operations required by the LTJ algorithm, as we see next.
3.4 Implementing the operations on the eBWT
Having separate columns, it is more convenient to identify the root of a trie starting with symbol of alphabet as the whole range . This implements operation depending on the next triple component we will descend by.
Now assume we are in a range of the eBWT (that is, a range of , , or ) that corresponds to a trie node . Recall that this means that the range is that of all the rotations that start with . Let us show how to simulate the trie operations and on this representation.
The “easy” case
In the “easy” case, is represented as and the children of belong to the alphabet . We can then use an extension of the functions to compute the eBWT range corresponding to a desired child of . For example, if we are at and want to descend to , then we move from the range , which represents , to the range , which represents . This is the range of all the rotations starting with . We can now descend to with the famous backward search formula [16], which essentially computes the range of all values for all where :
| (2) | |||||
so that now is the range of all the rotations starting with , and represents . If we further descend to , the analogous formulas will boil down to a single entry of corresponding to the trie leaf representing the triple (in this case, using the order pos).
For instance, assume that we want to go down by and then by . We start from the range , corresponding to . Going down to the child labeled corresponds to restricting the range to . Then, descending by from this node corresponds to moving to the range , with and . The range then corresponds to the range of a node in the trie , which represents the triple pattern . Note that, although we do not represent the trie , we can find a node in some trie that corresponds to the same set of c-strings.
Operation is slightly more complex, and resorts to the geometric capabilities of the wavelet tree. In the “easy case” again, node is represented by a range and belongs to ; for example we are in and is an object. We then use the wavelet tree ability to compute, in time, the least value not smaller than in a range of the string it represents [21].
The “hard” case
If is the root and we descend by triple component x, we are always in the “easy” case, because we can use as . It is also always the easy case if has two bound components (or, equivalently, has depth 2): given any length-2 prefix of the eBWT rotations, their range is in , where x is the unbound component of . For example, if , then its eBWT range is within . The only “hard” cases occur when exactly one component is bound and we are forced to move “forward”, instead of “backward” as Eq. (2) does: if only s is bound in (the cases of p and o are analogous), then the range of is within order spo, and then the representation of is of the form . We can then descend by using Eq. (2), but we cannot descend by in this way. Rather, we must restrict the range so that it goes from representing all the rotations starting with to representing all the rotations starting with ; the order is still spo.
To descend in this hard case, we reorder the string : we descend from the root by , and from the resulting interval of all the rotations starting with , we descend by to the interval of the rotations starting with .
For instance, assume that we have instantiated , which corresponds to in our example Ring. This range is depicted by the larger box in the Ring of Figure 4. Next, suppose that we instantiate and want to go down in the corresponding trie. The resulting range is represented by the smaller box in Figure 4. Notice that in this case we remain at (i.e., in trie ) after descending with (unlike the “easy” case, where we jump to a different trie after going down). However, using to go down by is not feasible since we are currently in . Instead, we descend from to , and then proceed with to obtain the desired range , with and .
Finally, we implement in the hard case as follows. Consider the same case above, where only is bound, is represented by , and . We start from , which is the range of all the rotations starting with predicates . We then map that range using an analogous of Eq. (2):
and obtain , the range of all the rotations starting with and following with some . The first element of that range, , is a triple with the desired minimum value . We then obtain by traversing the circular string as described in Section 3.3.
4 Rings in Higher Dimensions
The theory of worst-case-optimal join algorithms was initially developed for the relational model, and only then adapted to graph databases [25], where BGPs are modeled mainly as multijoins on a single table of three columns. The space problem is even more serious for tables with more than three attributes: a table with attributes (which we will say to be of dimension ) needs tries, each storing the equivalent of a copy of the database, in order to implement worst-case-optimal joins using LTJ. This, of course, makes the technique unaffordable even for modest amounts of attributes! A natural question is: can we extend the Ring to dimensions so as to use space proportional to the raw data?
The answer is, unfortunately, no: the key Theorem 4 does not hold for higher dimensions. Already for , let be the attributes. No matter how we choose the c-strings (say, ), there are subsets of attributes for which do not prefix any rotation of (say, or ).
It is not hard to see, however, that two c-strings, like an , suffice for Theorem 4 to hold on , in the sense that any string , in some convenient order, prefixes some rotation of or of . Compared to storing tries, storing just two eBWTs yields an enormous space reduction!
The Ring simulation of Section 3 can be extended to having two or more eBWTs, by jumping from one to another as needed. The following definition captures what is needed from the chosen set of c-strings for such a simulation to work.
Definition 12.
A string is a substring of a c-string iff it is a substring of . If , this is equivalent to being a substring of some rotation of .
Definition 13.
A set of strings over alphabet is complete for dimension iff, for every and , can be ordered to form a string such that or are a substring of some .
The elements of are strings of attributes, for example is complete for and is complete for . A Ring for dimension is built on a complete set of strings of length . For each data tuple and each , we extract the string (a permutation of the tuple). For example, with and , we extract from the tuple the string . With and , we extract from the tuple the strings and . The Ring consists of one eBWT for each string , which indexes the rotations of all the c-strings .
Consider a trie node representing a tuple pattern , now of elements. If is complete, then by definition it is possible to concatenate the bound elements of in some order into a string that prefixes all rotations in the eBWT of some . We note that there could be various ways to concatenate the bound attributes of to form strings so that they prefix rotations in different eBWTs.
We can then descend from to , with , in two ways. The “easy” one, where is represented by a range in , uses the analogous of Eq. (2). The “hard” case discards the current range and recomputes the result from scratch: it forms such that its order of attributes , preceded or followed by x, occurs in some (this works when is complete). It then retraverses the tries from the root following the path of or , using the analogous of Eq. (2) at each step.
Suppose, with Rings for spog and sopg, that we want to go down using , then , then , and finally . Initially, without further information, we may descend by from the root of the trie , obtaining a range of the spog Ring. Then, when it comes to go down by , we note that this is not possible on this Ring, so we move to the sopg Ring: we start with and continue by using the “easy” case, reaching a node in the trie (i.e., a range in the sopg Ring). To proceed with , we can remain in the same Ring, yet it corresponds to the “hard” case, because is to the right of in the order sopg. This can be solved with “easy” cases, either by re-running , , and on the same sopg Ring, or using , , and if we move to spog. In both cases, the final descent by corresponds to an easy case.
To implement , there are also two possibilities. In the easy case, we proceed on the wavelet tree exactly as explained (for the easy case) in Section 3. In the hard case, there are two subcases:
-
We can form whose order is such that occurs in some . In this case, we retraverse the tries from the root to find , and finish as in the easy case. (Note that we were already in some range for the unbound attributes of , but not in one from where we could descend by .)
-
We can form whose order is such that occurs in some . In this case, we start from the range in the corresponding eBWT, and descend by from it.
We then track the first cell in the resulting range, using , back to to find the answer Because of the hard cases, we simulate each or operation in time , which still yields a worst-case-optimal query algorithm, as is independent of .
Now that we know how to use multi-eBWT Rings to handle higher dimensions, the question is: how many eBWTs (i.e., strings in ) are needed to be complete in dimension ? Arroyuelo et al. [5] studied this problem, showing that eBWTs (instead of tries) are necessary and sufficient for , and eBWTs (instead of tries) are necessary and sufficient for . The optimal number for is shown to be between 10 and 12, and for is between 21 and 25 (they obtain various lower bounds). In asymptotic terms, they prove that the number of eBWTs needed is and . These results make worst-case-optimal algorithms much more feasible for relational databases.
5 Order Graphs
A way to visualize complete sets is what are called order graphs [5].
Definition 14.
An order graph of dimension has, as nodes, all the permutations of , and has directed labeled edges of the form .
In the order graphs that model complete sets of attribute strings, is always of the form . For example, the order graph for consists of the cycle
recall the example in Figure 5, where we can see how one can move from the spo order to osp using , then to pos using , and then back to spo using . The order graph for consists of the two cycles:
Each edge of the order graph implies that the Ring stores a column over the alphabet , so that the function maps from order to order . That is, applied on a position of the source order leads to the corresponding position in the target order. Backward search, as in Eq. (2), also maps from the first to the second order. The cycles arise because the eBWT, by definition, always stores the last column of its rotations.
An alternative way to understand Def. 13 is that, for every (in some order) and x, there must exist a directed path labeled or in the order graph: we can go from the source to the target node using or backward search and obtain the desired cell or range in the order of target node. This leads to a definition of complete order graphs.
Definition 15.
An order graph of dimension is complete iff, for every and , can be ordered to form a string such that concatenating the labels over some directed path in the graph one obtains or .
The definition of order graph, however, opens up new alternatives that are not possible under the eBWT framework, namely when is not of the form in Def. 14. Consider the list of the rotations listed in lexicographic order by the eBWT, and regard the possibility of storing another column of the rotations, not necessarily the last. For example, if we take the order and store the third column, the corresponding function on that column will stably reorder the rotations by the attribute , so that the ordering obtained is (recall the rationale of given after Eq. (1)). Backward search also works correctly using the analogous of Eq. (2). Such a column corresponds to a graph edge , and the order graph is not anymore a set of cycles of length . Figure 6 shows how this reordering by another column operates on our example graph of dimension .
While complete sets of c-strings yield complete order graphs, there are complete order graphs that do not correspond to complete sets. For , for example, we have a minimal complete order graph that consists of a single cycle of length 8, instead of two cycles of length 4 (for both, the index stores 8 columns ):
More curious optimal complete order graphs for exist; Figure 7 shows two of them.
A natural question is whether complete order graphs can yield smaller Rings than complete sets. The answer is indeed affirmative! For , there exists a complete order graph formed by a single cycle of 20 edges [5], whereas the smallest complete set requires 5 eBWTs: seen as an order graph, that amounts to 5 cycles of length 5, that is, 25 edges. The length-20 cycle is known to be a minimal complete order graph for . In general, however, it is unknown whether some minimal complete order graph is always a single cycle, or a set of cycles. It is only known that some minimal complete order graph is always formed by a set of cycles with trees possibly sprouting from the nodes [5]. That is, every node must have indegree 1. As it can be inferred from this discussion, no effective ways to build minimal complete sets of c-strings or order graphs for a dimension exist; we have resorted to exhaustive search.
6 Regular Path Queries
Assume an RPQ such that the initial node is fixed. A classic solution to this problem is to build the nondeterministic automaton of the regular expression and traverse the graph from , in DFS or BFS order. As we traverse an edge , we feed the automaton with symbol , and abort the branch if the automaton runs out of active states. We must also record the active automaton states with which we had already visited each graph node, to avoid repeating the same nodes with the same states (i.e., we deactivate the repeated states in this second visit). We report for every node where we arrive with a final automaton state activated.
The other cases are solved analogously: if only the final node is fixed, we can reverse the regular expression the symbols (i.e. convert every with and vice versa). If both are fixed, we can start from either of those and stop as soon as the other node is found with a final state. If none is fixed, we start a search from every possible value of .
6.1 Simulation with the Ring
This process can be simulated with the Ring without using additional data structures. In this case, it is easier to start assuming that only is fixed, and derive the other cases analogously, so we use the reversed automaton. Our first step is to descend from the trie root by , , which is represented by the range . We now analyze the automaton states to determine the symbols that would lead to active states. For each such symbol , we descend to using an analogous of Eq. (2), for predicate instead of for object . The resulting range, , contains all the distinct sources that lead to by symbol . Since there cannot be repetitions in this list, we extract each value of and iterate from it as the new source node, that is, setting .
Note that we do not need the Ring component for this simulation. Further, one can avoid duplicating the edges to support reversed symbols if one uses the “non-typical” case technique to handle them. Such a technique [32] was implemented over a bit-parallel simulation of the automaton, and shown to be very competitive in space and time with classic solutions. Further, it was shown that the used Ring components are equivalent to a known succinct representation of labeled graphs [31, Sec. 9.1.4]
6.2 Larger and smarter
An earlier Ring-based implementation [6], instead, does include the reversed edges, and thus it can always use the “typical” case formulas. This enables other optimizations that make the process faster. One possibility is that, instead of trying out the labels that are relevant for the automaton, one can enumerate the distinct labels that occur in , which could be less. The wavelet tree of can do this enumeration in logarithmic time per resulting symbol [21]. Further, the wavelet tree can be enriched with information on the automaton states that are sources of symbols in ranges of . Those can be intersected with the currently active states as we traverse the wavelet tree, so as to prune subtrees that cannot activate any state. This technique computes the intersection between the predicates that lead to active automaton states and those that leave from the current node, in optimal time, that is, proportional to the alternation complexity of both sets [11]. An analogous technique on the wavelet treee of avoids visiting sources of that have already been visited with all the currently active states.
Using even more space (but still several times less than classic solutions), one can implement even smarter traversals, in considerably less time [6]. For example, the regular expression can be cut at an edge with a label that is infrequent in the graph. From each edge , we launch a backward traversal from to nodes that activate the initial automaton state, and a forward traversal from to nodes that activate the final automaton states, and report the Cartesian product of both sets.
7 Perspective
We believe that the most important take-away message for the stringology audience is that there are algorithmic and combinatorial problems in the database community, particularly (but not only) in graph databases, that can benefit from approaches rooted in stringology: the LTJ algorithm works on tries, the Ring structure builds on the eBWT, and RPQs are an extended form of regular expression matching. We have also seen how combinatorial problems arise, like finding lower bounds to the number of Rings needed in dimension , finding optimal ones efficiently, or characterizing the shapes of optimal order graphs. There is also space for the use of compact data structures: the Ring is a non-redundant version of the six tries used by LTJ, and we have used wavelet trees to solve the more complex LTJ primitives, as well as to find better query plans [4]. Other remarkable uses of compact data structure are compressed quadtrees to solve BGPs on any dimension [9] and ordinal tree representations to extend BGPs with topological queries [20].
On the other hand, the application of stringology techniques to databases bounces back with new challenges and ideas. The most intriguing one is probably the idea of permuting the eBWT matrix using an arbitrary column, not necessarily the last one. Could this technique bring meaningful functionalities in other scenarios?
References
- [1] Christopher R. Aberger, Andrew Lamb, Susan Tu, Andres Nötzli, Kunle Olukotun, and Christopher Ré. Emptyheaded: A relational engine for graph processing. ACM Transactions on Database Systems, 42(4):20:1–20:44, 2017. doi:10.1145/3129246.
- [2] Waqas Ali, Muhammad Saleem, Bin Yao, Aidan Hogan, and Axel-Cyrille Ngonga Ngomo. A survey of RDF stores & SPARQL engines for querying knowledge graphs. The VLDB Journal, 31(3):1–26, 2022. doi:10.1007/S00778-021-00711-3.
- [3] Renzo Angles, Marcelo Arenas, Pablo Barceló, Aidan Hogan, Juan L. Reutter, and Domagoj Vrgoc. Foundations of modern query languages for graph databases. ACM Computing Surveys, 50(5):68:1–68:40, 2017. doi:10.1145/3104031.
- [4] Diego Arroyuelo, Fabrizio Barisione, Antonio Fariña, Adrián Gómez-Brandón, and Gonzalo Navarro. New compressed indices for multijoins on graph databases. CoRR, 2408.00558, 2024. doi:10.48550/arXiv.2408.00558.
- [5] Diego Arroyuelo, Adrián Gómez-Brandón, Aidan Hogan, Gonzalo Navarro, Juan L. Reutter, Javiel Rojas-Ledesma, and Adrián Soto. The Ring: Worst-case optimal joins in graph databases using (almost) no extra space. ACM Transactions on Database Systems, 29(2):article 5, 2024.
- [6] Diego Arroyuelo, Adrián Gómez-Brandón, Aidan Hogan, Gonzalo Navarro, and Javiel Rojas-Ledesma. Optimizing RPQs over a compact graph representation. The VLDB Journal, 33:349–374, 2024. doi:10.1007/S00778-023-00811-2.
- [7] Diego Arroyuelo, Aidan Hogan, Gonzalo Navarro, Juan L. Reutter, Javiel Rojas-Ledesma, and Adrián Soto. Worst-case optimal graph joins in almost no space. In Proc. 47th ACM International Conference on Management of Data (SIGMOD), pages 102–114, 2021. doi:10.1145/3448016.3457256.
- [8] Diego Arroyuelo, Aidan Hogan, Gonzalo Navarro, Juan L. Reutter, and Domagoj Vrgoc. Tackling challenges in implementing large-scale graph databases. Communications of the ACM, 67(8):40–44, 2024. doi:10.1145/3653314.
- [9] Diego Arroyuelo, Gonzalo Navarro, Juan L. Reutter, and Javiel Rojas-Ledesma. Optimal joins using compressed quadtrees. ACM Transactions on Database Systems, 47(2):article 8, 2022.
- [10] Albert Atserias, Martin Grohe, and Dániel Marx. Size bounds and query plans for relational joins. SIAM Journal on Computing, 42(4):1737–1767, 2013. doi:10.1137/110859440.
- [11] Jérémy Barbay and Claire Kenyon. Alternation and redundancy analysis of the intersection problem. ACM Transactions on Algorithms, 4(1):4:1–4:18, 2008. doi:10.1145/1328911.1328915.
- [12] Nieves Brisaboa, Ana Cerdeira-Pena, Guillermo de Bernardo, Antonio Fariña, and Gonzalo Navarro. Space/time-efficient rdf stores based on circular suffix sorting. The Journal of Supercomputing, 79:5643–5683, 2023. doi:10.1007/S11227-022-04890-W.
- [13] Michael Burrows and David Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
- [14] Davide Cenzato and Zsuzsanna Lipták. A theoretical and experimental analysis of BWT variants for string collections. In Hideo Bannai and Jan Holub, editors, Proc. 33rd Annual Symposium on Combinatorial Pattern Matching (CPM), pages 25:1–25:18, 2022. doi:10.4230/LIPICS.CPM.2022.25.
- [15] Orri Erling. Virtuoso, a hybrid rdbms/graph column store. IEEE Data Engineering Bulletin, 35(1):3–8, 2012. URL: http://sites.computer.org/debull/A12mar/vicol.pdf.
- [16] Paolo Ferragina and Giovanni Manzini. Indexing compressed texts. Journal of the ACM, 52(4):552–581, 2005. doi:10.1145/1082036.1082039.
- [17] Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3(2):article 20, 2007.
- [18] Paolo Ferragina and Rossano Venturini. The compressed permuterm index. ACM Transactions on Algorithms, 7(1):article 10, 2010.
- [19] Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. Cypher: An evolving query language for property graphs. In Proc. 44th ACM International Conference on Management of Data (SIGMOD), pages 1433–1445. ACM, 2018. doi:10.1145/3183713.3190657.
- [20] José Fuentes-Sepúlveda, Adrián Gómez-Brandón, Aidan Hogan, Ayleen Iribarra-Cortés, Gonzalo Navarro, and Juan Reutter. Worst-case-optimal joins on graphs with topological relations. In Proc. 34th International World-Wide Web Conference (WWW), 2025. To appear.
- [21] Travis Gagie, Gonzalo Navarro, and Simon J. Puglisi. New algorithms on wavelet trees and applications to information retrieval. Theoretical Computer Science, 426-427:25–41, 2012. doi:10.1016/J.TCS.2011.12.002.
- [22] Roberto Grossi, Ankur Gupta, and Jeff S. Vitter. High-order entropy-compressed text indexes. In Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 841–850, 2003.
- [23] Steve Harris, Andy Seaborne, and Eric Prud’hommeaux. SPARQL 1.1 query language. W3C recommendation, 2013. URL: https://www.w3.org/TR/sparql11-query/.
- [24] Aidan Hogan. The Web of Data. Springer, 2020. doi:10.1007/978-3-030-51580-5.
- [25] Aidan Hogan, Cristian Riveros, Carlos Rojas, and Adrián Soto. A worst-case optimal join algorithm for SPARQL. In Proc. 18th International Semantic Web Conference (ISWC), pages 258–275, 2019. doi:10.1007/978-3-030-30793-6_15.
- [26] Emil Eifrem Ian Robinson, Jim Webber. Graph Databases (2nd Edition). O’Reilly Media, Inc., 2015.
- [27] Frank Manola and Eric Miller. RDF primer, 2004. W3C Recommendation.
- [28] Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. An extension of the Burrows-Wheeler Transform. Theoretical Computer Science, 387(3):298–312, 2007. doi:10.1016/J.TCS.2007.07.014.
- [29] Amine Mhedhbi, Amol Deshpande, and Semih Salihoglu. Modern techniques for querying graph-structured databases. Foundations and Trends in Databases, 14(2):72–185, 2024. doi:10.1561/1900000090.
- [30] Amine Mhedhbi and Semih Salihoglu. Optimizing subgraph queries by combining binary and worst-case optimal joins. Proc. VLDB Endowment, 12(11):1692–1704, 2019. doi:10.14778/3342263.3342643.
- [31] Gonzalo Navarro. Compact Data Structures – A practical approach. Cambridge University Press, 2016.
- [32] Gonzalo Navarro and Josefa Robert. Compressed graph representations for evaluating regular path queries. In Proc. 31st International Symposium on String Processing and Information Retrieval (SPIRE), pages 218–232, 2024. doi:10.1007/978-3-031-72200-4_17.
- [33] Thomas Neumann and Gerhard Weikum. The RDF-3X engine for scalable management of RDF data. The VLDB Journal, 19(1):91–113, 2010. doi:10.1007/S00778-009-0165-Y.
- [34] Bryan B. Thompson, Mike Personick, and Martyn Cutcher. The bigdata® RDF graph database. In Linked Data Management, pages 193–237. Chapman and Hall/CRC, 2014.
- [35] Todd L. Veldhuizen. Triejoin: A simple, worst-case optimal join algorithm. In Proc. 17th International Conference on Database Theory (ICDT), pages 96–106, 2014. doi:10.5441/002/ICDT.2014.13.
