Incremental Reachability Index
Abstract
We study the reachability problem in append-only DAGs: given two nodes and , is there a path from to ? While the problem is linear in general, it can be answered faster by using a precomputed index, which gives a compressed representation of the transitive closure of the graph.
Index algorithms are evaluated on three dimensions: the query time that the algorithm needs to answer whether there is a path from one node to another, the memory that the index uses per node, and the indexing time that is required to update the index when a node is added to the graph.
In this paper, we combine Jagadish’s static index [9] with Felsner’s online chain-decomposition algorithm [7] to create an incremental index: data associated with a node is immutable, guaranteeing that queries are answered properly even if new nodes are inserted while the query is processed.
Its query time is constant, but its index size is heavily dependent on the graph width, and as such is not competitive with recent indexing algorithms (2-hop, tree-chain, …). We also propose a version of that incremental algorithm with a much lighter index. In the most compressed version, the query time becomes . However, constant-time queries can be retained depending on the desired time/memory trade-off.
Keywords and phrases:
Directed acyclic graphs, reachability, append-only, indexCopyright and License:
2012 ACM Subject Classification:
Theory of computation Dynamic graph algorithmsEditors:
Petra Mutzel and Nicola PrezzaSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
The reachability problem in Directed Acyclic Graphs consists in deciding, given two nodes and , whether there exists a path from to . It can be solved in linear time with a depth-first search, visiting either all the ancestors of or all the descendants of before one can answer negatively. A common way to lower that complexity is to use a precomputed index that allows future queries to be answered much faster. Several solutions have been proposed for this problem relying on tree cover with interval labeling [1, 15], approximate transitive closure [14, 12], and 2-hop [5, 16, 11]. The relative quality of such algorithms is usually evaluated on three dimensions: the query time is the amount of time it takes to process a single reachability query; the memory is the amount of information stored in the index; the indexing time is the amount of time it takes to compute the index of a new node.
In this paper, we consider append-only DAGs, which can only grow by adding children, i.e. in topological order. This constraint on graph updates is implied for Merkle Graphs, in which each node holds immutable content; it is identified by a hash of its contents, including its parent identifiers. We propagate this constraint to the index by considering incremental algorithms, where the local index of the nodes do not change when new nodes are added. Formally, the index should be an append-only array, i.e. such that new entry insertions do not affect older entries. A small mutable manifest is acceptable, containing additional data (e.g., graph size, …). An index satisfying such constraints allows for a number of useful features. Incremental computation: the index can be accessed by multiple readers, concurrently with a writer inserting new nodes. In an incremental index the writer can freely compute index entries for new nodes and append them to the end of the index array, then she only needs to lock the manifest while it is updated to its new value. In particular if the manifest is not necessary to answer queries (only for index updates, as is the case in our algorithm), then no lock at all is necessary. This allows the index to be partially computed and asynchronously updated. Cache management: the index is stored on disk and one or several clients may request values of and keep responses in cache; with an incremental index these caches remain valid even after inserting a node. Cache fragments can be shared through a content-addressable storage. Transactional updates: in the case of a rollback, i.e. if the last inserted nodes need to be removed from the graph, only the manifest needs to be reverted to the previous point in time. The index, and cached entries or checksums if any, remain automatically valid for all remaining nodes.
An existing algorithm by Jagadish [9] offers a fast index, with constant-time queries, based on a chain decomposition of the graph. Its chain decomposition is only computed when the graph is complete, but it can be adapted to use the online decomposition by Felsner [7] to become incremental as long as nodes are inserted in topological order. The index size for this algorithm is, however, heavily dependent on the graph width, i.e., the largest set of pairwise unreachable nodes. In this form, it is not competitive with the many recent algorithms. We are not aware of any other incremental reachability index (recent improvements [4, 10] on [9] optimize the chain decomposition once the graph is complete, so they are not incremental).
Formalism for DAGs and Chains.
Definition 1 (Directed acyclic graphs).
A directed acyclic graph (DAG) is a directed graph without cycles. If there is an arc in , we say that is a parent of and is a child of . If there is a path (possibly with ), we say that is an ancestor of , is a descendant of , and that are comparable. A node without parents is a source and a node without children is a sink.
In the scope of this paper, we assume that the nodes of are inserted one by one, in a topological order, that is, all nodes are sinks when they are inserted. Thus, each node is identified with its insertion rank (i.e. ). The topological order further implies that for any arc and that for any pair of ancestor/descendant .
Definition 2 (Chains and anti-chains).
A chain of is a set of pairwise-comparable nodes. An anti-chain is a set of pairwise-non-comparable nodes. A chain decomposition of is a partition of into chains. The width of , denoted , is the size of its largest anti-chain.
Theorem 3 (Dilworth [6]).
The minimum number of chains in a chain decomposition of is .
The optimal chain decomposition of a graph can be computed in polynomial time [8]. However, in the incremental setting we consider here, a chain decomposition must be extended with each new node without editing previous chain assignments. This corresponds to the on-line chain decomposition problem described by Felsner et al. [7, 3]: in this setting, any algorithm yields a chain decomposition of size in the worst case. Felsner also gives an algorithm producing a chain decomposition of size ; see [3, Theorem 3.5] for a very concise description of this algorithm. The insertion time is dominated by the cost of reachability queries performed in (all using the inserted node). Beside the chain decomposition itself, this algorithm maintains a size- manifest containing the last inserted vertices in each chain (and is only required for node insertions). Note that in our experiments (for graphs generated with our algorithm), we observe a linear dependency between and , see Figure 6 in appendix.
Our contribution.
Used with the incremental chain decomposition by Felsner [7], the chain-based reachability index by Jagadish [9] offers an incremental index with query time and memory per node. We propose an improved version of this algorithm based on two anchor strategies, yielding a much lighter index while retaining incrementality. The query time becomes in the most-compressed version and can be reduced to depending on the desired time/memory trade-off. The theoretical memory upper bound remains in the worst case.
We run experimental evaluations of our algorithm both on generated and real-life graphs. Compared to Jagadish [9], we obtain similar query times and reduce the index size by a factor close to 10. With these encouraging results, we compare our algorithm against a selection of non-incremental reachability index. We obtain in particular similar query time as our implementation of 2-hop, with competitive index size in real-life graphs and generated graphs of width up to .
Outline of the paper.
Section 2 presents the chain index obtained by combining Jagadish’s index [9] with Felsner’s online chain decomposition [7]; we then build upon this algorithm in Section 3 to present our anchor-based index. We then describe the set-up for the experimental evaluation in Section 4, and the conclusions in Section 5. Additional detailed results are given in appendix.
2 Simple Chain Algorithm [9, 7]
Throughout this algorithm, we maintain a chain decomposition with a list of chain identifiers, and we use the term chain both for the chain identifier and the actual set of vertices. We store, for each node , [the identifier of] the chain containing . We write Chains for the set of chain identifiers.
Definition 4.
Given a vertex and a chain , the top vertex of a chain under is . We write if no such exists. Given a chain decomposition of , the top set of vertex is the set of pairs .
The core observation for the algorithm is the following: for any , is an ancestor of if and only if (recall that node identifiers form a topological order on ). Indeed, the forward direction is clear since we took the maximal ancestor of within chain . The converse direction is obtained by transitivity: is an ancestor of and, since and are in the same chain , and are comparable and is an ancestor of since . The top vertex is thus sufficient to answer reachability queries in constant time for any vertex in .
Description of the index.
Since we rely on the chain decomposition by Felsner [7], we need to store the Chains list in the manifest. Furthermore, for each node , we store:
-
the identifier of the chain containing in ;
-
the set .
The manifest size is thus , and each index entry has size . The sets can be stored either as sorted lists (for optimal storage) or as hashmaps (for optimal look-up). We pick the latter for the theoretical analysis, since the asymptotic memory requirement is identical. In practice, one can store the sorted lists on disk, use them directly for isolated reachability queries, or build a hashmap when queries are expected to be repeated for the same vertex.
Answering queries.
In order to test whether is an ancestor of , we look up the index of the chain containing , and return true if . Using a hashmap for , we get constant-time reachability queries.
Node insertion.
Upon insertion of node , we first compute the combined top vertices of the parents of , defined as . With this array we can answer reachability queries for in constant time. Indeed, for a node in chain , is an ancestor of if or if since any ancestor of (beside itself) is an ancestor of some parent of . We can then assign a chain to in with Felsner’s algorithm [7], and compute as follows: , and for .
The complexity for inserting is thus dominated by the computation of in . Note that for high-degree graphs (), this complexity can be reduced to by considering only the top-most parent in each chain.
3 Our Anchor-based Algorithms
We can now present our anchor-based approach, aiming at reducing the index size to a minimum while keeping logarithmic queries. We present first the index itself, followed by two strategies for picking anchors, and finally some possible extensions for specific use-cases.
3.1 Index construction
We introduce the notion of anchor of a node , denoted . It can either be an ancestor of other than itself, or when there is no suitable ancestor. As required by the incremental setting, the anchor is picked upon node insertion and cannot be changed afterwards. We discuss in the next section several anchor-picking strategies. We define the Anchor-List as follows: and for any node . The anchor-depth of , denoted is the length of , and . Then, the anchor is used to shrink the index, as we define the following restricted top set (see also Figure 2).
Definition 5.
Given a vertex , a chain and an anchor , the restricted ancestor set of with anchor is if , otherwise. The restricted top vertex of chain under with anchor is ( if no such exists)
Given a chain decomposition of , the restricted top set of vertex with respect to anchor is the set of pairs .
Description of the index.
For each node , we store the identifier of the chain containing in , the anchor , and the restricted top set .
Note that the worst-case index size per node remains . However, if the restricted ancestor set is small, then we also have a useful bound: .
Answering queries.
In order to test whether is an ancestor of , we look-up the index of the chain containing and compute using the following recurrence (as before, is an ancestor of if and only if ):
The recurrence takes up to iterations since in the worst case the whole anchor list of needs to be visited, so the running time becomes .
In case of successive queries with the same vertex , values can be cached using temporary memory. Over tests, only different values of need to be computed, yielding an amortized time of per test, thus converging to . Such a cache can also speed up consecutive tests with different vertices, since it is common for different vertices to share the same anchor.
Node insertion.
As in the simple chain algorithm, upon inserting , we assign a chain to , compute the top sets of the parents of and combine them into the top set . The bottleneck here is the computation of the sets , since each has size and each top vertex takes time , so we need in total.
We then compute the anchor of (using for constant-time ancestor tests with respect to if necessary). Finally, we have if and otherwise. This step requires ancestor tests with , which are thus performed in time .
3.2 Picking Anchors
An ideal anchor-picking strategy would satisfy the following criteria:
-
memory optimization: minimize the number of different chains in the restricted ancestor set of (so this set should be as small as possible)
-
query time optimization: minimize the length of the anchor chain (so the restricted ancestor sets should be large enough)
We introduce two strategies, both of them aiming at logarithmic-size anchor chains with near-constant average canonical set size. We further introduce parameterizations of both strategies that help improve the query time for the first and index size for the second.
3.2.1 Record-based anchors
In this strategy, we assign a fixed random score to each node (chosen independently and uniformly at random in, say, ). The overall invariant is that the score of the anchor should be higher than the score of the node. In simplified settings (see Lemma 7) below, this guarantees a bound on the anchor chain length with high probability. Within this constraint, we aim at minimizing the number of chains in .
Anchor Strategy 1 (With maximum depth ).
For any , let . If is not empty, let minimizing . Choose if . In any other case, choose .
This strategy is intended as a sub-linear approximation of the ideal strategy picking an anchor with higher score than and minimal size of . Intuitively, for most nodes, the anchor is selected among the parents and the restricted ancestor set is thus minimized (and the restricted top set has often size 1 or 2). For nodes with high scores, the restricted ancestor set is larger, but is much closer to the sources of the DAG, and has itself a higher score, so the anchor list should have a bounded length.
Proposition 6.
For any constant , the query time with Strategy 1 is .
Proof.
By construction, for all , so the query time under this strategy is at most .
Lemma 7.
Consider Strategy 1 with . Let be the set of ancestors of . If all vertices in have degree at most one, then with high probability.
Proof.
Write with and such that there exists a path . Then by the anchor strategy, we have if and only if for all . Since the scores are taken independently at random, each has probability to have the maximum score over . Thus, is in with probability , hence the number of vertices in is asymptotically .
3.2.2 Power-based anchors
In this strategy, we do not use random scores, but instead keep track of the number of ancestors (hereafter called rank) of nodes as they are inserted. Whenever the rank passes a larger power of some integer , the node is more likely to have a large restricted ancestor set and to be an anchor for future nodes; however in most cases the nodes have a restricted ancestor set of size .
Definition 8.
Let be a node. Write for the number of ancestors of , and for the parent of maximizing (ties are broken arbitrarily, we write for source nodes; we further write ). Given an integer , the base- power of , denoted is the largest integer such that there exists an integer with .
Anchor Strategy 2 (With base ).
For any , define as the first node of such that .
Consider for example the path graph sketched in Figure 3. The rank of each node is depicted above each drawn node, so there are 2302 nodes (we use as a vertex identifier as well, corresponding to an insertion order starting at 1). The anchor list of node 2302 is depicted: . In this graph, the maximum is reached for (). Note that most nodes (those that are not multiple of 10) have a trivial restricted ancestor set (containing the single node ), so the restricted set of chain tops is a singleton for them, even if the graph is somehow decomposed into multiple chains.
Under this strategy, we can prove a logarithmic (deterministic) upper bound on the anchor depth of each node, as well as an upper bound on the restricted ancestor set of all nodes with small power.
Lemma 9.
For any we have and .
Proof.
Let be the list of nodes defined with until a source is reached.
Let be the smallest indices such that and ( and/or if no such index exists). Clearly . Also, note that : indeed, is necessarily the anchor of , then it is in the anchor list of each for from to , so it is picked as the anchor of .
Let , we show by induction that for all . This is by definition of for . For any , note that we have . Thus, by definition of .
In particular for , , so .
Consider now , and assume that , then , and the anchor of is necessarily some vertex in . Repeating this process, we have , where each satisfies . Thus, for each there is an integer such that . Thus, there are at most such vertices , and .
We can now show . More precisely, we show by induction that . Indeed, for any by definition. For a vertex with , , and necessarily , so . Now for any other , define as above (with and ; by induction
3.3 Possible extensions
We present two possible extensions to this algorithm for specific use cases, although not covered by the experimental evaluation.
Distributed node creations.
If multiple agents may insert nodes in parallel, the main difficulty is to maintain a correct chain decomposition. Once this difficulty is overcome, each agent may compute the anchor and restricted top vertices for each node they insert, then publish the index entries along with the rest of the node data. This can in particular be useful in VCS settings where commits are decentralized.
A simple solution to maintain a chain decomposition is to ensure that any chain is extended by at most one agent. For instance, if Alice needs to insert a node and has no available chain for , then a new chain is spawned for , and only Alice is “allowed” to insert new nodes in this chain. Felsner’s algorithm [7] can be run by each agent independently, each with their own disjoint set of chain heads.
2-way insertions.
In some settings, it can be desirable to insert nodes either as sources or sinks. For instance, if a server holds a long list of events, and a client needs to retrieve in priority the most recent events. Then the client will progressively download events starting from the most recent, but new events may also occur during this process, and need to be inserted on the other side of the graph.
A possible approach for this scenario would be to maintain 2 disjoint graphs, each with its own chain decomposition to perform reachability queries within each side. For queries spanning both sides of the graph, we further store, for each node, the lowest vertex in each chain of the other side at the time this vertex is created (or, at least, those lowest vertices that are not already stored in the anchor list). Given a reachability query between and , we look for the chain of within the anchor list of , and for the chain of within the anchor list of . Overall, the index size remains linear in the graph width, and query time should be multiplied by 2 for such wide queries.
4 Experimental Evaluation
We implemented the chain algorithm with both anchor strategies, as well as a selection of literature algorithms. We use a corpus of 1 000 randomly generated graphs, as well as 97 real-world DAGs. For each graph, we draw 5 000 random queries. Then, we build the index on each graph and perform all queries successively.
All algorithms were implemented in Python using standard data structures, in order to obtain homogeneous running times. Furthermore, no disk access is performed during the timed parts of the benchmark, as all operations are performed in RAM. We measure 3 quantities for each algorithm in each graph:
- Memory.
-
Index size was evaluated as the number of integers (mostly node identifiers) stored for each algorithm, counting both index and manifest.
- Query time.
-
The positive (resp. negative) query time is the average time to answer a positive (resp. negative) query. The overall query time is the mean of these two quantities. Since some algorithms perform better for one kind than the other, this helps have more homogeneous results, independent of whether we tested more or less positive queries.
- Indexing time.
-
The average time (in ms) per node needed to build the index for the whole graph (index construction / graph size).
All benchmark scripts are available at https://doi.org/10.5281/zenodo.14792894 for replication. All plots are drawn in log scale. We now describe more precisely the graphs used in the benchmark, as well as the implemented algorithms.
4.1 Random Graph Benchmark
We generated 1 000 random graphs with a tight control on their size, width, and average degree. Concretely, we have 3 parameters:
-
: size of the graph (number of nodes)
-
: width of the graph ()
-
: probability to add a new arc to a node
We use the following generation algorithm:
-
Start with degree-0 nodes, remember each as a chain head.
-
For each new node (repeat times), add an arc from a random chain head , and replace by in the set of chain heads.
-
With probability , add an arc from a random node to and repeat this step.
-
Randomly shuffle the list of in-neighbors (parents) of the node.
Overall, we produce nodes in topological order, with width exactly (the first nodes form an anti-chain, and the nodes admit a partition into paths), and expected average degree . Note that some arcs may be drawn multiple times.
The overall benchmark is obtained by generating graphs for all combinations of the following parameters:
4.2 Tests on the CARDS dataset
We also ran algorithms on real-life graphs selected in the CARDS dataset [13]. The whole dataset contains dependency graphs from various sources (package dependencies, git/mercurial repositories, etc.). For repositories, we only used the 20 largest graphs from each category, giving 97 graphs overall. We further truncated each graph to 100 000 nodes for more homogeneous running times.
4.3 Competing algorithms
Since we have no knowledge of existing incremental reachability indexes, we selected two algorithms with sub-linear insertion time allowing, at least, for fast update of the graph.
Hub Labeling (2-hop).
In the 2-hop algorithm(s) [5], the index contains two lists of hubs for each node: one among its descendants and one among its ancestors. The central guarantee is that any pair of related nodes have a common hub. The query time and size is dominated by the size of the hub lists.
Append-only insertions are the worst-case for this index, as each ancestor of a node becomes a hub, and the index size becomes quadratic. We thus chose to compare with a dynamic algorithm by Zhu et al. [17] that allows graph insertions and deletions without recomputing the index from scratch (but does edit hub lists of earlier nodes). More precisely, we implemented the index building routine as described by Akiba et al [2], using the node ordering presented in Zhu et al. [17]; we did not count the additional memory needed to prepare for random node insertion/deletion. We also implemented a random ordering strategy, to evaluate the impact of the ordering strategy on the index performances.
BFL.
In this algorithm [12], we pick a number of semaphores. We store for each node (using length- bit-vectors) the sets of semaphores present in the ancestors and descendants. Comparing 2 such bit vectors allows, for some pairs of nodes, to answer reachability queries in constant time (positively if a semaphore appears as ancestor of one node and descendant of the other, negatively if the ancestor semaphores of one are not included in the ancestors of the others, or similarly for descendants). When this comparison is inconclusive, a DFS is run from an endpoint testing each node along the way, until either a path is found after a positive test or all branches have been pruned by negative tests.
5 Results
We first compare in a unified setting all presented algorithms. Out of these, we select 4 finalists for which we analyse running-times and memory in more details. Finally, we run two selected algorithms on CARDS graphs [13] to ensure that the conclusions of the benchmark carry over to real-life graphs.
Comparing all algorithms.
| Algorithm | Query (ms) | Memory (ints/ node) | Indexing (ms per node) |
|---|---|---|---|
| [Jagadish] Chains only | 0.002 | 2770.697 | 1.614 |
| Anchor 1 (D=+) | 0.004 | 498.900 | 8.085 |
| Anchor 1 (D=64) | 0.003 | 498.602 | 7.778 |
| Anchor 2 (B=2) | 0.004 | 363.158 | 2.737 |
| Anchor 2 (B=32) | 0.004 | 236.599 | 4.035 |
| Anchor 2 (B=256) | 0.005 | 193.859 | 5.568 |
| [Cohen et al.] 2-hop (degree) | 0.070 | 283.423 | 36.269 |
| [Cohen et al.] 2-hop (rand) | 0.099 | 428.571 | 82.426 |
| [Su et al.] BFL (A=1024) | 0.261 | 32.000 | 0.008 |
| [Su et al.] BFL (A=256) | 0.744 | 8.000 | 0.007 |
The plots in Figure 4 show the performances of the algorithms presented here. Each graph is assigned a dimension, defined as . All graphs with the same dimension are grouped into a single point for each algorithm, with the query time as X coordinate and memory as y coordinate. The choice of the dimension allows to simplify the scatter plots, while keeping a good view of upper- and lower-bounds for each algorithms. Indeed, our anchor-based algorithms are mostly dependent on , while other algorithms are mostly dependent on the graph size . The right plot shows all algorithms, and the left plot gives a zoom-in on our anchor-based algorithms. Finally, Table 1 gives numeric results over the largest graphs of the data-set, corresponding to the top-right-most points of the plots.
Overall, each algorithm family behaves rather uniformly. As expected, chain-based approaches give the best (near-constant) query times, with a high memory cost for the original chains-only approach [9]. Algorithm BFL has prohibitive running times, even with up to 1024 semaphores. Considering 2-hop, the degree heuristic indeed reduces significantly the worst-case time and memory.
Detailed comparisons.
We choose the following four algorithms for more extensive performance analysis. Our Anchor algorithms with strategy 1 () and strategy 2 () are the two “extreme” algorithms in the memory/time trade-off. We also include the original chain algorithm [9], as well as the 2-hop algorithm with degree heuristic, giving the best performances among literature algorithms on our benchmark.
Figure 5 helps compare these algorithms for each criteria as a function of each graph parameter. As mentioned already, the memory requirements for the original chain algorithm are prohibitive, even for the smallest graph width, however its query and indexing times remain optimal for all dimensions. The memory requirements for our anchor-based heuristics are comparable with those of 2-hop in this favorable setting with bounded graph width, and queries remain an order of magnitude faster. Considering the degree of the graph, it can be seen that it mainly affects the indexing time (since all algorithms need to visit each arc at least once); however, it does not have a significant impact on the index performances.
On CARDS dataset.
Figure 7 shows the performances of our Anchor algorithm (with the lowest memory needs, i.e. Strategy 2 with ) compared with 2-Hop. Unsurprisingly, the worst-case query times appear for 2-hop. Considering memory, there appear to be many very simple graphs (e.g. with isolated nodes or small paths): on such graphs, the anchor algorithm’s memory is dominated by constant-size overheads (chain assignments, anchors, powers, etc.) rather than chain tops themselves, while 2-hop has no such overhead and also maintains very short hub lists. Detailed results are given in Table 2 in appendix.
Conclusions.
We can see that strategy 2 (with a large enough value of ) has comparable or lower memory needs than 2-hop for all values of up to . Compared to the original chain algorithm, we maintain bounded query and indexing times (within a factor 2 of the original), but improve by a factor 10 the memory requirements.
Performance with a real-life dataset indicates that our proof-of-concept algorithm performs similarly to our implementation of 2-hop, within the same order of magnitude.
Clearly, our benchmark does not cover all existing algorithms for the reachability problem, nor does it include the most efficient ones. The generated graphs are also very constrained, in order to precisely see the limits of each algorithm. The overarching goal is to ensure that performances are within the same order of magnitude as standard algorithms, while introducing the incremental feature required in many applications. A more precise benchmark would need to simulate real-life operations such as rollbacks, cache management, concurrent access, etc. where the incremental feature alone can save orders of magnitude in index maintenance. We hope our work can help introduce efficient reachability indexes in real-world applications, instead of a slow DFS, to help them scale up to larger graphs.
References
- [1] Rakesh Agrawal, Alexander Borgida, and Hosagrahar Visvesvaraya Jagadish. Efficient management of transitive relationships in large data and knowledge bases. ACM SIGMOD Record, 18(2):253–262, 1989.
- [2] Takuya Akiba, Yoichi Iwata, and Yuichi Yoshida. Fast exact shortest-path distance queries on large networks by pruned landmark labeling. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13, pages 349–360, New York, NY, USA, 2013. Association for Computing Machinery. doi:10.1145/2463676.2465315.
- [3] Bartlomiej Bosek, Stefan Felsner, Kamil Kloch, Tomasz Krawczyk, Grzegorz Matecki, and Piotr Micek. On-line chain partitions of orders: A survey. Order, 29(1):49–73, 2012. doi:10.1007/S11083-011-9197-1.
- [4] Yangjun Chen and Yibin Chen. An efficient algorithm for answering graph reachability queries. In 2008 IEEE 24th International Conference on Data Engineering, pages 893–902. IEEE, 2008. doi:10.1109/ICDE.2008.4497498.
- [5] Edith Cohen, Eran Halperin, Haim Kaplan, and Uri Zwick. Reachability and distance queries via 2-hop labels. SIAM Journal on Computing, 32(5):1338–1355, 2003. doi:10.1137/S0097539702403098.
- [6] R. P. Dilworth. A decomposition theorem for partially ordered sets. Annals of Mathematics, 51(1):161–166, 1950. URL: http://www.jstor.org/stable/1969503.
- [7] Stefan Felsner. On-line chain partitions of orders. Theor. Comput. Sci., 175(2):283–292, 1997. doi:10.1016/S0304-3975(96)00204-6.
- [8] Delbert Ray Fulkerson. Note on dilworth’s decomposition theorem for partially ordered sets. In Proceedings of the American Mathematical Society 7.4, pages 701–702. AMS, 1956. URL: https://api.semanticscholar.org/CorpusID:119499686.
- [9] H. V. Jagadish. A compression technique to materialize transitive closure. ACM Trans. Database Syst., 15(4):558–598, December 1990. doi:10.1145/99935.99944.
- [10] Giorgos Kritikakis and Ioannis G. Tollis. Fast reachability using DAG decomposition. In Loukas Georgiadis, editor, 21st International Symposium on Experimental Algorithms, SEA 2023, July 24-26, 2023, Barcelona, Spain, volume 265 of LIPIcs, pages 2:1–2:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.SEA.2023.2.
- [11] Qiuyi Lyu, Yuchen Li, Bingsheng He, and Bin Gong. Dbl: Efficient reachability queries on dynamic graphs. In Database Systems for Advanced Applications: 26th International Conference, DASFAA 2021, Taipei, Taiwan, April 11–14, 2021, Proceedings, Part II 26, pages 761–777. Springer, 2021. doi:10.1007/978-3-030-73197-7_52.
- [12] Jiao Su, Qing Zhu, Hao Wei, and Jeffrey Xu Yu. Reachability querying: Can it be even faster? IEEE Transactions on Knowledge and Data Engineering, 29(3):683–697, 2016. doi:10.1109/TKDE.2016.2631160.
- [13] Euxane Tran-Girard, Laurent Bulteau, and Pierre-Yves David. CARDS: A collection of package, revision, and miscellaneous dependency graphs. In 22nd International Conference on Mining Software Repositories (MSR), Ottawa, Canada, 2025. To appear. Download at https://doi.org/10.5281/zenodo.14245890.
- [14] Hao Wei, Jeffrey Xu Yu, Can Lu, and Ruoming Jin. Reachability querying: An independent permutation labeling approach. Proceedings of the VLDB Endowment, 7(12):1191–1202, 2014. doi:10.14778/2732977.2732992.
- [15] Hilmi Yildirim, Vineet Chaoji, and Mohammed J Zaki. Dagger: A scalable index for reachability queries in large dynamic graphs. arXiv preprint arXiv:1301.0977, 2013. arXiv:1301.0977.
- [16] Andy Diwen Zhu, Wenqing Lin, Sibo Wang, and Xiaokui Xiao. Reachability queries on large dynamic graphs: a total order approach. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 1323–1334, 2014. doi:10.1145/2588555.2612181.
- [17] Andy Diwen Zhu, Wenqing Lin, Sibo Wang, and Xiaokui Xiao. Reachability queries on large dynamic graphs: a total order approach. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, pages 1323–1334, New York, NY, USA, 2014. Association for Computing Machinery. doi:10.1145/2588555.2612181.
| Family | Graph | Size | Average | Reachability | Width | Memory (ints/node) | Query (ms) | Indexing (ms/node) | |||
| count | degree | ratio | W | 2-hop | Anchor | 2-hop | Anchor | 2-hop | Anchor | ||
| Mercurial | 20 | 100000 | 1.05 | 93.04% | 151 | 26 | 8 | 0.005 | 0.017 | 0.284 | 1.289 |
| Git | 20 | 100000 | 1.15 | 60.25% | 16005 | 26 | 10 | 0.005 | 0.022 | 0.323 | 39.712 |
| Debian | 1 | 63701 | 4.35 | 0.25% | 52726 | 32 | 23 | 0.059 | 0.003 | 7.663 | 1.16 |
| Homebrew | 4 | 7157 | 1.36 | 0.19% | 6812 | 7 | 10 | 0.011 | 0.002 | 0.309 | 0.109 |
| Rust | 4 | 100000 | 3.0 | 0.97% | 93507 | 23 | 15 | 0.025 | 0.004 | 7.28 | 3.075 |
| Snap | 3 | 54123 | 8.51 | 22.83% | 40276 | 90 | 53 | 0.016 | 0.003 | 11.046 | 7.021 |
| Archlinux | 5 | 100000 | 7.22 | 3.44% | 91826 | 31 | 16 | 0.041 | 0.006 | 18.905 | 17.359 |
| Freebsd | 3 | 36068 | 31.72 | 0.29% | 31862 | 52 | 17 | 0.036 | 0.002 | 7.158 | 0.809 |
| Nix | 2 | 100000 | 2.77 | 30.43% | 47750 | 22 | 13 | 0.027 | 0.015 | 3.454 | 15.293 |
| NpmJS | 3 | 100000 | 6.61 | 12.05% | 78818 | 36 | 20 | 0.028 | 0.006 | 6.637 | 41.936 |
| Perl | 1 | 35361 | 9.49 | 8.3% | 29810 | 41 | 11 | 0.024 | 0.003 | 10.289 | 3.711 |
| Maven | 7 | 100000 | 1.09 | 0.02% | 95898 | 7 | 11 | 0.045 | 0.002 | 0.964 | 0.507 |
| R | 1 | 21056 | 0.37 | 0.0% | 20516 | 2 | 8 | 0.002 | 0.001 | 0.068 | 0.112 |
| Matrix | 20 | 49582 | 0.87 | 6.63% | 33213 | 9 | 9 | 0.004 | 0.009 | 0.158 | 1.854 |
| DBLP | 1 | 100000 | 4.23 | 3.74% | 56743 | 155 | 238 | 0.064 | 0.004 | 23.186 | 2.934 |
| Openalex | 1 | 100000 | 2.05 | 0.86% | 82966 | 49 | 60 | 0.024 | 0.003 | 3.654 | 0.938 |
| LegiFrance | 1 | 100000 | 1.53 | 0.08% | 89734 | 11 | 14 | 0.006 | 0.005 | 0.264 | 0.911 |
| Caselaw | 1 | 100000 | 3.18 | 1.33% | 70531 | 110 | 181 | 0.062 | 0.003 | 16.305 | 1.479 |
| All | 97 | 80525 | 3.02 | 34.44% | 35712 | 25 | 17 | 0.015 | 0.011 | 2.985 | 11.57 |
