Universally Optimal Streaming Algorithm for Random Walks in Dense Graphs
Abstract
Sampling a random walk is a fundamental primitive in many graph applications. In the streaming model, it is known that sampling an -step random walk on an -vertex directed graph requires space, implying that no sublinear-space streaming algorithm exists for general graphs.
We show that sublinear algorithms are possible for the case of dense graphs, where every vertex has out-degree at least . In particular, we give a one-pass turnstile streaming algorithm that uses only memory for such graphs. More broadly, for graphs with minimum out-degree at least , our streaming algorithm samples a random walk using memory.
We show that our algorithm is optimal in a strong “beyond worst-case” sense. To formalize this, we introduce the notion of universal optimality for graph streaming algorithms. Informally, a streaming algorithm is universally optimal if it performs (almost) as well as possible on every graph, assuming a worst-case choice of the streaming order. This notion of universal optimality is a key conceptual contribution of our work.
Keywords and phrases:
Random Walk, streaming Algorithm, universal OptimalityFunding:
Klim Efremenko: Supported by the Israel Science Foundation (ISF) through grant No. 1456/18 and European Research Council Grant number: 949707.Copyright and License:
2012 ACM Subject Classification:
Theory of computation Streaming, sublinear and near linear time algorithmsEditor:
Shubhangi SarafSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
1.1 Background and Motivation
Sampling a random-walk in a graph is a fundamental problem with broad algorithmic applications. For instance, it underlies techniques for connectivity testing [17], clustering [2, 3, 6, 21], sampling [13], constructing random spanning trees [20], and approximate counting [12]. Another well-known application of random walk sampling is for PageRank estimation [19], where the rank of a webpage corresponds to the likelihood that a user, randomly following hyperlinks, arrives at that page111Formally, in a web graph , PageRank values satisfy the system for every vertex , where denotes the out-degree of [5].. Since these applications often involve extremely large, real-world networks, an important challenge is to develop space-efficient graph-streaming algorithms that process the graph edge-by-edge rather than storing it explicitly.
The folklore algorithm.
A well-known folklore streaming algorithm for simulating an -step random walk on an -vertex directed graph from a given start vertex uses memory222Throughout, the notation hides polylogarithmic factors in . Algorithms are assumed to know in advance, but not the length of the stream (the number of edges).. The algorithm works as follows: As the graph’s edges arrive in a stream, maintain for each vertex a list of outgoing edges selected independently and uniformly at randomly. To generate the walk, begin at and, at step , move to a new vertex by taking the first sampled edge that haven’t been taken before from the precomputed list of outgoing edges at .333For example, if the first two sampled edges from have already been used, take the third. A single edge may appear and be traversed multiple times if it occurs repeatedly in ’s stored list of outgoing edges.
The folklore algorithm was later shown to be optimal [14]; however, the hard instances used by the lower-bound proof crucially involve vertices of very low out-degree. This naturally raises the question of whether better algorithms might exist for dense graphs, which we define as graphs in which every vertex has out-degree at least for some large .444We assume , since if a vertex (e.g., ) has out-degree , there may not be any paths of length starting from . We use the minimum out-degree as our measure of density rather than, say, the average degree (or equivalently, the total number of edges), as, for instance, the graph formed by the union of a clique and a sparse subgraph may still have a high average out-degree. Yet, if is in the sparse subgraph, then sampling a random walk starting from requires sampling a random walk in the sparse subgraph.
1.2 Our Results
1.2.1 Random Walk Sampling for Dense Graphs
The algorithm .
We begin by noting that if the input graph has minimum out-degree at least and is large, the folklore algorithm can be improved to store only edges. Specifically, instead of storing outgoing edges per vertex, stores outgoing edges for each vertex. We show that this is enough because, with high probability, no vertex is visited more than that many times during an -step random walk on a graph with minimum out-degree at least . We call this algorithm .
The distortion of a random walk sampling algorithm is the maximum, over all graphs , of the statistical distance between the true distribution of random walk on and the distribution over outputs produced by the algorithm on input . The distortion of is small but nonzero, since, for example, it never outputs the rare walks in which a vertex is visited more than times.
The algorithm .
Since must store at least one sampled edge per vertex (and we assume is large), it never achieves sublinear space. We therefore propose a second, fundamentally different, streaming algorithm, , which, when is sufficiently large (relative to ), only uses sublinear memory.
The algorithm proceeds as follows: Set . Sample sets of vertices (layers of a graph) , each of size (if a vertex is sampled in multiple layers, we treat each occurrence as a distinct copy of the original vertex). Intuitively, represents the possible choices for vertex . Then, for each and , the algorithm samples a random outgoing edge from into , thereby constructing an -layer graph. The output is the unique path of length starting from in this layered graph.
It is not hard to verify the correctness of : fails in the event that some vertex in a layer has no outgoing edge to the next layer. However, the sizes of the layers are selected such that this event happens with low probability. If the algorithm does not fail and the path is sampled, we claim that each is a random (outgoing) neighbor of , and therefore that the output is a random walk. Indeed, since is a random subset of vertices (of a specific size), choosing a random neighbor of within is equivalent to selecting a random neighbor of in . Additionally, the algorithm only stores edges, and, in particular, on graphs with minimum out-degree , its space usage reduces to , which can be sublinear in .
Note that enjoys the following property: It fails with a small probability, but, conditioned on non-failure, it samples from the true distribution of random walks. does not satisfy this property.
1.2.2 Streaming Universal Optimality
The notion of streaming universal optimality.
is essentially optimal for very dense graphs with as it uses only memory and the output consists of edges. But is optimal for all values of ?
A worst-case optimality result for a graph streaming algorithm would identify a family of hard graph instances (on which the algorithm uses the most memory) and show that no other correct algorithm can use less memory on these instances. However, for easier instances, the algorithm may be highly suboptimal. This notion of optimality can be viewed as existential optimality: There exist inputs within the considered class for which the algorithm is optimal. A much more ambitious goal is to seek universal optimality, where the algorithm achieves optimal performance on every instance.
Universal optimality [10], along with other “beyond worst-case” notions, has been studied for decades in various computational models, see the great book [18]. In this work, we introduce the notion of universal optimality for streaming algorithms. Informally, a streaming algorithm is universally optimal if it performs (nearly) as well as possible on every graph under worst-case choice of streaming order. Specifically, for any graph , if the algorithm uses space on some streaming order of ’s edges, then every other correct algorithm must also use at least (approximately) space on some (possibly different) streaming order of . Further discussion of the notion of universal optimality for streaming algorithms appears in Section 1.2.3.
Universally optimal random walk samplers?
A priori, it is not clear that universally optimal algorithms should even exist for the task of sampling an -step random walk with distortion, say, , over the set of graphs consisting of directed graphs with minimum out-degree at least . Indeed, it is not hard to see that neither nor satisfies such universal optimality for all values of and .555For and , uses memory, while uses , so is not universally optimal. If and vertices have out-degree while the rest have constant out-degrees, is constant. In this case, uses memory, whereas uses only (as each of constant out-degree vertices only stores a constant number of edges), showing that is not universally optimal.
Nevertheless, we show that with a careful implementation, and can be combined into a universally optimal algorithm as follows: If is greater than (roughly) , run ; otherwise, run .
Theorem 1.
For , solves the task of sampling an -step random walk, with distortion , from a given vertex in any directed graph with minimum out-degree . It uses at most space.
Furthermore, it is universally optimal over the set of directed graphs with minimum out degree at least , for any .
We note that does not work for dynamic streams, since it relies on reservoir sampling [23] to select the outgoing edges to be stored. Nevertheless, by replacing reservoir sampling with -samplers [15], we can obtain a new algorithm that works for dynamic streams and uses space. However, this algorithm is not (insertion-only) universally optimal with a straightforward implementation666Intuitively, it is possible that a vertex has outgoing edges during the stream while its actual out-degree is very small at the end. For , this means samplers are needed. In contrast, if only insertion-only streams are considered, can simply store all outgoing edges if .. In Section 1.2.3, we will show that this turns out to be the best we can do.
As pointed out in [10], an interesting “side-effect” of universal optimality is that it precisely identifies the parameters of the problem that are inherently responsible for its complexity. In our case, the minimum out-degree of the graph dictates the optimal space complexity (at least when is known to the algorithm and within certain parameter regimes). In particular, once is specified, the optimal space requirement no longer depends on other parameters (such as the average out-degree, the minimum total degree, or the length of the longest directed path in the graph).
Do we need to know ?
The algorithm requires advance knowledge of . A natural question is whether one can design a one-pass streaming algorithm with comparable space usage that does not rely on knowing beforehand777This can be done in just two passes: the first pass computes the minimum out-degree , and the second pass runs using the degree computed in the first pass..
Specifically, let be sufficiently large for Theorem 1 to hold, and let be an algorithm for sampling an -step random walk in any graph with minimum out-degree at least (but unknown to ). Let be a graph with minimum out-degree . Running on requires space. By the universal optimality of guaranteed by Theorem 1, it follows that any algorithm that works for all graphs with minimum out-degree at least , including , must also use space on some streaming order of . In contrast, requires only space on every streaming order of . Hence, algorithms that do not know in advance cannot achieve the same space as .
1.2.3 Further Discussion of the Notion of Universal Optimality
Universal optimality for dynamic streaming algorithms.
Our definition of universal optimality considers an algorithm that works for insertion-only streams, and, for every graph, compares its space usage to that of any other algorithm that works for insertion-only streams. One can consider the same definition where and work for dynamic streams. However, we observe that notion of universal optimality for dynamic streaming algorithms coincides with standard worst-case optimality.
To see this, suppose is a worst-case optimal algorithm for some task on dynamic streams. Fix . Let denote the maximum memory used by on graphs with vertices. By (worst-case) optimality of , for every other algorithm that solves , there exists a stream of edge updates on vertices such that requires at least memory on . Now take any graph on vertices and consider an algorithm for . Construct the following stream for : First insert/delete edges according to , then reverse all of them, and finally insert the edges of . Since already uses at least memory on the prefix , it uses at least memory on this streaming order of . In contrast, never exceeds memory on any stream. Hence, is also universally optimal.
Universal optimality vs. instance optimality.
Our definition of universal optimality requires the algorithm to performs (nearly) as well as any competitor on every graph , when comparing their worst-case memory usage across all possible streaming orders of . However, this leaves open the possibility that on specific streaming orders of , a competing algorithm could do substantially better. A stronger requirement would be to guarantee optimality on every stream, not just every graph. Concretely, for any edge stream , the memory used by any competing algorithm on must be at least as large as that used by our algorithm on the same . This stronger guarantee corresponds to adapting the notion of instance optimality, where an algorithm must be optimal on each individual instance, to the context of graph streaming algorithms [9] (see also [1, 22]).
However, it turns out that instance optimality makes little sense. Suppose and . Consider an -vertex graph which is the disjoint union of cliques of size each, along with the input stream where edges are given in lexicographical order. We can construct an algorithm which checks if the input stream is exactly the edges of given in lexicographical order. This can be done in only additional space because the edges of can be easily generated in lexicographical order in such space. As soon as the check fails, simulates from the beginning of the stream. This is possible because the prefix of the stream can be regenerated. If the check passes at the end, then an -step random walk can be easily generated in space as we know all the edges (implicitly). Clearly, always performs as well as . Furthermore, is strictly better than on the specific instance of and a lexicographical order of the edges of .
More generally, the above “hard-coding” method can apply to any graph problem, if we can design a more space-efficient algorithm tailored to specific instances that have succinct representations and can be checked in small space. However, note that such things cannot happen with universal optimality as there is no way to check, using small space, if the input graph falls into certain categories.
1.3 Additional Related Work
In an influential work, [19] give an -pass, -space algorithm for simulating -step random walks on directed graphs, and used it to derive space-efficient algorithms for estimating PageRank on graph streams. For one-pass streaming, [14] proved an space lower bound, thereby proving the optimality of the folklore algorithm. For undirected graphs, [14] also gave an -space algorithm and a matching lower bound for sufficiently large . Recently, [7] showed that the space complexity of random walk sampling on directed graphs is when two passes over the stream are allowed. Additionally, [16, 8] studied the random walk sampling problem in the random-order streaming model.
[4] studies sublinear algorithms for MAXCUT and correlation clustering. For dense graphs, constant-space algorithms are known, whereas for sparse graphs, space lower bounds hold in the streaming model. Their work bridges these two extremes by providing algorithms whose performance is parameterized by the average degree.
2 Proof Overview
We overview the proof of Theorem 1 in this section. As the algorithm mentioned was already described in Section 1, we mostly focus on the proof of universal optimality. However, we do give some extra intuition for the algorithm .
2.1 The Sampling Algorithm
Just like , our algorithm is based on the folklore algorithm but develops it from a different perspective than . Recall that the standard view of the folklore algorithm interprets its memory as a set of lists, where each stores sampled outgoing edges from . The algorithm is obtained by simply shortening these lists by a factor of and a proof that such a shortening will not affect the correctness of the algorithm.
For the algorithm , we reinterpret the memory of the folklore algorithm as maintaining an -layered graph rather than a collection of lists. Each layer contains a copy of every vertex , and the (only) outgoing edge of in layer points to some vertex in layer . Specifically, if the -th edge in is , then the copy of in layer has a single outgoing edge to the copy of in layer . This perspective is equivalent to viewing the folklore algorithm’s memory as an matrix of edges, where the row corresponding to vertex encodes . In this representation, the -th column specifies the edges connecting layer to layer in the layered graph.
In this matrix (or layered-graph) view, corresponds to restricting the number of columns (equivalently, the number of layers) from to . By contrast, our algorithm reduces the number of rows (equivalently, the size of each layer) from to , using the guarantee on the minimum out-degree to show that such a reduction does not affect the correctness of the algorithm. It is easy to see that this memory restriction of the folklore algorithm yields exactly the algorithm described in Section 1.2.1.
The property testing approach.
Prior work has sometimes been motivated by a query model perspective to design algorithms for dense graphs. As such a perspective also yields an algorithm for our problem, we outline it here. However, the algorithm so obtained turns out to be suboptimal.
In the query model, a random walk can be efficiently simulated if adaptivity is allowed. Starting from , we select a random neighbor by querying random entries in the adjacency row of until a is found, then proceed similarly to select a neighbor of , and so on. Consider, for instance, the case where minimum degree at least . In this case, the expected number of queries needed to find the next vertex is constant. Consequently, the entire process requires roughly queries in expectation.
To obtain a one-pass streaming algorithm, we aim to transform the above adaptive query algorithm into a non-adaptive one, which are easily simulated by a streaming algortihm. The beautiful technique of [11] shows how to convert an adaptive query algorithm with queries into a non-adaptive one by sampling an induced subgraph on vertices, which requires queries.888While [11] state their result in the context of converting adaptive graph property testers into non-adaptive ones, their technique can be applied to convert the adaptive random walk algorithm above into a non-adaptive version. However, applying this idea directly to our adaptive random walk sampling algorithm yields a memory complexity of when , which is not optimal as it is worse than the bound we get.
2.2 Universal Optimality
We now overview the proof that our algorithm is universally optimal. For this, we first describe a lower bound approach designed for directed bipartite graphs that has been helpful in prior works as well. Consider a directed bipartite graph and an integer such that every vertex has out-edges to an independent uniformly random subset of size of vertices in . The vertices in have only one out-edge and all these edges lead to the same vertex that is chosen uniformly at random.
From a “sampling” problem to a “search” problem.
Because all the vertices in lead to the same vertex , a random walk starting from any vertex in would first go to , then take a uniformly random edge going out of , then go to again, and so on for steps. This means that any such walk contains within it, a sequence of edges going out of chosen independently and uniformly at random. As the number of distinct out-edges of is , we get that with high probability at least of these edges are distinct. We conclude that any algorithm that samples a random walk in this graph can also be used to solve a search problem, where the goal is to identify a subset of edges of size coming out of , with high probability.
Lower bound for the search problem.
However, it can be shown that the search problem is hard for streaming algorithms. Indeed, consider what happens when the edges in the graph are presented in a stream with all the edges going out of the vertices in appearing first in the stream followed by the edges going out of the vertices in . In this case, when the algorithm is processing the edges going out of the vertices in , it has no information about the vertex for which it has to remember the out-edges. Thus, unless the algorithm records enough out-edges from a lot of vertices in , it is highly likely that when is chosen uniformly at random from the vertices in , the algorithm would not have stored enough edges coming out of .
In other words, in order to output , the algorithm actually needs to remember this number of edges from all the vertices in , thereby needing at least memory. Observe that this proof is an adaptation of the lower bound proof for the well-known Index problem.
Further discussion and extensions of this approach.
Note that it is important in the above proof that every vertex in has the same number of edges going out of it. In case the number of out-edges going out of different vertices in is vastly different, the algorithm can focus on remembering only the edges from vertices that have few out-edges, and hope that when is sampled it happens to be one of these vertices. This could mean a worse space lower bound.
Also, observe that the instances considered in this lower bound approach are such that all vertices in have out-degree . However, for all , it can be extended to the case where the out-degree of every vertex is at least by letting all the vertices in have out-edges to the same subset of size of vertices in instead of having edges to exactly one vertex .
With this change, a random walk would not always return to the same vertex in and instead returns to a uniformly random vertex in . Thus, for any given vertex in , a random walk will only return to it (roughly) many times. In turn, we have that the algorithm only needs to store around many edges coming out of any vertex. As this number is now smaller, this would lead to a reduced space lower bound of . This is exactly where the expression in Theorem 1 comes from, but its not the end of the proof, as the input graph is not guaranteed to be bipartite or promised to have the same number of edges from all vertices in .
Extending to general graphs .
We now turn to show how the above approach can be extended to work for all graphs and give a lower bound of by embedding a hard instance of the form above in any input graph . For this, we “clean” any input graph in a sequence of steps, ultimately leading to a hard instance like those above.
Our first step is to identify a “dominant” degree in the graph. For this, we partition the vertices into levels based on which two consecutive powers of its out-degree lies in. This also partitions the sum we want in our lower bound and we focus on the part that is the largest. This gives a subset of vertices of vertices that have similar degrees and suffices for the lower bound up to logarithmic factors.
Having decided on a set to focus on, we now identify a cut in the graph such that a constant fraction of the vertices in lie on the left side of the cut, and a constant fraction of the neighbors of all these vertices lie on the right side of the cut. To show that such a cut exists, note that a random cut in the graph will have the property such that in expectation, a constant fraction of the neighbors of any vertex lie on a different side of the cut as that vertex. Using probabilistic arguments, we show that a cut exists for which this is true for a constant fraction of the vertices in .
We are almost there, except that the vertices in don’t have exactly the same degree which was assumed in our proof approach. This can be easily fixed by removing some of the edges from all vertices in till they have the same degree. As they had degrees within a factor of to being with, this loses only a constant fraction of edges and only costs us another constant factor in our lower bound. Another difference is that we started with a graph where every vertex is promised to have out-degree at least . However, as we removed the edges not in the cut, we may have shrunk the out-degrees of vertices. As this may make some out-degrees smaller than , it may take the resulting graph outside the promise, effectively making it useless. To fix this, we identify another set of randomly chosen vertices in the graph and connect them with each other in a clique and also connect all the other vertices to them. This increases the degrees of all vertices back to at least , satisfying the promise. Moreover, as999When and are within a constant factor of each other, our bound becomes which is tight as it is the size of the output. , this only loses us another constant in the lower bound.
A hard distribution.
Finally, note that the above construction only gave us a single graph while our lower bound approach required a distribution of graphs. This is a major difference, as there are algorithms that will solve the search problem on this particular graph (one can simply “hardwire” the answer for this graph in the algorithm) making a lower bound impossible. We get around this by converting our hard graph into a distribution over graphs as follows: For every vertex in , we only include a randomly chosen subset of half its edges in the graph and the other edges are discarded. This again only costs us a constant factor but creates for us a distribution just like we need for the lower bound approach.
We finish this section by arguing why a lower bound obtained from such a hard distribution after making all these changes would also apply to the original graph , as required for universal optimality. For this, we observe that our lower bound approach shows a lower bound on the memory needed at a specific point in the stream, when it has processed all the edges coming out from but before processing any edges that come out from . In our constructed distribution, the set corresponds to the set and thus, the inputs streams we generate will start with the edges coming out of the vertices in and the memory lower bound will hold for the memory state of the algorithm immediately after processing the edges. Crucially, (unlike the edges going into or the edges added to raise the degree) all these edges were also part of the original graph . Thus, for all the algorithm knows, the graph might as well have been with these edges presented first in the stream, and the lower bound we show for our constructed distribution will also apply to the original graph , finishing the proof.
3 Preliminaries
Notation.
For integer , we write as a shorthand for . denotes the set of positive integers. We often omit ceiling symbols, and by assigning a real number to an integer , we mean that gets .
Graphs.
For a directed graph and any vertex , we define to be the set of all out-edges of and define to be the out-degree of . We omit the superscript when it is clear from context. We use throughout.
Random walks.
An -step random walk in with starting vertex is a (directed) path of length in such that is an independent and uniformly random out-neighbor of for all . We say an algorithm samples an -step random walk starting from with distortion , if the sampled distribution of is -close in statistical distance to the true distribution of an -step random walk starting from .
Graph streaming.
An (insertion-only) graph streaming algorithm takes as input a sequence of edges of an underlying graph, and is required to process the edges one at a time using limited space. At the end of the stream, the algorithm needs to return an output for certain graph problems. In this paper, we consider the task of sampling random walks.
Universal optimality.
A graph streaming algorithm is universally optimal for a task over a set of graphs if:
-
1.
solves .
-
2.
For any streaming algorithm that also solves , the following holds: Let be any graph in and let be the maximum memory size used by over all streaming orders of . Then, the maximum memory size used by over any streaming order of is at most .
Chernoff bound.
We use the following version of the Chernoff bound.
Lemma 2 (Multiplicative Chernoff bound).
Suppose are independent random variables taking values in . Let denote their sum. Then,
Information theory.
We use sans-serif letters to denote random variables. We reserve to denote an arbitrary event. All random variables will be assumed to be discrete and we shall adopt the convention . When it is clear from context, we may abbreviate an event as just . All logarithms are taken with base .
Definition 3 (Entropy).
The (binary) entropy of is defined as:
The entropy of conditioned on is defined as:
Definition 4 (Conditional entropy).
We define the conditional entropy of given and as:
Henceforth, we shall omit writing the when it is clear from context.
Lemma 5 (Chain rule for entropy).
It holds for all , , and that:
Lemma 6 (Conditioning reduces entropy).
It holds for all , , and that:
Equality holds if and only if and are independent conditioned on , .
Corollary 7 (Subadditivity of entropy).
It holds for all , , and that:
Lemma 8.
It holds for all and that:
The second inequality is tight if and only if conditioned on is the uniform distribution over .
Corollary 9.
It holds for all , , and that:
Lemma 10 (Shearer’s inequality).
Let and be random variables. Let and be a set-valued random variable that is independent of . For any event determined by , it holds that:
Proof.
We have:
| (Definition 4) | ||||
| (as is independent of ) | ||||
| (Lemma 5) | ||||
| (Lemma 6) | ||||
| (Lemma 5) |
Corollary 11.
Let and be random variables. Let and be a set-valued random variable that is independent of . For any random variable determined by , it holds that:
4 Streaming Algorithm
The goal of this section is to prove the following Theorem 12. Algorithms , , and , are given in Algorithms 1, 2, and 3, respectively. By “output Fail,” we mean that the algorithm outputs . We will elaborate on the details of their streaming implementation later on.
Theorem 12.
Let . solves the task of sampling an -step random walk with distortion . It uses space.
As Algorithm 1 is simply a combination of Algorithms 2 and 3, Theorem 12 is a direct corollary of the following Lemmas 13 and 14. Indeed, when , we know that is larger than by a multiplicative factor of at most as . Note that can be sublinear in when .
We also emphasize that the proof of Lemma 14 does not require . In other words, Algorithm 3 alone is sufficient to achieve space in the worst case. However, as noted in Section 1, Algorithms 2 and 13 are necessary for achieving universal optimality.
Lemma 13.
Let such that . solves the task of sampling an -step random walk with distortion . It uses space.
Lemma 14.
Let . outputs Fail with probability at most and otherwise samples an (unbiased) -step random walk. It uses space.
4.1 Proof of Lemma 13
The correctness of Algorithm 2 follows from the following claim.
Claim 15.
Let such that . For any directed graph with minimum out-degree at least and vertex , with probability at least , an -step random walk from in visits any vertex at most times.
Proof.
Fix . For , let be the random variable for the -th vertex on an -step random walk from in . For any sub-path , because has at least outgoing edges, with possibly being one of them.
Note that the events may not be independent. Nevertheless, as each step of a random walk is sampled independently, the probability of being visited more than times is always upper bounded by the probability that independent biased coins, each with probability of getting a head, generate more than heads in total. By Lemma 2 and the assumption , this happens with probability at most . The claim then follows by union bounding over all vertices.
As to the space usage, Algorithm 2 can be implemented using reservoir sampling [23]. Concretely, for each vertex , it takes bits to keep track of . Meanwhile, it stores all its outgoing edges while . Once during the stream, it can use independent reservoir samplers, each taking bits. So each vertex takes space. Since the reservoir samplers are unbiased, this concludes the proof of Lemma 13 by applying Claim 15.
4.2 Proof of Lemma 14
To implement Algorithm 3, for each and vertex , one reservoir sampler is enough since all ’s can be sampled in advance. The space bound follows as is at most .
We now show that conditioned on no failure, Algorithm 3 samples an (unbiased) -step random walk. In other words, for every , has to be a uniformly random out-neighbor of in . To this end, consider any two out-neighbors of in , and subset of size . Suppose . If , each of them has the same probability of being sampled as . If , none of them would be sampled as . If exactly one of is in , assume without loss of generality. Since is a uniformly random subset of size , has the same probability as of being sampled as . Furthermore, the number of out-neighbors of in is the same in the two cases. As a result, the probability of being sampled as in the case is the same as the probability of being sampled as in the case . Altogether, this means that the overall probability of being sampled as is the same for .
It then remains to show that Algorithm 3 outputs Fail with probability at most . Note that the only possibility of failure is when there exists and vertex such that has no out-neighbor in . This happens only if . For any and , the probability of not having out-neighbors in is at most
This concludes the proof of Lemma 14 by union bounding over all options for .
5 Universal Optimality
The goal of this section is to show that the algorithm described in Algorithm 1 is universally optimal upto logarithmic factors.
We define a triple of positive integers to be significant if is large enough for asymptotic inequalities to holds and and .101010If , (or simply ) uses at most space. This must be universally optimal as outputting a length- path requires space. For , we define the set to be the set of all directed simple graphs where there are vertices and every vertex has out-degree at least . We say that a streaming algorithm is a random-walk algorithm if given significant integers , a graph and a start vertex , the algorithm samples an -step random walk from the start vertex with distortion at most . The theorem we show in this section is the following Theorem 16. Theorem 1 then follows directly from Theorem 16 and Theorem 12.
Theorem 16.
Let be significant, be a random-walk algorithm, and with a designated start vertex. There exists a streaming order for the edges of on which needs memory at least .
The proof of Theorem 16 spans this entire section. Fix , and be as in the theorem statement. As , we have that for all , it holds that . This means that there exists such that:
| (1) |
We fix this for the rest of this proof and define to be the set of vertices in the sum on the right. Namely,
| (2) |
Our first lemma shows that the set is large.
Lemma 17.
It holds that .
Proof.
We first prove this under the assumption that . In this case, we have for all . Plugging this in Equation 1, we get which gives the lemma as . Now, assume that . This means that implying that the minimum in the left hand side in Equation 1 can be lower bounded by while the minimum on the right hand side can be upper bounded by . Plugging in, we get . Rearranging gives the result.
We use and to denote the probability and expectation when is a subset of vertices chosen uniformly at random. For any subset , we define to be the set of edges in the cut formed by . Also, for all , define:
| (3) |
We show that there exists such that many of the vertices in lie in .
Lemma 18.
There exists satisfying:
Proof.
For any such that , we have that (for some ) if and only if all its out-neighbors are on the same side of the cut as . As every vertex has at least one out-neighbor, we get that this event happens with probability at most for such for a uniformly random . For such that , we have by Lemma 2 that:
Combining, we get that for all . It follows that:
In particular, there exists a cut for which . From this, we get that at least one of the quantities and is at least . To finish the proof, observe from Equation 3 that and thus, at least one of and satisfies the lemma.
Next, we fix to be the set promised by Lemma 18 and define and . It follows that by Lemma 17. Moreover, for all , we have that
| (4) |
Next, we isolate a suitably chosen set of vertices .
Lemma 19.
There exists a subset such that all the following hold:
Proof.
Proof by the probabilistic method. Consider the set-valued random variable obtained by including all with probability . By Lemma 2 and using the fact that , we get that the first inequality holds except with probability . For the other inequalities, consider any set such that . Note that because of our bound and the bound in Equation 4, we have that and111111Technically, as it needs to be a subset of , the projection of on its second coordinate is such a set, but the same argument applies. for all are such sets. As implies that , we have from Lemma 2 that for any such set . A union bound over all sets considered in the lemma statement finishes the proof.
Henceforth, we let be as promised by Lemma 19. We also define and conclude from Lemma 19 that . Finally, we define . For all , we have:
| (5) |
For all , let be an (arbitrary) subset of its out-edges going to vertices in of size . Observe from Equation 5 that at least one such subset exists and therefore, is well defined.
Next, for all , define the set-valued random variable to be an independent and uniformly random subseteq of of size . For notational convenience, for any subset , we will abbreviate to and omit the subscript when . Also, define the set-valued random variable to be a uniformly random subset of of size chosen independently of . As , we have that the random variable is well defined. The following lemma is straightforward.
Lemma 20.
Let and . It holds that:
Proof.
The size of the set we have to bound equals .
5.1 A “Search” Lower Bound
Before showing the desired lower bound on the memory used by the algorithm , we first define a related distributional search problem and show a lower bound for algorithms solving that. For this, let and be as defined above. We use these random variables to define a graph with the vertex set and the following edges:
-
(1)
For all and all , there is an edge from to . These edges are called the fixed edges.
-
(2)
For all , all the edges in are present in . These edges are called the -edges.
-
(3)
For all and all , there is an edge from to . These edges are called the -edges.
We let be the random variable corresponding to . Note that the random variables and determine each other, and, more specifically, the -edges and determine each other and the -edges and determine each other. Also, note from the fact that (see Lemma 19) that the out-degree of any vertex in any graph in the support of is at least . It follows that the support of is a subset of .
Having defined the random variable , the search problem we consider receives the edges of a graph , which we can equivalently view as a pair , in the stream, with the -edges arriving first (in lexicographic order), then the other edges (in lexicographic order). The goal of the algorithm is to make one-pass over the stream, and with probability at least , output at least distinct edges in .
Consider any streaming algorithm that solves the above problem. As we study on a specific distribution of inputs (specifically, the distribution ), we can assume without loss of generality that is deterministic. Let the random variable denoting the memory state of the algorithm after it has processed the -edges but before processing the other edges and let be the random variable denoting the output of . As is deterministic, observe that is a function of and that is a function of . We abbreviate events of the form to just , to just , and so on.
Lemma 21.
It holds that:
Proof.
Let be an indicator random variable that is if and only if the algorithm is correct on input (which determines and is determined by ). We have:
| (Corollary 9) | ||||
| (Lemma 8) | ||||
| (Definition 4) | ||||
We now use the upper bound for the entropy in Lemma 8. For the terms corresponding to , we use Lemma 20 with to get for all . For the terms corresponding to , we use the fact that and together determine the output of the algorithm, which when , determines of the edges in . By Lemma 20, this means that conditioned on , , , the random variable is supported on a set of size at most . We get:
| (As ) | ||||
| (As and ) |
Combining Lemma 21 with Corollary 11, we get:
Using the fact that and simplifying, we get:
From Corollary 9 and the fact that , this means that:
Now, from Equation 1 and as , we get:
As is upper bounded by the logarithm of the size of the support of , we have that the algorithm must use at least bits of memory.
5.2 Finishing the Proof
To finish the argument, we prove that the streaming (that is a random-walk algorithm) implies a streaming algorithm that solves our search problem from Section 5.1 with probability at least . For this, recall that any graph in the support of lies in . This means that we can run the algorithm (with the start vertex being an arbitrary vertex and the order of edges being the same as in the search problem) and it will sample an -step random walk from with distortion at most . Let be the random variable denoting the path, written as a sequence of vertices, taken by an -step random walk from .
Lemma 22.
For all graphs , which we can equivalently view as a pair , it holds that:
Proof.
Note that when the walk is at any vertex , the edge from will either go to another vertex in (when it is a fixed edge added in Item 1 in Section 5.1) or it will go to a vertex in (when it is a -edge added in Item 3). In the latter case, either the second edge will go to a vertex in (a fixed edge) or it will go to a vertex in (when it is an -edge added in Item 2). If it does go to a vertex in , the third edge has to be fixed and take it to a vertex in . Thus, at least one of any three consecutive vertices in must lie in .
In particular, if we break the random walk into chunks of length , then each such chunk will contain at least two vertices in within it. For all , we define to be random variables denoting the positions (these are values in the interval ) of the first two vertices in in chunk . As by Lemma 19, and for all , we have by the above discussion that, for all :
Because of the fact that is a random walk, the same bound holds even when conditioned on . This means that we can find mutually independent indicator random variables , jointly distributed with such that for all , we have and with probability . From Lemma 2, it follows that:
as and is large enough.
For the rest of this proof, we condition on values of such that the above event does not occur, and also condition on all the vertices visited in the walk except the vertices . Observe that, under this conditioning, for all such that , the edge is an independent and uniformly random edge in . As the total number of edges in is and these edges constitute at least independent uniformly random samples from it, the number of distinct samples will be at least except with probability at most . A union bound finishes the proof.
As for any graph , the algorithm samples from with distortion at most , we have from Lemma 22 that the algorithm solves the search problem with at least probability . From the result proved in Section 5.1, we get that the memory needed by the after processing the -edges is at least . As the -edges are always a subset of the edges in the graph , there is a streaming order of the edges of such that these are the edges that the algorithm sees first when it is run on . It follows that needs memory at least when run on and the proof of Theorem 16 is complete.
References
- [1] Peyman Afshani, Jérémy Barbay, and Timothy M. Chan. Instance-optimal geometric algorithms. J. ACM, 64(1):3:1–3:38, 2017. doi:10.1145/3046673.
- [2] Reid Andersen, Fan Chung, and Kevin Lang. Using pagerank to locally partition a graph. Internet Mathematics, 4(1):35–64, 2007. doi:10.1080/15427951.2007.10129139.
- [3] Reid Andersen and Yuval Peres. Finding sparse cuts locally using evolving sets. In Symposium on Theory of computing (STOC), pages 235–244, 2009. doi:10.1145/1536414.1536449.
- [4] Aditya Bhaskara, Samira Daruki, and Suresh Venkatasubramanian. Sublinear algorithms for MAXCUT and correlation clustering. In International Colloquium on Automata, Languages, and Programming (ICALP), volume 107 of LIPIcs, pages 16:1–16:14, 2018. doi:10.4230/LIPIcs.ICALP.2018.16.
- [5] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems, 30(1-7):107–117, 1998. doi:10.1016/S0169-7552(98)00110-X.
- [6] Moses Charikar, Liadan O’Callaghan, and Rina Panigrahy. Better streaming algorithms for clustering problems. In Symposium on Theory of computing (STOC), pages 30–39, 2003. doi:10.1145/780542.780548.
- [7] Lijie Chen, Gillat Kol, Dmitry Paramonov, Raghuvansh R. Saxena, Zhao Song, and Huacheng Yu. Near-optimal two-pass streaming algorithm for sampling random walks over directed graphs. In International Colloquium on Automata, Languages, and Programming (ICALP), volume 198, pages 52:1–52:19, 2021. doi:10.4230/LIPIcs.ICALP.2021.52.
- [8] Ashish Chiplunkar, John Kallaugher, Michael Kapralov, and Eric Price. Factorial lower bounds for (almost) random order streams. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2022, Denver, CO, USA, October 31 - November 3, 2022, pages 486–497. IEEE, 2022. doi:10.1109/FOCS54457.2022.00053.
- [9] Ronald Fagin, Amnon Lotem, and Moni Naor. Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci., 66(4):614–656, 2003. doi:10.1016/S0022-0000(03)00026-6.
- [10] Juan A. Garay, Shay Kutten, and David Peleg. A sublinear time distributed algorithm for minimum-weight spanning trees. SIAM journal on computing, 27(1):302–316, 1998. doi:10.1137/S0097539794261118.
- [11] Oded Goldreich and Luca Trevisan. Three theorems regarding testing graph properties. Random Struct. Algorithms, 23(1):23–57, 2003. doi:10.1002/rsa.10078.
- [12] Mark Jerrum and Alistair Sinclair. Approximating the permanent. SIAM journal on computing, 18(6):1149–1178, 1989. doi:10.1137/0218077.
- [13] Mark R Jerrum, Leslie G Valiant, and Vijay V Vazirani. Random generation of combinatorial structures from a uniform distribution. Theoretical computer science, 43:169–188, 1986. doi:10.1016/0304-3975(86)90174-X.
- [14] Ce Jin. Simulating random walks on graphs in the streaming model. In Avrim Blum, editor, Innovations in Theoretical Computer Science Conference (ITCS), volume 124 of LIPIcs, pages 46:1–46:15, 2019. doi:10.4230/LIPIcs.ITCS.2019.46.
- [15] Hossein Jowhari, Mert Saglam, and Gábor Tardos. Tight bounds for lp samplers, finding duplicates in streams, and related problems. In Symposium on Principles of Database Systems (PODS), pages 49–58, 2011. doi:10.1145/1989284.1989289.
- [16] John Kallaugher, Michael Kapralov, and Eric Price. Simulating random walks in random streams. In Symposium on Discrete Algorithms (SODA), pages 3091–3126, 2022. doi:10.1137/1.9781611977073.120.
- [17] Omer Reingold. Undirected connectivity in log-space. Journal of the ACM (JACM), 55(4):1–24, 2008. doi:10.1145/1391289.1391291.
- [18] Tim Roughgarden, editor. Beyond the Worst-Case Analysis of Algorithms. Cambridge University Press, 2020. doi:10.1017/9781108637435.
- [19] Atish Das Sarma, Sreenivas Gollapudi, and Rina Panigrahy. Estimating pagerank on graph streams. J. ACM, 58(3):13:1–13:19, 2011. doi:10.1145/1970392.1970397.
- [20] Aaron Schild. An almost-linear time algorithm for uniform random spanning tree generation. In Symposium on Theory of Computing (STOC), pages 214–227, 2018. doi:10.1145/3188745.3188852.
- [21] Daniel A Spielman and Shang-Hua Teng. A local clustering algorithm for massive graphs and its application to nearly linear time graph partitioning. SIAM Journal on computing, 42(1):1–26, 2013. doi:10.1137/080744888.
- [22] Gregory Valiant and Paul Valiant. An automatic inequality prover and instance optimal identity testing. SIAM J. Comput., 46(1):429–455, 2017. doi:10.1137/151002526.
- [23] Jeffrey Scott Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37–57, 1985. doi:10.1145/3147.3165.
