On Approximating the Stationary Distribution of Time-Reversible Markov Chains

Approximating the stationary probability of a state in a Markov chain through Markov chain Monte Carlo techniques is, in general, inefficient. Standard random walk approaches require Õ(τ/π(v))\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {O}(\tau /\pi (v))$\end{document} operations to approximate the probability π(v) of a state v in a chain with mixing time τ, and even the best available techniques still have complexity Õ(τ1.5/π(v)0.5)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {O}(\tau ^{1.5}/\pi (v)^{0.5})$\end{document}; and since these complexities depend inversely on π(v), they can grow beyond any bound in the size of the chain or in its mixing time. In this paper we show that, for time-reversible Markov chains, there exists a simple randomized approximation algorithm that breaks this “small-π(v) barrier”.


Introduction
We investigate the problem of approximating efficiently a single entry of the stationary distribution of an ergodic Markov chain. This problem has two main motivations. First, with the advent of massive-scale data, even complexities linear in the size of the input are often excessive [21]; therefore computing explicitly the entire stationary distribution, e.g. via the power method [11], can be simply infeasible. As an alternative one can then resort to approximating only individual entries of the vector, in exchange for a much lower computational complexity [14,22]. In fact, if such a complexity is low enough one could efficiently "sketch" the whole vector by quickly getting a fair estimate of its entries. Second, in many practical cases one is really interested in just a few entries at a time. A classic example is that of network centralities, many of which are stationary distributions of an ergodic Markov chain [4]. Indeed, the problem of approximating the Personalized PageRank score of a few nodes in a graph has been repeatedly addressed in the past [5,6,16,17].
In this paper we seek efficient algorithms for approximating the stationary probability π(v) of a target state v in the state space of a discrete-time ergodic Markov chain. Along with the motivations above, the problem arises in estimating heat kernels and graph diffusions, testing the conductance of graphs and chains, developing local algorithms, and has applications in machine learning; see [3,13] for a thorough discussion. We adopt a simple model where with a single operation one can either (i) simulate one step of the chain or (ii) retrieve the transition probability between a pair of states. Although recent research has provided encouraging results, existing algorithms suffer from a crucial bottleneck: to guarantee a small relative error in the approximation of π(v), they incur a cost that grows with 1/π(v) itself (basically because estimating π(v) via repeated sampling requires 1/π(v) samples). This is a crucial issue since in general there is no lower bound on π(v); even worse, if the state space has n states, then most states have mass π(v) = O( 1 n ), and one can easily design chains where they have mass exponentially small in n. In general, then, the cost of existing algorithms can blow up far beyond O(n) for almost all input states v. It is thus natural to ask if the dependence of the complexity on π(v) is unavoidable. Unfortunately, one can easily show that (τ/π(v)) operations can be necessary to estimate π(v) within any constant multiplicative factor if one makes no assumption on the chain (see Section 6). To drop below this complexity barrier one must then necessarily look at special classes of Markov chains.
We present an algorithm that breaks this "small-π(v) barrier" for time-reversible Markov chains. Time-reversible chains are a well-known subclass of Markov chains which lie at the heart of the celebrated Metropolis-Hastings algorithm [12] and are equivalent to random walks on weighted undirected graphs [15]. Formally, given any , δ > 0 and any state v in a time-reversible chain, our algorithm with probability 1 − δ returns a multiplicative (1 ± )-approximation of π(v) by usingÕ(τ π −1 ) operations, where τ is the mixing time of the chain, · is the Euclidean norm and O(·) hides polynomials in −1 , ln(δ −1 ), ln( π −1 ). The complexity is independent of π(v), and for all but a vanishing fraction of states in the chain improves by factors at least √ n or √ τ over previous algorithms. The heart of our algorithm is a randomized scheme for approximating the sum of a nonnegative vector by sampling its entries with probability proportional to their values. This scheme requires O( π −1 ) samples if π is the distribution over the vector entries, which generalizes theÕ( √ n) algorithm of [18] and is provably optimal. We prove that our algorithm for estimating π(v) is essentially optimal as a function of τ , n and π ; in fact one cannot do better even under a stronger computational model where all transition probabilities to/from all visited states are known. Along with running time bounds, that are our primary goals, we also provide trade-offs between the time and space complexity of our algorithms, as well as bounds on the number of distinct chain states they explore. All our algorithms are simple to implement, require no tuning, and experimentally they appear faster than existing alternatives already for medium-sized chains.
The rest of the paper is organized as follows. Section 1.1 pins down definitions and notation; Section 1.2 formalizes the problem; Section 1.3 discusses related work; Section 1.4 summarizes our results. Section 2 presents our vector sum approximation algorithm. Section 3 presents our approximation algorithm for π(v). Section 4 discusses time-space trade-offs for our algorithm. Section 5 analyzes the number of distinct states it visits. Section 6 provides the proofs of our lower bounds. Finally, Section 7 describes our experimental results.

Preliminaries
A discrete-time, finite-state Markov chain is a sequence of random variables X 0 , X 1 , . . . taking value over a set of states V = {1, . . . , n}, such that for all i ≥ 1 and all u 0 , . . . , u i ∈ V with Pr(X 0 = u 0 , . . . , We assume the chain is ergodic, and thus has a limit distribution that is independent from the distribution of X 0 ; the limit distribution then coincides with the stationary distribution π . Thus π is the unique distribution vector such that for any distribution vector π 0 : We denote by π(u) the stationary probability, or mass, of u, and we always denote by v the target state whose mass is to be estimated. For any V ⊆ V we let π(V ) denote u∈V π(u). We also assume the chain is time-reversible, i.e. that for any pair of states u and u we have: We denote by τ the standard 1 4 -mixing time of the chain. In words, τ is the smallest integer such that after τ steps the total variation distance between π and the distribution of X τ is bounded by 1 4 , irrespective of the initial distribution. Formally, After τ steps, the distribution of X t converges to π exponentially fast; that is, if t = ητ with η ≥ 1, then π 0 P t − π TV ≤ 2 −η . In the rest of the paper, · always denotes the 2 norm. Note that 1 √ n ≤ π ≤ 1 since 1 √ n x 1 ≤ x ≤ x 1 for any x ∈ R n . One may refer to [15] for a detailed explanation of the notions recalled here.
Unless necessary, we drop multiplicative factors depending only on , δ (see below) from the asymptotic complexity notation. Furthermore, we use the tilde notation to hide polylogarithmic factors, i.e. we denote O(f ·poly(log(f ))) byÕ(f ).

Problem Formulation
We consider a discrete-time, finite-state, time-reversible, ergodic Markov chain on n states. The chain is initially unknown and can only be accessed via two operations (also called queries): step(): accepts in input a state u, and returns state u with probability p uu probe(): accepts in input an ordered pair of states (u, u ), and returns p uu These queries are the de facto model of previous work. step() is used in [3,5,6,13,16,17] to simulate the walk, assuming each step costs O (1). probe() is used in [3,16,17] to access the elements of the transition matrix, assuming again one access costs O (1). Here, too, we assume step() and probe() as well as all standard operations (arithmetic instructions, memory accesses, . . . ) cost O(1). This includes set insertion and set membership testing; in case their complexity is ω(1), our bounds can be adapted correspondingly. The problem can now be formalized as follows. The algorithm is given in input a triple (v, , δ) where v is a state in the state space of the chain and , δ are two reals in (0, 1). It must output a valueπ(v) such that, with probability 1 − δ, it holds (1 − )π(v) ≤π(v) ≤ (1 + )π(v). The complexity of the algorithm is counted by the total number of operations it performs. Obviously we seek an algorithm of minimal complexity.
A final remark. We say state u has been visited if u = v or if u has been returned by a step() call. In line with previous work, we adopt the following "locality" constraint: the algorithm can invoke probe() and step() only on visited states.

Related Work
Two recent works address precisely the problem of estimating π(v) in Markov chains. The key differences with our paper are that they consider general (i.e. not necessarily time-reversible) chains, and that we aim at a small relative error for any π(v) and not only for large π(v).
The first work, [13], gives a local approximation algorithm for π(v) based on estimating return times via truncated random walks. Given any > 0, if π(v) ≥ the algorithm with probability 1 − δ outputs a multiplicative Z(v)-approximation of π(v), where Z(v) is a "local mixing time" that depends on the structure of the chain. The cost isÕ(ln(1/δ)/ 3 ) step() calls. If one wants a multiplicative (1 ± )approximation of π(v) for a generic v, the cost becomesÕ(τ/π(v)) step() calls since one must wait for the walks to hit v after having mixed.
The second work, [3], gives an algorithm to approximate the -step transition probability p uv from a state u to a state v, based on coupling a local exploration of the transition matrix P with simulated random walks. Let d be the average number of neighbors of the states in the chain. If v is drawn uniformly at random from the chain state space, and if p uv ≥ for some > 0, then with probability 1 − δ the algorithm gives a multiplicative (1 ± )-approximation of p uv perform-ingÕ( 1.5 √ d ln(1/δ) / 0.5 ) calls to step() and probe() in expectation. To estimate π(v) for a given v, one must then set = τ and = π(v); and since if the chain is irreducible then d = (1), the bound stays atÕ(τ 1.5 /π(v) 0.5 ). Note that this does not contradict our (τ/π(v)) lower bound for general Markov chains (see Section 6.2), which is worst-case rather than average-case.
The algorithms cited above require eitherÕ(τ/π(v)) orÕ(τ 1.5 /π(v) 0.5 ) step() and probe() calls to ensure a (1 ± )-approximation of π(v) for a generic state v. Note also that their complexity and approximation guarantees depend on knowledge of τ ; our algorithms are no exception, and we prove our bounds as a function of τ .
For the sake of completeness we mention a few interesting results that have been obtained for specific Markov chains, and in particular for PageRank. We remark that these results adopt graph access models based on adjacency lists or adjacency matrix representations, which are different from the step() and probe() model of this paper; and they also exploit the specific structure of PageRank itself, including the fact that for PageRank π(v) = (1/n) and τ = O(1) in an n-node graph. The algorithm of [5,6] approximates the PageRank π(v) of the nodes v having π(v) ≥ , at the cost ofÕ(1/ ) step() calls; if one desires a multiplicative (1 ± )-approximation of π(v), the cost becomesÕ(1/π(v)). The results of [7,8] show how to obtain a (1 ± )approximation of π(v) for any given v by accessing onlyÕ(n 4/5 d 1/5 ) nodes and arcs of the graph, where d is the average degree of the graph; for nodes with a small π(v) this bound is o(1/π(v)) unless the graph is dense. For Personalized PageRank, [17] gives an algorithm for estimating π(v) using techniques similar to [3]; if one aims at a multiplicative (1 ± )-approximation of π(v), if v is drawn uniformly at random from the graph, then the algorithm makesÕ(d 0.5 /π(v) 0.5 ) step() and probe() calls in expectation. Similar bounds have been ported to the worst case for Personalized PageRank on undirected graphs -see [16].
Finally, for the problem of estimating the sum of a nonnegative n-entry vector x by sampling its entries x i with probability π i = x i / i x i , the only algorithm existing to date is that of [18]. That algorithm takesÕ( √ n) samples independently of π , while ours needs O( √ n) samples only in the worst case, i.e. if π is (essentially) the uniform distribution.

Our Results
Our first contribution is SUMAPPROX, a randomized algorithm for estimating the sum γ of a nonnegative vector x, assuming one can sample its entries according to the probability distribution π = x/γ . Formally, we prove: Theorem 1 Given any , δ ∈ (0, 1), SUMAPPROX( , δ) with probability at least 1 − δ returns a multiplicative (1 ± )-approximation of γ by taking SUMAPPROX is extremely simple, yet it improves on the state-of-the-artÕ( √ n) algorithm of [18]. We prove π −1 samples are necessary, too, to get a fair estimate of γ .
We then employ SUMAPPROX to build MASSAPPROX, a randomized algorithm for approximating π(v). Random-walk-based sampling and time reversibility are the ingredients that allow one to make the connection. We prove:

elementary operations and calls to step() and probe().
Algorithms previously presented in the literature work also for non-reversible chains; but on the n − o(n) states with mass π(v) = O(1/n), their complexity becomes at leastÕ(τ n) [13] orÕ(τ 1.5 √ n) [3]. In fact, π(v) can be arbitrarily small (even exponentially small in n and τ ) for almost all states in the chain, so for almost all states the complexity of previous algorithms explodes while that of MASSAPPROX remains unchanged: since π −1 ≤ √ n for any π , the complexity of MASSAPPROX is at mostÕ(τ √ n). Next, we show that MASSAPPROX is optimal as a function of τ , n and π , up to small factors. In fact no algorithm can perform better even if, instead of using step() and probe(), it is equipped with a more powerful query neigh() that, invoked on a state u, reveals all incoming and outgoing transition probabilities of u. Formally, we prove: there is a family of timereversible chains on n states where (a) π = (ν(n)), and (b) there is a target state v such that, to estimate its mass π(v) within any constant multiplicative factor with constant probability, any algorithm needs (τ π −1 / ln n) neigh() calls where τ is the mixing time of the chain.
We then turn to bound the space requirements of our algorithms. The naive space bounds follow from the running time bounds of Theorem 1, which depend on π −1 and can grow as fast as √ n. This may be overly expensive if n is too large. To this end we show how one can reduce the space used by SUMAPPROX and MASSAPPROX by any given factor, if one is willing to increase the running time correspondingly. For example, one can ensure MASSAPPROX uses just O(1) space, for a running time ofÕ(τ π −2 ). In addition, we prove no algorithm can achieve a substantially better trade-off.
Next, we analyze the footprint of our algorithm MASSAPPROX, i.e. the number of distinct states it visits. This is useful if visiting a new state is expensive (e.g. if states are users in a social network). We develop an alternative algorithm, FULLMAS-SAPPROX, that has the same worst-case running time guarantees as MASSAPPROX, but whose footprint can be smaller than that of MASSAPPROX depending on the structure of the Markov chain (and at the expense of space). For example, on some chains FULLMASSAPPROX provably visits only roughly √ τ n distinct states, whereas MASSAPPROX visits τ √ n ones. In addition, FULLMASSAPPROX can have a smaller running time than MASSAPPROX, and in fact in our experiments this is always the case.
Finally, we provide an experimental evaluation of MASSAPPROX and FULLMAS-SAPPROX by running them on two synthetic Markov chains against the competing state-of-the-art algorithms of [13] and [3]. For the same target accuracy, our algorithms are always faster in terms of number of queries used. In fact, for all ≤ 1 FULLMASSAPPROX is at least two orders of magnitude faster than [13] and [3] in all our experiments.

Estimating Sum by Weighted Sampling
As anticipated in Section 1, and as we show formally in Section 3, the problem of estimating the stationary probability π(v) of v can be reduced to the following problem. We are given a vector of nonnegative reals γ u indexed by the elements u of a set V . The vector is unknown, including its dimension, but we can access its entries via weighted sampling. More precisely, we can draw pairs (u, γ u ) according to the distribution π where (u, γ u ) has probability γ u / u∈V γ u . The goal is to approximate the vector sum γ = u∈V γ u . Note that a sample (u, γ u ) reveals the index of an entry together with its value. We describe here below a simple randomized algorithm, SUMAPPROX, which estimates the sum γ by drawing samples until witnessing a certain number of repeats (a sample is a repeat if it returns a pair already sampled before).
The idea behind SUMAPPROX is the following. Each time a new element is to be drawn, the probability of a repeat is given by the sum of the probabilities of all the distinct elements drawn so far; this sum is w S /γ , where γ is unknown -indeed, it is the quantity to be determined. Now, consider a sequence of binary random variables χ i , i = 0, 1, . . . the (i + 1)-th of which assumes value 1 if and only if the (i + 1)-th draw is a repeat. The expectation of χ i coincides with the probability that χ i assumes value 1, and is thus given by the value assumed by w S /γ immediately before the (i + 1)-th execution of line 8. SUMAPPROX keeps track of two key quantities. The first is the numerator of the sum of the expectations of the χ i , that is, the sum of the values of w S over all iterations so far; this quantity is stored in the variable w. The second is the sum of the realizations of the χ i , which is simply the number of repeats witnessed so far; this is stored in the variable r. If the sum of the realizations of the χ i concentrates around its own expectation, as we prove, then r is close to w/γ , and thus w/r is an accurate estimate of γ .
We prove the following guarantees: We remark that the previous existing algorithm for the sum estimation problem [18] needs knowledge of n = |V | and uses O( √ n −7/2 log n (log 1 δ + log 1 + log log n)) = O( √ n log n log log n) samples. SUMAPPROX is simpler, oblivious to n, and its running time bounds are tighter -they save a factor that ranges from log n log log n when π is (essentially) the uniform distribution to √ n log n log log n when π is heavily skewed. In fact, we show that SUMAPPROX is optimal up to constant factors, by proving that ( π −1 ) samples are in general necessary to estimate γ even if n is known. This extends to arbitrary distributions the ( √ n) lower bound given by [18] for the uniform distribution.

Theorem 6 For any function
samples are necessary to estimate γ within constant multiplicative factors with constant probability, even if n is known.
The rest of this section is devoted to the proofs.

Proof of Theorem 4
We make use of a martingale tail inequality originally from [10] and stated (and proved) in the following form as Theorem 2.2 of [1], p. 1476: Let us plug into Theorem 7 the appropriate quantities from SUMAPPROX: -Let (X i , γ X i ) be the random pair drawn at the (i + 1) th execution of line 8.
-Let χ i be the indicator variable of the event that (X i , γ X i ) is a repeat, that is, Finally, note that Var(Z j |F j −1 ) = Var(χ j |F j −1 ) (as Z j = Z j −1 + χ j − P j and, again, Z j −1 and P j are completely determined by X 0 , . . . , X j −1 ). Since Theorem 7 then yields the following: Recall now SUMAPPROX. Note that i j =1 P j and Z i are respectively the value of w γ and of r − w γ just after the while loop has been executed for the (i + 1)-th time. Note also that, when SUMAPPROX returns, r = k ,δ . Therefore the event that, when SUMAPPROX returns, w which is smaller than δ/3 since clearly k ,δ > 2 2 ln 3 δ . Consider instead the event that, when SUMAPPROX returns, w r ≥ γ (1 + ) i.e. w γ ≥ r(1 + ) = k ,δ (1 + ). This is the event that Z i ≤ − k ,δ , or equivalently −Z i ≥ k ,δ . Note that Lemma 1 still holds if we replace Z i with −Z i , as (−Z i ) i≥0 too is obviously a martingale with respect to the filter (F i ) i≥0 , with −Z 0 = 0. Let then i 0 ≤ i be the smallest Invoking again Lemma 1 with z = k ,δ and v = (1 + )k ,δ + 1, we obtain: Note that 1 k ,δ < 2 2+4.4 < 0.2 since ≤ 1; so 2((1 + 2 ) + 1 k ,δ ) < 2 + 4.4 , and since k ,δ ≥ 2+4.4 2 ln 3 δ the right-hand term is at most δ 3 . Finally, by a simple union bound the probability that |γ − γ | ≥ γ is at most 2 δ 3 , and the proof of Theorem 4 is complete.
The proof of Theorem 5 is complete.

Proof of Theorem 6
Let k ∈ (ν(n) −2 ) with 1 ≤ k ≤ n 2 . Consider the two vectors x = (γ 1 , . . . , γ n ) and x = (γ 1 , . . . , γ n ) defined as follows: Now let γ = n i=1 γ i and γ = n i=1 γ i . One can check that γ ≤ 2k and |γ − γ | ≥ k 2 . Hence, to obtain an estimateγ of γ withγ ≤ 5 4 γ , one must distinguish x from x . Note that the norm of π = x/γ is in (1/ √ k) = (ν(n)), as requested. Now, for each one of x and x in turn, pick a permutation of {1, . . . , n} uniformly at random and apply it to the entries of the vector. Suppose then we sample o( π −1 ) = o( √ k) entries from x. We shall see that, with probability 1 − o(1), we cannot distinguish x from x . First, note that the total mass of the entries with value √ k/n is at most 1/ √ k. Hence the probability of drawing any of those entries with o( √ k) samples is o(1), and we can assume all draws yield entries having value 1. Since there are O(k) such entries in total, the probability of witnessing any repeat is also o(1), and we can assume no repeat is witnessed. Furthermore, because of the random permutation, the indices of samples are distributed uniformly over {1, . . . , n} (recall that we actually sample from the index set {1, . . . , n}, so we could use the distribution of the indices to distinguish x from x ). The same argument applies to x , so drawing o( √ k) samples from x yields exactly the same distribution and the two vectors are indistinguishable. To adapt the construction to larger approximation factors, set γ j = 1 : j ≤ ηk for η large enough.
The proof of Theorem 6 is complete.

Approximating the Stationary Distribution
In this section we address the problem of approximating π(v). We start in Section 3.1, where we reduce this problem to the sum estimation problem of Section 2 by sampling states via random walks. In Section 3.2 we will then be able to prove Theorem 2, presenting our algorithm MASSAPPROX for estimating π(v).

From Estimating the Sum to Estimating the Mass
Consider a random walk of length t + 1 that starts at v. Obviously we can simulate such a walk by setting u 0 = v and then invoking step(u i ) to obtain the state u i+1 , for i = 0, . . . , t − 1. Crucially, using the time-reversibility of the chain, for any visited state u we can obtain the ratio γ u between π(u) and π(v) using O(1) operations. Formally, let γ v = π(v)/π(v) = 1, and in general let γ u = π(u)/π(v). Note that: The time-reversibility of the chain (see (2)) implies π(u i+1 ) probe(u i+1 ,u i ) , hence we can compute γ u i+1 with O(1) operations if we know γ u i . But then we can keep track of γ u for any u visited so far, starting with γ u 0 = 1 and computing γ u i+1 by (10) the first time u i+1 is visited.
Suppose now to pick t large enough so that the chain reaches its stationary distribution, i.e. u t ∼ π irrespective of v. One is then drawing state u, as well as its associate weight γ u , with probability π(u). Now if we let γ = u∈V γ u , then π(u) = γ u γ −1 and in particular π(v) = γ −1 . Therefore approximating π(v) amounts to approximating γ ; more formally, for any ∈ (0, 1), ifγ is a (1 ± 2 )-approximation of γ thenγ −1 is a (1 ± )-approximation of π(v). We can therefore reduce to the sum approximation problem of Section 3: compute with probability (1 − δ) a (1 ± )approximation of γ , assuming we can draw pairs (u, γ u ) according to π . The only problem is that by simulating the chain we can only come close to (but not exactly on) the stationary distribution π . We must then tie the approximation guarantees of SUMAPPROX to the length t of the random walks, or better to the distance ||π −π || TV between π and the distribution π from which u t is drawn. Formally, we show:
Proof Let us start with the bound on the number of samples. Recall the proof of Theorem 5, and note that the argument depends on π but not on the values γ u . Indeed, π alone determines the probability of repeats and thus controls the distribution of the number of samples drawn by SUMAPPROX. Hence, by Theorem 5 SUMAP-PROX( 2 , δ) takes more than 45 π −1 8 −3 (ln 3 δ ) 3/2 = 360 π −1 −3 (ln 3 δ ) 3/2 samples with probability less than δ 3 . Now π −π ≤ 2||π −π || TV ≤ 2 π ln(3/δ) c ≤ π 2(ln 3) −c , which for c ≥ 15 is bounded by 1 2 π . Then, since π ≥ π − π − π , we have π −1 ≤ 2 π −1 and the bound above is in turn bounded by 720 π −1 −3 (ln 3 δ ) 3/2 . Let us now see the approximation guarantees. Recall the proof of Theorem 4. We want to show again that Pr[| w(s) r − γ | ≥ 2 γ ] ≤ 2δ 3 . However, now the samples are drawn according to π instead of π . Let then P j = u∈∪ j −1 h=0 {X h } π (u) and Z i = i j =0 (χ j − P j ); in a nutshell, P j and Z i are the analogous of P j and Z i under π . It is immediate to check that Lemma 1 holds with Z i and P j in place of Z i and P j .
Let now w (i) = γ i j =1 P j . Note that i j =1 P j and Z i are respectively the value of w (i) γ and of r − w (i) γ just after line 9 has been executed for the (i + 1)-th time. Therefore, the argument following Lemma 1 holds if we put w (i) in place of the value taken by w after the (i + 1)-th execution of line 9. Hence the same bounds hold, and SUMAPPROX( 2 , δ) ensures Pr[| w (s) r − γ | ≥ 2 γ ] ≤ δ 3 where s is the total number of draws. Now note that SUMAPPROX does not return w (s) r , but w(s) r where w(i) = γ i j =1 P j is the value of w in SUMAPPROX after line 9 has been executed for the (i + 1)-th time. We shall now make | w(s) r − w (s) r | ≤ 2 γ ; by the triangle inequality we will then be done. First of all, by the definition of w(s) and w (s) we have Now note that |P j − P j | ≤ π − π TV , since P j and P j are the probability of the same event under respectively π and π . Therefore the right-hand side of (11) is bounded by γ r −1 s ||π − π || TV . Now, when SUMAPPROX( 2 , δ) terminates r = k 2 ,δ ≥ 4 2+2.2 2 ln 3 δ , and by hypothesis ||π − π || TV ≤ π ln(3/δ) c . Therefore: Finally, recall that with probability 1 − 3 δ we have s ≤ 720 π −1 −3 (ln 3 δ ) 3/2 . In this case the equation above yields | w(s) r − w (s) r | ≤ γ · 721 c ln( 3 δ ) 0.5−c , which is smaller than 2 γ for c ≥ 1 2 + ln 1442 ln ln 3 ≈ 78. A simple union bound completes the proof.

Proof of Theorem 2
We are now ready to introduce our algorithm MASSAPPROX and prove Theorem 2. MASSAPPROX, whose pseudocode is listed below, is obtained from SUMAPPROX by simply drawing the pair (u, γ u ) by performing t steps of the random walk starting from v. We shall now show that choosing an appropriate t yields Theorem 2.
Let t = τ c ln( π −1 −1 ln 3 δ )/ ln 2, where c < 80 is the constant of Lemma 3 and τ is the mixing time of the chain. Simulate the walk for t steps starting from v, and let π be the distribution of the final state. By the properties of the mixing time (see Section 1.1): (13) and therefore by Lemma 3 we obtain a (1 ± ) approximation of γ . Note that for every ∈ (0, 1) a (1 ± ) approximation of π(v) is given by a (1 ± 2 ) approximation of γ (see above). The total number of operations performed is clearly bounded by t = τ c ln( π −1 −1 ln 3 δ )/ ln 2 times the number of samples taken by SUMAPPROX, and by substituting this value in the bound of Theorem 1 we obtain Theorem 2.

Space Complexity Bounds
In the previous sections we focused on the running time of our algorithms, ignoring their space requirements. It is straightforward to see that SUMAPPROX and MAS-SAPPROX need (|S|) space, where S is the set of distinct elements/states drawn, assuming every element/state can be encoded with (1) bits. Unfortunately, in general |S| grows with π −1 (see Theorem 1 and Theorem 6), and thus in the worst case with √ n. Since we are interested especially in cases where n is very large, this dependence might be problematic. It is thus natural to ask if we can reduce the space complexity. We show that one can reduce the space complexity of our algorithms by a factor α > 1 while correspondingly increasing the running time, and that no algorithm can perform significantly better. Formally, we prove: Theorem 5). Choose any integer κ ≥ 1, and let α = min(1, κ/s). Then one can adapt SUMAPPROX( , δ) so that at any point |S| ≤ κ and that, with probability at least 1 − δ 3 , it draws at most s/α samples. Furthermore, if one can store at most κ = o( π −1 ) distinct elements, then in the worst case ( π −2 /κ) samples are needed to estimate γ within constant multiplicative factors with constant probability.
Proof For the upper bound we adapt SUMAPPROX( , δ) so that, at any point, rather than all the |S| distinct entries drawn so far, it stores only the min(|S|, κ) entries with largest weights. Let S still denote the set of all distinct entries drawn by the algorithm, and let S ⊆ S be the subset of elements that are stored. In SUMAPPROX( , δ) we replace S by S , and replace lines 12-13 with the following lines (u 1 , . . . , u i denote the elements in S , ordered so that π(u 1 ) ≥ . . . ≥ π(u i )): Now recall the proof of Theorem 5. If case 1 holds, then we consider in particular u = arg max u∈V π(u). Obviously u ∈ S since the first time u is drawn; the whole argument then holds unchanged, and the claim follows since s ≤ s/α. If case 2 holds, then note that by construction afters draws we have |S | ≥ min(|S|, κ). Moreover, sinces ≤ s and α ≤ κ/s, then κ ≥ αs ≥ αs, and since clearlys ≥ |S|, then κ ≥ α|S|. Thus afters draws |S | ≥ α|S|, and therefore by construction its mass w (s) = Let us move to the lower bound. Recall the proof of Theorem 6. Any entry of x or x has probability O( 1 k ) of being drawn. Thus, if one stores at most κ elements, at any point in time the probability of the next draw yielding a repeat is bounded from above by O( κ k ). Therefore the expected number of draws to the first repeat is ( k κ ) = ( π −2 /κ). The rest of the argument follows.

Footprint Bounds
We have so far analysed the running time and space requirements of MASSAPPROX. There is however one last performance measure we are interested in: the footprint, i.e. the total number of distinct states visited by the algorithm along the walk. This is a sensible measure when walking through a state is expensive only the first time that state is visited, e.g. in random walks over social graphs, where upon visiting a new user one needs to retrieve all its neighbors, but can then cache them for later reuse. The footprint of MASSAPPROX is trivially bounded by its running time, which in turn is bounded by τ π −1 (see Theorem 2), which can be as large as τ √ n. In this section we describe FULLMASSAPPROX, a variant of MASSAPPROX that under certain conditions has a provably smaller footprint at the expense of memory requirements. Moreover, FULLMASSAPPROX is potentially faster (in fact, in our experiments it is always much faster than MASSAPPROX).
We obtain FULLMASSAPPROX by making two modifications to MASSAPPROX. First, instead of performing a new walk of length t from v for each sample, we perform a single long random walk of length T and take one sample every t steps. The correctness guarantees do not change, since although the samples do not come all from the same distribution, they are still drawn from a distribution sufficiently close to π . Second, after checking if the current draw yields a repeat, we include in the set S not only the draw but also all other states visited so far. This does not affect the guarantees, since (crucially) we do not need the set S to be built on independent samples; but it makes the mass of S grow potentially faster, which can lead to more repeats and thus a lower number of samples. The pseudocode of FULLMASSAPPROX is given below.
Let now N v,T be the number of distinct states visited by a random walk of T steps starting from v. This is in other words the number of distinct states visited by FULLMASSAPPROX over T steps. We show that, if N v,T gets sufficiently concentrated around its expectation, then FULLMASSAPPROX terminates when N v,T itself is smaller than τ √ n. The concentration of N v,T is expressed by some functionτ of the , then the footprint of FULLMASS-APPROX is essentially √τ n. Ideally we would likeτ = τ , in which case the footprint becomes √ τ n, which we prove to be essentially optimal. Formally, we show: there is a target state v such that, to estimate its mass π(v) within any constant multiplicative factor with constant probability, any algorithm requires ( √ τ n/ ln n) neigh() calls.
The proof of Theorem 10 is deferred to Section 6. We shall thus prove Theorem 9, but before starting let us provide some intuition behind the concentration hypothesis. Suppose the walk runs for T = kτ steps for someτ = τ poly(log( π −1 )). Such a process can be seen as a coupon collector over k rounds (see e.g. [19]), where a subset of at mostτ states is collected (i.e. visited) at each round. Now, if we pickτ ≤τ withτ = τ poly(log( π −1 )), then in each round theτ −τ middle steps are essentially independent of other rounds (more formally, the correlation is O(poly( π ))). Each round is then in large part independent of the others; the issue is that the states visited within a single round are correlated. This correlation is responsible for the factorτ in the concentration hypothesis and amounts for the (intuitive) fact that conditioning on the outcome of one step of the walk does not affect the distribution of those steps that are more thanτ steps away. We note that the concentration bounds of [9] give Pr is a weight function of state X i ; however we could not use them to prove the concentration hypothesis of Theorem 9.
Let us now delve into the proof.
Proof Observe the random walk performed by FULLMASSAPPROX. Clearly ifτ = (n) then the walk visits O(τ ln n + √τ n) distinct states, and the theorem holds unconditionally. Let us then assumeτ = o(n). We disregard the first T 0 = (τ ln n) steps of the walk, which of course yield at most T 0 distinct states, and focus on the last T steps, which we denote by X 1 , . . . , X T (one may thus plug T + T 0 in place of T in the concentration hypothesis). Let π i denote the distribution of state X i , i = 1, . . . , T . Since T 0 = (τ ln n), then we can make ||π i − π || TV ≤ 1 poly(n) . One can adapt the proof of Lemma 3 to FULLMASSAPPROX, using the hypothesis ||π − π i || TV ≤ π ln(3/δ) c for all i ≥ 1. This changes the bounds of the lemma only by constant multiplicative factors. We can thus focus on proving the bound on the number of states visited by the walk. In the analysis we assume X i ∼ π , but again the same asymptotic bounds hold if ||π i − π || TV ≤ 1 poly(n) . Let By the bounds of the previous paragraph, for any δ > 0 with probability Conditioned on the event that M t = (τ t ), any sample drawn after t steps is a repeat with probability (τ t ). If we then draw ( tτ ) samples, which require (t) steps, we witness an expected (1) repeats, which can be made larger than k ,δ by appropriately increasing t. Again by the concentration bounds on N t , the total number of states visited can be with probability arbitrarily close to 1 by appropriately increasing t.

Lower Bounds
In this section we prove the bounds of Theorem 3 and Theorem 10 (see Section 1.4), as well as the lower bound for non-time-reversible Markov chains mentioned in Section 1.

Proofs of Theorem 3 and Theorem 10
Both proofs follow the same line. As anticipated, the bounds are proven under a strengthened model providing a primitive neigh(u) that at cost O(1) returns all the incoming and outgoing transition probabilities of u. We assume neigh(u) is invoked automatically when u is first visited, and we leave for free all subsequent step() and probe() calls on u and all elementary operations. It is clear that the cost incurred under this model is no larger than that incurred in the step() and probe() model. We see the chains as random walks on weighted undirected graphs. Recall that any undirected weighted graph G can be univocally associated to a time-reversible Markov chain: for any u, u ∈ G, p uu > 0 if and only if (u, u ) is an edge of G with weight w uu = zπ(u)p uu , for some constant z > 0 equal for all edges.
Consider a random d-regular expander graph G 0 on n 0 nodes (see e.g. [19]). Standard random walk theory states that the simple random walk on G 0 has mixing time τ 0 = (log d n 0 ). Also, by a simple birthday argument, starting from any given node v any algorithm must visit ( √ n 0 ) nodes and thus perform ( √ n 0 ) queries to estimate n 0 within constant factors with constant probability. We now build our chain out of G 0 . In particular, we create a (random) graph G = G( , n 0 ) on n = (n 0 ) nodes as follows. Take each arc {u, v} of G 0 and replace it with a star on +1 nodes. More precisely: delete {u, v}, add a new node s uv and two arcs {u, s uv } and {v, s uv }, and add − 2 nodes each having a self-loop and an arc to s uv . is a function of n 0 to be decided later. Assign weight d − 1 to each self-loop, and weight 1 to any other that (i) π(u i ) = (p(n)), and (ii) the mixing time is τ = (τ (n)) (essentially, the expected time to leave the u i ). Pick any u i as target state v. Suppose now to alter the chain as follows: pick some u j = v and set p u j v = 1. The new stationary probability of v would then be roughly 2π(v). However one cannot distinguish between the two chains with constant probability with less than ( τ π(v) ) step() calls. Indeed, to distinguish between them one must at least visit u j (and then perform e.g. probe(u j , v)). Since u is the only state leading to u j with positive probability, one must invoke step(u) until it returns u j . But p uu j = p(n) τ (n) , hence one needs ( τ (n) p(n) ) = ( τ π(v) ) calls in expectation. The construction can be adapted to any constant approximation factor by adding more transitions towards v.

Experiments
We experimentally evaluate MASSAPPROX and FULLMASSAPPROX against the algorithms of Lee et al. [13] and Banerjee et al. [3] (see Section 1.3). All algorithms were ran on synthetic time-reversible Markov chains on 1M states, created as follows. We start from an undirected torus graph (i.e. a grid with periodic boundary) of n = 1000 × 1000 nodes. We then add 0.01n edges between random pairs of nodes, to reduce the mixing time and thus the cost (and running time) of the algorithms. We add self-loops to all nodes to ensure ergodicity. Finally, we weight the arcs according to two distributions: π U (uniform): each arc has weight 1. The resulting norm is π U = 0.001, that is, 1/ √ n. -π S (skewed): for each arc independently we draw X ∼ U (0, 1] and we set the arc's weight to 1/X. The resulting norm is π S 0.07. For each weighted graph, we consider the time-reversible chain of the associated random walk. We take v = 0 as the target node (obviously, all nodes are equivalent). For all algorithms we set δ = 0.1. For the algorithm of Banerjee et al. we set the minimum detection threshold at π(v), and for all other algorithms we set = 0.25. One must then fix the random walk length: t in our algorithms, in Banerjee et al., and 1/ in Lee et al. Setting the lengths to τ ln(n) would make all algorithms satisfy the desired guarantees. Since we do not know τ , for each algorithm we proceed as follows. We initially set the length of random walks to l = 10. We then perform three independent executions of the algorithm. If all three executions return an estimatê π(v) within a multiplicative factor (1 ± ) of π(v), then we stop. Otherwise, we increase l by a factor √ 2 and repeat. For each value of l we record the average relative errorˆ = |π(v)−π(v)| π (v) and the average total number of step() and probe() calls. Figure 1 shows howˆ decreases as the number of calls increases.
MASSAPPROX and FULLMASSAPPROX are the fastest candidates in all cases. In the uniform chain, MASSAPPROX is approached by the algorithm of Banerjee et al. at high accuracies. This seems a confirmation of theory: MASSAPPROX has complex-ityÕ(τ n −0.5 ) on a chain with uniform distribution, and the algorithm of Banerjee et Fig. 1 Cost incurred by the algorithms as their estimates converge towards π(v). Left: chain with uniform distribution π U . Right: chain with skewed distribution π S . Note that, once the chain has mixed, increasing the number of steps l brings no additional benefit. This explains the "saturation" behavior of MASSAPPROX in the right plot al. has complexityÕ(τ 1.5 n −0.5 ) on the "typical" target state with mass π(v) ≈ 1/n. If τ is not exceedingly large, the two complexities can translate into close performance in practice. On the other hand, FULLMASSAPPROX is neatly more efficient than previous algorithms. To obtain a fairly accurate estimate of π(v), say ±50%, it improves on their performance by two orders of magnitude -and possibly by more on the skewed chain. A final observation is that FULLMASSAPPROX outperforms MASSAPPROX, too, on the uniform chain. The complexity bounds we have are the same for both algorithms, but perhaps FULLMASSAPPROX takes advantage of some specific structural properties of the chain we have used, which further improves its performance.

Conclusions
We have given improved, optimal algorithms for approximating the stationary probability of a given state in a time-reversible Markov chain, and for approximating the sum of nonnegative real vectors by weighted sampling. Although time-reversible chains are of clear relevance, extending our results to other classes of Markov chains is an interesting open question. We have also shown that, under certain hypotheses on the chain, a simple modification of our main algorithm can reduce the total number of distinct states one needs to visit. Since this alternative algorithm appears to be the fastest one in practice, investigating it in greater detail can be a line for future research.