On approximating the stationary distribution of time-reversible Markov chains

Approximating the stationary probability of a state in a Markov chain through Markov chain Monte Carlo techniques is, in general, inefficient. Standard random walk approaches require $\tilde{O}(\tau/\pi(v))$ operations to approximate the probability $\pi(v)$ of a state $v$ in a chain with mixing time $\tau$, and even the best available techniques still have complexity $\tilde{O}(\tau^{1.5}/\pi(v)^{0.5})$, and since these complexities depend inversely on $\pi(v)$, they can grow beyond any bound in the size of the chain or in its mixing time. In this paper we show that, for time-reversible Markov chains, there exists a simple randomized approximation algorithm that breaks this"small-$\pi(v)$ barrier".


Introduction
We investigate the problem of approximating efficiently a single entry of the stationary distribution of an ergodic Markov chain. This problem has two main motivations. First, with the advent of massive-scale data, even complexities linear in the size of the input are often excessive [19]; therefore computing explicitly the entire stationary distribution, e.g. via the power method [10], can be simply infeasible. As an alternative one can then resort to approximating only individual entries of the vector, in exchange for a much lower computational complexity [13,20]. In fact, if such a complexity is low enough one could efficiently "sketch" the whole vector by quickly getting a fair estimate of its entries. Second, in many practical cases one is really interested in just a few entries at a time. A classic example is that of network centralities, many of which are stationary distributions of an ergodic Markov chain [4]. Indeed, the problem of approximating the Personalized PageRank score of a few nodes in a graph has been repeatedly addressed in the past [5,6,16,15].
In this paper we seek for efficient algorithms for approximating the stationary probability π(v) of some target state v in the state space of a discrete-time ergodic Markov chain. Besides the motivations above, the problem arises in estimating heat kernels and graph diffusions, testing the conductance of graphs and chains, developing local algorithms, and has applications in machine learning; see [3,12] for a thorough discussion. We adopt a simple model where with a single operation one can either (i) simulate one step of the chain or (ii) retrieve the transition probability between a pair of states. Although recent research has provided encouraging results, existing algorithms suffer from a crucial bottleneck: to guarantee a small relative error in the approximation of π(v), they incur a cost that grows with 1/π(v) itself (basically because estimating π(v) via repeated sampling requires 1/π(v) samples). This is a crucial issue since in general there is no lower bound on π(v); even worse, if the state space has n states, then most states have mass π(v) = O( 1 n ), and one can easily design chains where they have mass exponentially small in n. In general, then, the cost of existing algorithms can blow up far beyond O(n) for almost all input states v. It is thus natural to ask if the dependence of the complexity on π(v) is unavoidable. Unfortunately, one can easily show that Ω(τ /π(v)) operations can be necessary to estimate π(v) within any constant multiplicative factor if one makes no assumption on the chain (see Appendix 6.2). To drop below this complexity barrier one must then necessarily look at special classes of Markov chains.
We present an algorithm that breaks this "small-π(v) barrier" for time-reversible Markov chains. Time-reversible chains are a well-known subclass of Markov chains which lie at the heart of the celebrated Metropolis-Hastings algorithm [11] and are equivalent to random walks on weighted undirected graphs [14]. Formally, given any , δ > 0 and any state v in a time-reversible chain, our algorithm with probability 1 − δ returns a multiplicative (1 ± )-approximation of π(v) by usingÕ(τ π −1 ) operations, where τ is the mixing time of the chain, · is the Euclidean norm andÕ(·) hides polynomials in −1 , ln(δ −1 ), ln( π −1 ). The complexity is independent of π(v), and for all but a vanishing fraction of states in the chain improves by factors at least √ n or √ τ over previous algorithms. The heart of our algorithm is a randomized scheme for approximating the sum of a nonnegative vector by sampling its entries with probability proportional to their values. This scheme requiresÕ( π −1 ) samples if π is the distribution over the vector entries, which generalizes the O( √ n) algorithm of [17] and is provably optimal. We prove that our algorithm for estimating π(v) is essentially optimal as a function of τ , n and π ; in fact one cannot do better even under a stronger computational model where all transition probabilities to/from all visited states are known. Finally, we show the number of distinct states visited by our algorithm may be further reduced, provided such a number satisfies some concentration hypotheses. This is useful if visiting a new state is expensive (e.g. if states are users in a social network). All our algorithms are simple to implement, require no tuning, and an experimental evaluation shows that in practice they are faster than existing alternatives already for medium-sized chains (see Appendix 6.5).
The rest of the paper is organized as follows. Subsection 1.1 pins down definitions and notation; Subsection 1.2 formalizes the problem; Subsection 1.3 discusses related work; Subsection 1.4 summarizes our results. Section 2 presents our vector sum approximation algorithm. Section 3 presents our approximation algorithm for π(v). Section 4 provides the proofs of optimality. All missing details are found in the Appendix.

Preliminaries
A discrete-time, finite-state Markov chain is a sequence of random variables X 0 , X 1 , . . . taking value over a set of states V = {1, . . . , n}, such that for all i ≥ 1 and all u 0 , . . . , u i ∈ V with Pr(X 0 = u 0 , . . . , . Denote by P = [p uu ] the transition matrix of the chain, so that p uu = Pr(X i = u |X i−1 = u). We assume the chain is ergodic, and thus has a limit distribution that is independent from the distribution of X 0 ; the limit distribution then coincides with the stationary distribution π. Thus π is the unique distribution vector such that for any distribution vector π 0 : We denote by π(u) the stationary probability, or mass, of u, and we always denote by v the target state whose mass is to be estimated. For any V ⊆ V we let π(V ) denote u∈V π(u).
We also assume the chain is time-reversible, i.e. that for any pair of states u and u we have: We denote by τ the standard 1 4 -mixing time of the chain. In words, τ is the smallest integer such that after τ steps the total variation distance between π and the distribution of X τ is bounded by 1 4 , irrespective of the initial distribution. Formally, τ := min{t : After τ steps, the distribution of X t converges to π exponentially fast; that is, if t = ητ with η ≥ 1, then π 0 P t − π TV ≤ 2 −η . In the rest of the paper, · always denotes the 2 norm.
One may refer to [14] for a detailed explanation of the notions recalled here. Unless necessary, we drop multiplicative factors depending only on , δ (see below) from the asymptotic complexity notation. Furthermore, we use the tilde notation to hide polylogarithmic factors, i.e. we denote O(f · poly(log(f ))) byÕ(f ).

Problem formulation
Consider now a discrete-time, finite-state, time-reversible, ergodic Markov chain on n states. The chain is initially unknown and can be accessed via two operations (also called queries): step(): accepts in input a state u, and returns state u with probability p uu probe(): accepts in input a pair of states u, u , and returns p uu These queries are the de facto model of previous work. step() is used in [5,12,6,16,15,3] to simulate the walk, assuming each step costs O(1). probe() is used in [16,15,3] to access the elements of the transition matrix, assuming again one access costs O(1). Here, too, we assume step() and probe() as well as all standard operations (arithmetics, memory access, . . . ) cost O(1). This includes set insertion and set membership testing; in case their complexity is ω(1), our bounds can be adapted correspondingly. The problem can now be formalized as follows. The algorithm is given in input a triple (v, , δ) where v is a state in the state space of the chain and , δ are two reals in (0, 1). It must output a valueπ(v) such that, with probability 1 − δ, it holds (1 − )π(v) ≤π(v) ≤ (1 + )π(v). The complexity of the algorithm is counted by the total number of operations it performs. Obviously we seek for an algorithm of minimal complexity.
A final remark. We say state u has been visited if u = v or if u has been returned by a step() call. In line with previous work, we adopt the following "locality" constraint: the algorithm can invoke probe() and step() only on visited states.

Related work
Two recent works address precisely the problem of estimating π(v) in Markov chains. The key differences with our paper are that they consider general (i.e. not necessarily time-reversible) chains, and that we aim at a small relative error for any π(v) and not only for large π(v).
• [12] gives a local approximation algorithm based on estimating return times via truncated random walks. Given any ∆ > 0, if π(v) ≥ ∆ the algorithm with probability 1 − δ outputs a multiplicative Z(v)-approximation of π(v), where Z(v) is a "local mixing time" that depends on the structure of the chain. The cost isÕ(ln(1/δ)/ 3 ∆) step() calls. If one wants a multiplicative (1 ± )-approximation of π(v) for a generic v, the cost becomes O(τ /π(v)) step() calls since one must wait for the walks to hit v after having mixed.
• [3] gives an algorithm to approximate -step transition probabilities based on coupling a local exploration of the transition matrix P with simulated random walks. Given any ∆ > 0, if the probability to be estimated is ≥ ∆ then with probability 1 − δ the algorithm gives a multiplicative (1 ± )-approximation of it at an expected cost ofÕ( 1.5 d ln(1/δ) / ∆ 0.5 ) calls to both step() and probe(), for a uniform random choice of v in the chain, where d is the density of P. To estimate π(v) for a generic v one must set = τ and ∆ = π(v), and since if the chain is irreducible then d = Ω(1), the bound stays atÕ(τ 1.5 /π(v) 0.5 ). This does not contradict our lower bound of Appendix 6.2, since their model allows for probing transition probabilities even between unvisited states.
Similar results are known for specific Markov chains, and in particular for PageRank (note that in PageRank τ = O(1)). [5,6] give an algorithm for approximating the PageRank π(v) of the nodes v having π(v) ≥ ∆, at the cost ofÕ(1/∆) step() calls; again, if one desires a multiplicative (1± )-approximation of π(v), the cost becomesÕ(1/π(v)). [16] gives an algorithm, with techniques similar to [3], for estimating the Personalized PageRank π(v) of a node v; if one aims at a multiplicative (1 ± )-approximation of π(v), the algorithm makesÕ(d 0.5 /π(v) 0.5 ) step() and probe() calls where d is the average degree of the graph. Similar bounds can be found in [15] for Personalized PageRank on undirected graphs.
Summarizing, existing algorithms require eitherÕ(τ /π(v)) orÕ(τ 1.5 /π(v) 0.5 ) step() and probe() calls to ensure a (1 ± )-approximation of π(v) for a generic state v. Note that the complexity and approximation guarantees of these algorithms depend on knowledge of τ ; our algorithms are no exception, and we prove our bounds as a function of τ .
Finally, for the problem of estimating the sum of a nonnegative n-entry vector x by sampling its entries x i with probability π i = x i / i x i , the only algorithm existing to date is that of [17]. That algorithm takes O( √ n) samples independently of π, while ours needs O( √ n) samples only in the worst case, i.e. if π is (essentially) the uniform distribution.

Our results
Our first contribution is SumApprox, a randomized algorithm for estimating the sum γ of a nonnegative vector x, assuming one can sample its entries according to the probability distribution π = x/γ. Formally, we prove: Theorem 1. Given any δ, ∈ (0, 1), SumApprox( , δ) with probability at least 1 − δ returns a multiplicative (1 ± )-approximation of γ by taking O π −1 −3 (ln 1 δ ) 3/2 samples. SumApprox is extremely simple, yet it improves on the state-of-the-art O( √ n) algorithm of [17]. We prove Ω π −1 samples are necessary, too, to get a fair estimate of γ.
Previous algorithms work also for general (i.e. non-reversible) chains; but on the n − o(n) states with mass π(v) = O(1/n), their complexity becomes at leastÕ(τ n) [12] orÕ(τ 1.5 √ n) [3]. In fact, π(v) can be arbitrarily small (even exponentially small in n and τ ) for almost all states in the chain, so for almost all states the complexity of previous algorithms blow up while that of MassApprox remains unchanged: since π −1 ≤ √ n for any π, the complexity of MassApprox is at mostÕ(τ √ n). Next, we show that MassApprox is optimal as a function of τ , n and π , up to small factors. In fact, no algorithm can perform better even if equipped with an operation neigh(u) that returns all incoming and outgoing transition probabilities of u. Formally, we prove: there is a family of time-reversible chains on n states where (a) π = Θ(ν(n)), and (b) there is a target state v such that, to estimate its mass π(v) within any constant multiplicative factor with constant probability, any algorithm requires Ω(τ π −1 / ln n) neigh() calls where τ is the mixing time of the chain.
Although bounding time complexity is our primary goal, in some scenarios one wants to bound the footprint, i.e. the number of distinct states visited. Obviously, the footprint of MassApprox is bounded by its complexity (Theorem 2). We give a second algorithm, FullMassApprox, whose footprint can be smaller than that of MassApprox depending on τ, n, and π . More precisely, we prove a footprint bound that is conditional on the concentration of the footprint itself (see Subsection 3.1 for the intuition behind it). Theorem 4. Let N v,T be the number of distinct states visited by a random walk of T steps starting from v. Assume for a functionτ of the chain we have Pr . Then, given any δ, ∈ (0, 1), with probability (1−δ) one can obtain a multiplicative (1± )-approximation of π(v) by visiting O(f ( , δ)(τ ln n + √τ n)) distinct states.
If in Theorem 4 we haveτ = τ , then FullMassApprox is essentially optimal too. Formally: there is a family of time-reversible chains on n states where (a) the mixing time is τ = Θ(τ (n)), and (b) there is a target state v such that, to estimate its mass π(v) within any constant multiplicative factor with constant probability, any algorithm requires Ω( τ n/ ln n) neigh() calls.

Estimating sum by weighted sampling
In this section we analyse the following problem. We are given a vector of nonnegative reals γ u indexed by the elements u of a set V . The vector is unknown, including its length, but we can draw samples from V according to the distribution π where u has probability γ u / u∈V γ u . The goal is to approximate the vector sum γ = u∈V γ u . We describe a simple randomized algorithm, SumApprox, which proceeds by repeatedly drawing samples and checking for repeats (i.e. a draw that yields an element already drawn before). The key intuition is the following: at any instant, if S ⊆ V is the subset of elements drawn so far, then the next draw is a repeat with probability u∈S γ u /γ. By drawing a sequence of samples we can thus flip a sequence of binary random variables, each one telling if a draw is a repeat, whose expectation is known save for the factor 1/γ. If the sum of these random variables is sufficiently close to its expectation, one can then get a good approximation of γ by simply computing a ratio. The code of SumApprox is listed below.
Proof. We make use of a martingale tail inequality originally from [9] and stated (and proved) in the following form as Theorem 2.2 of [1], p. 1476: . Then for any z, v > 0 we have Let us plug into the formula of Theorem 7 the appropriate quantities from SumApprox: • Let X i be the (i + 1) th sample (i.e. (X i , γ X i ) is the pair (u, γ u ) drawn at the (i + 1) th invocation of line 8).
• Let F i be the event space generated by X 0 , . . . , X i , so that for any random variable be the indicator variable of a repeat on the (i + 1) th sample.
γu γ be the probability of a repeat on the (i + 1) th sample as a function of all the (distinct) samples up to the i th , i.e.
, and since Z i−1 and P i are completely determined by X 0 , . . . , X i−1 , the right-hand term is simply Theorem 7 then yields the following: Recall now SumApprox. Note that i j=1 P j and Z i are respectively the value of w γ and of r − w γ just after the while loop has been executed for the (i + 1)-th time. Note also that, when SumApprox returns, r = k ,δ . Therefore the event that, when SumApprox returns, which is smaller than δ/3 since clearly k ,δ > 2 2 ln 3 δ . Consider instead the event that, when SumApprox returns, w r ≥ γ(1 + ) i.e. w γ ≥ r(1 + ) = k ,δ (1 + ). This is the event that Invoking again Lemma 1 with z = k ,δ and v = (1 + )k ,δ + 1, we obtain: δ the right-hand term is at most δ 3 . Finally, by a simple union bound the probability that |γ−γ| ≥ γ is at most 2 δ 3 , and the proof of Theorem 6 is complete.
Proof. We show that the probability that s = 45 π −1 −3 (ln 3 δ ) 3/2 draws yield less than k ,δ repeats is less than δ 3 . Letp = 5 18 π (ln 3 δ ) −1/2 . We consider two cases. Case 1: ∃u ∈ V with π(u) >p. Let then C s u be the random variable counting the number of times u appears in s draws. Since if C s u > k ,δ then u causes at least k ,δ repeats, the probability that SumApprox needs more than s draws is upper bounded by Pr Since C s u is a sum of independent binary random variables, the bounds of Appendix 6.1 give Pr Case 2: π(u) ≤p for all u ∈ V . Let thens = p −1 , letS be the set of distinct elements in the firsts draws, and let w(s) = u∈S π(u). Again by a union bound, the probability that SumApprox draws more than s samples is less than δ 243 + δ 23 < δ 3 .
We remark that the previous existing algorithm for the sum estimation problem [17] needs knowledge of n = |V | and uses O( √ n −7/2 log(n)(log 1 δ + log 1 + log log n)) samples. SumApprox is simpler, oblivious to n, and gives more general bounds. It is also asymptotically faster unless π is (essentially) the uniform distribution.
Finally, we show that SumApprox is essentially optimal, by proving Ω( π −1 ) samples are in general necessary to estimate γ even if n is known in advance. This extends to arbitrary distributions the Ω( √ n) lower bound given by [17] for the uniform distribution.
Consider the two vectors x = (γ 1 , . . . , γ n ) and x = (γ 1 , . . . , γ n ) defined as follows: Now let γ = n i=1 γ i and γ = n i=1 γ i . One can check that γ ≤ 2k and |γ − γ | ≥ k 2 . Hence, to obtain an estimateγ of γ withγ ≤ 5 4 γ, one must distinguish x from x . Note that the norm of π = x/γ is in Θ(1/ √ k) = Θ(ν(n)), as requested. Now, for each one of x and x in turn, pick a permutation of {1, . . . , n} uniformly at random and apply it to the entries of the vector. Suppose then we sample o( π −1 ) = o( √ k) entries from x. We shall see that, with probability 1 − o(1), we cannot distinguish x from x . First, note that the total mass of the entries with value √ k/n is at most 1/ √ k. Hence the probability of drawing any of those entries with o( √ k) samples is o(1), and we can assume all draws yield entries having value 1. Since there are O(k) such entries in total, the probability of witnessing any repeat is also o(1), and we can assume no repeat is witnessed. Furthermore, because of the random permutation, the indices of samples are distributed uniformly over {1, . . . , n} (recall that we actually sample from the index set {1, . . . , n}, so we could use the distribution of the indices to distinguish x from x ). The same argument applies to x , so drawing o( √ k) samples from x yields exactly the same distribution and the two vectors are indistinguishable. To adapt the construction to larger approximation factors, set γ j = 1 : j ≤ ηk for η large enough.

Approximating the stationary distribution
In this section we address the problem of approximating π(v). Such a problem can in fact be reduced to the sum estimation problem of Section 2 by drawing states via random walks. The crux is determining how long the walks must be in order for the samples to come from a distribution close enough to π, so that the approximation guarantees of SumApprox transfer directly to our estimate of π(v).
Consider a random walk of length t + 1 that starts at v. Obviously we can simulate such a walk by setting u 0 = v and then invoking step(u i ) to obtain the state u i+1 , for i = 0, . . . , t − 1. Crucially, using the time-reversibility of the chain, for any visited state u we can obtain the ratio γ u between π(u) and π(v) using O(1) operations. Formally, let γ v = π(v)/π(v) = 1, and in general let γ u = π(u)/π(v). Note that: The time-reversibility of the chain (see Equation 2) implies π(u i+1 ) π(u i ) = pu i ,u i+1 pu i+1 ,u i = probe(u i ,u i+1 ) probe(u i+1 ,u i ) , hence we can compute γ u i+1 with O(1) operations if we know γ u i . But then we can keep track of γ u for any u visited so far, starting with γ u 0 = 1 and computing γ u i+1 by Equation 7 the first time u i+1 is visited.
Suppose now to pick t large enough so that the chain reaches its stationary distribution, i.e. u t ∼ π irrespective of v. One is then drawing state u, as well as its associate weight γ u , with probability π(u). Now if we let γ = u∈V γ u , then π(u) = γ u γ −1 and in particular π(v) = γ −1 . Therefore approximating π(v) amounts to approximating γ; more formally, for any ∈ (0, 1), if γ is a (1 ± 2 )-approximation of γ thenγ −1 is a (1 ± )-approximation of π(v). We can therefore reduce to the sum approximation problem of Section 3: compute with probability (1 − δ) a (1 ± )-approximation of γ, assuming we can draw pairs (u, γ u ) according to π. The only problem is that by simulating the chain we can only come close to (but not exactly on) the stationary distribution π. We must then tie the approximation guarantees of SumApprox to the length t of the random walks, or better to the distance π − π TV between π and the distribution π from which u t is drawn. Formally, we show: Lemma 10. There exists some constant c > 0 such that the following holds. Choose any δ, ∈ (0, 1), and suppose we draw the pairs (u, γ u ) from a distribution π such that π − π TV ≤ π ln(3/δ) c . Then SumApprox( 2 , δ) with probability at least 1 − δ returns a multiplicative (1 ± )approximation of γ by taking at most 720 π −1 −3 (ln 3 δ ) 3/2 samples. Proof. Let us start with the bound on the number of samples. Recall the proof of Theorem 8, and note that the whole argument depends on π but not on the values γ u . Indeed, π alone determines the probability of repeats and thus controls the distribution of the number of samples drawn by SumApprox. Hence, by Theorem 8 SumApprox( 2 , δ) takes more than 45 π −1 8 −3 (ln 3 δ ) 3/2 = 360 π −1 −3 (ln 3 δ ) 3/2 samples with probability less than δ 3 . Now π − π ≤ 2 π − π TV ≤ 2 π ln(3/δ) c ≤ π 2(ln 3) −c , which for c ≥ 15 is bounded by 1 2 π . Then, since π ≥ π − π − π , we have π −1 ≤ 2 π −1 and the bound above is in turn bounded by 720 π −1 −3 (ln 3 δ ) 3/2 . Let us now see the approximation guarantees. Recall the proof of Theorem 6. We want to show again that Pr[| w(s) r − γ| ≥ 2 γ] ≤ 2δ 3 . However, now the samples are drawn according to π instead of π. Let then P j = u∈∪ j−1 h=0 {X h } π (u) and Z i = i j=0 (χ j − P j ); in a nutshell, P j and Z i are the analogous of P j and Z i under π . It is immediate to check that Lemma 1 holds with Z i and P j in place of Z i and P j . Let now w (i) = γ i j=1 P j . Note that i j=1 P j and Z i are respectively the value of w (i) γ and of r − w (i) γ just after line 9 has been executed for the (i + 1)-th time. Therefore, the argument following Lemma 1 holds if we put w (i) in place of the value taken by w after the (i + 1)-th execution of line 9. Hence the same bounds hold, and SumApprox( 2 , δ) ensures Pr[| w (s) r − γ| ≥ 2 γ] ≤ δ 3 where s is the total number of draws. Now note that SumApprox does not return w (s) r , but w(s) r where w(i) = γ i j=1 P j is the value of w in SumApprox after line 9 has been executed for the (i + 1)-th time. We shall now make | w(s) r − w (s) r | ≤ 2 γ; by the triangle inequality we will then be done. First of all, by the definition of w(s) and w (s) we have Now note that |P j − P j | ≤ π − π TV , since P j and P j are the probability of the same event under respectively π and π . Therefore the right-hand side of Equation 8 is bounded by γr −1 s π − π TV . Now, when SumApprox( 2 , δ) terminates r = k 2 ,δ ≥ 4 2+2.2 2 ln 3 δ , and by hypothesis π − π TV ≤ π ln(3/δ) c . Therefore: Finally, recall from above that with probability 1 − 3 δ we have s ≤ 720 π −1 −3 (ln 3 δ ) 3/2 . In this case the equation above yields | w(s) r − w (s) r | ≤ γ · 721 c ln( 3 δ ) 0.5−c , which is smaller than 2 γ for c ≥ 1 2 + ln 1442 ln ln 3 ≈ 78.
A simple union bound completes the proof.
We are now ready to prove Theorem 2. Pick t = τ c ln( π −1 −1 ln 3 δ )/ ln 2, where c is the constant of Lemma 10 and τ is the mixing time of the chain. Simulate the walk for t steps starting from v, and let π be the distribution of the final state. By the properties of the mixing time (see Section 1.1): and therefore by Lemma 10 we obtain a (1 ± ) approximation of γ. By choosing small enough we can obtain a (1 ± ) approximation of π(v) for any desired . The total number of operations performed is clearly bounded by t = τ c ln( π −1 −1 ln 3 δ )/ ln 2 times the number of samples taken by SumApprox, and by substituting this value in the bound of Theorem 1 we obtain Theorem 2. The pseudocode of the resulting algorithm, MassApprox, is given for reference in Appendix 6.3.

Reducing the footprint
In this section we describe FullMassApprox, the algorithm behind the bounds of Theorem 5. FullMassApprox is derived from MassApprox as follows. First, instead of performing a new walk of length t from v for each sample, the algorithm performs one long random walk of length T and takes one sample every t steps. The correctness guarantees do not change, since although the samples do not come all from the same distribution, they are still drawn from a distribution sufficiently close to π. Second, after checking if the current draw yields a repeat, the algorithm includes in the set S not only the draw but also all other states visited so far. Again, this does not affect the guarantees, since we do not need the set S to be built on independent samples. However, this makes the mass of S grow potentially faster, so we can hope to get more repeats and decrease the total number of samples. The pseudocode of FullMassApprox is in Appendix 6.4.
The concentration hypothesis. Before continuing to the proof of Theorem 5, let us provide some intuition behind the concentration hypothesis. Suppose the walk runs for T = kτ steps for someτ = τ poly(log( π −1 )). Such a process can be seen as a coupon collector over k rounds, where a subset of at mostτ states is collected (i.e. visited) at each round. Now, if we pick τ ≤τ withτ = τ poly(log( π −1 )), then in each round theτ −τ central steps are essentially independent of other rounds (more formally, the correlation is O(poly(n) −1 )). Each round is then in large part independent of the others; the issue is that the states visited within a single round are correlated. Such a correlation is responsible for the factorτ in the concentration hypothesis and amounts for the (intuitive) fact that conditioning on the outcome of one step of the walk does not affect the distribution of those steps that are more thanτ steps away. We note that the concentration bounds of [8] give Pr is a function of state X i ; however we could not use them to prove the concentration hypothesis of Theorem 4.
Let us now delve into the proof.
Proof. Observe the random walk performed by FullMassApprox. Clearly ifτ = Ω(n) then the walk visits O(τ ln n + √τ n) distinct states, and the theorem holds unconditionally. Let us then assumeτ = o(n). We disregard the first T 0 = Θ(τ ln n) steps of the walk, which of course yield at most T 0 distinct states, and focus on the last T steps, which we denote by X 1 , . . . , X T (one may thus plug T + T 0 in place of T in the concentration hypothesis). Let π i denote the distribution of state X i , i = 1, . . . , T . Since T 0 = Θ(τ ln n), then we can make π i − π TV ≤ 1 poly(n) . One can adapt the proof of Lemma 10 to FullMassApprox, using the hypothesis π − π i TV ≤ π ln(3/δ) c for all i ≥ 1. This changes the bounds of the lemma only by constant multiplicative factors.
We can thus focus on proving the bound on the number of states visited by the walk. In the analysis we assume X i ∼ π, but again the same asymptotic bounds hold if π i − π TV ≤ 1 poly(n) .
and let M v,t = u∈Sv,t π(u). For brevity we simply write S t , N t , M t .
The crux is to show that M t , the aggregate mass of S t , grows basically as N 2 t /t. Formally we prove that, for any , δ, First, for any λ > 0 let V λ = {u ∈ V : π(u) < λ n }. Clearly Pr[X i ∈ V λ ] = u∈V λ π(u) < λ. Therefore the number of steps J t (λ) the chain was on a state of V λ satisfies E[J t (λ)] < tλ. Now, by Markov's inequality Pr[J t (λ) > q 2 ] < 2tλ q , and setting λ = q 2 t we obtain Pr[J t (λ) > q 2 ] < . Since by hypothesis Pr[N t < q] < δ, by a union bound we get Pr Conditioned on the event that M t = Ω(τ t ), any sample drawn after t steps is a repeat with probability Ω(τ t ). If we then draw Θ( t τ ) samples, which require Θ(t) steps, we witness an expected Ω(1) samples, which can be made larger than k ,δ by appropriately increasing t. Again by the concentration bounds on N t , the total number of states visited can be made O(2E[Q t ]) = O( √ nτ ) =Õ( √ nτ ) with probability arbitrarily close to 1 by appropriately increasing t.

Lower bounds
In this section we prove the bounds of Theorem 3 and Theorem 5 (see Section 1.4). Both proofs follow the same line. As anticipated, the bounds are proven under a strengthened model providing a primitive neigh(u) that at cost O(1) returns all the incoming and outgoing transition probabilities of u. We assume neigh(u) is invoked automatically when u is first visited, and we leave for free all subsequent step() and probe() calls on u and all elementary operations. It is clear that the cost incurred under this model is no larger than that incurred in the step() and probe() model. We see the chains as random walks on weighted undirected graphs. Recall that any undirected weighted graph G can be univocally associated to a time-reversible Markov chain: for any u, u ∈ G, p uu > 0 if and only if (u, u ) is an edge of G with weight w uu = zπ(u)p uu , for some constant z > 0 equal for all edges.
Consider a random d-regular expander graph G 0 on n 0 nodes. By standard results, the simple random walk on G 0 has mixing time τ 0 = Θ(log d n 0 ). Also, by standard birthday arguments, starting from any given node v any algorithm must visit Ω( √ n 0 ) nodes and thus perform Ω( √ n 0 ) queries to estimate n 0 within constant factors with constant probability. We now build our chain out of G 0 . In particular, we create a (random) graph G = G(∆, n 0 ) on n = Θ(n 0 ∆) nodes as follows. Take each arc {u, v} of G 0 and replace it with a star on ∆ + 1 nodes. More precisely: delete {u, v}, add a new node s uv and two arcs {u, s uv } and {v, s uv }, and add ∆ − 2 nodes each having a self-loop and an arc to s uv . ∆ is a function of n 0 to be decided later. Assign weight d − 1 to each self-loop, and weight 1 to any other arc. Let n be the number of nodes in G; clearly n = Θ(n 0 ∆). G is now a version of G 0 where the random walk slows down by a factor roughly d∆ = Θ(∆), since moving between two nodes u and u that were neighbors in G 0 now takes Θ(∆) steps in expectation. The mixing time of G is therefore τ = Θ(τ 0 ∆). This holds also for an arbitrary algorithm: any two neighbors of a star center are indistinguishable until they are visited, since their transition probabilities are the same (1/∆ from the center, and 1/d to the center). The same holds for the neighbors of the original nodes of G 0 , whose transition probabilities from/to such a node are 1/d and 1/∆ respectively. Therefore, starting from any node in G any algorithm needs Ω( √ ∆n) queries to estimate n. Finally, if π is the stationary distribution of the random walk, then π = Θ( ∆/n), since there are ∆/n nodes (the centers of the stars) all having the same mass which is also asymptotically larger than the mass of any other node, and which in aggregate is Ω(1). Now consider the graph G = (∆, n 0 2 ), which has half the nodes of G. Choose any node u ∈ G . We now add k nodes to G , with k = Θ(n), so that it has exactly the same number of nodes n as G. Finally, we add k arcs between each of those nodes and u, each one of weight k for some > 0. Both the overall mass of these k nodes, and the probability of walking to any of them from u, is then less than . Therefore, the mixing time of G is essentially unaltered, as well as its stationary distribution. However, any node in G has roughly half the mass of its "homologue" in G . Therefore, to estimate the mass of any given node in G, one must distinguish between G and G , i.e. determine whether the graph at hand comes from G(∆, n 0 ) or G(∆, n 0 2 ); and as we have seen this requires Ω( √ ∆n) queries. Now to the bounds. Note that τ π −1 = Θ τ 0 ∆ n/∆ = O( √ ∆n ln n), and √ τ n = Θ √ ∆τ 0 n = O √ ∆n √ ln n ; thus √ ∆n is in both Ω(τ π −1 / ln n) and Ω( τ n/ ln n). Since τ = Θ(∆τ 0 ) = Θ( n n 0 ln n 0 ) and ∆ = Θ( n n 0 ), by appropriately choosing n 0 ∈ Θ(1) ∩ Θ(n) we can make τ range from Θ(ln n) to Θ(n). In the same way, since π = Θ( ∆/n) = Θ( 1/n 0 ) we can make π range from Θ( 1 √ n ) to Θ(1) (although not independently of τ ).

Conclusions
We have given improved, optimal algorithms for approximating the stationary probability of a given state in a time-reversible Markov chain, and for approximating the sum of nonnegative real vectors by weighted sampling. Although time-reversible chains are of clear relevance, extending our results to other classes of Markov chains is an intriguing open question. We have also shown that the footprint of our algorithms in terms of number of distinct states visited is tied to the concentration of the number of distinct states visited by the chain; investigating such a concentration is thus an obvious line of future research.

Probability bounds
This appendix provides Chernoff-type probability bounds that are repeatedly used in our analysis; these bounds can be found in e.g. [2], and can be derived from [18].

A lower bound for non-time-reversible Markov chains
Lemma 12. For any functions τ (n) = ω(1) and p(n) = o( 1 n ) there exists a family of ergodic non-time-reversible Markov chains on n states having mixing time τ = Θ(τ (n)), and containing a state v with π(v) = Θ(p(n)) such that any algorithm needs Ω( τ π(v) ) calls to step() to estimate π(v) within constant multiplicative factors with constant probability.
Proof. Consider a chain with state space {u} ∪ {u 1 , . . . , u n−1 } and the following transition probabilities (we assume n large enough to set in [0, 1] any quantity where needed). For u, set p uu = 1 − (n−1)p(n) τ (n) , and p uu i = p(n) τ (n) for all i = 1, . . . , n − 1. For all i = 1, . . . , n − 1, set p u i u i = 1 − 1 τ (n) and p u i u = 1 τ (n) . The chain is clearly ergodic. Note that (n−1)p(n) τ (n) = o( 1 τ (n) ) and therefore the expected time to leave u is asymptotically larger than the expected time to leave any of the u i . One can then check that (i) π(u i ) = Θ(p(n)), and (ii) the mixing time is τ = Θ(τ (n)) (essentially, the expected time to leave the u i ). Pick any u i as target state v. Suppose now to alter the chain as follows: pick some u j = v and set p u j v = 1. The new stationary probability of v would then be roughly 2π(v). However one cannot distinguish between the two chains with constant probability with less than Ω( τ π(v) ) step() calls. Indeed, to distinguish between them one must at least visit u j (and then perform e.g. probe(u j , v)). Since u is the only state leading to u j with positive probability, one must invoke step(u) until it returns u j . But p uu j = p(n) τ (n) , hence one needs Ω( τ (n) p(n) ) = Ω( τ π(v) ) calls in expectation. The construction can be adapted to any constant approximation factor by adding more transitions towards v.

Experiments
We experimentally evaluate MassApprox and FullMassApprox against the algorithms of Lee et al. [12] and Banerjee et al. [3] (see Section 1.3). All algorithms were ran on synthetic timereversible Markov chains on 1M states, created as follows. We start from an undirected torus graph (i.e. a grid with periodic boundary) of n = 1000 × 1000 nodes. We then add 0.01n edges between random pairs of nodes, to reduce the mixing time and thus the cost (and running time) of the algorithms. We add self-loops to all nodes to ensure ergodicity. Finally, we weight the arcs according to two distributions: • π U (uniform): each arc has weight 1. The norm is π U = 0.001, or essentially 1/ √ n.
For each weighted graph, we consider the time-reversible chain of the associated random walk.
We picked v = 0 as the target node, which is equivalent to any other one (and indeed repeating the experiments on other nodes yielded the same results). For all algorithms we set δ = 0.1. For the algorithm of Banerjee et al. we set the minimum detection threshold at π(v), and for all other algorithms we set = 0.25. One must then fix the random walk length: t in our algorithms, in Banerjee et al., and 1/∆ in Lee et al. Setting the lengths to τ ln(n) would make all algorithms satisfy the desired guarantees. Since we do not know τ , for each algorithm we proceed as follows. We initially set the length of random walks to l = 10. We then perform three independent executions of the algorithm. If all three executions return an estimateπ(v) within a multiplicative factor (1 ± ) of π(v), then we stop. Otherwise, we increase l by a factor √ 2 and repeat. For each value of l we record the average relative errorˆ = |π(v)−π(v)| π(v) and the average total number of step() and probe() calls. Figure 1 shows howˆ decreases as the number of calls increases.  MassApprox and FullMassApprox are the fastest candidates in all cases. In the uniform chain, MassApprox is approached by the algorithm of Banerjee et al. at high accuracies. This seems a confirmation of theory: MassApprox has complexityÕ(τ n −0.5 ) on a chain with uniform distribution, and the algorithm of Banerjee et al. has complexityÕ(τ 1.5 n −0.5 ) on the "typical" target state with mass π(v) ≈ 1/n. If τ is not exceedingly large, the two complexities can translate into close performance in practice. On the other hand, FullMassApprox is neatly more efficient than previous algorithms. To obtain a fairly accurate estimate of π(v), say ±50%, it improves on their performance by two orders of magnitude -and possibly by more on the skewed chain. These results suggest that our algorithms are not only of theoretical interest, but also of practical value. A final observation is that FullMassApprox outperforms also MassApprox on the uniform chain. The complexity bounds we have are the same for both algorithms, but perhaps FullMassApprox takes advantage of some specific structural properties of the chain we have used, which makes its complexity drop further.