Asymptotic Divergences and Strong Dichotomy

The Schnorr-Stimm dichotomy theorem concerns finite-state gamblers that bet on infinite sequences of symbols taken from a finite alphabet $\Sigma$. In this paper we use the Kullback-Leibler divergence to formulate the $\textit{lower asymptotic divergence}$ $\text{div}(S||\alpha)$ of a probability measure $\alpha$ on $\Sigma$ from a sequence $S$ over $\Sigma$ and the $\textit{upper asymptotic divergence}$ $\text{Div}(S||\alpha)$ of $\alpha$ from $S$ in such a way that a sequence $S$ is $\alpha$-normal (meaning that every string $w$ has asymptotic frequency $\alpha(w)$ in $S$) if and only if $\text{Div}(S||\alpha)=0$. We also use the Kullback-Leibler divergence to quantify the $\textit{total risk }$ $\text{Risk}_G(w)$ that a finite-state gambler $G$ takes when betting along a prefix $w$ of $S$. Our main theorem is a $\textit{strong dichotomy theorem}$ that uses the above notions to $\textit{quantify}$ the exponential rates of winning and losing on the two sides of the Schnorr-Stimm dichotomy theorem (with the latter routinely extended from normality to $\alpha$-normality). Modulo asymptotic caveats in the paper, our strong dichotomy theorem says that the following two things hold for prefixes $w$ of $S$. (1) The infinitely-often exponential rate of winning is $2^{\text{Div}(S||\alpha)|w|}$. (2) The exponential rate of loss is $2^{-\text{Risk}_G(w)}$. We also use (1) to show that $1-\text{Div}(S||\alpha)/c$, where $c= \log(1/ \min_{a\in\Sigma}\alpha(a))$, is an upper bound on the finite-state $\alpha$-dimension of $S$ and prove the dual fact that $1-\text{div}(S||\alpha)/c$ is an upper bound on the finite-state strong $\alpha$-dimension of $S$.


Abstract
The Schnorr-Stimm dichotomy theorem [31] concerns finite-state gamblers that bet on infinite sequences of symbols taken from a finite alphabet Σ. The theorem asserts that, for any such sequence S, the following two things are true.
(1) If S is not normal in the sense of Borel (meaning that every two strings of equal length appear with equal asymptotic frequency in S), then there is a finite-state gambler that wins money at an infinitelyoften exponential rate betting on S.
(2) If S is normal, then any finite-state gambler betting on S loses money at an exponential rate betting on S.
In this paper we use the Kullback-Leibler divergence to formulate the lower asymptotic divergence div(S||α) of a probability measure α on Σ from a sequence S over Σ and the upper asymptotic divergence Div(S||α) of α from S in such a way that a sequence S is α-normal (meaning that every string w has asymptotic frequency α(w) in S) if and only if Div(S||α) = 0. We also use the Kullback-Leibler divergence

Introduction
An infinite sequence S over a finite alphabet is normal in the 1909 sense of Borel [7] if every two strings of equal length appear with equal asymptotic frequency in S. Borel normality played a central role in the origins of measure-theoretic probability theory [6] and is intuitively regarded as a weak notion of randomness. For a masterful discussion of this intuition, see section 3.5 of [22], where Knuth calls normal sequences "∞-distributed sequences." The theory of computing was used to make this intuition precise. This took place in three steps in the 1960s and 1970s. First, Martin-Löf [28] used constructive measure theory to give the first successful formulation of the randomness of individual infinite binary sequences. Second, Schnorr [30] gave an equivalent, and more flexible, formulation of Martin-Löf's notion in terms of gambling strategies called martingales. In this formulation, an infinite binary sequences S is random if no lower semicomputable martingale can make unbounded money betting on the successive bits of S. Third, Schnorr and Stimm [31] proved that an infinite binary sequence S is normal if and only if no martingale that is computed by a finite-state automaton can make unbounded money betting on the successive bits of S. That is, normality is finite-state randomness.
This equivalence was a breakthrough that has already had many consequences (discussed later in this introduction), but the Schnorr-Stimm result said more. It is a dichotomy theorem asserting that, for any infinite binary sequence S, the following two things are true. money at an infinitely-often exponential rate when betting on S.
2. If S is normal, then every finite-state gambler that bets infinitely many times on S loses money at an exponential rate.
The main contribution of this paper is to quantify the exponential rates of winning and losing on the two sides (1 and 2 above) of the Schnorr-Stimm dichotomy.
To describe our main theorem in some detail, let Σ be a finite alphabet. It is routine to extend the above notion of normality to an arbitrary probability measure α on Σ. Specifically, an infinite sequence S over Σ is α-normal if every finite string w over Σ appears with asymptotic frequency α |w| (w) in S, where α ℓ is the natural (product) extension of α to strings of length ℓ. Schnorr and Stimm [31] correctly noted that their dichotomy theorem extends to α-normal sequences in a straightforward manner, and it is this extension whose exponential rates we quantify here.
The quantitative tool that drives our approach is the Kullback-Leibler divergence [23], also known as the relative entropy [12]. If α and β are probability measures on Σ, then the Kullback-Leibler divergence of β from α is i.e., the expectation with respect to α of the random variable log α β : Σ → R ∪ {∞}, where the logarithm is base-2. Although the Kullback-Leibler divergence is not a metric on the space of probability measures on Σ, it does quantify "how different" β is from α, and it has the crucial property that D(α||β) ≥ 0, with equality if and only if α = β.
Here we use the empirical frequencies of symbols in S to define the asymptotic lower divergence div(S||α) of α from S and the asymptotic upper divergence Div(S||α) of α from S in a natural way, so that S is α-normal if and only if Div(S||α) = 0.
The first part of our strong dichotomy theorem says that the infinitelyoften exponential rate that can be achieved in 1 above is essentially at least 2 Div(S||α)|w| , where w is the prefix of S on which the finite-state gambler has bet so far. More precisely, it says the following. 1 ′ . If S is not α-normal, then, for every γ < 1, there is a finite-state gambler G such that, when G bets on S with payoffs according to α, there are infinitely many prefixes w of S after which G's capital exceeds 2 γ Div(S||α)|w| .
The second part of our strong dichotomy theorem, like the second part of the Schnorr-Stimm dichotomy theorem, is complicated by the fact that a finite-state gambler may, in some states, decline to bet. In this case, its capital after a bet is the same as it was before the bet, regardless of what symbol actually appears in S. Once again, however, it is the Kullback-Leibler divergence that clarifies the situation. As explained in section 3 below, in any particular state q, a finite-state gambler's betting strategy is a probability measure B(q) on Σ. If B(q) = α, then the gambler does not bet in state q. We thus define the risk that the gambler G takes in state q to be risk G (q) = D(α||B(q)), i.e., the divergence of B(q) from not betting. We then define the total risk that the gambler takes along a prefix w of the sequence S on which it is betting to be the sum Risk G (w) of the risks risk G (q) in the states that G traverses along w. The second part of our strong dichotomy theorem says that, if S is α-normal and G is a finite-state gambler betting on S, then after each prefix w of S, the capital of G on prefixes w of S is essentially bounded above by 2 − Risk G (w) . In some sense, then, G loses all that it risks. More precisely, the second part of our strong dichotomy says the following.
2 ′ . If S is α-normal, then, for every finite-state gambler G and every γ < 1, after all but finitely many prefixes w of S, the gambler G's capital is less than 2 −γ Risk G (w) .
A routine ergodic argument, already present in [31], shows that, if a finite-state gambler G bets on an α-normal sequence S, then every state of G that occurs infinitely often along S occurs with positive frequency along S. Hence 2 above follows from 2 ′ above.
Our strong dichotomy theorem has implications for finite-state dimensions. For each probability measure α on Σ and each sequence S over Σ, the finite-state α-dimension dim α FS (S) and the finite-state strong α-dimension Dim α FS (S) (defined in section 4 below) are finite-state versions of Billingsley dimension [5,10] introduced in [26]. When α is the uniform probability measure on Σ, these are the finite dimension dim FS (S), introduced in [14] as a finite-state version of Hausdorff dimension [20,17], and the finite-state strong dimension Dim FS (S), introduced in [2] as a finite-state version of packing dimension [35,34,17]. Intuitively, dim α FS (S) and Dim α FS (S) measure the lower and upper asymptotic α-densities of the finite-state information in S.
Here we use part 1 of our strong dichotomy theorem to prove that, for every positive probability measure α on Σ and every sequence S over Σ, where c = log(1/min a∈Σ α(a)). We also establish the dual result that, for all such α and S, Dim α FS (S) ≤ 1 − div(S||α)/c. Research on normal sequences and normal numbers (real numbers whose base-b expansions are normal sequences for various choices of b) has grown rapidly in recent years. Part of this is due to the fact that Agafonov [1] and Schnorr and Stimm [31] connected the theory of normal numbers so directly to the theory of computing. Further work along these lines has been continued in [21,29,3,33]. After the discovery of algorithmic dimensions in the present century [24,25,14,2], the Schnorr-Stimm dichotomy led to the realization [8] that the finite-state world, unlike any other known to date, is one in which maximum dimension is not only necessary, but also sufficient, for randomness. This in turn led to the discovery of nontrivial extensions of classical theorems on normal numbers [11,36] to new quantitative theorems on finite-state dimensions [19,16], a line of inquiry that will certainly continue. It has also led to a polynomial-time algorithm [4] that computes real numbers that are provably absolutely normal (normal in every base) and, via Lempel-Ziv methods, to a nearly linear time algorithm for this [27]. In parallel with these developments, connections among normality, Weyl equidistribution theorems, and Diophantine approximation have led to a great deal of progress surveyed in the books [15,9]. This paragraph does not begin to do justice to the breadth and depth of recent and ongoing research on normal numbers and their growing involvement with the theory of computing. It is to be hoped that our strong dichotomy theorem and the quantitative methods implicit in it will further accelerate these discoveries.

Divergence and normality
This section reviews the discrete Kullback-Leibler divergence, introduces asymptotic extensions of this divergence, and uses these to give useful characterizations of Borel normal sequences.

The Kullback-Leibler divergence
We work in a finite alphabet Σ with 2 ≤ |Σ| < ∞. We write Σ ℓ for the set of strings of length ℓ over Σ, Σ * = ∞ ℓ=0 Σ ℓ for the set of (finite) strings over Σ, Σ ω for the set of (infinite) sequences over Σ, and Σ ≤ω = Σ * ∪ Σ ω . We write λ for the empty string, |w| for the length of a string w ∈ Σ * , and |S| = ω for the length of a sequence S ∈ Σ ω . For x ∈ Σ ≤ω and 0 ≤ i < |x|, we write x[i] for the string consisting of the i-th through j-th symbols in x.
where the logarithm is base-2.
Note that the right-hand side of (2.2) is the α-expectation of the random variable log α β : Note also that D(α||β) is infinite if and only if α(ω) > 0 = β(ω) the some ω ∈ Ω. The Kullback-Leibler divergence D(α||β) is a useful measure of how different β is from α. It is not a metric (because it is not symmetric and does not satisfy the triangle inequality), but it has the crucial property that D(α||β) ≥ 0, with equality if and only if α = β. The two most central quantities in Shannon information theory, entropy and mutual information, can both be defined in terms of divergence as follows.
1. Entropy is divergence from certainty. The entropy of a probability measure α ∈ ∆(Ω), conceived by Shannon [32] as a measure of the uncertainty of α, is i.e., the α-average of the divergences of α from the "certainties" π ω .
2. Mutual information is divergence from independence. If α, β ∈ ∆(Ω) have a joint probability measure γ ∈ ∆(Ω × Ω) (i.e., are the marginal probability measures of γ), then the mutual information between α and β, conceived by Shannon [32] as a measure of the information shared by α and β, is i.e., the divergence of γ from the probability measure in which α and β are independent.
For each such S and n and each ℓ ∈ Z + , let π (ℓ) S,n = π S,n ↾ Σ ℓ be the restriction of the function π S,n to the set Σ ℓ of strings of length ℓ.
We call π (ℓ) S,n the n-th empirical probability measure on Σ ℓ given by S. A probability measure α ∈ ∆(Σ) naturally induces, for each ℓ ∈ Z + , a probability measure α (ℓ) ∈ ∆(Σ ℓ ) defined by (2.6) The empirical probability measures π (ℓ) S,n provide a natural way to define useful empirical divergences of probability measures from sequences.

The upper divergence of α from S is Div(S||α) = sup ℓ∈Z + Div ℓ (S||α)/ℓ.
A similar approach gives useful empirical divergences of one sequence from another.

Normality
The following notions are essentially due to Borel [7].

S is
4. S is normal if, for all ℓ ∈ Z + , S is ℓ-normal.
Proof. Let α, S, and ℓ be as given.

Strong Dichotomy
This section presents our main theorem, the strong dichotomy theorem for finite-state gambling. We first review finite-state gamblers.
Definition ( [31,18,14]). A finite-state gambler (FSG) is a 4-tuple where Q is a finite set of states, δ : Q × Σ → Q is the transition function, s ∈ Q is the start state, and B : Q → ∆ Q (Σ) is the betting function.
The transition structure (Q, δ, s) here works as in any deterministic finitestate automaton. For w ∈ Σ * , we write δ(w) for the state reached from s by processing w.
Intuitively, a gambler G = (Q, δ, s, B) bets on the successive symbols of a sequence S ∈ Σ ω . The payoffs in the betting are determined by a payoff probability measure α ∈ ∆(Σ). (We regard α and S as external to the gambler G.) We write d G,α (w) for the gambler G's capital (amount of money) after betting on the successive bits of a prefix w ⊑ S, and we assume that the initial capital is d G,α (λ) = 1.
The meaning of the betting function B is as follows. After betting on a prefix w ⊑ S, the gambler is in state δ(w) ∈ Q. The betting function B says that, for each a ∈ Σ, the gambler bets the fraction B(δ(w))(a) of its current capital d G,α (w) that wa ⊑ S, i.e., that the next symbol of S is an a. If it then turns out to be the case that wa ⊑ S, the gambler's capital will be (3.1) (Note: If α(a) = 0 here, we may define d G,α (wa) however we wish.) The payoffs in (3.1) are fair with respect to α, which means that the conditional α-expectation a∈Σ α(a)d G,α (wa) of d G,α (wa), given that w ⊑ S, is exactly d G,α (w). This says that the function d G,α is an α-martingale.
If δ(w) = q is a state in which B(q) = α, then (3.1) says that, for each a ∈ Σ, d G,α (wa) = d G,α (w). That is, the condition B(q) = α means that G does not bet in state q. Accordingly, we define the risk that G takes in a state q ∈ Q to be risk G (q) = D(α||B(q)).
i.e., the divergence of B(q) from not betting. We also define the total risk that G takes along a string w ∈ Σ * to be We now state our main theorem.
2. If S is α-normal, then, for every finite-state gambler G, for all but finitely many prefixes w ⊑ S, Proof. To prove the first part, let S be a non-normal sequence. Then by Theorem 2.3 we know that Div(S||α) > 0. Let r < 1 and let ǫ > 0. By the definition of Div(S||α) there must exist ℓ such that S,n ||α (ℓ) ) > ℓr Div(S||α).
Note that G can be viewed as a gambler gambling on every ℓ symbols, in the way that he always "waits" until he sees the first ℓ − 1 symbols of a string u = wa of length ℓ, and then bets a fraction of π 0 (wa) of his capital on the next symbol being an a.
Let u = a 0 · · · a ℓ−1 be in Σ ℓ . The following observation captures the above intuition: Now let w = S ↾ n k for some k. We can view w as w = u 0 u 1 · · · u n−1 u n , where |u i | = ℓ for 0 ≤ i ≤ n − 1 and u n = a 0 · · · a m with m < ℓ.
Then we have , where u n = a 0 · · · a m ranges over Σ <ℓ . Taking log on both sides of (3.10) we get Then by (3.7) and (3.9), we have Therefore, by (3.5) we have Since r and 1 − 2ǫ can be picked arbitrary close to 1, take r(1 − 2ǫ) > γ, then d G,α (w) ≥ 2 γ Div(S||α)|w| for w = S ↾ n k long enough. We now prove the second part of the main theorem. Let S be a normal number, G an arbitrary finite-state gambler. By Proposition 2.5 of [31], G = (Q, δ, s, B) will eventually reach to a bottom strongly connected component (a component that has no path to leave) when processing S. A similar argument can also be found in [33]. Without loss of generality, we will therefore assume that every state in G is recurrent in processing S.
Let w = a 0 · · · a n−1 ⊑ S. Then d G,α (w) = B(δ(λ))(a 0 ) · · · B(δ(w[a 0 ..a n−2 ]))(a n−1 ) α(a 0 ) · · · α(a n−1 ) where the notation # G,w (q, a) denotes the number of times G lands on state q and the next symbol is a while processing w. Similarly, we use the notation # G,w (q) to denote the number of times G lands on q in the same process.
Taking the logarithm of both sides of (3.12), we have By a result of Agafonov [1], which extends easily to the arbitrary probability measures considered here, we have that, for every state q, the limit of # G,w (q,a) # G,w (s) along S exists and converges to α(a). That is for every state q. Therefore, by equations (3.13) and (3.14), and the fact that there are finitely many states, we have (1)).

It follows that
so part 2 of the theorem holds.

Dimension
Finite-state dimensions give a particularly sharp formulation of part 1 of the strong dichotomy theorem, along with a dual of this result.
Finite-state dimensions were introduced for the uniform probability measure on Σ in [14,2] and extended to arbitrary probability measure on Σ in [26]. For each α ∈ ∆(Σ) and each S ∈ Σ ω , define the sets The limits superior and inferior here are taken for successively longer prefixes w ⊑ S. The "strong" subscript of G str (S) refers to the fact that α |w| (w) 1−s d G,α (w) is required to converge to infinity in a stronger sense than in G α (S).
Then for z ∈ Σ * with z ⊑ S and |z|= ln, we have S,n (u) Therefore, Since the number of states is fixed, this implies dim α FS (S) ≤ 1−Div(S||α)/c. The proof of the other case is similar, where we use the fact that, for infinitely many n, D(π (ℓ) S,n ||α (ℓ) ) > ltc.