Abstract 1 Introduction 2 Algorithm overview 3 Lower bounds of f-divergences 4 Algorithm for 𝒇-divergence with small parameters distance 5 Algorithm for 𝝌𝜶-divergence 6 Algorithms for other 𝒇-divergences 7 Proof of the hardness result References

On Approximating the f-Divergence Between Two Ising Models

Weiming Feng ORCID School of Computing and Data Science, The University of Hong Kong, China Yucheng Fu ORCID School of Computing and Data Science, The University of Hong Kong, China
Abstract

The f-divergence is a fundamental notion that measures the difference between two distributions. In this paper, we study the problem of approximating the f-divergence between two Ising models, which is a generalization of recent work on approximating the TV-distance. Given two Ising models ν and μ, which are specified by their interaction matrices and external fields, the problem is to approximate the f-divergence Df(νμ) within an arbitrary relative error e±ε. For χα-divergence with a constant integer α, we establish both algorithmic and hardness results. The algorithm works in a parameter regime that matches the hardness result. Our algorithm can be extended to other f-divergences such as α-divergence, Kullback-Leibler divergence, Rényi divergence, Jensen-Shannon divergence, and squared Hellinger distance.

Keywords and phrases:
Ising model, f-divergence, approximation algorithms, randomized algorithms
Funding:
Weiming Feng: Early Career Scheme of the Hong Kong Research Grants Council under grant number 27202725.
Yucheng Fu: 2025 Summer Research Internship Programme in the School of Computing and Data Science at The University of Hong Kong.
Copyright and License:
[Uncaptioned image] © Weiming Feng and Yucheng Fu; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Approximation algorithms analysis
; Mathematics of computing Markov networks
Related Version:
Full Version: https://arxiv.org/abs/2509.05016
Editor:
Shubhangi Saraf

1 Introduction

Let ν and μ be two distributions with the same support Ω. Let f:>00 be a convex function such that f(1)=0 and f is strictly convex around 1. The f-divergence of ν from μ is defined by

Df(νμ)=𝐄Xμ[f(ν(X)μ(X))]=σΩμ(σ)f(ν(σ)μ(σ)). (1)

The f-divergence is a very general notion that measures the difference between two distributions. For instance, f(x)=12|x1| gives the total variation distance (TV-distance), f(x)=x2x gives the χ2-divergence, and f(x)=xlnxx+1 gives the Kullback-Leibler (KL) divergence.

The Ising model is a fundamental graphical model in statistical physics, probability theory, and machine learning. Let G=(V,E) be a graph. Let JV×V be a symmetric interaction matrix such that Juv0 only if {u,v}E. Let hV be the external fields vector. An Ising model specified by (G,J,h) defines a Gibbs distribution μ with support Ω={1,+1}V such that

σ{1,+1}V,μ(σ)=wμ(σ)Zμ=exp(12σTJσ+hTσ)Zμ,where Zμ=σ{1,+1}Vwμ(σ).

The function wμ:Ω0 is called the weight function of the Ising model and the normalization factor Zμ is called the partition function of the Ising model.

Recently, the problem of computing the TV-distance between two high-dimensional distributions has received increasing attention. One interesting result [4] has proved that even for a pair of product distributions (Ising models on an empty graph), the exact computation of TV-distance is #P-hard. Later on, polynomial time approximation algorithms were proposed for product distributions [16, 18]. For general Ising models, both algorithmic and hardness results were established [6, 17] for approximating the TV-distance.

In this paper, we consider a more general problem of approximating the f-divergence between two Ising models. The problem is defined as follows.

Problem 1.

Approximating the f-divergence for two Ising models.

  • Input: Two Ising models (G,Jν,hν) and (G,Jμ,hμ) specifying two Gibbs distributions ν and μ respectively1111 assumes that ν and μ have the same underlying graph G. This assumption does not lose generality because the definition allows Juvν and Juvμ to be 0 even if {u,v}E., a function f defining the f-divergence, and an error bound ε;

  • Output: A number D^ such that eεDf(νμ)D^eεDf(νμ).

A randomized algorithm is said to be an FPRAS (fully polynomial randomized approximation scheme) for Problem 1 if it runs in time polynomial in n=|V| and 1/ε and with probability at least 23, the output approximates the value of the f-divergence within relative error e±ε.

Our algorithmic result is a reduction from f-divergence approximation to sampling and approximate counting, which are two fundamental computational tasks for the Ising model. There is long-line of research on developing efficient algorithms [27] for sampling and approximate counting. We assume the following abstract oracles for sampling and approximate counting.

Definition 2 (sampling and approximate counting oracles).

Let (G,Jμ,hμ) be an Ising model with Gibbs distribution μ and partition function Zμ. Let TGsp,TGct:>0[|V|+|E|,) be two non-increasing functions. Given any error bound ε>0,

  • The sampling oracle for (G,Jμ,hμ) with cost function TGsp returns a random sample X{1,+1}V in time TGsp(ε) with dTV(X,μ)ε, where dTV(X,μ) is the total variation distance between X and μ.

  • The approximate counting oracle for (G,Jμ,hμ) with cost function TGct returns a random number Z^μ in time TGct(ε) with 𝐏𝐫[eεZμZ^μeεZμ]0.99.

For functions TGsp(ε) and TGct(ε), we add the index G to emphasize that the running time also depends on parameters of graph G such as the number of vertices/edges and the maximum degree. We assume the oracles need to read the whole graph G so that the cost is at least |V|+|E|.

We also require the following mild assumption on the Ising models.

Definition 3 (marginal lower bound).

Let b0 be a constant. A Gibbs distribution μ is said to satisfy the b-marginal lower bound if for any ΛV, any pinning σ{1,+1}Λ, any vVΛ, and any c{1,+1}, it holds that μvσ(c)b, where μvσ() denotes the marginal distribution on v projected from μ conditional on that the configuration on Λ is fixed as σ.

The marginal lower bound condition is a natural and common assumption for the Ising model. The assumption is widely used in the literature of sampling and approximate counting [14], learning theory [10], and TV-distance approximation [17].

1.1 Algorithm and hardness results for χα-divergence approximation

Let α1 be a constant integer. The Df(νμ)=Dχα(νμ) is called the χα-divergence if f(x)=12|x1|α. We give the following approximation algorithm for the χα-divergence between two Ising models in Problem 1. Given two input Ising models (G,Jν,hν) and (G,Jμ,hμ), define the following family of Ising models on the graph G:

(ν,μ,α)={(G,J(k),h(k))J(k)kJν(k1)Jμh(k)khν(k1)hμ for integer 0kα}. (2)

We remark that (G,J(1),h(1)) and (G,J(0),h(0)) are the same as the input Ising models (G,Jν,hν) and (G,Jμ,hμ) respectively. The following theorem says that if all Ising models in admit sampling and approximate counting oracles, then the χα-divergence between two input Ising models can be approximated in polynomial time.

Theorem 4.

Let α1 be a constant integer and b(0,1) be a constant. There exists an FPRAS that solves Problem 1 for χα-divergence in time

Oα,b(TGct(ηα,bε(n+m)α)+(n+m)2αε2TGsp(ηα,bε2(n+m)2α)),

where n=|V|, m=|E|, and ηα,b>0 is a small constant depending only on α and b, if two input Ising Gibbs distributions μ and ν are both b-marginally bounded and all Ising models in (ν,μ,α) admit sampling and approximate counting oracles with cost functions TGsp() and TGct() respectively.

Let us consider a simplified case of the Ising model. Suppose G has the max degree Δ. For an Ising model (G,J,h), if for any e={u,v}E, if Juv=Jvu=lnβ2 take a unified value. In this case, an Ising model admits sampling and approximate counting oracles with cost function TGsp(ε)=poly(n,log1ε) and TGct(ε)=poly(n,1ε) if one of the following two conditions holds:

  • Uniqueness condition: Δ2ΔβΔΔ2 and an arbitrary external field hV [13].

  • Uniqueness condition or ferromagnetic condition with zero external field: βΔ2Δ and the zero external field h=𝟎 [13, 21].

Consider Problem 1 where two input Ising models both have unified values lnβν2 and lnβμ2 in the interaction matrices Jν and Jμ. Assume βν, βμ, hν, hμ, and the max degree Δ are all constants. Then the marginal lower bound condition is satisfied. We have the following corollary.

Corollary 5.

Let α1 be a constant integer. For Ising models with unified values in interaction matrices, there exists an FPRAS that solves Problem 1 for χα-divergence in time poly(n,1ε) if βν, βμ, hν, hμ, and Δ are all constants and one of the following three conditions holds:

  • Both βμ and (βνβμ)αβμ are in [Δ2Δ,ΔΔ2].

  • βνβμΔ2Δ and hν=hμ=𝟎.

  • βν<βμ, (βνβμ)αβμΔ2Δ, and hν=hμ=𝟎.

The corollary requires different conditions for the case βνβμ and the case βν<βμ. It is reasonable because for α>1, the χα-divergence is not symmetric Dχα(νμ)Dχα(μν), and thus the roles of two input Ising models are different.

In Theorem 4, we show that χα-divergence can be approximated if the sampling and approximate counting oracles for all Ising models in (ν,μ,α) exist. In a special case when α=1, (ν,μ,1) only contains two input Ising models (G,Jν,hν) and (G,Jμ,hμ), where we recover the result in [17] for approximating the TV-distance. However, for general α>1, the family (ν,μ,α) contains other Ising models. It is natural to ask the following questions.

  • Is it possible to approximate χα-divergence if we only assume the sampling and approximate counting oracles for the two input Ising models (G,Jν,hν) and (G,Jμ,hμ) exist?

  • A more direct question: are the oracle assumptions in Theorem 4 necessary?

The above two questions are answered by the following hardness result. Let us restrict our attention to Ising models (G,J,h) with zero external field and interaction matrix with unified values. Formally, h=𝟎 and Juv=Jvu=lnβ2 for all {u,v}E. Denote the model by (G,β). Let (G,βν) and (G,βμ) denote two input Ising models in Problem 1.

Theorem 6.

Fix an integer α2, an integer Δ3, and two parameters βμ>βνΔ2Δ such that (βνβμ)αβμ<Δ2Δ. Unless 𝖭𝖯=𝖱𝖯, there is no FPRAS for χα-divergence for two Ising models (G,βν) and (G,βμ) on Δ-regular graph G.

The above theorem considers two Ising models with zero external field and βν,βμΔ2Δ. Two input Ising models both admit polynomial time sampling and approximate counting oracles because they are either in the uniqueness regime β[Δ2Δ,ΔΔ2] [13] or the ferromagnetic regime β1 [21]. However, if (βνβμ)αβμ<Δ2Δ, approximating χα-divergence is still hard. For a concrete example, the problem is hard when α2, βν=Δ2Δ, and βμ=2βν=2(Δ2)Δ.

Theorem 6 also shows that the condition in Theorem 4 is necessary. Note that since Δ,βν and βμ are all constants and the external fields are zero, the hardness instances satisfy the marginal lower bound condition with b=b(Δ,βν,βμ)=Ω(1). Consider two Ising models (G,βν) and (G,βμ) on Δ-regular graph G such that βμ>βνΔ2Δ. For this family of instances, our algorithmic result in Corollary 5 together with the hardness result in Theorem 6 discovers the following computational phase transition for approximating Dχα(νμ) when α2:

  • FPRAS exists when (βνβμ)αβμΔ2Δ;

  • Unless 𝖭𝖯=𝖱𝖯, there is no FPRAS when (βνβμ)αβμ<Δ2Δ.

Theorem 6 is proved via the hardness results of approximating partition functions of anti-ferromagnetic Ising models beyond the uniqueness regime [33, 19]. See Section 7 for details.

1.2 Algorithms for other 𝒇-divergences

We say an Ising model (G,J,h) admits polynomial time sampling and approximate counting oracles if it admits sampling and approximate counting oracles with cost function TGsp(ε)=poly(nε) and TGct(ε)=poly(nε) respectively. Let (G,Jν,hν) and (G,Jμ,hμ) denote two input Ising models. We give algorithms for approximating the f-divergence Df(νμ) for various divergence functions f. For many functions f, the divergence Df(νμ) is not symmetric for ν and μ. We will refer to (G,Jν,hν) as the first input Ising model and (G,Jμ,hμ) as the second input Ising model.

By setting different functions f in Df(νμ), we can obtain the following divergences:

f(x)={xlnxx+1,Kullback-Leibler divergence;lnx+x1,Rényi divergence;12(xlnx(x+1)lnx+12)Jensen-Shannon divergence.
Theorem 7.

There exists an FPRAS for Problem 1 with KL-divergence, Rényi-divergence, and Jensen-Shannon divergence if two input Ising models are both marginally bounded and both admit polynomial time sampling and approximate counting oracles.

Compared with the result in Theorem 4, the above theorem only requires the sampling and approximate counting oracles for the two input Ising models exist.

Next, consider the α-divergence. When α=0, the α-divergence refers to the KL-divergence. When α=1, the α-divergence refers to the Rényi-divergence. The α-divergences for other values of α are defined as follows. Let α0,1. The α-divergence is defined as f(x)=xααx(1α)α(α1). Recall that in (2), J(α)=αJν(α1)Jμ and h(α)=αhν(α1)hμ. Here, α can be a real number.

Theorem 8.

Let α0,1 be a constant real. There exists an FPRAS for Problem 1 with α-divergence if two input Ising models are both marginally bounded, the second input Ising model (G,Jμ,hμ) admits a polynomial time sampling oracle, and both two input Ising models together with (G,J(α),h(α)) admit polynomial time approximate counting oracles.

The above theorem works for constant real α (assuming the computational model supports real arithmetic). Unlike χα-divergence, in addition to μ and ν we only require that the approximate counting oracle for (G,J(α),h(α)) exists.

The squared Hellinger distance is defined by setting f(x)=12(x1)2. Given two input Ising models (G,Jν,hν) and (G,Jμ,hμ), define an averaged Ising model as (G,Javg,havg), where Javg=12(Jν+Jμ) and havg=12(hν+hμ). We have the following theorem.

Theorem 9.

There exists an FPRAS for Problem 1 with squared Hellinger distance if two input Ising models are both marginally bounded, one of the input Ising models admits a polynomial time sampling oracle, and two input Ising models together with the averaged Ising model all admit polynomial time approximate counting oracles.

The squared Hellinger distance is symmetric Df(νμ)=Df(μν). Hence, we only require one of the input Ising models admits a polynomial time sampling oracle. As a corollary, if two input Ising models are both in the uniqueness regime, then their averaged Ising model is also in the uniqueness regime and FPRAS exists for the squared Hellinger distance. For two Ising models (G,βν) and (G,βμ) with zero external field, FPRAS exists when both βν,βμΔ2Δ.

1.3 Related work and open problems

Related work

The problem of computing the f-divergence between two (either discrete or continuous) distributions is well-motivated by the applications in machine learning and statistics. There are many works, both theoretical and practical, studying the algorithms in various settings. Many works study the error between the true divergence and the divergence computed by certain empirical distributions, e.g., [30, 22, 31, 34]. Embedding-base technique was also studied in [1], where algorithm embeds a large sample space into a smaller one while preserving the f-divergence. We cannot directly use these techniques to Ising model as the size of the sample space is 2n, which gives an exponential time algorithm. For graphical models, [26] shows that αβ-divergences can be computed exactly when the graphical model has a bounded treewidth.

Approximating the f-divergence is related to the testing problem, in which one or both two distributions are given by the access to different types of sampling oracles, and the algorithm is required to test certain divergence is large or small. There is a long line of work studying the testing problem. See [11, 12] for a comprehensive survey.

The TV-distance dTV(,) is a special case of f-divergences. There are many theoretical works studying the approximation algorithms (either additive error or relative error) and hardness of approximation for the TV-distance between various types of high-dimensional distributions. For example, distributions specified by circuits [32], hidden Markov models [23], product distributions [4, 16, 18, 24], graphical models [8, 5, 17], and high-dimensional Gaussian distributions [8, 3]. Recently, there are some works focusing on approximating 1dTV(,) for two product distributions [25, 7].

Open problems

A natural direction is to consider more general distributions and more general divergences. Here are a few examples. For the Ising model, how to remove the marginal lower bound assumption? There are many other types of graphical models, such as Markov random fields with high-order interactions and Bayesian networks. It is interesting to consider other types of distributions, e.g., distributions generated by probabilistic circuits [2]. For the divergence, we focus on χα-divergence when α is an integer. One can also consider the case when α is a real number and give an algorithm that works in the tight parameter regime. Another question is to give a more general algorithm that works for a larger family of f-divergences.

The algorithm given in this paper is randomized. An open question is to find deterministic algorithms by exploring the deterministic approximate counting algorithms for the Ising model [28, 29]. Currently, deterministic algorithms are known for approximating the TV-distance between two product distributions [4, 18].

We can also study the same problem in different settings. In our setting, we assume that the input gives the description of two Ising models. It is interesting to consider another (perhaps more practical) setting that the algorithm can only access one or two models through random samples. This setting is closely related to the Ising testing problem in [15]. One can even consider abstract distributions with certain properties (e.g. approximate tensorization of f-divergence) and the distribution is given by the access to some oracles (e.g., oracle for querying conditional marginal distributions or querying probability masses). This abstract setting was considered by recent work in testing [9, 20].

2 Algorithm overview

We give an overview of our algorithm for approximating the χα-divergence between two Ising models. We start with the following definition of the parameter distance between two Ising models.

Definition 10 (parameter distance [17]).

For two Ising models (G,Jν,hν) and (G,Jμ,hμ), the parameter distance dpar(ν,μ) is defined by

dpar(ν,μ)max{JνJμmax,maxv|hν(v)hμ(v)|deg(v)+1},

where deg(v) is the degree of v in G.

In [17], the parameter distance is used to give a lower bound of TV-distance dTV(ν,μ), with which the previous work gave an FPRAS for TV-distance. The following lemma says such a lower bound can be generalized to χα-divergence. The lemma is proved in Section 3.1, where the proof works for a general family of f-divergence.

Lemma 11 (χα-divergence lower bound).

For f=12|x1|α, where α1, it holds that

Dχα(νμ)b2α2dpar(ν,μ)α.

Let θ=1poly(n) be a threshold. We estimate Dχα(νμ) in the following two cases.

  • Case dpar(ν,μ)>θ. This case is trivial for TV-distance because there is a very simple additive-error approximation algorithm for the TV-distance [8]. Since TV-distance itself is lower bounded by 1/poly(n), the additive error approximation can be easily transferred to relative error approximation. However, this simple algorithm only works for the TV-distance. Instead, we propose a new algorithm for approximating the general χα-divergence. The algorithm is outlined in Section 2.1.

  • Case dpar(ν,μ)θ. In this case, two Ising models are similar to each other. Previous work [17] has explored the similarity of two models and designed an algorithm for approximating their TV-distance. We substantially generalize their algorithm to approximate a family f-divergence (including χα-divergence). The algorithm is outlined in Section 2.2.

2.1 The algorithm for instances with large parameter distance

We first state the challenge of approximating the χα-divergence for general α compared to the TV-distance. It is well-known that the TV-distance can be written as follows:

dTV(ν,μ)=12𝐄Xμ[|ν(X)μ(X)1|]=()σ:ν(σ)<μ(σ)μ(σ)(1ν(σ)μ(σ)). (3)

It was observed in [8] that the TV-distance can be approximated by draw independent samples Xμ and then taking the average of Y=max{0,1ν(X)μ(X)}222The value of Y can be computed approximately, which is sufficient for the purpose of approximation.. The above equation shows 𝐄[Y]=dTV(ν,μ). It is easy to see |Y|1 and thus 𝐕𝐚𝐫[Y]1. The simple algorithm achieves the additive error approximation, which can be transferred to relative-error approximation because dTV(ν,μ)1/poly(n) is lower bounded in this case.

However, for the χα-divergence with general α, by the definition,

Dχα(νμ)=12σΩμ(σ)|ν(σ)μ(σ)1|α=12σΩ|μ(σ)ν(σ)||ν(σ)μ(σ)1|α1.

Due to the extra term |ν(σ)μ(σ)1|α1, we cannot apply a similar transformation as equation () in (3). Hence, it is not clear how to design an unbiased estimator Y such that variance of Y is small.

To overcome this challenge, we propose a new algorithm for approximating the χα-divergence. The following lemma gives a lower bound for the χα-divergence in this case.

Lemma 12.

Let 0<θ,b1. If dpar(ν,μ)θ and both ν and μ are b-marginally bounded, then

Dχα(νμ)Bα,b(θ)σ{±}Vμ(σ)(ν(σ)μ(σ)+1)α,

where Bα,b(θ)=b2αθα2((b2αθα2)1/(α+1)+2)α(2(b2αθα2)1/(α+1)+1)1=Θα,b(θα).

The proof of Lemma 12 separates all σΩ into two parts, and separately lower bound their contributions to the χα-divergence. The detailed proof is deferred to Section 3.2.

Lemma 12 gives us a tool to control the error of approximation. Recall that our goal is to design an algorithm that approximates the χα-divergence with relative error e±ε. Equivalently, the algorithm should achieve the O(ε)Dχα(νμ) additive error. Using the above lemma, it is sufficient to show that the algorithm can achieve the O(ε)Bα,b(θ)σΩμ(σ)(ν(σ)μ(σ)+1)α additive error approximation, which turns out to be much easier to prove. In addition to algorithms, Lemma 12 also plays an important role in the proof of the hardness result. See Section 7 for details.

2.1.1 The algorithm for even 𝜶

Our actual algorithm deals with all α1 in a unified way. However, in the overview, we exhibit a simple algorithm for the case where α is even. The simple case will illustrate the intuition why we need to consider a family of Ising models (ν,μ,α) in (2) and why Lemma 12 can help us to control the error of approximation. When α is even, we can replace |ν(σ)μ(σ)1|α with (ν(σ)μ(σ)1)α and then use the binomial expansion to get the following equation:

Dχα(νμ)=12σΩμ(σ)(ν(σ)μ(σ)1)α=12k=0α(αk)(1)αkσΩνk(σ)μk1(σ). (4)

To deal with the term νk(σ)μk1(σ), recall that (2) defines a family of Ising models (G,J(k),h(k)) such that J(k)kJν(k1)Jμ and h(k)khν(k1)hμ.

By our assumption in Theorem 4, for all 0kα, the Ising model (G,J(k),h(k)) admits sampling and approximate counting oracles. Let Zk denote the partition function of the Ising model (G,J(k),h(k)). We remark that Zν=Z1 and Zμ=Z0 are the partition functions of input Ising model (G,Jν,hν) and (G,Jμ,hμ), respectively. Each term in the summation of (4) can be written as

σΩνk(σ)μk1(σ)=Zμk1ZνkσΩwνk(σ)wμk1(σ)=Zμk1ZνkσΩexp(σTJ(k)σ2+σTh(k))=Zμk1ZkZνk. (5)

The above ratio can be approximated by using approximate counting oracles for Zν, Zμ, and Zk. Note that 0kα is a constant. By choosing the relative approximation error O(Bα,b(θ)ε)=poly(ε/n) in the oracle small enough, where Bα,b(θ)=poly(1/n) is the parameter in Lemma 12, we can obtain a value Wke±O(Bα,b(θ)ε)Zμk1ZkZνk that achieves the following approximation guarantee

|WkZμk1ZkZνk|Bα,b(θ)εZμk1ZkZνk, (6)

where in the above inequality, we write the relative error in terms of the additive error. Finally, our algorithm outputs the number D^ such that

D^=12k=0α(αk)(1)αkWk.

Note that d^ is an interlacing of plus and minus. Using (4) and (6), in the worst case (the additive error of every term takes the worst sign), the error of d^ can be upper bounded as follows

|D^Dχα(νμ)| εBα,b(θ)k=0α(αk)Zμk1ZkZνk=εBα,b(θ)k=0α(αk)σΩνk(σ)μk1(σ)
=εBα,b(θ)σΩμ(σ)(ν(σ)μ(σ)+1)αLemma 12εDχα(νμ).

2.1.2 The algorithm for general 𝜶

For general α, we need to consider the effect of the absolute value inside the divergence. The χα-divergence Dχα(νμ)=12σΩμ(σ)|ν(σ)μ(σ)1|α can be written as

Dχα(νμ)=12σ:ν(σ)>μ(σ)μ(σ)(ν(σ)μ(σ)1)α12σ:ν(σ)<μ(σ)μ(σ)(ν(σ)μ(σ)1)α.

Using the binomial expansion, the divergence can be written as

Dχα(νμ)=12k=0α(1)αk(αk)(σ:ν(σ)>μ(σ)νk(σ)μk1(σ)+(1)ασ:ν(σ)<μ(σ)νk(σ)μk1(σ)).

Fix an integer 0kα. Let X be a random sample from the Ising model (G,J(k),h(k)). It holds that 𝐏𝐫[X=σ]νk(σ)μk1(σ). Define Wk+ and Wk to be random variables such that

Wk+=𝟏[ν(X)>μ(X)]Z0k1ZkZ1kandWk=𝟏[ν(X)<μ(X)]Z0k1ZkZ1k. (7)

With some calculation, we can verify that

σ:ν(σ)>μ(σ)νk(σ)μk1(σ)+(1)ασ:ν(σ)<μ(σ)νk(σ)μk1(σ)=𝐄[Wk+]+(1)α𝐄[Wk].

Ideally, our algorithm wants to draw T=poly(lognε) independent random samples of Wk+ and then uses their average value W^k+ to approximate 𝐄[Wk+]. Similarly, the algorithm computes W^k to approximate 𝐄[Wk]. Finally, it outputs D^=12k=0α(1)αk(αk)(W^k++(1)αW^k). However, to show the correctness of the algorithm, we need to deal with the following two types of errors.

  • The algorithm cannot draw perfect samples of Wk+ or Wk. One major issue333There are some other issues, e.g., the samples from the Ising model (G,J(k),h(k)) are approximate and partition functions Z0,Z1,Zk in (7) can only be approximated. is that given a sample XIsing(G,J(k),h(k)), the exact computation of the probability masses ν(X) and μ(X) is #P-hard [21]. Hence, the algorithm cannot perfectly distinguish the cases where ν(X)>μ(X) or ν(X)<μ(X). This issue can be solved by using the approximate counting oracle to approximate the probability masses ν(X) and μ(X). By choosing a small enough error parameter poly(n/ε) for approximate counting, the algorithm misclassifies ν(X)>μ(X) or ν(X)<μ(X) only if |ν(X)μ(X)1|poly(ε/n). We need to show that such X does not contribute much to the error of approximation.

  • We use the average of W^k+ and W^k to approximate 𝐄[Wk+] and 𝐄[Wk] respectively. We need to control the concentration error. This error can be bounded via Lemma 12, using a similar technique as described in the even α case.

The detailed algorithm and analysis is in Section 5.1.

2.2 The algorithm for instances with small parameter distance

In this case, we give a general algorithm for abstract f-divergence. We briefly sketch the algorithm implemented for χα-divergence in the overview. The abstract one is in Section 4. By the definition,

Dχα(νμ)=12σ{±}Vμ(σ)|ν(σ)μ(σ)1|α=12σ{±}Vμ(σ)|wν(σ)wμ(σ)ZμZν1|α. (8)

Define a random variable Wwν(X)wμ(X), where Xμ. It can be verified that 𝐄[W]=ZνZμ and

Dχα(νμ)=𝐄[|W𝐄[W]1|α].

The algorithm draws T=poly(n/ε) samples W1,,WT from W independently, and then computes W¯=1Ti=1TWi and D^=1Ti=1T12|WiW¯1|α. Since the parameter distance is small, two weight functions wν() and wμ() are very similar. Hence, the ratio W=wν(X)wμ(X)1. Formally,

σΩ,|wν(σ)wμ(σ)𝐄[W]|αpoly(n)dpar(ν,μ)αLemma 11poly(n)Dχα(νμ).

The above property guarantees that W is well-concentrated around its mean. By choosing the number of samples T=poly(n/ε) large enough, with some calculation, we can show that D^ approximates Dχα(νμ) with a relative error of e±O(ε) with high probability. We remark that D^ is not necessarily an unbiased estimator for Dχα(νμ) but it is still a good approximation.

2.3 Organization of the paper

In Section 3, we prove some lower bounds for a broad family of f-divergences. In Section 4, we give a general algorithm for general f-divergences satisfying an abstract condition. In Section 5, we give the detailed algorithm for χα-divergence, where we focus on the large parameter distance case because the small parameter distance case is solved in Section 4. In Section 6, we show how to extend our algorithm to other f-divergences. Finally, in Section 7, we prove the hardness result for approximating χα-divergence in the parameter regime stated in Theorem 6.

3 Lower bounds of f-divergences

In this paper, we assume the function f in f-divergence satisfies the following assumption.

Assumption 1.

The convex continuous function f:(0,+)[0,) satisfies that f(1)=0 and for any x1, f is twice differentiable at x such that f′′(x)0. Furthermore, the derivative satisfies f(x)<0 if 0<x<1 and f(x)>0 if x>1.

For any function f satisfying the above assumption, x=1 is the unique point where f(x)=0. It is straightforward to verify that f is strictly convex at around 1. The f-divergences we are interested in all satisfy the above assumption.

3.1 Lower bound in terms of parameter distance

Our proof uses the following lower bound of total variation distance.

Lemma 13 ([17, Lemma 15]).

For two Ising models, if both ν and μ are b-marginally bounded, then their total variation distance is lower bounded by

dTV(ν,μ)b22dpar(ν,μ).
Lemma 14.

Suppose f satisfies Assumption 1. For two Gibbs distributions ν,μ of Ising models,

Df(νμ) max{f(1dTV(ν,μ)),f(1+dTV(ν,μ))}
max{f(1b22dpar(ν,μ)),f(1+b22dpar(ν,μ))}.
Proof of Lemma 14.

Let A{xΩ:ν(x)>μ(x)}. Let p=𝐏𝐫Xν[XA] and q=𝐏𝐫Xμ[XA]. It is well-known that the total variation distance between ν and μ is dTV(ν,μ)=pq. Since we consider the Ising model, p,q>0. Consider a Markov kernel K:Ω{0,1} such that given any σΩ, K deterministically transforms σ to 1 if and only if σA. Note that νK and μK are Bernoulli distributions with parameters p and q respectively. We have

dTV(ν,μ)=dTV(νK,μK)=pq.

By using the data processing inequality for f-divergence, we have

Df(νμ) Df(νKμK)
=qf(pq)+(1q)f(1p1q)
(f(1p1q)f(1)=0) qf(pq)+(1q)f(1)
(Jensen’s inequality on f) f(1+pq)=f(1+dTV(ν,μ)).

Similarly, we use f(p/q)f(1) to get Df(νμ)f(1dTV(ν,μ)).

Note that f(x)<0 if 0<x<1 and f(x)>0 if x>1. Combining with Lemma 13, we have

f(1dTV(ν,μ))f(1b22dpar(ν,μ));f(1+dTV(ν,μ))f(1+b22dpar(ν,μ)).

Lemma 11 for χα-divergence can be proved as follows.

Proof of Lemma 11.

For χα-divergence, we have f(x)=12|x1|α. We can get a slightly better lower bound than the general result in Lemma 14. Note that f(1x)=f(1+x). Then f(1p1q)=f(1+p2q1q). Using the Jensen’s inequality on qf(pq)+(1q)f(1+p2q1q) implies the a lower bound Dχα(νμ)f(1+2dTV(ν,μ))=2α1dTV(ν,μ)α. Then lemma follows from Lemma 13.

3.2 A lower bound for χα-divergence

Proof of Lemma 12.

Let 0t<1 be a parameter to be fixed later. We partition the whole space Ω={±}V into M and M¯=ΩM such that

M={σ|ν(σ)μ(σ)1|t}.

Then, we bound the contribution of σM and σM¯ separately. By triangle inequality, we have ν(σ)μ(σ)+12+|ν(σ)μ(σ)1|. For the first case, we have

σMμ(σ)(ν(σ)μ(σ)+1)α σMμ(σ)(2+t)α(2+t)α.

By our assumption, dpar(ν,μ)θ. Using Lemma 11, we have Dχα(νμ)b2α2θα. Hence,

σMμ(σ)(ν(σ)μ(σ)+1)α2(2+t)αb2αθαDχα(νμ). (9)

Consider the other set M¯. For all σM¯, it holds that

ν(σ)μ(σ)+1|ν(σ)μ(σ)1|1+2|ν(σ)μ(σ)1|1+2t.

Summing over all σM¯, we have

σM¯μ(σ)(ν(σ)μ(σ)+1)α (1+2t)ασM¯μ(σ)|ν(σ)μ(σ)1|α
2(1+2t)αDχα(νμ). (10)

Finally, adding the two inequalities (9) and (3.2), we have

σΩμ(σ)(ν(σ)μ(σ)+1) (2(2+t)αb2αθα+2(1+2t)α)Dχα(νμ)
=g(t)Dχα(νμ). (11)

The above inequality holds for any 0t1. We next find a value of t to minimize the value of g(t).

Take the derivative of g, we get

g(t)=2α(1+2t)α1(2t2)+2αb2αθα(t+2)α1=2α(t+2)α1(1b2αθα2tα+1).

Thus when t=(b2αθα2)1/(α+1), where 0<t<1,the function g obtains its minimum value:

g(t) =2(2+t)αb2αθα+2(1+2t)α=(2+t)αtα+1+2(1+2t)α
=(1+2t)α(2+1t)=(t+2)α(2t+1)tα+1
=2b2αθα((b2αθα2)1/(α+1)+2)α(2(b2αθα2)1/(α+1)+1)=1Bα,b(θ).

Combining the above equation with (3.2) implies that σΩμ(σ)(ν(σ)μ(σ)+1)Dχα(νμ)Bα,b(θ). This proves the lower bound of the χα-divergence.

4 Algorithm for 𝒇-divergence with small parameters distance

In this section, we generalize the algorithm in [17] to the case of small parameter distance. The new algorithm works for general f-divergence, where f satisfies the following abstract condition.

Condition 15.

Let f be a function satisfying Assumption 1. There exists an function F:++ with F(ζ)poly(ζ) such that for any ζ1, any x(12ζ,0)(0,12ζ),

xf(1+ζx)f(1+x)F(ζ).
Theorem 16.

Let f be a function satisfying Condition 15 with function F. There exists an algorithm such that given two Ising models (G,Jν,hν) and (G,Jμ,hμ), any 0<ε<1, and f,b,F, if ν and μ are b-marginally bounded, dpar(ν,μ)<θ=110(n+3m), and μ admits sampling oracle with cost functions TGsp(), then it returns a random number D^ in time O(TTGsp(1100T)), where T=O(F(8(n+3m)/b2)2(n+m)2ε2), n=|V|, m=|E|, such that

𝐏𝐫[eεDf(νμ)D^eεDf(νμ)]23.
 Remark 17.

In Theorem 16, the algorithm for small parameter case only requires sampling oracles for the input Ising model (G,Jμ,hμ) instead of sampling and approximate counting oracles for the whole family of Ising models .

By the definitions of f-divergence and Ising model, we have

Df(νμ)=𝐄μ[f(ν(σ)μ(σ))]=𝐄μ[f(wν(σ)wμ(σ)ZμZν)].

Define the parameter

T212103F(8(n+3m)/b2)2(n+3m)2b4ε2.
  • Call the sampling oracle of μ with TV-distance error 1100T to obtain independent random samples σ^1,σ^2,,σ^T. For each i[T], compute W^i=wν(σ^i)wμ(σ^i).

  • Return D^=1Ti=1Tf(W^i/W¯), where W¯=1Ti=1TWi^.

The analysis of the algorithm is given in the full version of the paper.

5 Algorithm for 𝝌𝜶-divergence

Our algorithm first computes the marginal lower bound b in time O(n+m). Due to the conditional independence, the algorithm only need to enumerate all vV, any c{1,+1}, and find a worst pinning τ=τ(v,c) on all neighbor of v. Specifically, for any neighbor u of v, if Juv>0, then fix τu=c; otherwise, fix τu=c. The algorithm compute μvτ(c). Let b be the minimum value over all vV and c{1,+1}. We now assume that the value b is known by the algorithm.

Let θ=110(n+3m) be the threshold parameter. We give two algorithms in Section 5.1 and Section 5.2 depending on whether dpar(ν,μ)>θ or not. We prove Theorem 4 in Section 5.3.

5.1 Large parameters distance case

Lemma 18.

Let α1 be an integer and b(0,1) be a constant. There exists an algorithm such that given two Ising models (G,Jν,hν) and (G,Jμ,hμ), any 0<ε<1, and α,b, if ν and μ are b-marginally bounded, dpar(ν,μ)>θ, and every Ising model in admits sampling and approximate counting oracles with cost function TGsp() and TGct() respectively, then it returns a random D^ in time Oα,b(TGct(δ)+T(TGsp(1200T(α+1)))), where δ=Θα,b(θαε) and T=Θα,b(1ε2θα) such that

𝐏𝐫[eεDχα(νμ)D^eεDχα(νμ)]23.

By the definition of χα-divergence, we can rewrite

Dχα(νμ)=12σ:ν(σ)>μ(σ)μ(σ)(ν(σ)μ(σ)1)α+12σ:ν(σ)<μ(σ)μ(σ)(1ν(σ)μ(σ))α
= 12k=0α(1)αk(αk)(σ:ν(σ)>μ(σ)νk(σ)μk1(σ)+(1)ασ:ν(σ)<μ(σ)νk(σ)μk1(σ)).

Recall that Zk is the partition function of the Ising model (G,J(k),h(k)), where J(k)kJν(k1)Jμ and h(k)khν(k1)hμ. Note that Zμ=Z0 and Zν=Z1. Define two random variables Wk+ and Wk as follows. Let π(k) be the Gibbs distributions of the Ising model (G,J(k),h(k)).

Wk+=𝟏[ν(X)>μ(X)]Z0k1ZkZ1k,whereXπ(k).
Wk=𝟏[ν(Y)<μ(Y)]Z0k1ZkZ1k,whereYπ(k).

The expectation of Wk+ can be calculated as follows

𝐄[Wk+] =σ:ν(σ)>μ(σ)π(k)(σ)Z0k1ZkZ1k=σ:ν(σ)>μ(σ)wνk(σ)/wμk1(σ)ZkZ0k1ZkZ1k
=σ:ν(σ)>μ(σ)νk(σ)μk1(σ).

Similarly, we have 𝐄[Wk]=σ:ν(σ)<μ(σ)νk(σ)μk1(σ). In a high level, our algorithm draws approximate samples of Wk+ and Wk and estimate 𝐄[Wk+]+(1)α𝐄[Wk]. Define the parameter

T 8104(α+1)ε2Bα,b(θ)2andδBα,b(θ)ε20(α+1),

where the function Bα,b(θ) is defined in Lemma 12.

  • For 0kα, call the approximate counting oracle on Ising model (G,J(k),h(k)) for O(logα) times independently with relative error δ, and take the median as Z^k.

  • For each k from 0 to α do:

    1. 1.

      Draw 2T samples σ^1,,σ^2Tπ(k) by the sampling oracle with error 1200T(α+1).

    2. 2.

      For each i[T], compute two values

      W^k,i+=𝟏[ν^(σ^i)>μ^(σ^i)]Z^0k1Z^kZ^1k, and
      W^k,i=𝟏[ν^(σ^T+i)<μ^(σ^T+i)]Z^0k1Z^kZ^1k,

      where for any xΩ, ν^(x)=wν(x)Z^1 and μ^(x)=wμ(x)Z^0.

  • Return D^=12k=0α(1)αk(αk)(1Ti=1TW^k,i++(1)α1Ti=1TW^k,i).

The analysis of the algorithm is given in the full version of the paper.

5.2 Small parameters distance case

With Theorem 16, we only need to verify Condition 15 for f=12|x1|α. By definition,

xf(1+ζx)f(1+x)=α2|x|α|ζ|α112|x|α=αζα1.

Set F(ζ)=αζα1, F(ζ) satisfies Condition 15. Thus

F(8(n+3m)/b2)=23α3α(n+3m)α1b2α2.

Hence, using Theorem 16, there is an algorithm with running time O(TTGsp(1100T)), where T=O(26α6α2(n+3m)2αb4α4ε2)=Oα,b((n+3m)2αε2), for small parameter distance case dpar(ν,μ)<θ=110(n+3m).

5.3 Putting everything together (Proof of Theorem 4)

Theorem 4 follows from Theorem 16 and Lemma 18. Define parameter T=Θα,b((n+m)2αε2) and δ=Θα,b(ε(n+m)α). By choosing constants, we can make T large enough and δ small enough. The preprocessing time for computing b and dpar(ν,μ) is O(n+m), which is dominated by the time for sampling and approximate counting. The running time of the whole algorithm is at most

max{Oα,b(TGct(δ)+TTGsp(1200T(α+1))),Oα,b(TTGsp(1100T))}
Oα,b(TGct(ηα,bε(n+m)α)+(n+m)2αε2TGsp(ηα,bε2(n+m)2α)),

where the last inequality comes from the fact that both TGct and TGsp are non-increasing functions and ηα,b>0 is a small enough constant depending only on α and b.

6 Algorithms for other 𝒇-divergences

In this section, we briefly show the algorithms for other f-divergences. The algorithms prove the results in Theorem 7, Theorem 8, and Theorem 9.

Again, the algorithm first computes the parameter distance dpar(ν,μ). Then, it compares dpar(ν,μ) to the threshold θ=110(n+3m). If dpar(ν,μ)θ, algorithms are given in Section 6.1. Otherwise dpar(ν,μ)>θ, algorithms are given in Section 6.2.

6.1 Algorithms for small parameter distance

In this case, we use the abstract algorithm in Theorem 16. We only need to verify Condition 15 for f-divergences. The following lemma gives a sufficient condition for Condition 15.

Lemma 19.

Suppose f:>00 is twice differentiable at every x>0 such that f′′(x)>0 and f(1)=f(1)=0. If there exists two positive constants L,U>0 such that Lf′′(x)U for any x[12,32]. Then, the function f satisfies Condition 15 with F(ζ)=2ULζ.

Proof.

Note that f satisfying the condition in the lemma satisfies Assumption 1. By condition that f(1)=f(1)=0, we have the following two equalities (0x=x0 if x<0):

x(12,12),f(1+x)=0xf′′(1+u)𝑑u,f(1+x)=0x(xu)f′′(1+u)𝑑u. (12)

Since 0<Lf′′(ζx)U for ζ>1 and x[12ζ,0)(0,12ζ], we have

|f(1+ζx)| =|0ζxf′′(1+u)𝑑u|ζ|x|U,
f(1+x) =0x(xu)f′′(1+u)𝑑uL0x(xu)𝑑u=Lx22.

Thus, Condition 15 can be verify as follows

xf(1+ζx)f(1+x)=|x||f(1+ζx)|f(1+x)|x|ζ|x|ULx2/2=2ULζ.

Then, we using Lemma 19 verify Condition 15 for Kullback-Leibler, Rényi, Jensen-Shannon, α-divergence, and Squared Hellinger divergence. The result is summarized as the following table.

Divergence f(x) f′′(x) L U F(ζ)
Kullback-Leibler xlnxx+1 1x 23 2 6ζ
Rényi lnx+x1 1x2 49 4 18ζ
Jensen-Shannon 12(xlnx(x+1)lnx+12) 12x(x+1) 215 23 10ζ
α-divergence (α0,1) xααx(1α)α(α1) xα2 2|α2| 2|α2| 24|α2|ζ
Squared Hellinger 12(x1)2 12x 16 12 23ζ

We remark that Lemma 19 does not work for χα-divergence but it works for divergences above.

The algorithm then follows from Theorem 16 such that to approximate Df(νμ), it only requires polynomial time sampling oracle for the input distribution μ. For squared Hellinger distance, as it is symmetric Df(νμ)=Df(μν), we can swap the roles of ν and μ in the algorithm to make sure μ admits a polynomial time sampling oracle.

6.2 Algorithms for large parameter distance

Now, assume that the parameter distance dpar(ν,μ)>θ=110(n+3m). By Lemma 14, the f-divergence is at least

max{f(1b22dpar(ν,μ)),f(1+b22dpar(ν,μ))}max{f(1b22θ),f(1+b22θ)},

where the inequality holds because of the derivative assumptions in Assumption 1. Let δ=b22θ=1poly(n). We show that all divergences in the above table are at least 1poly(n). Kullback-Leibler and Rényi divergences can be verified by using Taylor series ln(1+δ)=δδ22+O(δ3). For Jensen-Shannon, one can rewrite f(1+δ)=(1+δ)ln(1+δ2+δ)ln(1+δ2) and then Taylor series shows f(1+δ)=Ω(δ2). For α-divergence, expanding (1+δ)α for real α implies f(1+δ)=δ22±O(δ3)=Ω(δ2). Finally, it is easy to verify that f(1+δ)=Ω(δ2) for Squared Hellinger divergence. We have

Df(νμ)1poly(n). (13)
Kullback-Leibler, Rényi, and Jensen-Shannon divergences

We give the algorithm for Jensen-Shannon divergence. Similar algorithms can be given for KL and Rényi divergences. The Jensen-Shannon divergence can be written as

12σΩν(σ)lnν(σ)μ(σ)12σΩν(σ)lnν(x)+μ(x)2μ(x)12σΩμ(x)lnν(x)+μ(x)2μ(x).

By (13), it suffices to give an algorithm with additive error εpoly(n). We can estimate each term with an additive error. We give the algorithm for the first term. The other two terms can be estimated similarly. We use the sampling oracle to draw T samples σ from ν, then using counting oracles to approximate h(σ)=lnν(σ)μ(σ)=ln(wν(σ)wμ(σ)ZμZν), finally, return the average of h(σ). The counting oracles estimate the partition functions with a relative error, and thus we can approximate h(σ) with an additive error. Due to the marginal lower bound assumption, |h(σ)|nlog1b=O(n). We can set T=poly(n,1ε) to achieve the additive error εpoly(n) and our goal is achieved.

𝜶-divergence

The α-divergence can be written as

1α(α1)(σν(σ)αμ(σ)α1)1α(α1)=1α(α1)(Zμα1ZαZνα1),

where Zα is the partition function of the Ising model (G,J(α),h(α)). By calling approximate counting oracles with relative error e±Θα(δ), we estimate Zμα1ZαZνα with e±δ relative error, which is equivalent to ±O(δ)Zμα1ZαZνα additive error approximation. When the constant α>1 or α<0, by (13), Zμα1ZαZνα is lower bounded by 1poly(n)+1. We set δ=εpoly(n), then Zμα1ZαZναεεδ, which implies

δZμα1ZαZναε(Zμα1ZαZνα1).

When the constant 0<α<1, α(1α)<0. By (13), the α-divergence at least 1poly(n), Zμα1ZαZνα is upper bounded by 11poly(n). We still set δ=εpoly(n), and in this case,

δZμα1ZαZναεpoly(n)ε(1Zμα1ZαZνα).

In both cases, our algorithm solves the problem with an additive error O(ε)Dα(νμ) approximation, which implies a relative error e±ε approximation.

Squared Hellinger distance

The Squared Hellinger distance can be written as

1σΩμ(σ)(ν(σ)μ(σ))1/2=1Z¯ZνZμ,

where Z¯ is the partition function of the averaged Ising model in Theorem 9. Again, by (13), it suffices to given an algorithm with additive error εpoly(n). We can use approximate counting oracles to estimate Z¯ZνZμ with a relative error. Note that for any δ>0, e±δ relative error approximation is the same as ±O(δ)Z¯ZνZμ additive error approximation. Since Z¯ZνZμ1 (as the divergence is non-negative), we can achieve ±O(δ) additive error. Then the whole term 1Z¯ZνZμ can be approximated within an additive error ±O(δ). The problem is solved by setting δ=εpoly(n).

7 Proof of the hardness result

In this section, we only consider Ising model (G,J,h) with zero external field and interaction matrix with unified values. Formally, h=𝟎 and Juv=Jvu=lnβ2 for all {u,v}E. It is easy to see the probability of a configuration σ is proportional to βm(σ), where m(σ) is the number of monochromatic edges m(σ)=|{u,v}E,σu=σv|. We denote it by (G,β).

Let α2, Δ3, and βμ>βνΔ2Δ such that (βνβμ)αβμ<Δ2Δ be constant parameters in Theorem 6. We define a constant parameter

β=(βνβμ)αβμ<Δ2Δ. (14)

The following hardness result for approximating partition function Z=σ{±}Vβm(σ) of the anti-ferromagnetic Ising model beyond the uniqueness threshold is well-known.

Lemma 20 ([33, 19]).

Fix Δ3 and 0<β<Δ2Δ. There exists c=c(Δ,β)>0 such that unless 𝖭𝖯=𝖱𝖯, there is no polynomial time randomized algorithm to approximate the partition function within a factor of ecn with probability at least 2/3 for the zero external field Ising model (G,β) on Δ-regular n-vertex graphs G.

For any Δ-regular graph G, let Z(G,β) denote the partition function of the zero external field Ising model (G,β), where β is defined in (14). We prove the following lemma.

Lemma 21.

Let Δ,α,βν,βμ be constants. There exists a constant C=C(Δ,α,βν,βμ)>0 such that for any graph G, let μ and ν be the Gibbs distributions of the Ising models (G,βν) and (G,βμ),

1CZ(G,β)Dχα(νμ)Z(G,βν)αZ(G,βμ)α1CZ(G,β).

Assume Lemma 21 holds. We prove Theorem 6.

Proof of Theorem 6.

Fix parameters Δ,α,βν,βμ. Let β be defined in (14).

Suppose there exists an FPRAS for Dχα(νμ). By run algorithms independently and take median, we can approximate Dχα(νμ) within the factor of 2 in polynomial time with probability at least 0.99. By our assumption, βν,βμΔ2Δ. We can compute Z(G,βν) within the factor of 2 by applying the algorithms in [13] (if βν[Δ2Δ,1]) or the algorithms in [21] (if βν>1) in polynomial time, where the algorithm succeeds with probability at least 0.99. Similarly, we can compute an approximation to Z(G,βμ).

By Lemma 21, we get a constant approximation to Z(G,β) in polynomial time with probability at least 0.99. The hardness result follows from Lemma 20.

The rest of this section is devoted to prove Lemma 21. To prove the lemma, we consider a middle term σΩμ(σ)(ν(σ)μ(σ)+1)α. The following result is a corollary of Lemma 12.

Corollary 22.

There exists C=C(Δ,α,βν,βμ) such that

1Cσ{±}Vμ(σ)(ν(σ)μ(σ)+1)αDχα(νμ)Cσ{±}Vμ(σ)(ν(σ)μ(σ)+1)α.
Proof.

The second inequality is trivial. For the first inequality, by Lemma 12, we can set θ in the lemma to be 12(lnβνlnβμ), which is a constant depending on βν,βμ. Furthermore, the underlying graph G has constant degree Δ, both βμ and βν are constants. Hence, by the conditional independence property of the Ising model, both ν and μ have a constant marginal lower bound b=b(Δ,βν,βμ). Hence Bα,b(θ)=Θα,βν,βμ,Δ(1) is a constant. With Corollary 22, Lemma 21 is a straightforward corollary of the following lemma.

Lemma 23.

It holds that

Z(G,β)Z(G,βν)αZ(G,βμ)α1σ{±}Vμ(σ)(ν(σ)μ(σ)+1)α3αZ(G,β).
Proof.

To simplify the notation, we fix the graph G in the proof. We denote the partition function Z(G,βν)=Zν and Z(G,βμ)=Zμ. The family of Ising models in (ν,μ,α) in (2) can be written as (G,βk), where βk=βνkβμk1 for 0kα. We use Zk to denote the partition function of (G,βk). We remark that Z0=Zμ, Z1=Zν, β in (14) is the same as βα, and Z(G,β)=Zα.

We use the binomial expansion that

σ{±}Vμ(σ)(ν(σ)μ(σ)+1)α=k=0α(αk)σ{±}Vνk(σ)μk1(σ)=k=0α(αk)Zμk1ZνkZk. (15)

We claim the following inequality holds. For any 0kα1,

ZμkZνk+1Zk+112Zμk1ZνkZk. (16)

We prove (16) later. With this conclusion, combining with (15), we know that

Zμα1ZναZαk=0α(αk)Zμk1ZνkZkZμα1ZναZαk=0α(αk)2αk=3αZμα1ZναZα, (17)

where the first inequality is trivial and the second inequality is due to (16). Rearranging the terms in (17), we get the desired result.

Finally, we prove (16). Recall that m(σ) denotes the number of monochromatic edges under the configuration σ. Since βμ>βν, the sequence βk is monotonically decreasing as βk=βνkβμk1. Fix an unordered pair of configurations {σ,τ}{±}V×{±}V (it may be possible that σ=τ). Without loss of generality, we assume m(σ)m(τ), otherwise we can swap the two configurations.

We have the following inequality for all 1kα1,

βμm(σ)βk+1m(τ)+βμm(τ)βk+1m(σ)(I)βμm(σ)βk+1m(τ)=(II)(βμβν)m(σ)βνm(σ)(βνβμ)m(τ)βkm(τ)
= (βμβν)m(σ)m(τ)βνm(σ)βkm(τ)(III)βνm(σ)βkm(τ)(IV)12(βνm(σ)βkm(τ)+βνm(τ)βkm(σ)), (18)

where the inequality (I) throws the second term, and the equality (II) is due to the definition of βk, the inequality (III) is due to the assumption m(σ)m(τ) and βμ>βν, and the last inequality (IV) is due m(σ)m(τ) and βν=β1βk for k1.

If k=0, note that β0=βμ and β1=βν, we have

βμm(σ)βk+1m(τ)+βμm(τ)βk+1m(σ) =βνm(σ)βkm(τ)+βνm(τ)βkm(σ)
12(βνm(σ)βkm(τ)+βνm(τ)βkm(σ)). (19)

To prove (16), we need to prove ZμZk+112ZνZk for all 0kα1. Recall that the partition function Zk=σ{±}Vβkm(σ). We have the following inequality

ZμZk+1 =σ,τ{±}Vβμm(σ)βk+1m(τ)
=unordered pair{σ,τ}στ(βμm(σ)βk+1m(τ)+βμm(τ)βk+1m(σ))+σ{±}Vβμm(σ)βk+1m(σ)
() 12unordered pair{σ,τ}στ(βνm(σ)βkm(τ)+βνm(τ)βkm(σ))+12σ{±}Vβνm(σ)βkm(σ)
=12ZνZk,

where the inequality () is due to (7) and (7) and two inequalities (7) and (7) hold even if σ=τ. This proves the inequality in (16).

References

  • [1] Amirali Abdullah, Ravi Kumar, Andrew McGregor, Sergei Vassilvitskii, and Suresh Venkatasubramanian. Sketching, embedding and dimensionality reduction in information theoretic spaces. In AISTATS, volume 51, pages 948–956. JMLR.org, 2016. URL: http://proceedings.mlr.press/v51/abdullah16.html.
  • [2] Antoine Amarilli, Marcelo Arenas, YooJung Choi, Mikaël Monet, Guy Van den Broeck, and Benjie Wang. A circus of circuits: Connections between decision diagrams, circuits, and automata. arXiv preprint, 2024. doi:10.48550/arXiv.2404.09674.
  • [3] Arnab Bhattacharyya, Weiming Feng, and Piyush Srivastava. Approximating the total variation distance between Gaussians. In AISTATS, volume 258 of Proceedings of Machine Learning Research, pages 1846–1854. PMLR, 2025. URL: https://proceedings.mlr.press/v258/bhattacharyya25a.html.
  • [4] Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S. Meel, Dimitrios Myrisiotis, A. Pavan, and N. V. Vinodchandran. On approximating total variation distance. In IJCAI, pages 3479–3487. ijcai.org, 2023. doi:10.24963/IJCAI.2023/387.
  • [5] Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S. Meel, Dimitrios Myrisiotis, A. Pavan, and N. V. Vinodchandran. Total variation distance meets probabilistic inference. In ICML. OpenReview.net, 2024. URL: https://openreview.net/forum?id=6OSLjErBhh.
  • [6] Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S. Meel, Dimitrios Myrisiotis, A. Pavan, and N. V. Vinodchandran. Computational explorations of total variation distance. In ICLR. OpenReview.net, 2025. URL: https://openreview.net/forum?id=xak8c9l1nu.
  • [7] Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S Meel, Dimitrios Myrisiotis, A Pavan, and NV Vinodchandran. Algorithms and hardness for estimating statistical similarity. arXiv preprint, 2025. arXiv:2502.10527.
  • [8] Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S. Meel, and N. V. Vinodchandran. Efficient distance approximation for structured high-dimensional distributions via learning. In NeurIPS, 2020.
  • [9] Antonio Blanca, Zongchen Chen, Daniel Stefankovic, and Eric Vigoda. Complexity of high-dimensional identity testing with coordinate conditional sampling. ACM Trans. Algorithms, 21(1):7:1–7:58, 2025. doi:10.1145/3686799.
  • [10] Guy Bresler. Efficiently learning Ising models on arbitrary graphs. In STOC, pages 771–782. ACM, 2015. doi:10.1145/2746539.2746631.
  • [11] Clément L. Canonne. A Survey on Distribution Testing: Your Data is Big. But is it Blue? Number 9 in Graduate Surveys. Theory of Computing Library, 2020. doi:10.4086/toc.gs.2020.009.
  • [12] Clément L. Canonne. Topics and techniques in distribution testing: A biased but representative sample. Found. Trends Commun. Inf. Theory, 19(6):1032–1198, 2022. doi:10.1561/0100000114.
  • [13] Xiaoyu Chen, Zongchen Chen, Yitong Yin, and Xinyuan Zhang. Rapid mixing at the uniqueness threshold. In STOC, pages 879–890. ACM, 2025. doi:10.1145/3717823.3718260.
  • [14] Zongchen Chen, Kuikui Liu, and Eric Vigoda. Optimal mixing of Glauber dynamics: Entropy factorization via high-dimensional expansion. In STOC, pages 1537–1550. ACM, 2021. doi:10.1145/3406325.3451035.
  • [15] Constantinos Daskalakis, Nishanth Dikkala, and Gautam Kamath. Testing Ising models. In SODA, pages 1989–2007. SIAM, 2018. doi:10.1137/1.9781611975031.130.
  • [16] Weiming Feng, Heng Guo, Mark Jerrum, and Jiaheng Wang. A simple polynomial-time approximation algorithm for the total variation distance between two product distributions. TheoretiCS, 2, 2023. doi:10.46298/THEORETICS.23.7.
  • [17] Weiming Feng, Hongyang Liu, and Minji Yang. Approximating the total variation distance between spin systems. In COLT, volume 291 of Proceedings of Machine Learning Research, pages 1974–2025. PMLR, 2025. URL: https://proceedings.mlr.press/v291/feng25a.html.
  • [18] Weiming Feng, Liqiang Liu, and Tianren Liu. On deterministically approximating total variation distance. In SODA, pages 1766–1791. SIAM, 2024. doi:10.1137/1.9781611977912.70.
  • [19] Andreas Galanis, Daniel Štefankovič, and Eric Vigoda. Inapproximability of the partition function for the antiferromagnetic Ising and hard-core models. Combin. Probab. Comput., 25(4):500–559, 2016. doi:10.1017/S0963548315000401.
  • [20] William Gay, William He, Nicholas Kocurek, and Ryan O’Donnell. Sampling and identity-testing without approximate tensorization of entropy. arXiv preprint arXiv:2506.23456, 2025. doi:10.48550/arXiv.2506.23456.
  • [21] Mark Jerrum and Alistair Sinclair. Polynomial-time approximation algorithms for the Ising model. SIAM J. Comput., 22(5):1087–1116, 1993. doi:10.1137/0222066.
  • [22] Takafumi Kanamori, Taiji Suzuki, and Masashi Sugiyama. f-divergence estimation and two-sample homogeneity test under semiparametric density-ratio models. IEEE Trans. Inf. Theory, 58(2):708–720, 2012. doi:10.1109/TIT.2011.2163380.
  • [23] Stefan Kiefer. On computing the total variation distance of hidden Markov models. In ICALP, volume 107 of LIPIcs, pages 130:1–130:13. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2018. doi:10.4230/LIPIcs.ICALP.2018.130.
  • [24] Aryeh Kontorovich. On the tensorization of the variational distance. Electron. Commun. Probab., 30:Paper No. 32, 10, 2025.
  • [25] Aryeh Kontorovich and Ariel Avital. Sharp bounds on aggregate expert error. In ALT, volume 272 of Proceedings of Machine Learning Research, pages 653–663. PMLR, 2025. URL: https://proceedings.mlr.press/v272/kontorovich25a.html.
  • [26] Loong Kuan Lee, Nico Piatkowski, François Petitjean, and Geoffrey I. Webb. Computing divergences between discrete decomposable models. In AAAI, pages 12243–12251. AAAI Press, 2023. doi:10.1609/AAAI.V37I10.26443.
  • [27] David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
  • [28] Liang Li, Pinyan Lu, and Yitong Yin. Correlation decay up to uniqueness in spin systems. In SODA, pages 67–84. SIAM, 2013. doi:10.1137/1.9781611973105.5.
  • [29] Jingcheng Liu, Alistair Sinclair, and Piyush Srivastava. The Ising partition function: Zeros and deterministic approximation. Journal of Statistical Physics, 174(2):287–315, 2019.
  • [30] Barnabás Póczos and Jeff G. Schneider. On the estimation of α-divergences. In AISTATS, volume 15 of JMLR Proceedings, pages 609–617. JMLR.org, 2011.
  • [31] Paul K. Rubenstein, Olivier Bousquet, Josip Djolonga, Carlos Riquelme, and Ilya O. Tolstikhin. Practical and consistent estimation of f-divergences. In NeurIPS, pages 4072–4082, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/3147da8ab4a0437c15ef51a5cc7f2dc4-Abstract.html.
  • [32] Amit Sahai and Salil Vadhan. A complete problem for statistical zero knowledge. J. ACM, 50(2):196–249, 2003. doi:10.1145/636865.636868.
  • [33] Allan Sly and Nike Sun. Counting in two-spin models on d-regular graphs. Ann. Probab., 42(6):2383–2416, 2014.
  • [34] Sreejith Sreekumar and Ziv Goldfeld. Neural estimation of statistical divergences. J. Mach. Learn. Res., 23:126:1–126:75, 2022. URL: https://jmlr.org/papers/v23/21-1212.html.