On Approximating the -Divergence Between Two Ising Models
Abstract
The -divergence is a fundamental notion that measures the difference between two distributions. In this paper, we study the problem of approximating the -divergence between two Ising models, which is a generalization of recent work on approximating the TV-distance. Given two Ising models and , which are specified by their interaction matrices and external fields, the problem is to approximate the -divergence within an arbitrary relative error . For -divergence with a constant integer , we establish both algorithmic and hardness results. The algorithm works in a parameter regime that matches the hardness result. Our algorithm can be extended to other -divergences such as -divergence, Kullback-Leibler divergence, Rényi divergence, Jensen-Shannon divergence, and squared Hellinger distance.
Keywords and phrases:
Ising model, -divergence, approximation algorithms, randomized algorithmsFunding:
Weiming Feng: Early Career Scheme of the Hong Kong Research Grants Council under grant number 27202725.Copyright and License:
2012 ACM Subject Classification:
Theory of computation Approximation algorithms analysis ; Mathematics of computing Markov networksEditor:
Shubhangi SarafSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Let and be two distributions with the same support . Let be a convex function such that and is strictly convex around 1. The -divergence of from is defined by
| (1) |
The -divergence is a very general notion that measures the difference between two distributions. For instance, gives the total variation distance (TV-distance), gives the -divergence, and gives the Kullback-Leibler (KL) divergence.
The Ising model is a fundamental graphical model in statistical physics, probability theory, and machine learning. Let be a graph. Let be a symmetric interaction matrix such that only if . Let be the external fields vector. An Ising model specified by defines a Gibbs distribution with support such that
The function is called the weight function of the Ising model and the normalization factor is called the partition function of the Ising model.
Recently, the problem of computing the TV-distance between two high-dimensional distributions has received increasing attention. One interesting result [4] has proved that even for a pair of product distributions (Ising models on an empty graph), the exact computation of TV-distance is #P-hard. Later on, polynomial time approximation algorithms were proposed for product distributions [16, 18]. For general Ising models, both algorithmic and hardness results were established [6, 17] for approximating the TV-distance.
In this paper, we consider a more general problem of approximating the -divergence between two Ising models. The problem is defined as follows.
Problem 1.
Approximating the -divergence for two Ising models.
-
Input: Two Ising models and specifying two Gibbs distributions and respectively1111 assumes that and have the same underlying graph . This assumption does not lose generality because the definition allows and to be 0 even if ., a function defining the -divergence, and an error bound ;
-
Output: A number such that .
A randomized algorithm is said to be an FPRAS (fully polynomial randomized approximation scheme) for Problem 1 if it runs in time polynomial in and and with probability at least , the output approximates the value of the -divergence within relative error .
Our algorithmic result is a reduction from -divergence approximation to sampling and approximate counting, which are two fundamental computational tasks for the Ising model. There is long-line of research on developing efficient algorithms [27] for sampling and approximate counting. We assume the following abstract oracles for sampling and approximate counting.
Definition 2 (sampling and approximate counting oracles).
Let be an Ising model with Gibbs distribution and partition function . Let be two non-increasing functions. Given any error bound ,
-
The sampling oracle for with cost function returns a random sample in time with , where is the total variation distance between and .
-
The approximate counting oracle for with cost function returns a random number in time with .
For functions and , we add the index to emphasize that the running time also depends on parameters of graph such as the number of vertices/edges and the maximum degree. We assume the oracles need to read the whole graph so that the cost is at least .
We also require the following mild assumption on the Ising models.
Definition 3 (marginal lower bound).
Let be a constant. A Gibbs distribution is said to satisfy the -marginal lower bound if for any , any pinning , any , and any , it holds that , where denotes the marginal distribution on projected from conditional on that the configuration on is fixed as .
The marginal lower bound condition is a natural and common assumption for the Ising model. The assumption is widely used in the literature of sampling and approximate counting [14], learning theory [10], and TV-distance approximation [17].
1.1 Algorithm and hardness results for -divergence approximation
Let be a constant integer. The is called the -divergence if . We give the following approximation algorithm for the -divergence between two Ising models in Problem 1. Given two input Ising models and , define the following family of Ising models on the graph :
| (2) |
We remark that and are the same as the input Ising models and respectively. The following theorem says that if all Ising models in admit sampling and approximate counting oracles, then the -divergence between two input Ising models can be approximated in polynomial time.
Theorem 4.
Let be a constant integer and be a constant. There exists an FPRAS that solves Problem 1 for -divergence in time
where , , and is a small constant depending only on and , if two input Ising Gibbs distributions and are both -marginally bounded and all Ising models in admit sampling and approximate counting oracles with cost functions and respectively.
Let us consider a simplified case of the Ising model. Suppose has the max degree . For an Ising model , if for any , if take a unified value. In this case, an Ising model admits sampling and approximate counting oracles with cost function and if one of the following two conditions holds:
-
Uniqueness condition: and an arbitrary external field [13].
Consider Problem 1 where two input Ising models both have unified values and in the interaction matrices and . Assume , , , , and the max degree are all constants. Then the marginal lower bound condition is satisfied. We have the following corollary.
Corollary 5.
Let be a constant integer. For Ising models with unified values in interaction matrices, there exists an FPRAS that solves Problem 1 for -divergence in time if , , , , and are all constants and one of the following three conditions holds:
-
Both and are in .
-
and .
-
, , and .
The corollary requires different conditions for the case and the case . It is reasonable because for , the -divergence is not symmetric , and thus the roles of two input Ising models are different.
In Theorem 4, we show that -divergence can be approximated if the sampling and approximate counting oracles for all Ising models in exist. In a special case when , only contains two input Ising models and , where we recover the result in [17] for approximating the TV-distance. However, for general , the family contains other Ising models. It is natural to ask the following questions.
-
Is it possible to approximate -divergence if we only assume the sampling and approximate counting oracles for the two input Ising models and exist?
-
A more direct question: are the oracle assumptions in Theorem 4 necessary?
The above two questions are answered by the following hardness result. Let us restrict our attention to Ising models with zero external field and interaction matrix with unified values. Formally, and for all . Denote the model by . Let and denote two input Ising models in Problem 1.
Theorem 6.
Fix an integer , an integer , and two parameters such that . Unless , there is no FPRAS for -divergence for two Ising models and on -regular graph .
The above theorem considers two Ising models with zero external field and . Two input Ising models both admit polynomial time sampling and approximate counting oracles because they are either in the uniqueness regime [13] or the ferromagnetic regime [21]. However, if , approximating -divergence is still hard. For a concrete example, the problem is hard when , , and .
Theorem 6 also shows that the condition in Theorem 4 is necessary. Note that since and are all constants and the external fields are zero, the hardness instances satisfy the marginal lower bound condition with . Consider two Ising models and on -regular graph such that . For this family of instances, our algorithmic result in Corollary 5 together with the hardness result in Theorem 6 discovers the following computational phase transition for approximating when :
-
FPRAS exists when ;
-
Unless , there is no FPRAS when .
1.2 Algorithms for other -divergences
We say an Ising model admits polynomial time sampling and approximate counting oracles if it admits sampling and approximate counting oracles with cost function and respectively. Let and denote two input Ising models. We give algorithms for approximating the -divergence for various divergence functions . For many functions , the divergence is not symmetric for and . We will refer to as the first input Ising model and as the second input Ising model.
By setting different functions in , we can obtain the following divergences:
Theorem 7.
There exists an FPRAS for Problem 1 with KL-divergence, Rényi-divergence, and Jensen-Shannon divergence if two input Ising models are both marginally bounded and both admit polynomial time sampling and approximate counting oracles.
Compared with the result in Theorem 4, the above theorem only requires the sampling and approximate counting oracles for the two input Ising models exist.
Next, consider the -divergence. When , the -divergence refers to the KL-divergence. When , the -divergence refers to the Rényi-divergence. The -divergences for other values of are defined as follows. Let . The -divergence is defined as . Recall that in (2), and . Here, can be a real number.
Theorem 8.
Let be a constant real. There exists an FPRAS for Problem 1 with -divergence if two input Ising models are both marginally bounded, the second input Ising model admits a polynomial time sampling oracle, and both two input Ising models together with admit polynomial time approximate counting oracles.
The above theorem works for constant real (assuming the computational model supports real arithmetic). Unlike -divergence, in addition to and we only require that the approximate counting oracle for exists.
The squared Hellinger distance is defined by setting . Given two input Ising models and , define an averaged Ising model as , where and . We have the following theorem.
Theorem 9.
There exists an FPRAS for Problem 1 with squared Hellinger distance if two input Ising models are both marginally bounded, one of the input Ising models admits a polynomial time sampling oracle, and two input Ising models together with the averaged Ising model all admit polynomial time approximate counting oracles.
The squared Hellinger distance is symmetric . Hence, we only require one of the input Ising models admits a polynomial time sampling oracle. As a corollary, if two input Ising models are both in the uniqueness regime, then their averaged Ising model is also in the uniqueness regime and FPRAS exists for the squared Hellinger distance. For two Ising models and with zero external field, FPRAS exists when both .
1.3 Related work and open problems
Related work
The problem of computing the -divergence between two (either discrete or continuous) distributions is well-motivated by the applications in machine learning and statistics. There are many works, both theoretical and practical, studying the algorithms in various settings. Many works study the error between the true divergence and the divergence computed by certain empirical distributions, e.g., [30, 22, 31, 34]. Embedding-base technique was also studied in [1], where algorithm embeds a large sample space into a smaller one while preserving the -divergence. We cannot directly use these techniques to Ising model as the size of the sample space is , which gives an exponential time algorithm. For graphical models, [26] shows that -divergences can be computed exactly when the graphical model has a bounded treewidth.
Approximating the -divergence is related to the testing problem, in which one or both two distributions are given by the access to different types of sampling oracles, and the algorithm is required to test certain divergence is large or small. There is a long line of work studying the testing problem. See [11, 12] for a comprehensive survey.
The TV-distance is a special case of -divergences. There are many theoretical works studying the approximation algorithms (either additive error or relative error) and hardness of approximation for the TV-distance between various types of high-dimensional distributions. For example, distributions specified by circuits [32], hidden Markov models [23], product distributions [4, 16, 18, 24], graphical models [8, 5, 17], and high-dimensional Gaussian distributions [8, 3]. Recently, there are some works focusing on approximating for two product distributions [25, 7].
Open problems
A natural direction is to consider more general distributions and more general divergences. Here are a few examples. For the Ising model, how to remove the marginal lower bound assumption? There are many other types of graphical models, such as Markov random fields with high-order interactions and Bayesian networks. It is interesting to consider other types of distributions, e.g., distributions generated by probabilistic circuits [2]. For the divergence, we focus on -divergence when is an integer. One can also consider the case when is a real number and give an algorithm that works in the tight parameter regime. Another question is to give a more general algorithm that works for a larger family of -divergences.
The algorithm given in this paper is randomized. An open question is to find deterministic algorithms by exploring the deterministic approximate counting algorithms for the Ising model [28, 29]. Currently, deterministic algorithms are known for approximating the TV-distance between two product distributions [4, 18].
We can also study the same problem in different settings. In our setting, we assume that the input gives the description of two Ising models. It is interesting to consider another (perhaps more practical) setting that the algorithm can only access one or two models through random samples. This setting is closely related to the Ising testing problem in [15]. One can even consider abstract distributions with certain properties (e.g. approximate tensorization of -divergence) and the distribution is given by the access to some oracles (e.g., oracle for querying conditional marginal distributions or querying probability masses). This abstract setting was considered by recent work in testing [9, 20].
2 Algorithm overview
We give an overview of our algorithm for approximating the -divergence between two Ising models. We start with the following definition of the parameter distance between two Ising models.
Definition 10 (parameter distance [17]).
For two Ising models and , the parameter distance is defined by
where is the degree of in .
In [17], the parameter distance is used to give a lower bound of TV-distance , with which the previous work gave an FPRAS for TV-distance. The following lemma says such a lower bound can be generalized to -divergence. The lemma is proved in Section 3.1, where the proof works for a general family of -divergence.
Lemma 11 (-divergence lower bound).
For , where , it holds that
Let be a threshold. We estimate in the following two cases.
-
Case . This case is trivial for TV-distance because there is a very simple additive-error approximation algorithm for the TV-distance [8]. Since TV-distance itself is lower bounded by , the additive error approximation can be easily transferred to relative error approximation. However, this simple algorithm only works for the TV-distance. Instead, we propose a new algorithm for approximating the general -divergence. The algorithm is outlined in Section 2.1.
-
Case . In this case, two Ising models are similar to each other. Previous work [17] has explored the similarity of two models and designed an algorithm for approximating their TV-distance. We substantially generalize their algorithm to approximate a family -divergence (including -divergence). The algorithm is outlined in Section 2.2.
2.1 The algorithm for instances with large parameter distance
We first state the challenge of approximating the -divergence for general compared to the TV-distance. It is well-known that the TV-distance can be written as follows:
| (3) |
It was observed in [8] that the TV-distance can be approximated by draw independent samples and then taking the average of 222The value of can be computed approximately, which is sufficient for the purpose of approximation.. The above equation shows . It is easy to see and thus . The simple algorithm achieves the additive error approximation, which can be transferred to relative-error approximation because is lower bounded in this case.
However, for the -divergence with general , by the definition,
Due to the extra term , we cannot apply a similar transformation as equation in (3). Hence, it is not clear how to design an unbiased estimator such that variance of is small.
To overcome this challenge, we propose a new algorithm for approximating the -divergence. The following lemma gives a lower bound for the -divergence in this case.
Lemma 12.
Let . If and both and are -marginally bounded, then
where .
The proof of Lemma 12 separates all into two parts, and separately lower bound their contributions to the -divergence. The detailed proof is deferred to Section 3.2.
Lemma 12 gives us a tool to control the error of approximation. Recall that our goal is to design an algorithm that approximates the -divergence with relative error . Equivalently, the algorithm should achieve the additive error. Using the above lemma, it is sufficient to show that the algorithm can achieve the additive error approximation, which turns out to be much easier to prove. In addition to algorithms, Lemma 12 also plays an important role in the proof of the hardness result. See Section 7 for details.
2.1.1 The algorithm for even
Our actual algorithm deals with all in a unified way. However, in the overview, we exhibit a simple algorithm for the case where is even. The simple case will illustrate the intuition why we need to consider a family of Ising models in (2) and why Lemma 12 can help us to control the error of approximation. When is even, we can replace with and then use the binomial expansion to get the following equation:
| (4) |
To deal with the term , recall that (2) defines a family of Ising models such that and .
By our assumption in Theorem 4, for all , the Ising model admits sampling and approximate counting oracles. Let denote the partition function of the Ising model . We remark that and are the partition functions of input Ising model and , respectively. Each term in the summation of (4) can be written as
| (5) |
The above ratio can be approximated by using approximate counting oracles for , , and . Note that is a constant. By choosing the relative approximation error in the oracle small enough, where is the parameter in Lemma 12, we can obtain a value that achieves the following approximation guarantee
| (6) |
where in the above inequality, we write the relative error in terms of the additive error. Finally, our algorithm outputs the number such that
Note that is an interlacing of plus and minus. Using (4) and (6), in the worst case (the additive error of every term takes the worst sign), the error of can be upper bounded as follows
2.1.2 The algorithm for general
For general , we need to consider the effect of the absolute value inside the divergence. The -divergence can be written as
Using the binomial expansion, the divergence can be written as
Fix an integer . Let be a random sample from the Ising model . It holds that . Define and to be random variables such that
| (7) |
With some calculation, we can verify that
Ideally, our algorithm wants to draw independent random samples of and then uses their average value to approximate . Similarly, the algorithm computes to approximate . Finally, it outputs . However, to show the correctness of the algorithm, we need to deal with the following two types of errors.
-
The algorithm cannot draw perfect samples of or . One major issue333There are some other issues, e.g., the samples from the Ising model are approximate and partition functions in (7) can only be approximated. is that given a sample , the exact computation of the probability masses and is P-hard [21]. Hence, the algorithm cannot perfectly distinguish the cases where or . This issue can be solved by using the approximate counting oracle to approximate the probability masses and . By choosing a small enough error parameter for approximate counting, the algorithm misclassifies or only if . We need to show that such does not contribute much to the error of approximation.
-
We use the average of and to approximate and respectively. We need to control the concentration error. This error can be bounded via Lemma 12, using a similar technique as described in the even case.
The detailed algorithm and analysis is in Section 5.1.
2.2 The algorithm for instances with small parameter distance
In this case, we give a general algorithm for abstract -divergence. We briefly sketch the algorithm implemented for -divergence in the overview. The abstract one is in Section 4. By the definition,
| (8) |
Define a random variable , where . It can be verified that and
The algorithm draws samples from independently, and then computes and . Since the parameter distance is small, two weight functions and are very similar. Hence, the ratio . Formally,
The above property guarantees that is well-concentrated around its mean. By choosing the number of samples large enough, with some calculation, we can show that approximates with a relative error of with high probability. We remark that is not necessarily an unbiased estimator for but it is still a good approximation.
2.3 Organization of the paper
In Section 3, we prove some lower bounds for a broad family of -divergences. In Section 4, we give a general algorithm for general -divergences satisfying an abstract condition. In Section 5, we give the detailed algorithm for -divergence, where we focus on the large parameter distance case because the small parameter distance case is solved in Section 4. In Section 6, we show how to extend our algorithm to other -divergences. Finally, in Section 7, we prove the hardness result for approximating -divergence in the parameter regime stated in Theorem 6.
3 Lower bounds of -divergences
In this paper, we assume the function in -divergence satisfies the following assumption.
Assumption 1.
The convex continuous function satisfies that and for any , is twice differentiable at such that . Furthermore, the derivative satisfies if and if .
For any function satisfying the above assumption, is the unique point where . It is straightforward to verify that is strictly convex at around 1. The -divergences we are interested in all satisfy the above assumption.
3.1 Lower bound in terms of parameter distance
Our proof uses the following lower bound of total variation distance.
Lemma 13 ([17, Lemma 15]).
For two Ising models, if both and are -marginally bounded, then their total variation distance is lower bounded by
Lemma 14.
Suppose satisfies Assumption 1. For two Gibbs distributions of Ising models,
Proof of Lemma 14.
Let . Let and . It is well-known that the total variation distance between and is . Since we consider the Ising model, . Consider a Markov kernel such that given any , deterministically transforms to if and only if . Note that and are Bernoulli distributions with parameters and respectively. We have
By using the data processing inequality for -divergence, we have
| (Jensen’s inequality on ) |
Similarly, we use to get .
Note that if and if . Combining with Lemma 13, we have
Lemma 11 for -divergence can be proved as follows.
Proof of Lemma 11.
3.2 A lower bound for -divergence
Proof of Lemma 12.
Let be a parameter to be fixed later. We partition the whole space into and such that
Then, we bound the contribution of and separately. By triangle inequality, we have . For the first case, we have
By our assumption, . Using Lemma 11, we have . Hence,
| (9) |
Consider the other set . For all , it holds that
Summing over all , we have
| (10) |
Finally, adding the two inequalities (9) and (3.2), we have
| (11) |
The above inequality holds for any . We next find a value of to minimize the value of .
Take the derivative of , we get
Thus when , where ,the function obtains its minimum value:
Combining the above equation with (3.2) implies that . This proves the lower bound of the -divergence.
4 Algorithm for -divergence with small parameters distance
In this section, we generalize the algorithm in [17] to the case of small parameter distance. The new algorithm works for general -divergence, where satisfies the following abstract condition.
Condition 15.
Let be a function satisfying Assumption 1. There exists an function with such that for any , any ,
Theorem 16.
Let be a function satisfying Condition 15 with function . There exists an algorithm such that given two Ising models and , any , and , if and are -marginally bounded, , and admits sampling oracle with cost functions , then it returns a random number in time , where , , , such that
Remark 17.
In Theorem 16, the algorithm for small parameter case only requires sampling oracles for the input Ising model instead of sampling and approximate counting oracles for the whole family of Ising models .
By the definitions of -divergence and Ising model, we have
Define the parameter
-
Call the sampling oracle of with TV-distance error to obtain independent random samples . For each , compute .
-
Return , where .
The analysis of the algorithm is given in the full version of the paper.
5 Algorithm for -divergence
Our algorithm first computes the marginal lower bound in time . Due to the conditional independence, the algorithm only need to enumerate all , any , and find a worst pinning on all neighbor of . Specifically, for any neighbor of , if , then fix ; otherwise, fix . The algorithm compute . Let be the minimum value over all and . We now assume that the value is known by the algorithm.
Let be the threshold parameter. We give two algorithms in Section 5.1 and Section 5.2 depending on whether or not. We prove Theorem 4 in Section 5.3.
5.1 Large parameters distance case
Lemma 18.
Let be an integer and be a constant. There exists an algorithm such that given two Ising models and , any , and , if and are -marginally bounded, , and every Ising model in admits sampling and approximate counting oracles with cost function and respectively, then it returns a random in time , where and such that
By the definition of -divergence, we can rewrite
Recall that is the partition function of the Ising model , where and . Note that and . Define two random variables and as follows. Let be the Gibbs distributions of the Ising model .
The expectation of can be calculated as follows
Similarly, we have . In a high level, our algorithm draws approximate samples of and and estimate . Define the parameter
where the function is defined in Lemma 12.
-
For , call the approximate counting oracle on Ising model for times independently with relative error , and take the median as .
-
For each from to do:
-
1.
Draw samples by the sampling oracle with error .
-
2.
For each , compute two values
where for any , and .
-
1.
-
Return .
The analysis of the algorithm is given in the full version of the paper.
5.2 Small parameters distance case
With Theorem 16, we only need to verify Condition 15 for . By definition,
Set , satisfies Condition 15. Thus
Hence, using Theorem 16, there is an algorithm with running time , where , for small parameter distance case .
5.3 Putting everything together (Proof of Theorem 4)
Theorem 4 follows from Theorem 16 and Lemma 18. Define parameter and . By choosing constants, we can make large enough and small enough. The preprocessing time for computing and is , which is dominated by the time for sampling and approximate counting. The running time of the whole algorithm is at most
where the last inequality comes from the fact that both and are non-increasing functions and is a small enough constant depending only on and .
6 Algorithms for other -divergences
In this section, we briefly show the algorithms for other -divergences. The algorithms prove the results in Theorem 7, Theorem 8, and Theorem 9.
Again, the algorithm first computes the parameter distance . Then, it compares to the threshold . If , algorithms are given in Section 6.1. Otherwise , algorithms are given in Section 6.2.
6.1 Algorithms for small parameter distance
In this case, we use the abstract algorithm in Theorem 16. We only need to verify Condition 15 for -divergences. The following lemma gives a sufficient condition for Condition 15.
Lemma 19.
Suppose is twice differentiable at every such that and . If there exists two positive constants such that for any . Then, the function satisfies Condition 15 with .
Proof.
Note that satisfying the condition in the lemma satisfies Assumption 1. By condition that , we have the following two equalities ():
| (12) |
Since for and , we have
Thus, Condition 15 can be verify as follows
Then, we using Lemma 19 verify Condition 15 for Kullback-Leibler, Rényi, Jensen-Shannon, -divergence, and Squared Hellinger divergence. The result is summarized as the following table.
| Divergence | |||||
|---|---|---|---|---|---|
| Kullback-Leibler | |||||
| Rényi | |||||
| Jensen-Shannon | |||||
| -divergence | |||||
| Squared Hellinger |
We remark that Lemma 19 does not work for -divergence but it works for divergences above.
The algorithm then follows from Theorem 16 such that to approximate , it only requires polynomial time sampling oracle for the input distribution . For squared Hellinger distance, as it is symmetric , we can swap the roles of and in the algorithm to make sure admits a polynomial time sampling oracle.
6.2 Algorithms for large parameter distance
Now, assume that the parameter distance . By Lemma 14, the -divergence is at least
where the inequality holds because of the derivative assumptions in Assumption 1. Let . We show that all divergences in the above table are at least . Kullback-Leibler and Rényi divergences can be verified by using Taylor series . For Jensen-Shannon, one can rewrite and then Taylor series shows . For -divergence, expanding for real implies . Finally, it is easy to verify that for Squared Hellinger divergence. We have
| (13) |
Kullback-Leibler, Rényi, and Jensen-Shannon divergences
We give the algorithm for Jensen-Shannon divergence. Similar algorithms can be given for KL and Rényi divergences. The Jensen-Shannon divergence can be written as
By (13), it suffices to give an algorithm with additive error . We can estimate each term with an additive error. We give the algorithm for the first term. The other two terms can be estimated similarly. We use the sampling oracle to draw samples from , then using counting oracles to approximate , finally, return the average of . The counting oracles estimate the partition functions with a relative error, and thus we can approximate with an additive error. Due to the marginal lower bound assumption, . We can set to achieve the additive error and our goal is achieved.
-divergence
The -divergence can be written as
where is the partition function of the Ising model . By calling approximate counting oracles with relative error , we estimate with relative error, which is equivalent to additive error approximation. When the constant or , by (13), is lower bounded by . We set , then , which implies
When the constant , . By (13), the -divergence at least , is upper bounded by . We still set , and in this case,
In both cases, our algorithm solves the problem with an additive error approximation, which implies a relative error approximation.
Squared Hellinger distance
The Squared Hellinger distance can be written as
where is the partition function of the averaged Ising model in Theorem 9. Again, by (13), it suffices to given an algorithm with additive error . We can use approximate counting oracles to estimate with a relative error. Note that for any , relative error approximation is the same as additive error approximation. Since (as the divergence is non-negative), we can achieve additive error. Then the whole term can be approximated within an additive error . The problem is solved by setting .
7 Proof of the hardness result
In this section, we only consider Ising model with zero external field and interaction matrix with unified values. Formally, and for all . It is easy to see the probability of a configuration is proportional to , where is the number of monochromatic edges . We denote it by .
Let , , and such that be constant parameters in Theorem 6. We define a constant parameter
| (14) |
The following hardness result for approximating partition function of the anti-ferromagnetic Ising model beyond the uniqueness threshold is well-known.
Lemma 20 ([33, 19]).
Fix and . There exists such that unless , there is no polynomial time randomized algorithm to approximate the partition function within a factor of with probability at least for the zero external field Ising model on -regular -vertex graphs .
For any -regular graph , let denote the partition function of the zero external field Ising model , where is defined in (14). We prove the following lemma.
Lemma 21.
Let be constants. There exists a constant such that for any graph , let and be the Gibbs distributions of the Ising models and ,
Proof of Theorem 6.
Fix parameters . Let be defined in (14).
Suppose there exists an FPRAS for . By run algorithms independently and take median, we can approximate within the factor of in polynomial time with probability at least . By our assumption, . We can compute within the factor of by applying the algorithms in [13] (if ) or the algorithms in [21] (if ) in polynomial time, where the algorithm succeeds with probability at least . Similarly, we can compute an approximation to .
By Lemma 21, we get a constant approximation to in polynomial time with probability at least . The hardness result follows from Lemma 20.
The rest of this section is devoted to prove Lemma 21. To prove the lemma, we consider a middle term . The following result is a corollary of Lemma 12.
Corollary 22.
There exists such that
Proof.
The second inequality is trivial. For the first inequality, by Lemma 12, we can set in the lemma to be , which is a constant depending on . Furthermore, the underlying graph has constant degree , both and are constants. Hence, by the conditional independence property of the Ising model, both and have a constant marginal lower bound . Hence is a constant. With Corollary 22, Lemma 21 is a straightforward corollary of the following lemma.
Lemma 23.
It holds that
Proof.
To simplify the notation, we fix the graph in the proof. We denote the partition function and . The family of Ising models in in (2) can be written as , where for . We use to denote the partition function of . We remark that , , in (14) is the same as , and .
We use the binomial expansion that
| (15) |
We claim the following inequality holds. For any ,
| (16) |
We prove (16) later. With this conclusion, combining with (15), we know that
| (17) |
where the first inequality is trivial and the second inequality is due to (16). Rearranging the terms in (17), we get the desired result.
Finally, we prove (16). Recall that denotes the number of monochromatic edges under the configuration . Since , the sequence is monotonically decreasing as . Fix an unordered pair of configurations (it may be possible that ). Without loss of generality, we assume , otherwise we can swap the two configurations.
We have the following inequality for all ,
| (18) |
where the inequality throws the second term, and the equality is due to the definition of , the inequality is due to the assumption and , and the last inequality is due and for .
If , note that and , we have
| (19) |
References
- [1] Amirali Abdullah, Ravi Kumar, Andrew McGregor, Sergei Vassilvitskii, and Suresh Venkatasubramanian. Sketching, embedding and dimensionality reduction in information theoretic spaces. In AISTATS, volume 51, pages 948–956. JMLR.org, 2016. URL: http://proceedings.mlr.press/v51/abdullah16.html.
- [2] Antoine Amarilli, Marcelo Arenas, YooJung Choi, Mikaël Monet, Guy Van den Broeck, and Benjie Wang. A circus of circuits: Connections between decision diagrams, circuits, and automata. arXiv preprint, 2024. doi:10.48550/arXiv.2404.09674.
- [3] Arnab Bhattacharyya, Weiming Feng, and Piyush Srivastava. Approximating the total variation distance between Gaussians. In AISTATS, volume 258 of Proceedings of Machine Learning Research, pages 1846–1854. PMLR, 2025. URL: https://proceedings.mlr.press/v258/bhattacharyya25a.html.
- [4] Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S. Meel, Dimitrios Myrisiotis, A. Pavan, and N. V. Vinodchandran. On approximating total variation distance. In IJCAI, pages 3479–3487. ijcai.org, 2023. doi:10.24963/IJCAI.2023/387.
- [5] Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S. Meel, Dimitrios Myrisiotis, A. Pavan, and N. V. Vinodchandran. Total variation distance meets probabilistic inference. In ICML. OpenReview.net, 2024. URL: https://openreview.net/forum?id=6OSLjErBhh.
- [6] Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S. Meel, Dimitrios Myrisiotis, A. Pavan, and N. V. Vinodchandran. Computational explorations of total variation distance. In ICLR. OpenReview.net, 2025. URL: https://openreview.net/forum?id=xak8c9l1nu.
- [7] Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S Meel, Dimitrios Myrisiotis, A Pavan, and NV Vinodchandran. Algorithms and hardness for estimating statistical similarity. arXiv preprint, 2025. arXiv:2502.10527.
- [8] Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S. Meel, and N. V. Vinodchandran. Efficient distance approximation for structured high-dimensional distributions via learning. In NeurIPS, 2020.
- [9] Antonio Blanca, Zongchen Chen, Daniel Stefankovic, and Eric Vigoda. Complexity of high-dimensional identity testing with coordinate conditional sampling. ACM Trans. Algorithms, 21(1):7:1–7:58, 2025. doi:10.1145/3686799.
- [10] Guy Bresler. Efficiently learning Ising models on arbitrary graphs. In STOC, pages 771–782. ACM, 2015. doi:10.1145/2746539.2746631.
- [11] Clément L. Canonne. A Survey on Distribution Testing: Your Data is Big. But is it Blue? Number 9 in Graduate Surveys. Theory of Computing Library, 2020. doi:10.4086/toc.gs.2020.009.
- [12] Clément L. Canonne. Topics and techniques in distribution testing: A biased but representative sample. Found. Trends Commun. Inf. Theory, 19(6):1032–1198, 2022. doi:10.1561/0100000114.
- [13] Xiaoyu Chen, Zongchen Chen, Yitong Yin, and Xinyuan Zhang. Rapid mixing at the uniqueness threshold. In STOC, pages 879–890. ACM, 2025. doi:10.1145/3717823.3718260.
- [14] Zongchen Chen, Kuikui Liu, and Eric Vigoda. Optimal mixing of Glauber dynamics: Entropy factorization via high-dimensional expansion. In STOC, pages 1537–1550. ACM, 2021. doi:10.1145/3406325.3451035.
- [15] Constantinos Daskalakis, Nishanth Dikkala, and Gautam Kamath. Testing Ising models. In SODA, pages 1989–2007. SIAM, 2018. doi:10.1137/1.9781611975031.130.
- [16] Weiming Feng, Heng Guo, Mark Jerrum, and Jiaheng Wang. A simple polynomial-time approximation algorithm for the total variation distance between two product distributions. TheoretiCS, 2, 2023. doi:10.46298/THEORETICS.23.7.
- [17] Weiming Feng, Hongyang Liu, and Minji Yang. Approximating the total variation distance between spin systems. In COLT, volume 291 of Proceedings of Machine Learning Research, pages 1974–2025. PMLR, 2025. URL: https://proceedings.mlr.press/v291/feng25a.html.
- [18] Weiming Feng, Liqiang Liu, and Tianren Liu. On deterministically approximating total variation distance. In SODA, pages 1766–1791. SIAM, 2024. doi:10.1137/1.9781611977912.70.
- [19] Andreas Galanis, Daniel Štefankovič, and Eric Vigoda. Inapproximability of the partition function for the antiferromagnetic Ising and hard-core models. Combin. Probab. Comput., 25(4):500–559, 2016. doi:10.1017/S0963548315000401.
- [20] William Gay, William He, Nicholas Kocurek, and Ryan O’Donnell. Sampling and identity-testing without approximate tensorization of entropy. arXiv preprint arXiv:2506.23456, 2025. doi:10.48550/arXiv.2506.23456.
- [21] Mark Jerrum and Alistair Sinclair. Polynomial-time approximation algorithms for the Ising model. SIAM J. Comput., 22(5):1087–1116, 1993. doi:10.1137/0222066.
- [22] Takafumi Kanamori, Taiji Suzuki, and Masashi Sugiyama. -divergence estimation and two-sample homogeneity test under semiparametric density-ratio models. IEEE Trans. Inf. Theory, 58(2):708–720, 2012. doi:10.1109/TIT.2011.2163380.
- [23] Stefan Kiefer. On computing the total variation distance of hidden Markov models. In ICALP, volume 107 of LIPIcs, pages 130:1–130:13. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2018. doi:10.4230/LIPIcs.ICALP.2018.130.
- [24] Aryeh Kontorovich. On the tensorization of the variational distance. Electron. Commun. Probab., 30:Paper No. 32, 10, 2025.
- [25] Aryeh Kontorovich and Ariel Avital. Sharp bounds on aggregate expert error. In ALT, volume 272 of Proceedings of Machine Learning Research, pages 653–663. PMLR, 2025. URL: https://proceedings.mlr.press/v272/kontorovich25a.html.
- [26] Loong Kuan Lee, Nico Piatkowski, François Petitjean, and Geoffrey I. Webb. Computing divergences between discrete decomposable models. In AAAI, pages 12243–12251. AAAI Press, 2023. doi:10.1609/AAAI.V37I10.26443.
- [27] David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
- [28] Liang Li, Pinyan Lu, and Yitong Yin. Correlation decay up to uniqueness in spin systems. In SODA, pages 67–84. SIAM, 2013. doi:10.1137/1.9781611973105.5.
- [29] Jingcheng Liu, Alistair Sinclair, and Piyush Srivastava. The Ising partition function: Zeros and deterministic approximation. Journal of Statistical Physics, 174(2):287–315, 2019.
- [30] Barnabás Póczos and Jeff G. Schneider. On the estimation of -divergences. In AISTATS, volume 15 of JMLR Proceedings, pages 609–617. JMLR.org, 2011.
- [31] Paul K. Rubenstein, Olivier Bousquet, Josip Djolonga, Carlos Riquelme, and Ilya O. Tolstikhin. Practical and consistent estimation of -divergences. In NeurIPS, pages 4072–4082, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/3147da8ab4a0437c15ef51a5cc7f2dc4-Abstract.html.
- [32] Amit Sahai and Salil Vadhan. A complete problem for statistical zero knowledge. J. ACM, 50(2):196–249, 2003. doi:10.1145/636865.636868.
- [33] Allan Sly and Nike Sun. Counting in two-spin models on -regular graphs. Ann. Probab., 42(6):2383–2416, 2014.
- [34] Sreejith Sreekumar and Ziv Goldfeld. Neural estimation of statistical divergences. J. Mach. Learn. Res., 23:126:1–126:75, 2022. URL: https://jmlr.org/papers/v23/21-1212.html.
