New Algorithmic Directions in Optimal Transport and Applications for Product Spaces
Abstract
We consider the problem of optimal transport between two high-dimensional distributions in from a new algorithmic perspective, in which we are given a sample and we have to find a close while running in time, where is the size/dimension of . In other words, we are interested in making the running time bounded in dimension of the spaces rather than bounded in the total size of the representations of the two distributions. Our main result is a general algorithmic transport result between any product distribution and an arbitrary distribution of total cost under cost; here is the cost of the so-called Knothe–Rosenblatt transport from to , while is a computational error that goes to zero for larger running time in the transport algorithm. For this result, we need to be “sequentially samplable” with a “bounded average sampling cost” which is a novel but natural notion of independent interest. In addition, we prove the following.
-
We prove an algorithmic version of the celebrated Talagrand’s inequality for transporting the standard Gaussian distribution to an arbitrary under the Euclidean-squared cost. When is conditioned on a set of measure , we show how to implement the needed sequential sampler for in expected time , using membership oracle access to . Hence, we obtain an algorithmic transport that maps to in time and expected Euclidean-squared distance , which is optimal for a general set of measure .
-
As corollary, we find the first computational concentration (Etesami et al. SODA 2020) result for the Gaussian measure under the Euclidean distance with a dimension-independent transportation cost, resolving a question of Etesami et al. More precisely, for any set of Gaussian measure , we map most of samples to with Euclidean distance in time .
Keywords and phrases:
Optimal transport, Randomized algorithms, Concentration boundsFunding:
Omid Etesami: Thanks to Sabanci University and TEIAS for their support during part of this work.Copyright and License:
2012 ACM Subject Classification:
Theory of computation Algorithm design techniques ; Theory of computation Online algorithms ; Mathematics of computing Probabilistic algorithms ; Theory of computation Probabilistic computationEditors:
Ho-Lin Chen, Wing-Kai Hon, and Meng-Tsung TsaiSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Optimal transport (OT) is a fundamental problem that arises in mathematics, science, and engineering, including differential geometry [17], transportation planning [40], economics [21], machine learning [34, 38], image registration [23], and seismic tomography [35]. In machine learning, it has been used in unsupervised learning [46], as a measure of the cost of misclassification [20], to define the fairness of algorithms [11], in Wasserstein GANs [2], for transfer learning [14], and in diffusion generative models [47, 26].
In the optimal transport problem, we would like to transport samples from a source distribution to points in the target distribution with a minimum expected “transportation cost” of transporting to . The study of this problem dates back to the work of Monge [33], who wanted the transportation mapping to be deterministic. Kantorovich [25] reformulated the problem by allowing to be a randomized (stochastic) mapping. In other words, we now look for a coupling over the distributions with minimum expected transportation cost , using which we define the optimal cost of transporting to ,
where is the set of all couplings between . OT is closely related to the notion of “Wasserstein metric” that generalizes OT using a parameter and is the same for .
As a prominent example of the use of OT in mathematics, Talagrand [43] gave a bound on the optimal transport, under the cost, of the -dimensional Gaussian measure to an arbitrary distribution based on the KL-divergence of from . Using this, he derived an essentially optimal concentration of measure result, showing that for any set of measure in , almost all of the measure is within (minimum) distance from .
Computational OT.
Computational aspects of OT have recently become extremely important on their own [38]. In the most common formulation of “computational OT”, we would like to compute or estimate efficiently. Computing is a key tool, e.g., for applications that use OT to quantify a loss that allows one to know “how far” we are from a target goal [4, 6, 7]. A common approach to computing is to work with empirical sample sets , and find the best OT between the empirical distributions that are uniform over (e.g., see [24, 32] and the references therein). This approximation converges to the quantity in the limit, and the OT between can be computed using the Hungarian algorithm for minimum weighted matching [29]. The popular iterative Sinkhorn algorithm solves a regularized version of the OT problem [42] but it also works with empirical sample sets, that is, i.i.d. samples from the distributions. Using empirical samples, one does not rapidly converge to the optimal OT in some elementary cases. For example, to transport the uniform distribution on the -dimensional unit cube to itself, the OT between two -size empirical versions of the original distribution is in distance even though the actual OT cost is zero.
Statistical OT.
The above approach of using empirical samples can in fact be used to approximate the transport map itself from to , as in [24, 32]. For example, Brenier’s theorem [10, 28] asserts that under the cost and suitable conditions, a unique Monge mapping achieves optimal transport, and one can aim at approximating this deterministic mapping. This approach is sometimes known as statistical optimal transport [13]. However, this approach needs exponential in samples for -dimensional distributions to achieve good approximate results. Some previous works like [24, 32] make improvements on this analysis by assuming further smoothness and structural conditions on the distributions but the curse of dimensionality basically remains intact. More importantly, to the best of our knowledge, no previous work models the algorithmic aspect of searching for the transport map by limiting its algorithm to run in polynomial time over the size of the input .
1.1 Our Contributions
In a nutshell, our contributions are (1) formalizing a new theory of algorithmic transport, (2) obtaining initial results on algorithmic transport for the high-dimensional setting, and (3) obtaining applications for algorithmic transport, e.g., to algorithmic concentration of measure. Each of the items above has multiple aspects that are elaborated in the following.
Algorithmic Transport in Polynomial Time.
The common computational OT formulation aims to compute or approximate the optimal transportation cost , yet it does not answer the key question of how to algorithmically compute the transport map efficiently over the size of the given input sample. I.e., suppose that we are given a particular sample as input, and we would like to map it to as follows: (1) The mapping shall be computed in polynomial time over the size of the input . (2) We would like to control the expected cost of the transportation. To point out the subtle distinction between our new algorithmic formulation and the traditional computational OT, in this work we use the term algorithmic transport to refer to the task of computing a (randomized) mapping efficiently based on its input size (e.g., the dimension of ), such that , whenever .
Algorithmic transport, when done optimally, can be used to approximate OT cost efficiently as well. In particular, when the transportation cost is bounded by a constant, using ) independent samples , the average gives an -approximation of the OT, with high probability. However, it is not clear how to do the reverse and obtain algorithmic transport from computational OT.
When are one dimensional, for natural (convex) costs such as one can find simple algorithms that simply use a “monotone” transportation plan [45]. Furthermore, when the distributions have small domains of size , one can use algorithms based on min-cost flows to find a full description of the OT from to in time [37]. However, our focus is on the high-dimensional setting and finding -time computable mappings between distributions of dimension with super-polynomial support. We ask:
If are -dimensional distributions, how can we find a -time computable transport map from to of a small/optimal cost?
Formalizing and answering the question above in various contexts is the focus of our work. Our studies also bear similarities to the line of work on approximating the total variation distance [5, 16] as it coincides with OT under the Hamming distance.
Transport in High-Dimensional Setting.
In this work, we approach the main question above through a study of so-called causal transports [30, 3] in high dimension, in which the transporting algorithm produces from in an online manner: The algorithm shall output based on and before receiving . Hence we also refer to those transports simply as online. The so-called Knothe-Rosenblatt transport (KR transport for short) [27, 39] is an important online transport with two properties: (1) its reverse is also online, and (2) it follows a “greedy” approach in each round by using a monotone mapping of dimension one. Our motivations for studying online transports is twofold: (1) Despite being a restriction on how the transport is done, the “online lens” guides us towards algorithm development; (2) In our eyes, information-theoretic study of online algorithms is interesting. In particular, in Section 2.1, we prove that the KR transport is optimal among all online transports when the source distribution is a product.
Main Result: Algorithmic Transport from Product Distributions.
Our main result (Theorem 28) is to design a -time online algorithm that transports a generic product distribution to any -dimensional distribution , assuming that (1) the transportation cost satisfies , where , and (2) the transporting algorithm has oracle access to proper samplers for both .
The algorithm is actually very simple: Given , having determined ,…, , to determine , it samples samples besides according to . Similarly it samples samples according to the conditional distribution of the th coordinate of conditioned on the values of . Then it optimally matches the two sets of two samples. The value of is the match of in this matching.
The transportation cost of turns out to be , where is the optimal cost of online transports from to (which, as we will prove, will coincide with the KR transport [27, 39] in our settings of interest), and is a term that could be made smaller by choosing larger. We show that the reverse transport from back to the product can be done algorithmically as well. This will be useful for deriving further algorithmic transports through composition.
Sequential Samplers.
When it comes to the samplability conditions needed in our main results above, we merely require that we can sample from efficiently. However, for the non-product distribution , the samplability condition is stronger and we require that one can sample from conditioned on any previously sampled prefix . We refer to such samplers as sequential samplers. A key quantity of interest is the complexity of iterative sampling of the coordinates sequentially (conditioned on previous ones) till we obtain a full sample . We would like to have samplers where the average complexity of this sequential generation is bounded. As it turns out, we can indeed bound such costs in our special cases of interest.
From a real-world application point of view, this notion of efficient sequential sampler is very natural in some generative models. This is indeed the case for transformer-based language models that autoregressively generate their tokens one by one, each conditioned on the previously sampled sequence of tokens [44, 18]. That is, the joint distributions produced by these generative models have sequential samplers of low expected cost, as they indeed generate their sequence of symbols in a reasonable time and in an online fashion.
Algorithmic Transport for the Standard Gaussian Distribution.
One of the fundamental results in OT is Talagrand’s transportation inequality for the -dimensional Gaussian distribution [43]. It is proved that for every distribution , , in which the cost is measured in , i.e., , and denotes the Kullback–Leibler divergence. In this work, we lift this classical result to the algorithmic setting. Note that, as mentioned in [43], this bound is optimal in general, e.g., when is a shifted , in which case our results converge to this optimal bound as well. In particular, we derive this result from our main result by proving the following two complementary claims:
-
Information theoretic: We observe that Talagrand’s bound of upper bounds not only the best “offline” transport from the standard Gaussian , but also the best optimal online transportation of to . Namely, we show that , where is the optimal online transportation cost as defined above.
-
Computational: We use results from [19] to show that the Gaussian distribution in one dimension has a small transportation cost to its empirical samples on average.
Transporting Standard Gaussian to Conditional Gaussian.
We show that in a natural setting, where is the Gaussian distribution conditioned on an event of Gaussian measure , such sequential samplers can be efficiently simulated using oracle access to membership tests in . In other words, we find an algorithmic oracle-aided transportation algorithm that simultaneously work for all such distributions . Note that such distributions have . We obtain algorithmic transports running in expected time that achieve transport cost that converges to the upper bound of Talagrand.
Dimension-Independent Computational Concentration for Gaussian Spaces.
One of the applications of OT is to obtain concentration of measure (CoM) inequalities [22]: One shows that any set of “sufficiently large” measure in a concentrated metric probability space , where is a distribution and is a distance metric, expands to cover most of the measure in , when we consider neighbors of within a certain distance. Recently, a computational (algorithmic) variant of the CoM phenomenon has been introduced [31, 15], in which one aims to show that the reverse mapping can be computed efficiently from almost all of the points in back to by moving the points within a bounded distance. Namely, given a typical sampled point , we aim to algorithmically find a “close neighbor” of bounded distance . The work of [15] obtained such results for various settings, but their work left open obtaining computational CoM with dimension-independent (optimal) distance for the basic and natural space of Gaussian distributions under the distance. Using our oracle set-transportation result for Gaussian spaces mentioned above, we resolve this open question and obtain such an optimal and dimension-free bound (see Corollary 36).
Reductions for (Deriving New) Algorithmic Transport.
Finally, considering the role of reductions in resolving algorithmic tasks, we also develop the (right) notion of algorithmic reductions for the goal of relating algorithms for (optimal) transport across different spaces. In particular, suppose are distributions and are two different transportation costs. In the full version we state conditions under which, we can automatically transform an algorithmic transport result from to (under the cost to a similar result that transports to (under the cost ) for specific distributions and arbitrary distribution . We then show how to realize such reductions when we transport uniform distributions over the unit cube and the unit sphere (to an arbitrary distribution) by reducing them to the case of transporting Gaussian distributions. Consequently, we obtain algorithmic transports from these distributions as well.
2 Basic Concepts
In this section, we define the key notions studied in this paper and prove their basic properties.
Notation.
We let . We denote the source (initial) distribution as . When is distributed over , we say that has dimension and by we denote the distribution of its th coordinate. We usually denote , where and . means that is a product distribution. We use a similar notation for the target distribution . By we denote the process of running a probabilistic algorithm on input to obtain output . When is a distribution, denotes an oracle algorithm that has access to fresh samples from , and when is a set, denotes a similar situation where the oracle responds membership in . For vector , by we denote the prefix vector . When a distribution of dimension with marginals is clear from the context, by , we denote the distribution of conditioned on having sampled from for all . For further clarity on the underlying joint distribution, we might sometimes use instead. By or we denote the probability of event under the distribution . Whenever it is clear from the context, for an outcome , we use to either denote the probability of the outcome or the density of at depending on whether is discrete or continuous. By we denote the support set of , which for the discrete and continuous cases can be defined as . When , their Kullback–Leibler (KL) divergence is denoted as with the natural logarithm basis. In the preceding definition and generally throughout this paper, we use the summation notation corresponding to discrete distributions; the corresponding results for continuous distributions replace sums with proper integrals. For , the -norm and -distance over are defined as , and .
Transportation Costs.
In the following, all transportation costs, usually denoted as , are functions with non-negative outputs that model the cost of transporting to . We always assume to be lower semi-continuous but do not assume to be symmetric or satisfy the triangle inequality; we state these conditions, whenever needed.
Definition 1 (Coupling and Optimal Transportation Cost).
We say that a distribution over pairs with marginals is a coupling of if . If for every , there is a unique such that , then we call this a deterministic (Monge) transport from to . For a cost , the transport cost of a coupling of is defined as
We refer to as the (Wasserstein) -cost of under . If denotes the set of all couplings between , the (Kantorovich) optimal transportation cost for is defined as
The infimum in Definition 1 for defining the optimal transportation costs turns out to be a minimum as is lower-semi continuous [1].
Definition 2 (Algorithmic Transport).
For distributions , algorithm is a transport from distribution to distribution if is a (probabilistic) algorithm such that whenever . By we denote the coupling created by . For a transportation cost the transportation cost of is defined as .
Computational Model.
In Definition 2, we need to either work with discrete distributions with samples of finite length, or when the distributions are continuous we need to work with the generalization of algorithms working with real numbers as formalized in [8, 9].
We now define an algorithmic variant of so-called causal transport [30] with a discrete time [3], We call it “online” to emphasize on the algorithmic aspect a la online learning [41].
Definition 3 (Online (Algorithmic) Transport).
For distributions of dimension , we call a (probabilistic and perhaps computationally unbounded) algorithm an online transport algorithm from to if it forms a transport from to , while it makes its decisions in an online way. Namely, has an internal iterating process (for simplicity also denoted by ) that reads coordinate by coordinate while holding an internal state, initially . In the th iteration, we have , and at the end we output . We also let to be the set of all couplings that can be obtained by online algorithms and for a transport cost obtain the optimal online transportation cost as
To contrast and emphasize on a transport not being necessarily online, we refer to (potentially) non-online transports as offline transports.
We now define a class of couplings that is closely related to online transport.
Definition 4 (Online Coupling).
Suppose is a coupling between -dimensional distributions , and is the corresponding marginal coupling between . We call an online coupling if for all , is a coupling of (according to ) and (according to ). If denotes the set of all online couplings between , for a transport cost we obtain the optimal online coupling cost between as
We now show how to characterize online couplings using online transports.
Proposition 5.
A coupling between is online if and only if it can be obtained through both an online transport from to and an online transport from to .
Definition 6.
We call the cost function over linear over , if for all .
Greedy Coupling.
One might wonder how we can compute/approximate . One approach is to use greedy methods, by trying to use an optimal coupling in each round. This is formalized in the following definition in settings with dedicated costs for each round. We will then discuss when this method succeeds in Theorem 10. More generally, we define locally-optimal couplings, even when they are not online.
Definition 7 (Locally Optimal and Greedy Couplings).
Suppose the cost function over is linear over . A coupling between is locally optimal, if for every , it holds that is an OT between ; i.e.,
When is an online coupling as well, the above condition simplifies to in which case we call greedy. For denoting the set of all greedy couplings from to , we define
Remark 8 (Greedy vs. Knothe-Rosenblatt Transports).
Greedy couplings are closely related to Knothe-Rosenblatt (KR for short) transports [27, 39]. Specifically, for a greedy coupling , when the cost functions are convex, for any , the locally optimal coupling could be obtained by simply using the unique monotone mapping [12]. Hence, KR coupling is a special case of greedy couplings and cover many interesting cases in this class. For example, when the cost function is for , then equals the cost of the KR coupling between and . However, due to the generality of greedy couplings (e.g., for non-monotone costs) we define and use greedy transports.
Lambda and Delta Cost Functions.
We now define two functions that play key roles in our analysis of the cost of online transports. The first (Lambda) function depends on a coupling, while the second one (Delta) depends on the two distributions that are coupled. As we prove later in Proposition 12, Lambda is a parameter that lower bounds the cost of any coupling. Delta is the optimal online transport from a product distribution to another one.
Definition 9 (The Lambda and Delta Functions).
For a coupling of dimension between distributions of dimension , and a cost function that is linear over , we define the Lambda functions as
We also define the Delta function between distributions of dimension as
Note that the coupling in Definition 9 does not have to be online. Furthermore, the definition of does depend on the order of the coordinates of the -dimension distributions.
2.1 Online Coupling and Transport from Products
We end this section by stating a theorem showing that, whenever is product, any online coupling that is “locally optimal” in the sense that given history it finds (an arbitrary) optimal transport between , finds an optimum online coupling between as well as an optimal online transport from to . This theorem does not assume convexity of the costs. As stated in Remark 8, for convex transportation costs, greedy algorithms can be instantiated using the KR transform.
Theorem 10 (Optimal Online Coupling and Transport from Products).
If is product and the cost function is linear over , then
Before proving Theorem 10 we prove some basic tools that are used in the proof. The first lemma that we state can be obtained from a simple application of the linearity of expectation.
Lemma 11 (Cost Splitting).
Let be a coupling between distributions of dimensions , and let be the corresponding coupling between the marginals . Suppose is linear over , and is an -dimensional distribution that is arbitrarily correlated with . Then,
In particular, we can choose , , or as special cases.
We now prove some basic properties of the two functions, showing how to use it and how to characterize it in some special settings. In summary, Lambda function lower bounds the transportation of every coupling, while Delta will play a key role in characterizing the transportation cost for product distributions.
Proposition 12 (Properties of Lambda and Delta Functions).
Suppose couples and is linear. The Lambda function satisfies the following properties.
-
1.
Lower Bound: For all , , and the equality holds iff is locally optimal.
-
2.
Online Transports from Products: If is an online transport and , then
-
3.
Online Coupling for Products: If is an online coupling, and is product then
Proof of Proposition 12.
We prove the claims in order.
-
1.
By letting in Lemma 11, we get
where the inequality follows from the fact that minimizes the transportation cost.
-
2.
We first claim that, in this case, for every , we have . This is true, because (1) and the fact that is an online transport, and (2) by the fact that is a product. Therefore,
We now use the right hand side. In analyzing the right hand side, we first use Lemma 11 (using ) and then sample in reverse order,
where for each , we sample by first sampling and then sampling conditioned on . Now, for every , we claim that
This claim follows from Part 2 of Proposition 19 and the fact that the average of over the choice of is .
-
3.
When the coupling is further an online coupling, then the equality holds, because , and the last inequality above becomes an equality.
Proof of Theorem 10.
It is enough to prove the following two claims.
-
1.
.
-
2.
.
The reason is that we already know (as being greedy is a limitation), and so proving the two claims above would imply all the equalities of the theorem statement.
To prove the first claim, we observe that cost can be achieved using (any) greedy algorithm that (by definition) optimally couples with in the th step. In fact, all greedy coupling algorithms have the same cost when one of the distributions is product.
3 Basic Tools
3.1 Composition and Triangle Inequalities
Multi-distribution Coupling and Composition.
We now generalize the notion of coupling to more than two distributions and use it to define composition of (online) couplings.
Definition 13 (Multi-distribution Coupling).
A coupling of is a distribution over -vectors such that the marginal of the th coordinate is distributed as .
Definition 14 (Composition of Couplings).
For coupling over and coupling over , we define the composition of and as the marginal of the first and third coordinates of the (unique) coupling of such that.
-
1.
For , the marginal distribution of in is distributed as .
-
2.
In the coupling , are independent, conditioned on .
We now use Wasserstein -cost, to state the following well-known triangle inequality.
Lemma 15 (Triangle Inequality for Wasserstein -Costs).
Suppose a cost function satisfies the triangle inequality (but not necessarily symmetry) and . Then, for every coupling over with marginal coupling over , we have the following,
The following proposition can be obtained from the triangle inequality of Lemma 15.
Proposition 16 (Triangle Inequality for Wasserstein -Costs in Multi-Round Settings).
Let be a distribution over , and for every let be a distribution over triples of distributions over . Suppose satisfies the triangle inequality and is linear over for . Then, the following holds.
The following can be obtained from the definition of online transport and Lemma 15.
Lemma 17 (Properties of the Composition of Online Transports).
Consider an online transport from to with coupling and an online transport from to with coupling . Let be the composed coupling. Then,
-
1.
The coupling is an online coupling.
-
2.
There is an algorithm that transports to as the coupling , whose complexity is bounded by running followed by running .
-
3.
If the cost function satisfies the triangle inequality, then for all the following holds
3.2 Transport Through Intermediate Distributions
In this section, we describe a method of transporting to (perhaps in an online and iterative way) through optimal transports between intermediate distributions in one dimension. We start with some definitions. We start by defining the notion of average for distributions and stating a general way of transporting through averages.
Definition 18 (Average Distribution).
Suppose is a distribution over distributions. We define the average of , denoted as , to be the distribution of the random variable that is sampled by first sampling and then . Namely, is the distribution that for all the events defined over .
Proposition 19 (Transport to Averages).
Suppose is a distribution over distributions with average .
-
1.
Suppose is the following joint distribution. We first sample , then couple with as , and then output a sample . Then, is a coupling between .
-
2.
.
Proof.
Part 1 holds because the marginals of and have the marginals of . Part 2 follows from Part 1 and picking to be the optimal transport between .
The following definition states a way of finding a transport from to by working with alternative (intermediate) distributions that approximate .
Definition 20 (Transport Through Intermediate Distributions).
Let be distributions, be a cost function, and be a distribution over pairs of distributions. We say that algorithm couples through (the intermediate distribution) , if the following conditions hold.
-
1.
produces marginals with averages . I.e., and .
-
2.
Algorithm first samples , then finds some optimal transport between according to , and finally outputs .
Definition 21 (Conditioning and Composing Transports with Distributions).
Suppose are distributions and is a transport from to . If , then consider the following sampling process.
-
1.
Sample .
-
2.
Sample from the -coordinate of , conditioned on its -coordinate being .
Then, the notation denotes the joint distribution of and denotes the distribution of . Additionally, if is a distribution over distributions, then denotes the distribution over distributions sampled by outputting for .
Notation.
Let be the distribution over distributions obtained by first sampling , and then outputting . A simple observation is that for all .
Proposition 22.
If is a distribution over distributions with average distribution , and if is any transport from to , then the following holds.
-
1.
is a distribution over distributions with average .
-
2.
For cost , in which is defined in Definition 21.
-
3.
, and if is samplable in time and coupling is computable in time , then one can sample the set that describes in time .
Proof.
For Part 1, observe that if we sample for , by definition we get , which means will be sampled as . For Part 2, also computes the cost of the same coupling by breaking it into marginal costs based on how is sampled. For Part 3, let . We first sample and then let for . It holds that s are independently sampled according to , and because transports to , ’s are also independently sampled according to .
Lemma 23 (Multi-Round Algorithmic Coupling Through Intermediate Distributions).
Suppose cost function satisfies the triangle inequality, and is linear over for . Let , with marginals be a transport from with marginals to with marginals . For round and previously sampled , suppose is a distribution over pairs of distributions defined based on , and is an optimal transport from to under . Suppose can also be obtained using the following algorithm in rounds. In round and for previously sampled , couples and through the intermediate distribution as defined in Definition 20 using the cost . Then,
where refers to the inverse coupling that changes the order of its marginals.
Proof of Lemma 23.
The proof uses the triangle inequality for Wasserstein -costs for the multi-round setting (Proposition 16).
For each and , consider the following sampling process that extends by outputting one more coordinate as well.
-
1.
Sample .
-
2.
Let .
-
3.
Obtain .
It holds that , which is the left side of the inequality of Proposition 16, and the right side is:
The first term is exactly the first term on the right hand side of the inequality of the lemma. Therefore, all we have to do is to prove that
In fact, we prove this statement for every choice of and , so ignoring we prove the claim:
where the middle term is added for the proof.
We now prove both the inequality and the equality above through the steps below.
-
Inequality: Again, using , we have
where the inequality is due to the fact that is the optimal cost.
3.3 Borrowed Tools
The following can be obtained from the proofs in [43, 22] (see the full version). For , it gives the celebrated Talagrand’s transportation inequality for Gaussian under .
Theorem 24 (Talagrand’s Inequality for the Gaussian Measure).
If , , is the standard Gaussian and is an arbitrary distribution both in , then
Definition 25 (Transports to Empirical).
For distributions and symmetric cost , we let denote the cost of transporting to an empirical set of size , where is the uniform distribution over the multi-set .
The following lemma follows from [19] and known moments of the Gaussian distribution.
Lemma 26 (Original-to-Empirical Transport for the Normal Distribution).
Let , be , and is the normal distribution. Then, for a constant depending on ,
4 Algorithmic Transport for Products
In this section, we put together the tools from previous sections to derive algorithmic results about online transport for the setting that one of the source or target distributions is product. We then derive a corollary for the Gaussian measure. We first define sequential samplers.
Definition 27 (Sequential Sampler).
For a distribution in dimension with marginals , we call its sequential sampler for , if for all calling returns an independent answer . For queries , calling returns . We also assign a (sequential sampling) cost to query , and call the average (sequential sampling) cost of . For an oracle-algorithm calling (a potentially randomized) set of queries to , we define its average total cost of calling as 111Since naturally measures the (e.g., computational) cost of sampling a coordinate conditioned on previously sampled coordinates, for natural settings and independent , the value of will be independent of .
One natural way of using is to model sampling time, but it can model other costs as well. The average cost of is indeed the average total cost of the following simple algorithm that uses sequentially to obtain a full sample: Let be the empty string, and for , let . Also, when is a product distribution, then is nothing other than a direct way of sampling from independent distributions for all .
Before stating our main result, recall the notation for transport cost to empirical sets from Definition 25.
Theorem 28 (Main Result).
Suppose and are distributions over , with sequential samplers and corresponding oracle cost functions . Suppose the transportation cost function is a metric (i.e., symmetric and satisfies the triangle inequality) and is linear over symmetric costs .222An example is . Then, there is an algorithm , parameterized by , that uses oracle access to samplers and achieves the following:
- 1.
-
2.
The average total cost of calling is as follows. and .555Note that because is a product distribution, if models the computational cost of sampling from , then we would have , where models the computational cost of sampling from .
-
3.
There is an algorithm that achieves the same as does, but it transports back to .
Proof.
At a high level, we use an empirical variant of the greedy algorithm (which is related to the KR transport) to design the algorithm. The algorithm itself is quite simple; the bulk of the work goes into its analysis, which is quite delicate and uses many tools from Section 3.
The Transportation Algorithm .
The algorithm works in rounds. In round , given find as described below.
-
1.
For , let be independent samples forming the multi-set of size .
-
2.
Pick at random. For all , let be independent samples. Additionally, let , and be the multi-set of size .
-
3.
Find the optimal transport between the two distributions under the cost (e.g., using the Hungarian method666This method can be implemented faster when the cost function is convex, in which case simply sorting gives us the optimal matching, as a monotone mapping.) that is in the form of a matching between and .777This can be proved, e.g., using the Birkhoff–von Neumann decomposition of doubly stochastic matrices.
-
4.
Output that is matched with .
We now analyze the algorithm above.
Transportation.
’s running time is clearly . We now prove that ’s algorithm produces an online coupling between , by showing that in round , it couples and . It is simple to check that all the elements of are distributed as and all the elements of are distributed as in . At first, it might not be clear why is distributed as , because the matching algorithm might change its distribution by picking it adversarially. However, since the algorithm hides the index of and statistically hides it among , the final “matched pair” is a random edge of the optimal matching/transport. Therefore, is also distributed accurately, and hence is producing an online coupling.
More formally, we can choose at random after the matching between is chosen. Moreover, the marginal distribution of is . Therefore, for every (even fixed) matching between , picking at random will lead to picking where is the index of the sample in that is matched with the index in . Therefore, .
The Cost.
To analyze the transportation cost we apply Lemma 23 from Section 3, which is stated in a more general form to better demonstrating the key ideas.
To apply Lemma 23, let return pair of distributions that are constructed using independent sample multi-sets of size , in order, from . Finally, because the algorithm finds an optimal transport between , we will have the premises of Lemma 23 and conclude that
in which is an (optimal) transport from to . (See Definition 21 for the notation.) We now further simplify the summation above.
-
Now, in the first term, both coordinates of are irrelevant to the summation.
Oracle Costs.
In each round, asks samples from and samples from . Furthermore, the previous samples are sampled according to itself, so the average total cost will be as claimed.
Inverse Transport.
The reverse mapping uses the same algorithm for one dimension transport, but it maps to , and inspection shows its transportation and (expected) total oracle costs will be the same as that of .
4.1 Extending Transport to Conditional Distributions
In this subsection, we study how to use the main result of Theorem 28 and obtain transports from the same to a more rich set of distributions that can be obtained from by conditioning on an event of not-so-small probability. Doing so would be extremely useful, when later on, we focus on transporting Gaussian distributions to the same distributions conditioned on an event . To prove this extension, we prove a general result about using sequential samplers for to obtain sequential samplers for .
Theorem 29 (Sequential Samplers for Event-conditioned Distributions).
Suppose is an -dimensional distribution that has a sequential sampler with average cost . Suppose is an event of measure , and is conditioned on . Then, there is an algorithm that uses oracle and a membership oracle and achieves the following.
-
1.
For all , .
-
2.
If we define be the average total cost of querying , and if we define , then
-
3.
When iteratively sampling , the expected number of calls to in round is at most , making the total expected number of calls to to be at most .
-
4.
The running time of the iterative sampling of , relative to the provided oracles is at most .
In other words, one can use to emulate a sequential sampler for in such a way that the average cost of obtaining a full sequence using nested calls to the provided sequential sampler only goes up (at most) by a multiplicative factor .
The main idea in the proof is to use rejection sampling with a subtle analysis. Namely, simply keeps using to obtain full sequences multiple times until the sample sequence falls within the event . The full proof follows.
Proof of Theorem 29.
For , let and .
Our algorithm samples from as follows.
-
1.
Sample from as follows: for sample fresh values .
-
2.
If , then output ; otherwise, go back to the previous step.
We refer to each execution of the two steps above (that has exactly one call to ) a trial.
Part 1 follows from the fact that the above sampling process is a simple rejection sampling. To prove Part 2, let be a random variable that counts the number of trials, and let its expectation be
Also let It can be observed that . Using these notations, the oracle sampling cost of at will be
Therefore, the average cost of will be
A subtle point is that, in the above sums the first half is sampled conditioned on , while the second half is done without such conditioning. We claim that for each , we have
| (1) |
Note that if Eq. (1) holds, then we conclude Part 2, because we get:
The following lemma proves Eq. (1) using , , and as before.
Lemma 30 (Expected Cost of Two-Step Sequential Sampling).
Suppose is distributed over with margins , and has probability . Also, suppose is a random variable defined over with average . Consider the following process: (1) Sample , which is the marginal distribution of in and let , in which is the marginal distribution over in conditioned on . (2) Sample . Then,
Proof.
We write the proof for the discrete setting. A similar proof holds in general. For each , define and . We have , and is the probability we sample in the sampling process of the lemma statement. Then, if we let , we have
To prove Part 3, using Lemma 30 and , we conclude that the expected number of times we call the oracle at each node is at most .
Deriving corollaries.
Using Theorem 29, we can derive more transportation results from Theorem 28 by conditioning on an arbitrary event for which we have a membership oracle at hand. Note that the parameter will change to a new value, but the key point is that we can control the cost of sequential samples from the new oracle, so long as we could do so for the initial oracle. Another interesting application of Theorem 29 is to transport a product distribution to for an arbitrary event , obtaining the following corollary.888In the next section we apply this idea to the special case of Gaussian distributions.
Corollary 31.
Suppose the assumptions of Theorem 28 hold. Then, we have the following:
-
1.
There is an algorithm such that, for all events defined over , transports to in expected time and -cost , in which is as in Theorem 28 and .
- 2.
In both cases above, the expected number of calls to is at most , and the transportation can be reversed with the same upper bounds on the running time and oracle costs.
5 Algorithmic Transport for Gaussian
In this section we focus on cases where at least one of the two distributions involved in the transport is Gaussian. We first use the main result of Theorem 28 and derive an algorithmic variant of Talagrand’s result [43] about transporting Gaussian measure to arbitrary distributions with bounded KL divergence from Gaussian. We then derive, as a corollary, a computational concentration result for the Gaussian source measure under the distance. Finally, we focus on finding (optimal) online transports in cases where both the source and destination are Gaussians, but they could be arbitrary (non-product) Gaussians.
5.1 Algorithmic Variant of Talagrand’s Transport for Gaussian
Theorem 32 (Algorithmic Version of Talagrand’s Gaussian Transport Theorem).
Let be the standard Gaussian in dimension and be an arbitrary distribution in . There is an algorithm , with integer parameter , such that whenever is provided with a sequential sampler for , the following properties hold.
-
1.
For all and , transports to in time with -cost at most
For , by the Talagrand inequality of Theorem 24, we have .
-
2.
The average total oracle cost of is at most
-
3.
There is an algorithm that achieves the same as , but it transports back to .
Remark 33 (Working with instead of ).
One might wonder what happens if we want to measure (and upper bound) transfer costs using rather than . However, this can be obtained using Jensen’s inequality (or rather the monotonicity of Wasserstein -costs for a fixed cost ). In particular, for every coupling , we have for all . Hence Theorem 32 is stated in the stronger form already.
Proof of Theorem 32.
The proof follows directly from Theorem 28 and Lemma 26. Namely, we use Corollary 26 to bound the term in Theorem 28 that upper bounds the transportation cost of empirical Gaussian from the Gaussian itself. One small point here is that, we will not need oracle samplers from the Gaussian itself, as we can use well-known sampling methods such as the Box-Muller method that generate such samples efficiently [36].999In particular, given two independent and uniform , the sampling works as follows: are independent samples .
We now focus on a special case of interest, in which the target distribution is for an event of probability , and show that in this case, one can have a single online transportation algorithm that uniformly works for all by merely accessing through a membership oracle. We first define such uniform transportation algorithms.
Definition 34 (Oracle Set-Transport).
For distribution and transportation cost , we say that has a set-transport of cost at most for a non-increasing function , if for every event , it holds that We further say that has an oracle set-transport of cost at most if there is a single algorithm such that with oracle membership queries for an arbitrary set and sampling queries for , produces a transport of cost at most from to .
Theorem 35 (Oracle-Set Transport for Gaussian Measure).
Let be the standard Gaussian in dimension . There is an (online) oracle-set transport algorithm for such that:
-
1.
For all and of measure ,
which is at most , for sufficiently large .
-
2.
In expectation, asks at most queries to and runs in .
-
3.
There is an algorithm that achieves the same, but transports back to .
Proof of Theorem 35.
To prove Theorem 35 we first use the first item of Corollary 31 where . This way, we already know that the running time of the transportation algorithm and its number of calls to are bounded as stated.
Then, we need to bound both terms . To bound , we again use Corollary 26 as we did in the proof of Theorem 32. To bound , we again use Corollary 24 and the well-known fact that for such that (applied to ).
Due to our transports being “reversible”, one can obtain a variant of the result above that transports conditional distributions to conditional distributions through composition.
5.2 Dimension-Independent Computational Concentration for Gaussian
It is well-known that transportation inequalities can be used to derive concentration of measure results [22]. Recently, a computational variant of this phenomenon has been explored [31, 15], which bears similarities to how we make transportation algorithmic. In a computational concentration result, we need an algorithm that maps “most” of the sampled points from the space to any “sufficiently large” event , algorithmically. The “cost” of the concentration is (a worst-case) allowed distance that the algorithm is allowed to move the points, and its error is the fraction of the sampled points that it fails to map to withing the allowed distance . The work of [15] obtained such results optimally for some settings (e.g., Gaussian under distance), however they left open obtaining an optimal (dimension-free) computational concentration result for the Gaussian space under the distance.
Using Theorem 35, we can resolve the question left open in [15] and derive such optimal computational concentration for the Gaussian space under as a simple corollary to our algorithmic transport result. Theorem 36 below follows from Theorem 35 and the Markov inequality. Using below implies the desired dimension-independent result.
Corollary 36 (Computational Concentration for Gaussian).
For all , given oracle access to , of Theorem 35 runs in -time and with probability over , it finds a point of distance
References
- [1] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
- [2] Martín Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, 2017. URL: http://proceedings.mlr.press/v70/arjovsky17a/arjovsky17a.pdf.
- [3] Julio Backhoff, Mathias Beiglbock, Yiqing Lin, and Anastasiia Zalashko. Causal transport in discrete time and applications. SIAM Journal on Optimization, 27(4):2528–2562, 2017.
- [4] Arjun Nitin Bhagoji, Daniel Cullina, and Prateek Mittal. Lower bounds on adversarial robustness from optimal transport. Advances in Neural Information Processing Systems, 32, 2019.
- [5] Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S Meel, Dimitrios Myrisiotis, A Pavan, and NV Vinodchandran. On approximating total variation distance. In IJCAI, 2023.
- [6] Jeremiah Birrell and Mohammadreza Ebrahimi. Adversarially robust deep learning with optimal-transport-regularized divergences. arXiv preprint, 2023. doi:10.48550/arXiv.2309.03791.
- [7] Jose Blanchet, Karthyek Murthy, and Fan Zhang. Optimal transport-based distributionally robust optimization: Structural properties and iterative schemes. Mathematics of Operations Research, 47(2):1500–1529, 2022. doi:10.1287/moor.2021.1178.
- [8] Lenore Blum. Complexity and real computation. Springer Science & Business Media, 1998.
- [9] Mark Braverman. On the complexity of real functions. In 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05), pages 155–164. IEEE, 2005. doi:10.1109/SFCS.2005.58.
- [10] Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375–417, 1991.
- [11] Maarten Buyl and Tijl De Bie. Optimal transport of classifiers to fairness. In Advances in Neural Information Processing Systems, 2022.
- [12] Guillaume Carlier, Alfred Galichon, and Filippo Santambrogio. From knothe’s transport to brenier’s map and a continuation method for optimal transport. SIAM Journal on Mathematical Analysis, 41(6):2554–2576, 2010. doi:10.1137/080740647.
- [13] Sinho Chewi, Jonathan Niles-Weed, and Philippe Rigollet. Statistical optimal transport. arXiv preprint, 2024. arXiv:2407.18163.
- [14] Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 3733–3742, 2017.
- [15] Omid Etesami, Saeed Mahloujifar, and Mohammad Mahmoody. Computational concentration of measure: Optimal bounds, reductions, and more. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 345–363. SIAM, 2020. doi:10.1137/1.9781611975994.21.
- [16] Weiming Feng, Heng Guo, Mark Jerrum, and Jiaheng Wang. A simple polynomial-time approximation algorithm for the total variation distance between two product distributions. In Symposium on Simplicity in Algorithms (SOSA), pages 343–347. SIAM, 2023. doi:10.1137/1.9781611977585.CH30.
- [17] Alessio Figalli and Cédric Villani. Optimal Transport and Curvature, pages 171–217. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011. doi:10.1007/978-3-642-21861-3_4.
- [18] Nicolas Ford, Daniel Duckworth, Mohammad Norouzi, and George E Dahl. The importance of generation order in language modeling. arXiv preprint, 2018. arXiv:1808.07910.
- [19] Nicolas Fournier and Arnaud Guillin. On the rate of convergence in wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3):707–738, 2015.
- [20] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A. Poggio. Learning with a wasserstein loss. In Advances in Neural Information Processing Systems, volume 28, pages 2053–2061, 2015.
- [21] Alfred Galichon. The unreasonable effectiveness of optimal transport in economics. Proceeding of the 2020 World Congress of the Econometric Society, 2020.
- [22] Nathael Gozlan and Christian Léonard. Transport inequalities. a survey. Markov Processes And Related Fields, 16:635–736, 2010.
- [23] Steven Haker, Lei Zhu, Allen Tannenbaum, and Sigurd Angenent. Optimal mass transport for registration and warping. International Journal of computer vision, 60(3):225–240, 2004. doi:10.1023/B:VISI.0000036836.66311.97.
- [24] Jan-Christian Hütter and Philippe Rigollet. Minimax estimation of smooth optimal transport maps. The Annals of Statistics, 49(2), 2021.
- [25] Leonid V Kantorovich. On the translocation of masses. In Dokl. Akad. Nauk. USSR (NS), volume 37, pages 199–201, 1942.
- [26] Daegyu Kim et al. Improving diffusion-based generative models via approximated optimal transport. arXiv preprint, 2024. arXiv:2403.05069.
- [27] Herbert Knothe. Contributions to the theory of convex bodies. Michigan Mathematical Journal, 4(1):39–52, 1957.
- [28] Martin Knott and Cyril S Smith. On the optimal mapping of distributions. Journal of Optimization Theory and Applications, 43:39–49, 1984.
- [29] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
- [30] Rémi Lassalle. Causal transport plans and their monge–kantorovich problems. Stochastic Analysis and Applications, 36(3):452–484, 2018.
- [31] Saeed Mahloujifar and Mohammad Mahmoody. Can adversarially robust learning leverage computational hardness? In Algorithmic Learning Theory, pages 581–609. PMLR, 2019. URL: http://proceedings.mlr.press/v98/mahloujifar19a.html.
- [32] Tudor Manole, Sivaraman Balakrishnan, Jonathan Niles-Weed, and Larry Wasserman. Plugin estimation of smooth optimal transport maps. The Annals of Statistics, 52(3):966–998, 2024.
- [33] Gaspard Monge. Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences de Paris, 1781.
- [34] Paul Montesuma, Loic Ngolè Mboula, and Antoine Souloumiac. Recent advances in optimal transport for machine learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- [35] Ludovic Métivier, Romain Brossier, Jean Virieux, and Jesus de la Puente. Measuring the misfit between seismograms using an optimal transport distance. Geophysical Journal International, 205(1):345–377, 2016.
- [36] Raymond Edward Alan Christopher Paley and Norbert Wiener. Fourier transforms in the complex domain, volume 19. American Mathematical Soc., 1934.
- [37] Gabriel Peyré. Course notes on computational optimal transport. Mathematical Tours, 2024. URL: https://mathematical-tours.github.io/.
- [38] Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
- [39] Murray Rosenblatt. Remarks on a multivariate transformation. The annals of mathematical statistics, 23(3):470–472, 1952.
- [40] Filippo Santambrogio. Models and applications of optimal transport theory, 2009. Lecture Notes, Grenoble.
- [41] Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
- [42] Richard Sinkhorn and Paul Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2):343–348, 1967.
- [43] Michel Talagrand. Transportation cost for Gaussian and other product measures. Geometric & Functional Analysis GAFA, 6(3):587–600, 1996.
- [44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- [45] Cédric Villani. Topics in optimal transportation, volume 58. American Mathematical Soc., 2021.
- [46] Hongkang Yang and Esteban G. Tabak. Clustering, factor discovery and optimal transport. IMA Journal of Applied Mathematics, 10(4):1353–1387, 2021. doi:10.1093/imaiai/iaaa040.
- [47] Yi-Zhuang You et al. Renormalization group flow, optimal transport and diffusion-based generative model. Physical Review E, 2024.