Abstract 1 Introduction 2 Basic Concepts 3 Basic Tools 4 Algorithmic Transport for Products 5 Algorithmic Transport for Gaussian References

New Algorithmic Directions in Optimal Transport and Applications for Product Spaces

Salman Beigi Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
TEIAS, Khatam University, Tehran, Iran
Omid Etesami EPFL, Campus Biotech, Geneva, Switzerland
Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
Mohammad Mahmoody Charlottesville, VA, USA Amir Najafi Sharif University of Technology, Tehran, Iran
Abstract

We consider the problem of optimal transport between two high-dimensional distributions μ,ν in n from a new algorithmic perspective, in which we are given a sample xμ and we have to find a close yν while running in poly(n) time, where n is the size/dimension of x,y. In other words, we are interested in making the running time bounded in dimension of the spaces rather than bounded in the total size of the representations of the two distributions. Our main result is a general algorithmic transport result between any product distribution μ and an arbitrary distribution ν of total cost Δ+δ under pp cost; here Δ is the cost of the so-called Knothe–Rosenblatt transport from μ to ν, while δ is a computational error that goes to zero for larger running time in the transport algorithm. For this result, we need ν to be “sequentially samplable” with a “bounded average sampling cost” which is a novel but natural notion of independent interest. In addition, we prove the following.

  • We prove an algorithmic version of the celebrated Talagrand’s inequality for transporting the standard Gaussian distribution Φn to an arbitrary ν under the Euclidean-squared cost. When ν is Φn conditioned on a set 𝒮 of measure ε, we show how to implement the needed sequential sampler for ν in expected time poly(n/ε), using membership oracle access to 𝒮. Hence, we obtain an algorithmic transport that maps Φn to Φn|𝒮 in time poly(n/ε) and expected Euclidean-squared distance O(log1/ε), which is optimal for a general set 𝒮 of measure ε.

  • As corollary, we find the first computational concentration (Etesami et al. SODA 2020) result for the Gaussian measure under the Euclidean distance with a dimension-independent transportation cost, resolving a question of Etesami et al. More precisely, for any set 𝒮 of Gaussian measure ε, we map most of Φn samples to 𝒮 with Euclidean distance O(log1/ε) in time poly(n/ε).

Keywords and phrases:
Optimal transport, Randomized algorithms, Concentration bounds
Funding:
Omid Etesami: Thanks to Sabanci University and TEIAS for their support during part of this work.
Copyright and License:
[Uncaptioned image] © Salman Beigi, Omid Etesami, Mohammad Mahmoody, and Amir Najafi; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Algorithm design techniques
; Theory of computation Online algorithms ; Mathematics of computing Probabilistic algorithms ; Theory of computation Probabilistic computation
Related Version:
Full Version: https://arxiv.org/abs/2509.21502
Editors:
Ho-Lin Chen, Wing-Kai Hon, and Meng-Tsung Tsai

1 Introduction

Optimal transport (OT) is a fundamental problem that arises in mathematics, science, and engineering, including differential geometry [17], transportation planning [40], economics [21], machine learning [34, 38], image registration [23], and seismic tomography [35]. In machine learning, it has been used in unsupervised learning [46], as a measure of the cost of misclassification [20], to define the fairness of algorithms [11], in Wasserstein GANs [2], for transfer learning [14], and in diffusion generative models [47, 26].

In the optimal transport problem, we would like to transport samples from a source distribution μ to points in the target distribution ν with a minimum expected “transportation cost” 𝖼(x,y) of transporting xμ to yν. The study of this problem dates back to the work of Monge [33], who wanted the transportation mapping A(x)=y to be deterministic. Kantorovich [25] reformulated the problem by allowing A(x) to be a randomized (stochastic) mapping. In other words, we now look for a coupling π over the distributions μ,ν with minimum expected transportation cost 𝔼𝖼(x,y), using which we define the optimal cost of transporting μ to ν,

𝖳(μ,ν)=minπ𝒞𝔼(x,y)π𝖼(x,y)

where 𝒞 is the set of all couplings between μ,ν. OT is closely related to the notion of “Wasserstein metric” that generalizes OT using a parameter p1 and is the same for p=1.

As a prominent example of the use of OT in mathematics, Talagrand [43] gave a bound on the optimal transport, under the 22 cost, of the n-dimensional Gaussian measure Φn to an arbitrary distribution ν based on the KL-divergence of ν from Φn. Using this, he derived an essentially optimal concentration of measure result, showing that for any set 𝒮 of measure ε in Φn, almost all of the measure Φn is within 22 (minimum) distance O(ln1/ε) from 𝒮.

Computational OT.

Computational aspects of OT have recently become extremely important on their own [38]. In the most common formulation of “computational OT”, we would like to compute or estimate 𝖳(μ,ν) efficiently. Computing 𝖳(μ,ν) is a key tool, e.g., for applications that use OT to quantify a loss that allows one to know “how far” we are from a target goal [4, 6, 7]. A common approach to computing 𝖳(μ,ν) is to work with empirical sample sets 𝒮μm,𝒯νm, and find the best OT between the empirical distributions U𝒮,U𝒯 that are uniform over 𝒮,𝒯 (e.g., see [24, 32] and the references therein). This approximation converges to the quantity 𝖳(μ,ν) in the limit, and the OT between U𝒮,U𝒯 can be computed using the Hungarian algorithm for minimum weighted matching [29]. The popular iterative Sinkhorn algorithm solves a regularized version of the OT problem [42] but it also works with empirical sample sets, that is, i.i.d. samples from the distributions. Using empirical samples, one does not rapidly converge to the optimal OT in some elementary cases. For example, to transport the uniform distribution on the n-dimensional unit cube to itself, the OT between two poly(n)-size empirical versions of the original distribution is Θ(n) in 22 distance even though the actual OT cost is zero.

Statistical OT.

The above approach of using empirical samples 𝒮μm,𝒯νm can in fact be used to approximate the transport map itself from μ to ν, as in [24, 32]. For example, Brenier’s theorem [10, 28] asserts that under the 22 cost and suitable conditions, a unique Monge mapping achieves optimal transport, and one can aim at approximating this deterministic mapping. This approach is sometimes known as statistical optimal transport [13]. However, this approach needs exponential in n samples for n-dimensional distributions to achieve good approximate results. Some previous works like [24, 32] make improvements on this analysis by assuming further smoothness and structural conditions on the distributions but the curse of dimensionality basically remains intact. More importantly, to the best of our knowledge, no previous work models the algorithmic aspect of searching for the transport map by limiting its algorithm to run in polynomial time over the size of the input xμ.

1.1 Our Contributions

In a nutshell, our contributions are (1) formalizing a new theory of algorithmic transport, (2) obtaining initial results on algorithmic transport for the high-dimensional setting, and (3) obtaining applications for algorithmic transport, e.g., to algorithmic concentration of measure. Each of the items above has multiple aspects that are elaborated in the following.

Algorithmic Transport in Polynomial Time.

The common computational OT formulation aims to compute or approximate the optimal transportation cost 𝖳(μ,ν), yet it does not answer the key question of how to algorithmically compute the transport map efficiently over the size of the given input sample. I.e., suppose that we are given a particular sample xμ as input, and we would like to map it to yν as follows: (1) The mapping shall be computed in polynomial time over the size of the input |x|=n. (2) We would like to control the expected cost of the transportation. To point out the subtle distinction between our new algorithmic formulation and the traditional computational OT, in this work we use the term algorithmic transport to refer to the task of computing a (randomized) mapping A efficiently based on its input size |x| (e.g., the dimension of x), such that A(x)ν, whenever xμ.

Algorithmic transport, when done optimally, can be used to approximate OT cost efficiently as well. In particular, when the transportation cost is bounded by a constant, using k=Θ(ε2) independent samples (x1,y1),,(xk,yk)(x,A(x))k, the average 𝔼i𝖼(xi,yi) gives an ε-approximation of the OT, with high probability. However, it is not clear how to do the reverse and obtain algorithmic transport from computational OT.

When μ,ν are one dimensional, for natural (convex) costs such as 𝖼(x,y)=|xy|p,p1 one can find simple algorithms that simply use a “monotone” transportation plan [45]. Furthermore, when the distributions μ,ν have small domains of size k, one can use algorithms based on min-cost flows to find a full description of the OT from μ to ν in poly(k) time [37]. However, our focus is on the high-dimensional setting and finding poly(n)-time computable mappings between distributions of dimension n with super-polynomial support. We ask:

If μ,ν are n-dimensional distributions, how can we find a poly(n)-time computable transport map from xμ to yν of a small/optimal cost?

Formalizing and answering the question above in various contexts is the focus of our work. Our studies also bear similarities to the line of work on approximating the total variation distance [5, 16] as it coincides with OT under the Hamming distance.

Transport in High-Dimensional Setting.

In this work, we approach the main question above through a study of so-called causal transports [30, 3] in high dimension, in which the transporting algorithm A produces y=(y1,,yn) from x=(x1,,xn) in an online manner: The algorithm A shall output yi based on x[i]=(x1,,xi) and before receiving xi+1. Hence we also refer to those transports simply as online. The so-called Knothe-Rosenblatt transport (KR transport for short) [27, 39] is an important online transport with two properties: (1) its reverse is also online, and (2) it follows a “greedy” approach in each round by using a monotone mapping of dimension one. Our motivations for studying online transports is twofold: (1) Despite being a restriction on how the transport is done, the “online lens” guides us towards algorithm development; (2) In our eyes, information-theoretic study of online algorithms is interesting. In particular, in Section 2.1, we prove that the KR transport is optimal among all online transports when the source distribution is a product.

Main Result: Algorithmic Transport from Product Distributions.

Our main result (Theorem 28) is to design a poly(n)-time online algorithm that transports a generic product distribution μ=μ1μn to any n-dimensional distribution ν, assuming that (1) the transportation cost satisfies 𝖼(x,y)=i𝖼i(xi,yi), where x=(x1,,xn),y=(y1,,yn), and (2) the transporting algorithm A has oracle access to proper samplers for both μ,ν.

The algorithm is actually very simple: Given x, having determined y1,…, yi1, to determine yi, it samples k1 samples besides xi according to μi. Similarly it samples k samples according to the conditional distribution of the ith coordinate of ν conditioned on the values of y1,,yi1. Then it optimally matches the two sets of two k samples. The value of yi is the match of xi in this matching.

The transportation cost of A turns out to be Δ+δ, where Δ is the optimal cost of online transports from μ to ν (which, as we will prove, will coincide with the KR transport [27, 39] in our settings of interest), and δ is a term that could be made smaller by choosing k larger. We show that the reverse transport from ν back to the product μ can be done algorithmically as well. This will be useful for deriving further algorithmic transports through composition.

Sequential Samplers.

When it comes to the samplability conditions needed in our main results above, we merely require that we can sample from μi efficiently. However, for the non-product distribution ν, the samplability condition is stronger and we require that one can sample from νi conditioned on any previously sampled prefix y[i1]. We refer to such samplers as sequential samplers. A key quantity of interest is the complexity of iterative sampling of the coordinates y1,,yn sequentially (conditioned on previous ones) till we obtain a full sample y. We would like to have samplers where the average complexity of this sequential generation is bounded. As it turns out, we can indeed bound such costs in our special cases of interest.

From a real-world application point of view, this notion of efficient sequential sampler is very natural in some generative models. This is indeed the case for transformer-based language models that autoregressively generate their tokens one by one, each conditioned on the previously sampled sequence of tokens [44, 18]. That is, the joint distributions produced by these generative models have sequential samplers of low expected cost, as they indeed generate their sequence of symbols in a reasonable time and in an online fashion.

Algorithmic Transport for the Standard Gaussian Distribution.

One of the fundamental results in OT is Talagrand’s transportation inequality for the n-dimensional Gaussian distribution Φn [43]. It is proved that for every distribution ν, 𝖳(Φn,ν)2𝖪𝖫(ν,Φn), in which the cost is measured in 22, i.e., 𝖼(x,y)=i[n]|xiyi|2, and 𝖪𝖫(,) denotes the Kullback–Leibler divergence. In this work, we lift this classical result to the algorithmic setting. Note that, as mentioned in [43], this bound is optimal in general, e.g., when ν is a shifted Φn, in which case our results converge to this optimal bound as well. In particular, we derive this result from our main result by proving the following two complementary claims:

  • Information theoretic: We observe that Talagrand’s bound of 2𝖪𝖫(ν,Φn) upper bounds not only the best “offline” transport from the standard Gaussian Φn, but also the best optimal online transportation of Φn to ν. Namely, we show that Δ2𝖪𝖫(ν,Φn), where Δ is the optimal online transportation cost as defined above.

  • Computational: We use results from [19] to show that the Gaussian distribution in one dimension has a small transportation cost to its empirical samples on average.

Transporting Standard Gaussian to Conditional Gaussian.

We show that in a natural setting, where ν is the Gaussian distribution conditioned on an event 𝒮 of Gaussian measure ε, such sequential samplers can be efficiently simulated using oracle access to membership tests in 𝒮. In other words, we find an algorithmic oracle-aided transportation algorithm that simultaneously work for all such distributions ν=Φn|𝒮. Note that such distributions have 2𝖪𝖫(ν,Φn)2ln1/ε. We obtain algorithmic transports running in expected time poly(n/ε) that achieve transport cost that converges to the upper bound of Talagrand.

Dimension-Independent Computational Concentration for Gaussian Spaces.

One of the applications of OT is to obtain concentration of measure (CoM) inequalities [22]: One shows that any set 𝒮 of “sufficiently large” measure in a concentrated metric probability space (μ,d), where μ is a distribution and d is a distance metric, expands to cover most of the measure in μ, when we consider neighbors of 𝒮 within a certain distance. Recently, a computational (algorithmic) variant of the CoM phenomenon has been introduced [31, 15], in which one aims to show that the reverse mapping can be computed efficiently from almost all of the points in μ back to 𝒮 by moving the points within a bounded distance. Namely, given a typical sampled point xμ, we aim to algorithmically find a “close neighbor” y𝒮 of bounded distance d(x,y). The work of [15] obtained such results for various settings, but their work left open obtaining computational CoM with dimension-independent (optimal) distance for the basic and natural space of Gaussian distributions under the 2 distance. Using our oracle set-transportation result for Gaussian spaces mentioned above, we resolve this open question and obtain such an optimal and dimension-free bound (see Corollary 36).

Reductions for (Deriving New) Algorithmic Transport.

Finally, considering the role of reductions in resolving algorithmic tasks, we also develop the (right) notion of algorithmic reductions for the goal of relating algorithms for (optimal) transport across different spaces. In particular, suppose μ1,μ2 are distributions and 𝖼1,𝖼2 are two different transportation costs. In the full version we state conditions under which, we can automatically transform an algorithmic transport result from μ1 to ν (under the cost 𝖼1) to a similar result that transports μ2 to ν (under the cost 𝖼2) for specific distributions μ1,μ2 and arbitrary distribution ν. We then show how to realize such reductions when we transport uniform distributions over the unit cube and the unit sphere (to an arbitrary distribution) by reducing them to the case of transporting Gaussian distributions. Consequently, we obtain algorithmic transports from these distributions as well.

2 Basic Concepts

In this section, we define the key notions studied in this paper and prove their basic properties.

Notation.

We let [n]={1,,n}. We denote the source (initial) distribution as μ. When μ is distributed over n, we say that μ has dimension n and by μi we denote the distribution of its ith coordinate. We usually denote xμ, where x=(x1,,xn) and xiμi. μ=μ1μn means that μ is a product distribution. We use a similar notation for the target distribution ν. By yA(x) we denote the process of running a probabilistic algorithm A on input x to obtain output y. When μ is a distribution, Aμ denotes an oracle algorithm A that has access to fresh samples from μ, and when 𝒮 is a set, A𝒮 denotes a similar situation where the oracle responds membership in 𝒮. For vector (v1,,vn), by v[i] we denote the prefix vector (v1,,vi). When a distribution μ of dimension n with marginals μ1,,μn is clear from the context, by μi|x[i1], we denote the distribution of μi conditioned on having sampled xj from μj for all j<i. For further clarity on the underlying joint distribution, we might sometimes use μi|μx[i1] instead. By μ(𝒮) or Prμ[𝒮] we denote the probability of event 𝒮 under the distribution μ. Whenever it is clear from the context, for an outcome x, we use μ(x) to either denote the probability of the outcome x or the density of μ at x depending on whether μ is discrete or continuous. By Supp(μ) we denote the support set of μ, which for the discrete and continuous cases can be defined as {xμ(x)>0}. When Supp(μ)Supp(ν)𝒮, their Kullback–Leibler (KL) divergence is denoted as 𝖪𝖫(ν,μ)=x𝒮ν(x)log(ν(x)/μ(x)) with the natural logarithm basis. In the preceding definition and generally throughout this paper, we use the summation notation corresponding to discrete distributions; the corresponding results for continuous distributions replace sums with proper integrals. For p1, the p-norm and p-distance over n are defined as p(x)=xp=(i[n]|xi|p)1/p, and p(x,y)=p(xy).

Transportation Costs.

In the following, all transportation costs, usually denoted as 𝖼, are functions 𝖼:2n with non-negative outputs that model the cost of transporting xμ to yν. We always assume 𝖼 to be lower semi-continuous but do not assume 𝖼 to be symmetric or satisfy the triangle inequality; we state these conditions, whenever needed.

Definition 1 (Coupling and Optimal Transportation Cost).

We say that a distribution π over pairs with marginals π1,π2 is a coupling of μ,ν if π1=μ,π2=ν. If for every xμ, there is a unique y such that (x,y)Supp(π), then we call this a deterministic (Monge) transport from μ to ν. For a cost 𝖼, the transport cost of a coupling π of μ,ν is defined as

𝖳𝖼(π)=𝔼(x,y)π𝖼(x,y).

We refer to 𝖳𝖼p1/p(π) as the (Wasserstein) p-cost of π under 𝖼. If 𝒞(μ,ν) denotes the set of all couplings between μ,ν, the (Kantorovich) optimal transportation cost for (μ,ν) is defined as

𝖳𝖼(μ,ν)=infπ𝒞(μ,ν)𝖳𝖼(π).

The infimum in Definition 1 for defining the optimal transportation costs turns out to be a minimum as 𝖼 is lower-semi continuous [1].

Definition 2 (Algorithmic Transport).

For distributions μ,ν, algorithm A is a transport from distribution μ to distribution ν if A is a (probabilistic) algorithm such that A(x)ν whenever xμ. By πA we denote the coupling created by A. For a transportation cost 𝖼 the transportation cost of A is defined as 𝖳𝖼(A)=𝖳𝖼(πA).

Computational Model.

In Definition 2, we need to either work with discrete distributions with samples of finite length, or when the distributions are continuous we need to work with the generalization of algorithms working with real numbers as formalized in [8, 9].

We now define an algorithmic variant of so-called causal transport [30] with a discrete time [3], We call it “online” to emphasize on the algorithmic aspect a la online learning [41].

Definition 3 (Online (Algorithmic) Transport).

For distributions μ,ν of dimension n, we call a (probabilistic and perhaps computationally unbounded) algorithm A an online transport algorithm from μ to ν if it forms a transport from μ to ν, while it makes its decisions in an online way. Namely, A has an internal iterating process (for simplicity also denoted by A) that reads (x1,,xn)μ coordinate by coordinate while holding an internal state, initially s0=. In the ith iteration, we have (si,yi)A(si1,xi), and at the end we output (y1,,yn)ν. We also let 𝒞OnT(μ,ν) to be the set of all couplings that can be obtained by online algorithms and for a transport cost 𝖼 obtain the optimal online transportation cost as

𝖳𝖼OnT(μ,ν)=infπ𝒞OnT(μ,ν)𝖳𝖼(π).

To contrast and emphasize on a transport not being necessarily online, we refer to (potentially) non-online transports as offline transports.

We now define a class of couplings that is closely related to online transport.

Definition 4 (Online Coupling).

Suppose π is a coupling between n-dimensional distributions μ,ν, and πi is the corresponding marginal coupling between μi,νi. We call π an online coupling if for all z=(x[i1],y[i1])Supp(π[i1]), πi|z is a coupling of μi|x[i1] (according to μ) and νi|y[i1] (according to ν). If 𝒞OnC(μ,ν) denotes the set of all online couplings between μ,ν, for a transport cost 𝖼 we obtain the optimal online coupling cost between μ,ν as

𝖳𝖼OnC(μ,ν)=infπ𝒞OnC(μ,ν)𝖳𝖼(π).

We now show how to characterize online couplings using online transports.

Proposition 5.

A coupling π between μ,ν is online if and only if it can be obtained through both an online transport from μ to ν and an online transport from ν to μ.

Definition 6.

We call the cost function 𝖼 over n×n linear over 𝖼1,,𝖼n, if 𝖼(x,y)=𝖼1(x1,y1)++𝖼n(xn,yn), for all x=(x1,,xn),y=(y1,,yn).

Greedy Coupling.

One might wonder how we can compute/approximate 𝖳𝖼OnC(μ,ν). One approach is to use greedy methods, by trying to use an optimal coupling in each round. This is formalized in the following definition in settings with dedicated costs for each round. We will then discuss when this method succeeds in Theorem 10. More generally, we define locally-optimal couplings, even when they are not online.

Definition 7 (Locally Optimal and Greedy Couplings).

Suppose the cost function 𝖼 over 2n is linear over 𝖼1,,𝖼n. A coupling π between μ,ν is locally optimal, if for every z[i1]Supp(π[i1]), it holds that πi|z[i1] is an OT between μi|z[i1],νi|z[i1]; i.e.,

𝖳𝖼i(πi|z[i1])=𝖳𝖼i(μi|z[i1],νi|z[i1]).

When π is an online coupling as well, the above condition simplifies to 𝖳𝖼i(πi|z[i1])=𝖳𝖼i(μi|x[i1],νi|y[i1]) in which case we call π greedy. For 𝒞G(μ,ν) denoting the set of all greedy couplings from μ to ν, we define

𝖳𝖼G(μ,ν)=supπ𝒞G(μ,ν)𝖳𝖼(π).
 Remark 8 (Greedy vs. Knothe-Rosenblatt Transports).

Greedy couplings are closely related to Knothe-Rosenblatt (KR for short) transports [27, 39]. Specifically, for a greedy coupling π, when the cost functions 𝖼i are convex, for any z[i1]π[i1], the locally optimal coupling πi|z[i1] could be obtained by simply using the unique monotone mapping [12]. Hence, KR coupling is a special case of greedy couplings and cover many interesting cases in this class. For example, when the cost function 𝖼 is pp for p1, then 𝖳𝖼G(μ,ν) equals the cost of the KR coupling between μ and ν. However, due to the generality of greedy couplings (e.g., for non-monotone costs) we define and use greedy transports.

Lambda and Delta Cost Functions.

We now define two functions that play key roles in our analysis of the cost of online transports. The first (Lambda) function depends on a coupling, while the second one (Delta) depends on the two distributions that are coupled. As we prove later in Proposition 12, Lambda is a parameter that lower bounds the cost of any coupling. Delta is the optimal online transport from a product distribution to another one.

Definition 9 (The Lambda and Delta Functions).

For a coupling π of dimension n between distributions μ,ν of dimension n, and a cost function 𝖼 that is linear over 𝖼1,,𝖼n, we define the Lambda functions as

Λ𝖼(π)=𝔼zπi[n]𝖳𝖼i(μi|z[i1],νi|z[i1]).

We also define the Delta function between distributions μ,ν of dimension n as

Δ𝖼(μ,ν)=𝔼yνi[n]𝖳𝖼i(μi,νi|y[i1]).

Note that the coupling π in Definition 9 does not have to be online. Furthermore, the definition of Λ() does depend on the order of the coordinates of the n-dimension distributions.

2.1 Online Coupling and Transport from Products

We end this section by stating a theorem showing that, whenever μ is product, any online coupling that is “locally optimal” in the sense that given history z=(x[i1],y[i1]) it finds (an arbitrary) optimal transport between (μi),(νi|y[i1]), finds an optimum online coupling between μ,ν as well as an optimal online transport from μ to ν. This theorem does not assume convexity of the costs. As stated in Remark 8, for convex transportation costs, greedy algorithms can be instantiated using the KR transform.

Theorem 10 (Optimal Online Coupling and Transport from Products).

If μ=μ1μn is product and the cost function 𝖼 is linear over 𝖼1,,𝖼n, then

𝖳𝖼OnT(μ,ν)=𝖳𝖼OnC(μ,ν)=𝖳𝖼G(μ,ν)=Δ𝖼(μ,ν).

Before proving Theorem 10 we prove some basic tools that are used in the proof. The first lemma that we state can be obtained from a simple application of the linearity of expectation.

Lemma 11 (Cost Splitting).

Let π be a coupling between distributions μ,ν of dimensions n, and let πi be the corresponding coupling between the marginals μi,νi. Suppose 𝖼 is linear over 𝖼1,,𝖼n, and ω is an n-dimensional distribution that is arbitrarily correlated with π. Then,

𝖳𝖼(π)=i[n]𝖳𝖼i(πi)=𝔼zωi[n]𝖳𝖼i(πi|ω[i1]=z[i1]).

In particular, we can choose ω=ν, ω=μ, or ω=π as special cases.

We now prove some basic properties of the two functions, showing how to use it and how to characterize it in some special settings. In summary, Lambda function lower bounds the transportation of every coupling, while Delta will play a key role in characterizing the transportation cost for product distributions.

Proposition 12 (Properties of Lambda and Delta Functions).

Suppose π couples μ,ν and 𝖼 is linear. The Lambda function satisfies the following properties.

  1. 1.

    Lower Bound: For all π, Λ𝖼(π)𝖳𝖼(π), and the equality holds iff π is locally optimal.

  2. 2.

    Online Transports from Products: If π is an online transport and μ=μ1μn, then

    Λ𝖼(π)Δ𝖼(μ,ν).
  3. 3.

    Online Coupling for Products: If π is an online coupling, and μ is product then

    Λ𝖼(π)=Δ𝖼(μ,ν).

Proof of Proposition 12.

We prove the claims in order.

  1. 1.

    By letting ω=π in Lemma 11, we get

    𝖳𝖼(π)=𝔼zπi[n]𝖳𝖼i(πi|z[i1])𝔼zπi[n]𝖳𝖼i(μi|z[i1],νi|z[i1])=Λ𝖼(π),

    where the inequality follows from the fact that 𝖳𝖼i(,) minimizes the transportation cost.

  2. 2.

    We first claim that, in this case, for every z[i1]=(x[i1],y[i1])π[i1], we have μi|z[i1]=μi. This is true, because (1) (μi|x[i1],y[i1])=(μi|x[i1]) and the fact that π is an online transport, and (2) (μi|x[i1])=μi by the fact that μ is a product. Therefore,

    Λ𝖼(π)=𝔼zπi[n]𝖳𝖼i(μi|z[i1],νi|z[i1])=𝔼z=(x,y)πi[n]𝖳𝖼i(μi,νi|z[i1]).

    We now use the right hand side. In analyzing the right hand side, we first use Lemma 11 (using ω=π) and then sample x,y in reverse order,

    𝔼(x,y)πi[n]𝖳𝖼i(μi,νi|z[i1])=i[n]𝔼y[i1]ν[i1]𝔼x[i1]ν[i1]|y[i1]𝖳𝖼i(μi,νi|y[i1],x[i1]),

    where for each i[n], we sample (x[i1],y[i1])π[i1] by first sampling y[i1] and then sampling x[i1] conditioned on y[i1]. Now, for every y[i1]ν[i1], we claim that

    𝔼x[i1]ν[i1]|y[i1]𝖳𝖼i(μi,νi|y[i1],x[i1])𝖳𝖼i(μi,νi|y[i1]).

    This claim follows from Part 2 of Proposition 19 and the fact that the average of νi|y[i1],x[i1] over the choice of x[i1]ν[i1]|y[i1] is νi|y[i1].

  3. 3.

    When the coupling π is further an online coupling, then the equality holds, because (νi|y[i1],x[i1])=(νi|y[i1]), and the last inequality above becomes an equality.

Proof of Theorem 10.

It is enough to prove the following two claims.

  1. 1.

    𝖳𝖼G(μ,ν)Δ𝖼(μ,ν).

  2. 2.

    𝖳𝖼OnT(μ,ν)Δ𝖼(μ,ν).

The reason is that we already know 𝖳𝖼OnT(μ,ν)𝖳𝖼G(μ,ν) (as being greedy is a limitation), and so proving the two claims above would imply all the equalities of the theorem statement.

To prove the first claim, we observe that cost Δ𝖼(μ,ν) can be achieved using (any) greedy algorithm that (by definition) optimally couples μi=μi|x[i1] with νi|y[i1] in the ith step. In fact, all greedy coupling algorithms have the same cost Δ𝖼(μ,ν) when one of the distributions is product.

To prove the second claim, let π be an online transport with cost 𝖳𝖼OnT(μ,ν). Our claim follows from Parts 1 and 2 of Proposition 12, due to π being online and μ being a product.

𝖳𝖼OnT(μ,ν)=𝖳𝖼(π)Λ𝖼(π)Δ𝖼(μ,ν).

3 Basic Tools

3.1 Composition and Triangle Inequalities

Multi-distribution Coupling and Composition.

We now generalize the notion of coupling to more than two distributions and use it to define composition of (online) couplings.

Definition 13 (Multi-distribution Coupling).

A coupling π of μ1,,μn is a distribution over n-vectors such that the marginal of the ith coordinate is distributed as μi.

Definition 14 (Composition of Couplings).

For coupling π1,2 over μ1,μ2 and coupling π2,3 over μ2,μ3, we define the composition π1,3=π2,3π1,2 of π1,2 and π2,3 as the marginal of the first and third coordinates of the (unique) coupling of μ1,μ2,μ3 such that.

  1. 1.

    For 1i<j3, the marginal distribution of (μi,μj) in π1,2,3 is distributed as πi,j.

  2. 2.

    In the coupling π1,2,3, μ1,μ3 are independent, conditioned on x2μ2.

We now use Wasserstein p-cost, to state the following well-known triangle inequality.

Lemma 15 (Triangle Inequality for Wasserstein p-Costs).

Suppose a cost function 𝖼 satisfies the triangle inequality (but not necessarily symmetry) and p1. Then, for every coupling π over μ1,μ2,μ3 with marginal coupling πi,j,i<j over πi,πj, we have the following,

𝖳𝖼p1/p(π1,3)𝖳𝖼p1/p(π1,2)+𝖳𝖼p1/p(π2,3).

The following proposition can be obtained from the triangle inequality of Lemma 15.

Proposition 16 (Triangle Inequality for Wasserstein p-Costs in Multi-Round Settings).

Let μ be a distribution over n, and for every i[n],x[i1]Supp(μ[i1]) let J(x[i1]) be a distribution over triples of distributions over . Suppose 𝖼 satisfies the triangle inequality and 𝖼p is linear over 𝖼1,,𝖼n for p1. Then, the following holds.

(𝔼xμi[n]𝔼(ν1,ν2,ν3)J(x[i1])𝖳𝖼i(ν1,ν3))1/p
k[2](𝔼xμi[n]𝔼(ν1,ν2,ν3)J(x[i1])𝖳𝖼i(νk,νk+1))1/p

The following can be obtained from the definition of online transport and Lemma 15.

Lemma 17 (Properties of the Composition of Online Transports).

Consider an online transport A1,2 from μ1 to μ2 with coupling π1,2 and an online transport A2,3 from μ2 to μ3 with coupling π2,3. Let π1,3=π2,3π1,2 be the composed coupling. Then,

  1. 1.

    The coupling π1,3 is an online coupling.

  2. 2.

    There is an algorithm A1,3 that transports μ1 to μ3 as the coupling π1,3, whose complexity is bounded by running A1,2 followed by running A2,3.

  3. 3.

    If the cost function 𝖼 satisfies the triangle inequality, then for all p1 the following holds

    𝖳𝖼p1/p(A1,3)𝖳𝖼p1/p(A1,2)+𝖳𝖼p1/p(A2,3).

The first item in Lemma 17 and Proposition 5 together show that the composition of two online coupling is also an online coupling.

3.2 Transport Through Intermediate Distributions

In this section, we describe a method of transporting μ to ν (perhaps in an online and iterative way) through optimal transports between intermediate distributions in one dimension. We start with some definitions. We start by defining the notion of average for distributions and stating a general way of transporting through averages.

Definition 18 (Average Distribution).

Suppose M is a distribution over distributions. We define the average of M, denoted as 𝔼[M]=𝔼μM[μ]=μ, to be the distribution μ of the random variable x that is sampled by first sampling μM and then xμ. Namely, μ is the distribution that μ(𝒮)=𝔼μMμ(𝒮) for all the events 𝒮 defined over μSupp(M)Supp(μ).

Proposition 19 (Transport to Averages).

Suppose M is a distribution over distributions with average μ.

  1. 1.

    Suppose π is the following joint distribution. We first sample μM, then couple μ with ν as πμ, and then output a sample (x,y)πμ. Then, π is a coupling between μ,ν.

  2. 2.

    𝔼μM𝖳𝖼(μ,ν)𝖳𝖼(μ,ν).

Proof.

Part 1 holds because the marginals of x and y have the marginals of μ,ν. Part 2 follows from Part 1 and picking πμ to be the optimal transport between μ,ν.

The following definition states a way of finding a transport from μ to ν by working with alternative (intermediate) distributions that approximate μ,ν.

Definition 20 (Transport Through Intermediate Distributions).

Let μ,ν be distributions, 𝖼 be a cost function, and J be a distribution over pairs of distributions. We say that algorithm A couples μ,ν through (the intermediate distribution) J, if the following conditions hold.

  1. 1.

    J produces marginals with averages μ,ν. I.e., μ=𝔼(μ,ν)Jμ and ν=𝔼(μ,ν)Jν.

  2. 2.

    Algorithm A first samples (μ,ν)J, then finds some optimal transport π between μ,ν according to 𝖼, and finally outputs (x,y)π.

Definition 21 (Conditioning and Composing Transports with Distributions).

Suppose μ,μ,ν are distributions and π is a transport from μ to ν. If Supp(μ)Supp(μ), then consider the following sampling process.

  1. 1.

    Sample xμ.

  2. 2.

    Sample y from the ν-coordinate of π, conditioned on its μ-coordinate being x.

Then, the notation π|μ denotes the joint distribution of (x,y) and πμ denotes the distribution of y. Additionally, if M is a distribution over distributions, then N=πM denotes the distribution over distributions sampled by outputting ν=πμ for μM.

Notation.

Let Uk,μ be the distribution over distributions obtained by first sampling 𝒳μk, and then outputting μ=U𝒳. A simple observation is that 𝔼Uk,μ=μ for all k.

Proposition 22.

If M is a distribution over distributions with average distribution μ, and if π is any transport from μ to ν, then the following holds.

  1. 1.

    N=πM is a distribution over distributions with average ν.

  2. 2.

    For cost 𝖼, 𝖳𝖼(π)=𝔼μM𝖳𝖼(π|μ) in which π|μ is defined in Definition 21.

  3. 3.

    Uk,ν=πUk,μ, and if μ is samplable in time tμ and coupling π is computable in time tπ, then one can sample the set 𝒴,|𝒴|=k that describes U𝒴Uk,ν in time k(tμ+tπ).

Proof.

For Part 1, observe that if we sample xμ for μM, by definition we get xμ, which means yπM will be sampled as yν. For Part 2, 𝔼μM𝖳𝖼(π|μ) also computes the cost of the same coupling π by breaking it into marginal costs based on how xμ is sampled. For Part 3, let (x,y)π. We first sample (x1,,xk)=𝒳μk and then let 𝒴=(y1,,yk) for yiy|x=xi. It holds that xis are independently sampled according to μ, and because π transports μ to ν, yi’s are also independently sampled according to ν.

Lemma 23 (Multi-Round Algorithmic Coupling Through Intermediate Distributions).

Suppose cost function 𝖼 satisfies the triangle inequality, and 𝖼p is linear over 𝖼1,,𝖼n for p1. Let π, with marginals π1,,πn be a transport from μ with marginals μ1,,μn to ν with marginals ν1,,νn. For round i[n] and previously sampled z[i1]=(x[i1],y[i1])Supp(π[i1]), suppose J(z[i1]) is a distribution over pairs of distributions defined based on z[i1], and σz[i1] is an optimal transport from μi|z[i1] to νi|z[i1] under 𝖼i. Suppose π can also be obtained using the following algorithm A in n rounds. In round i[n] and for previously sampled z[i1]=(x[i1],y[i1])Supp(π[i1]), A couples μi|z[i1] and νi|z[i1] through the intermediate distribution J(z[i1]) as defined in Definition 20 using the cost 𝖼i. Then,

𝖳𝖼p1/p(π)(𝔼zπi[n]𝔼(μi,νi)J(z[i1])𝖳𝖼i(μi,σz[i1]1νi))1/p+Λ𝖼p1/p(π),

where σ1 refers to the inverse coupling that changes the order of its marginals.

Proof of Lemma 23.

The proof uses the triangle inequality for Wasserstein p-costs for the multi-round setting (Proposition 16).

For each i[n] and z[i1]Supp(π[i1]), consider the following sampling process I(z[i1]) that extends J(z[i1]) by outputting one more coordinate as well.

  1. 1.

    Sample (μ,ν)J(z[i1]).

  2. 2.

    Let μ′′=σz[i1]1ν.

  3. 3.

    Obtain (μ,μ′′,ν)I(z[i1]).

It holds that 𝖳𝖼p1/p(A)=(𝔼zπi[n]𝔼(μ,μ′′,ν)I(z[i1])𝖳𝖼i(μ,ν))1/p, which is the left side of the inequality of Proposition 16, and the right side is:

(𝔼zπi[n]𝔼(μ,μ′′,ν)I(z[i1])𝖳𝖼i(μ,μ′′))1/p+(𝔼zπi[n]𝔼(μ,μ′′,ν)I(z[i1])𝖳𝖼i(μ′′,ν))1/p

The first term is exactly the first term on the right hand side of the inequality of the lemma. Therefore, all we have to do is to prove that

𝔼zπi[n]𝔼(μ,μ′′,ν)I(z[i1])𝖳𝖼i(μ′′,ν)𝔼zπi[n]𝖳𝖼i(σz[i1]).

In fact, we prove this statement for every choice of z and i, so ignoring z,i we prove the claim:

𝔼(μ′′,ν)I𝖳𝖼i(μ′′,ν)𝔼(μ′′,ν)I𝖳𝖼i((σ1|ν)1)=𝖳𝖼i(σ),

where the middle term is added for the proof.

We now prove both the inequality and the equality above through the steps below.

  • Equality: Since the average of νJ is νi and σ1 is a transport from νi to μi, if we define 𝖼i(yi,xi)=𝖼i(xi,yi), then by Part 2 of Proposition 22 we have

    𝔼(μ′′,ν)I𝖳𝖼i((σ1|ν)1)=𝔼(μ′′,ν)I𝖳𝖼i(σ1|ν)=𝖳𝖼i(σ1)=𝖳𝖼i(σ).
  • Inequality: Again, using 𝖼i(yi,xi)=𝖼i(xi,yi), we have

    𝔼(μ′′,ν)I𝖳𝖼i(μ′′,ν) =𝔼(μ′′,ν)I𝖳𝖼i(ν,μ′′)
    𝔼(μ′′,ν)I𝖳𝖼i(σ1|ν)=𝔼(μ′′,ν)I𝖳𝖼i((σ1|ν)1),

    where the inequality is due to the fact that 𝖳𝖼i(ν,μ′′) is the optimal cost.

3.3 Borrowed Tools

The following can be obtained from the proofs in [43, 22] (see the full version). For p=2, it gives the celebrated Talagrand’s transportation inequality for Gaussian under 2.

Theorem 24 (Talagrand’s Inequality for the Gaussian Measure).

If 𝖼(x,y)=pp(x,y), p[1,2], Φn is the standard Gaussian and ν is an arbitrary distribution both in n, then

𝖳𝖼OnT(Φn,ν)=Δ𝖼(Φn,ν)n1p/2(2𝖪𝖫(ν,Φn))p/2.
Definition 25 (Transports to Empirical).

For distributions μ and symmetric cost 𝖼, we let 𝖳𝖼,kEm(μ)=𝔼𝒳μk𝖳𝖼(U𝒳,μ) denote the cost of transporting μ to an empirical set of size k, where U𝒳 is the uniform distribution over the multi-set 𝒳.

The following lemma follows from [19] and known moments of the Gaussian distribution.

Lemma 26 (Original-to-Empirical Transport for the Normal Distribution).

Let p1, 𝖼 be pp, and μ=𝒩(0,1) is the normal distribution. Then, for a constant Cp depending on p,

𝖳𝖼,kEm(μ)Cp21+3p/2Γ(p+1)p2p+1k1/2.

4 Algorithmic Transport for Products

In this section, we put together the tools from previous sections to derive algorithmic results about online transport for the setting that one of the source or target distributions is product. We then derive a corollary for the Gaussian measure. We first define sequential samplers.

Definition 27 (Sequential Sampler).

For a distribution ν in dimension n with marginals ν1,,νn, we call ν^ its sequential sampler for ν, if for all y[i1]ν[i1] calling ν^(y[i1]) returns an independent answer ν^(y[i1])νi|y[i1]. For queries y[i1]Supp(ν[i1]), calling ν^(y[i1]) returns . We also assign a (sequential sampling) cost 𝗌𝖼ν(y[i1]) to query y[i1], and call 𝗌𝖼ν=𝔼yi[n1]𝗌𝖼ν(y[i1]) the average (sequential sampling) cost of ν^. For an oracle-algorithm A calling (a potentially randomized) set 𝒬 of queries to ν^, we define its average total cost of calling ν^ as 𝗌𝖼νA=𝔼𝒬a𝒬𝗌𝖼ν(a).111Since 𝗌𝖼ν(y[i1]) naturally measures the (e.g., computational) cost of sampling a coordinate conditioned on previously sampled coordinates, for natural settings and independent ν1,ν2, the value of 𝗌𝖼ν(y[1]) will be independent of y[1].

One natural way of using 𝗌𝖼 is to model sampling time, but it can model other costs as well. The average cost 𝗌𝖼ν of ν^ is indeed the average total cost of the following simple algorithm A that uses 𝗌𝖼ν sequentially to obtain a full sample: Let y[0] be the empty string, and for i[n], A let yi=ν^(y[i1]). Also, when μ is a product distribution, then μ^ is nothing other than a direct way of sampling from independent distributions νi for all i[n].

Before stating our main result, recall the notation for transport cost to empirical sets from Definition 25.

Theorem 28 (Main Result).

Suppose μ=μ1μn and ν are distributions over n, with sequential samplers μ^,ν^ and corresponding oracle cost functions 𝗌𝖼μ,𝗌𝖼ν. Suppose the transportation cost function 𝖼 is a metric (i.e., symmetric and satisfies the triangle inequality) and 𝖼p is linear over symmetric costs 𝖼1,,𝖼n.222An example is 𝖼=p. Then, there is an algorithm Ak, parameterized by k, that uses oracle access to samplers μ^,ν^ and achieves the following:

  1. 1.

    Akμ^,ν^ transports μ to ν through an online coupling in time poly(nk) with p-cost333See Definition 1.

    𝖳𝖼p1/p(Akμ^,ν^)δ+Δ

    in which δ=2(i[n]𝖳𝖼i,kEm(μi))1/p and Δ=Δ𝖼p1/p(μ,ν) as in Definition 9.444By Theorem 10, Δ𝖼p is also equal to 𝖳𝖼pOnT(μ,ν)=𝖳𝖼pOnC(μ,ν)=𝖳𝖼pG(μ,ν).

  2. 2.

    The average total cost of A calling μ^,ν^ is as follows. 𝗌𝖼νAk𝗌𝖼ν and 𝗌𝖼μAk𝗌𝖼μ.555Note that because μ is a product distribution, if 𝗌𝖼μ models the computational cost of sampling from μ, then we would have 𝗌𝖼μ=i[n]𝗌𝖼μi, where 𝗌𝖼μi models the computational cost of sampling from μi.

  3. 3.

    There is an algorithm B that achieves the same as A does, but it transports ν back to μ.

Proof.

At a high level, we use an empirical variant of the greedy algorithm (which is related to the KR transport) to design the algorithm. The algorithm itself is quite simple; the bulk of the work goes into its analysis, which is quite delicate and uses many tools from Section 3.

The Transportation Algorithm 𝑨.

The algorithm A works in n rounds. In round i[n], given xiμi find yiνi|y[i1] as described below.

  1. 1.

    For j[k], let yi(j)ν^(y[i1]) be independent samples forming the multi-set 𝒴 of size k.

  2. 2.

    Pick t[k] at random. For all j[k],jt, let xi(j)μ be k1 independent samples. Additionally, let xi(t)=xi, and 𝒳 be the multi-set {xi(j)j[k]} of size k.

  3. 3.

    Find the optimal transport between the two distributions U𝒳,U𝒴 under the cost 𝖼i (e.g., using the Hungarian method666This method can be implemented faster when the cost function is convex, in which case simply sorting 𝒳,𝒴 gives us the optimal matching, as a monotone mapping.) that is in the form of a matching between 𝒳 and 𝒴.777This can be proved, e.g., using the Birkhoff–von Neumann decomposition of doubly stochastic matrices.

  4. 4.

    Output yi𝒴 that is matched with xi(t)=xi𝒳.

We now analyze the algorithm A above.

Transportation.

A’s running time is clearly poly(kn). We now prove that A’s algorithm produces an online coupling between μ,ν, by showing that in round i, it couples μi and νi|y[i1]. It is simple to check that all the elements of 𝒳 are distributed as μi and all the elements of 𝒴 are distributed as in νi|y[i1]. At first, it might not be clear why yi is distributed as νi|y[i1], because the matching algorithm might change its distribution by picking it adversarially. However, since the algorithm hides the index of xi and statistically hides it among 𝒳, the final “matched pair” (xi,yi) is a random edge of the optimal matching/transport. Therefore, yi is also distributed accurately, and hence A is producing an online coupling.

More formally, we can choose t[k] at random after the matching between 𝒳,𝒴 is chosen. Moreover, the marginal distribution of yi(j) is ν^(y[i1]). Therefore, for every (even fixed) matching between 𝒳,𝒴, picking t at random will lead to picking yi=yi(j) where j is the index of the sample in 𝒴 that is matched with the index t in 𝒴. Therefore, yiν^(y[i1]).

The Cost.

To analyze the transportation cost we apply Lemma 23 from Section 3, which is stated in a more general form to better demonstrating the key ideas.

To apply Lemma 23, let J(y[i1]) return pair of distributions (μi=U𝒳,νi=U𝒴) that are constructed using independent sample multi-sets 𝒳,𝒴 of size k, in order, from μi,νi|y[i1]. Finally, because the algorithm A finds an optimal transport between μi,νi, we will have the premises of Lemma 23 and conclude that

𝖳𝖼p1/p(Akμ^,ν^)(𝔼(x,y)πi[n]𝔼(U𝒳,U𝒴)J(y[i1])𝖳𝖼i(U𝒳,σz[i1]1U𝒴))1/p+Λ𝖼p1/p(πA),

in which σz[i1] is an (optimal) transport from μi to νi|y[i1]. (See Definition 21 for the notation.) We now further simplify the summation above.

  • Because U𝒳,U𝒴 are empirical distributions from μi,νi|y[i1], if we let U𝒳=μi,U𝒴=νi in Proposition 22, by Part 3 we get U𝒳=σz[i1]1U𝒴 (see Definition 21 for the notation) in which U𝒳 is also an empirical distribution sampled from μi independently of U𝒳. So, the first term of the right hand side in the inequality above simplifies to:

    (𝔼(x,y)πi[n]𝔼𝒳,𝒳μik𝖳𝖼i(U𝒳,U𝒳))1/p
  • Now, in the first term, both coordinates of (x,y)π are irrelevant to the summation.

  • Since A is producing an online coupling the second term simplifies into Δ𝖼p1/p(μ,ν)=Λ𝖼p1/p(πA), due to Part 3 of Proposition 12 and that μ is a product.

  • Finally, by the triangle inequality of Proposition 16, the first term will become at most

    2(i[n]𝖳k,𝖼iEm(μi))1/p=2δ.

    To apply Proposition 16, we let Ji to be the distribution over distributions that outputs the following triple of distributions (ν1,ν2,ν3), where

    ν1=U𝒳,𝒳μik,ν2=μi,ν3=U𝒳,𝒳μik.

Oracle Costs.

In each round, A asks k1 samples from μi and k samples from νi|y[i1]. Furthermore, the previous samples y[i1] are sampled according to ν[i1] itself, so the average total cost will be as claimed.

Inverse Transport.

The reverse mapping uses the same algorithm for one dimension transport, but it maps νi|y[i1] to μi, and inspection shows its transportation and (expected) total oracle costs will be the same as that of A.

4.1 Extending Transport to Conditional Distributions

In this subsection, we study how to use the main result of Theorem 28 and obtain transports from the same μ to a more rich set of distributions that can be obtained from ν by conditioning ν on an event 𝒮 of not-so-small probability. Doing so would be extremely useful, when later on, we focus on transporting Gaussian distributions to the same distributions conditioned on an event 𝒮. To prove this extension, we prove a general result about using sequential samplers for ν to obtain sequential samplers for ν|𝒮.

Theorem 29 (Sequential Samplers for Event-conditioned Distributions).

Suppose ν is an n-dimensional distribution that has a sequential sampler ν^ with average cost 𝗌𝖼ν. Suppose 𝒮 is an event of measure ν(𝒮)ε, and ω=ν|𝒮 is ν conditioned on 𝒮. Then, there is an algorithm O that uses oracle ν^ and a membership oracle 𝒮 and achieves the following.

  1. 1.

    For all y[i1]ω[i1], O𝒮,ν^(y[i1])ω^(y[i1]).

  2. 2.

    If we define 𝗌𝖼ω(y[i1]) be the average total cost of O𝒮,ν^(y[i1]) querying ν^, and if we define 𝗌𝖼ν(i)=𝔼yν𝗌𝖼ν(y[i1]), then

    𝗌𝖼ω1εi[n]i𝗌𝖼ν(i)n𝗌𝖼νε.
  3. 3.

    When iteratively sampling (y1,,yn)ω, the expected number of calls to 𝒮 in round i is at most 1/ε, making the total expected number of calls to 𝒮 to be at most n/ε.

  4. 4.

    The running time of the iterative sampling of (y1,,yn)ω, relative to the provided oracles ν^,𝒮 is at most O(n2/ε).

In other words, one can use O𝒮,ν^ to emulate a sequential sampler for ω=ν|𝒮 in such a way that the average cost of obtaining a full sequence yν|𝒮 using n nested calls to the provided sequential sampler only goes up (at most) by a multiplicative factor n/Pr[𝒮].

The main idea in the proof is to use rejection sampling with a subtle analysis. Namely, O𝒮,ν^ simply keeps using ν^ to obtain full sequences multiple times until the sample sequence falls within the event 𝒮. The full proof follows.

Proof of Theorem 29.

For v=(v1,,vn), let vi=(vi,,vn) and v=(v[i1],vi).

Our algorithm O𝒮,ν^(y[i1]) samples from ω^(y[i1]) as follows.

  1. 1.

    Sample from ν|y[i1] as follows: for j=i,,n sample fresh values yjν^(y[j1]).

  2. 2.

    If y=(y[i1],yi)𝒮, then output yi; otherwise, go back to the previous step.

We refer to each execution of the two steps above (that has exactly one call to 𝒮) a trial.

Part 1 follows from the fact that the above sampling process is a simple rejection sampling. To prove Part 2, let H(y[i1]) be a random variable that counts the number of trials, and let its expectation be

h(y[i1])=𝔼[H(y[i1])]=1Pryν|y[i1][y𝒮].

Also let 𝗌𝖼¯ν(y[i1])=𝔼yν|y[i1]ji𝗌𝖼ν(yj1). It can be observed that 𝔼yν𝗌𝖼¯ν(y[i1])=ji𝗌𝖼ν(i). Using these notations, the oracle sampling cost of ω^() at y[i1] will be

𝗌𝖼ω(y[i1])=h(y[i1])𝗌𝖼¯ν(y[i1]).

Therefore, the average cost of ω^ will be

𝗌𝖼ω=𝔼yωi[n]h(y[i1])𝗌𝖼ω(y[i1])=i[n]𝔼yωh(y[i1])𝗌𝖼ω(y[i1]).

A subtle point is that, in the above sums the first half y[i1] is sampled conditioned on 𝒮, while the second half is done without such conditioning. We claim that for each i, we have

𝔼yωh(y[i1])𝗌𝖼ω(y[i1])1ε𝔼yν𝗌𝖼ω(y[i1]). (1)

Note that if Eq. (1) holds, then we conclude Part 2, because we get:

𝗌𝖼ωi[n]1ε𝔼yν𝗌𝖼ω(y[i1])=1εi[n]ji𝗌𝖼ν(i)=1εi[n]i𝗌𝖼ν(i).

The following lemma proves Eq. (1) using 𝒰=Supp(ν[i1]),𝒱=Supp(νi), σ=ν, f(y)=𝗌𝖼¯ν(y[i1]) and 𝒮 as before.

Lemma 30 (Expected Cost of Two-Step Sequential Sampling).

Suppose σ is distributed over 𝒰×𝒱 with margins σ𝒰,σ𝒱, and 𝒮𝒰×𝒱 has probability σ(𝒮)=ε. Also, suppose f is a random variable defined over σ with average f¯. Consider the following process: (1) Sample uσ𝒰|𝒮, which is the marginal distribution of 𝒰 in σ|𝒮 and let εu=Prvσ𝒱u[(u,v)𝒮], in which σ𝒱u is the marginal distribution over 𝒱 in σ conditioned on σ𝒰=u. (2) Sample vσ𝒱u|𝒮. Then,

𝔼(u,v)f(u,v)εuf¯ε.

Proof.

We write the proof for the discrete setting. A similar proof holds in general. For each uσ𝒰, define pu=Pr[σ𝒰=u] and fu=𝔼vσ𝒱|uf(u,v). We have ε=u𝒰puεu, and qu=puεuε is the probability we sample u in the sampling process of the lemma statement. Then, if we let 𝒰𝒮={uεu>0}=Supp(σ𝒰|𝒮), we have

𝔼u1εu𝔼v|uf(u,v)=u𝒰𝒮quεufu=u𝒰puεfuu𝒰𝒮puεfu=f¯ε.

To prove Part 3, using Lemma 30 and f(u,v)=1, we conclude that the expected number of times we call the 𝒮 oracle at each node y[i1] is at most 1/ε.

To prove Part 4 we can simply use a fake oracle sampling cost of 𝗌𝖼^ν()=1. Then the claim about the running time follows from Part 2.

Deriving corollaries.

Using Theorem 29, we can derive more transportation results from Theorem 28 by conditioning ν on an arbitrary event 𝒮 for which we have a membership oracle at hand. Note that the parameter Δ will change to a new value, but the key point is that we can control the cost of sequential samples from the new oracle, so long as we could do so for the initial oracle. Another interesting application of Theorem 29 is to transport a product distribution μ to μ|𝒮 for an arbitrary event 𝒮, obtaining the following corollary.888In the next section we apply this idea to the special case of Gaussian distributions.

Corollary 31.

Suppose the assumptions of Theorem 28 hold. Then, we have the following:

  1. 1.

    There is an algorithm Mk such that, for all events 𝒮 defined over μ, Mk𝒮,μ^ transports μ to μ|𝒮 in expected time poly(nk/ε) and p-cost 𝖳𝖼p1/p(Mkμ^,ν^)δ+Δ, in which δ is as in Theorem 28 and Δ=Δ𝖼p1/p(μ,μ|𝒮).

  2. 2.

    There is an algorithm Nk such that, for all events 𝒮 defined over ν, Nk𝒮,μ^,ν^ transports μ to ν|𝒮 in expected time poly(nk/ε) and p-cost 𝖳𝖼p1/p(Nk𝒮,μ^,ν^)δ+Δ, in which δ is as in Theorem 28 and Δ=Δ𝖼p1/p(μ,ν|𝒮). Moreover, 𝗌𝖼νNn𝗌𝖼νA/ε, for A of Theorem 28.

In both cases above, the expected number of calls to 𝒮 is at most kn/ε, and the transportation can be reversed with the same upper bounds on the running time and oracle costs.

5 Algorithmic Transport for Gaussian

In this section we focus on cases where at least one of the two distributions involved in the transport is Gaussian. We first use the main result of Theorem 28 and derive an algorithmic variant of Talagrand’s result [43] about transporting Gaussian measure to arbitrary distributions with bounded KL divergence from Gaussian. We then derive, as a corollary, a computational concentration result for the Gaussian source measure under the 2 distance. Finally, we focus on finding (optimal) online transports in cases where both the source and destination are Gaussians, but they could be arbitrary (non-product) Gaussians.

5.1 Algorithmic Variant of Talagrand’s Transport for Gaussian

Theorem 32 (Algorithmic Version of Talagrand’s Gaussian Transport Theorem).

Let Φn be the standard Gaussian in dimension n and ν be an arbitrary distribution in n. There is an algorithm Ak, with integer parameter k, such that whenever Akν^ is provided with a sequential sampler ν^ for ν, the following properties hold.

  1. 1.

    For all p1 and ν, Akν^ transports Φn to ν in time O(nklogk) with p-cost at most

    𝖳pp1/p(Akν^)Δpp1/p(Φn,ν)+(Op(nk1/2))1/p.

    For p=2, by the Talagrand inequality of Theorem 24, we have Δ22(Φn,ν) 2𝖪𝖫(Φn,ν).

  2. 2.

    The average total oracle cost of Akν^ is at most k𝔼yνi[n]𝗌𝖼(νi|y[i1]).

  3. 3.

    There is an algorithm Bkν^ that achieves the same as Akν^, but it transports ν back to Φn.

 Remark 33 (Working with p instead of pp).

One might wonder what happens if we want to measure (and upper bound) transfer costs using p rather than pp. However, this can be obtained using Jensen’s inequality (or rather the monotonicity of Wasserstein p-costs for a fixed cost 𝖼). In particular, for every coupling π, we have 𝖳p(π)𝖳pp1/p(π) for all p1. Hence Theorem 32 is stated in the stronger form already.

Proof of Theorem 32.

The proof follows directly from Theorem 28 and Lemma 26. Namely, we use Corollary 26 to bound the term δ in Theorem 28 that upper bounds the transportation cost of empirical Gaussian from the Gaussian itself. One small point here is that, we will not need oracle samplers from the Gaussian itself, as we can use well-known sampling methods such as the Box-Muller method that generate such samples efficiently [36].999In particular, given two independent and uniform u1,u2[0,1], the sampling works as follows: v1=2lnu1cos(2πu2),v2=2lnu1sin(2πu2) are independent samples v1,v2𝒩(0,1).

We now focus on a special case of interest, in which the target distribution ν is Φn|𝒮 for an event 𝒮 of probability Φn(𝒮)=ε, and show that in this case, one can have a single online transportation algorithm that uniformly works for all 𝒮 by merely accessing 𝒮 through a membership oracle. We first define such uniform transportation algorithms.

Definition 34 (Oracle Set-Transport).

For distribution μ and transportation cost 𝖼, we say that (μ,𝖼) has a set-transport of cost at most κ() for a non-increasing function κ:[0,1][0,1], if for every event 𝒮Supp(μ), it holds that 𝖳𝖼(μ,μ|𝒮)κ(μ(𝒮)). We further say that (μ,𝖼) has an oracle set-transport of cost at most κ() if there is a single algorithm A such that with oracle membership queries for an arbitrary set 𝒮 and sampling queries for μ, A𝒮,μ produces a transport of cost at most κ(μ(𝒮)) from μ to μ|𝒮.

Theorem 35 (Oracle-Set Transport for Gaussian Measure).

Let Φn be the standard Gaussian in dimension n. There is an (online) oracle-set transport algorithm Ak for Φn such that:

  1. 1.

    For all p[1,2] and 𝒮 of measure Φn(𝒮)=ε,

    𝖳pp1/p(Ak𝒮)κ1/p(ε)=n1/p1/22ln1/ε+(Op(nk1/2))1/p,

    which is at most (1+γ)n1/p1/22ln1/ε, for sufficiently large k=poly(n,1/ε,1/γ).

  2. 2.

    In expectation, Ak𝒮 asks at most kn/ε queries to 𝒮 and runs in poly(nk/ε).

  3. 3.

    There is an algorithm Bk that achieves the same, but Bk𝒮 transports Φn|𝒮 back to Φn.

Proof of Theorem 35.

To prove Theorem 35 we first use the first item of Corollary 31 where μ=Φn. This way, we already know that the running time of the transportation algorithm and its number of calls to 𝒮 are bounded as stated.

Then, we need to bound both terms Δ,δ. To bound δ, we again use Corollary 26 as we did in the proof of Theorem 32. To bound Δ, we again use Corollary 24 and the well-known fact that 𝖪𝖫(μ|𝒮,μ)ln1/ε for 𝒮 such that μ(𝒮)ε (applied to μ=Φn).

Due to our transports being “reversible”, one can obtain a variant of the result above that transports conditional distributions to conditional distributions through composition.

5.2 Dimension-Independent Computational Concentration for Gaussian

It is well-known that transportation inequalities can be used to derive concentration of measure results [22]. Recently, a computational variant of this phenomenon has been explored [31, 15], which bears similarities to how we make transportation algorithmic. In a computational concentration result, we need an algorithm that maps “most” of the sampled points from the space to any “sufficiently large” event 𝒮, algorithmically. The “cost” of the concentration is (a worst-case) allowed distance d that the algorithm is allowed to move the points, and its error is the fraction of the sampled points that it fails to map to 𝒮 withing the allowed distance d. The work of [15] obtained such results optimally for some settings (e.g., Gaussian under 1 distance), however they left open obtaining an optimal (dimension-free) computational concentration result for the Gaussian space under the 2 distance.

Using Theorem 35, we can resolve the question left open in [15] and derive such optimal computational concentration for the Gaussian space under 2 as a simple corollary to our algorithmic transport result. Theorem 36 below follows from Theorem 35 and the Markov inequality. Using p=2 below implies the desired dimension-independent result.

Corollary 36 (Computational Concentration for Gaussian).

For all ε,δ,λ,p[1,2], given oracle access to 𝒮n, Ak𝒮(x) of Theorem 35 runs in poly(nελ)-time and with probability 1δ over xΦn, it finds a point y𝒮 of distance

p(x,y)(1+λ)n1/p1/22ln1/εδ.

References

  • [1] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
  • [2] Martín Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, 2017. URL: http://proceedings.mlr.press/v70/arjovsky17a/arjovsky17a.pdf.
  • [3] Julio Backhoff, Mathias Beiglbock, Yiqing Lin, and Anastasiia Zalashko. Causal transport in discrete time and applications. SIAM Journal on Optimization, 27(4):2528–2562, 2017.
  • [4] Arjun Nitin Bhagoji, Daniel Cullina, and Prateek Mittal. Lower bounds on adversarial robustness from optimal transport. Advances in Neural Information Processing Systems, 32, 2019.
  • [5] Arnab Bhattacharyya, Sutanu Gayen, Kuldeep S Meel, Dimitrios Myrisiotis, A Pavan, and NV Vinodchandran. On approximating total variation distance. In IJCAI, 2023.
  • [6] Jeremiah Birrell and Mohammadreza Ebrahimi. Adversarially robust deep learning with optimal-transport-regularized divergences. arXiv preprint, 2023. doi:10.48550/arXiv.2309.03791.
  • [7] Jose Blanchet, Karthyek Murthy, and Fan Zhang. Optimal transport-based distributionally robust optimization: Structural properties and iterative schemes. Mathematics of Operations Research, 47(2):1500–1529, 2022. doi:10.1287/moor.2021.1178.
  • [8] Lenore Blum. Complexity and real computation. Springer Science & Business Media, 1998.
  • [9] Mark Braverman. On the complexity of real functions. In 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05), pages 155–164. IEEE, 2005. doi:10.1109/SFCS.2005.58.
  • [10] Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375–417, 1991.
  • [11] Maarten Buyl and Tijl De Bie. Optimal transport of classifiers to fairness. In Advances in Neural Information Processing Systems, 2022.
  • [12] Guillaume Carlier, Alfred Galichon, and Filippo Santambrogio. From knothe’s transport to brenier’s map and a continuation method for optimal transport. SIAM Journal on Mathematical Analysis, 41(6):2554–2576, 2010. doi:10.1137/080740647.
  • [13] Sinho Chewi, Jonathan Niles-Weed, and Philippe Rigollet. Statistical optimal transport. arXiv preprint, 2024. arXiv:2407.18163.
  • [14] Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 3733–3742, 2017.
  • [15] Omid Etesami, Saeed Mahloujifar, and Mohammad Mahmoody. Computational concentration of measure: Optimal bounds, reductions, and more. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 345–363. SIAM, 2020. doi:10.1137/1.9781611975994.21.
  • [16] Weiming Feng, Heng Guo, Mark Jerrum, and Jiaheng Wang. A simple polynomial-time approximation algorithm for the total variation distance between two product distributions. In Symposium on Simplicity in Algorithms (SOSA), pages 343–347. SIAM, 2023. doi:10.1137/1.9781611977585.CH30.
  • [17] Alessio Figalli and Cédric Villani. Optimal Transport and Curvature, pages 171–217. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011. doi:10.1007/978-3-642-21861-3_4.
  • [18] Nicolas Ford, Daniel Duckworth, Mohammad Norouzi, and George E Dahl. The importance of generation order in language modeling. arXiv preprint, 2018. arXiv:1808.07910.
  • [19] Nicolas Fournier and Arnaud Guillin. On the rate of convergence in wasserstein distance of the empirical measure. Probability Theory and Related Fields, 162(3):707–738, 2015.
  • [20] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A. Poggio. Learning with a wasserstein loss. In Advances in Neural Information Processing Systems, volume 28, pages 2053–2061, 2015.
  • [21] Alfred Galichon. The unreasonable effectiveness of optimal transport in economics. Proceeding of the 2020 World Congress of the Econometric Society, 2020.
  • [22] Nathael Gozlan and Christian Léonard. Transport inequalities. a survey. Markov Processes And Related Fields, 16:635–736, 2010.
  • [23] Steven Haker, Lei Zhu, Allen Tannenbaum, and Sigurd Angenent. Optimal mass transport for registration and warping. International Journal of computer vision, 60(3):225–240, 2004. doi:10.1023/B:VISI.0000036836.66311.97.
  • [24] Jan-Christian Hütter and Philippe Rigollet. Minimax estimation of smooth optimal transport maps. The Annals of Statistics, 49(2), 2021.
  • [25] Leonid V Kantorovich. On the translocation of masses. In Dokl. Akad. Nauk. USSR (NS), volume 37, pages 199–201, 1942.
  • [26] Daegyu Kim et al. Improving diffusion-based generative models via approximated optimal transport. arXiv preprint, 2024. arXiv:2403.05069.
  • [27] Herbert Knothe. Contributions to the theory of convex bodies. Michigan Mathematical Journal, 4(1):39–52, 1957.
  • [28] Martin Knott and Cyril S Smith. On the optimal mapping of distributions. Journal of Optimization Theory and Applications, 43:39–49, 1984.
  • [29] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
  • [30] Rémi Lassalle. Causal transport plans and their monge–kantorovich problems. Stochastic Analysis and Applications, 36(3):452–484, 2018.
  • [31] Saeed Mahloujifar and Mohammad Mahmoody. Can adversarially robust learning leverage computational hardness? In Algorithmic Learning Theory, pages 581–609. PMLR, 2019. URL: http://proceedings.mlr.press/v98/mahloujifar19a.html.
  • [32] Tudor Manole, Sivaraman Balakrishnan, Jonathan Niles-Weed, and Larry Wasserman. Plugin estimation of smooth optimal transport maps. The Annals of Statistics, 52(3):966–998, 2024.
  • [33] Gaspard Monge. Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences de Paris, 1781.
  • [34] Paul Montesuma, Loic Ngolè Mboula, and Antoine Souloumiac. Recent advances in optimal transport for machine learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [35] Ludovic Métivier, Romain Brossier, Jean Virieux, and Jesus de la Puente. Measuring the misfit between seismograms using an optimal transport distance. Geophysical Journal International, 205(1):345–377, 2016.
  • [36] Raymond Edward Alan Christopher Paley and Norbert Wiener. Fourier transforms in the complex domain, volume 19. American Mathematical Soc., 1934.
  • [37] Gabriel Peyré. Course notes on computational optimal transport. Mathematical Tours, 2024. URL: https://mathematical-tours.github.io/.
  • [38] Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
  • [39] Murray Rosenblatt. Remarks on a multivariate transformation. The annals of mathematical statistics, 23(3):470–472, 1952.
  • [40] Filippo Santambrogio. Models and applications of optimal transport theory, 2009. Lecture Notes, Grenoble.
  • [41] Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
  • [42] Richard Sinkhorn and Paul Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2):343–348, 1967.
  • [43] Michel Talagrand. Transportation cost for Gaussian and other product measures. Geometric & Functional Analysis GAFA, 6(3):587–600, 1996.
  • [44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • [45] Cédric Villani. Topics in optimal transportation, volume 58. American Mathematical Soc., 2021.
  • [46] Hongkang Yang and Esteban G. Tabak. Clustering, factor discovery and optimal transport. IMA Journal of Applied Mathematics, 10(4):1353–1387, 2021. doi:10.1093/imaiai/iaaa040.
  • [47] Yi-Zhuang You et al. Renormalization group flow, optimal transport and diffusion-based generative model. Physical Review E, 2024.