Abstract 1 Introduction 2 Weak ϵ-coresets in 𝟏 3 Weak ϵ-coresets in 𝟏 for inputs with a unique solution 4 Value-preserving coresets 5 Generalization to discrete metric spaces References Appendix A Appendix

Coresets for 1-Center in 1 Metrics

Amir Carmel ORCID Weizmann Institute of Science, Rehovot, Israel Chengzhi Guo ORCID Peking University, China Shaofeng H.-C. Jiang ORCID Peking University, China Robert Krauthgamer ORCID Weizmann Institute of Science, Rehovot, Israel
Abstract

We explore the applicability of coresets – a small subset of the input dataset that approximates a predefined set of queries – to the 1-center problem in 1 spaces. This approach could potentially extend to solving the 1-center problem in related metric spaces, and has implications for streaming and dynamic algorithms.

We show that in 1, unlike in Euclidean space, even weak coresets exhibit exponential dependency on the underlying dimension. Moreover, while inputs with a unique optimal center admit better bounds, they are not dimension independent. We then relax the guarantee of the coreset further, to merely approximate the value (optimal cost of 1-center), and obtain a dimension-independent coreset for every desired accuracy ϵ>0. Finally, we discuss the broader implications of our findings to related metric spaces, and show explicit implications to Jaccard and Kendall’s tau distances.

Keywords and phrases:
clustering, k-center, minimum enclosing balls, coresets, 1 norm, Kendall’s tau, Jaccard metric
Funding:
Shaofeng H.-C. Jiang: Research partially supported by a national key R&D program of China No. 2021YFA1000900 and a startup fund from Peking University.
Robert Krauthgamer: Work partially supported by the Israel Science Foundation grant #1336/23, by the Israeli Council for Higher Education (CHE) via the Weizmann Data Science Research Center, and by a research grant from the Estate of Harry Schutzman. Part of this work was done while visiting the Simons Institute for the Theory of Computing.
Copyright and License:
[Uncaptioned image] © Amir Carmel, Chengzhi Guo, Shaofeng H.-C. Jiang, and Robert Krauthgamer; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Sketching and sampling
; Theory of computation Streaming models
Editors:
Raghu Meka

1 Introduction

Clustering is a fundamental task in unsupervised learning and data analysis in general, with wide-ranging applications. Typically, the input is a dataset of n points in d or some other feature space of dimension d, and the objective is to partition the dataset into clusters, each characterized by a high degree of similarity. In the current era of big data, both the number of points and their dimension are often excessive, making the computational demands significant. In center-based clustering, the goal is to further assign to each cluster a representative “center” point. The number of centers (and ergo clusters) is often specified in advance, and denoted by k, giving rise to the fundamental k-center, k-median and k-mean problems, which are famous for their simplicity and broad applicability. The case k=1 is particularly important as the most basic setting, and focuses not on partitioning the points but rather on aggregating them, by identifying a representative that best captures the entire dataset.

We study the 1-center problem in 1 metrics (i.e., Manhattan distance). The motivation for 1 metrics is twofold. First, 1 metrics arise in many practical settings, especially when the dimension is large. Second, common discrete metric spaces, such as the Hamming, Kendall’s tau, and Jaccard metrics, are closely related to 1, as they can be embedded into d endowed with the 1 metric with all distances preserved isometrically (i.e., with no distortion).

The 1-center problem has been investigated extensively in many metric spaces. In the context of strings, 1-center under the Hamming distance and Edit Distance is a classical problem known as the Closest String Problem [26, 27, 29]. In the context of permutations (a common model for rankings), 1-center under Kendall’s tau distance is known as the maximum rank aggregation problem [5, 32]. The 1-center problem has also been studied in other discrete metrics, such as the Ulam distance between permutations and the Jaccard distance between sets [9, 13]. Additionally, recent work studied the 1-center problem under various metrics, including p, edit distance, and Ulam, specifically to examine how the time complexity depends on the dimension d [1]. In fact, the 1-center problem is known to be NP-hard in many metric spaces, and it seems that each one requires distinct algorithmic techniques. For some metrics, a PTAS is known, while for others, the existence of a PTAS or even a non-trivial approximation algorithm remains an open question and an active area of research.

A coreset is a data-summarization concept, where the input dataset is replaced by a small subset that approximates its clustering properties. It can lead to significant reductions in computational resources, particularly storage and communication, and plays a pivotal role in modern algorithmic settings, such as streaming, distributed, and dynamic. It is thus crucial to understand the tradeoff between the size of a coreset and its accuracy (approximation factor). Coresets have been studied intensely since their introduction by [2], see the surveys [3, 21, 28, 31], and they have been successfully applied in a broad range of settings, for example different clustering objectives and metric spaces. Surprisingly, little is known about coresets in 1, particularly in high dimension, leaving significant gaps in our understanding. Small coresets in 1, if they exist and can be computed efficiently, could potentially pave the way for a unified approach for 1-center also in related metrics, like Kendall’s tau and Ulam. Moreover, studying coresets for these metrics may provide insights into the geometric structure of these metric spaces.

1.1 Problem setup

In the 1-center problem, also known as Minimum Enclosing Ball (MEB), the input is a set P of points in a metric space (X,dist), and the goal is to find a center point cX that minimizes the objective function cost(c,P):=maxpPdist(p,c). Denote the optimal value by opt(P):=mincXcost(c,P), and observe that it is monotone, i.e., QP implies opt(Q)opt(P). While we mostly consider the space (d,1), these definitions capture also other metrics, like Kendall’s tau distance where X is the set of permutations over [d].

There are different (and sometimes inconsistent) definitions for coresets, which were usually developed when the desirable guarantees were deemed impossible or difficult to prove, leading to more relaxed guarantees. We face this same issue (especially for high dimension), and thus consider a sequence of definitions. For brevity, we present only the definitions most relevant to our work, and do not discuss possible generalizations (e.g., to k-center) or variants that appear in the literature. A common and very useful type of coreset is a strong coreset, which approximates the cost of every center, as follows.

Definition 1.1 (Strong Coreset).

A subset QP is a strong ϵ-coreset for a 1-center instance PX if

cX,cost(c,P)(1+ϵ)cost(c,Q). (1)

The guarantee in (1) is extremely effective in reducing (the task of solving) P to Q, with a small loss in the objective value. In Euclidean spaces, strong coresets for 1-center are known, however their size is exponential in the dimension, namely, |Q|(1/ϵ)O(d) [3], and unfortunately this upper bound is existentially tight.111There is a folklore lower bound, which follows by considering an instance P formed by an ϵ-net of the unit sphere Sd1; every strong coreset must contain all of P and thus have size (1/ϵ)Ω(d). The upper bound actually extends to every metric space with doubling dimension d [4, 10, 20], and therefore applies to d endowed with the 1 norm (or every other norm).

It is thus natural to ask: Can the coreset size be improved specifically for 1 norm? If not, can weaker notions lead to polynomial (in d) or even dimension-independent size?

1.2 Our results

Weak coresets.

We point our attention to a more relaxed notion, called weak coreset, which only approximates the cost of certain centers, namely, centers that may be found by solving the coreset instance. Trivially, every strong ϵ-coreset is also a weak ϵ-coreset.

Definition 1.2 (Weak Coreset).

A subset QP is a weak ϵ-coreset for a 1-center instance PX if

cQargmincXcost(c,Q),cost(cQ,P)(1+ϵ)cost(cQ,Q). (2)

Weak coresets were introduced in [8] specifically for 1-center in Euclidean space. It is well-known that in Euclidean space, there is a unique optimal solution,222The uniqueness follows from the fact that 2 is strictly convex, whereas 1 is convex but not strictly convex. which is why previous work often refers to cQ as the optimal center. The challenge then becomes how to extend this definition to other metric spaces, like 1 norm, that might have multiple optimal solutions. The typical use of a coreset Q is to apply an algorithm on Q and take the resulting center cQ as an approximated solution for the original instance P. Since a generic algorithm might pick any of the multiple optimal centers for Q, it is essential to have a guarantee that applies to every possible optimal solution cQ for Q, hence the universal quantifier (“for all”) in (2).

Remarkably, in Euclidean space, every instance P admits a weak ϵ-coreset of size 1/ϵ, which is independent of the input size n=|P| and of the Euclidean dimension d, and is in fact tight [7]. This result improved an earlier O(1/ϵ2) bound from [8], which was also dimension independent. Can similar bounds be proved for 1 metrics?

We demonstrate that, in sharp contrast to the 2 setting, weak coresets in 1 are dimension dependent and moreover grow exponentially with d, even for a fixed ϵ>0. At the same time, we show how to construct a strong 0-coreset of size 2d, which again contrasts with the 2 setting (and doubling metric spaces), where coresets grow with 1/ϵ and are thus not applicable for ϵ=0. It is well-known that a strong coreset is composable, i.e., the union of coresets is itself a coreset for the union of the original datasets, which is very useful for designing algorithms in the streaming and dynamic settings. Thus, our coreset construction can be used in these settings, particularly in low dimension.

Theorem 1.3.

Consider the 1-center problem in d under 1 norm.

  1. (a)

    Every instance Pd admits a strong 0-coreset of size 2d.

  2. (b)

    There exists Pd, such that every weak ϵ-coreset for ϵ<13 must have size 2Ω(d).

We prove this theorem in Section 2. The upper bound leverages the unique geometry of 1, where only 2d directions are in effect important. The lower bound builds a hard instance and employs classical techniques from coding theory, particularly the Hamming bound. We remark that an unpublished result from [30, Theorem 5], for a related problem about a convex shape in 2 metric, seems to imply a weak ϵ-coreset of size poly(d/ϵ) for our problem of 1-center in 1 metrics. This would contradict our lower bound in Theorem 1.3, indicating that its statement is inaccurate or its proof is flawed; see also a similar discussion in [6].

Due to the exponential dependency on d in the general case, it is important to identify cases with a smaller coreset size, say poly(d) or even dimension independent. Additionally, one may suspect that the huge disparity between 1 and Euclidean space, which does admit a dimension-independent weak coreset [7, 8], is the fact that in 1 the optimal center need not be unique. We thus study the special case of 1 where the optimal center is unique, which may occur naturally in certain contexts, or be “enforced” by adding at most two points to the input dataset. We provide bounds that are exponentially smaller than in the general case, but still depend on d, which establishes that 1 is inherently more complex than 2.

Theorem 1.4.

Consider the 1-center problem in d under 1 norm, on instances that have a unique optimal center.

  1. (a)

    Every instance Pd with unique optimum admits a weak 0-coreset of size 2d.

  2. (b)

    There exists Pd with unique optimum, such that every weak ϵ-coreset for fixed ϵ<1 must have size Ω(logd).

We prove this theorem in Section 3. We further conjecture that the lower bound can be improved to Ω(d), and prove it in the special case ϵ=0 using Hadamard code. The upper bound in our theorem employs tools from convex geometry, specifically the Steinitz Theorem, which is relatively obscure, to prove the existence non-constructively (i.e., without an efficient algorithm). For the lower bound, we consider the instance P={±1}d, which has a unique optimal center at the origin, and show that a coreset Q that is too small, must have an optimal center cQ in which most coordinates are ±1, and this center is clearly a poor solution for P.

Value-preserving coreset.

To further reduce the coreset size, we consider an even less restrictive variant that preserves solely the objective function opt(P), without imposing any conditions on the optimal centers of the coreset. This concept, which we shall refer to as a value-preserving coreset, has been studied previously under the general term “coreset” [6, 14, 33], and it is most useful in applications that focus on measuring the similarity among points in the cluster, rather than identifying a center. It is easy to verify that every weak coreset is also a value-preserving coreset.

Definition 1.5 (Value-Preserving Coreset).

A subset QP is a value-preserving ϵ-coreset for a 1-center instance PX if

opt(P)(1+ϵ)opt(Q). (3)

We show that there exist value-preserving ϵ-coresets of size O~(1/ϵ2), which is dimension independent, in contrast with Theorems 1.3 and 1.4. We thus establish a sharp separation between preserving an optimal solution (center point) or only preserving its value. This result is stated in the next theorem and proved in Section 4. It also answers a question posed in [6], which studies a generalization of 1-center, called the Minimum Enclosing Polytope problem, and asks for polytopes that admit a dimension-independent value-preserving coreset.333Prior to our result showing that the 1 ball (also known as the cross-polytope) admits dimension-independent value-preserving coreset, the only known example was the parallelotope, which trivially has a value-preserving 0-coreset with only 2 points.

Theorem 1.6.

Consider the 1-center problem in d under 1 norm. Every instance Pd admits a value-preserving ϵ-coreset of size O(1ϵ2log1ϵ).

We prove this theorem in Section 4. Our algorithm is based on random sampling, a technique that was rarely used (if at all) for 1-center coresets, but widely used for k-median and k-means coresets [15, 18, 22]. This is very natural because sampling is an excellent fit for objectives like k-median that that are formed by summation, but not for objectives like k-center that are formed by maximization, which are sensitive to missing even one term. It is thus not surprising that our sampling is conducted not on the 1-center problem but rather on its dual. More precisely, we formulate 1-center as a linear programming (LP) problem and write its dual problem, then solve this dual problem to obtain sampling probabilities for the input points. Crucially, the dual objective is formed by summation, and is thus conducive to sampling. Compared with the primal-dual framework, our algorithm is dual-only and it solves the dual LP explicitly, which is then related in the analysis to the primal via strong duality. Our sampling is similar to the randomized rounding commonly used in approximation algorithms, however its purpose here is to generate a coreset instead of a solution, and also the analysis is somewhat different.

Related metrics and applications.

We demonstrate that our results for weak coresets in 1 (in Sections 2 and 3) apply also to other discrete metric spaces. Specifically, we provide in Section 5 examples that use Kendall’s tau and Jaccard metrics, drawing inspiration from the results obtained in the 1 setting.

1.3 Related work

Another closely related aggregation problem is 1-median. This problem, unlike 1-center, can be solved in 1 metrics easily, by simply taking in each coordinate the median value of the input points. Utilizing this structure, 1-median in 1 metrics has a weak ϵ-coreset of size O(ϵ2log(1/ϵ)), which is independent of the dimension [19], in sharp contrast to 1-center. The 1-median problem also admits a PTAS (approximation arbitrarily close to 1 in polynomial time) in many other metrics, including Kendall’s tau distance [25], Jaccard distance [16], and a near-linear time PTAS in 2 [17]. However, it is unclear if a PTAS exists for edit distance, or even for a special case called Ulam distance, and the best upper bound is (2ρ)-approximation for some fixed ρ>0 [11, 12].

2 Weak ϵ-coresets in 𝟏

2.1 Construction of strong 𝟎-coreset of size 𝟐𝒅

In this section we will prove the following theorem:

Theorem 2.1.

Every instance Pd admits a strong 0-coreset of size 2d.

Proof.

Let Σ={1,1}d denote the set of all sign vectors in d. For each sign vector σΣ, let pσ be a point in P such that the inner product pσσ is maximized. Define QΣ={pσ:σΣ}, clearly |QΣ||Σ|=2d.

Let cd. Since QΣ is a subset of P, we have that cost(c,QΣ)cost(c,P). Let p be some point in P, then there exists a sign vector σΣ for which we can write the distance between p and c as:

pc1=σ(pc)=σpσc

By definition of QΣ, there exists a point pσQΣ satisfying σpσpσ. Thus,

pc1=σpσcσpσσcσ(pσc)|σ(pσc)|pσc1cost(c,QΣ).

That is,

cost(c,P)=maxpPpc1cost(c,QΣ)

2.2 Lower bound of 𝟐𝛀(𝒅) for weak ϵ-coresets

In this section, we present a set Pd of size 2d(1o(1)) that has only a trivial 0-coreset, namely, only itself. Furthermore, weak ϵ-coresets of P, for ϵ(0,13), must have size 2Ω(d).

Theorem 2.2.

There exists a set Pd of size 2d(1o(1)), such that

  • if QP is a weak 0-coreset of P, then Q=P; and

  • for all ϵ[0,13), if QP is a weak ϵ-coreset of P, then |Q|2Ω(d).

Proof.

We may assume without loss of generality that d is even, as otherwise we can take the construction for d1 and append 0 to all the points (vectors).

Denote by 1 the vector (1,1,,1), and let B{±1}d be the set of points that are balanced, in the sense that they have an equal number of 1 and 1 coordinates, i.e., B={b{±1}d:i=1dbi=0}. Fix 0<δ13, and let P={1,1}(1δ)B, where (1δ)B denotes the set of points in B scaled by factor 1δ. Thus, |P|=2+(dd/2)=Θ(2d/d). One can verify that the origin 0d is an optimal center, with value d. However it is not unique; to see this, observe that the antipodal pair {1,1} establishes the optimal value opt(P)=d, thus an optimal center must have the same distance d to each of them, which holds for points x[1,1]d that lie on the hyperplane i=1dxi=0. The set (1δ)B restricts the optimal solutions (inside that hyperplane) in a delicate manner, as we shall see.

We first show that P is the only weak 0-coreset of P. Let QP be a weak 0-coreset, and notice that opt(Q)=opt(P)=d, implying that Q must contain the antipodal pair 1 and 1. Assume towards contradiction that Q is a proper subset of P. Then there exists (1δ)b~PQ for some b~B. Let η=1d and consider c=(1+η)δb~; it cannot be an optimal center of P, because

(1+η)δb~(1δ)b~1=(1ηδ)b~1=(1+ηδ)b~1>d.

We next show that this point c=(1+η)δb~ is an optimal center of Q. Indeed, observe that δ13 implies (1+η)δ1δ, and thus the sign of 1δ will not change if we add/subtract to it (1+η)δ. Let sgn(x){1,0,1} denote the sign function. Now consider aQ; its distance to (1+η)δb~ is:

a+(1+η)δb~1 =i=1d|ai+(1+η)δb~i|=i=1dsgn(ai)(ai+(1+η)δb~i)
=a1+(1+η)δi=1dsgn(ai)b~i. (4)

In the case a{1,1}, the above equals d. In the remaining case aQ{1,1}, the choice of b~ implies that i=1dsgn(ai)b~id2, and thus

a+(1+η)δb~1a1+(1+η)δ(d2)(1δ)d+(1+η)δ(d2)<d.

We have thus confirmed that (1+η)δb~ is an optimal center of Q, but not for P. This contradicts our assumption that Q is a weak 0-coreset, and completes the proof of the first item.

To prove the second item, let Pd be as before but now for δ=13, and consider a weak ϵ-coreset Q of P, for some 0<ϵ<δ=13. Notice that Q must contain both 1 and 1, as otherwise, if 1,1Q, consider the origin 0d as a center. If Q contains just one of {1,1} say 1, consider δ1 as a center. In both cases, we obtain that opt(Q)(1δ)d<(1ϵ)d<11+ϵopt(P), in contradiction to the weak ϵ-coreset.

Denote by ψ(Q) the minimum Hamming distance between any point in B and its furthest point in Q, that is, ψ(Q)=minbBmaxaQHamm(a,b). The next proposition follows by a standard counting-bound.

Proposition 2.3.

If |Q|12d20.18d, then ψ(Q)<34d.

Proof.

Denote by C3d4(a) the set of all points in {±1}d whose Hamming distance from a{±1}d is at least 3d4. Then

|C3d4(a)|0jd4(dj)2dH(14),

where the last inequality is the known entropy bound for binomial coefficients and H(p)=plogp(1p)log(1p), hence H(14)0.811. Taking a union over all aQ,

|aQC3d4(a)|aQ|C3d4(a)|2dH(14)|Q|<(dd/2)=|B|,

where in the last inequality we used the known bound 12d2d(dd/2). Hence, there exists some point in B for which the maximum Hamming distance to points in Q is smaller than 3d4.

Assume towards contradiction that |Q|12d20.18d. Then by Proposition 2.3, there is a point b~B for which the maximum Hamming distance to points in Q is smaller than 3d4. We next show that the point 2δb~ is an optimal center for Q. Similarly to Equation (4), since 2δ1δ, for every aQ we can write:

a2δb~1=i=1d|ai2δb~i|=a12δi=1dsgn(ai)b~i,

where:

i=1dsgn(ai)b~i =i:sgn(ai)=sgn(b~i)1i:sgn(ai)sgn(b~i)1=d2Hamm(a,b~),

and thus:

a2δb~1=a12δd+4δHamm(a,b~).

In the case a{1,1}, since b~B then we have a2δb~1=a1=d. Otherwise a(1δ)B, and a2δb~1<a12δd+4δ34d(1δ)d+δd=d. In both cases, the distance from aQ to 2δb~ is at most d=opt(Q), with equality for some aQ. Thus, 2δb~ is an optimal center for Q. However, cost(2δb~,P)2δb~+(1δ)b~1=(1+δ)d>(1+ϵ)opt(P). This contradicts that Q is a weak ϵ-coreset of P, and completes the proof of Theorem 2.2.

3 Weak ϵ-coresets in 𝟏 for inputs with a unique solution

In this section we consider the special case, where the input P has a unique optimal center. We show that even in this special case, the size of the coreset depends on the dimension and provide a lower bound of Ω(logd) on the size of every ϵ-coreset for any fixed ϵ<1. This restriction over ϵ is necessary, as for larger values of ϵ, a naive coreset construction (based on Gonzalez algorithm for k-center [23]) with only 2 points achieves a 3-approximation.444Pick arbitrary pP, then a point pP that is furthest from p, and take Q={p,p} as the coreset. Note that opt(Q)=12pp1, and 12(p+p) is a possible optimal center. For every optimal center c for Q, we can write cost(c,P)cp1+cost(p,P)1.5pp1=3opt(Q)3opt(P). We also prove that there always exists a weak 0-coreset of size at most 2d. This is in stark contrast to Section 2, where we show that in the general case of multiple solutions, 0-coresets might require size exponential in d. Lastly, we present a set such that every weak 0-coreset of this set is of size 2d, hence the upper bound of 2d is tight for the case ϵ=0.

3.1 Lower bound of 𝛀(𝐥𝐨𝐠𝒅) for weak ϵ-coresets

Theorem 3.1.

Fix ϵ<1 and consider the set of points P={±1}dd. Then P has a unique optimal center and every weak ϵ-coreset of P must have Ω(logd) points.

Proof.

It is easy to note that the origin 0d is a unique optimal center with value d as the input contains a point from every face of the 1 ball centered at 0. Consider a weak ϵ-coreset QP of size |Q|12logd. Every point pQ induces a partition Πp of [d] into at most two parts, by splitting the coordinates of p into the positive and negative ones. Let Π be the common refinement of all these partitions (i.e. placing i,j[d] in the same part of Π if and only if for every pP they are in the same part of Πp). Since each Πp has at most two parts, Π has at most 2|Q|d parts. Observe that for every point pQ, coordinates in the same part of Π have the same sign.

Consider an optimal center c for Q. The idea is to modify it iteratively, without increasing its cost for Q, so as to have more coordinates take values in {1,1}. Each iteration takes two indices i,j in the same part of Π, whose coordinates in c are both not in {1,1}, say, 1<cicj<1. The iterations stop when no such indices exist. Now add δ to ci and δ to cj, where δ>0 is as large as possible while keeping both new values in the range [1,1], that is, δ=min{1cj,ci1}. It is easy to see that after this modification, at least one of ci and cj will be in {±1}, and at the same time the distance to every pQ does not change. The iteration stop at a center c¯, where in each part of Π at most one coordinate is not in {±1}, and in total at most |Π|2|Q|d coordinates are not in {±1}.

Finally, we show that c¯, which is an optimal center for Q, has a too large cost for P. Consider the point in P that is closest to c¯, i.e., where each coordinate of c¯ is rounded to the nearest value among {±1}. This point disagrees with c¯ (i.e., rounding was “needed”) on at most |Π| coordinates, hence its distance from c¯ is at least 2d|Π|, hence cost(c¯,P)2d|Π|2dd>(1+ϵ)opt(P).

We remark that the proof works also for the discrete space {1,0,+1}d (endowed with 1 norm), which has more limited choices for the center point, e.g., for c above. Moreover, note that the distance of the modified optimal center c¯ from 0 is at least d2|Q|.

3.2 Upper bound of 𝟐𝒅 for weak 0-coresets

We will show that if P has a unique solution, then P has a weak 0-coreset of size at most 2d.

Two standard notations from the domain of convex geometry are the convex-hull and the interior. The convex hull of P, denoted conv(P), consists of all convex combinations of points in P, that is every point in conv(P) can be expressed as a weighted sum of points from P with non-negative weights summing to 1. The interior of P, denoted int(P), refers to the set of points that lie strictly inside the convex hull. In other words, it consists of points that can be surrounded by a small ball completely contained within conv(P).

An important tool we use is Steinitz’s theorem [24].

Theorem 3.2 (Steinitz’s theorem).

Let 0int(conv(S)) for some Sd. Then there exists RS of size at most 2d, such that 0int(conv(Q)).

Next we introduce the notion of complete set, we will then provide an alternative formulation of Steinitz’s theorem that will be useful in our application.

Definition 3.3.

A set Sd is complete if for every vd{0} there exists sS with vs>0.

Lemma 3.4.

0int(conv(S)) if and only if S is complete.

Proof.

(): Write 0 as a convex combination, 0=iλisi where λi0 and iλi=1. Then,

0=v0=v(iλisi)=iλi(vsi)

If for all si, vsi0, then it must hold that for all si, vsi=0, but then 0 is not in the interior of conv(S).

(): If 0int(conv(S)) then there exists a hyperplane containing 0 such that S+. Since 0 is a point in the hyperplane we can write the hyperplane equation as vx=0 for some vd{0}. It follows that vs0 for every sS, hence S is not complete.

This leads to the following alternative statement of Steinitz’s theorem.

Corollary 3.5.

If Sd is complete, then there exists RS of size at most 2d, such that R is complete.

W.l.o.g., assume that 0 is the unique solution of P and let opt(P)=cost(0,P)=r. Since 0 is unique, we can further assume that 0p1=r for every pP; otherwise, any points that do not satisfy this can be removed without affecting the optimal solution or its cost. Given a pair a=(σ,p){1,1}d×P, we denote by a the hyperplane σ(xp)=r and by a+ the halfspace σ(xp)r. In general, xd is a feasible 1-center of P if for every pP, xp1r. Assuming r is known, we can write the LP formulation of the 1-center problem with d variables x=(x1,,xd) and the 2d|P| constraints σ(xp)r for every σ{1,1}d and pP. That is, every constraint defines a hyperplane (σ,p). We further remove from the set of constraints {1,1}d×P, pairs (σ,p) for which 0(σ,p) and denote the updated set of constraint by A=S×P. Note that no points from P are removed in this process, as each has distance r from 0. Since 0 is the unique optimal solution of P, this also does not change the set of feasible solutions, and 0 is the unique intersection point of all hyperplanes in A.

Proposition 3.6.

Let A=S×P be non-empty set such that 0a for every aA. Then, aAa+={0} if and only if S is complete.

Proof.

(): Let vd{0}, then there exists aA such that va+. That is, σ(vp)>r and since 0a, σ(0p)=r, and it follows σv>0.

(): A is non-empty so clearly 0aAa+. Let v0, then there exists (σ,p)A such that σv>0. Since 0a we have σ(0p)=r, it follows that σ(vp)>r, that is, va+.

We are now ready to prove our main Theorem:

Theorem 3.7.

Every instance Pd with unique optimum admits a weak 0-coreset of size 2d.

Proof.

By Proposition 3.6 the set S is complete and by Corollary 3.5 there exists RS such that R is complete and of size at most 2d. For each σR we select a single point pP such that the pair (σ,p) is a constraint in A. That is we obtain a set of at most 2d different points from P we denote by Q and denote by A this refined set of constraints. Since R is complete, using the second direction of Proposition 3.6 we have that aAa+={0}, meaning that Q is a weak 0-coreset of size at most 2d.

We conclude this section by providing an example of a set P of size 2d, where P is the sole weak 0-coreset of itself, demonstrating that the upper bound of 2d cannot be further improved. The set P is composed of the row vectors of the Hadamard matrix and their negations. The proof is provided in Appendix A.1.

4 Value-preserving coresets

In this section we provide an algorithm for constructing a dimension-independent value-preserving ϵ-coreset for 1-center. In addition, we provide a complete characterization for the special case ϵ=0, by showing a 0-coreset of size d+1, and moreover that this bound is tight, see in Section 4.2.

Theorem 4.1.

For every instance Pd and 0<ϵ<1, there exists a subset SP of O(1ϵ2log1ϵ) points such that

opt(S)(1ε)opt(P).

Our proof of Theorem 4.1 is based on the following LP formulation of 1-center and its dual. Unlike many primal-dual algorithms, our algorithm needs to explicitly solve and use this optimal dual solution to guide the coreset construction.

LP formulation of 𝟏-center.

Observe that 1-center on input Pd is equivalent to the following mathematical program with variables r and x=(x1,,xd).

minimize r
subject to i[d]|pixi|r pP.

This is formally not an LP due to the absolute values in the constraints, but each such constraint can obviously be expanded into 2d linear constraints, which leads to the following equivalent LP formulation.

minimize r
subject to i[d]σi(pixi)r σ{±1}d,pP
r,xd.

We derive the dual of this LP, with 2dn variables {uσ,p}σ,p, as follows. One can think of each pair (σ,p) as generating a “new” point (σ1pi,,σdpd)d, that is obtained from pP by flipping signs.

maximize σ{±1}dpPi[d]σipiuσ,p (DLP)
subject to σ{±1}dpPσiuσ,p=0 i[d]
σ{±1}dpPuσ,p=1
uσ,p0 σ{±1}d,pP.

Coreset Algorithm.

Our coreset construction, presented in Algorithm 1, builds the coreset by sampling, where the probabilities come from an optimal solution to (DLP). Technically, our sampling step is similar to randomized rounding that is often used in approximation algorithms, however we use it here to find a coreset rather than an approximate solution. The analysis of this algorithm appears in Section 4.1. We slightly abuse terminology and refer to W^ as a multiset, but formally it is always a sequence of m pairs, namely, W^:=((σ^1,p1),,(σ^m,pm)), hence, summing over all (σ^,p)W^ actually means summation over j[m].

Algorithm 1 Value-preserving ϵ-coreset.
1 shift P such that pPp=0 and solve (DLP) on P to obtain an optimal solution u
2 let W^ be a multiset of m:=O(1ϵ2log1ϵ) i.i.d. samples from the distribution u
  // m is an even number, u viewed as distribution over {±1}d×P
3 return S:={p:(σ,p)W^}
 Remark 4.2.

Algorithm 1 can be implemented in poly(nd)-time. The most expensive step is clearly to solve the dual LP. Since it has only d+1 constraints, one can solve it while using only poly(d) non-zero variables. Alternatively, one can first solve the primal LP using an ellipsoid algorithm, to find a subset of poly(n) constraints that has the same optimal value, and then solve the dual of this smaller primal LP. Moreover, inspecting the analysis, particularly (6), one can verify that it suffices to (1+ε)-approximate the dual LP.

4.1 Proof of Theorem 4.1: Analysis of Algorithm 1

We shall prove that the algorithm’s output S satisfies Pr[opt(S)(1ϵ)opt(P)]0.8. The plan is to build a dual solution for Sd, and show that with high probability, this dual solution is feasible for (DLP) and its objective is at least (1ϵ) times the optimal value (of the same dual LP) for P. This would imply that DLP(S)(1ϵ)DLP(P), where DLP(P) denotes the optimal value of (DLP) on input P, and by strong LP duality this is equivalent to opt(S)(1ϵ)opt(P), proving the theorem.

A lower bound on 𝐨𝐩𝐭(𝑷).

Recall that Algorithm 1 shifts P so that pPp=0. This does not change the optimal value, hence our analysis simply assumes that the input P already satisfies pPp=0, avoiding cumbersome notations for before and after the shift. This property of P yields the following immediate lower bound on opt(P).

Lemma 4.3.

If Pd satisfies pPp=0, then opt(P)max{p1/2:pP}.

Proof.

For every point pP,

p1=1|P|qP(pq)11|P|qPpq12opt(P).

Dual solution induced by 𝑾^.

We now wish to use W^ from Algorithm 1 to construct a solution for the dual LP. Informally, the idea is to construct a solution uW^ (can also think of it as a vector with 2dn coordinates), that is the average of m sparse solutions, namely, each has a single non-zero variable, whose value is 1 and it is drawn at random from the probability distribution defined by u (think of u as a vector with 2dn coordinates that sum to 1), using precisely the random draws made in Algorithm 1. Next, we formalize this construction in a slightly more general form.

Given a multiset Z{±1}d×P of size m, such as W^, we define the dual solution induced by Z, denoted uZ, as follows. Take each pair (σ,p)Z and build a dual solution u, where the variable corresponding to this pair is set as uσ,p=1, and the other 2dn1 variables are set to zero, and then we average all these m solutions. Formally, uZ is given by

uσ,pZ:=1m(σ,p)Z𝟙{σ=σ,p=p}.

The following fact rewrites certain linear forms over uZ, that appear in (DLP), in terms of pairs (σ,p)Z.

Fact 4.4.

Let uZ be induced by some Z{±1}d×P as above. Then for every i[d],

σ{±1}dpPσiuσ,pZ =1m(σ,p)Zσi,
σ{±1}dpPσipiuσ,pZ =1m(σ,p)Zσipi.

Initial dual solution.

Before building a feasible dual solution for (DLP) on S, we first consider an initial dual solution u^:=uW^, which is just the solution induced by W^ from Algorithm 1. We shall see below that u^ is feasible in an expected sense, and that turning the expectation into a high-probability guarantee introduces an additive error in the constraints (and in the objective). Nonetheless, this is still useful as we will later “fix” this solution into one that does satisfy the constraints (without additive error).

Let us examine this initial random solution u^. It always satisfies the second constraint of (DLP) by construction (recall that u^ is the average of m sparse solutions), i.e.,

σ{±1}dpSu^σ,p=m1m=1.

By linearity of expectation, u^ satisfies the first constraint in expectation (the middle term uses 4.4 to provide an alternate formulation),

i[d],𝔼[σ{±1}dpPσiu^σ,p]=𝔼[1m(σ,p)W^σi]=1mmσ{±1}dpPσiuσ,p=0. (5)

And again by linearity of expectation, the expected objective of u^ is the same as the optimal solution u, because:

𝔼[σ{±1}dpPi[d]σipiu^σ,p]=𝔼[1mi[d](σ,p)W^σipi]=opt(P). (6)

Almost feasibility.

We bound the deviation of u^ from satisfying the first constraint, as follows. Consider i[d] and let i be the event that |σ{±1}dpSσiu^σ,p|>ϵ, which by 4.4 can be also written as |1m(σ,p)W^σi|>ϵ. Informally, we want to bound the probability of this event, because when it does not occur, the constraint is “almost satisfied”. By applying Hoeffding’s inequality, where we use (5) for the expectation and the fact that |σi|1, we get (by our choice of m),

Pr[i]=Pr[|1m(σ,p)W^σi|>ϵ]2eΩ(ϵ2m)ϵ. (7)

A feasible solution 𝒖.

We next use W^ to construct a new multiset W whose induced dual uW is feasible, and in particular satisfies the first constraint. We will then take our final dual solution to be u:=uW, and it will remain to bound its objective value. At a high level, W is obtained by “flipping” some of the signs appearing in W^. More precisely, given W^=((σ^1,p1),,(σ^m,pm)), we shall define new sign vectors σ1,,σm but keep the exact same points p1,,pmP, and construct W=((σ1,p1),,(σm,pm)). It will be convenient to arrange the sign vectors σ^1,,σ^m as the rows of a matrix M^{±1}m×d, and use this M^ to define a new matrix M, whose rows define the new sign vectors σ1,,σm.

We construct M (from M^) using the following procedure, which operates separately on each column i[d]. First copy column i of M^ to be also column i of M, and let ni+ and ni be the number of positive and negative entries in this column, respectively. If ni+=ni, leave this column of M as is, noticing that it sums to zero. Otherwise, flip |ni+ni|2 signs chosen carefully from among the majority sign in column i, and this will ensure that column i of M sums to zero. (This is possible because m is even.) The careful choice (which signs to flip) will favor row indices j[m] with small value |pij|. Formally, let Ri[m] be the set of rows j with the majority sign in column i of M^, then sort the indices jRi by their value |pij|, pick the |ni+ni|/2 indices with smallest value (breaking ties arbitrarily), and flip their entries. As explained above, the rows of the resulting matrix M define sign vectors σ1,,σm, which in turn define W=((σ1,p1),,(σm,pm)), and our final dual solution is u=uW.

Lemma 4.5.

The solution u=uW is feasible for (DLP) on S.

Proof.

The first constraint is satisfied because the signs in every column of M sum to zero. The second constraint is satisfied because flipping a sign σij “moves” some amount (say 1/m) from variable uσ,p to uσ,p, where σ,σ differ in coordinate j, but the total remains the same. Finally, it uses only variables uσ,p for pS, i.e., all other variables are 0, and thus it is feasible not only for (DLP) on P but also on S.

The next lemma bounds the decrease in objective when constructing w from W^, i.e., comparing that of u with that of u^. Its proof is based on an averaging argument.

Lemma 4.6.

For all i[d],

|(σ,p)W^σipi(σ,p)Wσipi|2|ni+ni|m(σ,p)W^|pi|.

Proof.

We consider only the case ni+ni, as the other case is symmetric. Since ni++ni=m, we have ni+m/2. Recall that the construction of M flips in column i exactly ni+ni2 signs, picking the indices jRi with smallest value |pij|. Every such flip changes the objective by 2|pij|, and the number of choices |Ri|=ni+m/2, hence by an averaging argument,

|(σ,p)W^σipi(σ,p)Wσipi|(ni+ni)/2|Ri|jRi2|pij|ni+nim/2j[m]|pij|.

The objective value of 𝒖.

Using 4.4, we can write the objective value of u as

L:=i[d]σ{±1}dpSσipiuσ,p=1mi[d](σ,p)Wσipi. (8)

and that of u^ as

L^:=i[d]σ{±1}dpSσipiu^σ,p=1mi[d](σ,p)W^σipi, (9)

Our plan is to bound L (with high probability) by relating it to L^, and then bound the latter using Chebyshev’s inequality. These steps rely on the next two lemmas, and before proving them, we show how they imply Theorem 4.1.

Lemma 4.7.

𝔼[|LL^|]O(ϵ)opt(P).

Lemma 4.8.

Var(L^)4m(opt(P))2.

Proof of Theorem 4.1.

We already established in Lemma 4.5 that u is a feasible solution for (DLP) on S, hence opt(S)L. It thus suffices to show that Pr[L(1O(ϵ))opt(P)]0.8. By Lemma 4.7 and Markov’s inequality, Pr[|LL^|O(ϵ)opt(P)]0.9. Next, we apply Chebyshev’s inequality to L^, using the variance bound Lemma 4.8 and our choice of m, to obtain Pr[|L^𝔼[L^]|ϵopt(P)]4/mϵ20.1. Recall also from (6) that 𝔼[L^]=opt(P). It follows by a union bound that with probability at least 0.8,

LL^O(ϵ)opt(P)(1O(ϵ))opt(P).

Proof of Lemma 4.7.

Observe that i is the event that |ni+ni|>ϵm, in which case we can still bound |ni+ni|m (which holds always). Thus, |ni+ni|ϵm+𝟙im. Now applying Lemma 4.6 for every i[d] and taking a sum, we get (after dividing by m)

1m|i[d](σ,p)W^σipii[d](σ,p)Wσipi| 1mi[d]2|ni+ni|m(σ,p)W^|pi|
2mi[d][(ϵ+𝟙i)(σ,p)W^|pi|]. (10)

To bound (10) further, we expand its final expression into two parts. We then bound one part in (11) using Lemma 4.3, and the other part in 4.9.

𝔼[1mi[d]ϵ(σ,p)W^|pi|]=ϵ𝔼[1m(σ,p)W^pi1]2ϵopt(P). (11)
Claim 4.9.

𝔼[1mi[d]𝟙i(σ,p)W^|pi|]2ϵopt(P).

Proof.

The main obstacle here is that we have a product of two random variables, 𝟙i and |pi|, that are not independent. Intuitively, their dependence should be relatively small, and our actual proof bounds their product with another product, of two independent random variables. Recall that W^={(σ^1,p1),,(σ^m,pm)} consists of m i.i.d. samples, hence we can focus on analyzing (say) the last one, formally

𝔼[1mi[d]𝟙i(σ,p)W^|pi|]=𝔼[i[d]𝟙i|pim|].

Consider i[d], and define i to be the event that |1mj=1m1σ^ij|>ϵ1m. Observe that 𝟙i𝟙i because if i occurs then also i must occur. The crux is that i is independent of (σ^m,pm), and thus 𝔼[𝟙i|pim|]=𝔼[𝟙i]𝔼[|pim|]. We can still bound 𝔼[𝟙i]=Pr[i]ϵ by the argument we had for i in (7), except that we use m1 instead of m. Altogether,

𝔼[i[d]𝟙i|pim|]𝔼[i[d]𝟙i|pim|]=i[d]𝔼[𝟙i]𝔼[|pim|]ϵ𝔼[pm1],

and the claim follows by using Lemma 4.3 to bound pm12opt(P).

We now proceed with the proof of Lemma 4.7. Plugging the bounds from (11) and 4.9 into (10), we obtain

𝔼[1m|i[d](σ,p)Wσipii[d](σ,p)W^σipi|]O(ϵ)opt(P).

which completes the proof of Lemma 4.7.

Proof of Lemma 4.8.

For σ{±1}d and pP, define Xσ,p:=i[d]σipi. Then we can write

L^=1mi[d](σ,p)W^σipi=1m(σ,p)W^Xσ,p.

Observe that for every (σ,p) we have |Xσ,p|i[d]|pi|2opt(P) by Lemma 4.3. Now since the m samples in W^ are independent, the variance of their sum is the sum of their variances, and we obtain

Var(L^)=1m2Var((σ,p)W^Xσ,p)1m2m(2opt(P))2=4m(opt(P))2. (12)

This completes the proof of Lemma 4.8.

4.2 Lower bound for value-preserving 𝟎-coreset

The Helly number of a body S, denoted (S), is the smallest positive integer h, such that if every subset of size h from a family of translations of S has a non-empty intersection, then the entire family also has a non-empty intersection. The Helly theorem shows that the Helly number of every convex body is smaller than or equal to d+1.

The Helly number is equivalent to a worst-case result of value-preserving 0-coreset (cf. [6]). In the next proposition we show that the d+1 upper bound is tight, that is, the Helly number of the 1 ball is d+1.

Proposition 4.10.

Assume d>2. There exists Pd, such that every value-preserving 0-coreset must have size d+1.

Proof.

Consider the set of d+1 points P={p0,,pd}, where p0=(1,1,,1) and pi is the vector with all entries equal to 1 except for a 1 in the i-th position. We will show that:

  • The point 0=(0,0,,0) is the only center of P with a cost of d.

  • Each subset of P containing d points has a cost strictly less than d.

To prove the first claim, let c be an optimal center of P, clearly c[1,1]d. We know that for every i:

cost(c,P)12(cp01+cpi1)=12(2(d1)+2|ci+1|)=d1+|ci+1|

On the other hand, cost(c,P)d, since the zero vector 0 has a cost d, consequently 1|ci+1| for all i. This implies ci0 for all i, leading to:

cost(c,P)cpi1(d1)+(1+ci)jicj=d+cijicj

Since dcost(c,P), it follows that jicjci for all i. Given that d>2, this implies c=0.

For the second claim, consider any subset Q of d points. If p0Q, then clearly opt(Q)<d. Assuming p0Q and, without loss of generality, pdQ, we set ci=1d2 for i<d, and cd=1. The cost for this choice of c is:

cost(c,P) =max{2+(d1)|1ci|,(d2)|1ci|+|1ci|}
=max{2+(d1)(1+ci),(d2)(1ci)+(1+ci)}<d

5 Generalization to discrete metric spaces

In this section we show how to extend our lower bounds results to other discrete metric spaces by providing two examples for Jaccard and Kendall’s tau metric spaces.

5.1 Theorem 2.2 in Kendall’s tau metric space

Denote by 𝒮d the set of permutations over the set of elements [d]. For two permutations σ1,σ2𝒮d, the Kendall’s tau distance is the number of pairs (i,j) such that the order of i and j is reversed between σ1 and σ2. Formally,

τ(σ1,σ2)=|{{i,j}:(σ1(i)σ1(j))(σ2(i)σ2(j))<0}|

Following a similar approach as in Section 2 we will prove a lower bound of size 2Ω(d) on the size of weak coreset.

Proposition 5.1.

There exists a set P𝒮4d, such that for all ϵ[0,13), if QP is a weak ϵ-coreset of P, then |Q|2Ω(d).

Proof.

We will design a gadget that will allow us to replicate the structure of P that is given in the proof of Theorem 2.2. We denote σ1=(1,2,3,4), σδ=(1,3,2,4), σ0=(1,4,3,2), σδ=(4,3,1,2), σ1=(4,3,2,1) and σc=(1,3,4,2). The distances between these base permutations are detailed in Figure 1.

Figure 1: The figure shows the distances between the different base permutations used in the construction of the set P in Proposition 5.1. The distances between σ1,σδ,σ0,σδ,σ1 are additive over the straight lines. The curved lines indicate the distances from (1,3,4,2) to the others.

We will now introduce some notations that will allow us to describe permutations in 𝒮4d using the base permutations. For σ = (π1,,πk), with a slight abuse of notation, we denote by i+σ the permutation (i+π1,,i+πk) and also introduce an operator to denote the product AB={σσ^:σA,σ^B} where the product of two permutations is their composition. We will extend a base permutation over 4 elements to a permutation over 4d elements using the notation σx=(0+σx)(4+σx)(4(d1)+σx), for x{1,δ,0,δ,1,c}. Now we are ready to describe P. Let B=i=0d1{4i+σδ,4i+σδ}, then P=B{σ1,σ1}. The permutation σ0, has cost 3d (see also Figure 1) and this cost is optimal the base permutations composing σ1,σ1 are antipodal (for every id1, every pair that agrees with (4i+σ1) disagrees with (4i+σ1), and vice versa).

Let Q be a weak ϵ-coreset of P. Notice that Q must contain both σ1 and σ1, as otherwise, if σ1,σ1Q, consider σ0 as a center, and if Q contains just one of {σ1,σ1} say σ1, consider the permutation σc as a center. In both cases, we obtain that 2d=opt(Q)<11+ϵopt(P)=11+ϵ3d, in contradiction that Q is a weak ϵ-coreset.

Similarly to Proposition 2.3 it follows that if |Q|12d20.18d, then there exists bB such that the Kendall tau distance τ(b,q)3d for every qQ. That is, b is an optimal center of Q. However, cost(b,P)=4d, by considering the opposite permutation of b in P, thus |Q|2Ω(d).

5.2 Theorem 3.1 in the Jaccard metric space

For two sets A and B, the Jaccard distance is defined as: J(A,B)=1|AB||AB|. We will use the symbol to denote the product AB={ab:aA,bB}.

Following a similar approach as in Section 3.1 we will prove a lower bound of size Ω(logd) on the size of weak coreset.

Proposition 5.2.

Let P=i=0d1{{3i},{3i,3i+1,3i+2}} be a collection of sets, i.e. points in the Jaccard metric space. For every ϵ<13, a weak ϵ-coreset has Ω(logd) points.

Proof.

Let 𝒰={0,1,,3d1} denote the universe. We also denote elements in 𝒰 as type 0 if their remainder modulo 3 is 0 and type 1 otherwise. Note that all type 0 elements appear in every subset of P and thus appear in every optimal center of every coreset of P. It is then easy to verify that the set of optimal centers of P is the product of {3i+1,3i+2} for i[d], along with all type 0 elements, and that opt(P)=12.

Using the same arguments as in the proof of Theorem 3.1, assume QP is a weak ϵ-coreset of size t12logd. We denote by Ai={3i} and Bi={3i,3i+1,3i+2} the two building blocks of the set P, corresponding to 1 and 1 in the proof of Theorem 3.1. Every set in pQ induces a partition of 𝒰 into at most two parts, depending on their assignment to Ai or Bi. Again, we use Π to denote the common refinement of all these partitions.

Let c be any optimal 1-center of Q, and consider a partition AΠ. Let k denote the number of type 1 elements from A in c. By selecting k different type 1 elements from A, say the first k type 1 indices of A, we preserve the cost of c over Q, as the number of differences c has with each pQ remains unchanged. We apply this modifications for all partitions in Π and denote the new center by c¯. However, by the construction of c¯ there exists pP such that p intersects at most one type 1 element of c¯ in every partition and the union of p and c is 𝒰. Consequently, the distance between c and p is at least 1d+|Π|3d232t3d=23d3d>(1+ϵ)opt(P).

References

  • [1] Amir Abboud, MohammadHossein Bateni, Vincent Cohen-Addad, Karthik C. S., and Saeed Seddighin. On complexity of 1-center in various metrics. In APPROX/RANDOM, volume 275 of LIPIcs, pages 1:1–1:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPIcs.APPROX/RANDOM.2023.1.
  • [2] Pankaj K. Agarwal, Sariel Har-Peled, and Kasturi R. Varadarajan. Approximating extent measures of points. J. ACM, 51(4):606–635, 2004. doi:10.1145/1008731.1008736.
  • [3] Pankaj K. Agarwal, Sariel Har-Peled, and Kasturi R. Varadarajan. Geometric approximation via coresets. In Combinatorial and computational geometry, volume 52 of MSRI Publications, chapter 1, pages 1–30. Cambridge University Press, 2005.
  • [4] Sepideh Aghamolaei and Mohammad Ghodsi. A composable coreset for k-center in doubling metrics. In CCCG, pages 165–171, 2018. URL: http://www.cs.umanitoba.ca/%7Ecccg2018/papers/session4A-p2.pdf.
  • [5] Christian Bachmaier, Franz J. Brandenburg, Andreas Gleißner, and Andreas Hofmeier. On the hardness of maximum rank aggregation problems. J. Discrete Algorithms, 31:2–13, 2015. doi:10.1016/j.jda.2014.10.002.
  • [6] René Brandenberg and Stefan König. No dimension-independent core-sets for containment under homothetics. Discrete & Computational Geometry, 49(1):3–21, 2013. doi:10.1007/s00454-012-9462-0.
  • [7] Mihai Bădoiu and Kenneth L. Clarkson. Optimal core-sets for balls. Comput. Geom., 40(1):14–22, 2008. doi:10.1016/j.comgeo.2007.04.002.
  • [8] Mihai Bădoiu, Sariel Har-Peled, and Piotr Indyk. Approximate clustering via core-sets. In STOC, pages 250–257. ACM, 2002. doi:10.1145/509907.509947.
  • [9] Marc Bury, Michele Gentili, Chris Schwiegelshohn, and Mara Sorella. Polynomial time approximation schemes for all 1-center problems on metric rational set similarities. Algorithmica, 83(5):1371–1392, 2021. doi:10.1007/s00453-020-00787-3.
  • [10] Matteo Ceccarello, Andrea Pietracaprina, and Geppino Pucci. Solving k-center clustering (with outliers) in MapReduce and streaming, almost as accurately as sequentially. Proc. VLDB Endow., 12(7):766–778, 2019. doi:10.14778/3317315.3317319.
  • [11] Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Approximating the median under the ulam metric. In SODA, pages 761–775. SIAM, 2021. doi:10.1137/1.9781611976465.48.
  • [12] Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Clustering permutations: New techniques with streaming applications. In ITCS, volume 251 of LIPIcs, pages 31:1–31:24. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPIcs.ITCS.2023.31.
  • [13] Diptarka Chakraborty, Kshitij Gajjar, and Agastya Vibhuti Jha. Approximating the center ranking under ulam. In FSTTCS, volume 213 of LIPIcs, pages 12:1–12:21. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.FSTTCS.2021.12.
  • [14] Timothy M. Chan. Faster core-set constructions and data-stream algorithms in fixed dimensions. Computational Geometry, 35(1):20–35, 2006. doi:10.1016/j.comgeo.2005.10.002.
  • [15] Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing, 39(3):923–947, 2009. doi:10.1137/070699007.
  • [16] Flavio Chierichetti, Ravi Kumar, Sandeep Pandey, and Sergei Vassilvitskii. Finding the jaccard median. In SODA, pages 293–311. SIAM, 2010. doi:10.1137/1.9781611973075.25.
  • [17] Michael B. Cohen, Yin Tat Lee, Gary L. Miller, Jakub Pachocki, and Aaron Sidford. Geometric median in nearly linear time. In STOC, pages 9–21. ACM, 2016. doi:10.1145/2897518.2897647.
  • [18] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In STOC, pages 169–182. ACM, 2021. doi:10.1145/3406325.3451022.
  • [19] Matan Danos. Coresets for clustering by uniform sampling and generalized rank aggregation. Master’s thesis, Weizmann Institute of Science, 2021.
  • [20] Mark de Berg, Leyla Biabani, and Morteza Monemizadeh. k-center clustering with outliers in the MPC and streaming model. In IPDPS, pages 853–863. IEEE, 2023. doi:10.1109/IPDPS54959.2023.00090.
  • [21] Dan Feldman. Core-sets: An updated survey. WIREs Data Mining and Knowledge Discovery, 10(1):e1335, 2020. doi:10.1002/widm.1335.
  • [22] Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In STOC, pages 569–578. ACM, 2011. doi:10.1145/1993636.1993712.
  • [23] Teofilo F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293–306, 1985. doi:10.1016/0304-3975(85)90224-5.
  • [24] Peter M Gruber and Jörg M Wills. Handbook of convex geometry, Volume B. North-Holland, 1993.
  • [25] Claire Kenyon-Mathieu and Warren Schudy. How to rank with few errors. In STOC, pages 95–103. ACM, 2007. doi:10.1145/1250790.1250806.
  • [26] Ming Li, Bin Ma, and Lusheng Wang. On the closest string and substring problems. J. ACM, 49(2):157–171, 2002. doi:10.1145/506147.506150.
  • [27] Bin Ma and Xiaoming Sun. More efficient algorithms for closest string and substring problems. SIAM Journal on Computing, 39(4):1432–1443, 2010. doi:10.1137/080739069.
  • [28] Alexander Munteanu and Chris Schwiegelshohn. Coresets-methods and history: A theoreticians design pattern for approximation and streaming algorithms. Künstliche Intell., 32(1):37–53, 2018. doi:10.1007/s13218-017-0519-3.
  • [29] François Nicolas and Eric Rivals. Hardness results for the center and median string problems under the weighted and unweighted edit distances. J. Discrete Algorithms, 3(2-4):390–415, 2005. doi:10.1016/j.jda.2004.08.015.
  • [30] Rina Panigrahy. Minimum enclosing polytope in high dimensions. CoRR, cs.CG/0407020, 2004. doi:10.48550/arXiv.CS/0407020.
  • [31] Jeff M. Phillips. Coresets and sketches. In Handbook of discrete and computational geometry, chapter 48, pages 1269–1288. Chapman and Hall/CRC, 3rd edition, 2017. doi:10.1201/9781315119601.
  • [32] Alvin Yan Hong Yao and Diptarka Chakraborty. Approximate maximum rank aggregation: Beyond the worst-case. In FSTTCS, volume 284 of LIPIcs, pages 12:1–12:21. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPIcs.FSTTCS.2023.12.
  • [33] Hai Yu, Pankaj K. Agarwal, Raghunath Poreddy, and Kasturi R. Varadarajan. Practical methods for shape fitting and kinetic data structures using coresets. Algorithmica, 52(3):378–402, 2008. doi:10.1007/s00453-007-9067-9.

Appendix A Appendix

A.1 Lower bound for weak 0-coreset for inputs with a unique solution

We present an example of a set P of size 2d, where P is the sole weak 0-coreset of itself. This demonstrates that the upper bound of 2d is indeed tight.

We start with a simple proposition regarding the distances from a given point within the unit hypercube to two antipodal vertices of the cube.

Proposition A.1.

If p{1,1}d and x[1,1]d then px1+(p)x1=2d.

Proof.

px1+(p)x1=i=1d|1xi|+|1xi|=i=1d((1xi)+(1+xi))=i=1d2=2d

An immediate consequence of the above proposition is that the cost of every optimal center of P is at least d for every set P{1,1}d containing an antipodal pair. Since 0 has cost at most d to P we get that opt(P)=d and that 0 is an optimal center. This is summarized in the following corollary.

Corollary A.2.

If P[1,1]d contains antipodal pair, then opt(P)=d.

Let H denote the set of row vectors of the Hadamard matrix, and let P=HH.

Proposition A.3.

There exists Pd with unique optimum, such that every weak 0-coreset must have size 2d.

Proof.

Following Corollary A.2, opt(P)=d. We will further show that if QP does not contain an antipodal pair, then opt(Q)<d. Note that for a,bP, a±b we have ab=0. Also, since a,b{1,1}d, ab=i:ai=bi1i:aibi1=d2i:aibi1. That is:

ab1=2i:aibi1=dab=d (13)

That is, the distance from every aP to all points P{a,a} is d.

Let c=1|Q|qQq be some center. Since Q does not contain an antipodal pair it follows that |Q|d. Now, using Equation 13, for every q~Q:

cq~1=1|Q|qQqq~1=1|Q|qQ,qq~d1=|Q|1|Q|d<d

That is opt(Q)<d. It follows that every weak 0-coreset of P contains an antipodal pair. If a weak 0-coreset Q is a proper subset of P then there exists qQ with qQ. But cost(q,Q)=d=opt(Q), whereas cost(q,P)=2d. Hence P is the sole weak 0-coreset of P.