Coresets for -Center in Metrics
Abstract
We explore the applicability of coresets – a small subset of the input dataset that approximates a predefined set of queries – to the -center problem in spaces. This approach could potentially extend to solving the -center problem in related metric spaces, and has implications for streaming and dynamic algorithms.
We show that in , unlike in Euclidean space, even weak coresets exhibit exponential dependency on the underlying dimension. Moreover, while inputs with a unique optimal center admit better bounds, they are not dimension independent. We then relax the guarantee of the coreset further, to merely approximate the value (optimal cost of -center), and obtain a dimension-independent coreset for every desired accuracy . Finally, we discuss the broader implications of our findings to related metric spaces, and show explicit implications to Jaccard and Kendall’s tau distances.
Keywords and phrases:
clustering, -center, minimum enclosing balls, coresets, norm, Kendall’s tau, Jaccard metricFunding:
Shaofeng H.-C. Jiang: Research partially supported by a national key R&D program of China No. 2021YFA1000900 and a startup fund from Peking University.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Sketching and sampling ; Theory of computation Streaming modelsEditors:
Raghu MekaSeries and Publisher:

1 Introduction
Clustering is a fundamental task in unsupervised learning and data analysis in general, with wide-ranging applications. Typically, the input is a dataset of points in or some other feature space of dimension , and the objective is to partition the dataset into clusters, each characterized by a high degree of similarity. In the current era of big data, both the number of points and their dimension are often excessive, making the computational demands significant. In center-based clustering, the goal is to further assign to each cluster a representative “center” point. The number of centers (and ergo clusters) is often specified in advance, and denoted by , giving rise to the fundamental -center, -median and -mean problems, which are famous for their simplicity and broad applicability. The case is particularly important as the most basic setting, and focuses not on partitioning the points but rather on aggregating them, by identifying a representative that best captures the entire dataset.
We study the -center problem in metrics (i.e., Manhattan distance). The motivation for metrics is twofold. First, metrics arise in many practical settings, especially when the dimension is large. Second, common discrete metric spaces, such as the Hamming, Kendall’s tau, and Jaccard metrics, are closely related to , as they can be embedded into endowed with the metric with all distances preserved isometrically (i.e., with no distortion).
The -center problem has been investigated extensively in many metric spaces. In the context of strings, -center under the Hamming distance and Edit Distance is a classical problem known as the Closest String Problem [26, 27, 29]. In the context of permutations (a common model for rankings), -center under Kendall’s tau distance is known as the maximum rank aggregation problem [5, 32]. The -center problem has also been studied in other discrete metrics, such as the Ulam distance between permutations and the Jaccard distance between sets [9, 13]. Additionally, recent work studied the -center problem under various metrics, including , edit distance, and Ulam, specifically to examine how the time complexity depends on the dimension [1]. In fact, the -center problem is known to be NP-hard in many metric spaces, and it seems that each one requires distinct algorithmic techniques. For some metrics, a PTAS is known, while for others, the existence of a PTAS or even a non-trivial approximation algorithm remains an open question and an active area of research.
A coreset is a data-summarization concept, where the input dataset is replaced by a small subset that approximates its clustering properties. It can lead to significant reductions in computational resources, particularly storage and communication, and plays a pivotal role in modern algorithmic settings, such as streaming, distributed, and dynamic. It is thus crucial to understand the tradeoff between the size of a coreset and its accuracy (approximation factor). Coresets have been studied intensely since their introduction by [2], see the surveys [3, 21, 28, 31], and they have been successfully applied in a broad range of settings, for example different clustering objectives and metric spaces. Surprisingly, little is known about coresets in , particularly in high dimension, leaving significant gaps in our understanding. Small coresets in , if they exist and can be computed efficiently, could potentially pave the way for a unified approach for -center also in related metrics, like Kendall’s tau and Ulam. Moreover, studying coresets for these metrics may provide insights into the geometric structure of these metric spaces.
1.1 Problem setup
In the -center problem, also known as Minimum Enclosing Ball (MEB), the input is a set of points in a metric space , and the goal is to find a center point that minimizes the objective function . Denote the optimal value by , and observe that it is monotone, i.e., implies . While we mostly consider the space , these definitions capture also other metrics, like Kendall’s tau distance where is the set of permutations over .
There are different (and sometimes inconsistent) definitions for coresets, which were usually developed when the desirable guarantees were deemed impossible or difficult to prove, leading to more relaxed guarantees. We face this same issue (especially for high dimension), and thus consider a sequence of definitions. For brevity, we present only the definitions most relevant to our work, and do not discuss possible generalizations (e.g., to -center) or variants that appear in the literature. A common and very useful type of coreset is a strong coreset, which approximates the cost of every center, as follows.
Definition 1.1 (Strong Coreset).
A subset is a strong -coreset for a -center instance if
(1) |
The guarantee in (1) is extremely effective in reducing (the task of solving) to , with a small loss in the objective value. In Euclidean spaces, strong coresets for -center are known, however their size is exponential in the dimension, namely, [3], and unfortunately this upper bound is existentially tight.111There is a folklore lower bound, which follows by considering an instance formed by an -net of the unit sphere ; every strong coreset must contain all of and thus have size . The upper bound actually extends to every metric space with doubling dimension [4, 10, 20], and therefore applies to endowed with the norm (or every other norm).
It is thus natural to ask: Can the coreset size be improved specifically for norm? If not, can weaker notions lead to polynomial (in ) or even dimension-independent size?
1.2 Our results
Weak coresets.
We point our attention to a more relaxed notion, called weak coreset, which only approximates the cost of certain centers, namely, centers that may be found by solving the coreset instance. Trivially, every strong -coreset is also a weak -coreset.
Definition 1.2 (Weak Coreset).
A subset is a weak -coreset for a -center instance if
(2) |
Weak coresets were introduced in [8] specifically for -center in Euclidean space. It is well-known that in Euclidean space, there is a unique optimal solution,222The uniqueness follows from the fact that is strictly convex, whereas is convex but not strictly convex. which is why previous work often refers to as the optimal center. The challenge then becomes how to extend this definition to other metric spaces, like norm, that might have multiple optimal solutions. The typical use of a coreset is to apply an algorithm on and take the resulting center as an approximated solution for the original instance . Since a generic algorithm might pick any of the multiple optimal centers for , it is essential to have a guarantee that applies to every possible optimal solution for , hence the universal quantifier (“for all”) in (2).
Remarkably, in Euclidean space, every instance admits a weak -coreset of size , which is independent of the input size and of the Euclidean dimension , and is in fact tight [7]. This result improved an earlier bound from [8], which was also dimension independent. Can similar bounds be proved for metrics?
We demonstrate that, in sharp contrast to the setting, weak coresets in are dimension dependent and moreover grow exponentially with , even for a fixed . At the same time, we show how to construct a strong -coreset of size , which again contrasts with the setting (and doubling metric spaces), where coresets grow with and are thus not applicable for . It is well-known that a strong coreset is composable, i.e., the union of coresets is itself a coreset for the union of the original datasets, which is very useful for designing algorithms in the streaming and dynamic settings. Thus, our coreset construction can be used in these settings, particularly in low dimension.
Theorem 1.3.
Consider the -center problem in under norm.
-
(a)
Every instance admits a strong -coreset of size .
-
(b)
There exists , such that every weak -coreset for must have size .
We prove this theorem in Section 2. The upper bound leverages the unique geometry of , where only directions are in effect important. The lower bound builds a hard instance and employs classical techniques from coding theory, particularly the Hamming bound. We remark that an unpublished result from [30, Theorem 5], for a related problem about a convex shape in metric, seems to imply a weak -coreset of size for our problem of -center in metrics. This would contradict our lower bound in Theorem 1.3, indicating that its statement is inaccurate or its proof is flawed; see also a similar discussion in [6].
Due to the exponential dependency on in the general case, it is important to identify cases with a smaller coreset size, say or even dimension independent. Additionally, one may suspect that the huge disparity between and Euclidean space, which does admit a dimension-independent weak coreset [7, 8], is the fact that in the optimal center need not be unique. We thus study the special case of where the optimal center is unique, which may occur naturally in certain contexts, or be “enforced” by adding at most two points to the input dataset. We provide bounds that are exponentially smaller than in the general case, but still depend on , which establishes that is inherently more complex than .
Theorem 1.4.
Consider the -center problem in under norm, on instances that have a unique optimal center.
-
(a)
Every instance with unique optimum admits a weak -coreset of size .
-
(b)
There exists with unique optimum, such that every weak -coreset for fixed must have size .
We prove this theorem in Section 3. We further conjecture that the lower bound can be improved to , and prove it in the special case using Hadamard code. The upper bound in our theorem employs tools from convex geometry, specifically the Steinitz Theorem, which is relatively obscure, to prove the existence non-constructively (i.e., without an efficient algorithm). For the lower bound, we consider the instance , which has a unique optimal center at the origin, and show that a coreset that is too small, must have an optimal center in which most coordinates are , and this center is clearly a poor solution for .
Value-preserving coreset.
To further reduce the coreset size, we consider an even less restrictive variant that preserves solely the objective function , without imposing any conditions on the optimal centers of the coreset. This concept, which we shall refer to as a value-preserving coreset, has been studied previously under the general term “coreset” [6, 14, 33], and it is most useful in applications that focus on measuring the similarity among points in the cluster, rather than identifying a center. It is easy to verify that every weak coreset is also a value-preserving coreset.
Definition 1.5 (Value-Preserving Coreset).
A subset is a value-preserving -coreset for a -center instance if
(3) |
We show that there exist value-preserving -coresets of size , which is dimension independent, in contrast with Theorems 1.3 and 1.4. We thus establish a sharp separation between preserving an optimal solution (center point) or only preserving its value. This result is stated in the next theorem and proved in Section 4. It also answers a question posed in [6], which studies a generalization of -center, called the Minimum Enclosing Polytope problem, and asks for polytopes that admit a dimension-independent value-preserving coreset.333Prior to our result showing that the ball (also known as the cross-polytope) admits dimension-independent value-preserving coreset, the only known example was the parallelotope, which trivially has a value-preserving -coreset with only points.
Theorem 1.6.
Consider the -center problem in under norm. Every instance admits a value-preserving -coreset of size .
We prove this theorem in Section 4. Our algorithm is based on random sampling, a technique that was rarely used (if at all) for -center coresets, but widely used for -median and -means coresets [15, 18, 22]. This is very natural because sampling is an excellent fit for objectives like -median that that are formed by summation, but not for objectives like -center that are formed by maximization, which are sensitive to missing even one term. It is thus not surprising that our sampling is conducted not on the -center problem but rather on its dual. More precisely, we formulate -center as a linear programming (LP) problem and write its dual problem, then solve this dual problem to obtain sampling probabilities for the input points. Crucially, the dual objective is formed by summation, and is thus conducive to sampling. Compared with the primal-dual framework, our algorithm is dual-only and it solves the dual LP explicitly, which is then related in the analysis to the primal via strong duality. Our sampling is similar to the randomized rounding commonly used in approximation algorithms, however its purpose here is to generate a coreset instead of a solution, and also the analysis is somewhat different.
Related metrics and applications.
We demonstrate that our results for weak coresets in (in Sections 2 and 3) apply also to other discrete metric spaces. Specifically, we provide in Section 5 examples that use Kendall’s tau and Jaccard metrics, drawing inspiration from the results obtained in the setting.
1.3 Related work
Another closely related aggregation problem is -median. This problem, unlike -center, can be solved in metrics easily, by simply taking in each coordinate the median value of the input points. Utilizing this structure, -median in metrics has a weak -coreset of size , which is independent of the dimension [19], in sharp contrast to -center. The -median problem also admits a PTAS (approximation arbitrarily close to in polynomial time) in many other metrics, including Kendall’s tau distance [25], Jaccard distance [16], and a near-linear time PTAS in [17]. However, it is unclear if a PTAS exists for edit distance, or even for a special case called Ulam distance, and the best upper bound is -approximation for some fixed [11, 12].
2 Weak -coresets in
2.1 Construction of strong -coreset of size
In this section we will prove the following theorem:
Theorem 2.1.
Every instance admits a strong -coreset of size .
Proof.
Let denote the set of all sign vectors in . For each sign vector , let be a point in such that the inner product is maximized. Define , clearly .
Let . Since is a subset of , we have that . Let be some point in , then there exists a sign vector for which we can write the distance between and as:
By definition of , there exists a point satisfying . Thus,
That is,
2.2 Lower bound of for weak -coresets
In this section, we present a set of size that has only a trivial -coreset, namely, only itself. Furthermore, weak -coresets of , for , must have size .
Theorem 2.2.
There exists a set of size , such that
-
if is a weak -coreset of , then ; and
-
for all , if is a weak -coreset of , then .
Proof.
We may assume without loss of generality that is even, as otherwise we can take the construction for and append to all the points (vectors).
Denote by the vector , and let be the set of points that are balanced, in the sense that they have an equal number of and coordinates, i.e., . Fix , and let , where denotes the set of points in scaled by factor . Thus, . One can verify that the origin is an optimal center, with value . However it is not unique; to see this, observe that the antipodal pair establishes the optimal value , thus an optimal center must have the same distance to each of them, which holds for points that lie on the hyperplane . The set restricts the optimal solutions (inside that hyperplane) in a delicate manner, as we shall see.
We first show that is the only weak -coreset of . Let be a weak -coreset, and notice that , implying that must contain the antipodal pair and . Assume towards contradiction that is a proper subset of . Then there exists for some . Let and consider ; it cannot be an optimal center of , because
We next show that this point is an optimal center of . Indeed, observe that implies , and thus the sign of will not change if we add/subtract to it . Let denote the sign function. Now consider ; its distance to is:
(4) |
In the case , the above equals . In the remaining case , the choice of implies that , and thus
We have thus confirmed that is an optimal center of , but not for . This contradicts our assumption that is a weak 0-coreset, and completes the proof of the first item.
To prove the second item, let be as before but now for , and consider a weak -coreset of , for some . Notice that must contain both and , as otherwise, if , consider the origin as a center. If contains just one of say , consider as a center. In both cases, we obtain that , in contradiction to the weak -coreset.
Denote by the minimum Hamming distance between any point in and its furthest point in , that is, . The next proposition follows by a standard counting-bound.
Proposition 2.3.
If , then .
Proof.
Denote by the set of all points in whose Hamming distance from is at least . Then
where the last inequality is the known entropy bound for binomial coefficients and , hence . Taking a union over all ,
where in the last inequality we used the known bound . Hence, there exists some point in for which the maximum Hamming distance to points in is smaller than .
Assume towards contradiction that . Then by Proposition 2.3, there is a point for which the maximum Hamming distance to points in is smaller than . We next show that the point is an optimal center for . Similarly to Equation (4), since , for every we can write:
where:
and thus:
In the case , since then we have . Otherwise , and . In both cases, the distance from to is at most , with equality for some . Thus, is an optimal center for . However, . This contradicts that is a weak -coreset of , and completes the proof of Theorem 2.2.
3 Weak -coresets in for inputs with a unique solution
In this section we consider the special case, where the input has a unique optimal center. We show that even in this special case, the size of the coreset depends on the dimension and provide a lower bound of on the size of every -coreset for any fixed . This restriction over is necessary, as for larger values of , a naive coreset construction (based on Gonzalez algorithm for -center [23]) with only points achieves a -approximation.444Pick arbitrary , then a point that is furthest from , and take as the coreset. Note that , and is a possible optimal center. For every optimal center for , we can write . We also prove that there always exists a weak -coreset of size at most . This is in stark contrast to Section 2, where we show that in the general case of multiple solutions, -coresets might require size exponential in . Lastly, we present a set such that every weak -coreset of this set is of size , hence the upper bound of is tight for the case .
3.1 Lower bound of for weak -coresets
Theorem 3.1.
Fix and consider the set of points . Then has a unique optimal center and every weak -coreset of must have points.
Proof.
It is easy to note that the origin is a unique optimal center with value as the input contains a point from every face of the ball centered at . Consider a weak -coreset of size . Every point induces a partition of into at most two parts, by splitting the coordinates of into the positive and negative ones. Let be the common refinement of all these partitions (i.e. placing in the same part of if and only if for every they are in the same part of ). Since each has at most two parts, has at most parts. Observe that for every point , coordinates in the same part of have the same sign.
Consider an optimal center for . The idea is to modify it iteratively, without increasing its cost for , so as to have more coordinates take values in . Each iteration takes two indices in the same part of , whose coordinates in are both not in , say, . The iterations stop when no such indices exist. Now add to and to , where is as large as possible while keeping both new values in the range , that is, . It is easy to see that after this modification, at least one of and will be in , and at the same time the distance to every does not change. The iteration stop at a center , where in each part of at most one coordinate is not in , and in total at most coordinates are not in .
Finally, we show that , which is an optimal center for , has a too large cost for . Consider the point in that is closest to , i.e., where each coordinate of is rounded to the nearest value among . This point disagrees with (i.e., rounding was “needed”) on at most coordinates, hence its distance from is at least , hence .
We remark that the proof works also for the discrete space (endowed with norm), which has more limited choices for the center point, e.g., for above. Moreover, note that the distance of the modified optimal center from is at least .
3.2 Upper bound of for weak 0-coresets
We will show that if has a unique solution, then has a weak 0-coreset of size at most .
Two standard notations from the domain of convex geometry are the convex-hull and the interior. The convex hull of , denoted , consists of all convex combinations of points in , that is every point in can be expressed as a weighted sum of points from with non-negative weights summing to 1. The interior of , denoted , refers to the set of points that lie strictly inside the convex hull. In other words, it consists of points that can be surrounded by a small ball completely contained within .
An important tool we use is Steinitz’s theorem [24].
Theorem 3.2 (Steinitz’s theorem).
Let for some . Then there exists of size at most , such that .
Next we introduce the notion of complete set, we will then provide an alternative formulation of Steinitz’s theorem that will be useful in our application.
Definition 3.3.
A set is complete if for every there exists with .
Lemma 3.4.
if and only if is complete.
Proof.
Write as a convex combination, where and . Then,
If for all , , then it must hold that for all , , but then is not in the interior of .
If then there exists a hyperplane containing such that . Since is a point in the hyperplane we can write the hyperplane equation as for some . It follows that for every , hence is not complete.
This leads to the following alternative statement of Steinitz’s theorem.
Corollary 3.5.
If is complete, then there exists of size at most , such that is complete.
W.l.o.g., assume that is the unique solution of and let . Since is unique, we can further assume that for every ; otherwise, any points that do not satisfy this can be removed without affecting the optimal solution or its cost. Given a pair , we denote by the hyperplane and by the halfspace . In general, is a feasible 1-center of if for every , . Assuming is known, we can write the LP formulation of the 1-center problem with variables and the constraints for every and . That is, every constraint defines a hyperplane . We further remove from the set of constraints , pairs for which and denote the updated set of constraint by . Note that no points from are removed in this process, as each has distance from . Since is the unique optimal solution of , this also does not change the set of feasible solutions, and is the unique intersection point of all hyperplanes in .
Proposition 3.6.
Let be non-empty set such that for every . Then, if and only if is complete.
Proof.
Let , then there exists such that . That is, and since , , and it follows .
is non-empty so clearly . Let , then there exists such that . Since we have , it follows that , that is, .
We are now ready to prove our main Theorem:
Theorem 3.7.
Every instance with unique optimum admits a weak -coreset of size .
Proof.
By Proposition 3.6 the set is complete and by Corollary 3.5 there exists such that is complete and of size at most . For each we select a single point such that the pair is a constraint in . That is we obtain a set of at most different points from we denote by and denote by this refined set of constraints. Since is complete, using the second direction of Proposition 3.6 we have that , meaning that is a weak 0-coreset of size at most .
We conclude this section by providing an example of a set of size , where is the sole weak 0-coreset of itself, demonstrating that the upper bound of cannot be further improved. The set is composed of the row vectors of the Hadamard matrix and their negations. The proof is provided in Appendix A.1.
4 Value-preserving coresets
In this section we provide an algorithm for constructing a dimension-independent value-preserving -coreset for 1-center. In addition, we provide a complete characterization for the special case , by showing a -coreset of size , and moreover that this bound is tight, see in Section 4.2.
Theorem 4.1.
For every instance and , there exists a subset of points such that
Our proof of Theorem 4.1 is based on the following LP formulation of -center and its dual. Unlike many primal-dual algorithms, our algorithm needs to explicitly solve and use this optimal dual solution to guide the coreset construction.
LP formulation of -center.
Observe that -center on input is equivalent to the following mathematical program with variables and .
minimize | ||||
subject to |
This is formally not an LP due to the absolute values in the constraints, but each such constraint can obviously be expanded into linear constraints, which leads to the following equivalent LP formulation.
minimize | ||||
subject to | ||||
We derive the dual of this LP, with variables , as follows. One can think of each pair as generating a “new” point , that is obtained from by flipping signs.
maximize | (DLP) | ||||
subject to | |||||
Coreset Algorithm.
Our coreset construction, presented in Algorithm 1, builds the coreset by sampling, where the probabilities come from an optimal solution to (DLP). Technically, our sampling step is similar to randomized rounding that is often used in approximation algorithms, however we use it here to find a coreset rather than an approximate solution. The analysis of this algorithm appears in Section 4.1. We slightly abuse terminology and refer to as a multiset, but formally it is always a sequence of pairs, namely, , hence, summing over all actually means summation over .
Remark 4.2.
Algorithm 1 can be implemented in -time. The most expensive step is clearly to solve the dual LP. Since it has only constraints, one can solve it while using only non-zero variables. Alternatively, one can first solve the primal LP using an ellipsoid algorithm, to find a subset of constraints that has the same optimal value, and then solve the dual of this smaller primal LP. Moreover, inspecting the analysis, particularly (6), one can verify that it suffices to -approximate the dual LP.
4.1 Proof of Theorem 4.1: Analysis of Algorithm 1
We shall prove that the algorithm’s output satisfies . The plan is to build a dual solution for , and show that with high probability, this dual solution is feasible for (DLP) and its objective is at least times the optimal value (of the same dual LP) for . This would imply that , where denotes the optimal value of (DLP) on input , and by strong LP duality this is equivalent to , proving the theorem.
A lower bound on .
Recall that Algorithm 1 shifts so that . This does not change the optimal value, hence our analysis simply assumes that the input already satisfies , avoiding cumbersome notations for before and after the shift. This property of yields the following immediate lower bound on .
Lemma 4.3.
If satisfies , then .
Proof.
For every point ,
Dual solution induced by .
We now wish to use from Algorithm 1 to construct a solution for the dual LP. Informally, the idea is to construct a solution (can also think of it as a vector with coordinates), that is the average of sparse solutions, namely, each has a single non-zero variable, whose value is and it is drawn at random from the probability distribution defined by (think of as a vector with coordinates that sum to ), using precisely the random draws made in Algorithm 1. Next, we formalize this construction in a slightly more general form.
Given a multiset of size , such as , we define the dual solution induced by , denoted , as follows. Take each pair and build a dual solution , where the variable corresponding to this pair is set as , and the other variables are set to zero, and then we average all these solutions. Formally, is given by
The following fact rewrites certain linear forms over , that appear in (DLP), in terms of pairs .
Fact 4.4.
Let be induced by some as above. Then for every ,
Initial dual solution.
Before building a feasible dual solution for (DLP) on , we first consider an initial dual solution , which is just the solution induced by from Algorithm 1. We shall see below that is feasible in an expected sense, and that turning the expectation into a high-probability guarantee introduces an additive error in the constraints (and in the objective). Nonetheless, this is still useful as we will later “fix” this solution into one that does satisfy the constraints (without additive error).
Let us examine this initial random solution . It always satisfies the second constraint of (DLP) by construction (recall that is the average of sparse solutions), i.e.,
By linearity of expectation, satisfies the first constraint in expectation (the middle term uses 4.4 to provide an alternate formulation),
(5) |
And again by linearity of expectation, the expected objective of is the same as the optimal solution , because:
(6) |
Almost feasibility.
We bound the deviation of from satisfying the first constraint, as follows. Consider and let be the event that , which by 4.4 can be also written as . Informally, we want to bound the probability of this event, because when it does not occur, the constraint is “almost satisfied”. By applying Hoeffding’s inequality, where we use (5) for the expectation and the fact that , we get (by our choice of ),
(7) |
A feasible solution .
We next use to construct a new multiset whose induced dual is feasible, and in particular satisfies the first constraint. We will then take our final dual solution to be , and it will remain to bound its objective value. At a high level, is obtained by “flipping” some of the signs appearing in . More precisely, given , we shall define new sign vectors but keep the exact same points , and construct . It will be convenient to arrange the sign vectors as the rows of a matrix , and use this to define a new matrix , whose rows define the new sign vectors .
We construct (from ) using the following procedure, which operates separately on each column . First copy column of to be also column of , and let and be the number of positive and negative entries in this column, respectively. If , leave this column of as is, noticing that it sums to zero. Otherwise, flip signs chosen carefully from among the majority sign in column , and this will ensure that column of sums to zero. (This is possible because is even.) The careful choice (which signs to flip) will favor row indices with small value . Formally, let be the set of rows with the majority sign in column of , then sort the indices by their value , pick the indices with smallest value (breaking ties arbitrarily), and flip their entries. As explained above, the rows of the resulting matrix define sign vectors , which in turn define , and our final dual solution is .
Lemma 4.5.
The solution is feasible for (DLP) on .
Proof.
The first constraint is satisfied because the signs in every column of sum to zero. The second constraint is satisfied because flipping a sign “moves” some amount (say ) from variable to , where differ in coordinate , but the total remains the same. Finally, it uses only variables for , i.e., all other variables are , and thus it is feasible not only for (DLP) on but also on .
The next lemma bounds the decrease in objective when constructing from , i.e., comparing that of with that of . Its proof is based on an averaging argument.
Lemma 4.6.
For all ,
Proof.
We consider only the case , as the other case is symmetric. Since , we have . Recall that the construction of flips in column exactly signs, picking the indices with smallest value . Every such flip changes the objective by , and the number of choices , hence by an averaging argument,
The objective value of .
Using 4.4, we can write the objective value of as
(8) |
and that of as
(9) |
Our plan is to bound (with high probability) by relating it to , and then bound the latter using Chebyshev’s inequality. These steps rely on the next two lemmas, and before proving them, we show how they imply Theorem 4.1.
Lemma 4.7.
.
Lemma 4.8.
.
Proof of Theorem 4.1.
We already established in Lemma 4.5 that is a feasible solution for (DLP) on , hence . It thus suffices to show that . By Lemma 4.7 and Markov’s inequality, . Next, we apply Chebyshev’s inequality to , using the variance bound Lemma 4.8 and our choice of , to obtain . Recall also from (6) that . It follows by a union bound that with probability at least ,
Proof of Lemma 4.7.
Observe that is the event that , in which case we can still bound (which holds always). Thus, . Now applying Lemma 4.6 for every and taking a sum, we get (after dividing by )
(10) |
To bound (10) further, we expand its final expression into two parts. We then bound one part in (11) using Lemma 4.3, and the other part in 4.9.
(11) |
Claim 4.9.
.
Proof.
The main obstacle here is that we have a product of two random variables, and , that are not independent. Intuitively, their dependence should be relatively small, and our actual proof bounds their product with another product, of two independent random variables. Recall that consists of i.i.d. samples, hence we can focus on analyzing (say) the last one, formally
Consider , and define to be the event that . Observe that because if occurs then also must occur. The crux is that is independent of , and thus . We can still bound by the argument we had for in (7), except that we use instead of . Altogether,
and the claim follows by using Lemma 4.3 to bound .
Proof of Lemma 4.8.
4.2 Lower bound for value-preserving -coreset
The Helly number of a body , denoted , is the smallest positive integer , such that if every subset of size from a family of translations of has a non-empty intersection, then the entire family also has a non-empty intersection. The Helly theorem shows that the Helly number of every convex body is smaller than or equal to .
The Helly number is equivalent to a worst-case result of value-preserving -coreset (cf. [6]). In the next proposition we show that the upper bound is tight, that is, the Helly number of the ball is .
Proposition 4.10.
Assume . There exists , such that every value-preserving -coreset must have size .
Proof.
Consider the set of points , where and is the vector with all entries equal to 1 except for a in the -th position. We will show that:
-
The point is the only center of with a cost of .
-
Each subset of containing points has a cost strictly less than .
To prove the first claim, let be an optimal center of , clearly . We know that for every :
On the other hand, , since the zero vector has a cost , consequently for all . This implies for all , leading to:
Since , it follows that for all . Given that , this implies .
For the second claim, consider any subset of points. If , then clearly . Assuming and, without loss of generality, , we set for , and . The cost for this choice of is:
5 Generalization to discrete metric spaces
In this section we show how to extend our lower bounds results to other discrete metric spaces by providing two examples for Jaccard and Kendall’s tau metric spaces.
5.1 Theorem 2.2 in Kendall’s tau metric space
Denote by the set of permutations over the set of elements . For two permutations , the Kendall’s tau distance is the number of pairs such that the order of and is reversed between and . Formally,
Following a similar approach as in Section 2 we will prove a lower bound of size on the size of weak coreset.
Proposition 5.1.
There exists a set , such that for all , if is a weak -coreset of , then .
Proof.
We will design a gadget that will allow us to replicate the structure of that is given in the proof of Theorem 2.2. We denote , , , , and . The distances between these base permutations are detailed in Figure 1.
We will now introduce some notations that will allow us to describe permutations in using the base permutations. For = , with a slight abuse of notation, we denote by the permutation and also introduce an operator to denote the product where the product of two permutations is their composition. We will extend a base permutation over 4 elements to a permutation over elements using the notation , for . Now we are ready to describe . Let , then . The permutation , has cost (see also Figure 1) and this cost is optimal the base permutations composing are antipodal (for every , every pair that agrees with disagrees with , and vice versa).
Let be a weak -coreset of . Notice that must contain both and , as otherwise, if , consider as a center, and if contains just one of say , consider the permutation as a center. In both cases, we obtain that , in contradiction that is a weak -coreset.
Similarly to Proposition 2.3 it follows that if , then there exists such that the Kendall tau distance for every . That is, is an optimal center of . However, , by considering the opposite permutation of in , thus .
5.2 Theorem 3.1 in the Jaccard metric space
For two sets and , the Jaccard distance is defined as: . We will use the symbol to denote the product .
Following a similar approach as in Section 3.1 we will prove a lower bound of size on the size of weak coreset.
Proposition 5.2.
Let be a collection of sets, i.e. points in the Jaccard metric space. For every , a weak -coreset has points.
Proof.
Let denote the universe. We also denote elements in as type 0 if their remainder modulo 3 is 0 and type 1 otherwise. Note that all type 0 elements appear in every subset of and thus appear in every optimal center of every coreset of . It is then easy to verify that the set of optimal centers of is the product of for , along with all type 0 elements, and that .
Using the same arguments as in the proof of Theorem 3.1, assume is a weak -coreset of size . We denote by and the two building blocks of the set , corresponding to and in the proof of Theorem 3.1. Every set in induces a partition of into at most two parts, depending on their assignment to or . Again, we use to denote the common refinement of all these partitions.
Let be any optimal 1-center of , and consider a partition . Let denote the number of type 1 elements from in . By selecting different type 1 elements from , say the first type 1 indices of , we preserve the cost of over , as the number of differences has with each remains unchanged. We apply this modifications for all partitions in and denote the new center by . However, by the construction of there exists such that intersects at most one type 1 element of in every partition and the union of and is . Consequently, the distance between and is at least .
References
- [1] Amir Abboud, MohammadHossein Bateni, Vincent Cohen-Addad, Karthik C. S., and Saeed Seddighin. On complexity of 1-center in various metrics. In APPROX/RANDOM, volume 275 of LIPIcs, pages 1:1–1:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPIcs.APPROX/RANDOM.2023.1.
- [2] Pankaj K. Agarwal, Sariel Har-Peled, and Kasturi R. Varadarajan. Approximating extent measures of points. J. ACM, 51(4):606–635, 2004. doi:10.1145/1008731.1008736.
- [3] Pankaj K. Agarwal, Sariel Har-Peled, and Kasturi R. Varadarajan. Geometric approximation via coresets. In Combinatorial and computational geometry, volume 52 of MSRI Publications, chapter 1, pages 1–30. Cambridge University Press, 2005.
- [4] Sepideh Aghamolaei and Mohammad Ghodsi. A composable coreset for k-center in doubling metrics. In CCCG, pages 165–171, 2018. URL: http://www.cs.umanitoba.ca/%7Ecccg2018/papers/session4A-p2.pdf.
- [5] Christian Bachmaier, Franz J. Brandenburg, Andreas Gleißner, and Andreas Hofmeier. On the hardness of maximum rank aggregation problems. J. Discrete Algorithms, 31:2–13, 2015. doi:10.1016/j.jda.2014.10.002.
- [6] René Brandenberg and Stefan König. No dimension-independent core-sets for containment under homothetics. Discrete & Computational Geometry, 49(1):3–21, 2013. doi:10.1007/s00454-012-9462-0.
- [7] Mihai Bădoiu and Kenneth L. Clarkson. Optimal core-sets for balls. Comput. Geom., 40(1):14–22, 2008. doi:10.1016/j.comgeo.2007.04.002.
- [8] Mihai Bădoiu, Sariel Har-Peled, and Piotr Indyk. Approximate clustering via core-sets. In STOC, pages 250–257. ACM, 2002. doi:10.1145/509907.509947.
- [9] Marc Bury, Michele Gentili, Chris Schwiegelshohn, and Mara Sorella. Polynomial time approximation schemes for all -center problems on metric rational set similarities. Algorithmica, 83(5):1371–1392, 2021. doi:10.1007/s00453-020-00787-3.
- [10] Matteo Ceccarello, Andrea Pietracaprina, and Geppino Pucci. Solving -center clustering (with outliers) in MapReduce and streaming, almost as accurately as sequentially. Proc. VLDB Endow., 12(7):766–778, 2019. doi:10.14778/3317315.3317319.
- [11] Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Approximating the median under the ulam metric. In SODA, pages 761–775. SIAM, 2021. doi:10.1137/1.9781611976465.48.
- [12] Diptarka Chakraborty, Debarati Das, and Robert Krauthgamer. Clustering permutations: New techniques with streaming applications. In ITCS, volume 251 of LIPIcs, pages 31:1–31:24. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPIcs.ITCS.2023.31.
- [13] Diptarka Chakraborty, Kshitij Gajjar, and Agastya Vibhuti Jha. Approximating the center ranking under ulam. In FSTTCS, volume 213 of LIPIcs, pages 12:1–12:21. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.FSTTCS.2021.12.
- [14] Timothy M. Chan. Faster core-set constructions and data-stream algorithms in fixed dimensions. Computational Geometry, 35(1):20–35, 2006. doi:10.1016/j.comgeo.2005.10.002.
- [15] Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM Journal on Computing, 39(3):923–947, 2009. doi:10.1137/070699007.
- [16] Flavio Chierichetti, Ravi Kumar, Sandeep Pandey, and Sergei Vassilvitskii. Finding the jaccard median. In SODA, pages 293–311. SIAM, 2010. doi:10.1137/1.9781611973075.25.
- [17] Michael B. Cohen, Yin Tat Lee, Gary L. Miller, Jakub Pachocki, and Aaron Sidford. Geometric median in nearly linear time. In STOC, pages 9–21. ACM, 2016. doi:10.1145/2897518.2897647.
- [18] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In STOC, pages 169–182. ACM, 2021. doi:10.1145/3406325.3451022.
- [19] Matan Danos. Coresets for clustering by uniform sampling and generalized rank aggregation. Master’s thesis, Weizmann Institute of Science, 2021.
- [20] Mark de Berg, Leyla Biabani, and Morteza Monemizadeh. k-center clustering with outliers in the MPC and streaming model. In IPDPS, pages 853–863. IEEE, 2023. doi:10.1109/IPDPS54959.2023.00090.
- [21] Dan Feldman. Core-sets: An updated survey. WIREs Data Mining and Knowledge Discovery, 10(1):e1335, 2020. doi:10.1002/widm.1335.
- [22] Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In STOC, pages 569–578. ACM, 2011. doi:10.1145/1993636.1993712.
- [23] Teofilo F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293–306, 1985. doi:10.1016/0304-3975(85)90224-5.
- [24] Peter M Gruber and Jörg M Wills. Handbook of convex geometry, Volume B. North-Holland, 1993.
- [25] Claire Kenyon-Mathieu and Warren Schudy. How to rank with few errors. In STOC, pages 95–103. ACM, 2007. doi:10.1145/1250790.1250806.
- [26] Ming Li, Bin Ma, and Lusheng Wang. On the closest string and substring problems. J. ACM, 49(2):157–171, 2002. doi:10.1145/506147.506150.
- [27] Bin Ma and Xiaoming Sun. More efficient algorithms for closest string and substring problems. SIAM Journal on Computing, 39(4):1432–1443, 2010. doi:10.1137/080739069.
- [28] Alexander Munteanu and Chris Schwiegelshohn. Coresets-methods and history: A theoreticians design pattern for approximation and streaming algorithms. Künstliche Intell., 32(1):37–53, 2018. doi:10.1007/s13218-017-0519-3.
- [29] François Nicolas and Eric Rivals. Hardness results for the center and median string problems under the weighted and unweighted edit distances. J. Discrete Algorithms, 3(2-4):390–415, 2005. doi:10.1016/j.jda.2004.08.015.
- [30] Rina Panigrahy. Minimum enclosing polytope in high dimensions. CoRR, cs.CG/0407020, 2004. doi:10.48550/arXiv.CS/0407020.
- [31] Jeff M. Phillips. Coresets and sketches. In Handbook of discrete and computational geometry, chapter 48, pages 1269–1288. Chapman and Hall/CRC, 3rd edition, 2017. doi:10.1201/9781315119601.
- [32] Alvin Yan Hong Yao and Diptarka Chakraborty. Approximate maximum rank aggregation: Beyond the worst-case. In FSTTCS, volume 284 of LIPIcs, pages 12:1–12:21. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPIcs.FSTTCS.2023.12.
- [33] Hai Yu, Pankaj K. Agarwal, Raghunath Poreddy, and Kasturi R. Varadarajan. Practical methods for shape fitting and kinetic data structures using coresets. Algorithmica, 52(3):378–402, 2008. doi:10.1007/s00453-007-9067-9.
Appendix A Appendix
A.1 Lower bound for weak 0-coreset for inputs with a unique solution
We present an example of a set of size , where is the sole weak 0-coreset of itself. This demonstrates that the upper bound of is indeed tight.
We start with a simple proposition regarding the distances from a given point within the unit hypercube to two antipodal vertices of the cube.
Proposition A.1.
If and then .
Proof.
An immediate consequence of the above proposition is that the cost of every optimal center of is at least for every set containing an antipodal pair. Since has cost at most to we get that and that is an optimal center. This is summarized in the following corollary.
Corollary A.2.
If contains antipodal pair, then .
Let denote the set of row vectors of the Hadamard matrix, and let .
Proposition A.3.
There exists with unique optimum, such that every weak -coreset must have size .
Proof.
Following Corollary A.2, . We will further show that if does not contain an antipodal pair, then . Note that for , we have . Also, since , . That is:
(13) |
That is, the distance from every to all points is .
Let be some center. Since does not contain an antipodal pair it follows that . Now, using Equation 13, for every :
That is . It follows that every weak 0-coreset of contains an antipodal pair. If a weak 0-coreset is a proper subset of then there exists with . But , whereas . Hence is the sole weak -coreset of .