Coresets for Robust Clustering via Black-Box Reductions to Vanilla Case
Abstract
We devise -coresets for robust -Clustering with outliers through black-box reductions to vanilla clustering. Given an -coreset construction for vanilla clustering with size , we construct coresets of size for various metric spaces, where hides factors. This increases the size of the vanilla coreset by a small multiplicative factor of , and the additive term is up to a factor to the size of the optimal robust coreset. Plugging in recent vanilla coreset results of [Cohen-Addad, Saulpic and Schwiegelshohn, STOC’21; Cohen-Addad, Draganov, Russo, Saulpic and Schwiegelshohn, SODA’25], we obtain the first coresets for -Clustering with outliers with size near-linear in while previous results have size at least [Huang, Jiang, Lou and Wu, ICLR’23; Huang, Li, Lu and Wu, SODA’25].
Technically, we establish two conditions under which a vanilla coreset is as well a robust coreset. The first condition requires the dataset to satisfy special structures – it can be broken into “dense” parts with bounded diameter. We combine this with a new bounded-diameter decomposition that has only non-dense points to obtain the additive bound. Another sufficient condition requires the vanilla coreset to possess an extra size-preserving property. To utilize this condition, we further give a black-box reduction that turns a vanilla coreset to the one that satisfies the said size-preserving property, and this leads to the alternative additive size bound.
We also give low-space implementations of our reductions in the dynamic streaming setting. Combined with known streaming constructions for vanilla coresets [Braverman, Frahling, Lang, Sohler and Yang, ICML’17; Hu, Song, Yang and Zhong, arXiv’1802.00459], we obtain the first dynamic streaming algorithms for coresets for -Median (and -Means) with outliers, using space for inputs on a discrete grid .
Keywords and phrases:
Coresets, clustering, outliers, streaming algorithmsCategory:
Track A: Algorithms, Complexity and GamesFunding:
Shaofeng H.-C. Jiang: Research partially supported by a national key R&D program of China No. 2021YFA1000900.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Streaming, sublinear and near linear time algorithms ; Theory of computation Facility location and clusteringEditors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele PuppisSeries and Publisher:

1 Introduction
-Clustering is a fundamental problem that is well studied in both computer science and related areas such as operations research. Given a metric space and a dataset , -Clustering aims to find a set of centers, such that the following clustering objective is minimized:
where . This formulation generally captures several well known variants of -clustering, including -Median (when ) and -Means (when ).
Unfortunately, datasets can often be noisy, and the objective of -Clustering is very sensitive to noise. In fact, even adding a single outlier point that is distant from every other point can significantly bias the clustering centers towards the outlier. Hence, robust variants of clustering are highly desired to address this issue. We consider a natural formulation suggested by [9], called -Clustering clustering with outliers, -Clustering for short, where a parameter is introduced to denote the number of outliers. The new cost function is denoted as , and it is evaluated by first removing the points that are furthest to from , denoting the resultant points as , and compute (i.e., using the vanilla -Clustering objective). Equivalently,
Coreset [27] is a powerful technique for obtaining efficient algorithms for clustering. Roughly speaking, an -coreset is a tiny proxy of the dataset that approximates the clustering objective within factor for every potential center set . Coresets are not only useful for obtaining fast algorithms, but can also be converted into streaming [27], distributed [1] and fully-dynamic algorithms [28] via merge-and-reduce framework [27]. Tremendous progress has been made on finding small coresets for (vanilla) -Clustering. Take -Median () in Euclidean (i.e., ) for example. The initial coreset size [27] has been improved in a series of works [26, 21, 45, 22, 35, 7, 16, 13], all the way to via a novel sampling framework [16, 13], and this nearly matches a lower bound of [13]. (Alternatively, in the regime of , a better bound of may be obtained [14, 34].)
The sampling framework proposed by [16] is modified to handle several variants of clustering [4], and this modified framework is recently adapted to obtain an size coreset for -Median with outliers [31], which exponentially improves a decade-old bound [23]. This bound is further improved to in a recent work [33]. However, even this improved bound is unlikely to be tight; specifically, there is a sharp transition from (i.e., the vanilla case) to (i.e., only one outlier is considered) that increases the size by at least a factor (compared with [13]). It is unclear if this transition is fundamental and whether or not it can be avoided. Technically, existing approaches for robust coresets are built on frameworks designed for the vanilla case and is adapted in an ad-hoc way. In case improved bounds are obtained for the vanilla case, it is still technically nontrivial to adapt it to the robust case.
In this paper, we systematically address these issues by proposing a new framework via black-box reductions to the vanilla case. Specifically, we wish to start from any vanilla coreset algorithm (which may be the optimal one), and figure out how to add only a few points to convert it into a robust coreset, ideally to obtain a size bound that is very close to the vanilla case. Indeed, this study helps to fundamentally understand “the price of robustness” for coresets, i.e., the exact gap in size bounds between vanilla clustering and robust clustering. Since the coreset construction is via reduction, it also opens the door to more applications, such as coreset algorithms for robust clustering in sublinear models (as long as the vanilla version can be constructed in the corresponding model).
1.1 Our Results
Our main result, stated in Theorem 1, is a novel coreset construction for -Clustering via a reduction to the vanilla case. Crucially, this reduction is black-box style which works with any vanilla coreset construction without knowing its implementation detail. As mentioned, this type of bound helps to fundamentally understand the “price of robustness” for coresets. Our result works for general metric spaces, and we choose to state the result for the Euclidean case which is arguably the most natural setting for -Clustering. Overall, this size bound has linear dependence in (albeit is evaluated on slightly larger parameters), and the main increase in size is additive.
Theorem 1 (Euclidean case).
Assume there is an algorithm that constructs an -coreset for -Clustering of size for any dataset from . Then, there is an algorithm that constructs an -coreset for -Clustering of size
(1) |
for any dataset from , where and . The two bounds in the follow from Theorem 10 and Theorem 13, respectively.
As mentioned, our result in Theorem 1 is applicable to various other metric spaces besides Euclidean, including doubling metrics, general finite metrics and shortest-path metrics for graphs that exclude a fixed minor. The size bound only needs to change the parameters of according to the metric and does not change the additive terms and . For instance, for doubling metrics, our bound simply replaces the in (1) by the doubling dimension of the metric space (up to constant factor). Detailed size bounds can be found in the full version [38].
We start with justifying the tightness of our size bounds without plugging in any concrete vanilla coreset bound . Observe that Theorem 1 actually provides two size bounds, each corresponding to one of the two terms in the of (1). The first bound of is more useful when is small. In particular, it makes our result the first to achieve a smooth transition from (vanilla case) to (robust case) in an asymptotic sense, even when plugging in the optimal vanilla coreset. Indeed, the bound becomes when , and this is asymptotically dominated by a lower bound of for vanilla coreset as shown in [13]. For the second bound, it needs to use that is larger in , but under the typical case of , this only increases by a factor of which is minor. Regarding its additive term , it is actually up to only a factor to the optimal robust coresets, which is nearly tight with respect to due to a lower bound of for robust coreset [31].
Specific Coreset Size Bounds.
Since our reduction is black box, the large body of results on coresets for vanilla clustering can be readily applied. We start with listing in Table 1 the concrete coreset bounds obtained by plugging in recent results for -Clustering [13, 14, 34, 12]. These lead to the first coresets of near-linear size in for -Clustering in all the metric spaces listed, thereby improving the previous dependency [23, 31, 33]. We can also obtain improved coresets when plugging in results designed for specific parameters. For instance, in low-dimensional Euclidean space, a coreset for -Clustering of size was given by [30]. Applying this result to Theorem 1, we obtain a coreset for -Clustering of size .
metric space | problem | size | reference |
---|---|---|---|
Euclidean | vanilla () | [16] | |
vanilla () | [13] | ||
vanilla () | [34] | ||
robust, | [23] | ||
robust | [31] | ||
robust | [33] | ||
doubling metrics | vanilla () | [16] | |
robust | [33] | ||
-point metric space | vanilla () | [16] | |
robust, | [23] | ||
robust | [33] | ||
bounded treewidth graphs | vanilla () | [16] | |
robust | [33] | ||
excluded-minor graphs | vanilla () | [12] | |
robust | [33] | ||
all the above | robust | ours |
It is worth mentioning that our entire algorithm is deterministic, provided that the given vanilla coreset algorithm is deterministic. As highlighted in recent surveys [43, 20], deterministically constructing coresets is an important and less understood open question, even for vanilla clustering without outliers. The best-known deterministic construction works only in low-dimensional Euclidean space [26], yielding a coreset of size . For Euclidean -Means, a coreset of size can be constructed deterministically [22], and the size can be improved to if we consider a slightly more general definition of coresets, known as coresets with offset [17].111In the definition of coresets with offset, the original clustering objective is preserved by adding a universal constant (the offset) to the clustering objective on the coreset. All these results can be plugged into our reduction to directly achieve a deterministic coreset construction for -Clustering, with the coreset size increasing as shown in (1).222For coresets with offset, our reductions and analysis can be easily adapted and still achieve asymptotically the same coreset size.
Streaming Implementation of Theorem 1.
We also obtain reduction style coreset construction algorithms for -Clustering over a dynamic stream of points in using small space, which follows from a streaming implementation of Theorem 1. We consider a standard streaming model proposed by [36] which has been widely employed in the literature. In this model, the input points come from a discrete grid for some integer parameter , and they arrive as a dynamic stream with point insertions and deletions. At the end of the stream, the algorithm should return a weighted set as the coreset. We present in Theorem 2 the result for the case of which is -Median with outliers.
Theorem 2 (Informal version of Theorem 14).
Assume there is a streaming algorithm that constructs an -coreset for -Median for every dataset from presented as a dynamic stream, using space and with a failure probability of at most . Then, there is a streaming algorithm that constructs an -coreset for -Median with constant probability for every dataset from presented as a dynamic stream, using space , where
Observe that we did not explicitly state the size of the coreset, and only focus on the space complexity, since the coreset size is upper bounded by the space complexity of the streaming algorithm. Moreover, once one obtains an -coreset at the end of the stream, one can always run another existing coreset construction again on top of it to obtain a possibly smaller coreset. Hence, it is only the space complexity that matters in streaming coreset construction. We also note that once the coreset is obtained from our streaming algorithm, one can continue to find a -approximation to -Median without using higher order of space.
By combining Theorem 2 with an existing streaming coreset construction for -Median [5], we obtain a space complexity of . Remarkably, the dependence on and matches that of [5], which is . A similar bound for , i.e., -Means with outliers, may be obtained by combining with a dynamic streaming algorithm for -Means (e.g., [29]). These results are the first nontrivial dynamic streaming algorithms for robust -clustering, where the dependence on is comparable to existing bounds for vanilla clustering (up to degree of polynomial). We also show (in the full version) that the linear dependence on is necessary.
Our results push forward the understanding of dynamic streaming algorithms for variants of clustering. Indeed, recent works manage to devise (offline) coresets for variants of clustering (including clustering with outlier/capacity constraints; see e.g., [33]), which can immediately imply insertion-only streaming algorithms via standard techniques such as merge-and-reduce [27], but these are not readily applicable to dynamic point streams. Our new reduction style framework is a new approach that opens a door to obtaining dynamic streaming algorithms for other variants of clustering, which may be of independent interest.
1.2 Technical Overview
Our main technical contributions are the establishment of two conditions for a vanilla coreset to become a robust coreset, and black-box reductions that turn vanilla coresets to satisfy these conditions, which yields the bounds mentioned in Theorem 1. The two conditions reveal new fundamental structures of the robust clustering problem, especially the structural relation to vanilla clustering, which may be of independent interest. Moreover, to make vanilla coresets satisfy the two conditions in a black-box way, it requires us to devise a bounded-diameter decomposition called almost-dense decomposition, a new separated duplication transform for metric spaces, as well as adapting and refining a recently developed sparse partition [37, 24] (which was also studied under the notion of consistent hashing [18, 19]). The sparse partition/consistent hashing technique is also crucially used in our streaming implementation, which suggests a framework fundamentally different from those based on quadtrees as in previous dynamic streaming algorithms for (vanilla) clustering [5, 29].
We focus on presenting these ideas for the -Median problem (which is the case) in Euclidean (although most discussion already works for general metrics). We also discuss how to implement the reductions in streaming.
1.2.1 First Reduction: Size Bound
Condition I (Lemma 9): Vanilla Coresets on Dense Datasets Are Robust
Our first condition comes from a simple intuition: we wish for every , which is perhaps the most natural way to ensure a vanilla coreset is automatically a robust coreset. Now, a sufficient condition is to ensure the dataset is somewhat uniform, in the sense that for every center set , the contribution from every data points is at most (or equivalently, each data point contributes ). This way, whatever the outliers are, the change of objective from to is always within factor.
We thus consider datasets that can be broken into parts such that each part has diameter and contains at least points. In such datasets, every data point can find points within distance , hence its contribution is naturally charged to these nearby points, which is averaged to . We call such a dataset dense, and the above intuition leads to a formal reduction that for dense datasets, a vanilla coreset is as well a robust coreset.
Reduction I (Theorem 10): Almost-dense Decomposition
Of course, a general dataset may not be dense. Nevertheless, we manage to show a weaker almost-dense decomposition exists for every dataset. This decomposition breaks the dataset into two parts: a subset that is guaranteed to be dense and a remaining subset that only consists of points. The subset actually has more refined property: it can be broken into parts such that each part has diameter at most (which is similar to the dense part) and each contains at most points (which is the “complement” property of the dense part). We call subset the sparse subset.
To construct such a decomposition, we start with an optimal solution to the robust clustering (in the algorithm it suffices to use a tri-criteria solution, see Definition 6). Let be the target diameter bound of the dense and sparse parts. We first identify those isolated points consisting of both the outliers and those “far away” points with distance more than from . The number of far away points is bounded by using an averaging argument. Now since every remaining point is within to , we cluster all these remaining points, namely , with respect to . Each cluster automatically has a diameter bound by . Finally, we take clusters with more than points as the dense subset, and the remainder, along with , forms the sparse subset, which contains at most points in total.
With such a decomposition, one can simply put the sparse subset who has points into the coreset, and use the vanilla coreset on the dense subset. This yields our first size bound (recall that is the size of the vanilla coreset) as in Theorem 1.
1.2.2 Second Reduction: Size Bound
Overview of Reduction II (Theorem 13).
Indeed, in Reduction I, the step of simply adding sparse subset into the coreset may sound suboptimal. Hence, our second algorithm is built on top of Reduction I, such that we first follow the steps in Reduction I to build a vanilla coreset on the dense subset, and then run more refined new steps to the sparse subset. In fact, the new steps can be viewed as a separate reduction that takes the sparse subset as input and generates a robust coreset. We focus on these new steps and call it Reduction II (without the steps already in Reduction I). Crucially, Reduction II itself is standalone and can run on any dataset (and thus can be independent of Reduction I), albeit it would result in a coreset of size (where is the size of its input dataset). Luckily, we would eventually run it on the sparse subset which has points, so the factor becomes , and the overall size bound is competitive (which yields the other size bound in Theorem 1).
Next, we start with describing our Condition II (which is also standalone), and discuss how our Reduction II make a vanilla coreset to satisfy this condition.
Condition II (Lemma 11): Size-preserving Vanilla Coresets Are Robust
The second condition requires a stronger property in the definition of the coreset (instead of the dense property in the dataset as in Condition I). We consider a size-preserving property, which is defined with respect to some -bounded diameter decomposition – a partition of point set such that each part has diameter at most . For a given , a (vanilla) coreset is size-preserving, if there exists a -bounded diameter decomposition , such that for every , (here we interpret the coreset, which is a weighted set, as a multiset). This “size-preserving property” of coresets intuitively means the coreset could be interpreted as moving each data point by at most distance. More formally, there is a one-to-one correspondence between dataset and coreset such that .
Our Condition II states that if a vanilla coreset satisfies the size-preserving property then it is automatically a robust coreset. Next, we provide an overview of the difficulties we encountered in the proof of Condition II and the high-level ideas for resolving them.
Naive Approach: Bounding the Gap between and .
We explore the condition to make the gap between and is small. Unlike Condition I, here we are now making no assumptions about the structure of . Observe that the precise difference between the two cost functions is the outliers: Let be the set of outliers (which is defined with respect to ), then the vanilla is larger and has an extra term compared with the robust . An immediate idea to “avoid” this term in vanilla , is to additionally put a center at each point in so that points from contribute . This works perfectly when each outlier is separated from every other point since each point in forms a singleton cluster, and this suggests a necessary condition: the vanilla coreset needs to hold for centers, i.e., it is a coreset for -Median.
Unfortunately, centers on outliers may not form singleton clusters, in which case the value can underestimate . Even if it is possible to pick another to make this said gap small, it is still challenging to also prove that is always close to for an arbitrary and (which is needed to imply ).
Key Observation: Vanilla Coresets Also Preserve the Gap.
Hence, instead of seeking for a specific center set that makes the gap small for every , we show a weaker guarantee, such that the gap is preserved between and the coreset : we allow a large gap, just that if the gap is large in , then it is also large in , and the quantity is comparable. Specifically, we show that a coreset for -Median with size-preserving property can preserve this gap. Namely,
(2) |
for some . Indeed, this, combined with the guarantee (which follows from the -Median coreset guarantee of ), implies (which is the robust coreset guarantee).
To see why the gap can be preserved, the intuition is that there are only a few points that contribute non-zero values to the gap . To put it simply, let us first consider . Then only those inliers of that are closer to outliers than to can make hold. Let denote such inliers, and let us loosely assume that only contributes non-zero values to the gap (recalling that is a one-to-one correspondence between the dataset and the coreset , such that for every ). Then the difference of these two gaps can be upper bounded by using the triangle inequality, which is sufficient if and we pick .
Unfortunately, the naive setup of , as in the analysis above, may not lead to a small (noting that is defined with respect to ). Thus, we need to choose as a subset of . On the other hand, cannot be made too small as well; otherwise, a large error of would arise. (Recall that (2), combining the coreset guarantee , forms the complete cost preservation.) We provide a simple algorithm to address this by finding an appropriate such that both and the error are ensured to be small.
Reduction II (Theorem 13): Making Vanilla Coresets Size-preserving
To utilize Condition II, we need to show that, for a given , we can build a vanilla coreset that is size-preserving using a generic vanilla coreset algorithm in a black-box way. Specifically the coreset should satisfy that there exists a -bounded partition of such that for (recalling that we treat a coreset as a multiset). We first assume a generic partition is given, and assume that for simplicity.
Key Observation: Separated Instance is Easy.
We make use of a simple observation: if the parts are well-separated, meaning they are sufficiently distant from each other, then a vanilla coreset for -Median is already (nearly) size-preserving. To see this, if we put one center at each of the ’s for but without any center at , the clustering cost is roughly and this is up to factor to (by the coreset guarantee of ). This implies a slightly relaxed property , and it can be further adjusted to become strictly size-preserving.
Key Idea: Ensuring Separation Artificially.
A straightforward idea to deal with a generic is to artificially separate the parts. More precisely, for Euclidean , we define a mapping , such that for , for some sufficiently large (which can be loosely considered as infinite). We define . Clearly, is a -bounded partition for , and now the partition is well-separated. Thus, one can apply the above argument for , obtain a size-preserving coreset for , and find the corresponding for .
New Issue: Ensuring Coreset Guarantee.
However, a crucial issue arises: how can we ensure that is a coreset for ? In other words, we need to translate the coreset guarantee of (on ) to that of (on ). It seems that the only way is to find a “bridge” center set for any fixed such that and . If such exists and can handle at least centers, then the coreset guarantee of follows. Now, let us focus on the requirement . The tricky situation arises when some points in may have the same nearest center in but belong to different parts; for instance, assume that and both have the nearest center . After mapping through , the images and can be “infinitely” far apart. Therefore, to maintain the clustering cost, we need to include two copies, and , in . In the worst case, we would need to include copies of every center in , resulting in a size of for , and thus should be a coreset for -Median. Unfortunately, this worst-case scenario is unavoidable if is generic, and the reduction to -Median falls far from satisfaction, as our ultimate goal is near-linear size in .
Solution: Grouping Parts using Sparse Partition [37].
To address the above problem, we start by using a more careful way to “prune” centers in . For a part , if it is far from , specifically , then it suffices to include only one center in for , which is , and we can verify that . Through this pruning, these far parts can increase by at most , which is acceptable.
The challenging part is then handling the close parts. For a close part , we denote by the centers that is relevant to it, i.e., contains the nearest centers to points in . This implies that , meaning we only need to include in for every . As discussed earlier, could be in the worst case.
To further reduce the total number of relevant centers, we propose a new idea: group these parts together to create a new partition of . The relevant centers of are . Our goal then becomes to find a grouping strategy such that is small.
We achieve this using a more structured partition called sparse partition [37]. Specifically, we can compute a partition of such that each part has a diameter at most , and any subset with a diameter intersects at most parts. For simplicity, assume each part in is entirely covered by exactly one part in (though this is not generally true, the proof follows similarly). Then, fits our needs perfectly. To see this, consider any center . Notice that can appear in at most of the sets , since otherwise the subset , whose diameter is , would intersect more than parts, contradicting the guarantee provided by the sparse partition. As a result, . This ensures that it suffices to construct a coreset for -Median that is size-preserving with respect to .
An issue is that is -bounded, but this can be adjusted through rescaling, resulting in a size increase by a factor of . Given that this coreset applies to a subset of size in the full algorithm, the term is acceptable.
1.2.3 Streaming Implementations
We provide an overview of how these steps can be adapted to the streaming setting using foundational streaming algorithmic tools.
-Bounded Partition.
Recall that both of our reductions rely on a -bounded partition (where , and we can guess this ). Therefore, the key is to design a streaming algorithm that finds a -bounded partition. However, in the streaming setting, it is not possible to maintain the -bounded partition explicitly, as it requires space, which is not affordable.
Our strategy is to employ a hashing version of sparse partition defined in [18] that is space efficient and data oblivious and thus suitable for streaming. This hashing partitions the entire into buckets of diameter of . Hence, after the stream ends, those non-empty buckets form a -bounded partition of . Moreover, this hashing also guarantees that any subset of points with a diameter of intersects at most buckets (similar to the sparse partition guarantee). By combining with the existence of the small-sized partition (as shown in Reduction I), we can conclude that there are very few, i.e., , distinct buckets after this hashing.
In the following, we fix such a hashing, and we outline how our reductions can be adapted to the streaming setting using foundational streaming algorithmic tools.
The First Space Bound.
We first show how to implement our Reduction I in the streaming setting. Our main task is to identify all points that lie in small buckets containing fewer than points, which correspond to the sparse subset , and construct a vanilla coreset for the points that lie in other buckets, which correspond to the dense subset. The latter is easy; once the former is achieved, we can simply run a vanilla dynamic streaming algorithm on a stream representing (which can be achieved using some tricks).
The former task can be translated into a streaming algorithm: identify at most buckets, such that each contain at most points. This task is similar to sparse recover in streaming, but it is two-level which is different from the standard one, i.e., it should recover only from those buckets that contain small number of points. Hence, we introduce a modified sparse recovery structure for this specific task, and it takes about space.
However, this dependence is still suboptimal. To achieve a near-linear space bound in , we show that there exists a subset of size such that, after removing , the remaining points span very few buckets, allowing us to set . This subset can be efficiently identified using a two-level -sampler [18, Lemma 3.3] in the dynamic stream. We then use (modified) sparse recovery to identify the sparse subset after removing , reducing the total space to .
The Second Space Bound.
Our second algorithm first applies the same streaming procedure to identify the set as in the first bound, which is directly added to the coreset. Recall that the number of non-empty buckets after removing is small. Hence, these non-empty buckets give rise to a small-sized -bounded partition of , denoted by . Our Condition II states that if we construct a vanilla coreset that is size-preserving, i.e., for every , it is directly a robust coreset. Hence, in summary, our next step is to implement the construction of a size-preserving vanilla coreset (see Reduction II) for in the dynamic streaming setting. Specifically, we map each point in the stream to , where is sufficiently large and is an integer indicating the bucket to which belongs. This transformed data is then fed to an instance of a streaming algorithm designed for vanilla coresets. At the end of the stream, we obtain a coreset from this instance and consider its pre-image (which deletes the last coordinate) as the coreset for , which is approximately size-preserving. We run a sparse recovery algorithm to recover the sizes of each bucket to calibrate the weights (to make it strictly size-preserving).
1.3 Related Work
Besides clustering with outliers, researchers have extensively explored coresets for various other clustering variants. Two particularly interesting variants are clustering with capacity constraints and clustering with fairness constraints, both of which have been investigated in [11, 44, 32, 15, 2, 4] for coreset constructions. Notably, the coresets for both these variants can be reduced to an assignment-preserving coreset [44, 32], for which several studies apply a hierarchical sampling framework [10] to construct small-sized coresets with assignment-preserving properties [2, 4]. Recently, Huang et al. [33] introduced a general unified model to capture a wide range of clustering problems with general assignment constraints, including clustering with fairness/capacity constraints, clustering with outliers (which we consider), and a fault-tolerant variant of clustering [39].
2 Preliminaries
For integer , let . Given a mapping , for a subset , denote , and for an image , denote its inverse by . Let be an underlying metric space, and we assume oracle access to . For a point set , let be the diameter of . For a set (), a clustering of with respect to is a partition of such that for every and every point , .
Weighted Set.
A set with an associated weight function is a weighted set. When we talk about a weighted set , we denote the weight function of by unless otherwise specified. For an unweighted set , we interpret it as weighted set with unit weight function . We say a weighted set is a weighted subset of , denoted by , if and for every , . For two weighted sets and , the weight of their union is defined as . We denote by the weighted set with the weight function , where , and is the support of . Here, for two weight functions and , we use to denote the weight function such that for every , . Likewise, we denote by the subtraction of weight functions.333If , we define , and similarly, if , we define .
Clustering with Outliers (Weighted).
In our proof, we need to consider the weighted version of -Clustering, and we extend the definition as follows. For a weighted set , and a center set with , we define the clustering objective for -Clustering as . Moreover, given real number , we denote by the collection of all possible weighted subset of with a total weight equal to . We define the objective for -Clustering as
(3) |
One can verify that when , the definition of is equivalent to . We call a weighted set an -outlier set of with respect to , if and . We use to denote the optimal objective value for -Clustering on , i.e., .
Coresets.
Our definition of coreset is almost identical to the one in [31, Definition 2.1], except that the coreset preserves the objective for all potential number of outliers up to simultaneously.444Although the definition of coresets in [31] does not require preserving objective for all simultaneously, their algorithm actually constructs such a coreset (see [31, Remark 3.2]).
Definition 3 (Coresets).
Given and a dataset , an -coreset of for -Clustering is a weighted set such that
A weighted set is called an -coreset for the special case of -Clustering if it is an -coreset for -Clustering with .
For analysis purposes, we also need a weaker notion of coreset, as defined in Definition 4, that allows for additive errors. Such notions were also widely used in previous coreset constructions (see e.g., [16, 13, 4, 31]).
Definition 4 (Additive-error coresets).
Given and a dataset , an -coreset of for -Clustering is a weighted set such that
An important property of coresets is that it is composable. Roughly speaking, if is a coreset of and is a coreset of , then is a coreset of . It is well known that composability holds directly from the definition of vanilla coreset, and here we verify that it also holds for our definition of coresets for -Clustering. We formally present this property with respect to the more general additive-error coreset below.
Fact 5 (Composability of coresets).
For , and two datasets , if is an -coreset of for -Clustering, and is an -coreset of for -Clustering, then is an -coreset of for -Clustering.
Tri-criteria Approximation.
Similar to many previous works [4, 31, 33], we need a tri-criteria approximation algorithm for the -Clustering problem. Here, an -approximation is a set of centers, whose cost is at most times the optimal, allowing a violation to the number of outliers.
Definition 6 (-Approximation).
Given and a dataset , we say a center set is an -approximation solution to -Clustering on if it holds that and .
There exists a randomized algorithm that computes a -approximation in near-linear time with high probability [3] (which is also applicable to general metrics). For a deterministic approach, a -approximation solution can be obtained in polynomial time (e.g., [25]). A more detailed discussion on other possible algorithms for such an approximation can be found in [31, Appendix A].
3 Formal Statements of Reductions
In this section, we present our main results, specifically two reductions based on two conditions for a vanilla coreset to become a robust coreset, and the streaming implementations of these reductions. Due to space limitations, we provide the formal statements of our results and discuss them here; the full proof can be found in the full version [38].
We begin by defining the bounded partition below, which is a useful notion in both reductions.
Definition 8 (-Bounded partition).
For a dataset from a metric space and , we say that a partition of is -bounded if, for every , .
Reduction I: Density of Datasets
Our first reduction is presented in Theorem 10. This reduction is based on the first condition, stated in Lemma 9, which proves that a vanilla coreset on a dense dataset serves as a robust coreset.
Lemma 9 (Condition I: vanilla coresets on dense datasets are robust).
Suppose an underlying metric space is given. For a dataset , , and integers , if there exists a -bounded partition of for some such that for every , , then an -coreset of for -Clustering is also an -coreset for -Clustering.
Theorem 10 (Reduction I).
Suppose an underlying metric space is given. Assume there exists an algorithm that, given , integers and an -point dataset from as input, constructs an -coreset for -Clustering of size in time .
Then there is an algorithm that, given , integers , an -point dataset from and a -approximation solution to -Clustering on as input, constructs an -coreset of for -Clustering. This algorithm invokes a constant number of times and runs in time . The constructed coreset has size
Reduction II: Size-preserving Property
Our second condition is stated in Lemma 11, which shows that a vanilla coreset that satisfies the size-preserving property is automatically a robust coreset.
Lemma 11 (Condition II: size-preserving vanilla coresets are robust).
Suppose a metric is given. For , , integers and a dataset , if a weighted set satisfies that there exists a -bounded partition of such that
-
1.
is an -coreset of for -Clustering, and
-
2.
, ,
then is an -coreset of for -Clustering.
Based on Lemma 11, we present an alternative reduction in Theorem 13. Compared with Reduction I (Theorem 10), this reduction gives an alternative size bound with respect to and : instead of . This reduction requires a slightly stronger technical guarantee from the vanilla coreset algorithm, such that the algorithm not only needs to construct a small coreset for metric (that contains the dataset), but also for a family of separated duplications of . Roughly speaking, a separated duplication of “copies” into several disjoint sub-metrics, and makes the distance between two copies large. We provide the formal definition below.
Definition 12 (Seperated duplication of a metric space).
Given a metric space , for real number and integer , a metric space is called an -separated -duplication of , if
-
1.
, it holds that ; and
-
2.
such that , it holds that .
For the special case of and , we say is a -separated -duplication of if and only if .
Theorem 13 (Reduction II).
Suppose an underlying metric space and a family of -separated -duplication of are given. Assume there exists an algorithm such that, for every , integers , metric and -point dataset from , it runs in time to construct an -coreset of size for -Clustering on .
Then, there is an algorithm that, given , integers , an -point dataset and a -approximation to -Clustering on as input, constructs an -coreset for -Clustering of size
where , and . This algorithm invokes a constant number of times and runs in time
Notice that we assume a family is given alongside in Theorem 13, since this family may depend on the specific type of (e.g., is Euclidean or shortest-path metric of a graph) and may require careful design. Luckily, in most cases, the separated duplication family does not need to be much more “complex” than , and this particularly means as long as a vanilla coreset works on , it can also deal with metrics in .
We claim that for metrics including Euclidean spaces, general finite metrics, doubling metrics and shortest-path metrics of minor-excluded graphs, the complexity/dimension parameter (such as Euclidean dimension) of the separated duplication only increases by a constant (factor). For instance, for the -dimension Euclidean metric , we show that for any and there exists a -separated -duplication of that can be realized by a -dimensional Euclidean space . Hence, a vanilla coreset construction for those metrics automatically works for the separated duplication (with only a constant factor increase in coreset size). We summarize the results for the separated duplication in Table 2.
space | property of | |
for | can be realized by | |
general metric | doubling dimension | |
ambient size | ambient size | |
graph metric | treewidth | treewidth |
excluding a fixed minor | excluding the same minor |
Streaming Implementations
Finally, we implement our reduction algorithms (both Theorem 10 and Theorem 13) in the dynamic streaming setting. We consider the standard geometric streaming model suggested by Indyk [36], where the input is from a discrete set for some integer and is presented as a stream of point insertion and deletions. We obtain two black-box reductions in Theorem 14. These reductions turn a streaming coreset construction algorithm for -Clustering to the one for -Clustering, with guarantee similar to Theorem 10 and Theorem 13, respectively.
Theorem 14 (Streaming coresets for -Clustering).
Assume there exists a streaming algorithm that, given , integers , and a dataset from presented as a dynamic stream, uses space to construct an -coreset for -Clustering with a failure probability of at most . Then there exists a streaming algorithm that, given , integers , and a dataset presented as a dynamic stream, constructs an -coreset of for -Clustering. This algorithm has a failure probability of at most and uses space , where
References
- [1] Maria-Florina Balcan, Steven Ehrlich, and Yingyu Liang. Distributed k-means and k-median clustering on general communication topologies. In NIPS, pages 1995–2003, 2013. URL: https://proceedings.neurips.cc/paper/2013/hash/7f975a56c761db6506eca0b37ce6ec87-Abstract.html.
- [2] Sayan Bandyapadhyay, Fedor V. Fomin, and Kirill Simonov. On coresets for fair clustering in metric and euclidean spaces and their applications. In ICALP, volume 198 of LIPIcs, pages 23:1–23:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.ICALP.2021.23.
- [3] Aditya Bhaskara, Sharvaree Vadgama, and Hong Xu. Greedy sampling for approximate clustering in the presence of outliers. In NeurIPS, pages 11146–11155, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/73983c01982794632e0270cd0006d407-Abstract.html.
- [4] Vladimir Braverman, Vincent Cohen-Addad, Shaofeng H.-C. Jiang, Robert Krauthgamer, Chris Schwiegelshohn, Mads Bech Toftrup, and Xuan Wu. The power of uniform sampling for coresets. In FOCS, pages 462–473. IEEE, 2022. doi:10.1109/FOCS54457.2022.00051.
- [5] Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, and Lin F. Yang. Clustering high dimensional dynamic data streams. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 576–585. PMLR, 2017. URL: http://proceedings.mlr.press/v70/braverman17a.html.
- [6] Vladimir Braverman, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for ordered weighted clustering. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 744–753. PMLR, 2019. URL: http://proceedings.mlr.press/v97/braverman19a.html.
- [7] Vladimir Braverman, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for clustering in excluded-minor graphs and beyond. In SODA, pages 2679–2696. SIAM, 2021. doi:10.1137/1.9781611976465.159.
- [8] Vladimir Braverman, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for clustering with missing values. In NeurIPS, pages 17360–17372, 2021. URL: https://proceedings.neurips.cc/paper/2021/hash/90fd4f88f588ae64038134f1eeaa023f-Abstract.html.
- [9] Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. In SODA, pages 642–651. ACM/SIAM, 2001. URL: http://dl.acm.org/citation.cfm?id=365411.365555.
- [10] Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput., 39(3):923–947, 2009. doi:10.1137/070699007.
- [11] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Fair clustering through fairlets. In NIPS, pages 5029–5037, 2017. URL: https://proceedings.neurips.cc/paper/2017/hash/978fce5bcc4eccc88ad48ce3914124a2-Abstract.html.
- [12] Vincent Cohen-Addad, Andrew Draganov, Matteo Russo, David Saulpic, and Chris Schwiegelshohn. A tight vc-dimension analysis of clustering coresets with applications. In SODA, pages 4783–4808. SIAM, 2025. doi:10.1137/1.9781611978322.162.
- [13] Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, and Chris Schwiegelshohn. Towards optimal lower bounds for k-median and k-means coresets. In STOC, pages 1038–1051. ACM, 2022. doi:10.1145/3519935.3519946.
- [14] Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, and Omar Ali Sheikh-Omar. Improved coresets for euclidean k-means. In NeurIPS, 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/120c9ab5c58ba0fa9dd3a22ace1de245-Abstract-Conference.html.
- [15] Vincent Cohen-Addad and Jason Li. On the fixed-parameter tractability of capacitated clustering. In ICALP, volume 132 of LIPIcs, pages 41:1–41:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPIcs.ICALP.2019.41.
- [16] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In STOC, pages 169–182. ACM, 2021. doi:10.1145/3406325.3451022.
- [17] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. Deterministic clustering in high dimensional spaces: Sketches and approximation. In FOCS, pages 1105–1130. IEEE, 2023. doi:10.1109/FOCS57990.2023.00066.
- [18] Artur Czumaj, Arnold Filtser, Shaofeng H.-C. Jiang, Robert Krauthgamer, Pavel Veselý, and Mingwei Yang. Streaming facility location in high dimension via geometric hashing. arXiv preprint arXiv:2204.02095, 2022. The latest version has additional results compared to the preliminary version in [19]. doi:10.48550/arXiv.2204.02095.
- [19] Artur Czumaj, Shaofeng H.-C. Jiang, Robert Krauthgamer, Pavel Veselý, and Mingwei Yang. Streaming facility location in high dimension via geometric hashing. In FOCS, pages 450–461. IEEE, 2022. doi:10.1109/FOCS54457.2022.00050.
- [20] Dan Feldman. Core-sets: An updated survey. WIREs Data Mining Knowl. Discov., 10(1), 2020. doi:10.1002/widm.1335.
- [21] Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In STOC, pages 569–578. ACM, 2011. doi:10.1145/1993636.1993712.
- [22] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca, and projective clustering. SIAM J. Comput., 49(3):601–657, 2020. doi:10.1137/18M1209854.
- [23] Dan Feldman and Leonard J Schulman. Data reduction for weighted and outlier-resistant clustering. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pages 1343–1354. SIAM, 2012. doi:10.1137/1.9781611973099.106.
- [24] Arnold Filtser. Scattering and sparse partitions, and their applications. In ICALP, volume 168 of LIPIcs, pages 47:1–47:20. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPIcs.ICALP.2020.47.
- [25] Zachary Friggstad, Kamyar Khodamoradi, Mohsen Rezapour, and Mohammad R. Salavatipour. Approximation schemes for clustering with outliers. ACM Trans. Algorithms, 15(2):26:1–26:26, 2019. doi:10.1145/3301446.
- [26] Sariel Har-Peled and Akash Kushal. Smaller coresets for -median and -means clustering. Discret. Comput. Geom., 37(1):3–19, 2007. doi:10.1007/s00454-006-1271-x.
- [27] Sariel Har-Peled and Soham Mazumdar. On coresets for -means and -median clustering. In STOC, pages 291–300. ACM, 2004. doi:10.1145/1007352.1007400.
- [28] Monika Henzinger and Sagar Kale. Fully-dynamic coresets. In ESA, volume 173 of LIPIcs, pages 57:1–57:21. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPIcs.ESA.2020.57.
- [29] Wei Hu, Zhao Song, Lin F. Yang, and Peilin Zhong. Nearly optimal dynamic -means clustering for high-dimensional data. CoRR, abs/1802.00459, 2018. doi:10.48550/arXiv.1802.00459.
- [30] Lingxiao Huang, Ruiyuan Huang, Zengfeng Huang, and Xuan Wu. On coresets for clustering in small dimensional euclidean spaces. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 13891–13915. PMLR, 2023. URL: https://proceedings.mlr.press/v202/huang23h.html.
- [31] Lingxiao Huang, Shaofeng H.-C. Jiang, Jianing Lou, and Xuan Wu. Near-optimal coresets for robust clustering. In ICLR. OpenReview.net, 2023. URL: https://openreview.net/forum?id=Nc1ZkRW8Vde.
- [32] Lingxiao Huang, Shaofeng H.-C. Jiang, and Nisheeth K. Vishnoi. Coresets for clustering with fairness constraints. In NeurIPS, pages 7587–7598, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/810dfbbebb17302018ae903e9cb7a483-Abstract.html.
- [33] Lingxiao Huang, Jian Li, Pinyan Lu, and Xuan Wu. Coresets for constrained clustering: General assignment constraints and improved size bounds. In SODA, pages 4732–4782. SIAM, 2025. doi:10.1137/1.9781611978322.161.
- [34] Lingxiao Huang, Jian Li, and Xuan Wu. On optimal coreset construction for euclidean -clustering. In STOC, pages 1594–1604. ACM, 2024. doi:10.1145/3618260.3649707.
- [35] Lingxiao Huang and Nisheeth K. Vishnoi. Coresets for clustering in euclidean spaces: importance sampling is nearly optimal. In STOC, pages 1416–1429. ACM, 2020. doi:10.1145/3357713.3384296.
- [36] Piotr Indyk. Algorithms for dynamic geometric problems over data streams. In STOC, pages 373–380. ACM, 2004. doi:10.1145/1007352.1007413.
- [37] Lujun Jia, Guolong Lin, Guevara Noubir, Rajmohan Rajaraman, and Ravi Sundaram. Universal approximations for tsp, steiner tree, and set cover. In STOC, pages 386–395. ACM, 2005. doi:10.1145/1060590.1060649.
- [38] Shaofeng H.-C. Jiang and Jianing Lou. Coresets for robust clustering via black-box reductions to vanilla case. CoRR, abs/2502.07669, 2025. doi:10.48550/arXiv.2502.07669.
- [39] Samir Khuller, Robert Pless, and Yoram J. Sussmann. Fault tolerant k-center problems. Theor. Comput. Sci., 242(1-2):237–245, 2000. doi:10.1016/S0304-3975(98)00222-9.
- [40] Sagi Lotan, Ernesto Evgeniy Sanches Shayda, and Dan Feldman. Coreset for line-sets clustering. In NeurIPS, 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/f2ce95887c34393af4eb240d60017860-Abstract-Conference.html.
- [41] Konstantin Makarychev, Yury Makarychev, and Ilya P. Razenshteyn. Performance of johnson-lindenstrauss transform for k-means and k-medians clustering. In STOC, pages 1027–1038. ACM, 2019. doi:10.1145/3313276.3316350.
- [42] Yair Marom and Dan Feldman. k-means clustering of lines for big data. In NeurIPS, pages 12797–12806, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/6084e82a08cb979cf75ae28aed37ecd4-Abstract.html.
- [43] Alexander Munteanu and Chris Schwiegelshohn. Coresets-methods and history: A theoreticians design pattern for approximation and streaming algorithms. Künstliche Intell., 32(1):37–53, 2018. doi:10.1007/s13218-017-0519-3.
- [44] Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. Fair coresets and streaming algorithms for fair k-means. In WAOA, volume 11926 of Lecture Notes in Computer Science, pages 232–251. Springer, 2019. doi:10.1007/978-3-030-39479-0_16.
- [45] Christian Sohler and David P. Woodruff. Strong coresets for k-median and subspace approximation: Goodbye dimension. In FOCS, pages 802–813. IEEE Computer Society, 2018. doi:10.1109/FOCS.2018.00081.
- [46] Murad Tukan, Xuan Wu, Samson Zhou, Vladimir Braverman, and Dan Feldman. New coresets for projective clustering and applications. In AISTATS, volume 151 of Proceedings of Machine Learning Research, pages 5391–5415. PMLR, 2022. URL: https://proceedings.mlr.press/v151/tukan22a.html.