Abstract 1 Introduction 2 Preliminaries 3 Formal Statements of Reductions References

Coresets for Robust Clustering via Black-Box Reductions to Vanilla Case

Shaofeng H.-C. Jiang ORCID School of Computer Science, Peking University, Beijing, China Jianing Lou ORCID School of Computer Science, Peking University, Beijing, China
Abstract

We devise ε-coresets for robust (k,z)-Clustering with m outliers through black-box reductions to vanilla clustering. Given an ε-coreset construction for vanilla clustering with size N, we construct coresets of size Npolylog(kmε1)+Oz(min{kmε1,mε2zlogz(kmε1)}) for various metric spaces, where Oz hides 2O(zlogz) factors. This increases the size of the vanilla coreset by a small multiplicative factor of polylog(kmε1), and the additive term is up to a (ε1log(km))O(z) factor to the size of the optimal robust coreset. Plugging in recent vanilla coreset results of [Cohen-Addad, Saulpic and Schwiegelshohn, STOC’21; Cohen-Addad, Draganov, Russo, Saulpic and Schwiegelshohn, SODA’25], we obtain the first coresets for (k,z)-Clustering with m outliers with size near-linear in k while previous results have size at least Ω(k2) [Huang, Jiang, Lou and Wu, ICLR’23; Huang, Li, Lu and Wu, SODA’25].

Technically, we establish two conditions under which a vanilla coreset is as well a robust coreset. The first condition requires the dataset to satisfy special structures – it can be broken into “dense” parts with bounded diameter. We combine this with a new bounded-diameter decomposition that has only Oz(kmε1) non-dense points to obtain the Oz(kmε1) additive bound. Another sufficient condition requires the vanilla coreset to possess an extra size-preserving property. To utilize this condition, we further give a black-box reduction that turns a vanilla coreset to the one that satisfies the said size-preserving property, and this leads to the alternative Oz(mε2zlogz(kmε1)) additive size bound.

We also give low-space implementations of our reductions in the dynamic streaming setting. Combined with known streaming constructions for vanilla coresets [Braverman, Frahling, Lang, Sohler and Yang, ICML’17; Hu, Song, Yang and Zhong, arXiv’1802.00459], we obtain the first dynamic streaming algorithms for coresets for k-Median (and k-Means) with m outliers, using space O~(k+m)poly(dε1logΔ) for inputs on a discrete grid [Δ]d.

Keywords and phrases:
Coresets, clustering, outliers, streaming algorithms
Category:
Track A: Algorithms, Complexity and Games
Funding:
Shaofeng H.-C. Jiang: Research partially supported by a national key R&D program of China No. 2021YFA1000900.
Copyright and License:
[Uncaptioned image] © Shaofeng H.-C. Jiang and Jianing Lou; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Streaming, sublinear and near linear time algorithms
; Theory of computation Facility location and clustering
Related Version:
Full Version: https://arxiv.org/abs/2502.07669 [38]
Editors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele Puppis

1 Introduction

(k,z)-Clustering is a fundamental problem that is well studied in both computer science and related areas such as operations research. Given a metric space (V,dist) and a dataset XV, (k,z)-Clustering aims to find a set CV of k centers, such that the following clustering objective is minimized:

costz(X,C):=xX(dist(x,C))z,

where dist(x,C):=mincCdist(x,c). This formulation generally captures several well known variants of k-clustering, including k-Median (when z=1) and k-Means (when z=2).

Unfortunately, datasets can often be noisy, and the objective of (k,z)-Clustering is very sensitive to noise. In fact, even adding a single outlier point that is distant from every other point can significantly bias the clustering centers towards the outlier. Hence, robust variants of clustering are highly desired to address this issue. We consider a natural formulation suggested by [9], called (k,z)-Clustering clustering with m outliers, (k,z,m)-Clustering for short, where a parameter m is introduced to denote the number of outliers. The new cost function is denoted as costz(m)(X,C), and it is evaluated by first removing the m points that are furthest to C from X, denoting the resultant points as X, and compute costz(X,C) (i.e., using the vanilla (k,z)-Clustering objective). Equivalently,

costz(m)(X,C):=minY(Xm)costz(XY,C).

Coreset [27] is a powerful technique for obtaining efficient algorithms for clustering. Roughly speaking, an ε-coreset is a tiny proxy of the dataset that approximates the clustering objective within (1±ε) factor for every potential center set C. Coresets are not only useful for obtaining fast algorithms, but can also be converted into streaming [27], distributed [1] and fully-dynamic algorithms [28] via merge-and-reduce framework [27]. Tremendous progress has been made on finding small coresets for (vanilla) (k,z)-Clustering. Take k-Median (z=1) in Euclidean d (i.e., V=d,dist=2) for example. The initial coreset size O(kεdlogn) [27] has been improved in a series of works [26, 21, 45, 22, 35, 7, 16, 13], all the way to O~(kε3) via a novel sampling framework [16, 13], and this nearly matches a lower bound of Ω(kε2) [13]. (Alternatively, in the regime of ε1k, a better bound of O~(k4/3ε2) may be obtained [14, 34].)

The sampling framework proposed by [16] is modified to handle several variants of clustering [4], and this modified framework is recently adapted to obtain an O(m)+O~(k3ε5) size coreset for k-Median with m outliers [31], which exponentially improves a decade-old (k+m)O(k+m)poly(dε1logn) bound [23]. This bound is further improved to O(m)+O~(k2ε4) in a recent work [33]. However, even this improved bound is unlikely to be tight; specifically, there is a sharp transition from m=0 (i.e., the vanilla case) to m=1 (i.e., only one outlier is considered) that increases the size by at least a kε1 factor (compared with [13]). It is unclear if this transition is fundamental and whether or not it can be avoided. Technically, existing approaches for robust coresets are built on frameworks designed for the vanilla case and is adapted in an ad-hoc way. In case improved bounds are obtained for the vanilla case, it is still technically nontrivial to adapt it to the robust case.

In this paper, we systematically address these issues by proposing a new framework via black-box reductions to the vanilla case. Specifically, we wish to start from any vanilla coreset algorithm (which may be the optimal one), and figure out how to add only a few points to convert it into a robust coreset, ideally to obtain a size bound that is very close to the vanilla case. Indeed, this study helps to fundamentally understand “the price of robustness” for coresets, i.e., the exact gap in size bounds between vanilla clustering and robust clustering. Since the coreset construction is via reduction, it also opens the door to more applications, such as coreset algorithms for robust clustering in sublinear models (as long as the vanilla version can be constructed in the corresponding model).

1.1 Our Results

Our main result, stated in Theorem 1, is a novel coreset construction for (k,z,m)-Clustering via a reduction to the vanilla case. Crucially, this reduction is black-box style which works with any vanilla coreset construction without knowing its implementation detail. As mentioned, this type of bound helps to fundamentally understand the “price of robustness” for coresets. Our result works for general metric spaces, and we choose to state the result for the Euclidean case which is arguably the most natural setting for (k,z)-Clustering. Overall, this size bound has linear dependence in N (albeit N is evaluated on slightly larger parameters), and the main increase in size is additive.

Theorem 1 (Euclidean case).

Assume there is an algorithm that constructs an ε-coreset for (k,z)-Clustering of size N(d,k,ε1) for any dataset from d. Then, there is an algorithm that constructs an ε-coreset for (k,z,m)-Clustering of size

min{N(d,k,O(ε1))+A1,N(O(d),O(klog2(kmε1)),O(ε1))+A2} (1)

for any dataset from d, where A1=Oz(kmε1) and A2=Oz(mε2zlogz(kmε1)). The two bounds in the min follow from Theorem 10 and Theorem 13, respectively.

As mentioned, our result in Theorem 1 is applicable to various other metric spaces besides Euclidean, including doubling metrics, general finite metrics and shortest-path metrics for graphs that exclude a fixed minor. The size bound only needs to change the parameters of N according to the metric and does not change the additive terms A1 and A2. For instance, for doubling metrics, our bound simply replaces the d in (1) by the doubling dimension of the metric space (up to constant factor). Detailed size bounds can be found in the full version [38].

We start with justifying the tightness of our size bounds without plugging in any concrete vanilla coreset bound N. Observe that Theorem 1 actually provides two size bounds, each corresponding to one of the two terms in the min of (1). The first bound of A1+N(d,k,O(ε1)) is more useful when m is small. In particular, it makes our result the first to achieve a smooth transition from m=0 (vanilla case) to m>0 (robust case) in an asymptotic sense, even when plugging in the optimal vanilla coreset. Indeed, the Oz(kmε1) bound becomes Oz(kε1) when m=O(1), and this is asymptotically dominated by a lower bound of Ω(kε2) for vanilla coreset as shown in [13]. For the second bound, it needs to use k that is polylog(kmε1) larger in N, but under the typical case of N(d,k,ε1)=poly(dkε1), this only increases N by a factor of polylog(kmε1) which is minor. Regarding its additive term A2, it is actually up to only a (ε1log(km))O(z) factor to the optimal robust coresets, which is nearly tight with respect to m due to a lower bound of Ω(m) for robust coreset [31].

Specific Coreset Size Bounds.

Since our reduction is black box, the large body of results on coresets for vanilla clustering can be readily applied. We start with listing in Table 1 the concrete coreset bounds obtained by plugging in recent results for (k,z)-Clustering [13, 14, 34, 12]. These lead to the first coresets of near-linear size in k for (k,z,m)-Clustering in all the metric spaces listed, thereby improving the previous k2 dependency [23, 31, 33]. We can also obtain improved coresets when plugging in results designed for specific parameters. For instance, in low-dimensional Euclidean space, a coreset for (1,z)-Clustering of size 2O(zlogz)O~(dε1) was given by [30]. Applying this result to Theorem 1, we obtain a coreset for (1,m,z)-Clustering of size 2O(zlogz)O~((m+d)ε1).

Table 1: List of relevant coreset bounds under various metric spaces. In all the bounds reported in the table we omit a minor multiplicative factor of (zlog(kmε1))O(z). For our result listed in the last row, the exact form (without omitting the abovementioned minor factor) is min{𝐍+Oz(kmε1),𝐍polylog(kmε1)+Oz(mε2zlogz(kmε1))}, noting that the 𝐍 in the second term of min{,} has a polylog(kmε1) factor.
metric space M problem size reference
Euclidean d vanilla (𝐍) kdεmax{2,z} [16]
vanilla (𝐍) kε2z [13]
vanilla (𝐍) k2z+2z+2ε2 [34]
robust, z=1 (k+m)k+m(ε1dlogn)2 [23]
robust m+k3ε3z2 [31]
robust m+k2ε2z2 [33]
doubling metrics vanilla (𝐍) kεmax{2,z}ddim(M) [16]
robust m+k2ε2z(ddim(M)+ε1) [33]
n-point metric space vanilla (𝐍) kεmax{2,z}logn [16]
robust, z=1 (k+m)k+mε2log4n [23]
robust m+k2ε2z(logn+ε1) [33]
bounded treewidth graphs vanilla (𝐍) kεmax{2,z}tw [16]
robust m+k2ε2z2tw [33]
excluded-minor graphs vanilla (𝐍) kε2min{εz+1,k} [12]
robust m+k2ε2z2 [33]
all the above robust 𝐍+min{kmε1,mε2z} ours

It is worth mentioning that our entire algorithm is deterministic, provided that the given vanilla coreset algorithm is deterministic. As highlighted in recent surveys [43, 20], deterministically constructing coresets is an important and less understood open question, even for vanilla clustering without outliers. The best-known deterministic construction works only in low-dimensional Euclidean space [26], yielding a coreset of size poly(k)εO(d). For Euclidean k-Means, a coreset of size kpoly(ε1) can be constructed deterministically [22], and the size can be improved to poly(kε1) if we consider a slightly more general definition of coresets, known as coresets with offset [17].111In the definition of coresets with offset, the original clustering objective is preserved by adding a universal constant (the offset) to the clustering objective on the coreset. All these results can be plugged into our reduction to directly achieve a deterministic coreset construction for (k,z,m)-Clustering, with the coreset size increasing as shown in (1).222For coresets with offset, our reductions and analysis can be easily adapted and still achieve asymptotically the same coreset size.

Streaming Implementation of Theorem 1.

We also obtain reduction style coreset construction algorithms for (k,z,m)-Clustering over a dynamic stream of points in d using small space, which follows from a streaming implementation of Theorem 1. We consider a standard streaming model proposed by [36] which has been widely employed in the literature. In this model, the input points come from a discrete grid [Δ]d for some integer parameter Δ, and they arrive as a dynamic stream with point insertions and deletions. At the end of the stream, the algorithm should return a weighted set as the coreset. We present in Theorem 2 the result for the case of z=1 which is k-Median with m outliers.

Theorem 2 (Informal version of Theorem 14).

Assume there is a streaming algorithm that constructs an ε-coreset for k-Median for every dataset from [Δ]d presented as a dynamic stream, using space W(d,Δ,k,ε1) and with a failure probability of at most 1/poly(dlogΔ). Then, there is a streaming algorithm that constructs an ε-coreset for (k,m)-Median with constant probability for every dataset from [Δ]d presented as a dynamic stream, using space min{W1,W2}poly(dlogΔ), where

W1 =O~(kmε1)+W(d,Δ,k,O(ε1)),
W2 =O~(k+mε2)+W(d+1,ε1Δpoly(d),kpoly(d),O(ε1)).

Observe that we did not explicitly state the size of the coreset, and only focus on the space complexity, since the coreset size is upper bounded by the space complexity of the streaming algorithm. Moreover, once one obtains an ε-coreset at the end of the stream, one can always run another existing coreset construction again on top of it to obtain a possibly smaller coreset. Hence, it is only the space complexity that matters in streaming coreset construction. We also note that once the coreset is obtained from our streaming algorithm, one can continue to find a (1+ε)-approximation to (k,m)-Median without using higher order of space.

By combining Theorem 2 with an existing streaming coreset construction for k-Median [5], we obtain a space complexity of O~(min{kmε1,mε2}+kε2)poly(dlogΔ). Remarkably, the dependence on k and ε matches that of [5], which is kε2. A similar bound for z=2, i.e., k-Means with m outliers, may be obtained by combining with a dynamic streaming algorithm for k-Means (e.g., [29]). These results are the first nontrivial dynamic streaming algorithms for robust k-clustering, where the dependence on ε,k,d,logΔ is comparable to existing bounds for vanilla clustering (up to degree of polynomial). We also show (in the full version) that the linear dependence on m is necessary.

Our results push forward the understanding of dynamic streaming algorithms for variants of clustering. Indeed, recent works manage to devise (offline) coresets for variants of clustering (including clustering with outlier/capacity constraints; see e.g., [33]), which can immediately imply insertion-only streaming algorithms via standard techniques such as merge-and-reduce [27], but these are not readily applicable to dynamic point streams. Our new reduction style framework is a new approach that opens a door to obtaining dynamic streaming algorithms for other variants of clustering, which may be of independent interest.

1.2 Technical Overview

Our main technical contributions are the establishment of two conditions for a vanilla coreset to become a robust coreset, and black-box reductions that turn vanilla coresets to satisfy these conditions, which yields the bounds mentioned in Theorem 1. The two conditions reveal new fundamental structures of the robust clustering problem, especially the structural relation to vanilla clustering, which may be of independent interest. Moreover, to make vanilla coresets satisfy the two conditions in a black-box way, it requires us to devise a bounded-diameter decomposition called almost-dense decomposition, a new separated duplication transform for metric spaces, as well as adapting and refining a recently developed sparse partition [37, 24] (which was also studied under the notion of consistent hashing [18, 19]). The sparse partition/consistent hashing technique is also crucially used in our streaming implementation, which suggests a framework fundamentally different from those based on quadtrees as in previous dynamic streaming algorithms for (vanilla) clustering [5, 29].

We focus on presenting these ideas for the (k,m)-Median problem (which is the z=1 case) in Euclidean d (although most discussion already works for general metrics). We also discuss how to implement the reductions in streaming.

1.2.1 First Reduction: (𝑶(𝒌𝒎𝜺𝟏)+𝑵) Size Bound

Condition I (Lemma 9): Vanilla Coresets on Dense Datasets Are Robust

Our first condition comes from a simple intuition: we wish cost(X,C)cost(m)(X,C) for every Cd, which is perhaps the most natural way to ensure a vanilla coreset is automatically a robust coreset. Now, a sufficient condition is to ensure the dataset is somewhat uniform, in the sense that for every center set Cd, the contribution from every m data points is at most εOPT (or equivalently, each data point contributes εOPT/m). This way, whatever the m outliers are, the change of objective from cost to cost(m) is always within ε factor.

We thus consider datasets that can be broken into parts such that each part has diameter λ:=O(εOPT/m) and contains at least Ω(ε1m) points. In such datasets, every data point can find Ω(ε1m) points within distance λ=O(εOPT/m), hence its contribution is naturally charged to these nearby points, which is averaged to O(εOPT/m). We call such a dataset dense, and the above intuition leads to a formal reduction that for dense datasets, a vanilla coreset is as well a robust coreset.

Reduction I (Theorem 10): Almost-dense Decomposition

Of course, a general dataset may not be dense. Nevertheless, we manage to show a weaker almost-dense decomposition exists for every dataset. This decomposition breaks the dataset into two parts: a subset A that is guaranteed to be dense and a remaining subset B that only consists of O(kmε1) points. The subset B actually has more refined property: it can be broken into parts such that each part has diameter at most λ (which is similar to the dense part) and each contains at most O(ε1m) points (which is the “complement” property of the dense part). We call subset B the sparse subset.

To construct such a decomposition, we start with an optimal solution C to the robust clustering (in the algorithm it suffices to use a tri-criteria solution, see Definition 6). Let λ=O(εOPT/m) be the target diameter bound of the dense and sparse parts. We first identify those isolated points F consisting of both the m outliers and those “far away” points with distance more than λ from C. The number of far away points is bounded by OPT/λ=O(ε1m) using an averaging argument. Now since every remaining point is within λ to C, we cluster all these remaining points, namely XF, with respect to C. Each cluster automatically has a diameter bound by O(λ). Finally, we take clusters with more than Ω(ε1m) points as the dense subset, and the remainder, along with F, forms the sparse subset, which contains at most O(|C|ε1m)+O(ε1m)+m=O(kmε1) points in total.

With such a decomposition, one can simply put the sparse subset who has O(kmε1) points into the coreset, and use the vanilla coreset on the dense subset. This yields our first O(kmε1)+N size bound (recall that N is the size of the vanilla coreset) as in Theorem 1.

1.2.2 Second Reduction: (𝒎𝜺𝟐+𝑵)𝐩𝐨𝐥𝐲𝐥𝐨𝐠(𝒌𝒎𝜺𝟏) Size Bound

Overview of Reduction II (Theorem 13).

Indeed, in Reduction I, the step of simply adding sparse subset into the coreset may sound suboptimal. Hence, our second algorithm is built on top of Reduction I, such that we first follow the steps in Reduction I to build a vanilla coreset on the dense subset, and then run more refined new steps to the sparse subset. In fact, the new steps can be viewed as a separate reduction that takes the sparse subset as input and generates a robust coreset. We focus on these new steps and call it Reduction II (without the steps already in Reduction I). Crucially, Reduction II itself is standalone and can run on any dataset (and thus can be independent of Reduction I), albeit it would result in a coreset of size O(logn) (where n is the size of its input dataset). Luckily, we would eventually run it on the sparse subset which has O(kmε1) points, so the logn factor becomes log(kmε1), and the overall size bound is competitive (which yields the other size bound in Theorem 1).

Next, we start with describing our Condition II (which is also standalone), and discuss how our Reduction II make a vanilla coreset to satisfy this condition.

Condition II (Lemma 11): Size-preserving Vanilla Coresets Are Robust

The second condition requires a stronger property in the definition of the coreset (instead of the dense property in the dataset as in Condition I). We consider a size-preserving property, which is defined with respect to some λ-bounded diameter decomposition – a partition 𝒫 of point set X such that each part P𝒫 has diameter at most λ. For a given λ, a (vanilla) coreset S is size-preserving, if there exists a λ-bounded diameter decomposition 𝒫, such that for every P𝒫, |P|=|SP| (here we interpret the coreset, which is a weighted set, as a multiset). This “size-preserving property” of coresets intuitively means the coreset could be interpreted as moving each data point by at most λ distance. More formally, there is a one-to-one correspondence g between dataset X and coreset S such that dist(x,g(x))λ.

Our Condition II states that if a vanilla coreset S satisfies the size-preserving property then it is automatically a robust coreset. Next, we provide an overview of the difficulties we encountered in the proof of Condition II and the high-level ideas for resolving them.

Naive Approach: Bounding the Gap between 𝐜𝐨𝐬𝐭 and 𝐜𝐨𝐬𝐭(𝒎).

We explore the condition to make the gap between cost(X,C) and cost(m)(X,C) is small. Unlike Condition I, here we are now making no assumptions about the structure of X. Observe that the precise difference between the two cost functions is the outliers: Let Xout be the set of outliers (which is defined with respect to C), then the vanilla cost(X,C) is larger and has an extra term xXoutdist(x,C) compared with the robust cost(m)(X,C). An immediate idea to “avoid” this term in vanilla cost(X,C), is to additionally put a center at each point in Xout so that points from Xout contribute 0. This works perfectly when each outlier is separated from every other point since each point in Xout forms a singleton cluster, and this suggests a necessary condition: the vanilla coreset S needs to hold for k=k+m centers, i.e., it is a coreset for k-Median.

Unfortunately, centers on outliers may not form singleton clusters, in which case the value cost(X,CXout) can underestimate cost(m)(X,C). Even if it is possible to pick another C to make this said gap small, it is still challenging to also prove that cost(S,CC) is always close to cost(m)(S,C) for an arbitrary S and C (which is needed to imply cost(m)(X,C)cost(m)(S,C)).

Key Observation: Vanilla Coresets Also Preserve the Gap.

Hence, instead of seeking for a specific center set C that makes the gap cost(m)(X,C)cost(X,CC) small for every C, we show a weaker guarantee, such that the gap is preserved between X and the coreset S: we allow a large gap, just that if the gap is large in X, then it is also large in S, and the quantity is comparable. Specifically, we show that a coreset S for k-Median with size-preserving property can preserve this gap. Namely,

cost(m)(S,C)cost(S,CC)=cost(m)(X,C)cost(X,CC)±εcost(m)(X,C) (2)

for some CXout. Indeed, this, combined with the guarantee cost(S,CC)(1±ε)cost(X,CC) (which follows from the k-Median coreset guarantee of S), implies cost(m)(S,C)cost(m)(X,C) (which is the robust coreset guarantee).

To see why the gap can be preserved, the intuition is that there are only a few points that contribute non-zero values to the gap cost(m)(X,C)cost(X,CC). To put it simply, let us first consider C=Xout. Then only those inliers x of X that are closer to outliers Xout than to C can make dist(x,C)dist(x,CC) hold. Let D denote such inliers, and let us loosely assume that only g(D) contributes non-zero values to the gap cost(m)(S,C)cost(S,CC) (recalling that g is a one-to-one correspondence between the dataset X and the coreset S, such that dist(x,g(x))λ for every xX). Then the difference of these two gaps can be upper bounded by |cost(D,C)cost(g(D),C)|+|cost(D,CC)cost(g(D),CC)|O(|D|λ) using the triangle inequality, which is sufficient if |D|O(m) and we pick λ=εOPT/m.

Unfortunately, the naive setup of C=Xout, as in the analysis above, may not lead to a small D (noting that D is defined with respect to C). Thus, we need to choose C as a subset of Xout. On the other hand, C cannot be made too small as well; otherwise, a large error of εcost(X,CC) would arise. (Recall that (2), combining the coreset guarantee cost(S,CC)(1±ε)cost(X,CC), forms the complete cost preservation.) We provide a simple algorithm to address this by finding an appropriate C such that both D and the error εcost(X,CC) are ensured to be small.

Reduction II (Theorem 13): Making Vanilla Coresets Size-preserving

To utilize Condition II, we need to show that, for a given λ>0, we can build a vanilla coreset S that is size-preserving using a generic vanilla coreset algorithm in a black-box way. Specifically the coreset should satisfy that there exists a λ-bounded partition 𝒫=(P1,,Pt) of X such that |Pi|=|SPi| for i[t] (recalling that we treat a coreset as a multiset). We first assume a generic partition 𝒫 is given, and assume that t=|𝒫|=k for simplicity.

Key Observation: Separated Instance is Easy.

We make use of a simple observation: if the k parts are well-separated, meaning they are sufficiently distant from each other, then a vanilla coreset for k-Median is already (nearly) size-preserving. To see this, if we put one center at each of the Pj’s for ji but without any center at Pi, the clustering cost is roughly |Pi|dist(Pi,XPi) and this is up to (1±ε) factor to |SPi|dist(Pi,XPi) (by the coreset guarantee of S). This implies a slightly relaxed property |SPi|(1±ε)|Pi|, and it can be further adjusted to become strictly size-preserving.

Key Idea: Ensuring Separation Artificially.

A straightforward idea to deal with a generic 𝒫 is to artificially separate the parts. More precisely, for Euclidean d, we define a mapping h:Pd+1, such that for xPi, h(x):=(x,iw) for some sufficiently large w (which can be loosely considered as infinite). We define Pi:=h(Pi). Clearly, 𝒫:=(P1,,Pk) is a λ-bounded partition for X:=h(X), and now the partition 𝒫 is well-separated. Thus, one can apply the above argument for 𝒫, obtain a size-preserving coreset S for X, and find the corresponding S for X.

New Issue: Ensuring Coreset Guarantee.

However, a crucial issue arises: how can we ensure that S is a coreset for X? In other words, we need to translate the coreset guarantee of S (on X) to that of S (on X). It seems that the only way is to find a “bridge” center set C for any fixed C such that cost(X,C)cost(X,C) and cost(S,C)cost(S,C). If such C exists and S can handle at least |C| centers, then the coreset guarantee of S follows. Now, let us focus on the requirement cost(X,C)cost(X,C). The tricky situation arises when some points in X may have the same nearest center in C but belong to different parts; for instance, assume that xiPi and xjPj both have the nearest center cC. After mapping through h, the images xi=h(xi) and xj=h(xj) can be “infinitely” far apart. Therefore, to maintain the clustering cost, we need to include two copies, (c,iw) and (c,jw), in C. In the worst case, we would need to include |𝒫|=k copies of every center in C, resulting in a size of k2 for C, and thus S should be a coreset for k2-Median. Unfortunately, this worst-case scenario is unavoidable if 𝒫 is generic, and the reduction to k2-Median falls far from satisfaction, as our ultimate goal is near-linear size in k.

Solution: Grouping Parts using Sparse Partition [37].

To address the above problem, we start by using a more careful way to “prune” centers in C. For a part Pi𝒫, if it is far from C, specifically dist(Pi,C)>ε1λε1diam(Pi), then it suffices to include only one center in C for Pi, which is ci=argmincCdist(c,Pi), and we can verify that cost(Pi,{ci})(1±ε)cost(Pi,C). Through this pruning, these far parts can increase C by at most k, which is acceptable.

The challenging part is then handling the close parts. For a close part Pi, we denote by Ci the centers that is relevant to it, i.e., Ci contains the nearest centers to points in Pi. This implies that cost(Pi,C)=cost(Pi,Ci), meaning we only need to include {(c,iw):cCi} in C for every i. As discussed earlier, i|Ci| could be Ω(k2) in the worst case.

To further reduce the total number of relevant centers, we propose a new idea: group these parts together to create a new partition (G1,,Gl) of X. The relevant centers of Gj are i:PiGjCi. Our goal then becomes to find a grouping strategy such that j|i:PiGjCi| is small.

We achieve this using a more structured partition called sparse partition [37]. Specifically, we can compute a partition 𝒬=(G1,,Gl) of X such that each part Gj has a diameter at most O(ε1λlogn), and any subset with a diameter O(ε1λ) intersects at most O(logn) parts. For simplicity, assume each part in 𝒫 is entirely covered by exactly one part in 𝒬 (though this is not generally true, the proof follows similarly). Then, 𝒬=(G1,,Gl) fits our needs perfectly. To see this, consider any center c. Notice that c can appear in at most O(logn) of the sets i:PiGjCi, since otherwise the subset i:cCiPi, whose diameter is O(ε1λ), would intersect more than O(logn) parts, contradicting the guarantee provided by the sparse partition. As a result, j|i:PiGjCi|O(klogn). This ensures that it suffices to construct a coreset for O(klogn)-Median that is size-preserving with respect to 𝒬.

An issue is that 𝒬 is O(ε1λlogn)-bounded, but this can be adjusted through rescaling, resulting in a size increase by a factor of poly(ε1logn). Given that this coreset applies to a subset of size n=poly(kmε1) in the full algorithm, the logn term is acceptable.

1.2.3 Streaming Implementations

We provide an overview of how these steps can be adapted to the streaming setting using foundational streaming algorithmic tools.

𝝀-Bounded Partition.

Recall that both of our reductions rely on a λ-bounded partition 𝒫 (where λ=εOPT/m, and we can guess this OPT). Therefore, the key is to design a streaming algorithm that finds a λ-bounded partition. However, in the streaming setting, it is not possible to maintain the λ-bounded partition explicitly, as it requires Ω(n) space, which is not affordable.

Our strategy is to employ a hashing version of sparse partition defined in [18] that is space efficient and data oblivious and thus suitable for streaming. This hashing partitions the entire d into buckets of diameter of O(λ). Hence, after the stream ends, those non-empty buckets form a λ-bounded partition of X. Moreover, this hashing also guarantees that any subset of points with a diameter of λ/poly(d) intersects at most poly(d) buckets (similar to the sparse partition guarantee). By combining with the existence of the small-sized partition (as shown in Reduction I), we can conclude that there are very few, i.e., poly(dε1)(k+m), distinct buckets after this hashing.

In the following, we fix such a hashing, and we outline how our reductions can be adapted to the streaming setting using foundational streaming algorithmic tools.

The First 𝑶(𝒌𝒎)𝐩𝐨𝐥𝐲(𝜺𝟏𝒅𝐥𝐨𝐠𝚫) Space Bound.

We first show how to implement our Reduction I in the streaming setting. Our main task is to identify all points that lie in small buckets containing fewer than O(ε1m) points, which correspond to the sparse subset B, and construct a vanilla coreset for the points that lie in other buckets, which correspond to the dense subset. The latter is easy; once the former is achieved, we can simply run a vanilla dynamic streaming algorithm on a stream representing XB (which can be achieved using some tricks).

The former task can be translated into a streaming algorithm: identify at most a:=poly(dε1)(k+m) buckets, such that each contain at most b:=O(ε1m) points. This task is similar to sparse recover in streaming, but it is two-level which is different from the standard one, i.e., it should recover only from those buckets that contain small number of points. Hence, we introduce a modified sparse recovery structure for this specific task, and it takes about O(ab)poly(dlogΔ)=O(km+m2)poly(ε1dlogΔ) space.

However, this m2 dependence is still suboptimal. To achieve a near-linear space bound in m, we show that there exists a subset FX of size mpoly(dε1) such that, after removing F, the remaining points span very few buckets, allowing us to set a:=kpoly(d). This subset F can be efficiently identified using a two-level 0-sampler [18, Lemma 3.3] in the dynamic stream. We then use (modified) sparse recovery to identify the sparse subset after removing F, reducing the total space to O(km)poly(ε1dlogΔ).

The Second 𝑶~(𝒌+𝒎)𝐩𝐨𝐥𝐲(𝜺𝟏𝒅𝐥𝐨𝐠𝚫) Space Bound.

Our second algorithm first applies the same streaming procedure to identify the set F as in the first bound, which is directly added to the coreset. Recall that the number of non-empty buckets after removing F is small. Hence, these non-empty buckets give rise to a small-sized λ-bounded partition of XF, denoted by 𝒫. Our Condition II states that if we construct a vanilla coreset S that is size-preserving, i.e., |SP|=|P| for every P, it is directly a robust coreset. Hence, in summary, our next step is to implement the construction of a size-preserving vanilla coreset (see Reduction II) for XF in the dynamic streaming setting. Specifically, we map each point x in the stream to (x,φ(x)w), where w is sufficiently large and φ(x) is an integer indicating the bucket to which x belongs. This transformed data is then fed to an instance of a streaming algorithm designed for vanilla coresets. At the end of the stream, we obtain a coreset from this instance and consider its pre-image (which deletes the last coordinate) as the coreset for XF, which is approximately size-preserving. We run a sparse recovery algorithm to recover the sizes of each bucket to calibrate the weights (to make it strictly size-preserving).

1.3 Related Work

Besides clustering with outliers, researchers have extensively explored coresets for various other clustering variants. Two particularly interesting variants are clustering with capacity constraints and clustering with fairness constraints, both of which have been investigated in [11, 44, 32, 15, 2, 4] for coreset constructions. Notably, the coresets for both these variants can be reduced to an assignment-preserving coreset [44, 32], for which several studies apply a hierarchical sampling framework [10] to construct small-sized coresets with assignment-preserving properties [2, 4]. Recently, Huang et al. [33] introduced a general unified model to capture a wide range of clustering problems with general assignment constraints, including clustering with fairness/capacity constraints, clustering with outliers (which we consider), and a fault-tolerant variant of clustering [39].

Coresets for other variants of clustering have also been considered, including projective clustering [21, 22, 46], clustering with missing values [8], ordered-weighted clustering [6], and line clustering [42, 40].

2 Preliminaries

For integer n1, let [n]:={1,2,,n}. Given a mapping φ:XY, for a subset ZX, denote φ(Z):={φ(x):xZ}, and for an image yY, denote its inverse by φ1(y):={xX:φ(x)=y}. Let M=(V,dist) be an underlying metric space, and we assume oracle access to dist. For a point set XV, let diam(X):=maxx,yXdist(x,y) be the diameter of X. For a set C={c1,,ct}V (t>0), a clustering of X with respect to C is a partition {P1,,Pt} of X such that for every i[t] and every point xPi, dist(x,ci)dist(x,C).

Weighted Set.

A set S with an associated weight function wS:S0 is a weighted set. When we talk about a weighted set A, we denote the weight function of A by wA unless otherwise specified. For an unweighted set B, we interpret it as weighted set with unit weight function wB1. We say a weighted set Y is a weighted subset of X, denoted by (Y,wY)(X,wX), if YX and for every xY, wY(x)wX(x). For two weighted sets X and Y, the weight of their union Z:=XY is defined as wZ:=wX+wY. We denote by XY the weighted set Z with the weight function wZ, where wZ=wXwY, and Z is the support of wZ. Here, for two weight functions f:X and g:Y, we use f+g to denote the weight function h:XY such that for every xXY, h(x):=f(x)+g(x). Likewise, we denote by fg the subtraction of weight functions.333If xX, we define f(x):=0, and similarly, if xY, we define g(x):=0.

Clustering with Outliers (Weighted).

In our proof, we need to consider the weighted version of (k,z,m)-Clustering, and we extend the definition as follows. For a weighted set XV, and a center set CV with |C|=k, we define the clustering objective for (k,z)-Clustering as costz(X,C):=xXwX(x)distz(x,C). Moreover, given real number h0, we denote by X(h):={(Y,wY)(X,wX):wY(Y)=h} the collection of all possible weighted subset of X with a total weight equal to h. We define the objective for (k,z,m)-Clustering as

costz(m)(X,C):=min(Y,wY)X(m)costz(XY,C). (3)

One can verify that when m=0, the definition of costz(m) is equivalent to costz. We call a weighted set Y an m-outlier set of X with respect to C, if (Y,wY)X(m) and costz(m)(X,C)=costz(XY,C). We use OPTz(m)(X) to denote the optimal objective value for (k,z,m)-Clustering on X, i.e., OPTz(m)(X):=minCVkcostz(m)(X,C).

Coresets.

Our definition of coreset is almost identical to the one in [31, Definition 2.1], except that the coreset preserves the objective for all potential number of outliers 0hm up to m simultaneously.444Although the definition of coresets in [31] does not require preserving objective for all 0hm simultaneously, their algorithm actually constructs such a coreset (see [31, Remark 3.2]).

Definition 3 (Coresets).

Given 0<ε<1 and a dataset XV, an ε-coreset of X for (k,z,m)-Clustering is a weighted set SX such that

CVk,0hm,costz(h)(S,C)(1±ε)costz(h)(X,C).

A weighted set is called an ε-coreset for the special case of (k,z)-Clustering if it is an ε-coreset for (k,z,m)-Clustering with m=0.

For analysis purposes, we also need a weaker notion of coreset, as defined in Definition 4, that allows for additive errors. Such notions were also widely used in previous coreset constructions (see e.g.,  [16, 13, 4, 31]).

Definition 4 (Additive-error coresets).

Given 0<ε<1,η0 and a dataset XV, an (ε,η)-coreset of X for (k,z,m)-Clustering is a weighted set SX such that

CVk,0hm,|costz(h)(S,C)costz(h)(X,C)|εcostz(h)(X,C)+η.

An important property of coresets is that it is composable. Roughly speaking, if SA is a coreset of A and SB is a coreset of B, then SASB is a coreset of AB. It is well known that composability holds directly from the definition of vanilla coreset, and here we verify that it also holds for our definition of coresets for (k,z,m)-Clustering. We formally present this property with respect to the more general additive-error coreset below.

Fact 5 (Composability of coresets).

For 0<ε<1,η1,η20, and two datasets X,YV, if SX is an (ε,η1)-coreset of X for (k,z,m)-Clustering, and SY is an (ε,η2)-coreset of Y for (k,z,m)-Clustering, then SXSY is an (ε,η1+η2)-coreset of XY for (k,z,m)-Clustering.

Tri-criteria Approximation.

Similar to many previous works [4, 31, 33], we need a tri-criteria approximation algorithm for the (k,z,m)-Clustering problem. Here, an (α,β,γ)-approximation is a set of βk centers, whose cost is at most α times the optimal, allowing a violation γ to the number of outliers.

Definition 6 ((α,β,γ)-Approximation).

Given α,β,γ1 and a dataset XV, we say a center set CV is an (α,β,γ)-approximation solution to (k,z,m)-Clustering on X if it holds that costz(γm)(X,C)αOPTz(m)(X) and |C|βk.

There exists a randomized algorithm that computes a (2O(z),O(1),O(1))-approximation in near-linear time with high probability [3] (which is also applicable to general metrics). For a deterministic approach, a (2O(z),O(1),1)-approximation solution can be obtained in polynomial time (e.g., [25]). A more detailed discussion on other possible algorithms for such an approximation can be found in [31, Appendix A].

Lemma 7 (Generalized triangle inequalities).

Let a,b0 and ε(0,1), then for z1, the following holds.

  1. 1.

    (Lemma A.1 of [41]) (a+b)z(1+ε)z1az+(1+1ε)z1bz;

  2. 2.

    (Claim 5 of [45]) (a+b)z(1+ε)az+(3zε)z1bz.

3 Formal Statements of Reductions

In this section, we present our main results, specifically two reductions based on two conditions for a vanilla coreset to become a robust coreset, and the streaming implementations of these reductions. Due to space limitations, we provide the formal statements of our results and discuss them here; the full proof can be found in the full version [38].

We begin by defining the bounded partition below, which is a useful notion in both reductions.

Definition 8 (λ-Bounded partition).

For a dataset Y from a metric space (V,dist) and λ0, we say that a partition 𝒫={Pi}i of Y is λ-bounded if, for every Pi𝒫, diam(Pi)λ.

Reduction I: Density of Datasets

Our first reduction is presented in Theorem 10. This reduction is based on the first condition, stated in Lemma 9, which proves that a vanilla coreset on a dense dataset serves as a robust coreset.

Lemma 9 (Condition I: vanilla coresets on dense datasets are robust).

Suppose an underlying metric space M=(V,dist) is given. For a dataset XV, 0<ε<1, and integers k,z,m1, if there exists a λ-bounded partition 𝒫 of X for some λ0 such that for every P𝒫, |P|(1+ε1)m, then an ε-coreset of X for (k,z)-Clustering is also an (O(ε),2O(zlog(z+1))mλz)-coreset for (k,z,m)-Clustering.

Theorem 10 (Reduction I).

Suppose an underlying metric space M=(V,dist) is given. Assume there exists an algorithm 𝒜 that, given 0<ε<1, integers k,z1 and an n-point dataset from M as input, constructs an ε-coreset for (k,z)-Clustering of size N(n,k,ε1) in time T(n,k,ε1).

Then there is an algorithm that, given 0<ε<1, integers k,z,m1, an n-point dataset X from M and a (2O(z),O(1),O(1))-approximation solution CV to (k,z,m)-Clustering on X as input, constructs an ε-coreset S of X for (k,z,m)-Clustering. This algorithm invokes 𝒜 a constant number of times and runs in time O~(nk)+T(n,k,O(ε1)). The constructed coreset S has size 2O(zlogz)O(kmε1)+N(n,k,O(ε1)).

Reduction II: Size-preserving Property

Our second condition is stated in Lemma 11, which shows that a vanilla coreset that satisfies the size-preserving property is automatically a robust coreset.

Lemma 11 (Condition II: size-preserving vanilla coresets are robust).

Suppose a metric M=(V,dist) is given. For 0<ε<1, λ>0, integers k,z,m1 and a dataset XV, if a weighted set SX satisfies that there exists a λ-bounded partition 𝒫 of X such that

  1. 1.

    S is an ε-coreset of X for (k+|𝒫|,z)-Clustering, and

  2. 2.

    P𝒫, |P|=wS(SP),

then S is an (O(ε),2O(zlog(z+1))mλzε1z)-coreset of X for (k,z,m)-Clustering.

Based on Lemma 11, we present an alternative reduction in Theorem 13. Compared with Reduction I (Theorem 10), this reduction gives an alternative size bound with respect to k and m: Oε(k+m) instead of Oε(km). This reduction requires a slightly stronger technical guarantee from the vanilla coreset algorithm, such that the algorithm not only needs to construct a small coreset for metric M (that contains the dataset), but also for a family of separated duplications of M. Roughly speaking, a separated duplication of M “copies” M into several disjoint sub-metrics, and makes the distance between two copies large. We provide the formal definition below.

Definition 12 (Seperated duplication of a metric space).

Given a metric space M=(V,dist), for real number w0 and integer h1, a metric space M=(V×[h],dist) is called an w-separated h-duplication of M, if

  1. 1.

    x,yX,i[h], it holds that dist((x,i),(y,i))=dist(x,y); and

  2. 2.

    x,yX,i,j[h] such that ij, it holds that dist((x,i),(y,j))max{w,dist(x,y)}.

For the special case of h=1 and w=0, we say M is a w-separated h-duplication of M if and only if M=M.

Theorem 13 (Reduction II).

Suppose an underlying metric space M=(V,dist) and a family dup={Mh,wdup}h1,w0 of w-separated h-duplication of M are given. Assume there exists an algorithm 𝒜 such that, for every 0<ε<1, integers k,z1, metric Mdupdup and n-point dataset from Mdup, it runs in time T(Mdup,n,k,ε1) to construct an ε-coreset of size N(Mdup,n,k,ε1) for (k,z)-Clustering on Mdup.

Then, there is an algorithm that, given 0<ε<1, integers k,z,m1, an n-point dataset XV and a (2O(z),O(1),O(1))-approximation CV to (k,z,m)-Clustering on X as input, constructs an ε-coreset for (k,z,m)-Clustering of size

A+N(M,n,k,O(ε1))+N(Mh,wdup,O(kmε1),O(klog2(kmε1)),O(ε1)),

where A=2O(zlogz)O(mε2zlogz(kmε1)), h=min{O(klog(kmε1)),n} and w=O(zε1diam(X)n1/z). This algorithm invokes 𝒜 a constant number of times and runs in time

O~(nk)+poly(kmε1)+T(M,n,k,ε1)+T(Mh,wdup,h,O(klog2(kmε1)),O(ε1)).

Notice that we assume a family dup is given alongside M in Theorem 13, since this family dup may depend on the specific type of M (e.g., M is Euclidean or shortest-path metric of a graph) and may require careful design. Luckily, in most cases, the separated duplication family dup does not need to be much more “complex” than , and this particularly means as long as a vanilla coreset works on M, it can also deal with metrics in dup.

We claim that for metrics including Euclidean spaces, general finite metrics, doubling metrics and shortest-path metrics of minor-excluded graphs, the complexity/dimension parameter (such as Euclidean dimension) of the separated duplication only increases by a constant (factor). For instance, for the d-dimension Euclidean metric M=(d,2), we show that for any w and h there exists a w-separated h-duplication of M that can be realized by a (d+1)-dimensional Euclidean space (d+1,2). Hence, a vanilla coreset construction for those metrics automatically works for the separated duplication (with only a constant factor increase in coreset size). We summarize the results for the separated duplication in Table 2.

Table 2: Summary of the results for the separated duplication. For all cases of the original metric space M that we list, we demonstrate the existence of a family dup, such that each Mdupdup possesses the property as shown in the table.
space M property of Mdupdup
(d,p) for p1 can be realized by (d+1,p)
general metric doubling dimension ddim(M) ddim(Mdup)2ddim(M)+2
ambient size n ambient size n2
graph metric treewidth tw treewidth tw
excluding a fixed minor H excluding the same minor H
Streaming Implementations

Finally, we implement our reduction algorithms (both Theorem 10 and Theorem 13) in the dynamic streaming setting. We consider the standard geometric streaming model suggested by Indyk [36], where the input is from a discrete set [Δ]d for some integer Δ1 and is presented as a stream of point insertion and deletions. We obtain two black-box reductions in Theorem 14. These reductions turn a streaming coreset construction algorithm for (k,z)-Clustering to the one for (k,z,m)-Clustering, with guarantee similar to Theorem 10 and Theorem 13, respectively.

Theorem 14 (Streaming coresets for (k,z,m)-Clustering).

Assume there exists a streaming algorithm that, given 0<ε,δ<1, integers k,z,d,Δ1, and a dataset from [Δ]d presented as a dynamic stream, uses space W(d,Δ,k,ε1,δ1) to construct an ε-coreset for (k,z)-Clustering with a failure probability of at most δ. Then there exists a streaming algorithm that, given 0<ε,δ<1, integers k,z,m,d,Δ1, and a dataset X[Δ]d presented as a dynamic stream, constructs an ε-coreset of X for (k,z,m)-Clustering. This algorithm has a failure probability of at most δ and uses space 2O(zlogz)min{W1,W2}poly(dlog(δ1Δ)), where

W1 =O~((k+dz)mε1)+W(d,Δ,k,O(ε1),O(δ1zdlogΔ)),
W2 =O~(k+m(d/ε)2z)+W(d+1,zε1Δpoly(d),kpoly(d),O(ε1),O(δ1zdlogΔ)).

References

  • [1] Maria-Florina Balcan, Steven Ehrlich, and Yingyu Liang. Distributed k-means and k-median clustering on general communication topologies. In NIPS, pages 1995–2003, 2013. URL: https://proceedings.neurips.cc/paper/2013/hash/7f975a56c761db6506eca0b37ce6ec87-Abstract.html.
  • [2] Sayan Bandyapadhyay, Fedor V. Fomin, and Kirill Simonov. On coresets for fair clustering in metric and euclidean spaces and their applications. In ICALP, volume 198 of LIPIcs, pages 23:1–23:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.ICALP.2021.23.
  • [3] Aditya Bhaskara, Sharvaree Vadgama, and Hong Xu. Greedy sampling for approximate clustering in the presence of outliers. In NeurIPS, pages 11146–11155, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/73983c01982794632e0270cd0006d407-Abstract.html.
  • [4] Vladimir Braverman, Vincent Cohen-Addad, Shaofeng H.-C. Jiang, Robert Krauthgamer, Chris Schwiegelshohn, Mads Bech Toftrup, and Xuan Wu. The power of uniform sampling for coresets. In FOCS, pages 462–473. IEEE, 2022. doi:10.1109/FOCS54457.2022.00051.
  • [5] Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, and Lin F. Yang. Clustering high dimensional dynamic data streams. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 576–585. PMLR, 2017. URL: http://proceedings.mlr.press/v70/braverman17a.html.
  • [6] Vladimir Braverman, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for ordered weighted clustering. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 744–753. PMLR, 2019. URL: http://proceedings.mlr.press/v97/braverman19a.html.
  • [7] Vladimir Braverman, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for clustering in excluded-minor graphs and beyond. In SODA, pages 2679–2696. SIAM, 2021. doi:10.1137/1.9781611976465.159.
  • [8] Vladimir Braverman, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for clustering with missing values. In NeurIPS, pages 17360–17372, 2021. URL: https://proceedings.neurips.cc/paper/2021/hash/90fd4f88f588ae64038134f1eeaa023f-Abstract.html.
  • [9] Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. In SODA, pages 642–651. ACM/SIAM, 2001. URL: http://dl.acm.org/citation.cfm?id=365411.365555.
  • [10] Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput., 39(3):923–947, 2009. doi:10.1137/070699007.
  • [11] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Fair clustering through fairlets. In NIPS, pages 5029–5037, 2017. URL: https://proceedings.neurips.cc/paper/2017/hash/978fce5bcc4eccc88ad48ce3914124a2-Abstract.html.
  • [12] Vincent Cohen-Addad, Andrew Draganov, Matteo Russo, David Saulpic, and Chris Schwiegelshohn. A tight vc-dimension analysis of clustering coresets with applications. In SODA, pages 4783–4808. SIAM, 2025. doi:10.1137/1.9781611978322.162.
  • [13] Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, and Chris Schwiegelshohn. Towards optimal lower bounds for k-median and k-means coresets. In STOC, pages 1038–1051. ACM, 2022. doi:10.1145/3519935.3519946.
  • [14] Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, and Omar Ali Sheikh-Omar. Improved coresets for euclidean k-means. In NeurIPS, 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/120c9ab5c58ba0fa9dd3a22ace1de245-Abstract-Conference.html.
  • [15] Vincent Cohen-Addad and Jason Li. On the fixed-parameter tractability of capacitated clustering. In ICALP, volume 132 of LIPIcs, pages 41:1–41:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPIcs.ICALP.2019.41.
  • [16] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In STOC, pages 169–182. ACM, 2021. doi:10.1145/3406325.3451022.
  • [17] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. Deterministic clustering in high dimensional spaces: Sketches and approximation. In FOCS, pages 1105–1130. IEEE, 2023. doi:10.1109/FOCS57990.2023.00066.
  • [18] Artur Czumaj, Arnold Filtser, Shaofeng H.-C. Jiang, Robert Krauthgamer, Pavel Veselý, and Mingwei Yang. Streaming facility location in high dimension via geometric hashing. arXiv preprint arXiv:2204.02095, 2022. The latest version has additional results compared to the preliminary version in [19]. doi:10.48550/arXiv.2204.02095.
  • [19] Artur Czumaj, Shaofeng H.-C. Jiang, Robert Krauthgamer, Pavel Veselý, and Mingwei Yang. Streaming facility location in high dimension via geometric hashing. In FOCS, pages 450–461. IEEE, 2022. doi:10.1109/FOCS54457.2022.00050.
  • [20] Dan Feldman. Core-sets: An updated survey. WIREs Data Mining Knowl. Discov., 10(1), 2020. doi:10.1002/widm.1335.
  • [21] Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In STOC, pages 569–578. ACM, 2011. doi:10.1145/1993636.1993712.
  • [22] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca, and projective clustering. SIAM J. Comput., 49(3):601–657, 2020. doi:10.1137/18M1209854.
  • [23] Dan Feldman and Leonard J Schulman. Data reduction for weighted and outlier-resistant clustering. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pages 1343–1354. SIAM, 2012. doi:10.1137/1.9781611973099.106.
  • [24] Arnold Filtser. Scattering and sparse partitions, and their applications. In ICALP, volume 168 of LIPIcs, pages 47:1–47:20. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPIcs.ICALP.2020.47.
  • [25] Zachary Friggstad, Kamyar Khodamoradi, Mohsen Rezapour, and Mohammad R. Salavatipour. Approximation schemes for clustering with outliers. ACM Trans. Algorithms, 15(2):26:1–26:26, 2019. doi:10.1145/3301446.
  • [26] Sariel Har-Peled and Akash Kushal. Smaller coresets for k-median and k-means clustering. Discret. Comput. Geom., 37(1):3–19, 2007. doi:10.1007/s00454-006-1271-x.
  • [27] Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In STOC, pages 291–300. ACM, 2004. doi:10.1145/1007352.1007400.
  • [28] Monika Henzinger and Sagar Kale. Fully-dynamic coresets. In ESA, volume 173 of LIPIcs, pages 57:1–57:21. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPIcs.ESA.2020.57.
  • [29] Wei Hu, Zhao Song, Lin F. Yang, and Peilin Zhong. Nearly optimal dynamic k-means clustering for high-dimensional data. CoRR, abs/1802.00459, 2018. doi:10.48550/arXiv.1802.00459.
  • [30] Lingxiao Huang, Ruiyuan Huang, Zengfeng Huang, and Xuan Wu. On coresets for clustering in small dimensional euclidean spaces. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 13891–13915. PMLR, 2023. URL: https://proceedings.mlr.press/v202/huang23h.html.
  • [31] Lingxiao Huang, Shaofeng H.-C. Jiang, Jianing Lou, and Xuan Wu. Near-optimal coresets for robust clustering. In ICLR. OpenReview.net, 2023. URL: https://openreview.net/forum?id=Nc1ZkRW8Vde.
  • [32] Lingxiao Huang, Shaofeng H.-C. Jiang, and Nisheeth K. Vishnoi. Coresets for clustering with fairness constraints. In NeurIPS, pages 7587–7598, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/810dfbbebb17302018ae903e9cb7a483-Abstract.html.
  • [33] Lingxiao Huang, Jian Li, Pinyan Lu, and Xuan Wu. Coresets for constrained clustering: General assignment constraints and improved size bounds. In SODA, pages 4732–4782. SIAM, 2025. doi:10.1137/1.9781611978322.161.
  • [34] Lingxiao Huang, Jian Li, and Xuan Wu. On optimal coreset construction for euclidean (k,z)-clustering. In STOC, pages 1594–1604. ACM, 2024. doi:10.1145/3618260.3649707.
  • [35] Lingxiao Huang and Nisheeth K. Vishnoi. Coresets for clustering in euclidean spaces: importance sampling is nearly optimal. In STOC, pages 1416–1429. ACM, 2020. doi:10.1145/3357713.3384296.
  • [36] Piotr Indyk. Algorithms for dynamic geometric problems over data streams. In STOC, pages 373–380. ACM, 2004. doi:10.1145/1007352.1007413.
  • [37] Lujun Jia, Guolong Lin, Guevara Noubir, Rajmohan Rajaraman, and Ravi Sundaram. Universal approximations for tsp, steiner tree, and set cover. In STOC, pages 386–395. ACM, 2005. doi:10.1145/1060590.1060649.
  • [38] Shaofeng H.-C. Jiang and Jianing Lou. Coresets for robust clustering via black-box reductions to vanilla case. CoRR, abs/2502.07669, 2025. doi:10.48550/arXiv.2502.07669.
  • [39] Samir Khuller, Robert Pless, and Yoram J. Sussmann. Fault tolerant k-center problems. Theor. Comput. Sci., 242(1-2):237–245, 2000. doi:10.1016/S0304-3975(98)00222-9.
  • [40] Sagi Lotan, Ernesto Evgeniy Sanches Shayda, and Dan Feldman. Coreset for line-sets clustering. In NeurIPS, 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/f2ce95887c34393af4eb240d60017860-Abstract-Conference.html.
  • [41] Konstantin Makarychev, Yury Makarychev, and Ilya P. Razenshteyn. Performance of johnson-lindenstrauss transform for k-means and k-medians clustering. In STOC, pages 1027–1038. ACM, 2019. doi:10.1145/3313276.3316350.
  • [42] Yair Marom and Dan Feldman. k-means clustering of lines for big data. In NeurIPS, pages 12797–12806, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/6084e82a08cb979cf75ae28aed37ecd4-Abstract.html.
  • [43] Alexander Munteanu and Chris Schwiegelshohn. Coresets-methods and history: A theoreticians design pattern for approximation and streaming algorithms. Künstliche Intell., 32(1):37–53, 2018. doi:10.1007/s13218-017-0519-3.
  • [44] Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. Fair coresets and streaming algorithms for fair k-means. In WAOA, volume 11926 of Lecture Notes in Computer Science, pages 232–251. Springer, 2019. doi:10.1007/978-3-030-39479-0_16.
  • [45] Christian Sohler and David P. Woodruff. Strong coresets for k-median and subspace approximation: Goodbye dimension. In FOCS, pages 802–813. IEEE Computer Society, 2018. doi:10.1109/FOCS.2018.00081.
  • [46] Murad Tukan, Xuan Wu, Samson Zhou, Vladimir Braverman, and Dan Feldman. New coresets for projective clustering and applications. In AISTATS, volume 151 of Proceedings of Machine Learning Research, pages 5391–5415. PMLR, 2022. URL: https://proceedings.mlr.press/v151/tukan22a.html.