Coresets for Robust Clustering via Black-Box Reductions to Vanilla Case

Jiang, Shaofeng H.-C.; Lou, Jianing

doi:10.4230/LIPIcs.ICALP.2025.101

Coresets for Robust Clustering via Black-Box Reductions to Vanilla Case

Shaofeng H.-C. Jiang

School of Computer Science, Peking University, Beijing, China Jianing Lou

School of Computer Science, Peking University, Beijing, China

Abstract

We devise $\varepsilon$ -coresets for robust $(k,z)$ -Clustering with $m$ outliers through black-box reductions to vanilla clustering. Given an $\varepsilon$ -coreset construction for vanilla clustering with size $N$ , we construct coresets of size $N\cdot\operatorname{poly}\log(km\varepsilon^{-1})+O_{z}\left(\min\{km% \varepsilon^{-1},m\varepsilon^{-2z}\log^{z}(km\varepsilon^{-1})\}\right)$ for various metric spaces, where $O_{z}$ hides $2^{O(z\log z)}$ factors. This increases the size of the vanilla coreset by a small multiplicative factor of $\operatorname{poly}\log(km\varepsilon^{-1})$ , and the additive term is up to a $(\varepsilon^{-1}\log(km))^{O(z)}$ factor to the size of the optimal robust coreset. Plugging in recent vanilla coreset results of [Cohen-Addad, Saulpic and Schwiegelshohn, STOC’21; Cohen-Addad, Draganov, Russo, Saulpic and Schwiegelshohn, SODA’25], we obtain the first coresets for $(k,z)$ -Clustering with $m$ outliers with size near-linear in $k$ while previous results have size at least $\Omega(k^{2})$ [Huang, Jiang, Lou and Wu, ICLR’23; Huang, Li, Lu and Wu, SODA’25].

Technically, we establish two conditions under which a vanilla coreset is as well a robust coreset. The first condition requires the dataset to satisfy special structures – it can be broken into “dense” parts with bounded diameter. We combine this with a new bounded-diameter decomposition that has only $O_{z}(km\varepsilon^{-1})$ non-dense points to obtain the $O_{z}(km\varepsilon^{-1})$ additive bound. Another sufficient condition requires the vanilla coreset to possess an extra size-preserving property. To utilize this condition, we further give a black-box reduction that turns a vanilla coreset to the one that satisfies the said size-preserving property, and this leads to the alternative $O_{z}(m\varepsilon^{-2z}\log^{z}(km\varepsilon^{-1}))$ additive size bound.

We also give low-space implementations of our reductions in the dynamic streaming setting. Combined with known streaming constructions for vanilla coresets [Braverman, Frahling, Lang, Sohler and Yang, ICML’17; Hu, Song, Yang and Zhong, arXiv’1802.00459], we obtain the first dynamic streaming algorithms for coresets for $k$ -Median (and $k$ -Means) with $m$ outliers, using space $\tilde{O}(k+m)\cdot\mathrm{poly}(d\varepsilon^{-1}\log\Delta)$ for inputs on a discrete grid $[\Delta]^{d}$ .

Keywords and phrases:

Coresets, clustering, outliers, streaming algorithms

Category:

Track A: Algorithms, Complexity and Games

Funding:

Shaofeng H.-C. Jiang: Research partially supported by a national key R&D program of China No. 2021YFA1000900.

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Streaming, sublinear and near linear time algorithms ; Theory of computation

\rightarrow

Facility location and clustering

Editors:

Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele Puppis

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

$(k,z)$ -Clustering is a fundamental problem that is well studied in both computer science and related areas such as operations research. Given a metric space $(V,\operatorname{dist})$ and a dataset $X\subseteq V$ , $(k,z)$ -Clustering aims to find a set $C\subseteq V$ of $k$ centers, such that the following clustering objective is minimized:

\operatorname{cost}_{z}(X,C):=\sum_{x\in X}(\operatorname{dist}(x,C))^{z},

where $\operatorname{dist}(x,C):=\min_{c\in C}\operatorname{dist}(x,c)$ . This formulation generally captures several well known variants of $k$ -clustering, including $k$ -Median (when $z=1$ ) and $k$ -Means (when $z=2$ ).

Unfortunately, datasets can often be noisy, and the objective of $(k,z)$ -Clustering is very sensitive to noise. In fact, even adding a single outlier point that is distant from every other point can significantly bias the clustering centers towards the outlier. Hence, robust variants of clustering are highly desired to address this issue. We consider a natural formulation suggested by [9], called $(k,z)$ -Clustering clustering with $m$ outliers, $(k,z,m)$ -Clustering for short, where a parameter $m$ is introduced to denote the number of outliers. The new cost function is denoted as $\operatorname{cost}_{z}^{(m)}(X,C)$ , and it is evaluated by first removing the $m$ points that are furthest to $C$ from $X$ , denoting the resultant points as $X^{\prime}$ , and compute $\operatorname{cost}_{z}(X^{\prime},C)$ (i.e., using the vanilla $(k,z)$ -Clustering objective). Equivalently,

\operatorname{cost}_{z}^{(m)}(X,C):=\min_{Y\in\binom{X}{m}}\operatorname{cost}% _{z}(X\setminus Y,C).

Coreset [27] is a powerful technique for obtaining efficient algorithms for clustering. Roughly speaking, an $\varepsilon$ -coreset is a tiny proxy of the dataset that approximates the clustering objective within $(1\pm\varepsilon)$ factor for every potential center set $C$ . Coresets are not only useful for obtaining fast algorithms, but can also be converted into streaming [27], distributed [1] and fully-dynamic algorithms [28] via merge-and-reduce framework [27]. Tremendous progress has been made on finding small coresets for (vanilla) $(k,z)$ -Clustering. Take $k$ -Median ( $z=1$ ) in Euclidean $\mathbb{R}^{d}$ (i.e., $V=\mathbb{R}^{d},\operatorname{dist}=\ell_{2}$ ) for example. The initial coreset size $O(k\varepsilon^{-d}\log n)$ [27] has been improved in a series of works [26, 21, 45, 22, 35, 7, 16, 13], all the way to $\tilde{O}(k\varepsilon^{-3})$ via a novel sampling framework [16, 13], and this nearly matches a lower bound of $\Omega(k\varepsilon^{-2})$ [13]. (Alternatively, in the regime of $\varepsilon^{-1}\gg k$ , a better bound of $\tilde{O}(k^{4/3}\varepsilon^{-2})$ may be obtained [14, 34].)

The sampling framework proposed by [16] is modified to handle several variants of clustering [4], and this modified framework is recently adapted to obtain an $O(m)+\tilde{O}(k^{3}\varepsilon^{-5})$ size coreset for $k$ -Median with $m$ outliers [31], which exponentially improves a decade-old $(k+m)^{O(k+m)}\operatorname{poly}(d\varepsilon^{-1}\log n)$ bound [23]. This bound is further improved to $O(m)+\tilde{O}(k^{2}\varepsilon^{-4})$ in a recent work [33]. However, even this improved bound is unlikely to be tight; specifically, there is a sharp transition from $m=0$ (i.e., the vanilla case) to $m=1$ (i.e., only one outlier is considered) that increases the size by at least a $k\varepsilon^{-1}$ factor (compared with [13]). It is unclear if this transition is fundamental and whether or not it can be avoided. Technically, existing approaches for robust coresets are built on frameworks designed for the vanilla case and is adapted in an ad-hoc way. In case improved bounds are obtained for the vanilla case, it is still technically nontrivial to adapt it to the robust case.

In this paper, we systematically address these issues by proposing a new framework via black-box reductions to the vanilla case. Specifically, we wish to start from any vanilla coreset algorithm (which may be the optimal one), and figure out how to add only a few points to convert it into a robust coreset, ideally to obtain a size bound that is very close to the vanilla case. Indeed, this study helps to fundamentally understand “the price of robustness” for coresets, i.e., the exact gap in size bounds between vanilla clustering and robust clustering. Since the coreset construction is via reduction, it also opens the door to more applications, such as coreset algorithms for robust clustering in sublinear models (as long as the vanilla version can be constructed in the corresponding model).

1.1 Our Results

Our main result, stated in Theorem 1, is a novel coreset construction for $(k,z,m)$ -Clustering via a reduction to the vanilla case. Crucially, this reduction is black-box style which works with any vanilla coreset construction without knowing its implementation detail. As mentioned, this type of bound helps to fundamentally understand the “price of robustness” for coresets. Our result works for general metric spaces, and we choose to state the result for the Euclidean case which is arguably the most natural setting for $(k,z)$ -Clustering. Overall, this size bound has linear dependence in $N$ (albeit $N$ is evaluated on slightly larger parameters), and the main increase in size is additive.

Theorem 1 (Euclidean case).

Assume there is an algorithm that constructs an $\varepsilon$ -coreset for $(k,z)$ -Clustering of size $N(d,k,\varepsilon^{-1})$ for any dataset from $\operatorname{\mathbb{R}}^{d}$ . Then, there is an algorithm that constructs an $\varepsilon$ -coreset for $(k,z,m)$ -Clustering of size

\min\bigg{\{}N\left(d,k,O(\varepsilon^{-1})\right)+A_{1},\leavevmode\nobreak\ % N\left(O(d),O(k\log^{2}(km\varepsilon^{-1})),O(\varepsilon^{-1})\right)+A_{2}% \bigg{\}}

(1)

for any dataset from $\operatorname{\mathbb{R}}^{d}$ , where $A_{1}=O_{z}\left(km\varepsilon^{-1}\right)$ and $A_{2}=O_{z}\left(m\varepsilon^{-2z}\log^{z}(km\varepsilon^{-1})\right)$ . The two bounds in the $\min$ follow from Theorem 10 and Theorem 13, respectively.

As mentioned, our result in Theorem 1 is applicable to various other metric spaces besides Euclidean, including doubling metrics, general finite metrics and shortest-path metrics for graphs that exclude a fixed minor. The size bound only needs to change the parameters of $N$ according to the metric and does not change the additive terms $A_{1}$ and $A_{2}$ . For instance, for doubling metrics, our bound simply replaces the $d$ in (1) by the doubling dimension of the metric space (up to constant factor). Detailed size bounds can be found in the full version [38].

We start with justifying the tightness of our size bounds without plugging in any concrete vanilla coreset bound $N$ . Observe that Theorem 1 actually provides two size bounds, each corresponding to one of the two terms in the $\min$ of (1). The first bound of $A_{1}+N(d,k,O(\varepsilon^{-1}))$ is more useful when $m$ is small. In particular, it makes our result the first to achieve a smooth transition from $m=0$ (vanilla case) to $m>0$ (robust case) in an asymptotic sense, even when plugging in the optimal vanilla coreset. Indeed, the $O_{z}(km\varepsilon^{-1})$ bound becomes $O_{z}(k\varepsilon^{-1})$ when $m=O(1)$ , and this is asymptotically dominated by a lower bound of $\Omega(k\varepsilon^{-2})$ for vanilla coreset as shown in [13]. For the second bound, it needs to use $k$ that is $\operatorname{poly}\log(km\varepsilon^{-1})$ larger in $N$ , but under the typical case of $N(d,k,\varepsilon^{-1})=\operatorname{poly}(dk\varepsilon^{-1})$ , this only increases $N$ by a factor of $\operatorname{poly}\log(km\varepsilon^{-1})$ which is minor. Regarding its additive term $A_{2}$ , it is actually up to only a $(\varepsilon^{-1}\log(km))^{O(z)}$ factor to the optimal robust coresets, which is nearly tight with respect to $m$ due to a lower bound of $\Omega(m)$ for robust coreset [31].

Specific Coreset Size Bounds.

Since our reduction is black box, the large body of results on coresets for vanilla clustering can be readily applied. We start with listing in Table 1 the concrete coreset bounds obtained by plugging in recent results for $(k,z)$ -Clustering [13, 14, 34, 12]. These lead to the first coresets of near-linear size in $k$ for $(k,z,m)$ -Clustering in all the metric spaces listed, thereby improving the previous $k^{2}$ dependency [23, 31, 33]. We can also obtain improved coresets when plugging in results designed for specific parameters. For instance, in low-dimensional Euclidean space, a coreset for $(1,z)$ -Clustering of size $2^{O(z\log z)}\cdot\tilde{O}(\sqrt{d}\varepsilon^{-1})$ was given by [30]. Applying this result to Theorem 1, we obtain a coreset for $(1,m,z)$ -Clustering of size $2^{O(z\log z)}\cdot\tilde{O}((m+\sqrt{d})\varepsilon^{-1})$ .

Table 1: List of relevant coreset bounds under various metric spaces. In all the bounds reported in the table we omit a minor multiplicative factor of

(z\log(km\varepsilon^{-1}))^{O(z)}

. For our result listed in the last row, the exact form (without omitting the abovementioned minor factor) is

\min\{\mathbf{N}+O_{z}(km\varepsilon^{-1}),\mathbf{N}\cdot\operatorname{% polylog}(km\varepsilon^{-1})+O_{z}(m\varepsilon^{-2z}\log^{z}(km\varepsilon^{-% 1}))\}

, noting that the

\mathbf{N}

in the second term of

\min\{\cdot,\cdot\}

has a

\operatorname{polylog}(km\varepsilon^{-1})

factor.

metric space $M$	problem	size	reference
Euclidean $\mathbb{R}^{d}$	vanilla ( $\mathbf{N}$ )	$kd\varepsilon^{-\max\{2,z\}}$	[16]
	vanilla ( $\mathbf{N}$ )	$k\varepsilon^{-2-z}$	[13]
	vanilla ( $\mathbf{N}$ )	$k^{\frac{2z+2}{z+2}}\varepsilon^{-2}$	[34]
	robust, $z=1$	$(k+m)^{k+m}(\varepsilon^{-1}d\log n)^{2}$	[23]
	robust	$m+k^{3}\varepsilon^{-3z-2}$	[31]
	robust	$m+k^{2}\varepsilon^{-2z-2}$	[33]
doubling metrics	vanilla ( $\mathbf{N}$ )	$k\varepsilon^{-\max\{2,z\}}\cdot\operatorname{ddim}(M)$	[16]
	robust	$m+k^{2}\varepsilon^{-2z}\cdot(\operatorname{ddim}(M)+\varepsilon^{-1})$	[33]
$n$ -point metric space	vanilla ( $\mathbf{N}$ )	$k\varepsilon^{-\max\{2,z\}}\cdot\log n$	[16]
	robust, $z=1$	$(k+m)^{k+m}\varepsilon^{-2}\log^{4}n$	[23]
	robust	$m+k^{2}\varepsilon^{-2z}(\log n+\varepsilon^{-1})$	[33]
bounded treewidth graphs	vanilla ( $\mathbf{N}$ )	$k\varepsilon^{-\max\{2,z\}}\cdot\operatorname{tw}$	[16]
	robust	$m+k^{2}\varepsilon^{-2z-2}\cdot\operatorname{tw}$	[33]
excluded-minor graphs	vanilla ( $\mathbf{N}$ )	$k\varepsilon^{-2}\cdot\min\{\varepsilon^{-z+1},k\}$	[12]
	robust	$m+k^{2}\varepsilon^{-2z-2}$	[33]
all the above	robust	$\mathbf{N}+\min\{km\varepsilon^{-1},m\varepsilon^{-2z}\}$	ours

It is worth mentioning that our entire algorithm is deterministic, provided that the given vanilla coreset algorithm is deterministic. As highlighted in recent surveys [43, 20], deterministically constructing coresets is an important and less understood open question, even for vanilla clustering without outliers. The best-known deterministic construction works only in low-dimensional Euclidean space [26], yielding a coreset of size $\operatorname{poly}(k)\cdot\varepsilon^{-O(d)}$ . For Euclidean $k$ -Means, a coreset of size $k^{\operatorname{poly}(\varepsilon^{-1})}$ can be constructed deterministically [22], and the size can be improved to $\operatorname{poly}(k\varepsilon^{-1})$ if we consider a slightly more general definition of coresets, known as coresets with offset [17].¹¹1In the definition of coresets with offset, the original clustering objective is preserved by adding a universal constant (the offset) to the clustering objective on the coreset. All these results can be plugged into our reduction to directly achieve a deterministic coreset construction for $(k,z,m)$ -Clustering, with the coreset size increasing as shown in (1).²²2For coresets with offset, our reductions and analysis can be easily adapted and still achieve asymptotically the same coreset size.

Streaming Implementation of Theorem 1.

We also obtain reduction style coreset construction algorithms for $(k,z,m)$ -Clustering over a dynamic stream of points in $\mathbb{R}^{d}$ using small space, which follows from a streaming implementation of Theorem 1. We consider a standard streaming model proposed by [36] which has been widely employed in the literature. In this model, the input points come from a discrete grid $[\Delta]^{d}$ for some integer parameter $\Delta$ , and they arrive as a dynamic stream with point insertions and deletions. At the end of the stream, the algorithm should return a weighted set as the coreset. We present in Theorem 2 the result for the case of $z=1$ which is $k$ -Median with $m$ outliers.

Theorem 2 (Informal version of Theorem 14).

Assume there is a streaming algorithm that constructs an $\varepsilon$ -coreset for $k$ -Median for every dataset from $[\Delta]^{d}$ presented as a dynamic stream, using space $W(d,\Delta,k,\varepsilon^{-1})$ and with a failure probability of at most $1/\operatorname{poly}(d\log\Delta)$ . Then, there is a streaming algorithm that constructs an $\varepsilon$ -coreset for $(k,m)$ -Median with constant probability for every dataset from $[\Delta]^{d}$ presented as a dynamic stream, using space $\min\{W_{1},W_{2}\}\cdot\operatorname{poly}(d\log\Delta)$ , where

	$\displaystyle W_{1}$	$\displaystyle=\tilde{O}(km\varepsilon^{-1})+W\left(d,\Delta,k,O(\varepsilon^{-% 1})\right),$
	$\displaystyle W_{2}$	$\displaystyle=\tilde{O}(k+m\varepsilon^{-2})+W\left(d+1,\varepsilon^{-1}\Delta% ^{\operatorname{poly}(d)},k\operatorname{poly}(d),O(\varepsilon^{-1})\right).$

Observe that we did not explicitly state the size of the coreset, and only focus on the space complexity, since the coreset size is upper bounded by the space complexity of the streaming algorithm. Moreover, once one obtains an $\varepsilon$ -coreset at the end of the stream, one can always run another existing coreset construction again on top of it to obtain a possibly smaller coreset. Hence, it is only the space complexity that matters in streaming coreset construction. We also note that once the coreset is obtained from our streaming algorithm, one can continue to find a $(1+\varepsilon)$ -approximation to $(k,m)$ -Median without using higher order of space.

By combining Theorem 2 with an existing streaming coreset construction for $k$ -Median [5], we obtain a space complexity of $\tilde{O}\left(\min\{km\varepsilon^{-1},m\varepsilon^{-2}\}+k\varepsilon^{-2}% \right)\cdot\operatorname{poly}(d\log\Delta)$ . Remarkably, the dependence on $k$ and $\varepsilon$ matches that of [5], which is $k\varepsilon^{-2}$ . A similar bound for $z=2$ , i.e., $k$ -Means with $m$ outliers, may be obtained by combining with a dynamic streaming algorithm for $k$ -Means (e.g., [29]). These results are the first nontrivial dynamic streaming algorithms for robust $k$ -clustering, where the dependence on $\varepsilon,k,d,\log\Delta$ is comparable to existing bounds for vanilla clustering (up to degree of polynomial). We also show (in the full version) that the linear dependence on $m$ is necessary.

Our results push forward the understanding of dynamic streaming algorithms for variants of clustering. Indeed, recent works manage to devise (offline) coresets for variants of clustering (including clustering with outlier/capacity constraints; see e.g., [33]), which can immediately imply insertion-only streaming algorithms via standard techniques such as merge-and-reduce [27], but these are not readily applicable to dynamic point streams. Our new reduction style framework is a new approach that opens a door to obtaining dynamic streaming algorithms for other variants of clustering, which may be of independent interest.

1.2 Technical Overview

Our main technical contributions are the establishment of two conditions for a vanilla coreset to become a robust coreset, and black-box reductions that turn vanilla coresets to satisfy these conditions, which yields the bounds mentioned in Theorem 1. The two conditions reveal new fundamental structures of the robust clustering problem, especially the structural relation to vanilla clustering, which may be of independent interest. Moreover, to make vanilla coresets satisfy the two conditions in a black-box way, it requires us to devise a bounded-diameter decomposition called almost-dense decomposition, a new separated duplication transform for metric spaces, as well as adapting and refining a recently developed sparse partition [37, 24] (which was also studied under the notion of consistent hashing [18, 19]). The sparse partition/consistent hashing technique is also crucially used in our streaming implementation, which suggests a framework fundamentally different from those based on quadtrees as in previous dynamic streaming algorithms for (vanilla) clustering [5, 29].

We focus on presenting these ideas for the $(k,m)$ -Median problem (which is the $z=1$ case) in Euclidean $\mathbb{R}^{d}$ (although most discussion already works for general metrics). We also discuss how to implement the reductions in streaming.

1.2.1 First Reduction: $\left(O(km\varepsilon^{-1})+N\right)$ Size Bound

Condition I (Lemma 9): Vanilla Coresets on Dense Datasets Are Robust

Our first condition comes from a simple intuition: we wish $\operatorname{cost}(X,C)\approx\operatorname{cost}^{(m)}(X,C)$ for every $C\subseteq\operatorname{\mathbb{R}}^{d}$ , which is perhaps the most natural way to ensure a vanilla coreset is automatically a robust coreset. Now, a sufficient condition is to ensure the dataset is somewhat uniform, in the sense that for every center set $C\subseteq\operatorname{\mathbb{R}}^{d}$ , the contribution from every $m$ data points is at most $\varepsilon\operatorname{OPT}$ (or equivalently, each data point contributes $\varepsilon\operatorname{OPT}/m$ ). This way, whatever the $m$ outliers are, the change of objective from $\operatorname{cost}$ to $\operatorname{cost}^{(m)}$ is always within $\varepsilon$ factor.

We thus consider datasets that can be broken into parts such that each part has diameter $\lambda:=O(\varepsilon\operatorname{OPT}/m)$ and contains at least $\Omega(\varepsilon^{-1}m)$ points. In such datasets, every data point can find $\Omega(\varepsilon^{-1}m)$ points within distance $\lambda=O(\varepsilon\operatorname{OPT}/m)$ , hence its contribution is naturally charged to these nearby points, which is averaged to $O(\varepsilon\operatorname{OPT}/m)$ . We call such a dataset dense, and the above intuition leads to a formal reduction that for dense datasets, a vanilla coreset is as well a robust coreset.

Reduction I (Theorem 10): Almost-dense Decomposition

Of course, a general dataset may not be dense. Nevertheless, we manage to show a weaker almost-dense decomposition exists for every dataset. This decomposition breaks the dataset into two parts: a subset $A$ that is guaranteed to be dense and a remaining subset $B$ that only consists of $O(km\varepsilon^{-1})$ points. The subset $B$ actually has more refined property: it can be broken into parts such that each part has diameter at most $\lambda$ (which is similar to the dense part) and each contains at most $O(\varepsilon^{-1}m)$ points (which is the “complement” property of the dense part). We call subset $B$ the sparse subset.

To construct such a decomposition, we start with an optimal solution $C^{*}$ to the robust clustering (in the algorithm it suffices to use a tri-criteria solution, see Definition 6). Let $\lambda=O(\varepsilon\operatorname{OPT}/m)$ be the target diameter bound of the dense and sparse parts. We first identify those isolated points $F$ consisting of both the $m$ outliers and those “far away” points with distance more than $\lambda$ from $C^{*}$ . The number of far away points is bounded by $\operatorname{OPT}/\lambda=O(\varepsilon^{-1}m)$ using an averaging argument. Now since every remaining point is within $\lambda$ to $C^{*}$ , we cluster all these remaining points, namely $X\setminus F$ , with respect to $C^{*}$ . Each cluster automatically has a diameter bound by $O(\lambda)$ . Finally, we take clusters with more than $\Omega(\varepsilon^{-1}m)$ points as the dense subset, and the remainder, along with $F$ , forms the sparse subset, which contains at most $O(|C^{*}|\varepsilon^{-1}m)+O(\varepsilon^{-1}m)+m=O(km\varepsilon^{-1})$ points in total.

With such a decomposition, one can simply put the sparse subset who has $O(km\varepsilon^{-1})$ points into the coreset, and use the vanilla coreset on the dense subset. This yields our first $O(km\varepsilon^{-1})+N$ size bound (recall that $N$ is the size of the vanilla coreset) as in Theorem 1.

1.2.2 Second Reduction: $(m\varepsilon^{-2}+N)\operatorname{poly}\log(km\varepsilon^{-1})$ Size Bound

Overview of Reduction II (Theorem 13).

Indeed, in Reduction I, the step of simply adding sparse subset into the coreset may sound suboptimal. Hence, our second algorithm is built on top of Reduction I, such that we first follow the steps in Reduction I to build a vanilla coreset on the dense subset, and then run more refined new steps to the sparse subset. In fact, the new steps can be viewed as a separate reduction that takes the sparse subset as input and generates a robust coreset. We focus on these new steps and call it Reduction II (without the steps already in Reduction I). Crucially, Reduction II itself is standalone and can run on any dataset (and thus can be independent of Reduction I), albeit it would result in a coreset of size $O(\log n)$ (where $n$ is the size of its input dataset). Luckily, we would eventually run it on the sparse subset which has $O(km\varepsilon^{-1})$ points, so the $\log n$ factor becomes $\log(km\varepsilon^{-1})$ , and the overall size bound is competitive (which yields the other size bound in Theorem 1).

Next, we start with describing our Condition II (which is also standalone), and discuss how our Reduction II make a vanilla coreset to satisfy this condition.

Condition II (Lemma 11): Size-preserving Vanilla Coresets Are Robust

The second condition requires a stronger property in the definition of the coreset (instead of the dense property in the dataset as in Condition I). We consider a size-preserving property, which is defined with respect to some $\lambda$ -bounded diameter decomposition – a partition $\mathcal{P}$ of point set $X$ such that each part $P\in\mathcal{P}$ has diameter at most $\lambda$ . For a given $\lambda$ , a (vanilla) coreset $S$ is size-preserving, if there exists a $\lambda$ -bounded diameter decomposition $\mathcal{P}$ , such that for every $P\in\mathcal{P}$ , $|P|=|S\cap P|$ (here we interpret the coreset, which is a weighted set, as a multiset). This “size-preserving property” of coresets intuitively means the coreset could be interpreted as moving each data point by at most $\lambda$ distance. More formally, there is a one-to-one correspondence $g$ between dataset $X$ and coreset $S$ such that $\operatorname{dist}(x,g(x))\leq\lambda$ .

Our Condition II states that if a vanilla coreset $S$ satisfies the size-preserving property then it is automatically a robust coreset. Next, we provide an overview of the difficulties we encountered in the proof of Condition II and the high-level ideas for resolving them.

Naive Approach: Bounding the Gap between $\operatorname{cost}$ and $\operatorname{cost}^{(m)}$ .

We explore the condition to make the gap between $\operatorname{cost}(X,C)$ and $\operatorname{cost}^{(m)}(X,C)$ is small. Unlike Condition I, here we are now making no assumptions about the structure of $X$ . Observe that the precise difference between the two cost functions is the outliers: Let $X_{\mathrm{out}}$ be the set of outliers (which is defined with respect to $C$ ), then the vanilla $\operatorname{cost}(X,C)$ is larger and has an extra term $\sum_{x\in X_{\mathrm{out}}}\operatorname{dist}(x,C)$ compared with the robust $\operatorname{cost}^{(m)}(X,C)$ . An immediate idea to “avoid” this term in vanilla $\operatorname{cost}(X,C)$ , is to additionally put a center at each point in $X_{\mathrm{out}}$ so that points from $X_{\mathrm{out}}$ contribute $0$ . This works perfectly when each outlier is separated from every other point since each point in $X_{\mathrm{out}}$ forms a singleton cluster, and this suggests a necessary condition: the vanilla coreset $S$ needs to hold for $k^{\prime}=k+m$ centers, i.e., it is a coreset for $k^{\prime}$ -Median.

Unfortunately, centers on outliers may not form singleton clusters, in which case the value $\operatorname{cost}(X,C\cup X_{\mathrm{out}})$ can underestimate $\operatorname{cost}^{(m)}(X,C)$ . Even if it is possible to pick another $C^{\prime}$ to make this said gap small, it is still challenging to also prove that $\operatorname{cost}(S,C\cup C^{\prime})$ is always close to $\operatorname{cost}^{(m)}(S,C)$ for an arbitrary $S$ and $C$ (which is needed to imply $\operatorname{cost}^{(m)}(X,C)\approx\operatorname{cost}^{(m)}(S,C)$ ).

Key Observation: Vanilla Coresets Also Preserve the Gap.

Hence, instead of seeking for a specific center set $C^{\prime}$ that makes the gap $\operatorname{cost}^{(m)}(X,C)-\operatorname{cost}(X,C\cup C^{\prime})$ small for every $C$ , we show a weaker guarantee, such that the gap is preserved between $X$ and the coreset $S$ : we allow a large gap, just that if the gap is large in $X$ , then it is also large in $S$ , and the quantity is comparable. Specifically, we show that a coreset $S$ for $k^{\prime}$ -Median with size-preserving property can preserve this gap. Namely,

\operatorname{cost}^{(m)}(S,C)-\operatorname{cost}(S,C\cup C^{\prime})=% \operatorname{cost}^{(m)}(X,C)-\operatorname{cost}(X,C\cup C^{\prime})\pm% \varepsilon\operatorname{cost}^{(m)}(X,C)

(2)

for some $C^{\prime}\subseteq X_{\mathrm{out}}$ . Indeed, this, combined with the guarantee $\operatorname{cost}(S,C\cup C^{\prime})\in(1\pm\varepsilon)\operatorname{cost}% (X,C\cup C^{\prime})$ (which follows from the $k^{\prime}$ -Median coreset guarantee of $S$ ), implies $\operatorname{cost}^{(m)}(S,C)\approx\operatorname{cost}^{(m)}(X,C)$ (which is the robust coreset guarantee).

To see why the gap can be preserved, the intuition is that there are only a few points that contribute non-zero values to the gap $\operatorname{cost}^{(m)}(X,C)-\operatorname{cost}(X,C\cup C^{\prime})$ . To put it simply, let us first consider $C^{\prime}=X_{\mathrm{out}}$ . Then only those inliers $x$ of $X$ that are closer to outliers $X_{\mathrm{out}}$ than to $C$ can make $\operatorname{dist}(x,C)\neq\operatorname{dist}(x,C\cup C^{\prime})$ hold. Let $D$ denote such inliers, and let us loosely assume that only $g(D)$ contributes non-zero values to the gap $\operatorname{cost}^{(m)}(S,C)-\operatorname{cost}(S,C\cup C^{\prime})$ (recalling that $g$ is a one-to-one correspondence between the dataset $X$ and the coreset $S$ , such that $\operatorname{dist}(x,g(x))\leq\lambda$ for every $x\in X$ ). Then the difference of these two gaps can be upper bounded by $|\operatorname{cost}(D,C)-\operatorname{cost}(g(D),C)|+|\operatorname{cost}(D,% C\cup C^{\prime})-\operatorname{cost}(g(D),C\cup C^{\prime})|\leq O(|D|\cdot\lambda)$ using the triangle inequality, which is sufficient if $|D|\leq O(m)$ and we pick $\lambda=\varepsilon\operatorname{OPT}/m$ .

Unfortunately, the naive setup of $C^{\prime}=X_{\mathrm{out}}$ , as in the analysis above, may not lead to a small $D$ (noting that $D$ is defined with respect to $C^{\prime}$ ). Thus, we need to choose $C^{\prime}$ as a subset of $X_{\mathrm{out}}$ . On the other hand, $C^{\prime}$ cannot be made too small as well; otherwise, a large error of $\varepsilon\cdot\operatorname{cost}(X,C\cup C^{\prime})$ would arise. (Recall that (2), combining the coreset guarantee $\operatorname{cost}(S,C\cup C^{\prime})\in(1\pm\varepsilon)\cdot\operatorname{% cost}(X,C\cup C^{\prime})$ , forms the complete cost preservation.) We provide a simple algorithm to address this by finding an appropriate $C^{\prime}$ such that both $D$ and the error $\varepsilon\cdot\operatorname{cost}(X,C\cup C^{\prime})$ are ensured to be small.

Reduction II (Theorem 13): Making Vanilla Coresets Size-preserving

To utilize Condition II, we need to show that, for a given $\lambda>0$ , we can build a vanilla coreset $S$ that is size-preserving using a generic vanilla coreset algorithm in a black-box way. Specifically the coreset should satisfy that there exists a $\lambda$ -bounded partition $\mathcal{P}=(P_{1},\dots,P_{t})$ of $X$ such that $|P_{i}|=|S\cap P_{i}|$ for $i\in[t]$ (recalling that we treat a coreset as a multiset). We first assume a generic partition $\mathcal{P}$ is given, and assume that $t=|\mathcal{P}|=k$ for simplicity.

Key Observation: Separated Instance is Easy.

We make use of a simple observation: if the $k$ parts are well-separated, meaning they are sufficiently distant from each other, then a vanilla coreset for $k$ -Median is already (nearly) size-preserving. To see this, if we put one center at each of the $P_{j}$ ’s for $j\neq i$ but without any center at $P_{i}$ , the clustering cost is roughly $|P_{i}|\cdot\operatorname{dist}(P_{i},X\setminus P_{i})$ and this is up to $(1\pm\varepsilon)$ factor to $|S\cap P_{i}|\cdot\operatorname{dist}(P_{i},X\setminus P_{i})$ (by the coreset guarantee of $S$ ). This implies a slightly relaxed property $|S\cap P_{i}|\in(1\pm\varepsilon)|P_{i}|$ , and it can be further adjusted to become strictly size-preserving.

Key Idea: Ensuring Separation Artificially.

A straightforward idea to deal with a generic $\mathcal{P}$ is to artificially separate the parts. More precisely, for Euclidean $\operatorname{\mathbb{R}}^{d}$ , we define a mapping $h:P\to\mathbb{R}^{d+1}$ , such that for $x\in P_{i}$ , $h(x):=(x,i\cdot w)$ for some sufficiently large $w$ (which can be loosely considered as infinite). We define $P_{i}^{\prime}:=h(P_{i})$ . Clearly, $\mathcal{P}^{\prime}:=(P_{1}^{\prime},\ldots,P_{k}^{\prime})$ is a $\lambda$ -bounded partition for $X^{\prime}:=h(X)$ , and now the partition $\mathcal{P}^{\prime}$ is well-separated. Thus, one can apply the above argument for $\mathcal{P}^{\prime}$ , obtain a size-preserving coreset $S^{\prime}$ for $X^{\prime}$ , and find the corresponding $S$ for $X$ .

New Issue: Ensuring Coreset Guarantee.

However, a crucial issue arises: how can we ensure that $S$ is a coreset for $X$ ? In other words, we need to translate the coreset guarantee of $S^{\prime}$ (on $X^{\prime}$ ) to that of $S$ (on $X$ ). It seems that the only way is to find a “bridge” center set $C^{\prime}$ for any fixed $C$ such that $\operatorname{cost}(X,C)\approx\operatorname{cost}(X^{\prime},C^{\prime})$ and $\operatorname{cost}(S,C)\approx\operatorname{cost}(S^{\prime},C^{\prime})$ . If such $C^{\prime}$ exists and $S^{\prime}$ can handle at least $|C^{\prime}|$ centers, then the coreset guarantee of $S$ follows. Now, let us focus on the requirement $\operatorname{cost}(X,C)\approx\operatorname{cost}(X^{\prime},C^{\prime})$ . The tricky situation arises when some points in $X$ may have the same nearest center in $C$ but belong to different parts; for instance, assume that $x_{i}\in P_{i}$ and $x_{j}\in P_{j}$ both have the nearest center $c\in C$ . After mapping through $h$ , the images $x_{i}^{\prime}=h(x_{i})$ and $x_{j}^{\prime}=h(x_{j})$ can be “infinitely” far apart. Therefore, to maintain the clustering cost, we need to include two copies, $(c,i\cdot w)$ and $(c,j\cdot w)$ , in $C^{\prime}$ . In the worst case, we would need to include $|\mathcal{P}|=k$ copies of every center in $C$ , resulting in a size of $k^{2}$ for $C^{\prime}$ , and thus $S^{\prime}$ should be a coreset for $k^{2}$ -Median. Unfortunately, this worst-case scenario is unavoidable if $\mathcal{P}$ is generic, and the reduction to $k^{2}$ -Median falls far from satisfaction, as our ultimate goal is near-linear size in $k$ .

Solution: Grouping Parts using Sparse Partition [37].

To address the above problem, we start by using a more careful way to “prune” centers in $C^{\prime}$ . For a part $P_{i}\in\mathcal{P}$ , if it is far from $C$ , specifically $\operatorname{dist}(P_{i},C)>\varepsilon^{-1}\lambda\geq\varepsilon^{-1}% \mathrm{diam}(P_{i})$ , then it suffices to include only one center in $C^{\prime}$ for $P_{i}$ , which is $c_{i}=\arg\min_{c\in C}\operatorname{dist}(c,P_{i})$ , and we can verify that $\operatorname{cost}(P_{i},\{c_{i}\})\in(1\pm\varepsilon)\cdot\operatorname{% cost}(P_{i},C)$ . Through this pruning, these far parts can increase $C^{\prime}$ by at most $k$ , which is acceptable.

The challenging part is then handling the close parts. For a close part $P_{i}$ , we denote by $C_{i}$ the centers that is relevant to it, i.e., $C_{i}$ contains the nearest centers to points in $P_{i}$ . This implies that $\operatorname{cost}(P_{i},C)=\operatorname{cost}(P_{i},C_{i})$ , meaning we only need to include $\{(c,i\cdot w):c\in C_{i}\}$ in $C^{\prime}$ for every $i$ . As discussed earlier, $\sum_{i}|C_{i}|$ could be $\Omega(k^{2})$ in the worst case.

To further reduce the total number of relevant centers, we propose a new idea: group these parts together to create a new partition $(G_{1},\dots,G_{l})$ of $X$ . The relevant centers of $G_{j}$ are $\bigcup_{i:P_{i}\subseteq G_{j}}C_{i}$ . Our goal then becomes to find a grouping strategy such that $\sum_{j}|\bigcup_{i:P_{i}\subseteq G_{j}}C_{i}|$ is small.

We achieve this using a more structured partition called sparse partition [37]. Specifically, we can compute a partition $\mathcal{Q}=(G_{1},\dots,G_{l})$ of $X$ such that each part $G_{j}$ has a diameter at most $O(\varepsilon^{-1}\lambda\log n)$ , and any subset with a diameter $O(\varepsilon^{-1}\lambda)$ intersects at most $O(\log n)$ parts. For simplicity, assume each part in $\mathcal{P}$ is entirely covered by exactly one part in $\mathcal{Q}$ (though this is not generally true, the proof follows similarly). Then, $\mathcal{Q}=(G_{1},\dots,G_{l})$ fits our needs perfectly. To see this, consider any center $c$ . Notice that $c$ can appear in at most $O(\log n)$ of the sets $\bigcup_{i:P_{i}\subseteq G_{j}}C_{i}$ , since otherwise the subset $\bigcup_{i:c\in C_{i}}P_{i}$ , whose diameter is $O(\varepsilon^{-1}\lambda)$ , would intersect more than $O(\log n)$ parts, contradicting the guarantee provided by the sparse partition. As a result, $\sum_{j}|\bigcup_{i:P_{i}\subseteq G_{j}}C_{i}|\leq O(k\log n)$ . This ensures that it suffices to construct a coreset for $O(k\log n)$ -Median that is size-preserving with respect to $\mathcal{Q}$ .

An issue is that $\mathcal{Q}$ is $O(\varepsilon^{-1}\lambda\log n)$ -bounded, but this can be adjusted through rescaling, resulting in a size increase by a factor of $\operatorname{poly}(\varepsilon^{-1}\log n)$ . Given that this coreset applies to a subset of size $n=\operatorname{poly}(km\varepsilon^{-1})$ in the full algorithm, the $\log n$ term is acceptable.

1.2.3 Streaming Implementations

We provide an overview of how these steps can be adapted to the streaming setting using foundational streaming algorithmic tools.

$\lambda$ -Bounded Partition.

Recall that both of our reductions rely on a $\lambda$ -bounded partition $\mathcal{P}$ (where $\lambda=\varepsilon\operatorname{OPT}/m$ , and we can guess this $\operatorname{OPT}$ ). Therefore, the key is to design a streaming algorithm that finds a $\lambda$ -bounded partition. However, in the streaming setting, it is not possible to maintain the $\lambda$ -bounded partition explicitly, as it requires $\Omega(n)$ space, which is not affordable.

Our strategy is to employ a hashing version of sparse partition defined in [18] that is space efficient and data oblivious and thus suitable for streaming. This hashing partitions the entire $\mathbb{R}^{d}$ into buckets of diameter of $O(\lambda)$ . Hence, after the stream ends, those non-empty buckets form a $\lambda$ -bounded partition of $X$ . Moreover, this hashing also guarantees that any subset of points with a diameter of $\lambda/\operatorname{poly}(d)$ intersects at most $\operatorname{poly}(d)$ buckets (similar to the sparse partition guarantee). By combining with the existence of the small-sized partition (as shown in Reduction I), we can conclude that there are very few, i.e., $\operatorname{poly}(d\varepsilon^{-1})\cdot(k+m)$ , distinct buckets after this hashing.

In the following, we fix such a hashing, and we outline how our reductions can be adapted to the streaming setting using foundational streaming algorithmic tools.

The First $O(km)\operatorname{poly}(\varepsilon^{-1}d\log\Delta)$ Space Bound.

We first show how to implement our Reduction I in the streaming setting. Our main task is to identify all points that lie in small buckets containing fewer than $O(\varepsilon^{-1}m)$ points, which correspond to the sparse subset $B$ , and construct a vanilla coreset for the points that lie in other buckets, which correspond to the dense subset. The latter is easy; once the former is achieved, we can simply run a vanilla dynamic streaming algorithm on a stream representing $X\setminus B$ (which can be achieved using some tricks).

The former task can be translated into a streaming algorithm: identify at most $a:=\operatorname{poly}(d\varepsilon^{-1})\cdot(k+m)$ buckets, such that each contain at most $b:=O(\varepsilon^{-1}m)$ points. This task is similar to sparse recover in streaming, but it is two-level which is different from the standard one, i.e., it should recover only from those buckets that contain small number of points. Hence, we introduce a modified sparse recovery structure for this specific task, and it takes about $O(ab)\operatorname{poly}(d\log\Delta)=O(km+m^{2})\operatorname{poly}(% \varepsilon^{-1}d\log\Delta)$ space.

However, this $m^{2}$ dependence is still suboptimal. To achieve a near-linear space bound in $m$ , we show that there exists a subset $F\subseteq X$ of size $m\cdot\operatorname{poly}(d\varepsilon^{-1})$ such that, after removing $F$ , the remaining points span very few buckets, allowing us to set $a:=k\operatorname{poly}(d)$ . This subset $F$ can be efficiently identified using a two-level $\ell_{0}$ -sampler [18, Lemma 3.3] in the dynamic stream. We then use (modified) sparse recovery to identify the sparse subset after removing $F$ , reducing the total space to $O(km)\operatorname{poly}(\varepsilon^{-1}d\log\Delta)$ .

The Second $\tilde{O}(k+m)\cdot\operatorname{poly}(\varepsilon^{-1}d\log\Delta)$ Space Bound.

Our second algorithm first applies the same streaming procedure to identify the set $F$ as in the first bound, which is directly added to the coreset. Recall that the number of non-empty buckets after removing $F$ is small. Hence, these non-empty buckets give rise to a small-sized $\lambda$ -bounded partition of $X\setminus F$ , denoted by $\mathcal{P}$ . Our Condition II states that if we construct a vanilla coreset $S$ that is size-preserving, i.e., $|S\cap P|=|P|$ for every $P$ , it is directly a robust coreset. Hence, in summary, our next step is to implement the construction of a size-preserving vanilla coreset (see Reduction II) for $X\setminus F$ in the dynamic streaming setting. Specifically, we map each point $x$ in the stream to $(x,\varphi(x)\cdot w)$ , where $w$ is sufficiently large and $\varphi(x)$ is an integer indicating the bucket to which $x$ belongs. This transformed data is then fed to an instance of a streaming algorithm designed for vanilla coresets. At the end of the stream, we obtain a coreset from this instance and consider its pre-image (which deletes the last coordinate) as the coreset for $X\setminus F$ , which is approximately size-preserving. We run a sparse recovery algorithm to recover the sizes of each bucket to calibrate the weights (to make it strictly size-preserving).

1.3 Related Work

Besides clustering with outliers, researchers have extensively explored coresets for various other clustering variants. Two particularly interesting variants are clustering with capacity constraints and clustering with fairness constraints, both of which have been investigated in [11, 44, 32, 15, 2, 4] for coreset constructions. Notably, the coresets for both these variants can be reduced to an assignment-preserving coreset [44, 32], for which several studies apply a hierarchical sampling framework [10] to construct small-sized coresets with assignment-preserving properties [2, 4]. Recently, Huang et al. [33] introduced a general unified model to capture a wide range of clustering problems with general assignment constraints, including clustering with fairness/capacity constraints, clustering with outliers (which we consider), and a fault-tolerant variant of clustering [39].

Coresets for other variants of clustering have also been considered, including projective clustering [21, 22, 46], clustering with missing values [8], ordered-weighted clustering [6], and line clustering [42, 40].

2 Preliminaries

For integer $n\geq 1$ , let $[n]:=\{1,2,\dots,n\}$ . Given a mapping $\varphi:X\to Y$ , for a subset $Z\subseteq X$ , denote $\varphi(Z):=\{\varphi(x):x\in Z\}$ , and for an image $y\in Y$ , denote its inverse by $\varphi^{-1}(y):=\{x\in X:\varphi(x)=y\}$ . Let $M=(V,\operatorname{dist})$ be an underlying metric space, and we assume oracle access to $\operatorname{dist}$ . For a point set $X\subseteq V$ , let $\mathrm{diam}(X):=\max_{x,y\in X}\operatorname{dist}(x,y)$ be the diameter of $X$ . For a set $C=\{c_{1},\dots,c_{t}\}\subseteq V$ ( $t>0$ ), a clustering of $X$ with respect to $C$ is a partition $\{P_{1},\ldots,P_{t}\}$ of $X$ such that for every $i\in[t]$ and every point $x\in P_{i}$ , $\operatorname{dist}(x,c_{i})\leq\operatorname{dist}(x,C)$ .

Weighted Set.

A set $S$ with an associated weight function $w_{S}:S\to\operatorname{\mathbb{R}}_{\geq 0}$ is a weighted set. When we talk about a weighted set $A$ , we denote the weight function of $A$ by $w_{A}$ unless otherwise specified. For an unweighted set $B$ , we interpret it as weighted set with unit weight function $w_{B}\equiv 1$ . We say a weighted set $Y$ is a weighted subset of $X$ , denoted by $(Y,w_{Y})\subseteq(X,w_{X})$ , if $Y\subseteq X$ and for every $x\in Y$ , $w_{Y}(x)\leq w_{X}(x)$ . For two weighted sets $X$ and $Y$ , the weight of their union $Z:=X\cup Y$ is defined as $w_{Z}:=w_{X}+w_{Y}$ . We denote by $X-Y$ the weighted set $Z$ with the weight function $w_{Z}$ , where $w_{Z}=w_{X}-w_{Y}$ , and $Z$ is the support of $w_{Z}$ . Here, for two weight functions $f:X\to\operatorname{\mathbb{R}}$ and $g:Y\to\operatorname{\mathbb{R}}$ , we use $f+g$ to denote the weight function $h:X\cup Y\to\operatorname{\mathbb{R}}$ such that for every $x\in X\cup Y$ , $h(x):=f(x)+g(x)$ . Likewise, we denote by $f-g$ the subtraction of weight functions.³³3If $x\not\in X$ , we define $f(x):=0$ , and similarly, if $x\not\in Y$ , we define $g(x):=0$ .

Clustering with Outliers (Weighted).

In our proof, we need to consider the weighted version of $(k,z,m)$ -Clustering, and we extend the definition as follows. For a weighted set $X\subseteq V$ , and a center set $C\subseteq V$ with $|C|=k$ , we define the clustering objective for $(k,z)$ -Clustering as $\operatorname{cost}_{z}(X,C):=\sum_{x\in X}w_{X}(x)\cdot\operatorname{dist}^{z% }(x,C)$ . Moreover, given real number $h\geq 0$ , we denote by $\mathcal{L}_{X}^{(h)}:=\{(Y,w_{Y})\subseteq(X,w_{X}):w_{Y}(Y)=h\}$ the collection of all possible weighted subset of $X$ with a total weight equal to $h$ . We define the objective for $(k,z,m)$ -Clustering as

\operatorname{cost}_{z}^{(m)}(X,C):=\min_{(Y,w_{Y})\in\mathcal{L}_{X}^{(m)}}% \operatorname{cost}_{z}(X-Y,C).

(3)

One can verify that when $m=0$ , the definition of $\operatorname{cost}_{z}^{(m)}$ is equivalent to $\operatorname{cost}_{z}$ . We call a weighted set $Y$ an $m$ -outlier set of $X$ with respect to $C$ , if $(Y,w_{Y})\in\mathcal{L}_{X}^{(m)}$ and $\operatorname{cost}_{z}^{(m)}(X,C)=\operatorname{cost}_{z}(X-Y,C)$ . We use $\operatorname{OPT}_{z}^{(m)}(X)$ to denote the optimal objective value for $(k,z,m)$ -Clustering on $X$ , i.e., $\operatorname{OPT}_{z}^{(m)}(X):=\min_{C\in V^{k}}\operatorname{cost}_{z}^{(m)% }(X,C)$ .

Coresets.

Our definition of coreset is almost identical to the one in [31, Definition 2.1], except that the coreset preserves the objective for all potential number of outliers $0\leq h\leq m$ up to $m$ simultaneously.⁴⁴4Although the definition of coresets in [31] does not require preserving objective for all $0\leq h\leq m$ simultaneously, their algorithm actually constructs such a coreset (see [31, Remark 3.2]).

Definition 3 (Coresets).

Given $0<\varepsilon<1$ and a dataset $X\subseteq V$ , an $\varepsilon$ -coreset of $X$ for $(k,z,m)$ -Clustering is a weighted set $S\subseteq X$ such that

\forall C\in V^{k},0\leq h\leq m,\quad\operatorname{cost}_{z}^{(h)}(S,C)\in(1% \pm\varepsilon)\cdot\operatorname{cost}_{z}^{(h)}(X,C).

A weighted set is called an $\varepsilon$ -coreset for the special case of $(k,z)$ -Clustering if it is an $\varepsilon$ -coreset for $(k,z,m)$ -Clustering with $m=0$ .

For analysis purposes, we also need a weaker notion of coreset, as defined in Definition 4, that allows for additive errors. Such notions were also widely used in previous coreset constructions (see e.g., [16, 13, 4, 31]).

Definition 4 (Additive-error coresets).

Given $0<\varepsilon<1,\eta\geq 0$ and a dataset $X\subseteq V$ , an $(\varepsilon,\eta)$ -coreset of $X$ for $(k,z,m)$ -Clustering is a weighted set $S\subseteq X$ such that

\forall C\in V^{k},0\leq h\leq m,\quad\left|\operatorname{cost}_{z}^{(h)}(S,C)% -\operatorname{cost}_{z}^{(h)}(X,C)\right|\leq\varepsilon\cdot\operatorname{% cost}_{z}^{(h)}(X,C)+\eta.

An important property of coresets is that it is composable. Roughly speaking, if $S_{A}$ is a coreset of $A$ and $S_{B}$ is a coreset of $B$ , then $S_{A}\cup S_{B}$ is a coreset of $A\cup B$ . It is well known that composability holds directly from the definition of vanilla coreset, and here we verify that it also holds for our definition of coresets for $(k,z,m)$ -Clustering. We formally present this property with respect to the more general additive-error coreset below.

Fact 5 (Composability of coresets).

For $0<\varepsilon<1,\eta_{1},\eta_{2}\geq 0$ , and two datasets $X,Y\subseteq V$ , if $S_{X}$ is an $(\varepsilon,\eta_{1})$ -coreset of $X$ for $(k,z,m)$ -Clustering, and $S_{Y}$ is an $(\varepsilon,\eta_{2})$ -coreset of $Y$ for $(k,z,m)$ -Clustering, then $S_{X}\cup S_{Y}$ is an $(\varepsilon,\eta_{1}+\eta_{2})$ -coreset of $X\cup Y$ for $(k,z,m)$ -Clustering.

Tri-criteria Approximation.

Similar to many previous works [4, 31, 33], we need a tri-criteria approximation algorithm for the $(k,z,m)$ -Clustering problem. Here, an $(\alpha,\beta,\gamma)$ -approximation is a set of $\beta k$ centers, whose cost is at most $\alpha$ times the optimal, allowing a violation $\gamma$ to the number of outliers.

Definition 6 ( $(\alpha,\beta,\gamma)$ -Approximation).

Given $\alpha,\beta,\gamma\geq 1$ and a dataset $X\subseteq V$ , we say a center set $C\subseteq V$ is an $(\alpha,\beta,\gamma)$ -approximation solution to $(k,z,m)$ -Clustering on $X$ if it holds that $\operatorname{cost}_{z}^{(\gamma m)}(X,C)\leq\alpha\cdot\operatorname{OPT}_{z}% ^{(m)}(X)$ and $|C|\leq\beta k$ .

There exists a randomized algorithm that computes a $(2^{O(z)},O(1),O(1))$ -approximation in near-linear time with high probability [3] (which is also applicable to general metrics). For a deterministic approach, a $(2^{O(z)},O(1),1)$ -approximation solution can be obtained in polynomial time (e.g., [25]). A more detailed discussion on other possible algorithms for such an approximation can be found in [31, Appendix A].

Lemma 7 (Generalized triangle inequalities).

Let $a,b\geq 0$ and $\varepsilon\in(0,1)$ , then for $z\geq 1$ , the following holds.

1.

(Lemma A.1 of [41]) $(a+b)^{z}\leq(1+\varepsilon)^{z-1}\cdot a^{z}+(1+\frac{1}{\varepsilon})^{z-1}% \cdot b^{z}$ ;
2.

(Claim 5 of [45]) $(a+b)^{z}\leq(1+\varepsilon)\cdot a^{z}+(\frac{3z}{\varepsilon})^{z-1}\cdot b^% {z}$ .

3 Formal Statements of Reductions

In this section, we present our main results, specifically two reductions based on two conditions for a vanilla coreset to become a robust coreset, and the streaming implementations of these reductions. Due to space limitations, we provide the formal statements of our results and discuss them here; the full proof can be found in the full version [38].

We begin by defining the bounded partition below, which is a useful notion in both reductions.

Definition 8 ( $\lambda$ -Bounded partition).

For a dataset $Y$ from a metric space $(V,\operatorname{dist})$ and $\lambda\geq 0$ , we say that a partition $\mathcal{P}=\{P_{i}\}_{i}$ of $Y$ is $\lambda$ -bounded if, for every $P_{i}\in\mathcal{P}$ , $\mathrm{diam}(P_{i})\leq\lambda$ .

Reduction I: Density of Datasets

Our first reduction is presented in Theorem 10. This reduction is based on the first condition, stated in Lemma 9, which proves that a vanilla coreset on a dense dataset serves as a robust coreset.

Lemma 9 (Condition I: vanilla coresets on dense datasets are robust).

Suppose an underlying metric space $M=(V,\operatorname{dist})$ is given. For a dataset $X\subseteq V$ , $0<\varepsilon<1$ , and integers $k,z,m\geq 1$ , if there exists a $\lambda$ -bounded partition $\mathcal{P}$ of $X$ for some $\lambda\geq 0$ such that for every $P\in\mathcal{P}$ , $|P|\geq(1+\varepsilon^{-1})m$ , then an $\varepsilon$ -coreset of $X$ for $(k,z)$ -Clustering is also an $\left(O(\varepsilon),2^{O(z\log(z+1))}\cdot m\lambda^{z}\right)$ -coreset for $(k,z,m)$ -Clustering.

Theorem 10 (Reduction I).

Suppose an underlying metric space $M=(V,\operatorname{dist})$ is given. Assume there exists an algorithm $\mathcal{A}$ that, given $0<\varepsilon<1$ , integers $k,z\geq 1$ and an $n$ -point dataset from $M$ as input, constructs an $\varepsilon$ -coreset for $(k,z)$ -Clustering of size $N(n,k,\varepsilon^{-1})$ in time $T(n,k,\varepsilon^{-1})$ .

Then there is an algorithm that, given $0<\varepsilon<1$ , integers $k,z,m\geq 1$ , an $n$ -point dataset $X$ from $M$ and a $(2^{O(z)},O(1),O(1))$ -approximation solution $C^{*}\subseteq V$ to $(k,z,m)$ -Clustering on $X$ as input, constructs an $\varepsilon$ -coreset $S$ of $X$ for $(k,z,m)$ -Clustering. This algorithm invokes $\mathcal{A}$ a constant number of times and runs in time $\tilde{O}(nk)+T(n,k,O(\varepsilon^{-1}))$ . The constructed coreset $S$ has size $2^{O(z\log z)}\cdot O\left(km\varepsilon^{-1}\right)+N\left(n,k,O(\varepsilon^% {-1})\right).$

Reduction II: Size-preserving Property

Our second condition is stated in Lemma 11, which shows that a vanilla coreset that satisfies the size-preserving property is automatically a robust coreset.

Lemma 11 (Condition II: size-preserving vanilla coresets are robust).

Suppose a metric $M=(V,\operatorname{dist})$ is given. For $0<\varepsilon<1$ , $\lambda>0$ , integers $k,z,m\geq 1$ and a dataset $X\subseteq V$ , if a weighted set $S\subseteq X$ satisfies that there exists a $\lambda$ -bounded partition $\mathcal{P}$ of $X$ such that

1.

$S$ is an $\varepsilon$ -coreset of $X$ for $(k+|\mathcal{P}|,z)$ -Clustering, and
2.

$\forall P\in\mathcal{P}$ , $|P|=w_{S}(S\cap P)$ ,

then $S$ is an $\left(O(\varepsilon),2^{O(z\log(z+1))}\cdot m\lambda^{z}\varepsilon^{1-z}\right)$ -coreset of $X$ for $(k,z,m)$ -Clustering.

Based on Lemma 11, we present an alternative reduction in Theorem 13. Compared with Reduction I (Theorem 10), this reduction gives an alternative size bound with respect to $k$ and $m$ : $O_{\varepsilon}(k+m)$ instead of $O_{\varepsilon}(km)$ . This reduction requires a slightly stronger technical guarantee from the vanilla coreset algorithm, such that the algorithm not only needs to construct a small coreset for metric $M$ (that contains the dataset), but also for a family of separated duplications of $M$ . Roughly speaking, a separated duplication of $M$ “copies” $M$ into several disjoint sub-metrics, and makes the distance between two copies large. We provide the formal definition below.

Definition 12 (Seperated duplication of a metric space).

Given a metric space $M=(V,\operatorname{dist})$ , for real number $w\geq 0$ and integer $h\geq 1$ , a metric space $M^{\prime}=(V\times[h],\operatorname{dist}^{\prime})$ is called an $w$ -separated $h$ -duplication of $M$ , if

1.

$\forall x,y\in X,i\in[h]$ , it holds that $\operatorname{dist}^{\prime}((x,i),(y,i))=\operatorname{dist}(x,y)$ ; and
2.

$\forall x,y\in X,i,j\in[h]$ such that $i\neq j$ , it holds that $\operatorname{dist}^{\prime}((x,i),(y,j))\geq\max\{w,\operatorname{dist}(x,y)\}$ .

For the special case of $h=1$ and $w=0$ , we say $M^{\prime}$ is a $w$ -separated $h$ -duplication of $M$ if and only if $M^{\prime}=M$ .

Theorem 13 (Reduction II).

Suppose an underlying metric space $M=(V,\operatorname{dist})$ and a family $\mathcal{M}^{\mathrm{dup}}=\left\{M_{h,w}^{\mathrm{dup}}\right\}_{h\geq 1,w% \geq 0}$ of $w$ -separated $h$ -duplication of $M$ are given. Assume there exists an algorithm $\mathcal{A}$ such that, for every $0<\varepsilon<1$ , integers $k,z\geq 1$ , metric $M^{\mathrm{dup}}\in\mathcal{M}^{\mathrm{dup}}$ and $n$ -point dataset from $M^{\mathrm{dup}}$ , it runs in time $T(M^{\mathrm{dup}},n,k,\varepsilon^{-1})$ to construct an $\varepsilon$ -coreset of size $N(M^{\mathrm{dup}},n,k,\varepsilon^{-1})$ for $(k,z)$ -Clustering on $M^{\mathrm{dup}}$ .

Then, there is an algorithm that, given $0<\varepsilon<1$ , integers $k,z,m\geq 1$ , an $n$ -point dataset $X\subseteq V$ and a $(2^{O(z)},O(1),O(1))$ -approximation $C^{*}\subseteq V$ to $(k,z,m)$ -Clustering on $X$ as input, constructs an $\varepsilon$ -coreset for $(k,z,m)$ -Clustering of size

A+N(M,n,k,O(\varepsilon^{-1}))+N\left(M^{\mathrm{dup}}_{h^{\prime},w^{\prime}}% ,O(km\varepsilon^{-1}),O\left(k\log^{2}(km\varepsilon^{-1})\right),O(% \varepsilon^{-1})\right),

where $A=2^{O(z\log z)}\cdot O\left(m\varepsilon^{-2z}\log^{z}(km\varepsilon^{-1})\right)$ , $h^{\prime}=\min\{O(k\log(km\varepsilon^{-1})),n\}$ and $w^{\prime}=O(z\varepsilon^{-1}\cdot\mathrm{diam}(X)\cdot n^{1/z})$ . This algorithm invokes $\mathcal{A}$ a constant number of times and runs in time

\tilde{O}(nk)+\operatorname{poly}(km\varepsilon^{-1})+T(M,n,k,\varepsilon^{-1}% )+T\left(M^{\mathrm{dup}}_{h^{\prime},w^{\prime}},h^{\prime},O\left(k\log^{2}(% km\varepsilon^{-1})\right),O(\varepsilon^{-1})\right).

Notice that we assume a family $\mathcal{M}^{\mathrm{dup}}$ is given alongside $M$ in Theorem 13, since this family $\mathcal{M}^{\mathrm{dup}}$ may depend on the specific type of $M$ (e.g., $M$ is Euclidean or shortest-path metric of a graph) and may require careful design. Luckily, in most cases, the separated duplication family $\mathcal{M}^{\mathrm{dup}}$ does not need to be much more “complex” than $\mathcal{M}$ , and this particularly means as long as a vanilla coreset works on $M$ , it can also deal with metrics in $\mathcal{M}^{\mathrm{dup}}$ .

We claim that for metrics including Euclidean spaces, general finite metrics, doubling metrics and shortest-path metrics of minor-excluded graphs, the complexity/dimension parameter (such as Euclidean dimension) of the separated duplication only increases by a constant (factor). For instance, for the $d$ -dimension Euclidean metric $M=(\mathbb{R}^{d},\ell_{2})$ , we show that for any $w$ and $h$ there exists a $w$ -separated $h$ -duplication of $M$ that can be realized by a $(d+1)$ -dimensional Euclidean space $(\mathbb{R}^{d+1},\ell_{2})$ . Hence, a vanilla coreset construction for those metrics automatically works for the separated duplication (with only a constant factor increase in coreset size). We summarize the results for the separated duplication in Table 2.

Table 2: Summary of the results for the separated duplication. For all cases of the original metric space

M

that we list, we demonstrate the existence of a family

\mathcal{M}^{\mathrm{dup}}

, such that each

M^{\mathrm{dup}}\in\mathcal{M}^{\mathrm{dup}}

possesses the property as shown in the table.

space $M$		property of $M^{\mathrm{dup}}\in\mathcal{M}^{\mathrm{dup}}$
$(\operatorname{\mathbb{R}}^{d},\ell_{p})$ for $p\geq 1$		can be realized by $(\operatorname{\mathbb{R}}^{d+1},\ell_{p})$
general metric	doubling dimension $\operatorname{ddim}(M)$	$\operatorname{ddim}(M^{\mathrm{dup}})\leq 2\operatorname{ddim}(M)+2$
general metric	ambient size $n$	ambient size $\leq n^{2}$
graph metric	treewidth $\operatorname{tw}$	treewidth $\operatorname{tw}$
graph metric	excluding a fixed minor $H$	excluding the same minor $H$

Streaming Implementations

Finally, we implement our reduction algorithms (both Theorem 10 and Theorem 13) in the dynamic streaming setting. We consider the standard geometric streaming model suggested by Indyk [36], where the input is from a discrete set $[\Delta]^{d}$ for some integer $\Delta\geq 1$ and is presented as a stream of point insertion and deletions. We obtain two black-box reductions in Theorem 14. These reductions turn a streaming coreset construction algorithm for $(k,z)$ -Clustering to the one for $(k,z,m)$ -Clustering, with guarantee similar to Theorem 10 and Theorem 13, respectively.

Theorem 14 (Streaming coresets for $(k,z,m)$ -Clustering).

Assume there exists a streaming algorithm that, given $0<\varepsilon,\delta<1$ , integers $k,z,d,\Delta\geq 1$ , and a dataset from $[\Delta]^{d}$ presented as a dynamic stream, uses space $W(d,\Delta,k,\varepsilon^{-1},\delta^{-1})$ to construct an $\varepsilon$ -coreset for $(k,z)$ -Clustering with a failure probability of at most $\delta$ . Then there exists a streaming algorithm that, given $0<\varepsilon,\delta<1$ , integers $k,z,m,d,\Delta\geq 1$ , and a dataset $X\subseteq[\Delta]^{d}$ presented as a dynamic stream, constructs an $\varepsilon$ -coreset of $X$ for $(k,z,m)$ -Clustering. This algorithm has a failure probability of at most $\delta$ and uses space $2^{O(z\log z)}\cdot\min\{W_{1},W_{2}\}\cdot\operatorname{poly}(d\log(\delta^{-% 1}\Delta))$ , where

	$\displaystyle W_{1}$	$\displaystyle=\tilde{O}\left((k+d^{z})m\varepsilon^{-1}\right)+W\left(d,\Delta% ,k,O(\varepsilon^{-1}),O(\delta^{-1}zd\log\Delta)\right),$
	$\displaystyle W_{2}$	$\displaystyle=\tilde{O}\left(k+m(d/\varepsilon)^{2z}\right)+W\left(d+1,z% \varepsilon^{-1}\Delta^{\operatorname{poly}(d)},k\operatorname{poly}(d),O(% \varepsilon^{-1}),O(\delta^{-1}zd\log\Delta)\right).$

References

[1] Maria-Florina Balcan, Steven Ehrlich, and Yingyu Liang. Distributed k-means and k-median clustering on general communication topologies. In NIPS, pages 1995–2003, 2013. URL: https://proceedings.neurips.cc/paper/2013/hash/7f975a56c761db6506eca0b37ce6ec87-Abstract.html.
[2] Sayan Bandyapadhyay, Fedor V. Fomin, and Kirill Simonov. On coresets for fair clustering in metric and euclidean spaces and their applications. In ICALP, volume 198 of LIPIcs, pages 23:1–23:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.ICALP.2021.23.
[3] Aditya Bhaskara, Sharvaree Vadgama, and Hong Xu. Greedy sampling for approximate clustering in the presence of outliers. In NeurIPS, pages 11146–11155, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/73983c01982794632e0270cd0006d407-Abstract.html.
[4] Vladimir Braverman, Vincent Cohen-Addad, Shaofeng H.-C. Jiang, Robert Krauthgamer, Chris Schwiegelshohn, Mads Bech Toftrup, and Xuan Wu. The power of uniform sampling for coresets. In FOCS, pages 462–473. IEEE, 2022. doi:10.1109/FOCS54457.2022.00051.
[5] Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, and Lin F. Yang. Clustering high dimensional dynamic data streams. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 576–585. PMLR, 2017. URL: http://proceedings.mlr.press/v70/braverman17a.html.
[6] Vladimir Braverman, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for ordered weighted clustering. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 744–753. PMLR, 2019. URL: http://proceedings.mlr.press/v97/braverman19a.html.
[7] Vladimir Braverman, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for clustering in excluded-minor graphs and beyond. In SODA, pages 2679–2696. SIAM, 2021. doi:10.1137/1.9781611976465.159.
[8] Vladimir Braverman, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for clustering with missing values. In NeurIPS, pages 17360–17372, 2021. URL: https://proceedings.neurips.cc/paper/2021/hash/90fd4f88f588ae64038134f1eeaa023f-Abstract.html.
[9] Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. In SODA, pages 642–651. ACM/SIAM, 2001. URL: http://dl.acm.org/citation.cfm?id=365411.365555.
[10] Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput., 39(3):923–947, 2009. doi:10.1137/070699007.
[11] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Fair clustering through fairlets. In NIPS, pages 5029–5037, 2017. URL: https://proceedings.neurips.cc/paper/2017/hash/978fce5bcc4eccc88ad48ce3914124a2-Abstract.html.
[12] Vincent Cohen-Addad, Andrew Draganov, Matteo Russo, David Saulpic, and Chris Schwiegelshohn. A tight vc-dimension analysis of clustering coresets with applications. In SODA, pages 4783–4808. SIAM, 2025. doi:10.1137/1.9781611978322.162.
[13] Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, and Chris Schwiegelshohn. Towards optimal lower bounds for k-median and k-means coresets. In STOC, pages 1038–1051. ACM, 2022. doi:10.1145/3519935.3519946.
[14] Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, and Omar Ali Sheikh-Omar. Improved coresets for euclidean k-means. In NeurIPS, 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/120c9ab5c58ba0fa9dd3a22ace1de245-Abstract-Conference.html.
[15] Vincent Cohen-Addad and Jason Li. On the fixed-parameter tractability of capacitated clustering. In ICALP, volume 132 of LIPIcs, pages 41:1–41:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPIcs.ICALP.2019.41.
[16] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In STOC, pages 169–182. ACM, 2021. doi:10.1145/3406325.3451022.
[17] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. Deterministic clustering in high dimensional spaces: Sketches and approximation. In FOCS, pages 1105–1130. IEEE, 2023. doi:10.1109/FOCS57990.2023.00066.
[18] Artur Czumaj, Arnold Filtser, Shaofeng H.-C. Jiang, Robert Krauthgamer, Pavel Veselý, and Mingwei Yang. Streaming facility location in high dimension via geometric hashing. arXiv preprint arXiv:2204.02095, 2022. The latest version has additional results compared to the preliminary version in [19]. doi:10.48550/arXiv.2204.02095.
[19] Artur Czumaj, Shaofeng H.-C. Jiang, Robert Krauthgamer, Pavel Veselý, and Mingwei Yang. Streaming facility location in high dimension via geometric hashing. In FOCS, pages 450–461. IEEE, 2022. doi:10.1109/FOCS54457.2022.00050.
[20] Dan Feldman. Core-sets: An updated survey. WIREs Data Mining Knowl. Discov., 10(1), 2020. doi:10.1002/widm.1335.
[21] Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In STOC, pages 569–578. ACM, 2011. doi:10.1145/1993636.1993712.
[22] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca, and projective clustering. SIAM J. Comput., 49(3):601–657, 2020. doi:10.1137/18M1209854.
[23] Dan Feldman and Leonard J Schulman. Data reduction for weighted and outlier-resistant clustering. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pages 1343–1354. SIAM, 2012. doi:10.1137/1.9781611973099.106.
[24] Arnold Filtser. Scattering and sparse partitions, and their applications. In ICALP, volume 168 of LIPIcs, pages 47:1–47:20. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPIcs.ICALP.2020.47.
[25] Zachary Friggstad, Kamyar Khodamoradi, Mohsen Rezapour, and Mohammad R. Salavatipour. Approximation schemes for clustering with outliers. ACM Trans. Algorithms, 15(2):26:1–26:26, 2019. doi:10.1145/3301446.
[26] Sariel Har-Peled and Akash Kushal. Smaller coresets for $k$ -median and $k$ -means clustering. Discret. Comput. Geom., 37(1):3–19, 2007. doi:10.1007/s00454-006-1271-x.
[27] Sariel Har-Peled and Soham Mazumdar. On coresets for $k$ -means and $k$ -median clustering. In STOC, pages 291–300. ACM, 2004. doi:10.1145/1007352.1007400.
[28] Monika Henzinger and Sagar Kale. Fully-dynamic coresets. In ESA, volume 173 of LIPIcs, pages 57:1–57:21. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPIcs.ESA.2020.57.
[29] Wei Hu, Zhao Song, Lin F. Yang, and Peilin Zhong. Nearly optimal dynamic $k$ -means clustering for high-dimensional data. CoRR, abs/1802.00459, 2018. doi:10.48550/arXiv.1802.00459.
[30] Lingxiao Huang, Ruiyuan Huang, Zengfeng Huang, and Xuan Wu. On coresets for clustering in small dimensional euclidean spaces. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 13891–13915. PMLR, 2023. URL: https://proceedings.mlr.press/v202/huang23h.html.
[31] Lingxiao Huang, Shaofeng H.-C. Jiang, Jianing Lou, and Xuan Wu. Near-optimal coresets for robust clustering. In ICLR. OpenReview.net, 2023. URL: https://openreview.net/forum?id=Nc1ZkRW8Vde.
[32] Lingxiao Huang, Shaofeng H.-C. Jiang, and Nisheeth K. Vishnoi. Coresets for clustering with fairness constraints. In NeurIPS, pages 7587–7598, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/810dfbbebb17302018ae903e9cb7a483-Abstract.html.
[33] Lingxiao Huang, Jian Li, Pinyan Lu, and Xuan Wu. Coresets for constrained clustering: General assignment constraints and improved size bounds. In SODA, pages 4732–4782. SIAM, 2025. doi:10.1137/1.9781611978322.161.
[34] Lingxiao Huang, Jian Li, and Xuan Wu. On optimal coreset construction for euclidean $(k,z)$ -clustering. In STOC, pages 1594–1604. ACM, 2024. doi:10.1145/3618260.3649707.
[35] Lingxiao Huang and Nisheeth K. Vishnoi. Coresets for clustering in euclidean spaces: importance sampling is nearly optimal. In STOC, pages 1416–1429. ACM, 2020. doi:10.1145/3357713.3384296.
[36] Piotr Indyk. Algorithms for dynamic geometric problems over data streams. In STOC, pages 373–380. ACM, 2004. doi:10.1145/1007352.1007413.
[37] Lujun Jia, Guolong Lin, Guevara Noubir, Rajmohan Rajaraman, and Ravi Sundaram. Universal approximations for tsp, steiner tree, and set cover. In STOC, pages 386–395. ACM, 2005. doi:10.1145/1060590.1060649.
[38] Shaofeng H.-C. Jiang and Jianing Lou. Coresets for robust clustering via black-box reductions to vanilla case. CoRR, abs/2502.07669, 2025. doi:10.48550/arXiv.2502.07669.
[39] Samir Khuller, Robert Pless, and Yoram J. Sussmann. Fault tolerant k-center problems. Theor. Comput. Sci., 242(1-2):237–245, 2000. doi:10.1016/S0304-3975(98)00222-9.
[40] Sagi Lotan, Ernesto Evgeniy Sanches Shayda, and Dan Feldman. Coreset for line-sets clustering. In NeurIPS, 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/f2ce95887c34393af4eb240d60017860-Abstract-Conference.html.
[41] Konstantin Makarychev, Yury Makarychev, and Ilya P. Razenshteyn. Performance of johnson-lindenstrauss transform for k-means and k-medians clustering. In STOC, pages 1027–1038. ACM, 2019. doi:10.1145/3313276.3316350.
[42] Yair Marom and Dan Feldman. k-means clustering of lines for big data. In NeurIPS, pages 12797–12806, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/6084e82a08cb979cf75ae28aed37ecd4-Abstract.html.
[43] Alexander Munteanu and Chris Schwiegelshohn. Coresets-methods and history: A theoreticians design pattern for approximation and streaming algorithms. Künstliche Intell., 32(1):37–53, 2018. doi:10.1007/s13218-017-0519-3.
[44] Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. Fair coresets and streaming algorithms for fair k-means. In WAOA, volume 11926 of Lecture Notes in Computer Science, pages 232–251. Springer, 2019. doi:10.1007/978-3-030-39479-0_16.
[45] Christian Sohler and David P. Woodruff. Strong coresets for k-median and subspace approximation: Goodbye dimension. In FOCS, pages 802–813. IEEE Computer Society, 2018. doi:10.1109/FOCS.2018.00081.
[46] Murad Tukan, Xuan Wu, Samson Zhou, Vladimir Braverman, and Dan Feldman. New coresets for projective clustering and applications. In AISTATS, volume 151 of Proceedings of Machine Learning Research, pages 5391–5415. PMLR, 2022. URL: https://proceedings.mlr.press/v151/tukan22a.html.

[bib.bib1] [1] Maria-Florina Balcan, Steven Ehrlich, and Yingyu Liang. Distributed k-means and k-median clustering on general communication topologies. In NIPS, pages 1995–2003, 2013. URL: https://proceedings.neurips.cc/paper/2013/hash/7f975a56c761db6506eca0b37ce6ec87-Abstract.html.

[bib.bib2] [2] Sayan Bandyapadhyay, Fedor V. Fomin, and Kirill Simonov. On coresets for fair clustering in metric and euclidean spaces and their applications. In ICALP, volume 198 of LIPIcs, pages 23:1–23:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPIcs.ICALP.2021.23.

[bib.bib3] [3] Aditya Bhaskara, Sharvaree Vadgama, and Hong Xu. Greedy sampling for approximate clustering in the presence of outliers. In NeurIPS, pages 11146–11155, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/73983c01982794632e0270cd0006d407-Abstract.html.

[bib.bib4] [4] Vladimir Braverman, Vincent Cohen-Addad, Shaofeng H.-C. Jiang, Robert Krauthgamer, Chris Schwiegelshohn, Mads Bech Toftrup, and Xuan Wu. The power of uniform sampling for coresets. In FOCS, pages 462–473. IEEE, 2022. doi:10.1109/FOCS54457.2022.00051.

[bib.bib5] [5] Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, and Lin F. Yang. Clustering high dimensional dynamic data streams. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 576–585. PMLR, 2017. URL: http://proceedings.mlr.press/v70/braverman17a.html.

[bib.bib6] [6] Vladimir Braverman, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for ordered weighted clustering. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 744–753. PMLR, 2019. URL: http://proceedings.mlr.press/v97/braverman19a.html.

[bib.bib7] [7] Vladimir Braverman, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for clustering in excluded-minor graphs and beyond. In SODA, pages 2679–2696. SIAM, 2021. doi:10.1137/1.9781611976465.159.

[bib.bib8] [8] Vladimir Braverman, Shaofeng H.-C. Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for clustering with missing values. In NeurIPS, pages 17360–17372, 2021. URL: https://proceedings.neurips.cc/paper/2021/hash/90fd4f88f588ae64038134f1eeaa023f-Abstract.html.

[bib.bib9] [9] Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. In SODA, pages 642–651. ACM/SIAM, 2001. URL: http://dl.acm.org/citation.cfm?id=365411.365555.

[bib.bib10] [10] Ke Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput., 39(3):923–947, 2009. doi:10.1137/070699007.

[bib.bib11] [11] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Fair clustering through fairlets. In NIPS, pages 5029–5037, 2017. URL: https://proceedings.neurips.cc/paper/2017/hash/978fce5bcc4eccc88ad48ce3914124a2-Abstract.html.

[bib.bib12] [12] Vincent Cohen-Addad, Andrew Draganov, Matteo Russo, David Saulpic, and Chris Schwiegelshohn. A tight vc-dimension analysis of clustering coresets with applications. In SODA, pages 4783–4808. SIAM, 2025. doi:10.1137/1.9781611978322.162.

[bib.bib13] [13] Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, and Chris Schwiegelshohn. Towards optimal lower bounds for k-median and k-means coresets. In STOC, pages 1038–1051. ACM, 2022. doi:10.1145/3519935.3519946.

[bib.bib14] [14] Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, and Omar Ali Sheikh-Omar. Improved coresets for euclidean k-means. In NeurIPS, 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/120c9ab5c58ba0fa9dd3a22ace1de245-Abstract-Conference.html.

[bib.bib15] [15] Vincent Cohen-Addad and Jason Li. On the fixed-parameter tractability of capacitated clustering. In ICALP, volume 132 of LIPIcs, pages 41:1–41:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPIcs.ICALP.2019.41.

[bib.bib16] [16] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In STOC, pages 169–182. ACM, 2021. doi:10.1145/3406325.3451022.

[bib.bib17] [17] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. Deterministic clustering in high dimensional spaces: Sketches and approximation. In FOCS, pages 1105–1130. IEEE, 2023. doi:10.1109/FOCS57990.2023.00066.

[bib.bib18] [18] Artur Czumaj, Arnold Filtser, Shaofeng H.-C. Jiang, Robert Krauthgamer, Pavel Veselý, and Mingwei Yang. Streaming facility location in high dimension via geometric hashing. arXiv preprint arXiv:2204.02095, 2022. The latest version has additional results compared to the preliminary version in [19]. doi:10.48550/arXiv.2204.02095.

[bib.bib19] [19] Artur Czumaj, Shaofeng H.-C. Jiang, Robert Krauthgamer, Pavel Veselý, and Mingwei Yang. Streaming facility location in high dimension via geometric hashing. In FOCS, pages 450–461. IEEE, 2022. doi:10.1109/FOCS54457.2022.00050.

[bib.bib20] [20] Dan Feldman. Core-sets: An updated survey. WIREs Data Mining Knowl. Discov., 10(1), 2020. doi:10.1002/widm.1335.

[bib.bib21] [21] Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In STOC, pages 569–578. ACM, 2011. doi:10.1145/1993636.1993712.

[bib.bib22] [22] Dan Feldman, Melanie Schmidt, and Christian Sohler. Turning big data into tiny data: Constant-size coresets for k-means, pca, and projective clustering. SIAM J. Comput., 49(3):601–657, 2020. doi:10.1137/18M1209854.

[bib.bib23] [23] Dan Feldman and Leonard J Schulman. Data reduction for weighted and outlier-resistant clustering. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pages 1343–1354. SIAM, 2012. doi:10.1137/1.9781611973099.106.

[bib.bib24] [24] Arnold Filtser. Scattering and sparse partitions, and their applications. In ICALP, volume 168 of LIPIcs, pages 47:1–47:20. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPIcs.ICALP.2020.47.

[bib.bib25] [25] Zachary Friggstad, Kamyar Khodamoradi, Mohsen Rezapour, and Mohammad R. Salavatipour. Approximation schemes for clustering with outliers. ACM Trans. Algorithms, 15(2):26:1–26:26, 2019. doi:10.1145/3301446.

[bib.bib26] [26] Sariel Har-Peled and Akash Kushal. Smaller coresets for $k$ -median and $k$ -means clustering. Discret. Comput. Geom., 37(1):3–19, 2007. doi:10.1007/s00454-006-1271-x.

[bib.bib27] [27] Sariel Har-Peled and Soham Mazumdar. On coresets for $k$ -means and $k$ -median clustering. In STOC, pages 291–300. ACM, 2004. doi:10.1145/1007352.1007400.

[bib.bib28] [28] Monika Henzinger and Sagar Kale. Fully-dynamic coresets. In ESA, volume 173 of LIPIcs, pages 57:1–57:21. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPIcs.ESA.2020.57.

[bib.bib29] [29] Wei Hu, Zhao Song, Lin F. Yang, and Peilin Zhong. Nearly optimal dynamic $k$ -means clustering for high-dimensional data. CoRR, abs/1802.00459, 2018. doi:10.48550/arXiv.1802.00459.

[bib.bib30] [30] Lingxiao Huang, Ruiyuan Huang, Zengfeng Huang, and Xuan Wu. On coresets for clustering in small dimensional euclidean spaces. In ICML, volume 202 of Proceedings of Machine Learning Research, pages 13891–13915. PMLR, 2023. URL: https://proceedings.mlr.press/v202/huang23h.html.

[bib.bib31] [31] Lingxiao Huang, Shaofeng H.-C. Jiang, Jianing Lou, and Xuan Wu. Near-optimal coresets for robust clustering. In ICLR. OpenReview.net, 2023. URL: https://openreview.net/forum?id=Nc1ZkRW8Vde.

[bib.bib32] [32] Lingxiao Huang, Shaofeng H.-C. Jiang, and Nisheeth K. Vishnoi. Coresets for clustering with fairness constraints. In NeurIPS, pages 7587–7598, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/810dfbbebb17302018ae903e9cb7a483-Abstract.html.

[bib.bib33] [33] Lingxiao Huang, Jian Li, Pinyan Lu, and Xuan Wu. Coresets for constrained clustering: General assignment constraints and improved size bounds. In SODA, pages 4732–4782. SIAM, 2025. doi:10.1137/1.9781611978322.161.

[bib.bib34] [34] Lingxiao Huang, Jian Li, and Xuan Wu. On optimal coreset construction for euclidean $(k,z)$ -clustering. In STOC, pages 1594–1604. ACM, 2024. doi:10.1145/3618260.3649707.

[bib.bib35] [35] Lingxiao Huang and Nisheeth K. Vishnoi. Coresets for clustering in euclidean spaces: importance sampling is nearly optimal. In STOC, pages 1416–1429. ACM, 2020. doi:10.1145/3357713.3384296.

[bib.bib36] [36] Piotr Indyk. Algorithms for dynamic geometric problems over data streams. In STOC, pages 373–380. ACM, 2004. doi:10.1145/1007352.1007413.

[bib.bib37] [37] Lujun Jia, Guolong Lin, Guevara Noubir, Rajmohan Rajaraman, and Ravi Sundaram. Universal approximations for tsp, steiner tree, and set cover. In STOC, pages 386–395. ACM, 2005. doi:10.1145/1060590.1060649.

[bib.bib38] [38] Shaofeng H.-C. Jiang and Jianing Lou. Coresets for robust clustering via black-box reductions to vanilla case. CoRR, abs/2502.07669, 2025. doi:10.48550/arXiv.2502.07669.

[bib.bib39] [39] Samir Khuller, Robert Pless, and Yoram J. Sussmann. Fault tolerant k-center problems. Theor. Comput. Sci., 242(1-2):237–245, 2000. doi:10.1016/S0304-3975(98)00222-9.

[bib.bib40] [40] Sagi Lotan, Ernesto Evgeniy Sanches Shayda, and Dan Feldman. Coreset for line-sets clustering. In NeurIPS, 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/f2ce95887c34393af4eb240d60017860-Abstract-Conference.html.

[bib.bib41] [41] Konstantin Makarychev, Yury Makarychev, and Ilya P. Razenshteyn. Performance of johnson-lindenstrauss transform for k-means and k-medians clustering. In STOC, pages 1027–1038. ACM, 2019. doi:10.1145/3313276.3316350.

[bib.bib42] [42] Yair Marom and Dan Feldman. k-means clustering of lines for big data. In NeurIPS, pages 12797–12806, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/6084e82a08cb979cf75ae28aed37ecd4-Abstract.html.

[bib.bib43] [43] Alexander Munteanu and Chris Schwiegelshohn. Coresets-methods and history: A theoreticians design pattern for approximation and streaming algorithms. Künstliche Intell., 32(1):37–53, 2018. doi:10.1007/s13218-017-0519-3.

[bib.bib44] [44] Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. Fair coresets and streaming algorithms for fair k-means. In WAOA, volume 11926 of Lecture Notes in Computer Science, pages 232–251. Springer, 2019. doi:10.1007/978-3-030-39479-0_16.

[bib.bib45] [45] Christian Sohler and David P. Woodruff. Strong coresets for k-median and subspace approximation: Goodbye dimension. In FOCS, pages 802–813. IEEE Computer Society, 2018. doi:10.1109/FOCS.2018.00081.

[bib.bib46] [46] Murad Tukan, Xuan Wu, Samson Zhou, Vladimir Braverman, and Dan Feldman. New coresets for projective clustering and applications. In AISTATS, volume 151 of Proceedings of Machine Learning Research, pages 5391–5415. PMLR, 2022. URL: https://proceedings.mlr.press/v151/tukan22a.html.

Coresets for Robust Clustering via Black-Box Reductions to Vanilla Case

Abstract

Keywords and phrases:

Category:

Funding:

Copyright and License:

2012 ACM Subject Classification:

Related Version:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

1.1 Our Results

Theorem 1 (Euclidean case).

Specific Coreset Size Bounds.

Streaming Implementation of Theorem 1.

Theorem 2 (Informal version of Theorem 14).

1.2 Technical Overview

1.2.1 First Reduction: (𝑶⁢(𝒌⁢𝒎⁢𝜺−𝟏)+𝑵) Size Bound

Condition I (Lemma 9): Vanilla Coresets on Dense Datasets Are Robust

Reduction I (Theorem 10): Almost-dense Decomposition

1.2.2 Second Reduction: (𝒎⁢𝜺−𝟐+𝑵)⁢𝐩𝐨𝐥𝐲⁡𝐥𝐨𝐠⁡(𝒌⁢𝒎⁢𝜺−𝟏) Size Bound

Overview of Reduction II (Theorem 13).

Condition II (Lemma 11): Size-preserving Vanilla Coresets Are Robust

Naive Approach: Bounding the Gap between 𝐜𝐨𝐬𝐭 and 𝐜𝐨𝐬𝐭(𝒎).

Key Observation: Vanilla Coresets Also Preserve the Gap.

Reduction II (Theorem 13): Making Vanilla Coresets Size-preserving

Key Observation: Separated Instance is Easy.

Key Idea: Ensuring Separation Artificially.

New Issue: Ensuring Coreset Guarantee.

Solution: Grouping Parts using Sparse Partition [37].

1.2.3 Streaming Implementations

𝝀-Bounded Partition.

The First 𝑶⁢(𝒌⁢𝒎)⁢𝐩𝐨𝐥𝐲⁡(𝜺−𝟏⁢𝒅⁢𝐥𝐨𝐠⁡𝚫) Space Bound.

The Second 𝑶~⁢(𝒌+𝒎)⋅𝐩𝐨𝐥𝐲⁡(𝜺−𝟏⁢𝒅⁢𝐥𝐨𝐠⁡𝚫) Space Bound.

1.3 Related Work

2 Preliminaries

Weighted Set.

Clustering with Outliers (Weighted).

Coresets.

Definition 3 (Coresets).

Definition 4 (Additive-error coresets).

Fact 5 (Composability of coresets).

Tri-criteria Approximation.

Definition 6 ((α,β,γ)-Approximation).

Lemma 7 (Generalized triangle inequalities).

3 Formal Statements of Reductions

Definition 8 (λ-Bounded partition).

Reduction I: Density of Datasets

Lemma 9 (Condition I: vanilla coresets on dense datasets are robust).

Theorem 10 (Reduction I).

Reduction II: Size-preserving Property

Lemma 11 (Condition II: size-preserving vanilla coresets are robust).

Definition 12 (Seperated duplication of a metric space).

Theorem 13 (Reduction II).

Streaming Implementations

Theorem 14 (Streaming coresets for (k,z,m)-Clustering).

References

1.2.1 First Reduction: $\left(O(km\varepsilon^{-1})+N\right)$ Size Bound

1.2.2 Second Reduction: $(m\varepsilon^{-2}+N)\operatorname{poly}\log(km\varepsilon^{-1})$ Size Bound

Naive Approach: Bounding the Gap between $\operatorname{cost}$ and $\operatorname{cost}^{(m)}$ .

$\lambda$ -Bounded Partition.

The First $O(km)\operatorname{poly}(\varepsilon^{-1}d\log\Delta)$ Space Bound.

The Second $\tilde{O}(k+m)\cdot\operatorname{poly}(\varepsilon^{-1}d\log\Delta)$ Space Bound.

Definition 6 ( $(\alpha,\beta,\gamma)$ -Approximation).

Definition 8 ( $\lambda$ -Bounded partition).

Theorem 14 (Streaming coresets for $(k,z,m)$ -Clustering).