Deterministic k-Median Clustering in Near-Optimal Time

Costa, Martín; Farokhnejad, Ermiya

doi:10.4230/LIPIcs.ICALP.2025.62

Deterministic $k$ -Median Clustering in Near-Optimal Time

Martín Costa

University of Warwick, Coventry, UK Ermiya Farokhnejad

University of Warwick, Coventry, UK

Abstract

The metric $k$ -median problem is a textbook clustering problem. As input, we are given a metric space $V$ of size $n$ and an integer $k$ , and our task is to find a subset $S\subseteq V$ of at most $k$ “centers” that minimizes the total distance from each point in $V$ to its nearest center in $S$ .

Mettu and Plaxton [UAI’02] gave a randomized algorithm for $k$ -median that computes a $O(1)$ -approximation in $\tilde{O}(nk)$ time.¹¹1We use $\tilde{O}(\cdot)$ to hide polylog factors in the size $n$ and the aspect ratio $\Delta$ (see Section 2) of the metric space. They also showed that any algorithm for this problem with a bounded approximation ratio must have a running time of $\Omega(nk)$ . Thus, the running time of their algorithm is optimal up to polylogarithmic factors.

For deterministic $k$ -median, Guha et al. [FOCS’00] gave an algorithm that computes a $\operatorname*{poly}(\log(n/k))$ -approximation in $\tilde{O}(nk)$ time, where the degree of the polynomial in the approximation is unspecified. To the best of our knowledge, this remains the state-of-the-art approximation of any deterministic $k$ -median algorithm with this running time.

This leads us to the following natural question: What is the best approximation of a deterministic $k$ -median algorithm with near-optimal running time? We make progress in answering this question by giving a deterministic algorithm that computes a $O(\log(n/k))$ -approximation in $\tilde{O}(nk)$ time. We also provide a lower bound showing that any deterministic algorithm with this running time must have an approximation ratio of $\Omega(\log n/(\log k+\log\log n))$ , establishing a gap between the randomized and deterministic settings for $k$ -median.

Keywords and phrases:

k

-clustering,

k

-median, deterministic algorithms, approximation algorithms

Category:

Track A: Algorithms, Complexity and Games

Funding:

Martín Costa: Supported by a Google PhD Fellowship.

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Facility location and clustering

Editors:

Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele Puppis

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

Clustering data is one of the fundamental tasks in unsupervised learning. As input, we are given a dataset, and our task is to partition the elements of the dataset into groups called clusters so that similar elements are placed in the same cluster and dissimilar elements are placed in different clusters. One of the basic formulations of clustering is metric $k$ -clustering, where we are given a (weighted) metric space $(V,w,d)$ of size $n$ , and our goal is to find a subset $S\subseteq V$ of at most $k$ centers that minimizes an objective function. We focus on the $k$ -median problem, where the objective is defined as $\texttt{cost}(S):=\sum_{x\in V}w(x)\cdot d(x,S)$ , where $d(x,S):=\min_{y\in S}d(x,y)$ . Equivalently, we want to minimize the total weighted distance from points in $V$ to their nearest center in $S$ .

The State-of-the-Art for $𝒌$ -Median.

Metric $k$ -median is a fundamental clustering problem with many real-world applications and has been studied extensively across many computational models [11, 26, 1, 9, 12, 2]. The problem is NP-Hard and there is a long line of work designing efficient approximation algorithms for $k$ -median using a variety of different techniques, such as local search [3] and Lagrangian relaxation [26]. Mettu and Plaxton gave a randomized algorithm for $k$ -median that computes a $O(1)$ -approximation in $\tilde{O}(nk)$ time [28], where the approximation guarantee holds with high probability. They also showed that any algorithm for this problem with a non-trivial approximation ratio must have a running time of $\Omega(nk)$ . It follows that their algorithm is near-optimal, i.e. optimal up to polylogarithmic factors in the running time and the constant in the approximation ratio.

Deterministic Algorithms for $𝒌$ -Median.

Designing deterministic algorithms for fundamental problems is an important research direction within algorithms [15, 22, 4, 30, 24]. Even though the randomized complexity of $k$ -median is well understood, we do not have the same understanding of the problem in the deterministic setting. For deterministic $k$ -median, Mettu and Plaxton gave an algorithm that computes a $O(1)$ -approximation in $\tilde{O}(n^{2})$ time [27], and Jain and Vazirani gave an algorithm with an improved approximation of $6$ and a running time of $\tilde{O}(n^{2})$ [26]. Whenever $k=\Omega(n)$ , it follows from the lower bound of [28] that these algorithms are near-optimal in both the approximation ratio and running time. On the other hand, for $k\ll n$ , these algorithms are slower than the randomized $O(1)$ -approximation algorithm of [28]. Guha et al. gave an algorithm that computes a $\operatorname*{poly}(\log(n/k))$ -approximation in a near-optimal running time of $\tilde{O}(nk)$ [20], where the degree of the polynomial in the approximation is unspecified. However, it is not clear how much this approximation ratio can be improved, and in particular, whether or not we can match the bounds for randomized algorithms. This leads us to the following question.

Question 1.

What is the best approximation of any deterministic algorithm for $k$ -median that runs in $\tilde{O}(nk)$ time?

1.1 Our Results

We make progress in answering Question 1 by giving a deterministic algorithm with near-optimal running time and an improved approximation of $O(\log(n/k))$ , proving the following theorem.

Theorem 1.

There is a deterministic algorithm for $k$ -median that, given a metric space of size $n$ , computes a $O(\log(n/k))$ -approximate solution in $\tilde{O}(nk)$ time.

We obtain our algorithm by adapting the “hierarchical partitioning” approach of Guha et al. [20]. We show that a modified version of this hierarchy can be implemented efficiently by using “restricted $k$ -clustering” algorithms – a notation that was recently introduced by Bhattacharya et al. to design fast dynamic clustering algorithms [7]. We design a deterministic algorithm for restricted $k$ -median based on the reverse greedy algorithm of Chrobak et al. [13] and combine it with the hierarchical partitioning framework to construct our algorithm.

In addition to our algorithm, we also provide a lower bound on the approximation ratio of any deterministic algorithm with a running time of $\tilde{O}(nk)$ , proving the following theorem.

Theorem 2.

Any deterministic algorithm for $k$ -median that runs in $\tilde{O}(nk)$ time when given a metric space of size $n$ has an approximation ratio of

\Omega\!\left(\frac{\log n}{\log k+\log\log n}\right).

This lower bound establishes a separation between the randomized and deterministic settings for $k$ -median – ruling out the possibility of a deterministic $O(1)$ -approximation algorithm that runs in near-optimal $\tilde{O}(nk)$ time for $k=n^{o(1)}$ . For example, when $k=\operatorname*{poly}(\log n)$ , Theorem 2 shows that any deterministic algorithm with a near-optimal running time must have an approximation ratio of $\Omega(\log n/\log\log n)$ . On the other hand, Theorem 1 gives such an algorithm with an approximation ratio of $O(\log n)$ , which matches the lower bound up to a lower order $O(\log\log n)$ term.

We prove Theorem 2 by adapting a lower bound on the query complexity of dynamic $k$ -center given by Bateni et al. [5], where the query complexity of an algorithm is the number of queries that it makes to the distance function $d(\cdot,\cdot)$ . Our lower bound holds for any deterministic algorithm with a query complexity of $\tilde{O}(nk)$ . Since the query complexity of an algorithm is a lower bound on its running time, this gives us Theorem 2. In general, establishing a gap between the deterministic and randomized query complexity of a problem is an interesting research direction [29]. Our lower bound implies such a gap for $k$ -median when $k$ is sufficiently small.

For the special case of $1$ -median, Chang showed that, for any constant $\epsilon>0$ , any deterministic algorithm with a running time of $O(n^{1+\epsilon})$ has an approximation ratio of $\Omega(1/\epsilon)$ [10]. The lower bound for $1$ -median by [10] uses very similar techniques to the lower bounds of Bateni et al. [5], which we adapt to obtain our result. In the full version of the paper [16], we provide a generalization of the lower bound in Theorem 12, giving a similar tradeoff between running time and approximation.

1.2 Related Work

Two other well-studied metric $k$ -clustering problems are $k$ -means and $k$ -center. For $k$ -means, the current state-of-the-art is the same as for $k$ -median. Using randomization, it is known how to obtain an $O(1)$ -approximation in $\tilde{O}(nk)$ time [28]. The deterministic algorithm of [20] can be generalized to work for $k$ -means as well as $k$ -median (see Appendix A in the full version of the paper [16], where we describe this generalization). In particular, we can also obtain a $\operatorname*{poly}(\log(n/k))$ -approximation for $k$ -means in near-optimal $\tilde{O}(nk)$ time. On the other hand, for $k$ -center, the situation is quite different. The classic greedy algorithm given by Gonzalez [19] is deterministic and returns a $2$ -approximation in $O(nk)$ time. It is known that any non-trivial approximation algorithm must run in $\Omega(nk)$ time [5], and it is NP-Hard to obtain a $(2-\epsilon)$ -approximation for any constant $\epsilon>0$ [25]. Thus, this algorithm has an exactly optimal approximation ratio and running time.

Another related line of work considers the specific case of Euclidean spaces. It was recently shown how to obtain a $\operatorname*{poly}(1/\epsilon)$ -approximation in $\tilde{O}(n^{1+\epsilon+o(1)})$ time in such spaces [18]. They obtain their result by adapting the $\tilde{O}(n^{2})$ time deterministic algorithm of [27] using locality-sensitive hashing.

The $k$ -median problem has also recently received much attention in the dynamic setting, where points in the metric space are inserted and deleted over time and the objective is to maintain a good solution. A long line of work [14, 23, 8, 17, 7, 6] recently led to a fully dynamic $k$ -median algorithm with $O(1)$ -approximation and $\tilde{O}(k)$ update time against adaptive adversaries, giving near-optimal update time and approximation.²²2Note that, since we cannot obtain a running time of $o(nk)$ in the static setting, we cannot obtain an update time of $o(k)$ in the dynamic setting.

1.3 Organization

In Section 2, we give the preliminaries and describe the notation used throughout the paper. In Section 3, we give a technical overview of our results. We present our algorithm in Sections 4 and 5. We defer the proof of our lower bound to the full version of the paper [16]. In Section 6, we describe our results for the $k$ -means problem.

2 Preliminaries

Let $(V,w,d)$ be a weighted metric space of size $n$ , where $w:V\longrightarrow\mathbb{R}_{\geq 0}$ is a weight function and $d:V\times V\longrightarrow\mathbb{R}_{\geq 0}$ is a metric satisfying the triangle inequality. The aspect ratio $\Delta$ of the metric space is the ratio of the maximum and minimum non-zero distances in the metric space. We use the notation $\tilde{O}(\cdot)$ to hide polylogarithmic factors in the size $n$ and the aspect ratio $\Delta$ of the metric space. Given subsets $S,U\subseteq V$ , we define the cost of the solution $S$ with respect to $U$ as

\textnormal{{cost}}(S,U):=\sum_{x\in U}w(x)\cdot d(x,S),

where $d(x,S):=\min_{y\in S}d(x,y)$ .³³3Note that we do not necessarily require that $S$ is a subset of $U$ . When we are considering the cost of $S$ w.r.t. the entire space $V$ , we abbreviate $\textnormal{{cost}}(S,V)$ by $\textnormal{{cost}}(S)$ . In the $k$ -median problem on the metric space $(V,w,d)$ , our objective is to find a subset $S\subseteq V$ of size at most $k$ which minimizes $\textnormal{{cost}}(S)$ . Given an integer $k\geq 1$ and subsets $X,U\subseteq V$ , we define the optimal cost of a solution of size $k$ within $X$ with respect to $U$ as

\textsc{OPT}_{k}(U,X):=\min_{S\subseteq X,\,|S|=k}\textnormal{{cost}}(S,U).

When $X$ and $U$ are the same, we abbreviate $\textsc{OPT}_{k}(U,X)$ by $\textsc{OPT}_{k}(U)$ . Thus, the optimal solution to the $k$ -median problem on the metric space $(V,w,d)$ has cost $\textsc{OPT}_{k}(V)$ . For any $U\subseteq V$ , we denote the metric subspace obtained by considering the metric $d$ and weights $w$ restricted to only the points in $U$ by $(U,w,d)$ .

The Projection Lemma.

Given sets $A,B\subseteq V$ , we let $\operatorname*{\pi}(A,B)$ denote the projection of $A$ onto the set $B$ , which is defined as the subset of points $y\in B$ such that some point $x\in A$ has $y$ as its closest point in $B$ (breaking ties arbitrarily). In other words, we define $\operatorname*{\pi}(A,B):=\left\{\arg\min_{y\in B}d(y,x)\;\middle|\;x\in A\right\}$ . We use the following well-known projection lemma throughout the paper, which allows us to upper bound the cost of the projection $\operatorname*{\pi}(A,B)$ in terms of the costs of $A$ and $B$ [21, 13].

Lemma 3.

For any subsets $A,B\subseteq V$ , we have that $\textnormal{{cost}}(\operatorname*{\pi}(A,B))\leq\textnormal{{cost}}(B)+2\cdot% \textnormal{{cost}}(A)$ .

The following well-known corollary of the projection lemma shows that, for any set $U\subseteq V$ , the optimal cost of the $k$ -median problem in $(U,w,d)$ changes by at most a factor of $2$ if we are allowed to place centers anywhere in $V$ .

Corollary 4.

For any subset $U\subseteq V$ , we have that $\textsc{OPT}_{k}(U)\leq 2\cdot\textsc{OPT}_{k}(U,V)$ .

Proof.

Let $S_{V}^{\star}$ be a subset of $V$ of size at most $k$ that minimizes $\textnormal{{cost}}(S_{V}^{\star},U)$ and let $S_{U}^{\star}=\operatorname*{\pi}(S_{V}^{\star},U)$ . Then, for any $x\in U$ , it follows from Lemma 3 that $d(x,S^{\star}_{U})\leq d(x,U)+2\cdot d(x,S^{\star}_{V})=2\cdot d(x,S^{\star}_{% V})$ , which implies the corollary. $\hfill\blacktriangleleft$

3 Technical Overview

We begin by describing the hierarchical partitioning approach used by Guha et al. [20] to obtain a $\operatorname*{poly}(\log(n/k))$ -approximation algorithm with near-optimal running time. We then discuss the limitations of this approach and describe how we overcome these limitations to obtain our result.

3.1 The Hierarchical Partitioning Framework

Guha et al. [20] showed how to combine an $\tilde{O}(n^{2})$ time $k$ -median algorithm with a simple hierarchical partitioning procedure in order to produce a faster algorithm – while incurring some loss in the approximation. Their approach is based on the following divide-and-conquer procedure:

1.

Partition the metric space $(V,w,d)$ into $q$ metric subspaces $(V_{1},w,d),\dots,(V_{q},w,d)$ .
2.

Solve the $k$ -median problem on each subspace $(V_{i},w,d)$ to obtain a solution $S_{i}\subseteq V_{i}$ .
3.

Combine the solutions $S_{1},\dots,S_{q}$ to get a solution $S$ for the original space $(V,w,d)$ .

The main challenge in this framework is implementing Step 3 – finding a good way to merge the solutions from the subspaces into a solution for the original space. To implement this step, they prove the following lemma, which, at a high level, shows how to use the solutions $S_{i}$ to construct a sparsifier for the metric space $(V,w,d)$ that is much smaller than the size of the space.

Lemma 5 ([20]).

Suppose that each solution $S_{i}$ is a $\beta$ -approximate solution to the $k$ -median problem in $(V_{i},w,d)$ . Let $V^{\prime}=\bigcup_{i}S_{i}$ and, for each $y\in S_{i}$ , let $w^{\prime}(y)$ denote the total weight of points in $V_{i}$ that are assigned to $y$ in the solution $S_{i}$ . Then any $\alpha$ -approximate solution $S$ to the $k$ -median problem in the space $(V^{\prime},w^{\prime},d)$ is a $O(\alpha\beta)$ -approximation in the space $(V,w,d)$ .

Using Lemma 5, we can compute a (weighted) subspace $(V^{\prime},w^{\prime},d)$ that has size only $\sum_{i}|S_{i}|=O(kq)$ . Crucially, we have the guarantee that any good solution that we find within this subspace is also a good solution in the space $(V,w,d)$ . Thus, we can then use the deterministic $O(1)$ -approximate $\tilde{O}(n^{2})$ time $k$ -median algorithm of [27] to compute a solution $S$ for $(V^{\prime},w^{\prime},d)$ in $\tilde{O}(k^{2}q^{2})$ time.

A 2-Level Hierarchy.

Suppose we run this divide-and-conquer framework for one step (i.e. without recursing on the subspaces $(V_{i},w,d)$ ) and just compute the solutions $S_{i}$ for $(V_{i},w,d)$ using the $\tilde{O}(n^{2})$ time algorithm of [27]. It follows immediately from Lemma 5 that the approximation ratio of the final solution $S$ is $O(1)$ . We can also observe that, up to polylogarithmic factors, the total time taken to compute the $S_{i}$ is $\simeq q\cdot(n/q)^{2}=n^{2}/q$ , since the size of each subspace is $O(n/q)$ . Furthermore, the time taken to compute $S$ is $\tilde{O}(k^{2}q^{2})$ . By taking $q=(n/k)^{2/3}$ , we get that the total running time of the algorithm is $\tilde{O}(nk\cdot(n/k)^{1/3})$ , giving a polynomial improvement in the running time for $k\ll n$ .

An $\ell$ -Level Hierarchy.

Now, suppose we run this framework for $\ell$ steps. To balance the running time required to compute the solutions at each level of this divide-and-conquer procedure, we want to subdivide each metric subspace at depth $i$ in the recursion tree into (roughly) $q_{i}=(n/k)^{2^{-i}}$ further subspaces. Guha et al. show that the running time of this algorithm is $\tilde{O}(nk\cdot(n/k)^{2^{-\ell}})$ . By Lemma 5, we can also see that the approximation ratio of the final solution $S$ is $2^{O(\ell)}$ . Setting $\delta=(n/k)^{2^{-\ell}}$ , we get the following theorem.

Theorem 6.

There is a deterministic algorithm for $k$ -median that, given a metric space of size $n$ , computes a $\operatorname*{poly}(\log(n/k)/\log\delta)$ -approximation in $\tilde{O}(nk\delta)$ time, for any $2\leq\delta\leq n/k$ .

Setting $\delta=O(1)$ , we get immediately the following corollary.

Corollary 7.

There is a deterministic algorithm for $k$ -median that, given a metric space of size $n$ , computes a $\operatorname*{poly}(\log(n/k))$ -approximation in $\tilde{O}(nk)$ time.

We remark that the results in [20] are presented differently and only claim an approximation ratio of $\operatorname*{poly}(\log n/\log\delta)$ . See Appendix A of the full version of the paper [16], where we describe a generalization of their algorithm, proving Theorem 6 and also showing that it extends to $k$ -means.

3.2 The Barrier to Improving the Approximation

While the sparsification technique that is described in Lemma 5 allows us to obtain much faster algorithms by sparsifying our input in a hierarchical manner, this approach has one major drawback. Namely, the fact that sparsifying the input in this manner also leads to an approximation ratio that scales exponentially with the number of levels in the hierarchy. Unfortunately, this exponential growth seems unavoidable with this approach. This leads us to the following question.

Question 2.

Is there a different way to combine the solutions $S_{1},\dots,S_{q}$ that does not lead to exponential deterioration in the approximation?

3.3 Idea I: Sparsification via Restricted $𝒌$ -Median

Very recently, Bhattacharya et al. introduced the notion of “restricted $k$ -clustering” in order to design efficient and consistent dynamic clustering algorithms [7]. The restricted $k$ -median problem on the space $(V,w,d)$ is the same as the $k$ -median problem, except that we are also given a subset $X\subseteq V$ and have the additional restriction that our solution $S$ must be a subset of $X$ . Crucially, even though the algorithm can only place centers in the solution $S$ within $X$ , it receives the entire space $V$ as input and computes the cost of the solution $S$ w.r.t. the entire space.

The restricted $k$ -median problem allows us to take a different approach toward implementing Step 3 of the divide-and-conquer framework. Instead of compressing the entire space $(V,w,d)$ into only $O(kq)$ many weighted points, we restrict the output of the solution $S$ to these $O(kq)$ points but still consider the rest of the space while computing the cost of the set $S$ . This can be seen as a less aggressive way of sparsifying the metric space, where we lose less information. It turns out that this approach allows us to produce solutions of higher quality, where the approximation scales linearly in the number of levels in the hierarchy instead of exponentially.

Efficiently implementing this new hierarchy is challenging since we need to design a fast deterministic algorithm for restricted $k$ -median with the appropriate approximation guarantees. To illustrate our approach while avoiding this challenge, we first describe a simple version of our algorithm with improved approximation and near-optimal query complexity. We later show how to design such a restricted algorithm, allowing us to implement this algorithm efficiently.

3.4 Our Algorithm With Near-Optimal Query Complexity

Let $(V,w,d)$ be a metric space of size $n$ and $k\leq n$ be an integer. We define the value $\ell:=\log_{2}(n/k)$ , which we use to describe our algorithm. Our algorithm works in the following 2 phases.

Phase I.

In the first phase of our algorithm, we construct a sequence of partitions $Q_{0},\dots,Q_{\ell}$ of the metric space $V$ , such that the partition $Q_{i}$ is a refinement of the partition $Q_{i-1}$ .⁴⁴4i.e. for each element $X\in Q_{i-1}$ , there are elements $X_{1},\dots,X_{q}\in Q_{i}$ such that $X=X_{1}\cup\dots\cup X_{q}$ . We let $Q_{0}:=\{V\}$ . Subsequently, for each $i=1,\dots,\ell$ , we construct the partition $Q_{i}$ by arbitrarily partitioning each $X\in Q_{i-1}$ into subsets $X_{1}$ and $X_{2}$ of equal size and adding these subsets to $Q_{i}$ .⁵⁵5Note that it might not be possible for the subsets $X_{1}$ and $X_{2}$ to have equal size. For simplicity, we ignore this detail in the technical overview. For $X\in Q_{i-1}$ , we define $\mathcal{P}(X):=\{X^{\prime}\in Q_{i}\mid X^{\prime}\subseteq X\}$ .

Phase II.

The second phase of our algorithm proceeds in iterations, where we use the partitions $\{Q_{i}\}_{i}$ to compute the solution in a bottom-up manner. Let $V_{\ell+1}$ denote the set of points $V$ . For each $i=\ell,\dots,0$ , our algorithm constructs $V_{i}$ as follows:

For each $X\in Q_{i}$ , let $S_{X}$ be the optimal solution to the $k$ -median problem in the metric space $(X,w,d)$ such that $S_{X}\subseteq X\cap V_{i+1}$ . Let $V_{i}:=\bigcup_{X\in Q_{i}}S_{X}$ .

Output.

The set $V_{0}$ is the final output of our algorithm.

The following theorem summarizes the behaviour of this algorithm.

Theorem 8.

There is a deterministic algorithm for $k$ -median that, given a metric space of size $n$ , computes a $O(\log(n/k))$ -approximation with $\tilde{O}(nk)$ queries (but exponential running time).

We now sketch the proof of Theorem 8 by outlining the analysis of the approximation ratio and query complexity. The formal proof follows from the more general analysis in Section 5.2.

Approximation Ratio

To bound the cost of the solution $V_{0}$ returned by our algorithm, we first need to be able to relate the costs of a solution in the hierarchy to the costs of the solutions in the subsequent levels. Given any set $X$ within a partition $Q_{i}$ , the following key claim establishes the relationship between the cost of the solution $S_{X}$ on the metric subspace $(X,w,d)$ and the costs of the solutions $\{S_{X^{\prime}}\}_{X^{\prime}\in\mathcal{P}(X)}$ on the metric subspaces $\{(X^{\prime},w,d)\}_{X^{\prime}\in\mathcal{P}(X)}$ .

Claim 9.

For any set $X\in\bigcup_{i=0}^{\ell-1}Q_{i}$ , we have that

\textnormal{{cost}}(S_{X},X)\leq\sum_{X^{\prime}\in\mathcal{P}(X)}\textnormal{% {cost}}(S_{X^{\prime}},X^{\prime})+O(1)\cdot\textsc{OPT}_{k}(X).

Proof.

Let $R$ denote the set $\bigcup_{X^{\prime}\in\mathcal{P}(X)}S_{X^{\prime}}$ . Let $S^{\star}$ be an optimal solution to the $k$ -median problem in the metric space $(X,w,d)$ and let $S^{\prime}:=\operatorname*{\pi}(S^{\star},R)$ be the projection of $S^{\star}$ onto $R$ .

Since $S_{X}$ is the optimal solution to the $k$ -median problem in $(X,w,d)$ such that $S_{X}\subseteq R$ , it follows that $\textnormal{{cost}}(S_{X},X)\leq\textnormal{{cost}}(S^{\prime},X)$ . Applying the projection lemma (Lemma 3) to the projection $S^{\prime}$ of $S^{\star}$ onto $R$ , we get that $\textnormal{{cost}}(S^{\prime},X)\leq\textnormal{{cost}}(R,X)+2\cdot% \textnormal{{cost}}(S^{\star},X)=\textnormal{{cost}}(R,X)+O(1)\cdot\textsc{OPT% }_{k}(X)$ . Combining these two inequalities, it follows that

\textnormal{{cost}}(S_{X},X)\leq\textnormal{{cost}}(R,X)+O(1)\cdot\textsc{OPT}% _{k}(X).

(1)

The claim then follows from the fact that

\textnormal{{cost}}(R,X)=\sum_{X^{\prime}\in\mathcal{P}(X)}\textnormal{{cost}}% (R,X^{\prime})\leq\sum_{X^{\prime}\in\mathcal{P}(X)}\textnormal{{cost}}(S_{X^{% \prime}},X^{\prime}).\

$\hfill\vartriangleleft$

By repeatedly applying Claim 9 to the sum $\sum_{X\in Q_{i}}\textnormal{{cost}}(S_{X},X)$ , we obtain the following upper bound on $\textnormal{{cost}}(V_{0})$ :

\textnormal{{cost}}(V_{0})\leq\sum_{X\in Q_{\ell}}\textnormal{{cost}}(S_{X},X)% +O(\ell)\cdot\textsc{OPT}_{k}(V).

(2)

We prove Equation 2 in Claim 19. Now, let $S^{\star}$ be an optimal solution to $k$ -median in $(V,w,d)$ and consider any $X\in Q_{\ell}$ . Since $S_{X}$ is an optimal solution to $k$ -median in the subspace $(X,w,d)$ , we have that $\textnormal{{cost}}(S_{X},X)=\textsc{OPT}_{k}(X)\leq 2\cdot\textsc{OPT}_{k}(X,% V)\leq 2\cdot\textnormal{{cost}}(S^{\star},X)$ , where the first inequality follows from Corollary 4. Thus, summing over each $X\in Q_{\ell}$ , we get that

\sum_{X\in Q_{\ell}}\textnormal{{cost}}(S_{X},X)\leq 2\cdot\sum_{X\in Q_{\ell}% }\textnormal{{cost}}(S^{\star},X)=2\cdot\textnormal{{cost}}(S^{\star})=2\cdot% \textsc{OPT}_{k}(V).

(3)

Combining Equations 2 and 3, it follows that $\textnormal{{cost}}(V_{0})\leq O(\ell)\cdot\textsc{OPT}_{k}(V)$ . Thus, the solution $V_{0}$ returned by our algorithm is a $O(\ell)=O(\log(n/k))$ approximation.

Query Complexity

Since the partitions constructed in Phase I of the algorithm are arbitrary, we do not make any queries to the metric in Phase I. Thus, we now focus on bounding the query complexity of Phase II.

The total number of queries made during the $i^{th}$ iteration in Phase II is the sum of the number of queries required to compute the solutions $\{S_{X}\}_{X\in Q_{i}}$ . In the first iteration (when $i=\ell$ ), we compute $|Q_{\ell}|$ many solutions, each one on a subspace of size $n/|Q_{\ell}|$ .⁶⁶6Again, we assume for simplicity that each set in the partition has the same size. We can trivially compute an optimal solution in $(X,w,d)$ by querying every distance between all $O(|X|^{2})$ pairs of points in $X$ and then checking every possible solution. Thus, we can upper bound the number of queries by

\left(\frac{n}{|Q_{\ell}|}\right)^{2}\cdot|Q_{\ell}|=\frac{n^{2}}{|Q_{\ell}|}=% \frac{n^{2}}{2^{\ell}}=O(nk),

where we are using the fact that the number of sets in the partition $Q_{\ell}$ is $2^{\ell}=n/k$ . For each subsequent iteration (when $0\leq i<\ell$ ), we compute $|Q_{i}|$ many solutions, each one on a subspace $(X,w,d)$ of size at most $n/|Q_{i}|$ , where the solution is restricted to the set $X\cap V_{i+1}$ , which has size at most $|X\cap V_{i+1}|=|S_{X_{1}}\cup S_{X_{2}}|\leq 2k,$ where $\mathcal{P}(X)=\{X_{1},X_{2}\}$ and $S_{X_{1}}$ and $S_{X_{2}}$ are computed in the previous iteration. Since we only need to consider solutions that are contained in some subset of at most $O(k)$ points, we can find an optimal such restricted solution with $O(k)\cdot(n/|Q_{i}|)$ queries. It follows that the total number of queries that we make during this iteration is at most $O(nk/|Q_{i}|)\cdot|Q_{i}|\leq O(nk)$ . Thus, the total query complexity of our algorithm is $\ell\cdot O(nk)=\tilde{O}(nk)$ .

3.5 Idea II: A Deterministic Algorithm for Restricted $𝒌$ -Median

In order to establish the approximation guarantees of our algorithm in Section 3.4, we use the fact that, given a metric subspace $(V,w,d)$ of size $n$ and a subset $X\subseteq V$ , we can find a solution $S\subseteq X$ to the $k$ -median problem such that

\textnormal{{cost}}(S)\leq\textnormal{{cost}}(X)+O(1)\cdot\textsc{OPT}_{k}(V).

(4)

We use this fact in the proof of Claim 9 (see Equation 1), which is the key claim in our analysis of our algorithm. Furthermore, to establish the query complexity bound of our algorithm, we use the fact that we can find such a solution $S_{X}$ with $O(n|X|)$ many queries. To implement this algorithm efficiently, we need to be able to find such a solution in $\tilde{O}(n|X|)$ time. While designing an algorithm with these exact guarantees seems challenging (since it would require efficiently matching the bounds implied by the projection lemma), we can give an algorithm with the following relaxed guarantees, which suffice for our applications.

Lemma 10.

There is a deterministic algorithm that, given a metric space $(V,w,d)$ of size $n$ , a subset $X\subseteq V$ , and a parameter $k$ , returns a solution $S\subseteq X$ of size $2k$ in $\tilde{O}(n|X|)$ time such that

\textnormal{{cost}}(S)\leq\textnormal{{cost}}(X)+O\!\left(\log\!\left(\frac{|X% |}{k}\right)\right)\cdot\textsc{OPT}_{k}(V).

To prove Lemma 10, we give a simple modification of the reverse greedy algorithm of Chrobak, Kenyon, and Young [13]. The reverse greedy algorithm initially creates a solution $S\leftarrow V$ consisting of the entire space, and proceeds to repeatedly peel off the point in $S$ whose removal leads to the smallest increase in the cost of $S$ until only $k$ points remain. Chrobak et al. showed that this algorithm has an approximation ratio of $O(\log n)$ and $\Omega(\log n/\log\log n)$ . While this large approximation ratio might seem impractical for our purposes, we can make 2 simple modifications to the algorithm and analysis in order to obtain the guarantees in Lemma 10.

1.

We start off by setting $S\leftarrow X$ instead of $S\leftarrow V$ . This ensures that the output $S$ is a subset of $X$ , and gives us the guarantee that $\textnormal{{cost}}(S)\leq\textnormal{{cost}}(X)+O(\log|X|)\cdot\textsc{OPT}_{% k}(V)$ .
2.

We return the set $S$ once $|S|\leq 2k$ instead of $|S|\leq k$ . This allows us to obtain the guarantee that $\textnormal{{cost}}(S)\leq\textnormal{{cost}}(X)+O(\log(|X|/k))\cdot\textsc{% OPT}_{k}(V)$ .

We provide the formal proof of Lemma 10 in Section 4.

We can now make the following key observation: In our algorithm from Section 3.4, whenever we compute a solution $S_{X}$ to the $k$ -median problem, we impose the constraint that $S_{X}\subseteq R$ for a set $R$ of size $R=O(k)$ . Thus, if we use the algorithm from Lemma 10 to compute our solutions within the hierarchy, whenever we apply this lemma we can assume that $|X|=O(k)$ . Consequently, the approximation guarantee that we get from Lemma 10 becomes

\textnormal{{cost}}(S)\leq\textnormal{{cost}}(X)+O(1)\cdot\textsc{OPT}_{k}(V),

matching the required guarantee from Equation 4.

Dealing with Bicriteria Approximations.

One caveat from Lemma 10 is that the solutions output by the algorithm have size $2k$ instead of $k$ . In other words, these are “bicriteria approximations” to the $k$ -median problem, i.e. solutions that contain more than $k$ points. Thus, the final output of our algorithm has size $2k$ . By using the extraction technique of Guha et al. [20] described in Lemma 5, it is straightforward to compute a subset of $k$ of these points in $\tilde{O}(k^{2})$ time while only incurring a $O(1)$ increase in the approximation ratio.

Putting Everything Together.

By combining our algorithm from Section 3.4 with Lemma 10, we get our main result, which we prove in Section 5.

Theorem 11.

There is a deterministic algorithm for $k$ -median that, given a metric space of size $n$ , computes a $O(\log(n/k))$ -approximate solution in $\tilde{O}(nk)$ time.

3.6 Our Lower Bound for Deterministic $𝒌$ -Median

We also prove the following lower bound for deterministic $k$ -median. Due to space constraints, we defer the proof of this lower bound to the full version of the paper [16].

Theorem 12.

For every $\delta\geq 1$ , any deterministic algorithm for the $k$ -median problem that has a running time of $O(kn\delta)$ on a metric space of size $n$ has an approximation ratio of

\Omega\!\left(\frac{\log n}{\log\log n+\log k+\log\delta}\right).

We prove this lower bound by adapting a lower bound construction by Bateni et al. [5]. In this paper, the authors provide a lower bound for dynamic $k$ -center clustering against adaptive adversaries. Although their primary focus is $k$ -center, their lower bounds can be extended to various $k$ -clustering problems. The main idea is to design an adaptive adversary that controls the underlying metric space as well as the points being inserted and deleted from the metric space. Whenever the algorithm queries the distance between any two points $x, y$ , the adversary returns a value $d(x,y)$ which is consistent with the distances reported for all previously queried pairs of points. Note that if we have the distances between some points (not necessarily all of them), we might be able to embed different metrics on the current space that are consistent with the queried distances. More specifically, [5] introduces two different consistent metrics, and shows that the algorithm can not distinguish between these metrics, leading to a large approximation ratio.

We use the same technique as described above with slight modifications. The first difference is that [5] considers the problem in the fully dynamic setting, whereas our focus is on the static setting. The adversary has two advantages in the fully dynamic setting:

1.

The adversary has the option to delete a (problematic) point from the space.
2.

The approximation ratio of the algorithm is defined as the maximum approximation ratio throughout the entire stream of updates.

Both of these advantages are exploited in the framework of [5]: The adversary removes any “problematic” point $x$ and the approximation ratio of the algorithm is proven to be large only in special steps (referred to as “clean operations”) where the adversary has removed all of the problematic points. In the static setting, the adversary does not have these advantages, and the approximation ratio of the algorithm is only considered at the very end.

Due to the differences between the objective functions of the $k$ -median and $k$ -center problems, we observed that we can adapt the above framework for deterministic static $k$ -median. One of the technical points here is to show that, if we have a problematic point $x$ that is contained in the output $S$ of the algorithm, we can construct the metric so that the cost of the cluster of $x$ in solution $S$ become large. The final metric space is similar to the “uniform” metric introduced in [5] with a small modification. Since the algorithm is deterministic, the output is the same set $S$ if we run the algorithm on this final metric again. Hence, we get our lower bounds for any deterministic algorithm for the static $k$ -median problem. Please see the full version of the paper for a complete proof of our lower bound [16].

4 A Deterministic Algorithm for Restricted $𝒌$ -Median

In this section, we prove the following theorem.

Theorem 13.

There is a deterministic algorithm that, given a metric space $(V,w,d)$ of size $n$ , a subset $X\subseteq V$ , and a parameter $k^{\prime}\geq k$ , returns a solution $S\subseteq X$ of size $k^{\prime}$ in $\tilde{O}(n|X|)$ time such that

\textnormal{{cost}}(S)\leq\textnormal{{cost}}(X)+O\!\left(\log\!\left(\frac{|X% |}{k^{\prime}-k+1}\right)\right)\cdot\textsc{OPT}_{k}(V).

This algorithm is based on a variant of the reverse greedy algorithm for the $k$ -median problem [13], which we modify for the restricted $k$ -median problem. We refer to our modified version of this algorithm as $\textnormal{{Res-Greedy}}_{k^{\prime}}$ .

4.1 The Restricted Reverse Greedy Algorithm

Let $(V,w,d)$ be a metric space of size $n$ , $X\subseteq V$ and $k^{\prime}\leq|X|$ be an integer. The restricted reverse greedy algorithm $\textnormal{{Res-Greedy}}_{k^{\prime}}$ begins by initializing a set $S\leftarrow X$ and does the following:

While $|S|>k^{\prime}$ , identify the center $y\in S$ whose removal from $S$ causes the smallest increase in the cost of the solution $S$ (i.e. the point $y=\arg\min_{z\in S}\textnormal{{cost}}(S-z)$ ) and remove $y$ from $S$ . Once $|S|\leq k^{\prime}$ , return the set $S$ .

4.2 Analysis

Let $m$ denote the size of $X$ and let $k\leq k^{\prime}$ . Now, suppose that we run $\textnormal{{Res-Greedy}}_{k^{\prime}}$ and consider the state of the set $S$ throughout the run of the algorithm. Since points are only removed from $S$ , this gives us a sequence of nested subsets $S_{k^{\prime}}\subseteq\dots\subseteq S_{m}$ , where $|S_{i}|=i$ for each $i\in[k^{\prime},m]$ . Note that $S_{k^{\prime}}$ is the final output of our algorithm. The following lemma is the main technical lemma in the analysis of this algorithm.

Lemma 14.

For each $i\in[k^{\prime}+1,m]$ , we have that

\textnormal{{cost}}(S_{i-1})-\textnormal{{cost}}(S_{i})\leq\frac{2}{i-k}\cdot% \textsc{OPT}_{k}(V).

Using Lemma 14, we can now prove the desired approximation guarantee of our algorithm. By using a telescoping sum, we can express the cost of the solution $S_{k^{\prime}}$ as

\textnormal{{cost}}(S_{k^{\prime}})=\textnormal{{cost}}(S_{m})+\sum_{i=k^{% \prime}+1}^{m}\left(\textnormal{{cost}}(S_{i-1})-\textnormal{{cost}}(S_{i})% \right).

(5)

Applying Lemma 14, we can upper bound the sum on the RHS of Equation 5 by

	$\displaystyle\sum_{i=k^{\prime}+1}^{m}\left(\textnormal{{cost}}(S_{i-1})-% \textnormal{{cost}}(S_{i})\right)$	$\displaystyle\leq\sum_{i=k^{\prime}+1}^{m}\frac{2}{i-k}\cdot\textsc{OPT}_{k}(V)$
		$\displaystyle\leq O\!\left(\log\!\left(\frac{m}{k^{\prime}-k+1}\right)\right)% \cdot\textsc{OPT}_{k}(V).$		(6)

Combining Equations 5 and 4.2, we get that

\textnormal{{cost}}(S_{k^{\prime}})\leq\textnormal{{cost}}(S_{m})+O\!\left(% \log\!\left(\frac{m}{k^{\prime}-k+1}\right)\right)\cdot\textsc{OPT}_{k}(V),

giving us the desired bound in Theorem 13, since $S_{m}=X$ and $m=|X|$ .

4.3 Proof of Lemma 14

The following analysis is the same as the analysis given in [13], with some minor changes to accommodate the fact that we are using the algorithm for the restricted $k$ -median problem and to allow us to obtain an improved approximation for $k^{\prime}\geq(1+\Omega(1))k$ , at the cost of outputting bicriteria approximations.

We begin with the following claim, which we use later on in the analysis.

Claim 15.

For all subsets $A\subseteq B\subseteq V$ , we have that

\sum_{y\in B\setminus A}\left(\textnormal{{cost}}(B-y)-\textnormal{{cost}}(B)% \right)\leq\textnormal{{cost}}(A)-\textnormal{{cost}}(B).

Proof.

For each $y\in B$ , let $C_{B}(y)$ denote the cluster of the points in $V$ that are assigned to $y$ within the solution $B$ . In other words, $C_{B}(y)$ is the subset of points in $V$ whose closest point in $B$ is $y$ (breaking ties arbitrarily). We can now observe that

	$\displaystyle\sum_{y\in B\setminus A}\left(\textnormal{{cost}}(B-y)-% \textnormal{{cost}}(B)\right)$	$\displaystyle=\sum_{y\in B\setminus A}\sum_{x\in V}w(x)\left(d(x,B-y)-d(x,B)\right)$
		$\displaystyle=\sum_{y\in B\setminus A}\sum_{x\in C_{B}(y)}w(x)\left(d(x,B-y)-d% (x,B)\right)$
		$\displaystyle\leq\sum_{y\in B\setminus A}\sum_{x\in C_{B}(y)}w(x)\left(d(x,A)-% d(x,B)\right)$
		$\displaystyle\leq\sum_{x\in V}w(x)\left(d(x,A)-d(x,B)\right)$
		$\displaystyle=\textnormal{{cost}}(A)-\textnormal{{cost}}(B).$

Here, the first and fifth lines follow from the definition of the cost function, the second line follows since $d(x,B-y)-d(x,B)$ can only be non-zero when $x\in C_{B}(y)$ , the third line follows from the fact that $A\subseteq B-y$ for any $y\in B\setminus A$ , and the fourth line from the fact that $\{C_{B}(y)\}_{y\in B}$ partition the set $V$ . $\hfill\vartriangleleft$

Let $S^{\star}$ denote an optimal solution to the $k$ -median problem in the metric space $(V,w,d)$ and let $i\in[k^{\prime}+1,m]$ . We denote by $S^{\prime}_{i}$ the projection $\operatorname*{\pi}(S^{\star},S_{i})$ of the optimal solution $S^{\star}$ onto the set $S_{i}$ . It follows that

	$\displaystyle\textnormal{{cost}}(S_{i-1})-\textnormal{{cost}}(S_{i})$	$\displaystyle\leq\min_{y\in S_{i}\setminus S^{\prime}_{i}}\left(\textnormal{{% cost}}(S_{i}-y)-\textnormal{{cost}}(S_{i})\right)$
		$\displaystyle\leq\frac{1}{\|S_{i}\setminus S^{\prime}_{i}\|}\cdot\sum_{y\in S_{i% }\setminus S^{\prime}_{i}}\left(\textnormal{{cost}}(S_{i}-y)-\textnormal{{cost% }}(S_{i})\right)$
		$\displaystyle\leq\frac{1}{i-k}\cdot\sum_{y\in S_{i}\setminus S^{\prime}_{i}}% \left(\textnormal{{cost}}(S_{i}-y)-\textnormal{{cost}}(S_{i})\right)$
		$\displaystyle\leq\frac{1}{i-k}\cdot\left(\textnormal{{cost}}(S^{\prime}_{i})-% \textnormal{{cost}}(S_{i})\right)$
		$\displaystyle\leq\frac{2}{i-k}\cdot\textnormal{{cost}}(S^{\star})=\frac{2}{i-k% }\cdot\textsc{OPT}_{k}(V).$

The first line follows directly from how the algorithm chooses which point to remove from $S_{i}$ .⁷⁷7Note that, for analytic purposes, we only take the minimum over $y\in S_{i}\setminus S_{i}^{\prime}$ instead of all of $S_{i}$ in this inequality. The second line follows from the fact that the minimum value within a set of real numbers is upper bounded by its average. The third line follows from the fact that $|S_{i}\setminus S^{\prime}_{i}|\geq|S_{i}|-|S^{\prime}_{i}|\geq i-k$ . The fourth line follows from Claim 15. Finally, the fifth line follows from Lemma 3, which implies that $\textnormal{{cost}}(S^{\prime}_{i})\leq\textnormal{{cost}}(S_{i})+2\cdot% \textnormal{{cost}}(S^{\star})$ .

4.4 Implementation

Now we describe how to implement the algorithm to run in $\tilde{O}(n|X|)$ time. In particular, we prove the following lemma.

Lemma 16.

Given a metric space $(V,w,d)$ of size $n$ and a set $X\subseteq V$ , the algorithm Res-Greedy runs in time $O(n|X|\log n)$ .

We now show how to implement Res-Greedy to run in time $O(n|X|\log n)$ . Our implementation uses similar data structures to the randomized local search in [7]. For each $x\in V$ , we initialize a list $L_{x}$ which contains all of the points $y\in X$ , sorted in increasing order according to the distances $d(x,y)$ . We denote the $i^{th}$ point in the list $L_{x}$ by $L_{x}(i)$ . We maintain the invariant that, at each point in time, each of the lists in $\mathcal{L}=\{L_{x}\}_{x\in V}$ contains exactly the points in $S$ . Thus, at each point in time, we have that $\textnormal{{cost}}(S)=\sum_{x\in V}w(x)d(x,L_{x}(1))$ . By implementing each of these lists using a balanced binary tree, we can initialize them in $O(n|X|\log n)$ time and update them in $O(n\log n)$ time after each removal of a point from $S$ . Since the $S$ initially has size $|X|$ , the total time spent updating the lists is $O(n|X|\log n)$ . We also explicitly maintain the clustering $\mathcal{C}=\{C_{S}(y)\}_{y\in S}$ induced by the lists $\mathcal{L}$ , where $C_{S}(y):=\{x\in V\mid L_{x}(1)=y\}$ . We can initialize these clusters in $O(n)$ time given the collection of lists $\mathcal{L}$ and update them each time a list in $\mathcal{L}$ is updated while only incurring a $O(1)$ factor overhead in the running time. We now show that, using these data structures, we can implement each iteration of the greedy algorithm in $O(n)$ time. Since the algorithm runs for at most $|X|$ iterations, this gives us the desired running time.

Implementing an Iteration of Greedy.

Using the lists $\mathcal{L}$ and clustering $\mathcal{C}$ , we can compute

\texttt{change}(y)\leftarrow\sum_{x\in C_{S}(y)}w(x)(d(x,L_{x}(2))-d(x,L_{x}(1% )))

for each $y\in S$ . Since any point $x\in V$ appears in exactly one cluster in $\mathcal{C}$ , this takes $O(n)$ time in total. By observing that removing $y$ from $S$ causes each point $x\in C_{S}(y)$ to be reassigned to the center $L_{x}(2)$ , we can see that $\texttt{change}(y)$ is precisely the value of $\textnormal{{cost}}(S-y)-\textnormal{{cost}}(S)$ . Since minimizing $\textnormal{{cost}}(S-y)-\textnormal{{cost}}(S)$ is equivalent to minimizing $\textnormal{{cost}}(S-y)$ , it follows that

\min_{z\in S}\texttt{change}(z)=\min_{z\in S}(\textnormal{{cost}}(S-z)-% \textnormal{{cost}}(S))=\min_{z\in S}\textnormal{{cost}}(S-z).

Thus, we let $y\leftarrow\arg\min_{z\in S}\texttt{change}(z)$ , remove $y$ from $S$ , and proceed to update the data structures. Excluding the time taken to update the data structures, the iteration takes $O(n)$ time.

5 Our Deterministic $𝒌$ -Median Algorithm

In this section, we prove Theorem 1, which we restate below.

Theorem 17.

There is a deterministic algorithm for $k$ -median that, given a metric space of size $n$ , computes a $O(\log(n/k))$ -approximate solution in $\tilde{O}(nk)$ time.

5.1 Our Algorithm

Let $(V,w,d)$ be a metric space of size $n$ and $k\leq n$ be an integer. We also define the value $\ell:=\lceil\log_{2}(n/k)\rceil$ , which we use to describe our algorithm. Our algorithm works in 3 phases, which we describe below.

Phase I.

In the first phase of our algorithm, we construct a sequence of partitions $Q_{0},\dots,Q_{\ell}$ of the metric space $V$ , such that the partition $Q_{i}$ is a refinement of the partition $Q_{i-1}$ .⁸⁸8i.e. for each element $X\in Q_{i-1}$ , there are elements $X_{1},\dots,X_{q}\in Q_{i}$ such that $X=X_{1}\cup\dots\cup X_{q}$ . We start off by setting $Q_{0}:=\{V\}$ . Subsequently, for each $i=1,\dots,\ell$ , we construct the partition $Q_{i}$ as follows:

Initialize $Q_{i}\leftarrow\varnothing$ . Then, for each $X\in Q_{i-1}$ , arbitrarily partition $X$ into subsets $X_{1}$ and $X_{2}$ such that $\left||X_{1}|-|X_{2}|\right|\leq 1$ , and add these subsets to $Q_{i}$ .

For $X\in Q_{i-1}$ , we define $\mathcal{P}(X):=\{X^{\prime}\in Q_{i}\mid X^{\prime}\subseteq X\}$ .

Phase II.

The second phase of our algorithm proceeds in iterations, where we use the partitions $\{Q_{i}\}_{i}$ to compute the solution in a bottom-up manner. Let $V_{\ell+1}$ denote the set of points $V$ . For each $i=\ell,\dots,0$ , our algorithm constructs $V_{i}$ as follows:

For each $X\in Q_{i}$ , let $S_{X}$ be the solution obtained by running $\textnormal{{Res-Greedy}}_{2k}$ on the subspace $(X,w,d)$ , restricting the output to be a subset of $X\cap V_{i+1}$ . Finally, we define $V_{i}:=\bigcup_{X\in Q_{i}}S_{X}$ .

Phase III.

Consider the set $V_{0}$ which contains $2k$ points and let $\sigma:V\longrightarrow V_{0}$ be the projection from $V$ to $V_{0}$ . Define a weight function $w_{0}$ on each $y\in V_{0}$ by $w_{0}(y):=\sum_{x\in\sigma^{-1}(y)}w(x)$ (i.e. $w_{0}(y)$ is the total weight of all points in $V$ that are projected onto $y$ ). Let $S$ be the solution obtained by running the algorithm of Mettu-Plaxton [27] on the metric space $(V_{0},w_{0},d)$ .

Output.

The solution $S$ is the final output of our algorithm.

5.2 Analysis

We now analyze our algorithm by bounding its approximation ratio and running time.

Approximation Ratio

We begin by proving the following claim, which, for any set $X$ within a partition $Q_{i}$ , allows us to express the cost of the solution $S_{X}$ w.r.t. the metric subspace $(X,w,d)$ in terms of the costs of the solutions $\{S_{X^{\prime}}\}_{X^{\prime}\in\mathcal{P}(X)}$ .

Claim 18.

For any set $X\in\bigcup_{i=0}^{\ell-1}Q_{i}$ , we have that

\textnormal{{cost}}(S_{X},X)\leq\sum_{X^{\prime}\in\mathcal{P}(X)}\textnormal{% {cost}}(S_{X^{\prime}},X^{\prime})+O(1)\cdot\textsc{OPT}_{k}(X).

Proof.

Let $R:=\bigcup_{X^{\prime}\in\mathcal{P}(X)}S_{X^{\prime}}$ . We obtain the solution $S_{X}$ by calling $\textnormal{{Res-Greedy}}_{2k}$ on the metric space $(X,w,d)$ while restricting the output to be a subset of $R$ . Thus, it follows from Theorem 13 that

\textnormal{{cost}}(S_{X},X)\leq\textnormal{{cost}}(R,X)+O\!\left(\log\!\left(% \frac{|R|}{2k-k+1}\right)\right)\cdot\textsc{OPT}_{k}(X).

By observing that $|R|/(k+1)\leq 4k/(k+1)\leq 4$ , it follows that $\textnormal{{cost}}(S_{X},X)\leq\textnormal{{cost}}(R,X)+O(1)\cdot\textsc{OPT}% _{k}(X)$ . Finally, the claim follows since

\textnormal{{cost}}(R,X)=\sum_{X^{\prime}\in\mathcal{P}(X)}\textnormal{{cost}}% (R,X^{\prime})\leq\sum_{X^{\prime}\in\mathcal{P}(X)}\textnormal{{cost}}(S_{X^{% \prime}},X^{\prime}).\

$\hfill\vartriangleleft$

Using Claim 18, we now prove the following claim.

Claim 19.

For any $i\in[0,\ell]$ , we have that

\textnormal{{cost}}(V_{0},V)\leq\sum_{X\in Q_{i}}\textnormal{{cost}}(S_{X},X)+% O(i)\cdot\textsc{OPT}_{k}(V).

Proof.

We prove this claim by induction. Note that the base case where $i=0$ holds trivially. Now, suppose that the claim holds for some $i-1\in[0,\ell-1]$ . Then we have that

	$\displaystyle\textnormal{{cost}}(V_{0},V)$	$\displaystyle\leq\sum_{X\in Q_{i-1}}\textnormal{{cost}}(S_{X},X)+O(i-1)\cdot% \textsc{OPT}_{k}(V)$
		$\displaystyle\leq\sum_{X\in Q_{i-1}}\left(\sum_{X^{\prime}\in\mathcal{P}(X)}% \textnormal{{cost}}(S_{X^{\prime}},X^{\prime})+O(1)\cdot\textsc{OPT}_{k}(X)\right)$
		$\displaystyle\quad\quad\quad\quad+O(i-1)\cdot\textsc{OPT}_{k}(V)$
		$\displaystyle=\sum_{X\in Q_{i-1}}\sum_{X^{\prime}\in\mathcal{P}(X)}\textnormal% {{cost}}(S_{X^{\prime}},X^{\prime})+O(1)\cdot\sum_{X\in Q_{i-1}}\textsc{OPT}_{% k}(X)$
		$\displaystyle\quad\quad\quad\quad+O(i-1)\cdot\textsc{OPT}_{k}(V)$
		$\displaystyle\leq\sum_{X\in Q_{i}}\textnormal{{cost}}(S_{X},X)+O(1)\cdot% \textsc{OPT}_{k}(V)+O(i-1)\cdot\textsc{OPT}_{k}(V)$
		$\displaystyle=\sum_{X\in Q_{i}}\textnormal{{cost}}(S_{X},X)+O(i)\cdot\textsc{% OPT}_{k}(V).$

The second line follows from Claim 18. In the fourth line, we are using the fact that, for an optimal solution $S^{\star}$ of $V$ and any partition $Q$ of $V$ , we have

\sum_{X\in Q}\textsc{OPT}_{k}(X)\leq 2\cdot\sum_{X\in Q}\textsc{OPT}_{k}(X,V)% \leq 2\cdot\sum_{X\in Q}\textnormal{{cost}}(S^{\star},X)=2\cdot\textsc{OPT}_{k% }(V),

where the first inequality follows from Corollary 4. $\hfill\vartriangleleft$

We get the following immediate corollary from Claim 19 by setting $i=\ell$ .

Corollary 20.

We have that

\textnormal{{cost}}(V_{0},V)\leq\sum_{X\in Q_{\ell}}\textnormal{{cost}}(S_{X},% X)+O\!\left(\log\!\left(\frac{n}{k}\right)\right)\cdot\textsc{OPT}_{k}(V).

Using Corollary 20, we prove the following lemma.

Lemma 21.

We have that $\textnormal{{cost}}(V_{0})=O(\log(n/k))\cdot\textsc{OPT}_{k}(V)$ .

Proof.

By Theorem 13, it follows that, for any $X\in Q_{\ell}$ ,

\textnormal{{cost}}(S_{X},X)\leq\textnormal{{cost}}(X,X)+O\!\left(\log\!\left(% \frac{|X|}{k}\right)\right)\cdot\textsc{OPT}_{k}(X)\leq O\!\left(\log\!\left(% \frac{n}{k}\right)\right)\cdot\textsc{OPT}_{k}(X,V).

(7)

Combining Corollary 20 and Equation 7, we get that

	$\displaystyle\textnormal{{cost}}(V_{0},V)$	$\displaystyle\leq\sum_{X\in Q_{\ell}}\textnormal{{cost}}(S_{X},X)+O\!\left(% \log\!\left(\frac{n}{k}\right)\right)\cdot\textsc{OPT}_{k}(V)$
		$\displaystyle\leq O\!\left(\log\!\left(\frac{n}{k}\right)\right)\cdot\sum_{X% \in Q_{\ell}}\textsc{OPT}_{k}(X,V)+O\!\left(\log\!\left(\frac{n}{k}\right)% \right)\cdot\textsc{OPT}_{k}(V)$
		$\displaystyle\leq O\!\left(\log\!\left(\frac{n}{k}\right)\right)\cdot\textsc{% OPT}_{k}(V).\$

$\hfill\blacktriangleleft$

By Lemma 21, we get that $V_{0}$ is a $O(\log(n/k))$ -bicriteria approximation of size $2k$ . Using the extraction technique of [20] (see Lemma 5), which allows us to compute an exact solution to the $k$ -median problem from a bicriteria approximation while only incurring constant loss in the approximation ratio, it follows that the solution $S$ constructed in Phase III is a $O(\log(n/k))$ -approximation and has size at most $k$ .

Running Time

We begin by proving the following lemma, which summarizes the relevant properties of the partitions constructed in Phase I of the algorithm.

Lemma 22.

For each $i\in[0,\ell]$ , the set $Q_{i}$ is a partition of $V$ into $2^{i}$ many subsets of size at most $n/|Q_{i}|+2$ .

Proof.

We define $Q_{0}$ as $\{V\}$ , which is a trivial partition of $V$ . Now, suppose that this statement holds for the partition $Q_{i}$ , where $0\leq i<\ell$ . The algorithm constructs $Q_{i+1}$ by taking each $X\in Q_{i}$ and further partitioning $X$ into subsets $X_{1}$ and $X_{2}$ , such that the difference in the sizes of these subsets is at most $1$ . We can also observe that the number of subsets in the partition $Q_{i+1}$ is $2\cdot|Q_{i}|=2\cdot 2^{i}=2^{i+1}$ .⁹⁹9Note that we do not necessarily guarantee that all of the sets in these partitions are non-empty. Since each subset $X\in Q_{i}$ has size at most $n/|Q_{i}|+2$ , it follows that each subset in $Q_{i+1}$ has size at most

\left\lceil\frac{1}{2}\cdot\left(\frac{n}{|Q_{i}|}+2\right)\right\rceil\leq% \frac{n}{2\cdot|Q_{i}|}+\frac{2}{2}+1\leq\frac{n}{|Q_{i+1}|}+2.\

$\hfill\blacktriangleleft$

Bounding the Running Time.

We now bound the running time of our algorithm. The running time of Phase I of our algorithm is $O(n\ell)=\tilde{O}(n)$ , since it takes $O(n)$ time to construct each partition $Q_{i}$ given the partition $Q_{i-1}$ . The running time of Phase III of our algorithm is $\tilde{O}(nk)$ , since constructing the mapping $w_{0}$ takes $O(nk)$ time and running the Mettu-Plaxton algorithm on an input of size $2k$ takes $\tilde{O}(k^{2})$ time. Thus, we now focus on bounding the running time of Phase II.

We can first observe that the running time of the $i^{th}$ iteration in Phase II is dominated by the total time taken to handle the calls to the algorithm Res-Greedy. In the first iteration (when $i=\ell$ ), we make $|Q_{\ell}|$ many calls to Res-Greedy, each one on a subspace of size at most $n/|Q_{\ell}|+2$ (by Lemma 22). Thus, by Theorem 13, the time taken to handle these calls is at most

\tilde{O}(1)\cdot\left(\frac{n}{|Q_{\ell}|}+2\right)^{2}\cdot|Q_{\ell}|\leq% \tilde{O}(1)\cdot\left(\frac{n}{|Q_{\ell}|}\right)^{2}\cdot|Q_{\ell}|\leq% \tilde{O}(1)\cdot\frac{n^{2}}{|Q_{\ell}|}\leq\tilde{O}(1)\cdot\frac{n^{2}}{2^{% \ell}}\leq\tilde{O}(nk),

where the first inequality follows from the fact that $n/|Q_{\ell}|\geq 1$ , the third from Lemma 22, and the fourth since $2^{\ell}\geq n/k$ . It follows that the time taken to handle these calls to Res-Greedy is $\tilde{O}(nk)$ . For each subsequent iteration (when $0\leq i<\ell$ ), we make $|Q_{i}|$ many calls to Res-Greedy, each one on a subspace $(X,w,d)$ of size at most $n/|Q_{i}|+2$ (by Lemma 22), where the solution is restricted to the set $X\cap V_{i+1}$ , which has size at most $|X\cap V_{i+1}|=|S_{X_{1}}\cup S_{X_{2}}|\leq 4k,$ where $\mathcal{P}(X)=\{X_{1},X_{2}\}$ and $S_{X_{1}}$ and $S_{X_{2}}$ are computed in the previous iteration. It follows from Theorem 13 that the time taken to handle these calls is at most $\tilde{O}(1)\cdot(n/|Q_{i}|+2)\cdot 4k\cdot|Q_{i}|\leq\tilde{O}(nk)$ . It follows that the total time taken to handle these calls to Res-Greedy during the $i^{th}$ iteration of Phase II is $\tilde{O}(nk)$ . Hence, the total time spent handling calls to Res-Greedy is $\ell\cdot\tilde{O}(nk)=\tilde{O}(nk)$ . The running time of our algorithm follows.

6 Our Results for Deterministic $𝒌$ -Means

In this section, we briefly describe our results for the $k$ -means problem, where the clustering objective is $\sum_{x\in V}w(x)\cdot d(x,S)^{2}$ instead of $\sum_{x\in V}w(x)\cdot d(x,S)$ .

6.1 Our Deterministic Algorithm for $𝒌$ -Means

We first note that our deterministic algorithm with near-optimal query complexity (see Section 3.4) extends seamlessly to the $k$ -means problem, giving us the following result.

Theorem 23.

There is a deterministic algorithm for $k$ -means that, given a metric space of size $n$ , computes a $O(\log^{2}(n/k))$ -approximation with $\tilde{O}(nk)$ queries (but exponential running time).

Proof.

Consider the normalized $k$ -means objective $\textnormal{{cost}}_{2}(S):=\left(\sum_{x\in V}w(x)\cdot d(x,S)^{2}\right)^{1/2}$ . Since the projection lemma (Lemma 3) holds for this objective, the algorithm and analysis extend immediately to the normalized $k$ -means problem. By observing that a solution $S$ is an $\alpha^{2}$ -approximation to $k$ -means if and only if it is an $\alpha$ -approximation to normalized $k$ -means, it follows that the algorithm produces a $O(\log^{2}(n/k))$ -approximation for the $k$ -means problem. $\hfill\blacktriangleleft$

Unfortunately, it is not clear how to extend our efficient algorithm from Section 5 to $k$ -means. The barrier is that we don’t know how to design a good deterministic algorithm for the restricted $k$ -means (or normalized $k$ -means) problem. In particular, it is not known whether or not the reverse greedy algorithm can be used to produce good solutions for the $k$ -means problem. If we want to extend the analysis to the $k$ -means objective, the analysis fails since we need the projection lemma to hold. On the other hand, if we want to extend the analysis to the normalized $k$ -means objective (where the projection lemma does hold), the failure point of the analysis is Claim 15.

6.2 Our Lower Bound for Deterministic $𝒌$ -Means

Our lower bound for deterministic $k$ -median (Theorem 12) extends immediately to deterministic $k$ -means. In particular, we get the following theorem.

Theorem 24.

For every $\delta\geq 1$ , any deterministic algorithm for the $k$ -means problem that has a running time of $O(kn\delta)$ on a metric space of size $n$ has an approximation ratio of

\Omega\!\left(\left(\frac{\log n}{\log\log n+\log k+\log\delta}\right)^{2}% \right).

We defer the proof of this theorem to the full version of the paper [16].

References

[1] Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. SIAM Journal on Computing, 49(4):FOCS17–97, 2019. doi:10.1137/18M1171321.
[2] Nir Ailon, Ragesh Jaiswal, and Claire Monteleoni. Streaming k-means approximation. Advances in neural information processing systems, 22, 2009.
[3] Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristics for k-median and facility location problems. SIAM J. Comput., 33(3):544–562, 2004. doi:10.1137/S0097539702416402.
[4] Sepehr Assadi, Andrew Chen, and Glenn Sun. Deterministic graph coloring in the streaming model. In Stefano Leonardi and Anupam Gupta, editors, STOC ’22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 261–274. ACM, 2022. doi:10.1145/3519935.3520016.
[5] MohammadHossein Bateni, Hossein Esfandiari, Hendrik Fichtenberger, Monika Henzinger, Rajesh Jayaram, Vahab Mirrokni, and Andreas Wiese. Optimal fully dynamic k-center clustering for adaptive and oblivious adversaries. In Proceedings of the 2023 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2677–2727. SIAM, 2023. doi:10.1137/1.9781611977554.CH101.
[6] Sayan Bhattacharya, Martín Costa, and Ermiya Farokhnejad. Fully dynamic $k$ -median with near-optimal update time and recourse. In 57th Annual ACM SIGACT Symposium on Theory of Computing (STOC), 2025. (To Appear).
[7] Sayan Bhattacharya, Martín Costa, Naveen Garg, Silvio Lattanzi, and Nikos Parotsidis. Fully dynamic $k$ -clustering with fast update time and small recourse. In 65th IEEE Symposium on Foundations of Computer Science (FOCS), 2024.
[8] Sayan Bhattacharya, Martín Costa, Silvio Lattanzi, and Nikos Parotsidis. Fully dynamic $k$ -clustering in $\tilde{O}(k)$ update time. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
[9] Jarosław Byrka, Thomas Pensyl, Bartosz Rybicki, Aravind Srinivasan, and Khoa Trinh. An improved approximation for k-median and positive correlation in budgeted optimization. ACM Transactions on Algorithms (TALG), 13(2):1–31, 2017. doi:10.1145/2981561.
[10] Ching-Lueh Chang. Metric 1-median selection: Query complexity vs. approximation ratio. In Computing and Combinatorics - 22nd International Conference, COCOON 2016, Ho Chi Minh City, Vietnam, August 2-4, 2016, Proceedings, volume 9797 of Lecture Notes in Computer Science, pages 131–142. Springer, 2016. doi:10.1007/978-3-319-42634-1_11.
[11] Moses Charikar, Sudipto Guha, Éva Tardos, and David B Shmoys. A constant-factor approximation algorithm for the k-median problem. In Proceedings of the thirty-first annual ACM symposium on Theory of computing, pages 1–10, 1999.
[12] Moses Charikar, Liadan O’Callaghan, and Rina Panigrahy. Better streaming algorithms for clustering problems. In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages 30–39, 2003. doi:10.1145/780542.780548.
[13] Marek Chrobak, Claire Kenyon, and Neal E. Young. The reverse greedy algorithm for the metric k-median problem. Inf. Process. Lett., 97(2):68–72, 2006. doi:10.1016/J.IPL.2005.09.009.
[14] Vincent Cohen-Addad, Niklas Hjuler, Nikos Parotsidis, David Saulpic, and Chris Schwiegelshohn. Fully dynamic consistent facility location. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 3250–3260, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/fface8385abbf94b4593a0ed53a0c70f-Abstract.html.
[15] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. Deterministic clustering in high dimensional spaces: Sketches and approximation. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 1105–1130. IEEE, 2023. doi:10.1109/FOCS57990.2023.00066.
[16] Martín Costa and Ermiya Farokhnejad. Deterministic $k$ -median clustering in near-optimal time, 2025. arXiv:2504.15115.
[17] Max Dupré la Tour, Monika Henzinger, and David Saulpic. Fully dynamic k-means coreset in near-optimal update time. In 32nd Annual European Symposium on Algorithms, ESA 2024, volume 308 of LIPIcs, pages 100:1–100:16. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. doi:10.4230/LIPICS.ESA.2024.100.
[18] Max Dupré la Tour and David Saulpic. Almost-linear time approximation algorithm to euclidean k-median and k-means. CoRR, abs/2407.11217, 2024. doi:10.48550/arXiv.2407.11217.
[19] Teofilo F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci., 38:293–306, 1985. doi:10.1016/0304-3975(85)90224-5.
[20] Sudipto Guha, Nina Mishra, Rajeev Motwani, and Liadan O’Callaghan. Clustering data streams. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, 12-14 November 2000, Redondo Beach, California, USA, pages 359–366. IEEE Computer Society, 2000. doi:10.1109/SFCS.2000.892124.
[21] Anupam Gupta and Kanat Tangwongsan. Simpler analyses of local search algorithms for facility location. CoRR, abs/0809.2554, 2008. arXiv:0809.2554.
[22] Bernhard Haeupler, Yaowei Long, and Thatchaphol Saranurak. Dynamic deterministic constant-approximate distance oracles with n ${}^{\mbox{{$\epsilon$}}}$ worst-case update time. In 65th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2024, Chicago, IL, USA, October 27-30, 2024, pages 2033–2044. IEEE, 2024. doi:10.1109/FOCS61266.2024.00121.
[23] M. Henzinger and S. Kale. Fully-dynamic coresets. In ESA, 2020.
[24] Monika Henzinger, Jason Li, Satish Rao, and Di Wang. Deterministic near-linear time minimum cut in weighted graphs. In Proceedings of the 2024 ACM-SIAM Symposium on Discrete Algorithms, SODA 2024, Alexandria, VA, USA, January 7-10, 2024, pages 3089–3139. SIAM, 2024. doi:10.1137/1.9781611977912.111.
[25] Wen-Lian Hsu and George L. Nemhauser. Easy and hard bottleneck location problems. Discret. Appl. Math., 1(3):209–215, 1979. doi:10.1016/0166-218X(79)90044-1.
[26] Kamal Jain and Vijay V Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and lagrangian relaxation. Journal of the ACM (JACM), 48(2):274–296, 2001. doi:10.1145/375827.375845.
[27] Ramgopal R. Mettu and C. Greg Plaxton. The online median problem. In 41st Annual Symposium on Foundations of Computer Science (FOCS), pages 339–348, 2000. doi:10.1109/SFCS.2000.892122.
[28] Ramgopal R. Mettu and C. Greg Plaxton. Optimal time bounds for approximate clustering. In Proceedings of the 18th Conference in Uncertainty in Artificial Intelligence (UAI), pages 344–351, 2002. URL: https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=878&proceeding_id=18.
[29] Sagnik Mukhopadhyay and Danupon Nanongkai. Weighted min-cut: sequential, cut-query, and streaming algorithms. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA, June 22-26, 2020, pages 496–509. ACM, 2020. doi:10.1145/3357713.3384334.
[30] Ofer Neiman and Shay Solomon. Simple deterministic algorithms for fully dynamic maximal matching. ACM Trans. Algorithms, 12(1):7:1–7:15, 2016. doi:10.1145/2700206.

[bib.bib1] [1] Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. SIAM Journal on Computing, 49(4):FOCS17–97, 2019. doi:10.1137/18M1171321.

[bib.bib2] [2] Nir Ailon, Ragesh Jaiswal, and Claire Monteleoni. Streaming k-means approximation. Advances in neural information processing systems, 22, 2009.

[bib.bib3] [3] Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristics for k-median and facility location problems. SIAM J. Comput., 33(3):544–562, 2004. doi:10.1137/S0097539702416402.

[bib.bib4] [4] Sepehr Assadi, Andrew Chen, and Glenn Sun. Deterministic graph coloring in the streaming model. In Stefano Leonardi and Anupam Gupta, editors, STOC ’22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 261–274. ACM, 2022. doi:10.1145/3519935.3520016.

[bib.bib5] [5] MohammadHossein Bateni, Hossein Esfandiari, Hendrik Fichtenberger, Monika Henzinger, Rajesh Jayaram, Vahab Mirrokni, and Andreas Wiese. Optimal fully dynamic k-center clustering for adaptive and oblivious adversaries. In Proceedings of the 2023 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2677–2727. SIAM, 2023. doi:10.1137/1.9781611977554.CH101.

[bib.bib6] [6] Sayan Bhattacharya, Martín Costa, and Ermiya Farokhnejad. Fully dynamic $k$ -median with near-optimal update time and recourse. In 57th Annual ACM SIGACT Symposium on Theory of Computing (STOC), 2025. (To Appear).

[bib.bib7] [7] Sayan Bhattacharya, Martín Costa, Naveen Garg, Silvio Lattanzi, and Nikos Parotsidis. Fully dynamic $k$ -clustering with fast update time and small recourse. In 65th IEEE Symposium on Foundations of Computer Science (FOCS), 2024.

[bib.bib8] [8] Sayan Bhattacharya, Martín Costa, Silvio Lattanzi, and Nikos Parotsidis. Fully dynamic $k$ -clustering in $\tilde{O}(k)$ update time. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.

[bib.bib9] [9] Jarosław Byrka, Thomas Pensyl, Bartosz Rybicki, Aravind Srinivasan, and Khoa Trinh. An improved approximation for k-median and positive correlation in budgeted optimization. ACM Transactions on Algorithms (TALG), 13(2):1–31, 2017. doi:10.1145/2981561.

[bib.bib10] [10] Ching-Lueh Chang. Metric 1-median selection: Query complexity vs. approximation ratio. In Computing and Combinatorics - 22nd International Conference, COCOON 2016, Ho Chi Minh City, Vietnam, August 2-4, 2016, Proceedings, volume 9797 of Lecture Notes in Computer Science, pages 131–142. Springer, 2016. doi:10.1007/978-3-319-42634-1_11.

[bib.bib11] [11] Moses Charikar, Sudipto Guha, Éva Tardos, and David B Shmoys. A constant-factor approximation algorithm for the k-median problem. In Proceedings of the thirty-first annual ACM symposium on Theory of computing, pages 1–10, 1999.

[bib.bib12] [12] Moses Charikar, Liadan O’Callaghan, and Rina Panigrahy. Better streaming algorithms for clustering problems. In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages 30–39, 2003. doi:10.1145/780542.780548.

[bib.bib13] [13] Marek Chrobak, Claire Kenyon, and Neal E. Young. The reverse greedy algorithm for the metric k-median problem. Inf. Process. Lett., 97(2):68–72, 2006. doi:10.1016/J.IPL.2005.09.009.

[bib.bib14] [14] Vincent Cohen-Addad, Niklas Hjuler, Nikos Parotsidis, David Saulpic, and Chris Schwiegelshohn. Fully dynamic consistent facility location. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 3250–3260, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/fface8385abbf94b4593a0ed53a0c70f-Abstract.html.

[bib.bib15] [15] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. Deterministic clustering in high dimensional spaces: Sketches and approximation. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 1105–1130. IEEE, 2023. doi:10.1109/FOCS57990.2023.00066.

[bib.bib16] [16] Martín Costa and Ermiya Farokhnejad. Deterministic $k$ -median clustering in near-optimal time, 2025. arXiv:2504.15115.

[bib.bib17] [17] Max Dupré la Tour, Monika Henzinger, and David Saulpic. Fully dynamic k-means coreset in near-optimal update time. In 32nd Annual European Symposium on Algorithms, ESA 2024, volume 308 of LIPIcs, pages 100:1–100:16. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. doi:10.4230/LIPICS.ESA.2024.100.

[bib.bib18] [18] Max Dupré la Tour and David Saulpic. Almost-linear time approximation algorithm to euclidean k-median and k-means. CoRR, abs/2407.11217, 2024. doi:10.48550/arXiv.2407.11217.

[bib.bib19] [19] Teofilo F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci., 38:293–306, 1985. doi:10.1016/0304-3975(85)90224-5.

[bib.bib20] [20] Sudipto Guha, Nina Mishra, Rajeev Motwani, and Liadan O’Callaghan. Clustering data streams. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, 12-14 November 2000, Redondo Beach, California, USA, pages 359–366. IEEE Computer Society, 2000. doi:10.1109/SFCS.2000.892124.

[bib.bib21] [21] Anupam Gupta and Kanat Tangwongsan. Simpler analyses of local search algorithms for facility location. CoRR, abs/0809.2554, 2008. arXiv:0809.2554.

[bib.bib22] [22] Bernhard Haeupler, Yaowei Long, and Thatchaphol Saranurak. Dynamic deterministic constant-approximate distance oracles with n ${}^{\mbox{{$\epsilon$}}}$ worst-case update time. In 65th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2024, Chicago, IL, USA, October 27-30, 2024, pages 2033–2044. IEEE, 2024. doi:10.1109/FOCS61266.2024.00121.

[bib.bib23] [23] M. Henzinger and S. Kale. Fully-dynamic coresets. In ESA, 2020.

[bib.bib24] [24] Monika Henzinger, Jason Li, Satish Rao, and Di Wang. Deterministic near-linear time minimum cut in weighted graphs. In Proceedings of the 2024 ACM-SIAM Symposium on Discrete Algorithms, SODA 2024, Alexandria, VA, USA, January 7-10, 2024, pages 3089–3139. SIAM, 2024. doi:10.1137/1.9781611977912.111.

[bib.bib25] [25] Wen-Lian Hsu and George L. Nemhauser. Easy and hard bottleneck location problems. Discret. Appl. Math., 1(3):209–215, 1979. doi:10.1016/0166-218X(79)90044-1.

[bib.bib26] [26] Kamal Jain and Vijay V Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and lagrangian relaxation. Journal of the ACM (JACM), 48(2):274–296, 2001. doi:10.1145/375827.375845.

[bib.bib27] [27] Ramgopal R. Mettu and C. Greg Plaxton. The online median problem. In 41st Annual Symposium on Foundations of Computer Science (FOCS), pages 339–348, 2000. doi:10.1109/SFCS.2000.892122.

[bib.bib28] [28] Ramgopal R. Mettu and C. Greg Plaxton. Optimal time bounds for approximate clustering. In Proceedings of the 18th Conference in Uncertainty in Artificial Intelligence (UAI), pages 344–351, 2002. URL: https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=878&proceeding_id=18.

[bib.bib29] [29] Sagnik Mukhopadhyay and Danupon Nanongkai. Weighted min-cut: sequential, cut-query, and streaming algorithms. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA, June 22-26, 2020, pages 496–509. ACM, 2020. doi:10.1145/3357713.3384334.

[bib.bib30] [30] Ofer Neiman and Shay Solomon. Simple deterministic algorithms for fully dynamic maximal matching. ACM Trans. Algorithms, 12(1):7:1–7:15, 2016. doi:10.1145/2700206.

Deterministic k-Median Clustering in Near-Optimal Time

Abstract

Keywords and phrases:

Category:

Funding:

Copyright and License:

2012 ACM Subject Classification:

Related Version:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

The State-of-the-Art for 𝒌-Median.

Deterministic Algorithms for 𝒌-Median.

Question 1.

1.1 Our Results

Theorem 1.

Theorem 2.

1.2 Related Work

1.3 Organization

2 Preliminaries

The Projection Lemma.

Lemma 3.

Corollary 4.

Proof.

3 Technical Overview

3.1 The Hierarchical Partitioning Framework

Lemma 5 ([20]).

A 2-Level Hierarchy.

An ℓ-Level Hierarchy.

Theorem 6.

Corollary 7.

3.2 The Barrier to Improving the Approximation

Question 2.

3.3 Idea I: Sparsification via Restricted 𝒌-Median

3.4 Our Algorithm With Near-Optimal Query Complexity

Phase I.

Phase II.

Output.

Theorem 8.

Approximation Ratio

Claim 9.

Proof.

Query Complexity

3.5 Idea II: A Deterministic Algorithm for Restricted 𝒌-Median

Lemma 10.

Dealing with Bicriteria Approximations.

Putting Everything Together.

Theorem 11.

3.6 Our Lower Bound for Deterministic 𝒌-Median

Theorem 12.

4 A Deterministic Algorithm for Restricted 𝒌-Median

Theorem 13.

4.1 The Restricted Reverse Greedy Algorithm

4.2 Analysis

Lemma 14.

4.3 Proof of Lemma 14

Claim 15.

Proof.

4.4 Implementation

Lemma 16.

Implementing an Iteration of Greedy.

5 Our Deterministic 𝒌-Median Algorithm

Theorem 17.

5.1 Our Algorithm

Phase I.

Phase II.

Phase III.

Output.

5.2 Analysis

Approximation Ratio

Claim 18.

Proof.

Claim 19.

Proof.

Corollary 20.

Lemma 21.

Proof.

Running Time

Deterministic $k$ -Median Clustering in Near-Optimal Time

The State-of-the-Art for $𝒌$ -Median.

Deterministic Algorithms for $𝒌$ -Median.

An $\ell$ -Level Hierarchy.

3.3 Idea I: Sparsification via Restricted $𝒌$ -Median

3.5 Idea II: A Deterministic Algorithm for Restricted $𝒌$ -Median

3.6 Our Lower Bound for Deterministic $𝒌$ -Median

4 A Deterministic Algorithm for Restricted $𝒌$ -Median

5 Our Deterministic $𝒌$ -Median Algorithm

6 Our Results for Deterministic $𝒌$ -Means

6.1 Our Deterministic Algorithm for $𝒌$ -Means

6.2 Our Lower Bound for Deterministic $𝒌$ -Means