Deterministic -Median Clustering in Near-Optimal Time
Abstract
The metric -median problem is a textbook clustering problem. As input, we are given a metric space of size and an integer , and our task is to find a subset of at most “centers” that minimizes the total distance from each point in to its nearest center in .
Mettu and Plaxton [UAI’02] gave a randomized algorithm for -median that computes a -approximation in time.111We use to hide polylog factors in the size and the aspect ratio (see Section 2) of the metric space. They also showed that any algorithm for this problem with a bounded approximation ratio must have a running time of . Thus, the running time of their algorithm is optimal up to polylogarithmic factors.
For deterministic -median, Guha et al. [FOCS’00] gave an algorithm that computes a -approximation in time, where the degree of the polynomial in the approximation is unspecified. To the best of our knowledge, this remains the state-of-the-art approximation of any deterministic -median algorithm with this running time.
This leads us to the following natural question: What is the best approximation of a deterministic -median algorithm with near-optimal running time? We make progress in answering this question by giving a deterministic algorithm that computes a -approximation in time. We also provide a lower bound showing that any deterministic algorithm with this running time must have an approximation ratio of , establishing a gap between the randomized and deterministic settings for -median.
Keywords and phrases:
-clustering, -median, deterministic algorithms, approximation algorithmsCategory:
Track A: Algorithms, Complexity and GamesFunding:
Martín Costa: Supported by a Google PhD Fellowship.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Facility location and clusteringEditors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele PuppisSeries and Publisher:

1 Introduction
Clustering data is one of the fundamental tasks in unsupervised learning. As input, we are given a dataset, and our task is to partition the elements of the dataset into groups called clusters so that similar elements are placed in the same cluster and dissimilar elements are placed in different clusters. One of the basic formulations of clustering is metric -clustering, where we are given a (weighted) metric space of size , and our goal is to find a subset of at most centers that minimizes an objective function. We focus on the -median problem, where the objective is defined as , where . Equivalently, we want to minimize the total weighted distance from points in to their nearest center in .
The State-of-the-Art for -Median.
Metric -median is a fundamental clustering problem with many real-world applications and has been studied extensively across many computational models [11, 26, 1, 9, 12, 2]. The problem is NP-Hard and there is a long line of work designing efficient approximation algorithms for -median using a variety of different techniques, such as local search [3] and Lagrangian relaxation [26]. Mettu and Plaxton gave a randomized algorithm for -median that computes a -approximation in time [28], where the approximation guarantee holds with high probability. They also showed that any algorithm for this problem with a non-trivial approximation ratio must have a running time of . It follows that their algorithm is near-optimal, i.e. optimal up to polylogarithmic factors in the running time and the constant in the approximation ratio.
Deterministic Algorithms for -Median.
Designing deterministic algorithms for fundamental problems is an important research direction within algorithms [15, 22, 4, 30, 24]. Even though the randomized complexity of -median is well understood, we do not have the same understanding of the problem in the deterministic setting. For deterministic -median, Mettu and Plaxton gave an algorithm that computes a -approximation in time [27], and Jain and Vazirani gave an algorithm with an improved approximation of and a running time of [26]. Whenever , it follows from the lower bound of [28] that these algorithms are near-optimal in both the approximation ratio and running time. On the other hand, for , these algorithms are slower than the randomized -approximation algorithm of [28]. Guha et al. gave an algorithm that computes a -approximation in a near-optimal running time of [20], where the degree of the polynomial in the approximation is unspecified. However, it is not clear how much this approximation ratio can be improved, and in particular, whether or not we can match the bounds for randomized algorithms. This leads us to the following question.
Question 1.
What is the best approximation of any deterministic algorithm for -median that runs in time?
1.1 Our Results
We make progress in answering Question 1 by giving a deterministic algorithm with near-optimal running time and an improved approximation of , proving the following theorem.
Theorem 1.
There is a deterministic algorithm for -median that, given a metric space of size , computes a -approximate solution in time.
We obtain our algorithm by adapting the “hierarchical partitioning” approach of Guha et al. [20]. We show that a modified version of this hierarchy can be implemented efficiently by using “restricted -clustering” algorithms – a notation that was recently introduced by Bhattacharya et al. to design fast dynamic clustering algorithms [7]. We design a deterministic algorithm for restricted -median based on the reverse greedy algorithm of Chrobak et al. [13] and combine it with the hierarchical partitioning framework to construct our algorithm.
In addition to our algorithm, we also provide a lower bound on the approximation ratio of any deterministic algorithm with a running time of , proving the following theorem.
Theorem 2.
Any deterministic algorithm for -median that runs in time when given a metric space of size has an approximation ratio of
This lower bound establishes a separation between the randomized and deterministic settings for -median – ruling out the possibility of a deterministic -approximation algorithm that runs in near-optimal time for . For example, when , Theorem 2 shows that any deterministic algorithm with a near-optimal running time must have an approximation ratio of . On the other hand, Theorem 1 gives such an algorithm with an approximation ratio of , which matches the lower bound up to a lower order term.
We prove Theorem 2 by adapting a lower bound on the query complexity of dynamic -center given by Bateni et al. [5], where the query complexity of an algorithm is the number of queries that it makes to the distance function . Our lower bound holds for any deterministic algorithm with a query complexity of . Since the query complexity of an algorithm is a lower bound on its running time, this gives us Theorem 2. In general, establishing a gap between the deterministic and randomized query complexity of a problem is an interesting research direction [29]. Our lower bound implies such a gap for -median when is sufficiently small.
For the special case of -median, Chang showed that, for any constant , any deterministic algorithm with a running time of has an approximation ratio of [10]. The lower bound for -median by [10] uses very similar techniques to the lower bounds of Bateni et al. [5], which we adapt to obtain our result. In the full version of the paper [16], we provide a generalization of the lower bound in Theorem 12, giving a similar tradeoff between running time and approximation.
1.2 Related Work
Two other well-studied metric -clustering problems are -means and -center. For -means, the current state-of-the-art is the same as for -median. Using randomization, it is known how to obtain an -approximation in time [28]. The deterministic algorithm of [20] can be generalized to work for -means as well as -median (see Appendix A in the full version of the paper [16], where we describe this generalization). In particular, we can also obtain a -approximation for -means in near-optimal time. On the other hand, for -center, the situation is quite different. The classic greedy algorithm given by Gonzalez [19] is deterministic and returns a -approximation in time. It is known that any non-trivial approximation algorithm must run in time [5], and it is NP-Hard to obtain a -approximation for any constant [25]. Thus, this algorithm has an exactly optimal approximation ratio and running time.
Another related line of work considers the specific case of Euclidean spaces. It was recently shown how to obtain a -approximation in time in such spaces [18]. They obtain their result by adapting the time deterministic algorithm of [27] using locality-sensitive hashing.
The -median problem has also recently received much attention in the dynamic setting, where points in the metric space are inserted and deleted over time and the objective is to maintain a good solution. A long line of work [14, 23, 8, 17, 7, 6] recently led to a fully dynamic -median algorithm with -approximation and update time against adaptive adversaries, giving near-optimal update time and approximation.222Note that, since we cannot obtain a running time of in the static setting, we cannot obtain an update time of in the dynamic setting.
1.3 Organization
In Section 2, we give the preliminaries and describe the notation used throughout the paper. In Section 3, we give a technical overview of our results. We present our algorithm in Sections 4 and 5. We defer the proof of our lower bound to the full version of the paper [16]. In Section 6, we describe our results for the -means problem.
2 Preliminaries
Let be a weighted metric space of size , where is a weight function and is a metric satisfying the triangle inequality. The aspect ratio of the metric space is the ratio of the maximum and minimum non-zero distances in the metric space. We use the notation to hide polylogarithmic factors in the size and the aspect ratio of the metric space. Given subsets , we define the cost of the solution with respect to as
where .333Note that we do not necessarily require that is a subset of . When we are considering the cost of w.r.t. the entire space , we abbreviate by . In the -median problem on the metric space , our objective is to find a subset of size at most which minimizes . Given an integer and subsets , we define the optimal cost of a solution of size within with respect to as
When and are the same, we abbreviate by . Thus, the optimal solution to the -median problem on the metric space has cost . For any , we denote the metric subspace obtained by considering the metric and weights restricted to only the points in by .
The Projection Lemma.
Given sets , we let denote the projection of onto the set , which is defined as the subset of points such that some point has as its closest point in (breaking ties arbitrarily). In other words, we define . We use the following well-known projection lemma throughout the paper, which allows us to upper bound the cost of the projection in terms of the costs of and [21, 13].
Lemma 3.
For any subsets , we have that .
The following well-known corollary of the projection lemma shows that, for any set , the optimal cost of the -median problem in changes by at most a factor of if we are allowed to place centers anywhere in .
Corollary 4.
For any subset , we have that .
Proof.
Let be a subset of of size at most that minimizes and let . Then, for any , it follows from Lemma 3 that , which implies the corollary.
3 Technical Overview
We begin by describing the hierarchical partitioning approach used by Guha et al. [20] to obtain a -approximation algorithm with near-optimal running time. We then discuss the limitations of this approach and describe how we overcome these limitations to obtain our result.
3.1 The Hierarchical Partitioning Framework
Guha et al. [20] showed how to combine an time -median algorithm with a simple hierarchical partitioning procedure in order to produce a faster algorithm – while incurring some loss in the approximation. Their approach is based on the following divide-and-conquer procedure:
-
1.
Partition the metric space into metric subspaces .
-
2.
Solve the -median problem on each subspace to obtain a solution .
-
3.
Combine the solutions to get a solution for the original space .
The main challenge in this framework is implementing Step 3 – finding a good way to merge the solutions from the subspaces into a solution for the original space. To implement this step, they prove the following lemma, which, at a high level, shows how to use the solutions to construct a sparsifier for the metric space that is much smaller than the size of the space.
Lemma 5 ([20]).
Suppose that each solution is a -approximate solution to the -median problem in . Let and, for each , let denote the total weight of points in that are assigned to in the solution . Then any -approximate solution to the -median problem in the space is a -approximation in the space .
Using Lemma 5, we can compute a (weighted) subspace that has size only . Crucially, we have the guarantee that any good solution that we find within this subspace is also a good solution in the space . Thus, we can then use the deterministic -approximate time -median algorithm of [27] to compute a solution for in time.
A 2-Level Hierarchy.
Suppose we run this divide-and-conquer framework for one step (i.e. without recursing on the subspaces ) and just compute the solutions for using the time algorithm of [27]. It follows immediately from Lemma 5 that the approximation ratio of the final solution is . We can also observe that, up to polylogarithmic factors, the total time taken to compute the is , since the size of each subspace is . Furthermore, the time taken to compute is . By taking , we get that the total running time of the algorithm is , giving a polynomial improvement in the running time for .
An -Level Hierarchy.
Now, suppose we run this framework for steps. To balance the running time required to compute the solutions at each level of this divide-and-conquer procedure, we want to subdivide each metric subspace at depth in the recursion tree into (roughly) further subspaces. Guha et al. show that the running time of this algorithm is . By Lemma 5, we can also see that the approximation ratio of the final solution is . Setting , we get the following theorem.
Theorem 6.
There is a deterministic algorithm for -median that, given a metric space of size , computes a -approximation in time, for any .
Setting , we get immediately the following corollary.
Corollary 7.
There is a deterministic algorithm for -median that, given a metric space of size , computes a -approximation in time.
3.2 The Barrier to Improving the Approximation
While the sparsification technique that is described in Lemma 5 allows us to obtain much faster algorithms by sparsifying our input in a hierarchical manner, this approach has one major drawback. Namely, the fact that sparsifying the input in this manner also leads to an approximation ratio that scales exponentially with the number of levels in the hierarchy. Unfortunately, this exponential growth seems unavoidable with this approach. This leads us to the following question.
Question 2.
Is there a different way to combine the solutions that does not lead to exponential deterioration in the approximation?
3.3 Idea I: Sparsification via Restricted -Median
Very recently, Bhattacharya et al. introduced the notion of “restricted -clustering” in order to design efficient and consistent dynamic clustering algorithms [7]. The restricted -median problem on the space is the same as the -median problem, except that we are also given a subset and have the additional restriction that our solution must be a subset of . Crucially, even though the algorithm can only place centers in the solution within , it receives the entire space as input and computes the cost of the solution w.r.t. the entire space.
The restricted -median problem allows us to take a different approach toward implementing Step 3 of the divide-and-conquer framework. Instead of compressing the entire space into only many weighted points, we restrict the output of the solution to these points but still consider the rest of the space while computing the cost of the set . This can be seen as a less aggressive way of sparsifying the metric space, where we lose less information. It turns out that this approach allows us to produce solutions of higher quality, where the approximation scales linearly in the number of levels in the hierarchy instead of exponentially.
Efficiently implementing this new hierarchy is challenging since we need to design a fast deterministic algorithm for restricted -median with the appropriate approximation guarantees. To illustrate our approach while avoiding this challenge, we first describe a simple version of our algorithm with improved approximation and near-optimal query complexity. We later show how to design such a restricted algorithm, allowing us to implement this algorithm efficiently.
3.4 Our Algorithm With Near-Optimal Query Complexity
Let be a metric space of size and be an integer. We define the value , which we use to describe our algorithm. Our algorithm works in the following 2 phases.
Phase I.
In the first phase of our algorithm, we construct a sequence of partitions of the metric space , such that the partition is a refinement of the partition .444i.e. for each element , there are elements such that . We let . Subsequently, for each , we construct the partition by arbitrarily partitioning each into subsets and of equal size and adding these subsets to .555Note that it might not be possible for the subsets and to have equal size. For simplicity, we ignore this detail in the technical overview. For , we define .
Phase II.
The second phase of our algorithm proceeds in iterations, where we use the partitions to compute the solution in a bottom-up manner. Let denote the set of points . For each , our algorithm constructs as follows:
For each , let be the optimal solution to the -median problem in the metric space such that . Let .
Output.
The set is the final output of our algorithm.
The following theorem summarizes the behaviour of this algorithm.
Theorem 8.
There is a deterministic algorithm for -median that, given a metric space of size , computes a -approximation with queries (but exponential running time).
We now sketch the proof of Theorem 8 by outlining the analysis of the approximation ratio and query complexity. The formal proof follows from the more general analysis in Section 5.2.
Approximation Ratio
To bound the cost of the solution returned by our algorithm, we first need to be able to relate the costs of a solution in the hierarchy to the costs of the solutions in the subsequent levels. Given any set within a partition , the following key claim establishes the relationship between the cost of the solution on the metric subspace and the costs of the solutions on the metric subspaces .
Claim 9.
For any set , we have that
Proof.
Let denote the set . Let be an optimal solution to the -median problem in the metric space and let be the projection of onto .
Since is the optimal solution to the -median problem in such that , it follows that . Applying the projection lemma (Lemma 3) to the projection of onto , we get that . Combining these two inequalities, it follows that
(1) |
The claim then follows from the fact that
By repeatedly applying Claim 9 to the sum , we obtain the following upper bound on :
(2) |
We prove Equation 2 in Claim 19. Now, let be an optimal solution to -median in and consider any . Since is an optimal solution to -median in the subspace , we have that , where the first inequality follows from Corollary 4. Thus, summing over each , we get that
(3) |
Combining Equations 2 and 3, it follows that . Thus, the solution returned by our algorithm is a approximation.
Query Complexity
Since the partitions constructed in Phase I of the algorithm are arbitrary, we do not make any queries to the metric in Phase I. Thus, we now focus on bounding the query complexity of Phase II.
The total number of queries made during the iteration in Phase II is the sum of the number of queries required to compute the solutions . In the first iteration (when ), we compute many solutions, each one on a subspace of size .666Again, we assume for simplicity that each set in the partition has the same size. We can trivially compute an optimal solution in by querying every distance between all pairs of points in and then checking every possible solution. Thus, we can upper bound the number of queries by
where we are using the fact that the number of sets in the partition is . For each subsequent iteration (when ), we compute many solutions, each one on a subspace of size at most , where the solution is restricted to the set , which has size at most where and and are computed in the previous iteration. Since we only need to consider solutions that are contained in some subset of at most points, we can find an optimal such restricted solution with queries. It follows that the total number of queries that we make during this iteration is at most . Thus, the total query complexity of our algorithm is .
3.5 Idea II: A Deterministic Algorithm for Restricted -Median
In order to establish the approximation guarantees of our algorithm in Section 3.4, we use the fact that, given a metric subspace of size and a subset , we can find a solution to the -median problem such that
(4) |
We use this fact in the proof of Claim 9 (see Equation 1), which is the key claim in our analysis of our algorithm. Furthermore, to establish the query complexity bound of our algorithm, we use the fact that we can find such a solution with many queries. To implement this algorithm efficiently, we need to be able to find such a solution in time. While designing an algorithm with these exact guarantees seems challenging (since it would require efficiently matching the bounds implied by the projection lemma), we can give an algorithm with the following relaxed guarantees, which suffice for our applications.
Lemma 10.
There is a deterministic algorithm that, given a metric space of size , a subset , and a parameter , returns a solution of size in time such that
To prove Lemma 10, we give a simple modification of the reverse greedy algorithm of Chrobak, Kenyon, and Young [13]. The reverse greedy algorithm initially creates a solution consisting of the entire space, and proceeds to repeatedly peel off the point in whose removal leads to the smallest increase in the cost of until only points remain. Chrobak et al. showed that this algorithm has an approximation ratio of and . While this large approximation ratio might seem impractical for our purposes, we can make 2 simple modifications to the algorithm and analysis in order to obtain the guarantees in Lemma 10.
-
1.
We start off by setting instead of . This ensures that the output is a subset of , and gives us the guarantee that .
-
2.
We return the set once instead of . This allows us to obtain the guarantee that .
We can now make the following key observation: In our algorithm from Section 3.4, whenever we compute a solution to the -median problem, we impose the constraint that for a set of size . Thus, if we use the algorithm from Lemma 10 to compute our solutions within the hierarchy, whenever we apply this lemma we can assume that . Consequently, the approximation guarantee that we get from Lemma 10 becomes
matching the required guarantee from Equation 4.
Dealing with Bicriteria Approximations.
One caveat from Lemma 10 is that the solutions output by the algorithm have size instead of . In other words, these are “bicriteria approximations” to the -median problem, i.e. solutions that contain more than points. Thus, the final output of our algorithm has size . By using the extraction technique of Guha et al. [20] described in Lemma 5, it is straightforward to compute a subset of of these points in time while only incurring a increase in the approximation ratio.
Putting Everything Together.
By combining our algorithm from Section 3.4 with Lemma 10, we get our main result, which we prove in Section 5.
Theorem 11.
There is a deterministic algorithm for -median that, given a metric space of size , computes a -approximate solution in time.
3.6 Our Lower Bound for Deterministic -Median
We also prove the following lower bound for deterministic -median. Due to space constraints, we defer the proof of this lower bound to the full version of the paper [16].
Theorem 12.
For every , any deterministic algorithm for the -median problem that has a running time of on a metric space of size has an approximation ratio of
We prove this lower bound by adapting a lower bound construction by Bateni et al. [5]. In this paper, the authors provide a lower bound for dynamic -center clustering against adaptive adversaries. Although their primary focus is -center, their lower bounds can be extended to various -clustering problems. The main idea is to design an adaptive adversary that controls the underlying metric space as well as the points being inserted and deleted from the metric space. Whenever the algorithm queries the distance between any two points , the adversary returns a value which is consistent with the distances reported for all previously queried pairs of points. Note that if we have the distances between some points (not necessarily all of them), we might be able to embed different metrics on the current space that are consistent with the queried distances. More specifically, [5] introduces two different consistent metrics, and shows that the algorithm can not distinguish between these metrics, leading to a large approximation ratio.
We use the same technique as described above with slight modifications. The first difference is that [5] considers the problem in the fully dynamic setting, whereas our focus is on the static setting. The adversary has two advantages in the fully dynamic setting:
-
1.
The adversary has the option to delete a (problematic) point from the space.
-
2.
The approximation ratio of the algorithm is defined as the maximum approximation ratio throughout the entire stream of updates.
Both of these advantages are exploited in the framework of [5]: The adversary removes any “problematic” point and the approximation ratio of the algorithm is proven to be large only in special steps (referred to as “clean operations”) where the adversary has removed all of the problematic points. In the static setting, the adversary does not have these advantages, and the approximation ratio of the algorithm is only considered at the very end.
Due to the differences between the objective functions of the -median and -center problems, we observed that we can adapt the above framework for deterministic static -median. One of the technical points here is to show that, if we have a problematic point that is contained in the output of the algorithm, we can construct the metric so that the cost of the cluster of in solution become large. The final metric space is similar to the “uniform” metric introduced in [5] with a small modification. Since the algorithm is deterministic, the output is the same set if we run the algorithm on this final metric again. Hence, we get our lower bounds for any deterministic algorithm for the static -median problem. Please see the full version of the paper for a complete proof of our lower bound [16].
4 A Deterministic Algorithm for Restricted -Median
In this section, we prove the following theorem.
Theorem 13.
There is a deterministic algorithm that, given a metric space of size , a subset , and a parameter , returns a solution of size in time such that
This algorithm is based on a variant of the reverse greedy algorithm for the -median problem [13], which we modify for the restricted -median problem. We refer to our modified version of this algorithm as .
4.1 The Restricted Reverse Greedy Algorithm
Let be a metric space of size , and be an integer. The restricted reverse greedy algorithm begins by initializing a set and does the following:
While , identify the center whose removal from causes the smallest increase in the cost of the solution (i.e. the point ) and remove from . Once , return the set .
4.2 Analysis
Let denote the size of and let . Now, suppose that we run and consider the state of the set throughout the run of the algorithm. Since points are only removed from , this gives us a sequence of nested subsets , where for each . Note that is the final output of our algorithm. The following lemma is the main technical lemma in the analysis of this algorithm.
Lemma 14.
For each , we have that
Using Lemma 14, we can now prove the desired approximation guarantee of our algorithm. By using a telescoping sum, we can express the cost of the solution as
(5) |
Applying Lemma 14, we can upper bound the sum on the RHS of Equation 5 by
(6) |
Combining Equations 5 and 4.2, we get that
giving us the desired bound in Theorem 13, since and .
4.3 Proof of Lemma 14
The following analysis is the same as the analysis given in [13], with some minor changes to accommodate the fact that we are using the algorithm for the restricted -median problem and to allow us to obtain an improved approximation for , at the cost of outputting bicriteria approximations.
We begin with the following claim, which we use later on in the analysis.
Claim 15.
For all subsets , we have that
Proof.
For each , let denote the cluster of the points in that are assigned to within the solution . In other words, is the subset of points in whose closest point in is (breaking ties arbitrarily). We can now observe that
Here, the first and fifth lines follow from the definition of the cost function, the second line follows since can only be non-zero when , the third line follows from the fact that for any , and the fourth line from the fact that partition the set .
Let denote an optimal solution to the -median problem in the metric space and let . We denote by the projection of the optimal solution onto the set . It follows that
The first line follows directly from how the algorithm chooses which point to remove from .777Note that, for analytic purposes, we only take the minimum over instead of all of in this inequality. The second line follows from the fact that the minimum value within a set of real numbers is upper bounded by its average. The third line follows from the fact that . The fourth line follows from Claim 15. Finally, the fifth line follows from Lemma 3, which implies that .
4.4 Implementation
Now we describe how to implement the algorithm to run in time. In particular, we prove the following lemma.
Lemma 16.
Given a metric space of size and a set , the algorithm Res-Greedy runs in time .
We now show how to implement Res-Greedy to run in time . Our implementation uses similar data structures to the randomized local search in [7]. For each , we initialize a list which contains all of the points , sorted in increasing order according to the distances . We denote the point in the list by . We maintain the invariant that, at each point in time, each of the lists in contains exactly the points in . Thus, at each point in time, we have that . By implementing each of these lists using a balanced binary tree, we can initialize them in time and update them in time after each removal of a point from . Since the initially has size , the total time spent updating the lists is . We also explicitly maintain the clustering induced by the lists , where . We can initialize these clusters in time given the collection of lists and update them each time a list in is updated while only incurring a factor overhead in the running time. We now show that, using these data structures, we can implement each iteration of the greedy algorithm in time. Since the algorithm runs for at most iterations, this gives us the desired running time.
Implementing an Iteration of Greedy.
Using the lists and clustering , we can compute
for each . Since any point appears in exactly one cluster in , this takes time in total. By observing that removing from causes each point to be reassigned to the center , we can see that is precisely the value of . Since minimizing is equivalent to minimizing , it follows that
Thus, we let , remove from , and proceed to update the data structures. Excluding the time taken to update the data structures, the iteration takes time.
5 Our Deterministic -Median Algorithm
In this section, we prove Theorem 1, which we restate below.
Theorem 17.
There is a deterministic algorithm for -median that, given a metric space of size , computes a -approximate solution in time.
5.1 Our Algorithm
Let be a metric space of size and be an integer. We also define the value , which we use to describe our algorithm. Our algorithm works in 3 phases, which we describe below.
Phase I.
In the first phase of our algorithm, we construct a sequence of partitions of the metric space , such that the partition is a refinement of the partition .888i.e. for each element , there are elements such that . We start off by setting . Subsequently, for each , we construct the partition as follows:
Initialize . Then, for each , arbitrarily partition into subsets and such that , and add these subsets to .
For , we define .
Phase II.
The second phase of our algorithm proceeds in iterations, where we use the partitions to compute the solution in a bottom-up manner. Let denote the set of points . For each , our algorithm constructs as follows:
For each , let be the solution obtained by running on the subspace , restricting the output to be a subset of . Finally, we define .
Phase III.
Consider the set which contains points and let be the projection from to . Define a weight function on each by (i.e. is the total weight of all points in that are projected onto ). Let be the solution obtained by running the algorithm of Mettu-Plaxton [27] on the metric space .
Output.
The solution is the final output of our algorithm.
5.2 Analysis
We now analyze our algorithm by bounding its approximation ratio and running time.
Approximation Ratio
We begin by proving the following claim, which, for any set within a partition , allows us to express the cost of the solution w.r.t. the metric subspace in terms of the costs of the solutions .
Claim 18.
For any set , we have that
Proof.
Let . We obtain the solution by calling on the metric space while restricting the output to be a subset of . Thus, it follows from Theorem 13 that
By observing that , it follows that . Finally, the claim follows since
Using Claim 18, we now prove the following claim.
Claim 19.
For any , we have that
Proof.
We prove this claim by induction. Note that the base case where holds trivially. Now, suppose that the claim holds for some . Then we have that
The second line follows from Claim 18. In the fourth line, we are using the fact that, for an optimal solution of and any partition of , we have
where the first inequality follows from Corollary 4.
We get the following immediate corollary from Claim 19 by setting .
Corollary 20.
We have that
Using Corollary 20, we prove the following lemma.
Lemma 21.
We have that .
Proof.
By Lemma 21, we get that is a -bicriteria approximation of size . Using the extraction technique of [20] (see Lemma 5), which allows us to compute an exact solution to the -median problem from a bicriteria approximation while only incurring constant loss in the approximation ratio, it follows that the solution constructed in Phase III is a -approximation and has size at most .
Running Time
We begin by proving the following lemma, which summarizes the relevant properties of the partitions constructed in Phase I of the algorithm.
Lemma 22.
For each , the set is a partition of into many subsets of size at most .
Proof.
We define as , which is a trivial partition of . Now, suppose that this statement holds for the partition , where . The algorithm constructs by taking each and further partitioning into subsets and , such that the difference in the sizes of these subsets is at most . We can also observe that the number of subsets in the partition is .999Note that we do not necessarily guarantee that all of the sets in these partitions are non-empty. Since each subset has size at most , it follows that each subset in has size at most
Bounding the Running Time.
We now bound the running time of our algorithm. The running time of Phase I of our algorithm is , since it takes time to construct each partition given the partition . The running time of Phase III of our algorithm is , since constructing the mapping takes time and running the Mettu-Plaxton algorithm on an input of size takes time. Thus, we now focus on bounding the running time of Phase II.
We can first observe that the running time of the iteration in Phase II is dominated by the total time taken to handle the calls to the algorithm Res-Greedy. In the first iteration (when ), we make many calls to Res-Greedy, each one on a subspace of size at most (by Lemma 22). Thus, by Theorem 13, the time taken to handle these calls is at most
where the first inequality follows from the fact that , the third from Lemma 22, and the fourth since . It follows that the time taken to handle these calls to Res-Greedy is . For each subsequent iteration (when ), we make many calls to Res-Greedy, each one on a subspace of size at most (by Lemma 22), where the solution is restricted to the set , which has size at most where and and are computed in the previous iteration. It follows from Theorem 13 that the time taken to handle these calls is at most . It follows that the total time taken to handle these calls to Res-Greedy during the iteration of Phase II is . Hence, the total time spent handling calls to Res-Greedy is . The running time of our algorithm follows.
6 Our Results for Deterministic -Means
In this section, we briefly describe our results for the -means problem, where the clustering objective is instead of .
6.1 Our Deterministic Algorithm for -Means
We first note that our deterministic algorithm with near-optimal query complexity (see Section 3.4) extends seamlessly to the -means problem, giving us the following result.
Theorem 23.
There is a deterministic algorithm for -means that, given a metric space of size , computes a -approximation with queries (but exponential running time).
Proof.
Consider the normalized -means objective . Since the projection lemma (Lemma 3) holds for this objective, the algorithm and analysis extend immediately to the normalized -means problem. By observing that a solution is an -approximation to -means if and only if it is an -approximation to normalized -means, it follows that the algorithm produces a -approximation for the -means problem.
Unfortunately, it is not clear how to extend our efficient algorithm from Section 5 to -means. The barrier is that we don’t know how to design a good deterministic algorithm for the restricted -means (or normalized -means) problem. In particular, it is not known whether or not the reverse greedy algorithm can be used to produce good solutions for the -means problem. If we want to extend the analysis to the -means objective, the analysis fails since we need the projection lemma to hold. On the other hand, if we want to extend the analysis to the normalized -means objective (where the projection lemma does hold), the failure point of the analysis is Claim 15.
6.2 Our Lower Bound for Deterministic -Means
Our lower bound for deterministic -median (Theorem 12) extends immediately to deterministic -means. In particular, we get the following theorem.
Theorem 24.
For every , any deterministic algorithm for the -means problem that has a running time of on a metric space of size has an approximation ratio of
We defer the proof of this theorem to the full version of the paper [16].
References
- [1] Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. SIAM Journal on Computing, 49(4):FOCS17–97, 2019. doi:10.1137/18M1171321.
- [2] Nir Ailon, Ragesh Jaiswal, and Claire Monteleoni. Streaming k-means approximation. Advances in neural information processing systems, 22, 2009.
- [3] Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristics for k-median and facility location problems. SIAM J. Comput., 33(3):544–562, 2004. doi:10.1137/S0097539702416402.
- [4] Sepehr Assadi, Andrew Chen, and Glenn Sun. Deterministic graph coloring in the streaming model. In Stefano Leonardi and Anupam Gupta, editors, STOC ’22: 54th Annual ACM SIGACT Symposium on Theory of Computing, Rome, Italy, June 20 - 24, 2022, pages 261–274. ACM, 2022. doi:10.1145/3519935.3520016.
- [5] MohammadHossein Bateni, Hossein Esfandiari, Hendrik Fichtenberger, Monika Henzinger, Rajesh Jayaram, Vahab Mirrokni, and Andreas Wiese. Optimal fully dynamic k-center clustering for adaptive and oblivious adversaries. In Proceedings of the 2023 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2677–2727. SIAM, 2023. doi:10.1137/1.9781611977554.CH101.
- [6] Sayan Bhattacharya, Martín Costa, and Ermiya Farokhnejad. Fully dynamic -median with near-optimal update time and recourse. In 57th Annual ACM SIGACT Symposium on Theory of Computing (STOC), 2025. (To Appear).
- [7] Sayan Bhattacharya, Martín Costa, Naveen Garg, Silvio Lattanzi, and Nikos Parotsidis. Fully dynamic -clustering with fast update time and small recourse. In 65th IEEE Symposium on Foundations of Computer Science (FOCS), 2024.
- [8] Sayan Bhattacharya, Martín Costa, Silvio Lattanzi, and Nikos Parotsidis. Fully dynamic -clustering in update time. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
- [9] Jarosław Byrka, Thomas Pensyl, Bartosz Rybicki, Aravind Srinivasan, and Khoa Trinh. An improved approximation for k-median and positive correlation in budgeted optimization. ACM Transactions on Algorithms (TALG), 13(2):1–31, 2017. doi:10.1145/2981561.
- [10] Ching-Lueh Chang. Metric 1-median selection: Query complexity vs. approximation ratio. In Computing and Combinatorics - 22nd International Conference, COCOON 2016, Ho Chi Minh City, Vietnam, August 2-4, 2016, Proceedings, volume 9797 of Lecture Notes in Computer Science, pages 131–142. Springer, 2016. doi:10.1007/978-3-319-42634-1_11.
- [11] Moses Charikar, Sudipto Guha, Éva Tardos, and David B Shmoys. A constant-factor approximation algorithm for the k-median problem. In Proceedings of the thirty-first annual ACM symposium on Theory of computing, pages 1–10, 1999.
- [12] Moses Charikar, Liadan O’Callaghan, and Rina Panigrahy. Better streaming algorithms for clustering problems. In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages 30–39, 2003. doi:10.1145/780542.780548.
- [13] Marek Chrobak, Claire Kenyon, and Neal E. Young. The reverse greedy algorithm for the metric k-median problem. Inf. Process. Lett., 97(2):68–72, 2006. doi:10.1016/J.IPL.2005.09.009.
- [14] Vincent Cohen-Addad, Niklas Hjuler, Nikos Parotsidis, David Saulpic, and Chris Schwiegelshohn. Fully dynamic consistent facility location. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 3250–3260, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/fface8385abbf94b4593a0ed53a0c70f-Abstract.html.
- [15] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. Deterministic clustering in high dimensional spaces: Sketches and approximation. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 1105–1130. IEEE, 2023. doi:10.1109/FOCS57990.2023.00066.
- [16] Martín Costa and Ermiya Farokhnejad. Deterministic -median clustering in near-optimal time, 2025. arXiv:2504.15115.
- [17] Max Dupré la Tour, Monika Henzinger, and David Saulpic. Fully dynamic k-means coreset in near-optimal update time. In 32nd Annual European Symposium on Algorithms, ESA 2024, volume 308 of LIPIcs, pages 100:1–100:16. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. doi:10.4230/LIPICS.ESA.2024.100.
- [18] Max Dupré la Tour and David Saulpic. Almost-linear time approximation algorithm to euclidean k-median and k-means. CoRR, abs/2407.11217, 2024. doi:10.48550/arXiv.2407.11217.
- [19] Teofilo F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci., 38:293–306, 1985. doi:10.1016/0304-3975(85)90224-5.
- [20] Sudipto Guha, Nina Mishra, Rajeev Motwani, and Liadan O’Callaghan. Clustering data streams. In 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, 12-14 November 2000, Redondo Beach, California, USA, pages 359–366. IEEE Computer Society, 2000. doi:10.1109/SFCS.2000.892124.
- [21] Anupam Gupta and Kanat Tangwongsan. Simpler analyses of local search algorithms for facility location. CoRR, abs/0809.2554, 2008. arXiv:0809.2554.
- [22] Bernhard Haeupler, Yaowei Long, and Thatchaphol Saranurak. Dynamic deterministic constant-approximate distance oracles with n worst-case update time. In 65th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2024, Chicago, IL, USA, October 27-30, 2024, pages 2033–2044. IEEE, 2024. doi:10.1109/FOCS61266.2024.00121.
- [23] M. Henzinger and S. Kale. Fully-dynamic coresets. In ESA, 2020.
- [24] Monika Henzinger, Jason Li, Satish Rao, and Di Wang. Deterministic near-linear time minimum cut in weighted graphs. In Proceedings of the 2024 ACM-SIAM Symposium on Discrete Algorithms, SODA 2024, Alexandria, VA, USA, January 7-10, 2024, pages 3089–3139. SIAM, 2024. doi:10.1137/1.9781611977912.111.
- [25] Wen-Lian Hsu and George L. Nemhauser. Easy and hard bottleneck location problems. Discret. Appl. Math., 1(3):209–215, 1979. doi:10.1016/0166-218X(79)90044-1.
- [26] Kamal Jain and Vijay V Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and lagrangian relaxation. Journal of the ACM (JACM), 48(2):274–296, 2001. doi:10.1145/375827.375845.
- [27] Ramgopal R. Mettu and C. Greg Plaxton. The online median problem. In 41st Annual Symposium on Foundations of Computer Science (FOCS), pages 339–348, 2000. doi:10.1109/SFCS.2000.892122.
- [28] Ramgopal R. Mettu and C. Greg Plaxton. Optimal time bounds for approximate clustering. In Proceedings of the 18th Conference in Uncertainty in Artificial Intelligence (UAI), pages 344–351, 2002. URL: https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=878&proceeding_id=18.
- [29] Sagnik Mukhopadhyay and Danupon Nanongkai. Weighted min-cut: sequential, cut-query, and streaming algorithms. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA, June 22-26, 2020, pages 496–509. ACM, 2020. doi:10.1145/3357713.3384334.
- [30] Ofer Neiman and Shay Solomon. Simple deterministic algorithms for fully dynamic maximal matching. ACM Trans. Algorithms, 12(1):7:1–7:15, 2016. doi:10.1145/2700206.