Dimension-Free Parameterized Approximation Schemes for Hybrid Clustering
Abstract
Hybrid -Clustering is a model of clustering that generalizes two of the most widely studied clustering objectives: -Center and -Median. In this model, given a set of points , the goal is to find centers such that the sum of the -distances of each point to its nearest center is minimized. The -distance between two points and is defined as – this represents the distance of to the boundary of the -radius ball around if is outside the ball, and otherwise. This problem was recently introduced by Fomin et al. [APPROX 2024], who designed a -bicrtieria approximation that runs in time for inputs in ; such a bicriteria solution uses balls of radius instead of , and has a cost at most times the cost of an optimal solution using balls of radius .
In this paper we significantly improve upon this result by designing an approximation algorithm with the same bicriteria guarantee, but with running time that is FPT only in and – crucially, removing the exponential dependence on the dimension . This resolves an open question posed in their paper. Our results extend further in several directions. First, our approximation scheme works in a broader class of metric spaces, including doubling spaces, minor-free, and bounded treewidth metrics. Secondly, our techniques yield a similar bicriteria FPT-approximation schemes for other variants of Hybrid -Clustering, e.g., when the objective features the sum of -th power of the -distances. Finally, we also design a coreset for Hybrid -Clustering in doubling spaces, answering another open question from the work of Fomin et al.
Keywords and phrases:
Clustering, Parameterized algorithms, FPT approximation, -Median, -CenterFunding:
Ameet Gadekar: Partially supported by the Israel Science Foundation (grant No. 1042/22).Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Facility location and clustering ; Theory of computation Fixed parameter tractabilityAcknowledgements:
Tanmay would like to thank his co-authors from [23] – specifically Fedor V. Fomin – for formulating Hybrid -Clustering problem and introducing him to it. This work was partially carried out while Ameet was at Bar-Ilan University, Israel.Editors:
Olaf Beyersdorff, Michał Pilipczuk, Elaine Pimentel, and Nguyễn Kim ThắngSeries and Publisher:

1 Introduction
-Center, -Median, and -Means are among the most popular clustering problems both in theory and practice, with numerous applications in areas such as machine learning [36, 28, 6, 11, 4, 9, 38, 25], facility location problems [32, 41, 30, 29, 35, 16], and computational geometry [7, 34], among others.111These citations are supposed not to be comprehensive, but representative–an interested reader may use them as a starting point for the relevant literature. All of these problems have long been known to be -hard, even in the plane [22, 40, 37]. To cope with these intractability results, there has been several decades of research on designing approximation algorithms for these problems, that have lead to polynomial-time constant-factor approximation algorithms for all of them in general metric spaces [27, 20]. Moreover, for more “structured” inputs – such as when the points belong to Euclidean spaces, or planar graph metrics – one can find near-optimal solutions using approximation schemes [24, 17, 19, 5, 34, 7, 2].222This includes polynomial-time as well as FPT approximation schemes. Furthermore, given the simplicity and ubiquity of these vanilla clustering objectives, the techniques used for obtaining these results are subsequently used to find clustering with additional constraints, such as capacities [18], fairness [8], and outliers [31, 3]. One recurring theme in the literature has been to define a unified clustering objective that also captures classical clustering problems like -Median, -Center as special cases [42, 12, 15].
Along this line of research, Fomin et al. [23] recently introduced Hybrid -Clustering. In this problem, we are given an instance , where is a metric space333Fomin et al. [23] only considered the special case of Euclidean inputs, i.e., when ; however, the underlying problem can be defined for arbitrary metric spaces. on the set of clients and a set of facilities with distance function . Here, is a positive integer denoting the number of clusters, and is a non-negative real denoting the radius. The goal is to find a set of centers, such that, is minimized – here, for any point , any set , and a real , is defined as , and . Fomin et al. [23] proposed several motivations for studying this cost function. Indeed, Hybrid -Clustering can be seen as a shape fitting problem, where one wants to find the “best” set of balls of radius that fit the given set of points, where the quality of the solution is determined by the sum of distances of each point to the nearest point on the boundary of the ball – this is analogous to the classical regression, where one wants to find the “best” linear function fitting the given set of points. Projective Clustering is a well-studied generalization of linear regression, where the aim is to find affine spaces that minimizes the sum of distances from the points to these spaces (see e.g., [43]), which is also closely related to the model considered in Hybrid -Clustering. Furthermore, [23] gave another motivation for studying Hybrid -Clustering– namely, placing WiFi routers with identical circular coverage, where the clients that lie outside the coverage area need to travel to the boundary of the nearest ball in order to receive coverage. Finally, the name of the problem is motivated from the fact that the objective is a kind of “hybrid” between the -Center and -Median costs, and generalizes both of them. Indeed, as observed in [23], Hybrid -Clustering with is equivalent to -Median, while, when is set to be the optimal radius for -Center, Hybrid -Clustering reduces to -Center. These observations immediately rule out uni-criteria polynomial-time approximation schemes that violate only the cost, or only the radius by any arbitrary factor (as formalized in Proposition 1 of [23]). Indeed, such approximation algorithms – even with running times FPT in – would imply exact FPT algorithms for -Center and -Median in low-dimensional continuous Euclidean spaces. To our knowledge, such an algorithm for -Median is not known (nor is admittedly a lower bound), whereas [39] shows W[1]-hardness of -Center even in . Therefore, given the current state of the art, a bicriteria approximation scheme for Hybrid -Clustering is essentially the best outcome, even if one allows a running time that is FPT in .
For , an -bicrteria approximation for Hybrid -Clustering is an algorithm that returns a solution of size at most satisfying . Note that the bicriteria solution is allowed to consider instead of and is also allowed to find a solution of cost , where is the cost of an optimal solution w.r.t. radius , i.e., without any violation of the radius. The main result of Fomin et al. [23] was an -bicriteria approximation for inputs in d in time , where . An exponential dependence on the dimension appears inevitable using their approach, although it was not clear whether such a dependence is required. Indeed, for Euclidean -Median and -Center, one can obtain dimension-free approximation schemes with FPT dependence only on and [7, 33, 1]. This naturally motivates the following question, which was explicitly asked as an open question in [23].
Question 1. “An immediate question is whether improving or removing the FPT dependence on the dimension is possible[…]”
In this work, we answer this question in the affirmative by designing a (randomized) bicriteria FPT Approximation Scheme (FPT-AS) parameterized by and 444Such algorithms run in time and output a -bicriteria solution. Another term for FPT-AS is EPAS, which stands for Efficient Parameterized Approximation Schemes. for the (continuous) Euclidean instances of Hybrid -Clustering, stated formally in the following theorem.
Theorem 1 (Bicriteria FPT-AS for Euclidean Spaces).
There exists a randomized algorithm, that, given an instance of Hybrid -Clustering in d for any dimension , runs in time , and returns a -bicrtieria approximation with high probability.
The algorithm of [23] involves a pre-processing step that separately handles “-Median-like”, “-Center-like” instances, using the techniques specific to these respective problems. Then, the main algorithm handles the remaining instances that cannot be directly reduced to the respective problems. This approach somewhat undermines the goal of defining a unified problem that captures both of these problems as special cases. In contrast, our algorithm adopts a uniform approach by exploiting the intrinsic structure of the hybrid problem, without the need to separately handle some instances. In fact, our algorithm works for a broader class of metric spaces, called metric spaces of bounded algorithmic scatter dimension – a notion recently introduced by Abbasi et al. [1], and further studied by Bourneuf and Pilipczuk [10]. These metric spaces capture several interesting and well-studied metric spaces, see below. We give a formal definition of the notion of algorithmic scatter dimension in Section 2; however, a good mental picture to keep is to think of them as essentially doubling spaces, i.e., metric spaces with good “packing-covering” properties.555Although this is a good mental picture, it is ultimately inaccurate since the class of metric spaces of bounded algorithmic scatter dimension is strictly larger than that of doubling spaces. Indeed, continuous Euclidean space of high () dimension does not have bounded doubling dimension, yet it does have bounded algorithmic scatter dimension. Our general result is stated below.
Theorem 2 (Informal version of Theorem 9).
Hybrid -Clustering admits a randomized bicriteria FPT-AS in metrics of bounded algorithmic -scatter dimension, parameterized by and . In particular, Hybrid -Clustering admits randomized bicriteria FPT-AS parameterized by and , in continuous Euclidean spaces of any dimension, metrics of bounded doubling dimension, bounded treewidth metrics, and by graphs from any fixed proper minor-closed graph class.
We give a technical overview of this result in Section 1.1, and describe the approximation algorithm and its analysis in Section 3.
In their work, [23] also defined Hybrid -Clustering problem, which features the -th power of -distances, i.e., the objective is to minimize , where . They designed a bicriteria FPT-AS with similar running time for Hybrid -Clustering, i.e., a hybrid of -Means and -Center; but left the possibility of obtaining a similar result for the general case of conditional on the existence of a certain sampling-based approximation algorithm for the -clustering problem for Euclidean inputs. Using our approach, we can obtain a bicriteria FPT-AS for any fixed in a unified way, whose running time is independent of the dimension . In fact, our approach works for a much more general problem of Hybrid Norm -Clustering. We discuss these extensions in Section 4.
Next, we turn to another open direction mentioned in the work of Fomin et al. [23].
Question 2. “Another intriguing question is the design of coresets for Hybrid -Clustering, which could also have some implications for [Question 1].”
In this paper, we also answer this question affirmatively by designing coresets for Hybrid -Clustering in metric spaces of bounded doubling dimension. Specifically, we prove the following result.
Theorem 3 (Coreset for Hybrid -Clustering).
There exists an algorithm that takes as input an instance of Hybrid -Clustering in doubling metric of dimension and a parameter , and in time returns a pair , where has size , and , such that the following property is satisfied for any of size at most :
Here, .
1.1 Technical Overview
In this section, we highlight our technical and conceptual contributions of our main result (Theorem 2).
A natural direction to design dimension-free parameterized approximation algorithm for Hybrid -Clustering is to leverage insights from the framework of Abbasi et al. [1], who designed dimension-free parameterized approximations for wide range of clustering problems. However, in Hybrid -Clustering, the objective is a function of -distances, instead of true distances, as in [1]. This introduces several technical challenges. For instance, a most obvious roadblock is that the -distances do not satisfy triangle inequality. Moreover, many of the “nice” properties that are enjoyed by the true distances in “structured” metric spaces (e.g., covering-packing property in doubling spaces, or properties of shortest paths in sparse graphs) do not extend readily to the -distances. Therefore, we have to overcome several technical and conceptual challenges in order to leverage the ideas presented in the framework of [1]. Let us first recall the key notion from this paper that is crucially used in this framework.666In fact, the algorithm of our paper and [1] works for a weaker notion called algorithmic scatter dimension. But, for ease of exposition, we work with scatter dimension in this section.
Scatter Dimension.
Informally, an -scattering sequence in a metric space is a sequence of center-point pairs , where is some positive integer and for , such that , for all and , for all . The -scatter dimension of is the maximum length of -scattering sequence contained in .
Now, consider a natural extension of the framework [1] for Hybrid -Clustering in a metric space as follows. Let be the optimal cost of the Hybrid -Clustering instance corresponding to an optimal solution . The algorithm maintains cluster constraint for each cluster . Each consists of a collection of requests of the form , where is a point and is a distance, representing the demand that requires a center within distance . The algorithm always maintains a solution such that satisfy cluster constraint, for all . Now consider the sequence of triples corresponding to requests in , where is the center maintained by the algorithm just before adding request to . If is a near-optimal solution, then the algorithm terminates successfully and returns . Otherwise, it identifies a point whose -distance to is much larger than its -distance to . Such a point is called a witness to . A simple averaging argument shows that such a witness can be sampled with high probability. The algorithm then guesses the optimal cluster of and add a new request to , for some suitable but fixed depending on . The center is recomputed to satisfy the updated cluster constraint , if possible. Otherwise the algorithm reports failure.777E.g., if the algorithm failed to sample a witness. The key observation is that the sequence of pairs for a fixed radius forms an -scattering sequence in . Thus, if the -scatter dimension of is bounded then the length of these sequences are also bounded. Furthermore, it can be shown that if the aspect ratio of the radii of requests in for all is bounded, then the number of iterations of the algorithm can be bounded, yielding an FPT-AS.
Working with the inflated radius.
A major challenge that arises when the witness is defined in terms of -distances, but the requests added to the cluster constraints are based on the true distances. Specifically, we need to ensure that a witness whose -distance to is significantly larger than its -distance to also has a larger true distance to than to . This ensures that the request is feasible and the algorithm does not fail. However, this condition may not hold, especially when the true distances are close to . In fact, the issue is related to a fundamental barrier identified by [23], assuming standard conjectures.888Such a framework could potentially yield a near-optimal solution that does not violate the radius, contradicting Proposition 1 in [23]. To overcome this fundamental barrier and maintain a sufficient gap between the true distances, we consider the cost of with respect to radius. In other words, we look for a solution whose -cost is close to . In this case, we redefine a witness as a point in whose -distance to is much larger than its -distance to . These insights allow us to establish a gap between the true distance of a witness to and its true distance to , thereby justifying the requests added by the algorithm.
Bounding the aspect ratio of the radii.
Abbasi et al. [1] bound the aspect ratio of the radii by initializing the solution using “good” upper bounds, and sampling witness from the points that have comparable distance to with respect to the corresponding upper bounds. The first step guarantees that the solution maintained by the algorithm always satisfies the upper bounds within constant factor. This together with the second step allows one to bound the aspect ratio of radii in each . They show that such witnesses have a good probability mass if the points are sampled proportional to their true distances, and hence the algorithm successfully finds a witness for with probability that is a function of and . Although, devising feasible and “good” upper bounds for Hybrid -Clustering can be done with some efforts, the main challenge arises in the second step, which only guarantees to sample a point whose -distance to is comparable with its upper bound. As mentioned before, these -distances can be much smaller than the corresponding true distances, and hence there is no guarantee that true distance of a witness to current solution is comparable to its upper bound. Towards this, we introduce a novel idea of dividing witnesses into two sets – “nearby witness” set, i.e., the set of points within distance from , and “faraway witness” set, which are beyond this distance from . The observation is that, since the total probability mass on the witness set, now defined using -distances, is still high, it must be that either the nearby or the faraway witnesses have sufficient probability mass. However, since we do not know which of the two sets has enough mass, we perform a randomized branching (i.e., make a guess) – which will be “correct” with probability . Then, when a point is sampled proportional to its -distance from either the “nearby” or “faraway” set, it will be a witness with good probability. Now consider each of two cases separately. Since for a nearby witness , its true distance to is at least and at most , the aspect ratio of the radii of requests corresponding to nearby witness set is bounded by . On the other hand, for requests corresponding to faraway witness set, we show that their radii lie in bounded interval, using ideas similar to [1]. Note that, these two arguments imply that the radii of the requests lie in two (possibly disjoint) intervals that themselves are bounded. However, it is not clear if the length of these requests is bounded, unlike [1], where the radii belonged to a single interval of bounded aspect ratio. Nevertheless, we observe that, using the techniques of [1], the length of requests can be bounded, even when the radii lie in constantly many (here, two) intervals, each with a bounded aspect ratio.
2 Background on Algorithmic Scatter Dimension
In this paper, we consider metric (clustering) space , where is a finite set of points, is a (possible infinite) set of potential cluster centers, and is a metric on . A class of metric spaces is a (possibly infinite) set of metric spaces.
In this paper, we work with a notion that is weaker (and hence more general) than -scatter dimension that was defined in the overview, called algorithmic -scatter dimension, which we explain next. To this end, we first need the following search problem.
Definition 4 (Ball Intersection Problem).
Let be a class of metric spaces with possibly infinite set of centers. Given , a finite set of distance constraints, and an error parameter , the Ball Intersection problem asks to find a center that satisfy all distance constraints within multiplicative error, i.e., , for every , if such a center exists, and report failure otherwise.
We say admits a Ball Intersection algorithm if it correctly solves the ball intersection problem for every metric space in and runs in polynomial time in the size of and .
Now, we are ready to define algorithmic scatter dimension.
Definition 5 (Algorithmic -Scatter Dimension).
Given a class of metric spaces with Ball Intersection algorithm , a space , and , a -scattering sequence is a sequence , where is some positive integer, and for , and such that
-
(Covering by )
-
(-refutation)
The algorithmic -scatter dimension of is if any -scattering sequence contains at most many triples per radius value. The algorithmic -scatter dimension of is the minimum algorithmic -scatter dimension over any Ball Intersection algorithm for .
Although, algorithmic scatter dimension restricts the number of triples in the sequence with same radius value, we can use the proof technique from [1] to prove the following stronger guarantee. Results marked with can be found in the full version of the paper.
Lemma 6 ().
Let be a class of metric spaces of algorithmic -scatter dimension . Then there exists a Ball Intersection algorithm with the following property. Given , a constant , and for , any -scattering contains many triples whose radii lie in the interval .
3 Bicriteria FPT Approximation Scheme
3.1 Algorithm
Our bicrteria FPT-AS for Hybrid -Clustering is formally stated in Algorithm 1. As an input, we are given an instance of Hybrid -Clustering, an accuracy parameter , access to an algorithm for the so-called “Ball Intersection” problem (discussed later), and a guess for the optimal cost . By a standard exponential search, we will assume that . At a high level, this algorithm can be divided into two steps: initialization phase and the iterative cost-improvement phase. The initialization phase spans from line 1 to 7.
At a high level, the goal of this phase to compute for each point , an upper bound , such that must have an optimal center within distance . Once we find such upper bounds, we use a subset of them to initialize for each , a set of requests , where each request demands that the th center in the solution must be within distance at most from in every subsequent solution found by the algorithm.
Now, the algorithm moves to the iterative cost-improvement phase in lines 8 to 20, consisting of a while loop that runs as long as the current solution has not become the bicriteria approximation that we are looking for. Thus, in each iteration of the while loop, we are given a solution that satisfies all the requests , and yet satisfies . Then, our algorithm makes a random choice whether there is enough cost-contribution of nearby witnesses, or of faraway witnesses – here a witness is a point whose distance to is sufficiently larger than that in the (unknown) optimal solution . In each of the cases, we sample points from carefully defined sets (cf. in line 11 and in 14), proportional to their contribution to the cost to the respective sets (in the analysis, we will argue that with good probability, we will in fact sample witness points). Having sampled such a point , we guess the index of the optimal cluster of in line 17. Finally, assuming is indeed a witness, we add a request to the th request set , and in line 19 we recompute using . This algorithm either returns a center that satisfies all the requests (with an error of up to a factor of ), or correctly outputs that there is no point in satisfying all the requests. Note that in discrete metric spaces, can be simulated by simply scanning each and checking whether it satisfies all the requests in . On the other hand, for continuous Euclidean spaces, such an algorithm was designed in [1]. If the algorithm does not fail at this step, then we update our set and continue to the next iteration. In the rest of the section, we will prove that the algorithm returns a -bicriteria approximation with good probability.
3.2 Analysis
Throughout the analysis, we assume that we are given a guess , such that . We divide the analysis in two parts. In the first part, we bound the running time of the algorithm using the following lemma. The proof of this lemma is present in Section 3.2.1.
Lemma 7.
Algorithm 1 terminates in iterations – with or without failure.
In the second part, we show the following lemma, which says that the probability that the algorithm terminates without failure is high. The proof the lemma is present in Section 3.2.2.
Lemma 8.
With probability at least , Algorithm 1 terminates without failure, i.e., returns a solution satisfying .
Using these two lemmas and repeating the algorithm times, we have our main result.
Theorem 9 (Main Theorem).
Let be a class of metric spaces closed under scaling distances by a positive constant. There is a randomized algorithm that takes as input an instance of Hybrid -Clustering such that and , and outputs a -bicriteria solution for when, for all , the algorithmic -scatter dimension of is bounded by , for some function . The running time of the algorithm is .
3.2.1 Bounding runtime using Algorithmic Scatter Dimension
First, we show some properties of the initial upper bounds (line 4), that we need later in the proof of Lemma 7.
Proof.
Suppose . Letting , we have that . Since, , we have for . Using , this means for . Therefore,
contradicting the the cost of .
Bounding aspect ratio of radii
Towards proving Lemma 7, we bound the aspect ratio of the radii in the requests.
Lemma 12.
Consider a request set , for . Let , be the center maintained by the algorithm just before adding the request to . Further, let be the center corresponding to cluster . Then, the sequence is an algorithmic -scattering. Furthermore, the radii of requests in lie in the interval , where is the smallest radii in that is larger than .
Proof.
The proof of the first part is similar to [1].
First note that . Finally, since is computed by on and error parameter , we have that , for . Hence, is -algorithmic scattering.
Now we bound that the radii of sequence . Towards this, we partition into and , based on a partitioning of , as follows. We say a request belongs to if was sampled when the If condition was satisfied (in Line 12), otherwise it belongs to , which corresponds to the Else condition (in Line 15). Correspondingly, if and if . We bound the aspect ratio of each part, and , separately. For , we have and hence, the radii of triples in lie in interval Now consider . Recall that is the solution maintained by the algorithm when is added to . Then, note that (see Line 14), and hence, we have . The following claim, whose proof is similar to [1], shows that the radii of requests in are also bounded, finishing the proof of the lemma.
Proof.
Suppose, for the sake of contradiction, the algorithm does not fail and finds a center such that and . Thus, .
Therefore, we have
using Observation 11. Let be the center maintained by the algorithm when the request is added to . Then, we have
(1) |
due to Line 18. Similarly, let be the center maintained by the algorithm when request is added to . Then, since , we have that , using . Therefore, using ,
This means , contradicting (1). Now we are ready to finish the proof of Lemma 7.
Proof of Lemma 7.
3.2.2 Bounding success probability
The goal of this subsection is to prove Lemma 8, i.e., give a lower bound on the success probability of the algorithm. To this end, we first introduce the following notion.
Definition 14 (Consistency).
Consider a fixed hypothetical optimal solution . We say that the current state of execution (specified by ) of Algorithm 1 is consistent with if for any request , we have .
Note that, Lemma 11 implies that the initial set of requests (line 4) are feasible. Since the balls are disjoint, we can relabel the optimum centers so that . Thus, Lemma 11 implies that the initial state of the algorithm as computed in lines 1 to 6 is consistent with a fixed optimal solution which is fixed henceforth. To prove Lemma 8, we will inductively argue that the state of the algorithm remains consistent with – note that the base case is already shown. Most of this subsection is devoted to show the inductive step, i.e., to show that, given a consistent state at the start of an iteration, there is a good probability that the state of the algorithm at the end of the iteration is also consistent (cf. Lemma 22).
To this end, we introduce the following definitions w.r.t. the current solution .
Definition 15 (Contribution and Witness).
For any set , we use to denote the contribution of the set to the cost of the current solution.
We say that a point is an -witness (or simply, a witness) w.r.t. the solution if it satisfies .
The following claim can be proved by a simple averaging argument.
Claim 16 ().
.
Next, we introduce several different classifications of witnesses.
Definition 17 (Different subsets of witnesses).
For each , let denote the set of witnesses for which is a nearest center in (breaking ties arbitrarily). Then, , and . Further, let , and . We will refer to a witness in as a nearby witness and a witness in as a faraway witness.
Now, we consider two different cases regarding the behavior of the algorithm.
Case 1: .
In this case, when we sample a point from proportional to their values, the probability of sampling a nearby witness is at . This will correspond to the “good event”. We prove this formally in the following lemma.
Lemma 18 (Nearby witness lemma).
Suppose the current solution at the start of an iteration satisfies . Further, suppose . Then, with probability at least , the point , the index , and value defined in the iteration satisfy the following properties.
-
1.
is the closest center to , i.e., ,
-
2.
, i.e., (i) , and (ii) .
-
3.
.
Proof.
First, in line 10 with probability , we correctly move to the “nearby witness” (if) case of line 10. We condition on this event. Next, note that , and , where the first inequality is due to the case assumption. Therefore, in line 12, the point sampled from , will belong to with probability at least . Let denote the index such that is the closest center in the optimal solution , and we correctly guess the index in line 17. Note that the probability of the algorithm making the “correct” random choices is at least .
We condition on the said events. Since is a witness, we know that . We first observe that must be positive, which implies that . Now, we consider two cases based on the value of .
If , then , in which case,
(2) |
Otherwise, , in which case . Then, consider
(3) |
Thus, regardless of the value of , we have established the third item, hence completing the proof of the lemma.
Case 2: .
In this case, most of contribution of witnesses is coming from “faraway witnesses”. In this case, the “correct choice” corresponds to the case when we sample points from the set as defined in line 14. In this case, we will show that with good probability, the sampled point is a “faraway witness”. Specifically, we show the following lemma.
Lemma 19 (Faraway witness lemma).
Suppose the current solution at the start of an iteration satisfies . Further, suppose . Then, with probability at least , the point , the index , and value defined in the iteration satisfy the following properties.
-
1.
is the closest center to , i.e., ,
-
2.
, i.e., (i) , and (ii) ,
-
3.
, and
-
4.
.
Proof.
First, in line 10 with probability , we correctly move to the “faraway witness” (else) case of line 13. We condition on this event. Now, by combining Algorithm 16 and the case assumption we obtain,
(4) |
Next, let . It follows that,
(5) |
Fix a , and let us index the points in in the non-decreasing order of their distances (and equivalently, ). First, we state the following simple consequences of various definitions for future reference.
Observation 20 ().
For each , the following bounds hold.
-
1.
,
-
2.
-
3.
, and
-
4.
.
Let us return to the points in . Let denote the minimum index such that, the contribution of the set , is at least . Note that by a simple argument, this implies that the contribution of the set is also at least (see, e.g., Lemma 7.20 in [21]). Hence, we have the following lower bounds on the contribution of both the sets, by recalling that .
(6) |
Next, we first prove the following technical claim.
Claim 21.
Let . For all , it holds that . Therefore, .
Proof.
For this proof, we use the following shorthand: , and . Now, fix an arbitrary point . For any , , which implies that
(7) |
Here, we use the property 4 from Observation 20 in the third inequality in the above. On the other hand, note that,
(8) |
Then, if we set , from (7), we obtain that . Now, combining this with (8), we obtain that,
(9) |
Hence, we have that . Using from Observation 20 and the fact , we have . Therefore, , as desired.
Recall from (5) that , therefore, when we sample a point from , the probability that the sampled point belongs to is at least . Now, we condition on this event. Let be the nearest center to , and in line 17, the probability that we sample the correct index is . Thus, the total probability of the algorithm making the “correct choices” in this case is at least . We condition on these choices. Note that, item 1 is thus satisfied due to the correct sampling of , item 2 is satisfied due , and item 3 is satisfied since by construction. Thus, we are only left with showing item 4, to which end, we consider different cases for the distance between and its closest optimal center, .
If , then since , it easily follows that,
(10) |
Otherwise, if , then similar item 4 of Observation 20, we can show that,
(11) |
However, since is an -witness, it follows that, . Therefore, we obtain that,
(12) |
Thus, in each of the sub-cases, in (10) and (12), we have shown that , thus completing the proof of the lemma.
Lemma 18 and Lemma 19 can be combined in a straightforward way to obtain the following lemma completing the inductive step.
Lemma 22.
Consider an iteration of the algorithm such that the state of execution at the start of the iteration is consistent with a fixed optimal solution , and further . Then, with probability at least , the state of the algorithm at the end of the iteration is also consistent with .
Proof.
Consider the scenario described in the premise of the lemma. Then, the solution at the start of the iteration either satisfies that (nearby witness case), or (faraway witness case). Then, Lemmas 18 and 19; along with the correctness of , together imply that, conditioned on the respective case assumption, the probability that the state of the algorithm is consistent with is at least . 999Note that this probability also accounts for the result of the coin toss in line 10. Armed with Lemma 22, it is easy to conclude the proof of Lemma 8.
Proof of Lemma 8.
As argued initially, the state of the algorithm at line 6 is consistent with the optimal solution . Then, Lemma 22 implies that, for any , if the algorithm runs for iterations, then with probability at least , the state of the algorithm remains consistent. Further, note that as long as the state of the algorithm remains consistent with , then the algorithm cannot fail. Finally, Lemma 7 implies that the algorithm terminates within iterations. Therefore, the probability that the algorithm terminates without failure – i.e., by successfully breaking the while loop by finding a solution satisfying , is at least .
4 Extensions of the Bicriteria FPT-AS
Fomin et al. [23] defined a generalization of Hybrid -Clustering, which they call Hybrid -clustering, wherein the objective is , for fixed , which simultaneously generalizes -Clustering and -Center. They generalized their algorithm for , i.e., Hybrid -Means, but left the possibility of extending to an arbitrary value of conditional on the existence of a certain kind of sampling algorithm. In this paper, we consider a much more general Hybrid Norm -Clustering problem that captures all -Clustering problems, and much more. Here, the objective is to find a set that minimizes , where is the vector of -distances to the solution , and is a monotone norm101010 is a norm if it satisfies following properties for any and any , (i) iff , (ii) , (iii) ; furthermore, is monotone if whenever , where denotes point-wise inequality. – we refer the reader to [1] for a detailed background on norms and related notions.
Algorithm 1 and its analysis can be extended to the general norm objective following the approach of [1]. Here, we only sketch the modifications required to the algorithm and its proof, since this is not the main focus of this work.
The initial upper bounds can be found in a similar manner. The while loop runs as long as the solution in hand satisfies that , where denotes the vector of distances , and . If this is the case, then the first step is to compute111111Actually, it is sufficient to compute an approximate subgradient, as done in [1]. a subgradient of at , which is, loosely speaking, a weight function , such that it suffices to focus the attention in the current iteration on that satisfies . Our algorithm proceeds in a similar fashion, except that whenever points are sampled from the set (in line 12) or set (in line 15), the probability of sampling a point is proportional to . The rest of the algorithm remains unchanged.
There are a few minor modifications required in the analysis – mainly in Section 3.2.2. First, while the definition of an -witness (or simply a witness) remains unchanged, we redefine the contribution of a subset to be . The nearby and faraway witness cases are then considered exactly as in Section 3.2.2. The proof of the analogue of Lemma 18 goes through without any modifications; whereas we need to be slightly more careful in the proof of the analogue of Lemma 19. Here, in inequalities (8 and 9) we need to take the weighted distances into account and relate to , which requires a slight worsening of constants. With these minor changes in place, we can conclude that the state of the algorithm after iterations is consistent with a fixed optimal solution with probability at least . The analysis for bounding the number of iterations remains unchanged, and hence we obtain an FPT-AS with a similar running time, modulo the constants hidden in the big-Oh notation. We omit the details.
5 Coreset
In this section, we design a coreset for Hybrid -Clustering in doubling metric of bounded dimension. More specifically, we prove the following (restated for convenience). See 3
Proof.
A formal description of our algorithm can be found in the full version of the paper. Here, we provide an overview of the overview, which is based on grid construction approach of [26, 2]. Let be an approximate solution that satisfies the following properties: (i) , and (ii) 121212Note the set is a bicriteria solution, but in a different sense, since it approximates the size and the cost while using -distances.. For each , consider the balls , for . Note that, since , we have that lies in some ball , for and .
The idea is to decompose each into smaller balls each of radius , and associate each point to a unique smallest ball containing , breaking ties arbitrary. We say a small ball is non-empty if there is a point in associated with . Next, for each non-empty ball , pick an arbitrary point associated with and add to with weight equal to the total number of points associated with ; we call such a point as the representative of points associated with .
To bound the size of , note that, in doubling metric of dimension , a unit ball can be covered by balls, each of radius . Hence, we have .
Recall that, for , we define the weighted cost of with respect to as . Now we show that the cost of with respect to approximately preserves the cost of with respect to . Consider a point and let be its representative in . Now, the contribution of towards is . On the other hand, its contribution towards is . Hence,
Now if , then . Hence,
Therefore, . Hence, we have
Now, suppose , then . Let be such that . Hence, we have that . Therefore, using , we have
This implies,. Thus, we have
Finally, we have , as desired.
To finish the proof, we invoke the coreset construction algorithm using the approximate set obtained from the following lemma. At a high level, to obtain this lemma, we start from an -bicriteria approximation for Hybrid -Clustering from [23, 13], and use the fact that, a ball of radius can be decomposed into balls of radius , converting the guarantee from to .
Lemma 23 (Approximate solution for coreset construction).
There is a polynomial-time algorithm that, given an instance of Hybrid -Clustering in doubling metric of dimension , computes such that , where is the optimal cost for .
Proof.
Let be -bicriteria solution for obtained from [14], as mentioned in [23].That is . Consider for , and decompose into smaller balls, each of radius – note that this follows from the fact that the metric space has doubling dimension . For each smaller ball such that , add an arbitrary to . Finally, add to . Hence, .
It is easy to see that, for every , there is some , such that . Now, consider some . Note that , which implies that . Therefore, . Therefore, it follows that for all , . Which implies that,
This concludes the proof of the lemma.
6 Conclusion
In this paper, we revisit Hybrid -Clustering, which was introduced and studied recently by Fomin et al. [23]. We resolve two open questions explicitly asked in their work, namely, for continuous Euclidean instances from d, (1) we obtain an FPT-AS for Hybrid -Clustering that does not have an FPT dependence on the dimension (and in fact, the dependence on is instead as in [23]), and (2) we design coresets for the same. Indeed, our technique also generalizes to the -clustering variant of the problem, which was implicitly considered in [23], but was not completely resolved. To obtain our algorithmic result, we build upon insights from the recent framework of Abbasi et al. [1] for clustering in metric spaces of bounded algorithmic scatter dimension that encapsulates a broad class of metric spaces. Thus, our result shows that the potential of the framework introduced in [1] extends to clustering problems beyond that captured by the monotone norm setting, thus paving the way for obtaining FPT-ASes for other clustering problems using the similar technique. However, several interesting questions remain.
Firstly, the framework of [1] is inherently randomized due to key sampling steps, and its derandomization remains an intriguing open question. In particular, derandomizing the step that samples a witness in our algorithm is an interesting challenge. In another direction, now that the approximation landscape of the vanilla version of Hybrid -Clustering is beginning to be well-understood, it is natural to explore constrained variants of the problem such as those involving fairness or capacity constraints.
References
- [1] Fateme Abbasi, Sandip Banerjee, Jaroslaw Byrka, Parinya Chalermsook, Ameet Gadekar, Kamyar Khodamoradi, Dániel Marx, Roohani Sharma, and Joachim Spoerhase. Parameterized approximation schemes for clustering with general norm objectives. In 64th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2023, Santa Cruz, CA, USA, November 6-9, 2023, pages 1377–1399. IEEE, 2023. doi:10.1109/FOCS57990.2023.00085.
- [2] Fateme Abbasi, Sandip Banerjee, Jaroslaw Byrka, Parinya Chalermsook, Ameet Gadekar, Kamyar Khodamoradi, Dániel Marx, Roohani Sharma, and Joachim Spoerhase. Parameterized approximation for robust clustering in discrete geometric spaces. In Karl Bringmann, Martin Grohe, Gabriele Puppis, and Ola Svensson, editors, 51st International Colloquium on Automata, Languages, and Programming, ICALP 2024, July 8-12, 2024, Tallinn, Estonia, volume 297 of LIPIcs, pages 6:1–6:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. doi:10.4230/LIPICS.ICALP.2024.6.
- [3] Akanksha Agrawal, Tanmay Inamdar, Saket Saurabh, and Jie Xue. Clustering what matters: Optimal approximation for clustering with outliers. J. Artif. Intell. Res., 78:143–166, 2023. doi:10.1613/JAIR.1.14883.
- [4] Barbara Anthony, Vineet Goyal, Anupam Gupta, and Viswanath Nagarajan. A plant location guide for the unsure: Approximation algorithms for min-max location problems. Mathematics of Operations Research, 35(1):pages 79–101, 2010.
- [5] Sanjeev Arora, Prabhakar Raghavan, and Satish Rao. Approximation schemes for euclidean k-medians and related problems. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pages 106–113, New York, NY, USA, 1998. ACM. doi:10.1145/276698.276718.
- [6] Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristics for k-median and facility location problems. SIAM J. Comput., 33(3):544–562, 2004. doi:10.1137/S0097539702416402.
- [7] Mihai Badoiu, Sariel Har-Peled, and Piotr Indyk. Approximate clustering via core-sets. In Proceedings on 34th Annual ACM Symposium on Theory of Computing, (STOC), pages 250–257. ACM, 2002. doi:10.1145/509907.509947.
- [8] Sayan Bandyapadhyay, Fedor V. Fomin, and Kirill Simonov. On coresets for fair clustering in metric and euclidean spaces and their applications. J. Comput. Syst. Sci., 142:103506, 2024. doi:10.1016/J.JCSS.2024.103506.
- [9] Sayan Bhattacharya, Parinya Chalermsook, Kurt Mehlhorn, and Adrian Neumann. New approximability results for the robust -median problem. In Scandinavian Workshop on Algorithm Theory (SWAT’14), pages 50–61. Springer, 2014. doi:10.1007/978-3-319-08404-6_5.
- [10] Romain Bourneuf and Marcin Pilipczuk. Bounding -scatter dimension via metric sparsity. In Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) (to appear), 2025.
- [11] Vladimir Braverman, Shaofeng H-C Jiang, Robert Krauthgamer, and Xuan Wu. Coresets for ordered weighted clustering. In International Conference on Machine Learning (ICML’19), pages 744–753. PMLR, 2019. URL: http://proceedings.mlr.press/v97/braverman19a.html.
- [12] Jaroslaw Byrka, Krzysztof Sornat, and Joachim Spoerhase. Constant-factor approximation for ordered k-median. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 620–631. ACM, 2018. doi:10.1145/3188745.3188930.
- [13] Deeparnab Chakrabarty, Prachi Goyal, and Ravishankar Krishnaswamy. The non-uniform k-center problem. In Ioannis Chatzigiannakis, Michael Mitzenmacher, Yuval Rabani, and Davide Sangiorgi, editors, 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy, volume 55 of LIPIcs, pages 67:1–67:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPICS.ICALP.2016.67.
- [14] Deeparnab Chakrabarty and Chaitanya Swamy. Interpolating between -median and -center: Approximation algorithms for ordered -median. In Proceedings of the 45th International Colloquium on Automata, Languages, and Programming (ICALP), volume 107 of LIPIcs, pages 29:1–29:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2018. doi:10.4230/LIPICS.ICALP.2018.29.
- [15] Deeparnab Chakrabarty and Chaitanya Swamy. Approximation algorithms for minimum norm and ordered optimization problems. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 126–137. ACM, 2019. doi:10.1145/3313276.3316322.
- [16] Vincent Cohen-Addad, Fabrizio Grandoni, Euiwoong Lee, and Chris Schwiegelshohn. Breaching the 2 LMP approximation barrier for facility location with applications to k-median. In Nikhil Bansal and Viswanath Nagarajan, editors, Proceedings of the 2023 ACM-SIAM Symposium on Discrete Algorithms, SODA 2023, Florence, Italy, January 22-25, 2023, pages 940–986. SIAM, 2023. doi:10.1137/1.9781611977554.CH37.
- [17] Vincent Cohen-Addad, Philip N. Klein, and Claire Mathieu. Local search yields approximation schemes for k-means and k-median in euclidean and minor-free metrics. SIAM J. Comput., 48(2):644–667, 2019. doi:10.1137/17M112717X.
- [18] Vincent Cohen-Addad and Jason Li. On the fixed-parameter tractability of capacitated clustering. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini, and Stefano Leonardi, editors, 46th International Colloquium on Automata, Languages, and Programming, ICALP 2019, July 9-12, 2019, Patras, Greece, volume 132 of LIPIcs, pages 41:1–41:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPICS.ICALP.2019.41.
- [19] Vincent Cohen-Addad, Michał Pilipczuk, and Marcin Pilipczuk. A polynomial-time approximation scheme for facility location on planar graphs. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 560–581. IEEE, 2019.
- [20] Vincent Cohen-Addad, David Saulpic, and Chris Schwiegelshohn. A new coreset framework for clustering. In Proceddings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 169–182. ACM, 2021. doi:10.1145/3406325.3451022.
- [21] Marek Cygan, Fedor V. Fomin, Lukasz Kowalik, Daniel Lokshtanov, Dániel Marx, Marcin Pilipczuk, Michal Pilipczuk, and Saket Saurabh. Parameterized Algorithms. Springer, 2015. doi:10.1007/978-3-319-21275-3.
- [22] Tomás Feder and Daniel H. Greene. Optimal algorithms for approximate clustering. In Janos Simon, editor, Proceedings of the 20th Annual ACM Symposium on Theory of Computing, May 2-4, 1988, Chicago, Illinois, USA, pages 434–444. ACM, 1988. doi:10.1145/62212.62255.
- [23] Fedor V. Fomin, Petr A. Golovach, Tanmay Inamdar, Saket Saurabh, and Meirav Zehavi. Hybrid k-clustering: Blending k-median and k-center. In Amit Kumar and Noga Ron-Zewi, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2024, August 28-30, 2024, London School of Economics, London, UK, volume 317 of LIPIcs, pages 4:1–4:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. doi:10.4230/LIPICS.APPROX/RANDOM.2024.4.
- [24] Zachary Friggstad, Mohsen Rezapour, and Mohammad R. Salavatipour. Local search yields a PTAS for k-means in doubling metrics. SIAM J. Comput., 48(2):452–480, 2019. doi:10.1137/17M1127181.
- [25] Mehrdad Ghadiri, Samira Samadi, and Santosh Vempala. Socially fair -means clustering. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 438–448, 2021. doi:10.1145/3442188.3445906.
- [26] Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In László Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004, pages 291–300. ACM, 2004. doi:10.1145/1007352.1007400.
- [27] Dorit S. Hochbaum and Wolfgang Maass. Approximation schemes for covering and packing problems in image processing and VLSI. J. ACM, 32(1):130–136, 1985. doi:10.1145/2455.214106.
- [28] Dorit S. Hochbaum and David B. Shmoys. A best possible heuristic for the k-center problem. Mathematics of Operations Research, 10(2):180–184, 1985. URL: http://www.jstor.org/stable/3689371, doi:10.1287/MOOR.10.2.180.
- [29] Kamal Jain, Mohammad Mahdian, and Amin Saberi. A new greedy approach for facility location problems. In John H. Reif, editor, Proceedings on 34th Annual ACM Symposium on Theory of Computing, May 19-21, 2002, Montréal, Québec, Canada, pages 731–740. ACM, 2002. doi:10.1145/509907.510012.
- [30] Kamal Jain and Vijay V. Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and lagrangian relaxation. J. ACM, 48(2):274–296, 2001. doi:10.1145/375827.375845.
- [31] Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant approximation for k-median and k-means with outliers via iterative rounding. In Ilias Diakonikolas, David Kempe, and Monika Henzinger, editors, Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 646–659. ACM, 2018. doi:10.1145/3188745.3188882.
- [32] Alfred A Kuehn and Michael J Hamburger. A heuristic program for locating warehouses. Management science, 9(4):643–666, 1963.
- [33] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear time algorithms for clustering problems in any dimensions. In 32nd International Colloquium on Automata, Languages and Programming (ICALP), volume 3580 of Lecture Notes in Computer Science, pages 1374–1385. Springer, 2005. doi:10.1007/11523468_111.
- [34] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clustering problems in any dimensions. J. ACM, 57(2):5:1–5:32, 2010. doi:10.1145/1667053.1667054.
- [35] Shi Li. A 1.488 approximation algorithm for the uncapacitated facility location problem. Inf. Comput., 222:45–58, 2013. doi:10.1016/J.IC.2012.01.007.
- [36] S. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137, March 1982. doi:10.1109/TIT.1982.1056489.
- [37] Meena Mahajan, Prajakta Nimbhorkar, and Kasturi R. Varadarajan. The planar k-means problem is np-hard. Theor. Comput. Sci., 442:13–21, 2012. doi:10.1016/J.TCS.2010.05.034.
- [38] Yury Makarychev and Ali Vakilian. Approximation algorithms for socially fair clustering. In Conference on Learning Theory (COLT’21), pages 3246–3264. PMLR, 2021. URL: http://proceedings.mlr.press/v134/makarychev21a.html.
- [39] Dániel Marx. Parameterized complexity of independence and domination on geometric graphs. In Hans L. Bodlaender and Michael A. Langston, editors, Parameterized and Exact Computation, Second International Workshop, IWPEC 2006, Zürich, Switzerland, September 13-15, 2006, Proceedings, volume 4169 of Lecture Notes in Computer Science, pages 154–165. Springer, 2006. doi:10.1007/11847250_14.
- [40] Nimrod Megiddo and Kenneth J Supowit. On the complexity of some common geometric location problems. SIAM journal on computing, 13(1):182–196, 1984. doi:10.1137/0213014.
- [41] John F. Stollsteimer. A working model for plant numbers and locations. Journal of Farm Economics, 45(3):631–645, 1963. URL: http://www.jstor.org/stable/1235442.
- [42] Arie Tamir. The k-centrum multi-facility location problem. Discret. Appl. Math., 109(3):293–307, 2001. doi:10.1016/S0166-218X(00)00253-5.
- [43] Murad Tukan, Xuan Wu, Samson Zhou, Vladimir Braverman, and Dan Feldman. New coresets for projective clustering and applications. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtual Event, volume 151 of Proceedings of Machine Learning Research, pages 5391–5415. PMLR, 2022. URL: https://proceedings.mlr.press/v151/tukan22a.html.