Abstract 1 Introduction 2 Setup 3 k-center Sets Clustering 4 k-means Sets Clustering 5 k-median Sets Clustering 6 Another Variant of k-center Sets Clustering References

Clustering Point Sets Revisited

Md.Ā Billal Hossain ORCID Department of Computer Science, University of Texas at Dallas, TX, USA Benjamin Raichel ORCID Department of Computer Science, University of Texas at Dallas, TX, USA
Abstract

In the sets clustering problem one is given a collection of point sets š’«={P1,…⁢Pm} in ā„d, where for any set of k centers in ā„d, each Pi is assigned to its nearest center as determine by some local cost functions. The goal is then to select a set of k centers to minimize some global cost function of the corresponding local assignment costs. Specifically, we consider either summing or taking the maximum cost over all Pi, where for each Pi the cost of assigning it to a center c is either maxp∈Pi⁔‖cāˆ’p‖, āˆ‘p∈Pi‖cāˆ’p‖, or āˆ‘p∈Pi‖cāˆ’p‖2.

Different combinations of the global and local cost functions naturally generalize the k-center, k-median, and k-means clustering problems. In this paper, we improve the prior results for the natural generalization of k-center, give the first result for the natural generalization of k-means, and give results for generalizations of k-median and k-center which differ from those previously studied.

Keywords and phrases:
Clustering, k-center, k-median, k-means
Copyright and License:
[Uncaptioned image] © Md.Ā Billal Hossain and Benjamin Raichel; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation → Computational geometry
Funding:
Work on this paper was partially supported by NSF CAREER AF Award CCF-1750780 and NSF AF Award CCF-2311179.
Editors:
Pat Morin and Eunjin Oh

1 Introduction

Clustering is one of the most well studied problems in computational geometry and computer science as a whole, with a variety of applications. For center based clustering, given a point set PāŠ‚ā„d and integer parameter k>0, the goal is to select a set CāŠ‚ā„d of k centers, so as to minimize some cost function of the distances determined by assigning each point in P to its nearest center in C. Specifically, for any p∈P, let ‖pāˆ’C‖ denote the distance from p to its nearest center in C. Then the k-center, k-median, and k-means objectives are to find the set C minimizing maxp∈P⁔‖pāˆ’C‖, āˆ‘p∈P‖pāˆ’C‖, and āˆ‘p∈P‖pāˆ’C‖2, respectively.

k-center, k-median, and k-means clustering are all known to be NP-hard. k-center is NP-hard to approximate within any factor less than 2 in general metric spaces [13], and even in the plane is still hard to approximate within a factor of roughly 1.82 [7]. Conversely, both the standard greedy algorithm by Gonzalez [10] and the alternative scooping method by Hochbaum and Shmoys [12] achieve a 2-approximation for k-center. The k-median and k-means problems are both known to be hard to approximate within a 1+γ factor for some constant γ>0, even in O⁢(log⁔n) dimensional Euclidean space. Different values of γ are known depending on whether the points are in low or high dimensional Euclidean space, some other metric norm, or a general metric space. Conversely, there are several constant factor approximation algorithms for both problems. For a more in depth discussion see [5] and references therein.

For points in ā„d, PTAS’s exist for these center based clustering problems when k and d are bounded. For k-center, Agarwal and Procopiuc [1] gave a (1+ε)-approximation in O⁢(n⁢log⁔k)+(k/ε)O⁢(k1āˆ’1/d) time. For k-median and k-means, Har-Peled and Mazumdar [11] used coresets to achieve a (1+ε)-approximation algorithm whose running time is linear in n. Subsequent coreset based papers improve the time dependency on k, d, and ε [4, 8]. Using a sampling based approach, [17] provided a (1+ε)-approximation with probability ≄1/2 in O⁢(2(k/ε)O⁢(1)⁢d⁢n) time, where the probability can be improved by boosting. In general, there are many approximation schemes for k-median and k-means in Euclidean settings, though we note a linear dependence on d and n is possible (e.g.Ā [17]), and for k-median specifically a polynomial dependence on k is possible for bounded d (e.g.Ā [11]).

Sets Clustering

While standard clustering focuses on clustering points, naturally one can consider clustering more general geometric objects. In this paper, we consider the sets clustering problem [20], where one must cluster a collection of sets, š’«={P1,…⁢Pm}, of total size n=āˆ‘i|Pi|. Here each Pi can for example be viewed as a sample from a given object of study, whether say a physical object in 3d, or some class in a high dimensional feature space. With the interpretation that all points in Pi arise from the same object, we thus naturally require they all be covered by the same center.

Given a center c∈C, we let f⁢(c,Pi) denote the cost of assigning Pi to c, and let f⁢(C,Pi)=minc∈C⁔f⁢(c,Pi). Here we consider the cost functions fāˆžā¢(c,Pi)=maxp∈Pi⁔‖cāˆ’p‖, f1⁢(c,Pi)=āˆ‘p∈Pi‖cāˆ’p‖, and f2⁢(c,Pi)=āˆ‘p∈Pi‖cāˆ’p‖2. Note these cost functions are the 1-center, 1-median, and 1-means costs, and thus are motivated by the applications of those respective problems. However, unlike the 1-center/median/means problems, here we must cluster multiple sets and thus we need an aggregate cost. We consider two possibilities for the cost of clustering š’«, namely c⁢o⁢s⁢tāˆž,β⁢(C,š’«)=maxi⁔fβ⁢(C,Pi) and c⁢o⁢s⁢t1,β⁢(C,š’«)=āˆ‘ifβ⁢(C,Pi). Observe that when š’« is a collection of singleton points that c⁢o⁢s⁢tāˆž,āˆž=c⁢o⁢s⁢tāˆž,1, c⁢o⁢s⁢t1,āˆž=c⁢o⁢s⁢t1,1, and c⁢o⁢s⁢t1,2 respectively capture the k-center, k-median, and k-means objectives on the point set consisting of these singletons. Conversely, FigureĀ 1.1 shows how when the sets in š’« are not singletons, the optimal solutions of the variants of sets clustering that we consider differ from the k-center, k-median, and k-means solutions on ∪iPi.

Figure 1.1: Example showing sets clustering differs from regular clustering, where š’«={{p1},{p2,p3},{p4},{p5}}, P={p1,p2,p3,p4,p5}, and k=3. For k-center/means/median clustering on P, the optimal centers are p5, the midpoint of (p1,p2) and the midpoint of (p3,p4). However, the optimal centers for the k-center/means/median sets clustering variants on š’« that we consider (that is, c⁢o⁢s⁢tāˆž,āˆž/c⁢o⁢s⁢t1,2/c⁢o⁢s⁢t1,āˆž) are the midpoint of (p1,p5), the midpoint of (p2,p3), and p4.

Several prior works considered different variants of the sets clustering problem, most notably [20]. Our aim is both to improve prior results, as well as to consider new variants of the sets clustering problem. We break our discussion into whether a particular sets clustering problem is most naturally viewed as a generalization of k-center, k-means, or k-median.

  • ā– 

    k-center: Minimizing c⁢o⁢s⁢tāˆž,āˆžā¢(C,š’«) naturally generalizes the k-center problem. Previously, for points in constant dimensions, [20] provided an O⁢(n+m⁢log⁔k) time 3-approximation as well as an O⁢(n⁢k) time (1+3)-approximation. These results require minimum enclosing ball (MEB) computations, and thus have running times with hidden constants depending on d, though they remark that for higher dimensions they can use a (1+ε)-approximate MEB at the cost of +ε in the approximation quality.

    In Section 3, for points in any dimension d, we show that a natural adaptation of the standard O⁢(d⁢n⁢k) time greedy 2-approximation algorithm for k-center remains an O⁢(d⁢n⁢k) time 2-approximation for the sets clustering problem, a fact which surprisingly appears to have been overlooked in prior work. Moreover, our 2-approximation algorithm works in any metric space, and thus is optimal for general metric spaces (as it captures k-center, which is hard to approximate within a factor of 2). Additionally, for any constant d, using a standard grid based approach we provide an O⁢(n⁢kk+1/εd⁢k) time (1+ε)-approximation, that is, a linear time algorithm when k, d, and ε are constants.

  • ā– 

    k-means: Minimizing c⁢o⁢s⁢t1,2⁢(C,š’«) naturally generalizes the k-means problem, and we study this objective in SectionĀ 4. Surprisingly, it appears this cost function for sets clustering has not been studied in prior works, despite the ubiquity of k-means clustering. Consider the point set P obtained by replacing each set Pi with |Pi| copies of its centroid. Using previously observed facts about the geometry of the k-means objective, we prove the strong statement that an α-approximation to the k-means clustering problem on P is also an α-approximation for the sets clustering problem.

  • ā– 

    k-median: Minimizing c⁢o⁢s⁢t1,1⁢(C,š’«) generalizes the k-median problem. Previously, [20] provided a polynomial time (3+ε) approximation in ā„d for constant d, by replacing each set Pi with |Pi| copies of its (1+ε)-approximate 1-median, and then applying a (1+ε)-approximation to k-median for constant d (such as [11]). In SectionĀ 5, we instead consider the problem of minimizing c⁢o⁢s⁢t1,āˆžā¢(C,š’«), which is also equivalent to k-median when š’« consists of singletons.111 There is also a loose connection between c⁢o⁢s⁢t1,āˆžā¢(C,š’«) and clustering to minimize the sum of radii [3], though unlike k-median there is no immediate reduction. Minimizing c⁢o⁢s⁢t1,āˆžā¢(C,š’«) models the case where how well a center covers a set is determined by the furthest point in the set from the center (i.e.Ā the minimum radius ball at the center enclosing the whole set). To address this problem we apply a similar approach as [20], replacing each set with the center of its minimum enclosing ball. We argue that this provides a (1+α)-approximation where α is the approximation quality of the k-median subroutine.

  • ā– 

    k-center Alternative: In SectionĀ 6 we considered the problem of minimizing c⁢o⁢s⁢tāˆž,1⁢(C,š’«). Similar to the c⁢o⁢s⁢tāˆž,āˆž considered in SectionĀ 3 and prior work, this problem also generalizes standard k-center clustering. This problem is also related to the c⁢o⁢s⁢t1,āˆž problem considered in SectionĀ 5, but where we inverted the max and sum operators. This problem is more challenging, though we still can provide a polynomial time (3+ε)-approximation for point sets in the plane and for any constant 0<ε≤1.

We emphasize that the algorithms in sectionsĀ 3, 4, and 5 for the c⁢o⁢s⁢tāˆž,āˆž,c⁢o⁢s⁢t1,2, and c⁢o⁢s⁢t1,āˆž objectives, either modify or directly reduce to the algorithms respectively used for k-center, k-means, and k-median. Thus our running times are virtually equivalent to the algorithms used for these standard clustering problems, which is the best one could hope for as these standard clustering problems are special cases of our sets clustering problems. On the other hand, our polynomial time approximation for the more challenging c⁢o⁢s⁢t1,āˆžā¢(C,š’«) in SectionĀ 6 is slower, and we leave it as an open problem for future work to optimize the time.

We study the combinations of maximums, sums, and sums of squares which we believe to be the most natural, though other combinations exist that we do not study, such as squaring sums of squares. Several prior works also considered clustering based on the closest point in a set to its nearest center. Specifically, let f0⁢(C,P)=minc∈C,p∈P⁔‖cāˆ’p‖, then [15] considered minimizing āˆ‘Piāˆˆš’«f0⁢(C,Pi)2, providing a near linear time (1+ε)-approximation for the case when k, d, and z are constants, where z=maxi⁔|Pi|. [14] considered the problem of minimizing maxPiāˆˆš’«ā”f0⁢(C,Pi), where the Pi sets are continuous convex regions, providing a number of results such as a (5+2⁢3)-approximation for the case of disks. Many other papers have considered clustering other continuous geometric objects, and in particular unbounded objects such as lines or hyperplanes [9, 18, 6], though such results are less relevant than the above mentioned papers on clustering discrete sets.

2 Setup

For two points p,qāˆˆā„d, we use ‖pāˆ’q‖ to denote the Euclidean distance between p and q. Similarly, for a point qāˆˆā„d and a finite point set PāŠ‚ā„d we have ‖qāˆ’P‖=minp∈P⁔‖qāˆ’p‖.

Given a point set PāŠ‚ā„d and a set of k centers CāŠ‚ā„d, the standard cost functions for k-center, k-median, and k-means clustering are as follows.

  • ā– 

    k⁢c⁢e⁢n⁢t⁢e⁢r⁢(C,P)=maxp∈P⁔‖pāˆ’C‖

  • ā– 

    k⁢m⁢e⁢d⁢i⁢a⁢n⁢(C,P)=āˆ‘p∈P‖pāˆ’C‖

  • ā– 

    k⁢m⁢e⁢a⁢n⁢s⁢(C,P)=āˆ‘p∈P‖pāˆ’C‖2

Given a parameter k and point set PāŠ‚ā„d, the goal of k-center, k-median, or k-means clustering is then to find a set C of k centers minimizing the respective cost function.

We now generalize these notions to collections of point sets. So let š’«={P1,…⁢Pm} be a collection of m sets of points, where PiāŠ‚ā„d for all i, and n=āˆ‘i|Pi| is the total size. For a point cāˆˆā„d, let fβ⁢(c,Pi) be some non-negative function, representing the cost of assigning point set Pi to a center c. We consider three cost functions denoted by β=āˆž, 1, or 2.222The different β values are supposed to be vaguely reminiscent of corresponding vector norms.

  • ā– 

    fāˆžā¢(c,Pi)=maxp∈Pi⁔‖cāˆ’p‖

  • ā– 

    f1⁢(c,Pi)=āˆ‘p∈Pi‖cāˆ’p‖

  • ā– 

    f2⁢(c,Pi)=āˆ‘p∈Pi‖cāˆ’p‖2

For a single point p we write fβ⁢(c,p)=fβ⁢(c,{p}), where in particular we have fāˆžā¢(c,p)=f1⁢(c,p)=‖cāˆ’p‖ and f2⁢(c,p)=‖cāˆ’p‖2

Given a set C={c1,…,ck}āŠ‚ā„d of k centers, define fβ⁢(C,Pi)=minc∈C⁔fβ⁢(c,Pi), that is Pi is assigned to its nearest center under the function fβ. Now the fβ⁢(C,Pi) values over all i define a vector of length m, and we consider either the ā„“āˆž or ā„“1 norm. Namely, we define the cost functions, c⁢o⁢s⁢tα,β, for α=āˆž or 1 as follows:

  • ā– 

    c⁢o⁢s⁢tāˆž,β⁢(C,š’«)=maxi⁔fβ⁢(C,Pi)

  • ā– 

    c⁢o⁢s⁢t1,β⁢(C,š’«)=āˆ‘ifβ⁢(C,Pi)

Finally, let o⁢p⁢t⁢c⁢o⁢s⁢tα,β⁢(k,š’«)=minCāŠ†ā„d,|C|=k⁔c⁢o⁢s⁢tα,β⁢(C,š’«), and let o⁢p⁢tα,β⁢(k,š’«) denote a set C of k centers achieving o⁢p⁢t⁢c⁢o⁢s⁢tα,β⁢(k,š’«). For some γ≄1, a set CāŠ‚ā„d with |C|=k is referred to as a γ-approximation to o⁢p⁢tα,β⁢(k,š’«) if c⁢o⁢s⁢tα,β⁢(C,š’«)≤γ⋅o⁢p⁢t⁢c⁢o⁢s⁢tα,β⁢(k,š’«). (That is, k is a hard constraint and the approximation is on the cost, as is standard for clustering.)

Observation 1.

Given a point set P, let š’«ā¢(P) denote the collection of |P| singleton sets {p} for each p∈P. Then k⁢c⁢e⁢n⁢t⁢e⁢r⁢(C,P)=c⁢o⁢s⁢tāˆž,āˆžā¢(C,š’«ā¢(P))=c⁢o⁢s⁢tāˆž,1⁢(C,š’«ā¢(P)), k⁢m⁢e⁢d⁢i⁢a⁢n⁢(C,P)=c⁢o⁢s⁢t1,āˆžā¢(C,š’«ā¢(P))=c⁢o⁢s⁢t1,1⁢(C,š’«ā¢(P)), and k⁢m⁢e⁢a⁢n⁢s⁢(C,P)=c⁢o⁢s⁢t1,2⁢(C,š’«ā¢(P)).

3 k-center Sets Clustering

Given a collection of point sets š’«={P1,…⁢Pm} in ā„d and a parameter k, [20] considered the problem of covering š’« with k equal radius balls of minimum radius such that each Pi is entirely contained in one of the balls. Using the terminology defined above, this is equivalent to computing o⁢p⁢tāˆž,āˆžā¢(k,š’«). For this problem, [20] gave a (1+3)-approximation. Here we show that the standard greedy Gonzalez algorithm for k-center clustering can be adapted to yield a 2-approximation. Not only is this a better approximation ratio, but this algorithm will in fact work for any metric space. Moreover, for general metrics it is not possible to get a 2āˆ’Īµ approximation for k-center clustering for any ε>0, unless P=NP [13]. As k-center is a special case of our problem, our approximation algorithm is thus optimal.

Given a point set P and a parameter k, we recall that the Gonzalez algorithm [10] greedily outputs a set of k center c1,…,ck, where c1 is an arbitrary point in P, and for i>1, ci=arg⁔maxp∈P⁔‖pāˆ’{c1,…,ciāˆ’1}‖. We argue the following natural adaptation of this algorithm to our sets clustering problem, which greedily picks the furthest set rather than the furthest point, also yields a 2-approximation. (FigureĀ 1.1 shows that naively running Gonzalez on P=∪iPi does not achieve a 2-approximation. Namely, Gonzalez on P might pick p1,p4, and p5 as the centers, resulting in a cost of 11 for sets clustering on š’«, however, choosing p1, the midpoint of (p2,p3), and p4, gives a cost of 4.)

Algorithm 1 š š«šžšžšš²ā¢(k,š’«={P1,…,Pm}).
Theorem 2.

Given a parameter k and a collection of point sets š’«={P1,…⁢Pm} in ā„d, š š«šžšžšš²ā¢(k,š’«) gives an O⁢(d⁢n⁢k) time 2-approximation for computing o⁢p⁢tāˆž,āˆžā¢(k,š’«), where n=āˆ‘i|Pi|.

Proof.

Let C={c1,…,ck} be the output of š š«šžšžšš²ā¢(k,š’«) and let Cāˆ—={c1āˆ—,c2āˆ—,…,ckāˆ—}=o⁢p⁢tāˆž,āˆžā¢(k,š’«) with cost rāˆ—=o⁢p⁢t⁢c⁢o⁢s⁢tāˆž,āˆžā¢(k,š’«)=minCāŠ†ā„d,|C|=k⁔c⁢o⁢s⁢tāˆž,āˆžā¢(C,š’«), where recall c⁢o⁢s⁢tāˆž,āˆžā¢(C,š’«)=maxi⁔fāˆžā¢(C,Pi).

There are two possible cases. In the first case, for all cāˆ—āˆˆCāˆ—, there exists some c∈C such that ‖cāˆ—āˆ’c‖≤rāˆ—. Consider any set Pi with cāˆ— as its nearest optimal center under fāˆž, that is fāˆžā¢(cāˆ—,Pi)=fāˆžā¢(Cāˆ—,Pi). By definition, fāˆžā¢(cāˆ—,Pi)=maxp∈Pi⁔‖cāˆ—āˆ’p‖≤rāˆ—. Thus for any point p∈Pi, by the triangle inequality, ‖pāˆ’c‖≤‖pāˆ’cāˆ—ā€–+‖cāˆ—āˆ’c‖≤2⁢rāˆ—. Therefore, fāˆžā¢(c,Pi)=maxp∈Pi⁔‖cāˆ’p‖≤2⁢rāˆ—. This in turn implies fāˆžā¢(C,Pi)≤2⁢rāˆ—, and as this holds for all i it implies C is a 2-approximation.

In the second case, there exists some cāˆ—āˆˆCāˆ—, such that for any c∈C we have ‖cāˆ—āˆ’c‖>rāˆ—. For a given c∈C, let P⁢(c) denote the set Piāˆˆš’« from which c was selected in š š«šžšžšš²ā¢(k,š’«). Observe that if ‖cāˆ—āˆ’c‖>rāˆ— then, fāˆžā¢(cāˆ—,P⁢(c))>rāˆ—, that is P⁢(c) is not covered by cāˆ— in the optimal solution. However, fāˆžā¢(Cāˆ—,Pi)≤rāˆ— for all i, and so by the pigeon hole principle if such a cāˆ—āˆˆCāˆ— occurs then there must exist some other cjāˆ—āˆˆCāˆ— and centers cα,cβ∈C with α<β such that fāˆžā¢(cjāˆ—,P⁢(cα)),fāˆžā¢(cjāˆ—,P⁢(cβ))≤rāˆ—. This implies ‖cjāˆ—āˆ’p‖≤rāˆ— for any p∈P⁢(cα)∪P⁢(cβ). Thus by the triangle inequality, for any p∈P⁢(cβ) we have ‖cĪ±āˆ’p‖≤‖cĪ±āˆ’cjāˆ—ā€–+‖cjāˆ—āˆ’p‖≤2⁢rāˆ—, which implies fāˆžā¢(cα,P⁢(cβ))≤2⁢rāˆ—.

For any 1≤i≤k, let Ci={c1,…,ci}, i.e.Ā the first i centers chosen by š š«šžšžšš²ā¢(k,š’«), and let Ī“i=fāˆžā¢(Ciāˆ’1,P⁢(ci)) (where Ī“1=āˆž). Denote the cost of the final solution C output by š š«šžšžšš²ā¢(k,š’«) as Ī“k+1=maxPāˆˆš’«ā”fāˆžā¢(C,P). Observe that Ī“1≄Γ2≄…≄Γk≄Γk+1. Above we argued that fāˆžā¢(cα,P⁢(cβ))≤2⁢rāˆ— where α<β, thus Ī“k+1≤Γβ=fāˆžā¢(CĪ²āˆ’1,P⁢(cβ))≤fāˆžā¢(Cα,P⁢(cβ))≤fāˆžā¢(cα,P⁢(cβ))≤2⁢rāˆ—, and thus C is a 2-approximation.

As for the running time, we can achieve O⁢(d⁢n⁢k) time in a similar fashion to the standard k-center Gonzalez algorithm. Specifically, let Ci be the set of centers after some i iterations of the algorithm, and assume for all unmarked Pāˆˆš’« we maintain fāˆžā¢(Ci,P). Then in the i+1 round, ci+1 is some arbitrary point from arg⁔maxunmarked ⁢Pāˆˆš’«ā”fāˆžā¢(Ci,P), which can thus be found by a linear scan of the fāˆžā¢(Ci,P) values. We also need to update the fāˆžā¢(Ci,P) values to fāˆžā¢(Ci+1,P). To do so observe that for any Pāˆˆš’«, fāˆžā¢(Ci+1,P)=min⁔{fāˆžā¢(Ci,P),fāˆžā¢(ci+1,P)}. As fāˆžā¢(ci+1,P) can be computed in O⁢(d⁢|P|) time, we can thus update all of these values in O⁢(d⁢n) time. As the algorithm performs k iterations, it thus takes O⁢(d⁢n⁢k) time overall. ā—€

The algorithm used in [20] for achieving the O⁢(1+3)-approximation requires computing the minimum enclosing ball (MEB) for each point set, and thus has a hidden constant in the running time depending on d. Thus for high dimensions, [20] use a (1+ε)-approximate MEB (which can be computed in O⁢(d⁢n/ε+1/ε5) time [2]), resulting in a O⁢(1+3+ε)-approximation factor. In comparison, our algorithm avoids computing the MEB altogether, and as the above proof only relied on the triangle inequality, the same proof verbatim yields a 2-approximation for any metric space.333This is our only result that holds for arbitrary metric spaces, which is why o⁢p⁢tα,β was defined for points specifically in ā„d.

k-center clustering is known to be hard to approximate within a factor of roughly 1.82 even in the plane [7], however, a (1+ε)-approximation is possible in constant dimensions, though at the cost of having a time exponential in k. We now show this standard algorithm also applies to our problem.

We will make use of the following standard observation (adapted for our settings).

Observation 3.

Let C={c1,…,ck}āŠ‚ā„d be a set of k points, and let C′={c1′,…,ck′} be a set of k points such that ‖ciāˆ’ci′‖≤x. Then for any collection of point sets š’«={P1,…⁢Pm}, we have that c⁢o⁢s⁢tāˆž,āˆžā¢(C′,š’«)≤c⁢o⁢s⁢tāˆž,āˆžā¢(C,š’«)+x. This follows since if for a given set Pjāˆˆš’« its nearest center under fāˆž was ci, then by the triangle inequality we have that fāˆžā¢(ci′,Pj)≤fāˆžā¢(ci,Pi)+x.

Theorem 4.

Given a parameter k and a collection of point sets š’«={P1,…⁢Pm} in ā„d, for constant d, then for any ε>0 there is an O⁢(n⁢kk+1/εd⁢k) time (1+ε)-approximation for computing o⁢p⁢tāˆž,āˆžā¢(k,š’«), where n=āˆ‘i|Pi|.

Proof.

Let C={c1,c2,…,ck} be the centers output by š š«šžšžšš²ā¢(k,š’«), and let r=c⁢o⁢s⁢tāˆž,āˆžā¢(C,š’«). Also let Cāˆ—={c1āˆ—,c2āˆ—,…,ckāˆ—}=o⁢p⁢tāˆž,āˆžā¢(k,š’«) where rāˆ—=c⁢o⁢s⁢tāˆž,āˆžā¢(Cāˆ—,š’«). By TheoremĀ 2 we know that rāˆ—ā‰¤r≤2⁢rāˆ—.

Let ℬ⁢(x)=∪iB⁢(ci,x) where B⁢(ci,x) is the ball of radius x centered at ci. By definition of C and r, we know ∪iPiāŠ†ā„¬ā¢(r). We can assume that every center cāˆ—āˆˆCāˆ— is within distance rāˆ—ā‰¤r of some point in ∪iPi, as otherwise cāˆ— can be deleted without affecting the optimal solution quality. Thus we have that Cāˆ—āŠ†ā„¬ā¢(2⁢r).

Now consider the axis aligned grid over ā„d with cell side length Ī“=ε⁢r/2⁢d. Any ball of radius 2⁢r intersects at most O⁢((2⁢r/Ī“)d)=O⁢(1/εd) cells, for constant d. Thus ℬ⁢(2⁢r) intersects O⁢(k/εd) grid cells, and as Cāˆ—āŠ†ā„¬ā¢(2⁢r), there are thus O⁢(k/εd) possible cells for each center in Cāˆ—. Now for any point q=(q1,…,qd)āˆˆā„d, let g⁢r⁢i⁢dΓ⁢(q)=(Γ⁢⌊q1/Ī“āŒ‹,…,Γ⁢⌊qd/Ī“āŒ‹) be the lowest corner of the grid cell containing q (specifically the coordinate-wise minimal corner). Let g⁢r⁢i⁢dΓ⁢(Cāˆ—)={g⁢r⁢i⁢dΓ⁢(c1āˆ—),…,g⁢r⁢i⁢dΓ⁢(ckāˆ—)}. As the diameter of each cell is ε⁢r/2, by ObservationĀ 3 we have that c⁢o⁢s⁢tāˆž,āˆžā¢(g⁢r⁢i⁢dΓ⁢(Cāˆ—),š’«)≤c⁢o⁢s⁢tāˆž,āˆžā¢(Cāˆ—,š’«)+ε⁢r/2=rāˆ—+ε⁢r/2≤rāˆ—+ε⁢rāˆ—=(1+ε)⁢rāˆ—.

The above motivates the following algorithm. First compute C={c1,c2,…,ck} using š š«šžšžšš²ā¢(k,š’«). This determines a set of O⁢(k/εd) grid cells around the centers of C which must contain Cāˆ—. We try all possible O⁢((k/εd)k) possibilities for these k cells and for each possibility we use the lowest corners of the respective cells as our candidate set of centers. We then return as our solution the set of such centers C′ which had the minimum c⁢o⁢s⁢tāˆž,āˆžā¢(C′,š’«) value. As g⁢r⁢i⁢dΓ⁢(Cāˆ—) will be one of the possible sets of centers considered, where we already argued that c⁢o⁢s⁢tāˆž,āˆžā¢(g⁢r⁢i⁢dΓ⁢(Cāˆ—),š’«)≤(1+ε)⁢rāˆ—, this procedure is guaranteed to output a (1+ε)-approximation.

As for the running time, the initial call to š š«šžšžšš²ā¢(k,š’«) takes O⁢(n⁢k) time. There are O⁢((k/εd)k) subsets of k cells we consider, and for each one we first compute the set C′ of lower corners in O⁢(k) time. Then we compute c⁢o⁢s⁢tāˆž,āˆžā¢(C′,š’«), which can be done in O⁢(n⁢k) time. Thus the overall time is O⁢(n⁢k⁢(k/εd)k)=O⁢(n⁢kk+1/εd⁢k) ā—€

ā–¶Ā Remark 5.

For k-center clustering it is common to consider the discrete version of the problem where the centers must come from the input point set. One can easily argue that both TheoremĀ 2 and TheoremĀ 4 extend to this variant. (In the proof of TheoremĀ 4, we would restrict to non-empty grid cells for potential center locations.)

4 k-means Sets Clustering

Given š’«={P1,…⁢Pm} and a parameter k, for any set CāŠ‚ā„d of k centers recall that

c⁢o⁢s⁢t1,2⁢(C,š’«)=āˆ‘i=1mf2⁢(C,Pi)=āˆ‘i=1mminc∈C⁔f2⁢(c,Pi)=āˆ‘i=1mminc∈Cā¢āˆ‘p∈Pi‖cāˆ’p‖2.

Note that this is the natural analogue of the k-means problem to clustering point sets. Specifically, when each set is a single point, that is Pi={pi} for all i, then this is equivalent to the standard k-means clustering cost on the point set P={p1,…,pm}, as then c⁢o⁢s⁢t1,2⁢(C,š’«)=āˆ‘i=1m‖piāˆ’C‖2

Definition 6.

For a given point set PāŠ‚ā„d, the centroid of P is defined as pĀÆ=p¯⁢(P)=āˆ‘p∈Pp|P|.

Given a collection of sets š’«={P1,…,Pm} in ā„d, let PiĀÆ={pĀÆi⁢1,…,pĀÆi⁢|Pi|} be the set of |Pi| distinct copies of pĀÆi=p¯⁢(Pi), and let P¯⁢(š’«)=∪iPiĀÆ. Note, while pĀÆi⁢j and pĀÆi⁢k are collocated they are viewed as distinct points, and thus |PiĀÆ|=|Pi| and |P¯⁢(š’«)|=āˆ‘i|Pi|.

It is well known that the optimal solution to the 1-mean problem with respect to a point set P is the centroid C={p¯⁢(P)}. We require the following standard lemma, whose proof can be found in [16].

Lemma 7.

For a point set PāŠ‚ā„d with centroid pĀÆ=p¯⁢(P) and any point xāˆˆā„d, we have, f2⁢(x,P)=f2⁢(pĀÆ,P)+|P|ā‹…f2⁢(x,pĀÆ).

Given a collection of point sets, we now argue that one can solve the sets clustering problem by solving the k-means problem on the centroids of the sets.

Theorem 8.

For a parameter k and a point set PāŠ‚ā„d, let š¤š¦šžššš§š¬š€š„š ā¢(k,P) denote any algorithm which achieves an α-approximation for the k-means problem.

Given a parameter k and collection of sets š’«={P1,…,Pm} in ā„d, š¤š¦šžššš§š¬š€š„š ā¢(k,P¯⁢(š’«)) is an α-approximation for computing o⁢p⁢t1,2⁢(k,š’«).

Proof.

For any two point sets C,PāŠ‚ā„d, by lemmaĀ 7 we have:

f2⁢(C,P)=minc∈C⁔f2⁢(c,P)=minc∈C⁔(f2⁢(pĀÆ,P)+|P|ā‹…f2⁢(c,pĀÆ))=f2⁢(pĀÆ,P)+|P|ā‹…minc∈C⁔f2⁢(c,pĀÆ).

Recall from SectionĀ 2 that k⁢m⁢e⁢a⁢n⁢s⁢(C,P)=āˆ‘p∈P‖pāˆ’C‖2=āˆ‘p∈Pminc∈C⁔f2⁢(c,p). Thus summing the above equation for Pi over all i gives:

c⁢o⁢s⁢t1,2⁢(C,š’«) =āˆ‘i=1mf2⁢(C,Pi)=āˆ‘i=1mf2⁢(pĀÆi,Pi)+āˆ‘i=1m|Pi|ā‹…minc∈C⁔f2⁢(c,piĀÆ)
=āˆ‘i=1mf2⁢(pĀÆi,Pi)+āˆ‘p∈P¯⁢(š’«)minc∈C⁔f2⁢(c,p)=āˆ‘i=1mf2⁢(pĀÆi,Pi)+k⁢m⁢e⁢a⁢n⁢s⁢(C,P¯⁢(š’«)).

Thus c⁢o⁢s⁢t1,2⁢(C,š’«)=āˆ‘i=1mf2⁢(pĀÆi,Pi)+k⁢m⁢e⁢a⁢n⁢s⁢(C,P¯⁢(š’«)). So let Cāˆ— denote the optimal k-means solution on P¯⁢(š’«), that is Cāˆ—=arg⁔minCāŠ‚ā„d,|C|=k⁔k⁢m⁢e⁢a⁢n⁢s⁢(C,P¯⁢(š’«)). As āˆ‘i=1mf2⁢(pĀÆi,Pi) does not depend on C, the above equation implies that minimizing k⁢m⁢e⁢a⁢n⁢s⁢(C,P¯⁢(š’«)) also minimizes c⁢o⁢s⁢t1,2⁢(C,š’«), i.e.Ā Cāˆ—=arg⁔minCāŠ‚ā„d,|C|=k⁔c⁢o⁢s⁢t1,2⁢(C,š’«).

So let C′ be the α≄1 approximation returned by š¤š¦šžššš§š¬š€š„š ā¢(k,P¯⁢(š’«)), meaning k⁢m⁢e⁢a⁢n⁢s⁢(C′,P¯⁢(š’«))≤α⋅k⁢m⁢e⁢a⁢n⁢s⁢(Cāˆ—,P¯⁢(š’«)). Then we have

c⁢o⁢s⁢t1,2⁢(C′,š’«) =āˆ‘i=1mf2⁢(pĀÆi,Pi)+k⁢m⁢e⁢a⁢n⁢s⁢(C′,P¯⁢(š’«))
ā‰¤āˆ‘i=1mf2⁢(pĀÆi,Pi)+α⋅k⁢m⁢e⁢a⁢n⁢s⁢(Cāˆ—,P¯⁢(š’«))
≤α⁢(āˆ‘i=1mf2⁢(pĀÆi,Pi)+k⁢m⁢e⁢a⁢n⁢s⁢(Cāˆ—,P¯⁢(š’«)))=α⋅c⁢o⁢s⁢t1,2⁢(Cāˆ—,š’«) Ā 

ā—€

As discussed in the introduction, there are many known (1+ε)-approximation algorithms for k-means, with various trade-offs in the running times depending on the parameters n,d,k, and ε. Moreover, as k-means is a special case of our problem, we should not expect our times to be faster than what is possible for k-means.

Before we can apply such algorithms the above theorem requires we compute P¯⁢(š’«). However, this takes only O⁢(d⁢n) time, where n=āˆ‘i|Pi|, as the centroid of a set is simply the average of the points. Thus the above theorem immediately implies that we have the following.

Corollary 9.

For a parameter k and set PāŠ‚ā„d of n points, let T⁢(n,k,d,ε) denote the running time of any algorithm which achieves a (1+ε)-approximation for k-means.

Given a parameter k and collection of sets š’«={P1,…,Pm} in ā„d, where n=āˆ‘i|Pi|, there is a (1+ε)-approximation for computing o⁢p⁢t1,2⁢(k,š’«) with O⁢(d⁢n+T⁢(n,k,d,ε)) running time.

5 k-median Sets Clustering

Given š’«={P1,…⁢Pm} and a parameter k, for any set CāŠ‚ā„d of k centers recall that

c⁢o⁢s⁢t1,āˆžā¢(C,š’«)=āˆ‘i=1mfāˆžā¢(C,Pi)=āˆ‘i=1mminc∈C⁔fāˆžā¢(c,Pi)=āˆ‘i=1mminc∈C⁔maxp∈Pi⁔‖cāˆ’p‖.

As discussed in the introduction, the above cost (as well as c⁢o⁢s⁢t1,1⁢(C,š’«) studied in [20]) generalizes the k-median problem to clustering point sets. Specifically, when each set consists of a single point, i.e., Pi={pi} for all i, the problem reduces to the standard k-median clustering cost on the point set P={p1,…,pm}, as then c⁢o⁢s⁢t1,āˆžā¢(C,š’«)=āˆ‘i=1m‖piāˆ’C‖.

Lemma 10.

Let š’«={P1,…,Pm} be a collection of sets and B={b1,…,bm} where bi is the center of the minimum enclosing ball of Pi, then āˆ‘i=1mmaxp∈Pi⁔‖pāˆ’bi‖≤o⁢p⁢t⁢c⁢o⁢s⁢t1,āˆžā¢(k,š’«).

Proof.

Since bi is the center of the minimum enclosing ball of Pi, by definition we have that maxp∈Pi⁔‖pāˆ’bi‖=minbāˆˆā„d⁔maxp∈Pi⁔‖pāˆ’b‖=minbāˆˆā„d⁔fāˆžā¢(b,Pi). So if Cāˆ—={c1āˆ—,c2āˆ—,…,ckāˆ—}=o⁢p⁢t1,āˆžā¢(k,š’«), then,

āˆ‘i=1mmaxp∈Pi⁔‖pāˆ’bi‖=āˆ‘i=1mminbāˆˆā„d⁔fāˆžā¢(b,Pi)ā‰¤āˆ‘i=1mminc∈Cāˆ—ā”fāˆžā¢(c,Pi)=āˆ‘i=1mfāˆžā¢(Cāˆ—,Pi)=o⁢p⁢t⁢c⁢o⁢s⁢t1,āˆžā¢(k,š’«)

ā—€

Lemma 11.

Let š’«={P1,…,Pm} be a collection of point sets from ā„d, let B={b1,…,bm} where bi is the center of the minimum enclosing ball of Pi, and let CāŠ‚ā„d be any set of k centers, then k⁢m⁢e⁢d⁢i⁢a⁢n⁢(C,B)≤c⁢o⁢s⁢t1,āˆžā¢(C,š’«).

Proof.

First, we make the standard observation that for any point qāˆˆā„d, ‖qāˆ’bi‖≤maxp∈Pi⁔‖pāˆ’q‖. (Assume q≠bi as otherwise the inequality trivially holds.) This holds by considering the hyperplane passing through bi whose normal is in the direction from bi to q. As bi is the center of the minimum enclosing ball of Pi, the closed halfspace defined by this hyperplane and not containing q must contain a point from Pi, and thus this point is at least as far as bi from q. Thus we have,

k⁢m⁢e⁢d⁢i⁢a⁢n⁢(C,B) =āˆ‘i=1m‖biāˆ’C‖=āˆ‘i=1mminc∈C⁔‖biāˆ’cā€–ā‰¤āˆ‘i=1mminc∈C⁔maxp∈Pi⁔‖pāˆ’c‖
=āˆ‘i=1mfāˆžā¢(C,Pi)=c⁢o⁢s⁢t1,āˆžā¢(C,š’«) Ā 

ā—€

Theorem 12.

For a parameter k and a point set PāŠ‚ā„d, let š¤š¦šžšš¢ššš§š€š„š ā¢(k,P) denote any algorithm which achieves an α-approximation for the k-median problem.

Given a parameter k and collection of sets š’«={P1,…,Pm} in ā„d, let B={b1,…,bm} where bi is the center of the minimum enclosing ball of Pi. Then š¤š¦šžšš¢ššš§š€š„š ā¢(k,B) is a (1+α)-approximation for computing o⁢p⁢t1,āˆžā¢(k,š’«).

Proof.

Let C={c1,…,ck}āŠ‚ā„d be the set of centers returned by š¤š¦šžšš¢ššš§š€š„š ā¢(k,B), for which we have k⁢m⁢e⁢d⁢i⁢a⁢n⁢(C,B)≤α⋅minCā€²āŠ†ā„d,|C′|=k⁔k⁢m⁢e⁢d⁢i⁢a⁢n⁢(C′,B). Also let Cāˆ—=o⁢p⁢t1,āˆžā¢(k,š’«). Then using triangle inequality,

c⁢o⁢s⁢t1,āˆžā¢(C,š’«) =āˆ‘i=1mminc∈C⁔maxp∈Pi⁔‖cāˆ’pā€–ā‰¤āˆ‘i=1mminc∈C⁔maxp∈Pi⁔(‖pāˆ’bi‖+‖cāˆ’bi‖)
=āˆ‘i=1m((maxp∈Pi⁔‖pāˆ’bi‖)+(minc∈C⁔‖cāˆ’bi‖))
=āˆ‘i=1mmaxp∈Pi⁔‖pāˆ’bi‖+āˆ‘i=1mminc∈C⁔‖cāˆ’bi‖
=āˆ‘i=1mmaxp∈Pi⁔‖pāˆ’bi‖+k⁢m⁢e⁢d⁢i⁢a⁢n⁢(C,B)
≤o⁢p⁢t⁢c⁢o⁢s⁢t1,āˆžā¢(k,š’«)+k⁢m⁢e⁢d⁢i⁢a⁢n⁢(C,B) Lemma 10
≤o⁢p⁢t⁢c⁢o⁢s⁢t1,āˆžā¢(k,š’«)+α⋅minCā€²āŠ†ā„d,|C′|=k⁔k⁢m⁢e⁢d⁢i⁢a⁢n⁢(C′,B)
≤o⁢p⁢t⁢c⁢o⁢s⁢t1,āˆžā¢(k,š’«)+α⋅k⁢m⁢e⁢d⁢i⁢a⁢n⁢(Cāˆ—,B)
≤o⁢p⁢t⁢c⁢o⁢s⁢t1,āˆžā¢(k,š’«)+α⋅c⁢o⁢s⁢t1,āˆžā¢(Cāˆ—,š’«) Lemma 11
=o⁢p⁢t⁢c⁢o⁢s⁢t1,āˆžā¢(k,š’«)+α⋅o⁢p⁢t⁢c⁢o⁢s⁢t1,āˆžā¢(k,š’«)
=(1+α)⁢o⁢p⁢t⁢c⁢o⁢s⁢t1,āˆžā¢(k,š’«) Ā 

ā—€

As discussed in the introduction, there are many known (1+ε)-approximation algorithms for k-median, with various trade-offs in the running times depending on the parameters n,d,k, and ε. Moreover, as k-median is a special case of our problem, we should not expect our times to be faster than what is possible for k-median.

Before we can apply such algorithms, the above theorem requires computing the minimum enclosing ball of each set Pi, for which we can use (1+ε)-approximate minimum enclosing ball center, which can be computed in O⁢(d⁢n/ε+1/ε5) time [2]. Doing so would introduce a (1+ε) factor in LemmaĀ 10, and thus using this together with a (1+ε)-approximation for k-median, the same analysis from the proof of TheoremĀ 12, would yield a (1+ε)2 approximation to computing o⁢p⁢t1,āˆžā¢(k,š’«). By observing that (1+ε/4)2≤(1+ε) for any 0<ε≤1 we immediately have the following.

Corollary 13.

For a parameter k and set PāŠ‚ā„d of n points, let T⁢(n,k,d,ε) denote the running time of any algorithm which achieves a (1+ε)-approximation for k-median. Moreover, let M⁢E⁢B⁢(n,d,ε) denote the time to compute a (1+ε)-approximate minimum enclosing ball of P.

Given a parameter k and collection of sets š’«={P1,…,Pm} in ā„d, where n=āˆ‘i|Pi|, then for any 0<ε≤1, there is a (2+ε)-approximation for computing o⁢p⁢t1,āˆžā¢(k,š’«) with O⁢(M⁢E⁢B⁢(n,d,ε/4)+T⁢(n,k,d,ε/4)) running time.

6 Another Variant of k-center Sets Clustering

Above we gave improved approximation algorithms for the c⁢o⁢s⁢tāˆž,āˆžā¢(C,š’«) objective, previously considered by [20], and which naturally generalizes the k-center problem to sets. Here we consider an alternative generalization of k-center to sets clustering. Namely, given š’«={P1,…,Pm} and a parameter k, find the set CāŠ‚ā„d of k centers that minimizes:

c⁢o⁢s⁢tāˆž,1⁢(š’«,C)=maxPāˆˆš’«ā”f1⁢(C,P)=maxPāˆˆš’«ā”minc∈C⁔f1⁢(c,P)=maxPāˆˆš’«ā”minc∈Cā¢āˆ‘p∈P‖cāˆ’p‖.

This variant poses additional challenges, so for simplicity we will assume that the sets in š’« lie in ā„2, though a similar approach should extend to any constant dimension. Moreover, in this section we state our running times simply as being polynomial in n, as the precise constant in the exponent is large, as opposed to earlier sections which may be linear in n (depending on the subroutines used).

Definition 14.

In the weighted k-center problem, we are given a parameter k, a set of points P={p1,…,pn}āŠ‚ā„2 and a set of non-negative weights W={w1,…,wn}, where wi is the weight associated with point pi, and the goal is to find the set CāŠ‚ā„2 of k centers which minimizes the cost function: w⁢k⁢c⁢e⁢n⁢t⁢e⁢r⁢(C,P)=maxpi∈P⁔minc∈C⁔wi⋅‖piāˆ’c‖.

Observe that the weighted k-center objective can be viewed as a special instance of our c⁢o⁢s⁢tāˆž,1 objective, as a point with weight |Pi| models the case when all points in Pi are collocated, for each Piāˆˆš’«. Thus intuitively, to approximate c⁢o⁢s⁢tāˆž,1 will require approximating w⁢k⁢c⁢e⁢n⁢t⁢e⁢r.

[19] considered the weighted k-center problem where the input is instead an edge-weighted graph G=(V,E), where each vertex v has an associated non-negative weight wv. Specifically, let d⁢(u,v) denote the shortest path distance between u and v (with respect to edge weights, not vertex weights), then they seek the set CāŠ†V of k centers which minimizes: w⁢k⁢c⁢e⁢n⁢t⁢e⁢r⁢(C,V)=maxv∈V⁔minc∈C⁔wvā‹…d⁢(v,c). For this problem [19] give a 2-approximation, and we now describe how this implies a 2-approximation for the case in plane.

Lemma 15.

There exists a polynomial time 2-approximation algorithm for the weighted k-center problem in the plane. That is, given a parameter k, a set of points P={p1,…,pn}āŠ‚ā„2, and a set of non-negative weights W={w1,…,wn} where wi is the weight associated with point pi, there is a polynomial time 2-approximation to minCāŠ‚ā„2,|C|=k⁔w⁢k⁢c⁢e⁢n⁢t⁢e⁢r⁢(C,P).

Proof.

As we allow centers to be located anywhere in the plane, we first describe how to find a polynomial sized set Ī“ which contains the optimal set of centers as a subset.

For any two points pi,pj∈P, define their weighted bisector, β⁢(pi,pj), as the subset of points qāˆˆā„2 such that wi⁢‖piāˆ’q‖=wj⁢‖pjāˆ’q‖. It is well known that β⁢(pi,pj) is a line when wi=wj, and otherwise is a circle (called the Apollonius circle) containing the heavier weight point. Consider the arrangement of all such weighted bisectors over all pairs of points in P. The vertices of the arrangement occur at intersections of bisectors, and the edges are circular arcs between vertices (or entire bisectors, when the bisector does not intersect any other bisector). Let Ī“āŠ‚ā„2 be the set containing all points in P, all vertices in the arrangement, and for every edge on a bisector β⁢(pi,pj), a point on that edge achieving the minimum value of wi⁢‖piāˆ’q‖=wj⁢‖pjāˆ’q‖ (which may occur at a vertex of the arrangement). Observe that |Ī“|=O⁢(n4) as this is the complexity of the arrangement.

So let Cāˆ— be an optimal set of centers, and consider some cāˆ—āˆˆCāˆ—. We now argue that if cāˆ—āˆ‰Ī“, then it can be moved to a point in Ī“ in such a way that the weighted k-center objective does not increase. First, let P′ be the subset of points from P which call cāˆ— their nearest center in Cāˆ— and let q=arg⁔maxpi∈P′⁔wi⁢‖piāˆ’cāˆ—ā€–. Consider moving cāˆ— continuously from its initial location towards q (i.e.Ā along the segment cāˆ—ā¢q). As we move along this segment the weighted distance to q strictly decreases, and moreover until we cross a bisector β⁢(q,pi) for some pi, q remains the point in P′ furthest from cāˆ— (i.e.Ā the point determining cāˆ—ā€™s contribution to the overall objective). First suppose that cāˆ— is moved all the way to q without crossing a bisector involving q. In this case q remained the furthest point from cāˆ—, and so the cost of assigning P′ to cāˆ— did not increase (implying in fact that cāˆ—=q initially). In the second case, we run into some bisector β⁢(q,pi) for some pi, at which point we stop moving cāˆ— towards q (to ensure q remains the furthest). Now if cāˆ— is at a vertex of the arrangement of bisectors, then we are done as we included all vertices in Ī“. Otherwise, cāˆ— is on the interior of an edge in the arrangement. Now the point achieving the minimum value of the weighted distance from q (or equivalently pi) to cāˆ— along this edge is in included in Ī“ (and may in fact be a vertex), and so we can move cāˆ— to this point without increasing the cost of assigning P′ to cāˆ—, as we do not cross any other bisector while moving along this edge.

Now we construct a weighted graph instance G=(V,E) of weighted k-center as follows. First, set V=Ī“ (recall PāŠ†Ī“). Now for every pair of points in V set its edge weight equal to the Euclidean distance between the corresponding set of points. Finally, for the vertex weights, set the weight of all vertices in Ī“āˆ–P equal to 0 (i.e.Ā they do not have to be clustered), and for each pi∈P its weight remains wi (i.e.Ā the weight from the weighted k-center instance in the plane). Note that by construction, the cost of any solution CāŠ†V to this graph based weighted k-center problem is equivalent to the cost of C for the corresponding weighted k-center problem in the plane. Thus, as V=Ī“ is guaranteed to contain an optimal solution to our instance in the plane, if we simply call the polynomial time 2-approximation algorithm from [19] on this graph instance, then the returned solution is also a 2-approximation for our instance in the plane. ā—€

Our goal now is to use the above lemma to get a constant factor approximation to o⁢p⁢tāˆž,1⁢(k,š’«). First we need the following lemma, which is the analogue of LemmaĀ 10 from the prior section.

Lemma 16.

Let š’«={P1,…,Pm} be a collection of sets. Then, for any point-set Piāˆˆš’«,

minĪ¼āˆˆā„dā¢āˆ‘p∈Piā€–Ī¼āˆ’p‖≤o⁢p⁢t⁢c⁢o⁢s⁢tāˆž,1⁢(k,š’«).
Proof.

Let Cāˆ—=o⁢p⁢tāˆž,1⁢(k,š’«). Then for any point set Piāˆˆš’«,

minĪ¼āˆˆā„dā¢āˆ‘p∈Piā€–Ī¼āˆ’p‖≤mincāˆ—āˆˆCāˆ—ā¢āˆ‘p∈Pi‖cāˆ—āˆ’p‖ ≤maxPiāˆˆš’«ā”mincāˆ—āˆˆCāˆ—ā¢āˆ‘p∈Pi‖cāˆ—āˆ’p‖=o⁢p⁢t⁢c⁢o⁢s⁢tāˆž,1⁢(k,š’«).

ā—€

Theorem 17.

For a parameter k, a set of points P={p1,…,pl}āŠ‚ā„2, and weights W={w1,…,wl} where wi≄0 is the weight associated with point pi, let š°šžš¢š š”š­šžššŠš‚šžš§š­šžš«ā¢(k,P,W) denote any algorithm which achieves an α1-approximation for the weighted k-center problem. Also, for a point set PāŠ‚ā„2, let šŸā¢š¦ā¢šžā¢šā¢š¢ā¢ššā¢š§ā¢š€ā¢š„ā¢š ā¢(P) denote any algorithm that achieves an α2-approximation for the 1-median problem.

Given a parameter k and collection of sets š’«={P1,…,Pm} in ā„2, there is a polynomial time (α1+α2+α1⋅α2)-approximation for computing o⁢p⁢tāˆž,1⁢(k,š’«).

Proof.

For any set CāŠ‚ā„2 of k centers, and any set of points X={x1,…,xm}, by the triangle inequality we have

c⁢o⁢s⁢tāˆž,1⁢(C,š’«) =maxPiāˆˆš’«ā”minc∈Cā¢āˆ‘p∈Pi‖cāˆ’p‖≤maxPiāˆˆš’«ā”minc∈C⁔(āˆ‘p∈Pi‖xiāˆ’p‖+āˆ‘p∈Pi‖cāˆ’xi‖)
≤maxPiāˆˆš’«ā¢āˆ‘p∈Pi‖xiāˆ’p‖+maxPiāˆˆš’«ā”minc∈C⁔|Pi|⋅‖cāˆ’xi‖

Now, let the optimal set of centers for our problem be Cāˆ—=o⁢p⁢tāˆž,1⁢(k,š’«) and let C′ be the α1-approximation returned by š°šžš¢š š”š­šžššŠš‚šžš§š­šžš«ā¢(k,X,W) where X={x1,…,xm} is the set of α2-approximate 1-medians from the Pi sets, i.e.Ā xi=šŸā¢š¦ā¢šžā¢šā¢š¢ā¢ššā¢š§ā¢š€ā¢š„ā¢š ā¢(Pi), and W={|P1|,…,|Pm|}. Then,

c⁢o⁢s⁢t (C′,š’«)āˆž,1
≤maxPiāˆˆš’«ā¢āˆ‘p∈Pi‖xiāˆ’p‖+maxPiāˆˆš’«ā”minc∈C′⁔|Pi|⋅‖cāˆ’xi‖
≤maxPiāˆˆš’«ā”(α2ā‹…minĪ¼āˆˆā„2ā¢āˆ‘p∈Piā€–Ī¼āˆ’p‖)+α1ā‹…minCĀÆāŠ‚ā„2,|CĀÆ|=k⁔(maxPiāˆˆš’«ā”minc∈C¯⁔|Pi|⋅‖cāˆ’xi‖)
≤α2ā‹…maxPiāˆˆš’«ā”minĪ¼āˆˆā„2ā¢āˆ‘p∈Piā€–Ī¼āˆ’p‖+α1ā‹…maxPiāˆˆš’«ā”minc∈Cāˆ—ā”|Pi|⋅‖cāˆ’xi‖
≤α2ā‹…c⁢o⁢s⁢t⁢o⁢p⁢tāˆž,1⁢(k,š’«)+α1ā‹…maxPiāˆˆš’«ā”minc∈Cāˆ—ā¢āˆ‘p∈Pi(‖xiāˆ’p‖+‖cāˆ’p‖) Lemmma 16
≤α2ā‹…c⁢o⁢s⁢t⁢o⁢p⁢tāˆž,1⁢(k,š’«)+α1ā‹…(maxPiāˆˆš’«ā¢āˆ‘p∈Pi‖xiāˆ’p‖+maxPiāˆˆš’«ā”minc∈Cāˆ—ā¢āˆ‘p∈Pi‖cāˆ’p‖)
≤α2ā‹…c⁢o⁢s⁢t⁢o⁢p⁢tāˆž,1⁢(k,š’«)+α1ā‹…(maxPiāˆˆš’«ā”(α2ā‹…minĪ¼āˆˆā„2ā¢āˆ‘p∈Piā€–Ī¼āˆ’p‖)+c⁢o⁢s⁢t⁢o⁢p⁢tāˆž,1⁢(k,š’«))
≤α2ā‹…c⁢o⁢s⁢t⁢o⁢p⁢tāˆž,1⁢(k,š’«)+α1ā‹…(α2ā‹…c⁢o⁢s⁢t⁢o⁢p⁢tāˆž,1⁢(k,š’«)+c⁢o⁢s⁢t⁢o⁢p⁢tāˆž,1⁢(k,š’«))
=(α1+α2+α1⋅α2)ā‹…c⁢o⁢s⁢t⁢o⁢p⁢tāˆž,1⁢(k,š’«) Ā 

ā—€

Corollary 18.

Given a parameter k and collection of sets š’«={P1,…,Pm} in ā„2, there is a polynomial time (5+ε)-approximation for computing o⁢p⁢tāˆž,1⁢(k,š’«), for any constant ε>0.

Proof.

Using [11], we can get a polynomial time (1+ε/3)-approximation for the 1-median, for any constant ε>0. LemmaĀ 15 gives us a polynomial time 2-approximation for weighted k-center. So we have α1=2, α2=(1+ε/3), and (α1+α2+α1⁢α2)=(2+(1+ε/3)+2⁢(1+ε/3))=(5+ε). Thus, applying TheoremĀ 17 with these algorithms as subroutines gives a polynomial time (5+ε)-approximation for o⁢p⁢tāˆž,1⁢(k,š’«). ā—€

6.1 Approximation Improvement

We now show how to improve the above (5+ε)-approximation into a (3+ε)-approximation, by constructing and searching with a (3+ε)-approximate decision procedure.

Algorithm 2 decider(š’«={P1,…⁢Pm}, k, r).
Lemma 19.

For a point set PāŠ‚ā„2, let šŸā¢š¦ā¢šžā¢šā¢š¢ā¢ššā¢š§ā¢š€ā¢š„ā¢š ā¢(P) denote any algorithm that achieves an α-approximation for the 1-median problem.

Given a collection of sets š’«={P1,…,Pm} in ā„2, a parameter k, and a radius rāˆˆā„, then decider(š’«,k,r) either returns a set CāŠ‚ā„2 of k centers such that costāˆž,1⁢(C,š’«)≤r, or correctly returns that o⁢p⁢t⁢c⁢o⁢s⁢tāˆž,1⁢(k,š’«)>r/(2+α).

Proof.

decider(š’«, k, r) starts by initializing all point sets in š’« as unmarked, and computing the approximate medians of each point set using 1medianAlg. Next, the algorithm picks the approximate median of the unmarked point set which has the most points, adds it to the set C of centers, and mark all the sets that are ≤r from this approximate median. This process is repeated until all sets are marked. At the end of the loop, if |C|≤k, then the algorithm returns C, as clearly we have obtained a valid clustering with cost ≤r.

Otherwise, |C|≄k+1, in which case the algorithm returns ā€œo⁢p⁢t⁢c⁢o⁢s⁢tāˆž,1⁢(k,š’«)>r/(2+α)ā€, which we now argue is correct. Let {c1,…,ck+1} be the first k+1 centers selected by algorithm, where ci∈M was selected in the ith iteration, and let š’®={S1,S2,…,Sk+1}, where Siāˆˆš’« is the set such that ci=1medianAlg⁢(Si). Let Cāˆ—=o⁢p⁢tāˆž,1⁢(k,š’«). Then by the pigeonhole principle, there must be some cāˆ—āˆˆCāˆ— and sets Si,Sjāˆˆš’® with i<j such that cāˆ—=arg⁔minc∈Cāˆ—ā¢āˆ‘s∈Si‖cāˆ’s‖=arg⁔minc∈Cāˆ—ā¢āˆ‘s∈Sj‖cāˆ’s‖. Note that as i<j, by line 5 of the algorithm we know |Si|≄|Sj|. Then,

āˆ‘s∈Sj‖ciāˆ’s‖ ā‰¤āˆ‘s∈Sj(‖ciāˆ’cāˆ—ā€–+‖cāˆ—āˆ’s‖)=āˆ‘s∈Sj‖ciāˆ’cāˆ—ā€–+āˆ‘s∈Sj‖cāˆ—āˆ’s‖
≤|Sj|⋅‖ciāˆ’cāˆ—ā€–+o⁢p⁢t⁢c⁢o⁢s⁢tāˆž,1⁢(k,š’«)≤|Si|⋅‖ciāˆ’cāˆ—ā€–+o⁢p⁢t⁢c⁢o⁢s⁢tāˆž,1⁢(k,š’«)
ā‰¤āˆ‘s∈Si(‖ciāˆ’s‖+‖cāˆ—āˆ’s‖)+o⁢p⁢t⁢c⁢o⁢s⁢tāˆž,1⁢(k,š’«)
=āˆ‘s∈Si‖ciāˆ’s‖+āˆ‘s∈Si‖cāˆ—āˆ’s‖+o⁢p⁢t⁢c⁢o⁢s⁢tāˆž,1⁢(k,š’«)
≤α⋅mincā€²āˆˆā„2ā¢āˆ‘s∈Si‖cā€²āˆ’s‖+o⁢p⁢t⁢c⁢o⁢s⁢tāˆž,1⁢(k,š’«)+o⁢p⁢t⁢c⁢o⁢s⁢tāˆž,1⁢(k,š’«)
≤α⋅o⁢p⁢t⁢c⁢o⁢s⁢tāˆž,1⁢(k,š’«)+2ā‹…o⁢p⁢t⁢c⁢o⁢s⁢tāˆž,1⁢(k,š’«)
=(2+α)ā‹…o⁢p⁢t⁢c⁢o⁢s⁢tāˆž,1⁢(k,š’«)

Since i<j, we know that ci did not cover Sj within radius r, that is r<āˆ‘s∈Sj‖ciāˆ’s‖. Thus by the above, r<āˆ‘s∈Sj‖ciāˆ’s‖≤(2+α)ā‹…o⁢p⁢t⁢c⁢o⁢s⁢tāˆž,1⁢(k,š’«). Therefore, the algorithm correctly returns that o⁢p⁢t⁢c⁢o⁢s⁢tāˆž,1⁢(k,š’«)>r2+α. ā—€

Theorem 20.

Given a parameter k and a collection of point sets š’«={P1,…,Pm} in ā„2, for any constant 0<ε≤1, there exists a polynomial time (3+ε)-approximation algorithm for computing o⁢p⁢tāˆž,1⁢(k,š’«).

Proof.

Let C′ be the (5+ε)-approximate set of centers returned by CorollaryĀ 18. Let r=c⁢o⁢s⁢tāˆž,1⁢(C′,š’«) and let rāˆ—=o⁢p⁢tāˆž,1⁢(k,š’«), where by CorollaryĀ 18 we know that rāˆ—āˆˆ[r5+ε,r]. Consider the candidate radii set R={r0,r1,…,rz}, where ri=r5+ε⁢(1+ε/4)i and z=⌈log1+ε/4⁔(5+ε)āŒ‰=O⁢(1) for any constant ε>0. Note that for any i we have ri+1/ri=(1+ε/4), and z was chosen such that rz≄r.

Now we iterate through the candidate radii ri∈R, in order from i=0 to z. In the ith iteration we call Algorithm 2 with radius (2+α)ā‹…ri where α is the approximation ratio of the subroutine used to compute the approximate 1-medians. We can use [11] to get an α=(1+ε/8)-approximate 1-median, which runs in polynomial time for any constant ε>0. Let j be the first iteration where Algorithm 2 returns a set C rather than ā€œo⁢p⁢t⁢c⁢o⁢s⁢tāˆž,1⁢(k,š’«)>(2+α)ā‹…rj/(2+α)ā€. We claim that this set C is the desired approximation.

First, observe that this procedure is guaranteed to return some set C, that is j is well defined. Specifically, when j=z, the call to AlgorithmĀ 2 with radius (2+α)ā‹…rz, must return a set C since otherwise by LemmaĀ 19 the optimal radius rāˆ—>(2+α)ā‹…rz2+α=rz≄r which is a contradiction to the fact that rāˆ—āˆˆ[r5+ε,r].

LemmaĀ 19 guarantees that c⁢o⁢s⁢tāˆž,1⁢(C,š’«)≤(2+α)ā‹…rj. If j=0, then we are done as we know (2+α)ā‹…rj=(3+ε/8)ā‹…r0=(3+ε/8)ā‹…r5+ε≤(3+ε/8)ā‹…rāˆ—ā‰¤(3+ε)ā‹…rāˆ—. Otherwise, j>0 in which case we know rjāˆ’1 returned ā€œo⁢p⁢t⁢c⁢o⁢s⁢tāˆž,1⁢(k,š’«)>(2+α)ā‹…rjāˆ’1/(2+α)ā€, and so again by LemmaĀ 19 we then know that rāˆ—>(2+α)ā‹…rjāˆ’12+α=rjāˆ’1. Summarizing, rjāˆ’1<rāˆ—ā‰¤c⁢o⁢s⁢tāˆž,1⁢(C,š’«)≤(2+α)ā‹…rj, and so the approximation quality of the solution C is bounded by the ratio:

(2+α)ā‹…rjrjāˆ’1=(2+α)⁢(1+ε/4)=(3+ε/8)⁢(1+ε/4)=3+(7/8)⁢ε+ε2/32≤3+ε

Overall this is a polynomial time algorithm, as we made O⁢(1) calls to Algorithm 2, which itself is polynomial time when using a polynomial time subroutine to compute the approximate 1-medians. ā—€

References

  • [1] PankajĀ K. Agarwal and CeciliaĀ Magdalena Procopiuc. Exact and approximation algorithms for clustering. Algorithmica, 33(2):201–226, 2002. doi:10.1007/s00453-001-0110-y.
  • [2] Mihai Badoiu and KennethĀ L. Clarkson. Smaller core-sets for balls. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 801–802. ACM/SIAM, 2003. URL: http://dl.acm.org/citation.cfm?id=644108.644240.
  • [3] Moses Charikar and Rina Panigrahy. Clustering to minimize the sum of cluster diameters. J. Comput. Syst. Sci., 68(2):417–441, 2004. doi:10.1016/J.JCSS.2003.07.014.
  • [4] KeĀ Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput., 39(3):923–947, 2009. doi:10.1137/070699007.
  • [5] Vincent Cohen-Addad, Karthik C. S., and Euiwoong Lee. Johnson coverage hypothesis: Inapproximability of k-means and k-median in l⁢p-metrics. In Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1493–1530. SIAM, 2022. doi:10.1137/1.9781611977073.63.
  • [6] Eduard Eiben, FedorĀ V. Fomin, PetrĀ A. Golovach, William Lochet, Fahad Panolan, and Kirill Simonov. EPTAS for k-means clustering of affine subspaces. In DĆ”niel Marx, editor, Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2649–2659. SIAM, 2021. doi:10.1137/1.9781611976465.157.
  • [7] TomĆ”s Feder and DanielĀ H. Greene. Optimal algorithms for approximate clustering. In Proceedings of the 20th Annual ACM Symposium on Theory of Computing (STOC), pages 434–444. ACM, 1988. doi:10.1145/62212.62255.
  • [8] Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A PTAS for k-means clustering based on weak coresets. In Proceedings of the 23rd ACM Symposium on Computational Geometry (SOCG), pages 11–18. ACM, 2007. doi:10.1145/1247069.1247072.
  • [9] Jie Gao, Michael Langberg, and LeonardĀ J. Schulman. Clustering lines in high-dimensional space: Classification of incomplete data. ACM Trans. Algorithms, 7(1), December 2010. doi:10.1145/1868237.1868246.
  • [10] T.Ā F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293–306, 1985. doi:10.1016/0304-3975(85)90224-5.
  • [11] Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In LĆ”szló Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC), pages 291–300. ACM, 2004. doi:10.1145/1007352.1007400.
  • [12] D.Ā S. Hochbaum and D.Ā B. Shmoys. A best possible heuristic for the k-center problem. Mathematics of Operations Research, 10(2):180–184, 1985. doi:10.1287/moor.10.2.180.
  • [13] W.Ā Hsu and G.Ā L. Nemhauser. Easy and hard bottleneck location problems. Discrete Applied Mathematics, 1(3):209–215, 1979. doi:10.1016/0166-218X(79)90044-1.
  • [14] Hongyao Huang, Georgiy Klimenko, and Benjamin Raichel. Clustering with neighborhoods. In 32nd International Symposium on Algorithms and Computation (ISAAC), volume 212 of LIPIcs, pages 6:1–6:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ISAAC.2021.6.
  • [15] Ibrahim Jubran, Murad Tukan, Alaa Maalouf, and Dan Feldman. Sets clustering. In Proceedings of the 37th International Conference on Machine Learning, (ICML), volume 119 of Proceedings of Machine Learning Research, pages 4994–5005. PMLR, 2020. URL: http://proceedings.mlr.press/v119/jubran20a.html.
  • [16] Tapas Kanungo, DavidĀ M. Mount, NathanĀ S. Netanyahu, ChristineĀ D. Piatko, Ruth Silverman, and AngelaĀ Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2-3):89–112, 2004. doi:10.1016/J.COMGEO.2004.03.003.
  • [17] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clustering problems in any dimensions. J. ACM, 57(2), February 2010. doi:10.1145/1667053.1667054.
  • [18] Sagi Lotan, Ernesto EvgeniyĀ Sanches Shayda, and Dan Feldman. Coreset for line-sets clustering. In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/f2ce95887c34393af4eb240d60017860-Abstract-Conference.html.
  • [19] Qingzhou Wang and Kam-Hoi Cheng. A heuristic algorithm for the k-center problem with vertex weight. In International Symposium on Algorithms (SIGAL), volume 450 of Lecture Notes in Computer Science, pages 388–396. Springer, 1990. doi:10.1007/3-540-52921-7_88.
  • [20] Guang Xu and Jinhui Xu. Efficient approximation algorithms for clustering point-sets. Computational Geometry, 43(1):59–66, 2010. Special Issue on the 14th Annual Fall Workshop. doi:10.1016/j.comgeo.2007.12.002.