Clustering Point Sets Revisited
Abstract
In the sets clustering problem one is given a collection of point sets in , where for any set of centers in , each is assigned to its nearest center as determine by some local cost functions. The goal is then to select a set of centers to minimize some global cost function of the corresponding local assignment costs. Specifically, we consider either summing or taking the maximum cost over all , where for each the cost of assigning it to a center is either , , or .
Different combinations of the global and local cost functions naturally generalize the -center, -median, and -means clustering problems. In this paper, we improve the prior results for the natural generalization of -center, give the first result for the natural generalization of -means, and give results for generalizations of -median and -center which differ from those previously studied.
Keywords and phrases:
Clustering, -center, -median, -meansCopyright and License:
2012 ACM Subject Classification:
Theory of computation Computational geometryFunding:
Work on this paper was partially supported by NSF CAREER AF Award CCF-1750780 and NSF AF Award CCF-2311179.Editors:
Pat Morin and Eunjin OhSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl ā Leibniz-Zentrum für Informatik
1 Introduction
Clustering is one of the most well studied problems in computational geometry and computer science as a whole, with a variety of applications. For center based clustering, given a point set and integer parameter , the goal is to select a set of centers, so as to minimize some cost function of the distances determined by assigning each point in to its nearest center in . Specifically, for any , let denote the distance from to its nearest center in . Then the -center, -median, and -means objectives are to find the set minimizing , , and , respectively.
-center, -median, and -means clustering are all known to be NP-hard. -center is NP-hard to approximate within any factor less than 2 in general metric spaces [13], and even in the plane is still hard to approximate within a factor of roughly 1.82 [7]. Conversely, both the standard greedy algorithm by Gonzalez [10] and the alternative scooping method by Hochbaum and Shmoys [12] achieve a 2-approximation for -center. The -median and -means problems are both known to be hard to approximate within a factor for some constant , even in dimensional Euclidean space. Different values of are known depending on whether the points are in low or high dimensional Euclidean space, some other metric norm, or a general metric space. Conversely, there are several constant factor approximation algorithms for both problems. For a more in depth discussion see [5] and references therein.
For points in , PTASās exist for these center based clustering problems when and are bounded. For -center, Agarwal and Procopiuc [1] gave a -approximation in time. For -median and -means, Har-Peled and Mazumdar [11] used coresets to achieve a -approximation algorithm whose running time is linear in . Subsequent coreset based papers improve the time dependency on , , and [4, 8]. Using a sampling based approach, [17] provided a -approximation with probability in time, where the probability can be improved by boosting. In general, there are many approximation schemes for -median and -means in Euclidean settings, though we note a linear dependence on and is possible (e.g.Ā [17]), and for -median specifically a polynomial dependence on is possible for bounded (e.g.Ā [11]).
Sets Clustering
While standard clustering focuses on clustering points, naturally one can consider clustering more general geometric objects. In this paper, we consider the sets clustering problem [20], where one must cluster a collection of sets, , of total size . Here each can for example be viewed as a sample from a given object of study, whether say a physical object in 3d, or some class in a high dimensional feature space. With the interpretation that all points in arise from the same object, we thus naturally require they all be covered by the same center.
Given a center , we let denote the cost of assigning to , and let . Here we consider the cost functions , , and . Note these cost functions are the 1-center, 1-median, and 1-means costs, and thus are motivated by the applications of those respective problems. However, unlike the 1-center/median/means problems, here we must cluster multiple sets and thus we need an aggregate cost. We consider two possibilities for the cost of clustering , namely and . Observe that when is a collection of singleton points that , , and respectively capture the -center, -median, and -means objectives on the point set consisting of these singletons. Conversely, FigureĀ 1.1 shows how when the sets in are not singletons, the optimal solutions of the variants of sets clustering that we consider differ from the -center, -median, and -means solutions on .
Several prior works considered different variants of the sets clustering problem, most notably [20]. Our aim is both to improve prior results, as well as to consider new variants of the sets clustering problem. We break our discussion into whether a particular sets clustering problem is most naturally viewed as a generalization of -center, -means, or -median.
-
-center: Minimizing naturally generalizes the -center problem. Previously, for points in constant dimensions, [20] provided an time 3-approximation as well as an time -approximation. These results require minimum enclosing ball (MEB) computations, and thus have running times with hidden constants depending on , though they remark that for higher dimensions they can use a -approximate MEB at the cost of in the approximation quality.
In SectionĀ 3, for points in any dimension , we show that a natural adaptation of the standard time greedy 2-approximation algorithm for -center remains an time 2-approximation for the sets clustering problem, a fact which surprisingly appears to have been overlooked in prior work. Moreover, our 2-approximation algorithm works in any metric space, and thus is optimal for general metric spaces (as it captures -center, which is hard to approximate within a factor of 2). Additionally, for any constant , using a standard grid based approach we provide an time -approximation, that is, a linear time algorithm when , , and are constants.
-
-means: Minimizing naturally generalizes the -means problem, and we study this objective in SectionĀ 4. Surprisingly, it appears this cost function for sets clustering has not been studied in prior works, despite the ubiquity of -means clustering. Consider the point set obtained by replacing each set with copies of its centroid. Using previously observed facts about the geometry of the -means objective, we prove the strong statement that an -approximation to the -means clustering problem on is also an -approximation for the sets clustering problem.
-
-median: Minimizing generalizes the -median problem. Previously, [20] provided a polynomial time approximation in for constant , by replacing each set with copies of its -approximate 1-median, and then applying a -approximation to -median for constant (such as [11]). In SectionĀ 5, we instead consider the problem of minimizing , which is also equivalent to -median when consists of singletons.111 There is also a loose connection between and clustering to minimize the sum of radii [3], though unlike -median there is no immediate reduction. Minimizing models the case where how well a center covers a set is determined by the furthest point in the set from the center (i.e.Ā the minimum radius ball at the center enclosing the whole set). To address this problem we apply a similar approach as [20], replacing each set with the center of its minimum enclosing ball. We argue that this provides a -approximation where is the approximation quality of the -median subroutine.
-
-center Alternative: In SectionĀ 6 we considered the problem of minimizing . Similar to the considered in SectionĀ 3 and prior work, this problem also generalizes standard -center clustering. This problem is also related to the problem considered in SectionĀ 5, but where we inverted the max and sum operators. This problem is more challenging, though we still can provide a polynomial time -approximation for point sets in the plane and for any constant .
We emphasize that the algorithms in sectionsĀ 3, 4, and 5 for the , and objectives, either modify or directly reduce to the algorithms respectively used for -center, -means, and -median. Thus our running times are virtually equivalent to the algorithms used for these standard clustering problems, which is the best one could hope for as these standard clustering problems are special cases of our sets clustering problems. On the other hand, our polynomial time approximation for the more challenging in SectionĀ 6 is slower, and we leave it as an open problem for future work to optimize the time.
We study the combinations of maximums, sums, and sums of squares which we believe to be the most natural, though other combinations exist that we do not study, such as squaring sums of squares. Several prior works also considered clustering based on the closest point in a set to its nearest center. Specifically, let , then [15] considered minimizing , providing a near linear time -approximation for the case when , , and are constants, where . [14] considered the problem of minimizing , where the sets are continuous convex regions, providing a number of results such as a -approximation for the case of disks. Many other papers have considered clustering other continuous geometric objects, and in particular unbounded objects such as lines or hyperplanes [9, 18, 6], though such results are less relevant than the above mentioned papers on clustering discrete sets.
2 Setup
For two points , we use to denote the Euclidean distance between and . Similarly, for a point and a finite point set we have .
Given a point set and a set of centers , the standard cost functions for -center, -median, and -means clustering are as follows.
Given a parameter and point set , the goal of -center, -median, or -means clustering is then to find a set of centers minimizing the respective cost function.
We now generalize these notions to collections of point sets. So let be a collection of sets of points, where for all , and is the total size. For a point , let be some non-negative function, representing the cost of assigning point set to a center . We consider three cost functions denoted by , , or .222The different values are supposed to be vaguely reminiscent of corresponding vector norms.
For a single point we write , where in particular we have and
Given a set of centers, define , that is is assigned to its nearest center under the function . Now the values over all define a vector of length , and we consider either the or norm. Namely, we define the cost functions, , for or as follows:
Finally, let , and let denote a set of centers achieving . For some , a set with is referred to as a -approximation to if . (That is, is a hard constraint and the approximation is on the cost, as is standard for clustering.)
Observation 1.
Given a point set , let denote the collection of singleton sets for each . Then , , and .
3 k-center Sets Clustering
Given a collection of point sets in and a parameter , [20] considered the problem of covering with equal radius balls of minimum radius such that each is entirely contained in one of the balls. Using the terminology defined above, this is equivalent to computing . For this problem, [20] gave a -approximation. Here we show that the standard greedy Gonzalez algorithm for -center clustering can be adapted to yield a 2-approximation. Not only is this a better approximation ratio, but this algorithm will in fact work for any metric space. Moreover, for general metrics it is not possible to get a approximation for -center clustering for any , unless P=NP [13]. As -center is a special case of our problem, our approximation algorithm is thus optimal.
Given a point set and a parameter , we recall that the Gonzalez algorithm [10] greedily outputs a set of center , where is an arbitrary point in , and for , . We argue the following natural adaptation of this algorithm to our sets clustering problem, which greedily picks the furthest set rather than the furthest point, also yields a 2-approximation. (FigureĀ 1.1 shows that naively running Gonzalez on does not achieve a 2-approximation. Namely, Gonzalez on might pick and as the centers, resulting in a cost of 11 for sets clustering on , however, choosing , the midpoint of , and , gives a cost of 4.)
Theorem 2.
Given a parameter and a collection of point sets in , gives an time 2-approximation for computing , where .
Proof.
Let be the output of and let with cost , where recall .
There are two possible cases. In the first case, for all , there exists some such that . Consider any set with as its nearest optimal center under , that is . By definition, . Thus for any point , by the triangle inequality, . Therefore, . This in turn implies , and as this holds for all it implies is a 2-approximation.
In the second case, there exists some , such that for any we have . For a given , let denote the set from which was selected in . Observe that if then, , that is is not covered by in the optimal solution. However, for all , and so by the pigeon hole principle if such a occurs then there must exist some other and centers with such that . This implies for any . Thus by the triangle inequality, for any we have , which implies .
For any , let , i.e.Ā the first centers chosen by , and let (where ). Denote the cost of the final solution output by as . Observe that . Above we argued that where , thus , and thus is a 2-approximation.
As for the running time, we can achieve time in a similar fashion to the standard -center Gonzalez algorithm. Specifically, let be the set of centers after some iterations of the algorithm, and assume for all unmarked we maintain . Then in the round, is some arbitrary point from , which can thus be found by a linear scan of the values. We also need to update the values to . To do so observe that for any , . As can be computed in time, we can thus update all of these values in time. As the algorithm performs iterations, it thus takes time overall.
The algorithm used in [20] for achieving the -approximation requires computing the minimum enclosing ball (MEB) for each point set, and thus has a hidden constant in the running time depending on . Thus for high dimensions, [20] use a -approximate MEB (which can be computed in time [2]), resulting in a -approximation factor. In comparison, our algorithm avoids computing the MEB altogether, and as the above proof only relied on the triangle inequality, the same proof verbatim yields a 2-approximation for any metric space.333This is our only result that holds for arbitrary metric spaces, which is why was defined for points specifically in .
-center clustering is known to be hard to approximate within a factor of roughly 1.82 even in the plane [7], however, a -approximation is possible in constant dimensions, though at the cost of having a time exponential in . We now show this standard algorithm also applies to our problem.
We will make use of the following standard observation (adapted for our settings).
Observation 3.
Let be a set of points, and let be a set of points such that . Then for any collection of point sets , we have that . This follows since if for a given set its nearest center under was , then by the triangle inequality we have that .
Theorem 4.
Given a parameter and a collection of point sets in , for constant , then for any there is an time -approximation for computing , where .
Proof.
Let be the centers output by , and let . Also let where . By TheoremĀ 2 we know that .
Let where is the ball of radius centered at . By definition of and , we know . We can assume that every center is within distance of some point in , as otherwise can be deleted without affecting the optimal solution quality. Thus we have that .
Now consider the axis aligned grid over with cell side length . Any ball of radius intersects at most cells, for constant . Thus intersects grid cells, and as , there are thus possible cells for each center in . Now for any point , let be the lowest corner of the grid cell containing (specifically the coordinate-wise minimal corner). Let . As the diameter of each cell is , by ObservationĀ 3 we have that .
The above motivates the following algorithm. First compute using . This determines a set of grid cells around the centers of which must contain . We try all possible possibilities for these cells and for each possibility we use the lowest corners of the respective cells as our candidate set of centers. We then return as our solution the set of such centers which had the minimum value. As will be one of the possible sets of centers considered, where we already argued that , this procedure is guaranteed to output a -approximation.
As for the running time, the initial call to takes time. There are subsets of cells we consider, and for each one we first compute the set of lower corners in time. Then we compute , which can be done in time. Thus the overall time is
Ā Remark 5.
For -center clustering it is common to consider the discrete version of the problem where the centers must come from the input point set. One can easily argue that both TheoremĀ 2 and TheoremĀ 4 extend to this variant. (In the proof of TheoremĀ 4, we would restrict to non-empty grid cells for potential center locations.)
4 k-means Sets Clustering
Given and a parameter , for any set of centers recall that
Note that this is the natural analogue of the -means problem to clustering point sets. Specifically, when each set is a single point, that is for all , then this is equivalent to the standard -means clustering cost on the point set , as then
Definition 6.
For a given point set , the centroid of is defined as .
Given a collection of sets in , let be the set of distinct copies of , and let . Note, while and are collocated they are viewed as distinct points, and thus and .
It is well known that the optimal solution to the 1-mean problem with respect to a point set is the centroid . We require the following standard lemma, whose proof can be found in [16].
Lemma 7.
For a point set with centroid and any point we have, .
Given a collection of point sets, we now argue that one can solve the sets clustering problem by solving the -means problem on the centroids of the sets.
Theorem 8.
For a parameter and a point set , let denote any algorithm which achieves an -approximation for the -means problem.
Given a parameter and collection of sets in , is an -approximation for computing .
Proof.
For any two point sets , by lemmaĀ 7 we have:
Recall from SectionĀ 2 that . Thus summing the above equation for over all gives:
Thus . So let denote the optimal -means solution on , that is . As does not depend on , the above equation implies that minimizing also minimizes , i.e.Ā .
So let be the approximation returned by , meaning . Then we have
| Ā |
As discussed in the introduction, there are many known -approximation algorithms for -means, with various trade-offs in the running times depending on the parameters and . Moreover, as -means is a special case of our problem, we should not expect our times to be faster than what is possible for -means.
Before we can apply such algorithms the above theorem requires we compute . However, this takes only time, where , as the centroid of a set is simply the average of the points. Thus the above theorem immediately implies that we have the following.
Corollary 9.
For a parameter and set of points, let denote the running time of any algorithm which achieves a -approximation for -means.
Given a parameter and collection of sets in , where , there is a -approximation for computing with running time.
5 k-median Sets Clustering
Given and a parameter , for any set of centers recall that
As discussed in the introduction, the above cost (as well as studied in [20]) generalizes the -median problem to clustering point sets. Specifically, when each set consists of a single point, i.e., for all , the problem reduces to the standard -median clustering cost on the point set , as then .
Lemma 10.
Let be a collection of sets and where is the center of the minimum enclosing ball of , then .
Proof.
Since is the center of the minimum enclosing ball of , by definition we have that . So if , then,
Lemma 11.
Let be a collection of point sets from , let where is the center of the minimum enclosing ball of , and let be any set of centers, then .
Proof.
First, we make the standard observation that for any point , . (Assume as otherwise the inequality trivially holds.) This holds by considering the hyperplane passing through whose normal is in the direction from to . As is the center of the minimum enclosing ball of , the closed halfspace defined by this hyperplane and not containing must contain a point from , and thus this point is at least as far as from . Thus we have,
| Ā |
Theorem 12.
For a parameter and a point set , let denote any algorithm which achieves an -approximation for the -median problem.
Given a parameter and collection of sets in , let where is the center of the minimum enclosing ball of . Then is a -approximation for computing .
Proof.
Let be the set of centers returned by , for which we have . Also let . Then using triangle inequality,
| Lemma 10 | |||||
| Lemma 11 | |||||
| Ā | |||||
As discussed in the introduction, there are many known -approximation algorithms for -median, with various trade-offs in the running times depending on the parameters and . Moreover, as -median is a special case of our problem, we should not expect our times to be faster than what is possible for -median.
Before we can apply such algorithms, the above theorem requires computing the minimum enclosing ball of each set , for which we can use -approximate minimum enclosing ball center, which can be computed in time [2]. Doing so would introduce a factor in LemmaĀ 10, and thus using this together with a -approximation for -median, the same analysis from the proof of TheoremĀ 12, would yield a approximation to computing . By observing that for any we immediately have the following.
Corollary 13.
For a parameter and set of points, let denote the running time of any algorithm which achieves a -approximation for -median. Moreover, let denote the time to compute a -approximate minimum enclosing ball of .
Given a parameter and collection of sets in , where , then for any , there is a -approximation for computing with running time.
6 Another Variant of k-center Sets Clustering
Above we gave improved approximation algorithms for the objective, previously considered by [20], and which naturally generalizes the -center problem to sets. Here we consider an alternative generalization of -center to sets clustering. Namely, given and a parameter , find the set of centers that minimizes:
This variant poses additional challenges, so for simplicity we will assume that the sets in lie in , though a similar approach should extend to any constant dimension. Moreover, in this section we state our running times simply as being polynomial in , as the precise constant in the exponent is large, as opposed to earlier sections which may be linear in (depending on the subroutines used).
Definition 14.
In the weighted -center problem, we are given a parameter , a set of points and a set of non-negative weights , where is the weight associated with point , and the goal is to find the set of centers which minimizes the cost function: .
Observe that the weighted -center objective can be viewed as a special instance of our objective, as a point with weight models the case when all points in are collocated, for each . Thus intuitively, to approximate will require approximating .
[19] considered the weighted -center problem where the input is instead an edge-weighted graph , where each vertex has an associated non-negative weight . Specifically, let denote the shortest path distance between and (with respect to edge weights, not vertex weights), then they seek the set of centers which minimizes: . For this problem [19] give a 2-approximation, and we now describe how this implies a 2-approximation for the case in plane.
Lemma 15.
There exists a polynomial time 2-approximation algorithm for the weighted -center problem in the plane. That is, given a parameter , a set of points , and a set of non-negative weights where is the weight associated with point , there is a polynomial time 2-approximation to .
Proof.
As we allow centers to be located anywhere in the plane, we first describe how to find a polynomial sized set which contains the optimal set of centers as a subset.
For any two points , define their weighted bisector, , as the subset of points such that . It is well known that is a line when , and otherwise is a circle (called the Apollonius circle) containing the heavier weight point. Consider the arrangement of all such weighted bisectors over all pairs of points in . The vertices of the arrangement occur at intersections of bisectors, and the edges are circular arcs between vertices (or entire bisectors, when the bisector does not intersect any other bisector). Let be the set containing all points in , all vertices in the arrangement, and for every edge on a bisector , a point on that edge achieving the minimum value of (which may occur at a vertex of the arrangement). Observe that as this is the complexity of the arrangement.
So let be an optimal set of centers, and consider some . We now argue that if , then it can be moved to a point in in such a way that the weighted -center objective does not increase. First, let be the subset of points from which call their nearest center in and let . Consider moving continuously from its initial location towards (i.e.Ā along the segment ). As we move along this segment the weighted distance to strictly decreases, and moreover until we cross a bisector for some , remains the point in furthest from (i.e.Ā the point determining ās contribution to the overall objective). First suppose that is moved all the way to without crossing a bisector involving . In this case remained the furthest point from , and so the cost of assigning to did not increase (implying in fact that initially). In the second case, we run into some bisector for some , at which point we stop moving towards (to ensure remains the furthest). Now if is at a vertex of the arrangement of bisectors, then we are done as we included all vertices in . Otherwise, is on the interior of an edge in the arrangement. Now the point achieving the minimum value of the weighted distance from (or equivalently ) to along this edge is in included in (and may in fact be a vertex), and so we can move to this point without increasing the cost of assigning to , as we do not cross any other bisector while moving along this edge.
Now we construct a weighted graph instance of weighted -center as follows. First, set (recall ). Now for every pair of points in set its edge weight equal to the Euclidean distance between the corresponding set of points. Finally, for the vertex weights, set the weight of all vertices in equal to 0 (i.e.Ā they do not have to be clustered), and for each its weight remains (i.e.Ā the weight from the weighted -center instance in the plane). Note that by construction, the cost of any solution to this graph based weighted -center problem is equivalent to the cost of for the corresponding weighted -center problem in the plane. Thus, as is guaranteed to contain an optimal solution to our instance in the plane, if we simply call the polynomial time 2-approximation algorithm from [19] on this graph instance, then the returned solution is also a 2-approximation for our instance in the plane.
Our goal now is to use the above lemma to get a constant factor approximation to . First we need the following lemma, which is the analogue of LemmaĀ 10 from the prior section.
Lemma 16.
Let be a collection of sets. Then, for any point-set ,
Proof.
Let . Then for any point set ,
Theorem 17.
For a parameter , a set of points , and weights where is the weight associated with point , let denote any algorithm which achieves an -approximation for the weighted k-center problem. Also, for a point set , let denote any algorithm that achieves an -approximation for the 1-median problem.
Given a parameter and collection of sets in , there is a polynomial time -approximation for computing .
Proof.
For any set of centers, and any set of points , by the triangle inequality we have
Now, let the optimal set of centers for our problem be and let be the -approximation returned by where is the set of -approximate 1-medians from the sets, i.e.Ā , and . Then,
| Lemmma 16 | |||||
| Ā | |||||
Corollary 18.
Given a parameter and collection of sets in , there is a polynomial time -approximation for computing , for any constant .
Proof.
6.1 Approximation Improvement
We now show how to improve the above -approximation into a -approximation, by constructing and searching with a -approximate decision procedure.
Lemma 19.
For a point set , let denote any algorithm that achieves an -approximation for the 1-median problem.
Given a collection of sets in , a parameter , and a radius , then decider either returns a set of centers such that , or correctly returns that .
Proof.
decider(, , ) starts by initializing all point sets in as unmarked, and computing the approximate medians of each point set using 1medianAlg. Next, the algorithm picks the approximate median of the unmarked point set which has the most points, adds it to the set of centers, and mark all the sets that are from this approximate median. This process is repeated until all sets are marked. At the end of the loop, if , then the algorithm returns , as clearly we have obtained a valid clustering with cost .
Otherwise, , in which case the algorithm returns āā, which we now argue is correct. Let be the first centers selected by algorithm, where was selected in the th iteration, and let , where is the set such that . Let . Then by the pigeonhole principle, there must be some and sets with such that . Note that as , by line 5 of the algorithm we know . Then,
Since , we know that did not cover within radius , that is . Thus by the above, . Therefore, the algorithm correctly returns that .
Theorem 20.
Given a parameter and a collection of point sets in , for any constant , there exists a polynomial time -approximation algorithm for computing .
Proof.
Let be the -approximate set of centers returned by CorollaryĀ 18. Let and let , where by CorollaryĀ 18 we know that . Consider the candidate radii set , where and for any constant . Note that for any we have , and was chosen such that .
Now we iterate through the candidate radii , in order from to . In the th iteration we call Algorithm 2 with radius where is the approximation ratio of the subroutine used to compute the approximate 1-medians. We can use [11] to get an -approximate 1-median, which runs in polynomial time for any constant . Let be the first iteration where Algorithm 2 returns a set rather than āā. We claim that this set is the desired approximation.
First, observe that this procedure is guaranteed to return some set , that is is well defined. Specifically, when , the call to AlgorithmĀ 2 with radius , must return a set since otherwise by LemmaĀ 19 the optimal radius which is a contradiction to the fact that .
LemmaĀ 19 guarantees that . If , then we are done as we know . Otherwise, in which case we know returned āā, and so again by LemmaĀ 19 we then know that . Summarizing, , and so the approximation quality of the solution is bounded by the ratio:
Overall this is a polynomial time algorithm, as we made calls to Algorithm 2, which itself is polynomial time when using a polynomial time subroutine to compute the approximate 1-medians.
References
- [1] PankajĀ K. Agarwal and CeciliaĀ Magdalena Procopiuc. Exact and approximation algorithms for clustering. Algorithmica, 33(2):201ā226, 2002. doi:10.1007/s00453-001-0110-y.
- [2] Mihai Badoiu and KennethĀ L. Clarkson. Smaller core-sets for balls. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 801ā802. ACM/SIAM, 2003. URL: http://dl.acm.org/citation.cfm?id=644108.644240.
- [3] Moses Charikar and Rina Panigrahy. Clustering to minimize the sum of cluster diameters. J. Comput. Syst. Sci., 68(2):417ā441, 2004. doi:10.1016/J.JCSS.2003.07.014.
- [4] KeĀ Chen. On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput., 39(3):923ā947, 2009. doi:10.1137/070699007.
- [5] Vincent Cohen-Addad, Karthik C. S., and Euiwoong Lee. Johnson coverage hypothesis: Inapproximability of k-means and k-median in -metrics. In Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1493ā1530. SIAM, 2022. doi:10.1137/1.9781611977073.63.
- [6] Eduard Eiben, FedorĀ V. Fomin, PetrĀ A. Golovach, William Lochet, Fahad Panolan, and Kirill Simonov. EPTAS for k-means clustering of affine subspaces. In DĆ”niel Marx, editor, Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2649ā2659. SIAM, 2021. doi:10.1137/1.9781611976465.157.
- [7] TomĆ”s Feder and DanielĀ H. Greene. Optimal algorithms for approximate clustering. In Proceedings of the 20th Annual ACM Symposium on Theory of Computing (STOC), pages 434ā444. ACM, 1988. doi:10.1145/62212.62255.
- [8] Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A PTAS for k-means clustering based on weak coresets. In Proceedings of the 23rd ACM Symposium on Computational Geometry (SOCG), pages 11ā18. ACM, 2007. doi:10.1145/1247069.1247072.
- [9] Jie Gao, Michael Langberg, and LeonardĀ J. Schulman. Clustering lines in high-dimensional space: Classification of incomplete data. ACM Trans. Algorithms, 7(1), December 2010. doi:10.1145/1868237.1868246.
- [10] T.Ā F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293ā306, 1985. doi:10.1016/0304-3975(85)90224-5.
- [11] Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In LĆ”szló Babai, editor, Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC), pages 291ā300. ACM, 2004. doi:10.1145/1007352.1007400.
- [12] D.Ā S. Hochbaum and D.Ā B. Shmoys. A best possible heuristic for the k-center problem. Mathematics of Operations Research, 10(2):180ā184, 1985. doi:10.1287/moor.10.2.180.
- [13] W.Ā Hsu and G.Ā L. Nemhauser. Easy and hard bottleneck location problems. Discrete Applied Mathematics, 1(3):209ā215, 1979. doi:10.1016/0166-218X(79)90044-1.
- [14] Hongyao Huang, Georgiy Klimenko, and Benjamin Raichel. Clustering with neighborhoods. In 32nd International Symposium on Algorithms and Computation (ISAAC), volume 212 of LIPIcs, pages 6:1ā6:17. Schloss Dagstuhl ā Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ISAAC.2021.6.
- [15] Ibrahim Jubran, Murad Tukan, Alaa Maalouf, and Dan Feldman. Sets clustering. In Proceedings of the 37th International Conference on Machine Learning, (ICML), volume 119 of Proceedings of Machine Learning Research, pages 4994ā5005. PMLR, 2020. URL: http://proceedings.mlr.press/v119/jubran20a.html.
- [16] Tapas Kanungo, DavidĀ M. Mount, NathanĀ S. Netanyahu, ChristineĀ D. Piatko, Ruth Silverman, and AngelaĀ Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2-3):89ā112, 2004. doi:10.1016/J.COMGEO.2004.03.003.
- [17] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clustering problems in any dimensions. J. ACM, 57(2), February 2010. doi:10.1145/1667053.1667054.
- [18] Sagi Lotan, Ernesto EvgeniyĀ Sanches Shayda, and Dan Feldman. Coreset for line-sets clustering. In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/f2ce95887c34393af4eb240d60017860-Abstract-Conference.html.
- [19] Qingzhou Wang and Kam-Hoi Cheng. A heuristic algorithm for the k-center problem with vertex weight. In International Symposium on Algorithms (SIGAL), volume 450 of Lecture Notes in Computer Science, pages 388ā396. Springer, 1990. doi:10.1007/3-540-52921-7_88.
- [20] Guang Xu and Jinhui Xu. Efficient approximation algorithms for clustering point-sets. Computational Geometry, 43(1):59ā66, 2010. Special Issue on the 14th Annual Fall Workshop. doi:10.1016/j.comgeo.2007.12.002.
