Semi-Supervised Algorithms for Approximately Optimal and Accurate Clustering
We study k-means clustering in a semi-supervised setting. Given an oracle that returns whether two given points belong to the same cluster in a fixed optimal clustering, we investigate the following question: how many oracle queries are sufficient to efficiently recover a clustering that, with probability at least (1 - delta), simultaneously has a cost of at most (1 + epsilon) times the optimal cost and an accuracy of at least (1 - epsilon)?
We show how to achieve such a clustering on n points with O{((k^2 log n) * m{(Q, epsilon^4, delta / (k log n))})} oracle queries, when the k clusters can be learned with an epsilon' error and a failure probability delta' using m(Q, epsilon',delta') labeled samples in the supervised setting, where Q is the set of candidate cluster centers. We show that m(Q, epsilon', delta') is small both for k-means instances in Euclidean space and for those in finite metric spaces. We further show that, for the Euclidean k-means instances, we can avoid the dependency on n in the query complexity at the expense of an increased dependency on k: specifically, we give a slightly more involved algorithm that uses O{(k^4/(epsilon^2 delta) + (k^{9}/epsilon^4) log(1/delta) + k * m{({R}^r, epsilon^4/k, delta)})} oracle queries.
We also show that the number of queries needed for (1 - epsilon)-accuracy in Euclidean k-means must linearly depend on the dimension of the underlying Euclidean space, and for finite metric space k-means, we show that it must at least be logarithmic in the number of candidate centers. This shows that our query complexities capture the right dependencies on the respective parameters.
Clustering
Semi-supervised Learning
Approximation Algorithms
k-Means
k-Median
Theory of computation~Facility location and clustering
57:1-57:14
Regular Paper
This research was supported by ERC Starting Grant 335288-OptApprox.
https://arxiv.org/abs/1803.00926
Buddhima
Gamlath
Buddhima Gamlath
École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
Sangxia
Huang
Sangxia Huang
Sony Mobile Communications, Lund, Sweden
Ola
Svensson
Ola Svensson
École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
10.4230/LIPIcs.ICALP.2018.57
Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 61-72, 2017.
Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Approximate clustering with same-cluster queries. In 9th Innovations in Theoretical Computer Science Conference, ITCS 2018, January 11-14, 2018, Cambridge, MA, USA, pages 40:1-40:21, 2018. URL: http://dx.doi.org/10.4230/LIPIcs.ITCS.2018.40.
http://dx.doi.org/10.4230/LIPIcs.ITCS.2018.40
Hassan Ashtiani, Shrinu Kushagra, and Shai Ben-David. Clustering with same-cluster queries. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS'16, pages 3224-3232, USA, 2016. Curran Associates Inc. URL: http://dl.acm.org/citation.cfm?id=3157382.3157458.
http://dl.acm.org/citation.cfm?id=3157382.3157458
P. Awasthi, A. Blum, and O. Sheffet. Stability yields a PTAS for k-median and k-means clustering. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 309-318, Oct 2010. URL: http://dx.doi.org/10.1109/FOCS.2010.36.
http://dx.doi.org/10.1109/FOCS.2010.36
Maria-Florina Balcan, Avrim Blum, and Anupam Gupta. Clustering under approximation stability. J. ACM, 60(2):8:1-8:34, 2013. URL: http://dx.doi.org/10.1145/2450142.2450144.
http://dx.doi.org/10.1145/2450142.2450144
Michael B. Cohen, Yin Tat Lee, Gary Miller, Jakub Pachocki, and Aaron Sidford. Geometric median in nearly linear time. In Proceedings of the Forty-eighth Annual ACM Symposium on Theory of Computing, STOC '16, pages 9-21, New York, NY, USA, 2016. ACM. URL: http://dx.doi.org/10.1145/2897518.2897647.
http://dx.doi.org/10.1145/2897518.2897647
Vincent Cohen-Addad, Philip N. Klein, and Claire Mathieu. Local search yields approximation schemes for k-means and k-median in euclidean and minor-free metrics. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, pages 353-364, 2016.
S. Dasgupta. The Hardness of K-means Clustering. Technical report (University of California, San Diego. Department of Computer Science and Engineering). Department of Computer Science and Engineering, University of California, San Diego, 2008. URL: https://books.google.ch/books?id=riJuAQAACAAJ.
https://books.google.ch/books?id=riJuAQAACAAJ
Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A PTAS for k-means clustering based on weak coresets. In Proceedings of the Twenty-third Annual Symposium on Computational Geometry, SCG '07, pages 11-18, New York, NY, USA, 2007. ACM. URL: http://dx.doi.org/10.1145/1247069.1247072.
http://dx.doi.org/10.1145/1247069.1247072
Zachary Friggstad, Mohsen Rezapour, and Mohammad R. Salavatipour. Local search yields a PTAS for k-means in doubling metrics. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, pages 365-374, 2016.
Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 291-300. ACM, 2004.
Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted voronoi diagrams and randomization to variance-based k-clustering: (extended abstract). In Proceedings of the Tenth Annual Symposium on Computational Geometry, SCG '94, pages 332-339, New York, NY, USA, 1994. ACM. URL: http://dx.doi.org/10.1145/177424.178042.
http://dx.doi.org/10.1145/177424.178042
Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. In Proceedings of the Eighteenth Annual Symposium on Computational Geometry, SCG '02, pages 10-18, New York, NY, USA, 2002. ACM. URL: http://dx.doi.org/10.1145/513400.513402.
http://dx.doi.org/10.1145/513400.513402
Euiwoong Lee, Melanie Schmidt, and John Wright. Improved and simplified inapproximability for k-means. Inf. Process. Lett., 120:40-43, 2017.
Buddhima Gamlath, Sangxia Huang, and Ola Svensson
Creative Commons Attribution 3.0 Unported license
https://creativecommons.org/licenses/by/3.0/legalcode