Semi-Supervised Algorithms for Approximately Optimal and Accurate Clustering

Gamlath, Buddhima; Huang, Sangxia; Svensson, Ola

doi:10.4230/LIPIcs.ICALP.2018.57

Abstract

We study k-means clustering in a semi-supervised setting. Given an oracle that returns whether two given points belong to the same cluster in a fixed optimal clustering, we investigate the following question: how many oracle queries are sufficient to efficiently recover a clustering that, with probability at least (1 - delta), simultaneously has a cost of at most (1 + epsilon) times the optimal cost and an accuracy of at least (1 - epsilon)?
We show how to achieve such a clustering on n points with O{((k^2 log n) * m{(Q, epsilon^4, delta / (k log n))})} oracle queries, when the k clusters can be learned with an epsilon' error and a failure probability delta' using m(Q, epsilon',delta') labeled samples in the supervised setting, where Q is the set of candidate cluster centers. We show that m(Q, epsilon', delta') is small both for k-means instances in Euclidean space and for those in finite metric spaces. We further show that, for the Euclidean k-means instances, we can avoid the dependency on n in the query complexity at the expense of an increased dependency on k: specifically, we give a slightly more involved algorithm that uses O{(k^4/(epsilon^2 delta) + (k^{9}/epsilon^4) log(1/delta) + k * m{({R}^r, epsilon^4/k, delta)})} oracle queries.
We also show that the number of queries needed for (1 - epsilon)-accuracy in Euclidean k-means must linearly depend on the dimension of the underlying Euclidean space, and for finite metric space k-means, we show that it must at least be logarithmic in the number of candidate centers. This shows that our query complexities capture the right dependencies on the respective parameters.

Cite As Get BibTex

Buddhima Gamlath, Sangxia Huang, and Ola Svensson. Semi-Supervised Algorithms for Approximately Optimal and Accurate Clustering. In 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 107, pp. 57:1-57:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018) https://doi.org/10.4230/LIPIcs.ICALP.2018.57

Author Details

Buddhima Gamlath

École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland

Sangxia Huang

Sony Mobile Communications, Lund, Sweden

Ola Svensson

École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland

Funding

This research was supported by ERC Starting Grant 335288-OptApprox.

References

Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS), pages 61-72, 2017.
Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Approximate clustering with same-cluster queries. In 9th Innovations in Theoretical Computer Science Conference, ITCS 2018, January 11-14, 2018, Cambridge, MA, USA, pages 40:1-40:21, 2018. URL: http://dx.doi.org/10.4230/LIPIcs.ITCS.2018.40.
Hassan Ashtiani, Shrinu Kushagra, and Shai Ben-David. Clustering with same-cluster queries. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS'16, pages 3224-3232, USA, 2016. Curran Associates Inc. URL: http://dl.acm.org/citation.cfm?id=3157382.3157458.
P. Awasthi, A. Blum, and O. Sheffet. Stability yields a PTAS for k-median and k-means clustering. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 309-318, Oct 2010. URL: http://dx.doi.org/10.1109/FOCS.2010.36.
Maria-Florina Balcan, Avrim Blum, and Anupam Gupta. Clustering under approximation stability. J. ACM, 60(2):8:1-8:34, 2013. URL: http://dx.doi.org/10.1145/2450142.2450144.
Michael B. Cohen, Yin Tat Lee, Gary Miller, Jakub Pachocki, and Aaron Sidford. Geometric median in nearly linear time. In Proceedings of the Forty-eighth Annual ACM Symposium on Theory of Computing, STOC '16, pages 9-21, New York, NY, USA, 2016. ACM. URL: http://dx.doi.org/10.1145/2897518.2897647.
Vincent Cohen-Addad, Philip N. Klein, and Claire Mathieu. Local search yields approximation schemes for k-means and k-median in euclidean and minor-free metrics. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, pages 353-364, 2016.
S. Dasgupta. The Hardness of K-means Clustering. Technical report (University of California, San Diego. Department of Computer Science and Engineering). Department of Computer Science and Engineering, University of California, San Diego, 2008. URL: https://books.google.ch/books?id=riJuAQAACAAJ.
Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A PTAS for k-means clustering based on weak coresets. In Proceedings of the Twenty-third Annual Symposium on Computational Geometry, SCG '07, pages 11-18, New York, NY, USA, 2007. ACM. URL: http://dx.doi.org/10.1145/1247069.1247072.
Zachary Friggstad, Mohsen Rezapour, and Mohammad R. Salavatipour. Local search yields a PTAS for k-means in doubling metrics. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, pages 365-374, 2016.
Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 291-300. ACM, 2004.
Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted voronoi diagrams and randomization to variance-based k-clustering: (extended abstract). In Proceedings of the Tenth Annual Symposium on Computational Geometry, SCG '94, pages 332-339, New York, NY, USA, 1994. ACM. URL: http://dx.doi.org/10.1145/177424.178042.
Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. In Proceedings of the Eighteenth Annual Symposium on Computational Geometry, SCG '02, pages 10-18, New York, NY, USA, 2002. ACM. URL: http://dx.doi.org/10.1145/513400.513402.
Euiwoong Lee, Melanie Schmidt, and John Wright. Improved and simplified inapproximability for k-means. Inf. Process. Lett., 120:40-43, 2017.

Semi-Supervised Algorithms for Approximately Optimal and Accurate Clustering

Authors Buddhima Gamlath, Sangxia Huang, Ola Svensson

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Semi-Supervised Algorithms for Approximately Optimal and Accurate Clustering

Authors Buddhima Gamlath, Sangxia Huang, Ola Svensson

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message