Approximate Clustering with Same-Cluster Queries

Authors Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, Amit Kumar

Thumbnail PDF


  • Filesize: 0.63 MB
  • 21 pages

Document Identifiers

Author Details

Nir Ailon
Anup Bhattacharya
Ragesh Jaiswal
Amit Kumar

Cite AsGet BibTex

Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Approximate Clustering with Same-Cluster Queries. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 94, pp. 40:1-40:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Ashtiani et al. proposed a Semi-Supervised Active Clustering framework (SSAC), where the learner is allowed to make adaptive queries to a domain expert. The queries are of the kind "do two given points belong to the same optimal cluster?", where the answers to these queries are assumed to be consistent with a unique optimal solution. There are many clustering contexts where such same cluster queries are feasible. Ashtiani et al. exhibited the power of such queries by showing that any instance of the k-means clustering problem, with additional margin assumption, can be solved efficiently if one is allowed to make O(k^2 log{k} + k log{n}) same-cluster queries. This is interesting since the k-means problem, even with the margin assumption, is NP-hard. In this paper, we extend the work of Ashtiani et al. to the approximation setting by showing that a few of such same-cluster queries enables one to get a polynomial-time (1+eps)-approximation algorithm for the k-means problem without any margin assumption on the input dataset. Again, this is interesting since the k-means problem is NP-hard to approximate within a factor (1+c) for a fixed constant 0 < c < 1. The number of same-cluster queries used by the algorithm is poly(k/eps) which is independent of the size n of the dataset. Our algorithm is based on the D^2-sampling technique, also known as the k-means++ seeding algorithm. We also give a conditional lower bound on the number of same-cluster queries showing that if the Exponential Time Hypothesis (ETH) holds, then any such efficient query algorithm needs to make Omega (k/poly log k) same-cluster queries. Our algorithm can be extended for the case where the query answers are wrong with some bounded probability. Another result we show for the k-means++ seeding is that a small modification of the k-means++ seeding within the SSAC framework converts it to a constant factor approximation algorithm instead of the well known O(log k)-approximation algorithm.
  • k-means
  • semi-supervised learning
  • query bounds


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In APPROX-RANDOM, pages 15-28. Springer, 2009. Google Scholar
  2. Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. CoRR, abs/1612.07925, 2016. URL:
  3. Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Approximate clustering with same-cluster queries. CoRR, abs/1704.01862, 2017. URL:
  4. Nir Ailon, Yudong Chen, and Huan Xu. Iterative and active graph clustering using trace norm minimization without cluster size constraints. Journal of Machine Learning Research, 16:455-490, 2015. Google Scholar
  5. David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027-1035. Society for Industrial and Applied Mathematics, 2007. Google Scholar
  6. Hassan Ashtiani, Shrinu Kushagra, and Shai Ben-David. Clustering with same-cluster queries. In Advances in neural information processing systems, pages 3216-3224, 2016. Google Scholar
  7. Pranjal Awasthi, Maria-Florina Balcan, and Konstantin Voevodski. Local algorithms for interactive clustering. In ICML, pages 550-558, 2014. Google Scholar
  8. Pranjal Awasthi, Avrim Blum, and Or Sheffet. Center-based clustering under perturbation stability. Information Processing Letters, 112(1):49-54, 2012. Google Scholar
  9. Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The Hardness of Approximation of Euclidean k-Means. In 31st International Symposium on Computational Geometry (SoCG 2015), volume 34, pages 754-767, 2015. Google Scholar
  10. Maria-Florina Balcan and Avrim Blum. Clustering with interactive feedback. In ALT, pages 316-328. Springer, 2008. Google Scholar
  11. Maria-Florina Balcan, Avrim Blum, and Anupam Gupta. Approximate clustering without the approximation. In Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pages 1068-1077, 2009. Google Scholar
  12. Sugato Basu, Arindam Banerjee, and Raymond J Mooney. Active semi-supervision for pairwise constrained clustering. In Proceedings of the 2004 SIAM international conference on data mining, pages 333-344. SIAM, 2004. Google Scholar
  13. Anup Bhattacharya, Ragesh Jaiswal, and Nir Ailon. Tight lower bound instances for k-means++ in two dimensions. Theoretical Computer Science, 634:55-66, 2016. Google Scholar
  14. Tobias Brunsch and Heiko Röglin. A bad instance for k-means++. Theoretical Computer Science, 505:19-26, 2013. Google Scholar
  15. Vincent Cohen-Addad, Philip N. Klein, and Claire Mathieu. Local search yields approximation schemes for k-means and k-median in euclidean and minor-free metrics. In Irit Dinur, editor, IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, pages 353-364. IEEE Computer Society, 2016. URL:
  16. Sanjoy Dasgupta. The hardness of k-means clustering. Technical report, Department of Computer Science and Engineering, University of California, San Diego, 2008. Google Scholar
  17. Irit Dinur. The pcp theorem by gap amplification. Journal of the ACM (JACM), 54(3):12, 2007. Google Scholar
  18. Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A ptas for k-means clustering based on weak coresets. In Proceedings of the twenty-third annual symposium on Computational geometry, pages 11-18. ACM, 2007. Google Scholar
  19. Zachary Friggstad, Mohsen Rezapour, and Mohammad R Salavatipour. Local search yields a ptas for k-means in doubling metrics. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 365-374. IEEE, 2016. Google Scholar
  20. Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. Journal of Computer and System Sciences, 62:367-375, 2001. Google Scholar
  21. Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. Which problems have strongly exponential complexity? Journal of Computer and System Sciences, 63:512-530, 2001. Google Scholar
  22. Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted voronoi diagrams and randomization to variance-based k-clustering. In Proceedings of the tenth annual symposium on Computational geometry, pages 332-339. ACM, 1994. Google Scholar
  23. Ragesh Jaiswal, Amit Kumar, and Sandeep Sen. A simple d2-sampling based ptas for k-means and other clustering problems. Algorithmica, 70(1):22-46, 2014. Google Scholar
  24. Ragesh Jaiswal, Mehul Kumar, and Pulkit Yadav. Improved analysis of d²-sampling based ptas for k-means and other clustering problems. Information Processing Letters, 115(2):100-103, 2015. Google Scholar
  25. Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. A local search approximation algorithm for k-means clustering. In Proceedings of the eighteenth annual symposium on Computational geometry, pages 10-18. ACM, 2002. Google Scholar
  26. Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clustering problems in any dimensions. Journal of the ACM (JACM), 57(2):5, 2010. Google Scholar
  27. Euiwoong Lee, Melanie Schmidt, and John Wright. Improved and simplified inapproximability for k-means. Information Processing Letters, 120:40-43, 2017. Google Scholar
  28. Meena Mahajan, Prajakta Nimbhorkar, and Kasturi Varadarajan. The planar k-means problem is np-hard. Theoretical Computer Science, 442:13-21, 2012. Google Scholar
  29. Pasin Manurangsi. Almost-polynomial ratio eth-hardness of approximating densest k-subgraph. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 954-961. ACM, 2017. Google Scholar
  30. Rafail Ostrovsky, Yuval Rabani, Leonard J Schulman, and Chaitanya Swamy. The effectiveness of lloyd-type methods for the k-means problem. Journal of the ACM (JACM), 59(6):28, 2012. Google Scholar
  31. Luca Trevisan. Inapproximability of combinatorial optimization problems. CoRR, cs.CC/0409043, 2004. URL:
  32. Andrea Vattani. The hardness of k-means clustering in the plane. Technical report, Department of Computer Science and Engineering, University of California San Diego, 2009. Google Scholar
  33. Sharad Vikram and Sanjoy Dasgupta. Interactive bayesian hierarchical clustering. In International Conference on Machine Learning, pages 2081-2090, 2016. Google Scholar
  34. Konstantin Voevodski, Maria-Florina Balcan, Heiko Röglin, Shang-Hua Teng, and Yu Xia. Efficient clustering with limited distance information. CoRR, abs/1408.2045, 2014. URL:
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail