Approximate Clustering with Same-Cluster Queries

Ailon, Nir; Bhattacharya, Anup; Jaiswal, Ragesh; Kumar, Amit

doi:10.4230/LIPIcs.ITCS.2018.40

File

Author Details

Nir Ailon

Anup Bhattacharya

Ragesh Jaiswal

Amit Kumar

Cite AsGet BibTex

Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Approximate Clustering with Same-Cluster Queries. In 9th Innovations in Theoretical Computer Science Conference (ITCS 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 94, pp. 40:1-40:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)
https://doi.org/10.4230/LIPIcs.ITCS.2018.40

Abstract

Ashtiani et al. proposed a Semi-Supervised Active Clustering framework (SSAC), where the learner is allowed to make adaptive queries to a domain expert. The queries are of the kind "do two given points belong to the same optimal cluster?", where the answers to these queries are assumed to be consistent with a unique optimal solution. There are many clustering contexts where such same cluster queries are feasible. Ashtiani et al. exhibited the power of such queries by showing that any instance of the k-means clustering problem, with additional margin assumption, can be solved efficiently if one is allowed to make O(k^2 log{k} + k log{n}) same-cluster queries. This is interesting since the k-means problem, even with the margin assumption, is NP-hard. In this paper, we extend the work of Ashtiani et al. to the approximation setting by showing that a few of such same-cluster queries enables one to get a polynomial-time (1+eps)-approximation algorithm for the k-means problem without any margin assumption on the input dataset. Again, this is interesting since the k-means problem is NP-hard to approximate within a factor (1+c) for a fixed constant 0 < c < 1. The number of same-cluster queries used by the algorithm is poly(k/eps) which is independent of the size n of the dataset. Our algorithm is based on the D^2-sampling technique, also known as the k-means++ seeding algorithm. We also give a conditional lower bound on the number of same-cluster queries showing that if the Exponential Time Hypothesis (ETH) holds, then any such efficient query algorithm needs to make Omega (k/poly log k) same-cluster queries. Our algorithm can be extended for the case where the query answers are wrong with some bounded probability. Another result we show for the k-means++ seeding is that a small modification of the k-means++ seeding within the SSAC framework converts it to a constant factor approximation algorithm instead of the well known O(log k)-approximation algorithm.

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In APPROX-RANDOM, pages 15-28. Springer, 2009.
Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and euclidean k-median by primal-dual algorithms. CoRR, abs/1612.07925, 2016. URL: http://arxiv.org/abs/1612.07925.
Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Approximate clustering with same-cluster queries. CoRR, abs/1704.01862, 2017. URL: http://arxiv.org/abs/1704.01862.
Nir Ailon, Yudong Chen, and Huan Xu. Iterative and active graph clustering using trace norm minimization without cluster size constraints. Journal of Machine Learning Research, 16:455-490, 2015.
David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027-1035. Society for Industrial and Applied Mathematics, 2007.
Hassan Ashtiani, Shrinu Kushagra, and Shai Ben-David. Clustering with same-cluster queries. In Advances in neural information processing systems, pages 3216-3224, 2016.
Pranjal Awasthi, Maria-Florina Balcan, and Konstantin Voevodski. Local algorithms for interactive clustering. In ICML, pages 550-558, 2014.
Pranjal Awasthi, Avrim Blum, and Or Sheffet. Center-based clustering under perturbation stability. Information Processing Letters, 112(1):49-54, 2012.
Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The Hardness of Approximation of Euclidean k-Means. In 31st International Symposium on Computational Geometry (SoCG 2015), volume 34, pages 754-767, 2015.
Maria-Florina Balcan and Avrim Blum. Clustering with interactive feedback. In ALT, pages 316-328. Springer, 2008.
Maria-Florina Balcan, Avrim Blum, and Anupam Gupta. Approximate clustering without the approximation. In Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pages 1068-1077, 2009.
Sugato Basu, Arindam Banerjee, and Raymond J Mooney. Active semi-supervision for pairwise constrained clustering. In Proceedings of the 2004 SIAM international conference on data mining, pages 333-344. SIAM, 2004.
Anup Bhattacharya, Ragesh Jaiswal, and Nir Ailon. Tight lower bound instances for k-means++ in two dimensions. Theoretical Computer Science, 634:55-66, 2016.
Tobias Brunsch and Heiko Röglin. A bad instance for k-means++. Theoretical Computer Science, 505:19-26, 2013.
Vincent Cohen-Addad, Philip N. Klein, and Claire Mathieu. Local search yields approximation schemes for k-means and k-median in euclidean and minor-free metrics. In Irit Dinur, editor, IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, pages 353-364. IEEE Computer Society, 2016. URL: http://dx.doi.org/10.1109/FOCS.2016.46.
Sanjoy Dasgupta. The hardness of k-means clustering. Technical report, Department of Computer Science and Engineering, University of California, San Diego, 2008.
Irit Dinur. The pcp theorem by gap amplification. Journal of the ACM (JACM), 54(3):12, 2007.
Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A ptas for k-means clustering based on weak coresets. In Proceedings of the twenty-third annual symposium on Computational geometry, pages 11-18. ACM, 2007.
Zachary Friggstad, Mohsen Rezapour, and Mohammad R Salavatipour. Local search yields a ptas for k-means in doubling metrics. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 365-374. IEEE, 2016.
Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat. Journal of Computer and System Sciences, 62:367-375, 2001.
Russell Impagliazzo, Ramamohan Paturi, and Francis Zane. Which problems have strongly exponential complexity? Journal of Computer and System Sciences, 63:512-530, 2001.
Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted voronoi diagrams and randomization to variance-based k-clustering. In Proceedings of the tenth annual symposium on Computational geometry, pages 332-339. ACM, 1994.
Ragesh Jaiswal, Amit Kumar, and Sandeep Sen. A simple d2-sampling based ptas for k-means and other clustering problems. Algorithmica, 70(1):22-46, 2014.
Ragesh Jaiswal, Mehul Kumar, and Pulkit Yadav. Improved analysis of d²-sampling based ptas for k-means and other clustering problems. Information Processing Letters, 115(2):100-103, 2015.
Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. A local search approximation algorithm for k-means clustering. In Proceedings of the eighteenth annual symposium on Computational geometry, pages 10-18. ACM, 2002.
Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clustering problems in any dimensions. Journal of the ACM (JACM), 57(2):5, 2010.
Euiwoong Lee, Melanie Schmidt, and John Wright. Improved and simplified inapproximability for k-means. Information Processing Letters, 120:40-43, 2017.
Meena Mahajan, Prajakta Nimbhorkar, and Kasturi Varadarajan. The planar k-means problem is np-hard. Theoretical Computer Science, 442:13-21, 2012.
Pasin Manurangsi. Almost-polynomial ratio eth-hardness of approximating densest k-subgraph. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 954-961. ACM, 2017.
Rafail Ostrovsky, Yuval Rabani, Leonard J Schulman, and Chaitanya Swamy. The effectiveness of lloyd-type methods for the k-means problem. Journal of the ACM (JACM), 59(6):28, 2012.
Luca Trevisan. Inapproximability of combinatorial optimization problems. CoRR, cs.CC/0409043, 2004. URL: http://arxiv.org/abs/cs.CC/0409043.
Andrea Vattani. The hardness of k-means clustering in the plane. Technical report, Department of Computer Science and Engineering, University of California San Diego, 2009.
Sharad Vikram and Sanjoy Dasgupta. Interactive bayesian hierarchical clustering. In International Conference on Machine Learning, pages 2081-2090, 2016.
Konstantin Voevodski, Maria-Florina Balcan, Heiko Röglin, Shang-Hua Teng, and Yu Xia. Efficient clustering with limited distance information. CoRR, abs/1408.2045, 2014. URL: http://arxiv.org/abs/1408.2045.

Approximate Clustering with Same-Cluster Queries

Authors Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, Amit Kumar

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Keywords

Metrics

References

Thanks for your feedback!

Could not send message