Correlation Clustering with Same-Cluster Queries Bounded by Optimal Cost

Authors Barna Saha, Sanjay Subramanian

Thumbnail PDF


  • Filesize: 0.53 MB
  • 17 pages

Document Identifiers

Author Details

Barna Saha
  • University of California, Berkeley, USA
Sanjay Subramanian
  • Allen Institute for Artificial Intelligence, Irvine, CA, USA


The second author would like to thank Dan Roth for letting him use his machines for running experiments, Sainyam Galhotra for help with datasets, and Rajiv Gandhi for useful discussions.

Cite AsGet BibTex

Barna Saha and Sanjay Subramanian. Correlation Clustering with Same-Cluster Queries Bounded by Optimal Cost. In 27th Annual European Symposium on Algorithms (ESA 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 144, pp. 81:1-81:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)


Several clustering frameworks with interactive (semi-supervised) queries have been studied in the past. Recently, clustering with same-cluster queries has become popular. An algorithm in this setting has access to an oracle with full knowledge of an optimal clustering, and the algorithm can ask the oracle queries of the form, "Does the optimal clustering put vertices u and v in the same cluster?" Due to its simplicity, this querying model can easily be implemented in real crowd-sourcing platforms and has attracted a lot of recent work. In this paper, we study the popular correlation clustering problem (Bansal et al., 2002) under the same-cluster querying framework. Given a complete graph G=(V,E) with positive and negative edge labels, correlation clustering objective aims to compute a graph clustering that minimizes the total number of disagreements, that is the negative intra-cluster edges and positive inter-cluster edges. In a recent work, Ailon et al. (2018b) provided an approximation algorithm for correlation clustering that approximates the correlation clustering objective within (1+epsilon) with O((k^{14} log{n} log{k})/epsilon^6) queries when the number of clusters, k, is fixed. For many applications, k is not fixed and can grow with |V|. Moreover, the dependency of k^14 on query complexity renders the algorithm impractical even for datasets with small values of k. In this paper, we take a different approach. Let C_{OPT} be the number of disagreements made by the optimal clustering. We present algorithms for correlation clustering whose error and query bounds are parameterized by C_{OPT} rather than by the number of clusters. Indeed, a good clustering must have small C_{OPT}. Specifically, we present an efficient algorithm that recovers an exact optimal clustering using at most 2C_{OPT} queries and an efficient algorithm that outputs a 2-approximation using at most C_{OPT} queries. In addition, we show under a plausible complexity assumption, there does not exist any polynomial time algorithm that has an approximation ratio better than 1+alpha for an absolute constant alpha > 0 with o(C_{OPT}) queries. Therefore, our first algorithm achieves the optimal query bound within a factor of 2. We extensively evaluate our methods on several synthetic and real-world datasets using real crowd-sourced oracles. Moreover, we compare our approach against known correlation clustering algorithms that do not perform querying. In all cases, our algorithms exhibit superior performance.

Subject Classification

ACM Subject Classification
  • Theory of computation → Unsupervised learning and clustering
  • Theory of computation → Approximation algorithms analysis
  • Clustering
  • Approximation Algorithm
  • Crowdsourcing
  • Randomized Algorithm


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. N. Ailon, M. Charikar, and A. Newman. Aggregating Inconsistent Information: Ranking and Clustering. Symposium on the Theory of Computing (STOC), 2005. Google Scholar
  2. Nir Ailon, Anup Bhattacharya, and Ragesh Jaiswal. Approximate Correlation Clustering Using Same-Cluster Queries. In Latin American Symposium on Theoretical Informatics, pages 14-27. Springer, 2018. Google Scholar
  3. Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Approximate clustering with same-cluster queries. arXiv preprint, 2017. URL:
  4. Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: ranking and clustering. Journal of the ACM (JACM), 55(5):23, 2008. Google Scholar
  5. H. Ashtiani, S. Kushagra, and S. Ben-David. Clustering with Same-Cluster Queries. Advances in Neural Information Processing Systems (NIPS), 2016. Google Scholar
  6. M. F. Balcan and A. Blum. Clustering with Interactive Feedback. International Conference on Algorithmic Learning Theory (ALT), 2008. Google Scholar
  7. N. Bansal, A. Blum, and S. Chawla. Correlation Clustering. Symposium on Foundations of Computer Science (FOCS), 2002. Google Scholar
  8. Sebastian Böcker, Sebastian Briesemeister, Quang Bao Anh Bui, and Anke Truß. Going weighted: Parameterized algorithms for cluster editing. Theoretical Computer Science, 410(52):5467-5480, 2009. Google Scholar
  9. Moses Charikar, Venkatesan Guruswami, and Anthony Wirth. Clustering with qualitative information. Journal of Computer and System Sciences, 71(3):360-383, 2005. Google Scholar
  10. S. Chawla, K. Makarychev, T. Schramm, and G. Yaroslavtsev. Near Optimal LP Rounding Algorithm for Correlation Clustering on Complete and Complete k-partite Graphs. Symposium on the Theory of Computing (STOC), pages 219-228, 2015. Google Scholar
  11. Peter Christen. Febrl: a freely available record linkage system with a graphical user interface. In Proceedings of the second Australasian workshop on Health data and knowledge management-Volume 80, pages 17-25. Australian Computer Society, Inc., 2008. Google Scholar
  12. Erik D Demaine, Dotan Emanuel, Amos Fiat, and Nicole Immorlica. Correlation clustering in general weighted graphs. Theoretical Computer Science, 361(2-3):172-187, 2006. Google Scholar
  13. I. Dinur. Mildly exponential reduction from Gap 3SAT to polynomial-gap label-cover. Electronic Colloquium on Computational Complexity (ECCC), 2016. Google Scholar
  14. Donatella Firmani, Sainyam Galhotra, Barna Saha, and Divesh Srivastava. Robust Entity Resolution Using a CrowdOracle. IEEE Data Eng. Bull., 41(2):91-103, 2018. Google Scholar
  15. Buddhima Gamlath, Sangxia Huang, and Ola Svensson. Semi-Supervised Algorithms for Approximately Optimal and Accurate Clustering. In 45th International Colloquium on Automata, Languages, and Programming, ICALP 2018, July 9-13, 2018, Prague, Czech Republic, pages 57:1-57:14, 2018. URL:
  16. I. Giotis and V. Guruswami. Correlation Clustering with a Fixed Number of Clusters. ACM-SIAM Symposium on Discrete Algorithms (SODA), 2006. Google Scholar
  17. Anja Gruenheid, Besmira Nushi, Tim Kraska, Wolfgang Gatterbauer, and Donald Kossmann. Fault-tolerant entity resolution with the crowd. arXiv preprint, 2015. URL:
  18. Shrinu Kushagra, Shai Ben-David, and Ihab Ilyas. Semi-supervised clustering for de-duplication. arXiv preprint, 2018. URL:
  19. A. Mazumdar and B. Saha. Clustering with Noisy Queries. Advances in Neural Information Processing Systems (NIPS), 2017. Google Scholar
  20. Andrew McCallum. Data. URL:
  21. Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International journal of computer vision, 42(3):145-175, 2001. Google Scholar
  22. Vasilis Verroios and Hector Garcia-Molina. Entity resolution with crowd errors. In 2015 IEEE 31st International Conference on Data Engineering, pages 219-230. IEEE, 2015. Google Scholar
  23. Vasilis Verroios, Hector Garcia-Molina, and Yannis Papakonstantinou. Waldo: An adaptive human interface for crowd entity resolution. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1133-1148. ACM, 2017. Google Scholar
  24. David Williams. Probability with martingales. Cambridge university press, 1991. Google Scholar
  25. William E Winkler. Overview of record linkage and current research directions. In Bureau of the Census. Citeseer, 2006. Google Scholar