Correlation Clustering with Same-Cluster Queries Bounded by Optimal Cost

Saha, Barna; Subramanian, Sanjay

doi:10.4230/LIPIcs.ESA.2019.81

Abstract

Several clustering frameworks with interactive (semi-supervised) queries have been studied in the past. Recently, clustering with same-cluster queries has become popular. An algorithm in this setting has access to an oracle with full knowledge of an optimal clustering, and the algorithm can ask the oracle queries of the form, "Does the optimal clustering put vertices u and v in the same cluster?" Due to its simplicity, this querying model can easily be implemented in real crowd-sourcing platforms and has attracted a lot of recent work. 
In this paper, we study the popular correlation clustering problem (Bansal et al., 2002) under the same-cluster querying framework. Given a complete graph G=(V,E) with positive and negative edge labels, correlation clustering objective aims to compute a graph clustering that minimizes the total number of disagreements, that is the negative intra-cluster edges and positive inter-cluster edges. In a recent work, Ailon et al. (2018b) provided an approximation algorithm for correlation clustering that approximates the correlation clustering objective within (1+epsilon) with O((k^{14} log{n} log{k})/epsilon^6) queries when the number of clusters, k, is fixed. For many applications, k is not fixed and can grow with |V|. Moreover, the dependency of k^14 on query complexity renders the algorithm impractical even for datasets with small values of k. 
In this paper, we take a different approach. Let C_{OPT} be the number of disagreements made by the optimal clustering. We present algorithms for correlation clustering whose error and query bounds are parameterized by C_{OPT} rather than by the number of clusters. Indeed, a good clustering must have small C_{OPT}. Specifically, we present an efficient algorithm that recovers an exact optimal clustering using at most 2C_{OPT} queries and an efficient algorithm that outputs a 2-approximation using at most C_{OPT} queries. In addition, we show under a plausible complexity assumption, there does not exist any polynomial time algorithm that has an approximation ratio better than 1+alpha for an absolute constant alpha > 0 with o(C_{OPT}) queries. Therefore, our first algorithm achieves the optimal query bound within a factor of 2.
We extensively evaluate our methods on several synthetic and real-world datasets using real crowd-sourced oracles. Moreover, we compare our approach against known correlation clustering algorithms that do not perform querying. In all cases, our algorithms exhibit superior performance.

Cite As Get BibTex

Barna Saha and Sanjay Subramanian. Correlation Clustering with Same-Cluster Queries Bounded by Optimal Cost. In 27th Annual European Symposium on Algorithms (ESA 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 144, pp. 81:1-81:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019) https://doi.org/10.4230/LIPIcs.ESA.2019.81

Author Details

Barna Saha

University of California, Berkeley, USA

Sanjay Subramanian

Allen Institute for Artificial Intelligence, Irvine, CA, USA

Funding

Saha, Barna: B. Saha is partially supported by an NSF CAREER Award CCF 1652303, a Google Faculty Award and an Alfred P. Sloan fellowship.
Subramanian, Sanjay: This work was supported in part by the National Science Foundation (NSF) Research Experiences for Undergraduates (REU) program.

Acknowledgements

The second author would like to thank Dan Roth for letting him use his machines for running experiments, Sainyam Galhotra for help with datasets, and Rajiv Gandhi for useful discussions.

Supplementary Materials

Code and Data: https://www.github.com/sanjayss34/corr-clust-query-esa2019

References

N. Ailon, M. Charikar, and A. Newman. Aggregating Inconsistent Information: Ranking and Clustering. Symposium on the Theory of Computing (STOC), 2005.
Nir Ailon, Anup Bhattacharya, and Ragesh Jaiswal. Approximate Correlation Clustering Using Same-Cluster Queries. In Latin American Symposium on Theoretical Informatics, pages 14-27. Springer, 2018.
Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Approximate clustering with same-cluster queries. arXiv preprint, 2017. URL: http://arxiv.org/abs/1704.01862.
Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating inconsistent information: ranking and clustering. Journal of the ACM (JACM), 55(5):23, 2008.
H. Ashtiani, S. Kushagra, and S. Ben-David. Clustering with Same-Cluster Queries. Advances in Neural Information Processing Systems (NIPS), 2016.
M. F. Balcan and A. Blum. Clustering with Interactive Feedback. International Conference on Algorithmic Learning Theory (ALT), 2008.
N. Bansal, A. Blum, and S. Chawla. Correlation Clustering. Symposium on Foundations of Computer Science (FOCS), 2002.
Sebastian Böcker, Sebastian Briesemeister, Quang Bao Anh Bui, and Anke Truß. Going weighted: Parameterized algorithms for cluster editing. Theoretical Computer Science, 410(52):5467-5480, 2009.
Moses Charikar, Venkatesan Guruswami, and Anthony Wirth. Clustering with qualitative information. Journal of Computer and System Sciences, 71(3):360-383, 2005.
S. Chawla, K. Makarychev, T. Schramm, and G. Yaroslavtsev. Near Optimal LP Rounding Algorithm for Correlation Clustering on Complete and Complete k-partite Graphs. Symposium on the Theory of Computing (STOC), pages 219-228, 2015.
Peter Christen. Febrl: a freely available record linkage system with a graphical user interface. In Proceedings of the second Australasian workshop on Health data and knowledge management-Volume 80, pages 17-25. Australian Computer Society, Inc., 2008.
Erik D Demaine, Dotan Emanuel, Amos Fiat, and Nicole Immorlica. Correlation clustering in general weighted graphs. Theoretical Computer Science, 361(2-3):172-187, 2006.
I. Dinur. Mildly exponential reduction from Gap 3SAT to polynomial-gap label-cover. Electronic Colloquium on Computational Complexity (ECCC), 2016.
Donatella Firmani, Sainyam Galhotra, Barna Saha, and Divesh Srivastava. Robust Entity Resolution Using a CrowdOracle. IEEE Data Eng. Bull., 41(2):91-103, 2018.
Buddhima Gamlath, Sangxia Huang, and Ola Svensson. Semi-Supervised Algorithms for Approximately Optimal and Accurate Clustering. In 45th International Colloquium on Automata, Languages, and Programming, ICALP 2018, July 9-13, 2018, Prague, Czech Republic, pages 57:1-57:14, 2018. URL: https://doi.org/10.4230/LIPIcs.ICALP.2018.57.
I. Giotis and V. Guruswami. Correlation Clustering with a Fixed Number of Clusters. ACM-SIAM Symposium on Discrete Algorithms (SODA), 2006.
Anja Gruenheid, Besmira Nushi, Tim Kraska, Wolfgang Gatterbauer, and Donald Kossmann. Fault-tolerant entity resolution with the crowd. arXiv preprint, 2015. URL: http://arxiv.org/abs/1512.00537.
Shrinu Kushagra, Shai Ben-David, and Ihab Ilyas. Semi-supervised clustering for de-duplication. arXiv preprint, 2018. URL: http://arxiv.org/abs/1810.04361.
A. Mazumdar and B. Saha. Clustering with Noisy Queries. Advances in Neural Information Processing Systems (NIPS), 2017.
Andrew McCallum. Data. URL: https://people.cs.umass.edu/~mccallum/data.html.
Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. International journal of computer vision, 42(3):145-175, 2001.
Vasilis Verroios and Hector Garcia-Molina. Entity resolution with crowd errors. In 2015 IEEE 31st International Conference on Data Engineering, pages 219-230. IEEE, 2015.
Vasilis Verroios, Hector Garcia-Molina, and Yannis Papakonstantinou. Waldo: An adaptive human interface for crowd entity resolution. In Proceedings of the 2017 ACM International Conference on Management of Data, pages 1133-1148. ACM, 2017.
David Williams. Probability with martingales. Cambridge university press, 1991.
William E Winkler. Overview of record linkage and current research directions. In Bureau of the Census. Citeseer, 2006.

Correlation Clustering with Same-Cluster Queries Bounded by Optimal Cost

Authors Barna Saha, Sanjay Subramanian

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

Correlation Clustering with Same-Cluster Queries Bounded by Optimal Cost

Authors Barna Saha, Sanjay Subramanian

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Acknowledgements

Supplementary Materials

References

Thanks for your feedback!

Could not send message