Robust Correlation Clustering

Authors Devvrit, Ravishankar Krishnaswamy, Nived Rajaraman



PDF
Thumbnail PDF

File

LIPIcs.APPROX-RANDOM.2019.33.pdf
  • Filesize: 0.58 MB
  • 18 pages

Document Identifiers

Author Details

Devvrit
  • BITS Pilani, Goa Campus, Goa, India
Ravishankar Krishnaswamy
  • Microsoft Research, Bengaluru, India
Nived Rajaraman
  • IIT Madras, Chennai, India

Cite AsGet BibTex

Devvrit, Ravishankar Krishnaswamy, and Nived Rajaraman. Robust Correlation Clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 145, pp. 33:1-33:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.APPROX-RANDOM.2019.33

Abstract

In this paper, we introduce and study the Robust-Correlation-Clustering problem: given a graph G = (V,E) where every edge is either labeled + or - (denoting similar or dissimilar pairs of vertices), and a parameter m, the goal is to delete a set D of m vertices, and partition the remaining vertices V \ D into clusters to minimize the cost of the clustering, which is the sum of the number of + edges with end-points in different clusters and the number of - edges with end-points in the same cluster. This generalizes the classical Correlation-Clustering problem which is the special case when m = 0. Correlation clustering is useful when we have (only) qualitative information about the similarity or dissimilarity of pairs of points, and Robust-Correlation-Clustering equips this model with the capability to handle noise in datasets. In this work, we present a constant-factor bi-criteria algorithm for Robust-Correlation-Clustering on complete graphs (where our solution is O(1)-approximate w.r.t the cost while however discarding O(1) m points as outliers), and also complement this by showing that no finite approximation is possible if we do not violate the outlier budget. Our algorithm is very simple in that it first does a simple LP-based pre-processing to delete O(m) vertices, and subsequently runs a particular Correlation-Clustering algorithm ACNAlg [Ailon et al., 2005] on the residual instance. We then consider general graphs, and show (O(log n), O(log^2 n)) bi-criteria algorithms while also showing a hardness of alpha_MC on both the cost and the outlier violation, where alpha_MC is the lower bound for the Minimum-Multicut problem.

Subject Classification

ACM Subject Classification
  • Theory of computation → Design and analysis of algorithms
  • Theory of computation → Facility location and clustering
Keywords
  • Correlation Clustering
  • Outlier Detection
  • Clustering
  • Approximation Algorithms

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. KookJin Ahn, Graham Cormode, Sudipto Guha, Andrew McGregor, and Anthony Wirth. Correlation Clustering in Data Streams. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 2237-2246, Lille, France, 2015. PMLR. URL: http://proceedings.mlr.press/v37/ahn15.html.
  2. Nir Ailon, Moses Charikar, and Alantha Newman. Aggregating Inconsistent Information: Ranking and Clustering. In Proceedings of the Thirty-seventh Annual ACM Symposium on Theory of Computing, STOC '05, pages 684-693, New York, NY, USA, 2005. ACM. URL: https://doi.org/10.1145/1060590.1060692.
  3. Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation Clustering. Mach. Learn., 56(1-3):89-113, June 2004. URL: https://doi.org/10.1023/B:MACH.0000033116.57574.95.
  4. Amir Ben-Dor and Zohar Yakhini. Clustering Gene Expression Patterns. In Proceedings of the Third Annual International Conference on Computational Molecular Biology, RECOMB '99, pages 33-42, New York, NY, USA, 1999. ACM. URL: https://doi.org/10.1145/299432.299448.
  5. Moses Charikar, Venkatesan Guruswami, and Anthony Wirth. Clustering with Qualitative Information. J. Comput. Syst. Sci., 71(3):360-383, October 2005. URL: https://doi.org/10.1016/j.jcss.2004.10.012.
  6. Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. Algorithms for Facility Location Problems with Outliers. In Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '01, pages 642-651, Philadelphia, PA, USA, 2001. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=365411.365555.
  7. Sanjay Chawla and Aristides Gionis. k-means-: A Unified Approach to Clustering and Outlier Detection. In SDM, pages 189-197. SIAM, 2013. URL: http://dblp.uni-trier.de/db/conf/sdm/sdm2013.html#ChawlaG13.
  8. Shuchi Chawla, Konstantin Makarychev, Tselil Schramm, and Grigory Yaroslavtsev. Near Optimal LP Rounding Algorithm for CorrelationClustering on Complete and Complete K-partite Graphs. In Proceedings of the Forty-seventh Annual ACM Symposium on Theory of Computing, STOC '15, pages 219-228, New York, NY, USA, 2015. ACM. URL: https://doi.org/10.1145/2746539.2746604.
  9. Jiecao Chen, Erfan Sadeqi Azer, and Qin Zhang. A Practical Algorithm for Distributed Clustering and Outlier Detection. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pages 2253-2262, 2018. URL: http://papers.nips.cc/paper/7493-a-practical-algorithm-for-distributed-clustering-and-outlier-detection.
  10. Ke Chen. A Constant Factor Approximation Algorithm for K-median Clustering with Outliers. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '08, pages 826-835, Philadelphia, PA, USA, 2008. Society for Industrial and Applied Mathematics. URL: http://dl.acm.org/citation.cfm?id=1347082.1347173.
  11. Flavio Chierichetti, Nilesh Dalvi, and Ravi Kumar. Correlation Clustering in MapReduce. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, pages 641-650, New York, NY, USA, 2014. ACM. URL: https://doi.org/10.1145/2623330.2623743.
  12. William Cohen and Jacob Richman. Learning to Match and Cluster Entity Names. In In ACM SIGIR-2001 Workshop on Mathematical/Formal Methods in Information Retrieval, 2001. Google Scholar
  13. William W. Cohen and Jacob Richman. Learning to Match and Cluster Large High-dimensional Data Sets for Data Integration. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '02, pages 475-480, New York, NY, USA, 2002. ACM. URL: https://doi.org/10.1145/775047.775116.
  14. Erik D. Demaine, Dotan Emanuel, Amos Fiat, and Nicole Immorlica. Correlation clustering in general weighted graphs. Theoretical Computer Science, 361(2):172-187, 2006. Approximation and Online Algorithms. URL: https://doi.org/10.1016/j.tcs.2006.05.008.
  15. Jittat Fakcharoenphol, Satish Rao, and Kunal Talwar. Approximating Metrics by Tree Metrics. SIGACT News, 35(2):60-70, June 2004. URL: https://doi.org/10.1145/992287.992300.
  16. Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and Sergei Vassilvitskii. Local Search Methods for k-Means with Outliers. PVLDB, 10(7):757-768, 2017. URL: https://doi.org/10.14778/3067421.3067425.
  17. Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant Approximation for K-median and K-means with Outliers via Iterative Rounding. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, pages 646-659, New York, NY, USA, 2018. ACM. URL: https://doi.org/10.1145/3188745.3188882.
  18. Shi Li and Xiangyu Guo. Distributed k-Clustering for Data with Heavy Noise. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pages 7849-7857, 2018. URL: http://papers.nips.cc/paper/8009-distributed-k-clustering-for-data-with-heavy-noise.
  19. Konstantin Makarychev, Yury Makarychev, and Aravindan Vijayaraghavan. Correlation Clustering with Noisy Partial Information. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 1321-1342, Paris, France, 2015. PMLR. URL: http://proceedings.mlr.press/v40/Makarychev15.html.
  20. Andrew McCallum and Ben Wellner. Toward Conditional Models of Identity Uncertainty with Application to Proper Noun Coreference. In Proceedings of the 2003 International Conference on Information Integration on the Web, IIWEB'03, pages 79-84. AAAI Press, 2003. URL: http://dl.acm.org/citation.cfm?id=3104278.3104294.
  21. Napat Rujeerapaiboon, Kilian Schindler, Daniel Kuhn, and Wolfram Wiesemann. Size Matters: Cardinality-Constrained Clustering and Outlier Detection via Conic Optimization. SIAM Journal on Optimization, 2019. Google Scholar
  22. Chaitanya Swamy. Correlation Clustering: Maximizing Agreements via Semidefinite Programming. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms, volume 15, pages 526-527, January 2004. Google Scholar
  23. Anthony Wirth. Correlation Clustering. In Encyclopedia of Machine Learning, pages 227-231. Springer, 2010. URL: https://doi.org/10.1007/978-0-387-30164-8_176.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail