Robust Communication-Optimal Distributed Clustering Algorithms

Robust Communication-Optimal Distributed Clustering Algorithms In this work, we study the k-median and k-means clustering problems when the data is distributed across many servers and can contain outliers. While there has been a lot of work on these problems for worst-case instances, we focus on gaining a finer understanding through the lens of beyond worst-case analysis. Our main motivation is the following: for many applications such as clustering proteins by function or clustering communities in a social network, there is some unknown target clustering, and the hope is that running a k-median or k-means algorithm will produce clusterings which are close to matching the target clustering. Worst-case results can guarantee constant factor approximations to the optimal k-median or k-means objective value, but not closeness to the target clustering. Our first result is a distributed algorithm which returns a near-optimal clustering assuming a natural notion of stability, namely, approximation stability [Awasthi and Balcan, 2014], even when a constant fraction of the data are outliers. The communication complexity is O~(sk+z) where s is the number of machines, k is the number of clusters, and z is the number of outliers. Next, we show this amount of communication cannot be improved even in the setting when the input satisfies various non-worst-case assumptions. We give a matching Omega(sk+z) lower bound on the communication required both for approximating the optimal k-means or k-median cost up to any constant, and for returning a clustering that is close to the target clustering in Hamming distance. These lower bounds hold even when the data satisfies approximation stability or other common notions of stability, and the cluster sizes are balanced. Therefore, Omega(sk+z) is a communication bottleneck, even for real-world instances. robust distributed clustering communication complexity Theory of computation~Unsupervised learning and clustering 18:1-18:16 Track A: Algorithms, Complexity and Games A full version of the paper is available at https://arxiv.org/abs/1703.00830. This work was supported in part by NSF grants CCF-1422910, CCF-1535967, IIS-1618714, an Office of Naval Research (ONR) grant N00014-18-1-2562, an Amazon Research Award, a Microsoft Research Faculty Fellowship, and a National Defense Science & Engineering Graduate (NDSEG) fellowship. Part of this work was done while Ainesh Bakshi and David Woodruff were visiting the Simons Institute for the Theory of Computing. Pranjal Awasthi Pranjal Awasthi Rutgers University, Piscataway, NJ, USA Ainesh Bakshi Ainesh Bakshi Carnegie Mellon University, Pittsburgh, PA, USA Maria-Florina Balcan Maria-Florina Balcan Carnegie Mellon University, Pittsburgh, PA, USA Colin White Colin White Carnegie Mellon University, Pittsburgh, PA, USA David P. Woodruff David P. Woodruff Carnegie Mellon University, Pittsburgh, PA, USA 10.4230/LIPIcs.ICALP.2019.18 Margareta Ackerman and Shai Ben-David. Clusterability: A theoretical study. In Artificial Intelligence and Statistics, pages 1-8, 2009. Manu Agarwal, Ragesh Jaiswal, and Arindam Pal. k-means++ under Approximation Stability. Theoretical Computer Science, 588:37-51, 2015. Gagan Aggarwal, Tomás Feder, Krishnaram Kenthapadi, Samir Khuller, Rina Panigrahy, Dilys Thomas, and An Zhu. Achieving anonymity via clustering. In Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 153-162, 2006. Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms. arXiv preprint, 2016. URL: http://arxiv.org/abs/1612.07925. http://arxiv.org/abs/1612.07925 Sara Ahmadian and Chaitanya Swamy. Approximation Algorithms for Clustering Problems with Lower Bounds and Outliers. In 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy, pages 69:1-69:15, 2016. URL: http://dx.doi.org/10.4230/LIPIcs.ICALP.2016.69. http://dx.doi.org/10.4230/LIPIcs.ICALP.2016.69 Haris Angelidakis, Konstantin Makarychev, and Yury Makarychev. Algorithms for Stable and Perturbation-Resilient Problems. In Proceedings of the Annual Symposium on Theory of Computing (STOC), 2017. Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristics for k-median and facility location problems. SIAM Journal on Computing, 33(3):544-562, 2004. Pranjal Awasthi and Maria-Florina Balcan. Center based clustering: A foundational perspective. CRC, 2014. Pranjal Awasthi, Maria-Florina Balcan, Avrim Blum, Or Sheffet, and Santosh Vempala. On nash-equilibria of approximation-stable games. In International Symposium on Algorithmic Game Theory, pages 78-89. Springer, 2010. Pranjal Awasthi, Avrim Blum, and Or Sheffet. Center-based clustering under perturbation stability. Information Processing Letters, 112(1):49-54, 2012. Pranjal Awasthi and Or Sheffet. Improved spectral-norm bounds for clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 37-49. Springer, 2012. Maria-Florina Balcan, Avrim Blum, and Anupam Gupta. Clustering under approximation stability. Journal of the ACM (JACM), 60(2):8, 2013. Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. A discriminative framework for clustering via similarity functions. In Proceedings of the Annual Symposium on Theory of Computing (STOC), pages 671-680, 2008. Maria-Florina Balcan and Mark Braverman. Finding Low Error Clusterings. In COLT, volume 3, pages 3-4, 2009. Maria-Florina Balcan, Nika Haghtalab, and Colin White. k-center Clustering under Perturbation Resilience. In Proceedings of the Annual International Colloquium on Automata, Languages, and Programming (ICALP), 2016. Maria Florina Balcan and Yingyu Liang. Clustering under perturbation resilience. SIAM Journal on Computing, 45(1):102-155, 2016. Maria-Florina Balcan, Heiko Röglin, and Shang-Hua Teng. Agnostic Clustering. In ALT, pages 384-398. Springer, 2009. Maria-Florina F Balcan, Steven Ehrlich, and Yingyu Liang. Distributed k-means and k-median Clustering on General Topologies. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), pages 1995-2003, 2013. Ziv Bar-Yossef, Thathachar S Jayram, Ravi Kumar, and D Sivakumar. An information statistics approach to data stream and communication complexity. Journal of Computer and System Sciences, 68(4):702-732, 2004. Mohammadhossein Bateni, Aditya Bhaskara, Silvio Lattanzi, and Vahab Mirrokni. Distributed Balanced Clustering via Mapping Coresets. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), pages 2591-2599, 2014. Yonatan Bilu and Nathan Linial. Are stable instances easy? Combinatorics, Probability and Computing, 21(05):643-660, 2012. Mark Braverman, Faith Ellen, Rotem Oshman, Toniann Pitassi, and Vinod Vaikuntanathan. A tight bound for set disjointness in the message-passing model. In Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS), pages 668-677. IEEE, 2013. Jarosław Byrka, Thomas Pensyl, Bartosz Rybicki, Aravind Srinivasan, and Khoa Trinh. An improved approximation for k-median, and positive correlation in budgeted optimization. In Proceedings of the Annual Symposium on Discrete Algorithms (SODA), pages 737-756. SIAM, 2015. Moses Charikar, Sudipto Guha, Éva Tardos, and David B Shmoys. A constant-factor approximation algorithm for the k-median problem. In Proceedings of the Annual Symposium on Theory of Computing (STOC), pages 1-10. ACM, 1999. Moses Charikar, Samir Khuller, David M Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. In Proceedings of the Annual Symposium on Discrete Algorithms (SODA), pages 642-651. Society for Industrial and Applied Mathematics, 2001. Chandra Chekuri and Shalmoli Gupta. Perturbation Resilient Clustering for k-Center and Related Problems via LP Relaxations. In Proceedings of the International Workshop on Approximation, Randomization, and Combinatorial Optimization Algorithms and Techniques (APPROX-RANDOM), 2018. Jiecao Chen, He Sun, David Woodruff, and Qin Zhang. Communication-Optimal Distributed Clustering. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), pages 3720-3728, 2016. Ke Chen. A constant factor approximation algorithm for k-median clustering with outliers. In Proceedings of the Annual Symposium on Discrete Algorithms (SODA), volume 8, pages 826-835, 2008. Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 163-172. ACM, 2015. Travis Dick, Mu Li, Venkata Krishna Pillutla, Colin White, Maria Florina Balcan, and Alex Smola. Data Driven Resource Allocation for Distributed Learning. In "Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS)", 2017. Teofilo F Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293-306, 1985. Sudipto Guha, Yi Li, and Qin Zhang. Distributed Partial Clustering. In Proceedings of the Symposium on Parallelism in Algorithms and Architectures (SPAA), 2017. Rishi Gupta, Tim Roughgarden, and C Seshadhri. Decompositions of triangle-dense graphs. In Proceedings of the 5th conference on Innovations in theoretical computer science, pages 471-482. ACM, 2014. Lingxiao Huang, Shaofeng Jiang, Jian Li, and Xuan Wu. Epsilon-coresets for clustering (with outliers) in doubling metrics. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 814-825. IEEE, 2018. Zengfeng Huang, Bozidar Radunovic, Milan Vojnovic, and Qin Zhang. Communication complexity of approximate matching in distributed graphs. In LIPIcs-Leibniz International Proceedings in Informatics, volume 30. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2015. Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. A local search approximation algorithm for k-means clustering. In Proceedings of the eighteenth annual symposium on Computational geometry, pages 10-18. ACM, 2002. Ari Kobren, Nicholas Monath, Akshay Krishnamurthy, and Andrew McCallum. A Hierarchical Algorithm for Extreme Clustering. In Proceedings of the KDD International Conference on Knowledge Discovery and Data Mining, pages 255-264. ACM, 2017. Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant Approximation for k-Median and k-Means with Outliers via Iterative Rounding. CoRR, abs/1711.01323, 2017. URL: http://arxiv.org/abs/1711.01323. http://arxiv.org/abs/1711.01323 Amit Kumar and Ravindran Kannan. Clustering with spectral norm and the k-means algorithm. In Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS), pages 299-308. IEEE, 2010. Shi Li and Xiangyu Guo. Distributed k-Clustering for Data with Heavy Noise. In Advances in Neural Information Processing Systems, pages 7849-7857, 2018. Konstantin Makarychev, Yury Makarychev, Maxim Sviridenko, and Justin Ward. A bi-criteria approximation algorithm for k Means. In APPROX, 2016. Gustavo Malkomes, Matt J Kusner, Wenlin Chen, Kilian Q Weinberger, and Benjamin Moseley. Fast Distributed k-Center Clustering with Outliers on Massive Data. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), pages 1063-1071, 2015. Rafail Ostrovsky, Yuval Rabani, Leonard J Schulman, and Chaitanya Swamy. The effectiveness of lloyd-type methods for the k-means problem. Journal of the ACM (JACM), 59(6):28, 2012. Kim D Pruitt, Tatiana Tatusova, Garth R Brown, and Donna R Maglott. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic acids research, 2011. Alexander A. Razborov. On the distributional complexity of disjointness. Theoretical Computer Science, 106(2):385-390, 1992. Konstantin Voevodski, Maria-Florina Balcan, Heiko Röglin, Shang-Hua Teng, and Yu Xia. Min-sum clustering of protein sequences with limited distance information. In International Workshop on Similarity-Based Pattern Recognition, pages 192-206. Springer, 2011. Pranjal Awasthi, Ainesh Bakshi, Maria-Florina Balcan, Colin White, and David Woodruff Creative Commons Attribution 3.0 Unported license https://creativecommons.org/licenses/by/3.0/legalcode