Robust Communication-Optimal Distributed Clustering Algorithms

Authors Pranjal Awasthi, Ainesh Bakshi, Maria-Florina Balcan, Colin White, David P. Woodruff



PDF
Thumbnail PDF

File

LIPIcs.ICALP.2019.18.pdf
  • Filesize: 0.52 MB
  • 16 pages

Document Identifiers

Author Details

Pranjal Awasthi
  • Rutgers University, Piscataway, NJ, USA
Ainesh Bakshi
  • Carnegie Mellon University, Pittsburgh, PA, USA
Maria-Florina Balcan
  • Carnegie Mellon University, Pittsburgh, PA, USA
Colin White
  • Carnegie Mellon University, Pittsburgh, PA, USA
David P. Woodruff
  • Carnegie Mellon University, Pittsburgh, PA, USA

Acknowledgements

This work was supported in part by NSF grants CCF-1422910, CCF-1535967, IIS-1618714, an Office of Naval Research (ONR) grant N00014-18-1-2562, an Amazon Research Award, a Microsoft Research Faculty Fellowship, and a National Defense Science & Engineering Graduate (NDSEG) fellowship. Part of this work was done while Ainesh Bakshi and David Woodruff were visiting the Simons Institute for the Theory of Computing.

Cite AsGet BibTex

Pranjal Awasthi, Ainesh Bakshi, Maria-Florina Balcan, Colin White, and David P. Woodruff. Robust Communication-Optimal Distributed Clustering Algorithms. In 46th International Colloquium on Automata, Languages, and Programming (ICALP 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 132, pp. 18:1-18:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.ICALP.2019.18

Abstract

In this work, we study the k-median and k-means clustering problems when the data is distributed across many servers and can contain outliers. While there has been a lot of work on these problems for worst-case instances, we focus on gaining a finer understanding through the lens of beyond worst-case analysis. Our main motivation is the following: for many applications such as clustering proteins by function or clustering communities in a social network, there is some unknown target clustering, and the hope is that running a k-median or k-means algorithm will produce clusterings which are close to matching the target clustering. Worst-case results can guarantee constant factor approximations to the optimal k-median or k-means objective value, but not closeness to the target clustering. Our first result is a distributed algorithm which returns a near-optimal clustering assuming a natural notion of stability, namely, approximation stability [Awasthi and Balcan, 2014], even when a constant fraction of the data are outliers. The communication complexity is O~(sk+z) where s is the number of machines, k is the number of clusters, and z is the number of outliers. Next, we show this amount of communication cannot be improved even in the setting when the input satisfies various non-worst-case assumptions. We give a matching Omega(sk+z) lower bound on the communication required both for approximating the optimal k-means or k-median cost up to any constant, and for returning a clustering that is close to the target clustering in Hamming distance. These lower bounds hold even when the data satisfies approximation stability or other common notions of stability, and the cluster sizes are balanced. Therefore, Omega(sk+z) is a communication bottleneck, even for real-world instances.

Subject Classification

ACM Subject Classification
  • Theory of computation → Unsupervised learning and clustering
Keywords
  • robust distributed clustering
  • communication complexity

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Margareta Ackerman and Shai Ben-David. Clusterability: A theoretical study. In Artificial Intelligence and Statistics, pages 1-8, 2009. Google Scholar
  2. Manu Agarwal, Ragesh Jaiswal, and Arindam Pal. k-means++ under Approximation Stability. Theoretical Computer Science, 588:37-51, 2015. Google Scholar
  3. Gagan Aggarwal, Tomás Feder, Krishnaram Kenthapadi, Samir Khuller, Rina Panigrahy, Dilys Thomas, and An Zhu. Achieving anonymity via clustering. In Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 153-162, 2006. Google Scholar
  4. Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms. arXiv preprint, 2016. URL: http://arxiv.org/abs/1612.07925.
  5. Sara Ahmadian and Chaitanya Swamy. Approximation Algorithms for Clustering Problems with Lower Bounds and Outliers. In 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy, pages 69:1-69:15, 2016. URL: http://dx.doi.org/10.4230/LIPIcs.ICALP.2016.69.
  6. Haris Angelidakis, Konstantin Makarychev, and Yury Makarychev. Algorithms for Stable and Perturbation-Resilient Problems. In Proceedings of the Annual Symposium on Theory of Computing (STOC), 2017. Google Scholar
  7. Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristics for k-median and facility location problems. SIAM Journal on Computing, 33(3):544-562, 2004. Google Scholar
  8. Pranjal Awasthi and Maria-Florina Balcan. Center based clustering: A foundational perspective. CRC, 2014. Google Scholar
  9. Pranjal Awasthi, Maria-Florina Balcan, Avrim Blum, Or Sheffet, and Santosh Vempala. On nash-equilibria of approximation-stable games. In International Symposium on Algorithmic Game Theory, pages 78-89. Springer, 2010. Google Scholar
  10. Pranjal Awasthi, Avrim Blum, and Or Sheffet. Center-based clustering under perturbation stability. Information Processing Letters, 112(1):49-54, 2012. Google Scholar
  11. Pranjal Awasthi and Or Sheffet. Improved spectral-norm bounds for clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 37-49. Springer, 2012. Google Scholar
  12. Maria-Florina Balcan, Avrim Blum, and Anupam Gupta. Clustering under approximation stability. Journal of the ACM (JACM), 60(2):8, 2013. Google Scholar
  13. Maria-Florina Balcan, Avrim Blum, and Santosh Vempala. A discriminative framework for clustering via similarity functions. In Proceedings of the Annual Symposium on Theory of Computing (STOC), pages 671-680, 2008. Google Scholar
  14. Maria-Florina Balcan and Mark Braverman. Finding Low Error Clusterings. In COLT, volume 3, pages 3-4, 2009. Google Scholar
  15. Maria-Florina Balcan, Nika Haghtalab, and Colin White. k-center Clustering under Perturbation Resilience. In Proceedings of the Annual International Colloquium on Automata, Languages, and Programming (ICALP), 2016. Google Scholar
  16. Maria Florina Balcan and Yingyu Liang. Clustering under perturbation resilience. SIAM Journal on Computing, 45(1):102-155, 2016. Google Scholar
  17. Maria-Florina Balcan, Heiko Röglin, and Shang-Hua Teng. Agnostic Clustering. In ALT, pages 384-398. Springer, 2009. Google Scholar
  18. Maria-Florina F Balcan, Steven Ehrlich, and Yingyu Liang. Distributed k-means and k-median Clustering on General Topologies. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), pages 1995-2003, 2013. Google Scholar
  19. Ziv Bar-Yossef, Thathachar S Jayram, Ravi Kumar, and D Sivakumar. An information statistics approach to data stream and communication complexity. Journal of Computer and System Sciences, 68(4):702-732, 2004. Google Scholar
  20. Mohammadhossein Bateni, Aditya Bhaskara, Silvio Lattanzi, and Vahab Mirrokni. Distributed Balanced Clustering via Mapping Coresets. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), pages 2591-2599, 2014. Google Scholar
  21. Yonatan Bilu and Nathan Linial. Are stable instances easy? Combinatorics, Probability and Computing, 21(05):643-660, 2012. Google Scholar
  22. Mark Braverman, Faith Ellen, Rotem Oshman, Toniann Pitassi, and Vinod Vaikuntanathan. A tight bound for set disjointness in the message-passing model. In Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS), pages 668-677. IEEE, 2013. Google Scholar
  23. Jarosław Byrka, Thomas Pensyl, Bartosz Rybicki, Aravind Srinivasan, and Khoa Trinh. An improved approximation for k-median, and positive correlation in budgeted optimization. In Proceedings of the Annual Symposium on Discrete Algorithms (SODA), pages 737-756. SIAM, 2015. Google Scholar
  24. Moses Charikar, Sudipto Guha, Éva Tardos, and David B Shmoys. A constant-factor approximation algorithm for the k-median problem. In Proceedings of the Annual Symposium on Theory of Computing (STOC), pages 1-10. ACM, 1999. Google Scholar
  25. Moses Charikar, Samir Khuller, David M Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. In Proceedings of the Annual Symposium on Discrete Algorithms (SODA), pages 642-651. Society for Industrial and Applied Mathematics, 2001. Google Scholar
  26. Chandra Chekuri and Shalmoli Gupta. Perturbation Resilient Clustering for k-Center and Related Problems via LP Relaxations. In Proceedings of the International Workshop on Approximation, Randomization, and Combinatorial Optimization Algorithms and Techniques (APPROX-RANDOM), 2018. Google Scholar
  27. Jiecao Chen, He Sun, David Woodruff, and Qin Zhang. Communication-Optimal Distributed Clustering. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), pages 3720-3728, 2016. Google Scholar
  28. Ke Chen. A constant factor approximation algorithm for k-median clustering with outliers. In Proceedings of the Annual Symposium on Discrete Algorithms (SODA), volume 8, pages 826-835, 2008. Google Scholar
  29. Michael B Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 163-172. ACM, 2015. Google Scholar
  30. Travis Dick, Mu Li, Venkata Krishna Pillutla, Colin White, Maria Florina Balcan, and Alex Smola. Data Driven Resource Allocation for Distributed Learning. In "Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS)", 2017. Google Scholar
  31. Teofilo F Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293-306, 1985. Google Scholar
  32. Sudipto Guha, Yi Li, and Qin Zhang. Distributed Partial Clustering. In Proceedings of the Symposium on Parallelism in Algorithms and Architectures (SPAA), 2017. Google Scholar
  33. Rishi Gupta, Tim Roughgarden, and C Seshadhri. Decompositions of triangle-dense graphs. In Proceedings of the 5th conference on Innovations in theoretical computer science, pages 471-482. ACM, 2014. Google Scholar
  34. Lingxiao Huang, Shaofeng Jiang, Jian Li, and Xuan Wu. Epsilon-coresets for clustering (with outliers) in doubling metrics. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 814-825. IEEE, 2018. Google Scholar
  35. Zengfeng Huang, Bozidar Radunovic, Milan Vojnovic, and Qin Zhang. Communication complexity of approximate matching in distributed graphs. In LIPIcs-Leibniz International Proceedings in Informatics, volume 30. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2015. Google Scholar
  36. Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. A local search approximation algorithm for k-means clustering. In Proceedings of the eighteenth annual symposium on Computational geometry, pages 10-18. ACM, 2002. Google Scholar
  37. Ari Kobren, Nicholas Monath, Akshay Krishnamurthy, and Andrew McCallum. A Hierarchical Algorithm for Extreme Clustering. In Proceedings of the KDD International Conference on Knowledge Discovery and Data Mining, pages 255-264. ACM, 2017. Google Scholar
  38. Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant Approximation for k-Median and k-Means with Outliers via Iterative Rounding. CoRR, abs/1711.01323, 2017. URL: http://arxiv.org/abs/1711.01323.
  39. Amit Kumar and Ravindran Kannan. Clustering with spectral norm and the k-means algorithm. In Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS), pages 299-308. IEEE, 2010. Google Scholar
  40. Shi Li and Xiangyu Guo. Distributed k-Clustering for Data with Heavy Noise. In Advances in Neural Information Processing Systems, pages 7849-7857, 2018. Google Scholar
  41. Konstantin Makarychev, Yury Makarychev, Maxim Sviridenko, and Justin Ward. A bi-criteria approximation algorithm for k Means. In APPROX, 2016. Google Scholar
  42. Gustavo Malkomes, Matt J Kusner, Wenlin Chen, Kilian Q Weinberger, and Benjamin Moseley. Fast Distributed k-Center Clustering with Outliers on Massive Data. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), pages 1063-1071, 2015. Google Scholar
  43. Rafail Ostrovsky, Yuval Rabani, Leonard J Schulman, and Chaitanya Swamy. The effectiveness of lloyd-type methods for the k-means problem. Journal of the ACM (JACM), 59(6):28, 2012. Google Scholar
  44. Kim D Pruitt, Tatiana Tatusova, Garth R Brown, and Donna R Maglott. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic acids research, 2011. Google Scholar
  45. Alexander A. Razborov. On the distributional complexity of disjointness. Theoretical Computer Science, 106(2):385-390, 1992. Google Scholar
  46. Konstantin Voevodski, Maria-Florina Balcan, Heiko Röglin, Shang-Hua Teng, and Yu Xia. Min-sum clustering of protein sequences with limited distance information. In International Workshop on Similarity-Based Pattern Recognition, pages 192-206. Springer, 2011. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail