Greedy Strategy Works for k-Center Clustering with Outliers and Coreset Construction

Authors Hu Ding, Haikuo Yu, Zixiu Wang



PDF
Thumbnail PDF

File

LIPIcs.ESA.2019.40.pdf
  • Filesize: 1.26 MB
  • 16 pages

Document Identifiers

Author Details

Hu Ding
  • School of Computer Science and Technology, University of Science and Technology of China, China
Haikuo Yu
  • School of Computer Science and Technology, University of Science and Technology of China, China
Zixiu Wang
  • School of Computer Science and Technology, University of Science and Technology of China, China

Acknowledgements

The authors want to thank the anonymous reviewers for their helpful comments and suggestions for improving the paper.

Cite AsGet BibTex

Hu Ding, Haikuo Yu, and Zixiu Wang. Greedy Strategy Works for k-Center Clustering with Outliers and Coreset Construction. In 27th Annual European Symposium on Algorithms (ESA 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 144, pp. 40:1-40:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.ESA.2019.40

Abstract

We study the problem of k-center clustering with outliers in arbitrary metrics and Euclidean space. Though a number of methods have been developed in the past decades, it is still quite challenging to design quality guaranteed algorithm with low complexity for this problem. Our idea is inspired by the greedy method, Gonzalez’s algorithm, for solving the problem of ordinary k-center clustering. Based on some novel observations, we show that this greedy strategy actually can handle k-center clustering with outliers efficiently, in terms of clustering quality and time complexity. We further show that the greedy approach yields small coreset for the problem in doubling metrics, so as to reduce the time complexity significantly. Our algorithms are easy to implement in practice. We test our method on both synthetic and real datasets. The experimental results suggest that our algorithms can achieve near optimal solutions and yield lower running times comparing with existing methods.

Subject Classification

ACM Subject Classification
  • Theory of computation → Approximation algorithms analysis
Keywords
  • k-center clustering
  • outliers
  • coreset
  • doubling metrics
  • random sampling

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 15-28. Springer, 2009. Google Scholar
  2. Sepideh Aghamolaei and Mohammad Ghodsi. A Composable Coreset for k-Center in Doubling Metrics. In Proceedings of the 30th Canadian Conference on Computational Geometry, CCCG 2018, August 8-10, 2018, University of Manitoba, Winnipeg, Manitoba, Canada, pages 165-171, 2018. Google Scholar
  3. Noga Alon, Seannie Dar, Michal Parnas, and Dana Ron. Testing of clustering. SIAM Journal on Discrete Mathematics, 16(3):393-417, 2003. Google Scholar
  4. Noga Alon and Joel H Spencer. The probabilistic method. John Wiley & Sons, 2004. Google Scholar
  5. Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476, 2017. URL: http://arxiv.org/abs/1703.06476.
  6. Mihai Badoiu and Kenneth L. Clarkson. Smaller core-sets for balls. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 801-802, 2003. Google Scholar
  7. Mihai Badoiu, Sariel Har-Peled, and Piotr Indyk. Approximate clustering via core-sets. In Proceedings of the ACM Symposium on Theory of Computing (STOC), pages 250-257, 2002. Google Scholar
  8. Mikhail Belkin. Problems of learning on manifolds. The University of Chicago, 2003. Google Scholar
  9. Avrim Blum, John Hopcroft, and Ravindran Kannan. Foundations of data science. Vorabversion eines Lehrbuchs, 2016. Google Scholar
  10. Manuel Blum, Robert W. Floyd, Vaughan Pratt, Ronald L. Rivest, and Robert E. Tarjan. Time bounds for selection. Journal of Computer and System Sciences, 7(4):448-461, 1973. Google Scholar
  11. Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM (JACM), 36(4):929-965, 1989. Google Scholar
  12. Matteo Ceccarello, Andrea Pietracaprina, and Geppino Pucci. Solving k-center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially. CoRR, abs/1802.09205, 2018. Google Scholar
  13. Deeparnab Chakrabarty, Prachi Goyal, and Ravishankar Krishnaswamy. The Non-Uniform k-Center Problem. In 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy, pages 67:1-67:15, 2016. Google Scholar
  14. Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(3):15, 2009. Google Scholar
  15. Moses Charikar, Samir Khuller, David M Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 642-651. Society for Industrial and Applied Mathematics, 2001. Google Scholar
  16. Moses Charikar, Liadan O'Callaghan, and Rina Panigrahy. Better streaming algorithms for clustering problems. In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages 30-39. ACM, 2003. Google Scholar
  17. Amit Daniely, Nati Linial, and Michael E. Saks. Clustering is difficult only when it does not matter. CoRR, abs/1205.4891, 2012. Google Scholar
  18. Teofilo F Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293-306, 1985. Google Scholar
  19. Sudipto Guha, Yi Li, and Qin Zhang. Distributed Partial Clustering. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pages 143-152. ACM, 2017. Google Scholar
  20. Sariel Har-Peled and Manor Mendel. Fast construction of nets in low-dimensional metrics and their applications. SIAM Journal on Computing, 35(5):1148-1184, 2006. Google Scholar
  21. Sariel Har-Peled and Micha Sharir. Relative (p, ε)-approximations in geometry. Discrete & Computational Geometry, 45(3):462-496, 2011. Google Scholar
  22. Dorit S Hochbaum and David B Shmoys. A best possible heuristic for the k-center problem. Mathematics of operations research, 10(2):180-184, 1985. Google Scholar
  23. Lingxiao Huang, Shaofeng Jiang, Jian Li, and Xuan Wu. Epsilon-Coresets for Clustering (with Outliers) in Doubling Metrics. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 814-825. IEEE, 2018. Google Scholar
  24. Piotr Indyk, Sepideh Mahabadi, Mohammad Mahdian, and Vahab S. Mirrokni. Composable core-sets for diversity and coverage maximization. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS'14, Snowbird, UT, USA, June 22-27, 2014, pages 100-108, 2014. Google Scholar
  25. Anil K Jain. Data clustering: 50 years beyond K-means. Pattern recognition letters, 31(8):651-666, 2010. Google Scholar
  26. Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE Trans. Pattern Anal. Mach. Intell., 24(7):881-892, 2002. Google Scholar
  27. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. Google Scholar
  28. Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, 1998. Google Scholar
  29. Fei-Fei Li, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer vision and Image understanding, 106(1):59-70, 2007. Google Scholar
  30. Shi Li and Xiangyu Guo. Distributed k-Clustering for Data with Heavy Noise. In Advances in Neural Information Processing Systems, pages 7849-7857, 2018. Google Scholar
  31. Yi Li, Philip M Long, and Aravind Srinivasan. Improved bounds on the sample complexity of learning. Journal of Computer and System Sciences, 62(3):516-527, 2001. Google Scholar
  32. Gustavo Malkomes, Matt J Kusner, Wenlin Chen, Kilian Q Weinberger, and Benjamin Moseley. Fast distributed k-center clustering with outliers on massive data. In Advances in Neural Information Processing Systems, pages 1063-1071, 2015. Google Scholar
  33. Richard Matthew McCutchen and Samir Khuller. Streaming Algorithms for k-Center Clustering with Outliers and with Anonymity. In Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques, pages 165-178. Springer, 2008. Google Scholar
  34. Jeff M. Phillips. Coresets and Sketches. Computing Research Repository, 2016. Google Scholar
  35. Kunal Talwar. Bypassing the embedding: algorithms for low dimensional metrics. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 281-290, 2004. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail