Greedy Strategy Works for k-Center Clustering with Outliers and Coreset Construction

Ding, Hu; Yu, Haikuo; Wang, Zixiu

doi:10.4230/LIPIcs.ESA.2019.40

File

Subject Classification

ACM Subject Classification

Theory of computation → Approximation algorithms analysis

Keywords

k-center clustering
outliers
coreset
doubling metrics
random sampling

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

We study the problem of k-center clustering with outliers in arbitrary metrics and Euclidean space. Though a number of methods have been developed in the past decades, it is still quite challenging to design quality guaranteed algorithm with low complexity for this problem. Our idea is inspired by the greedy method, Gonzalez’s algorithm, for solving the problem of ordinary k-center clustering. Based on some novel observations, we show that this greedy strategy actually can handle k-center clustering with outliers efficiently, in terms of clustering quality and time complexity. We further show that the greedy approach yields small coreset for the problem in doubling metrics, so as to reduce the time complexity significantly. Our algorithms are easy to implement in practice. We test our method on both synthetic and real datasets. The experimental results suggest that our algorithms can achieve near optimal solutions and yield lower running times comparing with existing methods.

Cite As Get BibTex

Hu Ding, Haikuo Yu, and Zixiu Wang. Greedy Strategy Works for k-Center Clustering with Outliers and Coreset Construction. In 27th Annual European Symposium on Algorithms (ESA 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 144, pp. 40:1-40:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019) https://doi.org/10.4230/LIPIcs.ESA.2019.40

Author Details

Hu Ding

School of Computer Science and Technology, University of Science and Technology of China, China

Haikuo Yu

School of Computer Science and Technology, University of Science and Technology of China, China

Zixiu Wang

School of Computer Science and Technology, University of Science and Technology of China, China

References

Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 15-28. Springer, 2009.
Sepideh Aghamolaei and Mohammad Ghodsi. A Composable Coreset for k-Center in Doubling Metrics. In Proceedings of the 30th Canadian Conference on Computational Geometry, CCCG 2018, August 8-10, 2018, University of Manitoba, Winnipeg, Manitoba, Canada, pages 165-171, 2018.
Noga Alon, Seannie Dar, Michal Parnas, and Dana Ron. Testing of clustering. SIAM Journal on Discrete Mathematics, 16(3):393-417, 2003.
Noga Alon and Joel H Spencer. The probabilistic method. John Wiley & Sons, 2004.
Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476, 2017. URL: http://arxiv.org/abs/1703.06476.
Mihai Badoiu and Kenneth L. Clarkson. Smaller core-sets for balls. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 801-802, 2003.
Mihai Badoiu, Sariel Har-Peled, and Piotr Indyk. Approximate clustering via core-sets. In Proceedings of the ACM Symposium on Theory of Computing (STOC), pages 250-257, 2002.
Mikhail Belkin. Problems of learning on manifolds. The University of Chicago, 2003.
Avrim Blum, John Hopcroft, and Ravindran Kannan. Foundations of data science. Vorabversion eines Lehrbuchs, 2016.
Manuel Blum, Robert W. Floyd, Vaughan Pratt, Ronald L. Rivest, and Robert E. Tarjan. Time bounds for selection. Journal of Computer and System Sciences, 7(4):448-461, 1973.
Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM (JACM), 36(4):929-965, 1989.
Matteo Ceccarello, Andrea Pietracaprina, and Geppino Pucci. Solving k-center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially. CoRR, abs/1802.09205, 2018.
Deeparnab Chakrabarty, Prachi Goyal, and Ravishankar Krishnaswamy. The Non-Uniform k-Center Problem. In 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy, pages 67:1-67:15, 2016.
Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(3):15, 2009.
Moses Charikar, Samir Khuller, David M Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 642-651. Society for Industrial and Applied Mathematics, 2001.
Moses Charikar, Liadan O'Callaghan, and Rina Panigrahy. Better streaming algorithms for clustering problems. In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages 30-39. ACM, 2003.
Amit Daniely, Nati Linial, and Michael E. Saks. Clustering is difficult only when it does not matter. CoRR, abs/1205.4891, 2012.
Teofilo F Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293-306, 1985.
Sudipto Guha, Yi Li, and Qin Zhang. Distributed Partial Clustering. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pages 143-152. ACM, 2017.
Sariel Har-Peled and Manor Mendel. Fast construction of nets in low-dimensional metrics and their applications. SIAM Journal on Computing, 35(5):1148-1184, 2006.
Sariel Har-Peled and Micha Sharir. Relative (p, ε)-approximations in geometry. Discrete & Computational Geometry, 45(3):462-496, 2011.
Dorit S Hochbaum and David B Shmoys. A best possible heuristic for the k-center problem. Mathematics of operations research, 10(2):180-184, 1985.
Lingxiao Huang, Shaofeng Jiang, Jian Li, and Xuan Wu. Epsilon-Coresets for Clustering (with Outliers) in Doubling Metrics. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pages 814-825. IEEE, 2018.
Piotr Indyk, Sepideh Mahabadi, Mohammad Mahdian, and Vahab S. Mirrokni. Composable core-sets for diversity and coverage maximization. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS'14, Snowbird, UT, USA, June 22-27, 2014, pages 100-108, 2014.
Anil K Jain. Data clustering: 50 years beyond K-means. Pattern recognition letters, 31(8):651-666, 2010.
Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE Trans. Pattern Anal. Mach. Intell., 24(7):881-892, 2002.
Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278-2324, 1998.
Fei-Fei Li, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer vision and Image understanding, 106(1):59-70, 2007.
Shi Li and Xiangyu Guo. Distributed k-Clustering for Data with Heavy Noise. In Advances in Neural Information Processing Systems, pages 7849-7857, 2018.
Yi Li, Philip M Long, and Aravind Srinivasan. Improved bounds on the sample complexity of learning. Journal of Computer and System Sciences, 62(3):516-527, 2001.
Gustavo Malkomes, Matt J Kusner, Wenlin Chen, Kilian Q Weinberger, and Benjamin Moseley. Fast distributed k-center clustering with outliers on massive data. In Advances in Neural Information Processing Systems, pages 1063-1071, 2015.
Richard Matthew McCutchen and Samir Khuller. Streaming Algorithms for k-Center Clustering with Outliers and with Anonymity. In Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques, pages 165-178. Springer, 2008.
Jeff M. Phillips. Coresets and Sketches. Computing Research Repository, 2016.
Kunal Talwar. Bypassing the embedding: algorithms for low dimensional metrics. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 281-290, 2004.

Greedy Strategy Works for k-Center Clustering with Outliers and Coreset Construction

Authors Hu Ding, Haikuo Yu, Zixiu Wang

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Acknowledgements

References

Thanks for your feedback!

Could not send message

Greedy Strategy Works for k-Center Clustering with Outliers and Coreset Construction

Authors Hu Ding, Haikuo Yu, Zixiu Wang

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

Acknowledgements

References

Thanks for your feedback!

Could not send message