Improved Algorithms for Clustering with Outliers

Feng, Qilong; Zhang, Zhen; Huang, Ziyun; Xu, Jinhui; Wang, Jianxin

doi:10.4230/LIPIcs.ISAAC.2019.61

Abstract

Clustering is a fundamental problem in unsupervised learning. In many real-world applications, the to-be-clustered data often contains various types of noises and thus needs to be removed from the learning process. To address this issue, we consider in this paper two variants of such clustering problems, called k-median with m outliers and k-means with m outliers. Existing techniques for both problems either incur relatively large approximation ratios or can only efficiently deal with a small number of outliers. In this paper, we present improved solution to each of them for the case where k is a fixed number and m could be quite large. Particularly, we gave the first PTAS for the k-median problem with outliers in Euclidean space R^d for possibly high m and d. Our algorithm runs in O(nd((1/epsilon)(k+m))^(k/epsilon)^O(1)) time, which considerably improves the previous result (with running time O(nd(m+k)^O(m+k) + (1/epsilon)k log n)^O(1))) given by [Feldman and Schulman, SODA 2012]. For the k-means with outliers problem, we introduce a (6+epsilon)-approximation algorithm for general metric space with running time O(n(beta (1/epsilon)(k+m))^k) for some constant beta>1. Our algorithm first uses the k-means++ technique to sample O((1/epsilon)(k+m)) points from input and then select the k centers from them. Compared to the more involving existing techniques, our algorithms are much simpler, i.e., using only random sampling, and achieving better performance ratios.

Manu Agarwal, Ragesh Jaiswal, and Arindam Pal. k-means++ under Approximation Stability. Theoretical Computer Science, 588:37-51, 2015.
Ankit Aggarwal, Amit Deshpande, and Raivi Kannan. Adaptive sampling for k-means clustering. In Proc. 12nd Int. Workshop and 13rd Int. Workshop on Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 15-28, 2009.
Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms. In Proc. 58th IEEE Symposium on Foundations of Computer Science, pages 61-72, 2017.
Nir Ailon, Ragesh Jaiswal, and Clairire Monteleoni. Streaming k-means approximation. In Proc. 23rd Annual Conference on Neural Information Processing Systems, pages 10-18, 2009.
Mohammad Al Hasan, Vineet Chaoji, Saeed Salem, and Mohammed J Zaki. Robust partitional clustering by outlier and density insensitive seeding. Pattern Recognition Letters, 30(11):994-1002, 2009.
Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245-248, 2009.
David Arthur and Sergei Vassilvitskii. k-means++: the adavantage of careful seeding. In Proc. 18th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027-1035, 2007.
Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristic for k-median and facility location problems. In Proc. 33rd Annual ACM Symposium on Theory of Computing, pages 21-29, 2001.
Moses Charikar, Sudipto Guha, and David B. Shmoys. A constant-factor approximation algorithm for the k-median problem. In Proc. 31st Annual ACM Symposium on Theory of Computing, pages 1-10, 1999.
Moses Charikar, Samir Khuller, David M Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. In Proc. 20th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 642-651, 2001.
Ke Chen. On k-Median clustering in high dimensions. In Proc. 17th ACM-SIAM Symposium on Discrete Algorithm, pages 1177-1185, 2006.
Ke Chen. A constant factor approximation algorithm for k-median clustering with outliers. In Proc. 27th Annual ACM-SIAM Symposium on Discrete Algorithms, volume 8, pages 826-835, 2008.
Vincent Cohen-Addad, Philip N. Klein, and Claire Mathieu. Local Search Yields Approximation Schemes for k-Means and k-Median in Euclidean and Minor-Free Metrics. In Proc. 57th IEEE Annual Symposium on Foundations of Computer Science, pages 353-364, 2016.
Feldman Dan and Leonard J. Schulman. Data reduction for weighted and outlier-resistant clustering. In Proc. 31st Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1343-1354, 2012.
Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A PTAS for k-means clustering based on weak coresets. In Proc. 23rd Annual Symposium on Computational Geometry, pages 11-18, 2007.
Zachary Friggstad, Kamyar Khodamoradi, Mohsen Rezapour, and Mohammad R Salavatipour. Approximation schemes for clustering with outliers. In Proc. 37th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 398-414, 2018.
Zachary Friggstad, Mohsen Rezapour, and Mohammad R. Salavatipour. Local Search Yields a PTAS for k-Means in Doubling Metrics. In Proc. 57th IEEE Annual Symposium on Foundations of Computer Science, pages 365-374, 2016.
Anupam Gupta and Kanat Tangwongsan. Simpler analyses of local search algorithms for facility location. arXiv, 2008. URL: http://arxiv.org/abs/0809.2554.
Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and Sergei Vassilvitskii. Local search methods for k-means with outliers. Proceedings of the VLDB Endowment, 10(7):757-768, 2017.
Huang ingxiao, Jiang Shaofeng, Li Jian, and Wu Xuan. ε-Coresets for Clustering (with Outliers) in Doubling Metrics. In Proc. 59th IEEE Annual Symposium on Foundations of Computer Science, pages 814-825, 2018.
Kamal Jain, Mohammad Mahdian, and Amin Saberi. A new greedy approach for facility location problems. In Proc. 34th Annual ACM Symposium on Theory of Computing, pages 731-740, 2002.
Ragesh Jaiswal and Nitin Garg. Analysis of k-means++ for separable data. In Proc. 15th Int. Workshop and 16th Int. Workshop on Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 591-602, 2012.
Byrka Jaroslaw, Pensyl Thomas, Rybicki Bartosz, Srinivasan Aravind, and Trinh Khoa. An Improved Approximation for k-Median and Positive Correlation in Budgeted Optimization. ACM Transactions on Algorithms, 13(2):23, 2017.
Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant Approximation for k-Median and k-Means with Outliers via Iterative Rounding. In Proc. 50th Annual ACM Symposium on Theory of Computing, pages 646-659, 2018.
Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clustering problems in any dimensions. J. ACM, 57(2):5:1-5:32, 2010.
Shi Li and Ola Svensson. Approximating k-median via pseudo-approximation. SIAM Journal on Computing, 45(2):530-547, 2012.
Stuart Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129-137, 1982.
Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. Cambridge University Press, 1995.
Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy. The effectiveness of Lloyd-type methods for the k-means problem. J. ACM, 59(6):28:1-28:22, 2013.
Dennis Wei. A constant-factor bi-criteria approximation guarantee for k-means++. In Proc. 30th Annual Conference on Neural Information Processing Systems, pages 604-612, 2016.

Improved Algorithms for Clustering with Outliers

Authors Qilong Feng, Zhen Zhang, Ziyun Huang, Jinhui Xu, Jianxin Wang

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Improved Algorithms for Clustering with Outliers

Authors Qilong Feng, Zhen Zhang, Ziyun Huang, Jinhui Xu, Jianxin Wang

File

Document Identifiers

Author Details

Funding

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message