Improved Algorithms for Clustering with Outliers

Authors Qilong Feng, Zhen Zhang, Ziyun Huang, Jinhui Xu, Jianxin Wang



PDF
Thumbnail PDF

File

LIPIcs.ISAAC.2019.61.pdf
  • Filesize: 0.53 MB
  • 12 pages

Document Identifiers

Author Details

Qilong Feng
  • School of Computer Science and Engineering, Central South University, P.R. China
Zhen Zhang
  • School of Computer Science and Engineering, Central South University, P.R. China
Ziyun Huang
  • Department of Computer Science and Software Engineering, Penn State Erie, The Behrend College, Erie, PA, USA
Jinhui Xu
  • Department of Computer Science and Engineering, State University of New York at Buffalo, USA
Jianxin Wang
  • School of Computer Science and Engineering, Central South University, P.R. China

Cite AsGet BibTex

Qilong Feng, Zhen Zhang, Ziyun Huang, Jinhui Xu, and Jianxin Wang. Improved Algorithms for Clustering with Outliers. In 30th International Symposium on Algorithms and Computation (ISAAC 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 149, pp. 61:1-61:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)
https://doi.org/10.4230/LIPIcs.ISAAC.2019.61

Abstract

Clustering is a fundamental problem in unsupervised learning. In many real-world applications, the to-be-clustered data often contains various types of noises and thus needs to be removed from the learning process. To address this issue, we consider in this paper two variants of such clustering problems, called k-median with m outliers and k-means with m outliers. Existing techniques for both problems either incur relatively large approximation ratios or can only efficiently deal with a small number of outliers. In this paper, we present improved solution to each of them for the case where k is a fixed number and m could be quite large. Particularly, we gave the first PTAS for the k-median problem with outliers in Euclidean space R^d for possibly high m and d. Our algorithm runs in O(nd((1/epsilon)(k+m))^(k/epsilon)^O(1)) time, which considerably improves the previous result (with running time O(nd(m+k)^O(m+k) + (1/epsilon)k log n)^O(1))) given by [Feldman and Schulman, SODA 2012]. For the k-means with outliers problem, we introduce a (6+epsilon)-approximation algorithm for general metric space with running time O(n(beta (1/epsilon)(k+m))^k) for some constant beta>1. Our algorithm first uses the k-means++ technique to sample O((1/epsilon)(k+m)) points from input and then select the k centers from them. Compared to the more involving existing techniques, our algorithms are much simpler, i.e., using only random sampling, and achieving better performance ratios.

Subject Classification

ACM Subject Classification
  • Theory of computation → Facility location and clustering
Keywords
  • Clustering with Outliers
  • Approximation
  • Random Sampling

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Manu Agarwal, Ragesh Jaiswal, and Arindam Pal. k-means++ under Approximation Stability. Theoretical Computer Science, 588:37-51, 2015. Google Scholar
  2. Ankit Aggarwal, Amit Deshpande, and Raivi Kannan. Adaptive sampling for k-means clustering. In Proc. 12nd Int. Workshop and 13rd Int. Workshop on Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 15-28, 2009. Google Scholar
  3. Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms. In Proc. 58th IEEE Symposium on Foundations of Computer Science, pages 61-72, 2017. Google Scholar
  4. Nir Ailon, Ragesh Jaiswal, and Clairire Monteleoni. Streaming k-means approximation. In Proc. 23rd Annual Conference on Neural Information Processing Systems, pages 10-18, 2009. Google Scholar
  5. Mohammad Al Hasan, Vineet Chaoji, Saeed Salem, and Mohammed J Zaki. Robust partitional clustering by outlier and density insensitive seeding. Pattern Recognition Letters, 30(11):994-1002, 2009. Google Scholar
  6. Daniel Aloise, Amit Deshpande, Pierre Hansen, and Preyas Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245-248, 2009. Google Scholar
  7. David Arthur and Sergei Vassilvitskii. k-means++: the adavantage of careful seeding. In Proc. 18th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027-1035, 2007. Google Scholar
  8. Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristic for k-median and facility location problems. In Proc. 33rd Annual ACM Symposium on Theory of Computing, pages 21-29, 2001. Google Scholar
  9. Moses Charikar, Sudipto Guha, and David B. Shmoys. A constant-factor approximation algorithm for the k-median problem. In Proc. 31st Annual ACM Symposium on Theory of Computing, pages 1-10, 1999. Google Scholar
  10. Moses Charikar, Samir Khuller, David M Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. In Proc. 20th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 642-651, 2001. Google Scholar
  11. Ke Chen. On k-Median clustering in high dimensions. In Proc. 17th ACM-SIAM Symposium on Discrete Algorithm, pages 1177-1185, 2006. Google Scholar
  12. Ke Chen. A constant factor approximation algorithm for k-median clustering with outliers. In Proc. 27th Annual ACM-SIAM Symposium on Discrete Algorithms, volume 8, pages 826-835, 2008. Google Scholar
  13. Vincent Cohen-Addad, Philip N. Klein, and Claire Mathieu. Local Search Yields Approximation Schemes for k-Means and k-Median in Euclidean and Minor-Free Metrics. In Proc. 57th IEEE Annual Symposium on Foundations of Computer Science, pages 353-364, 2016. Google Scholar
  14. Feldman Dan and Leonard J. Schulman. Data reduction for weighted and outlier-resistant clustering. In Proc. 31st Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1343-1354, 2012. Google Scholar
  15. Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A PTAS for k-means clustering based on weak coresets. In Proc. 23rd Annual Symposium on Computational Geometry, pages 11-18, 2007. Google Scholar
  16. Zachary Friggstad, Kamyar Khodamoradi, Mohsen Rezapour, and Mohammad R Salavatipour. Approximation schemes for clustering with outliers. In Proc. 37th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 398-414, 2018. Google Scholar
  17. Zachary Friggstad, Mohsen Rezapour, and Mohammad R. Salavatipour. Local Search Yields a PTAS for k-Means in Doubling Metrics. In Proc. 57th IEEE Annual Symposium on Foundations of Computer Science, pages 365-374, 2016. Google Scholar
  18. Anupam Gupta and Kanat Tangwongsan. Simpler analyses of local search algorithms for facility location. arXiv, 2008. URL: http://arxiv.org/abs/0809.2554.
  19. Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and Sergei Vassilvitskii. Local search methods for k-means with outliers. Proceedings of the VLDB Endowment, 10(7):757-768, 2017. Google Scholar
  20. Huang ingxiao, Jiang Shaofeng, Li Jian, and Wu Xuan. ε-Coresets for Clustering (with Outliers) in Doubling Metrics. In Proc. 59th IEEE Annual Symposium on Foundations of Computer Science, pages 814-825, 2018. Google Scholar
  21. Kamal Jain, Mohammad Mahdian, and Amin Saberi. A new greedy approach for facility location problems. In Proc. 34th Annual ACM Symposium on Theory of Computing, pages 731-740, 2002. Google Scholar
  22. Ragesh Jaiswal and Nitin Garg. Analysis of k-means++ for separable data. In Proc. 15th Int. Workshop and 16th Int. Workshop on Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 591-602, 2012. Google Scholar
  23. Byrka Jaroslaw, Pensyl Thomas, Rybicki Bartosz, Srinivasan Aravind, and Trinh Khoa. An Improved Approximation for k-Median and Positive Correlation in Budgeted Optimization. ACM Transactions on Algorithms, 13(2):23, 2017. Google Scholar
  24. Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant Approximation for k-Median and k-Means with Outliers via Iterative Rounding. In Proc. 50th Annual ACM Symposium on Theory of Computing, pages 646-659, 2018. Google Scholar
  25. Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clustering problems in any dimensions. J. ACM, 57(2):5:1-5:32, 2010. Google Scholar
  26. Shi Li and Ola Svensson. Approximating k-median via pseudo-approximation. SIAM Journal on Computing, 45(2):530-547, 2012. Google Scholar
  27. Stuart Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129-137, 1982. Google Scholar
  28. Rajeev Motwani and Prabhakar Raghavan. Randomized algorithms. Cambridge University Press, 1995. Google Scholar
  29. Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy. The effectiveness of Lloyd-type methods for the k-means problem. J. ACM, 59(6):28:1-28:22, 2013. Google Scholar
  30. Dennis Wei. A constant-factor bi-criteria approximation guarantee for k-means++. In Proc. 30th Annual Conference on Neural Information Processing Systems, pages 604-612, 2016. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail