Faster Approximation Schemes for (Constrained) k-Means with Outliers

Authors Zhen Zhang , Junyu Huang , Qilong Feng



PDF
Thumbnail PDF

File

LIPIcs.MFCS.2024.84.pdf
  • Filesize: 0.88 MB
  • 17 pages

Document Identifiers

Author Details

Zhen Zhang
  • School of Advanced Interdisciplinary Studies, Hunan University of Technology and Business, Changsha, China
  • Xiangjiang Laboratory, Changsha, China
Junyu Huang
  • School of Computer Science and Engineering, Central South University, Changsha, China
Qilong Feng
  • School of Computer Science and Engineering, Central South University, Changsha, China

Cite As Get BibTex

Zhen Zhang, Junyu Huang, and Qilong Feng. Faster Approximation Schemes for (Constrained) k-Means with Outliers. In 49th International Symposium on Mathematical Foundations of Computer Science (MFCS 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 306, pp. 84:1-84:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024) https://doi.org/10.4230/LIPIcs.MFCS.2024.84

Abstract

Given a set of n points in ℝ^d and two positive integers k and m, the Euclidean k-means with outliers problem aims to remove at most m points, referred to as outliers, and minimize the k-means cost function for the remaining points. Developing algorithms for this problem remains an active area of research due to its prevalence in applications involving noisy data. In this paper, we give a (1+ε)-approximation algorithm that runs in n²d((k+m)ε^{-1})^O(kε^{-1}) time for the problem. When combined with a coreset construction method, the running time of the algorithm can be improved to be linear in n. For the case where k is a constant, this represents the first polynomial-time approximation scheme for the problem: Existing algorithms with the same approximation guarantee run in polynomial time only when both k and m are constants. Furthermore, our approach generalizes to variants of k-means with outliers incorporating additional constraints on instances, such as those related to capacities and fairness.

Subject Classification

ACM Subject Classification
  • Theory of computation → Facility location and clustering
Keywords
  • Approximation algorithms
  • clustering

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Akanksha Agrawal, Tanmay Inamdar, Saket Saurabh, and Jie Xue. Clustering what matters: Optimal approximation for clustering with outliers. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, pages 6666-6674, 2023. Google Scholar
  2. David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027-1035, 2007. Google Scholar
  3. Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of approximation of Euclidean k-means. In Proceedings of the 31st International Symposium on Computational Geometry, volume 34, pages 754-767, 2015. Google Scholar
  4. Sayan Bandyapadhyay, Fedor V. Fomin, and Kirill Simonov. On coresets for fair clustering in metric and Euclidean spaces and their applications. In Proceedings of the 48th International Colloquium on Automata, Languages, and Programming, volume 198, pages 23:1-23:15, 2021. Google Scholar
  5. Suman Kalyan Bera, Deeparnab Chakrabarty, Nicolas Flores, and Maryam Negahbani. Fair algorithms for clustering. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems, pages 4955-4966, 2019. Google Scholar
  6. Aditya Bhaskara, Sharvaree Vadgama, and Hong Xu. Greedy sampling for approximate clustering in the presence of outliers. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems, pages 11146-11155, 2019. Google Scholar
  7. Anup Bhattacharya, Dishant Goyal, Ragesh Jaiswal, and Amit Kumar. On sampling based algorithms for k-means. In Proceedings of the 40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, volume 182, pages 13:1-13:17, 2020. Google Scholar
  8. Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Faster algorithms for the constrained k-means problem. Theory Comput. Syst., 62(1):93-115, 2018. Google Scholar
  9. Sanjay Chawla and Aristides Gionis. k-means--: A unified approach to clustering and outlier detection. In Proceedings of the 13th SIAM International Conference on Data Mining, pages 189-197, 2013. Google Scholar
  10. Jiecao Chen, Erfan Sadeqi Azer, and Qin Zhang. A practical algorithm for distributed clustering and outlier detection. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pages 2253-2262, 2018. Google Scholar
  11. Ke Chen. On coresets for k-median and k-means clustering in metric and Euclidean spaces and their applications. SIAM J. Comput., 39(3):923-947, 2009. Google Scholar
  12. Vincent Cohen-Addad, Andreas Emil Feldmann, and David Saulpic. Near-linear time approximation schemes for clustering in doubling metrics. J. ACM, 68(6):44:1-44:34, 2021. Google Scholar
  13. Rajni Dabas, Neelima Gupta, and Tanmay Inamdar. FPT approximations for capacitated/fair clustering with outliers. CoRR, abs/2305.01471, 2023. Google Scholar
  14. Wenceslas Fernandez de la Vega, Marek Karpinski, Claire Kenyon, and Yuval Rabani. Approximation schemes for clustering problems. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing, pages 50-58. ACM, 2003. Google Scholar
  15. Amit Deshpande, Praneeth Kacham, and Rameshwar Pratap. Robust k-means++. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence, volume 124, pages 799-808, 2020. Google Scholar
  16. Hu Ding and Jinhui Xu. A unified framework for clustering constrained data without locality property. Algorithmica, 82(4):808-852, 2020. Google Scholar
  17. Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on Theory of Computing, pages 569-578, 2011. Google Scholar
  18. Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A PTAS for k-means clustering based on weak coresets. In Proceedings of the 23rd ACM Symposium on Computational Geometry, pages 11-18, 2007. Google Scholar
  19. Dan Feldman and Leonard J. Schulman. Data reduction for weighted and outlier-resistant clustering. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1343-1354, 2012. Google Scholar
  20. Zachary Friggstad, Kamyar Khodamoradi, Mohsen Rezapour, and Mohammad R. Salavatipour. Approximation schemes for clustering with outliers. ACM Trans. Algorithms, 15(2):26:1-26:26, 2019. Google Scholar
  21. Luis Ángel García-Escudero and Alfonso Gordaliza. Robustness properties of k-means and trimmed k-means. J. Am. Stat. Assoc., 94(447):956-969, 1999. Google Scholar
  22. Alexandros Georgogiannis. Robust k-means: A theoretical revisit. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems, pages 2883-2891, 2016. Google Scholar
  23. Christoph Grunau and Václav Rozhon. Adapting k-means algorithms for outliers. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pages 7845-7886, 2022. Google Scholar
  24. Sudipto Guha, Yi Li, and Qin Zhang. Distributed partial clustering. ACM Trans. Parallel Comput., 6(3):11:1-11:20, 2019. Google Scholar
  25. Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and Sergei Vassilvitskii. Local search methods for k-means with outliers. Proc. VLDB Endow., 10(7):757-768, 2017. Google Scholar
  26. Wassily Hoeffding. Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc., 58:13-30, 1963. Google Scholar
  27. Junyu Huang, Qilong Feng, Ziyun Huang, Jinhui Xu, and Jianxin Wang. Fast algorithms for distributed k-clustering with outliers. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 13845-13868, 2023. Google Scholar
  28. Lingxiao Huang, Shaofeng H.-C. Jiang, Jian Li, and Xuan Wu. ε-coresets for clustering (with outliers) in doubling metrics. In Proceedings of the 59th IEEE Annual Symposium on Foundations of Computer Science, pages 814-825, 2018. Google Scholar
  29. Lingxiao Huang, Shaofeng H.-C. Jiang, Jianing Lou, and Xuan Wu. Near-optimal coresets for robust clustering. In Proceedings of the 11th International Conference on Learning Representations, 2023. Google Scholar
  30. Sungjin Im, Mahshid Montazer Qaem, Benjamin Moseley, Xiaorui Sun, and Rudy Zhou. Fast noise removal for k-means clustering. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, volume 108, pages 456-466, 2020. Google Scholar
  31. Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering (extended abstract). In Proceedings of the 10th Annual Symposium on Computational Geometry, pages 332-339, 1994. Google Scholar
  32. Ragesh Jaiswal and Amit Kumar. Clustering what matters in constrained settings: Improved outlier to outlier-free reductions. In Proceedings of the 34th International Symposium on Algorithms and Computation, volume 283, pages 41:1-41:16, 2023. Google Scholar
  33. Ragesh Jaiswal, Amit Kumar, and Sandeep Sen. A simple D²-sampling based PTAS for k-means and other clustering problems. Algorithmica, 70(1):22-46, 2014. Google Scholar
  34. Ragesh Jaiswal, Mehul Kumar, and Pulkit Yadav. Improved analysis of D²-sampling based PTAS for k-means and other clustering problems. Inf. Process. Lett., 115(2):100-103, 2015. Google Scholar
  35. Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2-3):89-112, 2004. Google Scholar
  36. Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant approximation for k-median and k-means with outliers via iterative rounding. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 646-659, 2018. Google Scholar
  37. Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clustering problems in any dimensions. J. ACM, 57(2):5:1-5:32, 2010. Google Scholar
  38. Euiwoong Lee, Melanie Schmidt, and John Wright. Improved and simplified inapproximability for k-means. Inf. Process. Lett., 120:40-43, 2017. Google Scholar
  39. Shi Li and Xiangyu Guo. Distributed k-clustering for data with heavy noise. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pages 7849-7857, 2018. Google Scholar
  40. Alexander Schrijver. Combinatorial optimization: Polyhedra and efficiency. Springer, Berlin, 2003. Google Scholar
  41. Zhen Zhang, Qilong Feng, Junyu Huang, Yutian Guo, Jinhui Xu, and Jianxin Wang. A local search algorithm for k-means with outliers. Neurocomputing, 450:230-241, 2021. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail