Faster Approximation Schemes for (Constrained) k-Means with Outliers
Given a set of n points in ℝ^d and two positive integers k and m, the Euclidean k-means with outliers problem aims to remove at most m points, referred to as outliers, and minimize the k-means cost function for the remaining points. Developing algorithms for this problem remains an active area of research due to its prevalence in applications involving noisy data. In this paper, we give a (1+ε)-approximation algorithm that runs in n²d((k+m)ε^{-1})^O(kε^{-1}) time for the problem. When combined with a coreset construction method, the running time of the algorithm can be improved to be linear in n. For the case where k is a constant, this represents the first polynomial-time approximation scheme for the problem: Existing algorithms with the same approximation guarantee run in polynomial time only when both k and m are constants. Furthermore, our approach generalizes to variants of k-means with outliers incorporating additional constraints on instances, such as those related to capacities and fairness.
Approximation algorithms
clustering
Theory of computation~Facility location and clustering
84:1-84:17
Regular Paper
This work was supported by National Natural Science Foundation of China (62202161, 62172446), Natural Science Foundation of Hunan Province (2023JJ40240), and Scientific Research Fund of Hunan Provincial Education Department (23B0597).
Zhen
Zhang
Zhen Zhang
School of Advanced Interdisciplinary Studies, Hunan University of Technology and Business, Changsha, China
Xiangjiang Laboratory, Changsha, China
https://orcid.org/0000-0002-2974-5781
Junyu
Huang
Junyu Huang
School of Computer Science and Engineering, Central South University, Changsha, China
https://orcid.org/0000-0002-1747-470X
Qilong
Feng
Qilong Feng
School of Computer Science and Engineering, Central South University, Changsha, China
https://orcid.org/0000-0003-1657-7448
10.4230/LIPIcs.MFCS.2024.84
Akanksha Agrawal, Tanmay Inamdar, Saket Saurabh, and Jie Xue. Clustering what matters: Optimal approximation for clustering with outliers. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, pages 6666-6674, 2023.
David Arthur and Sergei Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027-1035, 2007.
Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of approximation of Euclidean k-means. In Proceedings of the 31st International Symposium on Computational Geometry, volume 34, pages 754-767, 2015.
Sayan Bandyapadhyay, Fedor V. Fomin, and Kirill Simonov. On coresets for fair clustering in metric and Euclidean spaces and their applications. In Proceedings of the 48th International Colloquium on Automata, Languages, and Programming, volume 198, pages 23:1-23:15, 2021.
Suman Kalyan Bera, Deeparnab Chakrabarty, Nicolas Flores, and Maryam Negahbani. Fair algorithms for clustering. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems, pages 4955-4966, 2019.
Aditya Bhaskara, Sharvaree Vadgama, and Hong Xu. Greedy sampling for approximate clustering in the presence of outliers. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems, pages 11146-11155, 2019.
Anup Bhattacharya, Dishant Goyal, Ragesh Jaiswal, and Amit Kumar. On sampling based algorithms for k-means. In Proceedings of the 40th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, volume 182, pages 13:1-13:17, 2020.
Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. Faster algorithms for the constrained k-means problem. Theory Comput. Syst., 62(1):93-115, 2018.
Sanjay Chawla and Aristides Gionis. k-means--: A unified approach to clustering and outlier detection. In Proceedings of the 13th SIAM International Conference on Data Mining, pages 189-197, 2013.
Jiecao Chen, Erfan Sadeqi Azer, and Qin Zhang. A practical algorithm for distributed clustering and outlier detection. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pages 2253-2262, 2018.
Ke Chen. On coresets for k-median and k-means clustering in metric and Euclidean spaces and their applications. SIAM J. Comput., 39(3):923-947, 2009.
Vincent Cohen-Addad, Andreas Emil Feldmann, and David Saulpic. Near-linear time approximation schemes for clustering in doubling metrics. J. ACM, 68(6):44:1-44:34, 2021.
Rajni Dabas, Neelima Gupta, and Tanmay Inamdar. FPT approximations for capacitated/fair clustering with outliers. CoRR, abs/2305.01471, 2023.
Wenceslas Fernandez de la Vega, Marek Karpinski, Claire Kenyon, and Yuval Rabani. Approximation schemes for clustering problems. In Proceedings of the 35th Annual ACM Symposium on Theory of Computing, pages 50-58. ACM, 2003.
Amit Deshpande, Praneeth Kacham, and Rameshwar Pratap. Robust k-means++. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence, volume 124, pages 799-808, 2020.
Hu Ding and Jinhui Xu. A unified framework for clustering constrained data without locality property. Algorithmica, 82(4):808-852, 2020.
Dan Feldman and Michael Langberg. A unified framework for approximating and clustering data. In Proceedings of the 43rd ACM Symposium on Theory of Computing, pages 569-578, 2011.
Dan Feldman, Morteza Monemizadeh, and Christian Sohler. A PTAS for k-means clustering based on weak coresets. In Proceedings of the 23rd ACM Symposium on Computational Geometry, pages 11-18, 2007.
Dan Feldman and Leonard J. Schulman. Data reduction for weighted and outlier-resistant clustering. In Proceedings of the 23rd Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1343-1354, 2012.
Zachary Friggstad, Kamyar Khodamoradi, Mohsen Rezapour, and Mohammad R. Salavatipour. Approximation schemes for clustering with outliers. ACM Trans. Algorithms, 15(2):26:1-26:26, 2019.
Luis Ángel García-Escudero and Alfonso Gordaliza. Robustness properties of k-means and trimmed k-means. J. Am. Stat. Assoc., 94(447):956-969, 1999.
Alexandros Georgogiannis. Robust k-means: A theoretical revisit. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems, pages 2883-2891, 2016.
Christoph Grunau and Václav Rozhon. Adapting k-means algorithms for outliers. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pages 7845-7886, 2022.
Sudipto Guha, Yi Li, and Qin Zhang. Distributed partial clustering. ACM Trans. Parallel Comput., 6(3):11:1-11:20, 2019.
Shalmoli Gupta, Ravi Kumar, Kefu Lu, Benjamin Moseley, and Sergei Vassilvitskii. Local search methods for k-means with outliers. Proc. VLDB Endow., 10(7):757-768, 2017.
Wassily Hoeffding. Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc., 58:13-30, 1963.
Junyu Huang, Qilong Feng, Ziyun Huang, Jinhui Xu, and Jianxin Wang. Fast algorithms for distributed k-clustering with outliers. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 13845-13868, 2023.
Lingxiao Huang, Shaofeng H.-C. Jiang, Jian Li, and Xuan Wu. ε-coresets for clustering (with outliers) in doubling metrics. In Proceedings of the 59th IEEE Annual Symposium on Foundations of Computer Science, pages 814-825, 2018.
Lingxiao Huang, Shaofeng H.-C. Jiang, Jianing Lou, and Xuan Wu. Near-optimal coresets for robust clustering. In Proceedings of the 11th International Conference on Learning Representations, 2023.
Sungjin Im, Mahshid Montazer Qaem, Benjamin Moseley, Xiaorui Sun, and Rudy Zhou. Fast noise removal for k-means clustering. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, volume 108, pages 456-466, 2020.
Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering (extended abstract). In Proceedings of the 10th Annual Symposium on Computational Geometry, pages 332-339, 1994.
Ragesh Jaiswal and Amit Kumar. Clustering what matters in constrained settings: Improved outlier to outlier-free reductions. In Proceedings of the 34th International Symposium on Algorithms and Computation, volume 283, pages 41:1-41:16, 2023.
Ragesh Jaiswal, Amit Kumar, and Sandeep Sen. A simple D²-sampling based PTAS for k-means and other clustering problems. Algorithmica, 70(1):22-46, 2014.
Ragesh Jaiswal, Mehul Kumar, and Pulkit Yadav. Improved analysis of D²-sampling based PTAS for k-means and other clustering problems. Inf. Process. Lett., 115(2):100-103, 2015.
Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2-3):89-112, 2004.
Ravishankar Krishnaswamy, Shi Li, and Sai Sandeep. Constant approximation for k-median and k-means with outliers via iterative rounding. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 646-659, 2018.
Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clustering problems in any dimensions. J. ACM, 57(2):5:1-5:32, 2010.
Euiwoong Lee, Melanie Schmidt, and John Wright. Improved and simplified inapproximability for k-means. Inf. Process. Lett., 120:40-43, 2017.
Shi Li and Xiangyu Guo. Distributed k-clustering for data with heavy noise. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pages 7849-7857, 2018.
Alexander Schrijver. Combinatorial optimization: Polyhedra and efficiency. Springer, Berlin, 2003.
Zhen Zhang, Qilong Feng, Junyu Huang, Yutian Guo, Jinhui Xu, and Jianxin Wang. A local search algorithm for k-means with outliers. Neurocomputing, 450:230-241, 2021.
Zhen Zhang, Junyu Huang, and Qilong Feng
Creative Commons Attribution 4.0 International license
https://creativecommons.org/licenses/by/4.0/legalcode