A Sub-Linear Time Framework for Geometric Optimization with Outliers in High Dimensions
Many real-world problems can be formulated as geometric optimization problems in high dimensions, especially in the fields of machine learning and data mining. Moreover, we often need to take into account of outliers when optimizing the objective functions. However, the presence of outliers could make the problems to be much more challenging than their vanilla versions. In this paper, we study the fundamental minimum enclosing ball (MEB) with outliers problem first; partly inspired by the core-set method from Bădoiu and Clarkson, we propose a sub-linear time bi-criteria approximation algorithm based on two novel techniques, the Uniform-Adaptive Sampling method and Sandwich Lemma. To the best of our knowledge, our result is the first sub-linear time algorithm, which has the sample size (i.e., the number of sampled points) independent of both the number of input points n and dimensionality d, for MEB with outliers in high dimensions. Furthermore, we observe that these two techniques can be generalized to deal with a broader range of geometric optimization problems with outliers in high dimensions, including flat fitting, k-center clustering, and SVM with outliers, and therefore achieve the sub-linear time algorithms for these problems respectively.
minimum enclosing ball
outliers
shape fitting
high dimensions
sub-linear time
Theory of computation~Approximation algorithms analysis
38:1-38:21
Regular Paper
A full version of the paper is available at https://arxiv.org/abs/2004.10090.
The author wants to thank the anonymous reviewers for their helpful comments and suggestions for improving the paper.
Hu
Ding
Hu Ding
School of Computer Science and Technology, University of Science and Technology of China, Anhui, China
http://staff.ustc.edu.cn/~huding/index.html
https://orcid.org/0000-0002-1307-6077
10.4230/LIPIcs.ESA.2020.38
Pankaj K. Agarwal, Esther Ezra, and Kyle Fox. Geometric optimization revisited. In Bernhard Steffen and Gerhard J. Woeginger, editors, Computing and Software Science - State of the Art and Perspectives, volume 10000 of Lecture Notes in Computer Science, pages 66-84. Springer, 2019.
Pankaj K. Agarwal, Sariel Har-Peled, and Kasturi R. Varadarajan. Geometric approximation via coresets. Combinatorial and Computational Geometry, 52:1-30, 2005.
Pankaj K. Agarwal, Sariel Har-Peled, and Hai Yu. Robust shape fitting via peeling and grating coresets. Discrete & Computational Geometry, 39(1-3):38-58, 2008.
Pankaj K. Agarwal and R. Sharathkumar. Streaming algorithms for extent problems in high dimensions. Algorithmica, 72(1):83-98, 2015.
Alok Aggarwal, Hiroshi Imai, Naoki Katoh, and Subhash Suri. Finding k points with minimum diameter and related problems. Journal of algorithms, 12(1):38-56, 1991.
Zeyuan Allen Zhu, Zhenyu Liao, and Yang Yuan. Optimization algorithms for faster computational geometry. In 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy, pages 53:1-53:6, 2016.
Noga Alon, Seannie Dar, Michal Parnas, and Dana Ron. Testing of clustering. SIAM Journal on Discrete Mathematics, 16(3):393-417, 2003.
Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector machines. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, 2012.
Battista Biggio and Fabio Roli. Wild patterns: Ten years after the rise of adversarial machine learning. Pattern Recognition, 84:317-331, 2018.
Manuel Blum, Robert W. Floyd, Vaughan Pratt, Ronald L. Rivest, and Robert E. Tarjan. Time bounds for selection. Journal of Computer and System Sciences, 7(4):448-461, 1973.
Mihai Bădoiu and Kenneth L. Clarkson. Smaller core-sets for balls. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 801-802, 2003.
Mihai Bădoiu and Kenneth L. Clarkson. Optimal core-sets for balls. Computational Geometry, 40(1):14-22, 2008.
Mihai Bădoiu, Sariel Har-Peled, and Piotr Indyk. Approximate clustering via core-sets. In Proceedings of the ACM Symposium on Theory of Computing (STOC), pages 250-257, 2002.
Matteo Ceccarello, Andrea Pietracaprina, and Geppino Pucci. Solving k-center clustering (with outliers) in mapreduce and streaming, almost as accurately as sequentially. PVLDB, 12(7):766-778, 2019.
Deeparnab Chakrabarty, Prachi Goyal, and Ravishankar Krishnaswamy. The non-uniform k-center problem. In 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy, pages 67:1-67:15, 2016.
Timothy M. Chan and Vinayak Pathak. Streaming and dynamic algorithms for minimum enclosing balls in high dimensions. Comput. Geom., 47(2):240-247, 2014.
Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM TIST, 2(3), 2011.
Moses Charikar, Samir Khuller, David M Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 642-651. Society for Industrial and Applied Mathematics, 2001.
Moses Charikar, Liadan O'Callaghan, and Rina Panigrahy. Better streaming algorithms for clustering problems. In Proceedings of the thirty-fifth annual ACM symposium on Theory of computing, pages 30-39. ACM, 2003.
Kenneth L. Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. ACM Transactions on Algorithms, 6(4):63, 2010.
Kenneth L. Clarkson, Elad Hazan, and David P. Woodruff. Sublinear optimization for machine learning. J. ACM, 59(5):23:1-23:49, 2012.
Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20:273, 1995.
David J. Crisp and Christopher J. C. Burges. A geometric interpretation of v-SVM classifiers. In Sara A. Solla, Todd K. Leen, and Klaus-Robert Müller, editors, NIPS, pages 244-250. The MIT Press, 1999.
Artur Czumaj and Christian Sohler. Sublinear-time approximation for clustering via random sampling. In International Colloquium on Automata, Languages, and Programming, pages 396-407. Springer, 2004.
Artur Czumaj and Christian Sohler. Sublinear-time algorithms. In Property Testing - Current Research and Surveys, pages 41-64, 2010.
Hu Ding. Minimum enclosing ball revisited: Stability and sub-linear time algorithms. CoRR, abs/1904.03796, 2019.
Hu Ding. A sub-linear time framework for geometric optimization with outliers in high dimensions. CoRR, abs/2004.10090, 2020. URL: http://arxiv.org/abs/2004.10090.
http://arxiv.org/abs/2004.10090
Hu Ding and Jinhui Xu. Random gradient descent tree: A combinatorial approach for svm with outliers. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 2561-2567, 2015.
Hu Ding, Haikuo Yu, and Zixiu Wang. Greedy strategy works for k-center clustering with outliers and coreset construction. In 27th Annual European Symposium on Algorithms, ESA 2019, September 9-11, 2019, Munich/Garching, Germany., pages 40:1-40:16, 2019.
Alon Efrat, Micha Sharir, and Alon Ziv. Computing the smallest k-enclosing circle and related problems. Computational Geometry, 4(3):119-136, 1994.
Dan Feldman, Chongyuan Xiang, Ruihao Zhu, and Daniela Rus. Coresets for differentially private k-means clustering and applications to privacy in mobile sensor networks. In Proceedings of the 16th ACM/IEEE International Conference on Information Processing in Sensor Networks, IPSN 2017, Pittsburgh, PA, USA, April 18-21, 2017, pages 3-15, 2017.
Kaspar Fischer, Bernd Gärtner, and Martin Kutz. Fast smallest-enclosing-ball computation in high dimensions. In Algorithms - ESA 2003, 11th Annual European Symposium, Budapest, Hungary, September 16-19, 2003, Proceedings, pages 630-641, 2003.
Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2):95-110, 1956.
Oded Goldreich, Shafi Goldwasser, and Dana Ron. Property testing and its connection to learning and approximation. J. ACM, 45(4):653-750, 1998.
Teofilo F Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:293-306, 1985.
Ian J. Goodfellow, Patrick D. McDaniel, and Nicolas Papernot. Making machine learning robust against adversarial inputs. Commun. ACM, 61(7):56-66, 2018.
Sudipto Guha, Yi Li, and Qin Zhang. Distributed partial clustering. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, pages 143-152. ACM, 2017.
Laszlo Gyongyosi and Sandor Imre. Geometrical analysis of physically allowed quantum cloning transformations for quantum cryptography. Information Sciences, 285:1-23, 2014.
Sariel Har-Peled and Soham Mazumdar. Fast algorithms for computing the smallest k-enclosing circle. Algorithmica, 41(3):147-157, 2005.
Sariel Har-Peled, Dan Roth, and Dav Zimak. Maximum margin coresets for active and noise tolerant learning. In IJCAI 2007, Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, January 6-12, 2007, pages 836-841, 2007.
Sariel Har-Peled and Kasturi R. Varadarajan. Approximate shape fitting via linearization. In 42nd Annual Symposium on Foundations of Computer Science, FOCS 2001, 14-17 October 2001, Las Vegas, Nevada, USA, pages 66-73, 2001.
Sariel Har-Peled and Kasturi R. Varadarajan. High-dimensional shape fitting in linear time. Discret. Comput. Geom., 32(2):269-288, 2004.
Sariel Har-Peled and Yusu Wang. Shape fitting with outliers. SIAM Journal on Computing, 33(2):269-285, 2004.
Elad Hazan, Tomer Koren, and Nati Srebro. Beating SGD: learning svms in sublinear time. In John Shawe-Taylor, Richard S. Zemel, Peter L. Bartlett, Fernando C. N. Pereira, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain, pages 1233-1241, 2011.
Ling Huang, Anthony D Joseph, Blaine Nelson, Benjamin IP Rubinstein, and J Doug Tygar. Adversarial machine learning. In Proceedings of the 4th ACM workshop on Security and artificial intelligence, pages 43-58, 2011.
Lingxiao Huang, Shaofeng Jiang, Jian Li, and Xuan Wu. Epsilon-coresets for clustering (with outliers) in doubling metrics. In 59th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2018, Paris, France, October 7-9, 2018, pages 814-825, 2018.
Piotr Indyk. Sublinear time algorithms for metric space problems. In Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, May 1-4, 1999, Atlanta, Georgia, USA, pages 428-434, 1999.
Matthew Jagielski, Alina Oprea, Battista Biggio, Chang Liu, Cristina Nita-Rotaru, and Bo Li. Manipulating machine learning: Poisoning attacks and countermeasures for regression learning. In 2018 IEEE Symposium on Security and Privacy, SP 2018, Proceedings, 21-23 May 2018, San Francisco, California, USA, pages 19-35, 2018.
Michael Kerber and R. Sharathkumar. Approximate čech complex in low and high dimensions. In Algorithms and Computation - 24th International Symposium, ISAAC 2013, Hong Kong, China, December 16-18, 2013, Proceedings, pages 666-676, 2013.
Piyush Kumar, Joseph S. B. Mitchell, and E. Alper Yildirim. Approximate minimum enclosing balls in high dimensions using core-sets. ACM Journal of Experimental Algorithmics, 8, 2003.
Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. arXiv preprint, 2016. URL: http://arxiv.org/abs/1611.01236.
http://arxiv.org/abs/1611.01236
Shi Li and Xiangyu Guo. Distributed k-clustering for data with heavy noise. In Advances in Neural Information Processing Systems, pages 7849-7857, 2018.
Gustavo Malkomes, Matt J Kusner, Wenlin Chen, Kilian Q Weinberger, and Benjamin Moseley. Fast distributed k-center clustering with outliers on massive data. In Advances in Neural Information Processing Systems, pages 1063-1071, 2015.
Jiří Matoušek. On enclosing k points by a circle. Information Processing Letters, 53(4):217-221, 1995.
Richard Matthew McCutchen and Samir Khuller. Streaming algorithms for k-center clustering with outliers and with anonymity. In Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques, pages 165-178. Springer, 2008.
Adam Meyerson, Liadan O'Callaghan, and Serge A. Plotkin. A k-median algorithm with running time independent of data size. Machine Learning, 56(1-3):61-87, 2004.
Nina Mishra, Dan Oblinger, and Leonard Pitt. Sublinear time approximate clustering. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 439-447. Society for Industrial and Applied Mathematics, 2001.
David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. On the least trimmed squares estimator. Algorithmica, 69(1):148-183, 2014.
Kobbi Nissim, Uri Stemmer, and Salil P. Vadhan. Locating a small cluster privately. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 413-427, 2016.
Rina Panigrahy. Minimum enclosing polytope in high dimensions. arXiv preprint, 2004. URL: http://arxiv.org/abs/cs/0407020.
http://arxiv.org/abs/cs/0407020
Jeff M. Phillips. Coresets and sketches. Computing Research Repository, 2016.
J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods - Support Vector Learning, pages 185-208, Cambridge, MA, 1999. MIT Press.
Ronitt Rubinfeld. Sublinear time algorithms. In Survey. Citeseer, 2006.
Ankan Saha, S. V. N. Vishwanathan, and Xinhua Zhang. New approximation algorithms for minimum enclosing convex shapes. In Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2011, San Francisco, California, USA, January 23-25, 2011, pages 1146-1160, 2011.
B. Scholkopf, A. J. Smola, K. R. Muller, and P. L. Bartlett. New support vector algorithms. Neural Computation, 12:1207-1245, 2000.
Shinya Suzumura, Kohei Ogawa, Masashi Sugiyama, and Ichiro Takeuchi. Outlier path: A homotopy algorithm for robust svm. In Tony Jebara and Eric P. Xing, editors, Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1098-1106, 2014.
Pang-Ning Tan, Michael S. Steinbach, and Vipin Kumar. Introduction to Data Mining. Addison-Wesley, 2005.
Ivor W. Tsang, James T. Kwok, and Pak-Ming Cheung. Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research, 6:363-392, 2005. URL: http://jmlr.org/papers/v6/tsang05a.html.
http://jmlr.org/papers/v6/tsang05a.html
Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. In Measures of complexity, pages 11-30. Springer, 2015.
Linli Xu, Koby Crammer, and Dale Schuurmans. Robust support vector machine training via convex outlier ablation. In AAAI, pages 536-542. AAAI Press, 2006.
Hai Yu, Pankaj K. Agarwal, Raghunath Poreddy, and Kasturi R. Varadarajan. Practical methods for shape fitting and kinetic data structures using coresets. Algorithmica, 52(3):378-402, 2008.
Hamid Zarrabi-Zadeh and Asish Mukhopadhyay. Streaming 1-center with outliers in high dimensions. In Proceedings of the Canadian Conference on Computational Geometry (CCCG), pages 83-86, 2009.
Hu Ding
Creative Commons Attribution 3.0 Unported license
https://creativecommons.org/licenses/by/3.0/legalcode