Relational Algorithms for k-Means Clustering

Moseley, Benjamin; Pruhs, Kirk; Samadian, Alireza; Wang, Yuyan

doi:10.4230/LIPIcs.ICALP.2021.97

Abstract

This paper gives a k-means approximation algorithm that is efficient in the relational algorithms model. This is an algorithm that operates directly on a relational database without performing a join to convert it to a matrix whose rows represent the data points. The running time is potentially exponentially smaller than N, the number of data points to be clustered that the relational database represents.
Few relational algorithms are known and this paper offers techniques for designing relational algorithms as well as characterizing their limitations. We show that given two data points as cluster centers, if we cluster points according to their closest centers, it is NP-Hard to approximate the number of points in the clusters on a general relational input. This is trivial for conventional data inputs and this result exemplifies that standard algorithmic techniques may not be directly applied when designing an efficient relational algorithm. This paper then introduces a new method that leverages rejection sampling and the k-means++ algorithm to construct a O(1)-approximate k-means solution.

URL: https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro.
Kaggle machine learning and data science survey. https://www.kaggle.com/kaggle/kaggle-survey-2018, 2018.
Mahmoud Abo-Khamis, Sungjin Im, Benjamin Moseley, Kirk Pruhs, and Alireza Samadian. Approximate aggregate queries under additive inequalities. In Symposium on Algorithmic Principles of Computer Systems (APOCS), pages 85-99. SIAM, 2021.
Mahmoud Abo-Khamis, Sungjin Im, Benjamin Moseley, Kirk Pruhs, and Alireza Samadian. A relational gradient descent algorithm for support vector machine training. In Symposium on Algorithmic Principles of Computer Systems (APOCS), pages 100-113. SIAM, 2021.
Mahmoud Abo Khamis, Hung Q Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. Ac/dc: in-database learning thunderstruck. In Second Workshop on Data Management for End-To-End Machine Learning, page 8. ACM, 2018.
Mahmoud Abo Khamis, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. In-database learning with sparse tensors. In ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 325-340, 2018.
Mahmoud Abo Khamis, Hung Q. Ngo, and Atri Rudra. Faq: Questions asked frequently. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’16, page 13–28, New York, NY, USA, 2016. Association for Computing Machinery. URL: https://doi.org/10.1145/2902251.2902280.
Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In International Conference on Approximation Algorithms for Combinatorial Optimization Problems, pages 15-28. Springer, 2009.
David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In ACM-SIAM Symposium on Discrete Algorithms, pages 1027-1035, 2007.
Albert Atserias, Martin Grohe, and Dániel Marx. Size bounds and query plans for relational joins. In IEEE Symposium on Foundations of Computer Science, pages 739-748, 2008.
Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. Scalable k-means++. Proceedings of the VLDB Endowment, 5(7):622-633, 2012.
Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, and Lin F. Yang. Clustering high dimensional dynamic data streams. In International Conference on Machine Learning, pages 576-585, 2017.
George Casella, Christian P Robert, Martin T Wells, et al. Generalized accept-reject sampling schemes. In A Festschrift for Herman Rubin, pages 342-347. Institute of Mathematical Statistics, 2004.
Zhaoyue Cheng and Nick Koudas. Nonlinear models over normalized data. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1574-1577. IEEE, 2019.
Ryan R. Curtin, Benjamin Moseley, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. Rk-means: Fast clustering for relational data. In Silvia Chiappa and Roberto Calandra, editors, The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], volume 108 of Proceedings of Machine Learning Research, pages 2742-2752. PMLR, 2020.
Alina Ene, Sungjin Im, and Benjamin Moseley. Fast clustering using mapreduce. In SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 681-689, 2011.
Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O'Callaghan. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering, 15(3):515-528, 2003.
Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. Computational Geometry, 28(2-3):89-112, 2004.
Arun Kumar, Jeffrey Naughton, and Jignesh M. Patel. Learning generalized linear models over normalized data. In ACM SIGMOD International Conference on Management of Data, pages 1969-1984, 2015.
Shi Li and Ola Svensson. Approximating k-median via pseudo-approximation. SIAM J. Comput., 45(2):530-547, 2016. URL: https://doi.org/10.1137/130938645.
Adam Meyerson, Liadan O'Callaghan, and Serge A. Plotkin. A k-median algorithm with running time independent of data size. Machine Learning, 56(1-3):61-87, 2004.
Benjamin Moseley, Kirk Pruhs, Alireza Samadian, and Yuyan Wang. Relational algorithms for k-means clustering, 2020. URL: http://arxiv.org/abs/2008.00358.
Steffen Rendle. Scaling factorization machines to relational data. In Proceedings of the VLDB Endowment, volume 6(5), pages 337-348. VLDB Endowment, 2013.
Maximilian Schleich, Dan Olteanu, and Radu Ciucanu. Learning linear regression models over factorized joins. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, pages 3-18. ACM, 2016.
Christian Sohler and David P. Woodruff. Strong coresets for k-median and subspace approximation: Goodbye dimension. In Symposium on Foundations of Computer Science, pages 802-813, 2018.
Keyu Yang, Yunjun Gao, Lei Liang, Bin Yao, Shiting Wen, and Gang Chen. Towards factorized svm with gaussian kernels over normalized data.
Clement Tak Yu and Meral Z Ozsoyoglu. An algorithm for tree-query membership of a distributed query. In Computer Software and The IEEE Computer Society’s Third International Applications Conference, pages 306-312. IEEE, 1979.

Relational Algorithms for k-Means Clustering

Authors Benjamin Moseley, Kirk Pruhs, Alireza Samadian, Yuyan Wang

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Relational Algorithms for k-Means Clustering

Authors Benjamin Moseley, Kirk Pruhs, Alireza Samadian, Yuyan Wang

File

Document Identifiers

Related Versions

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message