Relational Algorithms for k-Means Clustering

Authors Benjamin Moseley, Kirk Pruhs, Alireza Samadian, Yuyan Wang



PDF
Thumbnail PDF

File

LIPIcs.ICALP.2021.97.pdf
  • Filesize: 0.77 MB
  • 21 pages

Document Identifiers

Author Details

Benjamin Moseley
  • Carnegie Mellon University, Pittsburgh, PA, USA
Kirk Pruhs
  • University of Pittsburgh, PA, USA
Alireza Samadian
  • University of Pittsburgh, PA, USA
Yuyan Wang
  • Carnegie Mellon University, Pittsburgh, PA, USA

Cite AsGet BibTex

Benjamin Moseley, Kirk Pruhs, Alireza Samadian, and Yuyan Wang. Relational Algorithms for k-Means Clustering. In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 198, pp. 97:1-97:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.ICALP.2021.97

Abstract

This paper gives a k-means approximation algorithm that is efficient in the relational algorithms model. This is an algorithm that operates directly on a relational database without performing a join to convert it to a matrix whose rows represent the data points. The running time is potentially exponentially smaller than N, the number of data points to be clustered that the relational database represents. Few relational algorithms are known and this paper offers techniques for designing relational algorithms as well as characterizing their limitations. We show that given two data points as cluster centers, if we cluster points according to their closest centers, it is NP-Hard to approximate the number of points in the clusters on a general relational input. This is trivial for conventional data inputs and this result exemplifies that standard algorithmic techniques may not be directly applied when designing an efficient relational algorithm. This paper then introduces a new method that leverages rejection sampling and the k-means++ algorithm to construct a O(1)-approximate k-means solution.

Subject Classification

ACM Subject Classification
  • Theory of computation → Design and analysis of algorithms
Keywords
  • k-means
  • clustering
  • approximation
  • big-data
  • databases

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. URL: https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro.
  2. Kaggle machine learning and data science survey. https://www.kaggle.com/kaggle/kaggle-survey-2018, 2018.
  3. Mahmoud Abo-Khamis, Sungjin Im, Benjamin Moseley, Kirk Pruhs, and Alireza Samadian. Approximate aggregate queries under additive inequalities. In Symposium on Algorithmic Principles of Computer Systems (APOCS), pages 85-99. SIAM, 2021. Google Scholar
  4. Mahmoud Abo-Khamis, Sungjin Im, Benjamin Moseley, Kirk Pruhs, and Alireza Samadian. A relational gradient descent algorithm for support vector machine training. In Symposium on Algorithmic Principles of Computer Systems (APOCS), pages 100-113. SIAM, 2021. Google Scholar
  5. Mahmoud Abo Khamis, Hung Q Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. Ac/dc: in-database learning thunderstruck. In Second Workshop on Data Management for End-To-End Machine Learning, page 8. ACM, 2018. Google Scholar
  6. Mahmoud Abo Khamis, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. In-database learning with sparse tensors. In ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 325-340, 2018. Google Scholar
  7. Mahmoud Abo Khamis, Hung Q. Ngo, and Atri Rudra. Faq: Questions asked frequently. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’16, page 13–28, New York, NY, USA, 2016. Association for Computing Machinery. URL: https://doi.org/10.1145/2902251.2902280.
  8. Ankit Aggarwal, Amit Deshpande, and Ravi Kannan. Adaptive sampling for k-means clustering. In International Conference on Approximation Algorithms for Combinatorial Optimization Problems, pages 15-28. Springer, 2009. Google Scholar
  9. David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In ACM-SIAM Symposium on Discrete Algorithms, pages 1027-1035, 2007. Google Scholar
  10. Albert Atserias, Martin Grohe, and Dániel Marx. Size bounds and query plans for relational joins. In IEEE Symposium on Foundations of Computer Science, pages 739-748, 2008. Google Scholar
  11. Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. Scalable k-means++. Proceedings of the VLDB Endowment, 5(7):622-633, 2012. Google Scholar
  12. Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, and Lin F. Yang. Clustering high dimensional dynamic data streams. In International Conference on Machine Learning, pages 576-585, 2017. Google Scholar
  13. George Casella, Christian P Robert, Martin T Wells, et al. Generalized accept-reject sampling schemes. In A Festschrift for Herman Rubin, pages 342-347. Institute of Mathematical Statistics, 2004. Google Scholar
  14. Zhaoyue Cheng and Nick Koudas. Nonlinear models over normalized data. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1574-1577. IEEE, 2019. Google Scholar
  15. Ryan R. Curtin, Benjamin Moseley, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. Rk-means: Fast clustering for relational data. In Silvia Chiappa and Roberto Calandra, editors, The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], volume 108 of Proceedings of Machine Learning Research, pages 2742-2752. PMLR, 2020. Google Scholar
  16. Alina Ene, Sungjin Im, and Benjamin Moseley. Fast clustering using mapreduce. In SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 681-689, 2011. Google Scholar
  17. Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O'Callaghan. Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering, 15(3):515-528, 2003. Google Scholar
  18. Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. Computational Geometry, 28(2-3):89-112, 2004. Google Scholar
  19. Arun Kumar, Jeffrey Naughton, and Jignesh M. Patel. Learning generalized linear models over normalized data. In ACM SIGMOD International Conference on Management of Data, pages 1969-1984, 2015. Google Scholar
  20. Shi Li and Ola Svensson. Approximating k-median via pseudo-approximation. SIAM J. Comput., 45(2):530-547, 2016. URL: https://doi.org/10.1137/130938645.
  21. Adam Meyerson, Liadan O'Callaghan, and Serge A. Plotkin. A k-median algorithm with running time independent of data size. Machine Learning, 56(1-3):61-87, 2004. Google Scholar
  22. Benjamin Moseley, Kirk Pruhs, Alireza Samadian, and Yuyan Wang. Relational algorithms for k-means clustering, 2020. URL: http://arxiv.org/abs/2008.00358.
  23. Steffen Rendle. Scaling factorization machines to relational data. In Proceedings of the VLDB Endowment, volume 6(5), pages 337-348. VLDB Endowment, 2013. Google Scholar
  24. Maximilian Schleich, Dan Olteanu, and Radu Ciucanu. Learning linear regression models over factorized joins. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD '16, pages 3-18. ACM, 2016. Google Scholar
  25. Christian Sohler and David P. Woodruff. Strong coresets for k-median and subspace approximation: Goodbye dimension. In Symposium on Foundations of Computer Science, pages 802-813, 2018. Google Scholar
  26. Keyu Yang, Yunjun Gao, Lei Liang, Bin Yao, Shiting Wen, and Gang Chen. Towards factorized svm with gaussian kernels over normalized data. Google Scholar
  27. Clement Tak Yu and Meral Z Ozsoyoglu. An algorithm for tree-query membership of a distributed query. In Computer Software and The IEEE Computer Society’s Third International Applications Conference, pages 306-312. IEEE, 1979. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail