Moderate Dimension Reduction for k-Center Clustering

Authors Shaofeng H.-C. Jiang , Robert Krauthgamer , Shay Sapir



PDF
Thumbnail PDF

File

LIPIcs.SoCG.2024.64.pdf
  • Filesize: 0.76 MB
  • 16 pages

Document Identifiers

Author Details

Shaofeng H.-C. Jiang
  • Peking University, Beijing, China
Robert Krauthgamer
  • Weizmann Institute of Science, Rehovot, Israel
Shay Sapir
  • Weizmann Institute of Science, Rehovot, Israel

Cite AsGet BibTex

Shaofeng H.-C. Jiang, Robert Krauthgamer, and Shay Sapir. Moderate Dimension Reduction for k-Center Clustering. In 40th International Symposium on Computational Geometry (SoCG 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 293, pp. 64:1-64:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.SoCG.2024.64

Abstract

The Johnson-Lindenstrauss (JL) Lemma introduced the concept of dimension reduction via a random linear map, which has become a fundamental technique in many computational settings. For a set of n points in ℝ^d and any fixed ε > 0, it reduces the dimension d to O(log n) while preserving, with high probability, all the pairwise Euclidean distances within factor 1+ε. Perhaps surprisingly, the target dimension can be lower if one only wishes to preserve the optimal value of a certain problem on the pointset, e.g., Euclidean max-cut or k-means. However, for some notorious problems, like diameter (aka furthest pair), dimension reduction via the JL map to below O(log n) does not preserve the optimal value within factor 1+ε. We propose to focus on another regime, of moderate dimension reduction, where a problem’s value is preserved within factor α > 1 using target dimension (log n)/poly(α). We establish the viability of this approach and show that the famous k-center problem is α-approximated when reducing to dimension O({log n}/α² + log k). Along the way, we address the diameter problem via the special case k = 1. Our result extends to several important variants of k-center (with outliers, capacities, or fairness constraints), and the bound improves further with the input’s doubling dimension. While our poly(α)-factor improvement in the dimension may seem small, it actually has significant implications for streaming algorithms, and easily yields an algorithm for k-center in dynamic geometric streams, that achieves O(α)-approximation using space poly(kdn^{1/α²}). This is the first algorithm to beat O(n) space in high dimension d, as all previous algorithms require space at least exp(d). Furthermore, it extends to the k-center variants mentioned above.

Subject Classification

ACM Subject Classification
  • Theory of computation → Random projections and metric embeddings
  • Theory of computation → Sketching and sampling
  • Theory of computation → Streaming models
Keywords
  • Johnson-Lindenstrauss transform
  • dimension reduction
  • clustering
  • streaming algorithms

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Dimitris Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J. Comput. Syst. Sci., 66(4):671-687, 2003. URL: https://doi.org/10.1016/S0022-0000(03)00025-4.
  2. Pankaj K. Agarwal and Cecilia Magdalena Procopiuc. Exact and approximation algorithms for clustering. Algorithmica, 33(2):201-226, 2002. URL: https://doi.org/10.1007/s00453-001-0110-y.
  3. Pankaj K. Agarwal and R. Sharathkumar. Streaming algorithms for extent problems in high dimensions. Algorithmica, 72(1):83-98, 2015. URL: https://doi.org/10.1007/s00453-013-9846-4.
  4. Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. In Proceedings of the 38th Annual ACM Symposium on Theory of Computing, STOC, pages 557-563, 2006. URL: https://doi.org/10.1145/1132516.1132597.
  5. Evangelos Anagnostopoulos, Ioannis Z. Emiris, and Ioannis Psarros. Randomized embeddings with slack and high-dimensional approximate nearest neighbor. ACM Trans. Algorithms, 14(2):18:1-18:21, 2018. URL: https://doi.org/10.1145/3178540.
  6. Alexandr Andoni, Khanh Do Ba, Piotr Indyk, and David P. Woodruff. Efficient sketches for earth-mover distance, with applications. In 50th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2009, pages 324-330. IEEE Computer Society, 2009. URL: https://doi.org/10.1109/FOCS.2009.25.
  7. Alexandr Andoni, Piotr Indyk, and Robert Krauthgamer. Earth mover distance over high-dimensional spaces. In Proceedings of the Nineteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 343-352, 2008. URL: http://dl.acm.org/citation.cfm?id=1347082.1347120.
  8. Sayan Bandyapadhyay, Fedor V. Fomin, and Kirill Simonov. On coresets for fair clustering in metric and euclidean spaces and their applications. In 48th International Colloquium on Automata, Languages, and Programming, ICALP, volume 198 of LIPIcs, pages 23:1-23:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPICS.ICALP.2021.23.
  9. Judit Bar-Ilan, Guy Kortsarz, and David Peleg. How to allocate network centers. J. Algorithms, 15(3):385-415, 1993. URL: https://doi.org/10.1006/jagm.1993.1047.
  10. Luca Becchetti, Marc Bury, Vincent Cohen-Addad, Fabrizio Grandoni, and Chris Schwiegelshohn. Oblivious dimension reduction for k-means: beyond subspaces and the Johnson-Lindenstrauss Lemma. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC, pages 1039-1050, 2019. URL: https://doi.org/10.1145/3313276.3316318.
  11. Christos Boutsidis, Anastasios Zouzias, and Petros Drineas. Random projections for k-means clustering. In 24th Annual Conference on Neural Information Processing Systems, NeurIPS, pages 298-306. Curran Associates, Inc., 2010. URL: https://proceedings.neurips.cc/paper/2010/hash/73278a4a86960eeb576a8fd4c9ec6997-Abstract.html.
  12. Vladimir Braverman, Vincent Cohen-Addad, Shaofeng H.-C. Jiang, Robert Krauthgamer, Chris Schwiegelshohn, Mads Bech Toftrup, and Xuan Wu. The power of uniform sampling for coresets. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS, pages 462-473, 2022. URL: https://doi.org/10.1109/FOCS54457.2022.00051.
  13. Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, and Lin F. Yang. Clustering high dimensional dynamic data streams. In Proceedings of the 34th International Conference on Machine Learning, ICML, volume 70 of Proceedings of Machine Learning Research, pages 576-585. PMLR, 2017. URL: http://proceedings.mlr.press/v70/braverman17a.html.
  14. Matteo Ceccarello, Andrea Pietracaprina, and Geppino Pucci. Solving k-center clustering (with outliers) in MapReduce and streaming, almost as accurately as sequentially. Proc. VLDB Endow., 12(7):766-778, 2019. URL: https://doi.org/10.14778/3317315.3317319.
  15. Moses Charikar, Chandra Chekuri, Tomás Feder, and Rajeev Motwani. Incremental clustering and dynamic information retrieval. SIAM J. Comput., 33(6):1417-1440, 2004. URL: https://doi.org/10.1137/S0097539702418498.
  16. Moses Charikar and Erik Waingarten. The Johnson-Lindenstrauss Lemma for clustering and subspace approximation: From coresets to dimension reduction. CoRR, 2022. URL: https://arxiv.org/abs/2205.00371.
  17. Xi Chen, Vincent Cohen-Addad, Rajesh Jayaram, Amit Levi, and Erik Waingarten. Streaming Euclidean MST to a Constant Factor. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC, pages 156-169, 2023. URL: https://doi.org/10.1145/3564246.3585168.
  18. Xi Chen, Rajesh Jayaram, Amit Levi, and Erik Waingarten. New streaming algorithms for high dimensional EMD and MST. In 54th Annual Symposium on Theory of Computing, STOC, pages 222-233. ACM, 2022. URL: https://doi.org/10.1145/3519935.3519979.
  19. Xiaoyu Chen, Shaofeng H.-C. Jiang, and Robert Krauthgamer. Streaming Euclidean Max-Cut: Dimension vs data reduction. In Proceedings of the 55th Annual ACM Symposium on Theory of Computing, STOC, pages 170-182, 2023. URL: https://doi.org/10.1145/3564246.3585170.
  20. Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, and Sergei Vassilvitskii. Fair clustering through fairlets. In Annual Conference on Neural Information Processing Systems, NeurIPS, pages 5029-5037, 2017. URL: https://proceedings.neurips.cc/paper/2017/hash/978fce5bcc4eccc88ad48ce3914124a2-Abstract.html.
  21. Kenneth L. Clarkson. Nearest neighbor queries in metric spaces. Discret. Comput. Geom., 22(1):63-93, 1999. URL: https://doi.org/10.1007/PL00009449.
  22. Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, and Madalina Persu. Dimensionality reduction for k-means clustering and low rank approximation. In Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, STOC, pages 163-172, 2015. URL: https://doi.org/10.1145/2746539.2746569.
  23. Vincent Cohen-Addad, Chris Schwiegelshohn, and Christian Sohler. Diameter and k-Center in Sliding Windows. In 43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016), volume 55 of Leibniz International Proceedings in Informatics (LIPIcs), pages 19:1-19:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016. URL: https://doi.org/10.4230/LIPIcs.ICALP.2016.19.
  24. Artur Czumaj, Arnold Filtser, Shaofeng H-C Jiang, Robert Krauthgamer, Pavel Veselỳ, and Mingwei Yang. Streaming facility location in high dimension via geometric hashing. CoRR, 2022. The latest version has additional results compared to the preliminary version in [Artur Czumaj et al., 2022]. URL: https://arxiv.org/abs/2204.02095.
  25. Artur Czumaj, Shaofeng H.-C. Jiang, Robert Krauthgamer, Pavel Veselý, and Mingwei Yang. Streaming facility location in high dimension via geometric hashing. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS, pages 450-461, 2022. URL: https://doi.org/10.1109/FOCS54457.2022.00050.
  26. Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. A sparse Johnson-Lindenstrauss transform. In Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC, pages 341-350, 2010. URL: https://doi.org/10.1145/1806689.1806737.
  27. Sanjoy Dasgupta and Anupam Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Struct. Algorithms, 22(1):60-65, 2003. URL: https://doi.org/10.1002/rsa.10073.
  28. Mark de Berg, Leyla Biabani, and Morteza Monemizadeh. k-center clustering with outliers in the MPC and streaming model. In IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023, pages 853-863. IEEE, 2023. URL: https://doi.org/10.1109/IPDPS54959.2023.00090.
  29. Mark de Berg, Morteza Monemizadeh, and Yu Zhong. k-center clustering with outliers in the sliding-window model. In 29th Annual European Symposium on Algorithms, ESA, volume 204 of Leibniz International Proceedings in Informatics (LIPIcs), pages 13:1-13:13. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021. URL: https://doi.org/10.4230/LIPIcs.ESA.2021.13.
  30. Peter Frankl and Hiroshi Maehara. Some geometric applications of the beta distribution. Ann. Inst. Stat. Math., 42(3):463-474, 1990. URL: https://doi.org/10.1007/BF00049302.
  31. Teofilo F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci., 38:293-306, 1985. URL: https://doi.org/10.1016/0304-3975(85)90224-5.
  32. Anupam Gupta, Robert Krauthgamer, and James R. Lee. Bounded geometries, fractals, and low-distortion embeddings. In 44th Symposium on Foundations of Computer Science, FOCS, pages 534-543. IEEE Computer Society, 2003. URL: https://doi.org/10.1109/SFCS.2003.1238226.
  33. Wei Hu, Zhao Song, Lin F. Yang, and Peilin Zhong. Nearly optimal dynamic k-means clustering for high-dimensional data. CoRR, 2018. URL: https://arxiv.org/abs/1802.00459.
  34. Lingxiao Huang, Shaofeng H.-C. Jiang, and Nisheeth K. Vishnoi. Coresets for clustering with fairness constraints. In Advances in Neural Information Processing Systems 32, NeurIPS, pages 7587-7598, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/810dfbbebb17302018ae903e9cb7a483-Abstract.html.
  35. Piotr Indyk. Better algorithms for high-dimensional proximity problems via asymmetric embeddings. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA, pages 539-545, 2003. URL: http://dl.acm.org/citation.cfm?id=644108.644200.
  36. Piotr Indyk. Algorithms for dynamic geometric problems over data streams. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, STOC, pages 373-380, 2004. URL: https://doi.org/10.1145/1007352.1007413.
  37. Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on the Theory of Computing, STOC, pages 604-613, 1998. URL: https://doi.org/10.1145/276698.276876.
  38. Piotr Indyk and Assaf Naor. Nearest-neighbor-preserving embeddings. ACM Trans. Algorithms, 3(3):31, 2007. URL: https://doi.org/10.1145/1273340.1273347.
  39. William Johnson and Joram Lindenstrauss. Extensions of Lipschitz maps into a Hilbert space. Contemporary Mathematics, 26:189-206, January 1984. URL: https://doi.org/10.1090/conm/026/737400.
  40. Samir Khuller and Yoram J. Sussmann. The capacitated K-center problem. SIAM J. Discret. Math., 13(3):403-418, 2000. URL: https://doi.org/10.1137/S0895480197329776.
  41. Sang-Sub Kim and Hee-Kap Ahn. An improved data stream algorithm for clustering. Comput. Geom., 48(9):635-645, 2015. URL: https://doi.org/10.1016/j.comgeo.2015.06.003.
  42. Boaz Klartag and Shahar Mendelson. Empirical processes and random projections. Journal of Functional Analysis, 225(1):229-245, 2005. URL: https://doi.org/10.1016/j.jfa.2004.10.009.
  43. Christiane Lammersen. Approximation Techniques for Facility Location and Their Applications in Metric Embeddings. PhD thesis, Dortmund, Technische Universität, 2010. Google Scholar
  44. Christiane Lammersen, Anastasios Sidiropoulos, and Christian Sohler. Streaming embeddings with slack. In 11th International Symposium on Algorithms and Data Structures, WADS, volume 5664 of Lecture Notes in Computer Science, pages 483-494. Springer, 2009. URL: https://doi.org/10.1007/978-3-642-03367-4_42.
  45. Kasper Green Larsen and Jelani Nelson. Optimality of the Johnson-Lindenstrauss lemma. In 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS, pages 633-638, 2017. URL: https://doi.org/10.1109/FOCS.2017.64.
  46. Konstantin Makarychev, Yury Makarychev, and Ilya P. Razenshteyn. Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC, pages 1027-1038, 2019. URL: https://doi.org/10.1145/3313276.3316350.
  47. Jirí Matousek. On variants of the Johnson-Lindenstrauss Lemma. Random Struct. Algorithms, 33(2):142-156, 2008. URL: https://doi.org/10.1002/rsa.20218.
  48. Richard Matthew McCutchen and Samir Khuller. Streaming algorithms for k-center clustering with outliers and with anonymity. In Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques, volume 5171 of Lecture Notes in Computer Science, pages 165-178. Springer, 2008. URL: https://doi.org/10.1007/978-3-540-85363-3_14.
  49. Shyam Narayanan, Sandeep Silwal, Piotr Indyk, and Or Zamir. Randomized dimensionality reduction for facility location and single-linkage clustering. In Proceedings of the 38th International Conference on Machine Learning, ICML, volume 139 of Proceedings of Machine Learning Research, pages 7948-7957. PMLR, 2021. URL: http://proceedings.mlr.press/v139/narayanan21b.html.
  50. Jelani Nelson. Dimensionality reduction in Euclidean space. Notices of the American Mathematical Society, 67(10):1498-1507, 2020. URL: https://doi.org/10.1090/noti2166.
  51. Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. Fair coresets and streaming algorithms for fair k-means. In Approximation and Online Algorithms - 17th International Workshop, WAOA, volume 11926 of Lecture Notes in Computer Science, pages 232-251. Springer, 2019. URL: https://doi.org/10.1007/978-3-030-39479-0_16.
  52. Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. URL: https://doi.org/10.1017/9781108231596.
  53. David P. Woodruff and Taisuke Yasuda. High-dimensional geometric streaming in polynomial space. In 63rd Annual Symposium on Foundations of Computer Science, FOCS, pages 732-743. IEEE, 2022. URL: https://doi.org/10.1109/FOCS54457.2022.00075.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail