It’s Hard to HAC Average Linkage!

Authors MohammadHossein Bateni , Laxman Dhulipala , Kishen N. Gowda , D. Ellis Hershkowitz , Rajesh Jayaram , Jakub Łącki



PDF
Thumbnail PDF

File

LIPIcs.ICALP.2024.18.pdf
  • Filesize: 1.9 MB
  • 18 pages

Document Identifiers

Author Details

MohammadHossein Bateni
  • Google Research, New York, NY, USA
Laxman Dhulipala
  • University of Maryland, College Park, MD, USA
Kishen N. Gowda
  • University of Maryland, College Park, MD, USA
D. Ellis Hershkowitz
  • Brown University, Providence, RI, USA
Rajesh Jayaram
  • Google Research, New York, NY, USA
Jakub Łącki
  • Google Research, New York, NY, USA

Acknowledgements

We thank the anonymous reviewers for their useful comments.

Cite AsGet BibTex

MohammadHossein Bateni, Laxman Dhulipala, Kishen N. Gowda, D. Ellis Hershkowitz, Rajesh Jayaram, and Jakub Łącki. It’s Hard to HAC Average Linkage!. In 51st International Colloquium on Automata, Languages, and Programming (ICALP 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 297, pp. 18:1-18:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.ICALP.2024.18

Abstract

Average linkage Hierarchical Agglomerative Clustering (HAC) is an extensively studied and applied method for hierarchical clustering. Recent applications to massive datasets have driven significant interest in near-linear-time and efficient parallel algorithms for average linkage HAC. We provide hardness results that rule out such algorithms. On the sequential side, we establish a runtime lower bound of n^{3/2-ε} on n node graphs for sequential combinatorial algorithms under standard fine-grained complexity assumptions. This essentially matches the best-known running time for average linkage HAC. On the parallel side, we prove that average linkage HAC likely cannot be parallelized even on simple graphs by showing that it is CC-hard on trees of diameter 4. On the possibility side, we demonstrate that average linkage HAC can be efficiently parallelized (i.e., it is in NC) on paths and can be solved in near-linear time when the height of the output cluster hierarchy is small.

Subject Classification

ACM Subject Classification
  • Theory of computation → Parallel algorithms
  • Theory of computation → Streaming, sublinear and near linear time algorithms
  • Theory of computation → Graph algorithms analysis
Keywords
  • Clustering
  • Hierarchical Graph Clustering
  • HAC
  • Fine-Grained Complexity
  • Parallel Algorithms
  • CC

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Amir Abboud, Vincent Cohen-Addad, and Hussein Houdrouge. Subquadratic high-dimensional hierarchical clustering. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dquotesingle Alché-Buc, E. Fox, and R. Garnett, editors, Annual Conference on Neural Information Processing Systems (NeurIPS), volume 32. Curran Associates, Inc., 2019. Google Scholar
  2. Amir Abboud, Nick Fischer, and Yarin Shechter. Faster combinatorial k-clique algorithms. arXiv preprint, 2024. URL: https://arxiv.org/abs/2401.13502.
  3. Josh Alman and Virginia Vassilevska Williams. A refined laser method and faster matrix multiplication. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 522-539. SIAM, 2021. Google Scholar
  4. N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. ACM Transactions on Computer Systems, 34(2), April 2001. Google Scholar
  5. MohammadHossein Bateni, Soheil Behnezhad, Mahsa Derakhshan, MohammadTaghi Hajiaghayi, Raimondas Kiveris, Silvio Lattanzi, and Vahab Mirrokni. Affinity clustering: Hierarchical clustering at scale. In Annual Conference on Neural Information Processing Systems (NeurIPS), pages 6864-6874, 2017. Google Scholar
  6. MohammadHossein Bateni, Laxman Dhulipala, Kishen N Gowda, D Ellis Hershkowitz, Rajesh Jayaram, and Jakub Łącki. It’s hard to hac with average linkage!, 2024. URL: https://arxiv.org/abs/2404.14730.
  7. J-P Benzécri. Construction d'une classification ascendante hiérarchique par la recherche en chaîne des voisins réciproques. Cahiers de l'analyse des données, 7(2):209-218, 1982. Google Scholar
  8. Guy E Blelloch. Scans as primitive parallel operations. IEEE Transactions on computers, 38(11):1526-1538, 1989. Google Scholar
  9. Guy E. Blelloch, Jeremy T. Fineman, Yan Gu, and Yihan Sun. Optimal parallel algorithms in the binary-forking model. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2020. Google Scholar
  10. Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multithreaded computations. SIAM J. on Computing, 27(1), 1998. Google Scholar
  11. Mélanie Boly, Vincent Perlbarg, Guillaume Marrelec, Manuel Schabus, Steven Laureys, Julien Doyon, Mélanie Pélégrini-Issac, Pierre Maquet, and Habib Benali. Hierarchical clustering of brain activity during human nonrapid eye movement sleep. Proceedings of the National Academy of Sciences, 109(15):5856-5861, 2012. Google Scholar
  12. Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn, and Claire Mathieu. Hierarchical clustering: Objective functions and algorithms. Journal of the ACM (JACM), 66(4), 2019. Google Scholar
  13. Stephen A Cook, Yuval Filmus, and Dai Tri Man Le. The complexity of the comparator circuit value problem. ACM Transactions on Computation Theory (TOCT), 6(4):1-44, 2014. Google Scholar
  14. Don Coppersmith and Shmuel Winograd. On the asymptotic complexity of matrix multiplication. SIAM Journal on Computing, 11(3):472-492, 1982. Google Scholar
  15. Laxman Dhulipala, David Eisenstat, Jakub Lacki, Vahab Mirrokni, and Jessica Shi. Hierarchical agglomerative graph clustering in nearly-linear time. In International Conference on Machine Learning (ICML), pages 2676-2686, 2021. Google Scholar
  16. Laxman Dhulipala, David Eisenstat, Jakub Lacki, Vahab Mirrokni, and Jessica Shi. Hierarchical agglomerative graph clustering in poly-logarithmic depth. Annual Conference on Neural Information Processing Systems (NeurIPS), 35:22925-22940, 2022. Google Scholar
  17. Laxman Dhulipala, Jakub Łącki, Jason Lee, and Vahab Mirrokni. Terahac: Hierarchical agglomerative clustering of trillion-edge graphs. Proceedings of the ACM on Management of Data, 1(3):1-27, 2023. Google Scholar
  18. Michael B Eisen, Paul T Spellman, Patrick O Brown, and David Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25):14863-14868, 1998. Google Scholar
  19. Raymond Greenlaw and Sanpawat Kantabutra. On the parallel complexity of hierarchical clustering and cc-complete problems. Complexity, 14(2):18-28, 2008. Google Scholar
  20. Guan-Jie Hua, Che-Lun Hung, Chun-Yuan Lin, Fu-Che Wu, Yu-Wei Chan, and Chuan Yi Tang. MGUPGMA: a fast UPGMA algorithm with multiple graphics processing units using NCCL. Evolutionary Bioinformatics, 13:1176934317734220, 2017. Google Scholar
  21. J Juan. Programme de classification hiérarchique par l'algorithme de la recherche en chaîne des voisins réciproques. Cahiers de l'analyse des données, 7(2):219-225, 1982. Google Scholar
  22. Benjamin King. Step-wise clustering procedures. Journal of the American Statistical Association, 62(317):86-101, 1967. Google Scholar
  23. Ari Kobren, Nicholas Monath, Akshay Krishnamurthy, and Andrew McCallum. A hierarchical algorithm for extreme clustering. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 255-264, 2017. Google Scholar
  24. Godfrey N Lance and William Thomas Williams. A general theory of classificatory sorting strategies: 1. hierarchical systems. The computer journal, 9(4):373-380, 1967. Google Scholar
  25. Joshua Lau and Angus Ritossa. Algorithms and hardness for multidimensional range updates and queries. In Innovations in Theoretical Computer Science Conference (ITCS), 2021. Google Scholar
  26. François Le Gall. Faster algorithms for rectangular matrix multiplication. In Symposium on Foundations of Computer Science (FOCS), pages 514-523. IEEE, 2012. Google Scholar
  27. Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google Scholar
  28. Ernst W Mayr and Ashok Subramanian. The complexity of circuit value and network stability. Journal of Computer and System Sciences, 44(2):302-323, 1992. Google Scholar
  29. Nicholas Monath, Kumar Avinava Dubey, Guru Guruganesh, Manzil Zaheer, Amr Ahmed, Andrew McCallum, Gokhan Mergen, Marc Najork, Mert Terzihan, Bryon Tjanaka, et al. Scalable hierarchical agglomerative clustering. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 1245-1255, 2021. Google Scholar
  30. Benjamin Moseley and Joshua R. Wang. Approximation bounds for hierarchical clustering: Average linkage, bisecting k-means, and local search. In Annual Conference on Neural Information Processing Systems (NeurIPS), pages 3094-3103, 2017. Google Scholar
  31. Benjamin Moseley and Joshua R Wang. Approximation bounds for hierarchical clustering: Average linkage, bisecting k-means, and local search. Journal of Machine Learning Research, 24(1):1-36, 2023. Google Scholar
  32. Daniel Müllner. Modern hierarchical, agglomerative clustering algorithms. arXiv preprint, 2011. URL: https://arxiv.org/abs/1109.2378.
  33. Fionn Murtagh and Pedro Contreras. Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1):86-97, 2012. Google Scholar
  34. Fionn Murtagh and Pedro Contreras. Algorithms for hierarchical clustering: an overview, ii. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(6):e1219, 2017. Google Scholar
  35. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825-2830, 2011. Google Scholar
  36. JM Shearer and Michael A Wolfe. Alglib, a simple symbol-manipulation package. Communications of the ACM, 28(8):820-825, 1985. Google Scholar
  37. Peter Henry Andrews Sneath. The principles and practice of numerical classification. Numerical taxonomy, 573, 1973. Google Scholar
  38. Ashok Subramanian. A new approach to stable matching problems. Stanford University, 1989. Google Scholar
  39. Tom Tseng, Laxman Dhulipala, and Julian Shun. Parallel batch-dynamic minimum spanning forest and the efficiency of dynamic agglomerative graph clustering. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 233-245, 2022. Google Scholar
  40. Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, 17(3):261-272, 2020. Google Scholar
  41. Virginia Vassilevska Williams. Multiplying matrices faster than coppersmith-winograd. In Annual ACM Symposium on Theory of Computing (STOC), pages 887-898, 2012. Google Scholar
  42. Virginia Vassilevska Williams and Ryan Williams. Subcubic equivalences between path, matrix and triangle problems. In Symposium on Foundations of Computer Science (FOCS), pages 645-654, 2010. Google Scholar
  43. Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, and Renfei Zhou. New bounds for matrix multiplication: from alpha to omega. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 3792-3835. SIAM, 2024. Google Scholar
  44. Ying Zhao and George Karypis. Evaluation of hierarchical clustering algorithms for document datasets. In Conference on Information and Knowledge Management (CIKM), pages 515-524, 2002. Google Scholar