It’s Hard to HAC Average Linkage!

Bateni, MohammadHossein; Dhulipala, Laxman; Gowda, Kishen N.; Hershkowitz, D. Ellis; Jayaram, Rajesh; Łącki, Jakub

doi:10.4230/LIPIcs.ICALP.2024.18

Abstract

Average linkage Hierarchical Agglomerative Clustering (HAC) is an extensively studied and applied method for hierarchical clustering. Recent applications to massive datasets have driven significant interest in near-linear-time and efficient parallel algorithms for average linkage HAC. We provide hardness results that rule out such algorithms. On the sequential side, we establish a runtime lower bound of n^{3/2-ε} on n node graphs for sequential combinatorial algorithms under standard fine-grained complexity assumptions. This essentially matches the best-known running time for average linkage HAC. On the parallel side, we prove that average linkage HAC likely cannot be parallelized even on simple graphs by showing that it is CC-hard on trees of diameter 4. On the possibility side, we demonstrate that average linkage HAC can be efficiently parallelized (i.e., it is in NC) on paths and can be solved in near-linear time when the height of the output cluster hierarchy is small.

Amir Abboud, Vincent Cohen-Addad, and Hussein Houdrouge. Subquadratic high-dimensional hierarchical clustering. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dquotesingle Alché-Buc, E. Fox, and R. Garnett, editors, Annual Conference on Neural Information Processing Systems (NeurIPS), volume 32. Curran Associates, Inc., 2019.
Amir Abboud, Nick Fischer, and Yarin Shechter. Faster combinatorial k-clique algorithms. arXiv preprint, 2024. URL: https://arxiv.org/abs/2401.13502.
Josh Alman and Virginia Vassilevska Williams. A refined laser method and faster matrix multiplication. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 522-539. SIAM, 2021.
N. S. Arora, R. D. Blumofe, and C. G. Plaxton. Thread scheduling for multiprogrammed multiprocessors. ACM Transactions on Computer Systems, 34(2), April 2001.
MohammadHossein Bateni, Soheil Behnezhad, Mahsa Derakhshan, MohammadTaghi Hajiaghayi, Raimondas Kiveris, Silvio Lattanzi, and Vahab Mirrokni. Affinity clustering: Hierarchical clustering at scale. In Annual Conference on Neural Information Processing Systems (NeurIPS), pages 6864-6874, 2017.
MohammadHossein Bateni, Laxman Dhulipala, Kishen N Gowda, D Ellis Hershkowitz, Rajesh Jayaram, and Jakub Łącki. It’s hard to hac with average linkage!, 2024. URL: https://arxiv.org/abs/2404.14730.
J-P Benzécri. Construction d'une classification ascendante hiérarchique par la recherche en chaîne des voisins réciproques. Cahiers de l'analyse des données, 7(2):209-218, 1982.
Guy E Blelloch. Scans as primitive parallel operations. IEEE Transactions on computers, 38(11):1526-1538, 1989.
Guy E. Blelloch, Jeremy T. Fineman, Yan Gu, and Yihan Sun. Optimal parallel algorithms in the binary-forking model. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2020.
Robert D. Blumofe and Charles E. Leiserson. Space-efficient scheduling of multithreaded computations. SIAM J. on Computing, 27(1), 1998.
Mélanie Boly, Vincent Perlbarg, Guillaume Marrelec, Manuel Schabus, Steven Laureys, Julien Doyon, Mélanie Pélégrini-Issac, Pierre Maquet, and Habib Benali. Hierarchical clustering of brain activity during human nonrapid eye movement sleep. Proceedings of the National Academy of Sciences, 109(15):5856-5861, 2012.
Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn, and Claire Mathieu. Hierarchical clustering: Objective functions and algorithms. Journal of the ACM (JACM), 66(4), 2019.
Stephen A Cook, Yuval Filmus, and Dai Tri Man Le. The complexity of the comparator circuit value problem. ACM Transactions on Computation Theory (TOCT), 6(4):1-44, 2014.
Don Coppersmith and Shmuel Winograd. On the asymptotic complexity of matrix multiplication. SIAM Journal on Computing, 11(3):472-492, 1982.
Laxman Dhulipala, David Eisenstat, Jakub Lacki, Vahab Mirrokni, and Jessica Shi. Hierarchical agglomerative graph clustering in nearly-linear time. In International Conference on Machine Learning (ICML), pages 2676-2686, 2021.
Laxman Dhulipala, David Eisenstat, Jakub Lacki, Vahab Mirrokni, and Jessica Shi. Hierarchical agglomerative graph clustering in poly-logarithmic depth. Annual Conference on Neural Information Processing Systems (NeurIPS), 35:22925-22940, 2022.
Laxman Dhulipala, Jakub Łącki, Jason Lee, and Vahab Mirrokni. Terahac: Hierarchical agglomerative clustering of trillion-edge graphs. Proceedings of the ACM on Management of Data, 1(3):1-27, 2023.
Michael B Eisen, Paul T Spellman, Patrick O Brown, and David Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences, 95(25):14863-14868, 1998.
Raymond Greenlaw and Sanpawat Kantabutra. On the parallel complexity of hierarchical clustering and cc-complete problems. Complexity, 14(2):18-28, 2008.
Guan-Jie Hua, Che-Lun Hung, Chun-Yuan Lin, Fu-Che Wu, Yu-Wei Chan, and Chuan Yi Tang. MGUPGMA: a fast UPGMA algorithm with multiple graphics processing units using NCCL. Evolutionary Bioinformatics, 13:1176934317734220, 2017.
J Juan. Programme de classification hiérarchique par l'algorithme de la recherche en chaîne des voisins réciproques. Cahiers de l'analyse des données, 7(2):219-225, 1982.
Benjamin King. Step-wise clustering procedures. Journal of the American Statistical Association, 62(317):86-101, 1967.
Ari Kobren, Nicholas Monath, Akshay Krishnamurthy, and Andrew McCallum. A hierarchical algorithm for extreme clustering. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 255-264, 2017.
Godfrey N Lance and William Thomas Williams. A general theory of classificatory sorting strategies: 1. hierarchical systems. The computer journal, 9(4):373-380, 1967.
Joshua Lau and Angus Ritossa. Algorithms and hardness for multidimensional range updates and queries. In Innovations in Theoretical Computer Science Conference (ITCS), 2021.
François Le Gall. Faster algorithms for rectangular matrix multiplication. In Symposium on Foundations of Computer Science (FOCS), pages 514-523. IEEE, 2012.
Christopher D Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
Ernst W Mayr and Ashok Subramanian. The complexity of circuit value and network stability. Journal of Computer and System Sciences, 44(2):302-323, 1992.
Nicholas Monath, Kumar Avinava Dubey, Guru Guruganesh, Manzil Zaheer, Amr Ahmed, Andrew McCallum, Gokhan Mergen, Marc Najork, Mert Terzihan, Bryon Tjanaka, et al. Scalable hierarchical agglomerative clustering. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 1245-1255, 2021.
Benjamin Moseley and Joshua R. Wang. Approximation bounds for hierarchical clustering: Average linkage, bisecting k-means, and local search. In Annual Conference on Neural Information Processing Systems (NeurIPS), pages 3094-3103, 2017.
Benjamin Moseley and Joshua R Wang. Approximation bounds for hierarchical clustering: Average linkage, bisecting k-means, and local search. Journal of Machine Learning Research, 24(1):1-36, 2023.
Daniel Müllner. Modern hierarchical, agglomerative clustering algorithms. arXiv preprint, 2011. URL: https://arxiv.org/abs/1109.2378.
Fionn Murtagh and Pedro Contreras. Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1):86-97, 2012.
Fionn Murtagh and Pedro Contreras. Algorithms for hierarchical clustering: an overview, ii. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(6):e1219, 2017.
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825-2830, 2011.
JM Shearer and Michael A Wolfe. Alglib, a simple symbol-manipulation package. Communications of the ACM, 28(8):820-825, 1985.
Peter Henry Andrews Sneath. The principles and practice of numerical classification. Numerical taxonomy, 573, 1973.
Ashok Subramanian. A new approach to stable matching problems. Stanford University, 1989.
Tom Tseng, Laxman Dhulipala, and Julian Shun. Parallel batch-dynamic minimum spanning forest and the efficiency of dynamic agglomerative graph clustering. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 233-245, 2022.
Pauli Virtanen, Ralf Gommers, Travis E Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, 17(3):261-272, 2020.
Virginia Vassilevska Williams. Multiplying matrices faster than coppersmith-winograd. In Annual ACM Symposium on Theory of Computing (STOC), pages 887-898, 2012.
Virginia Vassilevska Williams and Ryan Williams. Subcubic equivalences between path, matrix and triangle problems. In Symposium on Foundations of Computer Science (FOCS), pages 645-654, 2010.
Virginia Vassilevska Williams, Yinzhan Xu, Zixuan Xu, and Renfei Zhou. New bounds for matrix multiplication: from alpha to omega. In Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 3792-3835. SIAM, 2024.
Ying Zhao and George Karypis. Evaluation of hierarchical clustering algorithms for document datasets. In Conference on Information and Knowledge Management (CIKM), pages 515-524, 2002.

It’s Hard to HAC Average Linkage!

Authors MohammadHossein Bateni , Laxman Dhulipala , Kishen N. Gowda , D. Ellis Hershkowitz , Rajesh Jayaram , Jakub Łącki

File

Document Identifiers

Author Details

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

It’s Hard to HAC Average Linkage!

Authors MohammadHossein Bateni , Laxman Dhulipala , Kishen N. Gowda , D. Ellis Hershkowitz , Rajesh Jayaram , Jakub Łącki

File

Document Identifiers

Author Details

Funding

Acknowledgements

Cite AsGet BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References

Thanks for your feedback!

Could not send message