Shrinkage Clustering: A Fast and Size-Constrained Algorithm for Biomedical Applications

Hu, Chenyue W.; Li, Hanyang; Qutub, Amina A.

doi:10.4230/LIPIcs.WABI.2017.11

File

Subject Classification

Keywords

Clustering
Matrix Factorization
Cancer Subtyping
Gene Expression

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

Motivation: Many common clustering algorithms require a two-step process that limits their efficiency. The algorithms need to be performed repetitively and need to be implemented together with a model selection criterion, in order to determine both the number of clusters present in the data and the corresponding cluster memberships. As biomedical datasets increase in size and prevalence, there is a growing need for new methods that are more convenient to implement and are more computationally efficient. In addition, it is often essential to obtain clusters of sufficient sample size to make the clustering result meaningful and interpretable for subsequent analysis. Results: We introduce Shrinkage Clustering, a novel clustering algorithm based on matrix factorization that simultaneously finds the optimal number of clusters while partitioning the data. We report its performances across multiple simulated and actual datasets, and demonstrate its strength in accuracy and speed in application to subtyping cancer and brain tissues. In addition, the algorithm offers a straightforward solution to clustering with cluster size constraints. Given its ease of implementation, computing efficiency and extensible structure, we believe Shrinkage Clustering can be applied broadly to solve biomedical clustering tasks especially when dealing with large datasets.

Cite As Get BibTex

Chenyue W. Hu, Hanyang Li, and Amina A. Qutub. Shrinkage Clustering: A Fast and Size-Constrained Algorithm for Biomedical Applications. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 11:1-11:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017) https://doi.org/10.4230/LIPIcs.WABI.2017.11

Author Details

Chenyue W. Hu

Hanyang Li

Amina A. Qutub

References

S. Aeberhard, D. Coomans, and O. De Vel. Comparison of classifiers in high dimensional settings. Dept. Math. Statist., James Cook Univ., North Queensland, Australia, Tech. Rep, (92-02), 1992.
Arthur Asuncion and David Newman. Uci machine learning repository, 2007.
P. S. Bradley, K. P. Bennett, and Ayhan Demiriz. Constrained k-means clustering. Microsoft Research, Redmond, pages 1-8, 2000.
Jean-Philippe Brunet, Pablo Tamayo, Todd R. Golub, and Jill P. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the national academy of sciences, 101(12):4164-4169, 2004.
Elisa Boari de Lima, Wagner Meira Júnior, and Raquel Cardoso de Melo-Minardi. Isofunctional protein subfamily detection using data integration and spectral clustering. PLoS Comput Biol, 12(6):e1005001, 2016.
Chris Ding, Xiaofeng He, and Horst D. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. In Proceedings of the 2005 SIAM International Conference on Data Mining, pages 606-610. SIAM, 2005.
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, volume 96, pages 226-231, 1996.
Ronald A. Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179-188, 1936.
Brendan J. Frey and Delbert Dueck. Clustering by passing messages between data points. science, 315(5814):972-976, 2007.
Chenyue W. Hu, Steven M. Kornblau, John H. Slater, and Amina A. Qutub. Progeny clustering: A method to identify biological phenotypes. Scientific reports, 5, 2015.
Stephen C. Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241-254, 1967.
Da Kuang, Chris Ding, and Haesun Park. Symmetric nonnegative matrix factorization for graph clustering. In Proceedings of the 2012 SIAM international conference on data mining, pages 106-117. SIAM, 2012.
Tilman Lange, Volker Roth, Mikio L. Braun, and Joachim M. Buhmann. Stability-based validation of clustering solutions. Neural computation, 16(6):1299-1323, 2004.
Tao Li and Chris H. Q. Ding. Nonnegative matrix factorizations for clustering: A survey., 2013.
James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281-297. California, USA, 1967.
Martin Maechler, Peter Rousseeuw, Anja Struyf, Mia Hubert, and Kurt Hornik. Cluster: cluster analysis basics and extensions. R package version, 1(2):56, 2012.
Olvi L Mangasarian, W. Nick Street, and William H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4):570-577, 1995.
Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, et al. Introduction to information retrieval, volume 1. Cambridge University Press, 2008.
Geoffrey J. McLachlan and Kaye E. Basford. Mixture models. inference and applications to clustering. Statistics: Textbooks and Monographs, New York: Dekker, 1988, 1, 1988.
Stefano Monti, Pablo Tamayo, Jill Mesirov, and Todd Golub. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine learning, 52(1-2):91-118, 2003.
Thomas J. Montine, Joshua A. Sonnen, Kathleen S. Montine, Paul K. Crane, and Eric B. Larson. Adult changes in thought study: dementia is an individually varying convergent syndrome with prevalent clinically silent diseases that may be modified by some commonly used therapeutics. Current Alzheimer Research, 9(6):718-723, 2012.
Wendy C. Moore, Deborah A. Meyers, Sally E. Wenzel, W. Gerald Teague, Huashi Li, Xingnan Li, Ralph D'Agostino Jr., Mario Castro, Douglas Curran-Everett, Anne M. Fitzpatrick, et al. Identification of asthma phenotypes using cluster analysis in the severe asthma research program. American journal of respiratory and critical care medicine, 181(4):315-323, 2010.
Dan Pelleg, Andrew W. Moore, et al. X-means: Extending K-means with Efficient Estimation of the Number of Clusters. In ICML, pages 727-734, 2000.
Peter J. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53-65, 1987.
Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888-905, 2000.
John H Slater, James C. Culver, Byron L. Long, Chenyue W. Hu, Jingzhe Hu, Taylor F Birk, Amina A. Qutub, Mary E. Dickinson, and Jennifer L. West. Recapitulation and modulation of the cellular architecture of a user-chosen cell of interest using cell-derived, biomimetic patterning. ACS nano, 9(6):6128-6138, 2015.
Nora Speicher and Thomas Lengauer. Towards the identification of cancer subtypes by integrative clustering of molecular data. PhD thesis, Universität des Saarlandes Saarbrücken, 2012.
W. Nick Street, William H. Wolberg, and Olvi L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis. In IS&T/SPIE’s Symposium on Electronic Imaging: Science and Technology, pages 861-870. International Society for Optics and Photonics, 1993.
Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411-423, 2001.
Joe H. Ward Jr. Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301):236-244, 1963.
Pratyaksha Wirapati, Christos Sotiriou, Susanne Kunkel, Pierre Farmer, Sylvain Pradervand, Benjamin Haibe-Kains, Christine Desmedt, Michail Ignatiadis, Thierry Sengstag, Frédéric Schütz, et al. Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Research, 10(4):R65, 2008.
Christian Wiwie, Jan Baumbach, and Richard Röttger. Comparing the performance of biomedical clustering methods. Nature Methods, 12(11):1033-1038, 2015.
Achim Zeileis, Kurt Hornik, Alex Smola, and Alexandros Karatzoglou. kernlab-an S4 package for kernel methods in R. Journal of statistical software, 11(9):1-20, 2004.

Shrinkage Clustering: A Fast and Size-Constrained Algorithm for Biomedical Applications

Authors Chenyue W. Hu, Hanyang Li, Amina A. Qutub

File

Document Identifiers

Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message