Shrinkage Clustering: A Fast and Size-Constrained Algorithm for Biomedical Applications

Authors Chenyue W. Hu, Hanyang Li, Amina A. Qutub



PDF
Thumbnail PDF

File

LIPIcs.WABI.2017.11.pdf
  • Filesize: 0.6 MB
  • 13 pages

Document Identifiers

Author Details

Chenyue W. Hu
Hanyang Li
Amina A. Qutub

Cite As Get BibTex

Chenyue W. Hu, Hanyang Li, and Amina A. Qutub. Shrinkage Clustering: A Fast and Size-Constrained Algorithm for Biomedical Applications. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 88, pp. 11:1-11:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017) https://doi.org/10.4230/LIPIcs.WABI.2017.11

Abstract

Motivation: Many common clustering algorithms require a two-step process that limits their efficiency. The algorithms need to be performed repetitively and need to be implemented together with a model selection criterion, in order to determine both the number of clusters present in the data and the corresponding cluster memberships. As biomedical datasets increase in size and prevalence, there is a growing need for new methods that are more convenient to implement and are more computationally efficient. In addition, it is often essential to obtain clusters of sufficient sample size to make the clustering result meaningful and interpretable for subsequent analysis.

Results: We introduce Shrinkage Clustering, a novel clustering algorithm based on matrix factorization that simultaneously finds the optimal number of clusters while partitioning the data. We report its performances across multiple simulated and actual datasets, and demonstrate its strength in accuracy and speed in application to subtyping cancer and brain tissues. In addition, the algorithm offers a straightforward solution to clustering with cluster size constraints. Given its ease of implementation, computing efficiency and extensible structure, we believe Shrinkage Clustering can be applied broadly to solve biomedical clustering tasks especially when dealing with large datasets.

Subject Classification

Keywords
  • Clustering
  • Matrix Factorization
  • Cancer Subtyping
  • Gene Expression

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. S. Aeberhard, D. Coomans, and O. De Vel. Comparison of classifiers in high dimensional settings. Dept. Math. Statist., James Cook Univ., North Queensland, Australia, Tech. Rep, (92-02), 1992. Google Scholar
  2. Arthur Asuncion and David Newman. Uci machine learning repository, 2007. Google Scholar
  3. P. S. Bradley, K. P. Bennett, and Ayhan Demiriz. Constrained k-means clustering. Microsoft Research, Redmond, pages 1-8, 2000. Google Scholar
  4. Jean-Philippe Brunet, Pablo Tamayo, Todd R. Golub, and Jill P. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the national academy of sciences, 101(12):4164-4169, 2004. Google Scholar
  5. Elisa Boari de Lima, Wagner Meira Júnior, and Raquel Cardoso de Melo-Minardi. Isofunctional protein subfamily detection using data integration and spectral clustering. PLoS Comput Biol, 12(6):e1005001, 2016. Google Scholar
  6. Chris Ding, Xiaofeng He, and Horst D. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. In Proceedings of the 2005 SIAM International Conference on Data Mining, pages 606-610. SIAM, 2005. Google Scholar
  7. Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, volume 96, pages 226-231, 1996. Google Scholar
  8. Ronald A. Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179-188, 1936. Google Scholar
  9. Brendan J. Frey and Delbert Dueck. Clustering by passing messages between data points. science, 315(5814):972-976, 2007. Google Scholar
  10. Chenyue W. Hu, Steven M. Kornblau, John H. Slater, and Amina A. Qutub. Progeny clustering: A method to identify biological phenotypes. Scientific reports, 5, 2015. Google Scholar
  11. Stephen C. Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241-254, 1967. Google Scholar
  12. Da Kuang, Chris Ding, and Haesun Park. Symmetric nonnegative matrix factorization for graph clustering. In Proceedings of the 2012 SIAM international conference on data mining, pages 106-117. SIAM, 2012. Google Scholar
  13. Tilman Lange, Volker Roth, Mikio L. Braun, and Joachim M. Buhmann. Stability-based validation of clustering solutions. Neural computation, 16(6):1299-1323, 2004. Google Scholar
  14. Tao Li and Chris H. Q. Ding. Nonnegative matrix factorizations for clustering: A survey., 2013. Google Scholar
  15. James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281-297. California, USA, 1967. Google Scholar
  16. Martin Maechler, Peter Rousseeuw, Anja Struyf, Mia Hubert, and Kurt Hornik. Cluster: cluster analysis basics and extensions. R package version, 1(2):56, 2012. Google Scholar
  17. Olvi L Mangasarian, W. Nick Street, and William H. Wolberg. Breast cancer diagnosis and prognosis via linear programming. Operations Research, 43(4):570-577, 1995. Google Scholar
  18. Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, et al. Introduction to information retrieval, volume 1. Cambridge University Press, 2008. Google Scholar
  19. Geoffrey J. McLachlan and Kaye E. Basford. Mixture models. inference and applications to clustering. Statistics: Textbooks and Monographs, New York: Dekker, 1988, 1, 1988. Google Scholar
  20. Stefano Monti, Pablo Tamayo, Jill Mesirov, and Todd Golub. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Machine learning, 52(1-2):91-118, 2003. Google Scholar
  21. Thomas J. Montine, Joshua A. Sonnen, Kathleen S. Montine, Paul K. Crane, and Eric B. Larson. Adult changes in thought study: dementia is an individually varying convergent syndrome with prevalent clinically silent diseases that may be modified by some commonly used therapeutics. Current Alzheimer Research, 9(6):718-723, 2012. Google Scholar
  22. Wendy C. Moore, Deborah A. Meyers, Sally E. Wenzel, W. Gerald Teague, Huashi Li, Xingnan Li, Ralph D'Agostino Jr., Mario Castro, Douglas Curran-Everett, Anne M. Fitzpatrick, et al. Identification of asthma phenotypes using cluster analysis in the severe asthma research program. American journal of respiratory and critical care medicine, 181(4):315-323, 2010. Google Scholar
  23. Dan Pelleg, Andrew W. Moore, et al. X-means: Extending K-means with Efficient Estimation of the Number of Clusters. In ICML, pages 727-734, 2000. Google Scholar
  24. Peter J. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53-65, 1987. Google Scholar
  25. Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888-905, 2000. Google Scholar
  26. John H Slater, James C. Culver, Byron L. Long, Chenyue W. Hu, Jingzhe Hu, Taylor F Birk, Amina A. Qutub, Mary E. Dickinson, and Jennifer L. West. Recapitulation and modulation of the cellular architecture of a user-chosen cell of interest using cell-derived, biomimetic patterning. ACS nano, 9(6):6128-6138, 2015. Google Scholar
  27. Nora Speicher and Thomas Lengauer. Towards the identification of cancer subtypes by integrative clustering of molecular data. PhD thesis, Universität des Saarlandes Saarbrücken, 2012. Google Scholar
  28. W. Nick Street, William H. Wolberg, and Olvi L. Mangasarian. Nuclear feature extraction for breast tumor diagnosis. In IS&T/SPIE’s Symposium on Electronic Imaging: Science and Technology, pages 861-870. International Society for Optics and Photonics, 1993. Google Scholar
  29. Robert Tibshirani, Guenther Walther, and Trevor Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411-423, 2001. Google Scholar
  30. Joe H. Ward Jr. Hierarchical grouping to optimize an objective function. Journal of the American statistical association, 58(301):236-244, 1963. Google Scholar
  31. Pratyaksha Wirapati, Christos Sotiriou, Susanne Kunkel, Pierre Farmer, Sylvain Pradervand, Benjamin Haibe-Kains, Christine Desmedt, Michail Ignatiadis, Thierry Sengstag, Frédéric Schütz, et al. Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures. Breast Cancer Research, 10(4):R65, 2008. Google Scholar
  32. Christian Wiwie, Jan Baumbach, and Richard Röttger. Comparing the performance of biomedical clustering methods. Nature Methods, 12(11):1033-1038, 2015. Google Scholar
  33. Achim Zeileis, Kurt Hornik, Alex Smola, and Alexandros Karatzoglou. kernlab-an S4 package for kernel methods in R. Journal of statistical software, 11(9):1-20, 2004. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail