Protein Classification with Improved Topological Data Analysis

Authors Tamal K. Dey, Sayan Mandal

Thumbnail PDF


  • Filesize: 1.33 MB
  • 13 pages

Document Identifiers

Author Details

Tamal K. Dey
  • Department of Computer Science and Engineering, The Ohio State University, Columbus, USA,
Sayan Mandal
  • Department of Computer Science and Engineering, The Ohio State University, Columbus, USA,

Cite AsGet BibTex

Tamal K. Dey and Sayan Mandal. Protein Classification with Improved Topological Data Analysis. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 6:1-6:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)


Automated annotation and analysis of protein molecules have long been a topic of interest due to immediate applications in medicine and drug design. In this work, we propose a topology based, fast, scalable, and parameter-free technique to generate protein signatures. We build an initial simplicial complex using information about the protein's constituent atoms, including its radius and existing chemical bonds, to model the hierarchical structure of the molecule. Simplicial collapse is used to construct a filtration which we use to compute persistent homology. This information constitutes our signature for the protein. In addition, we demonstrate that this technique scales well to large proteins. Our method shows sizable time and memory improvements compared to other topology based approaches. We use the signature to train a protein domain classifier. Finally, we compare this classifier against models built from state-of-the-art structure-based protein signatures on standard datasets to achieve a substantial improvement in accuracy.

Subject Classification

ACM Subject Classification
  • Applied computing → Life and medical sciences
  • topological data analysis
  • persistent homology
  • simplicial collapse
  • supervised learning
  • topology based protein feature vector
  • protein classification


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Ulrich Bauer, Michael Kerber, Jan Reininghaus, and Hubert Wagner. Phat - persistent homology algorithms toolbox. J. Symb. Comput., 78(C):76-90, 2017. Google Scholar
  2. Juliana Bernardes, Gerson Zaverucha, Catherine Vaquero, Alessandra Carbone, and Levitt Michael. Improvement in protein domain identification is reached by breaking consensus, with the agreement of many profiles and domain co-occurrence. PLoS Computational Biology, 12, 07 2016. Google Scholar
  3. Inbal Budowski-Tal, Yuval Nov, and Rachel Kolodny. Fragbag, an accurate representation of protein structure, retrieves structural neighbors from the entire pdb quickly and accurately. PNAS, 107(8):3481-3486, February 2010. Google Scholar
  4. Zixuan Cang, Lin Mu, Kedi Wu, Kristopher Opron, Kelin Xia, and Guo-Wei Wei. A topological approach for protein classification. In Computational and Mathematical Biophysics. MBMB, Nov 2015. Google Scholar
  5. Gunnar Carlsson, Afra Zomorodian, Anne Collins, and Leonidas Guibas. Persistence barcodes for shapes. In Proceedings of the 2004 Eurographics/ACM SIGGRAPH Symposium on Geometry Processing, SGP '04, pages 124-135. ACM, 2004. Google Scholar
  6. Natalie Dawson, Tony E Lewis, Sayoni Das, Jonathan Lees, David Lee, Paul Ashford, Christine Orengo, and Ian Sillitoe. Cath: An expanded resource to predict protein function through structure and sequence. Nucleic Acids Research, 45, 11 2016. Google Scholar
  7. Tamal K. Dey, Fengtao Fan, and Yusu Wang. Computing topological persistence for simplicial maps. Symposium on Computational Geometry, pages 345-354, june 2014. Google Scholar
  8. Tamal K. Dey, Dayu Shi, and Yusu Wang. Simba: An efficient tool for approximating rips-filtration persistence via simplicial batch-collapse. In ESA, volume 57 of LIPIcs, 2016. Google Scholar
  9. Zoltán Gáspári, Kristian Vlahovicek, and Sándor Pongor. Efficient recognition of folds in protein 3d structures by the improved pride algorithm. Bioinformatics, 21(15), 2005. Google Scholar
  10. Edelsbrunner Herbert and John Harer. Computational topology: an introduction. American Mathematical Society, 2010. Google Scholar
  11. Liang J, Edelsbrunner H, Fu P, Sudhakar PV, and Subramaniam S. Analytical shape computation of macromolecules: Ii. molecular area and volume through alpha shape. In Proteins, volume 33, pages 18-29, 1998. Google Scholar
  12. Rachel Kolodny, Patrice Koehl, Leonidas Guibas, and Michael Levitt. Small libraries of protein fragments model native protein structures accurately. JMB, 323, 2002. Google Scholar
  13. Vitaliy Kurlin. A fast persistence-based segmentation of noisy 2D clouds with provable guarantees. Pattern Recognition Letters, 83:3-12, 2015. Google Scholar
  14. Holm Liisa and Rosenström Päivi. Dali server: conservation mapping in 3d. Nucleic Acids Research, 38:W545-W549, 2010. URL:
  15. G. M. Morton. A computer oriented geodetic data base; and a new technique in file sequencing. International Business Machines Co., 1966. Google Scholar
  16. J. R. Munkres. Elements of Algebraic Topology, chapter 1. CRC Press, 1 edition, 1984. Google Scholar
  17. USA National Institutes of Health, 1988. URL:
  18. M Remmert, A Biegert, and Söding J. Hauser A. Hhblits: lightning-fast iterative protein sequence searching by hmm-hmm alignment. Nature Methods, 9, Dec 2011. Google Scholar
  19. Altschul S.F., Gish W., Miller W., Myers E.W., and Lipman D.J.n. Basic local alignment search tool. Journal of Molecular Biology, 215:403-410, 1990. Google Scholar
  20. Ian Sillitoe, Tony E Lewis, and et al. Cath: Comprehensive structural and functional annotations for genome sequences. Nucleic Acids Research, 43, 01 2015. Google Scholar
  21. Paolo Sonego, Mircea Pacurar, Somdutta Dhir, Attila Kertesz-Farkas, András Kocsor, Zoltán Gáspári, Jack A M Leunissen, and Sándor Pongor. A protein classification benchmark collection for machine learning. Nucleic acids research, 35:D232-6, 02 2007. Google Scholar
  22. The GUDHI Project. GUDHI User and Reference Manual. GUDHI Editorial Board, 2015. URL:
  23. Kelin Xia and Guo-Wei Wei. Persistent homology analysis of protein structure, flexibility and folding. IJNMBE, 30(8):814-844, 2014. URL: