Protein Classification with Improved Topological Data Analysis

Dey, Tamal K.; Mandal, Sayan

doi:10.4230/LIPIcs.WABI.2018.6

File

Subject Classification

ACM Subject Classification

Applied computing → Life and medical sciences

Keywords

topological data analysis
persistent homology
simplicial collapse
supervised learning
topology based protein feature vector
protein classification

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

Document

0

Metadata

Abstract

Automated annotation and analysis of protein molecules have long been a topic of interest due to immediate applications in medicine and drug design. In this work, we propose a topology based, fast, scalable, and parameter-free technique to generate protein signatures. We build an initial simplicial complex using information about the protein's constituent atoms, including its radius and existing chemical bonds, to model the hierarchical structure of the molecule. Simplicial collapse is used to construct a filtration which we use to compute persistent homology. This information constitutes our signature for the protein. In addition, we demonstrate that this technique scales well to large proteins. Our method shows sizable time and memory improvements compared to other topology based approaches. We use the signature to train a protein domain classifier. Finally, we compare this classifier against models built from state-of-the-art structure-based protein signatures on standard datasets to achieve a substantial improvement in accuracy.

Cite As Get BibTex

Tamal K. Dey and Sayan Mandal. Protein Classification with Improved Topological Data Analysis. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 6:1-6:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018) https://doi.org/10.4230/LIPIcs.WABI.2018.6

Author Details

Tamal K. Dey

Department of Computer Science and Engineering, The Ohio State University, Columbus, USA, http://web.cse.ohio-state.edu/~dey.8/

Sayan Mandal

Department of Computer Science and Engineering, The Ohio State University, Columbus, USA, http://web.cse.ohio-state.edu/~mandal.25/

References

Ulrich Bauer, Michael Kerber, Jan Reininghaus, and Hubert Wagner. Phat - persistent homology algorithms toolbox. J. Symb. Comput., 78(C):76-90, 2017.
Juliana Bernardes, Gerson Zaverucha, Catherine Vaquero, Alessandra Carbone, and Levitt Michael. Improvement in protein domain identification is reached by breaking consensus, with the agreement of many profiles and domain co-occurrence. PLoS Computational Biology, 12, 07 2016.
Inbal Budowski-Tal, Yuval Nov, and Rachel Kolodny. Fragbag, an accurate representation of protein structure, retrieves structural neighbors from the entire pdb quickly and accurately. PNAS, 107(8):3481-3486, February 2010.
Zixuan Cang, Lin Mu, Kedi Wu, Kristopher Opron, Kelin Xia, and Guo-Wei Wei. A topological approach for protein classification. In Computational and Mathematical Biophysics. MBMB, Nov 2015.
Gunnar Carlsson, Afra Zomorodian, Anne Collins, and Leonidas Guibas. Persistence barcodes for shapes. In Proceedings of the 2004 Eurographics/ACM SIGGRAPH Symposium on Geometry Processing, SGP '04, pages 124-135. ACM, 2004.
Natalie Dawson, Tony E Lewis, Sayoni Das, Jonathan Lees, David Lee, Paul Ashford, Christine Orengo, and Ian Sillitoe. Cath: An expanded resource to predict protein function through structure and sequence. Nucleic Acids Research, 45, 11 2016.
Tamal K. Dey, Fengtao Fan, and Yusu Wang. Computing topological persistence for simplicial maps. Symposium on Computational Geometry, pages 345-354, june 2014.
Tamal K. Dey, Dayu Shi, and Yusu Wang. Simba: An efficient tool for approximating rips-filtration persistence via simplicial batch-collapse. In ESA, volume 57 of LIPIcs, 2016.
Zoltán Gáspári, Kristian Vlahovicek, and Sándor Pongor. Efficient recognition of folds in protein 3d structures by the improved pride algorithm. Bioinformatics, 21(15), 2005.
Edelsbrunner Herbert and John Harer. Computational topology: an introduction. American Mathematical Society, 2010.
Liang J, Edelsbrunner H, Fu P, Sudhakar PV, and Subramaniam S. Analytical shape computation of macromolecules: Ii. molecular area and volume through alpha shape. In Proteins, volume 33, pages 18-29, 1998.
Rachel Kolodny, Patrice Koehl, Leonidas Guibas, and Michael Levitt. Small libraries of protein fragments model native protein structures accurately. JMB, 323, 2002.
Vitaliy Kurlin. A fast persistence-based segmentation of noisy 2D clouds with provable guarantees. Pattern Recognition Letters, 83:3-12, 2015.
Holm Liisa and Rosenström Päivi. Dali server: conservation mapping in 3d. Nucleic Acids Research, 38:W545-W549, 2010. URL: http://dx.doi.org/10.1137/070711669.
G. M. Morton. A computer oriented geodetic data base; and a new technique in file sequencing. International Business Machines Co., 1966.
J. R. Munkres. Elements of Algebraic Topology, chapter 1. CRC Press, 1 edition, 1984.
USA National Institutes of Health, 1988. URL: https://www.ncbi.nlm.nih.gov/.
M Remmert, A Biegert, and Söding J. Hauser A. Hhblits: lightning-fast iterative protein sequence searching by hmm-hmm alignment. Nature Methods, 9, Dec 2011.
Altschul S.F., Gish W., Miller W., Myers E.W., and Lipman D.J.n. Basic local alignment search tool. Journal of Molecular Biology, 215:403-410, 1990.
Ian Sillitoe, Tony E Lewis, and et al. Cath: Comprehensive structural and functional annotations for genome sequences. Nucleic Acids Research, 43, 01 2015.
Paolo Sonego, Mircea Pacurar, Somdutta Dhir, Attila Kertesz-Farkas, András Kocsor, Zoltán Gáspári, Jack A M Leunissen, and Sándor Pongor. A protein classification benchmark collection for machine learning. Nucleic acids research, 35:D232-6, 02 2007.
The GUDHI Project. GUDHI User and Reference Manual. GUDHI Editorial Board, 2015. URL: http://gudhi.gforge.inria.fr/doc/latest/.
Kelin Xia and Guo-Wei Wei. Persistent homology analysis of protein structure, flexibility and folding. IJNMBE, 30(8):814-844, 2014. URL: https://doi.org/10.1002/cnm.2655

Protein Classification with Improved Topological Data Analysis

Authors Tamal K. Dey, Sayan Mandal

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Protein Classification with Improved Topological Data Analysis

Authors Tamal K. Dey, Sayan Mandal

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Supplementary Materials

References

Thanks for your feedback!

Could not send message