Fold Family-Regularized Bayesian Optimization for Directed Protein Evolution

Authors Trevor S. Frisby, Christopher J. Langmead

Thumbnail PDF


  • Filesize: 1.06 MB
  • 17 pages

Document Identifiers

Author Details

Trevor S. Frisby
  • Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
Christopher J. Langmead
  • Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA

Cite AsGet BibTex

Trevor S. Frisby and Christopher J. Langmead. Fold Family-Regularized Bayesian Optimization for Directed Protein Evolution. In 20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 172, pp. 18:1-18:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)


Directed Evolution (DE) is a technique for protein engineering that involves iterative rounds of mutagenesis and screening to search for sequences that optimize a given property (ex. binding affinity to a specified target). Unfortunately, the underlying optimization problem is under-determined, and so mutations introduced to improve the specified property may come at the expense of unmeasured, but nevertheless important properties (ex. subcellular localization). We seek to address this issue by incorporating a fold-specific regularization factor into the optimization problem. The regularization factor biases the search towards designs that resemble sequences from the fold family to which the protein belongs. We applied our method to a large library of protein GB1 mutants with binding affinity measurements to IgG-Fc. Our results demonstrate that the regularized optimization problem produces more native-like GB1 sequences with only a minor decrease in binding affinity. Specifically, the log-odds of our designs under a generative model of the GB1 fold family are between 41-45% higher than those obtained without regularization, with only a 7% drop in binding affinity. Thus, our method is capable of making a trade-off between competing traits. Moreover, we demonstrate that our active-learning driven approach reduces the wet-lab burden to identify optimal GB1 designs by 67%, relative to recent results from the Arnold lab on the same data.

Subject Classification

ACM Subject Classification
  • Computing methodologies → Active learning settings
  • Computing methodologies → Machine learning
  • Applied computing → Computational proteomics
  • Mathematics of computing → Discrete optimization
  • Protein design
  • Bayesian Optimization
  • Gaussian Process Regression
  • Regularization


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Frances H. Arnold. Directed evolution: Bringing new chemistry to life. Angewandte Chemie International Edition, 57(16):4143-4148, 2018. URL:
  2. S. Balakrishnan, H. Kamisetty, J.C. Carbonell, S.I. Lee, and Langmead C.J. Learning Generative Models for Protein Fold Families. Proteins: Structure, Function, and Bioinformatics, 79(6):1061–1078, 2011. Google Scholar
  3. James S. Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyper-parameter optimization. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 2546-2554. Curran Associates, Inc., 2011. Google Scholar
  4. F Edward Boas and Pehr B Harbury. Potential energy functions for protein design. Current Opinion in Structural Biology, 17(2):199-204, April 2007. URL:
  5. Sara El-Gebali, Jaina Mistry, Alex Bateman, Sean R Eddy, Aurélien Luciani, Simon C Potter, Matloob Qureshi, Lorna J Richardson, Gustavo A Salazar, Alfredo Smart, Erik L L Sonnhammer, Layla Hirsh, Lisanna Paladin, Damiano Piovesan, Silvio C E Tosatto, and Robert D Finn. The Pfam protein families database in 2019. Nucleic Acids Research, 47(D1):D427-D432, October 2018. URL:
  6. Pietro Gatti-Lafranconi, Antonino Natalello, Sascha Rehm, Silvia Maria Doglia, Jürgen Pleiss, and Marina Lotti. Evolution of Stability in a Cold-Active Enzyme Elicits Specificity Relaxation and Highlights Substrate-Related Effects on Temperature Adaptation. Journal of Molecular Biology, 395(1):155-166, January 2010. URL:
  7. Lars Giger, Sami Caner, Richard Obexer, Peter Kast, David Baker, Nenad Ban, and Donald Hilvert. Evolution of a designed retro-aldolase leads to complete active site remodeling. Nature Chemical Biology, 9(8):494-498, August 2013. URL:
  8. Adi Goldenzweig and Sarel J. Fleishman. Principles of protein stability and their application in computational design. Annual Review of Biochemistry, 87(1):105-129, 2018. URL:
  9. Robert E. Hawkins, Stephen J. Russell, and Greg Winter. Selection of phage antibodies by binding affinity. Journal of Molecular Biology, 226(3):889-896, August 1992. URL:
  10. P. Ilten, M. Williams, and Y. Yang. Event generator tuning using Bayesian optimization. Journal of Instrumentation, 12(04):P04028-P04028, April 2017. URL:
  11. Anders Krogh, Michael Brown, I.Saira Mian, Kimmen Sjölander, and David Haussler. Hidden Markov Models in Computational Biology. Journal of Molecular Biology, 235(5):1501-1531, February 1994. URL:
  12. B. Kuhlman and D. Baker. Native protein sequences are close to optimal for their structures. Proceedings of the National Academy of Sciences of the United States of America, 97(19):10383-10388, September 2000. URL:
  13. Haitao Liu, Yew-Soon Ong, Xiaobo Shen, and Jianfei Cai. When gaussian process meets big data: A review of scalable gps. IEEE Transactions on Neural Networks and Learning Systems, page 1–19, 2020. Google Scholar
  14. Daniel J. Lizotte, Tao Wang, Michael H. Bowling, and Dale Schuurmans. Automatic gait optimization with gaussian process regression. In Manuela M. Veloso, editor, IJCAI, pages 944-949, 2007. URL:
  15. Stefan Lutz and Uwe Theo Bornscheuer. Protein Engineering Handbook. Wiley-VCH, Weinheim, 2012. OCLC: 890049290. Google Scholar
  16. Jonas Mockus. Bayesian Approach to Global Optimization: Theory and Applications, volume 37 of Mathematics and Its Applications. Springer Netherlands, Dordrecht, 1989. URL:
  17. Marziyeh Movahedi, Fatemeh Zare-Mirakabad, and Seyed Shahriar Arab. Evaluating the accuracy of protein design using native secondary sub-structures. BMC Bioinformatics, 17(1):353, September 2016. URL:
  18. Saghi Nojoomi and Patrice Koehl. String kernels for protein sequence comparisons: improved fold recognition. BMC Bioinformatics, 18(1):137, February 2017. URL:
  19. Saghi Nojoomi and Patrice Koehl. A weighted string kernel for protein fold recognition. BMC Bioinformatics, 18(1):378, August 2017. URL:
  20. C. Anders Olson, Nicholas C. Wu, and Ren Sun. A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain. Current Biology, 24(22):2643-2651, November 2014. URL:
  21. Carl Edward Rasmussen and Christopher K. I Williams. Gaussian processes for machine learning. MIT Press, Cambridge, Mass.; London, 2006. OCLC: 898708515. Google Scholar
  22. Janes S. Richardson and David C. Richardson. The de novo design of protein structures. Trends in Biochemical Sciences, 14(7):304-309, July 1989. URL:
  23. Richard W. Roberts and Jack W. Szostak. Rna-peptide fusions for the in vitro selection of peptides and proteins. Proceedings of the National Academy of Sciences, 94(23):12297-12302, 1997. URL:
  24. Philip A. Romero and Frances H. Arnold. Exploring protein fitness landscapes by directed evolution. Nature Reviews Molecular Cell Biology, 10(12):866-876, December 2009. URL:
  25. Regina S. Salvat, Andrew S. Parker, Yoonjoo Choi, Chris Bailey-Kellogg, and Karl E. Griswold. Mapping the Pareto Optimal Design Space for a Functionally Deimmunized Biotherapeutic Candidate. PLoS Computational Biology, 11(1):e1003988, January 2015. URL:
  26. Fathima Aidha Shaikh and Stephen G. Withers. Teaching old enzymes new tricks: engineering and evolution of glycosidases and glycosyl transferases for improved glycoside synthesisThis paper is one of a selection of papers published in this Special Issue, entitled CSBMCB — Systems and Chemical Biology, and has undergone the Journal’s usual peer review process. Biochemistry and Cell Biology, 86(2):169-177, April 2008. URL:
  27. Tyler N. Starr and Joseph W. Thornton. Epistasis in protein evolution. Protein science : a publication of the Protein Society, 25(7):1204-1218, July 2016. URL:
  29. Michalis Titsias. Variational learning of inducing variables in sparse gaussian processes. In David van Dyk and Max Welling, editors, Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pages 567-574, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16-18 Apr 2009. PMLR. Google Scholar
  30. James Wilson, Frank Hutter, and Marc Deisenroth. Maximizing acquisition functions for bayesian optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 9884-9895. Curran Associates, Inc., 2018. Google Scholar
  31. Nicholas C Wu, Lei Dai, C Anders Olson, James O Lloyd-Smith, and Ren Sun. Adaptation in protein fitness landscapes is facilitated by indirect paths. eLife, 5:e16965, July 2016. URL:
  32. Zachary Wu, S. B. Jennifer Kan, Russell D. Lewis, Bruce J. Wittmann, and Frances H. Arnold. Machine learning-assisted directed protein evolution with combinatorial libraries. Proceedings of the National Academy of Sciences, 116(18):8852-8858, 2019. Google Scholar
  33. Kevin K. Yang, Zachary Wu, and Frances H. Arnold. Machine-learning-guided directed evolution for protein engineering. Nature Methods, 16(8):687-694, August 2019. URL:
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail