Backdoor Defense, Learnability and Obfuscation

Authors Paul Christiano, Jacob Hilton, Victor Lecomte, Mark Xu



PDF
Thumbnail PDF

File

LIPIcs.ITCS.2025.38.pdf
  • Filesize: 0.78 MB
  • 21 pages

Document Identifiers

Author Details

Paul Christiano
  • Alignment Research Center, Berkeley, CA, USA
Jacob Hilton
  • Alignment Research Center, Berkeley, CA, USA
Victor Lecomte
  • Alignment Research Center, Berkeley, CA, USA
Mark Xu
  • Alignment Research Center, Berkeley, CA, USA

Acknowledgements

We are grateful to Dmitry Vaintrob for an earlier version of the results in Appendix A; to Thomas Read for finding the "Backdoor with likely input patterns" example and for help with proofs; to Andrea Lincoln, Dávid Matolcsi, Eric Neyman, George Robinson and Jack Smith for contributions to the project in its early stages; and to Geoffrey Irving, Robert Lasenby and Eric Neyman for helpful comments on drafts.

Cite As Get BibTex

Paul Christiano, Jacob Hilton, Victor Lecomte, and Mark Xu. Backdoor Defense, Learnability and Obfuscation. In 16th Innovations in Theoretical Computer Science Conference (ITCS 2025). Leibniz International Proceedings in Informatics (LIPIcs), Volume 325, pp. 38:1-38:21, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2025) https://doi.org/10.4230/LIPIcs.ITCS.2025.38

Abstract

We introduce a formal notion of defendability against backdoors using a game between an attacker and a defender. In this game, the attacker modifies a function to behave differently on a particular input known as the "trigger", while behaving the same almost everywhere else. The defender then attempts to detect the trigger at evaluation time. If the defender succeeds with high enough probability, then the function class is said to be defendable. The key constraint on the attacker that makes defense possible is that the attacker’s strategy must work for a randomly-chosen trigger.
Our definition is simple and does not explicitly mention learning, yet we demonstrate that it is closely connected to learnability. In the computationally unbounded setting, we use a voting algorithm of [Hanneke et al., 2022] to show that defendability is essentially determined by the VC dimension of the function class, in much the same way as PAC learnability. In the computationally bounded setting, we use a similar argument to show that efficient PAC learnability implies efficient defendability, but not conversely. On the other hand, we use indistinguishability obfuscation to show that the class of polynomial size circuits is not efficiently defendable. Finally, we present polynomial size decision trees as a natural example for which defense is strictly easier than learning. Thus, we identify efficient defendability as a notable intermediate concept in between efficient learnability and obfuscation.

Subject Classification

ACM Subject Classification
  • Theory of computation → Machine learning theory
  • Computing methodologies → Machine learning
  • Security and privacy → Mathematical foundations of cryptography
  • Theory of computation → Cryptographic primitives
Keywords
  • backdoors
  • machine learning
  • PAC learning
  • indistinguishability obfuscation

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Boaz Barak, Oded Goldreich, Rusell Impagliazzo, Steven Rudich, Amit Sahai, Salil Vadhan, and Ke Yang. On the (im)possibility of obfuscating programs. In Annual international cryptology conference, pages 1-18. Springer, 2001. URL: https://doi.org/10.1007/3-540-44647-8_1.
  2. Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM (JACM), 36(4):929-965, 1989. URL: https://doi.org/10.1145/76359.76371.
  3. Dan Boneh and Brent Waters. Constrained pseudorandom functions and their applications. In Advances in Cryptology-ASIACRYPT 2013: 19th International Conference on the Theory and Application of Cryptology and Information Security, Bengaluru, India, December 1-5, 2013, Proceedings, Part II 19, pages 280-300. Springer, 2013. URL: https://doi.org/10.1007/978-3-642-42045-0_15.
  4. Elette Boyle, Shafi Goldwasser, and Ioana Ivan. Functional signatures and pseudorandom functions. In International workshop on public key cryptography, pages 501-519. Springer, 2014. URL: https://doi.org/10.1007/978-3-642-54631-0_29.
  5. Sébastien Bubeck, Yin Tat Lee, Eric Price, and Ilya Razenshteyn. Adversarial examples from computational constraints. In International Conference on Machine Learning, pages 831-840. PMLR, 2019. URL: http://proceedings.mlr.press/v97/bubeck19a.html.
  6. Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pages 1310-1320. PMLR, 2019. URL: http://proceedings.mlr.press/v97/cohen19c.html.
  7. Jacob Dumford and Walter Scheirer. Backdooring convolutional neural networks via targeted weight perturbations. In 2020 IEEE International Joint Conference on Biometrics (IJCB), pages 1-9. IEEE, 2020. URL: https://doi.org/10.1109/IJCB48548.2020.9304875.
  8. Sanjam Garg, Somesh Jha, Saeed Mahloujifar, and Mahmoody Mohammad. Adversarially robust learning could leverage computational hardness. In Algorithmic Learning Theory, pages 364-385. PMLR, 2020. URL: http://proceedings.mlr.press/v117/garg20a.html.
  9. Oded Goldreich, Shafi Goldwasser, and Silvio Micali. How to construct random functions. Journal of the ACM (JACM), 33(4):792-807, 1986. URL: https://doi.org/10.1145/6490.6503.
  10. Oded Goldreich and Leonid A Levin. A hard-core predicate for all one-way functions. In Proceedings of the twenty-first annual ACM symposium on Theory of computing, pages 25-32, 1989. URL: https://doi.org/10.1145/73007.73010.
  11. Shafi Goldwasser, Michael P Kim, Vinod Vaikuntanathan, and Or Zamir. Planting undetectable backdoors in machine learning models. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 931-942. IEEE, 2022. Google Scholar
  12. Shafi Goldwasser, Jonathan Shafer, Neekon Vafa, and Vinod Vaikuntanathan. Oblivious defense in ML models: Backdoor removal without detection. arXiv preprint, 2024. URL: https://arxiv.org/abs/2411.03279.
  13. Steve Hanneke, Amin Karbasi, Mohammad Mahmoody, Idan Mehalel, and Shay Moran. On optimal learning under targeted data poisoning. Advances in Neural Information Processing Systems, 35:30770-30782, 2022. Google Scholar
  14. David Haussler, Nick Littlestone, and Manfred K Warmuth. Predicting 0, 1-functions on randomly drawn points. Information and Computation, 115(2):248-292, 1994. Google Scholar
  15. Sanghyun Hong, Nicholas Carlini, and Alexey Kurakin. Handcrafted backdoors in deep neural networks. Advances in Neural Information Processing Systems, 35:8068-8080, 2022. Google Scholar
  16. Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems. arXiv preprint, 2019. URL: https://arxiv.org/abs/1906.01820.
  17. Aayush Jain, Huijia Lin, and Amit Sahai. Indistinguishability obfuscation from well-founded assumptions. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 60-73, 2021. URL: https://doi.org/10.1145/3406325.3451093.
  18. Jinyuan Jia, Xiaoyu Cao, and Neil Zhenqiang Gong. Intrinsic certified robustness of bagging against data poisoning attacks. In Proceedings of the AAAI conference on artificial intelligence, volume 35(9), pages 7961-7969, 2021. URL: https://doi.org/10.1609/AAAI.V35I9.16971.
  19. Adam Tauman Kalai and Shang-Hua Teng. Decision trees are PAC-learnable from most product distributions: a smoothed analysis. arXiv preprint, 2008. URL: https://arxiv.org/abs/0812.0933.
  20. Jonathan Katz and Yehuda Lindell. Introduction to modern cryptography: principles and protocols. Chapman and hall/CRC, 2007. Google Scholar
  21. Michael Kearns and Leslie Valiant. Cryptographic limitations on learning Boolean formulae and finite automata. Journal of the ACM (JACM), 41(1):67-95, 1994. URL: https://doi.org/10.1145/174644.174647.
  22. Michael J Kearns and Umesh Vazirani. An introduction to computational learning theory. MIT press, 1994. Google Scholar
  23. Alaa Khaddaj, Guillaume Leclerc, Aleksandar Makelov, Kristian Georgiev, Hadi Salman, Andrew Ilyas, and Aleksander Madry. Rethinking backdoor attacks. In International Conference on Machine Learning, pages 16216-16236. PMLR, 2023. URL: https://proceedings.mlr.press/v202/khaddaj23a.html.
  24. Aggelos Kiayias, Stavros Papadopoulos, Nikos Triandopoulos, and Thomas Zacharias. Delegatable pseudorandom functions and applications. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security, pages 669-684, 2013. URL: https://doi.org/10.1145/2508859.2516668.
  25. Alexander Levine and Soheil Feizi. Deep partition aggregation: Provable defense against general poisoning attacks. arXiv preprint, 2020. URL: https://arxiv.org/abs/2006.14768.
  26. Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022. Google Scholar
  27. Ryan O'Donnell. Analysis of Boolean functions. arXiv preprint, 2021. URL: https://arxiv.org/abs/2105.10386.
  28. Chris Olah. Mechanistic interpretability, variables, and the importance of interpretable bases. Transformer Circuits Thread, 2022. URL: https://www.transformer-circuits.pub/2022/mech-interp-essay.
  29. Amit Sahai and Brent Waters. How to use indistinguishability obfuscation: deniable encryption, and more. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 475-484, 2014. URL: https://doi.org/10.1145/2591796.2591825.
  30. Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134-1142, 1984. URL: https://doi.org/10.1145/1968.1972.
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail