Backdoor Defense, Learnability and Obfuscation

Christiano, Paul; Hilton, Jacob; Lecomte, Victor; Xu, Mark

doi:10.4230/LIPIcs.ITCS.2025.38

Abstract

We introduce a formal notion of defendability against backdoors using a game between an attacker and a defender. In this game, the attacker modifies a function to behave differently on a particular input known as the "trigger", while behaving the same almost everywhere else. The defender then attempts to detect the trigger at evaluation time. If the defender succeeds with high enough probability, then the function class is said to be defendable. The key constraint on the attacker that makes defense possible is that the attacker’s strategy must work for a randomly-chosen trigger.
Our definition is simple and does not explicitly mention learning, yet we demonstrate that it is closely connected to learnability. In the computationally unbounded setting, we use a voting algorithm of [Hanneke et al., 2022] to show that defendability is essentially determined by the VC dimension of the function class, in much the same way as PAC learnability. In the computationally bounded setting, we use a similar argument to show that efficient PAC learnability implies efficient defendability, but not conversely. On the other hand, we use indistinguishability obfuscation to show that the class of polynomial size circuits is not efficiently defendable. Finally, we present polynomial size decision trees as a natural example for which defense is strictly easier than learning. Thus, we identify efficient defendability as a notable intermediate concept in between efficient learnability and obfuscation.

Boaz Barak, Oded Goldreich, Rusell Impagliazzo, Steven Rudich, Amit Sahai, Salil Vadhan, and Ke Yang. On the (im)possibility of obfuscating programs. In Annual international cryptology conference, pages 1-18. Springer, 2001. URL: https://doi.org/10.1007/3-540-44647-8_1.
Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM (JACM), 36(4):929-965, 1989. URL: https://doi.org/10.1145/76359.76371.
Dan Boneh and Brent Waters. Constrained pseudorandom functions and their applications. In Advances in Cryptology-ASIACRYPT 2013: 19th International Conference on the Theory and Application of Cryptology and Information Security, Bengaluru, India, December 1-5, 2013, Proceedings, Part II 19, pages 280-300. Springer, 2013. URL: https://doi.org/10.1007/978-3-642-42045-0_15.
Elette Boyle, Shafi Goldwasser, and Ioana Ivan. Functional signatures and pseudorandom functions. In International workshop on public key cryptography, pages 501-519. Springer, 2014. URL: https://doi.org/10.1007/978-3-642-54631-0_29.
Sébastien Bubeck, Yin Tat Lee, Eric Price, and Ilya Razenshteyn. Adversarial examples from computational constraints. In International Conference on Machine Learning, pages 831-840. PMLR, 2019. URL: http://proceedings.mlr.press/v97/bubeck19a.html.
Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pages 1310-1320. PMLR, 2019. URL: http://proceedings.mlr.press/v97/cohen19c.html.
Jacob Dumford and Walter Scheirer. Backdooring convolutional neural networks via targeted weight perturbations. In 2020 IEEE International Joint Conference on Biometrics (IJCB), pages 1-9. IEEE, 2020. URL: https://doi.org/10.1109/IJCB48548.2020.9304875.
Sanjam Garg, Somesh Jha, Saeed Mahloujifar, and Mahmoody Mohammad. Adversarially robust learning could leverage computational hardness. In Algorithmic Learning Theory, pages 364-385. PMLR, 2020. URL: http://proceedings.mlr.press/v117/garg20a.html.
Oded Goldreich, Shafi Goldwasser, and Silvio Micali. How to construct random functions. Journal of the ACM (JACM), 33(4):792-807, 1986. URL: https://doi.org/10.1145/6490.6503.
Oded Goldreich and Leonid A Levin. A hard-core predicate for all one-way functions. In Proceedings of the twenty-first annual ACM symposium on Theory of computing, pages 25-32, 1989. URL: https://doi.org/10.1145/73007.73010.
Shafi Goldwasser, Michael P Kim, Vinod Vaikuntanathan, and Or Zamir. Planting undetectable backdoors in machine learning models. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 931-942. IEEE, 2022.
Shafi Goldwasser, Jonathan Shafer, Neekon Vafa, and Vinod Vaikuntanathan. Oblivious defense in ML models: Backdoor removal without detection. arXiv preprint, 2024. URL: https://arxiv.org/abs/2411.03279.
Steve Hanneke, Amin Karbasi, Mohammad Mahmoody, Idan Mehalel, and Shay Moran. On optimal learning under targeted data poisoning. Advances in Neural Information Processing Systems, 35:30770-30782, 2022.
David Haussler, Nick Littlestone, and Manfred K Warmuth. Predicting 0, 1-functions on randomly drawn points. Information and Computation, 115(2):248-292, 1994.
Sanghyun Hong, Nicholas Carlini, and Alexey Kurakin. Handcrafted backdoors in deep neural networks. Advances in Neural Information Processing Systems, 35:8068-8080, 2022.
Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems. arXiv preprint, 2019. URL: https://arxiv.org/abs/1906.01820.
Aayush Jain, Huijia Lin, and Amit Sahai. Indistinguishability obfuscation from well-founded assumptions. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 60-73, 2021. URL: https://doi.org/10.1145/3406325.3451093.
Jinyuan Jia, Xiaoyu Cao, and Neil Zhenqiang Gong. Intrinsic certified robustness of bagging against data poisoning attacks. In Proceedings of the AAAI conference on artificial intelligence, volume 35(9), pages 7961-7969, 2021. URL: https://doi.org/10.1609/AAAI.V35I9.16971.
Adam Tauman Kalai and Shang-Hua Teng. Decision trees are PAC-learnable from most product distributions: a smoothed analysis. arXiv preprint, 2008. URL: https://arxiv.org/abs/0812.0933.
Jonathan Katz and Yehuda Lindell. Introduction to modern cryptography: principles and protocols. Chapman and hall/CRC, 2007.
Michael Kearns and Leslie Valiant. Cryptographic limitations on learning Boolean formulae and finite automata. Journal of the ACM (JACM), 41(1):67-95, 1994. URL: https://doi.org/10.1145/174644.174647.
Michael J Kearns and Umesh Vazirani. An introduction to computational learning theory. MIT press, 1994.
Alaa Khaddaj, Guillaume Leclerc, Aleksandar Makelov, Kristian Georgiev, Hadi Salman, Andrew Ilyas, and Aleksander Madry. Rethinking backdoor attacks. In International Conference on Machine Learning, pages 16216-16236. PMLR, 2023. URL: https://proceedings.mlr.press/v202/khaddaj23a.html.
Aggelos Kiayias, Stavros Papadopoulos, Nikos Triandopoulos, and Thomas Zacharias. Delegatable pseudorandom functions and applications. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security, pages 669-684, 2013. URL: https://doi.org/10.1145/2508859.2516668.
Alexander Levine and Soheil Feizi. Deep partition aggregation: Provable defense against general poisoning attacks. arXiv preprint, 2020. URL: https://arxiv.org/abs/2006.14768.
Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
Ryan O'Donnell. Analysis of Boolean functions. arXiv preprint, 2021. URL: https://arxiv.org/abs/2105.10386.
Chris Olah. Mechanistic interpretability, variables, and the importance of interpretable bases. Transformer Circuits Thread, 2022. URL: https://www.transformer-circuits.pub/2022/mech-interp-essay.
Amit Sahai and Brent Waters. How to use indistinguishability obfuscation: deniable encryption, and more. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 475-484, 2014. URL: https://doi.org/10.1145/2591796.2591825.
Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134-1142, 1984. URL: https://doi.org/10.1145/1968.1972.

Backdoor Defense, Learnability and Obfuscation

Authors Paul Christiano, Jacob Hilton, Victor Lecomte, Mark Xu

File

Document Identifiers

Author Details

Acknowledgements

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

References

Thanks for your feedback!

Could not send message

Backdoor Defense, Learnability and Obfuscation

Authors Paul Christiano, Jacob Hilton, Victor Lecomte, Mark Xu

File

Document Identifiers

Author Details

Acknowledgements

Cite As Get BibTex

Abstract

Subject Classification

ACM Subject Classification

Keywords

Metrics

Related Versions

References

Thanks for your feedback!

Could not send message