Backdoor Defense, Learnability and Obfuscation
Abstract
We introduce a formal notion of defendability against backdoors using a game between an attacker and a defender. In this game, the attacker modifies a function to behave differently on a particular input known as the “trigger”, while behaving the same almost everywhere else. The defender then attempts to detect the trigger at evaluation time. If the defender succeeds with high enough probability, then the function class is said to be defendable. The key constraint on the attacker that makes defense possible is that the attacker’s strategy must work for a randomly-chosen trigger.
Our definition is simple and does not explicitly mention learning, yet we demonstrate that it is closely connected to learnability. In the computationally unbounded setting, we use a voting algorithm of Hanneke et al. [13] to show that defendability is essentially determined by the VC dimension of the function class, in much the same way as PAC learnability. In the computationally bounded setting, we use a similar argument to show that efficient PAC learnability implies efficient defendability, but not conversely. On the other hand, we use indistinguishability obfuscation to show that the class of polynomial size circuits is not efficiently defendable. Finally, we present polynomial size decision trees as a natural example for which defense is strictly easier than learning. Thus, we identify efficient defendability as a notable intermediate concept in between efficient learnability and obfuscation.
Keywords and phrases:
backdoors, machine learning, PAC learning, indistinguishability obfuscationCopyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Machine learning theory ; Computing methodologies Machine learning ; Security and privacy Mathematical foundations of cryptography ; Theory of computation Cryptographic primitivesAcknowledgements:
We are grateful to Dmitry Vaintrob for an earlier version of the results in Appendix A; to Thomas Read for finding the “Backdoor with likely input patterns” example and for help with proofs; to Andrea Lincoln, Dávid Matolcsi, Eric Neyman, George Robinson and Jack Smith for contributions to the project in its early stages; and to Geoffrey Irving, Robert Lasenby and Eric Neyman for helpful comments on drafts.Editors:
Raghu MekaSeries and Publisher:

1 Introduction
A backdoor in a machine learning model is a modification to the model that causes it to behave differently on certain inputs that activate a secret “trigger”. There is a wide literature on backdoor attacks and defenses from both a theoretical and an empirical perspective [26]. However, prior theoretical work typically makes reference to a particular training dataset, leading to a focus on data poisoning attacks. In this work we introduce a formal notion of a backdoor that allows the attacker to modify the model arbitrarily. Our definition is simple, but nevertheless gives rise to a rich array of strategies incorporating both learning and obfuscation.
Our formal notion is focused on backdoor detection at runtime, meaning that the defender is given a particular input and must flag whether or not it activates the backdoor trigger. We focus on this case because of the existence of undetectable backdoors if the defender is instead required to flag a model as backdoored without being given any particular input [11]. Moreover, detection at runtime is sufficient in threat models where the attacker is given only one opportunity to modify the model, being akin to giving the defender an additional chance to modify the model after the attacker. We also focus on the white-box setting, in which the defender has access to a complete description of the model.
-
Definition of -defendability (Section 3). We introduce a simple game between an attacker and a defender to define what it means for a representation class to be -defendable with a certain confidence. The only constraint placed on the attacker is that their backdoor strategy must work for a randomly-chosen trigger. Without this assumption, there is a symmetry between the original and backdoored functions that makes defense impossible.
-
Statistical defendability (Section 4). In the absence of computational constraints on the defender, we show that a representation class is -defendable with confidence tending to if and only if , where is the Vapnik–Chervonenkis dimension of . We achieve this by adapting a result of Hanneke et al. [13] that applies a learning algorithm multiple times and takes a majority vote.
-
Computational defendability (Section 5). We introduce a notion of efficient defendability in which the defender’s detection strategy must run in polynomial time. We show that efficient PAC learnability implies efficient defendability, but not conversely. We also show that under certain cryptographic assumptions, the class of polynomial size circuits is not efficiently defendable, by combining a puncturable pseudorandom function with an efficient indistinguishability obfuscator.
-
Defendability of decision trees (Section 6). In the setting where the distribution over inputs is uniform, we give a defense for polynomial size decision trees that runs in the time taken to evaluate a single decision tree. This provides a natural example for which defense is faster than learning.
We conclude in Section 7 with a discussion of whether efficient defendability can be further separated from efficient PAC learnability, so-called “mechanistic” defenses that resemble our defense for polynomial size decision trees, and the implications of our results for AI alignment.
2 Related work
The most closely related work to our own is that of Goldwasser, Kim, Vaikuntanathan, and Zamir [11], whose main results use a digital signature scheme to insert an undetectable backdoor into a model. The backdoor trigger can be applied to any input by perturbing it appropriately. In the black-box setting, the backdoor is undetectable in the sense that it is computationally infeasible for the defender to find any input on which the backdoored model and the original model differ. In the white-box setting, the backdoor is undetectable in the sense that it is computationally infeasible for the defender to distinguish the backdoored model from the original model. However, both of these results are in the setting where the backdoor trigger must be extracted from the model rather than detected at runtime.
In the defense at runtime setting, the same authors also show that a backdoored model can be modified by the defender to produce a model that is not backdoored. Once again, the backdoor trigger is applied by perturbing the input. The result only applies to models that satisfy a certain smoothness assumption, and mirrors the use of randomized smoothing to achieve certified adversarial robustness by Cohen et al. [6]. By contrast, we avoid making smoothness assumptions, and further develop the theory of runtime backdoor detection.
Concurrent work by Goldwasser, Shafer, Vafa, and Vaikuntanathan [12] studies a black-box, non-realizable version of backdoor defense at runtime. Rather than predicting the exact label of the non-backdoored model on the given input (which is equivalent to our notion of runtime detection), the defender must instead produce a “canonical” label that has high accuracy on average, by using black-box access to the potentially backdoored model. In this local mitigation setting, they construct efficient defenders for almost-linear and almost-polynomial functions. They also consider the global mitigation setting, in which the defender must produce an entire clean model, constructing efficient global mitigators for Fourier-heavy functions.
Adversarial examples can be thought of as backdoor triggers, but the attacker must find similar inputs rather than a model with similar outputs. The gap between statistical and computational hardness has been explored in this context by Bubeck et al. [5] and Garg et al. [8]. Much like us, they find that defense is statistically possible but computationally hard.
Returning to backdoors, another thematically similar work to our own is that of Khaddaj et al. [23], which also questions the assumptions behind different backdoor attacks and defenses. However, it remains within the data-poisoning setting, in which the backdoor is inserted using corrupted training data.
A line of work beginning with Dumford and Scheirer [7] goes beyond data-poisoning, inserting a backdoor into an already-trained model, but still using data to search for the backdoored model. More manual methods of backdoor insertion were later introduced by Hong et al. [15]. This line of work is primarily empirical rather than theoretical.
Our result on statistical defendability is heavily based on the result of Hanneke et al. [13] in the realizable case. Following Levine and Feizi [25] and Jia et al. [18], they use a voting algorithm to produce a backdoor defense in the data-poisoning setting. By combining this with a classical result of Haussler, Littlestone, and Warmuth [14] on optimal learning, they relate the optimal performance of this algorithm to the VC dimension of the function class. Our result adapts this to our more general notion of defendability.
Our result on computational defendability of polynomial size circuits uses a hybrid argument that combines a puncturable pseudorandom function and an efficient indistinguishability obfuscator. This combination has been used in a number of previous hybrid arguments, as described by Sahai and Waters [29]. More generally, the relationship between learnability and cryptographic assumptions has a long history, with some basic constructions in this vein described in the textbook of Kearns and Vazirani [22].
3 Definition of -defendability
Our formal notion of defendability against backdoors is based on a game played between two players, an “attacker” and a “defender”. Informally, the attacker starts with a function and is given a randomly sampled backdoor trigger. They must choose a backdoored function that almost always agrees with the original function, but disagrees with the original function on the backdoor trigger. Then the defender is given either the original function and a random input, or the backdoored function and the backdoor trigger, and they must distinguish one from the other with high confidence.
For our formal definition, we fix a security parameter and consider classifiers where . Formally, a representation class over is a function for some , but we generally keep the choice of representation implicit and conflate a representation class with the set of functions in its image. In our results on computationally constrained defenders, we also implicitly assume that is polynomial in .
Our definition also makes use of the example oracle where is a distribution over and . This is an oracle that outputs for a randomly sampled .
Definition 3.1.
Let be a representation class over and let . We say that is -defendable with confidence if there is a probabilistic algorithm (the “detection strategy”) that wins the following game with probability at least :
-
An adversary with knowledge of the detection strategy (the “attacker”) chooses a distribution over and .
-
We randomly sample (the “backdoor trigger”), and then the adversary chooses (the “backdoored function”).
-
The backdoored function is called -valid if
-
We randomly choose to pass to the detection strategy to be either for a random sample , or , with 50% probability each. The detection strategy is given access to the example oracle as well as the representation of , , , and as input, and must output either Acc (for “accept”) or Rej (for “reject”).
-
If is passed to the detection strategy, then the detection strategy wins if it outputs Acc. If is passed to the detection strategy, then the detection strategy wins if either it outputs Rej or is not -valid.
In this definition, the adversary is not specified by a collection of algorithms, but is instead shorthand for a universal quantifier. Thus we could have equivalently (but perhaps convolutedly) said, “there exists a detection strategy such that, for all distributions over and all , the expectation over of the minimum over all of the win probability of the detection strategy is at least ”.
The manner in which , and are chosen in this definition deserves some elaboration. If the attacker were allowed to choose the backdoor trigger themselves, then it would be too easy for the adversary to insert an undetectable backdoor. So we instead require the adversary to come up with a backdoor insertion strategy that works for a randomly-chosen trigger. In fact, we do not even allow the adversary to see the backdoor trigger when they are selecting the original function , or else they could produce by removing an existing backdoor rather than inserting a new one. This is illustrated by the following example.
Example (Existing backdoor removal).
Let and let , where denotes the indicator function
Take to be uniform over . If we allowed the adversary to choose both and after sampling , then they could take and for (or in the event that , instead take for some sampled uniformly at random). Then would be -valid, and and would be identically distributed for . So if then any detection strategy would win with probability at most . This would be quite an extreme failure for the defender, especially considering that the VC dimension of is only .
Thus the way in which , and are chosen, by randomly sampling the backdoor trigger after the original function is chosen but before the backdoored function is chosen, is the key symmetry-breaking assumption that makes the game interesting.
Remark.
We could generalize this definition to include functions where is any subset of (or even more generally, any metric space). In this setting, the most natural generalization of -validity is
4 Statistical defendability
In this section we will completely determine, up to a constant factor, the values of and for which an arbitrary representation class is -defendable with confidence , in the absence of any computational constraints on the detection strategy.
Our result is expressed in terms of the Vapnik–Chervonenkis (VC) dimension of the representation class , denoted . This is defined as the maximum size of a set of points shattered by , meaning that all functions are realized as the restriction of some .
Theorem 4.1.
Let be a representation class over and let . Then the most (i.e., supremum) confidence with which is -defendable is
as .
In particular, is -defendable with confidence tending to as if and only if as .
Thus in the computationally unbounded setting, defendability is determined almost entirely by the VC dimension of the representation class, in much the same way as PAC learnability [2, 22, Chapter 3].
In fact, to prove Theorem 4.1, we will make use of a certain kind of learning algorithm called a prediction strategy. Intuitively, a prediction strategy uses the value of a function on random samples to predict the value of the function on a new random sample. The formal definition is as follows.
Definition.
Let be a representation class over . A prediction strategy for is a randomized algorithm that is given access to the example oracle for some distribution over and some , as well as as input, and outputs a prediction for .
The sample size of a prediction strategy is the maximum number of times it calls the example oracle, maximizing over any randomness used by the algorithm, any calls to the oracle, and the choice of , and .
Given a choice of and , the error rate of a prediction strategy is the probability that it fails to correctly predict for , randomizing over any randomness used by the algorithm, any calls to the oracle, and the choice of .
In their classical paper on prediction strategies, Haussler, Littlestone, and Warmuth [14] exhibit a “1-inclusion graph” algorithm yielding the following result.
Theorem (Haussler–Littlestone–Warmuth).
For any representation class and any positive integer , there is a prediction strategy for with sample size such that for any distribution over and any , the error rate is at most
As discussed in that work, there is a close relationship between prediction strategies and PAC learning algorithms, which we define in Section 5.2. We focus on prediction strategies in this section because they make our proof easier to carry out while avoiding the introduction of unnecessary logarithmic factors.
Our proof of Theorem 4.1 is closely modeled on a proof of Hanneke et al. [13, Theorem 3.1], although we cannot quote that result directly since it is specialized to a data poisoning setup. As in that proof, our detection strategy works by applying the Haussler–Littlestone–Warmuth prediction strategy multiple times and taking a majority vote.
The basic idea for the detection strategy is that, given as input (which could be either or ), if we could successfully predict , then we could distinguish the two cases by comparing this to , since but if is -valid. To attempt to make this prediction, we use the Haussler–Littlestone–Warmuth prediction strategy. However, if then this can fail if we encounter a point where and disagree. Hence we are forced to use a small sample size, but we are still left with a small constant probability of failure. Fortunately though, we can repeat this procedure multiple times with independent samples and take a majority vote, which allows us to make this small constant probability vanish.
Sketch proof of Theorem 4.1.
Write , and note that the detection strategy can achieve a win probability of at least simply by guessing randomly.
To show that the most confidence with which is -defendable is at most , let be a set of size shattered by . Roughly speaking, the adversary can take to be uniform over , and take to be uniform over the witnesses to the shattering. Then the adversary can insert a backdoor by simply changing the value of at and no other points of . The actual construction is slightly more careful than this, and is given in the full proof.
To show that the most confidence with which is -defendable is at least , our detection strategy is as follows. Given as input, we apply the Haussler–Littlestone–Warmuth prediction strategy with sample size and error rate at most to make a prediction for , which we in turn think of as a prediction for . We repeat this procedure times with independent samples and take a majority vote: if more than half of the predictions are different from , then we output Rej, and otherwise we output Acc.
If is passed to the detection strategy, then the probability that more than of the votes are wrong is at most , by Markov’s inequality. If is passed to the detection strategy, then there are two ways in which each vote can be wrong: either and disagree on at least one of the samples, or the prediction strategy fails to predict even if it is provided with the value of on all of these samples. The probability that or more of the votes will be wrong due to the second failure mode is at most , by Markov’s inequality again. Meanwhile, we can ensure that the probability of the first failure mode for each vote is at most by taking . Since each choice of samples was independent, the probability that or more of the votes will be wrong due to the first failure mode is at most , by Hoeffding’s inequality, which can be made arbitrarily small by taking to be sufficiently large. Hence the probability that or more of the votes will be wrong is at most , and so the overall win probability is , as required.
Intuitively, the reason that this detection strategy works is that the if is small compared to , then the attacker must choose a backdoored function that is very “strange”. Hence the detection strategy can sample a new function that is similar to the given possibly backdoored function, but more “normal”, and see if the two functions agree on the given possible backdoor trigger. This process can be thought of as a kind of regularization.
While our proof used distillation (i.e., learning from function samples) plus ensembling (i.e., voting), other methods of regularization may also work. For example, we could also have sampled from a Boltzmann distribution centered on the given function (with exponentially decaying probability based on Hamming distance), to get the same result up to logarithmic factors. We prove this claim in Appendix A.
5 Computational defendability
5.1 Definition of efficient defendability
In Section 4, we related the defendability of a representation class to its VC dimension. However, the prediction strategy we used to construct the detection strategy runs in exponential time, since it involves an expensive search over orientations of a certain graph. Hence it is interesting to ask what a polynomial-time detection strategy can achieve. To explore this, we introduce the following notion.
Definition 5.1.
Let be a representation class over . We say that is efficiently defendable if there is some polynomial such that for any and any with , is -defendable with confidence using a detection strategy that runs in time polynomial in and .
Note that the condition that is crucial: if were required to be -defendable with confidence for any , then this would not be possible even for a detection strategy with unlimited time (except in trivial cases), by Theorem 4.1. (Recall that constrains the attacker, which is why this requirement helps the defender.)
As a reminder, we also assume in this section that representation classes have polynomial-length representations.
5.2 Efficient defendability and efficient PAC learnability
In this sub-section we show that efficient PAC learnability implies efficient defendability, but not conversely.
The well-studied model of probably approximately correct (PAC) learning was introduced by Valiant [30]. The notion of efficient PAC learnability bears some resemblance to our notion of efficient defendability. The following definition is taken from Kearns and Vazirani [22, Definition 4].
Definition.
Let be a representation class over . We say that is efficiently PAC learnable if there is a probabilistic algorithm (the “PAC learning algorithm”) with the following properties. The PAC learning algorithm is given access to the example oracle for some distribution over and some , as well as (the “error parameter”) and (the “confidence parameter”) as input. The PAC learning algorithm must run in time polynomial in , and , and must output some polynomially evaluatable hypothesis , i.e. a representation for a function that is evaluatable in time polynomial in . The hypothesis must satisfy with probability at least for any distribution over and any .
Note that and play subtly different roles in this definition to and in the definition of efficient defendability. In particular, there is no condition on as a function of and in the definition of efficient PAC learnability.
We now show that efficient PAC learnability implies efficient defendability. This follows easily from the proof of Theorem 4.1. Even though that result used a prediction strategy that runs in exponential time, it is straightforward to replace it by an efficient PAC learning algorithm.
Corollary 5.2.
Let be a representation class over . If is efficiently PAC learnable, then is efficiently defendable.
Proof.
If is efficiently PAC learnable, then for any sufficiently large positive integer , there is a polynomial-time prediction strategy for with sample size and error rate at most : we simply PAC learn a hypothesis with confidence parameter and error parameter , and then evaluate this hypothesis on the given point. The definition of PAC learning guarantees that this succeeds as long as exceeds a particular polynomial in and .
Our detection strategy is then exactly the same as the one from the proof of Theorem 4.1, but using the above prediction strategy instead of the Haussler–Littlestone–Warmuth prediction strategy. Since for some polynomial of our choice, choosing as in that proof suffices to ensure that is eventually large enough for the prediction strategy to succeed. Finally, we take the number of votes to be the smallest integer greater than , which keeps the time complexity of the detection strategy polynomial in and . Then the detection strategy’s overall win probability is at least , as required.
We now show that the converse implication does not hold in the random oracle model of computation. In this model, programs have access to an oracle that outputs the result of calling of a uniformly random function .
Theorem 5.3.
In the random oracle model of computation, with probability for all over the choice of random oracle, there is a polynomially evaluatable representation class over that is efficiently defendable but not efficiently PAC learnable.
To prove this, we simply take a representation class of functions that make unique calls to the random oracle (supplying a different input to the oracle whenever either the function is different or the input to the function is different), and show that, with high probability over the choice of random oracle, this representation class is efficiently defendable but not efficiently PAC learnable. Roughly speaking, it is efficiently defendable because, for most sets of random functions with possible inputs, every pair of functions differs on around half of inputs, meaning that there are no -valid functions for the adversary to choose from. Meanwhile, it is not efficiently PAC learnable because most random functions with possible inputs differ from every possible function that a polynomial-time algorithm could output on around half of inputs. For a full proof, see Appendix C.
Although this result is in the random oracle model of computation, the same proof shows that the converse does not hold in the usual model of computation either. This is because we can make the random choices “on the outside”: we fix a randomly-chosen lookup table, and have the functions use that instead of calling the random oracle. However, this representation class is not necessarily polynomially evaluatable, since the lookup table would take up exponential space.
Nevertheless, we conjecture that there is a polynomially evaluatable counterexample that uses a pseudorandom function instead of a random oracle. The above proof cannot be immediately adapted to work for an arbitrary pseudorandom function, because two distinct keys could give rise to very similar functions, but it might not be too challenging to adapt it.
Conjecture 5.4.
Assuming , there is a polynomially evaluatable representation class over that is efficiently defendable but not efficiently PAC learnable.
Here, denotes the existence of a one-way function, which can used to construct a pseudorandom function [9, 20, Chapter 6].
One way in which our counterexample is unsatisfying is that it relies on the adversary being unable to find a backdoored function that is -valid, making detection trivial. However, it is straightforward to extend this result to a class for which it is always possible for the adversary to find a backdoored function that is -valid, but that is nevertheless efficiently defendable.
Example (Special-cased random functions).
Let be one of the above representation classes over that serves as a counterexample (either functions that call a random oracle, or a random choice of functions). Now let
where implicitly, the representation of a function in is given by a triple consisting of the representation of the function , the special-cased point , and the special-cased value . Then the adversary can ensure that there is always a -valid backdoored function by taking to be the uniform distribution over . However, is still efficiently defendable with high probability, since the detection strategy can check whether or not the input has been special-cased. also remains not efficiently PAC learnable with high probability.
Thus we have an example of a non-trivial detection strategy that is faster than any PAC learning algorithm for that representation class. This shows that the learning-based detection strategy of Corollary 5.2 is not always best possible under computational constraints. Nevertheless, our example is somewhat contrived, and so in Section 6 we give an example of a more natural representation class for which defense is faster than learning.
5.3 Efficient defendability and obfuscation
We conclude our analysis of computational defendability with the following result, which essentially says that representation classes that are rich enough to support obfuscation are not efficiently defendable.
Theorem 5.5.
Assuming and , the representation class of polynomial size Boolean circuits over is not efficiently defendable.
Here, again denotes the existence of a one-way function, and denotes the existence of an efficient indistinguishability obfuscator. Roughly speaking, an efficient indistinguishability obfuscator is a probabilistic polynomial-time algorithm that takes in a circuit and outputs an “obfuscated” circuit with the same behavior. The circuit being “obfuscated” means that the obfuscations of two different circuits with the same behavior are not distinguishable by any probabilistic polynomial-time adversary. For a precise definition of an efficient indistinguishability obfuscator, we refer the reader to Sahai and Waters [29, Section 3]. The study indistinguishability obfuscation was initiated by Barak et al. [1], and an efficient indistinguishability obfuscator was constructed from well-studied computational hardness assumptions by Jain, Lin and Sahai [17].
We use in our proof of Theorem 5.5 to obtain a puncturable pseudorandom function. Roughly speaking, a puncturable pseudorandom function is pseudorandom function such that any key (i.e., seed) can be “punctured” at a set of points. The key being “punctured” means that, when run using the punctured key, the pseudorandom function behaves the same on unpunctured points, but to a probabilistic polynomial-time adversary with knowledge of the punctured key only, the function looks pseudorandom on punctured points when run using the original key. For a precise definition of a puncturable pseudorandom function, we again refer the reader to Sahai and Waters [29, Section 3]. The observation that puncturable pseudorandom functions can be constructed from one-way functions was made by Boneh and Waters [3], Boyle et al. [4] and Kiayias et al. [24].
Our proof of Theorem 5.5 is an example of a hybrid argument: we show that two distributions are computationally indistinguishable via a sequence of intermediate distributions. For a precise definition of computational indistinguishability and further discussion of hybrid arguments, we refer the reader to Katz and Lindell [20, Chapter 6.8]. The specific combination of a puncturable pseudorandom function and an efficient indistinguishability obfuscator has been used in a number of previous hybrid arguments, as described by Sahai and Waters [29].
With these preliminaries in place, we are now ready to prove Theorem 5.5.
Proof of Theorem 5.5.
Take . We will show that, for any polynomial , the representation class of polynomial size Boolean circuits is not -defendable with confidence using a detection strategy that runs in time polynomial in .
To see this, by , let be the circuit for a puncturable pseudorandom function with key and a single output bit, and by , let be an efficient indistinguishability obfuscator. The adversary proceeds by taking to be uniform over , and takes for some chosen uniformly at random. As before, we may treat as if it were chosen randomly, since the detection strategy’s worst-case performance can be no better than its average-case performance. Finally, given , the adversary takes
where punctures the key at the point , and angle brackets indicate values that are hard-coded into the circuit, rather than being computed by the circuit. Note that is -valid since it agrees with on every point except .
It remains to show that no polynomial-time detection strategy can win against this adversary with probability at least , which is equivalent to saying that and are computationally indistinguishable for . To see this, consider the intermediate circuit
which agrees with everywhere. We claim that
where denotes computational indistinguishability of distributions. The first equivalence is simply an identity of distributions, since was chosen randomly. The second equivalence follows by definition of indistinguishability obfuscation, with the technical detail that is used as the “program state” that is passed from the circuit generator to the distinguisher. The third equivalence follows from puncturability: if and were computationally distinguishable, then we could use this to computationally distinguish from a uniformly random bit using only and . This final construction that relies on the fact that runs in polynomial time. By transitivity of computational indistinguishability, we have , as required.
Thus even though all efficiently PAC learnable representation classes are efficiently defendable, some representation classes are not efficiently defendable due to the possibility of obfuscation.
6 Defendability of decision trees
In Section 5.2 we gave an example of a representation class for which defense is faster than learning, in the sense that it is efficiently defendable using a detection strategy that is faster than any PAC learning algorithm. However, our example was somewhat contrived, and was not polynomially evaluatable without access to a random oracle. In this section, we give an example of a more natural representation class for which defense is faster than learning: the class of polynomial size decision trees over .
Definition.
A decision tree over is a rooted binary tree where each non-leaf node is labeled with one of the input variables and each leaf node is labeled with a or a . To evaluate a decision tree, we start at the root node and go left if the variable evaluates to and right if it evaluates to , continuing until we reach a leaf. The size of a decision tree is its number of leaves.
An example of a decision tree and how to evaluate it is given in Figure 3.
In our study of decision trees, we focus on the uniform-PAC model of learning, in which the distribution is always the uniform distribution. Thus we say that a representation class is efficiently uniform-PAC learnable to mean that it is efficiently PAC learnable as long as is uniform. Similarly, we say that a representation class is efficiently uniform-defendable to mean that it is efficiently defendable as long as is uniform.
As far as we are aware, it is unknown whether the representation class of polynomial size decision trees over is efficiently uniform-PAC learnable, although polynomial-time PAC learning algorithms are known for certain product distributions [19]. However, the class is efficiently uniform PAC-learnable if we allow membership queries (i.e., calls to an oracle that outputs the value of the function on an input of the algorithm’s choice) [10, 27, Chapter 3.5]. By specializing the proof of Corollary 5.2, it follows that the representation class of polynomial size decision trees over is efficiently defendable, since the defender can efficiently compute the results of membership queries for themselves using the representation of the decision tree.
Nevertheless, these learning algorithms for decision trees generally use at least linearly many calls to the example or membership oracle. By contrast, we now show that there is a defense for decision trees that is much faster than this.
Theorem 6.1.
The representation class of decision trees over of size at most , where is polynomial in , is efficiently uniform-defendable using a detection strategy that makes 0 calls to the example oracle and runs in time .
To prove this, we use a detection strategy that simply checks the depth of the leaf reached by the given input, and outputs Rej if this is too large. Roughly speaking, the reason this works is that, in order for the backdoored function to be -valid, the depth of the leaf reached by the backdoor trigger must be large for either the original function or the backdoored function. But since our trees have polynomially many leaves, this depth is unlikely to be large for the original function. Hence if the depth of the given function on the given input is large, then it is more likely that we have been given the backdoored function and the backdoor trigger.
For a full proof of Theorem 6.1, see Appendix D. We suspect that a similar approach would lead to fast detection strategies for other natural representation classes of piecewise constant functions, perhaps even in the non-uniform setting.
The detection strategy in this proof is not directly comparable to a learning algorithm, since it does not use the example oracle at all. However, if we treat a call to the example oracle as having similar computational cost to evaluating a single decision tree, then it is substantially faster than any possible learning algorithm for this representation class. Thus there is more to defense than only learning.
7 Discussion
7.1 Separating efficient defendability from efficient PAC learnability
A central theme of this work has been that learning can be used to perform backdoor defense, but backdoor defense cannot necessarily be used to perform learning. The first of these claims is encapsulated by Theorem 4.1 in the computationally unbounded setting and Corollary 5.2 in the computationally bounded setting. For the second of these claims, we have made several steps in this direction:
-
We showed in Theorem 5.3 that efficient defendability does not imply efficient PAC learnability in the random oracle model of computation.
-
We deduced from this that efficient defendability does not imply efficient PAC learnability in the usual model of computation either, as long as we allow representation classes that are not polynomially evaluatable, and we conjectured that there is a polynomially evaluatable counterexample assuming the existence of a one-way function.
-
We showed in Theorem 6.1 that the representation class of polynomial size decision trees is efficiently uniform-defendable using a detection strategy that is faster than any possible learning algorithm.
Nevertheless, we still do not have an example of a natural representation class that is efficiently defendable but not efficiently PAC learnable. We consider finding such an example to be a central challenge for follow-up research. One possible candidate of independent interest is the representation class of shallow (i.e., logarithmic-depth) polynomial size Boolean circuits, or equivalently, polynomial size Boolean formulas.
Question 7.1.
Under reasonable computational hardness assumptions, is the representation class of polynomial size, logarithmic-depth Boolean circuits over efficiently defendable?
7.2 Mechanistic defenses
Our detection strategy for decision trees in Theorem 6.1 is interesting not only because it is faster than any possible learning algorithm, but also because it works in a fundamentally different way. Given as input, it does not run the decision tree on any inputs other than itself, but instead exploits the mechanism by which the value of is computed, by looking at the depth of the corresponding leaf. We call a defense that exploits the structure of in this kind of way mechanistic.
The existence of mechanistic defenses is another reason to suspect that there are other representation classes for which defense is strictly easier than learning. However, some sophistication may be required to construct such defenses. For example, one might hope to detect a backdoor trigger for a Boolean formula by checking the pattern of inputs to each gate, and seeing how unlikely that pattern would be for a random input to the formula. Unfortunately, though, the following example, found by Thomas Read, presents an obstacle to such an approach.
Example (Backdoor with likely input patterns).
Consider the Boolean formulas given by and
where is shorthand for . By the distributive property, is logically equivalent to , and so and disagree only on the input . But on a uniformly random input to the formula, every gate in receives every pattern of inputs with probability at least . By contrast, in the logically equivalent version of , the subformula is on , which only happens with probability on a uniformly random input.
This example suggests a possible mechanistic defense for Boolean formulas that involves first rewriting the formula. More generally, mechanistic defenses offer an exciting avenue for future research.
7.3 Implications for AI alignment
We are motivated to study backdoors as an analogy for deceptive alignment, the possibility that an advanced AI system would learn to behave cooperatively when there are subtle cues that it is in training, but uncooperatively when those cues are missing [16]. A deceptively aligned model is analogous to a backdoored function in that it behaves similarly to a fully cooperative model except on certain inputs that are rare during training. Thus, in this analogy, is the training distribution, is the fully cooperative model, is the deceptively aligned model, and is an input on which behaves uncooperatively.
Note that allowing the backdoor to be detected at runtime is appropriate in this analogy, because we have the opportunity to modify or run a monitoring system. The flaw in the analogy comes from how the attacker is restricted. In our formal notion of defendability, the attacker must insert a backdoor that works for a randomly-chosen trigger. But in the case of deceptive alignment, there is no such restriction, and instead the backdoored behavior must arise from properties of the model architecture and the training distribution.
From this perspective, backdoor defenses that rely on learning are unsatisfying, because our assumption is that a similar process of learning gave rise to deceptive alignment in the first place. The detection strategy in our proof of Theorem 4.1 used distillation plus ensembling to produce a regularized model, but this may be exploiting the flaw in our analogy by using resampling to avoid the randomly-chosen trigger. Moreover, we would have to rely on a fast approximation to this strategy, since the exact version runs in exponential time. Similar issues apply to other methods of regularization, such as the one discussed in Appendix A, although that alternative does show more promise.
On the other hand, mechanistic defenses, as discussed in Section 7.2, may fare better, since they work very differently. Intuitively, if a detection strategy could spot the mechanism by which a deceptively aligned model concluded that it was in training, then it should transfer well to the case of deceptive alignment. A mechanistic defense may be able to do this by exploiting the fact that this mechanism is active on the backdoor trigger but inactive on most random inputs. Unfortunately though, we are lacking in examples of mechanistic defenses, which why we are excited to see more research in this direction.
Our result that polynomial size circuits are not efficiently defendable in Theorem 5.5 is also relevant to this analogy. Although it is unlikely that our indistinguishability obfuscator-based construction would arise out of ordinary model training, it is much more plausible for a trained model to be obfuscated in a more informal sense. Indeed, reverse engineering trained neural networks is an active area of research [28]. Hence the possibility of obfuscation poses a potential problem for detecting deceptive alignment. However, in the case of deceptive alignment, we also have access to the entire training process, including the training dataset, which may be enough information for us to detect the trigger despite any potential obfuscation. By analogy, in our construction in Theorem 5.5, it would be easy for the defender to detect the trigger if they had access to the unpunctured key that was used in the construction of .
This motivates the study of variants of our formal notion of defendability that constrain the attacker in different ways, or provide more assistance to the defender. To better capture the analogy with deceptive alignment, we would be excited to see research into variants that provide the defender with more information about how the function they are given was constructed, to see if this makes defense computationally feasible.
8 Conclusion
We have introduced a formal notion of defendability against backdoors in which the attacker’s strategy must work for a randomly-chosen trigger. Despite its simplicity, this notion gives rise to a rich array of strategies. In the absence of computational constraints, defense is exactly as hard as learning. Meanwhile, in the presence of computational constraints, defense is strictly easier than learning, but impossible for function classes that are rich enough to support obfuscation. We are excited to see future work that further explores the exact relationship between defense and learning.
References
- [1] Boaz Barak, Oded Goldreich, Rusell Impagliazzo, Steven Rudich, Amit Sahai, Salil Vadhan, and Ke Yang. On the (im)possibility of obfuscating programs. In Annual international cryptology conference, pages 1–18. Springer, 2001. doi:10.1007/3-540-44647-8_1.
- [2] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989. doi:10.1145/76359.76371.
- [3] Dan Boneh and Brent Waters. Constrained pseudorandom functions and their applications. In Advances in Cryptology-ASIACRYPT 2013: 19th International Conference on the Theory and Application of Cryptology and Information Security, Bengaluru, India, December 1-5, 2013, Proceedings, Part II 19, pages 280–300. Springer, 2013. doi:10.1007/978-3-642-42045-0_15.
- [4] Elette Boyle, Shafi Goldwasser, and Ioana Ivan. Functional signatures and pseudorandom functions. In International workshop on public key cryptography, pages 501–519. Springer, 2014. doi:10.1007/978-3-642-54631-0_29.
- [5] Sébastien Bubeck, Yin Tat Lee, Eric Price, and Ilya Razenshteyn. Adversarial examples from computational constraints. In International Conference on Machine Learning, pages 831–840. PMLR, 2019. URL: http://proceedings.mlr.press/v97/bubeck19a.html.
- [6] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pages 1310–1320. PMLR, 2019. URL: http://proceedings.mlr.press/v97/cohen19c.html.
- [7] Jacob Dumford and Walter Scheirer. Backdooring convolutional neural networks via targeted weight perturbations. In 2020 IEEE International Joint Conference on Biometrics (IJCB), pages 1–9. IEEE, 2020. doi:10.1109/IJCB48548.2020.9304875.
- [8] Sanjam Garg, Somesh Jha, Saeed Mahloujifar, and Mahmoody Mohammad. Adversarially robust learning could leverage computational hardness. In Algorithmic Learning Theory, pages 364–385. PMLR, 2020. URL: http://proceedings.mlr.press/v117/garg20a.html.
- [9] Oded Goldreich, Shafi Goldwasser, and Silvio Micali. How to construct random functions. Journal of the ACM (JACM), 33(4):792–807, 1986. doi:10.1145/6490.6503.
- [10] Oded Goldreich and Leonid A Levin. A hard-core predicate for all one-way functions. In Proceedings of the twenty-first annual ACM symposium on Theory of computing, pages 25–32, 1989. doi:10.1145/73007.73010.
- [11] Shafi Goldwasser, Michael P Kim, Vinod Vaikuntanathan, and Or Zamir. Planting undetectable backdoors in machine learning models. In 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS), pages 931–942. IEEE, 2022.
- [12] Shafi Goldwasser, Jonathan Shafer, Neekon Vafa, and Vinod Vaikuntanathan. Oblivious defense in ML models: Backdoor removal without detection. arXiv preprint, 2024. arXiv:2411.03279.
- [13] Steve Hanneke, Amin Karbasi, Mohammad Mahmoody, Idan Mehalel, and Shay Moran. On optimal learning under targeted data poisoning. Advances in Neural Information Processing Systems, 35:30770–30782, 2022.
- [14] David Haussler, Nick Littlestone, and Manfred K Warmuth. Predicting 0, 1-functions on randomly drawn points. Information and Computation, 115(2):248–292, 1994.
- [15] Sanghyun Hong, Nicholas Carlini, and Alexey Kurakin. Handcrafted backdoors in deep neural networks. Advances in Neural Information Processing Systems, 35:8068–8080, 2022.
- [16] Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems. arXiv preprint, 2019. arXiv:1906.01820.
- [17] Aayush Jain, Huijia Lin, and Amit Sahai. Indistinguishability obfuscation from well-founded assumptions. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 60–73, 2021. doi:10.1145/3406325.3451093.
- [18] Jinyuan Jia, Xiaoyu Cao, and Neil Zhenqiang Gong. Intrinsic certified robustness of bagging against data poisoning attacks. In Proceedings of the AAAI conference on artificial intelligence, volume 35(9), pages 7961–7969, 2021. doi:10.1609/AAAI.V35I9.16971.
- [19] Adam Tauman Kalai and Shang-Hua Teng. Decision trees are PAC-learnable from most product distributions: a smoothed analysis. arXiv preprint, 2008. arXiv:0812.0933.
- [20] Jonathan Katz and Yehuda Lindell. Introduction to modern cryptography: principles and protocols. Chapman and hall/CRC, 2007.
- [21] Michael Kearns and Leslie Valiant. Cryptographic limitations on learning Boolean formulae and finite automata. Journal of the ACM (JACM), 41(1):67–95, 1994. doi:10.1145/174644.174647.
- [22] Michael J Kearns and Umesh Vazirani. An introduction to computational learning theory. MIT press, 1994.
- [23] Alaa Khaddaj, Guillaume Leclerc, Aleksandar Makelov, Kristian Georgiev, Hadi Salman, Andrew Ilyas, and Aleksander Madry. Rethinking backdoor attacks. In International Conference on Machine Learning, pages 16216–16236. PMLR, 2023. URL: https://proceedings.mlr.press/v202/khaddaj23a.html.
- [24] Aggelos Kiayias, Stavros Papadopoulos, Nikos Triandopoulos, and Thomas Zacharias. Delegatable pseudorandom functions and applications. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security, pages 669–684, 2013. doi:10.1145/2508859.2516668.
- [25] Alexander Levine and Soheil Feizi. Deep partition aggregation: Provable defense against general poisoning attacks. arXiv preprint, 2020. arXiv:2006.14768.
- [26] Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
- [27] Ryan O’Donnell. Analysis of Boolean functions. arXiv preprint, 2021. arXiv:2105.10386.
- [28] Chris Olah. Mechanistic interpretability, variables, and the importance of interpretable bases. Transformer Circuits Thread, 2022. URL: https://www.transformer-circuits.pub/2022/mech-interp-essay.
- [29] Amit Sahai and Brent Waters. How to use indistinguishability obfuscation: deniable encryption, and more. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 475–484, 2014. doi:10.1145/2591796.2591825.
- [30] Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984. doi:10.1145/1968.1972.
Appendix A Alternative version of Theorem 4.1 using a Boltzmann posterior
In this section we prove a slightly weaker version of Theorem 4.1 using a detection strategy that does not involve learning. Instead of using distillation plus ensembling as in the proof of Theorem 4.1, we employ an alternative method of regularization that involves resampling the given function from a Boltzmann posterior. This gives rise to an extra logarithmic factor in the confidence of the detection algorithm, but otherwise gives the same bound, and in particular the threshold for detecting backdoors in a class of VC dimension is still .
We are grateful to Dmitry Vaintrob for an earlier version of these results.
The remainder of this section has been omitted from this version due to the page limit. For the full version, see: https://arxiv.org/abs/2409.03077
Appendix B Proof of Theorem 4.1
See 4.1
Proof.
Write . We will show that, more specifically, for :
-
(a)
For any detection strategy there is an adversary limiting their win probability to at most .
-
(b)
There is a detection strategy that wins against any adversary with probability at least .
For part a, let be a set of size shattered by , and let be any point. The adversary proceeds by taking to assign probability to each point of and the remaining probability to the point , where . The adversary then takes to be one of the witnesses for the shattering that assigns to the point . We may treat as if it were chosen uniformly at random, since the detection strategy’s worst-case performance can be no better than its average-case performance. Finally, given , the adversary takes to be the witness that agrees with exactly on the set if , and otherwise takes . It is easy to check that is -valid if and only if . Furthermore, and are identically distributed for . Hence the best a detection strategy can do is to output Acc when given the point , and otherwise the strategy can do no better than chance. So the detection strategy’s overall win probability is at most , as required.
For part b, note first that the detection strategy can achieve a win probability of at least simply by guessing randomly, so we may assume that . To construct the detection strategy, by Haussler–Littlestone–Warmuth, take a prediction strategy for with sample size and error rate at most , for some positive integer to be chosen later. The detection strategy then works as follows:
-
Given as input (which could be either or ), we use the prediction strategy to make a prediction for , which we in turn think of as a prediction for . The idea is that if we can successfully predict , then we can distinguish the two cases by comparing this to , since but if is -valid.
-
We repeat this procedure times with independent samples to obtain predictions for some positive integer to be chosen later.
-
Finally, we take a majority vote: if more than half of are different from , then we output Rej, and otherwise we output Acc.
To lower bound the detection strategy’s win probability, consider first the case in which is passed to the detection strategy. In this case the probability that is at most the error rate . Hence by Markov’s inequality and linearity of expectation, the probability that more than half of are different from is at most , giving the detection strategy a failure probability of at most . If instead is passed to the detection strategy, then there are two possible reasons why we could have : either and disagree on at least one of the samples provided by the example oracle, or the prediction strategy fails to predict even if it is provided with the value of on all of these samples. Write and for the indicator functions of these two events respectively during the th trial for . Then the probability that the detection strategy fails is
The second term on the right-hand side is at most by another application of Markov’s inequality and linearity of expectation. To bound the first term on the right-hand side, we may assume without loss of generality that is -valid, which implies that , by the union bound. Taking to be the unique positive integer with , we have . Since these indicator functions are independent, the first term on the right-hand side is at most by Hoeffding’s inequality, which can be made arbitrarily small by taking to be sufficiently large. Putting everything together, the detection strategy’s best overall win probability is at least
as required.
Appendix C Proof of Theorem 5.3
See 5.3
Proof.
Write for the random oracle. Let be a set of “keys”, and let , where denotes string concatenation. We will show that, over the choice of random oracle, is efficiently defendable with probability as for all , but not efficiently PAC learnable with probability as . The result then follows by the union bound.
For defendability, let and . We will show that, with probability over the choice of random oracle, if then is -defendable with confidence using the trivial detection strategy that always outputs Acc. To see this, let be the distribution over chosen by the adversary, and for write for . In order for the adversary to be able to choose a backdoored function that is -valid with high enough probability over the choice of backdoor trigger, the adversary must choose so that
Note also that
Now for any and any ,
(by Hoeffding’s inequality) | |||
(since ) | |||
(since ). |
Hence by the union bound, it is impossible for the adversary to choose any with with probability at least
When this event happens the detection strategy always wins, and if then this probability is , as required.
For PAC learnability, we will show that, with probability over the choice of random oracle, there is no PAC learning algorithm for , even with access to the random oracle, that runs in time with confidence parameter and error parameter , say. To see this, note that, for all sufficiently large , there are fewer than possible hypotheses that such an algorithm could output, assuming by convention that algorithms can process at most one bit per unit time. Now for each of these hypotheses , if is the uniform distribution over and , then
by Hoeffding’s inequality. Hence by the union bound, the probability that none of these hypotheses has generalization error at most is at least
When this event happens the PAC learning algorithm fails with probability , as required.
Appendix D Proof of Theorem 6.1
See 6.1
Proof.
Given and , write for the number of distinct input variables that appear along the path from root to leaf when is evaluated at .
Let , and take . We claim that, as long as the adversary takes to be uniform distribution, is -defendable with confidence using the detection strategy that, given as input, outputs Rej if and Acc otherwise. This suffices, since evaluating takes the same time as evaluating , up to a constant factor.
To see this, given and , write
and observe that
where we have written as shorthand for . The key claim is that, for any and any , if is -valid, then
The first inequality is a consequence of the fact that the number of distinct variables that appear when either or is evaluated at is at most , and makes essential use of the fact that is uniform (or at the very least, a product distribution). The second inequality follows from the definition of -valid.
Now choose , , and as in the backdoor detection game. Suppose first that is passed to the detection strategy. Then with probability at least , we have , since has at most leaves. Hence the detection strategy outputs Acc with probability at least . Now suppose instead that is passed to the detection strategy. Then with probability at least , we have , by a similar argument. But if is -valid, then this implies that , by the key claim. Hence the detection strategy outputs Rej with probability at least . So the detection strategy’s overall win probability is at least , as required.