Uniformity Testing Under User-Level Local Privacy
Abstract
We initiate the study of distribution testing under user-level local differential privacy, where each of users contributes samples from the unknown underlying distribution. This setting, albeit very natural, is significantly more challenging than the usual locally private setting, as for the same parameter the privacy guarantee must now apply to a full batch of data points. While some recent work considers distribution learning in this user-level setting, nothing was known for even the most fundamental testing task, uniformity testing (and its generalization, identity testing).
We address this gap, by providing (nearly) sample-optimal user-level LDP algorithms for uniformity and identity testing. Motivated by practical considerations, our main focus is on the private-coin, symmetric setting, which does not require users to share a common random seed nor to have been assigned a globally unique identifier.
Keywords and phrases:
Differential Privacy, Local Differential Privacy, Uniformity Testing, Identity Testing, Hypothesis Testing, User-Level Differential Privacy, Person-Level Differential PrivacyFunding:
Clément L. Canonne: Supported by an ARC DECRA (DE230101329).Copyright and License:
2012 ACM Subject Classification:
Security and privacy ; Security and privacy Usability in security and privacy ; Security and privacy Privacy protections ; Theory of computation Theory of database privacy and security ; Mathematics of computing Hypothesis testing and confidence interval computationEditor:
Shubhangi SarafSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
We consider the problem of uniformity testing (equivalently, identity testing [29, 24]) of distributions in the setting where each of distributed users hold independent and identically distributed observations from some unknown common distribution. This naturally captures many real-world statistical scenarios, such as when data is distributed among many users’ personal devices.
Consider the testing equivalent of the problem specified in [16], where the goal is to learn users’ most-used emoji. In this practical deployment, users are queried once per day with a prespecified privacy budget. Of course, people typically use a lot more than a single emoji during that time period, and so one would hope to obtain information about many emojis at once, from each user. Yet, despite users “sampling” from the “distribution” of emojis multiple times per day, the setting explored in earlier literature can only make use of one sample.
Can we leverage the fact that each user holds many samples to test the underlying distribution more efficiently, while still preserving each individual’s privacy as a whole?
To formalize this question, we work in the framework of Differential Privacy (DP) [26], specifically Local Differential Privacy (LDP) [33, 38], where data is made private before it leaves the device. This setting is of practical interest as data collection has grown massively, and many users are hesitant to trust a central curator with collecting and storing their data non-privately. Furthermore it has received theoretical attention as a well-parameterized model of learning under constrained information per-sample [2]. However, the usual setting of “item-level” (i.e., single-sample) LDP is ill-suited to our goal, which is to capture the fact that each user can contribute many samples: naïvely, this would correspond to viewing the data of each user as an -tuple, blowing up the domain size from to and leading to severely suboptimal algorithms. Instead, we will work in the more stringent setting of user-level LDP, whereby the privacy guarantee applies to the whole data held by any single user (see Figure 1 and Section 2 for an illustration and definition).
As we elaborate in Section 1.1, user-level LDP is a much more stringent and challenging setting, as (roughly speaking) the algorithm’s output must be remain similar even when the data of an entire user – a full batch of samples! – is changed arbitrarily.
As our focus is on capturing practically motivated settings, we also pay special attention to the type of algorithm we design, and how realistic their deployment would be. For instance, adaptive protocols, in which users interact sequentially and adapt their output to the previous message observed, are typically undesirable, as they introduce latency and a host of technical implementation challenges.
For this reason, we consider non-adaptive protocols for this problem; which themselves come in several varieties. In particular a protocol can be public-coin or private-coin, and symmetric or asymmetric. A public-coin protocol assumes the existence of a shared random seed between all participants, while a private-coin protocol does not. Symmetric protocols are such that each user runs the same algorithm locally with the same parameters, while an asymmetric protocol allows some variation between users, as decided by the central data collector or curator. These models capture the level of coordination between the curator and the distributed users such that a public-coin asymmetric protocol is the most coordinated, and a private-coin symmetric protocol is the least. For this reason we will progress through the models from most-coordination to least, as the latter model is more practical and thus preferable. Of course, if the case is any indication, this practicality may come at a cost, in terms of per-user communication [9] required or overall sample complexity [2].
Can we obtain symmetric, private-coin testing algorithms for user-level LDP uniformity testing? What is the cost of achieving these two desirable features, in terms of sample and communication complexity?
To ground these distinctions and motivate the question further, one can consider the following tasks that are easy in one model, but hard in its complement. A public-coin protocol may assume that all users can apply the same randomly sampled hash function to their data, while a private-coin protocol requires that the hash is either determined in advance, or sampled independently and sent by each user to the server. An asymmetric protocol may easily partition the users into groups of equal size, while to achieve the same effect a symmetric protocol must have users randomly partition themselves and send which group they landed in. This runs into the canonical coupon collector problem, where some groups will be under-sampled, requiring a more thorough analysis, or more samples.
1.1 Overview of our Results and Techniques
In what follows, we focus on uniformity testing, that is, the task of testing whether the unknown distribution is uniform on the domain, or at distance at least (in total variation distance) from uniform. As discussed in Section 1.2, by a now-standard reduction this directly implies the analogous statements for identity testing, where the reference distribution is a fixed, known reference instead of the uniform distribution .
Our setting (formally described in Section 2) involves users, each holding i.i.d. samples from an unknown distribution over a known domain of size . They engage in a distributed protocol, either public- or private-coin, to enable a central server to perform uniformity testing on this underlying distribution in a locally private manner, with privacy parameter . Our specific focus will be on the regime , arguably the most natural and relevant, where each user has a possibly large number of observations, but not enough that they could run a uniformity testing algorithm by themselves:111Indeed, if , then each user has enough samples to perform uniformity testing by themselves, and so the question boils down to communicating the answer to the server in a locally private manner, for which users is sufficient. When is even larger (specifically, at least the sample complexity of uniformity testing in the central DP model, then is enough: a single user can run the DP algorithm and send the outcome. we implicitly place ourselves in this regime throughout.
Our first “warm-up” result is a public-coin algorithm222Throughout, and in line with the local privacy literature, we will use “algorithm” and “protocol” interchangeably. which shows that, for uniformity testing, user-level LDP with samples for each of user behaves similarly to LDP with one sample for each of users:
Theorem 1 (See Theorem 8).
There exists an asymmetric, public-coin, user-level locally differentially private algorithm for uniformity testing (over domain of size ) which on privacy and distance parameters and takes
users, each holding i.i.d. samples from the unknown distribution and sending one bit of communication.
Whether this result seems unexpected or not to the reader, one interesting aspect is that this result essentially follows from combining, in a very simple way, two known algorithmic building blocks: the first is domain compression [6, 7], a type of hashing which, using shared randomness, allows us to to reduce the domain size without increasing the total variation distance “too much”. The second is the user-level LDP coin learning (asymmetric) protocol of [8], which lets us learn the bias of a single Bernoulli (and so, a fortiori, test it). Combining the two is then rather straightforward, modulo some bookkeeping details: (1) use domain compression to reduce the uniformity testing instance from to ; then (2) (privately, with samples) learn the bias of the resulting “coin” to accuracy .
In view of the simplicity of the first algorithm, it is natural to expect the private-coin algorithm to be similarly straightforward. As we discuss shortly, this expectation turns out to be significantly optimistic: nonetheless, we are able to establish the following, providing an analogous result in the private-coin setting. (To interpret it, it is useful to remember that, for , the sample complexity of private-coin locally private uniformity testing is known to be .)
Theorem 2 (1-bit user-level LDP uniformity testing).
There exists a private-coin, user-level locally differentially private algorithm for uniformity testing (over domain size ) which for small privacy parameter , and distance parameter takes
users, each holding i.i.d. samples from the unknown distribution. If the protocol is asymmetric, and each user sends 1 bit of communication. If the protocol is symmetric, then and each user sends bits of communication.
One may wonder whether this trade-off between symmetry and communication complexity is inherent: by a simple modification of an argument of Acharya and Sun [9] (originally in the context of (item-level) locally private distribution learning), we provide strong evidence that this trade-off is necessary for at least some parameter regime,333We cannot hope to show this trade-off for all values of , since for , as discussed before, there is a trivial, private-coin, symmetric protocol with a single bit per user. showing that this is the case for :
Proposition 3.
Any symmetric, private-coin algorithm for uniformity testing (over domain of size ) with distance parameter and sample per user requires at least bits of communication per user. (This holds regardless of whether the algorithm is locally private or not.)
For completeness, we give a proof of this result in the full version. While only stated for very small , this result shows that one cannot achieve constant communication complexity per user with symmetric private-coin protocols.
Underlying our results is a technical lemma, likely of independent interest, which gives a way to compress many samples into one bit. This compression, which is likely to also find applications in the (non-private) bandwidth-constrained model, enables us to reduce the task to the much simpler task of privatizing a single bit, rather than a higher-dimensional message. Phrased in terms of differential privacy, this considerably reduces the sensitivity of our randomizers.
Lemma 4 (Informal statement of Lemma 12).
Let with . Then, the indicator variable
is distributed as where
We briefly describe how the above lemma may lead to Theorem 2. The first (by now somewhat standard) idea is to use a good error-correcting code such as the Hadamard code to define a family of sets444For simplicity of presentation, we implicitly assume in this overview that is a power of , that is odd, and that everything is a multiple of what it needs to be. , each of size , such that
| (1) |
(note that since each set has size , if then the sum is ). Then we can partition the users into groups, where the users of group “monitor” : by computing how many of their samples fall into their assigned set . That is, each user of a given group now observes a random variable distributed as
where we can rewrite . By (1), we then have that is either (if is uniform) or at least ( is far from it). Now, our user has a in hand, and we would like them to send a single bit: this is where Theorem 2 comes in, letting the user send the bit obtained by thresholding their observation at . This yields a bit which has either bias (if is uniform, since is odd) or some bias such that .
By piecing together (and centreing) the bits from one user of each of the groups, we can view it as one -dimensional random bit vector in , with mean
This enables us to reduce our multi-sample-per-user uniformity testing problem to (privately) testing whether a product distribution over is uniform or has mean vector with norm at least
| (2) |
The good news is that we now longer have to worry about user-level privacy: each user only needs to make their one output bit -LDP, which is easy to achieve via Randomized Response: by standard arguments, this only changes the mean testing problem by replacing the parameter by . All that remains after that is to use an out-of-the-box (non-private) algorithm for mean testing of Rademacher-product distributions such as the ones in [20, 21], (not quite) establishing Theorem 2.
Are we there yet?
There are still, unfortunately, a few annoying issues with the above outline. The first is the in (2): we would like to lower bound by , but this only allows us to get , which is (much) weaker for large . Moreover, this is actually unavoidable in our reduction to mean testing of Rademacher-products!
The key insight is that this is only an issue when some of the sets have very large bias (high or low probability) under , in which case . We do not have control over this – it can happen! – but this should be an easy case to detect. And indeed, we can run, on a small number of users, an alternative protocol to detect whether there exists a with . By standard concentration arguments for maximum of Binomials, losing only a factor, we can argue that, with high probability, this will not lead to erroneously rejecting the uniform distribution ( for all ), but will detect if any of these is overly biased. If this test passes, then whether is uniform or not, we know that all are small enough, and so .
This leads us to the second annoying issue: namely, that the above outline leads to a good sample complexity, but runs two distinct sub-protocols, one of them partitioning our users in distinct groups: this is very much an asymmetric protocol, with users running distinct randomizers depending on their identity. To handle this, there is an “obvious”, general-purpose solution: let each user pick uniformly at random which of these randomizers to run, and send both the (private) output and the (non-private) index of the randomizer they picked. By a coupon collector argument, this works with high probability, at the cost of a logarithmic factor in the number of users (and an additional bits of communication per user).
As Proposition 3 asserts, the latter is necessary; the former, however, is very much wasteful. To avoid paying this logarithmic factor, we provide a generalization of the result for mean testing for Rademacher-product distributions, which allows for a different (and random) number of observations per coordinate and which we believe is of independent interest (Theorem 16). Its analysis, while intuitively simple, is rather technical and relies on a symmetrization argument by negative association: we provide it in the full version.
Finally, we observe that while our focus is privacy, we obtain novel results for the bandwidth-constrained setting [4] as well, where each user has limited communication budget with which to share their data.
Naïve approaches: “why did you not simply do this?”
When considering this model, the first and most natural approach is to consider the composition theorems of differential privacy. Simple composition of differential states that sending messages, each with -differential privacy, results in a final privacy loss of . We use these theorems by pretending that there are total users, each with , and each guaranteed privacy . Even in the most generous case when our sample complexity is some , this results in no apparent gains from the additional samples over simply picking one of the uniformly at random. In reality the situation is worse, as we frequently have , meaning that sending messages, each with privacy parameter will result in an even worse final sample complexity.
There also exists advanced composition [28, 27] that say that the same composition satisfies approximate555We do not further consider approximate-DP in this work, however one can informally think of as the “probability of not being -DP”. -differential privacy with final . Applying this once again to some arbitrary problem with , we see that our final sample complexity is again , yielding no gains.
As mentioned earlier, another simple approach exists when is large enough for each user to (non-privately) test on their own. In this regime we have each user test locally, reporting a bit that indicates accept or reject. We then simply learn the average bit using -LDP, for which is sufficient. Therefore, for the very specific regime when and , we have a final sample complexity . The existence of this approach allows us to assume for the remainder of this work.
A slightly less naïve approach (and why it is not enough).
Another approach one could consider is one based on testing via repetition, used in [23, Section 2] for uniformity testing in the streaming setting. Assuming that , we can run the following “privatized” version of this protocol. For each user, , and each domain element , let the number of samples of user that are . Then each user computes the statistic,
and applies an additive -LDP noise mechanism. Then, the final statistic computed by the server would be the average of the private versions of all the ’s, i.e., the private version of To make each private, each user would need to add noise to that is calibrated to the sensitivity, , of (i.e., the effect of changing all the samples of that user on in the worst case – if is the new statistic after changing all of user ’s samples, then the sensitivity is the maximum possible value of ). In this case, the sensitivity is . The private version of is then simply where , and correspondingly the private version of is where .
Now, by linearity of expectations, . Then for -far from , following the proof of [23, Section 2] we get
and, since and are independent, By Chebyshev’s inequality, we then get that a protocol which computes this at the server and thresholds its value at gives a valid (private, private-coin) uniformity testing algorithm as long as
Even though this is an improvement over the previously outlined methods, the sample complexity still falls short of what we are hoping for. One may hope to improve it by using techniques like the sensitivity-reducing mapping from [13, 12] that could potentially reduce the sensitivity from to . However, even in the best possible case, that would only improve the sample complexity to
As we see, even this best-case scenario would not improve the dependence on , and be very costly for small .
1.2 Prior work
Uniformity testing was first considered in the context of theoretical computer science by [30], and the optimal sample complexity was obtained in [35]. These results have since been generalized to the identity testing problem, where the reference distribution is not uniform, a problem later proven to be formally equivalent to uniformity testing [25, 29, 24]. The task has since also been considered under (differential) privacy, first in the central model of differential privacy [18], where the tight sample complexity was later shown to be [10, 14].
Under the more restrictive model of local differential privacy, the question of testing was raised in [36] and has since been wholly resolved. It is known now that the tight sample complexity for non-interactive private-coin protocols is , while non-interactive public-coin protocols achieve [2, 5, 1]. Furthermore, allowing interactivity cannot improve the sample complexity beyond the bound given in the public-coin setting [3, 15, 37].
In the shuffle model of differential privacy, where users’ responses are permuted before being received by the central curator, an upper bound of was shown in [22]. The best known lower bounds were derived from a connection to pan-privacy [17].
The user-level setting is a natural generalization of any distributed statistical problem. In practice, it is restrictive to consider users holding only one sample from the distribution of interest whether we are testing, learning, or performing any other distributed task. For this reason, distribution learning with multiple samples was studied under bandwidth constraints, where an intricate interplay between the number of samples per-user and the number of bits available to communicate the samples was shown [4]. Under user-level differential privacy, these results were extended to show that (among other things) for small , one achieves a sample complexity for learning of . Comparing this to the case when (For example [11]) we can see that the risk decreases linearly with . We could say that for the task of learning, having samples per-user is as beneficial as having times more users to query.
Even though there is no direct reduction between the two settings, and (as inferred from the previous discussion) the situation, as grows, becomes quite subtle, this gives strong evidence that our results which show a sample complexity improvement by a factor are tight, at least in the most relevant parameter regime for .
Organization
The remainder of this paper is organized as follows. Section 2 establishes necessary facts and definitions we use throughout the paper, including the necessary background on differential privacy, distribution testing, and some concentration inequalities important to this work. In Section 3, we present a result for the public-coin setting, combining known theorems from distribution testing and user-level private learning. Section 4 introduces our main technical results in the form of non-private protocols that our private protocols are built upon. In particular, we show new algorithms for various testing problems, as well as several technical lemmata we believe are of independent interest. Due to the length and technical depth of this section, we defer the question of establishing privacy to Section 5, where we show how to adapt the results of the previous section to the user-level local differential privacy setting. Finally, we conclude with some open questions surrounding user-level local differential privacy. Wherever a proof, lemma, or algorithm is deferred from the main body, we include an appropriate reference to the full version.
2 Preliminaries
We use to refer to the number of users participating in the protocol (or in the dataset) and to refer to the number of samples (or data points) each user has. Additionally, we use to denote the domain size, to denote the privacy parameter, to denote the accuracy parameter, and to denote the probability of failure (in terms of accuracy) of our protocol. We also write .
Next, for a distribution , refers to the -dimensional product distribution with each marginal . For two distributions and over the same (countably infinite) domain , we denote by the total variation distance between them, defined as
In the context of Bernoulli and Binomial distributions with parameter , we call their deviation from (i.e., ) their bias.
Finally, we use the notation , , and to denote the (sometimes slightly more convenient) analogues of the , , and notation: specifically, for two sequences , indexed by some parameter , we write if there exists such that for every , with the inequality reversed for . then denotes that both and hold. Throughout, and denote minimum and maximum: , and .
2.1 Differential Privacy
We first provide the necessary notions and results from the differential privacy (DP), starting with the definition of local differential privacy (LDP).
Definition 5 (Local Differential Privacy [32, 38]).
For , a randomized algorithm provides -Local Differential Privacy if for all ,
While Local Differential Privacy is somewhat involved to define for interactive protocols, where each user can send (in an adaptive manner) several messages, it is simpler in our setting. We consider non-interactive protocols, where each user only sends one message to the server. When each user holds several datapoints (that is, ), the above definition then directly corresponds to the user-level LDP guarantee considered in this paper.
We will rely on the (binary) Randomized Response mechanism, which is an optimal -local differential privacy protocol for 1-bit inputs [33]. Formally, let be the local randomizer for this protocol where is a random variable, and is the probability of seeing output on input .
2.2 Distribution and Uniformity Testing
Here, we formally state the problem and setting. We are interested in the problem of uniformity testing with multiple samples per user. Specifically, given a discrete distribution over elements, which we assume without loss of generality is over the domain , each user holds a multi-sample which is an -dimensional vector of i.i.d. samples from . We are interested in algorithms that can distinguish between the cases where and via the fewest possible number of users . As mentioned in the introduction, barring some kind of information constraint per-user, this problem reduces to the well-studied case where by having each user send all of their samples and treating each as its own message. This problem becomes non-trivial when we are either bandwidth-constrained or privacy-constrained. In the former, users have up to bits with which to communicate their samples. In the latter, the user is constrained by differential privacy.
In the user-level LDP setting, upon receiving private responses from the users, the server must distinguish between the two cases:
-
, and
-
,
with probability at least , while satisfying -user-level LDP. (The threshold is somewhat arbitrary, and can be amplified to any by standard arguments at a sample complexity cost of .)
2.3 Useful Probability Tools
We first recall Cantelli’s inequality, a one-sided version of Chebyshev’s inequality:
Lemma 6 (Cantelli’s inequality).
Let be a real-valued random variable with finite variance. Then, for every ,
We will also require the following standard tail bound for subgaussian random variables (see e.g., [31]):
Lemma 7.
Let be (not necessarily independent) -subgaussian random variables with mean zero. Then
and, for every ,
In particular, this implies that, for every ,
3 Public-Coin via Domain Compression
In this section, as a warmup, we show how to establish our public-coin result, Theorem 1, leveraging existing algorithm building blocks, domain compression and user-level coin estimation. We restate it below:
Theorem 8 (LDP 1-bit asymmetric public coin uniformity testing).
There exists an asymmetric, public-coin, user-level locally differentially private algorithm for uniformity testing (over domain of size ) which on privacy and distance parameters and takes
users, each holding i.i.d. samples from the unknown distribution and sending one bit of communication.
Proof (Detailed sketch).
The protocol works as follows: using public randomness, the users jointly perform domain compression, i.e., a type of hashing of the domain, reducing the domain size from to . By the following lemma, with constant probability, this preserves the distance between distributions up to a factor:
Lemma 9 (Domain Compression [24]).
There exists absolute constants such that the following holds. For any and any ,
where is a uniformly random partition of into subsets, and denotes the probability distribution on induced by and .
That is, if , then this (always) results in a distribution uniform over a domain size (i.e., a fair coin), while if was -far from uniform this results (with probability ) in a coin with bias .
All users now having samples from the same induced “coin”, all that remains is to learn the bias of this coin to accuracy in order to distinguish between the two cases. This can be done with the following existing protocol:
Theorem 10 ([4, Theorem 3.2]).
For , there exists a private-coin algorithm for person-level coin bias estimation with
assuming , where is an absolute constant.
Phrased differently, with the above one can privately learn the bias of a coin to an additive , with arbitrary (constant) probability, in the user-level LDP setting, using users.
Putting things together and recalling our setting of , the above protocol then succeeds with (small) constant probability, as long as
“as claimed.” This is so far a symmetric protocol: but the probability of success, for distributions far from uniform, is quite small, as the domain compression only preserves the distances with small probability. To amplify the probability of success to by standard techniques, we repeat the protocol on disjoint batches of users and combine their answers via a majority vote, leading to an asymmetric protocol. Finally, note that the condition on from Theorem 10 is indeed satisfied, as .
4 Low-Bandwidth Private-Coin via Hadamard Matrices
In order to prove Theorem 2 we must first introduce our novel (non-private) algorithm for uniformity testing. Due to the technical depth of this section, and as the analysis follows straightforwardly in the private case, we defer the privacy analysis to Section 5 where we give a sketch of the proof. The complete proof of the private algorithm can be found in the full version, and largely follows the same path as the proof contained in this section, differing only in the application of binary randomized response at key points.
As discussed in Section 1, our high-level approach will proceed as follows: given users, each holding samples from some unknown discrete distribution over elements, we first assign each user a group , for (intentionally dropping a group so that we have groups). As a first attempt, consider this assignment to be a deterministic function that evenly partitions the users among groups. Later, to remove this coordination step, which would lead to an asymmetric protocol, we will instead have each users select the index of their group uniformly at random. We describe our algorithm in stages: first, we give in Section 4.1 an asymmetric algorithm well-suited for “small” values of , before providing a complementary approach to detect large variations from uniformity (Section 4.2), and explaining how to put them together in Section 4.3.
4.1 An algorithm for small
Using the same indexing as for the groups, let be the set of indices where a 1 appears in the column of the Hadamard matrix , that is,
| (3) |
The properties of Hadamard matrices ensure that each of its columns is a column vector of length , where half of the positions are , and the other half are , and so for all . Hereafter, we will write to denote the probability that a sample under falls in set , i.e., . User then computes
i.e., the number (out of their samples) that lie within (the subset of the domain they are “monitoring”). One can observe the following distinction of cases:
-
If , then .
-
If , then , for some which will be related to by Lemma 13.
We call the bias observed by group . User , instead of directly sending , will then compute the one-bit indicator
and send this to the server. We will show that, given independently drawn samples as above, there exists an algorithm that distinguishes between and -far from (with probability at least ).
Theorem 11 (Asymmetric 1-bit multi-sample uniformity tester).
Given users, each holding samples of some unknown distribution on elements. There exists an algorithm (Algorithm 1) that distinguishes between and using
samples. Moreover, each user only sends one bit.
(Note that this protocol, which follows the above outline, is asymmetric, as users are partitioned in distinct groups of equal size, and users from different groups process their samples differently.) This first algorithm is particularly well suited to the “small regime” where , where it achieves the desired number of users.
The rest of this subsection is dedicated to establishing Theorem 11. In view of this, we will need three ingredients: first, demonstrating that has bias (this factor is where the dependence on in the final bound will come from); second, showing that a distribution that is -far from uniform induces bias in each group, such that ; and third, a known result on uniformity testing for Rademacher-product distributions. The proof proceeds immediately from these results: Each user sends their single bit with bias . We then take one bit from each group and arrange them into a vector. This vector simulates a sample from the product distribution with mean . Applying Lemma 13 we get that the expected norm of this vector is at least . Plugging in an -norm product tester as a black box, we conclude our proof.
As stated above, we begin by capturing the behaviour of as a function of each . Recall that is an indicator variable that declares whether a binomial exceeded the mean it “should” have in the uniform case: in this sense, going from to converts a “many-bit” sample to a “single-bit” one. We show that for with small bias , this indicator behaves as a Bernoulli with mean , and so this conversion from many bits to one still preserves both bias and a dependence on . For ease of exposition, we defer its proof to the end of the section.
Lemma 12 (From Many Bits, One).
Let be an odd integer, and . Define
where . Then , where
(Moreover, if , then .)
While the bias observed by each group is clearly distribution-dependent, we need to relate them explicitly to the distance parameter . To do so, we use the known fact that multiplication by a Hadamard matrix is -norm preserving.
Lemma 13 (Hadamard transform is norm-preserving).
As in the process described above, let be a distribution with . Let be the Hadamard matrix with entries. For each column (excluding the all-ones column), define , and let be the bias of column under distribution . Then
(This follows, for instance, from [2, Lemma 3], combined with Cauchy–Schwarz to relate total variation and distances.) For completeness, we provide a self-contained proof in the full version.
Finally, the third (and last) ingredient missing is an algorithm to test, given i.i.d. observations from a product distribution on bits, whether the mean vector is zero (uniform distribution) or has large norm:
Lemma 14 (Uniformity testing on product distributions).
Fix any . There exists an algorithm that, given a parameter and i.i.d. samples from a product distribution on with , has the following guarantees.
-
If , the algorithm returns accept w.p. ;
-
If , the algorithm returns reject w.p. ;
as long as , for some absolute constant .
For a proof of this in the case , see, e.g., [20, Section 2.1] or [21, Lemma 4.2], which establish this along the way, while focusing on testing in total variation distance, or [19, Theorem 4.1], which provides slightly stronger guarantees. We note that while stated only for , the proofs above actually implicitly show the result for the whole range of . For completeness, we provide a self-contained proof (for the whole range of ) in the full version.
With these three building blocks in hand, we are ready to analyze Algorithm 1:
Proof of Theorem 11.
Recall that user is deterministically assigned group index such that each group is of equal size.666The first column (and row) of the Hadamard matrix are all-ones, and can be ignored, or otherwise simulated if necessary, as any user’s behaviour in this group would be to deterministically send a 1. Where is clear from context or not relevant we suppress this notation, letting . We have each user send their per Algorithm 1 and focus on the server’s view. Clearly is distributed as some (yet unknown) Bernoulli , where depends on . Centering each of these, we get the Rademacher random variables
each with mean . Now, we wish to apply Lemma 14 which takes as input samples from some product distribution. To facilitate this we construct our own vector samples by concatenating one from each group. Each group contains users. So, as described in Algorithm 1, we create the vectors so that holds the bit sent by the ’th user of group .
Each therefore has mean vector . Applying Lemma 12, we get that, for all , (1) if , then ; and (2) otherwise, we have . Combining this with Lemma 13, this vector satisfies:
-
if , then ;
-
if , then
(4)
and the RHS is at least . This is exactly the setting we need to invoke Lemma 14: setting , and , this yields a final sample complexity of .
Before proceeding to the next component of our algorithm, it remains to establish Lemma 12.
Proof of Lemma 12.
We start with the “Moreover” statement: if , then . By symmetry, , and since ,
from which .
Assume without loss of generality that (as otherwise we can consider instead). To establish the first part of the statement, we will distinguish between three cases, depending on how large .
-
Second case: .777The choice of the constant in (instead of the more natural ) may appear somewhat arbitrary: this specific value for the cut-off will be useful, for technical reasons, in the third and last case. In this case, the mean of only differs by from , the mean of a standard Binomial (and the modes of the two distributions are either the same or very close integers), so the change in probability mass between the two is quite subtle. We can provide a coupling between and as follows:
where the distribution of , conditioned on , is . One can check that this satisfies (i.e., this is a valid coupling) and directly implies that a.s. Now, we have
using in the last step that . This shows that in this case .
-
Third case: . In this regime, we can rely on the Gaussian approximation, as quantified by the Berry–Esseen theorem (see, e.g., [34, Section 11.5], which guarantees that the CDF of the normalized version of ,
is pointwise close to the CDF of a standard Gaussian :
for some absolute constant (one can take ). In particular, this implies, in our case, that
(as ) (Berry–Esseen) (Studying for ) the last step using that and . This shows that, in this regime as well, .
This concludes the distinction of cases, and the proof.
4.2 An algorithm for large
The above approach gives the “right” sample complexity under the restriction that
where . We here provide a different protocol, which works well when at least one of the ’s is large. The main idea is to have each user just check is any of the sets receives a lot more (or less) than half of their samples, namely, for some suitable threshold . If is uniform, then this is highly unlikely, as by Lemma 7 the maximum deviation of subgaussian random variables (here, Binomials) from their mean is less than with overwhelming probability. But if is not uniform and one of the ’s is large, then the number of samples falling in the corresponding set is a very biased Binomial, and for every user this set will receive samples with high constant probability. We formalize this idea in Algorithm 2, starting with the non-private version; and prove the following theorem:
Theorem 15.
There exists an algorithm (Algorithm 2) with the following guarantees:
-
If , then the center outputs accept with probability at least ;
-
If there exists such that , then the center outputs reject with probability at least .
Moreover, each person sends only one bit.
Proof.
In what follows, we set .
-
If is uniform, then for all . By standard tail bounds for subgaussian random variables, this means that, for every
given our setting of (specifically, we apply Lemma 7 with to the mean-zero random variables , which are centered Binomials, and thus -subgaussian). By a union bound, we get , and the server outputs accept with probability at least .
-
Turning to the other case, assume there exists such that . Then, for every person , , and
where the last inequality holds with probability at least by Chebyshev (as is large enough, for large enough ). This implies that an expected of the ’s will be equal to : more precisely, with . The probability that such a Binomial is less that is at most (and actually ), and so the center will reject with probability at least (and actually ).
4.3 Combined Algorithm for all values of
There exists a critical regime of parameters that Algorithm 1 struggles with, leading to suboptimal sample complexity; and these are handled exactly by Algorithm 2. We can handle both regimes as follows: have the server run both protocols, the former with and the latter with users, and return accept if, and only if, both of them return accept. Assume
Then, we have the following distinction of cases:
-
If , then both protocols accept with probability at least each, so overall the centre returns accept with probability at least ;
-
If , then
-
–
If there exists such that
then, by Theorem 15, the second protocol (with one user) outputs reject with probability at least , in which case the center outputs reject.
-
–
Otherwise, then by (4) the mean of the product distribution in the first protocol is at least
(5) since for all , and . Then, concluding as in Section 4 by invoking the uniformity testing algorithm for product distributions of Lemma 14 (which also handles the full range of distance parameter ), the server rejects with probability at least , as
Either way, at least one of the two tests outputs reject with probability , and so does the center.
-
–
4.4 Symmetric Protocols via Generalized Product Testing
The above protocol is asymmetric in that, as stated, it needs users to be divided into groups. Of these, groups will run Algorithm 1, each with a different column. The last group (of only 1 user) runs Algorithm 2.
To resolve the asymmetry in Algorithm 1 we prove the following statement, which covers the setting we require to make our protocol symmetric: each of users independently selects a random coordinate and reports a sample from that coordinate. Formally, observations are obtained by choosing, independently for each , uniformly at random, and . In this case, the numbers of times each coordinate is sampled are (correlated) random variables.
Theorem 16.
There exists an algorithm (available in the full version) which, given parameters , , and sample access to distributions on , chooses at random from a multinomial distribution with parameters , , and ; and then is given i.i.d. samples from each (where the samples are independent from the choice of ’s). Then, letting , it has the following guarantees.
-
If , the algorithm returns accept w.p. ;
-
If , the algorithm returns reject w.p. ;
as long as for some absolute constant .
We defer the proof to the full version.
As Theorem 16 takes exactly the same parameters, and has exactly the same guarantees as Lemma 14, we do not restate the proof of Theorem 11. We need only note that having each user randomly sample their own index and send it to the center is indeed distributed as described above. This of course increases communication to bits.
This alone does not make the entire protocol symmetric. As we said, we still have one user assigned to running Algorithm 2. However, this is easily addressed. At the cost of one extra bit of communication per user, we can simply have all users run this second test, and then let the center select arbitrarily one of the outcomes to use. This finally gives a (non-private) algorithm, and can be summarized as follows:
Theorem 17 (Symmetric private-coin uniformity testing).
There exists a symmetric, private-coin (non-private) algorithm for uniformity testing (over domain size ) which on distance parameter takes
users, each holding i.i.d. samples from the unknown distribution, and sending bits of communication.
Note that the first term dominates when , which is the regime of interest (as otherwise a single user has enough samples to run a uniformity testing algorithm by themselves).
Of course, this result may still be somewhat underwhelming, as (albeit communication-efficient) the algorithm does not provide any privacy guarantee. In the next section, we will see how it can be easily adapted to yield our main result, Theorem 2.
5 Symmetric Locally Private Testing
We here sketch the analysis of the private analogues to the algorithms defined in Section 4, thus completing the proof of Theorem 2. We focus on making each of our two algorithms private before showing how they can be combined. In each case we will apply binary randomized response [33] to the bit returned by each user. As discussed before, the details of the analysis are provided in the full version of the paper.
Combining the two private algorithms gives us an asymmetric and symmetric protocol, each with sample complexity comparable to the case with times as many users, up to logarithmic factors.
Private Algorithm 1 for small .
We first introduce the private version of Algorithm 1, which handles the case when is small. As described above, we make this algorithm private by an application of binary randomized response to the bit returned by each user. Consider the proof of Theorem 17, which itself follows the proof of Theorem 11. By applying randomized response to the bit sent by each user, one can quickly derive that each vector coordinate is now distributed as a Rademacher with mean . Thereafter, applying the same steps as in the proof of Theorem 11 yields an algorithm with sample complexity
Noting that as grows, this rapidly converges to the non-private result.
Private Algorithm 2 for large .
Recall that the second stage of our algorithm, defined in Section 4.2 handles the case when one of the subsets defined by the Hadamard matrix is overrepresented. Non-privately we only require that a single user perform this test for all of the subsets and report their response. Under local differential privacy this would not have any high-probability guarantee. Instead we use a known (and easily derived) bound for learning coins under binary randomized response. Specifically, that one can learn a Bernoulli through randomized response up to additive error with success probability 2/3 using samples. This yields the following lemma:
Lemma 18 (Private version of Theorem 15).
Algorithm 2, when each user applies binary randomized response to their output, has the following guarantees:
-
If , then the server outputs accept with probability at least
-
If there exists such that , then the server outputs reject with probability at least
Analysis of the Combined Algorithm.
As in the non-private case, we have to consider how to combine these two algorithms. First, we consider the “easy” asymmetric case; allocating users to run the private analogue to Algorithm 1, we then only need users to run the second protocol. As the number of users required by the second protocol is clearly dominated by the first, we retain much the same sample complexity up to a logarithmic factor. We defer the proof to the full version, however it suffices to say that we derive as a final sample complexity for the asymmetric algorithm
To make the protocol symmetric we can follow the same procedure described in Section 4.4 and have all users run the second protocol. This incurs two costs, (1) we must divide the privacy budget between these two protocols, and (2) we lose a logarithmic factor in the final sample complexity.
Taking the path of least resistance, we assume that each user runs the private version of Algorithm 1 with privacy parameter , and likewise runs Algorithm 2 with . As such, we have and so gain an term in the log of Lemma 18. Setting to the sample complexity derived for the private analogue of Algorithm 2, one can see that we require
users. Combining both bounds completes the proof of Theorem 2.
6 Conclusion
User-level locally private distribution testing is still far from being understood. We observe many phase transitions as and vary. Consider the algorithm for testing in the central model of differential privacy discussed in Section 1.2, when exceeds the stated sample complexity we see that the required number of “users” goes to 1, and only 1 bit of communication is needed.
How exactly do these algorithms behave as a sliding scale between the local and central models of differential privacy? Characterizing the behaviour in each regime is an ongoing and important field of research.
Future work.
Data generation is not always homogeneous: the distributions that users sample from are not truly identical; rather, it is more likely they are sampling from distributions that could be similar or very far apart. User-level locally private distribution learning under limited heterogeneity is touched upon in [8], but we mirror their remark that this deserves further study.
A further heterogeneity that should be considered is the case when not all users hold the same number of samples . In this case each may hold for each . It is not at all obvious how this could be handled neatly, and general results for this model could greatly help practical implementations.
References
- [1] Jayadev Acharya, Clement Canonne, Cody Freitag, and Himanshu Tyagi. Test without trust: Optimal locally private distribution testing. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 2067–2076. PMLR, 16–18 April 2019. URL: http://proceedings.mlr.press/v89/acharya19b.html.
- [2] Jayadev Acharya, Clément L. Canonne, Cody Freitag, Ziteng Sun, and Himanshu Tyagi. Inference under information constraints III: local privacy constraints. IEEE J. Sel. Areas Inf. Theory, 2(1):253–267, 2021. doi:10.1109/JSAIT.2021.3053569.
- [3] Jayadev Acharya, Clément L. Canonne, Yuhan Liu, Ziteng Sun, and Himanshu Tyagi. Interactive inference under information constraints. CoRR, abs/2007.10976, 2020. arXiv:2007.10976.
- [4] Jayadev Acharya, Clément L. Canonne, Yuhan Liu, Ziteng Sun, and Himanshu Tyagi. Distributed estimation with multiple samples per user: Sharp rates and phase transition. In Advances in Neural Information Processing Systems 34 (NeurIPS), 2021.
- [5] Jayadev Acharya, Clément L. Canonne, and Himanshu Tyagi. Inference under information constraints I: Lower bounds from chi-square contraction. IEEE Trans. Inform. Theory, 66(12):7835–7855, 2020. Preprint available at arXiv:abs/1812.11476. doi:10.1109/TIT.2020.3028440.
- [6] Jayadev Acharya, Clément L. Canonne, and Himanshu Tyagi. Inference under information constraints II: Communication constraints and shared randomness. IEEE Transactions on Information Theory, 2020. In press. Preprint available at arXiv:abs/1804.06952. doi:10.1109/TIT.2020.3028439.
- [7] Jayadev Acharya, Clément L. Canonne, Yanjun Han, Ziteng Sun, and Himanshu Tyagi. Domain compression and its application to randomness-optimal distributed goodness-of-fit. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 3–40. PMLR, 09–12 July 2020. URL: http://proceedings.mlr.press/v125/acharya20a.html.
- [8] Jayadev Acharya, Yuhan Liu, and Ziteng Sun. Discrete distribution estimation under user-level local differential privacy. In AISTATS, volume 206 of Proceedings of Machine Learning Research, pages 8561–8585. PMLR, 2023. URL: https://proceedings.mlr.press/v206/acharya23a.html.
- [9] Jayadev Acharya and Ziteng Sun. Communication complexity in locally private distribution estimation and heavy hitters. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 51–60, Long Beach, California, USA, 09–15 June 2019. PMLR. URL: http://proceedings.mlr.press/v97/acharya19c.html.
- [10] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Differentially private testing of identity and closeness of discrete distributions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 6878–6891. Curran Associates, Inc., 2018. URL: http://papers.nips.cc/paper/7920-differentially-private-testing-of-identity-and-closeness-of-discrete-distributions.pdf.
- [11] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Hadamard response: Estimating distributions privately, efficiently, and with little communication. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 1120–1129. PMLR, 16–18 April 2019. URL: http://proceedings.mlr.press/v89/acharya19a.html.
- [12] Maryam Aliakbarpour, Arnav Burudgunte, Clément L. Canonne, and Ronitt Rubinfeld. Better private distribution testing by leveraging unverified auxiliary data. In COLT, volume 291 of Proceedings of Machine Learning Research, pages 22–63. PMLR, 2025. URL: https://proceedings.mlr.press/v291/aliakbarpour25a.html.
- [13] Maryam Aliakbarpour, Ilias Diakonikolas, Daniel Kane, and Ronitt Rubinfeld. Private testing of distributions via sample permutations. In NeurIPS, pages 10877–10888, 2019. URL: https://proceedings.neurips.cc/paper/2019/hash/8e036cc193d0af59aa9b22821248292b-Abstract.html.
- [14] Maryam Aliakbarpour, Ilias Diakonikolas, and Ronitt Rubinfeld. Differentially private identity and equivalence testing of discrete distributions. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 169–178, Stockholmsmässan, Stockholm Sweden, 10–15 July 2018. PMLR. URL: http://proceedings.mlr.press/v80/aliakbarpour18a.html.
- [15] Kareem Amin, Matthew Joseph, and Jieming Mao. Pan-private uniformity testing. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 183–218. PMLR, 09–12 July 2020. URL: http://proceedings.mlr.press/v125/amin20a.html.
- [16] Apple Differential Privacy Team. Learning with Privacy at Scale, 2017. URL: https://machinelearning.apple.com/research/learning-with-privacy-at-scale.
- [17] Victor Balcer, Albert Cheu, Matthew Joseph, and Jieming Mao. Connecting robust shuffle privacy and pan-privacy. CoRR, abs/2004.09481, 2020. arXiv:2004.09481.
- [18] Bryan Cai, Constantinos Daskalakis, and Gautam Kamath. Priv’it: Private and sample efficient identity testing. In Proceedings of the 34th International Conference on Machine Learning, ICML ’17, pages 635–644. JMLR, Inc., 2017. URL: http://proceedings.mlr.press/v70/cai17a.html.
- [19] Clément L. Canonne, Xi Chen, Gautam Kamath, Amit Levi, and Erik Waingarten. Random restrictions of high dimensional distributions and uniformity testing with subcube conditioning. In SODA, pages 321–336. SIAM, 2021. doi:10.1137/1.9781611976465.21.
- [20] Clément L. Canonne, Ilias Diakonikolas, Daniel M. Kane, and Alistair Stewart. Testing Bayesian Networks. In Satyen Kale and Ohad Shamir, editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 370–448, Amsterdam, Netherlands, 07–10 July 2017. PMLR. URL: http://proceedings.mlr.press/v65/canonne17a.html.
- [21] Clément L. Canonne, Gautam Kamath, Audra McMillan, Jonathan Ullman, and Lydia Zakynthinou. Private identity testing for high-dimensional distributions. In Advances in Neural Information Processing Systems 33, 2020. To appear. Preprint available at arXiv:abs/1905.11947.
- [22] Clément L. Canonne and Hongyi Lyu. Uniformity testing in the shuffle model: Simpler, better, faster. In Karl Bringmann and Timothy M. Chan, editors, 5th Symposium on Simplicity in Algorithms, SOSA@SODA 2022, Virtual Conference, January 10-11, 2022, pages 182–202. SIAM, 2022. doi:10.1137/1.9781611977066.13.
- [23] Clément L. Canonne and Joy Qiping Yang. Simpler distribution testing with little memory. In Merav Parter and Seth Pettie, editors, 2024 Symposium on Simplicity in Algorithms, SOSA 2024, Alexandria, VA, USA, January 8-10, 2024, pages 406–416. SIAM, 2024. doi:10.1137/1.9781611977936.37.
- [24] Clément L. Canonne. Topics and techniques in distribution testing: A biased but representative sample. Foundations and Trends® in Communications and Information Theory, 19(6):1032–1198, 2022. doi:10.1561/0100000114.
- [25] Ilias Diakonikolas and Daniel M. Kane. A new approach for testing properties of discrete distributions. In 57th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2016. IEEE Computer Society, 2016. doi:10.1109/FOCS.2016.78.
- [26] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography, volume 3876 of Lecture Notes in Comput. Sci., pages 265–284. Springer, Berlin, 2006. doi:10.1007/11681878_14.
- [27] Cynthia Dwork and Aaron Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends® in Theoretical Computer Science, 9(3-4):211–407, 2013. doi:10.1561/0400000042.
- [28] Cynthia Dwork, Guy N. Rothblum, and Salil Vadhan. Boosting and Differential Privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 51–60, 2010. ISSN: 0272-5428. doi:10.1109/FOCS.2010.12.
- [29] Oded Goldreich. The uniform distribution is complete with respect to testing identity to a fixed distribution. Electronic Colloquium on Computational Complexity (ECCC), 23:15, 2016. URL: http://eccc.hpi-web.de/report/2016/015.
- [30] Oded Goldreich and Dana Ron. On testing expansion in bounded-degree graphs. Technical Report TR00-020, Electronic Colloquium on Computational Complexity (ECCC), 2000.
- [31] Clement C. (https://math.stackexchange.com/users/75808/clement c). Tail bounds for maximum of sub-gaussian random variables. Mathematics Stack Exchange, 2023. URL:https://math.stackexchange.com/q/4002713 (version: 2023-12-21). arXiv:https://math.stackexchange.com/q/4002713.
- [32] Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. What can we learn privately? SIAM J. Comput., 40(3):793–826, 2011. doi:10.1137/090756090.
- [33] Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam D. Smith. What can we learn privately? SIAM J. Comput., 40(3):793–826, 2011. doi:10.1137/090756090.
- [34] Ryan O’Donnell. Analysis of Boolean functions. Cambridge University Press, New York, 2014. doi:10.1017/CBO9781139814782.
- [35] Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750–4755, 2008. doi:10.1109/TIT.2008.928987.
- [36] Or Sheffet. Locally private hypothesis testing. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4612–4621, Stockholmsmässan, Stockholm Sweden, 10–15 July 2018. PMLR. URL: http://proceedings.mlr.press/v80/sheffet18a.html.
- [37] Berrett Thomas and Butucea Cristina. Locally private non-asymptotic testing of discrete distributions is faster using interactive mechanisms. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, pages 3164–3173, Red Hook, NY, USA, 2020. Curran Associates Inc.
- [38] Stanley L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association, 60(309):63–69, 1965. doi:10.1080/01621459.1965.10480775.
