Locally Private Histograms in All Privacy Regimes
Abstract
Frequency estimation, a.k.a. histograms, is a workhorse of data analysis, and as such has been thoroughly studied under differentially privacy. In particular, computing histograms in the local model of privacy has been the focus of a fruitful recent line of work, and various algorithms have been proposed, achieving the order-optimal error in the high-privacy (small ) regime while balancing other considerations such as time- and communication-efficiency. However, to the best of our knowledge, the picture is much less clear when it comes to the medium- or low-privacy regime (large ), despite its increased relevance in practice. In this paper, we investigate locally private histograms, and the very related distribution learning task, in this medium-to-low privacy regime, and establish near-tight (and somewhat unexpected) bounds on the error achievable. As a direct corollary of our results, we obtain a protocol for histograms in the shuffle model of differential privacy, with accuracy matching previous algorithms but significantly better message and communication complexity.
Our theoretical findings emerge from a novel analysis, which appears to improve bounds across the board for the locally private histogram problem. We back our theoretical findings by an empirical comparison of existing algorithms in all privacy regimes, to assess their typical performance and behaviour beyond the worst-case setting.
Keywords and phrases:
Differential Privacy, Local Differential Privacy, Histograms, Frequency Estimation, Lower Bounds, Maximum ErrorFunding:
Clément L. Canonne: Supported by an ARC DECRA (DE230101329).Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Security and privacy ; Security and privacy Usability in security and privacy ; Security and privacy Privacy protections ; Theory of computation Theory of database privacy and securityAcknowledgements:
The authors would like to thank Guy Blanc for the proof of Lemma 17, and Albert Cheu for insightful discussions regarding the use of amplification by shuffling (Theorem 25). This work was done in part while the authors were visiting the Simons Institute for the Theory of Computing.Editors:
Raghu MekaSeries and Publisher:

1 Introduction
Frequency estimation is a fundamental problem of statistics: besides its use for basic surveying, it is also used as a building block in distribution learning, identifying heavy hitters in sparse domains, and regression, correlation or covariance analysis. As such, frequency estimation (and the closely related problem of heavy hitters) have been thoroughly studied in a variety of settings, ranging from streaming to privacy-preserving analytics. We here focus on the latter, and specifically on the local model of differential privacy (LDP), where the data is distributed across a large number of users, and each datum is subject to stringent differential privacy requirements. This question has, over the past decade, received a lot of attention, starting with [8, 28]; as we elaborate in Section 1.1, much is known about locally private frequency estimation, to the point that it may seem the question has been, already fully resolved – both in theory and practice. However, most recent large-scale implementations of local privacy (see, e.g., [5, 18]) must balance many efficiency objectives, including the bandwidth and computational requirements (as, among others, a proxy for energy consumption), and, above all, the estimation accuracy. Keeping this accuracy in check has, in turn, led to the common use of relatively large values of the “privacy parameter” , far from the original rule of thumb of .111See, for instance, the list of real-case use of differential privacy maintained here, which includes use cases of LDP along with the correspond stated values of : https://desfontain.es/blog/real-world-differential-privacy.html.
In view of the relevance of frequency estimation to privacy-preserving algorithms and deployments, it is crucial that theory be informed by, and applicable to practice; and that constant factors, bounds on achievability, and the behaviour of estimation error in all parameter regimes are well understood. Yet, to the best of our knowledge, most if not all of the previous theoretical work on locally private frequency estimation focuses on the high-privacy regime, leaving the low-privacy regime by and large uncharted. Addressing this gap in our theoretical understanding is the focus of our work.
The question and setting
We will focus on examining the standard worst-case estimation error, or the -error as it is most relevant to the problem, with special importance for controlling error rates in heavy hitters. Specifically, when estimating the empirical frequencies over a universe of size from users, the (expected) error is given by
(1) |
where the expectation is taken over the (possibly randomised) algorithm given as input the users’ data, and is the vector of estimated frequencies. We are interested in the worst-case error over all possible datasets, that is, the supremum of the above quantity over all inputs (equivalently, all ). We will seek algorithms (“protocols”) to minimize this worst-case error under the constraint of (pure) local privacy (see Section 2), parameterized by the privacy parameter . As in most distributed settings, one can consider several variants depending on whether a common random seed is available to the users (public-coin protocols) or not (private-coin protocols). It is known that for locally private frequency estimation allowing public-coin protocols can provably reduce the communication requirements [3];222Specifically, the cited paper establishes a separation between public- and private- coin protocols, showing that the former can achieve vanishing error even with constant-length communication per user, a setting where the latter must incur error. To complement this, it is known that using public randomness one can always reduce the per-player communication to bits [8]. similarly, one can allow some back and forth between users and the central server (interactive protocols). However, the public-coin or interactive settings come at an increased deployment costs, and are often less easy (or even impossible) to implement as they require either broadcasting from the center or sustained two-way communication between the parties. As our focus is chiefly on analyzing the most versatile setting, we hereafter restrict ourselves for our algorithms to the private-coin (non-interactive) setting; however, we mention that our lower bounds apply to the public-coin setting as well.
A closely related question is that of distribution learning under loss, which differs from frequency estimation in that the dataset is not arbitrary, but instead assumed to be drawn i.i.d. from an unknown probability distribution over : in this sense, distribution learning is an “easier” problem than frequency estimation, and an algorithm for the latter implies one for the former.333There are some subtleties here, but it is worth noting, as a baseline, that absent privacy constraints it is well-known that the error of learning to scales as with no dependence on the alphabet size ; while frequency estimation can be done with zero error, as each user can simply send their data. Formal definitions, as well as the relation between these two questions, can be found in Section 2. In this paper, we therefore focus on frequency estimation, and will point out the implications for distribution learning as corollaries.
Connection to shuffle privacy
To conclude this subsection, we mention that besides the use of large values of in practical deployments of locally private algorithms, which calls for a better understanding of the tasks in this parameter regime, another motivation for studying the “low-privacy regime” comes from the emergence of another model of differential privacy, shuffle privacy [15, 21], and its increasingly widespread adoption. Indeed, it is known that one generic way to obtain shuffle private algorithms is via the so-called “amplification-by-shuffling” technique (see, e.g., [23]), whereby a locally private algorithm with high (low privacy) yields a shuffle private protocol with small (high privacy), and the same number of messages and communication cost. This makes characterizing the low-privacy regime of locally private algorithms a very consequential question, especially for fundamental tasks such as frequency estimation and the related distribution learning.
1.1 Prior work
Many innovative algorithms have been proposed in recent years for locally private histograms and distribution estimation [22, 3, 14, 34, 33, 25, 24], with a subset focusing on -error.444For distribution estimation, a common error measure in the literature is , corresponding to total variation distance. A lower bound of for frequency estimation under local differential privacy was established by [8] for most regimes of , , and small privacy parameter ; and various LDP protocols asymptotically achieving this bound have been proposed (see Table 1).
Most recently, upper bounds on for frequency estimation were derived in [14], and evaluated empirically in [24]. For large , i.e., the “low-privacy regime,” to the best of our knowledge the best current results are (1) a bound of on the error rate established in [14], and, (2) an upper bound of from [29] (Theorem 2), based on CountSketch: note that this error does not vanish as grows.
In [25], it was stated that under most commonly used metrics (), the error is primary driven by the variance of an algorithm. We show that at least for a close look at sub-Gaussian, and sub-gamma behaviour can better capture the error in low-privacy regimes. Additionally, [14] raise the question of whether the upper bound of , is tight in the low privacy regime. We show in Theorem 14 that, quite surprisingly, it is not.
Protocol |
|
Communication | error | error | ||
RAPPOR | ||||||
|
? | |||||
|
? | |||||
|
? | |||||
|
||||||
|
||||||
|
Histograms have also seen detailed analysis in the shuffle model as both a practical question [26, 27, 15, 16, 9] and as a benchmark for reasoning about the power of the shuffle model [6]. As two points of comparison we highlight [26] which introduced a multiple–message protocol for shuffled histograms achieving expected -error with rounds of communication, and [6] which demonstrated expected maximum error of with rounds. While our results do not achieve the latter bound, we show surprisingly that the former can be achieved in some fairly permissive parameter regimes with only a single message.
1.2 Overview of results
We here summarise our main results, and briefly discuss the underlying techniques. While we focus on two protocols in particular, the techniques themselves are broadly applicable. Our first set of results concerns the error rate achievable by LDP protocols for frequency estimation. We first recall, as a baseline, a general transformation which converts any LDP protocol with optimal error in the high-privacy regime into another LDP protocol with reasonable error in the low-privacy regime, at the cost of a blowup in communication:
Proposition 1 (Informal; see Proposition 9).
Let be any locally private protocol for frequency estimation with expected error for , using bits of communication per user. Then there is a locally private protocol achieving error
for all , using bits of of communication per user.
We emphasise that this transformation is not new, and mimics an argument found in, e.g., [14]. This general transformation is quite appealing, as it provides (in theory) good performance in the low-privacy regime “for free” given any good enough LDP protocol for the high-privacy one. However, this comes at a price: first, the communication blowup, which could be impractical; second, a loss in constant factors, which while relatively small might still be prohibitive; and, perhaps more importantly, this requires changing the existing algorithm (and as a result the data analysis pipeline), which is often a significant hurdle. Our second result, focusing on one of the earliest, versatile, and (at least in its “vanilla” version) conceptually simple LDP protocols for frequency estimation, RAPPOR [22], shows that this transformation is not actually necessary, and that RAPPOR actually achieves this improved bound without modification:
Theorem 2 (Informal; see Theorem 13).
The (simple version of) RAPPOR achieves expected error
for all , using bits of of communication per user.
In comparison, using the generic transformation above on RAPPOR would require -bit messages per user, and, in terms of worst-case theoretical bounds, an expected error worse by a factor . As such, our first result can be summarized as saying that analyzing (again) an existing algorithm can be better than modifying it – and, quite importantly, that it may not be necessary to change an existing algorithmic pipeline to inherit better guarantees.
The proof of the above result relies on a careful analysis of the expected maximum of sums of Bernoulli random variables, and specifically on a fine-grained analysis of their subgaussian behaviour in the “highly biased” regime. While RAPPOR is particularly amenable to this analysis, we believe that this technique is highly general and applicable to a broad range of LDP protocols, for example those following the general “scheme template” of [4].
Yet, trying to establish optimality of this scaling turned out to be very challenging. And indeed, this is for a good reason: as we show, it is actually possible to achieve significantly better error rate in the low-privacy regime – and, surprisingly, this much better error is again attained by RAPPOR, out-of-the-box:
Theorem 3 (Informal; see Theorem 14).
The (simple version of) RAPPOR achieves expected error
for all , using bits of of communication per user.
The notation here only hides a single factor in the second term. As the error is always at most , up to this factor this is always better than Theorem 2.
The proof of this result requires going beyond the sub-gaussian concentration behaviour alluded to before, and instead analyse the maximum of these sums of Bernoulli random variables as sub-gamma random variables. More precisely, we draw upon very recent work by [17] (see also [10]) on “local Glivenko–Cantelli” results, which provide refined concentration bounds for mean estimation of high-dimensional product distributions – in our case, the distributions over arising from the use of RAPPOR.
The above result is appealing, in that it not only yields better error rate than previously known (or, in many cases, believed to be possible) in the regime; but also in that it is achieved “for free” by an existing and widely used algorithm. However, it does have one negative aspect: namely, that RAPPOR is, in its standard version, very inefficient from a communication point of view, as it require one -bit message from each user – far from the ideal bits one could hope for. Luckily, as mentioned above, the analysis underlying Theorem 3 is quite general, and applies to a broad range of locally private estimation algorithms. While it does not lead to the improved bound for all such protocols, we show that it applies, for instance, to the recently proposed Projective Geometry Response protocol of Feldman, Nelson, Nguyen, and Talwar [24], whose performance was originally only analyzed for error:
Theorem 4 (Informal; see Theorem 18).
Projective Geometry Response (PGR) achieves expected error
for all , using bits of of communication per user.
Note that again this is achieved “out-of-the-box” by an existing algorithm, without any modification! As a result, this inherits all of the features of PGR its authors originally established: crucially, its computational efficiency. We also point out that the stated guarantee is even slightly better than that of RAPPOR, as the first term now features an dependence (instead of the of Theorem 3).
We then turn to prove optimally of this error rate, and show that this is, up to a logarithmic factor, optimal for any LDP protocol:
Theorem 5 (Informal; see Theorem 24).
Any LDP protocol for frequency estimation must have, in the worst case, expected error
for , and
for .
We remark that this lower bound also applies to the “easier” question of distribution learning, as we will state momentarily. The first part of this lower bound, as mentioned before, was already known; the second part is new, and, while not necessarily difficult to show in hindsight, does require significant care in combining inequalities between various information-theoretic quantities to avoid ending up with a vacuous bound. In this sense, our main contribution for the lower bound is to establish it in a self-contained, streamlined fashion, by drawing on the “chi-square contraction” framework of [2]; and – importantly – that it matches the upper bound obtained earlier in both the high- and low-privacy regimes. Another interesting aspect of this lower bound is that all three terms are derived separately, but from the same family of hard instances: a dataset where almost all users hold the same element.
Implications for shuffle privacy
By combining our new results on LDP protocols for histograms in the low-privacy regime with known “amplification-by-shuffling” results, we are able to obtain a simple, robust shuffle (approximate) DP protocol using only one message per user of logarithmic length, while achieving state-of-the-art:
Theorem 6 (Informal; see Theorem 27).
Shuffling the Projective Geometry Response protocol achieves expected error
for all , with bits of communication and a single message per user.
This matches that of the best known shuffle DP protocols, but with a much lower message and communication complexity. It was, to the best of our knowledge, still open whether achieving these “best of three worlds” guarantees was possible.
We conclude by providing the corollary of our results for distribution learning, using the known connection to frequency estimation (along with the standard lower bound of on the error rate for the question, absent privacy constraints):
Corollary 7.
For distribution learning under local privacy constraints, the minmax error achievable is at most
for all , and
for ; and this is tight, up to a logarithmic factor (in ). Moreover, the upper bound is attained by PGR, using bits of of communication per user.
Organisation
After providing some background and setting notation, we establish our theoretical results (upper bounds on the error rate) in Section 3, followed by information-theoretic lower bounds in Section 4 and applications to shuffle privacy in Section 5. Finally, we discuss our results and potential future work in Section 6. Several proofs and a discussion of empirical analyses – omitted for clarity of exposition – can be found in the full version.
2 Preliminaries and notation
(Local) differential privacy
We first recall the definition of local differential privacy (LDP):
Definition 8 (Locally private randomiser).
An algorithm satisfies -differential privacy if for all pairs of inputs , all sets of outputs are -close,
A locally private protocol is typically a pair of algorithms , executed by the user on their data, and , executed by the server to transform the randomised outputs into an unbiased estimator for the true quantity. It is well known (see, e.g., [19]) that all LDP Frequency Estimation algorithms have a stochastic matrix representation which maps every element in the input alphabet to a probability distribution over the output alphabet. Furthermore, [30] demonstrated that all optimal mechanisms obey a “binary” property, in that their stochastic matrix only contains probabilities weighted by either or (appropriately normalised). Recently another “family of LDP protocols” was introduced by [4] and used in [24], where each input is associated with a set of high probability outputs determined by some useful set system.
Frequency Estimation
Under frequency estimation, the aim is to estimate the empirical frequency of the observations. We denote the dataset of observations as and their empirical frequencies as the probability vector , where . That is, is an element of the probability simplex,
A frequency estimator over an alphabet is a (possibly randomized) function , which approximates the true frequencies sufficiently well. The quality of the resulting approximation, denoted , is measured through a suitably chosen loss function. Typical choices of distance are the norms, for ; in this paper, we will be mostly concerned with the distance, where , and consider the expected loss (where the expectation is over the randomness of the frequency estimator).
Distribution learning (estimation)
Given independent and identically distributed (i.i.d.) samples from an unknown probability distribution , the goal of distribution learning (density estimation) is to find a distribution that approximates sufficiently well. As before, the quality of the approximation is typically measured through a suitably chosen loss function, quantifying the distance between the true distribution and its approximation . From the above, one can see that the key difference between frequency and distribution (under the same loss function) is that in the former, observations are arbitrary, while in the latter they are assumed i.i.d. from some common underlying probability distribution. As such, the expected loss also takes into account the randomness of the samples themselves.
Frequency Estimation implies Distribution Estimation
For any given distance measure we denote the expected sampling error (under loss ) between the empirical histogram from i.i.d. samples and the true underlying distribution as :
Further, denote the optimal worst-case (over ) expected error under loss from i.i.d. samples under -LDP constraints as . The “optimal” here refers to quantifying over all locally private distribution estimators. We can analogously express the optimal expected loss when learning the empirical frequencies as .
Fact 1.
Accurate frequency estimation implies accurate distribution estimation: namely, .
3 Better algorithms in the low-privacy regime
Although the order-optimal error rates for LDP frequency estimation and distribution learning are well-understood by now in the high-privacy regime, with many distinct algorithms achieving the tight
expected error bound, much less is understood about the best achievable error for large, or even medium, values of . In this section, we first revisit (and slightly generalize) an idea from [14], which shows how to convert any protocol “optimal in the high-privacy regime” to a related protocol “good enough in the low-privacy regime as well”, at the price of a blowup in communication (Section 3.1). We then focus on the specific example of RAPPOR, showing that – perhaps surprisingly – this simple protocol does already achieve the same bound by without any modification nor communication blowup (Section 3.2). We finally show that for distribution learning, this very same RAPPOR in fact achieves an even better error rate than this, leveraging very recent results on concentration of the empirical mean for high-dimensional distributions, due to [17, 10] (Section 3.3).
3.1 A generic transformation, and a baseline
Here we prove the following generic statement:
Proposition 9.
Suppose there exists a symmetric555i.e., where all users use the same randomiser. LDP protocol for frequency estimation achieving expected error
for , with messages and bits of communication per user. Then, for every integer , there exists an LDP protocol for frequency estimation achieving expected error
for all , with messages and bits of communication per user. Further, if is a public-coin (resp. private-coin) protocol, then so is .
The idea behind this result is not new, and is used (in a slightly less general form) in [14, Appendix E.4]. We restate and prove it in this paper in a self-contained form, as we believe it to be of broader interest.
As a direct corollary (setting ), applying the above to Hadamard Response and RAPPOR, for instance, we obtain the following bounds:
Corollary 10.
For every , there exists a private-coin -LDP protocol (namely, a modification of Hadamard Response) for frequency estimation with expected error
using messages per user, and bits per message.
Corollary 11.
For every , there exists a private-coin -LDP protocol (namely, a modification of RAPPOR) for frequency estimation with expected error
using messages per user, and bits per message.
3.2 Tighter analysis for RAPPOR
To do so, we first recall some facts and notation about RAPPOR, which will help with the analysis of its performance. A common simplification of the RAPPOR protocol [4, 3] is to parameterise it by as follows.
-
1.
Given input , one-hot encode it onto a -bit vector s.t. only the ’th bit is 1.
-
2.
Flip each bit independently with probability and send the resulting noisy bit-vector to the server.666This is just the “Permanent Randomised Response” step of RAPPOR parameterised by .
-
3.
The server then receives noisy bit vectors, computes , and estimates
(2) where is the 1-vector of size .
A standard fact, which we recall here for completeness, is that defined above is an unbiased estimator for the true vector of frequencies :
Lemma 12 (Expectation of ).
We have .
We then have , and so to bound the expected error is suffices to bound that of the expected maximum deviation of the ’s from their mean. Now, each is the (normalised) sum of independent Bernoulli random variables: the standard way to analyse this maximum is to recall that a sum of independent Bernoullis is a sub-gaussian random variable with parameter777Importantly, note that this may not coincide with the variance of , although it does in some important cases (e.g., for a Gaussian r.v.). at most . Standard results (see, for example [32, Chapter 2]) on the maximum of (not necessarily independent) sub-Gaussian random variables then give
which indeed behaves as for small . However, the bound quickly degrades for large , and only yields (no dependence on the privacy parameter at all!) as grows.
To remedy this, we will need a tighter analysis of the subgaussian parameter of Bernoulli random variables in the “very biased” regime (which is the one we have to handle for large , as then ). Specifically, we will rely on the following characterisation of the sub-gaussian norm of a (centered) random variable, known as the Kearns–Saul inequality, and which can be found in, e.g., [12]:
(3) |
Importantly, this expression is symmetric: for all . In our setting, each is the sum of independent Bernoulli random variables with one of two symmetric parameters, or . The sub-Gaussian norm at play is then . Using sub-additivity of the sub-gaussian parameter for independent random variables, we can bound the sub-Gaussian norm of as
(4) |
We can use this to bound the maximum error of the estimator.
Theorem 13 (Expected maximum error of RAPPOR).
The expected maximum of is given by
(5) |
proving the main result of this subsection.
3.3 Optimal analysis for RAPPOR
It is natural to wonder whether this dependence on the privacy parameter is order-optimal; especially as it appears in other locally private estimation tasks, such as mean estimation for high-dimensional Gaussians or product distributions [20, 1]. Quite surprisingly, we will show that for frequency estimation the error rate given in Equation 5 is not optimal for large values of , and that even the simple RAPPOR algorithm can achieve significantly better.
Theorem 14.
For , the expected error of RAPPOR for frequency estimation satisfies
Importantly, this is better than the bound in Equation 5 whenever , which is the regime of interest (small constant, or vanishing, error). To see why this better bound may hold for large , recall that the bound given in Equation 5 relies on analyzing the expected maximum of (centered) Binomials random variables using their sub-gaussian behavior. This is good when the parameters of the Binomials are not too skewed; however, in the low privacy regimes, the parameters of the Bernoulli summands become very close to (or ): in that case, to analyze the expected maximum of the Binomials it is tighter to see them as having a sub-gamma behavior. Details follow.
Proof of Theorem 14.
As alluded to above, we want to analyze the expected behavior of the maximum of Binomial random variables beyond the sub-gaussian regime. Invoking generic bounds for sub-gamma random variables such as [11, Corollary 2.6], unfortunately, does not lend itself to the order-optimal bounds either. Instead, we rely on the “local Glivenko–Cantelli” bounds recently obtained by [17, 10], which provide a more refined upper bound: to introduce the result we will invoke, we first need some notation.
Definition 15.
Given , denote by the sequence defined by for all , and by its non-increasing rearrangement. Finally, let denote the product distribution over with mean vector .
With this in hand, the main result of [17] can be restated as follows:
Theorem 16 ([17, Theorem 3]).
Let , and suppose that is such that . Let denote the empirical estimator for given i.i.d. samples from . Then the following holds:
(Note that [10] recently improved on this results by up to a factor in the second term. For simplicity, we use the slightly weaker result of [17], as it is easier to manipulate.)
We want to apply this result to bounding . However, we face one obstacle in doing so, as we do not have a product distribution over : while are indeed independent, and each of the form
where are independent Bernoulli random variables, these Bernoullis are not identically distributed: exactly of them have parameter , and the remaining have parameter . Of course, we do not know the ’s, as this is what we are trying to estimate; all we have is that
To circumvent this issue, let us write, for each ,
where , are independent. We can then express
(6) |
Now, instead of taking the maximum of sums of Bernoullis with different parameters but same number of summands ( summands), we take the maximum of sums of Bernoullis with the same parameter (i.e., Binomials) but different number of summands (at most ). This does not necessarily seem like an improvement, and still does not let us apply Theorem 16 to either of the two expectations. However, if we could argue that “adding summands to each Binomial” cannot decrease the expected maximum, then we would be in good shape: that is, we want to upper bound
by
where (instead of ). Intuitively, this seems reasonable, as adding independent summands should make the Binomial more likely to deviate from its expectation. The next lemma makes this intuition rigorous:
Lemma 17.
Fix and , with for all . Let and be (not necessarily independent) random variables with and . Then
Proof.
Set and for all . We give a coupling of such that
Such a coupling can be obtained by setting , where is independent of . Then it is easy to check that has the right distribution, since ; and the the conditional expectation is indeed as claimed. Using this coupling, we obtain
(Jensen’s inequality) | ||||
establishing the lemma.888More generally, via the existence of this coupling the argument shows that (domination in the convex order), which in turn is equivalent to having for every convex function . Let (resp. ) be i.i.d. with random variables (resp. ). Invoking Lemma 17 with separately on the two expectations of Equation 6, we get
Both of the terms in the RHS now fit the setting of Theorem 16. Moreover, since their parameters are (for the first expectation) and (for the second), in both case the corresponding is the same, and applying the theorem will give the same upper bound for both expectations. Thus, Theorem 16 yields
Recalling the setting of , along with the fact that , finally gives
(7) |
To conclude, observe that the right hand side is for , and for .
3.4 Upper bound for Projective Geometry Response
Projective geometry response (PGR), introduced by [24] achieves optimal rates for error, communication, and near-optimal processing time. The protocol is based on the general template established in [4]: specifically, PGR relies on a set structure defined by projective planes, as detailed next. For a prime power and , the authors define a -dimensional vector space , where each element is represented by one of the canonical basis vectors. These basis vectors in turn each uniquely determine a projective plane , such that there are “high probability” elements. Every one of these sets in turn has an intersection with every other set of size . By choosing a prime power , optimal error is achieved.
Theorem 18.
Projective geometry response (PGR) [24] achieves the optimal rate for error. More specifically, the expected error of Projective Geometry Response for frequency estimation satisfies
We will make use of the size of each subset and the size of each intersection as described above, as well as the probability of returning any element from the output alphabet, given as an input:
(8) |
The estimate of the frequency vector is then given by
(9) |
where
(10) |
so that is an unbiased estimator of the true frequency vector . Since a prime power can be found within a factor of any number, we can choose such that
(11) |
(While we could instead choose and set the inequality to be an undershoot by a factor two, an investigation in that direction led to worse constants and trickier analysis.)
Recall that, for every integer ,
(12) |
an identity we will rely on extensively in the rest of the section.
Remark 19.
Note that, as introduced above, we must have for a prime power satisfying (11). Other values of must be rounded up, losing up to a factor in the domain size (i.e., working instead with a domain size ). We hereafter ignore this detail, which does not affect the final bound of Theorem 18 unless ; and assume that is of the form stated above. For the same reason, we additionally can assume .
Proof of Theorem 18.
We start with two lemmas which will be useful in proving the optimal error rate.
Lemma 20.
The common size of every subset , , satisfies .
Proof.
We can rewrite as
As , we have, by applying Equations 11 and 12
Now remembering that whatever expected error we compute will be multiplied by the normalising constant , we would like to bound that in terms of .
Lemma 21.
The value defined in (10) satisfies .
Proof.
First note that
and note that by simple application of Equation 12 we get . Applying the identity again we get:
(13) |
where the final inequality comes from the fact that with this is bounded by the geometric series. For we have the same series, with the addition of a single term.
The result is then at most , by applying Equation 11. We are now ready to apply Theorem 16. As we did for RAPPOR, we will break the expected maximum into the sum of error over two vectors of binomials and where, recalling (8),
(14) |
In this case we do not have that so we will need to bound the following
We briefly note that while RAPPOR has clear independence between coordinates, the result of Theorem 16 due to [17] does not require independence; therefore it applies to PGR, and is likely suitable for the analysis of other subset-based protocols.
First we will upper bound :
Lemma 22.
For defined in (14), we have , and so .
Proof.
First, is clearly less than when we remove from the denominator, and so is at most by Lemma 20. For the other term, notice that by Equation 13 we have , from which,
Overall, we get that . The conclusion follows from the AM-GM inequality, as . Next we bound :
Lemma 23.
We have .
Proof.
We will proceed in both cases by lower-bounding the denominators. First,
Next,
(Lemma 20) | ||||
As such adding the reciprocals of both terms give a bound of . The only step left to bound the error of PGR is to take the product of these bounds. Proceeding to do so, we arrive at:
which in the low privacy regime gives:
as claimed.
4 Worst-case, information-theoretic lower bounds
We will follow the “chi-squared lower bound” framework of [2] to obtain our information-theoretic lower bounds against non-interactive locally private protocols:
Theorem 24.
Fix any . Any non-interactive (public- or private-coin) protocol for distribution estimation from users must have minmax expected error
Proof.
Suppose there exists a (non-interactive, public- or private-coin) -LDP protocol for users which learns any to expected error when each user gets an independent sample from as input:
(15) |
where is the output of when run on .
Now, consider the family of probability distributions over , where, for , is defined by
(16) |
(that is, is a mixture of the uniform distribution and a point mass on , with mixture coefficients and ). Consider the following process: first, we select from uniformly at random, then generate i.i.d. samples from and run on these samples, obtaining a hypothesis . We finally set to be the element of the domain with the largest probability under , i.e.,
(breaking ties arbitrarily). The first claim is that doing so allows one to guess the correct value of with high probability: indeed, since the gap between the highest and second highest element of is , we have
(17) |
the second-to-last inequality being Markov’s. However, by Fano’s inequality applied to the Markov chain (where is the tuple of i.i.d. samples, and denotes the tuple of messages, along with the public randomness, resulting from the protocol ), we get, recalling that is chosen uniformly in , that
(18) |
and so, putting Equations 17 and 18 together, we get
(19) |
This gives us the first ingredient: a lower bound on . For the second, we need to obtain an upper bound on this same mutual information as a function of and . To do so, observe first that, by the chain rule
the second equality follows from the independence of the public randomness from . This is convenient, as the messages are independent conditioned on , and so we get
(20) |
Consider any user , using locally private randomizer (for notational simplicity, we drop afterwards the dependence on and ). Let denote the average input distribution (over ), i.e., the uniform mixture of all ’s. Then we can rewrite the mutual information as
(21) |
where, for a given input distribution , denotes the output distribution (over ) induced by the randomizer on input .
First lower bound (good for small ).
We then proceed by upperbounding the KL divergence by the one and unrolling the latter’s definition, getting
Note, observing that, for our choice of , is simply the uniform distribution over , and that for all we then have . Then, letting for , we get
(22) |
Note that, for any fixed , we have
For simplicity, we write for . We will use the fact that, being -LDP, we have
(23) |
for every and . With this in hand, starting from (22), we can write
where the derivation relies on the fact that , (23), and (details can be found in the full version). Using this last bound along with Equations 19 and 20, we get that
(24) |
i.e.,
(25) |
showing the lower bound for small .
The second and third lower bounds can be established in an analogous fashion: the details can be found in the full version.
5 Amplification by shuffling
As mentioned in the Introduction, one of the key motivations for studying locally private histogram estimation in the low privacy (high-) regime is the implication for histogram estimation in the shuffle model of privacy (see, e.g., [27, 26, 7, 16]), in light of the “plug-and-play” amplification-by-shuffling results allowing to “translate” the former into the latter. Specifically, we will use the following result of Feldman, McMillan, and Talwar:
Theorem 25 ([23, Theorem 3.1]).
For any domain , let be an -DP local randomiser; and let be the algorithm that given a tuple of messages , samples a uniform random permutation over and outputs . Then for any such that , is -DP, where
In particular, if then , and if then .
This implies the following:
Lemma 26 (Amplification by shuffling).
Fix any , , and such that . Then, for
shuffling the messages of users using the same -LDP randomizer satisfies (robust) -shuffle differential privacy.
Proof.
Note that for as in the statement and as defined, we have . Applying Theorem 25, we get privacy for
which proves the statement.
Theorem 27.
For and , Shuffled Projective Geometry Response achieves maximum error , with bits of communication (and one single message) per user.
Proof.
For , the restriction on from Lemma 26 holds, and, setting as in the lemma, we also have . We then invoke the bound of Theorem 18, focusing on first part of the bound, which dominates when . This leads to an upper bound on the error of the order
as desired. It only remains to argue that the first term of the bound did, indeed, dominate the error. As mentioned above, this is the case whenever
for which a weaker, sufficient condition is , that is, . But this follows from our setting of , such that . Comparing this result with the summary of local, shuffle, and central histogram error bounds available in [16, Table 1] shows that with shuffled PGR achieves the best error of any protocol which sends a constant number of messages.
More specifically, focusing on 3 representative known protocols: the only known one-message-per-user protocol, due to [16], achieves much worse error,999And has the same restriction on the parameters. and requires bits of communication per user; while the protocol of [26], which achieves the same error as Theorem 27, uses -bit messages per user. Finally, a protocol of [6] does achieve better error, but at the cost of performing rounds, each with communication per user. Thus, our bounds demonstrates that shuffled PGR achieve state-of-the-art –error with only one message of bits.
6 Discussion and future work
The logarithmic factor in the upper bound.
We prove our upper bounds using the result of [17], which is where the term appears. The authors conjectured that this term could be removed in general, but follow–up work [10] demonstrated by counter–example that this is not the case. They do, however, show that there is a distribution–dependent interpolation between the term being and . This logarithmic factor has implications for understanding which error regime is dominating, given the parameters of the algorithm . While it is possible to imagine that a careful application of the tools in [10] could resolve this matter in general or for a specific use–case, in the meantime we emphasise that empirical analysis is still crucial.
Tighter analysis of other protocols.
While we analyse RAPPOR which admits a simple analysis due to the independence of coordinates, and Projective Geometry Response which represents the state of the art in low-communication LDP protocols, we believe the tools introduced in this paper are applicable in a very general way to most LDP protocols. Of particular interest would be Subset Selection, which has optimal mutual information between inputs and outputs of the local randomiser, and the two Hadamard Response protocols that we include in our empirical analysis.
Histograms in the shuffle model.
While our shuffle DP result (Theorem 27) yields better error, communication, and number of messages, it does have a limitation on the parameter range, namely . It would be interesting to weaken this requirement to match the central DP one (where the dependence on is only linear). Moreover, one could hope to further improve the resulting error to the central DP bound (i.e., a dependence instead of ), or, alternatively, prove a matching lower bound separating the two models.
Broader applicability of lower bound techniques.
While we prove lower bounds for the specific case of frequency estimation, the tools used are extremely general and we imagine that their application could provide new lower bounds for a variety of problems in the LDP setting, especially with the low–privacy regime in mind.
References
- [1] Jayadev Acharya, Clément L. Canonne, Ziteng Sun, and Himanshu Tyagi. Unified lower bounds for interactive high-dimensional estimation under information constraints. In NeurIPS, 2023.
- [2] Jayadev Acharya, Clément L. Canonne, and Himanshu Tyagi. Inference under information constraints I: Lower bounds from chi-square contraction. Institute of Electrical and Electronics Engineers, 66(12):7835–7855, 2020. doi:10.1109/TIT.2020.3028440.
- [3] Jayadev Acharya and Ziteng Sun. Communication complexity in locally private distribution estimation and heavy hitters. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 51–60, Long Beach, California, USA, 2019. PMLR. URL: http://proceedings.mlr.press/v97/acharya19c.html.
- [4] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Hadamard response: Estimating distributions privately, efficiently, and with little communication. In Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 1120–1129. PMLR, 2019. URL: http://proceedings.mlr.press/v89/acharya19a.html.
- [5] Apple Privacy Team. Learning with privacy at scale, 2017. URL: https://machinelearning.apple.com/research/learning-with-privacy-at-scale.
- [6] Victor Balcer and Albert Cheu. Separating Local & Shuffled Differential Privacy via Histograms, April 2020. arXiv:1911.06879, doi:10.48550/arXiv.1911.06879.
- [7] Victor Balcer and Albert Cheu. Separating local & shuffled differential privacy via histograms. In ITC, volume 163 of LIPIcs, pages 1:1–1:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPICS.ITC.2020.1.
- [8] Raef Bassily and Adam Smith. Local, Private, Efficient Protocols for Succinct Histograms. In Proceedings of the Forty-Seventh Annual ACM Symposium on Theory of Computing, pages 127–135, June 2015. doi:10.1145/2746539.2746632.
- [9] Andrea Bittau, Úlfar Erlingsson, Petros Maniatis, Ilya Mironov, Ananth Raghunathan, David Lie, Mitch Rudominer, Ushasree Kode, Julien Tinnés, and Bernhard Seefeld. Prochlo: Strong privacy for analytics in the crowd. CoRR, abs/1710.00901, 2017. arXiv:1710.00901.
- [10] Moïse Blanchard and Vaclav Voracek. Tight bounds for local glivenko-cantelli. In Claire Vernade and Daniel Hsu, editors, Proceedings of The 35th International Conference on Algorithmic Learning Theory, volume 237 of Proceedings of Machine Learning Research, pages 179–220. PMLR, February 2024. URL: https://proceedings.mlr.press/v237/blanchard24a.html.
- [11] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. OUP Oxford, 2013.
- [12] V. V. Buldygin and K. K. Moskvichova. The sub-Gaussian norm of a binary random variable. Theory of Probability and Mathematical Statistics, 86:33–49, August 2013. doi:10.1090/S0094-9000-2013-00887-4.
- [13] Clément L. Canonne. A short note on learning discrete distributions, 2020. arXiv:2002.11457.
- [14] Wei-Ning Chen, Peter Kairouz, and Ayfer Özgür. Breaking the communication-privacy-accuracy trilemma. IEEE Trans. Inf. Theory, 69(2):1261–1281, 2023. doi:10.1109/TIT.2022.3218772.
- [15] Albert Cheu, Adam D. Smith, Jonathan R. Ullman, David Zeber, and Maxim Zhilyaev. Distributed differential privacy via shuffling. In EUROCRYPT (1), volume 11476 of Lecture Notes in Computer Science, pages 375–403. Springer, 2019. doi:10.1007/978-3-030-17653-2_13.
- [16] Albert Cheu and Maxim Zhilyaev. Differentially private histograms in the shuffle model from fake users. In SP, pages 440–457. IEEE, 2022. doi:10.1109/SP46214.2022.9833614.
- [17] Doron Cohen and Aryeh Kontorovich. Local glivenko-cantelli. In COLT, volume 195 of Proceedings of Machine Learning Research, page 715. PMLR, 2023. URL: https://proceedings.mlr.press/v195/cohen23a.html.
- [18] Graham Cormode, Somesh Jha, Tejas Kulkarni, Ninghui Li, Divesh Srivastava, and Tianhao Wang. Privacy at scale: Local differential privacy in practice. In SIGMOD Conference, pages 1655–1658. ACM, 2018. doi:10.1145/3183713.3197390.
- [19] John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. Local privacy and statistical minimax rates. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, pages 429–438. IEEE Computer Society, 2013. doi:10.1109/FOCS.2013.53.
- [20] John C. Duchi and Ryan Rogers. Lower bounds for locally private estimation via communication complexity. In COLT, volume 99 of Proceedings of Machine Learning Research, pages 1161–1191. PMLR, 2019. URL: http://proceedings.mlr.press/v99/duchi19a.html.
- [21] Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Shuang Song, Kunal Talwar, and Abhradeep Thakurta. Encode, Shuffle, Analyze Privacy Revisited: Formalizations and Empirical Evaluation. arXiv:2001.03618 [cs], January 2020. arXiv:2001.03618.
- [22] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. RAPPOR: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM Conference on Computer and Communications Security, CCS ’14, pages 1054–1067, New York, NY, USA, 2014. ACM. doi:10.1145/2660267.2660348.
- [23] Vitaly Feldman, Audra McMillan, and Kunal Talwar. Hiding among the clones: A simple and nearly optimal analysis of privacy amplification by shuffling. In 62nd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2021, Denver, CO, USA, February 7-10, 2022, pages 954–964. IEEE, 2021. doi:10.1109/FOCS52979.2021.00096.
- [24] Vitaly Feldman, Jelani Nelson, Huy Nguyen, and Kunal Talwar. Private frequency estimation via projective geometry. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 6418–6433. PMLR, July 2022. URL: https://proceedings.mlr.press/v162/feldman22a.html.
- [25] Vitaly Feldman and Kunal Talwar. Lossless compression of efficient private local randomizers. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 3208–3219. PMLR, July 2021. URL: https://proceedings.mlr.press/v139/feldman21a.html.
- [26] Badih Ghazi, Noah Golowich, Ravi Kumar, Rasmus Pagh, and Ameya Velingker. On the power of multiple anonymous messages: Frequency estimation and selection in the shuffle model of differential privacy. In EUROCRYPT (3), volume 12698 of Lecture Notes in Computer Science, pages 463–488. Springer, 2021. doi:10.1007/978-3-030-77883-5_16.
- [27] Badih Ghazi, Ravi Kumar, Pasin Manurangsi, and Rasmus Pagh. Private counting from anonymous messages: Near-optimal accuracy with vanishing communication overhead. In ICML, volume 119 of Proceedings of Machine Learning Research, pages 3505–3514. PMLR, 2020. URL: http://proceedings.mlr.press/v119/ghazi20a.html.
- [28] Justin Hsu, Sanjeev Khanna, and Aaron Roth. Distributed private heavy hitters. In ICALP (1), volume 7391 of Lecture Notes in Computer Science, pages 461–472. Springer, 2012. doi:10.1007/978-3-642-31594-7_39.
- [29] Ziyue Huang, Yuan Qiu, Ke Yi, and Graham Cormode. Frequency estimation under multiparty differential privacy: One-shot and streaming. Proc. VLDB Endow., 15(10):2058–2070, 2022. doi:10.14778/3547305.3547312.
- [30] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. Extremal mechanisms for local differential privacy. Journal of Machine Learning Research, 17(17):1–51, 2016. URL: http://jmlr.org/papers/v17/15-135.html.
- [31] Sudeep Kamath, Alon Orlitsky, Dheeraj Pichapati, and Ananda Theertha Suresh. On Learning Distributions from their Samples. In Proceedings of The 28th Conference on Learning Theory, pages 1066–1100. PMLR, June 2015. URL: http://proceedings.mlr.press/v40/Kamath15.html.
- [32] Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.
- [33] Shaowei Wang, Liusheng Huang, Pengzhan Wang, Yiwen Nie, Hongli Xu, Wei Yang, Xiang-Yang Li, and Chunming Qiao. Mutual information optimally local private discrete distribution estimation. ArXiV, abs/1607.08025, 2016. arXiv:1607.08025.
- [34] Min Ye and Alexander Barg. Optimal schemes for discrete distribution estimation under locally differential privacy. Institute of Electrical and Electronics Engineers, 64(8):5662–5676, 2018. doi:10.1109/TIT.2018.2809790.