Abstract 1 Introduction 2 Preliminaries 3 Mechanisms 4 Proving Theorem 1 5 Laplace vs Gaussian noise 6 Discussion References Appendix A Definitions for Differential Privacy Appendix B 𝒌-ary Tree Mechanisms: Proofs Appendix C Leveraging subtraction for even 𝒌 Appendix D Lower bound for continual observation on random inputs

Count on Your Elders: Laplace vs Gaussian Noise

Joel Daniel Andersson ORCID BARC, University of Copenhagen, Copenhagen, Denmark Rasmus Pagh ORCID BARC, University of Copenhagen, Copenhagen, Denmark Teresa Anna Steiner ORCID University of Southern Denmark, Odense, Denmark Sahel Torkamani ORCID University of Edinburgh, Edinburgh, United Kingdom
Abstract

In recent years, Gaussian noise has become a popular tool in differentially private algorithms, often replacing Laplace noise which dominated the early literature on differential privacy. Gaussian noise is the standard approach to approximate differential privacy, often resulting in much higher utility than traditional (pure) differential privacy mechanisms. In this paper we argue that Laplace noise may in fact be preferable to Gaussian noise in many settings, in particular when we seek to achieve (ε,δ)-differential privacy for small values of δ. We consider two scenarios:

First, we consider the problem of counting under continual observation and present a new generalization of the binary tree mechanism that uses a k-ary number system with negative digits to improve the privacy-accuracy trade-off. Our mechanism uses Laplace noise and whenever δ is sufficiently small it improves the mean squared error over the best possible (ε,δ)-differentially private factorization mechanisms based on Gaussian noise. Specifically, using k=19 we get an asymptotic improvement over the bound given in the work by Henzinger, Upadhyay and Upadhyay (SODA 2023) when δ=O(T0.92).

Second, we show that the noise added by the Gaussian mechanism can always be replaced by Laplace noise of comparable variance for the same (ε,δ)-differential privacy guarantee, and in fact for sufficiently small δ the variance of the Laplace noise becomes strictly better. This challenges the conventional wisdom that Gaussian noise should be used for high-dimensional noise.

Finally, we study whether counting under continual observation may be easier in an average-case sense than in a worst-case sense. We show that, under pure differential privacy, the expected worst-case error for a random input must be Ω(log(T)/ε), matching the known lower bound for worst-case inputs.

Keywords and phrases:
differential privacy, continual observation, streaming, prefix sums, trees
Funding:
Joel Daniel Andersson: Supported by a Data Science Distinguished Investigator grant from the Novo Nordisk Fonden and VILLUM FONDEN grant 54451.
Rasmus Pagh: Supported by a Data Science Distinguished Investigator grant from the Novo Nordisk Fonden and VILLUM FONDEN grant 54451.
Teresa Anna Steiner: Supported by a research grant (VIL51463) from VILLUM FONDEN.
Copyright and License:
[Uncaptioned image] © Joel Daniel Andersson, Rasmus Pagh, Teresa Anna Steiner, and Sahel Torkamani; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Security and privacy Privacy-preserving protocols
Related Version:
Full Version: https://arxiv.org/abs/2408.07021
Acknowledgements:
We are grateful to Monika Henzinger and Jalaj Upadhyay for discussions on continual counting that inspired part of this work.
Editors:
Mark Bun

1 Introduction

In private data analysis the goal is to release the result of computations on datasets while safeguarding the privacy of the individuals whose data make up the dataset. A popular framework for achieving this is differential privacy [17], which gives probabilistic guarantees on how much can be learned about an individual’s data from the result of a computation. Algorithms providing differential privacy (DP) guarantees conceptually introduce noise, randomly perturbing the result of a corresponding exact, non-private computation. The trade-off between utility and privacy guarantees depends on the nature of the computation.

Let 𝐱,𝐱{0,1}T be input datasets. We say 𝐱 and 𝐱 are neighboring if they are equal except for a single bit. We want to compute all prefix sums of 𝐱 with differential privacy under this neighboring relation, i.e., we want to privately release A𝐱 where A is the lower-triangular all 1s matrix. If we add the restriction that the bits 𝐱1,𝐱2,,𝐱T are received one by one as a stream, and we are to output (A𝐱)i on receiving 𝐱i, then this problem is referred to as counting under continual observation or continual counting for short [18, 8].

The standard solution to this problem is the factorization mechanism [35], and it involves (1) factorizing A into matrices L,R such that A=LR, (2) privately releasing R𝐱 by adding noise 𝐳 and then (3) multiplying the private release by L from the left: L(R𝐱+𝐳)=A𝐱+L𝐳. Defining the p sensitivity as Δp=maxi[T]R𝐞ip, where 𝐞i is the ith unit basis vector, classical results on DP say that adding Laplace noise scaled to Δ1 yields ε-DP (also called pure DP), and adding Gaussian noise scaled to Δ2 yields (ε,δ)-DP (also called approximate DP). The scheme rests on the observation that while the sensitivity of A may be large, the sensitivity of R can be much smaller, and so adding noise L𝐳 to the final result may result in better accuracy than adding noise directly to A𝐱.

While continual counting is a deceptively simple problem, we note that it still has significant gaps between upper and lower bounds [18, 29, 21, 19, 11]. The problem is used as a subroutine in various applications ranging from private learning [37, 33, 10, 14, 9] to histogram estimation [6, 7, 32, 41], so any improvement in upper bounds affect these applications.

1.1 Our contributions

Our first contribution is an easy-to-implement tree-aggregation-based mechanism for continual counting under ε-DP, and based on Laplace noise, with properties stated in Theorem 1.

Theorem 1.

Given a constant odd integer k3 and integer T2, there exists an ε-DP algorithm for continual counting that for a stream 𝐱1,𝐱2,,𝐱T achieves mean squared error k(11/k2)2ε2log(k)3log(T)3+o(log(T)3), computing all outputs in time O(T) and space O(logT).

We denote the base k logarithm by logk, and define log()=log2(). To the best of our knowledge, this is the best bound on the mean squared error known for any ε-DP mechanism, up to lower-order asymptotically vanishing terms. It improves the error achieved by applying techniques in [31] to the binary tree mechanism (as implemented in [33] and referred to elsewhere as “Honaker Online”) by a factor 4 when k=19. While this is interesting in its own right, we note that this improvement in the leading constant has implications for how the error compares to (ε,δ)-DP mechanisms. Henzinger, Upadhyay and Upadhyay [29] produce a factorization with a bound on the mean squared error of Cε,δ2(1+ln(4T/5)π)2, where Cε,δ=2ε49+ln(1δ2π), and show that no factorization mechanism based on Gaussian noise can asymptotically improve on this error.111As the constant Cε,δ is valid for all ε,δ(0,1), it seems plausible that it can be reduced via a tighter analysis of the Gaussian mechanism, as done in for example [4]. We improve on their error using Laplace noise for small enough δ.

Theorem 2.

Given an integer T2, there exists an ε-DP algorithm for continual counting, based on Laplace noise, that improves on the mean squared error bound given in [29], up to lower-order asymptotically vanishing terms, whenever δ=O(T0.92).

Given that δ is commonly set to be o(1/T), this result is surprising. The phenomenon is discussed in Section 5, but in a nutshell a sufficiently small δ must cause the error to greatly increase if the mechanism is based on Gaussian noise. While improving constants for ε-DP is of independent interest, Theorem 2 implies that further improvements also have implications for (ε,δ)-DP.

Motivated by the dominance of Gaussian noise in applications, and the moral in Theorem 2 for Laplace noise, our second contribution is (ε,δ)-DP guarantees for the Laplace mechanism for given 1 and 2 sensitivities.

Theorem 3 (Approximate DP for the Laplace Mechanism).

Let f:𝒳nd be a function with p-sensitivity Δpmax𝒟𝒟f(𝒟)f(𝒟)p. For a given dataset 𝒟𝒳n, δ(0,1) and λ>Δ1, the output f(𝒟)+𝖫𝖺𝗉(λ)d, satisfies (ε,δ)-differential privacy where

ε=min{Δ1λ,Δ2λ(Δ22λ+2ln(1/δ))}.

Essentially we can implement (ε,δ)-DP for the Laplace mechanism where the resulting ε is never worse than what simple composition gives, while also being competitive with the Gaussian mechanism in the regime where it is advantageous.

Finally, in Appendix D we show the following lower bound, which matches the classical packing lower bound for continual observation under pure differential privacy [18, 20] in the case where the input is random.

Theorem 4.

Let 𝒰 be the distribution of strings of length T where every element in the string is drawn from B(1/2) (Bernoulli distribution with p=1/2). Let be an ε-differentially private algorithm that solves the continual counting problem for strings from 𝒰 with error at most α with probability at least 3/4 (probability over random input and random choices of the algorithm). Then α=Ω(ε1logT).

1.2 Overview of technical ideas

Improved tree-based mechanism

We start out by building a complete k-ary tree of height h=logk(T+1) on top of the input 𝐱{0,1}T, see Figure 1. To analyze this, it turns out that expressing a time step t as a k-ary number, i.e., as h digits in {0,,k1}, i.e. 𝐭{0,,k1}h, gives a direct means of computing which vertices contribute to a given prefix sum. Increasing the arity of the tree reduces the height, which reduces the 1-sensitivity and levels of vertices that might get added, but increases the number of vertices to add per level. To mitigate the increase in number of vertices to add, we leverage subtraction, see Figure 2.

To analyze the structure of the resulting prefix sums when subtraction is introduced, it turns out that rather than expressing t in digits from 0 to k1, we should instead allow for negative digits. Focusing on the case of odd k, we thus represent t as 𝐭{k12,,k12}h. The interpretation of 𝐭 becomes: if 𝐭0, add 𝐭 leftmost children on level , otherwise, subtract |𝐭| rightmost children on level . Identifying that 𝐭1 is equal to the number of vertices used to produce the estimate at time t, we can analyze the average number of vertices added for an estimate from this representation by studying 𝐭 for t[1,T]. This gives, roughly, a factor-2 reduction in the mean squared error versus not using subtraction.

Figure 1: Complete 5-ary tree with 25 leaves containing inputs 𝐱1,,𝐱25 and inner vertices containing subtree sums. To compute the sum of the first 24 inputs we can add 4 inner vertices and 4 leaves (shown in green) – in general, the worst-case number of vertices for k-ary trees is k1 per level. When subtree sums are made private by random (Laplace) noise the variance of a prefix sum estimator is proportional to the number of terms added. In Section 3.2 we analyze the privacy and utility of this natural generalization of the binary tree mechanism to k-ary trees, among other things showing that the mean number of terms is about half of the worst case.
(a) Prefix sum with only vertex addition.
(b) Prefix sum with vertex subtraction.
Figure 2: Complete ternary trees, each with 9 leaves containing inputs x1,,x9 and inner vertices containing subtree sums. The trees illustrate two different possibilities for computing the sum of the first 8 inputs: (2(a)) add two inner vertices and two leaf vertices, or (2(b)) subtract one vertex (shown in red) from the sum stored in the root vertex (shown in green). The variance of using the latter method, with subtraction, is half of the former method, assuming the noise distribution for each vertex is the same. In Section 3.3 we analyze the privacy and utility of our new generalization of the binary tree mechanism to k-ary trees that makes use of subtraction. Among other things we show that the worst-case number of terms needed per level is k/2, and the mean number of terms is about half of the worst case.

Factorization mechanism with Laplace noise

To derive the (ε,δ)-DP guarantee of the Laplace mechanism (Theorem 3), we make use of a heterogeneous composition theorem [34]. Given two arbitrary multidimensional neighboring inputs, we argue for the privacy loss in each coordinate when one of them is released with respect to its neighbor. To get the aggregate privacy guarantee for the entire vector, we apply the composition theorem over the individual coordinates being released, and show that the guarantee must hold for any two neighboring outputs. It turns out that, as in the case of the Gaussian mechanism, the (ε,δ)-DP guarantee for Laplace noise naturally depends on the 2-sensitivity of the function being released.

Lower bound for random inputs

Finally, the lower bound of Theorem 4 is shown by a variation of the packing argument that has been used to show an Ω(log(T)/ε) lower bound for worst-case inputs. The core idea is to construct a set of correlated input strings that are individually close to uniformly random, but a continual observation mechanism that has low maximum error on all of them would be able to reliably distinguish these input strings. More precisely, for integers T, B, m=T/B and kB/4, we construct our collection of strings 𝐱(0),𝐱(1),,𝐱(m) by first decomposing [T] into blocks of size B:B1=[1,B],B2=[B+1,2B],,Bm=[TB+1,T] and then doing the following sampling process: 𝐱(0) is sampled uniformly from {0,1}T conditioned on each block Bi containing between B/4 and 3B/4 1s, and for i[m], 𝐱(i) is derived from 𝐱(0) by flipping k 0s to 1s in block Bi. We show that each individual string has o(1) total variation distance to the uniform distribution over {0,1}T, and thus a mechanism accurate on the uniform distribution will also be accurate on our random strings. Also, pairs of our random strings will be 2k-neighboring with probability 1. The key insight is that the difference of the true count of 𝐱(i) minus 𝐱(0) is k at the end of Bi, and zero at the end of each preceding Bj for j<i, and this event is detectable if the mechanism is sufficiently accurate on both inputs. Choosing parameters carefully, a packing-like argument in the end shows that a too low error on uniformly distributed inputs contradicts differential privacy.

1.3 Related work

Throughout this section, we focus on unbiased mechanisms for continual counting with “good error”. Here “error” is a shorthand for mean squared error over T outputs, and “good” is an error of O(log(T)3) for ε-DP mechanisms and O(log(T)2log(1/δ)) for (ε,δ)-DP mechanisms. We note that there exists some biased mechanisms [19], but they focus on reducing the error for sparse inputs and give no improvement for dense streams. Unless stated otherwise, constant factor improvements are always for ε-DP.

Historically, the first mechanisms with good error are typically credited to Chan, Shi, and Song [8] and Dwork, Naor, Pitassi, and Rothblum [18], though similar constructions were proposed independently by others [25, 22]. These results are commonly referred to as binary tree mechanisms. Honaker [31] observed that the full tree used by the binary tree mechanism can be used to produce more efficient estimates since each prefix query can be computed using different combinations of vertices. A subset of Honaker’s techniques can be used to reduce the mean squared error by up to a factor 2, at some cost in efficiency. Rather than leveraging the density of the tree to improve the error, Andersson and Pagh [1] identified that making the tree sparser – only using a subset of the leaves to store inputs – allowed for a factor 2 reduction in the error while keeping the variance of outputs constant, for (ε,δ)-DP. Although they do not state a result for it, using their technique for ε-DP yields a factor 4 improvement over the binary tree mechanism, which is still more than a factor 2 away from our improvement. We also mention the work of Qardaji, Yang, and Li [38] which, while focusing on range queries, did investigate reducing the mean squared error by means of considering trees of higher arity. They claim an optimal arity of k=16, but average over a different set of queries, and do not leverage the subtraction of vertices. Cardoso and Rogers [6] used k-ary trees for prefix sums as a subroutine for (ε,δ)-DP and investigated choosing k to minimize the maximum variance.

Many recent papers [26, 2, 28, 16, 14, 29, 21] have focused on improving the error for continual counting for (ε,δ)-DP by directly searching among factorizations of A, the lower-triangular all-1s matrix. Most notably, Henzinger, Upadhyay, and Upadhyay [29] showed that there exists a near-optimal factorization L=R=A (also identified by Bennett [5]) whose error is optimal up to the leading constant across all factorizations. But, importantly, this is for (ε,δ)-DP and assumes using Gaussian noise. Due to the high 1 sensitivity of R=A, the noise needed to release R𝐱 with the (conventional) Laplace mechanism results in an ε-DP matrix mechanism with worse error than that of the binary mechanism.

The second part of our paper explores the use of Laplace noise for (ε,δ)-DP, especially in the small δ regime – a regime we are not the first to investigate [40]. We claim no novelty in pointing out that the Gaussian mechanism scales poorly for small δ (see e.g. [15]). We do note that there is a rich literature investigating alternative (non-Gaussian) noise distribution for approximate DP [24, 30, 36, 13, 23, 42]. In [39] they compare how Laplace and Gaussian noise behaves over many compositions, and especially observe that eventually they will both converge to Gaussian privacy loss distributions.

Finally, we turn to lower bounds. While [29] recently established a Ω(log(T)2) lower bound on the mean squared error, the strongest known lower bound for error remains the Ω(logT) result of [18]. The hard inputs in [18] (see the proof in [20] for details) are very sparse vectors with a single block of 1s that can be reliably distinguished by any algorithm that outputs prefix sums with noise significantly smaller than the block size, yet are close to each other in neighbor distance. This structure means that a “packing argument” can be used to show a lower bound on the amount of noise needed. One might wonder if most inputs are easier and allow a smaller error – our lower bound shows that this is not the case. Some algorithmic problems have the property that worst-case problem instances can be reduced to random instances (for example [27]). Though we are not aware of such a reduction for continual counting, our lower bound can be seen as an instantiation of this general principle.

2 Preliminaries

Before introducing any mechanism, we first introduce some notation and background. We use to denote all positive integers, and for a,b let [a,b] refer to the set of integers {a,a+1,b}. A vector-valued quantity 𝐲n is expressed in bold notation and its scalar entries are written as 𝐲i for i[1,n]. Standard definitions for ε-DP and (ε,δ)-DP are given in Appendix A.

As the main goal of this paper is the study and design of differentially private algorithms for counting queries, we will first formalize the problem. Let vector 𝐱{0,1}T be an input stream, where 𝐱t is received at time t. Then, two input streams 𝐱,𝐱 are neighboring inputs, denoted 𝐱𝐱, if they take on the same value at every time step except one. The target function we wish to privately release is f:{0,1}TT where f(𝐱)t=i=1t𝐱i. We say that a random algorithm :{0,1}TT is a differentially private mechanism for continual counting if it satisfies either ε-DP or (ε,δ)-DP under this neighboring relation.

For the entirety of this paper, we will express the utility of an algorithm in terms of its mean squared error defined as: MSE(,𝐱)=1T𝔼[(x)f(x)22]=1Tt=1TVar[(𝐱)t], where the second equality only holds if 𝐱{0,1}T:𝔼[(𝐱)]=f(𝐱), which is the case for all factorization mechanisms in general, and all mechanisms covered in this work in particular. That is to say, for unbiased mechanisms , their mean squared error will be equal to the average variance. Moving forward, any mention of error without further qualification will refer to the mean squared error.

3 Mechanisms

In this section, we will gradually build up towards the algorithm outlined in Theorem 1. We begin by describing the binary tree mechanism, which will be our starting point. Then we will consider how the utility improves when we extend to k-ary trees, and we give an algorithm and results for this case. Finally, our main contribution will be to incorporate subtraction of vertices in k-ary trees to further improve the utility. The formal proofs leading to Theorem 1 are given in the next section.

3.1 Binary Tree Mechanisms

The first algorithms [8, 18, 22, 25] with good utility are based on binary trees and are referred to as the binary tree mechanism. Implementation details vary, but we will consider a formulation of it where only the left subtrees are used, and in particular not the root vertex, structurally most similar to Algorithm 2 in [8]. The main idea for the algorithm is to build a binary tree of height h=log(T+1) on top of the input stream 𝐱 where each leaf stores an input 𝐱t, and each vertex higher up in the tree stores the sum of its children. Having such a tree, it is possible to add together vertices in the tree in such a way that at most 1 vertex per level in the tree is used, up to h in total, where the sum corresponds to a prefix sum up to t[1,T]. In particular, which vertices that get summed to produce a given prefix sum is described exactly by the h-bit binary representation of the current time step, where each 1 corresponds to adding a vertex in the tree to the output. To make this private, we add i.i.d. Laplace noise 𝖫𝖺𝗉(Δ1/ε) to each vertex in the tree, where Δ1=h since the trees for two neighboring inputs will differ in one vertex per level, excluding the root which is not used in computations. We defer making exact statements about the corresponding utility to Section 3.2, where this description of the binary tree mechanisms is a special case of Algorithm 1 for k=2.

3.2 𝒌-ary Tree Mechanisms

We will next consider trees of greater arity k2 to improve constants for the error. We note that using trees of higher arity has already been studied [38, 12, 6]. With a greater arity tree, a lesser height is needed to accommodate a given number of leaves. However, it might also require adding more vertices per level to generate a given prefix sum estimate. To explore this trade-off, we introduce notation for k-ary trees next.

Definition 5 (k-ary trees).

For integers k2 and h1, we define a k-ary tree of height h with kh leaves as 𝒯k,h. Then for integer 1h+1, let 𝒯k,h={[1+jk1,(j+1)k1]:j, 0j<kh+1} be the set of vertices on level , where a vertex I𝒯k,h is a child of I𝒯k,h+1 if II, and where 𝒯k,h=1h+1𝒯k,h.

Additionally, let Σ(I)=iI𝐱i for I𝒯k when such a tree is used to store partial sums on 𝐱, Σ^ for the case where a vertex in the tree stores a partial sum, plus independent noise. Also, we will find it expedient to express integers using k-ary digits, which we define next.

Definition 6 (k-ary representation of integer).

Let k, w and tkw1 be positive integers. We then define 𝐭=𝗋𝖾𝗉k(t,w)[0,k1]w as the unique vector in [0,k1]w satisfying i=1wki1𝐭i=t.

The intuition behind using k-ary to analyze k-ary tree mechanisms is the same as in the case where k=2 – there is a clean analogue between the two. The h-digit k-ary representation of a time step t, 𝐭=𝗋𝖾𝗉k(t,h), encodes a set of vertices in 𝒯k,h whose sum (when interpreted as partial sums) equals the prefix sum up to t. In particular, 𝐭 gives the number of vertices that get added at level of the tree. With this observation in mind, we introduce Algorithm 1.

Algorithm 1 k-ary Tree Mechanism.

Now we state the properties of Algorithm 1. We begin by discussing the privacy of the output of the algorithm. Subsequently, we examine the variance and average number of used vertices to analyze the mean squared error. Finally, we analyze the time and space complexity of the algorithm.

Lemma 7.

The output from Algorithm 1 satisfies ε-DP.

Proof.

Note that line 11 always adds the noisy count from a vertex in the tree at level [1,h] since p=pn=(modk1)0, and since 𝐭 is a valid representation of a time step tT, we also have pt, and so the vertex must be contained in the tree. Note further that everything after line 4 can be considered post-processing of privately releasing the tree 𝒯k,h storing partial sums. Since we never use the root vertex of the tree at level h+1, any two neighboring inputs will differ at exactly h vertices which we use, therefore Δ1=h, implying that releasing the individual counts in the tree with 𝖫𝖺𝗉(h/ε) gives privacy. For the utility, note that the output from Algorithm 1 at time t has variance (2h2/ε2)𝗋𝖾𝗉k(t,h)1, since 𝗋𝖾𝗉k(t,h)1 vertices with variance (2h2/ε2) are added up. Consequently, we can conclude this subsection with the following Lemma and Theorem.

Lemma 8.

Algorithm 1 achieves a mean squared error of (k1)h3ε2(11/kh) over T=kh1 outputs.

Proof.

Identifying that the mean squared error over T steps is equal to the average number of vertices added together to produce an output, call this number a[1,T], times the variance of each vertex, 2h2/ε2. Observe that a[1,T]=1Tt=1T𝗋𝖾𝗉k(t,h)1 and that T is the greatest integer possible to represent with h digits from {0,1,,k1}. This implies that we can compute the average 1-weight over {0,1,,k1}h, a[0,T]=h1kd=0k1|d|=h(k1)2. Since Ta[1,T]=(T+1)a[0,T], we have that a[1,T]=(1+1/T)a[0,T]=h(k1)2(11/kh). Plugging this in gives a mean squared error of (k1)h2(11/kh)2h2ε2=(k1)h3ε2(11/kh).

Theorem 9.

Given integer T2 and constant integer k2, there exists an ε-DP mechanism for continual counting that on receiving a stream 𝐱1,𝐱2,,𝐱T achieves a mean squared error of k1ε2log(k)3log(T)3+o(log(T)3), computes all outputs in time O(T) and O(logT) space.

As Theorem 9 is not our main contribution, we provide a proof sketch in Appendix B. Before continuing and including subtraction as a technique, we note that Theorem 9 describes the standard binary tree mechanism for k=2, and that the leading constant in the error is minimized for k=17, in which case it equals 0.234.

3.3 𝒌-ary trees with subtraction

Next, we will investigate what further improvements are possible if we allow for the subtraction of vertices to produce outputs. The reason for doing so is rather simple: if you need to add k/2 left-children for an estimate, you might as well add the parent and subtract the k/2 right-most children. Figure 2 provides a visual example that further illustrates how incorporating vertex subtraction can effectively reduce the number of vertices required to compute the prefix sum. Intuitively this should buy us another factor 2 improvement, as instead of adding a maximum of (k1) vertices per level to produce an estimate, we should now instead combine at most (k1)/2.

To analyze (and efficiently implement) such a mechanism, the previous integer representation using digits in [0,k1] is no longer the natural choice. Instead, when we open up the possibility of subtracting vertices, the solution becomes to allow for negative digits.

Definition 10 (Offset k-ary representation of integers).

Let k, w and t be positive integers, with k also being odd, satisfying t(kw1)/2. We then define 𝐭=𝗋𝖾𝗉^k(t,w)[k12,k12]w as the unique vector with integer entries in [k12,k12]w satisfying i=1wki1𝐭i=t.

We will return to the fact that Definition 10 only considers odd k and first focus on providing intuition for why we are using offset digits. As in the case for k-ary trees using only subtraction, there is a natural connection between 𝐭=𝗋𝖾𝗉^k(t,h) and vertices in 𝒯k,h. Put simply: if 𝐭>0, then to produce the prefix sum for t, we will be adding 𝐭(k1)/2 left-most children from 𝒯k,h, as in the case without addition except we add fewer vertices in the worst-case. If 𝐭<0, then we will instead subtract |𝐭|(k1)/2 right-most children from 𝒯k,h. In either case, the worst-case number of vertices added or subtracted is reduced to being at most (k1)/2.

Now, we only consider the case of odd k for a few reasons: (1) if k is odd, then we have symmetry in the digits, meaning we add at worst as many things as we may subtract. This makes the analysis easier. (2) even k is not optimal, see Appendix C for more details. (3) if k is odd, then there is a simple relation for when the root vertex of a tree is used to produce an output: the root vertex is used if t is in the right half of the tree. This gives an intuitive condition on the height to support T inputs: h=logk(2T) – set the height such that there is a sufficient number of left-half leaves. This can be compared to standard k-ary trees, where the same condition is to use every leaf of the tree except the right-most one.

We are now ready to introduce our main contribution: Algorithm 2.

Algorithm 2 k-ary Tree Mechanism with Subtraction.

Structurally the algorithm releases prefix sums with a correlation in the noise that depends on the representation 𝗋𝖾𝗉^k(t,h) of each time step t.

4 Proving Theorem 1

In this section, we will prove the properties of Algorithm 2, as summarized in Theorem 1, starting with privacy.

4.1 Privacy

Instead of arguing about the privacy of Algorithm 2, we will argue it for another algorithm with the same output distribution, Algorithm 3, whose connection to trees is more immediate. We proceed by establishing the equivalence of the output distributions in Algorithms 2 and 3.

Algorithm 3 More tree-like version of Algorithm 2 for analysis.

The following proposition will be helpful in establishing that the outputs are identically distributed.

Proposition 11.

When iterating over t, let countn and pn be the value of count and p after 1n𝐭1 updates in Algorithm 3. Then countn stores a noisy prefix sum up to pn[1,(kh1)/2].

Proof.

We show this by induction on n. Observe that pn[1,(kh1)/2] for all valid n since the first non-zero digit in 𝐭 has to be positive, and since the greatest possible integer we can represent in these digits is (kh1)/2. We therefore have that p11 and count1=Σ^[1,p] is a noisy prefix sum up to 𝐱p. Assuming that after n updates count represents a noisy prefix sum up to p, we will argue that this still holds after the n+1st update to p. Just before count is updated, observe that p=pn=(modk1)0, thus implying that the corresponding partial sums added or subtracted on lines 12 and 13 exist in the tree, and that the result will be a noisy prefix sum up to pn+1. We can now prove that the two algorithms produce identically distributed outputs.

Proposition 12.

The output distribution of Algorithms 2 and Algorithm 3 are equal.

Proof.

First, observe that Algorithm 3 like Algorithm 2 outputs the correct prefix sum on expectation which follows from Proposition 11 for n=𝐭1. We thus only have to argue for the equivalence of the noise added. With this target in mind, note that each time a vertex is used on lines 12 and 15 in Algorithm 3, it can be uniquely identified by the p after it is updated on the next line. The uniqueness can be inferred from the uniqueness of expressing the corresponding 𝐩=𝗋𝖾𝗉^k(p,h), where p[1,(kh1)/2] can be interpreted as a time step. Following up, note that for any time step t, the sequence of p’s that are computed on line 10 for Algorithm 2 are the same as the sequence of p’s obtained after lines 13 and 16 for Algorithm 3. Since Algorithm 2 explicitly indexes the noise terms by the same p, and since each term is identically distributed across the algorithms, it follows that each individual noise term that is added to the output at each step is identical up to their sign.

What remains to account for is the lack of a factor sgn(𝐞) on line 11 in Algorithm 2. The reason why it is not needed comes from the fact that a given vertex in Algorithm 3 always contributes with the same sign: we always subtract right-most children and always add left-most children to produce prefix sum estimates. Since Laplace noise is symmetric, subtracting or adding noise has the same effect, and we can therefore drop the sign on line 11 in Algorithm 2 and still have the same output distribution, proving the statement. We are now ready to prove that Algorithm 2 satisfies ε-DP.

Lemma 13.

Algorithm 2 satisfies ε-DP.

Proof.

We proceed by proving the privacy of Algorithm 3, the privacy of which implies the privacy of Algorithm 2 in the offline setting via Proposition 12. The privacy proof is standard and follows from identifying that if all Σ^(I) on line 4 are released privately, then the remaining computations are a post-processing of that release. For neighboring inputs 𝐱𝐱, let i be the index at which they differ. Let 𝐲|𝒯k,h| be a vector where each entry is identified by a unique interval from I𝒯k,h equal to Σ(I) given 𝐱. Define 𝐲 analogously for 𝐱. Since any i[1,T] is included in exactly h intervals I𝒯, intentionally excluding the root vertex which is not used, it follows that the sensitivity for releasing 𝐲 is Δ1=𝐲𝐲1=h. As 𝐲 is released on line 4 using the Laplace mechanism with appropriate scale, and all computations afterward are post-processing of y, Algorithm 3 satisfies ε-DP.

To extend the privacy analysis to the online case where 𝐱 is streamed (non-adaptively), note that the output distribution of Algorithm 2 can be expressed as A𝐱+L𝐳 for suitable matrices A,L. On receiving 𝐱t, (A𝐱)t+(L𝐳)t is output. Since L𝐳 is generated independently of the input, the output distribution of the algorithm is not impacted by having 𝐱 gradually revealed. It follows that the algorithm is also ε-DP under continual observation. The online argument in the proof is explicit about 𝐱 being chosen before any outputs are produced. We note that every non-adaptive ε-DP algorithm for continual counting is also adaptive (see Proposition 2.1 in [14]). Privacy in the online setting can also be proved directly using a “shifting” argument for the Laplace variables 𝐳, as done in for example [18].

4.2 Utility

As intuited, allowing for the subtraction of vertices does indeed improve utility. Before stating the improvement, note that, analogous to the case without subtraction, the variance of the output from Algorithm 2 at time t will be equal to (2h2/ε2)𝗋𝖾𝗉^k(t,h)1, since line 11 is executed 𝗋𝖾𝗉^k(t,h)1 times.

Lemma 14.

Algorithm 2 achieves a mean squared error of k(11/k2)h32ε2(11/kh) over T=(kh1)/2 outputs.

Proof.

First note that the mean squared error over T steps is equal to the average number of noise terms added together to produce an output, call this number a[1,T], times the variance of the noise in each vertex. We thus want to compute a[1,T]=1Tt=1T𝗋𝖾𝗉^k(t,h)1. Note that T is the greatest positive integer that can be expressed using h digits in {k12,,k12}, and T the most negative integer. For any representation 𝐭=𝗋𝖾𝗉^k(t,h) where t[1,T], note that we have a bijection where 𝐭 encodes a unique negative integer T. We therefore have that a[1,T]=a[T,1], and a[T,T]=(11/kh)a[1,T], where the extra factor comes from not including 0. Since a[T,T]=h1kd=(k1)/2(k1)/2|d|=h(k21)4k, by noting that every digit occurs the same amount of times in each position, we also have that a[1,T]=kh411/k211/kh. Since Var[𝐳p]=2h2/ε2, we arrive at a mean squared error of kh411/k211/kh2h2ε2=kh3(11/k2)2ε2(11/kh). We follow up with a corollary that states how the error scales and what constant the optimal k gives.

Corollary 15.

Algorithm 2 achieves a mean squared error that scales as k(11/k2)2ε2log(k)3log(T)3+o(log(T)3), and which is minimized for k=19 yielding (0.1236/ε2)log(T)3+o(log(T)3).

In relation to Honaker Online, this is a constant improvement by more than a factor 4.

4.3 Space and time complexity

Finally, it turns out that extending to k-ary trees with subtraction does not impact the efficiency of tree-based aggregation. We will need the following proposition to prove both the space and time complexity.

Proposition 16.

Each noise term 𝐳p used by Algorithm 2 will only be used for outputs in a time interval [t1,t2][1,T].

Proof.

Let the value of the iterate be q when 𝐳p is added on line 11. Consider first the case when q<h. Note that for any time step t with representation 𝐭=𝗋𝖾𝗉^k(t,h), 𝐳p is only used if the hq most-significant digits are identical across 𝐩=𝗋𝖾𝗉^k(p,h) and 𝐭. Interpreting any fixed number of most significant digits of 𝗋𝖾𝗉^k(t,h) as an integer, as t is incremented, will produce a sequence of monotonically increasing integers, implying that any noise value 𝐳p will start to get used at some t1, and then after some time step t2t1 will never be used again, proving the claim for q<h. If q=h, then once 𝐳p is added for the first time at t1, it will be added at every future time step including t2=T, proving the claim fully.

Lemma 17.

Algorithm 2 runs in O(klogkT) space.

Proof.

Note that at any time step t, we need to store noise values used for the current time step t, of which at most h(k1)/2 are at most needed. Let 𝐳p be an arbitrary noise value kept in memory at time t, and then no longer used starting from tt. By Proposition 16, 𝐳p will never be used again after t, and it can therefore be removed from memory. It follows that no noise values besides those used at a current time step need to be stored in memory, thus we need to store at most h(k1)/2=O(klogkT) noise values.

Lemma 18 states the time complexity of Algorithm 2, which assumes that (1) Laplace noise can be generated, and (2) 𝗋𝖾𝗉^k(t,h) can be computed, both in O(k) time. Note that the cost of (1) is unavoidable, and that (2) can be efficiently computed by incrementing the digits at each step, where an argument analogous to the analysis for incrementing a binary counter implies a time of O(kT) for incrementing the counter T times as measured in digit flips.

Lemma 18.

Algorithm 2 can be implemented to run in O(kT) time.

Proof.

First let T=(kh1)/2, meaning we support as many inputs as possible using h digits. Note that there are exactly T unique noise variables used by Algorithm 2, since p will take on all values in [1,T] when iterating over t. Rather than resetting noise on each iteration of t, we can instead update it, removing noise terms that are no longer used and adding new noise as needed. Each of them will be added exactly once and subtracted at most once (Proposition 16), therefore we run in time O(T) for this case. For (kh1+1)/2T<(kh1)/2=T, note that TT(kh1)/2(kh1+1)/2k, and therefore if we estimate the runtime of T by that for T, we get a runtime of O(kT) in this case. Finally, combining Lemmas 131417 and 18 gives us Theorem 1.

5 Laplace vs Gaussian noise

Having designed an efficient ε-DP algorithm for continual counting with a state-of-the-art bound on the error, a natural question arises: how does it compare to (ε,δ)-DP mechanisms?

Since the noise of the Gaussian mechanism grows as δ approaches zero, it is clear that for any given bounds on 1 and 2 sensitivity, using the Gaussian mechanism with sufficiently small δ will result in more noise than the Laplace mechanism. We will see that for current continual counting methods this change happens for surprisingly large values of δ, with the exact threshold depending on the number of time steps T.

Furthermore, we adapt the Laplace mechanism to work with 2 sensitivity and show that it provides a meaningful (ε,δ)-DP guarantee with variance close to that of the Gaussian mechanism.

5.1 Laplace vs Gaussian noise for continual counting

We start by stating a lemma that compares the currently best continual counting mechanisms under pure and approximate differential privacy.

Lemma 19.

For constants Bε,Bε,δ>0, let (Bε/ε2)log(T)3+o(log(T)3) be a bound on the MSE of an ε-DP mechanism and (Bε,δ/ε2)log(T)2log(1/δ)+o(log(T)2log(1/δ)) be a bound on the mean squared error of an (ε,δ)-DP mechanism, both for continual counting. Then the ε-DP mechanism achieves equivalent or better mean squared error, up to lower-order asymptotically vanishing terms, for δ=O(1/TBε/Bε,δ).

Proof.

Comparing the leading terms of the errors and solving for δ yields the statement.

Proof of Theorem 2.

Algorithm 2 has a leading constant of Bε<0.124 for k=19 under ε-DP, and the corresponding constant in [29] is Bε,δ=4/(π2log(e)3)>0.1349, so we have Bε/Bε,δ<0.92. Lemma 19 thus implies that the pure mechanism has a smaller bound on the noise for δ=O(T0.92).

5.2 Laplace noise scaled to 𝟐 sensitivity

We next consider how we may extend the Laplace mechanism with an approximate DP guarantee similar to the Gaussian mechanism, i.e., with noise scaled to the 2 sensitivity. Since the proof of this is straightforward using known results we suspect this fact may be known, but we are not aware of it appearing in the literature. The technique for extending the Laplace mechanism to approximate DP relies on the following heterogeneous composition theorem [34].

Lemma 20 (Simplified version of Theorem 3.5 in [34]).

For any εj>0 for j{1,,,k}, and δ(0,1), the class of εj-differentially private mechanisms satisfy (ε~,δ)-differential privacy under k-fold adaptive composition for

ε~=min{j=1kεj,j=1k(eεj1)εjeεj+1+2ln(1/δ)j=1kεj2,}.

Using Lemma 20, we can derive the following approximate DP guarantee for the 2 Laplace mechanism.

Lemma 21 (2 Laplace Mechanism).

Let f:𝒳nd be a function with 2-sensitivity Δ2max𝒟𝒟f(𝒟)f(𝒟)2. For a given dataset 𝒟𝒳n and ε,δ(0,1), the output f(𝒟)+𝖫𝖺𝗉(Δ2/aε,δ)d, where aε,δ=2ln(1/δ)(1+ε/ln(1/δ)1), satisfies (ε,δ)-DP.

Proof.

For x1,,xj(0,1), since exi1exi+1xi/2 it follows from Lemma 20 that (ε~,δ)-differential privacy is also satisfied for

ε~=min{j=1kεj,12j=1kεj2+2ln(1/δ)j=1kεj2,}. (1)

It suffices to establish the privacy of the Laplace mechanism when restricted to arbitrary neighboring inputs x, x. We will reason about the privacy loss for each coordinate of 𝐲=f(x) separately relative to the neighboring output 𝐲=f(x). Focusing on one coordinate 𝐲j and its private release, we have that its release must satisfy εj-differential privacy where εj=(aε,δ/Δ2)|𝐲j𝐲j|. Composing the d private releases of each coordinate using the bound (1) and noting that j=1dεj2(aε,δ/Δ2)2j=1d(𝐲j𝐲j)2aε,δ2 therefore gives an (ε,δ)-guarantee where ε=aε,δ(aε,δ2+2ln(1/δ)). Proceeding to solve for (positive) aε,δ gives the statement. The variance needed for 2-scaled Laplace noise compared to Gaussian noise for (ε,δ)-DP is only a constant factor greater whenever εlog(1/δ), and the constant approaches 2 as ε approaches zero.

Corollary 22.

Let 𝐲lap and 𝐲gauss be outputs based on the 2 Laplace mechanism (Lemma 21) and the Gaussian mechanism (Lemma 25) respectively, both to achieve the same (ε,δ)-DP guarantee where εln(1/δ). Then Var[𝐲jlap](ε/ln(1/δ))22(1+ε/ln(1/δ)1)2Var[𝐲jgauss]. Also, limε0(ε/ln(1/δ))22(1+ε/ln(1/δ)1)2=2.

Proof.

Using Lemma 21 and noting that the variance of the noise added is twice the square of the scale parameter for Laplace noise, we get that

Var[𝐲jlap]Var[𝐲jgauss]=2Δ22/(2ln(1/δ)(1+ε/ln(1/δ)1)2)2ln(1.25/δ)Δ22/ε2(ε/ln(1/δ))22(1+ε/ln(1/δ)1)2.

To show the limit, note that for positive w1 we have that w1+w1ww/2w2/8=21w/42 as w0, and w1+w12, both based on truncated Taylor expansions of 1+w.

Finally we prove Theorem 3 from the introduction.

Proof of Theorem 3.

From the proof of Lemma 21, we have that adding noise 𝖫𝖺𝗉(Δ2/a) gives an (ε,δ)-guarantee where ε=a(a2+2ln(1/δ)) if ε,δ(0,1). Note that our setup is the same if we set a=Δ2λ, and that adding noise 𝖫𝖺𝗉(λ) gives an Δ1λ-DP guarantee. We can therefore always guarantee (ε,δ)-DP where ε=min{Δ1λ,ε}<1 if we enforce λ>Δ1. Note that while functions with low 1 sensitivity also have low 2 sensitivity, the converse does not hold. In principle we can replace Gaussian noise with Laplace noise at a limited loss in utility for approximate DP (Corollary 22), but the corresponding ε-DP guarantee afforded by the noise will be weak unless Δ1 is small. Theorem 3 can alternatively be seen as adding an approximate DP guarantee to pure DP mechanisms built on top of the Laplace mechanism.

6 Discussion

While there are no known fundamental reasons ruling out asymptotic improvements in the mean squared error for ε-DP – it might be possible to match the Ω(log(T)2) lower bound of [29] – there have been no asymptotic improvements since the binary tree mechanism [18, 7]. This state of affairs motivates exploring how much leading constants can be improved using the factorization mechanism framework. Algorithm 2 falls into the framework, is easy to implement, efficient, and offers the best known error-scaling to date.

Our paper also reveals that using Laplace noise improves the best possible (ε,δ)-differentially private factorization mechanisms based on Gaussian noise whenever δ is sufficiently small. An interesting future direction is to investigate, for fixed ε,δ and a given factorization: which noise distribution achieves optimal error? It seems unlikely that Laplace or Gaussian noise is always best – indeed, Vinterbo [42] has results of this flavor in more restricted settings.

Finally, while we show a lower bound for random inputs that matches the best lower bound known for worst-case inputs, it is still possible that random inputs are easier and allow lower error than what can be achieved for worst-case inputs.

References

  • [1] Joel Daniel Andersson and Rasmus Pagh. A smooth binary mechanism for efficient private continual observation. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL: http://papers.nips.cc/paper_files/paper/2023/hash/99c41fb9fd53abfdd4a0259560ef1c9d-Abstract-Conference.html.
  • [2] Joel Daniel Andersson and Rasmus Pagh. Streaming private continual counting via binning. CoRR, abs/2412.07093, 2024. To appear in the proceedings of SaTML 2025. doi:10.48550/arXiv.2412.07093.
  • [3] Joel Daniel Andersson, Rasmus Pagh, and Sahel Torkamani. Improved counting under continual observation with pure differential privacy. CoRR, abs/2408.07021v1, 2024. Earlier (non-archival) version with a different title and set of authors. Presented at TPDP 2024. arXiv:2408.07021v1.
  • [4] Borja Balle and Yu-Xiang Wang. Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 394–403. PMLR, 10–15 July 2018. URL: https://proceedings.mlr.press/v80/balle18a.html.
  • [5] G. Bennett. Schur multipliers. Duke Mathematical Journal, 44(3):603–639, 1977. doi:10.1215/S0012-7094-77-04426-X.
  • [6] Adrian Rivera Cardoso and Ryan Rogers. Differentially private histograms under continual observation: Streaming selection into the unknown. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtual Event, volume 151 of Proceedings of Machine Learning Research, pages 2397–2419. PMLR, 2022. URL: https://proceedings.mlr.press/v151/rivera-cardoso22a.html.
  • [7] T. H. Hubert Chan, Mingfei Li, Elaine Shi, and Wenchang Xu. Differentially Private Continual Monitoring of Heavy Hitters from Distributed Streams. In Simone Fischer-Hübner and Matthew Wright, editors, Privacy Enhancing Technologies, Lecture Notes in Computer Science, pages 140–159, Berlin, Heidelberg, 2012. Springer. doi:10.1007/978-3-642-31680-7_8.
  • [8] T.-H. Hubert Chan, Elaine Shi, and Dawn Song. Private and Continual Release of Statistics. ACM Transactions on Information and System Security, 14(3):26:1–26:24, November 2011. doi:10.1145/2043621.2043626.
  • [9] Christopher A. Choquette-Choo, Arun Ganesh, Ryan McKenna, H. Brendan McMahan, John Rush, Abhradeep Guha Thakurta, and Zheng Xu. (amplified) banded matrix factorization: A unified approach to private training. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL: http://papers.nips.cc/paper_files/paper/2023/hash/ecc28b4ce9b39f5f23c3efb03e25b7bf-Abstract-Conference.html.
  • [10] Christopher A. Choquette-Choo, Hugh Brendan McMahan, J. Keith Rush, and Abhradeep Guha Thakurta. Multi-epoch matrix factorization mechanisms for private machine learning. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 5924–5963. PMLR, 2023. URL: https://proceedings.mlr.press/v202/choquette-choo23a.html.
  • [11] Edith Cohen, Xin Lyu, Jelani Nelson, Tamás Sarlós, and Uri Stemmer. Lower bounds for differential privacy under continual observation and online threshold queries. In Shipra Agrawal and Aaron Roth, editors, The Thirty Seventh Annual Conference on Learning Theory, June 30 - July 3, 2023, Edmonton, Canada, volume 247 of Proceedings of Machine Learning Research, pages 1200–1222. PMLR, 2024. URL: https://proceedings.mlr.press/v247/cohen24b.html.
  • [12] Graham Cormode, Tejas Kulkarni, and Divesh Srivastava. Answering range queries under local differential privacy. Proc. VLDB Endow., 12(10):1126–1138, June 2019. doi:10.14778/3339490.3339496.
  • [13] Yuval Dagan and Gil Kur. A bounded-noise mechanism for differential privacy. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 625–661. PMLR, 02–05 July 2022. URL: https://proceedings.mlr.press/v178/dagan22a.html.
  • [14] Sergey Denisov, H. Brendan McMahan, John Rush, Adam D. Smith, and Abhradeep Guha Thakurta. Improved differential privacy for SGD via optimal private linear operators on adaptive streams. In NeurIPS, 2022. URL: http://papers.nips.cc/paper_files/paper/2022/hash/271ec4d1a9ff5e6b81a6e21d38b1ba96-Abstract-Conference.html.
  • [15] Jinshuo Dong, Weijie J. Su, and Linjun Zhang. A central limit theorem for differentially private query answering. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 14759–14770, 2021. URL: https://proceedings.neurips.cc/paper/2021/hash/7c2c48a32443ad8f805e48520f3b26a4-Abstract.html.
  • [16] Krishnamurthy Dj Dvijotham, H. Brendan McMahan, Krishna Pillutla, Thomas Steinke, and Abhradeep Thakurta. Efficient and near-optimal noise generation for streaming differential privacy. In 65th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2024, Chicago, IL, USA, October 27-30, 2024, pages 2306–2317. IEEE, 2024. doi:10.1109/FOCS61266.2024.00135.
  • [17] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC’06, pages 265–284, Berlin, Heidelberg, 2006. Springer-Verlag. doi:10.1007/11681878_14.
  • [18] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N. Rothblum. Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing, STOC ’10, pages 715–724, New York, NY, USA, June 2010. Association for Computing Machinery. doi:10.1145/1806689.1806787.
  • [19] Cynthia Dwork, Moni Naor, Omer Reingold, and Guy N. Rothblum. Pure differential privacy for rectangle queries via private partitions. In Proceedings, Part II, of the 21st International Conference on Advances in Cryptology — ASIACRYPT 2015 - Volume 9453, pages 735–751, Berlin, Heidelberg, 2015. Springer-Verlag. doi:10.1007/978-3-662-48800-3_30.
  • [20] Cynthia Dwork and Aaron Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends® in Theoretical Computer Science, 9(3-4):211–407, 2013. doi:10.1561/0400000042.
  • [21] Hendrik Fichtenberger, Monika Henzinger, and Jalaj Upadhyay. Constant matters: Fine-grained error bound on differentially private continual observation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 10072–10092. PMLR, 23–29 July 2023. URL: https://proceedings.mlr.press/v202/fichtenberger23a.html.
  • [22] J. Gehrke, G. Wang, and X. Xiao. Differential privacy via wavelet transforms. In 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), pages 225–236, Los Alamitos, CA, USA, March 2010. IEEE Computer Society. doi:10.1109/ICDE.2010.5447831.
  • [23] Quan Geng, Wei Ding, Ruiqi Guo, and Sanjiv Kumar. Optimal noise-adding mechanism in additive differential privacy. In Kamalika Chaudhuri and Masashi Sugiyama, editors, The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, volume 89 of Proceedings of Machine Learning Research, pages 11–20. PMLR, 2019. URL: http://proceedings.mlr.press/v89/geng19a.html.
  • [24] Quan Geng, Wei Ding, Ruiqi Guo, and Sanjiv Kumar. Tight analysis of privacy and utility tradeoff in approximate differential privacy. In Silvia Chiappa and Roberto Calandra, editors, The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy], volume 108 of Proceedings of Machine Learning Research, pages 89–99. PMLR, 2020. URL: http://proceedings.mlr.press/v108/geng20a.html.
  • [25] Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. Boosting the accuracy of differentially private histograms through consistency. Proc. VLDB Endow., 3(1–2):1021–1032, September 2010. doi:10.14778/1920841.1920970.
  • [26] Monika Henzinger, Nikita P. Kalinin, and Jalaj Upadhyay. Binned group algebra factorization for differentially private continual counting, 2025. arXiv:2504.04398.
  • [27] Monika Henzinger, Andrea Lincoln, and Barna Saha. The complexity of average-case dynamic subgraph counting. In Joseph (Seffi) Naor and Niv Buchbinder, editors, Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA 2022, Virtual Conference / Alexandria, VA, USA, January 9 - 12, 2022, pages 459–498. SIAM, 2022. doi:10.1137/1.9781611977073.23.
  • [28] Monika Henzinger and Jalaj Upadhyay. Improved differentially private continual observation using group algebra. In Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2025, New Orleans, LA, USA, January 12-15, 2025, pages 2951–2970. SIAM, 2025. doi:10.1137/1.9781611978322.95.
  • [29] Monika Henzinger, Jalaj Upadhyay, and Sarvagya Upadhyay. Almost tight error bounds on differentially private continual counting. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 5003–5039, 2023. doi:10.1137/1.9781611977554.ch183.
  • [30] Naoise Holohan, Spiros Antonatos, Stefano Braghin, and Pól Mac Aonghusa. The bounded laplace mechanism in differential privacy. J. Priv. Confidentiality, 10(1), 2020. doi:10.29012/JPC.715.
  • [31] James Honaker. Efficient use of differentially private binary trees. Theory and Practice of Differential Privacy (TPDP 2015), London, UK, 2:26–27, 2015.
  • [32] Ziyue Huang, Yuan Qiu, Ke Yi, and Graham Cormode. Frequency estimation under multiparty differential privacy: one-shot and streaming. Proceedings of the VLDB Endowment, 15(10):2058–2070, September 2022. doi:10.14778/3547305.3547312.
  • [33] Peter Kairouz, Brendan Mcmahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu. Practical and Private (Deep) Learning Without Sampling or Shuffling. In Proceedings of the 38th International Conference on Machine Learning, pages 5213–5225. PMLR, July 2021. ISSN: 2640-3498. URL: https://proceedings.mlr.press/v139/kairouz21b.html.
  • [34] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential privacy. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1376–1385, Lille, France, 07–09 July 2015. PMLR. URL: https://proceedings.mlr.press/v37/kairouz15.html.
  • [35] Chao Li, Gerome Miklau, Michael Hay, Andrew McGregor, and Vibhor Rastogi. The matrix mechanism: optimizing linear counting queries under differential privacy. The VLDB Journal, 24(6):757–781, December 2015. doi:10.1007/s00778-015-0398-x.
  • [36] Fang Liu. Generalized gaussian mechanism for differential privacy. IEEE Trans. Knowl. Data Eng., 31(4):747–756, 2019. doi:10.1109/TKDE.2018.2845388.
  • [37] H. Brendan McMahan and Abhradeep Thakurta. Federated Learning with Formal Differential Privacy Guarantees, 2022. Accessed online 2023-05-15. URL: https://ai.googleblog.com/2022/02/federated-learning-with-formal.html.
  • [38] Wahbeh Qardaji, Weining Yang, and Ninghui Li. Understanding hierarchical methods for differentially private histograms. Proc. VLDB Endow., 6(14):1954–1965, September 2013. doi:10.14778/2556549.2556576.
  • [39] David M. Sommer, Sebastian Meiser, and Esfandiar Mohammadi. Privacy loss classes: The central limit theorem in differential privacy. Proc. Priv. Enhancing Technol., 2019(2):245–269, 2019. doi:10.2478/POPETS-2019-0029.
  • [40] Thomas Steinke and Jonathan R. Ullman. Between pure and approximate differential privacy. J. Priv. Confidentiality, 7(2), 2016. doi:10.29012/JPC.V7I2.648.
  • [41] Jalaj Upadhyay. Sublinear Space Private Algorithms Under the Sliding Window Model. In Proceedings of the 36th International Conference on Machine Learning, pages 6363–6372. PMLR, May 2019. ISSN: 2640-3498. URL: https://proceedings.mlr.press/v97/upadhyay19a.html.
  • [42] Staal Amund Vinterbo. Differential privacy for symmetric log-concave mechanisms. In Gustau Camps-Valls, Francisco J. R. Ruiz, and Isabel Valera, editors, International Conference on Artificial Intelligence and Statistics, AISTATS 2022, 28-30 March 2022, Virtual Event, volume 151 of Proceedings of Machine Learning Research, pages 6270–6291. PMLR, 2022. URL: https://proceedings.mlr.press/v151/vinterbo22a.html.

Appendix A Definitions for Differential Privacy

Here are the standard definitions for differential privacy which we have use for in the paper.

Definition 23 ((ε,δ)-Differential Privacy [20]).

A randomized algorithm :𝒳n𝒴 is (ε,δ)-differentially private if for all S𝖱𝖺𝗇𝗀𝖾() and all neighboring inputs 𝒟,𝒟𝒳n, the neighboring relation written as 𝒟𝒟, we have that:

Pr[(𝒟)S]exp(ε)Pr[(𝒟)S]+δ,

where (ε,0)-DP is referred to as ε-DP.

Lemma 24 (Laplacian Mechanism [20]).

Let f:𝒳nd be a function with 1-sensitivity Δ1max𝒟𝒟f(𝒟)f(𝒟)1. For a given data set 𝒟𝒳n and ε>0, the mechanism that releases f(𝒟)+𝖫𝖺𝗉(Δ1/ε)d satisfies ε-DP.

Lemma 25 (Gaussian Mechanism [20]).

Let f:𝒳nd be a function with 2-sensitivity Δ2max𝒟𝒟f(𝒟)f(𝒟)2. For a given data set 𝒟𝒳n and ε,δ(0,1), the mechanism that releases f(𝒟)+𝒩(0,2ln(1.25/δ)Δ22/ε2)d satisfies (ε,δ)-DP.

Appendix B 𝒌-ary Tree Mechanisms: Proofs

In this subsection, we will briefly discuss the proof sketch for Theorem 9. It is important to note that while this proof is included, it is not considered a main contribution of this paper.

Proof sketch of Theorem 9.

The utility and privacy follow from Lemmas 7 and 8, and the online nature of the mechanism follows from the partial sums being added on Line 11 involving inputs 𝐱i with ipt. For the argument about space usage, a standard argument [8, 1] can be extended from the binary tree mechanism to k-ary tree mechanism to yield O(klogkT). Similarly, the time needed to release T prefix sum will be O(kT) (assuming noise can be generated in constant time) based on extending the analysis of incrementing a binary counter to incrementing a k-ary counter.

Appendix C Leveraging subtraction for even 𝒌

In the main part of the paper we only considered subtraction for k-ary trees when k was odd. This part of the appendix is intended for motivating that choice.

First note that for even k we lose the symmetry in the offset digits, so we need to make a decision if the digits in the offset representation are {k/2+1,k/2} or {k/2,,k/21}. We pick the first one as that allows for supporting more inputs, giving Definition 26.

Definition 26 (Offset (even) k-ary representation of integers).

Let k, w and t be positive integers, with k also being even, satisfying tk(kh1)2(k1). We then define 𝐭=𝗋𝖾𝗉~k(t,w)[k2+1,k2]w as the unique vector with integer entries in [k2+1,k2]w satisfying i=1wki1𝐭i=t.

Substituting the representation used in Algorithm 2 by the one in Definition 26 and setting the height appropriately gives a valid algorithm using a k-ary tree. However, it performs worse than using odd k.

Claim 27.

For even k4, h1, and T=k(kh1)2(k1), Algorithm 2, with h correctly set, achieves a mean squared error of kh32ε2111/kh+o(h2) when producing T outputs.

Proof sketch..

The same proof as for Lemma 14 does not work due to the asymmetry in the digits. Instead, we can express a recursion for the number of terms needed. Let ch be the total number of vertices combined to produce all prefixes that a tree of height h can support. Then we have

ch=(k+24+k(h1)4)k2kh1+ch1,

where c0=0. The idea behind the recursion is as follows: the first term corresponds to all the integer representations with positive leading digit, of which there are k/2kh, and the leading digit contributes (k+2)/4, the remaining k(h1)/4 on average. The second term is the case where the first digit is zero, in which case the contribution is ch1. Solving the recursion and dividing by T yields the average number of vertices added, ch/T=kh4111/kh+o(1), which when multiplied by the variance of a vertex, 2h2/ε2, gives the statement. Ignoring the lower order terms, this is worse by a factor 11/k2 over the odd case, and we can also find the optimal k here.

Corollary 28.

Algorithm 2, with modifications for even k, achieves a mean squared error that scales as k2ε2log(k)3log(T)3+o(log(T)3), and which is minimized for k=20 yielding (0.1238/ε2)log(T)3+o(log(T)3).

As the analysis is less elegant, we get more lower order terms and the scaling is worse, we focus on odd k in the paper.

Appendix D Lower bound for continual observation on random inputs

In this section we consider continual observation in the “offline” setting where a mechanism has access to the entire input and has to output a differentially private approximation of all prefix sums. This of course implies a lower bound for the “online” setting. We show Theorem 4 which achieves the same lower bound that was previously known for worst-case inputs, in the setting where the input is a random string.

Proof of Theorem 4.

We will prove the theorem via a packing argument applied to a carefully constructed collection of random strings. Without loss of generality, let B=T be an integer divisible by 4, kT1/5 be an even integer, and T be large enough such that kB/4. Also divide the timeline [1,T] into blocks of length B: B1=[1,B],B2=[B+1,2B],,Bm=[TB+1,T], m=T/B. For an arbitrary string 𝐲{0,1}T, we use the notation 𝐲Bi{0,1}B for the substring of 𝐲 formed from the indices in Bi. We use err(), taking as input the output of a mechanism, to denote the (random) maximum additive error over all the outputs. The argument is outlined below:

  • Construct random strings 𝐱(0),,𝐱(m) that are near-neighboring and have small total variation distance to random strings from 𝒰.

  • Argue that assumes additive error α on these inputs with constant probability by virtue of the small total variation distance.

  • Show that for ii can be used to distinguish 𝐱(i) from 𝐱(i) unless α=Ω(ε1logT).

The intuition behind the argument is that an accurate mechanism on 𝒰 will also be accurate on our random (but highly correlated) strings 𝐱(0),𝐱(m). Being accurate on these strings implies the ability to distinguish between them. However, since satisfies ε-DP, distinguishing between the inputs with good confidence is impossible, implying a lower bound on the additive error α.

To this end, we define the random variables Xj(i)=𝐱Bj(i)1=Bj𝐱(i), i.e., as the sum of 1s in the block Bj of 𝐱(i). With the argument outlined, consider the random strings 𝐱(0),𝐱(1),,𝐱(m) formed by the following process:

  1. 1.

    Draw 𝐱(0) from 𝒰 conditioned on Xj(0)[B4,3B4] for all j[m]. Call this distribution 𝒟(0).

  2. 2.

    For i[m] derive 𝐱(i) from 𝐱(0) by choosing k 0s in Bi uniformly at random and flipping them to 1s. Call these distributions 𝒟(i).

We next define the algorithm Alg(𝐲,𝐲) outputting the difference between private sums on two arbitrary inputs 𝐲,𝐲{0,1}T. Formally, on receiving 𝐲 and 𝐲, it runs on each, producing outputs 𝐚1,,𝐚T and 𝐚1,,𝐚T respectively, and at time t, the algorithm outputs Alg(𝐲,𝐲)t=𝐚t𝐚t. We will show, step-by-step, that the accuracy guarantee of on 𝒰 will imply distinguishing between our random strings 𝐱(0),,𝐱(i).

Consider Alg(𝐱(i),𝐱(0)) for i[1,m]. Note that the true prefix sums of 𝐱(i) and 𝐱(0) at the end of each block will be identical up until block Bi, at the end of which they will differ by exactly k. We leverage this observation by introducing the following disjoint events Ej. For j[1,m] and arbitrary input strings 𝐲, 𝐲, we define Ej as the event that the following two statements are true: (1) for all [1,j1]:Alg(𝐲,𝐲)Bk/2, and (2) Alg(𝐲,𝐲)jB>k/2. In other words, Ej is the event that, when tracking only the outputs at the end of blocks, Alg(𝐲,𝐲) outputs a value greater than k/2 for the first time at the end of block Bj. Set α=k/4. By our construction, if (𝐱(0)) and (𝐱(i)) each assumes additive error at most α when running Alg(𝐱(i),𝐱(0)), then Alg(𝐱(i),𝐱(0)) has additive error at most 2α=k/2, and Ei happens with probability 1. We thus have

Pr[Alg(𝐱(i),𝐱(0))Ei]Pr[(𝐱(i))α]Pr[(𝐱(0))α].

We want to argue that is accurate on 𝐱(0),,𝐱(m), with constant probability for each input taken separately, and we will argue via the total variation distance between each of these random strings, and the random string 𝐱𝒰. In general, for a (possibly randomized) algorithm 𝒜:S, distributions 𝒟,𝒟 over S with total variation distance TV(𝒟,𝒟)Δ, and an event E over the output, we have that if Pr𝒜,x𝒟[𝒜(x)E]p0, then Pr𝒜,x𝒟[𝒜(x)E]pΔ. This follows from the observation that, in the extreme case, Δ probability mass can be shifted from inputs xS where Pr𝒜[𝒜(x)E]=1 to inputs xS where Pr𝒜[𝒜(x)E]=0. If we manage to show that, for i[0,m], TV(𝐱,𝐱(i))=o(1), then we can lower bound Pr[Alg(𝐱(i),𝐱(0))Ei] by a positive constant. We derive bounds on these total variation distances in the next two lemmas.

Lemma 29.

Let 𝐱𝒰 and 𝐱(0)𝒟(0). Then we have TV(𝐱,𝐱(0))Δ, where Δ=2Texp(T/8)

Proof.

Let S be the support of 𝒰 and S(0)S the support of 𝒟(0). We have that XBin(B,1/2) describes the number of 1s in an arbitrary block Bi of 𝐱. Hoeffding’s inequality gives Pr[B4X3B4]12exp(B/8) whereby it follows that Pr[𝐱S(0)](12exp(B/8))T/B. Observing that 𝐱(0) is uniformly distributed over S(0)S, we proceed to directly compute the total variation distance:

TV(𝐱,𝐱(0)) =12𝐬S|Pr[𝐱=𝐬]Pr[𝐱(0)=𝐬]|
=12|S(0)|(1|S(0)|1|S|)+12(|S||S(0)|)1|S|=1Pr[𝐱S(0)]
1(12exp(B/8))T/B2Texp(B/8)B=2Texp(T/8),

where the last inequality is using Bernoulli’s inequality and holds for T larger than an absolute constant. We prove that the remaining random strings 𝐱(1),,𝐱(m) are closely distributed to strings in 𝒰 next.

Lemma 30.

Let 𝐱𝒰 and 𝐱(i)𝒟(i) for i[m]. Then TV(𝐱,𝐱(i))Δ where Δ=10T1/20.

Proof.

As the total variation distance satisfies the triangle inequality, we can write

TV(𝐱,𝐱(i))TV(𝐱,𝐱(0))+TV(𝐱(0),𝐱(i)).

The first term we have already computed in Lemma 29. To bound the second term, we will use a property of total variation distance: it cannot increase if one applies the same transformation to both arguments. Let f:[m]×[0,B] be a mapping to a family of distributions where f(i,) denotes the distribution described by a process where (1) a string 𝐲𝒟(0) is sampled and then (2) 𝐲Bi is overwritten by a uniformly sampled string in {0,1}B, conditioned on it having 1s. For i[m], we claim that (f(i,Xi(i)), 𝒟(i)) and (f(i,Xi(0)),𝒟(0)) are pairs of identical distributions. To see this for the first pair of distributions, consider 𝐲f(i,Xi(i)). By definition 𝐲Bj is identically distributed to 𝐱Bj(i) for ji, and so the only special case to consider is j=i. For this case, we note that Xi(i) is defined as the number of 1s in 𝐱Bi(i), and that symmetry implies that any substring with 1s is equally likely, thus 𝐲Bi must be identically distributed to 𝐱Bi(i) as well, and we are done. An analogous argument can be used for showing that f(i,Xi(0)) and 𝒟(0) are identically distributed.

We thus have TV(𝐱(0),𝐱(i))=TV(f(i,Xi(0)),f(i,Xi(i)))TV(Xi(0),Xi(i)), and so our problem reduces to bounding this quantity.

TV(Xi(0),Xi(i)) =12=0B|Pr[Xi(0)=]Pr[Xi(i)=]|
=12=0B|Pr[Xi(0)=]Pr[Xi(0)=k]|
=12(=B/4B/4+k1Pr[Xi(0)=]+l=3B/4+1k3B/4Pr[Xi(0)=]
+=B/4+k3B/4|Pr[Xi(0)=]Pr[Xi(i)=]|)
kPr[Xi(0)=B/4+k]+12=B/4+k3B/4|Pr[Xi(0)=]Pr[Xi(0)=k]|

The last step follows from the symmetry of Xi(0) : Pr[Xi(0)=B/2+]=Pr[Xi(0)=B/2], and that the probability is greater for values closer to B/2. Observe that the first B/4k/2 terms in the second sum of the last step are equal to the last B/4k/2 terms by the symmetry of the distribution around B/2 (for =B/2+k/2, the term in the sum is equal to zero). Continuing the derivation and leveraging this symmetry, we get

TV(Xi(0),Xi(i)) kPr[Xi(0)=B/4+k]+=B/4+kB/2+k/21(Pr[Xi(0)=]Pr[Xi(0)=k])
=kPr[Xi(0)=B/4+k]
+=0k1(Pr[Xi(0)=B/2k/2+]Pr[Xi(0)=B/4+])
2kPr[Xi(0)=B/2]

where the first equality follows from the sum telescoping and the second inequality follows from bounding each term in the sum. Continuing, letting XBin(B,1/2), we have that

2kPr[Xi(0)=B/2] =2kPr[X=B/2|B4X3B4]=2kPr[X=B/2]12Pr[X<B/4]
2kPr[X=B/2]12exp(B/8)6kPr[X=B/2]=6k2B(BB/2)
6k2B22BπB=62πkB=62πT1/20,

where all inequalities are valid for T greater than some absolute constant, and the last inequality is a variant of Stirling’s inequality. We finally get

TV(𝐱,𝐱(i))TV(𝐱,𝐱(0))+TV(𝐱(0),𝐱(i))2Texp(T/8)+62πT1/2010T1/20,

where the last inequality is true for T400, thus proving the lemma. Using Lemma 29 and Lemma 30, we arrive at

Pr[Alg(𝐱(i),𝐱(0))Ei] Pr[err((𝐱(i)))α]Pr[err((𝐱(0)))α]
(Pr[err((𝐱))α]TV(𝐱(i),𝐱))
×(Pr[err((𝐱))α]TV(𝐱(0),𝐱))
(3/4Δ)(3/4Δ)1/2.

where the last inequality holds for T greater than an absolute constant. Next observe that 𝐱(i) and 𝐱(0) are k-neighboring with probability 1, and so we have Pr[Alg(𝐱(i),𝐱(0))Ei]ekεPr[Alg(𝐱(0),𝐱(0))Ei], and therefore Pr[Alg(𝐱(0),𝐱(0))Ei]ekε/2. Since all Ej are disjoint, we get

1j=1mPr[Alg(𝐱(0),𝐱(0))Ej]mekε/2,

and therefore

kε1log(m/2).

Recalling that m=T/B=T, we thus have k=Ω(ε1logT) and therefore α=k/4=Ω(ε1logT).