Abstract 1 Introduction 2 Preliminaries 3 Top-1 Data Structure for ANN 4 From ANN to DP-ANNC 5 Improving the Top-1 Data Structure 6 Conclusion and Open Problems References Appendix A Useful inequalities and Additional Definitions Appendix B Omitted Proofs Appendix C Data Structures for the Euclidean Space

Differentially Private High-Dimensional Approximate Range Counting, Revisited

Martin Aumüller ORCID IT University of Copenhagen, Denmark Fabrizio Boninsegna111Corresponding author, work partially done while visiting IT University of Copenhagen. ORCID Department of Information Engineering, University of Padova, Italy Francesco Silvestri ORCID Department of Information Engineering, University of Padova, Italy
Abstract

Locality Sensitive Filters are known for offering a quasi-linear space data structure with rigorous guarantees for the Approximate Near Neighbor search (ANN) problem. Building on Locality Sensitive Filters, we derive a simple data structure for the Approximate Near Neighbor Counting (ANNC) problem under differential privacy (DP). Moreover, we provide a simple analysis leveraging a connection with concomitant statistics and extreme value theory. Our approach produces a simple data structure with a tunable parameter that regulates a trade-off between space-time and utility. Through this trade-off, our data structure achieves the same performance as the recent findings of Andoni et al. (NeurIPS 2023) while offering better utility at the cost of higher space and query time. In addition, we provide a more efficient algorithm under pure ε-DP and elucidate the connection between ANN and differentially private ANNC. As a side result, the paper provides a more compact description and analysis of Locality Sensitive Filters for Fair Near Neighbor Search, improving a previous result in Aumüller et al. (TODS 2022).

Keywords and phrases:
Differential Privacy, Locality Sensitive Filters, Approximate Range Counting, Concominant Statistics
Copyright and License:
[Uncaptioned image] © Martin Aumüller, Fabrizio Boninsegna, and Francesco Silvestri; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Sorting and searching
; Mathematics of computing Probabilistic algorithms ; Security and privacy
Acknowledgements:
The authors would like to thank Ninh Pham and Rasmus Pagh for useful discussions.
Funding:
This work was supported in part by the MUR PRIN 20174LF3T8 AHeAD project, by MUR PNRR CN00000013 National Center for HPC, Big Data and Quantum Computing, and by Marsden Fund (MFP-UOA2226).
Editors:
Mark Bun

1 Introduction

Since the emergence of deep learning-based text and image embeddings, such as CLIP [28], the management of collections of high-dimensional vectors has become a critical challenge. Efficiently handling these collections and supporting complex query operations is essential for various applications, including social networks [8] and recommendation systems [29]. The required query operations in applications are often similarity search primitives, which have been widely studied in the literature [30, 33]. In particular, the r-Near Neighbor Search (r-NNS) or Count (r-NNC) problems are fundamental primitives: given a set 𝒮d of n vectors of d dimensions and a radius r>0, construct a data structure that, for any query 𝐪d, returns a point in 𝒮 with distance at most r from 𝐪 if such a point exists, or counts the number of such points. Unfortunately, these problems suffer from the curse of dimensionality, which refers to the phenomenon that any exact data structure with polynomial size requires a query time exponential in the dimension of the input space. This is supported by popular algorithmic hardness conjectures [2, 34]. To address this issue, approximate approaches have been proposed: for a given approximation factor c>1, we consider the (c,r)-Approximate Near Neighbor Search (c,r)-ANNS problem and the (c,r)-Approximate Near Neighbor Count (c,r)-ANNC problem. These relax the original problem constraints in such a way that the data structure may use points at distance at most cr to answer a query. For a search operation, this means that a point at distance at most cr from 𝐪 can be returned; for a count operation, points at distance between r and cr may be counted as near neighbors. Locality Sensitive Hashing (LSH) [22] and Locality Sensitive Filters (LSF) [5, 9] are the most common approaches for solving approximate near neighbor problems with theoretical guarantees.

In another line of research, there is an increasing demand for developing solutions for data analysis tasks that preserve the privacy of sensitive personal information in the input data. This was, for example, highlighted in an invited talk at PODS in 2019 by Dwork [12], who discussed the application of differential privacy (DP) [13] in the US census. Differential privacy is a widely adopted notion that can provide rigorous guarantees of privacy: intuitively, DP guarantees that the removal or addition of a single input entry cannot significantly affect the final result of an analysis task. The main idea of the DP approach is to inject well-calibrated noise into the data (structure) that protects privacy without significantly affecting the accuracy of the analysis task. Counting queries, which can be seen as a more general form of the near neighbor counting problem studied in this paper, are a well-studied area in DP (see [19, 32]). However, the error in the final count must be polynomial in n and the dimension of the input space to guarantee privacy [20, 24]. To reduce the error, the requirements should be relaxed to provide any count within a fuzzy range around the query [21]. Although this is different from the curse of dimensionality above, achieving an efficient solution similarly necessitates the use of approximate counts. Thus, it is natural to study approximate counting problems for the high-dimensional ANN problem when considering them in a DP setting.

In this paper, we focus on ANN problems for the inner product on the d-dimensional unit sphere 𝕊d1{𝐱d:𝐱2=1}, which is often called cosine or angular similarity. In these settings, the goal is to find or count points with large inner products. More specifically, we study the (α,β)-ANN count and search problems with 0β<α<1. Let B(𝐪,α){𝐱𝕊d1𝐱,𝐪α} be the set of unit vectors that have an inner product of at least α with 𝐪𝕊d1. The counting variant asks for a query 𝐪 to count all points in a dataset 𝒮 with inner product at least α but tolerates points with inner product at least β. That means that the resulting estimate E should satisfies |𝒮B(𝐪,α)|E|𝒮B(𝐪,β)|. This is a common notation in inner product search, and intuitively α and β are equivalent to r and r/c in (c,r)-ANNC. The first result for differentially private (α,β)-ANNC in high dimensions has been recently provided by Andoni et al. [4], where the authors use a linear space tree-based data structure based on the concept of LSF.222As observed by [4] as well, a solution on the unit sphere leads to a solution for the whole Euclidean space thanks to embedding methods. We will defer all discussion of this embedding and its applicability to Appendix C.

In this work, we explore the design space of locality sensitive filtering-based solutions to ANN and ANNC problems. We show that by revisiting the LSF framework for ANN it is possible to derive a simpler and more compact solution for ANN. Building on this result, we derive a novel solution for ANNC under differential privacy that extends the range of applicability of the state of the art [4] by removing some limitations on parameter ranges and differential privacy assumptions. In particular, we provide strong guarantees in the regime of pure DP (in contrast to approximate DP), and show that balancing the noise term of DP with the approximation error of LSF is not the only design choice: in fact, spending more space and query time results in more accurate solutions. The following section will provide more details on the technical contribution.

1.1 Our Contribution

Algorithm 1 DPTop-1 Data Structure for DP-ANNC.
Revisiting LSF for ANN

Our work is based on a construction for (α,β)-ANN first proposed by Aumüller et al. [6] in the context of algorithmic fairness. We provide a more compact description and analysis of LSF for (α,β)-ANN, and obtain a data structure with a lower pre-processing time of O(dn1+o(1)) (from O(dn1+ρ+o(1)) where 0<ρ:=ρ(α,β)1 is the strength of the filter). Moreover, assuming that some random variables follow a limiting distribution (Theorem 5), we get a more compact and simpler proof than [6] leveraging an elegant connection with concomitant statistics and extreme value theory, and the parameters used in our solution naturally follow from this theory. In Section 5.1, we demonstrate that an alternative construction procedure, CloseTop-1, remains effective without assuming any limiting distributions, matching the convergent properties.

From ANN to DP-ANNC

We then present a solution for (α,β)-ANNC under differential privacy. More specifically, we provide a general methodology that allows to translate a variant of a list-of-points data structure [5] for ANN into a data structure for DP-ANNC. Intuitively, a list-of-points is a data structure where input points are organized in a collection of lists and a query consists of a scan of some of these lists: this is the case, for instance, of methods based on LSH or LSF. This approach offers a way to develop the data structure in two steps by describing a data structure satisfying certain characteristics for ANN, and then applying a suitable DP mechanism on top of it.

When the data structure is built on top of the previous result, we get the DPTop-1 data structure presented in Algorithm 1, which we will now describe in words. Given a dataset 𝒮𝕊d1 consisting of n points, two similarity thresholds 0β<α<1, and privacy parameters ε>0 and δ[0,1), the data structure samples m=nO(1) Gaussian vectors. We associate with each such vector a counter, initialized with 0; each point in 𝒮 increments the counter of the vector that maximizes the inner product. Then, the counts are made differentially private by a suitable DP mechanism make_private, for example the Truncated Laplace mechanism [17] or the ALP mechanism [7]. Depending on the mechanism used, different privacy guarantees can be provided. Since each point increments exactly one counter, the absence or presence of a data point affects only a small part of the data structure. As a result, the sensitivity of the data structure is low and only a small amount of noise has to be added. The m vectors with their noisy counts form the data structure that can be released publicly. For any query 𝐪𝕊d1, the estimate of the number of near neighbors with inner product at least α is the sum of counters associated to vectors with inner product similarity to 𝐪 greater than η(α)=α2logm2(1α2)loglogm. This choice is guided by the theory of concomitant statistics and extreme value theory of Gaussian random variables [10], which we will formally introduce in Section 2.3. As we will detail in Section 3, in the asymptotic regime – hence for nDPTop-1 offers a simple and elegant solution for the (α,β)-ANNC problem under differential privacy. The following theorem provides the guarantees of DPTop-1 when using the Truncated Laplace mechanism [17] as a privacy mechanism (we refer to Theorem 9 for the exact statements regarding the ANN data structure, and Theorem 13 for the exact statements of the DP-ANNC implementation).

Theorem 1.

Consider the asymptotic regime, ε>0, δ(0,12), 0β<α<1, and αβ=Ω(loglognlogn). Let 𝒮={xi}i=1,,n𝕊d1 and let 𝐪𝕊d1. Then DPTop-1 (Algorithm 1) with Truncated Laplace mechanism satisfies (ε,δ)-DP, and with probability at least 2/3, the query returns ans~ such that

(1o(1))|𝒮B(𝐪,α)|O(log(1/δ)εnρ+o(1))ans~|𝒮B(𝐪,β)|+O(log(1/δ)εnρ+o(1)).

The data structure has pre-processing time O(dn1+ρ1α2), expected query time and space O(dnρ1α2).

This simple algorithm matches the accuracy of the solution found by Andoni et al. [4] and results in a straightforward space partitioning of 𝕊d1, one of the main goals of [4]. Furthermore, our approach provides a solution that works for almost all similarity thresholds on the unit sphere, while [4] supports a single distance threshold and relies on embedding techniques and scaling for all other distance thresholds.

While Algorithm 1 is potentially already practical, both the space and running time requirements are worse than the solution presented in [4] due to the large number of filters m. We suggest two improvements to the algorithm that do not affect the compactness of the algorithm and the proof. We first observe that the theoretical result in Theorem 1 works only for n: we drop this limitation in Section 5.1 thanks to a small but novel variation in the construction procedure. We then observe in Section 5.2 how to achieve almost linear pre-processing time, linear space, and dnρ+o(1) expected query time, which is optimal due to a space-time tradeoffs lower bound [5]333It is sufficient to set ρu=0 in Theorem 3.3 in the reference paper to get the lower bound for the running time, by concatenating polylog(n) data structures, using a technique called tensorization [9, 6]. We call this final version TensorCloseTop-1. Table 1 summarizes the guarantees of all algorithms described in this paragraph and compares them with the state of the art approach [4].

Table 1: Results for DP-ANNC, σ=(1α2)(1β2)(1αβ)2+(αβ)2, ρ=(1α2)(1β2)(1αβ)2, and σ<ρ<2σ for any 0β<α<1. The bound () holds only in expectation. Time and space bounds omit factor d.
Mechanism Privacy Additive Error Preprocessing Time Expected Query Time Space
DPTop-1 w/ Truncated Laplace (ε,δ) O(log(1/δ)εnρ+o(1)) O(n1+ρ1α2) O(nρ1α2) O(nρ1α2)
Andoni et al. [4], and TensorCloseTop-1 w/ Truncated Laplace (ε,δ) O(log(1/δ)εnρ+o(1)) n1+o(1) nρ+o(1) O(n)
Andoni et al. [4] (ε,0) O(1εnρ+o(1)) n1+o(1)(∗) nρ+o(1) O(n)()
TensorCloseTop-1 w/ Max Projection (ε,0) O(1εnρ+o(1)) n1+o(1) nρ+o(1) O(n)
Unbalanced TensorCloseTop-1 with Laplace (ε,0) O(1εnσ+o(1)) O(nσ1α2) n2σ+o(1) O(nσ1α2)
Balanced and Unbalanced Data Structures

As can be seen from the theorem statement, the provided estimate comes with an additive error of O(log(1/δ)nρ+o(1)/ε). This term includes two fundamentally different error sources. First, we might include “far away points”, i.e., points with inner product below β, in the count through the ANN data structure. This error does not depend on the choice of the privacy mechanism. The second source of error, which is due to privacy, arises from summing over nρ+o(1) noisy counters, as a query is expected to search through nρ+o(1) buckets on average. DPTop-1 and the data structure of Andoni et al. [4] balance these two errors so that both are upper bounded by nρ+o(1); however, as we will show in Sections 3 and 4 this is not necessarily an optimal trade-off. By using an unbalanced data structure – to be discussed in detail in Section 3.1 – with Laplace noise, we can achieve a more accurate result at the cost of larger running time and more space. In fact, for any parameter, our unbalanced data structure solves DP-ANNC with an additive error nσ+o(1) for σ<ρ. The main insight is that the sum of nρ+o(1) noisy, unbiased, and uncorrelated counters, provided by the Laplace mechanism, scales with nρ/2+o(1) by concentration arguments. This makes it suboptimal to provide the same upper bound for both sources of error.

Comparison to Andoni et al. (NeurIPS 2023)

For the convenience of the reader, we now provide a quick description of the solution in [4], and highlight differences to the present work. Let ηu, ηq, T and K be suitable parameters that depend on α and β. The data structure consists of a tree with degree T and height K; each internal node v is associated to a d-dimensional random Gaussian vector 𝐠v and to a subset of the input set 𝒮. At the beginning, the root contains the entire input set 𝒮. Then, for each internal node with a top-down approach, we partition the assigned vectors into T groups: each input vector 𝐱 is assigned to the node v with smallest index such that the inner product is 𝐱,𝐠vηu. (If no such index exists, the point is not stored in the data structure.) Once the input points have been processed, we replace the list in each leaf with its noisy size by adding Truncated Laplace noise [17] (if the final value is below a given threshold, we replace the value with 0). Given a query 𝐪, we collect the counts of all leaves v for which 𝐪,𝐠vηq for all nodes v on the unique path from the root to v, and return as a final estimate the sum of these counts. Using this tree data structure circumvents the evaluation problem of a too large number of filters mentioned for our variant above.

As the goal of their paper is to address Euclidean distance, the range of the α and β parameters is limited; their analysis works only for an α value that corresponds to Euclidean distance Θ(log1/8n) and all other distances are only supported through embedding and scaling, which adds an additional distortion to the distance values. In contrast, our solution allows for a wider range of these parameters, increasing the applicability for inner product similarity on the sphere and still gets a data structure that holds for Euclidean distance. Furthermore, by removing the tree structure, we are able to design an algorithm with little data dependencies that is likely to exploit hardware accelerators (e.g., Nvidia Tensor Core, Intel AMX) for neural networks that are optimized for batches of inner products.

1.2 Previous work

Near Neighbor Search

Locality Sensitive Hashing (LSH) [22] is one of the most used approaches for solving ANN with rigorous guarantees. However, it suffers of large space requirements. Indeed, LSH requires O(nd+n1+ρ) memory words, where ρ is a parameter describing the “power” of the used LSH (e.g., ρ=O(1/c2) for Euclidean distance [3]): indeed the data structure requires to create nρ hash tables, each storing all n points. Both Panigrahy [25] and Kapralov [23] provided linear space solutions using variants of LSH. An interesting technique to achieve smooth space-time trade-offs is given by Locality Sensitive Filters (LSF) [9, 5]. In the context of this work, the interesting space-time trade-off to focus on is the linear space regime [6]. Besides offering optimal space, this regime has many additional interesting properties for downstream applications. For example, very recently, Andoni et al. [4] showed their application in the context of differentially private range counting in high dimensional data. As mentioned above, a linear space data structure only involves at most one time a point of the dataset, so the absence or presence of a data point only affects a small part of the data structure; in comparison, with traditional LSH-based approaches a single point is stored in many different tables in the data structure.

Differentially Private Counting Queries

Counting queries require, except for a few classes of queries, a polynomial error in n and in the space dimension to guarantee privacy [20, 24]. This incentivized Huang and Yao [21] to relax the condition, allowing for the release of any count within a fuzzy range of the query. For ball queries in d, this is essentially the problem to release any count between |𝒮BD(𝐪,r)| and |𝒮BD(𝐪,cr)|, that we will identify as the (c,r)-Approximate Nearest Neighbor Count (ANNC) problem. One of the main results in [21] is that there exists a differential private solution of the problem with poly-logarithmic error in n at the price of an exponential dependence in the dimension d. A solution for the high dimensional case was proposed in [4], where Andoni et al. proposed a linear space data structure for the (c,r)-ANN, to solve the differential private (c,r)-ANNC. The authors developed a Locality Sensitive Filtering data structure with ρ=4c2(c2+1)2, for the differential private (c,r)-ANNC in the Euclidean space with additive error O(nρ+o(1)) and multiplicative error 1o(1), getting rid of the dependence on the dimension. The proposed data structure is based on a more general theory for data structures with space-time trade-offs [5], making the analysis more involved. In this paper, we will show that our data structure offers the same guarantees with a more streamlined analysis.

2 Preliminaries

2.1 Notation

We let [m] be the set of integers {1,,m}. We denote with 𝐪𝕊d1 a query point, and with 𝐱ϱ a point of the dataset such that 𝐱ϱ,𝐪=ϱ. We set 𝐚𝐱 as the vector associated to 𝐱, and define X𝐱:=𝐚𝐱,𝐱, and Q𝐱=𝐚𝐱,𝐪 as the concomitant – to be defined in Section 2.3 – of X𝐱. If X𝐱=maxi[m]𝐚i,𝐱 then it is denoted as X𝐱,(m) and so the concomitant as Q𝐱,[m]. The threshold for the query filter is η=α2logm2(1α2)loglogm. A ball in the hyper-sphere under inner product similarity centered in 𝐪 is denoted as B(𝐪,α):={𝐱𝕊d1:𝐪,𝐱α}. We call a point 𝐱 in 𝒮 close to 𝐪 if 𝐪,𝐱α, and far if 𝐪,𝐱<β. We consider n to be the number of points in the dataset 𝒮𝕊d1. We denote the Gaussian distribution of mean μ and variance σ2 as 𝒩(μ,σ2).

2.2 Problem Definition

Definition 2 ((α,β)-ANN).

Consider a set 𝒮𝕊d1 of n points. The Approximate Nearest Neighbor Search ANN problem asks to construct a data structure for 𝒮 that for a given query 𝐪𝕊d1, such that B(𝐪,α) contains a point in 𝒮, returns a point in 𝒮B(𝐪,β).

We will study a data structure that solves this problem with asymptotically high probability, hence at least 1o(1). The inner product similarity is related to the Euclidean distance, as 𝐱𝐲2=2(1𝐱,𝐲) for any 𝐱,𝐲𝕊d1. Therefore, for α=1r22 and β=1(cr)22, a (c,r)-ANN in (𝕊d1,2) is equivalent to the (α,β)-ANN defined above.

Definition 3 ((α,β)-ANNC).

Consider a set 𝒮𝕊d1 of n points. The Approximate Near Neighbor Counting (ANNC) problem asks to construct a data structure for 𝒮 that, for a given query 𝐪𝕊d1, returns a number between |𝒮B(𝐪,α)| and |𝒮B(𝐪,β)|.

This problem is the counting equivalent of the well-studied spherical range reporting problem (see for example [1]) that asks to enumerate all points at a certain distance from 𝐪.

2.3 Concomitant Order Statistics

The theory of concomitant order statistics offers a very elegant and intuitive tool for random projections in 𝕊d1, as highlighted in [15, 26, 27]. Let (X1,Y1),,(Xm,Ym) be m random samples from a bivariate distribution. We order the values according to X such that X(1)X(i)X(m). The Y-variate associated with X(r) is denoted as Y[r] and it is called the concomitant of the r-th order statistic.

Relation With Random Projections

Let 𝐱,𝐲𝕊d1 such that 𝐱,𝐲=ϱ and 𝐚𝒩(0,1)d. Consider the random variables X=𝐱,𝐚 and Y=𝐲,𝐚, then (X,Y)𝒩(0,0,1,1,ϱ), which is a standard bivariate normal distribution with correlation coefficient ϱ 444The general notation for a bivariate Gaussian distribution is 𝒩(μX,μY,Var[X],Var[Y],Cov[X,Y]), while Cov[X,Y]=i=1dxiyi=ϱ.. The relation between concomitant and order statistics for the normal bivariate distribution is given by the following lemma.

Lemma 4 ([10]).

Given m samples {(Xi,Yi)}i=1,,m from the standard bivariate normal distribution 𝒩(0,0,1,1,ϱ), for any r{1,,m} we have that Y[r]=ϱX(r)+Zr, where Zr is a random variable distributed as 𝒩(0,1ϱ2) and independent of X(r).

A standard result of concomitant order statistics states that Y[r]𝔼[Y[r]] weakly converges to a Gaussian distribution 𝒩(0,1ϱ2) [10]. Thus, defining FY(m) as the probability density function of Y[m], we have that limmFY[m]=𝒩(ϱ𝔼[X(m)],1ϱ2) [16]. By adding the fact that 𝔼[X(m)]=2logmo(1) [18] we get the following theorem.

Theorem 5 ([10, 18]).

Let {(Xi,Yi)}i=1,,m be m i.i.d. samples from 𝒩(0,0,1,1,ϱ). Then Y[m] weakly converges to 𝒩(ϱ2logm,1ϱ2).

This asymptotic result serves as the basis of the intuition for our data structure: if we associate to each point in 𝐱𝒮 the closest Gaussian vector 𝐚𝐱=argmax𝐚{𝐚1,,𝐚m}𝐚,𝐱, then a query 𝐪𝕊d1, such that 𝐪,𝐱=ϱ, will find 𝐱 associated to a Gaussian vector with inner product similarity 𝐪,𝐚𝐱ϱ2logm.

2.4 Differential Privacy

Differential Privacy (DP) is a definition on indistinguishability for the outputs of protocols applied to neighboring datasets. Two datasets are neighbor 𝒮𝒮 by addition/removal if they differ by the addition or a removal of one point, instead they are neighbor by substitution if |𝒮|=|𝒮| and they differ in one point.

Definition 6 (Approximate Differential Privacy [14]).

For ε>0 and δ[0,1), we say that a randomized algorithm is (ε,δ)-differentially private if for any two neighboring datasets 𝒮𝒮, and any possible outcome of the algorithm Yrange(), we have Pr[(𝒮)Y]eεPr[(𝒮)Y]+δ.

We are mainly interested in histogram queries f:𝒳n|𝒳| [13], where 𝒳 is the data universe and n is the size of the data set. The most common way to privatize f(𝒮) is to obfuscate the true values by adding noise scaled on the sensitivity of the query Δf=max𝒮𝒮f(𝒮)f(𝒮)1 [13]. In our context, each data point contributes to exactly one counter. For the addition/removal neighboring relationship, Δf=1; for substitution, Δf=2. We consider three different DP mechanisms when privatizing counters, for example, in the function make_private in Algorithm 1: Truncated Laplace Mechanism [17], Laplace Mechanism [17], and Max Projection [7]. More details can be found in Appendix A.1.

3 Top-1 Data Structure for ANN

Algorithm 2 Top-1 Data Structure.

Algorithm 2 describes the Top-1 data structure, which is the variant of Algorithm 1 targeting the (α,β)-ANN problem. Let 𝒜m=(𝐚1,,𝐚m) be a set of m random vectors from 𝒩(0,1)d. The data structure consists of a hash table that stores the input vectors assigned to each random vector in 𝒜m: more specifically, we assign each input vector 𝐱𝒮 to the random vector in 𝒜m with the largest inner product. For a given query vector 𝐪, the query algorithm selects all random vectors with an inner product larger than η with 𝐪. Then, it searches for an approximate near neighbor in the lists of points associated with these vectors in the hash table. We call buckets the indices of the hash table (i.e. the random vectors), and filters the function used to query the hash table (i.e. the inner product). In this section, we consider the asymptotic regime for n, so to use the limiting distribution of the extreme concomitant in Theorem 5.

Lemma 7 (Probability to Find a Close Point).

For n, Top-1 search contains a bucket with a close point, if it exists, with at least 1o(1) probability.

Proof.

Consider a close point 𝐱α, associated to the bucket 𝐚𝐱α. That bucket is found in search if 𝐚𝐱α,𝐪=Q𝐱α,[m]η. From Proposition 20 and Theorem 5, we observe that

Pr[Q𝐱α,[m]η]=Pr𝒩(0,1α2)[Z2(1α2)loglogm]1logm=O(1logn),

where we use m=nθ1α2. Thus, with probability at least 1o(1), 𝐱α is associated to a vector that exceeds the threshold.

Lemma 8 (Expected Number of Buckets and Far Points).

For n, 0β<α<1 such that (αβ)=Ω(loglogn/logn), Top-1 search returns in expectation at most nθ+o(1) buckets, containing in expectation at most n1θ(αβ)2(1α2)(1β2)+o(1) far points.

Proof.

We observe that by setting m=nθ1α2 we get loglogmlogm=O((1α2)loglognlogn). Thus, the threshold is ηα2logm(1O(1α2αloglognlogn)), which is positive for ααβ=Ω(loglogn/logn). From Proposition 20, the probability that a filter exceeds the threshold is

Pr[𝐚,𝐪η]Pr𝒩(0,1)[Zα2logm(11α2αo(1))]mα2+(1α2)o(1), (1)

as the projection over a Gaussian vector is a normal random variable. In expectation, a query inspects at most m1α2+o(1α2) buckets. The claim follows by setting m=nθ1α2. For the analysis of far points, we may write ηβ2logm+(αβ)2logm(11α2αβO(loglognlogn)). The second factor is positive for (αβ)=Ω(loglogn/logn). Thus, by applying Theorem 5 and Proposition 20, the probability to inspect a far point 𝐱β is

Pr[Q𝐱β,[m]η] Pr𝒩(β2logm,1β2)[Zβ2logm+(αβ)2logm(11α2αβo(1))]
exp[(αβ)21β2(11α2αβo(1))2logm]
=m(αβ)21β2+(1α2)(αβ)1β2o(1), (2)

By inserting m=nθ1α2 in the previous inequality, we obtain nθ(αβ)2(1α2)(1β2)+o(1), as αβ1β2=O(1). Since we have at most n far points, the expected number of inspected far points is at most n1θ(αβ)2(1α2)(1β2)+o(1). The proposed data structure Top-1 (Algorithm 2) is a naive solution with high space, pre-processing time and query time, for the (α,β)-ANN.

Theorem 9.

Consider n. For any 0β<α<1 such that (αβ)=Ω(loglogn/logn), 0<θO(1), and for any dataset 𝒮={xi}i=1,,n in 𝕊d, Top-1 solves with at least 1o(1) probability the (α,β)-ANN using pre-processing time O(dn1+θ1α2), space O(dmax{n,nθ1α2}), and expected query time O(dmax{nθ1α2,n1θ(αβ)2(1α2)(1β2)+o(1)}).

Proof.

The pre-processing time is given by O(dnm)=O(dn1+θ1α2) as for each of the n points we need to look at m random vectors of dimensionality d. Each point is assigned to only one random vector, so the space needed to store the data structure is O(d(n+m))=O(dmax{n,nθ1α2}). The running time is given by summing the running time of search and query. The buckets in search(𝐪) are found in time O(dm) while the expected running time of query(𝐪) is given by the expected number of far points present in these buckets, which are at most n1θ(αβ)2(1α2)(1β2)+o(1), and the expected number of buckets returned by search, which are at most nθ+o(1), due to Lemma 8. Thus, the expected running time is at most O(dmax{nθ1α2,n1θ(αβ)2(1α2)(1β2)+o(1),nθ+o(1)}). The problem is solved with at least 1o(1) probability due to Lemma 7.

3.1 Balanced and Unbalanced Top-1

The standard way to minimize the expected query time of an algorithm that solves (α,β)-ANN is to balance the number of buckets that have to be inspected with the number of far points (“error”) that are associated witht those buckets. To balance the contribution of far points and the number of buckets inspected we choose θ=(1α2)(1β2)(1αβ)2, which solves the equation θ=1θ(αβ)2(1α2)(1β2). We denote this specific solution as ρ to highlight its connection to standard ANN analysis. However, alternative values of θ can be chosen to achieve different trade-offs. The next corollary follows from Lemma 8.

Corollary 10 (Balanced and Unbalanced Top-1).

Consider n. For any 0β<α<1 such that (αβ)=Ω(loglogn/logn), consider σ=2(1α2)(1β2)(1αβ)2+(αβ)2 and ρ=(1α2)(1β2)(1αβ)2. Define balanced and unbalanced Top-1 as the data structures initialized with θ=ρ and θ=σ respectively. Then,

  1. 1.

    The expected number of buckets inspected by search(𝐪) is at most nρ+o(1) for balanced Top-1, and nσ+o(1) for unbalanced Top-1.

  2. 2.

    The buckets from search(𝐪) contain, in expectation, at most nρ+o(1) far points for balanced Top-1 and nσ2+o(1) far points for unbalanced Top-1.

  3. 3.

    For any 0β<α1 we have σ2<ρ<σ.

Proof.

It follows by a simple computation from Lemma 8. As ρ1α21 the space and the running time of balanced Top-1 is O(dnρ1α2), the latter considers the worst case as the expected number of far points is nρ+o(1)nρ1α2. Space and running time follow directly for unbalanced Top-1 as σ>ρ. The unusual behavior of unbalanced Top-1 will be further clarified in the next section, where its utility for (α,β)-DP-ANNC will be discussed. Although Top-1 benefits from a clean and straightforward analysis (due to the assumption n), it involves significant preprocessing, space, and query time requirements. These limitations will be addressed in Section 5.2 through the use of tensorization. Additionally, the assumption of n will be lifted with a minor modification to the algorithm, as detailed in Section 5.1. Nevertheless, as we will show in the next section, Top-1 still provides a meaningful solution for (α,β)-ANNC under differential privacy constraints.

4 From ANN to DP-ANNC

In this section, we study the relationship between ANN and DP-ANNC. We expose a general way to solve ANNC starting from a space-partitioning data structure for ANN, and discuss different differentially private mechanisms to privatize the ANNC data structure. We also show how the unbalanced data structure from the previous section can be used to increase the accuracy for (α,β)-DP-ANNC, by paying an increase in query time, pre-processing time, and space usage.

4.1 From ANN to ANNC

Algorithm 3 From ANN to ANNC using a space-partitioning data structure.

Inspired by the list-of-points data structure developed in [5], we define a family of data structures suitable for a general reduction from ANN to ANNC.

Definition 11 (Space Partitioning Data Structure).

Given a set 𝒮𝕊d1 and an integer m, a space-partitioning data structure for the ANN problem is defined as follows:

  • The data structure is a partition555Technically we define the data structure by using a not full partition, as we allow some points to not be stored. of 𝒮 into m sets =(L1,,Lm) such that i[m]Li𝒮 and LiLj= for ij, and a function 𝒬 that maps 𝕊d1𝐪I(𝐪)[m].

  • For a query 𝐪, we obtain the set I(𝐪)𝒬(𝐪) and scan all points in Li,iI(𝐪). If there exists a point with inner product at least β, we return it. Otherwise we return .

The total space is |𝒬|+O(dn), where |𝒬| is the space necessary to store the function 𝒬. The query time is at most T𝒬(𝐪)+O(diI(𝐪)|Li|), where T𝒬 is the time taken to compute I(𝐪) given query 𝐪𝕊d1, and O(diI(𝐪)|Li|) is the worst-case time needed to check all the points.

For example, for Algorithm 2, Li represents the points that achieve their maximum inner product with 𝐚i. 𝒬 consists of all 𝐚1,,𝐚m (its size is dm) and computes the indices I(𝐪) of all filters that are above the query threshold. For the data structure of Andoni et al. [4] discussed in the introduction, Li are the leaves of the tree, and 𝒬 represents the navigation tree-based data structure. Algorithm 3 presents a simple transformation for a space-partitioning data structure: It indeed suffices to substitute the actual points with their amount in each list. The new (α,β)-ANNC query returns the sum of the elements contained in these lists. Since each point is stored at most once, summing the cardinality of each bucket ensures that no point is counted more than once.

Lemma 12 (From ANN to ANNC).

Let 𝒮={xi}i=1,,n𝕊d1. Consider a space-partitioning data structure for 𝒮 such that for each 𝐪𝕊d1: (i) I(𝐪) contains a list with a close point with probability at least 1o(1), and (ii) the expected number of far points in iI(𝐪)Li is at most 𝒦. Then query in Algorithm 3 returns a value ans^ that, with probability at least 2/3, satisfies the following inequality:

(1o(1))|𝒮B(𝐪,α)|ans^|𝒮B(𝐪,β)|+𝒦.

The data structure in Algorithm 3 uses space |𝒬|+O(n) and the query time is T𝒬(𝐪)+|I(𝐪)|.

4.2 From ANNC to DP-ANNC

The data structure returned by Algorithm 3 uses counters, which is essentially a histogram. This histogram can be privatized using the algorithms presented in Section 2.4. To achieve differential privacy, we need to analyze the sensitivity of the counters T[1..m]. Let 𝒮 and 𝒮 be two neighboring datasets that differ in exactly one point, and let 𝒟 and 𝒟 be the data structures constructed for these data sets, respectively. Applying Algorithm 3 to 𝒟 and 𝒟 will result in two counters T and T that differ by at most 1 in at most one position. If 𝒬 is data independent, i.e., 𝒬 does not depend on the actual data set 𝒮, it is sufficient to privatize the counter T, which can be done using any differentially private mechanism make_private for histograms, as shown in the aforementioned DPTop-1 (see Algorithm 1). The next theorem states the guarantees for two specific privacy mechanisms.

Theorem 13 (DP-ANNC with Truncated Laplace or Max Projection).

Let 𝒮={xi}i=1,,n𝕊d1, 𝐪𝕊d1. Consider a space partitioning data structure for 𝒮 satisfying the assumptions of Lemma 12, with the addition of 𝔼[|I(𝐪)|]𝒦 and 𝒬 being data independent. When T is privatized using the truncated Laplace mechanism, the data structure is (ε,δ)-DP, for any ε1, requires an additional O(n) term in space and pre-processing time (compared to Lemma 12), and the query algorithm returns a value ans~ that satisfies the following inequality with probability at least 2/3:

(1o(1))|𝒮B(𝐪,α)|O(log(1/δ)ε𝒦)ans~|𝒮B(𝐪,β)|+O(log(1/δ)ε𝒦). (3)

When T is privatized with the Max Projection mechanism, then the data structure is (ε,0)-DP, requires an additional O(εn) term in space and pre-processing time, and the additive error in Equation 3 becomes O(𝒦ε).

Due to Lemma 7 and Corollary 10, balanced Top-1 satisfies the requirements for Theorem 13 with 𝒦=nρ+o(1), which proves Theorem 1. Moreover, the tree-based data structure described by Andoni et al. [4] is another data structure that satisfies these requirements of Theorem 13.666Technically, as stated earlier, [4] analyze their data structure for a fixed choice of α.

4.2.1 Usefulness of Unbalanced Data Structure

We now study how an unbiased and uncorrelated differentially private estimator of T, from the unbalanced Top-1 ANNC data structure, can be used to reduce the error compared to Theorem 13. The construction leverages the concentration of the sum of i.i.d. Laplace random variables.

Theorem 14 (DP-ANNC with Laplace Noise and Unbalanced Data Structure).

Let 𝒮={xi}i=1,,n𝕊d1, 𝐪𝕊d1. Consider a space partitioning data structure for 𝒮 satisfying the assumptions of Lemma 12, with the addition of 𝔼[|I(𝐪)|]𝒦2 and 𝒬 being data independent. When T is privatized by using the Laplace mechanism, the data structure is (ε,0)-DP, for any ε1, and query returns a value ans~ that satisfies the following inequality with probability at least 2/3:

(1o(1))|𝒮B(𝐪,α)|O(𝒦ε)ans~|𝒮B(𝐪,β)|+O(𝒦ε).

The privatized data structure requires an additional O(m) space, and pre-processing time.

Due to Lemma 7 and Corollary 10, unbalanced Top-1 satisfies the requirements for Theorem 14 with 𝒦=nσ2+o(1). As σ2<ρ unbalanced Top-1 is always more accurate than Top-1 for DP-ANNC. However, the pre-processing time and the space increase.

In the next section we provide several improvements for Top-1, aiming to get rid of the asymptotic assumption n used to apply the Theorem 5 for concomitant statistics, and reduce the pre-processing time, the query time, and the space. These improvements regard only the balanced ANN data structure; the additional requirements in space and pre-processing time of O(m) for DP-ANNC with unbalanced data structures will still be present and are an interesting open question for future work. Finally, we highlight that for errors of the form nC+o(1) we may increase the range of the privacy budget to εno(1) in Theorems 13 and 14, for the same argument provided by Andoni et al. [4].

5 Improving the Top-1 Data Structure

In this section we propose two improvements of Top-1. With CloseTop-1 we get rid of the assumption of n,777We observe that, although the time and space complexities are expressed in big-O notation (i.e., with a notation asymptotic in n), the correctness of this algorithm does not require assuming a limiting distribution for the concomitants, while this was the case for Top-1. while with TensorCloseTop-1 we reduce the pre-processing time to dn1+o(1), the space to O(dn), and obtain an expected query time of dnθ+o(1). In addition, we discuss how TensorCloseTop-1 can solve the (r,c)-ANN in the Euclidean space in Appendix C. These improvements do not alter the core of the data structure (a hash table of points with a search function), allowing them to be utilized for DP-ANNC as discussed in the previous section.

5.1 CloseTop-1

Algorithm 4 CloseTop-1 Data Structure.

We now study CloseTop-1 (see Algorithm 4), a practical implementation of the previous asymptotic data structure. In Top-1 we associate to each point of the dataset the random vector with the highest inner product. This is an intuitive choice that leads to a simple and clear analysis by analyzing Gaussian tails of concomitant statistics (Theorem 5). However, this is an asymptotic theorem, results from the fact that X𝐱=max𝐚𝒜m𝐚,𝐱=2logmo(1) and Var[X𝐱]=o(1) [18]. In fact, it can be obtained by Lemma 4 by setting X𝐱=2logm. The intuition of CloseTop-1 is to provide a lower and an upper bound for X𝐱 by construction, by associating to each point of the dataset a random vector with an inner product close to the expected maximum, so the name of the data structure. In the proposed construction, we sample m random vectors 𝐚𝒩(0,1)d, and we associate to any 𝐱 the first random vector such that 2logm32loglogm2logm𝐚,𝐱2logm. If at least one random vector succeeds in the association, then we say that 𝐱 collided. The key property of CloseTop-1 is that a point collides with high probability (Lemma 23), which allows to state the following lemma.

Lemma 15.

Lemma 7 and Lemma 8 remain valid for CloseTop-1 under the same assumptions, without relying on the n condition.

As Theorem 9 and Corollary 10 are derived from Lemmas 7 and 8, CloseTop-1 give the same results as Top-1, without relying on the assumption that the concomitants follow a limiting distribution.

5.2 TensorCloseTop-1

In this section, we propose TensorCloseTop-1 (see Algorithm 5), to reduce the pre-processing time to dn1+o(1), space to O(dn), and expected query time to dnρ+o(1) (for the balanced data structure). The data structure uses a technique developed in [9] called tensoring that essentially allows to simulate an exponential number of vectors by concatenating a polynomial number of data structures. The same expedient was used in [6] to get a pre-processing time of n1+ρ+o(1). This technique is similar to creating a tree, yet this data structure allows parallel evaluation for the hashes (Line 2-3 Algorithm 5 search). Define the concatenation factor t and assume m1/t is an integer; consider t independent CloseTop-1 data structures 𝒟1,,𝒟t each using m1/t Gaussian vectors 𝐚i,j, where i[t] indicates the data structure and j[m1/t] indicates the vector. For each point 𝐱𝒮 consider the t colliding vectors in each data structure (𝐚1,i1,,𝐚t,it), then map the point to a bucket (i1,,it)[m1/t]t using a hash table. Given a query 𝐪𝕊d1, for each data structure the indices B~i of the random vectors are selected such that 𝐚,𝐪η, hence B~i:={j[m1/t]:𝐚i,j,𝐪η}, and a search in all the buckets B~1××B~t is performed. The number of random vectors to sample is tm1/t which is sub-linear in n provided the data structure is not designed to search for points with exceedingly high inner product similarity.

Algorithm 5 TensorCloseTop-1 Data Structure.
Proposition 16 (Tensorization).

For any constant C>0 assume 11α2(logn)C. Then for t=log1/8n1α2, m=nθ(1α2), and θ=O(1), we have tm1/t=no(1)

Proof.

Just a simple computation: m1/t=nρlog1/8n=no(1) as θ=O(1), while t=log1/8n1α2(logn)O(1)=no(1), then tm1/t=no(1). In practice t,m1/t and m are all integers. However, this does not affect the asymptotic behavior since tm1/t(t+1)m1/t=no(1). With this trick, we reduce the query time due to search (Line 2,3,4 Algorithm 5, procedure search) to no(1). Therefore, the query time is mainly affected by how many times the data structure needs to access the hash table, which can be bounded in expectation. Under similar assumptions for α we can prove that TensorCloseTop-1 finds with high probability a close point.

Lemma 17 (Probability to Find a Close Point).

For any 0β<α<1 such that 1α2=ω(log3/4n), TensorCloseTop-1 finds a close point, if it exists, with at least 1o(1) probability.

Lemma 18 (Expected Number of Buckets and Far Points).

For any 0β<α<1 such that (1α2)=ω(log3/4n), and (αβ)=Ω(loglognlog7/8n), TensorCloseTop-1 search finds in expectation at most nθ+o(1) buckets containing at most n1θ(1α2)(1β2)(αβ)2+o(1) far points.

The previous Lemma states that θ has the same function it has in CloseTop-1, so it can be used to construct balanced and unbalanced TensorCloseTop-1. We now argue for the query, space and pre-processing time.

Theorem 19.

For any 0β<α<1 such that (1α2)=ω(log3/4n), (αβ)=Ω(loglognlog7/8n), and 0<θO(1). For any dataset 𝒮={xi}i=1,,n in 𝕊d, TensorCloseTop-1 solves with at least 1o(1) probability the (α,β)-ANN using space O(dn), preprocessing time dn1+o(1), and expected query time dmax{nθ+o(1),n1θ(αβ)2(1α2)(1β)2+o(1)}.

Proof.

Due to Proposition 16, the data structure needs to store tm1/t=no(1) random vectors. As each point 𝐱 is stored in at most one bucket, the space is O(d(n+no(1)))=O(dn). As to each point it is necessary to compute m1/t inner products at most t times, the pre-processing time is O(tdnm1/t)=O(dn1+o(1)). The buckets in search(𝐪) can be computed in time O(dno(1)), so the expected query time is at most dmax{nθ+o(1),n1θ(αβ)2(1α2)(1β)2+o(1)} due to Lemma 18. The problem is solved with at least 1o(1) probability due to Lemma 17. As Corollary 10 applies to TensorCloseTop-1 due to Lemma 18, the parameters ρ and σ respectively characterize the balanced and unbalanced versions of TensorCloseTop-1. The balanced version has an expected query time of O(dnρ+o(1)) and, when combined with Theorem 13, achieves an additive error of O(nρ+o(1)/ε) for differentially private approximate nearest neighbor search (DP-ANNC). For ε=O(1), the additional space and preprocessing requirements are O(n). In contrast, the unbalanced version of TensorCloseTop-1 has an expected query time of O(dnσ+o(1)) and, when used with Theorem 14, yields an asymptotically smaller additive error O(nσ2+o(1)/ε) for DP-ANNC. However, this approach incurs a significant drawback: the Laplace noise introduces an additional space and preprocessing overhead of O(m)=O(m~t)=O(nσ1α2) . This overhead becomes the dominant cost as σ1α21, especially in the Euclidean ANN problem, where (1α2)1=polylog(n). For further details on how this data structure is applied in Euclidean space, see Appendix C.

6 Conclusion and Open Problems

This paper introduced and analyzed simple linear space data structures that solve the (α,β)-ANN problem and can be transformed into efficient solutions for its counting variant under differential privacy. This provides an alternative data structure to the one proposed recently by Andoni et al. [4] with a simpler data structure and analysis. We provided general black-box transformations from approximate near neighbor problems to their counting variant under privacy constraints and showed that interesting error/time trade-offs are possible via unbalanced ANN data structures. The most intriguing open question was already posed by Andoni et al. [4]: Can one obtain better accuracy guarantees for range counting than by transforming near neighbor data structures that have well-understood lower bounds [5]? For example, [1] describes a sampling based range counting algorithm that could be a good starting point for further investigation. For the presented data structures, one should further investigate the relation of the noise error due to differential privacy and the error due to including “far points” which could give interesting trade-offs. We initiated such a study through unbalanced ANN data structures; the main obstacle for a space-efficient solution is to store “small counts” in a data structure that uses space O(f(ε)n1+o(1)) and provides unbiased counters such that the expected error of the sum of 𝒦 counters is only a factor O(𝒦) larger than the expected per-point error. Finally, while we believe that our algorithms are simple and straightforward, an experimental comparison between the different solutions presented here and in the literature seems necessary, not only for approximate range counting, but also filtering-based approximate near neighbor search. In fact, only the work of Pham et al. [27] provided evidence of the practical impact of filtering-based near neighbor search, and they achieve their result by a combination of LSH and LSF.

References

  • [1] Thomas D. Ahle, Martin Aumüller, and Rasmus Pagh. Parameter-free locality sensitive hashing for spherical range reporting. In SODA, pages 239–256. SIAM, 2017. doi:10.1137/1.9781611974782.16.
  • [2] Josh Alman and Ryan Williams. Probabilistic polynomials and hamming nearest neighbors. In Venkatesan Guruswami, editor, IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 136–150. IEEE Computer Society, 2015. doi:10.1109/FOCS.2015.18.
  • [3] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117–122, 2008. doi:10.1145/1327452.1327494.
  • [4] Alexandr Andoni, Piotr Indyk, Sepideh Mahabadi, and Shyam Narayanan. Differentially private approximate near neighbor counting in high dimensions. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc.
  • [5] Alexandr Andoni, Thijs Laarhoven, Ilya Razenshteyn, and Erik Waingarten. Optimal hashing-based time-space trade-offs for approximate near neighbors. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’17, pages 47–66, USA, 2017. Society for Industrial and Applied Mathematics. doi:10.1137/1.9781611974782.4.
  • [6] Martin Aumüller, Sariel Har-Peled, Sepideh Mahabadi, Rasmus Pagh, and Francesco Silvestri. Sampling a near neighbor in high dimensions — who is the fairest of them all? ACM Trans. Database Syst., 47(1), April 2022. doi:10.1145/3502867.
  • [7] Martin Aumüller, Christian Janos Lebeda, and Rasmus Pagh. Representing sparse vectors with differential privacy, low error, optimal space, and fast access. Journal of Privacy and Confidentiality, 12(2), November 2022. doi:10.29012/jpc.809.
  • [8] Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling up all pairs similarity search. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pages 131–140, New York, NY, USA, 2007. Association for Computing Machinery. doi:10.1145/1242572.1242591.
  • [9] Tobias Christiani. A framework for similarity search with space-time tradeoffs using locality-sensitive filtering. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’17, pages 31–46, USA, 2017. Society for Industrial and Applied Mathematics. doi:10.1137/1.9781611974782.3.
  • [10] H. A. David and J. Galambos. The asymptotic theory of concomitants of order statistics. Journal of Applied Probability, 11(4):762–770, 1974. URL: http://www.jstor.org/stable/3212559.
  • [11] Devdatt Dubhashi and Alessandro Panconesi. Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press, USA, 1st edition, 2009.
  • [12] Cynthia Dwork. Differential privacy and the us census. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’19, page 1, New York, NY, USA, 2019. Association for Computing Machinery. doi:10.1145/3294052.3322188.
  • [13] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC’06, pages 265–284, Berlin, Heidelberg, 2006. Springer-Verlag. doi:10.1007/11681878_14.
  • [14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3–4):211–407, August 2014. doi:10.1561/0400000042.
  • [15] Kave Eshghi and Shyamsundar Rajaram. Locality sensitive hash functions based on concomitant rank order statistics. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’08, pages 221–229, New York, NY, USA, 2008. Association for Computing Machinery. doi:10.1145/1401890.1401921.
  • [16] Thomas R Fleming and David P Harrington. Counting processes and survival analysis. John Wiley & Sons, 2013.
  • [17] Quan Geng, Wei Ding, Ruiqi Guo, and Sanjiv Kumar. Tight analysis of privacy and utility tradeoff in approximate differential privacy. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 89–99. PMLR, 26–28 August 2020. URL: https://proceedings.mlr.press/v108/geng20a.html.
  • [18] Peter Hall. On the rate of convergence of normal extremes. Journal of Applied Probability, 16(2):433–439, 1979. URL: http://www.jstor.org/stable/3212912.
  • [19] Moritz Hardt and Guy N. Rothblum. A multiplicative weights mechanism for privacy-preserving data analysis. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 61–70, 2010. doi:10.1109/FOCS.2010.85.
  • [20] Moritz Hardt and Kunal Talwar. On the geometry of differential privacy. In Proceedings of the Forty-Second ACM Symposium on Theory of Computing, STOC ’10, pages 705–714, New York, NY, USA, 2010. Association for Computing Machinery. doi:10.1145/1806689.1806786.
  • [21] Ziyue Huang and Ke Yi. Approximate Range Counting Under Differential Privacy. In Kevin Buchin and Éric Colin de Verdière, editors, 37th International Symposium on Computational Geometry (SoCG 2021), volume 189 of Leibniz International Proceedings in Informatics (LIPIcs), pages 45:1–45:14, Dagstuhl, Germany, 2021. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.SoCG.2021.45.
  • [22] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pages 604–613, New York, NY, USA, 1998. Association for Computing Machinery. doi:10.1145/276698.276876.
  • [23] Michael Kapralov. Smooth tradeoffs between insert and query complexity in nearest neighbor search. In PODS, pages 329–342. ACM, 2015. doi:10.1145/2745754.2745761.
  • [24] S. Muthukrishnan and Aleksandar Nikolov. Optimal private halfspace counting via discrepancy. In Proceedings of the Forty-Fourth Annual ACM Symposium on Theory of Computing, STOC ’12, pages 1285–1292, New York, NY, USA, 2012. Association for Computing Machinery. doi:10.1145/2213977.2214090.
  • [25] Rina Panigrahy. Entropy based nearest neighbor search in high dimensions. In SODA, pages 1186–1195. ACM Press, 2006. URL: http://dl.acm.org/citation.cfm?id=1109557.1109688.
  • [26] Ninh Pham. Simple yet efficient algorithms for maximum inner product search via extreme order statistics. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, pages 1339–1347, New York, NY, USA, 2021. Association for Computing Machinery. doi:10.1145/3447548.3467345.
  • [27] Ninh Pham and Tao Liu. Falconn++: a locality-sensitive filtering approach for approximate nearest neighbor search. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc.
  • [28] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021. URL: http://proceedings.mlr.press/v139/radford21a.html.
  • [29] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th International Conference on World Wide Web, WWW ’01, pages 285–295, New York, NY, USA, 2001. Association for Computing Machinery. doi:10.1145/371920.372071.
  • [30] Gregory Shakhnarovich, Trevor Darrell, and Piotr Indyk. Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing). The MIT Press, 2006.
  • [31] Stanislaw J. Szarek and Elisabeth Werner. A nonsymmetric correlation inequality for gaussian measure. 68(2):193–211, February 1999. doi:10.1006/jmva.1998.1784.
  • [32] Salil Vadhan. The Complexity of Differential Privacy, pages 347–450. Springer, Yehuda Lindell, ed., 2017. doi:10.1007/978-3-319-57048-8_7.
  • [33] Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. Hashing for similarity search: A survey. CoRR, abs/1408.2927, 2014. arXiv:1408.2927.
  • [34] Ryan Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theoretical Computer Science, 348(2):357–365, 2005. Automata, Languages and Programming: Algorithms and Complexity (ICALP-A 2004). doi:10.1016/j.tcs.2005.09.023.

Appendix A Useful inequalities and Additional Definitions

A.1 Differentially Private Mechanisms

In this work we considered three differentially private mechanisms:

  • The Truncated Laplace mechanism [17], used also by Andoni et al. [4], which obfuscate each positive entry of f(𝒮) by adding truncated Laplace noise. The mechanism is (ε,δ)-DP and produces a biased estimator with expected absolute error O(log(1/δ)/ε). In our context, it has the advantage that it only needs to sample and store the counts of the non-zero entries, i.e., at most n random variables.

  • The Laplace mechanism [13], which adds independent Laplace noise to each entry of f(𝒮). The mechanism is (ε,0)-DP and produces an unbiased estimator with uncorrelated entries with absolute expected error O(1/ε). The estimator behaves well for range queries (i.e. iBfi(𝒮) for some B[|𝒳|]) obtaining an expected absolute error O(B/ε). However, it requires to sample and store |𝒳| random variables.

  • The Max Projection mechanism [7] which stores all “small counts” in a sketching-based data structure. The mechanism is (ε,0)-DP and produces a data structure with O(1) access time returning a biased estimator with expected absolute error O(1/ε). The additional space it needs is O(εn), making it a valid pure-DP alternative to the Truncated Laplace mechanism.

A.2 Tail Bounds

We will make use of the following Gaussian tail bounds.

Proposition 20 (Gaussian Tail Bounds [11]).

Let Z𝒩(μ,σ2). Then, for any t0, we have that Pr[|Zμ|t]et22σ2.

Proposition 21 (Proposition 3, [31]).

Let Z be a standard normal random variable. Then, for any t>1, we have that

22πt+t2+4et22Pr[Zt]42π3t+t2+8et22,

From the previous proposition, it may be more useful to use the following loose bounds:

Proposition 22.

Let Z be a standard normal random variable. Then, for any t>1, we have that 22π(1+5)tet22Pr[Zt]42π3tet22

Proof.

The bounds follow from Proposition 21. The upper bound is trivial, while the lower bound follows by noticing that t2+4t5 as t>1.

Appendix B Omitted Proofs

B.1 Omitted Proofs in Section 4

Proof of Lemma 12.

We start with the lower bound. Let X be the random variable indicating the number of close points in 𝒮 not included in ans^. Due to requirement (i) the probability to not find, and so to not count, a close point is at most o(1), then 𝔼[X]|𝒮B(𝐪,α)|o(1). Using Markov’s inequality we have that X|𝒮B(𝐪,α)|o(1) with constant probability. Consider now the number of close points counted ans^close, clearly ans^ans^close and ans^close=|𝒮B(𝐪,α)|X. Therefore, with constant probability we have ans^close|𝒮B(𝐪,α)||𝒮B(𝐪,α)|o(1)=|𝒮B(𝐪,α)|(1o(1)) which concludes the proof for the lower bound.

We proceed with the upper bound. Let Y be the random variable indicating the number of far points in 𝒮 included in ans^, then ans^|𝒮B(𝐪,β)|+Y. Due to requirement (ii) we have that 𝔼[Y]𝒦. Thus, by using Markov’s inequality ans^|𝒮B(𝐪,β)|+𝒦 with constant probability. Combining these two bounds, we arrive at the desired result.

As the algorithm substitute d dimensional point with a number, the space to store these number reduces to O(n). The query does not search for a ANN, but sums all the numbers stored in the counter on the indices I(𝐪), so the running time is T𝒬(𝐪)+|I(𝐪)|.

Proof of Theorem 13.

As 𝒬 is data independent, on neighboring datasets the data structures differ only in the counters. We start by considering the truncated Laplace noise. Let T be the counter from Algorithm 3 and T~ be the differentially private version. The error due to differential privacy in the counts is |T~[i]T[i]|O(log(1/δ)ε), as the truncated Laplace mechanism adds bounded noise sampled from [Clog(1/δ)ε,Clog(1/δ)ε] for some C>0. Therefore, the expected error between ans~ and ans^ is at most

𝔼[|ans~ans^|]=𝔼[|iI(𝐪)(T~[i]T[i])|]O(log(1/δ)ε𝔼[|I(𝐪)|])O(log(1/δ)ε𝒦).

Thus, by Markov’s inequality we have |ans~ans^|O(log(1/δ)ε𝒦) with constant probability. The claim follows by Lemma 12 and ε1. As the Truncated Laplace mechanism only needs to sample at most n random variables, the additional factor in space and pre-processing time is O(n).

Max Projection returns a (ε,0)-DP counter T~ with constant access time, using space and pre-processing time O(εn), and with error 𝔼[|T[i]T~[i]|]O(1/ε) (Corollary 8.3 [7]). The analysis then follows identically.

Proof of Theorem 14.

Let T be the counter from Algorithm 3, T~ its differential private version, and I(𝐪) be the set of indices of the buckets the algorithms needs to inspect, then ans^=iI(𝐪)T[i] and ans~=iI(𝐪)T~[i]. The application of Laplace noise leads to an unbiased and uncorrelated estimator T~[i] so the variance of the error is

Var[ans~ans^] =𝔼[(iI(𝐪)(T~[i]T[i]))2]=𝔼[I(𝐪)]Var[Lap(1/ε)]O(𝒦2ε2),

as T~[i]=T[i]+Z where ZLap(1/ε) and each noise is sampled independently. Therefore, by Jensen’s inequality 𝔼[|ans~ans^|]Var[ans~ans^]O(𝒦ε) and then by Markov’s inequality |ans~ans^|O(𝒦ε) holds with constant probability. The claim follows by Lemma 12 and ε1. The additional O(m) space and pre-preprocessing time is necessary to store and sample m i.i.d. independent Laplace random variables, one for each element of the partition =(L1,,Lm).

B.2 Omitted Proofs in Section 5

Lemma 23.

The probability that a point 𝐱𝕊d1 collides during CloseTop-1 construction is at least 11mΩ(1).

For the proof of Lemma 23 we first need the following technical lemma.

Lemma 24.

Let m be any integer greater than 4. Define a=2logm32loglogm2logm and b=2logm, then for Z𝒩(0,1), we have Pr[Z(a,b)]2π3logmm.

Proof.

Using Proposition 22 we may bound Pr[Zb]4π31mlogm. For the left side of the interval, we first need to check if a1. We have that 2logm32loglogm2logm1 only if 34logm22logmloglogm. But 4logm22logmloglogm>5 for any m5. Thus, by applying Proposition 22 we get

Pr[Za] 22π(1+5)12logm32loglogm22logmexp[12(2logm32loglogm2logm)2]
2π(1+5)1mlogmlog3/2mexp[916(loglogm)2logm]4π3logmm.

Where the last inequality holds if 916(loglogm)2logmlog(2(1+5)/3). The right-hand side is greater than 1/2, thus, it is sufficient to check (loglogm)2logm8/9. For m5 the left-hand side is always smaller888The maximum is reached at m=11. than 1/3, thus, the inequality is satisfied. Putting these two bounds together, we conclude

Pr[Z(a,b)]=Pr[Za]Pr[Zb] 4π3logmm(11(logm)3/2)2π3logmm.

The last inequality follows from (logm)3/2(log5)3/22.

Proof of Lemma 23.

If the probability that a random vector succeeds in the assignation is p, then a point will not collide with probability (1p)m. Then for p=Ω(logmm) (from Proposition 24) the probability to not collide is at most (1p)mepm=1mΩ(1).

Proof of Lemma 15.

The probability to not find a close point 𝐱α is

Pr[Q𝐱αη] =PrZ𝒩(0,1α2)[ZηαX𝐱α]
PrZ𝒩(0,1α2)[Z2(1α2)loglogm(134α21α2loglogmlogm)]
Pr[Z2(1α2)loglogm(1O(loglognlogn))]
logm1(logm)O(loglogn/logn)O(log1m), (4)

where in the first equality we used Lemma 4, in the second inequality we use the fact that X𝐱2logm32loglogm2logm by construction, in the third inequality loglogmlogm=O((1α2)loglognlogn) for m=nθ1α2, and lastly limn(logn)O(loglogn/logn)=1. The probability to find a close point is the probability of the joint event [Q𝐱αη] and 𝐱α is stored in the data structure. Thus, by Lemma 23, we have that Pr[find 𝐱α](1O(log1m))(1mΩ(1))=1o(1). This proves Lemma 7 for CloseTop-1. The probability to inspect a far point 𝐱β is

Pr[Q𝐱βη] =PrZ𝒩(0,1β2)[ZηβX𝐱β]
PrZ𝒩(0,1β2)[Z(αβ)2logm2(1α2)loglogm] (5)

where in the first equality we used Lemma 4, while in the following inequality we used the fact that X𝐱2logm by construction. The analysis then follows the same step of Lemma 8. As the analysis of the expected number of buckets to inspect is the same, Lemma 8 holds under the same assumption.

Proof of Lemma 17.

Let’s consider one data structure 𝒟i, due to Equation 4 we have an upper bound of O(tlogm) to not find a close point in 𝒟i. By applying a union bound over t data structures we have that

Pr[i=1t{not find 𝐱α in 𝒟i}]O(t2logm)=O(1(1α2)log3/4n)=o(1),

where we used m=nθ1α2, and (1α2)=ω(log3/4n). We now study the probability to not store a point. Due to Lemma 23 the probability to not store a point in 𝒟i is 1mΩ(1/t), then by a union bound we have

Pr[i=1t{𝐱 is not stored in 𝒟i}]tmΩ(1/t)=o(1)log7/8neΩ(log7/8n)=o(1)=o(1),

as t=o(log7/8n) for 1α2=ω(log3/4n), and mΩ(1/t)=nΩ(log1/8n)=eΩ(log7/8n) for m=nθ1α2. Therefore, a close point is found with at least 1o(1) probability.

Proof of Lemma 18.

Consider one CloseTop-1 data structure 𝒟i with m~=n1tθ1α2=nθlog1/8n random vectors. Thus loglogm~logm~=O(loglognlog7/8n), so that the threshold may be written as ηα2logm~(11α2αO(loglognlog7/8n)), which is positive for ααβ=Ω(loglognlog7/8n). Therefore, by following the same computation of Lemma 8 (Equation 1), the expected number of buckets to inspect in 𝒟i is at most m~1α2+1α2o(1)=n1t(θ+11α2o(1)). By assumption, we have 1/1α2=o(log3/4n), thus 11α2O(loglognlog7/8n)=o(loglognlog1/8n)=o(1). By tensorization of t independent data structures, we conclude that the expected number of buckets is at most nθ+o(1).

Analogously, starting from the computation in Lemma 15 (Equation 5) and substituting m with m~, we may lower bound the threshold with (αβ)2logm~(11α2(αβ)2O(loglognlog7/8n)), which is positive for (αβ)=Ω(loglognlog7/8n). Thus, by following the computation in Lemma 8 (Equation 2), the probability to find a far point is at most m~(αβ)21β2+1α2(αβ)1β2o(1). The probability to find a far point in all the t independent data structures is at most

Pr[i=1t{𝐱β is found in 𝒟i}] m~t((αβ)21β2+1α2(αβ)1β2o(1))
=nθ(αβ)2(1α2)(1β2)+(αβ)(1α2)(1β2)o(1).

The last addend is still o(1) as 11α2O(loglognlog7/8n)=o(1) and αβ1β2=O(1). Thus, as there are at most n far points, the expected number of far points that are inspected is n1θ(αβ)2(1α2)(1β2).

Appendix C Data Structures for the Euclidean Space

In this section, we prove that balanced TensorCloseTop-1 solves the (c,r)-ANN in the Euclidean space and reproduces the results for (c,r)-ANNC in [4]. Due to standard embedding techniques (see Lemma A.1 and Corollary A.1 in [4]), a (c,r)-ANN in d can be mapped into a (c1γ1+γ,r(1+γ))-ANN in 𝕊d, in time O(dd), with d=O(lognγ2), if (cr)2γ/2. Thus, the embedding preserves asymptotically the metric only for small distances r=o(1) in 𝕊d1, that can be obtained in the original space d after an appropriate scaling. The relation between inner product similarity and Euclidean distance for small distances are

(1α2)=Θ(r2),(1β)2=Θ(r2),(αβ)=Θ(r2),(1αβ)=Θ(r2), (6)

and the concatenation factor t=log1/8n1α2 is in Θ(log1/8nr2).

Theorem 25.

For any r>0, constant c>1, and a dataset 𝒮={xi}i=1,,n in d , there exists a data structure that solves with at least 1o(1) probability the (c,r)-ANN using almost linear space n1+o(1), pre-processing time dn1+o(1), and query time in expectation at most dno(1)+nρ+o(1) for ρ=4c2(c2+1)2.

Proof.

To apply TensorCloseTop-1 to the embedded dataset we need to satisfy the assumptions in Theorem 19 which are: (i) ρ=(1α2)(1β2)(1αβ)2=O(1), (ii) αβ=Ω(loglognlog7/8n), and (iii) 1α2=ω(log3/4n). Requirement (i) is satisfied due to the asymptotic Equations 6. More precisely, by substituting α=1r22 and β=1(cr)22 we get

ρ=(1α2)(1β2)(1αβ)2=c2(4r2)(c2r24)(c2(r22)2)2=4c2(c2+1)2+O(r2).

Requirement (ii) and (iii) are satisfied for any distance r=ω(log3/8n), due to the asymptotic relations in Equation 6. Therefore, for any C<3/8 by setting γ=log2Cn and r=logCn we can scale the dataset 𝒮d, apply the standard embedding techniques to get a dataset in 𝕊d1 with d=logO(1)n, and invoke TensorCloseTop-1 to solve the (α,β)-ANN in 𝕊d by paying an asymptotically small γ factor.999Andoni et al. [4] set r=Θ(log1/8n) and γ=Θ(log1/8n). Our analysis demonstrates more clearly that there is a broader range of possible values. As the mapping can be computed in O(dd)=O(d(logn)O(1))=dno(1) time, the pre-processing time is dn1+o(1). The space is O(d(n+no(1)))=n1+o(1) and the query time is in expectation at most dno(1)+nρ+o(1), given by the time to embed the query dno(1) in the hyper-sphere plus the query time of the data structure dnρ+o(1)=nρ+o(1).

The Unbalanced TensorCloseTop-1

Unbalanced TensorCloseTop-1 can be used to solve the Euclidean DP-ANNC problem as well. The proof if the same of Theorem 25, with the distinction that

σ2=(1α2)(1β2)(1αβ)2+(1αβ)=2c21+c4+O(r2)

The space and pre-processing time needed is n2σ1α2=n2σΘ(r2)=npolylog(n).