Differentially Private High-Dimensional Approximate Range Counting, Revisited
Abstract
Locality Sensitive Filters are known for offering a quasi-linear space data structure with rigorous guarantees for the Approximate Near Neighbor search (ANN) problem. Building on Locality Sensitive Filters, we derive a simple data structure for the Approximate Near Neighbor Counting (ANNC) problem under differential privacy (DP). Moreover, we provide a simple analysis leveraging a connection with concomitant statistics and extreme value theory. Our approach produces a simple data structure with a tunable parameter that regulates a trade-off between space-time and utility. Through this trade-off, our data structure achieves the same performance as the recent findings of Andoni et al. (NeurIPS 2023) while offering better utility at the cost of higher space and query time. In addition, we provide a more efficient algorithm under pure -DP and elucidate the connection between ANN and differentially private ANNC. As a side result, the paper provides a more compact description and analysis of Locality Sensitive Filters for Fair Near Neighbor Search, improving a previous result in Aumüller et al. (TODS 2022).
Keywords and phrases:
Differential Privacy, Locality Sensitive Filters, Approximate Range Counting, Concominant StatisticsCopyright and License:
2012 ACM Subject Classification:
Theory of computation Sorting and searching ; Mathematics of computing Probabilistic algorithms ; Security and privacyAcknowledgements:
The authors would like to thank Ninh Pham and Rasmus Pagh for useful discussions.Funding:
This work was supported in part by the MUR PRIN 20174LF3T8 AHeAD project, by MUR PNRR CN00000013 National Center for HPC, Big Data and Quantum Computing, and by Marsden Fund (MFP-UOA2226).Editors:
Mark BunSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Since the emergence of deep learning-based text and image embeddings, such as CLIP [28], the management of collections of high-dimensional vectors has become a critical challenge. Efficiently handling these collections and supporting complex query operations is essential for various applications, including social networks [8] and recommendation systems [29]. The required query operations in applications are often similarity search primitives, which have been widely studied in the literature [30, 33]. In particular, the -Near Neighbor Search (-NNS) or Count (-NNC) problems are fundamental primitives: given a set of vectors of dimensions and a radius , construct a data structure that, for any query , returns a point in with distance at most from if such a point exists, or counts the number of such points. Unfortunately, these problems suffer from the curse of dimensionality, which refers to the phenomenon that any exact data structure with polynomial size requires a query time exponential in the dimension of the input space. This is supported by popular algorithmic hardness conjectures [2, 34]. To address this issue, approximate approaches have been proposed: for a given approximation factor , we consider the -Approximate Near Neighbor Search -ANNS problem and the -Approximate Near Neighbor Count -ANNC problem. These relax the original problem constraints in such a way that the data structure may use points at distance at most to answer a query. For a search operation, this means that a point at distance at most from can be returned; for a count operation, points at distance between and may be counted as near neighbors. Locality Sensitive Hashing (LSH) [22] and Locality Sensitive Filters (LSF) [5, 9] are the most common approaches for solving approximate near neighbor problems with theoretical guarantees.
In another line of research, there is an increasing demand for developing solutions for data analysis tasks that preserve the privacy of sensitive personal information in the input data. This was, for example, highlighted in an invited talk at PODS in 2019 by Dwork [12], who discussed the application of differential privacy (DP) [13] in the US census. Differential privacy is a widely adopted notion that can provide rigorous guarantees of privacy: intuitively, DP guarantees that the removal or addition of a single input entry cannot significantly affect the final result of an analysis task. The main idea of the DP approach is to inject well-calibrated noise into the data (structure) that protects privacy without significantly affecting the accuracy of the analysis task. Counting queries, which can be seen as a more general form of the near neighbor counting problem studied in this paper, are a well-studied area in DP (see [19, 32]). However, the error in the final count must be polynomial in and the dimension of the input space to guarantee privacy [20, 24]. To reduce the error, the requirements should be relaxed to provide any count within a fuzzy range around the query [21]. Although this is different from the curse of dimensionality above, achieving an efficient solution similarly necessitates the use of approximate counts. Thus, it is natural to study approximate counting problems for the high-dimensional ANN problem when considering them in a DP setting.
In this paper, we focus on ANN problems for the inner product on the -dimensional unit sphere , which is often called cosine or angular similarity. In these settings, the goal is to find or count points with large inner products. More specifically, we study the -ANN count and search problems with . Let be the set of unit vectors that have an inner product of at least with . The counting variant asks for a query to count all points in a dataset with inner product at least but tolerates points with inner product at least . That means that the resulting estimate should satisfies . This is a common notation in inner product search, and intuitively and are equivalent to and in -ANNC. The first result for differentially private -ANNC in high dimensions has been recently provided by Andoni et al. [4], where the authors use a linear space tree-based data structure based on the concept of LSF.222As observed by [4] as well, a solution on the unit sphere leads to a solution for the whole Euclidean space thanks to embedding methods. We will defer all discussion of this embedding and its applicability to Appendix C.
In this work, we explore the design space of locality sensitive filtering-based solutions to ANN and ANNC problems. We show that by revisiting the LSF framework for ANN it is possible to derive a simpler and more compact solution for ANN. Building on this result, we derive a novel solution for ANNC under differential privacy that extends the range of applicability of the state of the art [4] by removing some limitations on parameter ranges and differential privacy assumptions. In particular, we provide strong guarantees in the regime of pure DP (in contrast to approximate DP), and show that balancing the noise term of DP with the approximation error of LSF is not the only design choice: in fact, spending more space and query time results in more accurate solutions. The following section will provide more details on the technical contribution.
1.1 Our Contribution
Revisiting LSF for ANN
Our work is based on a construction for -ANN first proposed by Aumüller et al. [6] in the context of algorithmic fairness. We provide a more compact description and analysis of LSF for -ANN, and obtain a data structure with a lower pre-processing time of (from where is the strength of the filter). Moreover, assuming that some random variables follow a limiting distribution (Theorem 5), we get a more compact and simpler proof than [6] leveraging an elegant connection with concomitant statistics and extreme value theory, and the parameters used in our solution naturally follow from this theory. In Section 5.1, we demonstrate that an alternative construction procedure, CloseTop-1, remains effective without assuming any limiting distributions, matching the convergent properties.
From ANN to DP-ANNC
We then present a solution for -ANNC under differential privacy. More specifically, we provide a general methodology that allows to translate a variant of a list-of-points data structure [5] for ANN into a data structure for DP-ANNC. Intuitively, a list-of-points is a data structure where input points are organized in a collection of lists and a query consists of a scan of some of these lists: this is the case, for instance, of methods based on LSH or LSF. This approach offers a way to develop the data structure in two steps by describing a data structure satisfying certain characteristics for ANN, and then applying a suitable DP mechanism on top of it.
When the data structure is built on top of the previous result, we get the DPTop-1 data structure presented in Algorithm 1, which we will now describe in words. Given a dataset consisting of points, two similarity thresholds , and privacy parameters and , the data structure samples Gaussian vectors. We associate with each such vector a counter, initialized with 0; each point in increments the counter of the vector that maximizes the inner product. Then, the counts are made differentially private by a suitable DP mechanism make_private, for example the Truncated Laplace mechanism [17] or the ALP mechanism [7]. Depending on the mechanism used, different privacy guarantees can be provided. Since each point increments exactly one counter, the absence or presence of a data point affects only a small part of the data structure. As a result, the sensitivity of the data structure is low and only a small amount of noise has to be added. The vectors with their noisy counts form the data structure that can be released publicly. For any query , the estimate of the number of near neighbors with inner product at least is the sum of counters associated to vectors with inner product similarity to greater than . This choice is guided by the theory of concomitant statistics and extreme value theory of Gaussian random variables [10], which we will formally introduce in Section 2.3. As we will detail in Section 3, in the asymptotic regime – hence for – DPTop-1 offers a simple and elegant solution for the -ANNC problem under differential privacy. The following theorem provides the guarantees of DPTop-1 when using the Truncated Laplace mechanism [17] as a privacy mechanism (we refer to Theorem 9 for the exact statements regarding the ANN data structure, and Theorem 13 for the exact statements of the DP-ANNC implementation).
Theorem 1.
Consider the asymptotic regime, , , , and . Let and let . Then DPTop-1 (Algorithm 1) with Truncated Laplace mechanism satisfies -DP, and with probability at least 2/3, the query returns such that
The data structure has pre-processing time , expected query time and space .
This simple algorithm matches the accuracy of the solution found by Andoni et al. [4] and results in a straightforward space partitioning of , one of the main goals of [4]. Furthermore, our approach provides a solution that works for almost all similarity thresholds on the unit sphere, while [4] supports a single distance threshold and relies on embedding techniques and scaling for all other distance thresholds.
While Algorithm 1 is potentially already practical, both the space and running time requirements are worse than the solution presented in [4] due to the large number of filters . We suggest two improvements to the algorithm that do not affect the compactness of the algorithm and the proof. We first observe that the theoretical result in Theorem 1 works only for : we drop this limitation in Section 5.1 thanks to a small but novel variation in the construction procedure. We then observe in Section 5.2 how to achieve almost linear pre-processing time, linear space, and expected query time, which is optimal due to a space-time tradeoffs lower bound [5]333It is sufficient to set in Theorem 3.3 in the reference paper to get the lower bound for the running time, by concatenating data structures, using a technique called tensorization [9, 6]. We call this final version TensorCloseTop-1. Table 1 summarizes the guarantees of all algorithms described in this paragraph and compares them with the state of the art approach [4].
| Mechanism | Privacy | Additive Error | Preprocessing Time | Expected Query Time | Space |
| DPTop-1 w/ Truncated Laplace | |||||
| Andoni et al. [4], and TensorCloseTop-1 w/ Truncated Laplace | |||||
| Andoni et al. [4] | (∗) | ||||
| TensorCloseTop-1 w/ Max Projection | |||||
| Unbalanced TensorCloseTop-1 with Laplace |
Balanced and Unbalanced Data Structures
As can be seen from the theorem statement, the provided estimate comes with an additive error of . This term includes two fundamentally different error sources. First, we might include “far away points”, i.e., points with inner product below , in the count through the ANN data structure. This error does not depend on the choice of the privacy mechanism. The second source of error, which is due to privacy, arises from summing over noisy counters, as a query is expected to search through buckets on average. DPTop-1 and the data structure of Andoni et al. [4] balance these two errors so that both are upper bounded by ; however, as we will show in Sections 3 and 4 this is not necessarily an optimal trade-off. By using an unbalanced data structure – to be discussed in detail in Section 3.1 – with Laplace noise, we can achieve a more accurate result at the cost of larger running time and more space. In fact, for any parameter, our unbalanced data structure solves DP-ANNC with an additive error for . The main insight is that the sum of noisy, unbiased, and uncorrelated counters, provided by the Laplace mechanism, scales with by concentration arguments. This makes it suboptimal to provide the same upper bound for both sources of error.
Comparison to Andoni et al. (NeurIPS 2023)
For the convenience of the reader, we now provide a quick description of the solution in [4], and highlight differences to the present work. Let , , and be suitable parameters that depend on and . The data structure consists of a tree with degree and height ; each internal node is associated to a -dimensional random Gaussian vector and to a subset of the input set . At the beginning, the root contains the entire input set . Then, for each internal node with a top-down approach, we partition the assigned vectors into groups: each input vector is assigned to the node with smallest index such that the inner product is . (If no such index exists, the point is not stored in the data structure.) Once the input points have been processed, we replace the list in each leaf with its noisy size by adding Truncated Laplace noise [17] (if the final value is below a given threshold, we replace the value with 0). Given a query , we collect the counts of all leaves for which for all nodes on the unique path from the root to , and return as a final estimate the sum of these counts. Using this tree data structure circumvents the evaluation problem of a too large number of filters mentioned for our variant above.
As the goal of their paper is to address Euclidean distance, the range of the and parameters is limited; their analysis works only for an value that corresponds to Euclidean distance and all other distances are only supported through embedding and scaling, which adds an additional distortion to the distance values. In contrast, our solution allows for a wider range of these parameters, increasing the applicability for inner product similarity on the sphere and still gets a data structure that holds for Euclidean distance. Furthermore, by removing the tree structure, we are able to design an algorithm with little data dependencies that is likely to exploit hardware accelerators (e.g., Nvidia Tensor Core, Intel AMX) for neural networks that are optimized for batches of inner products.
1.2 Previous work
Near Neighbor Search
Locality Sensitive Hashing (LSH) [22] is one of the most used approaches for solving ANN with rigorous guarantees. However, it suffers of large space requirements. Indeed, LSH requires memory words, where is a parameter describing the “power” of the used LSH (e.g., for Euclidean distance [3]): indeed the data structure requires to create hash tables, each storing all points. Both Panigrahy [25] and Kapralov [23] provided linear space solutions using variants of LSH. An interesting technique to achieve smooth space-time trade-offs is given by Locality Sensitive Filters (LSF) [9, 5]. In the context of this work, the interesting space-time trade-off to focus on is the linear space regime [6]. Besides offering optimal space, this regime has many additional interesting properties for downstream applications. For example, very recently, Andoni et al. [4] showed their application in the context of differentially private range counting in high dimensional data. As mentioned above, a linear space data structure only involves at most one time a point of the dataset, so the absence or presence of a data point only affects a small part of the data structure; in comparison, with traditional LSH-based approaches a single point is stored in many different tables in the data structure.
Differentially Private Counting Queries
Counting queries require, except for a few classes of queries, a polynomial error in and in the space dimension to guarantee privacy [20, 24]. This incentivized Huang and Yao [21] to relax the condition, allowing for the release of any count within a fuzzy range of the query. For ball queries in , this is essentially the problem to release any count between and , that we will identify as the -Approximate Nearest Neighbor Count (ANNC) problem. One of the main results in [21] is that there exists a differential private solution of the problem with poly-logarithmic error in at the price of an exponential dependence in the dimension . A solution for the high dimensional case was proposed in [4], where Andoni et al. proposed a linear space data structure for the -ANN, to solve the differential private -ANNC. The authors developed a Locality Sensitive Filtering data structure with , for the differential private -ANNC in the Euclidean space with additive error and multiplicative error , getting rid of the dependence on the dimension. The proposed data structure is based on a more general theory for data structures with space-time trade-offs [5], making the analysis more involved. In this paper, we will show that our data structure offers the same guarantees with a more streamlined analysis.
2 Preliminaries
2.1 Notation
We let be the set of integers . We denote with a query point, and with a point of the dataset such that . We set as the vector associated to , and define , and as the concomitant – to be defined in Section 2.3 – of . If then it is denoted as and so the concomitant as . The threshold for the query filter is . A ball in the hyper-sphere under inner product similarity centered in is denoted as . We call a point in close to if , and far if . We consider to be the number of points in the dataset . We denote the Gaussian distribution of mean and variance as .
2.2 Problem Definition
Definition 2 (-ANN).
Consider a set of points. The Approximate Nearest Neighbor Search ANN problem asks to construct a data structure for that for a given query , such that contains a point in , returns a point in .
We will study a data structure that solves this problem with asymptotically high probability, hence at least . The inner product similarity is related to the Euclidean distance, as for any . Therefore, for and , a -ANN in is equivalent to the -ANN defined above.
Definition 3 (-ANNC).
Consider a set of points. The Approximate Near Neighbor Counting (ANNC) problem asks to construct a data structure for that, for a given query , returns a number between and .
This problem is the counting equivalent of the well-studied spherical range reporting problem (see for example [1]) that asks to enumerate all points at a certain distance from .
2.3 Concomitant Order Statistics
The theory of concomitant order statistics offers a very elegant and intuitive tool for random projections in , as highlighted in [15, 26, 27]. Let be random samples from a bivariate distribution. We order the values according to such that . The -variate associated with is denoted as and it is called the concomitant of the -th order statistic.
Relation With Random Projections
Let such that and . Consider the random variables and , then , which is a standard bivariate normal distribution with correlation coefficient 444The general notation for a bivariate Gaussian distribution is , while .. The relation between concomitant and order statistics for the normal bivariate distribution is given by the following lemma.
Lemma 4 ([10]).
Given samples from the standard bivariate normal distribution , for any we have that , where is a random variable distributed as and independent of .
A standard result of concomitant order statistics states that weakly converges to a Gaussian distribution [10]. Thus, defining as the probability density function of , we have that [16]. By adding the fact that [18] we get the following theorem.
This asymptotic result serves as the basis of the intuition for our data structure: if we associate to each point in the closest Gaussian vector , then a query , such that , will find associated to a Gaussian vector with inner product similarity .
2.4 Differential Privacy
Differential Privacy (DP) is a definition on indistinguishability for the outputs of protocols applied to neighboring datasets. Two datasets are neighbor by addition/removal if they differ by the addition or a removal of one point, instead they are neighbor by substitution if and they differ in one point.
Definition 6 (Approximate Differential Privacy [14]).
For and , we say that a randomized algorithm is -differentially private if for any two neighboring datasets , and any possible outcome of the algorithm , we have .
We are mainly interested in histogram queries [13], where is the data universe and is the size of the data set. The most common way to privatize is to obfuscate the true values by adding noise scaled on the sensitivity of the query [13]. In our context, each data point contributes to exactly one counter. For the addition/removal neighboring relationship, ; for substitution, . We consider three different DP mechanisms when privatizing counters, for example, in the function make_private in Algorithm 1: Truncated Laplace Mechanism [17], Laplace Mechanism [17], and Max Projection [7]. More details can be found in Appendix A.1.
3 Top-1 Data Structure for ANN
Algorithm 2 describes the Top-1 data structure, which is the variant of Algorithm 1 targeting the -ANN problem. Let be a set of random vectors from . The data structure consists of a hash table that stores the input vectors assigned to each random vector in : more specifically, we assign each input vector to the random vector in with the largest inner product. For a given query vector , the query algorithm selects all random vectors with an inner product larger than with . Then, it searches for an approximate near neighbor in the lists of points associated with these vectors in the hash table. We call buckets the indices of the hash table (i.e. the random vectors), and filters the function used to query the hash table (i.e. the inner product). In this section, we consider the asymptotic regime for , so to use the limiting distribution of the extreme concomitant in Theorem 5.
Lemma 7 (Probability to Find a Close Point).
For , Top-1 search contains a bucket with a close point, if it exists, with at least probability.
Proof.
Consider a close point , associated to the bucket . That bucket is found in search if . From Proposition 20 and Theorem 5, we observe that
where we use . Thus, with probability at least , is associated to a vector that exceeds the threshold.
Lemma 8 (Expected Number of Buckets and Far Points).
For , such that , Top-1 search returns in expectation at most buckets, containing in expectation at most far points.
Proof.
We observe that by setting we get . Thus, the threshold is , which is positive for . From Proposition 20, the probability that a filter exceeds the threshold is
| (1) |
as the projection over a Gaussian vector is a normal random variable. In expectation, a query inspects at most buckets. The claim follows by setting . For the analysis of far points, we may write . The second factor is positive for . Thus, by applying Theorem 5 and Proposition 20, the probability to inspect a far point is
| (2) |
By inserting in the previous inequality, we obtain , as . Since we have at most far points, the expected number of inspected far points is at most . The proposed data structure Top-1 (Algorithm 2) is a naive solution with high space, pre-processing time and query time, for the -ANN.
Theorem 9.
Consider . For any such that , , and for any dataset in , Top-1 solves with at least probability the -ANN using pre-processing time , space , and expected query time .
Proof.
The pre-processing time is given by as for each of the points we need to look at random vectors of dimensionality . Each point is assigned to only one random vector, so the space needed to store the data structure is . The running time is given by summing the running time of search and query. The buckets in search are found in time while the expected running time of query is given by the expected number of far points present in these buckets, which are at most , and the expected number of buckets returned by search, which are at most , due to Lemma 8. Thus, the expected running time is at most . The problem is solved with at least probability due to Lemma 7.
3.1 Balanced and Unbalanced Top-1
The standard way to minimize the expected query time of an algorithm that solves -ANN is to balance the number of buckets that have to be inspected with the number of far points (“error”) that are associated witht those buckets. To balance the contribution of far points and the number of buckets inspected we choose , which solves the equation . We denote this specific solution as to highlight its connection to standard ANN analysis. However, alternative values of can be chosen to achieve different trade-offs. The next corollary follows from Lemma 8.
Corollary 10 (Balanced and Unbalanced Top-1).
Consider . For any such that , consider and . Define balanced and unbalanced Top-1 as the data structures initialized with and respectively. Then,
-
1.
The expected number of buckets inspected by search is at most for balanced Top-1, and for unbalanced Top-1.
-
2.
The buckets from search contain, in expectation, at most far points for balanced Top-1 and far points for unbalanced Top-1.
-
3.
For any we have .
Proof.
It follows by a simple computation from Lemma 8. As the space and the running time of balanced Top-1 is , the latter considers the worst case as the expected number of far points is . Space and running time follow directly for unbalanced Top-1 as . The unusual behavior of unbalanced Top-1 will be further clarified in the next section, where its utility for -DP-ANNC will be discussed. Although Top-1 benefits from a clean and straightforward analysis (due to the assumption ), it involves significant preprocessing, space, and query time requirements. These limitations will be addressed in Section 5.2 through the use of tensorization. Additionally, the assumption of will be lifted with a minor modification to the algorithm, as detailed in Section 5.1. Nevertheless, as we will show in the next section, Top-1 still provides a meaningful solution for -ANNC under differential privacy constraints.
4 From ANN to DP-ANNC
In this section, we study the relationship between ANN and DP-ANNC. We expose a general way to solve ANNC starting from a space-partitioning data structure for ANN, and discuss different differentially private mechanisms to privatize the ANNC data structure. We also show how the unbalanced data structure from the previous section can be used to increase the accuracy for -DP-ANNC, by paying an increase in query time, pre-processing time, and space usage.
4.1 From ANN to ANNC
Inspired by the list-of-points data structure developed in [5], we define a family of data structures suitable for a general reduction from ANN to ANNC.
Definition 11 (Space Partitioning Data Structure).
Given a set and an integer , a space-partitioning data structure for the ANN problem is defined as follows:
-
The data structure is a partition555Technically we define the data structure by using a not full partition, as we allow some points to not be stored. of into sets such that and for , and a function that maps .
-
For a query , we obtain the set and scan all points in . If there exists a point with inner product at least , we return it. Otherwise we return .
The total space is , where is the space necessary to store the function . The query time is at most , where is the time taken to compute given query , and is the worst-case time needed to check all the points.
For example, for Algorithm 2, represents the points that achieve their maximum inner product with . consists of all (its size is ) and computes the indices of all filters that are above the query threshold. For the data structure of Andoni et al. [4] discussed in the introduction, are the leaves of the tree, and represents the navigation tree-based data structure. Algorithm 3 presents a simple transformation for a space-partitioning data structure: It indeed suffices to substitute the actual points with their amount in each list. The new -ANNC query returns the sum of the elements contained in these lists. Since each point is stored at most once, summing the cardinality of each bucket ensures that no point is counted more than once.
Lemma 12 (From ANN to ANNC).
Let . Consider a space-partitioning data structure for such that for each : (i) contains a list with a close point with probability at least , and (ii) the expected number of far points in is at most . Then query in Algorithm 3 returns a value that, with probability at least , satisfies the following inequality:
The data structure in Algorithm 3 uses space and the query time is .
4.2 From ANNC to DP-ANNC
The data structure returned by Algorithm 3 uses counters, which is essentially a histogram. This histogram can be privatized using the algorithms presented in Section 2.4. To achieve differential privacy, we need to analyze the sensitivity of the counters . Let and be two neighboring datasets that differ in exactly one point, and let and be the data structures constructed for these data sets, respectively. Applying Algorithm 3 to and will result in two counters and that differ by at most 1 in at most one position. If is data independent, i.e., does not depend on the actual data set , it is sufficient to privatize the counter , which can be done using any differentially private mechanism make_private for histograms, as shown in the aforementioned DPTop-1 (see Algorithm 1). The next theorem states the guarantees for two specific privacy mechanisms.
Theorem 13 (DP-ANNC with Truncated Laplace or Max Projection).
Let , . Consider a space partitioning data structure for satisfying the assumptions of Lemma 12, with the addition of and being data independent. When is privatized using the truncated Laplace mechanism, the data structure is -DP, for any , requires an additional term in space and pre-processing time (compared to Lemma 12), and the query algorithm returns a value that satisfies the following inequality with probability at least :
| (3) |
When is privatized with the Max Projection mechanism, then the data structure is -DP, requires an additional term in space and pre-processing time, and the additive error in Equation 3 becomes .
Due to Lemma 7 and Corollary 10, balanced Top-1 satisfies the requirements for Theorem 13 with , which proves Theorem 1. Moreover, the tree-based data structure described by Andoni et al. [4] is another data structure that satisfies these requirements of Theorem 13.666Technically, as stated earlier, [4] analyze their data structure for a fixed choice of .
4.2.1 Usefulness of Unbalanced Data Structure
We now study how an unbiased and uncorrelated differentially private estimator of , from the unbalanced Top-1 ANNC data structure, can be used to reduce the error compared to Theorem 13. The construction leverages the concentration of the sum of i.i.d. Laplace random variables.
Theorem 14 (DP-ANNC with Laplace Noise and Unbalanced Data Structure).
Let , . Consider a space partitioning data structure for satisfying the assumptions of Lemma 12, with the addition of and being data independent. When is privatized by using the Laplace mechanism, the data structure is -DP, for any , and query returns a value that satisfies the following inequality with probability at least :
The privatized data structure requires an additional space, and pre-processing time.
Due to Lemma 7 and Corollary 10, unbalanced Top-1 satisfies the requirements for Theorem 14 with . As unbalanced Top-1 is always more accurate than Top-1 for DP-ANNC. However, the pre-processing time and the space increase.
In the next section we provide several improvements for Top-1, aiming to get rid of the asymptotic assumption used to apply the Theorem 5 for concomitant statistics, and reduce the pre-processing time, the query time, and the space. These improvements regard only the balanced ANN data structure; the additional requirements in space and pre-processing time of for DP-ANNC with unbalanced data structures will still be present and are an interesting open question for future work. Finally, we highlight that for errors of the form we may increase the range of the privacy budget to in Theorems 13 and 14, for the same argument provided by Andoni et al. [4].
5 Improving the Top-1 Data Structure
In this section we propose two improvements of Top-1. With CloseTop-1 we get rid of the assumption of ,777We observe that, although the time and space complexities are expressed in big-O notation (i.e., with a notation asymptotic in ), the correctness of this algorithm does not require assuming a limiting distribution for the concomitants, while this was the case for Top-1. while with TensorCloseTop-1 we reduce the pre-processing time to , the space to , and obtain an expected query time of . In addition, we discuss how TensorCloseTop-1 can solve the -ANN in the Euclidean space in Appendix C. These improvements do not alter the core of the data structure (a hash table of points with a search function), allowing them to be utilized for DP-ANNC as discussed in the previous section.
5.1 CloseTop-1
We now study CloseTop-1 (see Algorithm 4), a practical implementation of the previous asymptotic data structure. In Top-1 we associate to each point of the dataset the random vector with the highest inner product. This is an intuitive choice that leads to a simple and clear analysis by analyzing Gaussian tails of concomitant statistics (Theorem 5). However, this is an asymptotic theorem, results from the fact that and [18]. In fact, it can be obtained by Lemma 4 by setting . The intuition of CloseTop-1 is to provide a lower and an upper bound for by construction, by associating to each point of the dataset a random vector with an inner product close to the expected maximum, so the name of the data structure. In the proposed construction, we sample random vectors , and we associate to any the first random vector such that . If at least one random vector succeeds in the association, then we say that collided. The key property of CloseTop-1 is that a point collides with high probability (Lemma 23), which allows to state the following lemma.
Lemma 15.
5.2 TensorCloseTop-1
In this section, we propose TensorCloseTop-1 (see Algorithm 5), to reduce the pre-processing time to , space to , and expected query time to (for the balanced data structure). The data structure uses a technique developed in [9] called tensoring that essentially allows to simulate an exponential number of vectors by concatenating a polynomial number of data structures. The same expedient was used in [6] to get a pre-processing time of . This technique is similar to creating a tree, yet this data structure allows parallel evaluation for the hashes (Line 2-3 Algorithm 5 search). Define the concatenation factor and assume is an integer; consider independent CloseTop-1 data structures each using Gaussian vectors , where indicates the data structure and indicates the vector. For each point consider the colliding vectors in each data structure , then map the point to a bucket using a hash table. Given a query , for each data structure the indices of the random vectors are selected such that , hence , and a search in all the buckets is performed. The number of random vectors to sample is which is sub-linear in provided the data structure is not designed to search for points with exceedingly high inner product similarity.
Proposition 16 (Tensorization).
For any constant assume . Then for , , and , we have
Proof.
Just a simple computation: as , while , then . In practice and are all integers. However, this does not affect the asymptotic behavior since . With this trick, we reduce the query time due to search (Line 2,3,4 Algorithm 5, procedure search) to . Therefore, the query time is mainly affected by how many times the data structure needs to access the hash table, which can be bounded in expectation. Under similar assumptions for we can prove that TensorCloseTop-1 finds with high probability a close point.
Lemma 17 (Probability to Find a Close Point).
For any such that , TensorCloseTop-1 finds a close point, if it exists, with at least probability.
Lemma 18 (Expected Number of Buckets and Far Points).
For any such that , and , TensorCloseTop-1 search finds in expectation at most buckets containing at most far points.
The previous Lemma states that has the same function it has in CloseTop-1, so it can be used to construct balanced and unbalanced TensorCloseTop-1. We now argue for the query, space and pre-processing time.
Theorem 19.
For any such that , , and . For any dataset in , TensorCloseTop-1 solves with at least probability the -ANN using space , preprocessing time , and expected query time .
Proof.
Due to Proposition 16, the data structure needs to store random vectors. As each point is stored in at most one bucket, the space is . As to each point it is necessary to compute inner products at most times, the pre-processing time is . The buckets in search can be computed in time , so the expected query time is at most due to Lemma 18. The problem is solved with at least probability due to Lemma 17. As Corollary 10 applies to TensorCloseTop-1 due to Lemma 18, the parameters and respectively characterize the balanced and unbalanced versions of TensorCloseTop-1. The balanced version has an expected query time of and, when combined with Theorem 13, achieves an additive error of for differentially private approximate nearest neighbor search (DP-ANNC). For , the additional space and preprocessing requirements are . In contrast, the unbalanced version of TensorCloseTop-1 has an expected query time of and, when used with Theorem 14, yields an asymptotically smaller additive error for DP-ANNC. However, this approach incurs a significant drawback: the Laplace noise introduces an additional space and preprocessing overhead of . This overhead becomes the dominant cost as , especially in the Euclidean ANN problem, where . For further details on how this data structure is applied in Euclidean space, see Appendix C.
6 Conclusion and Open Problems
This paper introduced and analyzed simple linear space data structures that solve the -ANN problem and can be transformed into efficient solutions for its counting variant under differential privacy. This provides an alternative data structure to the one proposed recently by Andoni et al. [4] with a simpler data structure and analysis. We provided general black-box transformations from approximate near neighbor problems to their counting variant under privacy constraints and showed that interesting error/time trade-offs are possible via unbalanced ANN data structures. The most intriguing open question was already posed by Andoni et al. [4]: Can one obtain better accuracy guarantees for range counting than by transforming near neighbor data structures that have well-understood lower bounds [5]? For example, [1] describes a sampling based range counting algorithm that could be a good starting point for further investigation. For the presented data structures, one should further investigate the relation of the noise error due to differential privacy and the error due to including “far points” which could give interesting trade-offs. We initiated such a study through unbalanced ANN data structures; the main obstacle for a space-efficient solution is to store “small counts” in a data structure that uses space and provides unbiased counters such that the expected error of the sum of counters is only a factor larger than the expected per-point error. Finally, while we believe that our algorithms are simple and straightforward, an experimental comparison between the different solutions presented here and in the literature seems necessary, not only for approximate range counting, but also filtering-based approximate near neighbor search. In fact, only the work of Pham et al. [27] provided evidence of the practical impact of filtering-based near neighbor search, and they achieve their result by a combination of LSH and LSF.
References
- [1] Thomas D. Ahle, Martin Aumüller, and Rasmus Pagh. Parameter-free locality sensitive hashing for spherical range reporting. In SODA, pages 239–256. SIAM, 2017. doi:10.1137/1.9781611974782.16.
- [2] Josh Alman and Ryan Williams. Probabilistic polynomials and hamming nearest neighbors. In Venkatesan Guruswami, editor, IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, 17-20 October, 2015, pages 136–150. IEEE Computer Society, 2015. doi:10.1109/FOCS.2015.18.
- [3] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117–122, 2008. doi:10.1145/1327452.1327494.
- [4] Alexandr Andoni, Piotr Indyk, Sepideh Mahabadi, and Shyam Narayanan. Differentially private approximate near neighbor counting in high dimensions. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc.
- [5] Alexandr Andoni, Thijs Laarhoven, Ilya Razenshteyn, and Erik Waingarten. Optimal hashing-based time-space trade-offs for approximate near neighbors. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’17, pages 47–66, USA, 2017. Society for Industrial and Applied Mathematics. doi:10.1137/1.9781611974782.4.
- [6] Martin Aumüller, Sariel Har-Peled, Sepideh Mahabadi, Rasmus Pagh, and Francesco Silvestri. Sampling a near neighbor in high dimensions — who is the fairest of them all? ACM Trans. Database Syst., 47(1), April 2022. doi:10.1145/3502867.
- [7] Martin Aumüller, Christian Janos Lebeda, and Rasmus Pagh. Representing sparse vectors with differential privacy, low error, optimal space, and fast access. Journal of Privacy and Confidentiality, 12(2), November 2022. doi:10.29012/jpc.809.
- [8] Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling up all pairs similarity search. In Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pages 131–140, New York, NY, USA, 2007. Association for Computing Machinery. doi:10.1145/1242572.1242591.
- [9] Tobias Christiani. A framework for similarity search with space-time tradeoffs using locality-sensitive filtering. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’17, pages 31–46, USA, 2017. Society for Industrial and Applied Mathematics. doi:10.1137/1.9781611974782.3.
- [10] H. A. David and J. Galambos. The asymptotic theory of concomitants of order statistics. Journal of Applied Probability, 11(4):762–770, 1974. URL: http://www.jstor.org/stable/3212559.
- [11] Devdatt Dubhashi and Alessandro Panconesi. Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press, USA, 1st edition, 2009.
- [12] Cynthia Dwork. Differential privacy and the us census. In Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’19, page 1, New York, NY, USA, 2019. Association for Computing Machinery. doi:10.1145/3294052.3322188.
- [13] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography, TCC’06, pages 265–284, Berlin, Heidelberg, 2006. Springer-Verlag. doi:10.1007/11681878_14.
- [14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3–4):211–407, August 2014. doi:10.1561/0400000042.
- [15] Kave Eshghi and Shyamsundar Rajaram. Locality sensitive hash functions based on concomitant rank order statistics. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’08, pages 221–229, New York, NY, USA, 2008. Association for Computing Machinery. doi:10.1145/1401890.1401921.
- [16] Thomas R Fleming and David P Harrington. Counting processes and survival analysis. John Wiley & Sons, 2013.
- [17] Quan Geng, Wei Ding, Ruiqi Guo, and Sanjiv Kumar. Tight analysis of privacy and utility tradeoff in approximate differential privacy. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 89–99. PMLR, 26–28 August 2020. URL: https://proceedings.mlr.press/v108/geng20a.html.
- [18] Peter Hall. On the rate of convergence of normal extremes. Journal of Applied Probability, 16(2):433–439, 1979. URL: http://www.jstor.org/stable/3212912.
- [19] Moritz Hardt and Guy N. Rothblum. A multiplicative weights mechanism for privacy-preserving data analysis. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pages 61–70, 2010. doi:10.1109/FOCS.2010.85.
- [20] Moritz Hardt and Kunal Talwar. On the geometry of differential privacy. In Proceedings of the Forty-Second ACM Symposium on Theory of Computing, STOC ’10, pages 705–714, New York, NY, USA, 2010. Association for Computing Machinery. doi:10.1145/1806689.1806786.
- [21] Ziyue Huang and Ke Yi. Approximate Range Counting Under Differential Privacy. In Kevin Buchin and Éric Colin de Verdière, editors, 37th International Symposium on Computational Geometry (SoCG 2021), volume 189 of Leibniz International Proceedings in Informatics (LIPIcs), pages 45:1–45:14, Dagstuhl, Germany, 2021. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.SoCG.2021.45.
- [22] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, pages 604–613, New York, NY, USA, 1998. Association for Computing Machinery. doi:10.1145/276698.276876.
- [23] Michael Kapralov. Smooth tradeoffs between insert and query complexity in nearest neighbor search. In PODS, pages 329–342. ACM, 2015. doi:10.1145/2745754.2745761.
- [24] S. Muthukrishnan and Aleksandar Nikolov. Optimal private halfspace counting via discrepancy. In Proceedings of the Forty-Fourth Annual ACM Symposium on Theory of Computing, STOC ’12, pages 1285–1292, New York, NY, USA, 2012. Association for Computing Machinery. doi:10.1145/2213977.2214090.
- [25] Rina Panigrahy. Entropy based nearest neighbor search in high dimensions. In SODA, pages 1186–1195. ACM Press, 2006. URL: http://dl.acm.org/citation.cfm?id=1109557.1109688.
- [26] Ninh Pham. Simple yet efficient algorithms for maximum inner product search via extreme order statistics. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, pages 1339–1347, New York, NY, USA, 2021. Association for Computing Machinery. doi:10.1145/3447548.3467345.
- [27] Ninh Pham and Tao Liu. Falconn++: a locality-sensitive filtering approach for approximate nearest neighbor search. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 2022. Curran Associates Inc.
- [28] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In ICML, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021. URL: http://proceedings.mlr.press/v139/radford21a.html.
- [29] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th International Conference on World Wide Web, WWW ’01, pages 285–295, New York, NY, USA, 2001. Association for Computing Machinery. doi:10.1145/371920.372071.
- [30] Gregory Shakhnarovich, Trevor Darrell, and Piotr Indyk. Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing). The MIT Press, 2006.
- [31] Stanislaw J. Szarek and Elisabeth Werner. A nonsymmetric correlation inequality for gaussian measure. 68(2):193–211, February 1999. doi:10.1006/jmva.1998.1784.
- [32] Salil Vadhan. The Complexity of Differential Privacy, pages 347–450. Springer, Yehuda Lindell, ed., 2017. doi:10.1007/978-3-319-57048-8_7.
- [33] Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. Hashing for similarity search: A survey. CoRR, abs/1408.2927, 2014. arXiv:1408.2927.
- [34] Ryan Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theoretical Computer Science, 348(2):357–365, 2005. Automata, Languages and Programming: Algorithms and Complexity (ICALP-A 2004). doi:10.1016/j.tcs.2005.09.023.
Appendix A Useful inequalities and Additional Definitions
A.1 Differentially Private Mechanisms
In this work we considered three differentially private mechanisms:
-
The Truncated Laplace mechanism [17], used also by Andoni et al. [4], which obfuscate each positive entry of by adding truncated Laplace noise. The mechanism is -DP and produces a biased estimator with expected absolute error . In our context, it has the advantage that it only needs to sample and store the counts of the non-zero entries, i.e., at most random variables.
-
The Laplace mechanism [13], which adds independent Laplace noise to each entry of . The mechanism is -DP and produces an unbiased estimator with uncorrelated entries with absolute expected error . The estimator behaves well for range queries (i.e. for some ) obtaining an expected absolute error . However, it requires to sample and store random variables.
-
The Max Projection mechanism [7] which stores all “small counts” in a sketching-based data structure. The mechanism is -DP and produces a data structure with access time returning a biased estimator with expected absolute error . The additional space it needs is , making it a valid pure-DP alternative to the Truncated Laplace mechanism.
A.2 Tail Bounds
We will make use of the following Gaussian tail bounds.
Proposition 20 (Gaussian Tail Bounds [11]).
Let . Then, for any , we have that .
Proposition 21 (Proposition 3, [31]).
Let be a standard normal random variable. Then, for any , we have that
From the previous proposition, it may be more useful to use the following loose bounds:
Proposition 22.
Let be a standard normal random variable. Then, for any , we have that
Proof.
The bounds follow from Proposition 21. The upper bound is trivial, while the lower bound follows by noticing that as .
Appendix B Omitted Proofs
B.1 Omitted Proofs in Section 4
Proof of Lemma 12.
We start with the lower bound. Let be the random variable indicating the number of close points in not included in . Due to requirement (i) the probability to not find, and so to not count, a close point is at most , then . Using Markov’s inequality we have that with constant probability. Consider now the number of close points counted , clearly and . Therefore, with constant probability we have which concludes the proof for the lower bound.
We proceed with the upper bound. Let be the random variable indicating the number of far points in included in , then . Due to requirement (ii) we have that . Thus, by using Markov’s inequality with constant probability. Combining these two bounds, we arrive at the desired result.
As the algorithm substitute dimensional point with a number, the space to store these number reduces to . The query does not search for a ANN, but sums all the numbers stored in the counter on the indices , so the running time is .
Proof of Theorem 13.
As is data independent, on neighboring datasets the data structures differ only in the counters. We start by considering the truncated Laplace noise. Let be the counter from Algorithm 3 and be the differentially private version. The error due to differential privacy in the counts is , as the truncated Laplace mechanism adds bounded noise sampled from for some . Therefore, the expected error between and is at most
Thus, by Markov’s inequality we have with constant probability. The claim follows by Lemma 12 and . As the Truncated Laplace mechanism only needs to sample at most random variables, the additional factor in space and pre-processing time is .
Max Projection returns a -DP counter with constant access time, using space and pre-processing time , and with error (Corollary 8.3 [7]). The analysis then follows identically.
Proof of Theorem 14.
Let be the counter from Algorithm 3, its differential private version, and be the set of indices of the buckets the algorithms needs to inspect, then and . The application of Laplace noise leads to an unbiased and uncorrelated estimator so the variance of the error is
as where and each noise is sampled independently. Therefore, by Jensen’s inequality and then by Markov’s inequality holds with constant probability. The claim follows by Lemma 12 and . The additional space and pre-preprocessing time is necessary to store and sample i.i.d. independent Laplace random variables, one for each element of the partition .
B.2 Omitted Proofs in Section 5
Lemma 23.
The probability that a point collides during CloseTop-1 construction is at least .
For the proof of Lemma 23 we first need the following technical lemma.
Lemma 24.
Let be any integer greater than . Define and , then for , we have .
Proof.
Using Proposition 22 we may bound . For the left side of the interval, we first need to check if . We have that only if . But for any . Thus, by applying Proposition 22 we get
Where the last inequality holds if . The right-hand side is greater than , thus, it is sufficient to check . For the left-hand side is always smaller888The maximum is reached at . than , thus, the inequality is satisfied. Putting these two bounds together, we conclude
The last inequality follows from .
Proof of Lemma 23.
If the probability that a random vector succeeds in the assignation is , then a point will not collide with probability . Then for (from Proposition 24) the probability to not collide is at most .
Proof of Lemma 15.
The probability to not find a close point is
| (4) |
where in the first equality we used Lemma 4, in the second inequality we use the fact that by construction, in the third inequality for , and lastly . The probability to find a close point is the probability of the joint event and is stored in the data structure. Thus, by Lemma 23, we have that . This proves Lemma 7 for CloseTop-1. The probability to inspect a far point is
| (5) |
where in the first equality we used Lemma 4, while in the following inequality we used the fact that by construction. The analysis then follows the same step of Lemma 8. As the analysis of the expected number of buckets to inspect is the same, Lemma 8 holds under the same assumption.
Proof of Lemma 17.
Let’s consider one data structure , due to Equation 4 we have an upper bound of to not find a close point in . By applying a union bound over data structures we have that
where we used , and . We now study the probability to not store a point. Due to Lemma 23 the probability to not store a point in is , then by a union bound we have
as for , and for . Therefore, a close point is found with at least probability.
Proof of Lemma 18.
Consider one CloseTop-1 data structure with random vectors. Thus , so that the threshold may be written as , which is positive for . Therefore, by following the same computation of Lemma 8 (Equation 1), the expected number of buckets to inspect in is at most . By assumption, we have , thus . By tensorization of independent data structures, we conclude that the expected number of buckets is at most .
Analogously, starting from the computation in Lemma 15 (Equation 5) and substituting with , we may lower bound the threshold with , which is positive for . Thus, by following the computation in Lemma 8 (Equation 2), the probability to find a far point is at most . The probability to find a far point in all the independent data structures is at most
The last addend is still as and . Thus, as there are at most far points, the expected number of far points that are inspected is .
Appendix C Data Structures for the Euclidean Space
In this section, we prove that balanced TensorCloseTop-1 solves the -ANN in the Euclidean space and reproduces the results for -ANNC in [4]. Due to standard embedding techniques (see Lemma A.1 and Corollary A.1 in [4]), a -ANN in can be mapped into a -ANN in , in time , with , if . Thus, the embedding preserves asymptotically the metric only for small distances in , that can be obtained in the original space after an appropriate scaling. The relation between inner product similarity and Euclidean distance for small distances are
| (6) |
and the concatenation factor is in .
Theorem 25.
For any , constant , and a dataset in , there exists a data structure that solves with at least probability the -ANN using almost linear space , pre-processing time , and query time in expectation at most for .
Proof.
To apply TensorCloseTop-1 to the embedded dataset we need to satisfy the assumptions in Theorem 19 which are: (i) , (ii) , and (iii) . Requirement (i) is satisfied due to the asymptotic Equations 6. More precisely, by substituting and we get
Requirement (ii) and (iii) are satisfied for any distance , due to the asymptotic relations in Equation 6. Therefore, for any by setting and we can scale the dataset , apply the standard embedding techniques to get a dataset in with , and invoke TensorCloseTop-1 to solve the -ANN in by paying an asymptotically small factor.999Andoni et al. [4] set and . Our analysis demonstrates more clearly that there is a broader range of possible values. As the mapping can be computed in time, the pre-processing time is . The space is and the query time is in expectation at most , given by the time to embed the query in the hyper-sphere plus the query time of the data structure .
The Unbalanced TensorCloseTop-1
Unbalanced TensorCloseTop-1 can be used to solve the Euclidean DP-ANNC problem as well. The proof if the same of Theorem 25, with the distinction that
The space and pre-processing time needed is .
