Robust Proximity Search for Balls using Sublinear Space

Given a set of n disjoint balls b1, . . ., bn in IRd, we provide a data structure, of near linear size, that can answer (1 \pm \epsilon)-approximate kth-nearest neighbor queries in O(log n + 1/\epsilon^d) time, where k and \epsilon are provided at query time. If k and \epsilon are provided in advance, we provide a data structure to answer such queries, that requires (roughly) O(n/k) space; that is, the data structure has sublinear space requirement if k is sufficiently large.


Introduction
The nearest neighbor problem is a fundamental problem in Computer Science [SDI06,Cha08,AI08,Cla06]. Here, one is given a set of points P, and given a query point q one needs to output the nearest point in P to q. While there is a trivial O(n) algorithm for this problem, typically the set of data points is fixed, while different queries keep arriving. Thus, one can use preprocessing to facilitate a faster query. Applications of nearest neighbor search include pattern recognition [FH49,CH67], selforganizing maps [Koh01], information retrieval [SWY75], vector compression [GG91], computational statistics [DW82], clustering [DHS01], data mining, learning, and many others. If one is interested in guaranteed performance and near linear space, there is no known way to solve this problem efficiently (i.e., logarithmic query time) for dimension d > 2, while using near linear space for the data-structure.
Approximate Nearest Neighbor (ANN). In light of the above, major effort has been devoted to develop approximation algorithms for nearest neighbor search [AMN + 98, IM98, KOR00, SDI06, Cha08, AI08, Cla06, HIM12]. In the (1 + ε)-approximate nearest neighbor problem, one is additionally given an approximation parameter ε > 0 and one is required to find a point u ∈ P such that d(q, u) ≤ (1 + ε)d(q, P). In d dimensional Euclidean space, one can answer ANN queries, in O(log n + 1/ε d−1 ) time using linear space [AMN + 98, Har11]. Unfortunately, the constant hidden in the O notation is exponential in the dimension (and this is true for all bounds mentioned in this paper), and specifically because of the 1/ε d−1 in the query time, this approach is only efficient in low dimensions. Interestingly, for this datastructure, the approximation parameter ε need not be specified during the construction, and one can provide it during the query. An alternative approach is to use Approximate Voronoi Diagrams (AVD),

Our results.
In this paper, we first show how preprocess the set of balls into a data structure requiring space O(n), in O(n log n) time, so that given a query point q, a number 1 ≤ k ≤ n and ε > 0, one can compute a (1 ± ε)-approximate kth closest ball in time O(log n + ε −d ). If both k and ε are available during preprocessing, one can preprocess the balls into a (k, ε)-AVD, using O( n kε d log(1/ε)) space, so that given a query point q, a (k, ε)-ANN closest ball can be computed, in O(log(n/k) + log(1/ε)) time.

Problem definition and notation
Input. We are given a set of disjoint 1 balls B = {b 1 , . . . , b n }, where b i = b(c i , r i ), for i = 1, . . . , n.
Here b(c, r) ⊆ IR d denotes the (closed) ball with center c and radius r ≥ 0. Additionally, we are given an approximation parameter ε ∈ (0, 1). For a point q ∈ IR d , the distance of q to a ball b = b(c, r) is d(q, b) = max q − c − r, 0 .
Observation 2.1. For two balls b 1 ⊆ b 2 ⊆ IR d , and any point q ∈ IR d , we have d(q, b 1 ) ≥ d(q, b 2 ).
The kth-nearest neighbor distance of q to B, denoted by d k (q, B), is the kth smallest number in d(q, b 1 ) , . . . , d(q, b n ). Similarly, for a given set of points P, d k (q, P) denotes the kth-nearest neighbor distance of q to P.
Problem definition. We aim to build a data-structure to answer (1 ± ε)-approximate kth-nearest neighbor (i.e., (k, ε)-ANN) queries, where for any query point q ∈ IR d one needs to output a ball b ∈ B such that, (1 − ε)d k (q, B) ≤ d(q, b) ≤ (1 + ε)d k (q, B). There are different variants depending on whether ε and k are provided with the query or in advance.

Notations.
We use cube to denote a square (d = 2), a cube (d = 3), or a hypercube (d > 3), as the dimension might be.

Observation 2.2. For any set of balls
Assumption 2.3. We assume all the balls are contained inside the cube 1/2 − δ, 1/2 + δ d , which can be ensured by translation and scaling (which preserves order of distances), where δ = ε/4. As such, we can ignore queries outside the unit cube [0, 1] d , as any input ball is a valid answer in this case.
For a real positive number x and a point p = (p 1 , . . . , p d ) ∈ IR d , define G x (p) to be the grid point (⌊p 1 /x⌋ x, . . . , ⌊p d /x⌋ x). The number x is the width or sidelength of the grid G x . The mapping G x partitions IR d into cubes that are called grid cells.
Definition 2.4. A cube is a canonical cube if it is contained inside the unit cube U = [0, 1] d , it is a cell in a grid G r , and r is a power of two (i.e., it might correspond to a node in a quadtree having [0, 1] d as its root cell). We will refer to such a grid G r as a canonical grid. Note, that all the cells corresponding to nodes of a compressed quadtree are canonical.
Definition 2.5. Given a set b ⊆ IR d , and a parameter δ > 0, let G ≈ (b, δ) denote the set of canonical grid cells of sidelength 2 Let B be a family of balls in IR d . Given a set X ⊆ IR d , let Our data-structure and algorithm work for the more general case where the balls are interior disjoint, where we define the interior of a "point ball", i.e., a ball of radius 0, as the point itself. This is not the usual topological definition.
denote the set of all balls in B that intersect X.
For two compact sets X, Y ⊆ IR d , X Y if and only if diam(X) ≤ diam(Y ). For a set X and a set of balls B, let B (X) = b ∈ B b ∩ X = ∅ and b X . Let c d denote the maximum number of pairwise disjoint balls of radius at least r, that may intersect a given ball of radius r in IR d . It can be verified that 2 ≤ c d ≤ 3 d for all d. Clearly, we have |B (b)| ≤ c d for any ball b.

Approximate range counting for balls
Data-structure 3.1. For a given set of balls B = {b 1 , . . . , b n } in IR d , we build the following datastructure, that is useful in performing several of the tasks at hand. (A) Store balls in a (compressed) quadtree. For i = 1, 2, . . . , n, let G i = G ≈ (b i ), and let G = n i=1 G i denote the union of these cells. Let T be a compressed quadtree decomposition of [0, 1] d , such that all the cells of G are cells of T . We preprocess T to answer point location queries. This takes O(n log n) time [Har11]. (B) Compute list of "large" balls intersecting each cell. For each node u of T , there is a list of balls registered with it. Formally, register a ball b i with all the cells of G i . Clearly, each ball is registered with O(1) cells, and it is easy to see that each cell has O(1) balls registered with it, since the balls are disjoint.
Next, for a cell in T we compute a list storing B ( ), and these balls are associated with this cell. These lists are computed in a top-down manner. To this end, propagate from a node u its list B ( ) (which we assume is already computed) down to its children. For a node receiving such a list, it scans it, and keep only the balls that intersect its cell (adding to this list the balls already registered with this cell). For a node ν ∈ T , let B ν be this list. (C) Build compressed quadtree on centers of balls. Let C be the set of centers of the balls of B.
Build, in O(n log n) time, a compressed quadtree T C storing C. (D) ANN for centers of balls. Build a data structure D, for answering 2-approximate k-nearest neighbor distances on C, the set of centers of the balls, see [HK12], where k and ε are provided with the query. The data structure D, returns a point c ∈ C such that, d k (q, C) ≤ d(q, c) ≤ 2d k (q, C). (E) Answering approximate range searching for the centers of balls.
Given a query ball b q = b(q, x) and a parameter δ > 0, one can, using T C , report (approximately), in O(log n + 1/δ d ) time, the points in b q ∩ C. Specifically, the query process computes O(1/δ d ) sets of points, such that their union X, has the property that b q ∩C ⊆ X ⊆ (1 + δ)b q ∩C, where (1 + δ)b q is the scaling of b q by a factor of 1 + δ around its center. Indeed, compute the set G ≈ (b q ), and then using cell queries in T C compute the corresponding cells (this takes O(log n) time). Now, descend to the relevant level of the quadtree to all the cells of the right size, that intersect b q . Clearly, the union of points stored in their subtrees are the desired set. This takes overall O(log n + 1/δ d ) time.
A similar data-structure for approximate range searching is provided by Arya and Mount [AM00], and our description above is provided for the sake of completeness. Overall, it takes O(n log n) time to build this data-structure.

Approximate range counting among balls
We need the ability to answer approximate range counting queries on a set of balls. Specifically, given a set of balls B, and a query ball b, the target is to compute the size of the set b∩B To make this query computationally fast, we allow an approximation. More precisely, for a ball b a set b is a (1 The purpose here, given a query ball b, is to compute the size of the set b ∩ B for some (1 + δ)-ball b of b. It turns out that this is challenging, as the query ball can be approximated, but the balls in B remain the same. This is to prevent the undesired situation where a "giant" ball is a valid answer for any ANN query.

Some useful tools
Lemma 3.2. Given a compressed quadtree T of size n, a convex set X, and a parameter δ > 0, one can compute the set of nodes in T , that realizes G ≈ (X, δ) (see Defnition 2.5), in O log n + 1/δ d time.
Specifically, this output a set X N of nodes, of size O 1/δ d , such that their cells intersect G ≈ (X, δ), and their parents cell diameter is larger than δdiam(X). Note that the cells in X N might be significantly larger if they are leafs of T .
Proof : Let G ≈ = G ≈ (X, 1) be the grid approximation to X. Using cell queries on the compressed quadtree, one can compute the cells of T that corresponds to these canonical cells. Specifically, for each cube ∈ G ≈ (X), the query either returns a node for which this is its cell, or it returns a compressed edge of the quadtree; that is, two cells (one is a parent of the other), such that is contained in of them and contains the other. Such a cell query takes O(log n) time [Har11]. This returns O(1) nodes in T such that their cells cover G ≈ (X). Now, traverse down the compressed quadtree starting from these nodes and collect all the nodes of the quadtree that are relevant. Clearly, one has to go at most O(log 1/δ) levels down the quadtree to get these nodes, and this takes O(1/δ d ) time overall. Lemma 3.3. Let X be any convex set in IR d , and let δ > 0 be a parameter. Using DS 3.1 , one can compute, in O log n + 1/δ d time, all the balls of B that intersects X, with diameter ≥ δdiam(X).
Proof : We compute the cells of the quadtree realizing G ≈ (X, δ) using Lemma 3.2. Now, from each such cell (and its parent), we extract the list of large balls intersecting it (there are O(1/δ d ) such nodes, and the size of each such list is O(1)). Next we check for each such ball if it intersects X and if its diameter is at least δdiam(X). We return the list of all such balls.

Answering a query
Given a query ball b q = b(q, x), and an approximation parameter δ > 0, our purpose is to compute a number N, The query algorithm works as follows: (A) Using Lemma 3.3, compute a set X of all the balls that intersect b q and are of radius ≥ δx/4.
Let N ′ be the total number of points in C stored in these nodes. (C) The quantity N ′ + |X| is almost the desired quantity, except that we might be counting some of the balls of X twice. To this end, let N ′′ be the number of balls in X with centers in Correctness. We only sketch the proof, as the proof is straightforward. Indeed, the union of the cells of G ≈ (b q (1 + δ/4), δ/4) contains b(q, x(1 + δ/4)) and is contained in b(q, (1 + δ)x). All the balls with radius smaller than δx/4 and intersecting b(q, x) have their centers in cells of G ≈ (b q (1 + δ/4), δ/4), and their number is computed correctly. Similarly, the "large" balls are computed correctly. The last stage ensures we do not over-count by 1 each large ball that also has its center in G ≈ (b q (1 + δ/4), δ/4). It is also easy to check that |B(b(q, x))| ≤ N ≤ |B(b(q, x(1 + δ)))|, and the same result can be used at x/(1 + δ).

Query running time. Computing all the cells of
Computing the "large" balls takes O log n + 1/δ d time. Checking for each large ball if it is already counted by the "small" balls takes O(1/δ d ) by using a grid.
Result. The above implies the following.
Lemma 3.4 (rangeCount). Given a set B of n balls in IR d , it can be preprocessed, in O(n log n) time, into a data structure of size O(n), such that given a query ball b(q, x) and approximation parameter δ > 0, the query algorithm rangeCount(q, x, δ) returns, in O(log n + 1/δ d ) time, a number N satisfying the following: Proof : Consider the k nearest neighbors of q from B. Any such ball that has its center outside b(q, 2r), has radius at least r, since it intersects b = b(q, r). Since the number of balls that are of radius at least r and intersecting b is bounded by c d , there must be at least max(0, k − c d ) balls among the k nearest neighbors, each having radius less than r. Now, b(q, 2r) will contain the centers of all such balls.
Idea. The basic observation is that we only need a rough approximation to the right radius, as using approximate range counting (i.e., Lemma 3.4), one can get improve the approximation. Let x i denote the distance of q to the ith closest center in C.
There are several possibilities: ball to q, for j = 1, . . . , k. It must be that b i , . . . , b k are much larger than b(q, d k ). But then, the balls b i , . . . , b k must intersect b(q, x i /2), and their radius is at least x i /2. We can easily compute these big balls using DS 3.1 (B), and the number of centers of the small balls close to query, and then compute d k exactly.
Preprocessing. We build DS 3.1 in O(n log n) time.
Answering a query. We are given a query point q ∈ IR d and a number k.
Using DS 3.1 , compute a 2-approximation for the smallest ball containing k − i centers of B, for i = 0, . . . , γ, where γ = min(k, c d ), and let r k−i be this radius. That is, for i = 0, . . . , γ, we have The algorithm is executed in the following steps. ( Otherwise, compute all the balls of B that are of radius at least r k−α /4 and intersect the ball Return 2ζ for the minimum such number ζ such that #(ζ) ≥ k. Proof : The data-structure and query algorithm are described above. We next prove correctness. To prove that (A) returns the correct answer observe that under the given assumptions, where the second inequality follows from Corollary 4.2, and the third inequality follows as N( For (B) observe that we have that N(r k−γ /4) ≤ #(r k−γ /4) < k and as such we have r k−γ /4 < d k (q, B). But by assumption, #(r k−γ ) ≥ k and so N( For (C), first observe that α < γ as the algorithm did not return in (A). Since α is the maximum d k (q, B). Also, d k (q, B) ≤ r k−α /4, as the algorithm did not return in (B). Now the ball b(q, r k−α−1 ) contains at least k − α − 1 centers from C, but it does not contain k − α centers. Indeed, otherwise we would have d k−α (q, C) ≤ r k−α−1 and so r k−α ≤ 2d k−α (q, C) ≤ 2r k−α−1 , but on the other hand r k−α−1 < d k (q, B) ≤ r k−α /4, which would be a contradiction. Similarly, there is no center of any ball whose distance from q is in the range (r k−α−1 , r k−α /2) otherwise we would have that d k−α (q, C) < r k−α /2 and this would mean that r k−α ≤ 2d k−α (q, C) < r k−α , a contradiction. Now, the center of the kth closest ball is clearly more than r k−α−1 away from q. As such its distance from q is at least r k−α /2. Since d k (q, B) ≤ r k−α /4 it follows that the kth closest ball intersects b(q, r k−α /4) and moreover, its radius is at least r k−α /4. Since we compute all such balls in (C), we do encounter the kth closest ball. It is easy to see that in this case we return a number ζ satisfying, ζ/2 ≤ d k (q, B) ≤ 2ζ.
As for the running time, notice that we need to use the algorithm of Lemma 3.4 O(1) times, each iteration taking time O(log n). After this we need another O(log n) time for the invocation of the algorithm in Lemma 3.3. As such, the total query time is O(log n). 1.1. Refining the approximation of d k (q, B) Lemma 4.4. Given a set B of n balls in IR d , it can be preprocessed, in O(n log n) time, into a data structure of size O(n). Given a query point q, numbers k, x, and an approximation parameter ε > 0, such

4.
Proof : We are going to use the same data-structure as Lemma 3.4, for the query ball b q = b(q, 4x(1+ε)). We compute all large balls of B that intersect b q . Here a large ball is a ball of radius > xε, and a ball of radius at most xε is considered to be a small ball. Consider the O(1/ε d ) grid cells of G ≈ (b q , ε/16).
In O(1/ε d ) time we can record the number of centers of large balls inside any such cell. Clearly, any small ball that intersects b(q, 4x) has its center in some cell of G ≈ (b q , ε/16). We use the quadtree T C to find out exactly the number of centers, N , of small balls in each cell of G ≈ (b q , ε/16), by finding the total number of centers using T C , and decreasing this by the count of centers of large balls in that cell. This can be done in time O(log n + 1/ε d ). We pick an arbitrary point in , and assign it weight N , and treat it as representing all the small balls in this grid cell -clearly, this introduces an error of size ≤ εx in the distance of such a ball from q, and as such we can ignore it in our argument. In the end of this snapping process, we have O(1/ε d ) weighted points, and O(1/ε d ) large balls. We know the distance of the query point from each one of these points/balls. This results in O(1/ε d ) weighted distances, and we want the smallest ℓ, such that the total weight of the distances ≤ ℓ is at least k. This can be done by weighted median selection in linear time in the number of distances, which is O(1/ε d ). Once we get the required point we can output any ball b corresponding to the point. Clearly, b satisfies the required conditions.

The result
Theorem 4.5. Given a set of n disjoint balls B in IR d , one can preprocess them in time O(n log n) into a data structure of size O(n), such that given a query point q ∈ IR d , a number k with 1 ≤ k ≤ n and ε > 0,

Quorum clustering
We are given a set B of n disjoint balls in IR d , and we describe how to compute quorum clustering for them quickly.
Let ξ be some constant. Let Next, we remove any k − c d balls that are contained in Λ i from R i to get the set R i+1 . We repeat this process till all balls are extracted. Notice that at each step i we only require that the Λ i intersects k balls of B (and not R i ), but that it must contain k − c d balls from R i . Also, the last quorum ball may contain fewer balls. The balls Λ 1 , . . . , Λ m , are the resulting ξ-approximate quorum clustering.

Computing an approximate quorum clustering
Definition 5.1. For a set P of n points in IR d , and an integer ℓ, with 1 ≤ ℓ ≤ n, let r opt (P, ℓ) denote the radius of the smallest ball which contains at least ℓ points from P, i.e., r opt (P, ℓ) = min q∈IR d d ℓ (q, P).
Similarly, for a set R of n balls in IR d , and an integer ℓ, with 1 ≤ ℓ ≤ n, let R opt (R, ℓ) denote the radius of the smallest ball which completely contains at least ℓ balls from R.
The algorithm. The algorithm to construct an approximate quorum clustering is as follows. We use the algorithm of Lemma 5.  centers c 1 , . . . , c ℓ , from C. Now consider any index i with 1 ≤ i ≤ ℓ, and consider any j = i, which exists as ℓ ≥ 2 by assumption. Since b(c, r) contains both c i and c j , 2r ≥ c i − c j by the triangle inequality. On the other hand, as the balls b i and b j are disjoint we have that c i − c j ≥ r i + r j ≥ r i . It follows that r i ≤ 2r for all 1 ≤ i ≤ ℓ. As such the ball b(c, 3r) must contain the entire ball b i , and thus it contains all the ℓ balls b 1 , . . . , b ℓ , corresponding to the centers. B = {b 1 = b(c 1 , r 1 ), . . . , b n = b(c n , r n )} be a set of n disjoint balls in IR d . Let C = {c 1 , . . . , c n } be the corresponding set of centers, and let ℓ be an integer with 2 ≤ ℓ ≤ n. Then, r opt (C, ℓ) ≤ R opt (B, ℓ) ≤ 3r opt (C, ℓ).

Lemma 5.4. Let
Proof : The first inequality follows since the ball realizing the optimal covering of ℓ balls, clearly contains their centers as well, and therefore ℓ points from C. To see the second inequality, consider the ball b = b(c, r) realizing r opt (C, ℓ), and use Lemma 5.3 on it. This implies R opt (B, ℓ) ≤ 3r opt (C, ℓ).  1 = 1, . . . , m, is the set of centers assigned to the balls b i . That is C 1 , . . . , C m form a disjoint decomposition of C, each of size k − c d (except for the last set, which might be smaller -a technicality that we ignore for the sake of simplicity of exposition).
For i = 1, . . . , m, let B i denote the set of balls corresponding to the centers in C i . Now while constructing the approximate quorum clusters we are going to assign the set of balls B π(i) for i = 1, . . . , m, to Λ i . Now, fix a i with 1 ≤ i ≤ m − 1. The balls of i j=1 B π(j) have been used up. Consider an optimal ball, i.e., a ball b = b(c, r) that contains completely k − c d balls among m j=i+1 B π(j) and intersects k balls from B, and is the smallest such possible. Fix some k − c d balls from m j=i+1 B π(j) that this optimal ball contains. Consider the sets of centers C ′ of these balls. The quorum clusters o π(j) for j = i + 1, . . . , m, contain all these centers, by construction. Out of these indices, i.e., out of the indices {π(i + 1), . . . , π(m)}, suppose p is the minimum index such that o p contains one of these centers. When o p was constructed, i.e., at the pth iteration of the algorithm of Lemma 5.2, all the centers from C ′ were available. Now since the optimal ball b = b(c, r) contains k − c d available centers too, it follows that ψ p ≤ 2r since Lemma 5.2 guarantees this. Since k − c d ≥ 2, by Lemma 5.3, b(u p , 3ψ p ) contains the balls of B p . Moreover, by the Lipschitz property, see Observation 2.2, it follows that d k (u p , B) ≤ d k (c, B) + u p − c ≤ r + (r + ψ p ) ≤ 4r, where the second last inequality follows as the balls b = b(c, r) and the ball o p = b(u p , ψ p ) intersect. Therefore, for the index p we have that, d k (u p , B) ≤ 2γ p ≤ 3d k (u p , B) ≤ 12r, and also that 3ψ p ≤ 6r. As such ζ p = max(2γ p , 3ψ p ) ≤ 12r. The index π(i + 1) minimizes this quantity among the indices {π(i + 1), . . . , π(m)} (as we took the sorted order), as such it follows that ζ i+1 ≤ 12r. Proof : The correctness was proved in Lemma 5.5. To see the time bound is also easy as the computation time is dominated by the time in Lemma 5.2, which is O(n log n).

Construction of the sublinear data-structure for (k, ε)-ANN
Here we show how to compute an approximate Voronoi diagram for approximating the kth-nearest ball, that takes O(n/k) space. We assume k > 2c d without loss of generality, and we let m = ⌈n/(k − c d )⌉ = O(n/k). Here k and ε are prespecified in advance.

Preliminaries
The following notation was introduced in [HK12]. A ball b of radius r in IR d , centered at a point c, can be interpreted as a point in IR d+1 , denoted by b ′ = (c, r). For a regular point p ∈ IR d , its corresponding image under this transformation is the mapped point p ′ = (p, 0) ∈ IR d+1 , i.e., we view it as a ball of radius 0 and use the mapping defined on balls. Given point u = (u 1 , . . . , u d ) ∈ IR d we will denote its Euclidean norm by u . We will consider a point u = (u 1 , u 2 , . . . , u d+1 ) ∈ IR d+1 to be in the product metric of IR d × IR and endowed with the product metric norm It can be verified that the above defines a norm, and for any u ∈ IR d+1 we have u ≤ u ⊕ ≤ √ 2 u .

Construction
The input is a set B of n disjoint balls in IR d , and parameters k and ε.
The construction of the data-structure is similar to the construction of the kth-nearest neighbor datastructure from the authors' paper [HK12]. We compute, using Lemma 5.6, a ξ-approximate quorum clustering of The algorithm then continues as follows: (A) Compute an exponential grid around each quorum cluster. Specifically, let be the set of grid cells covering the quorum clusters and their immediate environ, where ζ 1 is a sufficiently large constant (say, ζ 1 = 256ξ). (B) Intuitively, I takes care of the region of space immediately next to a quorum cluster 2 . For the other regions of space, we can apply a construction of an approximate Voronoi diagram for the centers of the clusters (the details are somewhat more involved). To this end, lift the quorum clusters into points in IR d+1 , as follows Note that all points in Σ ′ belong to U ′ = [0, 1] d+1 by Assumption 2.3. Now build a (1 + ε/8)-AVD for Σ ′ using the algorithm of Arya and Malamatos [AM02], for distances specified by the · ⊕ norm. The AVD construction provides a list of canonical cubes covering [0, 1] d+1 such that in the smallest cube containing the query point, the associated point of Σ ′ , is a (1 + ε/8)-ANN to the query point. (Note, that these cubes are not necessarily disjoint. In particular, the smallest cube containing the query point q is the one that determines the assigned approximate nearest neighbor to q.) Clip this collection of cubes to the hyperplane x d+1 = 0 (i.e., throw away cubes that do not have a face on this hyperplane). For a cube in this collection, denote by nn ′ ( ), the point of Σ ′ assigned to it. Let S be this resulting set of canonical d-dimensional cubes. (C) Let W be the space decomposition resulting from overlaying the two collection of cubes, i.e. I and S. Formally, we compute a compressed quadtree T that has all the canonical cubes of I and S as nodes, and W is the resulting decomposition of space into cells. One can overlay two compressed quadtrees representing the two sets in linear time [BHTT10,Har11]. Here, a cell associated with a leaf is a canonical cube, and a cell associated with a compressed node is the set difference of two canonical cubes. Each node in this compressed quadtree contains two pointers -to the smallest cube of I, and to the smallest cube of S, that contains it. This information can be computed by doing a BFS on the tree. For each cell ∈ W we store the following.
(I) An arbitrary representative point rep ∈ . (II) The point nn ′ ( ) ∈ Σ ′ that is associated with the smallest cell of S that contains this cell. We also store an arbitrary ball, b( ) ∈ B, that is one of the balls completely inside the cluster specified by nn ′ ( ) -we assume we stored such a ball inside each quorum cluster, when it was computed. ( rep , B), and a ball nn k ( rep ) ∈ B that realizes this distance. In order to compute β k ( rep ) and nn k ( rep ) use the data-structure of Section 4, see Theorem 4.5.

Answering a query
Given a query point q, compute the leaf cell (equivalently the smallest cell) in W that contains q by performing a point-location query in the compressed quadtree T . Let be this cell. Let, If diam( ) ≤ (ε/8)λ * we return nn k ( rep ) as the approximate kth-nearest neighbor, else we return b( ).
Proof : This follows by the Lipschitz property, see Observation 2.2.
Proof : For the point rep , by Observation 2.2, we have that Therefore, the ball nn k ( rep ) satisfies As such we have that Similarly, using the Lipschitz property, we can argue that, d(q, nn k ( rep )) ≥ (1 − ε)d k (q, B), and therefore we have, (1 − ε)d k (q, B) ≤ d(q, nn k ( rep )) ≤ (1 + ε)d k (q, B), and the required guarantees are satisfied. Lemma 6.3. For any point q ∈ IR d there is a quorum ball 3ξd k (q, B), and (C) q − w i ≤ 4ξd k (q, B).
Proof : By assumption, k > 2c d , and so by Lemma 4.1 among the k nearest neighbor of q, there are k − c d balls of radius at most d k (q, B). Let B ′ denote the set of these balls. Among the indices 1, . . . , m, let i be the minimum index such that one of these k − c d balls is completely covered by the quorum cluster Λ i = b(w i , x i ). Since b(q, d k (q, B)) intersects the ball while Λ i completely contains it, clearly Λ i intersects b (q, d k (q, B)). Now consider the time Λ i was constructed, i.e, the ith iteration of the quorum clustering algorithm. At this time, by assumption, all of B ′ was available, i.e., none of its balls were assigned to earlier quorum clusters. The ball b(q, 3d k (q, B)) contains k − c d unused balls and touches k balls from B, as such the smallest such ball had radius at most 3d k (q, B). By the guarantee on quorum clustering, x i ≤ 3ξd k (q, B). As for the last part, as the balls b(q, d k (q, B)) and Λ i = b(w i , x i ) intersect and x i ≤ 3ξd k (q, B), we have by the triangle inequality that q − w i ≤ (1 + 3ξ)d k (q, B) ≤ 4ξd k (q, B), as ξ ≥ 1.
Definition 6.4. For a given query point, any quorum cluster that satisfies the conditions of Lemma 6.3 is defined to be an anchor cluster. By Lemma 6.3 an anchor cluster always exists.
Lemma 6.5. Suppose that among the quorum cluster balls Λ 1 , . . . , Λ m , there is some ball ) then the output of the algorithm is correct.
Thus, by construction, the expanded environ of the quorum cluster b(w i , x i ) contains the query point, see Eq. (6.1) p12 . Let j be the smallest non-negative integer such that 2 j x i ≥ d(q, w i ). We have that, 2 j x i ≤ max(x i , 2d(q, w i )). As such, if is the smallest cell in W containing the query point q, then by Eq. (6.1) p12 , and if ζ 1 ≥ 256ξ. Now, by Lemma 6.1 we have that λ * ≥ d k (q, B), so diam( ) ≤ (ε/8)λ * . Therefore, the algorithm returns nn k ( rep ) as the (1 ± ε)-approximate kth-nearest neighbor, but then by Lemma 6.2 it is a correct answer.

The result
Theorem 6.7. Given a set B of n disjoint balls in IR d , a number k, with 1 ≤ k ≤ n, and ε ∈ (0, 1), one can preprocess B, in O n log n + n k C ε log n + n k C ′ ε time, where C ε = O ε −d log ε −1 and C ′ ε = O ε −2d log ε −1 . The space used by the data-structure is O(C ε n/k). Given a query point q, this datastructure outputs a ball b ∈ B in O log n kε time, such that (1 − ε)d k (q, B) ≤ d(q, b) ≤ (1 + ε)d k (q, B).
Proof : If k ≤ 2c d then Theorem 4.5 provides the desired result. For k > 2c d , the correctness was proved in Lemma 6.6. We only need to bound the construction time and space as well as the query time. Computing the quorum clustering takes time O(n log n) by Lemma 5.6. Observe that |I| = O n kε d log 1 ε . From the construction of Arya and Malamatos [AM02], we have |S| = O n kε d log 1 ε (note, that since we clip the construction to a hyperplane, we get 1/ε d in the bound and not 1/ε d+1 ). A careful implementation of this stage takes time O n log n + |W| log n + 1 Overlaying the two compressed quadtrees representing them takes linear time in their size, that is O(|I| + |S|).
The most expensive step is to perform the (1 ±ε/4)-approximate kth-nearest neighbor query for each cell in the resulting decomposition of W, see Eq. (6.2) p13 (i.e., computing β k ( rep ) for each cell ∈ W). Using the data-structure of Section 4 (see Theorem 4.5) each query takes O log n + 1/ε d time.
O n log n + |W| log n + 1 ε d = O n log n + n kε d log 1 ε log n + n kε 2d log 1 ε time, and this bounds the overall construction time.
The query algorithm is a point location query followed by an O(1) time computation and takes time O log n kε . Note, that the space decomposition generated by Theorem 6.7 can be interpreted as a space decomposition of complexity O(C ε n/k), where every cell has two input balls associated with it, which are the candidates to be the desired (k, ε)-ANN. That is, Theorem 6.7 compute a (k.ε)-AVD of the input balls.

Conclusions
In this paper, we presented a generalization of the usual (1 ± ε)-approximate kth-nearest neighbor problem in IR d , where the input are balls of arbitrary radius, while the query is a point. We first presented a data structure that takes O(n) space, and the query time is O(log n + ε −d ). Here, both k and ε could be supplied at query time. Next we presented an (k, ε)-AVD taking O(n/k) space. Thus showing, surprisingly, that the problem can be solved in sublinear space if k is sufficiently large.