Abstract 1 Introduction 2 Warm Up: Predecessor Search 3 Reductions from Data Structures to DDS Algorithms 4 New DDS Algorithms for Concrete Problems 5 DDS Using Structures with 𝝎(𝒏𝐥𝐨𝐠𝒏) Construction Time References Appendix A Proof of Theorem 3

Maximizing the Optimality Streak of Deferred Data Structuring (a.k.a. Database Cracking)

Yufei Tao ORCID The Chinese University of Hong Kong, China
Abstract

This paper studies how to minimize the total cost of answering r queries over n elements in an online manner (i.e., the next query is given only after the previous query’s result is ready) when the value rn is unknown in advance. Traditional indexing, which first builds a complete index on the n elements before answering queries, may be unsuitable because the index’s construction time – usually Ω(nlogn) – can become the performance bottleneck. In contrast, for many problems, a lower bound of Ω(nlog(1+r)) holds on the total cost of r queries for every r[1,n]. Matching this lower bound is a primary objective of deferred data structuring (DDS), also known as database cracking in the system community. For a wide class of problems, we present generic reductions to convert traditional indexes into DDS algorithms that match the lower bound for a long range of r. For a decomposable problem, if a data structure can be built in O(nlogn) time and has Q(n) query search time, our reduction yields an algorithm that runs in O(nlog(1+r)) time for all rnlognQ(n), where the upper bound nlognQ(n) is asymptotically the best possible under mild constraints. In particular, if Q(n)=O(logn), then the O(nlog(1+r))-time guarantee extends to all rn, with which we optimally settle a large variety of DDS problems. Our results can be generalized to a class of “spectrum indexable problems”, which subsumes the class of decomposable problems.

Keywords and phrases:
Deferred Data Structuring, Database Cracking, Data Structures
Copyright and License:
[Uncaptioned image] © Yufei Tao; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Data structures and algorithms for data management
Funding:
This work was partially supported by GRF projects 4203421 and 14222822 from HKRGC.
Editors:
Sudeepa Roy and Ahmet Kara

1 Introduction

Traditional indexing first creates an index on the whole dataset before starting to answer queries. For example, after building a binary search tree (BST) on an (unsorted) set S of n real values in O(nlogn) time, we can use the tree to answer each predecessor query111Given a search value q, a predecessor query returns the largest element in S that does not exceed q. in O(logn) time. This paradigm, however, falls short when the dataset S will only be searched a small number of times. In the extreme, if only one query needs to be answered, the best approach is to scan S in full, which requires only O(n) time. More generally, if we are to answer r queries in an online manner – that is, the next query is given after the result of the previous query has been output – it is possible to pay a total cost of O(nlog(1+r)) [19]. When rn (e.g., r=polylogn or 2logn), the cost O(nlog(1+r)) breaks the barrier of Ω(nlogn), suggesting that sorting is unnecessary.

Situations like the above occur in many big-data applications where large volumes of data are collected but only sparingly queried. The phenomenon has motivated a line of research under the name database cracking in the system community; see [12, 13, 16, 17, 18, 23, 29, 31, 32] and the references therein. In these environments, the problem size n is huge such that even sorting is considered expensive and should be avoided as much as possible. Furthermore, it is impossible to predict how many queries – namely, the integer r mentioned earlier – will need to be supported. Instead of constructing a complete index right away, database cracking carries out only a calculated portion of the construction during each query. If r eventually reaches a certain threshold, the index will be created in full, but it is more likely that r will stop at some value significantly lower than that threshold. The challenge of database cracking is to ensure “smoothness”: as r grows, the total cost of all the r queries so far ought to increase as slowly as possible.

In the theory field, the data-structure problems at the core of database cracking had already been studied in 1988 by Karp, Motwani, and Raghavan [19] under the name deferred data structuring (DDS). For a selection of problems, they explained how to answer r[1,n] queries on n elements with a total cost of O(nlog(1+r)), without knowing r in advance. For every r[1,n], they proved a matching lower bound of Ω(nlog(1+r)) on those problems. Through reductions, the same lower bound has been shown to hold on many other DDS problems.

This work explores the following topic: how to design a generic reduction that, given a (conventional) data structure, can turn it automatically into an efficient DDS algorithm? Such reductions may significantly simplify the design of DDS algorithms and shed new light on the intricate connections between traditional indexing and DDS.

Math Conventions.

We use + to denote the set of positive integers. For any integer x1, let [x] represent the set {1,2,,x}. Every logarithm has base 2 by default. Define Log(x)=log(1+x) for any x1.

1.1 Problem Definitions

This subsection will formalize the problems to be investigated. Let S, called the dataset, be a set of n elements drawn from a domain 𝔻. Let be a (possibly infinite) set where the elements are called predicates. Given a predicate q, a query issued on S returns an answer, represented as Ansq(S). We consider that, for any q, the answer Ansq(S) can be represented using O(1) words and can be computed in O(n) time (which essentially means that the query can be answered using a brute-force algorithm such as exhaustive scan).

Deferred Data Structuring (DDS).

We now formalize the DDS problem. Initially, an algorithm 𝒜 is provided with the dataset S in an array, where the elements are arbitrarily permuted. An adversary first chooses a predicate q1 for the first query, and solicits the answer Ansq1(S) from 𝒜. Iteratively, for each i2, after the answer of the (i1)-th query has been obtained, the adversary either decides to terminate the whole process, or chooses the predicate qi for the next query and solicits Ansqi(S) from 𝒜. The adversary is permitted to observe the execution of 𝒜 and, thus, capable of selecting a “bad” qi for 𝒜.

The algorithm 𝒜 is said to guarantee running time Time(n,r) for t queries if, for every rt, the first r queries are processed with a total cost at most Time(n,r). We will concentrate on tn because this is the scenario important for database cracking.

Streak of Optimality.

While the above setup is “standard” for DDS, as far as database cracking is concerned, it makes sense to carry out investigation from another fresh perspective. As database cracking is useful mainly in the scenario where the dataset receives relatively few queries, it is imperative to design DDS algorithms that are particularly efficient when the number of queries is small.

To formalize the notion of “particularly efficient”, we leverage the fact that, for many DDS problems (including the ones considered in this work), there exist hardness barriers dictating Time(n,r)=Ω(nLogr) for every r[1,n] under a relevant computation model. Motivated by this, we say that an algorithm 𝒜 guarantees a log(r)-streak of LogStreak(n) if its running time satisfies Time(n,r)=O(nLogr) for all rLogStreak(n).

The worst log(r)-streak guarantee is LogStreak(n)=O(1) – this is trivial because any query can be answered in O(n) time. Ideally, we would like to have LogStreak(n)=n, but this is not always possible as argued later in the paper. For practical use, however, it would suffice for an algorithm to ensure LogStreak(n)=Ω(nϵ) for some constant ϵ>0 because Ω(nϵ) queries are perhaps already too many for database cracking to be appealing.

Classes of Problems.

The above definition framework can be specialized into various problem instances that differ in the data domain 𝔻 or the predicate domain . Next, we introduce two problem classes relevant to our discussion:

  • A problem instance is decomposable if, for any disjoint sets S1,S2𝔻 and any predicate q, it is possible to derive Ansq(S1S2) from Ansq(S1) and Ansq(S2) in constant time.

  • We define a problem instance to be (B(n),Q(n)) spectrum indexable if the following property holds on every dataset S𝔻: for every integer s[|S|], it is possible to construct a data structure on S in O(|S|B(s)) time that can answer any query in O(|S|sQ(s)) time. The term “spectrum indexable” is chosen to reflect the ability to build a “good” index structure – as far as functions B(n) and Q(n) are concerned – for the whole spectrum of the parameter s.

Two observations are important about the above definitions:

  • (B(n),Q(n)) spectrum indexability implies that we can build a data structure on any dataset S𝔻 in O(|S|B(|S|)) time to answer a query in O(Q(|S|)) time (for this purpose, simply set s=|S|).

  • Consider any decomposable problem instance with the following property: for any dataset S𝔻, we can build a data structure 𝒯 in O(|S|B(|S|)) time to answer a query in O(Q(|S|)) time. Then, the problem instance must be (B(n),Q(n)) spectrum indexable. To see why, given an integer s[|S|], divide S arbitrarily into m=|S|/s disjoint subsets S1,S2,,Sm where all subsets have size s except Sm. For each i[m], create a structure 𝒯(Si) in O(|Si|B(s)) time; the total time to create all the m structures is O(msB(s))=O(|S|B(s)). To answer a query q, simply search every 𝒯(Si) to obtain Ansq(Si) in O(Q(s)) time and then combine Ansq(S1),Ansq(S2),,Ansq(Sm) into Ansq(S) using O(m) time. The total query time is therefore O(mQ(s)).

1.2 Related Work

Motwani and Raghavan introduced the concept of DDS in a conference paper [24], which together with Karp they extended into a journal article [19]. The article [19] presented algorithms for the following DDS problems:

  • Predecessor search, where S consists of n real values, and each query is given an arbitrary value q and returns the predecessor of q in S.

  • Halfplane containment, where S consists of n halfplanes in 2, and each query is given an arbitrary point q2 and returns whether q is covered by all the halfplanes in S.

  • Convex hull containment, where S consists of n points in 2, and each query is given an arbitrary point q2 and returns whether q is covered by the convex hull of S.

  • 2D linear programming, where S consists of n halfplanes in 2, and each query is given a 2D vector 𝒖 and returns the point p in the intersection of all the n halfplanes that maximizes the dot product 𝒖𝒑.

  • Orthogonal range counting, where S consists of n points in d with the dimensionality d being a fixed constant, and a query is a given an arbitrary d-rectangle q – namely, an axis-parallel box the form [x1,y1]×[x2,y2]××[xd,yd] – and returns the number of points of S that are covered by q.

For the first four problems, Karp, Motwani, and Raghavan presented algorithms achieving Time(n,r)=O(nLogr) for all rn. For orthogonal range counting, they presented two algorithms, the first of which ensures Time(n,r)=O(nLogdr) for all rn, whereas the other one ensures Time(n,r)=O(nlogn+nLogd1r) for all rn. For all these problems, they proved that, under the comparison model and/or the algebraic model, the running time of any algorithm must satisfy Time(n,r)=Ω(nLogr) for every r[1,n].

Aggarwal and Raghavan [1] presented a DDS algorithm with Time(n,r)=O(nLogr) for all rn for nearest neighbor search, where S consists of n points in 2, and a query is given an arbitrary point q2 and returns the point in S closest to q. This running time is optimal under the algebraic model for all rn.

A “success story” can be told about DDS on range median. In the problem’s offline version, we are given a set S of n real values in an (unsorted) array A. For any 1xyn, let A[x:y] represent the set of elements {A[x],A[x+1],,A[y]}. In addition, we are given r integer pairs (x1,y1),(x2,y2),,(xr,yr) such that 1xiyin for each i[n]. The goal is to find the median of the set A[xi:yi] for all i[r]. In [15], Har-Peled and Muthukrishnan explained how to solve the problem in O(nLogr+rlognLogr) time and proved a lower bound of Ω(nLogr) under the comparison model for all rn. In [11] (see also [5]), Gfeller and Sanders considered the problem’s DDS version, where S is as defined earlier, and a query is given an arbitrary pair (x,y) with 1xyn and returns the median of A[x:y]. They designed an algorithm achieving Time(n,r)=O(nLogr) for all rn. It is easy to see that any DDS algorithm can be utilized to solve also the offline problem in Time(n,r) time. Hence, the algorithm of Gfeller and Sanders improves the offline solution of [15] and is optimal (for DDS) under the comparison model.

More remotely related to our work is [8], where Ching, Mehlhorn, and Smid considered a dynamic DDS problem where, besides queries, an algorithm is also required to support updates on the dataset S. Towards another direction, Barbay et al. [2] studied the DDS version of predecessor search but analyzed their algorithm using more fine-grained parameters called “gaps” (rather than using only n and r). This direction has also been extended to dynamic DDS; see the recent works [26, 27].

As mentioned, DDS has been extensively studied in the system community under the name database cracking. The focus there is to engineer efficient heuristics to accelerate query workloads encountered in various practical scenarios, rather than establishing strong theoretical guarantees. Interested readers may refer to the representative works of [12, 13, 16, 17, 18, 23, 31, 32] as entry points into the literature.

1.3 Our Results

Our main result is a generic reduction with the following guarantee:

Theorem 1.

Suppose that B(n) and Q(n) are non-decreasing functions such that B(n)=O(logn) and Q(n)=O(n1ϵ) where ϵ>0 is an arbitrarily small constant. Every (B(n),Q(n)) spectrum indexable problem admits a DDS algorithm with

LogStreak(n)=min{n,cnlognQ(n)} (1)

for an arbitrarily large constant c, namely, the algorithm achieves Time(n,r)=O(nLogr) for all rmin{n,cnlognQ(n)}.

Under mild constraints, the streak bound in (1) is the best possible, as we argue in Section 3. As an important special case, when Q(n)=O(logn), the DDS algorithm produced by our reduction achieves LogStreak(n)=n, namely, Time(n, r) = O(nLogr) for all rn. For many decomposable problems with high importance to database systems, data structures with O(nlogn) construction time and O(logn) search time are already known (e.g., predecessor search and nearest neighbor search). As such problems must be (Logn,Logn) spectrum indexable (as explained in Section 1.1), Theorem 1 immediately produces excellent solutions to the DDS versions of all those problems, which are usually optimally under the comparison/algebraic model. A selection of those problems will be given in Section 4.1.

Theorem 1 has a pleasant message for database cracking: even data structures with slow query time can be useful for cracking! For example, for orthogonal range counting (defined in Section 1.2) in 2D space, the kd-tree, which takes O(nlogn) time to build, answers a query in Q(n)=O(n) time. Theorem 1 shows that the structure can be utilized to answer any r=O(nlogn) queries in O(nLogr) time, which is probably more than enough for database cracking in reality. This nicely manifests our motivation (stated in Section 1.1) for studying “DDS algorithms particularly efficient for small r”.

Theorem 1 has an instructive implication to the design of DDS algorithms – we should study how spectrum indexable the underlying problem really is. Exploration in this direction can get fairly interesting, as we will demonstrate in Section 4.2 for halfplane containment, convex hull containment, range median, and 2D linear programming (see Section 1.2 for their definitions). We can prove that each of those problems is (Logn,Q(n)) spectrum indexable for an appropriately selected function Q(n), and then leverage Theorem 1 to obtain an algorithm solving it with Time(n,r)=O(nLogr) for all rn.

What happens to structures that take ω(nlogn) time to build, or equivalently, B(n)=ω(logn)? Section 5 will show that, under mild constraints, no generic reductions can use such a structure to obtain DDS algorithms with LogStreak(n)=ω(1). In other words, these algorithms can achieve Time(n,r)=O(nLogr) only in the trivial scenario where r=O(1). Nevertheless, if one accepts algorithms with guarantees of the form “Time(n,r)nLogO(1)r for all rn”, our reduction underlying Theorem 1 can be extended to produce such algorithms as long as B(n) and Q(n) are LogO(1)n, as will be discussed in Section 5.

2 Warm Up: Predecessor Search

Predecessor search is the problem that has received the most attention from the DDS and database-cracking communities. This section will review two approaches developed in [19] for achieving Time(n,r)=O(nLogr) on this problem. These approaches form the basis of nearly all the existing DDS algorithms.

Bottom-Up.

Recall that the dataset S consists of n real values. We assume, w.l.o.g., that n is a power of 2. At all times, the set S is arbitrarily partitioned into disjoint subsets – referred to as runs – each having the same size s=2i for some i0. Every run is sorted and stored in an array. Initially, s=1, i.e., a run contains an individual element of S. Over time, the run size s increases monotonically. Whenever s needs to go from 2i to 2j for some value j>i, an overhaul is carried out to build the new runs. As a run of size 2j can be obtained by merging 2ji runs of size 2i in O(2i(ji)) time, the overhaul can be completed in O(n(ji)) time. Therefore, if the current run size is s, the overall cost of producing all the runs in history is O(nLogs).

A (predecessor) query is answered simply by performing binary search on every run. To control the cost, however, the algorithm makes sure that the current size s is at least iLogi before processing the i-th (i1) query. If the condition is not met, an overhaul is invoked to increase s to the nearest power of 2 at least iLogi. After that, the query entails a cost of O(nsLogs)=O(niLogilog(iLogi))=O(n/i) time. If we add this up for all r queries, the sum becomes O(ni=1r1i)=O(nLogr). As the final run size s is O(rLogr), we can conclude that the algorithm processes r queries in O(nLogr) time. Note that this holds only if rLogrn (the maximize run size is n). However, when r has reached n/logn, the algorithm can afford to sort the entire S in O(nlogn)=O(nlogr) time and answer every subsequent query in O(logn)=O(logr) time. Thus, the algorithm achieves Time(n,r)=O(nLogr) for all rn.

Top-Down.

This approach mimics the following strategy for building a binary search tree (BST) 𝒯 on S: (i) find the median of S, and splits S into S1 and S2 at the median; (ii) store the median as the root’s key, and then build the root’s left (resp., right) subtree recursively on S1 (resp., S2). Rather than doing a full construction outright, the algorithm builds 𝒯 on an “as-needed” basis during query processing.

In the beginning, only the root exists and it is put in the unexpanded mode. In general, an unexpanded node u has no children yet, but is associated with the subset SuS of elements that ought to be stored in its subtree. A query is answered in the same manner as in a normal BST, by traversing a root-to-leaf path π of 𝒯. The main difference is that as the search comes to an unexpanded node u on π, the algorithm must expand it first. If |Su|2, expanding u means creating two child nodes for u, splitting Su at the median (the key of u), dividing Su at the median into two parts, and associating each part with a child. After that, u becomes expanded with its children put in the unexpanded mode. If, on the other hand, |Su|=1, expanding u simply means making u a leaf of 𝒯 and marking u in the expanded mode. In any case, the expansion takes O(|Su|) time (finding the median of a set takes linear time [4]).

After r queries, the BST 𝒯 is partially built because only the nodes on the r root-to-leaf paths traversed during query processing are expanded. The nodes at the first Logr levels222The level of a node is the number of edges on the path from the node to the root. can incur a total expansion cost of O(nLogr). For each root-to-leaf path π, the node u at level Logr has expansion cost O(n/r), which dominates the total expansion cost of the descendants of u on π. Therefore, other than the nodes at the first Logr levels, all the other nodes in 𝒯 have an expansion cost of rO(nr)=O(n) in total. The algorithm therefore achieves Time(n,r)=O(nLogr) for all rn.

3 Reductions from Data Structures to DDS Algorithms

Section 3.1 will extend the bottom-up approach (reviewed in the previous section) into a generic reduction, which can yield DDS algorithms with large log(r)-streak bounds, provided that a crucial requirement – linear mergeability (to be defined shortly) – is met. Although this reduction will be superseded by our final reduction (presented in Section 3.2) underneath Theorem 1, its discussion (i) deepens the reader’s understanding of the approach’s power and limitations, and (ii) elucidates the necessity for new ideas in the absence of linear mergeability. Finally, Section 3.3 will establish a hardness result to show that the streak bound in Theorem 1 can no longer be improved significantly.

3.1 The First Reduction: Generalizing the Bottom-Up Approach

This subsection will focus on decomposable problems. For any dataset S𝔻, we assume the existence of a data structure 𝒯(S) that can answer any query in O(Q(n)) time. Furthermore, the structure is linearly mergeable, namely, for any disjoint S1,S2𝔻, the structure on S1S2 can be constructed from 𝒯(S1) and 𝒯(S2) in O(|S1|+|S2|) time. Note that this implies 𝒯(S) can be built in O(nlogn) time.

Lemma 2.

For a decomposable problem on which there is a linear-mergeable structure with query time Q(n)=O(n1ϵ) where ϵ>0 is a constant, we can design a DDS algorithm to guarantee LogStreak(n)=min{n,cnlognQ(n)} for an arbitrarily large constant c.

Proof.

We assume, w.l.o.g., that n is a power of 2. As with the bottom-up approach, at any moment, we divide S arbitrarily into runs with the same size s=2i for some i0. For each run, build a structure on the elements therein. The initial run size s is 1. Every time s grows from 2i to 2j for some value j>i, an overhaul is performed to construct the runs of size 2j. By linear mergeability, we can build the structure of a size-2j run by merging the structures of 2ji runs of size 2i in O(2j(ji)) time. By an analysis similar to the one in Section 2, if the current run size is s, the overall cost of producing the structures for all the runs that ever appeared in history is O(nLogs).

A query with predicate q is answered by searching the structure of every run, and then combining the answers from all the runs into Ansq(S). The query cost is O(nsQ(s)). We require that, before answering the i-th (i1) query, the run size s must satisfy

Q(s)/s1/i. (2)

If this requirement is not met, we launch an overhaul to increase s to the least power of 2 fulfilling (2). This ensures that the i-th query is answered with a cost O(n/i). Hence, the total cost of processing r queries is O(nLogr).

What remains is to bound the cost of the overhauls. The final run size s is the least power of 2 satisfying

  • sn (the run size cannot exceed n) and

  • s/Q(s)r (because of (2)).

Rather than solving s precisely, we instead aim to find an upper bound for it. As Q(s)=O(s1ϵ), we know Q(s)αs1ϵ for some constant α>0. Hence, s/Q(s)sϵ/α. Let s^ be the least power of 2 satisfying

s^n and s^ϵ/αr.

Since the condition sϵ/αr implies s/Q(s)r, it holds that s^s.

The condition s^ϵ/αr can be rewritten as s^(αr)1/ϵ. Therefore, if (αr)1/ϵn/2, then s^ is simply the least power of 2 at least (αr)1/ϵ. This means s^=O(r1/ϵ) and thus O(Logs)=O(Logs^)=O(Logr), in which case all the overhauls require O(nLogs)=O(nLogr) time overall.

The above argument does not work if (αr)1/ϵ>n/2. However, in this case, r>nϵ/(2ϵα), that is, r is already a polynomial of n. This motivates the following brute-force strategy. When r reaches nϵ/(2ϵα) – a moment we call the snapping point – we simply create a structure on the whole S in O(nlogn)=O(nlogr) time, and use it to answer every subsequent query in O(Q(n)) time until r=min{n,cnlognQ(n)}. All the queries after the snapping point have a total cost of O(nlogn)=O(nlogr). We thus have obtained an algorithm that guarantees Time(n,r)=O(nLogr) for all rmin{n,cnlognQ(n)}.

The above reduction crucially relies on the fact that the structure 𝒯 is linearly mergeable. Otherwise, the overhaul for creating size-2i runs would become Ω(nlog(2i)), in which case all the overhauls would end up with a total cost of Ω(nLog2r). Next, we will present another reduction that can shave a Logr factor.

3.2 The Second Reduction: No Linear Mergeability

We will now drop the linear mergeability requirement in Section 3.1 and establish Theorem 1. Recall that the underlying problem is (B(n),Q(n)) spectrum indexable with B(n)=O(logn) and Q(n)=O(n1ϵ) for some constant ϵ>0. The goal is to design an algorithm with Time(n,r)=O(nLogr) for all rmin{n,cnlognQ(n)}, where c>0 can be any constant.

Assume, w.l.o.g., that n is a power of 2. Our algorithm executes in epochs. At the beginning of the i-th (i1) epoch, we set

s=22i.

For now, let us consider that s does not exceed n, a condition that will be guaranteed, as discussed later. The epoch starts by creating a structure 𝒯 – the structure promised by (B(n),Q(n)) spectrum indexability – on S in O(nB(s)) time. The structure allows us to answer any query in O(nsQ(s)) time. The i-th epoch finishes after s/Q(s) queries333In practice, our algorithm may be slightly improved by making this number sB(s)/Q(s), but this will not be necessary for the purpose of proving Theorem 1 are answered during the epoch. These queries demand a total cost of

sQ(s)O(nsQ(s)) = O(n).

It is clear from the above that the total computation time of the i-th epoch is O(nB(s))=O(n2i). As n2i doubles when i increases by 1, the overall cost of answering r queries is O(n2h), where h is the number of epochs needed. Precisely, the value of h is the smallest integer x1 satisfying two conditions:

  • C1: 22xn (the value of s must not exceed n);

  • C2: i=1x22iQ(22i)r (the number of queries that can be answered by h epochs must be at least r).

Rather than solving h precisely, we aim to find an upper bound for it. Define h^ to be the smallest integer x1 satisfying

  • C1’: 22xn (same as C1);

  • C2’: 22x/Q(22x)r.

It is clear that h^h.

Still, we do not solve h^ precisely but instead will find an upper bound for it. For this purpose, we leverage the fact that Q(n)=O(n1ϵ), which means Q(n)αn1ϵ for some constant α>0. Define H to be the smallest integer x1 satisfying

  • C1”: 22xn (same as C1 and C1’);

  • C2”: 22xα(22x)1ϵ=(22x)ϵαr, or equivalently, 22x(αr)1/ϵ.

Because condition C2” implies condition C2’, we must have Hh^h.

Therefore, if (αr)2/ϵn (namely, rnϵ/4/α), the value of H is simply the least integer x1 satisfying 22x(αr)1/ϵ. This means

22H1<(αr)1/ϵ 22H<(αr)2/ϵ (3)
2H=O(Logr).

In this case, all epochs incur a total cost of O(n2h)=O(n2H), which is O(nLogr) by (3).

The above argument does not work if (αr)2/ϵ>n. However, when this happens, we must have r>nϵ/4/α. As soon as r reaches nϵ/4/α – the snapping point – we create a structure 𝒯 on the whole S in O(nlogn)=O(nlogr) time, and use it to answer every subsequent query in O(Q(n)) time until r=min{n,cnlognQ(n)}. The queries after the snapping point require a total cost of O(nlogn)=O(nlogr). We thus have obtained an algorithm that guarantees Time(n,r)=O(nLogr) for all rmin{n,cnlognQ(n)}, completing the proof of Theorem 1.

3.3 Tightness of the Streak Bound

This section will explain why the streak bound Ω(nlognQ(n)) in Theorem 1 is asymptotically the best possible for reduction algorithms with “reasonable” behavior.

Black-box Reductions on Decomposable Problems with Restricted Structures.

For proving hardness results, we can specialize the problem class at will, and we do so by considering only decomposable problems. A reduction algorithm 𝒜 is required to work on any decomposable problem that is (B(n),Q(n)) spectrum indexable. As explained in Section 1.1, this implies a data structure 𝒯 that can be built on any S𝔻 in O(|S|B(|S|)) time and answer any query on S in O(Q(|S|)) time.

As long as (B(n),Q(n)) spectrum indexability is retained, we can restrict the functionality of 𝒯 to make it hard for 𝒜. Specifically, 𝒯 provides only the following “services” to 𝒜:

  • 𝒜 can create a structure 𝒯(S) on any subset SS;

  • given a predicate q, 𝒜 can use 𝒯(S) to find the answer Ansq(S) on S;

  • given a predicate q, after 𝒜 already has obtained Ansq(S1) and Ansq(S2) for disjoint subsets S1,S2S, it can combine the answers into Ansq(S1S2) in constant time (the combining algorithm is provided by 𝒯).

Despite its limited functionalities, 𝒯 still makes the problem (B(n),Q(n)) spectrum indexable because the problem is decomposable; see the explanation in Section 1.1.

So far no restriction has been imposed on the behavior of 𝒜, but we are ready to do so now. To answer a query, the algorithm 𝒜 needs to search the structures on a number (including zero) of subsets of S, say, 𝒯(S1),𝒯(S2),,𝒯(St). The algorithm can choose any t0 and any S1,𝒮2,,St (they do not need to be disjoint). In addition, 𝒜 is also allowed to examine another subset S𝑠𝑐𝑎𝑛S by paying Ω(|S𝑠𝑐𝑎𝑛|) time. With all these, the algorithm must ensure

S𝑠𝑐𝑎𝑛S1S2St=S. (4)

The above constraint is natural because otherwise at least one element of S is absent from S𝑠𝑐𝑎𝑛S1S2St. If the algorithm 𝒜 “dares” to return the query answer anyway, it must have acquired certain special properties of the underlying problem. In this work, we are interested in generic reductions that are unaware of problem-specific properties.

We will refer to reduction algorithms obeying the above requirements as black-box reductions. Our algorithms in Lemma 2 and Theorem 1 belong to the black-box class.

Tightness of Theorem 1.

We will prove that any black-box reduction can answer only r=O(nlognQ(n)) queries in O(nLogr) time for any function Q(n):++ that

  • satisfies Q(n)=O(n), and Q(n)=Θ(Q(cn)) for any constant c>0;

  • is sub-additive, i.e., Q(x+y)Q(x)+Q(y) holds for any x,y1.

This will confirm the tightness of Theorem 1 on the black-box class.

We will contrive a decomposable problem and an accompanying data structure, both of which make no sense in reality but nonetheless are sound in mathematics. The dataset S consists of n arbitrary elements; given any predicate, a query on S always returns |S| (the concrete forms of elements and predicates are irrelevant). Whenever asked to “build” a data structure on S, we waste on purpose |S|log|S| time and then simply output an arbitrary permutation of S in an array. Whenever asked to “answer” a query, we waste on purpose Q(|S|) time and then return |S|. The problem is clearly decomposable.

We argue that any black-box reduction algorithm 𝒜 needs Ω(Q(n)) time to answer every query. Consider an arbitrary query and suppose that 𝒜 processes it by searching the structures 𝒯(S1),,𝒯(St) for some t0 and scanning S𝑠𝑐𝑎𝑛. By the design of our structure, the query cost is at least

Ω(|S𝑠𝑐𝑎𝑛|)+i[t]Q(|Si|) Ω(|S𝑠𝑐𝑎𝑛|)+Q(i[t]|Si|)(by sub-additivity)
= Ω(Q(|S𝑠𝑐𝑎𝑛|)+Q(i[t]|Si|))(by Q(n)=O(n)Q(n)=Θ(Q(cn)))
= Ω(Q(|S𝑠𝑐𝑎𝑛|+i[t]|Si|))(by sub-additivity)
= Ω(Q(n)).(by (4))

By definition of log(r)-streak, algorithm 𝒜 must process r=LogStreak(n) queries within a total cost of O(nlogr), which obviously cannot exceed O(nlogn) (remember rn). It thus follows that 𝒜 can process only O(nlognQ(n)) queries.

4 New DDS Algorithms for Concrete Problems

We now deploy Theorem 1 to develop algorithms for concrete DDS problems, focusing on decomposable problems in Section 4.1 and non-decomposable problems in Section 4.2.

4.1 Applications to Decomposable Problems

As mentioned, if a decomposable problem has a data structure 𝒯 that can be built in O(nlogn) time (i.e., B(n)=O(logn)) and supports a query in Q(n)=O(logn) time, Theorem 1 directly yields a DDS algorithm with Time(n,r)=O(nLogr) for all rn. We enumerate below a partial list of such problems with importance to database systems, of which no algorithms with the same guarantee were known previously.

  • 2D Orthogonal range counting. See Section 1.2 for the problem definition. The structure 𝒯 can be a persistent “aggregate” binary search tree (BST) [30].

  • Orthogonal range counting on rectangles. The dataset S is a set of n 2-rectangles (i.e., axis-parallel boxes) in 2. Given an arbitrary 2-rectangle q, a query returns how many rectangles of S intersecting with q. The problem can be reduced to four queries of the previous problem (orthogonal range counting on points) [33]. 𝒯 can once again be a persistent aggregate BST [30].

  • Point location. The dataset S is a planar subdivision of 2 defined by n line segments, where each segment is associated with the ids of the two faces incident to it. Given an arbitrary point q2, a query returns the face of the subdivision containing q, which boils down to finding the segment immediately above q (i.e., the first segment “hit” by a ray shooting upwards from q). The structure 𝒯 can be a persistent BST [28] or Kirkpatrick’s structure [20].

  • k=O(1) nearest neighbor search in 2D. The dataset S is a set of n points in 2. Fix a constant integer k1. Given a point q2, a query returns the k points in P closest to q. The structure 𝒯 can be a point location structure [20, 28] built on an order-k Voronoi diagram (an order-k Voronoi diagram can be computed in O(nlogn) time [6]). For k=1, a DDS algorithm with Time(n,r)=O(nLogr) for all rn has been found [1]. However, the algorithm of [1] heavily relies on the ability to merge two (order-1) Voronoi diagrams in linear time, and thus, cannot be extended to higher values of k easily.

  • Approximate nearest neighbor search in metric space. The dataset S consists of n objects in a metric space with a constant doubling dimension (this encapsulates any Euclidean space with a constant dimensionality). Let 𝑑𝑖𝑠𝑡(o1,o2) represents the distance between two objects o1 and o2 in the space. Given an arbitrary object q in the space, a query returns an object oS such that 𝑑𝑖𝑠𝑡(o,q)(1+ϵ)𝑑𝑖𝑠𝑡(o,q) for all oS, where ϵ is a constant. A structure 𝒯 fulfilling our purpose can be found in [14, 22].

For orthogonal range counting in d where d3 is a fixed constant (as defined in Section 1.2), one can apply Theorem 1 to achieve a somewhat unusual result. It is possible [3] to build a structure 𝒯 in O(nlogn) time that answers a query in Q(n)=O(nϵ) time where the constant ϵ>0 can be made arbitrarily small. Theorem 1 thus produces a DDS algorithm with Time(n,r)=O(nLogr) for all rn1ϵ. As ϵ can be arbitrarily close to 0, the log(r)-streak bound LogStreak(n)=n1ϵ is lower than the maximum value n only by a factor sub-polynomial in n. A remark is in order about the significance if this gap could be closed for all constant dimensionalities. If a DDS algorithm with LogStreak(n)=n could be discovered, then the algorithm would also settle the following offline version in O(nlogn) time: we are given a set P of n points in d and a set Q of n d-rectangles; the goal is to report, for each d-rectangle qQ, how many points in P are covered in q. This offline problem has been extensively studied, and yet the fastest algorithm still runs in nlogΘ(d)n time to our knowledge.

4.2 Non-Decomposable but Spectrum-Indexable Problems

This subsection will utilize Theorem 1 to deal with problems that are not decomposable problems (at least not obviously). The key is to show that the problem is (Logn,Q(n)) spectrum indexable for a suitable Q(n). This is an interesting topic on its own, as we demonstrate next by developing new DDS algorithms with Time(n,r)=O(nLogr) for all rn on halfplane containment, convex hull containment, range median, and 2D linear programming. The original algorithms [19, 11] for these problems were all designed using the top-down approach reviewed in Section 2. Our algorithms present a contrast that illustrates how Theorem 1 can facilitate the design of DDS algorithms.

Halfplane Containment.

The problem can be converted [19] to the following equivalent form by resorting to geometry duality [7]:

  • Line through convex hull. S is a set of n points in 2. Given any line l in 2, a query determines if l intersects with the convex hull of S, denoted as CH(S).

We will concentrate on the above problem instead.

Suppose, for now, that we are given a point q on l that falls outside CH(S). From q, we can shoot two tangent rays, each touching CH(S) but not entering into the interior of CH(S). In Figure 1(a) where S is the set of black points, the first ray passes point p1S while the second passes p2S. The two rays form a “wedge” enclosing CH(S) within (note that the angle of the wedge is less than 180); we will call it the wedge of q on CH(S). Line l passes through CH(S) if and only if l goes through the wedge. If the vertices of CH(S) have been stored in clockwise order, the two tangent rays can be found in O(logn) time [25].

We will prove that the “line through convex hull” problem is (Logn,Logn) spectrum indexable. Take any integer s[1,n] and set m=n/s. Divide S arbitrarily into S1, S2, …, Sm such that |Si|=s for i[m1] and |Sm|=ns(m1). To build a structure 𝒯, use O(sLogs) time to compute CH(Si) for each i[m] and store its vertices in clockwise order. The structure’s construction time is O(nLogs). Let us see how to answer a query with line l. Suppose, once again, that a point q on l outside CH(S) is given. For each i[m], compute the wedge of q on CH(Si) in O(Logs) time. From these m wedges, it is a simple task to obtain in O(m) time the wedge of q on CH(S) (we will deal with a more general problem shortly in discussing “convex hull containment”). Now, whether l intersects with CH(S) can be determined easily. The query time so far is O(mLogs).

It remains to explain how to find q. This can be done in O(1) time if we already have the minimum axis-parallel bounding rectangle of S, denoted as MBR(S). Note that MBR(S) must contain CH(S); see Figure 1(b). It is clear that MBR(S) can be obtained from MBR(S1), MBR(S2), …, MBR(Sm) in O(m) time, while each MBR(Si) (i[m]) can be computed in O(s) time during the construction of the structure 𝒯. We thus conclude that the problem is (Logn,Logn) spectrum indexable and can now apply Theorem 1.

Convex Hull Containment.

In this problem, S is a set of n points in 2. Given any point q2, a query determines if q is covered by CH(S). We will prove that the problem is (Logn,Logn) spectrum indexable.

Take any integer s[1,n] and set m=n/s. Divide S arbitrarily into S1, S2, …, Sm such that |Si|=s for i[m1] and |Sm|=ns(m1). To build a structure 𝒯, compute CH(Si) for each i[m] and store its vertices in clockwise order; this takes O(nLogs) time as explained before. Let us see how to answer a query given point q. For each i[m], whether q is covered by CH(Si) can be checked in O(Logs) time [25]. If the answer yes for any i, point q must be covered by CH(S), and we are done. The subsequent discussion assumes that q is outside CH(Si) for all i. In O(mLogs) time, compute the wedge of q on CH(Si) for all i[m] as described in the previous problem.

(a) Tangent rays. (b) MBR. (c) Arcs. (d) Dissection.
Figure 1: Illustrations of key concepts in Section 4.2.

It remains to determine from the m wedges whether q is covered by CH(S). This can be re-modeled as the following problem. Place an arbitrary circle centered at q. For each wedge, its two bounding rays intersect the circle into an arc of less than 180. Let a1,a2,,am be the arcs produced this way, and define a to be the smallest arc on the circle covering them all. Figure 1(c) shows an example with m=3. Arc a1 is subtended by points A and B, arc a2 by points C and D, and arc a3 by points E and F. Here, the smallest arc a goes clockwise from point A to F. Crucially, q is covered by CH(S) if and only if a spans at least 180 (this is the case in Figure 1(c)) – note that if q is outside CH(S), then a must be subtended by the wedge of q on CH(S), which must be less than 180. Therefore, the goal is to determine whether a is at least 180.

It is possible to achieve the purpose in O(m) time (even if the m arcs are given to us in an arbitrary order). For this purpose, we process the arcs one by one, maintain the smallest arc a covering the arcs already processed, and stop the algorithm when it becomes clear that a must be at least 180. Specifically, for i=1, simply set a to a1. Iteratively, given the next arc ai (i2), check in constant time if an arc of less than 180 can cover both ai and a. If so, update a to that arc; otherwise, stop the algorithm. In Figure 1(c), for example, after a2 is processed, the a we maintain goes clockwise from A to D. When processing a3, the algorithm realizes that a must be at least 180 and hence terminates.

We thus conclude that the convex hull containment problem is (Logn,Logn) spectrum indexable and can now apply Theorem 1.

Range Median.

In this problem, S is a set of n real values stored in an array A. Given an arbitrary integer pair (x,y) satisfying 1xyn, a query returns the median of A[x:y]. We will show that the problem is (Logn,Logn) spectrum indexable. Fix any integer sn and m=n/s. Define Si=A[(i1)s+1:is] for each im1 and Sm=A[(m1)s+1:n]. Next, we will assume sn; otherwise, simply use O(nlogn)=O(nlogs) time to create a structure of [11] on the whole S, which is able to answer any query on S in O(logn)=O(logs) time.

To build a structure 𝒯, for each i[m], store Si in ascending order, but each element of Si should be associated with its original position index in A. It is clear that 𝒯 can be built in O(nLogs) time. Let us see how to answer a query with predicate (x,y). First, determine the values a,b[m] such that A[x]Sa and A[y]Sb, which can be trivially done in O(m) time. Scan Sa and identify the subset Sa=SaA[x:y] (for each element in Sa, check if its original index in A falls in [x,y]). Because Sa is sorted, we can produce Sa in the sorted order in O(s) time. In the same fashion, compute Sb=SbA[x:y] in O(s) time. At this moment, all the elements of A[x:y] have been partitioned in ba+1 sorted arrays: Sa,Sa+1,Sa+2,,Sb1,Sb. The goal now is to find the (yx+1)/2-th smallest element in the union of these arrays. Frederickson and Johnson [10] described an algorithm to select the element of a given rank from the union of sorted arrays. Their algorithm runs in O(mLognm)=O(mLogs) time in our scenario. Overall the query time is O(s+mLogs)=O(mLogs) because sn.

We now conclude that the range median problem is (Logn,Logn) spectrum indexable and are ready to apply Theorem 1.

2D Linear Programming.

By resorting to geometry duality [7], the problem can be converted [19] to “line through convex hull” (which we already solved) and the following:

  • Ray exiting convex hull. Here, S is a set of n points in 2 such that CH(S) covers the origin. Given any ray emanating q from the origin, a query returns the edge e𝑒𝑥𝑖𝑡 of CH(S) from which q exits CH(S).

We will concentrate on the above problem instead. Before proceeding, let us state two facts about the problem:

  • Given any ray q, it is possible to find e𝑒𝑥𝑖𝑡 in O(n) time, using an algorithm in [21]; we will call this the basic algorithm.

  • We can create a structure in O(nlogn) time to answer any query in O(logn) time. First compute CH(S) and then “dissect” it using the segments connecting the origin to all the vertices; see Figure 1(d). A query with ray q can then be answered by looking for the triangle in the dissection that q goes through. We will call this the basic structure.

Unlike all the problems discussed so far, currently we cannot prove that “ray exiting convex hull” is (Logn,Logn) spectrum indexable. However, by virtue of Theorem 1, we do not have to! It suffices to show that the problem is (Logn,nc) spectrum indexable for any positive constant c<1. Theorem 1 then allows us to answer rn1clogn queries in O(nLogr) time. After r has reached n1c, we can afford to build the basic structure in O(nlogn)=O(nlogr) time to answer every subsequent query in O(logn)=O(logr) time. This allow us achieve Time(n,r)=O(nLogr) for all rn.

We will prove that the “ray exiting convex hull” problem is (Logn,n) spectrum indexable. We will achieve the purpose by resorting to a result of [19], where Karp, Motwani, and Raghavan used the top-down approach to build a binary tree 𝒯 with the following properties.

  • If a node u is at level of 𝒯, then u is associated with a set S(u) of n/2 points in S.

  • The first levels of the tree can be built in O(n) time.

  • A query is answered by traversing at most a root-to-leaf path of 𝒯. If the search process descendants to a node u, then the target edge e𝑒𝑥𝑖𝑡 can be found in O(|S(u)|) time by running the basic algorithm [21] on S(u).

Back to our scenario, fix any integer sn. Build the first 1+Logs levels of 𝒯 on S in O(nLogs)=O(nLogs) time. To answer a query, we descend a path of 𝒯 to a node u of level Logs if the edge e𝑒𝑥𝑖𝑡 has not already been found. The set |S(u)| has at most n/2Logs=n/s points. We can thus run the basic algorithm on S(u) to find e𝑒𝑥𝑖𝑡 in O(n/s)=O(nss) time. Therefore, “ray exiting convex hull” problem is (Logn,n) spectrum indexable.

We close this section with a remark, as is revealed by the above discussion, about an inherent connection between the top-down approach and our reduction. In essence, we build the structure of [19] incrementally: the i-th (i1) “epoch” (in the proof of Section 3.2) re-builds the first 22i levels of 𝒯. This is different from [19] where the nodes are “expanded upon first touch” in the way described in Section 2. In fact, all existing DDS algorithms designed based on the top-down approach can be encapsulated into our reduction framework through the bridge of “spectrum indexability”, in the manner we demonstrated for “ray exiting convex hull”.

5 DDS Using Structures with 𝝎(𝒏𝐥𝐨𝐠𝒏) Construction Time

Our discussion so far has focused on data structures with B(n)=O(logn). In this section, we will first show the necessity of this condition for black-box reductions to guarantee even a non-constant log(r)-streak bound. As a second step, we present an extension of Theorem 1 that permits the deployment of a structure with max{B(n),Q(n)}=polylogn to produce a DDS algorithm with good Time(n,r) for all rn.

Constant Streak Bounds for 𝑩(𝒏)=𝝎(𝐥𝐨𝐠𝒏).

Our subsequent hardness argument requires nB(n) to be a convex function. Consider any black-box reduction algorithm 𝒜. Recall that 𝒜 is required to work on any decomposable problem with a restricted data structure (the reader may wish to review Section 3.3 before proceeding). Suppose that 𝒜 can guarantee a log(r)-streak bound of LogStreak(n) for all such problems, namely, 𝒜 can always answer r queries in O(nLogr) time for all rLogStreak(n). We will show that, if B(n)=ω(logn), then LogStreak(n) must be O(1).

In a way similar to Section 3.3, we will contrive a decomposable problem and an accompanying data structure. The dataset S contains n arbitrary elements; given any predicate, a query on S always returns |S|. Whenever asked to build a data structure 𝒯(S) on S, we waste on purpose |S|B(|S|) time and then output an arbitrary permutation of S in an array. Whenever asked to answer a query, we immediately return |S| in constant time. In other words, the function Q(n) is fixed to 1.

Henceforth, we will fix r to the value of LogStreak(n) that 𝒜 can ensure when it is given our contrived data structure. We will assume r=ω(1); otherwise, LogStreak(n)=O(1) and our job is done. As it answers r queries in O(nLogr) time, at least one of the queries must have a cost of O(nLogrr). We will concentrate on this particular query in the rest of the argument.

Recall from Section 3.3 that, to answer this query, algorithm 𝒜 needs to search a number (including 0) of structures 𝒯(S1),𝒯(S2),,𝒯(St) and scan a subset S𝑠𝑐𝑎𝑛S. As 𝒜 needs to pay a cost of Ω(|S𝑠𝑐𝑎𝑛|) to scan S𝑠𝑐𝑎𝑛, it must hold that |S𝑠𝑐𝑎𝑛|αnLogrr for some constant α>0. We will consider r, which we know is ω(1), to be large enough to make αnLogrrn/2. Because 𝒜 must obey (4), we can assert that

|S1S2St||S||S𝑠𝑐𝑎𝑛|n/2. (5)

This implies t1. Since algorithm 𝒜 must pay a constant time to search each structure 𝒯(Si) (i[t]), we must have

t = O((n/r)Logr). (6)

However, by how we design 𝒯, the total cost of constructing 𝒯(S1),𝒯(S2),,𝒯(St) is

i[t]|Si|B(|Si|). (7)

Set λ=i[t]|Si|; from (5), we know λn/2. As nB(n) is a convex function and B(n) is non-decreasing, we know that (7) is minimized when |Si|=λ/t for all i[t]. Therefore:

(7) λB(λ/t)n2B(n2t). (8)

From (6), we have n/(2t)=Ω(r/logr). Because B(n)=ω(logn), rudimentary asymptotic analysis shows that B(n/(2t)) must be ω(logr).

We now conclude that (8), and hence (7), must be ω(nLogr). But this contradicts that algorithm 𝒜 can process r queries in O(nLogr) time. Therefore, our assumption that r=ω(1) must be wrong.

DDS Algorithms with 𝑩(𝒏)=𝝎(𝐥𝐨𝐠𝒏).

Data structures having ω(nlogn) construction time are still useful for DDS as long as we are not obsessed with answering r queries in O(nLogr) time. To make this formal, we modify the techniques behind Theorem 1 to obtain another generic reduction with the following guarantees.

Theorem 3.

Suppose that B(n) and Q(n) are non-decreasing functions that are both O(logγn) where γ1 is a constant. Every (B(n),Q(n)) spectrum indexable problem admits a DDS algorithm with Time(n,r)=O(nLogγr) for all rn.

The proof is similar to what was presented in Section 3.2 and is moved to Appendix A. An interesting application of the above is orthogonal range counting/reporting in d where d is a fixed constant at least 3 (see the problem definition in Section 1.2). The range tree augmented with fractional cascading [9], constructable in O(nlogd1n) time on n points, answers a counting/reporting query also in O(nlogd1n) time. The problem is decomposable and thus (Logd1n,Logd1n) spectrum indexable. Theorem 3 directly gives a DDS algorithm with Time(n,r)=O(nLogd1r) for all rn, which strictly improves the results of [19] as mentioned in Section 1.2.

References

  • [1] Alok Aggarwal and Prabhakar Raghavan. Deferred data structure for the nearest neighbor problem. Information Processing Letters (IPL), 40(3):119–122, 1991. doi:10.1016/0020-0190(91)90164-D.
  • [2] Jérémy Barbay, Ankur Gupta, Srinivasa Rao Satti, and Jonathan Sorenson. Near-optimal online multiselection in internal and external memory. J. Discrete Algorithms, 36:3–17, 2016. doi:10.1016/J.JDA.2015.11.001.
  • [3] Jon Louis Bentley and Hermann A. Maurer. Efficient worst-case data structures for range searching. Acta Inf., 13:155–168, 1980. doi:10.1007/BF00263991.
  • [4] Manuel Blum, Robert W. Floyd, Vaughan R. Pratt, Ronald L. Rivest, and Robert Endre Tarjan. Time bounds for selection. Journal of Computer and System Sciences (JCSS), 7(4):448–461, 1973. doi:10.1016/S0022-0000(73)80033-9.
  • [5] Gerth Stolting Brodal, Beat Gfeller, Allan Gronlund Jorgensen, and Peter Sanders. Towards optimal range medians. Theoretical Computer Science, 412(24):2588–2601, 2011. doi:10.1016/J.TCS.2010.05.003.
  • [6] Timothy M. Chan and Konstantinos Tsakalidis. Optimal deterministic algorithms for 2-d and 3-d shallow cuttings. Discrete & Computational Geometry, 56(4):866–881, 2016. doi:10.1007/S00454-016-9784-4.
  • [7] Bernard Chazelle, Leonidas J. Guibas, and D. T. Lee. The power of geometric duality. BIT Numerical Mathematics, 25(1):76–90, 1985. doi:10.1007/BF01934990.
  • [8] Yu-Tai Ching, Kurt Mehlhorn, and Michiel H. M. Smid. Dynamic deferred data structuring. Information Processing Letters (IPL), 35(1):37–40, 1990. doi:10.1016/0020-0190(90)90171-S.
  • [9] Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. Computational Geometry: Algorithms and Applications. Springer-Verlag, 3rd edition, 2008.
  • [10] Greg N. Frederickson and Donald B. Johnson. The complexity of selection and ranking in x+y and matrices with sorted columns. Journal of Computer and System Sciences (JCSS), 24(2):197–208, 1982. doi:10.1016/0022-0000(82)90048-4.
  • [11] Beat Gfeller and Peter Sanders. Towards optimal range medians. In Proceedings of International Colloquium on Automata, Languages and Programming (ICALP), pages 475–486, 2009. doi:10.1007/978-3-642-02927-1_40.
  • [12] Goetz Graefe, Felix Halim, Stratos Idreos, Harumi A. Kuno, and Stefan Manegold. Concurrency control for adaptive indexing. Proceedings of the VLDB Endowment (PVLDB), 5(7):656–667, 2012. doi:10.14778/2180912.2180918.
  • [13] Felix Halim, Stratos Idreos, Panagiotis Karras, and Roland H. C. Yap. Stochastic database cracking: Towards robust adaptive indexing in main-memory column-stores. Proceedings of the VLDB Endowment (PVLDB), 5(6):502–513, 2012. doi:10.14778/2168651.2168652.
  • [14] Sariel Har-Peled and Nirman Kumar. Approximate nearest neighbor search for low-dimensional queries. SIAM Journal on Computing, 42(1):138–159, 2013. doi:10.1137/110852711.
  • [15] Sariel Har-Peled and S. Muthukrishnan. Range medians. In Proceedings of European Symposium on Algorithms (ESA), pages 503–514, 2008. doi:10.1007/978-3-540-87744-8_42.
  • [16] Stratos Idreos, Martin L. Kersten, and Stefan Manegold. Database cracking. In Proceedings of Biennial Conference on Innovative Data Systems Research (CIDR), pages 68–78, 2007. URL: http://cidrdb.org/cidr2007/papers/cidr07p07.pdf.
  • [17] Stratos Idreos, Martin L. Kersten, and Stefan Manegold. Self-organizing tuple reconstruction in column-stores. In Proceedings of ACM Management of Data (SIGMOD), pages 297–308, 2009. doi:10.1145/1559845.1559878.
  • [18] Stratos Idreos, Stefan Manegold, Harumi A. Kuno, and Goetz Graefe. Merging what’s cracked, cracking what’s merged: Adaptive indexing in main-memory column-stores. Proceedings of the VLDB Endowment (PVLDB), 4(9):585–597, 2011. doi:10.14778/2002938.2002944.
  • [19] Richard M. Karp, Rajeev Motwani, and Prabhakar Raghavan. Deferred data structuring. SIAM Journal on Computing, 17(5):883–902, 1988. doi:10.1137/0217055.
  • [20] David G. Kirkpatrick. Optimal search in planar subdivisions. SIAM Journal of Computing, 12(1):28–35, 1983. doi:10.1137/0212002.
  • [21] David G. Kirkpatrick and Raimund Seidel. The ultimate planar convex hull algorithm? SIAM Journal on Computing, 15(1):287–299, 1986. doi:10.1137/0215021.
  • [22] Robert Krauthgamer and James R. Lee. Navigating nets: simple algorithms for proximity search. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 798–807, 2004. URL: http://dl.acm.org/citation.cfm?id=982792.982913.
  • [23] Konstantinos Lampropoulos, Fatemeh Zardbani, Nikos Mamoulis, and Panagiotis Karras. Adaptive indexing in high-dimensional metric spaces. Proceedings of the VLDB Endowment (PVLDB), 16(10):2525–2537, 2023. doi:10.14778/3603581.3603592.
  • [24] Rajeev Motwani and Prabhakar Raghavan. Deferred data structuring: Query-driven preprocessing for geometric search problems. In Proceedings of Symposium on Computational Geometry (SoCG), pages 303–312, 1986. doi:10.1145/10515.10548.
  • [25] Mark H. Overmars and Jan van Leeuwen. Maintenance of configurations in the plane. Journal of Computer and System Sciences (JCSS), 23(2):166–204, 1981. doi:10.1016/0022-0000(81)90012-X.
  • [26] Bryce Sandlund and Sebastian Wild. Lazy search trees. In Proceedings of Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 704–715, 2020. doi:10.1109/FOCS46700.2020.00071.
  • [27] Bryce Sandlund and Lingyi Zhang. Selectable heaps and optimal lazy search trees. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1962–1975, 2022. doi:10.1137/1.9781611977073.78.
  • [28] Neil Sarnak and Robert Endre Tarjan. Planar point location using persistent search trees. Communications of the ACM (CACM), 29(7):669–679, 1986. doi:10.1145/6138.6151.
  • [29] Felix Martin Schuhknecht, Alekh Jindal, and Jens Dittrich. An experimental evaluation and analysis of database cracking. The VLDB Journal, 25(1):27–52, 2016. doi:10.1007/S00778-015-0397-Y.
  • [30] Yufei Tao and Dimitris Papadias. Range aggregate processing in spatial databases. IEEE Transactions on Knowledge and Data Engineering (TKDE), 16(12):1555–1570, 2004. doi:10.1109/TKDE.2004.93.
  • [31] Fatemeh Zardbani, Peyman Afshani, and Panagiotis Karras. Revisiting the theory and practice of database cracking. In Proceedings of Extending Database Technology (EDBT), pages 415–418, 2020. doi:10.5441/002/EDBT.2020.46.
  • [32] Fatemeh Zardbani, Nikos Mamoulis, Stratos Idreos, and Panagiotis Karras. Adaptive indexing of objects with spatial extent. Proceedings of the VLDB Endowment (PVLDB), 16(9):2248–2260, 2023. doi:10.14778/3598581.3598596.
  • [33] Donghui Zhang, Vassilis J. Tsotras, and Dimitrios Gunopulos. Efficient aggregation over objects with extent. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 121–132, 2002. doi:10.1145/543613.543629.

Appendix A Proof of Theorem 3

Our algorithm executes in epochs. At the beginning of the i-th (i1) epoch, we set s=22i and create a structure 𝒯 – the structure promised by (B(n),Q(n)) spectrum indexability – on S in O(nB(s))=O(n2iγ) time. The structure allows us to answer any query in O(nsQ(s)) time. The i-th epoch finishes after s queries are answered during the epoch. These queries have a total cost of

sO(nsQ(s))=O(nQ(s))=O(n2iγ).

Hence, the total computation time of the i-th epoch is O(n2iγ). Because γ1, we know that n2iγ at least doubles when i increases by 1. Hence, the total cost of all epochs is asymptotically dominated by O(n2hγ), where h is the number of epochs needed.

Precisely, the value of h is the smallest integer x1 satisfying two conditions:

  • C1: 22xn (the value of s must not exceed n);

  • C2: i=1x22ir (the number of queries answerable by h epochs must be at least r).

Let H represent the smallest integer x1 such that 22xr. This means

22H1<r 22H<r2. (9)

When 22Hn, the definitions of h and H imply hH. In this case, all the epochs incur a total cost of O(n2hγ)=O(n2Hγ)=O(nLogγr).

The above argument does not work if 22H>n. However, when this happens, we know from (9) that r2>22H>n, leading to r>n. As soon as r reaches n – the snapping point – we create a structure 𝒯 on the whole S in O(nlogγn) time, and use it to answer every subsequent query in O(Q(n))=O(logγn) time until r=n. The queries after the snapping point require a total cost of O(nlogγn)=O(nlogγr). We thus have obtained an algorithm that guarantees Time(n,r)=O(nLogγr) for all rn, completing the proof of Theorem 3.