Maximizing the Optimality Streak of Deferred Data Structuring (a.k.a. Database Cracking)
Abstract
This paper studies how to minimize the total cost of answering queries over elements in an online manner (i.e., the next query is given only after the previous query’s result is ready) when the value is unknown in advance. Traditional indexing, which first builds a complete index on the elements before answering queries, may be unsuitable because the index’s construction time – usually – can become the performance bottleneck. In contrast, for many problems, a lower bound of holds on the total cost of queries for every . Matching this lower bound is a primary objective of deferred data structuring (DDS), also known as database cracking in the system community. For a wide class of problems, we present generic reductions to convert traditional indexes into DDS algorithms that match the lower bound for a long range of . For a decomposable problem, if a data structure can be built in time and has query search time, our reduction yields an algorithm that runs in time for all , where the upper bound is asymptotically the best possible under mild constraints. In particular, if , then the -time guarantee extends to all , with which we optimally settle a large variety of DDS problems. Our results can be generalized to a class of “spectrum indexable problems”, which subsumes the class of decomposable problems.
Keywords and phrases:
Deferred Data Structuring, Database Cracking, Data Structures2012 ACM Subject Classification:
Theory of computation Data structures and algorithms for data managementFunding:
This work was partially supported by GRF projects 4203421 and 14222822 from HKRGC.Editors:
Sudeepa Roy and Ahmet KaraSeries and Publisher:

1 Introduction
Traditional indexing first creates an index on the whole dataset before starting to answer queries. For example, after building a binary search tree (BST) on an (unsorted) set of real values in time, we can use the tree to answer each predecessor query111Given a search value , a predecessor query returns the largest element in that does not exceed . in time. This paradigm, however, falls short when the dataset will only be searched a small number of times. In the extreme, if only one query needs to be answered, the best approach is to scan in full, which requires only time. More generally, if we are to answer queries in an online manner – that is, the next query is given after the result of the previous query has been output – it is possible to pay a total cost of [19]. When (e.g., or ), the cost breaks the barrier of , suggesting that sorting is unnecessary.
Situations like the above occur in many big-data applications where large volumes of data are collected but only sparingly queried. The phenomenon has motivated a line of research under the name database cracking in the system community; see [12, 13, 16, 17, 18, 23, 29, 31, 32] and the references therein. In these environments, the problem size is huge such that even sorting is considered expensive and should be avoided as much as possible. Furthermore, it is impossible to predict how many queries – namely, the integer mentioned earlier – will need to be supported. Instead of constructing a complete index right away, database cracking carries out only a calculated portion of the construction during each query. If eventually reaches a certain threshold, the index will be created in full, but it is more likely that will stop at some value significantly lower than that threshold. The challenge of database cracking is to ensure “smoothness”: as grows, the total cost of all the queries so far ought to increase as slowly as possible.
In the theory field, the data-structure problems at the core of database cracking had already been studied in 1988 by Karp, Motwani, and Raghavan [19] under the name deferred data structuring (DDS). For a selection of problems, they explained how to answer queries on elements with a total cost of , without knowing in advance. For every , they proved a matching lower bound of on those problems. Through reductions, the same lower bound has been shown to hold on many other DDS problems.
This work explores the following topic: how to design a generic reduction that, given a (conventional) data structure, can turn it automatically into an efficient DDS algorithm? Such reductions may significantly simplify the design of DDS algorithms and shed new light on the intricate connections between traditional indexing and DDS.
Math Conventions.
We use to denote the set of positive integers. For any integer , let represent the set . Every logarithm has base 2 by default. Define for any .
1.1 Problem Definitions
This subsection will formalize the problems to be investigated. Let , called the dataset, be a set of elements drawn from a domain . Let be a (possibly infinite) set where the elements are called predicates. Given a predicate , a query issued on returns an answer, represented as . We consider that, for any , the answer can be represented using words and can be computed in time (which essentially means that the query can be answered using a brute-force algorithm such as exhaustive scan).
Deferred Data Structuring (DDS).
We now formalize the DDS problem. Initially, an algorithm is provided with the dataset in an array, where the elements are arbitrarily permuted. An adversary first chooses a predicate for the first query, and solicits the answer from . Iteratively, for each , after the answer of the ()-th query has been obtained, the adversary either decides to terminate the whole process, or chooses the predicate for the next query and solicits from . The adversary is permitted to observe the execution of and, thus, capable of selecting a “bad” for .
The algorithm is said to guarantee running time for queries if, for every , the first queries are processed with a total cost at most . We will concentrate on because this is the scenario important for database cracking.
Streak of Optimality.
While the above setup is “standard” for DDS, as far as database cracking is concerned, it makes sense to carry out investigation from another fresh perspective. As database cracking is useful mainly in the scenario where the dataset receives relatively few queries, it is imperative to design DDS algorithms that are particularly efficient when the number of queries is small.
To formalize the notion of “particularly efficient”, we leverage the fact that, for many DDS problems (including the ones considered in this work), there exist hardness barriers dictating for every under a relevant computation model. Motivated by this, we say that an algorithm guarantees a -streak of if its running time satisfies for all .
The worst -streak guarantee is – this is trivial because any query can be answered in time. Ideally, we would like to have , but this is not always possible as argued later in the paper. For practical use, however, it would suffice for an algorithm to ensure for some constant because queries are perhaps already too many for database cracking to be appealing.
Classes of Problems.
The above definition framework can be specialized into various problem instances that differ in the data domain or the predicate domain . Next, we introduce two problem classes relevant to our discussion:
-
A problem instance is decomposable if, for any disjoint sets and any predicate , it is possible to derive from and in constant time.
-
We define a problem instance to be spectrum indexable if the following property holds on every dataset : for every integer , it is possible to construct a data structure on in time that can answer any query in time. The term “spectrum indexable” is chosen to reflect the ability to build a “good” index structure – as far as functions and are concerned – for the whole spectrum of the parameter .
Two observations are important about the above definitions:
-
spectrum indexability implies that we can build a data structure on any dataset in time to answer a query in time (for this purpose, simply set ).
-
Consider any decomposable problem instance with the following property: for any dataset , we can build a data structure in time to answer a query in time. Then, the problem instance must be spectrum indexable. To see why, given an integer , divide arbitrarily into disjoint subsets where all subsets have size except . For each , create a structure in time; the total time to create all the structures is . To answer a query , simply search every to obtain in time and then combine into using time. The total query time is therefore .
1.2 Related Work
Motwani and Raghavan introduced the concept of DDS in a conference paper [24], which together with Karp they extended into a journal article [19]. The article [19] presented algorithms for the following DDS problems:
-
Predecessor search, where consists of real values, and each query is given an arbitrary value and returns the predecessor of in .
-
Halfplane containment, where consists of halfplanes in , and each query is given an arbitrary point and returns whether is covered by all the halfplanes in .
-
Convex hull containment, where consists of points in , and each query is given an arbitrary point and returns whether is covered by the convex hull of .
-
2D linear programming, where consists of halfplanes in , and each query is given a 2D vector and returns the point in the intersection of all the halfplanes that maximizes the dot product .
-
Orthogonal range counting, where consists of points in with the dimensionality being a fixed constant, and a query is a given an arbitrary -rectangle – namely, an axis-parallel box the form – and returns the number of points of that are covered by .
For the first four problems, Karp, Motwani, and Raghavan presented algorithms achieving for all . For orthogonal range counting, they presented two algorithms, the first of which ensures for all , whereas the other one ensures for all . For all these problems, they proved that, under the comparison model and/or the algebraic model, the running time of any algorithm must satisfy for every .
Aggarwal and Raghavan [1] presented a DDS algorithm with for all for nearest neighbor search, where consists of points in , and a query is given an arbitrary point and returns the point in closest to . This running time is optimal under the algebraic model for all .
A “success story” can be told about DDS on range median. In the problem’s offline version, we are given a set of real values in an (unsorted) array . For any , let represent the set of elements . In addition, we are given integer pairs such that for each . The goal is to find the median of the set for all . In [15], Har-Peled and Muthukrishnan explained how to solve the problem in time and proved a lower bound of under the comparison model for all . In [11] (see also [5]), Gfeller and Sanders considered the problem’s DDS version, where is as defined earlier, and a query is given an arbitrary pair with and returns the median of . They designed an algorithm achieving for all . It is easy to see that any DDS algorithm can be utilized to solve also the offline problem in time. Hence, the algorithm of Gfeller and Sanders improves the offline solution of [15] and is optimal (for DDS) under the comparison model.
More remotely related to our work is [8], where Ching, Mehlhorn, and Smid considered a dynamic DDS problem where, besides queries, an algorithm is also required to support updates on the dataset . Towards another direction, Barbay et al. [2] studied the DDS version of predecessor search but analyzed their algorithm using more fine-grained parameters called “gaps” (rather than using only and ). This direction has also been extended to dynamic DDS; see the recent works [26, 27].
As mentioned, DDS has been extensively studied in the system community under the name database cracking. The focus there is to engineer efficient heuristics to accelerate query workloads encountered in various practical scenarios, rather than establishing strong theoretical guarantees. Interested readers may refer to the representative works of [12, 13, 16, 17, 18, 23, 31, 32] as entry points into the literature.
1.3 Our Results
Our main result is a generic reduction with the following guarantee:
Theorem 1.
Suppose that and are non-decreasing functions such that and where is an arbitrarily small constant. Every spectrum indexable problem admits a DDS algorithm with
(1) |
for an arbitrarily large constant , namely, the algorithm achieves for all .
Under mild constraints, the streak bound in (1) is the best possible, as we argue in Section 3. As an important special case, when , the DDS algorithm produced by our reduction achieves , namely, for all . For many decomposable problems with high importance to database systems, data structures with construction time and search time are already known (e.g., predecessor search and nearest neighbor search). As such problems must be spectrum indexable (as explained in Section 1.1), Theorem 1 immediately produces excellent solutions to the DDS versions of all those problems, which are usually optimally under the comparison/algebraic model. A selection of those problems will be given in Section 4.1.
Theorem 1 has a pleasant message for database cracking: even data structures with slow query time can be useful for cracking! For example, for orthogonal range counting (defined in Section 1.2) in 2D space, the kd-tree, which takes time to build, answers a query in time. Theorem 1 shows that the structure can be utilized to answer any queries in time, which is probably more than enough for database cracking in reality. This nicely manifests our motivation (stated in Section 1.1) for studying “DDS algorithms particularly efficient for small ”.
Theorem 1 has an instructive implication to the design of DDS algorithms – we should study how spectrum indexable the underlying problem really is. Exploration in this direction can get fairly interesting, as we will demonstrate in Section 4.2 for halfplane containment, convex hull containment, range median, and 2D linear programming (see Section 1.2 for their definitions). We can prove that each of those problems is spectrum indexable for an appropriately selected function , and then leverage Theorem 1 to obtain an algorithm solving it with for all .
What happens to structures that take time to build, or equivalently, ? Section 5 will show that, under mild constraints, no generic reductions can use such a structure to obtain DDS algorithms with . In other words, these algorithms can achieve only in the trivial scenario where . Nevertheless, if one accepts algorithms with guarantees of the form “ for all ”, our reduction underlying Theorem 1 can be extended to produce such algorithms as long as and are , as will be discussed in Section 5.
2 Warm Up: Predecessor Search
Predecessor search is the problem that has received the most attention from the DDS and database-cracking communities. This section will review two approaches developed in [19] for achieving on this problem. These approaches form the basis of nearly all the existing DDS algorithms.
Bottom-Up.
Recall that the dataset consists of real values. We assume, w.l.o.g., that is a power of 2. At all times, the set is arbitrarily partitioned into disjoint subsets – referred to as runs – each having the same size for some . Every run is sorted and stored in an array. Initially, , i.e., a run contains an individual element of . Over time, the run size increases monotonically. Whenever needs to go from to for some value , an overhaul is carried out to build the new runs. As a run of size can be obtained by merging runs of size in time, the overhaul can be completed in time. Therefore, if the current run size is , the overall cost of producing all the runs in history is .
A (predecessor) query is answered simply by performing binary search on every run. To control the cost, however, the algorithm makes sure that the current size is at least before processing the -th () query. If the condition is not met, an overhaul is invoked to increase to the nearest power of 2 at least . After that, the query entails a cost of time. If we add this up for all queries, the sum becomes . As the final run size is , we can conclude that the algorithm processes queries in time. Note that this holds only if (the maximize run size is ). However, when has reached , the algorithm can afford to sort the entire in time and answer every subsequent query in time. Thus, the algorithm achieves for all .
Top-Down.
This approach mimics the following strategy for building a binary search tree (BST) on : (i) find the median of , and splits into and at the median; (ii) store the median as the root’s key, and then build the root’s left (resp., right) subtree recursively on (resp., ). Rather than doing a full construction outright, the algorithm builds on an “as-needed” basis during query processing.
In the beginning, only the root exists and it is put in the unexpanded mode. In general, an unexpanded node has no children yet, but is associated with the subset of elements that ought to be stored in its subtree. A query is answered in the same manner as in a normal BST, by traversing a root-to-leaf path of . The main difference is that as the search comes to an unexpanded node on , the algorithm must expand it first. If , expanding means creating two child nodes for , splitting at the median (the key of ), dividing at the median into two parts, and associating each part with a child. After that, becomes expanded with its children put in the unexpanded mode. If, on the other hand, , expanding simply means making a leaf of and marking in the expanded mode. In any case, the expansion takes time (finding the median of a set takes linear time [4]).
After queries, the BST is partially built because only the nodes on the root-to-leaf paths traversed during query processing are expanded. The nodes at the first levels222The level of a node is the number of edges on the path from the node to the root. can incur a total expansion cost of . For each root-to-leaf path , the node at level has expansion cost , which dominates the total expansion cost of the descendants of on . Therefore, other than the nodes at the first levels, all the other nodes in have an expansion cost of in total. The algorithm therefore achieves for all .
3 Reductions from Data Structures to DDS Algorithms
Section 3.1 will extend the bottom-up approach (reviewed in the previous section) into a generic reduction, which can yield DDS algorithms with large -streak bounds, provided that a crucial requirement – linear mergeability (to be defined shortly) – is met. Although this reduction will be superseded by our final reduction (presented in Section 3.2) underneath Theorem 1, its discussion (i) deepens the reader’s understanding of the approach’s power and limitations, and (ii) elucidates the necessity for new ideas in the absence of linear mergeability. Finally, Section 3.3 will establish a hardness result to show that the streak bound in Theorem 1 can no longer be improved significantly.
3.1 The First Reduction: Generalizing the Bottom-Up Approach
This subsection will focus on decomposable problems. For any dataset , we assume the existence of a data structure that can answer any query in time. Furthermore, the structure is linearly mergeable, namely, for any disjoint , the structure on can be constructed from and in time. Note that this implies can be built in time.
Lemma 2.
For a decomposable problem on which there is a linear-mergeable structure with query time where is a constant, we can design a DDS algorithm to guarantee for an arbitrarily large constant .
Proof.
We assume, w.l.o.g., that is a power of 2. As with the bottom-up approach, at any moment, we divide arbitrarily into runs with the same size for some . For each run, build a structure on the elements therein. The initial run size is 1. Every time grows from to for some value , an overhaul is performed to construct the runs of size . By linear mergeability, we can build the structure of a size- run by merging the structures of runs of size in time. By an analysis similar to the one in Section 2, if the current run size is , the overall cost of producing the structures for all the runs that ever appeared in history is .
A query with predicate is answered by searching the structure of every run, and then combining the answers from all the runs into . The query cost is . We require that, before answering the -th () query, the run size must satisfy
(2) |
If this requirement is not met, we launch an overhaul to increase to the least power of 2 fulfilling (2). This ensures that the -th query is answered with a cost . Hence, the total cost of processing queries is .
What remains is to bound the cost of the overhauls. The final run size is the least power of 2 satisfying
-
(the run size cannot exceed ) and
-
(because of (2)).
Rather than solving precisely, we instead aim to find an upper bound for it. As , we know for some constant . Hence, . Let be the least power of 2 satisfying
and .
Since the condition implies , it holds that .
The condition can be rewritten as . Therefore, if , then is simply the least power of 2 at least . This means and thus , in which case all the overhauls require time overall.
The above argument does not work if . However, in this case, , that is, is already a polynomial of . This motivates the following brute-force strategy. When reaches – a moment we call the snapping point – we simply create a structure on the whole in time, and use it to answer every subsequent query in time until . All the queries after the snapping point have a total cost of . We thus have obtained an algorithm that guarantees for all .
The above reduction crucially relies on the fact that the structure is linearly mergeable. Otherwise, the overhaul for creating size- runs would become , in which case all the overhauls would end up with a total cost of . Next, we will present another reduction that can shave a factor.
3.2 The Second Reduction: No Linear Mergeability
We will now drop the linear mergeability requirement in Section 3.1 and establish Theorem 1. Recall that the underlying problem is spectrum indexable with and for some constant . The goal is to design an algorithm with for all , where can be any constant.
Assume, w.l.o.g., that is a power of 2. Our algorithm executes in epochs. At the beginning of the -th () epoch, we set
For now, let us consider that does not exceed , a condition that will be guaranteed, as discussed later. The epoch starts by creating a structure – the structure promised by spectrum indexability – on in time. The structure allows us to answer any query in time. The -th epoch finishes after queries333In practice, our algorithm may be slightly improved by making this number , but this will not be necessary for the purpose of proving Theorem 1 are answered during the epoch. These queries demand a total cost of
It is clear from the above that the total computation time of the -th epoch is . As doubles when increases by 1, the overall cost of answering queries is , where is the number of epochs needed. Precisely, the value of is the smallest integer satisfying two conditions:
-
C1: (the value of must not exceed );
-
C2: (the number of queries that can be answered by epochs must be at least ).
Rather than solving precisely, we aim to find an upper bound for it. Define to be the smallest integer satisfying
-
C1’: (same as C1);
-
C2’: .
It is clear that .
Still, we do not solve precisely but instead will find an upper bound for it. For this purpose, we leverage the fact that , which means for some constant . Define to be the smallest integer satisfying
-
C1”: (same as C1 and C1’);
-
C2”: , or equivalently, .
Because condition C2” implies condition C2’, we must have .
Therefore, if (namely, ), the value of is simply the least integer satisfying . This means
(3) | |||||
In this case, all epochs incur a total cost of , which is by (3).
The above argument does not work if . However, when this happens, we must have . As soon as reaches – the snapping point – we create a structure on the whole in time, and use it to answer every subsequent query in time until . The queries after the snapping point require a total cost of . We thus have obtained an algorithm that guarantees for all , completing the proof of Theorem 1.
3.3 Tightness of the Streak Bound
This section will explain why the streak bound in Theorem 1 is asymptotically the best possible for reduction algorithms with “reasonable” behavior.
Black-box Reductions on Decomposable Problems with Restricted Structures.
For proving hardness results, we can specialize the problem class at will, and we do so by considering only decomposable problems. A reduction algorithm is required to work on any decomposable problem that is spectrum indexable. As explained in Section 1.1, this implies a data structure that can be built on any in time and answer any query on in time.
As long as spectrum indexability is retained, we can restrict the functionality of to make it hard for . Specifically, provides only the following “services” to :
-
can create a structure on any subset ;
-
given a predicate , can use to find the answer on ;
-
given a predicate , after already has obtained and for disjoint subsets , it can combine the answers into in constant time (the combining algorithm is provided by ).
Despite its limited functionalities, still makes the problem spectrum indexable because the problem is decomposable; see the explanation in Section 1.1.
So far no restriction has been imposed on the behavior of , but we are ready to do so now. To answer a query, the algorithm needs to search the structures on a number (including zero) of subsets of , say, . The algorithm can choose any and any (they do not need to be disjoint). In addition, is also allowed to examine another subset by paying time. With all these, the algorithm must ensure
(4) |
The above constraint is natural because otherwise at least one element of is absent from . If the algorithm “dares” to return the query answer anyway, it must have acquired certain special properties of the underlying problem. In this work, we are interested in generic reductions that are unaware of problem-specific properties.
Tightness of Theorem 1.
We will prove that any black-box reduction can answer only queries in time for any function that
-
satisfies , and for any constant ;
-
is sub-additive, i.e., holds for any .
This will confirm the tightness of Theorem 1 on the black-box class.
We will contrive a decomposable problem and an accompanying data structure, both of which make no sense in reality but nonetheless are sound in mathematics. The dataset consists of arbitrary elements; given any predicate, a query on always returns (the concrete forms of elements and predicates are irrelevant). Whenever asked to “build” a data structure on , we waste on purpose time and then simply output an arbitrary permutation of in an array. Whenever asked to “answer” a query, we waste on purpose time and then return . The problem is clearly decomposable.
We argue that any black-box reduction algorithm needs time to answer every query. Consider an arbitrary query and suppose that processes it by searching the structures for some and scanning . By the design of our structure, the query cost is at least
By definition of -streak, algorithm must process queries within a total cost of , which obviously cannot exceed (remember ). It thus follows that can process only queries.
4 New DDS Algorithms for Concrete Problems
We now deploy Theorem 1 to develop algorithms for concrete DDS problems, focusing on decomposable problems in Section 4.1 and non-decomposable problems in Section 4.2.
4.1 Applications to Decomposable Problems
As mentioned, if a decomposable problem has a data structure that can be built in time (i.e., ) and supports a query in time, Theorem 1 directly yields a DDS algorithm with for all . We enumerate below a partial list of such problems with importance to database systems, of which no algorithms with the same guarantee were known previously.
-
Orthogonal range counting on rectangles. The dataset is a set of 2-rectangles (i.e., axis-parallel boxes) in . Given an arbitrary 2-rectangle , a query returns how many rectangles of intersecting with . The problem can be reduced to four queries of the previous problem (orthogonal range counting on points) [33]. can once again be a persistent aggregate BST [30].
-
Point location. The dataset is a planar subdivision of defined by line segments, where each segment is associated with the ids of the two faces incident to it. Given an arbitrary point , a query returns the face of the subdivision containing , which boils down to finding the segment immediately above (i.e., the first segment “hit” by a ray shooting upwards from ). The structure can be a persistent BST [28] or Kirkpatrick’s structure [20].
-
nearest neighbor search in 2D. The dataset is a set of points in . Fix a constant integer . Given a point , a query returns the points in closest to . The structure can be a point location structure [20, 28] built on an order- Voronoi diagram (an order- Voronoi diagram can be computed in time [6]). For , a DDS algorithm with for all has been found [1]. However, the algorithm of [1] heavily relies on the ability to merge two (order-1) Voronoi diagrams in linear time, and thus, cannot be extended to higher values of easily.
-
Approximate nearest neighbor search in metric space. The dataset consists of objects in a metric space with a constant doubling dimension (this encapsulates any Euclidean space with a constant dimensionality). Let represents the distance between two objects and in the space. Given an arbitrary object in the space, a query returns an object such that for all , where is a constant. A structure fulfilling our purpose can be found in [14, 22].
For orthogonal range counting in where is a fixed constant (as defined in Section 1.2), one can apply Theorem 1 to achieve a somewhat unusual result. It is possible [3] to build a structure in time that answers a query in time where the constant can be made arbitrarily small. Theorem 1 thus produces a DDS algorithm with for all . As can be arbitrarily close to 0, the -streak bound is lower than the maximum value only by a factor sub-polynomial in . A remark is in order about the significance if this gap could be closed for all constant dimensionalities. If a DDS algorithm with could be discovered, then the algorithm would also settle the following offline version in time: we are given a set of points in and a set of -rectangles; the goal is to report, for each -rectangle , how many points in are covered in . This offline problem has been extensively studied, and yet the fastest algorithm still runs in time to our knowledge.
4.2 Non-Decomposable but Spectrum-Indexable Problems
This subsection will utilize Theorem 1 to deal with problems that are not decomposable problems (at least not obviously). The key is to show that the problem is spectrum indexable for a suitable . This is an interesting topic on its own, as we demonstrate next by developing new DDS algorithms with for all on halfplane containment, convex hull containment, range median, and 2D linear programming. The original algorithms [19, 11] for these problems were all designed using the top-down approach reviewed in Section 2. Our algorithms present a contrast that illustrates how Theorem 1 can facilitate the design of DDS algorithms.
Halfplane Containment.
The problem can be converted [19] to the following equivalent form by resorting to geometry duality [7]:
-
Line through convex hull. is a set of points in . Given any line in , a query determines if intersects with the convex hull of , denoted as .
We will concentrate on the above problem instead.
Suppose, for now, that we are given a point on that falls outside . From , we can shoot two tangent rays, each touching but not entering into the interior of . In Figure 1(a) where is the set of black points, the first ray passes point while the second passes . The two rays form a “wedge” enclosing within (note that the angle of the wedge is less than 180∘); we will call it the wedge of on . Line passes through if and only if goes through the wedge. If the vertices of have been stored in clockwise order, the two tangent rays can be found in time [25].
We will prove that the “line through convex hull” problem is spectrum indexable. Take any integer and set . Divide arbitrarily into , , …, such that for and . To build a structure , use time to compute for each and store its vertices in clockwise order. The structure’s construction time is . Let us see how to answer a query with line . Suppose, once again, that a point on outside is given. For each , compute the wedge of on in time. From these wedges, it is a simple task to obtain in time the wedge of on (we will deal with a more general problem shortly in discussing “convex hull containment”). Now, whether intersects with can be determined easily. The query time so far is .
It remains to explain how to find . This can be done in time if we already have the minimum axis-parallel bounding rectangle of , denoted as . Note that must contain ; see Figure 1(b). It is clear that can be obtained from , , …, in time, while each () can be computed in time during the construction of the structure . We thus conclude that the problem is spectrum indexable and can now apply Theorem 1.
Convex Hull Containment.
In this problem, is a set of points in . Given any point , a query determines if is covered by . We will prove that the problem is spectrum indexable.
Take any integer and set . Divide arbitrarily into , , …, such that for and . To build a structure , compute for each and store its vertices in clockwise order; this takes time as explained before. Let us see how to answer a query given point . For each , whether is covered by can be checked in time [25]. If the answer yes for any , point must be covered by , and we are done. The subsequent discussion assumes that is outside for all . In time, compute the wedge of on for all as described in the previous problem.
(a) Tangent rays. | (b) MBR. | (c) Arcs. | (d) Dissection. |
It remains to determine from the wedges whether is covered by . This can be re-modeled as the following problem. Place an arbitrary circle centered at . For each wedge, its two bounding rays intersect the circle into an arc of less than 180∘. Let be the arcs produced this way, and define to be the smallest arc on the circle covering them all. Figure 1(c) shows an example with . Arc is subtended by points A and B, arc by points C and D, and arc by points E and F. Here, the smallest arc goes clockwise from point A to F. Crucially, is covered by if and only if spans at least (this is the case in Figure 1(c)) – note that if is outside , then must be subtended by the wedge of on , which must be less than . Therefore, the goal is to determine whether is at least .
It is possible to achieve the purpose in time (even if the arcs are given to us in an arbitrary order). For this purpose, we process the arcs one by one, maintain the smallest arc covering the arcs already processed, and stop the algorithm when it becomes clear that must be at least . Specifically, for , simply set to . Iteratively, given the next arc (), check in constant time if an arc of less than can cover both and . If so, update to that arc; otherwise, stop the algorithm. In Figure 1(c), for example, after is processed, the we maintain goes clockwise from A to D. When processing , the algorithm realizes that must be at least and hence terminates.
We thus conclude that the convex hull containment problem is spectrum indexable and can now apply Theorem 1.
Range Median.
In this problem, is a set of real values stored in an array . Given an arbitrary integer pair satisfying , a query returns the median of . We will show that the problem is spectrum indexable. Fix any integer and . Define for each and . Next, we will assume ; otherwise, simply use time to create a structure of [11] on the whole , which is able to answer any query on in time.
To build a structure , for each , store in ascending order, but each element of should be associated with its original position index in . It is clear that can be built in time. Let us see how to answer a query with predicate . First, determine the values such that and , which can be trivially done in time. Scan and identify the subset (for each element in , check if its original index in falls in ). Because is sorted, we can produce in the sorted order in time. In the same fashion, compute in time. At this moment, all the elements of have been partitioned in sorted arrays: . The goal now is to find the -th smallest element in the union of these arrays. Frederickson and Johnson [10] described an algorithm to select the element of a given rank from the union of sorted arrays. Their algorithm runs in time in our scenario. Overall the query time is because .
We now conclude that the range median problem is spectrum indexable and are ready to apply Theorem 1.
2D Linear Programming.
By resorting to geometry duality [7], the problem can be converted [19] to “line through convex hull” (which we already solved) and the following:
-
Ray exiting convex hull. Here, is a set of points in such that covers the origin. Given any ray emanating from the origin, a query returns the edge of from which exits .
We will concentrate on the above problem instead. Before proceeding, let us state two facts about the problem:
-
Given any ray , it is possible to find in time, using an algorithm in [21]; we will call this the basic algorithm.
-
We can create a structure in time to answer any query in time. First compute and then “dissect” it using the segments connecting the origin to all the vertices; see Figure 1(d). A query with ray can then be answered by looking for the triangle in the dissection that goes through. We will call this the basic structure.
Unlike all the problems discussed so far, currently we cannot prove that “ray exiting convex hull” is spectrum indexable. However, by virtue of Theorem 1, we do not have to! It suffices to show that the problem is spectrum indexable for any positive constant . Theorem 1 then allows us to answer queries in time. After has reached , we can afford to build the basic structure in time to answer every subsequent query in time. This allow us achieve for all .
We will prove that the “ray exiting convex hull” problem is spectrum indexable. We will achieve the purpose by resorting to a result of [19], where Karp, Motwani, and Raghavan used the top-down approach to build a binary tree with the following properties.
-
If a node is at level of , then is associated with a set of points in .
-
The first levels of the tree can be built in time.
-
A query is answered by traversing at most a root-to-leaf path of . If the search process descendants to a node , then the target edge can be found in time by running the basic algorithm [21] on .
Back to our scenario, fix any integer . Build the first levels of on in time. To answer a query, we descend a path of to a node of level if the edge has not already been found. The set has at most points. We can thus run the basic algorithm on to find in time. Therefore, “ray exiting convex hull” problem is spectrum indexable.
We close this section with a remark, as is revealed by the above discussion, about an inherent connection between the top-down approach and our reduction. In essence, we build the structure of [19] incrementally: the -th () “epoch” (in the proof of Section 3.2) re-builds the first levels of . This is different from [19] where the nodes are “expanded upon first touch” in the way described in Section 2. In fact, all existing DDS algorithms designed based on the top-down approach can be encapsulated into our reduction framework through the bridge of “spectrum indexability”, in the manner we demonstrated for “ray exiting convex hull”.
5 DDS Using Structures with Construction Time
Our discussion so far has focused on data structures with . In this section, we will first show the necessity of this condition for black-box reductions to guarantee even a non-constant -streak bound. As a second step, we present an extension of Theorem 1 that permits the deployment of a structure with to produce a DDS algorithm with good for all .
Constant Streak Bounds for .
Our subsequent hardness argument requires to be a convex function. Consider any black-box reduction algorithm . Recall that is required to work on any decomposable problem with a restricted data structure (the reader may wish to review Section 3.3 before proceeding). Suppose that can guarantee a -streak bound of for all such problems, namely, can always answer queries in time for all . We will show that, if , then must be .
In a way similar to Section 3.3, we will contrive a decomposable problem and an accompanying data structure. The dataset contains arbitrary elements; given any predicate, a query on always returns . Whenever asked to build a data structure on , we waste on purpose time and then output an arbitrary permutation of in an array. Whenever asked to answer a query, we immediately return in constant time. In other words, the function is fixed to 1.
Henceforth, we will fix to the value of that can ensure when it is given our contrived data structure. We will assume ; otherwise, and our job is done. As it answers queries in time, at least one of the queries must have a cost of . We will concentrate on this particular query in the rest of the argument.
Recall from Section 3.3 that, to answer this query, algorithm needs to search a number (including 0) of structures and scan a subset . As needs to pay a cost of to scan , it must hold that for some constant . We will consider , which we know is , to be large enough to make . Because must obey (4), we can assert that
(5) |
This implies . Since algorithm must pay a constant time to search each structure (), we must have
(6) |
DDS Algorithms with .
Data structures having construction time are still useful for DDS as long as we are not obsessed with answering queries in time. To make this formal, we modify the techniques behind Theorem 1 to obtain another generic reduction with the following guarantees.
Theorem 3.
Suppose that and are non-decreasing functions that are both where is a constant. Every spectrum indexable problem admits a DDS algorithm with for all .
The proof is similar to what was presented in Section 3.2 and is moved to Appendix A. An interesting application of the above is orthogonal range counting/reporting in where is a fixed constant at least 3 (see the problem definition in Section 1.2). The range tree augmented with fractional cascading [9], constructable in time on points, answers a counting/reporting query also in time. The problem is decomposable and thus spectrum indexable. Theorem 3 directly gives a DDS algorithm with for all , which strictly improves the results of [19] as mentioned in Section 1.2.
References
- [1] Alok Aggarwal and Prabhakar Raghavan. Deferred data structure for the nearest neighbor problem. Information Processing Letters (IPL), 40(3):119–122, 1991. doi:10.1016/0020-0190(91)90164-D.
- [2] Jérémy Barbay, Ankur Gupta, Srinivasa Rao Satti, and Jonathan Sorenson. Near-optimal online multiselection in internal and external memory. J. Discrete Algorithms, 36:3–17, 2016. doi:10.1016/J.JDA.2015.11.001.
- [3] Jon Louis Bentley and Hermann A. Maurer. Efficient worst-case data structures for range searching. Acta Inf., 13:155–168, 1980. doi:10.1007/BF00263991.
- [4] Manuel Blum, Robert W. Floyd, Vaughan R. Pratt, Ronald L. Rivest, and Robert Endre Tarjan. Time bounds for selection. Journal of Computer and System Sciences (JCSS), 7(4):448–461, 1973. doi:10.1016/S0022-0000(73)80033-9.
- [5] Gerth Stolting Brodal, Beat Gfeller, Allan Gronlund Jorgensen, and Peter Sanders. Towards optimal range medians. Theoretical Computer Science, 412(24):2588–2601, 2011. doi:10.1016/J.TCS.2010.05.003.
- [6] Timothy M. Chan and Konstantinos Tsakalidis. Optimal deterministic algorithms for 2-d and 3-d shallow cuttings. Discrete & Computational Geometry, 56(4):866–881, 2016. doi:10.1007/S00454-016-9784-4.
- [7] Bernard Chazelle, Leonidas J. Guibas, and D. T. Lee. The power of geometric duality. BIT Numerical Mathematics, 25(1):76–90, 1985. doi:10.1007/BF01934990.
- [8] Yu-Tai Ching, Kurt Mehlhorn, and Michiel H. M. Smid. Dynamic deferred data structuring. Information Processing Letters (IPL), 35(1):37–40, 1990. doi:10.1016/0020-0190(90)90171-S.
- [9] Mark de Berg, Otfried Cheong, Marc van Kreveld, and Mark Overmars. Computational Geometry: Algorithms and Applications. Springer-Verlag, 3rd edition, 2008.
- [10] Greg N. Frederickson and Donald B. Johnson. The complexity of selection and ranking in x+y and matrices with sorted columns. Journal of Computer and System Sciences (JCSS), 24(2):197–208, 1982. doi:10.1016/0022-0000(82)90048-4.
- [11] Beat Gfeller and Peter Sanders. Towards optimal range medians. In Proceedings of International Colloquium on Automata, Languages and Programming (ICALP), pages 475–486, 2009. doi:10.1007/978-3-642-02927-1_40.
- [12] Goetz Graefe, Felix Halim, Stratos Idreos, Harumi A. Kuno, and Stefan Manegold. Concurrency control for adaptive indexing. Proceedings of the VLDB Endowment (PVLDB), 5(7):656–667, 2012. doi:10.14778/2180912.2180918.
- [13] Felix Halim, Stratos Idreos, Panagiotis Karras, and Roland H. C. Yap. Stochastic database cracking: Towards robust adaptive indexing in main-memory column-stores. Proceedings of the VLDB Endowment (PVLDB), 5(6):502–513, 2012. doi:10.14778/2168651.2168652.
- [14] Sariel Har-Peled and Nirman Kumar. Approximate nearest neighbor search for low-dimensional queries. SIAM Journal on Computing, 42(1):138–159, 2013. doi:10.1137/110852711.
- [15] Sariel Har-Peled and S. Muthukrishnan. Range medians. In Proceedings of European Symposium on Algorithms (ESA), pages 503–514, 2008. doi:10.1007/978-3-540-87744-8_42.
- [16] Stratos Idreos, Martin L. Kersten, and Stefan Manegold. Database cracking. In Proceedings of Biennial Conference on Innovative Data Systems Research (CIDR), pages 68–78, 2007. URL: http://cidrdb.org/cidr2007/papers/cidr07p07.pdf.
- [17] Stratos Idreos, Martin L. Kersten, and Stefan Manegold. Self-organizing tuple reconstruction in column-stores. In Proceedings of ACM Management of Data (SIGMOD), pages 297–308, 2009. doi:10.1145/1559845.1559878.
- [18] Stratos Idreos, Stefan Manegold, Harumi A. Kuno, and Goetz Graefe. Merging what’s cracked, cracking what’s merged: Adaptive indexing in main-memory column-stores. Proceedings of the VLDB Endowment (PVLDB), 4(9):585–597, 2011. doi:10.14778/2002938.2002944.
- [19] Richard M. Karp, Rajeev Motwani, and Prabhakar Raghavan. Deferred data structuring. SIAM Journal on Computing, 17(5):883–902, 1988. doi:10.1137/0217055.
- [20] David G. Kirkpatrick. Optimal search in planar subdivisions. SIAM Journal of Computing, 12(1):28–35, 1983. doi:10.1137/0212002.
- [21] David G. Kirkpatrick and Raimund Seidel. The ultimate planar convex hull algorithm? SIAM Journal on Computing, 15(1):287–299, 1986. doi:10.1137/0215021.
- [22] Robert Krauthgamer and James R. Lee. Navigating nets: simple algorithms for proximity search. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 798–807, 2004. URL: http://dl.acm.org/citation.cfm?id=982792.982913.
- [23] Konstantinos Lampropoulos, Fatemeh Zardbani, Nikos Mamoulis, and Panagiotis Karras. Adaptive indexing in high-dimensional metric spaces. Proceedings of the VLDB Endowment (PVLDB), 16(10):2525–2537, 2023. doi:10.14778/3603581.3603592.
- [24] Rajeev Motwani and Prabhakar Raghavan. Deferred data structuring: Query-driven preprocessing for geometric search problems. In Proceedings of Symposium on Computational Geometry (SoCG), pages 303–312, 1986. doi:10.1145/10515.10548.
- [25] Mark H. Overmars and Jan van Leeuwen. Maintenance of configurations in the plane. Journal of Computer and System Sciences (JCSS), 23(2):166–204, 1981. doi:10.1016/0022-0000(81)90012-X.
- [26] Bryce Sandlund and Sebastian Wild. Lazy search trees. In Proceedings of Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 704–715, 2020. doi:10.1109/FOCS46700.2020.00071.
- [27] Bryce Sandlund and Lingyi Zhang. Selectable heaps and optimal lazy search trees. In Proceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1962–1975, 2022. doi:10.1137/1.9781611977073.78.
- [28] Neil Sarnak and Robert Endre Tarjan. Planar point location using persistent search trees. Communications of the ACM (CACM), 29(7):669–679, 1986. doi:10.1145/6138.6151.
- [29] Felix Martin Schuhknecht, Alekh Jindal, and Jens Dittrich. An experimental evaluation and analysis of database cracking. The VLDB Journal, 25(1):27–52, 2016. doi:10.1007/S00778-015-0397-Y.
- [30] Yufei Tao and Dimitris Papadias. Range aggregate processing in spatial databases. IEEE Transactions on Knowledge and Data Engineering (TKDE), 16(12):1555–1570, 2004. doi:10.1109/TKDE.2004.93.
- [31] Fatemeh Zardbani, Peyman Afshani, and Panagiotis Karras. Revisiting the theory and practice of database cracking. In Proceedings of Extending Database Technology (EDBT), pages 415–418, 2020. doi:10.5441/002/EDBT.2020.46.
- [32] Fatemeh Zardbani, Nikos Mamoulis, Stratos Idreos, and Panagiotis Karras. Adaptive indexing of objects with spatial extent. Proceedings of the VLDB Endowment (PVLDB), 16(9):2248–2260, 2023. doi:10.14778/3598581.3598596.
- [33] Donghui Zhang, Vassilis J. Tsotras, and Dimitrios Gunopulos. Efficient aggregation over objects with extent. In Proceedings of ACM Symposium on Principles of Database Systems (PODS), pages 121–132, 2002. doi:10.1145/543613.543629.
Appendix A Proof of Theorem 3
Our algorithm executes in epochs. At the beginning of the -th () epoch, we set and create a structure – the structure promised by spectrum indexability – on in time. The structure allows us to answer any query in time. The -th epoch finishes after queries are answered during the epoch. These queries have a total cost of
Hence, the total computation time of the -th epoch is . Because , we know that at least doubles when increases by 1. Hence, the total cost of all epochs is asymptotically dominated by , where is the number of epochs needed.
Precisely, the value of is the smallest integer satisfying two conditions:
-
C1: (the value of must not exceed );
-
C2: (the number of queries answerable by epochs must be at least ).
Let represent the smallest integer such that . This means
(9) |
When , the definitions of and imply . In this case, all the epochs incur a total cost of .
The above argument does not work if . However, when this happens, we know from (9) that , leading to . As soon as reaches – the snapping point – we create a structure on the whole in time, and use it to answer every subsequent query in time until . The queries after the snapping point require a total cost of . We thus have obtained an algorithm that guarantees for all , completing the proof of Theorem 3.