Optimal Static Fully Indexable Dictionaries
Abstract
Fully indexable dictionaries (FID) store sets of integer keys while supporting rank/select queries. They serve as basic building blocks in many succinct data structures. Despite the great importance of FIDs, no known FID is succinct with efficient query time when the universe size is a large polynomial in the number of keys , which is the conventional parameter regime for dictionary problems. In this paper, we design an FID that uses bits of space, and answers rank/select queries in time in the worst case, for any parameter , provided . This time-space trade-off matches known lower bounds for FIDs [40, 41, 46] when .
Our techniques also lead to efficient succinct data structures for the fundamental problem of maintaining integers each of bits and supporting partial-sum queries, with a trade-off between query time and bits of space. Prior to this work, no known data structure for the partial-sum problem achieves constant query time with bits of space usage.
Keywords and phrases:
data structures, dictionaries, space efficiencyCategory:
Track A: Algorithms, Complexity and GamesFunding:
Renfei Zhou: Partially supported by the MongoDB PhD Fellowship.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Data structures design and analysisAcknowledgements:
The authors thank William Kuszmaul and Huacheng Yu for helpful suggestions on paper writing, and anonymous reviewers for pointing out important related works.Editors:
Keren Censor-Hillel, Fabrizio Grandoni, Joël Ouaknine, and Gabriele PuppisSeries and Publisher:

1 Introduction
A fully indexable dictionary (a.k.a. rank/select dictionary; FID for short) is a fundamental data structure, which stores a set of keys from a universe , supporting Rank and Select queries:
-
: return the number of keys in that are smaller than or equal to .
-
: return the -th smallest key in .
This paper focuses on static FIDs, where the key set is given at the beginning and does not change over time. There is also the dynamic case, allowing updates to the key set via insertions and deletions, which is not considered in this paper.
FIDs are powerful data structures with numerous applications. First, if we choose to think of as an indicator vector in , then the problem becomes storing a bit string containing ones and zeros, while supporting prefix-sum queries and queries for the position of the -th one. This so-called rank/select problem serves as a subroutine in many space-efficient data structures. Second, given Rank and Select, one can also support predecessor search queries (i.e., find the largest element in that does not exceed ) by calling – this, too, serves as a common data-structural subroutine. The result is that FIDs have many applications in data structures for strings [10, 14, 24, 26, 36, 37, 21, 13], trees [12, 18, 44, 35], parentheses sequences [17, 35], multisets [44], permutations [34], variations of dictionaries [8, 7], etc.
The most fundamental question regarding FIDs is to determine their best possible time-space trade-off. If an FID storing keys from the universe uses bits of space, we say it incurs bits of redundancy, where the first term is referred to as the information-theoretic optimum. Conventionally, an FID is said to be compact if the redundancy , and is said to be succinct if . The time-space trade-off of an FID is the relationship between the redundancy and the worst-case query time under a RAM with word-size .
Towards the optimal trade-off
There is vast literature on space-efficient FIDs. The earliest works [28, 33, 9] on FIDs focused on the setting where one needs to store ’s indicator vector using bits in the plain format, while using up to additional bits to store auxiliary information. Compared to the information-theoretic optimum , their approaches are succinct only if . In 2002, Raman, Raman, and Satti [44] constructed an FID with redundancy and constant query time. Later, Pǎtraşcu [39] reduced the redundancy to bits with worst-case query time .
Some applications of FIDs demand to store dense sets where the universe size ; in this case, Pǎtraşcu’s FID is already succinct and achieves an ideal trade-off between query time and bits of redundancy, which is provably optimal when [41, 46]. However, when , which is the conventional parameter regime for dictionary problems, the redundancy of Pǎtraşcu’s FID will be polynomially larger than the information-theoretic optimum even for large running times, e.g., , and thus is not succinct or compact. Gupta et al. [25] further improved the redundancy to bits, thus being succinct, at the cost of slower queries of time. It is currently not known how to construct succinct FIDs with query time .
On the lower-bound side, Pǎtraşcu and Thorup [40] showed an time lower bound for predecessor queries, assuming that and that the data structure is compact. Their lower bound also applies to FIDs by a reduction. Another lower bound proven by Pǎtraşcu and Viola [41, 46] shows that, if an FID takes worst-case time per query, it must incur bits of redundancy. The best-known lower bound for FIDs is a simple combination of these two independent lower bounds: When the redundancy equals , the worst-case query time must be at least , provided . There remains a huge gap between the lower and upper bounds.
In summary, it has remained one of the basic open questions in the field whether one can hope to construct an FID that is both succinct and that supports queries in the optimal time of . Moreover, although there are well-established lower bounds for the time-space trade-off of a static FID, it remains open on the upper-bound side to obtain tight bounds at any point along the curve. These are the problems that we seek to resolve in the current paper.
This paper: Tight upper bounds for FIDs
In this paper, we construct a succinct FID as shown in the following theorem.
Theorem 1.
For any parameters with and , there is a static fully indexable dictionary with query time and redundancy , in word RAM with word size .
Thus, it is possible to achieve succinctness while offering an optimal time bound of – in this case, the redundancy of our construction is bits. More generally, the time-space trade-off offered by the above theorem is provably optimal for all , as, in every parameter regime, it matches one of the two known lower bounds. Somewhat surprisingly, this means that the maximum of the two completely independent lower bounds forms a single tight lower bound for FIDs.
High-level technical approach
To understand the high-level approach taken by our data structure, let us consider the lower bound in [40], which points out a distribution of hard instances such that any data structure with near-linear space needs to spend time for each query on these inputs.
A critical insight in the current paper is that, although it is difficult to improve the query time for these hard instances, they are well-structured so that they only occupy a small fraction of all possible inputs. Therefore, in principle, the space we actually need to store these hard instances is small.
Moreover, if we restrict only to these hard instances, then we can afford to use a data structure that is space-efficient compared to the optimal bound for FIDs but that is space-inefficient compared to the information-theoretic optimum for hard instances. So we can handle hard instances by creating data structures that are morally space inefficient (i.e., not optimal for these hard instances), but that are nonetheless space efficient in the context of the overall FID problem.
On the flip side, when an input is not hard, we can think of it morally as being like a “random” input. The entropy of random inputs is large, so we cannot waste much space. Fortunately, random instances can benefit from the same high-level techniques that have already been developed in past work [39], allowing for fast queries with good space efficiency.
Of course, “hard instances” and “random instances” are really just two extremes in a large design space of inputs. What is remarkable is that, nonetheless, it is possible to combine the high-level approaches described above in order to construct a single data structure that achieves the optimal time-space trade-off for all inputs.
Other implications of our techniques
As mentioned earlier, FIDs can naturally support predecessor search queries, so our result also improves the predecessor data structures:
Corollary 2.
For any parameters with , there is a data structure storing a set of keys from a universe that supports predecessor searches in , with the same time-space trade-off as in Theorem 1.
Prior to this result, it remained open whether one could achieve query time while also offering bits of redundancy [42].
By adjusting our data structure for FIDs, we show the following time-space trade-off for so-called select dictionaries, which are dictionaries that support Select but not Rank:
Theorem 3.
For any parameters with and , there is a static dictionary that answers Select queries within time and has redundancy , in word RAM with word size .
Here, again, the bounds that we achieve are provably optimal, as they match a lower bound previously proven by [41]. This represents a significant improvement over the previous state of the art [44], which was unable to achieve space bounds any better than .
Another application of our techniques is to the basic question of storing -bit integers and answering partial-sum queries.
Theorem 4.
For any parameters with and , there is a data structure storing a sequence of -bit integers , which uses bits of space and can answer the partial-sum queries for any given within time.
For this partial-sum problem, the previous state of the art with query times incurred bits of redundancy [43]. Whether or not this could be reduced to remained an open question. Our bound reduces it all the way to bits for a polylogarithm of our choice.
1.1 Related works
Grossi et al. [23] studied the FID problem for polynomial universe sizes with query time. In this setting, achieving a compact space usage of bits is impossible due to the lower bound established by [40]. They showed a trade-off between query time and bits of space.
In the special case where the universe size , Yu [47] presented a data structure that supports Rank queries in time, but not Select queries. This data structure incurs only bits of redundancy, outperforming Pǎtraşcu’s data structure when is large, and matching the lower bound of [40] for all .
A closely related setting is the dynamic FID problem, where, in addition to answering Rank and Select queries, the data structure also needs to support fast insertions and deletions. When the universe size is linear in the number of keys , Li et al. [30] achieved bits of redundancy with the optimal operational time of . For polynomial universe sizes , it remains open whether a succinct data structure with time per operation can be constructed.
Predecessor data structures are also closely related to FIDs and have been extensively studied across various parameter regimes and settings due to their importance. For a comprehensive overview, see the survey by Navarro and Rojas-Ledesma [38].
Another variant of dictionary is the unordered dictionary problem, where the data structure only needs to answer membership queries – whether a given key is in the current key set. A series of works has focused on both static and dynamic unordered dictionaries [48, 27, 45, 31, 6, 32, 5], leading to a static unordered dictionary with constant query time and bits of redundancy [27], as well as tight upper and lower bounds for the time-space trade-off of dynamic unordered dictionaries for polynomial universe sizes [31, 6, 32].
Raman, Raman, and Satti [44] studied the indexable dictionary (ID) problem, which is similar to FIDs, but only returns the rank of when is present in the current key set; otherwise, it returns “not exist.” They constructed IDs with bits of redundancy and query time. Note that the ID setting is easier than FIDs, allowing data structures to achieve both succinctness and constant-time queries.
A further relaxation of ID is monotone minimal perfect hashing (MMPH), where the data structure only needs to support Rank queries for elements in the key set – if is not in the key set, can return an arbitrary answer. (Select queries are not required.) The key point of this relaxation is that MMPH data structures use asymptotically less space: Belazzougui, Boldi, Pagh, and Vigna [3] constructed a data structure that uses only bits, while encoding the key set itself requires bits. This bound is shown to be optimal by Assadi, Farach-Colton, and Kuszmaul [1] (see also [29]).
There is also a line of research on rank/select problems over arbitrary alphabets [21, 19, 15, 20, 22, 2, 4]. Given a sequence in , the select query asks for the -th occurrence of a given symbol ; rank queries are defined similarly. These problems generalize FIDs, which correspond to the special case . For arbitrary , the optimal time/redundancy trade-off is still not fully understood [4].
2 Preliminaries
Augmented B-trees
We will use the augmented B-trees (aB-trees for short) from [39] as a subroutine.
Let and be parameters such that is a power of , and let be an array of elements in the alphabet . An aB-tree of branching factor and size is a full -ary tree over leaves, which correspond to the entries of the array . Additionally:
-
Each node of the aB-tree is augmented by a label from the label alphabet . The label of a leaf node is determined by the corresponding entry in the array, and the label of an internal node is determined by the label sequence of its children. Formally, there is a transition function determining the label of each the internal node : , where is the label sequence of ’s children.
-
There is a recursive query algorithm, which starts by examining the label of the root, and then recursively traverses down a path from the root to some leaf of the tree. At each step, the algorithm examines the label of the current node and the labels of its children to determine which of the children to recurse on. After reaching a leaf, the algorithm outputs the answer to the query based on all the examined labels. Furthermore, this algorithm is restricted to spend only time at each examined node, ensuring that the query time remains at most .
Beyond [39], we also consider incomplete aB-trees. Let be any integer, not necessarily a power of , and let be an integer such that . An incomplete aB-tree with branching factor and size is derived from a (full) aB-tree over leaves with the same branching factor , by retaining only the first leaves while removing other leaves, and (repeatedly) deleting all internal nodes without any child. In such an incomplete aB-tree, the transition function of labels still follows the form , but the input label sequence may have a length less than , as might possess fewer than children.
Let be the number of instances of such that the aB-tree of branching factor over it will have root label . According to Theorem 8 of [39], when is a power of , we can compress the aB-tree with size and root label to within bits. Their proof directly works for incomplete aB-trees () as well.
Lemma 5 (Natural generalization of [39, Theorem 8]).
Let be parameters with and . Suppose there is an (incomplete) aB-tree with branching factor , size , and root label , then we can compress this aB-tree to within bits with query time , in the word RAM model with word size . The data structure uses a lookup table of words, which only depends on and .
Notice that the lookup table only depends on and , but not . The original construction in [39], when applied on incomplete aB-trees, uses a lookup table of words, which depends on , , and . To avoid the dependency on , we simply concatenate such lookup tables for all values of together, with at most words, and let aB-trees with different sizes use different parts of the (concatenated) lookup table. Later when we apply this lemma in our data structure, we will let multiple instances with the same and (but with different ) share a single lookup table.
Predecessor data structures
We will use the following extension of predecessor data structure as a subroutine.
Problem 1 (Predecessor with associated values).
Storing a set of keys,111We use notation to denote the set for any non-negative integer , and use to denote the set of integers when there is no ambiguity. where each key is associated with a value in , supporting predecessor and successor queries:
-
: Return the largest element such that , and the associated value of .
-
: Return the smallest element such that , and the associated value of .
By studies on dictionaries [16] and predecessor data structures [40], there is a compact construction for predecessor data structures with associated values:
Lemma 6.
For any , there are predecessor data structures with associated values using bits, with query time in the worst case.
Proof Sketch.
We maintain the following data structures simultaneously:
-
A predecessor data structure for the set by [40] with space and query time .
-
A successor data structure similar to the predecessor data structure.
Throughout this paper, we use the terminology “predecessor data structure” for short to refer to the data structure defined by Lemma 6.
3 Basic data structure for FIDs
In this section, we will construct a basic data structure for FIDs with a slightly worse time and space guarantee than the requirement of Theorem 1. It will serve as a subroutine in the final data structure in Section 4: The final data structure is based on the algorithm framework in this section, and by replacing a subroutine with the result in this section (Theorem 7) in a non-recursive way, it achieves the ideal time-space trade-off. Formally, we will prove the following theorem:
Theorem 7 (Weak version of Theorem 1).
For any parameters with , and a constant , there is a static fully indexable dictionary with query time and redundancy , in the word-RAM model with word size . The data structure uses a lookup table of words that only depend on and which could be shared among multiple instances of fully indexable dictionaries.
Although in most applications of FIDs we care about polynomially-sized universes , here we also consider the parameter regime where is significantly smaller than . The reason is that, later in Section 4, we will use Theorem 7 to maintain short subsequences of keys.
In the remainder of this section, we construct this basic data structure to prove Theorem 7. Table 1 lists the main parameters and notations we will introduce in this section.
Notation | Explanation |
---|---|
A parameter indicating that our algorithm’s time constraint is . | |
is the desired redundancy of our FID. | |
The branching factor of aB-trees in the mid parts. . When , there is . | |
The number of keys in each block. | |
. Each mid part consists of bits. | |
. Each low part consists of bits. | |
, , | The value in the low, mid, and high parts. See Figure 1. |
The value in the mid and high parts together. It equals . | |
is the prefix sum of . | |
is the summation of all within a block. |
Partitioning into blocks
Let be the set of keys we need to store, where . The first step of the construction is to divide the sequence into small blocks, and to store some inter-block data structure that reduces the entire FID problem to small-scale FID problems within each block.
Let be a parameter where , and we break the whole sequence into blocks of size . As the sequence is monotonically increasing, the partition into blocks could be viewed as partitioning the possible range of the keys into disjoint intervals, where the -th block corresponds to the interval .222We let for convenience. When is not divisible by , especially when , the size of the last block will be smaller than . Fortunately, the same construction below works for any block size , and we only illustrate our construction for block size .
Our inter-block data structure consists of the following two parts:
-
The sequence of endpoints of each interval, i.e., .
-
A predecessor structure for the sequence of endpoints , where each entry is associated with value .
Given the above auxiliary information, we view each block as an FID with keys from a universe of size . When we perform a rank query , the second part above can help us locate the block containing the queried element (i.e., ), transforming the original query into a Rank query in the -th block, within time (see Lemma 6). When we perform a select query , it becomes a Select query within block . The remaining question is to answer Rank/Select queries within each block.
Storing difference sequences within blocks
Throughout the remainder of this section, we mainly focus on the FID problem within a single block . Letting , the intra-block FID problem requires us to maintain the sequence of keys in the universe . We cut the binary representation of each key into two parts, as shown in Figure 1(a):
-
Letting be a parameter to be determined and , we define the low part as the least significant bits in the binary representation of each key , denoted by integers . We will directly store these integers in the data structure.
-
The remaining (more significant) bits are referred to as the mid-high part, denoted by integers . For these integers, we aim to store the difference sequence, i.e., the differences between adjacent pairs of elements.
Clearly, we have
For short, we let denote the partial sum , so that . We further define as the sum of the difference sequence :
We denote by the value of in the -th block. Clearly,
(1) |
With these notations, the FID problem within a single block can be restated as follows:
Problem 2 (FID within a block).
Let be a parameter stored outside. We need to store two sequences and , such that and , with the property that forms a strictly increasing sequence, supporting the following queries:
-
: Return . It corresponds to the Select queries in the original FID problem.
-
: Return the largest such that .
As introduced above, once we have a solution to Problem 2 with time for any , with the help of the inter-block data structure, we may answer the Rank/Select queries to the original key sequence in time.
The three-part partition
For some technical reasons, we further divide each (i.e., the difference sequence of the mid-high part) into two smaller parts as follows. We will use different approaches to organize these parts in Section 3.1.
-
Let be the integer formed by the least significant bits of , which we call the mid part.
-
Let be the integer formed by the remaining bits, which we call the high part.
Formally,
By now, we have cut the sequence of keys in a single block into three parts: the high part , the mid part , and the low part , as shown in Figure 1(b). The following subsection 3.1 will design (almost) separate data structures to store these three parts.
3.1 Data structure within each block
In this subsection, we will design the intra-block data structure with time complexity of three parts. At a very high level, these three parts of the data structure will interact as follows when performing PartialSum and Rank queries:
-
During the query , we will compute , and from the high part, mid part and the low part of the data structure separately,333We define and similarly to : , and so is . and then combine them into the output.
-
During the query, we sequentially examine the high part, mid part, and low part of the data structure. At each step, we narrow down the search interval for the desired index , gradually approaching its exact value in the low part.
Below, we will provide the detailed construction of the three parts separately.
High part
The high part sequence is sparse and thus easy to store. As
there are at most non-zero entries in the high part. By setting a large parameter , we can make the space usage of the high part very small such that we can regard the entire high part as redundant information.
Let be the set of indices of all non-zero entries in the high part, then . The high part of the data structure consists of the following two components:
-
A predecessor data structure of the set , where each is associated with .
-
A predecessor data structure of the set , where each is associated with .444Note that all the are distinct for as for all , hence is a valid set.
According to Lemma 6, there is a compact implementation of predecessor data structures, with bits of space and query time .
Mid part
We will store the mid-part sequence by an aB-tree with branching factor and size , using Lemma 5.555When the block size is smaller than , we will use an incomplete aB-tree, that is why we need to support incomplete aB-trees in Lemma 5. The details of the aB-tree are as follows.
-
Each leaf of the aB-tree contains a single mid-part entry .
-
There is a label on each node that equals the sum of all the leaves in its subtree. As for each , we have .
-
The premise of Lemma 5, , can be satisfied by setting . Recalling that is a parameter with , we have .
-
The size of the lookup table is
words, which will be shared between blocks.
-
The aB-tree can support three types of queries within time:
-
1.
Given , query .
-
2.
Given , query the largest index with .
-
3.
Given the value , query the maximal interval of the sequence with respect to , defined as follows.
Definition 8.
The maximal interval of a (non-negative) sequence with respect to value is defined as the interval formed by all indices with . In the maximal interval, we have , while , and .
-
1.
We now compute the space usage of the mid part. To use Lemma 5, we need to first store the root label before storing the aB-tree, as Lemma 5 assumes free access to the root label. After that, the aB-tree will occupy bits where is the number of sequences with root label . Recalling that is the sum of this sequence and
the number of possible sequences is at most . Therefore, the number of bits used by the mid part (per block) is at most
Low part
In this basic data structure, the low part is directly stored in an array. Specifically, we store the sequence one by one as integers each of bits in the memory.
Note that the low-part sequence has the following locally increasing property which we will use in our query algorithms:
-
Restricted to a maximal interval of , the subsequence is strictly increasing. This is because by the condition of Problem 2, the sequence is strictly increasing, whereas by the definition of maximal interval.
Query algorithms
Now we can formally state our algorithms for the PartialSum and Rank queries in this intra-block data structure.
For the query, we will query the high part, mid part, and low part of the data structure separately to get , , and the :
-
To get , we query on the first predecessor data structure of the high part, getting the largest index such that , with its associated value . As is the last index before with a nonzero , we have , as desired. This takes time.
-
To get , we directly query it from the aB-tree of the mid part, which takes time.
-
To get , we directly read it out from the integer array of the low part, which takes time.
Finally, the algorithm will return . The total time cost will be .
For the query, we will sequentially query the high part, mid part, and low part of the data structure to narrow down the search interval of the desired index :
-
We first query the high part to locate within a maximal interval of . This is achieved by querying the second predecessor data structure of the high part, when we will get two adjacent indices with , which locates the index within . Moreover, as are adjacent indices in , the interval forms a maximal interval of the sequence with respect to the value . This takes time.
-
Recall that the mid-part step needs to find the largest such that does not exceed a threshold . As all share the same value , this is equivalent to finding the largest such that does not exceed . This can be done with one query to the aB-tree in the mid part.
-
–
If the we found satisfies the strict inequality , we conclude the query with , because no matter what the low parts are, we already know is strictly smaller than the queried key , and is larger than .
-
–
If is equal to the threshold , we find the maximal interval of with by querying the aB-tree again. In this case, we can locate the answer to the query within the interval , but the concrete answer depends on the information in the low part.
In both cases, the mid-part step takes time.
-
–
-
Assuming the previous step encounters the second case, where an interval is known to have the same for all , and the answer to the query is guaranteed to reside in . Therefore, we need to find the largest such that , and that will be the answer to the query. (If is already larger than , the answer to the query should be .) Since is strictly increasing, we can use binary search to find within time where ; as , the running time of this step is bounded by .666Recall that , we have , hence .
Hence, the total time cost of the Rank query is at most .
Space usage of the block
Recalling that in our construction, the high part, mid part, and low part of the data structure use , and bits, respectively, the total space used by the intra-block data structure is
(2) |
3.2 Performance of the data structure
In this subsection, we will combine all parts of our construction to get the basic data structure, and check its time complexity, lookup table size, and redundancy to complete the proof of Theorem 7.
Time complexity
Recall that our query algorithm will first query the inter-block data structure using time to locate a block to query, and then query one intra-block data structure within time to get the answer. The time complexity of the whole data structure is per operation.
Lookup table size
The only lookup table used by our data structure occurs in the mid part within each block, which is used for aB-trees due to Lemma 5. As computed before, the size of this lookup table is words, which is shared between all blocks, as desired.
Redundancy
The redundancy of the whole FID consists of the following two parts:
-
The entire inter-block data structure is regarded as redundant. To store the endpoint sequence, we need to store keys from , which occupies bits. To store the predecessor data structure of the endpoint sequence, we also need at most bits according to Lemma 6. Recalling that , we have , and hence
(3) thus this part is covered by the desired amount of redundancy.
-
The redundancy caused by intra-block data structures, which we calculate below.
Recall that (2) upper bounds the space usage of each intra-block data structure. Taking a summation of (2) over all blocks, we get the total space consumption of the intra-block data structures:
(4) |
where the first inequality is due to (1); the second inequality is because , thus according to (3). We further have
(5) |
Comparing this quantity with the information-theoretic lower bound of FID, we get
(6) |
where the last inequality is because and . By plugging (3.2) and (6) into (4), the total space usage of all intra-block data structures is at most
where is covered by according to (3).
In conclusion, the total redundancy of our data structure for FIDs is bounded by . Since we have constructed an FID with time complexity , lookup table size , and redundancy , Theorem 7 follows.
4 Advanced data structure for FIDs
Proof.
We start by reviewing the bottleneck of Theorem 7. In the basic data structure, the time bottleneck is that Rank queries take time in the low-part step. Specifically, suppose we are required to answer the query . The inter-block data structure, together with the high and mid parts of the intra-block data structure, uses time to locate the desired index within a maximal interval of . Then, in the case where the low-part step is required, we need to further determine the maximum index such that . This can be formulated as a Rank query for the increasing sequence :
-
: Return the largest index , such that .
In the basic data structure, we directly store the sequence as a sorted array using bits and support Rank queries by binary search within time, where is the length of the subsequence. However, in the worst case, can be as large as the block size , which makes the time of the binary search . To address this bottleneck, we change the storage structure of the low part below.
New construction for the low part
The key observation is that storing the subsequence to support Rank and PartialSum queries can be viewed as managing a smaller FID. Hence, we have the following unified data structure to store it:
-
If , then we still store the subsequence as a sorted array within bits.
-
Otherwise, we will use the basic data structure in Theorem 7 (as a subroutine) to store the subsequence, with parameter and constant to be determined.888Theorem 7 requires that the universe size of the basic data structure is at least a large polynomial of the number of keys . This condition holds because, on one hand, our choice of (i.e., ) implies ; on the other hand, . As is a set of elements, the basic data structure will have query time , and will use
bits of space, where the last inequality uses the fact to ensure that is asymptotically larger than and . Hence, in this case, the basic data structure storing the subsequence also fits in bits. We pad 0’s to the end of the encoding of the basic data structure until it occupies exactly bits, ensuring memory alignment.
Clearly, this unified data structure allows us to answer Rank and Select queries of the subsequence within time:
-
If , then the subsequence is stored as a sorted array. For Rank queries, we can use binary search to get the desired index, which takes time; for Select queries, we can directly read out the desired within time.
-
Otherwise, the subsequence is stored using the basic data structure in Theorem 7, which can support Rank and Select queries within time.
Further, we can store the entire low part using the above unified data structure for subsequences. Specifically, we first divide the interval into all maximal intervals of . For each maximal interval of length , we store the corresponding subsequence of the low part using the unified data structure in bits. Finally, we concatenate all these unified data structures from the leftmost maximal interval to the rightmost one, obtaining a string of bits, which serves as the encoding of the entire low part.
Finally, to obtain the advanced data structure, we start with the basic data structure and replace the encoding of the low part with the new construction above (the concatenation of unified data structures). As our new construction for the low part occupies the same space ( bits) as before, the redundancy of this data structure is still . What makes the advanced data structure special is that the “basic data structure” appears twice here, once as the entire framework and once as the subroutines for maximal intervals in the low part. By embedding small instances of basic data structures within a larger framework of the same basic data structure in a non-recursive manner, we can improve the query time of FIDs without introducing any additional redundancy, as introduced below.
Query algorithms
The query algorithms for our advanced data structure are similar to those of the basic data structure in Section 3: We first use the inter-block information to transform the original queries on the entire FID into Rank/PartialSum queries within each block. Then, a three-stage process involving the high, mid, and low parts will obtain the answer to the query. The only difference is that, before we access anything in the low part, we need to first compute the maximal interval of that we plan to access; by comparing the length of the interval with the threshold , we determine if the unified data structure for that maximal interval is stored as a sorted array or a basic data structure. The algorithms to answer Rank/PartialSum queries within each block are explained below.
-
For the query , using the high and mid parts of the data structure, we can either answer the query directly or locate the desired index within a maximal interval of . In the latter case, we read the encoding of the unified data structure for , which resides in the -th bit to the -th bit in the encoding of the entire low part. Recall that comparing with will tell us whether the unified data structure is stored as a sorted array or a basic data structure for FIDs. After that, we perform the query on the unified data structure to obtain the desired index within time.
-
For the query , we first get by the mid and high parts. Then, instead of directly reading out from the low part as before, now we also need to know the maximal interval of containing to help us access the low part. We use a similar algorithm to the Rank query, to compute this maximal interval with respect to , and to extract the encoding of the subsequence . By querying on the unified data structure of this subsequence, we get in time.
Both types of queries take time, because the aB-trees in the mid part take time, while other steps (including the high and low parts and the inter-block data structure) take time per query. This meets the requirement in Theorem 1.
Space of the lookup table
Finally, we check that the size of the lookup table introduced by Theorem 7 is also dominated by . Recall that the lookup table consists of words. As , we can assume there is a constant such that . Then, we can set , which means that the number of bits in the lookup table is , which is significantly smaller than , as desired.
In summary, we get a data structure for static FID with query time and redundancy , which concludes the proof of Theorem 1.
5 Select and partial sum
In this section, we prove Theorems 3 and 4 by adjusting the basic data structure introduced in Section 3. We will rely on the predecessor data structure from [40] when the set to store is relatively dense:
Lemma 9 (Similar to Lemma 6, see [40]).
For , there is a predecessor data structure with associated values that uses bits of space and answers queries in time.
We follow the same notations in Section 3 in the following proofs.
5.1 Select dictionaries
See 3
Proof.
Recall that we have divided the binary representations of keys into the high, mid, and low parts, and in the high part, for each block of keys, we stored a predecessor data structure (Lemma 6) to compute , which takes time to answer each predecessor query. This is the only step exceeding time in the process of answering Select queries.
Instead of storing predecessor data structures for each block separately, here we store one large predecessor data structure for all keys with nonzero ’s. It achieves the same functionality of computing . The number of elements with nonzero ’s is bounded by , thus the predecessor data structure stores elements from the range . (If the number of nonzero ’s is smaller than , we add dummy elements until there are elements.) According to Lemma 9, the predecessor data structure takes time to answer each query, and takes bits of space, which fits in our desired redundancy. Other parts of the data structure remain the same as in Section 3.
When we perform a Select query, the above predecessor data structure in the high part will compute the prefix sum of the high part of the difference sequence, which takes time per query; the aB-trees in the mid part takes time to return the prefix sum of the mid part (within each block); finally, the low part reads directly to obtain the low part of the target key. The entire process takes time.
5.2 Partial sum on integer sequences
See 4
Proof.
Let be the partial-sum sequence of the input. The partial-sum problem is equivalent to storing a (multi-)set of keys supporting Select queries, i.e., a select dictionary. The only distinction is that the difference between any two adjacent keys is bounded by in this problem. The data structure we design for partial-sum is similar to that of the select dictionaries, except that we adjust the parameters and change the number of bits in the high, mid, and low parts:
-
There is no high part.
-
Let be a parameter such that for a small constant , and let . We call the least significant bits of each the low part, and store these bits directly using an array.
-
The remaining bits are called the mid part. In their difference sequence where , each entry equals either (i.e., the most significant bits of the input entry ) or , and thus is in . Same as in Section 3, we divide into blocks of size and use aB-trees to store them, supporting prefix-sum queries on .
Recall that represents the number of instances for an aB-tree with size and root label (i.e., the sum of entries in the aB-tree equals ), which is bounded by . The space usage of the mid part is thus
bits per block, where is the space usage of the aB-tree, and is the space to store the root label of the aB-tree. Taking a summation of the space usage over all blocks, including the bits taken by the array in the low part, the inter-block information, and the lookup tables, we know the total space occupied by the data structure is at most
bits, as desired. Similar to Theorem 3, each query takes time.
References
- [1] Sepehr Assadi, Martín Farach-Colton, and William Kuszmaul. Tight bounds for monotone minimal perfect hashing. ACM Transactions on Algorithms, page 3677608, August 2024. See also SODA 2023. doi:10.1145/3677608.
- [2] Jérémy Barbay, Francisco Claude, Travis Gagie, Gonzalo Navarro, and Yakov Nekrich. Efficient fully-compressed sequence representations. Algorithmica, 69(1):232–268, May 2014. doi:10.1007/s00453-012-9726-3.
- [3] Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Monotone minimal perfect hashing: Searching a sorted table with accesses. In Proc. 20th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 785–794, 2009. doi:10.1137/1.9781611973068.86.
- [4] Djamal Belazzougui and Gonzalo Navarro. Optimal lower and upper bounds for representing sequences. ACM Transactions on Algorithms, 11(4):1–21, June 2015. doi:10.1145/2629339.
- [5] Michael A. Bender, Martín Farach-Colton, John Kuszmaul, and William Kuszmaul. Modern hashing made simple. In Proc. 7th Symposium on Simplicity in Algorithms (SOSA), pages 363–373, 2024. doi:10.1137/1.9781611977936.33.
- [6] Michael A. Bender, Martín Farach-Colton, John Kuszmaul, William Kuszmaul, and Mingmou Liu. On the optimal time/space tradeoff for hash tables. In Proc. 54th ACM SIGACT Symposium on Theory of Computing (STOC), pages 1284–1297, 2022. doi:10.1145/3519935.3519969.
- [7] Daniel K. Blandford and Guy E. Blelloch. Compact dictionaries for variable-length keys and data with applications. ACM Transactions on Algorithms, 4(2):17:1–17:25, May 2008. doi:10.1145/1361192.1361194.
- [8] H. Buhrman, P. B. Miltersen, J. Radhakrishnan, and S. Venkatesh. Are bitvectors optimal? SIAM Journal on Computing, 31(6):1723–1744, January 2002. doi:10.1137/S0097539702405292.
- [9] David R. Clark. Compact PAT trees. PhD thesis, University of Waterloo, 1996.
- [10] David R. Clark and J. Ian Munro. Efficient suffix trees on secondary storage. In Proc. 7th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 383–391, 1996.
- [11] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, fourth edition. MIT Press, April 2022.
- [12] P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In Proc. 46th IEEE Symposium on Foundations of Computer Science (FOCS), pages 184–193, 2005. doi:10.1109/SFCS.2005.69.
- [13] Paolo Ferragina, Roberto Grossi, Ankur Gupta, Rahul Shah, and Jeffrey Scott Vitter. On searching compressed string collections cache-obliviously. In Proc. 27th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pages 181–190, 2008. doi:10.1145/1376916.1376943.
- [14] Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552–581, July 2005. doi:10.1145/1082036.1082039.
- [15] Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3(2):20–es, May 2007. doi:10.1145/1240233.1240243.
- [16] Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with worst case access time. Journal of the ACM, 31(3):538–544, June 1984. doi:10.1145/828.1884.
- [17] Richard F. Geary, Naila Rahman, Rajeev Raman, and Venkatesh Raman. A simple optimal representation for balanced parentheses. Theoretical Computer Science, 368(3):231–246, December 2006. doi:10.1016/j.tcs.2006.09.014.
- [18] Richard F. Geary, Rajeev Raman, and Venkatesh Raman. Succinct ordinal trees with level-ancestor queries. ACM Transactions on Algorithms, 2(4):510–534, October 2006. doi:10.1145/1198513.1198516.
- [19] Alexander Golynski, J. Ian Munro, and Srinivasa Rao Satti. Rank/select operations on large alphabets: a tool for text indexing. In Proc. 17th ACM-SIAM Symposium on Discrete Algorithm (SODA), pages 368–373, 2006. doi:10.1145/1109557.1109599.
- [20] Alexander Golynski, Rajeev Raman, and Srinivasa Rao Satti. On the redundancy of succinct data structures. In Proc. 11th Scandinavian Workshop on Algorithm Theory (SWAT), pages 148–159, 2008. doi:10.1007/978-3-540-69903-3_15.
- [21] Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proc. 14th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 841–850, 2003. URL: http://dl.acm.org/citation.cfm?id=644108.644250.
- [22] Roberto Grossi, Alessio Orlandi, and Rajeev Raman. Optimal trade-offs for succinct string indexes. In Proc. 37th International Colloquium Conference on Automata, Languages and Programming (ICALP), pages 678–689, 2010. doi:10.1007/978-3-642-14165-2_57.
- [23] Roberto Grossi, Alessio Orlandi, Rajeev Raman, and Srinivasa Rao Satti. More haste, less waste: lowering the redundancy in fully indexable dictionaries. In Proc. 26th International Symposium on Theoretical Aspects of Computer Science (STACS), pages 517–528, 2009. doi:10.4230/LIPICS.STACS.2009.1847.
- [24] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378–407, January 2005. doi:10.1137/S0097539702402354.
- [25] Ankur Gupta, Wing-Kai Hon, Rahul Shah, and Jeffrey Scott Vitter. Compressed data structures: Dictionaries and data-aware measures. Theoretical Computer Science, 387(3):313–331, 2007. doi:10.1016/j.tcs.2007.07.042.
- [26] Wing-Kai Hon, Kunihiko Sadakane, and Wing-Kin Sung. Breaking a time-and-space barrier in constructing full-text indices. SIAM Journal on Computing, 38(6):2162–2178, January 2009. doi:10.1137/070685373.
- [27] Yang Hu, Jingxun Liang, Huacheng Yu, Junkai Zhang, and Renfei Zhou. Optimal static dictionary with worst-case constant query time. In Proc. 57th ACM Symposium on Theory of Computing (STOC), 2025.
- [28] Guy Joseph Jacobson. Succinct static data structures. PhD thesis, Carnegie Mellon University, 1988.
- [29] Dmitry Kosolobov. Simplified tight bounds for monotone minimal perfect hashing. In Proc. 35th Symposium on Combinatorial Pattern Matching (CPM), pages 19:1–19:13, 2024. doi:10.4230/LIPICS.CPM.2024.19.
- [30] Tianxiao Li, Jingxun Liang, Huacheng Yu, and Renfei Zhou. Dynamic “succincter”. In Proc. 64th IEEE Symposium on Foundations of Computer Science (FOCS), pages 1715–1733, 2023. doi:10.1109/FOCS57990.2023.00104.
- [31] Tianxiao Li, Jingxun Liang, Huacheng Yu, and Renfei Zhou. Tight cell-probe lower bounds for dynamic succinct dictionaries. In Proc. 64th IEEE Symposium on Foundations of Computer Science (FOCS), pages 1842–1862, 2023. doi:10.1109/FOCS57990.2023.00112.
- [32] Tianxiao Li, Jingxun Liang, Huacheng Yu, and Renfei Zhou. Dynamic dictionary with subconstant wasted bits per key. In Proc. 35th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 171–207, 2024. doi:10.1137/1.9781611977912.9.
- [33] J. Ian Munro. Tables. In Proc. 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS), pages 37–42, 1996. doi:10.1007/3-540-62034-6_35.
- [34] J. Ian Munro, Rajeev Raman, Venkatesh Raman, and Srinivasa Rao Satti. Succinct representations of permutations and functions. Theoretical Computer Science, 438:74–88, June 2012. doi:10.1016/j.tcs.2012.03.005.
- [35] J. Ian Munro and Venkatesh Raman. Succinct representation of balanced parentheses and static trees. SIAM Journal on Computing, 31(3):762–776, 2001. doi:10.1137/S0097539799364092.
- [36] J.Ian Munro, Venkatesh Raman, and S.Srinivasa Rao. Space efficient suffix trees. Journal of Algorithms, 39(2):205–222, 2001. doi:10.1006/jagm.2000.1151.
- [37] Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1):2–es, April 2007. doi:10.1145/1216370.1216372.
- [38] Gonzalo Navarro and Javiel Rojas-Ledesma. Predecessor search. ACM Computing Surveys, 53(5):1–35, September 2021. doi:10.1145/3409371.
- [39] Mihai Pǎtraşcu. Succincter. In Proc. 49th IEEE Symposium on Foundations of Computer Science (FOCS), pages 305–313, 2008.
- [40] Mihai Pǎtraşcu and Mikkel Thorup. Time-space trade-offs for predecessor search. In Proc. 38th ACM Symposium on Theory of Computing (STOC), pages 232–240, 2006. doi:10.1145/1132516.1132551.
- [41] Mihai Pǎtraşcu and Emanuele Viola. Cell-probe lower bounds for succinct partial sums. In Proc. 21st ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 117–122, 2010.
- [42] Giulio Ermanno Pibiri and Rossano Venturini. Dynamic Elias-Fano representation. In Proc. 28th Annual Symposium on Combinatorial Pattern Matching (CPM), pages 30:1–30:14, 2017. doi:10.4230/LIPIcs.CPM.2017.30.
- [43] Rajeev Raman, Venkatesh Raman, and Srinivasa Rao Satti. Succinct dynamic data structures. In Proc. 7th International Workshop on Algorithms and Data Structures (WADS), pages 426–437, 2001.
- [44] Rajeev Raman, Venkatesh Raman, and Srinivasa Rao Satti. Succinct indexable dictionaries with applications to encoding -ary trees, prefix sums and multisets. ACM Transactions on Algorithms, 3(4):43, November 2007. See also SODA 2002. doi:10.1145/1290672.1290680.
- [45] Rajeev Raman and Srinivasa Rao Satti. Succinct dynamic dictionaries and trees. In Proc. 30th International Colloquium on Automata, Languages and Programming (ICALP), pages 357–368, 2003.
- [46] Emanuele Viola. New sampling lower bounds via the separator. In Proc. 38th Computational Complexity Conference (CCC), pages 26:1–26:23, 2023. doi:10.4230/LIPIcs.CCC.2023.26.
- [47] Huacheng Yu. Optimal succinct rank data structure via approximate nonnegative tensor decomposition. In Proc. 51st ACM SIGACT Symposium on Theory of Computing (STOC), pages 955–966, 2019. doi:10.1145/3313276.3316352.
- [48] Huacheng Yu. Nearly optimal static Las Vegas succinct dictionary. In Proc. 52nd ACM SIGACT Symposium on Theory of Computing (STOC), pages 1389–1401, 2020. doi:10.1145/3357713.3384274.