Abstract 1 Introduction 2 Preliminaries 3 Reducing the size of the lookup table 4 Induced total order for b length bitstrings 5 Hybrid implementation 6 Concluding Remarks References Appendix A Tabulated experiment results

Succinct Rank Dictionaries Revisited

Saska Dönges ORCID Department of Computer Science, University of Helsinki, Finland Simon J. Puglisi ORCID Department of Computer Science, University of Helsinki, Finland
Abstract

We study data structures for representing sets of m elements drawn from the universe [0..n1] that support access and rank queries. A classical approach to this problem, foundational to the fields of succinct and compact data structures, is to represent the set as a bitvector X of n bits, where X[i]=1 iff i is a member of the set. Our particular focus in this paper is on structures taking log2(nm)+o(n) bits, which stem from the so-called RRR bitvector scheme (Raman et al., ACM Trans. Alg., 2007). In RRR bitvectors, X is conceptually divided into n/b blocks of b bits each. A block containing c 1 bits is then encoded using log2b+log2(bc) bits, where logb bits are used to encode c, and log2(bc) bits are used to say which of the (bc) possible combinations the block represents. In all existing RRR implementations the code assigned to a block is its lexicographical rank amongst the (bc) combinations of its class. In this paper we explore alternative non-lexicographical assignments of codes to blocks. We show these approaches can lead to faster query times and offer relevant space-time trade-offs in practice compared to state-of-the-art implementations (Gog and Petri, Software, Prac. & Exp., 2014) from the Succinct Data Structures Library.

Keywords and phrases:
data structures, data compression, succinct data structures, compressed data structures, weighted de Bruijn sequence, text indexing, string algorithms
Copyright and License:
[Uncaptioned image] © Saska Dönges and Simon J. Puglisi; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Design and analysis of algorithms
Supplementary Material:
Software  (Source code): https://github.com/saskeli/h0_bv [3]
  archived at Software Heritage Logo swh:1:dir:383ab8d9247150886cfacd830469e5e5bdd72551
Acknowledgements:
We thank Zsuzsanna Lipták, Taku Onodera, and Rajeev Raman for interesting discussions, and the Finnish Computing Competence Infrastructure (FCCI) for supporting this project with computational and data storage resources.
Editors:
Petra Mutzel and Nicola Prezza

1 Introduction

Bit vectors supporting rank and access queries are essential to the field of compact data structures [15, 11] and, as such, have seen intense research in the past two decades.

We are given a sequence X[0..n1] of n bits, m of which are set to 1 and should preprocess X so that the following queries can be answered quickly111Another related query is select(i), which returns the position of the ith 1 bit, but it is less used than the other two queries, particularly in the context of text indexing using the backward search algorithm.:

access(i) =X[i]
rank(i) =number of 1s among the first i bits.

Classical methods use n+o(n) bits and leave the input bit string intact, building small structures on top of it to support rank queries in constant time [6, 14]. These methods are extremely fast in practice [5, 8], but incur large space overheads on sparse or otherwise compressible inputs. The Elias-Fano representation, used within web search engines [21, 17] and full-text indexes [10] takes 2m+mlog(n/m) bits and supports access in O(1) time and rank in O(log(n/m)) time is o(n) extra bits are used222Logarithms thoughout are base 2.. There are also a number of heuristic and hybrid methods that are tuned for certain scenarios, such as full-text indexing [8].

Storing a bitvector of length n with m 1 bits set can be seen as the problem of storing m elements of a finite universe of size n so that membership and rank queries can be answered efficiently in practice. The information theoretic lower bound for storing m elements from a universe of size n is B=log(nm). Raman, Raman, and Rao [19], building on earlier work by Pagh [18], describe a data structure that matches this bound up to an o(n) term and supports membership and rank queries in O(1) time. This data structure is commonly referred to as RRR and has since been heavily engineered by a series of authors [2, 16, 5], culminating in its embodiment in the Succinct Data Structures Library (SDSL)333https://github.com/simongog/sdsl.

RRR is of particular interest in full-text indexing because when applied to the bitvectors of a wavelet tree constructed over the Burrows-Wheeler transform of a string T of n symbols the resulting structure achieves size nHk+o(n) bits, where nHk is kth-order empirical entropy of T [13, 15], a technique called implicit compression boosting [12, 4].

All RRR implementations of which we are aware divide the bitvector X into n/b blocks of b bits each and store the encoded block containing c 1 bits using logb+log(bc) bits. logb of the bits are used to encode c, the number of 1 bits the block contains, which is referred to as the block’s class. The other log(bc) bits are used to encode the block’s offset, its lexicographical rank amongst all possible (bc) combinations of b-length blocks containing c 1 bits (we provide a detailed description in Section 2).

Contribution.

In this paper, we revisit the implementation of entropy-compressed bitvectors stemming from the basic RRR scheme. Our contributions are summarized as follows:

  1. 1.

    In order to later decode a block when answering an access or rank query, some RRR implementations [2, 5] use a lookup table of size 2b that is indexed with the block’s class and offset to obtain the original block in O(1) time. We describe two methods for reducing the size of the lookup table: a simple sparsification technique that provides a space-time trade-off, and an approach based on weighted binary de Bruijn sequences from combinatorics on words [20] that reduces the lookup table by a factor of b. Both these approaches allow slightly larger values of b to be used, reducing overall space usage.

  2. 2.

    An alternative to the lookup-table approach that is taken by some RRR implementations [16, 5] is to decode the block directly from its class-offset encoding by using properties of lexicographical offset assignment. These methods decode the original block bit-by-bit in O(b) time. We develop an alternative non-lexicographical scheme for assigning offset values, that proves slightly faster to decode in practice.

Both our new approaches above represent a departure from the lexicographical assignment of offsets that is common to all previous work on the problem.

Roadmap.

The remainder of this paper is structured as follows. In the next section, we give a detailed overview of the basic RRR approach common to all previous implementations of it, as well as the vagaries of some specific implementations. We also describe the experimental setup used throughout. Then, in Section 3 we describe methods for reducing the size of the block-decoding table relevant to some RRR implementations. Section 4 describes our new method for inverting combinations, with further engineering measures described in Section 4.1. In Section 5 we apply our new RRR implementations as an encoding option for hybrid bitvector schemes. Section 6 concludes and offers directions for further research.

2 Preliminaries

2.1 RRR bitvector structure

In this section we describe the elements of the RRR data structure that are common to existing implementations. Implementations of RRR [2, 16, 5] generally sacrifice the constant-time queries from the original paper [19] in favor or practicality. The first public implementation of RRR was due to Claude and Navarro [2], and we now describe their approach – all other implementations of which we are aware build on it.

We are given a bitvector X of n bits, m of which are set to 1. We divide X up into n/b blocks of b bits each. We assume here for ease of exposition that b divides n – the situation where this is not the case and the last block has less than b bits is trivial to deal with.

Denote the ith block as xi. Let the class, ci, of block xi be the number of 1 bits xi contains. There are (bc) different blocks belonging to class c. Each block corresponds to one of the (bc) combinations of its class. We define the offset, fi of block xi to be an integer in the range [0..(bc)) that uniquely determines which element of the class (i.e., which combination) the block corresponds to. All previous work on RRR assigns a block’s offset as its lexicographical rank among all blocks of the same class. We call this the block’s lexicographical offset.

Classes are stored in an n/blog(b+1) bit array, C, that trivially supports random access. The offsets are stored in a cilog(bci) bit sequence, F. Note that random access to elements of F is not trivial, because they are of variable length. To enable random access to F, pointers to the start of every kth element of the offset sequences are stored444For all of the implementations k is set to 32 (blocks). in a pointer table of nbk words. To support fast rank queries an additional nbk word table storing cumulative ranks up to every kth block is added.

Now, to support rank(i) we proceed as follows. Position i is contained in block i/b. We scan C[ikb..ib) and sum the class values ci/(kb),,ci/(b1), adding this to the cumulative rank at ibk which we have precomputed. This gives the rank up to the start of block xi/b. To obtain fi/b, the offset for the block, when scanning class values ci/k to ci/b1 we sum values log((bci/kb)), log((bc(i/(kb))+1)), , log((bci/b1)), from which we obtain the starting position in F of fi/b. As described, a scan of up to k elements of C is required, followed by a single access to F. The class and offset for a block are thus obtained. What remains is to unrank the block’s offset – i.e., obtain its original bit representation – which allows us to complete the query answer by counting the bits in the relevant prefix in the case of a rank query. It is at this point that implementations begin to differ.

Lookup table decoding.

The implementation of Claude and Navarro [2] uses a two-dimensional lookup table of 2b entries that contains all the possible blocks of size b (an approach prescribed in Raman at al.’s original paper [19]). The ith row of the table contains all the blocks of class i (i.e., all the blocks containing i 1 bits). Having worked out the class c and the offset f of the block containing the query index as described above, the query is completed by accessing cell [c,f] of the table. Claude and Navarro store the blocks in a row in lexicographical order, but the order is not important to the query algorithm just described.

The lookup table has total size b2b bits and indeed it is this size that adds some tension to the data structure: ideally, one would like to increase b in order to decrease the (logb)-bit overhead incurred per block for storing the class values. However increasing b causes a rapid blowup in the table size. In order to keep the lookup table manageable, and to fit c to four bits, Calude and Navarro fix the block size to b=15, resulting in a lookup table 15215 bits in size, and an overhead from the class array of 0.27n bits. Gog and Petri engineer this approach further in the SDSL [5] with a class called rrr_vector_15.

On-the-fly decoding.

Arguing that 25% is a non-trivial overhead in the context of succinct data strucutres, Navarro and Providel [16] show that the lookup table can be eliminated by decoding the bits of a block from its lexicographical offset f in O(b) time, by checking the range of values that offset f belongs to and extracting bits consecutively as follows.

For a block length b, consider all the blocks of class c arranged in lexicographical order. The first (b1c) of them begin with a 0 and the remaining (b1c1) begin with a 1. Thus, if f(b1c), then the first bit of the block was a 1. In this case we decrement b and c, and decrease f by (b1c). Otherwise, the first bit of the block was a 0. In this case we only decrement b. Now we continue extracting the next bit from a block of length b1. The process can be stopped if f becomes zero, which implies that all the remaining bits are a single run of zeroes or ones. All the binomial coefficients are precomputed in a table in order to make decoding faster, but this is much smaller than the original lookup table. Navarro and Providel show that queries, while slower than the lookup-table approach, are still acceptably fast in practice. The approach was further engineered by Gog and Petri [5] and is included in the SDSL.

2.2 Experimental Setup

We measure and report on experiments with our new methods throughout the paper and so define our experimental machine and test data now. Additionally, our source code is available at https://github.com/saskeli/h0_bv.

Data sets.

We used the same data sets for all of our experiments. Three of the bitvectors are derived from real-world data, and chosen to be diverse and relevant to practical compressed indexing applications. WT-WEB-1GB555Available at http://people.eng.unimelb.edu.au/sgog/data/WT-WEB-1GB.gz and WT-DNA-1GB666Available at http://people.eng.unimelb.edu.au/sgog/data/WT-DNA-1GB.gz are data sets by Gog and Petri [5].

bv-dump.bin777Available at https://doi.org/10.5281/zenodo.11031751 is the concatenated bit vectors from the “plain matrix” representation of the spectral Burrows-Wheeler transform (SBWT) proposed by Alanko, Puglisi and Vuohtoniemi [1]. The SBWT in question is built from a set of 17,336,887 Illumina HiSeq 2500 reads of length 502 sampled from the human gut (SRAidentifier ERR5035349) in a study on irritable bowel syndrome and bile acid malabsorption [7] with k-mer size 31.

In addition, we generated random bitvectors with varying probability p for a 1-bit. All of these bit vectors are one gigabyte in size, and p is i.i.d.. RND-1.bin has p=21, RND-5.bin has p=25 and RND-10.bin p=210.

Some summary statistics on the datasets are presented in Table 1

Table 1: Details of the datasets used in performance benchmarks.
Dataset n H0 mean H0 entropy Fraction of uniform
of 64-bit blocks 64-bit blocks
WT-WEB-1GB 8 664 343 857 0.997 653 0.099 729 8 0.843 696
WT-DNA-1GB 8 236 129 151 0.980 138 0.614 706 0.276 278
bv-dump.bin 8 261 672 152 0.811 278 0.755 564 0.007 233 09
RND-1.bin 8 589 934 592 1 0.988 641 0
RND-5.bin 8 589 934 592 0.200 631 0.187 785 0.131 055
RND-10.bin 8 589 934 592 0.011 172 3 0.007 195 96 0.939 396

Experiments.

The experimental run is the same for each of the following three sections, with only the data structures under test changing. In particular, data structures are built for each of the data sets, after which 107 random access and rank queries each are run and timed on the built index. The space usage of the built index along with mean query times are reported. We only report times for access queries because comparative rank performance between data structure implementations remains the same, just very slightly and consistently slower than access queries. Full results for rank and select queries as timed in our experiments, along with construction times, are available in Appendix A.

Machine and Compilation.

Tests were run on a machine with an AMD EPYC 7452 Processor and 500GB of DDR4 RAM. The operating system was AlmaLinux 8.10 (with Linux 4.18.0-372.9.1.el8.x86_64 kernel). Code was compiled with GCC 13.3.0 and the performance-relevant flags: -std=c++2a, -march=native, -Ofast and -DNDEBUG. Experiments were run multiple times, on multiple machines to ensure the stability of the results.

3 Reducing the size of the lookup table

As discussed in Section 2.1, the principle hurdle to reducing the lower order terms in the original RRR representation (and its implementation by Claude and Navarro) is the lookup table that stores all possible bitstrings of length b. This lookup table has size b2b bits, making it very sensitive to increases in b. Navarro and Providel’s approach is to remove this lookup table entirely and reconstruct its entries as needed through computation, at the cost of query time. In this section we keep the lookup table, but examine ways to reduce its size, with the hope of increasing b to reduce overall size and retain fast query times.

The two approaches we explore are sparsification of the original lookup table and essentially compacting the lookup table by replacing it with a series of weighted de Bruijn sequences [20].

Sparse lookup table.

Instead of storing all (bc) possible length-b bitstrings with weight c in the table, we can store every gth entry ((bc)/g in blocks total) and reconstruct any desired block in at most g constant-time steps. This changes the total space required for the lookup table to b2b/g+b bits. Further space savings can be had by storing samples only for blocks where cb/2, effectively halving the space usage. Our implementations thus use b2b1/g+b bits for storing the lookup table, and decoding a block given c and f values takes 𝒪(g) time with an algorithm that, given a block, can compute its lexicographic successor in O(1) time.

We use the “Compute the lexicographically next bit permutation”-algorithm from the “Bit Twiddling Hacks” website888https://graphics.stanford.edu/~seander/bithacks.html. See Listing 1. The fast enumeration algorithm compiles to 9 instructions per iteration when optimized and compiled by the GCC compiler.

A natural selection for gap size is g=b, leading to the lookup table taking 2b1+b bits.

Listing 1: Code for fast enumeration of the (bc) blocks of lenght b with c set bits.
constexpr uint32_t next(uint32_t v) {
uint32_t t = v | (v - 1);
return (t + 1) | (((~t & -~t) - 1) >> (__builtin_ctzl(v) + 1));
}

Weighted de Bruijn sequences.

In [20], Ruskey, Sawada and Williams study weighted de Bruijn sequences, circular strings of length (bc) from which all possible substrings of length b having c 1 bits can be efficiently extracted. In particular, we create circular bitstrings (the weighted de Bruijn sequences) of length (bc), such that all bitstrings of length b and weight c can be created from (b1)-length substrings of this weighted de Bruijn sequence with the necessary bit appended to reach the correct weight.

These remarkable constructions seem tailor made to reduce the size of the RRR lookup table. The class c remains the same as before, but the offset f is the relative position of the encoded block within the weighted de Bruijn sequence.

Storing these bit strings in practice requires (bc)+b1 bits, with the first b1 bits being appended to the end, to ensure that the “last” substring can be read without overhead due to the circularity of the string. Conveniently, if we concatenate these weighted de Bruijn sequences by increasing weight, adjacent sequences overlap by b2 characters, provided each weighted de Bruijn sequence is the lexicographically smallest rotation. See Figure 1. Additionally we only require sequences of weights 0cb/2, since higher weight sequences can be stored as inverted references to lower weight sequences. This leads to the entire lookup table fitting in

b1+c=1b/2(bc)+1<2b bits.

The required binary sequence along with the start positions of each weight are computed using a Python script and output as static arrays for use with C++ code. These arrays are used for decoding blocks. Reading from the binary sequence consists of reading the desired bytes in big-endian, followed by masking and shifting to generate a candidate block. After this the last bit is flipped, if required before flipping the entire candidate if c>b/2. The decoding process is visualized in Figure 2.

Figure 1: Figure illustrating the concatenation of weighted de Bruijn sequences with b=7, to create the full sequence that replaces the traditional lookup table. The first b symbols of the lexicographically smallest rotation are highlighted, to show the overlap between subsequent sequences with the starts being repeated.
Figure 2: Figure illustrating the process of reading elements from the weighted de Bruijn sequence with b=7 and c{3,4}. b candidate bits are read from the offset position, the candidate bits get flipped if c=4, after which the correct block is the exclusive OR of the candidate and 𝐩𝐨𝐩𝐜𝐨𝐮𝐧𝐭(candidate)c.

Performance of reduced size lookup tables.

We benchmarked both the sparse lookup table and weighted de Bruijn approaches with b{15,24} and gaps 1, 7, 15, 24, 32, 64 for the sparse lookup tables. It would have been interesting to test b=31 as well, but unfortunately, our methods for compile-time definition of the tables failed for these bigger tables (probably due to the compiler running out of memory or a compile time limitation in constant expression resolution).

Results for benchmarking can be seen in Figure 3. Our approach allows a new space-time trade-off by increasing compression efficiency by increasing b while reducing lookup speed. The RRR15 implementation by Gog and Petri [5] remains the fastest overall, but compressing worse than our de Brujin sequence-based approach with b=24 (DBS24).

Figure 3: Performance of different lookup table -based approaches with sizes of the lookup tables considered part of the compressed data structure size. The optimized version, based on a full lookup table due to Claude and Navarro [2], and Gog and Petri [5] (RRR15 in the figure), remains the fastest for queries, and beats both our implementations with b=15 in space efficiency as well, this is likely due to some specific optimizations for b=15, that are missing from our implementation and would not apply for b=24. For b=24, our approaches improve compression, offering a space – query performance trade-off. The weighted de Bruijn sequence-based approach (DBS) outperformes the sparsified lookup tables (GAP) in all cases. Mean block wise H0 entropies are added for reference to provide a theoretical limit for block-based H0 compression.

4 Induced total order for b length bitstrings

The specific encoding of the offset is irrelevant as long as the mapping from the b-length bitstring x to the (c,f) pair is bijective and 0f<(bc). As we saw in Section 2.1, lexicographic order allows decoding and answering queries in 𝒪(b) time. Selecting another induced order changes the complexity of decoding. The new method we now describe allows answering queries on blocks in 𝒪(logclogb) time.

For any bitstrings xα and xβ of length b and class c, define the ordering such that pop(xαb/2)<pop(xβb/2)fa<fb, where uv denotes the binary representation of u shifted right by v bits (discarding the lower v bits of u) and pop returns the number of 1 bits in an integer (i.e., its population count). That is, the order between blocks is primarily decided by the weight of their prefixes, where a bitstring with more bits in the prefix will have a greater f value. This can then be recursively applied to the prefixes and suffixes for a full recursive definition for calculation of the f values.

Based on this, when decoding any given (c,f), the first half of the block x to decode will contain exactly

argmaxcp(i=0cp1(b/2i)(b/2ci)<f)

one-bits.

Given cp, a query targeting the prefix of the block can recurse with

(cp,(fi=0cp1(b/2i)(b/2ci))/(b/2ccp))

and a query targeting the suffix recurses with

(ccp,(fi=0cp1(b/2i)(b/2ci))mod(b/2ccp)).

The recurrence takes 𝒪(logb) steps, and binary searching for cp takes 𝒪(logc) time, given precomputed lookup tables for binomial coefficients. This yields a total run time of 𝒪(logclogb) for determining the result of any rank or access query at a block level.

The offset values with the desired properties can be computed using the following algorithm.

Example.

We illustrate the workings of this alternative encoding with x=01101000b, and further demonstrate querying with an access(x,2) query on the encoded block.

The encoded values for the prefix xp=0110b and suffix xs=1000b are computed recursively, but the recurrences are not written out here for brevity.

  • There are i=01(4i)(43i)=28, vectors of class 3 where the prefix has less than two set bits. f28

  • enc(xp)=(2,2), meaning there are two instances for b=4 and c=2 that are less than xp, so we add the number of b=8 and c=3 vectors, with two set bits in the prefix where the prefix encoding is less than (2,2). I.e 2(41)=8. ff+8

  • The suffix enc(xs)=(1,3), so we add this value to the ones we derived before. ff+3

  • Thus enc(0110100b)=(3,28+8+3)=(3,39).

We start with access(i=2) with (c,f)=(3,39). When we recurse i and (c,f) will be updated accordingly.

  • Since 28f<52, we know that the class of the prefix of x is 2. As we recurse to the left we update (c,f) by subtracing the offset and dividing with the number of length 4 vectors with one set bit. (c,f)(2,39284)=(2,2)

  • Since 1f<5, we know that the class of the new prefix is 1. Now we recurse to the right and update ii2=0 and (c,f)(1,21mod2)=(1,1).

  • Since 1f<2, we know that the class of the new prefix is 1. Since we are recursing to the prefix, the only possible update value for c=1 is (c,f)(1,0).

  • We are now accessing a vector of length one with c=1. I.e. for (c,f)=(1,0) access(0)=1.

  • Now for (c,f)=(3,39), access(2)=1.

4.1 Implementation details

Recursive split.

Our implementation of the method described in the previous section makes extensive use of (small) lookup tables. For our implementation with b=64 we precompute (nm) for n{32,16,8} and 0mn. The binomials are stored as static heap-ordered binary trees to enable fast branchless binary searching for binomial coefficients. In addition, the decoding for block size 8 is done using precomputed lookup tables.

A drawback with our solution is the need to compute division and modulus with binomial coefficients. We were unable to make any compiler precompute the fast integer divisions for the required binomials, and inverse modular arithmetic is impractical because implementations would require computations using 128-bit integers.

Our implementation relies on the symmetry of the recurrence, requiring a block size that is a power of two, while, for better overall data structure size, b should be 2a1 for some a+ so that a bits are sufficient to store every possible class value. Unfortunately, asymmetric recurrences would cause additional unavoidable and unpredictable branching while decoding and have a significant negative impact on query performance.

8-bit block iterative approach.

The same principle for the induced order can be applied to other splits besides the recursive even split described above. This leads to an O(b) decode time, which matches the on-the-fly approach of Navarro and Providel [16] (and its later implementation by Gog and Petri [5]). The hope is that the reliance on lookup tables for 8-bit sections of the block would yield lower constant factors compared to other implementations.

This approach allows use of block sizes that are not powers of two to improve the storage of class values. If the offsets for the 8-bit sub-blocks correspond to the lexicographic order, the first partial block can be decoded with the same lookup table as the full blocks, with minimal impact from the change in b.

Figure 4: Performance of our H0-based implementations compared to implementations in the SDSL. Pictured are our de Bruijn sequence-based approach (DBS), our balanced recurrence-based approach (REC) and our block iteration-based implementation (IT), along with the lookup table (RRR) and non-lookup table (SDSL)-based approaches from SDSL, with the number denoting the block size b. The highly optimized RRR15 outperformed all other implementations in query performance, while only beating our weighted de Bruijn implementation with b=15 (DBS15) in compression, due to the small block size. For larger block sizes our implementations compare favorably, being comparable in compression ratio while typically improving query time. Mean block wise H0 entropies are added for reference to provide a theoretical limit for block-based H0 compression.
Figure 5: Visualization of the space usage for some of the tested data structures. Lookup table sizes are insignificant. The size of the F array increases with block size, but this increase is offset by the decrease in superblock overhead and C array size.

4.2 Results

Results from benchmarking, now including on-the-fly decoding performance, can be seen in Figure 4. The lookup table-based RRR15 remains the fastest overall, with the on-the-fly approaches lagging behind in query performance, but compressing significantly better. Of the on-the-fly approaches, our symmetric recurrence (REC) and block iterative (IT) methods offer similar compression but typically faster lookup speed than the on-the-fly implementation in the SDSL. On all data sets tested the new methods offer relevant space-time tradeoffs.

The space consumed by various components of the data structures is shown in Figure 5. Data structure size is mainly determined by the block and superblock sizes used, and lookup table sizes are irrelevant. Space for the F array increases as block size increases, but this is offset by the shrinking of the C array and superblock overhead.

5 Hybrid implementation

Regions of bitvectors that consist of alternating runs of ones and zeros have very low H1 entropy but potentially high H0 entropy. This leads to poor performance for the RRR-based techniques discussed so far, which achieve H0 entropy, ignoring the o(n) term.

The hybrid bitvector by Kärkkäinen, Kempa and Puglisi [8] is a widely used, practically powerful compressed bitvector scheme that, in particular, is an essential ingredient of state-of-the-art general-purpose FM indexes [4]. The approach is to divide the bitvector into blocks and then select the best encoding for each block from a number of possible encodings. The implementation of hybrid bitvectors in the sdsl_lite library uses 256 bit blocks, and then stores each block either as: 1) minority-bit encoded (i.e., to encode the block, we store the positions of the 1 bits using 8 bits each); 2) run-length encoded; or 3) in plain form. This scheme works extremely well for blocks with runs or blocks that are extremely dense or sparse, but can perform badly when this is not the case and many blocks get stored in plain form (resulting in no compression).

This hybrid scheme and the RRR-based scheme complement each other, and adding a H0 option for block encoding to the hybrid bitvector should improve compression performance.

With this in mind, we added on our 64-bit block iterative encoding as an option in the 256-bit hybrid bitvector implementation included in the SDSL. As a comparison we had another version that used Gog and Petri’s RRR implementation with b=256 as an encoding option instead of our iterative encoding. This was done in a way that did not cause any additional space overhead for the hybrid structure, but should induce some significant overhead to each query, compared to the implementation without support for H0 encoding.

We ran the same tests on these structures, the results of which are shown in Figure 6. We observe that by including RRR as a block-encoding options, compression levels of hybrid bitvectors can be significantly improved, though at some penalty to query performance.

Figure 6: Performance of the SDSL hybrid bitvector, compared to versions with added option for compressing blocks with either our 64-bit block iterative implementation (IT), or the RRR 256-bit compression also from the SDSL. We find that compression can be improved as expected (c.f. blue points and green points), but with a penalty to query performance.

6 Concluding Remarks

We have described two main improvements to the RRR bitvector data structure, the common theme of which is the non-lexicographical assignment of codes to offset values (in contrast the lexicographical assignment, which has been done in all previous work on the problem). The first of these approaches represents all possible blocks of size b as a series of weighted de Bruijn sequences. The second provides a new on-the-fly block-decoding mechanism for answering queries in 𝒪(log(c)log(b)) time based on a new induced total order for offset encoding. These methods offer improved practical tradeoffs for H0-compressed bitvectors. We have also descibed a sparsification technique for reducing the size of the lookup table in the case that lexicographic assignment is used.

There are numerous avenues to further optimize the RRR scheme. Firstly, profiling reveals that our non-lookup table-based implementations execute between three and eight div instructions per query. Eliminating these divisions efficiently would improve the query performance of our data structures and possibly be of independent interest. Efficiently optimizing divisions for 64 bit (and larger) integers is an open problem, but perhaps existing schemes for shorter integers[9] could be applied to larger integers as well as hardware keeps improving.

Larger block sizes have the potential to improve the compression ratio further, but implementations require computations using large integers (128, 256 or 512-bit integers), with significantly slower arithmetic operations than 64-bit integers.

Finally, the rrr_vector_15 class in the SDSL uses bit-parallelism to great effect. Since its publication [5] some 10 years ago, architecture support for bit-parallelism has improved significantly, and we expect it is possible to improve the existing optimizations of rrr_vector_15 or apply novel optimizations to implementations with b>15 using new SIMD instructions.

We leave these open problems to be the focus of future research.

References

  • [1] Jarno N. Alanko, Simon J. Puglisi, and Jaakko Vuohtoniemi. Small searchable k-spectra via subset rank queries on the spectral Burrows-Wheeler transform. In SIAM Conference on Applied and Computational Discrete Algorithms, ACDA 2023, pages 225–236. SIAM, 2023. doi:10.1137/1.9781611977714.20.
  • [2] Francisco Claude and Gonzalo Navarro. Practical rank/select queries over arbitrary sequences. In String Processing and Information Retrieval, pages 176–187, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg. doi:10.1007/978-3-540-89097-3_18.
  • [3] Saska Dönges. saskeli/h0_bv. Software, swhId: swh:1:dir:383ab8d9247150886cfacd830469e5e5bdd72551 (visited on 2025-07-02). URL: https://github.com/saskeli/h0_bv, doi:10.4230/artifacts.23794.
  • [4] Simon Gog, Juha Kärkkäinen, Dominik Kempa, Matthias Petri, and Simon J. Puglisi. Fixed block compression boosting in fm-indexes: Theory and practice. Algorithmica, 81(4):1370–1391, 2019. doi:10.1007/S00453-018-0475-9.
  • [5] Simon Gog and Matthias Petri. Optimized succinct data structures for massive data. Software: Practice and Experience, 44(11):1287–1314, 2014. doi:10.1002/spe.2198.
  • [6] Guy Jacobson. Space-efficient static trees and graphs. In Proc. 30th Annual Symposium on Foundations of Computer Science (FOCS), pages 549–554. IEEE Computer Society, 1989. doi:10.1109/SFCS.1989.63533.
  • [7] Ian B Jeffery, Anubhav Das, Eileen O’Herlihy, Simone Coughlan, Katryna Cisek, Michael Moore, Fintan Bradley, Tom Carty, Meenakshi Pradhan, Chinmay Dwibedi, et al. Differences in fecal microbiomes and metabolomes of people with vs without irritable bowel syndrome and bile acid malabsorption. Gastroenterology, 158(4):1016–1028, 2020.
  • [8] Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Hybrid compression of bitvectors for the FM-index. In Proc. Data Compression Conference, pages 302–311. IEEE, 2014. doi:10.1109/DCC.2014.87.
  • [9] Daniel Lemire, Owen Kaser, and Nathan Kurz. Faster remainder by direct computation: Applications to compilers and software libraries. Software: Practice and Experience, 49(6):953–970, 2019. doi:10.1002/spe.2689.
  • [10] Danyang Ma, Simon J. Puglisi, Rajeev Raman, and Bella Zhukova. On Elias-Fano for rank queries in FM-indexes. In Proc. 31st Data Compression Conference, pages 223–232. IEEE, 2021. doi:10.1109/DCC50243.2021.00030.
  • [11] Veli Mäkinen, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. Genome-Scale Algorithm Design: Bioinformatics in the Era of High-Throughput Sequencing (2nd edition). Cambridge University Press, 2023. URL: http://www.genome-scale.info/.
  • [12] Veli Mäkinen and Gonzalo Navarro. Implicit compression boosting with applications to self-indexing. In Proc. String Processing and Information Retrieval, 14th International Symposium, LNCS 4726, pages 229–241. Springer, 2007. doi:10.1007/978-3-540-75530-2_21.
  • [13] Giovanni Manzini. An analysis of the burrows-wheeler transform. J. ACM, 48(3):407–430, 2001. doi:10.1145/382780.382782.
  • [14] J. Ian Munro. Tables. In Proc. 16th Foundations of Software Technology and Theoretical Computer Science, LNCS 1180, pages 37–42. Springer, 1996. doi:10.1007/3-540-62034-6_35.
  • [15] Gonzalo Navarro. Compact Data Structures – A practical approach. Cambridge University Press, 2016. ISBN 978-1-107-15238-0. 570 pages.
  • [16] Gonzalo Navarro and Eliana Providel. Fast, small, simple rank/select on bitmaps. In Proc. SEA, pages 295–306. Springer, 2012. doi:10.1007/978-3-642-30850-5_26.
  • [17] Giuseppe Ottaviano and Rossano Venturini. Partitioned elias-fano indexes. In Proc. 37th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, pages 273–282. ACM, 2014. doi:10.1145/2600428.2609615.
  • [18] Rasmus Pagh. Low redundancy in static dictionaries with O(1) worst case lookup time. In Proc. Automata, Languages and Programming, 26th International Colloquium, ICALP’99, LNCS 1644, pages 595–604. Springer, 1999. doi:10.1007/3-540-48523-6_56.
  • [19] Rajeev Raman, Venkatesh Raman, and Srinivasa Rao Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms, 3(4):43–es, November 2007. doi:10.1145/1290672.1290680.
  • [20] Frank Ruskey, Joe Sawada, and Aaron Williams. De Bruijn sequences for fixed-weight binary strings. SIAM J. Discret. Math., 26(2):605–617, 2012. doi:10.1137/100808782.
  • [21] Sebastiano Vigna. Quasi-succinct indices. In Sixth ACM International Conference on Web Search and Data Mining, WSDM 2013, Rome, Italy, February 4-8, 2013, pages 83–92. ACM, 2013. doi:10.1145/2433396.2433409.

Appendix A Tabulated experiment results

WT-WEB-1GB WT-DNA-1GB bv-dump.bin RND-1.bin RND-5.bin RND-10.bin
GAP block size: 15, gap: 1
build time (s) 3.728 4.055 4.561 4.128 3.828 3.582
bits/bit 0.467 878 951 013 977 86 0.902 709 601 199 146 6 0.987 133 021 414 302 9 1.236 974 106 267 287 4 0.516 198 552 092 920 6 0.389 355 800 938 935 57
access (ns) 48.8088 153.222 164.145 167.843 96.8741 26.032
rank (ns) 162.941 171.973 172.029 172.431 166.629 153.58
select (ns) 913.867 931.872 971.726 1291.3 964.197 896.082
GAP block size: 15, gap: 7
build time (s) 4.136 4.457 4.784 4.553 4.225 3.987
bits/bit 0.467 845 090 862 284 1 0.902 673 980 582 676 7 0.987 097 510 927 937 7 1.236 939 952 806 085 6 0.516 164 398 631 718 8 0.389 321 647 477 733 73
access (ns) 55.2366 225.744 263.242 277.2 126.532 26.5134
rank (ns) 174.946 249.561 276.962 287.748 209.206 157.792
select (ns) 916.729 1006.02 1072.97 1405.67 1007.83 901.581
GAP block size: 15, gap: 15
build time (s) 3.796 4.132 4.476 4.152 3.866 3.638
bits/bit 0.467 842 084 518 100 1 0.902 670 817 931 956 0.987 094 358 055 348 2 1.236 936 920 419 805 3 0.516 161 366 245 438 4 0.389 318 615 091 453 4
access (ns) 56.1212 228.589 272.445 284.819 128.413 26.5664
rank (ns) 175.562 257.209 285.71 295.183 211.898 157.677
select (ns) 926.366 1020.79 1098.0 1418.99 1022.15 919.123
GAP block size: 15, gap: 24
build time (s) 3.792 4.105 4.437 4.152 3.871 3.637
bits/bit 0.467 841 102 100 958 86 0.902 669 784 436 757 4 0.987 093 327 755 460 2 1.236 935 929 492 593 2 0.516 160 375 318 226 4 0.389 317 624 164 241 37
access (ns) 56.9823 237.039 277.709 294.73 130.144 26.6282
rank (ns) 176.863 260.026 291.541 304.902 213.287 157.887
select (ns) 920.346 1023.72 1099.86 1452.77 1021.12 908.684
GAP block size: 15, gap: 32
build time (s) 3.783 4.145 4.481 4.199 3.915 3.629
bits/bit 0.467 840 688 451 636 3 0.902 669 349 280 884 4 0.987 092 893 944 981 1.236 935 512 260 083 0.516 159 958 085 716 1 0.389 317 206 931 731 05
access (ns) 57.1164 239.053 284.746 301.531 130.02 26.5804
rank (ns) 174.242 260.277 297.251 311.316 210.922 155.45
select (ns) 918.053 1022.84 1114.13 1452.71 1020.87 903.73
GAP block size: 15, gap: 64
build time (s) 3.952 4.186 4.543 4.257 3.892 3.635
bits/bit 0.467 840 067 977 652 4 0.902 668 696 547 074 7 0.987 092 243 229 262 3 1.236 934 886 411 317 4 0.516 159 332 236 950 7 0.389 316 581 082 965 6
access (ns) 59.1719 262.017 309.844 336.026 131.382 26.5396
rank (ns) 176.375 283.325 322.053 345.829 211.949 155.514
select (ns) 920.951 1065.62 1168.09 1520.98 1028.45 906.002
WT-WEB-1GB WT-DNA-1GB bv-dump.bin RND-1.bin RND-5.bin RND-10.bin
GAP block size: 24, gap: 1
build time (s) 3.413 10.835 11.661 16.962 2.974 2.399
bits/bit 0.396 031 446 736 880 8 0.860 447 697 958 992 3 0.969 373 827 457 722 8 1.204 960 980 788 164 0.461 326 225 948 883 4 0.314 638 674 708 912 7
access (ns) 57.3598 201.173 205.049 260.746 134.223 28.5569
rank (ns) 169.666 237.213 213.786 268.979 173.468 154.479
select (ns) 866.488 1005.67 1037.46 1532.87 898.629 836.867
GAP block size: 24, gap: 7
build time (s) 3.653 10.519 13.775 16.262 3.126 2.42
bits/bit 0.372 904 512 547 734 6 0.836 118 342 900 947 6 0.945 119 692 726 769 9 1.181 633 712 301 897 2 0.437 998 957 462 616 6 0.291 311 406 222 645 9
access (ns) 64.3484 246.201 285.983 298.609 186.285 29.5175
rank (ns) 179.338 268.256 291.393 308.176 233.904 158.526
select (ns) 878.575 1030.44 1089.68 1543.84 968.787 846.221
GAP block size: 24, gap: 15
build time (s) 3.531 10.505 11.493 16.338 3.27 2.547
bits/bit 0.370 848 786 213 280 8 0.833 955 734 771 221 0.942 963 770 844 603 3 1.179 560 178 484 319 3 0.435 925 423 645 038 7 0.289 237 872 405 068 04
access (ns) 64.8765 250.521 297.227 308.713 190.512 29.5091
rank (ns) 180.816 271.758 303.938 314.941 236.907 159.444
select (ns) 882.324 1029.76 1103.16 1573.71 984.492 854.255
GAP block size: 24, gap: 24
build time (s) 3.827 10.694 11.651 17.072 3.312 2.709
bits/bit 0.370 174 249 740 217 1 0.833 246 127 643 075 6 0.942 256 357 645 567 7 1.178 879 798 919 858 0.435 245 044 080 577 46 0.288 557 492 840 606 8
access (ns) 65.5028 248.99 300.185 308.375 195.675 29.8434
rank (ns) 181.514 271.901 308.174 314.475 239.596 158.259
select (ns) 885.657 1024.92 1107.33 1529.96 993.326 865.059
GAP block size: 24, gap: 32
build time (s) 3.815 10.86 11.811 16.677 3.582 2.835
bits/bit 0.369 893 197 185 287 74 0.832 950 462 539 222 0.941 961 606 664 828 8 1.178 596 311 780 837 8 0.434 961 556 941 557 25 0.288 274 005 701 586 57
access (ns) 65.5902 255.293 306.287 315.676 196.321 29.7771
rank (ns) 181.124 276.118 315.641 322.794 239.999 157.966
select (ns) 888.533 1035.55 1118.28 1546.55 998.906 868.652
GAP block size: 24, gap: 64
build time (s) 4.399 11.433 12.26 17.827 3.967 3.417
bits/bit 0.369 471 614 659 596 1 0.832 506 960 998 121 4 0.941 519 476 320 412 4 1.178 171 077 347 017 3 0.434 536 322 507 736 66 0.287 848 771 267 766
access (ns) 67.9074 276.484 333.083 349.913 201.254 29.7777
rank (ns) 183.644 296.403 342.381 355.715 245.106 157.793
select (ns) 894.616 1061.87 1165.78 1596.93 1020.69 864.823
WT-WEB-1GB WT-DNA-1GB bv-dump.bin RND-1.bin RND-5.bin RND-10.bin
IT block size: 63
build time (s) 2.289 3.238 3.997 4.031 3.982 1.903
bits/bit 0.222 369 605 619 496 95 0.712 744 527 198 054 1 0.843 265 818 362 669 2 1.071 979 860 238 881 0.292 494 886 238 166 05 0.129 363 397 103 820 78
access (ns) 74.8317 249.465 303.511 309.669 236.876 37.5112
rank (ns) 179.46 261.932 316.588 320.151 270.507 159.856
select (ns) 760.392 876.18 954.466 970.194 878.494 659.825
IT block size: 64
build time (s) 1.139 2.228 2.939 2.828 2.54 0.785
bits/bit 0.236 309 100 276 233 53 0.727 050 431 939 216 5 0.857 934 755 624 370 3 1.085 720 250 675 676 5 0.305 864 669 897 670 5 0.143 058 100 580 736 26
access (ns) 64.8315 249.066 303.555 307.486 240.434 33.4113
rank (ns) 178.502 264.999 322.533 320.614 271.908 160.267
select (ns) 764.262 888.36 971.232 1145.56 859.516 646.619
REC block size: 64
build time (s) 2.245 8.691 11.548 13.362 3.621 0.809
bits/bit 0.236 308 598 911 094 76 0.727 049 904 506 986 3 0.857 934 229 822 829 7 1.085 719 744 967 522 2 0.305 864 164 189 516 3 0.143 057 594 872 582
access (ns) 64.6253 248.801 305.583 308.281 226.252 33.549
rank (ns) 186.175 274.88 319.677 320.304 252.252 160.953
select (ns) 762.83 870.66 956.8 1106.23 850.819 703.01
DBS block size: 15
build time (s) 3.764 4.101 4.452 4.131 3.91 3.617
bits/bit 0.467 841 922 936 333 4 0.902 670 647 949 193 2 0.987 094 188 598 129 7 1.236 936 757 438 356 0.516 161 203 263 989 1 0.389 318 452 110 004 04
access (ns) 48.0073 151.446 161.857 165.004 95.4266 26.1222
rank (ns) 167.317 168.541 169.936 170.174 170.15 165.814
select (ns) 909.401 917.893 957.862 1279.39 959.579 905.097
DBS block size: 24
build time (s) 3.321 10.197 11.029 15.812 3.025 2.346
bits/bit 0.370 174 023 525 743 86 0.833 245 889 667 207 5 0.942 256 120 405 461 9 1.178 879 570 745 829 0.435 244 815 906 548 4 0.288 557 264 666 577 7
access (ns) 53.3154 160.277 173.185 176.764 133.093 28.6485
rank (ns) 171.014 182.42 180.797 183.204 172.551 167.101
select (ns) 862.827 916.101 941.277 1393.37 893.444 842.976
RRR block size: 15
build time (s) 3.538 4.769 5.765 5.158 6.343 3.11
bits/bit 0.457 483 470 794 666 3 0.900 648 555 341 946 7 0.987 155 238 708 128 2 1.234 912 143 201 290 4 0.507 886 580 132 793 1 0.370 627 164 873 278 4
access (ns) 34.0164 109.579 125.945 131.151 73.0565 20.3536
rank (ns) 105.353 149.235 157.422 157.874 134.708 93.2746
select (ns) 648.536 830.778 950.841 1278.28 925.266 821.865
WT-WEB-1GB WT-DNA-1GB bv-dump.bin RND-1.bin RND-5.bin RND-10.bin
SDSL block size: 24
build time (s) 6.468 24.749 26.207 40.311 10.079 3.153
bits/bit 0.365 175 565 440 526 9 0.833 398 985 856 804 7 0.942 412 771 499 804 2 1.177 777 904 623 911 4 0.431 538 978 170 313 1 0.278 341 012 795 709
access (ns) 69.2431 255.935 284.384 327.729 150.578 39.6841
rank (ns) 131.619 268.732 303.512 337.913 201.437 145.247
select (ns) 613.157 921.103 1083.35 1521.24 931.851 777.251
SDSL block size: 31
build time (s) 5.691 23.576 24.874 39.573 9.269 2.571
bits/bit 0.304 428 619 579 674 1 0.780 690 324 710 903 6 0.895 611 799 975 038 8 1.131 186 360 447 210 3 0.366 929 970 741 793 73 0.216 593 944 716 498 68
access (ns) 80.7888 276.749 312.463 357.589 175.915 48.5396
rank (ns) 136.336 284.458 325.647 360.295 216.23 154.463
select (ns) 582.405 909.211 1080.24 1311.47 915.935 740.606
SDSL block size: 63
build time (s) 4.318 20.965 22.133 36.357 7.999 1.654
bits/bit 0.220 891 961 599 987 9 0.713 251 545 955 347 0.843 772 811 120 335 1.071 990 387 909 185 9 0.291 513 361 658 801 43 0.125 901 700 922 089 08
access (ns) 94.2154 345.217 391.965 462.072 238.919 53.7692
rank (ns) 147.529 345.117 395.659 456.36 273.225 170.339
select (ns) 507.254 914.16 1102.63 1221.16 894.921 648.104
HYB-IT block size: 256
build time (s) 2.509 2.466 3.685 1.463 9.759 2.894
bits/bit 0.178 405 246 538 911 18 0.726 910 806 155 084 0.900 132 652 272 571 6 1.078 124 716 295 863 0.321 984 713 360 781 1 0.085 882 514 075 320 1
access (ns) 68.1073 150.018 289.457 117.611 171.909 80.628
rank (ns) 75.9432 165.043 291.335 133.961 174.076 96.8056
select (ns) nan nan nan nan nan nan
HYB-RRR block size: 256
build time (s) 3.344 25.407 42.77 2.646 22.861 2.661
bits/bit 0.179 313 915 915 168 02 0.711 185 904 735 567 4 0.853 696 703 216 206 6 1.077 480 904 180 698 8 0.281 826 643 036 107 4 0.085 882 499 174 159 02
access (ns) 126.547 665.36 976.07 143.217 400.072 81.6892
rank (ns) 133.519 668.806 947.346 160.66 407.562 98.2424
select (ns) nan nan nan nan nan nan
HYB block size: 256
build time (s) 2.264 2.029 1.813 1.277 9.86 2.554
bits/bit 0.193 016 071 886 461 17 0.760 724 461 173 873 8 1.058 504 750 251 859 9 1.078 125 081 374 309 3 0.328 133 506 118 256 53 0.085 882 514 075 320 1
access (ns) 49.4219 93.0849 99.518 94.6937 145.674 72.3655
rank (ns) 63.7183 119.323 132.481 128.628 148.508 89.4701
select (ns) nan nan nan nan nan nan