Abstract 1 Introduction 2 Preliminaries 3 Wavelet Tree Compression 4 Pattern Matching 5 Experiments and Analysis 6 Conclusion References Appendix A Algorithms Appendix B Count Query

FM-Adaptive: A Practical Data-Aware FM-Index

Hongwei Huo111corresponding author ORCID Department of Computer Science, Xidian University, Xi’an, China Zongtao He ORCID Department of Computer Science, Xidian University, Xi’an, China Pengfei Liu ORCID Department of Computer Science, Xidian University, Xi’an, China Jeffrey Scott Vitter111corresponding author ORCID Department of Computer Science, Tulane University, New Orleans, LA, USA
the University of Mississippi, MS, USA
Abstract

The FM-index provides an important solution for efficient retrieval and search in textual big data. Its variants have been widely used in many fields including information retrieval, genome analysis, and web searching. In this paper, we propose improvements via a new compressed representation of the wavelet tree of the Burrows-Wheeler transform of the input text, which incorporates the gap γ-encoding. Our theoretical analysis shows that the new index, called FM-Adaptive, achieves asymptotic space optimality within a factor of 2 in the leading term, but it has a better compression and faster retrieval in practice than the competitive optimal compression boosting used in previous FM-indexes. We present a practical improved locate algorithm that provides substantially faster locating time based upon memoization, which takes advantage of the overlapping subproblems property. We design the lookup table for accelerated decoding to support fast pattern matching in a text. Extensive experiments demonstrate that FM-Adaptive provides faster query performance, often by a considerable amount, and/or comparable or better compression than other state-of-the-art FM-index methods.

Keywords and phrases:
Text indexing, Burrows-Wheeler transform, Compressed wavelet trees, Entropy-compressed, Compressed data structures
Copyright and License:
[Uncaptioned image] © Hongwei Huo, Zongtao He, Pengfei Liu, and Jeffrey Scott Vitter; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Information systems Information retrieval
; Theory of computation Design and analysis of algorithms ; Theory of computation Data structures design and analysis ; Theory of computation Data compression ; Theory of computation Pattern matching
Supplementary Material:
Software  (Source Code): https://doi.org/10.24433/CO.7967727.v1 [30]
Acknowledgements:
We would like to thank Simon Gog for sharing code.
Funding:
This work was supported in part by the National Natural Science Foundation of China under Grant No. 62272358.
Editors:
Paolo Ferragina, Travis Gagie, and Gonzalo Navarro

1 Introduction

Massive data sets are being produced at unprecedented rates from sources like IoT, ultra-high-throughput next-generation sequencing, autonomous driving, digital universe, and social networks. A large part of the data consists of text in the form of a sequence of symbols representing not only natural language, but also multimedia streams, biological sequences, and myriad forms of other media. A full-text index is a data structure that stores a text string in preprocessed form so that it can support fast string matching queries. The best-known full-text indexes are the suffix tree [45, 54], and the suffix array [43], which support pattern matching queries in optimal or almost-optimal time. However, for a text 𝒯=𝒯[0,n1] consisting of n symbols drawn from an alphabet Σ of size σ, these data structures require Ω(nlogn) bits in the standard unit cost RAM, which is larger than nlogσ bits, which is the size of the input text, by a multiplicative factor of Ω(logσn).222All logarithms in this paper that do not have an explicit base listed are in base 2. For example, using suffix trees and suffix arrays, the full-text indexing requires approximately 36GB of memory in the most optimized implementations for short read mapping for a mammalian genome [4]. Thus indexing a text in space-efficient way while supporting efficient pattern matching queries is a challenging problem.

The field of compressed or succinct data structures attempts to build data structures whose space is provably close to the size of the data in compressed format and that still provide fast query functionality. Theoretical breakthroughs in the late 1990s led to the development of a new generation of space-efficient indexes. In particular, the compressed suffix array (𝙲𝚂𝙰[22, 52, 23, 53] and the FM-index [10, 21, 11] (based upon the Burrows-Wheeler transform (𝙱𝚆𝚃[3, 44]) provide the fundamentals of how to work with text efficiently in compressed format. Much subsequent work has focused on making compressed indexes fast and space-efficient in practice in order to handle a variety of big data applications, such as sequencing data consisting of billions of short reads.

1.1 Related Work

The FM-index [10, 11, 13, 12, 15, 24, 18, 29, 31] and the compressed suffix array (𝙲𝚂𝙰[22, 21, 23, 52, 53, 15, 42, 32, 19, 17] are space-efficient text indexes whose query times are proportional to the query pattern size plus the product of the output size and a small polylog function of n. The former maintains a succinct representation of the 𝙱𝚆𝚃 and the latter maintains a succinct representation of the neighbor function Ψ [22, 23]. Both are self-indexes in that they represent the original text, and thus it can be discarded.

Grossi and Vitter [22, 23] and Sadakane [52] introduced the compressed suffix array and the neighbor function Ψ. Ferragina and Manzini [10, 11] designed the original entropy-compressed FM-index based upon the Burrows-Wheeler transform (𝙱𝚆𝚃[3, 44] and the mapping function 𝐿𝐹. The mapping function 𝐿𝐹 and the neighbor function Ψ are inverses of one other.

Grossi, Gupta and Vitter [21] introduced the elegant data structure known as the wavelet tree, which has since become ubiquitous in text indexing. Using the wavelet tree, they were the first to establish a 𝙲𝚂𝙰 self-index that provably achieves the asymptotically optimal space bound with leading coefficient 1 (i.e., k(𝒯) bits, the kth order entropy of 𝒯, as defined subsequently in Definition 2). Their analysis of the wavelet tree also applied to the FM-index and was the first to show asymptotic optimality as well for the FM-index [21, 12, 13, 40, 41]. The original space bound derived for the BWT had a leading term of 5nk(𝒯) [10], which was improved to the asymptotically optimal nk(𝒯) [21, 12] using wavelet trees.

Subsequently to the early theoretical breakthroughs [22, 10, 21, 11, 23], there have been numerous improvements. Simpler implementations for the 𝙲𝚂𝙰 and the FM-index achieve high-order compression without explicit partitioning into separate k-contexts and thus using a single wavelet tree for the entire text 𝒯 [15, 40, 41, 14]. (Here “k-context” denotes a length-k prefix in 𝒯 of a symbol 𝒯[𝑆𝐴[i]], for some k0, where 𝑆𝐴 is the suffix array of 𝒯.) We refer the reader to the nice survey of Navarro and Mäkinen [49], Navarro’s book [48] and other references in the literature [40, 41, 18, 25, 20, 37, 28, 27, 33, 50, 46, 16, 47, 5, 38, 31, 17, 32, 1, 34, 2, 42, 31].

For example, Foschini et al. [15] encoded the single wavelet tree of the 𝙱𝚆𝚃 sequence using run-length encoding and Elias γ code to encode the runs; if the k-contexts are hypothetically overlaid onto the 𝙱𝚆𝚃 sequence, the encoding of each run length adapts implicitly to the frequency statistics of the current k-context(s), thus achieving zero-order compression, and thus by Definition 2 of kth-order entropy, they encode the overall node in kth-order entropy space. We use that same idea in this paper. Mäkinnen and Navarro [40, 41] did a careful analysis to show the surprising result that applying RRR [51] to a single Wavelet Tree of the entire 𝙱𝚆𝚃 sequence without any partitioning achieves the nk(𝒯) bits leading space term, for a similar reason as in [15]; the sublocks formed by [51] implicitly encode each kth-order context in roughly zero-order entropy space. Interesting experiments on the practical performance of these RLE and RRR approaches and related methods appear in [24]. Gog et al. [18] used a fixed-block boosting technique and the RRR method [51] to implement the FM-index in asymptotically optimal entropy-compressed space with an extra cost of o(nlogσ) bits. Mäkinen et al. [42] proposed a 𝙲𝚂𝙰-based index for highly repetitive data collections.

These references also mention many practical applications. An example is a special-purpose 𝙱𝚆𝚃-based compressed index for biological FASTQ data that exploits specific characteristics of next-generation sequence data [31].

1.2 Our Results

Developing space-efficient entropy-compressed self-indexes that achieve fast query performance both in theory and practice has been a challenging problem. In this paper, we propose FM-Adaptive, a fast data-aware FM-index applicable for a wide range of text strings with different alphabet sizes. The key is a new representation of the wavelet tree of the 𝙱𝚆𝚃, with new tradeoffs between space occupancy and search time. We propose several auxiliary data structures to support fast access to 𝙱𝚆𝚃.

Using gap γ code instead of Elias-Fano (EF) code [9, 7] used in [31] for DNA sequences, we deduce a new space bound for general texts while the space bound derived by using the EF code in [31] holds only when the entropy is o(1). In addition, the compressed index in [31] is geared towards FASTQ data, which have a specific format, and its queries are not identical to the queries on the general-purpose compressed indexes in this paper. Our index has an additional efficiency in that it allows the gap γ code and the run-length γ code to share the same lookup table as in [28], though the tables for the gap γ code and for the run-length γ code which we introduce in Section 3.1 for encoding methods have different meanings, thus avoiding the need to store the lookup table R𝐸𝐹 [31] for Elias-Fano code.

Ferragina, Giancarlo, and Manzini [14] discussed the run-length code and the gap code for the implementations of the Wavelet Tree. For a bit string  of length n, they can represent  in |𝑅𝑙𝑒()|max(2a,b+2a/3)||0()+3b+1=4n0()+4 bits (Lemma 3.3) using the run-length γ code and in |𝐺𝑒()|max(a,b)||0()+alog||+b+2=2n0()+2logn+3 bits (Lemma 3.4) using the gap γ code for a=2 and b=1. Instead, we can represent in 2n0()+2logn+2 bits [24, 28] using the run-length γ code and in 2n0() bits in Lemma 3 using the gap γ code, which avoids any second order terms.

Ferragina et al. [14] addressed the problem of compressing the 𝙱𝚆𝚃 L, but not the indexing mechanics of how to provide fast random access to L how to decode L. In this paper, we address compression as well as the problem of how to index the data. We designed several efficient auxiliary data structures and the lookup table for fast access and decoding to support efficient pattern matching in a text. We also present a new locating algorithm that substantially improves the locating time in practice.

For input text 𝒯 of n symbols, we summarize below the novel contributions we make in this paper:

  1. (1)

    We present a new compressed representation for the wavelet tree of the 𝙱𝚆𝚃, called FM-Adaptive, that incorporates the gap γ-encoding. Using this, we can implement FM-Adaptive in 2nk+o(nlogσ) bits, for kαlogσn1 and any fixed constant 0<α<1, where k denotes the kth-order empirical entropy of 𝒯. We can construct FM-Adaptive in 𝒪(nlogσ) time using 𝒪(nlogn) bits of space. In addition, for a bit string  of length n, we can represent  in 2n0() bits using the gap γ code, which avoids and improves any second order terms in space bounds in [14, 24], where 0() is zero-order empirical entropy of , defined in Definition 2.

  2. (2)

    We present an improved algorithm that provides substantially faster performance in practice for locating patterns in a text, based upon memoization, which takes advantage of the overlapping-subproblems property.

  3. (3)

    We design table Rγ (see Section 3.4) for accelerating the decoding of the gap γ-encoded blocks to support fast pattern matching in a text.

  4. (4)

    Given any pattern 𝒫 of m symbols encoded in mlogσ bits, using FM-Adaptive, we can count the number of occurrences of 𝒫 in 𝒯 in 𝒪(mlogσ) time, we can locate all 𝑜𝑐𝑐 occurrences in 𝒪(𝑜𝑐𝑐log1+ϵn) additional time, and we can retrieve a text substring 𝒯[𝑠𝑡𝑎𝑟𝑡,𝑠𝑡𝑎𝑟𝑡+1] in 𝒪(logσ+log1+ϵn) time.

  5. (5)

    Extensive experiments demonstrate that FM-Adaptive generally provides faster query performance, often by a considerable amount, and comparable or better compression than other state-of-the-art FM-index methods. The source code is available online [30].

1.3 Organization of the Paper

In Section 3 we introduce a new compression structure for wavelet trees. The search and locate functions are described and analyzed in Section 4. We report the experimental results in Section 5 and give final comments in Section 6.

2 Preliminaries

In this section, we define some necessary concepts and provide a brief description of the 𝙱𝚆𝚃.

2.1 Problem Formalization

Definition 1 (Text indexing problem).

Let 𝒯=𝒯[0,n1] be a text string of length n over an alphabet Σ of size σ and let 𝒫 be a query pattern. The text indexing problem is to represent 𝒯 in as small space as possible while efficiently supporting the following query operations:

  • count(𝒫): returns the number of occurrences of 𝒫 in 𝒯.

  • locate(𝒫): reports the positions where 𝒫 occurs in 𝒯.

  • extract(start,): retrieves substring 𝒯[𝑠𝑡𝑎𝑟𝑡,𝑠𝑡𝑎𝑟𝑡+1].

2.2 Empirical Entropy

Definition 2 (empirical entropy).

Let 𝒯 be a text string of length n over an alphabet Σ={0,1,,σ1} of size σ. The zero-order empirical entropy of 𝒯 is defined as

0=0(𝒯)=1nxΣnxlognnx

where nx is the number of occurrences in 𝒯 of symbol xΣ. The kth-order empirical entropy [44] of 𝒯 is defined as

k=k(𝒯)=1nωΣknω0(𝒯ω)

where ωΣk designates a string of length k and 𝒯ω denotes the string of length nω formed by taking the symbol immediately preceding each occurrence of ω in 𝒯 and concatenating the symbols together.

2.3 The Burrows-Wheeler Transform and FM-index

Let 𝒯=𝒯[0,n1] be a text string consisting of n symbols from alphabet Σ of size σ, and let 𝑆𝐴 denote the suffix array of 𝒯. The 𝙱𝚆𝚃 L of 𝒯 is defined as L[i]=𝒯[𝑆𝐴[i]1modn]. As shown in Table 1, the 𝙱𝚆𝚃 of 𝒯 is the string L formed by sorting the suffixes of 𝒯 in lexicographical order, choosing the preceding symbol for each suffix, and concatenating those preceding symbols. (When the suffix is the entire string 𝒯, we use the symbol # as its preceding symbol.)

The Burrows-Wheeler transform is invertible. We can retrieve 𝒯 from its 𝙱𝚆𝚃 L by backward search [10] using the mapping function 𝐿𝐹: The value 𝐿𝐹(i) is the lexicographical rank of the suffix with prefix L[i], namely, F[LF[i]]=L[i], where F is the first string of the 𝙱𝚆𝚃.

We can compute 𝐿𝐹(i) by 𝐿𝐹(i)=𝒞(L[i])+𝑂𝑐𝑐(L[i],i), where 𝑂𝑐𝑐(L[i],i) is the number of occurrences of symbol L[i] in L[0,i], and 𝒞[L[i]] is the number of occurrences of symbols in 𝒯 that are smaller than L[i]. Table 1 shows the 𝑆𝐴 and 𝙱𝚆𝚃L of the string 𝒯=tcaaaatatatgcaacatatagtattagattgtat# in which for each i we show the first four symbols (starting with F[i]) and the last symbol (namely, L[i]) of conceptual suffixes of 𝒯 in lexicographical order. The mapping function 𝐿𝐹 and the neighbor function Ψ are inverses of one another: 𝐿𝐹(Ψ(i))=Ψ(𝐿𝐹(i)), as shown in Table 1.

Table 1: The 𝑆𝐴 and 𝙱𝚆𝚃 L of 𝒯=tcaaaatatatgcaacatatagtattagattgtat#. The column  is the bit string for the root node of the wavelet tree of L.

2.4 The Wavelet Tree

Let S denote a text string of length n over an alphabet Σ={0,1,,σ1}. We define the wavelet tree (𝚆𝚃[21] of S with σ leaves labelled by the symbols of the alphabet Σ as follows: For each node v in 𝚆𝚃, let Σv be the subset of symbols in the subtree rooted at v, and let Sv be the substring of S consisting of all the symbols of Σv. For each internal node v in 𝚆𝚃, let Bv be a bit string with the same length as Sv defined as Bv[i]=0 if Sv[i] is in the left subtree of v; Bv[i]=1 if Sv[i] is in the right subtree of v. Such a bit string representation happens recursively at each internal node; the substring at the internal node consists of the characters dispatched from the parent node, with their order in the text string preserved. The collective size of the bit strings at any given level of the tree is bounded by n. With proper encoding, the wavelet tree representation of S achieves the zero-order entropy space bound [21].

The powerful wavelet tree data structure [21, 15, 24] reduces the problem of compressing a string over a finite alphabet to the problem of compressing a set of bit (i.e., binary) strings. It is an elegant and versatile data structure that allows efficient access, rank, and select queries on a text string 𝒯 of n symbols from a multisymbol alphabet:

  • access(𝒯,i): returns the symbol 𝒯[i].

  • rank(𝒯,i)c: returns the number of occurrences of symbol c in 𝒯[0,i] for any 0in1 and cΣ.

  • select(𝒯,j)c: returns the position in L of the jth occurrence of symbol c for any 1jn.

A key result of Grossi et al. [21] is that if each bit string i of the internal nodes of the wavelet tree is compressed to zero-order entropy, then the cumulative encoding is a zero-order entropy encoding of 𝒯:

Lemma 1 ([21]).

i=1t|i|0(i)=n0(𝒯), where t denotes the number of internal nodes of the wavelet tree.

With the extra cost for auxiliary data structures, the total space occupied by a wavelet tree is n0(𝒯)+o(nlogσ) bits of space [21, 15, 24].

We use a single wavelet tree to represent the 𝙱𝚆𝚃 L, as first done in Foschini et al. [15]. Mäkinen and Navarro [39] used a single wavelet tree to represent the S string of the run-length encoding of the FM-index (i.e., the run heads). Figure 1 shows the balanced wavelet tree for the example L = tcacaattttcatttgtgaattaatagaaag#ataa from Table 1. There are various ways [15] to form the wavelet tree, such as using a Huffman criterion [26]. Each leaf of the wavelet tree corresponds to a distinct symbol. The column labeled  in Table 1 is the root bit string of the wavelet tree in Figure 1. A value of 0 (resp., 1) in  means that the corresponding symbol in L lies in the left (resp., right) subtree. The bit string for the roots of each subtree are defined recursively.

Using the wavelet tree, we can turn each 𝑂𝑐𝑐(L[i],i) computation into one 𝚛𝚊𝚗𝚔c() computation on the wavelet tree.

Figure 1: The balanced wavelet tree representation for the 𝙱𝚆𝚃L of 𝒯 shown in Table 1.

3 Wavelet Tree Compression

In this section, we introduce a new compressed representation for bit strings of wavelet trees, which incorporates the gap γ-encoding for general text strings. In Sections 35, we show that this representation provides excellent query performance both in theory and practice.

3.1 Encoding Methods

We let denote the bit string of length n. Let r be the number of occurrences of the least frequent bit in  such that 1rn/2. Let 1p1<p2<<pr1<prn denote the increasing sequence of positions of the least frequent bit in . We assume for convenience p0=0. Let gj=pjpj1 denote the gap sequence of pairwise differences of neighboring values of the position increasing sequence of the least frequent bit in  for j=1,2,,r. We represent each gap using Elias γ code [8]. The resulting gap γ-encoding of  is the bit string γ𝑔𝑎𝑝()=γ(g1)γ(g2)γ(gr). Obviously, j=1rgi=prn.

On the other hand, we can view as a sequence of maximal runs of identical bits =b11,b22bss for some sn, where bi and bi+1{0,1} and bibi+1 for 1i<s. We can represent the length of each run using Elias γ code [8]. The resulting run-length γ encoding of is the bit string γ𝑟𝑙()=γ(1)γ(2)γ(s).

3.2 Structure

The starting point of the structure of the FM-Adaptive algorithm follows from [31], which we incorporate into the following discussion. We start the summary by introducing the compression structure to represent bit strings of the nodes of the wavelet tree. We then give an example to show how the newly introduced gap γ code works, which is the key to improve general-purpose performance.

To enable effective compression on the bit strings of the wavelet tree, we partition the bit strings of the wavelet tree nodes into blocks and categorize the blocks into three basic types:

  1. 1.

    blocks consisting only of all 0s or of all 1s;

  2. 2.

    blocks having relatively long runs of 0s or of 1s; and

  3. 3.

    blocks having a random-like sequence of 0s and 1s.

For each block, we choose one of the following four compression methods to minimize its coding length: All0/All1, GapG0/GapG1, RLG0/RLG1, and Plain.

Specifically, for the block of type 1 consisting only of all 0s or all 1s, we use the All0/All1 code to encode it; that is, no additional bits are needed to store it. For the block of type 2 having relatively long 0-runs or 1-runs, we use either the GapG code or the RLG code to encode it. Here GapG means that we apply the gap γ code (see Section 3.1) to encode the block. RLG means that we apply the run-length γ code (see Section 3.1) to encode the block. For the blocks of type 3 having a random-like sequence of 0s and 1s, we keep the block unchanged, denoted as Plain. This partitioning is combined with the mixed encoding to represent the wavelet tree nodes. The compressed wavelet trees (𝙲𝚆𝚃) are the ones whose bit strings of nodes are partition-based and mixed-encoded.

Let denote a bit string of an internal node of a wavelet tree. We use following three steps to obtain a succinct representation of :

  1. 1.

    Partition into blocks j of size b, except possibly the last one.

  2. 2.

    Combine a/b contiguous blocks to form a superblock of size a.

  3. 3.

    Apply the mixed encoding to encode each block. We build the encoded sequence 𝒮 by concatenating the encoded blocks.

We also maintain five extra structures to support fast access to : 𝑆𝐵𝑟𝑎𝑛𝑘, 𝐵𝑟𝑎𝑛𝑘, 𝑆𝐵, B, and M, where 𝑆𝐵𝑟𝑎𝑛𝑘 stores the number of 1s in  preceding the current superblock; 𝐵𝑟𝑎𝑛𝑘 stores the number of 1s in  preceding the current block relative to the beginning of its enclosing superblock; 𝑆𝐵 stores the number of bits in 𝒮 preceding the current superblock; B stores the number of bits in 𝒮 preceding the current block relative to the beginning of its enclosing superblock; and M indicates the encoding method used in a block.

Table 2 shows the compression structure for the root bit string = 100000111100111111001100101000100100 of the wavelet tree of the 𝙱𝚆𝚃 for the example string 𝒯 from Table 1, where n=36, b=6, and a=18. The figure shows four encodings: All1, GapG, RLG, and Plain. For purposes of illustrating all four methods, the RLG code is used to encode block 4. But in the actual algorithm, GapG would instead be selected to encode block 4, since it would result in a shorter encoding.

Table 2: The succinct index structure for the wavelet tree root bit string using mixed encoding of All1, GapG, RLG and Plain, where n=36, b=6, and a=18.
𝑟𝑎𝑛𝑘1(,i)=𝑆𝐵𝑟𝑎𝑛𝑘[i/a]+𝐵𝑟𝑎𝑛𝑘[i/b]+𝑙𝑟𝑎𝑛𝑘(𝒮,𝑜𝑓𝑓𝑠𝑒𝑡,M[i/b]) (1)

Using encoding sequence 𝒮 of  and auxiliary structures 𝑆𝐵𝑟𝑎𝑛𝑘, 𝐵𝑟𝑎𝑛𝑘, 𝑆𝐵, B, and M, we can compute 𝚛𝚊𝚗𝚔1(,i) by Equation (1). The 𝑙𝑟𝑎𝑛𝑘 operation performs All0/All1, GapG0/GapG1, RLG0/RLG1, or Plain decodings depending upon M[i/b] to return the number of 1s up to 𝑜𝑓𝑓𝑠𝑒𝑡 within block i/b and 𝑜𝑓𝑓𝑠𝑒𝑡=imodb. The starting decoding position on 𝒮 is 𝑆𝐵[i/a]+B[i/b]. We can access [i] in a way similar to 𝚛𝚊𝚗𝚔1(,i).

Lemma 2.

Using table Rγ of size w from Section 3.4, we can compute 𝚛𝚊𝚗𝚔L[i](𝙲𝚆𝚃,i) in 𝒪((b/w)logσ) time, for block size b.

3.3 Construction

In this section, we describe the construction of the compressed wavelet tree 𝙲𝚆𝚃. The 𝙲𝚆𝚃 consists of compressed bit strings of the wavelet tree nodes. We describe the compression in Algorithm 3 in Appendix A, consisting of the encoded sequence 𝒮 and five auxiliary structures 𝑆𝐵𝑟𝑎𝑛𝑘, 𝐵𝑟𝑎𝑛𝑘, 𝑆𝐵, B, and M. It is simple to see we can construct the compressed wavelet tree in 𝒪(nlogσ) time.

Theorem 1.

Given the wavelet tree of the 𝙱𝚆𝚃 of text 𝒯, we can construct the 𝙲𝚆𝚃 in 𝒪(nlogσ) time.

3.4 Accelerating the Rank Computation

In order to accelerate the 𝑟𝑎𝑛𝑘1(,i) computation for a gap γ-encoded block (GapG), we design table Rγ for the gap γ-encoding for every possible chunk w of (logn)/2 bits. Using Rγ, we can process each Θ(logn) bits of a γ-encoded sequence in constant time. Every 𝑖𝑛𝑑𝑒𝑥 of Rγ, corresponding to a bit string of w bits, contains three components: R1, R2, and R3:

  • R1 stores the total number of decoded entries in the 𝑖𝑛𝑑𝑒𝑥. It is the number of decoded gaps.

  • R2 stores the cumulative sum of decoded digits (gap values) in the 𝑖𝑛𝑑𝑒𝑥, which corresponds to the maximum value of decoded positions.

  • R3 stores the total number of decoded bits in the 𝑖𝑛𝑑𝑒𝑥.

The function GapG1rank() in Algorithm 4 in Appendix A implements the 𝑙𝑟𝑎𝑛𝑘() operation using Rγ for the GapG1-encoded block. The bit operator  shifts the bits in the argument to the left by the designated number of positions.

3.5 Space Usage

We let i of length ni denote the bit string of internal node i of the wavelet tree of the 𝙱𝚆𝚃 L for i=1,2,,t=σ1, and let ni denote the number of bits contained in i. We let r be the number of occurrences of the least frequent bit in i such that 1rni/2. Applying the gap γ code to i, we get the resulting bit string γ𝑔𝑎𝑝(i)=γ(g1)γ(g2)γ(gr), whose space bound is given in Lemma 3; it represents an improved space bound over the ones in [14, 24]:

Lemma 3.

|γ𝑔𝑎𝑝(i)|2ni0(i).

Proof.

By the definition of γ𝑔𝑎𝑝(i), we have

|γ𝑔𝑎𝑝(i)| = j=1r|γ(gj)|=j=1r(2log(gj)+1)2rlog(j=1rgjr)+r
= 2rlog(prr)+r2rlog(nir)+2(nir)log(ninir)=2ni0(i)

The second equality is due to Elias γ code [8]. The first inequality is due to Jensen’s inequality, and the next line follows from j=1rgj=pr and prni (see Section 3.1) and r(nir)log(ni/(nir)) when rni/2. The last step is due to Definition 2 of zero-order empirical entropy.

We partition i into blocks ij of size b for 1jni/b, and represent each block ij as a gap sequence g1,g2,,grj of positions of the least frequent bit of ij for some 1rjb/2.

We compress each block ij by choosing one of four compression methods that minimizes the encoding length h(ij): All0/All1, GapG0/GapG1, RLG0/RLG1 and Plain. Specifically, h(ij)=min{|a(ij)|,|γ𝑔𝑎𝑝(ij)|,|γ𝑟𝑙(ij)|,|p(ij)|}, where ||  denotes the bit string length of the encoding.

Theorem 2.

Given the 𝙱𝚆𝚃 L of 𝒯 of length n over an alphabet of size σ, we can represent L using the compressed wavelet tree in 2nk(𝒯)+o(nlogσ) bits of space for any k such that kαlogσn1 and any fixed constant 0<α<1, where k(𝒯) denotes the kth-order empirical entropy of 𝒯.

Theorem 2 follows by the analysis of Foschini et al. [15] and Huo et al. [28]. The key idea of [15] was to implicitly consider the k-contexts ω for purpose of analysis. The gap γ encoding adapts to each k-context as it is encountered, with some space inefficiency because gaps can span context blocks. However, the extra space is matched by the space savings achieved by using a single wavelet tree, which avoids all the space overhead that would be needed if there were an individual wavelet tree for each k-context.

Figure 2 shows bit string 1 of the root node of the wavelet tree for the 𝙱𝚆𝚃 L of 𝒯 from Table 1 and Figure 1. The second row is a hypothetical partition of L by k-contexts ω for k=2, and each part is called a context block Lω. Let k-context ω of L[i] denote the length-k prefix in 𝒯 of 𝒯[𝑆𝐴[i]]. For each k-context ω, the context block Lω of L can be formed conceptually by choosing the symbol in 𝒯 preceding each occurrence of the k-context ω and concatenating those preceding symbols. The third row is a partition of 1 by contexts, and each part is called a context bit block X1ω. The partition for L is the same as the partition for 1 according to contexts. The fourth row is a partition of 1 with fixed length 5. The red region (see Figure 2) comprises context bit block 𝟷𝟷𝟶𝟶𝟷𝟷𝟷𝟷, corresponding to context block 𝚝𝚝𝚌𝚊𝚝𝚝𝚝𝚐 (a substring of L) for context at.

Each k-context block remains contiguous in the internal nodes of the wavelet tree, as shown in red in Figure 2 for the 2-context at. By adapting Lemmas 3 and 1, the terms loggj for a given k-context ω over all the wavelet tree nodes sum up to |Lω|0(Lω) for that context, and by Definition 2 for kth-order entropy, summing over all the k-contexts ω yields nk. The analysis of [15, 28] considers how and whether the various hypothetical k-contexts overlap with the blocks and superblocks, which incurs some extra space costs. The resulting code length is stated in Theorem 2.

Figure 2: Hypothetical partition of 1 and L by k-context ω, for k=2. The portion of 1 for 2-context ω=at is shaded in red.

4 Pattern Matching

In this section, we use FM-Adaptive to implement three types of string matching queries: count, locate and substring extract. We put the count query in Appendix B.

4.1 Locate Query

We sample suffix array 𝑆𝐴 in the same manner as in Huo et al. [32], and we denote by 𝑆𝐴s the 𝑆𝐴 sampling, where s is the 𝑆𝐴 sampling step size. We use the conceptual bit array D of length n to record the indices corresponding to the sampled values in 𝑆𝐴s. The locate algorithm without memoization is given in Appendix A.

Theorem 3.

Let s=log1+ϵn/logσ for ϵ>0 be the step size for the suffix array sampling. For the given index range in 𝑆𝐴 of pattern 𝒫, we can answer a 𝑙𝑜𝑐𝑎𝑡𝑒 query and find the 𝑜𝑐𝑐 occurrences of 𝒫 using the 𝙲𝚆𝚃 of the 𝙱𝚆𝚃L of 𝒯 in 𝒪(𝑜𝑐𝑐log1+ϵn) time using 2nk(𝒯)+o(nlogσ) bits of space for any k such that kαlogσn1 and any constant 0<α<1, where k(𝒯) denotes the kth-order empirical entropy of 𝒯.

Proof.

Basic operations D[i] and 𝑟𝑎𝑛𝑘1() on the bit string take constant time. Each computation of 𝐿𝐹[i] using Rγ takes 𝒪((b/w)logσ) time by Lemma 2. We can find a 𝑆𝐴 sample in at most s steps, which takes 𝒪(s(b/w)logσ) time. Substituting values for b=𝒪(logn), s=log1+ϵn/logσ for ϵ>0, and w=(logn)/2, we get the time bound 𝒪(log1+ϵn). Thus the locate algorithm runs in 𝒪(𝑜𝑐𝑐log1+ϵn) time, where 𝑜𝑐𝑐 is the number of occurrences of 𝒫 in 𝒯.

There are n/s samplings of 𝑆𝐴 and each needs log(n/s) bits. So the space required by the sampling suffix array 𝑆𝐴s is (n/s)log(n/s)=o(nlogσ) bits for s=log1+ϵn/logσ for ϵ>0. We can store the bit array D in o(n) bits and support 𝚛𝚊𝚗𝚔1(D,) and 𝚜𝚎𝚕𝚎𝚌𝚝1(D,) in 𝒪(1) time [35, 6]. We can store Rγ in o(n) bits for w=(logn)/2. The space required by the index structure is given in Theorem 2.

Summing the space required by the 𝙲𝚆𝚃, Rγ, 𝑆𝐴s and D, we obtain the space bound of 2nk(𝒯)+o(nlogσ) bits on the locate query.

Now we consider the practical improvement of the locate query. The improvement is based upon memoization, which takes advantage of the overlapping subproblems property. Assume that we are given the index range [l,r] in 𝑆𝐴 of the suffixes with prefix 𝒫. According to our sampling method on the suffix array, for each j[l,r], we can walk at most s steps on 𝐿𝐹 to reach a suffix array sampling. For any ij and i[l,r], we could also perform the same process. The key point is that the process of finding a suffix array sampling for some j[l,r] may include the process of finding a suffix array sampling for some i[l,r]. That is, i and j may share the same suffix sampling. If we know the suffix position p(i) of i and the number of steps (say, #𝑠𝑡𝑒𝑝𝑠) to walk on 𝐿𝐹 starting at j[l,r] to i[l,r], we can obtain the suffix position p(j) of j by p(j)=p(i)+#𝑠𝑡𝑒𝑝𝑠. This could allow us to reduce the number of steps walking on 𝐿𝐹 and thus speed up the locating process.

Algorithm 1 fastLocate(𝒫,l,r).
Algorithm 2 extract(𝑠𝑡𝑎𝑟𝑡,).

We use structure 𝑝𝑜𝑠[] to keep the suffix positions for each j[l,r]. We compute the values of 𝑝𝑜𝑠[] in two ways:

  1. 1.

    by walking on 𝐿𝐹 only using 𝑆𝐴s; and

  2. 2.

    using the computed value of 𝑝𝑜𝑠[] and data structures 𝑝𝑟𝑒𝑑[] and 𝑑𝑖𝑠𝑡[] described below.

The array 𝑝𝑟𝑒𝑑 has length rl+1; 𝑝𝑟𝑒𝑑[il] records the j such that i=𝐿𝐹k(j)[l,r] for the first time. 𝑑𝑖𝑠𝑡 is an array of length rl+1; 𝑑𝑖𝑠𝑡[jl] is the number of steps using 𝐿𝐹 to walk from j to i.

𝖿𝖺𝗌𝗍𝖫𝗈𝖼𝖺𝗍𝖾 in Algorithm 1 gives the pseudocode of the practical improvement on the locate query. We initialize all 𝑝𝑜𝑠[], 𝑑𝑖𝑠𝑡[], and 𝑝𝑟𝑒𝑑[] entries to 1, 0, and 1, respectively.

4.2 Substring Extract Query

In this section, we consider how to retrieve text substrings 𝒯[𝑠𝑡𝑎𝑟𝑡,𝑠𝑡𝑎𝑟𝑡+1] using the 𝙲𝚆𝚃 of L and 𝑆𝐴s1. We sample inverse suffix array 𝑆𝐴1 in the same manner as in Huo et al. [32]. We denote by 𝑆𝐴s1 the 𝑆𝐴1 sampling.

The workflow to retrieve a substring using the 𝙲𝚆𝚃 of L and 𝑆𝐴s1 is that we first transform a given position into its rank i in 𝑆𝐴 and then retrieve 𝒯[𝑠𝑡𝑎𝑟𝑡,𝑠𝑡𝑎𝑟𝑡+1] by walking on 𝐿𝐹. The first step is done by the 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚() procedure and the second step by the 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒() procedure. Both procedures are given in Algorithm 2.

Theorem 4.

Given position 𝑠𝑡𝑎𝑟𝑡 and length , we can answer a substring 𝑒𝑥𝑡𝑟𝑎𝑐𝑡 query using the 𝙲𝚆𝚃 of the 𝙱𝚆𝚃L of 𝒯 in 𝒪(logσ+log1+ϵn) time using 2nk(𝒯)+o(nlogσ) bits of space for any k such that kαlogσn1 and any constant 0<α<1, and ϵ>0, where k(𝒯) denotes the kth-order empirical entropy of 𝒯.

Proof.

The time complexity of the substring extract algorithm is determined by the total running time of the 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚 procedure and the 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒 procedure. Each computation of 𝐿𝐹[i] using Rγ of size w takes 𝒪((b/w)logσ) time and the maximum walking length of the for loop is s in 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚. Consequently, the 𝑡𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚 procedure runs in 𝒪(s(b/w)logσ) time, and the 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒 procedure runs in 𝒪((b/w)logσ) time.

By summing the two parts, substituting the values of b=𝒪(logn), s=log1+ϵn/logσ, and w=(logn)/2, we get the running time bound of the 𝑒𝑥𝑡𝑟𝑎𝑐𝑡 algorithm: 𝒪(s(b/w)logσ)+𝒪((b/w)logσ)=𝒪(logσ+log1+ϵn) in the worst case.

The space required by the extract query is the same as that for the locate query, which is 2nk(𝒯)+o(nlogσ) bits.

5 Experiments and Analysis

5.1 Experimental Setting, Datasets, and Measure

In this section we describe experiments performed using the environment described in [32]. We used C++ to implement the algorithms and constructed the suffix array with parameter 64 using Mori’s fast lightweight suffix-sorting library.333github.com/y-256/libdivsufsort/

Table 3: Statistics and distribution for the tested data sets.

Table 3 summarizes some statistical characteristics of the data sets we used for the experiments. The data sets consist of four datasets from the 𝑃𝑖𝑧𝑧𝑎&𝐶ℎ𝑖𝑙𝑖𝐶𝑜𝑟𝑝𝑢𝑠444pizzachili.dcc.uchile.cl/texts.html (datasets 1–4), the highly repetitive data sets from the 𝑅𝑒𝑝𝑒𝑡𝑖𝑡𝑖𝑣𝑒𝐶𝑜𝑟𝑝𝑢𝑠555pizzachili.dcc.uchile.cl/repcorpus.html (datasets 5–8), the human genome (called hg38 at UCSC) from UCSC666hgdownload.cse.ucsc.edu/goldenPath/hg38 (dataset 9), and NA12877R10 (dataset 10), formed by extracting 10 gigabytes of reads from NA12877_1, downloaded from EBI.777www.ebi.ac.uk/ena/browser/view/ERR194146 Datasets 1–4 and 9–10 are nonrepetitive data sets, and datasets 5–8 are highly repetitive data sets. The expression n/r denotes the average run length in the 𝙱𝚆𝚃L, and σ is the alphabet size of the input data. For hg38, we exclude the query on pattern NN…N.

In our experiments, we examine the following state-of-the-art algorithms for their space usage and query time:

  1. 1.

    AF-Index: the implementation888pizzachili.dcc.uchile.cl/indexes of alphabet-friendly FM-index [13, 12], which combines an existing compression boosting technique with the wavelet tree data structure. We used its latest version af-index_v2.1.

  2. 2.

    FMI-Hybrid: the recent implementation999github.com/simongog/sdsl-lite of the FM-index [18], which uses the fixed-block boosting technique [36], in which bitvectors by default are implemented by hybrid encoding [37].

  3. 3.

    RL-CSA: the implementation101010jltsiren.kapsi.fi/rlcsa of the compressed suffix array [42] that uses run-length encoding and has been optimized for highly repetitive data. We used its current version of May 2016.

  4. 4.

    GeCSA: the recent implementation111111codeocean.com/capsule/3554560/tree/v1 of the compressed suffix array [32], in which a new run-length δ code is introduced for the gap sequence of Ψ [22, 23].

  5. 5.

    FM-Adaptive: our method described in this paper.

5.2 Improvement of the Locate Query by Memoization

In this section, we make an experimental comparison of the locate query before and after the improvement. The key point of the memoization improvement is to remember the computed suffix positions for some j[l,r] and then use these computed positions and other auxiliary information to determine the suffix positions of the remaining j so as to reduce the number of accesses to 𝐿𝐹 and thus speed up the locate query.

In the experiments, we perform 500 locate queries and calculate the average number of access to 𝐿𝐹 for each locate query. Figure 3 shows the average number of access to 𝐿𝐹 for each locate query before (in blue) and after (in orange) the improvement by memoization.

Figure 3: The average number of access to 𝐿𝐹 for each locate query before (in blue) and after (in orange) the memoization is used for s=64,128,256 for the data sets shown in Table 3.

As can be seen in Figure 3, the memoization of some computed suffix positions generally improves the locate query time substantially. The average number of accesses to 𝐿𝐹 is decreased by 88.6999.11 percent on english, sources, para, w.leaders, and kernel, by 44.8847.31 percent on hg38, by 6.2917.82 percent on NA12877R10, by 0.052.2 percent on proteins, by 0.010.08 percent on DNA, and by 0.00.01 percent on influenza for all s.

Table 4: The average pattern locating time T in milliseconds for three locate algorithms, for s=32 (top), 64 (middle) and 128 (bottom).

Figure 3 shows substantial reductions in 𝐿𝐹 invocations compared with the locate algorithm without memoization shown in Appendix A for most tested data. This memoization did not work on the datasets DNA, proteins, and influenza because there are few short patterns occurring frequently, which make i=𝐿𝐹k(j)[l,r] rarely satisfied.

To show that the improvement of the locate query is largely due to the memoization rather than the suffix array sampling method, we made additional experiments on three locate algorithms on the 𝑃𝑖𝑧𝑧𝑎&𝐶ℎ𝑖𝑙𝑖𝐶𝑜𝑟𝑝𝑢𝑠, as well as the 𝑅𝑒𝑝𝑒𝑡𝑖𝑡𝑖𝑣𝑒𝐶𝑜𝑟𝑝𝑢𝑠, which we show in Table 4.

We denote the locate algorithm with position sampling in suffix array in [28] as locate15, the locate algorithm without memoization with suffix array value sampling (Algorithm 5 in Appendix A) as 𝑙𝑜𝑐𝑎𝑡𝑒, and the locate algorithm with memoization with suffix array value sampling (Algorithm 1) as 𝑓𝑎𝑠𝑡𝐿𝑜𝑐𝑎𝑡𝑒. We randomly select 1000 patterns with length 20 from the input text, and average the total time over the 1000 locate queries on each of the three locate algorithms, denoted as T𝑝𝑠 for locate15, T𝑣𝑠 for 𝑙𝑜𝑐𝑎𝑡𝑒, and T𝑓𝑎𝑠𝑡 for 𝑓𝑎𝑠𝑡𝐿𝑜𝑐𝑎𝑡𝑒 in milliseconds, shown in Table 4. We take block size b=256 for the three locate algorithms. It can be seen that 𝑓𝑎𝑡𝑠𝐿𝑜𝑐𝑎𝑡𝑒 provides generally several times to orders of magnitude faster than locate15 and 𝑙𝑜𝑐𝑎𝑡𝑒 for different suffix array sampling step size s=32,64 and 128.

5.3 Performance Evaluation

In the following sections we compare the performance of our proposed algorithms with the state-of-the-art indexing methods described above, based upon the performance criteria of compression ratio, locate query time, and extract query time. We use the following parameters for each index:

  • AF-Index: b=328, and s = 64, 128, and 256;

  • FMI-Hybrid: b=256, and s = 32, 64, and 128;

  • RL-CSA: b=64, s = 32, 64, and 128;

  • GeCSA: b=256, and s = 128, 256, and 512;

  • FM-Adaptive: b=256, and s = 64, 128, and 256.

The other parameters used are the default parameters. The tested method used is the same as those in [32]. The compression ratio is defined as the ratio of the index size with the original size of the input text. The parameters in our index are a=16b, d=s, and w=16.

Figure 4: Performance comparison of five indexing methods on the 𝑙𝑜𝑐𝑎𝑡𝑒 query. The measures used are compression ratio in percentage and average time T per query in milliseconds on the 10 data sets described in Section 5.1.

5.3.1 Locate Query

Figure 4 shows the compression ratio and time for the locate query of the compared indexing methods. For the tested data of NA12877R10, RL-CSA fails to build the index, since it limits the size of the input collection to less than 4 gigabytes. AF-Index fails to build the index on proteins, english, hg38, and NA12877R10 with input size larger than 1 gigabyte.

As can be seen in Figure 4, FM-Adaptive provides the best compression on the nonrepetitive datasets shown in Table 3. FM-Adaptive is typically several to dozens of times faster than FMI-Hybrid in query time. Specifically, for the nonrepetitive datasets 14 and 910, when minimizing compression ratio, FM-Adaptive gives the best compression and is 1.035.58 times faster than FMI-Hybrid in query time in four out of six nonrepetitive datasets. Speedwise, when minimizing query time, RL-CSA shows the fastest query time in four out of six nonrepetitive datasets but at a cost of higher space usage. FM-Adaptive is 1.983.53 times faster than FMI-Hybrid in query time in four of six nonrepetitive datasets. FMI-Hybrid is essentially the same as and a bit faster than FM-Adaptive in query time on proteins and DNA, but it uses 15.17 and 4.7 percentage points more space.

For the repetitive datasets 58, when minimizing compression ratio, GeCSA gives the best compression on influenza, w.leaders, and kernel with the exception of FMI-Hybrid on para. FMI-Hybrid shows the best compression ratio on para, but it is 152.77 times slower than GeCSA. FM-Adaptive is 74.33, 58.03, 6.39, and 1.90 times faster with comparable compression than FMI-Hybrid on w.leaders, para, kernel, and influenza. Speedwise, when minimizing query time, GeCSA shows the fastest query time on para, w.leaders, and kernel about 2.459.93 times faster than the other three compared indexing methods with the exception of RL-CSA on influenza, but RL-CSA uses 15.06 percentage points more space than GeCSA. FM-Adaptive is 23.67, 16.38, 7.18, and 1.76 times faster with comparable compression than FMI-Hybrid on w.leaders, para, kernel, and influenza. AF-Index generally uses substantially more space on the tested data than the other compared indexing methods.

Memoization is a main reason for the improved locating query time for FM-Adaptive. Another reason is due to the different suffix array sampling. As can be seen in Figure 4, FM-Adaptive provides faster query performance by a considerable amount than the other compared FM-index methods on the tested data sets, except for DNA, proteins and influenza for which memoization barely works according to Figure 3. At the same time, FM-Adaptive provides a comparable or better compression.

Figure 5: Performance comparison of five indexing methods on the 𝑒𝑥𝑡𝑟𝑎𝑐𝑡 query. The measures used are compression ratio in percentage and average time T per query in microseconds on the 10 data sets described in Section 5.1.

5.3.2 Substring Extract Query

The extract query uses the same index as the locate query, so it has the same space usage as the locate query. In this section we focus upon its time performance. Figure 5 shows the compression ratio and time for the extract query for the compared indexing methods.

Figure 5 shows that RL-CSA, GeCSA, AF-Index, and FM-Adaptive give the best extract time on four, three, two and one of the 10 tested datasets in Table 3. As can be seen in Figure 5, GeCSA shows good extract performance on repetitive data. GeCSA uses less space and is generally many times faster than the other indexing methods, with the exception of AF-Index on para. AF-Index is 1.44 times faster than GeCSA on para but uses 35.19 percentage points more space. The query time of FM-series indexing methods can be affected by the alphabet size σ of input text. When σ is small, such as DNA, hg38 and NA12877R10, they have a better or competitive time. FM-Adaptive is faster than FMI-Hybrid in query time in eight out of 10 datasets with the exception of FMI-Hybrid on proteins and english. FMI-Hybrid is 1.05 and 1.4 times faster than FM-Adaptive on proteins and english but uses 15.17 and 5.59 percentage points more space.

6 Conclusion

We propose improvements via a new compressed representation of the wavelet tree of the Burrows-Wheeler transform of the input text, which incorporates the gap γ-encoding. Our new index achieves asymptotic space optimality within a factor of 2 in the leading term, but it has a better compression and faster retrieval in practice than the competitive optimal compression boosting used in previous FM-indexes. We present a practical improved locate algorithm that provides substantially faster locating time.

An interesting open problem is to improve our high-order entropy-compressed text self-index to achieve optimal space and query time both in theory and in practice. Another interesting goal is to adapt our index to compress and index the human genome and reads for the analysis of ultra-high-throughput next-generation sequencing (NGS) data, while supporting approximate matching queries.

References

  • [1] Christina Boucher, Davide Cenzato, Zsuzsanna Lipták, Massimiliano Rossi, and Marinella Sciortino. r-indexing the eBWT. Information and Computation, 298:105155, 2024. doi:10.1016/J.IC.2024.105155.
  • [2] Nathaniel K. Brown, Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Marinella Sciortino. Faster run-length compressed suffix arrays. In Alessio Conte, Andrea Marin, Giovanna Rosone, and Jeffrey S. Vitter, editors, From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi’s 60th Birthday. OASIcs, Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2025.
  • [3] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm, 1994.
  • [4] Stefan Canzar and Steven L. Salzberg. Short read mapping: An algorithmic tour. Proceedings of the IEEE, 105(3):436–458, 2017. doi:10.1109/JPROC.2015.2455551.
  • [5] Xiaoyang Chen, Hongwei Huo, Jun Huan, Jeffrey Scott Vitter, Weiguo Zheng, and Lei Zou. MSQ-Index: A succinct index for fast graph similarity search. IEEE Transactions on Knowledge and Data Engineering, 33(6):2654–2668, 2021. doi:10.1109/TKDE.2019.2954527.
  • [6] David Richard Clark. Compact PAT trees. PhD thesis, University of Waterloo, Waterloo, Canada, 1996.
  • [7] Peter Elias. Efficient storage and retrieval by content and address of static files. J. Assoc. Comput. Mach., 21:246–260, 1974. doi:10.1145/321812.321820.
  • [8] Peter Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194–203, 1975. doi:10.1109/TIT.1975.1055349.
  • [9] Robert M. Fano. On the number of bits required to implement an associative memory. Computer Structures Group, MIT, Cambridge, MA, 1971.
  • [10] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science (FOCS’00), pages 390–398, 2000. doi:10.1109/SFCS.2000.892127.
  • [11] Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552–581, 2005. doi:10.1145/1082036.1082039.
  • [12] Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. An alphabet-friendly FM-index. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval (SPIRE’04), pages 150–160, 2004. doi:10.1007/978-3-540-30213-1_23.
  • [13] Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3(2):Article 20, 2007.
  • [14] Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. The myriad virtues of Wavelet Trees. Information and Computation, 207(8):849–866, 2009. doi:10.1016/J.IC.2008.12.010.
  • [15] Luca Foschini, Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. When indexing equals compression: Experiments with compressing suffix arrays and applications. ACM Transactions on Algorithms, 2(4):611–639, 2006. Shorter versions appear in Proceedings of the 15th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA ’04), New Orleans, LA, January 2004, 636–645, and in “Fast compression with a static model in high-order entropy,” Proceedings of IEEE Data Compression Conference, Snowbird, Utah, March 2004, 62–71. doi:10.1145/1198513.1198521.
  • [16] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM, 67(1):Article 2, 2020.
  • [17] Raffaele Giancarlo, Giovanni Manzini, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. A new class of string transformations for compressed text indexing. Information and Computation, 294:105068, 2022. doi:10.1016/J.IC.2023.105068.
  • [18] Simon Gog, Juha Kärkkäinen, Dominik Kempa, Matthias Petri, and Simon J. Puglisi. Fixed block compression boosting in FM-indexes: Theory and practice. Algorithmica, 81:1370–1391, 2019. doi:10.1007/S00453-018-0475-9.
  • [19] Simon Gog, Gonzalo Navarro, and Matthias Petri. Improved and extended locating functionality on compressed suffix arrays. Journal of Discrete Algorithms, 32:53–63, 2015. doi:10.1016/J.JDA.2015.01.006.
  • [20] Simon Gog and Matthias Petri. Optimized succinct data structures for massive data. Software: Practice and Experience, 44(11):1287–1314, 2014. doi:10.1002/SPE.2198.
  • [21] Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the 14th ACM-SIAM Symposium on Discrete Algorithms (SODA’03), pages 841–850, 2003. URL: http://dl.acm.org/citation.cfm?id=644108.644250.
  • [22] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd ACM Symposium on Theory of Computing (STOC’00), pages 397–406, 2000.
  • [23] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378–407, 2005. doi:10.1137/S0097539702402354.
  • [24] Roberto Grossi, Jeffrey Scott Vitter, and Bojian Xu. Wavelet trees: From theory to practice. In Proceedings of the 1st International Conference on Data Compression, Communications and Processing, pages 210–221, 2011. doi:10.1109/CCP.2011.16.
  • [25] Wing-Kai Hon, Rahul Shah, and Jeffrey Scott Vitter. Compression, indexing, and retrieval for massive string data. In Proceedings of the 21st annual conference on Combinatorial pattern matching (CPM’10), pages 260–274, 2010. doi:10.1007/978-3-642-13509-5_24.
  • [26] David A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9):1098–1101, 1952.
  • [27] Hongwei Huo, Longgang Chen, Jeffrey Scott Vitter, and Yakov Nekrich. A practical implementation of compressed suffix arrays with applications to self-indexing. In Proceedings of the Data Compression Conference (DCC’14), pages 292–301, 2014. doi:10.1109/DCC.2014.49.
  • [28] Hongwei Huo, Longgang Chen, Heng Zhao, Jeffrey Scott Vitter, Yakov Nekrich, and Qiang Yu. A Data-aware FM-index. In Proceedings of the 17th Workshop on Algorithm Engineering and Experiments (ALENEX’15), pages 10–23, 2015. doi:10.1137/1.9781611973754.2.
  • [29] Hongwei Huo, Xiaoyang Chen, Xu Guo, and Jeffrey Scott Vitter. Efficient compression and indexing for highly repetitive DNA sequence collections. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 18(6):2394–2408, 2021. doi:10.1109/TCBB.2020.2968323.
  • [30] Hongwei Huo, Zongtao He, Pengfei Liu, and Jeffrey Scott Vitter. FM-Adaptive: A data-aware FM-index [source code], January 2022. doi:10.24433/CO.7967727.v1.
  • [31] Hongwei Huo, Pengfei Liu, Chenhui Wang, Hongbo Jiang, and Jeffrey Scott Vitter. CIndex: Compressed indexes for fast retrieval of FASTQ files. Bioinformatics, 38(2):335–343, 2022. doi:10.1093/BIOINFORMATICS/BTAB655.
  • [32] Hongwei Huo, Peng Long, and Jeffrey Scott Vitter. Practical high-order entropy-compressed text self-indexing. IEEE Transactions on Knowledge and Data Engineering, 35(3):2943–2960, 2023. doi:10.1109/TKDE.2021.3114401.
  • [33] Hongwei Huo, Zhigang Sun, Shuangjiang Li, Jeffrey Scott Vitter, and et al.. CS2A: A compressed suffix array-based method for short read alignment. In Proceedings of the Data Compression Conference (DCC’16), pages 271–278, 2016.
  • [34] Hongwei Huo, Yongze Yu, Zongtao He, and Jeffrey Scott Vitter. Indexing labeled property multidigraphs in entropy space, with applications. In Proceedings of the 41st IEEE International Conference on Data Engineering (ICDE’25), pages 2478–2492, 2025.
  • [35] Guy Jacobson. Space-efficient static trees and graphs. In Proceedings of the 30th Annual IEEE Symposium on Foundations of Computer Science (FOCS’89), pages 549–554, 1989. doi:10.1109/SFCS.1989.63533.
  • [36] Juha Kärkkäinen and Simon J. Puglisi. Fixed block compression boosting in FM-indexes. In Proceedings of the 18th International Symposium on String Processing and Information Retrieval (SPIRE’11), pages 174–184, 2011. doi:10.1007/978-3-642-24583-1_18.
  • [37] Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Hybrid compression of bitvectors for the FM-index. In DCC, pages 302–311, 2014. doi:10.1109/DCC.2014.87.
  • [38] Zhize Li, Jian Li, and Hongwei Huo. Optimal in-place suffix sorting. Information and Computation, 285(Part B):104818, 2022. doi:10.1016/J.IC.2021.104818.
  • [39] Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. In Proceedings of the 16th Annual Symposium on Combinatorial Pattern Matching (CPM), pages 45–56, 2005. LNCS 3537. doi:10.1007/11496656_5.
  • [40] Veli Mäkinen and Gonzalo Navarro. Implicit compression boosting with applications to self-indexing. In Proceedings of the 14th International Symposium on String Processing and Information Retrieval (SPIRE’07), pages 229–241, 2007. doi:10.1007/978-3-540-75530-2_21.
  • [41] Veli Mäkinen and Gonzalo Navarro. Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms, 4(3):article 32, 2008.
  • [42] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. J Comput Biol., 17(3):281–308, 2010. doi:10.1089/CMB.2009.0169.
  • [43] Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993. doi:10.1137/0222058.
  • [44] Giovanni Manzini. An analysis of the burrows-wheeler transform. Journal of the ACM, 48(3):407–430, 2001. doi:10.1145/382780.382782.
  • [45] Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262–272, 1976. doi:10.1145/321941.321946.
  • [46] J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. Fast compressed self-indexes with deterministic linear-time construction. Algorithmica, 82(2):316–337, 2020. A conference version of this paper appeared in Proc. ISAAC 2017. doi:10.1007/S00453-019-00637-X.
  • [47] Gonzalo Navarro. Indexing highly repetitive string collections, part ii: Compressed indexes. ACM Computing Surveys, 54(2):Article 26, 2021.
  • [48] Gonzalo Navarro. Compact Data Structures: A Practical Approach. Cambridge University Press, New York, USA, September 2016.
  • [49] Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1):Article 2, 2007.
  • [50] Gonzalo Navarro and Nicola Prezza. Universal compressed text indexing. Theoretical Computer Science, 762:41–50, 2019. doi:10.1016/J.TCS.2018.09.007.
  • [51] Rajeev Raman, Venkatesh Raman, and S.Srinivasa Rao. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms, 3(4):Article 43, 2007. doi:10.1145/1290672.1290680.
  • [52] Kunihiko Sadakane. Compressed text databases with efficient query algorithms based on the compressed suffix array. In Proceedings of the 11th Symposium on Algorithms and Computation (ISAAC’00), pages 410–421, 2000. doi:10.1007/3-540-40996-3_35.
  • [53] Kunihiko Sadakane. New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms, 48(2):294–313, 2003. doi:10.1016/S0196-6774(03)00087-7.
  • [54] Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995. doi:10.1007/BF01206331.

Appendix A Algorithms

Algorithm 3 compress().
denotes a node bit string of the wavelet tree.
Algorithm 4 GapG1rank(𝒮,𝑜𝑓𝑓𝑠𝑒𝑡).
Algorithm 5 locate(𝒫,l,r).

Appendix B Count Query

Given a pattern string 𝒫[0,m1] of m symbols. Each symbol in 𝒫 and 𝒯 belongs to a fixed alphabet Σ of size σn. An occurrence of the pattern at position i of 𝒯 means that the substring 𝒯[i,i+m1] is equal to 𝒫. The 𝑐𝑜𝑢𝑛𝑡 algorithm given below reports the range [l,r] of the first and the last occurrences of 𝒫 in 𝑆𝐴 in which all the suffixes are prefixed by 𝒫 using 𝒞 and 𝙲𝚆𝚃.

Algorithm 6 count(𝒫).

During the searching for 𝒫, the algorithm maintains the following invariant: Let 𝑆𝐴[l,r] denote the range of suffixes with prefix 𝒫[i,m1]. If L[k] and L[k] are the first and the last occurrences of symbol 𝒫[i1] in range L[l,r], then 𝑆𝐴[𝐿𝐹(k),𝐿𝐹(k)] is the range of suffixes with prefix 𝒫[i1,m1]. Basic operation 𝐿𝐹 takes 𝒪(logσ) time on the compressed wavelet tree for b=𝒪(logn) and w=(logn)/2 by Lemma 2. As a 𝑐𝑜𝑢𝑛𝑡 query does two 𝐿𝐹 operations per symbol of 𝒫, it runs in 𝒪(mlogσ) time. (Some more complicated implementations such as that in [13] of wavelet trees make use of multiway branching to shorten the height of the tree and reduce traversal time, but we use the simpler binary branching, which works well in practice.)

Theorem 5.

Let 𝒫 be a query pattern of length m. We can answer a count query of 𝒫 using the 𝙲𝚆𝚃 of the 𝙱𝚆𝚃L of 𝒯 in 𝒪(mlogσ) time using 2nk(𝒯)+o(nlogσ) bits of space for any k such that kαlogσn1 and any constant 0<α<1, where k(𝒯) denotes the kth-order empirical entropy of 𝒯.