FM-Adaptive: A Practical Data-Aware FM-Index

Huo, Hongwei; He, Zongtao; Liu, Pengfei; Vitter, Jeffrey Scott

doi:10.4230/OASIcs.Manzini.5

FM-Adaptive: A Practical Data-Aware FM-Index

Hongwei Huo¹¹1corresponding author

Department of Computer Science, Xidian University, Xi’an, China Zongtao He

Department of Computer Science, Xidian University, Xi’an, China Pengfei Liu

Department of Computer Science, Xidian University, Xi’an, China Jeffrey Scott Vitter¹¹1corresponding author

Department of Computer Science, Tulane University, New Orleans, LA, USA
the University of Mississippi, MS, USA

Abstract

The FM-index provides an important solution for efficient retrieval and search in textual big data. Its variants have been widely used in many fields including information retrieval, genome analysis, and web searching. In this paper, we propose improvements via a new compressed representation of the wavelet tree of the Burrows-Wheeler transform of the input text, which incorporates the gap $\gamma$ -encoding. Our theoretical analysis shows that the new index, called FM-Adaptive, achieves asymptotic space optimality within a factor of 2 in the leading term, but it has a better compression and faster retrieval in practice than the competitive optimal compression boosting used in previous FM-indexes. We present a practical improved locate algorithm that provides substantially faster locating time based upon memoization, which takes advantage of the overlapping subproblems property. We design the lookup table for accelerated decoding to support fast pattern matching in a text. Extensive experiments demonstrate that FM-Adaptive provides faster query performance, often by a considerable amount, and/or comparable or better compression than other state-of-the-art FM-index methods.

Keywords and phrases:

Text indexing, Burrows-Wheeler transform, Compressed wavelet trees, Entropy-compressed, Compressed data structures

Copyright and License:

2012 ACM Subject Classification:

Information systems

\rightarrow

Information retrieval ; Theory of computation

\rightarrow

Design and analysis of algorithms ; Theory of computation

\rightarrow

Data structures design and analysis ; Theory of computation

\rightarrow

Data compression ; Theory of computation

\rightarrow

Pattern matching

Supplementary Material:

Software (Source Code): https://doi.org/10.24433/CO.7967727.v1 [30]

Acknowledgements:

We would like to thank Simon Gog for sharing code.

Funding:

This work was supported in part by the National Natural Science Foundation of China under Grant No. 62272358.

DOI:

10.4230/OASIcs.Manzini.2025.5

Event:

The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini’s 60th Birthday

Editors:

Paolo Ferragina, Travis Gagie, and Gonzalo Navarro

Series and Publisher:

Open Access Series in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

Massive data sets are being produced at unprecedented rates from sources like IoT, ultra-high-throughput next-generation sequencing, autonomous driving, digital universe, and social networks. A large part of the data consists of text in the form of a sequence of symbols representing not only natural language, but also multimedia streams, biological sequences, and myriad forms of other media. A full-text index is a data structure that stores a text string in preprocessed form so that it can support fast string matching queries. The best-known full-text indexes are the suffix tree [45, 54], and the suffix array [43], which support pattern matching queries in optimal or almost-optimal time. However, for a text $\mathcal{T}=\mathcal{T}[0,n-1]$ consisting of $n$ symbols drawn from an alphabet $\Sigma$ of size $\sigma$ , these data structures require $\Omega(n\log n)$ bits in the standard unit cost RAM, which is larger than $n\log\sigma$ bits, which is the size of the input text, by a multiplicative factor of $\Omega(\log_{\sigma}n)$ .²²2All logarithms in this paper that do not have an explicit base listed are in base 2. For example, using suffix trees and suffix arrays, the full-text indexing requires approximately 36GB of memory in the most optimized implementations for short read mapping for a mammalian genome [4]. Thus indexing a text in space-efficient way while supporting efficient pattern matching queries is a challenging problem.

The field of compressed or succinct data structures attempts to build data structures whose space is provably close to the size of the data in compressed format and that still provide fast query functionality. Theoretical breakthroughs in the late 1990s led to the development of a new generation of space-efficient indexes. In particular, the compressed suffix array ( $\mathtt{CSA}$ ) [22, 52, 23, 53] and the FM-index [10, 21, 11] (based upon the Burrows-Wheeler transform ( $\mathtt{BWT}$ ) [3, 44]) provide the fundamentals of how to work with text efficiently in compressed format. Much subsequent work has focused on making compressed indexes fast and space-efficient in practice in order to handle a variety of big data applications, such as sequencing data consisting of billions of short reads.

1.1 Related Work

The FM-index [10, 11, 13, 12, 15, 24, 18, 29, 31] and the compressed suffix array ( $\mathtt{CSA}$ ) [22, 21, 23, 52, 53, 15, 42, 32, 19, 17] are space-efficient text indexes whose query times are proportional to the query pattern size plus the product of the output size and a small polylog function of $n$ . The former maintains a succinct representation of the $\mathtt{BWT}$ and the latter maintains a succinct representation of the neighbor function $\Psi$ [22, 23]. Both are self-indexes in that they represent the original text, and thus it can be discarded.

Grossi and Vitter [22, 23] and Sadakane [52] introduced the compressed suffix array and the neighbor function $\Psi$ . Ferragina and Manzini [10, 11] designed the original entropy-compressed FM-index based upon the Burrows-Wheeler transform ( $\mathtt{BWT}$ ) [3, 44] and the mapping function $\mathit{LF}$ . The mapping function $\mathit{LF}$ and the neighbor function $\Psi$ are inverses of one other.

Grossi, Gupta and Vitter [21] introduced the elegant data structure known as the wavelet tree, which has since become ubiquitous in text indexing. Using the wavelet tree, they were the first to establish a $\mathtt{CSA}$ self-index that provably achieves the asymptotically optimal space bound with leading coefficient 1 (i.e., $\sim\mathcal{H}_{k}(\mathcal{T})$ bits, the $k$ th order entropy of $\mathcal{T}$ , as defined subsequently in Definition 2). Their analysis of the wavelet tree also applied to the FM-index and was the first to show asymptotic optimality as well for the FM-index [21, 12, 13, 40, 41]. The original space bound derived for the BWT had a leading term of $5\,n\,\mathcal{H}_{k}(\mathcal{T})$ [10], which was improved to the asymptotically optimal $n\,\mathcal{H}_{k}(\mathcal{T})$ [21, 12] using wavelet trees.

Subsequently to the early theoretical breakthroughs [22, 10, 21, 11, 23], there have been numerous improvements. Simpler implementations for the $\mathtt{CSA}$ and the FM-index achieve high-order compression without explicit partitioning into separate $k$ -contexts and thus using a single wavelet tree for the entire text $\mathcal{T}$ [15, 40, 41, 14]. (Here “ $k$ -context” denotes a length- $k$ prefix in $\mathcal{T}$ of a symbol $\mathcal{T}[\mathit{SA}[i]]$ , for some $k\geq 0$ , where $\mathit{SA}$ is the suffix array of $\mathcal{T}$ .) We refer the reader to the nice survey of Navarro and Mäkinen [49], Navarro’s book [48] and other references in the literature [40, 41, 18, 25, 20, 37, 28, 27, 33, 50, 46, 16, 47, 5, 38, 31, 17, 32, 1, 34, 2, 42, 31].

For example, Foschini et al. [15] encoded the single wavelet tree of the $\mathtt{BWT}$ sequence using run-length encoding and Elias $\gamma$ code to encode the runs; if the $k$ -contexts are hypothetically overlaid onto the $\mathtt{BWT}$ sequence, the encoding of each run length adapts implicitly to the frequency statistics of the current $k$ -context(s), thus achieving zero-order compression, and thus by Definition 2 of $k$ th-order entropy, they encode the overall node in $k$ th-order entropy space. We use that same idea in this paper. Mäkinnen and Navarro [40, 41] did a careful analysis to show the surprising result that applying RRR [51] to a single Wavelet Tree of the entire $\mathtt{BWT}$ sequence without any partitioning achieves the $n\,\mathcal{H}_{k}(\mathcal{T})$ bits leading space term, for a similar reason as in [15]; the sublocks formed by [51] implicitly encode each $k$ th-order context in roughly zero-order entropy space. Interesting experiments on the practical performance of these RLE and RRR approaches and related methods appear in [24]. Gog et al. [18] used a fixed-block boosting technique and the RRR method [51] to implement the FM-index in asymptotically optimal entropy-compressed space with an extra cost of $o(n\log\sigma)$ bits. Mäkinen et al. [42] proposed a $\mathtt{CSA}$ -based index for highly repetitive data collections.

These references also mention many practical applications. An example is a special-purpose $\mathtt{BWT}$ -based compressed index for biological FASTQ data that exploits specific characteristics of next-generation sequence data [31].

1.2 Our Results

Developing space-efficient entropy-compressed self-indexes that achieve fast query performance both in theory and practice has been a challenging problem. In this paper, we propose FM-Adaptive, a fast data-aware FM-index applicable for a wide range of text strings with different alphabet sizes. The key is a new representation of the wavelet tree of the $\mathtt{BWT}$ , with new tradeoffs between space occupancy and search time. We propose several auxiliary data structures to support fast access to $\mathtt{BWT}$ .

Using gap $\gamma$ code instead of Elias-Fano (EF) code [9, 7] used in [31] for DNA sequences, we deduce a new space bound for general texts while the space bound derived by using the EF code in [31] holds only when the entropy is $o(1)$ . In addition, the compressed index in [31] is geared towards FASTQ data, which have a specific format, and its queries are not identical to the queries on the general-purpose compressed indexes in this paper. Our index has an additional efficiency in that it allows the gap $\gamma$ code and the run-length $\gamma$ code to share the same lookup table as in [28], though the tables for the gap $\gamma$ code and for the run-length $\gamma$ code which we introduce in Section 3.1 for encoding methods have different meanings, thus avoiding the need to store the lookup table $\mathit{R_{\mathit{EF}}}$ [31] for Elias-Fano code.

Ferragina, Giancarlo, and Manzini [14] discussed the run-length code and the gap code for the implementations of the Wavelet Tree. For a bit string $\mathcal{B}$ of length $n$ , they can represent $\mathcal{B}$ in $|\mathit{Rle}(\mathcal{B})|\leq\max(2a,b+2a/3)|\mathcal{B}|\mathcal{H}_{0}(% \mathcal{B})+3b+1=4n\mathcal{H}_{0}(\mathcal{B})+4$ bits (Lemma 3.3) using the run-length $\gamma$ code and in $|\mathit{Ge}(\mathcal{B})|\leq\max(a,b)|\mathcal{B}|\mathcal{H}_{0}(\mathcal{B% })+a\log|\mathcal{B}|+b+2=2n\mathcal{H}_{0}(\mathcal{B})+2\log n+3$ bits (Lemma 3.4) using the gap $\gamma$ code for $a=2$ and $b=1$ . Instead, we can represent $\mathcal{B}$ in $2n\mathcal{H}_{0}(\mathcal{B})+2\log n+2$ bits [24, 28] using the run-length $\gamma$ code and in $2n\mathcal{H}_{0}(\mathcal{B})$ bits in Lemma 3 using the gap $\gamma$ code, which avoids any second order terms.

Ferragina et al. [14] addressed the problem of compressing the $\mathtt{BWT}$ $L$ , but not the indexing mechanics of how to provide fast random access to $L$ how to decode $L$ . In this paper, we address compression as well as the problem of how to index the data. We designed several efficient auxiliary data structures and the lookup table for fast access and decoding to support efficient pattern matching in a text. We also present a new locating algorithm that substantially improves the locating time in practice.

For input text $\mathcal{T}$ of $n$ symbols, we summarize below the novel contributions we make in this paper:

(1)

We present a new compressed representation for the wavelet tree of the $\mathtt{BWT}$ , called FM-Adaptive, that incorporates the gap $\gamma$ -encoding. Using this, we can implement FM-Adaptive in $2n\mathcal{H}_{k}+o(n\log\sigma)$ bits, for $k\leq\alpha\log_{\sigma}n-1$ and any fixed constant $0<\alpha<1$ , where $\mathcal{H}_{k}$ denotes the $k$ th-order empirical entropy of $\mathcal{T}$ . We can construct FM-Adaptive in $\mathcal{O}(n\,{\log\sigma})$ time using $\mathcal{O}(n\log n)$ bits of space. In addition, for a bit string $\mathcal{B}$ of length $n$ , we can represent $\mathcal{B}$ in $2n\mathcal{H}_{0}(\mathcal{B})$ bits using the gap $\gamma$ code, which avoids and improves any second order terms in space bounds in [14, 24], where $\mathcal{H}_{0}(\mathcal{B})$ is zero-order empirical entropy of $\mathcal{B}$ , defined in Definition 2.
(2)

We present an improved algorithm that provides substantially faster performance in practice for locating patterns in a text, based upon memoization, which takes advantage of the overlapping-subproblems property.
(3)

We design table $R_{\gamma}$ (see Section 3.4) for accelerating the decoding of the gap $\gamma$ -encoded blocks to support fast pattern matching in a text.
(4)

Given any pattern $\mathcal{P}$ of $m$ symbols encoded in $m\log\sigma$ bits, using FM-Adaptive, we can count the number of occurrences of $\mathcal{P}$ in $\mathcal{T}$ in $\mathcal{O}(m\log\sigma)$ time, we can locate all $\mathit{occ}$ occurrences in $\mathcal{O}(\mathit{occ}\log^{1+\epsilon}n)$ additional time, and we can retrieve a text substring $\mathcal{T}[\mathit{start},\mathit{start}+\ell-1]$ in $\mathcal{O}(\ell\log\sigma+\log^{1+\epsilon}n)$ time.
(5)

Extensive experiments demonstrate that FM-Adaptive generally provides faster query performance, often by a considerable amount, and comparable or better compression than other state-of-the-art FM-index methods. The source code is available online [30].

1.3 Organization of the Paper

In Section 3 we introduce a new compression structure for wavelet trees. The search and locate functions are described and analyzed in Section 4. We report the experimental results in Section 5 and give final comments in Section 6.

2 Preliminaries

In this section, we define some necessary concepts and provide a brief description of the $\mathtt{BWT}$ .

2.1 Problem Formalization

Definition 1 (Text indexing problem).

Let $\mathcal{T}=\mathcal{T}[0,n-1]$ be a text string of length $n$ over an alphabet $\Sigma$ of size $\sigma$ and let $\mathcal{P}$ be a query pattern. The text indexing problem is to represent $\mathcal{T}$ in as small space as possible while efficiently supporting the following query operations:

$\blacksquare$

count $(\mathcal{P})$ : returns the number of occurrences of $\mathcal{P}$ in $\mathcal{T}$ .
$\blacksquare$

locate $(\mathcal{P})$ : reports the positions where $\mathcal{P}$ occurs in $\mathcal{T}$ .
$\blacksquare$

extract $(\mathit{start},\ell)$ : retrieves substring $\mathcal{T}[\mathit{start},\mathit{start}+\ell-1]$ .

2.2 Empirical Entropy

Definition 2 (empirical entropy).

Let $\mathcal{T}$ be a text string of length $n$ over an alphabet $\Sigma=\{0,1,\dots,\sigma-1\}$ of size $\sigma$ . The zero-order empirical entropy of $\mathcal{T}$ is defined as

\mathcal{H}_{0}=\mathcal{H}_{0}(\mathcal{T})=\frac{1}{n}\sum_{x\in\Sigma}n_{x}% \log\frac{n}{n_{x}}

where $n_{x}$ is the number of occurrences in $\mathcal{T}$ of symbol $x\in\Sigma$ . The $k$ th-order empirical entropy [44] of $\mathcal{T}$ is defined as

\mathcal{H}_{k}=\mathcal{H}_{k}(\mathcal{T})=\frac{1}{n}\sum_{\omega\in\Sigma^% {k}}n_{\omega}\,\mathcal{H}_{0}(\mathcal{T}_{\omega})

where $\omega\in\Sigma^{k}$ designates a string of length $k$ and $\mathcal{T}_{\omega}$ denotes the string of length $n_{\omega}$ formed by taking the symbol immediately preceding each occurrence of $\omega$ in $\mathcal{T}$ and concatenating the symbols together.

2.3 The Burrows-Wheeler Transform and FM-index

Let $\mathcal{T}=\mathcal{T}[0,n-1]$ be a text string consisting of $n$ symbols from alphabet $\Sigma$ of size $\sigma$ , and let $\mathit{SA}$ denote the suffix array of $\mathcal{T}$ . The $\mathtt{BWT}$ $L$ of $\mathcal{T}$ is defined as $L[i]=\mathcal{T}[\mathit{SA}[i]-1\bmod n]$ . As shown in Table 1, the $\mathtt{BWT}$ of $\mathcal{T}$ is the string $L$ formed by sorting the suffixes of $\mathcal{T}$ in lexicographical order, choosing the preceding symbol for each suffix, and concatenating those preceding symbols. (When the suffix is the entire string $\mathcal{T}$ , we use the symbol $\#$ as its preceding symbol.)

The Burrows-Wheeler transform is invertible. We can retrieve $\mathcal{T}$ from its $\mathtt{BWT}$ $L$ by backward search [10] using the mapping function $\mathit{LF}$ : The value $\mathit{LF}(i)$ is the lexicographical rank of the suffix with prefix $L[i]$ , namely, $F[LF[i]]=L[i]$ , where $F$ is the first string of the $\mathtt{BWT}$ .

We can compute $\mathit{LF}(i)$ by $\mathit{LF}(i)=\mathcal{C}(L[i])+\mathit{Occ}(L[i],i)$ , where $\mathit{Occ}(L[i],i)$ is the number of occurrences of symbol $L[i]$ in $L[0,i]$ , and $\mathcal{C}[L[i]]$ is the number of occurrences of symbols in $\mathcal{T}$ that are smaller than $L[i]$ . Table 1 shows the $\mathit{SA}$ and $\mathtt{BWT}\ L$ of the string $\mathcal{T}=\mbox{{tcaaaatatatgcaacatatagtattagattgtat\#}}$ in which for each $i$ we show the first four symbols (starting with $F[i]$ ) and the last symbol (namely, $L[i]$ ) of conceptual suffixes of $\mathcal{T}$ in lexicographical order. The mapping function $\mathit{LF}$ and the neighbor function $\Psi$ are inverses of one another: $\mathit{LF}(\Psi(i))=\Psi(\mathit{LF}(i))$ , as shown in Table 1.

Table 1: The

\mathit{SA}

and

\mathtt{BWT}

L

of

\mathcal{T}=\mbox{{tcaaaatatatgcaacatatagtattagattgtat\#}}

. The column

\mathcal{B}

is the bit string for the root node of the wavelet tree of

L

.

2.4 The Wavelet Tree

Let $S$ denote a text string of length $n$ over an alphabet $\Sigma=\{0,1,...,\sigma-1\}$ . We define the wavelet tree ( $\mathtt{WT}$ ) [21] of $S$ with $\sigma$ leaves labelled by the symbols of the alphabet $\Sigma$ as follows: For each node $v$ in $\mathtt{WT}$ , let $\Sigma_{v}$ be the subset of symbols in the subtree rooted at $v$ , and let $S_{v}$ be the substring of $S$ consisting of all the symbols of $\Sigma_{v}$ . For each internal node $v$ in $\mathtt{WT}$ , let $B_{v}$ be a bit string with the same length as $S_{v}$ defined as $B_{v}[i]=0$ if $S_{v}[i]$ is in the left subtree of $v$ ; $B_{v}[i]=1$ if $S_{v}[i]$ is in the right subtree of $v$ . Such a bit string representation happens recursively at each internal node; the substring at the internal node consists of the characters dispatched from the parent node, with their order in the text string preserved. The collective size of the bit strings at any given level of the tree is bounded by $n$ . With proper encoding, the wavelet tree representation of $S$ achieves the zero-order entropy space bound [21].

The powerful wavelet tree data structure [21, 15, 24] reduces the problem of compressing a string over a finite alphabet to the problem of compressing a set of bit (i.e., binary) strings. It is an elegant and versatile data structure that allows efficient access, rank, and select queries on a text string $\mathcal{T}$ of $n$ symbols from a multisymbol alphabet:

$\blacksquare$

access $(\mathcal{T},i)$ : returns the symbol $\mathcal{T}[i]$ .
$\blacksquare$

rank ${}_{c}(\mathcal{T},i)$ : returns the number of occurrences of symbol $c$ in $\mathcal{T}[0,i]$ for any $0\leq i\leq n-1$ and $c\in\Sigma$ .
$\blacksquare$

select ${}_{c}(\mathcal{T},j)$ : returns the position in $L$ of the $j$ th occurrence of symbol $c$ for any $1\leq j\leq n$ .

A key result of Grossi et al. [21] is that if each bit string $\mathcal{B}_{i}$ of the internal nodes of the wavelet tree is compressed to zero-order entropy, then the cumulative encoding is a zero-order entropy encoding of $\mathcal{T}$ :

Lemma 1 ([21]).

$\sum_{i=1}^{t}|\mathcal{B}_{i}|\mathcal{H}_{0}(\mathcal{B}_{i})=n\mathcal{H}_{% 0}(\mathcal{T})$ , where $t$ denotes the number of internal nodes of the wavelet tree.

With the extra cost for auxiliary data structures, the total space occupied by a wavelet tree is $n{\mathcal{H}_{0}}(\mathcal{T})+o(n\,{\log\sigma})$ bits of space [21, 15, 24].

We use a single wavelet tree to represent the $\mathtt{BWT}$ $L$ , as first done in Foschini et al. [15]. Mäkinen and Navarro [39] used a single wavelet tree to represent the $S$ string of the run-length encoding of the FM-index (i.e., the run heads). Figure 1 shows the balanced wavelet tree for the example $L$ = tcacaattttcatttgtgaattaatagaaag#ataa from Table 1. There are various ways [15] to form the wavelet tree, such as using a Huffman criterion [26]. Each leaf of the wavelet tree corresponds to a distinct symbol. The column labeled $\mathcal{B}$ in Table 1 is the root bit string of the wavelet tree in Figure 1. A value of 0 (resp., 1) in $\mathcal{B}$ means that the corresponding symbol in $L$ lies in the left (resp., right) subtree. The bit string for the roots of each subtree are defined recursively.

Using the wavelet tree, we can turn each $\mathit{Occ}(L[i],i)$ computation into one $\mathtt{rank}_{c}(\,)$ computation on the wavelet tree.

Figure 1: The balanced wavelet tree representation for the

\mathtt{BWT}\ L

of

\mathcal{T}

shown in Table 1.

3 Wavelet Tree Compression

In this section, we introduce a new compressed representation for bit strings of wavelet trees, which incorporates the gap $\gamma$ -encoding for general text strings. In Sections 3–5, we show that this representation provides excellent query performance both in theory and practice.

3.1 Encoding Methods

We let $\mathcal{B}$ denote the bit string of length $n$ . Let $r$ be the number of occurrences of the least frequent bit in $\mathcal{B}$ such that $1\leq r\leq n/2$ . Let $1\leq p_{1}<p_{2}<\cdots<p_{r-1}<p_{r}\leq n$ denote the increasing sequence of positions of the least frequent bit in $\mathcal{B}$ . We assume for convenience $p_{0}=0$ . Let $g_{j}=p_{j}-p_{j-1}$ denote the gap sequence of pairwise differences of neighboring values of the position increasing sequence of the least frequent bit in $\mathcal{B}$ for $j=1,2,\dots,r$ . We represent each gap using Elias $\gamma$ code [8]. The resulting gap $\gamma$ -encoding of $\mathcal{B}$ is the bit string $\gamma_{\mathit{gap}}(\mathcal{B})=\gamma(g_{1})\gamma(g_{2})\dotsm\gamma(g_{r})$ . Obviously, $\sum_{j=1}^{r}g_{i}=p_{r}\leq n$ .

On the other hand, we can view $\mathcal{B}$ as a sequence of maximal runs of identical bits $\mathcal{B}=b_{1}^{\ell_{1}},b_{2}^{\ell_{2}}\cdots b_{s}^{\ell_{s}}$ for some $s\leq n$ , where $b_{i}$ and $b_{i+1}\in\{0,1\}$ and $b_{i}\neq b_{i+1}$ for $1\leq i<s$ . We can represent the length of each run using Elias $\gamma$ code [8]. The resulting run-length $\gamma$ encoding of $\mathcal{B}$ is the bit string $\gamma_{\mathit{rl}}(\mathcal{B})=\gamma(\ell_{1})\gamma(\ell_{2})\cdots\gamma% (\ell_{s})$ .

3.2 Structure

The starting point of the structure of the FM-Adaptive algorithm follows from [31], which we incorporate into the following discussion. We start the summary by introducing the compression structure to represent bit strings of the nodes of the wavelet tree. We then give an example to show how the newly introduced gap $\gamma$ code works, which is the key to improve general-purpose performance.

To enable effective compression on the bit strings of the wavelet tree, we partition the bit strings of the wavelet tree nodes into blocks and categorize the blocks into three basic types:

1.

blocks consisting only of all 0s or of all 1s;
2.

blocks having relatively long runs of 0s or of 1s; and
3.

blocks having a random-like sequence of 0s and 1s.

For each block, we choose one of the following four compression methods to minimize its coding length: $\mathrm{All0}$ / $\mathrm{All1}$ , $\mathrm{GapG0}$ / $\mathrm{GapG1}$ , $\mathrm{RLG0}$ / $\mathrm{RLG1}$ , and $\mathrm{Plain}$ .

Specifically, for the block of type 1 consisting only of all 0s or all 1s, we use the $\mathrm{All0}$ / $\mathrm{All1}$ code to encode it; that is, no additional bits are needed to store it. For the block of type 2 having relatively long 0-runs or 1-runs, we use either the $\mathrm{GapG}$ code or the $\mathrm{RLG}$ code to encode it. Here $\mathrm{GapG}$ means that we apply the gap $\gamma$ code (see Section 3.1) to encode the block. $\mathrm{RLG}$ means that we apply the run-length $\gamma$ code (see Section 3.1) to encode the block. For the blocks of type 3 having a random-like sequence of 0s and 1s, we keep the block unchanged, denoted as $\mathrm{Plain}$ . This partitioning is combined with the mixed encoding to represent the wavelet tree nodes. The compressed wavelet trees ( $\mathtt{CWT}$ ) are the ones whose bit strings of nodes are partition-based and mixed-encoded.

Let $\mathcal{B}$ denote a bit string of an internal node of a wavelet tree. We use following three steps to obtain a succinct representation of $\mathcal{B}$ :

1.

Partition $\mathcal{B}$ into blocks $\mathcal{B}^{j}$ of size $b$ , except possibly the last one.
2.

Combine $a/b$ contiguous blocks to form a superblock of size $a$ .
3.

Apply the mixed encoding to encode each block. We build the encoded sequence $\mathcal{S}$ by concatenating the encoded blocks.

We also maintain five extra structures to support fast access to $\mathcal{B}$ : $\mathit{SBrank}$ , $\mathit{Brank}$ , $\mathit{SB}$ , $\mathit{B}$ , and $\mathit{M}$ , where $\mathit{SBrank}$ stores the number of 1s in $\mathcal{B}$ preceding the current superblock; $\mathit{Brank}$ stores the number of 1s in $\mathcal{B}$ preceding the current block relative to the beginning of its enclosing superblock; $\mathit{SB}$ stores the number of bits in $\mathcal{S}$ preceding the current superblock; $\mathit{B}$ stores the number of bits in $\mathcal{S}$ preceding the current block relative to the beginning of its enclosing superblock; and $\mathit{M}$ indicates the encoding method used in a block.

Table 2 shows the compression structure for the root bit string $\mathcal{B}=$ 100000111100111111001100101000100100 of the wavelet tree of the $\mathtt{BWT}$ for the example string $\mathcal{T}$ from Table 1, where $n=36$ , $b=6$ , and $a=18$ . The figure shows four encodings: $\mathrm{All1}$ , $\mathrm{GapG}$ , $\mathrm{RLG}$ , and $\mathrm{Plain}$ . For purposes of illustrating all four methods, the $\mathrm{RLG}$ code is used to encode block 4. But in the actual algorithm, $\mathrm{GapG}$ would instead be selected to encode block 4, since it would result in a shorter encoding.

Table 2: The succinct index structure for the wavelet tree root bit string

\mathcal{B}

using mixed encoding of All1, GapG, RLG and Plain, where

n=36

,

b=6

, and

a=18

.

\mathit{rank}_{1}(\mathcal{B}{},i)=\mathit{SBrank}[\lfloor i/a\rfloor]+\mathit% {Brank}[\lfloor i/b\rfloor]+\mathit{lrank}\bigl{(}\mathcal{S},\mathit{offset},% \mathit{M}[\lfloor i/b\rfloor]\bigr{)}

(1)

Using encoding sequence $\mathcal{S}$ of $\mathcal{B}$ and auxiliary structures $\mathit{SBrank}$ , $\mathit{Brank}$ , $\mathit{SB}$ , $\mathit{B}$ , and $\mathit{M}$ , we can compute $\mathtt{rank}_{1}(\mathcal{B},i)$ by Equation (1). The $\mathit{lrank}$ operation performs $\mathrm{All0}$ / $\mathrm{All1}$ , $\mathrm{GapG0}$ / $\mathrm{GapG1}$ , $\mathrm{RLG0}$ / $\mathrm{RLG1}$ , or $\mathrm{Plain}$ decodings depending upon $\mathit{M}[\lfloor{{i}/{b}}\rfloor]$ to return the number of 1s up to $\mathit{offset}$ within block $\lfloor{i/b}\rfloor$ and $\mathit{offset}=i\bmod b$ . The starting decoding position on $\mathcal{S}$ is $\mathit{SB}[\lfloor i/a\rfloor]+\mathit{B}[\lfloor i/b\rfloor]$ . We can access $\mathcal{B}[i]$ in a way similar to $\mathtt{rank}_{1}(\mathcal{B},i)$ .

Lemma 2.

Using table $R_{\gamma}$ of size $w$ from Section 3.4, we can compute $\mathtt{rank}_{L[i]}(\mathtt{CWT},i)$ in $\mathcal{O}\bigl{(}(b/w)\log\sigma\bigr{)}$ time, for block size $b$ .

3.3 Construction

In this section, we describe the construction of the compressed wavelet tree $\mathtt{CWT}$ . The $\mathtt{CWT}$ consists of compressed bit strings of the wavelet tree nodes. We describe the compression in Algorithm 3 in Appendix A, consisting of the encoded sequence $\mathcal{S}$ and five auxiliary structures $\mathit{SBrank}$ , $\mathit{Brank}$ , $\mathit{SB}$ , $B$ , and $\mathit{M}$ . It is simple to see we can construct the compressed wavelet tree in $\mathcal{O}(n\,{\log\sigma})$ time.

Theorem 1.

Given the wavelet tree of the $\mathtt{BWT}$ of text $\mathcal{T}$ , we can construct the $\mathtt{CWT}$ in $\mathcal{O}(n\,{\log\sigma})$ time.

3.4 Accelerating the Rank Computation

In order to accelerate the $\mathit{rank}_{1}(\mathcal{B},i)$ computation for a gap $\gamma$ -encoded block (GapG), we design table $R_{\gamma}$ for the gap $\gamma$ -encoding for every possible chunk $w$ of $(\log n)/2$ bits. Using $R_{\gamma}$ , we can process each $\Theta(\log n)$ bits of a $\gamma$ -encoded sequence in constant time. Every $\mathit{index}$ of $R_{\gamma}$ , corresponding to a bit string of $w$ bits, contains three components: $R_{1}$ , $R_{2}$ , and $R_{3}$ :

$\blacksquare$

$R_{1}$ stores the total number of decoded entries in the $\mathit{index}$ . It is the number of decoded gaps.
$\blacksquare$

$R_{2}$ stores the cumulative sum of decoded digits (gap values) in the $\mathit{index}$ , which corresponds to the maximum value of decoded positions.
$\blacksquare$

$R_{3}$ stores the total number of decoded bits in the $\mathit{index}$ .

The function $\mathit{GapG1rank}(\,)$ in Algorithm 4 in Appendix A implements the $\mathit{lrank}(\,)$ operation using $R_{\gamma}$ for the GapG1-encoded block. The bit operator $\ll$ shifts the bits in the argument to the left by the designated number of positions.

3.5 Space Usage

We let $\mathcal{B}_{i}$ of length $n_{i}$ denote the bit string of internal node $i$ of the wavelet tree of the $\mathtt{BWT}$ $L$ for $i=1,2,\dotsc,t=\sigma-1$ , and let $n_{i}$ denote the number of bits contained in $\mathcal{B}_{i}$ . We let $r$ be the number of occurrences of the least frequent bit in $\mathcal{B}_{i}$ such that $1\leq r\leq n_{i}/2$ . Applying the gap $\gamma$ code to $\mathcal{B}_{i}$ , we get the resulting bit string ${\gamma_{\mathit{gap}}}(\mathcal{B}_{i})=\gamma(g_{1})\gamma(g_{2})\dotsm% \gamma(g_{r})$ , whose space bound is given in Lemma 3; it represents an improved space bound over the ones in [14, 24]:

Lemma 3.

$|{\gamma_{\mathit{gap}}}(\mathcal{B}_{i})|\leq 2n_{i}\mathcal{H}_{0}(\mathcal{% B}_{i})$ .

Proof.

By the definition of $\gamma_{\mathit{gap}}(\mathcal{B}_{i})$ , we have

	$\displaystyle\|\gamma_{\mathit{gap}}(\mathcal{B}_{i})\|$	$\displaystyle=$	$\displaystyle\sum_{j=1}^{r}\|\gamma(g_{j})\|=\sum_{j=1}^{r}\bigl{(}2\lfloor\log(% g_{j})\rfloor+1\bigr{)}\leq 2r\log\bigg{(}\frac{\sum_{j=1}^{r}{g_{j}}}{r}\bigg% {)}+r$
		$\displaystyle=$	$\displaystyle 2r\log\bigg{(}\frac{p_{r}}{r}\bigg{)}+r\leq 2r\log\bigg{(}\frac{% n_{i}}{r}\bigg{)}+2(n_{i}-r)\log\bigg{(}\frac{n_{i}}{n_{i}-r}\bigg{)}=2n_{i}% \mathcal{H}_{0}(\mathcal{B}_{i})$

The second equality is due to Elias $\gamma$ code [8]. The first inequality is due to Jensen’s inequality, and the next line follows from $\sum_{j=1}^{r}g_{j}=p_{r}$ and $p_{r}\leq n_{i}$ (see Section 3.1) and $r\leq(n_{i}-r)\log\bigl{(}n_{i}/(n_{i}-r)\bigr{)}$ when $r\leq n_{i}/2$ . The last step is due to Definition 2 of zero-order empirical entropy. $\hfill\blacktriangleleft$

We partition $\mathcal{B}_{i}$ into blocks $\mathcal{B}_{i}^{j}$ of size $b$ for $1\leq j\leq\lceil n_{i}/b\rceil$ , and represent each block $\mathcal{B}_{i}^{j}$ as a gap sequence $g_{1},g_{2},\dotsm,g_{r_{j}}$ of positions of the least frequent bit of $\mathcal{B}_{i}^{j}$ for some $1\leq r_{j}\leq b/2$ .

We compress each block $\mathcal{B}_{i}^{j}$ by choosing one of four compression methods that minimizes the encoding length $h(\mathcal{B}_{i}^{j})$ : $\mathrm{All0/All1}$ , $\mathrm{GapG0/GapG1}$ , $\mathrm{RLG0/RLG1}$ and $\mathrm{Plain}$ . Specifically, $h(\mathcal{B}_{i}^{j})=\min\{|a(\mathcal{B}_{i}^{j})|,|{\gamma_{\mathit{gap}}}% (\mathcal{B}_{i}^{j})|,|{\gamma_{\mathit{rl}}}(\mathcal{B}_{i}^{j})|,|p(% \mathcal{B}_{i}^{j})|\}$ , where | $\cdot$ | denotes the bit string length of the encoding.

Theorem 2.

Given the $\mathtt{BWT}$ $L$ of $\mathcal{T}$ of length $n$ over an alphabet of size $\sigma$ , we can represent $L$ using the compressed wavelet tree in $2n\mathcal{H}_{k}(\mathcal{T})+o(n\log\sigma)$ bits of space for any $k$ such that $k\leq\alpha\log_{\sigma}n-1$ and any fixed constant $0<\alpha<1$ , where $\mathcal{H}_{k}(\mathcal{T})$ denotes the $k$ th-order empirical entropy of $\mathcal{T}$ .

Theorem 2 follows by the analysis of Foschini et al. [15] and Huo et al. [28]. The key idea of [15] was to implicitly consider the $k$ -contexts $\omega$ for purpose of analysis. The gap $\gamma$ encoding adapts to each $k$ -context as it is encountered, with some space inefficiency because gaps can span context blocks. However, the extra space is matched by the space savings achieved by using a single wavelet tree, which avoids all the space overhead that would be needed if there were an individual wavelet tree for each $k$ -context.

Figure 2 shows bit string $\mathcal{B}_{1}$ of the root node of the wavelet tree for the $\mathtt{BWT}$ $L$ of $\mathcal{T}$ from Table 1 and Figure 1. The second row is a hypothetical partition of $L$ by $k$ -contexts $\omega$ for $k=2$ , and each part is called a context block $L_{\omega}$ . Let $k$ -context $\omega$ of $L[i]$ denote the length- $k$ prefix in $\mathcal{T}$ of $\mathcal{T}[\mathit{SA}[i]]$ . For each $k$ -context $\omega$ , the context block $L_{\omega}$ of $L$ can be formed conceptually by choosing the symbol in $\mathcal{T}$ preceding each occurrence of the $k$ -context $\omega$ and concatenating those preceding symbols. The third row is a partition of $\mathcal{B}_{1}$ by contexts, and each part is called a context bit block $X_{1}^{\omega}$ . The partition for $L$ is the same as the partition for $\mathcal{B}_{1}$ according to contexts. The fourth row is a partition of $\mathcal{B}_{1}$ with fixed length $5$ . The red region (see Figure 2) comprises context bit block $\mathtt{11001111}$ , corresponding to context block $\mathtt{ttcatttg}$ (a substring of $L$ ) for context at.

Each $k$ -context block remains contiguous in the internal nodes of the wavelet tree, as shown in red in Figure 2 for the 2-context at. By adapting Lemmas 3 and 1, the terms $\log g_{j}$ for a given $k$ -context $\omega$ over all the wavelet tree nodes sum up to $|L_{\omega}|\mathcal{H}_{0}(L_{\omega})$ for that context, and by Definition 2 for $k$ th-order entropy, summing over all the $k$ -contexts $\omega$ yields $n\mathcal{H}_{k}$ . The analysis of [15, 28] considers how and whether the various hypothetical $k$ -contexts overlap with the blocks and superblocks, which incurs some extra space costs. The resulting code length is stated in Theorem 2.

Figure 2: Hypothetical partition of

\mathcal{B}_{1}

and

L

by

k

-context

\omega

, for

k=2

. The portion of

\mathcal{B}_{1}

for 2-context

\omega=\mbox{{at}}

is shaded in red.

4 Pattern Matching

In this section, we use FM-Adaptive to implement three types of string matching queries: count, locate and substring extract. We put the count query in Appendix B.

4.1 Locate Query

We sample suffix array $\mathit{SA}$ in the same manner as in Huo et al. [32], and we denote by $\mathit{SA}_{s}$ the $\mathit{SA}$ sampling, where $s$ is the $\mathit{SA}$ sampling step size. We use the conceptual bit array $D$ of length $n$ to record the indices corresponding to the sampled values in $\mathit{SA}_{s}$ . The locate algorithm without memoization is given in Appendix A.

Theorem 3.

Let $s=\log^{1+\epsilon}n/\log\sigma$ for $\epsilon>0$ be the step size for the suffix array sampling. For the given index range in $\mathit{SA}$ of pattern $\mathcal{P}$ , we can answer a $\mathit{locate}$ query and find the $\mathit{occ}$ occurrences of $\mathcal{P}$ using the $\mathtt{CWT}$ of the $\mathtt{BWT}\ L$ of $\mathcal{T}$ in $\mathcal{O}(\mathit{occ}\log^{1+\epsilon}n)$ time using $2n\mathcal{H}_{k}(\mathcal{T})+o(n\log\sigma)$ bits of space for any $k$ such that $k\leq\alpha\log_{\sigma}n-1$ and any constant $0<\alpha<1$ , where $\mathcal{H}_{k}(\mathcal{T})$ denotes the $k$ th-order empirical entropy of $\mathcal{T}$ .

Proof.

Basic operations $D[i]$ and $\mathit{rank}_{1}()$ on the bit string take constant time. Each computation of $\mathit{LF}[i]$ using $R_{\gamma}$ takes $\mathcal{O}\bigl{(}(b/w){\log\sigma}\bigr{)}$ time by Lemma 2. We can find a $\mathit{SA}$ sample in at most $s$ steps, which takes $\mathcal{O}\bigl{(}s(b/w){\log\sigma}\bigr{)}$ time. Substituting values for $b=\mathcal{O}(\log n)$ , $s=\log^{1+\epsilon}n/\log\sigma$ for $\epsilon>0$ , and $w=(\log n)/2$ , we get the time bound $\mathcal{O}(\log^{1+\epsilon}n)$ . Thus the locate algorithm runs in $\mathcal{O}(\mathit{occ}\log^{1+\epsilon}n)$ time, where $\mathit{occ}$ is the number of occurrences of $\mathcal{P}$ in $\mathcal{T}$ .

There are $n/s$ samplings of $\mathit{SA}$ and each needs $\log(n/s)$ bits. So the space required by the sampling suffix array $\mathit{SA}_{s}$ is $(n/s)\log(n/s)=o(n\log\sigma)$ bits for $s=\log^{1+\epsilon}n/\log\sigma$ for $\epsilon>0$ . We can store the bit array $D$ in $o(n)$ bits and support $\mathtt{rank}_{1}(D,\cdot)$ and $\mathtt{select}_{1}(D,\cdot)$ in $\mathcal{O}(1)$ time [35, 6]. We can store $R_{\gamma}$ in $o(n)$ bits for $w=(\log n)/2$ . The space required by the index structure is given in Theorem 2.

Summing the space required by the $\mathtt{CWT}$ , $R_{\gamma}$ , $\mathit{SA_{s}}$ and $D$ , we obtain the space bound of $2n\mathcal{H}_{k}(\mathcal{T})+o(n\,{\log\sigma})$ bits on the locate query. $\hfill\blacktriangleleft$

Now we consider the practical improvement of the locate query. The improvement is based upon memoization, which takes advantage of the overlapping subproblems property. Assume that we are given the index range $[l,r]$ in $\mathit{SA}$ of the suffixes with prefix $\mathcal{P}$ . According to our sampling method on the suffix array, for each $j\in[l,r]$ , we can walk at most $s$ steps on $\mathit{LF}$ to reach a suffix array sampling. For any $i\neq j$ and $i\in[l,r]$ , we could also perform the same process. The key point is that the process of finding a suffix array sampling for some $j\in[l,r]$ may include the process of finding a suffix array sampling for some $i\in[l,r]$ . That is, $i$ and $j$ may share the same suffix sampling. If we know the suffix position $p(i)$ of $i$ and the number of steps (say, $\#\mathit{steps}$ ) to walk on $\mathit{LF}$ starting at $j\in[l,r]$ to $i\in[l,r]$ , we can obtain the suffix position $p(j)$ of $j$ by $p(j)=p(i)+\#\mathit{steps}$ . This could allow us to reduce the number of steps walking on $\mathit{LF}$ and thus speed up the locating process.

Algorithm 1 fastLocate

(\mathcal{P},l,r)

.

Algorithm 2 extract

(\mathit{start},\ell)

.

We use structure $\mathit{pos}[\,]$ to keep the suffix positions for each $j\in[l,r]$ . We compute the values of $\mathit{pos}[\,]$ in two ways:

1.

by walking on $\mathit{LF}$ only using $\mathit{SA}_{s}$ ; and
2.

using the computed value of $\mathit{pos}[\,]$ and data structures $\mathit{pred}[\,]$ and $\mathit{dist}[\,]$ described below.

The array $\mathit{pred}$ has length $r-l+1$ ; $\mathit{pred}[i-l]$ records the $j$ such that $i=\mathit{LF}^{k}(j)\in[l,r]$ for the first time. $\mathit{dist}$ is an array of length $r-l+1$ ; $\mathit{dist}[j-l]$ is the number of steps using $\mathit{LF}$ to walk from $j$ to $i$ .

$\mathsf{fastLocate}$ in Algorithm 1 gives the pseudocode of the practical improvement on the locate query. We initialize all $\mathit{pos}[\,]$ , $\mathit{dist}[\,]$ , and $\mathit{pred}[\,]$ entries to $-1$ , $0$ , and $-1$ , respectively.

4.2 Substring Extract Query

In this section, we consider how to retrieve text substrings $\mathcal{T}[\mathit{start},\mathit{start}+\ell-1]$ using the $\mathtt{CWT}$ of $L$ and $\mathit{SA}_{s}^{-1}$ . We sample inverse suffix array $\mathit{SA}^{-1}$ in the same manner as in Huo et al. [32]. We denote by $\mathit{SA}_{s}^{-1}$ the $\mathit{SA}^{-1}$ sampling.

The workflow to retrieve a substring using the $\mathtt{CWT}$ of $L$ and $\mathit{SA}_{s}^{-1}$ is that we first transform a given position into its rank $i$ in $\mathit{SA}$ and then retrieve $\mathcal{T}[\mathit{start},\mathit{start}+\ell-1]$ by walking on $\mathit{LF}$ . The first step is done by the $\mathit{transform}()$ procedure and the second step by the $\mathit{retrieve}()$ procedure. Both procedures are given in Algorithm 2.

Theorem 4.

Given position $\mathit{start}$ and length $\ell$ , we can answer a substring $\mathit{extract}$ query using the $\mathtt{CWT}$ of the $\mathtt{BWT}\ L$ of $\mathcal{T}$ in $\mathcal{O}(\ell\log\sigma+\log^{1+\epsilon}n)$ time using $2n\mathcal{H}_{k}(\mathcal{T})+o(n\log\sigma)$ bits of space for any $k$ such that $k\leq\alpha\log_{\sigma}n-1$ and any constant $0<\alpha<1$ , and $\epsilon>0$ , where $\mathcal{H}_{k}(\mathcal{T})$ denotes the $k$ th-order empirical entropy of $\mathcal{T}$ .

Proof.

The time complexity of the substring extract algorithm is determined by the total running time of the $\mathit{transform}$ procedure and the $\mathit{retrieve}$ procedure. Each computation of $\mathit{LF}[i]$ using $R_{\gamma}$ of size $w$ takes $\mathcal{O}\bigl{(}(b/w)\,{\log\sigma}\bigr{)}$ time and the maximum walking length of the for loop is $s$ in $\mathit{transform}$ . Consequently, the $\mathit{transform}$ procedure runs in $\mathcal{O}\bigl{(}s\,(b/w)\,{\log\sigma}\bigr{)}$ time, and the $\mathit{retrieve}$ procedure runs in $\mathcal{O}\bigl{(}(b/w)\,{\log\sigma}\,\ell\bigr{)}$ time.

By summing the two parts, substituting the values of $b=\mathcal{O}(\log n)$ , $s=\log^{1+\epsilon}n/\log\sigma$ , and $w=(\log n)/2$ , we get the running time bound of the $\mathit{extract}$ algorithm: $\mathcal{O}\bigl{(}s(b/w)\log\sigma\bigr{)}+\mathcal{O}\bigl{(}\ell(b/w)\log% \sigma\bigr{)}=\mathcal{O}(\ell\log\sigma+\log^{1+\epsilon}n)$ in the worst case.

The space required by the extract query is the same as that for the locate query, which is $2n\mathcal{H}_{k}(\mathcal{T})+o(n\log\sigma)$ bits. $\hfill\blacktriangleleft$

5 Experiments and Analysis

5.1 Experimental Setting, Datasets, and Measure

In this section we describe experiments performed using the environment described in [32]. We used C++ to implement the algorithms and constructed the suffix array with parameter 64 using Mori’s fast lightweight suffix-sorting library.³³3github.com/y-256/libdivsufsort/

Table 3: Statistics and distribution for the tested data sets.

Table 3 summarizes some statistical characteristics of the data sets we used for the experiments. The data sets consist of four datasets from the $\mathit{Pizza\&Chili\ Corpus}$ ⁴⁴4pizzachili.dcc.uchile.cl/texts.html (datasets 1–4), the highly repetitive data sets from the $\mathit{Repetitive\ Corpus}$ ⁵⁵5pizzachili.dcc.uchile.cl/repcorpus.html (datasets 5–8), the human genome (called hg38 at UCSC) from UCSC⁶⁶6hgdownload.cse.ucsc.edu/goldenPath/hg38 (dataset 9), and NA12877R10 (dataset 10), formed by extracting 10 gigabytes of reads from NA12877_1, downloaded from EBI.⁷⁷7www.ebi.ac.uk/ena/browser/view/ERR194146 Datasets 1–4 and 9–10 are nonrepetitive data sets, and datasets 5–8 are highly repetitive data sets. The expression $n/r$ denotes the average run length in the $\mathtt{BWT}\ L$ , and $\sigma$ is the alphabet size of the input data. For hg38, we exclude the query on pattern NN…N.

In our experiments, we examine the following state-of-the-art algorithms for their space usage and query time:

1.

AF-Index: the implementation⁸⁸8pizzachili.dcc.uchile.cl/indexes of alphabet-friendly FM-index [13, 12], which combines an existing compression boosting technique with the wavelet tree data structure. We used its latest version af-index_v2.1.
2.

FMI-Hybrid: the recent implementation⁹⁹9github.com/simongog/sdsl-lite of the FM-index [18], which uses the fixed-block boosting technique [36], in which bitvectors by default are implemented by hybrid encoding [37].
3.

RL-CSA: the implementation¹⁰¹⁰10jltsiren.kapsi.fi/rlcsa of the compressed suffix array [42] that uses run-length encoding and has been optimized for highly repetitive data. We used its current version of May 2016.
4.

GeCSA: the recent implementation¹¹¹¹11codeocean.com/capsule/3554560/tree/v1 of the compressed suffix array [32], in which a new run-length $\delta$ code is introduced for the gap sequence of $\Psi$ [22, 23].
5.

FM-Adaptive: our method described in this paper.

5.2 Improvement of the Locate Query by Memoization

In this section, we make an experimental comparison of the locate query before and after the improvement. The key point of the memoization improvement is to remember the computed suffix positions for some $j\in[l,r]$ and then use these computed positions and other auxiliary information to determine the suffix positions of the remaining $j$ so as to reduce the number of accesses to $\mathit{LF}$ and thus speed up the locate query.

In the experiments, we perform 500 locate queries and calculate the average number of access to $\mathit{LF}$ for each locate query. Figure 3 shows the average number of access to $\mathit{LF}$ for each locate query before (in blue) and after (in orange) the improvement by memoization.

Figure 3: The average number of access to

\mathit{LF}

for each locate query before (in blue) and after (in orange) the memoization is used for

s=64,128,256

for the data sets shown in Table 3.

As can be seen in Figure 3, the memoization of some computed suffix positions generally improves the locate query time substantially. The average number of accesses to $\mathit{LF}$ is decreased by $88.69\emph{--}99.11$ percent on english, sources, para, w.leaders, and kernel, by $44.88\emph{--}47.31$ percent on hg38, by $6.29\emph{--}17.82$ percent on NA12877R10, by $0.05\emph{--}2.2$ percent on proteins, by $0.01\emph{--}0.08$ percent on DNA, and by $0.0\emph{--}0.01$ percent on influenza for all $s$ .

Table 4: The average pattern locating time

T

in milliseconds for three locate algorithms, for

s=32

(top),

64

(middle) and

128

(bottom).

Figure 3 shows substantial reductions in $\mathit{LF}$ invocations compared with the locate algorithm without memoization shown in Appendix A for most tested data. This memoization did not work on the datasets DNA, proteins, and influenza because there are few short patterns occurring frequently, which make $i=\mathit{LF^{k}}(j)\in[l,r]$ rarely satisfied.

To show that the improvement of the locate query is largely due to the memoization rather than the suffix array sampling method, we made additional experiments on three locate algorithms on the $\mathit{Pizza\&Chili\ Corpus}$ , as well as the $\mathit{Repetitive\ Corpus}$ , which we show in Table 4.

We denote the locate algorithm with position sampling in suffix array in [28] as $\mathit{locate15}$ , the locate algorithm without memoization with suffix array value sampling (Algorithm 5 in Appendix A) as $\mathit{locate}$ , and the locate algorithm with memoization with suffix array value sampling (Algorithm 1) as $\mathit{fastLocate}$ . We randomly select $1000$ patterns with length $20$ from the input text, and average the total time over the $1000$ locate queries on each of the three locate algorithms, denoted as $T_{\mathit{ps}}$ for $\mathit{locate15}$ , $T_{\mathit{vs}}$ for $\mathit{locate}$ , and $T_{\mathit{fast}}$ for $\mathit{fastLocate}$ in milliseconds, shown in Table 4. We take block size $b=256$ for the three locate algorithms. It can be seen that $\mathit{fatsLocate}$ provides generally several times to orders of magnitude faster than $\mathit{locate15}$ and $\mathit{locate}$ for different suffix array sampling step size $s=32,64$ and $128$ .

5.3 Performance Evaluation

In the following sections we compare the performance of our proposed algorithms with the state-of-the-art indexing methods described above, based upon the performance criteria of compression ratio, locate query time, and extract query time. We use the following parameters for each index:

$\blacksquare$

AF-Index: $b=32*8$ , and $s$ = 64, 128, and 256;
$\blacksquare$

FMI-Hybrid: $b=256$ , and $s$ = 32, 64, and 128;
$\blacksquare$

RL-CSA: $b=64$ , $s$ = 32, 64, and 128;
$\blacksquare$

GeCSA: $b=256$ , and $s$ = 128, 256, and 512;
$\blacksquare$

FM-Adaptive: $b=256$ , and $s$ = 64, 128, and 256.

The other parameters used are the default parameters. The tested method used is the same as those in [32]. The compression ratio is defined as the ratio of the index size with the original size of the input text. The parameters in our index are $\mathit{a}=16b$ , $d=s$ , and $w=16$ .

Figure 4: Performance comparison of five indexing methods on the

\mathit{locate}

query. The measures used are compression ratio in percentage and average time

T

per query in milliseconds on the 10 data sets described in Section 5.1.

5.3.1 Locate Query

Figure 4 shows the compression ratio and time for the locate query of the compared indexing methods. For the tested data of NA12877R10, RL-CSA fails to build the index, since it limits the size of the input collection to less than 4 gigabytes. AF-Index fails to build the index on proteins, english, hg38, and NA12877R10 with input size larger than 1 gigabyte.

As can be seen in Figure 4, FM-Adaptive provides the best compression on the nonrepetitive datasets shown in Table 3. FM-Adaptive is typically several to dozens of times faster than FMI-Hybrid in query time. Specifically, for the nonrepetitive datasets $1\emph{--}4$ and $9\emph{--}10$ , when minimizing compression ratio, FM-Adaptive gives the best compression and is $1.03\emph{--}5.58$ times faster than FMI-Hybrid in query time in four out of six nonrepetitive datasets. Speedwise, when minimizing query time, RL-CSA shows the fastest query time in four out of six nonrepetitive datasets but at a cost of higher space usage. FM-Adaptive is $1.98\emph{--}3.53$ times faster than FMI-Hybrid in query time in four of six nonrepetitive datasets. FMI-Hybrid is essentially the same as and a bit faster than FM-Adaptive in query time on proteins and DNA, but it uses $15.17$ and $4.7$ percentage points more space.

For the repetitive datasets $5\emph{--}8$ , when minimizing compression ratio, GeCSA gives the best compression on influenza, w.leaders, and kernel with the exception of FMI-Hybrid on para. FMI-Hybrid shows the best compression ratio on para, but it is $152.77$ times slower than GeCSA. FM-Adaptive is $74.33$ , $58.03$ , $6.39$ , and $1.90$ times faster with comparable compression than FMI-Hybrid on w.leaders, para, kernel, and influenza. Speedwise, when minimizing query time, GeCSA shows the fastest query time on para, w.leaders, and kernel about $2.4\emph{--}59.93$ times faster than the other three compared indexing methods with the exception of RL-CSA on influenza, but RL-CSA uses $15.06$ percentage points more space than GeCSA. FM-Adaptive is $23.67$ , $16.38$ , $7.18$ , and $1.76$ times faster with comparable compression than FMI-Hybrid on w.leaders, para, kernel, and influenza. AF-Index generally uses substantially more space on the tested data than the other compared indexing methods.

Memoization is a main reason for the improved locating query time for FM-Adaptive. Another reason is due to the different suffix array sampling. As can be seen in Figure 4, FM-Adaptive provides faster query performance by a considerable amount than the other compared FM-index methods on the tested data sets, except for DNA, proteins and influenza for which memoization barely works according to Figure 3. At the same time, FM-Adaptive provides a comparable or better compression.

Figure 5: Performance comparison of five indexing methods on the

\mathit{extract}

query. The measures used are compression ratio in percentage and average time

T

per query in microseconds on the 10 data sets described in Section 5.1.

5.3.2 Substring Extract Query

The extract query uses the same index as the locate query, so it has the same space usage as the locate query. In this section we focus upon its time performance. Figure 5 shows the compression ratio and time for the extract query for the compared indexing methods.

Figure 5 shows that RL-CSA, GeCSA, AF-Index, and FM-Adaptive give the best extract time on four, three, two and one of the 10 tested datasets in Table 3. As can be seen in Figure 5, GeCSA shows good extract performance on repetitive data. GeCSA uses less space and is generally many times faster than the other indexing methods, with the exception of AF-Index on para. AF-Index is $1.44$ times faster than GeCSA on para but uses $35.19$ percentage points more space. The query time of FM-series indexing methods can be affected by the alphabet size $\sigma$ of input text. When $\sigma$ is small, such as DNA, hg38 and NA12877R10, they have a better or competitive time. FM-Adaptive is faster than FMI-Hybrid in query time in eight out of 10 datasets with the exception of FMI-Hybrid on proteins and english. FMI-Hybrid is $1.05$ and $1.4$ times faster than FM-Adaptive on proteins and english but uses $15.17$ and $5.59$ percentage points more space.

6 Conclusion

We propose improvements via a new compressed representation of the wavelet tree of the Burrows-Wheeler transform of the input text, which incorporates the gap $\gamma$ -encoding. Our new index achieves asymptotic space optimality within a factor of 2 in the leading term, but it has a better compression and faster retrieval in practice than the competitive optimal compression boosting used in previous FM-indexes. We present a practical improved locate algorithm that provides substantially faster locating time.

An interesting open problem is to improve our high-order entropy-compressed text self-index to achieve optimal space and query time both in theory and in practice. Another interesting goal is to adapt our index to compress and index the human genome and reads for the analysis of ultra-high-throughput next-generation sequencing (NGS) data, while supporting approximate matching queries.

References

[1] Christina Boucher, Davide Cenzato, Zsuzsanna Lipták, Massimiliano Rossi, and Marinella Sciortino. r-indexing the eBWT. Information and Computation, 298:105155, 2024. doi:10.1016/J.IC.2024.105155.
[2] Nathaniel K. Brown, Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Marinella Sciortino. Faster run-length compressed suffix arrays. In Alessio Conte, Andrea Marin, Giovanna Rosone, and Jeffrey S. Vitter, editors, From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi’s 60th Birthday. OASIcs, Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2025.
[3] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm, 1994.
[4] Stefan Canzar and Steven L. Salzberg. Short read mapping: An algorithmic tour. Proceedings of the IEEE, 105(3):436–458, 2017. doi:10.1109/JPROC.2015.2455551.
[5] Xiaoyang Chen, Hongwei Huo, Jun Huan, Jeffrey Scott Vitter, Weiguo Zheng, and Lei Zou. MSQ-Index: A succinct index for fast graph similarity search. IEEE Transactions on Knowledge and Data Engineering, 33(6):2654–2668, 2021. doi:10.1109/TKDE.2019.2954527.
[6] David Richard Clark. Compact PAT trees. PhD thesis, University of Waterloo, Waterloo, Canada, 1996.
[7] Peter Elias. Efficient storage and retrieval by content and address of static files. J. Assoc. Comput. Mach., 21:246–260, 1974. doi:10.1145/321812.321820.
[8] Peter Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194–203, 1975. doi:10.1109/TIT.1975.1055349.
[9] Robert M. Fano. On the number of bits required to implement an associative memory. Computer Structures Group, MIT, Cambridge, MA, 1971.
[10] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science (FOCS’00), pages 390–398, 2000. doi:10.1109/SFCS.2000.892127.
[11] Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552–581, 2005. doi:10.1145/1082036.1082039.
[12] Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. An alphabet-friendly FM-index. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval (SPIRE’04), pages 150–160, 2004. doi:10.1007/978-3-540-30213-1_23.
[13] Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3(2):Article 20, 2007.
[14] Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. The myriad virtues of Wavelet Trees. Information and Computation, 207(8):849–866, 2009. doi:10.1016/J.IC.2008.12.010.
[15] Luca Foschini, Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. When indexing equals compression: Experiments with compressing suffix arrays and applications. ACM Transactions on Algorithms, 2(4):611–639, 2006. Shorter versions appear in Proceedings of the 15th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA ’04), New Orleans, LA, January 2004, 636–645, and in “Fast compression with a static model in high-order entropy,” Proceedings of IEEE Data Compression Conference, Snowbird, Utah, March 2004, 62–71. doi:10.1145/1198513.1198521.
[16] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM, 67(1):Article 2, 2020.
[17] Raffaele Giancarlo, Giovanni Manzini, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. A new class of string transformations for compressed text indexing. Information and Computation, 294:105068, 2022. doi:10.1016/J.IC.2023.105068.
[18] Simon Gog, Juha Kärkkäinen, Dominik Kempa, Matthias Petri, and Simon J. Puglisi. Fixed block compression boosting in FM-indexes: Theory and practice. Algorithmica, 81:1370–1391, 2019. doi:10.1007/S00453-018-0475-9.
[19] Simon Gog, Gonzalo Navarro, and Matthias Petri. Improved and extended locating functionality on compressed suffix arrays. Journal of Discrete Algorithms, 32:53–63, 2015. doi:10.1016/J.JDA.2015.01.006.
[20] Simon Gog and Matthias Petri. Optimized succinct data structures for massive data. Software: Practice and Experience, 44(11):1287–1314, 2014. doi:10.1002/SPE.2198.
[21] Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the 14th ACM-SIAM Symposium on Discrete Algorithms (SODA’03), pages 841–850, 2003. URL: http://dl.acm.org/citation.cfm?id=644108.644250.
[22] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd ACM Symposium on Theory of Computing (STOC’00), pages 397–406, 2000.
[23] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378–407, 2005. doi:10.1137/S0097539702402354.
[24] Roberto Grossi, Jeffrey Scott Vitter, and Bojian Xu. Wavelet trees: From theory to practice. In Proceedings of the 1st International Conference on Data Compression, Communications and Processing, pages 210–221, 2011. doi:10.1109/CCP.2011.16.
[25] Wing-Kai Hon, Rahul Shah, and Jeffrey Scott Vitter. Compression, indexing, and retrieval for massive string data. In Proceedings of the 21st annual conference on Combinatorial pattern matching (CPM’10), pages 260–274, 2010. doi:10.1007/978-3-642-13509-5_24.
[26] David A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9):1098–1101, 1952.
[27] Hongwei Huo, Longgang Chen, Jeffrey Scott Vitter, and Yakov Nekrich. A practical implementation of compressed suffix arrays with applications to self-indexing. In Proceedings of the Data Compression Conference (DCC’14), pages 292–301, 2014. doi:10.1109/DCC.2014.49.
[28] Hongwei Huo, Longgang Chen, Heng Zhao, Jeffrey Scott Vitter, Yakov Nekrich, and Qiang Yu. A Data-aware FM-index. In Proceedings of the 17th Workshop on Algorithm Engineering and Experiments (ALENEX’15), pages 10–23, 2015. doi:10.1137/1.9781611973754.2.
[29] Hongwei Huo, Xiaoyang Chen, Xu Guo, and Jeffrey Scott Vitter. Efficient compression and indexing for highly repetitive DNA sequence collections. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 18(6):2394–2408, 2021. doi:10.1109/TCBB.2020.2968323.
[30] Hongwei Huo, Zongtao He, Pengfei Liu, and Jeffrey Scott Vitter. FM-Adaptive: A data-aware FM-index [source code], January 2022. doi:10.24433/CO.7967727.v1.
[31] Hongwei Huo, Pengfei Liu, Chenhui Wang, Hongbo Jiang, and Jeffrey Scott Vitter. CIndex: Compressed indexes for fast retrieval of FASTQ files. Bioinformatics, 38(2):335–343, 2022. doi:10.1093/BIOINFORMATICS/BTAB655.
[32] Hongwei Huo, Peng Long, and Jeffrey Scott Vitter. Practical high-order entropy-compressed text self-indexing. IEEE Transactions on Knowledge and Data Engineering, 35(3):2943–2960, 2023. doi:10.1109/TKDE.2021.3114401.
[33] Hongwei Huo, Zhigang Sun, Shuangjiang Li, Jeffrey Scott Vitter, and et al.. CS2A: A compressed suffix array-based method for short read alignment. In Proceedings of the Data Compression Conference (DCC’16), pages 271–278, 2016.
[34] Hongwei Huo, Yongze Yu, Zongtao He, and Jeffrey Scott Vitter. Indexing labeled property multidigraphs in entropy space, with applications. In Proceedings of the 41st IEEE International Conference on Data Engineering (ICDE’25), pages 2478–2492, 2025.
[35] Guy Jacobson. Space-efficient static trees and graphs. In Proceedings of the 30th Annual IEEE Symposium on Foundations of Computer Science (FOCS’89), pages 549–554, 1989. doi:10.1109/SFCS.1989.63533.
[36] Juha Kärkkäinen and Simon J. Puglisi. Fixed block compression boosting in FM-indexes. In Proceedings of the 18th International Symposium on String Processing and Information Retrieval (SPIRE’11), pages 174–184, 2011. doi:10.1007/978-3-642-24583-1_18.
[37] Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Hybrid compression of bitvectors for the FM-index. In DCC, pages 302–311, 2014. doi:10.1109/DCC.2014.87.
[38] Zhize Li, Jian Li, and Hongwei Huo. Optimal in-place suffix sorting. Information and Computation, 285(Part B):104818, 2022. doi:10.1016/J.IC.2021.104818.
[39] Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. In Proceedings of the 16th Annual Symposium on Combinatorial Pattern Matching (CPM), pages 45–56, 2005. LNCS 3537. doi:10.1007/11496656_5.
[40] Veli Mäkinen and Gonzalo Navarro. Implicit compression boosting with applications to self-indexing. In Proceedings of the 14th International Symposium on String Processing and Information Retrieval (SPIRE’07), pages 229–241, 2007. doi:10.1007/978-3-540-75530-2_21.
[41] Veli Mäkinen and Gonzalo Navarro. Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms, 4(3):article 32, 2008.
[42] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. J Comput Biol., 17(3):281–308, 2010. doi:10.1089/CMB.2009.0169.
[43] Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993. doi:10.1137/0222058.
[44] Giovanni Manzini. An analysis of the burrows-wheeler transform. Journal of the ACM, 48(3):407–430, 2001. doi:10.1145/382780.382782.
[45] Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262–272, 1976. doi:10.1145/321941.321946.
[46] J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. Fast compressed self-indexes with deterministic linear-time construction. Algorithmica, 82(2):316–337, 2020. A conference version of this paper appeared in Proc. ISAAC 2017. doi:10.1007/S00453-019-00637-X.
[47] Gonzalo Navarro. Indexing highly repetitive string collections, part ii: Compressed indexes. ACM Computing Surveys, 54(2):Article 26, 2021.
[48] Gonzalo Navarro. Compact Data Structures: A Practical Approach. Cambridge University Press, New York, USA, September 2016.
[49] Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1):Article 2, 2007.
[50] Gonzalo Navarro and Nicola Prezza. Universal compressed text indexing. Theoretical Computer Science, 762:41–50, 2019. doi:10.1016/J.TCS.2018.09.007.
[51] Rajeev Raman, Venkatesh Raman, and S.Srinivasa Rao. Succinct indexable dictionaries with applications to encoding $k$ -ary trees, prefix sums and multisets. ACM Transactions on Algorithms, 3(4):Article 43, 2007. doi:10.1145/1290672.1290680.
[52] Kunihiko Sadakane. Compressed text databases with efficient query algorithms based on the compressed suffix array. In Proceedings of the 11th Symposium on Algorithms and Computation (ISAAC’00), pages 410–421, 2000. doi:10.1007/3-540-40996-3_35.
[53] Kunihiko Sadakane. New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms, 48(2):294–313, 2003. doi:10.1016/S0196-6774(03)00087-7.
[54] Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995. doi:10.1007/BF01206331.

Appendix A Algorithms

Algorithm 3 compress

(\mathcal{B})

.

\mathcal{B}

denotes a node bit string of the wavelet tree.

Algorithm 4 GapG1rank

(\mathcal{S},\mathit{offset})

.

Algorithm 5 locate

(\mathcal{P},l,r)

.

Appendix B Count Query

Given a pattern string $\mathcal{P}[0,m-1]$ of $m$ symbols. Each symbol in $\mathcal{P}$ and $\mathcal{T}$ belongs to a fixed alphabet $\Sigma$ of size $\sigma\leq n$ . An occurrence of the pattern at position $i$ of $\mathcal{T}$ means that the substring $\mathcal{T}[i,i+m-1]$ is equal to $\mathcal{P}$ . The $\mathit{count}$ algorithm given below reports the range $[l,r]$ of the first and the last occurrences of $\mathcal{P}$ in $\mathit{SA}$ in which all the suffixes are prefixed by $\mathcal{P}$ using $\mathcal{C}$ and $\mathtt{CWT}$ .

Algorithm 6 count

(\mathcal{P})

.

During the searching for $\mathcal{P}$ , the algorithm maintains the following invariant: Let $\mathit{SA}[l,r]$ denote the range of suffixes with prefix $\mathcal{P}[i,m-1]$ . If $L[k]$ and $L[k^{\prime}]$ are the first and the last occurrences of symbol $\mathcal{P}[i-1]$ in range $L[l,r]$ , then $\mathit{SA}[\mathit{LF}(k),\mathit{LF}(k^{\prime})]$ is the range of suffixes with prefix $\mathcal{P}[i-1,m-1]$ . Basic operation $\mathit{LF}$ takes $\mathcal{O}(\log\sigma)$ time on the compressed wavelet tree for $b=\mathcal{O}(\log n)$ and $w=(\log n)/2$ by Lemma 2. As a $\mathit{count}$ query does two $\mathit{LF}$ operations per symbol of $\mathcal{P}$ , it runs in $\mathcal{O}(m\log\sigma)$ time. (Some more complicated implementations such as that in [13] of wavelet trees make use of multiway branching to shorten the height of the tree and reduce traversal time, but we use the simpler binary branching, which works well in practice.)

Theorem 5.

Let $\mathcal{P}$ be a query pattern of length $m$ . We can answer a count query of $\mathcal{P}$ using the $\mathtt{CWT}$ of the $\mathtt{BWT}\ L$ of $\mathcal{T}$ in $\mathcal{O}(m\log\sigma)$ time using $2n\mathcal{H}_{k}(\mathcal{T})+o(n\log\sigma)$ bits of space for any $k$ such that $k\leq\alpha\log_{\sigma}n-1$ and any constant $0<\alpha<1$ , where $\mathcal{H}_{k}(\mathcal{T})$ denotes the $k$ th-order empirical entropy of $\mathcal{T}$ .

[bib.bib1] [1] Christina Boucher, Davide Cenzato, Zsuzsanna Lipták, Massimiliano Rossi, and Marinella Sciortino. r-indexing the eBWT. Information and Computation, 298:105155, 2024. doi:10.1016/J.IC.2024.105155.

[bib.bib2] [2] Nathaniel K. Brown, Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Marinella Sciortino. Faster run-length compressed suffix arrays. In Alessio Conte, Andrea Marin, Giovanna Rosone, and Jeffrey S. Vitter, editors, From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi’s 60th Birthday. OASIcs, Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2025.

[bib.bib3] [3] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm, 1994.

[bib.bib4] [4] Stefan Canzar and Steven L. Salzberg. Short read mapping: An algorithmic tour. Proceedings of the IEEE, 105(3):436–458, 2017. doi:10.1109/JPROC.2015.2455551.

[bib.bib5] [5] Xiaoyang Chen, Hongwei Huo, Jun Huan, Jeffrey Scott Vitter, Weiguo Zheng, and Lei Zou. MSQ-Index: A succinct index for fast graph similarity search. IEEE Transactions on Knowledge and Data Engineering, 33(6):2654–2668, 2021. doi:10.1109/TKDE.2019.2954527.

[bib.bib6] [6] David Richard Clark. Compact PAT trees. PhD thesis, University of Waterloo, Waterloo, Canada, 1996.

[bib.bib7] [7] Peter Elias. Efficient storage and retrieval by content and address of static files. J. Assoc. Comput. Mach., 21:246–260, 1974. doi:10.1145/321812.321820.

[bib.bib8] [8] Peter Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194–203, 1975. doi:10.1109/TIT.1975.1055349.

[bib.bib9] [9] Robert M. Fano. On the number of bits required to implement an associative memory. Computer Structures Group, MIT, Cambridge, MA, 1971.

[bib.bib10] [10] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science (FOCS’00), pages 390–398, 2000. doi:10.1109/SFCS.2000.892127.

[bib.bib11] [11] Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552–581, 2005. doi:10.1145/1082036.1082039.

[bib.bib12] [12] Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. An alphabet-friendly FM-index. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval (SPIRE’04), pages 150–160, 2004. doi:10.1007/978-3-540-30213-1_23.

[bib.bib13] [13] Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3(2):Article 20, 2007.

[bib.bib14] [14] Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. The myriad virtues of Wavelet Trees. Information and Computation, 207(8):849–866, 2009. doi:10.1016/J.IC.2008.12.010.

[bib.bib15] [15] Luca Foschini, Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. When indexing equals compression: Experiments with compressing suffix arrays and applications. ACM Transactions on Algorithms, 2(4):611–639, 2006. Shorter versions appear in Proceedings of the 15th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA ’04), New Orleans, LA, January 2004, 636–645, and in “Fast compression with a static model in high-order entropy,” Proceedings of IEEE Data Compression Conference, Snowbird, Utah, March 2004, 62–71. doi:10.1145/1198513.1198521.

[bib.bib16] [16] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM, 67(1):Article 2, 2020.

[bib.bib17] [17] Raffaele Giancarlo, Giovanni Manzini, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. A new class of string transformations for compressed text indexing. Information and Computation, 294:105068, 2022. doi:10.1016/J.IC.2023.105068.

[bib.bib18] [18] Simon Gog, Juha Kärkkäinen, Dominik Kempa, Matthias Petri, and Simon J. Puglisi. Fixed block compression boosting in FM-indexes: Theory and practice. Algorithmica, 81:1370–1391, 2019. doi:10.1007/S00453-018-0475-9.

[bib.bib19] [19] Simon Gog, Gonzalo Navarro, and Matthias Petri. Improved and extended locating functionality on compressed suffix arrays. Journal of Discrete Algorithms, 32:53–63, 2015. doi:10.1016/J.JDA.2015.01.006.

[bib.bib20] [20] Simon Gog and Matthias Petri. Optimized succinct data structures for massive data. Software: Practice and Experience, 44(11):1287–1314, 2014. doi:10.1002/SPE.2198.

[bib.bib21] [21] Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the 14th ACM-SIAM Symposium on Discrete Algorithms (SODA’03), pages 841–850, 2003. URL: http://dl.acm.org/citation.cfm?id=644108.644250.

[bib.bib22] [22] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd ACM Symposium on Theory of Computing (STOC’00), pages 397–406, 2000.

[bib.bib23] [23] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378–407, 2005. doi:10.1137/S0097539702402354.

[bib.bib24] [24] Roberto Grossi, Jeffrey Scott Vitter, and Bojian Xu. Wavelet trees: From theory to practice. In Proceedings of the 1st International Conference on Data Compression, Communications and Processing, pages 210–221, 2011. doi:10.1109/CCP.2011.16.

[bib.bib25] [25] Wing-Kai Hon, Rahul Shah, and Jeffrey Scott Vitter. Compression, indexing, and retrieval for massive string data. In Proceedings of the 21st annual conference on Combinatorial pattern matching (CPM’10), pages 260–274, 2010. doi:10.1007/978-3-642-13509-5_24.

[bib.bib26] [26] David A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9):1098–1101, 1952.

[bib.bib27] [27] Hongwei Huo, Longgang Chen, Jeffrey Scott Vitter, and Yakov Nekrich. A practical implementation of compressed suffix arrays with applications to self-indexing. In Proceedings of the Data Compression Conference (DCC’14), pages 292–301, 2014. doi:10.1109/DCC.2014.49.

[bib.bib28] [28] Hongwei Huo, Longgang Chen, Heng Zhao, Jeffrey Scott Vitter, Yakov Nekrich, and Qiang Yu. A Data-aware FM-index. In Proceedings of the 17th Workshop on Algorithm Engineering and Experiments (ALENEX’15), pages 10–23, 2015. doi:10.1137/1.9781611973754.2.

[bib.bib29] [29] Hongwei Huo, Xiaoyang Chen, Xu Guo, and Jeffrey Scott Vitter. Efficient compression and indexing for highly repetitive DNA sequence collections. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 18(6):2394–2408, 2021. doi:10.1109/TCBB.2020.2968323.

[bib.bib30] [30] Hongwei Huo, Zongtao He, Pengfei Liu, and Jeffrey Scott Vitter. FM-Adaptive: A data-aware FM-index [source code], January 2022. doi:10.24433/CO.7967727.v1.

[bib.bib31] [31] Hongwei Huo, Pengfei Liu, Chenhui Wang, Hongbo Jiang, and Jeffrey Scott Vitter. CIndex: Compressed indexes for fast retrieval of FASTQ files. Bioinformatics, 38(2):335–343, 2022. doi:10.1093/BIOINFORMATICS/BTAB655.

[bib.bib32] [32] Hongwei Huo, Peng Long, and Jeffrey Scott Vitter. Practical high-order entropy-compressed text self-indexing. IEEE Transactions on Knowledge and Data Engineering, 35(3):2943–2960, 2023. doi:10.1109/TKDE.2021.3114401.

[bib.bib33] [33] Hongwei Huo, Zhigang Sun, Shuangjiang Li, Jeffrey Scott Vitter, and et al.. CS2A: A compressed suffix array-based method for short read alignment. In Proceedings of the Data Compression Conference (DCC’16), pages 271–278, 2016.

[bib.bib34] [34] Hongwei Huo, Yongze Yu, Zongtao He, and Jeffrey Scott Vitter. Indexing labeled property multidigraphs in entropy space, with applications. In Proceedings of the 41st IEEE International Conference on Data Engineering (ICDE’25), pages 2478–2492, 2025.

[bib.bib35] [35] Guy Jacobson. Space-efficient static trees and graphs. In Proceedings of the 30th Annual IEEE Symposium on Foundations of Computer Science (FOCS’89), pages 549–554, 1989. doi:10.1109/SFCS.1989.63533.

[bib.bib36] [36] Juha Kärkkäinen and Simon J. Puglisi. Fixed block compression boosting in FM-indexes. In Proceedings of the 18th International Symposium on String Processing and Information Retrieval (SPIRE’11), pages 174–184, 2011. doi:10.1007/978-3-642-24583-1_18.

[bib.bib37] [37] Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Hybrid compression of bitvectors for the FM-index. In DCC, pages 302–311, 2014. doi:10.1109/DCC.2014.87.

[bib.bib38] [38] Zhize Li, Jian Li, and Hongwei Huo. Optimal in-place suffix sorting. Information and Computation, 285(Part B):104818, 2022. doi:10.1016/J.IC.2021.104818.

[bib.bib39] [39] Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. In Proceedings of the 16th Annual Symposium on Combinatorial Pattern Matching (CPM), pages 45–56, 2005. LNCS 3537. doi:10.1007/11496656_5.

[bib.bib40] [40] Veli Mäkinen and Gonzalo Navarro. Implicit compression boosting with applications to self-indexing. In Proceedings of the 14th International Symposium on String Processing and Information Retrieval (SPIRE’07), pages 229–241, 2007. doi:10.1007/978-3-540-75530-2_21.

[bib.bib41] [41] Veli Mäkinen and Gonzalo Navarro. Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms, 4(3):article 32, 2008.

[bib.bib42] [42] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. J Comput Biol., 17(3):281–308, 2010. doi:10.1089/CMB.2009.0169.

[bib.bib43] [43] Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993. doi:10.1137/0222058.

[bib.bib44] [44] Giovanni Manzini. An analysis of the burrows-wheeler transform. Journal of the ACM, 48(3):407–430, 2001. doi:10.1145/382780.382782.

[bib.bib45] [45] Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262–272, 1976. doi:10.1145/321941.321946.

[bib.bib46] [46] J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. Fast compressed self-indexes with deterministic linear-time construction. Algorithmica, 82(2):316–337, 2020. A conference version of this paper appeared in Proc. ISAAC 2017. doi:10.1007/S00453-019-00637-X.

[bib.bib47] [47] Gonzalo Navarro. Indexing highly repetitive string collections, part ii: Compressed indexes. ACM Computing Surveys, 54(2):Article 26, 2021.

[bib.bib48] [48] Gonzalo Navarro. Compact Data Structures: A Practical Approach. Cambridge University Press, New York, USA, September 2016.

[bib.bib49] [49] Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1):Article 2, 2007.

[bib.bib50] [50] Gonzalo Navarro and Nicola Prezza. Universal compressed text indexing. Theoretical Computer Science, 762:41–50, 2019. doi:10.1016/J.TCS.2018.09.007.

[bib.bib51] [51] Rajeev Raman, Venkatesh Raman, and S.Srinivasa Rao. Succinct indexable dictionaries with applications to encoding $k$ -ary trees, prefix sums and multisets. ACM Transactions on Algorithms, 3(4):Article 43, 2007. doi:10.1145/1290672.1290680.

[bib.bib52] [52] Kunihiko Sadakane. Compressed text databases with efficient query algorithms based on the compressed suffix array. In Proceedings of the 11th Symposium on Algorithms and Computation (ISAAC’00), pages 410–421, 2000. doi:10.1007/3-540-40996-3_35.

[bib.bib53] [53] Kunihiko Sadakane. New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms, 48(2):294–313, 2003. doi:10.1016/S0196-6774(03)00087-7.

[bib.bib54] [54] Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995. doi:10.1007/BF01206331.