FM-Adaptive: A Practical Data-Aware FM-Index
Abstract
The FM-index provides an important solution for efficient retrieval and search in textual big data. Its variants have been widely used in many fields including information retrieval, genome analysis, and web searching. In this paper, we propose improvements via a new compressed representation of the wavelet tree of the Burrows-Wheeler transform of the input text, which incorporates the gap -encoding. Our theoretical analysis shows that the new index, called FM-Adaptive, achieves asymptotic space optimality within a factor of 2 in the leading term, but it has a better compression and faster retrieval in practice than the competitive optimal compression boosting used in previous FM-indexes. We present a practical improved locate algorithm that provides substantially faster locating time based upon memoization, which takes advantage of the overlapping subproblems property. We design the lookup table for accelerated decoding to support fast pattern matching in a text. Extensive experiments demonstrate that FM-Adaptive provides faster query performance, often by a considerable amount, and/or comparable or better compression than other state-of-the-art FM-index methods.
Keywords and phrases:
Text indexing, Burrows-Wheeler transform, Compressed wavelet trees, Entropy-compressed, Compressed data structuresCopyright and License:
2012 ACM Subject Classification:
Information systems Information retrieval ; Theory of computation Design and analysis of algorithms ; Theory of computation Data structures design and analysis ; Theory of computation Data compression ; Theory of computation Pattern matchingAcknowledgements:
We would like to thank Simon Gog for sharing code.Funding:
This work was supported in part by the National Natural Science Foundation of China under Grant No. 62272358.Editors:
Paolo Ferragina, Travis Gagie, and Gonzalo NavarroSeries and Publisher:
Open Access Series in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Massive data sets are being produced at unprecedented rates from sources like IoT, ultra-high-throughput next-generation sequencing, autonomous driving, digital universe, and social networks. A large part of the data consists of text in the form of a sequence of symbols representing not only natural language, but also multimedia streams, biological sequences, and myriad forms of other media. A full-text index is a data structure that stores a text string in preprocessed form so that it can support fast string matching queries. The best-known full-text indexes are the suffix tree [45, 54], and the suffix array [43], which support pattern matching queries in optimal or almost-optimal time. However, for a text consisting of symbols drawn from an alphabet of size , these data structures require bits in the standard unit cost RAM, which is larger than bits, which is the size of the input text, by a multiplicative factor of .222All logarithms in this paper that do not have an explicit base listed are in base 2. For example, using suffix trees and suffix arrays, the full-text indexing requires approximately 36GB of memory in the most optimized implementations for short read mapping for a mammalian genome [4]. Thus indexing a text in space-efficient way while supporting efficient pattern matching queries is a challenging problem.
The field of compressed or succinct data structures attempts to build data structures whose space is provably close to the size of the data in compressed format and that still provide fast query functionality. Theoretical breakthroughs in the late 1990s led to the development of a new generation of space-efficient indexes. In particular, the compressed suffix array () [22, 52, 23, 53] and the FM-index [10, 21, 11] (based upon the Burrows-Wheeler transform () [3, 44]) provide the fundamentals of how to work with text efficiently in compressed format. Much subsequent work has focused on making compressed indexes fast and space-efficient in practice in order to handle a variety of big data applications, such as sequencing data consisting of billions of short reads.
1.1 Related Work
The FM-index [10, 11, 13, 12, 15, 24, 18, 29, 31] and the compressed suffix array () [22, 21, 23, 52, 53, 15, 42, 32, 19, 17] are space-efficient text indexes whose query times are proportional to the query pattern size plus the product of the output size and a small polylog function of . The former maintains a succinct representation of the and the latter maintains a succinct representation of the neighbor function [22, 23]. Both are self-indexes in that they represent the original text, and thus it can be discarded.
Grossi and Vitter [22, 23] and Sadakane [52] introduced the compressed suffix array and the neighbor function . Ferragina and Manzini [10, 11] designed the original entropy-compressed FM-index based upon the Burrows-Wheeler transform () [3, 44] and the mapping function . The mapping function and the neighbor function are inverses of one other.
Grossi, Gupta and Vitter [21] introduced the elegant data structure known as the wavelet tree, which has since become ubiquitous in text indexing. Using the wavelet tree, they were the first to establish a self-index that provably achieves the asymptotically optimal space bound with leading coefficient 1 (i.e., bits, the th order entropy of , as defined subsequently in Definition 2). Their analysis of the wavelet tree also applied to the FM-index and was the first to show asymptotic optimality as well for the FM-index [21, 12, 13, 40, 41]. The original space bound derived for the BWT had a leading term of [10], which was improved to the asymptotically optimal [21, 12] using wavelet trees.
Subsequently to the early theoretical breakthroughs [22, 10, 21, 11, 23], there have been numerous improvements. Simpler implementations for the and the FM-index achieve high-order compression without explicit partitioning into separate -contexts and thus using a single wavelet tree for the entire text [15, 40, 41, 14]. (Here “-context” denotes a length- prefix in of a symbol , for some , where is the suffix array of .) We refer the reader to the nice survey of Navarro and Mäkinen [49], Navarro’s book [48] and other references in the literature [40, 41, 18, 25, 20, 37, 28, 27, 33, 50, 46, 16, 47, 5, 38, 31, 17, 32, 1, 34, 2, 42, 31].
For example, Foschini et al. [15] encoded the single wavelet tree of the sequence using run-length encoding and Elias code to encode the runs; if the -contexts are hypothetically overlaid onto the sequence, the encoding of each run length adapts implicitly to the frequency statistics of the current -context(s), thus achieving zero-order compression, and thus by Definition 2 of th-order entropy, they encode the overall node in th-order entropy space. We use that same idea in this paper. Mäkinnen and Navarro [40, 41] did a careful analysis to show the surprising result that applying RRR [51] to a single Wavelet Tree of the entire sequence without any partitioning achieves the bits leading space term, for a similar reason as in [15]; the sublocks formed by [51] implicitly encode each th-order context in roughly zero-order entropy space. Interesting experiments on the practical performance of these RLE and RRR approaches and related methods appear in [24]. Gog et al. [18] used a fixed-block boosting technique and the RRR method [51] to implement the FM-index in asymptotically optimal entropy-compressed space with an extra cost of bits. Mäkinen et al. [42] proposed a -based index for highly repetitive data collections.
These references also mention many practical applications. An example is a special-purpose -based compressed index for biological FASTQ data that exploits specific characteristics of next-generation sequence data [31].
1.2 Our Results
Developing space-efficient entropy-compressed self-indexes that achieve fast query performance both in theory and practice has been a challenging problem. In this paper, we propose FM-Adaptive, a fast data-aware FM-index applicable for a wide range of text strings with different alphabet sizes. The key is a new representation of the wavelet tree of the , with new tradeoffs between space occupancy and search time. We propose several auxiliary data structures to support fast access to .
Using gap code instead of Elias-Fano (EF) code [9, 7] used in [31] for DNA sequences, we deduce a new space bound for general texts while the space bound derived by using the EF code in [31] holds only when the entropy is . In addition, the compressed index in [31] is geared towards FASTQ data, which have a specific format, and its queries are not identical to the queries on the general-purpose compressed indexes in this paper. Our index has an additional efficiency in that it allows the gap code and the run-length code to share the same lookup table as in [28], though the tables for the gap code and for the run-length code which we introduce in Section 3.1 for encoding methods have different meanings, thus avoiding the need to store the lookup table [31] for Elias-Fano code.
Ferragina, Giancarlo, and Manzini [14] discussed the run-length code and the gap code for the implementations of the Wavelet Tree. For a bit string of length , they can represent in bits (Lemma 3.3) using the run-length code and in bits (Lemma 3.4) using the gap code for and . Instead, we can represent in bits [24, 28] using the run-length code and in bits in Lemma 3 using the gap code, which avoids any second order terms.
Ferragina et al. [14] addressed the problem of compressing the , but not the indexing mechanics of how to provide fast random access to how to decode . In this paper, we address compression as well as the problem of how to index the data. We designed several efficient auxiliary data structures and the lookup table for fast access and decoding to support efficient pattern matching in a text. We also present a new locating algorithm that substantially improves the locating time in practice.
For input text of symbols, we summarize below the novel contributions we make in this paper:
-
(1)
We present a new compressed representation for the wavelet tree of the , called FM-Adaptive, that incorporates the gap -encoding. Using this, we can implement FM-Adaptive in bits, for and any fixed constant , where denotes the th-order empirical entropy of . We can construct FM-Adaptive in time using bits of space. In addition, for a bit string of length , we can represent in bits using the gap code, which avoids and improves any second order terms in space bounds in [14, 24], where is zero-order empirical entropy of , defined in Definition 2.
-
(2)
We present an improved algorithm that provides substantially faster performance in practice for locating patterns in a text, based upon memoization, which takes advantage of the overlapping-subproblems property.
-
(3)
We design table (see Section 3.4) for accelerating the decoding of the gap -encoded blocks to support fast pattern matching in a text.
-
(4)
Given any pattern of symbols encoded in bits, using FM-Adaptive, we can count the number of occurrences of in in time, we can locate all occurrences in additional time, and we can retrieve a text substring in time.
-
(5)
Extensive experiments demonstrate that FM-Adaptive generally provides faster query performance, often by a considerable amount, and comparable or better compression than other state-of-the-art FM-index methods. The source code is available online [30].
1.3 Organization of the Paper
2 Preliminaries
In this section, we define some necessary concepts and provide a brief description of the .
2.1 Problem Formalization
Definition 1 (Text indexing problem).
Let be a text string of length over an alphabet of size and let be a query pattern. The text indexing problem is to represent in as small space as possible while efficiently supporting the following query operations:
-
count: returns the number of occurrences of in .
-
locate: reports the positions where occurs in .
-
extract: retrieves substring .
2.2 Empirical Entropy
Definition 2 (empirical entropy).
Let be a text string of length over an alphabet of size . The zero-order empirical entropy of is defined as
where is the number of occurrences in of symbol . The th-order empirical entropy [44] of is defined as
where designates a string of length and denotes the string of length formed by taking the symbol immediately preceding each occurrence of in and concatenating the symbols together.
2.3 The Burrows-Wheeler Transform and FM-index
Let be a text string consisting of symbols from alphabet of size , and let denote the suffix array of . The of is defined as . As shown in Table 1, the of is the string formed by sorting the suffixes of in lexicographical order, choosing the preceding symbol for each suffix, and concatenating those preceding symbols. (When the suffix is the entire string , we use the symbol as its preceding symbol.)
The Burrows-Wheeler transform is invertible. We can retrieve from its by backward search [10] using the mapping function : The value is the lexicographical rank of the suffix with prefix , namely, , where is the first string of the .
We can compute by , where is the number of occurrences of symbol in , and is the number of occurrences of symbols in that are smaller than . Table 1 shows the and of the string in which for each we show the first four symbols (starting with ) and the last symbol (namely, ) of conceptual suffixes of in lexicographical order. The mapping function and the neighbor function are inverses of one another: , as shown in Table 1.
2.4 The Wavelet Tree
Let denote a text string of length over an alphabet . We define the wavelet tree () [21] of with leaves labelled by the symbols of the alphabet as follows: For each node in , let be the subset of symbols in the subtree rooted at , and let be the substring of consisting of all the symbols of . For each internal node in , let be a bit string with the same length as defined as if is in the left subtree of ; if is in the right subtree of . Such a bit string representation happens recursively at each internal node; the substring at the internal node consists of the characters dispatched from the parent node, with their order in the text string preserved. The collective size of the bit strings at any given level of the tree is bounded by . With proper encoding, the wavelet tree representation of achieves the zero-order entropy space bound [21].
The powerful wavelet tree data structure [21, 15, 24] reduces the problem of compressing a string over a finite alphabet to the problem of compressing a set of bit (i.e., binary) strings. It is an elegant and versatile data structure that allows efficient access, rank, and select queries on a text string of symbols from a multisymbol alphabet:
-
access: returns the symbol .
-
rank: returns the number of occurrences of symbol in for any and .
-
select: returns the position in of the th occurrence of symbol for any .
A key result of Grossi et al. [21] is that if each bit string of the internal nodes of the wavelet tree is compressed to zero-order entropy, then the cumulative encoding is a zero-order entropy encoding of :
Lemma 1 ([21]).
, where denotes the number of internal nodes of the wavelet tree.
With the extra cost for auxiliary data structures, the total space occupied by a wavelet tree is bits of space [21, 15, 24].
We use a single wavelet tree to represent the , as first done in Foschini et al. [15]. Mäkinen and Navarro [39] used a single wavelet tree to represent the string of the run-length encoding of the FM-index (i.e., the run heads). Figure 1 shows the balanced wavelet tree for the example = tcacaattttcatttgtgaattaatagaaag#ataa from Table 1. There are various ways [15] to form the wavelet tree, such as using a Huffman criterion [26]. Each leaf of the wavelet tree corresponds to a distinct symbol. The column labeled in Table 1 is the root bit string of the wavelet tree in Figure 1. A value of 0 (resp., 1) in means that the corresponding symbol in lies in the left (resp., right) subtree. The bit string for the roots of each subtree are defined recursively.
Using the wavelet tree, we can turn each computation into one computation on the wavelet tree.
3 Wavelet Tree Compression
In this section, we introduce a new compressed representation for bit strings of wavelet trees, which incorporates the gap -encoding for general text strings. In Sections 3–5, we show that this representation provides excellent query performance both in theory and practice.
3.1 Encoding Methods
We let denote the bit string of length . Let be the number of occurrences of the least frequent bit in such that . Let denote the increasing sequence of positions of the least frequent bit in . We assume for convenience . Let denote the gap sequence of pairwise differences of neighboring values of the position increasing sequence of the least frequent bit in for . We represent each gap using Elias code [8]. The resulting gap -encoding of is the bit string . Obviously, .
On the other hand, we can view as a sequence of maximal runs of identical bits for some , where and and for . We can represent the length of each run using Elias code [8]. The resulting run-length encoding of is the bit string .
3.2 Structure
The starting point of the structure of the FM-Adaptive algorithm follows from [31], which we incorporate into the following discussion. We start the summary by introducing the compression structure to represent bit strings of the nodes of the wavelet tree. We then give an example to show how the newly introduced gap code works, which is the key to improve general-purpose performance.
To enable effective compression on the bit strings of the wavelet tree, we partition the bit strings of the wavelet tree nodes into blocks and categorize the blocks into three basic types:
-
1.
blocks consisting only of all 0s or of all 1s;
-
2.
blocks having relatively long runs of 0s or of 1s; and
-
3.
blocks having a random-like sequence of 0s and 1s.
For each block, we choose one of the following four compression methods to minimize its coding length: /, /, /, and .
Specifically, for the block of type 1 consisting only of all 0s or all 1s, we use the / code to encode it; that is, no additional bits are needed to store it. For the block of type 2 having relatively long 0-runs or 1-runs, we use either the code or the code to encode it. Here means that we apply the gap code (see Section 3.1) to encode the block. means that we apply the run-length code (see Section 3.1) to encode the block. For the blocks of type 3 having a random-like sequence of 0s and 1s, we keep the block unchanged, denoted as . This partitioning is combined with the mixed encoding to represent the wavelet tree nodes. The compressed wavelet trees () are the ones whose bit strings of nodes are partition-based and mixed-encoded.
Let denote a bit string of an internal node of a wavelet tree. We use following three steps to obtain a succinct representation of :
-
1.
Partition into blocks of size , except possibly the last one.
-
2.
Combine contiguous blocks to form a superblock of size .
-
3.
Apply the mixed encoding to encode each block. We build the encoded sequence by concatenating the encoded blocks.
We also maintain five extra structures to support fast access to : , , , , and , where stores the number of 1s in preceding the current superblock; stores the number of 1s in preceding the current block relative to the beginning of its enclosing superblock; stores the number of bits in preceding the current superblock; stores the number of bits in preceding the current block relative to the beginning of its enclosing superblock; and indicates the encoding method used in a block.
Table 2 shows the compression structure for the root bit string 100000111100111111001100101000100100 of the wavelet tree of the for the example string from Table 1, where , , and . The figure shows four encodings: , , , and . For purposes of illustrating all four methods, the code is used to encode block 4. But in the actual algorithm, would instead be selected to encode block 4, since it would result in a shorter encoding.
| (1) |
Using encoding sequence of and auxiliary structures , , , , and , we can compute by Equation (1). The operation performs /, /, /, or decodings depending upon to return the number of 1s up to within block and . The starting decoding position on is . We can access in a way similar to .
Lemma 2.
Using table of size from Section 3.4, we can compute in time, for block size .
3.3 Construction
In this section, we describe the construction of the compressed wavelet tree . The consists of compressed bit strings of the wavelet tree nodes. We describe the compression in Algorithm 3 in Appendix A, consisting of the encoded sequence and five auxiliary structures , , , , and . It is simple to see we can construct the compressed wavelet tree in time.
Theorem 1.
Given the wavelet tree of the of text , we can construct the in time.
3.4 Accelerating the Rank Computation
In order to accelerate the computation for a gap -encoded block (GapG), we design table for the gap -encoding for every possible chunk of bits. Using , we can process each bits of a -encoded sequence in constant time. Every of , corresponding to a bit string of bits, contains three components: , , and :
-
stores the total number of decoded entries in the . It is the number of decoded gaps.
-
stores the cumulative sum of decoded digits (gap values) in the , which corresponds to the maximum value of decoded positions.
-
stores the total number of decoded bits in the .
3.5 Space Usage
We let of length denote the bit string of internal node of the wavelet tree of the for , and let denote the number of bits contained in . We let be the number of occurrences of the least frequent bit in such that . Applying the gap code to , we get the resulting bit string , whose space bound is given in Lemma 3; it represents an improved space bound over the ones in [14, 24]:
Lemma 3.
.
Proof.
By the definition of , we have
The second equality is due to Elias code [8]. The first inequality is due to Jensen’s inequality, and the next line follows from and (see Section 3.1) and when . The last step is due to Definition 2 of zero-order empirical entropy.
We partition into blocks of size for , and represent each block as a gap sequence of positions of the least frequent bit of for some .
We compress each block by choosing one of four compression methods that minimizes the encoding length : , , and . Specifically, , where || denotes the bit string length of the encoding.
Theorem 2.
Given the of of length over an alphabet of size , we can represent using the compressed wavelet tree in bits of space for any such that and any fixed constant , where denotes the th-order empirical entropy of .
Theorem 2 follows by the analysis of Foschini et al. [15] and Huo et al. [28]. The key idea of [15] was to implicitly consider the -contexts for purpose of analysis. The gap encoding adapts to each -context as it is encountered, with some space inefficiency because gaps can span context blocks. However, the extra space is matched by the space savings achieved by using a single wavelet tree, which avoids all the space overhead that would be needed if there were an individual wavelet tree for each -context.
Figure 2 shows bit string of the root node of the wavelet tree for the of from Table 1 and Figure 1. The second row is a hypothetical partition of by -contexts for , and each part is called a context block . Let -context of denote the length- prefix in of . For each -context , the context block of can be formed conceptually by choosing the symbol in preceding each occurrence of the -context and concatenating those preceding symbols. The third row is a partition of by contexts, and each part is called a context bit block . The partition for is the same as the partition for according to contexts. The fourth row is a partition of with fixed length . The red region (see Figure 2) comprises context bit block , corresponding to context block (a substring of ) for context at.
Each -context block remains contiguous in the internal nodes of the wavelet tree, as shown in red in Figure 2 for the 2-context at. By adapting Lemmas 3 and 1, the terms for a given -context over all the wavelet tree nodes sum up to for that context, and by Definition 2 for th-order entropy, summing over all the -contexts yields . The analysis of [15, 28] considers how and whether the various hypothetical -contexts overlap with the blocks and superblocks, which incurs some extra space costs. The resulting code length is stated in Theorem 2.
4 Pattern Matching
In this section, we use FM-Adaptive to implement three types of string matching queries: count, locate and substring extract. We put the count query in Appendix B.
4.1 Locate Query
We sample suffix array in the same manner as in Huo et al. [32], and we denote by the sampling, where is the sampling step size. We use the conceptual bit array of length to record the indices corresponding to the sampled values in . The locate algorithm without memoization is given in Appendix A.
Theorem 3.
Let for be the step size for the suffix array sampling. For the given index range in of pattern , we can answer a query and find the occurrences of using the of the of in time using bits of space for any such that and any constant , where denotes the th-order empirical entropy of .
Proof.
Basic operations and on the bit string take constant time. Each computation of using takes time by Lemma 2. We can find a sample in at most steps, which takes time. Substituting values for , for , and , we get the time bound . Thus the locate algorithm runs in time, where is the number of occurrences of in .
There are samplings of and each needs bits. So the space required by the sampling suffix array is bits for for . We can store the bit array in bits and support and in time [35, 6]. We can store in bits for . The space required by the index structure is given in Theorem 2.
Summing the space required by the , , and , we obtain the space bound of bits on the locate query.
Now we consider the practical improvement of the locate query. The improvement is based upon memoization, which takes advantage of the overlapping subproblems property. Assume that we are given the index range in of the suffixes with prefix . According to our sampling method on the suffix array, for each , we can walk at most steps on to reach a suffix array sampling. For any and , we could also perform the same process. The key point is that the process of finding a suffix array sampling for some may include the process of finding a suffix array sampling for some . That is, and may share the same suffix sampling. If we know the suffix position of and the number of steps (say, ) to walk on starting at to , we can obtain the suffix position of by . This could allow us to reduce the number of steps walking on and thus speed up the locating process.
We use structure to keep the suffix positions for each . We compute the values of in two ways:
-
1.
by walking on only using ; and
-
2.
using the computed value of and data structures and described below.
The array has length ; records the such that for the first time. is an array of length ; is the number of steps using to walk from to .
in Algorithm 1 gives the pseudocode of the practical improvement on the locate query. We initialize all , , and entries to , , and , respectively.
4.2 Substring Extract Query
In this section, we consider how to retrieve text substrings using the of and . We sample inverse suffix array in the same manner as in Huo et al. [32]. We denote by the sampling.
The workflow to retrieve a substring using the of and is that we first transform a given position into its rank in and then retrieve by walking on . The first step is done by the procedure and the second step by the procedure. Both procedures are given in Algorithm 2.
Theorem 4.
Given position and length , we can answer a substring query using the of the of in time using bits of space for any such that and any constant , and , where denotes the th-order empirical entropy of .
Proof.
The time complexity of the substring extract algorithm is determined by the total running time of the procedure and the procedure. Each computation of using of size takes time and the maximum walking length of the for loop is in . Consequently, the procedure runs in time, and the procedure runs in time.
By summing the two parts, substituting the values of , , and , we get the running time bound of the algorithm: in the worst case.
The space required by the extract query is the same as that for the locate query, which is bits.
5 Experiments and Analysis
5.1 Experimental Setting, Datasets, and Measure
In this section we describe experiments performed using the environment described in [32]. We used C++ to implement the algorithms and constructed the suffix array with parameter 64 using Mori’s fast lightweight suffix-sorting library.333github.com/y-256/libdivsufsort/
Table 3 summarizes some statistical characteristics of the data sets we used for the experiments. The data sets consist of four datasets from the 444pizzachili.dcc.uchile.cl/texts.html (datasets 1–4), the highly repetitive data sets from the 555pizzachili.dcc.uchile.cl/repcorpus.html (datasets 5–8), the human genome (called hg38 at UCSC) from UCSC666hgdownload.cse.ucsc.edu/goldenPath/hg38 (dataset 9), and NA12877R10 (dataset 10), formed by extracting 10 gigabytes of reads from NA12877_1, downloaded from EBI.777www.ebi.ac.uk/ena/browser/view/ERR194146 Datasets 1–4 and 9–10 are nonrepetitive data sets, and datasets 5–8 are highly repetitive data sets. The expression denotes the average run length in the , and is the alphabet size of the input data. For hg38, we exclude the query on pattern NN…N.
In our experiments, we examine the following state-of-the-art algorithms for their space usage and query time:
-
1.
AF-Index: the implementation888pizzachili.dcc.uchile.cl/indexes of alphabet-friendly FM-index [13, 12], which combines an existing compression boosting technique with the wavelet tree data structure. We used its latest version af-index_v2.1.
-
2.
FMI-Hybrid: the recent implementation999github.com/simongog/sdsl-lite of the FM-index [18], which uses the fixed-block boosting technique [36], in which bitvectors by default are implemented by hybrid encoding [37].
-
3.
RL-CSA: the implementation101010jltsiren.kapsi.fi/rlcsa of the compressed suffix array [42] that uses run-length encoding and has been optimized for highly repetitive data. We used its current version of May 2016.
-
4.
GeCSA: the recent implementation111111codeocean.com/capsule/3554560/tree/v1 of the compressed suffix array [32], in which a new run-length code is introduced for the gap sequence of [22, 23].
-
5.
FM-Adaptive: our method described in this paper.
5.2 Improvement of the Locate Query by Memoization
In this section, we make an experimental comparison of the locate query before and after the improvement. The key point of the memoization improvement is to remember the computed suffix positions for some and then use these computed positions and other auxiliary information to determine the suffix positions of the remaining so as to reduce the number of accesses to and thus speed up the locate query.
In the experiments, we perform 500 locate queries and calculate the average number of access to for each locate query. Figure 3 shows the average number of access to for each locate query before (in blue) and after (in orange) the improvement by memoization.
As can be seen in Figure 3, the memoization of some computed suffix positions generally improves the locate query time substantially. The average number of accesses to is decreased by percent on english, sources, para, w.leaders, and kernel, by percent on hg38, by percent on NA12877R10, by percent on proteins, by percent on DNA, and by percent on influenza for all .
Figure 3 shows substantial reductions in invocations compared with the locate algorithm without memoization shown in Appendix A for most tested data. This memoization did not work on the datasets DNA, proteins, and influenza because there are few short patterns occurring frequently, which make rarely satisfied.
To show that the improvement of the locate query is largely due to the memoization rather than the suffix array sampling method, we made additional experiments on three locate algorithms on the , as well as the , which we show in Table 4.
We denote the locate algorithm with position sampling in suffix array in [28] as , the locate algorithm without memoization with suffix array value sampling (Algorithm 5 in Appendix A) as , and the locate algorithm with memoization with suffix array value sampling (Algorithm 1) as . We randomly select patterns with length from the input text, and average the total time over the locate queries on each of the three locate algorithms, denoted as for , for , and for in milliseconds, shown in Table 4. We take block size for the three locate algorithms. It can be seen that provides generally several times to orders of magnitude faster than and for different suffix array sampling step size and .
5.3 Performance Evaluation
In the following sections we compare the performance of our proposed algorithms with the state-of-the-art indexing methods described above, based upon the performance criteria of compression ratio, locate query time, and extract query time. We use the following parameters for each index:
-
AF-Index: , and = 64, 128, and 256;
-
FMI-Hybrid: , and = 32, 64, and 128;
-
RL-CSA: , = 32, 64, and 128;
-
GeCSA: , and = 128, 256, and 512;
-
FM-Adaptive: , and = 64, 128, and 256.
The other parameters used are the default parameters. The tested method used is the same as those in [32]. The compression ratio is defined as the ratio of the index size with the original size of the input text. The parameters in our index are , , and .
5.3.1 Locate Query
Figure 4 shows the compression ratio and time for the locate query of the compared indexing methods. For the tested data of NA12877R10, RL-CSA fails to build the index, since it limits the size of the input collection to less than 4 gigabytes. AF-Index fails to build the index on proteins, english, hg38, and NA12877R10 with input size larger than 1 gigabyte.
As can be seen in Figure 4, FM-Adaptive provides the best compression on the nonrepetitive datasets shown in Table 3. FM-Adaptive is typically several to dozens of times faster than FMI-Hybrid in query time. Specifically, for the nonrepetitive datasets and , when minimizing compression ratio, FM-Adaptive gives the best compression and is times faster than FMI-Hybrid in query time in four out of six nonrepetitive datasets. Speedwise, when minimizing query time, RL-CSA shows the fastest query time in four out of six nonrepetitive datasets but at a cost of higher space usage. FM-Adaptive is times faster than FMI-Hybrid in query time in four of six nonrepetitive datasets. FMI-Hybrid is essentially the same as and a bit faster than FM-Adaptive in query time on proteins and DNA, but it uses and percentage points more space.
For the repetitive datasets , when minimizing compression ratio, GeCSA gives the best compression on influenza, w.leaders, and kernel with the exception of FMI-Hybrid on para. FMI-Hybrid shows the best compression ratio on para, but it is times slower than GeCSA. FM-Adaptive is , , , and times faster with comparable compression than FMI-Hybrid on w.leaders, para, kernel, and influenza. Speedwise, when minimizing query time, GeCSA shows the fastest query time on para, w.leaders, and kernel about times faster than the other three compared indexing methods with the exception of RL-CSA on influenza, but RL-CSA uses percentage points more space than GeCSA. FM-Adaptive is , , , and times faster with comparable compression than FMI-Hybrid on w.leaders, para, kernel, and influenza. AF-Index generally uses substantially more space on the tested data than the other compared indexing methods.
Memoization is a main reason for the improved locating query time for FM-Adaptive. Another reason is due to the different suffix array sampling. As can be seen in Figure 4, FM-Adaptive provides faster query performance by a considerable amount than the other compared FM-index methods on the tested data sets, except for DNA, proteins and influenza for which memoization barely works according to Figure 3. At the same time, FM-Adaptive provides a comparable or better compression.
5.3.2 Substring Extract Query
The extract query uses the same index as the locate query, so it has the same space usage as the locate query. In this section we focus upon its time performance. Figure 5 shows the compression ratio and time for the extract query for the compared indexing methods.
Figure 5 shows that RL-CSA, GeCSA, AF-Index, and FM-Adaptive give the best extract time on four, three, two and one of the 10 tested datasets in Table 3. As can be seen in Figure 5, GeCSA shows good extract performance on repetitive data. GeCSA uses less space and is generally many times faster than the other indexing methods, with the exception of AF-Index on para. AF-Index is times faster than GeCSA on para but uses percentage points more space. The query time of FM-series indexing methods can be affected by the alphabet size of input text. When is small, such as DNA, hg38 and NA12877R10, they have a better or competitive time. FM-Adaptive is faster than FMI-Hybrid in query time in eight out of 10 datasets with the exception of FMI-Hybrid on proteins and english. FMI-Hybrid is and times faster than FM-Adaptive on proteins and english but uses and percentage points more space.
6 Conclusion
We propose improvements via a new compressed representation of the wavelet tree of the Burrows-Wheeler transform of the input text, which incorporates the gap -encoding. Our new index achieves asymptotic space optimality within a factor of 2 in the leading term, but it has a better compression and faster retrieval in practice than the competitive optimal compression boosting used in previous FM-indexes. We present a practical improved locate algorithm that provides substantially faster locating time.
An interesting open problem is to improve our high-order entropy-compressed text self-index to achieve optimal space and query time both in theory and in practice. Another interesting goal is to adapt our index to compress and index the human genome and reads for the analysis of ultra-high-throughput next-generation sequencing (NGS) data, while supporting approximate matching queries.
References
- [1] Christina Boucher, Davide Cenzato, Zsuzsanna Lipták, Massimiliano Rossi, and Marinella Sciortino. r-indexing the eBWT. Information and Computation, 298:105155, 2024. doi:10.1016/J.IC.2024.105155.
- [2] Nathaniel K. Brown, Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Marinella Sciortino. Faster run-length compressed suffix arrays. In Alessio Conte, Andrea Marin, Giovanna Rosone, and Jeffrey S. Vitter, editors, From Strings to Graphs, and Back Again: A Festschrift for Roberto Grossi’s 60th Birthday. OASIcs, Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2025.
- [3] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm, 1994.
- [4] Stefan Canzar and Steven L. Salzberg. Short read mapping: An algorithmic tour. Proceedings of the IEEE, 105(3):436–458, 2017. doi:10.1109/JPROC.2015.2455551.
- [5] Xiaoyang Chen, Hongwei Huo, Jun Huan, Jeffrey Scott Vitter, Weiguo Zheng, and Lei Zou. MSQ-Index: A succinct index for fast graph similarity search. IEEE Transactions on Knowledge and Data Engineering, 33(6):2654–2668, 2021. doi:10.1109/TKDE.2019.2954527.
- [6] David Richard Clark. Compact PAT trees. PhD thesis, University of Waterloo, Waterloo, Canada, 1996.
- [7] Peter Elias. Efficient storage and retrieval by content and address of static files. J. Assoc. Comput. Mach., 21:246–260, 1974. doi:10.1145/321812.321820.
- [8] Peter Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, 21(2):194–203, 1975. doi:10.1109/TIT.1975.1055349.
- [9] Robert M. Fano. On the number of bits required to implement an associative memory. Computer Structures Group, MIT, Cambridge, MA, 1971.
- [10] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science (FOCS’00), pages 390–398, 2000. doi:10.1109/SFCS.2000.892127.
- [11] Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552–581, 2005. doi:10.1145/1082036.1082039.
- [12] Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. An alphabet-friendly FM-index. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval (SPIRE’04), pages 150–160, 2004. doi:10.1007/978-3-540-30213-1_23.
- [13] Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3(2):Article 20, 2007.
- [14] Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, and Gonzalo Navarro. The myriad virtues of Wavelet Trees. Information and Computation, 207(8):849–866, 2009. doi:10.1016/J.IC.2008.12.010.
- [15] Luca Foschini, Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. When indexing equals compression: Experiments with compressing suffix arrays and applications. ACM Transactions on Algorithms, 2(4):611–639, 2006. Shorter versions appear in Proceedings of the 15th Annual SIAM/ACM Symposium on Discrete Algorithms (SODA ’04), New Orleans, LA, January 2004, 636–645, and in “Fast compression with a static model in high-order entropy,” Proceedings of IEEE Data Compression Conference, Snowbird, Utah, March 2004, 62–71. doi:10.1145/1198513.1198521.
- [16] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM, 67(1):Article 2, 2020.
- [17] Raffaele Giancarlo, Giovanni Manzini, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. A new class of string transformations for compressed text indexing. Information and Computation, 294:105068, 2022. doi:10.1016/J.IC.2023.105068.
- [18] Simon Gog, Juha Kärkkäinen, Dominik Kempa, Matthias Petri, and Simon J. Puglisi. Fixed block compression boosting in FM-indexes: Theory and practice. Algorithmica, 81:1370–1391, 2019. doi:10.1007/S00453-018-0475-9.
- [19] Simon Gog, Gonzalo Navarro, and Matthias Petri. Improved and extended locating functionality on compressed suffix arrays. Journal of Discrete Algorithms, 32:53–63, 2015. doi:10.1016/J.JDA.2015.01.006.
- [20] Simon Gog and Matthias Petri. Optimized succinct data structures for massive data. Software: Practice and Experience, 44(11):1287–1314, 2014. doi:10.1002/SPE.2198.
- [21] Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proceedings of the 14th ACM-SIAM Symposium on Discrete Algorithms (SODA’03), pages 841–850, 2003. URL: http://dl.acm.org/citation.cfm?id=644108.644250.
- [22] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd ACM Symposium on Theory of Computing (STOC’00), pages 397–406, 2000.
- [23] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378–407, 2005. doi:10.1137/S0097539702402354.
- [24] Roberto Grossi, Jeffrey Scott Vitter, and Bojian Xu. Wavelet trees: From theory to practice. In Proceedings of the 1st International Conference on Data Compression, Communications and Processing, pages 210–221, 2011. doi:10.1109/CCP.2011.16.
- [25] Wing-Kai Hon, Rahul Shah, and Jeffrey Scott Vitter. Compression, indexing, and retrieval for massive string data. In Proceedings of the 21st annual conference on Combinatorial pattern matching (CPM’10), pages 260–274, 2010. doi:10.1007/978-3-642-13509-5_24.
- [26] David A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9):1098–1101, 1952.
- [27] Hongwei Huo, Longgang Chen, Jeffrey Scott Vitter, and Yakov Nekrich. A practical implementation of compressed suffix arrays with applications to self-indexing. In Proceedings of the Data Compression Conference (DCC’14), pages 292–301, 2014. doi:10.1109/DCC.2014.49.
- [28] Hongwei Huo, Longgang Chen, Heng Zhao, Jeffrey Scott Vitter, Yakov Nekrich, and Qiang Yu. A Data-aware FM-index. In Proceedings of the 17th Workshop on Algorithm Engineering and Experiments (ALENEX’15), pages 10–23, 2015. doi:10.1137/1.9781611973754.2.
- [29] Hongwei Huo, Xiaoyang Chen, Xu Guo, and Jeffrey Scott Vitter. Efficient compression and indexing for highly repetitive DNA sequence collections. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 18(6):2394–2408, 2021. doi:10.1109/TCBB.2020.2968323.
- [30] Hongwei Huo, Zongtao He, Pengfei Liu, and Jeffrey Scott Vitter. FM-Adaptive: A data-aware FM-index [source code], January 2022. doi:10.24433/CO.7967727.v1.
- [31] Hongwei Huo, Pengfei Liu, Chenhui Wang, Hongbo Jiang, and Jeffrey Scott Vitter. CIndex: Compressed indexes for fast retrieval of FASTQ files. Bioinformatics, 38(2):335–343, 2022. doi:10.1093/BIOINFORMATICS/BTAB655.
- [32] Hongwei Huo, Peng Long, and Jeffrey Scott Vitter. Practical high-order entropy-compressed text self-indexing. IEEE Transactions on Knowledge and Data Engineering, 35(3):2943–2960, 2023. doi:10.1109/TKDE.2021.3114401.
- [33] Hongwei Huo, Zhigang Sun, Shuangjiang Li, Jeffrey Scott Vitter, and et al.. CS2A: A compressed suffix array-based method for short read alignment. In Proceedings of the Data Compression Conference (DCC’16), pages 271–278, 2016.
- [34] Hongwei Huo, Yongze Yu, Zongtao He, and Jeffrey Scott Vitter. Indexing labeled property multidigraphs in entropy space, with applications. In Proceedings of the 41st IEEE International Conference on Data Engineering (ICDE’25), pages 2478–2492, 2025.
- [35] Guy Jacobson. Space-efficient static trees and graphs. In Proceedings of the 30th Annual IEEE Symposium on Foundations of Computer Science (FOCS’89), pages 549–554, 1989. doi:10.1109/SFCS.1989.63533.
- [36] Juha Kärkkäinen and Simon J. Puglisi. Fixed block compression boosting in FM-indexes. In Proceedings of the 18th International Symposium on String Processing and Information Retrieval (SPIRE’11), pages 174–184, 2011. doi:10.1007/978-3-642-24583-1_18.
- [37] Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Hybrid compression of bitvectors for the FM-index. In DCC, pages 302–311, 2014. doi:10.1109/DCC.2014.87.
- [38] Zhize Li, Jian Li, and Hongwei Huo. Optimal in-place suffix sorting. Information and Computation, 285(Part B):104818, 2022. doi:10.1016/J.IC.2021.104818.
- [39] Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. In Proceedings of the 16th Annual Symposium on Combinatorial Pattern Matching (CPM), pages 45–56, 2005. LNCS 3537. doi:10.1007/11496656_5.
- [40] Veli Mäkinen and Gonzalo Navarro. Implicit compression boosting with applications to self-indexing. In Proceedings of the 14th International Symposium on String Processing and Information Retrieval (SPIRE’07), pages 229–241, 2007. doi:10.1007/978-3-540-75530-2_21.
- [41] Veli Mäkinen and Gonzalo Navarro. Dynamic entropy-compressed sequences and full-text indexes. ACM Transactions on Algorithms, 4(3):article 32, 2008.
- [42] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. J Comput Biol., 17(3):281–308, 2010. doi:10.1089/CMB.2009.0169.
- [43] Udi Manber and Gene Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993. doi:10.1137/0222058.
- [44] Giovanni Manzini. An analysis of the burrows-wheeler transform. Journal of the ACM, 48(3):407–430, 2001. doi:10.1145/382780.382782.
- [45] Edward M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262–272, 1976. doi:10.1145/321941.321946.
- [46] J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. Fast compressed self-indexes with deterministic linear-time construction. Algorithmica, 82(2):316–337, 2020. A conference version of this paper appeared in Proc. ISAAC 2017. doi:10.1007/S00453-019-00637-X.
- [47] Gonzalo Navarro. Indexing highly repetitive string collections, part ii: Compressed indexes. ACM Computing Surveys, 54(2):Article 26, 2021.
- [48] Gonzalo Navarro. Compact Data Structures: A Practical Approach. Cambridge University Press, New York, USA, September 2016.
- [49] Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes. ACM Computing Surveys, 39(1):Article 2, 2007.
- [50] Gonzalo Navarro and Nicola Prezza. Universal compressed text indexing. Theoretical Computer Science, 762:41–50, 2019. doi:10.1016/J.TCS.2018.09.007.
- [51] Rajeev Raman, Venkatesh Raman, and S.Srinivasa Rao. Succinct indexable dictionaries with applications to encoding -ary trees, prefix sums and multisets. ACM Transactions on Algorithms, 3(4):Article 43, 2007. doi:10.1145/1290672.1290680.
- [52] Kunihiko Sadakane. Compressed text databases with efficient query algorithms based on the compressed suffix array. In Proceedings of the 11th Symposium on Algorithms and Computation (ISAAC’00), pages 410–421, 2000. doi:10.1007/3-540-40996-3_35.
- [53] Kunihiko Sadakane. New text indexing functionalities of the compressed suffix arrays. Journal of Algorithms, 48(2):294–313, 2003. doi:10.1016/S0196-6774(03)00087-7.
- [54] Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995. doi:10.1007/BF01206331.
Appendix A Algorithms
denotes a node bit string of the wavelet tree.
Appendix B Count Query
Given a pattern string of symbols. Each symbol in and belongs to a fixed alphabet of size . An occurrence of the pattern at position of means that the substring is equal to . The algorithm given below reports the range of the first and the last occurrences of in in which all the suffixes are prefixed by using and .
During the searching for , the algorithm maintains the following invariant: Let denote the range of suffixes with prefix . If and are the first and the last occurrences of symbol in range , then is the range of suffixes with prefix . Basic operation takes time on the compressed wavelet tree for and by Lemma 2. As a query does two operations per symbol of , it runs in time. (Some more complicated implementations such as that in [13] of wavelet trees make use of multiway branching to shorten the height of the tree and reduce traversal time, but we use the simpler binary branching, which works well in practice.)
Theorem 5.
Let be a query pattern of length . We can answer a count query of using the of the of in time using bits of space for any such that and any constant , where denotes the th-order empirical entropy of .
