Fast Compressed Self-Indexes with Deterministic Linear-Time Construction

We introduce a compressed suffix array representation that, on a text $T$ of length $n$ over an alphabet of size $\sigma$, can be built in $O(n)$ deterministic time, within $O(n\log\sigma)$ bits of working space, and counts the number of occurrences of any pattern $P$ in $T$ in time $O(|P| + \log\log_w \sigma)$ on a RAM machine of $w=\Omega(\log n)$-bit words. This new index outperforms all the other compressed indexes that can be built in linear deterministic time, and some others. The only faster indexes can be built in linear time only in expectation, or require $\Theta(n\log n)$ bits. We also show that, by using $O(n\log\sigma)$ bits, we can build in linear time an index that counts in time $O(|P|/\log_\sigma n + \log n(\log\log n)^2)$, which is RAM-optimal for $w=\Theta(\log n)$ and sufficiently long patterns.


Introduction
The string indexing problem consists in preprocessing a string T so that, later, we can efficiently find occurrences of patterns P in T . The most popular solutions to this problem are suffix trees [35] and suffix arrays [24]. Both can be built in O(n) deterministic time on a text T of length n over an alphabet of size σ, and the best variants can count the number of times a string P appears in T in time O(|P |), and even in time O(|P |/ log σ n) in the word-RAM model if P is given packed into |P |/ log σ n words [31]. Once counted, each occurrence can be located in O(1) time. Those optimal times, however, come with two important drawbacks: The variants with this counting time cannot be built in O(n) worst-case time.
The data structures use Θ(n log n) bits of space.
The reason of the first drawback is that some form of perfect hashing is always used to ensure constant time per pattern symbol (or pack of symbols). The classical suffix trees and arrays with linear-time deterministic construction offer O(|P | log σ) or O(|P | + log n) counting time, respectively. More recently, those times have been reduced to O(|P | + log σ) [10] and even to O(|P | + log log σ) [15]. Simultaneously with our work, a suffix tree variant was introduced by Bille et al. [7], which can be built in linear deterministic time and counts in time O(|P |/ log σ n + log |P | + log log σ). All those indexes, however, still suffer from the second drawback, that is, they use Θ(n log n) bits of space. This makes them impractical in most applications that handle large text collections.
Research on the second drawback dates back to almost two decades [30], and has led to indexes using nH k (T ) + o(n(H k (T ) + 1)) bits, where H k (T ) ≤ log σ is the k-th order entropy of T [25], for any k ≤ α log σ n − 1 and any constant 0 < α < 1. That is, the indexes use asymptotically the same space of the compressed text, and can reproduce the text and search it; thus they are called self-indexes. The fastest compressed self-indexes that can be built in linear deterministic time are able to count in time O(|P | log log σ) [1] or O(|P |(1 + log w σ)) [6]. There exist other compressed self-indexes that obtain times O(|P |) [5] or O(|P |/ log σ n + log σ n) for any constant > 0 [19], but both rely on perfect hashing and are not built in linear deterministic time. All those compressed self-indexes use O(n log n b ) further bits to locate the position of each occurrence found in time O(b), and to extract any substring S of T in time O(|S| + b).
In this paper we introduce the first compressed self-index that can be built in O(n) deterministic time (moreover, using O(n log σ) bits of space [28]) and with counting time O(|P | + log log w σ), where w = Ω(log n) is the size in bits of the computer word. More precisely, we prove the following result. We obtain our results with a combination of the compressed suffix tree T of T and the Burrows-Wheeler transform B of the reversed text T . We manage to simulate the suffix tree traversal for P , simultaneously on T and on B. With a combination of storing deterministic dictionaries and precomputed rank values for sampled nodes of T , and a constant-time method to compute an extension of partial rank queries that considers small ranges in B, we manage to ensure that all the suffix tree steps, except one, require constant time. The remaining one is solved with general rank queries in time O(log log w σ). As a byproduct, we show that the compressed sequence representations that obtain those rank times [6] can also be built in linear deterministic time.
Compared with previous work, other indexes may be faster at counting, but either they are not built in linear deterministic time [5,19,31] or they are not compressed [31,7]. Our index outperforms all the previous compressed [13,1,6], as well as some uncompressed [15], indexes that can be built deterministically.
As an application of our tools, we also show that an index using O(n log σ) bits of space can be built in linear deterministic time, so that it can count in time O(|P |/ log σ n + log n(log log n) 2 ), which is RAM-optimal for w = Θ(log n) and sufficiently long patterns. Current indexes obtaining similar counting time require O(n log σ) construction time [19] or higher [31], or O(n log n) bits of space [31,7].
Searches typically require to count the number of times P appears in T , and then locate the positions of T where P occurs. The vast majority of the indexes for this task are suffix tree [35] or suffix array [24] variants.
The suffix tree can be built in linear deterministic time [35,26,34], even on arbitrarily large integer alphabets [11]. The suffix array can be easily derived from the suffix tree in linear time, but it can also be built independently in linear deterministic time [23,22,21]. In their basic forms, these structures allow counting the number of occurrences of a pattern P in T in time O(|P | log σ) (suffix tree) or O(|P | + log n) (suffix array). Once counted, the occurrences can be located in constant time each.
Cole et al. [10] introduced the suffix trays, a simple twist on suffix trees that reduces their counting time to O(|P | + log σ). Fischer and Gawrychowski [15] introduced the wexponential search trees, which yield suffix trees with counting time O(|P | + log log σ) and support dynamism.
All these structures can be built in linear deterministic time, but require Θ(n log n) bits of space, which challenges their practicality when handling large text collections.
Faster counting is possible if we resort to perfect hashing and give away the linear deterministic construction time. In the classical suffix tree, we can easily achieve O(|P |) time by hashing the children of suffix tree nodes, and this is optimal in general. In the RAM model with word size Θ(log n), and if the consecutive symbols of P come packed into |P |/ log σ n words, the optimal time is instead O(|P |/ log σ n). This optimal time was recently reached by Navarro and Nekrich [31] (note that their time is not optimal if w = ω(log n)), with a simple application of weak-prefix search, already hinted in the original article [2]. However, even the randomized construction time of the weak-prefix search structure is O(n log n), for any constant > 0. By replacing the weak-prefix search with the solution of Grossi and Vitter [19] for the last nodes of the search, and using a randomized construction of their perfect hash functions, the index of Navarro and Nekrich [31] can be built in linear randomized time and count in time O(|P |/ log σ n + log σ n). Only recently, simultaneously with our work, a deterministic linear-time construction algorithm was finally obtained for an index obtaining O(|P |/ log σ n + log |P | + log log σ) counting time [7].
Still, these structures are not compressed. Compressed suffix trees and arrays appeared in the year 2000 [30]. To date, they take the space of the compressed text and replace it, in the sense that they can extract any desired substring of T ; they are thus called self-indexes. The space occupied is measured in terms of the k-th order empirical entropy of T , H k (T ) ≤ log σ [25], which is a lower bound on the space reached by any statistical compressor that encodes each symbol considering only the k previous ones. Self-indexes may occupy as little as nH k (T ) + o(n(H k (T ) + 1)) bits, for any k ≤ α log σ n − 1, for any constant 0 < α < 1. The fastest self-indexes with linear-time deterministic construction are those of Barbay et al. [1], which counts in time O(|P | log log σ), and Belazzougui and Navarro [6,Thm. 7], which counts in time O(|P |(1 + log w σ)). The latter requires O(n(1 + log w σ)) construction time, but if log σ = O(log w), its counting time is O(|P |) and its construction time is O(n).
If we admit randomized linear-time constructions, then Belazzougui and Navarro [6,Thm. 10] reach O(|P |(1 + log log w σ)) counting time. At the expense of O(n) further bits, in another work [5] they reach O(|P |) counting time. Using O(n log σ) bits, and if P comes in packed form, Grossi and Vitter [19] can count in time O(|P |/ log σ n + log σ n), for any constant > 0, however their construction requires O(n log σ) time. Table 1 puts those results and our contribution in context. Our new self-index, with O(|P | + log log w σ) counting time, linear-time deterministic construction, and nH k (T ) + o(n log σ) bits of space, dominates all the compressed indexes with linear-time deterministic construction [1,6], as well as some uncompressed ones [15] (to be fair, we do not cover the case log σ = O(log w), as in this case the previous work [6, Thm. 7] already obtains our result). Our self-index also dominates a previous one with linear-time randomized construction [6, Thm. 10], which we incidentally show can also be built deterministically. The only aspect in which some of those dominated indexes may outperform ours is in that they may use o(n(H k (T ) + 1)) [6,Thm. 10] or o(n) [6,Thm. 7] bits of redundancy, instead of our o(n log σ) bits. We also derive a compact index (i.e., using O(n log σ) bits) that is built in linear deterministic time and counts in time O(|P |/ log σ n + log n(log log n) 2 ), which is the only one in this category unless we consider constant σ for Grossi and Vitter [19].

Preliminaries
We denote by . We assume that the text T ends with a special symbol $ that lexicographically precedes all other symbols in T . The alphabet size is σ and symbols are integers in [0..σ − 1] (so $ corresponds to 0). In this paper, as in the previous work on this topic, we use the word RAM model of computation. A machine word consists of w = Ω(log n) bits and we can execute standard bit and arithmetic operations in constant time. We assume for simplicity that the alphabet size σ = O(n/ log n) (otherwise the text is almost incompressible anyway [16]). We also assume log σ = ω(log w), since otherwise our goal is already reached in previous work [6, Thm. 7].

Rank and Select Queries
We define three basic queries on sequences. Let .
We can answer access queries in O(1) time and select queries in any ω(1) time, or vice versa, and rank queries in time O(log log w σ), which is optimal [6]. These structures use n log σ + o(n log σ) bits, and we will use variants that require only compressed space. In this paper, we will show that those structures can be built in linear deterministic time.
An important special case of rank queries is the partial rank query, rank Unlike general rank queries, partial rank queries can be answered in O(1) time [6]. Such a structure can be built in O(n) deterministic time and requires O(n log log σ) bits of working and final space [28,Thm. A.4.1].
For this paper, we define a generalization of partial rank queries called interval rank queries, rank a (i, j, B) = rank a (i − 1, B), rank a (j, B) , from where in particular we can deduce the number of times a occurs in B[i..j]. If a does not occur in B[i..j], however, this query just returns null (this is why it can be regarded as a generalized partial rank query).
In the special case where the alphabet size is small, log σ = O(log w), we can represent B so that rank, select, and access queries are answered in O(1) time [6, Thm. 7], but we are not focusing on this case in this paper, as the problem has already been solved for this case.

Suffix Array and Suffix Tree
The suffix tree [35] for a string T [0..n − 1] is a compacted digital tree on the suffixes of T , where the leaves point to the starting positions of the suffixes. We call X u the string leading to suffix tree node u. The suffix array [24] is an array .] is the (i + 1)-th lexicographically smallest suffix of T . All the occurrences of a substring P in T correspond to suffixes of T that start with P . These suffixes descend from a single suffix tree node, called the locus of P , and also occupy a contiguous interval in the suffix array SA. Note that the locus of P is the node u closest to the root for which P is a prefix of X u . If P has no locus node, then it does not occur in T .

Compressed Suffix Array and Tree
A compressed suffix array (CSA) is a compact data structure that provides the same functionality as the suffix array. The main component of a CSA is the one that allows determining, given a pattern P , the suffix array range SA[i..j] of the prefixes starting with P . Counting is then solved as j − i + 1. For locating any cell SA [k], and for extracting any substring S from T , most CSAs make use of a sampled array SAM b , which contains the values of Here b is a tradeoff parameter: CSAs require O(n log n b ) further bits and can locate in time proportional to b and extract S in time proportional to b + |S|. We refer to a survey [30] for a more detailed description.
A compressed suffix tree [33] is formed by a compressed suffix array and other components that add up to O(n) bits. These include in particular a representation of the tree topology that supports constant-time computation of the preorder of a node, its number of children, its j-th child, its number of descendant leaves, and lowest common ancestors, among others [32]. Computing node preorders is useful to associate satellite information to the nodes.
Both the compressed suffix array and tree can be built in O(n) deterministic time using O(n log σ) bits of space [28].

Burrows-Wheeler Transform and FM-index
Hence, we can build the BWT by sorting the suffixes and writing the symbols that precede the suffixes in lexicographical order.
The FM-index [12,13] is a CSA that builds on the BWT. It consists of the following three main components: The . Thus the interval of P is found by answering 2m rank queries. Any sequence representation offering rank and access queries can then be applied on B to obtain an FM-index. An

Small Interval Rank Queries
We start by showing how a compressed data structure that supports select queries can be extended to support a new kind of queries that we dub small interval rank queries. An interval query rank a (i, j, B) is a small interval rank query if j − i ≤ log 2 σ. Our compressed index relies on the following result. Proof. We split C into groups G i of log 2 σ consecutive symbols, . Let A i denote the sequence of the distinct symbols that occur in G i . Storing A i directly would need log σ bits per symbol. Instead, we encode each element of A i as its first position in G i , which needs only O(log log σ) bits. With this encoded sequence, since we have O(t)-time access to C, we have access to any element of A i in time O(t). In addition, we store a succinct SB-tree [18] on the elements of A i . This structure uses O(p log log u) bits to index p elements in [1..u], and supports predecessor (and membership) queries in time O(log p/ log log u) plus one access to A i . Since u = σ and p ≤ log 2 σ, the query time is O(t) and the space usage is bounded by O(m log log σ) bits.
For each a ∈ A i we also keep the increasing list I a,i of all the positions where a occurs in G i . Positions are stored as differences with the left border of G i : if C[j] = a, we store the difference j − i log 2 σ. Hence elements of I a,i can also be stored in O(log log σ) bits per symbol, adding up to O(m log log σ) bits. We also build an SB-tree on top of each I a,i to provide for predecessor searches.
Using the SB-trees on A i and I a,i , we can answer small interval rank queries rank a (x, y, C). Consider a group G i = C[i log 2 σ..(i + 1) log 2 σ − 1], an index k such that i log 2 σ ≤ k ≤ (i + 1) log 2 σ, and a symbol a. We can find the largest i log 2 σ ≤ r ≤ k such that C[r] = a, or determine it does not exist: First we look for the symbol a in A i ; if a ∈ A i , we find the predecessor of k − i log 2 σ in I a,i . Now consider an interval C[x.
.y] of size at most log 2 σ. It intersects at most two groups, G i and G i−1 . We find the rightmost occurrence of symbol a in C[x..y] as follows. First we look for the rightmost occurrence y ≤ y of a in G i ; if a does not occur in C[i log 2 σ..y], we look for the rightmost occurrence y ≤ i log 2 σ − 1 of a in G i−1 . If this is ≥ x, we find the leftmost occurrence x of a in C[x..y] using a symmetric procedure. When x ≤ y are found, we can compute rank a (x , C) and rank a (y , C) in O(1) time by answering partial rank queries (Section 3.

Compressed Index
We classify the nodes of the suffix tree T of T into heavy, light, and special, as in previous work [31,28]. Let d = log σ. A node u of T is heavy if it has at least d leaf descendants and light otherwise. We say that a heavy node u is special if it has at least two heavy children.
For every special node u, we construct a deterministic dictionary [20] D u that contains the labels of all the heavy children of u: If the jth child of u, u j , is heavy and the first symbol on the edge from to u to u j is a j , then we store the key a j in D u with j as satellite data. If a heavy node u has only one heavy child u j and d or more light children, then we also store the data structure D u (containing only that heavy child of u). If, instead, a heavy node has one heavy child and less than d light children, we just keep the index of the heavy child using O(log d) = O(log log σ) bits.
The second component of our index is the Burrows-Wheeler Transform B of the reverse text T . We store a data structure that supports rank, partial rank, select, and access queries on B. It is sufficient for us to support access and partial rank queries in O(1) time and rank queries in O(log log w σ) time. We also construct the data structure described in Lemma 2, which supports small interval rank queries in O(1) time. Finally, we explicitly store the answers to some rank queries. Let B[l u ..r u ] denote the range of X u , where X u is the reverse of X u , for a suffix tree node u. For all data structures D u and for every symbol a ∈ D u we store the values of rank a (l u − 1, B) and rank a (r u , B).
Let us show how to store the selected precomputed answers to rank queries in O(log σ) bits per query. Following a known scheme [17], we divide the sequence B into chunks of size σ. The sequence representation that supports access and rank queries on B can be made to use nH k (T ) + o(n(H k (T ) + 1)) bits, by exploiting the fact that it is built on a BWT [6, Thm. 10]. 1 We note that they use constant-time select queries on B instead of constant-time access, so they can use select queries to perform LF −1 -steps in constant time. Instead, with our partial rank queries, we can perform LF -steps in constant time (recall Section 3.4), and thus have constant-time access instead of constant-time select on B (we actually do not use query select at all). They avoid this solution because partial rank queries require o(n log σ) bits, which can be more than o(n(H k (T ) + 1)), but we are already paying this price.
Apart from this space, array Acc needs O(σ log n) = O(n) bits and SAM b uses O(n log n b ). The total space usage of our self-index then adds up to nH k (T ) + o(n log σ) + O(n log n b ) bits.

Pattern Search
Given a query string P , we will find in time O(|P | + log log w σ) the range of the reversed string P in B. A backward search for P in B will be replaced by an analogous backward search for P in B, that is, we will find the range of P The idea is to traverse the suffix tree T in synchronization with the forward search on B, until the locus of P is found or we determine that P does not occur in T . Our procedure starts at the root node of T , with l −1 = 0, r −1 = n − 1, and i = 0. We compute the ranges B[l i ..r i ] that correspond to P [0..i] for i = 0, . . . , |P | − 1. Simultaneously, we move down in the suffix tree. Let u denote the last visited node of T and let a = P [i]. We denote by u a the next node that we must visit in the suffix tree, i.e., u a is the locus of P [0..i]. We can compute l i and r i in O(1) time if rank a (r i−1 , B) and rank a (l i−1 − 1, B) are known. We will show below that these queries can be answered in constant time because either (a) the answers to rank queries are explicitly stored in D u or (b) the rank query that must be answered is a small interval rank query. The only exception is the situation when we move from a heavy node to a light node in the suffix tree; in this case the rank query takes O(log log w σ) time. We note that, once we are in a light node, we need not descend in T anymore; it is sufficient to maintain the interval in B.
For ease of description we distinguish between the following cases.

1.
Node u is heavy and a ∈ D u . In this case we identify the heavy child u a of u that is labeled with a in constant time using the deterministic dictionary. We can also find l i and r i in time O(1) because rank a (l i−1 − 1, B) and rank a (r i−1 , B) are stored in D u . 2. Node u is heavy and a ∈ D u . In this case u a , if it exists, is a light node. We then find it with two standard rank queries on B, in order to compute l i and r i or determine that P does not occur in T . 3. Node u is heavy but we do not keep a dictionary D u for the node u. In this case u has at most one heavy child and less than d light children. We have two subcases: a. If u a is the (only) heavy node, we find this out with a single comparison, as the heavy node is identified in u. However, the values rank a (l i Note that if u is light we do not need to consider this case; we may directly apply case 4. Except for the cases 2 and 3b, we can find l i and r i in O(1) time. In cases 2 and 3b we need O(log log w σ) time to answer general rank queries. However, these cases only take place when the node u is heavy and its child u a is light. Since all descendants of a light node are light, those cases occur only once along the traversal of P . Hence the total time to find the range of P in B is O(|P | + log log w σ). Once the range is known, we can count and report all occurrences of P in the standard way.

7
Linear-Time Construction On top of the sequences B and B, we build the representation that supports access in O(1) and rank in O(log log w σ) time [6]. In their original paper, those structures are built using perfect hashing, but a deterministic construction is also possible [4,Lem. 11]; we give the details next.

Sequences and Related Structures
The key part of the construction is that, within a chunk of σ symbols, we must build a virtual list I a of the positions where each symbol a occurs, and provide predecessor search on those lists in O(log log w σ) time. We divide each list into blocks of log 2 σ elements, and create a succinct SB-tree [18] on the block elements, much as in Section 4. The search time inside a block is then O(t), where t is the time to access an element in I a , and the total extra space is O(n log log σ) bits. If there is more than one block in I a , then the block minima are inserted into a predecessor structure [6, App. A] that will find the closest preceding block minimum in time O(log log w σ) and use O(n log log σ) bits. This structure uses perfect hash functions called I(P ), which provide constant-time membership queries. Instead, we replace them with deterministic dictionaries [20]. The only disadvantage of these dictionaries is that they require O(log σ) construction time per element, and since each element is inserted into O(log log w σ) structures I(P ), the total construction time per element is O(log σ log log w σ). However, since we build these structures only on O(n/ log 2 σ) block minima, the total construction time is only O(n).
On the variant of the structure that provides constant-time access, the access to an element in I a is provided via a permutation structure [29] which offers access time t with extra space O((n/t) log σ) bits. Therefore, for any log σ = ω(log w), we can have t = O(log log w σ) with o(n log σ) bits of space.

Structures D u
The most complex part of the construction is to fill the data of the D u structures. We visit all the nodes of T and identify those nodes u for which the data structure D u must be constructed. This can be easily done in linear time, by using the constant-time computation of the number of descendant leaves. To determine if we must build D u , we traverse its children u 1 , u 2 , . . . and count their descendant leaves to decide if they are heavy or light.
We use a bit vector D to mark the preorders of the nodes u for which D u will be constructed: If p is the preorder of node u, then it stores a structure D u iff D[p] = 1, in which case D u is stored in an array at position rank 1 (D, p). If, instead, u does not store D u but it has one heavy child, we store its child rank in another array indexed by rank 0 (D, p), using log log σ bits per cell.
The main difficulty is how to compute the symbols a to be stored in D u , and the ranges B[l u , r u ], for all the selected nodes u. It is not easy to do this through a preorder traversal of T because we would need to traverse edges that represent many symbols. Our approach, instead, is inspired by the navigation of the suffix-link tree using two BWTs given by Belazzougui et al. [3]. Let T w denote the tree whose edges correspond to Weiner links between internal nodes in T . That is, the root of T w is the same root of T and, if we have internal nodes u, v ∈ T where X v = a · X u for some symbol a, then v descends from u by the symbol a in T w . We first show that the nodes of T w are the internal nodes of T . The inclusion is clear by definition in one direction; the other is well-known but we prove it for completeness. Proof. We proceed by induction on |X u |, where the base case holds by definition. Now let a non-root internal node u of T be labeled by string X u = aX. This means that there are at least two different symbols a 1 and a 2 such that both aXa 1 and aXa 2 occur in the text T . Then both Xa 1 and Xa 2 also occur in T . Hence there is an internal node u with X u = X in T and a Weiner link from u to u. Since |X u | = |X u | − 1, it holds by the inductive hypothesis that u belongs to T w , and thus u belongs to T w as a child of u .
We do not build T w explicitly, but just traverse its nodes conceptually in depth-first order and compute the symbols to store in the structures D u and the intervals in B. Let u be the current node of T in this traversal and u its corresponding locus in T . Assume for now that u is a nod, too. Let [l u , r u ] be the interval of X u in B and [l u , r u ] be the interval of the reverse string X u in B. 2 Our algorithm starts at the root nodes of T w , T , and T , which correspond to the empty string, and the intervals in B and B are [l u , r u ] = [l u , r u ] = [0, n − 1]. We will traverse only the heavy nodes, yet in some cases we will have to work on all the nodes. We ensure that on heavy nodes we work at most O(log σ) time, and at most O(1) time on arbitrary nodes.
Upon arriving at each node u, we first compute its heavy children. From the topology of T we identify the interval [l i , r i ] for every child u i of u, by counting leaves in the subtrees of the successive children of u. By reporting all the distinct symbols in B[l u ..r u ] with their frequencies, we identify the labels of those children. However, the labels are retrieved in arbitrary order and we cannot afford sorting them all. Yet, since the labels are associated with their frequencies in B[l u ..r u ], which match their number of leaves in the subtrees of u, we can discard the labels of the light children, that is, those appearing less than d times in B[l u ..r u ]. The remaining, heavy, children are then sorted and associated with the successive heavy children u i of u in T .
If our preliminary pass marked that a D u structure must be built, we construct at this moment the deterministic dictionary [20] with the labels a of the heavy children of u we have just identified, and associate them with the satellite data rank a (l u − 1, B) and rank a (r u , B). This construction takes O(log σ) time per element, but it includes only heavy nodes.
We now find all the Weiner links from u. For every (heavy or light) child u i of u, we compute the list L i of all the distinct symbols that occur in B[l i ..r i ]. We mark those symbols a in an array V [0..σ − 1] that holds three possible values: not seen, seen, and seen (at least) twice. If V [a] is not seen, then we mark it as seen; if it is seen, we mark it as seen twice; otherwise we leave it as seen twice. We collect a list E u of the symbols that are seen twice along this process, in arbitrary order. For every symbol a in E u , there is an explicit Weiner link from u labeled by a: Let X = X u ; if a occurred in L i and L j then both aXa i and aXa j occur in T and there is a suffix tree node that corresponds to the string aX. The total time to build E u amortizes to O(n): for each child v of u, we pay O(1) time for each child the node v has in T ; each node in T contributes once to the cost.
The targets of the Weiner links from u in T correspond to the children of the node u in T . To find them, we collect all the distinct symbols in B[l u ..r u ] and their frequencies. Again, we discard the symbols with frequency less than d, as they will lead to light nodes, which we do not have to traverse. The others are sorted and associated with the successive heavy children of u. By counting leaves in the successive children, we obtain the intervals B[l i ..r i ] corresponding to the heavy children u i of u.
We are now ready to continue the traversal of T w : for each Weiner link from u by symbol a leading to a heavy node, which turns out to be the i-th child of u, we know that its node in T is u i (computed from u using the tree topology) and its interval is  − 1, B), Acc[a] + rank a (r u , B) − 1]. This requires O(log log w σ) time, but applies only to heavy nodes. Finally, the corresponding node in T is obtained in constant time as the lowest common ancestor of the x-th and the y-th leaves of T . In the description above we assumed for simplicity that u is a node in T . In the general case u can be located on an edge of T . This situation arises when all occurrences of X u in the reverse text T are followed by the same symbol a. In this case there is at most one Weiner link from u; the interval in B does not change as we follow that link.
A recursive traversal of T w might require O(nσ log n) bits for the stack, because we store several integers associated to heavy children during the computation of each node u. We can limit the stack height by determining the largest subtree among the Weiner links of u, traversing all the others recursively, and then moving to that largest Weiner link target without recursion [3, Lem. 1]. Since only the largest subtree of a Weiner link target can contain more than half of the nodes of the subtree of u, the stack is guaranteed to be of height only O(log n). The space usage is thus O(σ log 2 n) = O(n log σ).
As promised, we have spent at most O(log σ) time on heavy nodes, which are O(n/d) = O(n/ log σ) in total, thus these costs add up to O(n). All other costs that apply to arbitrary nodes are O(1). The structures for partial rank queries (and the succinct SB-trees) can also be built in linear deterministic time, as shown in Section 4. Therefore our index can be constructed in O(n) time.

A Compact Index
As an application of our techniques, we show that it is possible to obtain O(|P |/ log σ n+log 2 n) search time, and even O(|P |/ log σ n + log n(log log n) 2 ), with an index that uses O(n log σ) bits and is built in linear deterministic time. We store B in compressed form and a sample of the heavy nodes of T . Following previous work [19,31], we start from the root and store a deterministic dictionary [20] with all the highest suffix tree nodes v representing strings of depth ≥ = log σ n. The key associated with each node is a log(n)-bit integer formed with the first symbols of the strings P v . The satellite data are the length |P v |, a position where P v occurs in T , and the range B[l v ..r v ] of v. From each of those nodes v, we repeat the process with the first symbols that follow after P v , and so on. The difference is that no light node will be inserted in those dictionaries. Let us charge the O(log n) bits of space to the children nodes, which are all heavy. If we count only the special nodes, which are O(n/d), this amounts to O((n log n)/d) total bits and construction time. Recall that d is the maximum subtree size of light nodes. This time will use d = Θ(log n) to have linear construction time and bit space, and thus will not take advantage of small rank interval queries.
There are, however, heavy nodes that are not special. These form possibly long chains between special nodes, and these will also induce chains of sampled nodes. While building the dictionaries for those nodes is trivial because they have only one sampled child, the total space may add up to O(n log σ) bits, if there are Θ(n) heavy nodes and the sampling chooses one out of in the chains. To avoid this, we increase the sampling step in those chains, enlarging it to = log n. This makes the extra space spent in sampling heavy non-special nodes to be O(n) bits as well.
In addition, we store the text T with a data structure that uses nH k (T ) + o(n log σ) for The idea is to speed up the traversal on the light nodes, as follows. For each light node v, we store the leaf in T where the heavy path starting at v ends. The heavy path chooses at each node the subtree with the most leaves, thus any traversal towards a leaf has to switch to another heavy path only O(log d) times. At each light node v, we go to the leaf u of its heavy path, obtain its position in T using the sampled array SAM b of B, and compare the rest of P with the corresponding part of the suffix, by chunks of symbols. Once we determine the number k of symbols that coincide with P in the path from v to u, we perform a binary search for the highest ancestor v of u where |P v − P v | ≥ k. If |P v − P v | > k, then P does not appear in T (unless P ends at the k-th character compared, in which case the locus of P is v ). Otherwise, we continue the search from v . To store the leaf u corresponding to each light node v, we record the difference between the preorder numbers of u and v, which requires O(log d) bits. The node u is easily found in constant time from this information [33]. We have the problem, however, that we spend O(log d) = O(log log n) bits per light node, which adds up to O(n log log n) bits. To reduce this to O(n), we choose a second sampling step e = O(log log n), and do not store this information on nodes with less than e leaves, which are called light-light. Those light nodes with e leaves or more are called light-heavy, and those with at least two light-heavy children are called light-special. There are O(n/e) light-special nodes. We store heavy path information only for light-special nodes or for light-heavy nodes that are children of heavy nodes; both are O(n/e) in total. A light-heavy node v that is not light-special has at most one light-heavy child u, and the heavy path that passes through v must continue towards u. Therefore, if it turns out that the search must continue from v after the binary search on the heavy path, then the search must continue towards the light-light children of v, therefore no heavy-path information is needed at node v.
Once we reach the first light-light node v, we proceed as we did for Theorem 4 on light nodes, in total time O(eb) = O(log n log log n). We need, however, the interval B[l v , r v ] before we can start the search from v.

Conclusions
We have shown how to build, in O(n) deterministic time and using O(n log σ) bits of working space, a compressed self-index for a text T of length n over an alphabet of size σ that searches for patterns P in time O(|P | + log log w σ), on a w-bit word RAM machine. This improves upon previous compressed self-indexes requiring O(|P | log log σ) [1] or O(|P |(1 + log w σ)) [6] time, on previous uncompressed indexes requiring O(|P | + log log σ) time [15] (but that supports dynamism), and on previous compressed self-indexes requiring O(|P |(1 + log log w σ)) time and randomized construction (which we now showed how to build in linear deterministic