String Indexing with Compressed Patterns

Given a string S of length n, the classic string indexing problem is to preprocess S into a compact data structure that supports efficient subsequent pattern queries. In this article, we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server scenario, where a client submits a query and communicates it in compressed form to a server. Instead of the server decompressing the query before processing it, we consider how to efficiently process the compressed query directly. Our main result is a novel linear space data structure that achieves near-optimal query time for patterns compressed with the classic Lempel-Ziv 1977 (LZ77) compression scheme. Along the way, we develop several data structural techniques of independent interest, including a novel data structure that compactly encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern.


Introduction
The string indexing problem is to preprocess a string S into a compact data structure that supports efficient subsequent pattern matching queries, that is, given a pattern string P , report all occurrences of P within S. In this paper, we introduce a basic variant of string indexing, called the string indexing with compressed pattern problem, where the pattern P is given in compressed form and we want to answer the query without decompressing P . The goal is to obtain a compact structure while achieving fast query times in terms of the compressed size of P . The string indexing with compressed pattern problem captures the following common client-server scenario: a client submits a query and sends it to a server which processes the query. To minimize communication time and bandwidth the query is sent in compressed form. Naively, the server will then have to decompress the query and then process it. With an efficient solution to the string indexing with compressed pattern problem we can eliminate the overhead decompression and speed up queries by exploiting repetitions in pattern strings.
We focus on the classic Lempel-Ziv 1977 (LZ77) [29] compression scheme. Note that since the size of an LZ77 compressed string is a lower bound for many other compression schemes (such as all grammar-based compression schemes) our results can be adapted to such compression schemes by recompressing the pattern string. To state the bounds, let n be the length of S, m be the length of P , and z be the LZ77 compressed length of P . Naively, we can solve the string indexing with compressed pattern problem by using a suffix tree of S as our data structure and answering queries by first decompressing them and then traversing the suffix tree with the uncompressed pattern. This leads to a solution with O(n) space and O(m + occ) query time. At the other extreme, we can store a trie of all the LZ77 compressed suffixes of S together with a simple tabulation, leading to a solution with O(n 3 ) space and O(z + occ) query time (see discussion in Section 3).
We present the first non-trivial solution to the string indexing with compressed pattern problem achieving the following bound: Theorem 1. We can solve the string indexing with compressed pattern problem for LZ77compressed patterns in O(n) space and O(z + log m + occ) time, where n is the length of the indexing string, m is the length of the pattern, and z is the number of phrases in the LZ77 compressed pattern.
Since any solution must use at least Ω(z + occ) time to read the input and report the occurrences, the time bound in Theorem 1 is optimal within an additive O(log m) term. In the common case when z = O(log m) or if we consider LZ77 without self-references the time bound is optimal. For simplicity, we focus on reporting queries, but the result is straightforward to extend to also support existential queries (decide if the pattern occurs in S) and counting queries (count the number of occurrences of the pattern in S) in O(z + log m) time and the same space.
To achieve Theorem 1 we develop several data structural techniques of independent interest. These include a compact data structure that encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern.
The paper is organized as follows. In Section 2 we recall basic string data structures and LZ77 compression. In Section 3 we present a simple O(n 2 ) space and O(z + log n + occ) time data structure that forms the basis of our solutions in the following sections. In Section 4 we show how to achieve linear space with the same time complexity. Finally, in Section 5 we show how to improve the log n term to log m.

Preliminaries
A string S of length n is a sequence S[0] · · · S[n − 1] of n characters drawn from an alphabet Σ.
and S [i, n] are called the j th prefix and i th suffix of S, respectively. We will sometimes use S i to denote the i th suffix of S.

Longest Common Prefix
For two strings S and S , the longest common prefix of S and S , denoted lcp(S, S ), is the maximum j ∈ {0, . . . , min (|S|, Given a string S of length n, there is a data structure of size O(n) that answers lcp-queries for any two suffixes of S in constant time by storing a suffix tree combined with an efficient nearest common ancestor (NCA) data structure [16,28].

Compact Trie
A compact trie for a set D of strings S 1 , . . . , S l is a rooted labeled tree T D , with the following properties: The label on each edge is a substring of one or more S i . If the set of strings is prefix free, each root-to-leaf path represents a string in the set (obtained by concatenating the labels on the edges of the path), and for every string there is a leaf corresponding to that string. Common prefixes of two strings share the same path maximally, and all internal vertices have at least two children.
The compact trie has O(l) nodes and edges and a total space complexity of O l i=1 |S i | . The position in the trie that corresponds to the maximum longest common prefix of a pattern P of length m and any S i can be found in O(m) time. For a position p in the tree, which can be either a node or a position within the label of an edge, let str(p) denote the string obtained by concatenating the labels on the path from the root to p. The locus of a string P in T D , denoted locus(P ), is the deepest position p in the tree such that str(p) is a prefix of P . A compact trie on the suffixes of a string S is called the suffix tree of S and can be stored in linear space [28]. The suffix array stores the starting positions of the suffixes in the string in lexicographic order. If at every node in the suffix tree its children are stored in lexicographic order, the order of the suffix array corresponds to the order of the leaves in the suffix tree.

LZ77
Given an input string S of length n, the LZ77 parsing divides S into z substrings f 1 , f 2 , . . . , f z , called phrases, in a greedy left-to-right order. The i th phrase f i , starting at position p i is either (a) the first occurrence of a character in S or (b) the longest substring that has at least one occurrence starting to the left of p i . If there are more than one occurrence, we assume that the choice is made in a consistent way. To compress S, we can then replace each phrase f i of type (b) with a pair (r i , l i ) such that r i is the distance from p i to the start of the previous occurrence, and l i is the length of the phrase. The occurrence of f i at position p i − r i is called the source of the phrase. (This is actually the LZ77-variant of Storer and Szymanski [27]; the original one [29] adds a character to each phrase so that it outputs triples instead of tuples.) Every LZ77-compressed string is a string over the extended alphabet which consists of all possible LZ77 phrases. For any string T we denote this string by LZ (T ).

A Simple Data Structure
In this section we will define a data structure that allows us to solve the string indexing with  The phrase trie for the string ABABACABABA$. In this example, the leaves are sorted according to the lexicographic order of the originial suffixes. For instance the 6 th suffix ABABA$ has the LZ77 parse A B (2,3) $, and this string corresponds to the concatenation of labels on the path from the root to the second leaf.

The Phrase Trie
The phrase trie of a string S is defined as the compact trie over the set of strings {LZ (S i $) , i = 0, . . . , |S| − 1} ∪ {$}, that is, the LZ77 parses of all suffixes of S appended by a new symbol $ which is lexicographically greater than any letter in the alphabet. For an example see Figure 1.
The phrase trie for a string S of length n has n + 1 leaves, one corresponding to every suffix of S$. Similarly as in the suffix tree, every internal node defines a consecutive range within the suffix array. Since every node has at least 2 children the number of nodes and edges is O(n).
LZ77 has the property that for two strings whose prefixes match up to some position the LZ77-compression of the two strings will be the same up to (not necessarily including) the phrase that contains position . As such, we can use the phrase trie to find the suffix S i of S for which the LZ77-compression of the pattern P agrees with the LZ77-compression of S i for as long as possible. Assuming they match for k − 1 phrases, the longest match of P in S ends within the k th phrase. If we additionally to the phrase trie keep a table for all possible phrases of a pattern that the longest match could end in, encoded as the triple (p, r, l) of starting position, distance to the start of the previous occurance, and length of the phrase, and which for each such phrase stores the solution to the query, we can solve the string indexing with compressed pattern problem in O(n 3 ) space and O(z + occ) time. Instead, we will store a linear space and constant time lcp data structure for S and show that given the first phrase where the suffix S i and the string P mismatch, we can find the lcp of P and S i by finding the lcp of two substrings of S.
The k th phrase in S is copied from position p k − r k , at which point S and S are identical; the lcp value gives how far p k and p k − r k match in S.

Longest Common Prefixes in LZ77-Compressed Strings
We will use an intuitive property about LZ77-compressed strings: assuming two strings match up until a certain phrase k − 1, we can reduce the task of finding the lcp of the two strings to the task of finding the longest common prefix between two suffixes of one of the strings. This property is summarized in the following lemma (see also Figure 2): Let p k be the starting position of f k and f k . If f k is a phrase represented by a pair (r k , l k ) the following holds: (1) Proof. To prove (1), we will show by induction that for any . For i = 0 this is true since S and S are the same up until position p k − 1. For the induction step assume it is true for all i 0 < i. We then have where (2) follows from i ≤ l k and because p k − r k is the source of phrase f k , (3) follows from the induction hypothesis, and (4) There are two cases: For , note that by (1), we know that S and S have an lcp of length at least p k + t. If t ≥ l k , then by the uniqueness of the greedy left-to-right parsing, the k th phrase of S and S would be the same, contradicting our condition. Otherwise, we have l k > t = l k . This together with (1) for every i = 0, . . . , t, since r k ≥ 1. By the greedy parsing property and since l k = t we know

10:6
String Indexing with Compressed Patterns

The Data Structure
Additionally to storing the phrase trie of S, we store the suffix array of S, and for every node in the phrase trie, the range of the leaves below it in the suffix array. Finally, we store a linear space and constant time data structure for answering lcp-queries for suffixes of S.

Algorithm
We begin by matching LZ (P ) as far as possible in the phrase trie. Let v = locus(LZ (P )). Let k be the first phrase in LZ (P ) that does not match any of the next phrases in the trie.
If v is a node set w = v, otherwise let w be the first node below v. We proceed as follows: If the k th phrase in P is a single letter, we return p k as the length of the match and the interval of positions stored at w. If the k th phrase is represented by (r k , l k ) then there are two cases: If v is on an edge, let S i be the suffix corresponding to any leaf below v. We return as the length of the match and the interval of positions stored at w.
If v is on a node, we do a binary search for the longest match in the range in the suffix array below v. That is, for the suffix S i corresponding to the middle leaf in the range below v, we compute lcp(S [i + p k , n] , S [i + p k − r k , n]). If this is greater than l k we stop the binary search. Otherwise, we check if the next position in suffix S i is lexicographically smaller or bigger than the next position in P to see whether we go left or right in the binary search. That is, we compare , and update our search accordingly. We also keep track of the longest match found so far. At the end of the search, we go to longest match, and check left and right in the suffix array to find all occurrences.

Correctness
The compact trie gives us the longest matching prefix of LZ (P ) = f 1 . . . f zp in the phrase trie. That is, we find all suffixes S i = f 1 · · · f zi for i = 0, . . . , n − 1 such that f 1 = f 1 , . . . , f k−1 = f k−1 and f k = f k , and k is maximal. By the uniqueness of parsing, the longest prefix of P found in S is the prefix of at least one these suffixes. Note that by the greedy parsing, the longest match of the k th phrase has to end before the next node in the trie. We argue the different cases: If the k th phrase in P is a letter, it did not appear in P before. Thus, it never appeared in any of the suffixes we matched so far. Since the next phrase in the phrase trie is different, it is either a copied position, or a different letter. In any case, the next letter of any candidate suffix does not match the next letter in P .
If f k is represented by (r k , l k ) there are two subcases. If v is on an edge, recall that S i is the suffix corresponding to any leaf below the current position v. By Lemma 2 and since S i [p] = S[p + i] for any p, we have that If v is on a node we have, by the same argument as before, for every suffix S i . Further, because of the lexicographic order of the suffix array, we can binary search to find the leaf with the longest match, and by checking the adjacent positions in the suffix array we make sure to find all occurrences.

Analysis
The suffix array and the lcp data structure both use linear space in the size of S. For the phrase trie, we store the LZ77-compressed suffixes of S, which use O( where z i is the number of phrases used to compress suffix S i . For the time complexity, we use O(k) = O(z) time for matching the phrases in the trie. In the worst case, that is, when the locus v is on a node, we need O(log(#leaves below v)) = O(log n) constant time lcp queries. In total, we have a time complexity of O(z + log n + occ). In summary, we proved the following lemma.

Space Efficient Phrase Trie
In this section, we show how to achieve the same functionality as the phrase trie while using linear space. The main idea is to store only one phrase per edge, and use Lemma 2 to navigate along an edge. That is, we no longer store the entire LZ77-compressed suffixes of S.

The Data Structure
We store a compact form of the phrase trie, which is essentially a blind trie version of the phrase trie. We store the following: We keep the tree structure of the phrase trie, and at each node, we keep a hash table, using perfect hashing [10], where the keys are the first LZ77 phrase of each outgoing edge. For each edge we store as additional information the length of the (uncompressed) substring on that edge and an arbitrarily chosen leaf below it. For an example see Figure 3. As before, we additionally store the suffix array, the range within the suffix array for each node, and a linear-sized lcp data structure for S.

Algorithm
The algorithm proceeds as follows. We start the search at the root. Assume we have matched k − 1 phrases of P and the current position in the trie is a node v. To match the next phrase we check if the k th phrase in P is in the hash table of v. 1. If it is not, we proceed exactly as in the previous section in the case where the locus is at a node. 2. If the k th phrase is present, let e be the corresponding edge and let i be the starting index of the leaf stored for e. Set k = k + 1. We do the following until we reach the end of edge e or get a mismatch. We differentiate between two cases. The k th phrase is a single letter α: , we set k = k + 1 and continue with the next phrase. If α = S[i + p k ], we stop and return p k as the length of the match. The k th phrase is represented by (r k , l k ): If min(lcp(S [i + p k , n] , S [i + p k − r k , n]), l k )) ≥ l k , we set k = k + 1 and continue with the next phrase. Otherwise, we return p k + lcp(S [i + p k , n] , S [i + p k − r k , n]) as the length of the match, with the interval of positions stored at the next node. If we reach the end of an edge, we go to the next node below and continue in the same way.

Correctness
The correctness follows from the previous section together with Lemma 2, since we always keep the invariant that when we process the k th phrase, we already matched the k − 1 previous ones.

Analysis
The space complexity is linear since the compact phrase trie has O(n) nodes and edges and stores constant information per node and edge, using perfect hashing.
The time complexity is the same as in the previous section, since for matching full phrases, we use at most one constant time lookup in the hash table and one constant time lcp query per phrase in P . As before, the worst case for matching the k th phrase is having to do a binary search, using O(log n) constant time lcp queries. In summary, this gives the following lemma.

Lemma 4. We can solve the string indexing with compressed pattern problem in O(n)
space and O(z + log n + occ) time.

Slice Tree Solution
In this section, we show how to reduce the O(log n) time overhead to O(log m). Recall that the additional O(log n) time originates from the binary search in the case where after matching k − 1 phrases we arrive at a node, and the k th phrase does not match any of the outgoing edges. In any other case, the solution from the previous section gives O(z + occ) time complexity. We use the solution from the previous section as a basis and show how to speed up the last step of matching the k th phrase. For our solution, we use Karp-Rabin fingerprints and the ART tree decomposition, which we define next.

Karp-Rabin Fingerprints
For a prime p and an x ≤ p, the Karp-Rabin fingerprint [19] of a substring S [i, j] is defined as ). Furthermore, the Karp-Rabin fingerprint has the property that for any three strings x, y and z where z = xy, given the fingerprint of any two of those strings, the third one can be computed in constant time. It follows that given the fingerprints of all suffixes of a string S, the fingerprint of any substring of S can be computed in constant time.
We assume that p and x are chosen in such a way that φ p,x is collision-free on substrings of S, that is, two distinct substrings of S have different fingerprints. For details on how to construct φ p,x see for example [4]. We will from now on use the notation φ = φ p,x .

ART decomposition
The ART decomposition of a tree by Alstrup et al. [1] partitions a tree into a top tree and several bottom trees. Every vertex v of minimal depth with no more than χ leaves below it is the root of a bottom tree which consists of v and all its descendants. The top tree consists of all vertices that are not in any bottom tree. The following lemma gives a key property of ART trees: Lemma 5 (Alstrup et al. [1]). The ART decomposition with parameter χ for a rooted tree T with n leaves produces a top tree with at most n χ+1 leaves.

The Slice Tree Decomposition
The overall idea is to construct a two level decomposition of the suffix tree. First, we will divide the tree into smaller trees, the slice trees, where the heights are powers of two and increase with the depth in the tree. Each of those slice trees is decomposed using an ART decomposition. Together with Karp-Rabin fingerprints stored at the roots of each slice tree, this will allow us to efficiently carry out an approximate search for the longest match, so we can then use the slice trees to find the exact position and length. In more detail, we store the space efficient phrase trie from the previous section for matching full phrases of the pattern. Additionally, we store the Karp-Rabin fingerprints for each suffix of S, as well as the following slice tree decomposition of the suffix tree of S: We store the suffix tree together with extra nodes at any position in the suffix tree that corresponds to a string depth that is a power of two. For each node we store the range in the suffix array of the leaves below. For each level of string depth 2 i , where i = 0, . . . , log n , we store a static hash table with Karp-Rabin fingerprints of the substring in S from the root to every node of string depth 2 i . As in section 4, we use perfect hashing for all hash tables in this solution.
For each node v at string depth 2 i we define a slice tree of order i. The slice tree is the subtree rooted at v, cut off at string depth 2 i , such that the string height of the slice tree is (at most) 2 i . We compute an ART decomposition of each slice tree of order i with the parameter χ set to χ = 2 i . For each 1 ≤ d < 2 i , we store a hash table with fingerprints corresponding to the substrings of length d starting at the root of the slice tree and ending in the top tree.
Additionally, for every edge connecting a top tree node to a bottom tree root save the corresponding first letter in the suffix tree. For every leaf in the bottom tree we store the starting position of a leaf below it in the suffix tree.

Algorithm
To match P , we first match the full phrases in the phrase trie until we find the first phrase f k which does not match any of the next phrases in the trie. If f k is just a letter, as before, we are done. Otherwise f k is represented by (r k , l k ). Now: We find the fingerprint φ(P [0, where i 0 is a leaf below the current position in the phrase trie. Note that since S [i 0 , i 0 + p k ] is a substring of S and we stored the fingerprints of all suffixes of S we can find its fingerprint in constant time via the fingerprints of the suffixes S i0 and S i0+p k . In order to find the slice tree where the match ends, we do a linear search for the deepest matching fingerprint in the hash tables at the power of 2 levels in the following way: to avoid false positives. We keep doing this until the first level where it is not present or the check fails. For the last level where there is a match, we find the corresponding node the slice tree rooted at that node. Note that this slice tree can be of order at most log m. Similarly as the linear search above, we now do an exponential search for fingerprints on the levels in the top tree of the slice tree. For the lowest level in which there is a match in the top tree, find the corresponding position v. If this is an internal node without any off-hanging bottom trees or on an edge in the top tree then locus(P ) = v. Once we have found locus(P ) we can easily find and return the occurrences as before. Otherwise, we check if the next letter in P matches any of the off-hanging bottom trees. Again, we can find this letter in constant time by looking up its source in S. If it matches, we do a binary search for the longest match with the leaves of the bottom tree, which proceeds exactly as in the phrase trie solution, but restricted to the representative leaves stored for each bottom tree leaf. For each bottom tree leaf that has a longest match with P report all suffix tree leaves below it.

Correctness
The correctness of matching the first k − 1 phrases follows from the previous section. Given that k is the first phrase that does not match any of the next phrases in the suffixes, we argue for the linear search in the power of two levels in the suffix tree. We know that the Karp-Rabin fingerprints have no false negatives, so if P [0, ) mod p will be present in the hash table of level p k + j. Further, we chose φ such that it has no false positives on substrings of S, so by checking φ(P [0, p k ]) = φ(S [i, i + p k ]) separately, we make sure that P [0, p k + j] and S [i, i + p k + j] are actually identical. Together, this means that by finding the biggest j such that p k + j is a power of two and both conditions are fulfilled, we will find the slice tree that contains the end of the longest match.
Next, we argue for the detailed search within the slice tree. The argument for the exponential search is the same as for the linear search. When we end the exponential search, we found the position in the top tree of maximum depth that corresponds to a substring of S matching a prefix of P . So the longest match either ends there or in a bottom tree that is connected to this position. If there is more than one such bottom tree, the first letter on each edge will uniquely identify the bottom tree that contains the leaf or leaves with the longest match. If the longest match ends in a bottom tree, it is enough to do the binary search with any representative leaf in the suffix tree per leaf in the bottom tree, since for any such leaf the prefix of a given length that ends in the bottom tree is the same.

Analysis
We use linear space for the phrase trie representation of the previous section and the fingerprints of the suffixes of S. Additionally, we use O(n log n) space for the extra nodes and hash tables at the power of two levels.
For each slice tree T of order i denote |T | the number of nodes in the slice tree and let h = 2 i be the maximal height of the slice tree. By Lemma 5, the top tree has at most |T |/h leaves. By the definition of the slice tree, each root-to-leaf path has at most h positions. As such, the hash tables for the top tree take up O(|T |) space. Furthermore we use constant space per leaf in the bottom tree. Each bottom tree leaf is a node in the suffix tree or an extra node, and each such node is a leaf in at most one bottom tree. So the total space for all slice trees is T is slice tree O (|T |) = O(#nodes in suffix tree + extra nodes) = O(n log n).
For the time complexity, as before, we use O(z) for matching in the phrase trie. Since we stored the fingerprints of all suffixes of S, the fingerprint of any substring of S can be found in constant time.
For the linear search of fingerprints in the suffix tree, note that the last phrase of P is at most m long. This means we stop the search after checking at most log m power of 2 levels, and a check can be done in constant time.
After the linear search we end up in a slice tree of order at most log m, which means h ≤ m. It follows that the exponential search in the top tree uses time at most O(log h) = O(log m). Further, by the definition of the ART decomposition, every bottom tree has no more than h ≤ m leaves, and as such the binary search in the bottom tree uses no more than O(log m) operations.
In total, this gives us a time complexity of O(z + log m + occ). We arrive at the following result: Lemma 6. The slice tree solution solves the string indexing with compressed pattern problem in O(n log n) space and O(z + log m + occ) time.

Saving Space
For the solution above, we constructed O(n log n) slice trees. By the way we defined them, note that any internal node in a slice tree has to be an original node from the suffix tree.
Since there are only O(n) such nodes, we conclude that many of the slice trees consist of a single edge. We will show that by removing those, we can define a linear space solution that gives the same time complexity as in Lemma 6.

The Data Structure
We start with the slice tree solution. Call every edge that contains two or more extra nodes a long edge. For every long edge, delete every extra node except the first and last, which we call v first and v last . For every deleted node also delete the additional information stored for their slice trees, and their corresponding entries in the power of two hash tables. For each long edge, store at the hash table position of v first additionally the information that it is on a long edge, how long that edge is, and a leaf below it.

Algorithm
The algorithm proceeds almost as before. The only change is that in the linear search of power of two levels, when we match with a node that is v first of a long edge, jump directly to the last power of two level that is before the end of the edge. If the fingerprint is present, proceed normally, otherwise, the longest match ends on that edge and we do a single lcp query between the source of the phrase in S and the stored leaf to find its length.

Correctness
If we do not encounter any long edges, nothing changes. If a long edge is entirely contained in the match, we will first find v first and then jump directly to the last power of two level on that edge, where we will find v last , and then continue as before. If the longest match ends on a long edge, there are two cases: 1. The longest match ends before v first or after v last : this means that by doing the linear search we find the slice tree that the longest match ends in, thus everything follows as before.

2.
The longest match ends between v first and v last : In this case, we will find a matching fingerprint at the level corresponding to v first but no matching fingerprint at the level corresponding to v last , which means we will use lcp to find the longest match with a leaf below v first . Since the match ends on that edge, this gives us the correct length and position.

Analysis
For space complexity, note that we only keep original nodes from the suffix tree, plus at most two extra nodes per edge, so a linear number of nodes in total. Since the space used for the slice trees and power of two hash tables is linear in the number of nodes, the total space consumption is linear. The time complexity does not change. This concludes the proof of Theorem 1.