Abstract 1 Introduction 2 Preliminaries 3 Sorting Run-Length Encoded Strings and Constructing Compact Tries 4 Truncated Matching 5 A Simple Solution 6 Full Algorithm References Appendix A Proof of Compact Trie Construction

Compressed Dictionary Matching on Run-Length Encoded Strings

Philip Bille ORCID DTU Compute, Technical University of Denmark, Lyngby, Denmark Inge Li Gørtz ORCID DTU Compute, Technical University of Denmark, Lyngby, Denmark Simon J. Puglisi ORCID Helsinki Institute for Information Technology (HIIT), Finland
Department of Computer Science, University of Helsinki, Finland
Simon R. Tarnow ORCID DTU Compute, Technical University of Denmark, Lyngby, Denmark
Abstract

Given a set of pattern strings 𝒫={P1,P2,Pk} and a text string S, the classic dictionary matching problem is to report all occurrences of each pattern in S. We study the dictionary problem in the compressed setting, where the pattern strings and the text string are compressed using run-length encoding, and the goal is to solve the problem without decompression and achieve efficient time and space in the size of the compressed strings. Let m and n be the total length of the patterns 𝒫 and the length of the text string S, respectively, and let m¯ and n¯ be the total number of runs in the run-length encoding of the patterns in 𝒫 and S, respectively. Our main result is an algorithm that achieves O((m¯+n¯)loglogm+occ) expected time, and O(m¯) space, where occ is the total number of occurrences of patterns in S. This is the first non-trivial solution to the problem. Since any solution must read the input, our time bound is optimal within an loglogm factor. We introduce several new techniques to achieve our bounds, including a new compressed representation of the classic Aho-Corasick automaton and a new efficient string index that supports fast queries in run-length encoded strings.

Keywords and phrases:
Dictionary matching, run-length encoding, compressed pattern matching
Funding:
Philip Bille: Danish Research Council grant DFF-8021-002498.
Inge Li Gørtz: Danish Research Council grant DFF-8021-002498.
Copyright and License:
[Uncaptioned image] © Philip Bille, Inge Li Gørtz, Simon J. Puglisi, and Simon R. Tarnow; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Pattern matching
Editors:
Paola Bonizzoni and Veli Mäkinen

1 Introduction

Given a set of pattern strings 𝒫={P1,P2,Pk} and a string S, the dictionary matching problem (also called the multi-string matching problem) is to report all occurrences of each pattern in S. Dictionary matching is a classic and extensively studied problem in combinatorial pattern matching [1, 27, 6, 7, 8, 31, 18, 19, 39, 11, 40, 25, 33, 12, 36, 43, 16, 37, 9] with the first efficient solution due to Aho and Corasick [1] from the 1970’ties. Dictionary matching is also a key component in several other algorithms for combinatorial pattern matching, see e.g. [45, 26, 13, 23, 17, 22, 21, 5, 34].

A run in a string S is a maximal substring of identical characters denoted αx, where α is the character of the run and x is the length of the run. The run-length encoding (RLE) of S is obtained by replacing each run αx in S by the pair (α,x). For example, the run-length encoding of aaaabbbaaaccbaa is (a,4),(b,3),(a,3),(c,2),(b,1),(a,2).

This paper focuses on compressed dictionary matching, where 𝒫 and S are given in run-length encoded form. The goal is to solve the problem without decompression and achieve efficient time and space in terms of the total number of runs in 𝒫 and S. Compressed dictionary matching has been studied for other compression schemes [39, 40, 18, 41], and run-length encoding has been studied for other compressed pattern matching problems [4, 5, 10, 15, 46, 30]. However, no non-trivial bounds are known for the combination of the two.

Results.

We address the basic question of whether it is possible to solve compressed dictionary matching in near-linear time in the total number of runs in the input. We show the following main result.

Theorem 1.

Let 𝒫={P1,Pk} be a set of k strings of total length m and let S be a string of length n. Given the RLE representation of 𝒫 and S consisting of m¯ and n¯ runs, respectively, we can solve the dictionary matching problem in O((m¯+n¯)loglogm+occ) expected time and O(m¯) space, where occ is the total number of occurrences of 𝒫 in S.

Theorem 1 assumes a standard word-RAM model of computation with logarithmic word length, and space is the number of words used by the algorithm, excluding the input strings, which are assumed to be read-only.

This is the first non-trivial algorithm for compressed dictionary matching on run-length encoded strings. Since any solution must read the input the time bound of Theorem 1 is optimal within a loglogm factor.

Techniques.

Our starting point is the classic Aho-Corasick algorithm for dictionary matching [1] that generalizes the Knuth-Morris-Pratt algorithm [42] for single string matching to multiple strings. Given a set of pattern strings 𝒫={P1,,Pk} of total length m the Aho-Corasick automaton (AC automaton) for 𝒫 consists of the trie of the patterns in 𝒫. Hence, any path from the trie’s root to a node v corresponds to a prefix of a pattern P𝒫. For each node v, a special failure link points to the node corresponding to the longest prefix matching a proper suffix of the string identified by v and an output link that points to the node corresponding to the longest pattern that matches a suffix of the string identified by v.

To solve dictionary matching, we read S one character at a time and traverse the AC automaton. At each step, we maintain the node corresponding to the longest suffix of the current prefix of S. If we cannot match a character, we recursively follow failure pointers (without reading further in S). If we reach a node with an output link, we output the corresponding pattern and recursively follow output links to report all other patterns matching at the current position in S. If we implement the trie using perfect hashing [35], we can perform the top-down traversal in constant time per character. Since failure links always point to a node of strictly smaller depth in the trie, it follows that we can charge the cost of traversing these to reading characters in S, and thus, the total time for traversing failure links is O(n). Each traversal of an output link results in a reported occurrence, and thus, the total time for traversing output links is O(occ). In total, this leads to a solution to dictionary matching that uses O(m) space and O(n+m+occ) expected time.

At a high level, our main result in Theorem 1 can viewed as a compressed implementation of the AC-automaton, that implements dictionary matching O((m¯+n¯)loglogm+occ) expected time and O(m¯) space. Compared to the uncompressed AC-automaton, we achieve the same bound in the compressed input except for a loglogm factor in the time complexity.

To implement the AC automaton in O(m¯) space, we introduce the run-length encoded trie TRLE of 𝒫, where each run αx in a pattern P𝒫 is encoded as a single letter “αx” and show how to simulate the action of the Aho-Corasick algorithm on TRLE processing one run of S at a time. The key challenge is that the nodes in the AC automaton are not explicitly represented in TRLE, and thus, we cannot directly implement the failure and output links in TRLE.

Naively, we can simulate the failure links of the AC automaton directly on TRLE. However, as in the Aho-Corasick algorithm, this leads to a solution traversing Ω(n) failure links, which we cannot afford. Instead, we show how to efficiently construct a data structure to group and process nodes simultaneously, achieving O(n¯loglogm) total processing time.

Similarly, we can simulate the output links by grouping them at the explicit nodes, but even on a simple instance such as 𝒫={a,am1} this would result in Ω(m) output links in total which we also cannot afford. Alternatively, we can store a single output link for each explicit node as in the AC automaton. However, the occurrences we need to report are not only patterns that are suffixes of the string seen until now, but also patterns ending inside the last processed run. Not all occurrences ending in the last run are substrings of each other, and thus, saving only the longest is not sufficient. Instead, we reduce the problem of reporting all occurrences that end in a specific run of S to a new data structure problem called truncate match reporting problem. To solve this problem, we reduce it to a problem of independent interest on colored weighted trees called colored ancestor threshold reporting. We present an efficient solution to the truncate match reporting problem by combining several existing tree data structures in a novel way, ultimately leading to the final solution.

Along the way, we also develop a simple and general technique for compressed sorting of run-length encoded strings of independent interest, which we need to preprocess our pattern strings efficiently.

Outline.

In Section 3, we show how to efficiently sort run-length encoded strings and subsequently efficiently construct the corresponding compact trie. In Section 4 we introduce the truncate match reporting and colored ancestor threshold reporting problem, our efficient solution to the colored ancestor threshold reporting problem, and our reduction from the truncate match reporting problem to the colored ancestor threshold reporting problem. In Section 5, we first give a simple and inefficient version of our algorithm, that focuses on navigation the run-length encoded trie. Finally, in Section 6, we extend and improve the simple version of our algorithm to achieve our main result.

2 Preliminaries

We use the following well-known basic data structure primitives. Let X be a set of n integers from a universe of size u. Given an integer x, a membership query determines if xX. A predecessor query, return the largest yX such that yx. We can support membership queries in O(n) space, O(n) expected preprocessing time, and constant query time using the well-known perfect hashing construction of Fredman, Komlós, and Szemerédi [35]. By combining the well-known y-fast trie of Willard [47] with perfect hashing, we can support predecessor queries with the following bounds.

Lemma 2.

Given a set of n integers from a universe of size u, we can support predecessor queries in O(n) space, O(nloglogn) expected preprocessing time, and O(loglogu) query time.

3 Sorting Run-Length Encoded Strings and Constructing Compact Tries

Let 𝒫={P1,,Pk} be a set of k strings of total length m from an alphabet of size σ. Furthermore, let 𝒫¯={P1¯,,Pk¯} be the RLE representation of 𝒫 consisting of m¯ runs. In this section, we show that given 𝒫¯ we sort the corresponding (uncompressed) set of strings in 𝒫 and construct the compact trie of them in expected O(m¯+kloglogk) time and O(m¯) space. We use this compact trie construction to efficiently preprocess our patterns in our compressed dictionary matching algorithm in the following sections.

We first show the following simple reduction to standard (uncompressed) string sorting. Let t(k,m,σ) and s(k,m,σ) denote the time and space, respectively, to sort k strings of total length m from an alphabet of size σ.

Lemma 3.

Let 𝒫={P1,,Pk} be a set of k strings of total length m from an alphabet of size σ. Given the RLE representation 𝒫¯={P1¯,,Pk¯} of 𝒫 consisting of m¯ runs, we can sort 𝒫 in O(t(k,m¯,σm)+m¯) time and O(s(k,m¯,σm)+m¯) space.

Proof.

To sort the strings 𝒫, we construct the set of strings 𝒫~={P~1,,P~k} such that P~i=(α1,x1)(α2,x2) is the sequence of the runs in Pi¯=α1x1α2x2, represented as pairs of character and length. We say that a pair (α,x) is smaller than (β,y) if α<β or x<y. Assume that we have the compact trie T~ of 𝒫~, where the edges are lexicographically sorted by their edge labels. We will show that we can construct the compact trie T of 𝒫 from T~ in O(m¯) time and maintain the ordering of the edges. Observe that for a node in vT the ordering of edge labels of the children of v are equivalent to the ordering achieved by encoding them as pairs of character and length, since the first character of each edge would differ. Let w be an internal node in T such that there is no node w~T~ where sw=sw~. Since w is an internal node in T, two patterns Pi and Pj exist such that the longest common prefix of Pi and Pj is sw. Since w does not have a corresponding node in T~, then there must exists a node w~ in T~ such that sw=sw~βy with at least two children with edge labels (β,y)Y and (β,z)Z where y<z. We ensure that any node wT has a corresponding node in T~ as follows. Starting at the root of T~, we do the following at each node v. Let v1,v2,,vd be the children of v lexicographically ordered by their edge labels (β1,y1)Y1,(β2,y2)Y2,,(βd,yd)Yd. Starting at i=d we do the following.

  • If βi1=βi and Yi1 is not the empty string, we insert a node w on the edge to vi1 and make vi the child of w. We then set the edge label of edge (v,w), (w,vi1), and (w,vi) to (βi,yi1), Yi1, and (βi,yiyi1)Yi, respectively.

  • If βi1=βi and Yi1 is the empty string, we make vi the child of vi1 and set the edge label of (vi1,vi) to (βi,yiyi1)Yi.

Note that if Yi1 is empty, then no child of vi1 will have (βi,z) as the first character on their edge label, since it could then be encoded as (βi,yi1+z). Hence, the parent of vi cannot change again. To maintain the order of the children of vi and w, we insert new edges using insertion sort. When w is created, it takes vi1’s place in the ordering of the children of v. We then decrement i until i=1 and then recurse on the remaining children of v. Finally, we traverse T~ and remove non-root vertices with only a single child. Since the children already appear in sorted order, we can recover the sorted order of 𝒫 from the constructed compact trie. We use O(t(k,m¯,σm)+m¯) time and O(s(k,m¯,σm)+m¯) space to sort 𝒫~. We can construct T~ from the sorted sequence of 𝒫~ in O(m¯) time (see Appendix A for the proof). We insert at most O(k) nodes in T~, and we apply insertion sort O(1) time at each node. Hence, we use O(m¯) time doing the insertion sort. In summary, we have shown Lemma 3.

Andersson and Nilsson [14] showed how to sort strings in t(k,m,σ)=O(m+kloglogk) expected time and s(k,m,σ)=O(m) space. We obtain the following result by plugging this into Lemma 3.

Corollary 4.

Let 𝒫={P1,,Pk} be a set of k strings of total length m from an alphabet of size σ. Given the RLE representation 𝒫¯={P1¯,,Pk¯} of 𝒫 consisting of m¯ runs, we can sort 𝒫 in O(m¯+kloglogk) expected time and O(m¯) space.

Using Corollary 4 and Lemma 3 we can efficiently construct the compact trie of 𝒫 from 𝒫¯.

Lemma 5.

Let 𝒫={P1,,Pk} be a set of k strings of total length m from an alphabet of size σ. Given the RLE representation 𝒫¯={P1¯,,Pk¯} of 𝒫 consisting of m¯ runs, we can construct the compact trie of 𝒫 in O(m¯+kloglogk) expected time and O(m¯) space.

4 Truncated Matching

This section presents a compact data structure that efficiently supports a new type of pattern matching queries that we call truncated matching. We will need this result to implement output links in our algorithm efficiently. Given a string P, we define the truncated string of P to be the string P such that P=Pαw, where αw is the last run in P. Let P1 and P2 be two strings of the form P1=P1α1w1 and P2=P2α2w2 where α1w1 and α2w2 are the last runs of P1 and P2, respectively. We say that P1 truncate matches P2 if P1 is a suffix of P2, α1=α2, and w1w2. (See Figure 1).

Let 𝒫={P1,,Pk} be a set of k strings of the form Pi=Piαiwi, where αiwi is the last run of Pi for all i. The truncate match reporting problem is to construct a data structure such that given an index 1ik, a character α, and an integer w, one can efficiently report all indices j, such that Pj truncate matches Piαw.

Our goal in this section is to give a data structure for the truncate match reporting problem that uses O(k) space, O(m¯+kloglogk) expected preprocessing time, and supports queries in O(loglogk) time.

Figure 1: The strings P1 and P2 where P1 and P2 is the truncated string of P1 and P2, respectively. If P1 is a suffix of P2 then P1 truncate matches P2 since the length of the last run in P1 is no longer than in P2 and is the same character.

4.1 Colored Ancestor Threshold Reporting

We first define a data structure problem on colored, weighted trees, called colored ancestor threshold reporting, and present an efficient solution. In the next section, we reduce truncate match reporting to colored threshold reporting to obtain our result for truncate match reporting.

Let 𝒞 a set of colors, and let T be a rooted tree with n nodes, where each node v has a (possibly empty) set of colors Cv𝒞 and a function πv:Cv{1,,W} that associates each color in Cv with a weight. The colored ancestor threshold reporting problem is to construct a data structure such that given a node v, a color c, and an integer w, one can efficiently report all ancestors u of v, such that cCu and πu(c)w.

We present a data structure for the colored ancestor threshold reporting problem that uses O(n+vT|Cv|) space, O(n+vT|Cv|) expected preprocessing time, and supports queries in O(loglogn+occ) time, where occ is the number of reported nodes.

Data Structure.

Our data structure stores the following information.

  • The tree T together with a first color ancestor data structure that supports queries of the form: given a node v and a color c return the nearest (not necessarily proper) ancestor u of v such that cCu.

  • Furthermore, for each color c𝒞 we store:

    • An induced tree Tc of the nodes vT that have color cCv, maintaining the ancestor relationships from T. Each node vTc has a weight wv=πv(c).

    • A path minima data structure on Tc that supports queries of the form: given two nodes u and v return a node with minimum weight on the path between u and v (both included).

    • A level ancestor data structure on Tc that supports queries of the form: given a node v and an integer d return the ancestor of v of depth d.

  • For each node vT: a dictionary associating each color in Cv with the corresponding node in Tc.

The total size of the induced trees is O(vT|Cv|). Hence, we can construct these and the associated dictionaries at each node in a single traversal of T using O(n+vT|Cv|) space and O(n+vT|Cv|) expected preprocessing time. We use standard linear space and preprocessing time and constant query time path minima and level-ancestor data structures on each of the induced trees [24, 2, 20, 29]. We use a first color ancestor data structure, which uses linear space, expected linear preprocessing time, and O(loglogn) query time [28, 44, 32, 3]. In total, we use O(n+vT|Cv|) space and O(n+vT|Cv|) expected preprocessing time.

Query.

Consider a query (u,c,w). We perform a first colored ancestor query (u,c) in T that returns a node u. We then look up the node v in Tc corresponding to u and perform a path minima query between v and the root of Tc. Let x be the returned node. If wx>w, we stop. Otherwise, we return the node x. Finally, we recurse on the path from the root of Tc to the parent of x and the path from v to the child x of x on the path to v. To find x, we use a level-ancestor query on v with d equal to the depth of x plus one.

The first colored ancestor query takes O(loglogn) time. Since a path minima query takes constant time, each recursion step uses constant time. The number of recursive steps is O(1+occ), and hence, the total time is O(loglogn+occ). In summary, we have shown the following result.

Lemma 6.

Let 𝒞 be a set of colors and let T be a rooted tree with n nodes, where each node v has a (possibly empty) set of colors Cv𝒞 and a weight function πv:Cv{1,,W}. We can construct a data structure for the colored ancestor threshold reporting problem that uses O(n+vT|Cv|) space, O(n+vT|Cv|) expected preprocessing time, and supports queries in O(loglogn+occ) time, where occ is the number of reported nodes.

4.2 Truncate Match Reporting

We now reduce truncate match reporting to colored ancestor threshold reporting.

Data Structure.

For each string Pi𝒫, let αiwi be the last run of Pi and let Pi be the truncated string of Pi. We construct the compact trie T of the reversed truncated strings P1,P2,,Pk. Each Pi corresponds to a node in T. Note that several truncated strings can correspond to the same node if the original strings end in different runs. For each node vT, let Iv be the set of string indices whose truncated strings correspond to node v.

Our data structure consists of the following information.

  • The compact trie T.

  • For each node v in T we store the following information:

    • Cv={αiiIv}.

    • A function πv:Cv{1,,m} where πv(α)=miniIv{wiαi=α}.

    • For each αCv: A list Wv,α containing the set {(wi,i)iIv and αi=α} sorted in increasing order of wi.

  • A colored ancestor threshold reporting data structure for T with colors Cv and weight functions πv.

  • An array A associating each string index with its corresponding node in the trie.

See Figure 2 for an example.

Figure 2: The structures of Lemma 6 and Lemma 7 for the strings {a5b1,a5b3a2,a5b3a1,a3b3a1,b2a1,b2}. To the right of each node vT is the set of colors associated with v. To the right of each node v in one of the trees Ta and Tb is the list wv,α where α=a and α=b, respectively. Finally πv(α) is the first weight in wv,α where v is the corresponding node in Tα.

The compact trie T uses space O(k). The total size of the Iv lists is O(k) and thus vT|Cv|=O(k). It follows that the total size of the Wv,α lists is O(k). The space of the colored ancestor threshold reporting data structure is O(k+vT|Cv|)=O(k). Thus the total size of the data structure is O(k).

We can construct the compact trie T in O(m¯+kloglogk) expected time using Lemma 5. The preprocessing time of the colored ancestor threshold reporting data structure is expected O(k+vT|Cv|)=O(k). Finally, we can sort all the Wv,α lists in O(vTαCv|Wv,α|loglogk)=O(kloglogk) time [38]. Thus, the preprocessing takes O(m¯+kloglogk) expected time.

Queries.

To answer a truncate match reporting query (i,α,w) we perform a colored ancestor threshold reporting query with (A[i],α,w). For each node u returned by the query, we return all string indices in Wu,α with weight at most w by doing a linear list scan and stop when the weight gets too big.

The colored ancestor threshold reporting query takes O(loglogk+occ) time. We have at least one occurrence for each reported node, and the occurrences are found by a scan taking linear time in the number of reported indices. In total the query takes O(loglogk+occ) time.

In summary, we have shown the following result.

Lemma 7.

Let 𝒫={P1,,Pk} be a set of k strings. Given the RLE representation 𝒫¯={P1¯,,Pk¯} of 𝒫 consisting of m¯ runs, we can construct a data structure for the truncate match reporting problem using O(k) space and expected O(m¯+kloglogk) preprocessing time, that can answer queries in time O(loglogk+occ) where occ is the number of reported occurrences.

5 A Simple Solution

In this section, we provide a simple solution that solves the compressed dictionary matching problem in O(m¯) space and O(n¯loglogk+m¯loglogm+n+m+occ) expected time. In the next section, we show how to improve this to obtain the main result of Theorem 1.

Let TRLE be the run-length encoded trie of 𝒫, where each run αx in a pattern P𝒫 is encoded as a single letter “αx”. See Figure 3 for an example.

The idea is to use TRLE to process S one run at a time. At each step we maintain a position in TRLE such that the string of the root to leaf path is the longest suffix of the part of the text we have seen so far. To efficiently report matches we use the data structure for truncate match reporting.

We will assume for now that the patterns 𝒫 all have at least 2 runs. Later we will show how to deal with patterns that consist of a single run.

Data Structure.

For a node v in the trie TRLE, let sv denote the string obtained by concatenating the characters of the edges of the root-to-v path where a run is not considered as a single letter. We store the following for each node v in TRLE:

  • Dv: A dictionary of the children of v, with their edge labels as key.

  • Fv: A failure link to the node u in TRLE for which su is the longest proper suffix of sv. We define the failure link for the root to be itself.

  • iv: The pattern index such that pattern Piv is the longest suffix of sv among all truncated patterns. If no such pattern exists let iv=1.

We build the truncate match reporting data structure for the set 𝒫. Finally, for each pattern Pi=Piαixi we store its length |Pi| and the length xi of the last run of Pi. See Figure 3 for an example.

Figure 3: The structure of the simple solution for the 6 patterns {a5b1,a5b3a2,a5b3a1,a3b3a1,b2a1,b2}. Here the blue arrows indicate the failure link of a node. Failure links going to the root node have been left out for simplicity. Note that the patterns are the same as in Figure 2.

The number of edges and nodes in TRLE is O(m¯). The dictionaries and failure links associated with TRLE use space proportional to the number of edges and vertices in TRLE, hence O(m¯) space. By Lemma 7 the truncate match reporting data structure uses O(k) space. In total, we use O(m¯+k)=O(m¯) space.

Query.

Given a run-length encoded string S we report all locations of occurrences of the patterns 𝒫 in S as follows. We process S one run at a time. Let S be the processed part of S. We maintain the node v in TRLE such that sv is the longest suffix of S. Initially, S=ϵ and v is the root. Let αy be the next unprocessed run in S. We proceed as follows:

Step 1: Report Occurrences.

Query the truncate match reporting data structure with (iv,α,y) unless iv=1. For each pattern j returned by the query, report that pattern j occurs at location |S||Pj|+xj.

Step 2: Follow Failure Links.

While v is not the root and αyDv set v=Fv. Finally, if αyDv let v=Dv[αy].

Time Analysis.

At each node we visit when processing S, we use constant time to determine if we follow an edge or a failure link. We traverse one edge per run in S and thus traverse at most O(n¯) edges. Since a failure link points to a proper suffix, the total number of failure links we traverse cannot exceed the number of characters we process throughout the algorithm. Since we process n characters we traverse at most O(n) failure links. Furthermore, we report from at most one node per run in S, and by Lemma 7 we use O(n¯loglogk+occ) time to report the locations of the matched patterns. In total, we use O(n+n¯loglogk+occ) time.

Correctness.

By induction on the iteration t, we will show that sv is the longest suffix of S in TRLE and that we report all starting positions of occurrences of patterns 𝒫 in S. Initially, t=0, S=ϵ, and v is the root of TRLE and thus sv=ϵ.

For the induction step assume that at the beginning of iteration t, sv is the longest suffix of S in TRLE. Let αy be the next unprocessed run in S and let X𝒫 be the subset of patterns that has an occurrence that ends in the unprocessed run αy. Note that any pattern PX has one occurrence that ends in the unprocessed run αy since P consists of at least two runs.

We first argue that we correctly report all occurrences. It follows immediately from the definition of truncate match that X consists of all the patterns P𝒫 that truncate matches Sαy. Let Pi be the longest suffix of S among all the truncated patterns. We will show that PX iff P truncate matches Piαy. Since Pi is a suffix of S it follows immediately that all patterns that truncate matches Piαy also truncate matches Sαy. We will show by contradiction that PX implies that P truncate matches Piαy. Assume that P=PαxX but does not truncate match Piαy. Since P truncate matches Sαy we have that xy and P is a suffix of S. Thus if Pαx does not truncate match Piαy it must be the case that P is not a suffix of Pi. But then P is a longer suffix of S than Pi contradicting that Pi is the longest suffix of S among all truncated patterns. Thus the patterns that truncate match Piαy is exactly X and by querying the truncate match reporting data structure with (i,α,y), we report all occurrences in X and hence the occurrences which end in the unprocessed run.

Finally, we will show that after step 2 the string sv is the longest suffix of Sαy. Let w be the node set to v by the end of step 2, i.e., after iteration t of the algorithm v=w. Assume for the sake of contradiction that after step 2, sw is not the longest suffix of Sαy. Then either sw is not a suffix of Sαy or there is another node u such that su is a suffix of Sαy and sw is a proper suffix of su. By the definition of the failure links and the dictionary Dv after step 2 sw must be a suffix of Sαy. Furthermore, either sw=ϵ (w is the root) or αyDp(w), where p(w) is the parent of w. Assume that there is a node u such that su is a suffix of Sαy and sw is a proper suffix of su. Then there exist nodes v and u such that sw=swαy and su=suαy and αyDwDu. Since sw is a suffix of su then sw is a suffix of su. At the beginning of step 2 su must be a suffix of sv since by the induction hypothesis sv is the longest suffix of S. Since u is longer than w we meet u before w in the traversal in step 2. Since αyDu we would have stopped at node u and not w.

Preprocessing.

First we construct the run-length encoded trie TRLE and the truncate match reporting data structure. In order to compute the failure link Fu for a non-root node u observe that if the edge (v,u) is labeled αx, then there are 3 scenarios which determine the value of Fu.

  • Either Fu=Dw[αx] where w is the first node such that αxDw which is reached by repeatedly following the failure links starting at node Fv.

  • If no such node exists then if the root r has an outgoing edge αy where y<x then Fu=Dr[αy] where y is maximum integer y<x such that there is an outgoing edge αy from the root r.

  • Otherwise Fu=r.

We compute the failure links by traversing the trie in a Breadth-first traversal starting at the root using the above cases. Note that the second case requires a predecessor query. In our preprocessing of the iv values we use the reverse failure links Fv of node vTRLE, i.e., uFv if Fu=v. We compute these while computing the failure links. To compute iv, we first set iv for each node v which has a child u corresponding to a pattern. We then start a Breadth-first traversal of the graph induced by the reverse failure links from each of these nodes. For each node v we visit, we set iv=iu for the parent u of v in the traversal. Note that each node is visited in exactly one of these traversals, since a node has exactly one failure link. For the remaining unvisited nodes let iv=1.

Preprocessing Time Analysis.

We can construct the run-length encoded trie of the patterns 𝒫 in expected O(m¯+kloglogk) time by Corollary 4 and the proof of Lemma 3. We use expected O(m¯) time to construct the dictionaries Dv of the nodes v in the run-length encoded trie TRLE and O(m¯+kloglogk) for the truncate match reporting data structure by Lemma 7. Since a failure link points to a proper suffix, each root-to-leaf path in TRLE can traverse a number of failure links proportional to the number of characters on the path. The sum of characters of all root-to-leaf paths is O(m); hence, we traverse at most O(m) failure links as we construct the failure links. We use at most one predecessor query for each node in TRLE as we construct the failure links and thus we use O(m¯loglogm+m) time to construct the failure links by Lemma 2. Finally, computing iv requires O(m¯) time to traverse TRLE. In total, we use O(m¯loglogm+m+kloglogk)=O(m¯loglogm+m) in expectation.

Dealing with Single Run Patterns.

To correctly report the occurrences of the single run patterns, we construct a dictionary D such that D[α] is the list of single run patterns of the character α sorted by their length. Let S be the processed part of S and let αy be the next unprocessed run in S. To report the single run patterns occurring in the run αy we do a linear scan of D[α]. For each single run pattern αx in D[α] the pattern αx occurs in all the locations |S|+i for 0iyx. If x>y then αx does not occur in αy and neither does any pattern later in the list D[α] since they are sorted by length. We can identify the single run patterns and construct D in O(m¯+kloglogk) expected time and reporting the occurrences of the single run patterns takes O(1+occ) time, neither of which impact our algorithm.

In summary, we have shown the following:

Lemma 8.

Let 𝒫={P1,Pk} be a set of k strings of total length m and let S be a string of length n. Given the RLE representation of 𝒫 and S consisting of m¯ and n¯ runs, respectively, we can solve the dictionary matching problem in O(n¯loglogk+m¯loglogm+n+m+occ) expected time and O(m¯) space, where occ is the total number of occurrences of 𝒫 in S.

In the next section, we show how to improve the time bound to O((m¯+n¯)loglogm+occ) expected time, thus effectively removing the linear dependency on the uncompressed lengths of the string and the patterns.

6 Full Algorithm

In the simple solution the failure links could point to a suffix that might only be one character shorter. Thus navigating them during the algorithm can take Ω(n) time. In this section, we show how to modify the failure links such that they point to a suffix that has fewer runs. The main idea is to group nodes that only differ by the length of their first run and navigate them simultaneously.

Data Structure.

We build the same structure as in Section 5, with a slight alteration to the failure links. Let v be a node in the trie TRLE and let β be the first character in sv, i.e., sv=βxX for some x. The failure link Fv stores a pointer to the node u in TRLE such that su is the longest suffix of X. Additionally, we store the length x of the first run of sv as Zv=x. We define the failure link for the root to be the root itself.

We partition the nodes of TRLE into groups as follows: Let v and u be two nodes of TRLE. If sv=βxX and su=βyX, then v and u belong to the same group G. We define the group of the root to consist only of the root node. See Figure 4. For a group G let Gα,x be all the vertices vG where v has an outgoing edge labeled αx.

For each group G, we store the following:

  • DG: A dictionary of the labels of the outgoing edges of the nodes in G. For each label αx, DG[αx] is a predecessor structure of the nodes vGα,x ordered by the length of their first run Zv.

Figure 4: The structure of the full solution for the 6 patterns {a5b1,a5b3a2,a5b3a1,a3b3a1,b2a1,b2}. Here the blue enclosures indicate grouped nodes, and the blue arrows indicate the failure link that all nodes in the group have in common. Failure links going to the root node have been left out for simplicity. Note that the patterns are the same as in Figure 2.

The dictionaries use linear space with the number of outgoing edges of G. Since the number of edges and nodes in TRLE is O(m¯) and the groups partition the nodes of TRLE, then the combined size of the dictionaries of the groups is O(m¯). In total, we use O(m¯) space.

Query.

Given a run-length encoded string S we report all locations of occurrences of the patterns 𝒫 in S as follows. We process S one run at a time. Let S be the processed part of S. We maintain the node v in TRLE such that sv is the longest suffix of S. Initially, S=ϵ and v is the root. Let αy be the next unprocessed run in S, and let G denote the group of v.

Step 1: Report Occurrences.

Query the truncate match reporting data structure with (iv,α,y) unless iv=1. For each pattern j returned by the structure, report that pattern j occurs at location |S||Pj|+xj.

Step 2: Follow Failure Links.

While v is not the root and Zv does not have a predecessor in DG[αy] set v=Fv. Finally, if DG[αy] has a predecessor u to Zv then v=Du[αy].

Time Analysis.

At each node we visit when processing S, we use O(loglogm) time to determine if we traverse an edge in the group or a failure link by Lemma 2. We traverse one edge per run in S and thus traverse at most O(n¯) edges. Since the suffix a failure link points to has fewer runs and the number of runs in S is n¯, then we traverse at most O(n¯) failure links. Furthermore, we report from at most one node per run in S, and by Lemma 7 we use O(n¯loglogk+occ) time to report the locations of the matched patterns. In total, we use O(n¯loglogm+occ) time.

Correctness.

By induction on the iteration t, we will show that sv is the longest suffix of S in TRLE and that we report all starting positions of occurrences of the patterns 𝒫 in S. Initially, t=0, S=ϵ, and v is the root of TRLE and thus sv=ϵ. Assume that at the beginning of iteration t, sv is the longest suffix of S in TRLE. Let αy be the next unprocessed run in S. By the same argument as in Section 5, we report all the occurrences of the patterns that end in the unprocessed run αy. What remains is to show that after step 2 the string sv is the longest suffix of Sαy. Let w be the node we set to v in the end of step 2, i.e., after iteration t of the algorithm v=w. Assume for the sake of contradiction that after step 2, sw is not the longest suffix of Sαy. Then either sw is not a suffix of Sαy or there is another node u such that su is a suffix of Sαy and sw is a proper suffix of su. By the definition of the failure links and the dictionary DG after step 2 sw must be a suffix of Sαy. Furthermore, either sw=ϵ (w is the root) or αyDp(w), where p(w) is the parent of w. Assume that there is a node u such that su is the longest suffix of Sαy and sw is a proper suffix of su. Then there is a node u such that su=suαy and αyDu. At the beginning of step 2, su must be a suffix of sv since by the induction hypothesis sv is the longest suffix of S. Hence in step 2 we would have stopped at a node v in the group G of node u. Furthermore, ZvZu since su is a suffix of sv. Since αyDu then αyDG and in the predecessor structure DG[αy] the node u will have Zu as a key and since ZvZu and su is the longest suffix of Sαy then u will be the predecessor of Zv and thus after step 2 w=Du[αy]=u.

Preprocessing.

First we construct the run-length encoded trie TRLE and the truncate match reporting data structure. We then compute the groups of TRLE as follows. First we group the children of the root r based on the label on the outgoing edge, such that all children v of r with an α on the edge (r,v) are in the same group. We compute the groups of all descendent nodes as follows. Given a group G all the nodes in DG[αx] constitute a new group G for αxDG. We then compute failure links between groups. Note that since a failure link points to a suffix with at least one less run, all the nodes v in a group G will have the same failure link. We compute the failure link Fu for a non-root node u as follows. Let the label of edge (v,u) be αx. There are 3 scenarios which determine the value of Fu.

  • Either Fu=Dw[αx] where w is the first non-root node such that αxDw which is reached by repeatedly following the failure links starting at node Fv.

  • if no such node exists and u is not a child of the root r and the root r has an outgoing edge αy where yx then Fu=Dr[αy] where y is the maximum integer yx such that there is an outgoing edge αy from the root r.

  • Otherwise Fu=r.

We compute the failure links by traversing the trie in a breadth-first traversal starting at the root using the above cases. Note that the second case requires a predecessor query. To compute iv we will simulate the reverse failure links used in Section 5. Observe that the failure links of Section 5 are composed of failure links where both endpoints are in the same group and failure links where the endpoints are in different groups. Since a failure link points to a suffix with at least one less run, we have only computed the failure links between groups. To simulate the failure links within a group observe that if v,uG and sv=βx and su=βy there is a failure link going from v to u in the data structure of Section 5 iff y<x and there is no node wG where sw=βz and y<z<x. Hence by building a predecessor structure of the nodes in G we can simulate the failure links of Section 5 and compute iv.

Preprocessing Time Analysis.

We can construct the run-length encoded trie of the patterns 𝒫 in expected O(m¯+kloglogk) time by Corollary 4 and the proof of Lemma 3. We use expected O(m¯) time to construct the dictionaries Dv of the nodes v in the run-length encoded trie TRLE and O(m¯+kloglogk) for the truncate match reporting data structure by Lemma 7. To partition TRLE into groups we use O(m¯) time, and O(m¯loglogm) to construct the predecessor structures of the groups by Lemma 2. Since a failure link points to a suffix with at least one less run, each root-to-leaf path in TRLE can traverse a number of failure links proportional to the number of runs on the path. The sum of runs of all root-to-leaf paths is O(m¯); hence, we traverse at most O(m¯) failure links as we construct the failure links. We use at most one predecessor query for each node in TRLE as we construct the failure links and thus we use O(m¯loglogm) time to construct the failure links by Lemma 2. Finally, computing iv requires O(m¯loglogm) time to traverse TRLE and simulate the failure links. In total, we use O(m¯loglogm+kloglogk)=O(m¯loglogm) expected time. In summary, we have shown Theorem 1.

References

  • [1] Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographic search. Commun. ACM, 18(6):333–340, 1975. doi:10.1145/360825.360855.
  • [2] Stephen Alstrup and Jacob Holm. Improved algorithms for finding level ancestors in dynamic trees. In Proc. 27th ICALP, pages 73–84, 2000. doi:10.1007/3-540-45022-X_8.
  • [3] Stephen Alstrup, Thore Husfeldt, and Theis Rauhe. Marked ancestor problems. In Proc. 39th FOCS, pages 534–544, 1998. doi:10.1109/SFCS.1998.743504.
  • [4] Amihood Amir and Gary Benson. Efficient two-dimensional compressed matching. In Proc. 2nd DCC, pages 279–288, 1992. doi:10.1109/DCC.1992.227453.
  • [5] Amihood Amir, Gary Benson, and Martin Farach. Optimal two-dimensional compressed matching. J. Algorithms, 24(2):354–379, 1997. doi:10.1006/JAGM.1997.0860.
  • [6] Amihood Amir and Martin Farach. Adaptive dictionary matching. In Proc. 32nd FOCS, pages 760–766, 1991. doi:10.1109/SFCS.1991.185445.
  • [7] Amihood Amir, Martin Farach, Zvi Galil, Raffaele Giancarlo, and Kunsoo Park. Dynamic dictionary matching. J. Comput. Syst. Sci., 49(2):208–222, 1994. doi:10.1016/S0022-0000(05)80047-9.
  • [8] Amihood Amir, Martin Farach, Ramana M. Idury, Johannes A. La Poutré, and Alejandro A. Schäffer. Improved dynamic dictionary matching. Inf. Comput., 119(2):258–282, 1995. doi:10.1006/INCO.1995.1090.
  • [9] Amihood Amir, Tsvi Kopelowitz, Avivit Levy, Seth Pettie, Ely Porat, and B. Riva Shalom. Mind the gap! - online dictionary matching with one gap. Algorithmica, 81(6):2123–2157, 2019. doi:10.1007/S00453-018-0526-2.
  • [10] Amihood Amir, Gad M. Landau, and Dina Sokol. Inplace run-length 2d compressed search. Theor. Comput. Sci., 290(3):1361–1383, 2003. doi:10.1016/S0304-3975(02)00041-5.
  • [11] Amihood Amir, Avivit Levy, Ely Porat, and B. Riva Shalom. Dictionary matching with one gap. In Proc. 25th CPM, pages 11–20, 2014. doi:10.1007/978-3-319-07566-2_2.
  • [12] Amihood Amir, Avivit Levy, Ely Porat, and B. Riva Shalom. Dictionary matching with a few gaps. Theor. Comput. Sci., 589:34–46, 2015. doi:10.1016/J.TCS.2015.04.011.
  • [13] Amihood Amir, Moshe Lewenstein, and Ely Porat. Faster algorithms for string matching with k mismatches. J. Algorithms, 50(2):257–275, 2004. doi:10.1016/S0196-6774(03)00097-X.
  • [14] Arne Andersson and Stefan Nilsson. A new efficient radix sort. In Proc. 35th FOCS, pages 714–721, 1994. doi:10.1109/SFCS.1994.365721.
  • [15] Alberto Apostolico, Gad M. Landau, and Steven Skiena. Matching for run-length encoded strings. J. Complex., 15(1):4–16, 1999. doi:10.1006/JCOM.1998.0493.
  • [16] Tanver Athar, Carl Barton, Widmer Bland, Jia Gao, Costas S. Iliopoulos, Chang Liu, and Solon P. Pissis. Fast circular dictionary-matching algorithm. Math. Struct. Comput. Sci., 27(2):143–156, 2017. doi:10.1017/S0960129515000134.
  • [17] Arturs Backurs and Piotr Indyk. Which regular expression patterns are hard to match? In Proc. 57th FOCS, pages 457–466, 2016. doi:10.1109/FOCS.2016.56.
  • [18] Djamal Belazzougui. Succinct dictionary matching with no slowdown. In Proc. 21st CPM, pages 88–100, 2010. doi:10.1007/978-3-642-13509-5_9.
  • [19] Djamal Belazzougui. Worst-case efficient single and multiple string matching on packed texts in the word-ram model. J. Discrete Algorithms, 14:91–106, 2012. doi:10.1016/J.JDA.2011.12.011.
  • [20] Omer Berkman and Uzi Vishkin. Finding level-ancestors in trees. J. Comput. Syst. Sci., 48(2):214–230, 1994. doi:10.1016/S0022-0000(05)80002-9.
  • [21] Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind. String matching with variable length gaps. Theoret. Comput. Sci., 443:25–34, 2012. doi:10.1016/J.TCS.2012.03.029.
  • [22] Philip Bille and Mikkel Thorup. Regular expression matching with multi-strings and intervals. In Proc. 21st SODA, 2010.
  • [23] William I. Chang and Eugene L. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4):327–344, 1994. doi:10.1007/BF01185431.
  • [24] Bernard Chazelle. Computing on a free tree via complexity-preserving mappings. Algorithmica, 2:337–361, 1987. doi:10.1007/BF01840366.
  • [25] Raphaël Clifford, Allyx Fontaine, Ely Porat, Benjamin Sach, and Tatiana Starikovskaya. Dictionary matching in a stream. In Proc. 23rd ESA, pages 361–372, 2015. doi:10.1007/978-3-662-48350-3_31.
  • [26] Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don’t cares. In Proc. 36th STOC, pages 91–100, 2004. doi:10.1145/1007352.1007374.
  • [27] Beate Commentz-Walter. A string matching algorithm fast on the average. In Hermann A. Maurer, editor, Proc. 6th ICALP, volume 71, pages 118–132, 1979. doi:10.1007/3-540-09510-1_10.
  • [28] Paul F. Dietz. Fully persistent arrays (extended array). In Proc. 1st WADS, pages 67–74, 1989. doi:10.1007/3-540-51542-9_8.
  • [29] Paul F. Dietz. Finding level-ancestors in dynamic trees. In Proc. 2nd WADS, pages 32–40, 1991. doi:10.1007/BFB0028247.
  • [30] Mohamed Y. Eltabakh, Wing-Kai Hon, Rahul Shah, Walid G. Aref, and Jeffrey Scott Vitter. The sbc-tree: an index for run-length compressed sequences. In Proc. 11th EDBT, pages 523–534, 2008. doi:10.1145/1353343.1353407.
  • [31] Paolo Ferragina and Fabrizio Luccio. Dynamic dictionary matching in external memory. Inf. Comput., 146(2):85–99, 1998. doi:10.1006/INCO.1998.2733.
  • [32] Paolo Ferragina and S. Muthukrishnan. Efficient dynamic method-lookup for object oriented languages (extended abstract). In Proc. 4th ESA, pages 107–120, 1996. doi:10.1007/3-540-61680-2_50.
  • [33] Johannes Fischer, Travis Gagie, Pawel Gawrychowski, and Tomasz Kociumaka. Approximating LZ77 via small-space multiple-pattern matching. In Proc. 23rd ESA, pages 533–544, 2015. doi:10.1007/978-3-662-48350-3_45.
  • [34] Johannes Fischer, Travis Gagie, Paweł Gawrychowski, and Tomasz Kociumaka. Approximating lz77 via small-space multiple-pattern matching. In Proc. 23rd ESA, pages 533–544, 2015.
  • [35] Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with O(1) worst case access time. In Proc. 23rd FOCS, pages 165–169, 1982. doi:10.1109/SFCS.1982.39.
  • [36] Arnab Ganguly, Wing-Kai Hon, and Rahul Shah. A framework for dynamic parameterized dictionary matching. In Proc. 15th SWAT, pages 10:1–10:14, 2016. doi:10.4230/LIPICS.SWAT.2016.10.
  • [37] Shay Golan and Ely Porat. Real-time streaming multi-pattern search for constant alphabet. In Proc. 25th ESA, pages 41:1–41:15, 2017. doi:10.4230/LIPICS.ESA.2017.41.
  • [38] Yijie Han. Deterministic sorting in o(nloglogn) time and linear space. J. Algorithms, 50(1):96–105, 2004. doi:10.1016/J.JALGOR.2003.09.001.
  • [39] Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. Faster compressed dictionary matching. Theor. Comput. Sci., 475:113–119, 2013. doi:10.1016/J.TCS.2012.10.050.
  • [40] Tomohiro I, Takaaki Nishimoto, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Compressed automata for dictionary matching. Theor. Comput. Sci., 578:30–41, 2015. doi:10.1016/J.TCS.2015.01.019.
  • [41] Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Masamichi Miyazaki, and Setsuo Arikawa. Multiple pattern matching in LZW compressed text. In Proceedings of the 8th Data Compression Conference, pages 103–112, 1998. doi:10.1109/DCC.1998.672136.
  • [42] Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323–350, 1977. doi:10.1137/0206024.
  • [43] Tsvi Kopelowitz, Ely Porat, and Yaron Rozen. Succinct online dictionary matching with improved worst-case guarantees. In Proc. 27th CPM, pages 6:1–6:13, 2016. doi:10.4230/LIPICS.CPM.2016.6.
  • [44] S. Muthukrishnan and Martin Müller. Time and space efficient method-lookup for object-oriented programs (extended abstract). In Proc. 7th SODA, pages 42–51, 1996. URL: http://dl.acm.org/citation.cfm?id=313852.313882.
  • [45] Gonzalo Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31–88, 2001. doi:10.1145/375360.375365.
  • [46] Yoshifumi Sakai. Computing the longest common subsequence of two run-length encoded strings. In Proc. 23rd ISAAC, pages 197–206, 2012. doi:10.1007/978-3-642-35261-4_23.
  • [47] Dan E. Willard. Log-logarithmic worst-case range queries are possible in space theta(n). Inf. Process. Lett., 17(2):81–84, 1983. doi:10.1016/0020-0190(83)90075-3.

Appendix A Proof of Compact Trie Construction

Proof.

Our algorithm proceeds as follows:

Step 1: Compute Longest Common Prefixes.

Let 𝒫={P1,,Pk} be the sorted order of the strings of total length m. We compute the longest common prefixes 1,,k1, where i=𝗅𝖼𝗉(Pi,Pi+1), of each consecutive pairs of strings in 𝒫. To do so, we scan each pair of strings Pi and Pi+1 from left to right to find the longest common prefix.

Step 2: Construct the Compact Trie.

To construct the compact trie T, we add the strings one at a time from left to right to an initially empty trie. We maintain the string depth of each node during the construction. Suppose we have constructed the compact trie Ti of the strings {P1,,Pi} and consider the leftmost path p in Ti corresponding to Pi. Let Pi+1 denote the suffix of Pi+1 not shared with Pi. We add the string Pi+1 to Ti as follows. If i=|Pi|, we extend the path p with a new edge representing Pi+1. Otherwise, we traverse p bottom up to find the location at string depth i to attach a new edge representing Pi+1. Note that this location is either an existing node in T or we need to split an edge and add a new node. At the end, we return the trie T=Tk.

Since the strings are sorted, step 2 inductively constructs the compact trie Ti, for i=1,,k of the i first strings. Thus, T=Tk is the compact trie of 𝒫. Step 1 uses O(m) time. For step 2, we bound the total number of edges in the bottom-up traversal of the rightmost path. Consider traversing a rightmost path p in Ti while adding Pi+1 and let v be the (existing or newly created) node of string depth i on p where we attach a new child edge representing Pi+1. The edges on p below v are never traversed again and are part of the final trie T. Hence, the total time to traverse them is at most O(k)=O(m¯). The remaining parts of step 2 take O(m) time, and hence, the total time is O(m). We use O(k+m)=O(m) space in total.