Compressed Dictionary Matching on Run-Length Encoded Strings
Abstract
Given a set of pattern strings and a text string , the classic dictionary matching problem is to report all occurrences of each pattern in . We study the dictionary problem in the compressed setting, where the pattern strings and the text string are compressed using run-length encoding, and the goal is to solve the problem without decompression and achieve efficient time and space in the size of the compressed strings. Let and be the total length of the patterns and the length of the text string , respectively, and let and be the total number of runs in the run-length encoding of the patterns in and , respectively. Our main result is an algorithm that achieves expected time, and space, where is the total number of occurrences of patterns in . This is the first non-trivial solution to the problem. Since any solution must read the input, our time bound is optimal within an factor. We introduce several new techniques to achieve our bounds, including a new compressed representation of the classic Aho-Corasick automaton and a new efficient string index that supports fast queries in run-length encoded strings.
Keywords and phrases:
Dictionary matching, run-length encoding, compressed pattern matchingFunding:
Philip Bille: Danish Research Council grant DFF-8021-002498.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Pattern matchingEditors:
Paola Bonizzoni and Veli MäkinenSeries and Publisher:

1 Introduction
Given a set of pattern strings and a string , the dictionary matching problem (also called the multi-string matching problem) is to report all occurrences of each pattern in . Dictionary matching is a classic and extensively studied problem in combinatorial pattern matching [1, 27, 6, 7, 8, 31, 18, 19, 39, 11, 40, 25, 33, 12, 36, 43, 16, 37, 9] with the first efficient solution due to Aho and Corasick [1] from the 1970’ties. Dictionary matching is also a key component in several other algorithms for combinatorial pattern matching, see e.g. [45, 26, 13, 23, 17, 22, 21, 5, 34].
A run in a string is a maximal substring of identical characters denoted , where is the character of the run and is the length of the run. The run-length encoding (RLE) of is obtained by replacing each run in by the pair . For example, the run-length encoding of is .
This paper focuses on compressed dictionary matching, where and are given in run-length encoded form. The goal is to solve the problem without decompression and achieve efficient time and space in terms of the total number of runs in and . Compressed dictionary matching has been studied for other compression schemes [39, 40, 18, 41], and run-length encoding has been studied for other compressed pattern matching problems [4, 5, 10, 15, 46, 30]. However, no non-trivial bounds are known for the combination of the two.
Results.
We address the basic question of whether it is possible to solve compressed dictionary matching in near-linear time in the total number of runs in the input. We show the following main result.
Theorem 1.
Let be a set of strings of total length and let be a string of length . Given the RLE representation of and consisting of and runs, respectively, we can solve the dictionary matching problem in expected time and space, where is the total number of occurrences of in .
Theorem 1 assumes a standard word-RAM model of computation with logarithmic word length, and space is the number of words used by the algorithm, excluding the input strings, which are assumed to be read-only.
This is the first non-trivial algorithm for compressed dictionary matching on run-length encoded strings. Since any solution must read the input the time bound of Theorem 1 is optimal within a factor.
Techniques.
Our starting point is the classic Aho-Corasick algorithm for dictionary matching [1] that generalizes the Knuth-Morris-Pratt algorithm [42] for single string matching to multiple strings. Given a set of pattern strings of total length the Aho-Corasick automaton (AC automaton) for consists of the trie of the patterns in . Hence, any path from the trie’s root to a node corresponds to a prefix of a pattern . For each node , a special failure link points to the node corresponding to the longest prefix matching a proper suffix of the string identified by and an output link that points to the node corresponding to the longest pattern that matches a suffix of the string identified by .
To solve dictionary matching, we read one character at a time and traverse the AC automaton. At each step, we maintain the node corresponding to the longest suffix of the current prefix of . If we cannot match a character, we recursively follow failure pointers (without reading further in ). If we reach a node with an output link, we output the corresponding pattern and recursively follow output links to report all other patterns matching at the current position in . If we implement the trie using perfect hashing [35], we can perform the top-down traversal in constant time per character. Since failure links always point to a node of strictly smaller depth in the trie, it follows that we can charge the cost of traversing these to reading characters in , and thus, the total time for traversing failure links is . Each traversal of an output link results in a reported occurrence, and thus, the total time for traversing output links is . In total, this leads to a solution to dictionary matching that uses space and expected time.
At a high level, our main result in Theorem 1 can viewed as a compressed implementation of the AC-automaton, that implements dictionary matching expected time and space. Compared to the uncompressed AC-automaton, we achieve the same bound in the compressed input except for a factor in the time complexity.
To implement the AC automaton in space, we introduce the run-length encoded trie of , where each run in a pattern is encoded as a single letter “” and show how to simulate the action of the Aho-Corasick algorithm on processing one run of at a time. The key challenge is that the nodes in the AC automaton are not explicitly represented in , and thus, we cannot directly implement the failure and output links in .
Naively, we can simulate the failure links of the AC automaton directly on . However, as in the Aho-Corasick algorithm, this leads to a solution traversing failure links, which we cannot afford. Instead, we show how to efficiently construct a data structure to group and process nodes simultaneously, achieving total processing time.
Similarly, we can simulate the output links by grouping them at the explicit nodes, but even on a simple instance such as this would result in output links in total which we also cannot afford. Alternatively, we can store a single output link for each explicit node as in the AC automaton. However, the occurrences we need to report are not only patterns that are suffixes of the string seen until now, but also patterns ending inside the last processed run. Not all occurrences ending in the last run are substrings of each other, and thus, saving only the longest is not sufficient. Instead, we reduce the problem of reporting all occurrences that end in a specific run of to a new data structure problem called truncate match reporting problem. To solve this problem, we reduce it to a problem of independent interest on colored weighted trees called colored ancestor threshold reporting. We present an efficient solution to the truncate match reporting problem by combining several existing tree data structures in a novel way, ultimately leading to the final solution.
Along the way, we also develop a simple and general technique for compressed sorting of run-length encoded strings of independent interest, which we need to preprocess our pattern strings efficiently.
Outline.
In Section 3, we show how to efficiently sort run-length encoded strings and subsequently efficiently construct the corresponding compact trie. In Section 4 we introduce the truncate match reporting and colored ancestor threshold reporting problem, our efficient solution to the colored ancestor threshold reporting problem, and our reduction from the truncate match reporting problem to the colored ancestor threshold reporting problem. In Section 5, we first give a simple and inefficient version of our algorithm, that focuses on navigation the run-length encoded trie. Finally, in Section 6, we extend and improve the simple version of our algorithm to achieve our main result.
2 Preliminaries
We use the following well-known basic data structure primitives. Let be a set of integers from a universe of size . Given an integer , a membership query determines if . A predecessor query, return the largest such that . We can support membership queries in space, expected preprocessing time, and constant query time using the well-known perfect hashing construction of Fredman, Komlós, and Szemerédi [35]. By combining the well-known -fast trie of Willard [47] with perfect hashing, we can support predecessor queries with the following bounds.
Lemma 2.
Given a set of integers from a universe of size , we can support predecessor queries in space, expected preprocessing time, and query time.
3 Sorting Run-Length Encoded Strings and Constructing Compact Tries
Let be a set of strings of total length from an alphabet of size . Furthermore, let be the RLE representation of consisting of runs. In this section, we show that given we sort the corresponding (uncompressed) set of strings in and construct the compact trie of them in expected time and space. We use this compact trie construction to efficiently preprocess our patterns in our compressed dictionary matching algorithm in the following sections.
We first show the following simple reduction to standard (uncompressed) string sorting. Let and denote the time and space, respectively, to sort strings of total length from an alphabet of size .
Lemma 3.
Let be a set of strings of total length from an alphabet of size . Given the RLE representation of consisting of runs, we can sort in time and space.
Proof.
To sort the strings , we construct the set of strings such that is the sequence of the runs in , represented as pairs of character and length. We say that a pair is smaller than if or . Assume that we have the compact trie of , where the edges are lexicographically sorted by their edge labels. We will show that we can construct the compact trie of from in time and maintain the ordering of the edges. Observe that for a node in the ordering of edge labels of the children of are equivalent to the ordering achieved by encoding them as pairs of character and length, since the first character of each edge would differ. Let be an internal node in such that there is no node where . Since is an internal node in , two patterns and exist such that the longest common prefix of and is . Since does not have a corresponding node in , then there must exists a node in such that with at least two children with edge labels and where . We ensure that any node has a corresponding node in as follows. Starting at the root of , we do the following at each node . Let be the children of lexicographically ordered by their edge labels . Starting at we do the following.
-
If and is not the empty string, we insert a node on the edge to and make the child of . We then set the edge label of edge , , and to , , and , respectively.
-
If and is the empty string, we make the child of and set the edge label of to .
Note that if is empty, then no child of will have as the first character on their edge label, since it could then be encoded as . Hence, the parent of cannot change again. To maintain the order of the children of and , we insert new edges using insertion sort. When is created, it takes ’s place in the ordering of the children of . We then decrement until and then recurse on the remaining children of . Finally, we traverse and remove non-root vertices with only a single child. Since the children already appear in sorted order, we can recover the sorted order of from the constructed compact trie. We use time and space to sort . We can construct from the sorted sequence of in time (see Appendix A for the proof). We insert at most nodes in , and we apply insertion sort time at each node. Hence, we use time doing the insertion sort. In summary, we have shown Lemma 3.
Andersson and Nilsson [14] showed how to sort strings in expected time and space. We obtain the following result by plugging this into Lemma 3.
Corollary 4.
Let be a set of strings of total length from an alphabet of size . Given the RLE representation of consisting of runs, we can sort in expected time and space.
Using Corollary 4 and Lemma 3 we can efficiently construct the compact trie of from .
Lemma 5.
Let be a set of strings of total length from an alphabet of size . Given the RLE representation of consisting of runs, we can construct the compact trie of in expected time and space.
4 Truncated Matching
This section presents a compact data structure that efficiently supports a new type of pattern matching queries that we call truncated matching. We will need this result to implement output links in our algorithm efficiently. Given a string , we define the truncated string of to be the string such that , where is the last run in . Let and be two strings of the form and where and are the last runs of and , respectively. We say that truncate matches if is a suffix of , , and . (See Figure 1).
Let be a set of strings of the form , where is the last run of for all . The truncate match reporting problem is to construct a data structure such that given an index , a character , and an integer , one can efficiently report all indices , such that truncate matches .
Our goal in this section is to give a data structure for the truncate match reporting problem that uses space, expected preprocessing time, and supports queries in time.
4.1 Colored Ancestor Threshold Reporting
We first define a data structure problem on colored, weighted trees, called colored ancestor threshold reporting, and present an efficient solution. In the next section, we reduce truncate match reporting to colored threshold reporting to obtain our result for truncate match reporting.
Let a set of colors, and let be a rooted tree with nodes, where each node has a (possibly empty) set of colors and a function that associates each color in with a weight. The colored ancestor threshold reporting problem is to construct a data structure such that given a node , a color , and an integer , one can efficiently report all ancestors of , such that and .
We present a data structure for the colored ancestor threshold reporting problem that uses space, expected preprocessing time, and supports queries in time, where occ is the number of reported nodes.
Data Structure.
Our data structure stores the following information.
-
The tree together with a first color ancestor data structure that supports queries of the form: given a node and a color return the nearest (not necessarily proper) ancestor of such that .
-
Furthermore, for each color we store:
-
–
An induced tree of the nodes that have color , maintaining the ancestor relationships from . Each node has a weight .
-
–
A path minima data structure on that supports queries of the form: given two nodes and return a node with minimum weight on the path between and (both included).
-
–
A level ancestor data structure on that supports queries of the form: given a node and an integer return the ancestor of of depth .
-
–
-
For each node : a dictionary associating each color in with the corresponding node in .
The total size of the induced trees is . Hence, we can construct these and the associated dictionaries at each node in a single traversal of using space and expected preprocessing time. We use standard linear space and preprocessing time and constant query time path minima and level-ancestor data structures on each of the induced trees [24, 2, 20, 29]. We use a first color ancestor data structure, which uses linear space, expected linear preprocessing time, and query time [28, 44, 32, 3]. In total, we use space and expected preprocessing time.
Query.
Consider a query . We perform a first colored ancestor query in that returns a node . We then look up the node in corresponding to and perform a path minima query between and the root of . Let be the returned node. If , we stop. Otherwise, we return the node . Finally, we recurse on the path from the root of to the parent of and the path from to the child of on the path to . To find , we use a level-ancestor query on with equal to the depth of plus one.
The first colored ancestor query takes time. Since a path minima query takes constant time, each recursion step uses constant time. The number of recursive steps is , and hence, the total time is . In summary, we have shown the following result.
Lemma 6.
Let be a set of colors and let be a rooted tree with nodes, where each node has a (possibly empty) set of colors and a weight function . We can construct a data structure for the colored ancestor threshold reporting problem that uses space, expected preprocessing time, and supports queries in time, where occ is the number of reported nodes.
4.2 Truncate Match Reporting
We now reduce truncate match reporting to colored ancestor threshold reporting.
Data Structure.
For each string , let be the last run of and let be the truncated string of . We construct the compact trie of the reversed truncated strings . Each corresponds to a node in . Note that several truncated strings can correspond to the same node if the original strings end in different runs. For each node , let be the set of string indices whose truncated strings correspond to node .
Our data structure consists of the following information.
-
The compact trie .
-
For each node in we store the following information:
-
–
.
-
–
A function where .
-
–
For each : A list containing the set sorted in increasing order of .
-
–
-
A colored ancestor threshold reporting data structure for with colors and weight functions .
-
An array associating each string index with its corresponding node in the trie.
See Figure 2 for an example.
The compact trie uses space . The total size of the lists is and thus . It follows that the total size of the lists is . The space of the colored ancestor threshold reporting data structure is . Thus the total size of the data structure is .
Queries.
To answer a truncate match reporting query we perform a colored ancestor threshold reporting query with . For each node returned by the query, we return all string indices in with weight at most by doing a linear list scan and stop when the weight gets too big.
The colored ancestor threshold reporting query takes time. We have at least one occurrence for each reported node, and the occurrences are found by a scan taking linear time in the number of reported indices. In total the query takes time.
In summary, we have shown the following result.
Lemma 7.
Let be a set of strings. Given the RLE representation of consisting of runs, we can construct a data structure for the truncate match reporting problem using space and expected preprocessing time, that can answer queries in time where occ is the number of reported occurrences.
5 A Simple Solution
In this section, we provide a simple solution that solves the compressed dictionary matching problem in space and expected time. In the next section, we show how to improve this to obtain the main result of Theorem 1.
Let be the run-length encoded trie of , where each run in a pattern is encoded as a single letter “”. See Figure 3 for an example.
The idea is to use to process one run at a time. At each step we maintain a position in such that the string of the root to leaf path is the longest suffix of the part of the text we have seen so far. To efficiently report matches we use the data structure for truncate match reporting.
We will assume for now that the patterns all have at least 2 runs. Later we will show how to deal with patterns that consist of a single run.
Data Structure.
For a node in the trie , let denote the string obtained by concatenating the characters of the edges of the root-to- path where a run is not considered as a single letter. We store the following for each node in :
-
: A dictionary of the children of , with their edge labels as key.
-
: A failure link to the node in for which is the longest proper suffix of . We define the failure link for the root to be itself.
-
: The pattern index such that pattern is the longest suffix of among all truncated patterns. If no such pattern exists let .
We build the truncate match reporting data structure for the set . Finally, for each pattern we store its length and the length of the last run of . See Figure 3 for an example.
The number of edges and nodes in is . The dictionaries and failure links associated with use space proportional to the number of edges and vertices in , hence space. By Lemma 7 the truncate match reporting data structure uses space. In total, we use space.
Query.
Given a run-length encoded string we report all locations of occurrences of the patterns in as follows. We process one run at a time. Let be the processed part of . We maintain the node in such that is the longest suffix of . Initially, and is the root. Let be the next unprocessed run in . We proceed as follows:
- Step 1: Report Occurrences.
-
Query the truncate match reporting data structure with unless . For each pattern returned by the query, report that pattern occurs at location .
- Step 2: Follow Failure Links.
-
While is not the root and set . Finally, if let .
Time Analysis.
At each node we visit when processing , we use constant time to determine if we follow an edge or a failure link. We traverse one edge per run in and thus traverse at most edges. Since a failure link points to a proper suffix, the total number of failure links we traverse cannot exceed the number of characters we process throughout the algorithm. Since we process characters we traverse at most failure links. Furthermore, we report from at most one node per run in , and by Lemma 7 we use time to report the locations of the matched patterns. In total, we use time.
Correctness.
By induction on the iteration , we will show that is the longest suffix of in and that we report all starting positions of occurrences of patterns in . Initially, , , and is the root of and thus .
For the induction step assume that at the beginning of iteration , is the longest suffix of in . Let be the next unprocessed run in and let be the subset of patterns that has an occurrence that ends in the unprocessed run . Note that any pattern has one occurrence that ends in the unprocessed run since consists of at least two runs.
We first argue that we correctly report all occurrences. It follows immediately from the definition of truncate match that consists of all the patterns that truncate matches . Let be the longest suffix of among all the truncated patterns. We will show that iff truncate matches . Since is a suffix of it follows immediately that all patterns that truncate matches also truncate matches . We will show by contradiction that implies that truncate matches . Assume that but does not truncate match . Since truncate matches we have that and is a suffix of . Thus if does not truncate match it must be the case that is not a suffix of . But then is a longer suffix of than contradicting that is the longest suffix of among all truncated patterns. Thus the patterns that truncate match is exactly and by querying the truncate match reporting data structure with , we report all occurrences in and hence the occurrences which end in the unprocessed run.
Finally, we will show that after step 2 the string is the longest suffix of . Let be the node set to by the end of step 2, i.e., after iteration of the algorithm . Assume for the sake of contradiction that after step 2, is not the longest suffix of . Then either is not a suffix of or there is another node such that is a suffix of and is a proper suffix of . By the definition of the failure links and the dictionary after step 2 must be a suffix of . Furthermore, either ( is the root) or , where is the parent of . Assume that there is a node such that is a suffix of and is a proper suffix of . Then there exist nodes and such that and and . Since is a suffix of then is a suffix of . At the beginning of step 2 must be a suffix of since by the induction hypothesis is the longest suffix of . Since is longer than we meet before in the traversal in step 2. Since we would have stopped at node and not .
Preprocessing.
First we construct the run-length encoded trie and the truncate match reporting data structure. In order to compute the failure link for a non-root node observe that if the edge is labeled , then there are 3 scenarios which determine the value of .
-
Either where is the first node such that which is reached by repeatedly following the failure links starting at node .
-
If no such node exists then if the root has an outgoing edge where then where is maximum integer such that there is an outgoing edge from the root .
-
Otherwise .
We compute the failure links by traversing the trie in a Breadth-first traversal starting at the root using the above cases. Note that the second case requires a predecessor query. In our preprocessing of the values we use the reverse failure links of node , i.e., if . We compute these while computing the failure links. To compute , we first set for each node which has a child corresponding to a pattern. We then start a Breadth-first traversal of the graph induced by the reverse failure links from each of these nodes. For each node we visit, we set for the parent of in the traversal. Note that each node is visited in exactly one of these traversals, since a node has exactly one failure link. For the remaining unvisited nodes let .
Preprocessing Time Analysis.
We can construct the run-length encoded trie of the patterns in expected time by Corollary 4 and the proof of Lemma 3. We use expected time to construct the dictionaries of the nodes in the run-length encoded trie and for the truncate match reporting data structure by Lemma 7. Since a failure link points to a proper suffix, each root-to-leaf path in can traverse a number of failure links proportional to the number of characters on the path. The sum of characters of all root-to-leaf paths is ; hence, we traverse at most failure links as we construct the failure links. We use at most one predecessor query for each node in as we construct the failure links and thus we use time to construct the failure links by Lemma 2. Finally, computing requires time to traverse . In total, we use in expectation.
Dealing with Single Run Patterns.
To correctly report the occurrences of the single run patterns, we construct a dictionary such that is the list of single run patterns of the character sorted by their length. Let be the processed part of and let be the next unprocessed run in . To report the single run patterns occurring in the run we do a linear scan of . For each single run pattern in the pattern occurs in all the locations for . If then does not occur in and neither does any pattern later in the list since they are sorted by length. We can identify the single run patterns and construct in expected time and reporting the occurrences of the single run patterns takes time, neither of which impact our algorithm.
In summary, we have shown the following:
Lemma 8.
Let be a set of strings of total length and let be a string of length . Given the RLE representation of and consisting of and runs, respectively, we can solve the dictionary matching problem in expected time and space, where is the total number of occurrences of in .
In the next section, we show how to improve the time bound to expected time, thus effectively removing the linear dependency on the uncompressed lengths of the string and the patterns.
6 Full Algorithm
In the simple solution the failure links could point to a suffix that might only be one character shorter. Thus navigating them during the algorithm can take time. In this section, we show how to modify the failure links such that they point to a suffix that has fewer runs. The main idea is to group nodes that only differ by the length of their first run and navigate them simultaneously.
Data Structure.
We build the same structure as in Section 5, with a slight alteration to the failure links. Let be a node in the trie and let be the first character in , i.e., for some . The failure link stores a pointer to the node in such that is the longest suffix of . Additionally, we store the length of the first run of as . We define the failure link for the root to be the root itself.
We partition the nodes of into groups as follows: Let and be two nodes of . If and , then and belong to the same group . We define the group of the root to consist only of the root node. See Figure 4. For a group let be all the vertices where has an outgoing edge labeled .
For each group , we store the following:
-
: A dictionary of the labels of the outgoing edges of the nodes in . For each label , is a predecessor structure of the nodes ordered by the length of their first run .
The dictionaries use linear space with the number of outgoing edges of . Since the number of edges and nodes in is and the groups partition the nodes of , then the combined size of the dictionaries of the groups is . In total, we use space.
Query.
Given a run-length encoded string we report all locations of occurrences of the patterns in as follows. We process one run at a time. Let be the processed part of . We maintain the node in such that is the longest suffix of . Initially, and is the root. Let be the next unprocessed run in , and let denote the group of .
- Step 1: Report Occurrences.
-
Query the truncate match reporting data structure with unless . For each pattern returned by the structure, report that pattern occurs at location .
- Step 2: Follow Failure Links.
-
While is not the root and does not have a predecessor in set . Finally, if has a predecessor to then .
Time Analysis.
At each node we visit when processing , we use time to determine if we traverse an edge in the group or a failure link by Lemma 2. We traverse one edge per run in and thus traverse at most edges. Since the suffix a failure link points to has fewer runs and the number of runs in is , then we traverse at most failure links. Furthermore, we report from at most one node per run in , and by Lemma 7 we use time to report the locations of the matched patterns. In total, we use time.
Correctness.
By induction on the iteration , we will show that is the longest suffix of in and that we report all starting positions of occurrences of the patterns in . Initially, , , and is the root of and thus . Assume that at the beginning of iteration , is the longest suffix of in . Let be the next unprocessed run in . By the same argument as in Section 5, we report all the occurrences of the patterns that end in the unprocessed run . What remains is to show that after step 2 the string is the longest suffix of . Let be the node we set to in the end of step 2, i.e., after iteration of the algorithm . Assume for the sake of contradiction that after step 2, is not the longest suffix of . Then either is not a suffix of or there is another node such that is a suffix of and is a proper suffix of . By the definition of the failure links and the dictionary after step 2 must be a suffix of . Furthermore, either ( is the root) or , where is the parent of . Assume that there is a node such that is the longest suffix of and is a proper suffix of . Then there is a node such that and . At the beginning of step 2, must be a suffix of since by the induction hypothesis is the longest suffix of . Hence in step 2 we would have stopped at a node in the group of node . Furthermore, since is a suffix of . Since then and in the predecessor structure the node will have as a key and since and is the longest suffix of then will be the predecessor of and thus after step 2 .
Preprocessing.
First we construct the run-length encoded trie and the truncate match reporting data structure. We then compute the groups of as follows. First we group the children of the root based on the label on the outgoing edge, such that all children of with an on the edge are in the same group. We compute the groups of all descendent nodes as follows. Given a group all the nodes in constitute a new group for . We then compute failure links between groups. Note that since a failure link points to a suffix with at least one less run, all the nodes in a group will have the same failure link. We compute the failure link for a non-root node as follows. Let the label of edge be . There are 3 scenarios which determine the value of .
-
Either where is the first non-root node such that which is reached by repeatedly following the failure links starting at node .
-
if no such node exists and is not a child of the root and the root has an outgoing edge where then where is the maximum integer such that there is an outgoing edge from the root .
-
Otherwise .
We compute the failure links by traversing the trie in a breadth-first traversal starting at the root using the above cases. Note that the second case requires a predecessor query. To compute we will simulate the reverse failure links used in Section 5. Observe that the failure links of Section 5 are composed of failure links where both endpoints are in the same group and failure links where the endpoints are in different groups. Since a failure link points to a suffix with at least one less run, we have only computed the failure links between groups. To simulate the failure links within a group observe that if and and there is a failure link going from to in the data structure of Section 5 iff and there is no node where and . Hence by building a predecessor structure of the nodes in we can simulate the failure links of Section 5 and compute .
Preprocessing Time Analysis.
We can construct the run-length encoded trie of the patterns in expected time by Corollary 4 and the proof of Lemma 3. We use expected time to construct the dictionaries of the nodes in the run-length encoded trie and for the truncate match reporting data structure by Lemma 7. To partition into groups we use time, and to construct the predecessor structures of the groups by Lemma 2. Since a failure link points to a suffix with at least one less run, each root-to-leaf path in can traverse a number of failure links proportional to the number of runs on the path. The sum of runs of all root-to-leaf paths is ; hence, we traverse at most failure links as we construct the failure links. We use at most one predecessor query for each node in as we construct the failure links and thus we use time to construct the failure links by Lemma 2. Finally, computing requires time to traverse and simulate the failure links. In total, we use expected time. In summary, we have shown Theorem 1.
References
- [1] Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographic search. Commun. ACM, 18(6):333–340, 1975. doi:10.1145/360825.360855.
- [2] Stephen Alstrup and Jacob Holm. Improved algorithms for finding level ancestors in dynamic trees. In Proc. 27th ICALP, pages 73–84, 2000. doi:10.1007/3-540-45022-X_8.
- [3] Stephen Alstrup, Thore Husfeldt, and Theis Rauhe. Marked ancestor problems. In Proc. 39th FOCS, pages 534–544, 1998. doi:10.1109/SFCS.1998.743504.
- [4] Amihood Amir and Gary Benson. Efficient two-dimensional compressed matching. In Proc. 2nd DCC, pages 279–288, 1992. doi:10.1109/DCC.1992.227453.
- [5] Amihood Amir, Gary Benson, and Martin Farach. Optimal two-dimensional compressed matching. J. Algorithms, 24(2):354–379, 1997. doi:10.1006/JAGM.1997.0860.
- [6] Amihood Amir and Martin Farach. Adaptive dictionary matching. In Proc. 32nd FOCS, pages 760–766, 1991. doi:10.1109/SFCS.1991.185445.
- [7] Amihood Amir, Martin Farach, Zvi Galil, Raffaele Giancarlo, and Kunsoo Park. Dynamic dictionary matching. J. Comput. Syst. Sci., 49(2):208–222, 1994. doi:10.1016/S0022-0000(05)80047-9.
- [8] Amihood Amir, Martin Farach, Ramana M. Idury, Johannes A. La Poutré, and Alejandro A. Schäffer. Improved dynamic dictionary matching. Inf. Comput., 119(2):258–282, 1995. doi:10.1006/INCO.1995.1090.
- [9] Amihood Amir, Tsvi Kopelowitz, Avivit Levy, Seth Pettie, Ely Porat, and B. Riva Shalom. Mind the gap! - online dictionary matching with one gap. Algorithmica, 81(6):2123–2157, 2019. doi:10.1007/S00453-018-0526-2.
- [10] Amihood Amir, Gad M. Landau, and Dina Sokol. Inplace run-length 2d compressed search. Theor. Comput. Sci., 290(3):1361–1383, 2003. doi:10.1016/S0304-3975(02)00041-5.
- [11] Amihood Amir, Avivit Levy, Ely Porat, and B. Riva Shalom. Dictionary matching with one gap. In Proc. 25th CPM, pages 11–20, 2014. doi:10.1007/978-3-319-07566-2_2.
- [12] Amihood Amir, Avivit Levy, Ely Porat, and B. Riva Shalom. Dictionary matching with a few gaps. Theor. Comput. Sci., 589:34–46, 2015. doi:10.1016/J.TCS.2015.04.011.
- [13] Amihood Amir, Moshe Lewenstein, and Ely Porat. Faster algorithms for string matching with k mismatches. J. Algorithms, 50(2):257–275, 2004. doi:10.1016/S0196-6774(03)00097-X.
- [14] Arne Andersson and Stefan Nilsson. A new efficient radix sort. In Proc. 35th FOCS, pages 714–721, 1994. doi:10.1109/SFCS.1994.365721.
- [15] Alberto Apostolico, Gad M. Landau, and Steven Skiena. Matching for run-length encoded strings. J. Complex., 15(1):4–16, 1999. doi:10.1006/JCOM.1998.0493.
- [16] Tanver Athar, Carl Barton, Widmer Bland, Jia Gao, Costas S. Iliopoulos, Chang Liu, and Solon P. Pissis. Fast circular dictionary-matching algorithm. Math. Struct. Comput. Sci., 27(2):143–156, 2017. doi:10.1017/S0960129515000134.
- [17] Arturs Backurs and Piotr Indyk. Which regular expression patterns are hard to match? In Proc. 57th FOCS, pages 457–466, 2016. doi:10.1109/FOCS.2016.56.
- [18] Djamal Belazzougui. Succinct dictionary matching with no slowdown. In Proc. 21st CPM, pages 88–100, 2010. doi:10.1007/978-3-642-13509-5_9.
- [19] Djamal Belazzougui. Worst-case efficient single and multiple string matching on packed texts in the word-ram model. J. Discrete Algorithms, 14:91–106, 2012. doi:10.1016/J.JDA.2011.12.011.
- [20] Omer Berkman and Uzi Vishkin. Finding level-ancestors in trees. J. Comput. Syst. Sci., 48(2):214–230, 1994. doi:10.1016/S0022-0000(05)80002-9.
- [21] Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind. String matching with variable length gaps. Theoret. Comput. Sci., 443:25–34, 2012. doi:10.1016/J.TCS.2012.03.029.
- [22] Philip Bille and Mikkel Thorup. Regular expression matching with multi-strings and intervals. In Proc. 21st SODA, 2010.
- [23] William I. Chang and Eugene L. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4):327–344, 1994. doi:10.1007/BF01185431.
- [24] Bernard Chazelle. Computing on a free tree via complexity-preserving mappings. Algorithmica, 2:337–361, 1987. doi:10.1007/BF01840366.
- [25] Raphaël Clifford, Allyx Fontaine, Ely Porat, Benjamin Sach, and Tatiana Starikovskaya. Dictionary matching in a stream. In Proc. 23rd ESA, pages 361–372, 2015. doi:10.1007/978-3-662-48350-3_31.
- [26] Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don’t cares. In Proc. 36th STOC, pages 91–100, 2004. doi:10.1145/1007352.1007374.
- [27] Beate Commentz-Walter. A string matching algorithm fast on the average. In Hermann A. Maurer, editor, Proc. 6th ICALP, volume 71, pages 118–132, 1979. doi:10.1007/3-540-09510-1_10.
- [28] Paul F. Dietz. Fully persistent arrays (extended array). In Proc. 1st WADS, pages 67–74, 1989. doi:10.1007/3-540-51542-9_8.
- [29] Paul F. Dietz. Finding level-ancestors in dynamic trees. In Proc. 2nd WADS, pages 32–40, 1991. doi:10.1007/BFB0028247.
- [30] Mohamed Y. Eltabakh, Wing-Kai Hon, Rahul Shah, Walid G. Aref, and Jeffrey Scott Vitter. The sbc-tree: an index for run-length compressed sequences. In Proc. 11th EDBT, pages 523–534, 2008. doi:10.1145/1353343.1353407.
- [31] Paolo Ferragina and Fabrizio Luccio. Dynamic dictionary matching in external memory. Inf. Comput., 146(2):85–99, 1998. doi:10.1006/INCO.1998.2733.
- [32] Paolo Ferragina and S. Muthukrishnan. Efficient dynamic method-lookup for object oriented languages (extended abstract). In Proc. 4th ESA, pages 107–120, 1996. doi:10.1007/3-540-61680-2_50.
- [33] Johannes Fischer, Travis Gagie, Pawel Gawrychowski, and Tomasz Kociumaka. Approximating LZ77 via small-space multiple-pattern matching. In Proc. 23rd ESA, pages 533–544, 2015. doi:10.1007/978-3-662-48350-3_45.
- [34] Johannes Fischer, Travis Gagie, Paweł Gawrychowski, and Tomasz Kociumaka. Approximating lz77 via small-space multiple-pattern matching. In Proc. 23rd ESA, pages 533–544, 2015.
- [35] Michael L. Fredman, János Komlós, and Endre Szemerédi. Storing a sparse table with O(1) worst case access time. In Proc. 23rd FOCS, pages 165–169, 1982. doi:10.1109/SFCS.1982.39.
- [36] Arnab Ganguly, Wing-Kai Hon, and Rahul Shah. A framework for dynamic parameterized dictionary matching. In Proc. 15th SWAT, pages 10:1–10:14, 2016. doi:10.4230/LIPICS.SWAT.2016.10.
- [37] Shay Golan and Ely Porat. Real-time streaming multi-pattern search for constant alphabet. In Proc. 25th ESA, pages 41:1–41:15, 2017. doi:10.4230/LIPICS.ESA.2017.41.
- [38] Yijie Han. Deterministic sorting in o(nloglogn) time and linear space. J. Algorithms, 50(1):96–105, 2004. doi:10.1016/J.JALGOR.2003.09.001.
- [39] Wing-Kai Hon, Tsung-Han Ku, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. Faster compressed dictionary matching. Theor. Comput. Sci., 475:113–119, 2013. doi:10.1016/J.TCS.2012.10.050.
- [40] Tomohiro I, Takaaki Nishimoto, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Compressed automata for dictionary matching. Theor. Comput. Sci., 578:30–41, 2015. doi:10.1016/J.TCS.2015.01.019.
- [41] Takuya Kida, Masayuki Takeda, Ayumi Shinohara, Masamichi Miyazaki, and Setsuo Arikawa. Multiple pattern matching in LZW compressed text. In Proceedings of the 8th Data Compression Conference, pages 103–112, 1998. doi:10.1109/DCC.1998.672136.
- [42] Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323–350, 1977. doi:10.1137/0206024.
- [43] Tsvi Kopelowitz, Ely Porat, and Yaron Rozen. Succinct online dictionary matching with improved worst-case guarantees. In Proc. 27th CPM, pages 6:1–6:13, 2016. doi:10.4230/LIPICS.CPM.2016.6.
- [44] S. Muthukrishnan and Martin Müller. Time and space efficient method-lookup for object-oriented programs (extended abstract). In Proc. 7th SODA, pages 42–51, 1996. URL: http://dl.acm.org/citation.cfm?id=313852.313882.
- [45] Gonzalo Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31–88, 2001. doi:10.1145/375360.375365.
- [46] Yoshifumi Sakai. Computing the longest common subsequence of two run-length encoded strings. In Proc. 23rd ISAAC, pages 197–206, 2012. doi:10.1007/978-3-642-35261-4_23.
- [47] Dan E. Willard. Log-logarithmic worst-case range queries are possible in space theta(n). Inf. Process. Lett., 17(2):81–84, 1983. doi:10.1016/0020-0190(83)90075-3.
Appendix A Proof of Compact Trie Construction
Proof.
Our algorithm proceeds as follows:
Step 1: Compute Longest Common Prefixes.
Let be the sorted order of the strings of total length . We compute the longest common prefixes , where , of each consecutive pairs of strings in . To do so, we scan each pair of strings and from left to right to find the longest common prefix.
Step 2: Construct the Compact Trie.
To construct the compact trie , we add the strings one at a time from left to right to an initially empty trie. We maintain the string depth of each node during the construction. Suppose we have constructed the compact trie of the strings and consider the leftmost path in corresponding to . Let denote the suffix of not shared with . We add the string to as follows. If , we extend the path with a new edge representing . Otherwise, we traverse bottom up to find the location at string depth to attach a new edge representing . Note that this location is either an existing node in or we need to split an edge and add a new node. At the end, we return the trie .
Since the strings are sorted, step 2 inductively constructs the compact trie , for of the first strings. Thus, is the compact trie of . Step 1 uses time. For step 2, we bound the total number of edges in the bottom-up traversal of the rightmost path. Consider traversing a rightmost path in while adding and let be the (existing or newly created) node of string depth on where we attach a new child edge representing . The edges on below are never traversed again and are part of the final trie . Hence, the total time to traverse them is at most . The remaining parts of step 2 take time, and hence, the total time is . We use space in total.