Pattern Matching on Run-Length Grammar-Compressed Strings in Linear Time
Abstract
Run-length straight-line programs (RLSLPs) are a technique for grammar-based compression, allowing any string to be represented with optimal space for , the substring complexity of the string. We address the compressed pattern matching problem for RLSLPs: Given a compressed text in RLSLP format and an uncompressed pattern, determine if the pattern appears in the text. This paper proposes an algorithm that solves this problem in linear time with respect to the size of the grammar and the length of the pattern.
Keywords and phrases:
pattern matching, run-length straight-line programs, compression, suffix treeFunding:
Ryo Yoshinaka: JSPS KAKENHI 18K11150, 20H05703, 23K11325, 24H00697, 24K14827.Copyright and License:
![[Uncaptioned image]](x1.png)
2012 ACM Subject Classification:
Theory of computation Design and analysis of algorithmsAcknowledgements:
We are grateful to the anonymous reviewers for their thorough review and helpful recommendations.Editors:
Paola Bonizzoni and Veli MäkinenSeries and Publisher:

1 Introduction
Pattern matching is the problem determining whether a given pattern occurs in a given text . It is one of the most fundamental problems in computer science, with applications in various fields such as text processing, signal processing, database searching, and bioinformatics. There are many algorithms for solving the pattern matching problem [5, 23].
Nowadays, datasets are often stored in a compressed form, and it is impractical to decompress the data each time to find a pattern in it. Thus, pattern matching on compressed data without decompression has attracted much attention. The problem of compressed pattern matching [29] is, given a pattern and a compressed representation of a text , to determine whether occurs in without decompressing it. The complexity of matching algorithms is mainly measured with respect to and , rather than .
Since the pioneering work of Amir and Benson [1], many compressed pattern matching algorithms have been proposed for various compression methods, depending on the specific properties of the methods [2, 19, 4]. Among them, Gawrychowski [8] showed an algorithm for Lempel-Ziv compression (commonly known as LZ77) that runs in . Gawrychowski [9] also showed time algorithm for Lempel-Ziv-Welch (LZW) compression. Recently, as a landmark, Ganardi and Gawrychowski [6] successfully developed an algorithm for straight-line programs (SLPs) that runs in time. SLP is a context-free grammar that generates exactly one string, which is extensively employed to generalize various grammar-based compression techniques [12, 28, 32, 21, 10, 3].
The complexity of pattern matching on compressed data is inherently linked to the compressibility of the data itself, as the size is influenced by the specific compression algorithm employed. Substring complexity due to Kociumaka et al. [17] serves as a measure of the compressibility of highly repetitive strings, enabling the evaluation of the compression performance of various methods. They showed that LZ77 can represent any text using space, which is asymptotically optimal with respect to . In contrast, an SLP requires space. LZ77 achieves high compression ratios compared to other methods and is widely used in practical applications such as zip. However, processing data directly in its LZ77-compressed form is challenging. Even accessing the -th character of the text is difficult, and solving compressed pattern matching in linear time is not feasible [8]. The difficulty of handling LZ77-compressed data is evident from the fact that the matching algorithm shown in [8] begins by converting LZ77-compressed data into an SLP.
Nishimoto et al. [26] enhanced SLPs with “run-length rules” of the form to handle repetitions more efficiently. Run-Length Straight-Line Programs (RLSLPs), like LZ77, can represent any text using space. Furthermore, RLSLPs preserve the simplicity and usability characteristic of grammar-based compression algorithms. For example, in RLSLPs, a substring of length at any position in the text can be accessed in time [17], and longest common extension (LCE) queries can be answered in time [13]. Additionally, RLSLPs are used in applications such as constructing suffix arrays [13] and indexes [16] within space. Despite their various applications, no linear-time matching algorithm for RLSLPs has been proposed.
In this paper, we propose a linear-time compressed pattern matching algorithm for strings compressed by RLSLPs, as stated in the following theorem.
Theorem 1.
Given a pattern of length , and an RLSLP of size , we can decide whether occurs in the text described by in time .
In the linear-time algorithm for compressed pattern matching on SLPs [6], it is crucial to represent substrings of the text where the pattern may occur as a concatenation of a constant number of substrings of the pattern. We will adopt this idea to our algorithm for RLSLPs. The challenge lies in representing substrings of the text as concatenations of substrings of the pattern, even when a rule of the form is present, which does not exist in SLPs. To address this, handling repetitions of substrings within the pattern is essential. A cover suffix tree [27] is a data structure that extends a suffix tree by adding nodes corresponding to substrings in a string whose squares are substrings of the string. However, for our problem, handling only squares, i.e., substrings that repeat twice, is insufficient. Instead, we need to determine the exact number of times a given substring repeats in a string. We address this issue by introducing a new data structure, an extension of the cover suffix tree, that can handle all repetitions within a string.
In the whole paper we assume the standard word RAM model, which operates on -bit words, where and , with the standard arithmetic (excluding integer division) and bitwise operations.
2 Preliminaries
Let be a finite alphabet. A string of length is a finite sequence of symbols in . For a string , we write for the -th character of . The length of is denoted by . A string of length is called the empty string and is denoted by . The concatenation of two strings and is denoted by or . For a string , we define and for any integer . A string of the form for any positive integer is called a repetition of and particularly is referred to as the square of . A string is primitive if cannot be represented as for any string and integer . A string is a root of , and if is primitive, is the primitive root of . A substring of a string starting at position and ending at position is denoted by . Especially, a substring is called a prefix, and a substring is called a suffix of . Proper prefixes and suffixes of are those different from . The suffix is also denoted by . If then is the empty string. Let , and be the sets of all substrings, prefixes and suffixes of , respectively. We say that a string occurs in a string at position if . For a string and an integer (), the th-rotation of is . A period of a string is a positive integer such that for all with . A substring is a run in if for its smallest period , (1) , (2) or , and (3) or . A run is specified by the pair of its position and smallest period . If and , and are called the maximum repetition and the maximum repetition count of in , respectively.
Example 2.
For , all runs in are , and represented by , and , respectively. On the other hand, all maximum repetitions with maximum repetition count of primitive substrings of are , and .
A straight-line program (SLP) is a context-free grammar describing exactly one string. Without loss of generality, we assume the grammar is in Chomsky normal form, i.e., each rule is of the form or where are nonterminals and is a terminal. For any rule , does not appear in the derivation of nor . A run-length straight-line program (RLSLP) [26] is an extended SLP which can also have rules of the form where . We call rules of the forms and binary rules and run-length rules, respectively. The terminal string derived from a nonterminal is denoted by .
Example 3.
Consider the following run-length straight-line program (RLSLP):
Here, is the start symbol. We have , , , , , and .
3 Previous Work and Challenges
This section describes an overview of Ganardi and Gawrychowski’s [6] compressed pattern matching algorithm on SLP and the challenges in extending it to RLSLP.
Consider an SLP of size for a text of length and a pattern of length . For each nonterminal of , let be the longest element of and be the longest element of .
Example 4.
For a pattern and a nonterminal with , we have and .
We can verify that occurs in if and only if there exists a rule such that occurs in . This fact has often been used to solve the compressed pattern matching problems efficiently [4, 14, 9, 8], where those strings are represented and processed as positions on . However, no linear-time algorithm is known that computes and for all nonterminals in an arbitrary SLP. To achieve a linear-time solution for SLP-compressed pattern matching, Ganardi and Gawrychowski [6] invented a brilliant idea to compute approximations of them, which we call PSI-information (Prefix-Suffix-Infix information) in this paper. A PSI-information is a tuple consisting of either four substrings or a single substring of .
Definition 5.
The set of PSI-information for a nonterminal is defined as follows. If is a substring of , i.e., ,
Otherwise,
They showed the following algorithm that computes some for each rule recursively, assuming that and are already computed. Depending on the types of and , there are four cases.
- Case 1
-
and :
Let if . Otherwise, let . - Case 2
-
and :
Let if . Otherwise, let . - Case 3
-
and :
Let if . Otherwise, let . - Case 4
-
and :
Let .
In every case, the computation is reduced to at most one call of the scq, defined below.
- Substring concatenation query
-
(scq): Given two substrings and of represented by their positions, return a position of in if ; otherwise, return “No”.
For scqs, it is not known whether we can answer for any single query in time with -time preprocessing. However, they showed the following alternative solution for batched queries, that is a key to the linear-time pattern matching algorithm.
Lemma 6 ([6], Theorem 1.3).
We can preprocess a string of length in time so that we can answer scqs in time.
Lemma 7 ([6]).
For rules , if the PSI-information for and has already been computed, we can determine whether occurs in for all in total time.
Let be the set of nonterminals whose derivation tree is of height . PSI-information for nonterminals is computed in the order of , in a batched style. From Lemma 6, PSI-information for all can be computed in time for each .
Lemma 8 ([6]).
We can preprocess in time so that given rules , where the PSI-information for and has already been computed, we can compute the PSI-information for all ’s in total time.
Another key element to support the linear-time algorithm is balancing SLPs to ensure that the height of the derivation tree is , due to Ganardi et al. [7]. With this technique, the total time complexity of the algorithm is bounded by , as we will trace it in the proof of Lemma 27.
We now turn our attention to extend the algorithm to RLSLPs. Concerning the height of the derivation trees, Navarro et al. [24] showed that any RLSLP can be balanced in linear time without increasing its asymptotic size. Thus, the crucial issue to realize a linear time RLSLP-compressed pattern matching algorithm is to establish the counterparts of Lemmas 7 and 8 for the run-length rules. Those will appear in Section 5 as Lemmas 28 and 26. Of course, we do not take the naive and computationally expensive solution that breaks down a run-length rule into binary rules and applies Ganardi and Gawrychowski’s technique. The following two types of queries will play important roles to achieve our goals just like Lemmas 7 and 8 are based on scqs.
- Maximum repetition query
-
(mrq): Given a nonempty substring of represented by a position, answer one of the positions of its maximum repetition and the maximum repetition count .
- Primitive root query
-
(prq): Given a nonempty substring of represented by a position, answer one of the positions of its primitive root.
Example 9.
Let be a pattern. Given a position of the substring , the answer to the mrq is the position of the substring and the maximum repetition count . Notice that the maximum repetition does not necessarily include the queried position. Also, given the position of the substring , the answer to the prq is any of the positions , , of the primitive substring .
If we can answer mrqs efficiently, we can efficiently compute PSI-information of from that of for run-length rules .
Lemma 10.
Consider and . If , . Otherwise, for the maximum repetition count of in , we have
Proof.
In the case where , by repeatedly applying Case 4 of the binary rule case, we can conclude . Suppose and let be the maximum repetition count of in . If , clearly occurs in . Otherwise, must be of the form for some proper prefix of . Thus, and by . Together with the symmetric argument on , we conclude . Consider deciding whether occurs in . If , either or . The former case can be handled in the same way as the binary rule case. For handling the latter case, we use prqs and mrqs based on Lemma 11. Note that all the periods of can be computed in time using the so-called Z-algorithm [11, 22].
Lemma 11.
Consider with . Let be the primitive root of , the maximum repetition count of in , and . Then, if and only if is a period of and , where if and otherwise; if and otherwise.
Proof.
Let be the integer such that . For to occur in , must have period . In this case, since occurs in , there are proper suffix and prefix of such that . If , occurs in just if . Otherwise, must have one more block of to cover each of and/or .
Our proposed algorithm for RLSLP-compressed pattern matching precomputes the answers to mrqs and prqs in an enhanced suffix tree. The next section describes the details of the data structure and its construction. Throughout this paper, we fix a pattern of length and a balanced RLSLP of size that represents a text of length .
4 Repetition-informed Suffix Tree
The longest common prefix of strings is for . The suffix tree of is defined as follows.
Definition 12 ([31]).
Let where is a special symbol that does not occur in . The suffix tree of consists of explicit nodes and edges where
-
,
-
.
Note that whereas the mathematical definition above is given with strings, node names and edge labels are represented as occurrence positions of those in . This paper often uses strings for readability where the actual computation is done on positions, unless confusion arises. We assume one can access in constant time the node for any and the parent of an arbitrary node. The suffix tree can be constructed in time [30]. Every substring of that is not an explicit node of the suffix tree is called an implicit node. An implicit node is a conceptual node that exists between two explicit nodes and specified as where is the shortest extension of in . A node is either an explicit node or an implicit node.
We extend suffix trees by adding more explicit nodes.
Definition 13.
Let and be the set of all explicit nodes in . The extended suffix tree of with a substring set is defined by
-
,
-
.
We use an extended suffix tree to store the answers to the mrqs and prqs on respective nodes. We promote implicit nodes to explicit nodes only if the answers to the queries on are nontrivial, in the sense that either the primitive root of is not , or the maximum repetition of is not . In the trivial case, the answers to mrqs and prqs can be the queried positions themselves, with the maximum repetition count . Let be the set of nonempty strings for which the mrq or the prq is nontrivial:
The goal of this section is to construct , defined as follows.
Definition 14.
An extended suffix-tree is said to be repetition-informed and denoted by , if each node in retains , , and for , where
-
: one of the positions of the maximum repetition of in ,
-
: the integer such that is the maximum repetition of in ,
-
: one of the positions of the primitive root of in .
That is, the answer to the mrq on is and and the one to the prq on is .
The suffix tree and the repetition-informed suffix tree extended with for are shown in Figure 1. Although values and are positions on , for intuitive understandability, Figure 1 presents them as links from to the nodes corresponding to those positions. We remark that can be computed from and . However, under the assumptions of the computer model, integer division cannot be done in constant time. So, we will compute the values in the preprocessing step and let remember them.
4.1 Node identification queries
Recall that instances of the mrqs and prqs are positions on , while the answers are stored in the nodes of the repetition-informed suffix tree. This subsection shows that we can efficiently convert positions on to the corresponding nodes of an extended suffix tree (Lemma 17).
- Node identification query
-
(niq): Given a substring of represented by a position, return the corresponding node on the extended suffix tree of .
We answer niqs using waqs below. In the weighted ancestor problem, we are given a node-weighted tree where the weight of any node is a nonnegative integer and greater than that of its parent.
- Weighted ancestor query
-
(waq): Given a node and a nonnegative integer , return the furthest ancestor of with weight at least .
Kociumaka et al. [15] showed that a batch of waqs on a tree of size can be answered in time, provided that the node weights are polynomially bounded in . Their algorithm first sorts the queries by weight and then processes them in the non-increasing order. This necessity of sorting is the reason why the weights must be polynomially bounded and the queries must be handled in a batch. In other words, as long as the queries are given in non-increasing order of weights, multiple waqs can be answered in an online manner with the same time complexity. It is more convenient for our discussion to present their result in this online computation version.
Lemma 15.
We can answer waqs on a node-weighted tree of size in total time in an online manner, provided that the queries are ordered by non-increasing weights.
Ganardi and Gawrychowski [6] improved upon the result by Kociumaka et al.
Lemma 16.
We can preprocess a node-weighted tree of size in time so that waqs can be answered in total time.
Lemma 16 encompasses Lemma 15, but since the algorithm in [15] is simpler, Lemma 15 is used when it is sufficient.
A result similar to the following lemma is used in the algorithm presented in [6].
Lemma 17.
We can preprocess the extended suffix tree of size in time so that we can answer niqs in time.
Proof.
Define the weight of each explicit node to be the length of its corresponding string. Preprocessing the extended suffix tree for waqs takes time by Lemma 16. Given substring as input for an niq, we use a waq with the leaf node and the length to identify an explicit node . If holds, is the node . Otherwise, is an implicit node between and its parent since . The implicit node is represented by . It takes time to process waqs.
Lemma 17 is based on Lemma 16 and used in the matching algorithm in Section 5. A simplified version of Lemma 17 is obtained based on Lemma 15, which is sufficient for discussions in the rest of this section.
Lemma 18.
We can answer niqs on the extended suffix tree of size in time in an online manner, provided that the queries are ordered by non-increasing length.
4.2 Constructing repetition-informed suffix trees
Li et al. [20] showed that the number of distinct substrings of the form for some in any string of length is less than . We immediately obtain the following lemma.
Lemma 19.
holds.
This ensures that the size of is and niqs on can be answered in time.
Define
Obviously, holds. Our construction consists of the following steps:111We construct the intermediate structure for clear illustration of the preprocessing algorithm. However, in practice, it suffices to identify the nodes of and calculate the answers to repetition queries in Step 2. One may embed those nodes into together with the other nodes of at once in Step 3.
-
1.
constructing ;
-
2.
constructing ;
-
3.
constructing .
The following subsections explain the respective steps.
4.2.1 Construction of
We first enumerate elements of as positions by Kolpakov and Kucherov’s algorithm [18] that enumerates all the runs of a string in linear time, as their positions and shortest periods. Then, we identify the corresponding nodes in by niqs and make them explicit. Conversion of implicit nodes into explicit nodes may appear very simple. When an implicit node to convert is represented by , we should introduce a new explicit node between the explicit node and the parent of . We replace the edge between and by two edges between and and between and . This can be done in constant time. One has to notice that, however, in our approach, the implicit nodes to be converted are all given simultaneously by niqs for efficiency. Converting one implicit node alters the tree structure, so the representations of some of the other implicit nodes may be affected. Suppose two implicit nodes and are represented by and , respectively, with , i.e., . If is converted first, then the representation of becomes invalid, since the explicit node immediately below is now the new explicit node rather than . One should not embed a new explicit node for as the parent of . This potential disturbance can be avoided by converting before .
Lemma 20.
Given a set of substrings of represented by their positions, we can construct in time.
Proof.
We first construct and sort the input substrings by their length. Since the lengths of the input substrings are at most , one can sort them in time by bucket sort. We then identify the corresponding nodes using niqs in non-increasing order. The conversion of the identified implicit nodes is performed in the reverse order, which ensures that the representation of those nodes remain valid when they are converted. By Lemma 18, one can identify and convert the nodes in time.
Corollary 21.
One can construct in time.
Proof.
One can enumerate positions of in in linear time by Kolpakov and Kucherov’s algorithm [18]. Then, applying Lemma 20 to those positions, we obtain . We note that the same substring may occur in different runs and in , in which case we identify the same node twice or more. This is not a problem at all when constructing . For the succeeding procedure, among those runs, we pick a longest run, denoted as , and retain an occurrence position of in the identified node.
4.2.2 Construction of
The goal of this subsection is to identify the nodes for all elements of on and compute , , and . We first focus on the node identification.
Identifying the nodes of
Let , where is a proper prefix of . If , this run contains the squares of all the rotations of , i.e., for . If , the run has squares of the form , i.e., , for . Therefore, for ,
holds. One can identify those nodes by niqs on positions shifting in by ; that is, if , we pose niqs on for all . However, a naive implementation of this idea may be redundant and inefficient: it is possible that for different and , so the same node may be identified several times. In order to bound the number of niqs by , we maintain found nodes in rotation lists consisting of rotation links. A rotation link is a link from to for and . Therefore, identifying nodes of and creating rotation lists are equivalent tasks. Examples of rotation lists are shown in Figure 2.
Suppose for different , , , and . If we process after , on the way constructing the rotation list starting at , we reach . Then, without tracing rotation links , we can jump to the end of the rotation list of and continue growing the rotation list if . If , this finishes the rotation list starting at . This is the basic idea to use the rotation lists for efficiently finding the nodes of . See Figure 3.
Algorithm 1 (CreateAllRotLists) computes all the rotation lists from the nodes of in non-increasing order by the lengths of their corresponding strings. The function creates the list consisting of nodes in this order. Those nodes are created but not embedded into (if they are not explicit) until we find all the nodes of . The subroutine poses an niq on the position shifted by one from to obtain the node . The rotation link is remembered in as , which will be used later. The value maintains the number of nodes in the rotation list counting from up to . is initialized to be zero to denote that the node has not been visited. We remember the end node of a rotation list starting at as if it is non-circular.
When extending the list from , as long as we do not encounter another node in , we repeatedly call GetRotLink and get rotation links using niqs to extend the rotation list. At some point, we may reach another node . In this case, the list from shall be extended to the end of the list from . If has been processed earlier, i.e., , we jump to . If not, we recursively call to find the end of the list from . In either case, we will reach . Then, may exceed . Otherwise, we continue extending the list.
It is possible that the nested recursive calls of CreateRotList circulate. In this case, we visit a node for the second time. This is depicted if has not yet been determined (Line 22). In this case, the list is circular and we discontinue extending the list.
When the algorithm halts, each rotation list has just one explicit node with . For that node , the following holds:
-
if , is the head of a non-circular rotation list which ends in ,
-
if , is in a circular rotation list.
Those values will be useful when computing and for nodes in each rotation list.
Lemma 22.
Given a position of of each and , one can compute all rotation lists in time.
Proof.
One can easily verify the above invariant properties of the variables used in the algorithm. We discuss the time complexity. For each node , (Line 17) is called at most once, and it is from where is the node in with for the smallest . Thus, the number of calls of is bounded by by Lemma 19. Each call of GetRotLink involves an niq. Since we process elements of in a non-increasing order of length, those queries are posed in a non-increasing order, thus Lemma 18 applies. Hence, the total time used by the function GetRotLink is bounded by . Therefore, the algorithm runs in time proportional to .
Computing answers to queries on
We now compute and for all . The value of coincides with the position of itself, which has already been computed when creating the rotation lists. Concerning , we remark that
by Lemma 19.
We compute the values by following each rotation list. During the iteration, we maintain (the length of) the longest pseudo-run with , denoted by , when visiting . A pseudo-run with is defined to be a substring of of the form with and . The maximum repetition and the maximum repetition count of are easily computed from the length of the longest pseudo-run with . We can compute from based on the fact that
where and are assumed to be empty if they are not defined: is undefined when , and is undefined when .
Suppose a rotation list is not circular and its head node is . In this case, there is no such that . Thus, holds. The maximum repetition of is the prefix of . So, and can be computed. We proceed to the next node in the rotation list. If , the longest pseudo-run with is . If , is the longer of and . In either case, and can be computed from . We repeat this process.
If a rotation list is circular, we begin the process from a node with the largest in the list, which is not necessarily the head. This node is found by traversing the rotation list once. In this case, there exists in the list such that . However, by the choice of , it is guaranteed that is longer than . Thus, holds. We proceed to the next node in the same manner as the non-circular case.
Note that, the maximum repeat count is obtained from and by dividing by . We perform division by subtraction; that is, to compute , we repeatedly subtract from until the value becomes negative. This division costs time. Thus, total division cost is222During the traversal of rotation lists, by maintaining the remainder of the division of by when computing those values for , one does not have to perform this “division” for .
Therefore, we can compute , , and for all in time.
The nodes in the rotation lists equipped with repetition information are embedded into in time by Lemma 20. Lemma 23 summarizes Section 4.2.2.
Lemma 23.
Given and , one can compute in time.
4.2.3 Construction of
Every element of is of the form for some and . Using the values of , and for , we compute a position of , , and for each in the ascending order. Let and be the start position of . A position of is given by . The value is set to be . To compute , we use the fact that and is the largest satisfying . The value is obtained by initializing to be and repeatedly decrementing the value one by one and checking the above inequality. Then, is computed as . Since is monotonically decremented from , the process to compute the concerned values for all repetitions of in can be performed in time. Summing the cost for all , we have .
Lemma 24.
We can construct in time.
Proof.
After computing positions of , , and for all , we modify into and let the explicit nodes retain the values , and . Note that Lemma 20 is valid for updating extended suffix trees.
Lemma 25.
We can preprocess in time so that mrqs and prqs can be answered in time.
Proof.
Given (positions of) substrings of as input, we identify the corresponding nodes in using niq. For each , if turns out to be in , return and or . Otherwise, return itself because the maximum repetition and the primitive root of are itself. From Lemma 17, mrqs or prqs can be answered in time.
5 Pattern Matching on RLSLP
Let (resp. ) be the set of nonterminals in such that its derivation tree has height and it has a rule of the form (resp. or ). PSI-information for nonterminals is computed in the order of . PSI-information for all can be computed in time by Lemma 8. We can also compute PSI-information for all in time.
Lemma 26.
We can preprocess in time so that given rules , where the PSI-information for has already been computed, we can compute the PSI-information for in total time.
Proof.
Lemma 27.
We can compute PSI-information for all nonterminals in in time.
Proof.
Recall that is balanced: the derivation height is bounded by . By Lemmas 8 and 26, the total time complexity is . For each run-length rule , we can determine whether occurs in in linear time.
Lemma 28.
For rules , if the PSI-information for has already been computed, we can determine whether occurs in in total time.
Proof.
Theorem 1.
Given a pattern of length and an RLSLP of size , we can decide whether occurs in the text described by in time .
Proof.
From Lemma 27, we can compute the PSI-information for all nonterminals in in time. From Lemmas 7 and 28, using the computed PSI-information, we can decide whether occurs in for all binary rules of the form in time, and whether occurs in for all run-length rules of the form in time. Therefore, the total time complexity is .
6 Conclusion
In this paper, we presented a linear-time algorithm for pattern matching on run-length grammar-compressed strings by generalizing Ganardi and Gawrychowski’s algorithm for straight-line programs [6]. We showed that the algorithm can be applied to any RLSLP and the algorithm runs in time for a pattern of length and an RLSLP of size .
It remains an open problem if there exists a linear-time algorithm for pattern matching on iterative straight-line programs [25], which are a further extension of RLSLPs.
References
- [1] Amihood Amir and Gary Benson. Efficient two-dimensional compressed matching. In Data Compression Conference, 1992., pages 279–288, 1992. doi:10.1109/DCC.1992.227453.
- [2] Amihood Amir, Gary Benson, and Martin Farach. Let sleeping files lie: Pattern matching in Z-compressed files. Journal of Computer and System Sciences, 52(2):299–307, 1996. doi:10.1006/jcss.1996.0023.
- [3] Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. Random access to grammar-compressed strings and trees. SIAM Journal on Computing, 44(3):513–539, 2015. doi:10.1137/130936889.
- [4] Martin Farach and Mikkel Thorup. String matching in Lempel-Ziv compressed strings. Algorithmica, 20(4):388–404, 1998. doi:10.1007/PL00009202.
- [5] Simone Faro and Thierry Lecroq. The exact online string matching problem: A review of the most recent results. ACM Comput. Surv., 45(2), March 2013. doi:10.1145/2431211.2431212.
- [6] Moses Ganardi and Paweł Gawrychowski. Pattern matching on grammar-compressed strings in linear time. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2833–2846, 2022. doi:10.1137/1.9781611977073.110.
- [7] Moses Ganardi, Artur Jeż, and Markus Lohrey. Balancing straight-line programs. J. ACM, 68(4), June 2021. doi:10.1145/3457389.
- [8] Paweł Gawrychowski. Pattern matching in Lempel-Ziv compressed strings: Fast, simple, and deterministic. In Algorithms – ESA 2011, pages 421–432, 2011. doi:10.1007/978-3-642-23719-5_36.
- [9] Paweł Gawrychowski. Optimal pattern matching in LZW compressed strings. ACM Trans. Algorithms, 9(3), June 2013. doi:10.1145/2483699.2483705.
- [10] Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Fast -gram mining on SLP compressed strings. Journal of Discrete Algorithms, 18:89–99, 2013. doi:10.1016/j.jda.2012.07.006.
- [11] Dan Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, 1997. doi:10.1017/CBO9780511574931.
- [12] Marek Karpinski, Wojciech Rytter, and Ayumi Shinohara. Pattern-matching for strings with short descriptions. In Proc. The 6th Annual Symposium on Combinatorial Pattern Matching (CPM95), volume 937 of Lecture Notes in Computer Science, pages 205–214. Springer, 1995. doi:10.1007/3-540-60044-2_44.
- [13] Dominik Kempa and Tomasz Kociumaka. Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space . In 2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS), pages 1877–1886, November 2023. doi:10.1109/FOCS57990.2023.00114.
- [14] Takuya Kida, Tetsuya Matsumoto, Yusuke Shibata, Masayuki Takeda, Ayumi Shinohara, and Setsuo Arikawa. Collage system: a unifying framework for compressed pattern matching. Theoretical Computer Science, 298(1):253–272, 2003. doi:10.1016/S0304-3975(02)00426-7.
- [15] Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. A linear-time algorithm for seeds computation. ACM Trans. Algorithms, 16(2), April 2020. doi:10.1145/3386369.
- [16] Tomasz Kociumaka, Gonzalo Navarro, and Francisco Olivares. Near-optimal search time in -optimal space. In LATIN 2022: Theoretical Informatics, pages 88–103, 2022. doi:10.1007/978-3-031-20624-5_6.
- [17] Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repetitive sequences. IEEE Transactions on Information Theory, 69(4):2074–2092, 2023. doi:10.1109/TIT.2022.3224382.
- [18] Roman Kolpakov and Gregory Kucherov. Finding maximal repetitions in a word in linear time. In 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039), pages 596–604, 1999. doi:10.1109/SFFCS.1999.814634.
- [19] S. Rao Kosaraju. Pattern matching in compressed texts. In Foundations of Software Technology and Theoretical Computer Science, pages 349–362, 1995. doi:10.1007/3-540-60692-0_60.
- [20] Shuo Li, Jakub Pachocki, and Jakub Radoszewski. A note on the maximum number of -powers in a finite word. The Electronic Journal of Combinatorics, 31(3), 2024. doi:10.37236/11270.
- [21] Markus Lohrey. Algorithmics on SLP-compressed strings: A survey. Groups - Complexity - Cryptology, 4(2):241–299, 2012. doi:doi:10.1515/gcc-2012-0016.
- [22] Michael G. Main and Richard J. Lorentz. An algorithm for finding all repetitions in a string. J. Algorithms, 5(3):422–432, 1984. doi:10.1016/0196-6774(84)90021-X.
- [23] Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001. doi:10.1145/375360.375365.
- [24] Gonzalo Navarro, Francisco Olivares, and Cristian Urbina. Balancing run-length straight-line programs. In String Processing and Information Retrieval, pages 117–131, 2022. doi:10.1007/978-3-031-20643-6_9.
- [25] Gonzalo Navarro and Cristian Urbina. Iterated straight-line programs. In LATIN 2024: Theoretical Informatics, pages 66–80, 2024. doi:10.1007/978-3-031-55598-5_5.
- [26] Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Fully Dynamic Data Structure for LCE Queries in Compressed Space. In 41st International Symposium on Mathematical Foundations of Computer Science (MFCS 2016), volume 58 of Leibniz International Proceedings in Informatics (LIPIcs), pages 72:1–72:14, 2016. doi:10.4230/LIPIcs.MFCS.2016.72.
- [27] Jakub Radoszewski. Linear Time Construction of Cover Suffix Tree and Applications. In 31st Annual European Symposium on Algorithms (ESA 2023), volume 274 of Leibniz International Proceedings in Informatics (LIPIcs), pages 89:1–89:17, 2023. doi:10.4230/LIPIcs.ESA.2023.89.
- [28] Wojciech Rytter. Grammar compression, LZ-encodings, and string algorithms with implicit input. In Automata, Languages and Programming (ICALP 2004), pages 15–27, 2004. doi:10.1007/978-3-540-27836-8_5.
- [29] Masayuki Takeda and Ayumi Shinohara. Pattern Matching on Compressed Text, in Encyclopedia of Algorithms, pages 1538–1542. Springer New York, 2016. doi:10.1007/978-1-4939-2864-4_81.
- [30] Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249–260, 1995. doi:10.1007/BF01206331.
- [31] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (SWAT 1973), pages 1–11, 1973. doi:10.1109/SWAT.1973.13.
- [32] Takanori Yamamoto, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Faster subsequence and don’t-care pattern matching on compressed texts. In Combinatorial Pattern Matching, pages 309–322, 2011. doi:10.1007/978-3-642-21458-5_27.