Abstract 1 Introduction 2 Preliminaries 3 Previous Work and Challenges 4 Repetition-informed Suffix Tree 5 Pattern Matching on RLSLP 6 Conclusion References

Pattern Matching on Run-Length Grammar-Compressed Strings in Linear Time

Yuto Iguchi Graduate School of Information Sciences, Tohoku University, Sendai, Japan Ryo Yoshinaka ORCID Graduate School of Information Sciences, Tohoku University, Sendai, Japan Ayumi Shinohara ORCID Graduate School of Information Sciences, Tohoku University, Sendai, Japan
Abstract

Run-length straight-line programs (RLSLPs) are a technique for grammar-based compression, allowing any string to be represented with optimal space for δ, the substring complexity of the string. We address the compressed pattern matching problem for RLSLPs: Given a compressed text in RLSLP format and an uncompressed pattern, determine if the pattern appears in the text. This paper proposes an algorithm that solves this problem in linear time with respect to the size of the grammar and the length of the pattern.

Keywords and phrases:
pattern matching, run-length straight-line programs, compression, suffix tree
Funding:
Ryo Yoshinaka: JSPS KAKENHI 18K11150, 20H05703, 23K11325, 24H00697, 24K14827.
Ayumi Shinohara: JSPS KAKENHI 21K11745.
Copyright and License:
[Uncaptioned image] © Yuto Iguchi, Ryo Yoshinaka, and Ayumi Shinohara; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Design and analysis of algorithms
Acknowledgements:
We are grateful to the anonymous reviewers for their thorough review and helpful recommendations.
Editors:
Paola Bonizzoni and Veli Mäkinen

1 Introduction

Pattern matching is the problem determining whether a given pattern P occurs in a given text T. It is one of the most fundamental problems in computer science, with applications in various fields such as text processing, signal processing, database searching, and bioinformatics. There are many algorithms for solving the pattern matching problem [5, 23].

Nowadays, datasets are often stored in a compressed form, and it is impractical to decompress the data each time to find a pattern in it. Thus, pattern matching on compressed data without decompression has attracted much attention. The problem of compressed pattern matching [29] is, given a pattern P and a compressed representation c(T) of a text T, to determine whether P occurs in T without decompressing it. The complexity of matching algorithms is mainly measured with respect to m=|P| and g=|c(T)|, rather than n=|T|.

Since the pioneering work of Amir and Benson [1], many compressed pattern matching algorithms have been proposed for various compression methods, depending on the specific properties of the methods [2, 19, 4]. Among them, Gawrychowski [8] showed an algorithm for Lempel-Ziv compression (commonly known as LZ77) that runs in O(glog(n/g)+m). Gawrychowski [9] also showed O(g+m) time algorithm for Lempel-Ziv-Welch (LZW) compression. Recently, as a landmark, Ganardi and Gawrychowski [6] successfully developed an algorithm for straight-line programs (SLPs) that runs in O(g+m) time. SLP is a context-free grammar that generates exactly one string, which is extensively employed to generalize various grammar-based compression techniques [12, 28, 32, 21, 10, 3].

The complexity of pattern matching on compressed data is inherently linked to the compressibility of the data itself, as the size g is influenced by the specific compression algorithm employed. Substring complexity δ due to Kociumaka et al. [17] serves as a measure of the compressibility of highly repetitive strings, enabling the evaluation of the compression performance of various methods. They showed that LZ77 can represent any text using O(δlog(n/δ)) space, which is asymptotically optimal with respect to δ. In contrast, an SLP requires O(δlog2(n/δ)) space. LZ77 achieves high compression ratios compared to other methods and is widely used in practical applications such as zip. However, processing data directly in its LZ77-compressed form is challenging. Even accessing the i-th character of the text is difficult, and solving compressed pattern matching in linear time is not feasible [8]. The difficulty of handling LZ77-compressed data is evident from the fact that the matching algorithm shown in [8] begins by converting LZ77-compressed data into an SLP.

Nishimoto et al. [26] enhanced SLPs with “run-length rules” of the form ABk to handle repetitions more efficiently. Run-Length Straight-Line Programs (RLSLPs), like LZ77, can represent any text using O(δlog(n/δ)) space. Furthermore, RLSLPs preserve the simplicity and usability characteristic of grammar-based compression algorithms. For example, in RLSLPs, a substring of length l at any position in the text can be accessed in O(l+logn) time [17], and longest common extension (LCE) queries can be answered in O(logn) time [13]. Additionally, RLSLPs are used in applications such as constructing suffix arrays [13] and indexes [16] within O(δlog(n/δ)) space. Despite their various applications, no linear-time matching algorithm for RLSLPs has been proposed.

In this paper, we propose a linear-time compressed pattern matching algorithm for strings compressed by RLSLPs, as stated in the following theorem.

Theorem 1.

Given a pattern P of length m, and an RLSLP 𝒢 of size g, we can decide whether P occurs in the text described by 𝒢 in time O(g+m).

In the linear-time algorithm for compressed pattern matching on SLPs [6], it is crucial to represent substrings of the text where the pattern may occur as a concatenation of a constant number of substrings of the pattern. We will adopt this idea to our algorithm for RLSLPs. The challenge lies in representing substrings of the text as concatenations of substrings of the pattern, even when a rule of the form ABk (k3) is present, which does not exist in SLPs. To address this, handling repetitions of substrings within the pattern is essential. A cover suffix tree [27] is a data structure that extends a suffix tree by adding nodes corresponding to substrings in a string whose squares are substrings of the string. However, for our problem, handling only squares, i.e., substrings that repeat twice, is insufficient. Instead, we need to determine the exact number of times a given substring repeats in a string. We address this issue by introducing a new data structure, an extension of the cover suffix tree, that can handle all repetitions within a string.

In the whole paper we assume the standard word RAM model, which operates on 𝗐-bit words, where 𝗐logn and 𝗐logm, with the standard arithmetic (excluding integer division) and bitwise operations.

2 Preliminaries

Let Σ be a finite alphabet. A string of length n is a finite sequence of n symbols in Σ. For a string s=a1an, we write s[i]=ai for the i-th character of s. The length of s is denoted by |s|. A string of length 0 is called the empty string and is denoted by ε. The concatenation of two strings s and t is denoted by st or st. For a string s, we define s1=s and sk=ssk1 for any integer k>1. A string of the form sk for any positive integer k1 is called a repetition of s and particularly s2 is referred to as the square of s. A string s is primitive if s cannot be represented as uk for any string u and integer k>1. A string u is a root of uk(k1), and if u is primitive, u is the primitive root of uk. A substring of a string s starting at position i and ending at position j is denoted by s[i..j]. Especially, a substring s[1..j] is called a prefix, and a substring s[i..|s|] is called a suffix of s. Proper prefixes and suffixes of s are those different from s. The suffix s[i..|s|] is also denoted by s[i..]. If i>j then s[i..j] is the empty string. Let 𝑆𝑢𝑏𝑠𝑡𝑟(s), 𝑃𝑟𝑒𝑓(s) and 𝑆𝑢𝑓𝑓(s) be the sets of all substrings, prefixes and suffixes of s, respectively. We say that a string u occurs in a string s at position (i,j) if u=s[i..j]. For a string s and an integer k (0k<|s|), the kth-rotation of s is 𝑟𝑜𝑡(s,k)=s[k+1..|s|]s[1..k]. A period of a string u is a positive integer p such that u[i]=u[i+p] for all i with 1i|u|p. A substring u[i..j] is a run in u if for its smallest period p, (1) 2pji+1, (2) i=1 or u[i1]u[i1+p], and (3) j=|u| or u[j+1]u[j+1p]. A run is specified by the pair of its position (i,j) and smallest period p. If uk𝑆𝑢𝑏𝑠𝑡𝑟(s) and uk+1𝑆𝑢𝑏𝑠𝑡𝑟(s), uk and k are called the maximum repetition and the maximum repetition count of u in s, respectively.

Example 2.

For P=aabcbcbcbabcbc, all runs in P are P[1..2]=aa, P[3..9]=bcbcbcb and P[11..14]=bcbc represented by ((1,2),1), ((3,9),2) and ((11,14),2), respectively. On the other hand, all maximum repetitions vk with maximum repetition count k2 of primitive substrings of P are P[1..2]=aa, P[3..8]=bcbcbc and P[4..9]=cbcbcb.

A straight-line program (SLP) is a context-free grammar describing exactly one string. Without loss of generality, we assume the grammar is in Chomsky normal form, i.e., each rule is of the form ABC or Aa where A,B,C are nonterminals and a is a terminal. For any rule ABC, A does not appear in the derivation of B nor C. A run-length straight-line program (RLSLP) [26] is an extended SLP which can also have rules of the form ABk where k3. We call rules of the forms ABC and ABk binary rules and run-length rules, respectively. The terminal string derived from a nonterminal A is denoted by A~.

Example 3.

Consider the following run-length straight-line program (RLSLP):

SAB,AC4,Ba,CDE,Db,Ec.

Here, S is the start symbol. We have E~=c, D~=b, C~=D~E~=bc, B~=a, A~=C~4=bcbcbcbc, and S~=A~B~=bcbcbcbca.

The size of an SLP and an RLSLP is the number of nonterminals in the grammars. Any SLP or RLSLP of size g describing a string of length n can be converted in O(g) time to an equivalent SLP or RLSLP of size O(g) whose derivation tree has height O(logn) [7, 24].

3 Previous Work and Challenges

This section describes an overview of Ganardi and Gawrychowski’s [6] compressed pattern matching algorithm on SLP and the challenges in extending it to RLSLP.

Consider an SLP 𝒢 of size g for a text T of length n and a pattern P of length m2. For each nonterminal A of 𝒢, let 𝗉𝗋𝖾𝖿𝗂𝗑P(A) be the longest element of 𝑃𝑟𝑒𝑓(A~)𝑆𝑢𝑓𝑓(P) and 𝗌𝗎𝖿𝖿𝗂𝗑P(A) be the longest element of 𝑆𝑢𝑓𝑓(A~)𝑃𝑟𝑒𝑓(P).

Example 4.

For a pattern P=ababaa and a nonterminal A with A~=abaaabaabababa, we have 𝗉𝗋𝖾𝖿𝗂𝗑P(A)=abaa and 𝗌𝗎𝖿𝖿𝗂𝗑P(A)=ababa.

We can verify that P occurs in T if and only if there exists a rule ABC such that P occurs in 𝗌𝗎𝖿𝖿𝗂𝗑P(B)𝗉𝗋𝖾𝖿𝗂𝗑P(C). This fact has often been used to solve the compressed pattern matching problems efficiently [4, 14, 9, 8], where those strings are represented and processed as positions on P. However, no linear-time algorithm is known that computes 𝗉𝗋𝖾𝖿𝗂𝗑P(A) and 𝗌𝗎𝖿𝖿𝗂𝗑P(A) for all nonterminals A in an arbitrary SLP. To achieve a linear-time solution for SLP-compressed pattern matching, Ganardi and Gawrychowski [6] invented a brilliant idea to compute approximations of them, which we call PSI-information (Prefix-Suffix-Infix information) in this paper. A PSI-information is a tuple consisting of either four substrings or a single substring of P.

Definition 5.

The set Ψ(A) of PSI-information for a nonterminal A is defined as follows. If A~ is a substring of P, i.e., A~𝑆𝑢𝑏𝑠𝑡𝑟(P),

Ψ(A)={(A~)}.

Otherwise,

Ψ(A)={(x,y,u,v)𝑆𝑢𝑏𝑠𝑡𝑟(P)4|𝗉𝗋𝖾𝖿𝗂𝗑P(A)𝑃𝑟𝑒𝑓(xy),xy𝑃𝑟𝑒𝑓(A~),𝗌𝗎𝖿𝖿𝗂𝗑P(A)𝑆𝑢𝑓𝑓(uv),uv𝑆𝑢𝑓𝑓(A~)}.

They showed the following algorithm that computes some ψAΨ(A) for each rule ABC recursively, assuming that ψBΨ(B) and ψCΨ(C) are already computed. Depending on the types of ψB and ψC, there are four cases.

Case 1

ψB=(B~) and ψC=(C~):
Let ψA:=(B~C~) if B~C~𝑆𝑢𝑏𝑠𝑡𝑟(P). Otherwise, let ψA:=(B~,C~,B~,C~).

Case 2

ψB=(x,y,u,v) and ψC=(C~):
Let ψA:=(x,y,u,vC~) if vC~𝑆𝑢𝑏𝑠𝑡𝑟(P). Otherwise, let ψA:=(x,y,v,C~).

Case 3

ψB=(B~) and ψC=(x,y,u,v):
Let ψA:=(B~x,y,u,v) if B~x𝑆𝑢𝑏𝑠𝑡𝑟(P). Otherwise, let ψA:=(B~,x,u,v).

Case 4

ψB=(xB,yB,uB,vB) and ψC=(xC,yC,uC,vC):
Let ψA:=(xB,yB,uC,vC).

In every case, the computation is reduced to at most one call of the scq, defined below.

Substring concatenation query

(scq): Given two substrings u and v of P represented by their positions, return a position of uv in P if uv𝑆𝑢𝑏𝑠𝑡𝑟(P); otherwise, return “No”.

For scqs, it is not known whether we can answer for any single query in O(1) time with O(m)-time preprocessing. However, they showed the following alternative solution for batched queries, that is a key to the linear-time pattern matching algorithm.

Lemma 6 ([6], Theorem 1.3).

We can preprocess a string P of length m in O(m) time so that we can answer q scqs in O(q+m/𝗐) time.

Lemma 7 ([6]).

For q rules AiBiCi (1iq), if the PSI-information for Bi and Ci has already been computed, we can determine whether P occurs in 𝗌𝗎𝖿𝖿𝗂𝗑P(Bi)𝗉𝗋𝖾𝖿𝗂𝗑P(Ci) for all i in O(q+m) total time.

Let Lc be the set of nonterminals whose derivation tree is of height c. PSI-information for nonterminals is computed in the order of L1,L2,, in a batched style. From Lemma 6, PSI-information for all ALc can be computed in O(|Lc|+m/𝗐) time for each c1.

Lemma 8 ([6]).

We can preprocess P in O(m) time so that given q rules AiBiCi (1iq), where the PSI-information for Bi and Ci has already been computed, we can compute the PSI-information for all Ai’s in O(q+m/𝗐) total time.

Another key element to support the linear-time algorithm is balancing SLPs to ensure that the height of the derivation tree is O(logn), due to Ganardi et al. [7]. With this technique, the total time complexity of the algorithm is bounded by O(g+m), as we will trace it in the proof of Lemma 27.

We now turn our attention to extend the algorithm to RLSLPs. Concerning the height of the derivation trees, Navarro et al. [24] showed that any RLSLP can be balanced in linear time without increasing its asymptotic size. Thus, the crucial issue to realize a linear time RLSLP-compressed pattern matching algorithm is to establish the counterparts of Lemmas 7 and 8 for the run-length rules. Those will appear in Section 5 as Lemmas 28 and 26. Of course, we do not take the naive and computationally expensive solution that breaks down a run-length rule ABk into O(logk) binary rules and applies Ganardi and Gawrychowski’s technique. The following two types of queries will play important roles to achieve our goals just like Lemmas 7 and 8 are based on scqs.

Maximum repetition query

(mrq): Given a nonempty substring v of P represented by a position, answer one of the positions of its maximum repetition vk and the maximum repetition count k.

Primitive root query

(prq): Given a nonempty substring of P represented by a position, answer one of the positions of its primitive root.

Example 9.

Let P=aabcbcbcbabc be a pattern. Given a position (11,12) of the substring P[11..12]=bc, the answer to the mrq is the position (3,8) of the substring bcbcbc and the maximum repetition count 3. Notice that the maximum repetition does not necessarily include the queried position. Also, given the position (4,7) of the substring P[4..7]=cbcb, the answer to the prq is any of the positions (4,5), (6,7), (8,9) of the primitive substring cb.

If we can answer mrqs efficiently, we can efficiently compute PSI-information of A from that of B for run-length rules ABk.

Lemma 10.

Consider ABk and ψBΨ(B). If B~𝑆𝑢𝑏𝑠𝑡𝑟(P), ψBΨ(A). Otherwise, for the maximum repetition count l of B~ in P, we have

{(B~k)Ψ(A) if lk,(B~l,B~,B~l,B~)Ψ(A) if l<k.
Proof.

In the case where B~𝑆𝑢𝑏𝑠𝑡𝑟(P), by repeatedly applying Case 4 of the binary rule case, we can conclude ψBΨ(A). Suppose B~𝑆𝑢𝑏𝑠𝑡𝑟(P) and let l be the maximum repetition count of B~ in P. If kl, clearly A~=B~k occurs in P. Otherwise, 𝗉𝗋𝖾𝖿𝗂𝗑P(A) must be of the form B~lw for some proper prefix w of B~. Thus, 𝗉𝗋𝖾𝖿𝗂𝗑P(A)𝑃𝑟𝑒𝑓(B~l+1) and B~l+1𝑃𝑟𝑒𝑓(A~) by l+1k. Together with the symmetric argument on 𝗌𝗎𝖿𝖿𝗂𝗑P(A), we conclude (B~l,B~,B~l,B~)Ψ(A). Consider deciding whether P occurs in A~. If P𝑆𝑢𝑏𝑠𝑡𝑟(A~), either P𝑆𝑢𝑏𝑠𝑡𝑟(B~B~) or B~𝑆𝑢𝑏𝑠𝑡𝑟(P). The former case can be handled in the same way as the binary rule case. For handling the latter case, we use prqs and mrqs based on Lemma 11. Note that all the periods of P can be computed in O(m) time using the so-called Z-algorithm [11, 22].

Lemma 11.

Consider ABk with B~𝑆𝑢𝑏𝑠𝑡𝑟(P). Let v be the primitive root of B~, l the maximum repetition count of v in P, and P[i..j]=vl. Then, P𝑆𝑢𝑏𝑠𝑡𝑟(A~) if and only if |v| is a period of P and (l+a+b)|v||A~|, where a=0 if i=1 and a=1 otherwise; b=0 if j=m and b=1 otherwise.

Proof.

Let p be the integer such that B~=vp. For P to occur in A~=B~k=vpk, P must have period |v|. In this case, since B~ occurs in P, there are proper suffix v and prefix v′′ of v such that P=vvlv′′. If v=v′′=ε, P occurs in A~ just if lpk. Otherwise, A~ must have one more block of v to cover each of v and/or v′′.

Our proposed algorithm for RLSLP-compressed pattern matching precomputes the answers to mrqs and prqs in an enhanced suffix tree. The next section describes the details of the data structure and its construction. Throughout this paper, we fix a pattern P of length m2 and a balanced RLSLP 𝒢 of size g that represents a text T of length n.

4 Repetition-informed Suffix Tree

The longest common prefix 𝑙𝑐𝑝(x,y) of strings x,y is x[1..k] for k=max{kx[1..k]=y[1..k]}. The suffix tree of P is defined as follows.

Definition 12 ([31]).

Let S=𝑆𝑢𝑏𝑠𝑡𝑟(P$) where $ is a special symbol that does not occur in P. The suffix tree ST(P) of P consists of explicit nodes V and edges E where

  • V={𝑙𝑐𝑝(x,y)x,y𝑆𝑢𝑓𝑓(P$)},

  • E={(x,y,xy)V×S×Vxy[1..k]V for 1k<|y|}.

Note that whereas the mathematical definition above is given with strings, node names and edge labels are represented as occurrence positions of those in P. This paper often uses strings for readability where the actual computation is done on positions, unless confusion arises. We assume one can access in constant time the node P[i..]$ for any im and the parent of an arbitrary node. The suffix tree can be constructed in O(m) time [30]. Every substring of P that is not an explicit node of the suffix tree is called an implicit node. An implicit node uSV is a conceptual node that exists between two explicit nodes and specified as (uv,|v|)V× where uv is the shortest extension of u in V. A node is either an explicit node or an implicit node.

We extend suffix trees by adding more explicit nodes.

Definition 13.

Let S=𝑆𝑢𝑏𝑠𝑡𝑟(P$) and V0 be the set of all explicit nodes in ST(P). The extended suffix tree 𝐸𝑆𝑇(P,U)=(V,E) of P with a substring set US is defined by

  • V=V0U,

  • E={(x,y,xy)V×S×Vxy[1..k]V for 1k<|y|}.

We use an extended suffix tree to store the answers to the mrqs and prqs on respective nodes. We promote implicit nodes v to explicit nodes only if the answers to the queries on v are nontrivial, in the sense that either the primitive root of v is not v, or the maximum repetition of v is not v. In the trivial case, the answers to mrqs and prqs can be the queried positions themselves, with the maximum repetition count 1. Let Γ𝑆𝑢𝑏𝑠𝑡𝑟(P) be the set of nonempty strings for which the mrq or the prq is nontrivial:

Γ={v𝑆𝑢𝑏𝑠𝑡𝑟(P){ε}v2𝑆𝑢𝑏𝑠𝑡𝑟(P), or v=uk for some uΣ and k2}.

The goal of this section is to construct 𝑅𝐸𝑆𝑇(P,Γ), defined as follows.

Definition 14.

An extended suffix-tree 𝐸𝑆𝑇(P,U) is said to be repetition-informed and denoted by 𝑅𝐸𝑆𝑇(P,U), if each node x in U retains MaxRep(x), RepCount(x), and PrimRoot(x) for xU, where

  • MaxRep(x): one of the positions of the maximum repetition of x in P,

  • RepCount(x): the integer k such that xk is the maximum repetition of x in P,

  • PrimRoot(x): one of the positions of the primitive root of x in P.

That is, the answer to the mrq on x is MaxRep(x) and RepCount(x) and the one to the prq on x is PrimRoot(x).

The suffix tree ST(P) and the repetition-informed suffix tree 𝑅𝐸𝑆𝑇(P,Γ) extended with Γ for P=aabcbcbcbc are shown in Figure 1. Although values MaxRep(x) and PrimRoot(x) are positions on P, for intuitive understandability, Figure 1 presents them as links from x to the nodes corresponding to those positions. We remark that RepCount(x) can be computed from MaxRep(x) and |x|. However, under the assumptions of the computer model, integer division cannot be done in constant time. So, we will compute the values in the preprocessing step and let 𝑅𝐸𝑆𝑇(P,Γ) remember them.

(a) The suffix tree.
(b) The repetition-informed suffix tree.
Figure 1: The suffix tree and the repetition-informed suffix tree 𝑅𝐸𝑆𝑇(P,Γ) of P=aabcbcbcbc. The gray-highlighted nodes are in Γ. Each node xΓ has MaxRep(x), PrimRoot(x) (positions of the substring corresponding to the node reached by following the red dashed and blue dotted arrows from x, respectively), and RepCount(x) (shown in respective nodes).

4.1 Node identification queries

Recall that instances of the mrqs and prqs are positions on P, while the answers are stored in the nodes of the repetition-informed suffix tree. This subsection shows that we can efficiently convert positions on P to the corresponding nodes of an extended suffix tree (Lemma 17).

Node identification query

(niq): Given a substring of P represented by a position, return the corresponding node on the extended suffix tree of P.

We answer niqs using waqs below. In the weighted ancestor problem, we are given a node-weighted tree where the weight of any node is a nonnegative integer and greater than that of its parent.

Weighted ancestor query

(waq): Given a node u and a nonnegative integer k, return the furthest ancestor of u with weight at least k.

Kociumaka et al. [15] showed that a batch of q waqs on a tree of size s can be answered in O(s+q) time, provided that the node weights are polynomially bounded in s. Their algorithm first sorts the queries by weight and then processes them in the non-increasing order. This necessity of sorting is the reason why the weights must be polynomially bounded and the queries must be handled in a batch. In other words, as long as the queries are given in non-increasing order of weights, multiple waqs can be answered in an online manner with the same time complexity. It is more convenient for our discussion to present their result in this online computation version.

Lemma 15.

We can answer q waqs on a node-weighted tree of size s in O(q+s) total time in an online manner, provided that the queries are ordered by non-increasing weights.

Ganardi and Gawrychowski [6] improved upon the result by Kociumaka et al.

Lemma 16.

We can preprocess a node-weighted tree of size s in O(s) time so that q waqs can be answered in O(q+s/𝗐) total time.

Lemma 16 encompasses Lemma 15, but since the algorithm in [15] is simpler, Lemma 15 is used when it is sufficient.

A result similar to the following lemma is used in the algorithm presented in [6].

Lemma 17.

We can preprocess the extended suffix tree of size s in O(s) time so that we can answer q niqs in O(q+s/𝗐) time.

Proof.

Define the weight of each explicit node to be the length of its corresponding string. Preprocessing the extended suffix tree for waqs takes O(s) time by Lemma 16. Given substring P[i..j] as input for an niq, we use a waq with the leaf node P[i..]$ and the length |P[i..j]|=ji+1 to identify an explicit node x. If |x|=|P[i..j]| holds, x is the node P[i..j]. Otherwise, P[i..j] is an implicit node between x and its parent y since |y|<|P[i..j]|<|x|. The implicit node P[i..j] is represented by (x,|x||P[i..j]|). It takes O(q+s/𝗐) time to process q waqs.

Lemma 17 is based on Lemma 16 and used in the matching algorithm in Section 5. A simplified version of Lemma 17 is obtained based on Lemma 15, which is sufficient for discussions in the rest of this section.

Lemma 18.

We can answer q niqs on the extended suffix tree of size s in O(q+s) time in an online manner, provided that the queries are ordered by non-increasing length.

4.2 Constructing repetition-informed suffix trees

Li et al. [20] showed that the number of distinct substrings of the form uk for some k2 in any string of length m is less than m. We immediately obtain the following lemma.

Lemma 19.

|Γ|<2m holds.

This ensures that the size of 𝑅𝐸𝑆𝑇(P,Γ) is O(m) and q niqs on 𝑅𝐸𝑆𝑇(P,Γ) can be answered in O(q+m/𝗐) time.

Define

Γ0 ={vΓv is primitive and P has a run vkv for some k2 and v𝑃𝑟𝑒𝑓(v)},
Γ1 ={vΓv is primitive}.

Obviously, Γ0Γ1Γ holds. Our construction consists of the following steps:111We construct the intermediate structure 𝑅𝐸𝑆𝑇(P,Γ1) for clear illustration of the preprocessing algorithm. However, in practice, it suffices to identify the nodes of Γ1 and calculate the answers to repetition queries in Step 2. One may embed those nodes into 𝐸𝑆𝑇(P,Γ0) together with the other nodes of Γ at once in Step 3.

  1. 1.

    constructing 𝐸𝑆𝑇(P,Γ0);

  2. 2.

    constructing 𝑅𝐸𝑆𝑇(P,Γ1);

  3. 3.

    constructing 𝑅𝐸𝑆𝑇(P,Γ).

The following subsections explain the respective steps.

4.2.1 Construction of 𝑬𝑺𝑻(𝑷,𝚪𝟎)

We first enumerate elements of Γ0 as positions by Kolpakov and Kucherov’s algorithm [18] that enumerates all the runs of a string in linear time, as their positions and shortest periods. Then, we identify the corresponding nodes in ST(P) by niqs and make them explicit. Conversion of implicit nodes into explicit nodes may appear very simple. When an implicit node v to convert is represented by (x,l), we should introduce a new explicit node between the explicit node x and the parent y of x. We replace the edge between y and x by two edges between y and v and between v and x. This can be done in constant time. One has to notice that, however, in our approach, the implicit nodes to be converted are all given simultaneously by niqs for efficiency. Converting one implicit node alters the tree structure, so the representations of some of the other implicit nodes may be affected. Suppose two implicit nodes v1 and v2 are represented by (x,l1) and (x,l2), respectively, with |v1|<|v2|, i.e., l1>l2. If v2 is converted first, then the representation (x,l1) of v1 becomes invalid, since the explicit node immediately below v1 is now the new explicit node v2 rather than x. One should not embed a new explicit node for v1 as the parent of x. This potential disturbance can be avoided by converting v1 before v2.

Lemma 20.

Given a set U of q substrings of P represented by their positions, we can construct 𝐸𝑆𝑇(P,U) in O(q+m) time.

Proof.

We first construct ST(P) and sort the input substrings by their length. Since the lengths of the input substrings are at most m, one can sort them in O(q+m) time by bucket sort. We then identify the corresponding nodes using niqs in non-increasing order. The conversion of the identified implicit nodes is performed in the reverse order, which ensures that the representation of those nodes remain valid when they are converted. By Lemma 18, one can identify and convert the nodes in O(q+m) time.

Corollary 21.

One can construct 𝐸𝑆𝑇(P,Γ0) in O(m) time.

Proof.

One can enumerate positions of Γ0 in P in linear time by Kolpakov and Kucherov’s algorithm [18]. Then, applying Lemma 20 to those positions, we obtain 𝐸𝑆𝑇(P,Γ0). We note that the same substring xΓ0 may occur in different runs xix and xjx′′ in P, in which case we identify the same node twice or more. This is not a problem at all when constructing 𝐸𝑆𝑇(P,Γ0). For the succeeding procedure, among those runs, we pick a longest run, denoted as LongestRun(x), and retain an occurrence position of LongestRun(x) in the identified node.

4.2.2 Construction of 𝑹𝑬𝑺𝑻(𝑷,𝚪𝟏)

The goal of this subsection is to identify the nodes for all elements v of Γ1 on 𝐸𝑆𝑇(P,Γ0) and compute MaxRep(v), RepCount(v), and PrimRoot(v). We first focus on the node identification.

Identifying the nodes of 𝚪𝟏

Let LongestRun(x)=xkx, where x is a proper prefix of x. If k3, this run contains the squares of all the rotations of x, i.e., 𝑟𝑜𝑡(x,i)Γ1 for 0i|x|1. If k=2, the run has squares of the form 𝑟𝑜𝑡(x,i)2, i.e., 𝑟𝑜𝑡(x,i)Γ1, for 0i|x|. Therefore, for x=min{|x|1,|LongestRun(x)|2|x|},

Γ1={𝑟𝑜𝑡(x,i)xΓ0 and 0ix}

holds. One can identify those nodes 𝑟𝑜𝑡(x,i) by niqs on positions shifting x in LongestRun(x) by i; that is, if LongestRun(x)=P[sx:tx], we pose niqs on P[sx+i:sx+i+|x|1] for all 0ix. However, a naive implementation of this idea may be redundant and inefficient: it is possible that v=𝑟𝑜𝑡(x,i)=𝑟𝑜𝑡(y,j)Γ1 for different x,yΓ0 and i,j0, so the same node may be identified several times. In order to bound the number of niqs by |Γ1|, we maintain found nodes in rotation lists consisting of rotation links. A rotation link is a link from 𝑟𝑜𝑡(x,i) to 𝑟𝑜𝑡(x,i+1) for xΓ0 and 0i<x. Therefore, identifying nodes of Γ1 and creating rotation lists are equivalent tasks. Examples of rotation lists are shown in Figure 2.

(a) P=abcabccabcabc.
(b) P=abacabacababb.
Figure 2: The rotation lists on 𝐸𝑆𝑇(P,Γ0), represented by red nodes and solid arrows.

Suppose v=𝑟𝑜𝑡(x,i)=𝑟𝑜𝑡(y,j)Γ1 for different x,yΓ0, ix, jy, and i<j. If we process y after x, on the way constructing the rotation list starting at y, we reach x=𝑟𝑜𝑡(y,ji). Then, without tracing rotation links 𝑟𝑜𝑡(y,ji+1),𝑟𝑜𝑡(y,ji+2),, we can jump to the end 𝑟𝑜𝑡(x,x)=𝑟𝑜𝑡(y,ji+x) of the rotation list of x and continue growing the rotation list if y>x+ji. If yx+ji, this finishes the rotation list starting at y. This is the basic idea to use the rotation lists for efficiently finding the nodes of Γ1. See Figure 3.

Figure 3: Suppose P contains runs r0=v02a0a1a2, r2=v22a2a3a4a5a6a7a8a9, and r5=v52a5a6a7, where v0=a0a1a9 and vi=𝑟𝑜𝑡(v0,i). The explicit nodes in 𝐸𝑆𝑇(P,Γ0) are drawn by solid circles, among which v0,v2,v5 are in Γ0, while the implicit nodes are shown by dotted circles. The runs r0, r2, and r5 should induce rotation links shown by red, blue, and green arrows, respectively, but duplicated links are traversed only once, by skipping the computation of the dashed links. These links form a circular rotation list. If r0 is missing in P, the rotation list becomes non-circular. If r2 is missing in P, we have two disconnected rotation lists.
Algorithm 1 Creating rotation lists.

Algorithm 1 (CreateAllRotLists) computes all the rotation lists from the nodes of Γ0 in non-increasing order by the lengths of their corresponding strings. The function CreateRotList(x) creates the list consisting of nodes 𝑟𝑜𝑡(x,0),,𝑟𝑜𝑡(x,x) in this order. Those nodes are created but not embedded into 𝐸𝑆𝑇(P,Γ0) (if they are not explicit) until we find all the nodes of Γ1. The subroutine GetRotLink(v) poses an niq on the position shifted by one from v to obtain the node 𝑟𝑜𝑡(v,1). The rotation link is remembered in v as 𝑅𝑜𝑡𝐿𝑖𝑛𝑘(v), which will be used later. The value 𝑆𝑖𝑧𝑒(x) maintains the number of nodes in the rotation list counting from xΓ0 up to 𝑇𝑎𝑟𝑔𝑒𝑡𝑆𝑖𝑧𝑒(x)=x+1. 𝑆𝑖𝑧𝑒(x) is initialized to be zero to denote that the node x has not been visited. We remember the end node of a rotation list starting at x as 𝑇𝑎𝑖𝑙(x) if it is non-circular.

When extending the list from xΓ0, as long as we do not encounter another node in Γ0, we repeatedly call GetRotLink and get rotation links using niqs to extend the rotation list. At some point, we may reach another node yΓ0. In this case, the list from x shall be extended to the end of the list from y. If y has been processed earlier, i.e., 𝑆𝑖𝑧𝑒(y)0, we jump to 𝑇𝑎𝑖𝑙(y). If not, we recursively call CreateRotList(y) to find the end of the list from y. In either case, we will reach 𝑇𝑎𝑖𝑙(y). Then, 𝑆𝑖𝑧𝑒(x) may exceed 𝑇𝑎𝑟𝑔𝑒𝑡𝑆𝑖𝑧𝑒(x). Otherwise, we continue extending the list.

It is possible that the nested recursive calls of CreateRotList circulate. In this case, we visit a node xΓ0 for the second time. This is depicted if 𝑇𝑎𝑖𝑙(x) has not yet been determined (Line 22). In this case, the list is circular and we discontinue extending the list.

When the algorithm halts, each rotation list has just one explicit node x with 𝐻𝑒𝑎𝑑(x)=𝖳. For that node x, the following holds:

  • if 𝑇𝑎𝑖𝑙(x)𝗇𝗎𝗅𝗅, x is the head of a non-circular rotation list which ends in 𝑇𝑎𝑖𝑙(x),

  • if 𝑇𝑎𝑖𝑙(x)=𝗇𝗎𝗅𝗅, x is in a circular rotation list.

Those values will be useful when computing MaxRep(v) and RepCount(v) for nodes v in each rotation list.

Lemma 22.

Given a position of LongestRun(x) of each xΓ0 and 𝐸𝑆𝑇(P,Γ0), one can compute all rotation lists in O(m) time.

Proof.

One can easily verify the above invariant properties of the variables used in the algorithm. We discuss the time complexity. For each node vΓ1, GetRotLink(v) (Line 17) is called at most once, and it is from CreateRotList(x) where x is the node in Γ0 with v𝑟𝑜𝑡(x,i) for the smallest i. Thus, the number of calls of GetRotLink(v) is bounded by O(m) by Lemma 19. Each call of GetRotLink involves an niq. Since we process elements of Γ0 in a non-increasing order of length, those queries are posed in a non-increasing order, thus Lemma 18 applies. Hence, the total time used by the function GetRotLink is bounded by O(m). Therefore, the algorithm runs in time proportional to |Γ1|O(m).

Computing answers to queries on 𝚪𝟏

We now compute MaxRep(v) and RepCount(v) for all vΓ1. The value of PrimRoot(v) coincides with the position of vΓ1 itself, which has already been computed when creating the rotation lists. Concerning RepCount(v), we remark that

vΓ1RepCount(v)=|Γ|<2m

by Lemma 19.

We compute the values by following each rotation list. During the iteration, we maintain (the length of) the longest pseudo-run with v, denoted by 𝐿𝑃𝑅(v), when visiting vΓ1. A pseudo-run with v is defined to be a substring of P of the form u=vkv with k2 and v𝑃𝑟𝑒𝑓(v). The maximum repetition and the maximum repetition count of v are easily computed from the length of the longest pseudo-run with v. We can compute 𝐿𝑃𝑅(𝑟𝑜𝑡(v,1)) from 𝐿𝑃𝑅(v) based on the fact that

𝐿𝑃𝑅(𝑟𝑜𝑡(v,1)) is the longer of 𝐿𝑃𝑅(v)[2:] and LongestRun(𝑟𝑜𝑡(v,1)),

where 𝐿𝑃𝑅(u) and LongestRun(u) are assumed to be empty if they are not defined: 𝐿𝑃𝑅(u) is undefined when uΓ1, and LongestRun(u) is undefined when uΓ0.

Suppose a rotation list is not circular and its head node is v0Γ0. In this case, there is no vΓ1{v0} such that v0=𝑟𝑜𝑡(v,1). Thus, 𝐿𝑃𝑅(v0)=LongestRun(v0) holds. The maximum repetition of v0 is the prefix v0k of 𝐿𝑃𝑅(v0)=v0kv0. So, RepCount(v0) and MaxRep(v0) can be computed. We proceed to the next node v1=𝑟𝑜𝑡(v0,1)=𝑅𝑜𝑡𝐿𝑖𝑛𝑘(v0) in the rotation list. If v1Γ0, the longest pseudo-run with v1 is 𝐿𝑃𝑅(v1)=𝐿𝑃𝑅(v0)[2:]. If v1Γ0, 𝐿𝑃𝑅(v1) is the longer of 𝐿𝑃𝑅(v0)[2:] and LongestRun(v1). In either case, RepCount(v1) and MaxRep(v1) can be computed from 𝐿𝑃𝑅(v1). We repeat this process.

If a rotation list is circular, we begin the process from a node xΓ0 with the largest LongestRun(x) in the list, which is not necessarily the head. This node is found by traversing the rotation list once. In this case, there exists vΓ1 in the list such that x=𝑟𝑜𝑡(v,1). However, by the choice of x, it is guaranteed that LongestRun(x) is longer than 𝐿𝑃𝑅(v)[2:]. Thus, 𝐿𝑃𝑅(x)=LongestRun(x) holds. We proceed to the next node 𝑟𝑜𝑡(x,1)=𝑅𝑜𝑡𝐿𝑖𝑛𝑘(x) in the same manner as the non-circular case.

Note that, the maximum repeat count k is obtained from 𝐿𝑃𝑅(v)=vkv and v by dividing |𝐿𝑃𝑅(v)| by |v|. We perform division by subtraction; that is, to compute a/b, we repeatedly subtract b from a until the value becomes negative. This division costs O(a/b) time. Thus, total division cost is222During the traversal of rotation lists, by maintaining the remainder of the division of |LongestRun(x)| by |x| when computing those values for xΓ0, one does not have to perform this “division” for vΓ1Γ0.

vΓ1O(|𝐿𝑃𝑅(v)|/|v|+1)=vΓ1O(RepCount(v)+1)O(|Γ|)O(m).

Therefore, we can compute MaxRep(v), RepCount(v), and PrimRoot(v) for all vΓ1 in O(m) time.

The nodes in the rotation lists equipped with repetition information are embedded into 𝐸𝑆𝑇(P,Γ0) in O(m) time by Lemma 20. Lemma 23 summarizes Section 4.2.2.

Lemma 23.

Given Γ0 and 𝐸𝑆𝑇(P,Γ0), one can compute 𝑅𝐸𝑆𝑇(P,Γ1) in O(m) time.

4.2.3 Construction of 𝑹𝑬𝑺𝑻(𝑷,𝚪)

Every element of Γ is of the form vi for some vΓ1 and iRepCount(v). Using the values of PrimRoot(v), MaxRep(v) and RepCount(v) for vΓ1, we compute a position of vi, PrimRoot(vi), RepCount(vi) and MaxRep(vi) for each i{2,,RepCount(v)} in the ascending order. Let kv=RepCount(v) and s be the start position of MaxRep(v). A position of vi is given by (s,s+|v|i1). The value PrimRoot(vi) is set to be PrimRoot(v). To compute ci=RepCount(vi)=kv/i, we use the fact that cici1=RepCount(vi1) and ci is the largest satisfying icikv. The value ci is obtained by initializing c to be ci1 and repeatedly decrementing the value one by one and checking the above inequality. Then, MaxRep(vi) is computed as (s,s+|v|ici1). Since c is monotonically decremented from kv, the process to compute the concerned values for all repetitions of v in Γ can be performed in O(kv) time. Summing the cost for all vΓ1, we have vΓ1O(kv)O(m).

Lemma 24.

We can construct 𝑅𝐸𝑆𝑇(P,Γ) in O(m) time.

Proof.

After computing positions of v, PrimRoot(v), RepCount(v) and MaxRep(v) for all vΓΓ0, we modify 𝐸𝑆𝑇(P,Γ0) into 𝐸𝑆𝑇(P,Γ) and let the explicit nodes vΓ retain the values MaxRep(v), RepCount(v) and PrimRoot(v). Note that Lemma 20 is valid for updating extended suffix trees.

Lemma 25.

We can preprocess P in O(m) time so that q mrqs and prqs can be answered in O(q+m/𝗐) time.

Proof.

We can construct 𝑅𝐸𝑆𝑇(P,Γ) in O(m) time (Lemma 24) and preprocess it for niqs in O(m) time (Lemmas 19 and 17).

Given q (positions of) substrings of P as input, we identify the corresponding nodes v1,,vq in 𝑅𝐸𝑆𝑇(P,Γ) using niq. For each vi, if vi turns out to be in Γ, return MaxRep(vi) and RepCount(vi) or PrimRoot(vi). Otherwise, return vi itself because the maximum repetition and the primitive root of vi are vi itself. From Lemma 17, q mrqs or prqs can be answered in O(q+m/𝗐) time.

5 Pattern Matching on RLSLP

Let Mc (resp. Lc) be the set of nonterminals A in 𝒢 such that its derivation tree has height c and it has a rule of the form ABk (resp. ABC or Aa). PSI-information for nonterminals is computed in the order of L1,M2,L2,M3,. PSI-information for all ALc can be computed in O(|Lc|+m/𝗐) time by Lemma 8. We can also compute PSI-information for all AMc in O(|Mc|+m/𝗐) time.

Lemma 26.

We can preprocess P in O(m) time so that given q rules AiBiki (1iq), where the PSI-information for Bi has already been computed, we can compute the PSI-information for Ai in O(q+m/𝗐) total time.

Proof.

By Lemmas 10 and 25.

Lemma 27.

We can compute PSI-information for all nonterminals in 𝒢 in O(g+m) time.

Proof.

Recall that 𝒢 is balanced: the derivation height is bounded by O(logn). By Lemmas 8 and 26, the total time complexity is c=1O(logn)((|Lc|+m/𝗐)+(|Mc|+m/𝗐))=O(g+m). For each run-length rule ABk, we can determine whether P occurs in B~k in linear time.

Lemma 28.

For q rules AiBiki (1iq), if the PSI-information for Bi has already been computed, we can determine whether P occurs in Bi~ki in O(q+m) total time.

Proof.

We precompute all of the periods of P. By Lemmas 11 and 25, the claim holds.

Theorem 1.

Given a pattern P of length m and an RLSLP 𝒢 of size g, we can decide whether P occurs in the text described by 𝒢 in time O(g+m).

Proof.

From Lemma 27, we can compute the PSI-information for all nonterminals in 𝒢 in O(g+m) time. From Lemmas 7 and 28, using the computed PSI-information, we can decide whether P occurs in 𝗌𝗎𝖿𝖿𝗂𝗑P(B)𝗉𝗋𝖾𝖿𝗂𝗑P(C) for all binary rules of the form ABC in O(g+m) time, and whether P occurs in B~k for all run-length rules of the form ABk in O(g+m) time. Therefore, the total time complexity is O(g+m).

6 Conclusion

In this paper, we presented a linear-time algorithm for pattern matching on run-length grammar-compressed strings by generalizing Ganardi and Gawrychowski’s algorithm for straight-line programs [6]. We showed that the algorithm can be applied to any RLSLP and the algorithm runs in O(g+m) time for a pattern of length m and an RLSLP of size g.

It remains an open problem if there exists a linear-time algorithm for pattern matching on iterative straight-line programs [25], which are a further extension of RLSLPs.

References

  • [1] Amihood Amir and Gary Benson. Efficient two-dimensional compressed matching. In Data Compression Conference, 1992., pages 279–288, 1992. doi:10.1109/DCC.1992.227453.
  • [2] Amihood Amir, Gary Benson, and Martin Farach. Let sleeping files lie: Pattern matching in Z-compressed files. Journal of Computer and System Sciences, 52(2):299–307, 1996. doi:10.1006/jcss.1996.0023.
  • [3] Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. Random access to grammar-compressed strings and trees. SIAM Journal on Computing, 44(3):513–539, 2015. doi:10.1137/130936889.
  • [4] Martin Farach and Mikkel Thorup. String matching in Lempel-Ziv compressed strings. Algorithmica, 20(4):388–404, 1998. doi:10.1007/PL00009202.
  • [5] Simone Faro and Thierry Lecroq. The exact online string matching problem: A review of the most recent results. ACM Comput. Surv., 45(2), March 2013. doi:10.1145/2431211.2431212.
  • [6] Moses Ganardi and Paweł Gawrychowski. Pattern matching on grammar-compressed strings in linear time. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2833–2846, 2022. doi:10.1137/1.9781611977073.110.
  • [7] Moses Ganardi, Artur Jeż, and Markus Lohrey. Balancing straight-line programs. J. ACM, 68(4), June 2021. doi:10.1145/3457389.
  • [8] Paweł Gawrychowski. Pattern matching in Lempel-Ziv compressed strings: Fast, simple, and deterministic. In Algorithms – ESA 2011, pages 421–432, 2011. doi:10.1007/978-3-642-23719-5_36.
  • [9] Paweł Gawrychowski. Optimal pattern matching in LZW compressed strings. ACM Trans. Algorithms, 9(3), June 2013. doi:10.1145/2483699.2483705.
  • [10] Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Fast q-gram mining on SLP compressed strings. Journal of Discrete Algorithms, 18:89–99, 2013. doi:10.1016/j.jda.2012.07.006.
  • [11] Dan Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, 1997. doi:10.1017/CBO9780511574931.
  • [12] Marek Karpinski, Wojciech Rytter, and Ayumi Shinohara. Pattern-matching for strings with short descriptions. In Proc. The 6th Annual Symposium on Combinatorial Pattern Matching (CPM95), volume 937 of Lecture Notes in Computer Science, pages 205–214. Springer, 1995. doi:10.1007/3-540-60044-2_44.
  • [13] Dominik Kempa and Tomasz Kociumaka. Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space . In 2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS), pages 1877–1886, November 2023. doi:10.1109/FOCS57990.2023.00114.
  • [14] Takuya Kida, Tetsuya Matsumoto, Yusuke Shibata, Masayuki Takeda, Ayumi Shinohara, and Setsuo Arikawa. Collage system: a unifying framework for compressed pattern matching. Theoretical Computer Science, 298(1):253–272, 2003. doi:10.1016/S0304-3975(02)00426-7.
  • [15] Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. A linear-time algorithm for seeds computation. ACM Trans. Algorithms, 16(2), April 2020. doi:10.1145/3386369.
  • [16] Tomasz Kociumaka, Gonzalo Navarro, and Francisco Olivares. Near-optimal search time in δ-optimal space. In LATIN 2022: Theoretical Informatics, pages 88–103, 2022. doi:10.1007/978-3-031-20624-5_6.
  • [17] Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repetitive sequences. IEEE Transactions on Information Theory, 69(4):2074–2092, 2023. doi:10.1109/TIT.2022.3224382.
  • [18] Roman Kolpakov and Gregory Kucherov. Finding maximal repetitions in a word in linear time. In 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039), pages 596–604, 1999. doi:10.1109/SFFCS.1999.814634.
  • [19] S. Rao Kosaraju. Pattern matching in compressed texts. In Foundations of Software Technology and Theoretical Computer Science, pages 349–362, 1995. doi:10.1007/3-540-60692-0_60.
  • [20] Shuo Li, Jakub Pachocki, and Jakub Radoszewski. A note on the maximum number of k-powers in a finite word. The Electronic Journal of Combinatorics, 31(3), 2024. doi:10.37236/11270.
  • [21] Markus Lohrey. Algorithmics on SLP-compressed strings: A survey. Groups - Complexity - Cryptology, 4(2):241–299, 2012. doi:doi:10.1515/gcc-2012-0016.
  • [22] Michael G. Main and Richard J. Lorentz. An O(nlogn) algorithm for finding all repetitions in a string. J. Algorithms, 5(3):422–432, 1984. doi:10.1016/0196-6774(84)90021-X.
  • [23] Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001. doi:10.1145/375360.375365.
  • [24] Gonzalo Navarro, Francisco Olivares, and Cristian Urbina. Balancing run-length straight-line programs. In String Processing and Information Retrieval, pages 117–131, 2022. doi:10.1007/978-3-031-20643-6_9.
  • [25] Gonzalo Navarro and Cristian Urbina. Iterated straight-line programs. In LATIN 2024: Theoretical Informatics, pages 66–80, 2024. doi:10.1007/978-3-031-55598-5_5.
  • [26] Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Fully Dynamic Data Structure for LCE Queries in Compressed Space. In 41st International Symposium on Mathematical Foundations of Computer Science (MFCS 2016), volume 58 of Leibniz International Proceedings in Informatics (LIPIcs), pages 72:1–72:14, 2016. doi:10.4230/LIPIcs.MFCS.2016.72.
  • [27] Jakub Radoszewski. Linear Time Construction of Cover Suffix Tree and Applications. In 31st Annual European Symposium on Algorithms (ESA 2023), volume 274 of Leibniz International Proceedings in Informatics (LIPIcs), pages 89:1–89:17, 2023. doi:10.4230/LIPIcs.ESA.2023.89.
  • [28] Wojciech Rytter. Grammar compression, LZ-encodings, and string algorithms with implicit input. In Automata, Languages and Programming (ICALP 2004), pages 15–27, 2004. doi:10.1007/978-3-540-27836-8_5.
  • [29] Masayuki Takeda and Ayumi Shinohara. Pattern Matching on Compressed Text, in Encyclopedia of Algorithms, pages 1538–1542. Springer New York, 2016. doi:10.1007/978-1-4939-2864-4_81.
  • [30] Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249–260, 1995. doi:10.1007/BF01206331.
  • [31] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (SWAT 1973), pages 1–11, 1973. doi:10.1109/SWAT.1973.13.
  • [32] Takanori Yamamoto, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Faster subsequence and don’t-care pattern matching on compressed texts. In Combinatorial Pattern Matching, pages 309–322, 2011. doi:10.1007/978-3-642-21458-5_27.