Pattern Matching on Run-Length Grammar-Compressed Strings in Linear Time

Iguchi, Yuto; Yoshinaka, Ryo; Shinohara, Ayumi

doi:10.4230/LIPIcs.CPM.2025.9

Pattern Matching on Run-Length Grammar-Compressed Strings in Linear Time

Yuto Iguchi Graduate School of Information Sciences, Tohoku University, Sendai, Japan Ryo Yoshinaka

Graduate School of Information Sciences, Tohoku University, Sendai, Japan Ayumi Shinohara

Graduate School of Information Sciences, Tohoku University, Sendai, Japan

Abstract

Run-length straight-line programs (RLSLPs) are a technique for grammar-based compression, allowing any string to be represented with optimal space for $\delta$ , the substring complexity of the string. We address the compressed pattern matching problem for RLSLPs: Given a compressed text in RLSLP format and an uncompressed pattern, determine if the pattern appears in the text. This paper proposes an algorithm that solves this problem in linear time with respect to the size of the grammar and the length of the pattern.

Keywords and phrases:

pattern matching, run-length straight-line programs, compression, suffix tree

Funding:

Ryo Yoshinaka: JSPS KAKENHI 18K11150, 20H05703, 23K11325, 24H00697, 24K14827.

Ayumi Shinohara: JSPS KAKENHI 21K11745.

Copyright and License:

2012 ACM Subject Classification:

Theory of computation

\rightarrow

Design and analysis of algorithms

Acknowledgements:

We are grateful to the anonymous reviewers for their thorough review and helpful recommendations.

DOI:

10.4230/LIPIcs.CPM.2025.9

Event:

36th Annual Symposium on Combinatorial Pattern Matching (CPM 2025)

Editors:

Paola Bonizzoni and Veli Mäkinen

Series and Publisher:

Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik

1 Introduction

Pattern matching is the problem determining whether a given pattern $P$ occurs in a given text $T$ . It is one of the most fundamental problems in computer science, with applications in various fields such as text processing, signal processing, database searching, and bioinformatics. There are many algorithms for solving the pattern matching problem [5, 23].

Nowadays, datasets are often stored in a compressed form, and it is impractical to decompress the data each time to find a pattern in it. Thus, pattern matching on compressed data without decompression has attracted much attention. The problem of compressed pattern matching [29] is, given a pattern $P$ and a compressed representation $c(T)$ of a text $T$ , to determine whether $P$ occurs in $T$ without decompressing it. The complexity of matching algorithms is mainly measured with respect to $m=|P|$ and $g=|c(T)|$ , rather than $n=|T|$ .

Since the pioneering work of Amir and Benson [1], many compressed pattern matching algorithms have been proposed for various compression methods, depending on the specific properties of the methods [2, 19, 4]. Among them, Gawrychowski [8] showed an algorithm for Lempel-Ziv compression (commonly known as LZ77) that runs in $O(g\log(n/g)+m)$ . Gawrychowski [9] also showed $O(g+m)$ time algorithm for Lempel-Ziv-Welch (LZW) compression. Recently, as a landmark, Ganardi and Gawrychowski [6] successfully developed an algorithm for straight-line programs (SLPs) that runs in $O(g+m)$ time. SLP is a context-free grammar that generates exactly one string, which is extensively employed to generalize various grammar-based compression techniques [12, 28, 32, 21, 10, 3].

The complexity of pattern matching on compressed data is inherently linked to the compressibility of the data itself, as the size $g$ is influenced by the specific compression algorithm employed. Substring complexity $\delta$ due to Kociumaka et al. [17] serves as a measure of the compressibility of highly repetitive strings, enabling the evaluation of the compression performance of various methods. They showed that LZ77 can represent any text using $O(\delta\log(n/\delta))$ space, which is asymptotically optimal with respect to $\delta$ . In contrast, an SLP requires $O(\delta\log^{2}(n/\delta))$ space. LZ77 achieves high compression ratios compared to other methods and is widely used in practical applications such as zip. However, processing data directly in its LZ77-compressed form is challenging. Even accessing the $i$ -th character of the text is difficult, and solving compressed pattern matching in linear time is not feasible [8]. The difficulty of handling LZ77-compressed data is evident from the fact that the matching algorithm shown in [8] begins by converting LZ77-compressed data into an SLP.

Nishimoto et al. [26] enhanced SLPs with “run-length rules” of the form $A\to B^{k}$ to handle repetitions more efficiently. Run-Length Straight-Line Programs (RLSLPs), like LZ77, can represent any text using $O(\delta\log(n/\delta))$ space. Furthermore, RLSLPs preserve the simplicity and usability characteristic of grammar-based compression algorithms. For example, in RLSLPs, a substring of length $l$ at any position in the text can be accessed in $O(l+\log n)$ time [17], and longest common extension (LCE) queries can be answered in $O(\log n)$ time [13]. Additionally, RLSLPs are used in applications such as constructing suffix arrays [13] and indexes [16] within $O(\delta\log(n/\delta))$ space. Despite their various applications, no linear-time matching algorithm for RLSLPs has been proposed.

In this paper, we propose a linear-time compressed pattern matching algorithm for strings compressed by RLSLPs, as stated in the following theorem.

Theorem 1.

Given a pattern $P$ of length $m$ , and an RLSLP $\mathcal{G}$ of size $g$ , we can decide whether $P$ occurs in the text described by $\mathcal{G}$ in time $O(g+m)$ .

In the linear-time algorithm for compressed pattern matching on SLPs [6], it is crucial to represent substrings of the text where the pattern may occur as a concatenation of a constant number of substrings of the pattern. We will adopt this idea to our algorithm for RLSLPs. The challenge lies in representing substrings of the text as concatenations of substrings of the pattern, even when a rule of the form $A\to B^{k}$ $(k\geq 3)$ is present, which does not exist in SLPs. To address this, handling repetitions of substrings within the pattern is essential. A cover suffix tree [27] is a data structure that extends a suffix tree by adding nodes corresponding to substrings in a string whose squares are substrings of the string. However, for our problem, handling only squares, i.e., substrings that repeat twice, is insufficient. Instead, we need to determine the exact number of times a given substring repeats in a string. We address this issue by introducing a new data structure, an extension of the cover suffix tree, that can handle all repetitions within a string.

In the whole paper we assume the standard word RAM model, which operates on $\mathsf{w}$ -bit words, where $\mathsf{w}\geq\log n$ and $\mathsf{w}\geq\log m$ , with the standard arithmetic (excluding integer division) and bitwise operations.

2 Preliminaries

Let $\Sigma$ be a finite alphabet. A string of length $n$ is a finite sequence of $n$ symbols in $\Sigma$ . For a string $s=a_{1}\dots a_{n}$ , we write $s[i]=a_{i}$ for the $i$ -th character of $s$ . The length of $s$ is denoted by $|s|$ . A string of length $0$ is called the empty string and is denoted by $\varepsilon$ . The concatenation of two strings $s$ and $t$ is denoted by $s\cdot t$ or $s t$ . For a string $s$ , we define $s^{1}=s$ and $s^{k}=s\cdot s^{k-1}$ for any integer $k>1$ . A string of the form $s^{k}$ for any positive integer $k\geq 1$ is called a repetition of $s$ and particularly $s^{2}$ is referred to as the square of $s$ . A string $s$ is primitive if $s$ cannot be represented as $u^{k}$ for any string $u$ and integer $k>1$ . A string $u$ is a root of $u^{k}\ (k\geq 1)$ , and if $u$ is primitive, $u$ is the primitive root of $u^{k}$ . A substring of a string $s$ starting at position $i$ and ending at position $j$ is denoted by $s[i..j]$ . Especially, a substring $s[1..j]$ is called a prefix, and a substring $s[i..|s|]$ is called a suffix of $s$ . Proper prefixes and suffixes of $s$ are those different from $s$ . The suffix $s[i..|s|]$ is also denoted by $s[i..]$ . If $i>j$ then $s[i..j]$ is the empty string. Let $\mathit{Substr}(s)$ , $\mathit{Pref}(s)$ and $\mathit{Suff}(s)$ be the sets of all substrings, prefixes and suffixes of $s$ , respectively. We say that a string $u$ occurs in a string $s$ at position $(i,j)$ if $u=s[i..j]$ . For a string $s$ and an integer $k$ ( $0\leq k<|s|$ ), the $k$ th-rotation of $s$ is $\mathit{rot}(s,k)=s[k+1..|s|]s[1..k]$ . A period of a string $u$ is a positive integer $p$ such that $u[i]=u[i+p]$ for all $i$ with $1\leq i\leq|u|-p$ . A substring $u[i..j]$ is a run in $u$ if for its smallest period $p$ , (1) $2p\leq j-i+1$ , (2) $i=1$ or $u[i-1]\neq u[i-1+p]$ , and (3) $j=|u|$ or $u[j+1]\neq u[j+1-p]$ . A run is specified by the pair of its position $(i,j)$ and smallest period $p$ . If $u^{k}\in\mathit{Substr}(s)$ and $u^{k+1}\notin\mathit{Substr}(s)$ , $u^{k}$ and $k$ are called the maximum repetition and the maximum repetition count of $u$ in $s$ , respectively.

Example 2.

For $P=aabcbcbcbabcbc$ , all runs in $P$ are $P[1..2]=aa$ , $P[3..9]=bcbcbcb$ and $P[11..14]=bcbc$ represented by $((1,2),1)$ , $((3,9),2)$ and $((11,14),2)$ , respectively. On the other hand, all maximum repetitions $v^{k}$ with maximum repetition count $k\geq 2$ of primitive substrings of $P$ are $P[1..2]=aa$ , $P[3..8]=bcbcbc$ and $P[4..9]=cbcbcb$ .

A straight-line program (SLP) is a context-free grammar describing exactly one string. Without loss of generality, we assume the grammar is in Chomsky normal form, i.e., each rule is of the form $A\to BC$ or $A\to a$ where $A, B, C$ are nonterminals and $a$ is a terminal. For any rule $A\to BC$ , $A$ does not appear in the derivation of $B$ nor $C$ . A run-length straight-line program (RLSLP) [26] is an extended SLP which can also have rules of the form $A\to B^{k}$ where $k\geq 3$ . We call rules of the forms $A\to BC$ and $A\to B^{k}$ binary rules and run-length rules, respectively. The terminal string derived from a nonterminal $A$ is denoted by $\widetilde{A}$ .

Example 3.

Consider the following run-length straight-line program (RLSLP):

\displaystyle S\to AB,\ A\to C^{4},\ B\to a,\ C\to DE,\ D\to b,\ E\to c.

Here, $S$ is the start symbol. We have $\widetilde{E}=c$ , $\widetilde{D}=b$ , $\widetilde{C}=\widetilde{D}\widetilde{E}=bc$ , $\widetilde{B}=a$ , $\widetilde{A}=\widetilde{C}^{4}=bcbcbcbc$ , and $\widetilde{S}=\widetilde{A}\widetilde{B}=bcbcbcbca$ .

The size of an SLP and an RLSLP is the number of nonterminals in the grammars. Any SLP or RLSLP of size $g$ describing a string of length $n$ can be converted in $O(g)$ time to an equivalent SLP or RLSLP of size $O(g)$ whose derivation tree has height $O(\log n)$ [7, 24].

3 Previous Work and Challenges

This section describes an overview of Ganardi and Gawrychowski’s [6] compressed pattern matching algorithm on SLP and the challenges in extending it to RLSLP.

Consider an SLP $\mathcal{G}$ of size $g$ for a text $T$ of length $n$ and a pattern $P$ of length $m\geq 2$ . For each nonterminal $A$ of $\mathcal{G}$ , let $\mathsf{prefix}_{P}(A)$ be the longest element of $\mathit{Pref}(\widetilde{A})\cap\mathit{Suff}(P)$ and $\mathsf{suffix}_{P}(A)$ be the longest element of $\mathit{Suff}(\widetilde{A})\cap\mathit{Pref}(P)$ .

Example 4.

For a pattern $P=ababaa$ and a nonterminal $A$ with $\widetilde{A}=abaaabaabababa$ , we have $\mathsf{prefix}_{P}(A)=abaa$ and $\mathsf{suffix}_{P}(A)=ababa$ .

We can verify that $P$ occurs in $T$ if and only if there exists a rule $A\to BC$ such that $P$ occurs in $\mathsf{suffix}_{P}(B)\cdot\mathsf{prefix}_{P}(C)$ . This fact has often been used to solve the compressed pattern matching problems efficiently [4, 14, 9, 8], where those strings are represented and processed as positions on $P$ . However, no linear-time algorithm is known that computes $\mathsf{prefix}_{P}(A)$ and $\mathsf{suffix}_{P}(A)$ for all nonterminals $A$ in an arbitrary SLP. To achieve a linear-time solution for SLP-compressed pattern matching, Ganardi and Gawrychowski [6] invented a brilliant idea to compute approximations of them, which we call PSI-information (Prefix-Suffix-Infix information) in this paper. A PSI-information is a tuple consisting of either four substrings or a single substring of $P$ .

Definition 5.

The set $\Psi(A)$ of PSI-information for a nonterminal $A$ is defined as follows. If $\widetilde{A}$ is a substring of $P$ , i.e., $\widetilde{A}\in\mathit{Substr}(P)$ ,

\Psi(A)=\{(\widetilde{A})\}.

Otherwise,

\Psi(A)=\left\{(x,y,u,v)\in\mathit{Substr}(P)^{4}\ \middle|\begin{array}[]{l}% \mathsf{prefix}_{P}(A)\in\mathit{Pref}(xy),\ xy\in\mathit{Pref}(\widetilde{A})% ,\\ \mathsf{suffix}_{P}(A)\in\mathit{Suff}(uv),\ uv\in\mathit{Suff}(\widetilde{A})% \end{array}\right\}.

They showed the following algorithm that computes some $\psi_{A}\in\Psi(A)$ for each rule $A\to BC$ recursively, assuming that $\psi_{B}\in\Psi(B)$ and $\psi_{C}\in\Psi(C)$ are already computed. Depending on the types of $\psi_{B}$ and $\psi_{C}$ , there are four cases.

Case 1: $\psi_{B}=(\widetilde{B})$ and $\psi_{C}=(\widetilde{C})$ :
Let $\psi_{A}:=(\widetilde{B}\cdot\widetilde{C})$ if $\widetilde{B}\cdot\widetilde{C}\in\mathit{Substr}(P)$ . Otherwise, let $\psi_{A}:=(\widetilde{B},\widetilde{C},\widetilde{B},\widetilde{C})$ .
Case 2: $\psi_{B}=(x,y,u,v)$ and $\psi_{C}=(\widetilde{C})$ :
Let $\psi_{A}:=(x,y,u,v\!\cdot\!\widetilde{C})$ if $v\!\cdot\!\widetilde{C}\in\mathit{Substr}(P)$ . Otherwise, let $\psi_{A}:=(x,y,v,\widetilde{C})$ .
Case 3: $\psi_{B}=(\widetilde{B})$ and $\psi_{C}=(x,y,u,v)$ :
Let $\psi_{A}:=(\widetilde{B}\!\cdot\!x,y,u,v)$ if $\widetilde{B}\!\cdot\!x\in\mathit{Substr}(P)$ . Otherwise, let $\psi_{A}:=(\widetilde{B},x,u,v)$ .
Case 4: $\psi_{B}=(x_{B},y_{B},u_{B},v_{B})$ and $\psi_{C}=(x_{C},y_{C},u_{C},v_{C})$ :
Let $\psi_{A}:=(x_{B},y_{B},u_{C},v_{C})$ .

In every case, the computation is reduced to at most one call of the scq, defined below.

Substring concatenation query: (scq): Given two substrings $u$ and $v$ of $P$ represented by their positions, return a position of $u v$ in $P$ if $uv\in\mathit{Substr}(P)$ ; otherwise, return “No”.

For scqs, it is not known whether we can answer for any single query in $O(1)$ time with $O(m)$ -time preprocessing. However, they showed the following alternative solution for batched queries, that is a key to the linear-time pattern matching algorithm.

Lemma 6 ([6], Theorem 1.3).

We can preprocess a string $P$ of length $m$ in $O(m)$ time so that we can answer $q$ scqs in $O(q+m/\mathsf{w})$ time.

Lemma 7 ([6]).

For $q$ rules $A_{i}\to B_{i}C_{i}$ $(1\leq i\leq q)$ , if the PSI-information for $B_{i}$ and $C_{i}$ has already been computed, we can determine whether $P$ occurs in $\mathsf{suffix}_{P}(B_{i})\cdot\mathsf{prefix}_{P}(C_{i})$ for all $i$ in $O(q+m)$ total time.

Let $L_{c}$ be the set of nonterminals whose derivation tree is of height $c$ . PSI-information for nonterminals is computed in the order of $L_{1},L_{2},\cdots$ , in a batched style. From Lemma 6, PSI-information for all $A\in L_{c}$ can be computed in $O(|L_{c}|+m/\mathsf{w})$ time for each $c\geq 1$ .

Lemma 8 ([6]).

We can preprocess $P$ in $O(m)$ time so that given $q$ rules $A_{i}\to B_{i}C_{i}$ $(1\leq i\leq q)$ , where the PSI-information for $B_{i}$ and $C_{i}$ has already been computed, we can compute the PSI-information for all $A_{i}$ ’s in $O(q+m/\mathsf{w})$ total time.

Another key element to support the linear-time algorithm is balancing SLPs to ensure that the height of the derivation tree is $O(\log n)$ , due to Ganardi et al. [7]. With this technique, the total time complexity of the algorithm is bounded by $O(g+m)$ , as we will trace it in the proof of Lemma 27.

We now turn our attention to extend the algorithm to RLSLPs. Concerning the height of the derivation trees, Navarro et al. [24] showed that any RLSLP can be balanced in linear time without increasing its asymptotic size. Thus, the crucial issue to realize a linear time RLSLP-compressed pattern matching algorithm is to establish the counterparts of Lemmas 7 and 8 for the run-length rules. Those will appear in Section 5 as Lemmas 28 and 26. Of course, we do not take the naive and computationally expensive solution that breaks down a run-length rule $A\to B^{k}$ into $O(\log k)$ binary rules and applies Ganardi and Gawrychowski’s technique. The following two types of queries will play important roles to achieve our goals just like Lemmas 7 and 8 are based on scqs.

Maximum repetition query: (mrq): Given a nonempty substring $v$ of $P$ represented by a position, answer one of the positions of its maximum repetition $v^{k}$ and the maximum repetition count $k$ .
Primitive root query: (prq): Given a nonempty substring of $P$ represented by a position, answer one of the positions of its primitive root.

Example 9.

Let $P=aabcbcbcbabc$ be a pattern. Given a position $(11,12)$ of the substring $P[11..12]=bc$ , the answer to the mrq is the position $(3,8)$ of the substring $b c b c b c$ and the maximum repetition count $3$ . Notice that the maximum repetition does not necessarily include the queried position. Also, given the position $(4,7)$ of the substring $P[4..7]=cbcb$ , the answer to the prq is any of the positions $(4,5)$ , $(6,7)$ , $(8,9)$ of the primitive substring $c b$ .

If we can answer mrqs efficiently, we can efficiently compute PSI-information of $A$ from that of $B$ for run-length rules $A\to B^{k}$ .

Lemma 10.

Consider $A\to B^{k}$ and $\psi_{B}\in\Psi(B)$ . If $\widetilde{B}\notin\mathit{Substr}(P)$ , $\psi_{B}\in\Psi(A)$ . Otherwise, for the maximum repetition count $l$ of $\widetilde{B}$ in $P$ , we have

\begin{cases}(\widetilde{B}^{k})\in\Psi(A)&\text{ if $l\geq k$,}\\ (\widetilde{B}^{l},\widetilde{B},\widetilde{B}^{l},\widetilde{B})\in\Psi(A)&% \text{ if $l<k$.}\end{cases}

Proof.

In the case where $\widetilde{B}\notin\mathit{Substr}(P)$ , by repeatedly applying Case 4 of the binary rule case, we can conclude $\psi_{B}\in\Psi(A)$ . Suppose $\widetilde{B}\in\mathit{Substr}(P)$ and let $l$ be the maximum repetition count of $\widetilde{B}$ in $P$ . If $k\leq l$ , clearly $\widetilde{A}=\widetilde{B}^{k}$ occurs in $P$ . Otherwise, $\mathsf{prefix}_{P}(A)$ must be of the form $\widetilde{B}^{l}w$ for some proper prefix $w$ of $\widetilde{B}$ . Thus, $\mathsf{prefix}_{P}(A)\in\mathit{Pref}(\widetilde{B}^{l+1})$ and $\widetilde{B}^{l+1}\in\mathit{Pref}(\widetilde{A})$ by $l+1\leq k$ . Together with the symmetric argument on $\mathsf{suffix}_{P}(A)$ , we conclude $(\widetilde{B}^{l},\widetilde{B},\widetilde{B}^{l},\widetilde{B})\in\Psi(A)$ . $\hfill\blacktriangleleft$ Consider deciding whether $P$ occurs in $\widetilde{A}$ . If $P\in\mathit{Substr}(\widetilde{A})$ , either $P\in\mathit{Substr}(\widetilde{B}\widetilde{B})$ or $\widetilde{B}\in\mathit{Substr}(P)$ . The former case can be handled in the same way as the binary rule case. For handling the latter case, we use prqs and mrqs based on Lemma 11. Note that all the periods of $P$ can be computed in $O(m)$ time using the so-called Z-algorithm [11, 22].

Lemma 11.

Consider $A\to B^{k}$ with $\widetilde{B}\in\mathit{Substr}(P)$ . Let $v$ be the primitive root of $\widetilde{B}$ , $l$ the maximum repetition count of $v$ in $P$ , and $P[i..j]=v^{l}$ . Then, $P\in\mathit{Substr}(\widetilde{A})$ if and only if $|v|$ is a period of $P$ and $(l+a+b)|v|\leq|\widetilde{A}|$ , where $a=0$ if $i=1$ and $a=1$ otherwise; $b=0$ if $j=m$ and $b=1$ otherwise.

Proof.

Let $p$ be the integer such that $\widetilde{B}=v^{p}$ . For $P$ to occur in $\widetilde{A}=\widetilde{B}^{k}=v^{pk}$ , $P$ must have period $|v|$ . In this case, since $\widetilde{B}$ occurs in $P$ , there are proper suffix $v^{\prime}$ and prefix $v^{\prime\prime}$ of $v$ such that $P=v^{\prime}v^{l}v^{\prime\prime}$ . If $v^{\prime}=v^{\prime\prime}=\varepsilon$ , $P$ occurs in $\widetilde{A}$ just if $l\leq pk$ . Otherwise, $\widetilde{A}$ must have one more block of $v$ to cover each of $v^{\prime}$ and/or $v^{\prime\prime}$ . $\hfill\blacktriangleleft$

Our proposed algorithm for RLSLP-compressed pattern matching precomputes the answers to mrqs and prqs in an enhanced suffix tree. The next section describes the details of the data structure and its construction. Throughout this paper, we fix a pattern $P$ of length $m\geq 2$ and a balanced RLSLP $\mathcal{G}$ of size $g$ that represents a text $T$ of length $n$ .

4 Repetition-informed Suffix Tree

The longest common prefix $\mathit{lcp}(x,y)$ of strings $x, y$ is $x[1..k]$ for $k=\max\{\,k\mid x[1..k]=y[1..k]\,\}$ . The suffix tree of $P$ is defined as follows.

Definition 12 ([31]).

Let $S=\mathit{Substr}(P\$)$ where $\$$ is a special symbol that does not occur in $P$ . The suffix tree $ST(P)$ of $P$ consists of explicit nodes $V$ and edges $E$ where

$\blacksquare$

$V=\{\,\mathit{lcp}(x,y)\mid x,y\in\mathit{Suff}(P\$)\,\}$ ,
$\blacksquare$

$E=\{\,(x,y,xy)\in V\times S\times V\mid x\cdot y[1..k]\not\in V\text{ for }1% \leq k<|y|\,\}$ .

Note that whereas the mathematical definition above is given with strings, node names and edge labels are represented as occurrence positions of those in $P$ . This paper often uses strings for readability where the actual computation is done on positions, unless confusion arises. We assume one can access in constant time the node $P[i..]\$$ for any $i\leq m$ and the parent of an arbitrary node. The suffix tree can be constructed in $O(m)$ time [30]. Every substring of $P$ that is not an explicit node of the suffix tree is called an implicit node. An implicit node $u\in S\setminus V$ is a conceptual node that exists between two explicit nodes and specified as $(uv,|v|)\in V\times\mathbb{N}$ where $u v$ is the shortest extension of $u$ in $V$ . A node is either an explicit node or an implicit node.

We extend suffix trees by adding more explicit nodes.

Definition 13.

Let $S=\mathit{Substr}(P\$)$ and $V_{0}$ be the set of all explicit nodes in $ST(P)$ . The extended suffix tree $\mathit{EST}(P,U)=(V,E)$ of $P$ with a substring set $U\subseteq S$ is defined by

$\blacksquare$

$V=V_{0}\cup U$ ,
$\blacksquare$

$E=\{\,(x,y,xy)\in V\times S\times V\mid x\cdot y[1..k]\notin V\text{ for }1% \leq k<|y|\,\}$ .

We use an extended suffix tree to store the answers to the mrqs and prqs on respective nodes. We promote implicit nodes $v$ to explicit nodes only if the answers to the queries on $v$ are nontrivial, in the sense that either the primitive root of $v$ is not $v$ , or the maximum repetition of $v$ is not $v$ . In the trivial case, the answers to mrqs and prqs can be the queried positions themselves, with the maximum repetition count $1$ . Let $\Gamma_{*}\subseteq\mathit{Substr}(P)$ be the set of nonempty strings for which the mrq or the prq is nontrivial:

\Gamma_{*}=\{\,v\in\mathit{Substr}(P)\setminus\{\varepsilon\}\mid v^{2}\in% \mathit{Substr}(P)\text{, or }v=u^{k}\text{ for some }u\in\Sigma^{*}\text{ and% }k\geq 2\,\}\,.

The goal of this section is to construct $\mathit{REST}(P,\Gamma_{*})$ , defined as follows.

Definition 14.

An extended suffix-tree $\mathit{EST}(P,U)$ is said to be repetition-informed and denoted by $\mathit{REST}(P,U)$ , if each node $x$ in $U$ retains $\mathrm{MaxRep}(x)$ , $\mathrm{RepCount}(x)$ , and $\mathrm{PrimRoot}(x)$ for $x\in U$ , where

$\blacksquare$

$\mathrm{MaxRep}(x)$ : one of the positions of the maximum repetition of $x$ in $P$ ,
$\blacksquare$

$\mathrm{RepCount}(x)$ : the integer $k$ such that $x^{k}$ is the maximum repetition of $x$ in $P$ ,
$\blacksquare$

$\mathrm{PrimRoot}(x)$ : one of the positions of the primitive root of $x$ in $P$ .

That is, the answer to the mrq on $x$ is $\mathrm{MaxRep}(x)$ and $\mathrm{RepCount}(x)$ and the one to the prq on $x$ is $\mathrm{PrimRoot}(x)$ .

The suffix tree $ST(P)$ and the repetition-informed suffix tree $\mathit{REST}(P,\Gamma_{*})$ extended with $\Gamma_{*}$ for $P=aabcbcbcbc$ are shown in Figure 1. Although values $\mathrm{MaxRep}(x)$ and $\mathrm{PrimRoot}(x)$ are positions on $P$ , for intuitive understandability, Figure 1 presents them as links from $x$ to the nodes corresponding to those positions. We remark that $\mathrm{RepCount}(x)$ can be computed from $\mathrm{MaxRep}(x)$ and $|x|$ . However, under the assumptions of the computer model, integer division cannot be done in constant time. So, we will compute the values in the preprocessing step and let $\mathit{REST}(P,\Gamma_{*})$ remember them.

(a) The suffix tree.

(b) The repetition-informed suffix tree.

Figure 1: The suffix tree and the repetition-informed suffix tree

\mathit{REST}(P,\Gamma_{*})

of

P=aabcbcbcbc

. The gray-highlighted nodes are in

\Gamma_{*}

. Each node

x\in\Gamma_{*}

has

\mathrm{MaxRep}(x)

,

\mathrm{PrimRoot}(x)

(positions of the substring corresponding to the node reached by following the red dashed and blue dotted arrows from

x

, respectively), and

\mathrm{RepCount}(x)

(shown in respective nodes).

4.1 Node identification queries

Recall that instances of the mrqs and prqs are positions on $P$ , while the answers are stored in the nodes of the repetition-informed suffix tree. This subsection shows that we can efficiently convert positions on $P$ to the corresponding nodes of an extended suffix tree (Lemma 17).

Node identification query: (niq): Given a substring of $P$ represented by a position, return the corresponding node on the extended suffix tree of $P$ .

We answer niqs using waqs below. In the weighted ancestor problem, we are given a node-weighted tree where the weight of any node is a nonnegative integer and greater than that of its parent.

Weighted ancestor query: (waq): Given a node $u$ and a nonnegative integer $k$ , return the furthest ancestor of $u$ with weight at least $k$ .

Kociumaka et al. [15] showed that a batch of $q$ waqs on a tree of size $s$ can be answered in $O(s+q)$ time, provided that the node weights are polynomially bounded in $s$ . Their algorithm first sorts the queries by weight and then processes them in the non-increasing order. This necessity of sorting is the reason why the weights must be polynomially bounded and the queries must be handled in a batch. In other words, as long as the queries are given in non-increasing order of weights, multiple waqs can be answered in an online manner with the same time complexity. It is more convenient for our discussion to present their result in this online computation version.

Lemma 15.

We can answer $q$ waqs on a node-weighted tree of size $s$ in $O(q+s)$ total time in an online manner, provided that the queries are ordered by non-increasing weights.

Ganardi and Gawrychowski [6] improved upon the result by Kociumaka et al.

Lemma 16.

We can preprocess a node-weighted tree of size $s$ in $O(s)$ time so that $q$ waqs can be answered in $O(q+s/\mathsf{w})$ total time.

Lemma 16 encompasses Lemma 15, but since the algorithm in [15] is simpler, Lemma 15 is used when it is sufficient.

A result similar to the following lemma is used in the algorithm presented in [6].

Lemma 17.

We can preprocess the extended suffix tree of size $s$ in $O(s)$ time so that we can answer $q$ niqs in $O(q+s/\mathsf{w})$ time.

Proof.

Define the weight of each explicit node to be the length of its corresponding string. Preprocessing the extended suffix tree for waqs takes $O(s)$ time by Lemma 16. Given substring $P[i..j]$ as input for an niq, we use a waq with the leaf node $P[i..]\$$ and the length $|P[i..j]|=j-i+1$ to identify an explicit node $x$ . If $|x|=|P[i..j]|$ holds, $x$ is the node $P[i..j]$ . Otherwise, $P[i..j]$ is an implicit node between $x$ and its parent $y$ since $|y|<|P[i..j]|<|x|$ . The implicit node $P[i..j]$ is represented by $(x,|x|-|P[i..j]|)$ . It takes $O(q+s/\mathsf{w})$ time to process $q$ waqs. $\hfill\blacktriangleleft$

Lemma 17 is based on Lemma 16 and used in the matching algorithm in Section 5. A simplified version of Lemma 17 is obtained based on Lemma 15, which is sufficient for discussions in the rest of this section.

Lemma 18.

We can answer $q$ niqs on the extended suffix tree of size $s$ in $O(q+s)$ time in an online manner, provided that the queries are ordered by non-increasing length.

4.2 Constructing repetition-informed suffix trees

Li et al. [20] showed that the number of distinct substrings of the form $u^{k}$ for some $k\geq 2$ in any string of length $m$ is less than $m$ . We immediately obtain the following lemma.

Lemma 19.

$|\Gamma_{*}|<2m$ holds.

This ensures that the size of $\mathit{REST}(P,\Gamma_{*})$ is $O(m)$ and $q$ niqs on $\mathit{REST}(P,\Gamma_{*})$ can be answered in $O(q+m/\mathsf{w})$ time.

Define

	$\displaystyle\Gamma_{0}$	$\displaystyle=\{\,v\in\Gamma_{*}\mid v\text{ is primitive and $P$ has a run }v% ^{k}v^{\prime}\text{ for some $k\geq 2$ and $v^{\prime}\in\mathit{Pref}(v)$}\,% \}\,,$
	$\displaystyle\Gamma_{1}$	$\displaystyle=\{\,v\in\Gamma_{*}\mid v\text{ is primitive}\,\}\,.$

Obviously, $\Gamma_{0}\subseteq\Gamma_{1}\subseteq\Gamma_{*}$ holds. Our construction consists of the following steps:¹¹1We construct the intermediate structure $\mathit{REST}(P,\Gamma_{1})$ for clear illustration of the preprocessing algorithm. However, in practice, it suffices to identify the nodes of $\Gamma_{1}$ and calculate the answers to repetition queries in Step 2. One may embed those nodes into $\mathit{EST}(P,\Gamma_{0})$ together with the other nodes of $\Gamma_{*}$ at once in Step 3.

1.

constructing $\mathit{EST}(P,\Gamma_{0})$ ;
2.

constructing $\mathit{REST}(P,\Gamma_{1})$ ;
3.

constructing $\mathit{REST}(P,\Gamma_{*})$ .

The following subsections explain the respective steps.

4.2.1 Construction of $\mathit{EST}(P,\Gamma_{0})$

We first enumerate elements of $\Gamma_{0}$ as positions by Kolpakov and Kucherov’s algorithm [18] that enumerates all the runs of a string in linear time, as their positions and shortest periods. Then, we identify the corresponding nodes in $ST(P)$ by niqs and make them explicit. Conversion of implicit nodes into explicit nodes may appear very simple. When an implicit node $v$ to convert is represented by $(x,l)$ , we should introduce a new explicit node between the explicit node $x$ and the parent $y$ of $x$ . We replace the edge between $y$ and $x$ by two edges between $y$ and $v$ and between $v$ and $x$ . This can be done in constant time. One has to notice that, however, in our approach, the implicit nodes to be converted are all given simultaneously by niqs for efficiency. Converting one implicit node alters the tree structure, so the representations of some of the other implicit nodes may be affected. Suppose two implicit nodes $v_{1}$ and $v_{2}$ are represented by $(x,l_{1})$ and $(x,l_{2})$ , respectively, with $|v_{1}|<|v_{2}|$ , i.e., $l_{1}>l_{2}$ . If $v_{2}$ is converted first, then the representation $(x,l_{1})$ of $v_{1}$ becomes invalid, since the explicit node immediately below $v_{1}$ is now the new explicit node $v_{2}$ rather than $x$ . One should not embed a new explicit node for $v_{1}$ as the parent of $x$ . This potential disturbance can be avoided by converting $v_{1}$ before $v_{2}$ .

Lemma 20.

Given a set $U$ of $q$ substrings of $P$ represented by their positions, we can construct $\mathit{EST}(P,U)$ in $O(q+m)$ time.

Proof.

We first construct $ST(P)$ and sort the input substrings by their length. Since the lengths of the input substrings are at most $m$ , one can sort them in $O(q+m)$ time by bucket sort. We then identify the corresponding nodes using niqs in non-increasing order. The conversion of the identified implicit nodes is performed in the reverse order, which ensures that the representation of those nodes remain valid when they are converted. By Lemma 18, one can identify and convert the nodes in $O(q+m)$ time. $\hfill\blacktriangleleft$

Corollary 21.

One can construct $\mathit{EST}(P,\Gamma_{0})$ in $O(m)$ time.

Proof.

One can enumerate positions of $\Gamma_{0}$ in $P$ in linear time by Kolpakov and Kucherov’s algorithm [18]. Then, applying Lemma 20 to those positions, we obtain $\mathit{EST}(P,\Gamma_{0})$ . $\hfill\blacktriangleleft$ We note that the same substring $x\in\Gamma_{0}$ may occur in different runs $x^{i}x^{\prime}$ and $x^{j}x^{\prime\prime}$ in $P$ , in which case we identify the same node twice or more. This is not a problem at all when constructing $\mathit{EST}(P,\Gamma_{0})$ . For the succeeding procedure, among those runs, we pick a longest run, denoted as $\mathrm{LongestRun}(x)$ , and retain an occurrence position of $\mathrm{LongestRun}(x)$ in the identified node.

4.2.2 Construction of $\mathit{REST}(P,\Gamma_{1})$

The goal of this subsection is to identify the nodes for all elements $v$ of $\Gamma_{1}$ on $\mathit{EST}(P,\Gamma_{0})$ and compute $\mathrm{MaxRep}(v)$ , $\mathrm{RepCount}(v)$ , and $\mathrm{PrimRoot}(v)$ . We first focus on the node identification.

Identifying the nodes of $\Gamma_{1}$

Let $\mathrm{LongestRun}(x)=x^{k}x^{\prime}$ , where $x^{\prime}$ is a proper prefix of $x$ . If $k\geq 3$ , this run contains the squares of all the rotations of $x$ , i.e., $\mathit{rot}(x,i)\in\Gamma_{1}$ for $0\leq i\leq|x|-1$ . If $k=2$ , the run has squares of the form $\mathit{rot}(x,i)^{2}$ , i.e., $\mathit{rot}(x,i)\in\Gamma_{1}$ , for $0\leq i\leq|x^{\prime}|$ . Therefore, for $\ell_{x}=\min\{|x|-1,\,|\mathrm{LongestRun}(x)|-2|x|\}$ ,

\Gamma_{1}=\{\,\mathit{rot}(x,i)\mid x\in\Gamma_{0}\text{ and }0\leq i\leq\ell% _{x}\,\}

holds. One can identify those nodes $\mathit{rot}(x,i)$ by niqs on positions shifting $x$ in $\mathrm{LongestRun}(x)$ by $i$ ; that is, if $\mathrm{LongestRun}(x)=P[s_{x}:t_{x}]$ , we pose niqs on $P[s_{x}+i:s_{x}+i+|x|-1]$ for all $0\leq i\leq\ell_{x}$ . However, a naive implementation of this idea may be redundant and inefficient: it is possible that $v=\mathit{rot}(x,i)=\mathit{rot}(y,j)\in\Gamma_{1}$ for different $x,y\in\Gamma_{0}$ and $i,j\geq 0$ , so the same node may be identified several times. In order to bound the number of niqs by $|\Gamma_{1}|$ , we maintain found nodes in rotation lists consisting of rotation links. A rotation link is a link from $\mathit{rot}(x,i)$ to $\mathit{rot}(x,i+1)$ for $x\in\Gamma_{0}$ and $0\leq i<\ell_{x}$ . Therefore, identifying nodes of $\Gamma_{1}$ and creating rotation lists are equivalent tasks. Examples of rotation lists are shown in Figure 2.

(a)

P=abcabccabcabc

.

(b)

P=abacabacababb

.

Figure 2: The rotation lists on

\mathit{EST}(P,\Gamma_{0})

, represented by red nodes and solid arrows.

Suppose $v=\mathit{rot}(x,i)=\mathit{rot}(y,j)\in\Gamma_{1}$ for different $x,y\in\Gamma_{0}$ , $i\leq\ell_{x}$ , $j\leq\ell_{y}$ , and $i<j$ . If we process $y$ after $x$ , on the way constructing the rotation list starting at $y$ , we reach $x=\mathit{rot}(y,j-i)$ . Then, without tracing rotation links $\mathit{rot}(y,j-i+1),\mathit{rot}(y,j-i+2),\dots$ , we can jump to the end $\mathit{rot}(x,\ell_{x})=\mathit{rot}(y,j-i+\ell_{x})$ of the rotation list of $x$ and continue growing the rotation list if $\ell_{y}>\ell_{x}+j-i$ . If $\ell_{y}\leq\ell_{x}+j-i$ , this finishes the rotation list starting at $y$ . This is the basic idea to use the rotation lists for efficiently finding the nodes of $\Gamma_{1}$ . See Figure 3.

Figure 3: Suppose

P

contains runs

r_{0}=v_{0}^{2}a_{0}a_{1}a_{2}

,

r_{2}=v_{2}^{2}a_{2}a_{3}a_{4}a_{5}a_{6}a_{7}a_{8}a_{9}

, and

r_{5}=v_{5}^{2}a_{5}a_{6}a_{7}

, where

v_{0}=a_{0}a_{1}\dots a_{9}

and

v_{i}=\mathit{rot}(v_{0},i)

. The explicit nodes in

\mathit{EST}(P,\Gamma_{0})

are drawn by solid circles, among which

v_{0},v_{2},v_{5}

are in

\Gamma_{0}

, while the implicit nodes are shown by dotted circles. The runs

r_{0}

,

r_{2}

, and

r_{5}

should induce rotation links shown by red, blue, and green arrows, respectively, but duplicated links are traversed only once, by skipping the computation of the dashed links. These links form a circular rotation list. If

r_{0}

is missing in

P

, the rotation list becomes non-circular. If

r_{2}

is missing in

P

, we have two disconnected rotation lists.

Algorithm 1 Creating rotation lists.

Algorithm 1 (CreateAllRotLists) computes all the rotation lists from the nodes of $\Gamma_{0}$ in non-increasing order by the lengths of their corresponding strings. The function $\textsc{CreateRotList}(x)$ creates the list consisting of nodes $\mathit{rot}(x,0),\dots,\mathit{rot}(x,\ell_{x})$ in this order. Those nodes are created but not embedded into $\mathit{EST}(P,\Gamma_{0})$ (if they are not explicit) until we find all the nodes of $\Gamma_{1}$ . The subroutine $\textsc{GetRotLink}(v)$ poses an niq on the position shifted by one from $v$ to obtain the node $\mathit{rot}(v,1)$ . The rotation link is remembered in $v$ as $\mathit{RotLink}(v)$ , which will be used later. The value $\mathit{Size}(x)$ maintains the number of nodes in the rotation list counting from $x\in\Gamma_{0}$ up to $\mathit{TargetSize}(x)=\ell_{x}+1$ . $\mathit{Size}(x)$ is initialized to be zero to denote that the node $x$ has not been visited. We remember the end node of a rotation list starting at $x$ as $\mathit{Tail}(x)$ if it is non-circular.

When extending the list from $x\in\Gamma_{0}$ , as long as we do not encounter another node in $\Gamma_{0}$ , we repeatedly call GetRotLink and get rotation links using niqs to extend the rotation list. At some point, we may reach another node $y\in\Gamma_{0}$ . In this case, the list from $x$ shall be extended to the end of the list from $y$ . If $y$ has been processed earlier, i.e., $\mathit{Size}(y)\neq 0$ , we jump to $\mathit{Tail}(y)$ . If not, we recursively call $\textsc{CreateRotList}(y)$ to find the end of the list from $y$ . In either case, we will reach $\mathit{Tail}(y)$ . Then, $\mathit{Size}(x)$ may exceed $\mathit{TargetSize}(x)$ . Otherwise, we continue extending the list.

It is possible that the nested recursive calls of CreateRotList circulate. In this case, we visit a node $x\in\Gamma_{0}$ for the second time. This is depicted if $\mathit{Tail}(x)$ has not yet been determined (Line 22). In this case, the list is circular and we discontinue extending the list.

When the algorithm halts, each rotation list has just one explicit node $x$ with $\mathit{Head}(x)=\mathsf{T}$ . For that node $x$ , the following holds:

$\blacksquare$

if $\mathit{Tail}(x)\neq\mathsf{null}$ , $x$ is the head of a non-circular rotation list which ends in $\mathit{Tail}(x)$ ,
$\blacksquare$

if $\mathit{Tail}(x)=\mathsf{null}$ , $x$ is in a circular rotation list.

Those values will be useful when computing $\mathrm{MaxRep}(v)$ and $\mathrm{RepCount}(v)$ for nodes $v$ in each rotation list.

Lemma 22.

Given a position of $\mathrm{LongestRun}(x)$ of each $x\in\Gamma_{0}$ and $\mathit{EST}(P,\Gamma_{0})$ , one can compute all rotation lists in $O(m)$ time.

Proof.

One can easily verify the above invariant properties of the variables used in the algorithm. We discuss the time complexity. For each node $v\in\Gamma_{1}$ , $\textsc{GetRotLink}(v)$ (Line 17) is called at most once, and it is from $\textsc{CreateRotList}(x)$ where $x$ is the node in $\Gamma_{0}$ with $v\in\mathit{rot}(x,i)$ for the smallest $i$ . Thus, the number of calls of $\textsc{GetRotLink}(v)$ is bounded by $O(m)$ by Lemma 19. Each call of GetRotLink involves an niq. Since we process elements of $\Gamma_{0}$ in a non-increasing order of length, those queries are posed in a non-increasing order, thus Lemma 18 applies. Hence, the total time used by the function GetRotLink is bounded by $O(m)$ . Therefore, the algorithm runs in time proportional to $|\Gamma_{1}|\in O(m)$ . $\hfill\blacktriangleleft$

Computing answers to queries on $\Gamma_{1}$

We now compute $\mathrm{MaxRep}(v)$ and $\mathrm{RepCount}(v)$ for all $v\in\Gamma_{1}$ . The value of $\mathrm{PrimRoot}(v)$ coincides with the position of $v\in\Gamma_{1}$ itself, which has already been computed when creating the rotation lists. Concerning $\mathrm{RepCount}(v)$ , we remark that

\sum_{v\in\Gamma_{1}}\mathrm{RepCount}(v)=|\Gamma_{*}|<2m

by Lemma 19.

We compute the values by following each rotation list. During the iteration, we maintain (the length of) the longest pseudo-run with $v$ , denoted by $\mathit{LPR}(v)$ , when visiting $v\in\Gamma_{1}$ . A pseudo-run with $v$ is defined to be a substring of $P$ of the form $u=v^{k}v^{\prime}$ with $k\geq 2$ and $v^{\prime}\in\mathit{Pref}(v)$ . The maximum repetition and the maximum repetition count of $v$ are easily computed from the length of the longest pseudo-run with $v$ . We can compute $\mathit{LPR}(\mathit{rot}(v,1))$ from $\mathit{LPR}(v)$ based on the fact that

\mathit{LPR}(\mathit{rot}(v,1))\text{ is the longer of }\mathit{LPR}(v)[2:]% \text{ and }\mathrm{LongestRun}(\mathit{rot}(v,1))\,,

where $\mathit{LPR}(u)$ and $\mathrm{LongestRun}(u)$ are assumed to be empty if they are not defined: $\mathit{LPR}(u)$ is undefined when $u\notin\Gamma_{1}$ , and $\mathrm{LongestRun}(u)$ is undefined when $u\notin\Gamma_{0}$ .

Suppose a rotation list is not circular and its head node is $v_{0}\in\Gamma_{0}$ . In this case, there is no $v\in\Gamma_{1}\setminus\{v_{0}\}$ such that $v_{0}=\mathit{rot}(v,1)$ . Thus, $\mathit{LPR}(v_{0})=\mathrm{LongestRun}(v_{0})$ holds. The maximum repetition of $v_{0}$ is the prefix $v_{0}^{k}$ of $\mathit{LPR}(v_{0})=v_{0}^{k}v_{0}^{\prime}$ . So, $\mathrm{RepCount}(v_{0})$ and $\mathrm{MaxRep}(v_{0})$ can be computed. We proceed to the next node $v_{1}=\mathit{rot}(v_{0},1)=\mathit{RotLink}(v_{0})$ in the rotation list. If $v_{1}\notin\Gamma_{0}$ , the longest pseudo-run with $v_{1}$ is $\mathit{LPR}(v_{1})=\mathit{LPR}(v_{0})[2:]$ . If $v_{1}\in\Gamma_{0}$ , $\mathit{LPR}(v_{1})$ is the longer of $\mathit{LPR}(v_{0})[2:]$ and $\mathrm{LongestRun}(v_{1})$ . In either case, $\mathrm{RepCount}(v_{1})$ and $\mathrm{MaxRep}(v_{1})$ can be computed from $\mathit{LPR}(v_{1})$ . We repeat this process.

If a rotation list is circular, we begin the process from a node $x\in\Gamma_{0}$ with the largest $\mathrm{LongestRun}(x)$ in the list, which is not necessarily the head. This node is found by traversing the rotation list once. In this case, there exists $v\in\Gamma_{1}$ in the list such that $x=\mathit{rot}(v,1)$ . However, by the choice of $x$ , it is guaranteed that $\mathrm{LongestRun}(x)$ is longer than $\mathit{LPR}(v)[2:]$ . Thus, $\mathit{LPR}(x)=\mathrm{LongestRun}(x)$ holds. We proceed to the next node $\mathit{rot}(x,1)=\mathit{RotLink}(x)$ in the same manner as the non-circular case.

Note that, the maximum repeat count $k$ is obtained from $\mathit{LPR}(v)=v^{k}v^{\prime}$ and $v$ by dividing $|\mathit{LPR}(v)|$ by $|v|$ . We perform division by subtraction; that is, to compute $\lfloor a/b\rfloor$ , we repeatedly subtract $b$ from $a$ until the value becomes negative. This division costs $O(\lfloor a/b\rfloor)$ time. Thus, total division cost is²²2During the traversal of rotation lists, by maintaining the remainder of the division of $|\mathrm{LongestRun}(x)|$ by $|x|$ when computing those values for $x\in\Gamma_{0}$ , one does not have to perform this “division” for $v\in\Gamma_{1}\setminus\Gamma_{0}$ .

\sum_{v\in\Gamma_{1}}O(\lfloor{|\mathit{LPR}(v)|/|v|}\rfloor+1)=\sum_{v\in% \Gamma_{1}}O(\mathrm{RepCount}(v)+1)\subseteq O(|\Gamma_{*}|)\subseteq O(m)\,.

Therefore, we can compute $\mathrm{MaxRep}(v)$ , $\mathrm{RepCount}(v)$ , and $\mathrm{PrimRoot}(v)$ for all $v\in\Gamma_{1}$ in $O(m)$ time.

The nodes in the rotation lists equipped with repetition information are embedded into $\mathit{EST}(P,\Gamma_{0})$ in $O(m)$ time by Lemma 20. Lemma 23 summarizes Section 4.2.2.

Lemma 23.

Given $\Gamma_{0}$ and $\mathit{EST}(P,\Gamma_{0})$ , one can compute $\mathit{REST}(P,\Gamma_{1})$ in $O(m)$ time.

4.2.3 Construction of $\mathit{REST}(P,\Gamma_{*})$

Every element of $\Gamma_{*}$ is of the form $v^{i}$ for some $v\in\Gamma_{1}$ and $i\leq\mathrm{RepCount}(v)$ . Using the values of $\mathrm{PrimRoot}(v)$ , $\mathrm{MaxRep}(v)$ and $\mathrm{RepCount}(v)$ for $v\in\Gamma_{1}$ , we compute a position of $v^{i}$ , $\mathrm{PrimRoot}(v^{i})$ , $\mathrm{RepCount}(v^{i})$ and $\mathrm{MaxRep}(v^{i})$ for each $i\in\{2,\dots,\mathrm{RepCount}(v)\}$ in the ascending order. Let $k_{v}=\mathrm{RepCount}(v)$ and $s$ be the start position of $\mathrm{MaxRep}(v)$ . A position of $v^{i}$ is given by $(s,s+|v|\cdot i-1)$ . The value $\mathrm{PrimRoot}(v^{i})$ is set to be $\mathrm{PrimRoot}(v)$ . To compute $c_{i}=\mathrm{RepCount}(v^{i})=\lfloor k_{v}/i\rfloor$ , we use the fact that $c_{i}\leq c_{i-1}=\mathrm{RepCount}(v^{i-1})$ and $c_{i}$ is the largest satisfying $i\cdot c_{i}\leq k_{v}$ . The value $c_{i}$ is obtained by initializing $c$ to be $c_{i-1}$ and repeatedly decrementing the value one by one and checking the above inequality. Then, $\mathrm{MaxRep}(v^{i})$ is computed as $(s,s+|v|\cdot i\cdot c_{i}-1)$ . Since $c$ is monotonically decremented from $k_{v}$ , the process to compute the concerned values for all repetitions of $v$ in $\Gamma_{*}$ can be performed in $O(k_{v})$ time. Summing the cost for all $v\in\Gamma_{1}$ , we have $\sum_{v\in\Gamma_{1}}O(k_{v})\subseteq O(m)$ .

Lemma 24.

We can construct $\mathit{REST}(P,\Gamma_{*})$ in $O(m)$ time.

Proof.

After computing positions of $v$ , $\mathrm{PrimRoot}(v)$ , $\mathrm{RepCount}(v)$ and $\mathrm{MaxRep}(v)$ for all $v\in\Gamma_{*}\setminus\Gamma_{0}$ , we modify $\mathit{EST}(P,\Gamma_{0})$ into $\mathit{EST}(P,\Gamma_{*})$ and let the explicit nodes $v\in\Gamma_{*}$ retain the values $\mathrm{MaxRep}(v)$ , $\mathrm{RepCount}(v)$ and $\mathrm{PrimRoot}(v)$ . Note that Lemma 20 is valid for updating extended suffix trees. $\hfill\blacktriangleleft$

Lemma 25.

We can preprocess $P$ in $O(m)$ time so that $q$ mrqs and prqs can be answered in $O(q+m/\mathsf{w})$ time.

Proof.

We can construct $\mathit{REST}(P,\Gamma_{*})$ in $O(m)$ time (Lemma 24) and preprocess it for niqs in $O(m)$ time (Lemmas 19 and 17).

Given $q$ (positions of) substrings of $P$ as input, we identify the corresponding nodes $v_{1},\dots,v_{q}$ in $\mathit{REST}(P,\Gamma_{*})$ using niq. For each $v_{i}$ , if $v_{i}$ turns out to be in $\Gamma_{*}$ , return $\mathrm{MaxRep}(v_{i})$ and $\mathrm{RepCount}(v_{i})$ or $\mathrm{PrimRoot}(v_{i})$ . Otherwise, return $v_{i}$ itself because the maximum repetition and the primitive root of $v_{i}$ are $v_{i}$ itself. From Lemma 17, $q$ mrqs or prqs can be answered in $O(q+m/\mathsf{w})$ time. $\hfill\blacktriangleleft$

5 Pattern Matching on RLSLP

Let $M_{c}$ (resp. $L_{c}$ ) be the set of nonterminals $A$ in $\mathcal{G}$ such that its derivation tree has height $c$ and it has a rule of the form $A\to B^{k}$ (resp. $A\to BC$ or $A\to a$ ). PSI-information for nonterminals is computed in the order of $L_{1},M_{2},L_{2},M_{3},\cdots$ . PSI-information for all $A\in L_{c}$ can be computed in $O(|L_{c}|+m/\mathsf{w})$ time by Lemma 8. We can also compute PSI-information for all $A\in M_{c}$ in $O(|M_{c}|+m/\mathsf{w})$ time.

Lemma 26.

We can preprocess $P$ in $O(m)$ time so that given $q$ rules $A_{i}\to B_{i}^{k_{i}}$ $(1\leq i\leq q)$ , where the PSI-information for $B_{i}$ has already been computed, we can compute the PSI-information for $A_{i}$ in $O(q+m/\mathsf{w})$ total time.

Proof.

By Lemmas 10 and 25. $\hfill\blacktriangleleft$

Lemma 27.

We can compute PSI-information for all nonterminals in $\mathcal{G}$ in $O(g+m)$ time.

Proof.

Recall that $\mathcal{G}$ is balanced: the derivation height is bounded by $O(\log n)$ . By Lemmas 8 and 26, the total time complexity is $\sum_{c=1}^{O(\log n)}\big{(}(|L_{c}|+m/\mathsf{w})+(|M_{c}|+m/\mathsf{w})\big% {)}=O(g+m)$ . $\hfill\blacktriangleleft$ For each run-length rule $A\to B^{k}$ , we can determine whether $P$ occurs in $\widetilde{B}^{k}$ in linear time.

Lemma 28.

For $q$ rules $A_{i}\to B_{i}^{k_{i}}$ $(1\leq i\leq q)$ , if the PSI-information for $B_{i}$ has already been computed, we can determine whether $P$ occurs in $\widetilde{B_{i}}^{k_{i}}$ in $O(q+m)$ total time.

Proof.

We precompute all of the periods of $P$ . By Lemmas 11 and 25, the claim holds. $\hfill\blacktriangleleft$

Theorem 1.

Given a pattern $P$ of length $m$ and an RLSLP $\mathcal{G}$ of size $g$ , we can decide whether $P$ occurs in the text described by $\mathcal{G}$ in time $O(g+m)$ .

Proof.

From Lemma 27, we can compute the PSI-information for all nonterminals in $\mathcal{G}$ in $O(g+m)$ time. From Lemmas 7 and 28, using the computed PSI-information, we can decide whether $P$ occurs in $\mathsf{suffix}_{P}(B)\cdot\mathsf{prefix}_{P}(C)$ for all binary rules of the form $A\to BC$ in $O(g+m)$ time, and whether $P$ occurs in $\widetilde{B}^{k}$ for all run-length rules of the form $A\to B^{k}$ in $O(g+m)$ time. Therefore, the total time complexity is $O(g+m)$ . $\hfill\blacktriangleleft$

6 Conclusion

In this paper, we presented a linear-time algorithm for pattern matching on run-length grammar-compressed strings by generalizing Ganardi and Gawrychowski’s algorithm for straight-line programs [6]. We showed that the algorithm can be applied to any RLSLP and the algorithm runs in $O(g+m)$ time for a pattern of length $m$ and an RLSLP of size $g$ .

It remains an open problem if there exists a linear-time algorithm for pattern matching on iterative straight-line programs [25], which are a further extension of RLSLPs.

References

[1] Amihood Amir and Gary Benson. Efficient two-dimensional compressed matching. In Data Compression Conference, 1992., pages 279–288, 1992. doi:10.1109/DCC.1992.227453.
[2] Amihood Amir, Gary Benson, and Martin Farach. Let sleeping files lie: Pattern matching in Z-compressed files. Journal of Computer and System Sciences, 52(2):299–307, 1996. doi:10.1006/jcss.1996.0023.
[3] Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. Random access to grammar-compressed strings and trees. SIAM Journal on Computing, 44(3):513–539, 2015. doi:10.1137/130936889.
[4] Martin Farach and Mikkel Thorup. String matching in Lempel-Ziv compressed strings. Algorithmica, 20(4):388–404, 1998. doi:10.1007/PL00009202.
[5] Simone Faro and Thierry Lecroq. The exact online string matching problem: A review of the most recent results. ACM Comput. Surv., 45(2), March 2013. doi:10.1145/2431211.2431212.
[6] Moses Ganardi and Paweł Gawrychowski. Pattern matching on grammar-compressed strings in linear time. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2833–2846, 2022. doi:10.1137/1.9781611977073.110.
[7] Moses Ganardi, Artur Jeż, and Markus Lohrey. Balancing straight-line programs. J. ACM, 68(4), June 2021. doi:10.1145/3457389.
[8] Paweł Gawrychowski. Pattern matching in Lempel-Ziv compressed strings: Fast, simple, and deterministic. In Algorithms – ESA 2011, pages 421–432, 2011. doi:10.1007/978-3-642-23719-5_36.
[9] Paweł Gawrychowski. Optimal pattern matching in LZW compressed strings. ACM Trans. Algorithms, 9(3), June 2013. doi:10.1145/2483699.2483705.
[10] Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Fast $q$ -gram mining on SLP compressed strings. Journal of Discrete Algorithms, 18:89–99, 2013. doi:10.1016/j.jda.2012.07.006.
[11] Dan Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, 1997. doi:10.1017/CBO9780511574931.
[12] Marek Karpinski, Wojciech Rytter, and Ayumi Shinohara. Pattern-matching for strings with short descriptions. In Proc. The 6th Annual Symposium on Combinatorial Pattern Matching (CPM95), volume 937 of Lecture Notes in Computer Science, pages 205–214. Springer, 1995. doi:10.1007/3-540-60044-2_44.
[13] Dominik Kempa and Tomasz Kociumaka. Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space . In 2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS), pages 1877–1886, November 2023. doi:10.1109/FOCS57990.2023.00114.
[14] Takuya Kida, Tetsuya Matsumoto, Yusuke Shibata, Masayuki Takeda, Ayumi Shinohara, and Setsuo Arikawa. Collage system: a unifying framework for compressed pattern matching. Theoretical Computer Science, 298(1):253–272, 2003. doi:10.1016/S0304-3975(02)00426-7.
[15] Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. A linear-time algorithm for seeds computation. ACM Trans. Algorithms, 16(2), April 2020. doi:10.1145/3386369.
[16] Tomasz Kociumaka, Gonzalo Navarro, and Francisco Olivares. Near-optimal search time in $\delta$ -optimal space. In LATIN 2022: Theoretical Informatics, pages 88–103, 2022. doi:10.1007/978-3-031-20624-5_6.
[17] Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repetitive sequences. IEEE Transactions on Information Theory, 69(4):2074–2092, 2023. doi:10.1109/TIT.2022.3224382.
[18] Roman Kolpakov and Gregory Kucherov. Finding maximal repetitions in a word in linear time. In 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039), pages 596–604, 1999. doi:10.1109/SFFCS.1999.814634.
[19] S. Rao Kosaraju. Pattern matching in compressed texts. In Foundations of Software Technology and Theoretical Computer Science, pages 349–362, 1995. doi:10.1007/3-540-60692-0_60.
[20] Shuo Li, Jakub Pachocki, and Jakub Radoszewski. A note on the maximum number of $k$ -powers in a finite word. The Electronic Journal of Combinatorics, 31(3), 2024. doi:10.37236/11270.
[21] Markus Lohrey. Algorithmics on SLP-compressed strings: A survey. Groups - Complexity - Cryptology, 4(2):241–299, 2012. doi:doi:10.1515/gcc-2012-0016.
[22] Michael G. Main and Richard J. Lorentz. An ${O}(n\log n)$ algorithm for finding all repetitions in a string. J. Algorithms, 5(3):422–432, 1984. doi:10.1016/0196-6774(84)90021-X.
[23] Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001. doi:10.1145/375360.375365.
[24] Gonzalo Navarro, Francisco Olivares, and Cristian Urbina. Balancing run-length straight-line programs. In String Processing and Information Retrieval, pages 117–131, 2022. doi:10.1007/978-3-031-20643-6_9.
[25] Gonzalo Navarro and Cristian Urbina. Iterated straight-line programs. In LATIN 2024: Theoretical Informatics, pages 66–80, 2024. doi:10.1007/978-3-031-55598-5_5.
[26] Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Fully Dynamic Data Structure for LCE Queries in Compressed Space. In 41st International Symposium on Mathematical Foundations of Computer Science (MFCS 2016), volume 58 of Leibniz International Proceedings in Informatics (LIPIcs), pages 72:1–72:14, 2016. doi:10.4230/LIPIcs.MFCS.2016.72.
[27] Jakub Radoszewski. Linear Time Construction of Cover Suffix Tree and Applications. In 31st Annual European Symposium on Algorithms (ESA 2023), volume 274 of Leibniz International Proceedings in Informatics (LIPIcs), pages 89:1–89:17, 2023. doi:10.4230/LIPIcs.ESA.2023.89.
[28] Wojciech Rytter. Grammar compression, LZ-encodings, and string algorithms with implicit input. In Automata, Languages and Programming (ICALP 2004), pages 15–27, 2004. doi:10.1007/978-3-540-27836-8_5.
[29] Masayuki Takeda and Ayumi Shinohara. Pattern Matching on Compressed Text, in Encyclopedia of Algorithms, pages 1538–1542. Springer New York, 2016. doi:10.1007/978-1-4939-2864-4_81.
[30] Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249–260, 1995. doi:10.1007/BF01206331.
[31] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (SWAT 1973), pages 1–11, 1973. doi:10.1109/SWAT.1973.13.
[32] Takanori Yamamoto, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Faster subsequence and don’t-care pattern matching on compressed texts. In Combinatorial Pattern Matching, pages 309–322, 2011. doi:10.1007/978-3-642-21458-5_27.

[bib.bib1] [1] Amihood Amir and Gary Benson. Efficient two-dimensional compressed matching. In Data Compression Conference, 1992., pages 279–288, 1992. doi:10.1109/DCC.1992.227453.

[bib.bib2] [2] Amihood Amir, Gary Benson, and Martin Farach. Let sleeping files lie: Pattern matching in Z-compressed files. Journal of Computer and System Sciences, 52(2):299–307, 1996. doi:10.1006/jcss.1996.0023.

[bib.bib3] [3] Philip Bille, Gad M. Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. Random access to grammar-compressed strings and trees. SIAM Journal on Computing, 44(3):513–539, 2015. doi:10.1137/130936889.

[bib.bib4] [4] Martin Farach and Mikkel Thorup. String matching in Lempel-Ziv compressed strings. Algorithmica, 20(4):388–404, 1998. doi:10.1007/PL00009202.

[bib.bib5] [5] Simone Faro and Thierry Lecroq. The exact online string matching problem: A review of the most recent results. ACM Comput. Surv., 45(2), March 2013. doi:10.1145/2431211.2431212.

[bib.bib6] [6] Moses Ganardi and Paweł Gawrychowski. Pattern matching on grammar-compressed strings in linear time. In Proceedings of the 2022 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2833–2846, 2022. doi:10.1137/1.9781611977073.110.

[bib.bib7] [7] Moses Ganardi, Artur Jeż, and Markus Lohrey. Balancing straight-line programs. J. ACM, 68(4), June 2021. doi:10.1145/3457389.

[bib.bib8] [8] Paweł Gawrychowski. Pattern matching in Lempel-Ziv compressed strings: Fast, simple, and deterministic. In Algorithms – ESA 2011, pages 421–432, 2011. doi:10.1007/978-3-642-23719-5_36.

[bib.bib9] [9] Paweł Gawrychowski. Optimal pattern matching in LZW compressed strings. ACM Trans. Algorithms, 9(3), June 2013. doi:10.1145/2483699.2483705.

[bib.bib10] [10] Keisuke Goto, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Fast $q$ -gram mining on SLP compressed strings. Journal of Discrete Algorithms, 18:89–99, 2013. doi:10.1016/j.jda.2012.07.006.

[bib.bib11] [11] Dan Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, 1997. doi:10.1017/CBO9780511574931.

[bib.bib12] [12] Marek Karpinski, Wojciech Rytter, and Ayumi Shinohara. Pattern-matching for strings with short descriptions. In Proc. The 6th Annual Symposium on Combinatorial Pattern Matching (CPM95), volume 937 of Lecture Notes in Computer Science, pages 205–214. Springer, 1995. doi:10.1007/3-540-60044-2_44.

[bib.bib13] [13] Dominik Kempa and Tomasz Kociumaka. Collapsing the Hierarchy of Compressed Data Structures: Suffix Arrays in Optimal Compressed Space . In 2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS), pages 1877–1886, November 2023. doi:10.1109/FOCS57990.2023.00114.

[bib.bib14] [14] Takuya Kida, Tetsuya Matsumoto, Yusuke Shibata, Masayuki Takeda, Ayumi Shinohara, and Setsuo Arikawa. Collage system: a unifying framework for compressed pattern matching. Theoretical Computer Science, 298(1):253–272, 2003. doi:10.1016/S0304-3975(02)00426-7.

[bib.bib15] [15] Tomasz Kociumaka, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. A linear-time algorithm for seeds computation. ACM Trans. Algorithms, 16(2), April 2020. doi:10.1145/3386369.

[bib.bib16] [16] Tomasz Kociumaka, Gonzalo Navarro, and Francisco Olivares. Near-optimal search time in $\delta$ -optimal space. In LATIN 2022: Theoretical Informatics, pages 88–103, 2022. doi:10.1007/978-3-031-20624-5_6.

[bib.bib17] [17] Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repetitive sequences. IEEE Transactions on Information Theory, 69(4):2074–2092, 2023. doi:10.1109/TIT.2022.3224382.

[bib.bib18] [18] Roman Kolpakov and Gregory Kucherov. Finding maximal repetitions in a word in linear time. In 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039), pages 596–604, 1999. doi:10.1109/SFFCS.1999.814634.

[bib.bib19] [19] S. Rao Kosaraju. Pattern matching in compressed texts. In Foundations of Software Technology and Theoretical Computer Science, pages 349–362, 1995. doi:10.1007/3-540-60692-0_60.

[bib.bib20] [20] Shuo Li, Jakub Pachocki, and Jakub Radoszewski. A note on the maximum number of $k$ -powers in a finite word. The Electronic Journal of Combinatorics, 31(3), 2024. doi:10.37236/11270.

[bib.bib21] [21] Markus Lohrey. Algorithmics on SLP-compressed strings: A survey. Groups - Complexity - Cryptology, 4(2):241–299, 2012. doi:doi:10.1515/gcc-2012-0016.

[bib.bib22] [22] Michael G. Main and Richard J. Lorentz. An ${O}(n\log n)$ algorithm for finding all repetitions in a string. J. Algorithms, 5(3):422–432, 1984. doi:10.1016/0196-6774(84)90021-X.

[bib.bib23] [23] Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31–88, 2001. doi:10.1145/375360.375365.

[bib.bib24] [24] Gonzalo Navarro, Francisco Olivares, and Cristian Urbina. Balancing run-length straight-line programs. In String Processing and Information Retrieval, pages 117–131, 2022. doi:10.1007/978-3-031-20643-6_9.

[bib.bib25] [25] Gonzalo Navarro and Cristian Urbina. Iterated straight-line programs. In LATIN 2024: Theoretical Informatics, pages 66–80, 2024. doi:10.1007/978-3-031-55598-5_5.

[bib.bib26] [26] Takaaki Nishimoto, Tomohiro I, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Fully Dynamic Data Structure for LCE Queries in Compressed Space. In 41st International Symposium on Mathematical Foundations of Computer Science (MFCS 2016), volume 58 of Leibniz International Proceedings in Informatics (LIPIcs), pages 72:1–72:14, 2016. doi:10.4230/LIPIcs.MFCS.2016.72.

[bib.bib27] [27] Jakub Radoszewski. Linear Time Construction of Cover Suffix Tree and Applications. In 31st Annual European Symposium on Algorithms (ESA 2023), volume 274 of Leibniz International Proceedings in Informatics (LIPIcs), pages 89:1–89:17, 2023. doi:10.4230/LIPIcs.ESA.2023.89.

[bib.bib28] [28] Wojciech Rytter. Grammar compression, LZ-encodings, and string algorithms with implicit input. In Automata, Languages and Programming (ICALP 2004), pages 15–27, 2004. doi:10.1007/978-3-540-27836-8_5.

[bib.bib29] [29] Masayuki Takeda and Ayumi Shinohara. Pattern Matching on Compressed Text, in Encyclopedia of Algorithms, pages 1538–1542. Springer New York, 2016. doi:10.1007/978-1-4939-2864-4_81.

[bib.bib30] [30] Esko Ukkonen. On-line construction of suffix trees. Algorithmica, 14:249–260, 1995. doi:10.1007/BF01206331.

[bib.bib31] [31] Peter Weiner. Linear pattern matching algorithms. In 14th Annual Symposium on Switching and Automata Theory (SWAT 1973), pages 1–11, 1973. doi:10.1109/SWAT.1973.13.

[bib.bib32] [32] Takanori Yamamoto, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Faster subsequence and don’t-care pattern matching on compressed texts. In Combinatorial Pattern Matching, pages 309–322, 2011. doi:10.1007/978-3-642-21458-5_27.

Pattern Matching on Run-Length Grammar-Compressed Strings in Linear Time

Abstract

Keywords and phrases:

Funding:

Copyright and License:

2012 ACM Subject Classification:

Acknowledgements:

DOI:

Event:

Editors:

Series and Publisher:

1 Introduction

Theorem 1.

2 Preliminaries

Example 2.

Example 3.

3 Previous Work and Challenges

Example 4.

Definition 5.

Lemma 6 ([6], Theorem 1.3).

Lemma 7 ([6]).

Lemma 8 ([6]).

Example 9.

Lemma 10.

Proof.

Lemma 11.

Proof.

4 Repetition-informed Suffix Tree

Definition 12 ([31]).

Definition 13.

Definition 14.

4.1 Node identification queries

Lemma 15.

Lemma 16.

Lemma 17.

Proof.

Lemma 18.

4.2 Constructing repetition-informed suffix trees

Lemma 19.

4.2.1 Construction of 𝑬𝑺𝑻⁢(𝑷,𝚪𝟎)

Lemma 20.

Proof.

Corollary 21.

Proof.

4.2.2 Construction of 𝑹𝑬𝑺𝑻⁢(𝑷,𝚪𝟏)

Identifying the nodes of 𝚪𝟏

Lemma 22.

Proof.

Computing answers to queries on 𝚪𝟏

Lemma 23.

4.2.3 Construction of 𝑹𝑬𝑺𝑻⁢(𝑷,𝚪∗)

Lemma 24.

Proof.

Lemma 25.

Proof.

5 Pattern Matching on RLSLP

Lemma 26.

Proof.

Lemma 27.

Proof.

Lemma 28.

Proof.

Theorem 1.

Proof.

6 Conclusion

References

4.2.1 Construction of $\mathit{EST}(P,\Gamma_{0})$

4.2.2 Construction of $\mathit{REST}(P,\Gamma_{1})$

Identifying the nodes of $\Gamma_{1}$

Computing answers to queries on $\Gamma_{1}$

4.2.3 Construction of $\mathit{REST}(P,\Gamma_{*})$