Abstract 1 Introduction 2 Upper Bound on the Number of 𝒌-Runs 3 -Suffix Trees and Their Applications 4 Counting p-Squares in 𝓞(𝒏𝝈𝐥𝐨𝐠𝝈) Time 5 Counting Generalised Squares in 𝓞(𝒏𝐥𝐨𝐠𝒏) Time References

Fast Computation of k-Runs, Parameterized Squares, and Other Generalised Squares

Yuto Nakashima ORCID Department of Informatics, Kyushu University, Fukuoka, Japan Jakub Radoszewski ORCID Institute of Informatics, University of Warsaw, Poland Tomasz Waleń ORCID Institute of Informatics, University of Warsaw, Poland
Abstract

A k-mismatch square is a string of the form XY where X and Y are two equal-length strings that have at most k mismatches. Kolpakov and Kucherov [Theor. Comput. Sci., 2003] defined two notions of k-mismatch repeats, called k-repetitions and k-runs, each representing a sequence of consecutive k-mismatch squares of equal length. They proposed algorithms for computing k-repetitions and k-runs working in 𝒪(nklogk+𝗈𝗎𝗍𝗉𝗎𝗍) time for a string of length n over an integer alphabet, where 𝗈𝗎𝗍𝗉𝗎𝗍 is the number of the reported repeats. We show that 𝗈𝗎𝗍𝗉𝗎𝗍=𝒪(nklogk), both in case of k-repetitions and k-runs, which implies that the complexity of their algorithms is actually 𝒪(nklogk). We apply this result to computing parameterized squares.

A parameterized square is a string of the form XY such that X and Y parameterized-match, i.e., there exists a bijection f on the alphabet such that f(X)=Y. Two parameterized squares XY and XY are equivalent if they parameterized match. Recently Hamai et al. [SPIRE 2024] showed that a string of length n over an alphabet of size σ contains less than nσ non-equivalent parameterized squares, improving an earlier bound by Kociumaka et al. [Theor. Comput. Sci., 2016]. We apply our bound for k-mismatch repeats to propose an algorithm that reports all non-equivalent parameterized squares in 𝒪(nσlogσ) time. We also show that the number of non-equivalent parameterized squares can be computed in 𝒪(nlogn) time. This last algorithm applies to squares under any substring compatible equivalence relation and also to counting squares that are distinct as strings. In particular, this improves upon the 𝒪(nσ)-time algorithm of Gawrychowski et al. [CPM 2023] for counting order-preserving squares that are distinct as strings if σ=ω(logn).

Keywords and phrases:
string algorithm, k-mismatch square, parameterized square, order-preserving square, maximum gapped repeat
Funding:
Yuto Nakashima: JSPS KAKENHI Grant Number JP25K00136.
Jakub Radoszewski: Supported by the Polish National Science Center, grant no. 2022/46/E/ST6/00463.
Copyright and License:
[Uncaptioned image] © Yuto Nakashima, Jakub Radoszewski, and Tomasz Waleń; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Pattern matching
Related Version:
Full Version: https://arxiv.org/abs/2509.02179
Editors:
Anne Benoit, Haim Kaplan, Sebastian Wild, and Grzegorz Herman

1 Introduction

A string of the form XX, for any string X, is called a square (or a tandem repeat). Squares are a classic notion in combinatorics on words (see, e.g., the early works of Thue [53] on avoidability of square substrings), text algorithms (starting from an algorithm for computing square substrings by Main and Lorentz [40]), and bioinformatics (see Gusfield’s book [22]). A string of length n contains at most n distinct square substrings [8] and all of them can be computed in 𝒪(n) time [4, 9, 14, 23] or even 𝒪(n/logσn) time [10] assuming that the string is over an integer alphabet of size σ. A maximal sequence of consecutive squares in a string is called a run (or a generalised run; cf. Section 2); see [33]. A string of length n contains at most n runs [3] and they can all be computed in 𝒪(n) time [3, 17]. Our work is devoted to efficient algorithms for computing known generalizations of squares and runs: k-mismatch squares represented as k-runs or k-repetitions, parameterized squares, and generalised squares that include parameterized squares, order-preserving squares, and Cartesian-tree squares.

We assume that positions in a string X are numbered from 1 to |X|, so that X[i] is the ith character of X. For integers i,j such that 1ij|X|, by X[i..j]=X[i..j+1) we denote the substring composed of characters X[i],X[i+1],,X[j]. We use similar notation for integer intervals: [i..j]=[i..j+1)={i,i+1,,j}.

We say that a length-n string is over an integer alphabet if its letters belong to [0..n𝒪(1)]. We use the word-RAM model of computation.

1.1 𝒌-Mismatch Squares and 𝒌-Runs

For two equal-length strings X and Y, their Hamming distance is defined as dH(X,Y)=|{i[1..|X|]:X[i]Y[i]}|. A k-mismatch square (also known under the name of k-mismatch tandem repeat) is a string XY such that |X|=|Y| and dH(X,Y)k. Let T be a string of length n. A k-run of period in T (cf. [38]) is a maximal substring T[a..b) such that T[i..i+2) is a k-mismatch square for every i[a..b2]; see Figure 1. Maximality means that the k-run is extended to the right and left as much as possible provided that the definition is still satisfied.

Figure 1: String T[1..26] contains two 2-runs with period 8, T[1..18] and T[5..24]. The two 2-runs represent three 2-mismatch squares (below) and five 2-mismatch squares (above), respectively; mismatches are shown in red. Strings T[x..x+16) for x{4,10,11} are not 2-mismatch squares.

Landau, Schmidt and Sokol [39] showed an algorithm for computing k-runs in a string of length n that works in 𝒪(nklog(n/k)) time; hence, the total number of k-runs reported is 𝒪(nklog(n/k)). Kolpakov and Kucherov [34, 35] showed that all k-runs (called there runs of k-mismatch tandem repeats) in a string of length n can be computed in 𝒪(nklogk+𝗈𝗎𝗍𝗉𝗎𝗍) time, where 𝗈𝗎𝗍𝗉𝗎𝗍 is the number of k-runs. Many other algorithms for computing approximate tandem repeats under various metrics, in the context of computational biology and with the aid of statistical methods and heuristics, were proposed [6, 7, 15, 16, 28, 31, 36, 43, 45, 46, 47, 48, 49, 50, 51, 52, 55, 56]. In Section 2 we show the following theorem.

Theorem 1.

A string of length n contains 𝒪(nklogk) k-runs.

Theorem 1 provides the first 𝒪(n) upper bound on the number of k-runs for k=𝒪(1) and implies that the algorithm of Kolpakov and Kucherov computing k-runs actually works in 𝒪(nklogk) time. Actually, we show a stronger condition that a string T of length n contains 𝒪(nklogk) uniform k-runs. Intuitively, a uniform k-run is a maximal sequence of consecutive k-mismatch squares of the same length in which the mismatches of all squares are at the same positions. (See Section 2 for a formal definition.) Our bound on the number of uniform k-runs implies the bound for k-runs as well as an 𝒪(nklogk) upper bound on the number of k-repetitions as defined in [34, 35] (called there k-mismatch globally defined repetitions).

To prove Theorem 1, we explore a combinatorial relation between k-runs and maximum gapped repeats [32] and apply the optimal 𝒪(nα) bound on the number of maximal α-gapped repeats in a length-n string [20, 27].

1.2 Parameterized Squares

For a string X, by Alph(X) we denote the set of characters of X. Two strings X, Y parameterized match if |X|=|Y| and there is a bijection f:Alph(X)Alph(Y) such that f(X)=Y (i.e., |X|=|Y| and f(X[i])=Y[i] for all i[1..|X|]). A parameterized square (p-square, in short) is a string XY such that X parameterized matches Y (see Figure 2). Two p-squares XY and XY are called equivalent if they parameterized match.

Parameterized matching was introduced by Baker [2] motivated by applications in code refactoring and plagiarism detection. The notion of p-squares was introduced by Kociumaka, Radoszewski, Rytter, and Waleń [30] who showed that a string of length n over alphabet of size σ contains at most 2σ!n non-equivalent p-squares. They also considered avoidability of parameterized cubes. The bounds from [30] were recently improved by Hamai, Taketsugu, Nakashima, Inenaga, and Bannai [24]; they showed that a length-n string over alphabet of size σ contains less than σn non-equivalent p-squares. This bound automatically implies that such a string contains at most σ!σn p-squares that are distinct as strings (improving the 2(σ!)2n upper bound from [30]). We show that all non-equivalent p-squares can be computed efficiently; our approach also extends to p-squares that are distinct as strings. See Section 4.

Theorem 2.

All non-equivalent p-square substrings in a string of length n over alphabet of size σ can be computed in 𝒪(nσlogσ) time.

All p-square substrings that are distinct as strings can be computed in 𝒪(nσlogσ+𝗈𝗎𝗍𝗉𝗎𝗍) time, where 𝗈𝗎𝗍𝗉𝗎𝗍 is the number of p-squares reported.

By the above discussion, all p-square substrings that are distinct as strings can be computed in 𝒪(n(σ+1)!) time. The key ingredient in the algorithm behind Theorem 2 is a relation between p-squares and σ-mismatch squares (and uniform σ-runs).

1.3 Generalised Squares under Substring Consistent Equivalence Relations

We then show an alternative algorithm for counting p-squares whose complexity does not depend on the alphabet size σ. The algorithm is stated for squares under any substring consistent equivalence relation (SCER in short). An SCER is a relation on strings such that XY implies that (1) |X|=|Y| and (2) X[i..j]Y[i..j] for all 1ij|X|; see [42]. Known examples of SCERs include parameterized matching, order-preserving matching [29, 37], Cartesian tree matching [44], and palindrome pattern matching [26]. Two strings X and Y order-preserving match if |X|=|Y|, sets Alph(X) and Alph(Y) are totally ordered, and there is a bijection f:Alph(X)Alph(Y) that is increasing (i.e., if x<y then f(x)<f(y)) such that f(X)=Y. Strings X, Y Cartesian-tree match if they have the same shape of a Cartesian tree (cf. [54]). Finally, strings X and Y palindrome match if for all 1ij|X|, X[i..j] is a palindrome if and only if Y[i..j] is a palindrome.

An -square is a string XY such that XY under SCER . Thus p-squares and order-preserving squares [30] (op-squares, in short) as well as Cartesian-tree squares [44] (CT-squares) and squares in the sense of palindrome matching (palindrome-squares) are -squares for the respective SCERs . See Figure 2 for an example.

Figure 2: For string S=1322434412323, X=132 243 is an op-square (hence, automatically, a p-square and a CT-square), Y=2243 4412 is a p-square, and Z=412 323 is a CT-square. Among the three substrings, only Z is not a palindrome-square, as its second half is a palindrome whereas its first half is not.

In a prefix consistent equivalence relation (PCER), condition (2) of an SCER only needs to hold for i=1. An SCER is a PCER. We say that :Σ is an encoding function if

XY(|X|=|Y|i[1..|X|](X[1..i])=(Y[1..i])). (1)
Observation 3.

For every PCER on strings over integer alphabet there exists an encoding function.

Proof.

Let (X) be defined as the number of the equivalence class under of X among all strings of length |X|. Let us verify that satisfies equivalence (1). () If XY, then, by definition, |X|=|Y| and X[1..i]Y[1..i] for all i[1..|X|]. Then indeed (X[1..i])=(Y[1..i]). () Taking i=|X|=|Y|, we obtain (X)=(Y). As |X|=|Y|, we have XY.

The encoding function defined in the proof of Observation 3 could be hard to compute efficiently for a particular PCER . It is also stronger, as it satisfies XY(|X|=|Y|(X)=(Y)). Luckily, for the aforementioned known SCERs efficient encoding functions are known, as shown in the following Example 4.

For an encoding function , let π(n), ρ(n) be integer sequences such that for a string T of length n over integer alphabet, after π(n) preprocessing time, (T[i..j]) for any i,j can be computed in ρ(n) time.

Example 4.

For parameterized matching, there exists a classic 𝑝𝑟𝑒𝑣-encoding, see [2], that is an encoding function:

𝑝𝑟𝑒𝑣(X)={|X|iX[i]=X[|X|] and X[j]X[|X|] for all j[i+1..|X|)0otherwise

The 𝑝𝑟𝑒𝑣-encodings of all prefixes of a string T can be computed by sorting all pairs (T[i],i). If T is over an integer alphabet, this is performed in π𝑝𝑟𝑒𝑣(n)=𝒪(n) time. Then 𝑝𝑟𝑒𝑣(T[i..j]) can be computed from 𝑝𝑟𝑒𝑣(T[1..j]) in ρ𝑝𝑟𝑒𝑣(n)=𝒪(1) time.

For order-preserving matching, one can use an encoding as pairs (α(X),β(X)). Here

α(X) is the largest j<|X| such that X[j]=max{X[k]:k[1..|X|),X[k]X[|X|]},

and if there is no such j, then α(X)=0. Similarly,

β(X) is the largest j<|X| such that X[j]=min{X[k]:k[1..|X|),X[k]X[|X|]},

and β(X)=0 if no such j exists; see [13, 29, 37]. As we require the encoding function to return integers, we can set (X)=α(X)|X|+β(X), since α(X),β(X)[0..|X|). Then [13, Lemma 4] implies that is an encoding function for order-preserving matching and [13, Lemma 24] gives π(n)=𝒪(nlogn), ρ(n)=𝒪(logn/loglogn).

For Cartesian-tree matching, [44, Theorem 1] shows that the following parent-distance representation:

𝑃𝐷(X)={|X|max1j<|X|{j:X[j]X[|X|]}if such j exists0otherwise

is an encoding function. Encodings of all prefixes of a string T can be computed in π𝑃𝐷(n)=𝒪(n) time using a folklore nearest smaller value algorithm. In [44, Section 5.2] it is noted that 𝑃𝐷(T[i..j]) can be computed from 𝑃𝐷(T[1..j]) in ρ𝑃𝐷(n)=𝒪(1) time.

By XR we denote the reverse of string X. In [26, Lemma 2] it is shown that the length of the longest suffix palindrome of X, formally:

𝐿𝑝𝑎𝑙(X)=max{|X|k+1:X[k..|X|]=X[k..|X|]R},

is an encoding function for palindrome matching.

Computing encodings of substrings 𝐿𝑝𝑎𝑙(T[i..j]) can be reduced in linear time to 2D range successor queries as follows.

In the 2D range successor problem we are given n points in an n×n grid and we are to answer queries that, given a rectangle in the grid, report a point in the rectangle with the smallest first coordinate, if any. Such queries can be answered in 𝒪(loglogn) time after 𝒪(nlogn) preprocessing [18].

In the reduction, we compute all maximal palindromes in T in 𝒪(n) time using Manacher’s algorithm [41]. Let us show how we deal with odd-length palindromes; the approach for even-length palindromes is analogous. For a maximal palindrome T[cr..c+r] with c[1..n] and r0, we create a point (c,c+r). To compute 𝐿𝑝𝑎𝑙(T[i..j]), it suffices to find the point in the rectangle [(i+j)/2..j]×[j..) with the smallest first coordinate. If x is the sought coordinate, the longest odd-length suffix palindrome of T[i..j] has length 2(jx)+1.

By [18], we get π𝐿𝑝𝑎𝑙(n)=𝒪(nlogn) and ρ𝐿𝑝𝑎𝑙(n)=𝒪(loglogn).

In Section 5 we show the next theorem on efficient counting of -squares.

Theorem 5.

Let be an SCER and be its encoding function. The number of non-equivalent -square substrings in a length-n string can be computed in 𝒪(π(n)+nρ(n)+nlogn) time.

The number of -square substrings in a length-n string that are distinct as strings can be computed in the same time complexity.

Consequently, in a length-n string, the numbers of non-equivalent p-squares, op-squares, CT-squares, and palindrome-squares, as well as the numbers of the respective squares that are distinct as strings can be computed in 𝒪(nlogn) time, as the respective SCERs enjoy encoding functions with π(n)=𝒪(nlogn) and ρ(n)=𝒪(logn) as shown in Example 4. In particular, the algorithm of Theorem 5 in the case that σ=ω(logn) improves upon the 𝒪(nσ)-time algorithm of Gawrychowski, Ghazawi, and Landau [19] for reporting (thus, counting) op-squares that are distinct as strings. Moreover, the obtained complexity for counting non-equivalent p-squares (p-squares that are distinct as strings, respectively) is better than the one in Theorem 2 if σ=ω(logn/loglogn) (σ=Ω(loglogn), respectively).

Theorem 5 involves constructing a -suffix tree data structure (see Section 3 for a definition).

2 Upper Bound on the Number of 𝒌-Runs

This section introduces an upper bound on the number of k-runs (proof of Theorem 1) which implies their efficient computation by the algorithm of Kolpakov and Kucherov [34, 35]. A position i in string T is called an -mismatching position if i[1..n] and T[i]T[i+]. In particular, T[i..i+2) is a k-mismatch square if and only if T[i..i+) contains at most k -mismatching positions. Let us formally define a uniform k-run.

Definition 6.

A substring T[a..b) is called a uniform k-run of period if ba2 and the sets {j[i..i+):T[j]T[j+]} of -mismatching positions for all i[a..b2] have cardinality at most k and are all the same. We are interested in maximal uniform k-runs.

A gapped repeat in a string T is a fragment of the form UVU, for |U|>0. Fragments denoted by U are called arms of the gapped repeat (left and right arm). The period of the gapped repeat is defined as |UV|. An α-gapped repeat (for α1) in T is a gapped repeat UVU such that |UV|α|U|. An (α-)gapped repeat is called maximal if its arms cannot be extended simultaneously with the same character to the right or to the left. A maximal (α-)gapped repeat will be called an (α-)MGR, for short. Gawrychowski, I, Inenaga, Köppl, and Manea [20] showed that, for a real α1, a string of length n contains 𝒪(nα) α-MGRs.

A generalised run is a pair (T[x..y),p) such that p satisfies 2pyx and (1) x=1 or T[x1..y) does not have period p, and (2) y=|T|+1 or T[x..y] does not have period p. That is, a generalised run is a 0-run. A run is a generalised run in which p is the smallest period of T[x..y). A string of length n contains less than n runs and 𝒪(n) generalised runs [3].

Definition 7.

We say that an MGR UVU=T[x..y) with period =|UV| in T induces a uniform k-run T[a..b) of period if [x..y)[a..b).

We further say that a generalised run (T[x..y),) induces a uniform k-run T[a..b) of period if [x..y)[a..b).

Let us emphasize that the uniform k-run does not need to be a substring of the MGR or generalised run. Intuitively, an MGR induces a uniform k-run if the left half of at least one k-mismatch square of the k-run contains a position of the left arm of the MGR; see Figure 3. If an MGR or a generalised run induces a uniform k-run, we say that the k-run is induced by the MGR or run, respectively. We also say that the MGR or generalised run induces all k-mismatch squares in the uniform k-run that it induces.

Figure 3: String T[1..26] from Figure 1 with 8-mismatching positions shown in red. The string contains three uniform 2-runs with period 8, T[1..18], T[5..22], and T[8..24] (in Figure 1, the last two form a single 2-run). They are all induced by a 3-MGR UVU for U=T[8..10]=T[16..18] and V=T[11..15]. In the proof of Lemma 8 for this MGR, we have L={7,4,0} and R={11,15,17}.
Lemma 8.

An MGR induces at most 2k+1 uniform k-runs.

Proof.

Consider an MGR T[x..y) with period . Let R be the set of k+1 smallest -mismatching positions in [y..y) in T; if there are no such k+1 positions, we select all -mismatching positions in [y..y) in T to the set R. Similarly, let L be the set of k+1 largest -mismatching positions in [x+1..x) in T; if there are no k+1 such positions, we select all -mismatching positions in [x+1..x). If |R|k, we insert y to R. If |L|k, we insert x to L. This way, if a k-mismatch square T[i..i+2) is induced by the MGR T[x..y), then iI for I=[minL+1..maxR] (as otherwise the left half of the square would contain more than k -mismatching positions or would not touch T[x..y)).

Let us consider a window of length sliding left-to-right over T with the left end point of the window located in interval I. There are at most 2k moments when the number of -mismatching positions in the window changes. Between them, we obtain uniform k-runs if the number of -mismatching positions in the window is at most k.

An analogous lemma for generalised runs is obtained with the same proof (replacing “MGR” by “generalised run”); see Figure 4.

Figure 4: String T[1..26] with its 6-mismatching positions shown in red. A generalised run T[6..21] with period 6 (rectangle) induces five uniform 2-runs with period 6 shown above. In this example, the generalised run corresponds to a run with period 3.
Lemma 9.

A generalised run induces at most 2k+1 uniform k-runs.

A uniform k-run can be induced by several MGRs and generalised runs.

Example 10.

The middle uniform 2-run T[5..22] in Figure 3 is induced also by two other MGRs with period 8, T[5..14] (with arms T[5..6]=T[13..14]=𝚊𝚊) and T[12..22] (with arms T[12..14]=T[20..22]=𝚋𝚊𝚊).

For a gapped repeat UVU, the fraction |UV||U| is called the gap ratio. That is, the gap ratio is the minimum α such that UVU is an α-gapped repeat. We assign to each gapped repeat UVU in T a weight equal to the reciprocal of its gap ratio, that is, |U||UV|.

Lemma 11.

If 4k and T[a..b) is a uniform k-run of period , then it is induced by a generalised run or by some number of (2k+2)-MGRs of total weight at least 14.

Proof.

Let p1,,pt be all -mismatching positions among [a..b), listed in an increasing order. We define sentinel p0 being the maximum -mismatching position that is smaller than a; if there is no such position, we set p0=0. Similarly, we define sentinel pt+1 as the minimum -mismatching position that is at least b; if no such position exists, we set pt+1=n+1.

Let us fix s[0..t]. We have Us:=T[ps+1..ps+1)=T[ps++1..ps+1+), so Ws:=T[ps+1..ps+1+) has period . Moreover, we have (1) ps=0 or T[ps]T[ps+] and (2) ps+1+=n+1 or T[ps+1]T[ps+1+]. If 0<|Us|<, then Ws is an MGR with arms equal to Us. If |Us|, Ws is a generalised run. The MGR or generalised run induces T[a..b) unless ps+1=a or ps=b1.

Assume that the uniform k-run T[a..b) is not induced by a generalised run. If a string Ws, for s[0..t], is not a (2k+2)-MGR of period , the length of its left arm Us is smaller than 2k+2. Hence, for strings Ws that are not (2k+2)-MGRs, the total length of their left arms Us is at most 2. The remaining (2k+2)-MGRs among Ws have total length of left arms at least (ba)2k2k (accounting for the at most k -mismatching positions p1,,pt) which is at least 4 by the assumption of the lemma. This means that the total weight of (2k+2)-MGRs that induce the k-run, defined as the sum of the lengths of their left arms divided by their periods, all equal to , is at least 14.

We show a theorem that implies Theorem 1.

Theorem 12.

A string of length n contains 𝒪(nklogk) uniform k-runs.

Proof.

Let T be a string of length n. Let us consider a bipartite graph G=(V1V2,E) such that vertices in V1 are uniform k-runs in T, vertices in V2 are (2k+2)-MGRs and generalised runs in T, and there is an edge uwE, for uV1 and wV2, if k-run u is induced by the MGR or generalised run w. Our goal is to bound |V1| from above.

Let us remove from G all vertices in V1 that are uniform k-runs with period at most 4k. There are at most 4kn such k-runs (as there are at most 4kn substrings of T of length at most 4k). Let us also remove from G all vertices in V2 that are generalised runs and all vertices in V1 that are adjacent to at least one of the removed vertices in V2. By [3], this way at most 1.5n vertices are removed from V2, so by Lemma 9, 𝒪(nk) vertices are removed from V1. Let G=(V1V2,E) be the remaining graph. We assume that each edge has a weight equal to the weight of the MGR in V2 it is incident to. In order to bound |V1|, we will consider the total weight of edges in E.

For each α[2..2k+2], let xα be the number of α-MGRs in T that are not (α1)-MGRs. Each such MGR has weight at most 1α1 and 2k+13k edges incident to it (cf. Lemma 8). Thus the sum of weights of edges in E is bounded from above by

α=22k+23kxαα1 (2)

The improved upper bound on the number of α-MGRs of I and Köppl [27] implies that

i=2αxi<13nαfor every α[2..2k+2] (3)

We show how to bound the number (2) in general.

Claim 13.

For every sequence (x2,,x2k+2) of non-negative integers that satisfies the condition (3), the value (2) is 𝒪(nklogk).

Proof.

As long as there exists α[3..2k+2] such that xα>13n, we select i[2..α) such that xi<13n or i=2 and xi<26n, decrement xα and increment xi. Such i exists by (3) and the pigeonhole principle. The operation does not change any lhs in (3) and increases (2).

In the end, we have x226n and xα13n for all α[3..2k+2]. Hence, (2) is bounded as:

α=22k+23kxαα1 78k+39nkα=22k+11α 78k+39nk(ln(2k+1)+1)=𝒪(nklogk).

By the claim, the sum of weights of edges in G is 𝒪(nklogk). By Lemma 11, for each uniform k-run with period 4k that is not induced by a generalised run, the total weight of edges incident to it is at least 14. This concludes that |V1|=𝒪(nklogk), so the total number of uniform k-runs (across all periods [1..n/2]) is |V1|=𝒪(nklogk).

A k-run of period contains a prefix being a uniform k-run of period , constructed by taking k-mismatch squares of length 2 at subsequent positions as long as their sets of -mismatching positions are the same. By maximality, no two k-runs with equal period start at the same position, so this way each k-run produces a different uniform k-run. Thus the number of k-runs in T does not exceed the number of uniform k-runs, which proves Theorem 1. By the same argument, each k-repetition implies a unique uniform k-run and so the number of k-repetitions is 𝒪(nklogk). (We refer the reader to the definition of k-repetition in [34, 35, 38] as this notion is not used below.)

3 -Suffix Trees and Their Applications

We introduce useful tools for the next two sections. Cole and Hariharan [11] defined a quasi-suffix collection as a collection of strings S1,S2,,Sn that satisfies the following conditions:

  1. 1.

    |S1|=n and |Si|=|Si1|1.

  2. 2.

    No Si is a prefix of another Sj.

  3. 3.

    Suppose strings Si and Sj have a common prefix of length >0. Then Si+1 and Sj+1 have a common prefix of length at least 1.

A quasi-suffix collection is specified implicitly by a character oracle that given i, j returns Si[j]. They obtain the following result.

Theorem 14 ([11]).

The compacted trie of a quasi-suffix collection of n strings can be constructed in 𝒪(n) expected time assuming that the character oracle works in 𝒪(1) time.

For a string T of length n, the -suffix tree of T is a compacted trie of strings 𝐶𝑜𝑑𝑒(T[i..n])#, i[1..n+1], where 𝐶𝑜𝑑𝑒(X)=(X[1])(X[1..2])(X) for a string X and # is an end-marker that is not an encoding of any string. We extend the sequences π, ρ so that for any string T[0..σ)n, after π(n,σ) preprocessing time, (T[i..j]) for any i,j can be computed in ρ(n,σ) time. By γ(n,σ) we denote an upper bound on the total number of distinct characters in the strings stored in the -suffix tree of a string in [0..σ)n. Theorem 14 implies the following corollary. (A similar result, but only for order-preserving equivalence, was shown in [19, Lemma 16].)

Corollary 15.

Let be an SCER and be its encoding function. The -suffix tree of a string in [0..σ)n can be constructed in worst case time:

𝒪(π(n,σ)+nρ(n,σ)+min{nlog2logn/logloglogn,nγ(n,σ)}).

Proof.

Let us verify that strings 𝐶𝑜𝑑𝑒(T[i..n])# satisfy the conditions 1-3 of a quasi-suffix collection. Condition 1 for n+1 is obvious. Condition 2 follows due to the end-marker. As for condition 3, assume LCP(Si,Sj)=>0 and let ij as otherwise the conclusion is trivial. Then <|Si|,|Sj| because of the end-marker. By equivalence (1), we have T[i..i+)T[j..j+). Because is an SCER, we have T[i+1..i+)T[j+1..j+). Hence, LCP(Si+1,Sj+1)1 by equivalence (1).

An oracle for the quasi-suffix collection Si answers queries in 𝒪(ρ(n,σ)) time after 𝒪(π(n,σ)) preprocessing. The only source of randomness in the algorithm behind Theorem 14 is the need to maintain, for each explicit node of the current tree, a dictionary indexed by the next character on an outgoing edge. If we store the respective characters per each node in a dynamic predecessor data structure of Andersson and Thorup [1], the total space remains 𝒪(n) and each predecessor query is answered in 𝒪(log2logn/logloglogn) time in the worst case. Alternatively, one can store all the children of a node in a list, to achieve 𝒪(γ(n,σ)) space per an explicit node and 𝒪(γ(n,σ)) time for a predecessor query. The complexity follows.

 Remark 16.

For SCERs mentioned in Example 4, we obtain the following time complexities for constructing -suffix trees in the respective settings: 𝒪(nlog2logn/logloglogn) for parameterized matching and Cartesian tree matching, 𝒪(nlogn) for palindrome matching (that matches the complexity from [26]), and 𝒪(nlogn/loglogn) for order-preserving matching (a faster, 𝒪(nlogn)-time construction was proposed in [13]).

For two strings X and Y, by LCP(X,Y) we denote max{0:X[1..]Y[1..]}. As in the case of standard suffix trees, by equivalence (1), having an -suffix tree of T and the data structure answering lowest common ancestor queries for nodes in 𝒪(1) time after 𝒪(n) preprocessing [25, 5], we can answer LCP queries about pairs of substrings of T in 𝒪(1) time.

The longest previous -factor array LPF for a string T is an array such that

LPF[i]=max{0:T[i..i+)T[j..j+) for some j[1..i)}.

Computing this array can be stated in terms of the -suffix tree: for a leaf with index i of the tree we are to find a leaf with index j<i such that the lowest common ancestor of the two leaves is as low as possible. The actual value LPF[i] is then the (weighted) depth of this ancestor. In [13, Theorem 16] it was shown that this problem, stated for an arbitrary rooted (weighted) tree with n leaves, can be solved in 𝒪(n) time. Thus, the LPF array for a length-n string can be computed in 𝒪(n) time if the -suffix tree is available.

4 Counting p-Squares in 𝓞(𝒏𝝈𝐥𝐨𝐠𝝈) Time

In this section we consider T[0..σ)n and denotes the relation of parameterized matching. We use the following function 𝐄 for (see Example 17):

𝐄(X)=|Alph(U)| where U is the longest suffix of X[1..|X|) without letter X[|X|].
Example 17.

𝐄(X) for X=[1,2,1,1,2,3,2,1,4] is as follows:

i 1 2 3 4 5 6 7 8 9
X[i] 1 2 1 1 2 3 2 1 4
𝐄(X[1..i]) 0 1 1 0 1 2 1 2 3

A similar function was used in the context of parameterized matching in [30]. Lemma 18 below shows that 𝐄 is an encoding function for parameterized matching and Lemma 19 shows that it can be computed efficiently.

The () part of the proof of Lemma 18 is similar to the proof of [30, Lemma 5.4]. A proof of the lemma can be found in the full version.

Lemma 18.

𝐄 is an encoding function for parameterized matching.

Lemma 19.

We have π𝐄(n,σ)=𝒪(nσ), ρ𝐄(n,σ)=𝒪(σ), and γ𝐄(n,σ)=σ+1.

Proof.

Let T[0..σ)n. By definition, 𝐄(T[i..j])[0..σ) for any indices i,j. Hence, γ𝐄(n,σ)=σ+1 (including the sentinel). In order to compute encodings of substrings of T, we can store for every c[0..σ), the numbers of characters c in respective prefixes of T. Moreover, for every index i, we store the position 𝑝𝑟𝑒𝑣[i][0..i) of the last occurrence of character T[i] in T[1..i); 𝑝𝑟𝑒𝑣[i]=0 if there is no such occurrence. The array 𝑝𝑟𝑒𝑣 can be computed from left to right in 𝒪(n+σ) time by storing an array of size σ of rightmost occurrences of each character. Then indeed π𝐄(n,σ)=𝒪(nσ). When computing 𝐄(T[i..j]), we can compare the counts of every character c at prefixes T[1..j) and T[1..i) where i=max(𝑝𝑟𝑒𝑣[j]+1,i). Hence, ρ𝐄(n,σ)=𝒪(σ).

We define strings T, T of length n such that

T[i]=𝐄(T[1..i]),T[i]=𝐄((T[i..n])R).

Our goal is to reduce reporting non-equivalent p-squares to computing uniform k-runs. We use two auxiliary lemmas.

Lemma 20.

If T[i..i+2) is a p-square, then T[i..i+2) and T[i..i+2) are σ-mismatch squares.

Proof.

We show a proof that T[i..i+2) is a σ-mismatch square; the proof that T[i..i+2) is a σ-mismatch square is symmetric.

Let i1,,it[0..), for t[1..σ], be the positions of leftmost occurrences of characters from [0..σ) in T[i..i+); if some character is not present in T[i..i+), it is not included in the sequence. Let j[0..){i1,,it}. Let j[0..j) be the maximum index such that T[i+j]=T[i+j], so T[i++j]=T[i++j] and T[i++j′′]T[i++j] for all j′′[j+1..j) because T[i..i+)T[i+..i+2). Then

T[i+j]=|Alph(T[i+j+1..i+j))|=|Alph(T[i++j+1..i++j))|=T[i++j]

by the fact that is an SCER. Hence, T[i..i+) and T[i+..i+2) have at most tσ mismatches, as required.

Lemma 21.

If T[i..i+2) is a p-square and T[i+]=T[i+2], then T[i+1..i+2] is a p-square. Similarly, if T[i..i+2) is a p-square and T[i1]=T[i+1], then T[i1..i+21) is a p-square.

Proof.

Again, we only prove the first part of the lemma. We have T[i..i+)T[i+..i+2), so T[i+1..i+)T[i+1+..i+2) since is an SCER. If

T[i+2]=T[i+]<|Alph(T[i+1..i+))|=|Alph(T[i+1+..i+2))|,

then T[i+]=𝐄(T[i+1..i+]), T[i+2]=𝐄(T[i+1+..i+2]), and T[i+1..i+]T[i+1+..i+2] by the fact that 𝐄 is an encoding function for parameterized matching (Lemma 18). Otherwise,

T[i+]Alph(T[i+1..i+))andT[i+2]Alph(T[i+1+..i+2)),

so we immediately obtain T[i+1..i+]T[i+1+..i+2]. In both cases, T[i+1..i+2] is a p-square.

If T[a..b) is a uniform k-run with period , then there is no -mismatching position in T[a+..b), i.e., T[a+..b)=T[a+2..b). Hence, if T[a..b) is a uniform k-run with period and T[a..a+2) is a p-square, then by Lemma 21, each string T[i..i+2) for i[a..b2] is a p-square. With this intuition, we are ready to obtain the reduction.

Lemma 22.

Let T[0..σ)n. Reporting non-equivalent p-squares in T reduces in 𝒪(nσ+r) time to computing uniform σ-runs in T and T, where r is the number of these σ-runs.

Reporting p-squares in T that are distinct as strings reduces to the same problem in 𝒪(nσ+r+𝗈𝗎𝗍𝗉𝗎𝗍) time, where 𝗈𝗎𝗍𝗉𝗎𝗍 is the number of p-squares reported.

Proof.

For a uniform k-run T[a..b) of period , we define its interval as [a..b2]. For each [1..n/2], let 𝒫 and 𝒫 be sets of intervals of uniform σ-runs of period in T and (T)R, respectively. Let 𝒫={[nb2..na2]:[a..b]𝒫}. Finally, let 𝒫=(𝒫)(𝒫).

Claim 23.

If T[i..i+2) is a p-square in T, then i𝒫.

Proof.

By Lemma 20, if T[i..i+2) is a p-square in T, then T[i..i+2) and T[i..i+2)=(T)R[ni2..ni) are σ-mismatch squares. Therefore, there exist intervals [a..b]𝒫 and [a..b]𝒫 such that i[a..b] and ni2[a..b]. In particular, i𝒫. We have [nb2..na2]𝒫 and i[nb2..na2], so i𝒫.

We store 𝒫 as a union of non-empty intervals of the form {IJ:I𝒫,J𝒫}. Let this representation be denoted as .

All the representations can be computed from 𝒫 and 𝒫 in 𝒪(n+(|𝒫|+|𝒫|)) total time. Let us bucket sort all endpoints of intervals in 𝒫 and 𝒫, for each . When processing the endpoints for a given , we keep track of the number of intervals containing a given position. This counter never exceeds 2 as intervals in each of 𝒫 and 𝒫 are pairwise disjoint. Whenever the counter reaches 2, for some endpoint a, it will drop at the next endpoint encountered, say b. Then [a..b) is inserted into .

The next claim shows that, for each interval in , all positions correspond to p-squares of length 2 or none of them does.

Claim 24.

Let [a..b]. For each i[a..b], T[i..i+2) is a p-square if and only if T[a..a+2) is a p-square.

Proof.

Let us fix i[a..b]. Let [a..b]𝒫 and [a′′..b′′]𝒫 be intervals such that [a..b]=[a..b][a′′..b′′].

Assume first that T[a..a+2) is a p-square. As 𝒫 was computed from uniform σ-runs, we have T[a+2..b+2)=T[a+..b+), so T[a+2..i+2)=T[a+..i+). By Lemma 21 applied for each subsequent position in [a..i), T[i..i+2) is a p-square.

Now assume that T[i..i+2) is a p-square. As [a′′..b′′]𝒫, [nb′′2..na′′2]𝒫. By definition, we have (T)R[nb′′..na′′)=(T)R[nb′′..na′′), i.e., T[a′′..b′′)=T[a′′+..b′′+). In particular, T[a..i)=T[a+..i+). By Lemma 21 applied for each position in [a+1..i] in decreasing order, T[a..a+2) is a p-square.

By Corollaries 15, 18, and 19, after 𝒪(nσ) preprocessing we can check if a given even-length substring of T is a p-square in 𝒪(1) time using an LCP query. For each [1..n/2] and each interval [a..b], we check if T[a..a+2) is a p-square. By Claim 24, if so, we obtain an interval of occurrences of p-squares, and if not, then none of the positions in [a..b] is the start of a p-square of length 2 in T. The total number of intervals in is linear in the number of uniform σ-runs in T and T, so we obtain 𝒪(r) intervals of occurrences of p-squares. By Claim 23, we do not miss any p-squares.

The final step of reporting non-equivalent p-squares mimics an analogous algorithm for standard squares of Bannai, Inenaga, and Köppl [4]. We report non-equivalent p-squares at their leftmost occurrence in T and use the LPF array to identify them in intervals of occurrences. More precisely, we compute the LPF array in 𝒪(nσ) time and construct in 𝒪(n) time a data structure for answering Range Minimum Queries (RmQs) over LPF in 𝒪(1) time per query; see [5, 25]. For each of the 𝒪(r) intervals [a..b] of starting positions of p-squares of length 2, we want to list all positions i[a..b] such that LPF[i]<2 and report corresponding leftmost occurrences of p-squares T[i..i+2). We ask an RmQ on LPF on the interval [a..b]; let the minimum be attained for i[a..b]. If LPF[i]2, the algorithm is finished. Otherwise, we report T[i..i+2) and run the algorithm recursively on intervals [a..i1] and [i+1..b]. The total time complexity is proportional to the number of reported p-squares plus 1. By [24], T contains 𝒪(nσ) non-equivalent p-squares. This completes an 𝒪(nσ+r)-time reduction to computing uniform σ-runs.

When reporting all p-squares that are distinct as strings, it suffices to replace the LPF array with the standard LPF=LPF= array. Such an array can be computed in 𝒪(n) time after the letters of the string have been sorted [12]. Then the computations take 𝒪(nσ+r+𝗈𝗎𝗍𝗉𝗎𝗍) time.

By Theorem 12, string T (and T) contains 𝒪(nσlogσ) uniform σ-runs. They can be computed in 𝒪(nσlogσ) time; we use the algorithm of Kolpakov and Kucherov [34, 35] to compute all σ-runs in T and then partition each σ-run to uniform σ-runs using kangaroo jumps. That is, let T[a..b) be a σ-run with period. We compute two values:

d1 =1+LCP(T[a..b2),T[a+..b))
d2 =1+LCP(T[a+..b),T[a+2..b))

that, intuitively, find the first -mismatching position in the σ-run, if any (d1), and the first -mismatching position after position a+1, if any (d2). Let d=min(d1,d2). We then report a uniform σ-run T[a..a+d+2) and continue processing the (non-maximal) σ-run T[a+d..b) until its length drops below 2. The LCP-queries in (T)R are answered in 𝒪(1) time [5, 25]. Thus Lemma 22 implies Theorem 2.

5 Counting Generalised Squares in 𝓞(𝒏𝐥𝐨𝐠𝒏) Time

In this section we show an algorithm that counts non-equivalent -squares, for an SCER with encoding function , in 𝒪(π(n)+nρ(n)+nlogn) time.

For a string T of length n and positive integer pn/2, we denote

Squaresp={i[1..n2p+1]:T[i..i+2p) is an -square}.

An interval representation of a set X of integers is X=[i1..j1][i2..j2][it..jt], where j1+1<i2, …, jt1+1<it; t is called the size of the representation.

Counterparts of the following lemma corresponding to order-preserving squares and Cartesian-tree squares were shown in [21] and [44], respectively. Our proof generalises these proofs; it can be found in the full version.

Lemma 25.

Let be an SCER and be its encoding function. Given a string T of length n, the interval representations of the sets Squaresp for all 1pn/2 have total size 𝒪(nlogn) and can be computed in 𝒪(π(n)+nρ(n)+nlogn) time.

We count non-equivalent -squares using the LPF array. We would like to count an -square at the position of its leftmost occurrence. For an integer array A[1..n], we denote a range count query for i,j[1..n], x as 𝖱𝖺𝗇𝗀𝖾𝖢𝗈𝗎𝗇𝗍A(i,j,x)=|{k[i..j]:A[k]<x}|. Then our problem reduces to computing the sum

p=1n/2[i..j]Squaresp𝖱𝖺𝗇𝗀𝖾𝖢𝗈𝗎𝗇𝗍LPF(i,j,2p). (4)

We say that an array A[1..n] is a linear oscillation array if i=1n1|A[i+1]A[i]|=𝒪(n). The following fact is folklore for the LPF array; we prove it for LPF in the full version.

Lemma 26.

For any SCER , LPF is a linear oscillation array.

We will now show that the sum from Equation 4 can be computed in 𝒪(nlogn) time. We use a very simple abstract problem.

Counting Problem
Input:
A set Y[1..n], initially empty, and an integer , initially =0.
Operation: One of: (1) insert an element x[1..n] to Z; (2) delete an element xY from Y; (3) increment by 1; (4) decrement by 1. After each operation, output |Y[+1..n]|.

The Counting Problem can be solved with a basic indicator array for Y.

Lemma 27.

After 𝒪(n)-time preprocessing, each operation in the Counting Problem can be answered in 𝒪(1) time.

Proof.

We store an indicator array of bits C[1..n] such that C[i]=1 if and only if iY. Initially, C0. We also store a value a=|Y[+1..n]|; initially a=0. After each operation, the current value of a is returned.

When we insert an element x[1..n] to Y, we set C[x] to 1 and increment a if x>. Symmetrically, when we delete xY from Y, we set C[x] to 0 and decrement a if x<. When we increment , we subtract C[] from a. When we decrement , we add C[] to a.

We are ready to obtain the final main result.

Proof of Theorem 5.

First we show how to compute the number of non-equivalent -squares in a length-n string. The LPF array can be computed in 𝒪(π(n)+nρ(n)+nlog2logn/logloglogn) time (Corollary 15). We compute the interval representations of the sets Squaresp using Lemma 25 in 𝒪(π(n)+nρ(n)+nlogn) time. The interval representations have total size 𝒪(nlogn). We need to compute the sum of results of 𝒪(nlogn) 𝖱𝖺𝗇𝗀𝖾𝖢𝗈𝗎𝗇𝗍 queries as in Equation 4. We will do it using the Counting Problem.

Let us bucket sort all start and end points of intervals across all sets Squaresp in 𝒪(nlogn) total time. We create an instance of the Counting Problem. For each position k from 1 to n, we proceed as follows. First, for each interval [k..j]Squaresp, for any p, we insert 2p to the set Y. Then, we change the current value in the problem to LPF[k] by performing increments or decrements, as appropriate. Afterwards, we add the returned value of the Counting Problem to the final result. Finally, for each interval [i..k]Squaresp, for any p, we delete 2p from the set Y.

For each p[1..n/2], the intervals in Squaresp are pairwise disjoint. By Lemma 25, 𝒪(nlogn) insertions and deletions are performed in the Counting Problem. The total number of increments and decrements is bounded by 𝒪(n) by Lemma 26 (and the fact that the values in this array are in [0..n)). In total, the sum (4) is computed in 𝒪(nlogn) time. We obtain the first part of Theorem 5.

As before, if one wishes to compute all -squares that are distinct as substrings, it suffices to replace the LPF array in the algorithm by the standard LPF=LPF= array. We obtain the second part of Theorem 5.

References

  • [1] Arne Andersson and Mikkel Thorup. Dynamic ordered sets with exponential search trees. Journal of the ACM, 54(3):13, 2007. doi:10.1145/1236457.1236460.
  • [2] Brenda S. Baker. Parameterized pattern matching: Algorithms and applications. Journal of Computer and System Sciences, 52(1):28–42, 1996. doi:10.1006/JCSS.1996.0003.
  • [3] Hideo Bannai, Tomohiro I, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. The “Runs” theorem. SIAM Journal on Computing, 46(5):1501–1514, 2017. doi:10.1137/15M1011032.
  • [4] Hideo Bannai, Shunsuke Inenaga, and Dominik Köppl. Computing all distinct squares in linear time for integer alphabets. In Juha Kärkkäinen, Jakub Radoszewski, and Wojciech Rytter, editors, 28th Annual Symposium on Combinatorial Pattern Matching, CPM 2017, July 4-6, 2017, Warsaw, Poland, volume 78 of LIPIcs, pages 22:1–22:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPICS.CPM.2017.22.
  • [5] Michael A. Bender and Martin Farach-Colton. The LCA problem revisited. In Gaston H. Gonnet, Daniel Panario, and Alfredo Viola, editors, LATIN 2000: Theoretical Informatics, 4th Latin American Symposium, Punta del Este, Uruguay, April 10-14, 2000, Proceedings, volume 1776 of Lecture Notes in Computer Science, pages 88–94. Springer, 2000. doi:10.1007/10719839_9.
  • [6] Gary Benson. An algorithm for finding tandem repeats of unspecified pattern size. In Sorin Istrail, Pavel A. Pevzner, and Michael S. Waterman, editors, Proceedings of the Second Annual International Conference on Research in Computational Molecular Biology, RECOMB 1998, New York, NY, USA, March 22-25, 1998, pages 20–29. ACM, 1998. doi:10.1145/279069.279079.
  • [7] Gary Benson. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research, 27(2):573–580, 1999. doi:10.1093/nar/27.2.573.
  • [8] Srečko Brlek and Shuo Li. On the number of squares in a finite word. Combinatorial Theory, 5(1), 2025. doi:10.5070/C65165014.
  • [9] Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, and Wiktor Zuba. Efficient enumeration of distinct factors using package representations. In Christina Boucher and Sharma V. Thankachan, editors, String Processing and Information Retrieval – 27th International Symposium, SPIRE 2020, Orlando, FL, USA, October 13-15, 2020, Proceedings, volume 12303 of Lecture Notes in Computer Science, pages 247–261. Springer, 2020. doi:10.1007/978-3-030-59212-7_18.
  • [10] Panagiotis Charalampopoulos, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, and Wiktor Zuba. Counting distinct square substrings in sublinear time. In Paweł Gawrychowski, Filip Mazowiecki, and Michał Skrzypczak, editors, 50th International Symposium on Mathematical Foundations of Computer Science, MFCS 2025, August 25–29, 2025, Warsaw, Poland, LIPIcs, 2025. doi:10.4230/LIPICS.MFCS.2025.36.
  • [11] Richard Cole and Ramesh Hariharan. Faster suffix tree construction with missing suffix links. SIAM Journal on Computing, 33(1):26–42, 2003. doi:10.1137/S0097539701424465.
  • [12] Maxime Crochemore and Lucian Ilie. Computing longest previous factor in linear time and applications. Information Processing Letters, 106(2):75–80, 2008. doi:10.1016/J.IPL.2007.10.006.
  • [13] Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Marcin Kubica, Alessio Langiu, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Order-preserving indexing. Theoretical Computer Science, 638:122–135, 2016. doi:10.1016/J.TCS.2015.06.050.
  • [14] Maxime Crochemore, Costas S. Iliopoulos, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Extracting powers and periods in a word from its runs structure. Theoretical Computer Science, 521:29–41, 2014. doi:10.1016/J.TCS.2013.11.018.
  • [15] Olivier Delgrange and Eric Rivals. STAR: an algorithm to search for tandem approximate repeats. Bioinformatics, 20:2812–2820, 2004. doi:10.1093/bioinformatics/bth335.
  • [16] Nevzat Onur Domaniç and Franco P. Preparata. A novel approach to the detection of genomic approximate tandem repeats in the Levenshtein metric. Journal of Computational Biology, 14(7):873–891, 2007. PMID: 17803368. doi:10.1089/cmb.2007.0018.
  • [17] Jonas Ellert and Johannes Fischer. Linear time runs over general ordered alphabets. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 of LIPIcs, pages 63:1–63:16. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ICALP.2021.63.
  • [18] Younan Gao, Meng He, and Yakov Nekrich. Fast preprocessing for optimal orthogonal range reporting and range successor with applications to text indexing. In Fabrizio Grandoni, Grzegorz Herman, and Peter Sanders, editors, 28th Annual European Symposium on Algorithms, ESA 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference), volume 173 of LIPIcs, pages 54:1–54:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPICS.ESA.2020.54.
  • [19] Pawel Gawrychowski, Samah Ghazawi, and Gad M. Landau. Order-preserving squares in strings. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 13:1–13:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CPM.2023.13.
  • [20] Pawel Gawrychowski, Tomohiro I, Shunsuke Inenaga, Dominik Köppl, and Florin Manea. Tighter bounds and optimal algorithms for all maximal α-gapped repeats and palindromes – finding all maximal α-gapped repeats and palindromes in optimal worst case time on integer alphabets. Theory of Computing Systems, 62(1):162–191, 2018. doi:10.1007/S00224-017-9794-5.
  • [21] Garance Gourdel, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Arseny M. Shur, and Tomasz Waleń. String periods in the order-preserving model. Information and Computation, 270, 2020. doi:10.1016/J.IC.2019.104463.
  • [22] Dan Gusfield. Algorithms on Strings, Trees, and Sequences – Computer Science and Computational Biology. Cambridge University Press, 1997. doi:10.1017/CBO9780511574931.
  • [23] Dan Gusfield and Jens Stoye. Linear time algorithms for finding and representing all the tandem repeats in a string. Journal of Computer and System Sciences, 69(4):525–546, 2004. doi:10.1016/J.JCSS.2004.03.004.
  • [24] Rikuya Hamai, Kazushi Taketsugu, Yuto Nakashima, Shunsuke Inenaga, and Hideo Bannai. On the number of non-equivalent parameterized squares in a string. In Zsuzsanna Lipták, Edleno Silva de Moura, Karina Figueroa, and Ricardo Baeza-Yates, editors, String Processing and Information Retrieval – 31st International Symposium, SPIRE 2024, Puerto Vallarta, Mexico, September 23-25, 2024, Proceedings, volume 14899 of Lecture Notes in Computer Science, pages 174–183. Springer, 2024. doi:10.1007/978-3-031-72200-4_13.
  • [25] Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. SIAM Journal on Computing, 13(2):338–355, 1984. doi:10.1137/0213024.
  • [26] Tomohiro I, Shunsuke Inenaga, and Masayuki Takeda. Palindrome pattern matching. Theoretical Computer Science, 483:162–170, 2013. doi:10.1016/J.TCS.2012.01.047.
  • [27] Tomohiro I and Dominik Köppl. Improved upper bounds on all maximal α-gapped repeats and palindromes. Theoretical Computer Science, 753:1–15, 2019. doi:10.1016/J.TCS.2018.06.033.
  • [28] Haim Kaplan, Ely Porat, and Nira Shafrir. Finding the position of the k-mismatch and approximate tandem repeats. In Lars Arge and Rusins Freivalds, editors, Algorithm Theory – SWAT 2006, 10th ScandinavianWorkshop on Algorithm Theory, Riga, Latvia, July 6-8, 2006, Proceedings, volume 4059 of Lecture Notes in Computer Science, pages 90–101. Springer, 2006. doi:10.1007/11785293_11.
  • [29] Jinil Kim, Peter Eades, Rudolf Fleischer, Seok-Hee Hong, Costas S. Iliopoulos, Kunsoo Park, Simon J. Puglisi, and Takeshi Tokuyama. Order-preserving matching. Theoretical Computer Science, 525:68–79, 2014. doi:10.1016/J.TCS.2013.10.006.
  • [30] Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Maximum number of distinct and nonequivalent nonstandard squares in a word. Theoretical Computer Science, 648:84–95, 2016. doi:10.1016/J.TCS.2016.08.010.
  • [31] Roman Kolpakov, Ghizlane Bana, and Gregory Kucherov. mreps: Efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Research, 31:3672–3678, 2003. doi:10.1093/nar/gkg617.
  • [32] Roman Kolpakov, Mikhail Podolskiy, Mikhail Posypkin, and Nickolay Khrapov. Searching of gapped repeats and subrepetitions in a word. Journal of Discrete Algorithms, 46-47:1–15, 2017. doi:10.1016/J.JDA.2017.10.004.
  • [33] Roman M. Kolpakov and Gregory Kucherov. Finding maximal repetitions in a word in linear time. In 40th Annual Symposium on Foundations of Computer Science, FOCS 1999, pages 596–604. IEEE Computer Society, 1999. doi:10.1109/SFFCS.1999.814634.
  • [34] Roman M. Kolpakov and Gregory Kucherov. Finding approximate repetitions under Hamming distance. In Friedhelm Meyer auf der Heide, editor, Algorithms – ESA 2001, 9th Annual European Symposium, Aarhus, Denmark, August 28-31, 2001, Proceedings, volume 2161 of Lecture Notes in Computer Science, pages 170–181. Springer, 2001. doi:10.1007/3-540-44676-1_14.
  • [35] Roman M. Kolpakov and Gregory Kucherov. Finding approximate repetitions under Hamming distance. Theoretical Computer Science, 303(1):135–156, 2003. doi:10.1016/S0304-3975(02)00448-6.
  • [36] Arun Krishnan and Francis Tang. Exhaustive whole-genome tandem repeats search. Bioinformatics, 20(16):2702–2710, May 2004. doi:10.1093/bioinformatics/bth311.
  • [37] Marcin Kubica, Tomasz Kulczyński, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. A linear time algorithm for consecutive permutation pattern matching. Information Processing Letters, 113(12):430–433, 2013. doi:10.1016/J.IPL.2013.03.015.
  • [38] Gregory Kucherov and Dina Sokol. Approximate tandem repeats. In Ming-Yang Kao, editor, Encyclopedia of Algorithms, pages 1–6. Springer Berlin Heidelberg, Berlin, Heidelberg, 2014. doi:10.1007/978-3-642-27848-8_24-2.
  • [39] Gad M. Landau, Jeanette P. Schmidt, and Dina Sokol. An algorithm for approximate tandem repeats. Journal of Computational Biology, 8(1):1–18, 2001. doi:10.1089/106652701300099038.
  • [40] Michael G. Main and Richard J. Lorentz. An O(n log n) algorithm for finding all repetitions in a string. Journal of Algorithms, 5(3):422–432, 1984. doi:10.1016/0196-6774(84)90021-X.
  • [41] Glenn K. Manacher. A new linear-time "on-line" algorithm for finding the smallest initial palindrome of a string. Journal of the ACM, 22(3):346–351, 1975. doi:10.1145/321892.321896.
  • [42] Yoshiaki Matsuoka, Takahiro Aoki, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Generalized pattern matching and periodicity under substring consistent equivalence relations. Theoretical Computer Science, 656:225–233, 2016. doi:10.1016/J.TCS.2016.02.017.
  • [43] Angelika Merkel and Neal Gemmell. Detecting short tandem repeats from genome data: opening the software black box. Briefings in Bioinformatics, 9:355–366, 2008. doi:10.1093/bib/bbn028.
  • [44] Sung Gwan Park, Magsarjav Bataa, Amihood Amir, Gad M. Landau, and Kunsoo Park. Finding patterns and periods in Cartesian tree matching. Theoretical Computer Science, 845:181–197, 2020. doi:10.1016/J.TCS.2020.09.014.
  • [45] Marco Pellegrini, M. Elena Renda, and Alessio Vecchio. TRStalker: an efficient heuristic for finding fuzzy tandem repeats. Bioinformatics, 26(12):i358–i366, June 2010. doi:10.1093/bioinformatics/btq209.
  • [46] Jeff Reneker, Chi-Ren Shyu, Peiyu Zeng, Joseph C. Polacco, and Walter Gassmann. ACMES: fast multiple-genome searches for short repeat sequences with concurrent cross-species information retrieval. Nucleic Acids Research, 32:W649–W653, 2004. doi:10.1093/nar/gkh455.
  • [47] E. Rivals, J. Delahaye, M. Dauchet, and O. Delgrange. A first step toward chromosome analysis by compression algorithms. In Proceedings First International Symposium on Intelligence in Neural and Biological Systems. INBS 1995, pages 233–239, 1995. doi:10.1109/INBS.1995.404256.
  • [48] Eric Rivals, Olivier Delgrange, Jean-Paul Delahaye, Max Dauchet, Marie-Odile Delorme, Alain Hénaut, and Emmanuelle Ollivier. Detection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences. Computer Applications in the Biosciences, 13(2):131–136, 1997. doi:10.1093/BIOINFORMATICS/13.2.131.
  • [49] Marie-France Sagot and Eugene W. Myers. Identifying satellites and periodic repetitions in biological sequences. Journal of Computational Biology, 5(3):539–553, 1998. doi:10.1089/CMB.1998.5.539.
  • [50] Tiago José P. Sobreira, Alan M. Durham, and Arthur Gruber. TRAP: automated classification, quantification and annotation of tandemly repeated sequences. Bioinformatics, 22(3):361–362, 2006. doi:10.1093/bioinformatics/bti809.
  • [51] Dina Sokol, Gary Benson, and Justin Tojeira. Tandem repeats over the edit distance. Bioinformatics, 23(2):e30–e35, January 2007. doi:10.1093/bioinformatics/btl309.
  • [52] Dina Sokol and Justin Tojeira. Speeding up the detection of tandem repeats over the edit distance. Theoretical Computer Science, 525:103–110, 2014. Advances in Stringology. doi:10.1016/j.tcs.2013.04.021.
  • [53] Axel Thue. Über unendliche Zeichenreihen. Norske Videnskabers Selskabs Skrifter Mat.-Nat. Kl., 7:1–22, 1906.
  • [54] Jean Vuillemin. A unifying look at data structures. Communications of the ACM, 23(4):229–239, April 1980. doi:10.1145/358841.358852.
  • [55] Ydo Wexler, Zohar Yakhini, Yechezkel Kashi, and Dan Geiger. Finding approximate tandem repeats in genomic sequences. In Philip E. Bourne and Dan Gusfield, editors, Proceedings of the Eighth Annual International Conference on Computational Molecular Biology, 2004, San Diego, California, USA, March 27-31, 2004, pages 223–232. ACM, 2004. doi:10.1145/974614.974644.
  • [56] Hongxia Zhou, Liping Du, and Hong Yan. Detection of tandem repeats in dna sequences based on parametric spectral estimation. IEEE Transactions on Information Technology in Biomedicine, 13(5):747–755, 2009. doi:10.1109/TITB.2008.920626.