Fast Computation of -Runs, Parameterized Squares, and Other Generalised Squares
Abstract
A -mismatch square is a string of the form where and are two equal-length strings that have at most mismatches. Kolpakov and Kucherov [Theor. Comput. Sci., 2003] defined two notions of -mismatch repeats, called -repetitions and -runs, each representing a sequence of consecutive -mismatch squares of equal length. They proposed algorithms for computing -repetitions and -runs working in time for a string of length over an integer alphabet, where is the number of the reported repeats. We show that , both in case of -repetitions and -runs, which implies that the complexity of their algorithms is actually . We apply this result to computing parameterized squares.
A parameterized square is a string of the form such that and parameterized-match, i.e., there exists a bijection on the alphabet such that . Two parameterized squares and are equivalent if they parameterized match. Recently Hamai et al. [SPIRE 2024] showed that a string of length over an alphabet of size contains less than non-equivalent parameterized squares, improving an earlier bound by Kociumaka et al. [Theor. Comput. Sci., 2016]. We apply our bound for -mismatch repeats to propose an algorithm that reports all non-equivalent parameterized squares in time. We also show that the number of non-equivalent parameterized squares can be computed in time. This last algorithm applies to squares under any substring compatible equivalence relation and also to counting squares that are distinct as strings. In particular, this improves upon the -time algorithm of Gawrychowski et al. [CPM 2023] for counting order-preserving squares that are distinct as strings if .
Keywords and phrases:
string algorithm, -mismatch square, parameterized square, order-preserving square, maximum gapped repeatFunding:
Yuto Nakashima: JSPS KAKENHI Grant Number JP25K00136.Copyright and License:
2012 ACM Subject Classification:
Theory of computation Pattern matchingEditors:
Anne Benoit, Haim Kaplan, Sebastian Wild, and Grzegorz HermanSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
A string of the form , for any string , is called a square (or a tandem repeat). Squares are a classic notion in combinatorics on words (see, e.g., the early works of Thue [53] on avoidability of square substrings), text algorithms (starting from an algorithm for computing square substrings by Main and Lorentz [40]), and bioinformatics (see Gusfield’s book [22]). A string of length contains at most distinct square substrings [8] and all of them can be computed in time [4, 9, 14, 23] or even time [10] assuming that the string is over an integer alphabet of size . A maximal sequence of consecutive squares in a string is called a run (or a generalised run; cf. Section 2); see [33]. A string of length contains at most runs [3] and they can all be computed in time [3, 17]. Our work is devoted to efficient algorithms for computing known generalizations of squares and runs: -mismatch squares represented as -runs or -repetitions, parameterized squares, and generalised squares that include parameterized squares, order-preserving squares, and Cartesian-tree squares.
We assume that positions in a string are numbered from 1 to , so that is the th character of . For integers such that , by we denote the substring composed of characters . We use similar notation for integer intervals: .
We say that a length- string is over an integer alphabet if its letters belong to . We use the word-RAM model of computation.
1.1 -Mismatch Squares and -Runs
For two equal-length strings and , their Hamming distance is defined as . A -mismatch square (also known under the name of -mismatch tandem repeat) is a string such that and . Let be a string of length . A -run of period in (cf. [38]) is a maximal substring such that is a -mismatch square for every ; see Figure 1. Maximality means that the -run is extended to the right and left as much as possible provided that the definition is still satisfied.
Landau, Schmidt and Sokol [39] showed an algorithm for computing -runs in a string of length that works in time; hence, the total number of -runs reported is . Kolpakov and Kucherov [34, 35] showed that all -runs (called there runs of -mismatch tandem repeats) in a string of length can be computed in time, where is the number of -runs. Many other algorithms for computing approximate tandem repeats under various metrics, in the context of computational biology and with the aid of statistical methods and heuristics, were proposed [6, 7, 15, 16, 28, 31, 36, 43, 45, 46, 47, 48, 49, 50, 51, 52, 55, 56]. In Section 2 we show the following theorem.
Theorem 1.
A string of length contains -runs.
Theorem 1 provides the first upper bound on the number of -runs for and implies that the algorithm of Kolpakov and Kucherov computing -runs actually works in time. Actually, we show a stronger condition that a string of length contains uniform -runs. Intuitively, a uniform -run is a maximal sequence of consecutive -mismatch squares of the same length in which the mismatches of all squares are at the same positions. (See Section 2 for a formal definition.) Our bound on the number of uniform -runs implies the bound for -runs as well as an upper bound on the number of -repetitions as defined in [34, 35] (called there -mismatch globally defined repetitions).
1.2 Parameterized Squares
For a string , by we denote the set of characters of . Two strings , parameterized match if and there is a bijection such that (i.e., and for all ). A parameterized square (p-square, in short) is a string such that parameterized matches (see Figure 2). Two p-squares and are called equivalent if they parameterized match.
Parameterized matching was introduced by Baker [2] motivated by applications in code refactoring and plagiarism detection. The notion of p-squares was introduced by Kociumaka, Radoszewski, Rytter, and Waleń [30] who showed that a string of length over alphabet of size contains at most non-equivalent p-squares. They also considered avoidability of parameterized cubes. The bounds from [30] were recently improved by Hamai, Taketsugu, Nakashima, Inenaga, and Bannai [24]; they showed that a length- string over alphabet of size contains less than non-equivalent p-squares. This bound automatically implies that such a string contains at most p-squares that are distinct as strings (improving the upper bound from [30]). We show that all non-equivalent p-squares can be computed efficiently; our approach also extends to p-squares that are distinct as strings. See Section 4.
Theorem 2.
All non-equivalent p-square substrings in a string of length over alphabet of size can be computed in time.
All p-square substrings that are distinct as strings can be computed in time, where is the number of p-squares reported.
By the above discussion, all p-square substrings that are distinct as strings can be computed in time. The key ingredient in the algorithm behind Theorem 2 is a relation between p-squares and -mismatch squares (and uniform -runs).
1.3 Generalised Squares under Substring Consistent Equivalence Relations
We then show an alternative algorithm for counting p-squares whose complexity does not depend on the alphabet size . The algorithm is stated for squares under any substring consistent equivalence relation (SCER in short). An SCER is a relation on strings such that implies that (1) and (2) for all ; see [42]. Known examples of SCERs include parameterized matching, order-preserving matching [29, 37], Cartesian tree matching [44], and palindrome pattern matching [26]. Two strings and order-preserving match if , sets and are totally ordered, and there is a bijection that is increasing (i.e., if then ) such that . Strings , Cartesian-tree match if they have the same shape of a Cartesian tree (cf. [54]). Finally, strings and palindrome match if for all , is a palindrome if and only if is a palindrome.
An -square is a string such that under SCER . Thus p-squares and order-preserving squares [30] (op-squares, in short) as well as Cartesian-tree squares [44] (CT-squares) and squares in the sense of palindrome matching (palindrome-squares) are -squares for the respective SCERs . See Figure 2 for an example.
In a prefix consistent equivalence relation (PCER), condition (2) of an SCER only needs to hold for . An SCER is a PCER. We say that is an encoding function if
| (1) |
Observation 3.
For every PCER on strings over integer alphabet there exists an encoding function.
Proof.
Let be defined as the number of the equivalence class under of among all strings of length . Let us verify that satisfies equivalence (1). If , then, by definition, and for all . Then indeed . Taking , we obtain . As , we have .
The encoding function defined in the proof of Observation 3 could be hard to compute efficiently for a particular PCER . It is also stronger, as it satisfies . Luckily, for the aforementioned known SCERs efficient encoding functions are known, as shown in the following Example 4.
For an encoding function , let , be integer sequences such that for a string of length over integer alphabet, after preprocessing time, for any can be computed in time.
Example 4.
For parameterized matching, there exists a classic -encoding, see [2], that is an encoding function:
The -encodings of all prefixes of a string can be computed by sorting all pairs . If is over an integer alphabet, this is performed in time. Then can be computed from in time.
For order-preserving matching, one can use an encoding as pairs . Here
and if there is no such , then . Similarly,
and if no such exists; see [13, 29, 37]. As we require the encoding function to return integers, we can set , since . Then [13, Lemma 4] implies that is an encoding function for order-preserving matching and [13, Lemma 24] gives , .
For Cartesian-tree matching, [44, Theorem 1] shows that the following parent-distance representation:
is an encoding function. Encodings of all prefixes of a string can be computed in time using a folklore nearest smaller value algorithm. In [44, Section 5.2] it is noted that can be computed from in time.
By we denote the reverse of string . In [26, Lemma 2] it is shown that the length of the longest suffix palindrome of , formally:
is an encoding function for palindrome matching.
Computing encodings of substrings can be reduced in linear time to 2D range successor queries as follows.
In the 2D range successor problem we are given points in an grid and we are to answer queries that, given a rectangle in the grid, report a point in the rectangle with the smallest first coordinate, if any. Such queries can be answered in time after preprocessing [18].
In the reduction, we compute all maximal palindromes in in time using Manacher’s algorithm [41]. Let us show how we deal with odd-length palindromes; the approach for even-length palindromes is analogous. For a maximal palindrome with and , we create a point . To compute , it suffices to find the point in the rectangle with the smallest first coordinate. If is the sought coordinate, the longest odd-length suffix palindrome of has length .
By [18], we get and .
In Section 5 we show the next theorem on efficient counting of -squares.
Theorem 5.
Let be an SCER and be its encoding function. The number of non-equivalent -square substrings in a length- string can be computed in time.
The number of -square substrings in a length- string that are distinct as strings can be computed in the same time complexity.
Consequently, in a length- string, the numbers of non-equivalent p-squares, op-squares, CT-squares, and palindrome-squares, as well as the numbers of the respective squares that are distinct as strings can be computed in time, as the respective SCERs enjoy encoding functions with and as shown in Example 4. In particular, the algorithm of Theorem 5 in the case that improves upon the -time algorithm of Gawrychowski, Ghazawi, and Landau [19] for reporting (thus, counting) op-squares that are distinct as strings. Moreover, the obtained complexity for counting non-equivalent p-squares (p-squares that are distinct as strings, respectively) is better than the one in Theorem 2 if (, respectively).
2 Upper Bound on the Number of -Runs
This section introduces an upper bound on the number of -runs (proof of Theorem 1) which implies their efficient computation by the algorithm of Kolpakov and Kucherov [34, 35]. A position in string is called an -mismatching position if and . In particular, is a -mismatch square if and only if contains at most -mismatching positions. Let us formally define a uniform -run.
Definition 6.
A substring is called a uniform -run of period if and the sets of -mismatching positions for all have cardinality at most and are all the same. We are interested in maximal uniform -runs.
A gapped repeat in a string is a fragment of the form , for . Fragments denoted by are called arms of the gapped repeat (left and right arm). The period of the gapped repeat is defined as . An -gapped repeat (for ) in is a gapped repeat such that . An (-)gapped repeat is called maximal if its arms cannot be extended simultaneously with the same character to the right or to the left. A maximal (-)gapped repeat will be called an (-)MGR, for short. Gawrychowski, I, Inenaga, Köppl, and Manea [20] showed that, for a real , a string of length contains -MGRs.
A generalised run is a pair such that satisfies and (1) or does not have period , and (2) or does not have period . That is, a generalised run is a 0-run. A run is a generalised run in which is the smallest period of . A string of length contains less than runs and generalised runs [3].
Definition 7.
We say that an MGR with period in induces a uniform -run of period if .
We further say that a generalised run induces a uniform -run of period if .
Let us emphasize that the uniform -run does not need to be a substring of the MGR or generalised run. Intuitively, an MGR induces a uniform -run if the left half of at least one -mismatch square of the -run contains a position of the left arm of the MGR; see Figure 3. If an MGR or a generalised run induces a uniform -run, we say that the -run is induced by the MGR or run, respectively. We also say that the MGR or generalised run induces all -mismatch squares in the uniform -run that it induces.
Lemma 8.
An MGR induces at most uniform -runs.
Proof.
Consider an MGR with period . Let be the set of smallest -mismatching positions in in ; if there are no such positions, we select all -mismatching positions in in to the set . Similarly, let be the set of largest -mismatching positions in in ; if there are no such positions, we select all -mismatching positions in . If , we insert to . If , we insert to . This way, if a -mismatch square is induced by the MGR , then for (as otherwise the left half of the square would contain more than -mismatching positions or would not touch ).
Let us consider a window of length sliding left-to-right over with the left end point of the window located in interval . There are at most moments when the number of -mismatching positions in the window changes. Between them, we obtain uniform -runs if the number of -mismatching positions in the window is at most .
An analogous lemma for generalised runs is obtained with the same proof (replacing “MGR” by “generalised run”); see Figure 4.
Lemma 9.
A generalised run induces at most uniform -runs.
A uniform -run can be induced by several MGRs and generalised runs.
Example 10.
The middle uniform 2-run in Figure 3 is induced also by two other MGRs with period 8, (with arms ) and (with arms ).
For a gapped repeat , the fraction is called the gap ratio. That is, the gap ratio is the minimum such that is an -gapped repeat. We assign to each gapped repeat in a weight equal to the reciprocal of its gap ratio, that is, .
Lemma 11.
If and is a uniform -run of period , then it is induced by a generalised run or by some number of -MGRs of total weight at least .
Proof.
Let be all -mismatching positions among , listed in an increasing order. We define sentinel being the maximum -mismatching position that is smaller than ; if there is no such position, we set . Similarly, we define sentinel as the minimum -mismatching position that is at least ; if no such position exists, we set .
Let us fix . We have , so has period . Moreover, we have (1) or and (2) or . If , then is an MGR with arms equal to . If , is a generalised run. The MGR or generalised run induces unless or .
Assume that the uniform -run is not induced by a generalised run. If a string , for , is not a -MGR of period , the length of its left arm is smaller than . Hence, for strings that are not -MGRs, the total length of their left arms is at most . The remaining -MGRs among have total length of left arms at least (accounting for the at most -mismatching positions ) which is at least by the assumption of the lemma. This means that the total weight of -MGRs that induce the -run, defined as the sum of the lengths of their left arms divided by their periods, all equal to , is at least .
We show a theorem that implies Theorem 1.
Theorem 12.
A string of length contains uniform -runs.
Proof.
Let be a string of length . Let us consider a bipartite graph such that vertices in are uniform -runs in , vertices in are -MGRs and generalised runs in , and there is an edge , for and , if -run is induced by the MGR or generalised run . Our goal is to bound from above.
Let us remove from all vertices in that are uniform -runs with period at most . There are at most such -runs (as there are at most substrings of of length at most ). Let us also remove from all vertices in that are generalised runs and all vertices in that are adjacent to at least one of the removed vertices in . By [3], this way at most vertices are removed from , so by Lemma 9, vertices are removed from . Let be the remaining graph. We assume that each edge has a weight equal to the weight of the MGR in it is incident to. In order to bound , we will consider the total weight of edges in .
For each , let be the number of -MGRs in that are not -MGRs. Each such MGR has weight at most and edges incident to it (cf. Lemma 8). Thus the sum of weights of edges in is bounded from above by
| (2) |
The improved upper bound on the number of -MGRs of I and Köppl [27] implies that
| (3) |
We show how to bound the number (2) in general.
Proof.
As long as there exists such that , we select such that or and , decrement and increment . Such exists by (3) and the pigeonhole principle. The operation does not change any lhs in (3) and increases (2).
In the end, we have and for all . Hence, (2) is bounded as:
By the claim, the sum of weights of edges in is . By Lemma 11, for each uniform -run with period that is not induced by a generalised run, the total weight of edges incident to it is at least . This concludes that , so the total number of uniform -runs (across all periods ) is .
A -run of period contains a prefix being a uniform -run of period , constructed by taking -mismatch squares of length at subsequent positions as long as their sets of -mismatching positions are the same. By maximality, no two -runs with equal period start at the same position, so this way each -run produces a different uniform -run. Thus the number of -runs in does not exceed the number of uniform -runs, which proves Theorem 1. By the same argument, each -repetition implies a unique uniform -run and so the number of -repetitions is . (We refer the reader to the definition of -repetition in [34, 35, 38] as this notion is not used below.)
3 -Suffix Trees and Their Applications
We introduce useful tools for the next two sections. Cole and Hariharan [11] defined a quasi-suffix collection as a collection of strings that satisfies the following conditions:
-
1.
and .
-
2.
No is a prefix of another .
-
3.
Suppose strings and have a common prefix of length . Then and have a common prefix of length at least .
A quasi-suffix collection is specified implicitly by a character oracle that given , returns . They obtain the following result.
Theorem 14 ([11]).
The compacted trie of a quasi-suffix collection of strings can be constructed in expected time assuming that the character oracle works in time.
For a string of length , the -suffix tree of is a compacted trie of strings , , where for a string and is an end-marker that is not an encoding of any string. We extend the sequences , so that for any string , after preprocessing time, for any can be computed in time. By we denote an upper bound on the total number of distinct characters in the strings stored in the -suffix tree of a string in . Theorem 14 implies the following corollary. (A similar result, but only for order-preserving equivalence, was shown in [19, Lemma 16].)
Corollary 15.
Let be an SCER and be its encoding function. The -suffix tree of a string in can be constructed in worst case time:
Proof.
Let us verify that strings satisfy the conditions 1-3 of a quasi-suffix collection. Condition 1 for is obvious. Condition 2 follows due to the end-marker. As for condition 3, assume and let as otherwise the conclusion is trivial. Then because of the end-marker. By equivalence (1), we have . Because is an SCER, we have . Hence, by equivalence (1).
An oracle for the quasi-suffix collection answers queries in time after preprocessing. The only source of randomness in the algorithm behind Theorem 14 is the need to maintain, for each explicit node of the current tree, a dictionary indexed by the next character on an outgoing edge. If we store the respective characters per each node in a dynamic predecessor data structure of Andersson and Thorup [1], the total space remains and each predecessor query is answered in time in the worst case. Alternatively, one can store all the children of a node in a list, to achieve space per an explicit node and time for a predecessor query. The complexity follows.
Remark 16.
For SCERs mentioned in Example 4, we obtain the following time complexities for constructing -suffix trees in the respective settings: for parameterized matching and Cartesian tree matching, for palindrome matching (that matches the complexity from [26]), and for order-preserving matching (a faster, -time construction was proposed in [13]).
For two strings and , by we denote . As in the case of standard suffix trees, by equivalence (1), having an -suffix tree of and the data structure answering lowest common ancestor queries for nodes in time after preprocessing [25, 5], we can answer queries about pairs of substrings of in time.
The longest previous -factor array for a string is an array such that
Computing this array can be stated in terms of the -suffix tree: for a leaf with index of the tree we are to find a leaf with index such that the lowest common ancestor of the two leaves is as low as possible. The actual value is then the (weighted) depth of this ancestor. In [13, Theorem 16] it was shown that this problem, stated for an arbitrary rooted (weighted) tree with leaves, can be solved in time. Thus, the array for a length- string can be computed in time if the -suffix tree is available.
4 Counting p-Squares in Time
In this section we consider and denotes the relation of parameterized matching. We use the following function for (see Example 17):
Example 17.
for is as follows:
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 1 | 1 | 2 | 3 | 2 | 1 | 4 | |
| 0 | 1 | 1 | 0 | 1 | 2 | 1 | 2 | 3 |
A similar function was used in the context of parameterized matching in [30]. Lemma 18 below shows that is an encoding function for parameterized matching and Lemma 19 shows that it can be computed efficiently.
The () part of the proof of Lemma 18 is similar to the proof of [30, Lemma 5.4]. A proof of the lemma can be found in the full version.
Lemma 18.
is an encoding function for parameterized matching.
Lemma 19.
We have , , and .
Proof.
Let . By definition, for any indices . Hence, (including the sentinel). In order to compute encodings of substrings of , we can store for every , the numbers of characters in respective prefixes of . Moreover, for every index , we store the position of the last occurrence of character in ; if there is no such occurrence. The array can be computed from left to right in time by storing an array of size of rightmost occurrences of each character. Then indeed . When computing , we can compare the counts of every character at prefixes and where . Hence, .
We define strings , of length such that
Our goal is to reduce reporting non-equivalent p-squares to computing uniform -runs. We use two auxiliary lemmas.
Lemma 20.
If is a p-square, then and are -mismatch squares.
Proof.
We show a proof that is a -mismatch square; the proof that is a -mismatch square is symmetric.
Let , for , be the positions of leftmost occurrences of characters from in ; if some character is not present in , it is not included in the sequence. Let . Let be the maximum index such that , so and for all because . Then
by the fact that is an SCER. Hence, and have at most mismatches, as required.
Lemma 21.
If is a p-square and , then is a p-square. Similarly, if is a p-square and , then is a p-square.
Proof.
Again, we only prove the first part of the lemma. We have , so since is an SCER. If
then , , and by the fact that is an encoding function for parameterized matching (Lemma 18). Otherwise,
so we immediately obtain . In both cases, is a p-square.
If is a uniform -run with period , then there is no -mismatching position in , i.e., . Hence, if is a uniform -run with period and is a p-square, then by Lemma 21, each string for is a p-square. With this intuition, we are ready to obtain the reduction.
Lemma 22.
Let . Reporting non-equivalent p-squares in reduces in time to computing uniform -runs in and , where is the number of these -runs.
Reporting p-squares in that are distinct as strings reduces to the same problem in time, where is the number of p-squares reported.
Proof.
For a uniform -run of period , we define its interval as . For each , let and be sets of intervals of uniform -runs of period in and , respectively. Let . Finally, let .
Claim 23.
If is a p-square in , then .
Proof.
By Lemma 20, if is a p-square in , then and are -mismatch squares. Therefore, there exist intervals and such that and . In particular, . We have and , so .
We store as a union of non-empty intervals of the form . Let this representation be denoted as .
All the representations can be computed from and in total time. Let us bucket sort all endpoints of intervals in and , for each . When processing the endpoints for a given , we keep track of the number of intervals containing a given position. This counter never exceeds 2 as intervals in each of and are pairwise disjoint. Whenever the counter reaches 2, for some endpoint , it will drop at the next endpoint encountered, say . Then is inserted into .
The next claim shows that, for each interval in , all positions correspond to p-squares of length or none of them does.
Claim 24.
Let . For each , is a p-square if and only if is a p-square.
Proof.
Let us fix . Let and be intervals such that .
Assume first that is a p-square. As was computed from uniform -runs, we have , so . By Lemma 21 applied for each subsequent position in , is a p-square.
Now assume that is a p-square. As , . By definition, we have , i.e., . In particular, . By Lemma 21 applied for each position in in decreasing order, is a p-square.
By Corollaries 15, 18, and 19, after preprocessing we can check if a given even-length substring of is a p-square in time using an query. For each and each interval , we check if is a p-square. By Claim 24, if so, we obtain an interval of occurrences of p-squares, and if not, then none of the positions in is the start of a p-square of length in . The total number of intervals in is linear in the number of uniform -runs in and , so we obtain intervals of occurrences of p-squares. By Claim 23, we do not miss any p-squares.
The final step of reporting non-equivalent p-squares mimics an analogous algorithm for standard squares of Bannai, Inenaga, and Köppl [4]. We report non-equivalent p-squares at their leftmost occurrence in and use the array to identify them in intervals of occurrences. More precisely, we compute the array in time and construct in time a data structure for answering Range Minimum Queries (RmQs) over in time per query; see [5, 25]. For each of the intervals of starting positions of p-squares of length , we want to list all positions such that and report corresponding leftmost occurrences of p-squares . We ask an RmQ on on the interval ; let the minimum be attained for . If , the algorithm is finished. Otherwise, we report and run the algorithm recursively on intervals and . The total time complexity is proportional to the number of reported p-squares plus 1. By [24], contains non-equivalent p-squares. This completes an -time reduction to computing uniform -runs.
When reporting all p-squares that are distinct as strings, it suffices to replace the array with the standard array. Such an array can be computed in time after the letters of the string have been sorted [12]. Then the computations take time.
By Theorem 12, string (and ) contains uniform -runs. They can be computed in time; we use the algorithm of Kolpakov and Kucherov [34, 35] to compute all -runs in and then partition each -run to uniform -runs using kangaroo jumps. That is, let be a -run with period. We compute two values:
that, intuitively, find the first -mismatching position in the -run, if any (), and the first -mismatching position after position , if any (). Let . We then report a uniform -run and continue processing the (non-maximal) -run until its length drops below . The LCP-queries in are answered in time [5, 25]. Thus Lemma 22 implies Theorem 2.
5 Counting Generalised Squares in Time
In this section we show an algorithm that counts non-equivalent -squares, for an SCER with encoding function , in time.
For a string of length and positive integer , we denote
An interval representation of a set of integers is , where , …, ; is called the size of the representation.
Counterparts of the following lemma corresponding to order-preserving squares and Cartesian-tree squares were shown in [21] and [44], respectively. Our proof generalises these proofs; it can be found in the full version.
Lemma 25.
Let be an SCER and be its encoding function. Given a string of length , the interval representations of the sets for all have total size and can be computed in time.
We count non-equivalent -squares using the array. We would like to count an -square at the position of its leftmost occurrence. For an integer array , we denote a range count query for , as . Then our problem reduces to computing the sum
| (4) |
We say that an array is a linear oscillation array if . The following fact is folklore for the LPF array; we prove it for in the full version.
Lemma 26.
For any SCER , is a linear oscillation array.
We will now show that the sum from Equation 4 can be computed in time. We use a very simple abstract problem.
Counting Problem
Input:
A set , initially empty, and an integer , initially .
Operation: One of: (1) insert an element to ; (2) delete an element from ; (3) increment by 1; (4) decrement by 1. After each operation, output .
The Counting Problem can be solved with a basic indicator array for .
Lemma 27.
After -time preprocessing, each operation in the Counting Problem can be answered in time.
Proof.
We store an indicator array of bits such that if and only if . Initially, . We also store a value ; initially . After each operation, the current value of is returned.
When we insert an element to , we set to 1 and increment if . Symmetrically, when we delete from , we set to 0 and decrement if . When we increment , we subtract from . When we decrement , we add to .
We are ready to obtain the final main result.
Proof of Theorem 5.
First we show how to compute the number of non-equivalent -squares in a length- string. The array can be computed in time (Corollary 15). We compute the interval representations of the sets using Lemma 25 in time. The interval representations have total size . We need to compute the sum of results of queries as in Equation 4. We will do it using the Counting Problem.
Let us bucket sort all start and end points of intervals across all sets in total time. We create an instance of the Counting Problem. For each position from 1 to , we proceed as follows. First, for each interval , for any , we insert to the set . Then, we change the current value in the problem to by performing increments or decrements, as appropriate. Afterwards, we add the returned value of the Counting Problem to the final result. Finally, for each interval , for any , we delete from the set .
For each , the intervals in are pairwise disjoint. By Lemma 25, insertions and deletions are performed in the Counting Problem. The total number of increments and decrements is bounded by by Lemma 26 (and the fact that the values in this array are in ). In total, the sum (4) is computed in time. We obtain the first part of Theorem 5.
As before, if one wishes to compute all -squares that are distinct as substrings, it suffices to replace the array in the algorithm by the standard array. We obtain the second part of Theorem 5.
References
- [1] Arne Andersson and Mikkel Thorup. Dynamic ordered sets with exponential search trees. Journal of the ACM, 54(3):13, 2007. doi:10.1145/1236457.1236460.
- [2] Brenda S. Baker. Parameterized pattern matching: Algorithms and applications. Journal of Computer and System Sciences, 52(1):28–42, 1996. doi:10.1006/JCSS.1996.0003.
- [3] Hideo Bannai, Tomohiro I, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. The “Runs” theorem. SIAM Journal on Computing, 46(5):1501–1514, 2017. doi:10.1137/15M1011032.
- [4] Hideo Bannai, Shunsuke Inenaga, and Dominik Köppl. Computing all distinct squares in linear time for integer alphabets. In Juha Kärkkäinen, Jakub Radoszewski, and Wojciech Rytter, editors, 28th Annual Symposium on Combinatorial Pattern Matching, CPM 2017, July 4-6, 2017, Warsaw, Poland, volume 78 of LIPIcs, pages 22:1–22:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPICS.CPM.2017.22.
- [5] Michael A. Bender and Martin Farach-Colton. The LCA problem revisited. In Gaston H. Gonnet, Daniel Panario, and Alfredo Viola, editors, LATIN 2000: Theoretical Informatics, 4th Latin American Symposium, Punta del Este, Uruguay, April 10-14, 2000, Proceedings, volume 1776 of Lecture Notes in Computer Science, pages 88–94. Springer, 2000. doi:10.1007/10719839_9.
- [6] Gary Benson. An algorithm for finding tandem repeats of unspecified pattern size. In Sorin Istrail, Pavel A. Pevzner, and Michael S. Waterman, editors, Proceedings of the Second Annual International Conference on Research in Computational Molecular Biology, RECOMB 1998, New York, NY, USA, March 22-25, 1998, pages 20–29. ACM, 1998. doi:10.1145/279069.279079.
- [7] Gary Benson. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Research, 27(2):573–580, 1999. doi:10.1093/nar/27.2.573.
- [8] Srečko Brlek and Shuo Li. On the number of squares in a finite word. Combinatorial Theory, 5(1), 2025. doi:10.5070/C65165014.
- [9] Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, and Wiktor Zuba. Efficient enumeration of distinct factors using package representations. In Christina Boucher and Sharma V. Thankachan, editors, String Processing and Information Retrieval – 27th International Symposium, SPIRE 2020, Orlando, FL, USA, October 13-15, 2020, Proceedings, volume 12303 of Lecture Notes in Computer Science, pages 247–261. Springer, 2020. doi:10.1007/978-3-030-59212-7_18.
- [10] Panagiotis Charalampopoulos, Manal Mohamed, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, and Wiktor Zuba. Counting distinct square substrings in sublinear time. In Paweł Gawrychowski, Filip Mazowiecki, and Michał Skrzypczak, editors, 50th International Symposium on Mathematical Foundations of Computer Science, MFCS 2025, August 25–29, 2025, Warsaw, Poland, LIPIcs, 2025. doi:10.4230/LIPICS.MFCS.2025.36.
- [11] Richard Cole and Ramesh Hariharan. Faster suffix tree construction with missing suffix links. SIAM Journal on Computing, 33(1):26–42, 2003. doi:10.1137/S0097539701424465.
- [12] Maxime Crochemore and Lucian Ilie. Computing longest previous factor in linear time and applications. Information Processing Letters, 106(2):75–80, 2008. doi:10.1016/J.IPL.2007.10.006.
- [13] Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Marcin Kubica, Alessio Langiu, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Order-preserving indexing. Theoretical Computer Science, 638:122–135, 2016. doi:10.1016/J.TCS.2015.06.050.
- [14] Maxime Crochemore, Costas S. Iliopoulos, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Extracting powers and periods in a word from its runs structure. Theoretical Computer Science, 521:29–41, 2014. doi:10.1016/J.TCS.2013.11.018.
- [15] Olivier Delgrange and Eric Rivals. STAR: an algorithm to search for tandem approximate repeats. Bioinformatics, 20:2812–2820, 2004. doi:10.1093/bioinformatics/bth335.
- [16] Nevzat Onur Domaniç and Franco P. Preparata. A novel approach to the detection of genomic approximate tandem repeats in the Levenshtein metric. Journal of Computational Biology, 14(7):873–891, 2007. PMID: 17803368. doi:10.1089/cmb.2007.0018.
- [17] Jonas Ellert and Johannes Fischer. Linear time runs over general ordered alphabets. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, July 12-16, 2021, Glasgow, Scotland (Virtual Conference), volume 198 of LIPIcs, pages 63:1–63:16. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ICALP.2021.63.
- [18] Younan Gao, Meng He, and Yakov Nekrich. Fast preprocessing for optimal orthogonal range reporting and range successor with applications to text indexing. In Fabrizio Grandoni, Grzegorz Herman, and Peter Sanders, editors, 28th Annual European Symposium on Algorithms, ESA 2020, September 7-9, 2020, Pisa, Italy (Virtual Conference), volume 173 of LIPIcs, pages 54:1–54:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPICS.ESA.2020.54.
- [19] Pawel Gawrychowski, Samah Ghazawi, and Gad M. Landau. Order-preserving squares in strings. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 13:1–13:19. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CPM.2023.13.
- [20] Pawel Gawrychowski, Tomohiro I, Shunsuke Inenaga, Dominik Köppl, and Florin Manea. Tighter bounds and optimal algorithms for all maximal -gapped repeats and palindromes – finding all maximal -gapped repeats and palindromes in optimal worst case time on integer alphabets. Theory of Computing Systems, 62(1):162–191, 2018. doi:10.1007/S00224-017-9794-5.
- [21] Garance Gourdel, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Arseny M. Shur, and Tomasz Waleń. String periods in the order-preserving model. Information and Computation, 270, 2020. doi:10.1016/J.IC.2019.104463.
- [22] Dan Gusfield. Algorithms on Strings, Trees, and Sequences – Computer Science and Computational Biology. Cambridge University Press, 1997. doi:10.1017/CBO9780511574931.
- [23] Dan Gusfield and Jens Stoye. Linear time algorithms for finding and representing all the tandem repeats in a string. Journal of Computer and System Sciences, 69(4):525–546, 2004. doi:10.1016/J.JCSS.2004.03.004.
- [24] Rikuya Hamai, Kazushi Taketsugu, Yuto Nakashima, Shunsuke Inenaga, and Hideo Bannai. On the number of non-equivalent parameterized squares in a string. In Zsuzsanna Lipták, Edleno Silva de Moura, Karina Figueroa, and Ricardo Baeza-Yates, editors, String Processing and Information Retrieval – 31st International Symposium, SPIRE 2024, Puerto Vallarta, Mexico, September 23-25, 2024, Proceedings, volume 14899 of Lecture Notes in Computer Science, pages 174–183. Springer, 2024. doi:10.1007/978-3-031-72200-4_13.
- [25] Dov Harel and Robert Endre Tarjan. Fast algorithms for finding nearest common ancestors. SIAM Journal on Computing, 13(2):338–355, 1984. doi:10.1137/0213024.
- [26] Tomohiro I, Shunsuke Inenaga, and Masayuki Takeda. Palindrome pattern matching. Theoretical Computer Science, 483:162–170, 2013. doi:10.1016/J.TCS.2012.01.047.
- [27] Tomohiro I and Dominik Köppl. Improved upper bounds on all maximal -gapped repeats and palindromes. Theoretical Computer Science, 753:1–15, 2019. doi:10.1016/J.TCS.2018.06.033.
- [28] Haim Kaplan, Ely Porat, and Nira Shafrir. Finding the position of the k-mismatch and approximate tandem repeats. In Lars Arge and Rusins Freivalds, editors, Algorithm Theory – SWAT 2006, 10th ScandinavianWorkshop on Algorithm Theory, Riga, Latvia, July 6-8, 2006, Proceedings, volume 4059 of Lecture Notes in Computer Science, pages 90–101. Springer, 2006. doi:10.1007/11785293_11.
- [29] Jinil Kim, Peter Eades, Rudolf Fleischer, Seok-Hee Hong, Costas S. Iliopoulos, Kunsoo Park, Simon J. Puglisi, and Takeshi Tokuyama. Order-preserving matching. Theoretical Computer Science, 525:68–79, 2014. doi:10.1016/J.TCS.2013.10.006.
- [30] Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Maximum number of distinct and nonequivalent nonstandard squares in a word. Theoretical Computer Science, 648:84–95, 2016. doi:10.1016/J.TCS.2016.08.010.
- [31] Roman Kolpakov, Ghizlane Bana, and Gregory Kucherov. mreps: Efficient and flexible detection of tandem repeats in DNA. Nucleic Acids Research, 31:3672–3678, 2003. doi:10.1093/nar/gkg617.
- [32] Roman Kolpakov, Mikhail Podolskiy, Mikhail Posypkin, and Nickolay Khrapov. Searching of gapped repeats and subrepetitions in a word. Journal of Discrete Algorithms, 46-47:1–15, 2017. doi:10.1016/J.JDA.2017.10.004.
- [33] Roman M. Kolpakov and Gregory Kucherov. Finding maximal repetitions in a word in linear time. In 40th Annual Symposium on Foundations of Computer Science, FOCS 1999, pages 596–604. IEEE Computer Society, 1999. doi:10.1109/SFFCS.1999.814634.
- [34] Roman M. Kolpakov and Gregory Kucherov. Finding approximate repetitions under Hamming distance. In Friedhelm Meyer auf der Heide, editor, Algorithms – ESA 2001, 9th Annual European Symposium, Aarhus, Denmark, August 28-31, 2001, Proceedings, volume 2161 of Lecture Notes in Computer Science, pages 170–181. Springer, 2001. doi:10.1007/3-540-44676-1_14.
- [35] Roman M. Kolpakov and Gregory Kucherov. Finding approximate repetitions under Hamming distance. Theoretical Computer Science, 303(1):135–156, 2003. doi:10.1016/S0304-3975(02)00448-6.
- [36] Arun Krishnan and Francis Tang. Exhaustive whole-genome tandem repeats search. Bioinformatics, 20(16):2702–2710, May 2004. doi:10.1093/bioinformatics/bth311.
- [37] Marcin Kubica, Tomasz Kulczyński, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. A linear time algorithm for consecutive permutation pattern matching. Information Processing Letters, 113(12):430–433, 2013. doi:10.1016/J.IPL.2013.03.015.
- [38] Gregory Kucherov and Dina Sokol. Approximate tandem repeats. In Ming-Yang Kao, editor, Encyclopedia of Algorithms, pages 1–6. Springer Berlin Heidelberg, Berlin, Heidelberg, 2014. doi:10.1007/978-3-642-27848-8_24-2.
- [39] Gad M. Landau, Jeanette P. Schmidt, and Dina Sokol. An algorithm for approximate tandem repeats. Journal of Computational Biology, 8(1):1–18, 2001. doi:10.1089/106652701300099038.
- [40] Michael G. Main and Richard J. Lorentz. An O(n log n) algorithm for finding all repetitions in a string. Journal of Algorithms, 5(3):422–432, 1984. doi:10.1016/0196-6774(84)90021-X.
- [41] Glenn K. Manacher. A new linear-time "on-line" algorithm for finding the smallest initial palindrome of a string. Journal of the ACM, 22(3):346–351, 1975. doi:10.1145/321892.321896.
- [42] Yoshiaki Matsuoka, Takahiro Aoki, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Generalized pattern matching and periodicity under substring consistent equivalence relations. Theoretical Computer Science, 656:225–233, 2016. doi:10.1016/J.TCS.2016.02.017.
- [43] Angelika Merkel and Neal Gemmell. Detecting short tandem repeats from genome data: opening the software black box. Briefings in Bioinformatics, 9:355–366, 2008. doi:10.1093/bib/bbn028.
- [44] Sung Gwan Park, Magsarjav Bataa, Amihood Amir, Gad M. Landau, and Kunsoo Park. Finding patterns and periods in Cartesian tree matching. Theoretical Computer Science, 845:181–197, 2020. doi:10.1016/J.TCS.2020.09.014.
- [45] Marco Pellegrini, M. Elena Renda, and Alessio Vecchio. TRStalker: an efficient heuristic for finding fuzzy tandem repeats. Bioinformatics, 26(12):i358–i366, June 2010. doi:10.1093/bioinformatics/btq209.
- [46] Jeff Reneker, Chi-Ren Shyu, Peiyu Zeng, Joseph C. Polacco, and Walter Gassmann. ACMES: fast multiple-genome searches for short repeat sequences with concurrent cross-species information retrieval. Nucleic Acids Research, 32:W649–W653, 2004. doi:10.1093/nar/gkh455.
- [47] E. Rivals, J. Delahaye, M. Dauchet, and O. Delgrange. A first step toward chromosome analysis by compression algorithms. In Proceedings First International Symposium on Intelligence in Neural and Biological Systems. INBS 1995, pages 233–239, 1995. doi:10.1109/INBS.1995.404256.
- [48] Eric Rivals, Olivier Delgrange, Jean-Paul Delahaye, Max Dauchet, Marie-Odile Delorme, Alain Hénaut, and Emmanuelle Ollivier. Detection of significant patterns by compression algorithms: the case of approximate tandem repeats in DNA sequences. Computer Applications in the Biosciences, 13(2):131–136, 1997. doi:10.1093/BIOINFORMATICS/13.2.131.
- [49] Marie-France Sagot and Eugene W. Myers. Identifying satellites and periodic repetitions in biological sequences. Journal of Computational Biology, 5(3):539–553, 1998. doi:10.1089/CMB.1998.5.539.
- [50] Tiago José P. Sobreira, Alan M. Durham, and Arthur Gruber. TRAP: automated classification, quantification and annotation of tandemly repeated sequences. Bioinformatics, 22(3):361–362, 2006. doi:10.1093/bioinformatics/bti809.
- [51] Dina Sokol, Gary Benson, and Justin Tojeira. Tandem repeats over the edit distance. Bioinformatics, 23(2):e30–e35, January 2007. doi:10.1093/bioinformatics/btl309.
- [52] Dina Sokol and Justin Tojeira. Speeding up the detection of tandem repeats over the edit distance. Theoretical Computer Science, 525:103–110, 2014. Advances in Stringology. doi:10.1016/j.tcs.2013.04.021.
- [53] Axel Thue. Über unendliche Zeichenreihen. Norske Videnskabers Selskabs Skrifter Mat.-Nat. Kl., 7:1–22, 1906.
- [54] Jean Vuillemin. A unifying look at data structures. Communications of the ACM, 23(4):229–239, April 1980. doi:10.1145/358841.358852.
- [55] Ydo Wexler, Zohar Yakhini, Yechezkel Kashi, and Dan Geiger. Finding approximate tandem repeats in genomic sequences. In Philip E. Bourne and Dan Gusfield, editors, Proceedings of the Eighth Annual International Conference on Computational Molecular Biology, 2004, San Diego, California, USA, March 27-31, 2004, pages 223–232. ACM, 2004. doi:10.1145/974614.974644.
- [56] Hongxia Zhou, Liping Du, and Hong Yan. Detection of tandem repeats in dna sequences based on parametric spectral estimation. IEEE Transactions on Information Technology in Biomedicine, 13(5):747–755, 2009. doi:10.1109/TITB.2008.920626.
