Counting Distinct Square Substrings in Sublinear Time
Abstract
We show that the number of distinct squares in a packed string of length over an alphabet of size can be computed in time in the word-RAM model of computation. This paper is the first to introduce a sublinear time algorithm for the packed version of squares counting. The packed representation of a string of length over an alphabet of size is given as a sequence of machine words in the word-RAM model (a machine word consists of bits).
Previously it was known how to count distinct squares in time [Gusfield and Stoye, JCSS 2004], even for a string over an integer alphabet, see [Crochemore et al., TCS 2014; Bannai et al., CPM 2017; Charalampopoulos et al., SPIRE 2020]. We use techniques of squares extraction from runs described by Crochemore et al. [TCS 2014]. However, the packed model requires novel approaches. In particular, we need an sized representation of all long-period runs (runs with periods that are ) which guarantees sublinear time counting of potentially linearly-many implied squares. The long-period runs with a string period that is periodic itself (called layer runs) are an obstacle, since their number can be . Fortunately, the number of all other long-period runs is and we can construct an implicit representation of all long-period runs in time by adopting the insights of Amir et al. [ESA 2019], combined with sublinear time tools provided by the PILLAR model of computations in case of packed strings. We count squares in layer runs in sublinear time by exploiting combinatorial properties of types of pyramidally-shaped groups of layer runs. As a by-product, we discover several new structural properties of runs.
Another difficulty is to compute, in sublinear time, locations of Lyndon roots of runs in packed strings, which is needed for grouping of runs that can generate equal squares. To overcome this difficulty, we introduce sparse-Lyndon roots which are based on the notion of string synchronizers proposed by Kempa and Kociumaka [STOC 2019].
Keywords and phrases:
square in a string, packed model, run (maximal repetition), Lyndon wordFunding:
Jakub Radoszewski: Supported by the Polish National Science Center, grant no. 2022/46/E/ST6/00463.Copyright and License:
2012 ACM Subject Classification:
Theory of computation Pattern matchingEditors:
Paweł Gawrychowski, Filip Mazowiecki, and Michał SkrzypczakSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
We consider a problem of counting distinct squares (and more generally, powers) in a string. Such problems are important not only from a purely theoretical point of view, but are also relevant in some applications in bioinformatics (see the book [27]). Strings of the form , for a non-empty string , called squares (or tandem repeats), are the most natural type of repetition.
A fundamental algorithmic problem related to squares is checking if a given string of length is square-free, that is, if it avoids square substrings. Thue’s construction of an infinite ternary square-free string [44] can be viewed as the beginning of combinatorics on words. The first -time algorithm for checking square-freeness was given by Main and Lorentz [40]. An -time algorithm for this problem, for the case of a constant-sized alphabet, was proposed by Crochemore [18]. Subsequently, -time algorithms for square-freeness over an integer alphabet and over a general ordered alphabet follow from Kolpakov and Kucherov’s [36] and Ellert and Fischer’s [23] algorithms for computing runs under these assumptions, respectively. Most recently, Ellert, Gawrychowski, and Gourdel [24] obtained an -time algorithm for testing square-freeness of a string over a general unordered alphabet of size ; they also showed that the algorithm is optimal under these assumptions. Square-freeness was also studied in on-line [29, 37], parallel [2, 3, 20] and dynamic [1] settings.
A much more challenging problem than testing square-freeness, that has received significant attention, is computing the number of distinct square substrings of a given string. Fraenkel and Simpson were the first to show that a string of length contains dictinct squares [26]. Brlek and Li very recently, using arguments from linear algebra and graph theory, improved the upper bound of Fraenkel and Simpson to just ; see [11, 10].
Linear-time algorithms for counting distinct square substrings were proposed by Gusfield and Stoye [28], Crochemore et al. [19], Bannai, Inenaga, and Köppl [6], and Charalampopoulos et al. [14]; notably, the last three results work for a string over an integer (generally, linearly sortable) alphabet. As already mentioned, testing square-freeness is a simpler problem than counting distinct squares. In particular, for a general (ordered) alphabet, element distinctness can be reduced in linear time to counting squares111As for the reduction, a sequence contains a repeating element, if and only if, string , for a square-free string (say, a prefix of the infinite ternary square free string [44]), contains less than distinct square substrings., and the latter problem is hard: it requires time in the comparison model [8].
In the word-RAM model of computation with word size , we may store up to string characters in a single machine word, where is the size of the alphabet. The packed representation of a length- string over an integer alphabet is a sequence of integers, each encoding a fragment of of length .
A recent line of work has yielded -time solutions for several basic stringology problems in the setting where the input string(s) is/are given in packed form (the packed setting). These include pattern matching [7] and indexing [31, 41], computing the LZ factorization and BWT [22, 30, 31, 32], the longest common substring [13], the longest palindromic substring [16], the Lyndon array [4], and covers of a string [42]. While for some of the discussed problems, such as pattern matching, optimal -time algorithms exist, for several others, such as BWT construction, the best known algorithms run in time . A recent work of Kempa and Kociumaka [33] shed light on the source of difficulty of several stringology problems for which the state-of-the-art solutions take time.
We are the first to study the problem of counting squares in the packed setting. The problem is formally defined as follows.
Packed Counting Of Distinct Squares
Input: A string of length over alphabet given in a packed representation.
Output: , where denotes the set of all squares that are equal to some substring of .
Example 1.
Consider string ; see Figure 1 for a comparison. This string contains 2000 distinct squares. We have
We settle the time complexity of the Packed Counting Of Distinct Squares problem, classifiying it as one of the elementary stringology problems that admit an -time solution in the packed setting.
Theorem 2.
The Packed Counting Of Distinct Squares problem can be solved in time.
Moreover, our algorithm can report distinct squares, for any between 0 and the actual number of distinct squares in the string, in time. Our algorithm generalizes readily to powers with higher exponent: for any integer , a string of length contains at most powers with exponent [39] and we can compute the actual number of those in a packed string in time.
Other related work.
A string of length contains substrings that are primitively rooted squares and they can all be computed in time; see [17, 21, 43]. The same representation is computed by Apostolico and Breslauer’s parallel algorithm in [3]. We note that these algorithms compute all occurrences of primitively rooted squares and are not concerned with whether any two computed substrings correspond to the same square.
Technical Overview
Let be a string of length over alphabet . It was already observed by Crochemore et al. [19] that distinct squares in can be computed from runs (maximal periodic fragments). This is because a square can be extended to a unique run with the same period as the primitive root of and a string of length contains runs that can be computed in time [36, 5, 23]. Amir et al. [1] (see also [12]) proposed an algorithm that efficiently maintains a representation of runs in a dynamic string. We note that their algorithm works in the PILLAR model of Charalampopoulos, Kociumaka, and Wellnitz [15] and it can be used to compute a representation of all runs in with period at least in time. All the remaining runs in either fit in a machine word or are so-called -runs (see [31]); the number of the latter is . As a result, a representation of all runs in can be computed in time and space. This representation can be used to compute the longest square in a string in time in a straightforward way.
Computing the number of distinct squares from this representation is much more challenging. The first obstacle towards achieving this goal is the difficulty in grouping the runs with respect to their Lyndon roots in packed strings. We overcome this obstacle by introducing a version of Lyndon roots more suitable for the packed model, called here sparse Lyndon roots. Positions of such nonstandard roots are based on synchronizing sets of positions (see Kempa and Kociumaka [31]).
Let us call squares whose root is both primitive and highly periodic (here: containing at least 4 occurrences of the period) special, while remaining squares are called plain. Plain squares can be efficiently counted by combining tabulation with the approach of [19] applied on a selected -sized subset of the runs of , after grouping these runs by their Lyndon roots (for small-period runs) or sparse Lyndon roots (for long-period runs). Counting special squares is significantly more challenging. We tackle this problem by processing certain families of runs. The runs corresponding to special squares with large period (larger than ) are called here layer-runs. The crucial point is that these layer-runs can be grouped in groups called “pyramids”, though they can contain together layer-runs. These “pyramids” are very regular and counting squares in them can be done in batches in sublinear time. Let us provide a trivial yet illustrative example.
Example 3.
Consider string . For each , we have a square of the form that occurs at (0-based) position ; this square is primitively rooted and, in fact, is a run. These are the only primitively rooted squares in other than primitively rooted squares abab and baba; see Figure 1.
Using the approach of Crochemore et al. [19], we can count all plain squares in the string from Example 3 by processing a constant number of runs: , , bb, and for . However, there are runs of the form for , each corresponding to a special square, and we clearly cannot afford to iterate over these runs in time. All such special squares in our example have their first half in run and their second half in run , and that these two runs have the same Lyndon root (ab).
We employ an -sized representation of all runs that generate special squares. As presented intuitively in Example 3, the representation relies on pairs and of runs that have the same Lyndon root and satisfy . Counting special squares from said representation is reduced to appropriately grouping the computed pairs of runs.
2 Preliminaries
For a string , its positions are numbered as . If , is called the empty string. A string consisting of characters is a substring of . A fragment of is a positioned substring; that is, a fragment represents the substring . Where possible, we treat fragments and substrings as equivalent. For two fragments and , we write if . Two fragments and are called neighboring if . For two neighboring fragments and , by we denote the fragment and by we denote the fragment .
We say that a positive integer is a period of string if and for all ; equivalently, . Fine and Wilf’s periodicity lemma [25] asserts that if a string of length has periods and such that , then the string has period . By we denote the smallest period of to which we refer as the period of .
A non-empty string is called primitive if the equality for a positive integer implies that . By we denote a cyclic rotation of the string , obtained by moving the first characters of to its end. A string is called a Lyndon string if it is primitive and lexicographically minimal in the class of its cyclic rotations.
We say that a string is periodic if and highly periodic if . The Lyndon root of a periodic string , denoted by , is the lexicographically smallest rotation of . For example, .
Definition 4.
The Lyndon representation of a periodic string is a quadruple such that:
-
, and
-
with and . ( and/or can be the empty string.)
Example 5.
We have for
A run in a string is a fragment that is periodic, that is, , and inclusion-maximal, that is,
-
or and
-
or .
We denote the set of runs in by . A string of length contains at most runs and they can be computed in time [5], even if the string is over an arbitrary ordered alphabet [23].
The Lyndon position of a run with Lyndon root is the unique position such that .
A square is a string of the form for a non-empty string . A square is called primitively rooted if is primitive and non-primitively-rooted otherwise.
We say that a square is generated by a periodic string if is contained in and . By the periodicity lemma, in this case is a multiple of . We denote by the set of squares generated by a periodic string .
Example 6.
For the underlined run with period 2 in , we have .
For a set of periodic fragments of we denote
The following observation states that each square in a string is generated by a run.
Observation 7 ([19]).
Unfortunately, from the point of view of counting, the same square string can be generated by many runs. Dealing with squares whose first half is both primitive and (highly) periodic is the most challenging. The next section is devoted to runs that generate such squares.
The next fact follows by using radix sort (with bucket sort).
Fact 8.
A list of -tuples of integers in , for any constant , can be sorted lexicographically in a stable manner in time.
3 Pyramids of Runs
In this section we describe the structure of runs that generate special squares.
Definition 9 (Subperiodic strings).
For a periodic string we define
A periodic string is called subperiodic if .
Example 10.
The string with period 5 generates a subperiodic square of length . Thus is subperiodic, and . Observe that all the remaining squares generated by , in particular, and , are not subperiodic.
A special square is a square which is primitively rooted and subperiodic. All other squares are called plain.
For two neighboring runs with equal period in , we have [36]. Two such runs can induce a collection of runs with subperiod . We formalize this structure in the following definition and provide illustrations in Figures 2 and 3.
Definition 11.
Let and be neighboring runs in with period and equal Lyndon roots. A pyramid of runs is the set
If , run is called a layer-run (or a layer for brevity).
Remark 12.
We show that all layers in are contained in except possibly the longest layer (for example the red layer in Figure 3).
Lemma 13.
If for some layer in pyramid , then the first half of is contained in and the other half in .
Proof.
Let . Layer is subperiodic with and thus must have period at least , so .
First we show that cannot be a fragment of (or of ). Indeed, this would mean that has period as well as period . By the periodicity lemma applied to , would divide . Consequently, would be a period of , a contradiction.
Let us now show, by contradiction, that each half of is contained in or in . Suppose that this is not the case. One of the two halves is fully contained in one of and and hence has period , while the other half contains a position that is in but not in and a position that is in but not in , and hence has period greater than due to the maximality of runs and . We have thus obtained a contradiction.
Next we obtain a combinatorial characterization of all runs in a pyramid.
Definition 14.
A layer with a maximal period in a pyramid is called a max-layer. We denote by the set of layer-runs in without the max-layer. The elements of are called regular layers.
Example 15.
Consider a pyramid and let . A canonical representation of pyramid consists of the (endpoints of) runs and , its max-layer, and sequences specifying the starting positions, ending positions, and periods of its regular layers. In a canonical representation, the end positions of regular layers form an arithmetic progression with difference , whereas the starting positions form an arithmetic progression with difference . Moreover, the periods of all regular layers form an arithmetic progression with difference . The lemma below is proved in the following Lemmas 17 and 18.
Lemma 16.
Any non-empty pyramid admits a canonical representation.
We use the following notation for a fixed pyramid . Let , , assuming without loss of generality that , and . Let the Lyndon positions of and be and , respectively, and define . For each , we denote and .
Lemma 17.
The set is equal to
For each , the period of run is .
Proof.
First, let us argue that for each . We note that (by the definition of and the fact that the two strings are contained in and , respectively), so is a period of by definition, and . Observe that has period and hence it cannot have an occurrence starting before position and ending after position – as all fragments satisfying these conditions do not have period by the maximality of and . Thus, does not have any period smaller than .
: Let . We have
as and are neighboring. In particular, is periodic. Let us show that is a maximal fragment with period . As , the periodicities of and and left maximality of imply that
which shows the left maximality of . A symmetric argument yields right maximality. Hence, is indeed a run.
The prefix of has length and period , so is subperiodic and . Moreover, cannot be smaller than , as then there would be a run with period smaller than overlapping one of runs on at least positions, which is impossible due to the periodicity lemma.
: Let us fix some . As is subperiodic, , and by the definition of , and . Consider a square . By Lemma 13, and . Since primitive strings do not match non-trivial rotations of themselves, we have that occurs only at positions of contained in such that . This implies that has to be equivalent to . Then, for some , . By the definition of , . Finally, there can be at most one run with period that contains at least positions of and and we have shown the existence of such a run in , so .
Lemma 18.
For the set defined in Lemma 17, there is at most one run .
Proof.
Consider the case when there exists a run containing position . We then know that has subperiod and is periodic with period . Consider the square . By Lemma 13, we know that the first half of this square is contained in , while the second half is contained in . Hence, the square has subperiod . This means that , which is an interval of length at most . Now, also has to be equivalent to , so there is a single possible value for it, say .
Observe that if is not contained in , then is a common prefix of and , which means that and hence . Now, a run containing position would have period in by symmetric arguments to those above. Then, we would have , a contradiction to the fact that a square with period generated by would have its first half in .
Finally, if our attempt to compute a run containing position fails, we perform symmetric computations to find a run that contains position , if one exists.
4 Computing a Representation of All Runs
In this section we show that a representation of all runs in a string of length over an alphabet can be computed in time.
First we show how to compute runs with large periods. Some of these runs are grouped in pyramids.
Amir et al. [1] showed how to compute squares and runs in a dynamic string. Their techniques can be interpreted in the so-called PILLAR model, introduced by Charalampopoulos, Kociumaka, and Wellnitz [15]. Recent optimal data structures for queries [31] and queries [35] in the packed setting imply that any problem on strings of total length that can be solved in time in the PILLAR model, can be solved in time in the packed setting. All in all, we obtain the following fact whose proof closely follows [1, 12]; for completeness, it is provided in the full version.
Fact 19 (see [1, 12]).
Let be a string given in packed form. For any constant , in time , we can compute a set of runs such that none of them is a regular layer of any pyramid of and a set of pyramids given by their canonical representations, such that , and, for , we have that
-
is a superset of all runs in of period at least , and
-
.
We do not include max-layers in set as they can be common to many pyramids.
Remark 20.
As shown in the proof of Fact 19 (given in the full version) and implicitly in [1, 12], for any parameter , the number of both max-layers with period at least and non-layer-runs with period at least in a length- string is . We note that the number of all layer runs with period at least can be : for any , the string from Example 3 has at least layer runs with period at least .
Definition 21 (Clusters of runs).
For a set of runs in and a set of integers , we define a cluster of runs:
The size of a cluster of runs is defined as .
In this work, in all considered clusters of runs, we have .
Example 22.
The string contains runs , , . Thus the following string
contains a cluster of runs .
A -run is a run of length at least with period at most .
Lemma 23 ([31, Section 6.1.2],[13, Lemma 10]).
For a positive integer , a string contains -runs. Moreover, if , given a packed representation of , we can compute all -runs in in time. Within the same complexity, we can compute the Lyndon position of each -run.
The next lemma can be proved using tabulation; a proof can be found in the full version.
Lemma 24.
Given a string in packed form and an integer , we can compute all runs of length smaller than and period at most , represented as clusters of runs, in time. The sum of lengths of lists across all clusters of runs is .
Putting everything together, we obtain the following proposition. We may need to trim arithmetic progressions of regular layers to avoid double reporting runs with small periods.
Proposition 25.
A representation of all runs in a string consisting of a disjoint union of runs, regular layers of pyramids, and clusters of runs, can be computed in time.
5 Grouping Runs via Lyndon Roots and Sparse-Lyndon Roots
The next observation is crucial in counting distinct square substrings of a string. Let us denote by and the sets of runs and squares in with Lyndon root .
Observation 26 ([19]).
Consider two runs and in a string . Then, implies that . In particular, for any Lyndon string , we have
Crochemore et al. [19] considered all runs in the string in groups consisting of runs with equal Lyndon root. The algorithm for grouping of runs that they used consists of the following three steps:
-
1.
Computing a Lyndon position for each run.
-
2.
Sorting runs with equal periods in the order of the suffixes starting at Lyndon positions. It is guaranteed that runs from the same group are listed consecutively.
-
3.
Partitioning the sorted list of runs obtained for each period into groups by issuing an -query for each pair of subsequent runs in the list.
We use Proposition 25 to compute an -sized representation of all the runs in . For runs with small periods, we use the aforementioned approach combined with tabulation (for runs that are not -runs) and the following fact for -runs.
Fact 27 ([31, Section 6.1.2],[13, Lemma 10]).
If , given a packed representation of , all -runs in can be sorted by their Lyndon roots in time.
A proof of Lemma 28 is given in the full version.
Lemma 28.
All runs in with periods at most , for a given , can be grouped by equal Lyndon roots in time. Among possibly many runs corresponding to equal substrings, at least one needs to be reported, but not necessarily all.
One issue with adapting the aforementioned approach to grouping runs with large periods is that we do not know how to compute the Lyndon positions of runs in if their period is greater than for a constant , in time.
Using cyclic equivalence queries of Kociumaka et al. [35] that allow to check if two substrings of are cyclic rotations of each other, we can check if two runs have the same Lyndon root in time after -time preprocessing, but this is not sufficient for grouping the runs by Lyndon roots. Moreover, it is unknown whether minimal cyclic rotation queries can be implemented efficiently in the packed model. The fastest known solution, by Kociumaka [34], answers minimal cyclic rotation queries in time but requires preprocessing; improving the preprocessing time to sublinear here seems to be challenging.
Instead, we use a string synchronizing set, as defined by Kempa and Kociumaka [31], to select a unique position within each long-period run (that might not be the Lyndon position) in a consistent way that allows us to group such runs by Lyndon roots.
Definition 29 (Synchronizing set [31]).
For a length- string and a positive integer , a set is a -synchronizing set of if it satisfies the following two conditions:
-
1.
Consistency: If , then if and only if .
-
2.
Density: For ,
if and only if .
Remark 30.
Informally, in the simpler case that is cube-free, a -synchronizing set of is an -sized set of synchronizing positions in such that each length- fragment of (except for the end of the string) contains at least one synchronizing position, and the leftmost synchronizing positions within two length- matching fragments of are consistent.
Crucially, string synchronizing sets for small values of can be constructed in optimal time in the packed setting.
Theorem 31 ([31, Proposition 8.10, Theorem 8.11]).
For a string with and , there exists a -synchronizing set of size that can be constructed in time, if is given in a packed representation.
Henceforth, we fix and a -synchronizing set for computed in time using Theorem 31. We next define sparse-Lyndon positions, noting that their existence is only guaranteed for runs whose periods are long enough, and use them to group runs by Lyndon roots.
Definition 32 (Sparse-Lyndon position).
Position is a sparse-Lyndon position for a periodic fragment with period if is the lexicographically minimal string among .
If a periodic fragment with period has a sparse-Lyndon position , we call the sparse-Lyndon root of and denote it as .
The following key lemma shows that Lyndon positions can indeed be replaced by sparse-Lyndon positions for the sake of grouping runs by Lyndon roots.
Lemma 33.
Both of the following hold.
-
(a)
If a run has period , then has a unique sparse-Lyndon root.
-
(b)
Two runs and with period have the same Lyndon root if and only if they have the same sparse-Lyndon root.
Proof.
(a) Let . We will first show that is non-empty. String has period as . By the periodicity lemma [25], if had a period at most , then the string would have a period that is smaller than and divides , which is not possible as this would imply that has a period . We have . String contains a fragment of length with period greater than ; indeed, otherwise the periods of all such fragments would be equal by the periodicity lemma and this would imply that has a period at most . Let be such a fragment with period greater than . By density, . We have and , so . Hence, indeed, .
This shows the existence of a sparse-Lyndon root of . As no two distinct suffixes are equal, the sparse-Lyndon root of is unique.
The implication “” is obvious. As for the implication “”, let and and assume that is the sparse-Lyndon position of . By the assumption, there exists such that . By the fact that , we have , so by consistency, .
To prove that is the sparse-Lyndon position of , assume to the contrary that there exists a position such that . Consequently, as no two cyclic rotations of a primitive string are equal. By the assumption, there exists such that . By the fact that and consistency, . We have
which contradicts the assumption that is the sparse-Lyndon root of .
Definition 34.
The sparse-Lyndon representation of a periodic fragment of is a quadruple such that:
-
, and
-
with and .
We use the next lemma to obtain the main result of this section.
Lemma 35 ([31, Theorem 4.3]).
Given the packed representation of a text and a -synchronizing set of of size for , we can compute in time the lexicographic order of all suffixes of starting at positions in .
Proposition 36.
All runs in computed as in Proposition 25, except for the regular layers of pyramids, can be grouped by equal Lyndon roots in time. For runs with period at most , we compute their Lyndon representations, and for the remaining runs, we compute their sparse-Lyndon representations.
Proof.
Recall that . Runs with periods at most are grouped by their Lyndon roots using Lemma 28. The remaining runs are grouped by their sparse-Lyndon roots, and thus by Lyndon roots due to Lemma 33, using Lemma 35 as follows.
Let , with , be a -synchronizing set of constructed as in Theorem 31. By Lemma 35, in time we can construct an array (“sparse RANK” array) such that
Then, in time, we construct a data structure that can answer range minimum queries over in time [9].
Let and be sentinels. Let denote the set of all runs with period that are not regular layers of any pyramid. By Fact 19, set can be computed in time. For each run , we need to compute an interval such that and . By Lemma 33(b), this interval is not empty and hence . The sparse-Lyndon position of each such run can then be computed in time as the argmin of a range minimum query over . The positions and are computed for all runs simultaneously in time by bucket sorting the set and merging the obtained sorted list with the synchronizing set in a merge-sort fashion.
The remainder of the algorithm mimics steps 2 and 3; see the discussion after Observation 26. Namely, in time, we bucket sort the runs with large periods by pairs , where is the sparse-Lyndon position and is the period of the run. Runs with equal sparse-Lyndon roots form consecutive sublists of the sorted lists. The equality of sparse-Lyndon roots of consecutive runs in the sorted list can be checked in time using longest common extension queries after an -time preprocessing [31, Theorem 5.4]. Thus, the grouping is performed in time.
6 Squares Generated by Pyramids
We show that a special square (that is, a square with a primitive and highly periodic half) is always generated by a layer of a pyramid. The proof of the lemma uses the assumption of at least 4 occurrences of the period in a special square half.
Lemma 37.
Let be a fragment of . Then is a special square if and only if there exists a pyramid in and a layer such that and .
Proof.
() Let be a special square fragment of and . Let and be runs with period that contain the first and the second half of the considered occurrence of in , respectively. We have , as otherwise would have period and, by the periodicity lemma, would not be primitive.
By Observation 7, there exists a run in such that . By definition, we have . Moreover, is a subperiodic run with and . If we had , then there would exist a run in with period that overlaps or – say, – on at least positions. The overlap length would be greater than , so by the periodicity lemma, the overlap would have period that divides , so would have period ; a contradiction.
Clearly, the runs are neighboring. We have or is a max-layer of some pyramid with . In either case, and , as required.
() Let be a layer of some pyramid with
We have since is subperiodic. By Lemma 13, one half of is contained in and the other in .
Period does not divide as otherwise we would have . Moreover, by the periodicity lemma, does not have a period that would divide . Thus, is primitive and highly periodic, which means that is a special square.
Definition 38 (Pyramid type).
Let be neighboring runs with period in . We define the type of the pyramid as a triad where (see Figure 4):
Remark 39.
The strings and are cyclically equivalent if is non-empty.
Example 40.
Let , and . Then .
We extend the notation to pyramids as follows.
Definition 41 (Special squares generated by pyramids).
We say that the elements of are generated by the pyramid .
Lemma 42.
The sets of squares generated by two pyramids of different types are disjoint.
Proof.
Let and be pairs of neighboring runs with equal periods such that . We will show that the sets and are disjoint.
Assume there exists . We have
for some runs and . Let and . By Lemma 37, is a special square with Let . Square does not have period (as is primitive). Hence, we can define
-
as the smallest position in such that ;
-
as the largest position in such that .
We have . Then, we have
so . This contradiction concludes the proof.
By we denote the set of (special) squares generated by regular layers of .
Lemma 43.
Both of the following hold:
-
(a)
If is a pyramid, then
-
(b)
If and are pyramids such that , then
Proof.
Proof of (a).
Each run generates squares. By Lemma 17, for run , this number of squares equals
Proof of (b).
An application of Lemma 17 to and for produces equal runs for subsequent values of if . Thus, if , then for each run in , an equal run is present in . For the max-layer of and the regular layer with the same period, we have . Runs in a pyramid have different periods, so they generate disjoint sets of squares. The max-layer of thus generates a special square that is not generated by .
7 Counting Squares
For brevity, primitively rooted squares are called p-squares and non-primitively rooted squares are called np-squares (see [38]). We note that special squares are, in particular, p-squares.
7.1 Counting Plain Squares
Recall that a square is plain unless it is special; that is, is plain if it is an np-square or it is not highly periodic. The next lemma follows from [19] (and Fact 8); see full version.
Lemma 44 (see [19, Theorem 13]).
Assume we are given periodic fragments in grouped by their Lyndon roots and that the Lyndon representations of all these periodic fragments are available. The numbers of distinct p-squares and distinct np-squares generated by these periodic fragments can be computed in time.
The same conclusion holds if we are given periodic fragments in grouped by their sparse-Lyndon roots and that their sparse-Lyndon representations are available.
In each case, any distinct corresponding squares can be reported in time.
Lemma 45.
The number of np-squares in can be computed in time.
Proof.
We use Lemma 44 for counting np-squares generated by all runs that are not regular layers, grouped as in Proposition 36. For runs with small periods, we use Lyndon representations, and for the remaining runs we use sparse-Lyndon representations. There are such runs, so np-squares are counted in time.
Lemma 46.
The number of plain p-squares in can be computed in time.
Proof.
By Lemma 37, plain p-squares are generated by runs grouped as in Proposition 36. For each run computed in Proposition 36, we check if it is also reported as a max-layer in Proposition 25. This can be checked globally for all runs in time using bucket sort. The runs that turned out to be max-layers are cut into smaller periodic fragments that generate plain p-squares (to avoid counting of special squares) as shown below.
Consider a max-layer with . Let be the sequence of runs with period , sorted with respect to their starting positions, such that is a max-layer of for all . Further, let for each .
For convenience, for all , set and . For , let us denote . Due to the periodicity of , for any , we have . Additionally, any occurrence of a plain p-square generated by in is contained in some . Further, each does not generate any special square of length , as it only contains a single maximal periodic fragment with period that is of length at least .
Hence, it suffices to use the strings among , , and that are of length at least instead of . Those strings are periodic and their (sparse-) Lyndon representations can be inferred in time from the (sparse-) Lyndon representation of the max-layer .
We obtain runs that are not max-layers from Proposition 36 and periodic fragments constructed as described above from max-layers ( periodic fragments from each max-layer). By Lemma 44, plain p-squares can be counted in time.
7.2 Counting Special Squares
By Lemma 37, special squares are only generated by layers of pyramids.
Lemma 47.
All pyramids can be grouped by their types in time.
Proof.
Recall that the type of a pyramid is where , is a length- suffix of run , is a length- prefix of run and . By Proposition 36, if , we know the Lyndon roots of , and otherwise, we know their sparse-Lyndon roots.
The Lyndon roots of and are the same. We have and for the common Lyndon root and some values that can be computed from the Lyndon representations of in time. Instead of grouping pyramids by triads , it suffices to group them by quadruples . Grouping by Lyndon roots is performed in Proposition 36. The remaining elements of quadruples are integers in , so we can bucket sort the quadruples by them in time in a stable way (so that we do not break the grouping by Lyndon roots) using Fact 8.
The same argument, with sparse-Lyndon roots instead of Lyndon roots, applies for grouping pyramids by types in case the period of runs , is greater than .
Lemma 48.
The number of special squares in can be computed in time.
Proof.
We group the pyramids by their types using Lemma 47. By Lemma 42, special squares generated by layers from each group can be considered separately. By Lemma 43(b), we can remove all pyramids of the same type that are not of maximal size (in terms of the number of layers). Among the remaining pyramids, all special squares generated by regular layers are counted using Lemma 43(a). Special squares generated by max-layers are counted by partitioning each max-layer into periodic fragments generating only special squares as in the proof of Lemma 46 and then counting all squares generated by such periodic fragments using Lemma 44.
8 Final Remarks
Any squares, for positive up to the number of distinct squares in , can be listed as follows. Lemma 44 allows to report subsequent squares. For special squares, we list squares generated by subsequent runs in the tallest pyramid of each type.
Theorem 49.
Given a string of length over alphabet in packed form and integer , we can output distinct squares in in time.
Our algorithms generalize to powers with any given exponent in the same time complexity. In this case, we do not need to consider regular layers, as they generate no powers of exponent greater than 2. Thus an analogue of Lemma 44 suffices.
Theorem 50.
Given a string of length over alphabet in packed form and integer , we can compute in time the number of distinct -th powers in .
References
- [1] Amihood Amir, Itai Boneh, Panagiotis Charalampopoulos, and Eitan Kondratovsky. Repetition detection in a dynamic string. In 27th Annual European Symposium on Algorithms, ESA 2019, volume 144 of LIPIcs, pages 5:1–5:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPICS.ESA.2019.5.
- [2] Alberto Apostolico. Optimal parallel detection of squares in strings. Algorithmica, 8(4):285–319, 1992. doi:10.1007/BF01758848.
- [3] Alberto Apostolico and Dany Breslauer. An optimal O(log log n)-time parallel algorithm for detecting all squares in a string. SIAM Journal on Computing, 25(6):1318–1331, 1996. doi:10.1137/S0097539793260404.
- [4] Hideo Bannai and Jonas Ellert. Lyndon arrays in sublinear time. In 31st Annual European Symposium on Algorithms, ESA 2023, volume 274 of LIPIcs, pages 14:1–14:16. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.ESA.2023.14.
- [5] Hideo Bannai, Tomohiro I, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. The “runs” theorem. SIAM Journal on Computing, 46(5):1501–1514, 2017. doi:10.1137/15M1011032.
- [6] Hideo Bannai, Shunsuke Inenaga, and Dominik Köppl. Computing all distinct squares in linear time for integer alphabets. In 28th Annual Symposium on Combinatorial Pattern Matching, CPM 2017, volume 78 of LIPIcs, pages 22:1–22:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPICS.CPM.2017.22.
- [7] Oren Ben-Kiki, Philip Bille, Dany Breslauer, Leszek Gasieniec, Roberto Grossi, and Oren Weimann. Towards optimal packed string matching. Theoretical Computer Science, 525:111–129, 2014. doi:10.1016/J.TCS.2013.06.013.
- [8] Michael Ben-Or. Lower bounds for algebraic computation trees (preliminary report). In Proceedings of the 15th Annual ACM Symposium on Theory of Computing, STOC 1983, pages 80–86. ACM, 1983. doi:10.1145/800061.808735.
- [9] Michael A. Bender and Martin Farach-Colton. The LCA problem revisited. In LATIN 2000: Theoretical Informatics, 4th Latin American Symposium, volume 1776 of Lecture Notes in Computer Science, pages 88–94. Springer, 2000. doi:10.1007/10719839_9.
- [10] Srečko Brlek and Shuo Li. On the number of distinct squares in finite sequences: Some old and new results. In Combinatorics on Words - 14th International Conference, WORDS 2023, volume 13899 of Lecture Notes in Computer Science, pages 35–44. Springer, 2023. doi:10.1007/978-3-031-33180-0_3.
- [11] Srečko Brlek and Shuo Li. On the number of squares in a finite word. Combinatorial Theory, 5(1), 2025. doi:10.5070/C65165014.
- [12] Panagiotis Charalampopoulos. Data structures for strings in the internal and dynamic settings. PhD thesis, King’s College London, UK, 2020. URL: https://kclpure.kcl.ac.uk/ws/portalfiles/portal/155221105/2021_Charalampopoulos_Panagiotis_1559341_ethesis.pdf.
- [13] Panagiotis Charalampopoulos, Tomasz Kociumaka, Solon P. Pissis, and Jakub Radoszewski. Faster algorithms for longest common substring. In 29th Annual European Symposium on Algorithms, ESA 2021, volume 204 of LIPIcs, pages 30:1–30:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ESA.2021.30.
- [14] Panagiotis Charalampopoulos, Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, Tomasz Waleń, and Wiktor Zuba. Efficient enumeration of distinct factors using package representations. In String Processing and Information Retrieval - 27th International Symposium, SPIRE 2020, volume 12303 of Lecture Notes in Computer Science, pages 247–261. Springer, 2020. doi:10.1007/978-3-030-59212-7_18.
- [15] Panagiotis Charalampopoulos, Tomasz Kociumaka, and Philip Wellnitz. Faster approximate pattern matching: A unified approach. In 61st IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, pages 978–989. IEEE, 2020. doi:10.1109/FOCS46700.2020.00095.
- [16] Panagiotis Charalampopoulos, Solon P. Pissis, and Jakub Radoszewski. Longest palindromic substring in sublinear time. In 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022, volume 223 of LIPIcs, pages 20:1–20:9. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPICS.CPM.2022.20.
- [17] Maxime Crochemore. An optimal algorithm for computing the repetitions in a word. Information Processing Letters, 12(5):244–250, 1981. doi:10.1016/0020-0190(81)90024-7.
- [18] Maxime Crochemore. Transducers and repetitions. Theoretical Computer Science, 45(1):63–86, 1986. doi:10.1016/0304-3975(86)90041-1.
- [19] Maxime Crochemore, Costas S. Iliopoulos, Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Extracting powers and periods in a word from its runs structure. Theoretical Computer Science, 521:29–41, 2014. doi:10.1016/J.TCS.2013.11.018.
- [20] Maxime Crochemore and Wojciech Rytter. Efficient parallel algorithms to test square-freeness and factorize strings. Information Processing Letters, 38(2):57–60, 1991. doi:10.1016/0020-0190(91)90223-5.
- [21] Maxime Crochemore and Wojciech Rytter. Squares, cubes, and time-space efficient string searching. Algorithmica, 13(5):405–425, 1995. doi:10.1007/BF01190846.
- [22] Jonas Ellert. Sublinear time Lempel-Ziv (LZ77) factorization. In String Processing and Information Retrieval - 30th International Symposium, SPIRE 2023, volume 14240 of Lecture Notes in Computer Science, pages 171–187. Springer, 2023. doi:10.1007/978-3-031-43980-3_14.
- [23] Jonas Ellert and Johannes Fischer. Linear time runs over general ordered alphabets. In 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021, volume 198 of LIPIcs, pages 63:1–63:16. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ICALP.2021.63.
- [24] Jonas Ellert, Paweł Gawrychowski, and Garance Gourdel. Optimal square detection over general alphabets. In Proceedings of the 2023 ACM-SIAM Symposium on Discrete Algorithms, SODA 2023, pages 5220–5242. SIAM, 2023. doi:10.1137/1.9781611977554.CH189.
- [25] Nathan J. Fine and Herbert S. Wilf. Uniqueness theorems for periodic functions. Proceedings of the American Mathematical Society, 16(1):109–114, 1965. doi:10.2307/2034009.
- [26] Aviezri S. Fraenkel and Jamie Simpson. How many squares can a string contain? Journal of Combinatorial Theory A, 82(1):112–120, 1998. doi:10.1006/JCTA.1997.2843.
- [27] Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997. doi:10.1017/CBO9780511574931.
- [28] Dan Gusfield and Jens Stoye. Linear time algorithms for finding and representing all the tandem repeats in a string. Journal of Computer and System Sciences, 69(4):525–546, 2004. doi:10.1016/J.JCSS.2004.03.004.
- [29] Jin-Ju Hong and Gen-Huey Chen. Efficient on-line repetition detection. Theoretical Computer Science, 407(1-3):554–563, 2008. doi:10.1016/J.TCS.2008.08.038.
- [30] Dominik Kempa. Optimal construction of compressed indexes for highly repetitive texts. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, pages 1344–1357. SIAM, 2019. doi:10.1137/1.9781611975482.82.
- [31] Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, pages 756–767. ACM, 2019. doi:10.1145/3313276.3316368.
- [32] Dominik Kempa and Tomasz Kociumaka. Lempel-Ziv (LZ77) factorization in sublinear time. In 65th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2024, pages 2045–2055. IEEE, 2024. doi:10.1109/FOCS61266.2024.00122.
- [33] Dominik Kempa and Tomasz Kociumaka. On the hardness hierarchy for the O(nlog n) complexity in the word RAM. In Proceedings of the 57th Annual ACM Symposium on Theory of Computing, STOC 2025, pages 290–300. ACM, 2025. doi:10.1145/3717823.3718291.
- [34] Tomasz Kociumaka. Minimal suffix and rotation of a substring in optimal time. In 27th Annual Symposium on Combinatorial Pattern Matching, CPM 2016, volume 54 of LIPIcs, pages 28:1–28:12. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPICS.CPM.2016.28.
- [35] Tomasz Kociumaka, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. Internal pattern matching queries in a text and applications. SIAM Journal on Computing, 53(5):1524–1577, 2024. doi:10.1137/23M1567618.
- [36] Roman M. Kolpakov and Gregory Kucherov. Finding maximal repetitions in a word in linear time. In 40th Annual Symposium on Foundations of Computer Science, FOCS 1999, pages 596–604. IEEE Computer Society, 1999. doi:10.1109/SFFCS.1999.814634.
- [37] Dmitry Kosolobov. Online detection of repetitions with backtracking. In Combinatorial Pattern Matching - 26th Annual Symposium, CPM 2015, volume 9133 of Lecture Notes in Computer Science, pages 295–306. Springer, 2015. doi:10.1007/978-3-319-19929-0_25.
- [38] Marcin Kubica, Jakub Radoszewski, Wojciech Rytter, and Tomasz Waleń. On the maximum number of cubic subwords in a word. European Journal of Combinatorics, 34(1):27–37, 2013. doi:10.1016/J.EJC.2012.07.012.
- [39] Shuo Li, Jakub Pachocki, and Jakub Radoszewski. A note on the maximum number of k-powers in a finite word. Electronic Journal of Combinatorics, 31(3), 2024. doi:10.37236/11270.
- [40] Michael G. Main and Richard J. Lorentz. An algorithm for finding all repetitions in a string. Journal of Algorithms, 5(3):422–432, 1984. doi:10.1016/0196-6774(84)90021-X.
- [41] J. Ian Munro, Gonzalo Navarro, and Yakov Nekrich. Text indexing and searching in sublinear time. In 31st Annual Symposium on Combinatorial Pattern Matching, CPM 2020, volume 161 of LIPIcs, pages 24:1–24:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPICS.CPM.2020.24.
- [42] Jakub Radoszewski and Wiktor Zuba. Computing string covers in sublinear time. In String Processing and Information Retrieval - 31st International Symposium, SPIRE 2024, volume 14899 of Lecture Notes in Computer Science, pages 272–288. Springer, 2024. doi:10.1007/978-3-031-72200-4_21.
- [43] Jens Stoye and Dan Gusfield. Simple and flexible detection of contiguous repeats using a suffix tree. Theoretical Computer Science, 270(1-2):843–856, 2002. doi:10.1016/S0304-3975(01)00121-9.
- [44] A. Thue. Über unendliche Zeichenreihen. Norske Videnskabers Selskabs Skrifter Mat.-Nat. Kl., 7:1–22, 1906.
