Abstract 1 Introduction and Related Work 2 Preliminaries 3 Eliminating Twin-Palindromes in 𝓞(𝒏𝐥𝐨𝐠𝒏) Time 4 Eliminating Twin-Palindromes in 𝓞(𝒏𝐥𝐨𝐠𝒏) Time 5 Eliminating Twin-Palindromes in 𝓞(𝒏) Time References Appendix A Eliminating Twin-Palindromes in 𝓞(𝒏) Comparisons

Minimal Generators in Optimal Time

Jonas Ellert ORCID DIENS, École normale supérieure de Paris, PSL Research University, France Paweł Gawrychowski ORCID Institute of Computer Science, University of Wrocław, Poland Tatiana Starikovskaya ORCID DIENS, École normale supérieure de Paris, PSL Research University, France
Abstract

A walk of length n on a string S of length m is a function f:{1,,n}{1,,m} such that i{2,,n}:|f(i)f(i1)|1. The walk generates the string T of length n defined by i{1,,n}:T[i]=S[f(i)]. Intuitively, this can be seen as walking n steps in S and outputting the encountered symbols, where in each step we either remain at the same position, or move one position to the left or to the right. The minimal generator of a string T is the shortest string S such that a walk on S generates T. Recently, it was shown that each string admits exactly one (up to reversal) minimal generator (Pratt-Hartmann, CPM 2024). However, no efficient algorithm for computing the minimal generator was known. We provide an optimal algorithm for this task, taking 𝒪(n) time for a string of length n over general unordered alphabet, i.e., accessing the string only by equality comparisons of symbols. The main challenge is to detect substrings of the form axbx~axb and replace them with axb, where a,b are symbols and x is a string with reversal x~. We solve this problem with a non-trivial adaptation of Manacher’s classic algorithm for computing maximal palindromic substrings (Manacher, J. ACM 1975). To obtain the final algorithm, we solve small subinstances of the problem in optimal time by adapting the “Four Russians” technique to strings over general unordered alphabet, which may be of independent interest.

Keywords and phrases:
string algorithms, walking on words, minimal generator, palindromic substrings, general unordered alphabet, decision tree complexity
Funding:
Jonas Ellert: Partially supported by grant ANR-20-CE48-0001 from the French National Research Agency (ANR).
Paweł Gawrychowski: Partially supported by the Polish National Science Centre grant number 2023/51/B/ST6/01505.
Tatiana Starikovskaya: Partially supported by grant ANR-20-CE48-0001 from the French National Research Agency (ANR).
Copyright and License:
[Uncaptioned image] © Jonas Ellert, Paweł Gawrychowski, and Tatiana Starikovskaya; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Pattern matching
Editors:
Paola Bonizzoni and Veli Mäkinen

1 Introduction and Related Work

Given some non-empty string S of length m, go to any position in S. Now repeat the following routine n times. Print the symbol at the current position, and then either stay at that position, or move one position to the left (if possible), or move one position to the right (if possible). The result is a string T of length n written in an append-only manner. Pratt-Hartmann describes this procedure as going for a walk on S [22], and he formally models the walk as a function f:{1,,n}{1,,m} that satisfies i{2,,n}:|f(i)f(i1)|1. Using this function, T is defined by i{1,,n}:T[i]=S[f(i)]. Intuitively, we say that some string S generates another string T (or that S is a generator of T) if and only if there is a (not necessarily unique) walk on S that results in T.

While a string may have many distinct generators, a natural question is to ask for a generator of minimal length, henceforth called a minimal generator.111In [22, 21], a minimal generator is called a primitive generator. We use different terminology to avoid confusion with the other definition of primitiveness (where a string is primitive if it cannot be written as the k times concatenation of a shorter string for some integer k). Pratt-Hartmann shows that the minimal generator of a string is unique up to reversal. While he does not consider the problem of efficiently computing the minimal generator of a given string, he implicitly provides a simple algorithmic idea for doing so. Particularly, he shows that the minimal generator of T can be obtained by starting with T and then repeatedly substituting (i) substrings of the form aa with a, (ii) substrings of the form axbx~axb with axb, and (iii) prefixes (resp. suffixes) of the form axbx~a with bx~a (resp. axb). Here, a,b are symbols and x is a possibly empty string with reversal x~. As soon as no further substitutions are possible, we have obtained the minimal generator. We provide an 𝒪(n)-time algorithm that computes a minimal generator via the described substitutions. It works over general unordered alphabet, i.e., it accesses the input string only via equality comparisons of symbols.

Theorem 1.

There is an algorithm that computes the minimal generator of a length-n string over general unordered alphabet in 𝒪(n) time.

1.1 Technical Overview and Roadmap

In Section 2, we introduce the formal definitions and notation used throughout the paper, and then show that substitutions of types (i) and (iii) are easy to realize. The remainder of the paper focuses on substitutions of type (ii).

In Section 3, we provide a simple algorithm that performs all substitutions of type (ii) in 𝒪(nlogn) time. It follows the same divide-and-conquer strategy as Main and Lorentz’s classic algorithm for detecting squares [18]. First, we consider the halves T1=T[1..n2] and T2=T[n2+1..n] and recursively perform all type-(ii) substitutions within them. Then, any remaining substitution in T1T2 must cross the boundary between T1 and T2. Using a modified version of Manacher’s algorithm for computing maximal palindromic substrings [19], we find and perform all substitutions at the boundary in linear time.

In Section 4, we improve the time to 𝒪(nlogn).222logn denotes the extremely slowly growing iterated logarithm defined by logn=0 if n1 and logn=1+loglogn otherwise. Instead of splitting T into halves, we split it into 𝒪(n/log2n) blocks of length 𝒪(log2n). We recursively perform the type-(ii) substitutions in each block. Then, a more sophisticated adaptation of Manacher’s algorithm combines the reduced blocks and performs the remaining substitutions in linear time.

In Section 5, we finally achieve linear time by modifying the 𝒪(nlogn) time algorithm. Once the recursion of the algorithm reaches blocks of size s=o(logloglogn) (which happens after a constant number of levels of recursion), we stop the recursion and solve the blocks of the current level in optimal time by using precomputed data structures. Particularly, by brute force, we precompute a decision tree of minimal height for solving blocks of size s. Peculiarly, this results in a time-wise optimal solution, but we do not immediately know the time complexity of this solution. We show that the height of the decision tree is indeed 𝒪(s) in Appendix A. Hence we solve instances of size s in linear time, as required. A similar technique has been used in [20, 25].

1.2 Related Work

Optimum Stack Generation.

In [23], Tarjan introduced the optimum stack generation problem, where one is given a finite alphabet Σ, a stack, and a string αΣn, and must find the shortest sequence of stack operations  – push, print the top character in the stack, and pop  – printing α. The stack must be empty in the beginning and at the end of the algorithm. While the problem seems to be quite similar to the problem of computing a minimal generator, no fast algorithms are known. A simple dynamic programming algorithm solves it in 𝒪(n3) time, and the current best solution is by Bringmann et al. [6] and requires 𝒪~(n2.8244) randomised time or 𝒪~(n2.8603) deterministic time on a constant-size alphabet.

Repetition Detection in Dynamic Strings.

Our algorithm repeatedly replaces substrings of the form axbx~axb with axb, where a,b are symbols and x is a string. In other words, it must be able to detect substrings of form axbx~axb in a dynamic string, where one is allowed to update the input by deleting its substrings. The problem of detecting repetitive structures in a dynamic string has been an active branch of research recently. The problem of maintaining the longest palindromic substring was considered in [1, 24, 13, 3], where the update time of the current best algorithm by Amir et al. [3] is 𝒪~(n). The longest square of a dynamic string can be maintained in polylog(n) time per update [7], while maintaining the runs can be done in no(1) time [2].

Repetition Detection and Other Problems Over Unordered Alphabets.

Our algorithm accesses the string via equality comparisons of symbols, i.e., we do not assume an order on the alphabet. Since order is not needed to define repetitive structures (like palindromes, squares, or runs), it is an intriguing question whether the presence of order facilitates the detection of such structures. We start with some examples where order is beneficial. Squares and runs can be detected in linear time over an ordered alphabet [10], whereas there is a tight bound of Θ(nlogσ) symbol comparisons to decide square-freeness of a string over unordered alphabet of size σ [12]. There is an 𝒪(nlogσ)-time algorithm for reporting all runs on such a string [9, 11]. Computing the Lempel-Ziv factorization has a tight time bound of Θ(nlogσ) for ordered alphabet [17], and Θ(nσ) for unordered alphabet [9, 12]. On the other hand, there are linear time algorithms for several problems over unordered alphabet, e.g., pattern matching in one [15] and two [14] dimensions, detecting the longest palindromic substring [19], computing an unbordered conjugate [8], finding the leftmost critical factorization [16], and (as shown in this paper) computing a minimal generator.

2 Preliminaries

Intervals, Alphabets, Strings.

For i,j, we write [i,j]=[i,j+1)=(i1,j]=(i1,j+1) to denote the integer range {i,i+1,,j} (or the empty set if i>j). Let n. A string T of length |T|=n is a sequence T[1]T[2]T[n] of symbols from an alphabet Σ (or the empty string ϵ if n=0). We simply write T[1..n] to denote that T is a string of length n. Every pair i,j[1,n] defines a substring T[i..j]=T[i..j+1)=T(i1..j]=T(i1..j+1)=T[i]T[i+1]T[j] (which equals ϵ if i>j). Index i is the position of substring T[i..j]. A substring T[1..i] is a prefix of T, while a substring T[i..n] is a suffix of T. The reversal of T is defined as T~=rev(T)=T[n]T[n1]T[1]. Another string S[1..m] is a substring of T if there is i[1,nm+1] such that S[1..m]=T[i..i+m), i.e., j[1,m]:T[i+j1]=S[j]. From now on, we use T and S to respectively denote the input and output string, and α,β,γ,δ to denote intermediate strings. We use the terms string and array interchangeably.

Palindromes, i.e., strings that satisfy α=α~, are essential for the study of minimal generators. A fundamental tool for detecting palindromic substrings is Manacher’s algorithm [19]. It computes, for each position i[1,n] of T, the maximal radius [0,min(i1,ni)] such that T[i..i+] is a palindrome in overall 𝒪(n) time. Note that, while not strictly required, familiarity with the details of Manacher’s algorithm will be beneficial for understanding the new algorithm for computing minimal generators.

Model of Computation.

An oracle string T[1..n] can only be accessed via the following type of oracle query. Given query positions i,j[1,n], the oracle answers if T[i]=T[j]. Algorithms that receive an oracle string as input are more commonly described as working over general (unordered) alphabet, or as accessing the input via equality comparisons. We present algorithms that (apart from the oracle queries) work in the word RAM model, i.e., they perform an interleaved sequence of oracle queries and word RAM operations. The time complexity of an algorithm is the sum of the number of oracle queries and the number of word RAM operations performed. The size of a machine word is assumed to be at least log2n bits.

Since an oracle string has no physical representation, it is not obvious how to perform traditional string operations like extracting, deleting, or overwriting substrings. We allow such operations by wrapping the input oracle string T[1..n] as follows. Every string S[1..m] encountered during the algorithm execution (e.g., a substring extracted from T or the computed output string) is physically represented by an array A[1,n]m, indicating that i[1,m]:S[i]=T[A[i]]. This also holds for the input string T[1..n] itself, which is represented by B[1..n] with i[1,n]:B[i]=i. Hence we can perform operations that modify or create copies of (sub)strings directly on the physical representations. Given the physical representations, we can still perform any symbol comparison in constant time with a single oracle query to T[1..n]. We hide this additional level of complexity from now on.

Walking on Strings.

Let S[1..m] be a string, and let n+. Then f:[1,n][1,m] is a walk on S if and only if i[2,n]:|f(i)f(i1)|1. The walk f on S generates the string T[1..n] defined by i[1,n]:T[i]=S[f(i)]. More generally, a string S generates another string T if and only if there is a walk on S that generates T. We further say that S generates T from left to right if and only if there is a walk f on S that generates T with f(1)=1 and f(|T|)=|S|. Trivially, if S generates T from left to right, then |S||T|. The following properties are straight-forward. (The first one is obtained by composing the two generating walks, the second one by concatenating them.)

Observation 2 (Properties of Left-to-Right Generation).

Let α,β,γ,δ be strings.

  1. (i)

    Transitivity: If α generates β from left to right, and β generates γ from left to right,
    then α generates γ from left to right.

  2. (ii)

    Concatenability: If α generates β from left to right, and γ generates δ from left to right,
    then αγ generates βδ from left to right.

Minimal Generators.

A minimal generator of T is a string that generates T and is of minimal length (i.e., there is no shorter generator). The minimal generator of T is unique up to reversal [22]. Whether or not a generator is minimal can be characterized by using the following types of substrings. A (non-trivial) palindrome is a string y (of length at least two) such that y=y~. A twin-palindrome is a string of the form axbx~axb for symbols a,b and a string x. A unit-square is a string of the form aa for a symbol a.

Lemma 3 ([22, Lemma 6]).

A string S is a minimal generator of another string T if and only if S generates T and both of the following conditions hold. No suffix or prefix of S is a non-trivial palindrome. No substring of S is a unit-square or a twin-palindrome.

Particularly, a minimal generator can be obtained from T by repeatedly applying the following substitutions until no further substitutions are possible. (The correctness is straight-forward, see Observation 2 and the proof of [22, Lemma 6]).

Lemma 4.

Let α,β,γ,x be strings, and let a,b be symbols.

  1. (i)

    If αaaβ generates γ, then so does αaβ.
    If αaaβ generates γ from left to right, then so does αaβ.

  2. (ii)

    If αaxbx~axbβ generates γ, then so does αaxbβ.
    If αaxbx~axbβ generates γ from left to right, then so does αaxbβ.

  3. (iii)

    If axbx~aα generates γ, then so does bx~aα.
    If αaxbx~a generates γ, then so does αaxb.

Our algorithm for computing the minimal generator performs the following three steps. First, we eliminate all unit-squares by applying Lemma 4(i) until no longer possible. Second, we eliminate twin-palindromes by applying Lemma 4(ii) until no longer possible, which introduces no new substrings of length two, and thus no new unit-squares. Third, we eliminate non-trivial suffix and prefix palindromes by applying Lemma 4(iii) until no longer possible. This introduces no new substrings, and thus the string remains free of unit-squares and twin-palindromes. The first and third step are easily performed using the auxiliary lemmas below. In the remainder of the paper, we focus on the elimination of twin-palindromes in the second step, for which we make the following observation.

Observation 5.

Let k+. A string α[1..3k+1] is a twin-palindrome if and only if α[1..2k+1] and α[k+1..3k+1] are palindromes (which justifies the term “twin-palindrome”).

Lemma 6.

Let T[1..n] be a string. There is an algorithm that, in 𝒪(n) time, computes a string S that generates T from left to right and contains no unit-squares.

Proof.

We start with β=α[1]. For each i[2,n] in increasing order, we append α[i] to β if and only if α[i]α[i1], which clearly takes 𝒪(n) time and yields the desired result.

Lemma 7.

Let T[1..n] be a string that contains neither unit-squares nor twin-palindromes. There is an algorithm that, in 𝒪(n) time, computes a minimal generator of T.

Proof.

We start with a copy of T and modify it as follows. As long as there is a non-trivial prefix palindrome xax~, we replace it with ax~. Then, as long as there is a non-trivial suffix palindrome xax~, we replace it with xa. (In both cases, a is a symbol and x is a non-empty string.) This results in a generator of T by Lemma 4(iii). Note that neither of the replacements introduces any new substrings because they merely truncate the string. Hence we introduce no new unit-squares or twin-palindromes, and the resulting string must be a minimal generator of T by Lemma 3.

We will find a non-trivial prefix or suffix palindrome xax~ in 𝒪(x) time, or determine that none exists is 𝒪(n) time. Whenever we spend 𝒪(x) time to discover a prefix or suffix palindrome, we also reduce the length of the string by x, and thus the time sums to 𝒪(n).

It remains to show how to find a prefix palindrome xax~. We proceed in up to log2n phases. In phase j (starting with j=1), we aim to find a prefix palindrome of length in range (2j1,2j]. We run Manacher’s algorithm [19] and find the longest prefix palindrome of T[1..min(n,2j)] in 𝒪(2j) time. If it is non-trivial, then it is of length at least 2j1, since otherwise we would have discovered it in an earlier phase. Also, its length must be odd because T is free of unit-squares (and even-length palindromes contain a unit-square at their center). As soon as we find a non-trivial prefix palindrome, we report it and perform no further phases. If we discover a non-trivial prefix palindrome in phase j, then the time sums to 𝒪(2j), which is linear in the length of the palindrome. Otherwise, we finish all phases in 𝒪(n) time and there is no non-trivial palindromic prefix. The procedure for suffix palindromes is symmetric.

3 Eliminating Twin-Palindromes in 𝓞(𝒏𝐥𝐨𝐠𝒏) Time

In this section, we show how to eliminate twin-palindromes that cross a fixed position. This ultimately results in the following theorem.

Theorem 8.

Let k+, and let α[1..n], β[1..m] be strings that contain no twin-palindromes of length at most k. There is an algorithm that computes n[0,n] and m[0,m] such that

  1. (i)

    α[1..nn]β(m..m] generates αβ from left to right, and

  2. (ii)

    α[1..nn]β(m..m] contains no twin-palindrome of length at most k, and

  3. (iii)

    if α and β contain no unit-squares, then neither does α[1..nn]β(m..m].

The algorithm runs in 𝒪(k+n+m) time.

The Theorem follows from a sequence of simpler results, which we describe in the remainder of the section. Before that, we show that the Theorem directly implies an 𝒪(nlogn) time algorithm for eliminating all twin-palindromes.

Corollary 9.

Let T[1..n] be a string. There is an algorithm that, in 𝒪(nlogn) time, computes a string S that generates T from left to right and contains neither twin-palindromes nor unit-squares.

Proof.

Assume that T contains no unit-squares. This assumption is without loss of generality, as otherwise we can eliminate unit-squares with Lemma 6, and then show the correctness of the Corollary for the resulting shorter string. We terminate in constant time if n=1. Otherwise, we recursively apply the Corollary to T1=T[1..n2] and T2=T(n2..n], resulting in strings T1 and T2 that contain neither twin-palindromes nor unit-squares. Furthermore, T1 generates T1 from left to right, and T2 generates T2 from left to right. Hence, by Observation 2(ii), T1T2 generates T from left to right. We apply Theorem 8 with k=n to T1 and T2. In 𝒪(n) time, we obtain n[0,|T1|] and m[0,|T2|] such that S=T1[1..nn]T2(m..|T2|] generates T1T2 (and by Observation 2(i) also T) from left to right. Furthermore, S contains no unit-squares or twin-palindromes. We spend 𝒪(n) time per level of recursion, and 𝒪(nlogn) time overall.

Now we prove Theorem 8. We start with an algorithm for finding a shortest twin-palindromic substring (or verifying that none exist) in linear time. At first glance, prioritizing short twin-palindromes might seem counter-intuitive; one might expect that long twin-palindromes should be eliminated early to quickly reduce the size of the input. However, finding long twin-palindromes appears to be more challenging, and finding the shortest one still suffices for proving Theorem 8.

Lemma 10.

Let α[1..n] be a string. There is an algorithm that returns the position and length of a shortest twin-palindromic substring of α, or reports that α contains no twin-palindromes, in 𝒪(n) time.

Proof.

We compute the maximal radius R[i] of a palindrome centered at each i[1,n], i.e., R[i]=max{h[0,min(i1,ni)]α[ih..i+h] is a palindrome}. A twin-palindrome must be of length 3k+1 for some positive integer k. Fix k and consider a substring α[p..p+3k]. Observation 5 states that α[p..p+3k] is a twin-palindrome if and only if α[p..p+2k] and α[p+k..p+3k] are palindromes, which is the case if and only if R[p+k]k and R[p+2k]k. This dictates a simple algorithm for detecting all twin-palindromes. We consider each pair ,r[1,n] with <r. Let k=r and p=k, which implies =p+k and r=p+2k. If R[]k and R[r]k, then we report α[p..p+3k] as a twin-palindrome. There are Θ(n2) pairs, and each pair is processed in constant time.

We cannot afford to consider each pair of positions. Hence we show that 𝒪(n) pairs are sufficient to detect a shortest twin-palindrome. Let α[p..p+3k] be any such twin-palindrome, i.e., α contains no twin-palindrome of length less than 3k+1. Let =p+k and r=p+2k. We already know that R[]k and R[r]k. We claim that m(,r):R[m]<k. Indeed, R[m]k with m(,r) implies that α[..2m] is a palindrome of length 2(m)+1 centered at position m. Due to R[]k>m, we know that α[2m..m] is a palindrome of length 2(m)+1 centered at position . Thus, Observation 5 implies that α[2m..2m] is a twin-palindrome of length 3(m)+1<3k+1. However, this contradicts the fact that α contains no twin-palindrome of length less than 3k+1.

Now we exploit the new insights algorithmically. For each position i[1,n], we compute prv[i]=max({j[1,i)R[j]R[i]}{0}) and nxt[i]=min({j(i,n]R[j]R[i]}{n+1}), i.e., the positions of the previous and next non-smaller value of R[i] in R. For a shortest twin-palindrome α[p..p+3k] with =p+k and r=p+2k, we have shown that R[]k, R[r]k, and m(,r):R[m]<k. Hence either nxt[]=r or prv[r]= (or both). Thus, instead of processing all the Θ(n2) possible pairs of positions, we instead process only the 𝒪(n) pairs {(,nxt[])[1,n] and nxt[]n}{(prv[r],r)r[1,n] and prv[r]>0}. If no twin-palindrome is found, we report that α contains no twin-palindrome. Otherwise, among the discovered twin-palindromes, we choose and report one of minimal length. Given R, nxt, and prv, we can generate and process the pairs in 𝒪(n) time. Computing R takes 𝒪(n) time with Manacher’s algorithm [19]. Computing nxt and prv takes 𝒪(n) time with a simple folklore algorithm (see, e.g., [5, Lemma 1]).

Given two strings that contain no twin-palindromes (possibly substrings of a longer string, in the representation described in Section 2), we can use the Lemma above to find a twin-palindrome in their concatenation. Most importantly, this yields an output-sensitive algorithm: if the algorithm reports a twin-palindrome, its runtime is linear in the length of that twin-palindrome; otherwise, the runtime is linear in the combined length of the input strings. This property will later allow us to amortize the time over the number of symbols removed from the input when computing the minimal generator.

Lemma 11.

Let α[1..n], β[1..m] be strings that contain no twin-palindromes. There is an algorithm that computes n[1,n] and m[1,m] such that α(nn..n]β[1..m] is a twin-palindrome in 𝒪(n+m) time, or outputs that no such values exist in 𝒪(n+m) time.

Proof.

Similarly to Lemma 7, we proceed in up to log2(n+m) phases. In phase j (starting with j=1), we aim to find a twin-palindrome α(nn..n]β[1..m] with n+m(2j1,2j]. As soon as we find such n,m, we return them without executing further phases. We will perform phase j in 𝒪(2j) time. Hence, if we find n,m in phase j, the total time sums to 𝒪(2j), which is 𝒪(n+m) due to n+m(2j1,2j]. If all phases finish without finding n,m, then the time sums to 𝒪(2log2(n+m))=𝒪(n+m).

It remains to show how to perform phase j in 𝒪(2j) time. Since we aim to find n,m2j, it suffices to consider α=α[max(1,n2j+1),n] and β=β[1..min(m,2j)] of total length 𝒪(2j). We apply Lemma 10 to γ=αβ in 𝒪(2j) time. Since α=γ[1..min(n,2j)] and β=γ(min(n,2j)..|γ|] are free of twin-palindromes, every twin-palindrome γ(a..b] satisfies a<min(n,2j)<b. Hence, if Lemma 10 reports γ(a..b] as a twin-palindrome, we return n=min(n,2j)a and m=bmin(n,2j). Note that ba>2j1, since otherwise the twin-palindrome would have already been discovered during the previous phase. If Lemma 10 reports no twin-palindrome, then αβ contains no twin-palindrome.

We are now ready to show Theorem 8. The algorithm first finds a twin-palindrome of length k crossing the boundary between α and β. By Lemma 11, this step takes either 𝒪() time if the twin-palindrome exists, or 𝒪(k) time otherwise. If over half of the twin-palindrome is in α (resp., β), the algorithm cuts its left (resp., right) part out of α and β shortening them by Ω() symbols, and applies the algorithm recursively to the resulting strings α and β. As a result, the total time is linear in the sum of k and on the number of the symbols that were cut off.

See 8

Proof.

If at least one of the strings is empty, then we trivially return n=m=0. If neither α nor β contain unit-squares, then we can without loss of generality assume that αβ contains no unit-squares. This is because the only possible unit-square in αβ is α[n]β[1], and if α[n]=β[1] we can simply run the algorithm for α and β[2..m] instead.

If both strings are non-empty, we apply Lemma 11 to α0=α(max(0,nk)..n] and β0=β[1..min(m,k)]. This either results in n0[1,min(n,k)] and m0[1,min(m,k)] such that α(nn0..n]β[1..m0] is a twin-palindrome in 𝒪(n0+m0) time, or determines that no such values exist in 𝒪(k) time. Since α and β do not contain any twin-palindromes of length at most k, it is clear that αβ contains no twin-palindrome of length at most k if and only if no values n0,m0 are found. Hence, if no values are found, we output n=m=0.

Otherwise, there are symbols a,b and a string x such that α(nn0..n]β[1..m0]=axbx~axb. Let h=|axb|(n0+m0)/2. Assume that n0m0, then it holds hn0 and α(nn0..nn0+h]=axb. By Lemma 4(ii), α1β1 with α1=α[1..nn0+h] and β1=β(m0..m] generates αβ from left to right. Similarly, if n0<m0, then h<m0 and β(m0h..m0]=axb. In this case, again by Lemma 4(ii), α1β1 with α1=α[1..nn0] and β1=β(m0h..m] generates αβ from left to right. Either way, h(n0+m0)/2 implies |α1β1|=n+mn0m0+hn+m(n0+m0)/2.

We finish the computation by applying Theorem 8 recursively to α1 and β1 (where we do not need to materialize α1 and β1, as they are respectively a prefix and a suffix of α and β). This results in values n1 and m1 such that α1[1..|α1|n1]β1(m1..|β1|] generates α1β1 from left to right and contains no twin-palindrome of length at most k. We have already established that α1β1 generates αβ from left to right, and hence also α1[1..|α1|n1]β1(m1..|β1|] generates αβ from left to right due to Observation 2(i). Thus, we return n=n|α1|+n1 and m=m|β1|+m1. Note that we obtained α1β1 by replacing axbx~axb with axb in αβ. This clearly does not introduce any new substrings of length two, and thus, if αβ contains no unit-square, then neither does α1β1, and (recursively) neither does α1[1..|α1|n1]β(m1..|β1|]=α[1..nn]β(m..m].

Recursively applying the Theorem to α1 and β1 takes 𝒪(k+n1+m1) time. (This is true by induction over the sum of output values, as we have already shown the correctness for the base case where n=m=0.) Apart from that, we spend 𝒪(n0+m0) time, dominated by applying Lemma 11. The total time is 𝒪(k+n0+n1+m0+m1), which is 𝒪(k+n|α1|+n1+m|β1|+m1)=𝒪(k+n+m) due to |α1β1|n+m(n0+m0)/2.

4 Eliminating Twin-Palindromes in 𝓞(𝒏𝐥𝐨𝐠𝒏) Time

In this section, we improve the time to 𝒪(nlogn). In principle, the solution is similar to the 𝒪(nlogn) time algorithm from Corollary 9, which cuts T into halves, eliminates the twin-palindromes in each half recursively, and finally eliminates all the twin-palindromes across the boundary. We accelerate the algorithm by instead cutting T into blocks of size roughly (log2n)2, which reduces the levels of recursion from 𝒪(logn) to 𝒪(logn). However, we need a much more sophisticated algorithm for eliminating the twin-palindromes that cross any of the block boundaries. The algorithm is built on the following structural lemma.

Lemma 12.

Let α[1..n] be a string that contains neither twin-palindromes nor unit-squares.

  1. (i)

    There are at most 𝒪(logn) palindromic suffixes of α.

  2. (ii)

    For any symbol b, there is at most one twin-palindromic suffix of αb, and (if it exists) it is of length 3k+1, where 2k+1 is the length of the longest palindromic suffix of αb.

Proof.

For both properties, we only have to consider palindromes of odd length because α contains no unit-square, and every palindrome of even length contains a unit-square at its center. For (i), assume that α has two non-trivial palindromic suffixes of respective lengths 2k1+1 and 2k2+1 such that k2(k1,2k1]. Let =k2k1k1. Then α[n2k1..n] is a palindrome centered at position nk1, and its central portion of length 2+1 is α[nk1..nk1+]=α[nk2..nk2+2]. Similarly, α[n2k2..n] is a palindrome centered at nk2, and its central portion of length 2+1 is α[nk2..nk2+]. However, by Observation 5, if α[nk2..nk2+2] and α[nk2..nk2+] are palindromes, then α[nk2..nk2+2] is a twin-palindrome, which is a contradiction. (See Figure 1(a).) Therefore, if we consider all the non-trivial palindromic suffixes in increasing order of length, then each suffix must be over twice as long as the previous one, and hence their number is 𝒪(logn).

(a) The string α[1..n] has palindromic suffixes α[n2k1..n] and α[n2k2..n] with k2(k1,2k1]. This implies a twin-palindrome α[nk2..nk2+2], where =k2k1.

(b) The string β[1..m]=αb has palindromic suffix β[m2k2..m] and twin-palindromic suffix β[m3k1..m] with 3k1+1<2k2. By mirroring the twin-palindrome over the center of the palindrome (indicated by the dotted line), we obtain another twin-palindrome β[m2k2..m2k2+3k1]=rev(β[m3k1..m]), which is a substring of α.
Figure 1: Supplementary drawings for the proof of Lemma 12.

For (ii), let m=n+1 and β[1..m]=αb. Let k2 be chosen such that β[m2k2..m] is the longest palindromic suffix of β. If β has some twin-palindromic suffix β[m3k1..m], then it also has palindromic suffix β[m2k1..m], which implies k1k2. If k1=k2, then β[m3k1..m] is the claimed unique twin-palindromic suffix. It remains to consider the case k1>k2. In a moment, we will show that k1>k2 implies 3k1+1<2k2, i.e., the twin-palindromic suffix β[m3k1..m] is of length less than 2k2. However, due to the palindromic suffix β[m2k2..m], it holds β[m2k2..m2k2+3k1]=rev(β[m3k1..m]). The reversal of a twin-palindrome is also a twin-palindrome, and thus β[m2k2..m2k2+3k1] is a twin-palindromic substring of α=β[1..m) (particularly due to m2k2+3k1<m; see Figure 1(b)). This contradicts the fact that α contains no twin-palindromes.

Finally, we show that k1>k2 indeed implies 3k1+1<2k2. If k1>1, then we observe that α has non-trivial suffix palindromes of respective lengths 2k11=2(k11)+1 and 2k21=2(k21)+1 (by truncating the palindromes β[m2k1..m] and β[m2k2..m] by one symbol on either side). As seen in the proof of (i), it holds k21(k11,2k12]. This implies k22k1, which in turn implies 3k1+13k2/2+1<2k2. (The latter inequality is due to 1<k1<k2 and thus k23.) If k1=1, then the twin-palindromic suffix at hand is β[m3..m]=α[m3..m)b=abab with a=α[m1]. If k2=2 then β[m4..m]=babab. However, then α has twin-palindromic suffix α[m3..m)=baba, which is a contradiction. Hence k2>2, which implies 3k1+1=4<2k2.

Theorem 13.

Let T[1..n] be a string that contains neither unit-squares nor twin-palindromes of length less than (log2n)2. There is an algorithm that, in 𝒪(n) time, computes a string S that generates T from left to right and contains neither twin-palindromes nor unit-squares.

Proof.

Let s=(log2n)2. We use a modified version of Manacher’s algorithm for maximal palindromic substrings [19]. In each step, the algorithm maintains the following data structures:

  1. (1)

    Strings α[1..a] and β[1..b] are the current working strings such that

    1. (i)

      αβ generates T from left to right, and

    2. (ii)

      αβ contains neither unit-squares nor twin-palindromes of length less than s, and

    3. (iii)

      α contains no twin-palindromes.

  2. (2)

    List C=[c1,c2,,cq] with c1<c2<<cq stores a subset of [1,a] such that

    1. (i)

      if c[1,a] is the center of a palindromic suffix of α, then c is in C (but not every element of C is necessarily the center of a palindromic suffix of α), and

    2. (ii)

      c1 is the center of the longest palindromic suffix of α (hence C is non-empty).

  3. (3)

    A prefix of array R[1..n] contains, for each i[1,a]{c1,c2,,cq}, the maximal radius of a palindrome centered at position i with respect to α, i.e.,

    R[i]=max{h[0,min(i1,ai)]α[ih..i+h] is a palindrome}.
  4. (4)

    A prefix of array E[1..n] contains, for each j[1,a], a list of centers defined as follows. The list E[j] contains exactly all the i[1,a]{c1,c2,,cq} for which i+R[i]=j. (Informally, we store each center at the end position of its maximal palindrome.)

We initialize the algorithm with α=T[1], β=T[2..n], and C=[c1] with c1=1. Without loss of generality, we assume that T[1] has no occurrence in T[2..n] (which can be achieved by prepending a unique symbol in front of T). This will ensure that α never becomes empty. Now we alternate between appending a symbol and eliminating twin-palindromes. Appending a symbol means that we remove the first symbol of β and append it as the last symbol of α, i.e., we replace α with αβ[1], and β with β[2..b]. Since their concatenation αβ does not change, Properties (1i) and (1ii) remain satisfied. As part of the appending routine, we update C, R, and E so that they satisfy Properties (2), (3), and (4), which we describe in detail below. Afterwards, only Property (1iii) is possibly not satisfied, i.e., it might be that the new α contains a twin-palindrome. In this case, the eliminating subroutine truncates possibly both α and β and updates C, R, and E so that all properties are satisfied. If β is empty after running the eliminating subroutine, then α generates T from left to right and contains neither twin-palindromes nor unit-squares.

Appending a Symbol.

Now we describe how to maintain C, R, and E after appending a symbol. By appending a symbol, we transitioned from working strings α[1..a] and β[1..b] to αβ[1] and β[2..b]. The list C currently contains values c1,c2,,cq, and we append the newly added position cq+1=a+1 to its end. If αβ[1] has a palindromic suffix with center c<a+1, then also α has a palindromic suffix with center c. Hence the center of the longest palindromic suffix of αβ[1] is already contained in C. We consider the positions ci with i[1,q] in increasing order. We first check if ci is the center of a palindromic suffix of α, which is always the case for c1. For ci with i1, we exploit that α[2c1a..a] is a palindrome, which implies α[2cia..a]=rev(α[2c1a..2c12ci+a]). Hence α[2cia..a] is a palindrome centered at position ci if and only if α[2c1a..2c12ci+a] is a palindrome centered at position ci=2c1ci<c1, which is the case if and only if R[ci]aci. We have already computed R[ci] because ci cannot be in C due to ci<c1. Therefore, we can check if α[2cia..a] is a palindromic suffix in constant time. If α[2cia..a] is a palindromic suffix of α, we check if α[2cia1..a]β[1] is a palindromic suffix of αβ[1], which is the case if and only if α[2cia1]=β[1]. If α[2cia1]β[1], then we proceed with the next larger value of i. Otherwise, we know that α[2cia1..a]β[1] is the longest palindromic suffix of αβ[1], and its center is ci. If no i[1,q] leads to a palindromic suffix of αβ[1], then the longest palindromic suffix is β[1] with center cq+1=a+1.

From now on, let i[1,q+1] be the value such that ci is the center of the longest palindromic suffix of αβ[1]. Now we update C, R, and E. We remove c1,c2,,ci1 from the list. If we actually removed positions from the list (i.e., i>1), we have to update R and E accordingly. In this case, it is clear that α[2c1a..a] is the maximal palindrome with center c1 with respect to the string αβ. We assign R[c1]=ac1 and append c1 to E[a]. For the remaining cj with j[2,i), we use the same observation as before: it holds α[2cia..a]=rev(α[2c1a..2c12ci+a]), and we know that this string is not a palindrome. Hence the radii of the maximal palindromes centered at positions ci and ci are identical, and we assign R[ci]=R[ci] and append ci to E[ci+R[ci]].

We spent 𝒪(i) time for computing i, removing i1 elements from C, and updating the corresponding entries in R and E. We also added a+1 to C, and thus we can amortize the 𝒪(i) time over the i elements that were either added to or removed from C. Each removed element was previously added, and thus the overall time for the appending subroutine is linear in the number of times we add an element to C.

Eliminating Twin-Palindromes.

It remains to show how to eliminate twin-palindromes after appending a symbol. For this part, let α[1..a] and β[1..b] be the working strings after appending a symbol, i.e., we just transitioned from strings α[1..a) and α[a]β to α and β. We have already ensured that C, R, and E satisfy the claimed properties. We know that α[1..a) does not contain any twin-palindromes, and the only property we have to establish is that α contains no twin-palindromes, i.e., we are only interested in twin-palindromic suffixes of α. By Lemma 12(ii), the only potential twin-palindromic suffix of α is α[a3k..a] with k=ac11. Since α[a2k..a] is a palindrome, α[a3k..a] is a twin-palindrome if and only if α[a3k..ak] is a palindrome (by Observation 5), which is the case if and only if R[a2k]k. We have already computed R[a2k] because a2k=2c1a<c1 is not contained in C. If R[a2k]<k (or a2k), then we terminate the eliminating subroutine in 𝒪(1) time.

Otherwise, i.e., if R[a2k]k, we know that α[a3k..a] is the only twin-palindrome in α. There are symbols c,d and a string x with α[a3k..a]=cxdx~cxd. We truncate α to α[1..a2k], i.e., we substitute cxdx~cxd with cxd. Since αβ does not contain unit-squares, and the substitution introduces no new substrings of length two, α[1..a2k]β also does not contain unit-squares. Also, α[1..a2k] generates α from left to right, and thus α[1..a2k]β generates αβ and consequently T from left to right. Finally, αβ does not contain twin-palindromes of length less than s, and therefore neither does α[1..a2k] or β. However, their concatenation α[1..a2k]β may contain twin-palindromes of length less than s. We apply Theorem 8 with parameter s to α[1..a2k] and β, resulting in values a and b such that α[1..a2ka]β(b..b] generates α[1..a2k]β (and hence also αβ, and consequently T) from left to right, and contains neither unit-squares nor twin-palindromes of length at most s. This takes 𝒪(s+a+b) time.

We replace α with α[1..a2ka] and β with β(b..b]. It remains to update C, R, and E. Note that a2k=2c1a<c1, and thus none of the positions c1,,cq currently stored in C lie within α[1..a2ka]. Hence we remove all elements from C. Currently, each R[i] with i[1,a2ka] contains the radius of the longest palindrome centered at position i with respect to α. If i+R[i]<a2ka, then R[i] is also the radius of the longest palindrome centered at position i with respect to α[1..a2ka], and thus we do not need to update such R[i]. If, however, i+R[i]a2ka, then i is the center of a suffix palindrome of α[1..a2ka], and we need to add i to the list C. In this case, we do not care about the content of R[i], and hence we can leave the entire array R unchanged.

We cannot afford to scan R and check if i+R[i]a2ka for each position. Instead, we directly identify the positions that need to be added to C by using E. We collect all the elements stored in all the lists E[j] with j[a2ka,a]. Let i be one of these elements. If i>a2ka, then we simply discard it. The remaining elements are exactly the positions i for which i is the center of a suffix palindrome of α[1..a2ka]. We sort them in increasing order, and place them into C. Finally, we replace E[a2ka] with an empty list.

Analyzing the Time Complexity.

If the eliminating subroutine does not finish in constant time, then we permanently reduce the length of the working strings by 2k+a+b, particularly by eliminating a twin-palindrome of length 3k+1 from αβ. Since αβ does not contain a twin-palindrome of length less than s, it holds k=Ω(s). Hence we reduce the length by Ω(s+a+b), and we can afford the 𝒪(s+a+b) time for applying Theorem 8. We also discard at most 𝒪(k+a) elements that were collected from E[a2ka..a], which we can afford to do in 𝒪(k+a) time because we reduced the length by Ω(k+a). The remaining elements correspond to suffix palindromes of α[1..a2ka], and by Lemma 12(i) there are at most 𝒪(logn)=𝒪(s) of them. Hence we can afford to sort them and to add them to the list C. During the entire algorithm execution, we add 𝒪(ns) elements to C in this way.

Over all iterations of the appending subroutine, we append additional 𝒪(n) elements to C. The total time for the appending subroutine is linear in the number of elements added to C, and hence it is 𝒪(n), which concludes the proof.

4.1 Obtaining the Recursive Algorithm

In this section, we show our 𝒪(nlogn)-time algorithm. As mentioned earlier, it is recursive: it partitions the input into blocks, eliminates twin-palindromes and unit-squares in them recursively, and then concatenates the blocks together to form the output. To concatenate the blocks together, we use the following procedure:

Lemma 14.

Let α1,,αk be strings, where each string is of length m and contains no twin-palindromes. There is an algorithm that, in 𝒪(km) time, computes a string β such that

  1. (i)

    β generates α1α2αk from left to right, and

  2. (ii)

    β contains no twin-palindrome of length at most m, and

  3. (iii)

    if none of α1,α2,,αk contains unit-squares, then neither does β.

Proof.

We proceed in k steps numbered 1,2,k. After running step i, the algorithm has computed a string βi that generates α1α2αi from left to right. If none of α1,α2,,αi contains unit-squares, then neither does βi. Furthermore, βi contains no twin-palindromes of length at most m, and it is stored in the first |βi| cells of an output array A[1..km].

In step i=1, we write β1=α1 to the output array, which trivially satisfies the claimed properties. In step i with i>1, we apply Theorem 8 with parameter m to βi1 and αi. This results in n[0,|βi1|] and m[0,|αi|] such that βi1[1..|βi1|n]αi(m..|αi|] generates βi1αi (and thus also α1α2αi) from left to right and does not contain a twin-palindrome of length at most m. Hence we can use βi=βi1[1..|βi1|n]αi(m..|αi|], and we write αi(m..|αi|] to A(|βi1|n..|βi1|+|αi|nm]. Now A[1..|βi1|+|αi|nm] contains βi. We have already established that, if none of α1,,αi contains unit-squares, then neither does βi1, and in this case Theorem 8 ensures that also βi contains no unit-squares.

The time for step i is dominated by applying Theorem 8, which takes 𝒪(m+n+m)=𝒪(m+|βi1|+|αi||βi|) time. This sums to 𝒪(km) over all steps.

We are now ready to present the main result of this section:

Corollary 15.

Let T[1..n] be a string. There is a recursive algorithm that, in 𝒪(nlogn) time, computes a string S that generates T from left to right and contains neither twin-palindromes nor unit-squares.

Proof.

Assume that T contains no unit-squares. This assumption is without loss of generality, as otherwise we can eliminate unit-squares with Lemma 6, and then show the correctness of the Corollary for the resulting shorter string. The algorithm operates in a recursive manner. We cut T into b=n/m blocks of length m=(log2n)2 (the final block might be shorter). For each block Bi, where i[1,b], we obtain a string Bi that generates Bi from left to right and contains neither twin-palindromes nor unit-squares. This is done by applying the algorithm recursively to each block except for the final one. For the final block Bb, we obtain Bb by applying Corollary 9 in 𝒪(mlogm)𝒪(n) time instead. For x+, let t(x) be the time needed by Corollary 15 for a length-x string. Then the time needed for computing the reduced blocks is 𝒪(n)+n/(log2n)2t((log2n)2).

Next, we obtain a single string T that generates B1B2Bb (and thus also T) from left to right and contains neither unit-squares, nor twin-palindromes of length at most m. This takes 𝒪(bm)=𝒪(n) time with Lemma 14. Finally, we apply Theorem 13 to T, which takes 𝒪(n) time and results in a string S that generates T (and thus also T) from left to right. Furthermore, S contains neither twin-palindromes nor unit-squares. The total time is t(n)=𝒪(n)+n/(log2n)2t((log2n)2), which resolves to t(n)=𝒪(nlogn). (Every two levels of recursion, the block size decreases from some length m to (log2(log2m)2)2), which is less than log2m if m exceeds a sufficiently large constant. Hence 2logn levels of recursion are sufficient to reach blocks of constant size.)

5 Eliminating Twin-Palindromes in 𝓞(𝒏) Time

Let L be the function defined by x:L(x)=(log2x)2. As seen in the proof of Corollary 15, there is a constant c such that the algorithm runs in time

t(n)=cn+n/L(n)t(L(n))cn+n/L(n)t(L(n)).

Particularly, we reduce the problem from a single instance of size n to at most n/L(n) instances of size L(n) each. Let s=L(L(L(L(x))))=o(logloglogn). After four levels of recursion, we have to solve at most n/s instances of size s. It then holds t(n)4cn+n/st(s). If we solve each instance in 𝒪(s) time, then the total time is linear. If the string was given over a polynomial integer alphabet, then we could reduce the alphabet of each instance (in batch using radix-sorting), and then use a precomputed lookup table to solve each instance in constant time. The lookup table would store the solution to each possible instance of size s. (This approach is commonly referred to as the “Four Russians” technique.333The technique originates from the work of Arlazarov, Dinitz, Kronrod, and Faradzhev [4], who proposed an efficient method for computing the transitive closure of a directed graph by precomputing solutions for small blocks of the adjacency matrix. The term “Four Russians” technique has since come to broadly refer to the use of precomputed solutions for small subproblems, typically of logarithmic size relative to the input. The name reflects the authors’ affiliation in Moscow, though not all were Russian.) However, we cannot use strings over general unordered alphabet to access lookup tables. Instead, in Lemma 16 (below), we adapt the technique such that the precomputation no longer only considers all possible instances of size s, but also all algorithms that potentially solve such instances efficiently. Among all the algorithms, we choose the fastest one that actually works correctly. This way, we show that we can indeed achieve 𝒪(s) time per instance after performing an 𝒪(222s)=o(n) time preprocessing. This improves the time complexity of Corollary 15 to 𝒪(n). By applying Lemma 7 as a postprocessing, we obtain the main result from Theorem 1.

Lemma 16.

Let n+. After an 𝒪(222n) time preprocessing, we can answer the following type of query in 𝒪(n) time. Given a string T[1..n], output a string S that contains neither twin-palindromes nor unit-squares and generates T from left to right.

Proof.

For some unknown value N+, assume that there is an algorithm that solves a query by performing less than N comparisons of symbols of the query string (running in arbitrary time). The algorithm can be implemented as a decision tree such that it answers queries in 𝒪(N) time. The decision tree is a binary tree with nodes on N levels, i.e., with at most 2N1 nodes. Each internal node has two children and is annotated with a position pair from [1,n]×[1,n]. Each leaf is annotated with a length m[1,n] and a string M[1,n]m. A query T is then answered as follows. We start at the root. As long as we are at an internal node, we perform the symbol comparison dictated by its annotation. If the outcome of the comparison is positive, we move to the left child, and otherwise to the right child. Once we reach a leaf labeled with m and M, we return the string S[1..m] defined by i[1,m]:S[i]=T[M[i]]. This takes 𝒪(N) comparisons and time. By adding dummy comparisons, we can assume without loss of generality that the decision tree is complete.

It remains to be shown how to compute the decision tree. Rather than directly obtaining the correct tree, we instead generate all the possible trees admitted by the domains of the annotations. We also generate all the possible length-n strings over alphabet [1,n]. Note that every length-n string over arbitrary alphabet is isomorphic to one of these strings. For each decision tree, and for each possible query string, we run the query procedure. We then check if the output string contains neither twin-palindromes nor unit-squares, and actually generates the query string from left to right. Since there is an algorithm that requires less than N comparisons, at least one of the decision trees must work for all possible input strings. As soon as we find such a tree, we terminate and use this tree to answer all future queries.

Now we analyze the preprocessing time. We will use z=yyy with y=n+N as an upper bound. Given all of its annotations, we can generate a tree in 𝒪(2Nn)𝒪(z) time. There are less than (nn+1)(2N)𝒪(z) ways of annotating a tree, and they can be easily enumerated. Hence the trees are generated in 𝒪(z2) time. Generating the nn𝒪(z) query strings takes 𝒪(nn+1)𝒪(z) time. Given a tree and a query string, answering the query and checking if the output string is free of twin-palindromes and unit-squares takes 𝒪(N+n3)𝒪(z) time (by testing each substring naively). We check if the output string actually generates the query string from left to right by trying all 𝒪(3n) possible walks in 𝒪(3nn)𝒪(z) time. By multiplying the number 𝒪(z) of possible trees with the number 𝒪(z) of possible instances and the time 𝒪(z) to verify that a tree works for a given instance, the overall preprocessing time is 𝒪(z3).

Since we do not know the value of N in advance, we simply start with N=1 and keep increasing N by one until the procedure succeeds, which increases the time to 𝒪(Nz3)𝒪(z4). By Corollary 15, we know that N𝒪(nlogn) suffices, which readily implies 𝒪(z4)𝒪(222n). (This would even hold for N=𝒪(poly(n)).) Finally, we terminate with the minimal value N such that a query can be answered in fewer than N symbol comparisons in the worst case, and hence the 𝒪(N) query time is optimal. In Appendix A, Lemma 17, we show that N=𝒪(n) comparisons are sufficient, which concludes the proof.

References

  • [1] Amihood Amir and Itai Boneh. Dynamic palindrome detection. CoRR, abs/1906.09732, 2019. arXiv:1906.09732.
  • [2] Amihood Amir, Itai Boneh, Panagiotis Charalampopoulos, and Eitan Kondratovsky. Repetition detection in a dynamic string. In Proceedings of the 27th Annual European Symposium on Algorithms (ESA 2019), pages 5:1–5:18, 2019. doi:10.4230/LIPICS.ESA.2019.5.
  • [3] Amihood Amir, Panagiotis Charalampopoulos, Solon P. Pissis, and Jakub Radoszewski. Longest common substring made fully dynamic. In Proceedings of the 27th Annual European Symposium on Algorithms (ESA 2019), pages 6:1–6:17, 2019. doi:10.4230/LIPIcs.ESA.2019.6.
  • [4] V. L. Arlazarov, E. A. Dinitz, M. A. Kronrod, and I. A. Faradzhev. On economical construction of the transitive closure of an oriented graph. Dokl. Akad. Nauk SSSR, 194(3):487–488, 1970. (Original in Russian). URL: https://www.mathnet.ru/eng/dan35675.
  • [5] Jérémy Barbay, Johannes Fischer, and Gonzalo Navarro. LRM-trees: Compressed indices, adaptive sorting, and compressed permutations. Theor. Comput. Sci., 459:26–41, 2012. doi:10.1016/J.TCS.2012.08.010.
  • [6] Karl Bringmann, Fabrizio Grandoni, Barna Saha, and Virginia Vassilevska Williams. Truly subcubic algorithms for language edit distance and RNA folding via fast bounded-difference min-plus product. SIAM J. Comput., 48(2):481–512, 2019. doi:10.1137/17M112720X.
  • [7] Panagiotis Charalampopoulos. Data structures for strings in the internal and dynamic settings. PhD thesis, King’s College London (University of London), UK, 2020. doi:10.12681/eadd/57602.
  • [8] Jean-Pierre Duval, Thierry Lecroq, and Arnaud Lefebvre. Linear computation of unbordered conjugate on unordered alphabet. Theor. Comput. Sci., 522:77–84, 2014. doi:10.1016/J.TCS.2013.12.008.
  • [9] Jonas Ellert. Efficient string algorithmics across alphabet realms. PhD thesis, Technical University of Dortmund, Germany, 2024. doi:10.17877/DE290R-24255.
  • [10] Jonas Ellert and Johannes Fischer. Linear time runs over general ordered alphabets. In Proceedings of the 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021), pages 63:1–63:16, 2021. doi:10.4230/LIPICS.ICALP.2021.63.
  • [11] Jonas Ellert, Pawel Gawrychowski, and Garance Gourdel. Optimal square detection over general alphabets. CoRR, abs/2303.07229, 2023. doi:10.48550/arXiv.2303.07229.
  • [12] Jonas Ellert, Paweł Gawrychowski, and Garance Gourdel. Optimal square detection over general alphabets. In Proceedings of the 34th Annual Symposium on Discrete Algorithms (SODA 2023), pages 5220–5242, 2023. doi:10.1137/1.9781611977554.ch189.
  • [13] Mitsuru Funakoshi, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Faster queries for longest substring palindrome after block edit. In Proceedings of the 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019), pages 27:1–27:13, 2019. doi:10.4230/LIPIcs.CPM.2019.27.
  • [14] Zvi Galil and Kunsoo Park. Truly alphabet-independent two-dimensional pattern matching. In Proceedings of the 33rd Annual Symposium on Foundations of Computer Science (FOCS 1992), pages 247–256, 1992. doi:10.1109/SFCS.1992.267767.
  • [15] Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323–350, 1977. doi:10.1137/0206024.
  • [16] Dmitry Kosolobov. Finding the leftmost critical factorization on unordered alphabet. CoRR, abs/1509.01018, 2015. arXiv:1509.01018.
  • [17] Dmitry Kosolobov. Lempel-Ziv factorization may be harder than computing all runs. In Proceedings of the 32nd International Symposium on Theoretical Aspects of Computer Science (STACS 2015), pages 582–593, 2015. doi:10.4230/LIPICS.STACS.2015.582.
  • [18] Michael G. Main and Richard J. Lorentz. An O(n log n) algorithm for finding all repetitions in a string. J. Algorithms, 5(3):422–432, 1984. doi:10.1016/0196-6774(84)90021-X.
  • [19] Glenn K. Manacher. A new linear-time “on-line” algorithm for finding the smallest initial palindrome of a string. J. ACM, 22(3):346–351, 1975. doi:10.1145/321892.321896.
  • [20] Seth Pettie and Vijaya Ramachandran. An optimal minimum spanning tree algorithm. J. ACM, 49(1):16–34, 2002. doi:10.1145/505241.505243.
  • [21] Ian Pratt-Hartmann. Walking on words. CoRR, abs/2208.08913, 2022. doi:10.48550/arXiv.2208.08913.
  • [22] Ian Pratt-Hartmann. Walking on words. In Proceedings of the 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024), pages 25:1–25:17, 2024. doi:10.4230/LIPICS.CPM.2024.25.
  • [23] Robert E. Tarjan. Problems in data structures and algorithms. In Graph Theory, Combinatorics and Algorithms: Interdisciplinary Applications, pages 17–39. Springer US, Boston, MA, 2005. doi:10.1007/0-387-25036-0_2.
  • [24] Yuki Urabe, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Longest Lyndon substring after edit. In Proceedings of the 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018), pages 19:1–19:10, 2018. doi:10.4230/LIPIcs.CPM.2018.19.
  • [25] Hao Yuan and Mikhail J. Atallah. Data structures for range minimum queries in multidimensional arrays. In Proceedings of the 21st Annual Symposium on Discrete Algorithms (SODA 2010), pages 150–160, 2010. doi:10.1137/1.9781611973075.14.

Appendix A Eliminating Twin-Palindromes in 𝓞(𝒏) Comparisons

Lemma 17.

Let T[1..n] be a string. There is an algorithm that, using 𝒪(n) symbol comparisons, computes a string S that generates T from left to right and contains neither twin-palindromes nor unit-squares.

Limiting the Number of Oracle Queries with Positive Answer.

As explained in Section 2, all strings encountered by the algorithm have an explicit physical representation. If we encounter strings S1[1..m1] and S2[1..m2], then they are physically stored as arrays A1[1..m1] and A2[1..m2] such that for i[1,m1],j[1,m2] it holds S1[i]=T[A1[i]] and S2[j]=T[A2[j]]. Hence S1[i]=S2[j] if and only if T[A1[i]]=T[A2[j]], i.e., we can perform each symbol comparison by asking exactly one oracle query to T.

Recall that T[1..n] itself has physical representation B[1..n] with i[1,n]:B[i]=i. By manipulating B, we can ensure that the algorithm asks at most n1 oracle queries with positive answer. We will ensure that i[1,n]:T[i]=T[B[i]] at all times. Instead of directly issuing an oracle query, we test if T[i]=T[j] with i,j[1,n] as follows.

  1. 1.

    If B[i]=B[j], we return that T[i]=T[j].

  2. 2.

    Otherwise, we ask the oracle if T[B[i]]=T[B[j]].

    1. (a)

      If T[B[i]]T[B[j]], we return that T[i]T[j].

    2. (b)

      If T[B[i]]=T[B[j]], we scan B and replace all occurrences of B[j] with B[i]. We then return that T[i]=T[j]. (This naively implements a union-find data structure.)

In Case 2b, the number of distinct values in B decreases. Hence we encounter the case at most n1 times. This is the only case in which we ask an oracle query and obtain a positive answer, hence the total number of such queries will be at most n1. Throughout the remainder of the section, we will use the previously described strategy for all symbol comparisons. Hence we perform at most 𝒪(n) oracle queries with positive answer, and we only have to count the number of comparisons with negative answers.

Eliminating Twin-Palindromes.

Assume that T contains no unit-squares. This assumption is without loss of generality, as otherwise we can eliminate unit-squares with Lemma 6, and then show the correctness of the Lemma for the resulting shorter string. We proceed in 𝒪(n) phases numbered 1,2,,K. Each phase consists of 𝒪(n) iterations. Before each iteration of phase k[1,K], each of the following conditions is satisfied:

  1. 1.

    We have computed a string Sk[1..m] that generates T from left to right and contains neither unit-squares nor twin-palindromes of length less than 3k+1.

  2. 2.

    We have computed an array Rk[1..m] that contains the maximal radius of the palindrome centered at each position of Sk, truncated to length k, i.e., for i[1,m] it holds

    Rk[i]=max{h[0,min(i1,mi,k)]Sk[ih..i+h] is a palindrome}.
  3. 3.

    We have charged some entries of Rk. However, Rk[i] cannot be charged if Rk[i]=k. We will use charged entries to account for symbol comparisons later.

As soon as we produce Sk[1..m] with 3k+1>m, we know that Sk contains no twin-palindromes. Now we describe how to implement an iteration of phase k. The goal of phase k is to eliminate twin-palindromes of length exactly 3k+1. By Observation 5, some substring Sk[p..p+3k] is a twin-palindrome if and only if Sk[p..p+2k] and Sk[p+k..p+3k] are palindromes, which is the case if and only if Rk[p+k]=k and Rk[p+2k]=k. Hence, given Rk, we can find a twin-palindrome of length 3k+1 (or determine that none exists) without performing any symbol comparisons.

Performing a Final Iteration.

If we cannot find a twin-palindrome of length 3k+1, then we immediately finish the iteration and proceed to the next phase k+1. A twin-palindrome cannot have length 3k+2 or 3k+3 for non-negative integer k, and thus Sk contains no twin-palindromes of length less than 3(k+1)1. Hence we can use Sk+1=Sk, and we only have to compute Rk+1. For any position i[1,m], if Rk[i]<k, then we can directly assign Rk+1[i]=Rk[i] without performing any comparisons. In this case, we also transfer the charge (if any) from Rk[i] to Rk+1[i]. If Rk[i]=k, then we check if Sk[ik1]=Sk[i+k+1] (assuming ik11 and i+k+1m). If the answer is yes, we assign Rk+1[i]=k+1 (and do not charge the position). If the answer is no, then we assign Rk+1[i]=k and charge position i. Note that we charge one previously uncharged position for each comparison with negative outcome. Later, we use the charges to count the overall number of comparisons.

Performing a Regular Iteration.

It remains to show how to proceed in phase k if we actually find a twin-palindrome Sk[p..p+3k]. In this case, Sk[p..p+k] generates Sk[p..p+3k] from left to right, and thus Sk[1..p+k]Sk(p+3k..m] generates Sk from left to right. However, Sk[1..p+k]Sk(p+3k..m] may contain twin-palindromes of length less than 3k+1 that were not present in Sk. We apply Theorem 8 with parameter 3k to Sk[1..p+k] and Sk(p+3k..m] and obtain values n,m such that Sk[1..p+kn]Sk(p+3k+m..m] generates Sk from left to right and contains neither unit-square nor twin-palindrome of length less than 3k+1.

Let =p+kn and r=p+3k+m. We will replace Sk with Sk=Sk[1..]Sk(r..m]. Therefore, we also have to replace Rk with the appropriate array Rk. Note that any entry Rk[i] depends solely on a short local substring Sk[ik..i+k]. This allows us to copy all except 𝒪(k) entries from Rk to Rk. More precisely:

  • For i[1,k], it holds Rk[i]=Rk[i] due to Sk[1..]=Sk[1..].

  • For i(r+k,m], it holds Rk[i]=Rk[i(r)] due to Sk(r..m]=Sk(..mr+].

The only missing entries are the Rk[i] with i([1,m(r)](k,+k]). We compute each such value from scratch. This requires only one comparison with negative outcome per value, or 𝒪(k) comparisons with negative outcome overall.

Finally, we have to handle the charged positions. We simply transfer the charges from Rk[1..k] to Rk[1..k], and from Rk(r+k..m] to Rk(r+k..m] (respectively only if >k and r+k<m). We have accounted for all charges, except for the ones in interval Rk[max(1,k+1)..min(m,r+k)]. This interval is of length r+2k=𝒪(k+n+m). We will pay for the charges directly, without transferring them to the next iteration. In this iteration, we have to pay for the following comparisons with negative outcome:

  • 𝒪(k+n+m) comparisons for applying Theorem 8 with parameter 3k, and

  • 𝒪(k) comparisons for computing the entries of Rk that cannot be copied from Rk, and

  • 𝒪(k+n+m) comparisons that were performed in previous iterations, stored as charges of entries in Rk that could not be copied to Rk.

This sums to 𝒪(k+n+m) comparisons with negative outcome, which we can afford because we permanently reduced the length of the string by |Sk||Sk|=r=2k+n+m. After the algorithm terminates, there are at most 𝒪(n) charges on the final Rk that have not been accounted for. Hence the number of comparisons with negative outcome is 𝒪(n) overall, which concludes the proof.