Minimal Generators in Optimal Time
Abstract
A walk of length on a string of length is a function such that . The walk generates the string of length defined by . Intuitively, this can be seen as walking steps in and outputting the encountered symbols, where in each step we either remain at the same position, or move one position to the left or to the right. The minimal generator of a string is the shortest string such that a walk on generates . Recently, it was shown that each string admits exactly one (up to reversal) minimal generator (Pratt-Hartmann, CPM 2024). However, no efficient algorithm for computing the minimal generator was known. We provide an optimal algorithm for this task, taking time for a string of length over general unordered alphabet, i.e., accessing the string only by equality comparisons of symbols. The main challenge is to detect substrings of the form and replace them with , where are symbols and is a string with reversal . We solve this problem with a non-trivial adaptation of Manacher’s classic algorithm for computing maximal palindromic substrings (Manacher, J. ACM 1975). To obtain the final algorithm, we solve small subinstances of the problem in optimal time by adapting the “Four Russians” technique to strings over general unordered alphabet, which may be of independent interest.
Keywords and phrases:
string algorithms, walking on words, minimal generator, palindromic substrings, general unordered alphabet, decision tree complexityFunding:
Jonas Ellert: Partially supported by grant ANR-20-CE48-0001 from the French National Research Agency (ANR).Copyright and License:
2012 ACM Subject Classification:
Theory of computation Pattern matchingEditors:
Paola Bonizzoni and Veli MäkinenSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction and Related Work
Given some non-empty string of length , go to any position in . Now repeat the following routine times. Print the symbol at the current position, and then either stay at that position, or move one position to the left (if possible), or move one position to the right (if possible). The result is a string of length written in an append-only manner. Pratt-Hartmann describes this procedure as going for a walk on [22], and he formally models the walk as a function that satisfies . Using this function, is defined by . Intuitively, we say that some string generates another string (or that is a generator of ) if and only if there is a (not necessarily unique) walk on that results in .
While a string may have many distinct generators, a natural question is to ask for a generator of minimal length, henceforth called a minimal generator.111In [22, 21], a minimal generator is called a primitive generator. We use different terminology to avoid confusion with the other definition of primitiveness (where a string is primitive if it cannot be written as the times concatenation of a shorter string for some integer ). Pratt-Hartmann shows that the minimal generator of a string is unique up to reversal. While he does not consider the problem of efficiently computing the minimal generator of a given string, he implicitly provides a simple algorithmic idea for doing so. Particularly, he shows that the minimal generator of can be obtained by starting with and then repeatedly substituting substrings of the form with , substrings of the form with , and prefixes (resp. suffixes) of the form with (resp. ). Here, are symbols and is a possibly empty string with reversal . As soon as no further substitutions are possible, we have obtained the minimal generator. We provide an -time algorithm that computes a minimal generator via the described substitutions. It works over general unordered alphabet, i.e., it accesses the input string only via equality comparisons of symbols.
Theorem 1.
There is an algorithm that computes the minimal generator of a length- string over general unordered alphabet in time.
1.1 Technical Overview and Roadmap
In Section 2, we introduce the formal definitions and notation used throughout the paper, and then show that substitutions of types and are easy to realize. The remainder of the paper focuses on substitutions of type .
In Section 3, we provide a simple algorithm that performs all substitutions of type in time. It follows the same divide-and-conquer strategy as Main and Lorentz’s classic algorithm for detecting squares [18]. First, we consider the halves and and recursively perform all type- substitutions within them. Then, any remaining substitution in must cross the boundary between and . Using a modified version of Manacher’s algorithm for computing maximal palindromic substrings [19], we find and perform all substitutions at the boundary in linear time.
In Section 4, we improve the time to .222 denotes the extremely slowly growing iterated logarithm defined by if and otherwise. Instead of splitting into halves, we split it into blocks of length . We recursively perform the type- substitutions in each block. Then, a more sophisticated adaptation of Manacher’s algorithm combines the reduced blocks and performs the remaining substitutions in linear time.
In Section 5, we finally achieve linear time by modifying the time algorithm. Once the recursion of the algorithm reaches blocks of size (which happens after a constant number of levels of recursion), we stop the recursion and solve the blocks of the current level in optimal time by using precomputed data structures. Particularly, by brute force, we precompute a decision tree of minimal height for solving blocks of size . Peculiarly, this results in a time-wise optimal solution, but we do not immediately know the time complexity of this solution. We show that the height of the decision tree is indeed in Appendix A. Hence we solve instances of size in linear time, as required. A similar technique has been used in [20, 25].
1.2 Related Work
Optimum Stack Generation.
In [23], Tarjan introduced the optimum stack generation problem, where one is given a finite alphabet , a stack, and a string , and must find the shortest sequence of stack operations – push, print the top character in the stack, and pop – printing . The stack must be empty in the beginning and at the end of the algorithm. While the problem seems to be quite similar to the problem of computing a minimal generator, no fast algorithms are known. A simple dynamic programming algorithm solves it in time, and the current best solution is by Bringmann et al. [6] and requires randomised time or deterministic time on a constant-size alphabet.
Repetition Detection in Dynamic Strings.
Our algorithm repeatedly replaces substrings of the form with , where are symbols and is a string. In other words, it must be able to detect substrings of form in a dynamic string, where one is allowed to update the input by deleting its substrings. The problem of detecting repetitive structures in a dynamic string has been an active branch of research recently. The problem of maintaining the longest palindromic substring was considered in [1, 24, 13, 3], where the update time of the current best algorithm by Amir et al. [3] is . The longest square of a dynamic string can be maintained in time per update [7], while maintaining the runs can be done in time [2].
Repetition Detection and Other Problems Over Unordered Alphabets.
Our algorithm accesses the string via equality comparisons of symbols, i.e., we do not assume an order on the alphabet. Since order is not needed to define repetitive structures (like palindromes, squares, or runs), it is an intriguing question whether the presence of order facilitates the detection of such structures. We start with some examples where order is beneficial. Squares and runs can be detected in linear time over an ordered alphabet [10], whereas there is a tight bound of symbol comparisons to decide square-freeness of a string over unordered alphabet of size [12]. There is an -time algorithm for reporting all runs on such a string [9, 11]. Computing the Lempel-Ziv factorization has a tight time bound of for ordered alphabet [17], and for unordered alphabet [9, 12]. On the other hand, there are linear time algorithms for several problems over unordered alphabet, e.g., pattern matching in one [15] and two [14] dimensions, detecting the longest palindromic substring [19], computing an unbordered conjugate [8], finding the leftmost critical factorization [16], and (as shown in this paper) computing a minimal generator.
2 Preliminaries
Intervals, Alphabets, Strings.
For , we write to denote the integer range (or the empty set if ). Let . A string of length is a sequence of symbols from an alphabet (or the empty string if ). We simply write to denote that is a string of length . Every pair defines a substring (which equals if ). Index is the position of substring . A substring is a prefix of , while a substring is a suffix of . The reversal of is defined as . Another string is a substring of if there is such that , i.e., . From now on, we use and to respectively denote the input and output string, and to denote intermediate strings. We use the terms string and array interchangeably.
Palindromes, i.e., strings that satisfy , are essential for the study of minimal generators. A fundamental tool for detecting palindromic substrings is Manacher’s algorithm [19]. It computes, for each position of , the maximal radius such that is a palindrome in overall time. Note that, while not strictly required, familiarity with the details of Manacher’s algorithm will be beneficial for understanding the new algorithm for computing minimal generators.
Model of Computation.
An oracle string can only be accessed via the following type of oracle query. Given query positions , the oracle answers if . Algorithms that receive an oracle string as input are more commonly described as working over general (unordered) alphabet, or as accessing the input via equality comparisons. We present algorithms that (apart from the oracle queries) work in the word RAM model, i.e., they perform an interleaved sequence of oracle queries and word RAM operations. The time complexity of an algorithm is the sum of the number of oracle queries and the number of word RAM operations performed. The size of a machine word is assumed to be at least bits.
Since an oracle string has no physical representation, it is not obvious how to perform traditional string operations like extracting, deleting, or overwriting substrings. We allow such operations by wrapping the input oracle string as follows. Every string encountered during the algorithm execution (e.g., a substring extracted from or the computed output string) is physically represented by an array , indicating that . This also holds for the input string itself, which is represented by with . Hence we can perform operations that modify or create copies of (sub)strings directly on the physical representations. Given the physical representations, we can still perform any symbol comparison in constant time with a single oracle query to . We hide this additional level of complexity from now on.
Walking on Strings.
Let be a string, and let . Then is a walk on if and only if . The walk on generates the string defined by . More generally, a string generates another string if and only if there is a walk on that generates . We further say that generates from left to right if and only if there is a walk on that generates with and . Trivially, if generates from left to right, then . The following properties are straight-forward. (The first one is obtained by composing the two generating walks, the second one by concatenating them.)
Observation 2 (Properties of Left-to-Right Generation).
Let be strings.
-
(i)
Transitivity: If generates from left to right, and generates from left to right,
then generates from left to right. -
(ii)
Concatenability: If generates from left to right, and generates from left to right,
then generates from left to right.
Minimal Generators.
A minimal generator of is a string that generates and is of minimal length (i.e., there is no shorter generator). The minimal generator of is unique up to reversal [22]. Whether or not a generator is minimal can be characterized by using the following types of substrings. A (non-trivial) palindrome is a string (of length at least two) such that . A twin-palindrome is a string of the form for symbols and a string . A unit-square is a string of the form for a symbol .
Lemma 3 ([22, Lemma 6]).
A string is a minimal generator of another string if and only if generates and both of the following conditions hold. No suffix or prefix of is a non-trivial palindrome. No substring of is a unit-square or a twin-palindrome.
Particularly, a minimal generator can be obtained from by repeatedly applying the following substitutions until no further substitutions are possible. (The correctness is straight-forward, see Observation 2 and the proof of [22, Lemma 6]).
Lemma 4.
Let be strings, and let be symbols.
-
(i)
If generates , then so does .
If generates from left to right, then so does . -
(ii)
If generates , then so does .
If generates from left to right, then so does . -
(iii)
If generates , then so does .
If generates , then so does .
Our algorithm for computing the minimal generator performs the following three steps. First, we eliminate all unit-squares by applying Lemma 4(i) until no longer possible. Second, we eliminate twin-palindromes by applying Lemma 4(ii) until no longer possible, which introduces no new substrings of length two, and thus no new unit-squares. Third, we eliminate non-trivial suffix and prefix palindromes by applying Lemma 4(iii) until no longer possible. This introduces no new substrings, and thus the string remains free of unit-squares and twin-palindromes. The first and third step are easily performed using the auxiliary lemmas below. In the remainder of the paper, we focus on the elimination of twin-palindromes in the second step, for which we make the following observation.
Observation 5.
Let . A string is a twin-palindrome if and only if and are palindromes (which justifies the term “twin-palindrome”).
Lemma 6.
Let be a string. There is an algorithm that, in time, computes a string that generates from left to right and contains no unit-squares.
Proof.
We start with . For each in increasing order, we append to if and only if , which clearly takes time and yields the desired result.
Lemma 7.
Let be a string that contains neither unit-squares nor twin-palindromes. There is an algorithm that, in time, computes a minimal generator of .
Proof.
We start with a copy of and modify it as follows. As long as there is a non-trivial prefix palindrome , we replace it with . Then, as long as there is a non-trivial suffix palindrome , we replace it with . (In both cases, is a symbol and is a non-empty string.) This results in a generator of by Lemma 4(iii). Note that neither of the replacements introduces any new substrings because they merely truncate the string. Hence we introduce no new unit-squares or twin-palindromes, and the resulting string must be a minimal generator of by Lemma 3.
We will find a non-trivial prefix or suffix palindrome in time, or determine that none exists is time. Whenever we spend time to discover a prefix or suffix palindrome, we also reduce the length of the string by , and thus the time sums to .
It remains to show how to find a prefix palindrome . We proceed in up to phases. In phase (starting with ), we aim to find a prefix palindrome of length in range . We run Manacher’s algorithm [19] and find the longest prefix palindrome of in time. If it is non-trivial, then it is of length at least , since otherwise we would have discovered it in an earlier phase. Also, its length must be odd because is free of unit-squares (and even-length palindromes contain a unit-square at their center). As soon as we find a non-trivial prefix palindrome, we report it and perform no further phases. If we discover a non-trivial prefix palindrome in phase , then the time sums to , which is linear in the length of the palindrome. Otherwise, we finish all phases in time and there is no non-trivial palindromic prefix. The procedure for suffix palindromes is symmetric.
3 Eliminating Twin-Palindromes in Time
In this section, we show how to eliminate twin-palindromes that cross a fixed position. This ultimately results in the following theorem.
Theorem 8.
Let , and let , be strings that contain no twin-palindromes of length at most . There is an algorithm that computes and such that
-
(i)
generates from left to right, and
-
(ii)
contains no twin-palindrome of length at most , and
-
(iii)
if and contain no unit-squares, then neither does .
The algorithm runs in time.
The Theorem follows from a sequence of simpler results, which we describe in the remainder of the section. Before that, we show that the Theorem directly implies an time algorithm for eliminating all twin-palindromes.
Corollary 9.
Let be a string. There is an algorithm that, in time, computes a string that generates from left to right and contains neither twin-palindromes nor unit-squares.
Proof.
Assume that contains no unit-squares. This assumption is without loss of generality, as otherwise we can eliminate unit-squares with Lemma 6, and then show the correctness of the Corollary for the resulting shorter string. We terminate in constant time if . Otherwise, we recursively apply the Corollary to and , resulting in strings and that contain neither twin-palindromes nor unit-squares. Furthermore, generates from left to right, and generates from left to right. Hence, by Observation 2(ii), generates from left to right. We apply Theorem 8 with to and . In time, we obtain and such that generates (and by Observation 2(i) also ) from left to right. Furthermore, contains no unit-squares or twin-palindromes. We spend time per level of recursion, and time overall.
Now we prove Theorem 8. We start with an algorithm for finding a shortest twin-palindromic substring (or verifying that none exist) in linear time. At first glance, prioritizing short twin-palindromes might seem counter-intuitive; one might expect that long twin-palindromes should be eliminated early to quickly reduce the size of the input. However, finding long twin-palindromes appears to be more challenging, and finding the shortest one still suffices for proving Theorem 8.
Lemma 10.
Let be a string. There is an algorithm that returns the position and length of a shortest twin-palindromic substring of , or reports that contains no twin-palindromes, in time.
Proof.
We compute the maximal radius of a palindrome centered at each , i.e., . A twin-palindrome must be of length for some positive integer . Fix and consider a substring . Observation 5 states that is a twin-palindrome if and only if and are palindromes, which is the case if and only if and . This dictates a simple algorithm for detecting all twin-palindromes. We consider each pair with . Let and , which implies and . If and , then we report as a twin-palindrome. There are pairs, and each pair is processed in constant time.
We cannot afford to consider each pair of positions. Hence we show that pairs are sufficient to detect a shortest twin-palindrome. Let be any such twin-palindrome, i.e., contains no twin-palindrome of length less than . Let and . We already know that and . We claim that . Indeed, with implies that is a palindrome of length centered at position . Due to , we know that is a palindrome of length centered at position . Thus, Observation 5 implies that is a twin-palindrome of length . However, this contradicts the fact that contains no twin-palindrome of length less than .
Now we exploit the new insights algorithmically. For each position , we compute and , i.e., the positions of the previous and next non-smaller value of in . For a shortest twin-palindrome with and , we have shown that , , and . Hence either or (or both). Thus, instead of processing all the possible pairs of positions, we instead process only the pairs . If no twin-palindrome is found, we report that contains no twin-palindrome. Otherwise, among the discovered twin-palindromes, we choose and report one of minimal length. Given , nxt, and prv, we can generate and process the pairs in time. Computing takes time with Manacher’s algorithm [19]. Computing nxt and prv takes time with a simple folklore algorithm (see, e.g., [5, Lemma 1]).
Given two strings that contain no twin-palindromes (possibly substrings of a longer string, in the representation described in Section 2), we can use the Lemma above to find a twin-palindrome in their concatenation. Most importantly, this yields an output-sensitive algorithm: if the algorithm reports a twin-palindrome, its runtime is linear in the length of that twin-palindrome; otherwise, the runtime is linear in the combined length of the input strings. This property will later allow us to amortize the time over the number of symbols removed from the input when computing the minimal generator.
Lemma 11.
Let , be strings that contain no twin-palindromes. There is an algorithm that computes and such that is a twin-palindrome in time, or outputs that no such values exist in time.
Proof.
Similarly to Lemma 7, we proceed in up to phases. In phase (starting with ), we aim to find a twin-palindrome with . As soon as we find such , we return them without executing further phases. We will perform phase in time. Hence, if we find in phase , the total time sums to , which is due to . If all phases finish without finding , then the time sums to .
It remains to show how to perform phase in time. Since we aim to find , it suffices to consider and of total length . We apply Lemma 10 to in time. Since and are free of twin-palindromes, every twin-palindrome satisfies . Hence, if Lemma 10 reports as a twin-palindrome, we return and . Note that , since otherwise the twin-palindrome would have already been discovered during the previous phase. If Lemma 10 reports no twin-palindrome, then contains no twin-palindrome.
We are now ready to show Theorem 8. The algorithm first finds a twin-palindrome of length crossing the boundary between and . By Lemma 11, this step takes either time if the twin-palindrome exists, or time otherwise. If over half of the twin-palindrome is in (resp., ), the algorithm cuts its left (resp., right) part out of and shortening them by symbols, and applies the algorithm recursively to the resulting strings and . As a result, the total time is linear in the sum of and on the number of the symbols that were cut off.
See 8
Proof.
If at least one of the strings is empty, then we trivially return . If neither nor contain unit-squares, then we can without loss of generality assume that contains no unit-squares. This is because the only possible unit-square in is , and if we can simply run the algorithm for and instead.
If both strings are non-empty, we apply Lemma 11 to and . This either results in and such that is a twin-palindrome in time, or determines that no such values exist in time. Since and do not contain any twin-palindromes of length at most , it is clear that contains no twin-palindrome of length at most if and only if no values are found. Hence, if no values are found, we output .
Otherwise, there are symbols and a string such that . Let . Assume that , then it holds and . By Lemma 4(ii), with and generates from left to right. Similarly, if , then and . In this case, again by Lemma 4(ii), with and generates from left to right. Either way, implies
We finish the computation by applying Theorem 8 recursively to and (where we do not need to materialize and , as they are respectively a prefix and a suffix of and ). This results in values and such that generates from left to right and contains no twin-palindrome of length at most . We have already established that generates from left to right, and hence also generates from left to right due to Observation 2(i). Thus, we return and . Note that we obtained by replacing with in . This clearly does not introduce any new substrings of length two, and thus, if contains no unit-square, then neither does , and (recursively) neither does .
4 Eliminating Twin-Palindromes in Time
In this section, we improve the time to . In principle, the solution is similar to the time algorithm from Corollary 9, which cuts into halves, eliminates the twin-palindromes in each half recursively, and finally eliminates all the twin-palindromes across the boundary. We accelerate the algorithm by instead cutting into blocks of size roughly , which reduces the levels of recursion from to . However, we need a much more sophisticated algorithm for eliminating the twin-palindromes that cross any of the block boundaries. The algorithm is built on the following structural lemma.
Lemma 12.
Let be a string that contains neither twin-palindromes nor unit-squares.
-
(i)
There are at most palindromic suffixes of .
-
(ii)
For any symbol , there is at most one twin-palindromic suffix of , and (if it exists) it is of length , where is the length of the longest palindromic suffix of .
Proof.
For both properties, we only have to consider palindromes of odd length because contains no unit-square, and every palindrome of even length contains a unit-square at its center. For (i), assume that has two non-trivial palindromic suffixes of respective lengths and such that . Let . Then is a palindrome centered at position , and its central portion of length is . Similarly, is a palindrome centered at , and its central portion of length is . However, by Observation 5, if and are palindromes, then is a twin-palindrome, which is a contradiction. (See Figure 1(a).) Therefore, if we consider all the non-trivial palindromic suffixes in increasing order of length, then each suffix must be over twice as long as the previous one, and hence their number is .
For (ii), let and . Let be chosen such that is the longest palindromic suffix of . If has some twin-palindromic suffix , then it also has palindromic suffix , which implies . If , then is the claimed unique twin-palindromic suffix. It remains to consider the case . In a moment, we will show that implies , i.e., the twin-palindromic suffix is of length less than . However, due to the palindromic suffix , it holds . The reversal of a twin-palindrome is also a twin-palindrome, and thus is a twin-palindromic substring of (particularly due to ; see Figure 1(b)). This contradicts the fact that contains no twin-palindromes.
Finally, we show that indeed implies . If , then we observe that has non-trivial suffix palindromes of respective lengths and (by truncating the palindromes and by one symbol on either side). As seen in the proof of (i), it holds . This implies , which in turn implies . (The latter inequality is due to and thus .) If , then the twin-palindromic suffix at hand is with . If then . However, then has twin-palindromic suffix , which is a contradiction. Hence , which implies .
Theorem 13.
Let be a string that contains neither unit-squares nor twin-palindromes of length less than . There is an algorithm that, in time, computes a string that generates from left to right and contains neither twin-palindromes nor unit-squares.
Proof.
Let . We use a modified version of Manacher’s algorithm for maximal palindromic substrings [19]. In each step, the algorithm maintains the following data structures:
-
(1)
Strings and are the current working strings such that
-
(i)
generates from left to right, and
-
(ii)
contains neither unit-squares nor twin-palindromes of length less than , and
-
(iii)
contains no twin-palindromes.
-
(i)
-
(2)
List with stores a subset of such that
-
(i)
if is the center of a palindromic suffix of , then is in (but not every element of is necessarily the center of a palindromic suffix of ), and
-
(ii)
is the center of the longest palindromic suffix of (hence is non-empty).
-
(i)
-
(3)
A prefix of array contains, for each , the maximal radius of a palindrome centered at position with respect to , i.e.,
-
(4)
A prefix of array contains, for each , a list of centers defined as follows. The list contains exactly all the for which . (Informally, we store each center at the end position of its maximal palindrome.)
We initialize the algorithm with , , and with . Without loss of generality, we assume that has no occurrence in (which can be achieved by prepending a unique symbol in front of ). This will ensure that never becomes empty. Now we alternate between appending a symbol and eliminating twin-palindromes. Appending a symbol means that we remove the first symbol of and append it as the last symbol of , i.e., we replace with , and with . Since their concatenation does not change, Properties (1i) and (1ii) remain satisfied. As part of the appending routine, we update , , and so that they satisfy Properties (2), (3), and (4), which we describe in detail below. Afterwards, only Property (1iii) is possibly not satisfied, i.e., it might be that the new contains a twin-palindrome. In this case, the eliminating subroutine truncates possibly both and and updates , , and so that all properties are satisfied. If is empty after running the eliminating subroutine, then generates from left to right and contains neither twin-palindromes nor unit-squares.
Appending a Symbol.
Now we describe how to maintain , , and after appending a symbol. By appending a symbol, we transitioned from working strings and to and . The list currently contains values , and we append the newly added position to its end. If has a palindromic suffix with center , then also has a palindromic suffix with center . Hence the center of the longest palindromic suffix of is already contained in . We consider the positions with in increasing order. We first check if is the center of a palindromic suffix of , which is always the case for . For with , we exploit that is a palindrome, which implies . Hence is a palindrome centered at position if and only if is a palindrome centered at position , which is the case if and only if . We have already computed because cannot be in due to . Therefore, we can check if is a palindromic suffix in constant time. If is a palindromic suffix of , we check if is a palindromic suffix of , which is the case if and only if . If , then we proceed with the next larger value of . Otherwise, we know that is the longest palindromic suffix of , and its center is . If no leads to a palindromic suffix of , then the longest palindromic suffix is with center .
From now on, let be the value such that is the center of the longest palindromic suffix of . Now we update , , and . We remove from the list. If we actually removed positions from the list (i.e., ), we have to update and accordingly. In this case, it is clear that is the maximal palindrome with center with respect to the string . We assign and append to . For the remaining with , we use the same observation as before: it holds , and we know that this string is not a palindrome. Hence the radii of the maximal palindromes centered at positions and are identical, and we assign and append to .
We spent time for computing , removing elements from , and updating the corresponding entries in and . We also added to , and thus we can amortize the time over the elements that were either added to or removed from . Each removed element was previously added, and thus the overall time for the appending subroutine is linear in the number of times we add an element to .
Eliminating Twin-Palindromes.
It remains to show how to eliminate twin-palindromes after appending a symbol. For this part, let and be the working strings after appending a symbol, i.e., we just transitioned from strings and to and . We have already ensured that , , and satisfy the claimed properties. We know that does not contain any twin-palindromes, and the only property we have to establish is that contains no twin-palindromes, i.e., we are only interested in twin-palindromic suffixes of . By Lemma 12(ii), the only potential twin-palindromic suffix of is with . Since is a palindrome, is a twin-palindrome if and only if is a palindrome (by Observation 5), which is the case if and only if . We have already computed because is not contained in . If (or ), then we terminate the eliminating subroutine in time.
Otherwise, i.e., if , we know that is the only twin-palindrome in . There are symbols and a string with . We truncate to , i.e., we substitute with . Since does not contain unit-squares, and the substitution introduces no new substrings of length two, also does not contain unit-squares. Also, generates from left to right, and thus generates and consequently from left to right. Finally, does not contain twin-palindromes of length less than , and therefore neither does or . However, their concatenation may contain twin-palindromes of length less than . We apply Theorem 8 with parameter to and , resulting in values and such that generates (and hence also , and consequently ) from left to right, and contains neither unit-squares nor twin-palindromes of length at most . This takes time.
We replace with and with . It remains to update , , and . Note that , and thus none of the positions currently stored in lie within . Hence we remove all elements from . Currently, each with contains the radius of the longest palindrome centered at position with respect to . If , then is also the radius of the longest palindrome centered at position with respect to , and thus we do not need to update such . If, however, , then is the center of a suffix palindrome of , and we need to add to the list . In this case, we do not care about the content of , and hence we can leave the entire array unchanged.
We cannot afford to scan and check if for each position. Instead, we directly identify the positions that need to be added to by using . We collect all the elements stored in all the lists with . Let be one of these elements. If , then we simply discard it. The remaining elements are exactly the positions for which is the center of a suffix palindrome of . We sort them in increasing order, and place them into . Finally, we replace with an empty list.
Analyzing the Time Complexity.
If the eliminating subroutine does not finish in constant time, then we permanently reduce the length of the working strings by , particularly by eliminating a twin-palindrome of length from . Since does not contain a twin-palindrome of length less than , it holds . Hence we reduce the length by , and we can afford the time for applying Theorem 8. We also discard at most elements that were collected from , which we can afford to do in time because we reduced the length by . The remaining elements correspond to suffix palindromes of , and by Lemma 12(i) there are at most of them. Hence we can afford to sort them and to add them to the list . During the entire algorithm execution, we add elements to in this way.
Over all iterations of the appending subroutine, we append additional elements to . The total time for the appending subroutine is linear in the number of elements added to , and hence it is , which concludes the proof.
4.1 Obtaining the Recursive Algorithm
In this section, we show our -time algorithm. As mentioned earlier, it is recursive: it partitions the input into blocks, eliminates twin-palindromes and unit-squares in them recursively, and then concatenates the blocks together to form the output. To concatenate the blocks together, we use the following procedure:
Lemma 14.
Let be strings, where each string is of length and contains no twin-palindromes. There is an algorithm that, in time, computes a string such that
-
(i)
generates from left to right, and
-
(ii)
contains no twin-palindrome of length at most , and
-
(iii)
if none of contains unit-squares, then neither does .
Proof.
We proceed in steps numbered . After running step , the algorithm has computed a string that generates from left to right. If none of contains unit-squares, then neither does . Furthermore, contains no twin-palindromes of length at most , and it is stored in the first cells of an output array .
In step , we write to the output array, which trivially satisfies the claimed properties. In step with , we apply Theorem 8 with parameter to and . This results in and such that generates (and thus also ) from left to right and does not contain a twin-palindrome of length at most . Hence we can use , and we write to . Now contains . We have already established that, if none of contains unit-squares, then neither does , and in this case Theorem 8 ensures that also contains no unit-squares.
The time for step is dominated by applying Theorem 8, which takes time. This sums to over all steps.
We are now ready to present the main result of this section:
Corollary 15.
Let be a string. There is a recursive algorithm that, in time, computes a string that generates from left to right and contains neither twin-palindromes nor unit-squares.
Proof.
Assume that contains no unit-squares. This assumption is without loss of generality, as otherwise we can eliminate unit-squares with Lemma 6, and then show the correctness of the Corollary for the resulting shorter string. The algorithm operates in a recursive manner. We cut into blocks of length (the final block might be shorter). For each block , where , we obtain a string that generates from left to right and contains neither twin-palindromes nor unit-squares. This is done by applying the algorithm recursively to each block except for the final one. For the final block , we obtain by applying Corollary 9 in time instead. For , let be the time needed by Corollary 15 for a length- string. Then the time needed for computing the reduced blocks is .
Next, we obtain a single string that generates (and thus also ) from left to right and contains neither unit-squares, nor twin-palindromes of length at most . This takes time with Lemma 14. Finally, we apply Theorem 13 to , which takes time and results in a string that generates (and thus also ) from left to right. Furthermore, contains neither twin-palindromes nor unit-squares. The total time is , which resolves to . (Every two levels of recursion, the block size decreases from some length to , which is less than if exceeds a sufficiently large constant. Hence levels of recursion are sufficient to reach blocks of constant size.)
5 Eliminating Twin-Palindromes in Time
Let be the function defined by . As seen in the proof of Corollary 15, there is a constant such that the algorithm runs in time
Particularly, we reduce the problem from a single instance of size to at most instances of size each. Let . After four levels of recursion, we have to solve at most instances of size . It then holds . If we solve each instance in time, then the total time is linear. If the string was given over a polynomial integer alphabet, then we could reduce the alphabet of each instance (in batch using radix-sorting), and then use a precomputed lookup table to solve each instance in constant time. The lookup table would store the solution to each possible instance of size . (This approach is commonly referred to as the “Four Russians” technique.333The technique originates from the work of Arlazarov, Dinitz, Kronrod, and Faradzhev [4], who proposed an efficient method for computing the transitive closure of a directed graph by precomputing solutions for small blocks of the adjacency matrix. The term “Four Russians” technique has since come to broadly refer to the use of precomputed solutions for small subproblems, typically of logarithmic size relative to the input. The name reflects the authors’ affiliation in Moscow, though not all were Russian.) However, we cannot use strings over general unordered alphabet to access lookup tables. Instead, in Lemma 16 (below), we adapt the technique such that the precomputation no longer only considers all possible instances of size , but also all algorithms that potentially solve such instances efficiently. Among all the algorithms, we choose the fastest one that actually works correctly. This way, we show that we can indeed achieve time per instance after performing an time preprocessing. This improves the time complexity of Corollary 15 to . By applying Lemma 7 as a postprocessing, we obtain the main result from Theorem 1.
Lemma 16.
Let . After an time preprocessing, we can answer the following type of query in time. Given a string , output a string that contains neither twin-palindromes nor unit-squares and generates from left to right.
Proof.
For some unknown value , assume that there is an algorithm that solves a query by performing less than comparisons of symbols of the query string (running in arbitrary time). The algorithm can be implemented as a decision tree such that it answers queries in time. The decision tree is a binary tree with nodes on levels, i.e., with at most nodes. Each internal node has two children and is annotated with a position pair from . Each leaf is annotated with a length and a string . A query is then answered as follows. We start at the root. As long as we are at an internal node, we perform the symbol comparison dictated by its annotation. If the outcome of the comparison is positive, we move to the left child, and otherwise to the right child. Once we reach a leaf labeled with and , we return the string defined by . This takes comparisons and time. By adding dummy comparisons, we can assume without loss of generality that the decision tree is complete.
It remains to be shown how to compute the decision tree. Rather than directly obtaining the correct tree, we instead generate all the possible trees admitted by the domains of the annotations. We also generate all the possible length- strings over alphabet . Note that every length- string over arbitrary alphabet is isomorphic to one of these strings. For each decision tree, and for each possible query string, we run the query procedure. We then check if the output string contains neither twin-palindromes nor unit-squares, and actually generates the query string from left to right. Since there is an algorithm that requires less than comparisons, at least one of the decision trees must work for all possible input strings. As soon as we find such a tree, we terminate and use this tree to answer all future queries.
Now we analyze the preprocessing time. We will use with as an upper bound. Given all of its annotations, we can generate a tree in time. There are less than ways of annotating a tree, and they can be easily enumerated. Hence the trees are generated in time. Generating the query strings takes time. Given a tree and a query string, answering the query and checking if the output string is free of twin-palindromes and unit-squares takes time (by testing each substring naively). We check if the output string actually generates the query string from left to right by trying all possible walks in time. By multiplying the number of possible trees with the number of possible instances and the time to verify that a tree works for a given instance, the overall preprocessing time is .
Since we do not know the value of in advance, we simply start with and keep increasing by one until the procedure succeeds, which increases the time to . By Corollary 15, we know that suffices, which readily implies . (This would even hold for .) Finally, we terminate with the minimal value such that a query can be answered in fewer than symbol comparisons in the worst case, and hence the query time is optimal. In Appendix A, Lemma 17, we show that comparisons are sufficient, which concludes the proof.
References
- [1] Amihood Amir and Itai Boneh. Dynamic palindrome detection. CoRR, abs/1906.09732, 2019. arXiv:1906.09732.
- [2] Amihood Amir, Itai Boneh, Panagiotis Charalampopoulos, and Eitan Kondratovsky. Repetition detection in a dynamic string. In Proceedings of the 27th Annual European Symposium on Algorithms (ESA 2019), pages 5:1–5:18, 2019. doi:10.4230/LIPICS.ESA.2019.5.
- [3] Amihood Amir, Panagiotis Charalampopoulos, Solon P. Pissis, and Jakub Radoszewski. Longest common substring made fully dynamic. In Proceedings of the 27th Annual European Symposium on Algorithms (ESA 2019), pages 6:1–6:17, 2019. doi:10.4230/LIPIcs.ESA.2019.6.
- [4] V. L. Arlazarov, E. A. Dinitz, M. A. Kronrod, and I. A. Faradzhev. On economical construction of the transitive closure of an oriented graph. Dokl. Akad. Nauk SSSR, 194(3):487–488, 1970. (Original in Russian). URL: https://www.mathnet.ru/eng/dan35675.
- [5] Jérémy Barbay, Johannes Fischer, and Gonzalo Navarro. LRM-trees: Compressed indices, adaptive sorting, and compressed permutations. Theor. Comput. Sci., 459:26–41, 2012. doi:10.1016/J.TCS.2012.08.010.
- [6] Karl Bringmann, Fabrizio Grandoni, Barna Saha, and Virginia Vassilevska Williams. Truly subcubic algorithms for language edit distance and RNA folding via fast bounded-difference min-plus product. SIAM J. Comput., 48(2):481–512, 2019. doi:10.1137/17M112720X.
- [7] Panagiotis Charalampopoulos. Data structures for strings in the internal and dynamic settings. PhD thesis, King’s College London (University of London), UK, 2020. doi:10.12681/eadd/57602.
- [8] Jean-Pierre Duval, Thierry Lecroq, and Arnaud Lefebvre. Linear computation of unbordered conjugate on unordered alphabet. Theor. Comput. Sci., 522:77–84, 2014. doi:10.1016/J.TCS.2013.12.008.
- [9] Jonas Ellert. Efficient string algorithmics across alphabet realms. PhD thesis, Technical University of Dortmund, Germany, 2024. doi:10.17877/DE290R-24255.
- [10] Jonas Ellert and Johannes Fischer. Linear time runs over general ordered alphabets. In Proceedings of the 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021), pages 63:1–63:16, 2021. doi:10.4230/LIPICS.ICALP.2021.63.
- [11] Jonas Ellert, Pawel Gawrychowski, and Garance Gourdel. Optimal square detection over general alphabets. CoRR, abs/2303.07229, 2023. doi:10.48550/arXiv.2303.07229.
- [12] Jonas Ellert, Paweł Gawrychowski, and Garance Gourdel. Optimal square detection over general alphabets. In Proceedings of the 34th Annual Symposium on Discrete Algorithms (SODA 2023), pages 5220–5242, 2023. doi:10.1137/1.9781611977554.ch189.
- [13] Mitsuru Funakoshi, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Faster queries for longest substring palindrome after block edit. In Proceedings of the 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019), pages 27:1–27:13, 2019. doi:10.4230/LIPIcs.CPM.2019.27.
- [14] Zvi Galil and Kunsoo Park. Truly alphabet-independent two-dimensional pattern matching. In Proceedings of the 33rd Annual Symposium on Foundations of Computer Science (FOCS 1992), pages 247–256, 1992. doi:10.1109/SFCS.1992.267767.
- [15] Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM J. Comput., 6(2):323–350, 1977. doi:10.1137/0206024.
- [16] Dmitry Kosolobov. Finding the leftmost critical factorization on unordered alphabet. CoRR, abs/1509.01018, 2015. arXiv:1509.01018.
- [17] Dmitry Kosolobov. Lempel-Ziv factorization may be harder than computing all runs. In Proceedings of the 32nd International Symposium on Theoretical Aspects of Computer Science (STACS 2015), pages 582–593, 2015. doi:10.4230/LIPICS.STACS.2015.582.
- [18] Michael G. Main and Richard J. Lorentz. An O(n log n) algorithm for finding all repetitions in a string. J. Algorithms, 5(3):422–432, 1984. doi:10.1016/0196-6774(84)90021-X.
- [19] Glenn K. Manacher. A new linear-time “on-line” algorithm for finding the smallest initial palindrome of a string. J. ACM, 22(3):346–351, 1975. doi:10.1145/321892.321896.
- [20] Seth Pettie and Vijaya Ramachandran. An optimal minimum spanning tree algorithm. J. ACM, 49(1):16–34, 2002. doi:10.1145/505241.505243.
- [21] Ian Pratt-Hartmann. Walking on words. CoRR, abs/2208.08913, 2022. doi:10.48550/arXiv.2208.08913.
- [22] Ian Pratt-Hartmann. Walking on words. In Proceedings of the 35th Annual Symposium on Combinatorial Pattern Matching (CPM 2024), pages 25:1–25:17, 2024. doi:10.4230/LIPICS.CPM.2024.25.
- [23] Robert E. Tarjan. Problems in data structures and algorithms. In Graph Theory, Combinatorics and Algorithms: Interdisciplinary Applications, pages 17–39. Springer US, Boston, MA, 2005. doi:10.1007/0-387-25036-0_2.
- [24] Yuki Urabe, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Longest Lyndon substring after edit. In Proceedings of the 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018), pages 19:1–19:10, 2018. doi:10.4230/LIPIcs.CPM.2018.19.
- [25] Hao Yuan and Mikhail J. Atallah. Data structures for range minimum queries in multidimensional arrays. In Proceedings of the 21st Annual Symposium on Discrete Algorithms (SODA 2010), pages 150–160, 2010. doi:10.1137/1.9781611973075.14.
Appendix A Eliminating Twin-Palindromes in Comparisons
Lemma 17.
Let be a string. There is an algorithm that, using symbol comparisons, computes a string that generates from left to right and contains neither twin-palindromes nor unit-squares.
Limiting the Number of Oracle Queries with Positive Answer.
As explained in Section 2, all strings encountered by the algorithm have an explicit physical representation. If we encounter strings and , then they are physically stored as arrays and such that for it holds and . Hence if and only if , i.e., we can perform each symbol comparison by asking exactly one oracle query to .
Recall that itself has physical representation with . By manipulating , we can ensure that the algorithm asks at most oracle queries with positive answer. We will ensure that at all times. Instead of directly issuing an oracle query, we test if with as follows.
-
1.
If , we return that .
-
2.
Otherwise, we ask the oracle if .
-
(a)
If , we return that .
-
(b)
If , we scan and replace all occurrences of with . We then return that . (This naively implements a union-find data structure.)
-
(a)
In Case 2b, the number of distinct values in decreases. Hence we encounter the case at most times. This is the only case in which we ask an oracle query and obtain a positive answer, hence the total number of such queries will be at most . Throughout the remainder of the section, we will use the previously described strategy for all symbol comparisons. Hence we perform at most oracle queries with positive answer, and we only have to count the number of comparisons with negative answers.
Eliminating Twin-Palindromes.
Assume that contains no unit-squares. This assumption is without loss of generality, as otherwise we can eliminate unit-squares with Lemma 6, and then show the correctness of the Lemma for the resulting shorter string. We proceed in phases numbered . Each phase consists of iterations. Before each iteration of phase , each of the following conditions is satisfied:
-
1.
We have computed a string that generates from left to right and contains neither unit-squares nor twin-palindromes of length less than .
-
2.
We have computed an array that contains the maximal radius of the palindrome centered at each position of , truncated to length , i.e., for it holds
-
3.
We have charged some entries of . However, cannot be charged if . We will use charged entries to account for symbol comparisons later.
As soon as we produce with , we know that contains no twin-palindromes. Now we describe how to implement an iteration of phase . The goal of phase is to eliminate twin-palindromes of length exactly . By Observation 5, some substring is a twin-palindrome if and only if and are palindromes, which is the case if and only if and . Hence, given , we can find a twin-palindrome of length (or determine that none exists) without performing any symbol comparisons.
Performing a Final Iteration.
If we cannot find a twin-palindrome of length , then we immediately finish the iteration and proceed to the next phase . A twin-palindrome cannot have length or for non-negative integer , and thus contains no twin-palindromes of length less than . Hence we can use , and we only have to compute . For any position , if , then we can directly assign without performing any comparisons. In this case, we also transfer the charge (if any) from to . If , then we check if (assuming and ). If the answer is yes, we assign (and do not charge the position). If the answer is no, then we assign and charge position . Note that we charge one previously uncharged position for each comparison with negative outcome. Later, we use the charges to count the overall number of comparisons.
Performing a Regular Iteration.
It remains to show how to proceed in phase if we actually find a twin-palindrome . In this case, generates from left to right, and thus generates from left to right. However, may contain twin-palindromes of length less than that were not present in . We apply Theorem 8 with parameter to and and obtain values such that generates from left to right and contains neither unit-square nor twin-palindrome of length less than .
Let and . We will replace with . Therefore, we also have to replace with the appropriate array . Note that any entry depends solely on a short local substring . This allows us to copy all except entries from to . More precisely:
-
For , it holds due to .
-
For , it holds due to .
The only missing entries are the with . We compute each such value from scratch. This requires only one comparison with negative outcome per value, or comparisons with negative outcome overall.
Finally, we have to handle the charged positions. We simply transfer the charges from to , and from to (respectively only if and ). We have accounted for all charges, except for the ones in interval . This interval is of length . We will pay for the charges directly, without transferring them to the next iteration. In this iteration, we have to pay for the following comparisons with negative outcome:
-
comparisons for applying Theorem 8 with parameter , and
-
comparisons for computing the entries of that cannot be copied from , and
-
comparisons that were performed in previous iterations, stored as charges of entries in that could not be copied to .
This sums to comparisons with negative outcome, which we can afford because we permanently reduced the length of the string by . After the algorithm terminates, there are at most charges on the final that have not been accounted for. Hence the number of comparisons with negative outcome is overall, which concludes the proof.
