Morphisms and BWT-Run Sensitivity
Abstract
We study how the application of morphisms affects the number of equal-letter runs in the Burrows–Wheeler Transform (BWT). This parameter has emerged as a key repetitiveness measure in compressed indexing. We focus on the notion of BWT-run sensitivity after application of morphisms. For binary alphabets, we characterize the class of injective morphisms that preserve the number of BWT-runs up to a bounded additive increase by showing that it coincides with the known class of primitivity-preserving morphisms, which are those that map primitive words to primitive words. We further prove that deciding whether a given binary morphism has bounded BWT-run sensitivity is possible in polynomial time with respect to the total length of the images of the two letters. Additionally, we explore new structural and combinatorial properties of synchronizing and recognizable morphisms. These results establish new connections between BWT-based compressibility, code theory, and symbolic dynamics.
Keywords and phrases:
Burrows–Wheeler transform, BWT-runs, morphism, pure code, repetitivenessFunding:
Gabriele Fici: Supported by MUR project PRIN 2022 APML – 20229BCXNW, funded by the European Union – Mission 4 “Education and Research” C2 – Investment 1.1.Copyright and License:
2012 ACM Subject Classification:
Theory of computation Design and analysis of algorithms ; Theory of computation Formal languages and automata theory ; Mathematics of computing Combinatorics on wordsEditors:
Paweł Gawrychowski, Filip Mazowiecki, and Michał SkrzypczakSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Morphisms are a powerful combinatorial mechanism for generating a collection of repetitive texts, and have been largely used in the field of combinatorics on words and formal languages [22, 31]. Formally, a morphism maps each character of an alphabet to a word over the same or another alphabet, by preserving the operation of concatenation. That is, if is a morphism and and are words, then . Iterating morphisms can produce long and often highly repetitive sequences, which makes them a natural model for studying repetitiveness in words. Morphisms find applications in a wide range of contexts. Injective morphisms are widely used in information theory, data compression, and cryptography, as they define uniquely decodable codes [4]. More recently, morphisms have been employed in combination with copy-paste mechanisms to define novel compression schemes, known as NU-systems [28], further highlighting their versatility in modeling and processing repetitive data.
The Burrows–Wheeler Transform (BWT) is a reversible transformation introduced in 1994 in the field of data compression [6] and now underpins some of the most used tools in bioinformatics, such as bwa [21, 34] and bowtie [20, 19]. It permutes the characters of a text in a way that makes it more compressible, by clustering characters that precede similar contexts in the text. This property often results in long runs of identical characters, particularly in repetitive texts. The number of such equal-letter runs, known as BWT-runs, has recently emerged as a measure of repetitiveness [27]. Several measures have been proposed to quantify repetitiveness in strings [26], such as the number of phrases in the Lempel–Ziv parsing, the size of the smallest context-free grammar generating the text, the size of the smallest string attractor [17, 8]. Among these, the measure has recently attracted considerable attention due to its close connection with compressed indexing structures, such as the -index [13], which use space proportional to and support efficient pattern matching and retrieval in highly repetitive text collections, including genomic datasets and versioned document archives. Akagi et al. [1] explored the question of how much one character edit affects compression-based repetitiveness measures. In [14], the effect of single edit operations on the measure has also been analyzed.
In this paper, we study how the application of a binary morphism affects the measure , i.e., the number of BWT-runs. We focus on two notions of BWT-run sensitivity, which capture how much the number of BWT-runs can change when a morphism is applied to a word. The additive sensitivity function gives, for every , the maximum increase in the number of BWT-runs that can occur when applying the morphism to any word of length , while the multiplicative sensitivity function gives, for every , the maximum ratio between the number of BWT-runs after and before the morphism is applied, over all words of length . These notions allow us to quantify the impact of a morphism on the compressibility of the resulting text. An initial approach to the study of how morphisms affect the number of BWT-runs was given in [12], where we showed that Sturmian morphisms are the only binary injective morphisms that preserve the number of BWT-runs. Here, we tackle the problem of characterizing those binary morphisms that preserve the BWT-based compressibility of a text, in the sense that they have an additive sensitivity function bounded by a constant. We first prove that all non-injective binary morphisms have a bounded additive sensitivity function. In the injective case, which is the most interesting one, we prove that this class coincides with the known class of primitivity-preserving morphisms, which are those that map primitive words to primitive words. As a direct consequence, for these morphisms the multiplicative sensitivity function is also bounded. Primitivity-preserving morphisms are a well-studied class in algebraic theory of codes, and they are crucial in applications involving symbolic sequences, code synchronization, and the structural analysis of words [32, 30, 11, 25, 22, 4, 15].
In addition to establishing a novel connection between BWT-based compressed indexing, combinatorics on words, and code theory, a key contribution of our paper consists in identifying new combinatorial and structural properties of primitivity-preserving, recognizable, and synchronizing morphisms. These properties are central to our main results but also hold independent interest in information theory and symbolic dynamics, where such morphisms play a fundamental role in coding, synchronization, and symbolic representations of dynamical systems [3, 4]. In fact, recognizability ensures that the morphic image of a word can be uniquely decomposed, up to rotations, into a sequence of morphic images of the letters of the alphabet. Synchronizing morphisms guarantee that a window of bounded length is sufficient to detect boundaries between codewords, a property that is crucial for decoding and synchronization in data streams.
We further show that all binary morphisms have bounded multiplicative sensitivity, but this result does not extend to alphabets with more than two symbols.
Furthermore, we prove that it is decidable in polynomial time whether the additive sensitivity function of a binary morphism is bounded by a constant, which makes our results practically applicable to the design of compression and indexing techniques that work directly on morphic encodings of highly repetitive text collections. Such a result builds upon fundamental results in the field of combinatorics on words, including properties of codes and solutions to word equations.
The rest of the paper is organized as follows. In Section 2, we present the preliminaries on words, morphisms, and the BWT. Section 3 introduces new combinatorial and structural properties of primitivity-preserving, recognizable, and synchronizing morphisms. Section 4 formalizes the sensitivity of with respect to the application of morphisms and motivates our measures. Section 5 contains our main theorem characterizing the morphisms with a bounded additive sensitivity function. Section 6 discusses the multiplicative case, and Section 7 concludes with final remarks and open problems.
2 Preliminaries
Basics
Let be a finite sorted set of letters , which we call an alphabet. A finite word is any finite sequence of letters where , for , and is the length of the word. The empty word, denoted by , is the unique word of length . The set of all finite words (resp. all non-empty words) over the alphabet is denoted by (resp. ). For a letter , denotes the number of occurrences of in . The vector is called the Parikh vector of .
If and are words, the concatenation of and is the word . We let denote the concatenation of the words in that order, and the concatenation of the word with itself times.
For any , we use the notation to denote the word , which we call a factor of . If , then we assume . We let denote the set of all factors of . For any , we write . A factor of is proper if it is different from itself. The factor is called a prefix when , and a suffix when . The longest common prefix between two words and is the longest word that is a prefix of both words. The length of this word is denoted by . The longest common suffix and the associated function are defined symmetrically.
The run-length encoding of a word , denoted , is the sequence of pairs ) with and , such that and for every . The length is the number of equal-letter runs in .
A rotation, or conjugate, of the word is a word of the form , for some , obtained by shifting letters cyclically. We let denote the multiset of all the rotations of . A word in is called a circular factor of .
A word is primitive if for every word , implies ; otherwise, is called non-primitive (or a power). A word of length is primitive if and only if it has exactly distinct rotations, i.e., if has all-distinct elements. We let denote the set of all primitive words in , and the set of all non-primitive words in . We say two non-empty words commute if . This is equivalent to saying that both and are not primitive.
Codes and morphisms
A set is a code if for all and with , the equation implies that and , for all . Or equivalently, every word has a unique factorization in words in . Given a word , a word is an -factor of if there exists a rotation of (which can be itself) that can be factored as such that .
Lemma 1.
A set , , is a code if and only if and do not commute, i.e., .
If and do not commute, then they are not powers of the same word, but in principle this does not exclude the case that either , , or both are non-primitive. For example, is a code.
Let and be two alphabets. A morphism is a map from to such that for all words . Therefore, a morphism can be defined by specifying its action on the letters of , and can therefore be denoted as . The size of the morphism is defined as . When , for all and , we have and .
Remark 2.
Let be a morphism. If and are conjugates, then so are and . Moreover, since every conjugate of a power is a power, if is a power, so is for every conjugate of .
A morphism is cyclic if there exists such that for each . Otherwise, it is called acyclic.
As shown in the following proposition, there is a very strong relation between codes and injective morphisms.
Proposition 3 ([4]).
Let be a code. Then, any morphism which induces a bijection of some alphabet onto is injective. Conversely, let be an injective morphism. Then, is a code.
Remark 4.
Examples of injective morphisms are the Fibonacci morphism , the Thue–Morse morphism , and the period-doubling morphism .
From the relationship between codes and morphisms, many properties of codes are reflected in the corresponding properties of injective morphisms. Combinatorial properties of injective morphisms are explored in Section 3.
The Fibonacci morphism belongs to a wider class of morphisms called Sturmian morphisms, strictly related to the well-known Sturmian words [5]. Sturmian morphisms can be defined as those that can be obtained by composition from: the Fibonacci morphism , the morphism , and the morphism .
Let us suppose that both and are endowed with a total order relation that yields a lexicographic order, denoted by and , respectively. A morphism is abelian order-preserving if for every pair of distinct words having the same Parikh vector, it holds that . A morphism is abelian order-reversing if for every pair of distinct words and having the same Parikh vector, it holds that . We simply write whenever and are clear from the context.
When , the following result holds.
Lemma 5 ([12]).
Let be an acyclic morphism. Then is either abelian order-preserving or abelian order-reversing.
For our purposes, the fact that binary acyclic morphisms are either abelian order-preserving or abelian order-reversing is a crucial property, since it implies that they preserve or reverse the order on the set of rotations of any given binary word.
Burrows–Wheeler Transform
The Burrows–Wheeler transform (BWT) of a word , denoted by , is a permutation of the letters of obtained by sorting all the rotations of in ascending lexicographic order and then concatenating the last letter of each rotation. The original word can be recovered if one stores the position where it appears in the list of sorted rotations. Figure 1 shows the sorted rotations of the word and .
We let denote the number of equal-letter runs of , i.e.,
Such a value can be considered as a measure of the repetitiveness of . In fact, if a word is highly repetitive, the number of equal-letter runs of its BWT tends to be small. From Figure 1, one can see that .
One can verify that for each word , and, consequently, and for every morphism .
Let be a non-primitive word, i.e., , for some and . It is well known that if , then [24]. This implies that .
Some results proved in [24, 29, 10] establish a strong connection between the BWT and Sturmian morphisms, as synthesized in the following theorem.
Theorem 6.
Let be a word over that is not a power of a single letter. Then the following are equivalent:
-
1.
for a Sturmian morphism and for some .
-
2.
.
3 New combinatorial properties of injective morphisms
This section focuses on some combinatorial properties and characterizations of some classes of morphisms which are well-known in the context of coding theory and symbolic dynamics. The results provided in this section may be of independent interest and will later be related to BWT-run sensitivity in the next sections.
3.1 Primitivity-Preserving Morphisms
A morphism is called primitivity-preserving if for every , it holds that , that is, primitive words are mapped to primitive words. Primitivity-preserving morphisms are injective, and the associated codes are known in the literature as pure codes [25]. Such codes have been introduced in [30] to study the relationships between locally testable languages and synchronizing properties of codes.
Given a morphism , we call a primitive word a –power if , for some primitive word and an integer . Intuitively, it is a word that witnesses the non-primitivity-preserving property of a morphism. By we refer to the set of all –power words. From the definition, hence, if and only if the morphism is primitivity-preserving.
Example 7.
Let be the period-doubling morphism. The word is a –power, since . Hence, , and is not primitivity-preserving.
Example 8.
Let . The word is a –power, since . Hence, , and is not primitivity-preserving.
In this section, we prove a new characterization of the decompositions of binary primitivity-preserving morphisms. To do so, we first recall the following lemma, characterizing the combinatorial structure of binary primitivity-preserving morphisms.
Lemma 9 ([16, Theorem 3.1]).
Let be an injective morphism, with two distinct primitive words. Then is a primitivity-preserving morphism if and only if all words in are primitive.
The following lemma describes what happens when the property of Lemma 9 is not verified. In particular, it considers the combinatorial structure of the non-primitive words generated by the morphism when applied to some primitive word distinct from a single letter. Recall that if , for some primitive word and , then we can derive that or (see [23]).
Lemma 10 ([2, 33]).
Let be an injective morphism, and let . Then, there is at most one primitive word and one integer such that , i.e. . Moreover, let . Then .
The next lemma provides a characterization of the structure of the set for an injective morphism .
Lemma 11.
Let be an injective morphism, and let . We can distinguish the two cases:
-
1.
. Then only one of the following occurs:
-
(a)
;
-
(b)
, for some ;
-
(c)
.
-
(a)
-
2.
. Then there exists a unique such that , and only one of the following occurs:
-
(a)
;
-
(b)
, for some .
-
(a)
Note that, among the cases described in Lemma 11, the Case 1a is the only one in which every primitivity-preserving morphism falls. In this case, both and are primitive words. If the morphism is not primitivity-preserving and , then we deduce from Lemma 11 that either only one between and is a non-primitive word (Case 1b), or both are non-primitive words (Case 1c).
A classification of the non-primitivity-preserving morphisms that fall in Cases 2a and 2b, with , and their respective -power words, can be derived from a result given in [15, Theorem 8].
The set of primitivity-preserving morphisms is closed under composition, as stated in the following lemma.
Lemma 12.
Let , be two morphisms. If and are both primitivity-preserving, then is primitivity-preserving too.
However, it is possible to obtain primitivity-preserving morphisms even as a composition of morphisms that do not necessarily satisfy this property. The following proposition gives a complete characterization.
Proposition 13.
Let be an injective morphism. The morphism is primitivity-preserving if and only if, for all such that , it holds that is a primitivity-preserving morphism and is an injective morphism such that .
Example 14.
Let . One can verify that , where and are the period-doubling morphism and the Thue–Morse morphism, respectively. We have that is not primitivity-preserving and . Since is a primitivity-preserving morphism and , from Proposition 13 it follows that is primitivity-preserving too.
Example 15.
Let , and consider the morphism , where is the Thue–Morse morphism, which is primitivity-preserving. In this case, is not primitivity-preserving (since ), nor is .
3.2 Recognizable Morphisms
In this subsection, we focus on some structural and combinatorial properties of morphisms that generate words admitting a unique factorization in circular factors, similarly to the notion of circular code [4].
Let . An injective morphism is recognizable on if for every non-empty word and every word , there exist, and are unique, , , , and , such that and . If , we simply say that is recognizable.
In other words, every image under a recognizable morphism has a unique circular factorization in words of . Equivalently, a recognizable morphism on can be regarded as an injective map on the necklaces over , i.e., for all , it holds that if and only if .
Example 16.
Let us consider the injective morphism . Such a morphism is recognizable since every word in has a unique circular factorization into the words and , as shown in Figure 2(a).
The recognizability of a morphism is well studied in the context of bi-infinite words and symbolic dynamics [3]. Here, it is adapted to necklaces, or circular words, which can be seen as periodic bi-infinite words. Note that most of the results known in the literature on bi-infinite words focus on the aperiodic case. Therefore, the results provided in this section can also be interpreted as contributions toward the less-explored setting of periodic bi-infinite words.
The following lemma establishes the close relationship between recognizable morphisms on and a property related to the so-called very pure codes, which are properly included in the class of pure codes [30].
Lemma 17 ([4, Proposition 7.1.1]).
An injective morphism is recognizable if and only if for every , if then .
In the case of binary injective morphisms, the next lemma provides a useful characterization of recognizable morphisms.
Lemma 18.
A primitivity-preserving morphism is recognizable if and only if and are not conjugates.
Example 19.
Consider the injective morphism . Since and are conjugates, by Lemma 18, is not recognizable on , as shown in Figure 2(b), where two distinct circular factorizations of the word are indicated. Indeed, and are equal up to rotations. Figure 3(a) shows two distinct circular factorizations of into and . Hence, is also not recognizable. One can note that , where , confirming the characterization established in the following theorem.
The following result shows an important structural property of the primitivity-preserving morphisms that are not recognizable. In particular, we prove that they can always be obtained by composing another morphism with the Thue–Morse morphism.
Theorem 20.
Let be a primitivity-preserving morphism. Then exactly one of the following cases occurs:
-
1.
is recognizable;
-
2.
for some injective morphism , where is the Thue–Morse morphism.
Proof.
We show that, under the hypothesis of the theorem, it holds that if and only if for some injective morphism , and by Lemma 18, the thesis directly follows. For the first direction, one can define , and therefore . For the other direction, if , then , and the thesis follows.
3.3 Synchronizing Morphisms
An important notion in the context of injective morphisms is that of synchronization pair, which intuitively marks a position within a factor of a morphic image where the boundary between two codewords can be uniquely identified. Synchronization provides a way to “align” a segment of the morphic image of a circular word with the images of the letters of the alphabet.
Let , , and . We say that is a synchronization pair of on if and, for all and , implies and , for some such that .
Observe that a morphism is recognizable on if and only if every word admits at least one synchronization pair, since from it one can uniquely recover the preimage , up to rotations.
The following notion of synchronization with delay gives a quantitative measure of the width of a window sliding along the morphic image of a circular word that guarantees the detection of a synchronization point. This is an adaptation to our setting of a definition commonly used in the context of HD0L-systems [7, 18].
We say that a morphism is synchronizing with delay for if every circular factor of of length at least admits a synchronization pair. Given , we say that is synchronizing with delay for if it is synchronizing with a finite delay for every and
It has been proved [30, Theorem 5.1] that a morphism is recognizable if and only if it is synchronizing with finite delay for . The following example shows that the recognizability of a morphism on a proper subset of does not necessarily imply being synchronizing with finite delay for that subset.
Example 21.
Let be the Thue–Morse morphism. Figure 3(a) shows that the Thue–Morse morphism is not recognizable. In fact, and are equal up to rotations and produce distinct circular factorizations. However, is recognizable on , where , as shown in Figure 3(b). Observe that even though is not recognizable, is recognizable on , since and are synchronization pairs that occur in every word of . In the figure, the unique circular factorization of is depicted; the two black squares identify the synchronization pairs and . Moreover, is synchronizing with delay on ; more in general it is synchronizing with delay for and for . However, is not synchronizing with finite delay for , since the supremum of all minimum finite delays for all the words in is unbounded.
Let . Unlike the previous cases, the synchronization pairs and occur in every word of of length at least , hence is synchronizing with delay for . In fact, as shown in Figure 3(c), every factor of having length at least contains a black square that identifies a synchronization pair.
We now give a new combinatorial characterization of synchronizing morphisms with finite delay on , for any . This characterization is based on the powers of single letters occurring in in the case of non-recognizable primitivity-preserving morphisms, while it is based on the -power words in the case of non-primitivity-preserving morphisms.
Lemma 22.
Let be a non-recognizable primitivity-preserving morphism and let and , where . Then is synchronizing with finite delay on if and only if at least one of the sets or is finite.
By using analogous techniques, one can prove that for any non-primitivity-preserving morphism, there exists a such that each -length factor in has a synchronization pair, unless for some .
Lemma 23.
Let be a non-primitivity-preserving morphism. Then, there exists an integer such that every factor has a synchronization pair.
From the previous lemma, the following result can be derived:
Theorem 24.
Let be a non-primitivity-preserving morphism and let . Then is synchronizing with finite delay on if and only if the set is finite.
4 Sensitivity of the Measure to the Application of Morphisms
Notice that may be negative for some word . For example, let be the morphism over a 3-letter alphabet defined as and let . One has that and . However, when is defined over a binary alphabet, one can prove that is always non-negative [12, Theorem 14].
Definition 25.
The BWT additive sensitivity function and BWT multiplicative sensitivity function for a morphism are, respectively, the functions
Note that the additive sensitivity function is always a non-negative function, as for every , for any letter .
Example 26.
Let us consider the period-doubling morphism . Let us compute the value of the BWT additive sensitivity function for when . From Table 1, it is possible to conclude that .
Remark 27.
Observe that for any pair of morphisms and , and any word , it holds that .
The following lemma shows that cyclic morphisms produce words with a fixed number of BWT-runs, whatever the words on which they are applied.
Lemma 28.
Let be a cyclic morphism. Then, there exist two constants , which depend on , such that and , for all .
Proof.
Recall that a binary morphism is cyclic if and only if there exist two integers and a non-empty word such that and . Hence, for each word it holds that . Let us fix the claimed constants and . For all , let us consider the word . By Theorem 6, it follows that . The proof follows by observing that since is constant, the values of and are maximal when assumes the smallest value, that in the case of binary words is , i.e., and .
Example 29.
Let us consider the cyclic morphism . It is possible to verify that for every , one has , for some integer depending on . This means that for every . For every length , we can consider the word . We have , which is the lowest value that can take on a binary word. Then, and , for .
The following characterization of Sturmian morphisms in terms of the BWT additive sensitivity function was proved in [12].
Proposition 30 ([12, Theorem 21]).
Let be a binary injective morphism. Then for every if and only if is a Sturmian morphism.
In the same paper, we showed that the Thue–Morse morphism increases by 2 the BWT-runs of every binary word, while in the case of the period-doubling morphism , for each we can find an -length word for which . We summarize these results in the following proposition.
Proposition 31 ([12, Lemma 24 and Proposition 31]).
Let and be the Thue–Morse and the period-doubling morphisms, respectively. The following properties hold:
-
1.
, for all ;
-
2.
.
Note that is not the only morphism for which the additive sensitivity function is . In [12] it is proved that this property also holds for the Thue–Morse-like morphisms , for some , and any composition of these morphisms with any Sturmian morphism.
Example 32.
5 Characterization of Binary BWT-Run Preserving Morphisms
In this section, we are interested in morphisms having a bounded BWT additive sensitivity function.
Definition 33.
Let be an integer. A morphism is called -BWT-run preserving if for all , . We simply say BWT-run preserving if such a exists.
By Lemma 28, every cyclic morphism is BWT-run preserving since it has a bounded BWT additive sensitivity function. As a main result of this section, we characterize the acyclic (and therefore, by Remark 4, injective) binary morphisms which are BWT-run preserving. In particular, we prove that they coincide with the primitivity-preserving morphisms.
We first give a lemma, in which we prove that the finite-delay synchronization of a morphism on the images of a language results in a bounded increase in the number of BWT-runs.
Lemma 34.
Let , where , and let be synchronizing with delay on . Then, there exists such that for all .
The following lemma proves one direction of the main result.
Lemma 35.
Let be an injective morphism. If is primitivity-preserving, then is BWT-run preserving.
Proof.
If is primitivity-preserving, then by Theorem 20, either (i) is recognizable or (ii) there exist an integer and a morphism such that and for all . If we fall in case (i), the thesis follows from the equivalence between recognizable morphisms and synchronizing morphisms with bounded delay [30, Theorem 5.1] and Lemma 34.
If instead we fall in case (ii), then by Proposition 31, it follows that increases the BWT-runs by (at most) 2. Hence, the thesis is equivalent to showing that there exists such that , for every : by using Remark 27, this would prove that the BWT additive sensitive function is bounded by a constant, that is for all . By Proposition 13, we can distinguish between two subcases: (ii.a) is recognizable and (ii.b) is not primitivity-preserving and . If (ii.a), the proof follows analogously to (i). If (ii.b), then by Lemma 10 all the -power words in the set consist of either a single letter or rotations of a unique word in the set . Observe that and , for all , while for and . Observe that only the latter does not have a finite size; however, for , by hypothesis, we have . Hence, contains only a finite number of powers of elements from , and the proof follows by Theorem 24.
Now we prove the opposite direction. We consider a class of morphisms that we use to decompose a generic morphism. For any , let denote the injective morphism . Observe that if , then is not primitivity-preserving.
In the following proposition, we prove that such morphisms have an unbounded additive sensitivity function.
Proposition 36.
Let , for some . Then, .
In the following proposition, we consider a larger class of morphisms with an unbounded additive sensitivity function.
Proposition 37.
Given an injective morphism , let and such that . Then,
where . Moreover, if , then .
The following lemma shows that if a morphism has bounded additive sensitivity, then it is primitivity-preserving.
Lemma 38.
Let be a non-primitivity-preserving injective morphism. For each , there exists a word such that .
Proof.
If or are not primitive, then the thesis follows by Proposition 37, so let us assume that .
Recall that a Lyndon word is a primitive word that is lexicographically smaller than all its proper conjugates. Since is injective, not primitivity-preserving, and both images are primitive words, by Lemma 11 there exists some Lyndon word such that , for some , and is primitive. Let and . Observe that: i) ; and ii) . Hence, . Then, by Proposition 37, there exists a word such that . Since the concatenation of two Lyndon words and , with , is a Lyndon word (see [9]), then, for every , and are Lyndon words, hence, by Lemma 9, the morphism is primitivity-preserving, and by Lemma 35 is BWT-run preserving. Finally, by using Remark 27, one has , and the thesis follows.
Example 39.
Let . Let and , where is Lyndon and is primitive. It holds that We define the morphisms and , as described in the proof of Lemma 38. Indeed, , as and . The morphism can be written as , and by Proposition 37 there exists such that . On the other hand, is primitivity-preserving, so it must be the case that .
Theorem 40.
Let be an injective morphism. Then is BWT-run preserving if and only if it is primitivity-preserving.
Finally, we can show that there exists a finite test case, as stated in the following theorem.
Theorem 41.
Let be an injective morphism. It is decidable in polynomial time in the size of whether is BWT-run preserving.
Proof.
By Lemma 9, to decide whether a given morphism , for some , is primitivity-preserving, we have to check the primitiveness of all the possible non-trivial solutions of the equation . Let and . In [15], it has been proved that there are at most words to check, each of these having length . Since the primitiveness can be checked in linear time in the size of the words, the total time complexity is .
6 Morphisms with bounded multiplicative sensitivity
Even though in the case of binary morphisms the additive sensitivity is not always bounded by a constant, it is natural to wonder whether the multiplicative sensitivity is. As shown in the following example, this is not the case when the alphabet size is greater than 2.
Example 42.
Let be the -th Fibonacci word with a letter such that appended. Define as , , and . Then . It is known that [14]. Hence, .
Observe that if , then the morphism is not primitivity-preserving. We first show that is bounded.
Lemma 43.
Let be a word that contains at least two ’s and one . Let be the length of the longest circular run of ’s in (i.e., the longest run of ’s in any string in ). It holds that
where . Moreover, it holds and .
Proof.
Let be the length of the longest circular run of ’s in . Since is order-preserving, and because the last characters of and are and , respectively, the sequence obtained by taking the last character of each image of the lexicographically sorted rotations of is exactly . Moreover, the string formed by concatenating the last characters of the range of rotations starting with in the BWT matrix of is exactly . Similarly, the string formed by concatenating the last characters of the (disjoint) ranges of rotations starting with for is exactly . Strictly in between the ranges of rotations starting with and for some , there is a range of rotations starting with for each , all ending with the character . In the worst case, each of these blocks of rotations can only increase the number of runs by . Hence, the additive increase is at most 2 times the number of circular factors of the form in . This proves the first claim of the proposition.
For the second claim, observe that a change of character occurs in correspondence with each block of rotations starting with , for each such that . Hence, the second claim follows because .
We now give a sketch of the main result of this section.
Theorem 44.
For every morphism , there exists an integer such that .
Proof (sketch).
We assume is injective, as otherwise the result follows from Lemma 28. By Proposition 37, can be decomposed as with and . By Lemma 43, both and , hence is bounded if and only if is bounded. If is primitivity-preserving, then by Lemma 35 we are done. Hence, we are left to show the case when is not primitivity-preserving and both images are primitive. We give a sketch for this case.
Let be a non-primitivity-preserving injective morphism with . By Lemma 10, there exists a primitive word with , such that and with and .
As a consequence of Lemma 23, there exists an integer , which depends only on , such that every rotation with a -length prefix contains a synchronization pair. Hence, we can partition these rotations according to their length- prefix, and the characters preceding these rotations can be determined.
The remaining rotations starting with a power of some rotation of are handled in a similar (though more complicated) fashion with respect to how rotations starting with a power of were handled in Lemma 43. This yields an upper-bound for depending on the value instead of .
7 Conclusions and future work
In this paper, we have provided a complete characterization of binary injective morphisms that preserve the number of BWT-runs up to a bounded additive increase. We have shown that this class coincides with the class of binary primitivity-preserving morphisms.
Primitivity-preserving morphisms could be considered a general effective tool for studying and evaluating repetitiveness measures, since such measures remain invariant, up to small constants, when applied to powers of a word. This suggests that such morphisms could be seen as a unifying framework for the analysis and comparison of different repetitiveness measures.
It would be interesting to explore the design of compression and indexing techniques based on BWT-runs that operate directly on morphic encodings of highly repetitive text collections. This could have applications, for example, in the domain of privacy-preserving algorithms. Although our current approach allows for polynomial-time decision procedures for testing whether a given binary morphism is BWT-run preserving or, equivalently, primitivity-preserving, more efficient algorithms could yield significant improvements in terms of scalability and practical performance.
Furthermore, BWT-run sensitivity could support a new classification of morphisms, providing new insights for their structural behavior and the impact on repetitiveness measures.
Finally, we plan to investigate how to extend our results to morphisms over larger alphabets.
References
- [1] Tooru Akagi, Mitsuru Funakoshi, and Shunsuke Inenaga. Sensitivity of string compressors and repetitiveness measures. Information and Computation, 291:104999, 2023. doi:10.1016/J.IC.2022.104999.
- [2] Evelyne Barbin-Le Rest and Michel Le Rest. Sur la combinatoire des codes à deux mots. Theoretical Computer Science, 41:61–80, 1985. doi:10.1016/0304-3975(85)90060-X.
- [3] Marie-Pierre Béal, Dominique Perrin, and Antonio Restivo. Unambiguously coded shifts. European Journal of Combinatorics, 119:103812, 2024. doi:10.1016/J.EJC.2023.103812.
- [4] Jean Berstel, Dominique Perrin, and Christophe Reutenauer. Codes and Automata, volume 129 of Encyclopedia of mathematics and its applications. Cambridge University Press, 2010.
- [5] Jean Berstel and Patrice Séébold. A Characterization of Sturmian Morphisms. In MFCS, volume 711 of Lecture Notes in Computer Science, pages 281–290. Springer, 1993. doi:10.1007/3-540-57182-5_20.
- [6] Michael Burrows and David Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
- [7] Julien Cassaigne. An algorithm to test if a given circular HD0L-language avoids a pattern. In IFIP Congress (1), volume A-51 of IFIP Transactions, pages 459–464. North-Holland, 1994.
- [8] Julien Cassaigne, France Gheeraert, Antonio Restivo, Giuseppe Romana, Marinella Sciortino, and Manon Stipulanti. New string attractor-based complexities for infinite words. Journal of Combinatorial Theory, Series A, 208:105936, 2024. doi:10.1016/J.JCTA.2024.105936.
- [9] Kuo Tsai Chen, Ralph H. Fox, and Roger C. Lyndon. Free differential calculus, IV. The quotient groups of the lower central series. Annals of Mathematics, 68(1):81–95, 1958.
- [10] Wai-Fong Chuan. Sturmian morphisms and alpha-words. Theoretical Computer Science, 225(1-2):129–148, 1999. doi:10.1016/S0304-3975(97)00239-9.
- [11] Pál Dömösi, Sándor Horváth, Masami Ito, László Kászonyi, and Masashi Katsura. Formal languages consisting of primitive words. In Zoltán Ésik, editor, Fundamentals of Computation Theory, pages 194–203, Berlin, Heidelberg, 1993. Springer Berlin Heidelberg. doi:10.1007/3-540-57163-9_15.
- [12] Gabriele Fici, Giuseppe Romana, Marinella Sciortino, and Cristian Urbina. On the Impact of Morphisms on BWT-Runs. In CPM, volume 259 of LIPIcs, pages 10:1–10:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CPM.2023.10.
- [13] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space. Journal of the ACM, 67(1):2:1–2:54, 2020. doi:10.1145/3375890.
- [14] Sara Giuliani, Shunsuke Inenaga, Zsuzsanna Lipták, Giuseppe Romana, Marinella Sciortino, and Cristian Urbina. Bit catastrophes for the Burrows-Wheeler transform. Theory of Computing Systems, 69(2):19, 2025. doi:10.1007/S00224-024-10212-9.
- [15] Štěpán Holub, Martin Raska, and Stepán Starosta. Binary codes that do not preserve primitivity. Journal of Automated Reasoning, 67(3):25, 2023. doi:10.1007/S10817-023-09674-2.
- [16] Cheng-Chi Huang. A note on pure codes. Acta Informatica, 47(5-6):347–357, 2010. doi:10.1007/S00236-010-0122-7.
- [17] Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In STOC, pages 827–840. ACM, 2018. doi:10.1145/3188745.3188814.
- [18] Karel Klouda and Stepán Starosta. Characterization of circular D0L-systems. Theor. Comput. Sci., 790:131–137, 2019. doi:10.1016/J.TCS.2019.04.021.
- [19] Ben Langmead and Steven L. Salzberg. Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4):357–359, 2012.
- [20] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3):R25, 2009.
- [21] Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26(5):589–595, 2010. doi:10.1093/BIOINFORMATICS/BTP698.
- [22] M. Lothaire. Combinatorics on Words. Cambridge University Press, 1997.
- [23] Roger C. Lyndon and Marcel-Paul Schützenberger. The equation in a free group. Michigan Mathematical Journal, 9(4):289–298, 1962.
- [24] Sabrina Mantaci, Antonio Restivo, and Marinella Sciortino. Burrows–Wheeler transform and Sturmian words. Information Processing Letters, 86(5):241–246, 2003. doi:10.1016/S0020-0190(02)00512-4.
- [25] Victor Mitrana. Primitive morphisms. Information Processing Letters, 64(6):277–281, 1997. doi:10.1016/S0020-0190(97)00178-6.
- [26] Gonzalo Navarro. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Computing Surveys, 54(2):article 29, 2021.
- [27] Gonzalo Navarro. The compression power of the BWT: technical perspective. Communications of the ACM, 65(6):90, 2022. doi:10.1145/3531443.
- [28] Gonzalo Navarro and Cristian Urbina. Repetitiveness measures based on string morphisms. Theoretical Computer Science, 1043:115259, 2025. doi:10.1016/J.TCS.2025.115259.
- [29] Geneviève Paquin. On a generalization of Christoffel words: epichristoffel words. Theoretical Computer Science, 410(38-40):3782–3791, 2009. doi:10.1016/J.TCS.2009.05.014.
- [30] Antonio Restivo. On a Question of McNaughton and Papert. Information and Control, 25(1):93–101, 1974. doi:10.1016/S0019-9958(74)90821-3.
- [31] Michel Rigo. Formal Languages, Automata and Numeration Systems 1: Introduction to Combinatorics on Words. Wiley, 2014.
- [32] Huei-Jan Shyr and Gabriel Thierrin. Codes, languages and MOL schemes. RAIRO Theoretical Informatics and Applications, 11(4):293–301, 1977. doi:10.1051/ITA/1977110402931.
- [33] Huei-Jan Shyr and Shyr-Shen Yu. Non-primitive words in the language . Soochow Journal of Mathematics, 20:535–546, 1994.
- [34] Md. Vasimuddin, Sanchit Misra, Heng Li, and Srinivas Aluru. Efficient architecture-aware acceleration of BWA-MEM for multicore systems. In 2019 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019, Rio de Janeiro, Brazil, May 20-24, 2019, pages 314–324, Los Alamitos, CA, 2019. IEEE Computer Society. doi:10.1109/IPDPS.2019.00041.
