Abstract 1 Introduction 2 Preliminaries 3 New results for 𝒓 vs 𝒓𝑩 4 Clustering effects of 𝗕𝗪𝗧 and 𝗕𝗕𝗪𝗧 5 Discussion References

On the Compressiveness of the Burrows-Wheeler Transform

Hideo Bannai ORCID M&D Data Science Center, Institute of Integrated Research, Institute of Science Tokyo, Japan Tomohiro I ORCID Department of Artificial Intelligence, Kyushu Institute of Technology, Fukuoka, Japan Yuto Nakashima ORCID Department of Informatics, Kyushu University, Fukuoka, Japan
Abstract

The Burrows-Wheeler transform (BWT) is a reversible transform that converts a string w into another string 𝖡𝖶𝖳(w). The size of the run-length encoded BWT (RLBWT) can be interpreted as a measure of repetitiveness in the class of representations called dictionary compression which are essentially representations based on copy and paste operations. In this paper, we shed new light on the compressiveness of BWT and the bijective BWT (BBWT). We first extend previous results on the relations of their run-length compressed sizes r and rB. We also show that the so-called “clustering effect” of BWT and BBWT can be captured by measures other than empirical entropy or run-length encoding. In particular, we show that BWT and BBWT do not increase the repetitiveness of the string with respect to various measures based on dictionary compression by more than a polylogarithmic factor. Furthermore, we show that there exists an infinite family of strings that are maximally incompressible by any dictionary compression measure, but become very compressible after applying BBWT. An interesting implication of this result is that it is possible to transcend dictionary compression in some cases by simply applying BBWT before applying dictionary compression.

Keywords and phrases:
Data Compression, Bijective Burrows-Wheeler Transform, Fibonacci words
Funding:
Hideo Bannai: JSPS KAKENHI Grant Number JP24K02899.
Tomohiro I: JSPS KAKENHI Grant Number JP22K11907 and JP24K02899, JST AIP Acceleration Research JPMJCR24U4.
Yuto Nakashima: JSPS KAKENHI Grant Number JP21K17705 and JP23H04386.
Copyright and License:
[Uncaptioned image] © Hideo Bannai, Tomohiro I, and Yuto Nakashima; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Data compression
; Mathematics of computing Combinatorics on words
Editors:
Paola Bonizzoni and Veli Mäkinen

1 Introduction

The Burrows-Wheeler Transform (BWT) [5] is a reversible mapping from a string to another string that enables compression and efficient pattern search, and is the theoretical cornerstone for essential tools in the field of bioinformatics [17, 18]. The compressibility of BWT has been studied in various contexts, but more recently, rather than statistical measures such as empirical entropy which are not helpful in highly repetitive datasets, the size r of the run-length encoded BWT (RLBWT) and its relation to the many other repetitiveness measures related to dictionary compression has become an important topic of study (See [21] for a comprehensive survey).

Dictionary compression is a family of compressed representations which are essentially based on copy and paste operations. Kempa and Prezza [15] proposed the notion of string attractors to view dictionary compression in a uniform way, and showed that the size γ of the smallest string attractor lower bounds all other measures of dictionary compression, since they implicitly give a string attractor of the same size as their representations: the size b of the smallest bidirectional macro scheme (BMS) [23], the size z of the LZ77 parsing [25], the size grl of the smallest grammar with run-length rules, the size g of the smallest grammar, r, and more. Another measure δ based on substring complexity is known to lowerbound γ [16].

Concerning r in relation to the other measures, Navarro et al. [22] showed that given an RLBWT of size r, we can construct a BMS of size 2r, thus showing b=O(r). Kempa and Kociumaka [14] showed r=O(zlog2n), and further r=O(δlog2n)111More precisely, r=O(δlogδmax(1,lognδlogδ)). Also, z=O(blogn) [22] has been shown, and therefore z=O(rlogn) or, r=Ω(z/logn). Recently, the size rB of the run-length encoded bijective BWT (RLBBWT) has been studied [2, 1], and was shown, similarly to r, that rB=O(zlog2n) [1]. Also, the existence of an infinite family of strings such that r=o(rB), namely, r=O(1) and rB=Ω(logn) was shown [1]. The existence of the opposite case, i.e., if there are string families for which rB=o(r), was left open.

The “clustering effect” of the BWT in terms of the run length encoding was studied by Mantaci et al. [19], where they proved that BWT can increase the size of the run-length encoding of the string by a factor of at most 2. The maximal clustering effect for run-length encoding can be seen for example with the Fibonacci words: Fibonacci words of length n have a run-length encoding of size Θ(n), while r=2. While this is remarkable when viewed from run-length encoding since a maximally incompressible string (in terms of rle) is transformed into a compressible one, in terms of the smallest bidirectional macro scheme, Fibonacci words and their BWT have the same size 4, and one might argue that BWT is merely transforming a compressible string into another compressible string.

In this paper, we extend previous results on the relation between r and rB. We also investigate the clustering effect of BWT and BBWT in terms of other repetitiveness measures and show that they can be much more powerful than previously perceived, giving an example where BBWT transforms a maximally incompressible string (in terms of dictionary compression) into a compressible one.

The contributions of this paper are as follows.

  1. 1.

    We show an infinite family of strings such that r=Ω(logn) and rB=O(1), answering affirmatively an open question about the existence of a family of strings where rB=o(r) raised by Badkobeh et al. [1].

  2. 2.

    We show a polylogarithmic (O(log4n)) upper bound on the multiplicative difference between r and rB.

  3. 3.

    We analyze the clustering effect of BWT and BBWT in terms of other repetitiveness measures, and show that the application of BWT and BBWT can only increase various repetitiveness measures of a string by a polylogarithmic factor.

  4. 4.

    We show that BBWT on the strings whose BBWT images are Fibonacci words exhibit a maximal clustering effect in terms of the repetitiveness measures δ, γ, b, v, or r, from almost linear (Θ(n/logn)) to constant. The main combinatorial part of our proof is an elegant characterization of the LF mapping function on Fibonacci strings that uses the Zeckendorf representation of the positions.

    As a byproduct, this result shows that it is possible in some cases to transcend dictionary compression by simply applying BBWT before applying dictionary compression.

2 Preliminaries

Let Σ denote a set of symbols called the alphabet. A string is an element of Σ. If w=xyz for any strings w,x,y,z, then, x, y, z are respectively a prefix, substring, and suffix of w and are a proper prefix, substring, or suffix if they are not equal to w. The length of string x is denoted by |x|. The empty string, i.e., the string of length 0 is denoted by ε. The symbol at position i will be denoted by x[i], and we will use a 0-based index, i.e., x=x[0]x[|x|1]. For integers i,j[0,|x|), let x[i..j]=x[i]x[j] if ij and x[i..j]=ε if i>j. We will also use x[i..j)=x[i..j1]. For any string x, x0=ε, and for any integer i1, xi=xi1x. A string w is primitive if it cannot be represented as w=xk for some string x and integer k2. For any symbol cΣ let |x|c=|{ix[i]=c}|, i.e., the number of c’s in x.

For any string x, let 𝗋𝗈𝗍(x)=x[1..|x|)x[0], denote a left cyclic rotation of x by one symbol. Notice that for any integer i, 𝗋𝗈𝗍i(x) naturally corresponds to the cyclic rotation of x by i symbols to the left if i>0, and to the right if i<0. We will use 𝗋𝗈𝗍(x) to denote the lexicographic smallest rotation of x. A conjugacy class is an equivalence class of the equivalence relation defined by xy𝗋𝗈𝗍(x)=𝗋𝗈𝗍(y).

A string w is a Lyndon word if it is lexicographically smaller than any of its, proper rotations, i.e., w<𝗋𝗈𝗍k(w) for any k[1,|w|). From the definition, it is clear that a Lyndon word must be primitive. A rotation of a primitive string x is always primitive and 𝗋𝗈𝗍(x) is Lyndon and unique, which we will sometimes refer to as the Lyndon rotation of x. It is known that Lyndon words cannot have a non-empty proper border (a proper prefix that is also a proper suffix). The Lyndon factorization [6] of a string w is a unique partitioning of w into a sequence of non-increasing Lyndon words, i.e., w=L1e1L(w)e(w) where each Li(i[1,(w)]), which we will call Lyndon factors of w, is a Lyndon word and Li>Li+1 for i[1,(w)). It is known that any occurrence of a Lyndon word in w must be a substring of a Lyndon factor of w.

The ω-order <ω between primitive strings or same length strings x,y is defined as x<ωyx<y.222As we will not be comparing non-primitive strings of different lengths in this paper, the definition here is a simplified version of the original. The ω-order can be different from the standard lexicographic order, but are identical when comparing strings of the same length or when comparing Lyndon words.

For a string x[0..|x|), let 𝗋𝖺𝗇𝗄c(i,x)=|x[0..i)|c, i.e., the number of symbols c in x[0..i). Let ρ(x) denote the size of the run-length encoding of x, i.e., the maximal number of same symbol runs in x. Let ρc(x) denote the number of runs of symbol c in the run-length encoding of x.

2.1 Repetitiveness Measures

A set of positions Γ is a string attractor [15] of w if any substring of w has an occurrence in w that covers a position in Γ. The size of the smallest string attractor of w is denoted by γ(w). The measure δ [16] is defined as maxk[1,|w|]Sk/k, where Sk is the number of distinct length-k substrings of w.

The Burrows-Wheeler transform (BWT) [5] 𝖡𝖶𝖳(w) of a string w is defined as the sequence of last (or equivalently, previous, in the cyclic sense) symbols of all rotations of w, in lexicographic order of the rotations. The size of the run-length encoding of 𝖡𝖶𝖳(w) will be denoted by r(w), i.e., r(w)=ρ(𝖡𝖶𝖳(w)). The bijective BWT (BBWT) [11] 𝖡𝖡𝖶𝖳(w) of a string w is defined as the sequence of last (or again, previous, in the cyclic sense) symbols of all the rotations of all the Lyndon factors of w, in ω-order of the rotations. Actually, BWT can be understood as a special case of BBWT: 𝖡𝖶𝖳(w)=𝖡𝖡𝖶𝖳(𝗋𝗈𝗍(w)) because 𝗋𝗈𝗍(w)=Le for some Lyndon word L and integer e=|w|/|L|, and the ω-order is equivalent to lexicographic order in this case since all the compared strings are of the same length. The size of the run-length encoding of 𝖡𝖡𝖶𝖳(w) will be denoted by rB(w), i.e., rB(w)=ρ(𝖡𝖡𝖶𝖳(w)).

The inverse transform of 𝖡𝖶𝖳 and 𝖡𝖡𝖶𝖳 on a string x can be defined by the LF mapping, which is a function Ψx(i)=j over positions of x where c=x[i]=s[j], 𝗋𝖺𝗇𝗄c(i,x)=𝗋𝖺𝗇𝗄c(j,s), and s is the string obtained by sorting the multiset of symbols of x in increasing order. Ψx is a permutation and thus forms cycles on the set of positions of x. A cycle (i,Ψ1(i),,Ψk1(i)) where k is the smallest positive integer such that Ψk(i)=i, corresponds to a (cyclic) string x[Ψk1(i)]x[Ψ1(i)]x[i]. This string is always primitive, and by concatenating, in non-increasing order, all the Lyndon rotations of the strings corresponding to all cycles, it can be shown that 𝖡𝖡𝖶𝖳1(x) is obtained. Ψx can be interpreted as returning, given the ω-order rank of a given rotation of a cycle, the ω-order rank of the previous rotation of the cycle. Note that when x is a 𝖡𝖶𝖳 image of a primitive string, Ψx consists of only a single cycle and that although the Lyndon rotation will give the string w for which 𝖡𝖡𝖶𝖳(w)=x, we have 𝖡𝖶𝖳(w)=x for any rotation w of w.

A Bidirectional Macro Scheme (BMS) [23] of a string w is a partitioning of w into phrases, where each phrase is either a single symbol, or is a substring that has an occurrence elsewhere in w which we call the source, or the reference of the phrase. The references of the phrases must be such that the induced reference for each position in the phrases are acyclic, i.e., the referencing on the position forms a forest where the roots are the positions corresponding to single symbol phrases. The size of the smallest bidirectional macro scheme for w is denoted by b(w).

The LZ77 factorization [25] of a string w is a BMS of w where all references are left-referencing, i.e., they must point to a smaller position, and the phrases are determined greedily from left-to-right. It is known that LZ77 is the smallest among left-referencing BMS. The size of LZ77 of w is denoted by z(w).

The lex-parse [22] of a string w is a BMS of w where all references point to a rotation of smaller lexicographic (or equivalently ω-order) rotation, and the phrases are determined greedily from left-to-right. Similarly, it is known that lex-parse is the smallest among BMS with such constraint. The size of lex-parse of w is denoted by v(w).

2.2 Fibonacci Words

Since we will later use Fibonacci words or their slight modifications to show some of our results, we introduce them here.

The Fibonacci words are defined recursively as: F0=𝚋, F1=𝚊, and for any integer i2, Fi=Fi1Fi2. Fibonacci words can also be defined via a morphism ϕ defined as ϕ(𝚊)=𝚊𝚋, and ϕ(𝚋)=𝚊, and Fi=ϕi(𝚋) for all i0. The lengths of the Fibonacci words correspond to the Fibonacci sequence, i.e., f0=f1=1, and fi=fi1+fi2. For technical reasons, we define fi=0 for i<0. The observation below follows from a simple induction.

Observation 1.

For any i1, |Fi|𝚊=fi1, and |Fi|𝚋=fi2.

3 New results for 𝒓 vs 𝒓𝑩

3.1 String family giving 𝒓(𝒘)/𝒓𝑩(𝒘)=𝛀(𝐥𝐨𝐠𝒏)

Here, we answer an open question raised by Badkobeh et al. [1], and show that there exists an infinite family of strings such that rB is asymptotically strictly smaller than r.

Theorem 2.

There exists an infinite family of strings such that r=Ω(rBlogn).

Proof.

Consider string wk=𝚋F2k, where F2k=𝗋𝗈𝗍(F2k). Then, for any k3, rB(wk)=3 (Lemma 3), and r(wk)=2k (Corollary 6).

Lemma 3.

Let wk=𝚋F2k, where F2k=𝗋𝗈𝗍(F2k). Then, rB(wk)=3 for any k3.

Proof.

The Lyndon factorization of wk results in the factors 𝚋 and F2k, and it is known that 𝖡𝖡𝖶𝖳(F2k)=𝖡𝖶𝖳(F2k)=𝚋f2k2𝚊f2k1. Since 𝚋 is greater, in ω-order, than any rotation of F2k, we have 𝖡𝖡𝖶𝖳(wk)=𝚋f2k2𝚊f2k1𝚋 and rB(wk)=3. The rest of the proof will focus on showing r(wk)=2k. We show two proofs, one that relies heavily on previous results, and the other an alternate direct proof.

3.1.1 Proof for 𝒓(𝒘𝒌)=𝟐𝒌

Actually, it turns out that wk is a rotation of the strings whose BWT are shown to have 2k runs in Proposition 3 of [12] and in Proposition 3 of [13]. The former defines it as v2k𝚋 for v2k=(F2k)R, where wR is the reverse of string w, i.e., (wR)[i]=w[|w|1i]. The latter defines it as inserting 𝚋 at position f2k12 of F2k, and mentions that it is a rotation of the former string. Our proof of r(wk)=2k further connects these strings with the lexicographically smallest rotation F2k of F2k.

The following is known:

Theorem 4 (Theorem 1 in [7]).

The rotation of Fn with rank ρ in the lexicographically sorted list of all the rotations of Fn, for n2, ρ[0..fn) is the rotation 𝗋𝗈𝗍i(Fn) where

i={(ρfn21)modfn if n odd((ρ+1)fn21)modfn if n even.

Therefore, we have

Corollary 5.

For every integer k1, F2k=𝗋𝗈𝗍f2k11(F2k).

Proof.

From Theorem 4 with ρ=0, F2k=𝗋𝗈𝗍i(F2k) where i=(f2k2+1)f2k11(modf2k). which means that inserting a 𝚋 at position f2k12 of F2k is a rotation of 𝚋F2k, leading to:

Corollary 6.

Let wk=𝚋F2k, where F2k=𝗋𝗈𝗍(F2k). Then, r(wk)=2k for any k3.

3.1.2 Alternate proof for 𝒓(𝒘𝒌)=𝟐𝒌

We also give an alternate direct proof based on morphisms, which is perhaps slightly simpler compared to the proof for r(v2k𝚋)=2k presented in [12] which relies on additional results on special factors of standard words [4]. Results on morphisms and their effects on r have been studied by Fici et al. [9], but to the best of our knowledge, their results do not directly apply to our case.

We use a string morphism θ defined as: θ(𝚊)=𝚊𝚊𝚋, θ(𝚋)=𝚊𝚋. The following claim can be shown by a simple induction.

Claim 7 (Claim 5 in [20]).

For any string w{𝚊,𝚋}, θ(w)=𝗋𝗈𝗍2(ϕ2(w)).

To prove Theorem 2, we first show the following lemma.

Lemma 8.

For every integer k0, F2k=θk(𝚋).

Proof.

It clearly holds for k=0,1. For k2, we show that θk(𝚋)=𝗋𝗈𝗍f2k11(F2k) by induction on k. We can see that the statement holds for k=2 since θ2(𝚋)=𝚊𝚊𝚋𝚊𝚋=𝗋𝗈𝗍2(F4). Suppose that θk(𝚋)=𝗋𝗈𝗍f2k11(F2k) holds for some k2. We have

θk+1(𝚋) =θ(θk(𝚋))=θ(𝗋𝗈𝗍f2k11(F2k)) by induction hypothesis
=θ(𝗋𝗈𝗍1(F2k2F2k1))
=𝗋𝗈𝗍2(ϕ2(𝗋𝗈𝗍1(F2k2F2k1)) by Claim 7

Since the last symbol of F2k is F2k[f2k11]=𝚊 and |ϕ2(𝚊)|=3, we have

θk+1(𝚋) =𝗋𝗈𝗍2(𝗋𝗈𝗍3(F2kF2k+1))
=𝗋𝗈𝗍f2k+11(F2k+2)
=𝗋𝗈𝗍f2(k+1)11(F2(k+1))

Hence, the statement also holds for k=k+1, and θk(𝚋)=𝗋𝗈𝗍f2k11(F2k) for every k2. By combining it with Corollary 5, we obtain F2k=θk(𝚋). We are ready to prove the following lemma.

Lemma 9.

Let wk=𝚋F2k, where F2k=𝗋𝗈𝗍(F2k). Then, r(wk)=2k for any k2.

Proof.

We first prove that r(yk)=2k+2 for yk=𝚌F2k for any k3, where 𝚌 is a symbol that is greater than 𝚊,𝚋. From Lemma 8, we have yk=𝚌F2k=𝚌θk(𝚋). Below, we extend the definition of θ so that θ(𝚌)=𝚌, so yk=θk(𝚌𝚋). We claim that

𝖡𝖶𝖳(yk) =𝚌𝚋f2k2kj=1k(𝚊f2j2𝚋) (1)

We show this statement by induction on k. If k=3, y3=𝚌𝚊𝚊𝚋𝚊𝚊𝚋𝚊𝚋𝚊𝚊𝚋𝚊𝚋 and 𝖡𝖶𝖳(y3)=𝚌𝚋2𝚊𝚋𝚊2𝚋𝚊5𝚋 hold. Assume that the statement holds for some k3. We show the statement also holds for k=k+1. Figure 1 depicts yk+1=θ(yk). We consider the three disjoint sets of the rotations of yk+1 as follows: the rotations (i) starting with 𝚊2, (ii) starting with 𝚊𝚋, (iii) starting with 𝚋 or 𝚌.

  1. (i)

    From the definition of θ, the number of rotations starting with 𝚊2 of yk+1 is f2k1(=|F2k|𝚊). These rotations are the first f2k1 rotations in the lexicographically increasingly sorted list of all rotations of yk+1. By the definition of yk, the lexicographically smallest rotation is 𝗋𝗈𝗍(yk), and the preceding symbol of this rotation is 𝚌. The other rotations of this case are preceded by 𝚋. Hence, 𝖡𝖶𝖳(yk+1)[0..f2k11]=𝚌𝚋f2k11.

  2. (ii)

    Let 𝗉𝗈𝗌 be the set of positions in yk+1 that are preceded by an occurrence of 𝚊𝚋. From the definitions of yk and θ, every occurrence of 𝚊𝚋 in yk+1 is a suffix of an occurrence of the substrings (𝚊𝚊𝚋 or 𝚊𝚋) produced from θ and an occurrence of 𝚊 or 𝚋 in yk. Therefore, we have |𝗉𝗈𝗌|=|yk|1=f2k and if s(i) denotes the position in yk that produces the corresponding symbol at position i in yk+1 by the morphism θ, then 𝗋𝗈𝗍i(yk+1)=θ(𝗋𝗈𝗍s(i)(yk)) for any i𝗉𝗈𝗌. Thus, for any positions i1,i2𝗉𝗈𝗌,

    𝗋𝗈𝗍i12(yk+1)<𝗋𝗈𝗍i22(yk+1) 𝗋𝗈𝗍i1(yk+1)<𝗋𝗈𝗍i2(yk+1)
    θ(𝗋𝗈𝗍s(i1)(yk))<θ(𝗋𝗈𝗍s(i2)(yk))
    𝗋𝗈𝗍s(i1)(yk)<𝗋𝗈𝗍s(i2)(yk). (2)

    where the last relation follows from the fact that θ is an order preserving morphism (i.e., s<tθ(s)<θ(t) holds for any s,t). We can also see that for any i𝗉𝗈𝗌, the symbol yk+1[i3] preceding the occurrence of 𝚊𝚋 is 𝚊 (resp. 𝚋) iff yk[s(i)1] is 𝚊 (resp. 𝚋). Thus, together with Equation 2, the sequence of entries in 𝖡𝖶𝖳(yk+1) corresponding to rotations that start with 𝚊𝚋 are equivalent to the sequence of entries in 𝖡𝖶𝖳(yk) for rotations that start with 𝚊 or 𝚋, i.e., 𝖡𝖶𝖳(yk+1)[f2k1..f2k+11]=𝖡𝖶𝖳(yk)[1..f2k]=𝚋f2k2kj=1k(𝚊f2j2𝚋). See also Figure 1.

  3. (iii)

    From the definition of θ, the number of rotations starting with 𝚋 in yk+1 is f2k, and the lexicographically largest rotation is yk+1 itself. These rotations are the last f2k+1 rotations in the lexicographically increasingly sorted list of all rotations of yk+1. The preceding symbol of the largest rotation, or equivalently, the last symbol of yk is 𝚋. All rotations starting with 𝚋 are preceded by 𝚊. Hence, 𝖡𝖶𝖳(yk+1)[f2k+1..f2k+2]=𝚊f2k𝚋.

All together, we have

𝖡𝖶𝖳(yk+1) =𝚌𝚋f2k11(𝗂)𝚋f2k2kj=1k(𝚊f2j2𝚋)(𝗂𝗂)𝚊f2k𝚋(𝗂𝗂𝗂)
=𝚌𝚋f2k(k+1)j=1k+1(𝚊f2j2𝚋)

and the statement holds for k=k+1, proving Equation 1. Therefore, r(yk)=2k+2. We note that Equation 1 holds for k=2, but since f2k22=0, r(y2)=5.

The last symbol of F2k is 𝚋, so changing the 𝚌 in yk to 𝚋 changes the unique occurrence of 𝚋𝚌 in all rotations of yk except for yk itself, to the unique occurrence of 𝚋𝚋 in all rotations of wk except for wk itself. Therefore, the lexicographic order of rotations of wk are equivalent to those in yk, except for wk itself. We can see that 𝖡𝖶𝖳(wk) will have the following changes compared to 𝖡𝖶𝖳(yk): 1) the lexicographically smallest rotation 𝗋𝗈𝗍(wk) is preceded by 𝚋 instead of 𝚌, and 2) wk is the lexicographically smallest rotation that starts with 𝚋 (rather than 𝚌). 1) changes 𝚌 to 𝚋 and thus decreases the run-length by 1. 2) moves the last 𝚋 right after the second to last 𝚋 in 𝖡𝖶𝖳(yk) (the preceding symbol of the largest rotation starting with 𝚊) and thus decreases the run-length by 1. Therefore, in total, r(wk)=2k which can be confirmed to hold for k=2 as well.

Figure 1: Illustration of the proof of Lemma 9. Every position in the set 𝗉𝗈𝗌 is preceded by an occurrence of 𝚊𝚋. We can obtain the lexicographic rank of the rotation 𝗋𝗈𝗍i2(yk+1) that starts with 𝚊𝚋 by using the corresponding rotation 𝗋𝗈𝗍s(i)(yk). The preceded symbols (underlined symbols) of yk+1[i3] and yk[s(i)1] are the always same. The figure shows the case of yk+1[j3]=yk[s(j)1]=𝚊 by the position i and the case of yk+1[j3]=yk[s(j)1]=𝚋 by the position j.

3.2 Bounds for 𝒓(𝒘)/𝒓𝑩(𝒘)

We show poly-logarithmic upper and lower bounds on the ratio r(w)/rB(w) for any string.

We first show a simple upper bound on the multiplicative rotation sensitivity of z.

Lemma 10.

For any strings w,w of length n in the same conjugacy class, z(w)=O(z(w)logn).

Proof.

It is easy to see that b(w)=Θ(b(w)). This is because, given any BMS of w of size b, we can reuse the parsing and referencing, except for the following changes, to construct a BMS for w of size 2b+1: (1) if w starts in the middle of a BMS phrase of w, we split the phrase, and (2) if a phrase references a substring of w but is split in w, we split the phrase. Since only one of the phrases of the split phrase in (1) is split by (2), the total number of phrases is at most 2b+1.

Since b(w)z(w)=O(b(w)logn) and b(w)z(w)=O(b(w)logn), we have z(w)/z(w)=O(logn).

Lemma 11.

For any strings w,w of length n in the same conjugacy class, rB(w)=O(rB(w)log4n).

Proof.

Badkobeh et al. [1] showed (1) b(w)=O(rB(w)), which implies z(w)=O(rB(w)logn), and (2) rB(w)=O(z(w)log2n). Therefore, rB(w)/rB(w)=O(z(w)/z(w)log3n)=O(log4n) from Lemma 10.

Corollary 12.

For any string w of length n, rB(w)=O(r(w)log4n) and r(w)=O(rB(w)log4n).

4 Clustering effects of 𝗕𝗪𝗧 and 𝗕𝗕𝗪𝗧

4.1 In terms of 𝝆

The clustering effect of 𝖡𝖶𝖳 was measured by the length ρ of the run-length encoding. Namely, the following statement was claimed by Mantaci et al. [19].

Theorem 13 (Theorem 3.3 in [19]).

For any string w, ρ(𝖡𝖶𝖳(w))2ρ(w).

While the statement is true, we believe there is a non-trivial case that was not covered in their original proof. Here, we address this case and also show that the same statement holds for 𝖡𝖡𝖶𝖳. Note that the following Theorem 14 implies Theorem 13 because 𝖡𝖶𝖳(w)=𝖡𝖡𝖶𝖳(𝗋𝗈𝗍(w)) and ρ(𝗋𝗈𝗍(w))ρ(w).

Theorem 14.

For any string w, ρ(𝖡𝖡𝖶𝖳(w))2ρ(w).

Proof.

We first recall the arguments of [19] for BWT. Consider a range [i..j] of x=𝖡𝖶𝖳(w) that corresponds to rotations of w that start with a symbol cΣ. Then, x[i..j] will consist of |w|cρc(w) occurrences of c’s that correspond to those preceding a c in the same run of c’s in w, and ρc(w) occurrence of symbols not equal to c that precede a maximal run of c’s in w. Thus, ρ(x[i..j]) is maximized when the ρc(w) non-c symbols in x[i..j] are not adjacent to each other. Mantaci et al. argued that this implies ρ(x[i..j])2ρc(w) summing up to 2ρ(w) for all c. However, they did not give a reason to why ρ(x[i..j]) could not be 2ρc(w)+1 (summing up to ρ(w)+σ), which is possibly the case when x[i..j] starts and ends with the symbol c. We show that in such a case, an LF mapping with a single cycle cannot be defined, and ultimately, ρ(x[i..j])2ρc(w) holds.

Assume that x[i]=x[j]=c and that all ρc(w) non-c symbols in x[i..j] are separated by one or more c’s. We claim that in such a case, there exists a position k[i..j] s.t. Ψx(k)=k resulting in a cycle of a single symbol. If there was no c in x[0..i), then, since x[i] is the first c in x, Ψx(i)=i must hold. Thus, consider the case that there exist c’s in x[0..i). For ikj, since the difference in the number of c’s in x[0..k] and the length of the range [i..k] (corresponding to the ki+1 lexicographically (or in ω order) smallest rotations that start with c) is monotonically non-increasing in k and changes by at most 1, there exists some ikj such that |x[0..k)|c=|x[0..k]|c=ki+1. Note that such k must necessarily satisfy x[k]c, and therefore k<j. However, from the assumption of x[i..j], x[k+1]=c holds and since |x[0..k+1]|c=ki+2, this implies that Ψx(k+1)=k+1. Therefore ρ(x[i..j])2ρc(w) holds.

For x=𝖡𝖡𝖶𝖳(w), similarly consider the range [i..j] of 𝖡𝖡𝖶𝖳(w) such that the corresponding rotations of the Lyndon factors of w start with the symbol cΣ. Since it is known that Lyndon factors of w (except for single symbol factors) must start and end at run-boundaries of w [10], a maximal run of c’s in w is either a substring of a Lyndon factor of w, or a maximal run of a single symbol Lyndon factors. If it is a substring of a Lyndon factor but not a prefix, then there is one non-c symbol per such run that precedes the run and occurs in x[i..j]. If it is a prefix of a Lyndon factor longer than 1, then the previous symbol (or the last symbol of the Lyndon factor) is non-c due to the Lyndon factor not having a proper border. Otherwise, the maximal run is a maximal run of single symbol Lyndon factors c, in which case the previous symbol for these occurring in x[i..j] will also be c. Thus, the arguments in the previous paragraph for BWT hold for BBWT as well, except for the last part: it can be that x[i]=x[j]=c and Ψ(k)=k for some k[i,j]. However, this implies that this corresponds to a single symbol Lyndon factor which would not have introduced a non-c symbol in x[i..j]. Therefore ρ(x[i..j])2ρc(w) still holds.

4.2 In terms of other repetitiveness measures

We consider measuring the clustering effect of 𝖡𝖶𝖳 and 𝖡𝖡𝖶𝖳 other than ρ. Here, we show that 𝖡𝖶𝖳 or 𝖡𝖡𝖶𝖳 can only increase the repetitiveness of the string by a polylogarithmic factor.

Theorem 15.

For repetitiveness measure μ{δ,γ,b,v,z,grl} and any string w of length n, it holds that μ(𝖡𝖶𝖳(w))=O(μ(w)log2n). Moreover, μ(𝖡𝖡𝖶𝖳(w))=O(μ(w)log2n) for μ{z,grl}, and μ(𝖡𝖡𝖶𝖳(w))=O(μ(w)log6n) for μ{δ,γ,b,v}.

Proof.

The proof for 𝖡𝖶𝖳 follows from a simple observation that for any string w, μ(w)=O(ρ(w)) implying μ(𝖡𝖶𝖳(w))=O(ρ(𝖡𝖶𝖳(w)). Since r(w)=ρ(𝖡𝖶𝖳(w))=O(δ(w)log2n), we have μ(𝖡𝖶𝖳(w))=O(δ(w)log2n)=O(μ(w)log2n), where the last relation follows from the fact that δ lower bounds all the repetitiveness measures considered.

For 𝖡𝖡𝖶𝖳, μ(𝖡𝖡𝖶𝖳(w))=O(ρ(𝖡𝖡𝖶𝖳(w))) holds, but only rB(w)=ρ(𝖡𝖡𝖶𝖳(w))=O(z(w)log2n) has been proved, from which rB(w)=O(grl(w)log2n) and the statement follows for μ{z,grl}. For the other measures, Corollary 12 gives us rB(w)=O(r(w)log4n)=O(δ(w)log6n) from which the statement follows.

4.3 A family of significantly “clustered” strings by BBWT

Finally, we show an infinite family of strings for which BBWT exhibits an asymptotically maximal clustering effect in terms of δ,γ,b,v,r, and slightly weaker in terms of z,grl,g. Namely, we consider the family of strings whose BBWT image are Fibonacci words.

Theorem 16.

Let w be a string of length n where 𝖡𝖡𝖶𝖳(w)=Fk, where k is prime. Then, for μ{δ,γ,b,v,r}, μ(𝖡𝖡𝖶𝖳(w))=O(lognnμ(w)) and for μ{z,grl,g}, μ(𝖡𝖡𝖶𝖳(w))=O(log2nnμ(w)).

In order to understand w, we first introduce the tools we use to characterize the LF mapping ΨFk on Fibonacci words.

The Zeckendorf representation [24] of a non-negative integer is a unique (sub-)set of distinct non-consecutive Fibonacci numbers {fkk1} that sum up to the integer. We represent a non-negative integer i as a bit string Z(i), where i=Z(i)=j0Z(i)[j]fj+1. We use Zk(i) to denote the length-k prefix Z(i)[0..k1] of Z(i). Note that for any i[0,fk), it suffices that a subset of {f1,,fk1} is used, and thus Z(i) requires only up to (k1) bits, i.e., 1’s will only occur in Z(i)[0..k2], and the rest Z(i)[k1..] will be 0.

We observe that length i0 prefix of a Fibonacci word can be uniquely factorized into a sequence of distinct non-consecutive Fibonacci words in order of decreasing length, where the length of each Fibonacci word corresponds to an element in the Zeckendorf representation of i.

Lemma 17.

For any i0,

F[0..i) =j=k20Fj+1Z(i)[j]=Fk1Z(i)[k2]F1Z(i)[0], (3)

where k is the smallest integer such that i<fk.

Proof.

Consider a greedy factorization of F[0..i) that takes the longest prefix of the remaining string that is a Fibonacci word. Let k be the smallest integer such that ifk. If i=fk then we simply take Fk as the last element. Otherwise, we take Fk1. Notice that in this case, the remaining string is F[fk1..i)=Fk[fk1..i)=Fk2[0..ifk1) of length i=ifk1<fk2, and we repeat the process to find a greedy factorization of Fk2[0..i). The remaining string is a proper prefix of Fk2 and thus Fk2 will not be chosen next, implying that consecutive Fibonacci words are not chosen. Thus, the lengths of the sequence of strings will be a set of non-consecutive set of distinct Fibonacci numbers that sum up to i. The lemma follows from the uniqueness of the Zeckendorf representation of i.

The following relation about the least significant bit in the Zeckendorf representation and the ith symbol in the (infinite) Fibonacci word is known.

Lemma 18 (Problem 6 in [8]).

For i=0,1,,

F[i] ={𝚊if Z(i)[0]=0,𝚋if Z(i)[0]=1.

Now, we make observations on the LF-mapping for Fibonacci words.

Lemma 19.

The LF-mapping function for the Fibonacci word Fk is, for any 0i<fk,

ΨFk(i)={𝗋𝖺𝗇𝗄𝚊(i,Fk)if Fk[i]=𝚊,fk1+𝗋𝖺𝗇𝗄𝚋(i,Fk)if Fk[i]=𝚋.
Proof.

Straightforward from the definition of ΨFk and Observation 1.

The next lemma shows that the LF mappings on Fk can be interpreted as rotation operations on the Zeckendorf representation of the position to be mapped.

Lemma 20.

For any i[0,fk),

Zk(ΨFk(i)) ={𝗋𝗈𝗍(Zk(i))if Fk[i]=𝚊,𝗋𝗈𝗍2(Zk(i))if Fk[i]=𝚋.
Proof.

See Figure 2 for a concrete example. Let j=ΨFk(i). Since i,j[0,fk), the most significant bits of Zk(i) and Zk(j), corresponding to fk, are always zero, i.e., Zk(i)[k1]=Zk(j)[k1]=0.

From Lemma 18, Fk[i]=𝚊 implies that Zk(i)[0]=0. Thus, 𝗋𝗈𝗍(Zk(i))=Zk(i)[1..k1]0. Therefore, we have

ΨFk(i) =𝗋𝖺𝗇𝗄𝚊(i,Fk) by Lemma 19
=|j=k20Fj+1Z(i)[j]|𝚊 by Lemma 17
=j=0k2Zk(i)[j]fj=j=0k1Zk(i)[j]fj
=Zk(i)[0]fk+j=0k2Zk(i)[j+1]fj+1
=j=0k1Zk(i)[(j+1)modk]fj+1
=𝗋𝗈𝗍(Zk(i))

On the other hand, again from Lemma 18, Fk[i]=𝚋 implies that Zk(i)[0]=1 and since the Zeckendorf representation does not use consecutive Fibonacci numbers, Zk(i)[1]=0. Thus, 𝗋𝗈𝗍2(Zk(i))=Zk(i)[2..k1]10. Therefore, we have

𝗋𝖺𝗇𝗄𝚋(i,Fk) =|j=k20Fj+1Z(i)[j]|𝚋 by Lemma 17
=j=0k2Zk(i)[j]fj1=j=0k1Zk(i)[j]fj1
=Zk(i)[0]f1+Zk(i)[1]f0+j=0k3Zk(i)[j+2]fj+1
=j=0k3Zk(i)[j+2]fj+1

Furthermore,

ΨFk(i) =𝗋𝖺𝗇𝗄𝚋(i,Fk)+fk1 by Lemma 19
=j=0k3Zk(i)[j+2]fj+1+fk1
=j=0k3Zk(i)[j+2]fj+1+Zk(i)[0]fk1+Zk(i)[1]fk
=j=0k1Zk(i)[(j+2)modk]fj+1
=𝗋𝗈𝗍2(Zk(i)).

thus proving the lemma (see also Figure 2).

Figure 2: Example of Lemma 20 for F7. The 𝚊 at position 18 is the 12th 𝚊 in F7[0..18], and thus the LF mapping should point to position 11, whose Zeckendorf representation Z7(11) is a 1-bit left rotation of Z7(18). The 𝚋 at position 17 is the 7th 𝚋 in F7[0..17], and since there are 13 𝚊’s in F7, the LF mapping should point to position 13+71=19, whose Zeckendorf representation Z7(19) is a 2-bit left rotation of Z7(17).

We are now ready to prove Theorem 16: See 16

Proof.

For any Fibonacci word Fk, δ(Fk)γ(Fk)b(Fk)v(Fk)r(Fk)=2 and z(Fk),grl(Fk),g(Fk)=O(logn) are known. In the rest of the proof, we show δ(w)=Ω(n/logn) implying the same lower bound for all other measures which finishes the proof.

From Lemma 20, ΨFk can be interpreted as a rotation operation on a k-bit string corresponding to the Zeckendorf representation of the given position. Therefore, the number of cycles that ΨFk produces is equivalent to the number of conjugacy classes among bit strings corresponding to the Zeckendorf representations of the set of positions [0,Fk). Since the Zeckendorf representations do not use adjacent Fibonacci numbers, this is equivalent to the number C(k) of binary necklaces of length k that do not contain “11”. This corresponds to entry A000358 in the on-line encyclopedia of integer sequences, which is known [3] to be:

C(k) =1kd|kφ(k/d)(fd2+fd), (4)

where d|k represents that d is a divisor of k, and φ is Euler’s totient function which returns, for integer n, the number of positive integers less than n that are relatively prime to n (i.e., having gcd of 1). For prime k, Equation 4 becomes:

C(k) =1k(φ(1)(fk2+fk)+φ(k)(f1+f1))=1k(fk2+fk+k1)fk/k.

As seen in the proof of Lemma 20, an 𝚊 in Fk implies a 1-bit left rotation on Zk for ΨFk, and 𝚋 in Fk implies a 2-bit left rotation on Zk for ΨFk. Each necklace will then correspond to some cyclic string w, where

|w|𝚊+2|w|𝚋 =k, (5)

except for the string 𝚊 corresponding to the bit string 0k, since all other necklaces must be primitive (due to k being prime) and a total of k-bits must be rotated in order to come back to the same bit string. Consider the lexicographically smallest rotations of all of them, which are Lyndon words. All of them are distinct, and any two of them (except for the single 𝚊) cannot be a substring of the other due to the constraint on their lengths (Equation 5). The string w=𝖡𝖡𝖶𝖳1(Fk) is a concatenation (in non-increasing order) of all these strings with the single 𝚊 appended at the end, corresponding to the Lyndon factorization of w. Since Lyndon words can only occur as a substring of a Lyndon factor, it follows that there are Ω(C(k)) distinct uniquely occurring strings (Lyndon factors) that do not overlap. Therefore, γ(w)=Ω(C(k))=Ω(n/logn) holds.

Furthermore, since the uniquely occurring Lyndon factors of w satisfy Equation 5 and have length at most k1, any length 2k substring of w will contain at least one full occurrence of a Lyndon factor of w. Since all such strings must also have unique occurrences and therefore be distinct, δ(w)(fk2k)/2k=Ω(n/logn) holds.

For example, for 𝖡𝖡𝖶𝖳1(F7), the binary necklaces of length 7 that do not contain “11” are: 0000000, 0000001, 0000101, 0001001, 0010101, which respectively correspond to 𝚊, 𝚊𝚊𝚊𝚊𝚊𝚋, 𝚊𝚊𝚊𝚋𝚋, 𝚊𝚊𝚋𝚊𝚋, and 𝚊𝚋𝚋𝚋, and therefore, 𝖡𝖡𝖶𝖳1(F7)=𝚊𝚋𝚋𝚋𝚊𝚊𝚋𝚊𝚋𝚊𝚊𝚊𝚋𝚋𝚊𝚊𝚊𝚊𝚊𝚋𝚊.

We note that 𝖡𝖡𝖶𝖳1(Fk) is asymptotically maximally incompressible by dictionary compression since it is known that for any string, δγz=O(n/logσn) holds, which is O(n/logn) for binary strings. We also note that Theorem 16 holds for general k, which will be shown in the full version of the paper.

5 Discussion

The log factors in our bounds, especially in Theorem 15 are most likely not tight; Our focus here was on showing the existence of such bounds. A natural open question is how many log factors can be shaved.

References

  • [1] Golnaz Badkobeh, Hideo Bannai, and Dominik Köppl. Bijective BWT based compression schemes. In Zsuzsanna Lipták, Edleno Silva de Moura, Karina Figueroa, and Ricardo Baeza-Yates, editors, String Processing and Information Retrieval - 31st International Symposium, SPIRE 2024, Puerto Vallarta, Mexico, September 23-25, 2024, Proceedings, volume 14899 of Lecture Notes in Computer Science, pages 16–25. Springer, 2024. doi:10.1007/978-3-031-72200-4_2.
  • [2] Elena Biagi, Davide Cenzato, Zsuzsanna Lipták, and Giuseppe Romana. On the number of equal-letter runs of the bijective Burrows-Wheeler transform. In Giuseppa Castiglione and Marinella Sciortino, editors, Proceedings of the 24th Italian Conference on Theoretical Computer Science, Palermo, Italy, September 13-15, 2023, volume 3587 of CEUR Workshop Proceedings, pages 129–142. CEUR-WS.org, 2023. URL: https://ceur-ws.org/Vol-3587/4564.pdf.
  • [3] M. Bona. Handbook of Enumerative Combinatorics. Discrete Mathematics and Its Applications. CRC Press, 2015. URL: https://books.google.co.jp/books?id=j3kZBwAAQBAJ.
  • [4] Jean-Pierre Borel and Christophe Reutenauer. On Christoffel classes. RAIRO Theor. Informatics Appl., 40(1):15–27, 2006. doi:10.1051/ITA:2005038.
  • [5] Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation, Systems Reseach Center, 1994. SRC Research Report 124.
  • [6] Kuo Tsai Chen, Ralph H. Fox, and Roger C. Lyndon. Free differential calculus, IV. The quotient groups of the lower central series. Annals of Mathematics, 68(1):81–95, 1958.
  • [7] Manolis Christodoulakis, Costas S. Iliopoulos, and Yoan José Pinzón Ardila. Simple algorithm for sorting the fibonacci string rotations. In Jirí Wiedermann, Gerard Tel, Jaroslav Pokorný, Mária Bieliková, and Julius Stuller, editors, SOFSEM 2006: Theory and Practice of Computer Science, 32nd Conference on Current Trends in Theory and Practice of Computer Science, Merín, Czech Republic, January 21-27, 2006, Proceedings, volume 3831 of Lecture Notes in Computer Science, pages 218–225. Springer, 2006. doi:10.1007/11611257_19.
  • [8] Maxime Crochemore, Thierry Lecroq, and Wojciech Rytter. Fibonacci Words and Fibonacci Numeration System, pages 17–52. Cambridge University Press, 2021.
  • [9] Gabriele Fici, Giuseppe Romana, Marinella Sciortino, and Cristian Urbina. On the impact of morphisms on BWT-runs. In Laurent Bulteau and Zsuzsanna Lipták, editors, 34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023, June 26-28, 2023, Marne-la-Vallée, France, volume 259 of LIPIcs, pages 10:1–10:18. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.CPM.2023.10.
  • [10] Yuta Fujishige, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Almost linear time computation of maximal repetitions in run length encoded strings. In Yoshio Okamoto and Takeshi Tokuyama, editors, 28th International Symposium on Algorithms and Computation, ISAAC 2017, December 9-12, 2017, Phuket, Thailand, volume 92 of LIPIcs, pages 33:1–33:12. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPICS.ISAAC.2017.33.
  • [11] Joseph Yossi Gil and David Allen Scott. A bijective string sorting transform. CoRR, abs/1201.3077, 2012. arXiv:1201.3077.
  • [12] Sara Giuliani, Shunsuke Inenaga, Zsuzsanna Lipták, Nicola Prezza, Marinella Sciortino, and Anna Toffanello. Novel results on the number of runs of the Burrows-Wheeler-transform. In Tomás Bures, Riccardo Dondi, Johann Gamper, Giovanna Guerrini, Tomasz Jurdzinski, Claus Pahl, Florian Sikora, and Prudence W. H. Wong, editors, SOFSEM 2021: Theory and Practice of Computer Science - 47th International Conference on Current Trends in Theory and Practice of Computer Science, SOFSEM 2021, Bolzano-Bozen, Italy, January 25-29, 2021, Proceedings, volume 12607 of Lecture Notes in Computer Science, pages 249–262. Springer, 2021. doi:10.1007/978-3-030-67731-2_18.
  • [13] Sara Giuliani, Shunsuke Inenaga, Zsuzsanna Lipták, Giuseppe Romana, Marinella Sciortino, and Cristian Urbina. Bit catastrophes for the Burrows-Wheeler transform. In Frank Drewes and Mikhail Volkov, editors, Developments in Language Theory - 27th International Conference, DLT 2023, Umeå, Sweden, June 12-16, 2023, Proceedings, volume 13911 of Lecture Notes in Computer Science, pages 86–99. Springer, 2023. doi:10.1007/978-3-031-33264-7_8.
  • [14] Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler transform conjecture. Commun. ACM, 65(6):91–98, 2022. doi:10.1145/3531445.
  • [15] Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In Ilias Diakonikolas, David Kempe, and Monika Henzinger, editors, Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 827–840. ACM, 2018. doi:10.1145/3188745.3188814.
  • [16] Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repetitive sequences. IEEE Transactions on Information Theory, 69(4):2074–2092, 2023. doi:10.1109/TIT.2022.3224382.
  • [17] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10:R25, 2009.
  • [18] Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics, 25(14):1754–1760, May 2009. doi:10.1093/bioinformatics/btp324.
  • [19] Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, Marinella Sciortino, and Luca Versari. Measuring the clustering effect of BWT via RLE. Theor. Comput. Sci., 698:79–87, 2017. doi:10.1016/J.TCS.2017.07.015.
  • [20] Takuya Mieno, Shunsuke Inenaga, and Takashi Horiyama. RePair grammars are the smallest grammars for Fibonacci words. In Hideo Bannai and Jan Holub, editors, 33rd Annual Symposium on Combinatorial Pattern Matching, CPM 2022, June 27-29, 2022, Prague, Czech Republic, volume 223 of LIPIcs, pages 26:1–26:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022. doi:10.4230/LIPICS.CPM.2022.26.
  • [21] Gonzalo Navarro. Indexing highly repetitive string collections, part II: compressed indexes. ACM Comput. Surv., 54(2):26:1–26:32, 2022. doi:10.1145/3432999.
  • [22] Gonzalo Navarro, Carlos Ochoa, and Nicola Prezza. On the approximation ratio of ordered parsings. IEEE Trans. Inf. Theory, 67(2):1008–1026, 2021. doi:10.1109/TIT.2020.3042746.
  • [23] James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. J. ACM, 29(4):928–951, 1982. doi:10.1145/322344.322346.
  • [24] Edouard Zeckendorf. Representations des nombres naturels par une somme de nombres de Fibonacci on de nombres de Lucas. Bulletin de La Society Royale des Sciences de Liege, pages 179–182, 1972. URL: https://cir.nii.ac.jp/crid/1570009749187075840.
  • [25] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory, 23(3):337–343, 1977. doi:10.1109/TIT.1977.1055714.