A Survey of the Bijective Burrows-Wheeler Transform
Abstract
The Bijective BWT (BBWT), conceived by Scott in 2007, later summarized in a preprint by Gil and Scott in 2009 (arXiv 2012), is a variant of the Burrows-Wheeler Transform which is bijective: every string is the BBWT of some string. Indeed, the BBWT of a string is the extended BWT [Mantaci et al., 2007] of the factors of its Lyndon factorization. The BBWT has been receiving increasing interest in recent years. In this paper, we survey existing research on the BBWT, starting with its history and motivation. We then present algorithmic topics including construction algorithms with various complexities and an index on top of the BBWT for pattern matching. We subsequently address some properties of the BBWT as a compressor, discussing robustness to operations such as reversal, edits, rotation, as well as compression power. We close with listing other bijective variants of the BWT and open problems concerning the BBWT.
Keywords and phrases:
Burrows-Wheeler Transform, compression, text indexing, repetitiveness measure, Lyndon words, index construction algorithms, bijective string transformationFunding:
Hideo Bannai: JSPS KAKENHI Grant Number JP24K02899Copyright and License:
2012 ACM Subject Classification:
Theory of computation Data structures design and analysis ; Mathematics of computing Combinatorics on wordsEditors:
Paolo Ferragina, Travis Gagie, and Gonzalo NavarroSeries and Publisher:
Open Access Series in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
The famous Burrows-Wheeler transform [17] (BWT) of a string is defined as the string obtained by concatenating the last letter of all rotations of the string, in the lexicographic order of the rotations. The BWT is often described as a reversible transform, meaning it must be injective, but clearly it is not, since all rotations of the string share the same output. Therefore, to make it injective and thus reversible, information to encode which of the strings among all rotations was the original is required; typically in the form of an explicit end-of-string symbol (often denoted by ), or an integer indicating the position in the BWT that corresponds to the lexicographic rank of the original string among all rotations, and therefore to the last character of the original string. It is also clear by a simple counting argument, that the transform is not surjective, meaning that there exist strings which are not images of BWT. Note that this is true even with the addition of the extra information to make it injective [37]. For example, (assuming ) is not a BWT image.111As a famous example, banana is not a BWT image, nor is ; in other words, one cannot insert anywhere for it to become a BWT image. For more on this problem, see [37].
The bijective variant of the BWT, Bijective BWT (BBWT, a.k.a. BWTS for BWT Scottified) considered in this paper was first announced by David Allen Scott in a Usenet thread in December 2007 [66], where the process of making the BWT bijective was termed the Scottification of the BWT. While Scott provided a working implementation together with explanations using concrete examples, a rigorous formal description of the procedure was not given, and the excitement of the discovery was unfortunately met with great skepticism towards its correctness and usefulness. In 2009, Joseph Gil, together with Scott, made available a preprint on Scott’s website [68] – later uploaded to arXiv in 2012 [32] – which contained a more accessible description of the process and a proof of correctness. Also in 2009, Kufleitner [47] provided another formal description of the BBWT, along with another BWT variant (which we will not cover in this survey). Scott later mentions [67] that his paper with Gil had been submitted to the Annual ACM-SIAM Symposium on Discrete Algorithms (SODA) but was rejected, highlighting the reviewer comment:
“Of course one can easily achieve invertability [sic!] in Burrows-Wheeler transform, by appending a special character to the input string.”
To the best of our knowledge, the paper has not been subsequently published in a journal or proceedings222An interesting parallel with the original BWT paper [17]..
Three advantages of BBWT over BWT are mentioned by Gil and Scott. First, it allows to get rid of the extra information mentioned above. Gil and Scott report empirical evidence that compression using BBWT is almost always better (albeit only slightly) than compression using BWT, on the Calgary corpus [9]. Second, they claim that in the context of applying encryption after compression, bijectivity is preferable, since not being bijective would reveal information to an attacker. Third, it is simply mathematically more elegant.
As can be seen from the reviewer’s comment to their paper (which misses the point), the difference between BBWT and BWT may be quite subtle and perhaps difficult to understand. Although the immediately observable merits of BBWT may have been scarce, some recent works have started to realize and tap into the value of this transform.
2 Basics
Let denote the alphabet, i.e., the set of symbols (or characters), and the set of (finite) strings (or texts) over . We index strings from . We denote the empty string by , the length of a string by , and the substring from to by , where if . We will also use to denote . The reverse of string is denoted . Let . Then, is the rotation of that starts at position . A cyclic string is a string whose first character is considered as being subsequent to its last character . In other words, arithmetic on the positions of is done modulo .
A string is called primitive if it cannot be represented as for some string and integer ; otherwise is called a power. A string is a rotation (or conjugate) of a string if there exist such that and . Conjugacy is an equivalence relation, and it is easy to see that the conjugacy class has cardinality if and only if is primitive.
We denote the number of the maximal same symbol runs, or simply runs, of a string by . For example, . Note that is the size of the run-length encoding of .
We assume a total order on and denote by the lexicographic order on induced by this total order. Furthermore, we define the omega-order (also called infinite periodic order [32, 64]) on as follows: if either , or and . Here, denotes the infinite string obtained by concatenating string an infinite number of times. Note that the two orders and coincide if neither of two strings is a proper prefix of the other but may differ otherwise. For example, but . The omega-order can also be viewed as an order on cyclic strings.
Since all rotations of a string have the same length, the two orders and coincide on the conjugacy classes. Given a string , let denote the smallest rotation of in lexicographic, or equivalently in this case, in omega-order, i.e. . If then is called a necklace333Note that a necklace is sometimes defined as a conjugacy class w.r.t. rotation. The two definitions are closely connected: a necklace viewed as a string is a particular representative of a necklace viewed as a conjugacy class. In this sense, e.g. for counting arguments, the two definitions are equivalent.; if is in addition primitive, then it is called a Lyndon word. Accordingly, is also referred to as necklace rotation resp. Lyndon rotation of . For instance, is a necklace and is a Lyndon word.
The Lyndon factorization [20] of a string is a factorization where for all , and is a Lyndon word, and . Note that our definition differs slightly from the usual one in that also here we are using the omega-order rather than the lexicographic order. Since on Lyndon words, the two orders coincide [27, Theorem 20], the two definitions are equivalent. We will refer to above, for , as the th Lyndon factor of . It is known that the Lyndon factorization of a string is unique [20] and can be computed in linear time in a left-to-right scan with Duval’s algorithm [28].
The suffix array of a string of length is a permutation of the index set , defined as follows: if the th suffix of is the th in lexicographic order of all suffixes (counting from ). The conjugate array (also called circular suffix array and denoted ) of a string is a permutation of the index set , defined as follows: if the th rotation is the th in lexicographic order (or, in this case equivalently, in omega-order) among all rotations of . (If is not primitive, then there are equal rotations, which are listed in in text order.) Note that and coincide if is a Lyndon word, or if it is terminated by an end-of-string symbol, but otherwise they can differ.
Standard Sturmian words, or simply standard words, are strings that can be defined via a so-called directive sequence: this is an infinite sequence of integers such that and for all , . This sequence generates a sequence of finite strings , called standard words, as follows: , , and for . For instance, the directive sequence generates the well-known sequence of finite Fibonacci words: , …For more details, see [52, Ch. 2].444The indexing of Fibonacci words is not uniform in the literature, so one may find a shift by one or two in different papers. In particular, even and odd order Fibonacci words may be exchanged.
| 10 | aaabracadabr | r | 10 | 0 |
| 11 | aabracadabra | a | 0 | 1 |
| 7 | abraaabracad | d | 9 | 2 |
| 0 | abracadabraa | a | 1 | 3 |
| 3 | acadabraaabr | r | 11 | 4 |
| 5 | adabraaabrac | c | 8 | 5 |
| 8 | braaabracada | a | 2 | 6 |
| 1 | bracadabraaa | a | 3 | 7 |
| 4 | cadabraaabra | a | 4 | 8 |
| 6 | dabraaabraca | a | 5 | 9 |
| 9 | raaabracadab | b | 6 | 10 |
| 2 | racadabraaab | b | 7 | 11 |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| a | b | r | a | c | a | d | a | b | r | a | a |
2.1 The BWT and the BBWT
The Burrows-Wheeler Transform (BWT) of a string is a permutation of the characters of . Consider the multiset of rotations of , . If for a primitive string , then each element of will appear times in this multiset. Now can be obtained by concatenating the last characters of these strings when they are given in omega-order (or, equivalently in this case, in lexicographic order). For example, . See also Figure 1. The number of runs of the BWT of a string is denoted , i.e. . Thus, .
Let be a string of length . The standard permutation of is a permutation of the index set of , defined as follows: if either , or and . In other words, is the position of character in the stable sort of the characters of . For example, for , , where we give both in one-line notation and in cycle notation. When is the BWT of some string, then the standard permutation of is also known as LF-mapping [29], which can be used to recover the original string up to rotation. For example, given as above, we can recover the Lyndon word with , spelling it back-to-front, as follows: , and for , . We get , which is the Lyndon rotation of abracadabraa. The process can be modified to uniquely recover the input string by either adding the lexicographic rank of the rotation to be recovered, or by adding a unique end-of-string marker to the string.
Since all rotations of have the same BWT, it is clear that the BWT is not a bijection on . It is well known that a string is the BWT of a primitive string if and only if is cyclic. Likhomanov and Shur gave a more general characterization [50]: is the BWT of some string if and only if the number of cycles of equals the greatest common divisor of the lengths of the runs of . This characterization encompasses powers, as well.
In fact, the number of strings of length which are the BWT of some string (referred to as BWT images in [50]) equals the number of necklaces of length . This follows from the definition of the BWT and the fact that the LF-mapping uniquely recovers the conjugacy class of the input string. In other words, the BWT is a bijection between necklaces of length and BWT images of length . The number of necklaces of length is , where is the alphabet size and is the number of Lyndon words of length over an alphabet of size , or equivalently, the number of conjugacy classes of primitive words of length . This can be seen by a counting argument on necklaces, distinguishing the case where the necklace is primitive (of which there are many), and the case where it is a power , with primitive (in which case is a divisor of ). It is known [51] that , where is the Möbius function defined as: , if is the product of distinct primes, or otherwise ( is divisible by a square number). Note that for all , the number of necklaces of length is at least , with equality if and only if is prime.555It is sometimes imprecisely stated, e.g. [32, 47], that th of all strings are BWT images, but this is only true for prime.
The extended BWT (eBWT) [54] is essentially666The original definition of eBWT does not assume the strings to be Lyndon words, but stores information concerning the rotations of the input strings. a bijection between multisets of Lyndon words of total length and strings of length . It is defined as the concatenation of the last characters of the rotations of the input strings, given in omega-order. Underlying the is the observation that the BWT is a special case of the inverse of the Gessel-Reutenauer bijection [31] between permutations of a given type and descent set and strings of a given type (see [21] for more details).
The is a straightforward generalization of the BWT in that for any primitive string , . Given a string , the standard permutation of (i.e., its LF-mapping) can be used to construct a multiset with equal to in the same way as before, yielding one string per cycle of . For more details on the , see [18] in this volume. Note that the eBWT is surjective: every string is the eBWT of something. However, the preimage is not necessarily a string; in general, it is a multiset of strings.
The bijective BWT (BBWT), on the other hand, is a variant of the BWT which is a bijection from to : every string is the BBWT of some string (of the same length).
Definition 1 (Bijective BWT).
The Bijective BWT, , of a string is the string obtained by concatenating the last character of the rotations of the multiset of the Lyndon factors of , in -order of the rotations.
The BBWT is thus the eBWT of the multiset of Lyndon factors of (see Example 2). The inverse of the BBWT is constructed as follows: Take a string , and let be the multiset of Lyndon words that we get via the LF-mapping from , i.e., one string per cycle of . Now concatenate the strings in non-increasing order. This yields the unique string whose BBWT is .
We define the generalized conjugate array of a string as follows: if the conjugate of starting in position is th in omega-order among all conjugates of the Lyndon factors of , where is the Lyndon factor containing position of .
Example 2.
The BBWT of the th Fibonacci word is given by . The Lyndon factorization of is . See Figure 2 for a visualization. We can recover the original string as follows. The stand- ard permutation of is . From this, we get: aab, aab, aabab, ab. Sorting these Lyndon words yields , and thus, we get .
| 7 | aabaababaabab | b | 0 | 7 | aab | b |
| 2 | aababaabaabab | b | 1 | 10 | aab | b |
| 10 | aababaababaab | b | 2 | 2 | aabab | b |
| 5 | abaabaababaab | b | 3 | 8 | aba | a |
| 0 | abaababaabaab | b | 4 | 11 | aba | a |
| 8 | abaababaababa | a | 5 | 5 | abaab | b |
| 3 | ababaabaababa | a | 6 | 3 | ababa | a |
| 11 | ababaababaaba | a | 7 | 0 | ab | b |
| 6 | baabaababaaba | a | 8 | 9 | baa | a |
| 1 | baababaabaaba | a | 9 | 12 | baa | a |
| 9 | baababaababaa | a | 10 | 6 | baaba | a |
| 4 | babaabaababaa | a | 11 | 4 | babaa | a |
| 12 | babaababaabaa | a | 12 | 1 | ba | a |
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| a | b | a | a | b | a | b | a | a | b | a | a | b |
Notice that the standard BWT can be considered as a special case of BBWT since . Indeed, the two transforms coincide exactly on the set of necklaces:
Proposition 3 ([12, Proposition 1]).
if and only if is a necklace.
If we fix the inverse BWT to be always the necklace representative of the conjugacy class, then one can consider the BWT as a bijection from a subset of strings (necklaces) to a subset of strings (BWT images). The BBWT thus is a generalization of the BWT, since it is a bijection from to and coincides with the BWT on this subset.
Similarly to the BWT, we will be interested in the number of runs of the BBWT of a string , denoted , i.e. . Thus, .
In the rest of the paper, denotes the length of the string under investigation (if no misunderstanding is possible).
3 Construction
Gil and Scott [32] sketched an algorithm based on the linear-time suffix array construction algorithm DC3 by Kärkkäinen, Sanders, and Burkhardt [41], for which they claim that the sorting “shall require time”. Unfortunately, the authors seem not to address the soundness of their algorithm nor provide source or pseudo code to give practical evidence that their approach indeed computes the BBWT. Mantaci et al. [54, before Proposition 10] conjectured that a generalization of DC3 should make it possible to compute the eBWT (and hence the BBWT) in linear time.
3.1 Dynamic construction
The first analysis of a BBWT construction algorithm with sound time bounds is due to Bannai et al. [6], who draw a connection to the algorithm of [14] constructing the eBWT, and were able to adapt this algorithm for computing the BBWT in time. This BBWT construction algorithm works in an online fashion, i.e., it builds the BBWT per Lyndon factor, starting with the first one, and then incrementally indexes each further factor until adding the last Lyndon factor. Compared to the BWT, there is no known algorithm computing the BWT in an online model that allows reading a specific prefix of the text (such as its Lyndon factors). However, it is possible to construct the BWT online by reading the text from the reverse order (e.g., [22]) – a setting for which we do not know whether a solution for the BBWT exists. The main idea of the BWT construction algorithm is that it computes the BWT of , where is a sentinel character smaller than all other characters appearing in . The construction starts with BWT just storing , and at any time later, the BWT contains exactly one at a specific position. By the definition of the LF mapping, going from this position backwards would give the last character of . Now, assume that we have built the BWT of and want to insert . For that, we perform a backward search step from the position of for the character . Say we end up at a position , then we exchange with but insert at position and recurse, see Figure 3 for a visualization. The same trick works without for the BBWT if we process the Lyndon factors sequentially, starting from the leftmost and lexicographically largest one. This is because, when inserting a new Lyndon factor , we know that it is lexicographically smaller than all other Lyndon factors. Hence, we can insert the last character of at the first position in the BBWT and backward search to find the insertion position of the previous character in . This gives the following complexities.
Theorem 4 ([6, Theorem 22]).
We can construct the BBWT of a string of length in time in an online model, where we receive at each step one Lyndon factor of the input.
3.2 Linear-time algorithms
In this section, we will sketch the two existing linear-time algorithms for BBWT construction. Both algorithms are non-trivial modifications of linear-time suffix array construction algorithms. The first, due to Bannai et al. [7] (2021), is based on the well-known SAIS-algorithm of Nong et al. [63]. The second, due to Olbrich et al. [64] is a modification of the algorithm GSACA by Baier [3]. In what follows, we will sketch the proofs of the following theorem:
Theorem 5 (Linear time construction of BBWT [7, 8, 15, 64]).
We can construct the BBWT of a string of length in time.
In the case of a string that is terminated by an end-of-string character, the BWT can be computed in linear time from the suffix array, by noticing that if , and otherwise. This, however, uses the fact that the last suffix is the smallest among all suffixes, which is not necessarily the case in general. In actual fact, the conjugate array would be needed, since for all , we have , where the subtraction is modulo . One solution is to first compute the Lyndon rotation of because for Lyndon words, the suffix array and the conjugate array coincide. The first algorithm that directly computes the BWT of a string without an end-of-string character, and without computing Lyndon rotations, was given in [15] (and was inspired by the SAIS-based algorithm [7] we are about to see).
For the BBWT, we need is the of , due to the following connection:
The two linear-time algorithms below take advantage of this idea by modifying known SA-construction algorithms to construct the instead of the .
3.2.1 A SAIS-based algorithm
The first -time construction algorithm of the BBWT was presented in [7, 8]. It is an adaptation of the SAIS (Suffix Array by Induced Sorting) algorithm [63], a recursive linear-time algorithm. The algorithm of [7, 8] was then adapted and simplified for eBWT construction by Boucher et al. [15]. Some of the simplifications are also applicable for BBWT construction, so our presentation here is a mixture of the two.
We first give a quick recap of SAIS. The SAIS algorithm first categorizes each text position in - or -type, according to whether the suffix is larger () or smaller () than the next suffix . -type positions following -type positions are called -positions (for “leftmost ”). The recursive step is done on a new text, where each so-called LMS-substring of is replaced by a meta-character; LMS-substrings are minimal substrings containing two consecutive LMS-positions. The meta-characters are assigned to the LMS-substrings respecting their LMS-order , a slight modification of the lexicographic order.777Even though the original definition is more complex, it can be shown that for two LMS-substrings : if either is a proper prefix of , or neither is a prefix of the other, and . For a suffix of let be its LMS-prefix, i.e., is the prefix of that is a suffix of the LMS-substring in which lies (and therefore, the full LMS-substring if is an LMS-position). Nong et al. [63] show that if then , while if , then the order is determined by the subsequent LMS-substrings. This is the basis of the recursive step. The recursion terminates when all meta-characters are distinct, at which point the suffix array of the meta-text on the current recursion depth can be trivially computed. The suffix array of the meta-text implies the relative order of the suffixes starting at LMS-positions in , which is then used to induce the -positions of the remaining suffixes in two linear passes over , the first left-to-right, inducing -type positions, the second right-to-left, inducing -type positions.
This algorithm has to be adapted for (a) computing the omega-order, rather than the lexicographic order, and (b) between conjugates of different strings, rather than suffixes of the same string.
Let us first assume that there are no Lyndon factors of length one. The first modification is that the position types must be defined cyclically within the Lyndon factors. This can be done in linear time, similarly to the SAIS algorithm, using the fact that the smallest rotation of Lyndon words start in their first position. Second, it can be shown that, using the cyclically defined types, the LMS-substrings and LMS-prefixes, defined cyclically, have a similar property: then , while if , then the order is determined by the subsequent LMS-substrings [15, Lemma 4]. This again enables recursion, using the additional insight that the starting positions of Lyndon factors constitute a subset of the starting positions of the LMS substrings [8, Lemma 3.4]. This implies that the meta-string’s Lyndon factorization is the one induced by the Lyndon factorization of . Finally, the inducing step (inducing the using the relative order of the LMS-conjugates) can be done analogously to SAIS, because here, too, all -type conjugates come before all -type conjugates in the same bucket.
It can also be shown that length- strings are always placed between - and -type conjugates in the same bucket. Accordingly, the algorithm removes length- Lyndon factors at the beginning and places them in the correct place at the end, when the rest of the has been filled in. For an example, see Example 6 and Figure 4.
Finally, Boucher et al. [15] showed that essentially the same algorithm works on any multiset of input strings, given in any order – their work targets the eBWT construction, where the input strings are not assumed to be Lyndon words, nor is it assumed that they are given in a relative order. The main algorithmic change in that case is in how the type assignment is done (scanning each string at most twice).
Example 6.
We give an example with a recursive call in Figure 4, on the string . We start on the upper left box with the classification of the text positions of into - and -types, omitting positions that correspond to Lyndon factors of length . Each succeeding an (considered cyclically) gets a star () to denote an LMS-type position. The Lyndon factorization of is denoted by vertical bars between the characters of . Next we replace the LMS-substrings by meta-characters , according to their rank in LMS-order, to form the meta-text based on these meta-characters. Since not all characters of are distinct, we recurse on (right side); here, all LMS conjugates are distinct such that we can immediately start the inducing step. We insert the entry for at the end of the recursive call at the end of the -bucket of (there being no -type conjugates for the largest character). translates back to give the relative order of the LMS-conjugates of , which is then used to induce the entire , in the usual way, except that induces cyclically w.r.t. the factor it is in. Finally, the entries for are inserted at the beginning of the -bucket of at the end of the execution (there being no -type conjugates for the smallest character). We give the final at the bottom of the figure.
3.2.2 A GSACA-based algorithm
Baier [3] in 2016 presented the first non-recursive linear-time suffix array construction algorithm, called GSACA. The algorithm came with an implementation, which was, however, not competitive with the best practical suffix construction tools. A modified version by Bertram et al. [10] later proved to be very efficient practically, as well. The original GSACA algorithm was then adapted for eBWT and BBWT construction by Olbrich et al. [64].
As was shown in [30], the GSACA algorithm uses Lyndon prefixes to compute a partial order of the suffixes. Given a string , its Lyndon prefix is the longest prefix which is a Lyndon word, which is also the first factor in its Lyndon factorization [28]. For example, the Lyndon prefix of is . GSACA first groups the suffixes according to their Lyndon prefixes (Phase 1). This order then gets refined by sorting the suffixes within the same group (Phase 2). Since the groups appear in lexicographic order according to the groups’ Lyndon prefixes, this yields the final suffix array. The crucial observation is that if are the Lyndon prefixes of strings , respectively, then implies .
Astonishingly, the distribution of the suffixes into their groups can be done in one single right-to-left scan. First the suffixes are partitioned into groups according to their first character; these are necessarily the first characters of their Lyndon prefixes. This initial partition is then refined by removing elements from groups and extending their labels to longer prefixes, where the label of a group is a common prefix shared among all group elements. At the end of this scan, each element will be in a group whose label equals the suffix’s Lyndon prefix. This is done as follows: proceeding from the highest group towards the lowest (i.e., right to left), for every suffix in the current group, we will take another suffix from a lower group, if such a exists, and insert it into a new group right after its current group, with its label extended by the label of ’s group. Here, . See Figure 5 for an example. This completes Phase 1.
Now, in Phase 2, the correct order within each group is induced by starting from the smallest suffix, which in the original algorithm, is the end-of-string character at position . In the modified algorithm for computing the BBWT, we need to take care of the circular manner of inducing within Lyndon factors. Note that with the Lyndon grouping from Phase 1, we also have the Lyndon factorization of , so we can use the first position of each Lyndon factor to start the inducing step.
Example 7.
We give an example on the same string as in Example 6, see Figure 5. In Phase 1, we first place all suffixes in a group according to their first character. We then start with the highest group, the one with label r: for the element , , which is in group . We remove from its group and move it into a new group, immediately after, with label , which is the concatenation of the label of ’s group (d) and the label of ’s group (r). Similarly, and cause the removal of resp. from their initial group and are inserted into a new group, immediately after, with label br. Then we move to the group dr, whose only element does not have any element to induce. in group d induces the removal of and its insertion into the new group ad, etc. The resulting Lyndon grouping is shown at the bottom left of the figure. In Phase 2, each group is sorted, via induced sorting with one pass from left to right. In (a) we show the result of the original algorithm constructing the of , from which the BWT can be computed. (Note that in that case, the algorithm assumes that has a final dollar-symbol.) In (b) we show the result if the inducing is done w.r.t. Lyndon factors, to get the and thus the BBWT. Differences between the two must necessarily lie within Lyndon groups. For example, note the difference in the last group (labeled r), where is now placed after the other two elements because it is followed by d in its Lyndon factor dr (and not by a as in ). We give at the bottom of the figure.
3.3 In-place construction
Köppl et al. [46] discovered that the BBWT construction algorithm of Theorem 4, which incrementally builds the BBWT from the text by one Lyndon factor at a time, can be used in the setting of in-place construction, where only additional working space is allowed to turn into its BBWT. Here, the main observation is that the in-place construction algorithm of [22] for the BWT can be changed to compute the BBWT instead of the BWT. That is because we can compute the Lyndon factorization with Duval’s algorithm with constant space while constructing the BBWT. In detail, we let Duval’s algorithm detect the next Lyndon factor, then pause it to consume this factor, and recurse by resuming Duval’s algorithm.
Theorem 8 (in-place construction [46]).
We can compute the BBWT in time using words of working space.
For restoring the original text from the BBWT, there is a need for supporting the FL-mapping, the inverse of the LF-mapping. The FL-mapping can be implemented by a select query on , which however needs time for a solution with constant working space [59] for a constant .
Theorem 9 ([46]).
We can invert the BBWT in time using words of working space for a constant .
A more involved analysis revealed that it is even possible to turn BWT into BBWT directly without inversion using time and working space. The idea is to run Duval’s algorithm using FL-mappings to detect the Lyndon factorization. Since Duval’s algorithm uses three pointers to text positions, we can simulate the pointers by pointers into the BWT and use the FL-mapping to advance a pointer. When a Lyndon factor has been detected, we try to move such that the Lyndon factor has its own orbit in the standard permutation of the BBWT. However, this change can also break the cycle of the remaining BWT, for which additional moves have to be performed.
The authors also gave an algorithm for computing the run-length compressed BBWT online, Lyndon factor by Lyndon factor. Similar to computing the run-length compressed BWT online reading the reversed text, the authors also show how to compute the run-length compressed BBWT in space and time.
3.4 Implementations
On the practical side, we are aware of the implementations by David A. Scott888https://web.archive.org/web/20210615215623/http://bijective.dogma.net/bwts.zip, Branden Brown999https://github.com/zephyrtronium/bwst, Yuta Mori in his OpenBWT library101010https://web.archive.org/web/20170306035431/https://encode.ru/attachment.php?attachmentid=959&d=1249146089, and of Neal Burns111111https://github.com/NealB/Bijective-BWT. While the first two are naive but easily understandable implementations calling general sorting algorithms on all conjugates to directly compute the BBWT, the implementation of Yuta Mori seems to be an adaptation of the SAIS algorithm to induce the BBWT. The implementation of Neal Burns takes an already computed suffix array as input, and transforms into by shifting values to the right until they fit. Hence, the running time is based on the lengths of these shifts, which can be , but seem to be negligible in practice for common texts.
Finally, the two linear-time algorithms we presented, due to Bannai et al. [8]121212https://github.com/mmpiatkowski/bbwt and Olbrich et al. [64]131313https://gitlab.com/qwerzuiop/lfgsaca, have both been implemented.
4 Indexing
The eBWT is a generalization of the BWT for multiple input texts. As such, it is possible to build an FM-index-like data structure upon the eBWT. Indeed, a more advanced compressed text index built upon the eBWT, the so-called extended -index, was recently presented in [16], which can efficiently answer pattern matching queries while requiring only space.
The reason this kind of index is not immediately applicable for the BBWT is because, by interpreting the BBWT as the multiset of Lyndon factors of , we would be restricted to patterns that occur within the Lyndon factors. In order to use the BBWT for an index for regular pattern matching on the original text, we must (a) remove superfluous matches that match the Lyndon factor in a cyclic sense but do not occur in the original string, and (b) find a way to detect matches that span multiple Lyndon factors, as these are not found by the standard backward search on the BBWT.
Such an index was presented in [6, 8]. We present this index under the facilitated setting that all Lyndon factors of are distinct, i.e., with . This imposes not a restriction because we first detect the actual powers in a precomputation step before discarding duplicate Lyndon factors. After building the index we will introduce, we augment it such that the correct numbers are counted for each Lyndon factor [8, after Theorem 4.7]. We also limit our survey to counting queries, i.e., to return the number of occurrences of a pattern in the text.
The index we are going to present works like the FM-index. The only difference is that we need to take action whenever the LF-mapping interval contains a position corresponding to the starting position of a Lyndon factor. Otherwise, it applies the backward search step in exactly the same way as the FM-index does. This is because a Lyndon word cannot cross two Lyndon factors of the text, meaning that any pattern that is a Lyndon word cannot be the suffix of some and the prefix of some , since necessarily . Consequently, a Lyndon pattern can be found with the same techniques as used by the FM-index since it can only occur inside a Lyndon factor of the text. For finding all patterns , we use the Lyndon factorization of -by our observation, we can be assured to match at least the last factor of correctly.
By the above observation, we are sure that we can find the last Lyndon factor of by the backward search without any modification. However, a backward search step of the preceding character may already give unexpected answers if this last factor is a prefix of a factor of the text. For instance, for the particular case that the pattern ends with , the LF-mapping may report an occurrence of this pattern (regardless of the remaining prefix) because the LF-mapping extracts the original cyclic Lyndon words by iterative application. Hence, only at those backward search steps that read the last character of a previous Lyndon factor of the pattern, we can observe phenomena that need to be taken care of, namely:
- a superfluous match:
-
caused by matching with the same Lyndon factor of the text, and
- a missed match:
-
caused by the fact that we do not check the characters starting before in text order for a potential occurrence.
Both phenomena happen at most times, where is the number of Lyndon factors of . Each time, we need to take care of at most one superfluous and one missed match. We track these individual matches by mapping them individually with the LF-mapping such that the final cost for matching are rank queries. To get the final result, we note that after the longest so-called significant suffix [39] of has been matched, there is only one match remaining to keep track of. Since the longest significant suffix can contain different Lyndon factors, we obtain the following result.
Theorem 10 (Pattern matching using BBWT [6, 8]).
Given a text and a pattern of length , we can compute all occurrences of in with the FM-index built on with rank/select operations.
The rank/select operations in Theorem 10 are not only necessary for the backward search, but for a bit vector with rank/select support marking the positions in corresponding to the distinct Lyndon factors.
The BBWT can also be run-length compressed into space while still supporting count queries. For that, the bit vector can be entropy-compressed [8].
5 Compression
The BWT was originally introduced for compression: Burrows and Wheeler state that the BWT “tends to group characters together so that the probability of finding a character close to another instance of the same character is increased substantially” [17]. An intuitive reason to why this is so, is that since the rotations of the string are sorted in lexicographic (or -)order, the BWT is a sequence of symbols that precede substrings with a similar context. It has been shown that achieving zero order entropy compression on appropriately partitioned blocks of the BWT achieves higher order entropy compression of the original string [57]. Thus, a string can be compressed by first applying BWT, and then applying some simple compression algorithms such as Move-To-Front encoding or run-length encoding.
While the analysis of the compression performance of the BWT has received much attention, the compression performance of the BBWT is less well studied. Scott [66] and Gil & Scott [32] reported the compression performance of BWT and BBWT on the Large Calgary corpus [9], often used as a benchmark dataset. On all but one of the files, the BBWT resulted in a slightly better compression than the BWT: the average gain was around of the original text size. These data sets are very small by today’s standards. We only know of empirical evaluations on larger data sets which compare the number of runs in the BWT and BBWT for various data sets (the Calgary Corpus, Canterbury Corpus, Pizza&Chili Corpus, and Silesia Corpus), where it is reported that they are roughly equal [8].
Recently, it has become recognized that, for highly repetitive data that are prevalent today in many real-world applications, analyses of compressors in terms of entropy measures, i.e., on statistical properties of the data, are not useful as they do not capture the repetitiveness of the data very well [61]. Instead, a trend is to analyze the compression performance of compressors in terms of the sizes of compressed representations of the individual strings, using representations, in particular, based on dictionary compression, that can better model the repetitiveness of the data.
The systematic study of repetitiveness measures related to dictionary compression was initiated by Kempa and Prezza [44]. Dictionary compression is a family of compressed representations that are essentially based on copy and paste operations, and include representations, for example, those computed by well-known algorithms like LZ77, grammar-based representations (e.g., LZ78, RE-PAIR), run-length encoding, and run-length encoded BWT (RLBWT). See [61] for a comprehensive survey on the subject.
In the following, we introduce results focused on theoretical analyses on the compressiveness of BBWT. Recall that, for a string , denotes the number of runs of and the number of runs of .
5.1 as a measure of dictionary compression
The Run-Length compressed BWT (RLBWT) [53] is a representation of of size . From its definition, it is not clear why the RLBWT can be considered as a form of dictionary compression, since the transform is applied before the run-length encoding. However, suppose for each run in the BWT, we define the reference (copying) of a symbol to point to the previous symbol in the run (except for the first one in the run). Then, if we consider all of these references on corresponding positions of the text, the text can be partitioned into at most phrases, where each phrase is either a single symbol that corresponds to the beginning of a run, or, a maximal substring such that the offsets of the references are the same for all positions in the phrase [62, Theorem 9]141414We note that [62] assumes that strings are terminated with a and the proved bound is . For strings that are not terminated by , there can be phrases, e.g., with .. The latter phrases can then be encoded by two integers – the length and offset, resulting in a representation for of size , known as bidirectional macro schemes (BMSs) [70], the most general (and most powerful) representation in dictionary compression.
The above relation can be shown as follows: Suppose two lexicographically adjacent rotations and have the same BWT symbol (i.e., ). Then, position will reference position , with offset . Due to the property of LF mapping, and will also be adjacent rotations, and, if they have the same BWT symbol as well, position will reference position , with offset and therefore have the same offset. This continues until the adjacent rotations do not have the same BWT symbol, which can happen at most times – at boundaries of runs. A minor point we have overlooked above is that the offsets and would always be the same in the cyclic sense, but they can become different in the linear sense when or , since position wraps around to position . This can happen at most twice. Since there are exactly single symbol phrases, the total number of phrases is at most .
Badkobeh et al. [2] showed that for the Run-Length compressed BBWT (RLBBWT), a similar BMS of size can be induced. An important observation to achieve this bound is that can be lower bounded by the number of distinct Lyndon factors of the Lyndon factorization of [16, 2]. This follows from [16, Corollary 9], which states that:
Lemma 11 ([16, Corollary 9]).
Let be a conjugate-free set of primitive strings of total length , the number of runs of its eBWT. Then . Moreover, if all strings have the same length , then .
Since, in the case of BBWT, where all the Lyndon factors are distinct and primitive, we have:
Corollary 12 ([2]).
For any string , .
Applying the same referencing of symbols for BBWT as BWT described above, the same arguments hold, except that there can be more instances where the references wrap around, i.e., when or corresponds to the beginning of a Lyndon factor. However, this is at most twice for each distinct Lyndon factor, since identical Lyndon factors occur adjacently. Therefore, by Corollary 12, the total number of phrases can be bounded by .
Theorem 13 ([2, Lemma 1]).
For any string , there exists a BMS of size that represents .
Therefore, RLBWT and RLBBWT respectively induce BMSs of size and , where all references point to a smaller (omega-order) position, and can be considered as a form of dictionary compression.
5.2 Relation to LZ77
Denote by the size of the LZ77151515We use the name LZ77 which is prevalent in the literature, though it has been mentioned that LZ76 is more precise [60]. factorization [48] of . It is known that is the size of a smallest BMS where all references point to a smaller (text-order) position. The question of whether can be bounded in terms of was answered by Kempa and Kociumaka [43, Theorem 3.2], who showed that for any string , holds.
Badkobeh et al. [2] showed that this proof can be extended to the case of . The term appears in Kempa and Kociumaka’s proof when considering, for given , the set of all substrings of of length , where can be bounded by since any substring of must have an occurrence containing the last position of an LZ77 phrase (including the last phrase). The main difference in the proof is in the definition of the set , which should be modified to be the set of all substrings of length of , rather than . This can be bounded by , since such a substring is either a substring of , or, is a substring of that is not a substring of . is obtained by applying the result by Urabe et al. [71] that shows . The rest of the proof is essentially the same, giving:
Theorem 14 ([2, Theorem 1]).
For any string , .
5.3 Robustness of
The following results are known about the change that may undergo if some operation is applied to .
5.3.1 String reversal
Giuliani et al. [34] studied the robustness of with respect to string reversal, i.e., how and can differ. They gave an infinite family of strings (which they called Fibonacci-plus words) for which and , i.e., with a runs-ratio that is logarithmic in the length of the string. This lower bound is nearly tight, in view of the known upper bound of [43]. Strings with higher runs-ratio than the family of [34] (albeit only experimentally) were given in [33].
Theorem 15 ([12]).
There exists an infinite family of strings such that and .
The family provided in the proof of Theorem 15 are Lyndon rotations of Fibonacci words: On the one hand, these words have , by Proposition 3 and Theorem 23. On the other, the authors show that has exactly runs. This is done by studying the Lyndon factorization of : the authors show that it consist of the Lyndon rotations of Fibonacci words of order to , with Lyndon rotations of even order words, in increasing order, first, followed by Lyndon rotations of odd order words, in decreasing order, with the last Lyndon factor, a, repeated twice. Therefore, equals , where , since the BWT is invariant w.r.t. rotation and multiple occurrences of a string in a multiset do not impact on the number of runs of the eBWT [54]. The authors then give the exact form of , which is shown to contain exactly runs.
In the same paper, experimental results were given both on (multiplicative difference) and on (additive difference), as well as on the connection between these two functions and the number of distinct Lyndon factors.
5.3.2 Single character edits
A side result of [34] is that the BWT is sensitive to one-character edits, in the sense that there exist strings on which , where results from by appending one character to . Akagi et al. [1] introduced the term sensitivity of a measure to denote the asymptotic change when a string undergoes single-character edits of different types. This was studied in detail in [35, 36] and logarithmic lower bounds were given for the multiplicative sensitivity , and an bound for the additive sensitivity, .
Jeon and Köppl [40] showed that this result can be translated to the BBWT with the same bounds by studying, among others, the Lyndon rotations of Fibonacci words and applying Corollary 12.
Theorem 16 ([40]).
There is a family of strings with and a family of strings with such that and and differ from and by one edit, respectively.
5.4 The relationship between and
Badkobeh et al. [2] considered the difference between and . Notice that a lower bound on the multiplicative difference between and is a lower bound on the multiplicative difference between and for any rotation of due to Proposition 3, since is equivalent to , the RLBBWT of a specific rotation of . A family of strings where could be a factor larger than was reported.
Theorem 17 ([2, Theorem 2]).
There exists an infinite family of strings such that and .
The family for Theorem 17 is the Fibonacci words of odd order, i.e., . It is well known that for any . For , using a result by Melançon on the Lyndon factorization of [58], it is shown that the number of Lyndon factors is . Then, by Corollary 12, is obtained.
The opposite case was shown in [5]: that can be a factor larger than .
Theorem 18 ([5, Theorem 2]).
There exists an infinite family of strings such that and .
The family used in the proof of Theorem 18 was also related to Fibonacci words. It is noticed that prepending a symbol to the Lyndon rotation of a Fibonacci word is conjugate to the string used by Giuliani et al. [34, 35, 36] for BWT, thus giving runs, while is easily verifiable to be . An alternate (perhaps simpler) direct proof based on morphisms is also given in [5].
As for upper bounds, poly-logarithmic bounds were derived from the upper bounds of both and mentioned above, as well as other known bounds and [44, 45, 61].
Theorem 19 ([5, Lemma 11]).
For any and , .
Similar to what was observed for the lower bound, Theorem 19 implies an upper bound on the multiplicative difference between and , as well.
Corollary 20 ([5, Corollary 12]).
For any string , .
5.5 Rotation and BBWT
As seen in Theorem 18, rotating the string can in some cases improve the compression performance of RLBBWT by a factor. Such a rotation can be encoded as an extra integer ( bits). Recall that in order to recover the string, BWT required extra information (an integer of bits161616Here, we do not consider the case of adding an explicit terminal symbol – using an extra explicit terminal symbol outside the alphabet may increase the encoding of each symbol by bit (e.g. when for some ), which would then require an extra bits.) to specify which rotation was the original string. Since the BWT is equivalent to BBWT for a specific rotation of (Proposition 3), it is always possible to find an encoding of a string using RLBBWT preceded by a rotation, that uses at most the space required for RLBWT (and the extra information). It is known that can be computed in linear time [69]. Badkobeh et al. raise the question of whether the optimal rotation, i.e., can be computed in sub-quadratic time, and give partial results towards this goal [2].
Badkobeh et al. further consider a compression scheme where multiple rotations and BBWT operations can be used. They give the following conjecture for strings with the same Parikh vector, where the Parikh vector of a string is a vector of the length of the alphabet storing the frequencies of the characters in .
Conjecture 21 ([2, Conjecture 1]).
Given two strings with the same Parikh vector, it is possible to transform one to the other by using only rotation and BBWT operations.
If the conjecture is true, this implies that any string can be represented, for example, by its Parikh vector and a sequence of integers alternately representing rotations or the number of times BBWT should be applied to obtain the string.
They show that the conjecture holds for the case where the alphabet is binary, or, when each symbol occurs only once.
Theorem 22 ([2, Theorem 4]).
Given two strings of the same length with the same Parikh vector, it is possible to transform one to the other by using only rotation and BBWT operations if all symbols are distinct, or if the alphabet is binary.
The theorem is proved by showing that, under the assumption of the alphabet, a string that is not the lexicographically smallest string can always be transformed, using a combination of rotations and an inverse BBWT operation, into a lexicographically smaller string. The theorem then follows from the bijectivity of rotations and BBWT.
5.6 Clustering effects
As mentioned at the beginning of Section 5, BWT is known to have a property which tends to group characters together. Other than by entropy arguments, this so-called clustering effect has also been analyzed with respect to the run-length encoding.
The binary strings which are clustered completely by BWT have been completely characterized by Mantaci et al. [56]:
Theorem 23 ([56, Corollary 11]).
if and only if is the power of a rotation of a standard Sturmian word.
A similar result was shown for the BBWT in [12]:
Theorem 24 ([12, Theorem 2]).
if and only if for some , or is a power of the Lyndon rotation of a standard word.
In the first case, , while in the second case, , such that is the th power of the Lyndon rotation of a standard word, where .
Note that Fibonacci words (a subset of standard Sturmian words) have many runs: . This implies that in some cases, the BWT is able to transform, with respect to run-length encoding, a least compressible string to a most compressible string. The same holds for , again due to Proposition 3.
It is important to note that the BWT does not always make the string more compressible171717In fact, there can exist no such lossless compression algorithm [49]., and there exist strings for which . For example, . However, Mantaci et al. [55] showed that can never be more than twice the number of runs of :
Theorem 25 ([55, Theorem 3.3]).
For any string , .
Badkobeh et al. [2] showed that the result can be further generalized for BBWT.
Theorem 26 ([5, Theorem 14]).
For any string , .
These results indicate that, with respect to run-length encoding, BWT and BBWT transform the string in an asymmetric way; while it is possible sometimes to make the string drastically compressible (Theorems 23 and 24), it will not make it much less compressible than it currently is (Theorems 25 and 26).
The examples of Theorems 23 and 24 are in some sense remarkable with respect to run-length encoding. However, the compressibility of the string with respect to dictionary compression does not change: the example only shows a compressible string (with respect to and ) being transformed into a compressible string.
Bannai et al. [5] analyze how BWT and BBWT change the string with respect to other repetitiveness measures. For both BWT and BBWT, it is proved that the transforms will not increase the repetitiveness measure by more than a poly-logarithmic factor.
Theorem 27 ([5, Theorem 15]).
, hold for any repetitiveness measure such that and .
The proof is based on the simple observation that since , . A poly-logarithmic bound for BBWT can be shown as well, with the use of Corollary 20. Thus, and .
Furthermore, it is shown that there is an infinite family of strings such that the BBWT transforms a maximally incompressible string, with respect to any measure lower-bounded by , to a very compressible string.
Theorem 28 ([5, Theorem 16]).
There exists an infinite family of strings such that and for any s.t. for all , and for any s.t. for all , where is the size of the smallest grammar compressed representation for .
The family of strings identified is again the Fibonacci words, which are very compressible, but here used as images of BBWT, i.e., is such that for some . It is proved that the string obtained by applying the inverse BBWT to Fibonacci words are not compressible by dictionary compression. Thus, BBWT transforms strings in an asymmetric way with respect to other repetitiveness measures as well. An interesting implication of this result is that in some cases, by applying BBWT in a pre-processing step before applying a dictionary compression, it is possible to transcend any dictionary compressor. It is not yet known whether analogous strings exist for BWT; the same construction using inverse BWT does not work since Fibonacci words are not always BWT images.
The proof of Theorem 28 is based on an elegant characterization of the LF mapping (or ) on the Fibonacci word . It utilizes a bit string representation of integers based on the fact (re)discovered181818See https://proofwiki.org/wiki/Zeckendorf%27s_Theorem/Historical_Note. by Zeckendorf [72], that any integer can be uniquely represented as the sum of a set of distinct and non-consecutive Fibonacci numbers. It is shown that if a position in is represented as a -bit string based on the Zeckendorf representation (for , the th bit is 1, if and only if the st Fibonacci number is used), the bit-string corresponding to the position is a rotation of . This implies that each conjugacy class of -bit strings for positions in correspond to a cycle in the standard permutation of . Then, by showing that the length of cyclic strings corresponding to the cycles of can be bounded by , and, when is prime, that the Lyndon rotation of each of the strings occur uniquely in , it follows that .
6 A class of bijective BWTs
In some way, it would be more natural to define the BBWT as a bijective Burrows–Wheeler transform in the sense that it is not the only variant that is an isomorphism on that sorts rotations of substrings of the input by the inverse of the Gessel-Reutenauer transform. In fact, we can exchange the Lyndon factorization with any factorization such that
-
it is uniquely defined,
-
it produces a set of primitive factors, and
-
there is a unique way of restoring the text from this set of primitive strings.
The Lyndon factorization is not the only factorization that has these properties. Daykin et al. [23] introduced the term unique maximal factorization family (UMFF) to address a factorization that is unique and greedy in the choice of selecting the next factor. The Lyndon factorization is an example for an UMFF, but not the only one. In [25] and [24] a bijective variant of the BWT was proposed, based on the so-called V-order and B-order, respectively. For both variants, the authors gave a semantic definition similar to Lyndon words, called V-words and B-words, by switching the lexicographical order with the V-order or the B-order, respectively. Both approaches apply a factorization similar to the Lyndon factorization and sort the conjugates in the respective order. The authors give linear time constructions for both approaches based on SAIS and show how to recover the original text.
Hendrian et al. [38] proposed a bijective BWT variant based on the Galois factorization [65]. An extension to the Galois factorization is the generalized Lyndon factorization [26], which applies a generalized lexicographic order, using a different order of the characters for each text position to compare. Similarly, we can create a bijective variant of the BWT that is based on the Nyldon factorization [19] or the canonical inverse Lyndon factorization. For the latter, in [13] so-called anti-Lyndon words and inverse Lyndon words were proposed. Anti-Lyndon words are Lyndon words for the inverse order of the characters. An inverse Lyndon word is the lexicographically largest among all its suffixes [13, Definition 3.1]. (Inverse Lyndon words can also be defined on properties based on their rotations [13, Proposition 3.1], which we here skip since these properties need further definitions.) In particular, an inverse Lyndon word can have a border, while an anti-Lyndon word cannot. A class of factorizations were studied in [13], which factorize a string into inverse Lyndon words. The authors also gave one canonical factorization that uniquely creates inverse Lyndon words, where the inverse Lyndon factors are in lexicographically increasing order, i.e., . However, an obstacle to define a bijective BWT variant based on inverse Lyndon factors is that a inverse Lyndon word is not necessarily primitive. It would be interesting to study whether any type of these factorizations can form a bijective variant of the BWT which can be used for pattern matching.
7 Conclusion
We presented the BBWT, a bijective variant of the BWT which is mathematically elegant in the sense that it is designed to be invertible without requiring additional information such as end-of-string markers or marking a dedicated position in the BWT. We have surveyed the following topics.
Construction.
We studied the history of construction algorithms, starting with time, leading to current time construction algorithms. It is also possible to construct the BBWT in-place in time. All approaches have in common that they adapt algorithms for suffix array or BWT construction.
Indexing and pattern matching.
The BBWT can be used for text indexing, similar to BWT-based or eBWT-based FM-indexing. For that, a slight modification of the FM-index suffices, which takes individual measures whenever the search interval matches a prefix of a Lyndon factor of the pattern with the starting position of a Lyndon factor of the text. This additional handling can slow down the computation by requiring up to a multiplicative factor of additional queries to a rank/select data structure built on the BBWT, where is the pattern length.
Compression performance.
Like the BWT, the BBWT tends to group similar characters together, making it effective for run-length encoding. However, this grouping can be affected by rotating the input text and by the number of Lyndon factors of the input, which serves as a lower bound on the number of character runs. We draw connections to dictionary compression techniques like LZ77 and macro schemes and give pointers to studies that show that the BBWT can achieve comparable or better compression ratios than the BWT in specific cases.
Open problems
-
Can we match a pattern of length with the techniques of Theorem 10 with rank operations? The multiplicative term seems too pessimistic: we are confident that a refined analysis could improve the time complexity.
-
Reachability: Can we express any primitive string by its Parikh vector and a sequence of BBWT applications and rotations such that applying this sequence of transforms on the lexicographically smallest string having as its Parikh vector leads to ? Given the Parikh vector is for an alphabet , the lexicographically smallest string having as its Parikh vector is . Cyclic rotations and the BBWT are bijective transformations, which we can apply in an arbitrary amount and order sequentially, and it seems that there is always such a sequence of transforms that turn into , cf. Conjecture 21.
-
Can we find the cyclic rotation of any primitive that leads to the minimum number of runs in the BBWT in time? (Cf. Section 5.5)
-
Can we compute the smallest integer such that in time for any string of length ?
-
How far can and differ, for instance with respect to Hamming distance? This problem is relevant for the analysis of the worst-case complexity of the (practical) algorithm computing from by shifting.
References
- [1] Tooru Akagi, Mitsuru Funakoshi, and Shunsuke Inenaga. Sensitivity of string compressors and repetitiveness measures. Inf. Comput., 291:104999, 2023. doi:10.1016/J.IC.2022.104999.
- [2] Golnaz Badkobeh, Hideo Bannai, and Dominik Köppl. Bijective BWT based compression schemes. In Zsuzsanna Lipták, Edleno Silva de Moura, Karina Figueroa, and Ricardo Baeza-Yates, editors, String Processing and Information Retrieval - 31st International Symposium, SPIRE 2024, Puerto Vallarta, Mexico, September 23-25, 2024, Proceedings, volume 14899 of Lecture Notes in Computer Science, pages 16–25. Springer, 2024. doi:10.1007/978-3-031-72200-4_2.
- [3] Uwe Baier. Linear-time suffix sorting - A new approach for suffix array construction. In Roberto Grossi and Moshe Lewenstein, editors, 27th Annual Symposium on Combinatorial Pattern Matching, CPM 2016, June 27-29, 2016, Tel Aviv, Israel, volume 54 of LIPIcs, pages 23:1–23:12. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2016. doi:10.4230/LIPICS.CPM.2016.23.
- [4] Hideo Bannai and Jonas Ellert. Lyndon arrays in sublinear time. In Proc. ESA, volume 274 of LIPIcs, pages 14:1–14:16, 2023. doi:10.4230/LIPIcs.ESA.2023.14.
- [5] Hideo Bannai, Tomohiro I, and Yuto Nakashima. On the compressiveness of the Burrows-Wheeler transform. In Paola Bonizzoni and Veli Mäkinen, editors, 36th Annual Symposium on Combinatorial Pattern Matching, CPM 2025, June 17-19, 2025, Milan, Italy, volume 331 of LIPIcs, pages 17:1–17:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2025. doi:10.4230/LIPICS.CPM.2025.17.
- [6] Hideo Bannai, Juha Kärkkäinen, Dominik Köppl, and Marcin Piatkowski. Indexing the bijective BWT. In Nadia Pisanti and Solon P. Pissis, editors, 30th Annual Symposium on Combinatorial Pattern Matching, CPM 2019, June 18-20, 2019, Pisa, Italy, volume 128 of LIPIcs, pages 17:1–17:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPICS.CPM.2019.17.
- [7] Hideo Bannai, Juha Kärkkäinen, Dominik Köppl, and Marcin Piatkowski. Constructing the bijective and the extended Burrows-Wheeler Transform in linear time. In Pawel Gawrychowski and Tatiana Starikovskaya, editors, 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, July 5-7, 2021, Wrocław, Poland, volume 191 of LIPIcs, pages 7:1–7:16. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.CPM.2021.7.
- [8] Hideo Bannai, Juha Kärkkäinen, Dominik Köppl, and Marcin Piatkowski. Constructing and indexing the bijective and extended Burrows-Wheeler transform. Inf. Comput., 297:105153, 2024. doi:10.1016/J.IC.2024.105153.
- [9] Timothy C. Bell, Ian H. Witten, and John G. Cleary. Modeling for text compression. ACM Comput. Surv., 21(4):557–591, 1989. doi:10.1145/76894.76896.
- [10] Nico Bertram, Jonas Ellert, and Johannes Fischer. Lyndon words accelerate suffix sorting. In Petra Mutzel, Rasmus Pagh, and Grzegorz Herman, editors, 29th Annual European Symposium on Algorithms, ESA 2021, September 6-8, 2021, Lisbon, Portugal (Virtual Conference), volume 204 of LIPIcs, pages 15:1–15:13. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021. doi:10.4230/LIPICS.ESA.2021.15.
- [11] Elena Biagi, Davide Cenzato, Zsuzsanna Lipták, and Giuseppe Romana. On the number of equal-letter runs of the Bijective Burrows-Wheeler Transform. In Giuseppa Castiglione and Marinella Sciortino, editors, Proceedings of the 24th Italian Conference on Theoretical Computer Science, Palermo, Italy, September 13-15, 2023, volume 3587 of CEUR Workshop Proceedings, pages 129–142. CEUR-WS.org, 2023. URL: https://ceur-ws.org/Vol-3587/4564.pdf.
- [12] Elena Biagi, Davide Cenzato, Zsuzsanna Lipták, and Giuseppe Romana. On the number of equal-letter runs of the Bijective Burrows-Wheeler Transform. Theoretical Computer Science, page 115004, 2024. doi:10.1016/j.tcs.2024.115004.
- [13] Paola Bonizzoni, Clelia De Felice, Rocco Zaccagnino, and Rosalba Zizza. Inverse Lyndon words and inverse Lyndon factorizations of words. Adv. Appl. Math., 101:281–319, 2018. doi:10.1016/j.aam.2018.08.005.
- [14] Silvia Bonomo, Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. Sorting conjugates and suffixes of words in a multiset. Int. J. Found. Comput. Sci., 25(8):1161, 2014. doi:10.1142/S0129054114400309.
- [15] Christina Boucher, Davide Cenzato, Zsuzsanna Lipták, Massimiliano Rossi, and Marinella Sciortino. Computing the original eBWT faster, simpler, and with less memory. In Thierry Lecroq and Hélène Touzet, editors, String Processing and Information Retrieval - 28th International Symposium, SPIRE 2021, Lille, France, October 4-6, 2021, Proceedings, volume 12944 of Lecture Notes in Computer Science, pages 129–142. Springer, 2021. doi:10.1007/978-3-030-86692-1_11.
- [16] Christina Boucher, Davide Cenzato, Zsuzsanna Lipták, Massimiliano Rossi, and Marinella Sciortino. -indexing the eBWT. Inf. Comput., 298:105155, 2024. doi:10.1016/J.IC.2024.105155.
- [17] Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation, Systems Reseach Center, 1994. SRC Research Report 124.
- [18] Davide Cenzato, Zsuzsanna Lipták, Nadia Pisanti, Giovanna Rosone, and Marinella Sciortino. BWT for string collections. In Paolo Ferragina, Travis Gagie, and Gonzalo Navarro, editors, The Expanding World of Compressed Data: A Festschrift for Giovanni Manzini’s 60th Birthday, volume 131 of OASIcs, pages 3:1–3:29, 2025.
- [19] Émilie Charlier, Manon Philibert, and Manon Stipulanti. Nyldon words. J. Comb. Theory, Ser. A, 167:60–90, 2019. doi:10.1016/j.jcta.2019.04.002.
- [20] K. T. Chen, R. H. Fox, and R. C. Lyndon. Free differential calculus, iv. the quotient groups of the lower central series. Annals of Mathematics, 68(1):81–95, 1958. URL: http://www.jstor.org/stable/1970044.
- [21] Maxime Crochemore, Jacques Désarménien, and Dominique Perrin. A note on the Burrows-Wheeler transformation. Theor. Comput. Sci., 332(1-3):567–572, 2005. doi:10.1016/J.TCS.2004.11.014.
- [22] Maxime Crochemore, Roberto Grossi, Juha Kärkkäinen, and Gad M. Landau. Computing the Burrows–Wheeler transform in place and in small space. J. Discrete Algorithms, 32:44–52, 2015. doi:10.1016/j.jda.2015.01.004.
- [23] David E. Daykin, Jacqueline W. Daykin, and William F. Smyth. Combinatorics of unique maximal factorization families (umffs). Fundam. Informaticae, 97(3):295–309, 2009. doi:10.3233/FI-2009-202.
- [24] Jacqueline W. Daykin, Richard Groult, Yannick Guesnet, Thierry Lecroq, Arnaud Lefebvre, Martine Léonard, and Élise Prieur-Gaston. Binary block order rouen transform. Theor. Comput. Sci., 656:118–134, 2016. doi:10.1016/J.TCS.2016.05.028.
- [25] Jacqueline W. Daykin and William F. Smyth. A bijective variant of the Burrows-Wheeler Transform using V-order. Theor. Comput. Sci., 531:77–89, 2014. doi:10.1016/J.TCS.2014.03.014.
- [26] Francesco Dolce, Antonio Restivo, and Christophe Reutenauer. On generalized Lyndon words. Theor. Comput. Sci., 777:232–242, 2019. doi:10.1016/j.tcs.2018.12.015.
- [27] Francesco Dolce, Antonio Restivo, and Christophe Reutenauer. Some variations on Lyndon words (invited talk). In Nadia Pisanti and Solon P. Pissis, editors, 30th Annual Symposium on Combinatorial Pattern Matching, CPM 2019, June 18-20, 2019, Pisa, Italy, volume 128 of LIPIcs, pages 2:1–2:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPICS.CPM.2019.2.
- [28] Jean-Pierre Duval. Factorizing words over an ordered alphabet. J. Algorithms, 4(4):363–381, 1983. doi:10.1016/0196-6774(83)90017-2.
- [29] Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552–581, 2005. doi:10.1145/1082036.1082039.
- [30] Frantisek Franek, Michael Liut, and William F. Smyth. On Baier’s sort of maximal Lyndon substrings. In Jan Holub and Jan Zdárek, editors, Prague Stringology Conference 2018, Prague, Czech Republic, August 27-28, 2018, pages 63–78. Czech Technical University in Prague, Faculty of Information Technology, Department of Theoretical Computer Science, 2018. URL: http://www.stringology.org/event/2018/p07.html.
- [31] Ira M. Gessel and Christophe Reutenauer. Counting permutations with given cycle structure and descent set. J. Comb. Theory A, 64(2):189–215, 1993.
- [32] Joseph Yossi Gil and David Allen Scott. A bijective string sorting transform. CoRR, abs/1201.3077, 2012. arXiv:1201.3077.
- [33] Sara Giuliani. Sensitivity of the Burrows-Wheeler Transform to Small Modifications, and Other Problems on String Compressors in Bioinformatics. PhD thesis, University of Verona, 2023.
- [34] Sara Giuliani, Shunsuke Inenaga, Zsuzsanna Lipták, Nicola Prezza, Marinella Sciortino, and Anna Toffanello. Novel results on the number of runs of the Burrows–Wheeler-transform. In Proc. SOFSEM, volume 12607 of LNCS, pages 249–262, 2021. doi:10.1007/978-3-030-67731-2_18.
- [35] Sara Giuliani, Shunsuke Inenaga, Zsuzsanna Lipták, Giuseppe Romana, Marinella Sciortino, and Cristian Urbina. Bit catastrophes for the Burrows-Wheeler Transform. In Frank Drewes and Mikhail Volkov, editors, Developments in Language Theory - 27th International Conference, DLT 2023, Umeå, Sweden, June 12-16, 2023, Proceedings, volume 13911 of Lecture Notes in Computer Science, pages 86–99. Springer, 2023. doi:10.1007/978-3-031-33264-7_8.
- [36] Sara Giuliani, Shunsuke Inenaga, Zsuzsanna Lipták, Giuseppe Romana, Marinella Sciortino, and Cristian Urbina. Bit Catastrophes for the Burrows-Wheeler Transform. Theory of Computing Systems, 2025. to appear.
- [37] Sara Giuliani, Zsuzsanna Lipták, Francesco Masillo, and Romeo Rizzi. When a dollar makes a BWT. Theor. Comput. Sci., 857:123–146, 2021. doi:10.1016/J.TCS.2021.01.008.
- [38] Diptarama Hendrian, Dominik Köppl, Ryo Yoshinaka, and Ayumi Shinohara. Algorithms for Galois words: Detection, factorization, and rotation. In Proc. CPM, volume 296 of LIPIcs, pages 18:1–18:16, 2024. doi:10.4230/LIPICS.CPM.2024.18.
- [39] Tomohiro I, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Faster lyndon factorization algorithms for SLP and LZ78 compressed text. Theor. Comput. Sci., 656:215–224, 2016. doi:10.1016/J.TCS.2016.03.005.
- [40] Hyodam Jeon and Dominik Köppl. Compression sensitivity of the Burrows–Wheeler transform and its bijective variant. Mathematics, 13(7)(1070):1–46, March 2025. doi:10.3390/math13071070.
- [41] Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. J. ACM, 53(6):918–936, 2006. doi:10.1145/1217856.1217858.
- [42] Dominik Kempa and Tomasz Kociumaka. String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure. In Proc. STOC, pages 756–767, 2019. doi:10.1145/3313276.3316368.
- [43] Dominik Kempa and Tomasz Kociumaka. Resolution of the Burrows-Wheeler transform conjecture. Commun. ACM, 65(6):91–98, 2022. doi:10.1145/3531445.
- [44] Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: string attractors. In Ilias Diakonikolas, David Kempe, and Monika Henzinger, editors, Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018, pages 827–840. ACM, 2018. doi:10.1145/3188745.3188814.
- [45] Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Toward a definitive compressibility measure for repetitive sequences. IEEE Trans. Inf. Theory, 69(4):2074–2092, 2023. doi:10.1109/TIT.2022.3224382.
- [46] Dominik Köppl, Daiki Hashimoto, Diptarama Hendrian, and Ayumi Shinohara. In-place bijective Burrows-Wheeler Transforms. In Inge Li Gørtz and Oren Weimann, editors, 31st Annual Symposium on Combinatorial Pattern Matching, CPM 2020, June 17-19, 2020, Copenhagen, Denmark, volume 161 of LIPIcs, pages 21:1–21:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2020. doi:10.4230/LIPICS.CPM.2020.21.
- [47] Manfred Kufleitner. On bijective variants of the Burrows-Wheeler Transform. In Jan Holub and Jan Zdárek, editors, Proceedings of the Prague Stringology Conference 2009, Prague, Czech Republic, August 31 - September 2, 2009, pages 65–79. Prague Stringology Club, Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University in Prague, 2009. URL: http://www.stringology.org/event/2009/p07.html.
- [48] Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Trans. Inf. Theory, 22(1):75–81, 1976. doi:10.1109/TIT.1976.1055501.
- [49] Ming Li and Paul M. B. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications, Third Edition. Texts in Computer Science. Springer, 2008. doi:10.1007/978-0-387-49820-1.
- [50] Konstantin M. Likhomanov and Arseny M. Shur. Two combinatorial criteria for BWT images. In Alexander S. Kulikov and Nikolay K. Vereshchagin, editors, Computer Science - Theory and Applications - 6th International Computer Science Symposium in Russia, CSR 2011, St. Petersburg, Russia, June 14-18, 2011. Proceedings, volume 6651 of Lecture Notes in Computer Science, pages 385–396. Springer, 2011. doi:10.1007/978-3-642-20712-9_30.
- [51] M. Lothaire. Combinatorics on Words. Cambridge Mathematical Library. Cambridge University Press, 2 edition, 1997.
- [52] M. Lothaire. Algebraic Combinatorics on Words. Cambridge University Press, 2002.
- [53] Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. Nord. J. Comput., 12(1):40–66, 2005.
- [54] Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. An extension of the Burrows-Wheeler Transform. Theor. Comput. Sci., 387(3):298–312, 2007. doi:10.1016/J.TCS.2007.07.014.
- [55] Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, Marinella Sciortino, and Luca Versari. Measuring the clustering effect of BWT via RLE. Theor. Comput. Sci., 698:79–87, 2017. doi:10.1016/J.TCS.2017.07.015.
- [56] Sabrina Mantaci, Antonio Restivo, and Marinella Sciortino. Burrows-Wheeler transform and Sturmian words. Inf. Process. Lett., 86(5):241–246, 2003. doi:10.1016/S0020-0190(02)00512-4.
- [57] Giovanni Manzini. An analysis of the Burrows-Wheeler transform. J. ACM, 48(3):407–430, 2001. doi:10.1145/382780.382782.
- [58] Guy Melançon. Lyndon words and singular factors of Sturmian words. Theor. Comput. Sci., 218(1):41–59, 1999. doi:10.1016/S0304-3975(98)00249-7.
- [59] J. Ian Munro and Venkatesh Raman. Selection from read-only memory and sorting with minimum data movement. Theor. Comput. Sci., 165(2):311–323, 1996. doi:10.1016/0304-3975(95)00225-1.
- [60] Gonzalo Navarro. Compact Data Structures: A Practical Approach. Cambridge University Press, 2016.
- [61] Gonzalo Navarro. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Comput. Surv., 54(2):29:1–29:31, 2022. doi:10.1145/3434399.
- [62] Gonzalo Navarro, Carlos Ochoa, and Nicola Prezza. On the approximation ratio of ordered parsings. IEEE Trans. Inf. Theory, 67(2):1008–1026, 2021. doi:10.1109/TIT.2020.3042746.
- [63] Ge Nong, Sen Zhang, and Wai Hong Chan. Two efficient algorithms for linear time suffix array construction. IEEE Trans. Computers, 60(10):1471–1484, 2011. doi:10.1109/TC.2010.188.
- [64] Jannik Olbrich, Enno Ohlebusch, and Thomas Büchler. Generic non-recursive suffix array construction. ACM Trans. Algorithms, 20(2):18, 2024. doi:10.1145/3641854.
- [65] Christophe Reutenauer. Mots de Lyndon généralisés. Séminaire Lotharingien de Combinatoire, 54(B54h):1–16, 2005.
- [66] David A. Scott. A truely BIJECTIVE BWT is here! https://groups.google.com/g/comp.compression/c/SDTLJypCWvc/m/ElbLTWJbnH8J, December 2007. Usenet thread, accessed Mar. 11, 2025.
- [67] David A. Scott. BIJECTIVE BWTS PAPER. https://groups.google.com/g/comp.compression/c/IN4XwmstgHk/m/X1RHMF2lyTgJ, September 2009. Usenet thread, accessed Mar. 11, 2025.
- [68] David A. Scott. The paper “A Bijective String Sorting Transform”. https://groups.google.com/g/comp.compression/c/hmMqcFOlCn8/m/AUQ7jasIiaoJ, July 2009. Usenet thread, accessed Mar. 11, 2025.
- [69] Yossi Shiloach. Fast canonization of circular strings. J. Algorithms, 2(2):107–121, 1981. doi:10.1016/0196-6774(81)90013-4.
- [70] James A. Storer and Thomas G. Szymanski. Data compression via textual substitution. J. ACM, 29(4):928–951, 1982. doi:10.1145/322344.322346.
- [71] Yuki Urabe, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. On the size of overlapping Lempel-Ziv and Lyndon factorizations. In Nadia Pisanti and Solon P. Pissis, editors, 30th Annual Symposium on Combinatorial Pattern Matching, CPM 2019, June 18-20, 2019, Pisa, Italy, volume 128 of LIPIcs, pages 29:1–29:11. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPICS.CPM.2019.29.
- [72] Edouard Zeckendorf. Representations des nombres naturels par une somme de nombres de Fibonacci on de nombres de Lucas. Bulletin de La Society Royale des Sciences de Liege, pages 179–182, 1972. URL: https://cir.nii.ac.jp/crid/1570009749187075840.
