On Inverting the Burrows-Wheeler Transform
Abstract
We study the relationship between four fundamental problems: sorting, suffix sorting, element distinctness and BWT inversion. Our main contribution is an lower bound for BWT inversion in the comparison model. As a corollary, we obtain a new proof of the classical lower bound for sorting, which we believe to be of didactic interest for those who are not familiar with the Burrows-Wheeler transform.
Keywords and phrases:
Burrows-Wheeler transform, sorting, suffix array, element distinctnessCategory:
Research2012 ACM Subject Classification:
Theory of computation Design and analysis of algorithms ; Theory of computation Models of computationFunding:
Funded by the Helsinki Institute for Information Technology (HIIT).Editors:
Alessio Conte, Andrea Marino, Giovanna Rosone, and Jeffrey Scott VitterSeries and Publisher:
Open Access Series in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Sorting is typically one of the introductory topics in a first course on algorithms and data structures. Knuth devotes almost four hundred pages to the problem in the The Art of Computer Programming [12] and Cormen et al. use insertion sort as a first example of an algorithm [3].
Consider a sorted alphabet (i.e., an alphabet endowed with a total order). The problem of sorting can be stated as follows.
Problem 1 (Sorting).
Given a string , compute a permutation of such that for every .
For example, if , then the output of Problem 1 is the permutation of such that , , and . Note that the permutation of Problem 1 is uniquely defined if and only the ’s are pairwise distinct. More generally, becomes uniquely defined if we add the additional requirement that, for every , if , then . The permutation that satisfies this additional requirement is the stable sort of .
Both Knuth and Cormen at al.’s consider the comparison model, in which no restriction on the (possibly infinite) sorted alphabet is assumed, and the only way to obtain information on the mutual order between the characters in is by solving queries of the following type: given , decide whether . We assume that each query takes time.
In the comparison model, the (worst-case) complexity of Problem 1 is : there exist algorithms (e.g., merge sort, which computes the stable sort of ) solving Problem 1 in time, and any algorithm solving Problem 1 has complexity . The (classical) proof of the lower bound [12, 3] shows a stronger result: to solve Problem 1, in the worst case we need queries ’s, and this is true even if we know that the ’s are pairwise distinct. In other words, is not only an algorithmic lower bound, but it also captures the minimum number of operations required to have enough information for solving Problem 1.
The comparison model yields a simple, informative, and mathematically appealing setting for studying the complexity of sorting, but it can hardly be considered a realistic computational model. For example, if the ’s are known to be integers in a polynomial range, Problem 1 can be solved in time via radix sort, which also computes the stable sort of [10].
The landscape becomes more complex if one considers alternative models of computations. On the one hand, it is possible to obtain models that are more flexible than the comparison model by allowing more complex queries in time. Assume that the alphabet is the set of all real numbers, and consider the function . In the comparison model, in time we can test whether for any choice of . In the linear decision tree model, one is allowed more general tests of the type , where we consider a linear function . In the algebraic decision tree model, we can choose to be an arbitrary polynomial. Even if the linear decision tree model and the algebraic decision tree model are more general than the comparison model, the complexity of Problem 1 in all these models is still [4, 1]. On the other hand, it is possible to consider (realistic) variants of the RAM model. In the word-RAM model with word size , where is the size of the input, the complexity of Problem 1 is still open, even though it is known to be [8, 9]. Other variants of these models are possible, but we will focus only on the comparison model and the case of integers in a polynomial range, which are arguably the most common settings in the literature.
In this paper, we show that the complexity of Problem 1 (Sorting) is the same as the complexity of other fundamental problems: suffix sorting, element distinctness and BWT inversion. The complexity of these problems in in the comparison model and in the case of integers in a polynomial range. Our main contribution is an lower bound for BWT inversion in the comparison model (Theorem 1). We also show that Theorem 1 implies a new proof of the classical lower bound for sorting (Corollary 2), which we believe to be of didactic interest for those who are not familiar with the Burrows-Wheeler transform.
2 Sorting, BWT Inversion and Related Problems
Problem 1 is closely related to several fundamental problems. Here are some examples.
Problem 2 (Suffix sorting).
Given a string , compute the permutation of such that is the -th lexicographically smallest suffix of for every .
For example, if , then the output of Problem 2 is the permutation of such that , , , , and . Note that is uniquely defined because the suffixes of a string are pairwise distinct.
Problem 3 (Element distinctness).
Given a string , decide whether there exist such that .
For example, if , then the output of Problem 3 is “no”.
The complexity of Problems 2 and 3 is in the case of integers in a polynomial range, and in the comparison model. Let us show how to obtain these bounds by reducing these problems to Problem 1 (Sorting).
Let us consider the case of integers in a polynomial range.
-
Problem 2 can be solved in time by using any linear-time algorithm for building the suffix array (see [15] for a survey). Historically, the linear time complexity was first proved by Farach, who solved the more general problem of building the suffix tree of a string in linear time [5]. Conceptually, the easiest algorithm is probably Kärkkäinen et al.’s algorithm [11].
Let us consider the comparison model.
-
On the one hand, Problem 2 can be solved in time by (i) sorting the ’s via any comparison-based algorithm, (ii) replacing each with its rank in the sorted list of all ’s (which does not affect the mutual order of the suffixes) and (iii) applying any suffix sorting algorithm mentioned earlier. On the other hand, to solve Problem 2 we need queries ’s in the worst case, and the same lower bound is true even if we know that the ’s are pairwise distinct. Indeed, given the permutation of Problem 2, we have for every (because the suffix is lexicographically smaller than the suffix ), so is also a correct output for Problem 1, and the conclusion follows from the lower bound for Problem 1 mentioned earlier.
-
On the one hand, Problem 3 has complexity : we can argue as in the case of integers in a polynomial range, but we use any comparison-based algorithm to compute a permutation as in Problem 1. On the other hand, to solve Problem 3 we need queries ’s in the worst case, as we show next. Fix an integer , and consider an algorithm that solves Problem 3 for every string . Let be the number of queries ’s solved by the algorithm in the worst case. If the ’s are pairwise distinct, the permutation of Problem 1 is uniquely defined, and the algorithm for Problem 3 must return “no”. To this end, the algorithm must solve the query for every because otherwise the algorithm could not infer that from the other queries that it solves and so it would not have enough information to solve Problem 3 correctly on input . Consequently, if for every query solved by the algorithm we also consider the query , we obtain at most queries. In particular, we consider both the query and the query for every , from which we can infer that for every . We conclude that at most queries ’s are sufficient in the worst case to compute and so solve Problem 1 for every in which the ’s are pairwise distinct. Hence, we obtain from the lower bound for Problem 1 mentioned earlier.
Let us show that Problem 1 is related to another fundamental problem in string processing and compression. To this end, we need to introduce the Burrows-Wheeler transform (BWT) of a string [2]. Let be a string such that , where is a special character such that (i) does not occur anywhere else in the string and (ii) is smaller than all the other characters. For example, we can consider the string (where ). For , let be the -th circular suffix of . For example, if , we have , , , , , and . Note that the circular suffixes of are pairwise distinct because occurs in a different position in each of them.
We build the square matrix of size such that, for every , the -th row is equal to -th (lexicograhically) smallest circular suffix of (see Table 1 for the matrix of size obtained from ). Note that because is the smallest character.
For , let be the -th column of from left to right. Notice that every is a rearrangement of the characters in . Let and be the first column and the last column of , respectively. In Table 1, we have and . Note that can be obtained by sorting the characters of . By definition, the Burrows-Wheeler transform of is the column , that is, . In Table 1, we have .
Crucially, the Burrows-Wheeler transform of a string is an encoding of the string: given , we can retrieve the string . In Table 1, given , we can retrieve . To prove this, we only need to show that from we can retrieve the matrix , because then the unique row of the matrix ending with yields the original string . We can retrieve by computing all columns ’s. We know that , and we can retrieve by sorting all characters in . Let us show how to retrieve . We know and , so we know all pairs of consecutive characters in . If we sort these pairs lexicographically and we pick the last element of each pair, we retrieve . In Table 1, all pairs of consecutive characters in are , , , , , , . By sorting these pairs, we obtain , , , , , , , and by picking the last element of each pair, we can infer that . Let us show how to retrieve . We know , and , so we know all triples of consecutive characters in . If we sort these triples lexicographically and we pick the last element of each triple, we retrieve . In Table 1, all triples of consecutive characters in are , , , , , , . By sorting these triples, we obtain , , , , , , , and by picking the last element of each triple, we can infer that . In the same way, we can retrieve .
Since is an encoding of , we can store instead of without losing information. The reason why storing may be more beneficial than storing is that the string tends to be repetitive if in several substrings have multiple occurrences, so we can exploit the repetitiveness of to compress . This property motivated the introduction of the Burrows-Wheeler transform in the original paper [2] and can be stated in precise mathematical terms via the notion of entropy [13]. Surprisingly, it is also possible to solve pattern matching on the original string by augmenting the compressed representation of the Burrows-Wheeler transform with space-efficient data structures, thus obtaining the FM-index [6].
We described an (inefficient) algorithm to invert the Burrows-Wheeler transform. Let us state the problem formally.
Problem 4 (BWT inversion).
Given a string such that for some (-terminated) string , compute .
For example, if , then (see Table 1).
3 Our Results
In this section, we show that the complexity of Problem 4 in in the comparison model and in the case of integers in a polynomial range. The only bound that cannot be inferred from the original paper on the Burrows-Wheeler transform [2] is the lower bound in the comparison model. Notice that the four problems considered in this paper have the same complexity.
To prove the bound for Problem 4, we will proceed differently from Problems 2 and 3. We will prove the lower bound directly, without relying on the lower bound for Problem 1 (Sorting). Then, we will use the lower bound for Problem 4 (BWT inversion) to infer the lower bound for Problem 1 (Sorting). In addition to establishing interesting relationships between fundamental problems, our approach yields a new proof of the celebrated bound for sorting, which we believe to be of didactic interest.
Let us start with the lower bound for BWT inversion.
Theorem 1.
In the comparison model, to solve Problem 4 we need queries ’s in the worst case. The same lower bound holds even if we know that the ’s are pairwise distinct.
Proof.
Fix an integer . Consider distinct characters such that . Then, the set:
has size . For every , the string is an encoding of , so the set:
has also size . Notice that for every string , if , then the ’s are pairwise distinct.
Consider any algorithm solving Problem 4 for every input . The algorithm can gather information on the mutual order between the ’s only by solving queries ’s. The decision on the next query can depend on the outcome of the previous queries ’s, so we can describe the behavior of the algorithm on all inputs through a decision tree (see Figure 1 for the case ). Since the algorithm correctly solves Problem 4 for every input , then the outcome of all queries ’s on a path from the leaf to a roof cannot be consistent with two distinct elements of , otherwise the algorithm would not have enough information to compute . For example, in Figure 1, if the output of “” is “no” and then the output “” is “yes”, then the algorithm must necessarily solve an additional query : both and are strings in for which , and we have and for two distinct , where and .
Assume that the longest path in the tree consists of edges. Then, the number of paths from the root to a leaf is upper bounded by , so we must have , which implies . This means that there exists for which the algorithm needs to solve queries ’s.
The proof of Theorem 1 is similar to the classical proof of the lower bound for sorting [12, 3]. In the classical proof, one typically uses the fact that a binary tree with leaves must have height at least . Here we used a slightly more direct pigeonhole argument to prove the inequality . The worst-case entropy of a set is by a similar pigeonhole argument [14], so one may argue that in the proof of Theorem 1 we used an entropy-based argument. This appears to establish an interesting analogy because we have already mentioned that the compressibility of the Burrows-Wheeler transform can be described through the notion of entropy [13].
Now, let us describe an efficient algorithm to solve Problem 4. The original paper by Burrows and Wheeler [2] follows an approach based on a permutation called LF-mapping. Here we will use a different approach based on the permutation , which is the inverse of the LF-mapping. The permutation plays a crucial role in compressed suffix arrays [7], and it captures the connection between Problem 4 (BWT inversion) and Problem 1 (Sorting) more explicitly. We are given such that for some , where , and we need to compute . By definition, , where is the last column of the matrix , so we know that and we must retrieve .
Let be the stable sort of (see Problem 1 and Table 1). Since is the smallest character, we have . We will prove that for every . For example, in Table 1 we have , , , , and , so . After computing , we can retrieve all the ’s in time by computing all the powers ’s. In the comparison model, we can compute in time, and in the case of integers in a polynomial range we can compute in time, so the complexity of Problem 4 is in the comparison model and in the case of integers in a polynomial range. We are left with proving that for every .
Let be the array such that, for every , the row of the matrix is equal to the circular suffix (see Table 1). Then, yields a permutation of . Moreover, we have and for every , where we assume . In particular, , so we have and for . Notice that yields a permutation of . From we obtain .
Let us prove that for every (for example, in Table 1 we have ). Fix a character that occurs in . Let be the smallest integer such that , and let be the largest integer such that . We only have to prove that for every , because by picking all possible values from smallest to largest we cover every between and ( corresponds to ). Let us prove that for every . From the definitions of and we obtain that in every row of and in every column of there are exactly characters equal to , and in particular . Since is the stable sort of , we obtain . This implies that is lexicographically smaller than , is lexicographically smaller than , , is lexicographically smaller than . We have for every , so we conclude that is lexicographically smaller than , is lexicographically smaller than , , is lexicographically smaller than , where for every . At the same time, for every we have if and only if , and is lexicographically smaller than , is lexicographically smaller than , , is lexicographically smaller than . We obtain for every , so for every , as claimed.
Let us prove that for every (for example, in Table 1 we have ). We proceed by induction on . For , we know that . Now assume that . By the inductive hypothesis, we know that . In particular, , so and we obtain .
We are now ready to prove the main claim. We have for every .
We conclude our paper by showing that Theorem 1 implies a new proof of the lower bound for sorting.
Corollary 2.
In the comparison model, to solve Problem 1 we need queries ’s in the worst case. The same lower bound holds even if we know that the ’s are pairwise distinct.
Proof.
Consider any algorithm solving Problem 4 for every input such that the ’s are pairwise distinct. Since the ’s are pairwise distinct, the permutation of Problem 1 is uniquely defined, and is also the stable sort of . Let be the number of queries ’s solved by the algorithm in the worst case to compute . Assume that , where . After computing , we can compute in time (because for every , as we have seen before) without solving any additional query . Consequently, queries are sufficient in the worst case to solve Problem 4 for every input such that the ’s are pairwise distinct. By Theorem 1, we conclude .
4 Conclusions
In this paper, we have shown that, in the comparison model, inverting the Burrows-Wheeler transform has complexity . As a corollary, we have obtained a new proof of the sorting lower bound. Our main goal was to highlight how the ideas behind the Burrows-Wheeler transform are deeply intertwined with the most fundamental results in Computer Science.
References
- [1] Michael Ben-Or. Lower bounds for algebraic computation trees. In Proceedings of the fifteenth Annual ACM Symposium on Theory of Computing, pages 80–86, 1983.
- [2] Michael Burrows and David J. Wheeler. A block-sorting lossless data compression algorithm. Technical report, Systems Research Center, 1994.
- [3] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT press, 2022.
- [4] David P Dobkin and Richard J Lipton. On the complexity of computations under varying sets of primitives. Journal of Computer and System Sciences, 18(1):86–91, 1979. doi:10.1016/0022-0000(79)90054-0.
- [5] Martin Farach. Optimal suffix tree construction with large alphabets. In Proceedings 38th Annual Symposium on Foundations of Computer Science, pages 137–143. IEEE, 1997. doi:10.1109/SFCS.1997.646102.
- [6] Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proceedings 41st Annual Symposium on Foundations of Computer Science, pages 390–398. IEEE, 2000. doi:10.1109/SFCS.2000.892127.
- [7] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the thirty-second Annual ACM Symposium on Theory of Computing, pages 397–406, 2000.
- [8] Torben Hagerup. Sorting and searching on the word RAM. In STACS 98: 15th Annual Symposium on Theoretical Aspects of Computer Science Paris, France, February 25–27, 1998 Proceedings 15, pages 366–398. Springer, 1998. doi:10.1007/BFB0028575.
- [9] Yijie Han. Deterministic sorting in time and linear space. Journal of Algorithms, 50(1):96–105, 2004. doi:10.1016/J.JALGOR.2003.09.001.
- [10] John E Hopcroft, Jeffrey D Ullman, and Alfred Vaino Aho. Data structures and algorithms, volume 175. Addison-wesley Boston, MA, USA:, 1983.
- [11] Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. Journal of the ACM (JACM), 53(6):918–936, 2006. doi:10.1145/1217856.1217858.
- [12] Donald E Knuth. The Art of Computer Programming: Sorting and Searching, volume 3. Addison-Wesley Professional, 1998.
- [13] Giovanni Manzini. An analysis of the Burrows-Wheeler transform. Journal of the ACM (JACM), 48(3):407–430, 2001. doi:10.1145/382780.382782.
- [14] Gonzalo Navarro. Compact data structures: A practical approach. Cambridge University Press, 2016.
- [15] Simon J Puglisi, William F Smyth, and Andrew H Turpin. A taxonomy of suffix array construction algorithms. ACM Computing Surveys (CSUR), 39(2):4–es, 2007.
