Algorithms for Computing Very Large BWTs:
a Short Survey
Abstract
The Burrows-Wheeler Transform (BWT) is a fundamental string transformation that, although initially introduced for data compression, has been extensively utilized across various domains, including text indexing and pattern matching within large datasets. Although the BWT construction is linear, the constants make the task impractical for large datasets, and as highlighted by Ferragina et al. [26], “to use it, one must first build it!”. Thus, the construction of the BWT remains a significant challenge. For these reasons, during the past three decades there has been a succession of new algorithms for its construction using techniques that work in external memory or that use text compression. In this survey, we revise some of the most important advancements and tools presented in the past years for computing large BWTs exploiting external memory or text compression approaches without using additional information about the data.
Keywords and phrases:
Burrows-Wheeler transform, Extended Burrows-Wheeler transform, external memory, text compression, longest common prefixFunding:
Diego Díaz-Domínguez: Supported by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101060011.Copyright and License:
2012 ACM Subject Classification:
Theory of computation Data structures design and analysisEditors:
Paolo Ferragina, Travis Gagie, and Gonzalo NavarroSeries and Publisher:
Open Access Series in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
The term “big data” has become widely used, encompassing a variety of data types. Genomic data is one of the most prominent forms of big data, due to its size, complexity, and impact [65]. The advancements in high-throughput sequencing technologies have resulted in an unprecedented expansion of genomic databases, which are expected to soon reach sizes of hundreds of terabytes. The scale and complexity of these datasets present significant challenges in terms of data storage, access, and processing.
One of the primary tasks when working with such large datasets is the construction of efficient indexes, which facilitate fast data retrieval and enable a variety of applications in genomics research. Until the early 2000s, the standard solutions for this task were the suffix array or the suffix tree. However, these approaches quickly become intractable as the text grows. Ferragina and Manzini solved this limitation with their FM-index [27], a succinct self-index333A data structure that encodes a text and an index for it in space proportional to the text size with suffix array functionality. Then, Lam et al. [41] noticed that the FM-index is efficient in performing pattern matching of short sequences with few mismatches over large genomes. This functionality led to the creation of popular bioinformatic tools such as bwa [46] and bowtie [42] based on the FM-index, and since then its use has been extended to other applications such as de novo assembly [62], variant detection [43, 60] and sequence compression [35, 61, 33, 31].
The main component of the FM-index is the Burrows-Wheeler transform [12], a string transformation introduced in the Data Compression field by Michael Burrows and David Wheeler in 1994. It is a reversible transformation that rearranges the symbols of a string over an alphabet , possibly increasing the runs of identical symbols. This structure enables compressing the original string using simple algorithms, such as Run-Length Encoding (RLE) followed by Move-to-Front (MTF) and Huffman coding (as employed in bzip2 compressor), see [1] for further details.
Since the sequencing process generates a collection of sequences, each representing a read, it introduced challenges in how to represent, process, and analyze these sequence collections. Applying the BWT to each sequence individually could not capture the features among the different sequences, and hence, there emerged a necessity to extend the BWT to sets of strings. Moreover, the need to analyze multiple genomes together has recently emerged to study the complete genetic diversity within populations [18]. Here, we consider the extension of the Burrows-Wheeler transform to a string collection, in which a distinct end-marker symbol is appended to each string (see [14, 15] for a survey about the different BWT variants). When an end-marker symbol is appended to each string, there are no strings in the collection that are prefixed by other strings. Such a transform applied to a string collection is also known in the literature as multi-string BWT444In [15], the same transform is called Multidollar-EBWT. [23, 8]. Over the years, a multitude of strategies and tools have been introduced to build the BWT and the multi-string BWT, whose efficiency may be dictated by particular needs (e.g. input alphabet, redundant data, available resources, knowledge of a reference string).
In-memory algorithms to compute standard single-string BWTs are linear but do not scale to large inputs with multiple strings as, in those cases, they might not be able to fit everything in RAM (e.g., SAIS [56] and its BWT variant [57]). A possible solution is to keep large portions of data on disk and bring them back to RAM and caches whenever the BWT algorithm requires it. This strategy reduces the space usage in fast memory levels considerably, but comes with a cost: While accessing caches requires nanoseconds in most modern computer architectures, accessing data from disks (aka I/O) requires milliseconds. This difference is the so-called I/O bottleneck, which amounts to a factor of [34, 25], and makes it necessary to adapt in-memory algorithms to minimize I/Os. An alternative solution is to keep the data in compressed form to fit more information in less space. This technique reduces the number of I/Os but requires extra time to compress and decompress. In addition, it is necessary to adapt the algorithm to make it compression-aware. In practice, disk-based and compression-aware algorithms are not incompatible and are often combined in the literature.
In this survey, we explore BWT construction algorithms that address the scalability problem by using techniques that work in external memory or by exploiting text compression approaches. We have selected some of the most representative algorithms from these two categories. Those that work in external memory (Section 3) must address the challenge of making the best possible use of the available RAM, minimizing and optimizing operations on external memory, avoiding random disk accesses. Optimal I/O volume when working with a set of strings of total size is bytes which is never achieved to the best of our knowledge. The ordering of suffixes necessary to compute BWTs and suffix arrays is in general carried out with bucket sort algorithms that write data to disk in sequential order, and it is often implemented in a virtual way, ordering indices instead of the actual structures. Some of the algorithms approach the problem by computing partial BWTs, which could be either BWTs of substrings or BWTs of subcollections or BWTs of suffixes of fixed lengths or up to fixed lengths.
We first present bwt-disk [26] (Section 3.1), an external memory approach that works on a single input string. We then present BCR and BCRext [5] (Section 3.2), (semi-)external algorithms, which, although not competitive for collections of long strings stand out due to their simplicity, making them a good choice for sets of strings over any alphabet with limited lengths and for building BWTs with a reduced, possibly minimum, number of runs [13, 15]. In Section 3.3, we present eGSA [49], which works best with an amount of RAM significantly larger than the input size and it is extremely efficient, but requires a comparatively large amount of disk space, even though it has good I/O costs. In Section 3.4, we present eGap [23], which slightly improves on the I/O costs of eGSA and is faster when less RAM is available, also thanks to a specific heuristic. Finally, in Section 3.5, we present BWT+LCP [8], which has the same asymptotic complexity as eGap with no heuristic and fixed size alphabet.
In Section 4, we consider algorithms of the second class that leverage compression techniques that exploit the high repetitiveness of the input. This way more of the input fits in memory as close as possible to the CPU for faster processing, and repetitions in the input are eliminated. The risk, though, is introducing a significant overhead when strings must be decompressed for access. The art in designing these algorithms is finding compression schemes whose structure actually enhances the BWT computation process instead of hindering it.
We present in Section 4.1 Big-BWT [11], which takes as input a single string and performs a pre-processing step that encodes the input building a dictionary and a parse (prefix-free parsing). This strategy for building the BWT works well when the dictionary and the parse generated are much smaller than the input string. Finally, in Section 4.2, we present grlBWT [22], which joins the two worlds as an algorithm that uses both external memory and compression techniques. It maintains data in compressed form throughout the computation to reduce space usage and repetitions.
2 Preliminaries
Let be a finite ordered alphabet of size . A string over of length is a sequence of characters (or symbols) from , where denotes the th character of , and denotes the substring , for . By convention if , where is the empty string. Any substring of of the form , , is called prefix, and any substring of the form , , is called suffix. Let be a string of length over and be a special character, which is not in and is lexicographically smaller than any other character in . Throughout this paper, we consider with the special character (referred to as end-marker) appended to it, such that .
For any two strings and , we define the lexicographic order if is a proper prefix of , or if there exists an index such that and for all , .
We consider be a collection of strings, , where is a string over and is the end-marker , for any . The total number of characters in is . By definition, the end-markers are not appearing elsewhere in and are smaller than any other character in . In addition, we assume that the end-markers are sorted according to the input strings555In order not to increase the alphabet size, typically tools use the same character in the output string, such as $, to represent all distinct end-markers. Each end-marker’s index can be stored separately. , i.e. , if .
We use the disk model of computation, which has two levels: a small but fast memory of bits (i.e., RAM) and a slow but unbounded memory (i.e., disk). Given an input of size , a procedure runs in RAM using words of bits that can be manipulated in constant time. Additionally, the procedure can transfer bits between memories (I/O operation). We express the space complexity in bits and the I/O cost as the number of data transfers. Additionally, we express the logarithms of base two as .
Suffix array and LCP.
Let be a string of length over and . The suffix array sa of [50] stores the permutation of listing the starting positions of the suffixes of in increasing lexicographic order, i.e., , for all .
The generalized suffix array (GSA) of the collection is the array of pairs of integers , corresponding to the lexicographically sorted suffixes , where and .
The longest common prefix array lcp stores the length of the longest common prefix () between lexicographically consecutive suffixes, i.e., , for all , and for convenience.
Burrows-Wheeler Transform (BWT) and Multi-string BWT.
The Burrows-Wheeler Transform (BWT) [12] is a reversible text transformation that given as input a string produces a permutation of its characters: the output string bwt is obtained by concatenating the characters that circularly precede the lexicographically sorted suffixes of , i.e.
The reversibility of the is guaranteed by the following two properties [12]. Let be the string formed by lexicographically sorting the characters of ,
-
for all , the character circularly follows the character in the string ;
-
for each , the th occurrence of in bwt corresponds to the th occurrence of in . In particular, if , then the position in corresponding to that -occurrence is given by , where is the total number of characters in that are smaller than and is the number of occurrences of character in the substring .
The function described above that maps character occurrences from bwt to is known as LF-mapping [27]. The LF-mapping yields a permutation of the positions which forms a single cycle over all positions.
Example 1.
Given the string of length , the output string is . The LF-mapping yields the following permutation .
Historically, the was extended to a collection of strings by Mantaci et al. [51] (EBWT), where cyclic rotations of the input strings are sorted666In this case, one needs to use the -order defined in [51]. without appending any end-marker to the input strings. Here, we consider the generalization that append to each input string in the collection a different end-marker and computes the transformed string by lexicographically sorting the suffixes of the input strings, as done for instance by [5, 24, 22]. Therefore, we define the extended to a collection of strings , multi-string BWT, as a permutation of the characters in obtained by concatenating the characters circularly preceding the lexicographically sorted suffixes of .
Since each is terminated by an end-marker symbol, the can be defined in terms of the generalized suffix array (GSA). In particular, if the -th smallest suffix of the strings in is . So, for ,
| (1) |
Equivalently, one can define in terms of suffix array of the concatenated string (see [23, 24], where it is denoted ) i.e.
| (2) |
Note that both definitions in Equations (1) and (2), as well as the EBWT, are such that each input string cycles on itself. In fact, the two definitions above are equivalent, but we chose to present both of them because they imply different constructions. In presenting each algorithm we will refer to the definition that inspired the algorithm’s strategy.
Finally, the multi-string BWT is reversible by definition: one can reconstruct the collection with the strings sorted according to their original order given the fact that the end-markers in are distinct [5, 6]. In addition, differently from the BWT applied to a single string, the LF-mapping associated with the multi-string BWT defines a permutation of the positions through that can be decomposed into cycles, one for each string in the collection.
Induced suffix sorting (ISS).
Induced suffix sorting (ISS) [40, 37, 56] is a recursive technique that deduces the lexicographical order of suffixes in starting from a subset of partially sorted suffixes. One of the most popular variants of this idea is SAIS [56], a linear-time algorithm to calculate the suffix array [56, 48, 55, 47] and the BWT [57, 3, 9, 22].
Each recursive level of SAIS receives a string ( and when ) and returns the suffix array of . The algorithm classifies the positions of into two types: L-type or S-type. A character is L-type if or if and is L-type. On the other hand, is S-type if or if and is S-type. By definition, is S-type. Furthermore, is LMS-type (leftmost S-type) if is S-type and is L-type. A substring is an LMS substring if and are LMS-type.
Every suffix whose starting position is LMS-type is also a local minimum . SAIS exploits this fact and performs three linear-time scans of to induce from a partial sort of the suffixes starting within , being the preceding LMS-type symbol. In this survey, this process is called ISS pass.
For any position , let be the smallest integer such that is of LMS type, or if or there are no LMS-type positions after . Additionally, let be the set of strings over that label the sequences . For two strings , their LMS order is
| (3) |
Let be the set of integers such that each corresponds to some . The relation implies that for any pair and , it holds . The ISS pass places the suffixes of starting at positions in contiguously in the range of sa, where is the rank of among the elements of . Thus, the suffixes in are lexicographically smaller than the suffixes in if . However, this sorting is partial because the relative order of the elements within and is still unknown.
SAIS now replaces the LMS substrings in by their ranks to create a new string , with , which passes as input to the next recursive level to calculate . When the algorithm returns to level from level , it performs another ISS pass. However, this time it uses the information in as the starting point for the induction, and thus the output of the pass becomes . If , SAIS returns to the previous level , otherwise it finishes.
Each new is at most half the size of . Consequently, the total cost of the algorithm is time and bits of working space. SAIS can also be easily adapted to compute the generalized suffix array gsa of a string collection [48].
Karp-Rabin fingerprint.
The Karp-Rabin (KR) fingerprints method [38] is a rolling hash technique that associates strings with integers. The KR fingerprints for all the substrings of length of a string of length can be computed in time.
Run-length encoding.
A run in a string is a maximal substring consisting of repetitions of only one character, i.e. for some and . The run-length encoding (RLE) of a string encodes any run in by a pair .
Grammar compression.
This technique consists of computing a small context-free grammar that generates only the string [39, 16]. A grammar is a tuple where is the alphabet of , is a set of nonterminal symbols, is a relation describing the set of rewriting rules, and is the start symbol. Each nonterminal has a replacement encoded by a rule . Given two strings , rewrites (denoted ) if exists in . The expansion of is the string resulting from successive rewrites with and . Expanding the start symbol produces . Grammar compression algorithms create one rewriting rule for every nonterminal , so there is one possible sequence of rewrites that starts in and ends in .
3 Disk-based strategies
When the data to be indexed exceeds the capacity of internal memory, external-memory algorithms become necessary to handle inputs efficiently. These algorithms are specifically designed to minimize random access and optimize disk I/O operations, as accessing external storage is significantly slower than accessing RAM. They typically rely on techniques such as sequential scans, buffering, and external sorting to process data in manageable chunks while reducing disk seek times. Efficient external-memory algorithms are crucial for indexing massive datasets where internal-memory approaches would be impractical due to space constraints.
3.1 bwt-disk [Ferragina et al., 2012]
The algorithm bwt-disk 777An implementation is available at https://people.unipmn.it/manzini/bwtdisk/ [26] is an external memory algorithm that constructs the BWT for a (single) input string of length , without first building the suffix array for the complete string. It operates using only words of RAM and bits of disk space, in addition to the disk space needed to store the input and the output. Moreover, all data on disk is accessed exclusively through sequential scans.
The algorithm partitions the input string into consecutive blocks of length , such that Each block represents a substring of with length , except possibly the last substring , which may be shorter if is not a multiple of . The is computed through passes, one for each block, which are processed backwards, from to . The key idea is that, during each pass , only the characters of need to be inserted into to obtain , and the relative order of suffixes from remains unchanged.
In Example 1, suppose is partitioned into blocks of length . Then, we have , and . Initially, is computed in internal memory, and will be obtained by sequentially inserting each character of into .
For each pass (for ), the algorithm loads from disk into a string buffer . Also, an auxiliary bit array, called gt, is assumed to be stored on disk, indicating whether each suffix in is greater than the suffix itself. The part of gt referring to the suffixes starting in is loaded to the bit array in internal memory.
Continuing from Example 1, at pass , we have the string buffer and .
Then, the pass proceeds through the following four steps.
In Step 1, the suffix array containing the lexicographic ordering of the suffixes of starting in block (extending up to the last character ) is computed in internal memory, without accessing any character of on disk. To do that, the suffixes are compared using their prefixes stored in , whenever the comparison of two suffixes exceeds position , the algorithm uses the corresponding value in to determine their relative order. For example, given two suffixes starting at positions and of , with , if is lexicographically smaller (or greater) than , then the suffix starting at is smaller (or greater) than the suffix starting at . Otherwise, if , their relative order is determined by the corresponding bit stored in .
In Example 1, at pass , the resulting suffix array for , , is computed using only the string buffer. Therefore, the suffix ordering in is: .
In Step 2, the array is computed given . For each , the value of is set to if ; otherwise, is assigned the special character , which does not appear in . Note that is not the actual BWT of the substring , as the suffixes of extend up to the last character and their relative order may differ.
Continuing from Example 1, at pass , is computed from and .
In Step 3, the algorithm computes a counter array, , which gives in how many suffixes of lie lexicographically between the suffixes in positions and from the current block . This counter array is computed in time with a single scan over using extra bits of gt and an -bit data structure supporting constant time queries over . At Step 3, the algorithm also computes the content of for the next pass as a by-product of the computation of gap.
Again, in Example 1, at pass , we have , since three suffixes from (namely , , and ) are smaller than the first suffix of , and the suffix from is smaller than only the last suffix of .
Finally, in Step 4, the algorithm merges with to create for the next pass . To do that I/O-efficiently, the algorithm uses the counter array gap, such that, for , characters from are copied to followed by the value . Whenever a character or is retrieved from during this process, it is replaced by the last symbol in .
In Example 1, at the end of pass , , then the first three values of are retrieved (namely ) and copied to as directly to disk. Then, the value in (namely ) is copied to , together with the next two values in (namely and ) as and , resulting in . Finally, and the last value from (namely ) is copied to and the last value from (namely ) is copied as well, yielding the final result: .
Theoretical costs.
The running time of bwt-disk is , for a (single) string , from a constant alphabet, using bits of RAM. The algorithm executes passes over bits of disk data, using bits of disk working space. The total number of I/Os is .
3.2 BCR and BCRext [Bauer et al., 2013]
In this section we describe the core of a strategy for constructing the BWT of a string collection according to the definition in Equation (1), which considers the strings in a circular way so that the symbol preceding the suffix is the end-marker of the string (i.e., ). The heart of this strategy is common to two algorithms, named BCR and BCRext, introduced in [5] and part of the BEETL library888https://github.com/BEETL/BEETL. BCR works in (semi-)external memory, whereas BCRext works in external memory.
The common strategy does not require linearizing the strings in by concatenating them and does not require explicitly computing the . Rather, it builds the incrementally in steps, where is the length of the longest string in (including the end-markers) by exploiting the basic properties of the and the LF-mapping. The basic idea is to left-align (or, equivalently, right-align) all the strings in and, by scanning all the strings in right-to-left at the same time, to incrementally construct the partial of suffixes of with length at most , with .
For simplicity, in this description, we assume that all the strings have the same length and they are right-aligned. So, at each iteration only suffixes of the same length are considered. Both algorithms build the in steps. At each step , they build a partial , denoted , containing the BWT of suffixes in of length up to .
Since we consider distinct end-markers and if , then it is easy to verify that is obtained by concatenating the last symbol of each string in the same order as the strings appear in the collection: .
For each step , both algorithms simulate the insertion of all suffixes of length in the list of sorted suffixes, i.e., insert all symbols (circularly) preceding the suffixes of length at the correct positions into in order to obtain .
In the last step , the end-markers are inserted at the correct positions into and is obtained.
At the iteration , let be the position of the symbol in . To insert the symbol preceding the suffix in , we compute the position using the LF-mapping (we omit details for space reasons): , where is the number of symbols lexicographically smaller than in , is the function that returns the number of occurrences of in the prefix .
By using Example 2, the base case of BCR/BCRext consists in concatenating all symbols preceding the suffixes obtaining (because the partial sorted suffixes is: since ). Then, they simulate the insertion of the suffixes into the partial sorted suffix list by inserting the symbols preceding the respectively suffixes into , resulting in . The process continues by simulating the insertion of the suffixes into the sorted suffix list, then inserting the symbols preceding the suffixes in order to obtain . And so on. In the last step, they simulate the inserting of the suffixes and insert the symbols in the partial .
If the strings have different lengths, an end-marker is inserted when the longest suffix of a string in is reached, and then no further symbols of that string will be inserted in the following iterations.
Note that, in practice, neither algorithm needs to explicitly append end-marker symbols to the strings, since they will only be inserted last and the symbols concatenated in follow the input order. Moreover, to avoid exponential growth of the alphabet, both algorithms write the same end-marker symbol in the output string, but they can store and output the index associated with each end-marker in a separate file. This is particularly important for decoding. An interested reader can refer to [5, 6].
In order to reduce the I/O operations, both algorithms split each partial BWT in files , where contains the symbols preceding the suffixes consisting of only the end-marker symbols and contains the symbols preceding the lexicographically sorted suffixes of of length at most and starting with .
The difference between the two algorithms lies in the data structures they employ, and therefore in the strategy they adopt to insert symbols incrementally.
In fact, BCR needs a data structure to keep track of each symbol to be inserted, the index of the sequence to which it belongs, and the position where each symbol was inserted in the previous iteration. Once the positions at which the new symbols are to be inserted have been updated, BCR needs to sort this information so that, at iteration , it can insert the symbols into each of the files sequentially.
BCRext computes and updates the positions where new symbols are to be inserted in a similar way to BCR, but it keeps this information in auxiliary files and the strings themselves are sorted externally by using the least-significant-digit radix sort. So it uses a negligible workspace. At the start of iteration , the suffixes of length of are assumed to be ordered, this ordering being partitioned into external files according to the first symbols of the suffixes of length . Files are such that contains the positions of the suffixes in , ordered the same way.
In [20], the authors show that BCR can be augmented in order to compute the LCP array and the generalized suffix array.
Moreover, in both algorithms the collection is considered ordered by using the input order, but one can modify them so that one can use a different ordering for the strings in to reduce the number of runs in the output of the BWT (see [19, 44, 6, 13, 14]).
Theoretical costs.
BCR is performed in bits of memory, with a worst-case time complexity of , where is the time taken to sort integers and I/O (bits). Whereas BCRext uses log bits of memory, with a worst-case time complexity of and I/O (bits).
3.3 eGSA [Louza et al., 2017]
The algorithm eGSA 999An implementation is available at https://github.com/felipelouza/egsa [49] works in external memory and computes the (generalized) BWT for a string collection of total length according to the definition in Equation (1), with each string having the same end-marker for computational efficiency. The algorithm works in two phases, as follows.
Phase 1.
The collection is partitioned into sub-collections based on the available memory, such that each has size less than , where denotes the number of available words in RAM. For each sub-collection , the arrays and are computed in internal memory and stored to disk. The authors used gSACA-K [48] combined with -algorithm [36]. Together, these algorithms run in linear. As each value and are written to disk, the corresponding and a substring of length of the suffix , denoted as , are computed and written to disk. In particular, gives a prefix of when combined with previous values and it is used to reduce external memory accesses in Phase 2.
In Example 2, suppose the input collection is partitioned into two sub-collections and , with prefix size . First, and are computed in internal memory. While each pair and is written to disk, the corresponding values in and are obtained and stored on disk as well. Next, and are computed in internal memory, and the values of and are obtained and stored on disk in the same way.
Phase 2.
The computed arrays of are merged with the help of a binary heap and internal memory buffers designed to reduce string comparisons on the disk. Let be the computed arrays of . Each is split into blocks of elements, except possibly the last block, ensuring that one block of each can simultaneously fit internal memory. Initially, the first block of each is loaded from disk to a buffer . The heading element of each is inserted into a binary min-heap . For , the smallest suffix, say from , is removed from , and the value of is written to an output buffer of size . Whenever is full, it is written to the disk. Then, the next element from the same block buffer is inserted to the heap. Whenever a buffer is empty, the next corresponding block in is loaded from disk to .
In Example 2, considering the block size , we have the following initial blocks in the buffers and . We assume there is enough internal memory to hold and simultaneously. The heading elements and are inserted into the heap .
Note, however, that each heap comparison during Phase 2 may require accessing strings in disk randomly. To reduce disk accesses, the following strategies are presented.
For each sub-collection a string buffer of maximum size is used to store a prefix of the heading suffix from in internal memory. Therefore, whenever a suffix from , say with , is inserted into , if then , otherwise, , where and is an end-of-buffer marker. In this way, a starting portion of is already loaded into when it is compared with other suffixes in . If is reached, is accessed in disk. Also, if the prefix is truncated and the next substrings are accessed from disk.
Continuing from Example 2, when the elements compared in are from and from , their string buffers and provide enough information to decide that the smallest suffix comes from (without accessing the disk) and is added to the output buffer . The next element from , , is inserted into the heap . The prefix is updated by considering that the current element from shared one symbol in common with the previous one, resulting in .
Another strategy is the usage of the LCP between nodes in to avoid string comparisons in heap insertions. Let from be the root of and nodes and its children, if the LCP between and the next element from is larger than the LCP between and , and the LCP between and , the next smallest value of comes from buffer . Also, if the LCP between and the next element from is smaller or equal to the LCP between and , the comparison between their corresponding suffixes can start from the minimum LCP value. The LCP values in are updated as nodes are swapped and can be used to speed up suffix comparisons as well.
In Example 2, again when the elements from and from are compared in , the LCP between the corresponding nodes in is computed and stored. Subsequently, the node representing is removed from , and the element is inserted. Since the LCP between the new element from and the previous one is equal to and the LCP between the corresponding nodes in is also equal to , the next comparison in between the elements from and may start from the second symbol of and . In particular, this suffix comparison will require accessing the disk, as is a prefix of .
Finally, eGSA also induces suffixes as each is removed from , with . Whenever the suffix will be induced. To do that, the value is inserted into a (queue) induced buffer , with . Each induced buffer , with , is written to disk as it gets full. Then, when the first suffix starting with is the smallest value in , the induced values in disk and in are used to decide which heading element of the blocks is the smallest in and send it to the output buffer, with no string comparison. As these values are sent to the corresponding nodes in are consumed.
In Example 2, when the element from is the smallest in , the suffix will be induced. To do that, is inserted into the induced buffer . Then, when the first suffix starting with becomes the smallest in , the heading element from is send directly to the output buffer, without string comparison.
Theoretical costs.
The running time of eGSA is , where is length of the longest common prefix between suffixes of the input strings. The total number of I/Os is , where is the length of the longest string in the collection.
3.4 eGap [Egidi et al., 2019]
The algorithm eGap 101010An implementation is available at https://github.com/felipelouza/egap/ [23] is an external memory algorithm for the computation of the BWT of a collection of strings , with for . The algorithm works in two phases, first using the available RAM to compute in memory the BWTs of subcollections of , and then merging the results into the final BWT.
Phase 1.
eGap uses the optimal linear time algorithm gSACA-K [48] for the first phase; the size of the subcollections is determined based on the available memory, taken in input. The gSACA-K algorithm is used to compute the suffix array for subcollections, from which, by Equation (2), the multi-string is obtained.
Phase 2.
The second phase is based on the gap algorithm [24], in turn derived from an earlier algorithm by Holt and McMillan [32]. The ideas from gap are used to merge the BWTs of the subcollections.
For simplicity of exposition, we describe here the algorithm when merging single-string BWTs, but the same algorithm can merge multi-string BWTs. The latter is important because it allows to adopt a multi-stage strategy in which the BWTs computed in Phase 1 are merged in successive rounds.
The merging phase does not need to access the original strings but requires only their BWTs . It works iteratively sorting the symbols according to prefixes of increasing lengths of their contexts: after iteration , they are sorted according to the length- prefixes of their contexts. The original BWTs are not explicitly sorted during the iterations, but the algorithm produces a recipe for performing the merge: the final iteration yields a length , valued vector such that iff the -th entry of is from (; ).
More formally, let be the approximation of array after iteration , with . Then the following property holds:
Property Z.
For , , and and the -th precedes the -th in iff .∎
For instance, assume that we are merging the three BWTs of the strings of Example 2. The BWTs are , and . Then, which corresponds to a partially merged bwt , where the suffixes are ordered only by their length two prefixes. For instance, the twelfth and thirteenth character are as opposed to in the final since their suffixes both start by and comes from whereas is from .
Array is obtained through a sequential scan of together with . The array is partitioned into buckets, one for each character of the alphabet. This is achieved through the array , where 111111In [23] the array is denoted . is defined as the number of occurrences of characters smaller than in . Therefore marks the beginning of the -bucket in , whose first entry is . When scanning , if is , the algorithm looks at the next character in : if it is , then it stores in the next free position of bucket in . So, continuing the example above, the character bucket of will be . Notice that now the twelfth character will be since its suffix is prefixed by and the thirteenth character will be , since its suffix starts with .
From Property Z, it follows that can be logically partitioned in blocks corresponding to sets of symbols whose contexts share the same length- prefixes. An additional bit array keeps track of these blocks: iff a new block starts at . When all entries of array are non zero, will not change in subsequent iterations and it is the required array . In our example, at the last iteration .
Then, a sequential scan of and allows to write the characters of in sequential order. Back to our example: is the first from , so it’s ; the second is the first from (it’s ); then we have an from ; and then the second character from (again an ) since .
The input BWTs are read from disk and never moved to memory.
Phase 2 of eGap actually only uses two copies of array to store for all values of : namely, and , whose roles are swapped at each iteration. They can be maintained in external memory, since access to each bucket of and is sequential.
In [24, Lemma 5], it is proven that if is set to at iteration , then will not change any more at any iteration . Since the roles of and are swapped, starting at iteration , processing to compute will be useless. So we define an array to keep track of entries that will not change again: if at iteration the algorithm finds that , then will be set to as well. Any sequence of s in defines an irrelevant range that can be skipped in the subsequent processing. So, at each iteration, irrelevant ranges are sequentially read from file to skip them, and new irrelevant ranges are sequentially written to file. Since maintaining and skipping ranges has a cost, only ranges of a significant size (according to a configurable parameter) are considered.
Computing the LCP.
The eGap algorithm can also compute the LCP of the input collection. This is done in two steps: during Phase 2, if at iteration , and , a pair is written to file to record that (then is set to ). Notice that in file all entries are ordered according to their first components. In an additional third phase of the algorithm, the entries from all temporary files are merged using a standard external memory multiway merge.
Theoretical costs.
We analyze costs assuming a fixed size alphabet. Phase 1 of the algorithm runs in total time , splitting the input into subcollections of size , if is the available RAM.
Without skipping the irrelevant ranges, Phase 2 would require (the maximum LCP value) sequential scans of items. By skipping irrelevant ranges, the overall amount of data directly read/written by the algorithm is items where is the arithmetic average of the entries in the LCP array. On the other hand, skipping irrelevant ranges causes an overhead in terms of blocks; so the overall cost of Phase 2 can be bound above by sequential I/Os.
RAM usage is flexible. In Phase 1, the larger the memory used, the fatser the algorithm. Phase 2 uses a negligible amount of RAM, but also a semi-external version that uses bytes of RAM was implemented.
The overall cost of Phase 3, to sort the LCP entries, is sequential I/Os, since it requires rounds (where is a configurable parameter), each one merging LCP files by sequentially reading and writing bytes of data.
3.5 BWT+LCP [Bonizzoni et al., 2021]
The BWT+LCP algorithm [8], implemented in a prototype called bwt-lcp-em121212https://github.com/AlgoLab/bwt-lcp-em, has similarities with BCR/BCRext [5] (Section 3.2) and uses ideas from [32], but proposes a different strategy.
The paper [8] uses a specific definition of suffixes that, contrary to the definition adopted in this survey, does not include the end-marker when counting the length of a suffix. So a suffix can have length and in this case it is not the empty string but it consists of the end-marker. We chose here to use a notation that agrees with this choice, for the sake of the reader who wants to delve into the details of [8]. To avoid confusion, in the following we will call these suffixes no-suffixes.
The algorithm works in two distinct phases.
It first computes partial multi-string BWTs, and then merges the BWTs to the final BWT, making largely use of external memory.
The partial BWT131313In [8] the partial BWTs are denoted , and the maximum length is . We use here the “” in to explicitly distinguish these partial BWTs from the partial BWTs computed in BCR. only takes into account no-suffixes of length . Notice the difference from the partial BWTs of BCR (Section 3.2) that are relative to all suffixes of lengths up to j. Also notice the difference from eGap (Section 3.4) where the partial BWTs computed in the first phase are complete BWTs of subcollections.
Phase 1.
The partial BWTs are computed with the support of positional representations of the input strings: these are arrays such that element is the -th character from the end of string , without the end-marker . The end-marker is added as a first character to each string. So contains all last characters of the strings.
The first partial BWT contains the concatenation of the last characters of each string, since it is the BWT of length no-suffixes (i.e. suffixes consisting only of the end-markers), and so is identical to .
To produce from , for each (where is the maximum length of the strings), a radix sort is used, ordering the strings from the rightmost character. In this phase array must be kept in main memory since it is accessed in random order. The sorting is implemented using arrays (). At the beginning of the -th iteration, if must be read from string . As is scanned, characters from the required strings are read: if character is read from string , it is appended to which is under construction and is appended to the bucket for character . The concatenation of all buckets, ordered according to the alphabetical ordering of the characters, is the new . The array is trivially initialized as .
Let’s consider again the strings of Example 2. To show how Phase 1 works, let’s see how is obtained from and . Notice that since it contains the th character from the end of each string, not counting the end marker. According to , the first character must be read from string , so the first character of is . At the same time, the first index in the bucket is . The second character must come from (it is , and bucket gets a ), and the third from (it is an and bucket gets a ). So and , as the concatenation of the buckets for , and in alphabetical order.
Phase 2.
When all partial BWTs have been built, they are merged to the final BWT. Notice that the ordering of each partial BWT will be preserved in the merge, since length- no-suffixes are already correctly ordered relative to each other.
The merge is first carried out virtually building an array that describes how the merge must be carried out. Specifically if the -th element of the final BWT is the character preceding a length- suffix. Therefore, to construct the output BWT, element will be taken from .
To build , the no-suffixes of different lengths must be ordered. is approximated in successive iterations, and the no-suffixes are ordered starting from their last characters and using a radix sort algorithm.
Precisely, at iteration , an approximation is computed from . The initial trivially proposes to concatenate all the partial BWTs in increasing order of the suffix lengths, that is .
Then, at iteration , is scanned sequentially, and for each , if then the next character is read from . If it is not the end-marker, the length is appended to the -bucket. The -bucket is fixed as a sequence of s, since the suffixes starting with the end-markers are no-suffixes of length . Then is the concatenation of all buckets, in alphabetical order of the characters.
Let us see this in our running example: the partial bwts are: , , , , and . Scanning , since , the first character is from . Then the character is , and so bucket gets as first index . The second character is again from , since ; it is a and thus bucket gets as first index . All buckets are filled continuing in this fashion, except that when an end marker is encountered, it is disregarded: so, for instance, nothing is done for , since the first character of is . Finally, the concatenation of all buckets in alphabetical order yields
From the final , is computed as it is done in eGap using the array , except that here the partial bwts are used. In our example, . The first three characters of are from , since starts with three s. Then we have an which is the first character of , since , and so on.
Computing the LCP.
As implied by the name, the BWT+LCP algorithm [8] can also compute the LCP array of the input sequence collection. This is done during the second phase: as the array is computed, at each iteration, the algorithm keeps track of the longest prefixes locally encountered at each location, by inspecting the corresponding BWT characters.
Theoretical costs.
For a fixed size alphabet, the algorithm runs in time, where is the maximum LCP value, when the alphabet is constant and memory addresses can be stored in a single memory word. It uses main memory (where is the maximum length of the input strings) and requires I/Os.
4 Strategies exploiting compressibility
Another way to scale the computation of large BWTs is to apply a lightweight compression scheme on top of the text to produce a small data representation from which we derive the BWT. This strategy has several advantages: (i) the compressed representation fits smaller memories that are closer to the CPU, so the transmission of data is more efficient. We need fewer bits of satellite information on top of the text (i.e., less RAM and disk usage), and (ii) the deduplication of strings via compression avoids redundant BWT computations. Overall, if we use the correct tools, this approach reduces both time and space. However, it also has some pitfalls. A greedy compression scheme (like Lempel-Ziv or RePair) greatly reduces space usage, but also imposes considerable overhead, overcoming the whole purpose of compressing. Moreover, the compact data representation we produce must follow some structure that facilitates the computation of the BWT, otherwise decompressing the strings every time we require to operate over them makes things more inefficient.
In this section, we present algorithms that exploit the repetitiveness of the text to compute the BWT.
4.1 Big-BWT [Boucher et al., 2019]
Boucher et al. [11] introduced a pre-processing technique, called prefix-free parsing, to compute the BWT for a (single) highly repetitive input string of length in time. The rationale behind this technique is to apply a simple compression scheme as to exploit the string repetitiveness, and then to compute the from the encoded string. The pre-processing step generates a dictionary and a parse of the input string, and it is effective if both the dictionary and the parse together are much smaller than the original input string. In this case, indeed, the computation executed in internal memory is more resource-efficient. The Big-BWT tool141414https://gitlab.com/manzai/Big-BWT implements this technique.
Experiments showed that the size of the parse is typically the most demanding component for very large and repetitive inputs. Thus, in order to reduce the memory requirements, recursive prefix-free parsing has been recently introduced in [58], where the prefix-free parse is applied to the parse generated by prefix-free parsing the input string.
In the following we outline the key phases of the computation via prefix-free parsing and highlight the properties of this technique.
Parsing phase.
The main idea of the parsing phase is to divide the input string into overlapping phrases of variable length that will constitute the dictionary. More precisely, the string considered is , where is the input string of length over , and are end-markers, and is a fixed parameter.
In order to parse , let be a set of strings of fixed length over (called trigger strings) augmented with and . The dictionary comprises all the strings such that (i) is a substring of , (ii) exactly one prefix of is in , (iii) exactly one suffix of is in , (iv) no other substring of is in . The sequence of dictionary phrase occurrences that form is the parse , in which each phrase occurrence is encoded by a meta-character given by the lexicographic rank of that phrase in . Then, , where is lexicographically sorted and is composed of characters that correspond to positions in . For instance, let be the string in Example 1, , and trigger strings given by , and . Then, . The parse of is which results in , when using the lexicographic rank to identify phrases in .
In Big-BWT [11], is iteratively built through a one-pass over using a Karp-Rabin hash function and a parameter . The trigger strings are implicitly found by passing a sliding window of length : wherever the KR fingerprint of the content of the current window is modulo , a trigger string is found. Then, the current phrase terminates at the end of the window, and it is added to the dictionary (recording also the phrase frequency), and then, the next phrase starts at the beginning of the current window. Thus, the input string is decomposed into overlapping variable-length phrases, each starting and ending with a trigger string of length . After sorting the dictionary , the final parse is generated.
The main property of the prefix-free parsing procedure is that none of the suffixes of length greater than of the phrases in is a prefix of any other [11, Lemma 1]. This is a property fundamental for the next phase when building the BWT of .
Building the BWT of the input string.
To construct the BWT of , we follow the definition that involves appending the end-marker to and then lexicographically sorting its suffixes.
Any suffix of corresponds to a unique suffix of of length greater than , i.e., the suffix of is mapped to the suffix of , if is not empty, otherwise is mapped to . Then, the permutation that lexicographically sorts the suffixes of of length greater than , also lexicographically sorts the corresponding suffixes of [11, Lemma 5].
Moreover, any suffix of has exactly one prefix that is a phrase suffix, i.e., is a suffix of a phrase in the dictionary , [11, Lemma 3]. This implies that the order among phrase suffixes can be carried over to the suffixes of . Indeed, given two suffixes and of (of length greater than ) and their unique corresponding prefixes and that are phrase suffixes, implies . However, if , some information about and must be taken into account.
More precisely, one can think of constructing the string in two-passes according to the list of lexicographically sorted suffixes (of length greater than ) of phrases in , where each suffix is considered a number of times equal to its phrase frequency. In a first pass, one builds the sequence of characters preceding any suffix of that is either a proper suffix of only one phrase dictionary or is an element of that occurs once in . Both cases have no ambiguity, since the preceding character is uniquely determined. In a second pass, one deals with the sequence of characters preceding suffixes that are either elements of occurring more than once in or proper suffixes of different dictionary phrases. In this second instance, one needs the list of lexicographically sorted suffixes of to put characters in sequence according to their correct order [11, Lemma 7]. Considering the string of Example 1 and the defined above, the list of lexicographically sorted suffixes of of length greater than is , where for each suffix, its phrase frequency is reported within brackets. Then, in the first pass, the sequence of characters of built is , where dashes left at positions and correspond to , which are the characters in preceding the two occurrences of the phrase dictionary . Finally, the is completed considering the lexicographic order of the suffixes of starting with , which is the meta-character corresponding to .
In Big-BWT, the BWT of is performed in linear time using the suffix array construction algorithm SACA-K [54]. However, instead of the string , it stores an inverted list that associates to each dictionary phrase the list of positions in the where occurs. The inverted list is a format more suitable for the next step, which consists in lexicographically sorting the suffixes in , and then by scanning them, placing the corresponding characters in the string. The sorting of the phrase suffixes is accomplished by applying the gSACA-K [48] algorithm.
Theoretical costs.
The computation for an input string of length uses space, where , is the sum of the length of the dictionary phrases, and the number of elements in , and it takes time when working in internal memory.
4.2 grlBWT [Díaz-Domínguez and Navarro, 2023]
Díaz-Domínguez and Navarro presented in [22] an external memory approach that computes the BWT of a collection of strings and symbols according to Equation (1). Their method, called grlBWT 151515An implementation is available at https://github.com/ddiazdom/grlBWT, builds on SAIS (Section 2) but maintains data in compressed form to reduce RAM usage and redundant calculations. We refer the reader to Section 2 for the notation used in this section.
Each recursive level of grlBWT receives a collection and returns the BWT of as defined in Equation (1). To do so, it computes the set of strings labelling the LMS substrings of and uses this set to fill in a preliminary BWT . The output is preliminary because it has some incomplete areas that could not be inferred from . The algorithm then produces a grammar that generates strings in to finish later. Subsequently, grlBWT replaces the LMS substrings of by their corresponding nonterminals in to create another collection that passes as input to the recursive level . The base case of the recursion is when all strings in have length one, in which case the BWT is the input itself. When grlBWT returns to level , it combines the information from and to fill the incomplete areas of and thus produce .
In practice, grlBWT is an iterative method with two phases, the parsing phase and the induction phase. The former simulates the descend of the recursive levels and the second simulates the ascend.
Parsing phase.
During each parsing round , grlBWT parses to compute and then applies ISS over to simulate the partially sorted of that SAIS obtains in an ISS pass. Recall from Section 2 that the ISS pass partitions such that each block contains the suffixes of prefixed by the same . The algorithm builds using this partition and the fact that each equal-symbol run does not need further processing of .
The parsing round starts by calculating along with an array that stores in the number of times appears as an LMS substring in . Subsequently, grlBWT runs ISS over to obtain an array where the suffixes of are arranged in order (Equation (3)), breaking ties for equal suffixes arbitrarily. Let be the -th smallest suffix of in order. The algorithm scans from left to right to visit every block of suffixes in with length labelled and uses the information in to compute the pair for the block storing the suffixes in labelled , with . The algorithm skips each suffix of length one because it overlaps in the first symbol of some . When grlBWT reaches , it visits the symbol that precedes in each , with . If is always a proper suffix in and is preceded by the same symbol , grlBWT appends a run to . If is a proper suffix, but is preceded by different symbols, then it appends an incomplete block instead, where is a special symbol. The third option is that matches a full string . In that case, grlBWT appends to , where is another special symbol that indicates another type of incomplete block.
The next step is to build a grammar (Section 2) that generates the set of different strings producing incomplete blocks in . Thus, for a phrase , grlBWT creates the rule , where is the order of in . In addition, it replaces on the right-hand side with , where is the order of the longest suffix which is also a phrase in . Recursively, stores the second longest suffix that also occurs in . This structure repeats times to cover the proper suffixes of in , and the last two symbols form the rule , where denotes the empty symbol. Let be the sequence of pairs in the recursive rewrites of . Notice that every encodes the nonterminal expanding to the -th suffix of (from left to right) and the symbol that precedes it. Additionally, stores the rightmost symbol of that does not overlap the following LMS substring in .
To depict the steps of the first parsing round, let us consider again the collection of Example 2. The underlined sybmols are LMS-type positions. After computing the LMS-substrings from , the method obtains the set of phrases and the corresponding array with their frequencies in . Then, it uses ISS to compute and from to produce:
Recall that refers to the -th suffix of the phrase in (from left to right). The partition of according to the suffixes of becomes . After block , all the blocks have length one because their corresponding suffixes (in grey above) are unique in . Also note that the method regards as the same symbol. In , the first block is incomplete because the method cannot infer the relative order of the occurrences of in from and . Moreover, the blocks , , , , and are also incomplete (with symbol ) because their corresponding suffixes match full phrases of . The next step is to construct the grammar from the the strings that produce incomplete blocks in (i.e., ). The original rules are . However, the method only keeps in the right-hand sides of the proper suffixes associated with incomplete blocks, so the rules become , , , , , with being the null symbol. Finally, the parsing round stores and , computes the new collection and continues to the next parsing round.
Induction phase.
Every iteration induces the order of symbols in the incomplete blocks of using and . The block is associated with a string that only labels LMS substrings of , and if is the -th incomplete run of , then is the nonterminal assigned to in . Consider the partition of according to their buckets in the fully sorted version of . Each block stores the symbols that precede the suffixes in prefixed by (i.e., those in ). There is a map from to the range with the suffixes of prefixed by . Consequently, filling in reduces to scanning from left to right, and for each in , run and append to .
On the other hand, each is associated with a string that occurs as a proper suffix in different LMS substrings of , say and , and is preceded by different symbols. That is, and . If has order in , then is the -th incomplete run of and is the number of times occurs as a suffix in the LMS substring of .
The encoding of is convenient to fill because tells that goes to the incomplete run number , goes to the incomplete run number , and so on. Thus, a symbol associated with in goes to after the symbols associated with in . In other words, it is possible to fill all incomplete blocks of in one linear decompression of . Another important observation is that a run produces copies of . Thus, grlBWT calls once and copies the content times in each incomplete block in .
The induction round works as follows: Create an array with buckets, where is the number of incomplete blocks of labelled . Then, scan from left to right and, for each run in , execute and append copies of every symbol in the bucket of . In addition, replace by in . Next, perform a sorted merge of , , and (observe that the three arrays over the alphabet ), replacing each of with the next symbols of , and each with the next symbols of .
Let us complete the preliminary BWT of the parsing round from our current example using the multi-string and the grammar . The rules of this grammar are , , , , . Observe that the alphabet of is a subset of the symbols on the left-hand side of . The method first creates the array with one bucket as contains one block labelled . The length of is three because the cumulative length of the blocks in labelled is three. The method scans from left to right to compute on the symbol of each run, and uses the output to fill the buckets of . The combined information of and yields . In turn, this information updates to , where the underlined symbols represent the replacement of by . Note that and do not produce symbols for because their suffixes do not produce incomplete blocks . The method also updates each with the rightmost symbol in that belongs to , so becomes . Finally, the method merges and to produce the final multi-string of in Example 2. In this case, the underlined symbols originally belonged to .
Theoretical costs.
Under the RAM model, grlBWT runs in time and uses bits of working memory, where is the longest string in . However, this cost is reached only with adversarial inputs, with practical scenarios (e.g., ) running in time. On the other hand, grlBWT is a semi-external approach that keeps , and on disk and accesses them linearly, thus requiring I/Os. On the other hand, the RAM usage depends on the size of . This set is small when is repetitive, but there is no known theoretical bound for its size.
5 Future directions
There are several promising directions for future research. First, the efficient computation of auxiliary components in BWT-based compressed self indexes. Examples of such components are the suffix array samples of the -index [28], the samples of the subsampled -index [17], or the LCP array samples of thresholds [2]. An important challenge in these examples is how to compute the data structures without building full versions of sa and lcp.
From a systems perspective, the design of disk-based algorithms optimized for Solid State Drives (SSDs) could lead to significant speedups, taking advantage of SSD-specific features such as high random access speeds and internal parallelism. Moreover, the acceleration of BWT construction via GPU architectures could offer significant performance improvements for massive datasets (see, for instance, the BWT-based compressor libbsc161616https://github.com/IlyaGrebnov/libbsc that uses NVIDIA GPU acceleration).
On the algorithmic side, the Prefix-Free Parsing (PFP) strategy should be further explored and adapted for the case of string collections. To our knowledge, the only PFP variant that works on collections is PFP-eBWT [10] that computes the extended BWT (EBWT) by Mantaci et al. [51]. On the other hand, designing new lightweight schemes for text compression that are compatible with the construction of the BWT could enhance the scalability for massive datasets. This idea is particularly relevant for grlBWT, where the compression of the input is the main bottleneck. Recently, Díaz-Domínguez et al. [21] proposed a parallel grammar algorithm that handles terabytes of data efficiently, and whose scheme shares some similarities with grlBWT. Adapting this technique could achieve considerable reductions in space and time.
Finally, there is growing interest in designing indexing171717A representation encoding the BWT and that supports LF-mapping and backward search queries data structures to encode BWTs with large alphabets (see [29, 4, 64] and references therein). The standard solution to this problem is the wavelet tree [30], which represents the BWT as a binary tree of height and with leaves, being the BWT’s alphabet size. However, when is large, the factor that results from navigating the tree becomes a considerable overhead. Furthermore, the bits required to store the tree also make the wavelet tree impractical in this scenario. In this regard, Nishimoto et al. recently proposed a data structure [53] that encodes the BWT in bits and reduces the penalty to regardless of the alphabet. However, this solution is space-efficient only if , which occurs when the text encoded by the BWT is highly repetitive. In practical scenarios where repetitive collections also have some noise, this solution grows rapidly in size [7].
6 Final Remarks
In this short survey, we highlight key algorithms for computing large BWTs by leveraging external memory or text compression techniques with no additional information about the data. It appears as a first step towards a more in-depth comparison of the corresponding tools, potentially including experimental results to highlight the differences among them. However, a direct comparison may not be entirely fair, given the differences in the intended use cases of the tools. Some are designed for a single string, others for string collections, and some work in external memory (semi- or fully), while others in internal memory. Additionally, some tools are prototypes, whereas others are engineered solutions. Table 1 summarizes the complexities of the algorithms at the basis of these tools.
| Algorithm | Time Complexity | Memory Usage | I/O Complexity |
|---|---|---|---|
| Big-BWT [11] | - | ||
| bwt-disk [26] | words | ||
| BCR [5] | words | ||
| BCRext [5] | words | ||
| eGSA [49] | words | ||
| eGap [23] | words | ||
| BWT+LCP [8] | |||
| grlBWT [22] | bits |
In particular, bwt-disk was the first tool for constructing the BWT for a single string in external memory, without the need to first compute the suffix array. It optimizes disk access by using only sequential scans, which enables it to fully leverage modern caching systems. Additionally, bwt-disk stores all input, output, and intermediate files in compressed form, which allows it to use less total disk space than the uncompressed input for real-world datasets.
Specifically, BCR/BCRext was developed to build the BWT for huge collections of short strings over any alphabet. The experimental results in [5] showed that BCR is capable of computing the BWT for billion strings, each 100 characters long (approximately GB of data), using only GB of RAM, while BCRext requires a negligible amount of RAM. Moreover, BCR can output auxiliary data structures (such as the Document Array, Generalized Suffix Array and Longest Common Prefix).
eGSA was the first external memory algorithm for constructing the generalized suffix arrays. It can also output the BWT for collections of strings with different sizes. The experimental results in [49] showed that eGSA tool can index up to GB of data using only GB of RAM. One disadvantage of the eGSA is the large amount of disk working space. Also, it is observed in [49] that eGSA ’s running time degrades when the RAM is restricted to the input size.
eGap is very flexible about the use of RAM. The first phase adapts to the available memory, but also the second one has been implemented in an alternative semi-external version. Moreover, the algorithm can output additional information (such as the Document Array) to solve in a single sequential scan of the inputs three well known problems on collections of sequences: maximal repeats, all pairs suffix-prefix overlaps and succinct de Bruijn graphs.
BWT+LCP splits the problem of computing BWTs and LCPs into subproblems in a different way from other proposals, as we explained in Section 3.5, and it shows good performance in experiments.
In the context of tools that employ compression techniques, Big-BWT has been demonstrated to reduce the workspace when computing the of highly repetitive inputs. The experiments in [11] showed that the prefix-free parsing procedure implemented in Big-BWT allows to produce a dictionary and a parse that are considerably smaller than the input and can fit in internal memory, even in cases where the input is of considerable size. For example, on a dataset of Salmonella genomes, the dictionary and the parse together took only about GB, compared to the GB of uncompressed input.
The current implementation of grlBWT can handle high volumes of repetitive data efficiently. In experiments in [22], the tool could process 400 human genomes ( TB) in hours, with a memory peak of GB. The performance decreased in non-repetitive data (i.e., collections of DNA sequencing reads), although it remained one of the most efficient methods. In all inputs, the main bottleneck was text compression during the parsing phase, which took more than of the running time. Currently, grlBWT only outputs the multi-string . The construction of other data structures such as lcp or sa is not yet supported.
A comparison with almost all disk-based methods is available in [23]. More comparisons can be found in the original papers, such as in [22].
Nevertheless, there are a number of other tools in the literature, some of which are optimized to work in internal memory and/or on specific alphabet (such as, for instance, the DNA alphabet). For example, the recent ropebwt3 [45] constructs the BWT of large DNA sequence sets and enables efficient searching via the FM-index. It builds on the result reported in [26], which has seen multiple implementations [63, 59]. It uses libsais181818Source code: https://github.com/IlyaGrebnov/libsais to construct a partial multi-string BWT from a sequence subset, and then merges it into an existing BWT. The tool ropebwt3191919https://github.com/lh3/ropebwt3 is particularly optimized for highly redundant data, such as pangenomes or high-coverage sequencing reads. It encodes the BWT using run-length encoding and a dynamic B+-tree (see also [44]). Another recent proposal [52] (called CMS-BWT202020https://github.com/fmasillo/CMS-BWT) introduces a method that takes as input a string collection and a reference string , and computes the BWT of the (single) concatenated string by exploiting the high repetitiveness of data through the compressed matching statistics introduced in [47].
References
- [1] Donald Adjeroh, Timothy Bell, and Amar Mukherjee. The Burrows–Wheeler transform: data compression, suffix arrays, and pattern matching. Springer, Boston, MA, 2008.
- [2] Hideo Bannai, Travis Gagie, et al. Refining the r-index. Theoretical Computer Science, 812:96–108, 2020. doi:10.1016/J.TCS.2019.08.005.
- [3] Hideo Bannai, Juha Kärkkäinen, Dominik Köppl, and Marcin Piątkowski. Constructing the Bijective and the Extended Burrows–Wheeler Transform in Linear Time. In Proc. 32nd Annual Symposium on Combinatorial Pattern Matching (CPM), pages 7:1–7:16, 2021. doi:10.4230/LIPIcs.CPM.2021.7.
- [4] Jérémy Barbay, Travis Gagie, Gonzalo Navarro, and Yakov Nekrich. Alphabet partitioning for compressed rank/select and applications. In Proc. 21st International Symposium on Algorithms and Computation (ISAAC), pages 315–326, 2010. doi:10.1007/978-3-642-17514-5_27.
- [5] Markus J. Bauer, Anthony J. Cox, and Giovanna Rosone. Lightweight algorithms for constructing and inverting the BWT of string collections. Theoretical Computer Science, 483(0):134–148, 2013. doi:10.1016/j.tcs.2012.02.002.
- [6] Gianmarco Bertola, Anthony J. Cox, Veronica Guerrini, and Giovanna Rosone. A Class of Heuristics for Reducing the Number of BWT-Runs in the String Ordering Problem. In Proc. 35th Annual Symposium on Combinatorial Pattern Matching (CPM), pages 7:1–7:15, 2024. doi:10.4230/LIPIcs.CPM.2024.7.
- [7] Nico Bertram, Johannes Fischer, and Lukas Nalbach. Move-r: optimizing the r-index. In Proc. 22nd International Symposium on Experimental Algorithms (SEA), pages 1:1–1:19, 2024. doi:10.4230/LIPICS.SEA.2024.1.
- [8] Paola Bonizzoni, Gianluca Della Vedova, Yuri Pirola, Marco Previtali, and Raffaella Rizzi. Computing the multi-string BWT and LCP array in external memory. Theoretical Computer Science, 862:42–58, 2021. doi:10.1016/j.tcs.2020.11.041.
- [9] Christina Boucher, Davide Cenzato, Zsuzsanna Lipták, Massimiliano Rossi, and Marinella Sciortino. Computing the original eBWT faster, simpler, and with less memory. In Proc. 28th International Symposium on String Processing and Information Retrieval (SPIRE), pages 129–142, 2021. doi:10.1007/978-3-030-86692-1_11.
- [10] Christina Boucher, Davide Cenzato, Zsuzsanna Lipták, Massimiliano Rossi, and Marinella Sciortino. Computing the original eBWT faster, simpler, and with less memory. In Proceedings of 28th International Symposium in String Processing and Information Retrieval SPIRE 2021, volume 12944 of Lecture Notes in Computer Science, pages 129–142, 2021. doi:10.1007/978-3-030-86692-1_11.
- [11] Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefix-free parsing for building big BWTs. Algorithms for Molecular Biology, 14(1):13:1–13:15, 2019. doi:10.1186/S13015-019-0148-5.
- [12] Michael Burrows and David J. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
- [13] Davide Cenzato, Veronica Guerrini, Zsuzsanna Lipták, and Giovanna Rosone. Computing the optimal BWT of very large string collections. In Proc. 33rd Data Compression Conference (DCC), pages 71–80, 2023. doi:10.1109/DCC55655.2023.00015.
- [14] Davide Cenzato and Zsuzsanna Lipták. A survey of BWT variants for string collections. Bioinformatics, 40(7):btae333, 2024. doi:10.1093/bioinformatics/btae333.
- [15] Davide Cenzato, Zsuzsanna Lipták, Nadia Pisanti, Giovanna Rosone, and Marinella Sciortino. BWT for string collections. submitted to Festschrift’s honoree Giovanni Manzini.
- [16] Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. The smallest grammar problem. IEEE Transactions on Information Theory, 51(7):2554–2576, 2005. doi:10.1109/TIT.2005.850116.
- [17] Dustin Cobas, Travis Gagie, and Gonzalo Navarro. A fast and small subsampled r-index. In Proc. 32nd Annual Symposium on Combinatorial Pattern Matching (CPM), page article 13, 2021.
- [18] The Computational Pan-Genomics Consortium. Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics, 19(1):118–135, 2016. doi:10.1093/bib/bbw089.
- [19] Anthony J. Cox, Markus J. Bauer, Tobias Jakobi, and Giovanna Rosone. Large-scale compression of genomic sequence databases with the Burrows–Wheeler transform. Bioinformatics, 28(11):1415–1419, 2012. doi:10.1093/bioinformatics/bts173.
- [20] Anthony J. Cox, Fabio Garofalo, Giovanna Rosone, and Marinella Sciortino. Lightweight LCP construction for very large collections of strings. Journal of Discrete Algorithms, 37:17–33, 2016. doi:10.1016/J.JDA.2016.03.003.
- [21] Diego Diaz-Dominguez. Efficient terabyte-scale text compression via stable local consistency and parallel grammar processing. arXiv preprint arXiv:2411.12439, 2024. to appear in Proc. 23rd Symposioum on Experimental Algorithms (SEA 2025).
- [22] Diego Díaz-Domínguez and Gonzalo Navarro. Efficient construction of the BWT for repetitive text using string compression. Information and Computation, 294:105088, 2023. doi:10.1016/j.ic.2023.105088.
- [23] Lavinia Egidi, Felipe A. Louza, Giovanni Manzini, and Guilherme P. Telles. External memory BWT and LCP computation for sequence collections with applications. Algorithms for Molecular Biology, 14(1):6:1–6:15, 2019. doi:10.1186/S13015-019-0140-0.
- [24] Lavinia Egidi and Giovanni Manzini. Lightweight merging of compressed indices based on BWT variants. Theoretical Computer Science, 812:214–229, 2020. doi:10.1016/j.tcs.2019.11.001.
- [25] Paolo Ferragina. Pearls of Algorithm Engineering. Cambridge University Press, 2023.
- [26] Paolo Ferragina, Travis Gagie, and Giovanni Manzini. Lightweight data indexing and compression in external memory. Algorithmica, 63(3):707–730, 2012. doi:10.1007/S00453-011-9535-0.
- [27] Paolo Ferragina and Giovanni Manzini. An experimental study of a compressed index. Information Sciences, 135(1):13–28, 2001. doi:10.1016/S0020-0255(01)00098-6.
- [28] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in bwt-runs bounded space. Journal of the ACM (JACM), 67(1):1–54, 2020. doi:10.1145/3375890.
- [29] Alexander Golynski. Rank/select operations on large alphabets: a tool for text. In Proc. 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), volume 122, page 368, 2006.
- [30] Roberto Grossi, Ankur Gupta, Jeffrey Scott Vitter, et al. High-order entropy-compressed text indexes. In Proc. 14th annual ACM-SIAM symposium on Discrete algorithms (SODA), volume 3, pages 841–850, 2003.
- [31] Veronica Guerrini, Felipe A. Louza, and Giovanna Rosone. Parallel lossy compression for large FASTQ files. In 16th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC), pages 97–120, 2023. doi:10.1007/978-3-031-38854-5_6.
- [32] James Holt and Leonard McMillan. Merging of multi-string BWTs with applications. Bioinformatics, 30(24):3524–3531, 2014. doi:10.1093/bioinformatics/btu584.
- [33] Hongwei Huo, Pengfei Liu, Chenhui Wang, Hongbo Jiang, and Jeffrey Scott Vitter. CIndex: compressed indexes for fast retrieval of FASTQ files. Bioinformatics, 38(2):335–343, 2021. doi:10.1093/bioinformatics/btab655.
- [34] Bruce Jacob, David Wang, and Spencer Ng. Memory systems: cache, DRAM, disk. Morgan Kaufmann, 2010.
- [35] Lilian Janin, Giovanna Rosone, and Anthony J. Cox. Adaptive reference-free compression of sequence quality scores. Bioinformatics, 30(1):24–30, 2014. doi:10.1093/bioinformatics/btt257.
- [36] Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. Permuted longest-common-prefix array. In Proc. 20th Annual Symposium on Combinatorial Pattern Matching (CPM), pages 181–192, 2009. doi:10.1007/978-3-642-02441-2_17.
- [37] Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. Journal of the ACM (JACM), 53(6):918–936, 2006. doi:10.1145/1217856.1217858.
- [38] Richard M. Karp and Michael O. Rabin. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, 31(2):249–260, 1987. doi:10.1147/rd.312.0249.
- [39] John C. Kieffer and En Hui Yang. Grammar–based codes: a new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737–754, 2000. doi:10.1109/18.841160.
- [40] Pang Ko and Srinivas Aluru. Space efficient linear time construction of suffix arrays. Journal of Discrete Algorithms, 3(2-4):143–156, 2005. doi:10.1016/j.jda.2004.08.002.
- [41] Tak Wah Lam, Wing-Kin Sung, Siu-Lung Tam, Chi-Kwong Wong, and Siu-Ming Yiu. Compressed indexing and local alignment of DNA. Bioinformatics, 24(6):791–797, 2008. doi:10.1093/bioinformatics/btn032.
- [42] Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10:1–10, 2009. doi:10.1186/gb-2009-10-3-r25.
- [43] Heng Li. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics, 28(14):1838–1844, 2012. doi:10.1093/bioinformatics/bts280.
- [44] Heng Li. Fast construction of FM-index for long sequence reads. Bioinformatics, 30(22):3274–3275, 2014. doi:10.1093/bioinformatics/btu541.
- [45] Heng Li. BWT construction and search at the terabase scale. Bioinformatics, 40(12):btae717, 2024. doi:10.1093/bioinformatics/btae717.
- [46] Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics, 26(5):589–595, 2010. doi:10.1093/bioinformatics/btp698.
- [47] Zsuzsanna Lipták, Francesco Masillo, and Simon J Puglisi. Suffix sorting via matching statistics. Algorithms for Molecular Biology, 19(1):11, 2024. doi:10.1186/s13015-023-00245-z.
- [48] Felipe A. Louza, Simon Gog, and Guilherme P. Telles. Inducing enhanced suffix arrays for string collections. Theoretical Computer Science, 678:22–39, 2017. doi:10.1016/J.TCS.2017.03.039.
- [49] Felipe A. Louza, Guilherme P. Telles, Steve Hoffmann, and Cristina Dutra de Aguiar Ciferri. Generalized enhanced suffix array construction in external memory. Algorithms for Molecular Biology, 12(1):26:1–26:16, 2017. doi:10.1186/S13015-017-0117-9.
- [50] Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal of Computation., 22(5):935–948, 1993. doi:10.1137/0222058.
- [51] Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. An extension of the Burrows–Wheeler transform. Theoretical Computer Science, 387(3):298–312, 2007. doi:10.1016/j.tcs.2007.07.014.
- [52] Francesco Masillo. Matching Statistics Speed up BWT Construction. In Proc. 31st Annual European Symposium on Algorithms (ESA 2023), pages 83:1–83:15, 2023. doi:10.4230/LIPIcs.ESA.2023.83.
- [53] Takaaki Nishimoto and Yasuo Tabei. Optimal-Time Queries on BWT-Runs Compressed Indexes. In Proc. 48th International Colloquium on Automata, Languages, and Programming (ICALP), volume 198, pages 101:1–101:15, 2021. doi:10.4230/LIPICS.ICALP.2021.101.
- [54] Ge Nong. Practical linear-time o(1)-workspace suffix sorting for constant alphabets. ACM Transactions on Information Systems, 31(3):15:1–15:15, 2013. doi:10.1145/2493175.2493180.
- [55] Ge Nong, Wai Hong Chan, Sheng Qing Hu, and Yi Wu. Induced sorting suffixes in external memory. ACM Transactions on Information Systems (TOIS), 33(3):1–15, 2015. doi:10.1145/2699665.
- [56] Ge Nong, Sen Zhang, and Wai Hong Chan. Two efficient algorithms for linear time suffix array construction. IEEE Transactions on Computers, 60(10):1471–1484, 2011. doi:10.1109/TC.2010.188.
- [57] Daisuke Okanohara and Kunihiko Sadakane. A linear-time Burrows–Wheeler transform using induced sorting. In Proc. 16th International Symposium on String Processing and Information Retrieval (SPIRE), pages 90–101, 2009. doi:10.1007/978-3-642-03784-9_9.
- [58] Marco Oliva, Travis Gagie, and Christina Boucher. Recursive prefix-free parsing for building big bwts. In 2023 Data Compression Conference (DCC), pages 62–70, 2023. doi:10.1109/DCC55655.2023.00014.
- [59] Marco Oliva, Massimiliano Rossi, Jouni Sirén, Giovanni Manzini, Tamer Kahveci, Travis Gagie, and Christina Boucher. Efficiently merging r-indexes. In Proc. 31st Data Compression Conference (DCC), pages 203–212, 2021. doi:10.1109/DCC50243.2021.00028.
- [60] Nicola Prezza, Nadia Pisanti, Marinella Sciortino, and Giovanna Rosone. Variable-order reference-free variant discovery with the Burrows–Wheeler Transform. BMC Bioinformatics, 21-S(8):260, 2020. doi:10.1186/S12859-020-03586-3.
- [61] Yoshihiro Shibuya and Matteo Comin. Better quality score compression through sequence-based quality smoothing. BMC Bioinformatics, 20-S(9):302:1–302:11, 2019. doi:10.1186/s12859-019-2883-5.
- [62] Jared T. Simpson and Richard Durbin. Efficient de novo assembly of large genomes using compressed data structures. Genome Research, 22(3):549–56, 2012. doi:10.1101/gr.126953.111.
- [63] Jouni Sirén. Burrows–Wheeler Transform for Terabases. In Proc. 16th Data Compression Conference (DCC), pages 211–220, 2016. doi:10.1109/DCC.2016.17.
- [64] Jouni Sirén, Erik Garrison, Adam M Novak, Benedict Paten, and Richard Durbin. Haplotype-aware graph indexes. Bioinformatics, 36(2):400–407, 2020. doi:10.1093/BIOINFORMATICS/BTZ575.
- [65] Clare Turnbull, Richard H. Scott, Ellen Thomas, Louise Jones, Nirupa Murugaesu, Freya Boardman Pretty, Dina Halai, Emma Baple, Clare Craig, Angela Hamblin, Shirley Henderson, Christine Patch, Amanda O’Neill, Andrew Devereau, Katherine Smith, Antonio Rueda Martin, Alona Sosinsky, Ellen M. McDonagh, Razvan Sultana, Michael Mueller, Damian Smedley, Adam Toms, Lisa Dinh, Tom Fowler, Mark Bale, Tim Hubbard, Augusto Rendon, Sue Hill, and Mark J. Caulfield. The 100 000 genomes project: bringing whole genome sequencing to the nhs. BMJ, 361, 2018. doi:10.1136/bmj.k1687.
