A Taxonomy of LCP-Array Construction Algorithms
Abstract
The combination of the suffix array and the -array can be used to solve many string processing problems efficiently. We review some of the most important sequential -array construction algorithms in random access memory.
Keywords and phrases:
longest common prefix array, suffix array, Burrows-Wheeler transformCopyright and License:
2012 ACM Subject Classification:
Theory of computation Data structures design and analysisEditors:
Paolo Ferragina, Travis Gagie, and Gonzalo NavarroSeries and Publisher:
Open Access Series in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
In a landmark paper, Weiner [43] presented the concept of suffix trees and devised the first linear-time construction algorithm. For several decades, the suffix tree was the data structure in stringology, because it can be used to efficiently solve a “myriad” of string processing problems [4]. For instance, with the help of the suffix tree, exact pattern matching can be done in time (assuming a constant alphabet size), where is the length of the pattern searched for. It should be pointed out that the search time is independent of the text length .
The suffix array was devised by Manber and Myers [30] and independently by Gonnet et al. [19] under the name PAT array. Ten years later, it was shown independently and contemporaneously by Kärkkäinen and Sanders [23], Kim et al. [26], Ko and Aluru [27], and Hon et al. [20] that a direct linear-time construction of the suffix array is possible.
The -array first appeared in the seminal paper of Manber and Myers [30], where it was called array. The authors presented an time LCP-array construction algorithm (LACA for short) and showed that the augmentation of classical binary search with the -array enables exact pattern matching on the suffix array in time. The first linear-time LACA was devised by Kasai et al. [25]. They also demonstrated the importance of the -array by presenting a linear-time algorithm that simulates the bottom-up traversal of a suffix tree with a suffix array and the -array. Their work was taken one step further by Abouelhoda et al. [1], who showed that suffix trees can be completely replaced with enhanced suffix arrays. An enhanced suffix array is the combination of the suffix array with augmenting data structures, which depend on the application at hand. For example, exact pattern matching can be done in time with the suffix array, the -array, and constant-time range minimum queries [16] on the -array [1] – again assuming constant alphabet size; for non-constant alphabet sizes one can extend this idea to achieve running time [15]. Algorithms on enhanced suffix arrays are not only more space efficient than those on suffix trees, but they are also faster and easier to implement. Since the -array is of utmost importance in this context, it has been studied extensively.
2 Preliminaries
Let be a text of length on an ordered alphabet of constant size . We assume that has the sentinel at the end (and nowhere else), which is smaller than any other character. For , denotes the character at position in . For , denotes the substring of starting at position and ending at position . Furthermore, denotes the -th suffix of .
The suffix array of the text is an array of integers in the range to specifying the lexicographic ordering of the suffixes of , that is, it satisfies . The suffix array can be built in linear time; we refer to the overview article of [37] for suffix array construction algorithms and to [35] for recent developments.
The inverse suffix array is an array of size such that for any with the equality holds. Obviously, it takes only linear time to invert the suffix array.
The -array contains the lengths of longest common prefixes between consecutive suffixes in . Formally, the -array is an array such that and for , where denotes the length of the longest common prefix between two strings and . Here the value is undefined, but – depending on the application – it may make sense to define it as or . Table 1 shows the -array of the string .
The Burrows-Wheeler transform [9] converts into the string defined by for all with and otherwise. Given the suffix array, this transformation takes only linear time.
The -array is an array of size such that and for all with . That is, is the index at which the suffix occurs in the suffix array. The -array is the inverse of the -array, which can easily be computed in linear time from the ; see e.g. [34, Section 7.2.2].
The -array is an array of size . For , the entry is defined as follows: if we consider all characters in that are smaller than , then is the overall number of their occurrences in .
3 The First Linear Time LACA
We first present Kasai et al.’s LACA [25], which computes the -array from the string , its suffix array , and its inverse suffix array ; see Algorithm 1. It starts with the longest suffix of (so ), computes and then by a left-to-right comparison of the characters in and its immediately preceding suffix in the suffix array. The same is done for the other suffixes of by incrementing successively. As we shall see, some character comparisons can safely be skipped.
In our example, Algorithm 1 first compares with and sets . In the next iteration of the for-loop, it compares with and sets . In the following iterations of the for-loop the underlined characters in will be skipped in further comparisons of adjacent suffixes. For example, in the third iteration, Algorithm 1 must compare with and the first three characters will be skipped in the comparison. To prove the correctness of Algorithm 1, we must show that if characters match in iteration of the for-loop, then at least characters match in iteration . So suppose that the longest common prefix of and its preceding suffix is , where is a character and is a string of length . In the next iteration, Algorithm 1 compares with its preceding suffix, which is not necessarily as in the example above. However, since is lexicographically smaller than , it follows that is lexicographically smaller than . Moreover, all the suffixes in between and must share the prefix because the suffixes are ordered lexicographically. In particular, and its preceding suffix have as a common prefix. Therefore, characters (namely ) can be skipped in the comparison of and its preceding suffix. If immediately precedes in the suffix array, then we can directly conclude that their longest common prefix is ; otherwise further character comparisons are required to determine their longest common prefix. Obviously, precedes if , where . This result is summarized in the next lemma, which will be important later. The first proof of Lemma 1 can be found in [25]. A value is called reducible if ; otherwise it is called irreducible.
Lemma 1.
Suppose that is reducible, where . Then we have .
It will now be shown that Algorithm 1 constructs the -array in time. We use an amortized analysis to show that the while-loop is executed at most times. It is readily verified that this implies the linear run-time. Each comparison in the while-loop ends with a mismatch, so there are mismatches (redundant character comparisons) in total. If a position in is involved in a match, then this particular occurrence of character will be skipped in further suffix comparisons. More precisely, if we have in the while-loop, then will not appear on the right-hand-side of a character comparison again. So there are at most matches. To illustrate the argument, the positions in our example text that are involved in a match are marked by underlining the respective character in . Since every position is marked at most once (because it is skipped in further comparisons), it is clear that the overall number of character comparisons is (mismatches) plus the number of underlined characters.
In Algorithm 1, the text as well as the arrays , , and should be randomly accessible. Under the common assumption that an entry in takes bytes, each of the arrays , , and requires bytes. Consequently, for an 8 bit alphabet, Algorithm 1 needs bytes of working memory.
In the context of compressed suffix trees, Mäkinen [29, Proposition 3.2] first showed that the -array can be constructed with less working space. Manzini [31] observed that the additional array is superfluous because it is possible to store the required information about in the -array itself. If the -array initially stores the -array, then the cell contains the value in iteration of the for-loop of Algorithm 2. That is, variable gets the value in line 4 and can safely be overwritten in line 9. The correctness of Algorithm 2 follows directly from this fact. Algorithm 2 requires bytes of working memory because it needs random access to the text and both input arrays.
Manzini [31] proposed another LACA, which saves even more space by overwriting the suffix array. This LACA is based on Lemma 1: if is reducible, then one can immediately set because the character comparisons in lines 7-8 of Algorithm 2 are superfluous. It follows as a consequence that in this case the assignment is superfluous, too. In other words, the value is required only if is irreducible. In a preprocessing phase, Manzini’s second LACA determines the values in the suffix array that are actually required and stores them in an auxiliary array. The auxiliary array is then used instead of and can safely be overwritten. However, this LACA is of limited value in practice because in most applications the suffix array is needed as well.
4 The -Algorithm
Kärkkäinen, Manzini, and Puglisi [22] proposed a variant of Kasai et al.’s algorithm, which first computes a permuted -array (the -array) with the help of the so-called -array and then derives the -array from the -array. They called it “-algorithm” because it uses the -array, which “is in some way symmetric to the array.”
The -array and the -array, respectively, are defined as follows: For all with let
Fig. 1 shows our example. So in the suffix array , the suffix is immediately preceded by the suffix unless . Note that we have because we assume that is terminated by . For , we have
| (1) |
In case , Equation (1) holds also true: and because . Thus, the -array is a permutation of the -array. The difference between the two arrays is that the lcp-values occur in text order (position order) in the -array, whereas they occur in suffix array order (lexicographic order) in the -array. The pseudocode of the -algorithm is shown in Algorithm 3. Its properties (correctness, linear run-time) can be shown as in the analysis of Algorithm 1.
Experimental comparisons of Algorithm 1 and the -algorithm can be found in [22, 18]. The bottom line is that the -algorithm is faster because of better memory locality: it merely needs sequential access to the -array and the -array in its second for-loop. Consequently, these arrays can be streamed and the number of possible cache misses is limited to : in the computation of the -array, in line 6,222Remark: If is reducible, then and hence is cached. and in the conversion of the -array into the -array (in virtually all applications lcp-values are required to be in suffix array order). By contrast, there can be up to cache misses in Algorithm 1 (including the computation of ).
If we assume that the arrays , , , and are kept in working memory, then Algorithm 3 requires bytes RAM. However, the algorithm merely needs random access to the -array in the first for-loop, to the text in the second for-loop, and to the -array in the third for-loop. All other arrays can be streamed from or to disk. Thus, bytes of RAM are sufficient.
Kärkkäinen et al. [22] presented another algorithm for computing the -array based on irreducible -values. We call a value reducible if ; otherwise it is called irreducible. The proof of the next lemma is similar to that of Lemma 1.
Lemma 2.
If is reducible, then .
The above-mentioned algorithm first determines all irreducible -values in a brute-force manner: each -value is computed by comparing the respective suffixes from the very beginning. Afterwards, it uses Lemma 2 to compute the remaining reducible -values. The authors showed that the sum of all irreducible -values is at most . Hence, the overall run-time is . An experimental comparison of this algorithm and the -algorithm showed that the -algorithm is faster in practice; see [22].
5 Computing the LCP-Array along with SA
The LCP-array can also be computed while sorting the suffixes lexicographically, which we show here for two well-known suffix array construction algorithms DC3 [24] and SAIS [33]. The adaptions to the LCP-array computation were described in the original article for DC3, and by Bingmann et al. for SAIS [8]. Intuitively, it seems clear that LCP-array computation during suffix array construction should be possible, as any suffix sorting algorithm must, at least implicitly, determine the lcp-values between lexicographically adjacent suffixes to determine their order, for otherwise it could not make the right decisions on the order of suffixes.
5.1 A Unified Perspective
We first look at the common features that both algorithms share, which form the basis for the LCP-computation. We assume knowledge about DC3 and SAIS, as a full exposition of those algorithms would take too much space in this article. We also leave out some details of LCP-array construction and refer the readers to the original articles at various points; the focus here is more on a common exposition of how the computations of lcp-values can be woven into suffix sorting algorithms.
Both DC3 and SAIS are recursive algorithms, and for sorting all the suffixes of a text they first sort a subset of suffixes recursively, stored in an array such that for all . From , they use different mechanisms to compute the full suffix array . In DC3, contains all suffixes at positions (see top of Fig. 2), whereas in SAIS, contains those positions with (called LMS-suffixes for leftmost smallest suffixes; see top of Fig. 3 with LMS-suffixes indicated by a star). In both algorithms, we talk about “-buckets” for an arbitrary character , by which we mean the subarray consisting of the suffixes starting with as in counting or bucket sort – in the figures those buckets are depicted by dotted lines.
By induction, we can assume that we also get the LCP-array for all the lexicographically adjacent suffixes in from the recursion: . The details of how to compute differ between the two algorithms and are a bit tedious, but are not the core of the algorithmic ideas. The reason why this is not trivial is that because has been computed recursively from the suffix array of a reduced text (each character in representing several characters in – 3 in DC3 (hence the name), and a non-constant number in SAIS), the lcp-values need to be scaled to actually represent the number of characters of shared prefixes in , and not . We assume here that this scaling step has already been performed – see again Fig. 2 for DC3 and Fig. 3 for SAIS, where, in the latter case, both arrays and are already stored in the final arrays and , as the algorithm works in-place (skip the gray values in for this moment, as they will be written in the course of the algorithm that follows).
Then both SAIS and DC3 compute the suffix array from , and the algorithms can be enhanced to compute the full LCP-array from . Both algorithms finally place suffixes into in a left-to-right manner (for SAIS, this happens in actually two passes, the second one right-to-left but symmetrically). Say the algorithm just wrote and now needs to compute the lcp-value between and its lexicographic predecessor at position in ; the result will have to be stored at . As the values in are filled from left to right, the value has been set in a previous iteration of this loop. (In SAIS, this left-to-right filling is per bucket, but the principle remains the same, as only suffixes in the same -bucket have an lcp-value .)
The easy case is if both and were neighbors in ; then we know their lcp-values and we can just copy the lcp-value from . Consider, for example, the placement of suffix and at positions and in Fig. 2: those suffixes are also neighbors in at positions and , respectively, and hence the lcp-value can be copied to (left red arrow). The same situation might occur in SAIS, e.g. in Fig. 3 suffix is placed to the right of , whose lcp-value of can just be copied from .
The more difficult case is when and were not neighbors in . Here, the techniques differ between the algorithms, but they share some common ideas. We first compare the two characters and (if in DC3 , we also compare and ). If the 1 or 2 compared characters are all the same (otherwise the lcp-value has been determined), it is clear that the value must be one more than , where (possibly two more than in DC3, see above). Now (or ) can be computed by a range minimum query on , which gives the minimum value between two specified array indices. More formally, the range minimum query (RMQ) problem on a general array is to find a data structure such that later, for two specified indices with , the minimum in can be returned efficiently, in symbols: . For these queries, linear preprocessing schemes exist that can answer the queries in constant time [16]. Armed with this tool, in DC3 we can simply compute (or ), where is the inverse of , defined by for all values actually stored in – note that for all the positions queried we have ; this is also why the distinction between and has been made.
For example, in Fig. 2, when writing to the right of , we have and and hence compute and thus finally write (green color). When writing to the right of , we compute and thus finally write (blue color).
The justification of this approach is the following: Let be an arbitrary array storing strings in lexicographically sorted order, and let be an array such that stores the lcp-value between strings and for all . Then due to the lexicographic sortedness of , the lcp-value between two arbitrary strings and is the minimum value from (assuming ), in symbols: . In our case, we exploit this fact with and .
In SAIS, the situation is different, as there is (most likely) no constant such that and are both in . Instead, we make use of the fact that (in the left-to-right scan) both suffixes and are lexicographically smaller than and , respectively, and are thus already present in (say at positions with and with ), including the lcp-values with their respective lexicographic predecessors and all lcp-values in between (the entire subarray has already been written). So can be easily computed in the same manner as in DC3. Look at the computation of the values and in Fig. 3 (orange color): We write when the scanning of reaches position with . As , which was written when at position , we need to set to the lcp-value of suffix and its lexicographic predecessor . The values are all known (gray values in below the b-bucket); hence we can perform an RMQ from positions 9 to 11 in , which gives a 2. We thus set .
Commenting on the differences between the RMQs in DC3 and SAIS, we can say that in DC3 the RMQs are performed on the static array (for which optimal preprocessing schemes exist), whereas for SAIS the array on which RMQs are performed grows semi-dynamically from the left to the right. We comment below how this can be done efficiently.
5.2 DC3
The above description of DC3 skipped one important detail of its inner workings: when it returns from the recursion, before it finally computes the full suffix array , it first needs to compute an array containing the lexicographic order of the suffixes starting at positions . This is done by a left-to-right scan through : for in ascending order, if , place to the first free place in ’s -bucket for . (By “first free position” we mean the first position not already written in that bucket, as in classical counting- or bucket-sort algorithms.)
Having computed from , we now explain how to compute array storing the lcp-values of lexicographically adjacent suffixes in . When placing to , let be its immediate predecessor in in the -bucket (first suffixes in a bucket are simple as their lcp-values are 0). Since both and are suffixes in , we can compute , where . Note that this is another usage of the RMQ data structure on that needs to be precomputed anyway for the later steps.
For example, in Fig. 2, when we write the final value next to , we have and , and hence compute as (orange color).
The rest of the algorithm works exactly as explained above – with the addition that if two suffixes from are placed adjacently into , we can look up their lcp-value directly in . In more detail, the final computation of can be viewed as merging the arrays and , as they are both sorted lexicographically and, together, contain all the suffixes from . Hence, if in this merging the value is written to and we already know with both , their lcp-value can be directly retrieved from and written to , as and are neighbors in . See Fig. 2 for an example, where the value is written right after , both of which come from at adjacent positions 5 and 6 with (right red arrow). We note here that neither nor need be materialized, but can be computed on the fly during the merging process.
5.3 SAIS
We now give more details about the LCP-array computation during the SAIS-algorithm. Let us first consider the scaling of the lcp-values to compute . As a reminder, the suffixes in (those positions with ) are called LMS-suffixes, including the sentinel character $ at the end. The recursion works by computing the suffix array of a reduced text with meta-characters ranging from one LMS-suffix to the next. In our example, the reduced text is (with square brackets indicating the meta-characters), and the recursion stops immediately with , as all meta-characters are different. Now in order to see the issue with LCP-computation, look at the suffixes and from : in , their order was determined from the order of the meta-characters and , where the first one is smaller, and the lcp-value of the suffixes – in terms of meta-characers – is 0. But the meta-characters themselves share a common prefix, ab in this case. Hence, the correct lcp-value for suffixes and is 2. Here, we do not give the full details of how the array is actually computed correctly, but refer the reader to [8, Lemma 3.1] for the correct way to do it, and to [14, Sect. 3.3] for the way how to compute this efficiently.
For what follows, we also need to classify the other suffixes: if , then is called L-suffix (for “larger”); if , then is called S-suffix (for “smaller”). The left-to-right pass (as shown in the example in Fig. 3) actually sorts the L-suffixes from the LMS-suffixes (whose order has been computed recursively), whereas the subsequent right-to-left pass sorts the S-suffixes into the already sorted LMS-suffixes. In a -bucket in for a character , all the L-suffixes come first, followed by a mix of S- and LMS-suffixes. E.g., at the bottom of Fig. 3 one can see that in the a-bucket the only L-suffix comes first, followed by the LMS-suffix , an S-suffix , followed by three more LMS-suffixes , , in this order.
A further detail is the computation of the lcp-value between the last L-suffix in a bucket and its lexicographic successor (which is either S or LMS). In Fig. 3, this happens in the a-bucket, where the L-suffix is placed right before the LMS-suffix , which had an lcp-value of 0 in (stored at ) because it was the lexicographically first suffix starting with an a among those in . However, with the placement of , the value is not 0 anymore, but needs to be updated to 1 (second gray arrow from the left). It can be shown [8, Lemma 3.2] that in the -bucket only the character , possibly repeated many times, can contribute to the lcp-value of those suffixes. Hence, a naive LCP-computation per bucket suffices to compute these lcp-values, as any character from can contribute at most once to this naive computation (in its corresponding bucket).
The last issue is that the RMQs needed for LCP-computation are performed on an array where values are constantly being written to, no data structures are known for constant-time range minimum queries for dynamic arrays. However, the updates occur in a regular manner, namely from left to right per bucket in the left-to-right scan (and again, symmetrically from right to left in the right-to-left scan). We sketch here the solution for such semi-dynamic RMQs [8, Sect. 3.2] based on LRM-trees [5, Def. 1] – also known under the slightly misleading name 2d-Min-Heaps [16, Def. 5.3]. We maintain an LRM-tree for each bucket , which initially contains only the LMS-suffixes of that bucket with their respective lcp-values (as computed in the recursive call). When a new L-suffix along with lcp-value is written into its -bucket, we climb up the rightmost path of until we find an element whose corresponding -entry is strictly smaller than ( has an artificial root holding lcp-value , which guarantees that such an element always exists). The new element is then added as ’s new rightmost leaf; an easy amortized argument shows that this results in overall linear time. is stored with a data structure for constant-time lowest common ancestor queries (LCAs) that supports dynamic leaf additions in worst-case time [10]. Then the minimum in any range in the processed portion of the -bucket can be found in time [16, Lemma 5.5] by LCA-queries on .
Interestingly, this solution for RMQs on semi-dynamic arrays is the only scenario known to the authors where LRM-trees are actually advantageous over the better known Cartesian Trees [42, Sect. 3.1], as only in LRM-trees appending elements to the array results in pure leaf additions to the tree. In a Cartesian Tree, appending a single new element might result in more complicated relinking operations on the tree topology (be it in constant time), and no data structure for constant-time LCAs is known in fully dynamic trees. A final caveat is that one would probably not want to implement this approach due to the complicated data structure for dynamic LCAs – see [8, Sect. 3.2] for alternatives that work better in practice.
6 Computing the LCP-array from the BWT
Next, we will present a LACA that is based on the of [7]. We start with some prerequisites.
Let be a substring of . The -interval is the interval such that is a common prefix of , but neither of nor of . For example, in Table 2 one can see that the -interval is , while the -interval is . Since the empty string is a common prefix of all suffixes of , the -interval is . The LACA in Algorithm 4 is based on the procedure , which has the following functionality: For an -interval , the call returns the list of all -intervals, where and is a substring of . In our example, applied to the -interval generates the -interval , the -interval , the -interval , the -interval and the -interval . Beller et al. [7] describe an implementation of the procedure that is based on the wavelet tree of the of . A call to takes time if it returns a list with intervals (so each interval can be generated in time).
Algorithm 4 shows how the -array of a string can be obtained solely based on the procedure . The algorithm maintains a queue , which initially contains the -interval . Moreover, the variable stores the current lcp-value (initially, ) and memorizes how many elements there are in that correspond to the current lcp-value (initially, ). Algorithm 4 computes lcp-values in increasing order (first the artificial entry , then the entries, and so on) as follows: whenever it dequeues an element from , say the -interval where , it tests whether . If so, it assigns to , generates all non-empty -intervals (where ) and adds them to (the back of) . Otherwise, it does nothing.
Let us illustrate the algorithm by computing the artificial entry and all entries in the -array of our example in Table 2. Initially, and the queue contains the -interval . Consequently, the first interval that is pulled from the queue is the -interval and is decreased by one. Since , case 1 in Algorithm 4 applies. Thus, is set to and generates the -interval , the -interval , the -interval , the -interval and the -interval . These intervals are put into the queue . In the second iteration of the while-loop, we have . Therefore, is increased by one and the variable gets the new value . At that point, the of is ; so five intervals correspond to the new lcp-value . The second interval that is pulled from the queue is the -interval and is decreased by one. Since , case 1 in Algorithm 4 applies. Thus, is set to and the -interval , which is the only interval in the list returned by , is added to the queue. Next, the -interval is dequeued (and is decreased by one). Again, case 1 applies because . So is set to , returns the list , and the intervals in the list are added to the queue . When the -interval is dequeued, is decreased by one, is set to , but no new interval is added to the queue (observe that does not generate the -interval because is not a substring of ). Then the -interval is dequeued, is decreased by one, is set to , and the intervals and (the - and the -interval) are enqueued. Finally, the -interval is pulled from the queue. Again, is decreased by one; it now has the value . Because , case 2 in Algorithm 4 applies. In the next iteration of the while-loop, is increased by one and is set to the current size of . At that point in time, the six elements in are the following intervals that correspond to the lcp-value :
where the notation indicates that the interval is the -interval. The reader is invited to compute all entries in the -array by executing the algorithm by hand.
Theorem 3.
Algorithm 4 correctly computes the -array.
Proof.
We proceed by induction on . After the first iteration of the while-loop (which leads to the artificial entry ), we have and the queue contains the -interval for every character , where and with . The algorithm sets unless . This is certainly correct because the suffix starts with the character and the suffix starts with the character . Clearly, the -array contains all entries with value . Let . By the inductive hypothesis, we may assume that Algorithm 4 has correctly computed all lcp-values . Immediately after the value of was incremented in line 8 of Algorithm 4, the queue solely contains intervals that correspond to the lcp-value (i.e., maximal intervals in which the suffixes share a common prefix of length ). Let the -interval be in , where and . If , then we know from the induction hypothesis that , i.e., is a common prefix of the suffixes and . On the other hand, is a prefix of but not of . Consequently, is the longest common prefix of and . Hence Algorithm 4 assigns the correct value to .
We still have to prove that all entries of the -array with value are really set. So let , , be an index with . Since , the longest common prefix of and can be written as , where , , and . Let be the length prefix of and be the length prefix of , where . Because is a prefix of and is a prefix of , it follows that is the longest common prefix of and . Let be the -interval, be the index with , and be the index with . Clearly, . Let , , be the smallest index at which the corresponding suffix does not start with ; see Fig. 4. Consequently, . According to the inductive hypothesis, Algorithm 4 assigns the value to . Therefore, is called with the -interval for some . Since is a prefix of and , it follows that the -interval, say , is not empty. Moreover, because for all . Thus, is in the returned by . Hence it is added to the queue . At some point in time, will be removed from and will be set to .
Theorem 4.
Algorithm 4 has a worst-case time complexity of .
Proof.
We use an amortized analysis to prove that each of the cases and can occur at most times. Case occurs as often as an entry of the -array is filled, and this happens exactly times. It remains for us to analyze how often case can occur. We claim that for a fixed position , , there is at most one substring ending at for which the -interval belongs to case . If is the largest position with such that the -interval belongs to case , then none of the left-extensions of is generated. More precisely, none of the -intervals, where with , will be enqueued. This proves the claim. As there are only possibilities for , it follows that case occurs at most times. In summary, the procedure can create at most intervals because every interval belongs to exactly one case. Each interval can be generated in time, so the run-time of Algorithm 4 is .
It is shown in [7] that Algorithm 4 can be implemented in such a way that it needs only bits for the storage of the intervals in the queue. Furthermore, bits are required for the wavelet tree and bytes are needed for the -array itself. Under the assumption that , the algorithm requires bytes plus bits of RAM in total.
In essence, Algorithm 4 enumerates all substring-intervals in a breadth-first manner and Beller et al. [7] demonstrated how this can be done space efficiently. Belazzougui [6] showed that the enumeration can also be done space-efficiently in a depth-first manner. Prezza and Rosone [36] analyzed the space consumption of the two alternatives and came to the conclusion that Beller et al.’s algorithm is more space-efficient on large alphabets whereas Belazzougui’s algorithm is more space-efficient on small alphabets. Prezza and Rosone designed an algorithm which is the combination of the two alternatives: if the alphabet size is larger than , they apply the algorithm of Beller et al. and otherwise they use Belazzougui’s algorithm to compute the -array. The resulting algorithm fills the -array in time, using only bits of working space on top of the and the -array [36, Theorem 3].
7 Final Remarks
We have only scratched the surface of LCP-array construction and reviewed some classic algorithms, but we did not cover results on the efficient storage of lcp-values, such as sampled [41] or compressed [13, 39] LCP-arrays. There are also variants of LCP-arrays, such as for labeled graphs [3], spectra [2], or Wheeler DFAs [11], or with mismatches [32], which were all not covered here. We also did not look at more advanced models of computation, like parallel shared or distributed memory [17, 40] or external memory [8, 21]. Finally, LCP-arrays for collections of strings have also been considered [12, 28]. All citations in this subsection should only be seen as a point of entry for further literature research, as they are far from exhaustive. We close this review article by mentioning that for large repetitive data sets the currently best practical algorithm is based on prefix free parsing [38, Lemma 2].
References
- [1] Mohamed Ibrahim Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms, 2(1):53–86, 2004. doi:10.1016/S1570-8667(03)00065-0.
- [2] Jarno N. Alanko, Elena Biagi, and Simon J. Puglisi. Longest common prefix arrays for succinct k-spectra. In Franco Maria Nardini, Nadia Pisanti, and Rossano Venturini, editors, String Processing and Information Retrieval - 30th International Symposium, SPIRE 2023, Pisa, Italy, September 26-28, 2023, Proceedings, volume 14240 of Lecture Notes in Computer Science, pages 1–13. Springer, 2023. doi:10.1007/978-3-031-43980-3_1.
- [3] Jarno N. Alanko, Davide Cenzato, Nicola Cotumaccio, Sung-Hwan Kim, Giovanni Manzini, and Nicola Prezza. Computing the LCP array of a labeled graph. In Shunsuke Inenaga and Simon J. Puglisi, editors, 35th Annual Symposium on Combinatorial Pattern Matching, CPM 2024, June 25-27, 2024, Fukuoka, Japan, volume 296 of LIPIcs, pages 1:1–1:15. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. doi:10.4230/LIPICS.CPM.2024.1.
- [4] Alberto Apostolico. The myriad virtues of subword trees. In Combinatorial Algorithms on Words, pages 85–96. Springer, 1985.
- [5] Jérémy Barbay, Johannes Fischer, and Gonzalo Navarro. LRM-trees: Compressed indices, adaptive sorting, and compressed permutations. Theor. Comput. Sci., 459:26–41, 2012. doi:10.1016/J.TCS.2012.08.010.
- [6] D. Belazzougui. Linear time construction of compressed text indices in compact space. In Proc. 46th Annual ACM Symposium on Theory of Computing, pages 148–193, 2014.
- [7] Timo Beller, Simon Gog, Enno Ohlebusch, and Thomas Schnattinger. Computing the longest common prefix array based on the Burrows-Wheeler transform. Journal of Discrete Algorithms, 18:22–31, 2013. doi:10.1016/J.JDA.2012.07.007.
- [8] Timo Bingmann, Johannes Fischer, and Vitaly Osipov. Inducing suffix and LCP arrays in external memory. ACM J. Exp. Algorithmics, 21(1):2.3:1–2.3:27, 2016. doi:10.1145/2975593.
- [9] M. Burrows and D.J. Wheeler. A block-sorting lossless data compression algorithm. Research Report 124, Digital Systems Research Center, 1994.
- [10] Richard Cole and Ramesh Hariharan. Dynamic LCA queries on trees. SIAM J. Comput., 34(4):894–923, 2005. doi:10.1137/S0097539700370539.
- [11] Alessio Conte, Nicola Cotumaccio, Travis Gagie, Giovanni Manzini, Nicola Prezza, and Marinella Sciortino. Computing matching statistics on Wheeler DFAs. In Ali Bilgin, Michael W. Marcellin, Joan Serra-Sagristà, and James A. Storer, editors, Data Compression Conference, DCC 2023, Snowbird, UT, USA, March 21-24, 2023, pages 150–159. IEEE, 2023. doi:10.1109/DCC55655.2023.00023.
- [12] Lavinia Egidi, Felipe A. Louza, Giovanni Manzini, and Guilherme P. Telles. External memory BWT and LCP computation for sequence collections with applications. Algorithms Mol. Biol., 14(1):6:1–6:15, 2019. doi:10.1186/S13015-019-0140-0.
- [13] Johannes Fischer. Wee LCP. Inf. Process. Lett., 110(8-9):317–320, 2010. doi:10.1016/J.IPL.2010.02.010.
- [14] Johannes Fischer. Inducing the LCP-array. In Frank Dehne, John Iacono, and Jörg-Rüdiger Sack, editors, Algorithms and Data Structures - 12th International Symposium, WADS 2011, New York, NY, USA, August 15-17, 2011. Proceedings, volume 6844 of Lecture Notes in Computer Science, pages 374–385. Springer, 2011. doi:10.1007/978-3-642-22300-6_32.
- [15] Johannes Fischer and Volker Heun. Finding range minima in the middle: Approximations and applications. Math. Comput. Sci., 3(1):17–30, 2010. doi:10.1007/S11786-009-0007-8.
- [16] Johannes Fischer and Volker Heun. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput., 40(2):465–492, 2011. doi:10.1137/090779759.
- [17] Patrick Flick and Srinivas Aluru. Parallel distributed memory construction of suffix and longest common prefix arrays. In Jackie Kern and Jeffrey S. Vetter, editors, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, Austin, TX, USA, November 15-20, 2015, pages 16:1–16:10. ACM, 2015. doi:10.1145/2807591.2807609.
- [18] Simon Gog and Enno Ohlebusch. Compressed suffix trees: Efficient computation and storage of LCP-values. Journal of Experimental Algorithmics, 18:2.1, 2013. doi:10.1145/2444016.2461327.
- [19] Gaston H. Gonnet, Ricardo Baeza-Yates, and Tim Snider. New indices for text: PAT trees and PAT arrays. In Information Retrieval: Data Structures and Algorithms, chapter 5, pages 66–82. Prentice-Hall, Englewood Cliffs, NJ, 1992.
- [20] Wing-Kai Hon, Kunihiko Sadakane, and Wing-Kin Sung. Breaking a time-and-space barrier in constructing full-text indices. In Proc. 44th Annual IEEE Symposium on Foundations of Computer Science, pages 251–260, 2003. doi:10.1109/SFCS.2003.1238199.
- [21] Juha Kärkkäinen and Dominik Kempa. Better external memory LCP array construction. ACM J. Exp. Algorithmics, 24(1):1.3:1–1.3:27, 2019. doi:10.1145/3297723.
- [22] Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. Permuted longest-common-prefix array. In Proc. 20th Annual Symposium on Combinatorial Pattern Matching, volume 5577 of Lecture Notes in Computer Science, pages 181–192. Springer, 2009. doi:10.1007/978-3-642-02441-2_17.
- [23] Juha Kärkkäinen and Peter Sanders. Simple linear work suffix array construction. In Proc. 30th International Colloquium on Automata, Languages and Programming, volume 2719 of Lecture Notes in Computer Science, pages 943–955. Springer, 2003. doi:10.1007/3-540-45061-0_73.
- [24] Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. Linear work suffix array construction. J. ACM, 53(6):918–936, 2006. doi:10.1145/1217856.1217858.
- [25] Toru Kasai, Hiroki Arimura Gunho Lee, Setsuo Arikawa, and Kunsoo Park. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proc. 12th Annual Symposium on Combinatorial Pattern Matching, volume 2089 of Lecture Notes in Computer Science, pages 181–192. Springer, 2001. doi:10.1007/3-540-48194-X_17.
- [26] Dong Kyue Kim, Jeong Seop Sim, Heejin Park, and Kunsoo Park. Linear-time construction of suffix arrays. In Proc. 14th Annual Symposium on Combinatorial Pattern Matching, volume 2676 of Lecture Notes in Computer Science, pages 186–199. Springer-Verlag, 2003. doi:10.1007/3-540-44888-8_14.
- [27] Pang Ko and Srinivas Aluru. Space efficient linear time construction of suffix arrays. In Proc. 14th Annual Symposium on Combinatorial Pattern Matching, volume 2676 of Lecture Notes in Computer Science, pages 200–210. Springer-Verlag, 2003. doi:10.1007/3-540-44888-8_15.
- [28] Felipe A. Louza, Guilherme P. Telles, Steve Hoffmann, and Cristina Dutra de Aguiar Ciferri. Generalized enhanced suffix array construction in external memory. Algorithms Mol. Biol., 12(1):26:1–26:16, 2017. doi:10.1186/S13015-017-0117-9.
- [29] Veli Mäkinen. Compact suffix array – a space-efficient full-text index. Fundamenta Informaticae, 56(1-2):191–210, 2003. URL: http://content.iospress.com/articles/fundamenta-informaticae/fi56-1-2-12.
- [30] Udi Manber and Gerne Myers. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing, 22(5):935–948, 1993. doi:10.1137/0222058.
- [31] Giovanni Manzini. Two space saving tricks for linear time LCP array computation. In Proc. 9th Scandinavian Workshop on Algorithm Theory, volume 3111 of Lecture Notes in Computer Science, pages 372–383. Springer, 2004. doi:10.1007/978-3-540-27810-8_32.
- [32] Giovanni Manzini. Longest common prefix with mismatches. In Costas S. Iliopoulos, Simon J. Puglisi, and Emine Yilmaz, editors, String Processing and Information Retrieval - 22nd International Symposium, SPIRE 2015, London, UK, September 1-4, 2015, Proceedings, volume 9309 of Lecture Notes in Computer Science, pages 299–310. Springer, 2015. doi:10.1007/978-3-319-23826-5_29.
- [33] Ge Nong, Sen Zhang, and Wai Hong Chan. Two efficient algorithms for linear time suffix array construction. IEEE Trans. Computers, 60(10):1471–1484, 2011. doi:10.1109/TC.2010.188.
- [34] Enno Ohlebusch. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, 2013.
- [35] Jannik Olbrich, Enno Ohlebusch, and Thomas Büchler. Generic non-recursive suffix array construction. ACM Transactions on Algorithms, 20(2), 2024. doi:10.1145/3641854.
- [36] Nicola Prezza and Giovanna Rosone. Space-efficient computation of the LCP array from the Burrows-Wheeler transform. In Proc. 30th Annual Symposium on Combinatorial Pattern Matching, volume 128 of Leibniz International Proceedings in Informatics, pages 7:1–7:18, 2019. doi:10.4230/LIPICS.CPM.2019.7.
- [37] Simon J. Puglisi, Bill Smyth, and Andrew Turpin. A taxonomy of suffix array construction algorithms. ACM Computing Surveys, 39(2):Article 4, 2007.
- [38] Massimiliano Rossi, Marco Oliva, Ben Langmead, Travis Gagie, and Christina Boucher. MONI: A pangenomic index for finding maximal exact matches. J. Comput. Biol., 29(2):169–187, 2022. doi:10.1089/CMB.2021.0290.
- [39] Kunihiko Sadakane. Compressed suffix trees with full functionality. Theory Comput. Syst., 41(4):589–607, 2007. doi:10.1007/S00224-006-1198-X.
- [40] Julian Shun. Fast parallel computation of longest common prefixes. In Trish Damkroger and Jack J. Dongarra, editors, International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, New Orleans, LA, USA, November 16-21, 2014, pages 387–398. IEEE Computer Society, 2014. doi:10.1109/SC.2014.37.
- [41] Jouni Sirén. Sampled longest common prefix array. In Amihood Amir and Laxmi Parida, editors, Combinatorial Pattern Matching, 21st Annual Symposium, CPM 2010, New York, NY, USA, June 21-23, 2010. Proceedings, volume 6129 of Lecture Notes in Computer Science, pages 227–237. Springer, 2010. doi:10.1007/978-3-642-13509-5_21.
- [42] Jean Vuillemin. A unifying look at data structures. Commun. ACM, 23(4):229–239, 1980. doi:10.1145/358841.358852.
- [43] Peter Weiner. Linear pattern matching algorithms. In Proc. 14th IEEE Annual Symposium on Switching and Automata Theory, pages 1–11, 1973. doi:10.1109/SWAT.1973.13.
