Abstract 1 Introduction 2 Preliminaries 3 Faster RLCSAs 4 Two-level indexing 5 Boyer-Moore-Li with two-level indexing References

Faster Run-Length Compressed Suffix Arrays

Nathaniel K. Brown ORCID Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA Travis Gagie111Corresponding author. ORCID Faculty of Computer Science, Dalhousie University, Halifax, Canada Giovanni Manzini ORCID Department of Computer Science, University of Pisa, Italy Gonzalo Navarro ORCID Department of Computer Science, University of Chile, Santiago, Chile Marinella Sciortino ORCID Department of Mathematics and Computer Science, University of Palermo, Italy
Abstract

We first review how we can store a run-length compressed suffix array (RLCSA) for a text T of length n over an alphabet of size σ whose Burrows-Wheeler Transform (BWT) consists of r runs in O(rlog(n/r)+rlogσ+σ) bits such that later, given character a and the suffix-array (SA) interval for P, we can find the SA interval for aP in O(logra+loglogn) time, where ra is the number of runs of copies of a in the BWT. We then show how to modify the RLCSA such that we find the SA interval for aP in only O(logra) time, without increasing its asymptotic space bound. Our key idea is applying a result by Nishimoto and Tabei (ICALP 2021) and then replacing rank queries on sparse bitvectors by a constant number of select queries. We also review two-level indexing and discuss how our faster RLCSA may be useful in improving it. Finally, we briefly discuss how two-level indexing may speed up a recent heuristic for finding maximal exact matches of a pattern with respect to an indexed text.

Keywords and phrases:
Run-length compressed suffix arrays, interpolative coding, two-level indexing
Category:
Research
Funding:
Nathaniel K. Brown: Supported in part by a Johns Hopkins University Computer Science PhD Fellowship and NIH grants R21HG013433 and R01HG011392.
Travis Gagie: Funded in part by NSERC grant RGPIN-07185-2020.
Giovanni Manzini: Partially supported by the NextGeneration EU programme PNRR ECS00000017 Tuscany Health Ecosystem (Spoke 6, CUP: I53C22000780001) and by the project PAN-HUB funded by the Italian Ministry of Health (ID: T4-AN-07, CUP: I53C22001300001).
Gonzalo Navarro: Funded in part by CeBiB under Basal Funds FB0001 and AFB240001, ANID, Chile.
Marinella Sciortino: Partially supported by the project “ACoMPA” (CUP B73C24001050001) funded by the NextGeneration EU programme PNRR ECS00000017 Tuscany Health Ecosystem (Spoke 6) and by MUR PRIN project “PINC” (no. 2022YRB97K).
Copyright and License:
[Uncaptioned image] © Nathaniel K. Brown, Travis Gagie, Giovanni Manzini, Gonzalo Navarro, and Marinella Sciortino; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Pattern matching
Acknowledgements:
Many thanks to the first author’s CSCI 4419 / 6106 students Anas Alhadi, Nour Allam, Dove Begleiter, Nithin Bharathi Kabilan Karpagavalli, Suchith Sridhar Khajjayam and Hamza Wahed, and to Christina Boucher, Ben Langmead, Jouni Sirén and Mohsen Zakeri.
Editors:
Alessio Conte, Andrea Marino, Giovanna Rosone, and Jeffrey Scott Vitter

1 Introduction

Grossi and Vitter’s compressed suffix arrays (CSAs) [11] and Ferragina and Manzini’s FM-indexes [8] are sometimes treated as almost interchangeable, but their query-time bounds are quite different. With a CSA for a text T of length n over an alphabet of size σ, when given a character a and the suffix-array (SA) interval for a pattern P we can find the SA interval for aP in O(logna) time, where na is the number of occurrences of a in the text; with an FM-index we use O(logσ) time. This difference carries over to run-length compressed suffix arrays (RLCSAs) [18, 24] and run-length compressed FM-indexes (RLFM-indexes) [10, 17], with both taking space proportional to the number r of runs in the Burrows-Wheeler Transform (BWT) of the text but the former being generally faster for texts over large alphabets with relatively few runs of each character, and the latter being faster for texts over smaller alphabets.

In Section 2 we review (with some artistic license) CSAs and RLCSAs. In Subsection 3.1 we show how to use interpolative coding to build an RLCSA for T that takes O(rlog(n/r)+rlogσ+σ) bits and allows us to find the SA interval for aP from that of P in O(logra+loglogn) time, where ra is the number of those runs in the BWT containing copies of a. In Subsection 3.2 we review a result by Nishimoto and Tabei [20] about splitting the runs in the BWT so that we can evaluate LF in constant time, without increasing the number of runs by more than a constant factor. In Subsection 3.3 we present our main result: how to modify the RLCSA from Section 2 such that finding the SA interval for aP takes only O(logra) time, without increasing the asymptotic space bound. In Section 4 we discuss two-level indexing, for which we build one index for the text and another for the parse of the text, and how our faster RLCSA may be more suitable for indexing parses than current options. Finally, in Section 5 we briefly discuss how two-level indexing may speed up a recent heuristic for finding long maximal exact matches (MEMs) of a pattern with respect to an indexed text.

2 Preliminaries

Suppose we are given a text T[0..n1] over an alphabet of size σ and asked to index it such that, given a pattern P[0..m1], we can quickly count the number of occurrences of P in T. More specifically, we want to find the interval in the suffix array (SA) of T containing the starting positions of occurrences of P. Consider the matrix whose rows are the lexicographically sorted cyclic shifts of T and let F and L be the first and last column of that matrix, respectively; this means F contains the characters in T in lexicographic order and L is the BWT of T.

2.1 Compressed suffix arrays

The key idea behind compressed suffix arrays (CSAs) is to store Ψ[0..n1] compactly while supporting certain searches on it quickly, where Ψ[0..n1] is the permutation of {0,,n1} such that Ψ[i] is the position of SA entry (SA[i]+1)modn in SA[0..n1] or, equivalently, the position in L of F[i]. (This means Ψ is the inverse of the LF mapping used in FM-indexes.) By the definition, Ψ consists of at most σ increasing intervals – one for each distinct character that occurs in the text, corresponding to the interval of suffixes starting with that character – and if we can support fast binary searches on these intervals then we can support fast pattern matching.

For example, consider the text

T=𝙲𝙲𝚃𝙶𝙶𝙶𝙲𝙶𝙰𝚃$𝙲𝚃𝚃𝙰𝙲𝙰𝙲𝙶𝙰𝚃$𝙶𝚃𝚃𝙰𝙲𝙲𝙰𝙶𝙲𝚃$𝙲𝚃𝚃𝙰𝙲𝙶𝙲𝙶𝙲𝚃$𝙲𝚃𝙶𝙰𝙲𝙶𝙰𝙰𝚃𝚃$𝙲𝚃𝚃𝙰𝙲𝙶𝙲𝙶𝙰𝚃#,

for which SA, Ψ, F and L are shown on the left in Figure 1. If we know SA[22..28] is the SA interval for CG (in the green rectangle) and we want the SA interval for GCG, then we can search in the increasing interval

Ψ[36..48]=6,9,14,15,16,23,24,28,29,30,42,46,63

for G (in the red rectangle, with Ψ values between 22 and 28 shown as orange arrows and the others shown as black arrows) for the successor Ψ[41]=23 of 22 and the predecessor Ψ[43]=28 of 28. We thus learn that the SA interval for GCG is SA[41..43] (in the blue rectangle). Knowing this, we can continue backward stepping.

Figure 1: For T=𝙲𝙲𝚃𝙶𝙶𝙶𝙲𝙶𝙰𝚃$𝙲𝚃𝚃𝙰𝙲𝙰𝙲𝙶𝙰𝚃$𝙶𝚃𝚃𝙰𝙲𝙲𝙰𝙶𝙲𝚃$𝙲𝚃𝚃𝙰𝙲𝙶𝙲𝙶𝙲𝚃$𝙲𝚃𝙶𝙰𝙲𝙶𝙰𝙰𝚃𝚃$𝙲𝚃𝚃𝙰𝙲𝙶𝙲𝙶𝙰𝚃# we show SA, Ψ, F and L on the left and the Ψ, F and L on the right. If we know SA[22..28] is the SA interval for CG (in the green rectangle on the left) and we want the SA interval for GCG, then we can search in the increasing interval Ψ[36..48]=6,9,14,15,16,23,24,28,29,30,42,46,63 for G (in the red rectangle on the left, with Ψ values between 22 and 28 shown as orange arrows and the others shown as black arrows) for the successor Ψ[41]=23 of 22 and the predecessor Ψ[43]=28 of 28. We thus learn that the SA interval for GCG is SA[41..43] (in the blue rectangle on the left).
On the other hand, if we know SA[22..28] starts at offset 0 in the L run of character L[12] – that is, at offset 0 in the 13th run, counting from 1 – and ends at offset 1 in the L run of character L[15] (in the green rectangle on the right), then we can search in the increasing interval Ψ[25..32]=1,3,7,13,15,22,25,39 for G (in the red rectangle, with Ψ values between 12 and 15 shown as orange arrows and the others shown as black arrows) for the successor Ψ[28]=13 of 12 and the predecessor Ψ[29]=15 of 15 (in the blue rectangle on the right). We then use select and rank queries on two n-bit sparse vectors to find the SA interval for GCG, the L runs containing that interval’s starting and ending positions, and those positions’ offsets in those runs.

2.2 Run-length compressed suffix arrays revisited

Run-length compressed suffix array (RLCSA) were introduced in [24] for indexing highly repetitive collections. In this section we present an alternative, but functionally equivalent, description of RLCSAs which is more suitable for describing our improvements.

Definition 1.

For a text T[0..n1], the array L[0..r1] stores the sequence of r characters in the runs of the run-length encoding of L.

Definition 2.

For a text T[0..n1], the array F[0..r1] stores the r characters in L in lexicographic order.

Definition 3.

For a text T[0..n1], the array Ψ[0..r1] is the permutation of {0,,r1} such that Ψ[i] is the position of F[i] in L.

In this paper we view a RLCSA as a data structure storing Ψ[0..r1] compactly while supporting certain searches on it quickly. By the definition of Ψ, it still consists of at most σ increasing intervals – one for each distinct character that occurs in T, corresponding to the interval of suffixes starting with that character – and if we can still support fast binary searches on these intervals then we can still support fast pattern matching.

For example, consider

T=𝙲𝙲𝚃𝙶𝙶𝙶𝙲𝙶𝙰𝚃$𝙲𝚃𝚃𝙰𝙲𝙰𝙲𝙶𝙰𝚃$𝙶𝚃𝚃𝙰𝙲𝙲𝙰𝙶𝙲𝚃$𝙲𝚃𝚃𝙰𝙲𝙶𝙲𝙶𝙲𝚃$𝙲𝚃𝙶𝙰𝙲𝙶𝙰𝙰𝚃𝚃$𝙲𝚃𝚃𝙰𝙲𝙶𝙲𝙶𝙰𝚃#

again, for which Ψ, F and L are shown on the right in Figure 1. If we know the SA interval SA[22..28] for CG starts at offset 0 in the L run of character L[12] and ends at offset 1 in the L run of character L[15] (in the green rectangle) and we want the SA interval for GCG, then we can search in the increasing interval

Ψ[25..33]=1,3,7,13,15,22,25,39

for G (in the red rectangle, with Ψ values between 12 and 15 shown as orange arrows and the others shown as black arrows) for the successor Ψ[28]=13 of 12 and the predecessor Ψ[29]=15 of 15.

Because L and F do not have the predecessor-successor relationship of L and F, we cannot deduce that the SA interval for GCG starts in the L run of character L[28] and ends in the L run of character L[29] (and, in fact, in this example it does not). Instead, we store two n-bit SD-bitvectors [21], BL and BF, with r copies of 1 each. The 1s in BL mark the starting positions of runs in L and the 1s in BF mark the positions in F of the marked characters in L. In our example

BF = 𝟷𝟷𝟷𝟶𝟶𝟷𝟷𝟶𝟷𝟷𝟷𝟶𝟶𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟶𝟶𝟶𝟷𝟶𝟷𝟷𝟷𝟶𝟷𝟷𝟶𝟷𝟷𝟷𝟶𝟶 101 0011110000010101111000
BL = 𝟷𝟶𝟶𝟶𝟶𝟶𝟷𝟷𝟶𝟷𝟷𝟷𝟶𝟷𝟷𝟶𝟶𝟷𝟶𝟷𝟷𝟷𝟷𝟷𝟶𝟷𝟶𝟶𝟷𝟶𝟶𝟷𝟷𝟷𝟶𝟶𝟷𝟷𝟷𝟶𝟶 011 0111111111110001011110.

The interval BF[41..43] in BF starting immediately before the bit with offset 0 in the block whose starting position is marked with the 29th copy of 1 and ending immediately before the bit with offset 1 in the block whose starting position is marked with the 30th copy of 1, is shown in blue. (We are interested in the blocks marked with the 29th and 30th copies of 1 because we count from 0 in the j column in Figure 1, so those blocks correspond to Ψ[28] and Ψ[29].) We can find this interval with 2 select1 queries on BF, which take constant time.

The corresponding interval BL[41..43] in BL is also shown in blue, starting immediately before the bit with offset 3 in the block whose starting position is marked with the 22nd copy of 1 and ending immediately before the bit with offset 1 in the block whose starting position is marked with the 24th copy of 1. We can find the 2 indices 22 and 24 with 2 rank1 queries on BL, which take O(loglogn) time. This means the SA interval for GCG is SA[41..43] and it starts at offset 3 in the L run of character L[21] and ends at offset 1 in the L run of character L[23]. Knowing this we can continue backward stepping.

The RLCSA in Sirén’s PhD thesis [24] for a text T[0..n1] with r BWT runs takes O(rlog(n/r)+rlogσ+σlogn) bits. Given a character a and the SA interval for P, it can find the SA interval for aP in O(logn) time.

3 Faster RLCSAs

3.1 Searchable Interpolative coding

Suppose we are given an increasing list 1,,k of k integers in the range [0..n1]. To encode them with interpolative coding [19], we first write k/2 using lg(n1)+1 bits (except that we write 0 using 1 bit). All the numbers 1,,k/21 are in the range [0..k/21], so we can encode them recursively. All the numbers k/2+1,,k are in the range [k/2+1..n1], so we can encode them recursively as k/2+1k/21,,kk/21. Each encoding has O(logn) bits, so we can read them in O(1) time. If we imagine the list stored as keys in a balanced binary search tree then we encode the keys according to a pre-order traversal: when we reach each key i, we know i lies between the numbers shown to the left and right of i and we encode i using the maximum number of bits we would need for any key in that range.

For example, if n=66, k=13 and the list is 6,9,14,15,16,23,24,28,29,30,42,46,63 then, as illustrated in Figure 2, we start by encoding 7=24 using lg65+1=7 bits as 0011000. We then encode 3=14 using lg23+1=5 bits as 01110. We then encode 1,2,5,4,6=6,9,16,15,23 as 𝟶𝟷𝟷𝟶,𝟶𝟷𝟶,𝟶𝟶𝟶𝟷,𝟶,𝟷𝟷𝟶, and 10,8,9,12,11,13 as 𝟶𝟶𝟶𝟷𝟶𝟷,𝟶𝟷𝟷,𝟶,𝟶𝟶𝟷𝟷𝟷𝟷,𝟷𝟶𝟷𝟷,𝟷𝟶𝟶𝟶𝟶. When we reach 46, say, in a pre-order traversal of the tree in Figure 2, we know it lies between 31 and 65, so we encode it using lg(6531)+1=6 bits as (4631)2=𝟶𝟶𝟷𝟷𝟷𝟷.

Figure 2: A balanced binary search tree storing the k=13 keys from the increasing list 6,9,14,15,16,23,24,28,29,30,42,46,63 with each key in the range [0..n1=65]. When we reach each key in a pre-order traversal or binary search, we know it lies between the two values shown to its left and right, so we can encode it as the binary number shown below it, using a total of O(klog(n/k)+k) bits. If we store a bitvector marking the start of each encoding as visited in an in-order traversal, as shown below the tree, then we can omit the leading 0s from the encodings and support binary search in time O(logk) without changing our asymptotic space bound.

The binary search tree has height lgk and the bottom level contains at most k keys. By Jensen’s Inequality, we encode those keys using O(klog(n/k)+k) bits. Similarly, there at most k/2h keys at height h and we encode those keys using O(k2hlognk/2h+k2h)=O(k2hlog(n/k)+k(h+1)2h) bits. Since

h=0lgkk(h+1)2h=O(k),

we use O(klog(n/k)+k) bits in total.

In this paper we want to perform binary search on the list – the reader may have noticed that our example is Ψ[36..48] from Figure 1 – so we want fast access to the encodings of the numbers in it in the order we check them in a binary search. We can store the encodings according to an in-order traversal instead of a pre-order traversal, and store an uncompressed bitvector with as many bits as there are in the concatenation of the encodings and 1s marking where the encodings start. Since the bitvector delimits the encodings, however, we can delete the leading 0s from each encoding before concatenating them and building the bitvector. The in-order encodings for our example are shown below the tree in Figure 2, with the leading 0s removed, and the bitvector is shown below them. Since the bitvector uses at most as many bits as the encodings, we still use O(klog(n/k)+k) bits in total and – although random access still takes O(logk) time – we can perform binary search in O(logk) total time. This scheme is similar to Teuhola’s [25] and Claude, Nicholson and Seco’s [5].

To find the successor of 22 in the list, we start at the root knowing n=66 and k=13 and perform select1(7) and select1(8) queries on the bitvector to find the starting and ending positions of the encoding 0011000 of 7=24 in the range [0..65]. Since 22<24, we then perform select1(3) and select1(4) queries to find the starting and ending positions of the encoding 01110 of 3=14 in the range [0..23]. Since 22>14, we then perform select1(5) and select1(6) queries to find the starting and ending positions of the encoding 0001 of 5=16 in the range [15..23]. Since 22>16, we then perform select1(6) and select1(7) queries to find the starting and ending positions of the encoding 110 of 6=23 in the range [17..23]. Since 22<23, we know the successor of 22 in L is 23. We can find the predecessor of 28 in O(logk) time symmetrically.

If we apply interpolative coding with fast binary search to the increasing interval of Ψ for a character a in a text T of length n, then we use O(nalog(n/na)+na) bits and can support binary search in O(logna) time, where na is the frequency of a in T. If we do this for all the characters then we use O(n(H0(T)+1)) bits, where H is the 0th-order empirical entropy of T. If we encode the increasing interval of Ψ for a with interpolative coding, then we use O(rlog(r/ra)+ra) bits and can support binary search in O(logra) time, where ra is the number of runs of copies of a in the BWT of T (and, equivalently, in L). If we do this for all the characters then we use O(r(H0(L)+1)) bits, where L is again the sequence of r characters in the runs of the run-length encoding of T. To be able to find the increasing interval for a in Ψ, we store an r-bit uncompressed bitvector with 1s marking where the intervals start.

Theorem 4.

We can store Ψ for T in O(r(H0(L)+1))O(rlogσ) bits and support binary search in the increasing interval for a character a in O(logra) time, where ra is the number of runs of copies of a in the BWT of T.

To use Theorem 4 in an RLCSA, we store

  • an uncompressed bitvector marking which distinct characters occur in T, in O(σ) bits;

  • the SD-vectors BF and BL in O(rlog(n/r)) bits;

  • an uncompressed bitvector with 1s marking where the intervals for the characters start in Ψ, in O(r) bits;

  • Ψ in O(rlogσ) bits.

If we are given a and the SA interval for P then we can find the SA interval for aP by

  • using a rank query on the first uncompressed bitvector to find a’s rank among the distinct characters that occur in T, in O(1) time;

  • using rank queries on BL to find the runs in L overlapping the SA interval for P, in O(loglogn) time;

  • using select queries on the second uncompressed bitvector to find the interval for a in Ψ, in O(1) time;

  • using binary search in the interval for a in Ψ to find the successor and predecessor of the ranks of the first and last runs in L overlapping the SA interval for P, in O(logra) time;

  • using select queries on BF and arithmetic to find the SA interval for aP in O(1) time.

We store O(rlog(n/r)+rlogσ+σ) bits in total and find the SA interval for aP in O(logra+loglogn) total time. Notice that the O(loglogn) term in the query time comes only from the rank query on BL.

Corollary 5.

We can store an RLCSA for T in O(rlog(n/r)+rlogσ+σ) bits such that, given character a and the SA interval for P, we can find the SA interval for aP in O(logra+loglogn) time.

3.2 Splitting Theorem for RLCSAs

Nishimoto and Tabei [20] showed how we can split the runs in L such that no block in BF overlaps more than a constant number of blocks in BL without increasing the number of runs by more than a constant factor, and then store LF in O(rlogn) bits and evaluate it in constant time. Brown, Gagie and Rossi [4] slightly generalized their key theorem:

Theorem 6 (Nishimoto and Tabei [20]; Brown, Gagie and Rossi [4]).

Let π be a permutation on {0,,n1},

P={0}{i: 0<in1,π(i)π(i1)+1},

and Q={π(i):iP}. For any integer d2, we can construct P with PP{0,,n1} and Q={π(i):iP} such that

  • if q,qQ and q is the predecessor of q in Q, then |[q,q)P|<2d,

  • |P|d|P|d1.

If L[i]=L[i1] then LF(i)=LF(i1)+1, so

{0}{i: 0<in1,LF(i)LF(i1)+1}

has cardinality r. If LF(i)=LF(i1)+1 then, since Ψ and LF are inverse permutations, Ψ[j]=Ψ[j1]+1 where j=LF(i). Therefore,

{0}{j: 0<jn1,Ψ[j]Ψ[j1]+1}

also has cardinality r and applying Theorem 6 with d=2 to Ψ splits the runs in the BWT such that no block in BF overlaps more than 3 blocks in BL, without increasing the number of runs by more than a factor of 3/2. In our example, the number of runs increases by only 1, from 40 to 41, as shown below with the split block – corresponding to the first run of 6 copies of T in L – in red:

BF = 𝟷𝟷𝟷𝟶𝟶𝟷𝟷𝟶𝟷𝟷𝟷𝟶𝟶𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟶𝟶𝟶𝟷𝟶𝟷𝟷𝟷𝟶𝟷𝟷𝟶𝟷𝟷𝟷𝟶𝟶𝟷𝟶𝟷𝟶𝟶𝟷𝟷𝟷𝟏𝟎𝟎𝟏𝟎𝟎𝟷𝟶𝟷𝟶𝟷𝟷𝟷𝟷𝟶𝟶𝟶
BL = 𝟏𝟎𝟎𝟏𝟎𝟎𝟷𝟷𝟶𝟷𝟷𝟷𝟶𝟷𝟷𝟶𝟶𝟷𝟶𝟷𝟷𝟷𝟷𝟷𝟶𝟷𝟶𝟶𝟷𝟶𝟶𝟷𝟷𝟷𝟶𝟶𝟷𝟷𝟷𝟶𝟶𝟶𝟷𝟷𝟶𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟶𝟶𝟶𝟷𝟶𝟷𝟷𝟷𝟷𝟶.

Suppose we apply Theorem 6 with d=2 to Ψ and then store, for 0b<r, the index of the block in BL containing LF(ib) and LF(ib)’s offset in that block, where ib is the starting position of block b in BL. Nishimoto and Tabei called this the move table for LF (see also [4, 26]) and it takes a total of O(rlogn) bits. If we know BL[j] is in block b in BL with offset jib then, since the block in BF to which LF maps block b in BL now overlaps at most the block containing BL[LF(ib)] and the next 2 blocks in BL, we can find the index of the block in BL containing BL[LF(j)]=BL[LF(ib)]+jib and BL[LF(j)]’s offset in that block with at most 2 constant-time select queries on BL. We could use at most 2 constant-time lookups instead if we have the starting positions of the blocks in BL stored explicitly in another O(rlogn) bits.

3.3 A faster RLCSA without rank queries

Recall that the O(loglogn) term in the query-time bound in Corollary 5 comes only from the use of rank queries on an SD-vector. Since rank and select queries can be combined to support predecessor queries and select queries on sparse bitvectors can easily be supported in constant time and space polynomial in the number of 1s, rank queries on compact sparse bitvectors inherit lower bounds from predecessor queries [3] – so they cannot be implemented in constant time. Therefore, to get rid of that O(loglogn) term, we must somehow avoid rank queries.

We could replace the rank queries with a move table, but that would result in an O(rlogn) term in our space bound. Instead, we introduce an uncompressed 2r-bit bitvector BFL indicating how the starting positions of the blocks in F and L are interleaved. Specifically, we scan BF and BL simultaneously – assuming we have already applied Theorem 6 to them so that no block in F overlaps more than 3 blocks in L (so r is a constant factor larger than it was before the application of the theorem) – and

  • if we see 0s in both bitvectors in position i then we write nothing;

  • if we see a 1 in BF and a 0 in BL then we write a 1 (indicating that a block starts in F);

  • if we see a 0 in BF and a 1 in BL then we write a 0 (indicating that a block starts in L);

  • if we see 1s in both bitvectors then we write a 0 and then a 1 (indicating that blocks start in both L and F).

This way, BFL.select1(j) tells us which at most 3 blocks in L – those corresponding to the 0 preceding the jth copy of 1 in BFL and possibly to the next 2 copies of 0 – could overlap block j in F (counting from 1). We can then find the starting positions of those blocks in L using at most 3 select queries on BL.

For our example, taking BF and BL to be as shown just after Theorem 6,

0123456789012345678901234567890123456789012345 67890123456789012345
BF = 𝟷𝟷𝟷𝟶𝟶𝟷𝟷𝟶𝟷𝟷𝟷𝟶𝟶𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟶𝟶𝟶𝟷𝟶𝟷𝟷𝟷𝟶𝟷𝟷𝟶𝟷𝟷𝟷𝟶𝟶𝟏𝟎𝟏𝟎𝟎 11110010010101111000
BL = 𝟷𝟶𝟶𝟷𝟶𝟶𝟷𝟷𝟶𝟷𝟷𝟷𝟶𝟷𝟷𝟶𝟶𝟷𝟶𝟷𝟷𝟷𝟷𝟷𝟶𝟷𝟶𝟶𝟷𝟶𝟶𝟷𝟷𝟷𝟶𝟶𝟷𝟷𝟏𝟎𝟎𝟎𝟏𝟏𝟎𝟏𝟏𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟷𝟶𝟶𝟶𝟷𝟶𝟷𝟷𝟷𝟷𝟶,

(with the grey numbers only to show positions) we have

𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿
BFL = 𝟶𝟷𝟷𝟷𝟶𝟷𝟶𝟷𝟶𝟷𝟶𝟷𝟶𝟷𝟶𝟶𝟷𝟶𝟷𝟷𝟷𝟶𝟷𝟷𝟶𝟷𝟶𝟷𝟶𝟷𝟶𝟷𝟶𝟷𝟶𝟷𝟶𝟷𝟷𝟶
012345678901234 567890123456789012345678901
𝟷𝟶𝟶𝟷𝟷𝟶𝟷𝟶𝟷𝟎𝟷𝟏𝟶𝟎𝟏𝟶𝟶𝟷𝟶𝟷𝟶𝟷𝟶𝟷𝟶𝟶𝟶𝟷𝟶𝟶𝟶𝟷𝟷𝟶𝟷𝟷𝟶𝟷𝟶𝟷𝟶𝟶.

In position 38 we see 1s in both BF and BL, so we write 01 in BFL (in positions 49 and 50, respectively); in positions 39 and 40 we see 0s in both in BF and BL, so we write nothing; in position 41 we see a 1 in BF and a 0 in BL, so we write a 1 in BFL; in position 42 we see a 0 in BF and a 1 in BL, so we write a 0 in BFL; in position 43 we see 1s in both BF and BL, so we write 01 in BFL; in position 44 we see 0s in both BF and BL, so we write nothing; and in position 45 we see a 0 in BF and a 1 in BL, so we write a 0 in BFL (in position 55). Admittedly, when n=66 and after applying Theorem 6 2r=82, it seems foolish to store a 2r-bit uncompressed bitvector instead of simply storing BL uncompressed. This is due to the small size of our example, however; for massive and highly repetitive datasets, r can easily be hundreds of times smaller than n.

Suppose we know the SA interval SA[41..43] for aP starts at offset 0 in block 28 in F and ends at offset 1 in block 29 in F and we want to find which blocks contain its starting and ending positions in L and the offsets of those positions. In Section 2, we performed 2 rank queries on BL, but now we perform queries BFL.select1(29)=51 and BFL.select1(30)=54 (with arguments 29 and 30 instead of 28 and 29 because we mark with a 1 the starting of the first block in F, which we index with 0; the results 51 and 54 are indexed from 0 as well). Since the 29th and 30th copies of 1 are BFL[51] and BFL[54] (shown in red above), they are preceded by the 5129+1=23rd and 5430+1=25th copies of 0, respectively.

Because we applied Theorem 6, this means the 29th and 30th blocks in F (shown in red in BF above) overlap the 23rd block in L and possibly the 24th and 25th blocks (shown in blue in BL), and the 25th block and possibly the 26th and 27th blocks (also shown in blue in BL). Notice that, because we split the 34th block in F but the first block in L for Theorem 6, the block numbers we find in F are the same as in Section 2 but the block numbers we find in L will be incremented. Although in general we need 6 select queries on BL, in this case we can use only 5 – BL.select1(23),,BL.select1(27) – to find where these blocks begin in constant time, and determine which contain the starting and ending positions of the SA interval SA[41..43]: the 23rd and the 25th, respectively.

In short, we replace a rank query on SD-bitvector BL by queries on uncompressed bitvector BFL and constant-time select queries on BL. This gives us the following theorem:

Theorem 7.

We can store an RLCSA for T in O(rlog(n/r)+rlogσ+σ) bits such that, given character a and the SA interval for P, we can find the SA interval for aP in O(logra) time, where ra is the number of runs of copies of a in the BWT of T.

Instead of viewing BFL as replacing slow rank queries while using the overall same space, we can also view it (and BF and BL) as replacing an O(rlogn)-bit move table while using the same overall query time. Brown, Gagie and Rossi [4] implemented a similar approach to speeding up LF computations in an RLFM-index, but only alluded to it briefly in their paper – the path to Bitvector in their Figure 3 – and gave no analysis nor bounds. We conjecture that a similar approach can also be applied to reduce the size of fast move tables for ϕ and ϕ1 [13], which return SA[i1] and SA[i+1] when given SA[i].

4 Two-level indexing

Corollary 5 and Theorem 7 suggest that RLCSAs should perform well compared to FM-indexes and RLFM-indexes when the BWT is over a fairly large alphabet and the number of runs of each character is fairly small; Ordóñez, Navarro and Brisaboa [23] have confirmed this experimentally. When indexing a highly repetitive text over a small alphabet, we can make RLCSAs more practical by storing a table of k-tuples that tells us in which range of Ψ to search based on which character we are trying to match and which k1 characters we have just matched. (This table can be represented with a bitvector to save space.) The table for our example from Figure 1 and k=2 is shown below:

#C 0
$C 1..2
$G 3
AA 4
AC 4..7
AG 8
AT 9..12
CA 13..14
CC 15..16
CG 17..19
CT 20..24
GA 25..27
GC 28..29
GG 30..31
GT 32
T# 33
T$ 33
TA 34..35
TG 36..37
TT 38..39

This says that if we want the SA interval for GCG and we have just matched the suffix CG, then we should search in the range Ψ[28..29]. On the other hand, notice that the largest range of Ψ in which we will ever search is now Ψ[20..24] – of length 5 – when we are trying to match a C after just matching a T; without such a table, the largest range we search is Ψ[13..24] – of length 12 – when trying to match a C.

There are interesting cases in which we want to index highly repetitive texts over large alphabets, however. For example, consider indexing a minimizer digest of a pangenome – considering minimizers as meta-characters from a large alphabet instead of tuples of characters from a small alphabet [1, 2, 7, 27] – or two-level indexing such a text. For two-level indexing we build one index for the text and another for a parse of the text; the alphabet of the parse is the dictionary of distinct phrases, which is usually large, but the parse itself is usually much smaller than the text and its BWT is usually still run-length compressible (albeit less than the BWT of the text) when the text is highly repetitive.

Something like two-level indexing was proposed by Deng, Hon, Köppl and Sadakane [6] but they did not use an index for the text and its absence made their implementation quite slow for all but very long patterns. Hong, Oliva, Köppl, Bannai, Boucher and Gagie [12] described another approach, which we will review here, but they used standard FM-indexes for the text and the parse instead of RLFM-indexes, so their two-level index was noticeably faster but hundreds of times larger than its competitors.

Consider the 50 similar toy genomes of length 50 each in Figure 3. Suppose we parse their concatenation similarly to rsync, by inserting a phrase break whenever we see a trigger string – ACA, ACG, CGC, CGG, GAC, GAG, GAT, GTG, GTT, TCG or TCT – or when we reach the TA# at the end of the text. (Considering #=$=0, 𝙰=1, 𝙲=2, 𝙶=3 and 𝚃=4 and viewing triples as 3-digit numbers in base 5, the trigger strings are the triples in the concatenation whose values are congruent to 0 modulo 6.) If we replace each phrase in the parse by its lexicographic rank in the dictionary of distinct phrases, counting from 1, and terminate the sequence with a 0, we get the 562-number sequence shown in Figure 4. With a larger example, of course, we obtain longer phrases on average and better compression from the parsing.

CTTCCGCGGTGATAAAGGGGGCGGTAATGTCGCGAAACAGTCTTTTCTA$
CTTACGCGGTGATACAGGGGGCCGTAATTTCGCGGAACAGTCTTTTCTA$
CTTACGCGACGATCCAGGGGGCGGTAATTTCGCGGAACAGTCTTTTCTA$
CTTATGCGATGATCCTGGGGGCGGTAATTTCGCGGAACAGTCTTTTCTA$
CTTACGCGGTGATCCAGGGGGCGGTAATTTCGCGGAACACTCTTCTCTA$
CTTACGCGATGATCCAGTGGGCGGTCTTTTCGCGGAACAGTCTTTTCGA$
CTTACGCGGTGATCCAGGGGGCGGTAATTTCGCGCAACAGTCTTTTCTA$
CTTACGCGGTGATCCAGGGGGCGGTAATTTCTCGGAACAGTCTTTTCTA$
CTTATGCGGTGATCCACGGGGCGGAAGTATCGCGGAACAGTCTTTTTTA$
CTTACGCGATGATCCAGGGGGCGGTAACTTCGCGGAACAGTCTTTTCTA$
CTTACGCGACGATCCAGGGGGCAGTAATTTCGCGGAACAGTCTTTTCTA$
CATACGCGGGGATCCAAGGGGCGGTAATTTCGCGGAACAGTCTTTGACA$
CTTTCACGGTGATCCAGGGGTGGGTAATTTCGCGGAACAGTCTTTTCTA$
CTTACGCGGTGATCCAGGGGGCGGTAATTTCGCGGAACAATCTTTTCTA$
CTTACGCGATGATCCAGGGGGCGGTAATTTCGCGGAACAGTCTTTTCTA$
CTTACGCGGTGATCCAGGGGGCGCTAATTTCGCGAAACAGTCTTTTCTA$
CTTACGTGGTGATCCAGGGGGCGGTAATTTCGAGGAACAGTCTTTAATA$
CTTACGCGGTGATCCAGGGCGCGGTAATTTCGCGGAACAATCTTTTCTA$
CTTACGCGATGATCCAGGGGGCGGTAATTTCGCGGAACAGTCTAATCTA$
CTTACGCGGTGATCCAGGGGGCGGTAATTTCGCGGAACAGTCTTTTCTA$
CTTACGCGGTGATCCAGGGGGCGGTAATTTCGCGGAACAGTCTTTTCTA$
CTTACGCGGTGATCCAGGGGGCGGTAATTTCGCGGAACAATCTTTTCTA$
CTTACGCGATGATCCAGGGGGCGGTAATTGCGCGGAACAGTCTTTTCTA$
CTTACGCGGTGATCCAGGGGGCGGTAATTTCGGGGAACAGTCTTTTCTA$
CTTACGCGATCTTCCAGGGGGCCGAAATTTCGCGTAACAGTCTTTTCTA$
CTTACGCGGTGATTCAGGGGGCGGTAATTTCGCGGATCAGTCTTTTCTA$
CTTACGCGGTTATCCAGGGGGTGGTACTTTCGGTGAACAGTCTGTTCTA$
CTTACGCGGTGATCCAGGGGGCAGTAATTTCGCGGAACAGTCTTTTCTA$
CTTACGCGATGATCCAGGGGGCGGTAATTTCGCTGAACAGTCTTTTCTA$
CTTACGCGATGATCCATTGGGCGGTAATTCCGCGGAACAGTCTTTTCTA$
CTTACGCGATGATCCATGGGGCGGTAATTTCGCGGAACAGTCTTTTCTA$
CTTACGCGATGATCCAGGGGGCTGTATTTTCGCGGAACAGTCTTTTCTA$
CTTACGCGCTGATCCAGGGGGCGGTAATTTCGCGGAACAGTCTTTTCTA$
CTTACGAGATGAGCTAGGGGGCGGTAATTTCGCGGAACAGTCTTTTCTA$
CTTACGCGATGCTCCAGGGGGCGGTGATTTCGCGGAACAGTCTTTTCTA$
CTTTCGCGATGATCCAGGGGGCGGTCATTTCGCGGAACAGTCTTTTCTA$
CTTACGCGGTGATCCAGGGGGCGGTAATTTCGCGGCACAGTCCTTTCTA$
CTTACGCGGTGATCCAGGGGGCGGTAATTTCGCGGAACAGTCTTTTCTA$
CTTACGCGGTGATCCAGGGGGCGGTAATTTCGCGGAACAGTCTTTTCTA$
CTTACTCGGTGATCCAGGGGGCGGTAATTTCGCGGAACAGTCTTTTCTT$
CTTACGCGATGATCCAGAGGGCGGTAATTTCGCGGAACAGTCTTTTCTA$
CTTACGCGGTGATCCAGGGGGCGGTAATTCCGCGGAACAGTCTTTTCTA$
CTTACGCGGGGATCCAGGGGGCGGTAATTTCGCGGAACAATCTTTTCTA$
CTTACGCGGTGATCCAGGGGTCGGTAATTTCGCGGAACAGTCTTTTCTA$
CATACGCGGTGATCCAGGGGGCGGTAATTTCGCGGAACAGTCTTTTCTA$
CTTACGCGGTGATCCAGGGGGCGGTAATTTCGCGGAACAGTCTTTTCTA$
CCTACGCGATGATGCAGGGGGCGGTAATTTCGCGGAACAGTCTTTCCTA$
CTTACGCGATGTTCCAGGGGGCGGTAATTTCGCGGAATAGTTTTTTCTA$
CTTACGCGGTGATCCAGCGGGCGGTAATTTCGCGGAATAGTCTTTTCTA$
CTTACGCGGTAATCCAGGGGGCGGTAATGTCGCGGAACAGTCTTTTCTA#
Figure 3: A set of 50 similar toy genomes of length 50 each, with the first 49 separated by copies of $ and the last one terminated by #.
44, 55, 79, 19, 11, 70, 22, 46, 64, 88, 6, 22, 55, 79, 19, 17, 59, 22, 55, 12,
64, 88, 6, 22, 48, 45, 19, 32, 73, 22, 55, 12, 64, 88, 8, 50, 39, 73, 22, 55,
12, 64, 88, 6, 22, 55, 79, 19, 32, 73, 22, 55, 12, 43, 78, 41, 6, 622, 50, 50,
36, 58, 78, 87, 22, 55, 12, 64, 87, 6, 22, 55, 79, 19, 32, 73, 22, 51, 12, 64,
88, 6, 22, 55, 79, 19, 32, 74, 40, 45, 12, 64, 88, 9, 79, 19, 26, 45, 58, 13,
22, 55, 12, 64, 90, 22, 50, 50, 32, 68, 22, 55, 12, 64, 88, 6, 22, 48, 45, 19,
30, 22, 55, 12, 64, 88, 4, 22, 55, 57, 25, 73, 22, 55, 12, 64, 86, 2, 1, 45,
79, 19, 35, 60, 22, 55, 12, 64, 88, 6, 22, 55, 79, 19, 32, 73, 22, 55, 12, 21,
88, 6, 22, 50, 50, 32, 73, 22, 55, 12, 64, 88, 6, 22, 55, 79, 19, 31, 73, 22,
46, 64, 88, 6, 79, 65, 19, 32, 73, 18, 47, 64, 83, 22, 55, 79, 19, 29, 55, 73,
22, 55, 12, 21, 88, 6, 22, 50, 50, 32, 73, 22, 55, 12, 64, 16, 6, 22, 55, 79,
19, 32, 73, 22, 55, 12, 64, 88, 6, 22, 55, 79, 19, 32, 73, 22, 55, 12, 64, 88,
6, 22, 55, 79, 19, 32, 73, 22, 55, 12, 21, 88, 6, 22, 50, 50, 32, 72, 55, 12,
64, 88, 6, 22, 55, 79, 19, 32, 73, 45, 56, 64, 88, 6, 22, 50, 41, 77, 22, 61,
64, 88, 6, 22, 55, 79, 19, 75, 73, 22, 55, 19, 24, 88, 6, 22, 55, 82, 20, 62,
45, 79, 12, 64, 66, 41, 6, 22, 55, 79, 19, 30, 22, 55, 12, 64, 88, 6, 22, 50,
50, 32, 73, 22, 80, 64, 88, 6, 22, 50, 50, 38, 71, 55, 12, 64, 88, 6, 22, 50,
50, 37, 73, 22, 55, 12, 64, 88, 6, 22, 50, 50, 33, 22, 55, 12, 64, 88, 6, 22,
51, 81, 32, 73, 22, 55, 12, 64, 88, 6, 18, 19, 49, 42, 73, 22, 55, 12, 64, 88,
6, 22, 50, 54, 79, 19, 85, 22, 55, 12, 64, 88, 10, 22, 50, 50, 32, 76, 22, 55,
12, 64, 88, 6, 22, 55, 79, 19, 32, 73, 22, 55, 23, 63, 6, 22, 55, 79, 19, 32,
73, 22, 55, 12, 64, 88, 6, 22, 55, 79, 19, 32, 73, 22, 55, 12, 64, 88, 7, 45,
79, 19, 32, 73, 22, 55, 12, 64, 88, 67, 22, 50, 50, 27, 58, 73, 22, 55, 12, 64,
88, 6, 22, 55, 79, 19, 32, 71, 55, 12, 64, 88, 6, 22, 55, 57, 32, 73, 22, 55,
12, 21, 88, 6, 22, 55, 79, 19, 34, 45, 73, 22, 55, 12, 64, 88, 4, 22, 55, 79,
19, 32, 73, 22, 55, 12, 64, 88, 6, 22, 55, 79, 19, 32, 73, 22, 55, 12, 64, 88,
5, 22, 50, 50, 52, 73, 22, 55, 12, 64, 84, 22, 50, 66, 32, 73, 22, 55, 15, 89,
6, 22, 55, 79, 19, 28, 53, 73, 22, 55, 14, 88, 6, 22, 55, 69, 70, 22, 55, 12,
64, 88, 3, 0
Figure 4: The 563-number sequence (20 numbers per line) over the alphabet {0,,90} we get from the concatenation of the toy genomes in Figure 3 by parsing, replacing each phrase by its rank in the dictionary (counting from 1) and appending a 0.

Run-length compression naturally works better on the BWT of the concatenation of the genomes than on the BWT of the parse, as shown in Figures 5 and 6. Again, with a larger example we would achieve better compression, also from the run-length compressed BWT (RLBWT) of the parse. Even this small example, however, gives some intuition how the dictionary of distinct phrases in the parse is usually large, but the parse is usually much smaller than the text and its BWT is usually still run-length compressible. In this case there are 90 distinct phrases in the dictionary, the parse is less than a quarter as long as the text, and the average run length in its RLBWT is slightly more than 2. An FM-index based on the RLBWT of the parse would generally use at least about lg90=7 rank queries on bitvectors for each backward step. The most common value in the runs, 19, occurs in only 16 runs, so we should spend at most about lg16=4 steps in each binary search.

𝙰8 𝚃1 𝙰41 𝚃28 𝙶1 𝚃18 𝙲1 𝚃1 𝙶2 𝚃1 𝙶27 𝙰1 𝙶10 𝙲1 𝚃1 𝙰1 𝙶6 𝚃1 𝙲1
𝙰1 𝙶1 𝚃1 𝙶2 𝚃2 𝙲4 𝚃39 𝙰1 𝚃3 𝙶1 𝙰5 𝚃1 𝙲1 𝙰41 𝚃1 𝙶2 𝚃42 𝙲2 𝚃2 𝙲1
𝙰1 𝚃1 𝙲1 𝙶1 𝙲1 𝙶2 𝙲1 𝙶1 𝙰1 𝙲6 𝙰1 𝙲13 𝚃1 𝙲22 𝙰1 𝙲22 𝚃1 𝙲22 𝚃1 𝙰1
𝙶2 𝙲2 𝙰2 𝙶10 𝙰1 𝙶25 𝚃1 𝙶6 𝚃1 𝙰1 𝙶1 𝙰4 𝙶15 𝚃2 𝙶1 𝙲1 𝙰2 𝙶2 𝙰3 𝙲1
𝙰27 𝙲1 𝙰2 𝙶1 𝙰9 𝚃1 𝙰1 𝙶1 𝙲1 𝙰4 𝙶1 𝙲1 𝚃1 𝙰1 𝙲6 𝙰1 𝙲12 𝙶1 𝙲11 𝚃1
𝙲10 𝙶2 𝙰35 𝚃1 𝙰7 𝙲1 $2 𝙲2 𝚃45 𝙶1 𝚃3 𝙶1 𝚃1 $1 𝚃3 𝙶2 𝙲1 𝙶2 𝙰1 𝚃1
𝙰2 𝙶17 𝚃2 𝙰9 𝚃1 𝙰7 𝚃1 𝙰1 𝚃18 𝙲1 𝚃2 𝙲1 𝚃9 𝙶1 𝚃9 𝙰3 𝙶1 𝙲1 𝙰22 𝚃1
𝙶1 𝚃1 𝙶35 𝚃1 𝙶9 𝚃1 𝙶2 𝙰1 𝙶17 𝚃1 𝙶23 𝚃1 𝙶18 𝚃1 𝙶4 𝙰1 𝙶4 𝙲1 𝙰1 𝚃17
𝙲1 𝚃28 𝙶1 𝙲1 𝙶2 𝚃1 𝙰1 𝚃1 𝙰1 𝙶2 𝙲1 𝙶1 𝚃2 $44 𝚃1 #1 𝙰1 𝚃2 $1 𝚃1
$1 𝙰1 𝙲1 𝚃44 𝙲4 𝙶6 𝚃1 𝙶15 𝚃1 𝙶22 𝚃1 𝙲3 𝚃1 𝙲1 𝙰1 𝚃2 𝙶2 𝚃4 𝙲1 𝚃9
𝙶1 𝚃10 𝙲1 𝚃13 𝙲1 𝙰1 𝙲13 𝚃1 𝙲2 𝚃2 𝙲1 𝙶1 𝚃1 𝙶4 𝙲16 𝚃1 𝙲4 𝚃1 𝙶2 𝙲38
𝙶1 𝙲4 𝙰1 𝙲2 𝙶1 𝙲1 𝙶8 𝙲1 𝙶29 𝙲2 𝚃1 𝙲20 𝙶1 𝙲3 𝙰1 𝚃1 𝙲2 𝙶1 𝙲6 𝙰1
𝙲10 𝙶1 𝙲26 𝙶2 𝙲1 𝙶54 𝙰1 𝙶5 𝚃1 𝙶16 𝙰1 𝙶8 𝙲1 𝙶7 𝚃1 𝙶2 𝙲3 𝙶5 𝙲1 𝙶13
𝙰1 𝙶2 𝚃1 𝙶19 𝙰23 𝚃1 𝙰18 𝙶1 𝚃1 𝙶3 𝙲25 𝙶1 𝙲14 𝚃1 𝙲1 𝙶1 𝙲10 𝚃1 𝙲18 𝙶2
𝙲2 𝙶17 𝙰1 𝙶3 𝙲1 𝙰1 𝙶21 𝙰1 𝚃1 𝙶1 𝙰1 𝚃2 𝙶1 𝙰6 𝙶1 𝙰37 𝙶28 𝙰1 𝙶2 𝙲1
𝙶1 𝚃2 𝙰1 𝚃1 𝙲8 𝚃1 𝙲15 𝙰1 𝙲22 𝙰1 𝙶2 𝚃1 𝙶1 𝙲1 𝙶6 𝙲1 𝙶35 𝙰1 𝚃14 𝙲1
𝚃3 𝙰1 𝚃14 𝙰1 𝚃11 𝙶1 𝙲1 𝙰2 𝚃1 𝙶1 𝚃2 𝙶1 𝚃2 𝙰1 𝙶1 𝙰7 𝚃1 𝙰22 𝚃1 𝙰5
𝙲1 𝙰7 𝚃4 𝙰1 𝙶1 𝚃2 𝙶1 𝚃12 𝙶1 𝚃23 𝙰1 𝚃6 𝙲1 𝚃1 𝙶1 𝚃1 𝙲1 𝚃13 𝙲1 𝚃16
𝙰1 𝚃13 𝙶1 𝚃2 𝙶1 𝚃1 𝙰1 𝙲1 𝙶13 𝙰3 𝙶20 𝙰1 𝙶10 𝙲1 𝚃1 𝙰1 𝙶3 𝙰1 𝙶4 𝙰1
𝙶1 𝙰1 𝙶5 𝙰1 𝙶1 𝙲1 𝙰1 𝙶7 𝙰1 𝙶2 𝙰2 𝙶2 𝙰5 𝙶2 𝙰2 𝚃1 𝙰2 𝚃1 𝙶1 𝙰1
𝙲1 𝙶3 𝙲1 𝙰3 𝙲2 𝚃2 𝙲42 𝙶1 𝙲2 𝚃1 𝙰1 𝙲1 𝙶1 𝙰2 𝙲1 𝚃19 𝙲1 𝚃47 𝙶1 𝚃21
𝙲1 𝚃2 𝙰2 𝚃1 𝙲3 𝚃1 𝙰2 𝙲1 𝙰9 𝚃1 𝙰8 𝚃1 𝙰20 𝙲1 𝚃28 𝙲1 𝚃12 𝙰1 𝚃1 𝙲1
𝚃1 𝙲2 𝙰1 𝙲20 𝚃1 𝙲20 𝚃2 𝙲1 𝙶1
Figure 5: The RLBWT of the concatenation of the toy genomes shown in Figure 3, consisting of 449 runs (20 runs per line).
3, 2, 86, 8812, 41, 886, 89, 41, 88, 87, 884, 16, 63, 8811, 19,
555, 79, 5525, 51, 55, 45, 552, 58, 55, 64, 19, 6, 73, 79, 55,
792, 45, 792, 65, 799, 45, 795, 18, 79, 82, 123, 70, 73, 62, 67,
90, 63, 10, 63, 5, 6, 84, 73, 6, 737, 87, 70, 30, 732, 68,
59, 30, 73, 33, 732, 60, 733, 76, 73, 85, 73, 13, 733, 4, 63,
83, 68, 4, 68, 77, 73, 55, 19, 57, 19, 50, 194, 50, 19, 50,
193, 57, 19, 50, 19, 81, 50, 196, 66, 19, 50, 19, 50, 19, 503,
74, 78, 66, 50, 49, 12, 0, 40, 48, 73, 26, 34, 62, 7, 1,
22, 18, 22, 19, 5010, 8, 2212, 50, 223, 50, 28, 50, 2217, 71, 22,
71, 228, 72, 2211, 29, 44, 2221, 45, 55, 45, 27, 36, 17, 35, 22,
20, 23, 12, 47, 129, 56, 122, 80, 122, 46, 1210, 61, 46, 125, 79,
50, 64, 88, 32, 55, 11, 69, 38, 322, 31, 32, 55, 323, 52, 25,
45, 32, 37, 42, 32, 58, 32, 39, 325, 53, 32, 75, 323, 19, 32,
41, 43, 58, 45, 55, 9, 5514, 45, 553, 45, 55, 54, 6, 22, 51,
55, 64, 19, 64, 78, 647, 212, 646, 14, 6411, 21, 64, 24, 645, 15,
64
Figure 6: The RLBWT of the sequence shown in Figure 6, consisting of 226 runs (15 runs per line).

To search for a pattern, we start by backward stepping in the index for the text until we reach the left end of the rightmost trigger string in the pattern. We keep count of how often each trigger string occurs in the text and an n-bit sparse bitvector with 1s marking the lexicographic ranks of the lexicographically least suffixes starting with each trigger string. This way, when we reach the left end of the rightmost trigger string, with a rank query on that bitvector we can compute the lexicographic ranks of the suffixes starting with the suffix of the pattern we have processed so far among all the suffixes starting with trigger strings, and map from the index of the text into the index for the parse. The width of the BWT interval stays the same and may span several lexicographically consecutive phrases in the dictionary – all those starting with the suffix of the pattern we have processed so far – but it is possible to start a backward search in the index for the parse with a lexicographic range of phrases rather than with a single phrase.

When we reach the left end of the leftmost trigger string in the pattern, we can use the same bitvector to map back into the index for the text and match the remaining prefix of the pattern with that. While matching the pattern phrase by phrase against the index for the parse, we can either compare against phrases in the stored dictionary or just use Karp-Rabin hashes (allowing some probability of false-positive matches). We still have to parse the pattern, but that requires a single sequential pass, while FM-indexes in particular are known for poor memory locality. They key idea is that, ideally, we match most of the pattern phrase by phrase instead of character by character, reducing the number of cache misses.

We plan to reimplement two-level indexes for collections of similar genomes with RLFM-indexes for the collections themselves and CSAs, standard RLCSAs and our sped-up RLCSAs for the parses from Theorem 7 of those collections, and compare them experimentally. We also plan to try indexing minimizer digests with CSAs and RLCSAs.

5 Boyer-Moore-Li with two-level indexing

Olbrich, Büchler and Ohlebusch [22] recently showed how working with rsync-like parses of genomes instead of the genomes themselves can speed up multiple alignment. More specfically, they find and use as anchors finding maximal substrings (call multi-MUMs) of the parses that occur exactly once in each parse. In this section we speculate about how two-level indexing may similarly speed up searches for long maximal exact matches (MEMs). A MEM of a pattern P[0..m1] with respect to a text T is a substring P[i..j] of P such that

  • P[i..j] occurs in T,

  • i=0 or P[i1..j] does not occur in T,

  • j=m1 or P[i..j+1] does not occur in T.

Finding long MEMs is an important task in bioinformatics and there are many tools for it.

Li [14] gave a practical algorithm, called forward-backward, for finding all the MEMs of P with respect to T using FM- or RLFM-indexes for T and its reverse Trev. Assume all the distinct characters in P occur in T; otherwise, we split P into maximal substrings consisting only of copies of characters occurring in T and find the MEMs of those with respect to T. We first use the index for Trev to find the longest prefix P[0..e1] of P that occurs in T, which is the leftmost MEM. If e1=m1 then we are done; otherwise, P[e1+1] is in the next MEM, so we use the index for T to find the longest suffix P[s2..e1+1] of P[0..e1+1] that occurs in T. The next MEM starts at s2, so conceptually we recurse on P[s2..m1]. The total number of backward steps in the two indexes is proportional to the total length of all the MEMs.

Gagie [9] proposed a heuristic for speeding up forward-backward when we are interested only in MEMs of length at least L. We call this heuristic Boyer-Moore-Li, following a suggestion from Finlay Maguire [16]. Since any MEM of length at least L starting in P[0..L1] includes P[L1], we first use the index for T to find the longest suffix P[s..L1] of P[0..L1] that occurs in T. If s=0 then we fall back on forward-backward to find the leftmost MEM and the starting position of the next MEM. Otherwise, since we know there are no MEMs of length at least L starting in P[0..s1], conceptually we recurse on P[s..m1]. Li [15] tested Boyer-Moore-Li and found it practical enough that he incorporated it into his tool ropebwt3.

Suppose we build an rsync-like parse of T[0..n1] and two-level indexes for T and Trev based on that parse and parse P when we get it. With a naïve two-level version of Boyer-Moore-Li, we would simply use the two-level indexes in place of the normal FM- or RLFM-indexes for T and Trev. We conjecture, however, that we can do better in practice.

Let P[k] be the last character of the last phrase that ends strictly before P[L], let P[j] be the first character of the first phrase such that P[j..k] occurs in T, and let P[i] be the second character of the phrase preceding the one containing P[j]. Notice we can find i, j and k by matching phrase by phrase using only the top level (for the parse) of the two-level index for T. If i>0 then we can immediately discard P[0..i1] and conceptually recurse on P[i..m1]; otherwise, we proceed normally.

Of course, the value i is at most the value s found by regular Boyer-Moore-Li and could be much smaller, in which case discarding P[0..i1] benefits us much less than discarding P[0..s1]. We hope this is usually not the case and we look forward to testing Boyer-Moore-Li with two-level indexing.

References

  • [1] Omar Y Ahmed, Massimiliano Rossi, Travis Gagie, Christina Boucher, and Ben Langmead. SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biology, 24:122, 2023.
  • [2] Lorraine AK Ayad, Gabriele Fici, Ragnar Groot Koerkamp, Grigorios Loukides, Rob Patro, Giulio Ermanno Pibiri, and Solon P Pissis. U-index: A universal indexing framework for matching long patterns. arXiv, 2025. arXiv:2502.14488.
  • [3] Paul Beame and Faith E Fich. Optimal bounds for the predecessor problem and related problems. Journal of Computer and System Sciences, 65:38–72, 2002. doi:10.1006/JCSS.2002.1822.
  • [4] Nathaniel K Brown, Travis Gagie, and Massimiliano Rossi. RLBWT tricks. In Proceedings of the 20th International Symposium on Experimental Algorithms (SEA), 2022.
  • [5] Francisco Claude, Patrick K Nicholson, and Diego Seco. On the compression of search trees. Information Processing & Management, 50:272–283, 2014. doi:10.1016/J.IPM.2013.11.002.
  • [6] Jin-Jie Deng, Wing-Kai Hon, Dominik Köppl, and Kunihiko Sadakane. FM-indexing grammars induced by suffix sorting for long patterns. In Proceedings of the Data Compression Conference (DCC), 2022.
  • [7] Barış Ekim, Bonnie Berger, and Rayan Chikhi. Minimizer-space de Bruijn graphs: Whole-genome assembly of long reads in minutes on a personal computer. Cell Systems, 12:958–968, 2021.
  • [8] Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52:552–581, 2005. doi:10.1145/1082036.1082039.
  • [9] Travis Gagie. How to find long maximal exact matches and ignore short ones. In Proceedings of the 28th Conference on Developments in Language Theory (DLT), 2024.
  • [10] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM, 67:1–54, 2020. doi:10.1145/3375890.
  • [11] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35:378–407, 2005. doi:10.1137/S0097539702402354.
  • [12] Aaron Hong, Marco Oliva, Dominik Köppl, Hideo Bannai, Christina Boucher, and Travis Gagie. PFP-FM: an accelerated FM-index. Algorithms for Molecular Biology, 19:15, 2024. doi:10.1186/S13015-024-00260-8.
  • [13] Juha Kärkkäinen, Giovanni Manzini, and Simon J Puglisi. Permuted longest-common-prefix array. In Proceedings of the 20th Symposium on Combinatorial Pattern Matching (CPM), 2009.
  • [14] Heng Li. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics, 28:1838–1844, 2012. doi:10.1093/BIOINFORMATICS/BTS280.
  • [15] Heng Li. BWT construction and search at the terabase scale. Bioinformatics, 40:btae717, 2024.
  • [16] Finlay Maguire. Personal communication, 2024.
  • [17] Veli Mäkinen and Gonzalo Navarro. Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing, 12:40–66, 2005.
  • [18] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17:281–308, 2010. doi:10.1089/CMB.2009.0169.
  • [19] Alistair Moffat and Lang Stuiver. Binary interpolative coding for effective index compression. Information Retrieval, 3:25–47, 2000. doi:10.1023/A:1013002601898.
  • [20] Takaaki Nishimoto and Yasuo Tabei. Optimal-time queries on BWT-runs compressed indexes. In Proceedings of the 48th International Colloquium on Automata, Languages, and Programming (ICALP), 2021.
  • [21] Daisuke Okanohara and Kunihiko Sadakane. Practical entropy-compressed rank/select dictionary. In Proceedings of the 9th Workshop on Algorithm Engineering and Experiments (ALENEX), 2007.
  • [22] Jannik Olbrich, Thomas Büchler, and Enno Ohlebusch. Generating multiple alignments on a pangenomic scale. Bioinformatics, 41(3):btaf104, 2025.
  • [23] Alberto Ordóñez, Gonzalo Navarro, and Nieves R Brisaboa. Grammar compressed sequences with rank/select support. Journal of Discrete Algorithms, 43:54–71, 2017. doi:10.1016/J.JDA.2016.10.001.
  • [24] Jouni Sirén. Compressed Full-Text Indexes for Highly Repetitive Collections. PhD thesis, University of Helsinki, 2012.
  • [25] Jukka Teuhola. Interpolative coding of integer sequences supporting log-time random access. Information Processing & Management, 47:742–761, 2011. doi:10.1016/J.IPM.2010.11.006.
  • [26] Mohsen Zakeri, Nathaniel K Brown, Omar Y Ahmed, Travis Gagie, and Ben Langmead. Movi: a fast and cache-efficient full-text pangenome index. iScience, 27, 2024.
  • [27] Alan Zheng, Ishmeal Lee, Vikram S. Shivakumar, Omar Y. Ahmed, and Ben Langmead. Fast and flexible minimizer digestion with digest. bioRxiv, 2025. URL: https://www.biorxiv.org/content/early/2025/01/08/2025.01.02.631161.