Abstract 1 Introduction 2 Background 3 Methods 4 Discussion References Appendix A Proofs Appendix B Algorithm Pseudocodes

An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT

Ahsan Sanaullah Department of Computer Science, University of Central Florida, Orlando, FL, USA Degui Zhi ORCID McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, TX, USA Shaojie Zhang ORCID Department of Computer Science, University of Central Florida, Orlando, FL, USA
Abstract

String matching problems in bioinformatics are typically for finding exact substring matches between a query and a reference text. Previous formulations often focus on maximum exact matches (MEMs). However, multiple occurrences of substrings of the query in the text that are long enough but not maximal may not be captured by MEMs. Such long matches can be informative, especially when the text is a collection of similar sequences such as genomes. In this paper, we describe a new type of match between a pattern and a text that aren’t necessarily maximal in the query, but still contain useful matching information: locally maximal exact matches (LEMs). There are usually a large amount of LEMs, so we only consider those above some length threshold . These are referred to as long LEMs. The purpose of long LEMs is to capture substring matches between a query and a text that are not necessarily maximal in the pattern but still long enough to be important. Therefore efficient long LEMs finding algorithms are desired for these datasets. However, these datasets are too large to query on traditional string indexes. Fortunately, these datasets are very repetitive. Recently, compressed string indexes that take advantage of the redundancy in the data but retain efficient querying capability have been proposed as a solution. We therefore give an efficient algorithm for computing all the long LEMs of a query and a text in a BWT runs compressed string index. We describe an O(m+occ) expected time algorithm that relies on an O(r) words space string index for outputting all long LEMs of a pattern with respect to a text given the matching statistics of the pattern with respect to the text. Here m is the length of the query, occ is the number of long LEMs outputted, and r is the number of runs in the BWT of the text. The O(r) space string index we describe relies on an adaptation of the move data structure by Nishimoto and Tabei. We are able to support LCP[i] queries in constant time given SA[i]. In other words, we answer PLCP[i] queries in constant time. These PLCP queries enable the efficient long LEM query. Long LEMs may provide useful similarity information between a pattern and a text that MEMs may ignore. This information is particularly useful in pangenome and biobank scale haplotype panel contexts.

Keywords and phrases:
BWT, LEM, Long LEM, MEM, Run Length Compressed BWT, Move Data Structure, Pangenome
Copyright and License:
[Uncaptioned image] © Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Pattern matching
; Theory of computation Data compression
Related Version:
Preprint: https://arxiv.org/abs/2505.15698 [41]
Funding:
This work was supported by the National Institutes of Health grant R01 HG010086.
Editors:
Broňa Brejová and Rob Patro

1 Introduction

Bioinformatics sequence data is often large and very repetitive. Furthermore, efficient matching queries on the data are frequently needed for many biological analyses. Therefore, bioinformatics problems have incentivized and profited from the development of efficient string indexes. The Burrows-Wheeler transform (BWT) has thus been used in bioinformatics algorithms. The BWT is a permutation of a text that has found wide use in string indexing and data compression [11]. Position i in the BWT of the text is essentially the character preceding the i-th lexicographically smallest suffix of the text. Due to this lexicographic sorting, adjacent characters in the BWT correspond to the characters preceding highly locally similar suffixes of the text. Therefore, the BWT of highly repetitive texts tends to have large runs of one character, with an overall small number of runs. The BWT of highly repetitive texts therefore compresses well. In fact, the number of runs in BWT, r, is sometimes used as a measure of the repetitiveness of a string [35]. Finally, given only the BWT of a text, the text can be reconstructed in linear time [11] and the BWT of a text can be constructed in linear time by construction of the suffix array [22]. The BWT ordering also allows efficient string indexes. In other words, given a pattern, find all occurrences of the pattern within the text. String indexes have been shown that output all occurrences of a pattern (a locate query) in space linear to the product of the length of the text and the size of the alphabet and time linear to the sum of the length of the pattern and the number of occurrences [18].

Compressed string indexes have also been shown [18, 3, 32]. These indexes output all occurrences of a pattern in space sublinear to the size of the text. Although the time complexity of locating these occurrences is not linear in the length of the pattern and the number of occurrences, they are typically independent of the length of the text barring logarithmic factors and close to linear in the length of the pattern and number of occurrences. In particular for highly repetitive texts, the space of the index can be much smaller than the space of the text. Notably, recent compressed string indexes have achieved space linear to the number of runs in the BWT (r[19, 36]. The r-index by Gagie et al. was the first compressed string index offering close to linear time locate queries in O(r) space [19]. Nishimoto and Tabei recently improved on this result with their OptBWTR, which achieves linear time locate queries for texts with alphabets of size polylogarithmic in the length of the text. OptBWTR relies on the move data structure, which was introduced in the same paper [36].

Compressed string indexes have been fruitfully applied to the growing collection of bioinformatics data. Over the past two decades, large collections of genomics data have grown increasingly larger in size. For example, the UK Biobank has whole genome sequencing data of roughly one million haplotypes [29], and the All of Us program has released whole genome sequencing data of half a million haplotypes [7]. Furthermore, recent arguments have been made that a human reference pangenome should be used instead of a singular human reference genome to avoid reference bias in downstream analyses [33, 46, 43]. The Human Pangenome Reference Consortium has released a draft human pangenome reference of more than two hundred high quality phased diploid assemblies and is planning to release over three hundred and fifty in the final release [30, 12]. The UK Biobank whole genome sequencing data has 1.5 billion variants, the All of Us whole genome sequencing data has 1 billion variants, and the typical diploid assembly in the draft human pangenome has 6 billion bases. Therefore, these datasets have 1,500 trillion, 250 trillion, and 1.2 trillion characters each respectively. However, while very large, these datasets are very repetitive. Furthermore, queries on these datasets are frequently needed for biological applications including read mapping[27, 24], read alignment [28], read classification for metagenomes [45, 1, 15] or pangenomes [10]. Many general purpose compressed string indexes have also been implemented for exact pattern matching and matching substrings of the pattern [38, 49, 16, 26]. These indexes may compute the maximal exact matches (MEMs) of the pattern with respect to the text. MEMs are matches between the pattern and the text that cannot be extended in the pattern.

While MEMs typically refer to matches that are maximal in the pattern, matches that are simultaneously maximal in the pattern and the text may sometimes be desired. Notably, in two data structures related to the BWT, algorithms for outputting matches that are simultaneously maximal in the pattern and the text have already been developed. These data structures are the positional Burrows-Wheeler transform (PBWT) and the graph Burrows-Wheeler transform (GBWT) [17, 44]. In the PBWT and GBWT, matches that cannot be extended in the pattern are referred to as set maximal matches, and matches that cannot be simultaneously extended in the pattern are referred to as locally maximal matches. Locally maximal matches that are longer than some length threshold are referred to as -long matches, or long matches for short. Algorithms for outputting set maximal matches and long matches have been published in the PBWT in uncompressed [17, 34, 40] and compressed space [13, 42, 48, 8]. Algorithms for outputting these matches have also been published for the GBWT in compressed space [39].

In this paper, we use these concepts in the traditional pattern and text context, and name matches that cannot be simultaneously extended in the pattern and the text locally maximal exact matches (LEMs). LEMs that are longer than are long LEMs. The distinction between matches that do not extend in the pattern and matches that do not extend simultaneously in the pattern and the text has been made before. Notably, in ropebwt3 MEMs refers to LEMs of our paper and super maximal exact matches (SMEMs) refers to MEMs of our paper [26]. The term SMEM has been used in place of MEM in a few papers to avoid the confusion in terminology [8, 15, 26, 13], however MEM is still the most common term by far for matches that cannot be extended in the pattern. The authors are not aware of any published algorithms for the computation of LEMs or long LEMs.

In this work, we describe an algorithm for outputting all long LEMs of a pattern with respect to a text in O(m+occ) expected time given the matching statistics of the pattern with respect to the text, where m is the length of the pattern and occ is the number of long LEMs it has with respect to the text. In order to do so, we modify the OptBWTR data structure of Nishimoto and Tabei to also compute LCP[i] given SA[i] (i.e. compute PLCP). We name this modified OptBWTR, OptBWTRL (i.e. OptBWTR for long LEMs or OptBWTR with LCP). OptBWTRL maintains the O(r) words space complexity of OptBWTR and computes ϕ[i] and PLCP[i] in constant time. The long LEM finding algorithm also requires as input an OptBWTRL of the text. We also discuss possible future work related to this paper, including avenues for improving the results, utilization of constant time PLCP computation to speed up matching statistics computation, and biological applications of long LEMs. Long LEMs may have many biological applications, from identity by descent segment detection and local ancestry inferences, to seeds or anchors for approximate matching algorithms for genome to genome alignment, genome to pangenome, read to genome or other alignments. In this paper, our main contributions are the following:

  • OptBWTRL: OptBWTRL is an O(r) words space data structure that maintains the capabilities of OptBWTR and adds the ability to compute ϕ,PLCP, and long LEMs efficiently. r is the number of runs in the BWT of the text.

    • PLCP: OptBWTRL enables constant time PLCP[i] computation in O(r) space. Note that PLCP[SA[j]]=LCP[j], therefore PLCP computation in constant time allows LCP[j] computation in constant time given SA[j].

    • Long LEM Query: We describe an O(m+occ) expected time long LEM query for pattern P and text T given the matching statistics of P with respect to T. The underlying index (OptBWTRL) uses O(r) space. m is the length of P and occ is the number of long LEMs P has with respect to T. A deterministic time bound for a similar algorithm we show is O(m+occlogoccloglogocc).

  • Long LEM Query with random access to the text: Given O(tRA) time random access to the text and a BWT related index, algorithms for computing matching statistics efficiently are known. Therefore, our long LEM query algorithm results in the following.

    • In Uncompressed Space: An algorithm for long LEM query in O(m+occ) expected time in uncompressed string indexes such as the FM Index (Corollary 3.2, variant with O(nσ) space, where n is the length of the text and σ is the size of the alphabet) [18].

    • In Compressed Space: An algorithm for long LEM query in O(mlognδ+occ) expected time in O(r+δlognδ) space given a block tree [23, 4] (with random access to the text in O(lognδ) time in O(nlognδ) space) and an OptBWTRL of the text.

2 Background

In this section, we review definitions used throughout the rest of the paper. We begin with strings, then in Section 2.1, we review BWT related concepts. In Section 2.2, we give a short overview of the results of Nishimoto and Tabei in [36]. Matching statistics are reviewed in Section 2.3. Finally we review maximal exact matches (MEMs) and define locally maximal exact matches (LEMs) in Section 2.4.

Let Σ={1,2,3,,σ} be an ordered alphabet of size σ. The size (number of characters it contains) of a string T is represented by |T|. T refers to a text of length n (|T|=n) where the last character is $. The character $ is lexicographically smaller than all other characters in T and occurs only in the last position of T. The i-th character of T is T[i], i[1,n]. T[i,j] refers to the substring of T that starts at position i and ends at position j, inclusive (T[i,j]=T[i]T[i+1]T[i+1]T[j]). Prefix i of T is the string T[1,i], suffix i of T is T[i,n]. The longest common prefix of two strings T and T is referred to by lcp(T,T). |lcp(T,T)| is the largest value i s.t. imin(|T|,|T|) and T[1,i]=T[1,i] (then, lcp(T,T)=T[1,i]=T[1,i]). A string T being lexicographically smaller than T is represented by TT. If T=T, TT and TT. If TT, TT iff T=lcp(T,T) or T[|lcp(T,T)|+1]<T[|lcp(T,T)|+1].

2.1 Burrows-Wheeler Transform

The Suffix Array (SA) of a text T is an array of length n=|T| where the i-th position stores the index of the i-th lexicographically smallest suffix of T. Therefore, T[SA[1],n]T[SA[2],n]T[SA[3],n]T[SA[n],n]. The Burrows-Wheeler Transform (BWT) of a text T is a string of length n where the i-th character in the string is the SA[i]1-th character of T (the n-th character if SA[i]=1). The LF array is an array of length n that stores the position of the previous suffix in the suffix array, LF[i]=j s.t. SA[j]=SA[i]1 for all SA[j][1,n1], LF[i]=j s.t. SA[j]=n for SA[i]=1. The ϕ array stores at position i, the suffix above suffix i in the suffix array, i.e. if SA[k]=i, ϕ[i]=SA[k1] (ϕ[i]=SA[n] if i=SA[1]). The ϕ1 array stores at position i, the suffix below suffix i in the suffix array, i.e. if SA[k]=i, ϕ1[i]=SA[k+1] (ϕ1[i]=SA[1] if i=SA[n]). Therefore, ϕ[ϕ1[i]]=i and ϕ1[ϕ[i]]=i. The LCP array is an array of length n where LCP[i] stores the length of the longest common prefix of suffix SA[i] and SA[i1]. LCP[1]=0 and for i[2,n], LCP[i]=|lcp(T[SA[i],n],T[SA[i1],n])|. The PLCP (permuted LCP) array is an array of length n where the LCP array is stored by suffix index. Therefore, if SA[i]=j, PLCP[j]=LCP[i]=|lcp(T[j,n],T[ϕ[j],n])|. Finally, the inverse suffix array, ISA, is an array of length n that stores at position i the position of suffix i in the suffix array, if ISA[i]=j, SA[j]=i. ISA[SA[i]]=i and SA[ISA[i]]=i. The SA,LF,ϕ,ϕ1, and ISA arrays are permutations of the integers in [1,n]. SA and ISA are inverses of each other and ϕ and ϕ1 are inverses of each other.

The run-length Burrows-Wheeler Transform (RLBWT) is the run-length encoding of the BWT of a text. Call L the BWT of text T. Then, L is partitioned into r nonempty substrings L1,L2,,Lr. Li is a substring of L corresponding to the i-th run of L. A run is a maximal repetition of the same character in L. Therefore, Li[1]=Li[2]==Li[|Li|] for all i[1,r] and Li[1]Li+1[1] for all i[1,r1]. li is the starting position of the run Li in L. The RLBWT is represented as r pairs (Li[1],li) for i[1,r]. All of these structures can be seen in Figure 1 for a text T=missisismississippi$.

Figure 1: BWT and related structures for T=missisismississippi$. SA,LCP,LF,F, and L are ordered by position in SA while ISA,PLCP,ϕ, and ϕ1 are ordered by position in the text.

2.2 Move Data Structure

The move data structure is a data structure for representing a permutation of a contiguous range of integers efficiently. It was introduced by Nishimoto and Tabei [36]. In the original introduction, the structure was described for a permutation of [1,n]. This of course may be extended to any bijective function from a contiguous range of integers to another contiguous range of integers. The move data structure takes space proportional to the number of intervals conserved in the function. An interval is conserved in a bijective function from a contiguous range of integers to another contiguous range of integers if for any i,j in the interval, f(i)f(j)=ij (therefore, f(i)=f(j)+ij and f(i)i=f(j)j). The move data structure computes the represented function in constant time. The important arrays, LF and ϕ1, are permutations of [1,n] with O(r) conserved intervals, where r is the number of runs in the BWT. Therefore, Nishimoto and Tabei define the OptBWTR data structures using move data structures. OptBWTR supports efficient count and locate queries in BWT-runs compressed space. Below, we more formally review some of the results from their paper [36].

2.2.1 Disjoint Interval Sequence

I=(p1,q1),(p2,q2),,(pk,qk) is a sequence of k pairs of integers. Let pk+1=n+1. Then i is a disjoint interval sequence iff there exists a permutation π of [1,k] s.t. (i) p1=1<p2<<pkn, (ii) qπ[1]=1, and (iii) qπ[i]=qπ[i1]+(pπ[i1]+1pπ[i1]). [pi,pi+11] is referred to as the i-th input interval, and [qi,qi+(pi+1pi)1] as the i-th output interval. The input intervals don’t overlap, and their union is [1,n]. The output intervals don’t overlap and their union is [1,n].

A move query on a disjoint interval sequence I takes as input (i,x), where i is an index in [1,n] and x is the index of the input interval sequence that contains it, i[1,n] and pxi<px+1 and x[1,k]. The move query outputs (i,x) where i=qx+(ipx) and pxi<px+1, i.e. i is the mapping of position i from the input to output intervals by I and x is the index of the input interval that contains i. f, a permutation of [1,n] with k conserved intervals, can be represented by a disjoint interval sequence where the input intervals are the conserved intervals and the output intervals are the mapping of the input intervals by f. Then, a move query of (i,x) returning (i,x) computes f by f(i)=i.

Nishimoto and Tabei show that move queries on a disjoint interval sequence of k input intervals (and therefore k output intervals) can be computed in constant time and O(k) space with the move data structure. The move data structure is built by splitting the k input intervals of I into at most 2k intervals. This results in a disjoint interval sequence of at most 2k input intervals (and an equivalent number of output intervals) that represents the same permutation as the original disjoint interval sequence. The split interval sequence of i that the move data structure is built on is referred to as a balanced interval sequence. The notation for a balanced interval sequence of I is B(I), and the notation for a move data structure of I is F(I). In this paper, we occasionally use input interval of F(I) as shorthand for input interval of B(I) (for example, i-th input interval of a move data structure refers to the i-th input interval of the balanced interval sequence it was built on). Brown et al. extend the balanced interval sequence result of Nishimoto and Tabei to splitting I’s k intervals into at most k+kd1 intervals, resulting in a move data structure with O(d) time move query computation for any d2 [9].

2.2.2 OptBWTR

The arrays LF and ϕ1 are permutations of [1,n] with O(r) conserved intervals. For LF, a conserved interval is within a run in the BWT. For ϕ1, a conserved interval is a range of suffixes of T that don’t occur at the bottom of a run in the BWT (except the first position of the interval may be at the bottom of a run). Therefore, Nishimoto and Tabei define the OptBWTR data structure as the combination of the move data structures of the LF and ϕ1 functions along with a rank-select data structure on an O(r) length string Lfirst. OptBWTR supports O(mloglogwσ) time count queries and O(mloglogwσ+occ) time locate queries in O(r) words of space, where r is the number of runs in the BWT of the text, m is the length of the pattern, occ is the number of occurrences of the pattern in the text, w is the word size, σ is the size of the alphabet (Theorem 9 of [36]). The input intervals of B(ILF), the disjoint interval sequence of the move data structure of LF, are contained within a run in the BWT. Call the i-th input interval of B(ILF) [pi,pi+11]. Then, Lfirst=L[p1]L[p2]L[p3]L[pk], where k2r is the number of input intervals of B(ILF), L is the BWT of T, and i[1,k],j[pi,pi+11]L[j]=L[pi]. Call B(ISA) the disjoint interval sequence of the move data structure of ϕ1, and [pi,pi+1] its i-th input interval. OptBWTR is composed of:

  • move data structures for LF and ϕ1 (F(ILF) and F(ISA) respectively),

  • a rank-select data structure on Lfirst (R(Lfirst)),

  • samples of the SA at the beginning of input intervals of the LF move data structure (SA+, where SA+[i]=SA[pi]), and

  • the index of the input interval of the ϕ1 move data structure that contains each SA sample in SA+ (SAindex+, where SAindex+=ySA+[i][py,py+1]).

2.3 Matching Statistics

The matching statistics of a pattern P with respect to a text T represents information on the local similarity of the pattern to the text. The matching statistics of P with respect to T, MPST, is an array of length |P|=m that stores at position i three values: MPST[i].len, MPST[i].suff, and MPST[i].row. MPST[i].len is the length of the longest substring of P starting at i that occurs in T. MPST[i].suff is a suffix of T that has a longest common prefix with P[i,m] of length MPST[i].len (or equivalently, MPST[i].suff is the starting position of an occurrence of P[i,i+MPST[i].len1] in T). MPST[i].row is the index in the SA of T that has value MPST.suff. Formally for all i[1,m],

  • MPST[i].len=maxj[1,|T|]|lcp(P[i,m],T[j,|T|])|,

  • |lcp(T[MPST[i].suff,n],P[i,m])|=MPST[i].len, and

  • SA[MPST[i].row]=MPST[i].suff.

When P and T are clear from the context, we omit them from MPST and refer to the matching statistics of P with respect to T as MS.

2.4 Maximal and Locally Maximal Exact Matches

For a pattern P and a text T (|P|=m,|T|=n), a maximal exact match (MEM), P[i,j]=T[i,j], is a match between P and T that cannot be extended left or right in the pattern. Formally, (i=1 or P[i1,j] doesn’t occur in T) and (j=m or P[i,j+1] doesn’t occur in T). A MEM can be fully specified by the triple (i,i,k) where k=ji+1 is the length of the match and i and i are the starting positions of the match in the pattern and the text respectively.

Figure 2: MEMs and LEMs of a pattern (haplotype) vs a text (pangenome). Haplotype i is the sequence of characters between $i1 and $i. The text is the concatenation of the haplotypes T=actgacccactgaaactcgggccctt$1actggggactgaaactcgggccctt$2. MEMs and long LEMs (length threshold for long LEMs: =10) of the pattern (a haplotype not contained in the pangenome) with respect to the text (the pangenome) are highlighted. MEMs are shaded in while LEMs are boxed in. In this example, MEMs are only able to detect relationships among the haplotypes most closely related to the pattern haplotype. Haplotypes similar to the pattern but not maximally similar at any location remain undetected. Notably, haplotype 2 is very similar to the pattern but doesn’t contain any MEMs with it. The number of undetected similar haplotypes in biobank scale haplotype panels may be an order of magnitude larger.

For a pattern P and a text T, a locally maximal exact match (LEM), P[i,j]=T[i,j], is a match between P and T that cannot be simultaneously extended in the pattern and the text. The match cannot be simultaneously extended left in the pattern and the text. Likewise, it cannot be simultaneously extended right in the pattern and the text. Formally, (i=1 or i=1 or P[i1,j]T[i1,j]) and (j=m or j=n or P[i,j+1]T[i,j+1]). A LEM can also be fully specified by the triple (i,i,k) where k is the length of the LEM and k=ji+1. For some length threshold , a long LEM is a LEM with length at least . See Figure 2 for a depiction of MEMs and LEMs in a text representing a pangenome.

3 Methods

Here we describe the main results of our paper. In Section 3.1, we prove move data structures can compute ϕ and PLCP in constant time. Then we describe OptBWTRL, our modification of OptBWTR that utilizes these move data structures. In Section 3.2, we describe multiple algorithms for long LEM query provided an OptBWTRL of the text and matching statistics of the pattern with respect to the text.

3.1 Computing LCP with Move Data Structures

We define pj+ to be the j-th smallest suffix that occurs at the top of a run in the BWT. Therefore let (i) p1+<p2+<<pr+<pr+1+=n+1 and (ii) {p1+,p2+,,pr+,pr+1+}={SA[l1],SA[l2],,SA[lr],n+1}. Lemma 1 and its proof are phrased very similarly to Lemma 4 in [36] to demonstrate its derivativeness and the similarity of the properties.

Lemma 1.

(i) Let x be the integer satisfying px+i<px+1+ for some integer i[1,n]. Then LCP[ISA[i]]=LCP[ISA[px+]](ipx+).

Proof.

Lemma 1(i) clearly holds for i=px+. We show that Lemma 4(i) holds for ipx+ (i.e., i>px+). Let st be the position in SA with sa-value px++t for an integer t[1,y] (i.e., SA[st]=px++t) where y=ipx+. Two adjacent positions st1 and st are contained in an interval [lv,lv+|Lv|1] on LCP which corresponds to the v-th run Lv of L. This is because st is not the starting position of a run, i.e., (SA[st]=px++t){p1+,p2+,,pr+}. The LF function maps st to st1, where s0 is the position with sa-value px+. LF also maps st1 to st11 by Lemma 3(i) of [36]. LCP[st1]=LCP[st]+1 due to st and st1 being in the same interval on L, Lv. These relationships produce y equalities LCP[s0]=LCP[s1]+1,LCP[s1]=LCP[s2]+1,,LCP[sy1]=LCP[sy]+1. The equalities lead to LCP[s0]=LCP[sy]+y, and therefore LCP[sy]=LCP[s0]y. Which represents LCP[ISA[i]]=LCP[ISA[px+](ipx+) by ISA[i]=sy,ISA[px+]=s0, and y=(ipx+).

Lemma 2.

(i) Let x be the integer satisfying px+i<px+1+ for some integer i[1,n]. Then PLCP[i]=PLCP[px+](ipx+).

Proof.

By Lemma 1 and PLCP[j]=LCP[ISA[j]] for all j[1,n] [21].

3.1.1 Move Data Structure for ϕ

pj+ remains as defined in the previous section. Let δ+ be a permutation of [1,r] satisfying ϕ(pδ+[1]+)<ϕ(pδ+[2]+)<<ϕ(pδ+[r]+). ϕ has the following properties on RLBWT.

Lemma 3.

The following three statements hold: (i) Let x be the integer satisfying px+i<px+1+ for some integer i[1,n]. Then ϕ(i)=ϕ(px+)+(ipx+); (ii) ϕ(pδ+[1]+)=1 and ϕ(pδ+[i]+)=ϕ(pδ+[i1]+)+d where d=pδ+[i1]+1+pδ+[i1]+; (iii) p1+=1.

Proof.

See Appendix A.

We can compute ϕ by using a move data structure. A sequence Iϕ consists of r pairs (p1+,ϕ(p1+)),(p2+,ϕ(p2+)),,(pr+,ϕ(pr+)). Iϕ satisfies the three conditions of a disjoint interval sequence by Lemma 3, and ϕ is equal to the bijective function represented by Iϕ.

Lemma 4.

(i) Iϕ is a disjoint interval sequence. (ii) ϕ is equal to the bijective function represented by Iϕ.

Proof.

(i) Iϕ has the following three properties: (a) p1+=1<p2+<<pr+n holds by Lemma 3(iii) and the definition of the sequence p1+,p2+,,pr+1+, (b) ϕ(pδ+[1]+)=1 by Lemma 3(ii), and (c) ϕ(pδ+[i]+)=ϕ(pδ+[i1]+)+(pδ+[i1]+1+pδ+[i1]+). Therefore Iϕ satisfies the three conditions of the disjoint interval sequence.

(ii) Let fϕ be the bijective function represented by Iϕ. Then fϕ(i)=ϕ(px+)+(ipx+) where x is the integer such that px+i<px+1+ holds. On the other hand, ϕ(i)=ϕ(px+)+(ipx+) holds by Lemma 3(i). Therefore fϕ(i)=ϕ(i) and fϕ and ϕ are the same function.

Let F(Iϕ) be the move data structure built on the balanced interval sequence B(Iϕ) for Iϕ. By Lemma 6 of [36], F(Iϕ) requires O(r) words of space. By the results of Section 3.2 of [36], evaluation of a move query using a move data structure for a balanced disjoint interval sequence takes constant time. Finally, ϕ(i)=i holds for a move query Move(B(Iϕ),i,x)=(i,x) by Lemma 4. Therefore we have proved (i) of the following lemma.

Lemma 5.

(i) There exists a move data structure F(Iϕ) that computes ϕ(i) in O(r) space and constant time given x, the index of the input interval of Iϕ that contains i. (ii) This move data structure can be modified to also compute PLCP[i] in O(r) space and constant time given i and x. Call the modified move data structure F(Iϕ,PLCP).

Proof.

Say that B(Iϕ) has k+ input intervals and the i-th input interval is [pi+,pi+1+]. Then we modify the move data structure F(Iϕ) by adding an array LCP+ of size k+. LCP+[i] stores the value LCP[ISA[px+]] for each x[1,k+]. (PLCP[px+]=LCP[ISA[px+]].) PLCP[i]=PLCP[px+](ipx+) by Lemma 1. Therefore PLCP[i] is computed in constant time by evaluating LCP+[x](ipx+). Call this modified move data structure F(Iϕ,PLCP).

A similar function that we may need to compute is LCP[i+1] given SA[i], i.e. given SA[i]=y, compute |lcp(T[y,n],T[ϕ1(y),n])|=PLCP[ϕ1(y)]=LCP[i+1]. Nishimoto and Tabei described F(ISA), a move data structure computing ϕ1. F(ISA) can be modified to compute PLCP[ϕ1(y)] in constant time as well in a similar fashion to the modification of the F(Iϕ) data structure. Call the disjoint interval sequence F(ISA) is built on B(ISA). Call the i-th input interval of B(ISA) [pi,pi+11], where B(ISA) has k input intervals and pk+1=n+1. Note that by the construction of Nishimoto and Tabei, every suffix at the bottom of a BWT run is the start of an input interval, {SA[l21],SA[l31],SA[l41],,SA[lr1],SA[n]}{p1,p2,,pk}. Then for any i,j[px,px+11],ϕ1(i)ϕ1(j)=ij. Therefore ϕ1(i)=ϕ1(j)+(ij) (see Lemma 4 in [36]). Below, we prove (ii) that for any i,j[px,px+11],PLCP[ϕ1(i)]PLCP[ϕ1(j)]=ji, therefore PLCP[ϕ1(i)]=PLCP[ϕ1(j)]+ji.

Lemma 6.

Let x be the integer satisfying pxi<px+1 for some i[1,n]. Then (i) PLCP[ϕ1(i)]=PLCP[ϕ1(px)](ipx). Therefore, (ii) for any i,j[px,px+11], PLCP[ϕ1(i)]=PLCP[ϕ1(px)](ipx), PLCP[ϕ1(j)]=PLCP[ϕ1(px)](jpx), and PLCP[ϕ1(i)]PLCP[ϕ1(j)]=ji.

Proof.

Lemma 6(i) clearly holds for i=px. We show that Lemma 6(i) holds for px<i<px+1 (i.e. ipx). Let st be the position in the SA with sa-value px+t for an integer t[1,y] where y=ipx. Two adjacent positions st and st+1 are contained in an interval [lv,lv+11] corresponding to the v-th run in the BWT (Lv). This is because st is not the ending position of a run, SA[st]{p1,p2,,pk}. The LF function maps st to st1, where s0 is the position in the SA with value px. LF also maps st+1 to st1+1 by Lemma 3(i) of [36]. PLCP[ϕ1(px+t1)]=LCP[st1+1]=LCP[st+1]+1=PLCP[ϕ1(px+t)]+1 since st and st+1 are in the same interval in the BWT, Lv. These relationships produce y equalities PLCP[ϕ1(px)]=PLCP[ϕ1(px+1)]+1,PLCP[ϕ1(px+1)]=PLCP[ϕ1(px+2)]+1,…,PLCP[ϕ1(px+y1)]=PLCP[ϕ1(px+y)]+1. This leads to PLCP[ϕ1(px)]=PLCP[ϕ1(px+y)]+y. Which leads to PLCP[ϕ1(i)]=PLCP[ϕ1(px)](ipx) by y=ipx and px+y=i.

Therefore, the move data structure that computes ϕ1(i), F(ISA), can be modified to compute PLCP[ϕ1(i)] as well.

Lemma 7.

F(ISA) can be modified to compute PLCP[ϕ1(i)] as well as ϕ1(i) in constant time and O(r) space given x, the index of the input interval of B(ISA) that contains i. Call the modified move data structure F(Iϕ1,PLCP).

Proof.

We modify the F(ISA) move data structure by LCP, an array of size k where the x-th element stores the value PLCP[ϕ1(px)]. Then, PLCP[ϕ1(i)] can be computed in constant time by evaluating LCP[x](ipx) by LCP[x]=PLCP[px] and Lemma 6(i). We call this modified move data structure F(Iϕ1,PLCP).

3.1.2 OptBWTRL

We slightly modify OptBWTR by adding a move data structure that computes ϕ and PLCP and arrays that allow jumping to the closest input intervals corresponding to adjacent runs in the BWT in constant time. We call it OptBWTRL, L for LCP and long LEMs. In addition to the structures of OptBWTR, OptBWTRL contains F(Iϕ,PLCP), ND, PD, SA, SAϕ+, SAindex, and SAϕ. Furthermore, the F(ISA) move data structure of OptBWTR is replaced by the F(Iϕ1,PLCP) move data structure described in Lemma 7. Recall that B(ILF) is the disjoint interval sequence the move data structure F(ILF) is built on. Let B(ILF) contain k input intervals where the i-th input interval is [pi,pi+11], and pk+1=n+1. Further recall that every input interval is contained in a run in the BWT, i.e. for all i[1,k], j,j[pi,pi+11],L[j]=L[j]. Then, ND and PD are arrays of length k where ND contains the index of the next input interval with a different character in the BWT and PD contains the index of the previous input interval with a different character in the BWT. Formally, for all i[1,k], ND[i]=min{j>i|L[pi]L[pj]}, and PD[i]=max{j<i|L[pi]L[pj]}. If no such j exists, ND[i]=k+1 and PD[i]=1. ND and PD can be constructed in O(k) (and therefore, O(r)) time given Lfirst. SA are samples of the SA at the ends of input intervals of B(ILF). SAϕ+ are the indices of the input intervals of the top of B(ILF) input interval suffix array samples in F(Iϕ,PLCP). SAindex and SAϕ are the indices of the input intervals of the bottom of B(ILF) input interval suffix array samples in F(Iϕ1,PLCP) and F(Iϕ,PLCP) respectively. Below, let [pi+,pi+1+1] and [pi,pi+11] be the i-th input intervals of F(Iϕ,PLCP) and F(Iϕ1,PLCP) respectively. Then, OptBWTRL differs from OptBWTR in the following ways.

  • Replaced F(ISA) with F(Iϕ1,PLCP) from Lemma 7.

  • Added F(Iϕ,PLCP) from Lemma 5.

  • Added ND and PD. ND[i]=minj>i{Lfirst[j]Lfirst[i] or j=n+1}. PD[i]=maxj>i{Lfirst[j]Lfirst[i] or j=1}.

  • Added SA. SA[i]=SA[li+11].

  • Added SAϕ+. SAϕ+[i]=j s.t. pj+SA+[i]<pj+1+.

  • Added SAindex. SAindex[i]=j s.t. pjSA[i]<pj+1.

  • Added SAϕ. SAϕ[i]=j s.t. pj+SA[i]<pj+1+.

3.2 Computing Long LEMs

Here, we describe an algorithm for outputting all the long LEMs of a pattern P with respect to a text T in O(m+occ) expected time using an index of size O(r) words given the matching statistics of P with respect to T and an OptBWTRL of T. m is the length of P and occ is the number of long LEMs P has with T. Furthermore, the matching statistics are slightly augmented to contain the input intervals it’s corresponding data are contained in. In particular, the input interval of F(ILF) that MS.row is contained in is stored as MS.i, the input interval of F(Iϕ,PLCP) that MS.suff is contained in is stored as MS.w, and the input interval of F(Iϕ1,PLCP) that MS.suff is contained in is stored as MS.x. Note that the long LEM query algorithm we present here does not necessarily result in an O(m+occ) expected time algorithm for outputting all long LEMs of P with respect to T given a OptBWTRL of T because an algorithm for computing the matching statistics of P with respect to T in O(m) time and O(r) space is not known.

We define the balanced salcp-interval of a string P as a 13-tuple (b,d,e,SA[b],SA[d],SA[e],i,j,k,v,w,x,y) where [b,e] is the sa-interval of P, d[b,e], i,j, and k are the indexes of the input intervals of B(ILF) that contain b,d, and e respectively, v and w are the indexes of the input intervals of B(Iϕ,PLCP) containing SA[b] and SA[d] respectively, and x and y are the indexes of the input intervals of B(Iϕ1,PLCP) of SA[d] and SA[e] respectively. The balanced salcp-interval keeps track of three positions in the sa-interval: the top (b), bottom (e), and the middle (d). d is any position in the interval, it may be equivalent to the top or the bottom. Each position also maintains its corresponding suffix array value and index of the input interval of the position in F(ILF) (i,j, and k for top, middle, and bottom respectively). Finally, the top maintains the index of the input interval of its sa-value in F(Iϕ,PLCP) (v), the bottom maintains the index of the input interval of its sa-value in F(Iϕ1,PLCP) (y), and the middle maintains the index of the input interval of its sa-value in both F(Iϕ,PLCP) and F(Iϕ1,PLCP) (w and x respectively). The balanced salcp-interval of a string P with no occurrences in T is undefined.

The high level idea of the long LEM finding algorithm is to compute the balanced salcp-interval of adjacent substrings of length of the pattern while outputting long LEMs along the way. I.E. given the balanced salcp-interval of P[f+1,f+], compute the salcp-interval of P[f,f+1] and output all long LEMs of the form P[f+1,g]=T[f,g]. We call this problem long salcp-interval advancement. Given an algorithm for long salcp-interval advancement in O(t) time, a straightforward long LEM computation algorithm is iterating from f=m1, repeatedly advancing the salcp-interval and outputting all long LEMs in O(mt) time. In Section 3.2.1, we outline an algorithm for balanced salcp-interval extension and in Section 3.2.2, we outline an algorithm for long salcp-interval advancement. These algorithms result in an O(m+occ) expected time algorithm for long LEM computation.

3.2.1 Balanced salcp-interval Extension

Here, we provide algorithms for obtaining the balanced salcp-interval of cP given the balanced salcp-interval of P and an OptBWTRL of T. The first algorithm runs in O(loglogwσ) time by making use of the rank-select structure on Lfirst. The second runs in time linear to the number of runs in the balanced salcp-interval of P, rP, by iterating through them. Call the balanced salcp-interval of P (b,d,e,SA[b],SA[d],SA[e],i,j,k,v,w,x,y) and the balanced salcp-interval of cP (b,d,e,SA[b],SA[d],SA[e],i,j,k,v,w,x,y). Recall that pj,pj+, and pj are the starting indexes of the j-th input intervals of F(ILF),F(Iϕ,PLCP), and F(Iϕ1,PLCP) respectively.

We first discuss the computation of the top values, b,SA[b],i, and v. If L[b]=c, then b=LF[b] and i can be computed with F(ILF) in constant time using (b,i). SA[b]=SA[b]1, and v=v if SA[b]pv+, otherwise v=v1. If L[b]c, b=LF[b^], where b^ is the first location in [b,e] such that L[b^]=c. If i^ is the index of first input interval ii^k such that Lfirst[i^]=c, then b^=pi^, where pa is the starting position of the a-th input interval of F(ILF). i^ can be computed in O(loglogwσ) time using R(Lfirst) or O(rP) time by iterating through the runs of balanced salcp-interval of P using the ND array. Then, i and b=LF[b^] can be computed with F(ILF) in constant time using (b^,i^). SA[b]=SA+[i^]1, and v=SAϕ+[i^]1.

The bottom values e,SA[e],k, and y can be computed in a similar fashion. If L[e]=c, then e=LF[e] and k can be computed with F(ILF) in constant time using (e,k). SA[e]=SA[e]1, and y=y if SA[e]pk, otherwise y=y1. If L[e]c, then e=LF[e^], where e^ is the last location in [b,e] such that L[e^]=c. If k^ is the index of the last input input interval ik^k such that Lfirst[k^]=c, then e^=pk^+11. k^ can be computed in O(loglogwσ) time using R(Lfirst) or O(rp) time by iterating through the runs of the balanced salcp-interval of P using the PD array. Then, k and e=LF[e^] can be computed with F(ILF) in constant time using (e^,k^). Finally, SA[e]=SA[k^]1, and y=SAindex+[k^]1.

Lastly, the middle values d,SA[d],j,w and x need to be computed. Pseudocode for middle value computation is provided as Algorithm 4 in Appendix B. If L[d]=c, then d=LF[d] and j can be computed in constant time with F(ILF) using (d,j). SA[d]=SA[d1]. w=w if SA[d]pw+, otherwise w=w1. Finally, x=x if SA[d]px, otherwise x=x1. If L[d]c and cP occurs in T, then there is a preceding or succeeding input interval of B(ILF) that intersects with the balanced salcp-interval of P and has value c in the BWT. Suppose there is a preceding interval, j^. Then the middle values can be updated similar to the bottom values. Let d^=pj^+11, then j and d=LF[d^] are computed in constant time with F(ILF), SA[d]=SA[j^]1, x=SAindex[j^]1, and w=SAϕ[j^] if SA[d^]pSAϕ[j^]+, otherwise w=SAϕ[j^]1. If there is no preceding interval, then set j^ to the index of the succeeding interval. Then the middle values can be updated similar to the top values. Let d^=pj^, then j and d=LF[d^] are computed in constant time with F(ILF), SA[d]=SA+[j^], w=SAϕ+[j^]1, and x=SAindex+[j^] if SA[d^]pSAindex+[j^], otherwise x=SAindex+[j^]1. The index, j^, of the preceding or succeeding interval in the salcp-interval of P with value c in the BWT can be found in O(loglogwσ) time with R(Lfirst) or O(rP) time by iterating through the runs in the BWT with PD and ND. Therefore, the balanced salcp-interval of cP can be computed in O(loglogwσ) time or O(rP) time given the balanced salcp-interval of P. See Algorithm 3 in Appendix B for the O(rP) time algorithm pseudocode.

3.2.2 Long salcp-interval Advancement

Let occstart,f+1 be the number of long LEMs of the form P[f+1,g]=T[f,g] and occend,f+1. be the number of long LEMs of the form P[h,f+1]=T[f′′,g′′]. Here, we describe an algorithm that computes the balanced salcp-interval of P[f,f+1] and a dynamic dictionary of the suffixes of T present in the balanced salcp-interval of P[f,f+1]. This algorithm also outputs all occstart,f+1 long LEMs of the form P[f+1,g]=T[f,g]. The algorithm runs in O(occstart,f+1+occend,f+1) expected time and requires as input the balanced salcp-interval of P[f+1,f+], an OptBWTRL of T, and a dynamic dictionary of the suffixes of T present in the balanced salcp-interval of P[f+1,f+].

We begin with the description of the dynamic dictionary, dictocc. There are numerous dynamic dictionary data structures that support expected constant time insertion, deletion, and queries [5, 37, 6, 14]. Therefore, we maintain a dynamic dictionary of the suffixes in the balanced salcp-interval. More precisely, if the balanced salcp-interval of P[f+1,f+] is (b,d,e,SA[b],SA[d],SA[e],i,j,k,v,w,x,y), then the dynamic dictionary provided as input to the long salcp-interval advancement algorithm has eb+1 elements. a[b,e],SA[a](f+1) is contained in the dictionary and has the value (f+1)+|lcp(T[SA[a],n],P[f+1,m])|1=f+|lcp(T[SA[a],n],P[f+1,m])| associated with it. I.E. the value associated with each suffix SA[a] of the text contained in the dictionary is the ending position (in the pattern) of the longest match between suffix SA[a] of the text and suffix f+1 of the pattern. It is not possible that multiple suffixes of T share the same key in dictocc. Each suffix of T can occur only once in the dictionary because each suffix of T can occur only once in any balanced salcp-interval. Each suffix of T can occur only once in any balanced salcp-interval since each suffix of T occurs exactly once in SA.

Here, we describe the procedure for outputting all occstart,f+1 long LEMs of the form P[f+1,g]=T[f,g] in O(occstart,f+1) expected time (we call this outputMatches). The high level idea is to iterate through the input intervals of B(ILF), skipping intervals corresponding to a run of P[f] in constant time per run using ND. We outline two functions: outputMatchesDown(s,ι,z) and outputMatchesUp(s,ι,z). For both functions, s represents a suffix of T and ι is the index of the input interval that contains it in F(Iϕ1,PLCP) and F(Iϕ,PLCP) in outputMatchesDown and outputMatchesUp respectively. z represents the number of matches to output (directly above s in SA for outputMatchesUp and directly below s in SA for outputMatchesDown) including s. outputMatchesUp(s,ι,1) outputs a match P[f+1,g]=T[s,s+g(f+1)], where g=dictocc[s(f+1)], and removes the key-value pair (s(f+1),g) from dictocc. outputMatchesUp(s,ι,z) for z>1 similarly outputs a match P[f+1,g]=T[s,s+g(f+1)] where g=dictocc[s(f+1)], then removes the key-value pair (s(f+1),g) from dictocc. Then, it recurses on outputMatchesUp(s,ι,z1), where ι and s=ϕ(s) are computed in constant time using F(Iϕ,PLCP). outputMatchesDown(s,ι,z) operates in the same way as outputMatchesDown except it computes ϕ1 instead of ϕ (using F(Iϕ1,PLCP) instead of F(Iϕ,PLCP)). It is simple to see that outputMatchesUp(s,ι,z) and outputMatchesDown(s,ι,z) operate in O(z) expected time and output z matches each. Now we utilize outputMatchesUp and outputMatchesDown to output the occstart,f+1 long LEMs of the form P[f+1,g]=T[f,g]. If the salcp-interval of P[f+1,f+] is fully contained in one input interval of F(ILF), then i=k. If Lfirst[i]=P[f], then there are no matches to output, otherwise, every suffix in the balanced salcp-interval needs to be outputted and we do so by calling outputMatchesUp(SA[d],w,db+1) and outputMatchesDown(ϕ1(SA[d]),x,eb), where x and ϕ1(SA[d]) are computed with (SA[d],x) and F(Iϕ1,PLCP). In the case where the balanced salcp-interval of P[f+1,f+] is not fully contained in one input interval (ik), we do the following. For the first input interval, i, if Lfirst[i]P[f], then the pi+1b long LEMs starting at SA[pi+11],SA[pi+12],,SA[b] in the text are outputted in O(pi+1b) expected time by calling outputMatchesUp(SA[i],SAϕ[i],pi+1b). For any middle input interval o, i<o<j, if Lfirst[o]=P[f], then this run in the BWT is skipped, o=ND[o]. Otherwise, if Lfirst[o]P[f], then the long LEMs starting at SA[po],SA[po+1],,SA[po+11] are outputted by calling outputMatchesDown(SA+[o],SAindex+[o],po+1po). For the last input interval, k, if Lfirst[k]P[f], then the epk+1 long LEMs starting at SA[pk],SA[pk+1],,SA[e] are outputting by calling outputMatchesDown(SA+[k],SAindex+[k],epk+1). Overall, outputting the occstart,f+1 long LEMs of the form P[f+1,g]=T[f,g] takes O(occstart,f+1+rP[f+1,f+]) expected time. Furthermore, for every run of character P[f] intersecting the salcp-interval of P[f+1,f+] except the first one, there is a run of characters P[f]. Therefore rP[f+1,f+]=O(occstart,f+1) Therefore outputting the occstart,f+1 long LEMs of the form P[f+1,g]=T[f,g] takes O(occstart,f+1) expected time. See Algorithms 5, 6, and 7 in Appendix B for outputMatches and related pseudocodes.

Finally, we must compute the balanced salcp-interval of P[f,f+1]. First suppose that the balanced salcp-interval of P[f+1,f+] is nonempty. Then, we use the algorithm described in Section 3.2.1 to obtain the salcp-interval of P[f,f+] in O(rP[f+1,f+]) time. Now, let the salcp-interval of P[f,f+] be (b^,d^,e^,SA[b^],SA[d^],SA[e^],i^,j^,k^,v^,w^,x^,y^) and the salcp-interval of P[f,f+1] be (b,d,e,SA[b],SA[d],SA[e],i,j,k,v,w,x,y). These salcp-intervals differ only by those suffixes of the text whose lcp with P[f,m] has length exactly . There are exactly occend,f+1 such suffixes. Furthermore, PLCP[SA[b]]< and PLCP[ϕ1(SA[e])]<. Finally, b<ab^, LCP[a]=PLCP[SA[a]], and e^a<e, LCP[a+1]=PLCP[SA[a+1]]=PLCP[ϕ1(a)]. Therefore, we initialize b=b^, SA[b]=SA[b^], i=i^, and v=v^. Then, while LCP[b]=PLCP[SA[b]], we (i) set i=i1 if b=pi, (ii) set b=b1, (iii) update SA[b] and v by F(Iϕ,PLCP), and (iv) insert the key SA[b]f into dictocc with value f+1. When LCP[b]=PLCP[SA[b]]<, the final value b has been computed. Similarly for e, we initialize e=e^,SA[e]=SA[e^],k=k^, and y=y^. Then, while LCP[e+1]=PLCP[SA[e+1]]=PLCP[ϕ(SA[e])], we (i) set k=k1 if e=pk+11, (ii) set e=e1, (iii) update SA[e] and y by F(Iϕ1,PLCP), and (iv) insert the key SA[e]f into dictocc with value f+1. When LCP[e+1]=PLCP[SA[e+1]]=PLCP[ϕ1(e)]<, the final value e has been computed. This takes constant time per suffix added to the interval, therefore O(occend,f+1) time.

If the balanced salcp-interval of P[f+1,f+] is empty, the balanced salcp-interval of P[f,f+1] is only nonempty if MS[f].len=. If it is, we initialize the balanced salcp-interval of P[f,f+1] to (b^=MS[f].row,d^=MS[f].row,e^=MS[f].row,SA[b^]=MS[f].suff,SA[d^]=MS[f].suff,SA[e^]=MS[f].suff,i^=MS[f].i,j^=MS[f].i,k^=MS[f].i,v^=MS[f].w,w^=MS[f].w,x^=MS[f].x,y^=MS[f].x) and insert the key MS.sufff into dictocc with value f+1. Then, the interval is expanded to the salcp-interval of P[f,f+1] in O(occend,f+1) time as in the other case.

In the case where the balanced salcp-interval of P[f+1,f+] is empty, long salcp-interval advancement is performed in O(occend,f+1) expected time . If it is not empty, the algorithm we have described first performs salcp-interval extension, obtaining the salcp-interval of P[f,f+] in O(rP[f+1,f+]) time and then takes O(occend,f+1) expected time to compute the salcp-interval of P[f,f+1] from the salcp-interval of P[f,f+]. Finally, rP[f+1,f+]=O(occend,f+1). Therefore, the algorithm described here performs the long salcp-interval advancement in O(occstart,f+1+occend,f+1) expected time. See Algorithm 2 in the Appendix for the pseudocode of this algorithm.

3.3 Time Complexity

If the above algorithm is iterated from f=m0, all long MEMs of the pattern with respect to the text are outputted. The time complexity of the algorithm is the sum of the time complexity of the m long salcp-interval advancements. Note that the sum of occstart,f+1 for f=m0 is occ and the sum of the occend,f+1 for f=m0 is also occ. Therefore, the time complexity of the algorithm overall is O(m+occ) expected time. See Algorithm 1 in Appendix B for pseudocode. The algorithm takes O(r) space for the OptBWTRL and O(occ) space for maintaining the dynamic dictionary [5]. Also note that if a deterministic time bound is desired, this algorithm runs in O(m+occlogoccloglogocc) time with the same space by replacing the dictionary with a deterministic dictionary implemented by exponential search trees [47, 2]. Recall these complexities are when given the modified matching statistics. A linear time algorithm for computing matching statistics in O(r) space is not known. However, note that since the values of matching statistics are only needed for positions i where MS[i].len=, a straightforward algorithm for long LEM query follows from our algorithm in O(mloglogwσ+occ) expected time when matching statistics are not given as input. This algorithm is obtained by computing the salcp-interval of each P[i,i+1] independently in O(loglogwσ) time using the standard count algorithm described by Nishimoto and Tabei [36] followed by performing the long LEM query described here. The long LEM query algorithm described here results in an O(m+occ) expected time long LEM query algorithm in uncompressed string indexes since algorithms for O(m) time matching statistics computation are known in uncompressed space.

4 Discussion

In this paper, we have described OptBWTRL, a modification of OptBWTR by Nishimoto and Tabei [36]. OptBWTRL adds the ability to compute PLCP and ϕ in constant time with additional move data structures. It also retains a space complexity of O(r) words. We also define locally maximal exact matches (LEMs), a match that cannot be simultaneously extended in the pattern and the text instead of one that is only unable to be extended in the pattern (MEMs). Finally, we describe an algorithm for outputting all LEMs with length at least in O(m+occ) expected time given an OptBWTRL of the text and the matching statistics of the pattern with respect to the text. Note that this doesn’t result in a linear time algorithm for computing long LEMs in O(m+occ) expected time in O(r) space because an algorithm for computing matching statistics of a pattern with respect to a text in linear time in O(r) space is not known. A deterministic bound for our long LEM query algorithm is O(m+occlogoccloglogocc). Finally, our long LEM query admits a direct computation of long LEMs in O(m+occ) expected time without being provided matching statistics as input. This algorithm may be faster than computation of matching statistics followed by O(m+occ) long LEM query in some cases, especially when is small.

It is likely that the move data structures F(Iϕ,PLCP) and F(Iϕ1,PLCP) can be merged into one data structure that still takes O(r) space and computes ϕ,ϕ1,PLCP[i], and PLCP[ϕ1(i)] in constant time in one data structure. This would greatly reduce the number of samples needed per input interval F(ILF). It would also allow bidirectional movement in the SA with one input interval index. This is left as future work. Possible other future work includes a practical implementation of the structures and algorithms described here, possibly as a modification of MOVI or b-move [49, 16]. Thirdly, the ability to compute PLCP in constant time may speed up matching statistics computation in compressed space. The intuition is that when MS[i].lenMS[i+1].len, then when MS[i].len is large, the sa-interval of P[i,i+MS[i].len1] is small and is faster to compute with PLCP and ϕ and ϕ1 than with reverse LF. When MS[i].len is small, computing the sa-interval is faster with reverse LF. In Heng Li’s forward-backward algorithm [25], the new sa-interval is always computed by reverse LF. Computing the sa-interval with PLCP and reverse LF simultaneously is likely to be faster in practice than reverse LF alone while retaining the same worst case time complexity. The authors are currently exploring this idea. Furthermore, variable length threshold long LEMs may be useful. I.E. output long LEMs that are x% of the length of the MEMs in the same area. The authors believe a linear time algorithm for this or a similar problem given matching statistics exist. Finally, applications utilizing the long LEMs of a pattern with respect to the text is a possible fruitful direction for future work. Particularly in biological applications.

Long LEMs may have many biological applications. In general, in any application where long MEMs are used, long LEMs may also be used. Note that MEMs are a subset of LEMs and long MEMs are a subset of long LEMs. For example, in biobank scale haplotype datasets, long matches (long LEMs) in the PBWT have revealed genealogical relationships that set maximal matches (MEMs) are not able to uncover. As the compressive power of compressed string indexes increases and the number of variants in biobank scale whole genome sequencing data increases, storing unaligned genomes becomes more viable. In that case, algorithms for outputting long LEMs are needed to replace the long match algorithms in the PBWT. These matches have many applications from identity by descent segment detection, haplotype phasing, haplotype imputation, inferring genealogical relationships, and ancestry inference. Utilizing unaligned matches from a large collection of haplotype sequences instead of aligned matches from haplotypes aligned to a linear reference genome may also reduce reference bias. Finally, novel applications for long LEMs may exist, long LEMs may be used as seeds for seed and extend algorithms. They may be used as anchors for approximate matching matching algorithms [20] possibly for long read alignment to either a reference pangenome or a linear reference genome [31]. Lastly, genome to genome or genome to pangenome long LEMs detection may find similar sequences in the genomes on different genomic regions. MEMs detection may miss these similar sequences on different genomic regions because these matches will typically be overshadowed by larger encompassing matches that occur in roughly the same region in the pattern and the text. The long LEMs may therefore reveal old structural variants that MEMs and general alignment algorithms are both unable to reveal. MEMs don’t reveal these variants due to looking for only the largest matches in a region on the pattern while alignment algorithms don’t due to better alignments existing in closeby genomic regions or alignment algorithms being too computationally expensive to run on very large datasets.

Overall, we have provided a linear time algorithm for outputting all long LEMs of a pattern with respect to a text in BWT runs compressed space given the matching statistics of the pattern with respect to the text. We have also applied the move data structure of Nishimoto and Tabei to computation of PLCP in constant time. Therefore, we can compute LCP[i] given SA[i] in constant time. We apply these results to modify the OptBWTR, creating OptBWTRL. OptBWTRL is an O(r) space data structure that computes ϕ and PLCP in constant time and long LEMs in linear time given matching statistics. These algorithms result in a linear time long LEM query algorithm in uncompressed string indexes.

References

  • [1] Omar Y Ahmed, Massimiliano Rossi, Travis Gagie, Christina Boucher, and Ben Langmead. SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biology, 24(1):122, 2023. doi:10.1186/s13059-023-02958-1.
  • [2] Arne Andersson and Mikkel Thorup. Dynamic ordered sets with exponential search trees. J. ACM, 54(3):13–es, June 2007. doi:10.1145/1236457.1236460.
  • [3] Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, and Mathieu Raffinot. Composite repetition-aware data structures. In Ferdinando Cicalese, Ely Porat, and Ugo Vaccaro, editors, Combinatorial Pattern Matching, pages 26–39, Cham, 2015. Springer International Publishing. doi:10.1007/978-3-319-19929-0_3.
  • [4] Djamal Belazzougui, Manuel Cáceres, Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Gonzalo Navarro, Alberto Ordóñez, Simon J. Puglisi, and Yasuo Tabei. Block trees. Journal of Computer and System Sciences, 117:1–22, 2021. doi:10.1016/j.jcss.2020.11.002.
  • [5] Michael A. Bender, Martín Farach-Colton, John Kuszmaul, William Kuszmaul, and Mingmou Liu. On the optimal time/space tradeoff for hash tables. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2022, pages 1284–1297, New York, NY, USA, 2022. Association for Computing Machinery. doi:10.1145/3519935.3519969.
  • [6] Michael A. Bender, Martín Farach-Colton, John Kuszmaul, and William Kuszmaul. Modern Hashing Made Simple, pages 363–373. Society for Industrial and Applied Mathematics, 2024. doi:10.1137/1.9781611977936.33.
  • [7] Alexander G. Bick, Ginger A. Metcalf, Kelsey R. Mayo, Lee Lichtenstein, Shimon Rura, Robert J. Carroll, Anjene Musick, Jodell E. Linder, I. King Jordan, Shashwat Deepali Nagar, Shivam Sharma, Robert Meller, Melissa Basford, Eric Boerwinkle, Mine S. Cicek, Kimberly F. Doheny, Evan E. Eichler, Stacey Gabriel, Richard A. Gibbs, David Glazer, Paul A. Harris, Gail P. Jarvik, Anthony Philippakis, Heidi L. Rehm, Dan M. Roden, Stephen N. Thibodeau, Scott Topper, Ashley L. Blegen, Samantha J. Wirkus, Victoria A. Wagner, Jeffrey G. Meyer, Donna M. Muzny, Eric Venner, Michelle Z. Mawhinney, Sean M. L. Griffith, Elvin Hsu, Hua Ling, Marcia K. Adams, Kimberly Walker, Jianhong Hu, Harsha Doddapaneni, Christie L. Kovar, Mullai Murugan, Shannon Dugan, Ziad Khan, Niall J. Lennon, Christina Austin-Tse, Eric Banks, Michael Gatzen, Namrata Gupta, Emma Henricks, Katie Larsson, Sheli McDonough, Steven M. Harrison, Christopher Kachulis, Matthew S. Lebo, Cynthia L. Neben, Marcie Steeves, Alicia Y. Zhou, Joshua D. Smith, Christian D. Frazar, Colleen P. Davis, Karynne E. Patterson, Marsha M. Wheeler, Sean McGee, Christina M. Lockwood, Brian H. Shirts, Colin C. Pritchard, Mitzi L. Murray, Valeria Vasta, Dru Leistritz, Matthew A. Richardson, Jillian G. Buchan, Aparna Radhakrishnan, Niklas Krumm, Brenna W. Ehmen, Sophie Schwartz, M. Morgan T. Aster, Kristian Cibulskis, Andrea Haessly, Rebecca Asch, Aurora Cremer, Kylee Degatano, Akum Shergill, Laura D. Gauthier, Samuel K. Lee, Aaron Hatcher, George B. Grant, Genevieve R. Brandt, Miguel Covarrubias, Ashley Able, Ashley E. Green, Jennifer Zhang, Henry R. Condon, Yuanyuan Wang, Moira K. Dillon, C. H. Albach, Wail Baalawi, Seung Hoan Choi, Xin Wang, Elisabeth A. Rosenthal, Andrea H. Ramirez, Sokny Lim, Siddhartha Nambiar, Bradley Ozenberger, Anastasia L. Wise, Chris Lunt, Geoffrey S. Ginsburg, Joshua C. Denny, The All of Us Research Program Genomics Investigators, Manuscript Writing Group, All of Us Research Program Genomics Principal Investigators, Mayo Biobank, Genome Center: Baylor-Hopkins Clinical Genome Center,  Color Genome Center: Broad, Mass General Brigham Laboratory for Molecular Medicine, Genome Center: University of Washington, Data, Research Center, All of Us Research Demonstration Project Teams, and NIH All of Us Research Program Staff. Genomic data in the All of Us Research Program. Nature, 627(8003):340–346, March 2024. doi:10.1038/s41586-023-06957-x.
  • [8] Paola Bonizzoni, Christina Boucher, Davide Cozzi, Travis Gagie, Dominik Köppl, and Massimiliano Rossi. Data structures for SMEM-finding in the PBWT. In Franco Maria Nardini, Nadia Pisanti, and Rossano Venturini, editors, String Processing and Information Retrieval, pages 89–101, Cham, 2023. Springer Nature Switzerland. doi:10.1007/978-3-031-43980-3_8.
  • [9] Nathaniel K. Brown, Travis Gagie, and Massimiliano Rossi. RLBWT Tricks. In Christian Schulz and Bora Uçar, editors, 20th International Symposium on Experimental Algorithms (SEA 2022), volume 233 of Leibniz International Proceedings in Informatics (LIPIcs), pages 16:1–16:16, Dagstuhl, Germany, 2022. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.SEA.2022.16.
  • [10] Nathaniel K. Brown, Vikram S. Shivakumar, and Ben Langmead. Improved pangenomic classification accuracy with chain statistics. In Sriram Sankararaman, editor, Research in Computational Molecular Biology, pages 190–208, Cham, 2025. Springer Nature Switzerland. doi:10.1007/978-3-031-90252-9_12.
  • [11] Michael Burrows. A block-sorting lossless data compression algorithm. SRS Research Report, 124, 1994.
  • [12] Human Pangenome Reference Consortium. HPRC data release 2. [online]. URL: https://humanpangenome.org/hprc-data-release-2/.
  • [13] Davide Cozzi, Massimiliano Rossi, Simone Rubinacci, Travis Gagie, Dominik Köppl, Christina Boucher, and Paola Bonizzoni. μ-PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data. Bioinformatics, 39(9):btad552, September 2023. doi:10.1093/bioinformatics/btad552.
  • [14] Erik D. Demaine, Friedhelm Meyer auf der Heide, Rasmus Pagh, and Mihai Pǎtraşcu. De dictionariis dynamicis pauco spatio utentibus. In José R. Correa, Alejandro Hevia, and Marcos Kiwi, editors, LATIN 2006: Theoretical Informatics, pages 349–361, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. doi:10.1007/11682462_34.
  • [15] Lore Depuydt, Omar Y. Ahmed, Jan Fostier, Ben Langmead, and Travis Gagie. Run-length compressed metagenomic read classification with SMEM-finding and tagging. bioRxiv, 2025. doi:10.1101/2025.02.25.640119.
  • [16] Lore Depuydt, Luca Renders, Simon Van de Vyver, Lennart Veys, Travis Gagie, and Jan Fostier. b-move: Faster Bidirectional Character Extensions in a Run-Length Compressed Index. In Solon P. Pissis and Wing-Kin Sung, editors, 24th International Workshop on Algorithms in Bioinformatics (WABI 2024), volume 312 of Leibniz International Proceedings in Informatics (LIPIcs), pages 10:1–10:18, Dagstuhl, Germany, 2024. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.WABI.2024.10.
  • [17] Richard Durbin. Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics, 30(9):1266–1272, January 2014. doi:10.1093/bioinformatics/btu014.
  • [18] Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552–581, July 2005. doi:10.1145/1082036.1082039.
  • [19] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1), January 2020. doi:10.1145/3375890.
  • [20] Chirag Jain, Daniel Gibney, and Sharma V. Thankachan. Algorithms for colinear chaining with overlaps and gap costs. Journal of Computational Biology, 29(11):1237–1251, 2022. doi:10.1089/cmb.2022.0266.
  • [21] Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. Permuted longest-common-prefix array. In Gregory Kucherov and Esko Ukkonen, editors, Combinatorial Pattern Matching, pages 181–192, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg. doi:10.1007/978-3-642-02441-2_17.
  • [22] Pang Ko and Srinivas Aluru. Space efficient linear time construction of suffix arrays. Journal of Discrete Algorithms, 3(2):143–156, 2005. Combinatorial Pattern Matching (CPM) Special Issue. doi:10.1016/j.jda.2004.08.002.
  • [23] Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Towards a definitive measure of repetitiveness. In Yoshiharu Kohayakawa and Flávio Keidi Miyazawa, editors, LATIN 2020: Theoretical Informatics, pages 207–219, Cham, 2020. Springer International Publishing. doi:10.1007/978-3-030-61792-9_17.
  • [24] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4):357–359, 2012. doi:10.1038/nmeth.1923.
  • [25] Heng Li. Exploring single-sample snp and indel calling with whole-genome de novo assembly. Bioinformatics, 28(14):1838–1844, May 2012. doi:10.1093/bioinformatics/bts280.
  • [26] Heng Li. BWT construction and search at the terabase scale. Bioinformatics, 40(12):btae717, November 2024. doi:10.1093/bioinformatics/btae717.
  • [27] Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14):1754–1760, May 2009. doi:10.1093/bioinformatics/btp324.
  • [28] Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics, 26(5):589–595, January 2010. doi:10.1093/bioinformatics/btp698.
  • [29] Shuwei Li, Keren J Carss, Bjarni V Halldorsson, Adrian Cortes, and UK Biobank Whole-Genome Sequencing Consortium. Whole-genome sequencing of half-a-million UK Biobank participants. medRxiv, 2023. doi:10.1101/2023.12.06.23299426.
  • [30] Wen-Wei Liao, Mobin Asri, Jana Ebler, Daniel Doerr, Marina Haukness, Glenn Hickey, Shuangjia Lu, Julian K Lucas, Jean Monlong, Haley J Abel, et al. A draft human pangenome reference. Nature, 617(7960):312–324, 2023. doi:10.1038/s41586-023-05896-x.
  • [31] Yongchao Liu and Bertil Schmidt. Long read alignment based on maximal exact match seeds. Bioinformatics, 28(18):i318–i324, September 2012. doi:10.1093/bioinformatics/bts414.
  • [32] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281–308, 2010. doi:10.1089/cmb.2009.0169.
  • [33] Karen H Miga and Ting Wang. The need for a human pangenome reference sequence. Annual Review of Genomics and Human Genetics, 22(1):81–102, 2021. doi:10.1146/annurev-genom-120120-081921.
  • [34] Ardalan Naseri, Erwin Holzhauser, Degui Zhi, and Shaojie Zhang. Efficient haplotype matching between a query and a panel for genealogical search. Bioinformatics, 35(14):i233–i241, July 2019. doi:10.1093/bioinformatics/btz347.
  • [35] Gonzalo Navarro. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Comput. Surv., 54(2), March 2021. doi:10.1145/3434399.
  • [36] Takaaki Nishimoto and Yasuo Tabei. Optimal-Time Queries on BWT-Runs Compressed Indexes. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021), volume 198 of Leibniz International Proceedings in Informatics (LIPIcs), pages 101:1–101:15, Dagstuhl, Germany, 2021. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.ICALP.2021.101.
  • [37] Rajeev Raman and Satti Srinivasa Rao. Succinct dynamic dictionaries and trees. In Jos C. M. Baeten, Jan Karel Lenstra, Joachim Parrow, and Gerhard J. Woeginger, editors, Automata, Languages and Programming, pages 357–368, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg. doi:10.1007/3-540-45061-0_30.
  • [38] Massimiliano Rossi, Marco Oliva, Ben Langmead, Travis Gagie, and Christina Boucher. MONI: A pangenomic index for finding maximal exact matches. Journal of Computational Biology, 29(2):169–187, 2022. PMID: 35041495. doi:10.1089/cmb.2021.0290.
  • [39] Ahsan Sanaullah, Seba Villalobos, Degui Zhi, and Shaojie Zhang. Haplotype matching with GBWT for pangenome graphs. bioRxiv, 2025. doi:10.1101/2025.02.03.634410.
  • [40] Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang. d-PBWT: dynamic positional burrows–wheeler transform. Bioinformatics, 37(16):2390–2397, February 2021. doi:10.1093/bioinformatics/btab117.
  • [41] Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang. An efficient data structure and algorithm for long-match query in run-length compressed BWT, 2025. doi:10.48550/arXiv.2505.15698.
  • [42] Pramesh Shakya, Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang. Dynamic μ-PBWT: Dynamic run-length compressed pbwt for biobank scale data. In Sriram Sankararaman, editor, Research in Computational Molecular Biology, pages 209–226, Cham, 2025. Springer Nature Switzerland. doi:10.1007/978-3-031-90252-9_13.
  • [43] Vipin Singh, Shweta Pandey, and Anshu Bhardwaj. From the reference human genome to human pangenome: Premise, promise and challenge. Frontiers in Genetics, 13:1042550, 2022. doi:10.3389/fgene.2022.1042550.
  • [44] Jouni Sirén, Erik Garrison, Adam M Novak, Benedict Paten, and Richard Durbin. Haplotype-aware graph indexes. Bioinformatics, 36(2):400–407, July 2019. doi:10.1093/bioinformatics/btz575.
  • [45] Li Song and Ben Langmead. Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification. Genome Biology, 25(1):106, 2024. doi:10.1007/978-1-0716-3989-4_22.
  • [46] Dylan J. Taylor, Jordan M. Eizenga, Qiuhui Li, Arun Das, Katharine M. Jenike, Eimear E. Kenny, Karen H. Miga, Jean Monlong, Rajiv C. McCoy, Benedict Paten, and Michael C. Schatz. Beyond the Human Genome Project: The age of complete human genome sequences and pangenome references. Annual Review of Genomics and Human Genetics, 25(Volume 25, 2024):77–104, 2024. doi:10.1146/annurev-genom-021623-081639.
  • [47] Mikkel Thorup. Mihai pǎtraşcu: obituary and open problems. SIGACT News, 44(1):110–114, March 2013. doi:10.1145/2447712.2447737.
  • [48] Victor Wang, Ardalan Naseri, Shaojie Zhang, and Degui Zhi. Syllable-PBWT for space-efficient haplotype long-match query. Bioinformatics, 39(1):btac734, November 2022. doi:10.1093/bioinformatics/btac734.
  • [49] Mohsen Zakeri, Nathaniel K Brown, Omar Y Ahmed, Travis Gagie, and Ben Langmead. Movi: a fast and cache-efficient full-text pangenome index. iScience, 27(12), 2024. doi:10.1016/j.isci.2024.111464.

Appendix A Proofs

Lemma 3. [Restated, see original statement.]

The following three statements hold: (i) Let x be the integer satisfying px+i<px+1+ for some integer i[1,n]. Then ϕ(i)=ϕ(px+)+(ipx+); (ii) ϕ(pδ+[1]+)=1 and ϕ(pδ+[i]+)=ϕ(pδ+[i1]+)+d where d=pδ+[i1]+1+pδ+[i1]+; (iii) p1+=1.

Proof.

(i) Lemma 3(i) clearly holds for i=px+. We show that Lemma 3(i) holds for ipx+ (i.e., i>px+). Let st be the position in SA with sa-value px++t for an integer t[1,y] (i.e., SA[st]=px++t), where y=ipx+. Two adjacent positions st and st1 are contained in an interval [lv,lv+|Lv|1] on SA (i.e., st,st1[lv,lv+|Lv|1]), which corresponds to the v-th run Lv of L. This is because st is not the starting position of a run, i.e. (SA[st]=px++t){p1+,p2+,,pr+}. The LF function maps st to st1, where s0 is the position with sa-value vx. LF also maps st1 to st11, by Lemma 3(i) of [36]. The two mapping relationships established by LF produce y equalities ϕ(SA[s1])=ϕ(SA[s0])+1,ϕ(SA[s2])=ϕ(SA[s1])+1,,ϕ(SA[sy])=ϕ(SA[sy1])+1. The equalities lead to ϕ(SA[sy])=ϕ(SA[s0])+y, which represents ϕ(i)=ϕ(px+)+(ipx+) by SA[sy]=i,SA[s0]=px+, and y=ipx+.

(ii) Let p be the integer satisfying Lp=$. Then there exists an integer q such that pq+ is the sa-value at position lp+1 (lp+1=lp+1) if pr; otherwise if p=r, pq+ is the sa-value at position 1 and q=1. ϕ(pq+)=1, because SA[lp]=1 always holds. Hence ϕ(pδ+[1]+)=1 holds by δ+[1]=q.

Next, ϕ(pδ+[i]+)=ϕ(pδ+[i1]+)+d holds for any i[2,r] because (a) ϕ maps the interval [pδ+[i]+,pδ+[i]++d1] into the interval [ϕ(pδ+[i]+),ϕ(pδ+[i]+)+d1] by Lemma 3(i) for any i[1,r], (b) ϕ is a bijection from [1,n] to [1,n], and (c) ϕ(pδ+[1]+)<ϕ(pδ+[2]+)<<ϕ(pδ+[r]+) holds.

(iii) Recall that p is an integer satisfying Lp=$. Then there exists an integer q such that pq+ is the sa-value at position lp. Finally, recall SA[lp]=1. Hence, v1=vq=1 holds.

Appendix B Algorithm Pseudocodes

Pseudocode for the long LEM query algorithm is provided in Algorithm 1. The long LEM query algorithm is separated into subroutines to aid understanding. The long LEM query algorithm is provided as input the length threshold , the query/pattern P, and an OptBWTRL of the text T. Subroutines have access to their calling function’s variables implicitly. Structures of the OptBWTRL of T are referenced directly, i.e., ND[i] instead of OptBWTRLT.ND[i]. Finally, operations on move data structures use the notation of Nishimoto and Tabei [36].

Algorithm 1 LongLEMQuery: Long LEM Query.
Algorithm 2 LongAdvance: Long salcp-interval Advancement.
Algorithm 3 BalancedExtend: Balanced salcp-interval extension.
Algorithm 4 ComputeMiddle.
Algorithm 5 OutputMatches.
Algorithm 6 OutputMatchesDown.
Algorithm 7 OutputMatchesUp.