An Efficient Data Structure and Algorithm for Long-Match Query in Run-Length Compressed BWT
Abstract
String matching problems in bioinformatics are typically for finding exact substring matches between a query and a reference text. Previous formulations often focus on maximum exact matches (MEMs). However, multiple occurrences of substrings of the query in the text that are long enough but not maximal may not be captured by MEMs. Such long matches can be informative, especially when the text is a collection of similar sequences such as genomes. In this paper, we describe a new type of match between a pattern and a text that aren’t necessarily maximal in the query, but still contain useful matching information: locally maximal exact matches (LEMs). There are usually a large amount of LEMs, so we only consider those above some length threshold . These are referred to as long LEMs. The purpose of long LEMs is to capture substring matches between a query and a text that are not necessarily maximal in the pattern but still long enough to be important. Therefore efficient long LEMs finding algorithms are desired for these datasets. However, these datasets are too large to query on traditional string indexes. Fortunately, these datasets are very repetitive. Recently, compressed string indexes that take advantage of the redundancy in the data but retain efficient querying capability have been proposed as a solution. We therefore give an efficient algorithm for computing all the long LEMs of a query and a text in a BWT runs compressed string index. We describe an expected time algorithm that relies on an words space string index for outputting all long LEMs of a pattern with respect to a text given the matching statistics of the pattern with respect to the text. Here is the length of the query, is the number of long LEMs outputted, and is the number of runs in the BWT of the text. The space string index we describe relies on an adaptation of the move data structure by Nishimoto and Tabei. We are able to support queries in constant time given . In other words, we answer queries in constant time. These queries enable the efficient long LEM query. Long LEMs may provide useful similarity information between a pattern and a text that MEMs may ignore. This information is particularly useful in pangenome and biobank scale haplotype panel contexts.
Keywords and phrases:
BWT, LEM, Long LEM, MEM, Run Length Compressed BWT, Move Data Structure, PangenomeCopyright and License:
2012 ACM Subject Classification:
Theory of computation Pattern matching ; Theory of computation Data compressionFunding:
This work was supported by the National Institutes of Health grant R01 HG010086.Editors:
Broňa Brejová and Rob PatroSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Bioinformatics sequence data is often large and very repetitive. Furthermore, efficient matching queries on the data are frequently needed for many biological analyses. Therefore, bioinformatics problems have incentivized and profited from the development of efficient string indexes. The Burrows-Wheeler transform (BWT) has thus been used in bioinformatics algorithms. The BWT is a permutation of a text that has found wide use in string indexing and data compression [11]. Position in the BWT of the text is essentially the character preceding the -th lexicographically smallest suffix of the text. Due to this lexicographic sorting, adjacent characters in the BWT correspond to the characters preceding highly locally similar suffixes of the text. Therefore, the BWT of highly repetitive texts tends to have large runs of one character, with an overall small number of runs. The BWT of highly repetitive texts therefore compresses well. In fact, the number of runs in BWT, , is sometimes used as a measure of the repetitiveness of a string [35]. Finally, given only the BWT of a text, the text can be reconstructed in linear time [11] and the BWT of a text can be constructed in linear time by construction of the suffix array [22]. The BWT ordering also allows efficient string indexes. In other words, given a pattern, find all occurrences of the pattern within the text. String indexes have been shown that output all occurrences of a pattern (a locate query) in space linear to the product of the length of the text and the size of the alphabet and time linear to the sum of the length of the pattern and the number of occurrences [18].
Compressed string indexes have also been shown [18, 3, 32]. These indexes output all occurrences of a pattern in space sublinear to the size of the text. Although the time complexity of locating these occurrences is not linear in the length of the pattern and the number of occurrences, they are typically independent of the length of the text barring logarithmic factors and close to linear in the length of the pattern and number of occurrences. In particular for highly repetitive texts, the space of the index can be much smaller than the space of the text. Notably, recent compressed string indexes have achieved space linear to the number of runs in the BWT () [19, 36]. The r-index by Gagie et al. was the first compressed string index offering close to linear time locate queries in space [19]. Nishimoto and Tabei recently improved on this result with their OptBWTR, which achieves linear time locate queries for texts with alphabets of size polylogarithmic in the length of the text. OptBWTR relies on the move data structure, which was introduced in the same paper [36].
Compressed string indexes have been fruitfully applied to the growing collection of bioinformatics data. Over the past two decades, large collections of genomics data have grown increasingly larger in size. For example, the UK Biobank has whole genome sequencing data of roughly one million haplotypes [29], and the All of Us program has released whole genome sequencing data of half a million haplotypes [7]. Furthermore, recent arguments have been made that a human reference pangenome should be used instead of a singular human reference genome to avoid reference bias in downstream analyses [33, 46, 43]. The Human Pangenome Reference Consortium has released a draft human pangenome reference of more than two hundred high quality phased diploid assemblies and is planning to release over three hundred and fifty in the final release [30, 12]. The UK Biobank whole genome sequencing data has 1.5 billion variants, the All of Us whole genome sequencing data has 1 billion variants, and the typical diploid assembly in the draft human pangenome has 6 billion bases. Therefore, these datasets have 1,500 trillion, 250 trillion, and 1.2 trillion characters each respectively. However, while very large, these datasets are very repetitive. Furthermore, queries on these datasets are frequently needed for biological applications including read mapping[27, 24], read alignment [28], read classification for metagenomes [45, 1, 15] or pangenomes [10]. Many general purpose compressed string indexes have also been implemented for exact pattern matching and matching substrings of the pattern [38, 49, 16, 26]. These indexes may compute the maximal exact matches (MEMs) of the pattern with respect to the text. MEMs are matches between the pattern and the text that cannot be extended in the pattern.
While MEMs typically refer to matches that are maximal in the pattern, matches that are simultaneously maximal in the pattern and the text may sometimes be desired. Notably, in two data structures related to the BWT, algorithms for outputting matches that are simultaneously maximal in the pattern and the text have already been developed. These data structures are the positional Burrows-Wheeler transform (PBWT) and the graph Burrows-Wheeler transform (GBWT) [17, 44]. In the PBWT and GBWT, matches that cannot be extended in the pattern are referred to as set maximal matches, and matches that cannot be simultaneously extended in the pattern are referred to as locally maximal matches. Locally maximal matches that are longer than some length threshold are referred to as -long matches, or long matches for short. Algorithms for outputting set maximal matches and long matches have been published in the PBWT in uncompressed [17, 34, 40] and compressed space [13, 42, 48, 8]. Algorithms for outputting these matches have also been published for the GBWT in compressed space [39].
In this paper, we use these concepts in the traditional pattern and text context, and name matches that cannot be simultaneously extended in the pattern and the text locally maximal exact matches (LEMs). LEMs that are longer than are long LEMs. The distinction between matches that do not extend in the pattern and matches that do not extend simultaneously in the pattern and the text has been made before. Notably, in ropebwt3 MEMs refers to LEMs of our paper and super maximal exact matches (SMEMs) refers to MEMs of our paper [26]. The term SMEM has been used in place of MEM in a few papers to avoid the confusion in terminology [8, 15, 26, 13], however MEM is still the most common term by far for matches that cannot be extended in the pattern. The authors are not aware of any published algorithms for the computation of LEMs or long LEMs.
In this work, we describe an algorithm for outputting all long LEMs of a pattern with respect to a text in expected time given the matching statistics of the pattern with respect to the text, where is the length of the pattern and is the number of long LEMs it has with respect to the text. In order to do so, we modify the OptBWTR data structure of Nishimoto and Tabei to also compute given (i.e. compute ). We name this modified OptBWTR, OptBWTRL (i.e. OptBWTR for long LEMs or OptBWTR with LCP). OptBWTRL maintains the words space complexity of OptBWTR and computes and in constant time. The long LEM finding algorithm also requires as input an OptBWTRL of the text. We also discuss possible future work related to this paper, including avenues for improving the results, utilization of constant time computation to speed up matching statistics computation, and biological applications of long LEMs. Long LEMs may have many biological applications, from identity by descent segment detection and local ancestry inferences, to seeds or anchors for approximate matching algorithms for genome to genome alignment, genome to pangenome, read to genome or other alignments. In this paper, our main contributions are the following:
-
OptBWTRL: OptBWTRL is an words space data structure that maintains the capabilities of OptBWTR and adds the ability to compute and long LEMs efficiently. is the number of runs in the BWT of the text.
-
–
PLCP: OptBWTRL enables constant time computation in space. Note that , therefore computation in constant time allows computation in constant time given .
-
–
Long LEM Query: We describe an expected time long LEM query for pattern and text given the matching statistics of with respect to . The underlying index (OptBWTRL) uses space. is the length of and is the number of long LEMs has with respect to . A deterministic time bound for a similar algorithm we show is .
-
–
-
Long LEM Query with random access to the text: Given time random access to the text and a BWT related index, algorithms for computing matching statistics efficiently are known. Therefore, our long LEM query algorithm results in the following.
-
–
In Uncompressed Space: An algorithm for long LEM query in expected time in uncompressed string indexes such as the FM Index (Corollary 3.2, variant with space, where is the length of the text and is the size of the alphabet) [18].
- –
-
–
2 Background
In this section, we review definitions used throughout the rest of the paper. We begin with strings, then in Section 2.1, we review BWT related concepts. In Section 2.2, we give a short overview of the results of Nishimoto and Tabei in [36]. Matching statistics are reviewed in Section 2.3. Finally we review maximal exact matches (MEMs) and define locally maximal exact matches (LEMs) in Section 2.4.
Let be an ordered alphabet of size . The size (number of characters it contains) of a string is represented by . refers to a text of length () where the last character is . The character is lexicographically smaller than all other characters in and occurs only in the last position of . The -th character of is , . refers to the substring of that starts at position and ends at position , inclusive (). Prefix of is the string , suffix of is . The longest common prefix of two strings and is referred to by . is the largest value s.t. and (then, ). A string being lexicographically smaller than is represented by . If , and . If , iff or .
2.1 Burrows-Wheeler Transform
The Suffix Array () of a text is an array of length where the -th position stores the index of the -th lexicographically smallest suffix of . Therefore, . The Burrows-Wheeler Transform (BWT) of a text is a string of length where the -th character in the string is the -th character of (the -th character if ). The array is an array of length that stores the position of the previous suffix in the suffix array, s.t. for all , s.t. for . The array stores at position , the suffix above suffix in the suffix array, i.e. if , ( if ). The array stores at position , the suffix below suffix in the suffix array, i.e. if , ( if ). Therefore, and . The array is an array of length where stores the length of the longest common prefix of suffix and . and for , . The (permuted ) array is an array of length where the array is stored by suffix index. Therefore, if , . Finally, the inverse suffix array, , is an array of length that stores at position the position of suffix in the suffix array, if , . and . The and arrays are permutations of the integers in . and are inverses of each other and and are inverses of each other.
The run-length Burrows-Wheeler Transform (RLBWT) is the run-length encoding of the BWT of a text. Call the BWT of text . Then, is partitioned into nonempty substrings . is a substring of corresponding to the -th run of . A run is a maximal repetition of the same character in . Therefore, for all and for all . is the starting position of the run in . The RLBWT is represented as pairs for . All of these structures can be seen in Figure 1 for a text .
2.2 Move Data Structure
The move data structure is a data structure for representing a permutation of a contiguous range of integers efficiently. It was introduced by Nishimoto and Tabei [36]. In the original introduction, the structure was described for a permutation of . This of course may be extended to any bijective function from a contiguous range of integers to another contiguous range of integers. The move data structure takes space proportional to the number of intervals conserved in the function. An interval is conserved in a bijective function from a contiguous range of integers to another contiguous range of integers if for any in the interval, (therefore, and ). The move data structure computes the represented function in constant time. The important arrays, and , are permutations of with conserved intervals, where is the number of runs in the BWT. Therefore, Nishimoto and Tabei define the OptBWTR data structures using move data structures. OptBWTR supports efficient count and locate queries in BWT-runs compressed space. Below, we more formally review some of the results from their paper [36].
2.2.1 Disjoint Interval Sequence
is a sequence of pairs of integers. Let . Then is a disjoint interval sequence iff there exists a permutation of s.t. (i) , (ii) , and (iii) . is referred to as the -th input interval, and as the -th output interval. The input intervals don’t overlap, and their union is . The output intervals don’t overlap and their union is .
A move query on a disjoint interval sequence takes as input , where is an index in and is the index of the input interval sequence that contains it, and and . The move query outputs where and , i.e. is the mapping of position from the input to output intervals by and is the index of the input interval that contains . , a permutation of with conserved intervals, can be represented by a disjoint interval sequence where the input intervals are the conserved intervals and the output intervals are the mapping of the input intervals by . Then, a move query of returning computes by .
Nishimoto and Tabei show that move queries on a disjoint interval sequence of input intervals (and therefore output intervals) can be computed in constant time and space with the move data structure. The move data structure is built by splitting the input intervals of into at most intervals. This results in a disjoint interval sequence of at most input intervals (and an equivalent number of output intervals) that represents the same permutation as the original disjoint interval sequence. The split interval sequence of that the move data structure is built on is referred to as a balanced interval sequence. The notation for a balanced interval sequence of is , and the notation for a move data structure of is . In this paper, we occasionally use input interval of as shorthand for input interval of (for example, -th input interval of a move data structure refers to the -th input interval of the balanced interval sequence it was built on). Brown et al. extend the balanced interval sequence result of Nishimoto and Tabei to splitting ’s intervals into at most intervals, resulting in a move data structure with time move query computation for any [9].
2.2.2 OptBWTR
The arrays and are permutations of with conserved intervals. For , a conserved interval is within a run in the BWT. For , a conserved interval is a range of suffixes of that don’t occur at the bottom of a run in the BWT (except the first position of the interval may be at the bottom of a run). Therefore, Nishimoto and Tabei define the OptBWTR data structure as the combination of the move data structures of the and functions along with a rank-select data structure on an length string . OptBWTR supports time count queries and time locate queries in words of space, where is the number of runs in the BWT of the text, is the length of the pattern, is the number of occurrences of the pattern in the text, is the word size, is the size of the alphabet (Theorem 9 of [36]). The input intervals of , the disjoint interval sequence of the move data structure of , are contained within a run in the BWT. Call the -th input interval of . Then, , where is the number of input intervals of , is the BWT of , and . Call the disjoint interval sequence of the move data structure of , and its -th input interval. OptBWTR is composed of:
-
move data structures for and ( and respectively),
-
a rank-select data structure on (),
-
samples of the at the beginning of input intervals of the LF move data structure (, where ), and
-
the index of the input interval of the move data structure that contains each sample in (, where ).
2.3 Matching Statistics
The matching statistics of a pattern with respect to a text represents information on the local similarity of the pattern to the text. The matching statistics of with respect to , , is an array of length that stores at position three values: , , and . is the length of the longest substring of starting at that occurs in . is a suffix of that has a longest common prefix with of length (or equivalently, is the starting position of an occurrence of in ). is the index in the SA of that has value . Formally for all ,
-
,
-
, and
-
.
When and are clear from the context, we omit them from and refer to the matching statistics of with respect to as .
2.4 Maximal and Locally Maximal Exact Matches
For a pattern and a text (, a maximal exact match (MEM), , is a match between and that cannot be extended left or right in the pattern. Formally, ( or doesn’t occur in ) and ( or doesn’t occur in ). A MEM can be fully specified by the triple where is the length of the match and and are the starting positions of the match in the pattern and the text respectively.
For a pattern and a text , a locally maximal exact match (LEM), , is a match between and that cannot be simultaneously extended in the pattern and the text. The match cannot be simultaneously extended left in the pattern and the text. Likewise, it cannot be simultaneously extended right in the pattern and the text. Formally, ( or or ) and ( or or ). A LEM can also be fully specified by the triple where is the length of the LEM and . For some length threshold , a long LEM is a LEM with length at least . See Figure 2 for a depiction of MEMs and LEMs in a text representing a pangenome.
3 Methods
Here we describe the main results of our paper. In Section 3.1, we prove move data structures can compute and in constant time. Then we describe OptBWTRL, our modification of OptBWTR that utilizes these move data structures. In Section 3.2, we describe multiple algorithms for long LEM query provided an OptBWTRL of the text and matching statistics of the pattern with respect to the text.
3.1 Computing LCP with Move Data Structures
We define to be the -th smallest suffix that occurs at the top of a run in the BWT. Therefore let (i) and (ii) . Lemma 1 and its proof are phrased very similarly to Lemma 4 in [36] to demonstrate its derivativeness and the similarity of the properties.
Lemma 1.
(i) Let be the integer satisfying for some integer . Then .
Proof.
Lemma 1(i) clearly holds for . We show that Lemma 4(i) holds for (i.e., ). Let be the position in with sa-value for an integer (i.e., ) where . Two adjacent positions and are contained in an interval on which corresponds to the -th run of . This is because is not the starting position of a run, i.e., . The LF function maps to , where is the position with sa-value . LF also maps to by Lemma 3(i) of [36]. due to and being in the same interval on , . These relationships produce equalities . The equalities lead to , and therefore . Which represents by and .
Lemma 2.
(i) Let be the integer satisfying for some integer . Then .
Proof.
3.1.1 Move Data Structure for
remains as defined in the previous section. Let be a permutation of satisfying . has the following properties on RLBWT.
Lemma 3.
The following three statements hold: (i) Let x be the integer satisfying for some integer . Then ; (ii) and where ; (iii) .
Proof.
See Appendix A.
We can compute by using a move data structure. A sequence consists of pairs . satisfies the three conditions of a disjoint interval sequence by Lemma 3, and is equal to the bijective function represented by .
Lemma 4.
(i) is a disjoint interval sequence. (ii) is equal to the bijective function represented by .
Proof.
(i) has the following three properties: (a) holds by Lemma 3(iii) and the definition of the sequence , (b) by Lemma 3(ii), and (c) . Therefore satisfies the three conditions of the disjoint interval sequence.
(ii) Let be the bijective function represented by . Then where is the integer such that holds. On the other hand, holds by Lemma 3(i). Therefore and and are the same function.
Let be the move data structure built on the balanced interval sequence for . By Lemma 6 of [36], requires words of space. By the results of Section 3.2 of [36], evaluation of a move query using a move data structure for a balanced disjoint interval sequence takes constant time. Finally, holds for a move query by Lemma 4. Therefore we have proved (i) of the following lemma.
Lemma 5.
(i) There exists a move data structure that computes in space and constant time given , the index of the input interval of that contains . (ii) This move data structure can be modified to also compute in space and constant time given and . Call the modified move data structure .
Proof.
Say that has input intervals and the -th input interval is . Then we modify the move data structure by adding an array of size . stores the value for each . (.) by Lemma 1. Therefore is computed in constant time by evaluating . Call this modified move data structure .
A similar function that we may need to compute is given , i.e. given , compute . Nishimoto and Tabei described , a move data structure computing . can be modified to compute in constant time as well in a similar fashion to the modification of the data structure. Call the disjoint interval sequence is built on . Call the -th input interval of , where has input intervals and . Note that by the construction of Nishimoto and Tabei, every suffix at the bottom of a BWT run is the start of an input interval, . Then for any . Therefore (see Lemma 4 in [36]). Below, we prove (ii) that for any , therefore .
Lemma 6.
Let be the integer satisfying for some . Then (i) . Therefore, (ii) for any , , , and .
Proof.
Lemma 6(i) clearly holds for . We show that Lemma 6(i) holds for (i.e. ). Let be the position in the SA with sa-value for an integer where . Two adjacent positions and are contained in an interval corresponding to the -th run in the BWT (). This is because is not the ending position of a run, . The LF function maps to , where is the position in the SA with value . LF also maps to by Lemma 3(i) of [36]. since and are in the same interval in the BWT, . These relationships produce equalities ,,…,. This leads to . Which leads to by and .
Therefore, the move data structure that computes , , can be modified to compute as well.
Lemma 7.
can be modified to compute as well as in constant time and space given , the index of the input interval of that contains . Call the modified move data structure .
Proof.
We modify the move data structure by , an array of size where the -th element stores the value . Then, can be computed in constant time by evaluating by and Lemma 6(i). We call this modified move data structure .
3.1.2 OptBWTRL
We slightly modify OptBWTR by adding a move data structure that computes and and arrays that allow jumping to the closest input intervals corresponding to adjacent runs in the BWT in constant time. We call it OptBWTRL, L for LCP and long LEMs. In addition to the structures of OptBWTR, OptBWTRL contains , , , , , , and . Furthermore, the move data structure of OptBWTR is replaced by the move data structure described in Lemma 7. Recall that is the disjoint interval sequence the move data structure is built on. Let contain input intervals where the -th input interval is , and . Further recall that every input interval is contained in a run in the BWT, i.e. for all , . Then, and are arrays of length where contains the index of the next input interval with a different character in the BWT and contains the index of the previous input interval with a different character in the BWT. Formally, for all , , and . If no such exists, and . and can be constructed in (and therefore, ) time given . are samples of the at the ends of input intervals of . are the indices of the input intervals of the top of input interval suffix array samples in . and are the indices of the input intervals of the bottom of input interval suffix array samples in and respectively. Below, let and be the -th input intervals of and respectively. Then, OptBWTRL differs from OptBWTR in the following ways.
3.2 Computing Long LEMs
Here, we describe an algorithm for outputting all the long LEMs of a pattern with respect to a text in expected time using an index of size words given the matching statistics of with respect to and an OptBWTRL of . is the length of and is the number of long LEMs has with . Furthermore, the matching statistics are slightly augmented to contain the input intervals it’s corresponding data are contained in. In particular, the input interval of that is contained in is stored as , the input interval of that is contained in is stored as , and the input interval of that is contained in is stored as . Note that the long LEM query algorithm we present here does not necessarily result in an expected time algorithm for outputting all long LEMs of with respect to given a OptBWTRL of because an algorithm for computing the matching statistics of with respect to in time and space is not known.
We define the balanced salcp-interval of a string as a 13-tuple where is the sa-interval of , , and are the indexes of the input intervals of that contain and respectively, and are the indexes of the input intervals of containing and respectively, and and are the indexes of the input intervals of of and respectively. The balanced salcp-interval keeps track of three positions in the sa-interval: the top (), bottom (), and the middle (). is any position in the interval, it may be equivalent to the top or the bottom. Each position also maintains its corresponding suffix array value and index of the input interval of the position in ( and for top, middle, and bottom respectively). Finally, the top maintains the index of the input interval of its sa-value in (), the bottom maintains the index of the input interval of its sa-value in (), and the middle maintains the index of the input interval of its sa-value in both and ( and respectively). The balanced salcp-interval of a string with no occurrences in is undefined.
The high level idea of the long LEM finding algorithm is to compute the balanced salcp-interval of adjacent substrings of length of the pattern while outputting long LEMs along the way. I.E. given the balanced salcp-interval of , compute the salcp-interval of and output all long LEMs of the form . We call this problem long salcp-interval advancement. Given an algorithm for long salcp-interval advancement in time, a straightforward long LEM computation algorithm is iterating from , repeatedly advancing the salcp-interval and outputting all long LEMs in time. In Section 3.2.1, we outline an algorithm for balanced salcp-interval extension and in Section 3.2.2, we outline an algorithm for long salcp-interval advancement. These algorithms result in an expected time algorithm for long LEM computation.
3.2.1 Balanced salcp-interval Extension
Here, we provide algorithms for obtaining the balanced salcp-interval of given the balanced salcp-interval of and an OptBWTRL of . The first algorithm runs in time by making use of the rank-select structure on . The second runs in time linear to the number of runs in the balanced salcp-interval of , , by iterating through them. Call the balanced salcp-interval of and the balanced salcp-interval of . Recall that and are the starting indexes of the -th input intervals of , and respectively.
We first discuss the computation of the top values, and . If , then and can be computed with in constant time using . , and if , otherwise . If , , where is the first location in such that . If is the index of first input interval such that , then , where is the starting position of the -th input interval of . can be computed in time using or time by iterating through the runs of balanced salcp-interval of using the array. Then, and can be computed with in constant time using . , and .
The bottom values , and can be computed in a similar fashion. If , then and can be computed with in constant time using . , and if , otherwise . If , then , where is the last location in such that . If is the index of the last input input interval such that , then . can be computed in time using or time by iterating through the runs of the balanced salcp-interval of using the array. Then, and can be computed with in constant time using . Finally, , and .
Lastly, the middle values and need to be computed. Pseudocode for middle value computation is provided as Algorithm 4 in Appendix B. If , then and can be computed in constant time with using . . if , otherwise . Finally, if , otherwise . If and occurs in , then there is a preceding or succeeding input interval of that intersects with the balanced salcp-interval of and has value in the BWT. Suppose there is a preceding interval, . Then the middle values can be updated similar to the bottom values. Let , then and are computed in constant time with , , , and if , otherwise . If there is no preceding interval, then set to the index of the succeeding interval. Then the middle values can be updated similar to the top values. Let , then and are computed in constant time with , , , and if , otherwise . The index, , of the preceding or succeeding interval in the salcp-interval of with value in the BWT can be found in time with or time by iterating through the runs in the BWT with and . Therefore, the balanced salcp-interval of can be computed in time or time given the balanced salcp-interval of . See Algorithm 3 in Appendix B for the time algorithm pseudocode.
3.2.2 Long salcp-interval Advancement
Let be the number of long LEMs of the form and . be the number of long LEMs of the form . Here, we describe an algorithm that computes the balanced salcp-interval of and a dynamic dictionary of the suffixes of present in the balanced salcp-interval of . This algorithm also outputs all long LEMs of the form . The algorithm runs in expected time and requires as input the balanced salcp-interval of , an OptBWTRL of , and a dynamic dictionary of the suffixes of present in the balanced salcp-interval of .
We begin with the description of the dynamic dictionary, . There are numerous dynamic dictionary data structures that support expected constant time insertion, deletion, and queries [5, 37, 6, 14]. Therefore, we maintain a dynamic dictionary of the suffixes in the balanced salcp-interval. More precisely, if the balanced salcp-interval of is , then the dynamic dictionary provided as input to the long salcp-interval advancement algorithm has elements. is contained in the dictionary and has the value associated with it. I.E. the value associated with each suffix of the text contained in the dictionary is the ending position (in the pattern) of the longest match between suffix of the text and suffix of the pattern. It is not possible that multiple suffixes of share the same key in . Each suffix of can occur only once in the dictionary because each suffix of can occur only once in any balanced salcp-interval. Each suffix of can occur only once in any balanced salcp-interval since each suffix of occurs exactly once in .
Here, we describe the procedure for outputting all long LEMs of the form in expected time (we call this ). The high level idea is to iterate through the input intervals of , skipping intervals corresponding to a run of in constant time per run using . We outline two functions: and . For both functions, represents a suffix of and is the index of the input interval that contains it in and in and respectively. represents the number of matches to output (directly above in for and directly below in for ) including . outputs a match , where , and removes the key-value pair from . for similarly outputs a match where , then removes the key-value pair from . Then, it recurses on , where and are computed in constant time using . operates in the same way as except it computes instead of (using instead of ). It is simple to see that and operate in expected time and output matches each. Now we utilize and to output the long LEMs of the form . If the salcp-interval of is fully contained in one input interval of , then . If , then there are no matches to output, otherwise, every suffix in the balanced salcp-interval needs to be outputted and we do so by calling and , where and are computed with and . In the case where the balanced salcp-interval of is not fully contained in one input interval , we do the following. For the first input interval, , if , then the long LEMs starting at in the text are outputted in expected time by calling . For any middle input interval , , if , then this run in the BWT is skipped, . Otherwise, if , then the long LEMs starting at are outputted by calling . For the last input interval, , if , then the long LEMs starting at are outputting by calling . Overall, outputting the long LEMs of the form takes expected time. Furthermore, for every run of character intersecting the salcp-interval of except the first one, there is a run of characters . Therefore Therefore outputting the long LEMs of the form takes expected time. See Algorithms 5, 6, and 7 in Appendix B for and related pseudocodes.
Finally, we must compute the balanced salcp-interval of . First suppose that the balanced salcp-interval of is nonempty. Then, we use the algorithm described in Section 3.2.1 to obtain the salcp-interval of in time. Now, let the salcp-interval of be and the salcp-interval of be . These salcp-intervals differ only by those suffixes of the text whose with has length exactly . There are exactly such suffixes. Furthermore, and . Finally, , , and , . Therefore, we initialize , , , and . Then, while , we (i) set if , (ii) set , (iii) update and by , and (iv) insert the key into with value . When , the final value has been computed. Similarly for , we initialize and . Then, while , we (i) set if , (ii) set , (iii) update and by , and (iv) insert the key into with value . When , the final value has been computed. This takes constant time per suffix added to the interval, therefore time.
If the balanced salcp-interval of is empty, the balanced salcp-interval of is only nonempty if . If it is, we initialize the balanced salcp-interval of to and insert the key into with value . Then, the interval is expanded to the salcp-interval of in time as in the other case.
In the case where the balanced salcp-interval of is empty, long salcp-interval advancement is performed in expected time . If it is not empty, the algorithm we have described first performs salcp-interval extension, obtaining the salcp-interval of in time and then takes expected time to compute the salcp-interval of from the salcp-interval of . Finally, . Therefore, the algorithm described here performs the long salcp-interval advancement in expected time. See Algorithm 2 in the Appendix for the pseudocode of this algorithm.
3.3 Time Complexity
If the above algorithm is iterated from , all long MEMs of the pattern with respect to the text are outputted. The time complexity of the algorithm is the sum of the time complexity of the long salcp-interval advancements. Note that the sum of for is and the sum of the for is also . Therefore, the time complexity of the algorithm overall is expected time. See Algorithm 1 in Appendix B for pseudocode. The algorithm takes space for the OptBWTRL and space for maintaining the dynamic dictionary [5]. Also note that if a deterministic time bound is desired, this algorithm runs in time with the same space by replacing the dictionary with a deterministic dictionary implemented by exponential search trees [47, 2]. Recall these complexities are when given the modified matching statistics. A linear time algorithm for computing matching statistics in space is not known. However, note that since the values of matching statistics are only needed for positions where , a straightforward algorithm for long LEM query follows from our algorithm in expected time when matching statistics are not given as input. This algorithm is obtained by computing the salcp-interval of each independently in time using the standard count algorithm described by Nishimoto and Tabei [36] followed by performing the long LEM query described here. The long LEM query algorithm described here results in an expected time long LEM query algorithm in uncompressed string indexes since algorithms for time matching statistics computation are known in uncompressed space.
4 Discussion
In this paper, we have described OptBWTRL, a modification of OptBWTR by Nishimoto and Tabei [36]. OptBWTRL adds the ability to compute and in constant time with additional move data structures. It also retains a space complexity of words. We also define locally maximal exact matches (LEMs), a match that cannot be simultaneously extended in the pattern and the text instead of one that is only unable to be extended in the pattern (MEMs). Finally, we describe an algorithm for outputting all LEMs with length at least in expected time given an OptBWTRL of the text and the matching statistics of the pattern with respect to the text. Note that this doesn’t result in a linear time algorithm for computing long LEMs in expected time in space because an algorithm for computing matching statistics of a pattern with respect to a text in linear time in space is not known. A deterministic bound for our long LEM query algorithm is . Finally, our long LEM query admits a direct computation of long LEMs in expected time without being provided matching statistics as input. This algorithm may be faster than computation of matching statistics followed by long LEM query in some cases, especially when is small.
It is likely that the move data structures and can be merged into one data structure that still takes space and computes and in constant time in one data structure. This would greatly reduce the number of samples needed per input interval . It would also allow bidirectional movement in the with one input interval index. This is left as future work. Possible other future work includes a practical implementation of the structures and algorithms described here, possibly as a modification of MOVI or b-move [49, 16]. Thirdly, the ability to compute in constant time may speed up matching statistics computation in compressed space. The intuition is that when , then when is large, the sa-interval of is small and is faster to compute with and and than with reverse . When is small, computing the sa-interval is faster with reverse . In Heng Li’s forward-backward algorithm [25], the new sa-interval is always computed by reverse . Computing the sa-interval with and reverse simultaneously is likely to be faster in practice than reverse alone while retaining the same worst case time complexity. The authors are currently exploring this idea. Furthermore, variable length threshold long LEMs may be useful. I.E. output long LEMs that are of the length of the MEMs in the same area. The authors believe a linear time algorithm for this or a similar problem given matching statistics exist. Finally, applications utilizing the long LEMs of a pattern with respect to the text is a possible fruitful direction for future work. Particularly in biological applications.
Long LEMs may have many biological applications. In general, in any application where long MEMs are used, long LEMs may also be used. Note that MEMs are a subset of LEMs and long MEMs are a subset of long LEMs. For example, in biobank scale haplotype datasets, long matches (long LEMs) in the PBWT have revealed genealogical relationships that set maximal matches (MEMs) are not able to uncover. As the compressive power of compressed string indexes increases and the number of variants in biobank scale whole genome sequencing data increases, storing unaligned genomes becomes more viable. In that case, algorithms for outputting long LEMs are needed to replace the long match algorithms in the PBWT. These matches have many applications from identity by descent segment detection, haplotype phasing, haplotype imputation, inferring genealogical relationships, and ancestry inference. Utilizing unaligned matches from a large collection of haplotype sequences instead of aligned matches from haplotypes aligned to a linear reference genome may also reduce reference bias. Finally, novel applications for long LEMs may exist, long LEMs may be used as seeds for seed and extend algorithms. They may be used as anchors for approximate matching matching algorithms [20] possibly for long read alignment to either a reference pangenome or a linear reference genome [31]. Lastly, genome to genome or genome to pangenome long LEMs detection may find similar sequences in the genomes on different genomic regions. MEMs detection may miss these similar sequences on different genomic regions because these matches will typically be overshadowed by larger encompassing matches that occur in roughly the same region in the pattern and the text. The long LEMs may therefore reveal old structural variants that MEMs and general alignment algorithms are both unable to reveal. MEMs don’t reveal these variants due to looking for only the largest matches in a region on the pattern while alignment algorithms don’t due to better alignments existing in closeby genomic regions or alignment algorithms being too computationally expensive to run on very large datasets.
Overall, we have provided a linear time algorithm for outputting all long LEMs of a pattern with respect to a text in BWT runs compressed space given the matching statistics of the pattern with respect to the text. We have also applied the move data structure of Nishimoto and Tabei to computation of in constant time. Therefore, we can compute given in constant time. We apply these results to modify the OptBWTR, creating OptBWTRL. OptBWTRL is an space data structure that computes and in constant time and long LEMs in linear time given matching statistics. These algorithms result in a linear time long LEM query algorithm in uncompressed string indexes.
References
- [1] Omar Y Ahmed, Massimiliano Rossi, Travis Gagie, Christina Boucher, and Ben Langmead. SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biology, 24(1):122, 2023. doi:10.1186/s13059-023-02958-1.
- [2] Arne Andersson and Mikkel Thorup. Dynamic ordered sets with exponential search trees. J. ACM, 54(3):13–es, June 2007. doi:10.1145/1236457.1236460.
- [3] Djamal Belazzougui, Fabio Cunial, Travis Gagie, Nicola Prezza, and Mathieu Raffinot. Composite repetition-aware data structures. In Ferdinando Cicalese, Ely Porat, and Ugo Vaccaro, editors, Combinatorial Pattern Matching, pages 26–39, Cham, 2015. Springer International Publishing. doi:10.1007/978-3-319-19929-0_3.
- [4] Djamal Belazzougui, Manuel Cáceres, Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Gonzalo Navarro, Alberto Ordóñez, Simon J. Puglisi, and Yasuo Tabei. Block trees. Journal of Computer and System Sciences, 117:1–22, 2021. doi:10.1016/j.jcss.2020.11.002.
- [5] Michael A. Bender, Martín Farach-Colton, John Kuszmaul, William Kuszmaul, and Mingmou Liu. On the optimal time/space tradeoff for hash tables. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2022, pages 1284–1297, New York, NY, USA, 2022. Association for Computing Machinery. doi:10.1145/3519935.3519969.
- [6] Michael A. Bender, Martín Farach-Colton, John Kuszmaul, and William Kuszmaul. Modern Hashing Made Simple, pages 363–373. Society for Industrial and Applied Mathematics, 2024. doi:10.1137/1.9781611977936.33.
- [7] Alexander G. Bick, Ginger A. Metcalf, Kelsey R. Mayo, Lee Lichtenstein, Shimon Rura, Robert J. Carroll, Anjene Musick, Jodell E. Linder, I. King Jordan, Shashwat Deepali Nagar, Shivam Sharma, Robert Meller, Melissa Basford, Eric Boerwinkle, Mine S. Cicek, Kimberly F. Doheny, Evan E. Eichler, Stacey Gabriel, Richard A. Gibbs, David Glazer, Paul A. Harris, Gail P. Jarvik, Anthony Philippakis, Heidi L. Rehm, Dan M. Roden, Stephen N. Thibodeau, Scott Topper, Ashley L. Blegen, Samantha J. Wirkus, Victoria A. Wagner, Jeffrey G. Meyer, Donna M. Muzny, Eric Venner, Michelle Z. Mawhinney, Sean M. L. Griffith, Elvin Hsu, Hua Ling, Marcia K. Adams, Kimberly Walker, Jianhong Hu, Harsha Doddapaneni, Christie L. Kovar, Mullai Murugan, Shannon Dugan, Ziad Khan, Niall J. Lennon, Christina Austin-Tse, Eric Banks, Michael Gatzen, Namrata Gupta, Emma Henricks, Katie Larsson, Sheli McDonough, Steven M. Harrison, Christopher Kachulis, Matthew S. Lebo, Cynthia L. Neben, Marcie Steeves, Alicia Y. Zhou, Joshua D. Smith, Christian D. Frazar, Colleen P. Davis, Karynne E. Patterson, Marsha M. Wheeler, Sean McGee, Christina M. Lockwood, Brian H. Shirts, Colin C. Pritchard, Mitzi L. Murray, Valeria Vasta, Dru Leistritz, Matthew A. Richardson, Jillian G. Buchan, Aparna Radhakrishnan, Niklas Krumm, Brenna W. Ehmen, Sophie Schwartz, M. Morgan T. Aster, Kristian Cibulskis, Andrea Haessly, Rebecca Asch, Aurora Cremer, Kylee Degatano, Akum Shergill, Laura D. Gauthier, Samuel K. Lee, Aaron Hatcher, George B. Grant, Genevieve R. Brandt, Miguel Covarrubias, Ashley Able, Ashley E. Green, Jennifer Zhang, Henry R. Condon, Yuanyuan Wang, Moira K. Dillon, C. H. Albach, Wail Baalawi, Seung Hoan Choi, Xin Wang, Elisabeth A. Rosenthal, Andrea H. Ramirez, Sokny Lim, Siddhartha Nambiar, Bradley Ozenberger, Anastasia L. Wise, Chris Lunt, Geoffrey S. Ginsburg, Joshua C. Denny, The All of Us Research Program Genomics Investigators, Manuscript Writing Group, All of Us Research Program Genomics Principal Investigators, Mayo Biobank, Genome Center: Baylor-Hopkins Clinical Genome Center, Color Genome Center: Broad, Mass General Brigham Laboratory for Molecular Medicine, Genome Center: University of Washington, Data, Research Center, All of Us Research Demonstration Project Teams, and NIH All of Us Research Program Staff. Genomic data in the All of Us Research Program. Nature, 627(8003):340–346, March 2024. doi:10.1038/s41586-023-06957-x.
- [8] Paola Bonizzoni, Christina Boucher, Davide Cozzi, Travis Gagie, Dominik Köppl, and Massimiliano Rossi. Data structures for SMEM-finding in the PBWT. In Franco Maria Nardini, Nadia Pisanti, and Rossano Venturini, editors, String Processing and Information Retrieval, pages 89–101, Cham, 2023. Springer Nature Switzerland. doi:10.1007/978-3-031-43980-3_8.
- [9] Nathaniel K. Brown, Travis Gagie, and Massimiliano Rossi. RLBWT Tricks. In Christian Schulz and Bora Uçar, editors, 20th International Symposium on Experimental Algorithms (SEA 2022), volume 233 of Leibniz International Proceedings in Informatics (LIPIcs), pages 16:1–16:16, Dagstuhl, Germany, 2022. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.SEA.2022.16.
- [10] Nathaniel K. Brown, Vikram S. Shivakumar, and Ben Langmead. Improved pangenomic classification accuracy with chain statistics. In Sriram Sankararaman, editor, Research in Computational Molecular Biology, pages 190–208, Cham, 2025. Springer Nature Switzerland. doi:10.1007/978-3-031-90252-9_12.
- [11] Michael Burrows. A block-sorting lossless data compression algorithm. SRS Research Report, 124, 1994.
- [12] Human Pangenome Reference Consortium. HPRC data release 2. [online]. URL: https://humanpangenome.org/hprc-data-release-2/.
- [13] Davide Cozzi, Massimiliano Rossi, Simone Rubinacci, Travis Gagie, Dominik Köppl, Christina Boucher, and Paola Bonizzoni. -PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data. Bioinformatics, 39(9):btad552, September 2023. doi:10.1093/bioinformatics/btad552.
- [14] Erik D. Demaine, Friedhelm Meyer auf der Heide, Rasmus Pagh, and Mihai Pǎtraşcu. De dictionariis dynamicis pauco spatio utentibus. In José R. Correa, Alejandro Hevia, and Marcos Kiwi, editors, LATIN 2006: Theoretical Informatics, pages 349–361, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg. doi:10.1007/11682462_34.
- [15] Lore Depuydt, Omar Y. Ahmed, Jan Fostier, Ben Langmead, and Travis Gagie. Run-length compressed metagenomic read classification with SMEM-finding and tagging. bioRxiv, 2025. doi:10.1101/2025.02.25.640119.
- [16] Lore Depuydt, Luca Renders, Simon Van de Vyver, Lennart Veys, Travis Gagie, and Jan Fostier. b-move: Faster Bidirectional Character Extensions in a Run-Length Compressed Index. In Solon P. Pissis and Wing-Kin Sung, editors, 24th International Workshop on Algorithms in Bioinformatics (WABI 2024), volume 312 of Leibniz International Proceedings in Informatics (LIPIcs), pages 10:1–10:18, Dagstuhl, Germany, 2024. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.WABI.2024.10.
- [17] Richard Durbin. Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics, 30(9):1266–1272, January 2014. doi:10.1093/bioinformatics/btu014.
- [18] Paolo Ferragina and Giovanni Manzini. Indexing compressed text. J. ACM, 52(4):552–581, July 2005. doi:10.1145/1082036.1082039.
- [19] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1), January 2020. doi:10.1145/3375890.
- [20] Chirag Jain, Daniel Gibney, and Sharma V. Thankachan. Algorithms for colinear chaining with overlaps and gap costs. Journal of Computational Biology, 29(11):1237–1251, 2022. doi:10.1089/cmb.2022.0266.
- [21] Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. Permuted longest-common-prefix array. In Gregory Kucherov and Esko Ukkonen, editors, Combinatorial Pattern Matching, pages 181–192, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg. doi:10.1007/978-3-642-02441-2_17.
- [22] Pang Ko and Srinivas Aluru. Space efficient linear time construction of suffix arrays. Journal of Discrete Algorithms, 3(2):143–156, 2005. Combinatorial Pattern Matching (CPM) Special Issue. doi:10.1016/j.jda.2004.08.002.
- [23] Tomasz Kociumaka, Gonzalo Navarro, and Nicola Prezza. Towards a definitive measure of repetitiveness. In Yoshiharu Kohayakawa and Flávio Keidi Miyazawa, editors, LATIN 2020: Theoretical Informatics, pages 207–219, Cham, 2020. Springer International Publishing. doi:10.1007/978-3-030-61792-9_17.
- [24] Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4):357–359, 2012. doi:10.1038/nmeth.1923.
- [25] Heng Li. Exploring single-sample snp and indel calling with whole-genome de novo assembly. Bioinformatics, 28(14):1838–1844, May 2012. doi:10.1093/bioinformatics/bts280.
- [26] Heng Li. BWT construction and search at the terabase scale. Bioinformatics, 40(12):btae717, November 2024. doi:10.1093/bioinformatics/btae717.
- [27] Heng Li and Richard Durbin. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14):1754–1760, May 2009. doi:10.1093/bioinformatics/btp324.
- [28] Heng Li and Richard Durbin. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics, 26(5):589–595, January 2010. doi:10.1093/bioinformatics/btp698.
- [29] Shuwei Li, Keren J Carss, Bjarni V Halldorsson, Adrian Cortes, and UK Biobank Whole-Genome Sequencing Consortium. Whole-genome sequencing of half-a-million UK Biobank participants. medRxiv, 2023. doi:10.1101/2023.12.06.23299426.
- [30] Wen-Wei Liao, Mobin Asri, Jana Ebler, Daniel Doerr, Marina Haukness, Glenn Hickey, Shuangjia Lu, Julian K Lucas, Jean Monlong, Haley J Abel, et al. A draft human pangenome reference. Nature, 617(7960):312–324, 2023. doi:10.1038/s41586-023-05896-x.
- [31] Yongchao Liu and Bertil Schmidt. Long read alignment based on maximal exact match seeds. Bioinformatics, 28(18):i318–i324, September 2012. doi:10.1093/bioinformatics/bts414.
- [32] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281–308, 2010. doi:10.1089/cmb.2009.0169.
- [33] Karen H Miga and Ting Wang. The need for a human pangenome reference sequence. Annual Review of Genomics and Human Genetics, 22(1):81–102, 2021. doi:10.1146/annurev-genom-120120-081921.
- [34] Ardalan Naseri, Erwin Holzhauser, Degui Zhi, and Shaojie Zhang. Efficient haplotype matching between a query and a panel for genealogical search. Bioinformatics, 35(14):i233–i241, July 2019. doi:10.1093/bioinformatics/btz347.
- [35] Gonzalo Navarro. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Comput. Surv., 54(2), March 2021. doi:10.1145/3434399.
- [36] Takaaki Nishimoto and Yasuo Tabei. Optimal-Time Queries on BWT-Runs Compressed Indexes. In Nikhil Bansal, Emanuela Merelli, and James Worrell, editors, 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021), volume 198 of Leibniz International Proceedings in Informatics (LIPIcs), pages 101:1–101:15, Dagstuhl, Germany, 2021. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. doi:10.4230/LIPIcs.ICALP.2021.101.
- [37] Rajeev Raman and Satti Srinivasa Rao. Succinct dynamic dictionaries and trees. In Jos C. M. Baeten, Jan Karel Lenstra, Joachim Parrow, and Gerhard J. Woeginger, editors, Automata, Languages and Programming, pages 357–368, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg. doi:10.1007/3-540-45061-0_30.
- [38] Massimiliano Rossi, Marco Oliva, Ben Langmead, Travis Gagie, and Christina Boucher. MONI: A pangenomic index for finding maximal exact matches. Journal of Computational Biology, 29(2):169–187, 2022. PMID: 35041495. doi:10.1089/cmb.2021.0290.
- [39] Ahsan Sanaullah, Seba Villalobos, Degui Zhi, and Shaojie Zhang. Haplotype matching with GBWT for pangenome graphs. bioRxiv, 2025. doi:10.1101/2025.02.03.634410.
- [40] Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang. d-PBWT: dynamic positional burrows–wheeler transform. Bioinformatics, 37(16):2390–2397, February 2021. doi:10.1093/bioinformatics/btab117.
- [41] Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang. An efficient data structure and algorithm for long-match query in run-length compressed BWT, 2025. doi:10.48550/arXiv.2505.15698.
- [42] Pramesh Shakya, Ahsan Sanaullah, Degui Zhi, and Shaojie Zhang. Dynamic -PBWT: Dynamic run-length compressed pbwt for biobank scale data. In Sriram Sankararaman, editor, Research in Computational Molecular Biology, pages 209–226, Cham, 2025. Springer Nature Switzerland. doi:10.1007/978-3-031-90252-9_13.
- [43] Vipin Singh, Shweta Pandey, and Anshu Bhardwaj. From the reference human genome to human pangenome: Premise, promise and challenge. Frontiers in Genetics, 13:1042550, 2022. doi:10.3389/fgene.2022.1042550.
- [44] Jouni Sirén, Erik Garrison, Adam M Novak, Benedict Paten, and Richard Durbin. Haplotype-aware graph indexes. Bioinformatics, 36(2):400–407, July 2019. doi:10.1093/bioinformatics/btz575.
- [45] Li Song and Ben Langmead. Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification. Genome Biology, 25(1):106, 2024. doi:10.1007/978-1-0716-3989-4_22.
- [46] Dylan J. Taylor, Jordan M. Eizenga, Qiuhui Li, Arun Das, Katharine M. Jenike, Eimear E. Kenny, Karen H. Miga, Jean Monlong, Rajiv C. McCoy, Benedict Paten, and Michael C. Schatz. Beyond the Human Genome Project: The age of complete human genome sequences and pangenome references. Annual Review of Genomics and Human Genetics, 25(Volume 25, 2024):77–104, 2024. doi:10.1146/annurev-genom-021623-081639.
- [47] Mikkel Thorup. Mihai pǎtraşcu: obituary and open problems. SIGACT News, 44(1):110–114, March 2013. doi:10.1145/2447712.2447737.
- [48] Victor Wang, Ardalan Naseri, Shaojie Zhang, and Degui Zhi. Syllable-PBWT for space-efficient haplotype long-match query. Bioinformatics, 39(1):btac734, November 2022. doi:10.1093/bioinformatics/btac734.
- [49] Mohsen Zakeri, Nathaniel K Brown, Omar Y Ahmed, Travis Gagie, and Ben Langmead. Movi: a fast and cache-efficient full-text pangenome index. iScience, 27(12), 2024. doi:10.1016/j.isci.2024.111464.
Appendix A Proofs
Lemma 3. [Restated, see original statement.]
The following three statements hold: (i) Let x be the integer satisfying for some integer . Then ; (ii) and where ; (iii) .
Proof.
(i) Lemma 3(i) clearly holds for . We show that Lemma 3(i) holds for (i.e., ). Let be the position in with sa-value for an integer (i.e., ), where . Two adjacent positions and are contained in an interval on SA (i.e., ), which corresponds to the -th run of . This is because is not the starting position of a run, i.e. . The LF function maps to , where is the position with sa-value . LF also maps to , by Lemma 3(i) of [36]. The two mapping relationships established by LF produce equalities . The equalities lead to , which represents by , and .
(ii) Let be the integer satisfying . Then there exists an integer such that is the sa-value at position () if ; otherwise if , is the sa-value at position 1 and . , because always holds. Hence holds by .
Next, holds for any because (a) maps the interval into the interval by Lemma 3(i) for any , (b) is a bijection from to , and (c) holds.
(iii) Recall that is an integer satisfying . Then there exists an integer such that is the sa-value at position . Finally, recall . Hence, holds.
Appendix B Algorithm Pseudocodes
Pseudocode for the long LEM query algorithm is provided in Algorithm 1. The long LEM query algorithm is separated into subroutines to aid understanding. The long LEM query algorithm is provided as input the length threshold , the query/pattern , and an OptBWTRL of the text . Subroutines have access to their calling function’s variables implicitly. Structures of the OptBWTRL of are referenced directly, i.e., instead of . Finally, operations on move data structures use the notation of Nishimoto and Tabei [36].
