Graph Indexing Beyond Wheeler Graphs
Abstract
After the discovery of the FM index, which linked the Burrows-Wheeler transform (BWT) to pattern matching on strings, several contemporaneous strands of research began on indexing more complex structures with the BWT, such as tries, finite languages, de Bruijn graphs, and aligned sequences. These directions can now be viewed as culminating in the theory of Wheeler Graphs, but sometimes they go beyond. This chapter reviews the significant body of “proto Wheeler Graph” indexes, many of which exploit characteristics of their specific case to outperform Wheeler graphs, especially in practice.
Keywords and phrases:
indexing, compression, compressed data structures, string algorithms, pattern matchingFunding:
Jarno N. Alanko: Funded by the Helsinki Institute for Information Technology (HIIT).Copyright and License:
2012 ACM Subject Classification:
Theory of computation Sorting and searching ; Theory of computation Graph algorithms analysis ; Theory of computation Pattern matching ; Theory of computation Formal languages and automata theoryAcknowledgements:
We wish to thank Hideo Bannai for spotting a mistake in the original version of Definition 22.Editors:
Paolo Ferragina, Travis Gagie, and Gonzalo NavarroSeries and Publisher:
Open Access Series in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
After the discovery of the FM index [22], designed for pattern matching on strings using the Burrows-Wheeler transform (BWT)[11], many groups explored BWT-inspired indexing of more complex structures, such as tries [21], finite languages [48, 49], de Bruijn graphs [10, 5], and aligned sequences [20, 43]. These directions can now be viewed as culminating in the theory of Wheeler Graphs, which is covered in the next chapter. Before that, however, in this chapter we review the significant body of “proto Wheeler Graph” indexes, many of which exploit characteristics of their specific case to outperform Wheeler graphs, especially in practice. While indeed many of the approaches reviewed (but not all) can be encapsulated in the general Wheeler Graph framework, the connection usually covers only the last indexing step. Construction of the object to be indexed is typically the most complex step and this is tailored for each approach. Especially these construction considerations go beyond the Wheeler Graph framework.
We start with key notions and index structures for strings and string collections. Before proceeding to general graphs, we extend the techniques to tries. Then we continue with de Bruijn graphs that represent overlaps of length substrings of a string collection. The studied extensions of BWT are then shown to work for general graphs, albeit with the caveat of worst case exponential index size. This worst case behavior is then avoided by resorting back to de Bruijn graphs, but this time indexing paths of a graph rather than string collection. Finally, we explore an approach to directly build graphs from aligned sequences that have indexing properties similar to de Bruijn graphs, but without fixed parameter for the substring length.
2 Notation
2.1 Strings
Let be a string of length over an alphabet of size . If the string and the alphabet are clear from the context, we often denote their sizes as and . We write substrings as and call substrings of the form and prefixes and suffixes, respectively. Let be the smallest symbol in the alphabet. We often use it as a sentinel that occurs only at the end of the string.
Let be a string of length , let be an offset in it, and let be a symbol. We define as the number of occurrences of symbol in the prefix . For convenience, we define when and when .
We define the inverse operation as the position of the occurrence of of rank . Let be the total number of occurrences of symbol in the string. For , we have such that and . We also define and for convenience and leave the function undefined for the remaining values of .
A bitvector is a data structure that stores a binary string and supports efficient rank/select queries. If we store the string as such, we can support both queries in time and bits of space [14, 37]. When the bitvector is sparse, we can achieve better space usage by using Elias–Fano encoding to store the positions with ones [31, 42]. If there are ones in a bitvector of length , we can store it in bits of space and support select queries in time and rank queries in time.
2.2 String collections
Let be an ordered collection of strings of total length . We assume that each string ends with a sentinel and that iff to make the suffixes of the strings all distinct in lexicographic comparisons. However, we represent the sentinels using the same symbol in the actual strings. When the collection is clear from the context, we use and to denote the number of strings and the total length of the strings. When the collection consists of a single string , we may consider the collection and the string to be the same.
Definition 1 (Multiple sequence alignment).
Let be a collection of strings over alphabet , and let be a gap symbol. Let be a function such that is string with all occurrences of the gap symbol removed. A multiple sequence alignment of collection is a matrix with rows and columns such that for each row , we have .
Let and be two positions in the same column. If we have (if ), let () be the corresponding symbol. We interpret the alignment in the following way:
-
If , positions and are a match.
-
If , positions and are a mismatch.
-
If and , string has an insertion relative to string and string has a deletion relative to string at the position.
Recombinations let us build new strings by breaking existing strings in two at aligned positions and concatenating the prefix of one string with the suffix of another. Using the above notation, if positions and are a match, we can recombine strings and at those positions as and . This corresponds to switching between rows and in column of the multiple sequence alignment.
In later sections, we discuss data structures for indexing all possible recombinations of a string collection compatible with a given alignment. We formalize this as follows.
Definition 2 (Closure under recombinations).
Let be a collection of strings and let be an equivalence relation for its suffixes such that if , we have . The closure of collection under recombinations is a collection such that:
-
1.
-
2.
For any two strings and positions such that , it holds that .
-
3.
Given a string as above, if and if .
According to rule 3, each suffix of the recombined string inherits the alignment of the suffix starting from the corresponding position in the source strings and .
2.3 Arrays over sorted suffixes
Let be an ordered collection of strings. We define the (generalized) suffix array of collection as an array of pointers to the suffixes of the strings in lexicographic order. Each pointer refers to the suffix of string . Given two pointers and , we have iff in lexicographic order. We therefore call the lexicographic rank of suffix among the suffixes of collection . Note that for , we have .
The multi-string Burrows–Wheeler transform (BWT) [11] of collection is a string representing a permutation of the symbols in strings . We define it using the suffix array. If and , the BWT stores the preceding symbol . When , we set .
In addition to the suffix array and the Burrows–Wheeler transform, we often define other arrays over the lexicographically sorted suffixes. The longest common prefix (LCP) array stores the length of the longest common prefix of two lexicographically adjacent suffixes. If and , we store the LCP of suffixes and in , with .
2.4 FM-index
The Burrows–Wheeler transform can be inverted using a function called LF-mapping that maps the lexicographic rank of a suffix to the lexicographic rank of its cyclic predecessor. We define it in two parts. Let . If , we define such that . We sometimes write this as . For , we define such that .
We can compute the LF-mapping for most values of using the BWT. Let be an array such that is the total number of occurrences of all symbols in collection . Let . If , which is equivalent to , we have . Here is the number of suffixes starting with a symbol smaller than and is the lexicographic rank of suffix among the suffixes starting with symbol .
Inverse LF-mapping, denoted as , can be computed as by identifying a symbol such that and then applying .
We can generalize LF-mapping for hypothetical suffixes. Let be the lexicographic rank of string among the suffixes of collection . That is, is the number of suffixes of the collection such that . Then, given a symbol with , we define the generalized LF-mapping as the lexicographic rank of string . We can compute it as .
Let be the lexicographic range of suffixes starting with string . That is, for each , we have such that . Then for any symbol , the range of suffixes starting with is . This backward search step forms the core of the FM-index [22]. Given a pattern , we can determine the range of suffixes starting with with rank operations on the BWT.
Another core operation is , which converts the lexicographic rank of a suffix to its starting position. To compute it, the FM-index needs suffix array samples in addition to the BWT. Let be a bitvector that marks the sampled suffix array positions with , and let be the subsequence of the suffix array corresponding to those positions. If , we can compute as . Otherwise we rely on iterated LF-mapping and the fact that in the general case. If we need steps to find the nearest sampled position, we have . For this scheme to function, we need to sample whenever . If we sample every time for a parameter , we can compute with rank operations.
2.5 Graphs
Let be a graph with a finite set of vertices and a set of edges . We assume that the graph is directed. For any two vertices , with , edges and are distinct. A path is a string of vertices such that for . We call the graph acyclic if there is no path that visits a vertex more than once.
A vertex-labeled graph is a tuple , where is a graph, is an alphabet, and is a function that assigns a label to each vertex . If is a path, we write to refer to the string formed by concatenating the vertex labels. We can similarly define edge-labeled graphs as tuples , where is a function that assigns labels to edges. Later, we extend the notions to vertices and edges labeled with non-empty strings.
Definition 3 (Alignment graph).
Let be a collection of strings over alphabet , and let be an equivalence relation for its suffixes such that if , we have . The alignment graph of collection with respect to is a vertex-labeled graph such that:
-
1.
The set of vertices is the set of equivalence classes for relation .
-
2.
For any two vertices , we have if and only if there is a string and a position such that and .
-
3.
The label of vertex is the shared first symbol , where .
Note that the set of labels for paths in the alignment graph is the set of substrings of strings . If is an equivalence relation based on a multiple sequence alignment such that when the corresponding positions match, the resulting alignment graph is always acyclic.
3 XBWT: Indexing Labeled Trees
The Burrows–Wheeler transform was originally defined for a single string. Extending the definition to an ordered collection of strings is straightforward enough. But a collection of strings is still a linear structure, while paths in a graph can both diverge and converge. The first step towards using the BWT for graph indexing was the extended Burrows–Wheeler transform (XBWT) [21], which can index labeled trees. The key technique was using select queries on a bitvector for expanding a (co-)lexicographic range of vertices to a (co-)lexicographic range of outgoing edges.
In this section, we present a variant of the XBWT that is similar to the indexes for vertex-labeled graphs discussed in Sections 4 and 5. There are other fundamentally equivalent representations. The representations differ in their handling of leaf vertices and outdegrees, on whether they consider vertex-labeled or edge-labeled trees, and on whether the search proceeds from the root towards the leaves or the other way around. Some of them achieve better space bounds than what is presented here.
3.1 Data structure
A vertex-labeled tree with root is a vertex-labeled graph such that the indegree of the root is and the indegree of every other vertex is , with a unique path from the root to every vertex . Given an edge , we call vertex the parent of vertex and a child of . We call vertices with outdegree the leaves of the tree.
XBWT uses vertices in place of suffixes. The vertices are sorted by the label in colexicographic order (in lexicographic order by reverse path labels). For a given vertex , the BWT stores the concatenation of the labels for each outgoing edge . We encode the outdegree in unary as a one followed by zeros and concatenate the outdegrees in the same order to form bitvector . In order to handle the leaves, we add a technical vertex such that . For each leaf , we encode an edge . As with strings, we assume that for every vertex . If the colexicographic rank of vertex is , the labels of its children are stored in . See Figure 1 for an example. With an uncompressed wavelet tree [29] for the BWT and an uncompressed bitvector for , the XBWT data structure requires bits of space, not including the array.
Let be the colexicographic range of vertices with path labels ending with string . Given a symbol , we want to find the range of vertices with path labels ending with . We first expand the range into a colexicographic range of outgoing edges . Every edge in that range is encoded with the label of the child vertex, and the colexicographic rank of that vertex is based on path label . Hence we can use a backward search step (Section 2.4) to get the range of children with label : Note that the backward search step becomes a forward search step, as the XBWT is based on colexicographic order. Given a string , we can therefore compute using rank operations on and select operations on .
If we know the colexicographic rank of the root , we can match entire path labels instead of suffixes by starting the above search from range . We can also use the XBWT for tree navigation. Given a vertex with colexicographic rank , the range of its children is . The colexicographic rank of its -th child is and the rank of its parent is .
3.2 Applications
The initial XBWT paper proposed using the structure as a fast and space-efficient representation for hierarchical documents, such as XML [21]. In such applications, it is important that a vertex can have multiple children with the same label (such as list items). When storing string collections (often called string dictionaries in this context), we can collapse the tree into a trie, improving space usage. Compared to other representations for string collections, the XBWT is smaller but slower [36]. The slowness comes from symbol-to-symbol navigation and the poor memory locality of iterated LF-mapping. However, the same FM-index functionality enables substring searches not supported by faster representations based on hashing or front coding.
The Aho–Corasick algorithm can find the occurrences of multiple patterns in a string in linear time , where is the number of occurrences. It is based on a trie of the patterns augmented with failure links (suffix links). A failure link from vertex points to a vertex such that is the longest proper suffix of among all vertices . The XBWT can be augmented with a -bit balanced parentheses representation to support failure links, making it a space-efficient Aho–Corasick automaton [35]. Here the relative slowness matters less, as the algorithm is based on symbol-to-symbol navigation anyway.
Because the XBWT represents a collection of strings, it can be built from other representations, such as the suffix array and the Burrows–Wheeler transform [35]. The link also works in the other direction: we can go from the XBWT directly to a run-length encoded multi-string BWT [12]. As a byproduct, we can efficiently find the order of the strings that minimizes the number of runs in the BWT [12]. Because the order of the strings determines how well the suffixes close to the end of the string can be run-length encoded, this can be valuable with highly repetitive collections of short strings.
4 SBWT: Indexing k-Spectra
The -spectrum of a string is the set of distinct -mers occurring in . In this section we give an overview of how -spectra can be indexed for -mer lookup queries via the Spectral Burrows-Wheeler transform (SBWT) [5]. Given a query -mer, a -mer lookup query returns the colexicographic rank of the -mer in the -spectrum. It is not difficult to see that a structure for -mer lookup queries gives a convenient representation of a de Bruijn graph.
The SBWT is a distillation of the ideas found in the BOSS [10] and Wheeler graph data structures. It is indeed an evolution of the BOSS data structure, which is an indexed representation of the edge-centric de Bruijn graph. The SBWT differs from the BOSS being vertex-centric, and more general – the SBWT is a mathematical transformation which can be indexed in various ways, one of which closely resembles the BOSS data structure. In addition, the SBWT graph is also a Wheeler graph [25] covered in the next chapter of this book. This implies that the generic Wheeler graph index could be used to index the graph. However, that index requires storage of the sequence of indegrees and outdegrees of the vertices in the graph, whereas in the SBWT this is not required because every vertex apart from the vertex of the empty string has indegree of exactly 1, and the sequence of outdegrees is already included in the sizes of the sets of the outgoing edge label sets. Thus, the SBWT can be seen as a specialization of the Wheeler graph framework to -spectra, taking advantage of the properties of the special case.
4.1 The Spectral Burrows-Wheeler Transform
In this section we define the Spectral Burrows-Wheeler transform, SBWT. We begin with few basic definitions:
Definition 4 (-spectrum).
A string’s -spectrum, denoted with , consists of all unique -mers present in the string: . The -spectrum of a set of strings is the union .
padded -spectrum adds a minimal set of $-padded dummy -mers to ensure that in the de Bruijn graph (to be defined shortly), every non-dummy -mer has an incoming path of length at least :
Definition 5 (Source vertex set).
Let be a -spectrum, the source vertex set is a subset of containing -mers such that is not a suffix of any -mer in .
Definition 6 (Padded -spectrum).
Let be a -spectrum of , and let be the source vertex set of for a given . The padded -spectrum is the set , where is a special character of , which is smaller than all other characters of the alphabet.
The number of dummy vertices can be calculated with a simple formula. It is indeed equal to the number of internal vertices of , where is the trie of the source vertex set of a set of strings.
For example, if we consider two strings ACAGTG and ATCAGA, and , then ACA, CAG, AGT, GTG, ATC, TCA, AGA, ACA, ATC, and the padded -spectrum $$$, $$A, $AC, $AT.
Definition 7 (de Bruijn graph).
The de Bruijn graph of a set of strings is an edge-labeled graph , where:
-
1.
The set of vertices is the set of -mers from the padded -spectrum of , in symbols .
-
2.
For any two vertices, -mers, , we have if and only if .
-
3.
The label of an edge , denoted with or , is the character . never appears as an edge label.
See figure 2 for an illustration of a de Bruijn graph.
We can now define the Spectral BWT.
Definition 8 (Spectral Burrows-Wheeler transform (SBWT) [5]).
Let be a padded -spectrum and let be the elements of in colexicographic order. The SBWT is the sequence of sets with such that if and , otherwise .
Continuing the example above and adding a new sequence:
. The colexicographically ordered list of is:
$$$,
$$A,
ACA,
TCA,
AGA,
$AC,
ATC,
GTC,
CAG,
GTG,
TTG,
$$T,
$AT,
AGT,
TGT,
$TT, and the SBWT of is the sequence of sets:
{A,T},
{C,T},
{G},
,
,
{A},
{A},
,
{A,T},
{T},
,
{T},
{C},
{C,G},
,
{G}.
The sets in the SBWT represent the labels of outgoing edges in the vertex-centric de Bruijn graph, such that we only include outgoing edges from -mers that have a different suffix of length than the preceeding -mer in the colexicographically sorted list. The padding of dollar-symbols is a technical detail required to make the SBWT work. It ensures that every -mer of the -spectrum has an incoming path of length at least in the de Bruijn graph. Alternatively, the sequences of the set on which the SBWT is built could be made cyclic, thus removing the need for the padded -spectrum .
We now describe how to implement efficient -mer membership queries on the spectrum of the input strings, using only the information encoded in the SBWT. It is helpful to think of the SBWT as encoding the set of -mers as a list arranged in colexicographical order (as in the example above). In the context of -mer lookup, the SBWT allows navigation of this ordered list via a search routine based on subset rank queries.
Definition 9 (Subset rank query).
Let be a sequence of subsets of an alphabet . A subset rank query takes as input an index and a character , and returns the number of subsets with such that .
The search routine, introduced below, works by searching the -mer character by character from left to right. It does so maintaining the interval of -mers in the colexicographically-sorted list that are suffixed by the prefix which has been processed so far.
Since the -mers in are colexicographically sorted in the SBWT, the subset of -mers in that share a string as a suffix are next to each other, and they are called the colexicographic interval of . The colexicographic interval of a string longer than is defined as the range of -mers in the sorted list of suffixed by the last characters of . Now let denote the colexicographic interval of the string , where and are respectively the colexicographic ranks of the smallest and largest -mer in the SBWT having substring as a suffix. Finding the interval of , where is a character from , is equivalent to following all edges labeled with from the vertices in . Due to the way the edges are defined, the end points of these edges are a contiguous range such that is the destination of the first outgoing edge labeled from , and is the destination of the last one. Let be the number of edge labels with label smaller than in the graph. We now have the following formulas:
| (1) |
where is a subset rank query on the SBWT sequence. The “+1” at the start of the formulas is to skip over the vertex of . The values can be precomputed for all characters . By iterating these formulas times, we have the -mer search routine.
The simplest data structures for subset rank queries is the plain matrix. See Figure 3 for an example.
Definition 10 (Plain Matrix representation).
The plain matrix representation of the SBWT sequence is a binary matrix , such that iff subset includes the character in the alphabet.
The rows of the matrix are indexed for constant time rank queries [37] (implementation from [28]). A single rank query on row at index in would answer in constant time , where is the character of the alphabet. There exist alternative succinct data structures for subset rank, such as the subset wavelet tree, details of which can be found in [4, 5]. A concatenation of the sets indexed as a wavelet tree leads to something resembling the BOSS data structure. All -mers of a longer string can be searched efficiently by adding the array of the SBWT [3]. At the end of each single -mer seach, the array allows to extend the colexicographic interval to the last characters observed in the previous -mer.
5 GCSA: Indexing Finite Languages
The generalized compressed suffix array (GCSA) [48, 49] is a generalization of the FM-index to graphs. It was originally designed for indexing acyclic vertex-labeled graphs (or equivalently finite languages) that arise from aligned DNA sequences, using techniques similar to the XBWT (Section 3).
GCSA can be built for any such graph, and it supports efficient queries in it, but the index can be exponentially larger than the graph. Some other BWT-based graph indexes, such as BWBBLE [32] and vBWT [34], take the opposite approach. By encoding the graph as a string, they make index construction fast and the index itself small. Queries can be slower than in GCSA, as they often require branching in the graph.
BOSS [10] represents de Bruijn graphs using a data structure equivalent to the GCSA. This equivalence led to a realization that the GCSA approach works with any graph that is sufficiently similar to a de Bruijn graph. Such graphs can recognize a strict subclass of regular languages [49]. GCSA2 [46] uses this realization in a practical way to index any vertex-labeled graph by approximating it with a high-order (pruned) de Bruijn graph.
GCSA requires that the vertices of the graph can be sorted unambiguously by the labels of maximal paths starting from them. This is a global property that is sometimes difficult to reason about. Wheeler graphs [25] replaced this requirement with an essentially equivalent local property. Vertices are sorted by their labels, and ties are broken by inheriting the order of the successor vertices. (The original formulation used edge-labeled graphs and inherited the order from the predecessors.) This alternate characterization made reasoning about Wheeler graphs easier in many situations.
5.1 Structural runs in the BWT
Consider a collection that consists of identical copies of the same string . Then for any position , the suffixes starting at that position are consecutive in lexicographic order. Because the preceding symbol (if ; otherwise ) is also the same for every string, the suffixes are in the same run of symbol in the BWT of the collection. Hence the BWT of a collection of identical strings always has the same number of runs, regardless of the number of copies. We now consider what happens when we apply edits to a collection of identical strings.
Definition 11 (Aligned suffixes).
Let be a collection of strings, and let MSA be a multiple sequence alignment of the collection. For any two strings and positions in them, suffixes and are aligned if and only if positions and are a match in MSA. We represent this with equivalence relation
Definition 12 (Structural BWT runs).
Let be a collection of strings, let be an alignment of the suffixes, and let and be the Burrows–Wheeler transform and the suffix array of the collection. A structural BWT run is a maximal interval such that for any :
-
1.
, where and ; and
-
2.
.
We use equivalence relation to denote that suffixes are in the same structural BWT run.
Let be a highly repetitive collection of similar strings. We assume that there is a sequence of edit operations that transforms a collection of copies of string into , with . We further assume that there is a multiple sequence alignment corresponding to the edit operations, where suffixes that have been derived from the same suffix of the original string start with a match or a mismatch in the same column. Note that different sequences of edit operations leading to the same collection may correspond to different alignments.
Initially, there are structural runs, one for each position in string . Each edit operation that changes a substring can affect the suffixes of in three ways:
-
1.
The symbol in the BWT can change for the suffixes starting from to .
-
2.
Suffixes starting from to , for some , may leave their current structural runs due to changes in lexicographic ranks.
-
3.
New unaligned suffixes can be created and existing suffixes may be removed if substring is replaced with a string of different length.
Every affected suffix can increase the number of structural runs by at most two: by becoming a new run and by splitting an existing run in two.
Mäkinen et al. [38] showed that if the original string is random and we apply random edits, is at most in the expected case. If we assume single-symbol edits, edit operations increase the number of structural runs by at most in the expected case. Because structural runs are (possibly non-maximal) BWT runs, there are at most runs in the BWT of collection in the expected case. Here is the number of runs in the BWT of string .
5.2 Indexing the graph of structural runs
Let be a string collection and an equivalence relation that denotes that two suffixes are in the same structural BWT run. Let be the alignment graph of the collection with respect to .
Graphs like this have been used as space-efficient representations of the collection. Both SimDNA [33] and FM-index of alignment [40, 39] index the collection using a graph based on a specific type of alignment. The alignment alternates between common regions, where all strings are aligned, and non-common regions, where they diverge. BWT tunneling [7] also uses a similar graph to improve BWT-based data compression by not storing run lengths within unary paths. Our goal is instead to extend the FM-index of the collection to index all strings in the closure using the graph.
We could augment the FM-index with a bitvector that marks the start (or equivalently the end) of each structural run with ones. After each backward search step, we use some rank/select operations on to extend the interval to the smallest interval that contains it and consists of entire structural runs. If we continue the search from interval , we are effectively accepting recombinations within each vertex in the range. No matter which suffixes we used to reach a vertex, we can continue the search to the predecessor of any suffix in the vertex. See Figure 4 for an example.
5.3 GCSA structure
Because we discard the information on how we arrived to the current vertex, we no longer need to keep track of the exact BWT interval. Hence we choose the GCSA representation [48] (Figure 5) that corresponds more closely to the structure of the graph. We keep the vertices in the same order as the structural runs. Let be a function that maps a vertex to its lexicographic rank . Lexicographic range now corresponds to vertices such that .
The data structure consists of three components , , and . We use them all for computing a backward search step from to for a string and a symbol .
-
1.
Bitvector encodes the indegree of each vertex in unary. Similar to using in XBWT (Section 3.1), we map the range of vertices to a range of incoming edges as .
-
2.
The BWT range for vertex is the concatenation of for each vertex with in an arbitrary order. A backward search step with this BWT maps the range of incoming edges to a range of outgoing edges, ordered by the predecessor vertex.
-
3.
Bitvector encodes the outdegree of each vertex in unary. With rank queries, we get the interval of the predecessors of the vertices in with label as .
Together these take bits, assuming an uncompressed wavelet tree for the BWT and uncompressed bitvectors and excluding the array. To handle the edge cases with source/sink vertices, we add technical edges connecting vertices with outdegree to vertices with indegree .
This data structure also facilitates graph navigation. For instance, if is a vertex and , we can navigate backward along a unary path from . This is effectively the same as using to navigate from a suffix to its predecessor. The rank of the first incoming edge to is . If , vertex has multiple predecessors and the path is no longer unary. With , we get the rank of the corresponding outgoing edge, and is the rank of the predecessor vertex. Navigation like this is used in locate queries (Section 5.5).
The alignment graph based on structural runs allowed us to index the original strings as well as their recombinations within the runs. Our ultimate goal is allowing recombinations at any aligned position. In principle, we already have the necessary tools for it. Because is a refinement of , we have . In other words, recombinations at aligned positions include recombinations at aligned positions within the same BWT run. We could therefore build the GCSA structure for the closure . The challenge is doing that without representing the closure explicitly.
5.4 GCSA construction
Before proceeding, we relax the definition of structural runs. The GCSA structure only requires that the equivalence classes of suffixes are lexicographic ranges that start with the same symbol. By not requiring that they also share the preceding symbol, we simplify the discussion and make the alignment graph smaller.
Definition 13 (Structural SA runs).
Let be a collection of strings, let be an alignment of the suffixes, and let be the suffix array of the collection. A structural SA run is a maximal interval such that for any with and , we have . We use equivalence relation to denote that suffixes are in the same structural SA run.
The algorithm we use for building GCSA [48, 49] is similar to the prefix-doubling approach to suffix array construction. We start from graph and build a series of graphs for increasing path lengths . If we consider the graphs finite automata, they all recognize the same language . There are two types of vertices :
-
1.
The vertex corresponds to a path of length starting from vertex in graph . For every maximal path starting from vertex in graph , we require that is a prefix of .
-
2.
The vertex corresponds to a half-open lexicographic range of path labels. There is an arbitrary path of length starting from vertex in graph such that . For every maximal path starting from vertex in graph , we require that . For every other vertex with , we require that (if is of type 1) or (if is of type 2).
In both cases, we set .
If all vertices are of type 2, we have a graph that is equivalent to , with vertices that can be sorted unambiguously by the labels of maximal paths starting from them. For each suffix of the closure , there is a vertex such that and a path starting from with label . Therefore the vertices of correspond to (possibly non-maximal) structural runs. If we merge them to form maximal runs, we get the graph . See Figure 5 for an example.
We derive graph from graph with a doubling step followed by a pruning step. Let be a type 1 vertex, and let be the lexicographic rank of in the set of distinct path labels . In the doubling step, we combine it with all vertices such that is a valid path. For every combination, we create a new vertex with and . The new vertex will be of the same type as vertex . This requires that graph is reverse deterministic; see below for a discussion. If is a type 2 vertex, we copy it as vertex with . The doubling step finishes by sorting the new vertices by their ranks and replacing the pairs of ranks with integer values.
In the pruning step, we merge lexicographic ranges of type 1 vertices into type 2 vertices. Let be the set of vertices with rank . If all vertices share the same vertex, the shared path label uniquely defines the starting vertex. We can therefore merge them into a type 2 vertex. Similarly, if sets and also share the starting vertex, we can merge them into a single vertex.
If is the longest string in , the prefix-doubling algorithm finishes after at most doubling steps. In the worst case, the final graph can be exponentially larger than the graph we started from. However, if the distances between the positions of the original string with edits in any string are sufficiently high, graph will have vertices and edges in the expected case [49]. If we assume a random string and random edits, the phase transition to exponential growth occurs at average distance .
We do not need to maintain the edges of the intermediate graphs. Determining them for the final graph is simple enough. Given a vertex of the final graph and an edge of the original graph, we know that there must be a corresponding edge such that in the final graph. We can therefore generate the edges as and sort them by . This also sorts them by . By scanning the lists of vertices and edges in sorted order, we can then match edges to source vertices by .
This construction algorithm only works if the original graph is reverse deterministic. For every vertex , we require that each predecessor with has a distinct label . If two predecessors share a label, for every path starting from vertex . The pruning step cannot merge them, because no extension of the path label can determine the starting vertex uniquely. On the other hand, if the graph is reverse deterministic and is a type 2 vertex, always determines the unique predecessor .
Determinizing a finite automaton takes exponential time in the worst case, because the states of the determinized automaton correspond to sets of states of the non-deterministic automaton. However, when the textbook algorithm was used to determinize the graph before GCSA construction, it was observed to be fast [49]. This was later explained using Wheeler graphs [16]. The exponent in the time complexity is not the total number of states but the maximal number of states that are not comparable in lexicographic order. Intuitively, the closer the graph is to a Wheeler graph, the faster the determinization is. If the graph is simple enough that the prefix-doubling algorithm only increases its size by a constant factor, it must already be close to being a Wheeler graph.
5.5 Locating the hits
Let be the original alignment graph and the final graph in GCSA construction. During the construction, we annotated each vertex in each intermediate graph (including and ) with the corresponding vertex of the original graph. As the notation implies, we want GCSA locate queries return positions in the original graph. While it would be easier to return positions in (we built GCSA for that graph, after all), it is only a technical construct. Positions in the original graph (or columns in the alignment) are more likely to be meaningful to the user.
A locate query in an FM-index relies on the fact that . If a position has not been sampled, we can derive the value from the predecessor. Let be a function that gives each vertex a unique integer identifier. To apply the same idea with GCSA, we require that for each edge such that is the only predecessor of and is the only successor of . If we sample for each vertex that has multiple predecessors or for the only predecessor , the locate algorithm works as before [48, 49]. (We also want to sample at regular intervals on unary paths to ensure query performance.)
This approach was originally designed for sampling the LCP array [45]. If is not at the start of a run, . We therefore need to sample the irreducible values, while the rest can be derived from the predecessor. More generally, the approach can be understood as a sampled tag array [8]. We annotate each position in the FM-index with a tag, which can store arbitrary information related to the position. If most tags can be derived easily from the only predecessor, we can store the tag array space-efficiently and still support efficient queries.
5.6 Simplifying the graph
As mentioned earlier, there is a phase transition in GCSA construction. As the distances between positions with edits in the alignment get below a critical threshold, the number of vertices and edges in the final graph starts growing rapidly. This makes the GCSA impossible to build in practice for many graphs. One way to handle this is adding a context length requirement to the definition of alignment [48, 49].
Definition 14 (-aligned suffixes).
Let be a collection of strings, let MSA be a multiple sequence alignment for the collection, and let be a context length. For any two strings and positions in them, suffixes and are -aligned if and only if positions and are a match in MSA and . We represent this with equivalence relation
We similarly define structural SA runs based on -aligned suffixes and denote them with equivalence relation . Then we start GCSA construction from graph and use prefix-doubling to build graph . In practice, context length imposes a minimum distance between recombinations. Regions that are heavily branching in gradually become simpler in as increases. That makes it more likely that as path length grows, path labels become unique at a faster rate than the number of type 1 vertices grows.
5.7 Simplified data structure
The final graph is always reverse deterministic. To see that, consider any two edges and such that . Because and are type 2 vertices, it must be that and , where is a maximal path starting from . As the lexicographic ranges for two type 2 vertices cannot overlap, we have .
For each vertex and symbol , there is at most one vertex such that and . We can therefore replace the indegree bitvector and the BWT with a binary matrix [49]. For each symbol , let be a bitvector of length , with whenever vertex has a predecessor with . The first two parts of a backward search step become .
We replaced a select query on a bitvector and a rank query on a string with a single rank query on a bitvector. A backward search step now consists of four rank queries on bitvectors. Depending on the data structures used, this can be considerably faster than with the full structure. With uncompressed bitvectors, the size bound increases from bits to bits, which can be significant with large alphabets. With Elias–Fano encoded bitvectors for , we get bits at the expense of somewhat slower queries. A practical hybrid approach [46] uses uncompressed bitvectors for common symbols and compressed bitvectors for rare ones.
5.8 Graphs with cycles
BOSS [10] represents de Bruijn graphs with a data structure that is equivalent to the full GCSA structure (Section 5.3). While a de Bruijn graph often contains cycles, its vertices are type 2 vertices, with as the interval. Here is a string of length and is the order of the de Bruijn graph. We could therefore use GCSA, including the more efficient simplified structure (Section 5.7), with de Bruijn graphs. The main difference is that while LF-mapping and backward search follow the edges in BOSS, they go against the direction of the edges in GCSA.
A natural question is which classes of graphs can be represented using GCSA, and what is the corresponding class of formal languages. Because vertex-labeled graphs can be understood as finite automata, the class of languages is a subclass of regular languages. Sirén et al. [49] showed that the class is a proper subclass: GCSA can represent the obvious automaton recognizing language but no automaton recognizing . The characterization of type 2 vertices with lexicographic ranges suggests that the class of graphs admitting the GCSA representation is a generalization of de Bruijn graphs. The properties of this class were later investigated based on the Wheeler graph formalization [1, 2].
5.9 GCSA2
GCSA was designed for indexing pangenome graphs based on aligned DNA sequences. The construction algorithm assumes that the alignments can be represented as a multiple sequence alignment or an acyclic graph. Such models are appropriate for simple variants, such as substitutions, insertions, and deletions. Other forms of variation, such as rearrangements (same substrings appear in different order in different strings) and inversions (a substring is replaced with its reverse complement) can be represented better with cycles in the graph. Such graphs often do not have an equivalent Wheeler graph that can be indexed with GCSA.
GCSA2 [46] is a reimplementation of GCSA for general pangenome graphs. Because the graph cannot be fully indexed, GCSA2 approximates it with an order- de Bruijn graph. It matches patterns of length at most correctly, while longer matches may be false positives combining partial matches in different parts of the original graph. The initial implementation used , which was chosen to make excessive growth of the number of vertices less likely. This was later increased to to allow mapping perfectly matching 150 bp and 250 bp reads as a single hit [47].
Let be the original graph and be the graph after rounds of prefix-doubling (Section 5.4). We replace the final pruning step with a merging step that also merges all type 1 vertices sharing , regardless of whether the vertices match. The resulting pruned de Bruijn graph is equivalent to an order- de Bruijn graph but potentially much smaller. To handle some edge cases, we must ensure that all type 2 vertices have intervals defined by a shared prefix. We build GCSA2 for graph . As a consequence of the final merge, is now a set of vertices in the original graph.
Pangenome graphs contain regions that are dense in variation. While GCSA2 construction stops prefix-doubling early to prevent the worst exponential growth, such regions must often be simplified to keep the size of the index manageable. Because the original GCSA considered the alignments and the graph computational artifacts, it could handle complex regions by changing the definition of alignment (Section 5.6). GCSA2 cannot do that, because the provided alignments are assumed to be biologically meaningful. Instead, a preprocessing step removes everything except the reference genome in regions where there are too many branching vertices in a short window [26]. There is also an option to avoid sequence loss by unfolding the original strings in the complex region while maintaining a mapping to the corresponding positions in the original graph. Variation is then replaced with a set of disjoint paths representing unaligned strings [47].
GCSA2 supports some additional functionality beyond that provided by FM-indexes. It builds a trie of path labels for vertices and represents it using next/previous smaller value queries and range minimum queries over the LCP array [23]. The resulting suffix tree-like functionality is used for finding maximal exact matches between the pattern and the graph [41]. A similar approach was first used with variable-order de Bruijn graphs [9] and later generalized to Wheeler graphs [15]. There is also a data structure for reporting the number of distinct values for vertices in a range . It is based on compressed versions [24] of Sadakane’s document counting structure [44].
6 Founder Graphs
In this section, we deal with graphs built from the segmentation of multiple sequence alignments (MSAs), which is a line of research originating from the study of founder sequences [50]. Similarly to the GCSA approach from Section 5, respecting the MSA positions (i.e. the columns) results in an acyclic graph spelling the input strings and their recombination. However, differently from GCSA, in these segmentation-induced graphs that we call founder graphs, the recombinations depend exclusively on the chosen segments.
Many approaches reviewed so far fix a -mer length to avoid indexing exponentially many paths of a graph. While such fixed parameter yields practically efficient solutions, it is of great theoretical value to study more general approaches. Wheeler Graphs [25] have been proposed as such, but they appear too general; Wheeler Graph recognition is an NP-complete problem [27] and thus to use Wheeler graph machinery one first needs to limit to graph classes that somehow avoid this complexity. Founder graphs also have similar hardness issues: in Section 6.2 we present the negative results from [17] stating that matching in founder graphs is not easier than matching in general labeled graphs. However, this hardness can be avoided, as we present in Section 6.3 that suitable segmentations of MSAs respecting a specific uniqueness property induce graph classes that are easy to index for linear-time pattern matching. These graphs are not limited to a fixed -mer length, but can handle node labels of arbitrary length (provided that they are unique) and arbitrary queries. After delving into more details, we continue discussing the connection to Wheeler Graphs in Sect. 6.3. We conclude this section with a simple extension of the uniqueness properties of de Bruijn graphs and founder graphs to overlap graphs.
6.1 Inducing graphs from multiple sequence alignments
We define a segmentation of as a partition of columns in segments , , …, such that for , and additionally we forbid any segmentation to spell the empty string in one of its segments, in symbols for all , .
Definition 15 (elastic founder graph [20]).
Let be a multiple sequence alignment. Given segmentation , …, , the elastic founder graph (EFG) induced by is a vertex-labeled graph , where assigns a non-empty label to every vertex, and the set of vertices is partitioned into blocks , …, such that (unique strings in a block) and for all we have that
Moreover, it holds that if and only if there exists and such that , , and .
If , that is all vertices in are connected to all vertices in for each , then the EFG is an elastic degenerate string (EDS) [30] with no empty string, and we can omit edge set . Moreover, given an EFG, we define its EDS relaxation as the corresponding EDS over .
See Figure 6. Given an elastic founder graph , for each vertex we denote as the vertex label length . We define height as the maximum number of vertices in a block, and as its maximum vertex label length. We can extend to label paths in the graph: given a -path , with for , we define ; for an edge , we call its edge label. Then, we say that some string occurs in if there exists path in such that is a substring of . More specifically, we indicate with triple the subpath of spelling starting from position in and ending in position in , that is, . The length of subpath is then . We say that occurs starting from the beginning of if ; occurs starting inside if .
6.2 Hardness of matching EFGs
Starting from an MSA, we can construct EFG . After this step, we would like to answer string matching queries for , namely determining whether a given pattern string occurs in . We will now show that, conditioned on some complexity hypotheses, this problem requires quadratic time complexity. This justifies the approach we will adopt in later sections, where we assume an additional property to hold in EFGs in order to break through this quadratic barrier.
The quadratic lower bound is obtained via fine-grained complexity techniques and is conditioned on the Strong Exponential Time Hypothesis (SETH) and the Orthogonal Vectors Conjecture (OVC). Since Williams [51] showed that SETH implies OVC, it suffices to show a subquadratic reduction from the Orthogonal Vectors (OV) problem to obtain this result. Let us first give a formal definition of OV.
Definition 16 (OV).
Let be two sets of binary vectors of length . Determine whether there exist such that .
The notation indicates the scalar product when used for two single entries of vectors and , while it refers to the dot product when applied on the vectors themselves. Currently, it is conjectured that no algorithm can solve OV in truly subquadratic time.
Definition 17 (Orthogonal Vectors Conjecture (OVC)).
Let be the two sets of an OV instance, each containing binary vectors of length .111In this section, keeping in line with the usual notation in the OV problem, we use to denote the size of and , instead of the number of columns of the MSA. For any constant , no algorithm can solve OV in time , where poly is a polynomial function on .
Then, we can show that determining whether string occurs in graph is at least as hard as solving OV, obtaining a quadratic conditional lower bound.
Theorem 18 ([20]).
For any constant , it is not possible to find a match for a query string in an EFG in either or time, unless OVC fails. This holds even if restricted to an alphabet of size .
We will provide an intuition on how to prove Theorem 18 by giving an idea of how to perform a subquadratic reduction from OV.
Starting from sets of binary vectors of length , the goal of the reduction is to build string and EFG such that occurs in if and only if there exist such that . Moreover, this reduction should be performed in not more than time, for some constant .
Pattern.
We start by constructing string from set of vectors . We build by combining string gadgets , one for each vector in , plus some additional characters. For example, vector results in string
Full string is then the concatenation .
Graph.
We build graph combining together three different sub-graphs: , , (for left, middle and right). Our final goal is to build a graph structured in three logical “rows”. We denote the three rows of as , , , respectively. Sub-graphs is able to match any prefix of pattern and is connected only to row , which can also match any prefix of . Simmetrically, sub-graphs is able to match any suffix of pattern and it is connected only to row , which can also match any suffix of . Thus, the first and the third rows of , along with subgraphs and (introduced to allow slack), can match any vector. The second row matches only sub-patterns encoding vectors that are orthogonal to the vectors of set . The key is to structure the graph such that the pattern is forced to utilize the second row to obtain a full match. We present the full structure of the graph in Figure 8, which shows the graph built on top of vector set . In particular, consists of gadgets , one for each vector . The key elements of these sub-graphs are gadgets , and (see Figure 7), which allow to stack together multiple instances of strings , , , . The overall structure mimics the one in [17, 19], except for the new idea from Figure 7 ([20]).
6.3 Indexable founder graphs
As we have seen, EFGs are not likely to support fast pattern matching without additional conditions. To alleviate this hardness, a uniqueness property has been introduced:
Definition 19 (semi-repeat-free indexability property [20]).
An EFG is semi-repeat-free or indexable (denoted as iEFG) if each subpath spelling for is such that and , that is, only occurs from the beginning of vertices in block . The semi-repeat-free property is a generalization of the stronger repeat-free property, imposing that each only occurs in from the beginning of .
The simpler repeat-freeness condition requiring to start only from the beginning of vertex already enables fast pattern matching, but it is open if there is a fast algorithm to construct such EFG from the MSA [20].
Before proceeding to the construction of iEFGs, let us first gather some intuition why such graphs are useful. To fix ideas, assume the input MSA has no gaps . Then semi-repeat-freeness matches with repeat-freeness, as the strings in each block are of the same length, and cannot be prefixes of one another. Consider searching for query string in iEFG. An FM-index for the collection of edge labels (concatenation of adjacent vertex labels) can be used for checking if occurs inside a vertex or an edge. Longer occurrences must span at least one complete vertex label. One can build an Aho-Corasick automaton on the vertex labels to find all candidate occurrences. Consider vertex with occurring closest to the end of , say, at . Consider a trie built on the set . We can check if can be traced from the root of to verify if occurs starting from the beginning of . To verify if occurs ending at the end of vertex s.t. , we can use tries : We can scan from right to left in . If we reach a leaf, it corresponds to a unique vertex due to the (semi-)repeat-free property. The scanning can continue from , until we have verified an occurrence for or detected that there was no occurrence starting from our candidate vertex .
Claim 20.
If is found neither inside an edge label nor with the verification algorithm above, it has no occurrence in iEFG.
Proof.
For contradiction, assume has an occurrence in . Since such occurrence is not found within an edge, it must span at least one vertex of . Therefore the verification algorithm has started with some vertex such that , but did not report this occurrence. That is, the substring occurs starting somewhere else than in the beginning of and thus our iEFG is not repeat-free.
This tailored Aho-Corasick approach can be made more space-efficient by an expanded backward search mechanism [20] similar to one used for de Bruijn graphs in Section 4.1. Alternatively, one can reduce this case of repeat-free graphs induced from MSAs with no gaps to Wheeler Graphs [20].
It is important to note why none of the above approaches work if the MSA has gaps and our iEFG is semi-repeat-free: is allowed to occur starting from another vertex from the same block. Due to this, straightforward extensions lead to false positive matches, and it is open if there is a reduction to Wheeler Graphs. However, there is a way to fix the Aho-Corasick approach: consider instead edges of the graph, as they have a uniqueness property analogous to the one above for vertices in the repeat-free case. One can basically repeat similar scanning of suffix and prefix of on indexes built on the set of edges, but there is a non-trivial detail in connecting the searches we omit here [43]:
Theorem 21 ([43]).
A semi-repeat-free EFG can be indexed in polynomial time into a data structure occupying bits of space, where , is the total length of the vertex labels, and is the height of . Later, one can find out in time if a given query string occurs in .
Next, we will consider how to construct iEFGs. To fix ideas, we consider only the problem of maximizing the number of blocks in iEFG; there exist similar solutions for other objective functions like minimizing the maximum segment length and minimizing the maximum height of a block.
Recall from Definition 15 that an EFG is defined by a graph induced by a segmentation , …, of an MSA. We denote this graph .
Definition 22 ([20]).
Segment of is semi-repeat-free if for any string occurs in gaps-removed row only at position , where is equal to minus the number of gaps in 222The original definition considers mistakenly gaps in .
Lemma 23 (Characterization of semi-repeat-free segments [20]).
The graph induced from a segmentation is semi-repeat-free if and only if all segments of are semi-repeat-free.
Consider you have found a semi-repeat-free segmentation until column and want to extend it with a new semi-repeat-free segment into a longer segmentation. The following definition tells the minimum length of such extension.
Definition 24 (Minimal right extensions [20]).
Given , for each we define value as the smallest integer greater than such that segment is semi-repeat-free. If such integer does not exist, we define .
To see how to compute these values, let us for now assume our MSA is gapless. One can build a generalized suffix tree of the rows of the MSA. Consider the set of leaves of the suffix tree corresponding to suffixes of the rows starting at column . The goal is to find a set of suffix tree vertices that cover these leaves and only them. Among such sets, one should find the one that contains a vertex with the smallest string depth, as that will determine . This is called the exclusive ancestor set problem and can be solved in linear time in the number of leaves in the input set [43]. Updating the set of leaves when moving from column to can also be done in linear time, but with general MSAs, one needs to tailor the algorithm to take into gaps (e.g. in string depth), which is non-trivial but possible.
Theorem 25 ([43]).
Given with and , we can compute the minimal right extensions for (Definition 24) in time .
Corollary 26 ([43]).
Given with and , the construction of an optimal semi-repeat-free segmentation maximizing the number of blocks can be done in time .
Proof.
Algorithm [20, Algorithm 1] by Equi et al. solves the problem in time, assuming minimal right extensions given (Theorem 25).
6.4 Repeat-free graphs: generalizing de Bruijn graphs with vertex uniqueness
We generalize iEFGs respecting the repeat-free property to overlap graphs respecting an equivalent and homonymous property. We show that graphs where long strings are constrained to have a unique starting location in the graph are a straightforward generalization of de Bruijn graphs, and are still indexable in polynomial time for linear-time pattern matching.
We define labeled graphs where each vertex is labeled with a non-empty string of variable length, and each directed edge specifies a suffix-prefix overlap between the connected vertices, as follows.
Definition 27 (Overlap graph).
An overlap graph is a quadruple where is a finite set of vertices, labels the vertices, is the set of directed edges, and defines the length of a (possibly empty) suffix-prefix overlap between the vertices of each edge such that the following holds: and
Condition imposes the overlap to be less than one full vertex.
See Figure 9. We extend to label paths of as follows: given path ( for all ), we define path label as if , otherwise it is equal to
that is, the string spelled by path considers the overlaps between adjacent vertices only once.
6.4.1 Hardness of exact matching in overlap graphs
We say that a pattern occurs in an overlap graph if is a substring of for some path in . Then, it is immediate to see that the problem of matching a query in an overlap graph is a generalization of string matching in labeled graphs, since the latter are vertex-labeled graphs with overlap equal to for all edges. Then, the conditional lower bounds of [19, 18] holds also for overlap graphs. However, two classes of graphs that bypass the conditional lower bounds by having a uniqueness property of the possible strings spelled in the graph are de Bruijn graphs and iEFGs we considered earlier.
Observation 28.
Overlap graphs (Definition 27) are a generalization of EFGs (Definition 15), since the latter are directed graphs, the vertices are labeled with non-empty strings, and the edges do not present any overlap ( for all ).
Observation 29.
Overlap graphs (Definition 27) are a generalization of de Bruijn graphs (Definition 7) and compacted de Bruijn graphs (see e.g. [13]), since these can be represented as -mers or unitigs connected by directed edges with a fixed overlap of ( for all ).
Observation 30 (Uniqueness property of dBGs).
In a (compacted) de Bruijn Graph , we have that for every possible -mer , if occurs in then is uniquely associated to a single node in .
6.4.2 Indexable overlap graphs
We study the natural extension of the repeat-free property (Definition 19) and the uniqueness property of de Bruijn Graphs (Observation 30) to overlap graphs, where the occurrences of patterns ignore overlaps between nodes. For this, we need to redefine the interpretation of occurrence positions, as an occurrence starting inside the overlapping part of an edge can be seen to start from or from .
Definition 31.
Let query have an occurrence in an overlap graph with . We say that such occurrence is redundant, as it can be written equivalently as . Occurrences that are not redundant are called non-redundant.
Definition 32.
Query occurs in an overlap graph starting inside a node if it has a non-redundant occurrence with . Query occurs in an overlap graph starting from the beginning of a node if all its non-redundant occurrences are of the form .
Definition 33.
We say that an overlap graph is repeat-free if each non-redundant occurrence of for is such that and , that is, occurs in only starting from the beginning of .
Then, we argue that the expanded backward search algorithm [20] for repeat-free EFGs is correct also for overlap graphs, since it performs the following steps:
-
1.
preprocess by indexing string
for pattern matching queries and bit arrays , for rank and select queries, where and are defined such as and if and only if is the lexicographic interval of some node label , ; then, given a non-empty lexicographical interval of string in , we can answer in constant time whether is prefix of for some node ;
-
2.
perform backward search for in and, when the current lexicographical interval is fully contained in the lexicographical interval of some , expand the interval to that of and continue the search.
The correctness follows from the repeat-free property and the lexicographic order: whenever we have read a substring of – initially, a suffix of – whose lexicographical interval is contained in that of some , it must be that is a prefix of and by the repeat-free property all matches of in start from the beginning of . By expanding the lexicographic range to that of , we are discarding the characters and we can continue with the search without loosing any occurrence.
Theorem 34.
A repeat-free overlap graph can be indexed in
time, where and , where and are the indegree and outdegree of vertex , to answer whether occurs in in time.
We leave it for future work how to construct such repeat-free overlap graphs, which is at least as hard as constructing repeat-free founder graphs from MSA with gaps, for which we do not have an efficient construction either.
7 Conclusions
This chapter has reviewed BWT-inspired indexing of more complex structures, such as tries, finite languages, de Bruijn graphs, and aligned sequences. As we have seen, much can be gained by tailoring the BWT to these specific types of objects. The next chapter covers a more general indexing framework for labeled graphs, via the notion of Wheeler Graphs.
Even for the special cases we have covered above, many open problems remain. Although enormous progress has been made toward making these data structures practical – in many cases to the point that they are used routinely a variety of bioinformatics pipelines (e.g., the SBWT lies at the heart of the Themisto pseudoalignment software [6]) – further improvements in performance practical performance are desirable. Construction algorithms in particular have room for improvement, and precisely how best to add parallelization to both construction and querying algorithms is currently under-explored.
References
- [1] Jarno Alanko, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Regular languages meet prefix sorting. In Proc. SODA 2020, pages 911–930. SIAM, 2020. doi:10.1137/1.9781611975994.55.
- [2] Jarno Alanko, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Wheeler languages. Information and Computation, 281:104820, 2021. doi:10.1016/j.ic.2021.104820.
- [3] Jarno N. Alanko, Elena Biagi, and Simon J. Puglisi. Longest common prefix arrays for succinct k-spectra. In Proc. SPIRE, LNCS 14240, pages 1–13. Springer, 2023. doi:10.1007/978-3-031-43980-3_1.
- [4] Jarno N. Alanko, Elena Biagi, Simon J. Puglisi, and Jaakko Vuohtoniemi. Subset wavelet trees. In Proc. of the 21st International Symposium on Experimental Algorithms (SEA), LIPIcs. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.SEA.2023.4.
- [5] Jarno N. Alanko, Simon J. Puglisi, and Jaakko Vuohtoniemi. Small searchable k-spectra via subset rank queries on the spectral Burrows-Wheeler transform. In Proc. ACDA, pages 225–236. SIAM, 2023. doi:10.1137/1.9781611977714.20.
- [6] Jarno N. Alanko, Jaakko Vuohtoniemi, Tommi Mäklin, and Simon J. Puglisi. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Bioinform., 39(Supplement-1):260–269, 2023. doi:10.1093/BIOINFORMATICS/BTAD233.
- [7] Uwe Baier. On undetected redundancy in the Burrows-Wheeler transform. In Proc. CPM 2018, volume 105 of LIPIcs, pages 3:1–3:15, 2018. doi:10.4230/LIPIcs.CPM.2018.3.
- [8] Andrej Baláž, Travis Gagie, Adrián Goga, Simon Heumos, Gonzalo Navarro, Alessia Petescia, and Jouni Sirén. Wheeler maps. In Proc. LATIN 2024, volume 14578 of LNCS, pages 178–192. Springer, 2024. doi:10.1007/978-3-031-55598-5_12.
- [9] Christina Boucher, Alex Bowe, Travis Gagie, Simon J. Puglisi, and Kunihiko Sadakane. Variable-order de Bruijn graphs. In Proc. DCC 2015, pages 383–392. IEEE, 2015. doi:10.1109/DCC.2015.70.
- [10] Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn graphs. In International workshop on algorithms in bioinformatics, pages 225–235. Springer, 2012. doi:10.1007/978-3-642-33122-0_18.
- [11] Michael Burrows and David J. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994. URL: http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.html.
- [12] Bastien Cazaux and Eric Rivals. Linking BWT and XBW via Aho-Corasick automaton: Applications to run-length encoding. In Proc. CPM 2019, volume 128 of LIPIcs, pages 24:1–24:20, 2019. doi:10.4230/LIPIcs.CPM.2019.24.
- [13] Rayan Chikhi, Antoine Limasset, and Paul Medvedev. Compacting de bruijn graphs from sequencing data quickly and in low memory. Bioinform., 32(12):201–208, 2016. doi:10.1093/BIOINFORMATICS/BTW279.
- [14] David R. Clark and J. Ian Munro. Efficient suffix trees on secondary storage (extended abstract). In Proc. SODA 1996, pages 383–391. SIAM, 1996. doi:10.1145/313852.314087.
- [15] Alessio Conte, Nicola Cotumaccio, Travis Gagie, Giovanni Manzini, Nicola Prezza, and Marinella Sciortino. Computing matching statistics on Wheeler DFAs. In Proc. DCC 2023, pages 150–159. IEEE, 2023. doi:10.1109/DCC55655.2023.00023.
- [16] Nicola Cotumaccio, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Co-lexicographically ordering automata and regular languages - part I. Journal of the ACM, 70(4):1–73, 2023. doi:10.1145/3607471.
- [17] Massimo Equi, Roberto Grossi, Veli Mäkinen, and Alexandru I. Tomescu. On the complexity of string matching for graphs. In 46th International Colloquium on Automata, Languages, and Programming (ICALP), volume 132 of LIPIcs, pages 55:1–55:15, 2019. doi:10.4230/LIPICS.ICALP.2019.55.
- [18] Massimo Equi, Veli Mäkinen, and Alexandru I. Tomescu. Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails. Theor. Comput. Sci., 975:114128, 2023. doi:10.1016/J.TCS.2023.114128.
- [19] Massimo Equi, Veli Mäkinen, Alexandru I. Tomescu, and Roberto Grossi. On the complexity of string matching for graphs. ACM Trans. Algorithms, 19(3):21:1–21:25, 2023. doi:10.1145/3588334.
- [20] Massimo Equi, Tuukka Norri, Jarno Alanko, Bastien Cazaux, Alexandru I. Tomescu, and Veli Mäkinen. Algorithms and complexity on indexing founder graphs. Algorithmica, 85(6):1586–1623, 2023. doi:10.1007/S00453-022-01007-W.
- [21] Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. Journal of the ACM, 57(1):4, 2009. doi:10.1145/1613676.1613680.
- [22] Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552–581, 2005. doi:10.1145/1082036.1082039.
- [23] Johannes Fischer, Veli Mäkinen, and Gonzalo Navarro. Faster entropy-bounded compressed suffix trees. Theoretical Computer Science, 410(51):5354–5364, 2009. doi:10.1016/j.tcs.2009.09.012.
- [24] Travis Gagie, Aleksi Hartikainen, Kalle Karhu, Juha Kärkkäinen, Gonzalo Navarro, Simon J. Puglisi, and Jouni Sirén. Document retrieval on repetitive string collections. Information Retrieval Journal, 20(3):253–291, 2017. doi:10.1007/s10791-017-9297-7.
- [25] Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for BWT-based data structures. Theoretical Computer Science, 698:67–78, 2017. doi:10.1016/J.TCS.2017.06.016.
- [26] Erik Garrison, Jouni Sirén, Adam M. Novak, Glenn Hickey, Jordan M. Eizenga, Eric T. Dawson, William Jones, Shilpa Garg, Charles Markello, Michael F. Lin, Benedict Paten, and Richard Durbin. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature Biotechnology, 36(9):875–879, 2018. doi:10.1038/nbt.4227.
- [27] Daniel Gibney and Sharma V. Thankachan. On the complexity of recognizing wheeler graphs. Algorithmica, 84(3):784–814, 2022. doi:10.1007/S00453-021-00917-5.
- [28] Simon Gog and Matthias Petri. Optimized succinct data structures for massive data. Softw. Pract. Exp., 44(11):1287–1314, 2014. doi:10.1002/SPE.2198.
- [29] Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proc. SODA 2003, pages 841–850. SIAM, 2003. URL: http://dl.acm.org/citation.cfm?id=644108.644250.
- [30] Roberto Grossi, Costas S. Iliopoulos, Chang Liu, Nadia Pisanti, Solon P. Pissis, Ahmad Retha, Giovanna Rosone, Fatima Vayani, and Luca Versari. On-line pattern matching on similar texts. In Proc. CPM 2017, volume 78 of LIPIcs, pages 9:1–9:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPICS.CPM.2017.9.
- [31] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378–407, 2005. doi:10.1137/S0097539702402354.
- [32] Lin Huang, Victoria Popic, and Serafim Batzoglou. Short read alignment with populations of genomes. Bioinformatics, 29(13):i361–i370, 2013. doi:10.1093/bioinformatics/btt215.
- [33] Songbo Huang, T.W. Lam, W.K. Sung, S.L. Tam, and S.M. Yiu. Indexing similar DNA sequences. In Proc. AAIM 2010, volume 6124 of LNCS, pages 180–190. Springer, 2010. doi:10.1007/978-3-642-14355-7_19.
- [34] Sorina Maciuca, Carlos del Ojo Elias, Gil McVean, and Zamin Iqbal. A natural encoding of genetic variation in a Burrows-Wheeler transform to enable mapping and genome inference. In Proc. WABI 2016, volume 9838 of LNCS, pages 222–233. Springer, 2016. doi:10.1007/978-3-319-43681-4_18.
- [35] Giovanni Manzini. XBWT tricks. In Proc. SPIRE 2016, volume 9954 of LNCS, pages 80–92. Springer, 2016. doi:10.1007/978-3-319-46049-9_8.
- [36] Miguel A. Martínez-Prieto, Nieves Brisaboa, Rodrigo Cánovas, Francisco Claude, and Gonzalo Navarro. Practical compressed string dictionaries. Information Systems, 56:73–108, 2016. doi:10.1016/j.is.2015.08.008.
- [37] J. Ian Munro. Tables. In Proc. FSTTCS 1996, volume 1180 of LNCS. Springer, 1996. doi:10.1007/3-540-62034-6_35.
- [38] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281–308, 2010. doi:10.1089/cmb.2009.0169.
- [39] Joong Chae Na, Hyunjoon Kim, Seunghwan Min, Heejin Park, Thierry Lecroq, Martine Léonard, Laurent Mouchard, and Kunsoo Park. FM-index of alignment with gaps. Theoretical Computer Science, 710:148–157, 2018. doi:10.1016/j.tcs.2017.02.020.
- [40] Joong Chae Na, Hyunjoon Kim, Heejin Park, Thierry Lecroq, Martine Léonard, Laurent Mouchard, and Kunsoo Park. FM-index of alignment: A compressed index for similar strings. Theoretical Computer Science, 638:159–170, 2016. doi:10.1016/j.tcs.2015.08.008.
- [41] Enno Ohlebusch, Simon Gog, and Adrian Kügel. Computing matching statistics and maximal exact matches on compressed full-text indexes. In Proc. SPIRE 2010, volume 6393 of LNCS, pages 347–358. Springer, 2010. doi:10.1007/978-3-642-16321-0_36.
- [42] Daisuke Okanohara and Kunihiko Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. ALENEX 2007, pages 60–70. SIAM, 2007. doi:10.1137/1.9781611972870.6.
- [43] Nicola Rizzo, Massimo Equi, Tuukka Norri, and Veli Mäkinen. Elastic founder graphs improved and enhanced. Theor. Comput. Sci., 982:114269, 2024. doi:10.1016/J.TCS.2023.114269.
- [44] Kunihiko Sadakane. Succinct data structures for flexible text retrieval systems. Journal of Discrete Algorithms, 5(1):12–22, 2007. doi:10.1016/j.jda.2006.03.011.
- [45] Jouni Sirén. Sampled longest common prefix array. In Proc. CPM 2010, volume 6129 of LNCS, pages 227–237. Springer, 2010. doi:10.1007/978-3-642-13509-5_21.
- [46] Jouni Sirén. Indexing variation graphs. In Proc. ALENEX 2017, pages 13–27. SIAM, 2017. doi:10.1137/1.9781611974768.2.
- [47] Jouni Sirén, Erik Garrison, Adam M. Novak, Benedict Paten, and Richard Durbin. Haplotype-aware graph indexes. Bioinformatics, 36(2):400–407, 2020. doi:10.1093/bioinformatics/btz575.
- [48] Jouni Sirén, Niko Välimäki, and Veli Mäkinen. Indexing finite language representation of population genotypes. In Proc. WABI 2011, volume 6833 of LNCS, pages 270–281. Springer, 2011. doi:10.1007/978-3-642-23038-7_23.
- [49] Jouni Sirén, Niko Välimäki, and Veli Mäkinen. Indexing graphs for path queries with applications in genome research. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(2):375–388, 2014. doi:10.1109/TCBB.2013.2297101.
- [50] Esko Ukkonen. Finding founder sequences from a set of recombinants. In Roderic Guigó and Dan Gusfield, editors, Algorithms in Bioinformatics, Second International Workshop, WABI 2002, Rome, Italy, September 17-21, 2002, Proceedings, volume 2452 of Lecture Notes in Computer Science, pages 277–286. Springer, 2002. doi:10.1007/3-540-45784-4_21.
- [51] Ryan Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theor. Comput. Sci., 348(2-3):357–365, 2005. doi:10.1016/J.TCS.2005.09.023.
