Abstract 1 Introduction 2 Notation 3 XBWT: Indexing Labeled Trees 4 SBWT: Indexing k-Spectra 5 GCSA: Indexing Finite Languages 6 Founder Graphs 7 Conclusions References

Graph Indexing Beyond Wheeler Graphs

Jarno N. Alanko ORCID Department of Computer Science, University of Helsinki, Finland Elena Biagi ORCID Department of Computer Science, University of Helsinki, Finland Massimo Equi ORCID Department of Computer Science, Aalto University, Finland Veli Mäkinen ORCID Department of Computer Science, University of Helsinki, Finland Simon J. Puglisi ORCID Department of Computer Science, University of Helsinki, Finland Nicola Rizzo ORCID Department of Computer Science, University of Helsinki, Finland Kunihiko Sadakane ORCID Department of Mathematical Informatics, University of Tokyo, Japan Jouni Sirén ORCID Genomics Institute, University of California, Santa Cruz, CA, USA
Abstract

After the discovery of the FM index, which linked the Burrows-Wheeler transform (BWT) to pattern matching on strings, several contemporaneous strands of research began on indexing more complex structures with the BWT, such as tries, finite languages, de Bruijn graphs, and aligned sequences. These directions can now be viewed as culminating in the theory of Wheeler Graphs, but sometimes they go beyond. This chapter reviews the significant body of “proto Wheeler Graph” indexes, many of which exploit characteristics of their specific case to outperform Wheeler graphs, especially in practice.

Keywords and phrases:
indexing, compression, compressed data structures, string algorithms, pattern matching
Funding:
Jarno N. Alanko: Funded by the Helsinki Institute for Information Technology (HIIT).
Elena Biagi: Funded by the Academy of Finland via grant 339070.
Massimo Equi: Partially funded by the Research Council of Finland, Grant 359104.
Veli Mäkinen: Partially funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101060011 (TeamPerMed).
Simon J. Puglisi: Funded by the Academy of Finland via grant 339070.
Nicola Rizzo: Funded by the European Union’s Horizon Europe research and innovation programme under grant agreement No 101060011 (TeamPerMed).
Jouni Sirén: Funded by the National Human Genome Research Institute (NHGRI) under award numbers U01HG013748 and U41HG010972.
Copyright and License:
[Uncaptioned image] © Jarno N. Alanko, Elena Biagi, Massimo Equi, Veli Mäkinen, Simon J. Puglisi, Nicola Rizzo, Kunihiko Sadakane, and Jouni Sirén; licensed under Creative Commons License CC-BY 4.0
2012 ACM Subject Classification:
Theory of computation Sorting and searching
; Theory of computation Graph algorithms analysis ; Theory of computation Pattern matching ; Theory of computation Formal languages and automata theory
Acknowledgements:
We wish to thank Hideo Bannai for spotting a mistake in the original version of Definition 22.
Editors:
Paolo Ferragina, Travis Gagie, and Gonzalo Navarro

1 Introduction

After the discovery of the FM index [22], designed for pattern matching on strings using the Burrows-Wheeler transform (BWT)[11], many groups explored BWT-inspired indexing of more complex structures, such as tries [21], finite languages [48, 49], de Bruijn graphs [10, 5], and aligned sequences [20, 43]. These directions can now be viewed as culminating in the theory of Wheeler Graphs, which is covered in the next chapter. Before that, however, in this chapter we review the significant body of “proto Wheeler Graph” indexes, many of which exploit characteristics of their specific case to outperform Wheeler graphs, especially in practice. While indeed many of the approaches reviewed (but not all) can be encapsulated in the general Wheeler Graph framework, the connection usually covers only the last indexing step. Construction of the object to be indexed is typically the most complex step and this is tailored for each approach. Especially these construction considerations go beyond the Wheeler Graph framework.

We start with key notions and index structures for strings and string collections. Before proceeding to general graphs, we extend the techniques to tries. Then we continue with de Bruijn graphs that represent overlaps of length k substrings of a string collection. The studied extensions of BWT are then shown to work for general graphs, albeit with the caveat of worst case exponential index size. This worst case behavior is then avoided by resorting back to de Bruijn graphs, but this time indexing paths of a graph rather than string collection. Finally, we explore an approach to directly build graphs from aligned sequences that have indexing properties similar to de Bruijn graphs, but without fixed parameter k for the substring length.

2 Notation

2.1 Strings

Let S=S[1]S[n] be a string of length |S| over an alphabet Σ of size |Σ|. If the string and the alphabet are clear from the context, we often denote their sizes as n=|S| and σ=|Σ|. We write substrings as S[i:j]=S[i]S[j] and call substrings of the form S[:j]=S[1:j] and S[i:]=S[i:n] prefixes and suffixes, respectively. Let $Σ be the smallest symbol in the alphabet. We often use it as a sentinel S[n]=$ that occurs only at the end of the string.

Let S be a string of length n, let i be an offset in it, and let cΣ be a symbol. We define S.rank(i,c) as the number of occurrences of symbol c in the prefix S[1:i]. For convenience, we define S.rank(i,c)=0 when i<1 and S.rank(i,c)=S.rank(n,c) when i>n.

We define the inverse operation S.select(i,c) as the position of the occurrence of c of rank i. Let nc=S.rank(n,c) be the total number of occurrences of symbol c in the string. For 1inc, we have j=S.select(i,c) such that S.rank(j,c)=i and S.rank(j1,c)=i1. We also define S.select(0,c)=0 and S.select(nc+1,c)=n+1 for convenience and leave the function undefined for the remaining values of i.

A bitvector is a data structure that stores a binary string and supports efficient rank/select queries. If we store the string as such, we can support both queries in O(1) time and n+o(n) bits of space [14, 37]. When the bitvector is sparse, we can achieve better space usage by using Elias–Fano encoding to store the positions with ones [31, 42]. If there are n1 ones in a bitvector of length n, we can store it in n1log(n/n1)+O(n1) bits of space and support select queries in O(1) time and rank queries in O(log(n/n1)) time.

2.2 String collections

Let 𝒮=(S1,,Sm) be an ordered collection of |𝒮| strings of total length 𝒮. We assume that each string Si ends with a sentinel $i and that $i<$j iff i<j to make the suffixes of the strings all distinct in lexicographic comparisons. However, we represent the sentinels using the same symbol $ in the actual strings. When the collection is clear from the context, we use m=|𝒮| and N=𝒮 to denote the number of strings and the total length of the strings. When the collection consists of a single string S, we may consider the collection and the string to be the same.

Definition 1 (Multiple sequence alignment).

Let 𝒮=(S1,,Sm) be a collection of strings over alphabet Σ, and let Σ be a gap symbol. Let 𝗌𝗉𝖾𝗅𝗅:(Σ{})Σ be a function such that 𝗌𝗉𝖾𝗅𝗅(S) is string S with all occurrences of the gap symbol removed. A multiple sequence alignment of collection 𝒮 is a matrix with m rows and n columns 𝖬𝖲𝖠[1:m,1:n](Σ{})m×n such that for each row i, we have Si=𝗌𝗉𝖾𝗅𝗅(𝖬𝖲𝖠[i,1:n]).

Let (i,j) and (i,j) be two positions in the same column. If we have 𝖬𝖲𝖠[i,j] (if 𝖬𝖲𝖠[i,j]), let Si[x] (Si[y]) be the corresponding symbol. We interpret the alignment in the following way:

  • If 𝖬𝖲𝖠[i,j]=𝖬𝖲𝖠[i,j], positions Si[x] and Si[y] are a match.

  • If 𝖬𝖲𝖠[i,j]𝖬𝖲𝖠[i,j], positions Si[x] and Si[y] are a mismatch.

  • If 𝖬𝖲𝖠[i,j] and 𝖬𝖲𝖠[i,j]=, string Si has an insertion relative to string Si and string Si has a deletion relative to string Si at the position.

Recombinations let us build new strings by breaking existing strings in two at aligned positions and concatenating the prefix of one string with the suffix of another. Using the above notation, if positions Si[x] and Si[y] are a match, we can recombine strings Si and Si at those positions as Si[:x1]Si[y:] and Si[:y1]Si[x:]. This corresponds to switching between rows i and i in column j of the multiple sequence alignment.

In later sections, we discuss data structures for indexing all possible recombinations of a string collection compatible with a given alignment. We formalize this as follows.

Definition 2 (Closure under recombinations).

Let 𝒮 be a collection of strings and let be an equivalence relation for its suffixes such that if S[j:]S[j:], we have S[j]=S[j]. The closure of collection 𝒮 under recombinations is a collection 𝒮 such that:

  1. 1.

    𝒮𝒮

  2. 2.

    For any two strings S,S𝒮 and positions j,j such that S[j:]S[j:], it holds that S′′=S[:j1]S[j:]𝒮.

  3. 3.

    Given a string S′′𝒮 as above, S′′[j′′:]S[j′′:] if j′′<j and S′′[j′′:]S[j′′j+j:] if j′′j.

According to rule 3, each suffix S′′[j′′:] of the recombined string S′′ inherits the alignment of the suffix starting from the corresponding position in the source strings S and S.

2.3 Arrays over sorted suffixes

Let 𝒮 be an ordered collection of strings. We define the (generalized) suffix array of collection 𝒮 as an array of pointers SA[1:N] to the suffixes of the strings in lexicographic order. Each pointer SA[x]=(i,j) refers to the suffix Si[j:] of string Si. Given two pointers SA[x]=(i,j) and SA[x]=(i,j), we have x<x iff Si[j:]<Si[j:] in lexicographic order. We therefore call x the lexicographic rank of suffix Si[j:] among the suffixes of collection 𝒮. Note that for 1xm, we have SA[x]=(x,|Sx|).

The multi-string Burrows–Wheeler transform (BWT) [11] of collection 𝒮 is a string BWT[1:N] representing a permutation of the symbols in strings S𝒮. We define it using the suffix array. If SA[x]=(i,j) and j>1, the BWT stores the preceding symbol BWT[x]=Si[j1]. When j=1, we set BWT[x]=$.

In addition to the suffix array and the Burrows–Wheeler transform, we often define other arrays over the lexicographically sorted suffixes. The longest common prefix (LCP) array LCP[1:N] stores the length of the longest common prefix of two lexicographically adjacent suffixes. If SA[x]=(i,j) and SA[x1]=(i,j), we store the LCP of suffixes Si[j:] and Si[j:] in LCP[x], with LCP[1]=0.

2.4 FM-index

The Burrows–Wheeler transform can be inverted using a function called LF-mapping that maps the lexicographic rank of a suffix to the lexicographic rank of its cyclic predecessor. We define it in two parts. Let SA[x]=(i,j). If j>1, we define LF(x)=y such that SA[y]=(i,j1). We sometimes write this as SA[LF(x)]=SA[x]1. For j=1, we define LF(x)=y such that SA[y]=(i,|Si|).

We can compute the LF-mapping for most values of x using the BWT. Let C[1:σ] be an array such that C[c] is the total number of occurrences of all symbols c<c in collection 𝒮. Let SA[x]=(i,j). If j>1, which is equivalent to BWT[x]$, we have LF(x)=C[BWT[x]]+BWT.rank(x,BWT[x]). Here C[BWT[x]] is the number of suffixes starting with a symbol smaller than BWT[x]=Si[j1] and BWT.rank(x,BWT[x]) is the lexicographic rank of suffix Si[j1:] among the suffixes starting with symbol BWT[x].

Inverse LF-mapping, denoted as Ψ, can be computed as Ψ(x)=LF1(x) by identifying a symbol c such that C[c]<xC[x+1] and then applying Ψ(x)=BWT.select(xC[c],c).

We can generalize LF-mapping for hypothetical suffixes. Let S be the lexicographic rank of string X among the suffixes of collection 𝒮. That is, x is the number of suffixes Y of the collection such that YX. Then, given a symbol cΣ with c$, we define the generalized LF-mapping as the lexicographic rank of string cX. We can compute it as LF(x,c)=C[c]+BWT.rank(x,c).

Let find(X)=[sp:ep] be the lexicographic range of suffixes starting with string X. That is, for each x[sp:ep], we have SA[x]=(i,j) such that Si[j:j+|X|1]=X. Then for any symbol c, the range of suffixes starting with cX is find(cX)=[LF(sp1,c)+1:LF(ep,c)]. This backward search step forms the core of the FM-index [22]. Given a pattern P, we can determine the range of suffixes find(P) starting with P with O(|P|) rank operations on the BWT.

Another core operation is locate(x)=SA[x]=(i,j), which converts the lexicographic rank of a suffix to its starting position. To compute it, the FM-index needs suffix array samples in addition to the BWT. Let B[1:N] be a bitvector that marks the sampled suffix array positions with 1, and let S be the subsequence of the suffix array corresponding to those positions. If B[x]=1, we can compute locate(x) as S[B.rank(x,1)]. Otherwise we rely on iterated LF-mapping and the fact that SA[LF(x)]=SA[x]1 in the general case. If we need k steps to find the nearest sampled position, we have locate(x)=locate(LFk(x))+k. For this scheme to function, we need to sample SA[x]=(i,j) whenever j=1. If we sample SA[x] every time j1(mods) for a parameter s>0, we can compute locate(x) with O(s) rank operations.

2.5 Graphs

Let G=(V,E) be a graph with a finite set of vertices V and a set of edges EV×V. We assume that the graph is directed. For any two vertices u,vV, with uv, edges (u,v) and (v,u) are distinct. A path is a string P[1:n] of vertices such that (P[i],P[i+1])E for 1i<n. We call the graph acyclic if there is no path that visits a vertex vV more than once.

A vertex-labeled graph is a tuple G=(V,E,Σ,), where (V,E) is a graph, Σ is an alphabet, and :VΣ is a function that assigns a label (v)Σ to each vertex vV. If P is a path, we write (P)=(P[1])(P[n]) to refer to the string formed by concatenating the vertex labels. We can similarly define edge-labeled graphs as tuples G=(V,E,Σ,), where :EΣ is a function that assigns labels to edges. Later, we extend the notions to vertices and edges labeled with non-empty strings.

Definition 3 (Alignment graph).

Let 𝒮 be a collection of strings over alphabet Σ, and let be an equivalence relation for its suffixes such that if S[j:]S[j:], we have S[j]=S[j]. The alignment graph of collection 𝒮 with respect to is a vertex-labeled graph G=(V,E,Σ,) such that:

  1. 1.

    The set of vertices V is the set of equivalence classes for relation .

  2. 2.

    For any two vertices u,vV, we have (u,v)E if and only if there is a string S𝒮 and a position j such that S[j:]u and S[j+1:]v.

  3. 3.

    The label of vertex vV is the shared first symbol (v)=S[j], where S[j:]v.

Note that the set of labels (P) for paths P in the alignment graph is the set of substrings of strings S𝒮. If is an equivalence relation based on a multiple sequence alignment such that S[j:]S[j:] when the corresponding positions match, the resulting alignment graph is always acyclic.

3 XBWT: Indexing Labeled Trees

The Burrows–Wheeler transform was originally defined for a single string. Extending the definition to an ordered collection of strings is straightforward enough. But a collection of strings is still a linear structure, while paths in a graph can both diverge and converge. The first step towards using the BWT for graph indexing was the extended Burrows–Wheeler transform (XBWT) [21], which can index labeled trees. The key technique was using select queries on a bitvector for expanding a (co-)lexicographic range [sp:ep] of vertices to a (co-)lexicographic range [spout:epout] of outgoing edges.

In this section, we present a variant of the XBWT that is similar to the indexes for vertex-labeled graphs discussed in Sections 4 and 5. There are other fundamentally equivalent representations. The representations differ in their handling of leaf vertices and outdegrees, on whether they consider vertex-labeled or edge-labeled trees, and on whether the search proceeds from the root towards the leaves or the other way around. Some of them achieve better space bounds than what is presented here.

3.1 Data structure

A vertex-labeled tree with root vrV is a vertex-labeled graph T=(V,E,Σ,) such that the indegree of the root is 0 and the indegree of every other vertex vV is 1, with a unique path Pv from the root to every vertex vV. Given an edge (u,v)E, we call vertex u the parent of vertex v and v a child of u. We call vertices with outdegree 0 the leaves of the tree.

Figure 1: XBWT for the trie of strings GATATAT, GATTACAT, and GATTAGAT. The XBWT consists of the bitvector Bout and the string BWT.

XBWT uses vertices in place of suffixes. The vertices are sorted by the label (Pv) in colexicographic order (in lexicographic order by reverse path labels). For a given vertex uV, the BWT stores the concatenation of the labels (v) for each outgoing edge (u,v)E. We encode the outdegree d in unary as a one followed by d1 zeros and concatenate the outdegrees in the same order to form bitvector Bout. In order to handle the leaves, we add a technical vertex v$V such that (v$)=$. For each leaf v, we encode an edge (v,v$)E. As with strings, we assume that (v)$ for every vertex vV. If the colexicographic rank of vertex vV is x, the labels of its children are stored in BWT[Bout.select(x,1):Bout.select(x+1,1)1]. See Figure 1 for an example. With an uncompressed wavelet tree [29] for the BWT and an uncompressed bitvector for Bout, the XBWT data structure requires O(|V|logσ) bits of space, not including the C array.

Let find(X)=[sp:ep] be the colexicographic range of vertices vV with path labels (Pv) ending with string X. Given a symbol cΣ, we want to find the range find(Xc) of vertices with path labels ending with Xc. We first expand the range into a colexicographic range of outgoing edges [spout:epout]=[Bout.select(sp,1):Bout.select(ep+1,1)1]. Every edge (u,v)E in that range is encoded with the label (v) of the child vertex, and the colexicographic rank of that vertex is based on path label (Pv)=(Pu)(v). Hence we can use a backward search step (Section 2.4) to get the range of children with label c: find(Xc)=[LF(spout1,c)+1:LF(epout,c)] Note that the backward search step becomes a forward search step, as the XBWT is based on colexicographic order. Given a string X, we can therefore compute find(X) using O(|X|) rank operations on BWT and O(|X|) select operations on Bout.

If we know the colexicographic rank xr of the root vr, we can match entire path labels instead of suffixes by starting the above search from range [xr:xr]. We can also use the XBWT for tree navigation. Given a vertex vV with colexicographic rank x, the range of its children is [Bout.select(x,1):Bout.select(x+1,1)1]. The colexicographic rank of its i-th child is LF(Bout.select(x,1)+i1) and the rank of its parent is Bout.rank(Ψ(x),1).

3.2 Applications

The initial XBWT paper proposed using the structure as a fast and space-efficient representation for hierarchical documents, such as XML [21]. In such applications, it is important that a vertex can have multiple children with the same label (such as list items). When storing string collections (often called string dictionaries in this context), we can collapse the tree into a trie, improving space usage. Compared to other representations for string collections, the XBWT is smaller but slower [36]. The slowness comes from symbol-to-symbol navigation and the poor memory locality of iterated LF-mapping. However, the same FM-index functionality enables substring searches not supported by faster representations based on hashing or front coding.

The Aho–Corasick algorithm can find the occurrences of multiple patterns 𝒫 in a string S in linear time O(|S|+𝒫+occ), where occ is the number of occurrences. It is based on a trie T=(V,E,Σ,) of the patterns augmented with failure links (suffix links). A failure link from vertex vV points to a vertex uV such that (Pu) is the longest proper suffix of (Pv) among all vertices uV. The XBWT can be augmented with a 2|V|(1+o(1))-bit balanced parentheses representation to support failure links, making it a space-efficient Aho–Corasick automaton [35]. Here the relative slowness matters less, as the algorithm is based on symbol-to-symbol navigation anyway.

Because the XBWT represents a collection of strings, it can be built from other representations, such as the suffix array and the Burrows–Wheeler transform [35]. The link also works in the other direction: we can go from the XBWT directly to a run-length encoded multi-string BWT [12]. As a byproduct, we can efficiently find the order of the strings that minimizes the number of runs in the BWT [12]. Because the order of the strings determines how well the suffixes close to the end of the string can be run-length encoded, this can be valuable with highly repetitive collections of short strings.

4 SBWT: Indexing k-Spectra

The k-spectrum of a string S is the set of distinct k-mers occurring in S. In this section we give an overview of how k-spectra can be indexed for k-mer lookup queries via the Spectral Burrows-Wheeler transform (SBWT) [5]. Given a query k-mer, a k-mer lookup query returns the colexicographic rank of the k-mer in the k-spectrum. It is not difficult to see that a structure for k-mer lookup queries gives a convenient representation of a de Bruijn graph.

The SBWT is a distillation of the ideas found in the BOSS [10] and Wheeler graph data structures. It is indeed an evolution of the BOSS data structure, which is an indexed representation of the edge-centric de Bruijn graph. The SBWT differs from the BOSS being vertex-centric, and more general – the SBWT is a mathematical transformation which can be indexed in various ways, one of which closely resembles the BOSS data structure. In addition, the SBWT graph is also a Wheeler graph [25] covered in the next chapter of this book. This implies that the generic Wheeler graph index could be used to index the graph. However, that index requires storage of the sequence of indegrees and outdegrees of the vertices in the graph, whereas in the SBWT this is not required because every vertex apart from the vertex of the empty string has indegree of exactly 1, and the sequence of outdegrees is already included in the sizes of the sets of the outgoing edge label sets. Thus, the SBWT can be seen as a specialization of the Wheeler graph framework to k-spectra, taking advantage of the properties of the special case.

4.1 The Spectral Burrows-Wheeler Transform

In this section we define the Spectral Burrows-Wheeler transform, SBWT. We begin with few basic definitions:

Definition 4 (k-spectrum).

A string’s k-spectrum, denoted with Sk(S), consists of all unique k-mers present in the string: {S[i:i+k1]|i=1,,|S|k+1}. The k-spectrum Sk(S1,,Sm) of a set of strings S1,,Sm is the union i=1mSk(Si).

padded k-spectrum adds a minimal set of $-padded dummy k-mers to ensure that in the de Bruijn graph (to be defined shortly), every non-dummy k-mer has an incoming path of length at least k:

Definition 5 (Source vertex set).

Let R=Sk(S1,,Sm) be a k-spectrum, the source vertex set R is a subset of R containing k-mers Y such that Y[1:k1] is not a suffix of any k-mer in R.

Definition 6 (Padded k-spectrum).

Let R be a k-spectrum of 𝒮=(S1,,Sm), and let RR be the source vertex set of 𝒮 for a given k. The padded k-spectrum is the set Sk+(S1,,Sm)=R{$k}YR{$kiY[1:i]|i=1,,k1}, where $ is a special character of Σ, which is smaller than all other characters of the alphabet.

The number of dummy vertices can be calculated with a simple formula. It is indeed equal to the number of internal vertices of 𝒯(R), where 𝒯(R) is the trie of the source vertex set of a set of strings.

For example, if we consider two strings S1= ACAGTG and S2= ATCAGA, and k=3, then R=S3(S1,S2)={ACA, CAG, AGT, GTG, ATC, TCA, AGA}, R={ACA, ATC}, and the padded k-spectrum S3+(S1,S2)=S3(S1,S2){$$$, $$A, $AC, $AT}.

Definition 7 (de Bruijn graph).

The de Bruijn graph of a set of strings 𝒮=(S1,,Sm) is an edge-labeled graph G=(V,E,Σ,,k), where:

  1. 1.

    The set of vertices V is the set of k-mers from the padded k-spectrum of 𝒮, in symbols V=Sk+(S1,,Sm).

  2. 2.

    For any two vertices, k-mers, u,vV, we have (u,v)E if and only if u[2:k]= v[1:k1].

  3. 3.

    The label of an edge e=(u,v)E, denoted with (e) or (u,v), is the character v[k]. $ never appears as an edge label.

See figure 2 for an illustration of a de Bruijn graph.

Figure 2: The de Bruijn graph of 𝒮= {ACAGTG, ATCAGA, TTGTCAGTGT}, k=3. The source vertex set, k-mers with no predecessor in 𝒮, is striped, and the solid-colored vertices represent the dummy vertex set. Dashed edges are pruned from the graph as the vertices they point to can be reached from other (solid) edges.

We can now define the Spectral BWT.

Definition 8 (Spectral Burrows-Wheeler transform (SBWT) [5]).

Let R+ be a padded k-spectrum and let X1,,X|R+| be the elements of R+ in colexicographic order. The SBWT is the sequence of sets A1,,A|R+| with AiΣ such that Ai= if i>1 and Xi[2:k]=Xi1[2:k], otherwise Ai={cΣ|Xi[2k]cR+}.

Continuing the example above and adding a new sequence:
𝒮={ACAGTG,ATCAGA,TTGTCAGTGT}. The colexicographically ordered list of S3+(𝒮) is: $$$, $$A, ACA, TCA, AGA, $AC, ATC, GTC, CAG, GTG, TTG, $$T, $AT, AGT, TGT, $TT, and the SBWT of 𝒮 is the sequence of sets: {A,T}, {C,T}, {G}, , , {A}, {A}, , {A,T}, {T}, , {T}, {C}, {C,G}, , {G}.

The sets in the SBWT represent the labels of outgoing edges in the vertex-centric de Bruijn graph, such that we only include outgoing edges from k-mers that have a different suffix of length k1 than the preceeding k-mer in the colexicographically sorted list. The padding of dollar-symbols is a technical detail required to make the SBWT work. It ensures that every k-mer of the k-spectrum has an incoming path of length at least k in the de Bruijn graph. Alternatively, the sequences of the set 𝒮 on which the SBWT is built could be made cyclic, thus removing the need for the padded k-spectrum Sk+(𝒮).

We now describe how to implement efficient k-mer membership queries on the spectrum of the input strings, using only the information encoded in the SBWT. It is helpful to think of the SBWT as encoding the set R+ of k-mers as a list arranged in colexicographical order (as in the example above). In the context of k-mer lookup, the SBWT allows navigation of this ordered list via a search routine based on subset rank queries.

Definition 9 (Subset rank query).

Let X1,Xn be a sequence of subsets of an alphabet Σ={1,,σ}. A subset rank query takes as input an index i and a character cΣ, and returns the number of subsets Xj with ji such that cXj.

The search routine, introduced below, works by searching the k-mer character by character from left to right. It does so maintaining the interval of k-mers in the colexicographically-sorted list that are suffixed by the prefix which has been processed so far.

Since the k-mers in R+ are colexicographically sorted in the SBWT, the subset of k-mers in R+ that share a string α as a suffix are next to each other, and they are called the colexicographic interval of α. The colexicographic interval of a string α longer than k is defined as the range of k-mers in the sorted list of R+ suffixed by the last k characters of α. Now let [s:e]α denote the colexicographic interval of the string α, where s and e are respectively the colexicographic ranks of the smallest and largest k-mer in the SBWT having substring α as a suffix. Finding the interval [s:e] of αc, where c is a character from Σ{$}, is equivalent to following all edges labeled with c from the vertices in [s:e]. Due to the way the edges are defined, the end points of these edges are a contiguous range [s:e] such that s is the destination of the first outgoing edge labeled c from [s:e], and e is the destination of the last one. Let C[c] be the number of edge labels with label smaller than c in the graph. We now have the following formulas:

s=1+C[c]+subsetrankc(s1)+1e=1+C[c]+subsetrankc(e) (1)

where subsetrankc is a subset rank query on the SBWT sequence. The “+1” at the start of the formulas is to skip over the vertex of  $k. The values C[c] can be precomputed for all characters cΣ. By iterating these formulas k times, we have the k-mer search routine.

Figure 3: MatrixSBWT of 𝒮={ACAGTG,ATCAGA,TTGTCAGTGT} with k=3. The dashed lines indicate borders of suffix groups. Two adjacent columns are in the same group if they share the same suffix of length k1. Bits may be moved horizontally inside a suffix group without affecting the k-mer set encoded in the matrix.

The simplest data structures for subset rank queries is the plain matrix. See Figure 3 for an example.

Definition 10 (Plain Matrix representation).

The plain matrix representation of the SBWT sequence is a |Σ{$}|×n binary matrix M, such that M[i][j]=1 iff subset Xj includes the ith character in the alphabet.

The rows of the matrix M are indexed for constant time rank queries [37] (implementation from [28]). A single rank query on row i at index j in M would answer in constant time subsetrankc(j), where c is the ith character of the alphabet. There exist alternative succinct data structures for subset rank, such as the subset wavelet tree, details of which can be found in [4, 5]. A concatenation of the sets indexed as a wavelet tree leads to something resembling the BOSS data structure. All k-mers of a longer string can be searched efficiently by adding the LCS array of the SBWT [3]. At the end of each single k-mer seach, the LCS array allows to extend the colexicographic interval to the last k1 characters observed in the previous k-mer.

5 GCSA: Indexing Finite Languages

The generalized compressed suffix array (GCSA) [48, 49] is a generalization of the FM-index to graphs. It was originally designed for indexing acyclic vertex-labeled graphs (or equivalently finite languages) that arise from aligned DNA sequences, using techniques similar to the XBWT (Section 3).

GCSA can be built for any such graph, and it supports efficient queries in it, but the index can be exponentially larger than the graph. Some other BWT-based graph indexes, such as BWBBLE [32] and vBWT [34], take the opposite approach. By encoding the graph as a string, they make index construction fast and the index itself small. Queries can be slower than in GCSA, as they often require branching in the graph.

BOSS [10] represents de Bruijn graphs using a data structure equivalent to the GCSA. This equivalence led to a realization that the GCSA approach works with any graph that is sufficiently similar to a de Bruijn graph. Such graphs can recognize a strict subclass of regular languages [49]. GCSA2 [46] uses this realization in a practical way to index any vertex-labeled graph by approximating it with a high-order (pruned) de Bruijn graph.

GCSA requires that the vertices of the graph can be sorted unambiguously by the labels of maximal paths starting from them. This is a global property that is sometimes difficult to reason about. Wheeler graphs [25] replaced this requirement with an essentially equivalent local property. Vertices are sorted by their labels, and ties are broken by inheriting the order of the successor vertices. (The original formulation used edge-labeled graphs and inherited the order from the predecessors.) This alternate characterization made reasoning about Wheeler graphs easier in many situations.

5.1 Structural runs in the BWT

Consider a collection that consists of m identical copies of the same string S. Then for any position j, the suffixes starting at that position are consecutive in lexicographic order. Because the preceding symbol S[j1] (if j>1; otherwise S[n]=$) is also the same for every string, the suffixes are in the same run of symbol S[j1] in the BWT of the collection. Hence the BWT of a collection of m identical strings always has the same number of runs, regardless of the number of copies. We now consider what happens when we apply edits to a collection of identical strings.

Definition 11 (Aligned suffixes).

Let 𝒮=(S1,,Sm) be a collection of strings, and let MSA be a multiple sequence alignment of the collection. For any two strings Si,Si𝒮 and positions j,j in them, suffixes Si[j:] and Si[j:] are aligned if and only if positions Si[j] and Si[j] are a match in MSA. We represent this with equivalence relation 0

Definition 12 (Structural BWT runs).

Let 𝒮=(S1,,Sm) be a collection of strings, let 0 be an alignment of the suffixes, and let BWT and SA be the Burrows–Wheeler transform and the suffix array of the collection. A structural BWT run is a maximal interval [sp:ep] such that for any x,x[sp:ep]:

  1. 1.

    Si[j:]0Si[j:], where SA[x]=(i,j) and SA[x]=(i,j); and

  2. 2.

    BWT[x]=BWT[x].

We use equivalence relation 0b to denote that suffixes are in the same structural BWT run.

Let 𝒮=(S1,,Sm) be a highly repetitive collection of similar strings. We assume that there is a sequence of s edit operations that transforms a collection of m copies of string S into 𝒮, with smn. We further assume that there is a multiple sequence alignment corresponding to the edit operations, where suffixes that have been derived from the same suffix of the original string S start with a match or a mismatch in the same column. Note that different sequences of edit operations leading to the same collection may correspond to different alignments.

Initially, there are n structural runs, one for each position in string S. Each edit operation that changes a substring Si[j:j] can affect the suffixes of Si in three ways:

  1. 1.

    The symbol in the BWT can change for the suffixes starting from Si[j+1] to Si[j+1].

  2. 2.

    Suffixes starting from Si[jk] to Si[j], for some k0, may leave their current structural runs due to changes in lexicographic ranks.

  3. 3.

    New unaligned suffixes can be created and existing suffixes may be removed if substring Si[j:j] is replaced with a string of different length.

Every affected suffix can increase the number of structural runs by at most two: by becoming a new run and by splitting an existing run in two.

Mäkinen et al. [38] showed that if the original string S is random and we apply random edits, k is at most O(logσN) in the expected case. If we assume single-symbol edits, s edit operations increase the number of structural runs by at most O(slogσN) in the expected case. Because structural runs are (possibly non-maximal) BWT runs, there are at most r+O(slogσN) runs in the BWT of collection 𝒮 in the expected case. Here r is the number of runs in the BWT of string S.

5.2 Indexing the graph of structural runs

Let 𝒮 be a string collection and 0b an equivalence relation that denotes that two suffixes are in the same structural BWT run. Let G0b=(V,E,Σ,) be the alignment graph of the collection with respect to 0b.

Graphs like this have been used as space-efficient representations of the collection. Both SimDNA [33] and FM-index of alignment [40, 39] index the collection using a graph based on a specific type of alignment. The alignment alternates between common regions, where all strings are aligned, and non-common regions, where they diverge. BWT tunneling [7] also uses a similar graph to improve BWT-based data compression by not storing run lengths within unary paths. Our goal is instead to extend the FM-index of the collection to index all strings in the closure 𝒮0b using the graph.

Figure 4: Structural runs in the BWT of strings S1= GATTACAT$, S2= GATATAT$, and S3= GATTAGAT$. The table lists the suffixes in lexicographic order, the BWT symbol for each suffix, and the starting position of the suffix in the MSA (also shown). Bitvector Br marks positions at the start of each structural BWT run. The graphs from top to bottom are: G0 representing the MSA; G0b representing structural BWT runs; and G0s representing structural SA runs (see Section 5.4).

We could augment the FM-index with a bitvector Br[1:N] that marks the start (or equivalently the end) of each structural run with ones. After each backward search step, we use some rank/select operations on Br to extend the interval [sp:ep] to the smallest interval [sp:ep] that contains it and consists of entire structural runs. If we continue the search from interval [sp:ep], we are effectively accepting recombinations within each vertex in the range. No matter which suffixes we used to reach a vertex, we can continue the search to the predecessor of any suffix in the vertex. See Figure 4 for an example.

5.3 GCSA structure

Because we discard the information on how we arrived to the current vertex, we no longer need to keep track of the exact BWT interval. Hence we choose the GCSA representation [48] (Figure 5) that corresponds more closely to the structure of the graph. We keep the vertices in the same order as the structural runs. Let rankV:V[1:|V|] be a function that maps a vertex vV to its lexicographic rank rankV(v). Lexicographic range [sp:ep] now corresponds to vertices vV such that rankV(v)[sp:ep].

The data structure consists of three components Bin, BWT, and Bout. We use them all for computing a backward search step from find(X)=[sp:ep] to find(cX)=[sp:ep] for a string X and a symbol c.

  1. 1.

    Bitvector Bin encodes the indegree of each vertex in unary. Similar to using Bout in XBWT (Section 3.1), we map the range of vertices to a range of incoming edges as [spin:epin]=[Bin.select(sp,1):Bin.select(ep+1,1)1].

  2. 2.

    The BWT range for vertex vV is the concatenation of (u) for each vertex uV with (u,v)E in an arbitrary order. A backward search step with this BWT maps the range [spin:epin] of incoming edges to a range [spout:epout]=[LF(spin1,c)+1:LF(epin,c)] of outgoing edges, ordered by the predecessor vertex.

  3. 3.

    Bitvector Bout encodes the outdegree of each vertex in unary. With rank queries, we get the interval of the predecessors of the vertices in [sp:ep] with label c as [sp:ep]=[Bout.rank(spout,1):Bout.rank(epout,1)].

Together these take |E|(logσ+2)(1+o(1)) bits, assuming an uncompressed wavelet tree for the BWT and uncompressed bitvectors and excluding the C array. To handle the edge cases with source/sink vertices, we add technical edges connecting vertices with outdegree 0 to vertices with indegree 0.

This data structure also facilitates graph navigation. For instance, if vV is a vertex and x=rankV(v), we can navigate backward along a unary path from v. This is effectively the same as using LF(x) to navigate from a suffix to its predecessor. The rank of the first incoming edge to v is xin=Bin.select(x,1). If Bin[xin+1]=0, vertex v has multiple predecessors and the path is no longer unary. With xout=LF(xin), we get the rank of the corresponding outgoing edge, and Bout.rank(xout,1) is the rank of the predecessor vertex. Navigation like this is used in locate queries (Section 5.5).

The alignment graph based on structural runs allowed us to index the original strings as well as their recombinations within the runs. Our ultimate goal is allowing recombinations at any aligned position. In principle, we already have the necessary tools for it. Because 0b is a refinement of 0, we have 𝒮0=(𝒮0)0b. In other words, recombinations at aligned positions include recombinations at aligned positions within the same BWT run. We could therefore build the GCSA structure for the closure 𝒮0. The challenge is doing that without representing the closure explicitly.

5.4 GCSA construction

Before proceeding, we relax the definition of structural runs. The GCSA structure only requires that the equivalence classes of suffixes are lexicographic ranges that start with the same symbol. By not requiring that they also share the preceding symbol, we simplify the discussion and make the alignment graph G0s smaller.

Definition 13 (Structural SA runs).

Let 𝒮=(S1,,Sm) be a collection of strings, let 0 be an alignment of the suffixes, and let SA be the suffix array of the collection. A structural SA run is a maximal interval [sp:ep] such that for any x,x[sp:ep] with SA[x]=(i,j) and SA[x]=(i,j), we have Si[j:]0Si[j:]. We use equivalence relation 0s to denote that suffixes are in the same structural SA run.

The algorithm we use for building GCSA [48, 49] is similar to the prefix-doubling approach to suffix array construction. We start from graph G1=G0=(V,E,Σ,) and build a series of graphs Gl=(Vl,El,Σ,l) for increasing path lengths l. If we consider the graphs finite automata, they all recognize the same language 𝒮0. There are two types of vertices vVl:

  1. 1.

    The vertex corresponds to a path Pv of length |Pv|=l starting from vertex locate(v)=Pv[1]V in graph G0. For every maximal path P starting from vertex v in graph Gl, we require that (Pv) is a prefix of l(P).

  2. 2.

    The vertex corresponds to a half-open lexicographic range [Xv:Yv) of path labels. There is an arbitrary path Pv of length |Pv|l starting from vertex locate(v)=Pv[1]V in graph G0 such that (Pv)[Xv:Yv). For every maximal path P starting from vertex v in graph Gl, we require that l(P)[Xv:Yv). For every other vertex uVl with uv, we require that (Pu)[Xv:Yv) (if u is of type 1) or [Xu:Yu)[Xv:Yv)= (if u is of type 2).

In both cases, we set l(v)=(locate(v)).

Figure 5: GCSA example for the string collection in Figure 4. We show the original graph (top) again with each vertex v annotated with id(v) (see Section 5.5). The bottom graph is the final graph G0s=(V,E,Σ,) for the closure, with each vertex vV annotated with (Pv). Paths Pv are chosen to be prefixes of the lexicographically smallest maximal path starting from the vertex. The table lists each vertex vV identified by the label (Pv), the corresponding parts of the GCSA structure, and the identifier id(locate(v)) of the corresponding vertex in the original graph.

If all vertices vVl are of type 2, we have a graph Gl that is equivalent to G0, with vertices that can be sorted unambiguously by the labels of maximal paths starting from them. For each suffix Si[j:] of the closure 𝒮0, there is a vertex vVl such that Si[j:][Xv:Yv) and a path P starting from v with label l(P)=Si[j:]. Therefore the vertices of Gl correspond to (possibly non-maximal) structural runs. If we merge them to form maximal runs, we get the graph G0s. See Figure 5 for an example.

We derive graph G2l from graph Gl with a doubling step followed by a pruning step. Let uVl be a type 1 vertex, and let rankl(u) be the lexicographic rank of (Pu) in the set of distinct path labels {(Pv)vVl}. In the doubling step, we combine it with all vertices vVl such that PuPv is a valid path. For every combination, we create a new vertex w with Pw=PuPv and rank2l(w)=(rankl(u),rankl(v)). The new vertex will be of the same type as vertex v. This requires that graph G0 is reverse deterministic; see below for a discussion. If uVl is a type 2 vertex, we copy it as vertex wV2l with rank2l(w)=(rankl(u),0). The doubling step finishes by sorting the new vertices by their ranks and replacing the pairs of ranks with integer values.

In the pruning step, we merge lexicographic ranges of type 1 vertices into type 2 vertices. Let V2lx={vV2lrank2l(v)=x} be the set of vertices with rank x. If all vertices vV2lx share the same locate(v) vertex, the shared path label uniquely defines the starting vertex. We can therefore merge them into a type 2 vertex. Similarly, if sets V2lx and V2lx+1 also share the starting vertex, we can merge them into a single vertex.

If Smax is the longest string in 𝒮0, the prefix-doubling algorithm finishes after at most log|Smax| doubling steps. In the worst case, the final graph G0s can be exponentially larger than the graph G0 we started from. However, if the distances between the positions of the original string S with edits in any string S𝒮 are sufficiently high, graph G0s will have O(|V|) vertices and O(|E|) edges in the expected case [49]. If we assume a random string and random edits, the phase transition to exponential growth occurs at average distance O(logσ|S|).

We do not need to maintain the edges of the intermediate graphs. Determining them for the final graph G0s=(V,E,Σ,) is simple enough. Given a vertex vV of the final graph and an edge (u,locate(v))E of the original graph, we know that there must be a corresponding edge (u,v)E such that locate(u)=u in the final graph. We can therefore generate the edges as (u,v) and sort them by ((u),(Pv)). This also sorts them by (Pu). By scanning the lists of vertices and edges in sorted order, we can then match edges (u,v) to source vertices u by locate(u)=u.

This construction algorithm only works if the original graph G0 is reverse deterministic. For every vertex vV, we require that each predecessor uV with (u,v)E has a distinct label (u). If two predecessors u,uV share a label, (uP)=(uP) for every path P starting from vertex v. The pruning step cannot merge them, because no extension of the path label can determine the starting vertex uniquely. On the other hand, if the graph is reverse deterministic and vV is a type 2 vertex, (uPv) always determines the unique predecessor u.

Determinizing a finite automaton takes exponential time in the worst case, because the states of the determinized automaton correspond to sets of states of the non-deterministic automaton. However, when the textbook algorithm was used to determinize the graph before GCSA construction, it was observed to be fast [49]. This was later explained using Wheeler graphs [16]. The exponent in the time complexity is not the total number of states but the maximal number of states that are not comparable in lexicographic order. Intuitively, the closer the graph is to a Wheeler graph, the faster the determinization is. If the graph is simple enough that the prefix-doubling algorithm only increases its size by a constant factor, it must already be close to being a Wheeler graph.

5.5 Locating the hits

Let G0=(V,E,Σ,) be the original alignment graph and G0s=(V,E,Σ,) the final graph in GCSA construction. During the construction, we annotated each vertex vVl in each intermediate graph Gl=(Vl,El,Σ,l) (including G0 and G0s) with the corresponding vertex locate(v)V of the original graph. As the notation implies, we want GCSA locate queries return positions in the original graph. While it would be easier to return positions in G0s (we built GCSA for that graph, after all), it is only a technical construct. Positions in the original graph (or columns in the alignment) are more likely to be meaningful to the user.

A locate query in an FM-index relies on the fact that SA[x]=SA[LF(x)]+1. If a position has not been sampled, we can derive the value from the predecessor. Let id:V[1:|V|] be a function that gives each vertex vV a unique integer identifier. To apply the same idea with GCSA, we require that id(v)=id(u)+1 for each edge (u,v)E such that u is the only predecessor of v and v is the only successor of u. If we sample id(locate(v)) for each vertex vV that has multiple predecessors or id(locate(v))id(locate(u))+1 for the only predecessor u, the locate algorithm works as before [48, 49]. (We also want to sample id(locate(v)) at regular intervals on unary paths to ensure query performance.)

This approach was originally designed for sampling the LCP array [45]. If BWT[x] is not at the start of a run, LCP[x]=LCP[LF(x)]1. We therefore need to sample the irreducible values, while the rest can be derived from the predecessor. More generally, the approach can be understood as a sampled tag array [8]. We annotate each position in the FM-index with a tag, which can store arbitrary information related to the position. If most tags can be derived easily from the only predecessor, we can store the tag array space-efficiently and still support efficient queries.

5.6 Simplifying the graph

As mentioned earlier, there is a phase transition in GCSA construction. As the distances between positions with edits in the alignment get below a critical threshold, the number of vertices and edges in the final graph G0s starts growing rapidly. This makes the GCSA impossible to build in practice for many graphs. One way to handle this is adding a context length requirement to the definition of alignment [48, 49].

Definition 14 (k-aligned suffixes).

Let 𝒮=(S1,,Sm) be a collection of strings, let MSA be a multiple sequence alignment for the collection, and let k0 be a context length. For any two strings Si,Si𝒮 and positions j,j in them, suffixes Si[j:] and Si[j:] are k-aligned if and only if positions Si[j] and Si[j] are a match in MSA and Si[j:j+k]=Si[j:j+k]. We represent this with equivalence relation k

We similarly define structural SA runs based on k-aligned suffixes and denote them with equivalence relation ks. Then we start GCSA construction from graph Gk and use prefix-doubling to build graph Gks. In practice, context length k imposes a minimum distance between recombinations. Regions that are heavily branching in G0 gradually become simpler in Gk as k increases. That makes it more likely that as path length l grows, path labels become unique at a faster rate than the number of type 1 vertices grows.

5.7 Simplified data structure

The final graph Gks=(V,E,Σ,) is always reverse deterministic. To see that, consider any two edges (u,v)E and (u,v)E such that (u)=(u). Because u and u are type 2 vertices, it must be that (uP)=(u)(P)[Xu:Yu) and (uP)=(u)(P)[Xu:Yu), where P is a maximal path starting from v. As the lexicographic ranges for two type 2 vertices cannot overlap, we have u=u.

For each vertex vV and symbol cΣ, there is at most one vertex uV such that (u,v)E and (u)=c. We can therefore replace the indegree bitvector Bin and the BWT with a binary matrix [49]. For each symbol cΣ, let Bc be a bitvector of length |V|, with Bc[rankV(v)]=1 whenever vertex vV has a predecessor uV with (u)=c. The first two parts of a backward search step become [spout,epout]=[C[c]+Bc.rank(sp1,1)+1:C[c]+Bc.rank(ep,1)].

We replaced a select query on a bitvector and a rank query on a string with a single rank query on a bitvector. A backward search step now consists of four rank queries on bitvectors. Depending on the data structures used, this can be considerably faster than with the full structure. With uncompressed bitvectors, the size bound increases from |E|(logσ+2)(1+o(1)) bits to (σ|V|+|E|)(1+o(1)) bits, which can be significant with large alphabets. With Elias–Fano encoded bitvectors for Bc, we get |E|(log(σ|V|/|E|)+O(1)) bits at the expense of somewhat slower queries. A practical hybrid approach [46] uses uncompressed bitvectors for common symbols and compressed bitvectors for rare ones.

SBWT [5] (Section 4.1) goes even further with a representation, where the rank queries on Bc and Bout can be done with a single rank query on a bitvector. This comes at the expense of some functionality. In particular, SBWT does not tell directly whether vertex vV has a predecessor with symbol cΣ.

5.8 Graphs with cycles

BOSS [10] represents de Bruijn graphs with a data structure that is equivalent to the full GCSA structure (Section 5.3). While a de Bruijn graph often contains cycles, its vertices v are type 2 vertices, with [Xv:Yv)={SΣSvis a prefix ofS} as the interval. Here Sv is a string of length |Sv|=k and k is the order of the de Bruijn graph. We could therefore use GCSA, including the more efficient simplified structure (Section 5.7), with de Bruijn graphs. The main difference is that while LF-mapping and backward search follow the edges in BOSS, they go against the direction of the edges in GCSA.

A natural question is which classes of graphs can be represented using GCSA, and what is the corresponding class of formal languages. Because vertex-labeled graphs can be understood as finite automata, the class of languages is a subclass of regular languages. Sirén et al. [49] showed that the class is a proper subclass: GCSA can represent the obvious automaton recognizing language {a,b} but no automaton recognizing {a,b}{a,c}. The characterization of type 2 vertices with lexicographic ranges suggests that the class of graphs admitting the GCSA representation is a generalization of de Bruijn graphs. The properties of this class were later investigated based on the Wheeler graph formalization [1, 2].

5.9 GCSA2

GCSA was designed for indexing pangenome graphs based on aligned DNA sequences. The construction algorithm assumes that the alignments can be represented as a multiple sequence alignment or an acyclic graph. Such models are appropriate for simple variants, such as substitutions, insertions, and deletions. Other forms of variation, such as rearrangements (same substrings appear in different order in different strings) and inversions (a substring is replaced with its reverse complement) can be represented better with cycles in the graph. Such graphs often do not have an equivalent Wheeler graph that can be indexed with GCSA.

GCSA2 [46] is a reimplementation of GCSA for general pangenome graphs. Because the graph cannot be fully indexed, GCSA2 approximates it with an order-k de Bruijn graph. It matches patterns of length at most k correctly, while longer matches may be false positives combining partial matches in different parts of the original graph. The initial implementation used k=128, which was chosen to make excessive growth of the number of vertices less likely. This was later increased to k=256 to allow mapping perfectly matching 150 bp and 250 bp reads as a single hit [47].

Let G=(V,E,Σ,) be the original graph and Gk=(V,E,Σ,) be the graph after logk rounds of prefix-doubling (Section 5.4). We replace the final pruning step with a merging step that also merges all type 1 vertices vV sharing rankk(v), regardless of whether the locate(v) vertices match. The resulting pruned de Bruijn graph is equivalent to an order-k de Bruijn graph but potentially much smaller. To handle some edge cases, we must ensure that all type 2 vertices vV have intervals [Xv:Yv) defined by a shared prefix. We build GCSA2 for graph Gk. As a consequence of the final merge, locate(v)V is now a set of vertices in the original graph.

Pangenome graphs contain regions that are dense in variation. While GCSA2 construction stops prefix-doubling early to prevent the worst exponential growth, such regions must often be simplified to keep the size of the index manageable. Because the original GCSA considered the alignments and the graph computational artifacts, it could handle complex regions by changing the definition of alignment (Section 5.6). GCSA2 cannot do that, because the provided alignments are assumed to be biologically meaningful. Instead, a preprocessing step removes everything except the reference genome in regions where there are too many branching vertices in a short window [26]. There is also an option to avoid sequence loss by unfolding the original strings in the complex region while maintaining a mapping to the corresponding positions in the original graph. Variation is then replaced with a set of disjoint paths representing unaligned strings [47].

GCSA2 supports some additional functionality beyond that provided by FM-indexes. It builds a trie of path labels (Pv) for vertices vV and represents it using next/previous smaller value queries and range minimum queries over the LCP array [23]. The resulting suffix tree-like functionality is used for finding maximal exact matches between the pattern and the graph [41]. A similar approach was first used with variable-order de Bruijn graphs [9] and later generalized to Wheeler graphs [15]. There is also a data structure for reporting the number of distinct locate(v) values for vertices vV in a range [sp:ep]. It is based on compressed versions [24] of Sadakane’s document counting structure [44].

6 Founder Graphs

In this section, we deal with graphs built from the segmentation of multiple sequence alignments (MSAs), which is a line of research originating from the study of founder sequences [50]. Similarly to the GCSA approach from Section 5, respecting the MSA positions (i.e. the columns) results in an acyclic graph spelling the input strings and their recombination. However, differently from GCSA, in these segmentation-induced graphs that we call founder graphs, the recombinations depend exclusively on the chosen segments.

Many approaches reviewed so far fix a k-mer length to avoid indexing exponentially many paths of a graph. While such fixed parameter yields practically efficient solutions, it is of great theoretical value to study more general approaches. Wheeler Graphs [25] have been proposed as such, but they appear too general; Wheeler Graph recognition is an NP-complete problem [27] and thus to use Wheeler graph machinery one first needs to limit to graph classes that somehow avoid this complexity. Founder graphs also have similar hardness issues: in Section 6.2 we present the negative results from [17] stating that matching in founder graphs is not easier than matching in general labeled graphs. However, this hardness can be avoided, as we present in Section 6.3 that suitable segmentations of MSAs respecting a specific uniqueness property induce graph classes that are easy to index for linear-time pattern matching. These graphs are not limited to a fixed k-mer length, but can handle node labels of arbitrary length (provided that they are unique) and arbitrary queries. After delving into more details, we continue discussing the connection to Wheeler Graphs in Sect. 6.3. We conclude this section with a simple extension of the uniqueness properties of de Bruijn graphs and founder graphs to overlap graphs.

6.1 Inducing graphs from multiple sequence alignments

We define a segmentation 𝒳 of 𝖬𝖲𝖠[1:m,1:n] as a partition of columns [1:n] in b segments [1=x1:y1], [x2:y2], …, [xb:yb=n] such that yj=xj+11 for j[1:b1], and additionally we forbid any segmentation to spell the empty string in one of its segments, in symbols 𝗌𝗉𝖾𝗅𝗅(𝖬𝖲𝖠[i,xj:yj])ε for all i[1:m], j[1:b].

Definition 15 (elastic founder graph [20]).

Let 𝖬𝖲𝖠[1:m,1:n](Σ{})m×n be a multiple sequence alignment. Given segmentation 𝒳=[x1:y1], …, [xb:yb], the elastic founder graph (EFG) induced by 𝒳 is a vertex-labeled graph G=(V,E,), where :VΣ+ assigns a non-empty label to every vertex, and the set of vertices V is partitioned into blocks V1, …, Vb such that |{(v):vVk}|=|Vk| (unique strings in a block) and for all k[1:b] we have that

{(v):vVk}={𝗌𝗉𝖾𝗅𝗅(𝖬𝖲𝖠[i,xk:yk])| 1im}.

Moreover, it holds that (v,w)E if and only if there exists k[1:b1] and i[1:m] such that vVk, wVk+1, and 𝗌𝗉𝖾𝗅𝗅(𝖬𝖲𝖠[i,xk:yk+1])=(v)(w).

If E=k=1b1(Vk×Vk+1), that is all vertices in Vk are connected to all vertices in Vk+1 for each b[1..b], then the EFG is an elastic degenerate string (EDS[30] with no empty string, and we can omit edge set E. Moreover, given an EFG, we define its EDS relaxation as the corresponding EDS over V1,,Vb.

Figure 6: Example of 𝖬𝖲𝖠[1:5,1:24], segmentation 𝒳, and elastic founder graph G=(V,E,) induced by 𝒳.

See Figure 6. Given an elastic founder graph G=(V,E,), for each vertex vV we denote as v the vertex label length |(v)|. We define height H(G)=maxk=1b|Vk| as the maximum number of vertices in a block, and L(G)=maxvVv as its maximum vertex label length. We can extend to label paths in the graph: given a u1uk-path P=u1uk, with (ui,ui+1)E for i[1..k1], we define (P)=(u1)(uk); for an edge (u,v)E, we call (uv) its edge label. Then, we say that some string QΣ+ occurs in G if there exists path P in G such that Q is a substring of (P). More specifically, we indicate with triple (i,u1uk,j) the subpath of P spelling Q starting from position i[1:u1] in u1 and ending in position j[1:uk] in uk, that is, Q=(u1)[i:](u2uk1)(uk)[:j]. The length of subpath (i,u1uk,j) is then |Q|. We say that Q occurs starting from the beginning of u1 if i=1; Q occurs starting inside u1 if i[2:u1].

6.2 Hardness of matching EFGs

Starting from an MSA, we can construct EFG G. After this step, we would like to answer string matching queries for G, namely determining whether a given pattern string Q occurs in G. We will now show that, conditioned on some complexity hypotheses, this problem requires quadratic time complexity. This justifies the approach we will adopt in later sections, where we assume an additional property to hold in EFGs in order to break through this quadratic barrier.

The quadratic lower bound is obtained via fine-grained complexity techniques and is conditioned on the Strong Exponential Time Hypothesis (SETH) and the Orthogonal Vectors Conjecture (OVC). Since Williams [51] showed that SETH implies OVC, it suffices to show a subquadratic reduction from the Orthogonal Vectors (OV) problem to obtain this result. Let us first give a formal definition of OV.

Definition 16 (OV).

Let X,Y{0,1}d be two sets of n binary vectors of length d. Determine whether there exist xX,yY such that xy=i=1dx[i]y[i]=0.

The notation x[i]y[i] indicates the scalar product when used for two single entries of vectors x and y, while it refers to the dot product xy when applied on the vectors themselves. Currently, it is conjectured that no algorithm can solve OV in truly subquadratic time.

Definition 17 (Orthogonal Vectors Conjecture (OVC)).

Let X,Y be the two sets of an OV instance, each containing n binary vectors of length d.111In this section, keeping in line with the usual notation in the OV problem, we use n to denote the size of X and Y, instead of the number of columns of the MSA. For any constant ϵ>0, no algorithm can solve OV in time O(poly(d)n2ϵ), where poly is a polynomial function on d.

Figure 7: G𝚋𝚎, G𝟶 and G𝟷. Each gadget is organized into three rows, each row encoding a different partitioning of the strings 𝚋𝚋𝚋𝚋, 𝚎𝚎𝚎𝚎, 𝟶𝟶𝟶𝟶, 𝟷𝟷𝟷𝟷. This ensures that, when combining these gadgets into the final graph, edges can be controlled to go within the same row, or to the row below. Figure from [20].

Then, we can show that determining whether string Q occurs in graph G is at least as hard as solving OV, obtaining a quadratic conditional lower bound.

Theorem 18 ([20]).

For any constant ϵ>0, it is not possible to find a match for a query string Q in an EFG G=(V,E,) in either O(|E|1ϵ|Q|) or O(|E||Q|1ϵ) time, unless OVC fails. This holds even if restricted to an alphabet of size 4.

We will provide an intuition on how to prove Theorem 18 by giving an idea of how to perform a subquadratic reduction from OV.

Starting from sets X,Y{0,1}d of n binary vectors of length d, the goal of the reduction is to build string Q and EFG G such that Q occurs in G if and only if there exist xX,yY such that xy=0. Moreover, this reduction should be performed in not more than O(n2ϵpoly(d)) time, for some constant ϵ>0.

Pattern.

We start by constructing string Q from set of vectors X. We build Q by combining string gadgets Q1,,Qn, one for each vector in X, plus some additional characters. For example, vector xi=101 results in string

Qi=𝚋𝚋𝚋𝚋Qi,1Qi,2Qi,3𝚎𝚎𝚎𝚎, where Qi,1=𝟷𝟷𝟷𝟷,
Qi,2=𝟶𝟶𝟶𝟶,Qi,3=𝟷𝟷𝟷𝟷.

Full string Q is then the concatenation Q=𝚋𝚋𝚋𝚋Q1Q2Qn𝚎𝚎𝚎𝚎.

Graph.

We build graph G combining together three different sub-graphs: GL, GM, GR (for left, middle and right). Our final goal is to build a graph structured in three logical “rows”. We denote the three rows of GM as GM1, GM2, GM3, respectively. Sub-graphs GL is able to match any prefix of pattern Q and is connected only to row GM1, which can also match any prefix of Q. Simmetrically, sub-graphs GR is able to match any suffix of pattern Q and it is connected only to row GM3, which can also match any suffix of Q. Thus, the first and the third rows of G, along with subgraphs GL and GR (introduced to allow slack), can match any vector. The second row matches only sub-patterns encoding vectors that are orthogonal to the vectors of set Y. The key is to structure the graph such that the pattern is forced to utilize the second row to obtain a full match. We present the full structure of the graph in Figure 8, which shows the graph built on top of vector set {100, 011, 010}. In particular, GM consists of n gadgets GMj, one for each vector yjY. The key elements of these sub-graphs are gadgets G𝚋𝚎, G𝟶 and G𝟷 (see Figure 7), which allow to stack together multiple instances of strings 𝚋4, 𝚎4, 𝟷4, 𝟶4. The overall structure mimics the one in [17, 19], except for the new idea from Figure 7 ([20]).

(a) Sub-graph GL. The last segment belongs to sub-graph GM and shows the connection.
(b) Sub-graph GM for vectors y1=100, y2=011 and y3=010. The dashed rectangles highlight the single GMj gadgets.
(c) Sub-graph GR. The first segment belongs to sub-graph GM and shows the connection.
Figure 8: An example of graph G. To visualize the entire graph, watch the three sub-figures from top to bottom and from left to right. We also show two example occurrences of a query string Q constructed from x1=101, x2=110, x3=100 (left-most), and from x1=101, x2=100, x3=110 (right-most), respectively. The shade of gray changes from one Qi to the next. Any such occurrence must pass through the middle row of GM. Figure from [20].

6.3 Indexable founder graphs

As we have seen, EFGs are not likely to support fast pattern matching without additional conditions. To alleviate this hardness, a uniqueness property has been introduced:

Definition 19 (semi-repeat-free indexability property [20]).

An EFG G=(V,E,) is semi-repeat-free or indexable (denoted as iEFG) if each subpath (i,u1up,j) spelling (v) for vVk is such that i=1 and u1Vk, that is, (v) only occurs from the beginning of vertices in block Vk. The semi-repeat-free property is a generalization of the stronger repeat-free property, imposing that each (v) only occurs in G from the beginning of v.

The simpler repeat-freeness condition requiring (v) to start only from the beginning of vertex v already enables fast pattern matching, but it is open if there is a fast algorithm to construct such EFG from the MSA [20].

Before proceeding to the construction of iEFGs, let us first gather some intuition why such graphs are useful. To fix ideas, assume the input MSA has no gaps . Then semi-repeat-freeness matches with repeat-freeness, as the strings in each block are of the same length, and cannot be prefixes of one another. Consider searching for query string Q[1:m] in iEFG. An FM-index for the collection of edge labels (concatenation of adjacent vertex labels) can be used for checking if Q occurs inside a vertex or an edge. Longer occurrences must span at least one complete vertex label. One can build an Aho-Corasick automaton on the vertex labels to find all candidate occurrences. Consider vertex v with (v) occurring closest to the end of Q, say, at Q[i:j]. Consider a trie built on the set F(v)={(w)(v,w)E}. We can check if Q[j+1..m] can be traced from the root of F(v) to verify if Q[i:] occurs starting from the beginning of v. To verify if Q[1:i1] occurs ending at the end of vertex u s.t. (u,v)E, we can use tries R(v)={(u)1(u,v)E}: We can scan Q[:i1] from right to left in R(v). If we reach a leaf, it corresponds to a unique vertex u due to the (semi-)repeat-free property. The scanning can continue from R(u), until we have verified an occurrence for Q or detected that there was no occurrence starting from our candidate vertex v.

Claim 20.

If Q is found neither inside an edge label nor with the verification algorithm above, it has no occurrence in iEFG.

Proof.

For contradiction, assume Q has an occurrence (i,u1uk,j) in G. Since such occurrence is not found within an edge, it must span at least one vertex of G. Therefore the verification algorithm has started with some vertex v such that Q[i:j]=(v), but did not report this occurrence. That is, the substring Q[i:j]=(v) occurs starting somewhere else than in the beginning of v and thus our iEFG is not repeat-free.

This tailored Aho-Corasick approach can be made more space-efficient by an expanded backward search mechanism [20] similar to one used for de Bruijn graphs in Section 4.1. Alternatively, one can reduce this case of repeat-free graphs induced from MSAs with no gaps to Wheeler Graphs [20].

It is important to note why none of the above approaches work if the MSA has gaps and our iEFG is semi-repeat-free: (v) is allowed to occur starting from another vertex v from the same block. Due to this, straightforward extensions lead to false positive matches, and it is open if there is a reduction to Wheeler Graphs. However, there is a way to fix the Aho-Corasick approach: consider instead edges of the graph, as they have a uniqueness property analogous to the one above for vertices in the repeat-free case. One can basically repeat similar scanning of suffix and prefix of Q on indexes built on the set of edges, but there is a non-trivial detail in connecting the searches we omit here [43]:

Theorem 21 ([43]).

A semi-repeat-free EFG G=(V,E,) can be indexed in polynomial time into a data structure occupying O(|D|log|D|) bits of space, where |D|=O(NH(G)), N=vVv is the total length of the vertex labels, and H(G) is the height of G. Later, one can find out in O(|Q|) time if a given query string Q occurs in G.

Next, we will consider how to construct iEFGs. To fix ideas, we consider only the problem of maximizing the number of blocks in iEFG; there exist similar solutions for other objective functions like minimizing the maximum segment length and minimizing the maximum height of a block.

Recall from Definition 15 that an EFG is defined by a graph induced by a segmentation 𝒳=[x1:y1], …, [xb:yb] of an MSA. We denote this graph G(𝒳).

Definition 22 ([20]).

Segment [x:y] of 𝖬𝖲𝖠[1:m,1:n] is semi-repeat-free if for any i,i[1:m] string 𝗌𝗉𝖾𝗅𝗅(𝖬𝖲𝖠[i,x:y]) occurs in gaps-removed row 𝗌𝗉𝖾𝗅𝗅(𝖬𝖲𝖠[i,1:n]) only at position g(i,x), where g(i,x) is equal to x minus the number of gaps in 𝖬𝖲𝖠[i,1:x1].222The original definition considers mistakenly gaps in 𝖬𝖲𝖠[i,1:x].

Lemma 23 (Characterization of semi-repeat-free segments [20]).

The graph G(𝒳) induced from a segmentation 𝒳 is semi-repeat-free if and only if all segments of 𝒳 are semi-repeat-free.

Consider you have found a semi-repeat-free segmentation until column x and want to extend it with a new semi-repeat-free segment into a longer segmentation. The following definition tells the minimum length of such extension.

Definition 24 (Minimal right extensions [20]).

Given 𝖬𝖲𝖠[1:m,1:n], for each 0xn1 we define value f(x) as the smallest integer greater than x such that segment [x+1:f(x)] is semi-repeat-free. If such integer does not exist, we define f(x)=.

To see how to compute these f(x) values, let us for now assume our MSA is gapless. One can build a generalized suffix tree of the rows of the MSA. Consider the set of leaves of the suffix tree corresponding to suffixes of the rows starting at column x+1. The goal is to find a set of suffix tree vertices that cover these leaves and only them. Among such sets, one should find the one that contains a vertex with the smallest string depth, as that will determine f(x). This is called the exclusive ancestor set problem and can be solved in linear time in the number of leaves in the input set [43]. Updating the set of leaves when moving from column x to x1 can also be done in linear time, but with general MSAs, one needs to tailor the algorithm to take into gaps (e.g. in string depth), which is non-trivial but possible.

Theorem 25 ([43]).

Given 𝖬𝖲𝖠[1:m,1:n]{Σ{}}m×n with Σ=[1:σ] and σmn, we can compute the minimal right extensions f(x) for 0x<n (Definition 24) in time O(mn).

Corollary 26 ([43]).

Given 𝖬𝖲𝖠[1:m,1:n]{Σ{}}m×n with Σ=[1:σ] and σmn, the construction of an optimal semi-repeat-free segmentation maximizing the number of blocks can be done in time O(mn).

Proof.

Algorithm [20, Algorithm 1] by Equi et al. solves the problem in O(n) time, assuming minimal right extensions given (Theorem 25).

6.4 Repeat-free graphs: generalizing de Bruijn graphs with vertex uniqueness

We generalize iEFGs respecting the repeat-free property to overlap graphs respecting an equivalent and homonymous property. We show that graphs where long strings are constrained to have a unique starting location in the graph are a straightforward generalization of de Bruijn graphs, and are still indexable in polynomial time for linear-time pattern matching.

We define labeled graphs where each vertex is labeled with a non-empty string of variable length, and each directed edge specifies a suffix-prefix overlap between the connected vertices, as follows.

Definition 27 (Overlap graph).

An overlap graph is a quadruple G=(V,E,Σ,,ov) where V is a finite set of vertices, :VΣ+ labels the vertices, EV×V is the set of directed edges, and ov:E defines the length of a (possibly empty) suffix-prefix overlap between the vertices of each edge uvE such that the following holds: 0ov(uv)<min(u,v) and

(u)[uov(uv)+1:]=(v)[:ov(uv)].

Condition ov(uv)<min(u,v) imposes the overlap to be less than one full vertex.

See Figure 9. We extend to label paths of G as follows: given path P[1:k] ((P[i],P[i+1]E for all i[1:k1]), we define path label (P[1]P[k]) as (u1) if k=1, otherwise it is equal to

(u1)[u1ov(u1u2)+1:](uk1)[uk1ov(uk1uk)+1](uk),

that is, the string spelled by path P considers the overlaps between adjacent vertices only once.

Figure 9: An overlap graph with four vertices u,v,w, and x, where the overlaps are represented above each edge. Note that in the spelling of edges and paths we do not repeat the overlapping substrings.

6.4.1 Hardness of exact matching in overlap graphs

We say that a pattern QΣ+ occurs in an overlap graph G if Q is a substring of (P) for some path P in G. Then, it is immediate to see that the problem of matching a query in an overlap graph is a generalization of string matching in labeled graphs, since the latter are vertex-labeled graphs with overlap equal to 0 for all edges. Then, the conditional lower bounds of [19, 18] holds also for overlap graphs. However, two classes of graphs that bypass the conditional lower bounds by having a uniqueness property of the possible strings spelled in the graph are de Bruijn graphs and iEFGs we considered earlier.

Observation 28.

Overlap graphs (Definition 27) are a generalization of EFGs (Definition 15), since the latter are directed graphs, the vertices are labeled with non-empty strings, and the edges do not present any overlap (ov(uv)=0 for all uvE).

Observation 29.

Overlap graphs (Definition 27) are a generalization of de Bruijn graphs (Definition 7) and compacted de Bruijn graphs (see e.g. [13]), since these can be represented as k-mers or unitigs connected by directed edges with a fixed overlap of k1 (ov(uv)=k1 for all uvE).

Observation 30 (Uniqueness property of dBGs).

In a (compacted) de Bruijn Graph G=(V,E,Σ,,ov), we have that for every possible k-mer SΣk, if S occurs in G then S is uniquely associated to a single node in G.

6.4.2 Indexable overlap graphs

We study the natural extension of the repeat-free property (Definition 19) and the uniqueness property of de Bruijn Graphs (Observation 30) to overlap graphs, where the occurrences of patterns ignore overlaps between nodes. For this, we need to redefine the interpretation of occurrence positions, as an occurrence starting inside the overlapping part of an edge (u,v) can be seen to start from u or from v.

Definition 31.

Let query Q have an occurrence (i,u1u2uk,j) in an overlap graph G with u1ov(u1u2)<iu1. We say that such occurrence is redundant, as it can be written equivalently as (i(u1ov(u1u2)),u2uk,j). Occurrences that are not redundant are called non-redundant.

Definition 32.

Query Q occurs in an overlap graph G starting inside a node if it has a non-redundant occurrence (i,u1,u2,,uk,j) with i>1. Query Q occurs in an overlap graph G starting from the beginning of a node if all its non-redundant occurrences are of the form (1,u1,u2,,uk,j).

Definition 33.

We say that an overlap graph G=(V,E,Σ,,ov) is repeat-free if each non-redundant occurrence (i,u1uk,j) of (u) for uV is such that i=1 and u1=u, that is, (u) occurs in G only starting from the beginning of u.

Then, we argue that the expanded backward search algorithm [20] for repeat-free EFGs is correct also for overlap graphs, since it performs the following steps:

  1. 1.

    preprocess G by indexing string

    Tedges=#(u,v)E(uv)#

    for pattern matching queries and bit arrays B[1:|Tedges|+1], E[1:|Tedges|+1] for rank and select queries, where B and E are defined such as B[k]=1 and E[r]=1 if and only if [s:e] is the lexicographic interval of some node label (v), vV; then, given a non-empty lexicographical interval [s:e] of string S in Tedges, we can answer in constant time whether (v) is prefix of S for some node vV;

  2. 2.

    perform backward search for Q in Tedges and, when the current lexicographical interval is fully contained in the lexicographical interval of some (v), expand the interval to that of (v) and continue the search.

The correctness follows from the repeat-free property and the lexicographic order: whenever we have read a substring Q[i:j] of Q – initially, a suffix of Q – whose lexicographical interval is contained in that of some (v), it must be that (v) is a prefix of Q[i:] and by the repeat-free property all matches of Q[i:j] in G start from the beginning of v. By expanding the lexicographic range to that of (v), we are discarding the characters Q[i+|(v)|:j] and we can continue with the search without loosing any occurrence.

Theorem 34.

A repeat-free overlap graph G=(V,E,Σ,,ov) can be indexed in
O(|Tedges|)=O(LD) time, where L=maxuvE|(uv)| and D=maxvVmax(δ+(v),δ(v)), where δ+(v) and δ(v) are the indegree and outdegree of vertex v, to answer whether QΣ+ occurs in G in O(|Q|) time.

We leave it for future work how to construct such repeat-free overlap graphs, which is at least as hard as constructing repeat-free founder graphs from MSA with gaps, for which we do not have an efficient construction either.

Figure 10: Hierarchy of vertex-labeled graph (with non-empty long labels) admitting a poly-time index for linear-time pattern matching. The classes of graphs with known linear-time construction algorithms are highlighted in gray. Note that semi-repeat-free EFGs are a superclass of repeat-free EFGs, that in turn contain repeat-free non-elastic founder graphs, and that the class of overlap graphs (Definition 27) contains all shown classes.

7 Conclusions

This chapter has reviewed BWT-inspired indexing of more complex structures, such as tries, finite languages, de Bruijn graphs, and aligned sequences. As we have seen, much can be gained by tailoring the BWT to these specific types of objects. The next chapter covers a more general indexing framework for labeled graphs, via the notion of Wheeler Graphs.

Even for the special cases we have covered above, many open problems remain. Although enormous progress has been made toward making these data structures practical – in many cases to the point that they are used routinely a variety of bioinformatics pipelines (e.g., the SBWT lies at the heart of the Themisto pseudoalignment software [6]) – further improvements in performance practical performance are desirable. Construction algorithms in particular have room for improvement, and precisely how best to add parallelization to both construction and querying algorithms is currently under-explored.

References

  • [1] Jarno Alanko, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Regular languages meet prefix sorting. In Proc. SODA 2020, pages 911–930. SIAM, 2020. doi:10.1137/1.9781611975994.55.
  • [2] Jarno Alanko, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Wheeler languages. Information and Computation, 281:104820, 2021. doi:10.1016/j.ic.2021.104820.
  • [3] Jarno N. Alanko, Elena Biagi, and Simon J. Puglisi. Longest common prefix arrays for succinct k-spectra. In Proc. SPIRE, LNCS 14240, pages 1–13. Springer, 2023. doi:10.1007/978-3-031-43980-3_1.
  • [4] Jarno N. Alanko, Elena Biagi, Simon J. Puglisi, and Jaakko Vuohtoniemi. Subset wavelet trees. In Proc. of the 21st International Symposium on Experimental Algorithms (SEA), LIPIcs. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.SEA.2023.4.
  • [5] Jarno N. Alanko, Simon J. Puglisi, and Jaakko Vuohtoniemi. Small searchable k-spectra via subset rank queries on the spectral Burrows-Wheeler transform. In Proc. ACDA, pages 225–236. SIAM, 2023. doi:10.1137/1.9781611977714.20.
  • [6] Jarno N. Alanko, Jaakko Vuohtoniemi, Tommi Mäklin, and Simon J. Puglisi. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Bioinform., 39(Supplement-1):260–269, 2023. doi:10.1093/BIOINFORMATICS/BTAD233.
  • [7] Uwe Baier. On undetected redundancy in the Burrows-Wheeler transform. In Proc. CPM 2018, volume 105 of LIPIcs, pages 3:1–3:15, 2018. doi:10.4230/LIPIcs.CPM.2018.3.
  • [8] Andrej Baláž, Travis Gagie, Adrián Goga, Simon Heumos, Gonzalo Navarro, Alessia Petescia, and Jouni Sirén. Wheeler maps. In Proc. LATIN 2024, volume 14578 of LNCS, pages 178–192. Springer, 2024. doi:10.1007/978-3-031-55598-5_12.
  • [9] Christina Boucher, Alex Bowe, Travis Gagie, Simon J. Puglisi, and Kunihiko Sadakane. Variable-order de Bruijn graphs. In Proc. DCC 2015, pages 383–392. IEEE, 2015. doi:10.1109/DCC.2015.70.
  • [10] Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn graphs. In International workshop on algorithms in bioinformatics, pages 225–235. Springer, 2012. doi:10.1007/978-3-642-33122-0_18.
  • [11] Michael Burrows and David J. Wheeler. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994. URL: http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.html.
  • [12] Bastien Cazaux and Eric Rivals. Linking BWT and XBW via Aho-Corasick automaton: Applications to run-length encoding. In Proc. CPM 2019, volume 128 of LIPIcs, pages 24:1–24:20, 2019. doi:10.4230/LIPIcs.CPM.2019.24.
  • [13] Rayan Chikhi, Antoine Limasset, and Paul Medvedev. Compacting de bruijn graphs from sequencing data quickly and in low memory. Bioinform., 32(12):201–208, 2016. doi:10.1093/BIOINFORMATICS/BTW279.
  • [14] David R. Clark and J. Ian Munro. Efficient suffix trees on secondary storage (extended abstract). In Proc. SODA 1996, pages 383–391. SIAM, 1996. doi:10.1145/313852.314087.
  • [15] Alessio Conte, Nicola Cotumaccio, Travis Gagie, Giovanni Manzini, Nicola Prezza, and Marinella Sciortino. Computing matching statistics on Wheeler DFAs. In Proc. DCC 2023, pages 150–159. IEEE, 2023. doi:10.1109/DCC55655.2023.00023.
  • [16] Nicola Cotumaccio, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Co-lexicographically ordering automata and regular languages - part I. Journal of the ACM, 70(4):1–73, 2023. doi:10.1145/3607471.
  • [17] Massimo Equi, Roberto Grossi, Veli Mäkinen, and Alexandru I. Tomescu. On the complexity of string matching for graphs. In 46th International Colloquium on Automata, Languages, and Programming (ICALP), volume 132 of LIPIcs, pages 55:1–55:15, 2019. doi:10.4230/LIPICS.ICALP.2019.55.
  • [18] Massimo Equi, Veli Mäkinen, and Alexandru I. Tomescu. Graphs cannot be indexed in polynomial time for sub-quadratic time string matching, unless SETH fails. Theor. Comput. Sci., 975:114128, 2023. doi:10.1016/J.TCS.2023.114128.
  • [19] Massimo Equi, Veli Mäkinen, Alexandru I. Tomescu, and Roberto Grossi. On the complexity of string matching for graphs. ACM Trans. Algorithms, 19(3):21:1–21:25, 2023. doi:10.1145/3588334.
  • [20] Massimo Equi, Tuukka Norri, Jarno Alanko, Bastien Cazaux, Alexandru I. Tomescu, and Veli Mäkinen. Algorithms and complexity on indexing founder graphs. Algorithmica, 85(6):1586–1623, 2023. doi:10.1007/S00453-022-01007-W.
  • [21] Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. Journal of the ACM, 57(1):4, 2009. doi:10.1145/1613676.1613680.
  • [22] Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM, 52(4):552–581, 2005. doi:10.1145/1082036.1082039.
  • [23] Johannes Fischer, Veli Mäkinen, and Gonzalo Navarro. Faster entropy-bounded compressed suffix trees. Theoretical Computer Science, 410(51):5354–5364, 2009. doi:10.1016/j.tcs.2009.09.012.
  • [24] Travis Gagie, Aleksi Hartikainen, Kalle Karhu, Juha Kärkkäinen, Gonzalo Navarro, Simon J. Puglisi, and Jouni Sirén. Document retrieval on repetitive string collections. Information Retrieval Journal, 20(3):253–291, 2017. doi:10.1007/s10791-017-9297-7.
  • [25] Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for BWT-based data structures. Theoretical Computer Science, 698:67–78, 2017. doi:10.1016/J.TCS.2017.06.016.
  • [26] Erik Garrison, Jouni Sirén, Adam M. Novak, Glenn Hickey, Jordan M. Eizenga, Eric T. Dawson, William Jones, Shilpa Garg, Charles Markello, Michael F. Lin, Benedict Paten, and Richard Durbin. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nature Biotechnology, 36(9):875–879, 2018. doi:10.1038/nbt.4227.
  • [27] Daniel Gibney and Sharma V. Thankachan. On the complexity of recognizing wheeler graphs. Algorithmica, 84(3):784–814, 2022. doi:10.1007/S00453-021-00917-5.
  • [28] Simon Gog and Matthias Petri. Optimized succinct data structures for massive data. Softw. Pract. Exp., 44(11):1287–1314, 2014. doi:10.1002/SPE.2198.
  • [29] Roberto Grossi, Ankur Gupta, and Jeffrey Scott Vitter. High-order entropy-compressed text indexes. In Proc. SODA 2003, pages 841–850. SIAM, 2003. URL: http://dl.acm.org/citation.cfm?id=644108.644250.
  • [30] Roberto Grossi, Costas S. Iliopoulos, Chang Liu, Nadia Pisanti, Solon P. Pissis, Ahmad Retha, Giovanna Rosone, Fatima Vayani, and Luca Versari. On-line pattern matching on similar texts. In Proc. CPM 2017, volume 78 of LIPIcs, pages 9:1–9:14. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2017. doi:10.4230/LIPICS.CPM.2017.9.
  • [31] Roberto Grossi and Jeffrey Scott Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing, 35(2):378–407, 2005. doi:10.1137/S0097539702402354.
  • [32] Lin Huang, Victoria Popic, and Serafim Batzoglou. Short read alignment with populations of genomes. Bioinformatics, 29(13):i361–i370, 2013. doi:10.1093/bioinformatics/btt215.
  • [33] Songbo Huang, T.W. Lam, W.K. Sung, S.L. Tam, and S.M. Yiu. Indexing similar DNA sequences. In Proc. AAIM 2010, volume 6124 of LNCS, pages 180–190. Springer, 2010. doi:10.1007/978-3-642-14355-7_19.
  • [34] Sorina Maciuca, Carlos del Ojo Elias, Gil McVean, and Zamin Iqbal. A natural encoding of genetic variation in a Burrows-Wheeler transform to enable mapping and genome inference. In Proc. WABI 2016, volume 9838 of LNCS, pages 222–233. Springer, 2016. doi:10.1007/978-3-319-43681-4_18.
  • [35] Giovanni Manzini. XBWT tricks. In Proc. SPIRE 2016, volume 9954 of LNCS, pages 80–92. Springer, 2016. doi:10.1007/978-3-319-46049-9_8.
  • [36] Miguel A. Martínez-Prieto, Nieves Brisaboa, Rodrigo Cánovas, Francisco Claude, and Gonzalo Navarro. Practical compressed string dictionaries. Information Systems, 56:73–108, 2016. doi:10.1016/j.is.2015.08.008.
  • [37] J. Ian Munro. Tables. In Proc. FSTTCS 1996, volume 1180 of LNCS. Springer, 1996. doi:10.1007/3-540-62034-6_35.
  • [38] Veli Mäkinen, Gonzalo Navarro, Jouni Sirén, and Niko Välimäki. Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology, 17(3):281–308, 2010. doi:10.1089/cmb.2009.0169.
  • [39] Joong Chae Na, Hyunjoon Kim, Seunghwan Min, Heejin Park, Thierry Lecroq, Martine Léonard, Laurent Mouchard, and Kunsoo Park. FM-index of alignment with gaps. Theoretical Computer Science, 710:148–157, 2018. doi:10.1016/j.tcs.2017.02.020.
  • [40] Joong Chae Na, Hyunjoon Kim, Heejin Park, Thierry Lecroq, Martine Léonard, Laurent Mouchard, and Kunsoo Park. FM-index of alignment: A compressed index for similar strings. Theoretical Computer Science, 638:159–170, 2016. doi:10.1016/j.tcs.2015.08.008.
  • [41] Enno Ohlebusch, Simon Gog, and Adrian Kügel. Computing matching statistics and maximal exact matches on compressed full-text indexes. In Proc. SPIRE 2010, volume 6393 of LNCS, pages 347–358. Springer, 2010. doi:10.1007/978-3-642-16321-0_36.
  • [42] Daisuke Okanohara and Kunihiko Sadakane. Practical entropy-compressed rank/select dictionary. In Proc. ALENEX 2007, pages 60–70. SIAM, 2007. doi:10.1137/1.9781611972870.6.
  • [43] Nicola Rizzo, Massimo Equi, Tuukka Norri, and Veli Mäkinen. Elastic founder graphs improved and enhanced. Theor. Comput. Sci., 982:114269, 2024. doi:10.1016/J.TCS.2023.114269.
  • [44] Kunihiko Sadakane. Succinct data structures for flexible text retrieval systems. Journal of Discrete Algorithms, 5(1):12–22, 2007. doi:10.1016/j.jda.2006.03.011.
  • [45] Jouni Sirén. Sampled longest common prefix array. In Proc. CPM 2010, volume 6129 of LNCS, pages 227–237. Springer, 2010. doi:10.1007/978-3-642-13509-5_21.
  • [46] Jouni Sirén. Indexing variation graphs. In Proc. ALENEX 2017, pages 13–27. SIAM, 2017. doi:10.1137/1.9781611974768.2.
  • [47] Jouni Sirén, Erik Garrison, Adam M. Novak, Benedict Paten, and Richard Durbin. Haplotype-aware graph indexes. Bioinformatics, 36(2):400–407, 2020. doi:10.1093/bioinformatics/btz575.
  • [48] Jouni Sirén, Niko Välimäki, and Veli Mäkinen. Indexing finite language representation of population genotypes. In Proc. WABI 2011, volume 6833 of LNCS, pages 270–281. Springer, 2011. doi:10.1007/978-3-642-23038-7_23.
  • [49] Jouni Sirén, Niko Välimäki, and Veli Mäkinen. Indexing graphs for path queries with applications in genome research. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 11(2):375–388, 2014. doi:10.1109/TCBB.2013.2297101.
  • [50] Esko Ukkonen. Finding founder sequences from a set of recombinants. In Roderic Guigó and Dan Gusfield, editors, Algorithms in Bioinformatics, Second International Workshop, WABI 2002, Rome, Italy, September 17-21, 2002, Proceedings, volume 2452 of Lecture Notes in Computer Science, pages 277–286. Springer, 2002. doi:10.1007/3-540-45784-4_21.
  • [51] Ryan Williams. A new algorithm for optimal 2-constraint satisfaction and its implications. Theor. Comput. Sci., 348(2-3):357–365, 2005. doi:10.1016/J.TCS.2005.09.023.