Document

**Published in:** LIPIcs, Volume 242, 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)

Motivation. To keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index.
Results. We present NIQKI, a novel structure with well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a few days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe this approach can lead to tremendous improvements, allowing fast queries and scaling on extensive genomic databases.

Clément Agret, Bastien Cazaux, and Antoine Limasset. Toward Optimal Fingerprint Indexing for Large Scale Genomics. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 25:1-25:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{agret_et_al:LIPIcs.WABI.2022.25, author = {Agret, Cl\'{e}ment and Cazaux, Bastien and Limasset, Antoine}, title = {{Toward Optimal Fingerprint Indexing for Large Scale Genomics}}, booktitle = {22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)}, pages = {25:1--25:15}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-243-3}, ISSN = {1868-8969}, year = {2022}, volume = {242}, editor = {Boucher, Christina and Rahmann, Sven}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2022.25}, URN = {urn:nbn:de:0030-drops-170598}, doi = {10.4230/LIPIcs.WABI.2022.25}, annote = {Keywords: Data Structure, Indexation, Local Sensitive Hashing, Genomes, Databases} }

Document

**Published in:** LIPIcs, Volume 212, 32nd International Symposium on Algorithms and Computation (ISAAC 2021)

We study the problem of matching a string in a labeled graph. Previous research has shown that unless the Orthogonal Vectors Hypothesis (OVH) is false, one cannot solve this problem in strongly sub-quadratic time, nor index the graph in polynomial time to answer queries efficiently (Equi et al. ICALP 2019, SOFSEM 2021). These conditional lower-bounds cover even deterministic graphs with binary alphabet, but there naturally exist also graph classes that are easy to index: E.g. Wheeler graphs (Gagie et al. Theor. Comp. Sci. 2017) cover graphs admitting a Burrows-Wheeler transform -based indexing scheme. However, it is NP-complete to recognize if a graph is a Wheeler graph (Gibney, Thankachan, ESA 2019).
We propose an approach to alleviate the construction bottleneck of Wheeler graphs. Rather than starting from an arbitrary graph, we study graphs induced from multiple sequence alignments. Elastic degenerate strings (Bernadini et al. SPIRE 2017, ICALP 2019) can be seen as such graphs, and we introduce here their generalization: elastic founder graphs. We first prove that even such induced graphs are hard to index under OVH. Then we introduce two subclasses that are easy to index. Moreover, we give a near-linear time algorithm to construct indexable elastic founder graphs. This algorithm is based on an earlier segmentation algorithm for gapless multiple sequence alignments inducing non-elastic founder graphs (Mäkinen et al., WABI 2020), but uses more involved techniques to cope with repetitive string collections synchronized with gaps. Finally, we show that one of the subclasses admits a reduction to Wheeler graphs in polynomial time.

Massimo Equi, Tuukka Norri, Jarno Alanko, Bastien Cazaux, Alexandru I. Tomescu, and Veli Mäkinen. Algorithms and Complexity on Indexing Elastic Founder Graphs. In 32nd International Symposium on Algorithms and Computation (ISAAC 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 212, pp. 20:1-20:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{equi_et_al:LIPIcs.ISAAC.2021.20, author = {Equi, Massimo and Norri, Tuukka and Alanko, Jarno and Cazaux, Bastien and Tomescu, Alexandru I. and M\"{a}kinen, Veli}, title = {{Algorithms and Complexity on Indexing Elastic Founder Graphs}}, booktitle = {32nd International Symposium on Algorithms and Computation (ISAAC 2021)}, pages = {20:1--20:18}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-214-3}, ISSN = {1868-8969}, year = {2021}, volume = {212}, editor = {Ahn, Hee-Kap and Sadakane, Kunihiko}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ISAAC.2021.20}, URN = {urn:nbn:de:0030-drops-154532}, doi = {10.4230/LIPIcs.ISAAC.2021.20}, annote = {Keywords: orthogonal vectors hypothesis, multiple sequence alignment, segmentation} }

Document

**Published in:** LIPIcs, Volume 191, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)

The hierarchical overlap graph (HOG) is a graph that encodes overlaps from a given set P of n strings, as the overlap graph does. A best known algorithm constructs HOG in O(||P|| log n) time and O(||P||) space, where ||P|| is the sum of lengths of the strings in P. In this paper we present a new algorithm to construct HOG in O(||P||) time and space. Hence, the construction time and space of HOG are better than those of the overlap graph, which are O(||P|| + n²).

Sangsoo Park, Sung Gwan Park, Bastien Cazaux, Kunsoo Park, and Eric Rivals. A Linear Time Algorithm for Constructing Hierarchical Overlap Graphs. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 22:1-22:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{park_et_al:LIPIcs.CPM.2021.22, author = {Park, Sangsoo and Park, Sung Gwan and Cazaux, Bastien and Park, Kunsoo and Rivals, Eric}, title = {{A Linear Time Algorithm for Constructing Hierarchical Overlap Graphs}}, booktitle = {32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)}, pages = {22:1--22:9}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-186-3}, ISSN = {1868-8969}, year = {2021}, volume = {191}, editor = {Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2021.22}, URN = {urn:nbn:de:0030-drops-139736}, doi = {10.4230/LIPIcs.CPM.2021.22}, annote = {Keywords: overlap graph, hierarchical overlap graph, shortest superstring problem, border array} }

Document

**Published in:** LIPIcs, Volume 172, 20th International Workshop on Algorithms in Bioinformatics (WABI 2020)

We introduce a compact pangenome representation based on an optimal segmentation concept that aims to reconstruct founder sequences from a multiple sequence alignment (MSA). Such founder sequences have the feature that each row of the MSA is a recombination of the founders. Several linear time dynamic programming algorithms have been previously devised to optimize segmentations that induce founder blocks that then can be concatenated into a set of founder sequences. All possible concatenation orders can be expressed as a founder block graph. We observe a key property of such graphs: if the node labels (founder segments) do not repeat in the paths of the graph, such graphs can be indexed for efficient string matching. We call such graphs segment repeat-free founder block graphs.
We give a linear time algorithm to construct a segment repeat-free founder block graph given an MSA. The algorithm combines techniques from the founder segmentation algorithms (Cazaux et al. SPIRE 2019) and fully-functional bidirectional Burrows-Wheeler index (Belazzougui and Cunial, CPM 2019). We derive a succinct index structure to support queries of arbitrary length in the paths of the graph.
Experiments on an MSA of SARS-CoV-2 strains are reported. An MSA of size 410× 29811 is compacted in one minute into a segment repeat-free founder block graph of 3900 nodes and 4440 edges. The maximum length and total length of node labels is 12 and 34968, respectively. The index on the graph takes only 3% of the size of the MSA.

Veli Mäkinen, Bastien Cazaux, Massimo Equi, Tuukka Norri, and Alexandru I. Tomescu. Linear Time Construction of Indexable Founder Block Graphs. In 20th International Workshop on Algorithms in Bioinformatics (WABI 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 172, pp. 7:1-7:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{makinen_et_al:LIPIcs.WABI.2020.7, author = {M\"{a}kinen, Veli and Cazaux, Bastien and Equi, Massimo and Norri, Tuukka and Tomescu, Alexandru I.}, title = {{Linear Time Construction of Indexable Founder Block Graphs}}, booktitle = {20th International Workshop on Algorithms in Bioinformatics (WABI 2020)}, pages = {7:1--7:18}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-161-0}, ISSN = {1868-8969}, year = {2020}, volume = {172}, editor = {Kingsford, Carl and Pisanti, Nadia}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2020.7}, URN = {urn:nbn:de:0030-drops-127961}, doi = {10.4230/LIPIcs.WABI.2020.7}, annote = {Keywords: Pangenome indexing, founder reconstruction, multiple sequence alignment, compressed data structures, string matching} }

Document

**Published in:** LIPIcs, Volume 143, 19th International Workshop on Algorithms in Bioinformatics (WABI 2019)

Recent large-scale community sequencing efforts allow at an unprecedented level of detail the identification of genomic regions that show signatures of natural selection. Traditional methods for identifying such regions from individuals' haplotype data, however, require excessive computing times and therefore are not applicable to current datasets. In 2019, Cunha et al. (Proceedings of BSB 2019) suggested the maximal perfect haplotype block as a very simple combinatorial pattern, forming the basis of a new method to perform rapid genome-wide selection scans. The algorithm they presented for identifying these blocks, however, had a worst-case running time quadratic in the genome length. It was posed as an open problem whether an optimal, linear-time algorithm exists. In this paper we give two algorithms that achieve this time bound, one conceptually very simple one using suffix trees and a second one using the positional Burrows-Wheeler Transform, that is very efficient also in practice.

Jarno Alanko, Hideo Bannai, Bastien Cazaux, Pierre Peterlongo, and Jens Stoye. Finding All Maximal Perfect Haplotype Blocks in Linear Time. In 19th International Workshop on Algorithms in Bioinformatics (WABI 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 143, pp. 8:1-8:9, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{alanko_et_al:LIPIcs.WABI.2019.8, author = {Alanko, Jarno and Bannai, Hideo and Cazaux, Bastien and Peterlongo, Pierre and Stoye, Jens}, title = {{Finding All Maximal Perfect Haplotype Blocks in Linear Time}}, booktitle = {19th International Workshop on Algorithms in Bioinformatics (WABI 2019)}, pages = {8:1--8:9}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-123-8}, ISSN = {1868-8969}, year = {2019}, volume = {143}, editor = {Huber, Katharina T. and Gusfield, Dan}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2019.8}, URN = {urn:nbn:de:0030-drops-110388}, doi = {10.4230/LIPIcs.WABI.2019.8}, annote = {Keywords: Population genomics, selection coefficient, haplotype block, positional Burrows-Wheeler Transform} }

Document

**Published in:** LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)

The boom of genomic sequencing makes compression of sets of sequences inescapable. This underlies the need for multi-string indexing data structures that helps compressing the data. The most prominent example of such data structures is the Burrows-Wheeler Transform (BWT), a reversible permutation of a text that improves its compressibility. A similar data structure, the eXtended Burrows-Wheeler Transform (XBW), is able to index a tree labelled with alphabet symbols. A link between a multi-string BWT and the Aho-Corasick automaton has already been found and led to a way to build a XBW from a multi-string BWT. We exhibit a stronger link between a multi-string BWT and a XBW by using the order of the concatenation in the multi-string. This bijective link has several applications: first, it allows one to build one data structure from the other; second, it enables one to compute an ordering of the input strings that optimises a Run-Length measure (i.e., the compressibility) of the BWT or of the XBW.

Bastien Cazaux and Eric Rivals. Linking BWT and XBW via Aho-Corasick Automaton: Applications to Run-Length Encoding. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 24:1-24:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{cazaux_et_al:LIPIcs.CPM.2019.24, author = {Cazaux, Bastien and Rivals, Eric}, title = {{Linking BWT and XBW via Aho-Corasick Automaton: Applications to Run-Length Encoding}}, booktitle = {30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)}, pages = {24:1--24:20}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-103-0}, ISSN = {1868-8969}, year = {2019}, volume = {128}, editor = {Pisanti, Nadia and P. Pissis, Solon}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2019.24}, URN = {urn:nbn:de:0030-drops-104955}, doi = {10.4230/LIPIcs.CPM.2019.24}, annote = {Keywords: Data Structure, Algorithm, Aho-Corasick Tree, compression, RLE} }

Document

**Published in:** LIPIcs, Volume 113, 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)

Given a threshold L and a set R = {R_1, ..., R_m} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1,n] into set P of disjoint segments such that each segment [a,b] in P has length at least L and the number d(a,b)=|{R_i[a,b] : 1 <= i <= m}| of distinct substrings at segment [a,b] is minimized over [a,b] in P. The distinct substrings in the segments represent founder blocks that can be concatenated to form max{d(a,b) : [a,b] in P} founder sequences representing the original R such that crossovers happen only at segment boundaries. We give an optimal O(mn) time algorithm to solve the problem, improving over earlier O(mn^2). This improvement enables to exploit the algorithm on a pan-genomic setting of input strings being aligned haplotype sequences of complete human chromosomes, with a goal of finding a representative set of references that can be indexed for read alignment and variant calling. We implemented the new algorithm and give some experimental evidence on the practicality of the approach on this pan-genomic setting.

Tuukka Norri, Bastien Cazaux, Dmitry Kosolobov, and Veli Mäkinen. Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time. In 18th International Workshop on Algorithms in Bioinformatics (WABI 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 113, pp. 15:1-15:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{norri_et_al:LIPIcs.WABI.2018.15, author = {Norri, Tuukka and Cazaux, Bastien and Kosolobov, Dmitry and M\"{a}kinen, Veli}, title = {{Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time}}, booktitle = {18th International Workshop on Algorithms in Bioinformatics (WABI 2018)}, pages = {15:1--15:15}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-082-8}, ISSN = {1868-8969}, year = {2018}, volume = {113}, editor = {Parida, Laxmi and Ukkonen, Esko}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2018.15}, URN = {urn:nbn:de:0030-drops-93175}, doi = {10.4230/LIPIcs.WABI.2018.15}, annote = {Keywords: Pan-genome indexing, founder reconstruction, dynamic programming, positional Burrows-Wheeler transform, range minimum query} }

Document

**Published in:** LIPIcs, Volume 103, 17th International Symposium on Experimental Algorithms (SEA 2018)

Given a set P of words, the Shortest Linear Superstring (SLS) problem is an optimisation problem that asks for a superstring of P of minimal length. SLS has applications in data compression, where a superstring is a compact representation of P, and in bioinformatics where it models the first step of genome assembly. Unfortunately SLS is hard to solve (NP-hard) and to closely approximate (MAX-SNP-hard). If numerous polynomial time approximation algorithms have been devised, few articles report on their practical performance. We lack knowledge about how closely an approximate superstring can be from an optimal one in practice. Here, we exhibit a linear time algorithm that reports an upper and a lower bound on the length of an optimal superstring. The upper bound is the length of an approximate superstring. This algorithm can be used to evaluate beforehand whether one can get an approximate superstring whose length is close to the optimum for a given instance. Experimental results suggest that its approximation performance is orders of magnitude better than previously reported practical values. Moreover, the proposed algorithm remainso efficient even on large instances and can serve to explore in practice the approximability of SLS.

Bastien Cazaux, Samuel Juhel, and Eric Rivals. Practical lower and upper bounds for the Shortest Linear Superstring. In 17th International Symposium on Experimental Algorithms (SEA 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 103, pp. 18:1-18:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{cazaux_et_al:LIPIcs.SEA.2018.18, author = {Cazaux, Bastien and Juhel, Samuel and Rivals, Eric}, title = {{Practical lower and upper bounds for the Shortest Linear Superstring}}, booktitle = {17th International Symposium on Experimental Algorithms (SEA 2018)}, pages = {18:1--18:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-070-5}, ISSN = {1868-8969}, year = {2018}, volume = {103}, editor = {D'Angelo, Gianlorenzo}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2018.18}, URN = {urn:nbn:de:0030-drops-89530}, doi = {10.4230/LIPIcs.SEA.2018.18}, annote = {Keywords: greedy, approximation, overlap, Concat-Cycles, cyclic cover, linear time, text compression} }

Document

**Published in:** LIPIcs, Volume 105, 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)

A superstring of a set of words P = {s_1, ..., s_p } is a string that contains each word of P as substring. Given P, the well known Shortest Linear Superstring problem (SLS), asks for a shortest superstring of P. In a variant of SLS, called Multi-SLS, each word s_i comes with an integer m(i), its multiplicity, that sets a constraint on its number of occurrences, and the goal is to find a shortest superstring that contains at least m(i) occurrences of s_i. Multi-SLS generalizes SLS and is obviously as hard to solve, but it has been studied only in special cases (with words of length 2 or with a fixed number of words). The approximability of Multi-SLS in the general case remains open. Here, we study the approximability of Multi-SLS and that of the companion problem Multi-SCCS, which asks for a shortest cyclic cover instead of shortest superstring. First, we investigate the approximation of a greedy algorithm for maximizing the compression offered by a superstring or by a cyclic cover: the approximation ratio is 1/2 for Multi-SLS and 1 for Multi-SCCS. Then, we exhibit a linear time approximation algorithm, Concat-Greedy, and show it achieves a ratio of 4 regarding the superstring length. This demonstrates that for both measures Multi-SLS belongs to the class of APX problems.

Bastien Cazaux and Eric Rivals. Superstrings with multiplicities. In 29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 105, pp. 21:1-21:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{cazaux_et_al:LIPIcs.CPM.2018.21, author = {Cazaux, Bastien and Rivals, Eric}, title = {{Superstrings with multiplicities}}, booktitle = {29th Annual Symposium on Combinatorial Pattern Matching (CPM 2018)}, pages = {21:1--21:16}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-074-3}, ISSN = {1868-8969}, year = {2018}, volume = {105}, editor = {Navarro, Gonzalo and Sankoff, David and Zhu, Binhai}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2018.21}, URN = {urn:nbn:de:0030-drops-86881}, doi = {10.4230/LIPIcs.CPM.2018.21}, annote = {Keywords: greedy algorithm, approximation, overlap, cyclic cover, APX, subset system} }

X

Feedback for Dagstuhl Publishing

Feedback submitted

Please try again later or send an E-mail