Document

Complete Volume

**Published in:** LIPIcs, Volume 273, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)

LIPIcs, Volume 273, WABI 2023, Complete Volume

23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 1-400, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@Proceedings{belazzougui_et_al:LIPIcs.WABI.2023, title = {{LIPIcs, Volume 273, WABI 2023, Complete Volume}}, booktitle = {23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)}, pages = {1--400}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-294-5}, ISSN = {1868-8969}, year = {2023}, volume = {273}, editor = {Belazzougui, Djamal and Ouangraoua, A\"{i}da}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2023}, URN = {urn:nbn:de:0030-drops-186250}, doi = {10.4230/LIPIcs.WABI.2023}, annote = {Keywords: LIPIcs, Volume 273, WABI 2023, Complete Volume} }

Document

Front Matter

**Published in:** LIPIcs, Volume 273, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)

Front Matter, Table of Contents, Preface, Conference Organization

23rd International Workshop on Algorithms in Bioinformatics (WABI 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 273, pp. 0:i-0:xiv, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{belazzougui_et_al:LIPIcs.WABI.2023.0, author = {Belazzougui, Djamal and Ouangraoua, A\"{i}da}, title = {{Front Matter, Table of Contents, Preface, Conference Organization}}, booktitle = {23rd International Workshop on Algorithms in Bioinformatics (WABI 2023)}, pages = {0:i--0:xiv}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-294-5}, ISSN = {1868-8969}, year = {2023}, volume = {273}, editor = {Belazzougui, Djamal and Ouangraoua, A\"{i}da}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2023.0}, URN = {urn:nbn:de:0030-drops-186267}, doi = {10.4230/LIPIcs.WABI.2023.0}, annote = {Keywords: Front Matter, Table of Contents, Preface, Conference Organization} }

Document

**Published in:** LIPIcs, Volume 242, 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)

We apply Invertible Bloom Lookup Tables (IBLTs) to the comparison of k-mer sets originated from large DNA sequence datasets. We show that for similar datasets, IBLTs provide a more space-efficient and, at the same time, more accurate method for estimating Jaccard similarity of underlying k-mer sets, compared to MinHash which is a go-to sketching technique for efficient pairwise similarity estimation. This is achieved by combining IBLTs with k-mer sampling based on syncmers, which constitute a context-independent alternative to minimizers and provide an unbiased estimator of Jaccard similarity. A key property of our method is that involved data structures require space proportional to the difference of k-mer sets and are independent of the size of sets themselves. As another application, we show how our ideas can be applied in order to efficiently compute (an approximation of) k-mers that differ between two datasets, still using space only proportional to their number. We experimentally illustrate our results on both simulated and real data (SARS-CoV-2 and Streptococcus Pneumoniae genomes).

Yoshihiro Shibuya, Djamal Belazzougui, and Gregory Kucherov. Efficient Reconciliation of Genomic Datasets of High Similarity. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 14:1-14:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{shibuya_et_al:LIPIcs.WABI.2022.14, author = {Shibuya, Yoshihiro and Belazzougui, Djamal and Kucherov, Gregory}, title = {{Efficient Reconciliation of Genomic Datasets of High Similarity}}, booktitle = {22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)}, pages = {14:1--14:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-243-3}, ISSN = {1868-8969}, year = {2022}, volume = {242}, editor = {Boucher, Christina and Rahmann, Sven}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2022.14}, URN = {urn:nbn:de:0030-drops-170481}, doi = {10.4230/LIPIcs.WABI.2022.14}, annote = {Keywords: k-mers, sketching, Invertible Bloom Lookup Tables, IBLT, MinHash, syncmers, minimizers} }

Document

**Published in:** LIPIcs, Volume 201, 21st International Workshop on Algorithms in Bioinformatics (WABI 2021)

Motivation. k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Output formats could rely on quotienting to reduce the space of k-mers in hash tables, however counts are not usually stored in space-efficient formats. Overall, k-mer count tables for genomic data take a considerable space, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general.
Results. In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom Filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E.Coli and C.Elegans) as well as on k-mer document frequency tables for 29 E.Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k’s.

Yoshihiro Shibuya, Djamal Belazzougui, and Gregory Kucherov. Space-Efficient Representation of Genomic k-Mer Count Tables. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 201, pp. 8:1-8:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{shibuya_et_al:LIPIcs.WABI.2021.8, author = {Shibuya, Yoshihiro and Belazzougui, Djamal and Kucherov, Gregory}, title = {{Space-Efficient Representation of Genomic k-Mer Count Tables}}, booktitle = {21st International Workshop on Algorithms in Bioinformatics (WABI 2021)}, pages = {8:1--8:19}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-200-6}, ISSN = {1868-8969}, year = {2021}, volume = {201}, editor = {Carbone, Alessandra and El-Kebir, Mohammed}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2021.8}, URN = {urn:nbn:de:0030-drops-143619}, doi = {10.4230/LIPIcs.WABI.2021.8}, annote = {Keywords: k-mer counting, data structures, compression, minimizers, compressed static function, Bloom filter, empirical entropy} }

Document

**Published in:** LIPIcs, Volume 191, 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)

The weighted ancestor problem is a well-known generalization of the predecessor problem to trees. It is known to require Ω(log log n) time for queries provided 𝒪(n polylog n) space is available and weights are from [0..n], where n is the number of tree nodes. However, when applied to suffix trees, the problem, surprisingly, admits an 𝒪(n)-space solution with constant query time, as was shown by Gawrychowski, Lewenstein, and Nicholson (Proc. ESA 2014). This variant of the problem can be reformulated as follows: given the suffix tree of a string s, we need a data structure that can locate in the tree any substring s[p..q] of s in 𝒪(1) time (as if one descended from the root reading s[p..q] along the way). Unfortunately, the data structure of Gawrychowski et al. has no efficient construction algorithm, limiting its wider usage as an algorithmic tool. In this paper we resolve this issue, describing a data structure for weighted ancestors in suffix trees with constant query time and a linear construction algorithm. Our solution is based on a novel approach using so-called irreducible LCP values.

Djamal Belazzougui, Dmitry Kosolobov, Simon J. Puglisi, and Rajeev Raman. Weighted Ancestors in Suffix Trees Revisited. In 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 191, pp. 8:1-8:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{belazzougui_et_al:LIPIcs.CPM.2021.8, author = {Belazzougui, Djamal and Kosolobov, Dmitry and Puglisi, Simon J. and Raman, Rajeev}, title = {{Weighted Ancestors in Suffix Trees Revisited}}, booktitle = {32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)}, pages = {8:1--8:15}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-186-3}, ISSN = {1868-8969}, year = {2021}, volume = {191}, editor = {Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2021.8}, URN = {urn:nbn:de:0030-drops-139594}, doi = {10.4230/LIPIcs.CPM.2021.8}, annote = {Keywords: suffix tree, weighted ancestors, irreducible LCP, deterministic substring hashing} }

Document

**Published in:** LIPIcs, Volume 161, 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

We study a document retrieval problem in the new framework where D text documents are organized in a category tree with a pre-defined number h of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern p and a category (level in the category tree), we wish to efficiently retrieve the t categorical units containing this pattern and belonging to the category. We propose several efficient solutions for this problem. One of them uses n(logσ(1+o(1))+log D+O(h)) + O(Δ) bits of space and O(|p|+t) query time, where n is the total length of the documents, σ the size of the alphabet used in the documents and Δ is the total number of nodes in the category tree. Another solution uses n(logσ(1+o(1))+O(log D))+O(Δ)+O(Dlog n) bits of space and O(|p|+tlog D) query time. We finally propose other solutions which are more space-efficient at the expense of a slight increase in query time.

Djamal Belazzougui and Gregory Kucherov. Efficient Tree-Structured Categorical Retrieval. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 4:1-4:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{belazzougui_et_al:LIPIcs.CPM.2020.4, author = {Belazzougui, Djamal and Kucherov, Gregory}, title = {{Efficient Tree-Structured Categorical Retrieval}}, booktitle = {31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)}, pages = {4:1--4:11}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-149-8}, ISSN = {1868-8969}, year = {2020}, volume = {161}, editor = {G{\o}rtz, Inge Li and Weimann, Oren}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.4}, URN = {urn:nbn:de:0030-drops-121299}, doi = {10.4230/LIPIcs.CPM.2020.4}, annote = {Keywords: pattern matching, document retrieval, category tree, space-efficient data structures} }

Document

**Published in:** LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)

Given a string T on an alphabet of size sigma, we describe a bidirectional Burrows-Wheeler index that takes O(|T| log sigma) bits of space, and that supports the addition and removal of one character, on the left or right side of any substring of T, in constant time. Previously known data structures that used the same space allowed constant-time addition to any substring of T, but they could support removal only from specific substrings of T. We also describe an index that supports bidirectional addition and removal in O(log log |T|) time, and that takes a number of words proportional to the number of left and right extensions of the maximal repeats of T. We use such fully-functional indexes to implement bidirectional, frequency-aware, variable-order de Bruijn graphs with no upper bound on their order, and supporting natural criteria for increasing and decreasing the order during traversal.

Djamal Belazzougui and Fabio Cunial. Fully-Functional Bidirectional Burrows-Wheeler Indexes and Infinite-Order De Bruijn Graphs. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 10:1-10:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{belazzougui_et_al:LIPIcs.CPM.2019.10, author = {Belazzougui, Djamal and Cunial, Fabio}, title = {{Fully-Functional Bidirectional Burrows-Wheeler Indexes and Infinite-Order De Bruijn Graphs}}, booktitle = {30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)}, pages = {10:1--10:15}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-103-0}, ISSN = {1868-8969}, year = {2019}, volume = {128}, editor = {Pisanti, Nadia and P. Pissis, Solon}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2019.10}, URN = {urn:nbn:de:0030-drops-104811}, doi = {10.4230/LIPIcs.CPM.2019.10}, annote = {Keywords: BWT, suffix tree, CDAWG, de Bruijn graph, maximal repeat, string depth, contraction, bidirectional index} }

Document

**Published in:** LIPIcs, Volume 128, 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)

A string S[1,n] is a power (or repetition or tandem repeat) of order k and period n/k, if it can be decomposed into k consecutive identical blocks of length n/k. Powers and periods are fundamental structures in the study of strings and algorithms to compute them efficiently have been widely studied. Recently, Fici et al. (Proc. ICALP 2016) introduced an antipower of order k to be a string composed of k distinct blocks of the same length, n/k, called the antiperiod. An arbitrary string will have antiperiod t if it is prefix of an antipower with antiperiod t. In this paper, we describe efficient algorithm for computing the smallest antiperiod of a string S of length n in O(n) time. We also describe an algorithm to compute all the antiperiods of S that runs in O(n log n) time.

Hayam Alamro, Golnaz Badkobeh, Djamal Belazzougui, Costas S. Iliopoulos, and Simon J. Puglisi. Computing the Antiperiod(s) of a String. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). Leibniz International Proceedings in Informatics (LIPIcs), Volume 128, pp. 32:1-32:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2019)

Copy BibTex To Clipboard

@InProceedings{alamro_et_al:LIPIcs.CPM.2019.32, author = {Alamro, Hayam and Badkobeh, Golnaz and Belazzougui, Djamal and Iliopoulos, Costas S. and Puglisi, Simon J.}, title = {{Computing the Antiperiod(s) of a String}}, booktitle = {30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)}, pages = {32:1--32:11}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-103-0}, ISSN = {1868-8969}, year = {2019}, volume = {128}, editor = {Pisanti, Nadia and P. Pissis, Solon}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2019.32}, URN = {urn:nbn:de:0030-drops-105035}, doi = {10.4230/LIPIcs.CPM.2019.32}, annote = {Keywords: antiperiod, antipower, power, period, repetition, run, string} }

Document

**Published in:** LIPIcs, Volume 103, 17th International Symposium on Experimental Algorithms (SEA 2018)

Computing the matching statistics of a string S with respect to a string T on an alphabet of size sigma is a fundamental primitive for a number of large-scale string analysis applications, including the comparison of entire genomes, for which space is a pressing issue. This paper takes from theory to practice an existing algorithm that uses just O(|T|log{sigma}) bits of space, and that computes a compact encoding of the matching statistics array in O(|S|log{sigma}) time. The techniques used to speed up the algorithm are of general interest, since they optimize queries on the existence of a Weiner link from a node of the suffix tree, and parent operations after unsuccessful Weiner links. Thus, they can be applied to other matching statistics algorithms, as well as to any suffix tree traversal that relies on such calls. Some of our optimizations yield a matching statistics implementation that is up to three times faster than a plain version of the algorithm, depending on the similarity between S and T. In genomic datasets of practical significance we achieve speedups of up to 1.8, but our fastest implementations take on average twice the time of an existing code based on the LCP array. The key advantage is that our implementations need between one half and one fifth of the competitor's memory, and they approach comparable running times when S and T are very similar.

Djamal Belazzougui, Fabio Cunial, and Olgert Denas. Fast matching statistics in small space. In 17th International Symposium on Experimental Algorithms (SEA 2018). Leibniz International Proceedings in Informatics (LIPIcs), Volume 103, pp. 17:1-17:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2018)

Copy BibTex To Clipboard

@InProceedings{belazzougui_et_al:LIPIcs.SEA.2018.17, author = {Belazzougui, Djamal and Cunial, Fabio and Denas, Olgert}, title = {{Fast matching statistics in small space}}, booktitle = {17th International Symposium on Experimental Algorithms (SEA 2018)}, pages = {17:1--17:14}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-070-5}, ISSN = {1868-8969}, year = {2018}, volume = {103}, editor = {D'Angelo, Gianlorenzo}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.SEA.2018.17}, URN = {urn:nbn:de:0030-drops-89528}, doi = {10.4230/LIPIcs.SEA.2018.17}, annote = {Keywords: Matching statistics, maximal repeat, Burrows-Wheeler transform, wavelet tree, suffix tree topology} }

Document

**Published in:** LIPIcs, Volume 78, 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)

Given a string T, it is known that its suffix tree can be represented using the compact directed acyclic word graph (CDAWG) with e_T arcs, taking overall O(e_T+e_REV(T)) words of space, where REV(T) is the reverse of T, and supporting some key operations in time between O(1) and O(log(log(n))) in the worst case. This representation is especially appealing for highly repetitive strings, like collections of similar genomes or of version-controlled documents, in which e_T grows sublinearly in the length of T in practice. In this paper we augment such representation, supporting a number of additional queries in worst-case time between O(1) and O(log(n)) in the RAM model, without increasing space complexity asymptotically. Our technique, based on a heavy path decomposition of the suffix tree, enables also a representation of the suffix array, of the inverse suffix array, and of T itself, that takes O(e_T) words of space, and that supports random access in O(log(n)) time. Furthermore, we establish a connection between the reversed CDAWG of T and a context-free grammar that produces T and only T, which might have independent interest.

Djamal Belazzougui and Fabio Cunial. Representing the Suffix Tree with the CDAWG. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 78, pp. 7:1-7:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)

Copy BibTex To Clipboard

@InProceedings{belazzougui_et_al:LIPIcs.CPM.2017.7, author = {Belazzougui, Djamal and Cunial, Fabio}, title = {{Representing the Suffix Tree with the CDAWG}}, booktitle = {28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017)}, pages = {7:1--7:13}, series = {Leibniz International Proceedings in Informatics (LIPIcs)}, ISBN = {978-3-95977-039-2}, ISSN = {1868-8969}, year = {2017}, volume = {78}, editor = {K\"{a}rkk\"{a}inen, Juha and Radoszewski, Jakub and Rytter, Wojciech}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2017.7}, URN = {urn:nbn:de:0030-drops-73402}, doi = {10.4230/LIPIcs.CPM.2017.7}, annote = {Keywords: CDAWG, suffix tree, heavy path decomposition, maximal repeat, context-free grammar} }

X

Feedback for Dagstuhl Publishing

Feedback submitted

Please try again later or send an E-mail