11 Search Results for "Kucherov, Gregory"

b-move: Faster Bidirectional Character Extensions in a Run-Length Compressed Index

Authors: Lore Depuydt, Luca Renders, Simon Van de Vyver, Lennart Veys, Travis Gagie, and Jan Fostier

Published in: LIPIcs, Volume 312, 24th International Workshop on Algorithms in Bioinformatics (WABI 2024)

Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.’s r-index and Nishimoto and Tabei’s move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.’s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns. We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient bidirectional character extensions in run-length compressed space. It achieves bidirectional character extensions up to 8 times faster than the br-index, closing the performance gap with FM-index-based alternatives, while maintaining the br-index’s favorable memory characteristics. For example, all available complete E. coli genomes on NCBI’s RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop. Thus, b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license.

Cite as

Lore Depuydt, Luca Renders, Simon Van de Vyver, Lennart Veys, Travis Gagie, and Jan Fostier. b-move: Faster Bidirectional Character Extensions in a Run-Length Compressed Index. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 10:1-10:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

  author =	{Depuydt, Lore and Renders, Luca and Van de Vyver, Simon and Veys, Lennart and Gagie, Travis and Fostier, Jan},
  title =	{{b-move: Faster Bidirectional Character Extensions in a Run-Length Compressed Index}},
  booktitle =	{24th International Workshop on Algorithms in Bioinformatics (WABI 2024)},
  pages =	{10:1--10:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-340-9},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{312},
  editor =	{Pissis, Solon P. and Sung, Wing-Kin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2024.10},
  URN =		{urn:nbn:de:0030-drops-206546},
  doi =		{10.4230/LIPIcs.WABI.2024.10},
  annote =	{Keywords: Pan-genomics, FM-index, r-index, Move Structure, Bidirectional Search, Approximate Pattern Matching, Lossless Alignment, Cache Efficiency}
Approximate Suffix-Prefix Dictionary Queries

Authors: Wiktor Zuba, Grigorios Loukides, Solon P. Pissis, and Sharma V. Thankachan

Published in: LIPIcs, Volume 306, 49th International Symposium on Mathematical Foundations of Computer Science (MFCS 2024)

In the all-pairs suffix-prefix (APSP) problem [Gusfield et al., Inf. Process. Lett. 1992], we are given a dictionary R of r strings, S₁,…,S_r, of total length n, and we are asked to find the length SPL_{i,j} of the longest string that is both a suffix of S_i and a prefix of S_j, for all i,j ∈ [1..r]. APSP is a classic problem in string algorithms with applications in bioinformatics, especially in sequence assembly. Since r = |R| is typically very large in real-world applications, considering all r² pairs of strings explicitly is prohibitive. This is when the data structure variant of APSP makes sense; in the same spirit as distance oracles computing shortest paths between any two vertices given online. We show how to quickly locate k-approximate matches (under the Hamming or the edit distance) in R using a version of the k-errata tree [Cole et al., STOC 2004] that we introduce. Let SPL^k_{i,j} be the length of the longest suffix of S_i that is at distance at most k from a prefix of S_j. In particular, for any k = 𝒪(1), we show an 𝒪(nlog^k n)-sized data structure to support the following queries: - One-to-One^k(i,j): output SPL^k_{i,j} in 𝒪(log^k nlog log n) time. - Report^k(i,d): output all j ∈ [1..r], such that SPL^k_{i,j} ≥ d, in 𝒪(log^{k}n(log n/log log n+output)) time, where output denotes the size of the output. In fact, our algorithms work for any value of k not just for k = 𝒪(1), but the formulas bounding the complexities get much more complicated for larger values of k.

Cite as

Wiktor Zuba, Grigorios Loukides, Solon P. Pissis, and Sharma V. Thankachan. Approximate Suffix-Prefix Dictionary Queries. In 49th International Symposium on Mathematical Foundations of Computer Science (MFCS 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 306, pp. 85:1-85:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

  author =	{Zuba, Wiktor and Loukides, Grigorios and Pissis, Solon P. and Thankachan, Sharma V.},
  title =	{{Approximate Suffix-Prefix Dictionary Queries}},
  booktitle =	{49th International Symposium on Mathematical Foundations of Computer Science (MFCS 2024)},
  pages =	{85:1--85:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-335-5},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{306},
  editor =	{Kr\'{a}lovi\v{c}, Rastislav and Ku\v{c}era, Anton{\'\i}n},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.MFCS.2024.85},
  URN =		{urn:nbn:de:0030-drops-206416},
  doi =		{10.4230/LIPIcs.MFCS.2024.85},
  annote =	{Keywords: all-pairs suffix-prefix, suffix-prefix queries, suffix tree, k-errata tree}
Distribution-Free Proofs of Proximity

Authors: Hugo Aaronson, Tom Gur, Ninad Rajgopal, and Ron D. Rothblum

Published in: LIPIcs, Volume 300, 39th Computational Complexity Conference (CCC 2024)

Motivated by the fact that input distributions are often unknown in advance, distribution-free property testing considers a setting in which the algorithmic task is to accept functions f: [n] → {0,1} having a certain property Π and reject functions that are ε-far from Π, where the distance is measured according to an arbitrary and unknown input distribution 𝒟 ∼ [n]. As usual in property testing, the tester is required to do so while making only a sublinear number of input queries, but as the distribution is unknown, we also allow a sublinear number of samples from the distribution 𝒟. In this work we initiate the study of distribution-free interactive proofs of proximity (df-IPPs) in which the distribution-free testing algorithm is assisted by an all powerful but untrusted prover. Our main result is that for any problem Π ∈ NC, any proximity parameter ε > 0, and any (trade-off) parameter τ ≤ √n, we construct a df-IPP for Π with respect to ε, that has query and sample complexities τ+O(1/ε), and communication complexity Õ(n/τ + 1/ε). For τ as above and sufficiently large ε (namely, when ε > τ/n), this result matches the parameters of the best-known general purpose IPPs in the standard uniform setting. Moreover, for such τ, its parameters are optimal up to poly-logarithmic factors under reasonable cryptographic assumptions for the same regime of ε as the uniform setting, i.e., when ε ≥ 1/τ. For smaller values of ε (i.e., when ε < τ/n), our protocol has communication complexity Ω(1/ε), which is worse than the Õ(n/τ) communication complexity of the uniform IPPs (with the same query complexity). With the aim of improving on this gap, we further show that for IPPs over specialised, but large distribution families, such as sufficiently smooth distributions and product distributions, the communication complexity can be reduced to Õ(n/τ^{1-o(1)}). In addition, we show that for certain natural families of languages, such as symmetric and (relaxed) self-correctable languages, it is possible to further improve the efficiency of distribution-free IPPs.

Cite as

Hugo Aaronson, Tom Gur, Ninad Rajgopal, and Ron D. Rothblum. Distribution-Free Proofs of Proximity. In 39th Computational Complexity Conference (CCC 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 300, pp. 24:1-24:18, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

  author =	{Aaronson, Hugo and Gur, Tom and Rajgopal, Ninad and Rothblum, Ron D.},
  title =	{{Distribution-Free Proofs of Proximity}},
  booktitle =	{39th Computational Complexity Conference (CCC 2024)},
  pages =	{24:1--24:18},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-331-7},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{300},
  editor =	{Santhanam, Rahul},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CCC.2024.24},
  URN =		{urn:nbn:de:0030-drops-204204},
  doi =		{10.4230/LIPIcs.CCC.2024.24},
  annote =	{Keywords: Property Testing, Interactive Proofs, Distribution-Free Property Testing}
Track A: Algorithms, Complexity and Games
Optimal Bounds for Distinct Quartics

Authors: Panagiotis Charalampopoulos, Paweł Gawrychowski, and Samah Ghazawi

Published in: LIPIcs, Volume 297, 51st International Colloquium on Automata, Languages, and Programming (ICALP 2024)

A fundamental concept related to strings is that of repetitions. It has been extensively studied in many versions, from both purely combinatorial and algorithmic angles. One of the most basic questions is how many distinct squares, i.e., distinct strings of the form UU, a string of length n can contain as fragments. It turns out that this is always 𝒪(n), and the bound cannot be improved to sublinear in n [Fraenkel and Simpson, JCTA 1998]. Several similar questions about repetitions in strings have been considered, and by now we seem to have a good understanding of their repetitive structure. For higher-dimensional strings, the basic concept of periodicity has been successfully extended and applied to design efficient algorithms - it is inherently more complex than for regular strings. Extending the notion of repetitions and understanding the repetitive structure of higher-dimensional strings is however far from complete. Quartics were introduced by Apostolico and Brimkov [TCS 2000] as analogues of squares in two dimensions. Charalampopoulos, Radoszewski, Rytter, Waleń, and Zuba [ESA 2020] proved that the number of distinct quartics in an n×n 2D string is 𝒪(n²log²n) and that they can be computed in 𝒪(n²log²n) time. Gawrychowski, Ghazawi, and Landau [SPIRE 2021] constructed an infinite family of n×n 2D strings with Ω(n²log n) distinct quartics. This brings the challenge of determining asymptotically tight bounds. Here, we settle both the combinatorial and the algorithmic aspects of this question: the number of distinct quartics in an n×n 2D string is 𝒪(n²log n) and they can be computed in the worst-case optimal 𝒪(n²log n) time. As expected, our solution heavily exploits the periodic structure implied by occurrences of quartics. However, the two-dimensional nature of the problem introduces some technical challenges. Somewhat surprisingly, we overcome the final challenge for the combinatorial bound using a result of Marcus and Tardos [JCTA 2004] for permutation avoidance on matrices.

Cite as

Panagiotis Charalampopoulos, Paweł Gawrychowski, and Samah Ghazawi. Optimal Bounds for Distinct Quartics. In 51st International Colloquium on Automata, Languages, and Programming (ICALP 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 297, pp. 39:1-39:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

  author =	{Charalampopoulos, Panagiotis and Gawrychowski, Pawe{\l} and Ghazawi, Samah},
  title =	{{Optimal Bounds for Distinct Quartics}},
  booktitle =	{51st International Colloquium on Automata, Languages, and Programming (ICALP 2024)},
  pages =	{39:1--39:17},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-322-5},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{297},
  editor =	{Bringmann, Karl and Grohe, Martin and Puppis, Gabriele and Svensson, Ola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2024.39},
  URN =		{urn:nbn:de:0030-drops-201823},
  doi =		{10.4230/LIPIcs.ICALP.2024.39},
  annote =	{Keywords: 2D strings, quartics, repetitions, periodicity}
Track A: Algorithms, Complexity and Games
Better Space-Time-Robustness Trade-Offs for Set Reconciliation

Authors: Djamal Belazzougui, Gregory Kucherov, and Stefan Walzer

Published in: LIPIcs, Volume 297, 51st International Colloquium on Automata, Languages, and Programming (ICALP 2024)

We consider the problem of reconstructing the symmetric difference between similar sets from their representations (sketches) of size linear in the number of differences. Exact solutions to this problem are based on error-correcting coding techniques and suffer from a large decoding time. Existing probabilistic solutions based on Invertible Bloom Lookup Tables (IBLTs) are time-efficient but offer insufficient success guarantees for many applications. Here we propose a tunable trade-off between the two approaches combining the efficiency of IBLTs with exponentially decreasing failure probability. The proof relies on a refined analysis of IBLTs proposed in (Bæk Tejs Houen et al. SOSA 2023) which has an independent interest. We also propose a modification of our algorithm that enables telling apart the elements of each set in the symmetric difference.

Cite as

Djamal Belazzougui, Gregory Kucherov, and Stefan Walzer. Better Space-Time-Robustness Trade-Offs for Set Reconciliation. In 51st International Colloquium on Automata, Languages, and Programming (ICALP 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 297, pp. 20:1-20:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

  author =	{Belazzougui, Djamal and Kucherov, Gregory and Walzer, Stefan},
  title =	{{Better Space-Time-Robustness Trade-Offs for Set Reconciliation}},
  booktitle =	{51st International Colloquium on Automata, Languages, and Programming (ICALP 2024)},
  pages =	{20:1--20:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-322-5},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{297},
  editor =	{Bringmann, Karl and Grohe, Martin and Puppis, Gabriele and Svensson, Ola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2024.20},
  URN =		{urn:nbn:de:0030-drops-201639},
  doi =		{10.4230/LIPIcs.ICALP.2024.20},
  annote =	{Keywords: data structures, hashing, set reconciliation, invertible Bloom lookup tables, random hypergraphs, BCH codes}
Track A: Algorithms, Complexity and Games
On the Cut-Query Complexity of Approximating Max-Cut

Authors: Orestis Plevrakis, Seyoon Ragavan, and S. Matthew Weinberg

Published in: LIPIcs, Volume 297, 51st International Colloquium on Automata, Languages, and Programming (ICALP 2024)

We consider the problem of query-efficient global max-cut on a weighted undirected graph in the value oracle model examined by [Rubinstein et al., 2018]. Graph algorithms in this cut query model and other query models have recently been studied for various other problems such as min-cut, connectivity, bipartiteness, and triangle detection. Max-cut in the cut query model can also be viewed as a natural special case of submodular function maximization: on query S ⊆ V, the oracle returns the total weight of the cut between S and V\S. Our first main technical result is a lower bound stating that a deterministic algorithm achieving a c-approximation for any c > 1/2 requires Ω(n) queries. This uses an extension of the cut dimension to rule out approximation (prior work of [Graur et al., 2020] introducing the cut dimension only rules out exact solutions). Secondly, we provide a randomized algorithm with Õ(n) queries that finds a c-approximation for any c < 1. We achieve this using a query-efficient sparsifier for undirected weighted graphs (prior work of [Rubinstein et al., 2018] holds only for unweighted graphs). To complement these results, for most constants c ∈ (0,1], we nail down the query complexity of achieving a c-approximation, for both deterministic and randomized algorithms (up to logarithmic factors). Analogously to general submodular function maximization in the same model, we observe a phase transition at c = 1/2: we design a deterministic algorithm for global c-approximate max-cut in O(log n) queries for any c < 1/2, and show that any randomized algorithm requires Ω(n/log n) queries to find a c-approximate max-cut for any c > 1/2. Additionally, we show that any deterministic algorithm requires Ω(n²) queries to find an exact max-cut (enough to learn the entire graph).

Cite as

Orestis Plevrakis, Seyoon Ragavan, and S. Matthew Weinberg. On the Cut-Query Complexity of Approximating Max-Cut. In 51st International Colloquium on Automata, Languages, and Programming (ICALP 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 297, pp. 115:1-115:20, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)

Copy BibTex To Clipboard

  author =	{Plevrakis, Orestis and Ragavan, Seyoon and Weinberg, S. Matthew},
  title =	{{On the Cut-Query Complexity of Approximating Max-Cut}},
  booktitle =	{51st International Colloquium on Automata, Languages, and Programming (ICALP 2024)},
  pages =	{115:1--115:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-322-5},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{297},
  editor =	{Bringmann, Karl and Grohe, Martin and Puppis, Gabriele and Svensson, Ola},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ICALP.2024.115},
  URN =		{urn:nbn:de:0030-drops-202587},
  doi =		{10.4230/LIPIcs.ICALP.2024.115},
  annote =	{Keywords: query complexity, maximum cut, approximation algorithms, graph sparsification}
Improving the Sensitivity of MinHash Through Hash-Value Analysis

Authors: Gregory Kucherov and Steven Skiena

Published in: LIPIcs, Volume 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)

MinHash sketching is an important algorithm for efficient document retrieval and bioinformatics. We show that the value of the matching MinHash codes convey additional information about the Jaccard similarity of S and T over and above the fact that the MinHash codes agree. This observation holds the potential to increase the sensitivity of minhash-based retrieval systems. We analyze the expected Jaccard similarity of two sets as a function of observing a matching MinHash value a under a reasonable prior distribution on intersection set sizes, and present a practical approach to using MinHash values to improve the sensitivity of traditional Jaccard similarity estimation, based on the Kolmogorov-Smirnov statistical test for sample distributions. Experiments over a wide range of hash function counts and set similarities show a small but consistent improvement over chance at predicting over/under-estimation, yielding an average accuracy of 61% over the range of experiments.

Cite as

Gregory Kucherov and Steven Skiena. Improving the Sensitivity of MinHash Through Hash-Value Analysis. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 20:1-20:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

  author =	{Kucherov, Gregory and Skiena, Steven},
  title =	{{Improving the Sensitivity of MinHash Through Hash-Value Analysis}},
  booktitle =	{34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
  pages =	{20:1--20:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-276-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{259},
  editor =	{Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2023.20},
  URN =		{urn:nbn:de:0030-drops-179740},
  doi =		{10.4230/LIPIcs.CPM.2023.20},
  annote =	{Keywords: MinHash sketching, sequence similarity, hashing}
Efficient Reconciliation of Genomic Datasets of High Similarity

Authors: Yoshihiro Shibuya, Djamal Belazzougui, and Gregory Kucherov

Published in: LIPIcs, Volume 242, 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)

We apply Invertible Bloom Lookup Tables (IBLTs) to the comparison of k-mer sets originated from large DNA sequence datasets. We show that for similar datasets, IBLTs provide a more space-efficient and, at the same time, more accurate method for estimating Jaccard similarity of underlying k-mer sets, compared to MinHash which is a go-to sketching technique for efficient pairwise similarity estimation. This is achieved by combining IBLTs with k-mer sampling based on syncmers, which constitute a context-independent alternative to minimizers and provide an unbiased estimator of Jaccard similarity. A key property of our method is that involved data structures require space proportional to the difference of k-mer sets and are independent of the size of sets themselves. As another application, we show how our ideas can be applied in order to efficiently compute (an approximation of) k-mers that differ between two datasets, still using space only proportional to their number. We experimentally illustrate our results on both simulated and real data (SARS-CoV-2 and Streptococcus Pneumoniae genomes).

Cite as

Yoshihiro Shibuya, Djamal Belazzougui, and Gregory Kucherov. Efficient Reconciliation of Genomic Datasets of High Similarity. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 14:1-14:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

  author =	{Shibuya, Yoshihiro and Belazzougui, Djamal and Kucherov, Gregory},
  title =	{{Efficient Reconciliation of Genomic Datasets of High Similarity}},
  booktitle =	{22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
  pages =	{14:1--14:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-243-3},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{242},
  editor =	{Boucher, Christina and Rahmann, Sven},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2022.14},
  URN =		{urn:nbn:de:0030-drops-170481},
  doi =		{10.4230/LIPIcs.WABI.2022.14},
  annote =	{Keywords: k-mers, sketching, Invertible Bloom Lookup Tables, IBLT, MinHash, syncmers, minimizers}
Space-Efficient Representation of Genomic k-Mer Count Tables

Authors: Yoshihiro Shibuya, Djamal Belazzougui, and Gregory Kucherov

Published in: LIPIcs, Volume 201, 21st International Workshop on Algorithms in Bioinformatics (WABI 2021)

Motivation. k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Output formats could rely on quotienting to reduce the space of k-mers in hash tables, however counts are not usually stored in space-efficient formats. Overall, k-mer count tables for genomic data take a considerable space, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general. Results. In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom Filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E.Coli and C.Elegans) as well as on k-mer document frequency tables for 29 E.Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k’s.

Cite as

Yoshihiro Shibuya, Djamal Belazzougui, and Gregory Kucherov. Space-Efficient Representation of Genomic k-Mer Count Tables. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 201, pp. 8:1-8:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

  author =	{Shibuya, Yoshihiro and Belazzougui, Djamal and Kucherov, Gregory},
  title =	{{Space-Efficient Representation of Genomic k-Mer Count Tables}},
  booktitle =	{21st International Workshop on Algorithms in Bioinformatics (WABI 2021)},
  pages =	{8:1--8:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-200-6},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{201},
  editor =	{Carbone, Alessandra and El-Kebir, Mohammed},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2021.8},
  URN =		{urn:nbn:de:0030-drops-143619},
  doi =		{10.4230/LIPIcs.WABI.2021.8},
  annote =	{Keywords: k-mer counting, data structures, compression, minimizers, compressed static function, Bloom filter, empirical entropy}
Efficient Tree-Structured Categorical Retrieval

Authors: Djamal Belazzougui and Gregory Kucherov

Published in: LIPIcs, Volume 161, 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

We study a document retrieval problem in the new framework where D text documents are organized in a category tree with a pre-defined number h of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern p and a category (level in the category tree), we wish to efficiently retrieve the t categorical units containing this pattern and belonging to the category. We propose several efficient solutions for this problem. One of them uses n(logσ(1+o(1))+log D+O(h)) + O(Δ) bits of space and O(|p|+t) query time, where n is the total length of the documents, σ the size of the alphabet used in the documents and Δ is the total number of nodes in the category tree. Another solution uses n(logσ(1+o(1))+O(log D))+O(Δ)+O(Dlog n) bits of space and O(|p|+tlog D) query time. We finally propose other solutions which are more space-efficient at the expense of a slight increase in query time.

Cite as

Djamal Belazzougui and Gregory Kucherov. Efficient Tree-Structured Categorical Retrieval. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 4:1-4:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

  author =	{Belazzougui, Djamal and Kucherov, Gregory},
  title =	{{Efficient Tree-Structured Categorical Retrieval}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{4:1--4:11},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.4},
  URN =		{urn:nbn:de:0030-drops-121299},
  doi =		{10.4230/LIPIcs.CPM.2020.4},
  annote =	{Keywords: pattern matching, document retrieval, category tree, space-efficient data structures}
Combinatorics and Algorithmics of Strings (Dagstuhl Seminar 14111)

Authors: Maxime Crochemore, James D. Currie, Gregory Kucherov, and Dirk Nowotka

Published in: Dagstuhl Reports, Volume 4, Issue 3 (2014)

Strings (aka sequences or words) form the most basic and natural data structure. They occur whenever information is electronically transmitted (as bit streams), when natural language text is spoken or written down (as words over, for example, the Latin alphabet), in the process of heredity transmission in living cells (through DNA sequences) or the protein synthesis (as sequence of amino acids), and in many more different contexts. Given this universal form of representing information, the need to process strings is apparent and is actually a core purpose of computer use. Algorithms to efficiently search through, analyze, (de-)compress, match, encode and decode strings are therefore of chief interest. Combinatorial problems about strings lie at the core of such algorithmic questions. Many such combinatorial problems are common in the string processing efforts in the different fields of application. The purpose of this seminar is to bring together researchers from different disciplines whose interests are string processing algorithms and related combinatorial problems on words. The two main areas of interest for this seminar are Combinatorics on Words and Stringology. This report documents the program and the outcomes of Dagstuhl Seminar 14111 "Combinatorics and Algorithmics of Strings".

Cite as

Maxime Crochemore, James D. Currie, Gregory Kucherov, and Dirk Nowotka. Combinatorics and Algorithmics of Strings (Dagstuhl Seminar 14111). In Dagstuhl Reports, Volume 4, Issue 3, pp. 28-46, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2014)

Copy BibTex To Clipboard

  author =	{Crochemore, Maxime and Currie, James D. and Kucherov, Gregory and Nowotka, Dirk},
  title =	{{Combinatorics and Algorithmics of Strings (Dagstuhl Seminar 14111)}},
  pages =	{28--46},
  journal =	{Dagstuhl Reports},
  ISSN =	{2192-5283},
  year =	{2014},
  volume =	{4},
  number =	{3},
  editor =	{Crochemore, Maxime and Currie, James D. and Kucherov, Gregory and Nowotka, Dirk},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/DagRep.4.3.28},
  URN =		{urn:nbn:de:0030-drops-45524},
  doi =		{10.4230/DagRep.4.3.28},
  annote =	{Keywords: combinatorics on words, string algorithms, automata}
  • Refine by Author
  • 6 Kucherov, Gregory
  • 4 Belazzougui, Djamal
  • 2 Shibuya, Yoshihiro
  • 1 Aaronson, Hugo
  • 1 Charalampopoulos, Panagiotis
  • Show More...

  • Refine by Classification
  • 3 Theory of computation → Pattern matching
  • 2 Applied computing
  • 1 Applied computing → Bioinformatics
  • 1 Information systems → Document representation
  • 1 Information systems → Information retrieval query processing
  • Show More...

  • Refine by Keyword
  • 2 data structures
  • 2 hashing
  • 2 minimizers
  • 1 2D strings
  • 1 Approximate Pattern Matching
  • Show More...

  • Refine by Type
  • 11 document

  • Refine by Publication Year
  • 6 2024
  • 1 2014
  • 1 2020
  • 1 2021
  • 1 2022
  • Show More...