DROPS

Document

DOI: 10.4230/LIPIcs.CPM.2023.20

Improving the Sensitivity of MinHash Through Hash-Value Analysis

Authors: Gregory Kucherov and Steven Skiena

Published in: LIPIcs, Volume 259, 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)

Abstract

MinHash sketching is an important algorithm for efficient document retrieval and bioinformatics. We show that the value of the matching MinHash codes convey additional information about the Jaccard similarity of S and T over and above the fact that the MinHash codes agree. This observation holds the potential to increase the sensitivity of minhash-based retrieval systems. We analyze the expected Jaccard similarity of two sets as a function of observing a matching MinHash value a under a reasonable prior distribution on intersection set sizes, and present a practical approach to using MinHash values to improve the sensitivity of traditional Jaccard similarity estimation, based on the Kolmogorov-Smirnov statistical test for sample distributions. Experiments over a wide range of hash function counts and set similarities show a small but consistent improvement over chance at predicting over/under-estimation, yielding an average accuracy of 61% over the range of experiments.

Cite as

Gregory Kucherov and Steven Skiena. Improving the Sensitivity of MinHash Through Hash-Value Analysis. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 20:1-20:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)

Copy BibTex To Clipboard

@InProceedings{kucherov_et_al:LIPIcs.CPM.2023.20,
  author =	{Kucherov, Gregory and Skiena, Steven},
  title =	{{Improving the Sensitivity of MinHash Through Hash-Value Analysis}},
  booktitle =	{34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
  pages =	{20:1--20:12},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-276-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{259},
  editor =	{Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2023.20},
  URN =		{urn:nbn:de:0030-drops-179740},
  doi =		{10.4230/LIPIcs.CPM.2023.20},
  annote =	{Keywords: MinHash sketching, sequence similarity, hashing}
}

Document

DOI: 10.4230/LIPIcs.WABI.2022.14

Efficient Reconciliation of Genomic Datasets of High Similarity

Authors: Yoshihiro Shibuya, Djamal Belazzougui, and Gregory Kucherov

Published in: LIPIcs, Volume 242, 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)

Abstract

We apply Invertible Bloom Lookup Tables (IBLTs) to the comparison of k-mer sets originated from large DNA sequence datasets. We show that for similar datasets, IBLTs provide a more space-efficient and, at the same time, more accurate method for estimating Jaccard similarity of underlying k-mer sets, compared to MinHash which is a go-to sketching technique for efficient pairwise similarity estimation. This is achieved by combining IBLTs with k-mer sampling based on syncmers, which constitute a context-independent alternative to minimizers and provide an unbiased estimator of Jaccard similarity. A key property of our method is that involved data structures require space proportional to the difference of k-mer sets and are independent of the size of sets themselves. As another application, we show how our ideas can be applied in order to efficiently compute (an approximation of) k-mers that differ between two datasets, still using space only proportional to their number. We experimentally illustrate our results on both simulated and real data (SARS-CoV-2 and Streptococcus Pneumoniae genomes).

Cite as

Yoshihiro Shibuya, Djamal Belazzougui, and Gregory Kucherov. Efficient Reconciliation of Genomic Datasets of High Similarity. In 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 242, pp. 14:1-14:14, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022)

Copy BibTex To Clipboard

@InProceedings{shibuya_et_al:LIPIcs.WABI.2022.14,
  author =	{Shibuya, Yoshihiro and Belazzougui, Djamal and Kucherov, Gregory},
  title =	{{Efficient Reconciliation of Genomic Datasets of High Similarity}},
  booktitle =	{22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)},
  pages =	{14:1--14:14},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-243-3},
  ISSN =	{1868-8969},
  year =	{2022},
  volume =	{242},
  editor =	{Boucher, Christina and Rahmann, Sven},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2022.14},
  URN =		{urn:nbn:de:0030-drops-170481},
  doi =		{10.4230/LIPIcs.WABI.2022.14},
  annote =	{Keywords: k-mers, sketching, Invertible Bloom Lookup Tables, IBLT, MinHash, syncmers, minimizers}
}

Document

DOI: 10.4230/LIPIcs.WABI.2021.8

Space-Efficient Representation of Genomic k-Mer Count Tables

Authors: Yoshihiro Shibuya, Djamal Belazzougui, and Gregory Kucherov

Published in: LIPIcs, Volume 201, 21st International Workshop on Algorithms in Bioinformatics (WABI 2021)

Abstract

Motivation. k-mer counting is a common task in bioinformatic pipelines, with many dedicated tools available. Output formats could rely on quotienting to reduce the space of k-mers in hash tables, however counts are not usually stored in space-efficient formats. Overall, k-mer count tables for genomic data take a considerable space, easily reaching tens of GB. Furthermore, such tables do not support efficient random-access queries in general. Results. In this work, we design an efficient representation of k-mer count tables supporting fast random-access queries. We propose to apply Compressed Static Functions (CSFs), with space proportional to the empirical zero-order entropy of the counts. For very skewed distributions, like those of k-mer counts in whole genomes, the only currently available implementation of CSFs does not provide a compact enough representation. By adding a Bloom Filter to a CSF we obtain a Bloom-enhanced CSF (BCSF) effectively overcoming this limitation. Furthermore, by combining BCSFs with minimizer-based bucketing of k-mers, we build even smaller representations breaking the empirical entropy lower bound, for large enough k. We also extend these representations to the approximate case, gaining additional space. We experimentally validate these techniques on k-mer count tables of whole genomes (E.Coli and C.Elegans) as well as on k-mer document frequency tables for 29 E.Coli genomes. In the case of exact counts, our representation takes about a half of the space of the empirical entropy, for large enough k’s.

Cite as

Yoshihiro Shibuya, Djamal Belazzougui, and Gregory Kucherov. Space-Efficient Representation of Genomic k-Mer Count Tables. In 21st International Workshop on Algorithms in Bioinformatics (WABI 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 201, pp. 8:1-8:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2021)

Copy BibTex To Clipboard

@InProceedings{shibuya_et_al:LIPIcs.WABI.2021.8,
  author =	{Shibuya, Yoshihiro and Belazzougui, Djamal and Kucherov, Gregory},
  title =	{{Space-Efficient Representation of Genomic k-Mer Count Tables}},
  booktitle =	{21st International Workshop on Algorithms in Bioinformatics (WABI 2021)},
  pages =	{8:1--8:19},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-200-6},
  ISSN =	{1868-8969},
  year =	{2021},
  volume =	{201},
  editor =	{Carbone, Alessandra and El-Kebir, Mohammed},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2021.8},
  URN =		{urn:nbn:de:0030-drops-143619},
  doi =		{10.4230/LIPIcs.WABI.2021.8},
  annote =	{Keywords: k-mer counting, data structures, compression, minimizers, compressed static function, Bloom filter, empirical entropy}
}

Document

DOI: 10.4230/LIPIcs.CPM.2020.4

Efficient Tree-Structured Categorical Retrieval

Authors: Djamal Belazzougui and Gregory Kucherov

Published in: LIPIcs, Volume 161, 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)

Abstract

We study a document retrieval problem in the new framework where D text documents are organized in a category tree with a pre-defined number h of categories. This situation occurs e.g. with taxomonic trees in biology or subject classification systems for scientific literature. Given a string pattern p and a category (level in the category tree), we wish to efficiently retrieve the t categorical units containing this pattern and belonging to the category. We propose several efficient solutions for this problem. One of them uses n(logσ(1+o(1))+log D+O(h)) + O(Δ) bits of space and O(|p|+t) query time, where n is the total length of the documents, σ the size of the alphabet used in the documents and Δ is the total number of nodes in the category tree. Another solution uses n(logσ(1+o(1))+O(log D))+O(Δ)+O(Dlog n) bits of space and O(|p|+tlog D) query time. We finally propose other solutions which are more space-efficient at the expense of a slight increase in query time.

Cite as

Djamal Belazzougui and Gregory Kucherov. Efficient Tree-Structured Categorical Retrieval. In 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020). Leibniz International Proceedings in Informatics (LIPIcs), Volume 161, pp. 4:1-4:11, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2020)

Copy BibTex To Clipboard

@InProceedings{belazzougui_et_al:LIPIcs.CPM.2020.4,
  author =	{Belazzougui, Djamal and Kucherov, Gregory},
  title =	{{Efficient Tree-Structured Categorical Retrieval}},
  booktitle =	{31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
  pages =	{4:1--4:11},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-149-8},
  ISSN =	{1868-8969},
  year =	{2020},
  volume =	{161},
  editor =	{G{\o}rtz, Inge Li and Weimann, Oren},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/LIPIcs.CPM.2020.4},
  URN =		{urn:nbn:de:0030-drops-121299},
  doi =		{10.4230/LIPIcs.CPM.2020.4},
  annote =	{Keywords: pattern matching, document retrieval, category tree, space-efficient data structures}
}

Document

DOI: 10.4230/DagRep.4.3.28

Combinatorics and Algorithmics of Strings (Dagstuhl Seminar 14111)

Authors: Maxime Crochemore, James D. Currie, Gregory Kucherov, and Dirk Nowotka

Published in: Dagstuhl Reports, Volume 4, Issue 3 (2014)

Abstract

Strings (aka sequences or words) form the most basic and natural data structure. They occur whenever information is electronically transmitted (as bit streams), when natural language text is spoken or written down (as words over, for example, the Latin alphabet), in the process of heredity transmission in living cells (through DNA sequences) or the protein synthesis (as sequence of amino acids), and in many more different contexts. Given this universal form of representing information, the need to process strings is apparent and is actually a core purpose of computer use. Algorithms to efficiently search through, analyze, (de-)compress, match, encode and decode strings are therefore of chief interest. Combinatorial problems about strings lie at the core of such algorithmic questions. Many such combinatorial problems are common in the string processing efforts in the different fields of application. The purpose of this seminar is to bring together researchers from different disciplines whose interests are string processing algorithms and related combinatorial problems on words. The two main areas of interest for this seminar are Combinatorics on Words and Stringology. This report documents the program and the outcomes of Dagstuhl Seminar 14111 "Combinatorics and Algorithmics of Strings".

Cite as

Maxime Crochemore, James D. Currie, Gregory Kucherov, and Dirk Nowotka. Combinatorics and Algorithmics of Strings (Dagstuhl Seminar 14111). In Dagstuhl Reports, Volume 4, Issue 3, pp. 28-46, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2014)

Copy BibTex To Clipboard

@Article{crochemore_et_al:DagRep.4.3.28,
  author =	{Crochemore, Maxime and Currie, James D. and Kucherov, Gregory and Nowotka, Dirk},
  title =	{{Combinatorics and Algorithmics of Strings (Dagstuhl Seminar 14111)}},
  pages =	{28--46},
  journal =	{Dagstuhl Reports},
  ISSN =	{2192-5283},
  year =	{2014},
  volume =	{4},
  number =	{3},
  editor =	{Crochemore, Maxime and Currie, James D. and Kucherov, Gregory and Nowotka, Dirk},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops-dev.dagstuhl.de/entities/document/10.4230/DagRep.4.3.28},
  URN =		{urn:nbn:de:0030-drops-45524},
  doi =		{10.4230/DagRep.4.3.28},
  annote =	{Keywords: combinatorics on words, string algorithms, automata}
}

5 Search Results for "Kucherov, Gregory"

Improving the Sensitivity of MinHash Through Hash-Value Analysis

Abstract

Cite as

Efficient Reconciliation of Genomic Datasets of High Similarity

Abstract

Cite as

Space-Efficient Representation of Genomic k-Mer Count Tables

Abstract

Cite as

Efficient Tree-Structured Categorical Retrieval

Abstract

Cite as

Combinatorics and Algorithmics of Strings (Dagstuhl Seminar 14111)

Abstract

Cite as

Thanks for your feedback!

Could not send message