Search Results

Documents authored by Hera, Mahmudur Rahman


Document
Cosine Similarity Estimation Using FracMinHash: Theoretical Analysis, Safety Conditions, and Implementation

Authors: Mahmudur Rahman Hera and David Koslicki

Published in: LIPIcs, Volume 312, 24th International Workshop on Algorithms in Bioinformatics (WABI 2024)


Abstract
Motivation. The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing k-mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in several useful applications. Recent studies on FracMinHash proved unbiased estimators for the containment and Jaccard indices. However, theoretical investigations for other metrics, such as the cosine similarity, are still lacking. Theoretical contributions. In this paper, we present a theoretical framework for estimating cosine similarity from FracMinHash sketches. We establish conditions under which this estimation is sound, and recommend a minimum scale factor s for accurate results. Experimental evidence supports our theoretical findings. Practical contributions. We also present frac-kmc, a fast and efficient FracMinHash sketch generator program. frac-kmc is the fastest known FracMinHash sketch generator, delivering accurate and precise results for cosine similarity estimation on real data. We show that by computing FracMinHash sketches using frac-kmc, we can estimate pairwise cosine similarity speedily and accurately on real data. frac-kmc is freely available here: https://github.com/KoslickiLab/frac-kmc/.

Cite as

Mahmudur Rahman Hera and David Koslicki. Cosine Similarity Estimation Using FracMinHash: Theoretical Analysis, Safety Conditions, and Implementation. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 6:1-6:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)


Copy BibTex To Clipboard

@InProceedings{hera_et_al:LIPIcs.WABI.2024.6,
  author =	{Hera, Mahmudur Rahman and Koslicki, David},
  title =	{{Cosine Similarity Estimation Using FracMinHash: Theoretical Analysis, Safety Conditions, and Implementation}},
  booktitle =	{24th International Workshop on Algorithms in Bioinformatics (WABI 2024)},
  pages =	{6:1--6:16},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-340-9},
  ISSN =	{1868-8969},
  year =	{2024},
  volume =	{312},
  editor =	{Pissis, Solon P. and Sung, Wing-Kin},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.WABI.2024.6},
  URN =		{urn:nbn:de:0030-drops-206502},
  doi =		{10.4230/LIPIcs.WABI.2024.6},
  annote =	{Keywords: Hashing, sketching, FracMinHash, Min-Hash, k-mer, similarity, theory}
}
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail