Cosine Similarity Estimation Using FracMinHash: Theoretical Analysis, Safety Conditions, and Implementation

Authors Mahmudur Rahman Hera , David Koslicki



PDF
Thumbnail PDF

File

LIPIcs.WABI.2024.6.pdf
  • Filesize: 0.98 MB
  • 16 pages

Document Identifiers

Author Details

Mahmudur Rahman Hera
  • School of Electrical Engineering and Computer Science, Pennsylvania State University, University Park, PA, USA
David Koslicki
  • School of Electrical Engineering and Computer Science, Pennsylvania State University, Univesity Park, PA, USA
  • Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA, USA
  • Department of Biology, Pennsylvania State University, University Park, PA, USA

Acknowledgements

We want to thank Marek Kokot and Sebastian Deorowicz for providing us with directions to navigate through the source code of KMC.

Cite AsGet BibTex

Mahmudur Rahman Hera and David Koslicki. Cosine Similarity Estimation Using FracMinHash: Theoretical Analysis, Safety Conditions, and Implementation. In 24th International Workshop on Algorithms in Bioinformatics (WABI 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 312, pp. 6:1-6:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.WABI.2024.6

Abstract

Motivation. The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing k-mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in several useful applications. Recent studies on FracMinHash proved unbiased estimators for the containment and Jaccard indices. However, theoretical investigations for other metrics, such as the cosine similarity, are still lacking. Theoretical contributions. In this paper, we present a theoretical framework for estimating cosine similarity from FracMinHash sketches. We establish conditions under which this estimation is sound, and recommend a minimum scale factor s for accurate results. Experimental evidence supports our theoretical findings. Practical contributions. We also present frac-kmc, a fast and efficient FracMinHash sketch generator program. frac-kmc is the fastest known FracMinHash sketch generator, delivering accurate and precise results for cosine similarity estimation on real data. We show that by computing FracMinHash sketches using frac-kmc, we can estimate pairwise cosine similarity speedily and accurately on real data. frac-kmc is freely available here: https://github.com/KoslickiLab/frac-kmc/.

Subject Classification

ACM Subject Classification
  • Applied computing → Computational biology
Keywords
  • Hashing
  • sketching
  • FracMinHash
  • Min-Hash
  • k-mer
  • similarity
  • theory

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Gaëtan Benoit, Pierre Peterlongo, Mahendra Mariadassou, Erwan Drezen, Sophie Schbath, Dominique Lavenier, and Claire Lemaitre. Multiple comparative metagenomics using multiset k-mer counting. PeerJ Computer Science, 2:e94, 2016. Google Scholar
  2. Antonio Blanca, Robert S Harris, David Koslicki, and Paul Medvedev. The statistics of k-mers from a sequence undergoing a simple mutation process without spurious matches. Journal of Computational Biology, 29(2):155-168, 2022. Google Scholar
  3. Andrei Z Broder. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21-29. IEEE, 1997. Google Scholar
  4. C Titus Brown and Luiz Irber. sourmash: a library for minhash sketching of dna. Journal of open source software, 1(5):27, 2016. Google Scholar
  5. Stanley Cai, Georgios K Georgakilas, John L Johnson, and Golnaz Vahedi. A cosine similarity-based method to infer variability of chromatin accessibility at the single-cell level. Frontiers in genetics, 9:319, 2018. Google Scholar
  6. Illyoung Choi, Alise J Ponsero, Matthew Bomhoff, Ken Youens-Clark, John H Hartman, and Bonnie L Hurwitz. Libra: scalable k-mer-based tool for massive all-vs-all metagenome comparisons. GigaScience, 8(2):giy165, 2019. Google Scholar
  7. Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, and Szymon Grabowski. Disk-based k-mer counting on a pc. BMC bioinformatics, 14:1-12, 2013. Google Scholar
  8. Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, and Agnieszka Debudaj-Grabysz. Kmc 2: fast and resource-frugal k-mer counting. Bioinformatics, 31(10):1569-1576, 2015. Google Scholar
  9. Mahmudur Rahman Hera, Shaopeng Liu, Wei Wei, Judith S Rodriguez, Chunyu Ma, and David Koslicki. Fast, lightweight, and accurate metagenomic functional profiling using fracminhash sketches. bioRxiv, pages 2023-11, 2023. Google Scholar
  10. Mahmudur Rahman Hera, N Tessa Pierce-Ward, and David Koslicki. Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using fracminhash. Genome research, 33(7):1061-1068, 2023. Google Scholar
  11. Luiz Irber, Phillip T Brooks, Taylor Reiter, N Tessa Pierce-Ward, Mahmudur Rahman Hera, David Koslicki, and C Titus Brown. Lightweight compositional analysis of metagenomes with fracminhash and minimum metagenome covers. BioRxiv, pages 2022-01, 2022. Google Scholar
  12. Chirag Jain, Luis M Rodriguez-R, Adam M Phillippy, Konstantinos T Konstantinidis, and Srinivas Aluru. High throughput ani analysis of 90k prokaryotic genomes reveals clear species boundaries. Nature communications, 9(1):5114, 2018. Google Scholar
  13. Marek Kokot, Maciej Długosz, and Sebastian Deorowicz. Kmc 3: counting and manipulating k-mer statistics. Bioinformatics, 33(17):2759-2761, 2017. Google Scholar
  14. David Koslicki and Hooman Zabeti. Improving minhash via the containment index with applications to metagenomic analysis. Applied Mathematics and Computation, 354:206-215, 2019. Google Scholar
  15. Shaopeng Liu and David Koslicki. Cmash: fast, multi-resolution estimation of k-mer-based jaccard and containment indices. Bioinformatics, 38(Supplement_1):i28-i35, 2022. Google Scholar
  16. Guillaume Marçais and Carl Kingsford. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764-770, 2011. Google Scholar
  17. Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomization and probabilistic techniques in algorithms and data analysis. Cambridge university press, 2017. Google Scholar
  18. Brian D Ondov, Gabriel J Starrett, Anna Sappington, Aleksandra Kostic, Sergey Koren, Christopher B Buck, and Adam M Phillippy. Mash screen: high-throughput sequence containment estimation for genome discovery. Genome biology, 20:1-13, 2019. Google Scholar
  19. Brian D Ondov, Todd J Treangen, Páll Melsted, Adam B Mallonee, Nicholas H Bergman, Sergey Koren, and Adam M Phillippy. Mash: fast genome and metagenome distance estimation using minhash. Genome biology, 17:1-14, 2016. Google Scholar
  20. Jane Peterson, Susan Garges, Maria Giovanni, Pamela McInnes, Lu Wang, Jeffery A Schloss, Vivien Bonazzi, Jean E McEwen, Kris A Wetterstrand, Carolyn Deal, et al. The nih human microbiome project. Genome research, 19(12):2317-2323, 2009. Google Scholar
  21. N Tessa Pierce, Luiz Irber, Taylor Reiter, Phillip Brooks, and C Titus Brown. Large-scale sequence comparisons with sourmash. F1000Research, 8, 2019. Google Scholar
  22. Chinta Someswara Rao and S Viswanadha Raju. Similarity analysis between chromosomes of homo sapiens and monkeys with correlation coefficient, rank correlation coefficient and cosine similarity measures. Genomics data, 7:202-209, 2016. Google Scholar
  23. Guillaume Rizk, Dominique Lavenier, and Rayan Chikhi. Dsk: k-mer counting with very low memory usage. Bioinformatics, 29(5):652-653, 2013. Google Scholar
  24. Jim Shaw and Yun William Yu. Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nature Methods, 20(11):1661-1665, 2023. Google Scholar
  25. Juanying Xie, Mingzhao Wang, Shengquan Xu, Zhao Huang, and Philip W Grant. The unsupervised feature selection algorithms based on standard deviation and cosine similarity for genomic data analysis. Frontiers in Genetics, 12:684100, 2021. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail