Improving the Sensitivity of MinHash Through Hash-Value Analysis

Authors Gregory Kucherov , Steven Skiena



PDF
Thumbnail PDF

File

LIPIcs.CPM.2023.20.pdf
  • Filesize: 0.78 MB
  • 12 pages

Document Identifiers

Author Details

Gregory Kucherov
  • LIGM, CNRS/Université Gustave Eiffel, Marne-la-Vallée, France
Steven Skiena
  • Dept. of Computer Science, Stony Brook University, Stony Brook, NY, USA

Cite As Get BibTex

Gregory Kucherov and Steven Skiena. Improving the Sensitivity of MinHash Through Hash-Value Analysis. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 20:1-20:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.CPM.2023.20

Abstract

MinHash sketching is an important algorithm for efficient document retrieval and bioinformatics. We show that the value of the matching MinHash codes convey additional information about the Jaccard similarity of S and T over and above the fact that the MinHash codes agree. This observation holds the potential to increase the sensitivity of minhash-based retrieval systems. We analyze the expected Jaccard similarity of two sets as a function of observing a matching MinHash value a under a reasonable prior distribution on intersection set sizes, and present a practical approach to using MinHash values to improve the sensitivity of traditional Jaccard similarity estimation, based on the Kolmogorov-Smirnov statistical test for sample distributions. Experiments over a wide range of hash function counts and set similarities show a small but consistent improvement over chance at predicting over/under-estimation, yielding an average accuracy of 61% over the range of experiments.

Subject Classification

ACM Subject Classification
  • Theory of computation → Bloom filters and hashing
Keywords
  • MinHash sketching
  • sequence similarity
  • hashing

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Daniel N. Baker and Ben Langmead. Dashing: fast and accurate genomic distances with HyperLogLog. Genome Biology, 20:265, 2019. URL: https://doi.org/10.1186/s13059-019-1875-0.
  2. Andrei Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences 1997, SEQUENCES '97, pages 21-, Washington, DC, USA, 1997. IEEE Computer Society. URL: http://dl.acm.org/citation.cfm?id=829502.830043.
  3. Andrei Z. Broder. Identifying and filtering near-duplicate documents. In Combinatorial Pattern Matching, 11th Annual Symposium, CPM 2000, Montreal, Canada, June 21-23, 2000, Proceedings, pages 1-10, 2000. URL: https://doi.org/10.1007/3-540-45123-4_1.
  4. Andrei Z Broder, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher. Min-wise independent permutations. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 327-336, 1998. Google Scholar
  5. C. Titus Brown and Luiz Irber. Sourmash: a library for minhash sketching of dna. Journal of Open Source Software, 1(5):27, 2016. URL: https://doi.org/10.21105/joss.00027.
  6. Moses S Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380-388, 2002. Google Scholar
  7. Lianhua Chi, Bin Li, and Xingquan Zhu. Context-preserving hashing for fast text classification. In Proceedings of the 2014 SIAM International Conference on Data Mining, pages 100-108. SIAM, 2014. Google Scholar
  8. Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, Michael Mitzenmacher, Alessandro Panconesi, and Prabhakar Raghavan. On compressing social networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 219-228, 2009. Google Scholar
  9. Edith Cohen. Min-hash sketches: A brief survey, 2016. URL: http://www.cohenwang.com/edith/Surveys/minhash.pdf.
  10. Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pages 253-262, 2004. Google Scholar
  11. Otmar Ertl. Setsketch: Filling the gap between minhash and hyperloglog. Proc. VLDB Endow., 14(11):2244-2257, 2021. URL: https://doi.org/10.14778/3476249.3476276.
  12. Dennis Fetterly, Mark Manasse, Marc Najork, and Janet Wiener. A large-scale study of the evolution of web pages. In Proceedings of the 12th international conference on World Wide Web, pages 669-678, 2003. Google Scholar
  13. Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. Discrete Mathematics & Theoretical Computer Science, DMTCS Proceedings vol. AH, 2007 Conference on Analysis of Algorithms (AofA 07), January 2007. URL: https://doi.org/10.46298/dmtcs.3545.
  14. Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. Similarity search in high dimensions via hashing. In Vldb, volume 99, pages 518-529, 1999. Google Scholar
  15. Monika Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 284-291, 2006. Google Scholar
  16. Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604-613, 1998. Google Scholar
  17. Nitin Jindal and Bing Liu. Opinion spam and analysis. In Proceedings of the 2008 international conference on web search and data mining, pages 219-230, 2008. Google Scholar
  18. Andrey Kolmogorov. Sulla determinazione empirica di una lgge di distribuzione. Inst. Ital. Attuari, Giorn., 4:83-91, 1933. Google Scholar
  19. Ping Li, Anshumali Shrivastava, Joshua Moore, and Arnd König. Hashing algorithms for large-scale learning. Advances in neural information processing systems, 24, 2011. Google Scholar
  20. Brian D. Ondov, Gabriel Starrett, Anna Sappington, Aleksandra Kostic, Sergey Koren, Christopher B. Buck, and Adam M. Phillippy. Mash screen: high-throughput sequence containment estimation for genome discovery. Genome Biology, 20:232, 2019. URL: https://doi.org/10.1186/s13059-019-1841-x.
  21. Brian D Ondov, Todd J Treangen, Páll Melsted, Adam B Mallonee, Nicholas H Bergman, Sergey Koren, and Adam M Phillippy. Mash: fast genome and metagenome distance estimation using minhash. Genome Biology, 17(1):1-14, 2016. Google Scholar
  22. Anshumali Shrivastava and Ping Li. In defense of MinHash over SimHash. In Artificial Intelligence and Statistics, pages 886-894. PMLR, 2014. Google Scholar
  23. Nickolay Smirnov. Table for estimating the goodness of fit of empirical distributions. The annals of mathematical statistics, 19(2):279-281, 1948. Google Scholar
  24. Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. Hashing for similarity search: A survey. CoRR, abs/1408.2927, 2014. URL: https://arxiv.org/abs/1408.2927.
  25. Wei Wu, Bin Li, Ling Chen, Junbin Gao, and Chengqi Zhang. A review for weighted MinHash algorithms. IEEE Transactions on Knowledge and Data Engineering, 34(6):2553-2573, 2020. Google Scholar
  26. Wei Wu, Bin Li, Ling Chen, Junbin Gao, and Chengqi Zhang. A review for weighted minhash algorithms. IEEE Trans. Knowl. Data Eng., 34(6):2553-2573, 2022. URL: https://doi.org/10.1109/TKDE.2020.3021067.
  27. Yun William Yu and Griffin M. Weber. Hyperminhash: Minhash in loglog space. IEEE Trans. Knowl. Data Eng., 34(1):328-339, 2022. URL: https://doi.org/10.1109/TKDE.2020.2981311.
  28. XiaoFei Zhao. Bindash, software for fast genome distance estimation on a typical personal laptop. Bioinformatics, 35(4):671-673, 2019. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail