Improving the Sensitivity of MinHash Through Hash-Value Analysis

Kucherov, Gregory; Skiena, Steven

doi:10.4230/LIPIcs.CPM.2023.20

File

Subject Classification

ACM Subject Classification

Theory of computation → Bloom filters and hashing

Keywords

MinHash sketching
sequence similarity
hashing

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

Abstract

MinHash sketching is an important algorithm for efficient document retrieval and bioinformatics. We show that the value of the matching MinHash codes convey additional information about the Jaccard similarity of S and T over and above the fact that the MinHash codes agree. This observation holds the potential to increase the sensitivity of minhash-based retrieval systems. We analyze the expected Jaccard similarity of two sets as a function of observing a matching MinHash value a under a reasonable prior distribution on intersection set sizes, and present a practical approach to using MinHash values to improve the sensitivity of traditional Jaccard similarity estimation, based on the Kolmogorov-Smirnov statistical test for sample distributions. Experiments over a wide range of hash function counts and set similarities show a small but consistent improvement over chance at predicting over/under-estimation, yielding an average accuracy of 61% over the range of experiments.

Cite As Get BibTex

Gregory Kucherov and Steven Skiena. Improving the Sensitivity of MinHash Through Hash-Value Analysis. In 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 259, pp. 20:1-20:12, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023) https://doi.org/10.4230/LIPIcs.CPM.2023.20

Author Details

Gregory Kucherov

LIGM, CNRS/Université Gustave Eiffel, Marne-la-Vallée, France

Steven Skiena

Dept. of Computer Science, Stony Brook University, Stony Brook, NY, USA

References

Daniel N. Baker and Ben Langmead. Dashing: fast and accurate genomic distances with HyperLogLog. Genome Biology, 20:265, 2019. URL: https://doi.org/10.1186/s13059-019-1875-0.
Andrei Broder. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences 1997, SEQUENCES '97, pages 21-, Washington, DC, USA, 1997. IEEE Computer Society. URL: http://dl.acm.org/citation.cfm?id=829502.830043.
Andrei Z. Broder. Identifying and filtering near-duplicate documents. In Combinatorial Pattern Matching, 11th Annual Symposium, CPM 2000, Montreal, Canada, June 21-23, 2000, Proceedings, pages 1-10, 2000. URL: https://doi.org/10.1007/3-540-45123-4_1.
Andrei Z Broder, Moses Charikar, Alan M Frieze, and Michael Mitzenmacher. Min-wise independent permutations. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 327-336, 1998.
C. Titus Brown and Luiz Irber. Sourmash: a library for minhash sketching of dna. Journal of Open Source Software, 1(5):27, 2016. URL: https://doi.org/10.21105/joss.00027.
Moses S Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 380-388, 2002.
Lianhua Chi, Bin Li, and Xingquan Zhu. Context-preserving hashing for fast text classification. In Proceedings of the 2014 SIAM International Conference on Data Mining, pages 100-108. SIAM, 2014.
Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, Michael Mitzenmacher, Alessandro Panconesi, and Prabhakar Raghavan. On compressing social networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 219-228, 2009.
Edith Cohen. Min-hash sketches: A brief survey, 2016. URL: http://www.cohenwang.com/edith/Surveys/minhash.pdf.
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry, pages 253-262, 2004.
Otmar Ertl. Setsketch: Filling the gap between minhash and hyperloglog. Proc. VLDB Endow., 14(11):2244-2257, 2021. URL: https://doi.org/10.14778/3476249.3476276.
Dennis Fetterly, Mark Manasse, Marc Najork, and Janet Wiener. A large-scale study of the evolution of web pages. In Proceedings of the 12th international conference on World Wide Web, pages 669-678, 2003.
Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. Discrete Mathematics & Theoretical Computer Science, DMTCS Proceedings vol. AH, 2007 Conference on Analysis of Algorithms (AofA 07), January 2007. URL: https://doi.org/10.46298/dmtcs.3545.
Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. Similarity search in high dimensions via hashing. In Vldb, volume 99, pages 518-529, 1999.
Monika Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 284-291, 2006.
Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604-613, 1998.
Nitin Jindal and Bing Liu. Opinion spam and analysis. In Proceedings of the 2008 international conference on web search and data mining, pages 219-230, 2008.
Andrey Kolmogorov. Sulla determinazione empirica di una lgge di distribuzione. Inst. Ital. Attuari, Giorn., 4:83-91, 1933.
Ping Li, Anshumali Shrivastava, Joshua Moore, and Arnd König. Hashing algorithms for large-scale learning. Advances in neural information processing systems, 24, 2011.
Brian D. Ondov, Gabriel Starrett, Anna Sappington, Aleksandra Kostic, Sergey Koren, Christopher B. Buck, and Adam M. Phillippy. Mash screen: high-throughput sequence containment estimation for genome discovery. Genome Biology, 20:232, 2019. URL: https://doi.org/10.1186/s13059-019-1841-x.
Brian D Ondov, Todd J Treangen, Páll Melsted, Adam B Mallonee, Nicholas H Bergman, Sergey Koren, and Adam M Phillippy. Mash: fast genome and metagenome distance estimation using minhash. Genome Biology, 17(1):1-14, 2016.
Anshumali Shrivastava and Ping Li. In defense of MinHash over SimHash. In Artificial Intelligence and Statistics, pages 886-894. PMLR, 2014.
Nickolay Smirnov. Table for estimating the goodness of fit of empirical distributions. The annals of mathematical statistics, 19(2):279-281, 1948.
Jingdong Wang, Heng Tao Shen, Jingkuan Song, and Jianqiu Ji. Hashing for similarity search: A survey. CoRR, abs/1408.2927, 2014. URL: https://arxiv.org/abs/1408.2927.
Wei Wu, Bin Li, Ling Chen, Junbin Gao, and Chengqi Zhang. A review for weighted MinHash algorithms. IEEE Transactions on Knowledge and Data Engineering, 34(6):2553-2573, 2020.
Wei Wu, Bin Li, Ling Chen, Junbin Gao, and Chengqi Zhang. A review for weighted minhash algorithms. IEEE Trans. Knowl. Data Eng., 34(6):2553-2573, 2022. URL: https://doi.org/10.1109/TKDE.2020.3021067.
Yun William Yu and Griffin M. Weber. Hyperminhash: Minhash in loglog space. IEEE Trans. Knowl. Data Eng., 34(1):328-339, 2022. URL: https://doi.org/10.1109/TKDE.2020.2981311.
XiaoFei Zhao. Bindash, software for fast genome distance estimation on a typical personal laptop. Bioinformatics, 35(4):671-673, 2019.

Improving the Sensitivity of MinHash Through Hash-Value Analysis

Authors Gregory Kucherov , Steven Skiena

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

References

Thanks for your feedback!

Could not send message

Improving the Sensitivity of MinHash Through Hash-Value Analysis

Authors Gregory Kucherov , Steven Skiena

File

Document Identifiers

Subject Classification

ACM Subject Classification

Keywords

Metrics

Abstract

Cite As Get BibTex

Author Details

Funding

References

Thanks for your feedback!

Could not send message