Large-Scale Similarity Joins With Guarantees (Invited Talk)

Author Rasmus Pagh

Thumbnail PDF


  • Filesize: 384 kB
  • 10 pages

Document Identifiers

Author Details

Rasmus Pagh

Cite AsGet BibTex

Rasmus Pagh. Large-Scale Similarity Joins With Guarantees (Invited Talk). In 18th International Conference on Database Theory (ICDT 2015). Leibniz International Proceedings in Informatics (LIPIcs), Volume 31, pp. 15-24, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2015)


The ability to handle noisy or imprecise data is becoming increasingly important in computing. In the database community the notion of similarity join has been studied extensively, yet existing solutions have offered weak performance guarantees. Either they are based on deterministic filtering techniques that often, but not always, succeed in reducing computational costs, or they are based on randomized techniques that have improved guarantees on computational cost but come with a probability of not returning the correct result. The aim of this paper is to give an overview of randomized techniques for high-dimensional similarity search, and discuss recent advances towards making these techniques more widely applicable by eliminating probability of error and improving the locality of data access.
  • Similarity join
  • filtering
  • locality-sensitive hashing
  • recall


  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    PDF Downloads


  1. Panagiotis Achlioptas, Bernhard Schölkopf, and Karsten Borgwardt. Two-locus association mapping in subquadratic time. In Proceedings of KDD, pages 726-734. ACM, 2011. Google Scholar
  2. Alexandr Andoni and Ilya Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. arXiv preprint arXiv:1501.01062, 2015. Google Scholar
  3. Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. Efficient exact set-similarity joins. In Proceedings of VLDB, pages 918-929, 2006. Google Scholar
  4. Nikolaus Augsten and Michael H Böhlen. Similarity joins in relational database systems. Synthesis Lectures on Data Management, 5(5):1-124, 2013. Google Scholar
  5. Bahman Bahmani, Ashish Goel, and Rajendra Shinde. Efficient distributed locality sensitive hashing. In Proceedings of CIKM, pages 2174-2178, 2012. Google Scholar
  6. Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling up all pairs similarity search. In Proceedings of WWW, pages 131-140, 2007. Google Scholar
  7. Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157-1166, 1997. Google Scholar
  8. Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of STOC, pages 380-388, 2002. Google Scholar
  9. Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. A primitive operator for similarity joins in data cleaning. In Proceedings of ICDE, page 5, 2006. Google Scholar
  10. Yun Chen and Jignesh M Patel. Efficient evaluation of all-nearest-neighbor queries. In Proceedings of ICDE, pages 1056-1065. IEEE, 2007. Google Scholar
  11. Edith Cohen, Mayur Datar, Shinji Fujiwara, Aristides Gionis, Piotr Indyk, Rajeev Motwani, Jeffrey D. Ullman, and Cheng Yang. Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng., 13(1):64-78, 2001. Google Scholar
  12. Abhinandan Das, Mayur Datar, Ashutosh Garg, and ShyamSundar Rajaram. Google news personalization: scalable online collaborative filtering. In Proceedings of WWW, pages 271-280, 2007. Google Scholar
  13. Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of SOCG, pages 253-262, 2004. Google Scholar
  14. Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In Proceedings of VLDB, pages 518-529, 1999. Google Scholar
  15. Dan Greene, Michal Parnas, and Frances Yao. Multi-index hashing for information retrieval. In Proceedings of FOCS, pages 722-731. IEEE, 1994. Google Scholar
  16. Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of computing, 8(1):321-350, 2012. Google Scholar
  17. Theodore E Harris. The theory of branching processes. Courier Dover Publications, 2002. Google Scholar
  18. Edwin H Jacox and Hanan Samet. Metric space similarity joins. ACM Transactions on Database Systems (TODS), 33(2):7, 2008. Google Scholar
  19. Yu Jiang, Dong Deng, Jiannan Wang, Guoliang Li, and Jianhua Feng. Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints. In Proceedings of Joint EDBT/ICDT Workshops, pages 341-348. ACM, 2013. Google Scholar
  20. Guoliang Li, Dong Deng, Jiannan Wang, and Jianhua Feng. Pass-join: A partition-based method for similarity joins. Proceedings of the VLDB Endowment, 5(3):253-264, 2011. Google Scholar
  21. Yucheng Low and Alice X Zheng. Fast top-k similarity queries via matrix compression. In Proceedings of CIKM, pages 2070-2074. ACM, 2012. Google Scholar
  22. Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, and Haiyong Wang. String similarity measures and joins with synonyms. In Proceedings of SIGMOD, pages 373-384. ACM, 2013. Google Scholar
  23. Marvin L Minsky and Seymour A Papert. Perceptrons - Expanded Edition: An Introduction to Computational Geometry. MIT press, 1987. Google Scholar
  24. Mohammad Norouzi, Ali Punjani, and David J Fleet. Fast search in hamming space with multi-index hashing. In Proceedings of CVPR, pages 3108-3115. IEEE, 2012. Google Scholar
  25. Rasmus Pagh, Ninh Pham, Francesco Silvestri, and Morten Stöckel. I/O-efficient similarity join in high dimensions. Manuscript, 2015. Google Scholar
  26. Rasmus Pagh, Morten Stöckel, and David P. Woodruff. Is min-wise hashing optimal for summarizing set intersection? In Proceedings of PODS, pages 109-120. ACM, 2014. Google Scholar
  27. Ramamohan Paturi, Sanguthevar Rajasekaran, and John Reif. The Light Bulb Problem. Information and Computation, 117(2):187-192, March 1995. Google Scholar
  28. Yasin N Silva, Walid G Aref, and Mohamed H Ali. The similarity join database operator. In Proceedings of ICDE, pages 892-903. IEEE, 2010. Google Scholar
  29. Gregory Valiant. Finding Correlations in Subquadratic Time, with Applications to Learning Parities and Juntas. In Proceedings of FOCS, pages 11-20. IEEE, October 2012. Google Scholar
  30. Jeffrey Scott Vitter. Algorithms and Data Structures for External Memory. Now Publishers Inc., 2008. Google Scholar
  31. Jiannan Wang, Guoliang Li, and Jianhua Fe. Fast-join: An efficient method for fuzzy token matching based string similarity join. In Proceedings of ICDE, pages 458-469. IEEE, 2011. Google Scholar
  32. Jiannan Wang, Guoliang Li, and Jianhua Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proceedings of SIGMOD, pages 85-96. ACM, 2012. Google Scholar
  33. Ye Wang, Ahmed Metwally, and Srinivasan Parthasarathy. Scalable all-pairs similarity search in metric spaces. In Proceedings of KDD, pages 829-837, 2013. Google Scholar
  34. Chenyi Xia, Hongjun Lu, Beng Chin Ooi, and Jing Hu. Gorder: an efficient method for knn join processing. In Proceedings of VLDB, pages 756-767. VLDB Endowment, 2004. Google Scholar
  35. Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu. Efficient similarity joins for near duplicate detection. In Proceedings of WWW, pages 131-140, 2008. Google Scholar
  36. Reza Bosagh Zadeh and Ashish Goel. Dimension independent similarity computation. The Journal of Machine Learning Research, 14(1):1605-1626, 2013. Google Scholar
  37. Xiang Zhang, Fei Zou, and Wei Wang. Fastanova: an efficient algorithm for genome-wide association study. In Proceedings of KDD, pages 821-829. ACM, 2008. Google Scholar
Questions / Remarks / Feedback

Feedback for Dagstuhl Publishing

Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail