Large-Scale Similarity Joins With Guarantees (Invited Talk)

Pagh, Rasmus

doi:10.4230/LIPIcs.ICDT.2015.15

Abstract

The ability to handle noisy or imprecise data is becoming increasingly important in computing. In the database community the notion of similarity join has been studied extensively, yet existing solutions have offered weak performance guarantees. Either they are based on deterministic filtering techniques that often, but not always, succeed in reducing computational costs, or they are based on randomized techniques that have improved guarantees on computational cost but come with a probability of not returning the correct result. The aim of this paper is to give an overview of randomized techniques for high-dimensional similarity search, and discuss recent advances towards making these techniques more widely applicable by eliminating probability of error and improving the locality of data access.

Panagiotis Achlioptas, Bernhard Schölkopf, and Karsten Borgwardt. Two-locus association mapping in subquadratic time. In Proceedings of KDD, pages 726-734. ACM, 2011.
Alexandr Andoni and Ilya Razenshteyn. Optimal data-dependent hashing for approximate near neighbors. arXiv preprint arXiv:1501.01062, 2015.
Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. Efficient exact set-similarity joins. In Proceedings of VLDB, pages 918-929, 2006.
Nikolaus Augsten and Michael H Böhlen. Similarity joins in relational database systems. Synthesis Lectures on Data Management, 5(5):1-124, 2013.
Bahman Bahmani, Ashish Goel, and Rajendra Shinde. Efficient distributed locality sensitive hashing. In Proceedings of CIKM, pages 2174-2178, 2012.
Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. Scaling up all pairs similarity search. In Proceedings of WWW, pages 131-140, 2007.
Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157-1166, 1997.
Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of STOC, pages 380-388, 2002.
Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. A primitive operator for similarity joins in data cleaning. In Proceedings of ICDE, page 5, 2006.
Yun Chen and Jignesh M Patel. Efficient evaluation of all-nearest-neighbor queries. In Proceedings of ICDE, pages 1056-1065. IEEE, 2007.
Edith Cohen, Mayur Datar, Shinji Fujiwara, Aristides Gionis, Piotr Indyk, Rajeev Motwani, Jeffrey D. Ullman, and Cheng Yang. Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng., 13(1):64-78, 2001.
Abhinandan Das, Mayur Datar, Ashutosh Garg, and ShyamSundar Rajaram. Google news personalization: scalable online collaborative filtering. In Proceedings of WWW, pages 271-280, 2007.
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of SOCG, pages 253-262, 2004.
Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In Proceedings of VLDB, pages 518-529, 1999.
Dan Greene, Michal Parnas, and Frances Yao. Multi-index hashing for information retrieval. In Proceedings of FOCS, pages 722-731. IEEE, 1994.
Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of computing, 8(1):321-350, 2012.
Theodore E Harris. The theory of branching processes. Courier Dover Publications, 2002.
Edwin H Jacox and Hanan Samet. Metric space similarity joins. ACM Transactions on Database Systems (TODS), 33(2):7, 2008.
Yu Jiang, Dong Deng, Jiannan Wang, Guoliang Li, and Jianhua Feng. Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints. In Proceedings of Joint EDBT/ICDT Workshops, pages 341-348. ACM, 2013.
Guoliang Li, Dong Deng, Jiannan Wang, and Jianhua Feng. Pass-join: A partition-based method for similarity joins. Proceedings of the VLDB Endowment, 5(3):253-264, 2011.
Yucheng Low and Alice X Zheng. Fast top-k similarity queries via matrix compression. In Proceedings of CIKM, pages 2070-2074. ACM, 2012.
Jiaheng Lu, Chunbin Lin, Wei Wang, Chen Li, and Haiyong Wang. String similarity measures and joins with synonyms. In Proceedings of SIGMOD, pages 373-384. ACM, 2013.
Marvin L Minsky and Seymour A Papert. Perceptrons - Expanded Edition: An Introduction to Computational Geometry. MIT press, 1987.
Mohammad Norouzi, Ali Punjani, and David J Fleet. Fast search in hamming space with multi-index hashing. In Proceedings of CVPR, pages 3108-3115. IEEE, 2012.
Rasmus Pagh, Ninh Pham, Francesco Silvestri, and Morten Stöckel. I/O-efficient similarity join in high dimensions. Manuscript, 2015.
Rasmus Pagh, Morten Stöckel, and David P. Woodruff. Is min-wise hashing optimal for summarizing set intersection? In Proceedings of PODS, pages 109-120. ACM, 2014.
Ramamohan Paturi, Sanguthevar Rajasekaran, and John Reif. The Light Bulb Problem. Information and Computation, 117(2):187-192, March 1995.
Yasin N Silva, Walid G Aref, and Mohamed H Ali. The similarity join database operator. In Proceedings of ICDE, pages 892-903. IEEE, 2010.
Gregory Valiant. Finding Correlations in Subquadratic Time, with Applications to Learning Parities and Juntas. In Proceedings of FOCS, pages 11-20. IEEE, October 2012.
Jeffrey Scott Vitter. Algorithms and Data Structures for External Memory. Now Publishers Inc., 2008.
Jiannan Wang, Guoliang Li, and Jianhua Fe. Fast-join: An efficient method for fuzzy token matching based string similarity join. In Proceedings of ICDE, pages 458-469. IEEE, 2011.
Jiannan Wang, Guoliang Li, and Jianhua Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proceedings of SIGMOD, pages 85-96. ACM, 2012.
Ye Wang, Ahmed Metwally, and Srinivasan Parthasarathy. Scalable all-pairs similarity search in metric spaces. In Proceedings of KDD, pages 829-837, 2013.
Chenyi Xia, Hongjun Lu, Beng Chin Ooi, and Jing Hu. Gorder: an efficient method for knn join processing. In Proceedings of VLDB, pages 756-767. VLDB Endowment, 2004.
Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu. Efficient similarity joins for near duplicate detection. In Proceedings of WWW, pages 131-140, 2008.
Reza Bosagh Zadeh and Ashish Goel. Dimension independent similarity computation. The Journal of Machine Learning Research, 14(1):1605-1626, 2013.
Xiang Zhang, Fei Zou, and Wei Wang. Fastanova: an efficient algorithm for genome-wide association study. In Proceedings of KDD, pages 821-829. ACM, 2008.

Large-Scale Similarity Joins With Guarantees (Invited Talk)

Author Rasmus Pagh

File

Document Identifiers

Author Details

Cite AsGet BibTex

Abstract

Keywords

Metrics

References