Document Open Access Logo

Dynamic Enumeration of Similarity Joins

Authors Pankaj K. Agarwal, Xiao Hu, Stavros Sintos, Jun Yang



PDF
Thumbnail PDF

File

LIPIcs.ICALP.2021.11.pdf
  • Filesize: 0.94 MB
  • 19 pages

Document Identifiers

Author Details

Pankaj K. Agarwal
  • Duke University, Durham, NC, USA
Xiao Hu
  • Duke University, Durham, NC, USA
Stavros Sintos
  • University of Chicago, IL, USA
Jun Yang
  • Duke University, Durham, NC, USA

Cite AsGet BibTex

Pankaj K. Agarwal, Xiao Hu, Stavros Sintos, and Jun Yang. Dynamic Enumeration of Similarity Joins. In 48th International Colloquium on Automata, Languages, and Programming (ICALP 2021). Leibniz International Proceedings in Informatics (LIPIcs), Volume 198, pp. 11:1-11:19, Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2021)
https://doi.org/10.4230/LIPIcs.ICALP.2021.11

Abstract

This paper considers enumerating answers to similarity-join queries under dynamic updates: Given two sets of n points A,B in ℝ^d, a metric ϕ(⋅), and a distance threshold r > 0, report all pairs of points (a, b) ∈ A × B with ϕ(a,b) ≤ r. Our goal is to store A,B into a dynamic data structure that, whenever asked, can enumerate all result pairs with worst-case delay guarantee, i.e., the time between enumerating two consecutive pairs is bounded. Furthermore, the data structure can be efficiently updated when a point is inserted into or deleted from A or B. We propose several efficient data structures for answering similarity-join queries in low dimension. For exact enumeration of similarity join, we present near-linear-size data structures for 𝓁₁, 𝓁_∞ metrics with log^{O(1)} n update time and delay. We show that such a data structure is not feasible for the 𝓁₂ metric for d ≥ 4. For approximate enumeration of similarity join, where the distance threshold is a soft constraint, we obtain a unified linear-size data structure for 𝓁_p metric, with log^{O(1)} n delay and update time. In high dimensions, we present an efficient data structure with worst-case delay-guarantee using locality sensitive hashing (LSH).

Subject Classification

ACM Subject Classification
  • Theory of computation → Data structures and algorithms for data management
Keywords
  • dynamic enumeration
  • similarity joins
  • worst-case delay guarantee

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Peyman Afshani. Improved pointer machine and I/O lower bounds for simplex range reporting and related problems. In Proc. 28th Annual Symp. Comput. Geom., pages 339-346, 2012. Google Scholar
  2. P. K. Agarwal. Simplex range searching and its variants: A review. In M. Loebl, J. Nešetřil, and R. Thomas, editors, A Journey Through Discrete Mathematics, pages 1-30. Springer, 2017. Google Scholar
  3. P. K. Agarwal and J. Erickson. Geometric range searching and its relatives. In B. Chazelle, J. E. Goodman, and R. Pollack, editors, Contemporary Mathematics, volume 223, pages 1-56. American Math. Society, 1999. Google Scholar
  4. Pankaj K. Agarwal, Xiao Hu, Stavros Sintos, and Jun Yang. Dynamic enumeration of similarity joins. URL: http://arxiv.org/abs/2105.01818.
  5. Pankaj K Agarwal, Junyi Xie, Jun Yang, and Hai Yu. Monitoring continuous band-join queries over dynamic data. In Int. Symp. Algorithms Comput., pages 349-359. Springer, 2005. Google Scholar
  6. Pankaj K Agarwal, Junyi Xie, Jun Yang, and Hai Yu. Scalable continuous query processing by tracking hotspots. In Proc. 32nd Int. Conf. Very Large Databases, pages 31-42, 2006. Google Scholar
  7. Dror Aiger, Haim Kaplan, and Micha Sharir. Reporting neighbors in high-dimensional euclidean space. SIAM J. Comput., 43(4):1363-1395, 2014. Google Scholar
  8. A. Al-Badarneh, A. Al-Abdi, M. Sana’a, and H. Najadat. Survey of similarity join algorithms based on mapreduce. MATTER: International Journal of Science and Technology, 2016. Google Scholar
  9. Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Proc. 47th Annual IEEE Symp. Foundations Comput. Sci., pages 459-468, 2006. Google Scholar
  10. N. Augsten and M. Böhlen. Similarity joins in relational database systems. Synthesis Lectures on Data Management, 5(5):1-124, 2013. Google Scholar
  11. Guillaume Bagan, Arnaud Durand, and Etienne Grandjean. On acyclic conjunctive queries and constant delay enumeration. In Proc. 21th Int. Workshop Computer Science Logic, pages 208-222, 2007. Google Scholar
  12. J. Bentley. Decomposable searching problems. Technical report, CMU, 1978. Google Scholar
  13. J. Bentley and J. Friedman. Data structures for range searching. ACM Comput. Surv., 11(4):397-409, 1979. Google Scholar
  14. Christoph Berkholz, Jens Keppeler, and Nicole Schweikardt. Answering conjunctive queries under updates. In Proc. 36th ACM SIGMOD-SIGACT-SIGAI Symp. Principles Database Systems, pages 303-318, 2017. Google Scholar
  15. P. Callahan. Dealing with Higher dimensions: The Well-Separated Pair Decomposition and Its Applications. PhD thesis, Johns Hopkins University, 1995. Google Scholar
  16. N. Carmeli and M. Kröll. Enumeration complexity of conjunctive queries with functional dependencies. Theory of Computing Systems, pages 1-33, 2019. Google Scholar
  17. T. Chan. Optimal partition trees. Discrete & Comput. Geom., 47(4):661-690, 2012. Google Scholar
  18. Sirish Chandrasekaran and Michael J Franklin. Streaming queries over streaming data. In Proc. 28th Int. Conf. Very Large Databases, pages 203-214, 2002. Google Scholar
  19. S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In Proc. 22nd Int. Conf. Data Eng., pages 5-5, 2006. Google Scholar
  20. B. Chazelle and B. Rosenberg. Simplex range reporting on a pointer machine. Comput. Geom.: Theory Apps., 5(5):237-247, 1996. Google Scholar
  21. Jianjun Chen, David J DeWitt, Feng Tian, and Yuan Wang. Niagaracq: A scalable continuous query system for internet databases. In Proc. 19th ACM SIGMOD Int. Conf. Management Data, pages 379-390, 2000. Google Scholar
  22. Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. 20th Annual Symp. Comput. Geom., pages 253-262, 2004. Google Scholar
  23. M. De Berg, M. Van Kreveld, M. Overmars, and O. C. Schwarzkopf. Computational Geometry: Algorithms and Applications. Springer, 3rd edition, 2008. Google Scholar
  24. J. Erickson. Static-to-dynamic transformations. URL: http://jeffe.cs.illinois.edu/teaching/datastructures/notes/01-statictodynamic.pdf.
  25. Françoise Fabret, H Arno Jacobsen, François Llirbat, Joăo Pereira, Kenneth A Ross, and Dennis Shasha. Filtering algorithms and implementation for very fast publish/subscribe systems. In Proc. 20th ACM SIGMOD Int. Conf. Management Data, pages 115-126, 2001. Google Scholar
  26. John Fischer and Sariel Har-Peled. Dynamic well-separated pair decomposition made easy. In Proc. 17th Canadian Conf. Comput. Geom., pages 235-238, 2005. Google Scholar
  27. Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. Similarity search in high dimensions via hashing. In Proc. 25nd Int. Conf. Very Large Databases, volume 99 (6), pages 518-529, 1999. Google Scholar
  28. S. Har-Peled. Geometric Approximation Algorithms. Number 173 in Mathematical Surveys and Monographs. American Math. Society, 2011. Google Scholar
  29. Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. Approximate nearest neighbor: Towards removing the curse of dimensionality. Theory of Computing, 8(1):321-350, 2012. Google Scholar
  30. Sariel Har-Peled and Manor Mendel. Fast construction of nets in low-dimensional metrics and their applications. SIAM J. Comput., 35(5):1148-1184, 2006. Google Scholar
  31. Muhammad Idris, Martín Ugarte, and Stijn Vansummeren. The dynamic yannakakis algorithm: Compact and efficient query processing under updates. In Proc. 36th ACM SIGMOD Int. Conf. Management Data, pages 1259-1274, 2017. Google Scholar
  32. Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proc. 30th Annual ACM Symp. Theory of Comput., pages 604-613, 1998. Google Scholar
  33. Edwin H Jacox and Hanan Samet. Metric space similarity joins. ACM Trans. Database Systems, 33(2):1-38, 2008. Google Scholar
  34. Ahmet Kara, Hung Q Ngo, Milos Nikolic, Dan Olteanu, and Haozhe Zhang. Counting triangles under updates in worst-case optimal time. In Proc. 22nd Int. Conf. Database Theory, pages 4:1-4:18, 2019. Google Scholar
  35. Ahmet Kara, Milos Nikolic, Dan Olteanu, and Haozhe Zhang. Trade-offs in static and dynamic evaluation of hierarchical queries. In Proc. 39th ACM SIGMOD-SIGACT-SIGAI Symp. Principles Database Systems, pages 375-392, 2020. Google Scholar
  36. Hans-Peter Lenhof and Michiel Smid. Sequential and parallel algorithms for the k closest pairs problem. Int. J. Comput. Geom. & Appl., 5(03):273-288, 1995. Google Scholar
  37. J. Matoušek. Efficient partition trees. Discrete Comput. Geom., 8(3):315-334, 1992. Google Scholar
  38. M. Overmars and J. van Leeuwen. Worst-case optimal insertion and deletion methods for decomposable searching problems. Inf. Process. Lett., 12(4):168-173, 1981. Google Scholar
  39. Mark H Overmars. The design of dynamic data structures, volume 156. Springer Science & Business Media, 1987. Google Scholar
  40. Rodrigo Paredes and Nora Reyes. Solving similarity joins and range queries in metric spaces with the list of twin clusters. J. Discrete Algorithms, 7(1):18-35, 2009. Google Scholar
  41. H. Samet. Spatial data structures: Quadtree, octrees and other hierarchical methods, 1989. Google Scholar
  42. Luc Segoufin. Enumerating with constant delay the answers to a query. In Proc. 16th Int. Conf. Database Theory, pages 10-20, 2013. Google Scholar
  43. Y. Silva, W. Aref, and M. Ali. The similarity join database operator. In Proc. 26th Int. Conf. Data Engi., pages 892-903, 2010. Google Scholar
  44. Yasin N Silva, Jason Reed, Kyle Brown, Adelbert Wadsworth, and Chuitian Rong. An experimental survey of mapreduce-based similarity joins. In Int. Conf. Similarity Search Appl., pages 181-195. Springer, 2016. Google Scholar
  45. J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering? an adaptive framework for similarity join and search. In Proc. 31th ACM SIGMOD Int. Conf. Management Data, pages 85-96, 2012. Google Scholar
  46. Q. Wang and K. Yi. Maintaining acyclic foreign-key joins under updates. In Proc. 39th ACM SIGMOD Int. Conf. Management Data, pages 1225-1239, 2020. Google Scholar
  47. D. E. Willard. Applications of range query theory to relational data base join and selection operations. J. Comput. Syst. Sci., 52(1):157-169, 1996. Google Scholar
  48. Dan E Willard. Polygon retrieval. SIAM J. Comput., 11(1):149-165, 1982. Google Scholar
  49. Kun-Lung Wu, Shyh-Kwei Chen, and Philip S Yu. Interval query indexing for efficient stream processing. In Proc. 30th ACM Int. Conf. Inform. Knowledge Management, pages 88-97, 2004. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail