Range-Efficient Consistent Sampling and Locality-Sensitive Hashing for Polygons

Authors Joachim Gudmundsson, Rasmus Pagh



PDF
Thumbnail PDF

File

LIPIcs.ISAAC.2017.42.pdf
  • Filesize: 0.52 MB
  • 13 pages

Document Identifiers

Author Details

Joachim Gudmundsson
Rasmus Pagh

Cite AsGet BibTex

Joachim Gudmundsson and Rasmus Pagh. Range-Efficient Consistent Sampling and Locality-Sensitive Hashing for Polygons. In 28th International Symposium on Algorithms and Computation (ISAAC 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 92, pp. 42:1-42:13, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)
https://doi.org/10.4230/LIPIcs.ISAAC.2017.42

Abstract

Locality-sensitive hashing (LSH) is a fundamental technique for similarity search and similarity estimation in high-dimensional spaces. The basic idea is that similar objects should produce hash collisions with probability significantly larger than objects with low similarity. We consider LSH for objects that can be represented as point sets in either one or two dimensions. To make the point sets finite size we consider the subset of points on a grid. Directly applying LSH (e.g. min-wise hashing) to these point sets would require time proportional to the number of points. We seek to achieve time that is much lower than direct approaches. Technically, we introduce new primitives for range-efficient consistent sampling (of independent interest), and show how to turn such samples into LSH values. Another application of our technique is a data structure for quickly estimating the size of the intersection or union of a set of preprocessed polygons. Curiously, our consistent sampling method uses transformation to a geometric problem.
Keywords
  • Locality-sensitive hashing
  • probability distribution
  • polygon
  • min-wise hashing
  • consistent sampling

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. S. Arya and D. M. Mount. Approximate range searching. Computational Geometry - Theory and Applications, 17:135-152, 2000. Google Scholar
  2. Y. Bachrach and E. Porat. Sketching for big data recommender systems using fast pseudo-random fingerprints. In Proceedings of 40th International Colloquium on Automata, Languages, and Programming (ICALP), pages 459-471. Springer, 2013. Google Scholar
  3. J. L. Bentley. Algorithms for Klee’s rectangle problems. Unpublished note, Computer Science Department, Carnegie Mellon University, 1977. Google Scholar
  4. K. Bringmann and T. Friedrich. Approximating the volume of unions and intersections of high-dimensional geometric objects. Computational Geometry - Theory and Applications, 43(6-7):601-610, 2010. Google Scholar
  5. A. Z. Broder. On the resemblance and containment of documents. In Proceedings of International Conference on Compression and Complexity of Sequences (SEQUENCES), pages 21-29. IEEE, 1997. Google Scholar
  6. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8):1157-1166, 1997. Google Scholar
  7. J. L. Carter and M. N. Wegman. Universal classes of hash functions. In Proceedings of 9th ACM Symposium on Theory of Computing (STOC), pages 106-112. ACM, 1977. Google Scholar
  8. T. M. Chan. Klee’s measure problem made easy. In Proceedings of 54th IEEE Symposium on Foundations of Computer Science (FOCS), pages 410-419, 2013. Google Scholar
  9. E. Charrier and L. Buzer. Approximating a real number by a rational number with a limited denominator: A geometric approach. Discrete Applied Mathematics, 157:3473-3484, 2009. Google Scholar
  10. E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comp. Syst. Sci., 55(3):441-453, 1997. Google Scholar
  11. E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In Proceedings of 26th annual ACM Symposium on Principles of Distributed Computing (PODC), pages 225-234. ACM, 2007. Google Scholar
  12. S. Dahlgaard and M. Thorup. Approximately minwise independence with twisted tabulation. In Proceedings of 14th Scandinavian Symposium and Workshops on Algorithm Theory (SWAT), pages 134-145. Springer International Publishing, 2014. Google Scholar
  13. J. Gudmundsson and R. Pagh. Range-efficient consistent sampling and locality-sensitive hashing for polygons. CoRR, abs/1701.05290, 2017. Google Scholar
  14. B. Haeupler, M. Manasse, and K. Talwar. Consistent weighted sampling made fast, small, and easy. arXiv:1410.4266, 2014. Google Scholar
  15. J. Hershberger. Stable snap rounding. Computational Geometry, 46(4):403-–416, 2013. Google Scholar
  16. P. Indyk. A small approximately min-wise independent family of hash functions. Journal of Algorithms, 38(1):84-90, 2001. Google Scholar
  17. Sergey Ioffe. Improved consistent sampling, weighted minhash and L1 sketching. In Proceedings of 10th IEEE International Conference on Data Mining (ICDM), pages 246-255, 2010. Google Scholar
  18. V. Klee. Can the measure of ∪[a_i, b_i] be computed in less than o(n log n) steps? American Mathematical Monthly, 84:284-285, 1977. Google Scholar
  19. M. H. Overmars and C.-K. Yap. New upper bounds in Klee’s measure problem. SIAM Journal on Computing, 20(6):1034-1045, 1991. Google Scholar
  20. A. Pavan and S. Tirthapura. Range-efficient counting of distinct elements in a massive data stream. SIAM Journal on Computing, 37(2):359-379, 2007. Google Scholar
  21. M. Thorup. Bottom-k and priority sampling, set similarity and subset sums with minimal independence. In Proceedings of 45th ACM Symposium on Theory of Computing (STOC), pages 371-380. ACM, 2013. Google Scholar
  22. S. Tirthapura and D. Woodruff. Rectangle-efficient aggregation in spatial data streams. In Proceedings of 31st Symposium on Principles of Database Systems (PODS), pages 283-294. ACM, 2012. Google Scholar
  23. J. van Leeuwen and D. Wood. The measure problem for rectangular ranges in d-space. Journal of Algorithms, 2(3):282-300, 1981. Google Scholar
  24. N. Y. Zolotykh. On the number of vertices in integer linear programming problems. Technical report, University of Nizhni Novograd, 2000. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail