Understanding the Moments of Tabulation Hashing via Chaoses

Authors Jakob Bæk Tejs Houen , Mikkel Thorup



PDF
Thumbnail PDF

File

LIPIcs.ICALP.2022.74.pdf
  • Filesize: 0.74 MB
  • 19 pages

Document Identifiers

Author Details

Jakob Bæk Tejs Houen
  • BARC, Department of Computer Science, University of Copenhagen, Denmark
Mikkel Thorup
  • BARC, Department of Computer Science, University of Copenhagen, Denmark

Acknowledgements

We thank the anonymous reviewers for their careful reading of our manuscript and their many insightful comments and suggestions.

Cite As Get BibTex

Jakob Bæk Tejs Houen and Mikkel Thorup. Understanding the Moments of Tabulation Hashing via Chaoses. In 49th International Colloquium on Automata, Languages, and Programming (ICALP 2022). Leibniz International Proceedings in Informatics (LIPIcs), Volume 229, pp. 74:1-74:19, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2022) https://doi.org/10.4230/LIPIcs.ICALP.2022.74

Abstract

Simple tabulation hashing dates back to Zobrist in 1970 and is defined as follows: Each key is viewed as c characters from some alphabet Σ, we have c fully random hash functions h₀, …, h_{c - 1} : Σ → {{0, …, 2^l - 1}}, and a key x = (x₀, …, x_{c - 1}) is hashed to h(x) = h₀(x₀) ⊕ … ⊕ h_{c - 1}(x_{c - 1}) where ⊕ is the bitwise XOR operation. The previous results on tabulation hashing by Pǎtraşcu and Thorup [J.ACM'11] and by Aamand et al. [STOC'20] focused on proving Chernoff-style tail bounds on hash-based sums, e.g., the number keys hashing to a given value, for simple tabulation hashing, but their bounds do not cover the entire tail. Thus their results cannot bound moments. The paper Dahlgaard et al. [FOCS'15] provides a bound on the moments of certain hash-based sums, but their bound only holds for constant moments, and we need logarithmic moments.
Chaoses are random variables of the form ∑ a_{i₀, …, i_{c - 1}} X_{i₀} ⋅ … ⋅ X_{i_{c - 1}} where X_i are independent random variables. Chaoses are a well-studied concept from probability theory, and tight analysis has been proven in several instances, e.g., when the independent random variables are standard Gaussian variables and when the independent random variables have logarithmically convex tails. We notice that hash-based sums of simple tabulation hashing can be seen as a sum of chaoses that are not independent. This motivates us to use techniques from the theory of chaoses to analyze hash-based sums of simple tabulation hashing.
In this paper, we obtain bounds for all the moments of hash-based sums for simple tabulation hashing which are tight up to constants depending only on c. In contrast with the previous attempts, our approach will mostly be analytical and does not employ intricate combinatorial arguments. The improved analysis of simple tabulation hashing allows us to obtain bounds for the moments of hash-based sums for the mixed tabulation hashing introduced by Dahlgaard et al. [FOCS'15]. With simple tabulation hashing, there are certain inputs for which the concentration is much worse than with fully random hashing. However, with mixed tabulation, we get logarithmic moment bounds that are only a constant factor worse than those with fully random hashing for any possible input. This is a strong addition to other powerful probabilistic properties of mixed tabulation hashing proved by Dahlgaard et al.

Subject Classification

ACM Subject Classification
  • Theory of computation → Pseudorandomness and derandomization
Keywords
  • hashing
  • concentration bounds
  • moment bounds

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Anders Aamand, Debarati Das, Evangelos Kipouridis, Jakob Bæk Tejs Knudsen, Peter M. R. Rasmussen, and Mikkel Thorup. No repetition: Fast streaming with highly concentrated hashing. CoRR, 2020. URL: http://arxiv.org/abs/2004.01156.
  2. Anders Aamand, Jakob Bæk Tejs Knudsen, Mathias Bæk Tejs Knudsen, Peter Michael Reichstein Rasmussen, and Mikkel Thorup. Fast hashing with strong concentration bounds. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, pages 1265-1278, New York, NY, USA, 2020. Association for Computing Machinery. URL: https://doi.org/10.1145/3357713.3384259.
  3. Anders Aamand, Mathias Bæk Tejs Knudsen, and Mikkel Thorup. Power of d choices with simple tabulation. In Ioannis Chatzigiannakis, Christos Kaklamanis, Dániel Marx, and Donald Sannella, editors, 45th International Colloquium on Automata, Languages, and Programming, ICALP 2018, July 9-13, 2018, Prague, Czech Republic, volume 107 of LIPIcs, pages 5:1-5:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2018. URL: https://doi.org/10.4230/LIPIcs.ICALP.2018.5.
  4. Anders Aamand and Mikkel Thorup. Non-empty bins with simple tabulation hashing. In Timothy M. Chan, editor, Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2019, San Diego, California, USA, January 6-9, 2019, pages 2498-2512. SIAM, 2019. URL: https://doi.org/10.1137/1.9781611975482.153.
  5. Radosław Adamczak and Rafał Latała. Tail and moment estimates for chaoses generated by symmetric random variables with logarithmically concave tails. Annales de l'Institut Henri Poincaré, Probabilités et Statistiques, 48(4):1103-1136, 2012. URL: https://doi.org/10.1214/11-AIHP441.
  6. George Bennett. Probability inequalities for the sum of independent random variables. Journal of the American Statistical Association, 57(297):33-45, 1962. URL: https://doi.org/10.1080/01621459.1962.10482149.
  7. Vladimir Braverman, Kai-Min Chung, Zhenming Liu, Michael Mitzenmacher, and Rafail Ostrovsky. AMS without 4-wise independence on product domains. In Jean-Yves Marion and Thomas Schwentick, editors, 27th International Symposium on Theoretical Aspects of Computer Science, STACS 2010, March 4-6, 2010, Nancy, France, volume 5 of LIPIcs, pages 119-130. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2010. URL: https://doi.org/10.4230/LIPIcs.STACS.2010.2449.
  8. Herman Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 23(4):493-507, 1952. Google Scholar
  9. S. Dahlgaard, M. B. T. Knudsen, E. Rotenberg, and M. Thorup. Hashing for statistics over k-partitions. In 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pages 1292-1310, 2015. URL: https://doi.org/10.1109/FOCS.2015.83.
  10. Søren Dahlgaard, Mathias Bæk Tejs Knudsen, and Mikkel Thorup. Practical hash functions for similarity estimation and dimensionality reduction. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS'17, pages 6618-6628, USA, 2017. Curran Associates Inc. URL: http://dl.acm.org/citation.cfm?id=3295222.3295407.
  11. Martin Dietzfelbinger and Michael Rink. Applications of a splitting trick. In Proceedings of the 36th ICALP, pages 354-365, 2009. Google Scholar
  12. A. I. Dumey. Indexing for rapid random access memory systems. Computers and Automation, 5(12):6-9, 1956. Google Scholar
  13. Philippe Flajolet and G. Nigel Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2):182-209, 1985. Announced at FOCS'83. Google Scholar
  14. Philippe Flajolet, Éric Fusy, Olivier Gandouet, and et al. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In In Analysis of Algorithms (AOFA), 2007. Google Scholar
  15. Stefan Heule, Marc Nunkesser, and Alex Hall. Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the EDBT 2013 Conference, pages 683-692, 2013. Google Scholar
  16. Konrad Kolesko and Rafał Latała. Moment estimates for chaoses generated by symmetric random variables with logarithmically convex tails. Statistics & Probability Letters, 107:210-214, 2015. URL: https://doi.org/10.1016/j.spl.2015.08.019.
  17. Rafał Latała. Estimates of moments and tails of Gaussian chaoses. The Annals of Probability, 34(6):2315-2331, 2006. URL: https://doi.org/10.1214/009117906000000421.
  18. Joseph Lehec. Moments of the Gaussian Chaos, pages 327-340. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011. URL: https://doi.org/10.1007/978-3-642-15217-7_13.
  19. Ping Li, Art B. Owen, and Cun-Hui Zhang. One permutation hashing. In Proc. 26th NIPS, pages 3122-3130, 2012. Google Scholar
  20. Krzysztof Oleszkiewicz. On a nonsymmetric version of the Khinchine-Kahane inequality. In Stochastic inequalities and applications, pages 157-168. Springer, 2003. Google Scholar
  21. Mihai Patrascu and Mikkel Thorup. Twisted tabulation hashing. In Sanjeev Khanna, editor, Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2013, New Orleans, Louisiana, USA, January 6-8, 2013, pages 209-228. SIAM, 2013. URL: https://doi.org/10.1137/1.9781611973105.16.
  22. Mihai Pǎtraşcu and Mikkel Thorup. The power of simple tabulation hashing. J. ACM, 59(3), June 2012. URL: https://doi.org/10.1145/2220357.2220361.
  23. Anshumali Shrivastava and Ping Li. Densifying one permutation hashing via rotation for fast near neighbor search. In Proc. 31th International Conference on Machine Learning (ICML), pages 557-565, 2014. Google Scholar
  24. Anshumali Shrivastava and Ping Li. Improved densification of one permutation hashing. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI 2014, Quebec City, Quebec, Canada, July 23-27, 2014, pages 732-741, 2014. Google Scholar
  25. Mikkel Thorup. Simple tabulation, fast expanders, double tabulation, and high independence. In 54th Annual Symposium on Foundations of Computer Science (FOCS), pages 90-99, 2013. Google Scholar
  26. Mark N. Wegman and Larry Carter. New classes and applications of hash functions. Journal of Computer and System Sciences, 22(3):265-279, 1981. Announced at FOCS'79. Google Scholar
  27. Albert Lindsey Zobrist. A new hashing method with application for game playing. Technical Report 88, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, 1970. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail