Learned Monotone Minimal Perfect Hashing

Authors Paolo Ferragina , Hans-Peter Lehmann , Peter Sanders , Giorgio Vinciguerra



PDF
Thumbnail PDF

File

LIPIcs.ESA.2023.46.pdf
  • Filesize: 0.96 MB
  • 17 pages

Document Identifiers

Author Details

Paolo Ferragina
  • University of Pisa, Italy
Hans-Peter Lehmann
  • Karlsruhe Institute of Technology, Germany
Peter Sanders
  • Karlsruhe Institute of Technology, Germany
Giorgio Vinciguerra
  • University of Pisa, Italy

Acknowledgements

We thank Stefan Walzer for early discussions leading to this paper.

Cite AsGet BibTex

Paolo Ferragina, Hans-Peter Lehmann, Peter Sanders, and Giorgio Vinciguerra. Learned Monotone Minimal Perfect Hashing. In 31st Annual European Symposium on Algorithms (ESA 2023). Leibniz International Proceedings in Informatics (LIPIcs), Volume 274, pp. 46:1-46:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2023)
https://doi.org/10.4230/LIPIcs.ESA.2023.46

Abstract

A Monotone Minimal Perfect Hash Function (MMPHF) constructed on a set S of keys is a function that maps each key in S to its rank. On keys not in S, the function returns an arbitrary value. Applications range from databases, search engines, data encryption, to pattern-matching algorithms. In this paper, we describe LeMonHash, a new technique for constructing MMPHFs for integers. The core idea of LeMonHash is surprisingly simple and effective: we learn a monotone mapping from keys to their rank via an error-bounded piecewise linear model (the PGM-index), and then we solve the collisions that might arise among keys mapping to the same rank estimate by associating small integers with them in a retrieval data structure (BuRR). On synthetic random datasets, LeMonHash needs 34% less space than the next larger competitor, while achieving about 16 times faster queries. On real-world datasets, the space usage is very close to or much better than the best competitors, while achieving up to 19 times faster queries than the next larger competitor. As far as the construction of LeMonHash is concerned, we get an improvement by a factor of up to 2, compared to the competitor with the next best space usage. We also investigate the case of keys being variable-length strings, introducing the so-called LeMonHash-VL: it needs space within 13% of the best competitors while achieving up to 3 times faster queries than the next larger competitor.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
  • Information systems → Point lookups
Keywords
  • compressed data structure
  • monotone minimal perfect hashing
  • retrieval

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Sepehr Assadi, Martin Farach-Colton, and William Kuszmaul. Tight bounds for monotone minimal perfect hashing. In Proc. 34th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 456-476, 2023. URL: https://doi.org/10.1137/1.9781611977554.CH20.
  2. Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Monotone minimal perfect hashing: searching a sorted table with O(1) accesses. In Proc. 20th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 785-794, 2009. URL: https://doi.org/10.1137/1.9781611973068.86.
  3. Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Theory and practice of monotone minimal perfect hashing. ACM J. Exp. Algorithmics, 16, 2011. URL: https://doi.org/10.1145/1963190.2025378.
  4. Djamal Belazzougui, Fabiano C. Botelho, and Martin Dietzfelbinger. Hash, displace, and compress. In Proc. 17th Annual European Symposium on Algorithms (ESA), pages 682-693, 2009. URL: https://doi.org/10.1007/978-3-642-04128-0_61.
  5. Djamal Belazzougui, Fabio Cunial, Juha Kärkkäinen, and Veli Mäkinen. Linear-time string indexing and analysis in small space. ACM Trans. Algorithms, 16(2):17:1-17:54, 2020. URL: https://doi.org/10.1145/3381417.
  6. Djamal Belazzougui, Gonzalo Navarro, and Daniel Valenzuela. Improved compressed indexes for full-text document retrieval. J. Discrete Algorithms, 18:3-13, 2013. URL: https://doi.org/10.1016/j.jda.2012.07.005.
  7. Dominik Bez, Florian Kurpicz, Hans-Peter Lehmann, and Peter Sanders. High Performance Construction of RecSplit Based Minimal Perfect Hash Functions. In 31st Annual European Symposium on Algorithms (ESA 2023), volume 274 of LIPIcs, pages 19:1-19:16, Dagstuhl, Germany, 2023. Schloss Dagstuhl - Leibniz-Zentrum für Informatik. URL: https://doi.org/10.4230/LIPIcs.ESA.2023.19.
  8. Antonio Boffa, Paolo Ferragina, Francesco Tosoni, and Giorgio Vinciguerra. Compressed string dictionaries via data-aware subtrie compaction. In Proc. 29th International Symposium on String Processing and Information Retrieval (SPIRE), pages 233-249, 2022. URL: https://doi.org/10.1007/978-3-031-20643-6_17.
  9. Antonio Boffa, Paolo Ferragina, and Giorgio Vinciguerra. A learned approach to design compressed rank/select data structures. ACM Trans. Algorithms, 18(3):24:1-24:28, 2022. URL: https://doi.org/10.1145/3524060.
  10. Paolo Boldi, Massimo Santini, and Sebastiano Vigna. A large time-aware web graph. SIGIR Forum, 42(2):33-38, 2008. URL: https://doi.org/10.1145/1480506.1480511.
  11. Alexandra Boldyreva, Nathan Chenette, and Adam O'Neill. Order-preserving encryption revisited: Improved security analysis and alternative solutions. In Proc. 31st Annual International Cryptology Conference (CRYPTO), pages 578-595, 2011. URL: https://doi.org/10.1007/978-3-642-22792-9_33.
  12. Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, and Daniel S. Rokhsar. Meraculous: De novo genome assembly with short paired-end reads. PLOS ONE, 6(8):1-13, August 2011. URL: https://doi.org/10.1371/journal.pone.0023501.
  13. David Richard Clark. Compact Pat Trees. PhD thesis, University of Waterloo, Canada, 1996. Google Scholar
  14. Peter C. Dillinger, Lorenz Hübschle-Schneider, Peter Sanders, and Stefan Walzer. Fast succinct retrieval and approximate membership using ribbon. In Proc. 20th International Symposium on Experimental Algorithms (SEA), pages 4:1-4:20, 2022. URL: https://doi.org/10.4230/LIPICS.SEA.2022.4.
  15. Peter Elias. Efficient storage and retrieval by content and address of static files. J. ACM, 21(2):246-260, 1974. URL: https://doi.org/10.1145/321812.321820.
  16. Emmanuel Esposito, Thomas Mueller Graf, and Sebastiano Vigna. RecSplit: Minimal perfect hashing via recursive splitting. In Proc. 22nd Symposium on Algorithm Engineering and Experiments (ALENEX), pages 175-185, 2020. URL: https://doi.org/10.1137/1.9781611976007.14.
  17. Robert Mario Fano. On the number of bits required to implement an associative memory. Technical report, MIT, Computer Structures Group, 1971. Project MAC, Memorandum 61". Google Scholar
  18. Paolo Ferragina, Roberto Grossi, Ankur Gupta, Rahul Shah, and Jeffrey Scott Vitter. On searching compressed string collections cache-obliviously. In Proc. 27th ACM Symposium on Principles of Database Systems (PODS), pages 181-190, 2008. URL: https://doi.org/10.1145/1376916.1376943.
  19. Paolo Ferragina, Hans-Peter Lehmann, Peter Sanders, and Giorgio Vinciguerra. Learned monotone minimal perfect hashing, 2023. URL: https://doi.org/10.48550/arXiv.2304.11012.
  20. Paolo Ferragina, Fabrizio Lillo, and Giorgio Vinciguerra. On the performance of learned data structures. Theor. Comput. Sci., 871:107-120, 2021. URL: https://doi.org/10.1016/J.TCS.2021.04.015.
  21. Paolo Ferragina, Giovanni Manzini, and Giorgio Vinciguerra. Compressing and querying integer dictionaries under linearities and repetitions. IEEE Access, 10:118831-118848, 2022. URL: https://doi.org/10.1109/ACCESS.2022.3221520.
  22. Paolo Ferragina and Gonzalo Navarro. Pizza&Chili corpus. Accessed: February 2023. URL: http://pizzachili.dcc.uchile.cl/texts.html.
  23. Paolo Ferragina and Giorgio Vinciguerra. Learned data structures. In Luca Oneto, Nicolò Navarin, Alessandro Sperduti, and Davide Anguita, editors, Recent Trends in Learning From Data, pages 5-41. Springer International Publishing, 2020. URL: https://doi.org/10.1007/978-3-030-43883-8_2.
  24. Paolo Ferragina and Giorgio Vinciguerra. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. PVLDB, 13(8):1162-1175, 2020. URL: https://doi.org/10.14778/3389133.3389135.
  25. Edward A. Fox, Qi Fan Chen, Amjad M. Daoud, and Lenwood S. Heath. Order-preserving minimal perfect hash functions and information retrieval. ACM Trans. Inf. Syst., 9(3):281-308, 1991. URL: https://doi.org/10.1145/125187.125200.
  26. Edward A. Fox, Qi Fan Chen, and Lenwood S. Heath. A faster algorithm for constructing minimal perfect hash functions. In Proc. 15th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR), pages 266-273, 1992. URL: https://doi.org/10.1145/133160.133209.
  27. Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM, 67(1):2:1-2:54, 2020. URL: https://doi.org/10.1145/3375890.
  28. LeMonHash - GitHub. https://github.com/ByteHamster/LeMonHash, 2023.
  29. MMPHF-Experiments - GitHub. https://github.com/ByteHamster/MMPHF-Experiments, 2023.
  30. Simon Gog, Timo Beller, Alistair Moffat, and Matthias Petri. From theory to practice: Plug and play with succinct data structures. In Proc. 13th International Symposium on Experimental Algorithms (SEA), pages 326-337, 2014. URL: https://doi.org/10.1007/978-3-319-07959-2_28.
  31. Google. Google ngram exports. Accessed: March 2023. URL: https://storage.googleapis.com/books/ngrams/books/datasetsv3.html.
  32. Roberto Grossi, Alessio Orlandi, and Rajeev Raman. Optimal trade-offs for succinct string indexes. In Proc. 37th International Colloquium on Automata, Languages and Programming (ICALP), pages 678-689, 2010. URL: https://doi.org/10.1007/978-3-642-14165-2_57.
  33. Roberto Grossi and Giuseppe Ottaviano. Fast compressed tries through path decompositions. ACM J. Exp. Algorithmics, 19(1), 2014. URL: https://doi.org/10.1145/2656332.
  34. Guy Jacobson. Space-efficient static trees and graphs. In Proc. 30th IEEE Symposium on Foundations of Computer Science (FOCS), pages 549-554, 1989. URL: https://doi.org/10.1109/SFCS.1989.63533.
  35. Andreas Kipf, Ryan Marcus, Alexander van Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. SOSD: A benchmark for learned indexes. CoRR, abs/1911.13014, 2019. Google Scholar
  36. Evgenios M. Kornaropoulos, Silei Ren, and Roberto Tamassia. The price of tailoring the index to your data: Poisoning attacks on learned index structures. In Proc. 48th International Conference on Management of Data (SIGMOD), pages 1331-1344, 2022. URL: https://doi.org/10.1145/3514221.3517867.
  37. Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. The case for learned index structures. In Proc. 44th International Conference on Management of Data (SIGMOD), pages 489-504, 2018. URL: https://doi.org/10.1145/3183713.3196909.
  38. Florian Kurpicz. Engineering compact data structures for rank and select queries on bit vectors. In Proc. 29th International Symposium on String Processing and Information Retrieval (SPIRE), pages 257-272, 2022. URL: https://doi.org/10.1007/978-3-031-20643-6_19.
  39. Florian Kurpicz, Hans-Peter Lehmann, and Peter Sanders. PaCHash: Packed and compressed hash tables. In Proc. 25th Symposium on Algorithm Engineering and Experiments (ALENEX), pages 162-175, 2023. URL: https://doi.org/10.1137/1.9781611977561.ch14.
  40. Hans-Peter Lehmann, Peter Sanders, and Stefan Walzer. SicHash - small irregular cuckoo tables for perfect hashing. In Proc. 25th Symposium on Algorithm Engineering and Experiments (ALENEX), pages 176-189, 2022. URL: https://doi.org/10.1137/1.9781611977561.ch15.
  41. Hyeontaek Lim, Bin Fan, David G. Andersen, and Michael Kaminsky. SILT: a memory-efficient, high-performance key-value store. In Proc. 23rd ACM Symposium on Operating Systems Principles (SOSP), pages 1-13. ACM, 2011. URL: https://doi.org/10.1145/2043556.2043558.
  42. Antoine Limasset, Guillaume Rizk, Rayan Chikhi, and Pierre Peterlongo. Fast and scalable minimal perfect hashing for massive key sets. In Proc. 16th International Symposium on Experimental Algorithms (SEA), pages 25:1-25:16, 2017. URL: https://doi.org/10.4230/LIPICS.SEA.2017.25.
  43. Bohdan S. Majewski, Nicholas C. Wormald, George Havas, and Zbigniew J. Czech. A family of perfect hashing methods. Comput. J., 39(6):547-554, 1996. URL: https://doi.org/10.1093/COMJNL/39.6.547.
  44. Ingo Müller, Peter Sanders, Robert Schulze, and Wei Zhou. Retrieval and perfect hashing using fingerprinting. In Proc. 13th International Symposium on Experimental Algorithms (SEA), pages 138-149, 2014. URL: https://doi.org/10.1007/978-3-319-07959-2_12.
  45. J. Ian Munro and Venkatesh Raman. Succinct representation of balanced parentheses and static trees. SIAM J. Comput., 31(3):762-776, 2001. URL: https://doi.org/10.1137/S0097539799364092.
  46. Gonzalo Navarro. Spaces, trees, and colors: The algorithmic landscape of document retrieval on sequences. ACM Comput. Surv., 46(4):1-47, 2014. URL: https://doi.org/10.1145/2535933.
  47. Gonzalo Navarro. Compact data structures: a practical approach. Cambridge University Press, 2016. Google Scholar
  48. Gonzalo Navarro and Javiel Rojas-Ledesma. Predecessor search. ACM Comput. Surv., 53(5), 2020. URL: https://doi.org/10.1145/3409371.
  49. Giuseppe Ottaviano and Rossano Venturini. Partitioned Elias-Fano indexes. In Proc. 37th International ACM Conference on Research and Development in Information Retrieval (SIGIR), pages 273-282, 2014. URL: https://doi.org/10.1145/2600428.2609615.
  50. Giulio E. Pibiri and Roberto Trani. PTHash: Revisiting FCH minimal perfect hashing. In Proc. 44th International ACM Conference on Research and Development in Information Retrieval (SIGIR), pages 1339-1348, 2021. URL: https://doi.org/10.1145/3404835.3462849.
  51. Ibrahim Sabek, Kapil Vaidya, Dominik Horn, Andreas Kipf, Michael Mitzenmacher, and Tim Kraska. Can learned models replace hash functions? PVLDB, 16(3):532-545, 2022. URL: https://doi.org/10.14778/3570690.3570702.
  52. Sebastiano Vigna. Broadword implementation of rank/select queries. In Proc. 7th International Workshop on Experimental Algorithms (WEA), pages 154-168. Springer, 2008. URL: https://doi.org/10.1007/978-3-540-68552-4_12.
  53. Stefan Walzer. Peeling close to the orientability threshold - spatial coupling in hashing-based data structures. In Proc. 32nd ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2194-2211, 2021. URL: https://doi.org/10.1137/1.9781611976465.131.
  54. Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, 2nd edition, 1999. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail