PHOBIC: Perfect Hashing With Optimized Bucket Sizes and Interleaved Coding

Authors Stefan Hermann , Hans-Peter Lehmann , Giulio Ermanno Pibiri , Peter Sanders , Stefan Walzer



PDF
Thumbnail PDF

File

LIPIcs.ESA.2024.69.pdf
  • Filesize: 1.1 MB
  • 17 pages

Document Identifiers

Author Details

Stefan Hermann
  • Karlsruhe Institute of Technology, Germany
Hans-Peter Lehmann
  • Karlsruhe Institute of Technology, Germany
Giulio Ermanno Pibiri
  • Ca' Foscari University of Venice, Italy
  • ISTI-CNR, Pisa, Italy
Peter Sanders
  • Karlsruhe Institute of Technology, Germany
Stefan Walzer
  • Karlsruhe Institute of Technology, Germany

Acknowledgements

This paper is based on the Master’s thesis [Hermann, 2023] of Stefan Hermann, which contains a more detailed evaluation and description of the GPU implementation.

Cite AsGet BibTex

Stefan Hermann, Hans-Peter Lehmann, Giulio Ermanno Pibiri, Peter Sanders, and Stefan Walzer. PHOBIC: Perfect Hashing With Optimized Bucket Sizes and Interleaved Coding. In 32nd Annual European Symposium on Algorithms (ESA 2024). Leibniz International Proceedings in Informatics (LIPIcs), Volume 308, pp. 69:1-69:17, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2024)
https://doi.org/10.4230/LIPIcs.ESA.2024.69

Abstract

A minimal perfect hash function (or MPHF) maps a set of n keys to [n] : = {1, …, n} without collisions. Such functions find widespread application e.g. in bioinformatics and databases. In this paper we revisit PTHash - a construction technique particularly designed for fast queries. PTHash distributes the input keys into small buckets and, for each bucket, it searches for a hash function seed that places its keys in the output domain without collisions. The collection of all seeds is then stored in a compressed way. Since the first buckets are easier to place, buckets are considered in non-increasing order of size. Additionally, PTHash heuristically produces an imbalanced distribution of bucket sizes by distributing 60% of the keys into 30% of the buckets. Our main contribution is to characterize, up to lower order terms, an optimal choice for the expected bucket sizes, improving construction throughput for space efficient configurations both in theory and practice. Further contributions include a new encoding scheme for seeds that works across partitions of the data structure and a GPU parallelization. Compared to PTHash, PHOBIC is 0.17 bits/key more space efficient for same query time and construction throughput. For a configuration with fast queries, our GPU implementation can construct an MPHF at 2.17 bits/key in 28 ns/key, which can be queried in 37 ns/query on the CPU.

Subject Classification

ACM Subject Classification
  • Theory of computation → Data compression
  • Information systems → Point lookups
Keywords
  • Compressed Data Structures
  • Minimal Perfect Hashing
  • GPU

Metrics

  • Access Statistics
  • Total Accesses (updated on a weekly basis)
    0
    PDF Downloads

References

  1. Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Fast prefix search in little space, with applications. In ESA (1), volume 6346 of Lecture Notes in Computer Science, pages 427-438. Springer, 2010. URL: https://doi.org/10.1007/978-3-642-15775-2_37.
  2. Djamal Belazzougui, Fabiano C. Botelho, and Martin Dietzfelbinger. Hash, displace, and compress. In ESA, volume 5757 of Lecture Notes in Computer Science, pages 682-693. Springer, 2009. URL: https://doi.org/10.1007/978-3-642-04128-0_61.
  3. Djamal Belazzougui and Gonzalo Navarro. Alphabet-independent compressed text indexing. ACM Trans. Algorithms, 10(4):23:1-23:19, 2014. URL: https://doi.org/10.1145/2635816.
  4. Piotr Beling. Fingerprinting-based minimal perfect hashing revisited. ACM J. Exp. Algorithmics, 28:1.4:1-1.4:16, 2023. URL: https://doi.org/10.1145/3596453.
  5. Dominik Bez, Florian Kurpicz, Hans-Peter Lehmann, and Peter Sanders. High performance construction of recsplit based minimal perfect hash functions. In ESA, volume 274 of LIPIcs, pages 19:1-19:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2023. URL: https://doi.org/10.4230/LIPICS.ESA.2023.19.
  6. Fabiano C. Botelho, Rasmus Pagh, and Nivio Ziviani. Simple and space-efficient minimal perfect hash functions. In WADS, volume 4619 of Lecture Notes in Computer Science, pages 139-150. Springer, 2007. URL: https://doi.org/10.1007/978-3-540-73951-7_13.
  7. Andrei Z. Broder and Michael Mitzenmacher. Survey: Network applications of Bloom filters: A survey. Internet Math., 1(4):485-509, 2003. URL: https://doi.org/10.1080/15427951.2004.10129096.
  8. Chin-Chen Chang and Chih-Yang Lin. Perfect hashing schemes for mining association rules. Comput. J., 48(2):168-179, 2005. URL: https://doi.org/10.1093/COMJNL/BXH074.
  9. Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, and Daniel S. Rokhsar. Meraculous: De novo genome assembly with short paired-end reads. PLOS ONE, 6(8):1-13, August 2011. URL: https://doi.org/10.1371/journal.pone.0023501.
  10. Yann Collet. xxHash: Extremely fast non-cryptographic hash algorithm. URL: https://github.com/Cyan4973/xxHash.
  11. Victoria G. Crawford, Alan Kuhnle, Christina Boucher, Rayan Chikhi, and Travis Gagie. Practical dynamic de bruijn graphs. Bioinform., 34(24):4189-4195, 2018. URL: https://doi.org/10.1093/BIOINFORMATICS/BTY500.
  12. Emmanuel Esposito, Thomas Mueller Graf, and Sebastiano Vigna. RecSplit: Minimal perfect hashing via recursive splitting. In ALENEX, pages 175-185. SIAM, 2020. URL: https://doi.org/10.1137/1.9781611976007.14.
  13. Edward A. Fox, Qi Fan Chen, and Lenwood S. Heath. A faster algorithm for constructing minimal perfect hash functions. In SIGIR, pages 266-273. ACM, 1992. URL: https://doi.org/10.1145/133160.133209.
  14. Marco Genuzio, Giuseppe Ottaviano, and Sebastiano Vigna. Fast scalable construction of (minimal perfect hash) functions. In SEA, volume 9685 of Lecture Notes in Computer Science, pages 339-352. Springer, 2016. URL: https://doi.org/10.1007/978-3-319-38851-9_23.
  15. Solomon W. Golomb. Run-length encodings (corresp.). IEEE Trans. Inf. Theory, 12(3):399-401, 1966. URL: https://doi.org/10.1109/TIT.1966.1053907.
  16. Stefan Hermann. Accelerating minimal perfect hash function construction using gpu parallelization. Master’s thesis, Karlsruhe Institute for Technology (KIT), 2023. URL: https://doi.org/10.5445/IR/1000164413.
  17. Stefan Hermann, Hans-Peter Lehmann, Giulio Ermanno Pibiri, Peter Sanders, and Stefan Walzer. PHOBIC: perfect hashing with optimized bucket sizes and interleaved coding. CoRR, abs/2404.18497, 2024. URL: https://doi.org/10.48550/arXiv.2404.18497.
  18. Aaron Kiely. Selecting the Golomb parameter in Rice coding. IPN progress report, 42:159, 2004. Google Scholar
  19. Hans-Peter Lehmann, Peter Sanders, and Stefan Walzer. Shockhash: Near optimal-space minimal perfect hashing beyond brute-force (extended version). CoRR, abs/2310.14959, 2023. URL: https://doi.org/10.48550/arXiv.2310.14959.
  20. Hans-Peter Lehmann, Peter Sanders, and Stefan Walzer. SicHash - small irregular cuckoo tables for perfect hashing. In ALENEX, pages 176-189. SIAM, 2023. URL: https://doi.org/10.1137/1.9781611977561.CH15.
  21. Hans-Peter Lehmann, Peter Sanders, and Stefan Walzer. Shockhash: Towards optimal-space minimal perfect hashing beyond brute-force. In ALENEX. SIAM, 2024. URL: https://doi.org/10.1137/1.9781611977929.15.
  22. Antoine Limasset, Guillaume Rizk, Rayan Chikhi, and Pierre Peterlongo. Fast and scalable minimal perfect hashing for massive key sets. In SEA, volume 75 of LIPIcs, pages 25:1-25:16. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017. URL: https://doi.org/10.4230/LIPICS.SEA.2017.25.
  23. Yi Lu, Balaji Prabhakar, and Flavio Bonomi. Perfect hashing for network applications. In ISIT, pages 2774-2778. IEEE, 2006. URL: https://doi.org/10.1109/ISIT.2006.261567.
  24. Kurt Mehlhorn. On the program size of perfect and universal hash functions. In FOCS, pages 170-175. IEEE Computer Society, 1982. URL: https://doi.org/10.1109/SFCS.1982.80.
  25. Ingo Müller, Peter Sanders, Robert Schulze, and Wei Zhou. Retrieval and perfect hashing using fingerprinting. In SEA, volume 8504 of Lecture Notes in Computer Science, pages 138-149. Springer, 2014. URL: https://doi.org/10.1007/978-3-319-07959-2_12.
  26. Giulio Ermanno Pibiri. Sparse and skew hashing of k-mers. Bioinformatics, 38(Supplement_1):i185-i194, 2022. Google Scholar
  27. Giulio Ermanno Pibiri and Roberto Trani. Pthash: Revisiting FCH minimal perfect hashing. In SIGIR, pages 1339-1348. ACM, 2021. URL: https://doi.org/10.1145/3404835.3462849.
  28. Giulio Ermanno Pibiri and Roberto Trani. Parallel and external-memory construction of minimal perfect hash functions with pthash. IEEE Trans. Knowl. Data Eng., 36(3):1249-1259, 2024. URL: https://doi.org/10.1109/TKDE.2023.3303341.
  29. Giulio Ermanno Pibiri and Rossano Venturini. Efficient data structures for massive N-gram datasets. In SIGIR, pages 615-624. ACM, 2017. URL: https://doi.org/10.1145/3077136.3080798.
  30. Robert F Rice. Some practical universal noiseless coding techniques, 1979. Google Scholar
  31. Peter Sanders. Emulating MIMD behaviour on SIMD-machines. In EUROSIM, pages 313-320. Elsevier, 1994. Google Scholar
Questions / Remarks / Feedback
X

Feedback for Dagstuhl Publishing


Thanks for your feedback!

Feedback submitted

Could not send message

Please try again later or send an E-mail