Fast and Scalable Minimal Perfect Hashing for Massive Key Sets

Limasset, Antoine; Rizk, Guillaume; Chikhi, Rayan; Peterlongo, Pierre

doi:10.4230/LIPIcs.SEA.2017.25

File

LIPIcs.SEA.2017.25.pdf

Filesize: 0.67 MB
16 pages

Document Identifiers

DOI: 10.4230/LIPIcs.SEA.2017.25
URN: urn:nbn:de:0030-drops-76196

Author Details

Antoine Limasset

Guillaume Rizk

Rayan Chikhi

Pierre Peterlongo

Cite AsGet BibTex

Antoine Limasset, Guillaume Rizk, Rayan Chikhi, and Pierre Peterlongo. Fast and Scalable Minimal Perfect Hashing for Massive Key Sets. In 16th International Symposium on Experimental Algorithms (SEA 2017). Leibniz International Proceedings in Informatics (LIPIcs), Volume 75, pp. 25:1-25:16, Schloss Dagstuhl – Leibniz-Zentrum für Informatik (2017)
https://doi.org/10.4230/LIPIcs.SEA.2017.25

Abstract

Minimal perfect hash functions provide space-efficient and collision-free hashing on static sets. Existing algorithms and implementations that build such functions have practical limitations on the number of input elements they can process, due to high construction time, RAM or external memory usage. We revisit a simple algorithm and show that it is highly competitive with the state of the art, especially in terms of construction time and memory usage. We provide a parallel C++ implementation called BBhash. It is capable of creating a minimal perfect hash function of 10^{10} elements in less than 7 minutes using 8 threads and 5 GB of memory, and the resulting function uses 3.7 bits/element. To the best of our knowledge, this is also the first implementation that has been successfully tested on an input of cardinality 10^{12}. Source code: https://github.com/rizkg/BBHash

Keywords

Minimal Perfect Hash Functions
Algorithms
Data Structures
Big Data

Metrics

Access Statistics
Total Accesses (updated on a weekly basis)

0

PDF Downloads

0

Metadata Views

References

Djamal Belazzougui, Paolo Boldi, Giuseppe Ottaviano, Rossano Venturini, and Sebastiano Vigna. Cache-oblivious peeling of random hypergraphs. In Data Compression Conference (DCC), 2014, pages 352-361. IEEE, 2014.
Djamal Belazzougui, Fabiano C. Botelho, and Martin Dietzfelbinger. Hash, displace, and compress. In European Symposium on Algorithms, pages 682-693. Springer, 2009.
Fabiano C. Botelho, Rasmus Pagh, and Nivio Ziviani. Simple and space-efficient minimal perfect hash functions. In Algorithms and Data Structures, pages 139-150. Springer, 2007.
Fabiano C. Botelho, Rasmus Pagh, and Nivio Ziviani. Practical perfect hashing in nearly optimal space. Information Systems, 38(1):108-131, 2013.
Chin-Chen Chang and Chih-Yang Lin. Perfect hashing schemes for mining association rules. The Computer Journal, 48(2):168-179, 2005. URL: http://dx.doi.org/10.1093/comjnl/bxh074.
Jarrod A. Chapman, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P. Schroth, and Daniel S. Rokhsar. Meraculous: de novo genome assembly with short paired-end reads. PloS one, 6(8):e23501, 2011.
Yupeng Chen, Bertil Schmidt, and Douglas L Maskell. A hybrid short read mapping accelerator. BMC Bioinformatics, 14(1):67, 2013. URL: http://dx.doi.org/10.1186/1471-2105-14-67.
Rayan Chikhi, Antoine Limasset, and Paul Medvedev. Compacting de bruijn graphs from sequencing data quickly and in low memory. Bioinformatics, 32(12):i201-i208, 2016.
Zbigniew J. Czech, George Havas, and Bohdan S. Majewski. Perfect hashing. Theoretical Computer Science, 182(1):1-143, 1997.
Michael L. Fredman and János Komlós. On the size of separating systems and families of perfect hash functions. SIAM Journal on Algebraic Discrete Methods, 5(1):61-68, 1984.
Marco Genuzio, Giuseppe Ottaviano, and Sebastiano Vigna. Fast scalable construction of (minimal perfect hash) functions. In V. Andrew Goldberg and S. Alexander Kulikov, editors, Experimental Algorithms: 15th International Symposium, SEA 2016, St. Petersburg, Russia, June 5-8, 2016, Proceedings, pages 339-352. Springer International Publishing, Cham, 2016. URL: http://dx.doi.org/10.1007/978-3-319-38851-9_23.
Yi Lu, Balaji Prabhakar, and Flavio Bonomi. Perfect hashing for network applications. In 2006 IEEE International Symposium on Information Theory, pages 2774-2778. IEEE, 2006.
George Marsaglia et al. Xorshift rngs. Journal of Statistical Software, 8(14):1-6, 2003.
Kurt Mehlhorn. On the program size of perfect and universal hash functions. In Foundations of Computer Science, 1982. SFCS'08. 23rd Annual Symposium on, pages 170-175. IEEE, 1982.
Michael Mitzenmacher and Salil Vadhan. Why simple hash functions work: exploiting the entropy in a data stream. In Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, pages 746-755. Society for Industrial and Applied Mathematics, 2008.
Ingo Müller, Peter Sanders, Robert Schulze, and Wei Zhou. Retrieval and Perfect Hashing Using Fingerprinting, pages 138-149. Springer International Publishing, Cham, 2014. URL: http://dx.doi.org/10.1007/978-3-319-07959-2_12.