MorphisHash: Improving Space Efficiency of ShockHash for Minimal Perfect Hashing
Abstract
A minimal perfect hash function (MPHF) maps a set of keys to unique positions . Representing an MPHF requires at least bits per key. ShockHash is a technique to construct an MPHF and requires just slightly more space. It gives each key two random candidate positions. If each key can be mapped to one of its two candidate positions such that there is exactly one key mapped to each position, then an MPHF is found. If not, ShockHash repeats the process with a new set of random candidate positions. ShockHash has to store how many repetitions were required and for each key to which of the two candidate positions it is mapped. However, when a given set of candidate positions can be used as MPHF then there is not only one but multiple ways of mapping the keys to one of their candidate positions such that the mapping results in an MPHF. This redundancy makes up for the majority of the remaining space overhead in ShockHash. In this paper, we present MorphisHash which almost completely eliminates this redundancy. Our theoretical result is that MorphisHash saves bits in expectation compared to ShockHash. This corresponds to a factor of 20 less space overhead in practice. Just like ShockHash, MorphisHash can be used as a building block within RecSplit to obtain MorphisHash-RS. When compared for same space consumption, MorphisHash-RS can be constructed up to 21 times faster than ShockHash-RS. The technique to accomplish this might be of a more general interest to compress data structures.
Keywords and phrases:
compressed data structure, perfect hashing, random graph, pseudoforest, component2012 ACM Subject Classification:
Theory of computation Data compression ; Information systems Point lookups ; Theory of computation Bloom filters and hashing ; Mathematics of computing Random graphsSupplementary Material:
Software: https://github.com/stefanfred/MorphisHash [10]archived at
swh:1:dir:72bf7952109795567ca30d31efc4c557b44dfc17
Acknowledgements:
I thank Stefan Walzer for proofreading an earlier version of this paper.Funding:
This work was supported by funding from the pilot program Core Informatics at KIT (KiKIT) of the Helmholtz Association (HGF).Editors:
Anne Benoit, Haim Kaplan, Sebastian Wild, and Grzegorz HermanSeries and Publisher:
Leibniz International Proceedings in Informatics, Schloss Dagstuhl – Leibniz-Zentrum für Informatik
1 Introduction
Given a set of keys, a minimal perfect hash function (MPHF) maps each key to a unique position in . MPHFs have a wide range of applications including compressed full-text indexes [2], computer networks [18], databases [6], prefix-search data structures [1], language models [24], bioinformatics [7, 23], and Bloom filters [5]. Different techniques exist for constructing an MPHF. They offer a variety of trade-offs between construction time, space consumption and query time. The space lower bound of an MPHF is bits per key [19].
ShockHash.
Our technique builds on ShockHash [15, 16]. Similar to Cuckoo hashing [22], each key is given two candidate positions using respective hash functions and with seed . ShockHash finds a seed such that all keys can be mapped to one of their candidate positions and there is exactly one key mapped to each position. ShockHash needs to store the seed , once found. Additionally, it needs to store for each key if the candidate position or is used. This can be represented using a function . Such a mapping can be stored efficiently using a retrieval data structure [8] which requires about 1 bit per key. A key is queried using .
A different perspective on ShockHash is that with each seed it samples a random graph. The possible output positions are the nodes of that graph. The keys are the edges, connecting the nodes of their respective candidate positions. A seed is accepted if the graph can be oriented, i.e. each edge is given a direction, such that the indegree of each node is . This is possible if and only if the graph is a pseudoforest – a graph where each component is a cycle with trees branching from it. The edges of the cycle of each component are oriented either all “clockwise” or “counterclockwise”. The edges in the trees are uniquely oriented away from their cycle. Hence, the indegree of each node is 1. ShockHash arbitrarily chooses one of two possible orientations of each cycle and stores the according orientation of each edge in a retrieval structure. Hence, for each cycle there is one bit of redundancy.
Contribution.
In this paper we introduce MorphisHash which exhausts this remaining redundancy. MorphisHash is a recursive acronym: MorphisHash is an overloaded retrieval structure for perfect hashing using ShockHash. Our key observation is that the possible orientations of a pseudoforest can be described as the solution of a linear equation system. A retrieval structure that stores the edge orientations can also be described as the solution of a linear equation system. This allows us to concatenate the equation systems using matrix multiplication. We achieve compression by reducing the dimensionality of the solution space of the combined equation system. Our theoretical insight is that a random pseudoforest has components in expectation and MorphisHash can convert this into bits of expected space savings compared to ShockHash by utilizing the freedom of choosing the orientation of each component’s cycle. Our experiments show that MorphisHash has about a factor of less space overhead than ShockHash at the cost of roughly times more construction time.
Partitioning.
Note that within this paper, denotes the input size for one instance of MorphisHash. In Section 5, an MPHF is obtained by splitting a large input key set into a linear number of MorphisHash instances, each of size , and concatenating them afterwards. Partitioning is required mainly for keeping construction times feasible. Furthermore, our per instance space savings translate to linear space savings in terms of the large input.
Outline.
We begin in Section 2 with related space efficient PHF construction techniques. We present MorphisHash in Section 3 and analyze it in Section 4. We explain implementation details in Section 5. Finally, Section 6 discusses experiments and the paper is concluded with Section 7. Our compression technique might be of a more general interest and we give further examples in the full version of this paper [11].
2 Related Work
We provide a brief overview of other space efficient PHF techniques. For a detailed survey of state-of-the-art techniques we refer to [14].
RecSplit.
RecSplit [9, 3] first hashes the keys into partitions of about 2000 keys. RecSplit then finds a seed of a hash function that splits the partition into smaller subsets of equal size. This is applied recursively resulting in a tree-like structure. Once sufficiently small subsets are obtained, RecSplit uses brute-force search to find an MPHF within each leaf. Very recently a significant improvement to RecSplit has been made with the introduction of Consensus [17]. Instead of allowing arbitrarily large seeds, Consensus-RecSplit uses a fixed number of bits for each seed and backtracks in the splitting tree if a seed cannot be found within the allowed space. Consensus-RecSplit is currently the most space efficient technique with just 0.001 bits per key overhead.
PHOBIC.
Another PHF construction technique is PHOBIC [12]. Again, the keys are hashed into partitions of about 2000 keys. Within each partition the keys are hashed to buckets which have an average size of about 10. For each bucket, PHOBIC uses brute force search to find a seed of a hash function such that all keys of that bucket are hashed to positions in to which no keys of previous buckets are hashed to. The buckets are inserted in non-increasing order of size because it is much easier to insert the larger buckets when the output domain is almost empty. This effect is utilized further by intentionally making some of the buckets larger. PHOBIC has fast queries at the cost of more space overhead.
3 MorphisHash
ShockHash samples random graphs until stumbling on a pseudoforest. The only remaining degree of freedom when orienting the pseudoforest is that each component contains a cycle and there are two ways to orient each cycle. We address this remaining redundancy with MorphisHash.
The first ingredient of MorphisHash is the insight that all allowed orientations of a graph can be expressed as an affine subspace of , where is the set of keys. To show this, we define as the vector representing the orientation of each edge such that an edge is oriented to node . We now consider a given pseudoforest and one possible orientation . We can flip the orientation of a cycle by adding a vector to y where if and only if is part of that cycle. This can be done for each cycle independently. Clearly, the dimension of this subspace is equal to the number of components. Part of this section is to describe the linear equation system of which the solution space is our desired affine subspace.
The second ingredient is a 1-bit retrieval data structure. The retrieval structure works by storing a bit vector , where the parameter is discussed in detail later. The orientation of an edge is described using , where is a hash function and is the ShockHash seed. Using linear equations that involve hash functions is a common technique for retrieval data structures [8].
The beauty of MorphisHash is that we can concatenate both linear equation systems using a simple matrix multiplication to find a retrieval structure which directly orients the edges correctly. We can then decrease the number of bits that the retrieval structure is allowed to use which reduces the dimension of the solution space and therefore extracts the remaining redundancy.
We now show the equation system that describes the allowed edge orientations. As a first step we show that the constraints of an MPHF can be weakened in the following sense:
Lemma 1.
A function with is an MPHF (i.e. a bijection) if and only if for all we have that is uneven.
Proof.
Clearly, if is bijective then is uneven. If is not bijective then there is at least one such that which is even.
Linear Equations in Graphs.
This allows us to count the indegree of each node using : If the indegree of all nodes is then the orientations result in a valid MPHF. Recall the definition of as the vector representing the orientation of each edge such that an edge is oriented to node . A different perspective is that an edge is oriented towards a node if and only if , which will be useful in the following equation. We define as the incidence matrix of the graph. We also define where counts the number of edges (+1) that are mapped to position using the candidate function. Finally, this allows us to count the indegree of a node as
The operator is logical XOR. According to Lemma 1 setting such that the indegree of each node is uneven results in a valid orientation of all edges. The complete equation system therefore simplifies to , which has a solution if and only if the graph is a pseudoforest. Note that in case of a loop, the respective column in the incidence matrix is a zero column.
The Retrieval Data Structure.
To store the orientation of the edges we cannot use directly because we do not know the index of a key during query time. We therefore employ the idea of a retrieval structure. Our retrieval structure consists of a bit vector , where is a tuning parameter. The orientation of edge is then described using a scalar product where is a hash function and is the ShockHash seed. MorphisHash needs to find such that all edges are properly oriented. The above linear equation is given for each key and can therefore be written using a matrix . Each row is the hash of the respective key . We have and substitute it into resulting in . If a solution for exists then the graph is a pseudoforest, is a valid orientation and requires exactly bits to store using a bit string. We discuss the selection of both in theory (Section 4) and practice (Section 6). Querying our MPHF for a key is now straightforward .
If the system does not have a solution for a sampled graph this can have two reasons. (1) The graph is no pseudoforest (2) A possible orientation of the pseudoforest is not within the solution space of the retrieval structure. If there is no solution we reject the seed and ShockHash continues with searching for a new seed.
The already existing step in ShockHash of checking whether a seed is a pseudoforest is therefore redundant. However, solving an equation system is computationally more expensive than using the original ShockHash pseudoforest check. We therefore leave the original pseudoforest check as a filter and only need to solve a few times in practice. MorphisHash is illustrated using an example in Figure 1.
Bipartite MorphisHash.
A variant of ShockHash is bipartite ShockHash. Assume that is even, the extension for uneven numbers can be found in the original paper [15]. In bipartite ShockHash, the ranges of the two hash functions are made disjoint using and . The seed is also split in two independent parts and . Seeds where not all positions are hit by at least one candidate position are filtered out. Bipartite ShockHash only checks if pairs of and that passed the filter result in a pseudoforest. MorphisHash uses ShockHash as a black box and can be applied on the bipartite case just as well as on the non-bipartite case. In an obvious manner, we refer to bipartite MorphisHash if bipartite ShockHash is used.
4 Analysis
In this section we analyze non-bipartite MorphisHash for large . Our analysis is split in two parts. First, we analyze the number of components of a random pseudoforest. We use this in the second part, to show the space efficiency of MorphisHash compared to ShockHash. Note that we employ the common simple uniform hashing assumption [21], which assumes that hash functions behave like truly random functions.
4.1 The Number of Components in a Random Pseudoforest
In a pseudoforest each component is a tree with one additional edge. An equivalent view is that each component of the pseudoforest is a cycle with trees branching from it.
Our first result uses the following graph model. Let be the set of all functions from to . Every corresponds to a directed graph . All are pseudoforests, because if we start from any node and follow its edges we will eventually end up in a cycle. Furthermore, there can only be one cycle in each component because each node has an out-degree of one. In the following we refer to all as maximal directed pseudoforests. All edges of the pseudoforest which are not part of the cycle are pointed towards the cycle and all edges in the cycle are all directed in the same direction. We will first need the following results.
Lemma 2 ([25]).
Let be the number of undirected forests having node set with components where nodes belong to different trees. We have .
Lemma 3.
Proof.
If each node in is part of a cycle then the indegree of each node is 1. Hence, is a bijection and there are possible bijections. We can now combine the previous results to analyze how the number of nodes in cycles is distributed in maximal directed pseudoforests.
Lemma 4.
.
Proof.
There are ways of partitioning the nodes into (1) cycles with a total of nodes and (2) trees that are attached to the cycles. According to Lemma 3 there are ways to create cycles with nodes. The nodes in the cycles are the roots of trees. According to Lemma 2 the number of labeled rooted trees with nodes and roots is . This number holds for graphs where each node has one outgoing edge, because all edges are uniquely oriented towards their parent node. We therefore have
The previous result is based on maximal directed pseudoforests. However, in MorphisHash we sample graphs in a different manner. Let be the set of functions from to . Each corresponds to a graph . We refer to the set of all as the hashed graph model. Graphs in this model may have multiple edges and loops. MorphisHash uniformly samples functions . We are interested in the distribution of conditioned on the event that is a pseudoforest. In the following, we transfer our result of maximal directed pseudoforests to the hashed graph model.
Lemma 5.
Let be the number of components of , an appropriate normalization constant, and then
Proof.
We begin with a relation between maximal directed pseudoforests and hashed pseudoforests . A hashed pseudoforest is related to a maximal directed pseudoforest if we can obtain by orienting the edges of . Let be the number of maximal directed pseudoforests and the number of hashed pseudoforest with nodes in cycles, components, double edges and loops. Within this subset, each cycle of the hashed graph can be oriented in two possible ways to obtain a maximal directed pseudoforest. However, changing the orientation of cycles that are a loop or double edge does not result in different maximal directed pseudoforests. Hence, we have orientations which result in different maximal directed pseudoforest.
Let be the number of elements such that is a pseudoforest with nodes in cycles, components, double edges and loops. We show that . We have to consider the number of functions in that result in the same hashed pseudoforest. The order of the edges can be permuted in ways without changing the underlying pseudoforest, except for double edges. The number of possible permutations decreases by a factor of two for each double edge (). Analogously, the nodes within the edges can be switched without changing the underlying graph in ways, except for loops ().
By definition we have and , where and are the pseudoforests with nodes in cycles. For brevity let
We have
Where we used Lemma 4 for . is the number of all hashed pseudoforests and thus only depends on . is chosen such that the probabilities add up to 1. The next step is to analyze . To this end, we first require some definitions and more general results.
Lemma 6.
The number of components of a pseudoforest with nodes and nodes in cycles follows the same distribution as the number of components of a graph of nodes where each component is a cycle.
Note that this refers to both graph models and .
Proof.
The order in which the edges of a graph are sampled does not change the distribution of the graph. Given a graph of nodes and edges such that all edges are part of cycles, the probability that the remaining edges form trees with the root being part of the existing cycles is independent of how the nodes of the cycles are connected to form any number of components.
Configuration Model.
The configuration model [20] can be used to describe distributions of random graphs. In the model, each node is given a fixed number of half-edges. The graph is obtained by repeatedly connecting half-edges by uniformly sampling from the set of all remaining half-edges.
Lemma 7.
The distribution of the hashed graph with is equal to the distribution of the following graph.
The graph is obtained in two steps. First, the degree of each node is revealed by distributing half-edges. In a second step, the edges of the graph are obtained in a sequence of rounds. In each round an unmatched half-edge is chosen arbitrarily and matched to a distinct unmatched half-edge , chosen uniformly at random. The choice of may depend on the set of half-edges matched previously.
We refer to ShockHash [15, Lemma 5] for a proof. We require an analogous result for
Lemma 8.
The distribution of the maximal directed pseudoforest with is equal to the distribution of the following directed graph.
The number of outgoing edges of each node is fixed to 1. The graph is obtained in two steps. First, the indegree of each node is revealed by distributing incoming half-edges. In a second step, the edges of the graph are obtained in a sequence of rounds. In each round an unmatched outgoing half-edge at node is chosen arbitrarily and matched to a distinct unmatched incoming half-edge , chosen uniformly at random. The choice of may depend on the set of half-edges matched previously.
Again, the proof is analogous to ShockHash [15, Lemma 5].
Lemma 9.
For
Proof.
By induction and
We now have the available tools to analyze the number of components conditioned on the number of nodes in cycles.
Lemma 10.
, where and is the number of components of .
Note that the following proof has similarities to ShockHash [15, Lemma 6].
Proof.
We consider the number of components of a maximal directed pseudoforest with nodes, , , conditioned on the event that each component of is a cycle. According to Lemma 6 the number of components of follows the same distribution as the number of components of a maximal directed pseudoforest with nodes and nodes in cycles. We proceed as described in Lemma 8 by revealing the graph in a sequence of rounds. First, the indegree of each node is revealed. Each component is a circle and therefore each node has indegree one. We choose the outgoing half-edge at an arbitrary node . The outgoing half-edge is matched with one of the incoming half-edges. Let be the node at the incoming half-edge. There are two cases.
-
1.
With probability we have . In this case, a loop is created, a cycle is closed and the next outgoing half-edge to match is chosen arbitrarily.
-
2.
Otherwise we merge the nodes to a single one which does not change the number of components. The outgoing half-edge at node is matched next.
In both cases, the distribution of the remaining graph is that of . Because of this independence, we can multiply the expectation values to obtain the recurrence
With the base case the recurrence is solved and upper bounded by
Where we used Lemma 9. Analogously, a lower bound is
A similar result for hashed pseudoforests instead of maximal directed pseudoforests is the following.
Lemma 11.
, where and is the number of components of .
Proof.
The proof is analogous to the previous one. We use Lemma 7 to match a half-edge to one of the remaining half-edges. In each round the number of components increases by one with probability . Resolving the respective recurrence results in
Where is the k-th harmonic number. We are interested in a lower bound for the probability that the number of components is at least . We can use the variance to find such a bound. The probability distribution of the number of components can be described in terms of a Poisson binomial distribution. The variance of a Poisson binomial distribution can be bounded above by the expected value. Clearly, is an upper bound for the variance. Using Cantelli’s inequality () we find with that with probability the pseudoforest has at least components. The previous result shows that the number of components increases logarithmically in the number of nodes in cycles. The next step is therefore to show that the cycles are sufficiently large.
Lemma 12.
, where .
Proof.
Let be the probability of sampling a hashed pseudoforest with nodes in cycles as determined in Lemma 5. We need to show that the probability of sampling a pseudoforest with at least nodes in cycles is at least a constant factor larger than sampling a pseudoforest with less than nodes in cycles, i.e. .
A stronger statement is .
In the following we show for all .
This pointwise comparison is an even stronger statement.
For brevity, we omit rounding of .
Using the bounds of Lemma 10 and Stirling’s approximation we have:
Combining the results gives us the following.
Theorem 13.
, where is the number of components of and .
Proof.
According to Lemma 12 at least nodes are in cycles with constant probability. Using Lemma 11 this results in components with constant probability.
Corollary 14.
, where is the number of components of and .
Proof.
Use Theorem 13 for the lower bound. There can be at most all nodes in cycles of the pseudoforest and we can apply Lemma 11 for the upper bound.
4.2 Space Savings of MorphisHash
We now show that MorphisHash can convert each component into space savings compared to ShockHash. As a first step we transition the previous graph results into the world of matrices.
Lemma 15.
The defect of the incidence matrix of a pseudoforest is at least the number of the pseudoforests components.
Proof.
For each component , summing up the rows of the respective nodes results in zero rows because for each component, the two endpoints of each edge are included in the summed up rows. The rows summed up for each zero row are disjoint and therefore in particular linearly independent combinations.
Lemma 16 ([4]).
The probability that a random square matrix (of any field) with rows and columns has full rank approaches 1 for large .
Lemma 17.
The probability that a random rectangular matrix (of any field) with rows and columns has full rank approaches for large .
Proof.
A necessary condition that a square matrix with rows has full rank is that the first columns have full rank. The probability that this rectangular submatrix has full rank is therefore bounded below by Lemma 16. We now show our main result.
Theorem 18.
, where , incidence matrix and vector of are as described in the algorithm, and .
Proof.
In Theorem 13 we showed that a hashed pseudoforest has at least components with constant probability. According to Lemma 15 a direct consequence is that the incidence matrix of a random pseudoforest has a defect of at least with constant probability. According to Lemma 17, the probability that has full rank is at least constant. The system has a solution if there is a vector which solves and simultaneously . Such a vector exists if the two solution spaces have to intersect, which happens once . Since and are uncorrelated we have with constant probability that both has full rank and simultaneously has at least a defect of . With we have with constant probability. We use the above result to measure the space savings compared to ShockHash.
Corollary 19.
Compared to ShockHash, MorphisHash is at least bits more space efficient in expectation while requiring a constant factor more time.
Proof.
ShockHash requires at least bits to store the orientation of all keys in a retrieval structure. MorphisHash has to store the solution vector instead, requiring exactly bits. According to Theorem 18 we can choose to obtain the desired space savings. However, there is a constant probability that a seed pair has to be rejected because there is no solution for . MorphisHash therefore has to check a constant factor more seeds in expectation, consequently increasing the expected space required to store the seed by a constant number of bits. Analogously, the construction time grows by a constant factor in expectation. The time required to solve an expected constant number of equation systems is dominated by the seed search.
5 Partitioning
The time required to find a seed in MorphisHash and ShockHash grows exponentially with . To keep construction feasible for a large number of keys, we first partition the input into equally sized subsets of manageable size. For consistency, we use for the size of those subsets, and refer to them as base cases in the following. MorphisHash is then applied on those base cases. Note that the following two partitioning schemes are also applied on ShockHash and we refer to their paper for more details [15, Section 7].
MorphisHash-RS.
We use RecSplit [9] to recursively split the input into smaller subsets. Once sufficiently small subsets are obtained we apply MorphisHash on those subsets as a base case. RecSplit is space efficient but has significant query time overheads caused by traversing the tree. In MorphisHash-RS we store the solution vector of the retrieval structure directly next to the corresponding seed to improve locality.
MorphisHash-Flat.
The input keys are first hashed into buckets. Using thresholds, some keys are bumped such that the bucket does not exceed the desired base case size . The bumped keys are then used to fill up buckets which did not reach the desired size. We apply MorphisHash on each bucket.
6 Experiments
In this section we experimentally evaluate MorphisHash. We show the effect of the new parameter introduced by MorphisHash. We then compare MorphisHash-Flat and MorphisHash-RS with state-of-the-art competitors. Our source code is public under the General Public License [10]. We integrate MorphisHash into an existing benchmark framework [13], which was used for the comparison with competitors. The benchmark framework is described in detail in [14]. We perform all experiments on a Core i7-11700 CPU which has 48 KiB L1 and 512 KiB L2 data cache per core. The CPU has a total of 16 MiB L3 cache. The machine has 64 GiB of dual-channel DDR4-3200 RAM. Note that our experiments are at a scale where variances are relatively small, we therefore omit them for better readability.
6.1 MorphisHash vs ShockHash without partitioning
In the following we compare Bipartite MorphisHash with Bip. ShockHash. MorphisHash has the additional parameter , which determines the size of the retrieval structure. Smaller values result in less space overhead at the cost of more seed tests. Note that in theory we choose but in practice where is small it is more intuitive to work with minus a small constant. Figure 2 shows that Bip. MorphisHash has to check roughly a constant factor more seeds than Bip. ShockHash when is fixed to minus a constant. For this factor is . At the same time, MorphisHash almost completely eliminates the remaining space overhead. MorphisHash comes below 0.1 bits of space overhead when using while ShockHash has about 2 bits of space overhead. Thus, MorphisHash has roughly 20 times less space overhead. For , MorphisHash has bits per key space overhead.
Figure 4 gives another perspective. In this plot, we fix and vary . Interestingly, ShockHash even outperforms MorphisHash for . This is because MorphisHash requires bits for the retrieval structure in this case just like ShockHash. However, there is still the chance that the equation system of MorphisHash has no solution resulting in more retries and therefore in a higher space consumption of the seed and a higher construction time. For smaller the space overhead approaches 0. In the extreme case of and respective zero dimensional retrieval vector, MorphisHash is equivalent to simple brute force search because all keys can only use the candidate function. The average successful seed grows rapidly with smaller as it is increasingly less likely to stumble upon a pseudoforest which has at least many components. A pseudoforest with less than components may still result in a solvable (overdetermined) equation system, but this becomes exponentially less likely with every component below . Figure 4 uses . The expected number of components for is 1.56 as shown in Figure 4.
6.2 Choosing in MorphisHash-RS and MorphisHash-Flat
Space improvements in MorphisHash-RS and MorphisHash-Flat can be made either by (1) increasing the base case size which reduces the space overhead of the partitioning technique or (2) decreasing which reduces the space overhead of MorphisHash. In both cases the construction time increases. We experimented with different values of to identify the configurations which dominate the construction time and space trade-off. We determined that is a good choice for MorphisHash-RS and for MorphisHash-Flat. MorphisHash-Flat is a less space efficient partitioning technique compared to MorphisHash-RS. Space savings can be made more easily by increasing instead of decreasing .
6.3 Comparison to Competitors
| Method | Space | Query | Construction |
| (bits/key) | (ns/query) | (ns/key) | |
| Consensus-RS, =, = | 1.447 | 222 | 6 733 |
| Bip. MorphisHash-RS, base case size =, =n- | 1.501 | 137 | 6 669 |
| Bip. ShockHash-RS, base case size = | 1.523 | 147 | 7 186 |
| Bip. MorphisHash-Flat, base case size =, =n- | 1.541 | 75 | 6 330 |
| Bip. ShockHash-Flat, base case size = | 1.554 | 76 | 6 676 |
| PHOBIC, =, IC-R | 1.749 | 49 | 6 426 |
| Bip. ShockHash-RS, base case size = | 1.489 | 131 | 172 738 |
| Bip. MorphisHash-RS, base case size =, =n- | 1.489 | 139 | 8 085 |
We compare MorphisHash-RS and MorphisHash-Flat with state-of-the-art competitors. We select the following space efficient competitors based on a recent survey [14]: Consensus-RS[17], ShockHash-Flat, ShockHash-RS and PHOBIC [12]. We test a wide range of configurations for each competitor and compare them in Figure 5. A selection of configurations is shown in Table 1. As can be seen in the plot and the table, MorphisHash-RS is about 0.02 bits per key more space efficient than ShockHash-RS when compared for equal construction time. This corresponds to a reduction of 27% in space overhead. When compared for equal space consumption of 1.489 bits per key, MorphisHash-RS is times faster to construct (see Table 1). For MorphisHash-Flat we select less aggressively compared to MorphisHash-RS (Section 6.2) and obtain a space improvement of about 0.01 bits per key. According to Figure 5 MorphisHash dominates ShockHash in the overall space, construction and query time trade-off. The next best competitor in terms of space efficiency is PHOBIC which is a clear winner in terms of query throughput. In the other direction, we have the recently published Consensus-RS, which can reach space overheads as low as 0.001 bits per key at the cost of additional query time. A negative result regarding non-minimal PHFs can be found in the full version of this paper [11].
7 Conclusion and Future Work
MorphisHash almost completely eliminates the remaining redundancy in ShockHash. This is particularly effective when combined with a space efficient partitioning technique. Our compression scheme might be of a more general interest, further examples can be found in the full version of this paper [11].
In future work we plan to improve the space efficiency of partitioning techniques as those are a major source of space overhead. We are hopeful that a partitioning technique that involves the novel Consensus technique puts further trade-offs for MorphisHash into reach.
References
- [1] Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. Fast prefix search in little space, with applications. In ESA (1), volume 6346 of Lecture Notes in Computer Science, pages 427–438. Springer, 2010. doi:10.1007/978-3-642-15775-2_37.
- [2] Djamal Belazzougui and Gonzalo Navarro. Alphabet-independent compressed text indexing. ACM Trans. Algorithms, 10(4):23:1–23:19, 2014. doi:10.1145/2635816.
- [3] Dominik Bez, Florian Kurpicz, Hans-Peter Lehmann, and Peter Sanders. High performance construction of recsplit based minimal perfect hash functions. In ESA, volume 274 of LIPIcs, pages 19:1–19:16. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2023. doi:10.4230/LIPICS.ESA.2023.19.
- [4] Jean Bourgain, Van H Vu, and Philip Matchett Wood. On the singularity probability of discrete random matrices. Journal of Functional Analysis, 258(2):559–603, 2010.
- [5] Andrei Z. Broder and Michael Mitzenmacher. Survey: Network applications of Bloom filters: A survey. Internet Math., 1(4):485–509, 2003. doi:10.1080/15427951.2004.10129096.
- [6] Chin-Chen Chang and Chih-Yang Lin. Perfect hashing schemes for mining association rules. Comput. J., 48(2):168–179, 2005. doi:10.1093/COMJNL/BXH074.
- [7] Victoria G. Crawford, Alan Kuhnle, Christina Boucher, Rayan Chikhi, and Travis Gagie. Practical dynamic de bruijn graphs. Bioinform., 34(24):4189–4195, 2018. doi:10.1093/BIOINFORMATICS/BTY500.
- [8] Peter C Dillinger, Lorenz Hübschle-Schneider, Peter Sanders, and Stefan Walzer. Fast succinct retrieval and approximate membership using ribbon. arXiv preprint, 2021. arXiv:2109.01892.
- [9] Emmanuel Esposito, Thomas Mueller Graf, and Sebastiano Vigna. RecSplit: Minimal perfect hashing via recursive splitting. In ALENEX, pages 175–185. SIAM, 2020. doi:10.1137/1.9781611976007.14.
- [10] Stefan Hermann. MorphisHash - GitHub. https://github.com/stefanfred/MorphisHash, 2025.
- [11] Stefan Hermann. Morphishash: Improving space efficiency of shockhash for minimal perfect hashing. arXiv preprint, 2025. doi:10.48550/arXiv.2503.10161.
- [12] Stefan Hermann, Hans-Peter Lehmann, Giulio Ermanno Pibiri, Peter Sanders, and Stefan Walzer. PHOBIC: perfect hashing with optimized bucket sizes and interleaved coding. In ESA, volume 308 of LIPIcs, pages 69:1–69:17. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2024. doi:10.4230/LIPIcs.ESA.2024.69.
- [13] Hans-Peter Lehmann. MPHF Experiments – GitHub. https://github.com/ByteHamster/MPHF-Experiments, 2025.
- [14] Hans-Peter Lehmann, Thomas Mueller, Rasmus Pagh, Giulio Ermanno Pibiri, Peter Sanders, Sebastiano Vigna, and Stefan Walzer. Modern minimal perfect hashing: A survey. arXiv preprint, 2025. doi:10.48550/arXiv.2506.06536.
- [15] Hans-Peter Lehmann, Peter Sanders, and Stefan Walzer. ShockHash: Near optimal-space minimal perfect hashing beyond brute-force. arXiv preprint, invited to Algorithmica, 2024. doi:10.48550/arXiv.2310.14959.
- [16] Hans-Peter Lehmann, Peter Sanders, and Stefan Walzer. Shockhash: Towards optimal-space minimal perfect hashing beyond brute-force. In ALENEX. SIAM, 2024. doi:10.1137/1.9781611977929.15.
- [17] Hans-Peter Lehmann, Peter Sanders, Stefan Walzer, and Jonatan Ziegler. Combined search and encoding for seeds, with an application to minimal perfect hashing. In ESA, LIPIcs. Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2025.
- [18] Yi Lu, Balaji Prabhakar, and Flavio Bonomi. Perfect hashing for network applications. In ISIT, pages 2774–2778. IEEE, 2006. doi:10.1109/ISIT.2006.261567.
- [19] Kurt Mehlhorn. On the program size of perfect and universal hash functions. In FOCS, pages 170–175. IEEE Computer Society, 1982. doi:10.1109/SFCS.1982.80.
- [20] MARK EJ Newman. Networks: an introduction, 2010.
- [21] Anna Pagh and Rasmus Pagh. Uniform hashing in constant time and optimal space. SIAM Journal on Computing, 38(1):85–96, 2008. doi:10.1137/060658400.
- [22] Rasmus Pagh and Flemming Friche Rodler. Cuckoo hashing. Journal of Algorithms, 51(2):122–144, 2004. doi:10.1016/J.JALGOR.2003.12.002.
- [23] Giulio Ermanno Pibiri. Sparse and skew hashing of k-mers. Bioinformatics, 38(Supplement_1):i185–i194, 2022. doi:10.1093/BIOINFORMATICS/BTAC245.
- [24] Giulio Ermanno Pibiri and Rossano Venturini. Efficient data structures for massive N-gram datasets. In SIGIR, pages 615–624. ACM, 2017. doi:10.1145/3077136.3080798.
- [25] Lajos Takács. On cayley’s formula for counting forests. Journal of Combinatorial Theory, Series A, 53(2):321–323, 1990. doi:10.1016/0097-3165(90)90064-4.
