Are simple hash functions good enough?

Hash functions are critical components of almost every computer program and a basic building block of data structures. They are used to retrieve data, perform fast similarity searches, implement caches, route network traffic, count objects, to name just a few applications. All of these applications rely on a property of some hash functions: that they map inputs to a set of outputs in a uniform manner. For instance, associative arrays - often called hash maps, dictionaries, or unordered maps by software engineers - rely on a hash function that uniformly maps keys to a series of 'slots' which store information about the values. If a hash function is too biased, it can cause the program to revert to slow collision resolution algorithms - often making computations infeasible.

In practice, most hash functions are designed to distribute data uniformly at random across a codomain, say of length $n$ , so that for any key $k$ chosen at random from the domain will have a probability $\frac{1}{n}$ of mapping to each output value. Take, for example, Daniel J. Bernstein's djb2_32 hash:

use std::num::Wrapping;

fn djb2_32(bytes: &[u8]) -> u32 {
    let mut hash = Wrapping(5381);

    for &b in bytes {
        hash = (hash << 5) + hash + Wrapping(b as u32);
    }

    hash.0
}

This forms the relation:

 $h_m = \left( 33 \; h_{m-1} + b_{m-1}\right) \mod 2^{32}, \; h_0 = 5381$

While extremely simple, this hash function is actually quite clever. It takes the form of a linear-congruential generator (LCG), applying an affine transformation followed by a modulus. We can view this in two ways. Say we consume $m$ bytes; our sequence will take the form:

 $h_m = (33^m h_0 \mod 2^{32} + 33^{m-1} b_0 \mod 2^{32} + \cdots + 33 b_{m-2} \mod 2^{32} + b_{m-1} \mod 2^{32}) \mod 2^{32}$

Which, in effect, means that $h_m$ is the sum of $m$ different LCGs modulo $2^{32}$ . From another perspective, say we consume a sequence of fixed bytes $b_m = B$ . In this scenario, if $B$ is not 0, 1, 2, or 33, the recurrence would satisfy the Hull-Dobell Theorem and would form an LCG with a period greater than $2^{32}$ . (Doesn't your linear-congruential generator satisfy the Hull-Dobell Theorem?) I imagine this was the reasoning behind choosing these specific coefficients.

While there are many non-cryptographic hash functions available, a number of fast non-cryptographic hash functions have been designed in the past decade. Some recent examples include Murmur, CityHash, XXHash, and t1hash. These functions claim superior performance in terms of both speed and quality, although they come with a complexity trade-off. This is because (1) they are often architecture-specific, (2) some perform unaligned accesses, and (3) they often require language-specific FFI bindings. Many benchmarks claim that simple hash functions, like fnv1a, have "serious quality issues." However, many of these benchmarks are also unrealistic, involving the construction of worst-case key pairs and ensuring there are no patterns in the hash outputs. So, I wanted to answer the question: Do any of these simple hash functions break down on real-world datasets? If so, what are their failure modes? To do this, I designed two tests that simulate real-world use cases and tested a number of hash functions across three datasets.

Hash Functions Under Test

I gathered some simple "low quality" hash functions as well as some "high quality" hash functions. These include:

Low quality hash functions:

adler32: Mark Adler's version of the Fletcher checksum. adler32 is considered unreliable for short inputs, as per RFC 3309.
djb2_32: A simple non-cryptographic hash devised by Daniel J. Bernstein.
fnv1a32: A widely-used hash designed by Glenn Fowler, Phong Vo, and Landon Noll.

High quality hash functions:

spooky32: A hash function designed by Bob Jenkins.
murmur3: A hash function designed by Austin Appleby in 2008.
city32: A fast hash function developed by Google.
xx32: Claims to be the fastest x86 non-cryptographic hashing algorithm.

Some of these hashes also have 64-bit variants including city, xx, spooky, fnv1a, and djb2.

Datasets

I wanted to test each hash function on a variety of large datasets. These provide sample scenarios from networking, bioinformatics, and natural language processing. These include:

All words in the English, German, and French languages as provided by the GNU ASpell dictionary version 2.1.
- The French dictionary contains 221,377 words.
- The American English dictionary contains 123,985 words.
- The German dictionary contains 304,736 words.
All possible private IPv4 addresses, as unsigned bytes in network byte order.
- This includes 17,891,328 addresses.
- Most addresses are continuous, differing in a single bit.
All unique 12-mers (or 12-length substrings) in the human genome (e.g. all contigs from GRCh38.)
- For example, 'AAGAGTCAGTTATT' is a 12-mer.
- Comprising 203,091,438 unique 12-length substrings from the human genome. These sequences cover most of the possible combinations in the genome's four-character alphabet (A, T, C, G).
- This dataset is challenging to hash because the differentiating information is contained in a small subset of the input bits.
- I did not canonicalize these k-mers so they could be drawn from the 5' 3' or 3' 5' strands.

Multinomial Non-Uniformity Test

In practice, most hash functions are used to associate an item with a specific 'slot' in memory, and many algorithms depend on the premise that the distribution of items across these slots is no worse than that that could be produced by a uniform random distribution. This test is unique in modeling real-world behavior of the hash function rather than the behavior under a synthetic benchmark. Since the ranges of the hash functions are large (i.e., either $2^{32}$ or $2^{64}$ ), we need to choose a function that maps these to our slot index, $b_i$ . This is most commonly accomplished by taking the modulus of the output with the number of slots. After hashing $k$ items, the resulting distribution should be modeled by a multinomial across the $n$ slots. On average, $\frac{k}{n}$ values should hash to each slot and we can use the $\Chi^2$ distribution to test if the distribution differs significantly from the distribution that would be produced by a random hash function. By assuming $n$ is sufficiently large, we can compute the test statistic using the following formula:

 $\Chi^2 = \sum_{i = 0}^{n-1} \frac{(b_i - \frac{k}{n})^2}{\frac{k}{n}}$

Then, compute the $p$ value using the chi-squared CDF. I performed around 60 of these tests across all the datasets, so I'll only list a few here.

German Word List

$|b| = 1024$

hash_function	p_value	average	p50	p75	p99
city64	0.041558	297.594	298	309	341
city32	0.558001	297.594	298	310	338
xx64	0.0619553	297.594	298	309	342
xx32	0.146449	297.594	297	310	340
spooky64	0.978192	297.594	297	309	337
spooky32	0.978192	297.594	297	309	337
murmur3 32	0.288355	297.594	298	309	335
fnv1a64	0.073901	297.594	298	309	340
fnv1a32	0.19892	297.594	297	310	340
adler32	0	297.594	271	357	548
djb2_32	0.499734	297.594	298	309	336
djb2_64	0.499734	297.594	298	309	336

$|b| = 1031$

hash_function	p_value	average	p50	p75	p99
city64	0.394728	295.573	295	307	336
city32	0.658379	295.573	295	306	337
xx64	0.349555	295.573	295	307	335
xx32	0.809289	295.573	295	307	335
spooky64	0.944751	295.573	295	307	333
spooky32	0.0605966	295.573	296	307	339
murmur3 32	0.421352	295.573	295	307	339
fnv1a64	0.224086	295.573	296	307	337
fnv1a32	0.84226	295.573	295	306	336
adler32	0.784829	295.573	295	307	337
djb2_32	0.992628	295.573	296	306	332
djb2_64	0.487786	295.573	296	308	334

$|b| = 435337 \approx 0.7 n$

hash_function	p_value	average	p50	p75	p99
city64	0.433201	0.7	1	1	3
city32	0.81538	0.7	1	1	3
xx64	0.46347	0.7	1	1	3
xx32	0.737051	0.7	1	1	3
spooky64	0.0797217	0.7	1	1	3
spooky32	0.390342	0.7	1	1	3
murmur3 32	0.696641	0.7	1	1	3
fnv1a64	0.0207139	0.7	1	1	3
fnv1a32	0.648008	0.7	1	1	3
adler32	0	0.7	0	1	4
djb2_32	0.110557	0.7	1	1	3
djb2_64	0.0788193	0.7	1	1	3

With the exception of adler32, all the hash functions hold up well against these ASCII inputs. When the number of slots is prime and the table size is small, adler32 performs at its best. I think that's likely because the sum wraps around the modulus, creating something closer to a uniform distribution, though this does not necessarily mean it should be used.

Private IP Ranges

$|b| = 65536$

hash_function	p_value	average	p50	p75	p99
city64	0.161976	273	273	284	313
city32	0.550778	273	273	284	312
xx64	0.150364	273	273	284	312
xx32	0.962877	273	273	284	312
spooky64	0.960136	273	273	284	312
spooky32	0.960136	273	273	284	312
murmur3 32	0.229322	273	273	284	312
fnv1a64	1	273	273	278	284
fnv1a32	1	273	273	275	280
adler32	0	273	0	0	1653
djb2_32	0	273	254	303	372
djb2_64	0	273	254	303	372

$|b| = 25559057 \approx 0.7n$

hash_function	p_value	average	p50	p75	p99
city64	0.727569	0.7	1	1	3
city32	0.738734	0.7	1	1	3
xx64	0.510211	0.7	1	1	3
xx32	1	0.7	1	1	3
spooky64	0.331874	0.7	1	1	3
spooky32	0.823507	0.7	1	1	3
murmur3 32	1	0.7	1	1	3
fnv1a64	1	0.7	1	1	3
fnv1a32	1	0.7	1	1	3
adler32	0	0.7	0	0	0
djb2_32	0	0.7	0	0	55
djb2_64	0	0.7	0	0	55

This is likely the most challenging test of the three due to the fact many of these IPs are differentiated by single bits. Both adler32 and djb2_32 fail. In particular, adler32 hash function only distributes hashes amongst 1% of the allocated buckets! Interestingly enough, for $|b| = 65536$ , fnv1a seems to distribute the values very uniformly. In expectation, the 99th percentile should approach 312; interestingly, this doesn't happen for fnv1a. (I could probably run the 17712414th order statistic to find if this is significant, but that seems like a bit of a nightmare.)

All k-mers in GRC H38

$|b| = 65536$

hash_function	p_value	average	p50	p75	p99
city64	0.808561	3098.93	3099	3136	3229
city32	0.238145	3098.93	3099	3136	3229
xx64	0.0837374	3098.93	3099	3137	3230
xx32	0.388023	3098.93	3099	3137	3229
spooky64	0.0890488	3098.93	3099	3136	3230
spooky32	0.0890488	3098.93	3099	3136	3230
murmur3 32	0.754288	3098.93	3099	3136	3230
fnv1a64	1	3098.93	3099	3136	3227
fnv1a32	0.99998	3098.93	3099	3136	3228
adler32	0	3098.93	0	0	0
djb2_32	0	3098.93	3054	4210	4975
djb2_64	0	3098.93	3054	4210	4975

$|b| = 290130625 \approx 0.7n$

hash_function	p_value	average	p50	p75	p99
city64	0.567872	0.7	1	1	3
city32	8.5713e-09	0.7	1	1	3
xx64	0.0503037	0.7	1	1	3
xx32	0.000792609	0.7	1	1	3
spooky64	0.0415268	0.7	1	1	3
spooky32	1.90627e-06	0.7	1	1	3
murmur3 32	1.57968e-07	0.7	1	1	3
fnv1a64	0.833133	0.7	1	1	3
fnv1a32	1.01819e-10	0.7	1	1	3
adler32	0	0.7	0	0	0
djb2_32	0.00151188	0.7	1	1	3
djb2_64	0.0428499	0.7	1	1	3

The k-mer test seemed to induce failures in all the 32-bit values. While we can say these statistically differ from the uniform distribution, this does not mean it will impact the performance of our application significantly. It actually seems to be fairly well distributed, at least from the 50th, 75th, and 99th percentiles.

Sparse Collisions Test

While the non-uniformity test is simple to administer, interpreting its results can be challenging due to the fact you have to compare across distributions. This motivated me to develop a test to characterize the likelihood of observing a certain number of hash collisions throughout the entire data set. The "Sparse Collisions Test" is simple, and it operates by hashing all the keys (for example, all the words in the German language) and counting the number of collisions. The real challenge lies in determining whether the number of collisions we measure is significant. Finding the likelihood of observing $q$ collisions when $k$ values are hashed is a variation on the famously unintuitive Birthday Problem.

Characterizing the full distribution for each scenario proved difficult, and I believe there might not be a closed-form formula without approximation. After considerable effort, I was able to develop a formula to compute the likelihood of a specific number of collisions. This operated by summing a combinatorial formula over all partitions of the input space. Using dynamic programming, the exact distribution can be computed in $O\left( k^k \right)$ time and $O\left( k \right)$ space. This is only practical for small inputs. Fortunately, by limiting the space of partitions considered and eliminating those which would almost certainly would not occur, I was able to make more progress. In the end, I was able to categorize the expected number of collisions within the private IP address space and German word list for the 32-bit variants with an error on the order of $10^{-8}$ . Further information about how the distribution was derived will be included in an appendix.

German Word List

The expected probability distribution can be computed using from the partitions formula:

Expected Cumulative German Word List Collision Distribution

Algorithm	Collisions	Percentage
city64	0	0
city32	11	0.0036
xx64	0	0
xx32	10	0.0033
spooky64	0	0
spooky32	10	0.0033
murmur3 32	15	0.0049
fnv1a64	0	0
fnv1a32	18	0.0060
adler32	68006	22.32
djb2_32	17	0.0056
djb2_64	2	0.00066

For 32-bit hash functions, we should expect fewer than 22 collisions at $p=0.001$ , a criterion that only adler32 fails to meet. The 64-bit hash functions can be bounded by the Birthday Problem, accordingly we expect that no collisions occur and any number of collisions are statistically significant at the 0.001 level. Thus, we can say djb2_64 also differs significantly from a random hash function.

All Private IP Addresses

The expected probability distribution can be computed using from the partitions formula:

Expected Private IP Address Collision Distribution

Expected Cumulative Private IP Address Collision Distribution

Algorithm	Collisions	Percentage
city64	0	0
city32	37534	0.21
xx64	0	0
xx32	0	0
spooky64	0	0
spooky32	37143	0.21
murmur3 32	0	0
fnv1a64	0	0
fnv1a32	0	0
adler32	17530308	97.98
djb2_32	17571285	98.21
djb2_64	17571285	98.21

For the 32-bit hash function, we would expect fewer than 37,812 collisions at $p=0.001$ . As in the previous test, any collisions for the 64-bit hashes are significant at the 0.001 level. So for this test, djb2_32, adler32, and djb2_64 perform significantly worse than what would be expected from a random hash function. On the other hand, fnv1a_32, xx32, and murmur3 actually perform significantly better than what would be expected from a random hash function. city32 and spooky32 perform in line with our expectations.

Unique K-mers in GRCh38

With my current methods, I can't compute for the expected probability distribution $k=203091438$ . It's too computationally expensive. I ran the tests anyway so I could list the results.

Algorithm	Collisions	Percentage
city64	0	0
city32	4726992	2.33
xx64	0	0
xx32	4707102	2.33
spooky64	0	0
spooky32	4726688	2.33
murmur3 32	4723849	2.33
fnv1a64	0	0
fnv1a32	4724280	2.33
adler32	202966890	99.94
djb2_32	5324427	2.62
djb2_64	0	0

Conclusion

After conducting all these experiments, my biggest takeaway is that hash benchmarking suites are probably not measuring real hashing performance. In these tests, fnv1a, a simple hash function from the early 90s, held up remarkably well. While I think measuring the randomness of hash functions is interesting both theoretically and as a fun engineering exercise, I believe these hyper-optimized hash functions offer very marginal benefits. Of course, I am open to changing my mind. This would happen if I am presented with a real-world dataset that elicits bad behavior from a simple hash function like fnv1a. There might be some dataset for which city and spooky outperform their simpler predecessors. You can't really prove that these hash functions are "good"; you can only show that under certain situations they perform poorly.

Many early hash functions like adler32 and djb2 were designed in an era when hashing performance was an important consideration, and they were typically used for specific applications. adler32 was used in gzip, where entropy was abundant. This accounts for its significant shortcomings with short string inputs. I believe djb2 was designed for ASCII strings. ASCII data, like German and English words, contains a lot of inherent entropy, meaning that weaker hash functions like djb2 perform quite well. The main issue with djb2 is that the prime does not provide avalanching over the entire output space. Replacing 33 with a better prime, like 22695477, considerably boosts its performance. I think the reason Bernstein used 33 is that he designed it in the 90s when computing resources were limited. The multiplication operation could be replaced with a bit shift and addition.

Appendix

Deriving the Expected Collision Distribution

In order to characterize the collision distribution, we want to obtain the probability that $q$ collisions occur, $P(Q=q)$ , for an idealized random hash function.

Let us define the hash function over some alphabet, $\Sigma$ . This hash function maps an arbitrary input to one of the $n$ slots, that is, $f: \Sigma^\mathbb{N} \to [1, n]$ . Each input has a $\frac{1}{n}$ probability of mapping to each output. We are interested in the probability that $q$ collisions occur within a set of $k$ values.

The distributions of hashes over the slots are a multinomial distribution since the number of trials is fixed, the trials are independent, and there is a fixed probability $p_i = \frac{1}{n}$ that they hash within each bucket. Therefore, the probability that a specific distribution of slot counts $b_0, b_1, \cdots, b_{n-1}$ is given by $\frac{k!}{\prod_{i=0}^{n-1} b_i!}$ where $\sum_{i=0}^{n-1} b_i = k$ . We could evaluate this by considering all possible distributions of values in the buckets, summing the probabilities of each distribution that contributes to $q$ collisions. However, many of these are duplicative. For example, if $n = 2$ and $k = 3$ , $b_0 = 1, b_1 = 3$ and $b_0 = 3, b_1 = 1$ occur with equal likelihood. Thus, we can compute the probability of achieving any set of bucket counts $c_0, c_1, \cdots, c_k$ by multiplying the probability of this outcome by the number of ways in which it can occur:

 $P\left(c_0, c_1, \cdots, c_k\right) = \frac{ \left( \sum_{j = 0}^k c_j \right) ! }{ \prod_{i = 0}^{k} c_i! } \frac{k!}{n^k \prod_{i=0}^{i=k} i! ^{c_i}}$

To calculate the probability that $q$ collisions occur, this needs to be summed over all partitions of $k$ . That is, all natural numbered coefficients $c_1, c_2, \cdots c_k$ which satisfy $k = \sum_{i=1}^k i \cdot c_i$ . The number of collisions is given by $q = \sum_{j=2}^{k} c_j (j - 1)$ .

I would have liked to obtain a closed-form formula, even via an approximation. But, there is no known closed-form formula for partitions. If anyone knows of an appropriate approximation, let me know.

Computing the Expected Collision Distribution

The equation given above can be computed efficiently using a few approximations. First, factorials can be approximated through the use of the log gamma function with 16-bit floating point accuracy. This is provided by the Lanczos Gamma Approximation. The log gamma values for $[0, k] \cup [n-k, n]$ can be cached to make these calls $O(1)$ . This expected collision distribution can be computed through a depth-first search over the partition space. Unfortunately, partitions grow exponentially. For example, there are around $10^{60}$ possible partitions for the German Word List dataset. Many possible outcomes have near-zero likelihoods of occurring. For example, the likelihood that all $k$ values hash to the same bucket is $(\frac{1}{n})^{k-1}$ . Near-perfect approximations can be obtained by limiting the depth of the search and the number of partitions at a given depth.

Additional Results

I have made all my results, as well a the program I used to compute the collision distributions, available in a Git repository.