flu0r1ne.net/logs/simple-hash-functionsLast Updated: 2023-12-01

Are simple hash functions good enough?

Hash functions are critical components of almost every computer program and a basic building block of data structures. They are used to retrieve data, perform fast similarity searches, implement caches, route network traffic, count objects, to name just a few applications. All of these applications rely on a property of some hash functions: that they map inputs to a set of outputs in a uniform manner. For instance, associative arrays - often called hash maps, dictionaries, or unordered maps by software engineers - rely on a hash function that uniformly maps keys to a series of 'slots' which store information about the values. If a hash function is too biased, it can cause the program to revert to slow collision resolution algorithms - often making computations infeasible.

In practice, most hash functions are designed to distribute data uniformly at random across a codomain, say of length nn, so that for any key kk chosen at random from the domain will have a probability 1n\frac{1}{n} of mapping to each output value. Take, for example, Daniel J. Bernstein's djb2_32 hash:

use std::num::Wrapping;

fn djb2_32(bytes: &[u8]) -> u32 {
    let mut hash = Wrapping(5381);

    for &b in bytes {
        hash = (hash << 5) + hash + Wrapping(b as u32);
    }

    hash.0
}

This forms the relation:

hm=(33  hm1+bm1)mod232,  h0=5381h_m = \left( 33 \; h_{m-1} + b_{m-1}\right) \mod 2^{32}, \; h_0 = 5381

While extremely simple, this hash function is actually quite clever. It takes the form of a linear-congruential generator (LCG), applying an affine transformation followed by a modulus. We can view this in two ways. Say we consume mm bytes; our sequence will take the form:

hm=(33mh0mod232+33m1b0mod232++33bm2mod232+bm1mod232)mod232h_m = (33^m h_0 \mod 2^{32} + 33^{m-1} b_0 \mod 2^{32} + \cdots + 33 b_{m-2} \mod 2^{32} + b_{m-1} \mod 2^{32}) \mod 2^{32}

Which, in effect, means that hmh_m is the sum of mm different LCGs modulo 2322^{32}. From another perspective, say we consume a sequence of fixed bytes bm=Bb_m = B. In this scenario, if BB is not 0, 1, 2, or 33, the recurrence would satisfy the Hull-Dobell Theorem and would form an LCG with a period greater than 2322^{32}. (Doesn't your linear-congruential generator satisfy the Hull-Dobell Theorem?) I imagine this was the reasoning behind choosing these specific coefficients.

While there are many non-cryptographic hash functions available, a number of fast non-cryptographic hash functions have been designed in the past decade. Some recent examples include Murmur, CityHash, XXHash, and t1hash. These functions claim superior performance in terms of both speed and quality, although they come with a complexity trade-off. This is because (1) they are often architecture-specific, (2) some perform unaligned accesses, and (3) they often require language-specific FFI bindings. Many benchmarks claim that simple hash functions, like fnv1a, have "serious quality issues." However, many of these benchmarks are also unrealistic, involving the construction of worst-case key pairs and ensuring there are no patterns in the hash outputs. So, I wanted to answer the question: Do any of these simple hash functions break down on real-world datasets? If so, what are their failure modes? To do this, I designed two tests that simulate real-world use cases and tested a number of hash functions across three datasets.

Hash Functions Under Test

I gathered some simple "low quality" hash functions as well as some "high quality" hash functions. These include:

Low quality hash functions:

High quality hash functions:

Some of these hashes also have 64-bit variants including city, xx, spooky, fnv1a, and djb2.

Datasets

I wanted to test each hash function on a variety of large datasets. These provide sample scenarios from networking, bioinformatics, and natural language processing. These include:

  • All words in the English, German, and French languages as provided by the GNU ASpell dictionary version 2.1.
    • The French dictionary contains 221,377 words.
    • The American English dictionary contains 123,985 words.
    • The German dictionary contains 304,736 words.
  • All possible private IPv4 addresses, as unsigned bytes in network byte order.
    • This includes 17,891,328 addresses.
    • Most addresses are continuous, differing in a single bit.
  • All unique 12-mers (or 12-length substrings) in the human genome (e.g. all contigs from GRCh38.)
    • For example, 'AAGAGTCAGTTATT' is a 12-mer.
    • Comprising 203,091,438 unique 12-length substrings from the human genome. These sequences cover most of the possible combinations in the genome's four-character alphabet (A, T, C, G).
    • This dataset is challenging to hash because the differentiating information is contained in a small subset of the input bits.
    • I did not canonicalize these k-mers so they could be drawn from the 5' 3' or 3' 5' strands.

Multinomial Non-Uniformity Test

In practice, most hash functions are used to associate an item with a specific 'slot' in memory, and many algorithms depend on the premise that the distribution of items across these slots is no worse than that that could be produced by a uniform random distribution. This test is unique in modeling real-world behavior of the hash function rather than the behavior under a synthetic benchmark. Since the ranges of the hash functions are large (i.e., either 2322^{32} or 2642^{64}), we need to choose a function that maps these to our slot index, bib_i. This is most commonly accomplished by taking the modulus of the output with the number of slots. After hashing kk items, the resulting distribution should be modeled by a multinomial across the nn slots. On average, kn\frac{k}{n} values should hash to each slot and we can use the X2\Chi^2 distribution to test if the distribution differs significantly from the distribution that would be produced by a random hash function. By assuming nn is sufficiently large, we can compute the test statistic using the following formula:

X2=i=0n1(bikn)2kn\Chi^2 = \sum_{i = 0}^{n-1} \frac{(b_i - \frac{k}{n})^2}{\frac{k}{n}}

Then, compute the pp value using the chi-squared CDF. I performed around 60 of these tests across all the datasets, so I'll only list a few here.

German Word List

b=1024|b| = 1024

hash_functionp_valueaveragep50p75p99
city640.041558297.594298309341
city320.558001297.594298310338
xx640.0619553297.594298309342
xx320.146449297.594297310340
spooky640.978192297.594297309337
spooky320.978192297.594297309337
murmur3 320.288355297.594298309335
fnv1a640.073901297.594298309340
fnv1a320.19892297.594297310340
adler320297.594271357548
djb2_320.499734297.594298309336
djb2_640.499734297.594298309336

b=1031|b| = 1031

hash_functionp_valueaveragep50p75p99
city640.394728295.573295307336
city320.658379295.573295306337
xx640.349555295.573295307335
xx320.809289295.573295307335
spooky640.944751295.573295307333
spooky320.0605966295.573296307339
murmur3 320.421352295.573295307339
fnv1a640.224086295.573296307337
fnv1a320.84226295.573295306336
adler320.784829295.573295307337
djb2_320.992628295.573296306332
djb2_640.487786295.573296308334

b=4353370.7n|b| = 435337 \approx 0.7 n

hash_functionp_valueaveragep50p75p99
city640.4332010.7113
city320.815380.7113
xx640.463470.7113
xx320.7370510.7113
spooky640.07972170.7113
spooky320.3903420.7113
murmur3 320.6966410.7113
fnv1a640.02071390.7113
fnv1a320.6480080.7113
adler3200.7014
djb2_320.1105570.7113
djb2_640.07881930.7113

With the exception of adler32, all the hash functions hold up well against these ASCII inputs. When the number of slots is prime and the table size is small, adler32 performs at its best. I think that's likely because the sum wraps around the modulus, creating something closer to a uniform distribution, though this does not necessarily mean it should be used.

Private IP Ranges

b=65536|b| = 65536

hash_functionp_valueaveragep50p75p99
city640.161976273273284313
city320.550778273273284312
xx640.150364273273284312
xx320.962877273273284312
spooky640.960136273273284312
spooky320.960136273273284312
murmur3 320.229322273273284312
fnv1a641273273278284
fnv1a321273273275280
adler320273001653
djb2_320273254303372
djb2_640273254303372

b=255590570.7n|b| = 25559057 \approx 0.7n

hash_functionp_valueaveragep50p75p99
city640.7275690.7113
city320.7387340.7113
xx640.5102110.7113
xx3210.7113
spooky640.3318740.7113
spooky320.8235070.7113
murmur3 3210.7113
fnv1a6410.7113
fnv1a3210.7113
adler3200.7000
djb2_3200.70055
djb2_6400.70055

This is likely the most challenging test of the three due to the fact many of these IPs are differentiated by single bits. Both adler32 and djb2_32 fail. In particular, adler32 hash function only distributes hashes amongst 1% of the allocated buckets! Interestingly enough, for b=65536|b| = 65536, fnv1a seems to distribute the values very uniformly. In expectation, the 99th percentile should approach 312; interestingly, this doesn't happen for fnv1a. (I could probably run the 17712414th order statistic to find if this is significant, but that seems like a bit of a nightmare.)

All k-mers in GRC H38

b=65536|b| = 65536

hash_functionp_valueaveragep50p75p99
city640.8085613098.93309931363229
city320.2381453098.93309931363229
xx640.08373743098.93309931373230
xx320.3880233098.93309931373229
spooky640.08904883098.93309931363230
spooky320.08904883098.93309931363230
murmur3 320.7542883098.93309931363230
fnv1a6413098.93309931363227
fnv1a320.999983098.93309931363228
adler3203098.93000
djb2_3203098.93305442104975
djb2_6403098.93305442104975

b=2901306250.7n|b| = 290130625 \approx 0.7n

hash_functionp_valueaveragep50p75p99
city640.5678720.7113
city328.5713e-090.7113
xx640.05030370.7113
xx320.0007926090.7113
spooky640.04152680.7113
spooky321.90627e-060.7113
murmur3 321.57968e-070.7113
fnv1a640.8331330.7113
fnv1a321.01819e-100.7113
adler3200.7000
djb2_320.001511880.7113
djb2_640.04284990.7113

The k-mer test seemed to induce failures in all the 32-bit values. While we can say these statistically differ from the uniform distribution, this does not mean it will impact the performance of our application significantly. It actually seems to be fairly well distributed, at least from the 50th, 75th, and 99th percentiles.

Sparse Collisions Test

While the non-uniformity test is simple to administer, interpreting its results can be challenging due to the fact you have to compare across distributions. This motivated me to develop a test to characterize the likelihood of observing a certain number of hash collisions throughout the entire data set. The "Sparse Collisions Test" is simple, and it operates by hashing all the keys (for example, all the words in the German language) and counting the number of collisions. The real challenge lies in determining whether the number of collisions we measure is significant. Finding the likelihood of observing qq collisions when kk values are hashed is a variation on the famously unintuitive Birthday Problem.

Characterizing the full distribution for each scenario proved difficult, and I believe there might not be a closed-form formula without approximation. After considerable effort, I was able to develop a formula to compute the likelihood of a specific number of collisions. This operated by summing a combinatorial formula over all partitions of the input space. Using dynamic programming, the exact distribution can be computed in O(kk)O\left( k^k \right) time and O(k)O\left( k \right) space. This is only practical for small inputs. Fortunately, by limiting the space of partitions considered and eliminating those which would almost certainly would not occur, I was able to make more progress. In the end, I was able to categorize the expected number of collisions within the private IP address space and German word list for the 32-bit variants with an error on the order of 10810^{-8}. Further information about how the distribution was derived will be included in an appendix.

German Word List

The expected probability distribution can be computed using from the partitions formula:

Expected German Word List Collision Distribution
Expected Cumulative German Word List Collision Distribution

AlgorithmCollisionsPercentage
city6400
city32110.0036
xx6400
xx32100.0033
spooky6400
spooky32100.0033
murmur3 32150.0049
fnv1a6400
fnv1a32180.0060
adler326800622.32
djb2_32170.0056
djb2_6420.00066

For 32-bit hash functions, we should expect fewer than 22 collisions at p=0.001p=0.001, a criterion that only adler32 fails to meet. The 64-bit hash functions can be bounded by the Birthday Problem, accordingly we expect that no collisions occur and any number of collisions are statistically significant at the 0.001 level. Thus, we can say djb2_64 also differs significantly from a random hash function.

All Private IP Addresses

The expected probability distribution can be computed using from the partitions formula:

Expected Private IP Address Collision Distribution
Expected Cumulative Private IP Address Collision Distribution

AlgorithmCollisionsPercentage
city6400
city32375340.21
xx6400
xx3200
spooky6400
spooky32371430.21
murmur3 3200
fnv1a6400
fnv1a3200
adler321753030897.98
djb2_321757128598.21
djb2_641757128598.21

For the 32-bit hash function, we would expect fewer than 37,812 collisions at p=0.001p=0.001. As in the previous test, any collisions for the 64-bit hashes are significant at the 0.001 level. So for this test, djb2_32, adler32, and djb2_64 perform significantly worse than what would be expected from a random hash function. On the other hand, fnv1a_32, xx32, and murmur3 actually perform significantly better than what would be expected from a random hash function. city32 and spooky32 perform in line with our expectations.

Unique K-mers in GRCh38

With my current methods, I can't compute for the expected probability distribution k=203091438k=203091438. It's too computationally expensive. I ran the tests anyway so I could list the results.

AlgorithmCollisionsPercentage
city6400
city3247269922.33
xx6400
xx3247071022.33
spooky6400
spooky3247266882.33
murmur3 3247238492.33
fnv1a6400
fnv1a3247242802.33
adler3220296689099.94
djb2_3253244272.62
djb2_6400

Conclusion

After conducting all these experiments, my biggest takeaway is that hash benchmarking suites are probably not measuring real hashing performance. In these tests, fnv1a, a simple hash function from the early 90s, held up remarkably well. While I think measuring the randomness of hash functions is interesting both theoretically and as a fun engineering exercise, I believe these hyper-optimized hash functions offer very marginal benefits. Of course, I am open to changing my mind. This would happen if I am presented with a real-world dataset that elicits bad behavior from a simple hash function like fnv1a. There might be some dataset for which city and spooky outperform their simpler predecessors. You can't really prove that these hash functions are "good"; you can only show that under certain situations they perform poorly.

Many early hash functions like adler32 and djb2 were designed in an era when hashing performance was an important consideration, and they were typically used for specific applications. adler32 was used in gzip, where entropy was abundant. This accounts for its significant shortcomings with short string inputs. I believe djb2 was designed for ASCII strings. ASCII data, like German and English words, contains a lot of inherent entropy, meaning that weaker hash functions like djb2 perform quite well. The main issue with djb2 is that the prime does not provide avalanching over the entire output space. Replacing 33 with a better prime, like 22695477, considerably boosts its performance. I think the reason Bernstein used 33 is that he designed it in the 90s when computing resources were limited. The multiplication operation could be replaced with a bit shift and addition.


Appendix

Deriving the Expected Collision Distribution

In order to characterize the collision distribution, we want to obtain the probability that qq collisions occur, P(Q=q)P(Q=q), for an idealized random hash function.

Let us define the hash function over some alphabet, Σ\Sigma. This hash function maps an arbitrary input to one of the $n$ slots, that is, f:ΣN[1,n]f: \Sigma^\mathbb{N} \to [1, n]. Each input has a 1n\frac{1}{n} probability of mapping to each output. We are interested in the probability that qq collisions occur within a set of kk values.

The distributions of hashes over the slots are a multinomial distribution since the number of trials is fixed, the trials are independent, and there is a fixed probability pi=1np_i = \frac{1}{n} that they hash within each bucket. Therefore, the probability that a specific distribution of slot counts b0,b1,,bn1b_0, b_1, \cdots, b_{n-1} is given by k!i=0n1bi!\frac{k!}{\prod_{i=0}^{n-1} b_i!} where i=0n1bi=k\sum_{i=0}^{n-1} b_i = k. We could evaluate this by considering all possible distributions of values in the buckets, summing the probabilities of each distribution that contributes to qq collisions. However, many of these are duplicative. For example, if n=2n = 2 and k=3k = 3, b0=1,b1=3b_0 = 1, b_1 = 3 and b0=3,b1=1b_0 = 3, b_1 = 1 occur with equal likelihood. Thus, we can compute the probability of achieving any set of bucket counts c0,c1,,ckc_0, c_1, \cdots, c_k by multiplying the probability of this outcome by the number of ways in which it can occur:

P(c0,c1,,ck)=(j=0kcj)!i=0kci!k!nki=0i=ki!ciP\left(c_0, c_1, \cdots, c_k\right) = \frac{ \left( \sum_{j = 0}^k c_j \right) ! }{ \prod_{i = 0}^{k} c_i! } \frac{k!}{n^k \prod_{i=0}^{i=k} i! ^{c_i}}

To calculate the probability that qq collisions occur, this needs to be summed over all partitions of kk. That is, all natural numbered coefficients c1,c2,ckc_1, c_2, \cdots c_k which satisfy k=i=1kicik = \sum_{i=1}^k i \cdot c_i. The number of collisions is given by q=j=2kcj(j1)q = \sum_{j=2}^{k} c_j (j - 1).

I would have liked to obtain a closed-form formula, even via an approximation. But, there is no known closed-form formula for partitions. If anyone knows of an appropriate approximation, let me know.

Computing the Expected Collision Distribution

The equation given above can be computed efficiently using a few approximations. First, factorials can be approximated through the use of the log gamma function with 16-bit floating point accuracy. This is provided by the Lanczos Gamma Approximation. The log gamma values for [0,k][nk,n][0, k] \cup [n-k, n] can be cached to make these calls O(1)O(1). This expected collision distribution can be computed through a depth-first search over the partition space. Unfortunately, partitions grow exponentially. For example, there are around 106010^{60} possible partitions for the German Word List dataset. Many possible outcomes have near-zero likelihoods of occurring. For example, the likelihood that all kk values hash to the same bucket is (1n)k1(\frac{1}{n})^{k-1}. Near-perfect approximations can be obtained by limiting the depth of the search and the number of partitions at a given depth.

Additional Results

I have made all my results, as well a the program I used to compute the collision distributions, available in a Git repository.