# Are simple hash functions good enough?

Hash functions are critical components of almost every computer program and a basic building block of data structures. They are used to retrieve data, perform fast similarity searches, implement caches, route network traffic, count objects, to name just a few applications. All of these applications rely on a property of some hash functions: that they map inputs to a set of outputs in a uniform manner. For instance, associative arrays - often called hash maps, dictionaries, or unordered maps by software engineers - rely on a hash function that uniformly maps keys to a series of 'slots' which store information about the values. If a hash function is too biased, it can cause the program to revert to slow collision resolution algorithms - often making computations infeasible.

In practice, most hash functions are designed to distribute data uniformly at random across a codomain, say of length $n$, so that for any key $k$ chosen at random from the domain will have a probability $\frac{1}{n}$ of mapping to each output value. Take, for example, Daniel J. Bernstein's `djb2_32`

hash:

```
use std::num::Wrapping;
fn djb2_32(bytes: &[u8]) -> u32 {
let mut hash = Wrapping(5381);
for &b in bytes {
hash = (hash << 5) + hash + Wrapping(b as u32);
}
hash.0
}
```

This forms the relation:

`$h_m = \left( 33 \; h_{m-1} + b_{m-1}\right) \mod 2^{32}, \; h_0 = 5381$`

While extremely simple, this hash function is actually quite clever. It takes the form of a linear-congruential generator (LCG), applying an affine transformation followed by a modulus. We can view this in two ways. Say we consume $m$ bytes; our sequence will take the form:

`$h_m = (33^m h_0 \mod 2^{32} + 33^{m-1} b_0 \mod 2^{32} + \cdots + 33 b_{m-2} \mod 2^{32} + b_{m-1} \mod 2^{32}) \mod 2^{32}$`

Which, in effect, means that $h_m$ is the sum of $m$ different LCGs modulo $2^{32}$. From another perspective, say we consume a sequence of fixed bytes $b_m = B$. In this scenario, if $B$ is not 0, 1, 2, or 33, the recurrence would satisfy the Hull-Dobell Theorem and would form an LCG with a period greater than $2^{32}$. (Doesn't your linear-congruential generator satisfy the Hull-Dobell Theorem?) I imagine this was the reasoning behind choosing these specific coefficients.

While there are many non-cryptographic hash functions available, a number of fast non-cryptographic hash functions have been designed in the past decade. Some recent examples include Murmur, CityHash, XXHash, and t1hash. These functions claim superior performance in terms of both speed and quality, although they come with a complexity trade-off. This is because (1) they are often architecture-specific, (2) some perform unaligned accesses, and (3) they often require language-specific FFI bindings. Many benchmarks claim that simple hash functions, like `fnv1a`

, have "serious quality issues." However, many of these benchmarks are also unrealistic, involving the construction of worst-case key pairs and ensuring there are no patterns in the hash outputs. So, I wanted to answer the question: Do any of these simple hash functions break down on real-world datasets? If so, what are their failure modes? To do this, I designed two tests that simulate real-world use cases and tested a number of hash functions across three datasets.

## Hash Functions Under Test

I gathered some simple "low quality" hash functions as well as some "high quality" hash functions. These include:

Low quality hash functions:

`adler32`

: Mark Adler's version of the Fletcher checksum.`adler32`

is considered unreliable for short inputs, as per RFC 3309.`djb2_32`

: A simple non-cryptographic hash devised by Daniel J. Bernstein.`fnv1a32`

: A widely-used hash designed by Glenn Fowler, Phong Vo, and Landon Noll.

High quality hash functions:

`spooky32`

: A hash function designed by Bob Jenkins.`murmur3`

: A hash function designed by Austin Appleby in 2008.`city32`

: A fast hash function developed by Google.`xx32`

: Claims to be the fastest x86 non-cryptographic hashing algorithm.

Some of these hashes also have 64-bit variants including `city`

, `xx`

, `spooky`

, `fnv1a`

, and `djb2`

.

## Datasets

I wanted to test each hash function on a variety of large datasets. These provide sample scenarios from networking, bioinformatics, and natural language processing. These include:

- All words in the English, German, and French languages as provided by the GNU ASpell dictionary version
`2.1`

.- The French dictionary contains 221,377 words.
- The American English dictionary contains 123,985 words.
- The German dictionary contains 304,736 words.

- All possible private IPv4 addresses, as unsigned bytes in network byte order.
- This includes 17,891,328 addresses.
- Most addresses are continuous, differing in a single bit.

- All unique 12-mers (or 12-length substrings) in the human genome (e.g. all contigs from GRCh38.)
- For example, 'AAGAGTCAGTTATT' is a 12-mer.
- Comprising 203,091,438 unique 12-length substrings from the human genome. These sequences cover most of the possible combinations in the genome's four-character alphabet (A, T, C, G).
- This dataset is challenging to hash because the differentiating information is contained in a small subset of the input bits.
- I did not canonicalize these k-mers so they could be drawn from the 5' 3' or 3' 5' strands.

## Multinomial Non-Uniformity Test

In practice, most hash functions are used to associate an item with a specific 'slot' in memory, and many algorithms depend on the premise that the distribution of items across these slots is no worse than that that could be produced by a uniform random distribution. This test is unique in modeling real-world behavior of the hash function rather than the behavior under a synthetic benchmark. Since the ranges of the hash functions are large (i.e., either $2^{32}$ or $2^{64}$), we need to choose a function that maps these to our slot index, $b_i$. This is most commonly accomplished by taking the modulus of the output with the number of slots. After hashing $k$ items, the resulting distribution should be modeled by a multinomial across the $n$ slots. On average, $\frac{k}{n}$ values should hash to each slot and we can use the $\Chi^2$ distribution to test if the distribution differs significantly from the distribution that would be produced by a random hash function. By assuming $n$ is sufficiently large, we can compute the test statistic using the following formula:

`$\Chi^2 = \sum_{i = 0}^{n-1} \frac{(b_i - \frac{k}{n})^2}{\frac{k}{n}}$`

Then, compute the $p$ value using the chi-squared CDF. I performed around 60 of these tests across all the datasets, so I'll only list a few here.

### German Word List

$|b| = 1024$

hash_function | p_value | average | p50 | p75 | p99 |
---|---|---|---|---|---|

city64 | 0.041558 | 297.594 | 298 | 309 | 341 |

city32 | 0.558001 | 297.594 | 298 | 310 | 338 |

xx64 | 0.0619553 | 297.594 | 298 | 309 | 342 |

xx32 | 0.146449 | 297.594 | 297 | 310 | 340 |

spooky64 | 0.978192 | 297.594 | 297 | 309 | 337 |

spooky32 | 0.978192 | 297.594 | 297 | 309 | 337 |

murmur3 32 | 0.288355 | 297.594 | 298 | 309 | 335 |

fnv1a64 | 0.073901 | 297.594 | 298 | 309 | 340 |

fnv1a32 | 0.19892 | 297.594 | 297 | 310 | 340 |

adler32 | 0 | 297.594 | 271 | 357 | 548 |

djb2_32 | 0.499734 | 297.594 | 298 | 309 | 336 |

djb2_64 | 0.499734 | 297.594 | 298 | 309 | 336 |

$|b| = 1031$

hash_function | p_value | average | p50 | p75 | p99 |
---|---|---|---|---|---|

city64 | 0.394728 | 295.573 | 295 | 307 | 336 |

city32 | 0.658379 | 295.573 | 295 | 306 | 337 |

xx64 | 0.349555 | 295.573 | 295 | 307 | 335 |

xx32 | 0.809289 | 295.573 | 295 | 307 | 335 |

spooky64 | 0.944751 | 295.573 | 295 | 307 | 333 |

spooky32 | 0.0605966 | 295.573 | 296 | 307 | 339 |

murmur3 32 | 0.421352 | 295.573 | 295 | 307 | 339 |

fnv1a64 | 0.224086 | 295.573 | 296 | 307 | 337 |

fnv1a32 | 0.84226 | 295.573 | 295 | 306 | 336 |

adler32 | 0.784829 | 295.573 | 295 | 307 | 337 |

djb2_32 | 0.992628 | 295.573 | 296 | 306 | 332 |

djb2_64 | 0.487786 | 295.573 | 296 | 308 | 334 |

$|b| = 435337 \approx 0.7 n$

hash_function | p_value | average | p50 | p75 | p99 |
---|---|---|---|---|---|

city64 | 0.433201 | 0.7 | 1 | 1 | 3 |

city32 | 0.81538 | 0.7 | 1 | 1 | 3 |

xx64 | 0.46347 | 0.7 | 1 | 1 | 3 |

xx32 | 0.737051 | 0.7 | 1 | 1 | 3 |

spooky64 | 0.0797217 | 0.7 | 1 | 1 | 3 |

spooky32 | 0.390342 | 0.7 | 1 | 1 | 3 |

murmur3 32 | 0.696641 | 0.7 | 1 | 1 | 3 |

fnv1a64 | 0.0207139 | 0.7 | 1 | 1 | 3 |

fnv1a32 | 0.648008 | 0.7 | 1 | 1 | 3 |

adler32 | 0 | 0.7 | 0 | 1 | 4 |

djb2_32 | 0.110557 | 0.7 | 1 | 1 | 3 |

djb2_64 | 0.0788193 | 0.7 | 1 | 1 | 3 |

With the exception of `adler32`

, all the hash functions hold up well against these ASCII inputs. When the number of slots is prime and the table size is small, `adler32`

performs at its best. I think that's likely because the sum wraps around the modulus, creating something closer to a uniform distribution, though this does not necessarily mean it should be used.

### Private IP Ranges

$|b| = 65536$

hash_function | p_value | average | p50 | p75 | p99 |
---|---|---|---|---|---|

city64 | 0.161976 | 273 | 273 | 284 | 313 |

city32 | 0.550778 | 273 | 273 | 284 | 312 |

xx64 | 0.150364 | 273 | 273 | 284 | 312 |

xx32 | 0.962877 | 273 | 273 | 284 | 312 |

spooky64 | 0.960136 | 273 | 273 | 284 | 312 |

spooky32 | 0.960136 | 273 | 273 | 284 | 312 |

murmur3 32 | 0.229322 | 273 | 273 | 284 | 312 |

fnv1a64 | 1 | 273 | 273 | 278 | 284 |

fnv1a32 | 1 | 273 | 273 | 275 | 280 |

adler32 | 0 | 273 | 0 | 0 | 1653 |

djb2_32 | 0 | 273 | 254 | 303 | 372 |

djb2_64 | 0 | 273 | 254 | 303 | 372 |

$|b| = 25559057 \approx 0.7n$

hash_function | p_value | average | p50 | p75 | p99 |
---|---|---|---|---|---|

city64 | 0.727569 | 0.7 | 1 | 1 | 3 |

city32 | 0.738734 | 0.7 | 1 | 1 | 3 |

xx64 | 0.510211 | 0.7 | 1 | 1 | 3 |

xx32 | 1 | 0.7 | 1 | 1 | 3 |

spooky64 | 0.331874 | 0.7 | 1 | 1 | 3 |

spooky32 | 0.823507 | 0.7 | 1 | 1 | 3 |

murmur3 32 | 1 | 0.7 | 1 | 1 | 3 |

fnv1a64 | 1 | 0.7 | 1 | 1 | 3 |

fnv1a32 | 1 | 0.7 | 1 | 1 | 3 |

adler32 | 0 | 0.7 | 0 | 0 | 0 |

djb2_32 | 0 | 0.7 | 0 | 0 | 55 |

djb2_64 | 0 | 0.7 | 0 | 0 | 55 |

This is likely the most challenging test of the three due to the fact
many of these IPs are differentiated by single bits. Both `adler32`

and
`djb2_32`

fail. In particular, `adler32`

hash function only distributes
hashes amongst 1% of the allocated buckets! Interestingly enough, for
$|b| = 65536$, fnv1a seems to distribute the values very uniformly. In
expectation, the 99th percentile should approach 312; interestingly, this
doesn't happen for fnv1a. (I could probably run the 17712414th order statistic
to find if this is significant, but that seems like a bit of a nightmare.)

### All k-mers in GRC H38

$|b| = 65536$

hash_function | p_value | average | p50 | p75 | p99 |
---|---|---|---|---|---|

city64 | 0.808561 | 3098.93 | 3099 | 3136 | 3229 |

city32 | 0.238145 | 3098.93 | 3099 | 3136 | 3229 |

xx64 | 0.0837374 | 3098.93 | 3099 | 3137 | 3230 |

xx32 | 0.388023 | 3098.93 | 3099 | 3137 | 3229 |

spooky64 | 0.0890488 | 3098.93 | 3099 | 3136 | 3230 |

spooky32 | 0.0890488 | 3098.93 | 3099 | 3136 | 3230 |

murmur3 32 | 0.754288 | 3098.93 | 3099 | 3136 | 3230 |

fnv1a64 | 1 | 3098.93 | 3099 | 3136 | 3227 |

fnv1a32 | 0.99998 | 3098.93 | 3099 | 3136 | 3228 |

adler32 | 0 | 3098.93 | 0 | 0 | 0 |

djb2_32 | 0 | 3098.93 | 3054 | 4210 | 4975 |

djb2_64 | 0 | 3098.93 | 3054 | 4210 | 4975 |

$|b| = 290130625 \approx 0.7n$

hash_function | p_value | average | p50 | p75 | p99 |
---|---|---|---|---|---|

city64 | 0.567872 | 0.7 | 1 | 1 | 3 |

city32 | 8.5713e-09 | 0.7 | 1 | 1 | 3 |

xx64 | 0.0503037 | 0.7 | 1 | 1 | 3 |

xx32 | 0.000792609 | 0.7 | 1 | 1 | 3 |

spooky64 | 0.0415268 | 0.7 | 1 | 1 | 3 |

spooky32 | 1.90627e-06 | 0.7 | 1 | 1 | 3 |

murmur3 32 | 1.57968e-07 | 0.7 | 1 | 1 | 3 |

fnv1a64 | 0.833133 | 0.7 | 1 | 1 | 3 |

fnv1a32 | 1.01819e-10 | 0.7 | 1 | 1 | 3 |

adler32 | 0 | 0.7 | 0 | 0 | 0 |

djb2_32 | 0.00151188 | 0.7 | 1 | 1 | 3 |

djb2_64 | 0.0428499 | 0.7 | 1 | 1 | 3 |

The k-mer test seemed to induce failures in all the 32-bit values. While we can say these statistically differ from the uniform distribution, this does not mean it will impact the performance of our application significantly. It actually seems to be fairly well distributed, at least from the 50th, 75th, and 99th percentiles.

## Sparse Collisions Test

While the non-uniformity test is simple to administer, interpreting its results can be challenging due to the fact you have to compare across distributions. This motivated me to develop a test to characterize the likelihood of observing a certain number of hash collisions throughout the entire data set. The "Sparse Collisions Test" is simple, and it operates by hashing all the keys (for example, all the words in the German language) and counting the number of collisions. The real challenge lies in determining whether the number of collisions we measure is significant. Finding the likelihood of observing $q$ collisions when $k$ values are hashed is a variation on the famously unintuitive Birthday Problem.

Characterizing the full distribution for each scenario proved difficult, and I believe there might not be a closed-form formula without approximation. After considerable effort, I was able to develop a formula to compute the likelihood of a specific number of collisions. This operated by summing a combinatorial formula over all partitions of the input space. Using dynamic programming, the exact distribution can be computed in $O\left( k^k \right)$ time and $O\left( k \right)$ space. This is only practical for small inputs. Fortunately, by limiting the space of partitions considered and eliminating those which would almost certainly would not occur, I was able to make more progress. In the end, I was able to categorize the expected number of collisions within the private IP address space and German word list for the 32-bit variants with an error on the order of $10^{-8}$. Further information about how the distribution was derived will be included in an appendix.

### German Word List

The expected probability distribution can be computed using from the partitions formula:

Algorithm | Collisions | Percentage |
---|---|---|

city64 | 0 | 0 |

city32 | 11 | 0.0036 |

xx64 | 0 | 0 |

xx32 | 10 | 0.0033 |

spooky64 | 0 | 0 |

spooky32 | 10 | 0.0033 |

murmur3 32 | 15 | 0.0049 |

fnv1a64 | 0 | 0 |

fnv1a32 | 18 | 0.0060 |

adler32 | 68006 | 22.32 |

djb2_32 | 17 | 0.0056 |

djb2_64 | 2 | 0.00066 |

For 32-bit hash functions, we should expect fewer than 22 collisions at $p=0.001$, a criterion that only `adler32`

fails to meet. The 64-bit hash functions can be bounded by the Birthday Problem, accordingly we expect that no collisions occur and any number of collisions are statistically significant at the 0.001 level. Thus, we can say `djb2_64`

also differs significantly from a random hash function.

### All Private IP Addresses

The expected probability distribution can be computed using from the partitions formula:

Algorithm | Collisions | Percentage |
---|---|---|

city64 | 0 | 0 |

city32 | 37534 | 0.21 |

xx64 | 0 | 0 |

xx32 | 0 | 0 |

spooky64 | 0 | 0 |

spooky32 | 37143 | 0.21 |

murmur3 32 | 0 | 0 |

fnv1a64 | 0 | 0 |

fnv1a32 | 0 | 0 |

adler32 | 17530308 | 97.98 |

djb2_32 | 17571285 | 98.21 |

djb2_64 | 17571285 | 98.21 |

For the 32-bit hash function, we would expect fewer than 37,812 collisions at $p=0.001$. As in the previous test, any collisions for the 64-bit hashes are significant at the 0.001 level. So for this test, `djb2_32`

, `adler32`

, and `djb2_64`

perform significantly worse than what would be expected from a random hash function. On the other hand, `fnv1a_32`

, `xx32`

, and `murmur3`

actually perform significantly better than what would be expected from a random hash function. `city32`

and `spooky32`

perform in line with our expectations.

### Unique K-mers in GRCh38

With my current methods, I can't compute for the expected probability distribution $k=203091438$. It's too computationally expensive. I ran the tests anyway so I could list the results.

Algorithm | Collisions | Percentage |
---|---|---|

city64 | 0 | 0 |

city32 | 4726992 | 2.33 |

xx64 | 0 | 0 |

xx32 | 4707102 | 2.33 |

spooky64 | 0 | 0 |

spooky32 | 4726688 | 2.33 |

murmur3 32 | 4723849 | 2.33 |

fnv1a64 | 0 | 0 |

fnv1a32 | 4724280 | 2.33 |

adler32 | 202966890 | 99.94 |

djb2_32 | 5324427 | 2.62 |

djb2_64 | 0 | 0 |

## Conclusion

After conducting all these experiments, my biggest takeaway is that hash benchmarking suites are probably not measuring real hashing performance. In these tests, `fnv1a`

, a simple hash function from the early 90s, held up remarkably well. While I think measuring the randomness of hash functions is interesting both theoretically and as a fun engineering exercise, I believe these hyper-optimized hash functions offer very marginal benefits. Of course, I am open to changing my mind. This would happen if I am presented with a real-world dataset that elicits bad behavior from a simple hash function like `fnv1a`

. There might be some dataset for which `city`

and `spooky`

outperform their simpler predecessors. You can't really prove that these hash functions are "good"; you can only show that under certain situations they perform poorly.

Many early hash functions like `adler32`

and `djb2`

were designed in an era when hashing performance was an important consideration, and they were typically used for specific applications. `adler32`

was used in gzip, where entropy was abundant. This accounts for its significant shortcomings with short string inputs. I believe `djb2`

was designed for ASCII strings. ASCII data, like German and English words, contains a lot of inherent entropy, meaning that weaker hash functions like `djb2`

perform quite well. The main issue with `djb2`

is that the prime does not provide avalanching over the entire output space. Replacing 33 with a better prime, like 22695477, considerably boosts its performance. I think the reason Bernstein used 33 is that he designed it in the 90s when computing resources were limited. The multiplication operation could be replaced with a bit shift and addition.

## Appendix

### Deriving the Expected Collision Distribution

In order to characterize the collision distribution, we want to obtain the probability that $q$ collisions occur, $P(Q=q)$, for an idealized random hash function.

Let us define the hash function over some alphabet, $\Sigma$. This hash function maps an arbitrary input to one of the $n$ slots, that is, $f: \Sigma^\mathbb{N} \to [1, n]$. Each input has a $\frac{1}{n}$ probability of mapping to each output. We are interested in the probability that $q$ collisions occur within a set of $k$ values.

The distributions of hashes over the slots are a multinomial distribution since the number of trials is fixed, the trials are independent, and there is a fixed probability $p_i = \frac{1}{n}$ that they hash within each bucket. Therefore, the probability that a specific distribution of slot counts $b_0, b_1, \cdots, b_{n-1}$ is given by $\frac{k!}{\prod_{i=0}^{n-1} b_i!}$ where $\sum_{i=0}^{n-1} b_i = k$. We could evaluate this by considering all possible distributions of values in the buckets, summing the probabilities of each distribution that contributes to $q$ collisions. However, many of these are duplicative. For example, if $n = 2$ and $k = 3$, $b_0 = 1, b_1 = 3$ and $b_0 = 3, b_1 = 1$ occur with equal likelihood. Thus, we can compute the probability of achieving any set of bucket counts $c_0, c_1, \cdots, c_k$ by multiplying the probability of this outcome by the number of ways in which it can occur:

`$P\left(c_0, c_1, \cdots, c_k\right) = \frac{ \left( \sum_{j = 0}^k c_j \right) ! }{ \prod_{i = 0}^{k} c_i! } \frac{k!}{n^k \prod_{i=0}^{i=k} i! ^{c_i}}$`

To calculate the probability that $q$ collisions occur, this needs to be summed over all partitions of $k$. That is, all natural numbered coefficients $c_1, c_2, \cdots c_k$ which satisfy $k = \sum_{i=1}^k i \cdot c_i$. The number of collisions is given by $q = \sum_{j=2}^{k} c_j (j - 1)$.

I would have liked to obtain a closed-form formula, even via an approximation. But, there is no known closed-form formula for partitions. If anyone knows of an appropriate approximation, let me know.

### Computing the Expected Collision Distribution

The equation given above can be computed efficiently using a few approximations. First, factorials can be approximated through the use of the log gamma function with 16-bit floating point accuracy. This is provided by the Lanczos Gamma Approximation. The log gamma values for $[0, k] \cup [n-k, n]$ can be cached to make these calls $O(1)$. This expected collision distribution can be computed through a depth-first search over the partition space. Unfortunately, partitions grow exponentially. For example, there are around $10^{60}$ possible partitions for the German Word List dataset. Many possible outcomes have near-zero likelihoods of occurring. For example, the likelihood that all $k$ values hash to the same bucket is $(\frac{1}{n})^{k-1}$. Near-perfect approximations can be obtained by limiting the depth of the search and the number of partitions at a given depth.

### Additional Results

I have made all my results, as well a the program I used to compute the collision distributions, available in a Git repository.