# How to collapse the universe?

This post has some tidbits regarding the problem of computing a ‘fingerprint’ of a long sequence of characters, often called ‘string hashing’. Most of the results I will describe are quite old, but they are scattered upon several papers, so I think it is worthwhile to have a post where they are  put together. This is of course not a survey.

In our setting we have a set $U$, which is typically very large – think for instance all strings of at most $2^{32}$ bits. A hash function $h$ takes an item from $U$ and outputs a ‘fingerprint’ from a range $B$. Think of $B$ as say $64$ bits which is hopefully unique over a small set of items: we want to come up with a family of hash functions $H$, such that for every pair of items $m_1,m_2\in U$, when sampling $h$ from $H$, the collision probability $h(m_1)=h(m_2)$ is small. The parameters we care about are $|H|$, since $\log |H|$ is a lower bound on the key size, that is the number of random bits needed to specify $h$. We also care about the running time of $h$, and of course the collision probability we obtain. In this post I will not discuss directly the running time of $h$, though running time is arguably the most important parameter, instead I’ll focus on the tradeoff between the collision probability and $|H|$.

The first observation is that if we insist on having the collision probability as small as $1/|B|$,  ($2^{-64}$ in the example above), then $\log |H|$ must be roughly at least $\log |U|$ which means in the example above we will need a very long key.

Denote  by $\epsilon$ the collision probability we want to obtain. Stinson proved that $|H| \geq \frac{|U|(|B|-1)}{|U|(\epsilon |B|-1)+|B|^2(1-\epsilon)}$ which is roughly $|U|/|B|$ when $\epsilon = 1/|B|$, but says very little when $\epsilon$ is slightly larger. A sketch of Stinson’s short and clever proof is at the end of the post.

This bound is almost met by  Carter and Wegman‘s classic multilinear hashing. Here $B$ is assumed to be a finite field, the items in $U$ are comprised of at most $\ell$ field elements, and the key $k_0,...,k_\ell$ is comprised of $\ell + 1$ field elements sampled uniformly. The multilinear hash family is defined as $h(m_1...) = k_0 + \sum k_im_i$ where all operations are in the field. If we forbid messages to end with zero then it is pairwise independent. A short proof could be found in a paper by Daniel Lemire.

In practice we would be willing to sacrifice a little in the collision probability and have a hash function with a smaller key length, one that is closer to the length of the fingerprint, and that has a weaker dependency on the size of the domain. There are many ways of doing that, with various tradeoffs. Here are two general and quite straightforward methods. Lets call them recursive and iterative.

Recursive: The main idea is to compose  weeker functions in a tree-like structure. The origin of this idea (as many others) is in Carter and Wegman. We use as a building block a hash function that takes as input shorter fixed length strings. To be concrete lets assume we have a family of functions that takes $256$ bits and output $64$ bits. There are many good ways of designing such functions. The input is viewed as  $\ell$ chunks of $256$ bits each and apply the hash function on each of these blocks, concatenating the output obtaining a string with $\ell/4$ chunks of $256$. We repeat the operation on the new string, with a freshly sampled hash function, and so on. In total we need to sample $\log_{4} \ell$ such function. In the example above, if each of the hash functions has a key of $256$ bits, and the original input is $2^{32}$ bits, then the total length of keys turns out to be approximately $3300$ bits. The collision probability could be bounded by multiplying the collision probability of the basic hash functions with the number of iterations,  $13$ in this case.  Mikkel Thorup has a more detailed description and some experiments here, as does Sarkar.

Iterative: This idea is known by various names such as CBC-MAC and Pearson hashing. Here we assume the message is comprised of at most $\ell$ blocks of $n$ bits each: $m_1,\ldots, m_\ell$. The hash function is indexed by a permutation $\pi:\{0,1\}^n \rightarrow \{0,1\}^n$. The $i'th$ stage of the computation has state $C_i \in \{0,1\}^n$. Starting with $C_0 = 0^n$ the function iteratively computes $C_i = \pi(C_{i-1}\oplus m_i)$. The output is $C_\ell$. Ideally we would have liked the permutation to be completely random, this cannot be done for reasonable values of $n$, so in practice a pseudorandom permutation is used, though the exact definition of pseudorandomness in this context could be debated. Assuming the permutation is indeed random, what is the collision probability? It is shown here to be $O(\ell^{o(1)}/2^n)$ as long as $\ell \leq 2^{n/4}$. The proof holds for a stronger attack model. I do not see an inherent reason for the bound on $\ell$.

I now sketch Stinson’s proof for the bound mentioned above, which is a nice application of the second moment method. Recall that $U$ denotes the domain, the range is denoted by $B$ and its size by $b$. The collision probability is denoted by $\epsilon$. For every $h\in H$ and $y \in B$ let $Q_{hy}\subseteq U$ denote $h^{-1}(y)$. Now let’s pick some specific $h$ and $y$. For every $g\in H, g\neq h$ and $x\in B$ define $\mu_{g,x} = |Q_{h,y} \cap Q_{g,x}|$, i.e. the number of items that are mapped to $y$ by $h$ and to $x$ by $g$. Let $\mu$ denote the expected size of $\mu_{g,x}$ when $x$ is sampled from $B$ and $g$ from $H\setminus \{h\}$. A moment’s reflection shows this average to be $|Q_{h,y}|/b$ (for each item in $Q_{h,y}$ we have a $1/b$ chance of guessing right where it lands under $g$). The key observation is that in order for $\mu_{g,x}$ to be large, many items should collide on $x$ under $g$, which is unlikely, so the fact that the collision probability of $H$ is bounded gives us a handle on the variance of $\mu$. This in turn would imply a bound on $|H|$. Consider two items $u,v$ in $Q_{h,y}$. Since they collide under $h$, there are at most $\epsilon |H| -1$ other function under which they collide. So,

$\sum_{u\neq v \in Q_{h,y}}\sum_{g\neq h,x\in B}\mathbf{1}_{g(u)=g(v)=x} \leq \binom{|Q_{h,y}|}{2}(\epsilon|H| - 1)$

changing the order of summation this means

$\sum_{g\neq h,x\in B}\binom{\mu_{g,x}}{2} \leq \binom{|Q_{h,y}|}{2}(\epsilon|H| - 1)$

Moving things around we have

$\sum_{g\neq h,x\in B}\mu_{g,x}^2 \leq |Q_{h,y}|(|Q_{h,y}|-1)(\epsilon|H| - 1)+|Q_{h,y}|(|H|-1)$

So we have an upper bound on the second moment of $\mu_{g,x}$ in terms of $|H|$ and $|Q_{h,y}|$. Of course we also know that the variance of $\mu_{g,x}$ is non negative, which after calculations implies the bound. Finally in order to get a bound in terms of $|U|$ we observe that we can pick $h$ and $y$ such that $|Q_{h,y}|$ is at least  $|U|/b$.