This post has some tidbits regarding the problem of computing a ‘fingerprint’ of a long sequence of characters, often called ‘string hashing’. Most of the results I will describe are quite old, but they are scattered upon several papers, so I think it is worthwhile to have a post where they are put together. This is of course not a survey.
In our setting we have a set , which is typically very large – think for instance all strings of at most
bits. A hash function
takes an item from
and outputs a ‘fingerprint’ from a range
. Think of
as say
bits which is hopefully unique over a small set of items: we want to come up with a family of hash functions
, such that for every pair of items
, when sampling
from
, the collision probability
is small. The parameters we care about are
, since
is a lower bound on the key size, that is the number of random bits needed to specify
. We also care about the running time of
, and of course the collision probability we obtain. In this post I will not discuss directly the running time of
, though running time is arguably the most important parameter, instead I’ll focus on the tradeoff between the collision probability and
.
The first observation is that if we insist on having the collision probability as small as , (
in the example above), then
must be roughly at least
which means in the example above we will need a very long key.
Denote by the collision probability we want to obtain. Stinson proved that
which is roughly
when
, but says very little when
is slightly larger. A sketch of Stinson’s short and clever proof is at the end of the post.
This bound is almost met by Carter and Wegman‘s classic multilinear hashing. Here is assumed to be a finite field, the items in
are comprised of at most
field elements, and the key
is comprised of
field elements sampled uniformly. The multilinear hash family is defined as
where all operations are in the field. If we forbid messages to end with zero then it is pairwise independent. A short proof could be found in a paper by Daniel Lemire.
In practice we would be willing to sacrifice a little in the collision probability and have a hash function with a smaller key length, one that is closer to the length of the fingerprint, and that has a weaker dependency on the size of the domain. There are many ways of doing that, with various tradeoffs. Here are two general and quite straightforward methods. Lets call them recursive and iterative.
Recursive: The main idea is to compose weeker functions in a tree-like structure. The origin of this idea (as many others) is in Carter and Wegman. We use as a building block a hash function that takes as input shorter fixed length strings. To be concrete lets assume we have a family of functions that takes bits and output
bits. There are many good ways of designing such functions. The input is viewed as
chunks of
bits each and apply the hash function on each of these blocks, concatenating the output obtaining a string with
chunks of
. We repeat the operation on the new string, with a freshly sampled hash function, and so on. In total we need to sample
such function. In the example above, if each of the hash functions has a key of
bits, and the original input is
bits, then the total length of keys turns out to be approximately
bits. The collision probability could be bounded by multiplying the collision probability of the basic hash functions with the number of iterations,
in this case. Mikkel Thorup has a more detailed description and some experiments here, as does Sarkar.
Iterative: This idea is known by various names such as CBC-MAC and Pearson hashing. Here we assume the message is comprised of at most blocks of
bits each:
. The hash function is indexed by a permutation
. The
stage of the computation has state
. Starting with
the function iteratively computes
. The output is
. Ideally we would have liked the permutation to be completely random, this cannot be done for reasonable values of
, so in practice a pseudorandom permutation is used, though the exact definition of pseudorandomness in this context could be debated. Assuming the permutation is indeed random, what is the collision probability? It is shown here to be
as long as
. The proof holds for a stronger attack model. I do not see an inherent reason for the bound on
.
I now sketch Stinson’s proof for the bound mentioned above, which is a nice application of the second moment method. Recall that denotes the domain, the range is denoted by
and its size by
. The collision probability is denoted by
. For every
and
let
denote
. Now let’s pick some specific
and
. For every
and
define
, i.e. the number of items that are mapped to
by
and to
by
. Let
denote the expected size of
when
is sampled from
and
from
. A moment’s reflection shows this average to be
(for each item in
we have a
chance of guessing right where it lands under
). The key observation is that in order for
to be large, many items should collide on
under
, which is unlikely, so the fact that the collision probability of
is bounded gives us a handle on the variance of
. This in turn would imply a bound on
. Consider two items
in
. Since they collide under
, there are at most
other function under which they collide. So,
changing the order of summation this means
Moving things around we have
So we have an upper bound on the second moment of in terms of
and
. Of course we also know that the variance of
is non negative, which after calculations implies the bound. Finally in order to get a bound in terms of
we observe that we can pick
and
such that
is at least
.