“Balls and Bins?”, you ask, “Is there anything left to prove there?” Surprisingly, there are really natural questions that are open. Today I want to talk about one such question.
First a quick primer. Balls and Bins processes model randomized allocations processes, used in hashing or more general load balancing schemes. Suppose that I have balls (think items) to be thrown into
bins (think hash buckets). I want a simple process that will keep the loads balanced, while allowing quick decentralized lookup.
The simplest randomized process one can think of is what’s called the one choice process: for each ball, we independently and uniformly-at-random pick one of the bins and place it there.
We measure the balance in terms of the additive gap: the difference between the maximum load and the average load. Many of us studied the case in a randomized algorithms class, and we know that the gap is
except with negligible probability (for the rest of the post, I will skip this “except with negligible probability” qualifier). What happens when
is much larger than
? It can be shown that for large enough
, the gap is
(see this paper for more precise behaviour of the gap).
Can we do better? Azar, Broder, Karlin and Upfal analyzed the following two choice process: balls come sequentially, and each ball picks two bins independently and uniformly at random (say, with replacement), and goes into the less loaded of the two bins (with ties broken arbitrarily). They showed that the gap of the two choice process when is significantly better: only
. What about the case when
is much larger than
? Berenbrink et al. showed that the gap stays at
, for arbitrary
.
While the decrease from to
is surprising enough, I find the latter distinction much more striking. When
is a large polynomial in
, the gap for one choice scheme keeps going up, while that in two choice case remains put. For some quick intuition on why this happens, pause for a minute to think about the case
(A hint is in the comments). Then read on.
So now that we understand one and two choice, let’s move on (or rather, dig in between). Yuval Peres, Udi Wieder and I analyzed the -choice process, where we place a ball in a uniformly random bin with probability
, and in the lesser loaded to two random bins with probability
. The
intuition would suggest that this slight bias towards balance would keep the gap from growing with
, and this is indeed what we show: for
bounded away from
, the gap is
.
Analyzing the -choice process helps us understand another natural process. Given a regular graph
on
, one can define the balls and bins process on
as follows: balls come sequentially, and each ball picks an edge of
at random, and goes to the (bin corrseponding to the) lesser loaded endpoint. Thus when
is made up of
self loops, we get the one choice process. When
is the complete graph (with self loops), we get the two choice process. Using the
-choice analysis, it can be shown that whenever
is connected, the gap is independent of
. And when
is an expander, the gap is
.
And that brings us to a simple open question: the case of being a cycle. To restate the question:
bins on a cycle. Repeatedly pick two adjacent bins, and put a ball in the lesser loaded of the two. How large does the gap get?
We can show the gap is and
. Empirically neither of these bounds seems remotely tight. Can you improve them?
First, an apology for skipping a lot of other relevant literature.
The promised hint for the 2 bin case: Let X_1 and X_2 denote the loads of the two bins. Notice that the gap is |X_1-X_2|/2. How does the random variable (X_1-X_2) evolve in the one choice case? How about the two choice case?
Perhaps you could emphasize that by “gap”, you mean it in the standard English sense of “difference”, as opposed to ratio (which is what usually “integrality gap” means, say).
Thanks Arvind for the remark. i have added the qualificiation “additive” to clarify that.
Is the $\Omega(\log n)$ lower bound with respect to $m=n$? If so, I’m a bit surprised. If one always picks the bin with even index (according to some fixed enumeration of the bins along the cycle) instead of picking the lesser loaded one, one arrives at the single-choice model with $n$ balls and $n/2$ bins, which should upper-bound the two-choice cycle and whose gap I thought is also $O(\ln n / \ln\ln n)$ …
You are right. For $m=n$, pretty much any process would give you a maximum load of $O(log n/log log n$. The lower bound is for larger $m$. More precisely, if $m=0.25 n log n$, the average is $0.25 log n$. For one choice, there will then be a bin with load more than $0.75 log n$. In the cycle case, you get an edge which is picked $0.75 log n$ times, so that at least one of its endpoints has load at least $0.35 log n$, which is $\Omega(log n)$ more than the average.