Discrepancy Bounds from Convex Geometry

In the last post we discussed some questions about discrepancy and the ‘Six Standard Deviations Suffice’ theorem stated below (without the {6}, which is not too important, but makes for a great title):

Theorem 1 For vectors {a^1,\ldots,a^n \in \{1,-1\}^n}, there exists {\epsilon \in \{1,-1\}^n} such that for every {j \in [n]}, {|\langle a^j,\epsilon\rangle| = O(\sqrt{n})}.

In this post (second and the most technical of the three post series) we will see a proof of the above result. The theorem was proved by Spencer in 1985 using Beck’s partial coloring approach. Independently of Spencer, the result was proved by Gluskin in 1988 in a convex geometric context. Here we will review Gluskin’s proof which is quite beautiful.

Gluskin’s proof will also give us an excuse to look at some elegant (and simple to describe) results in convex geometry which may be of use elsewhere. Finally, the geometric view here will actually be useful in the next post when we discuss an algorithmic proof. Gluskin’s paper is truly remarkable and seems to reinvent several key ideas from scratch such as Sidak’s lemma, a version of Kanter’s lemma for Gaussians in convex geometry and even has the partial coloring approach implicit in it. I recommend taking a look at the paper even if it is a bit of a tough read. Much of the content in the post is based on discussions with Shachar Lovett and Oded Regev, but all mistakes are mine.

Gluskin’s proof follows the partial coloring approach with the crucial lemma proved using a volume argument. The partial coloring method was introduced by Beck in 1981 and all proofs of Theorem 1 and many other important discrepancy results in fact use this method. Here, instead of looking for a {\epsilon \sim \{1,-1\}^n} solution as in the theorem, one looks for a {\{1,0,-1\}^n} solution first. This is meaningless to begin with as we can just output the all zeros vector. The main idea is to instead look for a {\{1,0,-1\}^n} solution which has {\Omega(n)} support. We then recurse on the set of coordinates which are set to {0}. If everything goes according to plan, as we are geometrically decreasing the ambient dimension, we similarly get geometrically decreasing discrepancy bounds which we can tolerate. I won’t go into the details here, but let’s accept that it suffices to show the following:

Lemma 2 For vectors {a^1,\ldots,a^n \in \{1,-1\}^n}, there exists {\epsilon \in \{1,0,-1\}^n} such that for every {j \in [n]}, {\langle a^j,\epsilon\rangle = O(\sqrt{n})} and {|Support(\epsilon)| = \Omega(n)}.

To prove the above partial-coloring-lemma, let us first rephrase the problem in a geometric language. Let {\mathcal{K} \subseteq R^n} be the symmetric convex set (symmetric meaning {x \in \mathcal{K}} implies {-x \in \mathcal{K}}) defined as follows for {\Delta = O(\sqrt{n})} to be chosen later:

\displaystyle \mathcal{K} = \{x: |\langle a^j,x\rangle| \leq \Delta, \forall j \in [n]\}.

We want to show that {\mathcal{K}} contains a {\{1,0,-1\}^n} lattice point of large support. We show this indirectly by proving that {\mathcal{K}} instead contains a lot of points from {\{1,0,-1\}^n}. Gluskin does this by a clever volume argument: first show that the volume of {\mathcal{K} \cap [-1,1]^n} is large and then apply Minkowski’s theorem to show that there are many lattice points. To lower bound the first volume, Gluskin actually works in the Gaussian space.

I don’t have a clear intuitive reason for why the Gaussian distribution is better than the Lebesgue measure in this context. But if one looks at the analysis, a clear advantage is that projections behave better (when considering volumes) in the Gaussian space. For example, if we take a set like {S = \{x \in R^n\,:\,|x_1| \leq 1\}}, then the Lebsgue volume of {S} is infinite, but if we project along the first coordinate it becomes finite. In the Gaussian case, both volumes are the same.

We next go over all the main pieces in Gluskin’s proof.

Sidak’s Lemma  Suppose we have a standard Gaussian vector {g \sim \mathcal{N}(0,1)^n}. Then, for any unit vector {v \in R^n}, {\langle v,g\rangle} has the standard normal distribution. Now, suppose we have several unit vectors {v_1,\ldots,v_m \in R^n}. Then, the random variables {X_i = \langle v_i,g\rangle} are individually standard normals, but are correlated with one another. Sidak’s lemma (1967) says that no matter what the correlations of {X_i}‘s are, to bound the probability that none of the {X_i}‘s is too large, the “worst-behaviour” one could expect is for them to be independent. Concretely:

Lemma 3 (Sidak’s Lemma) Let {v_1,\ldots,v_m \in R^n} and let {g \sim \mathcal{N}(0,1)^n} be a standard Gaussian vector. Then, for all {t_1,\ldots,t_m \in R_+},

\displaystyle Pr\left[|\langle v_1,g\rangle| \leq t_1 \;\wedge\; |\langle v_2,g\rangle| \leq t_2\;\wedge\; \cdots \;\wedge\; |\langle v_m,g\rangle| \leq t_m\right] \geq

\displaystyle \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; Pr\left[|\langle v_1,g\rangle |\leq t_1\right] \cdot Pr\left[|\langle v_2,g\rangle|\leq t_2\right] \cdots Pr\left[|\langle v_m,g\rangle|\leq t_m\right].

The proof of the lemma is actually not too hard and an excellent exposition can be found in this paper.

The lemma is actually a very special case of a longstanding open problem called the Correlation Conjecture. Let me digress a little bit to state this beautiful question. In the above setup, let slab {C_i = \{x \in R^n: |\langle v_i,x\rangle| \leq t_i\}}. Then, Sidak’s lemma says that for {g \sim \mathcal{N}(0,1)^n},

\displaystyle Pr[g \in C_1 \wedge C_2 \wedge \cdots \wedge C_m] \geq Pr[g \in C_1] \cdot Pr[g \in C_2] \cdots Pr[g \in C_m].

The correlation conjecture asserts that this inequality is in fact true for all symmetric convex sets (in fact, we only need to look at {m=2}). Sidak’s lemma says the conjecture is true for slabs. It is also known to be true for ellipsoids. The statement for ellipsoids also has a discrepancy implication leading to a vector generalization of Spencer’s theorem (pointed to me by Krzysztof Oleszkiewicz). But that’s for another day.

Kanter’s Lemma The second inequality we need is a comparison inequality due to Kanter. The lemma essentially lets us lift certain relations between two distributions {p,q} to their product distributions {p^n, q^n} and I think should be useful in other contexts. For instance, I recently used it in this paper in a completely different context. To state the lemma, we need the notion of peakedness of distributions.

Let {p,q} be two symmetric distributions on {R^m} for some {m}. We say {p} is less peaked than {q} (written {p \preceq q}) if for all symmetric convex sets {\mathcal{K}}, {p(\mathcal{K}) \leq q(\mathcal{K})}. Intuitively, this means that {p} is putting less of its mass near the origin than {q} (hence the term less peaked). For example, {\mathcal{N}(0,2) \preceq \mathcal{N}(0,1)}.

Kanter’s lemma says that the peakedness relation tensorises provided we have unimodality. A univariate distribution is unimodal if the corresponding probability density function has a single maximum and no other local maxima. I won’t define what it means for a multivariate distribution to be unimodal here, but we only need the lemma for univariate distributions. See this survey for the formal definition.

Lemma 4 (Kanter’s lemma) Let {p,q} be two symmetric distributions on {R^n} such that {p \preceq q} and let {\mu} be a unimodal distribution on {R^m}. Then, the product distributions {p \times \mu}, {q \times \mu} on {R^{n \times m}}, satisfy {p \times \mu \preceq q \times \mu}.

The proof of the lemma is not too hard, but is non-trivial in that it uses the Brunn-Minkowski’s inequality. Combining the above lemma with the not-too-hard fact that the standard Gaussian distribution is less peaked than the uniform distribution on {[-1,1]}, we get:

Corollary 5 Let {\mu} be the uniform distribution on {[-1,1]}. Then, {\mathcal{N}(0,1)^n \preceq \mu^n}.

Minkowski’s Theorem The final piece we need is the classical Minkowski’s theorem form lattice geometry:

Theorem 6 (Minkowski’s Theorem) Let {C \subseteq R^n} be a symmetric convex set of Lebesgue volume more than {2^n \cdot \ell} for an integer {\ell \geq 1}. Then, {C} contains at least {\ell} points from the integer lattice {\mathbb{Z}^n}.

Putting Things Together  We will prove the partial coloring lemma Lemma 2. The proof will be a sequence of simple implications using the above lemmas. Recall the definition of {\mathcal{K}}:

\displaystyle \mathcal{K} = \{x: |\langle a^j,x\rangle| \leq \Delta, \forall j \in [n]\}.

Our goal (Lemma 2) is equivalent to showing that {\mathcal{K}} contains a {\{1,0,-1\}} point of large support.

Note that {\mathcal{K}} is the intersection of {n} slabs. Therefore, by Sidak’s lemma, for {g \sim \mathcal{N}(0,1)^n},

\displaystyle Pr[g \in \mathcal{K}] \geq \prod_{j=1}^n Pr[|\langle a^j,g\rangle| \leq \Delta] \geq \prod_{j=1}^n\left(1-2\exp(-\Delta^2/2n)\right),

where the last inequality follows from the fact that {\langle a^j,g\rangle} has the Gaussian distribution with standard deviation {\sqrt{n}}. Therefore, if we pick {\Delta = O(\sqrt{n})}, sufficiently big then

\displaystyle Pr[g \in \mathcal{K}] \geq (3/4)^n.

Now, let {\mu} be the uniform distribution on {[-1,1]}. Then, by Corollary 5, and the definition of peakedness, {\mu^n(\mathcal{K}) \geq (3/4)^n}. Hence, the Lebesgue volume of {\mathcal{K}' = \mathcal{K} \cap [-1,1]^n} is at least {2^n (3/4)^n = (3/2)^n}. Therefore, for sufficiently small {\delta > 0}, Lebesgue volume of {(2-\delta)\mathcal{K}' \geq (2-\delta)^n \cdot (3/2)^n = 2^n \cdot 2^{\Omega(n)}}. Thus, by applying Minkowski’s theorem to the symmetric convex set {(2-\delta) \mathcal{K}'} we get that {(2-\delta)\mathcal{K}'} has at least {2^{\Omega(n)}} lattice points.

Now, note that the only lattice points in {(2-\delta)\mathcal{K}'} are elements of {\{1,0,-1\}^n} inside {\mathcal{K}}. Therefore {\mathcal{K}} has at least {2^{\Omega(n)}} points from {\{1,0,-1\}^n}. By a simple counting argument at least one of these lattice points, {\epsilon}, must {\Omega(n)} non-zero coordinates – which is exactly what we need to prove Lemma 2!

Discussion  The above argument can actually be simplified by replacing the use of Kanter’s lemma with an appropriate version of Miknowski’s theorem for Gaussian volume as done here. But I like any excuse to discuss Kanter’s lemma.

More importantly, the proof seems to be more amenable to generalization. The core of the proof really is to use Sidak’s lemma to lower bound the Gaussian volume of the convex set {\mathcal{K}}. Whenever you have such a statement you should even get a corresponding discrepancy statement. In particular, the matrix discrepancy conjecture from last post, essentially reduces to the following probability question:

Question Is it true that for a universal constant {C}, for all symmetric matrices {A_1,\ldots,A_n \in R^{n \times n}} with {\|A_i\| \leq 1},

\displaystyle Pr_{g \sim \mathcal{N}^n}\left[\|g_1 A_1 + g_2 A_2 + \cdots + g_n A_n \| \leq C \sqrt{n}\right] \geq (3/4)^n\,?

Acknowledgments Thanks to Shachar Lovett, Oded Regev and Nikhil Srivastava for helpful suggestions, comments, and corrections during the preparation of this post.

4 thoughts on “Discrepancy Bounds from Convex Geometry

  1. Very nice post!

    I may be missing something, but I think Giannopoulos’s proof also gives the implication from Gaussian volume lower bounds to a partial coloring.

    Let K be the set of “small discrepancy fractional colorings” (the same set you define), and assume \Pr_g[g \in K] > 2^{-c_1n}. Then let f_K(g) be the number of sign sequences \varepsilon \in \{-1, 1\}^n such that g \in 0.01\varepsilon + K. From looking at the pdf of the standard gaussian, \Pr[g \in 0.01\varepsilon + K] >= 2^{-c_2n} for some c_2 = 2^{(1 - c_2)n}. So there must exist a point $\latex x$ which is in at least N = 2^{(1-c_2)n} sets \varepsilon^{1} + 100K, \ldots, \varepsilon^{N} + 100K. For any two i, j, \varepsilon^i - \varepsilon^j \in 200K, and since N is so large, there must exist two i, j such that \frac{\varepsilon^i - \varepsilon^j}{2} \in 100K and \frac{\varepsilon^i - \varepsilon^j}{2} has large support. I chose 0.01 very conservatively, much better constants should do the job.

    But Kanter’s lemma is cool in any case!

    1. Indeed, Giannopoulos’s proof is simpler – one can view the argument you described (a typo: c_2 should be a constant) as a kind of Gaussian version of Minkowski’s theorem which by itself is a nice thing to know.

      1. Sorry for the typo. I wish there was a way to preview comments by the way.

        I could’ve been more explicit: \Pr[g \in x + K] \geq e^{0.5\|x\|_2^2} for any centrally symmetric $K$ and any x, so c_2 = c_1 +  0.005\log_2 e \leq c_1 + 0.008. I guess I was really conservative with constants.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s