In the previous posts we talked about various discrepancy questions and saw a proof of the six standard deviations suffice result. Besides being of interest in combinatorics, discrepancy theory has several remarkable applications in algorithms. Check this excellent book for a taste of these results. Here I will briefly discuss two (one old and one new) such results:

• Rounding linear programs by discrepancy theory: specifically, the beautiful argument of Lovasz, Spencer and Vesztergombi (LSV) on bounding linear discrepancy in terms of hereditary discrepancy. LSV is also an excellent place to look if you are running short on beautiful, innocent-in-appearance, time-sinking combinatorial conjectures.

Unfortunately, to keep the post short (-er), I will not discuss the recent breakthrough result of Rothvoss on bin-packing which uses discrepancy theory (in a much more sophisticated) for rounding a well-studied linear program (the “Gilmore-Gomory” LP). I highly recommend reading the paper.

• A recent algorithmic proof of Spencer/Gluskin theorem due to Shachar Lovett and myself.

Rounding Linear Programs One of the most basic techniques for combinatorial optimization is linear programming relaxation. Let us phrase this in a language suitable for the present context (which is in fact fairly generic). We have a constraint matrix ${A \in \mathbb{R}^{m \times n}}$, a target vector ${b \in \mathbb{R}^n}$ and our goal is to find a ${x \in \{0,1\}^n}$ so as to minimize ${\|Ax - b\|_\infty}$. One typical approach is to relax this discrete problem and instead solve the linear program

$\displaystyle \min_{y \in [0,1]^n} \|Ay - b\|_\infty \ \ \ \ \ (1)$

(which can be done efficiently). The next step, and often the most challenging one, is to round the fractional solution ${y \in [0,1]^n}$ to an integer solution in ${\{0,1\}^n}$ with little “loss”. How well can we do this for a constraint matrix ${A}$, and general vectors ${b}$? This is captured by the notion of linear discrepancy introduced by LSV:

$\displaystyle lindisc(A) = \max_{y \in [0,1]^n} \min_{x \in \{0,1\}^n} \|Ax - Ay\|_\infty. \ \ \ \ \ (2)$

LSV introduced the above notion originally as a generalization of discrepancy which can be formulated in our current context as follows:

$\displaystyle disc(A) = \min_{\varepsilon \in \{1,-1\}^n} \|A\varepsilon\|_\infty. \ \ \ \ \ (3)$

This corresponds exactly to the notion of discrepancy we studied in the last two posts. On the other hand, we can also write ${disc(A) = 2 \max_{x \in \{0,1\}^n} \| Ax - A e/2\|}$, where ${e}$ denotes the all one’s vector. Thus, ${disc(A)}$ corresponds to the special case of linear discrepancy where we are only trying to round a particular (${e/2}$) fractional solution.

The remarkable result of LSV is that while the above definition seems to be much weaker than ${lindisc}$ (which has to round all fractional solutions) there is a natural extension, hereditary discrepancy, of ${disc}$ which is nearly as strong. For a matrix ${A \in \mathbb{R}^{m \times n}}$ and ${S \subseteq [n]}$, let ${A_S}$ denote the ${\mathbb{R}^{m \times |S|}}$ sub-matrix corresponding to the columns of ${A}$ indexed by ${S}$. Then, define

$\displaystyle herdisc(A) = \max_S disc(A_S). \ \ \ \ \ (4)$

Hereditary discrepancy is a natural and more “robust” version of discrepancy. As LSV phrase it, discrepancy can be small by accident, whereas hereditary discrepancy seems to carry more structural information. For example, let ${A}$ be a random ${1,-1}$ matrix with the constraint that each row has sum ${0}$. Then, ${disc(A) = 0}$, but ${herdisc(A) = \Omega(\sqrt{n})}$ (this needs proving, but is not too hard) – which makes intuitive sense as we expect random matrices to have little structure.

It is also worth noting that several notable results in discrepancy theory which bound the discrepancy in fact also bound hereditary discrepancy. For example, Spencer’s original proof as well as Gluskin’s argument from last post (with a little bit more work) in fact show the following:

Theorem 1 For all matrices ${A \in [-1,1]^{n \times n}}$, ${herdisc(A) = O(\sqrt{n})}$.

LSV show the following connection between linear and hereditary discrepancies:

Theorem 2 For any matrix ${A}$, ${ lindisc(A) \leq herdisc(A)}$.

In other words, any fractional solution for a linear program of the form Equation 1 can be rounded to a integer solution with an additive error of at most ${herdisc(A)}$.

Let me now describe the cute proof which can be explained in a few lines. Suppose you have a fractional solution ${y \in [0,1]^n}$. Our goal is to find an ${x \in \{0,1\}^n}$ such that ${\|Ay- Ax\|_\infty}$ is small. We will construct such a ${x}$ by iteratively making ${y}$ more integral. Let us write out the binary expansion of each of the coordinates of ${y}$: for each ${i \in [n]}$, ${y_i = 0.y_{i1} y_{i2} \cdots }$. To avoid unnecessary technical issues, let us suppose that each coordinate has a finite expansion of length ${m}$.

We will build a sequence of solutions ${y^0 = y, y^1,\ldots,y^m = x}$ such that the coordinates of ${y^i}$ when written in binary will have expansions of length at most ${m-i}$. Let us look at how to get ${y^1}$; we can get the rest similarly.

Let ${S = \{i \in [n] : y_{im} = 1\}}$. By the definition of ${herdisc}$, there exists a vector ${\varepsilon \in \{1,-1\}^S}$ such that ${\|A_S \varepsilon_S \|_\infty \leq herdisc(A)}$. Let ${y^1 = y + \varepsilon_S/2^m}$ (interpreting ${\varepsilon}$ as a vector in ${\{1,-1,0\}^n}$ in the natural way). Clearly, the binary expansions of the coordinates of ${y^1}$ have length at most ${m-1}$. Further, ${\|Ay - Ay^1\|_\infty = \|A_S \varepsilon_S\|/2^m \leq herdisc(A)/2^m}$. Iterating this argument, we get ${y^1,\ldots,y^m = x}$ such that ${\|Ay^i - A y^{i+1}\|_\infty \leq herdisc(A)/2^{m-i}}$. Therefore,

$\displaystyle \|A y - Ax\| \leq \sum_{i=0}^{m-1} \frac{herdisc(A)}{2^{m-i}} \leq herdisc(A).$

Constructive Discrepancy Minimization by Walking on the Edges We just saw a way to round linear programs as in Equation 1 with error bounded by their discrepancy. As appealing as this is, it comes with one important caveat. The original motivation for looking at LP relaxations was that we can solve them efficiently. For this to make sense, we need the rounding procedure to be efficient as well. In our present case, to make the rounding efficient, we need to find a small discrepancy solution efficiently (find ${\varepsilon_S}$ given ${A_S}$ as above). Unfortunately, this in general is NP-hard in a very strong way as was shown recently by Charkiar, Newman and Nikolov.

However, what about specific bounds like in Theorem 1? Spencer’s original proof as well as Gluskin’s proof do not give an efficient algorithm for finding a good coloring. This is fundamentally inherent with the two arguments: they rely (directly or indirectly via Minkowski’s theorem) on the pigeon-hole principle (with exponentially many “holes”) which is quite non-algorithmic. In fact, Alon and Spencer conjectured (in the’ book) several years ago that finding a coloring with discrepancy ${O(\sqrt{n})}$ as in the theorem is computationally hard.

Fortunately for us, like all good conjectures, this was proven to be false by a breakthrough result of Nikhil Bansal. Nikhil’s argument studies a carefully constructed semi-definite programming relaxation of the problem and then gives a new and amazing rounding algorithm for the SDP.

Here I will briefly discuss a different proof of Theorem 1 due to Lovett and myself which will also lead to an efficient algorithm for finding a coloring as required.

Let us first revisit Beck’s partial-coloring approach described in last post which says that to prove the theorem, it suffices to show the following.

Lemma 3 For vectors ${a^1,\ldots,a^n \in [-1,1]^n}$, there exists ${\varepsilon \in \{1,0,-1\}^n}$ such that for every ${j \in [n]}$, ${|\langle a^j,\varepsilon\rangle| = O(\sqrt{n})}$ and ${|Support(\varepsilon)| = \Omega(n)}$.

As in the last post, let us also rephrase the problem in a geometric language. Let ${\mathcal{K} \subseteq \mathbb{R}^n}$ be the symmetric convex set\footnote{Symmetric meaning ${x \in \mathcal{K}}$ implies ${-x \in \mathcal{K}}$.} defined as follows for ${\Delta = O(\sqrt{n})}$ to be chosen later:

$\displaystyle \mathcal{K} = \{\;x\,:\, |\langle a^j,x\rangle| \leq \Delta, \forall j \in [n] \;\;\;\;\;\;\;\;\text{ (call these \emph{discrepancy} constraints)}$

$\displaystyle \;\;\;\;\;\;\;\;\; |x_i| \leq 1, \forall i \in [n]\;\} \;\;\;\;\;\;\;\;\text{ (call these \emph{coloring} constraints)}$

The partial coloring lemma is equivalent to showing that ${\mathcal{K}}$ contains a ${\{1,0,-1\}^n}$ lattice point of large support. As it turns out, we don’t really need to find a lattice point in ${\mathcal{K}}$ but any point with many ${\{1,-1\}}$ (or close to ${\{1,-1\}}$) coordinates will serve us equally well. Concretely, we want:

Lemma 4 For ${\mathcal{K}}$ as above, there exists ${x \in \mathcal{K}}$ such that ${|\{i: |x_i| = 1\}| = \Omega(n)}$.

The above is equivalent to finding a vertex of ${\mathcal{K}}$ which is tight on ${\Omega(n)}$ coloring constraints. For intuition let us use the distance from the origin as a proxy for how many coordinates are close to ${1}$ in absolute value. Thus, roughly speaking our goal is to find a vertex of ${\mathcal{K}}$ as far away from the origin as possible. This is the interpretation we will use.

Let us think of trying to find the vertex ${x}$ iteratively. Our starting point would be the all-zeros vector. Now what? Well, we want to get away from the origin but still stay inside the polytope ${\mathcal{K}}$. Being somewhat greedy and lazy, we will just toss some coins and update our ${x}$ by doing Brownian motion (if this is uncomfortable, think of taking a discrete walk with tiny random Gaussian steps). The point is that the random walk will steadily move away from the origin.

Now, in the course of performing this random walk at some time we will touch the boundary of ${\mathcal{K}}$ or in other words hit some constraint(s) of ${\mathcal{K}}$. Now what? Well, as before we still want to get away from the origin but do not want to cross the polytope. So as before, being greedy and lazy (tough habits to change), we will continue doing Brownian motion but now constrain ourselves to only take steps which lie in the face of the polytope that we hit. So we move from doing Brownian motion in ${n}$-dimensions to doing one in a subspace (corresponding to the face that we hit) of dimension at most ${n-1}$. We now repeat this process: every time we hit a new constraint we restrict ourselves to only move in the face we hit. Repeating the above step, we should eventually reach a vertex of the polytope ${\mathcal{K}}$.

This is pretty much the actual algorithm which we call the Edge-Walk algorithm (except that to make it implementable we take tiny Gaussian random steps instead of doing Brownian motion; and this makes the analysis easier too). I will refer to the paper for the full details of the algorithm and its analysis, but let me just say what the punchline is: you are more likely to hit nearer constraints than farther ones.

Before moving on, note that the Edge-Walk algorithm can be defined for any polytope ${\mathcal{K}}$. This leads to the meta-question: Can we say anything interesting about the distribution on the vertices of a polytope induced by the walk? The analysis from our paper implicitly gives one such property which has to do with the distances of the constraints defining the vertex from the origin. Understanding this distribution better might be useful elsewhere.

Discussion This concludes the three post series. Perhaps, sometime in the future there will be other posts on other notable results in discrepancy theory (like the Beck-Fiala theorem). Keeping with the trend from the last two posts, let me end with another open question which strengthens Theorem 1 in a strong way:

Komlos Conjecture Given unit vectors ${v_1,\ldots,v_n \in \mathbb{R}^d}$, there exists ${\varepsilon \in \{1,-1\}^n}$ such that

$\displaystyle \left\|\varepsilon_1 v_1 + \varepsilon_2 v_2 + \cdots + \varepsilon_n v_n\right\|_\infty = O(1).$

Theorem 1 follows from the above if we take ${v_i}$‘s to be the normalized columns of the matrix ${A}$. The above conjecture also strengthens another beautiful conjecture due to Beck and Fiala; but that’s for another day. The most remarkable aspect of the above conjecture is that there is no dependence on the dimension. The best bound we know is due to Banaszczyk who showed a bound of ${O(\sqrt{\log d})}$ which we’ll also leave for another day.

In the last post we discussed some questions about discrepancy and the ‘Six Standard Deviations Suffice’ theorem stated below (without the ${6}$, which is not too important, but makes for a great title):

Theorem 1 For vectors ${a^1,\ldots,a^n \in \{1,-1\}^n}$, there exists ${\epsilon \in \{1,-1\}^n}$ such that for every ${j \in [n]}$, ${|\langle a^j,\epsilon\rangle| = O(\sqrt{n})}$.

In this post (second and the most technical of the three post series) we will see a proof of the above result. The theorem was proved by Spencer in 1985 using Beck’s partial coloring approach. Independently of Spencer, the result was proved by Gluskin in 1988 in a convex geometric context. Here we will review Gluskin’s proof which is quite beautiful.

Gluskin’s proof will also give us an excuse to look at some elegant (and simple to describe) results in convex geometry which may be of use elsewhere. Finally, the geometric view here will actually be useful in the next post when we discuss an algorithmic proof. Gluskin’s paper is truly remarkable and seems to reinvent several key ideas from scratch such as Sidak’s lemma, a version of Kanter’s lemma for Gaussians in convex geometry and even has the partial coloring approach implicit in it. I recommend taking a look at the paper even if it is a bit of a tough read. Much of the content in the post is based on discussions with Shachar Lovett and Oded Regev, but all mistakes are mine.

Gluskin’s proof follows the partial coloring approach with the crucial lemma proved using a volume argument. The partial coloring method was introduced by Beck in 1981 and all proofs of Theorem 1 and many other important discrepancy results in fact use this method. Here, instead of looking for a ${\epsilon \sim \{1,-1\}^n}$ solution as in the theorem, one looks for a ${\{1,0,-1\}^n}$ solution first. This is meaningless to begin with as we can just output the all zeros vector. The main idea is to instead look for a ${\{1,0,-1\}^n}$ solution which has ${\Omega(n)}$ support. We then recurse on the set of coordinates which are set to ${0}$. If everything goes according to plan, as we are geometrically decreasing the ambient dimension, we similarly get geometrically decreasing discrepancy bounds which we can tolerate. I won’t go into the details here, but let’s accept that it suffices to show the following:

Lemma 2 For vectors ${a^1,\ldots,a^n \in \{1,-1\}^n}$, there exists ${\epsilon \in \{1,0,-1\}^n}$ such that for every ${j \in [n]}$, ${\langle a^j,\epsilon\rangle = O(\sqrt{n})}$ and ${|Support(\epsilon)| = \Omega(n)}$.

To prove the above partial-coloring-lemma, let us first rephrase the problem in a geometric language. Let ${\mathcal{K} \subseteq R^n}$ be the symmetric convex set (symmetric meaning ${x \in \mathcal{K}}$ implies ${-x \in \mathcal{K}}$) defined as follows for ${\Delta = O(\sqrt{n})}$ to be chosen later:

$\displaystyle \mathcal{K} = \{x: |\langle a^j,x\rangle| \leq \Delta, \forall j \in [n]\}.$

We want to show that ${\mathcal{K}}$ contains a ${\{1,0,-1\}^n}$ lattice point of large support. We show this indirectly by proving that ${\mathcal{K}}$ instead contains a lot of points from ${\{1,0,-1\}^n}$. Gluskin does this by a clever volume argument: first show that the volume of ${\mathcal{K} \cap [-1,1]^n}$ is large and then apply Minkowski’s theorem to show that there are many lattice points. To lower bound the first volume, Gluskin actually works in the Gaussian space.

I don’t have a clear intuitive reason for why the Gaussian distribution is better than the Lebesgue measure in this context. But if one looks at the analysis, a clear advantage is that projections behave better (when considering volumes) in the Gaussian space. For example, if we take a set like ${S = \{x \in R^n\,:\,|x_1| \leq 1\}}$, then the Lebsgue volume of ${S}$ is infinite, but if we project along the first coordinate it becomes finite. In the Gaussian case, both volumes are the same.

We next go over all the main pieces in Gluskin’s proof.

Sidak’s Lemma  Suppose we have a standard Gaussian vector ${g \sim \mathcal{N}(0,1)^n}$. Then, for any unit vector ${v \in R^n}$, ${\langle v,g\rangle}$ has the standard normal distribution. Now, suppose we have several unit vectors ${v_1,\ldots,v_m \in R^n}$. Then, the random variables ${X_i = \langle v_i,g\rangle}$ are individually standard normals, but are correlated with one another. Sidak’s lemma (1967) says that no matter what the correlations of ${X_i}$‘s are, to bound the probability that none of the ${X_i}$‘s is too large, the “worst-behaviour” one could expect is for them to be independent. Concretely:

Lemma 3 (Sidak’s Lemma) Let ${v_1,\ldots,v_m \in R^n}$ and let ${g \sim \mathcal{N}(0,1)^n}$ be a standard Gaussian vector. Then, for all ${t_1,\ldots,t_m \in R_+}$,

$\displaystyle Pr\left[|\langle v_1,g\rangle| \leq t_1 \;\wedge\; |\langle v_2,g\rangle| \leq t_2\;\wedge\; \cdots \;\wedge\; |\langle v_m,g\rangle| \leq t_m\right] \geq$

$\displaystyle \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; Pr\left[|\langle v_1,g\rangle |\leq t_1\right] \cdot Pr\left[|\langle v_2,g\rangle|\leq t_2\right] \cdots Pr\left[|\langle v_m,g\rangle|\leq t_m\right].$

The proof of the lemma is actually not too hard and an excellent exposition can be found in this paper.

The lemma is actually a very special case of a longstanding open problem called the Correlation Conjecture. Let me digress a little bit to state this beautiful question. In the above setup, let slab ${C_i = \{x \in R^n: |\langle v_i,x\rangle| \leq t_i\}}$. Then, Sidak’s lemma says that for ${g \sim \mathcal{N}(0,1)^n}$,

$\displaystyle Pr[g \in C_1 \wedge C_2 \wedge \cdots \wedge C_m] \geq Pr[g \in C_1] \cdot Pr[g \in C_2] \cdots Pr[g \in C_m].$

The correlation conjecture asserts that this inequality is in fact true for all symmetric convex sets (in fact, we only need to look at ${m=2}$). Sidak’s lemma says the conjecture is true for slabs. It is also known to be true for ellipsoids. The statement for ellipsoids also has a discrepancy implication leading to a vector generalization of Spencer’s theorem (joint work with Shachar Lovett and Oded Regev). But that’s for another day.

Kanter’s Lemma The second inequality we need is a comparison inequality due to Kanter. The lemma essentially lets us lift certain relations between two distributions ${p,q}$ to their product distributions ${p^n, q^n}$ and I think should be useful in other contexts. For instance, I recently used it in this paper in a completely different context. To state the lemma, we need the notion of peakedness of distributions.

Let ${p,q}$ be two symmetric distributions on ${R^m}$ for some ${m}$. We say ${p}$ is less peaked than ${q}$ (written ${p \preceq q}$) if for all symmetric convex sets ${\mathcal{K}}$, ${p(\mathcal{K}) \leq q(\mathcal{K})}$. Intuitively, this means that ${p}$ is putting less of its mass near the origin than ${q}$ (hence the term less peaked). For example, ${\mathcal{N}(0,2) \preceq \mathcal{N}(0,1)}$.

Kanter’s lemma says that the peakedness relation tensorises provided we have unimodality. A univariate distribution is unimodal if the corresponding probability density function has a single maximum and no other local maxima. I won’t define what it means for a multivariate distribution to be unimodal here, but we only need the lemma for univariate distributions. See this survey for the formal definition.

Lemma 4 (Kanter’s lemma) Let ${p,q}$ be two symmetric distributions on ${R^n}$ such that ${p \preceq q}$ and let ${\mu}$ be a unimodal distribution on ${R^m}$. Then, the product distributions ${p \times \mu}$, ${q \times \mu}$ on ${R^{n \times m}}$, satisfy ${p \times \mu \preceq q \times \mu}$.

The proof of the lemma is not too hard, but is non-trivial in that it uses the Brunn-Minkowski’s inequality. Combining the above lemma with the not-too-hard fact that the standard Gaussian distribution is less peaked than the uniform distribution on ${[-1,1]}$, we get:

Corollary 5 Let ${\mu}$ be the uniform distribution on ${[-1,1]}$. Then, ${\mathcal{N}(0,1)^n \preceq \mu^n}$.

Minkowski’s Theorem The final piece we need is the classical Minkowski’s theorem form lattice geometry:

Theorem 6 (Minkowski’s Theorem) Let ${C \subseteq R^n}$ be a symmetric convex set of Lebesgue volume more than ${2^n \cdot \ell}$ for an integer ${\ell \geq 1}$. Then, ${C}$ contains at least ${\ell}$ points from the integer lattice ${\mathbb{Z}^n}$.

Putting Things Together  We will prove the partial coloring lemma Lemma 2. The proof will be a sequence of simple implications using the above lemmas. Recall the definition of ${\mathcal{K}}$:

$\displaystyle \mathcal{K} = \{x: |\langle a^j,x\rangle| \leq \Delta, \forall j \in [n]\}.$

Our goal (Lemma 2) is equivalent to showing that ${\mathcal{K}}$ contains a ${\{1,0,-1\}}$ point of large support.

Note that ${\mathcal{K}}$ is the intersection of ${n}$ slabs. Therefore, by Sidak’s lemma, for ${g \sim \mathcal{N}(0,1)^n}$,

$\displaystyle Pr[g \in \mathcal{K}] \geq \prod_{j=1}^n Pr[|\langle a^j,g\rangle| \leq \Delta] \geq \prod_{j=1}^n\left(1-2\exp(-\Delta^2/2n)\right),$

where the last inequality follows from the fact that ${\langle a^j,g\rangle}$ has the Gaussian distribution with standard deviation ${\sqrt{n}}$. Therefore, if we pick ${\Delta = O(\sqrt{n})}$, sufficiently big then

$\displaystyle Pr[g \in \mathcal{K}] \geq (3/4)^n.$

Now, let ${\mu}$ be the uniform distribution on ${[-1,1]}$. Then, by Corollary 5, and the definition of peakedness, ${\mu^n(\mathcal{K}) \geq (3/4)^n}$. Hence, the Lebesgue volume of ${\mathcal{K}' = \mathcal{K} \cap [-1,1]^n}$ is at least ${2^n (3/4)^n = (3/2)^n}$. Therefore, for sufficiently small ${\delta > 0}$, Lebesgue volume of ${(2-\delta)\mathcal{K}' \geq (2-\delta)^n \cdot (3/2)^n = 2^n \cdot 2^{\Omega(n)}}$. Thus, by applying Minkowski’s theorem to the symmetric convex set ${(2-\delta) \mathcal{K}'}$ we get that ${(2-\delta)\mathcal{K}'}$ has at least ${2^{\Omega(n)}}$ lattice points.

Now, note that the only lattice points in ${(2-\delta)\mathcal{K}'}$ are elements of ${\{1,0,-1\}^n}$ inside ${\mathcal{K}}$. Therefore ${\mathcal{K}}$ has at least ${2^{\Omega(n)}}$ points from ${\{1,0,-1\}^n}$. By a simple counting argument at least one of these lattice points, ${\epsilon}$, must ${\Omega(n)}$ non-zero coordinates – which is exactly what we need to prove Lemma 2!

Discussion  The above argument can actually be simplified by replacing the use of Kanter’s lemma with an appropriate version of Miknowski’s theorem for Gaussian volume as done here. But I like any excuse to discuss Kanter’s lemma.

More importantly, the proof seems to be more amenable to generalization. The core of the proof really is to use Sidak’s lemma to lower bound the Gaussian volume of the convex set ${\mathcal{K}}$. Whenever you have such a statement you should even get a corresponding discrepancy statement. In particular, the matrix discrepancy conjecture from last post, essentially reduces to the following probability question:

Question Is it true that for a universal constant ${C}$, for all symmetric matrices ${A_1,\ldots,A_n \in R^{n \times n}}$ with ${\|A_i\| \leq 1}$,

$\displaystyle Pr_{g \sim \mathcal{N}^n}\left[\|g_1 A_1 + g_2 A_2 + \cdots + g_n A_n \| \leq C \sqrt{n}\right] \geq (3/4)^n\,?$

Acknowledgments Thanks to Shachar Lovett, Oded Regev and Nikhil Srivastava for helpful suggestions, comments, and corrections during the preparation of this post.

This blog post is an introduction to the “Sum of Squares” (SOS) algorithm from my biased perspective. This post is rather long – I apologize. You might prefer to view/print it in pdf format. If you’d rather “see the movie”, I’ll be giving a TCS+ seminar about this topic on Wednesday, February 26th 1pm EST. If you prefer the live version, I’ll be giving a summer course in Stockholm from June 30 till July 4th. I’ve been told that there are worse places to be in than Stockholm in June, and there are definitely worse things to do than taking a course on analysis of Boolean functions from Ryan O’Donnell, which will also be part of the same summer school.

This post starts with an 1888 existential proof by Hilbert, given a constructive version by Motzkin in 1965. We will then go through proof complexity and semidefinite programming to describe the SOS algorithm and how it can be analyzed. Most of this is based on my recent paper with Kelner and Steuer and a (yet unpublished) followup work of ours, but we’ll also discuss notions that can be found in our previous work with Brandao, Harrow and Zhou. While our original motivation to study this algorithm came from the Unique Games Conjecture, our methods turn out to be useful to problems from other domains as well. In particular, we will see an application for the Sparse Coding problem (also known as dictionary learning) that arises in machine learning, computer vision and image processing, and computational neuroscience. In fact, we will close a full circle as we will see how polynomials related to Motzkin’s end up playing a role in our analysis of this algorithm.

I am still a novice myself in this area, and so this post is likely to contain inaccuracies, misattributions, and just plain misunderstandings. Comments are welcome! For deeper coverage of this topic, see Pablo Parrilo’s lecture notes, and Monique Laurent’s monograph. A great accessible introduction from a somewhat different perspective is given in Amir Ali Ahmadi’s blog posts (part I and part II). In addition to the optimization applications discussed here, Amir mentions that “in dynamics and control, SOS optimization has enabled a paradigm shift from classical linear control to … nonlinear controllers that are provably safer, more agile, and more robust” and that SOS has been used in “areas as diverse as quantum information theory, robotics, geometric theorem proving, formal verification, derivative pricing, stochastic optimization, and game theory”.

The FOCS 2014 call for papers is out. The PC chose to forgo hard page limits for submissions, trusting authors to apply their best judgment to make their submissions as easy to read and review as possible. Toward this end, here are some words of advice for potential authors. None of this is mandatory, but I hope you keep it in mind as you prepare your submission. I believe the effort to make your paper more readable will pay off not just in improving your FOCS submission, but also the future online, proceedings, and journal versions of your paper.

Remember the audience.

One of the challenging aspects of writing a FOCS submission (and in fact any scientific paper) is that it needs to address different types of readers simultaneously. A non-exhaustive list includes:

1)      The expert in your field that wants to verify all the details and understand how your approach differs from their 2007 paper.

2)      A non-expert reviewer that wants to understand what you did, why the question is motivated, and get some sense of the techniques you used.

3)      A PC member (such as  yours truly) that was not assigned to review the paper, and wants to get some approximation of the above by just skimming it for a few minutes.

A general rule of thumb for addressing those different readers is to make sure that the paper’s first few sections are accessible to a general theoretical CS audience, while later sections might contain the details that are mainly of interest to experts in the field. This brings us to our next point..

While there is no hard page limit, FOCS reviewers are not expected to read all submissions in full.  In practice, this means you should follow what I call “Impagliazzo’s Rule”: The first X pages of the paper should make the reader want to read the next X pages, for any value of X.

In particular, you should make sure that your results, your techniques, the motivation for your work, its context and novelty compared to prior works are clearly stated early in the paper. If your main theorem is hard to state succinctly, you can state an informal version, or an important special case, adding a forward reference to the place where it’s stated in full.

The above applies not just to results but also to the techniques as well. Don’t wait until the technical section to tell us about your novel ideas. Some of the best written papers follow the introduction with a section such as “Our techniques”, “Proof outline”, “Warmup” or “Toy problem” that illustrates the ideas behind the proofs in an informal and accessible way.

While modesty is a fine virtue, you don’t want to overdo it in a FOCS submission, and hide your contributions in some remote corners of the papers. Of course, you don’t want to go too far in the other direction, and so you should also

As scientists, we bend over backwards to show the potential flaws, caveats, and rooms for improvements in our work, and I expect nothing less from FOCS authors. It can be extremely frustrating for a reviewer to find out that the result is restricted in a significant way only when she reaches Section 4 of the paper. All restrictions, caveats, assumptions, and limitations should be described early in the paper. In fact, some caveats are so major that you shouldn’t wait to state them even until the introduction. For example, if you prove a lower bound that holds only for monotone circuits, then not only should this be clearly stated in the abstract, the word “monotone” should probably appear in the title. Generally speaking, if you’ve made a choice in modeling the problem that makes it easier, you should discuss this and explain what would have changed had you made a different choice. Similarly, any relations and overlap with prior works should be clearly described early in the paper.  If the result is a generalization of prior work, explain how they differ and what motivates the generalization. If it improves in some parameters but is worse in others, a discussion of the significance of these is in order. If there is a related work you are aware of, even if it’s not yet formally published,  or was done after your work, you should still cite it and explain the relation between the two works and the chronology.

Kill the suspense.

A scientific paper is not a novel and, ideally, readers should not be staying in suspense or be surprised negatively or positively. The FOCS PC is an incredibly talented group of people, but you should still write your paper in a “foolproof” way, trying to anticipate all questions and misunderstandings that a reader may have (especially one that needs to review 40 papers under time pressure).

For example, it can be extremely annoying for an author to get a review saying “the proof of the main theorem can be vastly simplified by using X” where X is the thing you tried first and doesn’t work. The way to avoid it is to add a section titled “First attempt” where you discuss X and explain why it fails. Similarly, if there is a paper that at first look seems related to your work, but turns out to be irrelevant, then you should still cite it and explain why it’s not actually related.

Another annoyance is when the reviews give the impression that the paper was rejected for being “too simple”. I and the rest of the FOCS PC believe that simplicity is a great virtue and never a cause for rejection. But you don’t want the reviewer to be surprised by the simplicity, discovering only late in the paper that the proof is a 3 line reduction to some prior work. If the proof is simple then be proud of this fact, and announce it right from the start. Similarly, if the proof of a particular lemma involves some routine applications of a standard method, you don’t need to remove it or move it to the appendix, but do add a sentence saying this at the proof’s start, so the less detail-oriented reviewers will know to skip ahead. This applies in the other case as well: if the proof involves a novel twist or a subtle point, you should add a sentence alerting the reader to look out for that point.

Summary

Writing a scientific paper is often a hard task, and I and the rest of the PC deeply appreciate your decision to send us your work and make the effort to communicate it as clearly as possible. We hope you find the above suggestions useful, and are looking forward to reading your submission.

In this series of three posts I want to discuss some recent and old advances in discrepancy theory and their applications to algorithms. Discrepancy minimization is quite a rich and beautiful area as evidenced in these two books. Here I will focus on a specific perspective — that of “Beating the Union Bound” — which while being limiting captures the main issues in the examples and applications we consider. The description below is particularly efficient in hiding the main combinatorial motivations for the questions in discrepancy theory, but hopefully we will makeup for it by other means.

As a preview of what’s to come, in the next post I will discuss Gluskin’s geometric approach to proving the Six Standard Deviations Suffice’ theorem (Theorem 1 below, but with a constant different from ${6}$) and in the third post we will look at some algorithmic applications and approaches. Much of the content in the posts is based on discussions with Shachar Lovett and Oded Regev, but all mistakes are mine.

We all know what the union bound is. But if you don’t, here’s how the well-known mathematical genius S. Holmes described it: “When you have eliminated the impossible, whatever remains, however improbable, must be the truth”. In non-sleuth terms this says that if you have arbitrary events ${E_1,\ldots,E_m}$ over some probability space, then

$\displaystyle {Pr[E_1 \cup E_2 \cup \cdots \cup E_m] \leq Pr[E_1] + \cdots + Pr[E_m]}$.

In particular, if the latter sum is less than ${1}$, then there exists a sample point where none of the events ${E_1,\ldots,E_m}$ occur.

By now, we have seen several strangely simple and powerful applications of this simple precept – e.g., existence of good Ramsey graphs (Erdős), existence of good error correcting codes (Shannon), existence of good metric embeddings (Johnson-Lindenstrauss). And the union bound and further variants of it are one of the main techniques we have for showing existential results.

However, the union bound is indeed quite naive in many important contexts and is not always enough to get what we want (as nothing seems to be these days). One particularly striking example of this is the Lovász local lemma (LLL). But let us not digress along this (beautiful) path. Here we will discuss how “beating” the union bound can lead to some important results in the context of discrepancy theory.

Discrepancy and Beating the Union Bound: Six Suffice A basic probabilistic inequality which we use day-in and day-out is the Chernoff bound. The special case of interest to us is the following: for any vector ${a = (a_1,\ldots,a_n) \in \{1,-1\}^n}$ and ${t > 0}$,

$\displaystyle Pr_{\epsilon \sim \{1,-1\}^n}\left[\left|\sum_i \epsilon_i a_i\right| > t \sqrt{n}\right] \leq 2 \exp(-t^2/2). \ \ \ \ \ (1)$

In words, the probability that the sum ${\sum_i \epsilon_i a_i}$ falls ${t}$ standard deviations away is at most ${2\exp(-t^2/2)}$. Combining the above with a simple union bound we get the following: For vectors ${a^1,\ldots,a^n \in \{1,-1\}^n}$,

$\displaystyle Pr_{\epsilon \sim \{1,-1\}^n}\left[\exists j \in [n],\; \left|\sum_{i=1}^n \epsilon_i a^j_i\right| > t\sqrt{n}\right] \leq 2 n \exp(-t^2/2). \ \ \ \ \ (2)$

In particular, if we choose ${t = O(\sqrt{\log n})}$, then the right hand side above is less than ${1}$ and we get that there exists ${\epsilon \in \{1,-1\}^n}$ such that for every ${j \in [n]}$, ${|\langle a^j, \epsilon\rangle| \leq O(\sqrt{n \log n})}$. Can we do better? It is important to note that the above argument is actually tight for uniformly random signs and we want to do better by a careful choice of signs. This is exactly what Spencer showed in his seminal “Six Standard Deviatons Suffice” result (1985) (Spencer’s proof as well as all other proofs easily generalize to the case when there are more vectors than the degrees of freedom and we only require each vector to have bounded entries.):

Theorem 1 For all ${n \geq 1}$, vectors ${a^1,\ldots,a^n \in \{1,-1\}^n}$, there exists ${\epsilon \in \{1,-1\}^n}$ such that for every ${j \in [n]}$, ${|\langle a^j,\epsilon \rangle| \leq 6\sqrt{n}}$.

We will see a proof of the theorem in the next post.

Discrepancy and Beating the Union Bound (and some): Paving Conjecture As discussed in this post a few months back, the stunning result (adjective is mine) of Adam Marcus, Daniel Spielman and Nikhil Srivastava proving the paving conjecture (and hence resolving the Kadison-Singer problem) can also be cast in a discrepancy language. Let me repeat this from Nikhil’s post for completeness. Let us look at a specific variant of the ‘Matrix Chernoff Bound’ (see Nikhil’s post for references; here ${\|\;\|}$ denotes the spectral norm of a matrix):

Theorem 2 Given symmetric matrices ${A_1,\ldots,A_m \in \mathbb{R}^{n \times n}}$,

$\displaystyle Pr_{\epsilon \sim \{1,-1\}^m}\left[\left\|\sum_{i=1}^m \epsilon_i A_i\right\| \geq t \left\|\sum_{i=1}^m A_i^2\right\|^{1/2}\right] \leq 2 n \exp(-t^2/2).$

Note that the above is a substantial generalization of Equation 2 which corresponds to the special case when the matrices ${A_j}$‘s are diagonal matrices with entries given by the vectors ${a_j}$. In a very naive but still meaningful way, one can view the factor ${n}$ on the right hand side as again appearing partially because of a union bound.

An implication of the above theorem is that for symmetric matrices ${A_1,\ldots,A_m \in \mathbb{R}^{n \times n}}$, for uniformly random signs ${\epsilon \sim \{1,-1\}^m}$, with high probability

$\displaystyle \left\|\sum_{i=1}^m \epsilon_i A_i\right\| = O(\sqrt{\log n})\left\|\sum_{i=1}^m A_i^2\right\|^{1/2}. \ \ \ \ \ (3)$

Just as for the scalar case, the ${\sqrt{\log n}}$ factor is necessary in general for uniformly random signs. Can we do better? There are two facets to this question both of which seem to be quite important and basic (and perhaps correspondingly hard):

• Question 1: Can the ${\sqrt{\log n}}$ factor be improved by picking the signs carefully instead of uniformly at random?
• Question 2: Can the ${\sqrt{\log n}}$ factor be improved for uniformly random signs under some nice conditions on the matrices ${A_i}$‘s?

Here we will focus on the first question, but let me say a couple of words about the second one. In all known examples showing tightness of the matrix Chernoff bound, the matrices ${A_1,\ldots,A_m}$ seem to have a lot of commutative structure (e.g., diagonal matrices) and intuitively non-commutativity should help us in improving the bound. Perhaps there is a quantitative way of capturing non-commutativity to do better (see the discussion here for one such attempt).

Regarding the first question, Spencer’s theorem gives a positive answer in the special case when ${A_i}$‘s are diagonal matrices with ${\{1,-1\}}$ entries (or more generally, bounded entries). The breakthrough result of Marcus, Spielman and Srivastava amounts to giving an exact characterization of when we can do better if the matrices ${A_i}$‘s are rank one positive semi-definite matrices.

Discrepancy and Beating the Union: Matrix Spencer Theorem? The above discussions prompt the following question:

Conjecture For any symmetric matrices ${A_1,\ldots,A_n \in {\mathbb R}^{n \times n}}$ with ${\|A_i\| \leq 1}$, there exist signs ${\epsilon_1,\ldots,\epsilon_n}$ such that ${\|\sum_i \epsilon_i A_i\| = O(\sqrt{n})}$.

The above conjecture is a strong generalization of Spencer’s theorem which corresponds to the matrices ${A_i}$‘s being diagonal. The earliest reference I am aware of for it is this paper. I can’t think of any concrete applications of the conjecture, but I quite like it because of the simplicity of the statement. Admittedly, I also have a personal bias: my first foray into discrepancy theory (with Shachar Lovett and Oded Regev) was to study this question.

One can view the conjecture as giving a partial answer to Question 1. In the case when ${\|A_i\| \leq 1}$, ${\|\sum_i A_i^2\| \leq \sum_i \|A_i^2\| \leq n}$, so that the right hand side of Equation 3 is ${O(\sqrt{n \log n})}$ . Thus, the above conjecture beats the matrix Chernoff bound by getting rid of the ${\sqrt{\log n}}$ term, but instead introduces a term depending on the “sum-of-norms” of the ${A_i^2}$‘s as opposed to having a dependence on the “norm-of-the-sum” of ${A_i^2}$‘s.

In the next post, I will discuss Gluskin’s proof of Theorem 1, which for now (to me) seems to be the most promising approach for proving the conjecture.

Acknowledgments

Thanks to Shachar Lovett, Oded Regev and Nikhil Srivastava for helpful suggestions, comments, and corrections during the preparation of this post.

Today, we have a guest post from Frank McSherry talking about a clever approach to using Differential Privacy for handling pesky dependencies that get in the way of proving measure concentration results.

———————

In this post I’ll explain a cute use of differential privacy as a tool in probabilistic analysis. This is a great example of differential privacy being useful for something other than privacy itself, although there are other good examples too. The main problem we’ll be looking at is the analysis of mostly independent random variables, through the lens of a clustering problem I worked on many years ago.

In the problem of clustering data drawn from a Gaussian mixture, you assume that you are provided access to a large volume of data each of which is a sample from one of a few high-dimensional Gaussian distributions. Each of the Gaussian distributions are determined by a mean vector and a co-variance matrix, and each Gaussian has a mixing weight describing their relative fraction of the overall population of samples. There are a few goals you might have from such a collection of data, but the one we are going to look at is the task of clustering the data into parts corresponding to the Gaussian distributions from which they were drawn. Under what conditions on the Gaussians is such a thing possible? [please note: this work is a bit old, and does not reflect state of the art results in the area; rather, we are using it to highlight the super-cool use of differential privacy].

The main problem is that while each coordinate of a Gaussian distribution is concentrated, there are some large number ${d}$ of them, and the proximity of a sample to some mean vector ${\mu}$ is not particularly great. You end up with bounds that look a bit like

$\displaystyle Pr [ | x - \mu |_2^2 \gg d \cdot \sigma^2 ] \le \delta \; ,$

where ${x}$ is the sample, ${\mu}$ is the mean vector, ${d}$ the ambient dimensionality, and ${\sigma^2}$ the norm of the covariance matrix. The probability ${\delta}$ gets determined by thinking really hard and its specific value won’t be particularly important for us here.

Dimitris Achlioptas and I had an algorithm for doing this based on spectral projections: we would find the optimal low-rank subspace determined by the singular value decomposition (the rank taken to be ${k}$, the number of Gaussians in the mixture) and argue that under some separation conditions involving the means and co-variance matrices, the space spanned by these projections was basically the same as the space spanned by the mean vectors of the Gaussian distributions. This is great because when you project a Gaussian sample, you are projecting its mean plus some noise. As the true mean lies in the target space, it stays where it is. When you project Gaussian noise onto a fixed subspace, it stays Gaussian, but with far fewer dimensions. The particular form of these results looks something like this, with a projection matrix ${P}$ applied to the sample ${x}$ before subtracting from ${\mu}$.

$\displaystyle Pr [ |P x - \mu |_2^2 \gg k \cdot \tau^2 ] \le \delta \; ,$

where $\tau$ is close to $\sigma$. This means that while ${x}$ stays centered on ${\mu}$, the contribution of the noise more-or-less vanishes. At least, the ${d}$ is reduced to ${k}$ and that can be quite a lot. Hooray!

The problem is that this “more-or-less vanishes” thing is really only true when the target space and the random noise are independent. However, since the optimal low-rank subspace was determined from the data, it isn’t independent of any of the noise we are projecting. It’s slightly dependent on the noise, and in the wrong way (it attempts to accomodate the noise, which isn’t what we want if we want to project *out* the noise). In particular, you don’t immediately get access to the sorts of bounds above.

You could do things the way Dimitris and I did, which was a fairly complicated mess of cross-training (randomly partition the data and use each half to determine the subspace for the other), and you end up with a paper that spends most of its time determining algorithms and notation to enforce the independence (the cross-training needs to be done recursively, but we can’t afford to cut the samples in half at each level, blah blah, brain explodes). You can read all about it here. We’re going to do things a bit simpler now.

Enter differential privacy. Recall, for a moment, the informal statement of differential privacy: a randomized computation has differential privacy if the probability of any output occurrence is not substantially adjusted when a single input element is added or removed. What a nice privacy definition!

Now, let’s think of it slightly differently, in terms of dependence and independence. If ${P}$ is the result of a differentially private computation on a dataset ${X = \{ x_0, x_1, \ldots \}}$, then ${P}$ is not substantially differently distributed than the same computation run on ${X \setminus \{ x_i \}}$. If this second distribution enjoys some nice property with high probability, for example due to its independence from ${x_i}$, then it remains very likely that ${P}$ has the property as well. The probability that the property no longer holds can only increase by a factor of ${\exp(\epsilon)}$ when we add ${x_i}$ back in to the input.

For example, let’s consider the probability of the property: “the squared length of the projection of ${x_i-\mu}$ onto the optimal subspace is much larger than ${k \cdot \tau^2}$”. When the input data are ${X \setminus \{ x_i \}}$, resulting in a projection-valued random variable we’ll name ${P_i}$, this probability is small because ${P_i}$ is independent of ${x_i}$.

$\displaystyle Pr [ | P_i x_i - \mu |_2^2 \gg k \cdot \tau^2 ] < \delta \; .$

When the input data are ${X}$, resulting in a projection-valued random variable ${P}$, this probability is not easily bounded by independence, but can be bounded by differential privacy: if the computation producing ${P}$ is ${\epsilon}$-differentially private, then the probability can increase by at most ${\exp(\epsilon)}$:

$\displaystyle Pr [ | P x_i - \mu |_2^2 \gg k \cdot \tau^2 ] < \exp(\epsilon) \times \delta \; .$

We can even live with fairly beefy values of ${\epsilon}$ and still get a result here. Let’s take ${\epsilon = 1}$ for concreteness.

Now let’s discuss differentially private optimal subspace computation. One standard way to compute optimal low dimensional subspaces is by taking the covariance matrix of the data and computing its top singular vectors, using the singular value decomposition. One standard way to release the covariance matrix while preserving differential privacy is to add Laplace noise proportional to the largest magnitude permitted of any of the data points, to each of the entries of the covariance matrix. Since our Gaussian data are nicely concentrated, they aren’t likely to be terribly enormous, and a data-independent upper bound will work great here.

What we get is a noisy covariance matrix, which we then subject to the singular value decomposition. There are some nice theorems about how the SVD is robust in the presence of noise, which is actually the same reason it is used as a tool to filter out all of that noise that the Gaussian distributions added in in the first place. So, even though we added a bunch of noise to the covariance matrix, we still get a fairly decent approximation to the space spanned by the means of the Gaussian distributions (as long as the number of samples is larger than the dimensionality of the samples). At the same time, because the resulting subspace is differentially private with respect to the samples, we can still use the concentration bounds typically reserved for the projection of random noise on independent subspaces, as long as we can absorb a factor of ${\exp(\epsilon)}$.

At its heart, this problem was about recovering a tenuous independence which was very important for simplicity (and arguably tractability) of analysis. It shows up in lots of learning problems, especially in validating models: we would typically split data into test and training, to permit an evaluation of learned results without the issue of overfitting. Here differential privacy makes things simpler: if your learning process is differentially private, you did not overfit your data (much).

This week’s post touches on subjects spanning almost 2000 years — we start with a cryptographic problem and go back in time to discover a theorem that could be known to the Greeks. Its content is based on a paper co-authored with Anton Mityagin and Kobbi Nissim that appeared in ANTS VII in 2006. The paper was motivated by a cryptographic question, previously introduced by Claus-Peter Schnorr, but the machinery that we ended up using had more to do with extremal graph theory, projective geometry, and combinatorial number theory. It fits in nicely with the overarching theme of this blog, which is interconnectedness of mathematics and CS theory, and leaves open several intriguing questions at the intersection of these areas.

I. Motivation

Consider the problem of computing the discrete logarithm in a generic group of a known prime order ${p}$: given two random elements ${g}$ and ${h}$, find ${x}$ so that ${h=g^x}$. Instead of having access to the group itself, we may only manipulate encodings of its elements (basically, a random mapping of the group ${{\mathbb Z}/p{\mathbb Z}}$ to a sufficiently large alphabet) via a group oracle. The group oracle accepts encodings of two elements and returns the encoding of their product. Think of it as a model of an abstract group, where the result of multiplying two group elements is treated as a new formal variable.

Let us try solving the discrete logarithm problem in this model. Given the encodings of two elements ${g}$ and ${h}$, one can multiply them, obtaining the encoding of ${gh}$, square the result, etc. In general, it is possible to compute (encodings of) elements of the form ${g^{a_i} h^{b_i}}$, where ${(a_i,b_i)}$ are pairs of integers modulo ${p}$ (all arithmetic not involving ${g}$ or ${h}$ is going to be modulo ${p}$ from now on). Of course, there can be multiple ways of arriving at the same element. For instance, ${g^2h^2 = (gh)^2}$ (as the group is of the prime order, it is necessarily Abelian). Unless we do it on purpose, all elements that we obtain from the group oracle are going to be distinct with an overwhelming probability over ${x=\log_g h}$ (assume that the group order ${p}$ is large, say, at least ${2^{80}}$). Indeed, if ${g^{a_i} h^{b_i} = g^{a_j} h^{b_j}}$, then ${a_i+b_ix=a_j+b_jx},$ which happens for ${(a_i,b_i)\neq(a_j,b_j)}$ with probability at most ${1/p}$. On the other hand, if we do get a non-trivial relationship, we can recover ${x}$ right away.

In other words, the group oracle keeps outputting some random encodings that tell us nothing useful about the elements ${g}$ and ${h}$ (we could sample encodings from the same distribution ourselves, without access to the oracle), until it returns an element that we did see before, which immediately gives away the answer to the discrete logarithm problem.

If ${x}$ is chosen uniformly at random from ${{\mathbb Z}/p{\mathbb Z}}$, the success probability of any algorithm in the generic group model making no more than ${n}$ group operations is bounded by ${{n\choose 2}/p = O(n^2/p)}$: each pair of elements output by the group oracle collides with probability at most ${1/p}$, there are at most ${{n\choose 2}}$ such pairs, union bound, check and mate. A formal version of this handwavy argument is due to Victor Shoup, which gives a tight (up to a constant) bound on the success probability of any algorithm for solving the discrete logarithm problem in the generic group model.

A simple algorithm matches this bound. Let ${n = 3m}$. Compute ${g, g^2, g^3,\dots,g^m}$ (by repeat multiplications by ${g}$), ${g^{2m},g^{3m},\dots,g^{m^2}}$ (by repeat multiplications by ${g^m}$), and using the elements already available, compute ${hg,hg^2,hg^3,\dots,hg^m}$. If ${x\in [-m,m^2-m]}$, there’s going to be a collision between ${g^{im}}$ and ${ hg^j=g^{x+j}}$ for some ${i}$ and ${j \leq m}$. This algorithm is known as the baby-step giant-step method — we are making “baby ”steps when we are multiplying ${h}$ by powers of ${g}$, and “giants” steps, when we are computing powers of ${g^m}$. If ${m = \lceil \sqrt{p}\rceil}$, the discrete logarithm problem is solved with probability 1.

The above argument suggests that in order to solve the discrete logarithm problem in the generic group model one would want to maximize the probability of observing a collision. Collisions have simple geometric interpretation: each time the algorithm computes ${g^{a_i}h^{b_i}}$, it draws a line ${a_i+b_ix}$ in the ${({{\mathbb Z}/p{\mathbb Z}})^2}$ space. An element ${z}$ is “covered” if two lines intersect above this element: ${a_i+b_iz = a_j+b_jz}$. The adversary is trying to cover as many elements as possible with the fewest number of lines.

As we just have seen, the number of group operations required to solve the discrete logarithm problem in the generic group when ${g}$ and ${h}$ are chosen uniformly at random is ${\Theta(\sqrt{p})}$. The question becomes much more interesting if we constrain the joint distribution of ${g}$ and ${h}$.

What is the complexity of the discrete logarithm problem measured as the number of group operations, if ${h = g^x}$, where ${x}$ is sampled uniformly from ${S \subset {{\mathbb Z}/p{\mathbb Z}}}$?

It turns out that this question has been answered for some simple sets ${S}$, but it is wide open in general.

II. Geometric Formulation

We re-formulate the problem using the language of finite field geometry.

Given a subset ${S}$ of ${{{\mathbb Z}/p{\mathbb Z}}}$, define its DL-complexity, denoted as ${C(S)}$, as the minimal number of lines in ${({{\mathbb Z}/p{\mathbb Z}})^2}$ whose intersection points projected to the ${x}$-axis cover ${S}$.

In the notation of the previous section, the adversary is drawing lines ${y=a_i+b_ix}$. It scores a hit when two lines intersect above point ${z\in S}$, i.e., ${a_i+b_iz=a_j+b_jz}$. The adversary’s goal is to cover the entire set ${S}$ with the smallest number of lines, which would correspond to solving the discrete logarithm problem for the case when ${h=g^x}$ and ${x\in S}$.

What are the most basic facts about ${C(S)}$?

1. ${C(S) = O(\sqrt{p})}$. Indeed, we know that the (generic) baby-step giant-step algorithm covers the entire ${{{\mathbb Z}/p{\mathbb Z}}}$ with ${\Theta(\sqrt{p})}$ lines.
2. ${C(S) \leq |S| + 1}$ — duh! It suffices to draw a single line ${y = 0}$ and one line for each element of ${z\in S\colon y=x - z}$.
3. ${C(S) > \sqrt{|S|}}$: if ${n = C(S)}$ lines can cover the entire ${S}$, then the number of intersection points, which is less than ${n^2}$, is at least ${|S|}$.

Putting these bounds together on this schematic picture drawn in the log-log scale, we can see that ${C(S)}$ lives inside the shaded triangle.
The most intriguing part of the triangle is the upper-left corner, marked with the target sign, that corresponds to sets that are as small as ${\sqrt{p}}$ but have the property that solving the discrete logarithm problem in these subsets is as hard as in the entire ${{{\mathbb Z}/p{\mathbb Z}}}$. How can we get there, or just get closer? But first, why do we care at all?

One, rather vague motivation is that we are interested in characterizing these subsets because they capture the complexity of the discrete logarithm problem. Another, due to Claus-Peter Schnorr, who defined the problem in 2000, is that the amount of entropy needed to sample an element of that set is half of ${\log_2 p}$. The observation that got us going back in 2005 was that modular exponentiation takes amount of time that depends on the exponent. Wouldn’t it be nice if we could choose exponents that allowed for faster exponentiation algorithms? These exponents could cover only a fraction of the entire space, which naturally led us to the question of understanding the discrete logarithm problem restricted to a subset, which turned out to be very interesting in its own right.

The first result, going back to Schnorr, is very encouraging:

For a random ${S \subset {{\mathbb Z}/p{\mathbb Z}}}$ of size ${O(\sqrt{p})}$, ${C(S) > |S|/\log p}$ with probability at least ${1-1/p}$.

It means a random subset has essentially maximal possible DL-complexity (up to a ${\log p}$ factor) with very high probability. Unfortunately, using (truly) random subsets forecloses the possibility of extracting any gains in exponentiation relative to the average case. Second, it really does not quite answer the question of whether any specific sets are particularly hard for the discrete logarithm problem.

In the rest of this post we explore several approaches towards constructing explicit sets and sets with succinct representation for which we can prove a lower bound on their DL-complexity stronger than ${\sqrt{|S|}}$.

III. A first attempt

Rather than trying to solve the problem in full generality, let’s constrain the definition of ${C(S)}$ to capture only generalizations of the baby-step giant-step method. Let us call this restriction ${C_\mathrm{bsgs1}}$, defined as follows:

Given a subset ${S}$ of ${{{\mathbb Z}/p{\mathbb Z}}}$, let ${C_\mathrm{bsgs1}(S)}$ be the minimal number ${n}$ so that ${S}$ is covered by intersection of two sets of lines ${\{y=a_i\}_{i=1}^n}$ and ${\{y=x+b_j\}_{j=1}^n}$, where ${a_i,b_j\in {{\mathbb Z}/p{\mathbb Z}}}$.

Recall that the intersection of two lines covers an element of ${S}$ if these lines intersect at a point whose first coordinate is in ${S}$.

The definition of ${C_\mathrm{bsgs1}}$ complexity considers only horizontal lines (analogous to the giant steps of the algorithm, ${g^{a_i}}$) and parallel slanted lines (corresponding to the elements ${hg^{b_j}}$). The 1 in BSGS1 refers to the fact that all slanted lines have slope of exactly 1 (for now — this condition will be relaxed later).

Can we come up with a constraint on ${S}$ that would guarantee that ${C_\mathrm{bsgs1}(S) \gg \sqrt{|S|}}$? It turns out that we can.

Assume for a moment that all pairwise sums of elements in ${S}$ are distinct, i.e., no four elements satisfy the following equation: ${a+b=c+d}$, where ${a,b,c,d\in S}$, unless ${\{a,b\}=\{c,d\}}$. If this is the case, at least one of the intersection points of the lines in the following configuration will miss an element of ${S}$:

To see why it is so, observe that ${x_1+y_2=x_2+y_1}$ — a contradiction with ${S}$‘s not having solutions to this equation.

We now introduce one more way of thinking about these lines in ${({{\mathbb Z}/p{\mathbb Z}})^2}$ that are trying to hit elements of ${S}$ (we promise it is going to be the last!). Associate lines with the vertices of a graph and draw an edge between two vertices if the intersection point of the corresponding vertices projects to ${S}$ (“kills an element of ${S}$”).

If all pairwise sums of ${S}$ are distinct, then the graph whose nodes are the horizontal and slanted lines does not have a 4-cycle. This property alone is sufficient to bound the total number of edges in the graph (and thus the number of elements of ${S}$ hit by these lines) to be less than ${O(n^{3/2})}$. If the graph is bipartite, which is our case, this bound is known as the Zarankiewicz problem, which can be established via a simple counting argument.

If ${2n}$ lines cannot cover more than ${O(n^{3/2})}$ elements of ${S}$, it means that ${C_\mathrm{bsgs1}(S) = \Omega(|S|^{2/3})}$.

What’s left to do is to construct sets whose pairwise sums never repeat. They are known as modular Sidon sets, with several beautiful constructions resulting in sets of astonishingly high density. Ponder it for a moment: we want a subset of ${{{\mathbb Z}/p{\mathbb Z}}}$ such that no two pairs of its elements sum to the same thing. Obviously, by the pigeonhole principle, the size of such as set is ${O(\sqrt{p})}$. This bound is tight, as there exist — explicit, and efficiently enumerable — sets of size ${\Theta(\sqrt{p})}$!

Notice that when two lines cover an element of ${S}$, their coefficients satisfy an especially simple condition: if ${a_i = x + b_j}$, where ${x \in S}$, then ${a_i - b_j\in S}$. Let ${A = \{a_1,\dots,a_n\}}$ and ${B=\{b_1,\dots,b_n\}}$. If all of ${S}$ is covered by intersections between lines ${\{y=a_i\}}$ and ${\{y=x + b_j\}}$, then ${S\subset A-B}$, where ${A-B}$ is the sumset of ${A}$ and ${(-B)}$. Using the language of additive combinatorics, Erdős and D. Newman posed in 1977 the problem of constructing subsets of $\{1,\dots,p\}$ that cannot be covered by sumsets of small sets. They proved that the set of “small squares” ${SQ = \{x^2 | x < \sqrt{p}\}}$ has this property, or in our terminology, ${C_\mathrm{bsgs1}(SQ) = \Omega(|SQ|^{2/3-\epsilon})}$ for any ${\epsilon > 0}$.

IV. Moving upwards

Let’s relax the constraint of the previous definition by allowing two classes of lines — horizontal and arbitrarily slanted, but the only hits that count are due to intersections between lines of different classes. Call the resulting measure of complexity ${C_\mathrm{bsgs}(S)}$:

Given a subset ${S}$ of ${{{\mathbb Z}/p{\mathbb Z}}}$, let ${C_\mathrm{bsgs}(S)}$ be the minimal number ${n}$ so that ${S}$ is covered by intersection of two classes of lines ${\{y=a_i\}_{i=1}^n}$ and ${\{y=c_jx+b_j\}_{j=1}^n}$, for ${a_i,b_j,c_j\in {{\mathbb Z}/p{\mathbb Z}}}$, where only intersections between lines of different classes count towards covering ${S}$.

By analogy with the previous argument, we’d like to identify a local property on ${S}$ that will result in a non-trivial bound on ${C_\mathrm{bsgs}(S)}$. More concretely, we should be looking for some condition on a small number of elements of ${S}$ that make them difficult to cover by few lines of two different classes.

Fortunately, one such property is not that difficult to find. Consider the following drawing:

The intercept theorem (known also as Thales’ theorem) implies that ${|A_1B_1|:|B_1C_1|=|A_2B_2|:|B_2C_2|}$, and consequently (applying it a second time),

$\displaystyle {(x_1-y_1)/(y_1-z_1)= (x_2-y_2)/(y_2-z_2).\qquad(*)}$

Conversely, if the 6-tuple ${(x_1,y_1,z_1,x_2,y_2,z_2)}$ is such that ${(x_1-y_1) (y_2-z_2) \neq (x_2-y_2) (y_1-z_1) }$, these points cannot be covered all at once by three horizontal and two slanted lines.

Consider again the bipartite graph drawn on the sets of ${n}$ horizontal and ${n}$ slanted lines, where two lines are adjacent in the graph if their intersection point covers an element of ${S}$. What is the maximal density of this graph if it is prohibited from containing the ${K_{3,2}}$ subgraph? Somewhat surprisingly, the answer is asymptotically the same as before, namely, the number of edges in the graph is ${O(n^{3/2})}$. Therefore, if the set ${S}$ avoids 6-tuples satisfying (*), then ${C_\mathrm{bsgs}(S)=\Omega(|S|^{2/3})}$.

What about constructing sets that have that property? A short answer is that we don’t know how to do so explicitly, but at least there exist sets satisfying this property with succinct representation.

V. Going all the way

Having flexed our muscles with the watered-down notions of sets’ DL-complexity, let us try to extend our technique to handle the most general case of unrestricted lines, where everything goes and all intersections count towards the attacker’s goal of covering the entire set ${S}$.

Once again, we’d like to find a local property with global repercussions. Concretely, we should be looking for a configuration of lines whose intersection points satisfy some avoidable condition, similar to ${x_1+y_2=x_2+y_1}$ or the quadratic polynomial of the previous section. It may seem that we should look no further than Menelaus’ theorem, which gives us just that. If your projective geometry is a bit rusty, Menelaus’ theorem applies to the six intersection points of four lines in the plane:
It states, in the form most relevant to us that

$\displaystyle{(x_A-x_E)(x_B-x_F)(x_C-x_D)=- (x_E-x_B)(x_F-x_C)(x_D-x_A).\qquad(**)}$

It seems like a nice local property but what about its global ramifications? Namely, if we manage to construct a set such that no 6-tuple satisfies the cubic polynomial (**), what can we say about the number of lines required to cover that set? Well, our luck runs out here. Recall that we used the local property to guarantee that the graph, whose nodes corresponded to lines and edges corresponded to elements of ${S}$ covered by intersection points, excluded a certain subgraph. First, it was a 4-cycle, then ${K_{3,2}}$. Unfortunately, if the graph excludes a complete graph on four vertices, which Menelaus’ theorem guarantees for sets avoiding (**), the number of edges in that graph can be as large as ${\Theta(n^2)}$. This is the consequence of Turán’s theorem (or Erdős–Stone) that yields no bound better than that unless the excluded subgraph is bipartite.

The only path forward is to find a Menelaus-like theorem that allows us to exclude a bipartite graph. It turns that the minimal such configuration involves seven lines and 12 intersection points:
Most compactly, the theorem states that the following determinant evaluates to 0:

$\displaystyle \det\left(\begin{matrix} x_1 - y_1 & x_1 - z_1 & z_1(x_1 - y_1) & y_1(x_1 - z_1) \\ x_2 - y_2 & x_2 - z_2 & z_2(x_2 - y_2) & y_2(x_2 - z_2) \\ x_3 - y_3 & x_3 - z_3 & z_3(x_3 - y_3) & y_3(x_3 - z_3) \\ x_4 - y_4 & x_4 - z_4 & z_4(x_4 - y_4) & y_4(x_4 - z_4) \end{matrix}\right)=0.$

Using the same argument as before, if ${S}$ avoids solutions to the above equation on 12 variables and total degree 6, the “hit” graph defined over the ${n}$ lines avoids the ${K_{3,4}}$ graph. A variant of the Zarankiewicz bound guarantees that such graph has ${O(n^{2-1/3})=O(n^{5/3})}$ edges (the exponent in the Zarankiewicz bound depends only on size of the smaller part of the excluded bipartite graph). Since each element of the set ${S}$ corresponds to at least one edge of the “hit” graph, ${|S|=O( n^{5/3})}$ and consequently ${C(S)=\Omega(|S|^{3/5})}$, which is better than the trivial bound ${|S|^{1/2}}$. Finding explicit constructions remains a challenge, although it is easy to demonstrate existence of such sets with succinct representation by probabilistic method.

VI. Bipartite Menelaus’ Theorem and Open Problems

Even though our original motivation was rooted in cryptography, we ended up proving a fact of projective geometry. In an equivalent form, which is most similar to the standard formulation of Menelaus’ theorem, it asserts that
where the line segments are signed: positive if they point in the same direction as the line they are part of (for some arbitrary but fixed orientation), and negative otherwise.

The classic (and classical — after all, Menelaus of Alexandria lived in the first century AD) theorem is implied by ours. Indeed, in the degenerate case when ${A_1=C_1}$, ${B_2=C_2}$, and ${A_3=B_3}$, following a furious round of cancellations, we end up with Menelaus’. This explains why we refer to our “12-point’’ theorem as bipartite Menelaus’: it is the minimal Menelaus-like theorem that involves lines separated into two classes.

We did search far and wide for evidence that this theorem had been known before, and came up empty. In retrospect, such a theorem is inevitable — the number of intersections (i.e., equations) grows quadratically in the number of lines, each of which only requires two free variables to describe. This is a counting argument that really gives no insight into why bipartite Menelaus’ theorem is what it is. Is there a purely geometric proof? Is it a consequence of a simpler/deeper fact about projective geometries over finite fields? We’d love to know.

Let’s measure our progress against the initial goal of finding explicit sets that are as hard as the entire group against the discrete-logarithm-finding adversary. We are not there yet — although we did develop some machinery for arguing that some sets are more resistant than the most pessimistic square-root bound implies, but these sets are hard to construct and too small to be useful. What about proving that some natural sets, such as the sets of squares, as in Erdős-Newman, or cubes, have high DL-complexity? It is conceivable that the combinatorial approach based on excluded subgraphs is not sufficient to get us to the sweet spot of sets of size and DL-complexity ${\Theta(p^{1/2-o(1)})}$. What can?

A necessary disclaimer: the generic group model is just that — a model. Any instantiation of the abstract group allows direct observation of the group elements, and may enable attacks not captured by the model. For instance, representation of the group elements as integers modulo ${q}$ has enough structural properties that index calculus is exponentially more effective in $({{\mathbb Z}/q{\mathbb Z}})^*$ than any generic algorithm. On the positive side, for many groups, such as some elliptic curves or prime-order subgroups of $({{\mathbb Z}/q{\mathbb Z}})^*$ for sufficiently large ${q}$, no algorithms for finding discrete logarithms faster than generic methods are presently known. It motivates studying generic groups as a useful abstraction of many groups of cryptographic significance.

Notes

The abstract (generic) group model was introduced in the papers by Nechaev and Shoup, and hardness of the discrete logarithm in that model was shown to be ${\Theta(\sqrt{p})}$. Several generic methods for computing discrete logarithm with similar total running time are known: Shank’s baby-step giant-step method, Pollard’s rho and kangaroo (lambda) methods. These algorithms can be adapted to intervals $[a,b]\subset \mathbb{Z}/p\mathbb{Z}$ to work in time $\sqrt{|a-b|}$, matching the pessimistic square-root bound. For small-weight subsets of $\mathbb{Z}/p\mathbb{Z}$ see work of Stinson and references therein. Canetti put forward a variant of the Decisional Diffie-Hellman assumption where one of the exponents is sampled from an arbitrary distribution of bounded min-entropy. Chateauneuf, Ling, and Stinson gave a combinatorial characterization of algorithms for computing discrete logarithm in terms of slope coverings, and show how weak Sidon sets are related to optimal algorithms. Erdős and Newman defined the notion of bases for subsets of $\{1,\dots,p\}$, which corresponds (up to a factor of 4) to BSGS1-complexity in ${{\mathbb Z}/p{\mathbb Z}}$. They showed that random subsets of size $p^{1/2-\epsilon}$ have basis of size $\Theta(|S|)$ and for sets of squares their basis is $\Omega(|S|^{2/3})$. Subsuming the counting argument of Erdős and Newman, Schnorr proved that the discrete logarithm problem has essentially optimal (up to a logarithmic factor) hardness on random subsets. Resolving the question of Erdős and Newman, Alon, Bukh, and Sudakov showed that for sets ${S}$ of size exactly ${\sqrt{p}}$ even their restricted DL-complexity, ${C_\textrm{bsgs1}(S)}$ is ${o(|S|)}$. They also extend analysis of BSGS1-complexity for the set of squares to that of higher powers.