# Discrepancy and Rounding Linear Programs

In the previous posts we talked about various discrepancy questions and saw a proof of the six standard deviations suffice result. Besides being of interest in combinatorics, discrepancy theory has several remarkable applications in algorithms. Check this excellent book for a taste of these results. Here I will briefly discuss two (one old and one new) such results:

• Rounding linear programs by discrepancy theory: specifically, the beautiful argument of Lovasz, Spencer and Vesztergombi (LSV) on bounding linear discrepancy in terms of hereditary discrepancy. LSV is also an excellent place to look if you are running short on beautiful, innocent-in-appearance, time-sinking combinatorial conjectures.

Unfortunately, to keep the post short (-er), I will not discuss the recent breakthrough result of Rothvoss on bin-packing which uses discrepancy theory (in a much more sophisticated) for rounding a well-studied linear program (the “Gilmore-Gomory” LP). I highly recommend reading the paper.

• A recent algorithmic proof of Spencer/Gluskin theorem due to Shachar Lovett and myself.

Rounding Linear Programs One of the most basic techniques for combinatorial optimization is linear programming relaxation. Let us phrase this in a language suitable for the present context (which is in fact fairly generic). We have a constraint matrix ${A \in \mathbb{R}^{m \times n}}$, a target vector ${b \in \mathbb{R}^n}$ and our goal is to find a ${x \in \{0,1\}^n}$ so as to minimize ${\|Ax - b\|_\infty}$. One typical approach is to relax this discrete problem and instead solve the linear program

$\displaystyle \min_{y \in [0,1]^n} \|Ay - b\|_\infty \ \ \ \ \ (1)$

(which can be done efficiently). The next step, and often the most challenging one, is to round the fractional solution ${y \in [0,1]^n}$ to an integer solution in ${\{0,1\}^n}$ with little “loss”. How well can we do this for a constraint matrix ${A}$, and general vectors ${b}$? This is captured by the notion of linear discrepancy introduced by LSV:

$\displaystyle lindisc(A) = \max_{y \in [0,1]^n} \min_{x \in \{0,1\}^n} \|Ax - Ay\|_\infty. \ \ \ \ \ (2)$

LSV introduced the above notion originally as a generalization of discrepancy which can be formulated in our current context as follows:

$\displaystyle disc(A) = \min_{\varepsilon \in \{1,-1\}^n} \|A\varepsilon\|_\infty. \ \ \ \ \ (3)$

This corresponds exactly to the notion of discrepancy we studied in the last two posts. On the other hand, we can also write ${disc(A) = 2 \max_{x \in \{0,1\}^n} \| Ax - A e/2\|}$, where ${e}$ denotes the all one’s vector. Thus, ${disc(A)}$ corresponds to the special case of linear discrepancy where we are only trying to round a particular (${e/2}$) fractional solution.

The remarkable result of LSV is that while the above definition seems to be much weaker than ${lindisc}$ (which has to round all fractional solutions) there is a natural extension, hereditary discrepancy, of ${disc}$ which is nearly as strong. For a matrix ${A \in \mathbb{R}^{m \times n}}$ and ${S \subseteq [n]}$, let ${A_S}$ denote the ${\mathbb{R}^{m \times |S|}}$ sub-matrix corresponding to the columns of ${A}$ indexed by ${S}$. Then, define

$\displaystyle herdisc(A) = \max_S disc(A_S). \ \ \ \ \ (4)$

Hereditary discrepancy is a natural and more “robust” version of discrepancy. As LSV phrase it, discrepancy can be small by accident, whereas hereditary discrepancy seems to carry more structural information. For example, let ${A}$ be a random ${1,-1}$ matrix with the constraint that each row has sum ${0}$. Then, ${disc(A) = 0}$, but ${herdisc(A) = \Omega(\sqrt{n})}$ (this needs proving, but is not too hard) – which makes intuitive sense as we expect random matrices to have little structure.

It is also worth noting that several notable results in discrepancy theory which bound the discrepancy in fact also bound hereditary discrepancy. For example, Spencer’s original proof as well as Gluskin’s argument from last post (with a little bit more work) in fact show the following:

Theorem 1 For all matrices ${A \in [-1,1]^{n \times n}}$, ${herdisc(A) = O(\sqrt{n})}$.

LSV show the following connection between linear and hereditary discrepancies:

Theorem 2 For any matrix ${A}$, ${ lindisc(A) \leq herdisc(A)}$.

In other words, any fractional solution for a linear program of the form Equation 1 can be rounded to a integer solution with an additive error of at most ${herdisc(A)}$.

Let me now describe the cute proof which can be explained in a few lines. Suppose you have a fractional solution ${y \in [0,1]^n}$. Our goal is to find an ${x \in \{0,1\}^n}$ such that ${\|Ay- Ax\|_\infty}$ is small. We will construct such a ${x}$ by iteratively making ${y}$ more integral. Let us write out the binary expansion of each of the coordinates of ${y}$: for each ${i \in [n]}$, ${y_i = 0.y_{i1} y_{i2} \cdots }$. To avoid unnecessary technical issues, let us suppose that each coordinate has a finite expansion of length ${m}$.

We will build a sequence of solutions ${y^0 = y, y^1,\ldots,y^m = x}$ such that the coordinates of ${y^i}$ when written in binary will have expansions of length at most ${m-i}$. Let us look at how to get ${y^1}$; we can get the rest similarly.

Let ${S = \{i \in [n] : y_{im} = 1\}}$. By the definition of ${herdisc}$, there exists a vector ${\varepsilon \in \{1,-1\}^S}$ such that ${\|A_S \varepsilon_S \|_\infty \leq herdisc(A)}$. Let ${y^1 = y + \varepsilon_S/2^m}$ (interpreting ${\varepsilon}$ as a vector in ${\{1,-1,0\}^n}$ in the natural way). Clearly, the binary expansions of the coordinates of ${y^1}$ have length at most ${m-1}$. Further, ${\|Ay - Ay^1\|_\infty = \|A_S \varepsilon_S\|/2^m \leq herdisc(A)/2^m}$. Iterating this argument, we get ${y^1,\ldots,y^m = x}$ such that ${\|Ay^i - A y^{i+1}\|_\infty \leq herdisc(A)/2^{m-i}}$. Therefore,

$\displaystyle \|A y - Ax\| \leq \sum_{i=0}^{m-1} \frac{herdisc(A)}{2^{m-i}} \leq herdisc(A).$

Constructive Discrepancy Minimization by Walking on the Edges We just saw a way to round linear programs as in Equation 1 with error bounded by their discrepancy. As appealing as this is, it comes with one important caveat. The original motivation for looking at LP relaxations was that we can solve them efficiently. For this to make sense, we need the rounding procedure to be efficient as well. In our present case, to make the rounding efficient, we need to find a small discrepancy solution efficiently (find ${\varepsilon_S}$ given ${A_S}$ as above). Unfortunately, this in general is NP-hard in a very strong way as was shown recently by Charkiar, Newman and Nikolov.

However, what about specific bounds like in Theorem 1? Spencer’s original proof as well as Gluskin’s proof do not give an efficient algorithm for finding a good coloring. This is fundamentally inherent with the two arguments: they rely (directly or indirectly via Minkowski’s theorem) on the pigeon-hole principle (with exponentially many “holes”) which is quite non-algorithmic. In fact, Alon and Spencer conjectured (in `the’ book) several years ago that finding a coloring with discrepancy ${O(\sqrt{n})}$ as in the theorem is computationally hard.

Fortunately for us, like all good conjectures, this was proven to be false by a breakthrough result of Nikhil Bansal. Nikhil’s argument studies a carefully constructed semi-definite programming relaxation of the problem and then gives a new and amazing rounding algorithm for the SDP.

Here I will briefly discuss a different proof of Theorem 1 due to Lovett and myself which will also lead to an efficient algorithm for finding a coloring as required.

Let us first revisit Beck’s partial-coloring approach described in last post which says that to prove the theorem, it suffices to show the following.

Lemma 3 For vectors ${a^1,\ldots,a^n \in [-1,1]^n}$, there exists ${\varepsilon \in \{1,0,-1\}^n}$ such that for every ${j \in [n]}$, ${|\langle a^j,\varepsilon\rangle| = O(\sqrt{n})}$ and ${|Support(\varepsilon)| = \Omega(n)}$.

As in the last post, let us also rephrase the problem in a geometric language. Let ${\mathcal{K} \subseteq \mathbb{R}^n}$ be the symmetric convex set\footnote{Symmetric meaning ${x \in \mathcal{K}}$ implies ${-x \in \mathcal{K}}$.} defined as follows for ${\Delta = O(\sqrt{n})}$ to be chosen later:

$\displaystyle \mathcal{K} = \{\;x\,:\, |\langle a^j,x\rangle| \leq \Delta, \forall j \in [n] \;\;\;\;\;\;\;\;\text{ (call these \emph{discrepancy} constraints)}$

$\displaystyle \;\;\;\;\;\;\;\;\; |x_i| \leq 1, \forall i \in [n]\;\} \;\;\;\;\;\;\;\;\text{ (call these \emph{coloring} constraints)}$

The partial coloring lemma is equivalent to showing that ${\mathcal{K}}$ contains a ${\{1,0,-1\}^n}$ lattice point of large support. As it turns out, we don’t really need to find a lattice point in ${\mathcal{K}}$ but any point with many ${\{1,-1\}}$ (or close to ${\{1,-1\}}$) coordinates will serve us equally well. Concretely, we want:

Lemma 4 For ${\mathcal{K}}$ as above, there exists ${x \in \mathcal{K}}$ such that ${|\{i: |x_i| = 1\}| = \Omega(n)}$.

The above is equivalent to finding a vertex of ${\mathcal{K}}$ which is tight on ${\Omega(n)}$ coloring constraints. For intuition let us use the distance from the origin as a proxy for how many coordinates are close to ${1}$ in absolute value. Thus, roughly speaking our goal is to find a vertex of ${\mathcal{K}}$ as far away from the origin as possible. This is the interpretation we will use.

Let us think of trying to find the vertex ${x}$ iteratively. Our starting point would be the all-zeros vector. Now what? Well, we want to get away from the origin but still stay inside the polytope ${\mathcal{K}}$. Being somewhat greedy and lazy, we will just toss some coins and update our ${x}$ by doing Brownian motion (if this is uncomfortable, think of taking a discrete walk with tiny random Gaussian steps). The point is that the random walk will steadily move away from the origin.

Now, in the course of performing this random walk at some time we will touch the boundary of ${\mathcal{K}}$ or in other words hit some constraint(s) of ${\mathcal{K}}$. Now what? Well, as before we still want to get away from the origin but do not want to cross the polytope. So as before, being greedy and lazy (tough habits to change), we will continue doing Brownian motion but now constrain ourselves to only take steps which lie in the face of the polytope that we hit. So we move from doing Brownian motion in ${n}$-dimensions to doing one in a subspace (corresponding to the face that we hit) of dimension at most ${n-1}$. We now repeat this process: every time we hit a new constraint we restrict ourselves to only move in the face we hit. Repeating the above step, we should eventually reach a vertex of the polytope ${\mathcal{K}}$.

This is pretty much the actual algorithm which we call the Edge-Walk algorithm (except that to make it implementable we take tiny Gaussian random steps instead of doing Brownian motion; and this makes the analysis easier too). I will refer to the paper for the full details of the algorithm and its analysis, but let me just say what the punchline is: you are more likely to hit nearer constraints than farther ones.

Before moving on, note that the Edge-Walk algorithm can be defined for any polytope ${\mathcal{K}}$. This leads to the meta-question: Can we say anything interesting about the distribution on the vertices of a polytope induced by the walk? The analysis from our paper implicitly gives one such property which has to do with the distances of the constraints defining the vertex from the origin. Understanding this distribution better might be useful elsewhere.

Discussion This concludes the three post series. Perhaps, sometime in the future there will be other posts on other notable results in discrepancy theory (like the Beck-Fiala theorem). Keeping with the trend from the last two posts, let me end with another open question which strengthens Theorem 1 in a strong way:

Komlos Conjecture Given unit vectors ${v_1,\ldots,v_n \in \mathbb{R}^d}$, there exists ${\varepsilon \in \{1,-1\}^n}$ such that

$\displaystyle \left\|\varepsilon_1 v_1 + \varepsilon_2 v_2 + \cdots + \varepsilon_n v_n\right\|_\infty = O(1).$

Theorem 1 follows from the above if we take ${v_i}$‘s to be the normalized columns of the matrix ${A}$. The above conjecture also strengthens another beautiful conjecture due to Beck and Fiala; but that’s for another day. The most remarkable aspect of the above conjecture is that there is no dependence on the dimension. The best bound we know is due to Banaszczyk who showed a bound of ${O(\sqrt{\log d})}$ which we’ll also leave for another day.