In this post, I am going to give you a simple, self-contained, and fruitful demonstration of a recently introduced proof technique called the method of interlacing families of polynomials, which was also mentioned in an earlier post. This method, which may be seen as an incarnation of the probabilistic method, is relevant in situations when you want to show that at least one matrix from some collection of matrices must have eigenvalues in a certain range. The basic methodology is to write down the characteristic polynomial of each matrix you care about, compute the average of all these polynomials (literally by averaging the coefficients), and then reason about the eigenvalues of the individual matrices by studying the zeros of the average polynomial. The relationship between the average polynomial and the individuals hinges on a phenomenon known as interlacing.

We will use the method to prove the following sharp variant of Bourgain and Tzafriri’s restricted invertibility theorem, which may be seen as a robust, quantitative version of the fact that every rank ${k}$ matrix contains a set of ${k}$ linearly independent columns.

Theorem 1 Suppose ${v_1,\ldots,v_m\in{\mathbb R}^n}$ are vectors with ${\sum_{i=1}^mv_iv_i^T=I_n}$. Then for every ${k there is a subset ${\sigma\subset [m]}$ of size ${k}$ with

$\displaystyle \lambda_k\left(\sum_{i\in\sigma}v_iv_i^T\right)\geq \left(1-\sqrt{\frac{k}{n}}\right)^2\frac{n}{m}.$

That is, any set of vectors ${v_1,\ldots,v_m}$ with variance ${\sum_{i=1}^m \langle x,v_i\rangle^2}$ equal to one in every direction ${\|x\|=1}$ must contain a large subset which is far from being linearly degenerate, in the sense of having large eigenvalues (compared to $(n/m)$, which is the average squared length of the vectors). Such variance one sets go by many other names in different contexts: they are also called isotropic sets, decompositions of the identity, and tight frames. This type of theorem was first proved by Bourgain and Tzafriri in 1987, and later generalized and sharpened to include the form stated here.

The original applications of Theorem 1 and its variants were mainly in Banach space theory and harmonic analysis. More recently, it was used in theoretical CS by Nikolov, Talwar, and Zhang in the contexts of differential privacy and discrepancy minimization. Another connection with TCS was discovered by Joel Tropp, who showed that the set ${\sigma}$ can be found algorithmically using a semidefinite program whose dual is related to the Goemans-Wiliamson SDP for Max-Cut.

In more concrete notation, the theorem says that every rectangular matrix ${B_{n\times m}}$ with ${BB^T=I_n}$ contains an ${n\times k}$ column submatrix ${S\subset [m]}$ with ${\sigma_k^2(B_S)\ge (1-\sqrt{k/n})^2(n/m)}$, where ${\sigma_k}$ is the kth largest singular value. Written this way, we see some similarity with the column subset selection problem in data mining, which seeks to extract a maximally nondegenerate set of representative’ columns from a data matrix. There are also useful generalizations of Theorem 1 for arbitrary rectangular ${B}$.

As I said earlier, the technique is based on studying the roots of averages of polynomials. In general, averaging polynomials coefficient-wise can do unpredictable things to the roots. For instance, the average of ${(x-1)(x-2)}$ and ${(x-3)(x-4)}$, which are both real-rooted quadratics, is ${x^2-5x+7}$, which has complex roots ${2.5\pm\sqrt{3}i}$. Even when the roots of the average are real, there is in general no simple relationship between the roots of two polynomials and the roots of their average.

The main insight is that there are nonetheless many situations where averaging the coefficients of polynomials also has the effect of averaging each of the roots individually, and that it is possible to identify and exploit these situations. To speak about this concretely, we will need to give the roots names. There is no canonical way to do this for arbitrary polynomials, whose roots are just sets of points in the complex plane. However, when all the roots are real there is a natural ordering given by the real line; we will use this ordering to label the roots of a real-rooted polynomial ${p(x)=\prod_{i=1}^n (x-\alpha_i)}$ in descending order ${\alpha_1\ge\ldots\ge\alpha_n}$.

Interlacing

We will use the following classical notion to characterize precisely the good situations mentioned above.

Definition 2 (Interlacing) Let ${f}$ be a degree ${n}$ polynomial with all real roots ${\{\alpha_i\}}$, and let ${g}$ be degree ${n}$ or ${n-1}$ with all real roots ${\{\beta_i\}}$ (ignoring ${\beta_n}$ in the degree ${n-1}$ case). We say that ${g}$ interlaces ${f}$ if their roots alternate, i.e.,

$\displaystyle \beta_n\le\alpha_n\le \beta_{n-1}\le \ldots \beta_1\le \alpha_1.$

Following Fisk, we denote this as ${g\longrightarrow f}$, to indicate that the largest root belongs to ${f}$.

If there is a single ${g}$ which interlaces a family of polynomials ${f_1,\ldots,f_m}$, we say that they have a common interlacing.

It is an easy exercise that ${f_1,\ldots,f_m}$ of degree ${n}$ have a common interlacing iff there are closed intervals ${I_n\le I_{n-1}\le\ldots I_1}$ (where ${\le}$ means to the left of) such that the ${i}$th roots of all the ${f_j}$ are contained in ${I_i}$. It is also easy to see that a set of polynomials has a common interlacing iff every pair of them has a common interlacing (this may be viewed as Helly’s theorem on the real line).

We now state the main theorem about averaging polynomials with common interlacings.

Theorem 3 Suppose ${f_1,\ldots,f_m}$ are real-rooted of degree ${n}$ with positive leading coefficients. Let ${\lambda_k(f_j)}$ denote the ${k^{th}}$ largest root of ${f_j}$ and let ${\mu}$ be any distribution on ${[m]}$. If ${f_1,\ldots,f_m}$ have a common interlacing, then for all ${k=1,\ldots,n}$

$\displaystyle \min_j \lambda_k(f_j)\le \lambda_k(\mathop{\mathbb E}_{j\sim \mu} f_j)\le \max_j \lambda_k(f_j).$

The proof of this theorem is a three line exercise. Since it is the crucial fact upon which the entire technique relies, I encourage you to find this proof for yourself (Hint: Apply the intermediate value theorem inside each interval ${I_i}$.) You can also look at the picture below, which shows what happens for two cubic polynomials with a quadratic common interlacing.

One of the nicest features of common interlacings is that their existence is equivalent to certain real-rootedness statements. Often, this characterization gives us a systematic way to argue that common interlacings exist, rather than having to rely on cleverness and pull them out of thin air. The following seems to have been discovered a number of times (for instance, Fell or Chudnovsky & Seymour); the proof of it included below assumes that the roots of a polynomial are continuous functions of its coefficients (which may be shown using elementary complex analysis).

Theorem 4 If ${f_1,\ldots,f_m}$ are degree ${n}$ polynomials and all of their convex combinations ${\sum_{i=1}^m \mu_if_i}$ have real roots, then they have a common interlacing.

Proof: Since common interlacing is a pairwise condition, it suffices to handle the case of two polynomials ${f_0}$ and ${f_1}$. Let

$\displaystyle f_t := (1-t)f_0+tf_1$

with ${t\in [0,1]}$. Assume without loss of generality that ${f_0}$ and ${f_1}$ have no common roots (if they do, divide them out and put them back in at the end). As ${t}$ varies from ${0}$ to ${1}$, the roots of ${f_t}$ define ${n}$ continuous curves in the complex plane ${C_1,\ldots,C_n}$, each beginning at a root of ${f_0}$ and ending at a root of ${f_1}$. By our assumption the curves must all lie in the real line. Observe that no curve can cross a root of either ${f_0}$ or ${f_1}$ in the middle: if ${f_t(r)=0}$ for some ${t \in (0,1)}$ and ${f_0(r)=0}$, then immediately we also have ${f_t(r)=tf_1(r)=0}$, contradicting the no common roots assumption. Thus, each curve defines a closed interval containing exactly one root of ${f_0}$ and one root of ${f_1}$, and these intervals do not overlap except possibly at their endpoints, establishing the existence of a common interlacing. $\Box$

Characteristic Polynomials and Rank One Updates

A very natural and relevant example of interlacing polynomials comes from matrices. Recall that the characteristic polynomial of a matrix ${A}$ is given by

$\displaystyle \chi(A)(x):=\det(xI-A)$

and that its roots are the eigenvalues of ${A}$. The following classical fact tells us that rank one updates create interlacing.

Lemma 5 (Cauchy’s Interlacing Theorem) If ${A}$ is a symmetric matrix and ${v}$ is a vector then

$\displaystyle \chi(A)\longrightarrow \chi(A+vv^T).$

Proof: There are many ways to prove this, and it is a nice exercise. One way which I particularly like, and which will be relevant for the rest of this post, is to observe that

$\displaystyle \begin{array}{rcl} \det(xI-A-vv^T) &= \det(xI-A)\det(I-vv^T(xI-A)^{-1})\qquad (*) \\&=\det(xI-A)(1-v^T(xI-A)^{-1}v) \\&=\chi(A)(x)\left(1-\sum_{i=1}^n\frac{\langle v,u_i\rangle^2}{x-\lambda_i}\right),\end{array}$

where ${u_i}$ and ${\lambda_i}$ are the eigenvectors and eigenvalues of ${A}$. Interlacing then follows by inspecting the poles and zeros of the rational function on the right hand side. $\Box$

We are now in a position to do something nontrivial. Suppose ${A}$ is a symmetric ${n\times n}$ real matrix and ${v_1,\ldots,v_m}$ are some vectors in ${{\mathbb R}^n}$. Cauchy’s theorem tells us that the polynomials

$\displaystyle \chi(A+v_1v_1^T),\chi(A+v_2v_2^T),\ldots,\chi(A+v_mv_m^T)$

have a common interlacing, namely ${\chi(A)}$. Thus, Theorem 3 implies that for every ${k}$, there exists a ${j}$ so that the ${k}$th largest eigenvalue of ${A+v_jv_j^T}$ is at least the ${k}$th largest root of the average polynomial

$\displaystyle \mathop{\mathbb E}_j \chi(A+v_jv_j^T).$

We can compute this polynomial using the calculation ${(*)}$ as follows:

$\displaystyle \mathop{\mathbb E} \chi(A+v_jv_j^T) = \chi(A)\left(1-\sum_{i=1}^n\frac{ \mathop{\mathbb E} \langle v_j,u_i\rangle^2}{x-\lambda_i}\right).$

In general, this polynomial depends on the squared inner products ${\langle v_j,u_i\rangle^2}$. When ${\sum_{i=1}^m v_iv_i^T=I}$, however, we have ${\mathop{\mathbb E}_j \langle v_j,u_i\rangle^2=1/m}$ for all ${u_i}$, and:

$\displaystyle \mathop{\mathbb E}_j \chi(A+v_jv_j^T) = \chi(A)(x)\left(1-\sum_{i=1}^n\frac{ 1/m}{x-\lambda_i}\right)=\chi(A)(x)-(1/m)\frac{\partial}{\partial x}\chi(A)(x).\qquad (**)$

That is, adding a random rank one matrix in the isotropic case corresponds to subtracting off a multiple of the derivative from the characteristic polynomial. Note that there is no dependence on the vectors ${v_j}$ in this expression, and it has forgotten’ all of the eigenvectors ${u_i}$. This is where the gain is: we have reduced a high-dimensional linear algebra problem (of finding a ${v_j}$ for which ${A+v_jv_j^T}$ has certain eigenvalues, which may be difficult when the matrices involved do not commute) to a univariate calculus / analysis problem (given a polynomial, figure out what subtracting the derivative does to its roots). Moreover, the latter problem is amenable to a completely different set of tools than the original eigenvalue problem.

As a sanity check, if we apply the above deduction to ${A=0}$, we find that any isotropic set ${\sum_{i=1}^m v_iv_i^T=I}$ must contain a ${j}$ such that ${\lambda_1(v_jv_j^T)}$ is at least the largest root of

$\displaystyle \chi(0)(x)-(1/m)\chi(0)'(x)=x^n-(n/m)x^{n-1},$

which is just ${n/m}$. This makes sense since ${\lambda_1(v_jv_j^T)=\|v_j\|^2}$, and the average squared length of the vectors is indeed ${n/m}$ since ${\mathrm{trace}(\sum_i v_iv_i^T)=n}$.

Differential Operators and Induction

The real power of the method comes from being able to apply it inductively to a sum of many independent random ${v_j}$‘s at once, rather than just once. In this case, establishing the necessary common interlacings requires a combination of Theorem 4 and Cauchy’s theorem. A central role is played by the differential operators ${(I-(1/m)\frac{\partial}{\partial x})}$ seen above, which I will henceforth denote as ${(I-(1/m)D)}$. The proof relies on the following key properties of these operators:

Lemma 6 (Properties of Differential Operators)

1. If ${ \mathbf{X} }$ is a random vector with ${\mathop{\mathbb E} \mathbf{X} \mathbf{X} ^T = cI}$ then

$\displaystyle \mathop{\mathbb E} \chi(A+ \mathbf{X} \mathbf{X} ^T) = (I-cD)\chi(A).$

2. If ${f}$ has real roots then so does ${(I-cD)f}$.
3. If ${f_1,\ldots,f_m}$ have a common interlacing, then so do ${(I-cD)f_1,\ldots, (1-cD)f_m}$.

Proof: Part (1) was essentially shown in ${(**)}$. Part (2) follows by applying ${(*)}$ to the matrix ${A}$ with diagonal entries equal to the roots of ${f}$, and plugging in ${v=\sqrt{c}\cdot(1,1,\ldots,1)^T}$, so that ${f=\chi[A]}$ and ${(1-cD)f=\chi[A+vv^T]}$.

For part (3), Theorem 3 tells us that all convex combinations ${\sum_{i=1}^m\mu_if_i}$ have real roots. By part (2) it follows that all

$\displaystyle (1-cD)\sum_{i=1}^m\mu_if_i = \sum_{i=1}^m \mu_i (1-cD)f_i$

also have real roots. By Theorem 4, this means that the ${(1-cD)f_i}$ must have a common interlacing.$\Box$

We are now ready to perform the main induction which will give us the proof of Theorem 1.

Lemma 7 Let ${ \mathbf{X} }$ be uniformly chosen from ${\{v_i\}_{i\le m}}$ so that ${\mathop{\mathbb E} \mathbf{X} \mathbf{X} ^T=(1/m)I}$, and let ${ \mathbf{X} _1,\ldots, \mathbf{X} _k}$ be i.i.d. copies of ${ \mathbf{X} }$. Then there exists a choice of indices ${j_1,\ldots j_k}$ satisfying

$\displaystyle \lambda_k \left(\chi(\sum_{i=1}^k v_{j_i}v_{j_i}^T)\right) \ge \lambda_k \left(\mathop{\mathbb E} \chi(\sum_{i=1}^k \mathbf{X} _i \mathbf{X} _i^T)\right).$

Proof: For any partial assignment ${j_1,\ldots,j_\ell}$ of the indices, consider the conditional expectation’ polynomial:

$\displaystyle q_{j_1,\ldots,j_\ell}(x) := \mathop{\mathbb E}_{ \mathbf{X} _{\ell+1},\ldots, \mathbf{X} _k} \chi\left(\sum_{i=1}^\ell v_{j_i}v_{j_i}^T + \sum_{i=\ell+1}^k \mathbf{X} _i \mathbf{X} _i^T\right).$

We will show that there exists a ${j_{\ell+1}\in [m]}$ such that

$\displaystyle \lambda_k(q_{j_1,\ldots,j_{\ell+1}})\ge \lambda_k(q_{j_1,\ldots,j_\ell}),\ \ \ \ \ (1)$

which by induction will complete the proof. Consider the matrix

$\displaystyle A = \sum_{i=1}^\ell v_{j_i}v_{j_i}^T,$

By Cauchy’s interlacing theorem ${\chi(A)}$ interlaces ${\chi(A+v_{j_{\ell+1}}v_{j_{\ell+1}}^T)}$ for every ${j_{\ell+1}\in [m]}$. Lemma 6 tells us ${(1-(1/m)D)}$ operators preserve common interlacing, so the polynomials

$\displaystyle (1-(1/m)D)^{k-(\ell+1)}\chi(A+v_{j_{\ell+1}}v_{j_{\ell+1}}^T) = q_{j_1,\ldots,j_\ell,j_{\ell+1}}(x)$

(by applying Lemma 6 ${k-(\ell+1)}$ times) must also have a common interlacing. Thus, some ${j_{\ell+1}\in [m]}$ must satisfy (1), as desired. $\Box$

Bounding the Roots: Laguerre Polynomials

To finish the proof of Theorem 1, it suffices by Lemma 7 to prove a lower bound on the ${k}$th largest root of the expected polynomial ${\mathop{\mathbb E} \chi\left(\sum_{i=1}^k \mathbf{X} _i \mathbf{X} _i^T\right)}$. By applying Lemma 6 ${k}$ times to ${\chi(0)=x^n}$, we find that

$\displaystyle \mathop{\mathbb E} \chi\left(m\cdot \sum_{i=1}^k \mathbf{X} _i \mathbf{X} _i^T\right) = (1-D)^kx^n =: p_k(x).\ \ \ \ \ (2)$

This looks like a nice polynomial, and we are free to use any method we like to bound its roots.

The easiest way is to observe that

$\displaystyle p_k(x)=x^{n-k}\L_k^{(n-k)}(x),$

where ${\L_k^{(n-k)}(x)}$ is a degree ${k}$ associated Laguerre polynomial. These are a classical family of orthogonal polynomials and a lot is known about the locations of their roots; in particular, there is the following estimate due to Krasikov.

Lemma 8 (Roots of Laguerre Polynomials) The roots of the associated Laguerre polynomial

$\displaystyle \L_k^{(n-k)}(x):= (1-D)^{n}x^k\ \ \ \ \ (3)$

are contained in the interval ${[n(1-\sqrt{k/n})^2,n(1+\sqrt{k/n})^2].}$

It follows by Lemma 8 that ${\lambda_k(p_k)\ge \lambda_k(\L_k^{(n-k)})\ge n(1-\sqrt{k/n})^2}$, which immediately yields Theorem 1 by Lemma 7 and (2).

If you think that appealing to Laguerre polynomials was magical, it is also possible to prove the bound (3) from scratch in less than a page using the barrier function’ argument from this paper, which is also intimately related to the formulas ${(*)}$ and ${(**)}$.

Conclusion

The argument given here is a special case of a more general principle: that expected characteristic polynomials of certain random matrices can be expressed in terms of differential operators, which can then be used to establish the existence of the necessary common interlacings as well as to analyze the roots of the expected polynomials themselves. In the isotropic case of Bourgain-Tzafriri presented here, all of these objects can be chosen to be univariate polynomials. Morally, this is because the covariance matrices of all of the random vectors involved are multiples of the identity (which trivially commute with each other), and all of the characteristic polynomials involved are simple univariate linear transformations of each other (such a ${(I-cD)}$). The above argument can also be made to work in the non-isotropic case, yielding improvements over previously known bounds. This is the subject of a paper in preparation, Interlacing Families III, with Adam Marcus and Dan Spielman.

On the other hand, the proofs of Kadison-Singer and existence of Ramanujan graphs involve analyzing sums of independent rank one matrices which come from non-identically distributed distributions whose covariance matrices do not commute. At a high level, this is what creates the need to consider multivariate characteristic polynomials and differential operators, which are then analyzed using techniques from the theory of real stable polynomials.

Acknowledgments

Everything in this post is joint work with Adam Marcus and Dan Spielman. Thanks to Raghu Meka for helpful comments in the preparation of this post.

And is linked from the Call for Papers. (The url for the server is http://focs2014.cs.princeton.edu , though please do read the call for paper, and the advice linked below, before submitting.)

We are continuing the FOCS 2013 experiment of less regulation and more responsibility on the paper formatting, so please do read my advice for authors before submitting.

The deadline is 4:30pm Eastern Time on Wednesday, April 2nd, but I’d suggest you try to submit the day before to allow extra time for fixing all those typos and implementing all those suggestions that usually come to you just as you press the “submit” button. (Not to mention to allow time for computer crashes, network outages, latex bugs, bathroom breaks, and all those other pesky issues that tend to pop up 5 minutes before the deadline.) While you’re at it, I recommend you post the paper also on the arXiv, ECCC, or ePrint, so other people can benefit from your work.

Now go and prove that missing lemma – I’m looking forward to reading your submissions!

STOC 2014: Call for Workshop and Tutorial Proposals

(also available on the conference website)

Workshop and Tutorial Day: Saturday, May 31, 2014

Workshop and Tutorial Co-Chairs: Kunal Talwar and Chris Umans

On Saturday, May 31, immediately preceding the main conference, SsTOC 2014 will hold a workshop-and-tutorials day. We invite groups of interested researchers to submit workshop or tutorial proposals. The goal of a workshop is to provide an informal forum for researchers to discuss important research questions, directions, and challenges. Connections between theoretical computer science and other areas, topics that are not well represented at STOC, and open problems are encouraged as workshop topics. Organizers are completely free to choose their workshop formats (invited speakers, panel discussions, etc.). The program for May 31 may also involve tutorials, each consisting of 1-2 survey talks on a particular area, and we welcome tutorial proposals as well.

STOC does not have funds to pay travel expenses or honoraria to invited workshop and tutorial speakers, but we do anticipate funds for breaks with snacks and drinks. Workshop and tutorials attendance will be free for all STOC registrants, and will also be available separately for a reduced registration fee, for those not attending STOC itself.

Proposal submission: Workshop and tutorial proposals should be no longer than 2 pages. Please include a list of names and email addresses of the organizers, a description of the topic and the goals of the workshop or tutorial, the proposed workshop format (invited talks, contributed talks, panel, etc.), and proposed or tentatively confirmed speakers if known. Please also indicate the preferred length of time for your workshop or tutorial, along with the minimum acceptable time. We anticipate a 4-5 hour block for each workshop and a 1-3 hour block for each tutorial. Please feel free to contact the Workshop and Tutorial Co-Chairs at the email addresses below if you have questions about workshop or tutorial proposals.

Submission deadline: Proposals should be submitted by March 28, 2014, via email to kunal@microsoft.com and umans@cs.caltech.edu. Proposers will be notified by April 10, 2014, about whether their proposals have been accepted.

In the previous posts we talked about various discrepancy questions and saw a proof of the six standard deviations suffice result. Besides being of interest in combinatorics, discrepancy theory has several remarkable applications in algorithms. Check this excellent book for a taste of these results. Here I will briefly discuss two (one old and one new) such results:

• Rounding linear programs by discrepancy theory: specifically, the beautiful argument of Lovasz, Spencer and Vesztergombi (LSV) on bounding linear discrepancy in terms of hereditary discrepancy. LSV is also an excellent place to look if you are running short on beautiful, innocent-in-appearance, time-sinking combinatorial conjectures.

Unfortunately, to keep the post short (-er), I will not discuss the recent breakthrough result of Rothvoss on bin-packing which uses discrepancy theory (in a much more sophisticated) for rounding a well-studied linear program (the “Gilmore-Gomory” LP). I highly recommend reading the paper.

• A recent algorithmic proof of Spencer/Gluskin theorem due to Shachar Lovett and myself.

Rounding Linear Programs One of the most basic techniques for combinatorial optimization is linear programming relaxation. Let us phrase this in a language suitable for the present context (which is in fact fairly generic). We have a constraint matrix ${A \in \mathbb{R}^{m \times n}}$, a target vector ${b \in \mathbb{R}^n}$ and our goal is to find a ${x \in \{0,1\}^n}$ so as to minimize ${\|Ax - b\|_\infty}$. One typical approach is to relax this discrete problem and instead solve the linear program

$\displaystyle \min_{y \in [0,1]^n} \|Ay - b\|_\infty \ \ \ \ \ (1)$

(which can be done efficiently). The next step, and often the most challenging one, is to round the fractional solution ${y \in [0,1]^n}$ to an integer solution in ${\{0,1\}^n}$ with little “loss”. How well can we do this for a constraint matrix ${A}$, and general vectors ${b}$? This is captured by the notion of linear discrepancy introduced by LSV:

$\displaystyle lindisc(A) = \max_{y \in [0,1]^n} \min_{x \in \{0,1\}^n} \|Ax - Ay\|_\infty. \ \ \ \ \ (2)$

LSV introduced the above notion originally as a generalization of discrepancy which can be formulated in our current context as follows:

$\displaystyle disc(A) = \min_{\varepsilon \in \{1,-1\}^n} \|A\varepsilon\|_\infty. \ \ \ \ \ (3)$

This corresponds exactly to the notion of discrepancy we studied in the last two posts. On the other hand, we can also write ${disc(A) = 2 \max_{x \in \{0,1\}^n} \| Ax - A e/2\|}$, where ${e}$ denotes the all one’s vector. Thus, ${disc(A)}$ corresponds to the special case of linear discrepancy where we are only trying to round a particular (${e/2}$) fractional solution.

The remarkable result of LSV is that while the above definition seems to be much weaker than ${lindisc}$ (which has to round all fractional solutions) there is a natural extension, hereditary discrepancy, of ${disc}$ which is nearly as strong. For a matrix ${A \in \mathbb{R}^{m \times n}}$ and ${S \subseteq [n]}$, let ${A_S}$ denote the ${\mathbb{R}^{m \times |S|}}$ sub-matrix corresponding to the columns of ${A}$ indexed by ${S}$. Then, define

$\displaystyle herdisc(A) = \max_S disc(A_S). \ \ \ \ \ (4)$

Hereditary discrepancy is a natural and more “robust” version of discrepancy. As LSV phrase it, discrepancy can be small by accident, whereas hereditary discrepancy seems to carry more structural information. For example, let ${A}$ be a random ${1,-1}$ matrix with the constraint that each row has sum ${0}$. Then, ${disc(A) = 0}$, but ${herdisc(A) = \Omega(\sqrt{n})}$ (this needs proving, but is not too hard) – which makes intuitive sense as we expect random matrices to have little structure.

It is also worth noting that several notable results in discrepancy theory which bound the discrepancy in fact also bound hereditary discrepancy. For example, Spencer’s original proof as well as Gluskin’s argument from last post (with a little bit more work) in fact show the following:

Theorem 1 For all matrices ${A \in [-1,1]^{n \times n}}$, ${herdisc(A) = O(\sqrt{n})}$.

LSV show the following connection between linear and hereditary discrepancies:

Theorem 2 For any matrix ${A}$, ${ lindisc(A) \leq herdisc(A)}$.

In other words, any fractional solution for a linear program of the form Equation 1 can be rounded to a integer solution with an additive error of at most ${herdisc(A)}$.

Let me now describe the cute proof which can be explained in a few lines. Suppose you have a fractional solution ${y \in [0,1]^n}$. Our goal is to find an ${x \in \{0,1\}^n}$ such that ${\|Ay- Ax\|_\infty}$ is small. We will construct such a ${x}$ by iteratively making ${y}$ more integral. Let us write out the binary expansion of each of the coordinates of ${y}$: for each ${i \in [n]}$, ${y_i = 0.y_{i1} y_{i2} \cdots }$. To avoid unnecessary technical issues, let us suppose that each coordinate has a finite expansion of length ${m}$.

We will build a sequence of solutions ${y^0 = y, y^1,\ldots,y^m = x}$ such that the coordinates of ${y^i}$ when written in binary will have expansions of length at most ${m-i}$. Let us look at how to get ${y^1}$; we can get the rest similarly.

Let ${S = \{i \in [n] : y_{im} = 1\}}$. By the definition of ${herdisc}$, there exists a vector ${\varepsilon \in \{1,-1\}^S}$ such that ${\|A_S \varepsilon_S \|_\infty \leq herdisc(A)}$. Let ${y^1 = y + \varepsilon_S/2^m}$ (interpreting ${\varepsilon}$ as a vector in ${\{1,-1,0\}^n}$ in the natural way). Clearly, the binary expansions of the coordinates of ${y^1}$ have length at most ${m-1}$. Further, ${\|Ay - Ay^1\|_\infty = \|A_S \varepsilon_S\|/2^m \leq herdisc(A)/2^m}$. Iterating this argument, we get ${y^1,\ldots,y^m = x}$ such that ${\|Ay^i - A y^{i+1}\|_\infty \leq herdisc(A)/2^{m-i}}$. Therefore,

$\displaystyle \|A y - Ax\| \leq \sum_{i=0}^{m-1} \frac{herdisc(A)}{2^{m-i}} \leq herdisc(A).$

Constructive Discrepancy Minimization by Walking on the Edges We just saw a way to round linear programs as in Equation 1 with error bounded by their discrepancy. As appealing as this is, it comes with one important caveat. The original motivation for looking at LP relaxations was that we can solve them efficiently. For this to make sense, we need the rounding procedure to be efficient as well. In our present case, to make the rounding efficient, we need to find a small discrepancy solution efficiently (find ${\varepsilon_S}$ given ${A_S}$ as above). Unfortunately, this in general is NP-hard in a very strong way as was shown recently by Charkiar, Newman and Nikolov.

However, what about specific bounds like in Theorem 1? Spencer’s original proof as well as Gluskin’s proof do not give an efficient algorithm for finding a good coloring. This is fundamentally inherent with the two arguments: they rely (directly or indirectly via Minkowski’s theorem) on the pigeon-hole principle (with exponentially many “holes”) which is quite non-algorithmic. In fact, Alon and Spencer conjectured (in `the’ book) several years ago that finding a coloring with discrepancy ${O(\sqrt{n})}$ as in the theorem is computationally hard.

Fortunately for us, like all good conjectures, this was proven to be false by a breakthrough result of Nikhil Bansal. Nikhil’s argument studies a carefully constructed semi-definite programming relaxation of the problem and then gives a new and amazing rounding algorithm for the SDP.

Here I will briefly discuss a different proof of Theorem 1 due to Lovett and myself which will also lead to an efficient algorithm for finding a coloring as required.

Let us first revisit Beck’s partial-coloring approach described in last post which says that to prove the theorem, it suffices to show the following.

Lemma 3 For vectors ${a^1,\ldots,a^n \in [-1,1]^n}$, there exists ${\varepsilon \in \{1,0,-1\}^n}$ such that for every ${j \in [n]}$, ${|\langle a^j,\varepsilon\rangle| = O(\sqrt{n})}$ and ${|Support(\varepsilon)| = \Omega(n)}$.

As in the last post, let us also rephrase the problem in a geometric language. Let ${\mathcal{K} \subseteq \mathbb{R}^n}$ be the symmetric convex set\footnote{Symmetric meaning ${x \in \mathcal{K}}$ implies ${-x \in \mathcal{K}}$.} defined as follows for ${\Delta = O(\sqrt{n})}$ to be chosen later:

$\displaystyle \mathcal{K} = \{\;x\,:\, |\langle a^j,x\rangle| \leq \Delta, \forall j \in [n] \;\;\;\;\;\;\;\;\text{ (call these \emph{discrepancy} constraints)}$

$\displaystyle \;\;\;\;\;\;\;\;\; |x_i| \leq 1, \forall i \in [n]\;\} \;\;\;\;\;\;\;\;\text{ (call these \emph{coloring} constraints)}$

The partial coloring lemma is equivalent to showing that ${\mathcal{K}}$ contains a ${\{1,0,-1\}^n}$ lattice point of large support. As it turns out, we don’t really need to find a lattice point in ${\mathcal{K}}$ but any point with many ${\{1,-1\}}$ (or close to ${\{1,-1\}}$) coordinates will serve us equally well. Concretely, we want:

Lemma 4 For ${\mathcal{K}}$ as above, there exists ${x \in \mathcal{K}}$ such that ${|\{i: |x_i| = 1\}| = \Omega(n)}$.

The above is equivalent to finding a vertex of ${\mathcal{K}}$ which is tight on ${\Omega(n)}$ coloring constraints. For intuition let us use the distance from the origin as a proxy for how many coordinates are close to ${1}$ in absolute value. Thus, roughly speaking our goal is to find a vertex of ${\mathcal{K}}$ as far away from the origin as possible. This is the interpretation we will use.

Let us think of trying to find the vertex ${x}$ iteratively. Our starting point would be the all-zeros vector. Now what? Well, we want to get away from the origin but still stay inside the polytope ${\mathcal{K}}$. Being somewhat greedy and lazy, we will just toss some coins and update our ${x}$ by doing Brownian motion (if this is uncomfortable, think of taking a discrete walk with tiny random Gaussian steps). The point is that the random walk will steadily move away from the origin.

Now, in the course of performing this random walk at some time we will touch the boundary of ${\mathcal{K}}$ or in other words hit some constraint(s) of ${\mathcal{K}}$. Now what? Well, as before we still want to get away from the origin but do not want to cross the polytope. So as before, being greedy and lazy (tough habits to change), we will continue doing Brownian motion but now constrain ourselves to only take steps which lie in the face of the polytope that we hit. So we move from doing Brownian motion in ${n}$-dimensions to doing one in a subspace (corresponding to the face that we hit) of dimension at most ${n-1}$. We now repeat this process: every time we hit a new constraint we restrict ourselves to only move in the face we hit. Repeating the above step, we should eventually reach a vertex of the polytope ${\mathcal{K}}$.

This is pretty much the actual algorithm which we call the Edge-Walk algorithm (except that to make it implementable we take tiny Gaussian random steps instead of doing Brownian motion; and this makes the analysis easier too). I will refer to the paper for the full details of the algorithm and its analysis, but let me just say what the punchline is: you are more likely to hit nearer constraints than farther ones.

Before moving on, note that the Edge-Walk algorithm can be defined for any polytope ${\mathcal{K}}$. This leads to the meta-question: Can we say anything interesting about the distribution on the vertices of a polytope induced by the walk? The analysis from our paper implicitly gives one such property which has to do with the distances of the constraints defining the vertex from the origin. Understanding this distribution better might be useful elsewhere.

Discussion This concludes the three post series. Perhaps, sometime in the future there will be other posts on other notable results in discrepancy theory (like the Beck-Fiala theorem). Keeping with the trend from the last two posts, let me end with another open question which strengthens Theorem 1 in a strong way:

Komlos Conjecture Given unit vectors ${v_1,\ldots,v_n \in \mathbb{R}^d}$, there exists ${\varepsilon \in \{1,-1\}^n}$ such that

$\displaystyle \left\|\varepsilon_1 v_1 + \varepsilon_2 v_2 + \cdots + \varepsilon_n v_n\right\|_\infty = O(1).$

Theorem 1 follows from the above if we take ${v_i}$‘s to be the normalized columns of the matrix ${A}$. The above conjecture also strengthens another beautiful conjecture due to Beck and Fiala; but that’s for another day. The most remarkable aspect of the above conjecture is that there is no dependence on the dimension. The best bound we know is due to Banaszczyk who showed a bound of ${O(\sqrt{\log d})}$ which we’ll also leave for another day.

In the last post we discussed some questions about discrepancy and the ‘Six Standard Deviations Suffice’ theorem stated below (without the ${6}$, which is not too important, but makes for a great title):

Theorem 1 For vectors ${a^1,\ldots,a^n \in \{1,-1\}^n}$, there exists ${\epsilon \in \{1,-1\}^n}$ such that for every ${j \in [n]}$, ${|\langle a^j,\epsilon\rangle| = O(\sqrt{n})}$.

In this post (second and the most technical of the three post series) we will see a proof of the above result. The theorem was proved by Spencer in 1985 using Beck’s partial coloring approach. Independently of Spencer, the result was proved by Gluskin in 1988 in a convex geometric context. Here we will review Gluskin’s proof which is quite beautiful.

Gluskin’s proof will also give us an excuse to look at some elegant (and simple to describe) results in convex geometry which may be of use elsewhere. Finally, the geometric view here will actually be useful in the next post when we discuss an algorithmic proof. Gluskin’s paper is truly remarkable and seems to reinvent several key ideas from scratch such as Sidak’s lemma, a version of Kanter’s lemma for Gaussians in convex geometry and even has the partial coloring approach implicit in it. I recommend taking a look at the paper even if it is a bit of a tough read. Much of the content in the post is based on discussions with Shachar Lovett and Oded Regev, but all mistakes are mine.

Gluskin’s proof follows the partial coloring approach with the crucial lemma proved using a volume argument. The partial coloring method was introduced by Beck in 1981 and all proofs of Theorem 1 and many other important discrepancy results in fact use this method. Here, instead of looking for a ${\epsilon \sim \{1,-1\}^n}$ solution as in the theorem, one looks for a ${\{1,0,-1\}^n}$ solution first. This is meaningless to begin with as we can just output the all zeros vector. The main idea is to instead look for a ${\{1,0,-1\}^n}$ solution which has ${\Omega(n)}$ support. We then recurse on the set of coordinates which are set to ${0}$. If everything goes according to plan, as we are geometrically decreasing the ambient dimension, we similarly get geometrically decreasing discrepancy bounds which we can tolerate. I won’t go into the details here, but let’s accept that it suffices to show the following:

Lemma 2 For vectors ${a^1,\ldots,a^n \in \{1,-1\}^n}$, there exists ${\epsilon \in \{1,0,-1\}^n}$ such that for every ${j \in [n]}$, ${\langle a^j,\epsilon\rangle = O(\sqrt{n})}$ and ${|Support(\epsilon)| = \Omega(n)}$.

To prove the above partial-coloring-lemma, let us first rephrase the problem in a geometric language. Let ${\mathcal{K} \subseteq R^n}$ be the symmetric convex set (symmetric meaning ${x \in \mathcal{K}}$ implies ${-x \in \mathcal{K}}$) defined as follows for ${\Delta = O(\sqrt{n})}$ to be chosen later:

$\displaystyle \mathcal{K} = \{x: |\langle a^j,x\rangle| \leq \Delta, \forall j \in [n]\}.$

We want to show that ${\mathcal{K}}$ contains a ${\{1,0,-1\}^n}$ lattice point of large support. We show this indirectly by proving that ${\mathcal{K}}$ instead contains a lot of points from ${\{1,0,-1\}^n}$. Gluskin does this by a clever volume argument: first show that the volume of ${\mathcal{K} \cap [-1,1]^n}$ is large and then apply Minkowski’s theorem to show that there are many lattice points. To lower bound the first volume, Gluskin actually works in the Gaussian space.

I don’t have a clear intuitive reason for why the Gaussian distribution is better than the Lebesgue measure in this context. But if one looks at the analysis, a clear advantage is that projections behave better (when considering volumes) in the Gaussian space. For example, if we take a set like ${S = \{x \in R^n\,:\,|x_1| \leq 1\}}$, then the Lebsgue volume of ${S}$ is infinite, but if we project along the first coordinate it becomes finite. In the Gaussian case, both volumes are the same.

We next go over all the main pieces in Gluskin’s proof.

Sidak’s Lemma  Suppose we have a standard Gaussian vector ${g \sim \mathcal{N}(0,1)^n}$. Then, for any unit vector ${v \in R^n}$, ${\langle v,g\rangle}$ has the standard normal distribution. Now, suppose we have several unit vectors ${v_1,\ldots,v_m \in R^n}$. Then, the random variables ${X_i = \langle v_i,g\rangle}$ are individually standard normals, but are correlated with one another. Sidak’s lemma (1967) says that no matter what the correlations of ${X_i}$‘s are, to bound the probability that none of the ${X_i}$‘s is too large, the “worst-behaviour” one could expect is for them to be independent. Concretely:

Lemma 3 (Sidak’s Lemma) Let ${v_1,\ldots,v_m \in R^n}$ and let ${g \sim \mathcal{N}(0,1)^n}$ be a standard Gaussian vector. Then, for all ${t_1,\ldots,t_m \in R_+}$,

$\displaystyle Pr\left[|\langle v_1,g\rangle| \leq t_1 \;\wedge\; |\langle v_2,g\rangle| \leq t_2\;\wedge\; \cdots \;\wedge\; |\langle v_m,g\rangle| \leq t_m\right] \geq$

$\displaystyle \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; Pr\left[|\langle v_1,g\rangle |\leq t_1\right] \cdot Pr\left[|\langle v_2,g\rangle|\leq t_2\right] \cdots Pr\left[|\langle v_m,g\rangle|\leq t_m\right].$

The proof of the lemma is actually not too hard and an excellent exposition can be found in this paper.

The lemma is actually a very special case of a longstanding open problem called the Correlation Conjecture. Let me digress a little bit to state this beautiful question. In the above setup, let slab ${C_i = \{x \in R^n: |\langle v_i,x\rangle| \leq t_i\}}$. Then, Sidak’s lemma says that for ${g \sim \mathcal{N}(0,1)^n}$,

$\displaystyle Pr[g \in C_1 \wedge C_2 \wedge \cdots \wedge C_m] \geq Pr[g \in C_1] \cdot Pr[g \in C_2] \cdots Pr[g \in C_m].$

The correlation conjecture asserts that this inequality is in fact true for all symmetric convex sets (in fact, we only need to look at ${m=2}$). Sidak’s lemma says the conjecture is true for slabs. It is also known to be true for ellipsoids. The statement for ellipsoids also has a discrepancy implication leading to a vector generalization of Spencer’s theorem (pointed to me by Krzysztof Oleszkiewicz). But that’s for another day.

Kanter’s Lemma The second inequality we need is a comparison inequality due to Kanter. The lemma essentially lets us lift certain relations between two distributions ${p,q}$ to their product distributions ${p^n, q^n}$ and I think should be useful in other contexts. For instance, I recently used it in this paper in a completely different context. To state the lemma, we need the notion of peakedness of distributions.

Let ${p,q}$ be two symmetric distributions on ${R^m}$ for some ${m}$. We say ${p}$ is less peaked than ${q}$ (written ${p \preceq q}$) if for all symmetric convex sets ${\mathcal{K}}$, ${p(\mathcal{K}) \leq q(\mathcal{K})}$. Intuitively, this means that ${p}$ is putting less of its mass near the origin than ${q}$ (hence the term less peaked). For example, ${\mathcal{N}(0,2) \preceq \mathcal{N}(0,1)}$.

Kanter’s lemma says that the peakedness relation tensorises provided we have unimodality. A univariate distribution is unimodal if the corresponding probability density function has a single maximum and no other local maxima. I won’t define what it means for a multivariate distribution to be unimodal here, but we only need the lemma for univariate distributions. See this survey for the formal definition.

Lemma 4 (Kanter’s lemma) Let ${p,q}$ be two symmetric distributions on ${R^n}$ such that ${p \preceq q}$ and let ${\mu}$ be a unimodal distribution on ${R^m}$. Then, the product distributions ${p \times \mu}$, ${q \times \mu}$ on ${R^{n \times m}}$, satisfy ${p \times \mu \preceq q \times \mu}$.

The proof of the lemma is not too hard, but is non-trivial in that it uses the Brunn-Minkowski’s inequality. Combining the above lemma with the not-too-hard fact that the standard Gaussian distribution is less peaked than the uniform distribution on ${[-1,1]}$, we get:

Corollary 5 Let ${\mu}$ be the uniform distribution on ${[-1,1]}$. Then, ${\mathcal{N}(0,1)^n \preceq \mu^n}$.

Minkowski’s Theorem The final piece we need is the classical Minkowski’s theorem form lattice geometry:

Theorem 6 (Minkowski’s Theorem) Let ${C \subseteq R^n}$ be a symmetric convex set of Lebesgue volume more than ${2^n \cdot \ell}$ for an integer ${\ell \geq 1}$. Then, ${C}$ contains at least ${\ell}$ points from the integer lattice ${\mathbb{Z}^n}$.

Putting Things Together  We will prove the partial coloring lemma Lemma 2. The proof will be a sequence of simple implications using the above lemmas. Recall the definition of ${\mathcal{K}}$:

$\displaystyle \mathcal{K} = \{x: |\langle a^j,x\rangle| \leq \Delta, \forall j \in [n]\}.$

Our goal (Lemma 2) is equivalent to showing that ${\mathcal{K}}$ contains a ${\{1,0,-1\}}$ point of large support.

Note that ${\mathcal{K}}$ is the intersection of ${n}$ slabs. Therefore, by Sidak’s lemma, for ${g \sim \mathcal{N}(0,1)^n}$,

$\displaystyle Pr[g \in \mathcal{K}] \geq \prod_{j=1}^n Pr[|\langle a^j,g\rangle| \leq \Delta] \geq \prod_{j=1}^n\left(1-2\exp(-\Delta^2/2n)\right),$

where the last inequality follows from the fact that ${\langle a^j,g\rangle}$ has the Gaussian distribution with standard deviation ${\sqrt{n}}$. Therefore, if we pick ${\Delta = O(\sqrt{n})}$, sufficiently big then

$\displaystyle Pr[g \in \mathcal{K}] \geq (3/4)^n.$

Now, let ${\mu}$ be the uniform distribution on ${[-1,1]}$. Then, by Corollary 5, and the definition of peakedness, ${\mu^n(\mathcal{K}) \geq (3/4)^n}$. Hence, the Lebesgue volume of ${\mathcal{K}' = \mathcal{K} \cap [-1,1]^n}$ is at least ${2^n (3/4)^n = (3/2)^n}$. Therefore, for sufficiently small ${\delta > 0}$, Lebesgue volume of ${(2-\delta)\mathcal{K}' \geq (2-\delta)^n \cdot (3/2)^n = 2^n \cdot 2^{\Omega(n)}}$. Thus, by applying Minkowski’s theorem to the symmetric convex set ${(2-\delta) \mathcal{K}'}$ we get that ${(2-\delta)\mathcal{K}'}$ has at least ${2^{\Omega(n)}}$ lattice points.

Now, note that the only lattice points in ${(2-\delta)\mathcal{K}'}$ are elements of ${\{1,0,-1\}^n}$ inside ${\mathcal{K}}$. Therefore ${\mathcal{K}}$ has at least ${2^{\Omega(n)}}$ points from ${\{1,0,-1\}^n}$. By a simple counting argument at least one of these lattice points, ${\epsilon}$, must ${\Omega(n)}$ non-zero coordinates – which is exactly what we need to prove Lemma 2!

Discussion  The above argument can actually be simplified by replacing the use of Kanter’s lemma with an appropriate version of Miknowski’s theorem for Gaussian volume as done here. But I like any excuse to discuss Kanter’s lemma.

More importantly, the proof seems to be more amenable to generalization. The core of the proof really is to use Sidak’s lemma to lower bound the Gaussian volume of the convex set ${\mathcal{K}}$. Whenever you have such a statement you should even get a corresponding discrepancy statement. In particular, the matrix discrepancy conjecture from last post, essentially reduces to the following probability question:

Question Is it true that for a universal constant ${C}$, for all symmetric matrices ${A_1,\ldots,A_n \in R^{n \times n}}$ with ${\|A_i\| \leq 1}$,

$\displaystyle Pr_{g \sim \mathcal{N}^n}\left[\|g_1 A_1 + g_2 A_2 + \cdots + g_n A_n \| \leq C \sqrt{n}\right] \geq (3/4)^n\,?$

Acknowledgments Thanks to Shachar Lovett, Oded Regev and Nikhil Srivastava for helpful suggestions, comments, and corrections during the preparation of this post.

This blog post is an introduction to the “Sum of Squares” (SOS) algorithm from my biased perspective. This post is rather long – I apologize. You might prefer to view/print it in pdf format. If you’d rather “see the movie”, I’ll be giving a TCS+ seminar about this topic on Wednesday, February 26th 1pm EST. If you prefer the live version, I’ll be giving a summer course in Stockholm from June 30 till July 4th. I’ve been told that there are worse places to be in than Stockholm in June, and there are definitely worse things to do than taking a course on analysis of Boolean functions from Ryan O’Donnell, which will also be part of the same summer school.

This post starts with an 1888 existential proof by Hilbert, given a constructive version by Motzkin in 1965. We will then go through proof complexity and semidefinite programming to describe the SOS algorithm and how it can be analyzed. Most of this is based on my recent paper with Kelner and Steuer and a (yet unpublished) followup work of ours, but we’ll also discuss notions that can be found in our previous work with Brandao, Harrow and Zhou. While our original motivation to study this algorithm came from the Unique Games Conjecture, our methods turn out to be useful to problems from other domains as well. In particular, we will see an application for the Sparse Coding problem (also known as dictionary learning) that arises in machine learning, computer vision and image processing, and computational neuroscience. In fact, we will close a full circle as we will see how polynomials related to Motzkin’s end up playing a role in our analysis of this algorithm.

I am still a novice myself in this area, and so this post is likely to contain inaccuracies, misattributions, and just plain misunderstandings. Comments are welcome! For deeper coverage of this topic, see Pablo Parrilo’s lecture notes, and Monique Laurent’s monograph. A great accessible introduction from a somewhat different perspective is given in Amir Ali Ahmadi’s blog posts (part I and part II). In addition to the optimization applications discussed here, Amir mentions that “in dynamics and control, SOS optimization has enabled a paradigm shift from classical linear control to … nonlinear controllers that are provably safer, more agile, and more robust” and that SOS has been used in “areas as diverse as quantum information theory, robotics, geometric theorem proving, formal verification, derivative pricing, stochastic optimization, and game theory”.

The FOCS 2014 call for papers is out. The PC chose to forgo hard page limits for submissions, trusting authors to apply their best judgment to make their submissions as easy to read and review as possible. Toward this end, here are some words of advice for potential authors. None of this is mandatory, but I hope you keep it in mind as you prepare your submission. I believe the effort to make your paper more readable will pay off not just in improving your FOCS submission, but also the future online, proceedings, and journal versions of your paper.

Remember the audience.

One of the challenging aspects of writing a FOCS submission (and in fact any scientific paper) is that it needs to address different types of readers simultaneously. A non-exhaustive list includes:

1)      The expert in your field that wants to verify all the details and understand how your approach differs from their 2007 paper.

2)      A non-expert reviewer that wants to understand what you did, why the question is motivated, and get some sense of the techniques you used.

3)      A PC member (such as  yours truly) that was not assigned to review the paper, and wants to get some approximation of the above by just skimming it for a few minutes.

A general rule of thumb for addressing those different readers is to make sure that the paper’s first few sections are accessible to a general theoretical CS audience, while later sections might contain the details that are mainly of interest to experts in the field. This brings us to our next point..

While there is no hard page limit, FOCS reviewers are not expected to read all submissions in full.  In practice, this means you should follow what I call “Impagliazzo’s Rule”: The first X pages of the paper should make the reader want to read the next X pages, for any value of X.

In particular, you should make sure that your results, your techniques, the motivation for your work, its context and novelty compared to prior works are clearly stated early in the paper. If your main theorem is hard to state succinctly, you can state an informal version, or an important special case, adding a forward reference to the place where it’s stated in full.

The above applies not just to results but also to the techniques as well. Don’t wait until the technical section to tell us about your novel ideas. Some of the best written papers follow the introduction with a section such as “Our techniques”, “Proof outline”, “Warmup” or “Toy problem” that illustrates the ideas behind the proofs in an informal and accessible way.

While modesty is a fine virtue, you don’t want to overdo it in a FOCS submission, and hide your contributions in some remote corners of the papers. Of course, you don’t want to go too far in the other direction, and so you should also

As scientists, we bend over backwards to show the potential flaws, caveats, and rooms for improvements in our work, and I expect nothing less from FOCS authors. It can be extremely frustrating for a reviewer to find out that the result is restricted in a significant way only when she reaches Section 4 of the paper. All restrictions, caveats, assumptions, and limitations should be described early in the paper. In fact, some caveats are so major that you shouldn’t wait to state them even until the introduction. For example, if you prove a lower bound that holds only for monotone circuits, then not only should this be clearly stated in the abstract, the word “monotone” should probably appear in the title. Generally speaking, if you’ve made a choice in modeling the problem that makes it easier, you should discuss this and explain what would have changed had you made a different choice. Similarly, any relations and overlap with prior works should be clearly described early in the paper.  If the result is a generalization of prior work, explain how they differ and what motivates the generalization. If it improves in some parameters but is worse in others, a discussion of the significance of these is in order. If there is a related work you are aware of, even if it’s not yet formally published,  or was done after your work, you should still cite it and explain the relation between the two works and the chronology.

Kill the suspense.

A scientific paper is not a novel and, ideally, readers should not be staying in suspense or be surprised negatively or positively. The FOCS PC is an incredibly talented group of people, but you should still write your paper in a “foolproof” way, trying to anticipate all questions and misunderstandings that a reader may have (especially one that needs to review 40 papers under time pressure).

For example, it can be extremely annoying for an author to get a review saying “the proof of the main theorem can be vastly simplified by using X” where X is the thing you tried first and doesn’t work. The way to avoid it is to add a section titled “First attempt” where you discuss X and explain why it fails. Similarly, if there is a paper that at first look seems related to your work, but turns out to be irrelevant, then you should still cite it and explain why it’s not actually related.

Another annoyance is when the reviews give the impression that the paper was rejected for being “too simple”. I and the rest of the FOCS PC believe that simplicity is a great virtue and never a cause for rejection. But you don’t want the reviewer to be surprised by the simplicity, discovering only late in the paper that the proof is a 3 line reduction to some prior work. If the proof is simple then be proud of this fact, and announce it right from the start. Similarly, if the proof of a particular lemma involves some routine applications of a standard method, you don’t need to remove it or move it to the appendix, but do add a sentence saying this at the proof’s start, so the less detail-oriented reviewers will know to skip ahead. This applies in the other case as well: if the proof involves a novel twist or a subtle point, you should add a sentence alerting the reader to look out for that point.

Summary

Writing a scientific paper is often a hard task, and I and the rest of the PC deeply appreciate your decision to send us your work and make the effort to communicate it as clearly as possible. We hope you find the above suggestions useful, and are looking forward to reading your submission.