We will use the method to prove the following sharp variant of Bourgain and Tzafriri’s restricted invertibility theorem, which may be seen as a robust, quantitative version of the fact that every rank matrix contains a set of linearly independent columns.

Theorem 1Suppose are vectors with . Then for every there is a subset of size with

That is, any set of vectors with variance equal to one in every direction must contain a large subset which is far from being linearly degenerate, in the sense of having large eigenvalues (compared to , which is the average squared length of the vectors). Such variance one sets go by many other names in different contexts: they are also called isotropic sets, decompositions of the identity, and tight frames. This type of theorem was first proved by Bourgain and Tzafriri in 1987, and later generalized and sharpened to include the form stated here.

The original applications of Theorem 1 and its variants were mainly in Banach space theory and harmonic analysis. More recently, it was used in theoretical CS by Nikolov, Talwar, and Zhang in the contexts of differential privacy and discrepancy minimization. Another connection with TCS was discovered by Joel Tropp, who showed that the set can be found algorithmically using a semidefinite program whose dual is related to the Goemans-Wiliamson SDP for Max-Cut.

In more concrete notation, the theorem says that every rectangular matrix with contains an column submatrix with , where is the kth largest singular value. Written this way, we see some similarity with the column subset selection problem in data mining, which seeks to extract a maximally nondegenerate set of `representative’ columns from a data matrix. There are also useful generalizations of Theorem 1 for arbitrary rectangular .

As I said earlier, the technique is based on studying the roots of averages of polynomials. In general, averaging polynomials coefficient-wise can do unpredictable things to the roots. For instance, the average of and , which are both real-rooted quadratics, is , which has complex roots . Even when the roots of the average are real, there is in general no simple relationship between the roots of two polynomials and the roots of their average.

The main insight is that there are nonetheless many situations where averaging the coefficients of polynomials also has the effect of averaging each of the roots individually, and that it is possible to identify and exploit these situations. To speak about this concretely, we will need to give the roots names. There is no canonical way to do this for arbitrary polynomials, whose roots are just sets of points in the complex plane. However, when all the roots are real there is a natural ordering given by the real line; we will use this ordering to label the roots of a real-rooted polynomial in descending order .

** Interlacing **

We will use the following classical notion to characterize precisely the good situations mentioned above.

Definition 2 (Interlacing)Let be a degree polynomial with all real roots , and let be degree or with all real roots (ignoring in the degree case). We say that interlaces if their roots alternate, i.e.,

Following Fisk, we denote this as , to indicate that the largest root belongs to .

If there is a single which interlaces a family of polynomials , we say that they have a

common interlacing.

It is an easy exercise that of degree have a common interlacing iff there are closed intervals (where means to the left of) such that the th roots of all the are contained in . It is also easy to see that a set of polynomials has a common interlacing iff every pair of them has a common interlacing (this may be viewed as Helly’s theorem on the real line).

We now state the main theorem about averaging polynomials with common interlacings.

Theorem 3Suppose are real-rooted of degree with positive leading coefficients. Let denote the largest root of and let be any distribution on . If have a common interlacing, then for all

The proof of this theorem is a three line exercise. Since it is the crucial fact upon which the entire technique relies, I encourage you to find this proof for yourself (Hint: Apply the intermediate value theorem inside each interval .) You can also look at the picture below, which shows what happens for two cubic polynomials with a quadratic common interlacing.

One of the nicest features of common interlacings is that their existence is *equivalent* to certain real-rootedness statements. Often, this characterization gives us a systematic way to argue that common interlacings exist, rather than having to rely on cleverness and pull them out of thin air. The following seems to have been discovered a number of times (for instance, Fell or Chudnovsky & Seymour); the proof of it included below assumes that the roots of a polynomial are continuous functions of its coefficients (which may be shown using elementary complex analysis).

Theorem 4If are degree polynomials and all of their convex combinations have real roots, then they have a common interlacing.

*Proof:* Since common interlacing is a pairwise condition, it suffices to handle the case of two polynomials and . Let

with . Assume without loss of generality that and have no common roots (if they do, divide them out and put them back in at the end). As varies from to , the roots of define continuous curves in the complex plane , each beginning at a root of and ending at a root of . By our assumption the curves must all lie in the real line. Observe that no curve can cross a root of either or in the middle: if for some and , then immediately we also have , contradicting the no common roots assumption. Thus, each curve defines a closed interval containing exactly one root of and one root of , and these intervals do not overlap except possibly at their endpoints, establishing the existence of a common interlacing.

** Characteristic Polynomials and Rank One Updates **

A very natural and relevant example of interlacing polynomials comes from matrices. Recall that the *characteristic polynomial* of a matrix is given by

and that its roots are the eigenvalues of . The following classical fact tells us that rank one updates create interlacing.

Lemma 5 (Cauchy’s Interlacing Theorem)If is a symmetric matrix and is a vector then

*Proof:* There are many ways to prove this, and it is a nice exercise. One way which I particularly like, and which will be relevant for the rest of this post, is to observe that

where and are the eigenvectors and eigenvalues of . Interlacing then follows by inspecting the poles and zeros of the rational function on the right hand side.

We are now in a position to do something nontrivial. Suppose is a symmetric real matrix and are some vectors in . Cauchy’s theorem tells us that the polynomials

have a common interlacing, namely . Thus, Theorem 3 implies that for every , there exists a so that the th largest eigenvalue of is at least the th largest root of the average polynomial

We can compute this polynomial using the calculation as follows:

In general, this polynomial depends on the squared inner products . When , however, we have for all , and:

That is, adding a random rank one matrix in the isotropic case corresponds to subtracting off a multiple of the derivative from the characteristic polynomial. Note that there is *no dependence* on the vectors in this expression, and it has `forgotten’ all of the eigenvectors . This is where the gain is: we have reduced a high-dimensional linear algebra problem (of finding a for which has certain eigenvalues, which may be difficult when the matrices involved do not commute) to a univariate calculus / analysis problem (given a polynomial, figure out what subtracting the derivative does to its roots). Moreover, the latter problem is amenable to a completely different set of tools than the original eigenvalue problem.

As a sanity check, if we apply the above deduction to , we find that any isotropic set must contain a such that is at least the largest root of

which is just . This makes sense since , and the average squared length of the vectors is indeed since .

** Differential Operators and Induction **

The real power of the method comes from being able to apply it inductively to a sum of many independent random ‘s at once, rather than just once. In this case, establishing the necessary common interlacings requires a combination of Theorem 4 and Cauchy’s theorem. A central role is played by the differential operators seen above, which I will henceforth denote as . The proof relies on the following key properties of these operators:

Lemma 6 (Properties of Differential Operators)

- If is a random vector with then
- If has real roots then so does .
- If have a common interlacing, then so do .

*Proof:* Part (1) was essentially shown in . Part (2) follows by applying to the matrix with diagonal entries equal to the roots of , and plugging in , so that and .

For part (3), Theorem 3 tells us that all convex combinations have real roots. By part (2) it follows that all

also have real roots. By Theorem 4, this means that the must have a common interlacing.

We are now ready to perform the main induction which will give us the proof of Theorem 1.

Lemma 7Let be uniformly chosen from so that , and let be i.i.d. copies of . Then there exists a choice of indices satisfying

*Proof:* For any partial assignment of the indices, consider the `conditional expectation’ polynomial:

We will show that there exists a such that

which by induction will complete the proof. Consider the matrix

By Cauchy’s interlacing theorem interlaces for every . Lemma 6 tells us operators preserve common interlacing, so the polynomials

(by applying Lemma 6 times) must also have a common interlacing. Thus, some must satisfy (1), as desired.

** Bounding the Roots: Laguerre Polynomials **

To finish the proof of Theorem 1, it suffices by Lemma 7 to prove a lower bound on the th largest root of the expected polynomial . By applying Lemma 6 times to , we find that

This looks like a nice polynomial, and we are free to use any method we like to bound its roots.

The easiest way is to observe that

where is a degree associated Laguerre polynomial. These are a classical family of orthogonal polynomials and a lot is known about the locations of their roots; in particular, there is the following estimate due to Krasikov.

Lemma 8 (Roots of Laguerre Polynomials)The roots of the associated Laguerre polynomial

It follows by Lemma 8 that , which immediately yields Theorem 1 by Lemma 7 and (2).

If you think that appealing to Laguerre polynomials was magical, it is also possible to prove the bound (3) from scratch in less than a page using the `barrier function’ argument from this paper, which is also intimately related to the formulas and .

** Conclusion **

The argument given here is a special case of a more general principle: that expected characteristic polynomials of certain random matrices can be expressed in terms of differential operators, which can then be used to establish the existence of the necessary common interlacings as well as to analyze the roots of the expected polynomials themselves. In the isotropic case of Bourgain-Tzafriri presented here, all of these objects can be chosen to be univariate polynomials. Morally, this is because the covariance matrices of all of the random vectors involved are multiples of the identity (which trivially commute with each other), and all of the characteristic polynomials involved are simple univariate linear transformations of each other (such a ). The above argument can also be made to work in the non-isotropic case, yielding improvements over previously known bounds. This is the subject of a paper in preparation, *Interlacing Families III*, with Adam Marcus and Dan Spielman.

On the other hand, the proofs of Kadison-Singer and existence of Ramanujan graphs involve analyzing sums of independent rank one matrices which come from *non-identically distributed* distributions whose covariance matrices do not commute. At a high level, this is what creates the need to consider multivariate characteristic polynomials and differential operators, which are then analyzed using techniques from the theory of real stable polynomials.

** Acknowledgments **

Everything in this post is joint work with Adam Marcus and Dan Spielman. Thanks to Raghu Meka for helpful comments in the preparation of this post.

]]>

We are continuing the FOCS 2013 experiment of less regulation and more responsibility on the paper formatting, so please do read my advice for authors before submitting.

The deadline is **4:30pm Eastern Time on Wednesday, April 2nd**, but I’d suggest you try to submit the day before to allow extra time for fixing all those typos and implementing all those suggestions that usually come to you just as you press the “submit” button. (Not to mention to allow time for computer crashes, network outages, latex bugs, bathroom breaks, and all those other pesky issues that tend to pop up 5 minutes before the deadline.) While you’re at it, I recommend you post the paper also on the arXiv, ECCC, or ePrint, so other people can benefit from your work.

Now go and prove that missing lemma – I’m looking forward to reading your submissions!

]]>

(also available on the conference website)

Workshop and Tutorial Day: Saturday, May 31, 2014

Workshop and Tutorial Co-Chairs: Kunal Talwar and Chris Umans

On Saturday, May 31, immediately preceding the main conference, SsTOC 2014 will hold a workshop-and-tutorials day. We invite groups of interested researchers to submit workshop or tutorial proposals. The goal of a workshop is to provide an informal forum for researchers to discuss important research questions, directions, and challenges. Connections between theoretical computer science and other areas, topics that are not well represented at STOC, and open problems are encouraged as workshop topics. Organizers are completely free to choose their workshop formats (invited speakers, panel discussions, etc.). The program for May 31 may also involve tutorials, each consisting of 1-2 survey talks on a particular area, and we welcome tutorial proposals as well.

STOC does not have funds to pay travel expenses or honoraria to invited workshop and tutorial speakers, but we do anticipate funds for breaks with snacks and drinks. Workshop and tutorials attendance will be free for all STOC registrants, and will also be available separately for a reduced registration fee, for those not attending STOC itself.

**Proposal submission:** Workshop and tutorial proposals should be no longer than 2 pages. Please include a list of names and email addresses of the organizers, a description of the topic and the goals of the workshop or tutorial, the proposed workshop format (invited talks, contributed talks, panel, etc.), and proposed or tentatively confirmed speakers if known. Please also indicate the preferred length of time for your workshop or tutorial, along with the minimum acceptable time. We anticipate a 4-5 hour block for each workshop and a 1-3 hour block for each tutorial. Please feel free to contact the Workshop and Tutorial Co-Chairs at the email addresses below if you have questions about workshop or tutorial proposals.

**Submission deadline:** Proposals should be submitted by **March 28, 2014**, via email to kunal@microsoft.com and umans@cs.caltech.edu. Proposers will be notified by April 10, 2014, about whether their proposals have been accepted.

]]>

- Rounding linear programs by discrepancy theory: specifically, the beautiful argument of Lovasz, Spencer and Vesztergombi (LSV) on bounding
*linear discrepancy*in terms of*hereditary discrepancy*. LSV is also an excellent place to look if you are running short on beautiful, innocent-in-appearance, time-sinking combinatorial conjectures.Unfortunately, to keep the post short (-er), I will not discuss the recent breakthrough result of Rothvoss on bin-packing which uses discrepancy theory (in a much more sophisticated) for rounding a well-studied linear program (the “Gilmore-Gomory” LP). I highly recommend reading the paper.

- A recent algorithmic proof of Spencer/Gluskin theorem due to Shachar Lovett and myself.

**Rounding Linear Programs** One of the most basic techniques for combinatorial optimization is linear programming relaxation. Let us phrase this in a language suitable for the present context (which is in fact fairly generic). We have a constraint matrix , a target vector and our goal is to find a so as to minimize . One typical approach is to *relax* this discrete problem and instead solve the linear program

(which can be done efficiently). The next step, and often the most challenging one, is to round the fractional solution to an integer solution in with little “loss”. How well can we do this for a constraint matrix , and general vectors ? This is captured by the notion of *linear discrepancy* introduced by LSV:

LSV introduced the above notion originally as a generalization of discrepancy which can be formulated in our current context as follows:

This corresponds exactly to the notion of discrepancy we studied in the last two posts. On the other hand, we can also write , where denotes the all one’s vector. Thus, corresponds to the special case of linear discrepancy where we are only trying to round a particular () fractional solution.

The remarkable result of LSV is that while the above definition seems to be much weaker than (which has to round *all* fractional solutions) there is a natural extension, *hereditary discrepancy*, of which is nearly as strong. For a matrix and , let denote the sub-matrix corresponding to the columns of indexed by . Then, define

Hereditary discrepancy is a natural and more “robust” version of discrepancy. As LSV phrase it, discrepancy can be small by accident, whereas hereditary discrepancy seems to carry more structural information. For example, let be a random matrix with the constraint that each row has sum . Then, , but (this needs proving, but is not too hard) – which makes intuitive sense as we expect random matrices to have little structure.

It is also worth noting that several notable results in discrepancy theory which bound the discrepancy in fact also bound hereditary discrepancy. For example, Spencer’s original proof as well as Gluskin’s argument from last post (with a little bit more work) in fact show the following:

LSV show the following connection between linear and hereditary discrepancies:

Theorem 2For any matrix , .

In other words, any fractional solution for a linear program of the form Equation 1 can be rounded to a integer solution with an additive error of at most .

Let me now describe the cute proof which can be explained in a few lines. Suppose you have a fractional solution . Our goal is to find an such that is small. We will construct such a by iteratively making *more* integral. Let us write out the binary expansion of each of the coordinates of : for each , . To avoid unnecessary technical issues, let us suppose that each coordinate has a finite expansion of length .

We will build a sequence of solutions such that the coordinates of when written in binary will have expansions of length at most . Let us look at how to get ; we can get the rest similarly.

Let . By the definition of , there exists a vector such that . Let (interpreting as a vector in in the natural way). Clearly, the binary expansions of the coordinates of have length at most . Further, . Iterating this argument, we get such that . Therefore,

**Constructive Discrepancy Minimization by Walking on the Edges** We just saw a way to round linear programs as in Equation 1 with error bounded by their discrepancy. As appealing as this is, it comes with one important caveat. The original motivation for looking at LP relaxations was that we can solve them efficiently. For this to make sense, we need the rounding procedure to be efficient as well. In our present case, to make the rounding efficient, we need to find a small discrepancy solution efficiently (find given as above). Unfortunately, this in general is NP-hard in a very strong way as was shown recently by Charkiar, Newman and Nikolov.

However, what about specific bounds like in Theorem 1? Spencer’s original proof as well as Gluskin’s proof do not give an efficient algorithm for finding a good coloring. This is fundamentally inherent with the two arguments: they rely (directly or indirectly via Minkowski’s theorem) on the pigeon-hole principle (with exponentially many “holes”) which is quite non-algorithmic. In fact, Alon and Spencer conjectured (in `the’ book) several years ago that finding a coloring with discrepancy as in the theorem is computationally hard.

Fortunately for us, like all good conjectures, this was proven to be false by a breakthrough result of Nikhil Bansal. Nikhil’s argument studies a carefully constructed semi-definite programming relaxation of the problem and then gives a new and amazing rounding algorithm for the SDP.

Here I will briefly discuss a different proof of Theorem 1 due to Lovett and myself which will also lead to an efficient algorithm for finding a coloring as required.

Let us first revisit Beck’s partial-coloring approach described in last post which says that to prove the theorem, it suffices to show the following.

Lemma 3For vectors , there exists such that for every , and .

As in the last post, let us also rephrase the problem in a geometric language. Let be the symmetric convex set\footnote{Symmetric meaning implies .} defined as follows for to be chosen later:

The partial coloring lemma is equivalent to showing that contains a lattice point of large support. As it turns out, we don’t really need to find a lattice point in but any point with many (or close to ) coordinates will serve us equally well. Concretely, we want:

The above is equivalent to finding a vertex of which is tight on coloring constraints. For intuition let us use the distance from the origin as a proxy for how many coordinates are close to in absolute value. Thus, roughly speaking our goal is to find a vertex of as far away from the origin as possible. This is the interpretation we will use.

Let us think of trying to find the vertex iteratively. Our starting point would be the all-zeros vector. Now what? Well, we want to get away from the origin but still stay inside the polytope . Being somewhat greedy and lazy, we will just toss some coins and update our by doing Brownian motion (if this is uncomfortable, think of taking a discrete walk with tiny random Gaussian steps). The point is that the random walk will steadily move away from the origin.

Now, in the course of performing this random walk at some time we will touch the boundary of or in other words *hit* some constraint(s) of . Now what? Well, as before we still want to get away from the origin but do not want to cross the polytope. So as before, being greedy and lazy (tough habits to change), we will continue doing Brownian motion but now constrain ourselves to only take steps which lie in the face of the polytope that we hit. So we move from doing Brownian motion in -dimensions to doing one in a subspace (corresponding to the face that we hit) of dimension at most . We now repeat this process: every time we hit a new constraint we restrict ourselves to only move in the face we hit. Repeating the above step, we should eventually reach a vertex of the polytope .

This is pretty much the actual algorithm which we call the **Edge-Walk** algorithm (except that to make it implementable we take tiny Gaussian random steps instead of doing Brownian motion; and this makes the analysis easier too). I will refer to the paper for the full details of the algorithm and its analysis, but let me just say what the punchline is: you are more likely to hit nearer constraints than farther ones.

Before moving on, note that the Edge-Walk algorithm can be defined for any polytope . This leads to the meta-question: Can we say anything interesting about the distribution on the vertices of a polytope induced by the walk? The analysis from our paper implicitly gives one such property which has to do with the distances of the constraints defining the vertex from the origin. Understanding this distribution better might be useful elsewhere.

**Discussion** This concludes the three post series. Perhaps, sometime in the future there will be other posts on other notable results in discrepancy theory (like the Beck-Fiala theorem). Keeping with the trend from the last two posts, let me end with another open question which strengthens Theorem 1 in a strong way:

Komlos ConjectureGiven unit vectors , there exists such that

Theorem 1 follows from the above if we take ‘s to be the normalized columns of the matrix . The above conjecture also strengthens another beautiful conjecture due to Beck and Fiala; but that’s for another day. The most remarkable aspect of the above conjecture is that there is no dependence on the dimension. The best bound we know is due to Banaszczyk who showed a bound of which we’ll also leave for another day.

]]>

Theorem 1For vectors , there exists such that for every , .

In this post (second and the most technical of the three post series) we will see a proof of the above result. The theorem was proved by Spencer in 1985 using Beck’s partial coloring approach. Independently of Spencer, the result was proved by Gluskin in 1988 in a convex geometric context. Here we will review Gluskin’s proof which is quite beautiful.

Gluskin’s proof will also give us an excuse to look at some elegant (and simple to describe) results in convex geometry which may be of use elsewhere. Finally, the geometric view here will actually be useful in the next post when we discuss an algorithmic proof. Gluskin’s paper is truly remarkable and seems to reinvent several key ideas from scratch such as Sidak’s lemma, a version of Kanter’s lemma for Gaussians in convex geometry and even has the partial coloring approach implicit in it. I recommend taking a look at the paper even if it is a bit of a tough read. Much of the content in the post is based on discussions with Shachar Lovett and Oded Regev, but all mistakes are mine.

Gluskin’s proof follows the partial coloring approach with the crucial lemma proved using a volume argument. The partial coloring method was introduced by Beck in 1981 and all proofs of Theorem 1 and many other important discrepancy results in fact use this method. Here, instead of looking for a solution as in the theorem, one looks for a solution first. This is meaningless to begin with as we can just output the all zeros vector. The main idea is to instead look for a solution which has support. We then recurse on the set of coordinates which are set to . If everything goes according to plan, as we are geometrically decreasing the ambient dimension, we similarly get geometrically decreasing discrepancy bounds which we can tolerate. I won’t go into the details here, but let’s accept that it suffices to show the following:

Lemma 2For vectors , there exists such that for every , and .

To prove the above partial-coloring-lemma, let us first rephrase the problem in a geometric language. Let be the symmetric convex set (symmetric meaning implies ) defined as follows for to be chosen later:

We want to show that contains a lattice point of large support. We show this indirectly by proving that instead contains a lot of points from . Gluskin does this by a clever volume argument: first show that the volume of is large and then apply Minkowski’s theorem to show that there are many lattice points. To lower bound the first volume, Gluskin actually works in the Gaussian space.

I don’t have a clear intuitive reason for why the Gaussian distribution is better than the Lebesgue measure in this context. But if one looks at the analysis, a clear advantage is that projections behave better (when considering volumes) in the Gaussian space. For example, if we take a set like , then the Lebsgue volume of is infinite, but if we project along the first coordinate it becomes finite. In the Gaussian case, both volumes are the same.

We next go over all the main pieces in Gluskin’s proof.

**Sidak’s Lemma** Suppose we have a standard Gaussian vector . Then, for any unit vector , has the standard normal distribution. Now, suppose we have several unit vectors . Then, the random variables are individually standard normals, but are correlated with one another. Sidak’s lemma (1967) says that no matter what the correlations of ‘s are, to bound the probability that none of the ‘s is too large, the “worst-behaviour” one could expect is for them to be independent. Concretely:

Lemma 3 (Sidak’s Lemma)Let and let be a standard Gaussian vector. Then, for all ,

The proof of the lemma is actually not too hard and an excellent exposition can be found in this paper.

The lemma is actually a very special case of a longstanding open problem called the *Correlation Conjecture*. Let me digress a little bit to state this beautiful question. In the above setup, let *slab* . Then, Sidak’s lemma says that for ,

The correlation conjecture asserts that this inequality is in fact true for all symmetric convex sets (in fact, we only need to look at ). Sidak’s lemma says the conjecture is true for slabs. It is also known to be true for ellipsoids. The statement for ellipsoids also has a discrepancy implication leading to a vector generalization of Spencer’s theorem (pointed to me by Krzysztof Oleszkiewicz). But that’s for another day.

**Kanter’s Lemma** The second inequality we need is a *comparison* inequality due to Kanter. The lemma essentially lets us lift certain relations between two distributions to their product distributions and I think should be useful in other contexts. For instance, I recently used it in this paper in a completely different context. To state the lemma, we need the notion of *peakedness* of distributions.

Let be two symmetric distributions on for some . We say is *less peaked* than (written ) if for all symmetric convex sets , . Intuitively, this means that is putting less of its mass near the origin than (hence the term less peaked). For example, .

Kanter’s lemma says that the peakedness relation tensorises provided we have *unimodality*. A univariate distribution is unimodal if the corresponding probability density function has a single maximum and no other local maxima. I won’t define what it means for a multivariate distribution to be unimodal here, but we only need the lemma for univariate distributions. See this survey for the formal definition.

Lemma 4 (Kanter’s lemma)Let be two symmetric distributions on such that and let be a unimodal distribution on . Then, the product distributions , on , satisfy .

The proof of the lemma is not too hard, but is non-trivial in that it uses the Brunn-Minkowski’s inequality. Combining the above lemma with the not-too-hard fact that the standard Gaussian distribution is less peaked than the uniform distribution on , we get:

**Minkowski’s Theorem** The final piece we need is the classical Minkowski’s theorem form lattice geometry:

Theorem 6 (Minkowski’s Theorem)Let be a symmetric convex set of Lebesgue volume more than for an integer . Then, contains at least points from the integer lattice .

**Putting Things Together** We will prove the partial coloring lemma Lemma 2. The proof will be a sequence of simple implications using the above lemmas. Recall the definition of :

Our goal (Lemma 2) is equivalent to showing that contains a point of large support.

Note that is the intersection of slabs. Therefore, by Sidak’s lemma, for ,

where the last inequality follows from the fact that has the Gaussian distribution with standard deviation . Therefore, if we pick , sufficiently big then

Now, let be the uniform distribution on . Then, by Corollary 5, and the definition of peakedness, . Hence, the Lebesgue volume of is at least . Therefore, for sufficiently small , Lebesgue volume of . Thus, by applying Minkowski’s theorem to the symmetric convex set we get that has at least lattice points.

Now, note that the only lattice points in are elements of inside . Therefore has at least points from . By a simple counting argument at least one of these lattice points, , must non-zero coordinates – which is exactly what we need to prove Lemma 2!

**Discussion** The above argument can actually be simplified by replacing the use of Kanter’s lemma with an appropriate version of Miknowski’s theorem for Gaussian volume as done here. But I like any excuse to discuss Kanter’s lemma.

More importantly, the proof seems to be more amenable to generalization. The core of the proof really is to use Sidak’s lemma to lower bound the Gaussian volume of the convex set . Whenever you have such a statement you should even get a corresponding discrepancy statement. In particular, the *matrix discrepancy* conjecture from last post, essentially reduces to the following probability question:

QuestionIs it true that for a universal constant , for all symmetric matrices with ,

**Acknowledgments **Thanks to Shachar Lovett, Oded Regev and Nikhil Srivastava for helpful suggestions, comments, and corrections during the preparation of this post.

]]>

This post starts with an 1888 existential proof by Hilbert, given a constructive version by Motzkin in 1965. We will then go through proof complexity and semidefinite programming to describe the SOS algorithm and how it can be analyzed. Most of this is based on my recent paper with Kelner and Steuer and a (yet unpublished) followup work of ours, but we’ll also discuss notions that can be found in our previous work with Brandao, Harrow and Zhou. While our original motivation to study this algorithm came from the Unique Games Conjecture, our methods turn out to be useful to problems from other domains as well. In particular, we will see an application for the *Sparse Coding* problem (also known as dictionary learning) that arises in machine learning, computer vision and image processing, and computational neuroscience. In fact, we will close a full circle as we will see how polynomials related to Motzkin’s end up playing a role in our analysis of this algorithm.

I am still a novice myself in this area, and so this post is likely to contain inaccuracies, misattributions, and just plain misunderstandings. Comments are welcome! For deeper coverage of this topic, see Pablo Parrilo’s lecture notes, and Monique Laurent’s monograph. A great accessible introduction from a somewhat different perspective is given in Amir Ali Ahmadi’s blog posts (part I and part II). In addition to the optimization applications discussed here, Amir mentions that “in dynamics and control, SOS optimization has enabled a paradigm shift from classical linear control to … nonlinear controllers that are provably safer, more agile, and more robust” and that SOS has been used in “areas as diverse as quantum information theory, robotics, geometric theorem proving, formal verification, derivative pricing, stochastic optimization, and game theory”.

**1. Sum of squares and the Positivstellensatz **

One of the tedious exercises in high school mathematics involves finding the minimal value of a polynomial such as

To solve this, we were taught to go over all critical points (where the derivative vanishes) and check their values, so eventually we get a picture like this

and can deduce that the minimum is achieved at the point .

As theoretical Computer Scientists, we can make this feeling of tedium precise— this method involves exhaustive search over all local optima and hence can take time for an -variate polynomial, even if it is only of degree . Unfortunately, in general, we do not know of a better way. Indeed, in practice people use generalizations of the high school method such as Gradient Descent to solve such problems, and unsurprisingly often these algorithms do not return the global minimum but get stuck at some local optima. In fact, we know one cannot do better in the worst case, as one can easily transform an -variable 3SAT formula into an -variable degree- polynomial which will have as its global minimum if and only if is satisfiable. (It is easy to come up with a degree polynomial such that for every , equals the number of clauses satisfied by , while the second term is strictly positive for any . See also Ahmadi’s blog post for a reduction to a degree polynomial.)

However, some times there is a nicer, more global way to argue about the extremal points of polynomials. For example, one can see that the minimum of the polynomial in (1) is equal to by writing it as

using the simple (but deep!) fact that a square of a number is never negative.

Bruce Reznick paraphrased Jane Austen to say

It is a truth universally acknowledged, that a mathematical object whose orderings are non-negative must be in want of a representation as a sum of squares.

Despite this saying, the 3SAT example above implies that (assuming NPcoNP) there exists a non-negative polynomial that is not equal to a sum of squares (SOS) of other polynomials, as otherwise we could have had a short proof of unsatisfiability for every 3SAT formula. However proving the existence of such a polynomial unconditionally is not so trivial. Hilbert first gave a non-constructive proof in 1888, and it took almost 80 years until Motzkin gave an explicit example: the Arithmetic-Mean Geometric-Mean (AMGM) inequality implies that the polynomial

is always non-negative since the last term is the geometric mean of , , and , but it turns out that it does not equal a sum of squares. (See Reznick’s or Schmüdgen’s surveys for the proof and more on the history of this problem.)

However, we cam still prove Motzkin’s polynomial is non-negative via a simple and short “SOS proof” since it is the sum of squares of four *rational* functions:

Already Hilbert was aware of this notion of SOS proofs, and asked as his 17th problem whether we can always certify the non-negativity of any polynomial by such a proof. Through works of Artin (1927), Krivine (1964), and Stengle (1974), we know the answer is a resounding *yes!*. These results, which serve as the foundation of real algebraic geometry, are known as the *Positivstellensatz* and hold for even more general polynomial equalities and inequalities. In particular, given some polynomials , the set of equations

is unsatisfiable if and only if this can be certified by finding polynomials and a sum of squares polynomial such that

(The Positivstellensatz applies to polynomials *inequalities* as well, but in the context of optimization, one can always transform an inequality into the equality by introducing an auxiliary variable . Even without using inequalities, the Positivstellensatz fundamentally relies on the fact that the real numbers, as opposed to complex numbers, are well ordered.)

**2. Proof systems and algorithms **

As Theoretical Computer Scientists, we typically try to make *qualitative* questions into *quantitative* ones, and indeed in 1999 Gregoriev and Vorobjov defined the *Positivstellensatz proof complexity* of a set of equations as the minimal *degree* needed for the polynomials and in (4). Note that a degree SOS proof can always be written down using coefficients, and hence the 3SAT example suggests that at least some equations require a large (i.e., ) degree, and such a result was indeed shown by Grigoriev in 2001.

Bounds on the degree turn out to be very important for optimization as well. Several researchers, including N. Shor, Y. Nesterov, P. Parrilo, and J. Lasserre, realized independently that it is possible to search for a degree SOS proof in time . This follows from a correspondence between polynomials that are sums of squares and positive semidefinite (p.s.d.) matrices; combined with the fact that the latter convex set has an efficient separation oracle, and hence can be efficiently optimized over. (This is known as semidefinite programming.) Indeed, for any -variable degree- polynomial we can define an matrix such that for every , . (As written this assumes is a homogenous polynomial, but this can be suitably generalized to the non-homogenous case as well.) By definition, the matrix is p.s.d. if and only if it equals for some vectors . Therefore one can see that if is p.s.d. then each one of those ‘s defines a degree polynomial such that for every . The other direction works in a similar way.

*So, we can efficiently certify the unsatisfiability of a set of polynomial equations if it has a low degree SOS proof. But under what conditions will it have such a proof? and what about finding solutions for satisfiable polynomial equations?*

Progress on these questions has been quite slow. For computational problems of interest, degree upper and lower bounds have both been very hard to come by. We essentially knew only one degree lower bound— Gregoriev’s result mentioned above (later rediscovered by Schoenebeck and expanded upon by Tulsiani). As for upper bounds, until very recently we essentially had none, in the sense of having no significant algorithmic results using the SOS algorithm that did not already follow from weaker algorithms. However, some recent results, including the quasipolynomial time quantum separability algorithm of Brandao, Christiandl and Yard, and our results on solving “hard” unique games instances, gave some signs that the SOS framework does have the potential to solve problems beyond the reach of other techniques. In a very recent work with Kelner and Steurer, we show a general way to exploit the power of SOS, which I will now describe.

**3. Pseudoexpectations and Combining algorithms **

Our approach for using SOS to solve computational problems is based on the following wise proverb

It is easier to solve a problem if you already have a solution.

Specifically, fix the set of polynomial equations (3). A *combining algorithm* is an algorithm that takes as input a (multi) set of solutions for (3), and outputs a single solution . Based on the proverb above, I guess most readers of this blog would be able to come up with such an algorithm. Now, to make things more challenging, imagine that does not get represented as a list of all its elements (which, after all, could be exponentially large), but rather only gets some *low order statistics* of . That is, for some smallish (say or ), gets a vector such that for every , (where we identify with the distribution over its elements). This is indeed a harder task, but it still seems much easier than solving the problem from scratch. Surprisingly, it turns out that you can often reduce solving the equations to constructing such a combining algorithm. Our approach works as follows— given a combining algorithm , instead of giving it moments of an actual distribution over solutions (which of course we don’t have), we feed it with a “fake moment” vector . If we’re lucky, won’t notice the difference and will still output a good solution .

How do we construct those “fake moments”? and when will we be lucky? Those “fake moments” are more formally known as *pseudoexpectations*, and can be found using a semidefinite program which is the dual to the SOS program. In particular, they will obey some of the consistency properties of actual moments, most importantly that for every degree polynomial , if we combine the fake moments given by linearly to compute the presumed value of then it will be non-negative. Those properties imply that if we have a *proof* that works as combining algorithm, and that proof can be encapsulated in the SOS framework with not too high a degree, then that proof in fact shows that will still output a good solution even when it is fed the fake moments.

**4. Machine learning application: the sparse coding problem **

To make things more concrete, we now sketch an example taken from an upcoming paper with Kelner and Steurer. One can see other examples in our original paper. One of the challenging tasks for machine learning is to find good *representations* for data. For example, representing a picture as a bitmap of pixel works great for projecting it on a screen, but not is not as well suited for trying to decide if it’s a picture of a cat or a dog. Finding the “right” representation for, say, images, is often a first task not just in learning applications, but also for other tasks such as edge detection, image denoising, image completion and more. *Sparsity* is one way to capture “rightness” of representation. For example, it is often the case that the data is sparse when represented in the “right” linear basis (such as the Fourier or wavelet bases). Olshausen and Field (1997) argued that this may be the way that images are represented by in visual cortex, and they used a heuristic alternating-minimization based algorithm to recover the following basis (also known as a *dictionary*) for natural images.

But of course as theorists we are not satisfied with a heuristic that works fast and achieves good results. We want a slower algorithm with worse performance that we can prove something about. This is where Sum of Squares comes to the rescue. (At least on some of the counts: it is definitely slower, and we can prove something about it; however “unfortunately” beyond just amenability to analysis there is an (unverified) chance that it actually outperforms the heuristics on some instances, since at least in principle it can avoid getting stuck in local minima.) Recently there have been several works giving rigorous guarantees for sparse coding (e.g., [SWW12 , AGM13, AAJNT13]) but all of them require the representations to be “super-sparse”- have less than significant non-zero coordinates. (There are some works handling the denser case such as [ FJK96, GVX13 , ABGM14] , but they make rather strong assumptions on the dictionary and/or distribution of samples.) In our new work, we use the SOS algorithm to recover the dictionary even as long as the number of nonzero coefficients is less than for some sufficiently small constant . (For these parameters our algorithms takes quasipolynomial time; it takes polynomial time if the number of nonzero coefficients is at most for an arbitrarily small .)

We now sketch the ideas behind the solution. For simplicity of notation, lets assume that the “right basis” is in fact an orthonormal basis , and that we obtain examples of the form , where the ‘s are independently and identically distributed random variables with and for some . Like the works [AGM13, AAJNT13], our methods apply to much more general settings, including non-independent ‘s and overcomplete dictionaries, but this is a good case to keep in mind.

We focus on the task of approximately recovering a single basis vector: once you have such a subroutine, recovering all vectors is not that hard. We do so in three steps:

- Using the examples observed, we construct a system of polynomial equations. We argue that any solution to is a good solution for us (namely close to one of the basis vectors).
More concretely, we will find some polynomial whose projection on the sphere looks somewhat like this

and whose maximum points correspond to the unknown vectors of our dictionary.

- On its own, phrasing the question as a polynomial system is not necessarily helpful, since is not convex, and so in general we don’t know how to solve it. However, we do show how to solve the easier problem of coming up with a
*combining algorithm*, that given the low order moments of a distribution over solutions of , manages to recover a single solution. - We then verify that all our arguments in Steps 1 and 2 can be encapsulated in the SOS proof system with a low degree, and hence will still work even when fed “fake moments”, thus establishing the result.

We now describe how to implement those three steps.

**Step 1: The system of equations.** Given examples each one of the form , we construct the polynomial , and consider the equations defined as follows:

(As mentioned before, we can always translate the inequality into an equality by adding an auxiliary variable.)

As grows, converges to the polynomial

It suffices to take to get good enough convergence for our purposes, and so we will just assume that is equal to (5). Opening the parenthesis, and using the fact that and , we see that

which equals

since the ‘s are an orthonormal basis.

Note that for all , and hence the system is feasible. Moreover, if and then

which for equals (using the fact that the ‘s are an orthonormal basis), and this means

which means that

for some .

**Step 2: Combining algorithm.** I will describe a particularly simple combining algorithm that will involve moments up to logarithmic order, and hence result in a quasipolynomial time algorithm. Under some conditions (namely ) we are able to give a polynomial-time algorithm.

The combining algorithm gets access to a distribution over solutions of , and needs to output a single solution. We do so by choosing a set of random gaussian vectors for , and consider the matrix defined as

where . We output the top eigenvector of .

We claim that with probability , this vector will be highly correlated with one of the ‘s. To see this, note that every element in the support of is very close to one of the ‘s. In particular, there is an such that is close to with probability at least . We can assume without loss of generality, and so is a convex combination of two distributions and such that every element in the support of is close to .

Note that for any unit vector , for a standard Gaussian and hence . Now let us condition on the event that all of the vectors satisfy ; this will happen with probability. This would mean that is roughly for that is close to , while it has a much lower value for ‘s that are not close to it. Thus, almost all the weight in (7) is on the ‘s close to , meaning that is roughly equal to a constant times , and in particular its top eigenvector will be close to .

**Step 3: Lifting to pseudoexpectations.** Step~3 is in some sense the heart of the proof, moving from a combining algorithm, that requires an actual distribution (which we don’t have), to a rounding algorithm, that only needs a pseudo-distribution (which we can find using semidefinite programming). However, it is also the most tedious, since it involves going over the steps of the proof one by one, verifying that each one only used SOS arguments. However, occasionally “lifting” the proofs for pseudoexpectations requires more care. Rather than giving the full proof here, we show just one example of how we deal with one of those more subtle cases.

Recall that while deriving (6), we used the following simple inequality:

(substituting for ). One challenge in lifting this inequality to pseudo-expectations is that the operation is not a low degree polynomial. Since the norm can be approximated by the norm for large , it turns out it suffices to prove the statement

for some large . Of course to make this a polynomial statement, we need to raise this to the power and get

(9) becomes non-trivial to prove in SOS for so let us focus on the case. The LHS has the form

The RHS has the form

Fixing some particular , and dividing both sides by we see we need to show that

(10) follows from

but this is just our old friend the AMGM inequality . We might worry that this is very related to Motzkin’s polynomial, whose claim to fame is exactly the fact that despite being non-negative, it is not a sum of squares. There are two answers to this concern. The first one is that we don’t care, since Motzkin’s polynomial does have a low degree SOS proof (using its representations as a sum of squares of rational functions) and this is good enough for us to show that it holds for pseudo-expectations as well. The second one is that, at least in my experience, if you stare at (11) long enough, then eventually you get tired of staring, look up and realize you are in a workshop on semidefinite programming, where you can easily find an expert or two that will show you why the RHS minus the LHS is actually a sum of squares. ((Apparently, this was already proven by Hurwitz in 1891, see equation (1.6)

here.) Thus (11) holds if the ‘s come from a pseudo-expectation, which is what we wanted to prove. (Going over the proof, one can see that it can extend into a much more general settings of both the dictionary and the distribution.)

** Conclusion **

While until recently we had very little tools to take advantage of the SOS algorithm (at least in the sense of having rigorous analysis), we now have some indications that, when applied to the right problems it can be a powerful tool, that may may be able to goals that resisted previous attempts. We have seen some examples of this phenomenon, but I hope (and believe) the best is yet to come, and am looking forward to seeing how this research area develops in the near future.

**Acknowledgements.** Many thanks to David Steurer for great suggestions, corrections, and insights that greatly improved this blog post, as well as for patiently explaining to me for the nth time the difference between the Positivstellensatz and Nullstellensatz. Thanks to Amir Ali Ahmadi, Jon Kelner, and Pablo Parrilo for showing me the SOS proof for (11) in the ICERM workshop on semidefinite programming and graph partitioning. Thanks to Amir for also pointing out an error in my previous reduction from 3SAT.

]]>

**Remember the audience.**

One of the challenging aspects of writing a FOCS submission (and in fact any scientific paper) is that it needs to address different types of readers simultaneously. A non-exhaustive list includes:

1) The expert in your field that wants to verify all the details and understand how your approach differs from their 2007 paper.

2) A non-expert reviewer that wants to understand what you did, why the question is motivated, and get some sense of the techniques you used.

3) A PC member (such as yours truly) that was not assigned to review the paper, and wants to get some approximation of the above by just skimming it for a few minutes.

A general rule of thumb for addressing those different readers is to make sure that the paper’s first few sections are accessible to a general theoretical CS audience, while later sections might contain the details that are mainly of interest to experts in the field. This brings us to our next point..

**Put your best foot forward.**

While there is no hard page limit, FOCS reviewers are not expected to read all submissions in full. In practice, this means you should follow what I call “Impagliazzo’s Rule”: The first X pages of the paper should make the reader want to read the next X pages, for any value of X.

In particular, you should make sure that your** results**, your **techniques**, the **motivation** for your work, its **context** and **novelty** compared to prior works are **clearly stated early in the paper**. If your main theorem is hard to state succinctly, you can state an informal version, or an important special case, adding a forward reference to the place where it’s stated in full.

The above applies not just to results but also to the techniques as well. Don’t wait until the technical section to tell us about your novel ideas. Some of the best written papers follow the introduction with a section such as “Our techniques”, “Proof outline”, “Warmup” or “Toy problem” that illustrates the ideas behind the proofs in an informal and accessible way.

While modesty is a fine virtue, you don’t want to overdo it in a FOCS submission, and hide your contributions in some remote corners of the papers. Of course, you don’t want to go too far in the other direction, and so you should also

**Put your worst foot forward.**

As scientists, we bend over backwards to show the potential flaws, caveats, and rooms for improvements in our work, and I expect nothing less from FOCS authors. It can be extremely frustrating for a reviewer to find out that the result is restricted in a significant way only when she reaches Section 4 of the paper. All **restrictions**,** caveats**,** assumptions**, and **limitations** should be described early in the paper. In fact, some caveats are so major that you shouldn’t wait to state them even until the introduction. For example, if you prove a lower bound that holds only for monotone circuits, then not only should this be clearly stated in the abstract, the word “monotone” should probably appear in the title. Generally speaking, if you’ve made a choice in modeling the problem that makes it easier, you should discuss this and explain what would have changed had you made a different choice. Similarly, any **relations and overlap with prior works** should be clearly described early in the paper. If the result is a generalization of prior work, explain how they differ and what motivates the generalization. If it improves in some parameters but is worse in others, a discussion of the significance of these is in order. If there is a related work you are aware of, even if it’s not yet formally published, or was done after your work, you should still cite it and explain the relation between the two works and the chronology.

**Kill the suspense.**

A scientific paper is not a novel and, ideally, readers should not be staying in suspense or be surprised negatively or positively. The FOCS PC is an incredibly talented group of people, but you should still write your paper in a “**foolproof**” way, trying to **anticipate all questions and misunderstandings** that a reader may have (especially one that needs to review 40 papers under time pressure).

For example, it can be extremely annoying for an author to get a review saying “the proof of the main theorem can be vastly simplified by using X” where X is the thing you tried first and doesn’t work. The way to avoid it is to add a section titled “First attempt” where you discuss X and explain why it fails. Similarly, if there is a paper that at first look seems related to your work, but turns out to be irrelevant, then you should still cite it and explain why it’s not actually related.

Another annoyance is when the reviews give the impression that the paper was rejected for being “too simple”. I and the rest of the FOCS PC believe that simplicity is a great virtue and never a cause for rejection. But you don’t want the reviewer to be surprised by the simplicity, discovering only late in the paper that the proof is a 3 line reduction to some prior work. If the proof is simple then be proud of this fact, and announce it right from the start. Similarly, if the proof of a particular lemma involves some routine applications of a standard method, you don’t need to remove it or move it to the appendix, but do add a sentence saying this at the proof’s start, so the less detail-oriented reviewers will know to skip ahead. This applies in the other case as well: if the proof involves a novel twist or a subtle point, you should add a sentence alerting the reader to look out for that point.

**Summary**

Writing a scientific paper is often a hard task, and I and the rest of the PC deeply appreciate your decision to send us your work and make the effort to communicate it as clearly as possible. We hope you find the above suggestions useful, and are looking forward to reading your submission.

]]>

As a preview of what’s to come, in the next post I will discuss Gluskin’s geometric approach to proving the `Six Standard Deviations Suffice’ theorem (Theorem 1 below, but with a constant different from ) and in the third post we will look at some algorithmic applications and approaches. Much of the content in the posts is based on discussions with Shachar Lovett and Oded Regev, but all mistakes are mine.

We all know what the union bound is. But if you don’t, here’s how the well-known mathematical genius S. Holmes described it: “When you have eliminated the impossible, whatever remains, however improbable, must be the truth”. In non-sleuth terms this says that if you have arbitrary events over some probability space, then

.

In particular, if the latter sum is less than , then there exists a sample point where none of the events occur.

By now, we have seen several strangely simple and powerful applications of this simple precept – e.g., existence of good Ramsey graphs (Erdős), existence of good error correcting codes (Shannon), existence of good metric embeddings (Johnson-Lindenstrauss). And the union bound and further variants of it are one of the main techniques we have for showing existential results.

However, the union bound is indeed quite naive in many important contexts and is not always enough to get what we want (as nothing seems to be these days). One particularly striking example of this is the Lovász local lemma (LLL). But let us not digress along this (beautiful) path. Here we will discuss how “beating” the union bound can lead to some important results in the context of discrepancy theory.

**Discrepancy and Beating the Union Bound: Six Suffice** A basic probabilistic inequality which we use day-in and day-out is the Chernoff bound. The special case of interest to us is the following: for any vector and ,

In words, the probability that the sum falls standard deviations away is at most . Combining the above with a simple union bound we get the following: For vectors ,

In particular, if we choose , then the right hand side above is less than and we get that there exists such that for every , . Can we do better? It is important to note that the above argument is actually tight for uniformly random signs and we want to do better by a careful choice of signs. This is exactly what Spencer showed in his seminal “Six Standard Deviatons Suffice” result (1985) (Spencer’s proof as well as all other proofs easily generalize to the case when there are more vectors than the degrees of freedom and we only require each vector to have bounded entries.):

Theorem 1For all , vectors , there exists such that for every , .

We will see a proof of the theorem in the next post.

**Discrepancy and Beating the Union Bound (and some): Paving Conjecture ** As discussed in this post a few months back, the stunning result (adjective is mine) of Adam Marcus, Daniel Spielman and Nikhil Srivastava proving the paving conjecture (and hence resolving the Kadison-Singer problem) can also be cast in a discrepancy language. Let me repeat this from Nikhil’s post for completeness. Let us look at a specific variant of the ‘Matrix Chernoff Bound’ (see Nikhil’s post for references; here denotes the spectral norm of a matrix):

Theorem 2Given symmetric matrices ,

Note that the above is a substantial generalization of Equation 2 which corresponds to the special case when the matrices ‘s are diagonal matrices with entries given by the vectors . In a very naive but still meaningful way, one can view the factor on the right hand side as again appearing partially because of a union bound.

An implication of the above theorem is that for symmetric matrices , for uniformly random signs , with high probability

Just as for the scalar case, the factor is necessary in general for uniformly random signs. Can we do better? There are two facets to this question both of which seem to be quite important and basic (and perhaps correspondingly hard):

*Question 1:*Can the factor be improved by picking the signs carefully instead of uniformly at random?*Question 2:*Can the factor be improved for uniformly random signs under some nice conditions on the matrices ‘s?

Here we will focus on the first question, but let me say a couple of words about the second one. In all known examples showing tightness of the matrix Chernoff bound, the matrices seem to have a lot of commutative structure (e.g., diagonal matrices) and intuitively non-commutativity should help us in improving the bound. Perhaps there is a quantitative way of capturing non-commutativity to do better (see the discussion here for one such attempt).

Regarding the first question, Spencer’s theorem gives a positive answer in the special case when ‘s are diagonal matrices with entries (or more generally, bounded entries). The breakthrough result of Marcus, Spielman and Srivastava amounts to giving an exact characterization of when we can do better if the matrices ‘s are rank one positive semi-definite matrices.

**Discrepancy and Beating the Union: Matrix Spencer Theorem? ** The above discussions prompt the following question:

ConjectureFor any symmetric matrices with , there exist signs such that .

The above conjecture is a strong generalization of Spencer’s theorem which corresponds to the matrices ‘s being diagonal. The earliest reference I am aware of for it is this paper. I can’t think of any concrete applications of the conjecture, but I quite like it because of the simplicity of the statement. Admittedly, I also have a personal bias: my first foray into discrepancy theory (with Shachar Lovett and Oded Regev) was to study this question.

One can view the conjecture as giving a partial answer to *Question 1*. In the case when , , so that the right hand side of Equation 3 is . Thus, the above conjecture beats the matrix Chernoff bound by getting rid of the term, but instead introduces a term depending on the “sum-of-norms” of the ‘s as opposed to having a dependence on the “norm-of-the-sum” of ‘s.

In the next post, I will discuss Gluskin’s proof of Theorem 1, which for now (to me) seems to be the most promising approach for proving the conjecture.

**Acknowledgments **

Thanks to Shachar Lovett, Oded Regev and Nikhil Srivastava for helpful suggestions, comments, and corrections during the preparation of this post.

]]>

———————

In this post I’ll explain a cute use of differential privacy as a tool in probabilistic analysis. This is a great example of differential privacy being useful for something other than privacy itself, although there are other good examples too. The main problem we’ll be looking at is the analysis of mostly independent random variables, through the lens of a clustering problem I worked on many years ago.

In the problem of clustering data drawn from a Gaussian mixture, you assume that you are provided access to a large volume of data each of which is a sample from one of a few high-dimensional Gaussian distributions. Each of the Gaussian distributions are determined by a mean vector and a co-variance matrix, and each Gaussian has a mixing weight describing their relative fraction of the overall population of samples. There are a few goals you might have from such a collection of data, but the one we are going to look at is the task of clustering the data into parts corresponding to the Gaussian distributions from which they were drawn. Under what conditions on the Gaussians is such a thing possible? [please note: this work is a bit old, and does not reflect state of the art results in the area; rather, we are using it to highlight the super-cool use of differential privacy].

The main problem is that while each coordinate of a Gaussian distribution is concentrated, there are some large number of them, and the proximity of a sample to some mean vector is not particularly great. You end up with bounds that look a bit like

where is the sample, is the mean vector, the ambient dimensionality, and the norm of the covariance matrix. The probability gets determined by thinking really hard and its specific value won’t be particularly important for us here.

Dimitris Achlioptas and I had an algorithm for doing this based on spectral projections: we would find the optimal low-rank subspace determined by the singular value decomposition (the rank taken to be , the number of Gaussians in the mixture) and argue that under some separation conditions involving the means and co-variance matrices, the space spanned by these projections was basically the same as the space spanned by the mean vectors of the Gaussian distributions. This is great because when you project a Gaussian sample, you are projecting its mean plus some noise. As the true mean lies in the target space, it stays where it is. When you project Gaussian noise onto a fixed subspace, it stays Gaussian, but with far fewer dimensions. The particular form of these results looks something like this, with a projection matrix applied to the sample before subtracting from .

where is close to . This means that while stays centered on , the contribution of the noise more-or-less vanishes. At least, the is reduced to and that can be quite a lot. Hooray!

The problem is that this “more-or-less vanishes” thing is really only true when the target space and the random noise are independent. However, since the optimal low-rank subspace was determined from the data, it isn’t independent of any of the noise we are projecting. It’s slightly dependent on the noise, and in the wrong way (it attempts to accomodate the noise, which isn’t what we want if we want to project *out* the noise). In particular, you don’t immediately get access to the sorts of bounds above.

You could do things the way Dimitris and I did, which was a fairly complicated mess of cross-training (randomly partition the data and use each half to determine the subspace for the other), and you end up with a paper that spends most of its time determining algorithms and notation to enforce the independence (the cross-training needs to be done recursively, but we can’t afford to cut the samples in half at each level, blah blah, brain explodes). You can read all about it here. We’re going to do things a bit simpler now.

Enter differential privacy. Recall, for a moment, the informal statement of differential privacy: a randomized computation has differential privacy if the probability of any output occurrence is not substantially adjusted when a single input element is added or removed. What a nice privacy definition!

Now, let’s think of it slightly differently, in terms of dependence and independence. If is the result of a differentially private computation on a dataset , then is not substantially differently distributed than the same computation run on . If this second distribution enjoys some nice property with high probability, for example due to its independence from , then it remains very likely that has the property as well. The probability that the property no longer holds can only increase by a factor of when we add back in to the input.

For example, let’s consider the probability of the property: “the squared length of the projection of onto the optimal subspace is much larger than ”. When the input data are , resulting in a projection-valued random variable we’ll name , this probability is small because is independent of .

When the input data are , resulting in a projection-valued random variable , this probability is not easily bounded by independence, but can be bounded by differential privacy: if the computation producing is -differentially private, then the probability can increase by at most :

We can even live with fairly beefy values of and still get a result here. Let’s take for concreteness.

Now let’s discuss differentially private optimal subspace computation. One standard way to compute optimal low dimensional subspaces is by taking the covariance matrix of the data and computing its top singular vectors, using the singular value decomposition. One standard way to release the covariance matrix while preserving differential privacy is to add Laplace noise proportional to the largest magnitude permitted of any of the data points, to each of the entries of the covariance matrix. Since our Gaussian data are nicely concentrated, they aren’t likely to be terribly enormous, and a data-independent upper bound will work great here.

What we get is a noisy covariance matrix, which we then subject to the singular value decomposition. There are some nice theorems about how the SVD is robust in the presence of noise, which is actually the same reason it is used as a tool to filter out all of that noise that the Gaussian distributions added in in the first place. So, even though we added a bunch of noise to the covariance matrix, we still get a fairly decent approximation to the space spanned by the means of the Gaussian distributions (as long as the number of samples is larger than the dimensionality of the samples). At the same time, because the resulting subspace is differentially private with respect to the samples, we can still use the concentration bounds typically reserved for the projection of random noise on independent subspaces, as long as we can absorb a factor of .

At its heart, this problem was about recovering a tenuous independence which was very important for simplicity (and arguably tractability) of analysis. It shows up in lots of learning problems, especially in validating models: we would typically split data into test and training, to permit an evaluation of learned results without the issue of overfitting. Here differential privacy makes things simpler: if your learning process is differentially private, you did not overfit your data (much).

]]>

**I. Motivation**

Consider the problem of computing the discrete logarithm in a generic group of a known prime order : given two random elements and , find so that . Instead of having access to the group itself, we may only manipulate encodings of its elements (basically, a random mapping of the group to a sufficiently large alphabet) via a group oracle. The group oracle accepts encodings of two elements and returns the encoding of their product. Think of it as a model of an abstract group, where the result of multiplying two group elements is treated as a new formal variable.

Let us try solving the discrete logarithm problem in this model. Given the encodings of two elements and , one can multiply them, obtaining the encoding of , square the result, etc. In general, it is possible to compute (encodings of) elements of the form , where are pairs of integers modulo (all arithmetic not involving or is going to be modulo from now on). Of course, there can be multiple ways of arriving at the same element. For instance, (as the group is of the prime order, it is necessarily Abelian). Unless we do it on purpose, all elements that we obtain from the group oracle are going to be distinct with an overwhelming probability over (assume that the group order is large, say, at least ). Indeed, if , then which happens for with probability at most . On the other hand, if we do get a non-trivial relationship, we can recover right away.

In other words, the group oracle keeps outputting some random encodings that tell us nothing useful about the elements and (we could sample encodings from the same distribution ourselves, without access to the oracle), until it returns an element that we did see before, which immediately gives away the answer to the discrete logarithm problem.

If is chosen uniformly at random from , the success probability of any algorithm in the generic group model making no more than group operations is bounded by : each pair of elements output by the group oracle collides with probability at most , there are at most such pairs, union bound, check and mate. A formal version of this handwavy argument is due to Victor Shoup, which gives a tight (up to a constant) bound on the success probability of any algorithm for solving the discrete logarithm problem in the generic group model.

A simple algorithm matches this bound. Let . Compute (by repeat multiplications by ), (by repeat multiplications by ), and using the elements already available, compute . If , there’s going to be a collision between and for some and . This algorithm is known as the baby-step giant-step method — we are making “baby ”steps when we are multiplying by powers of , and “giants” steps, when we are computing powers of . If , the discrete logarithm problem is solved with probability 1.

The above argument suggests that in order to solve the discrete logarithm problem in the generic group model one would want to maximize the probability of observing a collision. Collisions have simple geometric interpretation: each time the algorithm computes , it draws a line in the space. An element is “covered” if two lines intersect above this element: . The adversary is trying to cover as many elements as possible with the fewest number of lines.

As we just have seen, the number of group operations required to solve the discrete logarithm problem in the generic group when and are chosen uniformly at random is . The question becomes much more interesting if we constrain the joint distribution of and .

What is the complexity of the discrete logarithm problem measured as the number of group operations, if , where is sampled uniformly from ?

It turns out that this question has been answered for some simple sets , but it is wide open in general.

**II. Geometric Formulation**

We re-formulate the problem using the language of finite field geometry.

Given a subset of , define its *DL-complexity*, denoted as , as the minimal number of lines in whose intersection points projected to the -axis cover .

In the notation of the previous section, the adversary is drawing lines . It scores a hit when two lines intersect above point , i.e., . The adversary’s goal is to cover the entire set with the smallest number of lines, which would correspond to solving the discrete logarithm problem for the case when and .

What are the most basic facts about ?

- . Indeed, we know that the (generic) baby-step giant-step algorithm covers the entire with lines.
- — duh! It suffices to draw a single line and one line for each element of .
- : if lines can cover the entire , then the number of intersection points, which is less than , is at least .

Putting these bounds together on this schematic picture drawn in the log-log scale, we can see that lives inside the shaded triangle.

The most intriguing part of the triangle is the upper-left corner, marked with the target sign, that corresponds to sets that are as small as but have the property that solving the discrete logarithm problem in these subsets is as hard as in the entire . How can we get there, or just get closer? But first, why do we care at all?

One, rather vague motivation is that we are interested in characterizing these subsets because they capture the complexity of the discrete logarithm problem. Another, due to Claus-Peter Schnorr, who defined the problem in 2000, is that the amount of entropy needed to sample an element of that set is half of . The observation that got us going back in 2005 was that modular exponentiation takes amount of time that depends on the exponent. Wouldn’t it be nice if we could choose exponents that allowed for faster exponentiation algorithms? These exponents could cover only a fraction of the entire space, which naturally led us to the question of understanding the discrete logarithm problem restricted to a subset, which turned out to be very interesting in its own right.

The first result, going back to Schnorr, is very encouraging:

For a random of size , with probability at least .

It means a random subset has essentially maximal possible DL-complexity (up to a factor) with very high probability. Unfortunately, using (truly) random subsets forecloses the possibility of extracting any gains in exponentiation relative to the average case. Second, it really does not quite answer the question of whether any specific sets are particularly hard for the discrete logarithm problem.

In the rest of this post we explore several approaches towards constructing explicit sets and sets with succinct representation for which we can prove a lower bound on their DL-complexity stronger than .

**III. A first attempt**

Rather than trying to solve the problem in full generality, let’s constrain the definition of to capture only generalizations of the baby-step giant-step method. Let us call this restriction , defined as follows:

Given a subset of , let be the minimal number so that is covered by intersection of two sets of lines and , where .

Recall that the intersection of two lines covers an element of if these lines intersect at a point whose first coordinate is in .

The definition of complexity considers only horizontal lines (analogous to the giant steps of the algorithm, ) and parallel slanted lines (corresponding to the elements ). The 1 in BSGS1 refers to the fact that all slanted lines have slope of exactly 1 (for now — this condition will be relaxed later).

Can we come up with a constraint on that would guarantee that ? It turns out that we can.

Assume for a moment that all pairwise sums of elements in are distinct, i.e., no four elements satisfy the following equation: , where , unless . If this is the case, at least one of the intersection points of the lines in the following configuration will miss an element of :

To see why it is so, observe that — a contradiction with ‘s not having solutions to this equation.

We now introduce one more way of thinking about these lines in that are trying to hit elements of (we promise it is going to be the last!). Associate lines with the vertices of a graph and draw an edge between two vertices if the intersection point of the corresponding vertices projects to (“kills an element of ”).

If all pairwise sums of are distinct, then the graph whose nodes are the horizontal and slanted lines does not have a 4-cycle. This property alone is sufficient to bound the total number of edges in the graph (and thus the number of elements of hit by these lines) to be less than . If the graph is bipartite, which is our case, this bound is known as the Zarankiewicz problem, which can be established via a simple counting argument.

If lines cannot cover more than elements of , it means that .

What’s left to do is to construct sets whose pairwise sums never repeat. They are known as modular Sidon sets, with several beautiful constructions resulting in sets of astonishingly high density. Ponder it for a moment: we want a subset of such that no two pairs of its elements sum to the same thing. Obviously, by the pigeonhole principle, the size of such as set is . This bound is tight, as there exist — explicit, and efficiently enumerable — sets of size !

Notice that when two lines cover an element of , their coefficients satisfy an especially simple condition: if , where , then . Let and . If all of is covered by intersections between lines and , then , where is the sumset of and . Using the language of additive combinatorics, Erdős and D. Newman posed in 1977 the problem of constructing subsets of that cannot be covered by sumsets of small sets. They proved that the set of “small squares” has this property, or in our terminology, for any .

**IV. Moving upwards**

Let’s relax the constraint of the previous definition by allowing two classes of lines — horizontal and arbitrarily slanted, but the only hits that count are due to intersections between lines of different classes. Call the resulting measure of complexity :

Given a subset of , let be the minimal number so that is covered by intersection of two classes of lines and , for , where only intersections between lines of different classes count towards covering .

By analogy with the previous argument, we’d like to identify a local property on that will result in a non-trivial bound on . More concretely, we should be looking for some condition on a small number of elements of that make them difficult to cover by few lines of two different classes.

Fortunately, one such property is not that difficult to find. Consider the following drawing:

The intercept theorem (known also as Thales’ theorem) implies that , and consequently (applying it a second time),

Conversely, if the 6-tuple is such that , these points cannot be covered all at once by three horizontal and two slanted lines.

Consider again the bipartite graph drawn on the sets of horizontal and slanted lines, where two lines are adjacent in the graph if their intersection point covers an element of . What is the maximal density of this graph if it is prohibited from containing the subgraph? Somewhat surprisingly, the answer is asymptotically the same as before, namely, the number of edges in the graph is . Therefore, if the set avoids 6-tuples satisfying (*), then .

What about constructing sets that have that property? A short answer is that we don’t know how to do so explicitly, but at least there exist sets satisfying this property with succinct representation.

**V. Going all the way**

Having flexed our muscles with the watered-down notions of sets’ DL-complexity, let us try to extend our technique to handle the most general case of unrestricted lines, where everything goes and all intersections count towards the attacker’s goal of covering the entire set .

Once again, we’d like to find a local property with global repercussions. Concretely, we should be looking for a configuration of lines whose intersection points satisfy some avoidable condition, similar to or the quadratic polynomial of the previous section. It may seem that we should look no further than Menelaus’ theorem, which gives us just that. If your projective geometry is a bit rusty, Menelaus’ theorem applies to the six intersection points of four lines in the plane:

It states, in the form most relevant to us that

It seems like a nice local property but what about its global ramifications? Namely, if we manage to construct a set such that no 6-tuple satisfies the cubic polynomial (**), what can we say about the number of lines required to cover that set? Well, our luck runs out here. Recall that we used the local property to guarantee that the graph, whose nodes corresponded to lines and edges corresponded to elements of covered by intersection points, excluded a certain subgraph. First, it was a 4-cycle, then . Unfortunately, if the graph excludes a complete graph on four vertices, which Menelaus’ theorem guarantees for sets avoiding (**), the number of edges in that graph can be as large as . This is the consequence of Turán’s theorem (or Erdős–Stone) that yields no bound better than that unless the excluded subgraph is bipartite.

The only path forward is to find a Menelaus-like theorem that allows us to exclude a bipartite graph. It turns that the minimal such configuration involves seven lines and 12 intersection points:

Most compactly, the theorem states that the following determinant evaluates to 0:

Using the same argument as before, if avoids solutions to the above equation on 12 variables and total degree 6, the “hit” graph defined over the lines avoids the graph. A variant of the Zarankiewicz bound guarantees that such graph has edges (the exponent in the Zarankiewicz bound depends only on size of the *smaller* part of the excluded bipartite graph). Since each element of the set corresponds to at least one edge of the “hit” graph, and consequently , which is better than the trivial bound . Finding explicit constructions remains a challenge, although it is easy to demonstrate existence of such sets with succinct representation by probabilistic method.

**VI. Bipartite Menelaus’ Theorem and Open Problems**

Even though our original motivation was rooted in cryptography, we ended up proving a fact of projective geometry. In an equivalent form, which is most similar to the standard formulation of Menelaus’ theorem, it asserts that

where the line segments are signed: positive if they point in the same direction as the line they are part of (for some arbitrary but fixed orientation), and negative otherwise.

The classic (and classical — after all, Menelaus of Alexandria lived in the first century AD) theorem is implied by ours. Indeed, in the degenerate case when , , and , following a furious round of cancellations, we end up with Menelaus’. This explains why we refer to our “12-point’’ theorem as bipartite Menelaus’: it is the minimal Menelaus-like theorem that involves lines separated into two classes.

We did search far and wide for evidence that this theorem had been known before, and came up empty. In retrospect, such a theorem is inevitable — the number of intersections (i.e., equations) grows quadratically in the number of lines, each of which only requires two free variables to describe. This is a counting argument that really gives no insight into why bipartite Menelaus’ theorem is what it is. Is there a purely geometric proof? Is it a consequence of a simpler/deeper fact about projective geometries over finite fields? We’d love to know.

Let’s measure our progress against the initial goal of finding explicit sets that are as hard as the entire group against the discrete-logarithm-finding adversary. We are not there yet — although we did develop some machinery for arguing that some sets are more resistant than the most pessimistic square-root bound implies, but these sets are hard to construct and too small to be useful. What about proving that some natural sets, such as the sets of squares, as in Erdős-Newman, or cubes, have high DL-complexity? It is conceivable that the combinatorial approach based on excluded subgraphs is not sufficient to get us to the sweet spot of sets of size and DL-complexity . What can?

A necessary disclaimer: the generic group model is just that — a model. Any instantiation of the abstract group allows direct observation of the group elements, and may enable attacks not captured by the model. For instance, representation of the group elements as integers modulo has enough structural properties that index calculus is exponentially more effective in than any generic algorithm. On the positive side, for many groups, such as some elliptic curves or prime-order subgroups of for sufficiently large , no algorithms for finding discrete logarithms faster than generic methods are presently known. It motivates studying generic groups as a useful abstraction of many groups of cryptographic significance.

**Notes**

The abstract (generic) group model was introduced in the papers by Nechaev and Shoup, and hardness of the discrete logarithm in that model was shown to be . Several generic methods for computing discrete logarithm with similar total running time are known: Shank’s baby-step giant-step method, Pollard’s rho and kangaroo (lambda) methods. These algorithms can be adapted to intervals to work in time , matching the pessimistic square-root bound. For small-weight subsets of see work of Stinson and references therein. Canetti put forward a variant of the Decisional Diffie-Hellman assumption where one of the exponents is sampled from an arbitrary distribution of bounded min-entropy. Chateauneuf, Ling, and Stinson gave a combinatorial characterization of algorithms for computing discrete logarithm in terms of slope coverings, and show how weak Sidon sets are related to optimal algorithms. Erdős and Newman defined the notion of bases for subsets of , which corresponds (up to a factor of 4) to BSGS1-complexity in . They showed that random subsets of size have basis of size and for sets of squares their basis is . Subsuming the counting argument of Erdős and Newman, Schnorr proved that the discrete logarithm problem has essentially optimal (up to a logarithmic factor) hardness on random subsets. Resolving the question of Erdős and Newman, Alon, Bukh, and Sudakov showed that for sets of size exactly even their restricted DL-complexity, is . They also extend analysis of BSGS1-complexity for the set of squares to that of higher powers.

]]>