Let me say right off the bat that I think implicit (and, as Michael says, sometimes explicit) bias is a very real phenomenon. Moreover, such biases are not just a problem in the sense that they are “unfair” to authors, but they cause real harm to science, in suppressing the contributions from certain authors. Nor do I have any principled objection to anonymization: I do for example practice anonymous grading in my courses for exactly this reason. I also don’t buy the suggestion that we must know the author’s identity to evaluate if the proof is correct. Reviewers can (and do) evaluate whether a proof makes sense without needing to trust the author.

However, there is a huge difference between grading a problem set and refereeing a paper. In the latter case, and in particular in theoretical computer science, you often need the expertise of very particular people that have worked on this area. By the time the paper is submitted to a conference, these experts have often already seen it, either because it was posted on the arxiv/eccc/eprint, or because they have seen a talk on it, or perhaps they have already discussed it with the authors by email.

More generally, these days much of theoretical CS is moving to the model where papers are first posted online, and by the time they are submitted to a conference they have circulated quite a bit around the relevant experts. Posting papers online is very good for science and should be encouraged, as it allows fast dissemination of results, but it does make the anonymous submission model obsolete.

One could say that if the author’s identity is revealed then there is no harm, since in such a case we simply revert to the original form of non anonymous submissions. However, the fact that the authors’ identity is known to *some but not all* participants in the process (e.g., maybe some reviewers but not others), makes some conflicts and biases invisible. Moreover, the fact that the author’s identity is not “officially” known, causes a lot of practical headaches.

For example, as a PC member you can’t just shoot a quick email to an expert to ask for a quick opinion on the paper, since they may well be the author themselves (as happened to me several time as a CRYPTO PC member), or someone closely related to them. Second, you often have the case where the reviewer knows who the authors are, and has some history with them, even if it’s not a formal conflict, but the program committee member does not know this information. In particular, using anonymous submissions completely precludes using a *disclosure based* model for conflicts of interest (where reviewers disclose their relations with the authors in their reviews) but rather you have to move to an *exclusion based* model, where reviewers meeting some explicit criteria are ruled out.

If anonymous submissions don’t work well for theory conferences, does it mean we have to just have to accept biases? I don’t think so. I believe there are a number of things we could attempt. First, while completely anonymizing submissions might not work well, we could try to make the author names less prominent, for example by having them in the last page of the submissions instead of the first, and not showing them in the conference software. Also, we could try “fairness through awareness”. As I mentioned in my tips for future FOCS/STOC chairs, one potential approach is to tag papers by authors who never had a prior STOC/FOCS paper (one could possibly also tag papers by authors from under-represented groups). One wouldn’t give such papers *preferential* treatment, but rather just make sure they get extra attention. For example, we could add an extra review for such papers. That review might end up being positive or negative, but would counter the bias of dismissing some works out of hand.

To summarize, I agree with Michael’s and Suresh’s sentiments that biases are harmful and should be combated. I just don’t think anonymous submissions are the way to go about that.

]]>The *Unique Games *() problem with parameters is the following: given a set of linear equations each involving at most two variables over some finite field, to distinguish between the *completeness* case where there exists an assignment to the variables satisfying at least fraction of the equations, and the *soundness* case where every assignment satisfies fewer than a fraction.

Clearly the problem becomes easier the larger the gap between and . The unique games conjecture is that the problem is as hard as it can be, in the sense that it is NP hard for arbitrarily close to one and arbitrarily close to zero (as a function of the field size which we assume tends to infinity in what follows). In other words, the difficulty of as a function of and is conjectured by the UGC to look like this:

Until today, what was known about unique games could be summarized in this (not to scale) figure:

That is, when and are sufficiently close to each other, was known to be NP hard (see for example this paper) and in fact with a linear-blowup reduction establishing exponential hardness (i.e. under the exponential time hypothesis. On the other hand, when either the completeness parameter is sufficiently close to one or the soundness is sufficiently close to zero, there was a known subexponential time algorithm of Arora, Steurer and I (see also here) for . (That is, an algorithm running in time for some which tends to zero as either completeness tends to one or soundness tends to zero.)

However, that algorithm was of course only an upper bound, and we did not know whether it could be improved further to or even polynomial time. Moreover, its mere existence showed that in some sense the techniques of the previously known NP hardness results for (which used a linear blow up reduction) are *inherently inapplicable* to establishing the UGC which requires hardness in a completely different regime.

The new result of Khot, Minzer and Safra (when combined with the prior ones) shows that is NP hard for arbitrarily close to half and arbitrarily close to zero. (The result is presented as hardness of 2 to 2 games with completeness close to one, but immediately implies hardness of unique games with completeness close to half.) That is, the new picture of unique games’ complexity is as follows:

This establishes for the first time hardness of unique games in the regime for which a sub-exponential time algorithm was known, and hence (necessarily) uses a reduction with some (large) polynomial blowup. While it is theoretically still possible for the unique games conjecture to be false (as I personally believed would be the case until this latest sequence of results) the most likely scenario is now that the UGC is true, and the complexity of the problem looks something like the following:

That is, for every , the best running time is roughly where is a function that is always positive but tends to zero as tends to one or tends to zero (and achieves the value one in a positive measure region of the plane). Of course we are still yet far from proving this, let alone characterizing this function, but this is still very exciting progress nonetheless.

I personally am also deeply interested in the question of whether the algorithm that captures this curve is the sum of squares algorithm. Since SoS does capture the known subexponential algorithms for unique games, the new work provides more evidence that this is the case.

]]>

CS 121 is a required course for Computer Science concentrators, and so we had about 160 students. There was a great variability in students preparation and background, many of which have not taken a proof-based course before. That, combined with the inevitable first-time kinks, made the first several weeks challenging for both the students and the teaching team. That said, I am overall very pleased with the students’ performance. In a course that contained fairly advanced material, students overall did quite well in the problem sets, midterm and final exams. I was also very pleased with my team of teaching fellows (headed by an amazing undergraduate student – Juan Perdomo) that had to deal with teaching a new iteration of the course, including many concepts that they themselves weren’t so familiar with.

Perhaps the most significant change I made from the standard presentation is to make **non uniform computation** (specifically straightline programs / circuits with NAND gates) the initial computational model, rather than automata. I am quite happy with this choice and intend to keep it for the following reasons:

- Boolean gates such as NAND have much tighter correspondence to actual hardware and convey the notion that this is not an arbitrary abstract model but intends to capture computation as is physically realizable.
- Starting with a model for finite functions allows us to avoid dealing with infinity in the first few lectures.
- Much of the conceptual lessons of the course – that we can model computation mathematically, that we can encode an algorithm as a string and give it as input to other algorithms, that there is a universal algorithm, and that some functions are harder that others – can already be explained in the finite non uniform setting.
- The non uniform model is crucial for talking about
**cryptography**(e.g., explaining notions such as “128 bits of security” or giving a model where bitcoin proofs of work make sense),**pseudorandomness**and**quantum computing**. Cryptography and pseudorandomness are the most compelling examples of*“mining hardness”*or*“making lemonade out of the computational difficulty lemon”*which is a core take away concept. Further I believe that it is crucial to talk about quantum computing in a course that aims to model computation as it exists in the world we live in. - A more minor point is that the non uniform model and the notion of “unrolling the loop” to simulate a uniform computation by a non uniform one, makes certain proofs such as Cook-Levin Theorem, Godel’s Incompleteness Theorem, and BPP in P/poly, technically much easier.

So how did the students like it? The overall satisfaction with the course was **3.6** (in a 1-5 scale) which gives me one more reason to be thankful that I’m a tenured professor and not an Uber driver. On the positive side, 60% of the students rated the course as “very good” or “excellent” and 83% rated it as “good”,”very good” or “excellent”.

Here are the student answers to the question “What did you learn from this course? How did it change you?”. As expected, they are a mixed bag. One answer was *“I learned about the theory of computation. This course made me realize I do not want to study the theory of computation. ” *which I guess means that the course helped this student fulfill the ancient goal of knowing thyself.

In terms of difficulty and workload, 44% of the students found it “difficult” and 23% found it “very difficult” which is a little (but not significantly) more than the average difficulty level for CS classes at Harvard. While the mean amount of hours (outside lectures) spent on this course per week was 11.6 (par for the course in CS classes), you don’t need the sum of squares algorithm to see that the distribution is a mixture model:

I imagine that the students with less math preparation needed to work much more (but perhaps also gained more in terms of their math skills).

**Lessons learned for next time:**

- There is more work to be done on the text, especially to make it more accessible to students not used to reading mathematical definitions and notations. I plan to add more Sipser-style “proof ideas” to the theorems in the text, and add more plain English exposition, especially at the earlier chapters.
- Many students got hung up on the details of the computational models, in particular my “Turing machine analog” which was the NAND++ programming language. I need to find a way to strike the right balance between making sure there is a precise and well-defined model, and being able to properly prove theorems about it, and getting the broader point across that the particular details of the model don’t matter.
- I find the idea of incorporating programming-languages based models in this course appealing, and have made some use of Jupyter notebooks in this course. I need to spend more thought on how to use these tools in the pedagogically most useful way.

and Lisa Zhang.

The workshop will take place on June 19-22 2018 at Harvard university. I am a local co organizer with Madhu Sudan and Salil Vadhan. Having WIT at Harvard is brings back great memories for me, since I was involved in the first WIT at Princeton in 2008. In that workshop Gillat Kol, Barna Saha, and Shubhangi Saraf (now speakers and organizer) were participants.

I encourage any female graduate student in theoretical computer science to strongly consider applying for this workshop by filling the form here.

]]>ITCS is back in the east coast, and will be at MIT from January 11-14, 2018. As you know, ITCS is a conference that is unique in many respects: it’s a conference that emphasizes dialog and discussion among all sub-areas of TCS, facilitating it with a single track structure and “chair rants” providing the context for each session. Submissions, refereeing and presentations emphasize the “I” in ITCS: new concepts and models, new lines of inquiry, new techniques or novel use of existing techniques, and new connections between areas.

All in all, great fun! This year, ITCS will run for four full days with lots of activities. **Tickets are going fast: the deadline for early registration and hotel block are both December 28, 2017.**

A great tradition at ITCS is the “graduating bits” session, where graduating PhD students and postdocs give brief overviews of their research in advance of going out on the job market. If you fit the description, you should sign up here.

Following the success of the poster session at ITCS’17 and STOC’18, we will have one too, at the Marriott the first evening of the conference. To sign up, go here.

We hope to see many of you at ITCS!

Your local organizers,

Costis, Yael & Vinod

]]>

Sam Hopkins just completed a heroic 6 part blog post sequence on using the Sum of Squares algorithm for unsupervised learning.

The goal of **unsupervised learning** is to recover the underlying *structure* of a distribution given samples sampled from $\mathcal{D}$. This is phrased as positing a *model* for the distribution as having the form where is a parameter, and the goal is to recover the parameter from the examples. We can consider the information-theoretic or *computationally unbounded* setting where the question is of **identifiability** – is there enough data to recover (an approximation of) the parameter, and the more realistic setting where we are computationally bounded and the question becomes one of **recovery** – can we efficiently recover the parameter from the data.

Theoretical computer scientists are used to the setting where every problem can be solved if you have enough computational power, and so intuitively would think that identifiability and recovery have nothing to do with each other, and that the former would be a much easier task than the latter. Indeed this is often the case, but Sam discussed some important cases where we can (almost) automatically transform a proof of identifiability into an efficient algorithm for recovery.

While reading these 6 (not too long) blog posts might take an afternoon, it is an afternoon you would learn something in, and I highly recommend this series, especially for students who might be interested in pursuing questions in the theory of machine learning or the applications of semidefinite programming.

The series is as follows:

- In part I Sam describes the general model of unsupervised learning and focuses on the classical problem (dating at least as far back as the 19th century) of learning a Gaussian mixture model , but for which significant progress was just made by Sam and others. He wisely focuses on the one dimensional case, and gives a proof of identifiability for this case.
- In part II Sam introduces the Sum of Squares proof system, and starts on an SoS proof (and statement) of his identifiability proof from the first Part. He completes this proof of identifiability in the (relatively short) part III.
- In the part IV, Sam completes the transformation of this identifiability proof into an algorithm for the one dimensional Gaussian mixture model.
- The above already gives a complete description of how to transform and identifiability proof to an algorithm, but of course we really care about the
*high dimensional*case. In Part V Sam generalizes the identifiability proof to higher dimensions, - In Part VI Sam completes the transformation of the higher dimensional identifiability proof to an algorithm, and also includes an overview of other papers and resources on this general area.

While I realize that today’s social media is trending towards 140 characters, I hope that some people still have the attention span for a longer exposition, and I do believe those that read it will find it worthwhile.

Thanks so much Sam!

]]>*Did you find this series helpful? Unhelpful? What could be done better? If there were to be another tutorial series on a Sum of Squares-related topic, what would you like to see? Let me know in the comments!*

Last time we developed our high-dimensional clustering algorithm for Gaussian mixtures. In this post we will make our SoS identifiability proofs high-dimensional. In what is hopefully a familiar pattern by now, these identifiability proofs will also amount to an analysis of the clustering algorithm from part 5. At the end of this post is a modest discussion of some of the literature on the SoS proofs-to-algorithms method we have developed in this series.

We decided to remember the following properties of a collection of samples from a -dimensional Gaussian mixture model.

(1) break up into clusters which partition , and each has size exactly .

(2) Each cluster has bounded moments, *and this has a small certificate*: for some , if is the mean of the -th cluster,

.

(3) The means are separated: if then .

Our identifiability proof follows the template we have laid out in the non-SoS proof from part 1 and the one-dimensional SoS proof from later parts. The first thing to do is prove a key fact about a pair of overlapping clusters with bounded moments.

The next fact is the high-dimensional analogue of Fact 2. (We are not going to prove a high-dimensional analogue of Fact 1; the reader should at this point have all the tools to work it out for themselves.) We remind the reader of the key family polynomial inequalities.

Given a collection of points , let be the following set of polynomial inequalities in indeterminates :

for all

where as usual . As before, for a subset , we use the notation .

Fact 3.

Let . Let have ; let be its mean. Let be a power of . Suppose satisfies.

Then

The main difference between the conclusions of Fact 3 and Fact 2 is that both sides of the inequality are multiplied by , as compared to Fact 2. As we will see, this is because an additional dependence on the vector-valued polynomial is introduced by the need to project the high-dimensional vectors onto the line . We have already tackled a situation where an SoS-provable inequality seemed to require cancelling terms of left and right in order to be useful (i.e. when we used Fact 2 to prove Lemma 2), and similar ideas will work here.

The main innovation in proving Fact 3 is the use of the -th moment inequality in . Other than that, we follow the proof of Fact 2 almost line by line. The main proposition needed to use the -th moment inequality is:

Proposition..

*Proof of Proposition.*

Expanding the polynomial we get

where is a multi-index over and “even” means that every index in occurs with even multiplicity. (Other terms vanish by symmetry of .) Since is always even in the sum, the monomial is a square, and by standard properties of Gaussian moments. Hence,

.

QED.

*Proof of Fact 3.*

As usual, we write things out in terms of , then apply Holder’s inequality and the triangle inequality. First of all,

By SoS Holder’s inequality, we get

By the same reasoning as in Fact 2, using and we get

By our usual squaring and use of , we also get

(We want both sides to be squared so that we are set up to eventually use SoS Cauchy-Schwarz.) We are left with the last two terms, which are -th moments in the direction . If we knew

and similarly

then we would be done. We start with the second inequality. We write the polynomial on the LHS as

Squaring again as usual, it would be enough to bound both and . For the former, using the Proposition above we get

.

For the latter, notice that

where is the matrix

.

Hence by SoS Cauchy-Schwarz, we get

Putting these together, we get

,

the second of the inequalities we wanted. Proving the first one is similar, using the hypothesis

in place of . QED.

The last thing is to use Fact 3 to prove Lemma 3, our high-dimensional SoS identifiability lemma. Predictably, it is almost identical to our previous proof of Lemma 2 using Fact 2.

Lemma 3.

Let . Let be a partition of into pieces of size such that for each , the collection of vectors obeys the following moment bound:.

where is the average and is some number in which is a power of . Let be such that for every .

Let be indeterminates. Let be the set of equations and inequalities defined above. Thinking of the variables as defining a set via its indicator, let be the formal expression Let be a degree pseudoexpectation which satisfies . Then

*Proof of Lemma 3.*

Let satisfy . As in Lemmas 1 and 2, we will endeavor to bound for each .

By -separation,

where we implicitly used the SoS triangle inequality on each coordinate of the vectors and .

So,

Now we are going to bound . By symmetry the same argument will apply to the same expression with and exchanged.

We apply Fact 3:

Then we use pseudoexpectation Cauchy-Schwarz to get

.

Putting these two together and canceling we get

.

Also, clearly . So we get

We started out with the goal of bounding . We have found that

Applying pseudoexpectation Holder’s inequality, we find that

.

Rearranging things, we get

Now proceeding as in the proof of Lemma 2, we know

so

QED.

**Remark: **We cheated ever so slightly in this proof. First of all, we did not state a version of pseudoexpectation Holder’s which allows the exponent , just one which allows . The correct version can be found in Lemma A.4 of this paper. That inequality will work only when is large enough; I think suffices. To handle smaller probably one must remove the square from both sides of Fact 3, which will require a hypothesis which does not use the squared Frobenious norm. This is possible; see e.g. my paper with Jerry Li.

The conclusion of Lemma 3 is almost identical to the conclusion of Lemma 2, and so the rest of the analysis of the high-dimensional clustering algorithm proceeds exactly as in the one-dimensional case. At the end, to show that the clustering algorithm works with high probability to cluster samples from a separated Gaussian mixture model, one uses straightforward concentration of measure to show that if are enough samples from a -separated mixture model, then the satisfy the hypotheses of Lemma 3 with high probability. This concludes our “proof” of the main theorem from way back in part 1.

The reader interested in further applications of the Sum of Squares method to unsupervised learning problems may consult some of the following works.

- [Barak, Kelner, Steurer] on dictionary learning
- [Potechin, Steurer] on tensor completion
- [Ge, Ma] on random overcomplete tensors

Though we have not seen it in these posts, often the SoS method overlaps with questions about tensor decomposition. For some examples in this direction, see the [Barak, Kelner, Steurer] dictionary learning paper above, as well as

- [Ma, Shi, Steurer] on tensor decomposition

The SoS method can often be used to design algorithms which have more practical running times than the large SDPs we have discussed here. (This often requires further ideas, to avoid solving large semidefinite programs.) See e.g.:

- [Hopkins, Schramm, Shi, Steurer] on extracting spectral algorithms (rather than SDP-based algorithms) from SoS proofs
- [Schramm, Steurer] developing a sophisticated spectral method for dictionary learning via SoS proofs
- [Hopkins, Kothari, Potechin, Raghavendra, Schramm, Steurer] with a meta-theorem on when it is possible to extract spectral algorithms from SoS proofs

Another common tool in constructing SoS proofs for unsupervised learning problems which we did not see here are concentration bounds for random matrices whose entries are low-degree polynomials in independent random variables. For some examples along these lines, see

- [Hopkins, Shi, Steurer] on tensor principal component analysis
- [Rao, Raghavendra, Schramm] on random constraint satisfaction problems

as well as several of the previous papers.

]]>Last time we finished our algorithm design and analysis for clustering one-dimensional Gaussian mixtures. Clustering points on isn’t much of a challenge. In this post we will finally move to the high-dimensional setting. We will see that most of the ideas and arguments so far carry over nearly unchanged.

In keeping with the method we are advocating throughout the posts, the first thing to do is return to the non-SoS cluster identifiability proof from Part 1 and see how to generalize it to collections of points in dimension . We encourage the reader to review that proof.

Our first step in designing that proof was to correctly choose a property of a collection of samples from a Gaussian mixture which we would rely on for identifiability. The property we chose was that the points break into clusters of equal size so that each cluster has bounded empirical -th moments and the means of the clusters are separated.

Here is our first attempt at a high-dimensional generalization: break into clusters of equal size such that

(1) for each cluster and ,

where is the empirical mean of cluster , and

(2) those means are separated: for .

The first property says that every one-dimensional projection of every cluster has Gaussian -th moments. The second should be familiar: we just replaced distance on the line with distance in .

The main steps in our one-dimensional non-SoS identifiability proofs were Fact 1 and Lemma 1. We will give an informal discussion on their high-dimensional generalizations; for the sake of brevity we will skip a formal non-SoS identifiability proof this time and go right to the SoS proof.

The key idea is: for any pair of sets such that and satisfy the empirical -th moment bound (2) with respect to empirical means and respectively, if , then by the one-dimensional projections

are collections of numbers in which satisfy the hypotheses of our one-dimensional identifiability arguments. All we did was choose the right one-dimensional projection of the high-dimensional points to capture the separation between and .

(The reader is encouraged to work this out for themselves; it is easiest shift all the points so that without loss of generality .)

We are going to face two main difficulties in turning the high-dimensional non-SoS identifiability proofs into SoS proofs.

(1) The one-dimensional projections above have in the denominator, which is not a low-degree polynomial. This is easy to handle, and we have seen similar things before: we will just clear denominators of all inequalities in the proofs, and raise both sides to a high-enough power that we get polynomials.

(2) The high-dimensional -th moment bound has a “for all ” quantification. That is, if are indeterminates as in our one-dimensional proof, to be interpreted as the indicators for membership in a candidate cluster , we would like to enforce

.

Because of the , this is not a polynomial inequality in . This turns out to be a serious problem, and it will require us to strengthen our assumptions about the points .

In order for the SoS algorithm to successfully cluster , it needs to *certify* that each of the clusters it produces satisfies the -th empirical moment property. Exactly why this is so, and whether it would also be true for non-SoS algorithms, is an interesting topic for discussion. But, for the algorithm to succeed, in particular a short certificate of the above inequality must exist! It is probably not true that such a certificate exists for an arbitrary collection of points in satisfying the -th empirical moment bound. Thus, we will add the existence of such a certificate as an assumption on our clusters.

When are sufficiently-many samples from a -dimensional Gaussian, the following matrix inequality is a short certificate of the -th empirical moment property:

where the norm is Frobenious norm (spectral norm would have been sufficient but the inequality is easier to verify with Frobenious norm instead, and this just requires taking a few more samples). This inequality says that the empirical -th moment matrix of is close to its expectation in Frobenious norm. It certifies the -th moment bound, because for any , we would have

by analyzing the quadratic forms of the empirical and true -th moment matrices at the vector .

In our high-dimensional SoS identifiability proof, we will remember the following things about the samples from the underlying Gaussian mixture.

- break into clusters , each of size , so that if is the empirical mean of the -th cluster, if , and
- For each cluster :

.

Now we are prepared to describe our high-dimensional algorithm for clustering Gaussian mixtures. For variety’s sake, this time we are going to describe the algorithm before the identifiability proof. We will finish up the high-dimensional identifiability proof, and hence the analysis of the following algorithm, in the next post, which will be the last in this series.

Given a collection of points , let be the following set of polynomial inequalities in indeterminates :

for all

where as usual .

The algorithm is: given , find a degree- pseudoexpectation of minimal satisfying . Run the rounding procedure from the one-dimensional algorithm on .

]]>Last time we finished our SoS identifiability proof for one-dimensional Gaussian mixtures. In this post, we are going to turn it into an algorithm. In Part 3 we proved Lemma 2, which we restate here.

Lemma 2.

Let . Let be a partition of into pieces of size such that for each , the collection of numbers obeys the following moment bound:where is the average and is a power of in . Let be such that for every .

Let be indeterminates. Let be the following set of equations and inequalities.

As before is the polynomial . Thinking of the variables as defining a set via its indicator, let be the formal expression

Let be a degree pseudoexpectation which satisfies . Then

We will we design a convex program to exploit the SoS identifiability proof, and in particular Lemma 2. Then we describe a (very simple) rounding procedure and analyze it, which will complete our description and analysis of the one-dimensional algorithm.

Let’s look at the hypothesis of Lemma 2. It asks for a pseudoexpectation of degree which satisfies the inequalities . First of all, note that the inequalities depend only on the vectors to be clustered, and in particular not on the hidden partition , so they are fair game to use in our algorithm. Second, it is not too hard to check that the set of pseudoexpectations satisfying is convex, and in fact the feasible region of a semidefinite program with variables!

It is actually possible to design a rounding algorithm which takes any pseudoexpectation satisfying and produces a cluster, up to about a -fraction of misclassified points. Then the natural approach to design an algorithm to find all the clusters is to iterate:

(1) find such a pseudoexpectation via semidefinite programming

(2) round to find a cluster

(3) remove all the points in from , go to (1).

This is a viable algorithm, but analyzing it is a little painful because misclassifications from early rounds of the rounding algorithm must be taken into account when analyzing later rounds, and in particular a slightly stronger version of Lemma 2 is needed, to allow some error from early misclassifications.

We are going to avoid this pain by imposing some more structure on the pseudoexpectation our algorithm eventually rounds, to enable our rounding scheme to recover all the clusters without re-solving a convex program. This is not possible if one is only promised a pseudoexpectation which satisfies : observe, for example, that one can choose the pseudodistribution to be a probability distribution supported on one point , the indicator of cluster . This particular is easy to round to extract , but contains no information about the remaining clusters .

We are going to use a trick reminiscent of entropy maximization to ensure that the pseudoexpectation we eventually round contains information about all the clusters . Our convex program will be:

where is the Frobenious norm of the matrix .

It may not be so obvious why is a good thing to minimize, or what it has to do with entropy maximization. We offer the following interpretation. Suppose that instead of pseudodistributions, we were able to minimize over all which are supported on vectors which are the indicators of the clusters . Such a distribution is specified by nonnegative for which sum to , and the Frobenious norm is given by

where we have used orthogonality if . Since all the clusters have size , we have , and we have , where is the -norm, or collision probability, of . This collision probability is minimized when is uniform.

We can analyze our convex program via the following corollary of Lemma 2.

Corollary 1.

Let and be as in Lemma 2. Let be the degree pseudoexpectation solvingLet be the uniform distribution over vectors where is the indicator of cluster . Then

where is the Frobenious norm.

*Proof of Corollary 1.*

The uniform distribution over is a feasible solution to the convex program with , by the calculation preceding the corollary. So if is the minimizer, we know .

We expand the norm:

To bound the last term we use Lemma 2. In the notation of that Lemma,

Remember that . Putting these together, we get

QED.

We are basically done with the algorithm now. Observe that the matrix contains all the information about the clusters . In fact the clusters can just be read off of the rows of .

Once one has in hand a matrix which is close in Frobenious norm to , extracting the clusters is still a matter of reading them off of the rows of the matrix (choosing rows at random to avoid hitting one of the small number of rows which could be wildly far from their sisters in ).

We will prove the following fact at the end of this post.

Fact: rounding.

Let be a partition of into parts of size . Let be the indicator matrix for same-cluster membership. That is, if and are in the same cluster . Suppose $M \in \mathbb{R}^{n \times n}$ satisfies .There is a polynomial-time algorithm which takes and with probability at least produces a partition of into clusters of size such that, up to a permutation of ,

Now we sketch a proof of Theorem 1 in the case . Our algorithm is: given , solve

then apply the rounding algorithm from the rounding fact to and output the resulting partition.

If the vectors satisfy the hypothesis of Lemma 2, then by Corollary 1, we know

where is the uniform distribution over indicators for clusters . Hence the rounding algorithm produces a partition of such that

Since a standard Gaussian has , elementary concentration shows that the vectors satisfy the hypotheses of Lemma 1 with probability so long as .

The last thing to do in this post is prove the rounding algorithm Fact. This has little to do with SoS; the algorithm is elementary and combinatorial. We provide it for completeness.

The setting is: there is a partition of into parts of size . Let be the indicator matrix for cluster membership; i.e. if and only if and are in the same cluster . Given a matrix such that , the goal is to recover a partition of which is close to up to a permutation of .

Let be the -th row of and similarly for . Let be a parameter we will set later.

The algorithm is:

(1) Let be the set of active indices.

(2) Pick uniformly.

(3) Let be those indices for which .

(4) Add to the list of clusters and let $\mathcal{I} := \mathcal{I} \setminus S$.

(5) If , go to (2).

(6) (postprocess) Assign remaining indices to clusters arbitrarily, then move indices arbitrarily from larger clusters to smaller ones until all clusters have size .

Fact: rounding.

If then with probability at least the rounding algorithm outputs disjoint clusters , each of size , such that up to a permutation of , .

*Proof.*

Call an index *good* if . An index is bad if it is not good. By hypothesis . Each bad index contributes at least to the left side and , so there are at most bad indices.

If are good indices and both are in the same cluster , then if the algorithm chooses , the resulting cluster will contain . If , then also if is good but is in some other cluster , the cluster formed upon choosing will not contain . Thus if the algorithm never chooses a bad index, before postprocessing the clusters it outputs will (up to a global permutation of ) satisfy that contains all the good indices in . Hence in this case only bad indices can be misclassified, so the postprocessing step moves at most indices, and in the end the cluster again errs from on at most indices.

Consider implementing the algorithm by drawing a list of indices before seeing (i.e. obliviously), then when the algorithm requires random index we give it the next index in our list which is in (and halt with no output if no such index exists). It’s not hard to see that this implementation fails only with probability at most . Furthermore, by a union bound the list contains a bad index only with probability . Choosing thus completes the proof. QED.

]]>Let’s have a brief recap. We are designing an algorithm to cluster samples from Gaussian mixture models on . Our plan is to do this by turning a simple *identifiability proof *into an algorithm. For us, “simple” means that the proof is captured by the low degree Sum of Squares (SoS) proof system.

We have so far addressed only the case (which will remain true in this post). In part 1 we designed our identifiability proof, not yet trying to formally capture it in SoS. The proof was simple in the sense that it used only the triangle inequality and Holder’s inequality. In part 2 we defined SoS proofs formally, and stated and proved an SoS version of one of the key facts in the identifiability proof (Fact 2).

In this post we are going to finish up our SoS identifiability proof. In the next post, we will see how to transform the identifiability proof into an algorithm.

We recall our setting formally. Although our eventual goal is to cluster samples sampled from a mixture of Gaussians, we decided to remember only a few properties of such a collection of samples, which will hold with high probability.

The properties are:

(1) They break up into clusters of equal size, , such that for some , each cluster obeys the empirical moment bound,

where is the empirical mean of the cluster , and

(2) Those means are separated: .

The main statement of cluster identifiability was Lemma 1, which we restate for convenience here.

Lemma 1.Let . Let be a partition of into pieces of size such that for each , the collection of numbers obeys the following moment bound:where is the average and is some number in . Let be such that for every . Suppose is large enough that .

Let have size and be such that obey the same moment-boundedness property:

for the same , where is the mean . Then there exists an such that

for some universal constant .

Our main goal in this post is to state and prove an SoS version of Lemma 1. We have already proved the following Fact 2, an SoS analogue of Fact 1 which we used to prove Lemma 1.

Fact 2.Let . Let have ; let be its mean. Let be a power of . Suppose satisfiesLet be indeterminates. Let be the following set of equations and inequalities.

Then

We are going to face a couple of problems.

(1) The statement and proof of Lemma 1 are not sufficiently symmetric for our purposes — it is hard to phrase things like “there exists a cluster such that…” as statements directly about polynomials. We will handle this by giving more symmetric version of Lemma 1, with a more symmetric proof.

(2) Our proof of Lemma 1 uses the conclusion of Fact 1 in the form

whereas Fact 2 concludes something slightly different:

The difference in question is that the polynomials in Fact 2 are degree , and appears on both sides of the inequality. If we were not worried about SoS proofs, we could just cancel terms in the second inequality and take -th roots to obtain the first, but these operations are not necessarily allowed by the SoS proof system.

One route to handling this would be to state and prove a version of Lemma 1 which concerns only degree . This is probably possible but definitely inconvenient. Instead we will exhibit a common approach to situations where it would be useful to cancel terms and take roots but the SoS proof system doesn’t quite allow it: we will work simultaneously with SoS proofs and with their dual objects, *pseudodistributions*.

We will tackle issues (1) and (2) in turn, starting with the (a)symmetry issue.

We pause here to record an alternative version of Lemma 1, with an alternative proof. This second version is conceptually the same as the one we gave in part 1, but it avoids breaking the symmetry among the clusters , whereas this was done at the very beginning of the first proof, by choosing the ordering of the clusters by . Maintaining this symmetry requires a slight reformulation of the proof, but will eventually make it easier to phrase the proof in the Sum of Squares proof system. In this proof we will also avoid the assumption , however, we will pay a factor of rather than in the final bound.

Alternative version of Lemma 1.

Let .

Let be a partition of into pieces of size such that for each , the collection of numbers obeys the following moment bound:where is the average and is some number in . Let be such that for every .

Let have size and be such that obey the same moment-boundedness property:

.

for the same , where . Then

We remark on the conclusion of this alternative version of Lemma 1. Notice that are nonnegative numbers which sum to . The conclusion of the lemma is that for . Since the sum of their squares is at least , one obtains

matching the conclusion of our first version of Lemma 1 up to an extra factor of .

*Proof of alternative version of Lemma 1.*

Let again have size with mean and -th moment bound . Since partition ,

We will endeavor to bound for every pair . Since ,

Certainly and similarly for , so this is at most

Using Fact 1, this in turn is at most . So, we obtained

for every .

Putting this together with our first bound on , we get

QED.

Now that we have resolved the asymmetry issue in our earlier version of Lemma 1, it is time to move on to pseudodistributions, the dual objects of SoS proofs, so that we can tackle the last remaining hurdles to proving an SoS version of Lemma 1.

Pseudodistributions are the convex duals of SoS proofs. As with SoS proofs, there are several expositions covering elementary definitions and results in detail (e.g. the lecture notes of Barak and Steurer, here and here). We will define what we need to keep the tutorial self-contained but refer the reader elsewhere for further discussion. Here we follow the exposition in those lecture notes.

As usual, let be some indeterminates. For a finitely-supported function and a function , define

If defines a probability distribution, then is the operator sending a function to its expectation under .

A finitely-supported is a degree pseudodistribution if

(1)

(2) for every polynomial of degree at most .

When is clear from context, we usually suppress it and write . Furthermore, if is an operator and for some pseudodistribution , we often abuse terminology and call a pseudoexpectation.

If is a family of polynomial inequalities and is a degree pseudodistribution, we say satisfies if for every and such that one has

We are not going to rehash the basic duality theory of SoS proofs and pseudodistributions here, but we will need the following basic fact, which is easy to prove from the definitions.

Fact: weak soundness of SoS proofs.

Suppose and that is a degree pseudodistribution which satisfies . Then for every SoS polynomial , if $\deg h + \ell \leq d$ then .

We call this “weak soundness” because somewhat stronger statements are available, which more readily allow several SoS proofs to be composed. See Barak and Steurer’s notes for more.

The following fact exemplifies what we mean in the claim that pseudodistributions help make up for the inflexibility of SoS proofs to cancel terms in inequalities.

Fact: pseudoexpectation Cauchy-Schwarz.

Let be a degree pseudoexpectation on indeterminates . Let and be polynomials of degree at most . ThenAs a consequence, if has degree and is a power of , by induction

*Proof of pseudoexpectation Cauchy-Schwarz.*

For variety, we will do this proof in the language of matrices rather than linear operators. Let be the matrix indexed by monomials among of degree at most , with entries . If is a polynomial of degree at most , we can think of as a vector indexed by monomials (whose entries are the coefficients of ) such that . Hence,

QED.

We will want a second, similar fact.

Fact: pseudoexpectation Holder’s.

Let be a degree sum of squares polynomial, , and a degree pseudoexpectation. Then

The proof of pseudoexpectation Holder’s is similar to several we have already seen; it can be found as Lemma A.4 in this paper by Barak, Kelner, and Steurer.

We are ready to state and prove our SoS version of Lemma 1. The reader is encouraged to compare the statement of Lemma 2 to the alternative version of Lemma 1. The proof will be almost identical to the proof of the alternative version of Lemma 1.

Lemma 2.

Let . Let be a partition of into pieces of size such that for each , the collection of numbers obeys the following moment bound:where is the average and is a power of in . Let be such that for every .

Let be indeterminates. Let be the following set of equations and inequalities.

As before is the polynomial . Thinking of the variables as defining a set via its indicator, let be the formal expression

Let be a degree pseudoexpectation which satisfies . Then

*Proof of Lemma 2.*

We will endeavor to bound from above for every . Since we want to use the degree polynomials in Fact 2, we get started with

by (repeated) pseudoexpectation Cauchy-Schwarz.

Since and are -separated, i.e. , we also have

where the indeterminate is and we have only used the SoS triangle inequality. Hence,

Applying Fact 2 and soundness to the right-hand side, we get

Now using that and hence and similarly for , we get

By pseudoexpectation Cauchy-Schwarz

which, combined with the preceding, rearranges to

By pseudoexpectation Holder’s,

All together, we got

Now we no longer have to worry about SoS proofs; we can just cancel the terms on either side of the inequality to get

Putting this together with

finishes the proof. QED.

]]>