Skip to content

On double blind reviews in theory conferences

January 11, 2018

Michael Mitzenmacher points to two  posts of Suresh Venkatasubramanian on the issue of so called “double blind reviews” (i.e., anonymous submissions) in theory conferences. In short, both Michael and Suresh think they are a good idea. I agree with much of their motivations, but, based on my experience in both non-blinded (e.g., STOC/FOCS) and blinded (e.g., CRYPTO) conferences as both reviewer and author, I do not think double blind reviewing is a good fit for theoretical computer science.

Let me say right off the bat that I think implicit (and, as Michael says, sometimes explicit) bias is a very real phenomenon. Moreover, such biases are not just a problem in the  sense that they are “unfair” to authors, but they cause real harm to science, in suppressing the contributions from certain authors.  Nor do I have any principled objection to  anonymization: I do for example practice anonymous grading in my courses for exactly this reason. I also don’t buy the suggestion that we must know the author’s identity to evaluate if the proof is correct. Reviewers can (and do) evaluate whether a proof makes sense without needing to trust the author.

However, there is a huge difference between grading a problem set and refereeing a paper. In the latter case, and in particular in theoretical computer science, you often need the expertise of very particular people that have worked on this area. By the time the paper is submitted to a conference, these experts have often already seen it, either because it was posted on the arxiv/eccc/eprint, or because they have seen a talk on it, or perhaps they have already discussed it with the authors by email.

More generally, these days much of theoretical CS is moving to the model where papers are first posted online, and by the time they are submitted to a conference they have circulated quite a bit around the relevant experts. Posting papers online is very good for science and should be encouraged, as it allows fast dissemination of results, but it does make the anonymous submission model obsolete.

One could say that if the author’s identity is revealed then there is no harm, since in such a case we simply revert to the original form of non anonymous submissions. However, the fact that the authors’ identity is known to some but not all participants in the process  (e.g., maybe some reviewers but not others), makes some conflicts and biases invisible. Moreover, the fact that the author’s identity is not “officially” known, causes a lot of practical headaches.

For example, as a PC member you can’t just shoot a quick email to an expert to ask for a quick opinion on the paper, since they may well be the author themselves (as happened to me several time as a CRYPTO PC member), or someone closely related to them. Second, you often have the case where the reviewer knows who the authors are, and has some history with them, even if it’s not a formal conflict, but the program committee member does not know this information.  In particular, using anonymous submissions completely precludes using a disclosure based model for conflicts of interest (where reviewers disclose their relations with the authors in their reviews) but rather you have to move to an exclusion based model, where reviewers meeting some explicit criteria are ruled out.

If anonymous submissions don’t work well for theory conferences, does it mean we have to just have to accept biases? I don’t think so. I believe there are a number of things we could attempt. First, while completely anonymizing submissions might not work well, we could try to make the author names less prominent, for example by having them in the last page of the submissions instead of the first, and not showing them in the conference software. Also, we could try “fairness through awareness”. As I mentioned in my tips for future FOCS/STOC chairs,  one potential approach is to tag papers by authors who never had a prior STOC/FOCS paper (one could possibly also tag papers by authors from under-represented groups). One wouldn’t give such papers preferential treatment, but rather just make sure they get extra attention. For example, we could add an extra review for such papers. That review might end up being positive or negative, but would counter the bias of dismissing some works out of hand.

To summarize, I agree with Michael’s and Suresh’s sentiments that biases are harmful and should be combated. I just don’t think anonymous submissions are the way to go about that.

Unique Games Conjecture – halfway there?

January 10, 2018

(Edit: scribe notes on my lectures on this topic are now up.)

Subhash Khot, Dor Minzer and Muli Safra just posted an exciting manuscript online. In it, they confirm the combinatorial hypothesis I’ve posted about before on the structure of non-expanding set in the degree two short-code graph (or, equivalently, in the Grassman graph).   Together with their prior work with Dinur and Kindler,  and a small assist of an expansion-to-testing reduction of Kothari, Steurer and I, this establishes (an imperfect completeness variant of) Khot’s two to one conjecture and the intermediate complexity conjecture.   This also makes significant progress towards proving, or at least giving evidence for, the more famous unique games conjecture (UGC). In fact, as I explain below, there is a technical sense in which it goes “halfway” towards proving the UGC.

The Unique Games (UG(s,c)) problem with parameters 0 \leq s \leq c \leq 1 is the following: given a set of  linear equations each involving at most two variables over some finite field, to distinguish between the completeness case where there exists an assignment to the variables satisfying at least c fraction of the equations, and the soundness case where every assignment satisfies fewer than a s fraction.

Clearly the UG(s,c) problem  becomes easier the larger the gap between c and s. The unique games conjecture is that the problem is as hard as it can be, in the sense that it is NP hard for c arbitrarily close to one and s arbitrarily close to zero (as a function of the field size which we assume tends to infinity in what follows).  In other words, the difficulty of UG(s,c) as a function of c and s is conjectured by the UGC to look like this:

ugc_fig1

 

(Note that the problem does not make sense when c<s. Also, in this figure we ignore regions of measure zero. In particular for c = 1-o(1)  or s=o(1) the problem can be solved by a semidefinite program where the precise o(1) factors depend on the field size; it is known that if the UGC is true then this dependence is optimal.)

 

Until today, what was known about  unique games could be summarized in this (not to scale) figure:

UGC_fig2

That is, when s and c are sufficiently close to each other,  UG(s,c) was known to be NP hard (see for example this paper) and in fact with a linear-blowup reduction establishing exponential hardness (i.e. 2^{\tilde{\Omega}(n)} under the exponential time hypothesis. On the other hand, when either the completeness parameter is sufficiently close to one or the soundness is sufficiently close to zero, there was a known subexponential time algorithm  of Arora, Steurer and I (see also here) for UG(s,c). (That is, an algorithm  running in time 2^{n^\xi} for some \xi<1  which tends to zero as either completeness tends to one or soundness tends to zero.)

However, that algorithm was of course only an upper bound, and we did not know whether it could be improved further to 2^{n^{o(1)}} or even polynomial time. Moreover, its mere existence showed that in some sense the techniques of the previously known NP hardness results for UG(s,c) (which used a linear blow up reduction) are inherently inapplicable to establishing the UGC which requires hardness in a completely different regime.

The new result of Khot, Minzer and Safra (when combined with the prior ones) shows that UG(s,c) is NP hard for c arbitrarily close to half and s arbitrarily close to zero. (The result is presented as hardness of 2 to 2 games with completeness close to one, but immediately implies hardness of unique games with completeness close to half.) That is, the new picture of unique games’ complexity is as follows:

ugc_fig3

This establishes for the first time hardness of unique games in the regime for which a sub-exponential time algorithm was known, and hence (necessarily) uses a reduction with some (large) polynomial blowup. While it is theoretically still possible for the unique games conjecture to be false (as I personally believed would be the case until this latest sequence of results) the most likely scenario is now that the UGC is true, and the complexity of the UG(s,c) problem looks something like the following:

ugc_summary

 

That is, for every 0<s<c<1, the best running time is roughly 2^{n^{\xi(c,s)}} where \xi:(0,1)^2 \rightarrow (0,1] is a function that is always positive but tends to zero as c tends to one or s tends to zero (and achieves the value one in a positive measure region of the plane). Of course we are still yet far from proving this, let alone characterizing this function, but this is still very exciting progress nonetheless.

I personally am also deeply interested in the question of whether the algorithm that captures this curve is the sum of squares algorithm. Since SoS does capture the known subexponential algorithms for unique games, the new work provides more evidence that this is the case.

 

 

 

Intro TCS course post-mortem

January 3, 2018

This fall I taught CS 121 – “Introduction to Theoretical Computer Science” – at Harvard. This is analogous to courses known at other universities as “Introduction to the Theory of Computation”, “Automata and Computability”, or “Great Ideas in Theoretical Computer Science”, and are often taught using Sipser’s excellent book. However, I decided to significantly revise it and write my own text (which is still work in progress but available online with the source on a GitHub repository).

CS 121 is a required course for Computer Science concentrators, and so we had about 160 students. There was a great variability in students preparation and background, many of which have not taken a proof-based course before. That, combined with the inevitable first-time kinks, made the first several weeks challenging for both the students and the teaching team. That said, I am overall very pleased with the students’ performance. In a course that contained fairly advanced material, students overall did quite well in the problem sets, midterm and final exams. I was also very pleased with my team of teaching fellows (headed by an amazing undergraduate student – Juan Perdomo) that had to deal with teaching a new iteration of the course, including many concepts that they themselves weren’t so familiar with.

Perhaps the most significant change I made from the standard presentation is to make non uniform computation (specifically straightline programs / circuits with NAND gates) the initial computational model, rather than automata. I am quite happy with this choice and intend to keep it for the following reasons:

  • Boolean gates such as NAND have much tighter correspondence to actual hardware and convey the notion that this is not an arbitrary abstract model but intends to capture computation as is physically realizable.
  • Starting with a model for finite functions allows us to avoid dealing with infinity in the first few lectures.
  • Much of the conceptual lessons of the course  – that we can model computation mathematically, that we can encode an algorithm as a string and give it as input to other algorithms, that there is a universal algorithm, and that some functions are harder that others – can already be explained in the finite non uniform setting.
  • The non uniform model is crucial for talking about cryptography (e.g., explaining notions such as “128 bits of security” or giving a model where bitcoin proofs of work make sense), pseudorandomness and quantum computing.  Cryptography and pseudorandomness are the most compelling examples of “mining hardness” or “making lemonade out of the computational difficulty lemon” which is a core take away concept. Further  I believe that it is crucial to talk about quantum computing in a course that aims to model computation as it exists in the world we live in.
  • A more minor point is that the non uniform model and the notion of “unrolling the loop” to simulate a uniform computation by a non uniform one, makes certain proofs such as Cook-Levin Theorem, Godel’s Incompleteness Theorem,  and BPP in P/poly,  technically much easier.

 

So how did the students like it? The overall satisfaction with the course was 3.6 (in a 1-5 scale) which gives me one more reason to be thankful that I’m a tenured professor and not an Uber driver. On the positive side, 60% of the students rated the course as “very good” or “excellent” and 83% rated it as “good”,”very good” or “excellent”.

Here are the student answers to  the question “What did you learn from this course? How did it change you?”. As expected, they are a mixed bag. One answer was “I learned about the theory of computation. This course made me realize I do not want to study the theory of computation. ” which I guess means that the course helped this student fulfill the ancient goal of knowing thyself.

In terms of difficulty and workload, 44% of the students found it “difficult” and 23% found it “very difficult” which is a little (but not significantly) more than the average difficulty level for CS classes at Harvard. While the mean amount of hours (outside lectures) spent on this course per week was 11.6 (par for the course in CS classes),  you don’t need the sum of squares algorithm to see that the distribution is a mixture model:

cs121workload

I imagine that the students with less math preparation needed to work much more (but perhaps also gained more in terms of their math skills).

 

Lessons learned for next time:

  • There is more work to be done on the text, especially to make it more accessible to students not used to reading mathematical definitions and notations. I plan to add more Sipser-style “proof ideas” to the theorems in the text, and add more plain English exposition, especially at the earlier chapters.
  • Many students got hung up on the details of the computational models, in particular my “Turing machine analog”  which was the NAND++ programming language. I need to find a way to strike the right balance between making sure there is a precise and well-defined model, and being able to properly prove theorems about it, and getting the broader point across that the particular details of the model don’t matter.
  • I find the idea of incorporating programming-languages based models in this course appealing, and have made some use of Jupyter notebooks in this course. I need to spend more thought on how to use these tools in the pedagogically most useful way.

Women In Theory – registration deadline getting closer

December 24, 2017

The deadline to register to the Women In Theory workshop is January 16, 2018.  As Omer Reingold posted, this is a wonderful workshop with a strong set of speakers (confirmed speakers include  Bonnie Berger, Yael Kalai, Julia Kempe,  Gillat Kol, Nancy Lynch, and Barna Saha). It is sure to have a great technical content, as well as be a wonderful experience thanks to the organization of Tal Rabin, Shubhangi Saraf
and Lisa Zhang.

The workshop will take place on June 19-22 2018 at Harvard university. I am a local co organizer with Madhu Sudan and Salil Vadhan. Having WIT at Harvard is brings back great memories for me, since I was involved in the first WIT at Princeton in 2008. In that workshop Gillat Kol, Barna Saha, and Shubhangi Saraf (now speakers and organizer) were participants.

I encourage any female graduate student in theoretical computer science to strongly consider applying for this workshop by filling the form here.

ITCS early registration deadline

December 22, 2017

Message from Costis, Yael, and Vinod:

ITCS is back in the east coast, and will be at MIT from January 11-14, 2018. As you know, ITCS is a conference that is unique in many respects: it’s a conference that emphasizes dialog and discussion among all sub-areas of TCS, facilitating it with a single track structure and “chair rants” providing the context for each session. Submissions, refereeing and presentations emphasize the “I” in ITCS: new concepts and models, new lines of inquiry, new techniques or novel use of existing techniques, and new connections between areas.

All in all, great fun! This year, ITCS will run for four full days with lots of activities. Tickets are going fast: the deadline for early registration and hotel block are both December 28, 2017.

A great tradition at ITCS is the “graduating bits” session, where graduating PhD students and postdocs give brief overviews of their research in advance of going out on the job market. If you fit the description, you should sign up here.

Following the success of the poster session at ITCS’17 and STOC’18, we will have one too, at the Marriott the first evening of the conference. To sign up, go here.

We hope to see many of you at ITCS!
 
Your local organizers,
 
Costis, Yael & Vinod

 

Sam Hopkins’s 6 part learning via SoS series

December 21, 2017

(I’m a non native speaker – is it Hopkins’ or Hopkins’s? –Boaz)

Sam Hopkins just completed a heroic 6 part blog post sequence on using the Sum of Squares algorithm for  unsupervised learning.

The goal of unsupervised learning is to recover the underlying structure of a distribution \mathcal{D} given samples X_1,\ldots,X_n sampled from $\mathcal{D}$. This is phrased as positing a model for the distribution \mathcal{D} as having the form P(X|\theta) where \theta is a parameter, and the goal is to recover the parameter \theta from the examples. We can consider the information-theoretic or computationally unbounded setting  where the question is of identifiability – is there enough data to recover (an approximation of) the parameter, and the more realistic setting where we are computationally bounded and the question becomes one of recovery – can we efficiently recover the parameter from the data.

Theoretical computer scientists are used to the setting where every problem can be solved if you have enough computational power, and so intuitively would think that identifiability and recovery have nothing to do with each other, and that the former would be a much easier task than the latter. Indeed this is often the case, but Sam discussed some important cases where we can (almost) automatically transform a proof of identifiability into an efficient algorithm for recovery.

While reading these 6 (not too long) blog posts might take an afternoon, it is an afternoon you would learn something in, and I highly recommend this series, especially for students who might be interested in pursuing questions in the theory of machine learning or the applications of semidefinite programming.

The series is as follows:

  • In part I Sam describes the general model of unsupervised learning and focuses on the classical problem (dating at least as far back as the 19th century) of learning a Gaussian mixture model ,  but for which significant progress was just made by Sam and others. He wisely focuses on the one dimensional case, and gives a proof of identifiability for this case.
  • In part II Sam introduces the Sum of Squares proof system, and starts on an SoS proof (and statement) of his identifiability proof from the first Part. He completes this proof of identifiability in the (relatively short) part III.
  • In the  part IV, Sam completes the transformation of this identifiability proof into an algorithm for the one dimensional Gaussian mixture model.
  • The above already gives a complete description of how to transform and identifiability proof to an algorithm, but of course we really care about the high dimensional case. In Part V  Sam generalizes the identifiability proof to higher dimensions,
  • In Part VI Sam completes the transformation of the higher dimensional identifiability proof to an algorithm, and also includes an overview of other papers and resources on this general area.

 

While I realize that today’s social media is trending towards 140 characters, I hope that some people still have the attention span for a longer exposition, and I do believe those that read it will find it worthwhile.

Thanks so much Sam!

Clustering and Sum of Squares Proofs, Part 6

December 21, 2017
by

This is the 6th and final part of a series on clustering, Gaussian mixtures, and Sum of Squares (SoS) proofs. If you have not read them yet, I recommend starting with Part 1, Part 2, Part 3, Part 4, and Part 5. Also, if you find errors please mention them in the comments (or otherwise get in touch with me) and I will fix them ASAP.

Did you find this series helpful? Unhelpful? What could be done better? If there were to be another tutorial series on a Sum of Squares-related topic, what would you like to see? Let me know in the comments!

Last time we developed our high-dimensional clustering algorithm for Gaussian mixtures. In this post we will make our SoS identifiability proofs high-dimensional. In what is hopefully a familiar pattern by now, these identifiability proofs will also amount to an analysis of the clustering algorithm from part 5. At the end of this post is a modest discussion of some of the literature on the SoS proofs-to-algorithms method we have developed in this series.

Setting

We decided to remember the following properties of a collection X_1,\ldots,X_n of samples from a d-dimensional Gaussian mixture model.

(1) X_1,\ldots,X_n break up into clusters S_1,\ldots,S_k \subseteq [n] which partition [n], and each S_i has size exactly |S_i| = N = n/k.

(2) Each cluster has bounded moments, and this has a small certificate: for some t \in \mathbb{N}, if \mu_i = \mathbb{E}_{j \sim S_i} X_j is the mean of the i-th cluster,

\left \| \mathbb{E}_{j \sim S_i} \left ( [X_j - \mu_i]^{\otimes t/2} \right ) \left ( [X_j - \mu_i]^{\otimes t/2} \right )^\top - \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \left ( Y^{\otimes t/2} \right)^\top \right \|_F^2 \leq 1.

(3) The means are separated: if i \neq j then \|\mu_i - \mu_j\| \geq \Delta.

Our identifiability proof follows the template we have laid out in the non-SoS proof from part 1 and the one-dimensional SoS proof from later parts. The first thing to do is prove a key fact about a pair of overlapping clusters with bounded moments.

Fact 3

The next fact is the high-dimensional analogue of Fact 2. (We are not going to prove a high-dimensional analogue of Fact 1; the reader should at this point have all the tools to work it out for themselves.) We remind the reader of the key family polynomial inequalities.

Given a collection of points X_1,\ldots,X_n \in \mathbb{R}^d, let \mathcal{B} be the following set of polynomial inequalities in indeterminates w_1,\ldots,w_n:

w_i^2 = w_i for all i \in [n]

\sum_{i \in [n]} w_i = N

\left \| \frac 1 N \sum_{i \in [n]}  \! w_i \left ( [X_i - \mu]^{\otimes t/2} \right ) \!  \left ( [X_i - \mu]^{\otimes t/2} \right )^\top  \! - \! \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \! \left ( Y^{\otimes t/2} \right)^\top \right \|_F^2 \leq 1

where as usual \mu(w) = \frac 1 N \sum_{i \in [n]} w_i X_i. As before, for a subset S \subseteq [n], we use the notation |T \cap S| = |T \cap S|(w) = \sum_{i \in S} w_i.

Fact 3.
Let X_1,\ldots,X_n \in \mathbb{R}^d. Let S \subseteq [n] have |S| = N; let \mu_S = \mathbb{E}_{i \sim S} X_i be its mean. Let t be a power of 2. Suppose S satisfies

\left \| \mathbb{E}_{j \sim S} \left ( [X_j - \mu_S]^{\otimes t/2} \right ) \left ( [X_j - \mu_S]^{\otimes t/2} \right )^\top - \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \left ( Y^{\otimes t/2} \right)^\top \right \|_F^2 \leq 1.

Then

\mathcal{B} \vdash_{O(t)} \left (  \frac{|T \cap S|}N \right )^{2t} \|\mu - \mu_S\|^{4t} \leq 2^{O(t)} \cdot t^{t} \cdot \left ( \frac{|T \cap S|}N \right ) ^{2(t-1)} \cdot \|\mu - \mu_S\|^{2t}.

The main difference between the conclusions of Fact 3 and Fact 2 is that both sides of the inequality are multiplied by \|\mu - \mu_S\|^t, as compared to Fact 2. As we will see, this is because an additional dependence on the vector-valued polynomial (\mu - \mu_S)(w) is introduced by the need to project the high-dimensional vectors X_i onto the line \mu - \mu_S. We have already tackled a situation where an SoS-provable inequality seemed to require cancelling terms of left and right in order to be useful (i.e. when we used Fact 2 to prove Lemma 2), and similar ideas will work here.

The main innovation in proving Fact 3 is the use of the t-th moment inequality in \mathcal{B}. Other than that, we follow the proof of Fact 2 almost line by line. The main proposition needed to use the t-th moment inequality is:

Proposition. \vdash_t \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, u \rangle^t \leq t^{t/2} \cdot \|u\|^t.

Proof of Proposition.

Expanding the polynomial \mathbb{E}_{Y \sim \mathcal{N}(0,I)}\langle Y, u \rangle^t we get

\mathbb{E}_{Y \sim \mathcal{N}(0,I)}\langle Y, u \rangle^t = \sum_{|\alpha| = t, \alpha \text{ even}} u^\alpha \cdot \mathbb{E} Y^\alpha

where \alpha is a multi-index over [n] and “even” means that every index in \alpha occurs with even multiplicity. (Other terms vanish by symmetry of Y.) Since \alpha is always even in the sum, the monomial u^\alpha is a square, and \mathbb{E} Y^\alpha \leq t^{t/2} by standard properties of Gaussian moments. Hence,

\vdash_t \sum_{|\alpha| = t, \alpha \text{ even}} u^\alpha \cdot \mathbb{E} Y^\alpha \leq t^{t/2} \cdot \sum_{|\alpha| = t, \alpha \text{even}} u^\alpha = t^{t/2} \cdot \| u \|^t.

QED.

Proof of Fact 3.

As usual, we write things out in terms of X_1,\ldots,X_n, then apply Holder’s inequality and the triangle inequality. First of all,

\left ( \sum_{i \in S} w_i \right )^t \|\mu - \mu_S\|^{2t} = \left ( \sum_{i \in S} w_i \langle \mu - \mu_S, \mu - \mu_S \rangle \right )^t = \left ( \sum_{i \in S} w_i \langle (\mu - X_i) - (\mu_S - X_i), \mu - \mu_S \rangle \right )^t.

By SoS Holder’s inequality, we get

\mathcal{B} \vdash_{O(t)} \left ( \sum_{i \in S} w_i \right )^t \|\mu - \mu_S\|^{2t} \leq \left ( \sum_{i \in S} w_i \right )^{t-1} \cdot \sum_{i \in S} w_i \langle (\mu - X_i) - (\mu_S - X_i), \mu - \mu_S \rangle^t.

By the same reasoning as in Fact 2, using \vdash_t (a - b)^t \leq 2^t (a^t + b^t) and w_i = w_i^2 \leq 1 we get

\mathcal{B} \vdash_{O(t)} \left ( \sum_{i \in S} w_i \right )^t \|\mu - \mu_S\|^{2t} \leq 2^t \cdot \left ( \sum_{i \in S} w_i \right )^{t-1} \cdot \left [ \sum_{i \in [n]} w_i \langle X_i - \mu, \mu - \mu_S \rangle^t + \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t \right ].

By our usual squaring and use of (a+b)^2 \leq 2 (a^2 + b^2), we also get

\mathcal{B} \vdash_{O(t)} \left ( \sum_{i \in S} w_i \right )^{2t} \|\mu - \mu_S\|^{4t} \leq 2^{O(t)} \cdot \left ( \sum_{i \in S} w_i \right )^{2(t-1)} \cdot \left [ \left ( \sum_{i \in [n]} w_i \langle X_i - \mu, \mu - \mu_S \rangle^t \right )^2 + \left ( \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t \right ) \right ].

(We want both sides to be squared so that we are set up to eventually use SoS Cauchy-Schwarz.) We are left with the last two terms, which are t-th moments in the direction \mu - \mu_S. If we knew

\mathcal{B} \vdash_{O(t)} \left ( \sum_{i \in [n]} w_i \langle X_i - \mu, \mu - \mu_S \rangle^t \right )^2 \leq 2^{O(t)} t^{t} \cdot \|\mu - \mu_S\|^{2t} \cdot N^2

and similarly

\mathcal{B} \vdash_{O(t)} \left ( \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t \right )^2 \leq 2^{O(t)} t^{t} \cdot \|\mu - \mu_S\|^{2t} \cdot N^2

then we would be done. We start with the second inequality. We write the polynomial on the LHS as

\frac 1 N \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t

= \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, \mu - \mu_S \rangle^t + \left ( \frac 1 N \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t  - \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, \mu - \mu_S \rangle^t \right )

Squaring again as usual, it would be enough to bound both \left ( \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, \mu - \mu_S \rangle^t \right )^2 and  \left ( \frac 1 N \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t  - \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, \mu - \mu_S \rangle^t \right )^2. For the former, using the Proposition above we get

\vdash_{2t} \left ( \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, \mu - \mu_S \rangle^t \right )^2 \leq 2^{O(t)} t^t \|\mu - \mu_S\|^{2t}.

For the latter, notice that

\left ( \frac 1 N \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t  - \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, \mu - \mu_S \rangle^t \right )^2 = \left \langle M, \left [  (\mu - \mu_S)^{\otimes t/2} \right ] \left [ (\mu - \mu_S)^{\otimes t/2} \right ]^\top \right \rangle

where M is the matrix

M = \mathbb{E}_{j \sim S} \left ( [X_j - \mu_S]^{\otimes t/2} \right ) \left ( [X_j - \mu_S]^{\otimes t/2} \right )^\top - \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \left ( Y^{\otimes t/2} \right)^\top.

Hence by SoS Cauchy-Schwarz, we get

\vdash_{O(t)} \left \langle M, \left [  (\mu - \mu_S)^{\otimes t/2} \right ] \left [ (\mu - \mu_S)^{\otimes t/2} \right ]^\top \right \rangle^2 \leq \|M\|_F^2 \cdot \|\mu - \mu_S\|^{2t} \leq \|\mu - \mu_S\|^{2t}

Putting these together, we get

\mathcal{B} \vdash_{O(t)} \left ( \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t \right )^2 \leq 2^{O(t)} t^{t} \cdot \|\mu - \mu_S\|^{2t} \cdot N^2,

the second of the inequalities we wanted. Proving the first one is similar, using the hypothesis

\left \| \frac 1 N \sum_{i \in [n]}  \! w_i \left ( [X_i - \mu]^{\otimes t/2} \right ) \!  \left ( [X_i - \mu]^{\otimes t/2} \right )^\top  \! - \! \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \! \left ( Y^{\otimes t/2} \right)^\top \right \|_F^2 \leq 1

in place of \|M\|_F^2 \leq 1. QED.

Lemma 3

The last thing is to use Fact 3 to prove Lemma 3, our high-dimensional SoS identifiability lemma. Predictably, it is almost identical to our previous proof of Lemma 2 using Fact 2.

Lemma 3.
Let X_1,\ldots,X_n \in \mathbb{R}^d. Let S_1,\ldots,S_k be a partition of [n] into k pieces of size N = n/k such that for each S_i, the collection of vectors \{X_j\}_{j \in S_i} obeys the following moment bound:

\left \| \mathbb{E}_{j \sim S_i} \left ( [X_j - \mu_i]^{\otimes t/2} \right ) \left ( [X_j - \mu_i]^{\otimes t/2} \right )^\top - \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \left ( Y^{\otimes t/2} \right)^\top \right \|_F^2 \leq 1.

where \mu_i is the average \mathbb{E}_{j \sim S_i} X_j and t is some number in \mathbb{N} which is a power of 2. Let \Delta > 0 be such that \| \mu_i - \mu_j \| \geq \Delta for every i \neq j.

Let w_1,\ldots,w_n,\mu be indeterminates. Let \mathcal{B} be the set of equations and inequalities defined above. Thinking of the variables w_i as defining a set T \subseteq [n] via its 0/1 indicator, let |T \cap S_j| be the formal expression |T \cap S_j| = \sum_{i \in S_j} w_i. Let \tilde{\mathbb{E}} be a degree O(t) pseudoexpectation which satisfies \mathcal{B}. Then

\tilde{\mathbb{E}} \left [ \sum_{i \in [k]} \left ( \frac{|T \cap S_i|}{N} \right )^2 \right ] \geq 1 - \frac{2^{O(t)} t^{t/2} k^2}{\Delta^t}.

Proof of Lemma 3.
Let \tilde{\mathbb{E}} satisfy \mathcal{B}. As in Lemmas 1 and 2, we will endeavor to bound \tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t for each i \neq j.

By \Delta-separation,

\vdash_{2t} \Delta^{2t} \leq \|\mu_i - \mu_j\|^{2t} = \|(\mu_i - \mu) - (\mu_j - \mu)\|^{2t} \leq 2^{2t} \left (  \|\mu_i - \mu\|^{2t} + \|\mu_j - \mu\|^{2t} \right),

where we implicitly used the SoS triangle inequality \vdash_t (a-b)^t \leq 2^{t} (a^t + b^t) on each coordinate of the vectors \mu_i - \mu and \mu_j - \mu.

So,

\tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \leq \tilde{\mathbb{E}} \left [ \frac{\|\mu_i - \mu\|^{2t} + \|\mu_j - \mu\|^{2t}}{2^{2t} \Delta^{2t}}  |T \cap S_i|^t |T \cap S_j|^t \right ].

Now we are going to bound \tilde{\mathbb{E}} \|\mu_i - \mu\|^{2t} |T \cap S_i|^t |T \cap S_j|^t. By symmetry the same argument will apply to the same expression with i and j exchanged.

We apply Fact 3:

\tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \|\mu_i - \mu\|^{2t} \leq 2^{O(t)} t^{t} \cdot N^2 \cdot \tilde{\mathbb{E}} |T \cap S_i|^{t-2} |T \cap S_j|^t \|\mu_i - \mu\|^t.

Then we use pseudoexpectation Cauchy-Schwarz to get

\tilde{\mathbb{E}} |T \cap S_i|^{t-2} |T \cap S_j|^t \|\mu_i - \mu\|^t \leq \left ( \tilde{\mathbb{E}} |T \cap S_i|^{t-4} |T \cap S_j|^t \right )^{1/2} \left ( \tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \|\mu_i - \mu\|^{2t} \right )^{1/2}.

Putting these two together and canceling \left ( \tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \|\mu_i - \mu\|^{2t} \right )^{1/2} we get

\tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \|\mu_i - \mu\|^{2t} \leq 2^{O(t)} t^{2t} \cdot N^4 \tilde{\mathbb{E}} |T \cap S_i|^{t-4} |T \cap S_j|^t.

Also, clearly \mathcal{B} \vdash_{O(1)} |T \cap S_j| \leq N. So we get

\tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \|\mu_i - \mu\|^{2t} \leq 2^{O(t)} t^{2t} \cdot N^8 \cdot \tilde{\mathbb{E}} |T \cap S_i|^{t-4} |T \cap S_j|^{t-4} \|\mu_i - \mu\|^t.

We started out with the goal of bounding \tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t. We have found that

\tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \leq \frac{2^{O(t)} t^{2t} N^8}{\Delta^{4t}} \tilde{\mathbb{E}} |T \cap S_i|^{t-4} |T \cap S_j|^{t-4}.

Applying pseudoexpectation Holder’s inequality, we find that

\tilde{\mathbb{E}} |T \cap S_i|^{t-4} |T \cap S_j|^{t-4} \leq \left ( \tilde{\mathbb{E}} |T \cap S_i|^t \tilde{\mathbb{E}} |T \cap S_j|^t \right )^{(t-4)/t}.

Rearranging things, we get

\left ( \tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \right )^{1/t} \leq \frac{2^{O(t)} t^{t/2} N^2}{\Delta^{t}}.

Now proceeding as in the proof of Lemma 2, we know

\tilde{\mathbb{E}} \left ( ( \sum_{i \in [k]} \frac{|T \cap S_i|}{N} \right )^2 = 1

so

\tilde{\mathbb{E}} \sum_{i \in [k]} \left ( \frac{|T \cap S_i|}{N} \right )^2 \geq 1 - \frac{2^{O(t)} t^{t/2} k^2}{\Delta^{t}}.

QED.

Remark: We cheated ever so slightly in this proof. First of all, we did not state a version of pseudoexpectation Holder’s which allows the exponent (t-4)/t, just one which allows (t-2)/t. The correct version can be found in Lemma A.4 of this paper. That inequality will work only when t is large enough; I think t \geq 4 suffices. To handle smaller t probably one must remove the square from both sides of Fact 3, which will require a hypothesis which does not use the squared Frobenious norm. This is possible; see e.g. my paper with Jerry Li.

Putting Things Together

The conclusion of Lemma 3 is almost identical to the conclusion of Lemma 2, and so the rest of the analysis of the high-dimensional clustering algorithm proceeds exactly as in the one-dimensional case. At the end, to show that the clustering algorithm works with high probability to cluster samples from a separated Gaussian mixture model, one uses straightforward concentration of measure to show that if X_1,\ldots,X_n are enough samples from a \Delta-separated mixture model, then the satisfy the hypotheses of Lemma 3 with high probability. This concludes our “proof” of the main theorem from way back in part 1.

Related literature

The reader interested in further applications of the Sum of Squares method to unsupervised learning problems may consult some of the following works.

Though we have not seen it in these posts, often the SoS method overlaps with questions about tensor decomposition. For some examples in this direction, see the [Barak, Kelner, Steurer] dictionary learning paper above, as well as

The SoS method can often be used to design algorithms which have more practical running times than the large SDPs we have discussed here. (This often requires further ideas, to avoid solving large semidefinite programs.) See e.g.:

Another common tool in constructing SoS proofs for unsupervised learning problems which we did not see here are concentration bounds for random matrices whose entries are low-degree polynomials in independent random variables. For some examples along these lines, see

as well as several of the previous papers.