# Clustering and Sum of Squares Proofs, Part 6

*This is the 6th and final part of a series on clustering, Gaussian mixtures, and Sum of Squares (SoS) proofs. If you have not read them yet, I recommend starting with Part 1, Part 2, Part 3, Part 4, and Part 5. Also, if you find errors please mention them in the comments (or otherwise get in touch with me) and I will fix them ASAP.
*

*Did you find this series helpful? Unhelpful? What could be done better? If there were to be another tutorial series on a Sum of Squares-related topic, what would you like to see? Let me know in the comments!*

Last time we developed our high-dimensional clustering algorithm for Gaussian mixtures. In this post we will make our SoS identifiability proofs high-dimensional. In what is hopefully a familiar pattern by now, these identifiability proofs will also amount to an analysis of the clustering algorithm from part 5. At the end of this post is a modest discussion of some of the literature on the SoS proofs-to-algorithms method we have developed in this series.

## Setting

We decided to remember the following properties of a collection of samples from a -dimensional Gaussian mixture model.

(1) break up into clusters which partition , and each has size exactly .

(2) Each cluster has bounded moments, *and this has a small certificate*: for some , if is the mean of the -th cluster,

.

(3) The means are separated: if then .

Our identifiability proof follows the template we have laid out in the non-SoS proof from part 1 and the one-dimensional SoS proof from later parts. The first thing to do is prove a key fact about a pair of overlapping clusters with bounded moments.

## Fact 3

The next fact is the high-dimensional analogue of Fact 2. (We are not going to prove a high-dimensional analogue of Fact 1; the reader should at this point have all the tools to work it out for themselves.) We remind the reader of the key family polynomial inequalities.

Given a collection of points , let be the following set of polynomial inequalities in indeterminates :

for all

where as usual . As before, for a subset , we use the notation .

Fact 3.

Let . Let have ; let be its mean. Let be a power of . Suppose satisfies.

Then

The main difference between the conclusions of Fact 3 and Fact 2 is that both sides of the inequality are multiplied by , as compared to Fact 2. As we will see, this is because an additional dependence on the vector-valued polynomial is introduced by the need to project the high-dimensional vectors onto the line . We have already tackled a situation where an SoS-provable inequality seemed to require cancelling terms of left and right in order to be useful (i.e. when we used Fact 2 to prove Lemma 2), and similar ideas will work here.

The main innovation in proving Fact 3 is the use of the -th moment inequality in . Other than that, we follow the proof of Fact 2 almost line by line. The main proposition needed to use the -th moment inequality is:

Proposition..

*Proof of Proposition.*

Expanding the polynomial we get

where is a multi-index over and “even” means that every index in occurs with even multiplicity. (Other terms vanish by symmetry of .) Since is always even in the sum, the monomial is a square, and by standard properties of Gaussian moments. Hence,

.

QED.

*Proof of Fact 3.*

As usual, we write things out in terms of , then apply Holder’s inequality and the triangle inequality. First of all,

By SoS Holder’s inequality, we get

By the same reasoning as in Fact 2, using and we get

By our usual squaring and use of , we also get

(We want both sides to be squared so that we are set up to eventually use SoS Cauchy-Schwarz.) We are left with the last two terms, which are -th moments in the direction . If we knew

and similarly

then we would be done. We start with the second inequality. We write the polynomial on the LHS as

Squaring again as usual, it would be enough to bound both and . For the former, using the Proposition above we get

.

For the latter, notice that

where is the matrix

.

Hence by SoS Cauchy-Schwarz, we get

Putting these together, we get

,

the second of the inequalities we wanted. Proving the first one is similar, using the hypothesis

in place of . QED.

## Lemma 3

The last thing is to use Fact 3 to prove Lemma 3, our high-dimensional SoS identifiability lemma. Predictably, it is almost identical to our previous proof of Lemma 2 using Fact 2.

Lemma 3.

Let . Let be a partition of into pieces of size such that for each , the collection of vectors obeys the following moment bound:.

where is the average and is some number in which is a power of . Let be such that for every .

Let be indeterminates. Let be the set of equations and inequalities defined above. Thinking of the variables as defining a set via its indicator, let be the formal expression Let be a degree pseudoexpectation which satisfies . Then

*Proof of Lemma 3.*

Let satisfy . As in Lemmas 1 and 2, we will endeavor to bound for each .

By -separation,

where we implicitly used the SoS triangle inequality on each coordinate of the vectors and .

So,

Now we are going to bound . By symmetry the same argument will apply to the same expression with and exchanged.

We apply Fact 3:

Then we use pseudoexpectation Cauchy-Schwarz to get

.

Putting these two together and canceling we get

.

Also, clearly . So we get

We started out with the goal of bounding . We have found that

Applying pseudoexpectation Holder’s inequality, we find that

.

Rearranging things, we get

Now proceeding as in the proof of Lemma 2, we know

so

QED.

**Remark: **We cheated ever so slightly in this proof. First of all, we did not state a version of pseudoexpectation Holder’s which allows the exponent , just one which allows . The correct version can be found in Lemma A.4 of this paper. That inequality will work only when is large enough; I think suffices. To handle smaller probably one must remove the square from both sides of Fact 3, which will require a hypothesis which does not use the squared Frobenious norm. This is possible; see e.g. my paper with Jerry Li.

## Putting Things Together

The conclusion of Lemma 3 is almost identical to the conclusion of Lemma 2, and so the rest of the analysis of the high-dimensional clustering algorithm proceeds exactly as in the one-dimensional case. At the end, to show that the clustering algorithm works with high probability to cluster samples from a separated Gaussian mixture model, one uses straightforward concentration of measure to show that if are enough samples from a -separated mixture model, then the satisfy the hypotheses of Lemma 3 with high probability. This concludes our “proof” of the main theorem from way back in part 1.

## Related literature

The reader interested in further applications of the Sum of Squares method to unsupervised learning problems may consult some of the following works.

- [Barak, Kelner, Steurer] on dictionary learning
- [Potechin, Steurer] on tensor completion
- [Ge, Ma] on random overcomplete tensors

Though we have not seen it in these posts, often the SoS method overlaps with questions about tensor decomposition. For some examples in this direction, see the [Barak, Kelner, Steurer] dictionary learning paper above, as well as

- [Ma, Shi, Steurer] on tensor decomposition

The SoS method can often be used to design algorithms which have more practical running times than the large SDPs we have discussed here. (This often requires further ideas, to avoid solving large semidefinite programs.) See e.g.:

- [Hopkins, Schramm, Shi, Steurer] on extracting spectral algorithms (rather than SDP-based algorithms) from SoS proofs
- [Schramm, Steurer] developing a sophisticated spectral method for dictionary learning via SoS proofs
- [Hopkins, Kothari, Potechin, Raghavendra, Schramm, Steurer] with a meta-theorem on when it is possible to extract spectral algorithms from SoS proofs

Another common tool in constructing SoS proofs for unsupervised learning problems which we did not see here are concentration bounds for random matrices whose entries are low-degree polynomials in independent random variables. For some examples along these lines, see

- [Hopkins, Shi, Steurer] on tensor principal component analysis
- [Rao, Raghavendra, Schramm] on random constraint satisfaction problems

as well as several of the previous papers.

## Trackbacks

Comments are closed.