This is the 6th and final part of a series on clustering, Gaussian mixtures, and Sum of Squares (SoS) proofs. If you have not read them yet, I recommend starting with Part 1, Part 2, Part 3, Part 4, and Part 5. Also, if you find errors please mention them in the comments (or otherwise get in touch with me) and I will fix them ASAP.
Did you find this series helpful? Unhelpful? What could be done better? If there were to be another tutorial series on a Sum of Squares-related topic, what would you like to see? Let me know in the comments!
Last time we developed our high-dimensional clustering algorithm for Gaussian mixtures. In this post we will make our SoS identifiability proofs high-dimensional. In what is hopefully a familiar pattern by now, these identifiability proofs will also amount to an analysis of the clustering algorithm from part 5. At the end of this post is a modest discussion of some of the literature on the SoS proofs-to-algorithms method we have developed in this series.
Setting
We decided to remember the following properties of a collection of samples from a
-dimensional Gaussian mixture model.
(1) break up into clusters
which partition
, and each
has size exactly
.
(2) Each cluster has bounded moments, and this has a small certificate: for some , if
is the mean of the
-th cluster,
.
(3) The means are separated: if then
.
Our identifiability proof follows the template we have laid out in the non-SoS proof from part 1 and the one-dimensional SoS proof from later parts. The first thing to do is prove a key fact about a pair of overlapping clusters with bounded moments.
Fact 3
The next fact is the high-dimensional analogue of Fact 2. (We are not going to prove a high-dimensional analogue of Fact 1; the reader should at this point have all the tools to work it out for themselves.) We remind the reader of the key family polynomial inequalities.
Given a collection of points , let
be the following set of polynomial inequalities in indeterminates
:
for all
where as usual . As before, for a subset
, we use the notation
.
Fact 3.
Let. Let
have
; let
be its mean. Let
be a power of
. Suppose
satisfies
.
Then
The main difference between the conclusions of Fact 3 and Fact 2 is that both sides of the inequality are multiplied by , as compared to Fact 2. As we will see, this is because an additional dependence on the vector-valued polynomial
is introduced by the need to project the high-dimensional vectors
onto the line
. We have already tackled a situation where an SoS-provable inequality seemed to require cancelling terms of left and right in order to be useful (i.e. when we used Fact 2 to prove Lemma 2), and similar ideas will work here.
The main innovation in proving Fact 3 is the use of the -th moment inequality in
. Other than that, we follow the proof of Fact 2 almost line by line. The main proposition needed to use the
-th moment inequality is:
Proposition.
.
Proof of Proposition.
Expanding the polynomial we get
where is a multi-index over
and “even” means that every index in
occurs with even multiplicity. (Other terms vanish by symmetry of
.) Since
is always even in the sum, the monomial
is a square, and
by standard properties of Gaussian moments. Hence,
.
QED.
Proof of Fact 3.
As usual, we write things out in terms of , then apply Holder’s inequality and the triangle inequality. First of all,
By SoS Holder’s inequality, we get
By the same reasoning as in Fact 2, using and
we get
By our usual squaring and use of , we also get
(We want both sides to be squared so that we are set up to eventually use SoS Cauchy-Schwarz.) We are left with the last two terms, which are -th moments in the direction
. If we knew
and similarly
then we would be done. We start with the second inequality. We write the polynomial on the LHS as
Squaring again as usual, it would be enough to bound both and
. For the former, using the Proposition above we get
.
For the latter, notice that
where is the matrix
.
Hence by SoS Cauchy-Schwarz, we get
Putting these together, we get
,
the second of the inequalities we wanted. Proving the first one is similar, using the hypothesis
in place of . QED.
Lemma 3
The last thing is to use Fact 3 to prove Lemma 3, our high-dimensional SoS identifiability lemma. Predictably, it is almost identical to our previous proof of Lemma 2 using Fact 2.
Lemma 3.
Let. Let
be a partition of
into
pieces of size
such that for each
, the collection of vectors
obeys the following moment bound:
.
where
is the average
and
is some number in
which is a power of
. Let
be such that
for every
.
Let
be indeterminates. Let
be the set of equations and inequalities defined above. Thinking of the variables
as defining a set
via its
indicator, let
be the formal expression
Let
be a degree
pseudoexpectation which satisfies
. Then
Proof of Lemma 3.
Let satisfy
. As in Lemmas 1 and 2, we will endeavor to bound
for each
.
By -separation,
where we implicitly used the SoS triangle inequality on each coordinate of the vectors
and
.
So,
Now we are going to bound . By symmetry the same argument will apply to the same expression with
and
exchanged.
We apply Fact 3:
Then we use pseudoexpectation Cauchy-Schwarz to get
.
Putting these two together and canceling we get
.
Also, clearly . So we get
We started out with the goal of bounding . We have found that
Applying pseudoexpectation Holder’s inequality, we find that
.
Rearranging things, we get
Now proceeding as in the proof of Lemma 2, we know
so
QED.
Remark: We cheated ever so slightly in this proof. First of all, we did not state a version of pseudoexpectation Holder’s which allows the exponent , just one which allows
. The correct version can be found in Lemma A.4 of this paper. That inequality will work only when
is large enough; I think
suffices. To handle smaller
probably one must remove the square from both sides of Fact 3, which will require a hypothesis which does not use the squared Frobenious norm. This is possible; see e.g. my paper with Jerry Li.
Putting Things Together
The conclusion of Lemma 3 is almost identical to the conclusion of Lemma 2, and so the rest of the analysis of the high-dimensional clustering algorithm proceeds exactly as in the one-dimensional case. At the end, to show that the clustering algorithm works with high probability to cluster samples from a separated Gaussian mixture model, one uses straightforward concentration of measure to show that if are enough samples from a
-separated mixture model, then the satisfy the hypotheses of Lemma 3 with high probability. This concludes our “proof” of the main theorem from way back in part 1.
Related literature
The reader interested in further applications of the Sum of Squares method to unsupervised learning problems may consult some of the following works.
- [Barak, Kelner, Steurer] on dictionary learning
- [Potechin, Steurer] on tensor completion
- [Ge, Ma] on random overcomplete tensors
Though we have not seen it in these posts, often the SoS method overlaps with questions about tensor decomposition. For some examples in this direction, see the [Barak, Kelner, Steurer] dictionary learning paper above, as well as
- [Ma, Shi, Steurer] on tensor decomposition
The SoS method can often be used to design algorithms which have more practical running times than the large SDPs we have discussed here. (This often requires further ideas, to avoid solving large semidefinite programs.) See e.g.:
- [Hopkins, Schramm, Shi, Steurer] on extracting spectral algorithms (rather than SDP-based algorithms) from SoS proofs
- [Schramm, Steurer] developing a sophisticated spectral method for dictionary learning via SoS proofs
- [Hopkins, Kothari, Potechin, Raghavendra, Schramm, Steurer] with a meta-theorem on when it is possible to extract spectral algorithms from SoS proofs
Another common tool in constructing SoS proofs for unsupervised learning problems which we did not see here are concentration bounds for random matrices whose entries are low-degree polynomials in independent random variables. For some examples along these lines, see
- [Hopkins, Shi, Steurer] on tensor principal component analysis
- [Rao, Raghavendra, Schramm] on random constraint satisfaction problems
as well as several of the previous papers.
One thought on “Clustering and Sum of Squares Proofs, Part 6”