This is part 4 of a continuing series on clustering, Gaussian mixtures, and Sum of Squares (SoS) proofs. If you have not read them yet, I recommend starting with Part 1, Part 2, and Part 3. Also, if you find errors please mention them in the comments (or otherwise get in touch with me) and I will fix them ASAP.
Last time we finished our SoS identifiability proof for one-dimensional Gaussian mixtures. In this post, we are going to turn it into an algorithm. In Part 3 we proved Lemma 2, which we restate here.
Lemma 2.
Let. Let
be a partition of
into
pieces of size
such that for each
, the collection of numbers
obeys the following moment bound:
where
is the average
and
is a power of
in
. Let
be such that
for every
.
Let
be indeterminates. Let
be the following set of equations and inequalities.
As before
is the polynomial
. Thinking of the variables
as defining a set
via its
indicator, let
be the formal expression
Let
be a degree
pseudoexpectation which satisfies
. Then
We will we design a convex program to exploit the SoS identifiability proof, and in particular Lemma 2. Then we describe a (very simple) rounding procedure and analyze it, which will complete our description and analysis of the one-dimensional algorithm.
Let’s look at the hypothesis of Lemma 2. It asks for a pseudoexpectation of degree which satisfies the inequalities
. First of all, note that the inequalities
depend only on the vectors
to be clustered, and in particular not on the hidden partition
, so they are fair game to use in our algorithm. Second, it is not too hard to check that the set of pseudoexpectations satisfying
is convex, and in fact the feasible region of a semidefinite program with
variables!
It is actually possible to design a rounding algorithm which takes any pseudoexpectation satisfying and produces a cluster, up to about a
-fraction of misclassified points. Then the natural approach to design an algorithm to find all the clusters is to iterate:
(1) find such a pseudoexpectation via semidefinite programming
(2) round to find a cluster
(3) remove all the points in from
, go to (1).
This is a viable algorithm, but analyzing it is a little painful because misclassifications from early rounds of the rounding algorithm must be taken into account when analyzing later rounds, and in particular a slightly stronger version of Lemma 2 is needed, to allow some error from early misclassifications.
We are going to avoid this pain by imposing some more structure on the pseudoexpectation our algorithm eventually rounds, to enable our rounding scheme to recover all the clusters without re-solving a convex program. This is not possible if one is only promised a pseudoexpectation which satisfies
: observe, for example, that one can choose the pseudodistribution
to be a probability distribution supported on one point
, the indicator of cluster
. This particular
is easy to round to extract
, but contains no information about the remaining clusters
.
We are going to use a trick reminiscent of entropy maximization to ensure that the pseudoexpectation we eventually round contains information about all the clusters . Our convex program will be:
where is the Frobenius norm of the matrix
.
It may not be so obvious why is a good thing to minimize, or what it has to do with entropy maximization. We offer the following interpretation. Suppose that instead of pseudodistributions, we were able to minimize
over all
which are supported on vectors
which are the
indicators of the clusters
. Such a distribution
is specified by nonnegative
for
which sum to
, and the Frobenius norm is given by
where we have used orthogonality if
. Since all the clusters have size
, we have
, and we have
, where
is the
-norm, whose square is the collision probability, of
. This collision probability is minimized when
is uniform.
We can analyze our convex program via the following corollary of Lemma 2.
Corollary 1.
Letand
be as in Lemma 2. Let
be the degree
pseudoexpectation solving
Let
be the uniform distribution over vectors
where
is the
indicator of cluster
. Then
where
is the Frobenius norm.
Proof of Corollary 1.
The uniform distribution over
is a feasible solution to the convex program with
, by the calculation preceding the corollary. So if
is the minimizer, we know
.
We expand the norm:
To bound the last term we use Lemma 2. In the notation of that Lemma,
Remember that . Putting these together, we get
QED.
We are basically done with the algorithm now. Observe that the matrix contains all the information about the clusters
. In fact the clusters can just be read off of the rows of
.
Once one has in hand a matrix which is close in Frobenius norm to
, extracting the clusters is still a matter of reading them off of the rows of the matrix (choosing rows at random to avoid hitting one of the small number of rows which could be wildly far from their sisters in
).
We will prove the following fact at the end of this post.
Fact: rounding.
Letbe a partition of
into
parts of size
. Let
be the
indicator matrix for same-cluster membership. That is,
if
and
are in the same cluster
. Suppose $M \in \mathbb{R}^{n \times n}$ satisfies
.
There is a polynomial-time algorithm which takes
and with probability at least
produces a partition
of
into clusters of size
such that, up to a permutation of
,
Putting things together for one-dimensional Gaussian mixtures
Now we sketch a proof of Theorem 1 in the case . Our algorithm is: given
, solve
then apply the rounding algorithm from the rounding fact to and output the resulting partition.
If the vectors satisfy the hypothesis of Lemma 2, then by Corollary 1, we know
where is the uniform distribution over
indicators for clusters
. Hence the rounding algorithm produces a partition
of
such that
Since a standard Gaussian has , elementary concentration shows that the vectors
satisfy the hypotheses of Lemma 1 with probability
so long as
.
Rounding algorithm
The last thing to do in this post is prove the rounding algorithm Fact. This has little to do with SoS; the algorithm is elementary and combinatorial. We provide it for completeness.
The setting is: there is a partition of
into
parts of size
. Let
be the
indicator matrix for cluster membership; i.e.
if and only if
and
are in the same cluster
. Given a matrix
such that
, the goal is to recover a partition
of
which is close to
up to a permutation of
.
Let be the
-th row of
and similarly for
. Let
be a parameter we will set later.
The algorithm is:
(1) Let be the set of active indices.
(2) Pick uniformly.
(3) Let be those indices
for which
.
(4) Add to the list of clusters and let $\mathcal{I} := \mathcal{I} \setminus S$.
(5) If , go to (2).
(6) (postprocess) Assign remaining indices to clusters arbitrarily, then move indices arbitrarily from larger clusters to smaller ones until all clusters have size .
Fact: rounding.
Ifthen with probability at least
the rounding algorithm outputs disjoint clusters
, each of size
, such that up to a permutation of
,
.
Proof.
Call an index good if
. An index is bad if it is not good. By hypothesis
. Each bad index contributes at least
to the left side and
, so there are at most
bad indices.
If are good indices and both are in the same cluster
, then if the algorithm chooses
, the resulting cluster will contain
. If
, then also if
is good but is in some other cluster
, the cluster formed upon choosing
will not contain
. Thus if the algorithm never chooses a bad index, before postprocessing the clusters
it outputs will (up to a global permutation of
) satisfy that
contains all the good indices in
. Hence in this case only bad indices can be misclassified, so the postprocessing step moves at most
indices, and in the end the cluster
again errs from
on at most
indices.
Consider implementing the algorithm by drawing a list of indices before seeing
(i.e. obliviously), then when the algorithm requires random index
we give it the next index in our list which is in
(and halt with no output if no such index exists). It’s not hard to see that this implementation fails only with probability at most
. Furthermore, by a union bound the list contains a bad index only with probability
. Choosing
thus completes the proof. QED.
Nice (series of) post(s)!
Two minor details:
* “Frobenious” should read “Frobenius”
* the collision probability is not the $2$-norm of the probability distribution (or rather its pmf), but is the squared $2$-norm