*This is part 4 of a continuing series on clustering, Gaussian mixtures, and Sum of Squares (SoS) proofs. If you have not read them yet, I recommend starting with Part 1, Part 2, and Part 3. Also, if you find errors please mention them in the comments (or otherwise get in touch with me) and I will fix them ASAP.*

Last time we finished our SoS identifiability proof for one-dimensional Gaussian mixtures. In this post, we are going to turn it into an algorithm. In Part 3 we proved Lemma 2, which we restate here.

Lemma 2.

Let . Let be a partition of into pieces of size such that for each , the collection of numbers obeys the following moment bound:where is the average and is a power of in . Let be such that for every .

Let be indeterminates. Let be the following set of equations and inequalities.

As before is the polynomial . Thinking of the variables as defining a set via its indicator, let be the formal expression

Let be a degree pseudoexpectation which satisfies . Then

We will we design a convex program to exploit the SoS identifiability proof, and in particular Lemma 2. Then we describe a (very simple) rounding procedure and analyze it, which will complete our description and analysis of the one-dimensional algorithm.

Let’s look at the hypothesis of Lemma 2. It asks for a pseudoexpectation of degree which satisfies the inequalities . First of all, note that the inequalities depend only on the vectors to be clustered, and in particular not on the hidden partition , so they are fair game to use in our algorithm. Second, it is not too hard to check that the set of pseudoexpectations satisfying is convex, and in fact the feasible region of a semidefinite program with variables!

It is actually possible to design a rounding algorithm which takes any pseudoexpectation satisfying and produces a cluster, up to about a -fraction of misclassified points. Then the natural approach to design an algorithm to find all the clusters is to iterate:

(1) find such a pseudoexpectation via semidefinite programming

(2) round to find a cluster

(3) remove all the points in from , go to (1).

This is a viable algorithm, but analyzing it is a little painful because misclassifications from early rounds of the rounding algorithm must be taken into account when analyzing later rounds, and in particular a slightly stronger version of Lemma 2 is needed, to allow some error from early misclassifications.

We are going to avoid this pain by imposing some more structure on the pseudoexpectation our algorithm eventually rounds, to enable our rounding scheme to recover all the clusters without re-solving a convex program. This is not possible if one is only promised a pseudoexpectation which satisfies : observe, for example, that one can choose the pseudodistribution to be a probability distribution supported on one point , the indicator of cluster . This particular is easy to round to extract , but contains no information about the remaining clusters .

We are going to use a trick reminiscent of entropy maximization to ensure that the pseudoexpectation we eventually round contains information about all the clusters . Our convex program will be:

where is the Frobenius norm of the matrix .

It may not be so obvious why is a good thing to minimize, or what it has to do with entropy maximization. We offer the following interpretation. Suppose that instead of pseudodistributions, we were able to minimize over all which are supported on vectors which are the indicators of the clusters . Such a distribution is specified by nonnegative for which sum to , and the Frobenius norm is given by

where we have used orthogonality if . Since all the clusters have size , we have , and we have , where is the -norm, whose square is the collision probability, of . This collision probability is minimized when is uniform.

We can analyze our convex program via the following corollary of Lemma 2.

Corollary 1.

Let and be as in Lemma 2. Let be the degree pseudoexpectation solvingLet be the uniform distribution over vectors where is the indicator of cluster . Then

where is the Frobenius norm.

*Proof of Corollary 1.*

The uniform distribution over is a feasible solution to the convex program with , by the calculation preceding the corollary. So if is the minimizer, we know .

We expand the norm:

To bound the last term we use Lemma 2. In the notation of that Lemma,

Remember that . Putting these together, we get

QED.

We are basically done with the algorithm now. Observe that the matrix contains all the information about the clusters . In fact the clusters can just be read off of the rows of .

Once one has in hand a matrix which is close in Frobenius norm to , extracting the clusters is still a matter of reading them off of the rows of the matrix (choosing rows at random to avoid hitting one of the small number of rows which could be wildly far from their sisters in ).

We will prove the following fact at the end of this post.

Fact: rounding.

Let be a partition of into parts of size . Let be the indicator matrix for same-cluster membership. That is, if and are in the same cluster . Suppose $M \in \mathbb{R}^{n \times n}$ satisfies .There is a polynomial-time algorithm which takes and with probability at least produces a partition of into clusters of size such that, up to a permutation of ,

## Putting things together for one-dimensional Gaussian mixtures

Now we sketch a proof of Theorem 1 in the case . Our algorithm is: given , solve

then apply the rounding algorithm from the rounding fact to and output the resulting partition.

If the vectors satisfy the hypothesis of Lemma 2, then by Corollary 1, we know

where is the uniform distribution over indicators for clusters . Hence the rounding algorithm produces a partition of such that

Since a standard Gaussian has , elementary concentration shows that the vectors satisfy the hypotheses of Lemma 1 with probability so long as .

## Rounding algorithm

The last thing to do in this post is prove the rounding algorithm Fact. This has little to do with SoS; the algorithm is elementary and combinatorial. We provide it for completeness.

The setting is: there is a partition of into parts of size . Let be the indicator matrix for cluster membership; i.e. if and only if and are in the same cluster . Given a matrix such that , the goal is to recover a partition of which is close to up to a permutation of .

Let be the -th row of and similarly for . Let be a parameter we will set later.

The algorithm is:

(1) Let be the set of active indices.

(2) Pick uniformly.

(3) Let be those indices for which .

(4) Add to the list of clusters and let $\mathcal{I} := \mathcal{I} \setminus S$.

(5) If , go to (2).

(6) (postprocess) Assign remaining indices to clusters arbitrarily, then move indices arbitrarily from larger clusters to smaller ones until all clusters have size .

Fact: rounding.

If then with probability at least the rounding algorithm outputs disjoint clusters , each of size , such that up to a permutation of , .

*Proof.*

Call an index *good* if . An index is bad if it is not good. By hypothesis . Each bad index contributes at least to the left side and , so there are at most bad indices.

If are good indices and both are in the same cluster , then if the algorithm chooses , the resulting cluster will contain . If , then also if is good but is in some other cluster , the cluster formed upon choosing will not contain . Thus if the algorithm never chooses a bad index, before postprocessing the clusters it outputs will (up to a global permutation of ) satisfy that contains all the good indices in . Hence in this case only bad indices can be misclassified, so the postprocessing step moves at most indices, and in the end the cluster again errs from on at most indices.

Consider implementing the algorithm by drawing a list of indices before seeing (i.e. obliviously), then when the algorithm requires random index we give it the next index in our list which is in (and halt with no output if no such index exists). It’s not hard to see that this implementation fails only with probability at most . Furthermore, by a union bound the list contains a bad index only with probability . Choosing thus completes the proof. QED.

Nice (series of) post(s)!

Two minor details:

* “Frobenious” should read “Frobenius”

* the collision probability is not the $2$-norm of the probability distribution (or rather its pmf), but is the squared $2$-norm