Clustering and Sum of Squares Proofs, Part 4

This is part 4 of a continuing series on clustering, Gaussian mixtures, and Sum of Squares (SoS) proofs. If you have not read them yet, I recommend starting with Part 1, Part 2, and Part 3. Also, if you find errors please mention them in the comments (or otherwise get in touch with me) and I will fix them ASAP.

Last time we finished our SoS identifiability proof for one-dimensional Gaussian mixtures. In this post, we are going to turn it into an algorithm. In Part 3 we proved Lemma 2, which we restate here.

Lemma 2.
Let X_1,\ldots,X_n \in \mathbb{R}. Let S_1,\ldots,S_k be a partition of [n] into k pieces of size N = n/k such that for each S_i, the collection of numbers \{X_j\}_{j \in S_i} obeys the following moment bound:

\mathbb{E}_{j \sim S_i} |X_j - \mu_i|^t \leq 2 \cdot t^{t/2}

where \mu_i is the average \mathbb{E}_{j \sim S_i} X_j and t is a power of 2 in \mathbb{N}. Let \Delta > 0 be such that |\mu_i - \mu_j| \geq \Delta for every i \neq j.

Let w_1,\ldots,w_n be indeterminates. Let \mathcal{A} be the following set of equations and inequalities.

w_i^2 = w_i \text{ for } i \in [n]
\sum_{i \in [n]} w_i = N
\frac 1 N \sum_{i \in [n]} w_i \cdot (X_i - \mu)^t \leq 2 \cdot t^{t/2}.

As before \mu = \mu(w) is the polynomial \tfrac 1 N \sum_{i \in [n]} w_i X_i. Thinking of the variables w_i as defining a set T \subseteq [n] via its 0/1 indicator, let |T \cap S_j| be the formal expression

|T \cap S_j| = \sum_{i \in S_j} w_i.

Let \tilde{\mathbb{E}} be a degree O(t) pseudoexpectation which satisfies \mathcal{A}. Then

\tilde{\mathbb{E}} \left [  \sum_{i \in [k]} \left ( \frac{|T \cap S_i|}{N} \right ) ^2 \right ] \geq 1 - \frac{2^{O(t)} t^{t/2} k^2}{\Delta^t}.

We will we design a convex program to exploit the SoS identifiability proof, and in particular Lemma 2. Then we describe a (very simple) rounding procedure and analyze it, which will complete our description and analysis of the one-dimensional algorithm.

Let’s look at the hypothesis of Lemma 2. It asks for a pseudoexpectation of degree O(t) which satisfies the inequalities \mathcal{A}. First of all, note that the inequalities \mathcal{A} depend only on the vectors X_1,\ldots,X_n to be clustered, and in particular not on the hidden partition S_1,\ldots,S_k, so they are fair game to use in our algorithm. Second, it is not too hard to check that the set of pseudoexpectations satisfying \mathcal{A} is convex, and in fact the feasible region of a semidefinite program with n^{O(t)} variables!

It is actually possible to design a rounding algorithm which takes any pseudoexpectation satisfying \mathcal{A} and produces a cluster, up to about a 2^{O(t)} t^{t/2} \text{poly}(k) / \Delta^t-fraction of misclassified points. Then the natural approach to design an algorithm to find all the clusters is to iterate:

(1) find such a pseudoexpectation \tilde{\mathbb{E}} via semidefinite programming
(2) round to find a cluster T
(3) remove all the points in T from X_1,\ldots,X_n, go to (1).

This is a viable algorithm, but analyzing it is a little painful because misclassifications from early rounds of the rounding algorithm must be taken into account when analyzing later rounds, and in particular a slightly stronger version of Lemma 2 is needed, to allow some error from early misclassifications.

We are going to avoid this pain by imposing some more structure on the pseudoexpectation our algorithm eventually rounds, to enable our rounding scheme to recover all the clusters without re-solving a convex program. This is not possible if one is only promised a pseudoexpectation \tilde{\mathbb{E}} which satisfies \mathcal{A}: observe, for example, that one can choose the pseudodistribution \mu to be a probability distribution supported on one point w \in \{0,1\}^n, the indicator of cluster S_1. This particular \tilde{\mathbb{E}} is easy to round to extract S_1, but contains no information about the remaining clusters S_2,\ldots,S_k.

We are going to use a trick reminiscent of entropy maximization to ensure that the pseudoexpectation we eventually round contains information about all the clusters S_1,\ldots,S_k. Our convex program will be:

\text{min } \|\tilde{\mathbb{E}} ww^\top \| \text{ such that } \tilde{\mathbb{E}} \text{ satisfies } \mathcal{A},

where \| \tilde{\mathbb{E}} ww^\top \| is the Frobenius norm of the matrix \tilde{\mathbb{E}} ww^\top.

It may not be so obvious why \| \tilde{\mathbb{E}} ww^\top \| is a good thing to minimize, or what it has to do with entropy maximization. We offer the following interpretation. Suppose that instead of pseudodistributions, we were able to minimize \| \mathbb{E}_{a \sim \mu} aa^\top \| over all \mu which are supported on vectors a_1,\ldots,a_k \in \{0,1\}^n which are the 0/1 indicators of the clusters S_1,\ldots,S_k. Such a distribution \mu is specified by nonnegative \mu(a_i) \in \mathbb{R} for i \in [k] which sum to 1, and the Frobenius norm is given by

\|\mathbb{E}_{a \sim \mu} aa^\top \|^2 = \left \langle \sum_{i \in [k]} \mu(a_i) a_i a_i^\top, \sum_{i \in [k]} \mu(a_i) a_i a_i^\top \right \rangle = \sum_{i \in [k]} \mu(a_i)^2 \cdot \|a_i\|^4,

where we have used orthogonality \langle a_i,a_j \rangle = 0 if i \neq j. Since all the clusters have size n/k, we have \|a_i\|^4 = (n/k)^2, and we have \|\mathbb{E}_{a \sim \mu} aa^\top \| = \|\mu\| \cdot (n/k)^2, where \|\mu\| is the 2-norm, whose square is the collision probability, of \mu. This collision probability is minimized when \mu is uniform.

We can analyze our convex program via the following corollary of Lemma 2.

Corollary 1.
Let X_1,\ldots,X_n and S_1,\ldots,S_k be as in Lemma 2. Let \tilde{\mathbb{E}} be the degree O(t) pseudoexpectation solving

\text{min } \|\tilde{\mathbb{E}} ww^\top \| \text{ such that } \tilde{\mathbb{E}} \text{ satisfies } \mathcal{A}.

Let \mu be the uniform distribution over vectors a_1,\ldots,a_k \in \{0,1\}^n where a_i is the 0/1 indicator of cluster S_i. Then

\| \tilde{\mathbb{E}} ww^\top - \mathbb{E}_{\mu} aa^\top \| \leq \|\mathbb{E} aa^\top\| \cdot \left ( \frac{2^{O(t)} t^{t/2} k^2}{\Delta^t} \right )^{1/2}.

where \|\cdot \| is the Frobenius norm.

Proof of Corollary 1.
The uniform distribution \mu over a_1,\ldots,a_k is a feasible solution to the convex program with \|\mathbb{E}_{a \sim \mu} aa^\top \|^2 = (n/k)^2 \cdot 1/k, by the calculation preceding the corollary. So if \tilde{\mathbb{E}} is the minimizer, we know \|\tilde{\mathbb{E}} ww^\top \|^2 \leq n^2/k^3.

We expand the norm:

\| \tilde{\mathbb{E}} ww^\top - \mathbb{E}_{\mu} aa^\top \|^2 = \| \tilde{\mathbb{E}} ww^\top\|^2 + \|\mathbb{E}_\mu aa^\top\|^2 - 2 \langle \tilde{\mathbb{E}} ww^\top, \mathbb{E} aa^\top \rangle \leq 2 \left ( \frac {n^2}{k^3} - \langle \tilde{\mathbb{E}} ww^\top, \mathbb{E} aa^\top \rangle \right ).

To bound the last term we use Lemma 2. In the notation of that Lemma,

\langle \tilde{\mathbb{E}} ww^\top, \mathbb{E} aa^\top \rangle = \frac 1k \tilde{\mathbb{E}} \sum_{i \in [k]} |S_i \cap T|^2 \geq \frac 1k \cdot N^2 \cdot \left ( 1 - \frac{2^{O(t)} t^{t/2} k^2}{\Delta^t} \right ).

Remember that N = n/k. Putting these together, we get

\| \tilde{\mathbb{E}} ww^\top - \mathbb{E}_{\mu} aa^\top \| \leq \|\mathbb{E} aa^\top\| \cdot \left ( \frac{2^{O(t)} t^{t/2} k^2}{\Delta^t} \right )^{1/2}.

QED.

We are basically done with the algorithm now. Observe that the matrix \mathbb{E}_\mu aa^\top contains all the information about the clusters S_1,\ldots,S_k. In fact the clusters can just be read off of the rows of \mathbb{E} aa^\top.

Once one has in hand a matrix M = \tilde{\mathbb{E}} ww^\top which is close in Frobenius norm to \mathbb{E} aa^\top, extracting the clusters is still a matter of reading them off of the rows of the matrix (choosing rows at random to avoid hitting one of the small number of rows which could be wildly far from their sisters in \mathbb{E} aa^\top).

We will prove the following fact at the end of this post.

Fact: rounding.
Let S_1,\ldots,S_k be a partition of [n] into k parts of size N = n/k. Let A \in \{0,1\}^{n \times n} be the 0/1 indicator matrix for same-cluster membership. That is, A_{ij} = 1 if i and j are in the same cluster S_\ell. Suppose $M \in \mathbb{R}^{n \times n}$ satisfies \|M - A\| \leq \epsilon \|A\|.

There is a polynomial-time algorithm which takes M and with probability at least 1 - O(k^2 \epsilon^2) produces a partition T_1,\ldots,T_k of [n] into clusters of size N such that, up to a permutation of [k],

\frac{|S_i \cap T_i|}{N} \geq 1 - O(\epsilon^2 k).

 

Putting things together for one-dimensional Gaussian mixtures

Now we sketch a proof of Theorem 1 in the case d=1. Our algorithm is: given X_1,\ldots,X_n, solve

\arg \min \| \tilde{\mathbb{E}} ww^\top \| \text{ such that } \tilde{\mathbb{E}} \text{ is degree } O(t) \text{ and satisfies } \mathcal{A}

then apply the rounding algorithm from the rounding fact to \tilde{\mathbb{E}} ww^\top and output the resulting partition.

If the vectors X_1,\ldots,X_n satisfy the hypothesis of Lemma 2, then by Corollary 1, we know

\|\tilde{\mathbb{E}} ww^\top - \mathbb{E}_\mu aa^\top \| \leq \| \mathbb{E}_\mu aa^\top \| \cdot \left ( \frac{2^{O(t)} t^{t/2} k^2}{\Delta^t} \right )^{1/2}

where \mu is the uniform distribution over 0/1 indicators for clusters S_1,\ldots,S_k. Hence the rounding algorithm produces a partition T_1,\ldots,T_k of [n] such that

\frac{|S_i \cap T_i|}{N} \geq 1 - O \left ( \frac{2^{O(t)} t^{t/2} k^3}{\Delta^t} \right ).

Since a standard Gaussian has \mathbb{E} |X|^t \leq t^{t/2}, elementary concentration shows that the vectors X_1,\ldots,X_n satisfy the hypotheses of Lemma 1 with probability 0.99 so long as n \geq poly(k).

Rounding algorithm

The last thing to do in this post is prove the rounding algorithm Fact. This has little to do with SoS; the algorithm is elementary and combinatorial. We provide it for completeness.

The setting is: there is a partition S_1,\ldots,S_k of [n] into k parts of size n/k. Let A \in \{0,1\}^{n \times n} be the 0/1 indicator matrix for cluster membership; i.e. A_{ij} = 1 if and only if i and j are in the same cluster S_\epsilon. Given a matrix M such that \|M - A \| \leq \epsilon \|A\|, the goal is to recover a partition T_1,\ldots,T_k of [n] which is close to S_1,\ldots,S_k up to a permutation of [k].

Let M_i be the i-th row of M and similarly for A_i. Let \delta > 0 be a parameter we will set later.

The algorithm is:

(1) Let \mathcal{I} = [n] be the set of active indices.
(2) Pick i \sim \mathcal{I} uniformly.
(3) Let S \subseteq \mathcal{I} be those indices j for which \|M_j - M_i\| \leq \delta \cdot \sqrt{n/k}.
(4) Add S to the list of clusters and let $\mathcal{I} := \mathcal{I} \setminus S$.
(5) If |\mathcal{I}| > n/2k, go to (2).
(6) (postprocess) Assign remaining indices to clusters arbitrarily, then move indices arbitrarily from larger clusters to smaller ones until all clusters have size n/k.

Fact: rounding.
If \delta = 0.01 then with probability at least 1 - O(k^2 \epsilon^2) the rounding algorithm outputs disjoint clusters T_1,\ldots,T_k, each of size n/k, such that up to a permutation of [k], |T_i \cap S_i| \geq (1 - O(\epsilon^2 k)) \cdot \tfrac nk.

Proof.
Call an index i \in [n] good if \|M_i - A_i\| \leq \tfrac \delta 2 \cdot \sqrt{n/k} = \tfrac \delta 2 \|A_i\|. An index is bad if it is not good. By hypothesis \sum_{i \in [n]} \|M_i - A_i\|^2 = \|M - A\|^2 \leq \epsilon^2 \|A\|^2. Each bad index contributes at least \tfrac {\delta^2}{4} \|A_i\|^2 to the left side and \|A_i\|^2 = \|A\|^2 /n, so there are at most 4 \epsilon^2 n / \delta^2 bad indices.

If i,j are good indices and both are in the same cluster S_\epsilon, then if the algorithm chooses i, the resulting cluster will contain j. If \delta < 0.1, then also if j is good but is in some other cluster S_{\epsilon'}, the cluster formed upon choosing i will not contain j. Thus if the algorithm never chooses a bad index, before postprocessing the clusters T_1,\ldots,T_k it outputs will (up to a global permutation of [k]) satisfy that T_i contains all the good indices in S_i. Hence in this case only bad indices can be misclassified, so the postprocessing step moves at most 4 \epsilon^2 n / \delta^2 indices, and in the end the cluster T_i again errs from S_i on at most 8 \epsilon^2 n / \delta^2 indices.

Consider implementing the algorithm by drawing a list of k^2 indices before seeing M (i.e. obliviously), then when the algorithm requires random index i \in \mathcal{I} we give it the next index in our list which is in \mathcal{I} (and halt with no output if no such index exists). It’s not hard to see that this implementation fails only with probability at most O(1/k). Furthermore, by a union bound the list contains a bad index only with probability O(k^2 \epsilon^2 / \delta^2). Choosing \delta = 0.01 thus completes the proof. QED.

4 thoughts on “Clustering and Sum of Squares Proofs, Part 4

  1. Nice (series of) post(s)!

    Two minor details:
    * “Frobenious” should read “Frobenius”
    * the collision probability is not the $2$-norm of the probability distribution (or rather its pmf), but is the squared $2$-norm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s