# Clustering and Sum of Squares Proofs, Part 4

This is part 4 of a continuing series on clustering, Gaussian mixtures, and Sum of Squares (SoS) proofs. If you have not read them yet, I recommend starting with Part 1, Part 2, and Part 3. Also, if you find errors please mention them in the comments (or otherwise get in touch with me) and I will fix them ASAP.

Last time we finished our SoS identifiability proof for one-dimensional Gaussian mixtures. In this post, we are going to turn it into an algorithm. In Part 3 we proved Lemma 2, which we restate here.

Lemma 2.
Let $X_1,\ldots,X_n \in \mathbb{R}$. Let $S_1,\ldots,S_k$ be a partition of $[n]$ into $k$ pieces of size $N = n/k$ such that for each $S_i$, the collection of numbers $\{X_j\}_{j \in S_i}$ obeys the following moment bound:

$\mathbb{E}_{j \sim S_i} |X_j - \mu_i|^t \leq 2 \cdot t^{t/2}$

where $\mu_i$ is the average $\mathbb{E}_{j \sim S_i} X_j$ and $t$ is a power of $2$ in $\mathbb{N}$. Let $\Delta > 0$ be such that $|\mu_i - \mu_j| \geq \Delta$ for every $i \neq j$.

Let $w_1,\ldots,w_n$ be indeterminates. Let $\mathcal{A}$ be the following set of equations and inequalities.

$w_i^2 = w_i \text{ for } i \in [n]$
$\sum_{i \in [n]} w_i = N$
$\frac 1 N \sum_{i \in [n]} w_i \cdot (X_i - \mu)^t \leq 2 \cdot t^{t/2}.$

As before $\mu = \mu(w)$ is the polynomial $\tfrac 1 N \sum_{i \in [n]} w_i X_i$. Thinking of the variables $w_i$ as defining a set $T \subseteq [n]$ via its $0/1$ indicator, let $|T \cap S_j|$ be the formal expression

$|T \cap S_j| = \sum_{i \in S_j} w_i.$

Let $\tilde{\mathbb{E}}$ be a degree $O(t)$ pseudoexpectation which satisfies $\mathcal{A}$. Then

$\tilde{\mathbb{E}} \left [ \sum_{i \in [k]} \left ( \frac{|T \cap S_i|}{N} \right ) ^2 \right ] \geq 1 - \frac{2^{O(t)} t^{t/2} k^2}{\Delta^t}.$

We will we design a convex program to exploit the SoS identifiability proof, and in particular Lemma 2. Then we describe a (very simple) rounding procedure and analyze it, which will complete our description and analysis of the one-dimensional algorithm.

Let’s look at the hypothesis of Lemma 2. It asks for a pseudoexpectation of degree $O(t)$ which satisfies the inequalities $\mathcal{A}$. First of all, note that the inequalities $\mathcal{A}$ depend only on the vectors $X_1,\ldots,X_n$ to be clustered, and in particular not on the hidden partition $S_1,\ldots,S_k$, so they are fair game to use in our algorithm. Second, it is not too hard to check that the set of pseudoexpectations satisfying $\mathcal{A}$ is convex, and in fact the feasible region of a semidefinite program with $n^{O(t)}$ variables!

It is actually possible to design a rounding algorithm which takes any pseudoexpectation satisfying $\mathcal{A}$ and produces a cluster, up to about a $2^{O(t)} t^{t/2} \text{poly}(k) / \Delta^t$-fraction of misclassified points. Then the natural approach to design an algorithm to find all the clusters is to iterate:

(1) find such a pseudoexpectation $\tilde{\mathbb{E}}$ via semidefinite programming
(2) round to find a cluster $T$
(3) remove all the points in $T$ from $X_1,\ldots,X_n$, go to (1).

This is a viable algorithm, but analyzing it is a little painful because misclassifications from early rounds of the rounding algorithm must be taken into account when analyzing later rounds, and in particular a slightly stronger version of Lemma 2 is needed, to allow some error from early misclassifications.

We are going to avoid this pain by imposing some more structure on the pseudoexpectation our algorithm eventually rounds, to enable our rounding scheme to recover all the clusters without re-solving a convex program. This is not possible if one is only promised a pseudoexpectation $\tilde{\mathbb{E}}$ which satisfies $\mathcal{A}$: observe, for example, that one can choose the pseudodistribution $\mu$ to be a probability distribution supported on one point $w \in \{0,1\}^n$, the indicator of cluster $S_1$. This particular $\tilde{\mathbb{E}}$ is easy to round to extract $S_1$, but contains no information about the remaining clusters $S_2,\ldots,S_k$.

We are going to use a trick reminiscent of entropy maximization to ensure that the pseudoexpectation we eventually round contains information about all the clusters $S_1,\ldots,S_k$. Our convex program will be:

$\text{min } \|\tilde{\mathbb{E}} ww^\top \| \text{ such that } \tilde{\mathbb{E}} \text{ satisfies } \mathcal{A},$

where $\| \tilde{\mathbb{E}} ww^\top \|$ is the Frobenius norm of the matrix $\tilde{\mathbb{E}} ww^\top$.

It may not be so obvious why $\| \tilde{\mathbb{E}} ww^\top \|$ is a good thing to minimize, or what it has to do with entropy maximization. We offer the following interpretation. Suppose that instead of pseudodistributions, we were able to minimize $\| \mathbb{E}_{a \sim \mu} aa^\top \|$ over all $\mu$ which are supported on vectors $a_1,\ldots,a_k \in \{0,1\}^n$ which are the $0/1$ indicators of the clusters $S_1,\ldots,S_k$. Such a distribution $\mu$ is specified by nonnegative $\mu(a_i) \in \mathbb{R}$ for $i \in [k]$ which sum to $1$, and the Frobenius norm is given by

$\|\mathbb{E}_{a \sim \mu} aa^\top \|^2 = \left \langle \sum_{i \in [k]} \mu(a_i) a_i a_i^\top, \sum_{i \in [k]} \mu(a_i) a_i a_i^\top \right \rangle = \sum_{i \in [k]} \mu(a_i)^2 \cdot \|a_i\|^4,$

where we have used orthogonality $\langle a_i,a_j \rangle = 0$ if $i \neq j$. Since all the clusters have size $n/k$, we have $\|a_i\|^4 = (n/k)^2$, and we have $\|\mathbb{E}_{a \sim \mu} aa^\top \| = \|\mu\| \cdot (n/k)^2$, where $\|\mu\|$ is the $2$-norm, whose square is the collision probability, of $\mu$. This collision probability is minimized when $\mu$ is uniform.

We can analyze our convex program via the following corollary of Lemma 2.

Corollary 1.
Let $X_1,\ldots,X_n$ and $S_1,\ldots,S_k$ be as in Lemma 2. Let $\tilde{\mathbb{E}}$ be the degree $O(t)$ pseudoexpectation solving

$\text{min } \|\tilde{\mathbb{E}} ww^\top \| \text{ such that } \tilde{\mathbb{E}} \text{ satisfies } \mathcal{A}.$

Let $\mu$ be the uniform distribution over vectors $a_1,\ldots,a_k \in \{0,1\}^n$ where $a_i$ is the $0/1$ indicator of cluster $S_i$. Then

$\| \tilde{\mathbb{E}} ww^\top - \mathbb{E}_{\mu} aa^\top \| \leq \|\mathbb{E} aa^\top\| \cdot \left ( \frac{2^{O(t)} t^{t/2} k^2}{\Delta^t} \right )^{1/2}.$

where $\|\cdot \|$ is the Frobenius norm.

Proof of Corollary 1.
The uniform distribution $\mu$ over $a_1,\ldots,a_k$ is a feasible solution to the convex program with $\|\mathbb{E}_{a \sim \mu} aa^\top \|^2 = (n/k)^2 \cdot 1/k$, by the calculation preceding the corollary. So if $\tilde{\mathbb{E}}$ is the minimizer, we know $\|\tilde{\mathbb{E}} ww^\top \|^2 \leq n^2/k^3$.

We expand the norm:

$\| \tilde{\mathbb{E}} ww^\top - \mathbb{E}_{\mu} aa^\top \|^2 = \| \tilde{\mathbb{E}} ww^\top\|^2 + \|\mathbb{E}_\mu aa^\top\|^2 - 2 \langle \tilde{\mathbb{E}} ww^\top, \mathbb{E} aa^\top \rangle \leq 2 \left ( \frac {n^2}{k^3} - \langle \tilde{\mathbb{E}} ww^\top, \mathbb{E} aa^\top \rangle \right ).$

To bound the last term we use Lemma 2. In the notation of that Lemma,

$\langle \tilde{\mathbb{E}} ww^\top, \mathbb{E} aa^\top \rangle = \frac 1k \tilde{\mathbb{E}} \sum_{i \in [k]} |S_i \cap T|^2 \geq \frac 1k \cdot N^2 \cdot \left ( 1 - \frac{2^{O(t)} t^{t/2} k^2}{\Delta^t} \right ).$

Remember that $N = n/k$. Putting these together, we get

$\| \tilde{\mathbb{E}} ww^\top - \mathbb{E}_{\mu} aa^\top \| \leq \|\mathbb{E} aa^\top\| \cdot \left ( \frac{2^{O(t)} t^{t/2} k^2}{\Delta^t} \right )^{1/2}.$

QED.

We are basically done with the algorithm now. Observe that the matrix $\mathbb{E}_\mu aa^\top$ contains all the information about the clusters $S_1,\ldots,S_k$. In fact the clusters can just be read off of the rows of $\mathbb{E} aa^\top$.

Once one has in hand a matrix $M = \tilde{\mathbb{E}} ww^\top$ which is close in Frobenius norm to $\mathbb{E} aa^\top$, extracting the clusters is still a matter of reading them off of the rows of the matrix (choosing rows at random to avoid hitting one of the small number of rows which could be wildly far from their sisters in $\mathbb{E} aa^\top$).

We will prove the following fact at the end of this post.

Fact: rounding.
Let $S_1,\ldots,S_k$ be a partition of $[n]$ into $k$ parts of size $N = n/k$. Let $A \in \{0,1\}^{n \times n}$ be the $0/1$ indicator matrix for same-cluster membership. That is, $A_{ij} = 1$ if $i$ and $j$ are in the same cluster $S_\ell$. Suppose $M \in \mathbb{R}^{n \times n}$ satisfies $\|M - A\| \leq \epsilon \|A\|$.

There is a polynomial-time algorithm which takes $M$ and with probability at least $1 - O(k^2 \epsilon^2)$ produces a partition $T_1,\ldots,T_k$ of $[n]$ into clusters of size $N$ such that, up to a permutation of $[k]$,

$\frac{|S_i \cap T_i|}{N} \geq 1 - O(\epsilon^2 k).$

## Putting things together for one-dimensional Gaussian mixtures

Now we sketch a proof of Theorem 1 in the case $d=1$. Our algorithm is: given $X_1,\ldots,X_n$, solve

$\arg \min \| \tilde{\mathbb{E}} ww^\top \| \text{ such that } \tilde{\mathbb{E}} \text{ is degree } O(t) \text{ and satisfies } \mathcal{A}$

then apply the rounding algorithm from the rounding fact to $\tilde{\mathbb{E}} ww^\top$ and output the resulting partition.

If the vectors $X_1,\ldots,X_n$ satisfy the hypothesis of Lemma 2, then by Corollary 1, we know

$\|\tilde{\mathbb{E}} ww^\top - \mathbb{E}_\mu aa^\top \| \leq \| \mathbb{E}_\mu aa^\top \| \cdot \left ( \frac{2^{O(t)} t^{t/2} k^2}{\Delta^t} \right )^{1/2}$

where $\mu$ is the uniform distribution over $0/1$ indicators for clusters $S_1,\ldots,S_k$. Hence the rounding algorithm produces a partition $T_1,\ldots,T_k$ of $[n]$ such that

$\frac{|S_i \cap T_i|}{N} \geq 1 - O \left ( \frac{2^{O(t)} t^{t/2} k^3}{\Delta^t} \right ).$

Since a standard Gaussian has $\mathbb{E} |X|^t \leq t^{t/2}$, elementary concentration shows that the vectors $X_1,\ldots,X_n$ satisfy the hypotheses of Lemma 1 with probability $0.99$ so long as $n \geq poly(k)$.

## Rounding algorithm

The last thing to do in this post is prove the rounding algorithm Fact. This has little to do with SoS; the algorithm is elementary and combinatorial. We provide it for completeness.

The setting is: there is a partition $S_1,\ldots,S_k$ of $[n]$ into $k$ parts of size $n/k$. Let $A \in \{0,1\}^{n \times n}$ be the $0/1$ indicator matrix for cluster membership; i.e. $A_{ij} = 1$ if and only if $i$ and $j$ are in the same cluster $S_\epsilon$. Given a matrix $M$ such that $\|M - A \| \leq \epsilon \|A\|$, the goal is to recover a partition $T_1,\ldots,T_k$ of $[n]$ which is close to $S_1,\ldots,S_k$ up to a permutation of $[k]$.

Let $M_i$ be the $i$-th row of $M$ and similarly for $A_i$. Let $\delta > 0$ be a parameter we will set later.

The algorithm is:

(1) Let $\mathcal{I} = [n]$ be the set of active indices.
(2) Pick $i \sim \mathcal{I}$ uniformly.
(3) Let $S \subseteq \mathcal{I}$ be those indices $j$ for which $\|M_j - M_i\| \leq \delta \cdot \sqrt{n/k}$.
(4) Add $S$ to the list of clusters and let $\mathcal{I} := \mathcal{I} \setminus S$.
(5) If $|\mathcal{I}| > n/2k$, go to (2).
(6) (postprocess) Assign remaining indices to clusters arbitrarily, then move indices arbitrarily from larger clusters to smaller ones until all clusters have size $n/k$.

Fact: rounding.
If $\delta = 0.01$ then with probability at least $1 - O(k^2 \epsilon^2)$ the rounding algorithm outputs disjoint clusters $T_1,\ldots,T_k$, each of size $n/k$, such that up to a permutation of $[k]$, $|T_i \cap S_i| \geq (1 - O(\epsilon^2 k)) \cdot \tfrac nk$.

Proof.
Call an index $i \in [n]$ good if $\|M_i - A_i\| \leq \tfrac \delta 2 \cdot \sqrt{n/k} = \tfrac \delta 2 \|A_i\|$. An index is bad if it is not good. By hypothesis $\sum_{i \in [n]} \|M_i - A_i\|^2 = \|M - A\|^2 \leq \epsilon^2 \|A\|^2$. Each bad index contributes at least $\tfrac {\delta^2}{4} \|A_i\|^2$ to the left side and $\|A_i\|^2 = \|A\|^2 /n$, so there are at most $4 \epsilon^2 n / \delta^2$ bad indices.

If $i,j$ are good indices and both are in the same cluster $S_\epsilon$, then if the algorithm chooses $i$, the resulting cluster will contain $j$. If $\delta < 0.1$, then also if $j$ is good but is in some other cluster $S_{\epsilon'}$, the cluster formed upon choosing $i$ will not contain $j$. Thus if the algorithm never chooses a bad index, before postprocessing the clusters $T_1,\ldots,T_k$ it outputs will (up to a global permutation of $[k]$) satisfy that $T_i$ contains all the good indices in $S_i$. Hence in this case only bad indices can be misclassified, so the postprocessing step moves at most $4 \epsilon^2 n / \delta^2$ indices, and in the end the cluster $T_i$ again errs from $S_i$ on at most $8 \epsilon^2 n / \delta^2$ indices.

Consider implementing the algorithm by drawing a list of $k^2$ indices before seeing $M$ (i.e. obliviously), then when the algorithm requires random index $i \in \mathcal{I}$ we give it the next index in our list which is in $\mathcal{I}$ (and halt with no output if no such index exists). It’s not hard to see that this implementation fails only with probability at most $O(1/k)$. Furthermore, by a union bound the list contains a bad index only with probability $O(k^2 \epsilon^2 / \delta^2)$. Choosing $\delta = 0.01$ thus completes the proof. QED.

## 4 thoughts on “Clustering and Sum of Squares Proofs, Part 4”

1. Nice (series of) post(s)!

Two minor details:
* the collision probability is not the $2$-norm of the probability distribution (or rather its pmf), but is the squared $2$-norm