This is part 5 of a continuing series on clustering, Gaussian mixtures, and Sum of Squares (SoS) proofs. If you have not read them yet, I recommend starting with Part 1, Part 2, Part 3, and Part 4. Also, if you find errors please mention them in the comments (or otherwise get in touch with me) and I will fix them ASAP.

Last time we finished our algorithm design and analysis for clustering one-dimensional Gaussian mixtures. Clustering points on $\mathbb{R}$ isn’t much of a challenge. In this post we will finally move to the high-dimensional setting. We will see that most of the ideas and arguments so far carry over nearly unchanged.

In keeping with the method we are advocating throughout the posts, the first thing to do is return to the non-SoS cluster identifiability proof from Part 1 and see how to generalize it to collections of points in dimension $d > 1$. We encourage the reader to review that proof.

## Generalizing the non-SoS Identifiability Proof

Our first step in designing that proof was to correctly choose a property of a collection of samples $X_1,\ldots,X_n$ from a Gaussian mixture which we would rely on for identifiability. The property we chose was that the points $X_1,\ldots,X_n$ break into clusters $S_1,\ldots,S_k$ of equal size so that each cluster has bounded empirical $t$-th moments and the means of the clusters are separated.

Here is our first attempt at a high-dimensional generalization: $X_1,\ldots,X_n$ break into $k$ clusters $S_1,\ldots,S_k$ of equal size $N = n/k$ such that

(1) for each cluster $S_i$ and $u \in \mathbb{R}^d$,

$\mathbb{E}_{j \sim S_i} \langle X_j - \mu_i, u \rangle^t \leq 2 \cdot t^{t/2}$

where $\mu_i = \mathbb{E}_{j \sim S_i} X_j$ is the empirical mean of cluster $S_i$, and

(2) those means are separated: $\|\mu_i - \mu_j\| \geq \Delta$ for $i \neq j$.

The first property says that every one-dimensional projection of every cluster has Gaussian $t$-th moments. The second should be familiar: we just replaced distance on the line with $\ell_2$ distance in $\mathbb{R}^d$.

The main steps in our one-dimensional non-SoS identifiability proofs were Fact 1 and Lemma 1. We will give an informal discussion on their high-dimensional generalizations; for the sake of brevity we will skip a formal non-SoS identifiability proof this time and go right to the SoS proof.

The key idea is: for any pair of sets $S,S' \subseteq [n]$ such that $\{X_j\}_{i \in S}$ and $\{X_j\}_{i \in S'}$ satisfy the empirical $t$-th moment bound (2) with respect to empirical means $\mu$ and $\mu'$ respectively, if $|\mu - \mu'| \geq \Delta$, then by the one-dimensional projections

$\left \{ \left \langle X_j - \mu, \frac{\mu - \mu'}{\| \mu - \mu' \|} \right \rangle \right \}_{j \in S} \text{ and } \left \{ \left \langle X_j - \mu, \frac{\mu - \mu'}{\| \mu - \mu' \|} \right \rangle \right \}_{j \in S'}$

are collections of numbers in $\mathbb{R}$ which satisfy the hypotheses of our one-dimensional identifiability arguments. All we did was choose the right one-dimensional projection of the high-dimensional points $\{ X_j \}$ to capture the separation between $\mu$ and $\mu'$.

(The reader is encouraged to work this out for themselves; it is easiest shift all the points so that without loss  of generality $\mu = 0$.)

## Obstacles to High-dimensional SoS Identifiability

We are going to face two main difficulties in turning the high-dimensional non-SoS identifiability proofs into SoS proofs.

(1) The one-dimensional projections above have $\|\mu - \mu'\|$ in the denominator, which is not a low-degree polynomial. This is easy to handle, and we have seen similar things before: we will just clear denominators of all inequalities in the proofs, and raise both sides to a high-enough power that we get polynomials.

(2) The high-dimensional $t$-th moment bound has a “for all $u$” quantification. That is, if $w = w_1,\ldots,w_n$ are indeterminates as in our one-dimensional proof, to be interpreted as the $0/1$ indicators for membership in a candidate cluster $T \subseteq [n]$, we would like to enforce

$\max_{u \in \mathbb{R}^d} \frac 1 N \sum_{i \in [n]} w_i \langle X_i - \mu, u \rangle^t \leq 2 \cdot t^{t/2} \cdot \|u\|^t$.

Because of the $\max$, this is not a polynomial inequality in $w$. This turns out to be a serious problem, and it will require us to strengthen our assumptions about the points $X_1,\ldots,X_n$.

In order for the SoS algorithm to successfully cluster $X_1,\ldots,X_n$, it needs to certify that each of the clusters it produces satisfies the $t$-th empirical moment property. Exactly why this is so, and whether it would also be true for non-SoS algorithms, is an interesting topic for discussion. But, for the algorithm to succeed, in particular a short certificate of the above inequality must exist! It is probably not true that such a certificate exists for an arbitrary collection of points in $\mathbb{R}^d$ satisfying the $t$-th empirical moment bound. Thus, we will add the existence of such a certificate as an assumption on our clusters.

When $Y_1,\ldots,Y_m$ are sufficiently-many samples from a $d$-dimensional Gaussian, the following matrix inequality is a short certificate of the $t$-th empirical moment property:

$\left \| \frac 1 N \sum_{i \in [m]} \left ( Y_i^{\otimes t/2} \right ) \left ( Y_i^{\otimes t/2} \right)^\top - \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \left ( Y^{\otimes t/2} \right)^\top \right \|_F \leq 1$

where the norm $\| \cdot \|_F$ is Frobenious norm (spectral norm would have been sufficient but the inequality is easier to verify with Frobenious norm instead, and this just requires taking a few more samples). This inequality says that the empirical $t$-th moment matrix of $Y_1,\ldots,Y_m$ is close to its expectation in Frobenious norm. It certifies the $t$-th moment bound, because for any $u \in \mathbb{R}^d$, we would have

$\mathbb{E}_{i \sim [m]} \langle Y_i, u \rangle^t \leq \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, u \rangle^t + \|u\|^t \leq 2 \cdot t^{t/2} \cdot \|u\|^t$

by analyzing the quadratic forms of the empirical and true $t$-th moment matrices at the vector $u^{\otimes t/2}$.

In our high-dimensional SoS identifiability proof, we will remember the following things about the samples $X_1,\ldots,X_n$ from the underlying Gaussian mixture.

• $X_1,\ldots,X_n$ break into clusters $S_1,\ldots,S_k$, each of size $N = n/k$, so that if $\mu_i = \mathbb{E}_{j \sim S_i} X_j$ is the empirical mean of the $i$-th cluster, $\|\mu_i - \mu_j\| \geq \Delta$ if $i \neq j$, and
• For each cluster $S_i$:

$\left \| \mathbb{E}_{j \sim S_i} \left ( [X_j - \mu_i]^{\otimes t/2} \right ) \left ( [X_j - \mu_i]^{\otimes t/2} \right )^\top - \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \left ( Y^{\otimes t/2} \right)^\top \right \|_F^2 \leq 1$.

## Algorithm for High-Dimensional Clustering

Now we are prepared to describe our high-dimensional algorithm for clustering Gaussian mixtures. For variety’s sake, this time we are going to describe the algorithm before the identifiability proof. We will finish up the high-dimensional identifiability proof, and hence the analysis of the following algorithm, in the next post, which will be the last in this series.

Given a collection of points $X_1,\ldots,X_n \in \mathbb{R}^d$, let $\mathcal{B}$ be the following set of polynomial inequalities in indeterminates $w_1,\ldots,w_n$:

$w_i^2 = w_i$ for all $i \in [n]$

$\sum_{i \in [n]} w_i = N$

$\left \| \frac 1 N \sum_{i \in [n]} \! w_i \left ( [X_i - \mu]^{\otimes t/2} \right ) \! \left ( [X_i - \mu]^{\otimes t/2} \right )^\top \! - \! \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \! \left ( Y^{\otimes t/2} \right)^\top \right \|_F^2 \leq 1$

where as usual $\mu(w) = \frac 1 N \sum_{i \in [n]} w_i X_i$.

The algorithm is: given $X_1,\ldots,X_n, k, t$, find a degree-$O(t)$ pseudoexpectation $\tilde{\mathbb{E}}$ of minimal $\| \tilde{\mathbb{E}} ww^\top \|_F$ satisfying $\mathcal{B}$. Run the rounding procedure from the one-dimensional algorithm on $\tilde{\mathbb{E}} ww^\top$.