Clustering and Sum of Squares Proofs, Part 6

This is the 6th and final part of a series on clustering, Gaussian mixtures, and Sum of Squares (SoS) proofs. If you have not read them yet, I recommend starting with Part 1, Part 2, Part 3, Part 4, and Part 5. Also, if you find errors please mention them in the comments (or otherwise get in touch with me) and I will fix them ASAP.

Did you find this series helpful? Unhelpful? What could be done better? If there were to be another tutorial series on a Sum of Squares-related topic, what would you like to see? Let me know in the comments!

Last time we developed our high-dimensional clustering algorithm for Gaussian mixtures. In this post we will make our SoS identifiability proofs high-dimensional. In what is hopefully a familiar pattern by now, these identifiability proofs will also amount to an analysis of the clustering algorithm from part 5. At the end of this post is a modest discussion of some of the literature on the SoS proofs-to-algorithms method we have developed in this series.

Setting

We decided to remember the following properties of a collection $X_1,\ldots,X_n$ of samples from a $d$ -dimensional Gaussian mixture model.

(1) $X_1,\ldots,X_n$ break up into clusters $S_1,\ldots,S_k \subseteq [n]$ which partition $[n]$ , and each $S_i$ has size exactly $|S_i| = N = n/k$ .

(2) Each cluster has bounded moments, and this has a small certificate: for some $t \in \mathbb{N}$ , if $\mu_i = \mathbb{E}_{j \sim S_i} X_j$ is the mean of the $i$ -th cluster,

$\left \| \mathbb{E}_{j \sim S_i} \left ( [X_j - \mu_i]^{\otimes t/2} \right ) \left ( [X_j - \mu_i]^{\otimes t/2} \right )^\top - \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \left ( Y^{\otimes t/2} \right)^\top \right \|_F^2 \leq 1$ .

(3) The means are separated: if $i \neq j$ then $\|\mu_i - \mu_j\| \geq \Delta$ .

Our identifiability proof follows the template we have laid out in the non-SoS proof from part 1 and the one-dimensional SoS proof from later parts. The first thing to do is prove a key fact about a pair of overlapping clusters with bounded moments.

Fact 3

The next fact is the high-dimensional analogue of Fact 2. (We are not going to prove a high-dimensional analogue of Fact 1; the reader should at this point have all the tools to work it out for themselves.) We remind the reader of the key family polynomial inequalities.

Given a collection of points $X_1,\ldots,X_n \in \mathbb{R}^d$ , let $\mathcal{B}$ be the following set of polynomial inequalities in indeterminates $w_1,\ldots,w_n$ :

$w_i^2 = w_i$ for all $i \in [n]$

$\sum_{i \in [n]} w_i = N$

$\left \| \frac 1 N \sum_{i \in [n]} \! w_i \left ( [X_i - \mu]^{\otimes t/2} \right ) \! \left ( [X_i - \mu]^{\otimes t/2} \right )^\top \! - \! \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \! \left ( Y^{\otimes t/2} \right)^\top \right \|_F^2 \leq 1$

where as usual $\mu(w) = \frac 1 N \sum_{i \in [n]} w_i X_i$ . As before, for a subset $S \subseteq [n]$ , we use the notation $|T \cap S| = |T \cap S|(w) = \sum_{i \in S} w_i$ .

Fact 3.
Let $X_1,\ldots,X_n \in \mathbb{R}^d$ . Let $S \subseteq [n]$ have $|S| = N$ ; let $\mu_S = \mathbb{E}_{i \sim S} X_i$ be its mean. Let $t$ be a power of $2$ . Suppose $S$ satisfies

$\left \| \mathbb{E}_{j \sim S} \left ( [X_j - \mu_S]^{\otimes t/2} \right ) \left ( [X_j - \mu_S]^{\otimes t/2} \right )^\top - \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \left ( Y^{\otimes t/2} \right)^\top \right \|_F^2 \leq 1$ .

Then

$\mathcal{B} \vdash_{O(t)} \left ( \frac{|T \cap S|}N \right )^{2t} \|\mu - \mu_S\|^{4t} \leq 2^{O(t)} \cdot t^{t} \cdot \left ( \frac{|T \cap S|}N \right ) ^{2(t-1)} \cdot \|\mu - \mu_S\|^{2t}.$

The main difference between the conclusions of Fact 3 and Fact 2 is that both sides of the inequality are multiplied by $\|\mu - \mu_S\|^t$ , as compared to Fact 2. As we will see, this is because an additional dependence on the vector-valued polynomial $(\mu - \mu_S)(w)$ is introduced by the need to project the high-dimensional vectors $X_i$ onto the line $\mu - \mu_S$ . We have already tackled a situation where an SoS-provable inequality seemed to require cancelling terms of left and right in order to be useful (i.e. when we used Fact 2 to prove Lemma 2), and similar ideas will work here.

The main innovation in proving Fact 3 is the use of the $t$ -th moment inequality in $\mathcal{B}$ . Other than that, we follow the proof of Fact 2 almost line by line. The main proposition needed to use the $t$ -th moment inequality is:

Proposition. $\vdash_t \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, u \rangle^t \leq t^{t/2} \cdot \|u\|^t$ .

Proof of Proposition.

Expanding the polynomial $\mathbb{E}_{Y \sim \mathcal{N}(0,I)}\langle Y, u \rangle^t$ we get

$\mathbb{E}_{Y \sim \mathcal{N}(0,I)}\langle Y, u \rangle^t = \sum_{|\alpha| = t, \alpha \text{ even}} u^\alpha \cdot \mathbb{E} Y^\alpha$

where $\alpha$ is a multi-index over $[n]$ and “even” means that every index in $\alpha$ occurs with even multiplicity. (Other terms vanish by symmetry of $Y$ .) Since $\alpha$ is always even in the sum, the monomial $u^\alpha$ is a square, and $\mathbb{E} Y^\alpha \leq t^{t/2}$ by standard properties of Gaussian moments. Hence,

$\vdash_t \sum_{|\alpha| = t, \alpha \text{ even}} u^\alpha \cdot \mathbb{E} Y^\alpha \leq t^{t/2} \cdot \sum_{|\alpha| = t, \alpha \text{even}} u^\alpha = t^{t/2} \cdot \| u \|^t$ .

QED.

Proof of Fact 3.

As usual, we write things out in terms of $X_1,\ldots,X_n$ , then apply Holder’s inequality and the triangle inequality. First of all,

$\left ( \sum_{i \in S} w_i \right )^t \|\mu - \mu_S\|^{2t} = \left ( \sum_{i \in S} w_i \langle \mu - \mu_S, \mu - \mu_S \rangle \right )^t = \left ( \sum_{i \in S} w_i \langle (\mu - X_i) - (\mu_S - X_i), \mu - \mu_S \rangle \right )^t.$

By SoS Holder’s inequality, we get

$\mathcal{B} \vdash_{O(t)} \left ( \sum_{i \in S} w_i \right )^t \|\mu - \mu_S\|^{2t} \leq \left ( \sum_{i \in S} w_i \right )^{t-1} \cdot \sum_{i \in S} w_i \langle (\mu - X_i) - (\mu_S - X_i), \mu - \mu_S \rangle^t.$

By the same reasoning as in Fact 2, using $\vdash_t (a - b)^t \leq 2^t (a^t + b^t)$ and $w_i = w_i^2 \leq 1$ we get

$\mathcal{B} \vdash_{O(t)} \left ( \sum_{i \in S} w_i \right )^t \|\mu - \mu_S\|^{2t} \leq 2^t \cdot \left ( \sum_{i \in S} w_i \right )^{t-1} \cdot \left [ \sum_{i \in [n]} w_i \langle X_i - \mu, \mu - \mu_S \rangle^t + \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t \right ].$

By our usual squaring and use of $(a+b)^2 \leq 2 (a^2 + b^2)$ , we also get

$\mathcal{B} \vdash_{O(t)} \left ( \sum_{i \in S} w_i \right )^{2t} \|\mu - \mu_S\|^{4t} \leq 2^{O(t)} \cdot \left ( \sum_{i \in S} w_i \right )^{2(t-1)} \cdot \left [ \left ( \sum_{i \in [n]} w_i \langle X_i - \mu, \mu - \mu_S \rangle^t \right )^2 + \left ( \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t \right ) \right ].$

(We want both sides to be squared so that we are set up to eventually use SoS Cauchy-Schwarz.) We are left with the last two terms, which are $t$ -th moments in the direction $\mu - \mu_S$ . If we knew

$\mathcal{B} \vdash_{O(t)} \left ( \sum_{i \in [n]} w_i \langle X_i - \mu, \mu - \mu_S \rangle^t \right )^2 \leq 2^{O(t)} t^{t} \cdot \|\mu - \mu_S\|^{2t} \cdot N^2$

and similarly

$\mathcal{B} \vdash_{O(t)} \left ( \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t \right )^2 \leq 2^{O(t)} t^{t} \cdot \|\mu - \mu_S\|^{2t} \cdot N^2$

then we would be done. We start with the second inequality. We write the polynomial on the LHS as

$\frac 1 N \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t$

$= \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, \mu - \mu_S \rangle^t + \left ( \frac 1 N \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t - \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, \mu - \mu_S \rangle^t \right )$

Squaring again as usual, it would be enough to bound both $\left ( \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, \mu - \mu_S \rangle^t \right )^2$ and $\left ( \frac 1 N \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t - \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, \mu - \mu_S \rangle^t \right )^2$ . For the former, using the Proposition above we get

$\vdash_{2t} \left ( \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, \mu - \mu_S \rangle^t \right )^2 \leq 2^{O(t)} t^t \|\mu - \mu_S\|^{2t}$ .

For the latter, notice that

$\left ( \frac 1 N \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t - \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, \mu - \mu_S \rangle^t \right )^2 = \left \langle M, \left [ (\mu - \mu_S)^{\otimes t/2} \right ] \left [ (\mu - \mu_S)^{\otimes t/2} \right ]^\top \right \rangle$

where $M$ is the matrix

$M = \mathbb{E}_{j \sim S} \left ( [X_j - \mu_S]^{\otimes t/2} \right ) \left ( [X_j - \mu_S]^{\otimes t/2} \right )^\top - \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \left ( Y^{\otimes t/2} \right)^\top$ .

Hence by SoS Cauchy-Schwarz, we get

$\vdash_{O(t)} \left \langle M, \left [ (\mu - \mu_S)^{\otimes t/2} \right ] \left [ (\mu - \mu_S)^{\otimes t/2} \right ]^\top \right \rangle^2 \leq \|M\|_F^2 \cdot \|\mu - \mu_S\|^{2t} \leq \|\mu - \mu_S\|^{2t}$

Putting these together, we get

$\mathcal{B} \vdash_{O(t)} \left ( \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t \right )^2 \leq 2^{O(t)} t^{t} \cdot \|\mu - \mu_S\|^{2t} \cdot N^2$ ,

the second of the inequalities we wanted. Proving the first one is similar, using the hypothesis

in place of $\|M\|_F^2 \leq 1$ . QED.

Lemma 3

The last thing is to use Fact 3 to prove Lemma 3, our high-dimensional SoS identifiability lemma. Predictably, it is almost identical to our previous proof of Lemma 2 using Fact 2.

Lemma 3.
Let $X_1,\ldots,X_n \in \mathbb{R}^d$ . Let $S_1,\ldots,S_k$ be a partition of $[n]$ into $k$ pieces of size $N = n/k$ such that for each $S_i$ , the collection of vectors $\{X_j\}_{j \in S_i}$ obeys the following moment bound:

$\left \| \mathbb{E}_{j \sim S_i} \left ( [X_j - \mu_i]^{\otimes t/2} \right ) \left ( [X_j - \mu_i]^{\otimes t/2} \right )^\top - \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \left ( Y^{\otimes t/2} \right)^\top \right \|_F^2 \leq 1$ .

where $\mu_i$ is the average $\mathbb{E}_{j \sim S_i} X_j$ and $t$ is some number in $\mathbb{N}$ which is a power of $2$ . Let $\Delta > 0$ be such that $\| \mu_i - \mu_j \| \geq \Delta$ for every $i \neq j$ .

Let $w_1,\ldots,w_n,\mu$ be indeterminates. Let $\mathcal{B}$ be the set of equations and inequalities defined above. Thinking of the variables $w_i$ as defining a set $T \subseteq [n]$ via its $0/1$ indicator, let $|T \cap S_j|$ be the formal expression $|T \cap S_j| = \sum_{i \in S_j} w_i.$ Let $\tilde{\mathbb{E}}$ be a degree $O(t)$ pseudoexpectation which satisfies $\mathcal{B}$ . Then

$\tilde{\mathbb{E}} \left [ \sum_{i \in [k]} \left ( \frac{|T \cap S_i|}{N} \right )^2 \right ] \geq 1 - \frac{2^{O(t)} t^{t/2} k^2}{\Delta^t}.$

Proof of Lemma 3.
Let $\tilde{\mathbb{E}}$ satisfy $\mathcal{B}$ . As in Lemmas 1 and 2, we will endeavor to bound $\tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t$ for each $i \neq j$ .

By $\Delta$ -separation,

$\vdash_{2t} \Delta^{2t} \leq \|\mu_i - \mu_j\|^{2t} = \|(\mu_i - \mu) - (\mu_j - \mu)\|^{2t} \leq 2^{2t} \left ( \|\mu_i - \mu\|^{2t} + \|\mu_j - \mu\|^{2t} \right),$

where we implicitly used the SoS triangle inequality $\vdash_t (a-b)^t \leq 2^{t} (a^t + b^t)$ on each coordinate of the vectors $\mu_i - \mu$ and $\mu_j - \mu$ .

So,

$\tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \leq \tilde{\mathbb{E}} \left [ \frac{\|\mu_i - \mu\|^{2t} + \|\mu_j - \mu\|^{2t}}{2^{2t} \Delta^{2t}} |T \cap S_i|^t |T \cap S_j|^t \right ].$

Now we are going to bound $\tilde{\mathbb{E}} \|\mu_i - \mu\|^{2t} |T \cap S_i|^t |T \cap S_j|^t$ . By symmetry the same argument will apply to the same expression with $i$ and $j$ exchanged.

We apply Fact 3:

$\tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \|\mu_i - \mu\|^{2t} \leq 2^{O(t)} t^{t} \cdot N^2 \cdot \tilde{\mathbb{E}} |T \cap S_i|^{t-2} |T \cap S_j|^t \|\mu_i - \mu\|^t.$

Then we use pseudoexpectation Cauchy-Schwarz to get

$\tilde{\mathbb{E}} |T \cap S_i|^{t-2} |T \cap S_j|^t \|\mu_i - \mu\|^t \leq \left ( \tilde{\mathbb{E}} |T \cap S_i|^{t-4} |T \cap S_j|^t \right )^{1/2} \left ( \tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \|\mu_i - \mu\|^{2t} \right )^{1/2}$ .

Putting these two together and canceling $\left ( \tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \|\mu_i - \mu\|^{2t} \right )^{1/2}$ we get

$\tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \|\mu_i - \mu\|^{2t} \leq 2^{O(t)} t^{2t} \cdot N^4 \tilde{\mathbb{E}} |T \cap S_i|^{t-4} |T \cap S_j|^t$ .

Also, clearly $\mathcal{B} \vdash_{O(1)} |T \cap S_j| \leq N$ . So we get

$\tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \|\mu_i - \mu\|^{2t} \leq 2^{O(t)} t^{2t} \cdot N^8 \cdot \tilde{\mathbb{E}} |T \cap S_i|^{t-4} |T \cap S_j|^{t-4} \|\mu_i - \mu\|^t.$

We started out with the goal of bounding $\tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t$ . We have found that

$\tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \leq \frac{2^{O(t)} t^{2t} N^8}{\Delta^{4t}} \tilde{\mathbb{E}} |T \cap S_i|^{t-4} |T \cap S_j|^{t-4}.$

Applying pseudoexpectation Holder’s inequality, we find that

$\tilde{\mathbb{E}} |T \cap S_i|^{t-4} |T \cap S_j|^{t-4} \leq \left ( \tilde{\mathbb{E}} |T \cap S_i|^t \tilde{\mathbb{E}} |T \cap S_j|^t \right )^{(t-4)/t}$ .

Rearranging things, we get

$\left ( \tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \right )^{1/t} \leq \frac{2^{O(t)} t^{t/2} N^2}{\Delta^{t}}.$

Now proceeding as in the proof of Lemma 2, we know

$\tilde{\mathbb{E}} \left ( ( \sum_{i \in [k]} \frac{|T \cap S_i|}{N} \right )^2 = 1$

$\tilde{\mathbb{E}} \sum_{i \in [k]} \left ( \frac{|T \cap S_i|}{N} \right )^2 \geq 1 - \frac{2^{O(t)} t^{t/2} k^2}{\Delta^{t}}.$

QED.

Remark: We cheated ever so slightly in this proof. First of all, we did not state a version of pseudoexpectation Holder’s which allows the exponent $(t-4)/t$ , just one which allows $(t-2)/t$ . The correct version can be found in Lemma A.4 of this paper. That inequality will work only when $t$ is large enough; I think $t \geq 4$ suffices. To handle smaller $t$ probably one must remove the square from both sides of Fact 3, which will require a hypothesis which does not use the squared Frobenious norm. This is possible; see e.g. my paper with Jerry Li.

Putting Things Together

The conclusion of Lemma 3 is almost identical to the conclusion of Lemma 2, and so the rest of the analysis of the high-dimensional clustering algorithm proceeds exactly as in the one-dimensional case. At the end, to show that the clustering algorithm works with high probability to cluster samples from a separated Gaussian mixture model, one uses straightforward concentration of measure to show that if $X_1,\ldots,X_n$ are enough samples from a $\Delta$ -separated mixture model, then the satisfy the hypotheses of Lemma 3 with high probability. This concludes our “proof” of the main theorem from way back in part 1.

Related literature

The reader interested in further applications of the Sum of Squares method to unsupervised learning problems may consult some of the following works.

[Barak, Kelner, Steurer] on dictionary learning
[Potechin, Steurer] on tensor completion
[Ge, Ma] on random overcomplete tensors

Though we have not seen it in these posts, often the SoS method overlaps with questions about tensor decomposition. For some examples in this direction, see the [Barak, Kelner, Steurer] dictionary learning paper above, as well as

[Ma, Shi, Steurer] on tensor decomposition

The SoS method can often be used to design algorithms which have more practical running times than the large SDPs we have discussed here. (This often requires further ideas, to avoid solving large semidefinite programs.) See e.g.:

[Hopkins, Schramm, Shi, Steurer] on extracting spectral algorithms (rather than SDP-based algorithms) from SoS proofs
[Schramm, Steurer] developing a sophisticated spectral method for dictionary learning via SoS proofs
[Hopkins, Kothari, Potechin, Raghavendra, Schramm, Steurer] with a meta-theorem on when it is possible to extract spectral algorithms from SoS proofs

Another common tool in constructing SoS proofs for unsupervised learning problems which we did not see here are concentration bounds for random matrices whose entries are low-degree polynomials in independent random variables. For some examples along these lines, see

[Hopkins, Shi, Steurer] on tensor principal component analysis
[Rao, Raghavendra, Schramm] on random constraint satisfaction problems

as well as several of the previous papers.

	Boaz Barak on AI is a Meteor. Don’t be…
	Huck Bennett on AI is a Meteor. Don’t be…
	Boaz Barak on AI is a Meteor. Don’t be…
	Clement Canonne on AI is a Meteor. Don’t be…
	Anon on Thoughts by a non-economist on…

Clustering and Sum of Squares Proofs, Part 6

Setting

Fact 3

Lemma 3

Putting Things Together

Related literature

Published by samhop

One thought on “Clustering and Sum of Squares Proofs, Part 6”

Leave a comment Cancel reply

Setting

Fact 3

Lemma 3

Putting Things Together

Related literature

Share this:

Related

Published by samhop

One thought on “Clustering and Sum of Squares Proofs, Part 6”

Leave a comment Cancel reply