Skip to content

Clustering and Sum of Squares Proofs, Part 6

December 21, 2017
by

This is the 6th and final part of a series on clustering, Gaussian mixtures, and Sum of Squares (SoS) proofs. If you have not read them yet, I recommend starting with Part 1, Part 2, Part 3, Part 4, and Part 5. Also, if you find errors please mention them in the comments (or otherwise get in touch with me) and I will fix them ASAP.

Did you find this series helpful? Unhelpful? What could be done better? If there were to be another tutorial series on a Sum of Squares-related topic, what would you like to see? Let me know in the comments!

Last time we developed our high-dimensional clustering algorithm for Gaussian mixtures. In this post we will make our SoS identifiability proofs high-dimensional. In what is hopefully a familiar pattern by now, these identifiability proofs will also amount to an analysis of the clustering algorithm from part 5. At the end of this post is a modest discussion of some of the literature on the SoS proofs-to-algorithms method we have developed in this series.

Setting

We decided to remember the following properties of a collection X_1,\ldots,X_n of samples from a d-dimensional Gaussian mixture model.

(1) X_1,\ldots,X_n break up into clusters S_1,\ldots,S_k \subseteq [n] which partition [n], and each S_i has size exactly |S_i| = N = n/k.

(2) Each cluster has bounded moments, and this has a small certificate: for some t \in \mathbb{N}, if \mu_i = \mathbb{E}_{j \sim S_i} X_j is the mean of the i-th cluster,

\left \| \mathbb{E}_{j \sim S_i} \left ( [X_j - \mu_i]^{\otimes t/2} \right ) \left ( [X_j - \mu_i]^{\otimes t/2} \right )^\top - \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \left ( Y^{\otimes t/2} \right)^\top \right \|_F^2 \leq 1.

(3) The means are separated: if i \neq j then \|\mu_i - \mu_j\| \geq \Delta.

Our identifiability proof follows the template we have laid out in the non-SoS proof from part 1 and the one-dimensional SoS proof from later parts. The first thing to do is prove a key fact about a pair of overlapping clusters with bounded moments.

Fact 3

The next fact is the high-dimensional analogue of Fact 2. (We are not going to prove a high-dimensional analogue of Fact 1; the reader should at this point have all the tools to work it out for themselves.) We remind the reader of the key family polynomial inequalities.

Given a collection of points X_1,\ldots,X_n \in \mathbb{R}^d, let \mathcal{B} be the following set of polynomial inequalities in indeterminates w_1,\ldots,w_n:

w_i^2 = w_i for all i \in [n]

\sum_{i \in [n]} w_i = N

\left \| \frac 1 N \sum_{i \in [n]}  \! w_i \left ( [X_i - \mu]^{\otimes t/2} \right ) \!  \left ( [X_i - \mu]^{\otimes t/2} \right )^\top  \! - \! \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \! \left ( Y^{\otimes t/2} \right)^\top \right \|_F^2 \leq 1

where as usual \mu(w) = \frac 1 N \sum_{i \in [n]} w_i X_i. As before, for a subset S \subseteq [n], we use the notation |T \cap S| = |T \cap S|(w) = \sum_{i \in S} w_i.

Fact 3.
Let X_1,\ldots,X_n \in \mathbb{R}^d. Let S \subseteq [n] have |S| = N; let \mu_S = \mathbb{E}_{i \sim S} X_i be its mean. Let t be a power of 2. Suppose S satisfies

\left \| \mathbb{E}_{j \sim S} \left ( [X_j - \mu_S]^{\otimes t/2} \right ) \left ( [X_j - \mu_S]^{\otimes t/2} \right )^\top - \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \left ( Y^{\otimes t/2} \right)^\top \right \|_F^2 \leq 1.

Then

\mathcal{B} \vdash_{O(t)} \left (  \frac{|T \cap S|}N \right )^{2t} \|\mu - \mu_S\|^{4t} \leq 2^{O(t)} \cdot t^{t} \cdot \left ( \frac{|T \cap S|}N \right ) ^{2(t-1)} \cdot \|\mu - \mu_S\|^{2t}.

The main difference between the conclusions of Fact 3 and Fact 2 is that both sides of the inequality are multiplied by \|\mu - \mu_S\|^t, as compared to Fact 2. As we will see, this is because an additional dependence on the vector-valued polynomial (\mu - \mu_S)(w) is introduced by the need to project the high-dimensional vectors X_i onto the line \mu - \mu_S. We have already tackled a situation where an SoS-provable inequality seemed to require cancelling terms of left and right in order to be useful (i.e. when we used Fact 2 to prove Lemma 2), and similar ideas will work here.

The main innovation in proving Fact 3 is the use of the t-th moment inequality in \mathcal{B}. Other than that, we follow the proof of Fact 2 almost line by line. The main proposition needed to use the t-th moment inequality is:

Proposition. \vdash_t \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, u \rangle^t \leq t^{t/2} \cdot \|u\|^t.

Proof of Proposition.

Expanding the polynomial \mathbb{E}_{Y \sim \mathcal{N}(0,I)}\langle Y, u \rangle^t we get

\mathbb{E}_{Y \sim \mathcal{N}(0,I)}\langle Y, u \rangle^t = \sum_{|\alpha| = t, \alpha \text{ even}} u^\alpha \cdot \mathbb{E} Y^\alpha

where \alpha is a multi-index over [n] and “even” means that every index in \alpha occurs with even multiplicity. (Other terms vanish by symmetry of Y.) Since \alpha is always even in the sum, the monomial u^\alpha is a square, and \mathbb{E} Y^\alpha \leq t^{t/2} by standard properties of Gaussian moments. Hence,

\vdash_t \sum_{|\alpha| = t, \alpha \text{ even}} u^\alpha \cdot \mathbb{E} Y^\alpha \leq t^{t/2} \cdot \sum_{|\alpha| = t, \alpha \text{even}} u^\alpha = t^{t/2} \cdot \| u \|^t.

QED.

Proof of Fact 3.

As usual, we write things out in terms of X_1,\ldots,X_n, then apply Holder’s inequality and the triangle inequality. First of all,

\left ( \sum_{i \in S} w_i \right )^t \|\mu - \mu_S\|^{2t} = \left ( \sum_{i \in S} w_i \langle \mu - \mu_S, \mu - \mu_S \rangle \right )^t = \left ( \sum_{i \in S} w_i \langle (\mu - X_i) - (\mu_S - X_i), \mu - \mu_S \rangle \right )^t.

By SoS Holder’s inequality, we get

\mathcal{B} \vdash_{O(t)} \left ( \sum_{i \in S} w_i \right )^t \|\mu - \mu_S\|^{2t} \leq \left ( \sum_{i \in S} w_i \right )^{t-1} \cdot \sum_{i \in S} w_i \langle (\mu - X_i) - (\mu_S - X_i), \mu - \mu_S \rangle^t.

By the same reasoning as in Fact 2, using \vdash_t (a - b)^t \leq 2^t (a^t + b^t) and w_i = w_i^2 \leq 1 we get

\mathcal{B} \vdash_{O(t)} \left ( \sum_{i \in S} w_i \right )^t \|\mu - \mu_S\|^{2t} \leq 2^t \cdot \left ( \sum_{i \in S} w_i \right )^{t-1} \cdot \left [ \sum_{i \in [n]} w_i \langle X_i - \mu, \mu - \mu_S \rangle^t + \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t \right ].

By our usual squaring and use of (a+b)^2 \leq 2 (a^2 + b^2), we also get

\mathcal{B} \vdash_{O(t)} \left ( \sum_{i \in S} w_i \right )^{2t} \|\mu - \mu_S\|^{4t} \leq 2^{O(t)} \cdot \left ( \sum_{i \in S} w_i \right )^{2(t-1)} \cdot \left [ \left ( \sum_{i \in [n]} w_i \langle X_i - \mu, \mu - \mu_S \rangle^t \right )^2 + \left ( \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t \right ) \right ].

(We want both sides to be squared so that we are set up to eventually use SoS Cauchy-Schwarz.) We are left with the last two terms, which are t-th moments in the direction \mu - \mu_S. If we knew

\mathcal{B} \vdash_{O(t)} \left ( \sum_{i \in [n]} w_i \langle X_i - \mu, \mu - \mu_S \rangle^t \right )^2 \leq 2^{O(t)} t^{t} \cdot \|\mu - \mu_S\|^{2t} \cdot N^2

and similarly

\mathcal{B} \vdash_{O(t)} \left ( \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t \right )^2 \leq 2^{O(t)} t^{t} \cdot \|\mu - \mu_S\|^{2t} \cdot N^2

then we would be done. We start with the second inequality. We write the polynomial on the LHS as

\frac 1 N \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t

= \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, \mu - \mu_S \rangle^t + \left ( \frac 1 N \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t  - \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, \mu - \mu_S \rangle^t \right )

Squaring again as usual, it would be enough to bound both \left ( \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, \mu - \mu_S \rangle^t \right )^2 and  \left ( \frac 1 N \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t  - \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, \mu - \mu_S \rangle^t \right )^2. For the former, using the Proposition above we get

\vdash_{2t} \left ( \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, \mu - \mu_S \rangle^t \right )^2 \leq 2^{O(t)} t^t \|\mu - \mu_S\|^{2t}.

For the latter, notice that

\left ( \frac 1 N \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t  - \mathbb{E}_{Y \sim \mathcal{N}(0,I)} \langle Y, \mu - \mu_S \rangle^t \right )^2 = \left \langle M, \left [  (\mu - \mu_S)^{\otimes t/2} \right ] \left [ (\mu - \mu_S)^{\otimes t/2} \right ]^\top \right \rangle

where M is the matrix

M = \mathbb{E}_{j \sim S} \left ( [X_j - \mu_S]^{\otimes t/2} \right ) \left ( [X_j - \mu_S]^{\otimes t/2} \right )^\top - \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \left ( Y^{\otimes t/2} \right)^\top.

Hence by SoS Cauchy-Schwarz, we get

\vdash_{O(t)} \left \langle M, \left [  (\mu - \mu_S)^{\otimes t/2} \right ] \left [ (\mu - \mu_S)^{\otimes t/2} \right ]^\top \right \rangle^2 \leq \|M\|_F^2 \cdot \|\mu - \mu_S\|^{2t} \leq \|\mu - \mu_S\|^{2t}

Putting these together, we get

\mathcal{B} \vdash_{O(t)} \left ( \sum_{i \in S} \langle X_i - \mu_S, \mu - \mu_S \rangle^t \right )^2 \leq 2^{O(t)} t^{t} \cdot \|\mu - \mu_S\|^{2t} \cdot N^2,

the second of the inequalities we wanted. Proving the first one is similar, using the hypothesis

\left \| \frac 1 N \sum_{i \in [n]}  \! w_i \left ( [X_i - \mu]^{\otimes t/2} \right ) \!  \left ( [X_i - \mu]^{\otimes t/2} \right )^\top  \! - \! \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \! \left ( Y^{\otimes t/2} \right)^\top \right \|_F^2 \leq 1

in place of \|M\|_F^2 \leq 1. QED.

Lemma 3

The last thing is to use Fact 3 to prove Lemma 3, our high-dimensional SoS identifiability lemma. Predictably, it is almost identical to our previous proof of Lemma 2 using Fact 2.

Lemma 3.
Let X_1,\ldots,X_n \in \mathbb{R}^d. Let S_1,\ldots,S_k be a partition of [n] into k pieces of size N = n/k such that for each S_i, the collection of vectors \{X_j\}_{j \in S_i} obeys the following moment bound:

\left \| \mathbb{E}_{j \sim S_i} \left ( [X_j - \mu_i]^{\otimes t/2} \right ) \left ( [X_j - \mu_i]^{\otimes t/2} \right )^\top - \mathbb{E}_{Y \sim \mathcal{N}(0, I)} \left ( Y^{\otimes t/2} \right ) \left ( Y^{\otimes t/2} \right)^\top \right \|_F^2 \leq 1.

where \mu_i is the average \mathbb{E}_{j \sim S_i} X_j and t is some number in \mathbb{N} which is a power of 2. Let \Delta > 0 be such that \| \mu_i - \mu_j \| \geq \Delta for every i \neq j.

Let w_1,\ldots,w_n,\mu be indeterminates. Let \mathcal{B} be the set of equations and inequalities defined above. Thinking of the variables w_i as defining a set T \subseteq [n] via its 0/1 indicator, let |T \cap S_j| be the formal expression |T \cap S_j| = \sum_{i \in S_j} w_i. Let \tilde{\mathbb{E}} be a degree O(t) pseudoexpectation which satisfies \mathcal{B}. Then

\tilde{\mathbb{E}} \left [ \sum_{i \in [k]} \left ( \frac{|T \cap S_i|}{N} \right )^2 \right ] \geq 1 - \frac{2^{O(t)} t^{t/2} k^2}{\Delta^t}.

Proof of Lemma 3.
Let \tilde{\mathbb{E}} satisfy \mathcal{B}. As in Lemmas 1 and 2, we will endeavor to bound \tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t for each i \neq j.

By \Delta-separation,

\vdash_{2t} \Delta^{2t} \leq \|\mu_i - \mu_j\|^{2t} = \|(\mu_i - \mu) - (\mu_j - \mu)\|^{2t} \leq 2^{2t} \left (  \|\mu_i - \mu\|^{2t} + \|\mu_j - \mu\|^{2t} \right),

where we implicitly used the SoS triangle inequality \vdash_t (a-b)^t \leq 2^{t} (a^t + b^t) on each coordinate of the vectors \mu_i - \mu and \mu_j - \mu.

So,

\tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \leq \tilde{\mathbb{E}} \left [ \frac{\|\mu_i - \mu\|^{2t} + \|\mu_j - \mu\|^{2t}}{2^{2t} \Delta^{2t}}  |T \cap S_i|^t |T \cap S_j|^t \right ].

Now we are going to bound \tilde{\mathbb{E}} \|\mu_i - \mu\|^{2t} |T \cap S_i|^t |T \cap S_j|^t. By symmetry the same argument will apply to the same expression with i and j exchanged.

We apply Fact 3:

\tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \|\mu_i - \mu\|^{2t} \leq 2^{O(t)} t^{t} \cdot N^2 \cdot \tilde{\mathbb{E}} |T \cap S_i|^{t-2} |T \cap S_j|^t \|\mu_i - \mu\|^t.

Then we use pseudoexpectation Cauchy-Schwarz to get

\tilde{\mathbb{E}} |T \cap S_i|^{t-2} |T \cap S_j|^t \|\mu_i - \mu\|^t \leq \left ( \tilde{\mathbb{E}} |T \cap S_i|^{t-4} |T \cap S_j|^t \right )^{1/2} \left ( \tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \|\mu_i - \mu\|^{2t} \right )^{1/2}.

Putting these two together and canceling \left ( \tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \|\mu_i - \mu\|^{2t} \right )^{1/2} we get

\tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \|\mu_i - \mu\|^{2t} \leq 2^{O(t)} t^{2t} \cdot N^4 \tilde{\mathbb{E}} |T \cap S_i|^{t-4} |T \cap S_j|^t.

Also, clearly \mathcal{B} \vdash_{O(1)} |T \cap S_j| \leq N. So we get

\tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \|\mu_i - \mu\|^{2t} \leq 2^{O(t)} t^{2t} \cdot N^8 \cdot \tilde{\mathbb{E}} |T \cap S_i|^{t-4} |T \cap S_j|^{t-4} \|\mu_i - \mu\|^t.

We started out with the goal of bounding \tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t. We have found that

\tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \leq \frac{2^{O(t)} t^{2t} N^8}{\Delta^{4t}} \tilde{\mathbb{E}} |T \cap S_i|^{t-4} |T \cap S_j|^{t-4}.

Applying pseudoexpectation Holder’s inequality, we find that

\tilde{\mathbb{E}} |T \cap S_i|^{t-4} |T \cap S_j|^{t-4} \leq \left ( \tilde{\mathbb{E}} |T \cap S_i|^t \tilde{\mathbb{E}} |T \cap S_j|^t \right )^{(t-4)/t}.

Rearranging things, we get

\left ( \tilde{\mathbb{E}} |T \cap S_i|^t |T \cap S_j|^t \right )^{1/t} \leq \frac{2^{O(t)} t^{t/2} N^2}{\Delta^{t}}.

Now proceeding as in the proof of Lemma 2, we know

\tilde{\mathbb{E}} \left ( ( \sum_{i \in [k]} \frac{|T \cap S_i|}{N} \right )^2 = 1

so

\tilde{\mathbb{E}} \sum_{i \in [k]} \left ( \frac{|T \cap S_i|}{N} \right )^2 \geq 1 - \frac{2^{O(t)} t^{t/2} k^2}{\Delta^{t}}.

QED.

Remark: We cheated ever so slightly in this proof. First of all, we did not state a version of pseudoexpectation Holder’s which allows the exponent (t-4)/t, just one which allows (t-2)/t. The correct version can be found in Lemma A.4 of this paper. That inequality will work only when t is large enough; I think t \geq 4 suffices. To handle smaller t probably one must remove the square from both sides of Fact 3, which will require a hypothesis which does not use the squared Frobenious norm. This is possible; see e.g. my paper with Jerry Li.

Putting Things Together

The conclusion of Lemma 3 is almost identical to the conclusion of Lemma 2, and so the rest of the analysis of the high-dimensional clustering algorithm proceeds exactly as in the one-dimensional case. At the end, to show that the clustering algorithm works with high probability to cluster samples from a separated Gaussian mixture model, one uses straightforward concentration of measure to show that if X_1,\ldots,X_n are enough samples from a \Delta-separated mixture model, then the satisfy the hypotheses of Lemma 3 with high probability. This concludes our “proof” of the main theorem from way back in part 1.

Related literature

The reader interested in further applications of the Sum of Squares method to unsupervised learning problems may consult some of the following works.

Though we have not seen it in these posts, often the SoS method overlaps with questions about tensor decomposition. For some examples in this direction, see the [Barak, Kelner, Steurer] dictionary learning paper above, as well as

The SoS method can often be used to design algorithms which have more practical running times than the large SDPs we have discussed here. (This often requires further ideas, to avoid solving large semidefinite programs.) See e.g.:

Another common tool in constructing SoS proofs for unsupervised learning problems which we did not see here are concentration bounds for random matrices whose entries are low-degree polynomials in independent random variables. For some examples along these lines, see

as well as several of the previous papers.

One Comment

Trackbacks

  1. Sam Hopkins’s 6 part learning via SoS series | Windows On Theory

Comments are closed.

%d bloggers like this: