Clustering and Sum of Squares Proofs, Part 1

Update (1/5/18): a pdf containing all six posts in this series is now available on my website.

I am excited to (temporarily) join the Windows on Theory family as a guest blogger!

This is the first post in a series which will appear on Windows on Theory in the coming weeks. The aim is to give a self-contained tutorial on using the Sum of Squares algorithm for unsupervised learning problems, and in particular in Gaussian mixture models. This will take several posts: let’s get started.

In an unsupervised learning problem, the goal is generally to recover some parameters \theta \in \mathbb{R}^d given some data X_1,\ldots,X_n \sim P(X \, | \, \theta), where P is a probability distribution on, say, \mathbb{R}^p which is parameterized by \theta. The goal is to estimate \theta by some estimator \hat{\theta}(X_1,\ldots,X_n) which (a) does not require too many samples and (b) can be computed from those samples in polynomial time. This basic setup can be instantiated (sometimes with minor adaptations) to capture numerous important problems in statistics and machine learning: a few examples are

  •  component analysis and its many variants (PCA, ICA, Sparse PCA, etc.)
  • Netflix problem / matrix completion / tensor completion
  • dictionary learning / blind source separation
  • community detection and recovery / stochastic block models
  • many clustering problems

Excellent resources on any of these topics are just a Google search away, and our purpose here is not to survey the vast literature on unsupervised learning, or even on provable “TCS-style” algorithms for these problems. Instead, we will try to give a simple exposition of one technique which has now been applied successfully to many unsupervised learning problems: the Sum of Squares method for turning identifiability proofs into algorithms.

Identifiability is a concept from statistics. If one hopes for an algorithm which recovers parameters \hat{\theta}(X_1,\ldots,X_n) \approx \theta, it must at least be true that X_1,\ldots,X_n uniquely (or almost uniquely) determine \theta, with high probability: when this occurs we say \theta is identifiable from the samples X_1,\ldots,X_n.

Classically, identifiability is often proved by analysis of a (typically) inefficient brute-force algorithm. First, one invents some property Q(X_1,\ldots,X_n) of \theta. For example, in a maximum-likelihood argument, one shows that Pr(X_1,\ldots,X_n \, | \, \theta) > p for some p. Then, often via a concentration-of-measure argument, one shows that no \theta' far from \theta satisfies property Q. In the maximum-likelihood example, this would entail showing that if \theta' is far from \theta then Pr(X_1,\ldots,X_n \, | \, \theta') < p.

An identifiability argument like this typically has no implications for computationally-efficient algorithms, since finding \theta which satisfies Q often appears to require brute-force search. Thus, often when the investigation turns to efficient algorithms, the identifiability argument is abandoned and more explicitly algorithmic techniques are brought to bear: convex relaxations, spectral methods, and even heuristic methods.

The method we present here for designing computationally-efficient algorithms begins with a return to identifiability proofs. The main insight is that if both

  • the property Q and, more importantly,
  • the proof that any \theta' satisfying Q must be close to \theta

are sufficiently simple, then identifiability of \theta from X_1,\ldots,X_n does imply the existence of an efficient algorithm to (approximately) recover \theta from X_1,\ldots,X_n!

For us, simple has a formal meaning: the proof should be captured in the low-degree Sum of Squares proof system.

The algorithms which result in the end follow a familiar recipe: solve some convex relaxation whose constraints depend on X_1,\ldots,X_n and round it to obtain \hat{\theta}(X_1,\ldots,X_n). But the design and analysis of this relaxation are heavily informed by the simple identifiability proof described above. In particular, the convex programs which result will not be familiar relaxations of maximum likelihood problems!

In this series of blog posts, we are going to carry out this strategy from start to finish for a classic unsupervised learning problem: clustering mixtures of Gaussians. So that we can get down to business as quickly as possible, we defer a short survey of the literature on this “proofs-to-algorithms” method to a later post.

Mixtures of Gaussians

Gaussian mixtures are classic objects in statistics, dating at least to Pearson in 1894. The basic idea is: suppose you have a data set X_1,\ldots,X_n \in \mathbb{R}^d which was generated in a heterogeneous fashion: some points were sampled from probability distribution \mathcal{D}_1, some from \mathcal{D}_2, and so on up to \mathcal{D}_k, but you do not know which points came from which distributions. Under what settings can you cluster the points by which distribution they came from, and perhaps also recover some properties of those distributions, such as their means \mu_i = \mathbb{E}_{X \sim \mathcal{D}_i} X?

Pearson, in 1894, was faced with a collection of body measurements of crabs. The distribution of one such attribute in the data — the ratio of forehead length to body width — curiously deviated from a Gaussian distribution. Pearson concluded that in fact two distinct species of crabs were present in his data set, and that the data should therefore be modeled as a mixture of two Gaussians. He invented the method of moments to discover the means of both the Gaussians in question.1 In the years since, Gaussian mixtures have become a fundamental statistical modeling tool: algorithms to fit Gaussian mixtures to data sets are included in basic machine learning packages like sklearn.

Image result for mixture of gaussians
A mixture of 2 Gaussians in \mathbb{R}^2.2

Here is our mixture of Gaussians model, formally.

Mixtures of separated spherical Gaussians:
Let

  • \Delta > 0.
  • \mu_1,\ldots,\mu_k \in \mathbb{R}^d be such that \|\mu_i - \mu_j\| \geq \Delta for every i \neq j.
  • \mathcal{N}_1(\mu_1,I),\ldots,\mathcal{N}_k(\mu_k,I) be d-dimensional spherical Gaussians, centered at the means \mu_i.
  • X_1,\ldots,X_n \in \mathbb{R}^d be iid samples, each drawn by selecting j \sim [k] uniformly, then drawing X \sim \mathcal{N}(\mu_j, I).
  • S_1,\ldots,S_k be the induced partition of [n] into k parts, where i \in S_j if samples X_i was drawn from \mathcal{N}(\mu_j, I)

The problem is: given X_1,\ldots,X_n, output a partition T_1,\ldots,T_k of [n] into k parts, each of size n/k, such that (up to renaming the clusters 1,\ldots,k),

|S_i \cap T_i| \geq (1 - \delta) \cdot \frac nk

for each i \in [k] and some small number \delta > 0.

gmm2
Another mixture of 2 Gaussians. The means have Euclidean distance about 3.5.

To avoid some minor but notationally annoying matters, we are going to work with a small variant of the model, where we assume that exactly n/k samples X_i came from each Gaussian \mathcal{N}(\mu_j, I). We call a mixture like this “equidistributed”.

We will work up to a proof of this theorem, which was proved recently (as of Fall 2017) in 3 independent works.

Main Theorem (Hopkins-Li, Kothari-Steinhardt, Diakonikolas-Kane-Stewart):
For arbitrarily-large t \in \mathbb{N} there is an algorithm requiring n = d^{O(t)} k^{O(1)} samples from the equidistributed mixtures of Gaussians model and running in time n^{O(t)} which outputs a partition T_1,\ldots,T_k of [n] into parts of size N = n/k such that with high probability,

\displaystyle \frac{|S_i \cap T_i|}{N} \geq 1 - k^{10} \cdot \left ( \frac{C \sqrt t}{\Delta} \right )^{t}

for some universal constant C.3

In particular:

  • If \Delta = k^\epsilon for some \epsilon > 0, then by choosing t = 100/\epsilon the algorithm recovers the correct clustering up to 1/poly(k) errors in poly(k,d) time with poly(k,d)-many samples.
  • If \Delta = C \sqrt{\log k} for a large-enough universal constant C, then choosing t = O(\log k) gives a quasipolynomial-time algorithm (using quasipolynomially-many samples) to recover clusters up to 1/poly(k) error.4

One (rather weak) consequence of the main theorem is that, for n = d^{O(t)} k^{O(1)} samples, there is enough information in the samples to determine the underlying clustering, up to error \delta(t,\Delta) = \tfrac{2^{O(t)} \cdot t^{t/2} \cdot k^{10}}{\Delta^t}. Our strategy to prove the main theorem will start with proving the latter statement independently — as we have discussed, such an argument is often called a proof of identifiability because we say that the clusters are identifiable from the samples (up to \delta(t,\Delta) errors).

While identifiability itself does not carry immediate algorithmic consequences, our proof of identifiability will be somewhat special: it will be simple in a formal sense, namely, that it will be captured by a formal proof system of restricted power. This simplicity of the proof of identifiability will almost immediately imply the main theorem: the construction of a computationally-efficient algorithm from a simple proof of identifiability is the heart of the proofs-to-algorithms method.

Identifiability proof: 1 dimension

Our eventual goal is to work up to a proof in the low-degree Sum of Squares proof system that clusters S_1,\ldots,S_k are identifiable from samples X_1,\ldots,X_n from a mixture of Gaussians. Since we have not yet defined low-degree Sum of Squares proofs, for now we will focus on constructing an identifiability proof which avoids mathematical facts which we deem “too complicated”. In particular, we will avoid any Chernoff/union bound style arguments.

To get to the heart of the matter it helps to simplify the setting. Our first simplification is to restrict attention to the d = 1 case, so that distributions \mathcal{N}(\mu_i,1) are univariate Gaussians with unit variance.

Before stating our first lemma, let’s discuss the key property of a collection Y_1,\ldots,Y_m of samples from a Gaussian \mathcal{N}(0,1) which we will use. Recall that if Y \sim \mathcal{N}(0,1) is a standard Gaussian, then for every t \in \mathbb{N},

\mathbb{E} |Y|^t \leq t^{t/2}.

If Y_1,\ldots,Y_m are samples from Y, then for m = m(t) large enough, the empirical distribution of Y_1,\ldots,Y_m inherits this property, up to some small fluctuations. Namely, with high probability we would have

\mathbb{E}_{i \sim [m]} |Y_i|^t \leq 1.1 \cdot t^{t/2}.

(We could have replaced 1.1 by any small constant greater than 1.) Here, the notation i \sim [m] means that an index i is chosen uniformly among \{1,\ldots,m\}.

If Y \sim \mathcal{N}(\mu,1) for some \mu \in \mathbb{R}, then the same discussion applies immediately to \mathbb{E}|Y - \mu|^t and \mathbb{E}_{i \sim [m]} |Y_i - \mu|^t. But even more is true: if \overline{\mu} is the empirical mean of Y_1,\ldots,Y_m (that is, \overline{\mu} = \mathbb{E}_{i \sim [m]} Y_i), then with high probability the t-th centered empirical moment also inherits the same bound:

\mathbb{E}_{i \sim [m]} |Y_i - \overline{\mu}|^t \leq 1.1 \cdot t^{t/2}.

In the Gaussian mixture setting, so long as enough samples are drawn from each Gaussian \mathcal{N}(\mu_i, 1), each cluster will have t-th empirical moments satisfying the above bound (with high probability).

In our identifiability proof, we choose to forget that the samples X_1,\ldots,X_n were drawn from Gaussians, and we remember only that they break up into underlying clusters, each of which satisfies that empirical moment bound. We do not even remember the “true” means of each Gaussian; instead we talk only about the empirical means. We will show that no clustering far from that underlying ground-truth clustering results in clusters which satisfy the empirical moment bound.

Lemma 1. Let X_1,\ldots,X_n \in \mathbb{R}. Let S_1,\ldots,S_k be a partition of [n] into k pieces of size N = n/k such that for each S_i, the collection of numbers \{X_j\}_{j \in S_i} obeys the following moment bound:

\mathbb{E}_{j \sim S_i} |X_j - \mu_i|^t \leq 2 \cdot t^{t/2}

where \mu_i is the average \mathbb{E}_{j \sim S_i} X_j and t is some number in \mathbb{N}. Let \Delta > 0 be such that |\mu_i - \mu_j| \geq \Delta for every i \neq j. Suppose t is large enough that 10 \sqrt t k^{1/t} \leq \Delta.

Let S \subseteq [n] have size |S| = N = n/k and be such that \{X_i\}_{i \in S} obey the same moment-boundedness property:

\mathbb{E}_{j \sim S} |X_j - \mu_S|^t < 2 \cdot t^{t/2}

for the same t \in \mathbb{N}, where \mu_S is the mean \mu_S = \mathbb{E}_{j \sim S} X_j. Then there exists an S_i such that

\displaystyle \frac{|S \cap S_i|}{N} \geq 1 - k \cdot \left ( \frac{C \sqrt{t}}{\Delta} \right ) ^t.

for some universal constant C.

How do we interpret Lemma 1 as a statement of cluster identifiability? The lemma implies that the clusters are, up to \delta(t,\Delta) errors, the only subsets of [n] of size n/k which satisfy the t-th moment bound. This is our property Q, like we discussed earlier in this post. The true clustering S_1,\ldots,S_k satisfies Q (i.e. if you group X_1,\ldots,X_n by this ground-truth clustering, the resulting clusters will have bounded empirical t-th moments), and every clustering which satisfies this bounded t-th moment property must be close to the true clustering. Thus, the correct clusters could be identified by searching for subsets of [n] which satisfy the t-th moment bound (never mind that such a search would naively require about 2^n time).

We said that to use the sum of squares method to turn our identifiability proof into an algorithm, both the property Q and the proof of identifiability need to be simple.

This t-th moment boundedness property will turn out to be simple enough. What about the proof of Lemma 1? By the end of this post we will prove Lemma 1 in a way which uses only Holder’s inequality for the t vs \tfrac t {t-1} norms and the triangle inequality for the t-norm. Later, we will show that these inequalities are simple in the correct formal sense: they are captured by a proof system of restricted power.

Our proof of Lemma 1 relies on the following key fact.

Fact 1. Let S,S' \subseteq \mathbb{R} have |S| = |S'| = N. Let X denote a uniform sample from S and similarly for X'. Let \mu = \mathbb{E} X and \mu' = \mathbb{E} X'. Suppose X,X' satisfy the t-th moment bound

\mathbb{E} |X - \mu|^t \leq 2 \cdot t^{t/2} \text{ and } \mathbb{E}|X' - \mu'|^t \leq 2 \cdot t^{t/2}.

Then

|\mu - \mu'| \leq 4 \sqrt t \cdot \left ( \frac{|S \cap S'|}{N} \right )^{-1/t}.

A slightly more general interpretation of this fact is that a pair of random variables X,X' on \mathbb{R} which have bounded t-th moments and whose total variation distance TV(X,X') \leq 1-\alpha cannot have means which are too far apart: |\mathbb{E} X - \mathbb{E} X'| \leq 4 \sqrt t / \alpha^{1/t}.

Proof of Fact 1.
Let \alpha = |S \cap S'|/N. Observe that there is a coupling of the random variables X,X' so that Pr(X = X') = \alpha. The coupling chooses with probability \alpha to select a uniform sample Y \sim S \cap S', then lets X = X' = Y. With probability 1-\alpha, the random variable X is a uniform sample from S \setminus S' and similarly for X'.

Let (X,X') be a coupled pair of samples. We expand a quantity related to the one we want to bound, and then apply Holder’s inequality with the t and \tfrac t {t-1} norms. (The underlying inner product space assigns functions f, g : (X,X') \mapsto \mathbb{R} the inner product \mathbb{E}_{(X,X')} f(X,X') \cdot g(X,X').)

\alpha \cdot |\mu - \mu'|  = \mathbb{E}_{(X,X')} \left [  \mathbf{1}_{X = X'} \cdot |(\mu - X) - (\mu' - X')| \right ]

\leq \left ( \mathbb{E} (\mathbf{1}_{X = X'})^{t/(t-1)} \right )^{\tfrac {t-1} t} \cdot \left ( \mathbb{E} |(\mu - X) - (\mu' - X')|^t \right )^{1/t}

= \alpha^{1-1/t} \cdot \left ( \mathbb{E} |(\mu - X) - (\mu' - X')|^t \right)^{1/t}.

Now we can apply the triangle inequality for the t-norm to the last term, followed by our t-th moment assumptions.

\left ( \mathbb{E} |(\mu - X) - (\mu' - X')|^t \right )^{1/t} \leq \left ( \mathbb{E} |\mu - X|^t \right )^{1/t} + \left ( \mathbb{E} |\mu' - X'|^t \right )^{1/t} \leq 4 \sqrt t

Putting everything together, we get |\mu - \mu'| \leq \tfrac {4 \sqrt t}{\alpha^{1/t}}. QED.

Keeping in mind our eventual goal of constructing a low-degree Sum of Squares proof, we record the observation that the only inequalities we required to prove Fact 1 were the t vs. \tfrac t {t-1} Holder’s inequality and the triangle inequality for the t-norm.

Armed with Fact 1, we can prove Lemma 1.The main idea in the proof is that if S_1 and S_2 are the two clusters with greatest intersection with S, then \mu_S can only be close to one of \mu_1, \mu_2.

Proof of Lemma 1.
Let S \subseteq [n] have size |S| = n/k = N, with mean \mu_S = \mathbb{E}_{i \sim S} X_i and t-th moment bound \mathbb{E}_{i \sim S} |X_i - \mu_S|^t \leq 2t^{t/2}. Without loss of generality, order the clusters so that S_1 has largest intersection with S, then S_2, and so on: that is |S \cap S_1| \geq |S \cap S_2| \geq \ldots \geq |S \cap S_k|. If |S \cap S_1| = (1 -\delta)N, then |S \cap S_2| \geq \tfrac \delta k N, just by counting.

Since |\mu_1 - \mu_2| \geq \Delta, either |\mu_1 - \mu_S| \geq \Delta/2 or |\mu_2 - \mu_S| \geq \Delta/2. We claim it must be the second. Using Fact 1,

|\mu_1 - \mu_S| \leq \frac{4 \sqrt t}{(1 - \delta)^{1/t}} \leq 4 \sqrt t \cdot k^{1/t} < \Delta/2

since certainly (1-\delta) \geq \tfrac 1k, and we assumed 10 \sqrt t k^{1/t} \leq \Delta. We conclude that |\mu_2 - \mu_S| \geq \Delta/2.

We have obtained \tfrac{|S \cap S_2|}{N} \geq \tfrac \delta k and |\mu_2 - \mu_S| \geq \Delta/2. Putting these together with Fact 1, we find

\frac \Delta 2 \leq |\mu_2 - \mu_S| \leq 4 \sqrt t \cdot \left ( \frac k \delta \right )^{1/t}.

Rearranging, this reads \delta \leq \frac{2^{O(t)} t^{t/2} k}{\Delta^t}. QED.

Looking ahead

This concludes our one-dimensional identifiability proof, which will form the core of our proof of Theorem 1. In our next post we will lift this proof to a Sum of Squares proof (for which we will need to define Sum of Squares proofs). After that, with a Sum of Squares proof in hand, we will finish designing our mixture of Gaussians algorithm for the one-dimensional case. Then we will show that the same ideas, nearly unchanged, imply that the algorithm works in higher dimensions.

Thanks

Many thanks to Boaz Barak, David Steurer, and especially Daniel Freund for their helpful remarks on early drafts of this post.

Footnotes

  1. Moritz Hardt on Gaussian mixtures and Pearson’s approach

  2. Image credit: Mathematica Stack Exchange

  3. To see how to apply the ideas in this tutorial to a much broader class of clustering problems, see my joint paper with Jerry Li and the recent paper of Pravesh Kothari and Jacob Steinhardt.

  4. Before these recent works, the best polynomial-time algorithms for the clustering mixtures of Gaussians could not tolerate any \Delta < k^{1/4} (when \Delta \geq k^{1/4} a simple greedy algorithm can be shown to solve the clustering problem to high accuracy). On the other hand, known lower bounds show that when \Delta = C \sqrt{\log k}, clustering is impossible (even using exponential time) with k^{O(1)} samples, so one cannot hope to improve the guarantees of this theorem too much further [Regev-Vijayaraghavan]. (That said, reducing the sample complexity and running time to poly(d,k) when \Delta = C \sqrt{\log k} is a fascinating open problem.)

Variants of this theorem, which may be found in all three of the sources listed, offer algorithms which additionally output estimates of the means \mu_1,\ldots,\mu_k, work for many distributions besides Gaussians (without even knowing the underlying distribution!), and tolerate some amount of advesarial corruption among the samples X_1,\ldots,X_n. We note also that the theorems in those works handle the usual mixtures of Gaussians problem, rather than the equidistributed version, and can tolerate non-uniform mixtures; i.e. those where some clusters are much smaller than others.

11 thoughts on “Clustering and Sum of Squares Proofs, Part 1

  1. EDITS 12/11/17 ca. 6:30pm western: fixed some mathematical typos — missing exponent of t in some equations, missing N inline in proof of Lemma 1.

    Thanks to Gautam Kamath for pointing these out.

  2. Sam, great post. IMHO, this is the most compelling demonstration of the “SoS method” I have seen (in terms of explaining how proofs gives algorithms in a simple, non-trivial way).

    Boaz: Have you thought about switching over to a private WordPress server so that you can use MathJax? The native WordPress behavior (compiling to inline images) looks so antiquated, especially given the background discrepancies in the theorem environments.

    Followup: Did you ever figure out a good method for writing math in emails? GmailTeX is consistently annoying / broken, and without the ability to use macros (Google’s fault, not the dev). (Looking forward to the next “Tools in theory” post. There are some new developments, e.g.: https://github.com/fsavje/math-with-slack.)

    1. Seconded regarding private WordPress server…would avoid extremely painfully compiling out all my macros by hand when moving these posts from latex to wordpress.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s