This is part 2 of a series on clustering, Gaussian mixtures, and Sum of Squares (SoS) proofs. If you have not read it yet, I recommend starting with Part 1. Also, if you find errors please mention them in the comments (or otherwise get in touch with me) and I will fix them ASAP.

Welcome back.

In the last post, we introduced Gaussian mixture models and the clustering problem for Gaussian mixtures. We described identifiability proofs for unsupervised learning problems. Then we set ourselves some goals:

1. Design a simple identifiability proof for clustering in Gaussian mixtures, saying that if $X_1,\ldots,X_n$ are (enough) samples from a $d$-dimensional mixture of $k$ Gaussians, the ground-truth clustering of the $X_i$‘s by which Gaussian they were drawn from is identifiable from the samples.
2. Formalize the simplicity of that identifiability proof by showing that it is captured by a formal proof system of restricted power: the Sum of Squares (SoS) proof system.
3. Guided by our SoS identifiability proof, design an algorithm for clustering in Gaussian mixtures. (And hence prove Theorem 1 from last time.)

In the last post, we accomplished task (1) in $1$-dimensional case. In this post we will get started on task (2), again in the $1$-dimensional case.

Recalling from last time, in our identifiability proof we remembered only two things things about our samples $X_1,\ldots,X_n$:

(1) They break up into clusters $S_1,\ldots,S_k$ of equal size, $|S_i| = n/k = N$, such that for some $t \in \mathbb{N}$, each cluster obeys the empirical moment bound,

$\mathbb{E}_{j \sim S_i} |X_j - \mu_i|^t \leq 2 \cdot t^{t/2}$

where $\mu_i = \mathbb{E}_{j \sim S_i} X_j$ is the empirical mean of the cluster $S_i$, and

(2) Those means are separated: $|\mu_i - \mu_j| \geq \Delta$.1

The key tool we used in our identifiability proof was Fact 1, which we restate here for convenience.

Fact 1. Let $S,S' \subseteq \mathbb{R}$ have $|S| = |S'| = N$. Let $X$ denote a uniform sample from $S$ and similarly for $X'$. Let $\mu = \mathbb{E} X$ and $\mu' = \mathbb{E} X'$. Suppose $X,X'$ satisfy the $t$-th moment bound

$\mathbb{E} |X - \mu|^t \leq 2 \cdot t^{t/2} \text{ and } \mathbb{E}|X' - \mu'|^t \leq 2 \cdot t^{t/2}.$

Then

$|\mu - \mu'| \leq 4 \sqrt t \cdot \left ( \frac{|S \cap S'|}{N} \right )^{-1/t}.$

We are going to give a Sum of Squares proof of this fact. Or rather: we are going to state a very similar fact, which concerns inequalities among low-degree polynomials, and give a Sum of Squares proof of that.

We are going to do things in a slightly unusual order, delaying definition of the SoS proof system till we have something concrete in mind to prove in it.

First, because SoS is a proof system to reason about inequalities among low-degree polynomials, we are going to formulate Fact 2, which is like Fact 1 except it will be explicitly about low-degree polynomials. Proving Fact 2 will be our goal.

Second, we will define the SoS proof system.

Finally, we will prove this Fact 2 in the low-degree SoS proof system.

## Fact 2: an SoS version of Fact 1

Fact 1 concerns two subsets $S,S'$ of $\mathbb{R}$. When we used Fact 1 to prove Lemma 1 in part 1, one of those sets $S$ was one of the ground-truth clusters among $S_1,\ldots,S_k$, and one of them was a “candidate” cluster $S' \subseteq X_1,\ldots,X_n$ — Lemma 1 showed that the candidate cluster $S'$ must in fact have been close to one of the true clusters $S_1,\ldots,S_k$.

We will design a system of polynomial equations whose solutions are in correspondence with candidate clusters $S'$. This is probably unavoidably foreign at first. We will offer some attempts at demystifying remarks later on, but for now we will forge ahead.

Let $X_1,\ldots,X_n \in \mathbb{R}$. Let $w_1,\ldots,w_n$ be some indeterminates; they will be the variables in our polynomials. We are going to think of them as $0/1$ indicators of a subset $T \subseteq [n]$ which is a candidate cluster.

First, let’s enforce that $w$ is the indicator vector of a set of size $N = n/k$. Consider the equations

$w_i^2 = w_i \text{ for all } i \in [n] \text{ and } \sum_{i \in [n]} w_i = N$.

Any solution to these equations over $\mathbb{R}$ is a $0/1$ indicator of a subset of $[n]$ of size $N$.

The second hypothesis Fact 1 places on a candidate cluster is the $t$-th moment bound. Let’s enforce that with a polynomial inequality. First, we need some notation for the empirical mean $\mu$ of $T$. Denote by $\mu(w)$ the polynomial

$\mu(w) = \frac 1 N \sum_{i \in [n]} w_i X_i$.

Often we drop the $w$ and just write $\mu$. And now consider the inequality:

$\frac 1 N \sum_{i \in [n]} w_i (X_i - \mu)^t \leq 2 \cdot t^{t/2}$.

Belaboring the point somewhat: any solution over $\mathbb{R}$ to the equations and inequalities we have described would correspond to a subset of $[n]$ of which obeys the $t$-th empirical moment bound. (Here we assume that $t$ is even, so that $|X_i - \mu|^t = (X_i - \mu)^t$.)

Although we don’t have all the definitions we need in place yet, we will go ahead and state Fact 2. We introduce some suggestive notation. If $S \subseteq [n]$, we define the polynomial

$|S \cap T|(w) = \sum_{i \in S} w_i$.

Often we just write $|S \cap T|$.

Fact 2.  Let $X_1,\ldots,X_n \in \mathbb{R}$. Let $S \subseteq [n]$ have $|S| = N$; let $\mu_S = \mathbb{E}_{i \sim S} X_i$ be its mean. Let $t$ be a power of $2$. Suppose $S$ satisfies

$\mathbb{E}_{i \sim S} |X_i - \mu_S|^t \leq 2 \cdot t^{t/2}.$

Let $w_1,\ldots,w_n$ be indeterminates. Let $\mathcal{A}$ be the following set of equations and inequalities.

$w_i^2 = w_i \text{ for } i \in [n]$
$\sum_{i \in [n]} w_i = N$
$\frac 1 N \sum_{i \in [n]} w_i \cdot (X_i - \mu)^t \leq 2 \cdot t^{t/2}.$

Then

$\mathcal{A} \vdash_{O(t)} \left ( \frac{|S \cap T|}{N} \right )^t \cdot (\mu - \mu_S)^t \leq 2^{O(t)} \cdot t^{t/2} \cdot \left( \frac {|S \cap T|} N \right ) ^{t-1}.$

We have not yet defined the notation $\mathcal{A} \vdash_{O(t)} \ldots$. This notation means that that there is a degree $O(t)$ SoS proof of the inequality on the right-hand side using the axioms $\mathcal{A}$ — we will define this momentarily.

The purpose of stating Fact 2 now was just to convince the reader that there is a plausible version of Fact 1 which may be stated entirely as inequalities among polynomials. The reader is encouraged to compare the hypotheses and conclusions of Facts 1 and 2 (ignoring this $\mathcal{A} \vdash \ldots$ for now).2 There are two main differences:

(a) The inequality in the conclusion of Fact 2 is raised to the $t$-th power, as compared to the conclusion of Fact 1.

(b) The inequality in the conclusion of Fact 2 seems to have extra factors of $\frac{|S \cap T|}{N}$ on both sides.

Point (a) is needed just to make both sides of the inequality into polynomials in $w$; otherwise there would be fractional powers. The need for (b) is more subtle, and we will not be able to fully understand it for a while. For now, what we can say is: the inequality

$\left ( \frac{|S \cap T|}{N} \right ) \cdot (\mu - \mu_S)^t \leq 2^{O(t)} \cdot t^{t/2}.$

would be true for any $w$ which solves $\mathcal{A}$, but that might not have an SoS proof.

One last difference is the factor $2^{O(t)}$ on the right-hand side of the conclusion. This is probably not inherent; it arises because we will use an easy-to-prove but lossy version of the triangle inequality in our SoS proof. In any case, it is dwarfed by the term $t^{t/2}$, so we are not too worried about it.

Enough dancing around these SoS proofs — in order to stop with the hand-waiving, we need to set up our proof system.

## Sum of Squares Proofs

We cannot go any further without defining the formal proof system we will work in. Since for now we are sticking with a one-dimensional setting, things can be little simpler than for the proof system we will need when $X_1,\ldots,X_n$ become high-dimensional, but developing the proof system still incurs a little notational burden. That is life.

The Sum of Squares proof system, henceforth “SoS”, is a formal proof system for reasoning about systems of polynomial equations and inequalities over the real numbers. At its heart is the simple fact that if $p(x)$ is a polynomial in indeterminates $x$ with real coefficients, then $p(x)^2 \geq 0$ for all $x \in \mathbb{R}$.

Plenty of elementary expositions on SoS proofs are available (see e.g. SoS on the hypercube, SoS on general domains, and the first several sections of this paper by Ryan O’Donnell and Yuan Zhou.) We will define as much as we need to keep this tutorial self-contained but for expanded discussion we refer the reader to those resources and references therein.

Let $x_1,\ldots,x_n$ be some indeterminates. Let $\mathcal{P} = \{ p_1(x) \geq 0,\ldots,p_m(x) \geq 0\}$ be some polynomial inequalities in those indeterminates. If $r(x)$ is some other polynomial with real coefficients, it may be the case that for any $x$ satisfying $\mathcal{P}$, also $r(x) \geq 0$; we would say that $\mathcal{P}$ implies $r(x) \geq 0$.

The key concept for us will be that $\mathcal{P}$ implies $r(x) \geq 0$ with a sum of squares proof. We say that $\mathcal{P}$ SoS-proves $r(x) \geq 0$ if there exist polynomials $b_S(x)$ for $S \subseteq [m]$ such that

$r(x) = \sum_{S \subseteq [m]} b_S(x) \prod_{i \in S} p_i(x)$

and the polynomials $\{b_S\}$ are sums of squares—i.e. each of them has the form $\sum_{j} u_j(x)^2$ for some polynomials $u_j(x)$. Notice that if this equation holds, for any $x \in \mathbb{R}^n$ which satisfies the inequalities $\mathcal{P}$ the right-hand side must be nonnegative, so $r(x)$ is also nonnegative.

The polynomials $b_S$ form an SoS proof that $\mathcal{P}$ implies $r(x) \geq 0$. If $\deg \left [ b_S(x) \prod_{i \in S} p_i(x) \right ]$ are all at most $d \in \mathbb{N}$, we say that the proof has degree $d$, and we write

$\mathcal{P} \vdash_d r(x) \geq 0$.

When $\mathcal{P} = \emptyset$ we often just write $\vdash_d r(x) \geq 0$.

We commonly use a few shorthand notations.

• We write $\mathcal{P} \vdash r(x) \geq r'(x)$, by which we mean $\mathcal{P} \vdash r(x) - r'(x) \geq 0$.
• We include polynomial equations such as $x_i^2 = x_i$ in $\mathcal{P}$, by which we mean that $\mathcal{P}$ contains both $x_i^2 \leq x_i$ and $x_i^2 \geq x_i$.
• We write $\mathcal{P} \vdash r(x) = r'(x)$, by which we mean that both $\mathcal{P} \vdash r(x) \geq r'(x)$ and $\mathcal{P} \vdash r'(x) \geq r(x)$.

Although the definition suggests something static — there is a fixed collection of polynomials $b_S$ forming an SoS proof — in practice we treat SoS proofs as dynamic objects, building them line by line much as we would any other proof. We are going to see an example of this very soon when we prove Fact 2.

It is time to begin to make good on the promise from the last section that we would get substantial mileage out of proving identifiability of mixtures of Gaussians using only simple inequalities. While it will take us several sections to completely make good on our promise, we can begin by giving SoS versions of the triangle and Holder’s inequalities we used in our identifiability proof. We will prove one of these now to give the reader a sense of how such arguments go; since there is usually not much novelty in SoS proofs of such basic inequalities we will defer others till later.

We would like to emphasize that the following SoS proofs themselves have nothing to do with mixtures of Gaussians; instead they are part of a growing problem-independent toolkit of basic inequalities useful in designing SoS proofs of more interesting mathematical statements. The fact that one can have such a problem-independent toolkit is in part what makes the proofs-to-algorithms method so broadly useful.

### SoS triangle inequality

We will start with a fairly weak SoS triangle inequality, that will suffice for our needs.
Much more sophisticated versions of this inequality are possible which allow various norms and do not lose the factor $2^t$ we do here.

Fact: SoS triangle inequality.
Let $x,y$ be indeterminates. Let $t$ be a power of $2$. Then

$\vdash_t (a+b)^t \leq 2^t (a^t + b^t).$

To prove the SoS triangle inequality we will want the following basic observation about composability of SoS proofs. (This is really a special case of a more general composibility result.)

Proposition: squaring SoS proofs.
Suppose $\vdash_d p(x) \leq q(x)$ and $p,q$ are sums of squares. Then $\vdash_{2d} p(x)^2 \leq q(x)^2$.

Proof of proposition.
By hypothesis, $q(x) - p(x)$ is a sum of squares polynomial. Now,

$q(x)^2 - p(x)^2 = (q(x) - p(x))(q(x) + p(x))$

so it is a product of sum of squares polynomials, and hence itself a sum of squares.
QED.

Proof of SoS triangle inequality.
We start with the case $t=2$. In this case, we have $(a+b)^2 = a^2 + 2ab + b^2$. We claim that $\vdash_2 2ab \leq a^2 + b^2$, since the polynomial $a^2 + b^2 - 2ab = (a-b)^2$ is a square. Hence, we find

$\vdash_t (a+b)^2 = a^2 + 2ab + b^2 \leq 2(a^2 + b^2).$

Now to prove the general case, we proceed by induction. We may suppose

$\vdash_{t/2} (a+b)^{t/2} \leq 2^{t/2} (a^{t/2} + b^{t/2}).$

By the proposition, this implies $\vdash_t (a+b)^t \leq 2^t (a^{t/2} + b^{t/2})^2$. Now we can apply the base case again to $(a'+b')^2$ for $a' = a^{t/2}$ and $b' = b^{t/2}$ to complete the argument.
QED.

### SoS Holder’s inequality

Holder’s inequality poses a quandary for SoS proofs, because of the non-integral exponents in most $p$-norms (hence such norms do not naturally correspond to polynomials). Consequently, there are many SoS versions of Holder’s inequality in the literature, choosing various ways to handle this non-integrality. The version we present here will be most useful for our mixtures of Gaussians proof. We will address the non-integral powers issue by imposing polynomial inequalities requiring that some of the underlying variables be Boolean.

Since the proof of this SoS Holder’s inequality proceeds via a similar induction to the one we used for the SoS triangle inequality we just proved, we defer it to the end of this post.

SoS Holder’s inequality.
Let $w_1,\ldots,w_n$ be indeterminates and let $\mathcal{W}$ be the collection of equations

$\mathcal{W} = \{ w_i^2 - w_i = 0 \, : i \in [n] \}.$

Note that the only solutions to $\mathcal{W}$ are $\{0,1\}^n$.

Let $p_1(w),\ldots,p_n(w)$ be polynomials of degree at most $\ell$. Let $t$ be a power of $2$. Then

$\mathcal{W} \vdash_{O(t\ell)} \left ( \sum_{i \in [n]} w_i \cdot p_i(w) \right )^t \leq \left ( \sum_{i \in [n]} w_i \right )^{t-1} \cdot \sum_{i \in [n]} p_i(w)^t.$

and

$\mathcal{W} \vdash_{O(t\ell)} \left ( \sum_{i \in [n]} w_i \cdot p_i(w) \right )^t \leq \left ( \sum_{i \in [n]} w_i \right )^{t-1} \cdot \sum_{i \in [n]} w_i \cdot p_i(w)^t.$

### SoS Boolean Inequalities

We will also want one more SoS inequality for our proof of Fact 1. In linear programming relaxations of Boolean problems, it is common to replace an integrality constraint $x \in \{0,1\}$ with the linear inequalities $0 \leq x \leq 1$. SoS can derive the latter inequalities.

Fact: SoS Boolean Inequalities

$x^2 = x \vdash_2 0 \leq x \leq 1$.

Proof of Fact.

For the inequality $x \geq 0$, just use the axiom $x \geq x^2$. For the inequality $x \leq 1$, write

$1 - x = x - x^2 + (1-x)^2$.

QED.

## Proof of Fact 2

Without further ado, we will prove Fact 2 by lifting the proof of Fact 1 into the SoS proof system. Though slightly notationally cumbersome, the proof follows that of Fact 1 nearly line by line—the reader is encouraged to compare the two proofs.

Proof of Fact 2.
We write out what we want to bound in terms of $X_1,\ldots,X_n$, then apply Holder’s inequality and the triangle inequality.

$\left ( \sum_{i \in S} w_i \right )^t \cdot (\mu - \mu_S)^t = \left ( \sum_{i \in S} w_i \left [ (\mu - X_i) - (\mu_S - X_i) \right ] \right )^t.$

We deploy our SoS Holder’s inequality to obtain

$\mathcal{A} \vdash_{O(t)} \left ( \sum_{i \in S} w_i \left [ (\mu - X_i) - (\mu_S - X_i) \right ] \right )^t \leq \left ( \sum_{i \in S} w_i \right )^{t-1} \cdot \sum_{i \in S} w_i \left [ (\mu - X_i) - (\mu_S - X_i) \right ]^t.$

Next we can use our equations $w_i^2 - w_i$ to conclude that in fact

$\mathcal{A} \vdash_{O(t)} \left ( \sum_{i \in S} w_i \left [ (\mu - X_i) - (\mu_S - X_i) \right ] \right )^t \leq \left ( \sum_{i \in S} w_i^2 \right )^{t-1} \cdot \sum_{i \in S} w_i^2 \left [ (\mu - X_i) - (\mu_S - X_i) \right ]^t.$

The polynomial $\left ( \sum_{i \in S} w_i^2 \right)^{t-1}$ is a sum of squares, as is $2^t(a^t + b^t) - (a+b)^t$ via our SoS triangle inequality; applying this with $a = (\mu - X_i)$ and $b = -(\mu_S - X_i)$ we obtain

$\mathcal{A} \vdash_{O(t)} \left ( \sum_{i \in S} w_i \left [ (\mu - X_i) - (\mu_S - X_i) \right ] \right ) ^t$

$\leq 2^t \left ( \sum_{i \in S} w_i^2 \right )^{t-1} \cdot \sum_{i \in S} w_i^2 (\mu - X_i)^t + w_i^2 (\mu_S - X_i)^t.$

We can add the sum of squares $2^t \left ( \sum_{i \in S} w_i^2 \right )^{t-1} \cdot \sum_{i \notin S} w_i^2 (\mu - X_i)^t + w_i^2 (\mu_S - X_i)^t$ to obtain

$\mathcal{A} \vdash_{O(t)} \left ( \sum_{i \in S} w_i \left [ (\mu - X_i) - (\mu_S - X_i) \right ] \right )^t$

$\leq 2^t \left ( \sum_{i \in S} w_i^2 \right )^{t-1} \cdot \sum_{i \in [n]} w_i^2 (\mu - X_i)^t + w_i^2 (\mu_S - X_i)^t.$

Using the equations $w_i^2 - w_i$, which, as in the SoS Boolean inequalities fact can be used to prove $w_i^2 \leq 1$, we obtain

$\mathcal{A} \vdash_{O(t)} \left (\sum_{i \in S} w_i \left [ (\mu - X_i) - (\mu_S - X_i) \right ] \right )^t$

$\leq 2^t \left ( \sum_{i \in S} w_i^2 \right )^{t-1} \cdot \sum_{i \in [n]} w_i (\mu - X_i)^t + (\mu_S - X_i)^t.$

Finally using $\mathbb{E}_{i \in S}(\mu_S - X_i)^t \leq 2 \cdot t^{t/2}$ and $\mathcal{A} \vdash_{O(t)} \sum_{i \in [n]} w_i (X_i - \mu)^t \leq 2 \cdot t^{t/2} \cdot N$, we get

$\mathcal{A} \vdash_{O(t)} \left ( \sum_{i \in S} w_i \left [ (\mu - X_i) - (\mu_S - X_i) \right ] \right )^t$

$\leq 2^{O(t)} \cdot \left ( \sum_{i \in S} w_i^2 \right )^{t-1} \cdot t^{t/2} \cdot N.$

Last of all, using $w_i^2 - w_i = 0$ to simplify the term $\sum_{i \in S} w_i^2$,

$\mathcal{A} \vdash_{O(t)} \left ( \sum_{i \in S} w_i \left [ (\mu - X_i) - (\mu_S - X_i) \right ] \right )^t$

$\leq 2^{O(t)} \cdot \left ( \sum_{i \in S} w_i \right )^{t-1} \cdot N \cdot t^{t/2}.$

The fact follows by rearranging. QED.

## SoS proof of Holder’s inequality

The last thing we haven’t proved is our SoS Holder’s inequality. We will need an SoS Cauchy-Schwarz inequality to prove it.

SoS Cauchy-Schwarz.
Let $x_1,\ldots,x_n,y_1,\ldots,y_n$ be indeterminates. Then

$\vdash_2 \left ( \sum_{i \in [n]} x_i y_i \right )^2 \leq \left ( \sum_{i \in [n]} x_i^2 \right ) \left ( \sum_{i \in [n]} y_i^2 \right ).$

Proof of SoS Cauchy-Schwarz.
It is not hard to check that

$\left ( \sum_{i \in [n]} x_i^2 \right ) \left ( \sum_{i \in [n]} y_i^2 \right ) - \left ( \sum_{i \in [n]} x_i y_i \right )^2 = \sum_{i,j \in [n]} (x_i y_j - x_j y_i)^2$

which is a sum of squares. QED.

Proof of SoS Holder’s inequality.
We start with the case $t=2$. Using SoS Cauchy-Schwarz, we obtain

$\vdash_{2\ell + 2} \left ( \sum_{i \in [n]} w_i \cdot p_i(w) \right)^2 \leq \left ( \sum_{i \in [n]} w_i^2 \right) \cdot \left ( \sum_{i \in [n]} p_i(w)^2 \right).$

This follows from our SoS Cauchy-Schwarz inequality by substituting $w_i$ for $x_i$ and $p_i(w)$ for $y_i$; we proved a fact about a sum of squares polynomial in $x,y$ which implies a corresponding fact about a sum of squares in variables $w_i$ and $p_i(w)$. The latter in turn is a sum of squares in $w$.

To finish the first half of the $t=2$ case, we just need to replace $w_i^2$ with $w_i$ on the right-hand side. By adding the polynomials $(w_i^2 - w_i) \cdot \left ( - \sum_{i \in [n]} p_i(w)^2 \right )$ via the equations $\mathcal{W}$, we obtain

$\mathcal{W} \vdash_{2\ell +2} \left ( \sum_{i \in [n]} w_i \cdot p_i(w) \right )^2 \leq \left ( \sum_{i \in [n]} w_i \right ) \cdot \left ( \sum_{i \in [n]} p_i(w)^2 \right ).$

To establish the second inequality for the $t=2$ base case, we start by again adding multiples of $(w_i^2 - w_i)$ to get

$\mathcal{W} \vdash_{2\ell +2} \left ( \sum_{i \in [n]} w_i \cdot p_i(w) \right )^2 = \left ( \sum_{i \in [n]} w_i^2 \cdot p_i(w) \right )^2.$

Then the inequality follows from Cauchy-Schwarz and again adding some multiples of $(w_i^2 - w_i)$.

Now it’s time for the induction step. We can assume

$\mathcal{W} \vdash_{O(t/2 \cdot \ell)} \left ( \sum_{i \in [n]} w_i \cdot p_i(w) \right )^{t/2} \leq \left ( \sum_{i \in [n]} w_i \right )^{t/2-1} \cdot \left ( \sum_{i \in [n]} w_i p_i(w)^{t/2} \right ).$

By again adding multiples of $w_i^2 - w_i$, we obtain

$\mathcal{W} \vdash_{O(t/2 \cdot \ell)} \left ( \sum_{i \in [n]} w_i \cdot p_i(w) \right )^{t/2} \leq \left ( \sum_{i \in [n]} w_i^2 \right )^{t/2-1} \cdot \left ( \sum_{i \in [n]} w_i^2 p_i(w)^{t/2} \right. )$

Now both sides are sums of squares. So, by squaring, we find

$\mathcal{W} \vdash_{O(t \cdot \ell)} \left ( \sum_{i \in [n]} w_i \cdot p_i(w) \right )^{t} \leq \left ( \sum_{i \in [n]} w_i^2 \right )^{t-2} \cdot \left ( \sum_{i \in [n]} w_i^2 \cdot p_i(w)^{t/2} \right )^2.$

The proof is finished by applying Cauchy-Schwarz to the last term and cleaning up by adding multiples of $w_i^2 - w_i$ as necessary. QED.

In the next post, we will use Fact 2 to deduce an SoS version of Lemma 1 (from part 1). Subsequently, we will finish designing an algorithm for one-dimensional clustering, proving Theorem 1 (from part 1) in the one-dimensional case. Then we will get high dimensional.

#### Footnotes

1. Suspicious readers may note that our original Gaussian mixture model assumed that the population means are $\Delta$-separated. Because we will draw enough samples that the empirical mean of each cluster is very close to the true (population) mean, this difference can be ignored for now and is easy to handle when we put everything together to analyze our final algorithm.

2. To make them look even more alike of course we could have introduced notations like $\mathbb{E}_{i \sim T} (X_i - \mu)^t$, but for concreteness we are keeping the $w$ variables at least a little explicit.

I am excited to (temporarily) join the Windows on Theory family as a guest blogger!

This is the first post in a series which will appear on Windows on Theory in the coming weeks. The aim is to give a self-contained tutorial on using the Sum of Squares algorithm for unsupervised learning problems, and in particular in Gaussian mixture models. This will take several posts: let’s get started.

In an unsupervised learning problem, the goal is generally to recover some parameters $\theta \in \mathbb{R}^d$ given some data $X_1,\ldots,X_n \sim P(X \, | \, \theta)$, where $P$ is a probability distribution on, say, $\mathbb{R}^p$ which is parameterized by $\theta$. The goal is to estimate $\theta$ by some estimator $\hat{\theta}(X_1,\ldots,X_n)$ which (a) does not require too many samples and (b) can be computed from those samples in polynomial time. This basic setup can be instantiated (sometimes with minor adaptations) to capture numerous important problems in statistics and machine learning: a few examples are

•  component analysis and its many variants (PCA, ICA, Sparse PCA, etc.)
• Netflix problem / matrix completion / tensor completion
• dictionary learning / blind source separation
• community detection and recovery / stochastic block models
• many clustering problems

Excellent resources on any of these topics are just a Google search away, and our purpose here is not to survey the vast literature on unsupervised learning, or even on provable “TCS-style” algorithms for these problems. Instead, we will try to give a simple exposition of one technique which has now been applied successfully to many unsupervised learning problems: the Sum of Squares method for turning identifiability proofs into algorithms.

Identifiability is a concept from statistics. If one hopes for an algorithm which recovers parameters $\hat{\theta}(X_1,\ldots,X_n) \approx \theta$, it must at least be true that $X_1,\ldots,X_n$ uniquely (or almost uniquely) determine $\theta$, with high probability: when this occurs we say $\theta$ is identifiable from the samples $X_1,\ldots,X_n$.

Classically, identifiability is often proved by analysis of a (typically) inefficient brute-force algorithm. First, one invents some property $Q(X_1,\ldots,X_n)$ of $\theta$. For example, in a maximum-likelihood argument, one shows that $Pr(X_1,\ldots,X_n \, | \, \theta) > p$ for some $p$. Then, often via a concentration-of-measure argument, one shows that no $\theta'$ far from $\theta$ satisfies property $Q$. In the maximum-likelihood example, this would entail showing that if $\theta'$ is far from $\theta$ then $Pr(X_1,\ldots,X_n \, | \, \theta') < p$.

An identifiability argument like this typically has no implications for computationally-efficient algorithms, since finding $\theta$ which satisfies $Q$ often appears to require brute-force search. Thus, often when the investigation turns to efficient algorithms, the identifiability argument is abandoned and more explicitly algorithmic techniques are brought to bear: convex relaxations, spectral methods, and even heuristic methods.

The method we present here for designing computationally-efficient algorithms begins with a return to identifiability proofs. The main insight is that if both

• the property $Q$ and, more importantly,
• the proof that any $\theta'$ satisfying $Q$ must be close to $\theta$

are sufficiently simple, then identifiability of $\theta$ from $X_1,\ldots,X_n$ does imply the existence of an efficient algorithm to (approximately) recover $\theta$ from $X_1,\ldots,X_n$!

For us, simple has a formal meaning: the proof should be captured in the low-degree Sum of Squares proof system.

The algorithms which result in the end follow a familiar recipe: solve some convex relaxation whose constraints depend on $X_1,\ldots,X_n$ and round it to obtain $\hat{\theta}(X_1,\ldots,X_n)$. But the design and analysis of this relaxation are heavily informed by the simple identifiability proof described above. In particular, the convex programs which result will not be familiar relaxations of maximum likelihood problems!

In this series of blog posts, we are going to carry out this strategy from start to finish for a classic unsupervised learning problem: clustering mixtures of Gaussians. So that we can get down to business as quickly as possible, we defer a short survey of the literature on this “proofs-to-algorithms” method to a later post.

## Mixtures of Gaussians

Gaussian mixtures are classic objects in statistics, dating at least to Pearson in 1894. The basic idea is: suppose you have a data set $X_1,\ldots,X_n \in \mathbb{R}^d$ which was generated in a heterogeneous fashion: some points were sampled from probability distribution $\mathcal{D}_1$, some from $\mathcal{D}_2$, and so on up to $\mathcal{D}_k$, but you do not know which points came from which distributions. Under what settings can you cluster the points by which distribution they came from, and perhaps also recover some properties of those distributions, such as their means $\mu_i = \mathbb{E}_{X \sim \mathcal{D}_i} X$?

Pearson, in 1894, was faced with a collection of body measurements of crabs. The distribution of one such attribute in the data — the ratio of forehead length to body width — curiously deviated from a Gaussian distribution. Pearson concluded that in fact two distinct species of crabs were present in his data set, and that the data should therefore be modeled as a mixture of two Gaussians. He invented the method of moments to discover the means of both the Gaussians in question.1 In the years since, Gaussian mixtures have become a fundamental statistical modeling tool: algorithms to fit Gaussian mixtures to data sets are included in basic machine learning packages like sklearn.

A mixture of $2$ Gaussians in $\mathbb{R}^2$.2

Here is our mixture of Gaussians model, formally.

Mixtures of separated spherical Gaussians:
Let

• $\Delta > 0$.
• $\mu_1,\ldots,\mu_k \in \mathbb{R}^d$ be such that $\|\mu_i - \mu_j\| \geq \Delta$ for every $i \neq j$.
• $\mathcal{N}_1(\mu_1,I),\ldots,\mathcal{N}_k(\mu_k,I)$ be $d$-dimensional spherical Gaussians, centered at the means $\mu_i$.
• $X_1,\ldots,X_n \in \mathbb{R}^d$ be iid samples, each drawn by selecting $j \sim [k]$ uniformly, then drawing $X \sim \mathcal{N}(\mu_j, I)$.
• $S_1,\ldots,S_k$ be the induced partition of $[n]$ into $k$ parts, where $i \in S_j$ if samples $X_i$ was drawn from $\mathcal{N}(\mu_j, I)$

The problem is: given $X_1,\ldots,X_n$, output a partition $T_1,\ldots,T_k$ of $[n]$ into $k$ parts, each of size $n/k$, such that (up to renaming the clusters $1,\ldots,k$),

$|S_i \cap T_i| \geq (1 - \delta) \cdot \frac nk$

for each $i \in [k]$ and some small number $\delta > 0$.

Another mixture of 2 Gaussians. The means have Euclidean distance about 3.5.

To avoid some minor but notationally annoying matters, we are going to work with a small variant of the model, where we assume that exactly $n/k$ samples $X_i$ came from each Gaussian $\mathcal{N}(\mu_j, I)$. We call a mixture like this “equidistributed”.

We will work up to a proof of this theorem, which was proved recently (as of Fall 2017) in 3 independent works.

Main Theorem (Hopkins-Li, Kothari-Steinhardt, Diakonikolas-Kane-Stewart):
For arbitrarily-large $t \in \mathbb{N}$ there is an algorithm requiring $n = d^{O(t)} k^{O(1)}$ samples from the equidistributed mixtures of Gaussians model and running in time $n^{O(t)}$ which outputs a partition $T_1,\ldots,T_k$ of $[n]$ into parts of size $N = n/k$ such that with high probability,

$\displaystyle \frac{|S_i \cap T_i|}{N} \geq 1 - k^{10} \cdot \left ( \frac{C \sqrt t}{\Delta} \right )^{t}$

for some universal constant $C$.3

In particular:

• If $\Delta = k^\epsilon$ for some $\epsilon > 0$, then by choosing $t = 100/\epsilon$ the algorithm recovers the correct clustering up to $1/poly(k)$ errors in $poly(k,d)$ time with $poly(k,d)$-many samples.
• If $\Delta = C \sqrt{\log k}$ for a large-enough universal constant $C$, then choosing $t = O(\log k)$ gives a quasipolynomial-time algorithm (using quasipolynomially-many samples) to recover clusters up to $1/poly(k)$ error.4

One (rather weak) consequence of the main theorem is that, for $n = d^{O(t)} k^{O(1)}$ samples, there is enough information in the samples to determine the underlying clustering, up to error $\delta(t,\Delta) = \tfrac{2^{O(t)} \cdot t^{t/2} \cdot k^{10}}{\Delta^t}$. Our strategy to prove the main theorem will start with proving the latter statement independently — as we have discussed, such an argument is often called a proof of identifiability because we say that the clusters are identifiable from the samples (up to $\delta(t,\Delta)$ errors).

While identifiability itself does not carry immediate algorithmic consequences, our proof of identifiability will be somewhat special: it will be simple in a formal sense, namely, that it will be captured by a formal proof system of restricted power. This simplicity of the proof of identifiability will almost immediately imply the main theorem: the construction of a computationally-efficient algorithm from a simple proof of identifiability is the heart of the proofs-to-algorithms method.

## Identifiability proof: 1 dimension

Our eventual goal is to work up to a proof in the low-degree Sum of Squares proof system that clusters $S_1,\ldots,S_k$ are identifiable from samples $X_1,\ldots,X_n$ from a mixture of Gaussians. Since we have not yet defined low-degree Sum of Squares proofs, for now we will focus on constructing an identifiability proof which avoids mathematical facts which we deem “too complicated”. In particular, we will avoid any Chernoff/union bound style arguments.

To get to the heart of the matter it helps to simplify the setting. Our first simplification is to restrict attention to the $d = 1$ case, so that distributions $\mathcal{N}(\mu_i,1)$ are univariate Gaussians with unit variance.

Before stating our first lemma, let’s discuss the key property of a collection $Y_1,\ldots,Y_m$ of samples from a Gaussian $\mathcal{N}(0,1)$ which we will use. Recall that if $Y \sim \mathcal{N}(0,1)$ is a standard Gaussian, then for every $t \in \mathbb{N}$,

$\mathbb{E} |Y|^t \leq t^{t/2}.$

If $Y_1,\ldots,Y_m$ are samples from $Y$, then for $m = m(t)$ large enough, the empirical distribution of $Y_1,\ldots,Y_m$ inherits this property, up to some small fluctuations. Namely, with high probability we would have

$\mathbb{E}_{i \sim [m]} |Y_i|^t \leq 1.1 \cdot t^{t/2}.$

(We could have replaced $1.1$ by any small constant greater than $1$.) Here, the notation $i \sim [m]$ means that an index $i$ is chosen uniformly among $\{1,\ldots,m\}$.

If $Y \sim \mathcal{N}(\mu,1)$ for some $\mu \in \mathbb{R}$, then the same discussion applies immediately to $\mathbb{E}|Y - \mu|^t$ and $\mathbb{E}_{i \sim [m]} |Y_i - \mu|^t$. But even more is true: if $\overline{\mu}$ is the empirical mean of $Y_1,\ldots,Y_m$ (that is, $\overline{\mu} = \mathbb{E}_{i \sim [m]} Y_i$), then with high probability the $t$-th centered empirical moment also inherits the same bound:

$\mathbb{E}_{i \sim [m]} |Y_i - \overline{\mu}|^t \leq 1.1 \cdot t^{t/2}.$

In the Gaussian mixture setting, so long as enough samples are drawn from each Gaussian $\mathcal{N}(\mu_i, 1)$, each cluster will have $t$-th empirical moments satisfying the above bound (with high probability).

In our identifiability proof, we choose to forget that the samples $X_1,\ldots,X_n$ were drawn from Gaussians, and we remember only that they break up into underlying clusters, each of which satisfies that empirical moment bound. We do not even remember the “true” means of each Gaussian; instead we talk only about the empirical means. We will show that no clustering far from that underlying ground-truth clustering results in clusters which satisfy the empirical moment bound.

Lemma 1. Let $X_1,\ldots,X_n \in \mathbb{R}$. Let $S_1,\ldots,S_k$ be a partition of $[n]$ into $k$ pieces of size $N = n/k$ such that for each $S_i$, the collection of numbers $\{X_j\}_{j \in S_i}$ obeys the following moment bound:

$\mathbb{E}_{j \sim S_i} |X_j - \mu_i|^t \leq 2 \cdot t^{t/2}$

where $\mu_i$ is the average $\mathbb{E}_{j \sim S_i} X_j$ and $t$ is some number in $\mathbb{N}$. Let $\Delta > 0$ be such that $|\mu_i - \mu_j| \geq \Delta$ for every $i \neq j$. Suppose $t$ is large enough that $10 \sqrt t k^{1/t} \leq \Delta$.

Let $S \subseteq [n]$ have size $|S| = N = n/k$ and be such that $\{X_i\}_{i \in S}$ obey the same moment-boundedness property:

$\mathbb{E}_{j \sim S} |X_j - \mu_S|^t < 2 \cdot t^{t/2}$

for the same $t \in \mathbb{N}$, where $\mu_S$ is the mean $\mu_S = \mathbb{E}_{j \sim S} X_j$. Then there exists an $S_i$ such that

$\displaystyle \frac{|S \cap S_i|}{N} \geq 1 - k \cdot \left ( \frac{C \sqrt{t}}{\Delta} \right ) ^t.$

for some universal constant $C$.

How do we interpret Lemma 1 as a statement of cluster identifiability? The lemma implies that the clusters are, up to $\delta(t,\Delta)$ errors, the only subsets of $[n]$ of size $n/k$ which satisfy the $t$-th moment bound. This is our property $Q$, like we discussed earlier in this post. The true clustering $S_1,\ldots,S_k$ satisfies $Q$ (i.e. if you group $X_1,\ldots,X_n$ by this ground-truth clustering, the resulting clusters will have bounded empirical $t$-th moments), and every clustering which satisfies this bounded $t$-th moment property must be close to the true clustering. Thus, the correct clusters could be identified by searching for subsets of $[n]$ which satisfy the $t$-th moment bound (never mind that such a search would naively require about $2^n$ time).

We said that to use the sum of squares method to turn our identifiability proof into an algorithm, both the property $Q$ and the proof of identifiability need to be simple.

This $t$-th moment boundedness property will turn out to be simple enough. What about the proof of Lemma 1? By the end of this post we will prove Lemma 1 in a way which uses only Holder’s inequality for the $t$ vs $\tfrac t {t-1}$ norms and the triangle inequality for the $t$-norm. Later, we will show that these inequalities are simple in the correct formal sense: they are captured by a proof system of restricted power.

Our proof of Lemma 1 relies on the following key fact.

Fact 1. Let $S,S' \subseteq \mathbb{R}$ have $|S| = |S'| = N$. Let $X$ denote a uniform sample from $S$ and similarly for $X'$. Let $\mu = \mathbb{E} X$ and $\mu' = \mathbb{E} X'$. Suppose $X,X'$ satisfy the $t$-th moment bound

$\mathbb{E} |X - \mu|^t \leq 2 \cdot t^{t/2} \text{ and } \mathbb{E}|X' - \mu'|^t \leq 2 \cdot t^{t/2}.$

Then

$|\mu - \mu'| \leq 4 \sqrt t \cdot \left ( \frac{|S \cap S'|}{N} \right )^{-1/t}.$

A slightly more general interpretation of this fact is that a pair of random variables $X,X'$ on $\mathbb{R}$ which have bounded $t$-th moments and whose total variation distance $TV(X,X') \leq 1-\alpha$ cannot have means which are too far apart: $|\mathbb{E} X - \mathbb{E} X'| \leq 4 \sqrt t / \alpha^{1/t}$.

Proof of Fact 1.
Let $\alpha = |S \cap S'|/N$. Observe that there is a coupling of the random variables $X,X'$ so that $Pr(X = X') = \alpha$. The coupling chooses with probability $\alpha$ to select a uniform sample $Y \sim S \cap S'$, then lets $X = X' = Y$. With probability $1-\alpha$, the random variable $X$ is a uniform sample from $S \setminus S'$ and similarly for $X'$.

Let $(X,X')$ be a coupled pair of samples. We expand a quantity related to the one we want to bound, and then apply Holder’s inequality with the $t$ and $\tfrac t {t-1}$ norms. (The underlying inner product space assigns functions $f, g : (X,X') \mapsto \mathbb{R}$ the inner product $\mathbb{E}_{(X,X')} f(X,X') \cdot g(X,X')$.)

$\alpha \cdot |\mu - \mu'| = \mathbb{E}_{(X,X')} \left [ \mathbf{1}_{X = X'} \cdot |(\mu - X) - (\mu' - X')| \right ]$

$\leq \left ( \mathbb{E} (\mathbf{1}_{X = X'})^{t/(t-1)} \right )^{\tfrac {t-1} t} \cdot \left ( \mathbb{E} |(\mu - X) - (\mu' - X')|^t \right )^{1/t}$

$= \alpha^{1-1/t} \cdot \left ( \mathbb{E} |(\mu - X) - (\mu' - X')|^t \right)^{1/t}.$

Now we can apply the triangle inequality for the $t$-norm to the last term, followed by our $t$-th moment assumptions.

$\left ( \mathbb{E} |(\mu - X) - (\mu' - X')|^t \right )^{1/t} \leq \left ( \mathbb{E} |\mu - X|^t \right )^{1/t} + \left ( \mathbb{E} |\mu' - X'|^t \right )^{1/t} \leq 4 \sqrt t$

Putting everything together, we get $|\mu - \mu'| \leq \tfrac {4 \sqrt t}{\alpha^{1/t}}$. QED.

Keeping in mind our eventual goal of constructing a low-degree Sum of Squares proof, we record the observation that the only inequalities we required to prove Fact 1 were the $t$ vs. $\tfrac t {t-1}$ Holder’s inequality and the triangle inequality for the $t$-norm.

Armed with Fact 1, we can prove Lemma 1.The main idea in the proof is that if $S_1$ and $S_2$ are the two clusters with greatest intersection with $S$, then $\mu_S$ can only be close to one of $\mu_1, \mu_2$.

Proof of Lemma 1.
Let $S \subseteq [n]$ have size $|S| = n/k = N$, with mean $\mu_S = \mathbb{E}_{i \sim S} X_i$ and $t$-th moment bound $\mathbb{E}_{i \sim S} |X_i - \mu_S|^t \leq 2t^{t/2}$. Without loss of generality, order the clusters so that $S_1$ has largest intersection with $S$, then $S_2$, and so on: that is $|S \cap S_1| \geq |S \cap S_2| \geq \ldots \geq |S \cap S_k|$. If $|S \cap S_1| = (1 -\delta)N$, then $|S \cap S_2| \geq \tfrac \delta k N$, just by counting.

Since $|\mu_1 - \mu_2| \geq \Delta$, either $|\mu_1 - \mu_S| \geq \Delta/2$ or $|\mu_2 - \mu_S| \geq \Delta/2$. We claim it must be the second. Using Fact 1,

$|\mu_1 - \mu_S| \leq \frac{4 \sqrt t}{(1 - \delta)^{1/t}} \leq 4 \sqrt t \cdot k^{1/t} < \Delta/2$

since certainly $(1-\delta) \geq \tfrac 1k$, and we assumed $10 \sqrt t k^{1/t} \leq \Delta$. We conclude that $|\mu_2 - \mu_S| \geq \Delta/2$.

We have obtained $\tfrac{|S \cap S_2|}{N} \geq \tfrac \delta k$ and $|\mu_2 - \mu_S| \geq \Delta/2$. Putting these together with Fact 1, we find

$\frac \Delta 2 \leq |\mu_2 - \mu_S| \leq 4 \sqrt t \cdot \left ( \frac k \delta \right )^{1/t}.$

Rearranging, this reads $\delta \leq \frac{2^{O(t)} t^{t/2} k}{\Delta^t}.$ QED.

This concludes our one-dimensional identifiability proof, which will form the core of our proof of Theorem 1. In our next post we will lift this proof to a Sum of Squares proof (for which we will need to define Sum of Squares proofs). After that, with a Sum of Squares proof in hand, we will finish designing our mixture of Gaussians algorithm for the one-dimensional case. Then we will show that the same ideas, nearly unchanged, imply that the algorithm works in higher dimensions.

### Thanks

Many thanks to Boaz Barak, David Steurer, and especially Daniel Freund for their helpful remarks on early drafts of this post.

#### Footnotes

1. Moritz Hardt on Gaussian mixtures and Pearson’s approach

2. Image credit: Mathematica Stack Exchange

3. To see how to apply the ideas in this tutorial to a much broader class of clustering problems, see my joint paper with Jerry Li and the recent paper of Pravesh Kothari and Jacob Steinhardt.

4. Before these recent works, the best polynomial-time algorithms for the clustering mixtures of Gaussians could not tolerate any $\Delta < k^{1/4}$ (when $\Delta \geq k^{1/4}$ a simple greedy algorithm can be shown to solve the clustering problem to high accuracy). On the other hand, known lower bounds show that when $\Delta = C \sqrt{\log k}$, clustering is impossible (even using exponential time) with $k^{O(1)}$ samples, so one cannot hope to improve the guarantees of this theorem too much further [Regev-Vijayaraghavan]. (That said, reducing the sample complexity and running time to $poly(d,k)$ when $\Delta = C \sqrt{\log k}$ is a fascinating open problem.)

Variants of this theorem, which may be found in all three of the sources listed, offer algorithms which additionally output estimates of the means $\mu_1,\ldots,\mu_k$, work for many distributions besides Gaussians (without even knowing the underlying distribution!), and tolerate some amount of advesarial corruption among the samples $X_1,\ldots,X_n$. We note also that the theorems in those works handle the usual mixtures of Gaussians problem, rather than the equidistributed version, and can tolerate non-uniform mixtures; i.e. those where some clusters are much smaller than others.

(In this post I am following the venerable tradition of bloggers opining about matters on which they don’t really know much about. I hope I learn something from the feedback –Boaz).

Nothing is impossible,
Child, nothing is impossible.
Every bridge is crossable.
Every tooth is flossable.
Every win is lossable.
Every worker’s bossable.
Every yak’s a lhasa bull.
Nothing is impossible,
Child, nothing is impossible.

Okay, teacher, can you name something that ISN’T possible?

No, Child. Nothing is impossible.

So, there IS something that’s impossible. Naming something that’s impossible is impossible.

(From “The Teacher and The Child”by Chris Harris)

In this world where reasoned arguments and respect for facts seem increasigly rare, some people are actually worried about the opposite problem of “intelligence explosion”.
Recently, through a blog post of Scott Aaronson, I came across this essay by François Chollet. Given Scott’s less than raving review, I fully expected not to like it, but actually found myself agreeing with some parts of this essay (though not all, and in particular not with the title).

The basic fear of “intelligence explosion” is that:

1. Once we develop sufficiently advanced artificial intelligence, it will go on to use this intelligence to build better and better AI, quickly leaving us behind.
2. The AI will develop some form of consciousness and rather than using their intelligence to make our lives better, will be about as kind to us as we were to the Neanderthals.

Consciousness is a hard concept to define. Humans used to attribute consciousness to the weather, praying to the sun and sea gods. After all it made perfect sense, the weather is unpredictable and dangerous, and seemed to vary in capricious ways. As we have grown to understand weather better, we no longer think that it is conscious. Yes, there is still an element of unpredictability and randomness to it, but we do understand the basic mechanisms at play, and can place some probabilities or bounds on its behavior. So these days the weather is stochastic rather than adversarial.
Arthur C. Clarke famously said that “Any sufficiently advanced technology is indistinguishable from magic.”. Similarly, one can say that any system that is sufficiently adversarial is indistinguishable from being conscious.

If we follow this definition, then we have already created conscious machines, since we can definitely already create technologies that we understand so poorly that we cannot model it in any way other than adversarial. In some sense this is the underlying lesson of results such as the Halting problem and NP completeness. Moreover, ever since the Atom bomb, mankind’s potential to damage the planet and ourselves has gone far beyond what nature can do (and of course, as witnessed by global warming, nuclear energy is not the only technology that can affect the whole planet). Also, as anyone getting “on the air updates” for their gadgets can attest to, we already have systems that continuously improve over time, often with a feedback loop. With more and more of the world’s economy becoming dependent on the transfer of information as opposed to goods (which of course is somewhat of an artificial distinction), the speed of progress has become much faster. So, if the question is whether we should worry about us developing and deploying technology whose behavior we can’t completely predict, and one that could result in very bad consequences, then I am with the alarmists.

What I find less appealing about the “AI risk” position is the focus on notions such as “intelligence” and “conciousness”. There is already an algorithm outperforming most humans on IQ tests and surely soon there will be an algorithm with an IQ of 300, or whatever the maximum possible value is. However, as Chollet points out, while some individual humans have had profound influence on the course of humanity, it is typically not intelligence alone that helped them (see e.g. the rather sad story of William James Sidis). That said, one can’t dispute that in the 200K years we’ve been around, Homo Sapiens have managed to make some significant progress, and so if you could simulate a population of even average humans, but do it with more people and larger speed, you’d speed up scientific discovery. Indeed (and this is where I part ways with Chollet) scientific progress has been accelerating, precisely because we use scientific progress to make more progress, whether it’s computer simulations helping in doing physics or quantum physics helping us build new computers, and so on and so forth. We are likely to continue to progress at an accelerating pace, not by trying to simulate a population of average or genius humans, but rather by continuously applying our tools and understanding to build better tools and improve our understanding.

But all of the above does not mean that modelling computation and communication systems as a new species and anthropomorphizing them is helpful. Research can and should be done on trying to verify that the behavior of computing systems does not deviate from certain parameters, whether it is Windows 10 or an AI algorithm.
With the progress of time computers are likely to continue to do more and more tasks currently associated with human intelligence, and yes, we do have a nontrivial chance of creating a technology that may eventually destroy us. But I don’t see why thinking of algorithms in anthropomorphic terms is helpful any more than thinking of the weather in terms of deities. If anything, understanding human “intelligence” and “consciousness” in more algorithmic ways  seems like a better path forward.

[Guest post by Robi Krauthgamer;  note that there is no conflict in nominating the same work/person to be highlighted in both HALG and TheoryFest. –Boaz]

Call for Nominations

3rd Highlights of Algorithms conference (HALG 2018)

Amsterdam, June 4-6, 2018

The HALG 2018 conference seeks high-quality nominations for invited talks that will highlight recent advances in algorithmic research. Similarly to previous years, there are two categories of invited talks:

A. survey (60 minutes): a survey of an algorithmic topic that has seen  exciting developments in last couple of years.

B. paper (30 minutes): a significant algorithmic result appearing in a paper in 2017 or later.

To nominate, please email  halg2018.nominations@gmail.com  the following information:

1. Basic details: speaker name + topic (for survey talk) or paper’s title, authors, conference/arxiv + preferable speaker (for paper talk).

2. Brief justification: Focus on the benefits to the audience, e.g., quality of results, importance/relevance of topic, clarity of talk, speaker’s presentation skills. Pay attention to potentially non-obvious information, e.g., the topic might seem out of scope, or the material seems inadequate for one talk.

All nominations will be reviewed by the Program Committee (PC) to select speakers that will be invited to the conference.

Nominations deadline: December 12, 2017 (for full consideration).

Please keep in mind that the conference does not provide financial support for the speakers.

Best regards,

Robert Krauthgamer,

HALG 2018 PC chair.

In his State of the Union address in January 1984, president Ronald Reagan announced that he directed his treasury secretary to simplify and reform the U.S., tax code. Thus began a process of 1.5 years until in June 1985, the house Ways and Means committee began formal discussion on the bill, which it voted on November 1985, and after a yearlong process in Congress, the bill was signed into law by president Reagan on October 1986.

In contrast, the Republican party is currently “desperate for a win” and is trying to move a massive tax reform involving about 8 trillion dollars (cutting about 5 trillion dollars of taxes and partially paying for by increasing about 3.5 trillion dollars in other taxes) in a very short period of time.

When such huge decisions are done in haste, it’s likely to cause “collateral damage”. In particular, universities and academic research, which have been a major engine of growth in the U.S. economy, have been targeted by this ostensibly “pro growth” tax plan as a way to finance its cuts in other places. There are several provisions in it that are particularly harmful to universities, including taxing endowments,  eliminating student loan interest deductions, and considering graduate student tuition waivers as taxable income.

As Luca says, if you are a U.S. citizen, and particularly if you have a Republican congressional representative, please contact him or her and make your voice heard.

(This blog post is in the form of a Jupyter notebook. See here for an arguably better formatted version, and here for the version with the omitted code, this (Beta version) website allows you to see the code “live” without needing to install Jupyter on your machine.)

Quantum computing is one of the most exciting developments of computer science in the last decades. But this concept is not without its critics, often known as “quantum computing skeptics” or “skeptics” for short. The debate on quantum computing can sometimes confuse the physical and mathematical aspects of this question, and so in this essay I try to clarify those. Following Impagliazzo’s classic essay, I will give names to scenarios or “potential worlds” in which certain physical or mathematical conditions apply.

## Potential worlds

Superiorita is the world where it is feasible to build scalable quantum computers, and these computers have exponential advantage over classical computers. That is, in superiorita there is no fundamental physical roadblock to building large quantum computers, and hence the class BQP is a good model of computation that is physically realizable. More precisely, in superioriata the amount of resources (think dollars) that is required in order to simulate a $T$-gate quantum circuit grows at most polynomially or maybe even linearly (with not-too-terrible constants) in $T$.

The other aspect of Superiorita is the mathematical conjecture that quantum computers offer exponential advantage over classical ones. That is, that there are functions computable by the mathematical model of (uniform) quantum circuits that require exponential time to compute by Turing machines. (In complexity jargon, this is the conjecture that $BQP \not\subseteq SUBEXP$ where the latter stands for the class $TIME(2^{n^{o(1)}})$.) Integer factoring is one problem that is conjectured to lie in $BQP \setminus SUBEXP$ (i.e., where quantum computers have an exponential advantage). One can also consider analogous conjectures for sampling problems, and some particular sampling tasks that can be achieved in quantum polynomial time have been conjectured as requiring exponential time for probabilistic Turing machines.

Superiorita is the world in which most quantum computing researchers think we live in, and, judging by the hundreds of millions of dollars of investments, many commercial companies and funding agencies as well. Note that this is a mix of both a physical assumption (that the model of $BQP$ can be physically realized) and a mathematical assumption (that this model offers exponential speedup over classical machines). Without assuming both the physical and mathematical aspects of superiorita there would be no justification for investing huge efforts in building quantum computers.

In Superiorita quantum computers are not a panacea and in particular they can’t solve NP complete problems. Let me not wage into the (hugely important!) question of whether in Superiorita the Lattice Shortest Vector Problem is in $BQP$ or not (see my essay  for more on this topic). Also, even if one  believes we live in Superiorita, whether or not the particular problems on which quantum computing offer exponential speedup are interesting is a matter of taste. As far as I know, factoring large integers is not inherently interesting in its own right, and once the world moves to different encryption standards, the applications to breaking encryption will eventually disappear. However, there are other tasks where quantum computers seem to provide exponential speedups and that can be interesting in their own right in areas such as chemistry and machine learning (though one should read the fine print).

Popscitopia is the “hyper superiorita” world where quantum computers can solve NP complete problems. That is, in popscitopia quantum computers can be built, and $NP \subseteq BQP$. This is the world that is described by some popular accounts of quantum computers as being able to “run exponentially many parallel computations at once”, a belief that is prevalent enough that Scott Aaronson devotes the tagline of his blog to refuting it. Most researchers in the area believe that, regardless of whether quantum computers can be physically be built, they cannot solve $NP$-complete problem (a belief which is essential to so called “post quantum cryptography”), and indeed so far we have no reason to think quantum computers off exponential (or even better than quadratic) speedup for such problems. But, we have no proof that this is the case, and indeed, some TCS researchers, as Richard Lipton, have suggested that even $NP = P$ (which in particular implies $NP \subseteq BQP$) might be true.

Skepticland is the world where it is not possible to build scalable quantum computers, though mathematically they do offer an exponential advantage. That is, in Skepticland, for every function $F$ (and more generally a promise problem or a sampling problem) that can be computed using $T$ amount of physical resources, there is a probabilistic Boolean circuit of size polynomial in $T$ that computes $F$ as well. However, mathematically, like in Superiorita, it is still the case in Skepticland that $BQP$ contains functions (such as integer factoring) that require exponential time to be computed classically.

Skepticland is the world that “quantum computing skeptics” such as Gil Kalai, Leonid Levin and Oded Goldreich think we live in. In this world the extended Church-Turing hypothesis hold sway and there exists some (yet unaccounted for) cost that blows up exponentially in $T$ when trying to physically realize size $T$ quantum circuits.

These skeptics still accept the mathematical conjecture underlying superiorita that $BQP$ contains functions that require exponential time for deterministic or probabilistic Turing machines. Indeed, as far as I can tell, their belief in the inhrent difficulty of problems such as factoring is a large part of the intuition for why quantum computers would not be physically realizable.

Finally, Classicatopia is the world where $BQP \subseteq BPP$ and more generally any function, promise problem, or sampling problem that can be solved by (uniform) quantum circuits can be solved by probabilistic Turing machines with a polynomial overhead. In this world quantum computers can be physically realized, but only because they are no more powerful than classical computers. Hence the Extended Church-Turing holds but for a completely different reason than in Skepticland. In Classicatopia we can simulate the entire physical world using a classical computer. One advocate of this world is Ed Fredkin (who interestingly was the person who motivated Richard Feynmann to propose the possiblity of quantum computers in the first place). Also, several researchers (such as Peter Sarnak) have suggested that the marquee problem of integer factoring can be solved by polynomial-time Turing machines.

## Truth and beauty

At this point I should probably talk about the evidence for the probability of truth of each of these scenarios, and discuss the latest advances in experimental works building quantum computers. But frankly I’d be just parroting stuff I Googled, since I don’t really know much about these works beyond second or third hand reports.

Rather, I’d like to talk about which of these worlds is more beautiful. Beauty is in some ways as important for science as truth. Science is not just a collection of random facts but rather a coherent framework where these facts fit together. If a conjecture is “ugly” in the sense that it does not fit with our framework then this can be evidence that it is false. When such “ugly ducklings” turn out to be true then this means we need to change our standards of beauty and come up with a new framework in which they fit. This is often how progress in science is made.

While I am not a physicst, I believe that quantum mechanics itself followed exactly such a trajectory. (I am probably making some historical, physical, and maybe even mathematical mistakes below, but I hope thebigger picture description is still accurate; however please do correct me in the comments!)

The ancient greek philospher Democritus is often quoted as saying “Nothing exists except atoms and empty space, everything else is opinion.” This saying is usually interpreted as an emprical hypothesis about the world, or to use mathematical jargon, a conjecture. But I think this is really more of a definition. That is, one can interpret Democritus as not really making a concrete physical theory but defining the allowed space for all physical theories: any theory of the world should involve particles that mechnically and deterministically evolve following some specific and local rules.

Over the coming years, scientists such as Newton, Leibniz and Einstein, took this prescription to heart and viewed the role of physics as coming up with every more general and predictive theories within the Democritus model of deterministic particles with no randomness, intent, or magic such as “action at a distance”. In the late 1910’s, Emmy Noether proved some remarkable theorems that derived conservation laws from physical theories based only on the fact that they satisfy certain symmetries (see also my recent post). While the mechanical clockwork theories satisfied such symmetries, they are not the only theories that do so. Thus Noether’s theorems showed that even non-clockwork theories could still satisfy a more general notion of “mathematical beauty”.

At the time Noether’s Theorems were just a very useful mathematical tool, but soon nature gave some indications that she prefers Noether’s notion of beauty to Democritus’. That is, a series of experiments led to the introduction of the distinctly “non clockwork” theory of quantum mechanics. Giving up on the classical notion of beauty was not easy for physicsts, and many (most famously Einstein) initially thought of quantum mechanics as a temporary explanation that eventually will be replaced by a more beautiful “Democritus-approved” theory. But Noether’s results allowed to make quantum mechanics not just predictive but beautiful. As Nima Harkani-Hamed says:

Newton’s laws, even though they were the first way we leaned how to think about classical physics, were not the right way to make the jump to quantum mechanics. … [Rather] because the underlying ideas of the action– and everything just really ports beautifully through, from classical to quantum physics, only the interpretation changes in a fundamental way– all of Noether’s arguments, all of Emmy Noether’s arguments about conservation laws go through completely unscathed. It’s absolutely amazing. All these arguments about conservation laws, many other things change, tons of other things changed when we went from classical to quantum. But our understanding of the conservation laws, even though they’re come up with by this classical physicist a hundred years ago, are equally true in quantum mechanics today.

Moreover, my outsider impression is that with time physicsts have learned to accept and even grow to love quantum mechanics, to the degree that today many would not want to live in a purely classical world. If you wonder how anyone could ever love such a monstrosity, note that, as Scott Aaronson likes to say, there is a sense in which the relation between quantum and classical physics is analogous to the relation between the $\ell_2$ and $\ell_1$ norms. I think most mathematicians would agree that the former norm is “more beautiful” than the latter.

## My personal opinion

So, which is the most beautiful world, Superiorita or Skepticland?

If you’ve asked me that question a decade ago, I would have answered “Skepticland” without hesitation. Part of the reason I got into computer science is that I was never good at physics and didn’t particularly like it. I also thought I could avoid caring about it. I believed that ultimately the world is a Turing machine or cellular automata and whether it has 5 or 12 particles is about as interesting as whether the computer I’m typing this on uses big endian or little endian representation for integers. When I first heard about quantum computing I was hoping very much that there is some inherent reason it can never work so I can avoid dealing with the ugliness of quantum mechanics and its bracket notation.

But as I’ve learned more about quantum mechanics, I’ve grown not just to accept it as a true theory but also beautiful, and with this to also accept quantum information and computation theory as a beautiful generalization of information and computation in its own right. At the moment I don’t see any beautiful alternative theory (to use Aaronson’s terms, a “Sure/Shor separator”) from the skeptics. The closest we have to such a theory comes from Gil Kalai, but as far as I can tell it posits noise as a new fundamental property of nature (the Ka-la-ee constant?). Noise here is not the usual interpretation of quantum probabilities or the uncertainty principle. It seems to be more similar to the engineering form of noise as inaccuracies in measurements or errors in transmissions. These can be serious issues (for example, I believe that friction is a large part why actually building Babbage’s Analytical Engine was so difficult). But as far as I can tell, these engineering difficulties are not fundamental barriers and with sufficient hard work and resources the noise can be driven down to as close to zero as needed.

Moreover some of the predictions involve positing noise that scales with number of qubits in the computer. It seems to require nature to “know” that some physical system in fact corresponds to a logical qubit, and moreover that two distant physical systems are part of the same quantum computer. (I should say that Gil Kalai disagrees with this interpretation of his conjecture.) While one could argue that this is not more counterintuitive than other notions of quantum mechanics such as destructive interference, entanglement, and collapse under measurements, each one of those notions was only accepted following unequivocal experimental results, and moreover they all follow from our modelling of quantum mechanics via unitary evolutions.

The bottom line is that, as far as I can tell, Superiorita is the most beautiful and evidence-supported world that is currently on offer.

## Will we see a mega-qubit quantum computer?

The current experimental efforts are aimed at building a 50 qubit quantum computer. This sounds impressive until I remember that the VIC 20 I played with as a third-grader more than thirty years ago already had 5K (i.e., about 40,000 bits) of memory. So, will we ever see a quantum computer big enough to run Frogger? (not to mention Ultima IV )

The answer to this question depends not just on the science but also on economics and policy as well. Suppose that (with no real justification) that eventually we will able to produce a quantum computer at a cost of 1000 dollars per qubit. Then a million qubit machine will cost a billion dollar to build. The current applications of quantum computers do not seem to justify this cost. As I mentioned, once we transition to different cryptosystems, the motivation for factoring integers will be significantly lessened, and while simulating quantum systems can be important, it’s hard to see it as forming the basis for a billion dollar business. Of course, this can all change with a single theory paper, just as Peter Shor revolutionized the field of quantum computing with a single algorithm.

Moreover I hope that at some point, policy makers and the public at large will stop viewing computer science just through the lens of applications, and start seeing it also as a fundamental science in its own right. The large Hardron Collider apparantly cost about 13 billion dollars to build and operate, and yet the same analysis calls it a “bargain” in terms of the benefit from both technologies invented and scientific discovery. The case can be made that building a large scale quantum computer would be no less important to science, and would offer no less benefit to society. Indeed, a quantum computer offers literally an exponential number of potential experiments one can run on it. Moreover, there is absolutely no reason to think that Shor gave the final word on breakthrough algorithms that could use such a computer for tasks that a priori seem to have nothing to do with physics. In that vein, I hope that whatever bodies that fund experimental quantum computing research realize that at least part of their investment should go into theoretical work in quantum (and also classical, as the two are intertwined) algorithm design.

Acknowledgements: Thanks to Gil Kalai and Scott Aaronson for comments on earlier versions of this post. Needless to say, they are not responsible for anything that I said here.