I’m with her (but 4 years too late)

In May 2016, after Donald Trump was elected as the republican nominee for president, I wrote the following blog post. I ended up not publishing it – this has always been a technical blog (and also more of a group blog, at the time). While the damage of a Donald Trump presidency was hypothetical at the time, it is now very real and in a second term the stakes are only higher. I once again hope American readers of this blog would do what they can to support Joe Biden.

Note: this is not an invitation for a debate on who to vote for in the comments. At this point, if you are educated and following the news (as I imagine all readers of this blog are) then if you are not already convinced that Donald Trump is a danger to this country and the world, nothing I will say will change your mind. Similarly, nothing you will say will change mine. Hence this is just a call for those readers who already support Joe Biden to make sure they vote and think of how they can help with their money or time in other ways.

I’m with Her

Boaz Barak / May 3, 2016

The republican electorate, in their infinite wisdom, have just (essentially) finalized the election of Donald Trump as their nominee for the position of the president of the United States.

While I have my political views, I don’t consider myself a very political person, and have not (as far as I remember) ever written about politics in this blog. However, extreme times call for extreme measures. It is tempting to be fatalist or cynical about this process, and believe that whether Donald Trump or Hillary Clinton is elected doesn’t make much of a difference. Some people believe that the presidency shapes the person more than the other way around, and others feel that all politicians are anyway corrupt or that Hillary is not much better than trump since she voted for the Iraq war and gave paid speeches for Wall Street firms. I think the last two presidencies of George W. Bush and Barack Obama demonstrated that the identity of the president makes a huge difference. All evidence points out that with Trump this difference will be entirely in the negative direction.

While his chances might not be great, this is not a bet I’m comfortable taking. I plan to support Hillary Clinton as much as I can, and hope that other American readers of this blog will do the same

Full-replica-symmetry-breaking based algorithms for dummies

One of the fascinating lines of research in recent years has been a convergence between the statistical physics and theoretical computer science points of view on optimization problems.
This blog post is mainly a note to myself (i.e., I’m the “dummy” ūüėÉ), trying to work out some basic facts in some of this line of work. it was inspired by this excellent talk of Eliran Subag, itself part of a great Simons institute workshop which I am still planning to watch the talks of. I am posting this in case it’s useful for others, but this is quite rough, missing many references, and I imagine I have both math mistakes as well as inaccuracies in how I refer to the literature – would be grateful for comments!

Screen shot from Eliran Subag’s talk demonstrating the difference between “easy” and “hard” instances.

In computer science, optimization is the task of finding an assignment x_1,\ldots,x_n that minimizes some function J(x_1,\ldots,x_n). In statistical physics we think of x_1,\ldots,x_n as the states of n particles, and J(x_1,\ldots,x_n) as the energy of this state. Finding the minimum assignment corresponds to finding the ground state, and another computational problem is sampling from the Gibbs distribution where the probability of x is proportional to \exp(-\beta J(x)) for some \beta>0.

Two prototypical examples of such problems are:

  1. Random 3SAT – in this case x\in { \pm 1 }^n and J(x) is the number of clauses violated by the assignment x for a random formula.
  2. Sherrington-Kirpatrick model – in this case x \in { \pm 1 }^n and J(x)= \sum_{i,j} J_{i,j}x_ix_j where J_{i,j} are independent normal variables with variance 1/n for i\neq j and variance 2/n for i=j. (Another way to say it is that J is the matrix A+A^\top where A‘s entries are chosen i.i.d from N(0,\tfrac{1}{2n})).)

The physics and CS intuition is that these two problems have very different computational properties. For random 3SAT (of the appropriate density), it is believed that the set of solutions is “shattered” in the sense that it is partitioned to exponentially many clusters, separated from one another by large distance. It is conjectured that in this setting the problem will be computationally hard. Similarly from the statistical physics point of view, it is conjectured that if we were to start with the uniform distribution (i.e., a “hot” system) and “lower the temperature” (increase \beta) at a rate that is not exponentially slow then we will get “stuck” at a “metastable” state. This is analogous how when we heat up sand and then cool it quickly then rather than returning to its original state, the sand will get stuck at the metastable state of glass.

In contrast for the Sherrington-Kirpatrick (SK) model, the geometry is more subtle, but interestingly this enables better algorithms. The SK model is extermely widely studied, with hundreds of papers, and was the inspiration for the simulated annealing algorithm. If memory serves me right, Sherrington and Kirpatrick made the wrong conjecture on the energy of the ground state, and then Parisi came up in 1979 with a wonderful and hugely influential way to compute this value. Parisi’s calculation was heuristic, but about 30 years later, first Talagrand and later Panchenko proved rigorously many of Parisi’s conjectures. (See this survey of Panchenko.)

Recently Montanari gave a polynomial time algorithm to find a state with energy that is arbitrarily close to the ground state’s. The algorithm relies on Parisi’s framework and in particular on the fact that the solution space has a property known as “full replica symmetry breaking (RSB)” / “ultrametricity”. Parisi’s derivations (and hence also Montanari’s analysis) are highly elaborate and I admit that I have not yet been able to fully follow it. The nice thing is that (as we’ll see) it is possible to describe at least some of the algorithmic results without going into this theory. In the end of the post I will discuss a bit some of the relation to this theory, which is the underlying inspiration for Subag’s results described here.

Note: These papers and this blog post deal with the search problem of finding a solution that minimizes the objective. The refutation problem of certifying that this minimum is at least -C for some C has often been studied. The computational complexity of these problems need not be identical. In particular there are cases where the search problem has an efficient algorithm achieving value -C^* but the best refutation algorithm can only certify that the value is at most -C' for C' \gg C^*.

Analysis of a simpler setting

Luckily, there is a similar computational problem, for which the analysis of analogous algorithm, which was discovered by Subag and was the partial inspiration for Montanari’s work, is much simpler. Specifically, we consider the case where x is an element of the unit sphere, and J(x) is a degree d polynomial with random Gaussian coefficients. Specifically, for every vector \gamma = (\gamma_2,\ldots,\gamma_d), we let J(x) = \gamma_2 J^2 \cdot x^{\otimes 2} + \gamma_3 J^3 \cdot x^{\otimes 3} + \cdots + \gamma_d J^d \cdot x^{\otimes p} where for every p \in {2,\ldots, d }, J_p is a random tensor of order p whose n^p coefficients are all chosen i.i.d in N(0,1/n). (We assume that polynomial does not have constant or linear components.)

Depending on \gamma, the computational and geometrical properties of this problem can vary considerably. The case that \gamma = (1,0,\ldots,0) (i.e., only J^2 has a non-zero coeffiecent) corresponds to finding the unit vector x minimizing x^\top J x for a random matrix x, which of course corresponds to the efficiently solveable minimum eigenvector problem. In contrast, the case \gamma = (0,1,0,\ldots,0) corresponds to finding a rank one component of a random three-tensor, which is believed to be computationally difficult. The Parisi calculations give a precise condition P on the vector \gamma such that if P(\gamma) holds then the solution space has the “full RSB” property (and hence the problem is computationally easy) and if P(\gamma) does not hold then the solution space does not have this property (and potentially the problem is hard).

These calculations also give rise to the following theorem:

Theorem (Chen and Sen, Proposition 2): If P(\gamma) holds then in the limit n \rightarrow \infty, \min_{x : |x|=1} J(x) = -\int_0^1 \sqrt{\nu''(q)} dq, where \nu(q) = \sum_{p \geq 2} \gamma_p^2 q^p. (That is, \nu'' is the second derivative of this univariate polynomial)

We will not discuss the proof of this theorem, but rather how, taking it as a black box, it leads to an algorithm for minimizing J(x) that achieves a near-optimal value (assuming P(\gamma) holds).

It is a nice exercise to show that for every two vectors x,x'\in\mathbb{R}^n, \mathbb{E}_J [J(x)J(x')] = \nu(\langle x,x' \rangle)/n. Hence for any unit vector x, J(x) is a random variable with mean zero and standard deviation \sqrt{\nu(1)/n}. Since (after some coarsening) the number of unit vectors of dimension n can be thought of as c^n for some c>1, and we expect the probability of deviating t standard deviations to be \exp(-c' t^2), the minimum value of J(x) should be -c'' \sqrt{\nu(1)} for some constant c''. However determining this constant is non trivial and is the result of the Parisi theory.

To get a better sense for the quantity -\int_0^1 \sqrt{\nu''(q)} dq, let’s consider two simple cases:

  • If \gamma = (1,0,\ldots) (i.e., J(x) = \sum_{i,j}M_{i,j}x_ix_j for random matrix M) then \nu(q)= q^2 and \nu''(q) = 2, meaning that -\int_0^1 \sqrt{\nu''(q)} dq = -\sqrt{2}. This turns out to be the actual minimum value. Indeed in this case \min_{|x|^2=1} J(x) = \tfrac{1}{2} \lambda_{min}(M + M^\top). But the matrix A=\tfrac{1}{2}(M+M^\top)‘s non diagonal entries are distributed like N(0,\tfrac{1}{2n}) and the diagonal entries like N(0,\tfrac{1}{n}) which means that A is distributed as \tfrac{1}{\sqrt{2}} times a random matrix B from the Gaussian Orthogonal Ensemble (GOE) where B_{i,j} \sim N(0,1/n) for off diagonal entries and B_{i,i} \sim N(0,2/n) for diagonal entries. The minimum eigenvalue of such matrices is known to be -2\pm o(1) with high probability.
  • If \gamma = (0,\ldots,1) (i.e. J(x) = T \cdot x^{\otimes d} for a random d-tensor T) then P(\gamma) does not hold. Indeed, in this case the value -\int_0^1 \sqrt{\nu''(q)} dq = - \int_0^1 \sqrt{d (d-1)q^{d-2}} dq = - \sqrt{d(d-1)}\tfrac{1}{d/2-1} \approx -2 for large d. However I believe (though didn’t find the reference) that the actual minimum tends to -\infty with d.

While the particular form of the property P(\gamma) is not important for this post, there are several equivalent ways to state it, see Proposition 1 in Subag’s paper. One of them is that the function \nu''(t)^{-1/2} (note the negative exponent) is concave on the interval (0,1].
It can be shown that this condition cannot be satisfied if \gamma_2 = 0, and that for every setting of \gamma_3,\ldots,\gamma_d, if \gamma_2 is large enough then it will be satisfied.

The central result of Subag’s paper is the following:

Theorem: For every \epsilon>0, there is a polynomial-time algorithm A such that on input random J chosen according to the distribution above, with high probability A(J)=x such that J(x) \leq -\int_0^1 \sqrt{\nu''(q)} dq + \epsilon.

The algorithm itself, and the idea behind the analysis are quite simple. In some sense it’s the second algorithm you would think of (or at least the second algorithm according to some ordering).

The first algorithm one would think of is gradient descent. We start at some initial point x^0, and repeat the transformation x^{t+1} \leftarrow x^t - \eta \nabla J(x^t) for some small \eta (and normalizing the norm). Unfortunately, we will generally run into saddle points when we do so, with the gradient being zero. In fact, for simplicity, below we will always make the pessimistic assumption that we are constantly on a saddle point. (This assumption turns out to be true in the actual algorithm, and if it was not then we can always use gradient descent until we hit a saddle.)

The second algorithm one could think of would be to use the Hessian instead of the gradient. That is, repeat the transformation x^{t+1} \leftarrow x^t - \eta u where u is the minimal eigenvector of \nabla^2 J(x^t) (i.e., the Hessian matrix H such that H_{i,j} = \tfrac{\partial J(x^t)}{\partial x_i \partial x_j} ). By the Taylor approximation J(x - \eta u) \approx J(x) - \eta \nabla J(x) \cdot u + \tfrac{1}{2} \eta^2 u^\top \nabla^2 J(x) u (and since we assume the gradient is zero) the change in the objective will be roughly \tfrac{1}{2} \eta^2 \lambda_{min}(\nabla^2 J(x^t)). (Because we assume the gradient vanishes, it will not make a difference whether we update with -\eta u or +\eta u, but we use -\eta for consistency with gradient descent.)

The above approach is promising, but we still need some control over the norm. The way that Subag handles this is that he starts with x^0 = 0, and at each step takes a step in an orthogonal direction, and so within 1/\eta^2 steps he will get to a unit norm vector. That is, the algorithm is as follows:


Input: J:\mathbb{R}^n \rightarrow \mathbb{R}.

Goal: Find unit x approximately minimizing J(x)

  1. Initialize x^0 = 0^n
  2. For t=0,\ldots,1/\eta^2-1: i. Let u^t be a unit vector u such that u is orthogonal to u^0,\ldots,u^{t-1} and u^\top \nabla^2 J(x^t) u \approx \lambda_{min}(\nabla^2 J(x^t)). (Since the bottom eigenspace of \nabla^2 J(x^t) has large dimention, we can find a vector u that is not only nearly minimal eigenvector but also orthogonal to all prior ones. Also, as mentioned, we assume that \nabla \cdot u \approx 0.) ii. Set x^{t+1} \leftarrow x^t - \eta u^t.
  3. Output x^{1/\eta^2} = -\eta\sum_{t=0}^{1/\eta^2-1} u^t

(The fact that the number of steps is 1/\eta^2 and not 1/\eta is absolutely crucial for the algorithm’s success; without it we would not have been able to use the second order contribution that arise from the Hessian.)

If we define \lambda_t to be the minimum eigenvalue at time t, we get that the final objective value achieved by the algorithm satisfies

VAL = \sum_{t=1}^{1/\eta^2} \tfrac{1}{2} \eta^2 \lambda_t

Now due to rotation invariance, the distribution of \nabla^2 J at the point x^t is the same as \nabla^2J at the point (|x^t|,0,\ldots,0). Using concentration of measure arguments, it can be shown that the minimum eigenvalue of \nabla^2 J(x^t) will be close with high probability to the minimum eigenvalue of \nabla^2 J(|x^t|,0,\ldots,0).
Since \|x^t\|^2 = \eta^2 t we can write

VAL = \sum_{t=1}^{1/\eta^2} \tfrac{1}{2} \eta^2 \lambda(\sqrt{\eta^2 t})

where \lambda(q) is the minimum eigenvalue of \nabla^2 J at the point (q,0,\ldots,0).
Taking \eta to zero, we get that (approximately) the value of the solution output by the algorithm will satisfy

VAL = \int_0^1 \tfrac{1}{2} \lambda(\sqrt{q}) dq

and hence the result will be completed by showing that

\lambda(\sqrt{q}) = 2 \sqrt{\nu''(q)}

To do this, let’s recall the definition of J:

J(x) = \gamma_2 J^2 \cdot x^{\otimes 2} + \gamma_3 J^3 \cdot x^{\otimes 3} + \cdots + \gamma_d J^d \cdot x^{\otimes p} where for every p \in {2,\ldots, d }, J_p is a random tensor of order p whose n^p coefficients are all chosen i.i.d in N(0,1/n).

For simplicity, let’s assume that d=3 and hence

J(x) = \gamma_2 J^2 \cdot x^{\otimes 2} + \gamma_3 J^3 \cdot x^{\otimes 3}\;.

(The calculations in the general case are similar)

The i,j entry of \nabla^2 J(\sqrt{q},0,\ldots,0) equals \tfrac{\partial J(q,0,\ldots,0)}{\partial x_i \partial x_j}. The contribution of the J^2 component to this term only arises from the terms corresponding to either x_ix_j or x_jx_i and hence for i\neq j it equals \gamma_2 (J_{i,j}+J_{j,i}) which is distributed like \gamma_2 N(0,\tfrac{2}{n}) = N(0, \tfrac{2\gamma_2^2}{n}). For i=j, since (x_i^2)'' = 2, the contributioon equals 2 \gamma_2 J_{i,i} which is distributed like N(0,\tfrac{4\gamma_2^2}{n}).

The contribution from the J^3 component comes (in the case i\neq j) from all the 6=3! terms involving 1,i,j that is, \gamma_3 (J_{1,i,j}\sqrt{q} + J_{1,j,i}\sqrt{q} + J_{i,1,j}\sqrt{q}+J_{j,1,i}\sqrt{q}+J_{i,j,1}\sqrt{q}+J_{j,i,1})\sqrt{q} which is distributed like \gamma_3 N(0, \tfrac{6}{n}) = N(0, \tfrac{6\gamma_3^2 q}{n}). For i=j the contribution will be from the 3 terms involving 1,i, with each yielding a contribution of 2\sqrt{q}, and hence the result will be distributed like N(0,\tfrac{12 \gamma_3^2 q}{n}).

(More generally, for larger p, the number of terms for distinct i,j is p(p-1), each contributing a Gaussian of standard deviation (\sqrt{q})^{p-2}\gamma_p/\sqrt{n}, while for i=j we have p(p-1)/2 terms, each contributing a Gaussian of standard deviation 2(\sqrt{q})^{p-2}\gamma_p/\sqrt{n}.)

Since the sum of Gaussians is a Gaussian we get that \nabla^2J(q,0,\ldots,0){i,j} is distributed like a Gaussian with variance \sum \gamma_p^2 p(p-1)q^{p-2}/n = \nu''(q)/n for i\neq j, and twice that for i=j. This means that the minimum eigenvalue of \nabla^2J(q,0,\ldots,0) equals \sqrt{\nu''(q)} times the minimum eigenvalue of a random matrix M from the Gaussian Orthogonal Ensemble (i.e. M is sampled via M{i,j} \sim N(0,1/n), M_{i,i} \sim N(0,2/n)). As mentioned above, it is known that this minimum eigenvalue is -2+o(1), and in fact by the semi-circle law, for every \epsilon>0, the number of eigenvalues of value \leq -2+\epsilon is \Omega(n), and so we can also pick one that is orthogonal to the previous directions. QED

Full replica symmetry breaking and ultra-metricity

The point of this blog post is that at least in the “mixed p spin” case considered by Subag, we can understand what the algorithm does and the value that it achieves without needing to go into the theory of the geometry of the solution space, but let me briefly discuss some of this theory. (As I mentioned, I am still reading through this, and so this part should be read with big grains of salt.)

The key object studied in this line of work is the probability distribution \xi of the dot product \langle x,x' \rangle for x and x' sampled independently from the Gibbs distribution induced by J. (This probability distribution will depend on the number of dimensions n, but we consider the case that n \rightarrow \infty.)

Intuitively, there are several ways this probability distribution can behave, depending on how the solution space is “clustered”:

  • If all solutions are inside a single “cluster”, in the sense that they are all of the form x = x_* + e where x_* is the “center” of the cluster and e is some random vector, then \langle x,x' \rangle will be concentrated on the point \| x_*\|^2.
  • If the solutions are inside a finite number k of clusters, with centers x_1,\ldots,x_k, then the support of the distribution will be on the k^2 points \{ \langle x_i,x_j \rangle \}_{i,j \in [k]}.
  • Suppose that the solutions are inside a hierarchy of clusters. That is, suppose we have some rooted tree \mathcal{T} (e.g., think of a depth d full binary tree), and we associate a vector x_v with every vertex v of \mathcal{T}, with the property that x_v is orthogonal to all vectors associated with v‘s ancestors on the tree. Now imagine that the distribution is obtained by taking a random leaf u of \mathcal{T} and outputting \sum x_v for all vertices v on the path from the root to u. In such a case the dot product of x_u and x_{u'} will be \sum_v \|x_v\|^2 taken over all the common ancestors of v. As the dimension and depth of the tree goes to infinity, the distribution over dot product can have continuous support, and it is this setting (specifically when the support is an interval [0,q)) that is known as full replica symmetry breaking. Because the dot product is determined by common ancestor, for every three vectors x_u,x_v,x_w in the support of the distribution \langle x_v,x_w \rangle \geq \min \{ \langle x_u,x_v \rangle, \langle x_u,x_w \rangle \} or \| x_v - x_w \| \leq \max \{ \|x_u - x_v \|, \|x_u -x_w \|\}. It is this condition that known as ultra-metricity.

In Subag’s algorithm, as mentioned above, at any given step we could make an update of either x^{t+1} \leftarrow x^t - \eta u or x^{t+1} \leftarrow x^t + \eta u. If we think of all the possible choices for the signs in the d=1/\eta^2 of the algorithms, we see that the algorithm does not only produce a single vector but actually 2^d such vectors that are arranged in an ultrametric tree just as above. Indeed, this ultrametric structure was the inspiration for the algorithm and is the reason why the algorithm produces the correct result precisely in the full replica symmetry breaking regime.

Acknowledgements: Thanks to Tselil Schramm for helpful comments.

Understanding generalization requires rethinking deep learning?

Yamini Bansal, Gal Kaplun, and Boaz Barak

(See also paper on arxiv, code on gitlab, upcoming talk by Yamini&Boaz, video of past talk)

A central puzzle of deep learning is the question of generalization. In other words, what can we deduce from the training performance of a neural network about its test performance on fresh unseen examples. An influential paper of Zhang, Bengio, Hardt, Recht, and Vinyals showed that the answer could be “nothing at all.”

Zhang et al. gave examples where modern deep neural networks achieve 100% accuracy on classifying their training data, but their performance on unseen data may be no better than chance. Therefore we cannot give meaningful guarantees for deep learning using traditional “generalization bounds” that bound the difference between test and train performance by some quantity that tends to zero as the number of datapoints n increases. This is why (to quote their title), Zhang et al. claimed that “understanding deep learning requires rethinking generalization”.

But what if the issue isn’t that we’ve been doing generalization bounds wrong, but rather that we’ve been doing deep learning (or more accurately, supervised deep learning) wrong?

Self Supervised + Simple fit (SSS) learning

To explain what we mean, let’s take a small detour to contrast “traditional” or “end-to-end” supervised learning with a different approach to supervised learning, which we’ll call here “Self-Supervised + Simple fit” or “SSS algorithms.” (While the name “SSS algorithms” is new, the approach itself has a long history and has recently been used with great success in practice; our work gives no new methods—only new analysis.)

The classical or “end-to-end” approach for supervised learning can be phrased as “ask and you shall receive”. Given labeled data, you ask (i.e., run an optimizer) for a complex classifier (e.g., a deep neural net) that fits the data (i.e., outputs the given labels on the given data points) and hope that it will be successful on future, unseen, data points as well. End-to-end supervised learning achieves state-of-art results for many classification problems, particularly for computer vision datasets ImageNet and CIFAR-10.

However, end-to-end learning does not directly correspond to the way humans learn to recognize objects (see also this talk of LeCun). A baby may see millions of images in the first year of her life, but most of them do not come with explicit labels. After seeing those images, a baby can make future classifications using very few labeled examples. For example, it might be enough to show her once what is a dog and what is a cat for her to correctly classify future dogs and cats, even if they look quite different from these examples.

End-to-end learning vs SSS algorithms.
Figure 1: Cartoon of end-to-end vs SSS learning

In recent years, practitioners have proposed algorithms that are more similar to human learning than supervised learning. Such methods separate the process into two stages. In the first stage, we do representation learning whereby we use unlabeled data to learn a representation: a complex map (e.g., a deep neural net) mapping the inputs into some “representation space.” In the second stage, we fit a simple classifier (e.g., a linear threshold function) to the representation of the datapoints and the given labels. We call such algorithms “Self-Supervision + Simple fit” or SSS algorithms. (Note that, unlike other representation-learning based classifiers, the complex representation is “frozen” and not “fine-tuned” in the second stage, where only a simple classifier is used on top of it.)

While we don’t have a formal definition, a “good representation” should make downstream tasks easier, in the sense of allowing for fewer examples or simpler classifiers. We typically learn a representation via self supervision , whereby one finds a representation minimizing an objective function that intuitively requires some “insight” into the data. Approaches for self-supervision include reconstruction, where the objective involves recovering data points from partial information (e.g., recover missing words or pixels), and contrastive learning, where the objective is to find a representation that make similar points close and dissimilar points far (e.g., in Euclidean space).

SSS algorithms have been traditionally used in natural language processing, where unlabeled data is plentiful, but labeled data for a particular task is often scarce. But recently SSS algorithms were also used with great success even for vision tasks such as ImageNet and CIFAR10 where all data is labeled! While SSS algorithms do not yet beat the state-of-art supervised learning algorithms, they do get pretty close. SSS algorithms also have other practical advantages over “end-to-end supervised learning”: they can make use of unlabeled data, the representation could be useful for non-classification tasks, and may have improved out of distribution performance. There has also been recent theoretical analysis of contrastive and reconstruction learning under certain statistical assumptions (see Arora et al and Lee et al).

The generalization gap of SSS algorithms

In a recent paper, we show that SSS algorithms not only work in practice, but work in theory too.

Specifically, we show that such algorithms have (1) small generalization gap and (2) we can prove (under reasonable assumptions) that their generalization gap tends to zero with the number of samples, with bounds that are meaningful for many modern classifiers on the CIFAR-10 and ImageNet datasets. We consider the setting where all data is labeled, and the same dataset is used for both learning the representation and fitting a simple classifier. The resulting classifier includes the overparameterized representation, and so we cannot simply apply “off the shelf” generalization bounds. Indeed, a priori it’s not at all clear that the generalization gap for SSS algorithms should be small.

To get some intuition for the generalization gap of SSS algorithms, consider the experiment where we inject some label noise into our distribution. That is, we corrupt an \eta fraction of the labels in both the train and test set, replacing them with random labels. Already in the noiseless case (\eta=0), the generalization gap of SSS algorithms is noticeably smaller than that of end-to-end supervised learning. As we increase the noise, the difference becomes starker. End-to-end supervised learning algorithms can always achieve 100% training accuracy, even as the test accuracy deteriorates, since they can “memorize” all the training labels they are given. In contrast, for SSS algorithms, both training and testing accuracy decrease together as we increase the noise, with training accuracy correlating with test performance.

Figure 2: Generalization gap of end-to-end and SSS algorithms on CIFAR 10 as a function of noise (since there are 10 classes, 90% noisy samples corresponds to the Zhang et al experiment). See also interactive version.

Our main theoretical result is a formal proof of the above statement. To do so, we consider training with a small amount of label noise (say \eta=5\%) and define the following quantities:

  • The robustness gap is the amount by which training accuracy degrades between the “clean” (\eta=0) experiment and the noisy one. (In this and all other quantities, the training accuracy is measured with respect to the original uncorrupted labels.)
  • The memorization gap considers the noisy experiment (\eta=5\%) and measures the amount by which performance on the corrupted data samples (where we received the wrong label) is worse than performance on the overall training set. If the algorithm can memorize all given labels, it will be perfectly wrong on the corrupted data samples, leading to a large memorization gap.
  • The rationality gap is the difference between the performance on the corrupted data samples and performance on unseen test examples. For example, if x is an image of a dog, then it measures the difference between the probability that f(x)=\text{"dog"} when (x,\text{"cat"}) is in the training set and the probability that f(x)=\text{"dog"} when x is not in the training set at all. Since intuitively, getting the wrong label should be worse than getting no label at all, we typically expect the rationality gap to be around zero or negative. Formally we define the rationality gap to the maximum between 0 and the difference above, so it is always non-negative. We think of an algorithm with a significant positive rationality gap as “irrational.”

By summing up the quantities above, we get the following inequality, which we call the RRM bound

generalization gap \leq robustness gap + rationality gap + memorization gap

In practice, the robustness and rationality gaps are always small, both for end-to-end supervised algorithms (which have a large generalization gap), and for SSS algorithms (which have a small generalization gap). Thus the main contribution to the generalization gap comes from the memorization gap. Roughly speaking, our main result is the following:

If the complexity of the second-stage classifier of an SSS algorithm is smaller than the number of samples then the generalization gap is small.

See the paper for the precise definition of “complexity,” but it is bounded by the number of bits that it takes to describe the simple classifier (no matter how complex is the representation used in the first stage). Our bound yields non-vacuous results in various practical settings; see the figures below or their interactive version.

Figure 3: Empirical study of the generalization gap of a variety of of SSS algorithms on CIFAR-10. Each vertical line corresponds to one model, sorted by generalization gap. The RRM bound is typically near-tight, and our complexity upper bound is often non vacuous. Use this webpage to interact with figures 3 and 4.
Figure 4: Empirical study of gaps for the ImageNet dataset. Because of limited computational resources, we only evaluated the theoretical bound for two models in this dataset.

What’s next

There are still many open questions. Can we prove rigorous bounds on robustness and rationality? We have some preliminary results in the paper, but there is much room for improvement. Similarly, our complexity-based upper bound is far from tight at the moment, though the RRM bound itself is often surprisingly tight. Our work only applies to SSS algorithms, but people have the intuition that even end-to-end supervised learning algorithms implicitly learn a representation. So perhaps these tools can apply to such algorithms as well. As mentioned, we don’t yet have formal definitions for “good representations,” and the choice of the self-supervision task is still somewhat of a “black art” – can we find a more principled approach?

ITC 2021 (guest post by Benny Applebaum)

Following last year’s successful launch, we are happy to announce the second edition of the conference on Information-Theoretic Cryptography (ITC).

The call for papers for ITC 2021 is out, and, to cheer you up during lockdowns, we prepared a short theme song https://youtu.be/kZT1icVoTp8  

Feel free to add your own verse ūüėČ

The submission deadline is February 1st. Please submit your best work to ITC 2021! We hope to see many of you there!

SIGACT research highlights – call for nominations

TL;DR: Know of a great recent paper that should be highlighted to the theory community and beyond? Email a nomination to sigact.highlights.nominations@outlook.com by October 19th.

The goal of the SIGACT Research Highlights Committee is to help promote
top computer science theory research via identifying results that are of
high quality and broad appeal to the general computer science audience.
These results would then be recommended for consideration for the CACM Research Highlights section as well as other general-audience computer science research outlets.

Nomination and Selection Process:

The committee solicits two types of nominations:

1) Conference nominations. Each year, the committee will ask the PC
chairs of theoretical computer science conferences to send a selection
of up to three top papers from these conferences (selected based on both
their technical merit and breadth of interest to non-theory audience)
and forwarding them to the committee for considerations.

2) Community nominations. The committee will accept nominations from the members of the community. Each such nomination should summarize the contribution of the nominated paper and also argue why this paper is
suitable for broader outreach. The nomination should be no more than a
page in length and can be submitted at any time by emailing it to
sigact.highlights.nominations@outlook.com. Self-nominations are

The nomination deadline is Monday, October 19, 2020 .


The SIGACT Research Highlights Committee currently comprises the
following members:

Boaz Barak, Harvard University
Irit Dinur, Weizmann Institute of Science
Aleksander MńÖdry, Massachusetts Institute of Technology (chair)
Jelani Nelson, University of California, Berkeley

Highlights of Algorithms (HALG) -free – Aug 31- Sep 2

[Guest post by Yossi Azar]

The 5th Highlights of Algorithms conference (HALG 2020) will take place Aug 31- Sep 2, 2020. http://highlightsofalgorithms.org/

The Highlights of Algorithms conference is a forum for presenting the highlights of recent developments in algorithms and for discussing potential further advances in this area. The conference will provide a broad picture of the latest research in algorithms through a series of invited talks, as well as short talks. 
Invited talks includes top algorithmic surveys and papers from FOCS/STOC/SODA/COLT/PODC/SPAA/ITCS/PODS (2019-20)

The conference will take place online on Mon, Aug 31-Wed, Sep 2 from 12:00 noon until 19:30 CEST (Europe time) http://highlightsofalgorithms.org/programme

The short contributed talk are on Sep 1-2, 9:45-11:30 CEST http://highlightsofalgorithms.org/shorttalks

Registration is free but mandatory
Please register on our webpage: https://highlightsofalgorithms2020.ethz.ch/registration

Ryan O’Donnell’s “TCS Toolkit” and other resources

When I was in grad school a common advice for beginning grad students was to leaf through the (paper) STOC or FOCS proceedings to see papers that you are interested in. This is still a decent advice (and requires less physical strength these days ūüôā ) but papers are not always the best source for people starting out. More often than we’d like to, the author of the 10th paper on a topic writes it to the audience of people that read (in fact probably wrote) the previous 9 papers.

Talks often do a better job of giving an overview of the field, and one great resource is the videos from the Simons Institute. If you want to get more in-depth information about a particular topic, it’s hard to beat the extended surveys in Foundations and Trends in TCS, as well as the related areas such as Machine Learning and Information Theory.

But if you are not yet sure what topic you’re interested in, or perhaps not even sure if you want to go to grad school, but you just know that you are interested in theory, there is now a new great resource. As I learned from Twitter, Ryan O’Donnell has just finished his TCS Toolkit course. All 99(!) lectures are on YouTube.

The topics are the following (these links are to the handwritten notes, for the lecture videos see YouTube channel):

1.   Course Overview, and How to TCS

2.   Basic Asymptotics

3.   Factorials and Binomial Coefficients

4.   Central Limit Theorem

5.   Chernoff Bounds

6.   Computational Models

7.   Fast Multiplication with the DFT

8.   Analysis of Boolean Functions

9.   Quantum Computation

10.   Fields and Polynomials

11.   Error-Correcting Codes

12.   Derandomization

13.   Spectral Graph Theory I

14.   Spectral Graph Theory II

15.   Spectral Graph Theory III

15.1.   Cheeger’s Inequality (Spectral Graph Theory bonus)

16.   Expander Graphs

17.   Linear Programming I

18.   Linear Programming II

19.   The Ellipsoid Algorithm

20.   CSPs and Approximation

21.   LP Hierarchies and Proof Systems

22.   Treewidth

23.   Communication Complexity

24.   Information Theory

25.   Cryptography

26.   Hardness Assumptions

27.   The PCP Theorem

p.s. For giving a high level taste of theory to beginning undergraduates, a great resource is Aaronson’s Quantum Computing since Democritus or Wigderson’s Math and Computation if they’re more math inclined.

CFP: Symposium on Simplicity in Algorithms (SOSA)

[Guest post by Valerie King. TL;DR SOSA 21 will take place jointly with SODA 21. To submit register by August 12, paper deadline August 19.]

Call for Papers: Registration deadline August 12, 2020

3rd SIAM Symposium on Simplicity in Algorithms  (SOSA)
January 11-12, 2021
Alexandria, Virginia, U.S.
(Held jointly with SODA 2021)

Symposium on Simplicity in Algorithms (SOSA) is a conference in theoretical computer science dedicated to advancing algorithms research by promoting simplicity and elegance in the design and analysis of algorithms. The benefits of simplicity are manifold: simpler algorithms manifest a better understanding of the problem at hand; they are more likely to be implemented and trusted by practitioners; they can serve as benchmarks, as an initialization step, or as the basis for a “state of the art” algorithm;¬† they are more easily taught and are more likely to be included in algorithms textbooks; and they attract a broader set of researchers to difficult algorithmic problems.

Papers in all areas of algorithms research are sought.  An ideal submission will advance our understanding of an algorithmic problem by, for example, introducing a simpler algorithm, presenting a simpler analysis of an existing algorithm, or offering insights that generally simplify our understanding of important algorithms or computational problems.

We are especially interested in papers that make material more accessible to a wider audience, such as undergraduates, or for more specialized topics, general algorithms researchers.

Submissions should contain novel ideas or attractive insights, but they are not expected to prove novel theorems. That is, the results themselves can be known, but their presentation must be new. Conference website is https://www.siam.org/conferences/cm/conference/sosa21

Program Committee
Aaron Bernstein, Rutgers University, U.S.
Allan Borodin, University of Toronto, Canada
Timothy Chan, University of Illinois at Urbana-Champaign, U.S.
Edith Cohen, Google Research, U.S. and Tel Aviv University, Israel
Vincent Cohen-Addad, Google Research, Z√ľrich, Switzerland
Sanjoy Dasgupta, University of California, San Diego, U.S.
Michael Elkin, Ben-Gurion University of the Negev, Israel
Funda Ergun, NSF and University of Indiana, Bloomington, U.S.
Mike Fellows, University of Bergen, Norway
Arnold Filtser, Columbia University, U.S.
Kasper Green Larsen,  Aarhus University, Denmark
Andrew Goldberg, Amazon, U.S.
John Iacono, Université libre de Bruxelles, Belgium
Russell Impagliazzo, University of California, San Diego, U.S.
Giuseppe Italiano,   LUISS Guido Carli, Rome, Italy
Michael Kapralov,¬† √Čcole Polytechnique F√©d√©rale de Lausanne (EPFL), Switzerland
Anna Karlin, University of Washington, Seattle, U.S.
Valerie King, University of Victoria, Canada (Co-chair)
Hung Le, University of Victoria, Canada and  University of Massachusetts at Amherst, U.S. (Co-chair)
Daniel Lokshtanov, University of California, Santa Barbara, U.S.
S. Cenk Sahinalp, NIH and University of Indiana, Bloomington, U.S.
Jared Saia, University of New Mexico, U.S.
Shay Solomon, Tel Aviv University, Israel
Mario Szegedy, Alibaba and Rutgers University, U.S.
Robert Tarjan, Princeton University, U.S.
Seeun William Umboh, University of Sydney, Australia
Qin Zhang, University of Indiana, Bloomington, U.S.
Uri Zwick, Tel Aviv University, Israel

Steering Committee
Michael A. Bender, Stony Brook University, U.S.
David Karger,  Massachusetts Institute of Technology, U.S.
Tsvi Kopelowitz, Bar-Ilan University, Israel
Seth Pettie,  University of Michigan, U.S.
Robert Tarjan, Princeton University, U.S.
Mikkel Thorup, University of Copenhagen, Denmark

Submissions and Deadlines:
A link to the submission site is available https://easychair.org/my/conference?conf=sosa21.
Short Abstract Submission and Paper Registration Deadline: August 12, 2020, 4:59 PM Eastern Time
Full Paper Submission Deadline: August 19, 2020, 4:59 PM Eastern Time
Acceptance Notification: early October 2020
Proceedings (posted online):  early January 2021

How to Participate
Authors must submit their papers electronically, in PDF format.
Submissions should begin with a title page containing the paper title, each author‚Äôs name, affiliation, and email address, and an abstract summarizing the contributions of the paper. There is no page limit. The paper should begin with a clear description of the algorithmic problem to be solved, a survey of prior work on the problem‚ÄĒincluding a candid assessment of prior work in terms of simplicity and elegance‚ÄĒand a discussion of the contributions of the paper. The body of the paper should be written for a general theoretical computer science audience, and substantiate the main claims of the paper with full proofs. The submission should be typeset using 11 point font, in a single-column format with ample spacing throughout and ample margins all around.¬† The submissions ought to be visually easy to read.
Brevity is a hallmark of simplicity. Authors are specifically encouraged to submit short and simple papers. Final papers may not exceed fifteen (15) pages in double column format.

The program committee may designate one or more papers as SOSA Best Papers. All submissions will be considered.

Simons institute lectures on analysis of Boolean functions

(Does it still make sense to blog such announcements or is these days Twitter the only way to go about this? Asking for a friend ūüôā )

Prasad Raghavendra and Avishay Tal have organized a sequence of 6 lectures on some of the exciting recent advances in analysis of Boolean functions.

Lecture Series: Advances in Boolean Function Analysis

The Simons Institute is organizing a series of lectures on Advances in Boolean Function Analysis, that will highlight a few major developments in the area. The series will feature weekly two-hour lectures from July 15th to Aug 18th.  The lectures aim to address both the broad context of the results and their technical details. Though closely related in theme, each lecture will be self-contained.  The schedule is attached below (more info at link).

Talks take place on Wednesdays at 10am Pacific time (1pm Eastern). If you can’t catch them live, they will be redcorded.

Zoom Link: https://berkeley.zoom.us/j/93086371156

Talk Schedule:

July 15, Wednesday  10:00am PDT (1pm EDT) Dor Minzer (Institute of Advanced Study)On the Fourier-Entropy Influence Conjecture

July 22, Wednesday, 10:00am PDT (1pm EDT) Hao Huang (Emory University) & Avishay Tal (UC Berkeley)Sensitivity Conjecture and Its Applications

August 3rd, Monday, 10:00am PDT (1pm EDT) Shachar Lovett (UC San Diego)Improved Bounds for the Sunflower Lemma

August 5, Wednesday, 10:00am PDT (1pm EDT) Ronen Eldan (Weizmann Institute)Concentration on the Boolean Hypercube via Pathwise Stochastic Analysis

August 12, Wednesday, 10:00am Esty Kelman (Tel Aviv University) KKL via Random Restrictions

August 18, Tuesday, 10:00am Pooya Hatami (Ohio State University)Pseudorandom Generators from Polarizing Random Walks

TCS book: Call for GitHub issues

I originally planned this summer to finish the work on my Introduction to Theoretical Computer Science book, and in particular write the two missing chapters on space complexity and interactive proof systems. Needless to say, this summer did not go as planned and I won’t be able to write these chapters. However, I still intend to go over the existing chapters, fixing typos, adding examples, exercises, and generally making it friendlier to beginning undergraduate students.

Toward this end, I would be grateful for people posting bugs, typos, and suggestions as GitHub issues (I currently have 267 closed and 14 open issues which I hope to get to soon). Of course, if you are technically inclined and there’s a simple local fix, you can also make a pull request.

Aside from these fixes, I am making two more “global” changes to the book. First, I am adding a “non mathy overview” for each chapter. While some students got a lot from reading the book prior to lectures, others were intimidated by the mathematical notation, and so I hope this more gentle introduction will be helpful. I am also adding more examples & solved exercises toward this end.

Another change is that I now follow the more traditional way of presenting deterministic finite automata before Turing machines – DFAs are still optional and can be skipped without missing anything, but some instructors find them as a good introduction to Turing Machines. Thus the order of presentation of materials in the book is roughly as follows:

  1. Introduction, representing objects as strings – Representing numbers, lists, etc. Specifying computational tasks as functions mapping binary strings to binary strings, Cantor’s theorem.
  2. Finite functions and Boolean circuits – Every function can be computed by some circuit, circuits as straightline programs, representing circuits as strings, universal circuit evaluator, counting lower bound.
  3. Computing on unbounded inputs – DFAs (optional), Turing Machines, equivalence between Turing machines, RAM machines and programming languages, őĽ calculus (optional), cellular automata (optional)
  4. Uncomputability – Universal Turing machine, Halting problem, reductions, Rice’s Theorem. Optional: G√∂del’s incompleteness theorem, uncomputability of quantified arithmetic statements, context free grammars.
  5. Efficient computation – Modeling running time, time hierarchy theorem, P and EXP
  6. NP and NP completeness – Polynomial-time reductions, Cook-Levin Theorem (using circuits), definition of NP using “proof system”/”verifying algorithms” (no non-deterministic TMs), NP completeness, consequences of P=NP: search to decision, optimal machine learning, etc..
  7. Randomized computation: Worst-case randomized computation, defining BPP, Sipser-G√°cs, does BPP=P? (a little on derandomization)
  8. Cryptography: One time pad, necessity of long keys for information theoretic crypto, pseudorandom generators and stream ciphers, taste of public key and “magic” (ZKP, FHE, MPC)
  9. Quantum computing: Some quantum mechanics background – double slit experiment, Bell’s inequality. Modeling quantum computation. Bird’s eye view of Shor’s algorithm and quantum Fourier transform.