## A bet for the new decade

I am in Tel Aviv Theory Fest this week – a fantastic collection of talks and workshops organized by Yuval Filmus , Gil Kalai, Ronen Eldan, and Muli Safra.

It was a good chance to catch up with many friends and colleagues. In particular I met Elchanan Mossel and Subhash Khot, who asked me to serve as a “witness” for their bet on the unique games conjecture. I am recording it here so we can remember it a decade from noe.

Specifically, Elchanan bets that the Unique Games conjecture will be proven in the next decade – sometime between January 1, 2020 and December 31, 2029 there will be a paper uploaded to the arxiv with a correct proof of the conjecture. Subhash bets that this won’t happen. They were not sure what to bet on, but eventually agreed to take my offer that the loser will have to collaborate on a problem chosen by the winner, so I think science will win in either case. (For what it’s worth, I think there is a good chance that Subhash will lose the bet because he himself will prove the UGC in this decade, though it’s always possible Subhash can both win the bet and prove the UGC if he manages to do it by tomorrow đź™‚ )

The conference itself is, as I mentioned, wonderful with an amazing collection of speakers. Let me mention just a couple of talks from this morning. Shafi Goldwasser talked about “Law and Algorithms”. There is a recent area of research studying how to regulate algorithms, but Shafi’s talk focused mostly on the other direction: how algorithms and cryptography can help achieve legal objectives such as the “right to be forgotten” or the ability to monitor secret proceedings such as wiretap requests.

Christos Papadimitriou talked about “Language, Brain, and Computation”. Christos is obviously excited about understanding the language mechanisms in the brain. He said that studying the brain gives him the same feeling that you get when you sit in a coffee shop in Cambridge and hear intellectual discussions all around you: you don’t understand why everyone is not dropping everything they are doing and come here. (Well, his actual words were “sunsets over the Berkeley hills” but I think the Cambridge coffee shops are a better metaphor đź™‚ )

*[The post below is by Dorit Aharonov who co-organized the wonderful school on quantum computing last week which I attended and greatly enjoyed. –Boaz] *

**TL;DR: ** Last week we had a wonderful one-week intro course into the math of quantum computing at Hebrew U; It included a one day crash course on the basics, and 7 mini-courses on math-oriented research topics (quantum delegation, Hamiltonian complexity, algorithms and more) by top-notch speakers. Most importantly – it is all online, and could be very useful if you want to take a week or two to enter the area and don’t know where to start.

Hi Theory people!

I want to tell you about a 5-days winter school called “The Mathematics of Quantum Computation“, which we (me, Zvika Brakerski, Or Sattath and Amnon Ta-Shma) organized last week at the Institute for advanced studies (IIAS) at the Hebrew university in Jerusalem.

There were two reasons I happily agreed to Boaz’s suggestion to write a guest blogpost about this school.

a) The school was really great fun. We enjoyed it so much, that I think you might find it interesting to hear about it even if you were not there, or are not even into quantum computation.

And b), it might actually be useful for you or your quantum-curious friends. We put all material online, with the goal in mind that after the school, this collection of talks+written material will constitute all that is needed for an almost self-contained **very-intensive-one-week-course of introduction into the mathematical side of quantum computation**; I think this might be of real use for any theoretical computer scientist or mathematician interested in entering this fascinating but hard-to-penetrate area, and not knowing where to start.

Before telling you a little more about what we actually learned in this school, let’s start with some names and numbers. We had:

- 160 participants (students and faculty) from all over the world.
- 7 terrific speakers: Adam Bouland (UC Berkeley), Sergey Bravyi (IBM), Matthias Christandl (Coppenhagen), AndrĂˇs GilyĂ©n (Caltech), Sandy Irani (UC Irvine), Avishay Tal (Berkeley), and Thomas Vidick (Caltech);
- 2 great TAs: AndrĂˇs GilyĂ©n (Caltech) and Chinmay Nirkhe (UC Berkeley)
- 4 busy organizers: myself (Hebrew U), Zvika Brakerski (Weizmann), Or Sattath (Ben Gurion U), and Amnon Ta-Shma (Tel Aviv U)
- 1 exciting and very intensive program
- 5 challenging and fascinating days of talks, problem sessions and really nice food.
- 1 great Rabin’s lecture by Boaz Barak (Harvard)
- 1 beautiful Quantum perspective lecture by Sergey Bravyi (IBM)
- 8 panelists in the supremacy panel we had on the fifth day: Sandy Irani (UC Irvine), our wise moderator, and 7 panelists on stage and online: myself, Scott Aaronson (Austin, online), Boaz Barak, Adam Bouland, Sergio Boixo (Google, online), Gil Kalai (Hebrew U), and Umesh Vazirani (UC Berkeley, online)
- 8 brave speakers in the gong show, our very last session, each talking for 3 minutes;
- 1 group-tour to 1 UNESCO site (Tel Maresha) and 6 beers tasted by ~80 tour participants
- 3 problem sets with 43 problems and (!) their solutions.

So why did we decide to organize this particular quantum school, given the many quantum schools around? Well, the area of quantum computation is just bursting now with excitement and new mathematical challenges; But there seems to be no easy way for theoreticians to learn about all these things unless you are already in the loop… The (admittedly) very ambitious goal of the school was to assume zero background in quantum computation, and quickly bring people up to speed on six or seven of the most interesting mathematical research forefronts in the area.

The first day of the school was intended to put everyone essentially on the same page: it included four talks about the very basics (qubits by Or Sattath, circuits, by myself, algorithms by Adam Bouland, and error correction by Sergey Bravyi). By the end of this first day everyone was at least supposed to be familiar with the basic concepts, and capable of listening to the mini-courses to follow. The rest of the school was devoted mainly to those mini-courses, whose topics included what I think are some of the most exciting topics on the more theoretical and mathematical side of quantum computation.

Yes, it was extremely challenging… the good thing was that we had two great TAs, AndrĂˇs and Chinmay, who helped prepare problem sets, which people actually seriously tried to solve (!) during the daily one+ hour TA problem-solving sessions (with the help of the team strolling around ready to answer questions…). It seems that this indeed helped people follow, despite the fact that we did get into some hard stuff in those mini-courses… The many questions that were asked throughout the school proved that many people were following and interested till the bitter end.

So here is a summary of the mini-courses, by order of appearance.

I added some buzz words of interesting related mathematical notions so that you know where these topics might lead you if you take the paths they suggest.

- Thomas Vidick gave a three-lecture wonderfully clear mini-course providing an intro to the recently very active and exciting area of
**quantum verification and delegation**, connecting cryptography and quantum computational complexity. [Thomas didn’t have time to talk about it, but down the road this eventually connects to approximate representation theory, as well as to Connes embedding conjecture, and more.] - Sandy Irani gave a beautiful exposition (again, in a a three lecture mini-course) on
**quantum Hamiltonian complexity**. Sandy started with Kitaev’s quantum version of the Cook Levin theorem, showing that the local Hamiltonian problem is quantum NP complete; she then explained how this can be extended to more physically relevant questions such as translationally invariant 1D systems, questions about the thermodynamical limit, and more. [This topic is related to open questions such as quantum PCP, which was not mentioned in the school, as well as to beautiful recent results about undecidability of the spectral gap problem, and more.] - Matthias Christandl gave an exciting two-lecture mini-course on the fascinating connection between
**tensor ranks and matrix product multiplication**. Starting from what seemed to be childish games with small pictures in his first talk, he cleverly used those as his building blocks in his second talk, to enable him to talk about Strassen’s universal spectral points program for approaching the complexity of matrix multiplication, asymptotic ranks, border ranks and more. That included also very beautiful pictures of polytopes! Matthias explained the connection that underlines this entire direction, between entanglement properties of three body systems, with these old combinatorial problems. - Avishay Tal gave a really nice two-lecture exposition on his recent breakthrough result with Ran Raz, proving that
**quantum polynomial time computation is not contained in the polynomial Hierarchy, in the oracle model**. This included talking about AC0, a problem called forrelation, Fourier expansion, Gaussians and much more. - AndrĂˇs GilyĂ©n gave a wonderful talk about a recent development: the evolution of the
**singular value approach to quantum algorithms**. He left us all in awe showing that essentially almost any quantum algorithm you can think of falls into this beautiful framework… Among other things, he mentioned Chebychev’s polynomials, quantum walks, Hamiltonian simulations, and more. What else can be done with this framework remains to be seen. - Sergey Bravyi gave two talks (on top of his intro to quantum error correction). The first was as part of a monthly series at Hebrew university, called “quantum perspectives”; in this talk, Sergey gave a really nice exposition of his breakthrough result (with Gosset and Konig) demonstrating an
; this uses in a clever way the well known quantum magic square game enabling quantum correlations to win with probability one, while classical correlations are always bounded away from one; somehow this result manages to cleverly turn this game into a computational advantage. In Sergey’s last talk, he gave the basics of the beautiful topic of*information theoretical*separation between quantum and classical constant depth circuits**stoqaustic Hamiltonians**– a model in between quantum Hamiltonians and classical constrained satisfaction problems, which poses many fundamental and interesting open questions (and is tightly related to classical Markov chains, and Markov chain Monte Carlo). - Finally, Adam Bouland gave two superb talks on
**quantum supremacy**, explaining the beautiful challenges in this area – including his recent average case to worst case hardness results about sampling using quantum circuits, which is related to Google’s supremacy experiment. - Ah, I also gave a talk – it was about three of the many different equivalent models of quantum computation –
**adiabatic computation**,**quantum walks**, and the**Jones polynomial**(I also briefly mentioned a differential geometry model). The talk came out way too disordered in my mind (never give a talk when you are an organizer!), but hopefully it gave some picture about the immense variety of ways to think about quantum computation and quantum algorithms.

In addition to the main lectures, we also had some special events intertwined:

- Boaz Barak gave the distinguished annual Rabin lecture, joint with the CS colloquium; His talk, which was given the intriguing title
**“Quantum computing and classical algorithms: The best of frenemies”**, focused on the fickle relationships between quantum and classical algorithms. The main players in this beautiful talk were SDPs and sums of squares, and it left us with many open questions. - Last but not least, we had an
**international panel about the meaning of Google’s recent experiment claiming supremacy**, joined by Sergio Boixo from Google explaining the experiment, as well as Scott Aaronson and Umesh Vazirani who woke up very early in the US to join us. I feared we would have some friction and fist fights, but this actually became a deep and interesting discussion! We went with quite some depth into the most important question in my mind about the supremacy experiment, which is the issue of noise; Unbelievably, it all went well even from the technological aspect! I really recommend watching this discussion.

So, weÂ had a great time….Â and as I said, one of the best things is that it is all recorded and saved. You are welcome to follow theÂ program, watch the recordedÂ talks, consult the slides, lecture notes, exercises and solutions and also read theÂ reading materialÂ if you want to extend your knowledgeÂ beyond what is covered in the school.Â In caseÂ you know of anyÂ math or TCS-oriented person who wants to enter the field and start working on some problem at the forefront of research, just send him or herÂ this post, orÂ theÂ link of theÂ school’s website;Â It will take a very intensive weekÂ (well, maybe two)Â of following lectures and doing the exercises,Â butÂ by the end of that time, one is guaranteed to be no longer a complete amateur to the area, as the set of topics coveredÂ gives a pretty good pictureÂ of what is going on in the field.Â Â Â

Â Â

Last but not least, I would like to thank the Israeli quantum initiative, Vatat, and the IIAS, for their generous funding which enabled this school and the funding of students; the IIAS team for their immense help in organization; and of course, thanks a lot to all participants who attended the school!

Wishing everyone a very happy year of 2020,

Dorit

## Deep Double Descent (cross-posted on OpenAI blog)

*By Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever*

*This is a lightly edited and expanded version of the following post on the OpenAI blog about the following paper. While I usually don’t advertise my own papers on this blog, I thought this might be of interest to theorists, and a good follow up to my prior post. I promise not to make a habit out of it. –Boaz*

**TL;DR:** Our paper shows that double descent occurs in conventional modern deep learning settings: visual classification in the presence of label noise (CIFAR 10, CIFAR 100) and machine translation (IWSLTâ€™14 and WMTâ€™14). As we increase the number of parameters in a neural network, initially the test error decreases, then increases, and then, just as the model is able to fit the train set, it undergoes a second descent, again decreasing as the number of parameters increases. This behavior also extends over train epochs, where a single model undergoes double-descent in test error over the course of training. Surprisingly (at least to us!), we show these phenomenon can lead to a regime where â€ś* more data hurts*â€ťâ€”training a deep network on a larger train set actually performs worse.

## Introduction

Open a statistics textbook and you are likely to see warnings against the danger of â€śoverfittingâ€ť: If you are trying to find a good classifier or regressor for a given set of labeled examples, you would be well-advised to steer clear of having so many parameters in your model that you are able to completely fit the training data, because you risk not generalizing to new data.

The canonical example for this is polynomial regression. Suppose that we get *n* samples of the form *(x, p(x)+noise)* where *x* is a real number and *p(x)* is a cubic (i.e. degree 3) polynomial. If we try to fit the samples with a degree 1 polynomialâ€”-a linear function, then we would get many points wrong. If we try to fit it with just the right degree, we would get a very good predictor. However, as the degree grows, we get worse till the degree is large enough to fit all the noisy training points, at which point the regressor is terrible, as shown in this figure:

It seems that the higher the degree, the worse things are, but what happens if we go *even higher*? It seems like a crazy ideaâ€”-why would we increase the degree beyond the number of samples? But it corresponds to the practice of having many more parameters than training samples in modern deep learning. Just like in deep learning, when the degree is larger than the number of samples, there is more than one polynomial that fits the data– but we choose a specific one: the one found running gradient descent.

Here is what happens if we do this for degree 1000, fitting a polynomial using gradient descent (see this notebook):

We still fit all the training points, but now we do so in a more controlled way which actually tracks quite closely the ground truth. We see that despite what we learn in statistics textbooks, sometimes overfitting is not that bad, as long as you go â€śall inâ€ť rather than â€śbarely overfittingâ€ť the data. That is, overfitting doesn’t hurt us if we take the number of parameters to be much larger than what is needed to just fit the training set â€” and in fact, as we see in deep learning, larger models are often better.

The above is not a novel observation. Belkin et al called this phenomenon ** â€śdouble descentâ€ť** and this goes back to even earlier works . In this new paper we (Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever) extend the prior works and report on a variety of experiments showing that â€śdouble descentâ€ť is widely prevalent across several modern deep neural networks and for several natural tasks such as image recognition (for the CIFAR 10 and CIFAR 100 datasets) and language translation (for IWSLTâ€™14 and WMTâ€™14 datasets). As we increase the number of parameters in a neural network, initially the test error decreases, then increases, and then, just as the model is able to fit the train set, it undergoes a

*second descent,*again decreasing as the number of parameters increases. Moreover, double descent also extends beyond number of parameters to other measures of â€ścomplexityâ€ť such as the number of training epochs of the algorithm.

The take-away from our work (and the prior works it builds on) is that neither the classical statisticiansâ€™ conventional wisdom that ** â€śtoo large models are worseâ€ť** nor the modern ML paradigm that

*â€śbigger models are always betterâ€ť**always hold. Rather it all depends on whether you are on the first or second descent. Further more, these insights also allow us to generate natural settings in which even the age-old adage of*

**is violated!**

*“more data is always better”*In the rest of this blog post we present a few sample results from this recent paper.

### Model-wise Double Descent

We observed many cases in which, just like in the polynomial interpolation example above, the test error undergoes a “double descent” as we increase the complexity of the model. The figure below demonstrates one such example: we plot the test error as a function of the complexity of the model for ResNet18 networks. The complexity of the model is the width of the layers, and the dataset is CIFAR10 with 15% label noise. Notice that the peak in test error occurs around the â€śinterpolation thresholdâ€ť: when the models are just barely large enough to fit the train set. In all cases weâ€™ve observed, changes which affect the interpolation threshold (such as changing the optimization algorithm, changing the number of train samples, or varying the amount of label noise) also affect the location of the test error peak correspondingly.

We found the double descent phenomena is most prominent in settings with added label noise— without it, the peak is much smaller and easy to miss. But adding label noise amplifies this general behavior and allows us to investigate it easily.

### Sample-Wise Nonmonotonicity

Using the model-wise double descent phenomenon we can obtain examples where training on **more data actually hurts**. To see this, letâ€™s look at the effect of increasing the number of train samples on the test error vs. model size graph. The below plot shows Transformers trained on a language-translation task (with no added label noise):

On the one hand, (as expected) increasing the number of samples generally shifts the curve downwards towards lower test error. On the other hand, it also shifts the curve to the right: since more samples require larger models to fit, the interpolation threshold (and hence, the peak in test error) shifts to the right. For intermediate model sizes, these two effects combine, and we see that **training on 4.5x more samples actually hurts test performance.**

### Epoch-Wise Double Descent

**There is a regime where training longer reverses overfitting.**Â Letâ€™s look closer at the experiment from the “Model-wise Double Descent” section, and plot Test Error as a function of both model-size and number of optimization steps. In the plot below to the right, each column tracks the Test Error of a given model over the course of training. The top horizontal dotted-line corresponds to the double-descent of the first figure. But we can also see that for a fixed large model, as training proceeds test error goes down, then up and down againâ€”we call this phenomenon â€śepoch-wiseÂ double-descent.â€ť

Moreover, if we plot the Train error of the same models and the corresponding interpolation contour (dotted line) we see that it exactly matches the ridge of high test error (on the right).

**In general, the peak of test error appears systematically when models are just barely able to fit the train set.**

Our intuition is that for models at the interpolation threshold, there is effectively only one model that fits the train data, and forcing it to fit even slightly-noisy or mis-specified labels will destroy its global structure. That is, there are no â€śgood modelsâ€ť, which both interpolate the train set, and perform well on the test set. However in the over-parameterized regime, there are many models that fit the train set, and there exist â€śgood modelsâ€ť which both interpolate the train set and perform well on the distribution. Moreover, the implicit bias of SGD leads it to such â€śgoodâ€ť models, for reasons we donâ€™t yet understand.

The above intuition is theoretically justified for linear models, via a series of recent works including [Hastie et al.] and [Mei-Montanari]. We leave fully understanding the mechanisms behind double descent in deep neural networks as an important open question.

## Commentary: Experiments for Theory

The experiments above are especially interesting (in our opinion) because of how they can inform ML theory: any theory of ML must be consistent with “double descent.” In particular, one ambitious hope for what it means to “theoretically explain ML” is to prove a theorem of the form:

“If the distribution satisfies property X and architecture/initialization satisfies property Y, then SGD trained on ‘n’ samples, for T steps, will have small test error with high probability”

For values of X, Y, n, T, “small” and “high” that are used in practice.

However, these experiments show that these properties are likely more subtle than we may have hoped for, and must be non-monotonic in certain natural parameters.

This rules out even certain natural “conditional conjectures” that we may have hoped for, for example the conjecture that

“If SGD on a width W network works for learning from ‘n’ samples from distribution D, then SGD on a width W+1 network will work at least as well”

Or the conjecture

“If SGD on a certain network and distribution works for learning with ‘n’ samples, then it will work at least as well with n+1 samples”

It also appears to conflict with a “2-phase” view of the trajectory of SGD, as an initial “learning phase” and then an “overfitting phase” — in particular, because the overfitting is sometimes reversed (at least, as measured by test error) by further training.

Finally, the fact that these phenomena are not specific to neural networks, but appear to hold fairly universally for natural learning methods (linear/kernel regression, decision trees, random features) gives us hope that there is a deeper phenomenon at work, and we are yet to find the right abstraction.

*We especially thank Mikhail Belkin and Christopher Olah for helpful discussions throughout this work.* *The polynomial example is inspired in part by experiments in [Muthukumar et al.]*.

## HALG 2020 call for nominations (guest post by Yossi Azar)

*[Guest post by Yossi Azar – I attended HALG once and enjoyed it quite a lot; I highly recommend people make such nominations –Boaz]*

**Call forÂ Invited Talk Nominations** :**5th Highlights of Algorithms conference (HALGÂ 2020)**

ETH Zurich, June 3-5, 2020

â€‹

http://2020.highlightsofalgorithms.org/

The HALG 2020 conference seeks high-quality nominations for invited talks that will highlight recent advances in algorithmic research. Similarly to previous years, there are two categories of invited talks:

A. survey (60 minutes): a survey of an algorithmic topic that has seen exciting developments in last couple of years.

B. paper (30 minutes): a significant algorithmic result appearing in a paper in 2019 or later.

To nominate, please email halg2020.nominations@gmail.com the following information:

- Basic details: speaker name + topic (for survey talk) or paperâ€™s title, authors, conference/arxiv + preferable speaker (for paper talk).
- Brief justification: Focus on the benefits to the audience, e.g., quality of results, importance/relevance of topic, clarity of talk, speakerâ€™s presentation skills.

All nominations will be reviewed by the Program Committee (PC) to select speakers that will be invited to the conference.

Nominations deadline: **December 20, 2020 **(for full consideration).

## Harvard opportunity: lecturing / advising position

Harvard Computer Science is seeking a Lecturer/Assistant Director of Undergraduate Studies. A great candidate would be someone passionate about teaching and mentoring and excited to build a diverse and inclusive Undergraduate Computer Science community at Harvard. The position requires a Ph.D and is open to all areas of computer science and related fields, but of course personally I would love to have a theorist fill this role.

Key responsibilities are:

* Teach (or co-teach) one undergraduate Computer Science course per semester.

* Join and help lead the Computer Science Undergraduate Advising team (which includes mentoring and advising undergraduate students and developing materials, initiatives, and events to foster a welcoming and inclusive Harvard Computer Science community.)

The job posting with all details is at https://tiny.cc/harvardadus

Any questions about this position, feel free to contact me or Steve ChongÂ (the co directors of undergraduate studies for CS at Harvard) at cs-dus at seas.harvard.eduÂ

## Puzzles of modern machine learning

It is often said that "we don’t understand deep learning" but it is not as often clarified what is it exactly that we don’t understand. In this post I try to list some of the "puzzles" of modern machine learning, from a theoretical perspective. This list is neither comprehensive nor authoritative. Indeed, I only started looking at these issues last year, and am very much in the position of not yet fully understanding the questions, let alone potential answers. On the other hand, at the rate ML research is going, a calendar year corresponds to about 10 "ML years"…

Machine learning offers many opportunities for theorists; there are many more questions than answers, and it is clear that a better theoretical understanding of what makes certain training procedures work or fail is desperately needed. Moreover, recent advances in software frameworks made it much easier to test out intuitions and conjectures. While in the past running training procedures might have required a Ph.D in machine learning, recently the "barrier to entry" was reduced to first to undergraduates, then to high school students, and these days it’s so easy that even theoretical computer scientists can do it đź™‚

To set the context for this discussion, I focus on the task of supervised learning. In this setting we are given a *training set* of examples of the form where is some vector (think of it as the pixels of an image) and is some label (think of as equaling if is the image of a dog and if is the image of a cat). The goal in supervised learning is to find a *classifier* such that will hold for many future samples .

The standard approach is to consider some parameterized family of classifiers, where for every vector of parameters, we associate a classifier . For example, we can fix a certain neural network architecture (depth, connections, activation functions, etc.) and let be the vector of weights that characterizes every network in this architecture. People then run some optimizing algorithm such as stochastic gradient descent with the objective function set as finding the vector that minimizes a *loss function* . This loss function can be the fraction of labels that gets wrong on the set or a more continuous loss that takes into account the confidence level or other parameters of as well. By now this general approach has been successfully applied to a many classification tasks, in many cases achieving near-human to super-human performance. In the rest of this post I want to discuss some of the questions that arise when trying to obtain a theoretical understanding of both the powers and the limitations of the above approach. I focus on deep learning, though there are still some open questions even for over-parameterized linear regression.

## The generalization puzzle

The approach outlined above has been well known and analyzed for many decades in the statistical learning literature. There are many cases where we can *prove* that a classifier obtained in this case has a small *generalization gap*, in the sense that if the training set was obtained by sampling independent and identical samples from a distribution , then the performance of a classifier on new samples from will be close to its performance on the training set.

Ultimately, these results all boil down to the Chernoff bound. Think of the random variables where if the classifier makes an error on the -th training example. The Chernoff bound tells us that probability that that deviates by more than from its expectation is something like and so as long as the total number of classifiers is less than for , we can use a union bound over all possible classifiers to argue that if we make a fraction of errors on the training set, the probability we make an error on a new example is at most . We can of course "bunch together" classifiers that behave similarly on our distribution, and so it is enough if there are at most of these equivalence classes. Another approach is to add a "regularizing term" to the objective function, which amounts to restricting attention to the set of all classifiers such that for some parameter . Again, as long as the number of equivalence classes in this set is less than , we can use this bound.

To a first approximation, the number of classifiers (even after "bunching together") is roughly exponential in the number of parameters, and so these results tell us that as long as the number of of parameters is smaller than the number of examples, we can expect to have a small *generalization gap* and can infer future performance (known as "test performance") from the performance on the set (known as "train performance").
Once the number of parameters becomes close to or even bigger than the number of samples , we are in danger of "overfitting" where we could have excellent train performance but terrible test performance. Thus according to the classical statistical learning theory, the ideal number of parameters would be some number between and the number of samples , with the precise value governed by the so called "bias variance tradeoff".

This is a beautiful theory, but unfortunately the classical theorems yield vacous results in the realm of modern machine learning, where we often train networks with millions of parameters on a mere tens of thousands of examples. Moreover, Zhang et al showed that this is not just a question of counting parameters better. They showed that modern deep networks can in fact "overfit" and achieve 100% success on the training set even if you gave them random or arbitrary labels.

The results above in particular show that we can find classifiers that perform great on the training set but perform terribly on the future tests, as well as classifiers that perform terrible on the training set but pretty good on future test. Specifically, consider an architecture that has the capacity to fit arbitrary labels, and suppose that we train it on a set of examples. Then we can find a setting of parameters that both fits the training set exactly (i.e., satisfies for all ) but also satisfies that the additional constraint that (i.e., the negation of the label ) for every in some additional set of pairs. (The set is not part of the actual training set, but rather an "auxiliary set" that we simply use for the sake of constructing this counterexample; note that we can use as means to generate the initial network which can then be fed into standard stochastic gradient descent on the set .) The network fits its training set perfectly, but since it effectively corresponds to training with 95% label noise, it will perform worse than even a coin toss.

In an analogous way, we can find parameters that completely fail on the training set, but fit correctly the additional "auxiliary set" . This will correspond to the case of standard training with 5% label noise, which typically yields about 95% of the performance on the noiseless distribution.

The above insights break the **separation of concerns** or separation of **computational problems** from **algorithms** which we theorists like so much. Ideally, we would like to phrase the "machine learning problem" as a well defined optimization objective, such as finding, given a set , the vector that mimimizes . Once phrased in this way, we can try to find with an algorithm that achieves this goal as efficiently as possible.

Unfortunately, modern machine learning does not currently lend itself to such a clean partition. In particular, since not all optima are equally good, we *don’t actually want* to solve the task of minimizing the loss function in a "black box" way. In fact, many of the ideas that make optimization faster such as accelaration, lower learning rate, second order methods and others, yield *worse* generalization performance. Thus, while the objective function is somewhat correlated with generalization performance, it is neither necessary nor sufficient for it. This is a clear sign that we don’t really understand what makes machine learning work, and there is still much left to discover. I don’t know what machine learning textbooks in the 2030’s will contain, but my guess is that they would *not* prescribe running stochastic gradient descent on one of these loss functions. (Moritz Hardt counters that what we teach in ML today is not that far from the 1973 book of Duda and Hart, and that by some measures ML moved *slower* than other areas of CS.)

The *generalization puzzle* of machine learning can be phrased as the question of understanding what properties of procedures that map a training set into a classifier lead to good generalization performance with respect to certain distributions. In particular we would like to understand what are the properties of natural natural distributions and stochastic gradient descent that make the latter into such a map.

## The computational puzzle

Yet another puzzle in modern machine learning arises from the fact that we are able to find the minimum of in the first place. A priori this is surprising since, apart from very special cases (e.g., linear regression with a square loss), the function is in general *non convex*. Indeed, for almost any natural loss function, the problem of finding that minimizes is NP hard.
However, if we look at the computational question in the context of the generalization puzzle above, it might not be as mysterious. As we have seen, the fact that the we output is a global minimizer (or close to minimizer) of is in some sense accidental and by far not the the most important property of . There are many minima of the loss function that generalize badly, and many non minima that generalize well.

So perhaps the right way to phrase the computational puzzle is as

"How come that we are able to use stochastic gradient descent to find the vector that is output by stochastic gradient descent."

which when phrased like that, doesn’t seem like much of a puzzle after all.

## The off-distribution performance puzzle

In the supervised learning problem, the training samples are drawn from the same distribution as the final test sample. But in any applications of machine learning, classifiers are expected to perform on samples that arise from very different settings. The image that the camera of a self-driving car observes is not drawn from ImageNet, and yet it still needs to (and often can) detect whether not it is seeing a dog or a cat (at which point it will break or accelerate, depending on whether the programmer was a dog or cat lover). Another insight to this question comes from a recent work of Recht et al. They generated a new set of images that is very similar to the original ImageNet test set, but not identical to it. One can think of it as generated from a distribution that is close but not the same as the original distribution of ImageNet. They then checked how well do neural networks that were trained on the original ImageNet distribution perform on . They saw that while these networks performed significantly worse on than they did on , their performance on was *highly correlated* with their performance on . Hence doing better on did correspond to being better in a way that carried over to the (very closely related) . (However, the networks did perform worse on $D’$ so off-distribution performance is by no means a full success story.)

Coming up with a theory that can supply some predictions for learning in a way that is not as tied to the particular distribution is still very much open. I see it as somewhat akin to finding a theory for the performance of algorithms that is somewhere between average-case complexity (which is highly dependant on the distribution) and worst-case complexity (which does not depend on the distribution at all, but is not always achievable).

## The robustness puzzle

If the previous puzzles were about understanding why deep networks are surprisingly good, the next one is about understanding why they are surprisingly bad. Images of physical objects have the property that if we modify them in some ways, such as perturbing them in a small number of pixels or by few shades or rotating by an angle, they still correspond to the same object. Deep neural networks do not seem to "pick up" on this property. Indeed, there are many examples of how tiny perturbations can cause a neural net to think that one image is another, and people have even printed a 3D turtle that most modern systems recognize as a rifle. (See this excellent tutorial, though note an "ML decade" has already passed since it was published). This "brittleness" of neural networks can be a significant concern when we deploy them in the wild. (Though perhaps mixing up turtles and rifles is not so bad: I can imagine some people that would normally resist regulations to protect the environment but would support them if they confused turtles with guns..) Perhaps one reason for this brittleness is that neural networks can be thought of as a way of embedding a set of examples in dimension into dimension (where is the number of neurons in the penultimate layer) in a way that will make the positive examples be linearly separable from the negative examples. Amplifying small differences can help in achieving such a separation, even if it hurts robustness.

Recent works have attempted to rectify this, by using a variants of the loss function where corresponds to the maximum error under all possible such perturbations of the data. A priori you would think that while robust training might come at a computational cost, statistically it would be a "win win" with the resulting classifiers not only being more robust but also overall better at classifying. After all, we are providing the training procedure with the additional information (i.e., "updating its prior") that the label should be unchanged by certain transformations, which should be equivalent to supplying it with more data. Surprisingly, the robust classifiers currently perform *worse* than standard trained classifiers on unperturbed data. Ilyas et al argued that this may be because even if humans ignore information encoded in, for example, whether the intensity level of a pixel is odd or even, it does not mean that this information is not predictive of the label. Suppose that (with no basis whatsoever – just as an example) cat owners are wealthier than dog owners and hence cat pictures tend to be taken with higher quality lenses. One could imagine that a neural network would pick up on that, and use some of the fine grained information in the pixels to help in classification. When we force such a network to be robust it would perform worse. Distill journal published six discussion pieces on the Ilyas et al paper. I like the idea of such "paper discussions" very much and hope it catches on in machine learning and beyond.

## The interpretability puzzle

Deep neural networks are inspired by our brain, and it is tempting to try to understand their internal structure just like we try to understand the brain and see if it has a "grandmother neuron". For example, we could try to see if there is a certain neuron (i.e., gate) in a neural network that "fires" only when it is fed images with certain high level features (or more generally find vectors that have large correlation with the state at a certain layer only when the image has some features). This also of practical importance, as we increasingly use classifiers to make decisions such as whether to approve or deny bail, whether to prescribe to a patient treatment A or B, or whether a car should steer left or right, and would like to understand what is the basis for such decisions. There are beautiful visualizations of neural networks’ decisions and internal structures , but given the robustness puzzle above, it is unclear if these really capture the decision process. After all, if we could change the classification from a cat to a dog by perturbing a tiny number of pixels, in what sense can we explain *why* the network made this decision or the other.

## The natural distributions puzzle

Yet another puzzle (pointed out to me by Ilya Sutskever) is to understand what is it about "natural" distributions such as images, texts, etc.. that makes them so amenable to learning via neural networks, even though such networks can have a very hard time with learning even simple concepts such as parities. Perhaps this is related to the "noise robustness" of natural concepts which is related to being correlated with low degree polynomials. Another suggestion could be that at least for text etc.., human languages are implicitly designed to fit neural network. Perhaps on some other planets there are languages where the meaning of a sentence completely changes depending on whether it has an odd or an even number of letters…

## Summary

The above are just a few puzzles that modern machine learning offers us. Not all of those might have answers in the form of mathematical theorems, or even well stated conjectures, but it is clear that there is still much to be discovered, and plenty of research opportunities for theoretical computer scientists. In this blog I focused on supervised learning, where at least the problem is well defined, but there are other areas of machine learning, such as transfer learning and generative modeling, where we don’t even yet know how to phrase the computational task, let alone prove that any particular procedure solves it. In several ways, the state of machine learning today seems to me as similar to the state of cryptography in the late 1970’s. After the discovery of public key cryptography, researchers has highly promising techniques and great intuitions, but still did not really understand even what security means, let alone how to achieve it. In the decades since, cryptography has turned from an art to a science, and I hope and believe the same will happen to machine learning.

**Acknowledgements:** Thanks to Preetum Nakkiran, Aleksander MÄ…dry, Ilya Sutskever and Moritz Hardt for helpful comments. (In particular, I dropped an interpretability experiment suggested in an earlier version of this post since Moritz informed me that several similar experiments have been done.) Needless to say, none of them is responsible for any of the speculations and/or errors above.

## Rabin postdoc fellowship

Hi, once again it is the time of the year to advertise the Michael O. Rabin postdoctoral fellowship at Harvard, see https://toc.seas.harvard.edu/rabin-postdoc for more details. The deadline to apply is December 2, 2019. For any questions please email theory-postdoc-apply (at) seas dot harvard dot edu