Skip to content

Summer School on Statistical Physics and Machine Learning

January 21, 2020

Gerard Ben Arous, Surya Ganguli, Florent Krzakala and Lenka Zdeborova are organizing a summer school on statistical physics of machine learning on August 2-28, 2020 in Les Houches, France. If you don’t know Les Houches, it apparently looks like this:

They are looking for applications from students, postdocs, and young researchers in physics & math, as well computer scientists. While I am biased (I will be lecturing there too) I think the combination of lecturers, speakers, and audience members will yield a very unique opportunity for interaction across communities, and strongly encourage theoretical computer scientists to apply (which you can from the website). Let me also use this opportunity to remind people again of Tselil Schramm’s blog post where she collected some of the lecture notes from the seminar we ran on physics & computation.

More information about the summer school:

The “Les Houches school of physics”, situated close to Chamonix and the Mont Blanc in the French Alps, has a long history of forming generations of young researchers on the frontiers of their fields. Our school is aimed primarily at the growing audience of theoretical physicists and applied mathematicians interested in machine learning and high-dimensional data analysis, as well as to colleagues from other fields interested in this interface. [my emphasis –Boaz] We will cover basics and frontiers of high-dimensional statistics, machine learning, the theory of computing and learning, and probability theory. We will focus in particular on methods of statistical physics and their results in the context of current questions and theories related to machine learning and neural networks. The school will also cover examples of applications of machine learning methods in physics research, as well as other emerging applications of wide interest. Open questions and directions will be presented as well.

Students, postdocs and young researchers interested to participate in the event are invited to apply on the website before March 15, 2020. The capacity of the school is limited, and due to this constraint participants will be selected from the applicants and participants will be required to attend the whole event.


  • Boaz Barak (Harvard): Computational hardness perspectives
  • Giulio Biroli (ENS, Paris): High-dimensional dynamics
  • Michael Jordan (UC Berkeley): Optimization, diffusion & economics
  • Marc Mézard (ENS, Paris): Message-Passing algorithms
  • Yann LeCun (Facebook AI, NYU). Challenges and directions in machine learning
  • Remi Monasson (ENS, Paris): Statistical physics or learning in neural networks
  • Andrea Montanari (Stanford): High-dimensional statistics & neural networks
  • Maria Schuld (Univ. KwaZulu Natal & Xanadu): Quantum machine learning
  • Haim Sompolinsky (Harvard & Hebrew Univ.): Statistical mechanics of deep neural networks
  • Nathan Srebro (TTI-Chicago): Optimization and implicit regularisation
  • Miles Stoudenmire (Flatiron, NYC): Tensor network methods
  • Pierre Vandergheynst (EPFL, Lausanne): Graph signal processing & neural networks

Invited Speakers (to be completed):

  • Christian Borgs (UC Berkeley)
  • Jennifer Chayes (UC Berkeley)
  • Shirley Ho (Flatiron NYC)
  • Levent Sagun (Facebook AI)

Intro TCS recap

January 15, 2020

This semester I taught another iteration of my “Introduction to Theoretical Computer Science” course, based on my textbook in process. The book was also used in University of Virgnia CS 3102 by David Evans and Nathan Brunelle.

The main differences I made in the text and course since its original version were to make it less “idiosyncratic”: while I still think using programming language terminology is the conceptually “right” way to teach this material, there is a lot to be said for sticking with well-established models. So, I used Boolean circuits as the standard model for finite-input non-uniform computation, and Turing Machines, as the standard model for unbounded-input uniform computation. (I do talk about the equivalent programming languages view of both models, which can be a more useful perspective for some results, and is also easier to work with in code.)

In any course on intro to theoretical CS, there are always beautiful topics that are left on the “cutting room floor”. To partially compensate for that, we had an entirely optional “advanced section” where guest speakers talked about topics such as error correcting codes, circuit lower bounds, communication complexity, interactive proofs, and more. The TA in charge of this section – amazing sophomore named Noah Singer – wrote very detailed lecture notes for this section.

This semester, students in CS 121 could also do an optional project. Many chose to do a video about topics related to the course, here are some examples:

There is much work to still do on both the text and the course. Though the text has improved a lot (we do have 267 closed issues after all) some students still justifiably complained about typos, which can throw off people that are just getting introduced to the topic. I also want to add significantly more solved exercises and examples, since students do find them extremely useful. I need to significantly beef up the NP completeness chapter with more examples of reductions, though I do have Python implementation of several reductions and the Cook Levin theorem.

This type of course is often known as a “great ideas” in computer science, and so in the book I also added a “Big Idea” environment to highlight those. Of course some of those ideas are bigger than others, but I think the list below reflects well the contents of the course:

  • If we can represent objects of type T as strings, then we can represent tuples of objects of type T as strings as well.
  • A function is not the same as a program. A program computes a function.
  • Two models are equivalent in power if they can be used to compute the same set of functions.
  • Every finite function can be computed by a large enough Boolean  circuit.
  • program is a piece of text, and so it can be fed as input to other  programs.
  • Some functions  f:\{0,1\}^n \rightarrow \{0,1\}  cannot be computed by a Boolean circuit using fewer than exponential (in n) number of gates.
  • We can precisely define what it means for a function to be computable by any possible algorithm.
  • Using equivalence results such as those between Turing and RAM machines, we can “have our cake and eat it too”: We can use a simpler model such as Turing machines when we want to prove something can’t be done, and use a feature-rich model such as RAM machines when we want to prove something can be done.
  • There is a  “universal” algorithm that can evaluate arbitrary algorithms on arbitrary inputs.
  • There are some functions that can not be computed by any algorithm.
  • If a function F is uncomputable we can show that another function H is uncomputable by giving a way to reduce the task of computing F to computing H.
  • We can use restricted computational models to bypass limitations such as uncomputability of the Halting problem and Rice’s Theorem. Such models can compute only a restricted subclass of functions, but allow to answer at least some semantic questions on programs.
  • A proof is just a string of text whose meaning is given by a verification algorithm.
  • The running time of an algorithm is not a number, it is a function of the length of the input.
  • For a function F:{0,1}^* \rightarrow {0,1} and T:\mathbb{N} \rightarrow \mathbb{N}, we can formally define what it means for F to be computable in time at most T(n) where n is the size of the input.
  • All “reasonable” computational models are equivalent if we only care about the distinction between  polynomial and exponential. (The book immediately notes quantum computers as a possible exception for this.)
  • If we have more time, we can compute more functions.
  • By “unrolling the loop” we can transform an algorithm that takes T(n) steps to compute F into a circuit that uses poly(T(n)) gates to compute the restriction of F to {0,1}^n.
  • A reduction F \leq_p G shows that F is “no harder than G” or equivalently that G is “no easier than F“.
  • If a single \mathbf{NP}-complete has a polynomial-time algorithm, then there is such an algorithm for every decision problem that corresponds to the existence of an efficiently-verifiable solution.
  • If \mathbf{P}=\mathbf{NP}, we can efficiently solve a fantastic number of decision, search, optimization, counting, and sampling problems from all areas of human endeavors.
  • A randomized algorithm outputs the correct value with good probability on every possible input.
  • We can amplify the success of randomized algorithms to a value that is arbitrarily close to 1.
  • There is no secrecy without randomness.
  • Computational hardness is necessary and sufficient for almost all cryptographic applications.
  • Just as we did with classical computation, we can define mathematical models for quantum computation, and represent quantum algorithms as binary strings.
  • Quantum computers are not a panacea and are unlikely to solve \mathbf{NP} complete problems, but they can provide exponential speedups to certain structured problems.

These are all ideas that I believe are important for Computer Science undergraduates to be exposed to, but covering all of these does make for a every challenging course, which gets literally mixed reviews from the students, with some loving it and some hating it. (I post all reviews on the course home page.) Running a 200-student class is definitely something that I’m still learning how to do.

MIP*=RE, disproving Connes embedding conjecture.

January 14, 2020

In an exciting manuscript just posted on the arxiv, Zhengfeng Ji, Anand Natarajan, Thomas Vidick, John Wright, and Henry Yuen prove that there is a 2-prover quantum protocol (with shared entanglement) for the halting problem. As a consequence they resolve negatively a host of open problems in quantum information theory and operator algebra, including refuting the longstanding Connes embedding conjecture. See also Scott’s post and this blog post of Thomas Vidick discussing his personal history with these questions, that started with his Masters project under Julia Kempe’s supervision 14 years ago.

I am not an expert in this area, and still have to look the paper beyond the first few pages, but find the result astounding. In particular, the common intuition is that since all physical quantities are “nice” function (continuous, differentiable, etc..), we could never distinguish between the case that the universe is infinite or discretized at a fine enough grid. The new work (as far as I understand) provides a finite experiment that can potentially succeed with probability 1 if the two provers use an infinite amount of shared entangled state, but would succeed with probability at most 1/2 if they use only a finite amount. A priori you would expect that if there is a strategy that succeeds with probability 1 with an infinite entanglement, then you could succeed with probability at least 1-\epsilon with a finite entangled state whose dimension depends only on \epsilon.

The result was preceded by Ito and Vidick’s 2012 result that \mathbf{NEXP} \subseteq \mathbf{MIP^*} and Natarajan and Wright’s result last year that \mathbf{NEEXP} (non deterministic double exponential time) is contained in \mathbf{MIP^*}. This brings to mind Edmonds’ classic quote that:

“For practical purposes the difference between algebraic and exponential order is often more crucial than the difference between finite and non-finite”

sometimes, the difference between double-exponential and infinite turns out to be non-existent..

A bet for the new decade

December 30, 2019

I am in Tel Aviv Theory Fest this week – a fantastic collection of talks and workshops organized by Yuval Filmus , Gil Kalai, Ronen Eldan, and Muli Safra.

It was a good chance to catch up with many friends and colleagues. In particular I met Elchanan Mossel and Subhash Khot, who asked me to serve as a “witness” for their bet on the unique games conjecture. I am recording it here so we can remember it a decade from noe.

Specifically, Elchanan bets that the Unique Games conjecture will be proven in the next decade – sometime between January 1, 2020 and December 31, 2029 there will be a paper uploaded to the arxiv with a correct proof of the conjecture. Subhash bets that this won’t happen. They were not sure what to bet on, but eventually agreed to take my offer that the loser will have to collaborate on a problem chosen by the winner, so I think science will win in either case. (For what it’s worth, I think there is a good chance that Subhash will lose the bet because he himself will prove the UGC in this decade, though it’s always possible Subhash can both win the bet and prove the UGC if he manages to do it by tomorrow 🙂 )

The conference itself is, as I mentioned, wonderful with an amazing collection of speakers. Let me mention just a couple of talks from this morning. Shafi Goldwasser talked about “Law and Algorithms”. There is a recent area of research studying how to regulate algorithms, but Shafi’s talk focused mostly on the other direction: how algorithms and cryptography can help achieve legal objectives such as the “right to be forgotten” or the ability to monitor secret proceedings such as wiretap requests.

Christos Papadimitriou talked about “Language, Brain, and Computation”. Christos is obviously excited about understanding the language mechanisms in the brain. He said that studying the brain gives him the same feeling that you get when you sit in a coffee shop in Cambridge and hear intellectual discussions all around you: you don’t understand why everyone is not dropping everything they are doing and come here. (Well, his actual words were “sunsets over the Berkeley hills” but I think the Cambridge coffee shops are a better metaphor 🙂 )

A crash course on the math of quantum computing (guest post by Dorit Aharonov)

December 28, 2019

[The post below is by Dorit Aharonov who co-organized the wonderful school on quantum computing last week which I attended and greatly enjoyed. –Boaz]

TL;DR: Last week we had a wonderful one-week intro course into the math of quantum computing at Hebrew U;  It included a one day crash course on the basics, and 7 mini-courses on math-oriented research topics (quantum delegation, Hamiltonian complexity, algorithms and more) by top-notch speakers. Most importantly – it is all online, and could be very useful if you want to take a week or two to enter the area and don’t know where to start.  

Hi Theory people!  

I want to tell you about a 5-days winter school called “The Mathematics of Quantum Computation“, which we (me, Zvika Brakerski, Or Sattath and Amnon Ta-Shma) organized last week at the Institute for advanced studies (IIAS) at the Hebrew university in Jerusalem. 

There were two reasons I happily agreed to Boaz’s suggestion to write a guest blogpost about this school.

a) The school was really great fun. We enjoyed it so much, that I think you might find it interesting to hear about it even if you were not there, or are not even into quantum computation.

And b), it might actually be useful for you or your quantum-curious friends. We put all material online, with the goal in mind that after the school, this collection of talks+written material will constitute all that is needed for an almost self-contained very-intensive-one-week-course of introduction into the mathematical side of quantum computation; I think this might be of real use for any theoretical computer scientist or mathematician interested in entering this fascinating but hard-to-penetrate area, and not knowing where to start.
Before telling you a little more about what we actually learned in this school, let’s start with some names and numbers. We had:  

  • 160 participants (students and faculty) from all over the world. 
  • 7 terrific speakers: Adam Bouland (UC Berkeley), Sergey Bravyi (IBM), Matthias Christandl (Coppenhagen), András Gilyén (Caltech), Sandy Irani (UC Irvine), Avishay Tal (Berkeley), and Thomas Vidick (Caltech); 
  • 2 great TAs:  András Gilyén (Caltech) and Chinmay Nirkhe (UC Berkeley)
  • 4 busy organizers: myself (Hebrew U), Zvika Brakerski (Weizmann), Or Sattath (Ben Gurion U), and Amnon Ta-Shma (Tel Aviv U)
  • 1 exciting and very intensive program
  • 5 challenging and fascinating days of  talksproblem sessions and really nice food.   
  • 1 great Rabin’s lecture by Boaz Barak (Harvard)
  • 1 beautiful Quantum perspective lecture by Sergey Bravyi (IBM)
  • 8 panelists in the supremacy panel we had on the fifth day: Sandy Irani (UC Irvine), our wise moderator, and 7 panelists on stage and online: myself, Scott Aaronson (Austin, online), Boaz Barak, Adam Bouland, Sergio Boixo (Google, online), Gil Kalai (Hebrew U), and Umesh Vazirani (UC Berkeley, online) 
  • 8 brave speakers in the gong show, our very last session, each talking for 3 minutes;    
  • 1 group-tour to 1 UNESCO site (Tel Maresha) and 6 beers tasted by ~80 tour participants
  • 3 problem sets with 43 problems and (!) their solutions.   

So why did we decide to organize this particular quantum school, given the many quantum schools around? Well, the area of quantum computation is just bursting now with excitement and new mathematical challenges; But there seems to be no easy way for theoreticians to learn about all these things unless you are already in the loop… The (admittedly) very ambitious goal of the school was to assume zero background in quantum computation, and quickly bring people up to speed on six or seven of the most interesting mathematical research forefronts in the area. 

The first day of the school was intended to put everyone essentially on the same page: it included four talks about the very basics (qubits by Or Sattath, circuits, by myself, algorithms by Adam Bouland, and error correction by Sergey Bravyi). By the end of this first day everyone was at least supposed to be familiar with the basic concepts, and capable of listening to the mini-courses to follow. The rest of the school was devoted mainly to those mini-courses, whose topics included what I think are some of the most exciting topics on the more theoretical and mathematical side of quantum computation.

Yes, it was extremely challenging… the good thing was that we had two great TAs, András and Chinmay, who helped prepare problem sets, which people actually seriously tried to solve (!) during the daily one+ hour TA problem-solving sessions (with the help of the team strolling around ready to answer questions…). It seems that this indeed helped people follow, despite the fact that we did get into some hard stuff in those mini-courses… The many questions that were asked throughout the school proved that many people were following and interested till the bitter end. 

So here is a summary of the mini-courses, by order of appearance. 
I added some buzz words of interesting related mathematical notions so that you know where these topics might lead you if you take the paths they suggest.   

  •  Thomas Vidick gave a three-lecture wonderfully clear mini-course providing an intro to the recently very active and exciting area of quantum verification and delegation, connecting cryptography and quantum computational complexity. [Thomas didn’t have time to talk about it, but down the road this eventually connects to  approximate representation theory, as well as to Connes embedding conjecture, and more.]
  •  Sandy Irani gave a beautiful exposition (again, in a a three lecture mini-course) on quantum Hamiltonian complexity. Sandy started with Kitaev’s quantum version of the Cook Levin theorem, showing that the local Hamiltonian problem is quantum NP complete; she then explained how this can be extended to more physically relevant questions such as translationally invariant 1D systems, questions about the thermodynamical limit, and more. [This topic is related to open questions such as quantum PCP, which was not mentioned in the school, as well as to beautiful recent results about undecidability of the spectral gap problem, and more.]    
  •  Matthias Christandl gave an exciting two-lecture mini-course on the fascinating connection between tensor ranks and matrix product multiplication. Starting from what seemed to be childish games with small pictures in his first talk, he cleverly used those as his building blocks in his second talk, to enable him to talk about Strassen’s universal spectral points program for approaching the complexity of matrix multiplication, asymptotic ranks, border ranks and more. That included also very beautiful pictures of polytopes! Matthias explained the connection that underlines this entire direction, between entanglement properties of three body systems, with these old combinatorial problems.  
  •  Avishay Tal gave a really nice two-lecture exposition on his recent breakthrough result with Ran Raz, proving that quantum polynomial time computation is not contained in the polynomial Hierarchy, in the oracle model. This included talking about AC0, a problem called forrelation, Fourier expansion, Gaussians and much more.
  •   András Gilyén gave a wonderful talk about a recent development: the evolution of the singular value approach to quantum algorithms. He left us all in awe showing that essentially almost any quantum algorithm you can think of falls into this beautiful framework… Among other things, he mentioned Chebychev’s polynomials, quantum walks, Hamiltonian simulations, and more. What else can be done with this framework remains to be seen.
  • Sergey Bravyi gave two talks (on top of his intro to quantum error correction). The first was as part of a monthly series at Hebrew university, called  “quantum perspectives”; in this talk, Sergey gave a really nice exposition of his breakthrough result (with Gosset and Konig) demonstrating an information theoretical separation between quantum and classical constant depth circuits; this uses in a clever way the well known quantum magic square game enabling quantum correlations to win with probability one, while classical correlations are always bounded away from one;  somehow this result manages to cleverly turn this game into a computational advantage. In Sergey’s last talk, he gave the basics of the beautiful topic of stoqaustic Hamiltonians –  a model in between quantum Hamiltonians and classical constrained satisfaction problems, which poses many fundamental and interesting open questions (and is tightly related to classical Markov chains, and Markov chain Monte Carlo). 
  • Finally, Adam Bouland gave two superb talks on quantum supremacy, explaining the beautiful challenges in this area – including his recent average case to worst case hardness results about sampling using quantum circuits, which is related to Google’s supremacy experiment.  
  • Ah, I also gave a talk – it was about three of the many different equivalent models of quantum computation – adiabatic computation, quantum walks, and the Jones polynomial (I also briefly mentioned a differential geometry model). The talk came out way too disordered in my mind (never give a talk when you are an organizer!), but hopefully it gave some picture about the immense variety of ways to think about quantum computation and quantum algorithms.

In addition to the main lectures, we also had some special events intertwined: 

  • Boaz Barak gave the distinguished annual Rabin lecture, joint with the CS colloquium; His talk, which was given the intriguing title  “Quantum computing and classical algorithms: The best of frenemies”, focused on the fickle relationships between quantum and classical algorithms. The main players in this beautiful talk were SDPs and sums of squares, and it left us with many open questions.     
  • Last but not least, we had an international panel about the meaning of Google’s recent experiment claiming supremacy, joined by Sergio Boixo from Google explaining the experiment, as well as Scott Aaronson and Umesh Vazirani who woke up very early in the US to join us. I feared we would have some friction and fist fights, but this actually became a deep and interesting discussion! We went with quite some depth into the most important question in my mind about the supremacy experiment, which is the issue of noise; Unbelievably, it all went well even from the technological aspect! I really recommend watching this discussion

So, we had a great time…. and as I said, one of the best things is that it is all recorded and saved. You are welcome to follow the program, watch the recorded talks, consult the slides, lecture notes, exercises and solutions and also read the reading material if you want to extend your knowledge beyond what is covered in the school. In case you know of any math or TCS-oriented person who wants to enter the field and start working on some problem at the forefront of research, just send him or her this post, or the link of the school’s website;  It will take a very intensive week (well, maybe two) of following lectures and doing the exercises, but by the end of that time, one is guaranteed to be no longer a complete amateur to the area, as the set of topics covered gives a pretty good picture of what is going on in the field.   

Last but not least, I would like to thank the Israeli quantum initiative, Vatat, and the IIAS, for their generous funding which enabled this school and the funding of students; the IIAS team for their immense help in organization;   and of course, thanks a lot to all participants who attended the school!

Wishing everyone a very happy year of 2020,  


Deep Double Descent (cross-posted on OpenAI blog)

December 5, 2019

By Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever

This is a lightly edited and expanded version of the following post on the OpenAI blog about the following paper. While I usually don’t advertise my own papers on this blog, I thought this might be of interest to theorists, and a good follow up to my prior post. I promise not to make a habit out of it. –Boaz

TL;DR: Our paper shows that double descent occurs in conventional modern deep learning settings: visual classification in the presence of label noise (CIFAR 10, CIFAR 100) and machine translation (IWSLT’14 and WMT’14). As we increase the number of parameters in a neural network, initially the test error decreases, then increases, and then, just as the model is able to fit the train set, it undergoes a second descent, again decreasing as the number of parameters increases. This behavior also extends over train epochs, where a single model undergoes double-descent in test error over the course of training. Surprisingly (at least to us!), we show these phenomenon can lead to a regime where “more data hurts”—training a deep network on a larger train set actually performs worse.


Open a statistics textbook and you are likely to see warnings against the danger of “overfitting”: If you are trying to find a good classifier or regressor for a given set of labeled examples, you would be well-advised to steer clear of having so many parameters in your model that you are able to completely fit the training data, because you risk not generalizing to new data.

The canonical example for this is polynomial regression. Suppose that we get n samples of the form (x, p(x)+noise) where x is a real number and p(x) is a cubic (i.e. degree 3) polynomial. If we try to fit the samples with a degree 1 polynomial—-a linear function, then we would get many points wrong. If we try to fit it with just the right degree, we would get a very good predictor. However, as the degree grows, we get worse till the degree is large enough to fit all the noisy training points, at which point the regressor is terrible, as shown in this figure:

It seems that the higher the degree, the worse things are, but what happens if we go even higher? It seems like a crazy idea—-why would we increase the degree beyond the number of samples? But it corresponds to the practice of having many more  parameters than training samples in modern deep learning. Just like in deep learning, when the degree is larger than the number of samples, there is more than one polynomial that fits the data– but we choose a specific one: the one found running gradient descent.

Here is what happens if we do this for degree 1000, fitting a polynomial using gradient descent (see this notebook):

We still fit all the training points, but now we do so in a more controlled way which actually tracks quite closely the ground truth. We see that despite what we learn in statistics textbooks, sometimes overfitting is not that bad, as long as you go “all in” rather than “barely overfitting” the data. That is, overfitting doesn’t hurt us if we take the number of parameters to be much larger than what is needed to just fit the training set — and in fact, as we see in deep learning, larger models are often better.

The above is not a novel observation. Belkin et al called this phenomenon “double descent” and this goes back to even earlier works . In this new paper we (Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever) extend the prior works and report on a variety of experiments showing that “double descent” is widely prevalent across several modern deep neural networks and for several natural tasks such as image recognition (for the CIFAR 10 and CIFAR 100 datasets) and language translation (for IWSLT’14 and WMT’14 datasets).  As we increase the number of parameters in a neural network, initially the test error decreases, then increases, and then, just as the model is able to fit the train set, it undergoes a second descent, again decreasing as the number of parameters increases.  Moreover, double descent also extends beyond number of parameters to other measures of “complexity” such as the number of training epochs of the algorithm.

The take-away from our work (and the prior works it builds on) is that neither the classical statisticians’ conventional wisdom that “too large models are worse” nor the modern ML paradigm that “bigger models are always better” always hold. Rather it all depends on whether you are on the first or second descent.  Further more, these insights also allow us to generate natural settings in which even the age-old adage of “more data is always better” is violated!

In the rest of this blog post we present a few sample results from this recent paper.

Model-wise Double Descent

We observed many cases in which, just like in the polynomial interpolation example above, the test error undergoes a “double descent” as we increase the complexity of the model. The figure below demonstrates one such example: we plot the test error as a function of the complexity of the model for ResNet18 networks. The complexity of the model is the width of the layers, and the dataset is CIFAR10 with 15% label noise. Notice that the peak in test error occurs around the “interpolation threshold”: when the models are just barely large enough to fit the train set. In all cases we’ve observed, changes which affect the interpolation threshold (such as changing the optimization algorithm, changing the number of train samples, or varying the amount of label noise) also affect the location of the test error peak correspondingly.

We found the double descent phenomena is most prominent in settings with added label noise— without it, the peak is much smaller and easy to miss. But adding label noise amplifies this general behavior and allows us to investigate it easily.

Sample-Wise Nonmonotonicity

Using the model-wise double descent phenomenon we can obtain examples where training on more data actually hurts. To see this, let’s look at the effect of increasing the number of train samples on the test error vs. model size graph. The below plot shows Transformers trained on a language-translation task (with no added label noise):

On the one hand, (as expected) increasing the number of samples generally shifts the curve downwards towards lower test error. On the other hand, it also shifts the curve to the right: since more samples require larger models to fit, the interpolation threshold (and hence, the peak in test error) shifts to the right. For intermediate model sizes, these two effects combine, and we see that training on 4.5x more samples actually hurts test performance.

Epoch-Wise Double Descent

There is a regime where training longer reverses overfitting. Let’s look closer at the experiment from the “Model-wise Double Descent” section, and plot Test Error as a function of both model-size and number of optimization steps. In the plot below to the right, each column tracks the Test Error of a given model over the course of training. The top horizontal dotted-line corresponds to the double-descent of the first figure. But we can also see that for a fixed large model, as training proceeds test error goes down, then up and down again—we call this phenomenon “epoch-wise double-descent.”

Moreover, if we plot the Train error of the same models and the corresponding interpolation contour (dotted line) we see that it exactly matches the ridge of high test error (on the right).

In general, the peak of test error appears systematically when models are just barely able to fit the train set.

Our intuition is that for models at the interpolation threshold, there is effectively only one model that fits the train data, and forcing it to fit even slightly-noisy or mis-specified labels will destroy its global structure. That is, there are no “good models”, which both interpolate the train set, and perform well on the test set. However in the over-parameterized regime, there are many models that fit the train set, and there exist “good models” which both interpolate the train set and perform well on the distribution. Moreover, the implicit bias of SGD leads it to such “good” models, for reasons we don’t yet understand.

The above intuition is theoretically justified for linear models, via a series of recent works including [Hastie et al.] and [Mei-Montanari]. We leave fully understanding the mechanisms behind double descent in deep neural networks as an important open question.

Commentary: Experiments for Theory

The experiments above are especially interesting (in our opinion) because of how they can inform ML theory: any theory of ML must be consistent with “double descent.” In particular, one ambitious hope for what it means to “theoretically explain ML” is to prove a theorem of the form:

“If the distribution satisfies property X and architecture/initialization satisfies property Y, then SGD trained on ‘n’ samples, for T steps, will have small test error with high probability”

For values of X, Y, n, T, “small” and “high” that are used in practice.

However, these experiments show that these properties are likely more subtle than we may have hoped for, and must be non-monotonic in certain natural parameters.

This rules out even certain natural “conditional conjectures” that we may have hoped for, for example the conjecture that

“If SGD on a width W network works for learning from ‘n’ samples from distribution D, then SGD on a width W+1 network will work at least as well”

Or the conjecture

“If SGD on a certain network and distribution works for learning with ‘n’ samples, then it will work at least as well with n+1 samples”

It also appears to conflict with a “2-phase” view of the trajectory of SGD, as an initial “learning phase” and then an “overfitting phase” — in particular, because the overfitting is sometimes reversed (at least, as measured by test error) by further training.

Finally, the fact that these phenomena are not specific to neural networks, but appear to hold fairly universally for natural learning methods (linear/kernel regression, decision trees, random features) gives us hope that there is a deeper phenomenon at work, and we are yet to find the right abstraction.

We especially thank Mikhail Belkin and Christopher Olah for helpful discussions throughout this work. The polynomial example is inspired in part by experiments in [Muthukumar et al.].

HALG 2020 call for nominations (guest post by Yossi Azar)

November 27, 2019

[Guest post by Yossi Azar – I attended HALG once and enjoyed it quite a lot; I highly recommend people make such nominations –Boaz]

Call for Invited Talk Nominations :5th Highlights of Algorithms conference (HALG 2020)

ETH Zurich, June 3-5, 2020

The HALG 2020 conference seeks high-quality nominations for invited talks that will highlight recent advances in algorithmic research. Similarly to previous years, there are two categories of invited talks:

A. survey (60 minutes): a survey of an algorithmic topic that has seen exciting developments in last couple of years.

B. paper (30 minutes): a significant algorithmic result appearing in a paper in 2019 or later.

To nominate, please email the following information:

  1. Basic details: speaker name + topic (for survey talk) or paper’s title, authors, conference/arxiv + preferable speaker (for paper talk).
  2. Brief justification: Focus on the benefits to the audience, e.g., quality of results, importance/relevance of topic, clarity of talk, speaker’s presentation skills.

All nominations will be reviewed by the Program Committee (PC) to select speakers that will be invited to the conference.

Nominations deadline: December 20, 2020 (for full consideration).