ML Theory with bad drawings

This semester I am teaching a seminar on the theory of machine learning. For the first lecture, I would like to talk about what is the theory of machine learning. I decided to write this (very rough!) blog post mainly to organize my own thoughts.

In any science, ML included, the goals of theory and practice are not disjoint. We study the same general phenomena from different perspectives. We can think of questions in computation at a high level as trying to map the unknown terrain of the computational cost of achieving certain quality of output, given some conditions on the input.

Practitioners aim to find points on this terrain’s practically relevant regions, while theoreticians are more interested in its broad landscape, including at points far out into infinity. Both theoreticians and practitioners care about discontinuities, when small changes in one aspect correspond to large changes in another. As theoreticians, we care about computational/statistical tradeoffs, particularly about points where a small difference in quality yields an exponential difference in time or sample complexity. In practice, models such as GPT-3 demonstrate that a quantitative increase in computational resources can correspond to a qualitative increase in abilities.

Since theory and practice study the same phenomenon in different ways, their relation can vary. Sometimes theory is forward-looking or prescriptive – giving “proof of concepts” results that can inspire future application or impossibility results that can rule out certain directions. Sometimes it is backward-looking or descriptive – explaining phenomena uncovered by experiments and putting them in a larger context.

What is machine learning?

Before we talk about what is machine-learning theory, we should talk about what is machine learning. It’s common to see claims of the form “machine learning is X”:

  • Machine learning is just statistics
  • Machine learning is just optimization
  • Machine learning is just approximation or “curve fitting”

and probably many more.

I view machine learning as addressing the following setup:

There is a system f that interacts with the world around it in some manner, and we have some sense of whether this interaction is successful or unsuccessful.

For example, a self-driving car is “successful” if it gets passengers from point A to point B safely, quickly, and comfortably. It is “unsuccessful” if it gets into accidents, takes a long time, or drives in a halting or otherwise uncomfortable manner.

Three levels of indirection

The system f is the “machine.” It is “learning” since we adapt f so that it becomes more successful. Ideally, we would set f to be the most successful system. However, what we actually do is at least thrice-removed from this ideal:

  1. The model gap: We do not optimize overall possible systems, but rather a small subset of such systems (e.g., ones that belong to a certain family of models).
  2. The metric gap: In almost all cases, we do not optimize the actual measure of success we care about, but rather another metric that is at best correlated with it.
  3. The algorithm gap: We don’t even optimize the latter metric since it will almost always be non-convex, and hence the system we end up with depends on our starting point and the particular algorithms we use.

The magic of machine learning is that sometimes (though not always!) we can still get good results despite these gaps. Much of the theory of machine learning is about understanding under what conditions can we bridge some of these gaps.

The above discussion explains the “machine Learning is just X” takes. The expressivity of our models falls under approximation theory. The gap between the success we want to achieve and the metric we can measure often corresponds to the difference between population and sample performance, which becomes a question of statistics. The study of our algorithms’ performance falls under optimization.

The metric gap is perhaps the widest of them all. While in some settings (e.g., designing systems to play video games) we can directly measure a system’s success, this is typically the exception rather than rule. Often:

  • The data we have access to is not the data the system will run on and might not even come from the same distribution.
  • The measure of success that we care about is not necessarily well defined or accessible to us. Even if it was, it is not necessarily in a form that we can directly optimize.

Nevertheless, the hope is that if we optimize the hell out of the metrics we can measure, the system will perform well in the way we care about. In other words:

A priori, it is not clear that this should be the case, and indeed it’s not always is. Part of the magic of auto-differentiation is that it allows optimization of more complex metrics that match more closely with the success metric we have in mind. But, except for very special circumstances, the two metrics can never be the same. The mismatch between the goal we have and the metric we optimize can manifest in one of the following general ways:

  • “No free lunch” or Goodhart’s law: If we optimize a metric \mathcal{M} then we will get a system that does very well at \mathcal{M} and not at all well in any other measure. For example, if we optimize accuracy in predicting images, then we may not do well on slightly perturbed versions of these images.
  • “It’s not the destination, it’s the journey” or Anna Karenina principle: All successful systems are similar to one another, and hence if we make our system successful with respect to a metric \mathcal{M} then it is likely to also be successful with respect to related measures. For example, image classifiers trained on ImageNet have been successfully used for very different images, and there is also evidence that success in ImageNet translates into success on a related distribution (i.e., ImageNet v2).

At the moment, we have no good way to predict when a system will behave according to Goodhart’s law versus the Anna Karenina principle. It seems that when it comes to learning representations, machine learning systems follow the Anna Karenina principle: all successful models tend to learn very similar representations of their data. In contrast, when it comes to making decisions, we get manifestations of Goodhart’s law, and optimizing for one metric can give very different results than the other.

The model gap is the gap between the set of all possible systems and the set that we actually optimize over. Even if a system can be captured as a finite function (say a function mapping 32 \times 32 pixel images to 10 classes), a simple counting argument shows that the vast majority of these functions require far too many gates to be efficiently computable by any reasonable computational model. (In case you are curious, the counting argument also works for quantum circuits.) Hence we necessarily have to deal with a subclass of all possible functions.

This raises the question of whether we should expect a realizable system to exist, let alone for us to find it. Often machine learning is used in artifical intelligence applications, where we are trying to mimic human performance. In such cases, human performance is an “existence proof” that some reasonably-sized circuit is successful in the goal. But whether this circuit can be embedded in our model family is still an open question.

Remember when I said that sometimes it’s not about the journey but about the destination? I lied. The algorithm gap implies that in modern machine learning, there is a certain sense in which it is always about the journey. In modern machine learning, the system f is typically parameterized by a vector w \in \mathbb{R}^N of real numbers. Hence the metric we optimize over is to minimize some loss function \mathcal{L}(w). We use local search algorithms, which start off at some vector w_0 chosen via some distribution, and then iteratively takes small steps w_1,w_2,w_3,\ldots. For each w_i, the next step w_{i+1} is chosen from some distribution of vectors close to w_i such that (hopefully) in expectation \mathcal{L}(w_{i+1}) < \mathcal{L}(w_i).

Modern machine learning systems are non convex, which means that the final point w_t we end up in depends on the starting point, the algorithms, and the randomness. Hence, we can’t have a neat “separation of concerns” and decouple the architecture, metric, and algorithm. When we say that a system obeys the “Anna Karenina principle,” we really mean that for natural algorithms to optimize natural metrics on natural architectures, successful outputs (ones that do well on the metric) are likely to be similar to one another. The use of the “natural” qualifier is a warning sign that we don’t fully understand the conditions under which this happens, but it is clear that some conditions are necessary. Due to non-convexity, it is typically possible to find a “bad minima” that would be very good in some specific metric but terrible in other ones.

Types of theory for machine learning (and for CS in general)

Given the above discussion, what kind of theoretical results should we expect in machine learning? Let’s try to classify the type of results we see in general theoretical computer science. In the discussion below, I will focus on the limitations of these theoretical results and use humorous names, but make no mistake: all of these categories correspond to valuable theoretical insights.

End to end analysis

The quicksort algorithm provides a canonical example of algorithm analysis. At this point, we have a reasonably complete understanding of quicksort. We can analyze the distribution of its running time down to the constant.

Hence we have an efficient algorithm, used in practice, with rigoroulsy proven theorems that precisely characterize its performance. This is the “gold standard” of analysis, but also an unrealistic and unachievable goal in almost any other setting.

Proof of concept

A more common situation is that we have algorithms that work in practice and algorithms with rigorous analysis, but they are not the same algorithms. A canonical example is linear programming. The simplex algorithm often works well in practice but was known to take exponential time on certain instances. For a long time, it was an open question whether or not there is an algorithm for linear programming that take polynomial time in the worst case.

In 1979, Khachiyan gave the Ellipsoid algorithm that runs in polynomial time, but with a polynomial so large that it is impractical. Still, the Ellipsoid algorithm was a great breakthrough, giving hope for better algorithms and a new approach for defining progress measures for linear programming instances. Indeed in 1984, Karmarkar came up with the interior points algorithm, with much better time dependence.

The interior-point algorithm is practical and often outperforms the simplex method on large enough instances. However, the algorithm as implemented is not identical to the one analyzed- implementations use different step sizes and other heuristics for which we do not have a precise analysis.

Nevertheless, even if (unlike quicksort) the rigorously analyzed algorithm is not identical to the practical implementations, the story of linear programming shows how crucial “proofs of concept” can be to introducing new ideas and techniques.

The flip side of “proof of concept” results are impossiblity results that rule out even “proof of concept ” algorithms. Impossibility results always come with fine print and caveats (for example, NP-hardness results always refer to worst case complexity). However, they still teach us about the structure of the problem and help define the contours of what we can expect to achieve.

Character witness

“End to end analysis” is when we can prove the guarantees we need on algorithms we actually want to use. “Proof of concept” is when we prove the guarantees we need on impractical algorithms. A “character witness” result is when we prove something positive about an algorithm people use in practice, even if that positive property falls short of the guarantees we actually want.

While the term “character witness” sounds derogatory, such results can sometimes yield truly profound insights. A canonical example again comes from linear programming. While the simplex algorithm can be exponential in the worst case, in a seminal work, Spielman and Teng showed that it does run in polynomial-time if the input is slightly perturbed- this is so-called smoothed analysis. While the actual polynomial and the level of perturbation do not yield practical bounds, this is still an important result. It gives formal meaning to the intuition that the simplex only fails on “pathological” instances and initiated a new mode of analyzing algorithms between worst-case and average-case complexity.

In machine learning, a “character witness” result can take many forms. For example, some “character witness” results are analyses of algorithms under certain assumptions on the data, that even if not literally true, seem like they could be “morally true”. Another type of “character witness” result shows that an algorithm would yield the right results if it is allowed to run for an infinite or exponentially long time. Evaluating the significance of such results can be challenging. The main question is whether the analysis teaches us something we didn’t know before.

A third type of “character witness” results are quantitative bounds that are too weak for practical use but are still non-vacuous. For example, approximation algorithms with too big of an approximation factor, or generalization bounds with too big of a guaranteed gap. In such cases, one would hope that these bounds will be at least correlated with performance: algorithms with better bounds will also have higher quality output.

Toy problems

The name “toy problem” also sounds derogatory, but toy problems or toy models can be extremely important in science, and machine learning is no different. A toy model is a way to abstract the salient issues of a model to enable analysis. Results on “toy models” can teach us about general principles that hold in more complex models.

When choosing a toy model, it is important not to mistake models that share superficial similarity with models that keep the salient issues we want to study. For example, consider the following two variants of deep neural networks:

  • Networks that have standard architecture except that we make them extremely wide, with the width tending to infinity independently of all other parameters.
  • Networks where all activation functions are linear.

Since the practical intuition is that bigger models are better, it may seem that such “ultra-wide” models are not toy models at all and should capture state of the art deep networks. However, the neural tangent kernel results show that these models become kernel models that do not learn their representation at all.

Intuitively, making activation functions linear seems pointless since the composition of linear functions is linear, and hence such linear networks are no more expressive than a layer one linear net. Thus such linear networks seem too much of a “toy model” (maybe a “happy meal model”?). Yet, it turns out that despite their limited expressivity, deep linear networks capture an essential feature of deep networks- the bias induced by the gradient descent algorithm. For example, running gradient descent on a depth two linear network translate to regularizing by a proxy of rank. In general, gradient descent on a deep linear network induces a very different geometry on the manifold of linear functions.

Hence, although the first model seems much more “real” than the second one, there are some deep learning questions where the second model is more relevant.

Theory without theorems

One might think that the whole point of theory is to provide rigorous results: if we are not going to prove theorems, we might as well use benchmarks. Yet, there are important theoretical insights we can get from experiments. A favorite example of mine is the Zhang et al result that deep neural networks can fit random labels. This simple experiment ruled out in one fell swoop a whole direction for proving generalization bounds on deep nets. A theory work does not have to involve theorems: well-chosen experiments can provide important theoretical insights.


The above was a stream-of-consciousness and very rough personal overview of questions and results in ML theory. In the coming seminar we will see results of all the types above, as well as many open questions.

Acknowledgements: Thanks to Yamini Bansal and Preetum Nakkiran for helpful comments (though they are not to blame for any mistakes!)

Obfuscation: The season 4 Finale

For many of the famous open problems of theoretical computer science, most researchers agree on what the answer is, but the challenge is to prove it. Most complexity theorists (with few notable exceptions) believe that P≠NP, but we don’t know how to prove it. Similarly, most people working on matrix multiplication believe that there is an Õ(n²) algorithm for this problem, but we’re still stuck at 2.3728596. We believed that primality checking has a deterministic polynomial-time algorithm long before it was proven and we still believe the same holds for polynomial identity testing.

The story of cryptographic obfuscation is different. This story deserves a full length blog post (though see my now outdated survey), but the short version is as follows. In 2001 we (in a paper with Goldreich, Impagliazzo, Rudich, Sahai, Vadhan, and Yang) showed that what is arguably the most natural definition of obfuscation is impossible to achieve. That paper explored a number of obfuscation-related questions, and in particular left as an open question the existence of so-called indistinguishable obfuscators or IO. Since then there were arguably more negative than positive results in obfuscation research until in 2012, extending some of the ideas behind fully-homomorphic encryption, Garg Gentry and Halevi gave a heuristic construction of multilinear map, which one can think of as “Diffie Hellman on steroids” (or maybe LSD..). Then in 2013 Garg, Gentry, Halevi., Raykova, Sahai and Waters (GGHRSW) built on top of these maps to give a heuristic construction of IO.

The GGHRSW paper opened the floodgates to many papers using IO to achieve many longstanding cryptographic goals as well as show that IO provides a unified approach to solve many classic cryptographic problems. The fact that so many goals were achieved through heuristic constructions was not very comforting to cryptographers. Even less comforting was the fact that several cryptographic attacks were discovered on these heuristic constructions. The years that followed saw a sequence of constructions and breaks, giving cryptographers an “emotional whiplash”. Everyone agreed that IO would be amazing if it exists, but whether or not it actually exists depended on who you asked, and what paper in the eprint archive they read that morning…

The “holy grail” in this line of work is to base obfuscation on a standard assumption, and ideally Regev’s Learning With Errors (LWE) assumption. Of course, we don’t know that LWE is true (in particular LWE implies P≠NP) but if it’s false it would bring down so much of the field that cryptographers might as well pack their bags and do machine learning (or try to sabotage progress in quantum computing, since the only other standard assumptions for public-key crypto are broken by fully scalable quantum computing).

We have not yet achieved this holy grail (this is only the 4th season) but as described in this quanta article, there has been a remarkable progress in the last few months. In particular, Jain, Lin and Sahai (JLS) (building on a long sequence of works by many people including Ananth, Matt, Tessaro and Vaikuntanathan) obtained IO based on LWE and several standard assumptions in cryptography. This is arguably the first “heuristic free” construction, and is a fantastic breakthrough. However, there is still work to do – the JLS construction uses not just LWE but also a variant of it that is not as well studied. It is also based on pairing-based cryptography. This is an area that has thousands of papers, but for which known instantiations can be broken by quantum computers. However, there is yet more hope – in another sequence of works by Agrawal, Brakerski, Döttling, Garg, and Malavolta, Wee and Wichs, Gay and Pass a construction of IO was achieved that is “almost” heuristic free. It still uses one heuristic assumption (circular security) but has the advantage that apart from this assumption it only relies on LWE.

One can hope that in the next season, these two lines of work will converge to give a construction of IO based on LWE, achieving a “meta theorem” deriving from LWE a huge array of cryptographic primitives.

Want to learn more about these amazing advances? Want to know what’s next in store for IO?

Fortunately there is a virtual Simons symposium on indistinguishability obfuscation coming to your computer screen on December 10-11. Authors of all the papers mentioned will join together in coordinated presentations to give a unified view of the field and the challenges ahead. We will also have a historical opening talk by Yael Kalai, as well as a talk by Benny Applebaum on the computational assumptions used, followed by a panel discussion with Yael, Benny and Chris Peikert. Finally, like every proper crypto event, there will be a rump session, though you will have to supply your own beer.

See the schedule of the workshop and you can register on this page.

Hope you to see you there! Bring your favorite programs to obfuscate with you*

*Disclaimer/fine print: Due to large constants and exponents, we do not recommend the compiler be used on programs that are more than one nanobit long.

Image credit: MIT

Making TCS more connected / less insular

[Announcement from Jelani Nelson –Boaz]


A task force has been convened by CATCS to investigate possible
approaches to modifying aspects of the TCS community, especially our
publishing culture, to enhance connections with other areas of CS and
be as welcoming as possible to a broad range of contributions within
theory. This committee will collect and synthesize feedback from the
community via the questionnaire on then make suggestions.

Since adjusting conference formats may help with this, we would like to get
a general idea of community opinion on format choices. If you have
some opinions you would like to share, please use this form. Though
the questions below focus primarily on FOCS/STOC, we are welcome to
receiving any and all suggestions that would make the TCS community as
broad and well-connected to other areas of TCS as possible (see the
last question of the survey).

Task force members:
Ken Clarkson (IBM Research)
Sandy Irani (UC Irvine)
Bobby Kleinberg (Cornell)
Adam Klivans (UT Austin)
Ravi Kumar (Google)
Jelani Nelson (UC Berkeley)
Yuval Rabani (Hebrew University)

On Galileo Galilei and “denialism” from elections to climate to COVID

Galileo Galileo has many self-appointed intellectual heirs these days. Whether it’s a claim that the election has been stolen, that COVID-19 is less fatal than the flu, that climate change or evolution are hoaxes, or that P=NP, we keep hearing from people considering themselves as bold truth-tellers railing against conventional wisdom. We are encouraged to “teach the debate” and that if only paid attention, we will see that their Tweet, declaration, or arxiv paper contains an irrefutable proof of their assertions.

In the words of Ted Cruz, “They brand you a heretic. Today, the global warming alarmists are the equivalent of the flat-Earthers. It used to be [that] it is accepted scientific wisdom the Earth is flat, and this heretic named Galileo was branded a denier”.

Of course by Galileo’s time it was well known that the earth was spherical, and Magellan circumnavigated the earth more than 40 years before Galileo was born. But putting aside Cruz’s confusion of flat earth and geocentrism, the story of heliocentric theory is not one of an outsider railing against the scientific mainstream. Galileo himself was a chaired professor of mathematics at the University of Padua, and later philosopher and mathematician to the grand duke of Tuscany. He was very much part of the scientific establishment of his time. Moreover, though Galileo did provide important evidence for heliocentrism, he was not the only one doing so. Kepler found a heliocentric model with elliptical orbits that actually made correct predictions, and, though it took a decade or so, Kepler’s book eventually became the standard textbook for astronomy.

My point in this post is not to rehash the history of heliocentrism or Galileo but rather to call out a misconception which, to use Sean Carrol’s phrasing, amounts to valorization of puzzle-solving over wisdom.

It is tempting to think that an argument, regardless whether it comes from an expert or a random person on Twitter, can be presented in a self-contained way and judged on its merits. However, this is not how things work in any interesting setting. Even in the case of a purported P vs NP proof, there is background knowledge on computational complexity without which it would be hard to spot holes in the argument. This is doubly so for any claim involving empirical facts, whether it’s about elections, infections, climate etc. It is not possible to evaluate such claims without context, and to get this context you need to turn to the experts that have studied the topic.

I have written before in defense of expertise (see also here) but Carroll puts it very well. Another way to say it is that the operational interpretation of the common refrain

Extraordinary claims require extraordinary evidence


Treat claims conforming to conventional wisdom with charity, and claims disputing it with skepticism.

(There is a question of how to define “conventional wisdom” but interestingly there is usually agreement in practice by both sides. Most “deniers” of various sorts are proud of going against conventional wisdom, but don’t acknowledge that this means they are more likely to be wrong.)

As an example, even if someone has expertise in analytic number theory, and so presumably has plenty of so-called “puzzle-solving intelligence”, that doesn’t mean that they can evaluate a statistical claim on election fraud and their analysis should be considered evidence (apparently at this point the number theorist himself agrees). We can try to read and debunk what they wrote, or we can assume that if there was evidence for large-scale fraud, then the president of the United States and his well-funded campaign would have managed to find actual statisticians and experts on election to make the case.

There can be debate if Trump’s attempt to overthrow the election should be considered as dangerous or merely absurd, but the constant attacks on the very notions of truth, science, and expertise are causing far-reaching harm.

(H/T: Scott Aaronson, who makes a similar point around the election conspiracies.)

Announcing the WiML-T Mentorship Program (guest post)

[Guest post by Claire Vernade, Jessica Sorrell, Kamalika Chaudhuri, Lee Cohen, Mary Anne Smart, Michal Moshkovitz, and Ruth Urner.

I am very happy about this initiative – mentoring and community is so important for success in science, and as I’ve written before, there is much work to do so women will have the same access to these as men. –Boaz]

TL;DR: we are organizing a new mentorship program for women in machine learning. Please consider applying as a mentor, mentee, or both at

What is WIML-T?

Women in machine learning (or WiML for short) was established more than ten years ago, and its main goals are to 1. Increase the number of women in machine learning 2. Help women in machine learning succeed professionally 3. Increase the impact of women in machine learning in the community. Towards this goal, they create different opportunities for women to showcase their work. Chief among them is the annual Women in Machine Learning (WiML) Workshop, typically co-located with NeurIPS, which presents women’s cutting-edge research. 

Women in machine learning theory (or WiML-T for short) shares the same goals as WiML but focuses on the smaller learning theory community. The vision of WiML-T is to give visibility and legitimacy to under-represented groups in learning theory, to create stronger bonds within the community and beyond, and to provide support and advice. As part of this vision, we have decided to facilitate a mentoring program that will connect women and non-binary researchers who are newer to learning theory with more experienced mentors.  

Why mentorship?

Mentoring programs have been shown to greatly help underrepresented communities develop [1]. More resources on mentoring can be found on the website of Stanford’s Women’s Leadership Innovation Lab and on our website. Mentoring creates strong bonds of trust within the community, it helps mentees find career advice and connections, and it helps mentors grow their leadership skills while keeping in touch with the newcomers of the community.  We hope that creating a world-wide program will also help reduce inequality between less privileged areas and the most famous institutions. Indeed, it is a well-known fact that the more competitive a career path, the less diverse it is, due in part to network effects from which under-represented groups are excluded. We believe that uniting individuals from these groups (i.e. women, non-binary people, persons of color and other minorities) will help reduce these effects and contribute to finding solutions to the community’s problems. 

Why be a mentor?  

Remember the beginning of your research journey, with all the difficulties, uncertainties, and unanswered questions? This is your chance to give all the advice you wish you got, and make an impact on a future colleague. As a mentor you will be a role model and help the young generation of researchers in learning theory. You will help them develop and enhance their careers by giving them the support they need. Need more reasons to mentor? It will grow your leadership skills, self-confidence, communication skills, and you will feel happier after you help others. 

Who can be a mentor? 

Nearly everyone who has some experience in academia or industry can be a mentor. It can be interesting for an undergrad student to receive advice from senior PhD students or postdocs who have recently had to reflect about career decisions and can share knowledge about their work environment. We indeed expect the most senior researchers to apply as mentors, but we would also like to encourage PhDs and postdocs to consider mentoring (while possibly having a mentor as well!).    

Can men mentor? 

We thank everyone who wants to help the community!  

We will prioritize women mentors as they can give their unique perspective, BUT, we acknowledge that there might be a limited number of mentors. To mitigate this issue, we will be happy to pair male mentors provided the mentee agrees.

Why be a mentee?   

Having a mentor is one of the best ways to get external career advice, to get some feedback from someone with a possibly similar background. Managing to find one’s way into academia or science is not easy. It can be even harder for under-represented groups who may lack role models within their institution, or who may not connect with common advice that implicitly assumes or relies on some class privilege. Having a mentor can help you navigate professional and personal issues that men may not always have. It is also a way to get connected to other members of the community, or have second opinions on research strategies.   


  • The program launched on October 29 2020 and will run on a continuous basis.
  • We will start with pairings of mentors and mentees in December 2020. This process can take a few months.  
  • Frequency of the meetings: totally depends on the mentor and mentee. It can be either weekly meetings or once every two months or anything in between. 
  • Duration of the mentorship: totally depends on the mentor and mentee. It can be a few months, a year, or even more. 

Have questions? You can mail us at:

[1] Ginther, D. K., Currie, J., Blau, F. D., & Croson, R. (2020). Can Mentoring Help Female Assistant Professors in Economics? An Evaluation by Randomized Trial. NBER Working Paper No. 26864.     

Updated Research Masters programs database by Aviad Rubinstein and Matt Weinberg

Guest post by Aviad Rubinstein and Matt Weinberg

As explained in Boaz’s previous posts [1] [2], the PhD admission process can be challenging for students who discover their passion for Theory of Computer Science late in their undergraduate studies. Discovering TCS earlier is especially challenging for students who aren’t exposed to CS in high school, and this bias aggravates the diversity issues we have in our community. Masters programs are one way to mitigate this issue.
On a personal note, one of us (Aviad), worked half-time during undergraduate and would not have been in academia today -let alone TCS of all subjects- if it weren’t for the awesome TCS masters program at Tel-Aviv University.

But where would you go (or send your students) to do a masters in TCS?A little over a year ago, we were discussing how useful it would be to have a unified resource that can help students choose. We decided that we should create one! Months passed, and right before we ran out of excuses to procrastinate, Boaz made this post crowdsourcing information on TCS masters programs.
After some more procrastination, we eventually did send a lot of emails and tried to clean and organize the information to the best of our ability. Thanks to everyone who contributed! You can find the latest version here. (Short url:

Now you should share it with your brightest students!

This is also meant to be a live project. Please email aviad [at] if you have more new information or find any inaccuracies.


-Aviad and Matt

Election insecurity

Election security has been studied for many years by computer scientists, but it is not as often that it attracts so much mainstream attention. I would never have expected to see my former Princeton colleague Andrew Appel on a Sean Hannity segment tweeted by President Trump.

It may seem that even if it has partisan motivations, the recent GOP interest in election security is overall a positive thing. Who wouldn’t want elections to be more secure? Who wouldn’t want less fraud? However, in a very precise sense, the definition of “election security” used by the GOP these days corresponds to election insecurity.

To understand this claim, consider what it means for an election to be secure. (Let’s focus just on the correctness aspect of the count, since it is at the heart of the current issues, and not on the very interesting privacy aspect.) Computer scientists use the technical termscast as intended“, “recorded as cast“, and “tallied as recorded“. In other words: if a voter X intends to cast vote for candidate Y, then this vote should be recorded and tallied, and only such votes should be tallied.

With mail-in voting, there are several potential points of failure on the path between voter intent and the final tally:

  1. Mail can be lost or delayed too much, leading to the vote not counting.
  2. A third party could intercept the ballot and impersonate the voter.
  3. A ballot may not be formatted properly in some way, leading to it being disqualified.
  4. There can be errors or hacks in the tallying process.

Election security is about combatting points 1-4 (of which the last 3 are also applicable to in-person voting) , ideally in a way that is verifiable to the individual voters. Achieving verifiability while maintaining secrecy and not requiring the voter to trust complex technology is a challenging task, but there have been some proposed solutions (see above links).

The Hannity segment and much of the “Dominion” non story focused on point 4. This is an important point, but as Appel himself notes, paper ballots, which are mostly used in the US, serve as a way to audit counting. Re-counting is important, and is commonly done, but such recounts often change the total counts by relatively little (and the changes mostly cancel out). For example, here is the list of ballots changed and reasons from the Wisconsin 2016 count (taken from this paper)


In contrast, many of the legal cases by the Trump campaign focused on signature verification and other ballot irregularities. There are two main reasons why a signature would not match between a ballot and driver’s license or other records:

  1. The signature may have been forged by someone trying to impersonate the voter.
  2. The voter’s signature might not very consistent, or maybe they have more than one signature (for example, I sometimes sign in Hebrew and sometimes in English) .

Empirically, reason 2 is much more common than reason 1. If a ballot is tossed out because of the second reason it corresponds to a break between the voter intent and the final tally, and hence it is a case of election insecurity. For this reason, making more stringent signature checks could make elections less secure!

While President Trump might claim on Twitter that the election was stolen by a massive conspiracy involving forging of tens of thousands of ballots, this is not the actual content of the court cases (especially after some recent amendments). For example, at the heart of the PA case is the process of “curing a ballot“. This is when a ballot is disqualified due to some technical issue, and a voter has a chance to fix it. Curing a ballot ensures that the voters intent is captured, and hence makes elections more secure.

In PA, the decision of whether to notify voters in such cases was left to the counties, and apparently Democrat-controlled counties were more likely to do so Republican-controlled counties. This is a shame, and had the Trump campaign asked to extend the deadline for curing ballots, then I would think it makes perfect sense. However, this is not what their lawsuit is about. To quote their complaint: “plaintiffs seek a permanent injunction requiring the County Election Boards to invalidate ballots cast by voters who were notified and given an opportunity to cure their invalidly cast mail-in ballot.” This are ballots where there is no question of the eligibility of the voter, nor of the accuracy of their intent, yet the Trump campaign seeks to prevent them from counting. I call this election insecurity.

p.s. See Adam Klasfeld’s feed for more about the various Trump campaign cases

MoPS and Junior-Senior Meeting (DISC 2020)

(Guest post by Shir Cohen and Mouna Safir)

The 34th International Symposium on Distributed Computing (DISC 2020) was held on October 12-16, 2020, as a virtual conference. As such, the opportunity for community members to get to know each other in an informal environment was lacking. To address this need, we arranged two types of virtual networking events. We hope that these events planted the seeds for many future collaborations and that there will be an opportunity for those involved to meet in person next time.

MoPS (Meet other Postdoc and Students) Sessions

To allow junior members of the community to get to know one another, we arranged MoPS sessions, which we have not seen done before. There were more than 50 participants who took part in the sessions, with representation from a host of countries throughout the world. Sessions were held in 10-time slots before, during, and after DISC. In each session, there would typically be 5 members representing a mixture of Bachelor’s students, Masters students, PhDs, postdocs, and others. Care was taken to include at least one postdoc or Ph.D. in each session so that Bachelors and Masters students might benefit from their experience. Groups were formed with the goal of allowing participants from different countries and institutions to share their experiences and research journeys with one another. Based on the feedback for this event, it would appear that that goal was met and the participants came away with more of a sense of community.

Junior-Senior Meetings

The Junior-Senior meetings were organized to provide an opportunity for junior researchers to meet with senior researchers. Fourteen sessions were conducted, where each one brought together one senior and 3-5 juniors. In these sessions,  the juniors got a chance to interact with seniors in the field and profit from their experience. Discussions covered a variety of topics such as how to approach research, how to deal with the job market, or perhaps more personal concerns like work-life balance. We collected amazing feedback from the participants, who claimed that this was a fruitful and interesting experience.

Yet another backpropagation tutorial

I am teaching deep learing this week in Harvard’s CS 182 (Artificial Intelligence) course. As I’m preparing the back-propagation lecture, Preetum Nakkiran told me about Andrej Karpathy’s awesome micrograd package which implements automatic differentiation for scalar variables in very few lines of code.

I couldn’t resist using this to show how simple back-propagation and stochastic gradient descents are. To make sure we leave nothing “under the hood” we will not import anything from the package but rather only copy paste the few things we need. I hope that the text below is generally accessible to anyone familiar with partial derivatives. See this colab notebook for all the code in this tutorial. In particular, aside from libraries for plotting and copy pasting a few dozen lines from Karpathy this code uses absolutely no libraries (no numpy, no pytorch, etc..) and can train (slowly..) neural networks using stochastic gradient descent. (This notebook builds the code more incrementally.)

Automatic differentiation is a mechanism that allows you to write a Python functions such as

def f(x,y): return (x+y)+x**3

and enables one to automatically obtain the partial derivatives \tfrac{\partial f}{\partial \text{x}} and \tfrac{\partial f}{\partial \text{y}}. Numerically we could do this by choosing some small value \delta and computing both \tfrac{f(x+\delta,y)-f(x,y)}{\delta} and \tfrac{f(x,y+\delta)-f(x,y)}{\delta}.
However, if we generalize this approach to n variables, we get an algorithm that requires roughly n evaluations of f. Back-propagation enables computing all of the partial derivatives at only constant overhead over the cost of a single evaluation of f.

Back propagation and the chain rule

Back-propagation is a direct implication of the multi-variate chain rule. Let’s illustrate this for the case of two variables. Suppose that v,w: \mathbb{R} \rightarrow \mathbb{R} and z:\mathbb{R}^2 \rightarrow \mathbb{R} are differentiable functions, and define

f(u) = z(v(u),w(u)).

That is, we have the following situation:

where f(u) is the value z=z(v(u),w(u))

Then the chain rule states that

\tfrac{\partial f}{\partial u} = ( \tfrac{\partial v}{\partial u} \cdot \tfrac{\partial z}{\partial v} + \tfrac{\partial w}{\partial u} \cdot \tfrac{\partial z}{\partial w} )

You can take this on faith, but it also has a simple proof. To see the intuition, note that for small \delta, v(u+\delta) \approx v(u) + \delta \tfrac{\partial v}{\partial u}(u) and w(u+\delta) \approx w(u) + \delta \tfrac{\partial w}{\partial u}(u). For small \delta_1,\delta_2, z(v+\delta_1,w+\delta_2) \approx z(v,w) + \delta_1 \tfrac{\partial z}{\partial v}(v,w) + \delta_2 \tfrac{\partial z}{\partial w}(v,w). Hence, if we ignore terms with powers of delta two or higher,

f(u +\delta)= z(w(u+\delta),v(u+\delta)) \approx f(u) + \delta \tfrac{\partial v}{\partial u} \cdot \tfrac{\partial z}{\partial v} + \delta \tfrac{\partial w}{\partial u} \cdot \tfrac{\partial z}{\partial w}

Meaning that \frac{f(u +\delta) - f(u)}{\delta} \approx \tfrac{\partial v}{\partial u} \cdot \tfrac{\partial z}{\partial v} + \tfrac{\partial w}{\partial u} \cdot \tfrac{\partial z}{\partial w} which is what we needed to show.

The chain rule generalizes naturally to the case that z is a function of more variables than u. Generally, if the value f(u) is obtained by first computing some intermediate values v_1,\ldots,v_k from u and then computing z in some arbitrary way from v_1,\ldots,v_k, then \tfrac{\partial z}{\partial u} \sum_{i=1}^k \tfrac{\partial v_i}{\partial u} \cdot \tfrac{\partial z}{\partial v_i}.

As a corollary, if you already managed to compute the values \tfrac{\partial z}{\partial v_1},\ldots, \tfrac{\partial z}{\partial v_k}, and you kept track of the way that v_1,\ldots,v_k were obtained from u, then you can compute \tfrac{\partial z}{\partial u}.

This suggests a simple recursive algorithm by which you compute the derivative of the final value z with respect to an intermediate value w in the computation using recursive calls to compute the values \tfrac{\partial z}{\partial w'} for all the values w' that were directly computed from w. Back propagation is this algorithm.

Computing \tfrac{\partial z}{\partial u},\tfrac{\partial z}{\partial v}, \tfrac{\partial z}{\partial w} for the assignment u=5 using back propagation

Implementing automatic differentiation using back propagation in Python

We now describe how to do this in Python, following Karpathy’s code. The basic class we use is Value. Every member u of Value is a container that holds:

  1. The actual scalar (i.e., floating point) value that u holds. We call this data.
  2. The gradient of u with respect to some future unknown value f that will use it. We call this grad and it is initialized to zero.
  3. Pointers to all the values that were used in the computation of u. We call this _prev
  4. The method that adds (using the current value of u and other values) the contribution of u to the gradient of all its previous values v to their gradients. We call this function _backward. Specifically, at the time we call _backward we assume that u.grad already contains \tfrac{\partial z}{\partial u} where z is the final value we are interested in. For every value v that was used to compute u, we add to v.grad the quantity \tfrac{\partial z}{\partial u} \cdot \tfrac{\partial u}{\partial v}. For the latter quantity we need to keep track of how u was computed from b.
  5. If we call the method backwards (without an underscore) on a variable u then this will compute the derivative of u with respect to v for all values v that were used in the computation of u. We do this by applying _backward to u and then recursively (just like in DFS) going over the “children” (values used to compute u), calling _backward on each one and keeping track the ones we visited just like the Depth First Search (DFS) algorithm.

Let’s now describe this in code. We start off with a simple version that only supports addition and multiplication. The constructor for the class is the following:

class Value:
    """ stores a single scalar value and its gradient """

    def __init__(self, data, _children=()): = data
        self.grad = 0
        self._backward = lambda: None
        self._prev = set(_children)

which fairly directly matches the description above. This constructor creates a value not using prior ones, which is why the _backward function is empty.
However, we can also create values by adding or multiplying prior ones, by adding the following methods:

  def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value( +, (self, other))

        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward

        return out

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value( *, (self, other))

        def _backward():
            self.grad += * out.grad
            other.grad += * out.grad
        out._backward = _backward

        return out

That is, if we create w by adding the values u and v, then the _backward function of w works by adding w.grad = \tfrac{\partial z}{\partial w} to both u.grad and v.grad.
If we w is obtain by multiplying u and v then we add w.grad \cdot = \tfrac{\partial z}{\partial w} v to u.grad and similarly add w.grad \cdot = \tfrac{\partial z}{\partial w} u to v.grad.

The backward function is obtained by setting the gradient of the current value to 1 and then running _backwards on all other values in reverse topological order:

def backward(self, visited= None): # slightly shorter code to fit in the blog 
    if visited is None:
        visited= set([self])
        self.grad = 1
    for child in self._prev:
        if not child in visited:

For example, if we run the following code

a = Value(5)
def f(x): return (x+2)**2 + x**3

then the values printed will be 0 and 89 since the derivative of (x+2)^2 + x^3 = x^3 + x^2 + 4x + 4 equals 3x^2 + 2x +42.

In the notebook you can see that we implement also the power function, and have some “convenience methods” (division etc..).

Linear regression using back propagation and stochastic gradient descent

In stochastic gradient descent we are given some data (x_1,y_1),\ldots,(x_n,y_n) and want to find an hypothesis h that minimizes the empirical loss L(h) = \tfrac{1}{n}\sum_{i=1}^n L(h(x_i),y_i) where L is a loss function mapping two labels y, y' to a real number. If we let L_i(h) be the i-th term of this sum, then, identifying h with the parameters (i.e., real numbers) that specify it, stochastic gradient descent is the following algorithm:

  1. Set h to be a random vector. Set \eta to be some small number (e.g., \eta = 0.1)
  2. For t \in {1,\ldots, T} (where T is the number of epochs):
  • For i \in {1,\ldots, n}: (in random order)
    • Let h \leftarrow h - \eta \nabla_h L_i(h)

If h is specified by the parameters h_1,\ldots,h_k \nabla_h L_i(h) is the vector ( \tfrac{\partial L_i}{\partial h_1}(h), \tfrac{\partial L_i}{\partial h_2}(h),\ldots, \tfrac{\partial L_i}{\partial h_k}(h)). This is exactly the vector we can obtain using back propagation.

For example, if we want a linear model, we can use (a,b) as our parameters and the function will be x \mapsto a\cdot + b. We can generate random points X,Y as follows:

Now we can define a linear model as follows:

class Linear:
  def __init__(self):
    self.a,self.b = Value(random.random()),Value(random.random())
  def __call__(self,x): return self.a*x+self.b

  def zero_grad(self):
    self.a.grad, self.b.grad = 0,0

And train it directly by using SGD:

η = 0.03, epochs = 20
for t in range(epochs):
  for x,y in zip(X,Y):
    loss = (model(x)-y)**2
    model.a , model.b = (model.a - η*model.a.grad  , model.b - η*model.b.grad)

Which as you can see works very well:

From linear classifiers to Neural Networks.

The above was somewhat of an “overkill” for linear models, but the beautify of automatic differentiation is that we can easily use more complex computation.

We can follow Karpathy’s demo and us the same approach to train a neural network.

We will use a neural network that takes two inputs and has two hidden layers of width 16. A neuron that takes input x_1,\ldots,x_k will apply the ReLU function (max{0,x}) to \sum w_i x_i where w_1,\ldots,w_k are its weight parameters. (It’s easy to add support for relu for our Value class. Also we won’t have a bias term in this example.)

The code for this Neural Network is as follows: (when Value() is called without a parameter the value is random number in [-1,1])

def Neuron(weights,inputs, relu =True):
  # Evaluate neuron with given weights on given inputs
  v =  sum(weights[i]*x for i,x in enumerate(inputs))
  return v.relu() if relu else v

class Net:
  # Depth 3 fully connected neural net with one two inputs and output
  def __init__(self,  N=16):
    self.layer_1 = [[Value(),Value()] for i in range(N)]
    self.layer_2 = [ [Value() for j in range(N)] for i in range(N)]
    self.output =  [ Value() for i in range(N)]
    self.parameters = [v for L in [self.layer_1,self.layer_2,[self.output]] for w in L for v in w]

  def __call__(self,x):
    layer_1_vals = [Neuron(w,x) for w in self.layer_1]
    layer_2_vals = [Neuron(w,layer_1_vals) for w in self.layer_2]
    return Neuron(self.output,layer_2_vals,relu=False) 
    # the last output does not have the ReLU on top

  def zero_grad(self):
    for p in self.parameters:

We can train it in the same way as above.
We will follow Karpathy and train it to classify the following points:

The training code is very similar, with the following differences:

  • Instead of the square loss, we use the function L(y,y')= \max{ 1- y\cdot y', 0 } which is 0 if y \cdot y' \geq 1. This makes sense since our data labels will be \pm 1 and we say we classify correctly if we get the same sign. We get zero loss if we classify correctly all samples with a margin of at least 1.
  • Instead of stochastic gradient descent we will do standard gradient descent, using all the datapoints before taking a gradient step. The optimal for neural networks is actually often something in the middle – batch gradient descent where we take a batch of samples and perform the gradient over them.

The resulting code is the following:

for t in range(epochs):
  loss = sum([(1+ -y*model(x)).relu() for (x,y) in zip(X,Y)])/len(X)
  for p in model.parameters: -= η*p.grad

If we use this, we get a decent approximation for this training set (see image below). As Karpathy shows, by adjusting the learning rate and using regularization, one can in fact get 100% accuracy.

Update 11/30: Thanks to Gollamudi Tarakaram for pointing out a typo in a previous version.

Digging into election models

With election on my mind, and constantly looking at polls and predictions, I thought I would look a little more into how election models are made. (Disclaimer: I am not an expert statistician / pollster and this is based on me trying to read their methodological description as well as looking into results of simulations in Python. However, there is a colab notebook so you can try this on your own!)

If polls were 100% accurate, then we would not need election models – we will know that the person polling at more than 50% in a given state will win, and we can just sum up the electoral votes. However, polls have various sources of errors:

  1. Statistical sample error – this is simply the deviation between the fraction of people that would say “I will vote for X” at time T in the population, and the empirical fraction reported by the poll based on their sample. As battleground states get polled frequently with large samples, this error is likely to be negligible.
  2. Sampling bias – this is the bias incurred by the fact that we cannot actually sample a random subset of the population and get them to answer our questions – the probability that people will pick up their phone may be correlated with their vote. Pollsters hope that these correlations all disappear once you condition on certain demographic variables (race, education, etc..) and so try to ensure the sample is balanced according to these metrics. I believe this was part of the reason that polls were off in 2016, since they didn’t explicitly adjust for levels of education (which were not strongly correlated with party before) and ended up under-representing white voters without college degrees.
  3. Lying responses or “shy” voters – Some people suggest that voters lie to pollsters because their choice is considered “socially undesirable”. There is not much support that this is a statistically significant effect. In particular one study showed there was no statistically significant difference between responders’ responses in online and live calling. Also in 2016 polls equally under-estimated the votes for Trump and Republican senators (which presumably didn’t have the same “social stigma” to them).
  4. Turnout estimates – Estimating the probability that a person supporting candidate X will actually show up to vote (or mail it in) is a bit of a dark art, and account for the gap in polls representing registered voters (which make no such estimates) and polls representing likely voters (which do). Since traditionally the Republican electorate is older and more well off, they tend to vote more reliably and hence likely voter estimates are typically better for republicans. The effect seems not to be very strong this year. Turnout might be particularly hard to predict this year, though it seems likely to be historically high.
  5. Voters changing their mind – The poll is done at a given point in time and does not necessarily reflect voters views in election day. For example in 2016 it seems that many undecided voters broke for Trump. In this cycle the effect might be less pronounced since there are few undecided voters and “election day” is smoothed over a 2-4 week period due to early and mail-in voting.

To a first approximation, a poll-based election model does the following:

1. Aggregates polls into national and state-wise predictions

2. Computes a probability distribution over the correlated error vectors (i.e. the vector \vec{e} with coordinate for each jurisdiction containing the deviation from the prediction)

3. Samples from the probability distribution over vectors to obtain probabilities over outcomes.

From a skim of 538’s methodology it seems that they do the following:

  1. Aggregate polls (weighing by quality, timeliness, adjusting for house effects, etc..). Earlier in the election cycle they also mix in “fundamentals” such as state of the economy etc.. though their weight decreases with time.
  2. Estimate magnitude of national error (i.e., sample a value E \in \mathbb{R}_+ according to some distribution that reflects the amount of national uncertainty.
  3. (This is where I may be understanding wrong.) Sample a vector \vec{e} whose entries sum up to E according to a correlated distribution, where the correlations between states depends on demographic, location, and other factors. For each particular choice of $E$, because the sum is fixed, if a state has $E+X$ bias then on average the other states will need to compensate for this $-X$ bias, and hence this can create negative correlations between states. (It is not clear that negative correlations are unreasonable – one could imagine policies that are deeply popular with population A and deeply unpopular with population B)

From a skim of the Economist’s methodology it seems that they do the following:

  1. They start again with some estimate on the national popular vote, based on polls and fundamentals, and then assume it is distributed according to some probability distribution to account for errors.
  2. They then compute some prior on “partisan lean” (difference between state and national popular vote) for each state. If we knew the popular vote and partisan lean perfectly then we would know the result. Again like good Bayesians they assume that the lean is distributed according to some probability distribution.
  3. They update the prior based on state polls and other information
  4. They sample from an error distribution \vec{e} according to some explicit pairwise correlation matrix that has only non-negative entries (and hence you don’t get negative correlations in their model).

So, given all of the above, how much do these models differ? Perhaps surprisingy, the answer is “not by much”. To understand how they differ, I plotted for both models the following:

  1. The histogram of Biden’s popular vote margin
  2. The probability of Biden to win conditioned on a particular margin

Much of the methodological difference, including the issue of pairwise correlations, should manifest in 2, but eyeballing it, they don’t seem to differ that much. It seems that conditioned on a particular margin, both models give Biden similar probability to win. (In particular both models think that 3% margin is about 50/50, while 4% margin gives Biden about 80/20 chance). The main difference is actually in the first part of estimating the popular vote margin – 538 is more “conservative” and has fatter tails.

If you want to check my data, see if I have a bug, or try your own analysis, you can use this colab notebook.

“Make your own needle”

Another applications for such models is to help us adjust the priors as new information comes in. For example, it’s possible that Florida, North Carolina and Texas will report results early. If Biden loses one of these states, should we adjust our estimate of win probability significantly? It turns out that the answer depends on by how much he loses.

The following graphs show the updated win probability conditioned on a particular margin in a state. We see that winning or losing Florida, North Carolina, and Texas on their own doesn’t make much difference to the probability – it’s all about the margin. In contrast, losing Pennsylvania’s 20 electoral votes will make a significant difference to Biden’s chances.

(The non monotonicity is simply a side effect of having a finite number of simulation runs and would disappear in the limit.)