Obfuscation: The season 4 Finale

For many of the famous open problems of theoretical computer science, most researchers agree on what the answer is, but the challenge is to prove it. Most complexity theorists (with few notable exceptions) believe that P≠NP, but we don’t know how to prove it. Similarly, most people working on matrix multiplication believe that there is an Õ(n²) algorithm for this problem, but we’re still stuck at 2.3728596. We believed that primality checking has a deterministic polynomial-time algorithm long before it was proven and we still believe the same holds for polynomial identity testing.

The story of cryptographic obfuscation is different. This story deserves a full length blog post (though see my now outdated survey), but the short version is as follows. In 2001 we (in a paper with Goldreich, Impagliazzo, Rudich, Sahai, Vadhan, and Yang) showed that what is arguably the most natural definition of obfuscation is impossible to achieve. That paper explored a number of obfuscation-related questions, and in particular left as an open question the existence of so-called indistinguishable obfuscators or IO. Since then there were arguably more negative than positive results in obfuscation research until in 2012, extending some of the ideas behind fully-homomorphic encryption, Garg Gentry and Halevi gave a heuristic construction of multilinear map, which one can think of as “Diffie Hellman on steroids” (or maybe LSD..). Then in 2013 Garg, Gentry, Halevi., Raykova, Sahai and Waters (GGHRSW) built on top of these maps to give a heuristic construction of IO.

The GGHRSW paper opened the floodgates to many papers using IO to achieve many longstanding cryptographic goals as well as show that IO provides a unified approach to solve many classic cryptographic problems. The fact that so many goals were achieved through heuristic constructions was not very comforting to cryptographers. Even less comforting was the fact that several cryptographic attacks were discovered on these heuristic constructions. The years that followed saw a sequence of constructions and breaks, giving cryptographers an “emotional whiplash”. Everyone agreed that IO would be amazing if it exists, but whether or not it actually exists depended on who you asked, and what paper in the eprint archive they read that morning…

The “holy grail” in this line of work is to base obfuscation on a standard assumption, and ideally Regev’s Learning With Errors (LWE) assumption. Of course, we don’t know that LWE is true (in particular LWE implies P≠NP) but if it’s false it would bring down so much of the field that cryptographers might as well pack their bags and do machine learning (or try to sabotage progress in quantum computing, since the only other standard assumptions for public-key crypto are broken by fully scalable quantum computing).

We have not yet achieved this holy grail (this is only the 4th season) but as described in this quanta article, there has been a remarkable progress in the last few months. In particular, Jain, Lin and Sahai (JLS) (building on a long sequence of works by many people including Ananth, Matt, Tessaro and Vaikuntanathan) obtained IO based on LWE and several standard assumptions in cryptography. This is arguably the first “heuristic free” construction, and is a fantastic breakthrough. However, there is still work to do – the JLS construction uses not just LWE but also a variant of it that is not as well studied. It is also based on pairing-based cryptography. This is an area that has thousands of papers, but for which known instantiations can be broken by quantum computers. However, there is yet more hope – in another sequence of works by Agrawal, Brakerski, Döttling, Garg, and Malavolta, Wee and Wichs, Gay and Pass a construction of IO was achieved that is “almost” heuristic free. It still uses one heuristic assumption (circular security) but has the advantage that apart from this assumption it only relies on LWE.

One can hope that in the next season, these two lines of work will converge to give a construction of IO based on LWE, achieving a “meta theorem” deriving from LWE a huge array of cryptographic primitives.

Want to learn more about these amazing advances? Want to know what’s next in store for IO?

Fortunately there is a virtual Simons symposium on indistinguishability obfuscation coming to your computer screen on December 10-11. Authors of all the papers mentioned will join together in coordinated presentations to give a unified view of the field and the challenges ahead. We will also have a historical opening talk by Yael Kalai, as well as a talk by Benny Applebaum on the computational assumptions used, followed by a panel discussion with Yael, Benny and Chris Peikert. Finally, like every proper crypto event, there will be a rump session, though you will have to supply your own beer.

See the schedule of the workshop and you can register on this page.

Hope you to see you there! Bring your favorite programs to obfuscate with you*

*Disclaimer/fine print: Due to large constants and exponents, we do not recommend the compiler be used on programs that are more than one nanobit long.

Image credit: MIT

Making TCS more connected / less insular

[Announcement from Jelani Nelson –Boaz]

TL;DR: https://tinyurl.com/tcs-connections

A task force has been convened by CATCS to investigate possible
approaches to modifying aspects of the TCS community, especially our
publishing culture, to enhance connections with other areas of CS and
be as welcoming as possible to a broad range of contributions within
theory. This committee will collect and synthesize feedback from the
community via the questionnaire on https://tinyurl.com/tcs-connections then make suggestions.

Since adjusting conference formats may help with this, we would like to get
a general idea of community opinion on format choices. If you have
some opinions you would like to share, please use this form. Though
the questions below focus primarily on FOCS/STOC, we are welcome to
receiving any and all suggestions that would make the TCS community as
broad and well-connected to other areas of TCS as possible (see the
last question of the survey).

Task force members:
Ken Clarkson (IBM Research)
Sandy Irani (UC Irvine)
Bobby Kleinberg (Cornell)
Adam Klivans (UT Austin)
Ravi Kumar (Google)
Jelani Nelson (UC Berkeley)
Yuval Rabani (Hebrew University)

On Galileo Galilei and “denialism” from elections to climate to COVID

Galileo Galileo has many self-appointed intellectual heirs these days. Whether it’s a claim that the election has been stolen, that COVID-19 is less fatal than the flu, that climate change or evolution are hoaxes, or that P=NP, we keep hearing from people considering themselves as bold truth-tellers railing against conventional wisdom. We are encouraged to “teach the debate” and that if only paid attention, we will see that their Tweet, declaration, or arxiv paper contains an irrefutable proof of their assertions.

In the words of Ted Cruz, “They brand you a heretic. Today, the global warming alarmists are the equivalent of the flat-Earthers. It used to be [that] it is accepted scientific wisdom the Earth is flat, and this heretic named Galileo was branded a denier”.

Of course by Galileo’s time it was well known that the earth was spherical, and Magellan circumnavigated the earth more than 40 years before Galileo was born. But putting aside Cruz’s confusion of flat earth and geocentrism, the story of heliocentric theory is not one of an outsider railing against the scientific mainstream. Galileo himself was a chaired professor of mathematics at the University of Padua, and later philosopher and mathematician to the grand duke of Tuscany. He was very much part of the scientific establishment of his time. Moreover, though Galileo did provide important evidence for heliocentrism, he was not the only one doing so. Kepler found a heliocentric model with elliptical orbits that actually made correct predictions, and, though it took a decade or so, Kepler’s book eventually became the standard textbook for astronomy.

My point in this post is not to rehash the history of heliocentrism or Galileo but rather to call out a misconception which, to use Sean Carrol’s phrasing, amounts to valorization of puzzle-solving over wisdom.

It is tempting to think that an argument, regardless whether it comes from an expert or a random person on Twitter, can be presented in a self-contained way and judged on its merits. However, this is not how things work in any interesting setting. Even in the case of a purported P vs NP proof, there is background knowledge on computational complexity without which it would be hard to spot holes in the argument. This is doubly so for any claim involving empirical facts, whether it’s about elections, infections, climate etc. It is not possible to evaluate such claims without context, and to get this context you need to turn to the experts that have studied the topic.

I have written before in defense of expertise (see also here) but Carroll puts it very well. Another way to say it is that the operational interpretation of the common refrain

Extraordinary claims require extraordinary evidence

is

Treat claims conforming to conventional wisdom with charity, and claims disputing it with skepticism.

(There is a question of how to define “conventional wisdom” but interestingly there is usually agreement in practice by both sides. Most “deniers” of various sorts are proud of going against conventional wisdom, but don’t acknowledge that this means they are more likely to be wrong.)

As an example, even if someone has expertise in analytic number theory, and so presumably has plenty of so-called “puzzle-solving intelligence”, that doesn’t mean that they can evaluate a statistical claim on election fraud and their analysis should be considered evidence (apparently at this point the number theorist himself agrees). We can try to read and debunk what they wrote, or we can assume that if there was evidence for large-scale fraud, then the president of the United States and his well-funded campaign would have managed to find actual statisticians and experts on election to make the case.

There can be debate if Trump’s attempt to overthrow the election should be considered as dangerous or merely absurd, but the constant attacks on the very notions of truth, science, and expertise are causing far-reaching harm.

(H/T: Scott Aaronson, who makes a similar point around the election conspiracies.)

Announcing the WiML-T Mentorship Program (guest post)

[Guest post by Claire Vernade, Jessica Sorrell, Kamalika Chaudhuri, Lee Cohen, Mary Anne Smart, Michal Moshkovitz, and Ruth Urner.

I am very happy about this initiative – mentoring and community is so important for success in science, and as I’ve written before, there is much work to do so women will have the same access to these as men. –Boaz]

TL;DR: we are organizing a new mentorship program for women in machine learning. Please consider applying as a mentor, mentee, or both at https://www.wiml-t.org/mentoring-program

What is WIML-T?

Women in machine learning (or WiML for short) was established more than ten years ago, and its main goals are to 1. Increase the number of women in machine learning 2. Help women in machine learning succeed professionally 3. Increase the impact of women in machine learning in the community. Towards this goal, they create different opportunities for women to showcase their work. Chief among them is the annual Women in Machine Learning (WiML) Workshop, typically co-located with NeurIPS, which presents women’s cutting-edge research. 

Women in machine learning theory (or WiML-T for short) shares the same goals as WiML but focuses on the smaller learning theory community. The vision of WiML-T is to give visibility and legitimacy to under-represented groups in learning theory, to create stronger bonds within the community and beyond, and to provide support and advice. As part of this vision, we have decided to facilitate a mentoring program that will connect women and non-binary researchers who are newer to learning theory with more experienced mentors.  

Why mentorship?

Mentoring programs have been shown to greatly help underrepresented communities develop [1]. More resources on mentoring can be found on the website of Stanford’s Women’s Leadership Innovation Lab and on our website. Mentoring creates strong bonds of trust within the community, it helps mentees find career advice and connections, and it helps mentors grow their leadership skills while keeping in touch with the newcomers of the community.  We hope that creating a world-wide program will also help reduce inequality between less privileged areas and the most famous institutions. Indeed, it is a well-known fact that the more competitive a career path, the less diverse it is, due in part to network effects from which under-represented groups are excluded. We believe that uniting individuals from these groups (i.e. women, non-binary people, persons of color and other minorities) will help reduce these effects and contribute to finding solutions to the community’s problems. 

Why be a mentor?  

Remember the beginning of your research journey, with all the difficulties, uncertainties, and unanswered questions? This is your chance to give all the advice you wish you got, and make an impact on a future colleague. As a mentor you will be a role model and help the young generation of researchers in learning theory. You will help them develop and enhance their careers by giving them the support they need. Need more reasons to mentor? It will grow your leadership skills, self-confidence, communication skills, and you will feel happier after you help others. 

Who can be a mentor? 

Nearly everyone who has some experience in academia or industry can be a mentor. It can be interesting for an undergrad student to receive advice from senior PhD students or postdocs who have recently had to reflect about career decisions and can share knowledge about their work environment. We indeed expect the most senior researchers to apply as mentors, but we would also like to encourage PhDs and postdocs to consider mentoring (while possibly having a mentor as well!).    

Can men mentor? 

We thank everyone who wants to help the community!  

We will prioritize women mentors as they can give their unique perspective, BUT, we acknowledge that there might be a limited number of mentors. To mitigate this issue, we will be happy to pair male mentors provided the mentee agrees.

Why be a mentee?   

Having a mentor is one of the best ways to get external career advice, to get some feedback from someone with a possibly similar background. Managing to find one’s way into academia or science is not easy. It can be even harder for under-represented groups who may lack role models within their institution, or who may not connect with common advice that implicitly assumes or relies on some class privilege. Having a mentor can help you navigate professional and personal issues that men may not always have. It is also a way to get connected to other members of the community, or have second opinions on research strategies.   

Timeline

  • The program launched on October 29 2020 and will run on a continuous basis.
  • We will start with pairings of mentors and mentees in December 2020. This process can take a few months.  
  • Frequency of the meetings: totally depends on the mentor and mentee. It can be either weekly meetings or once every two months or anything in between. 
  • Duration of the mentorship: totally depends on the mentor and mentee. It can be a few months, a year, or even more. 

Have questions? You can mail us at: contact@wiml-t.org

[1] Ginther, D. K., Currie, J., Blau, F. D., & Croson, R. (2020). Can Mentoring Help Female Assistant Professors in Economics? An Evaluation by Randomized Trial. NBER Working Paper No. 26864. https://doi.org/10.3386/w26864     

Updated Research Masters programs database by Aviad Rubinstein and Matt Weinberg

Guest post by Aviad Rubinstein and Matt Weinberg

As explained in Boaz’s previous posts [1] [2], the PhD admission process can be challenging for students who discover their passion for Theory of Computer Science late in their undergraduate studies. Discovering TCS earlier is especially challenging for students who aren’t exposed to CS in high school, and this bias aggravates the diversity issues we have in our community. Masters programs are one way to mitigate this issue.
On a personal note, one of us (Aviad), worked half-time during undergraduate and would not have been in academia today -let alone TCS of all subjects- if it weren’t for the awesome TCS masters program at Tel-Aviv University.


But where would you go (or send your students) to do a masters in TCS?A little over a year ago, we were discussing how useful it would be to have a unified resource that can help students choose. We decided that we should create one! Months passed, and right before we ran out of excuses to procrastinate, Boaz made this post crowdsourcing information on TCS masters programs.
After some more procrastination, we eventually did send a lot of emails and tried to clean and organize the information to the best of our ability. Thanks to everyone who contributed! You can find the latest version here. (Short url: http://tiny.cc/tcsmasters)

Now you should share it with your brightest students!


This is also meant to be a live project. Please email aviad [at] cs.stanford.edu if you have more new information or find any inaccuracies.

Thanks!

-Aviad and Matt

Election insecurity

Election security has been studied for many years by computer scientists, but it is not as often that it attracts so much mainstream attention. I would never have expected to see my former Princeton colleague Andrew Appel on a Sean Hannity segment tweeted by President Trump.

It may seem that even if it has partisan motivations, the recent GOP interest in election security is overall a positive thing. Who wouldn’t want elections to be more secure? Who wouldn’t want less fraud? However, in a very precise sense, the definition of “election security” used by the GOP these days corresponds to election insecurity.

To understand this claim, consider what it means for an election to be secure. (Let’s focus just on the correctness aspect of the count, since it is at the heart of the current issues, and not on the very interesting privacy aspect.) Computer scientists use the technical termscast as intended“, “recorded as cast“, and “tallied as recorded“. In other words: if a voter X intends to cast vote for candidate Y, then this vote should be recorded and tallied, and only such votes should be tallied.

With mail-in voting, there are several potential points of failure on the path between voter intent and the final tally:

  1. Mail can be lost or delayed too much, leading to the vote not counting.
  2. A third party could intercept the ballot and impersonate the voter.
  3. A ballot may not be formatted properly in some way, leading to it being disqualified.
  4. There can be errors or hacks in the tallying process.

Election security is about combatting points 1-4 (of which the last 3 are also applicable to in-person voting) , ideally in a way that is verifiable to the individual voters. Achieving verifiability while maintaining secrecy and not requiring the voter to trust complex technology is a challenging task, but there have been some proposed solutions (see above links).

The Hannity segment and much of the “Dominion” non story focused on point 4. This is an important point, but as Appel himself notes, paper ballots, which are mostly used in the US, serve as a way to audit counting. Re-counting is important, and is commonly done, but such recounts often change the total counts by relatively little (and the changes mostly cancel out). For example, here is the list of ballots changed and reasons from the Wisconsin 2016 count (taken from this paper)

Image

In contrast, many of the legal cases by the Trump campaign focused on signature verification and other ballot irregularities. There are two main reasons why a signature would not match between a ballot and driver’s license or other records:

  1. The signature may have been forged by someone trying to impersonate the voter.
  2. The voter’s signature might not very consistent, or maybe they have more than one signature (for example, I sometimes sign in Hebrew and sometimes in English) .

Empirically, reason 2 is much more common than reason 1. If a ballot is tossed out because of the second reason it corresponds to a break between the voter intent and the final tally, and hence it is a case of election insecurity. For this reason, making more stringent signature checks could make elections less secure!

While President Trump might claim on Twitter that the election was stolen by a massive conspiracy involving forging of tens of thousands of ballots, this is not the actual content of the court cases (especially after some recent amendments). For example, at the heart of the PA case is the process of “curing a ballot“. This is when a ballot is disqualified due to some technical issue, and a voter has a chance to fix it. Curing a ballot ensures that the voters intent is captured, and hence makes elections more secure.

In PA, the decision of whether to notify voters in such cases was left to the counties, and apparently Democrat-controlled counties were more likely to do so Republican-controlled counties. This is a shame, and had the Trump campaign asked to extend the deadline for curing ballots, then I would think it makes perfect sense. However, this is not what their lawsuit is about. To quote their complaint: “plaintiffs seek a permanent injunction requiring the County Election Boards to invalidate ballots cast by voters who were notified and given an opportunity to cure their invalidly cast mail-in ballot.” This are ballots where there is no question of the eligibility of the voter, nor of the accuracy of their intent, yet the Trump campaign seeks to prevent them from counting. I call this election insecurity.

p.s. See Adam Klasfeld’s feed for more about the various Trump campaign cases

MoPS and Junior-Senior Meeting (DISC 2020)


(Guest post by Shir Cohen and Mouna Safir)


The 34th International Symposium on Distributed Computing (DISC 2020) was held on October 12-16, 2020, as a virtual conference. As such, the opportunity for community members to get to know each other in an informal environment was lacking. To address this need, we arranged two types of virtual networking events. We hope that these events planted the seeds for many future collaborations and that there will be an opportunity for those involved to meet in person next time.

MoPS (Meet other Postdoc and Students) Sessions
Webpage: https://sites.google.com/view/disc2020mops/home

To allow junior members of the community to get to know one another, we arranged MoPS sessions, which we have not seen done before. There were more than 50 participants who took part in the sessions, with representation from a host of countries throughout the world. Sessions were held in 10-time slots before, during, and after DISC. In each session, there would typically be 5 members representing a mixture of Bachelor’s students, Masters students, PhDs, postdocs, and others. Care was taken to include at least one postdoc or Ph.D. in each session so that Bachelors and Masters students might benefit from their experience. Groups were formed with the goal of allowing participants from different countries and institutions to share their experiences and research journeys with one another. Based on the feedback for this event, it would appear that that goal was met and the participants came away with more of a sense of community.

Junior-Senior Meetings
Webpage: https://sites.google.com/view/disc2020junior-seniormeeting/home

The Junior-Senior meetings were organized to provide an opportunity for junior researchers to meet with senior researchers. Fourteen sessions were conducted, where each one brought together one senior and 3-5 juniors. In these sessions,  the juniors got a chance to interact with seniors in the field and profit from their experience. Discussions covered a variety of topics such as how to approach research, how to deal with the job market, or perhaps more personal concerns like work-life balance. We collected amazing feedback from the participants, who claimed that this was a fruitful and interesting experience.

Yet another backpropagation tutorial

I am teaching deep learing this week in Harvard’s CS 182 (Artificial Intelligence) course. As I’m preparing the back-propagation lecture, Preetum Nakkiran told me about Andrej Karpathy’s awesome micrograd package which implements automatic differentiation for scalar variables in very few lines of code.

I couldn’t resist using this to show how simple back-propagation and stochastic gradient descents are. To make sure we leave nothing “under the hood” we will not import anything from the package but rather only copy paste the few things we need. I hope that the text below is generally accessible to anyone familiar with partial derivatives. See this colab notebook for all the code in this tutorial. In particular, aside from libraries for plotting and copy pasting a few dozen lines from Karpathy this code uses absolutely no libraries (no numpy, no pytorch, etc..) and can train (slowly..) neural networks using stochastic gradient descent. (This notebook builds the code more incrementally.)

Automatic differentiation is a mechanism that allows you to write a Python functions such as

def f(x,y): return (x+y)+x**3

and enables one to automatically obtain the partial derivatives \tfrac{\partial f}{\partial \text{x}} and \tfrac{\partial f}{\partial \text{y}}. Numerically we could do this by choosing some small value \delta and computing both \tfrac{f(x+\delta,y)-f(x,y)}{\delta} and \tfrac{f(x,y+\delta)-f(x,y)}{\delta}.
However, if we generalize this approach to n variables, we get an algorithm that requires roughly n evaluations of f. Back-propagation enables computing all of the partial derivatives at only constant overhead over the cost of a single evaluation of f.

Back propagation and the chain rule

Back-propagation is a direct implication of the multi-variate chain rule. Let’s illustrate this for the case of two variables. Suppose that v,w: \mathbb{R} \rightarrow \mathbb{R} and z:\mathbb{R}^2 \rightarrow \mathbb{R} are differentiable functions, and define

f(u) = z(v(u),w(u)).

That is, we have the following situation:

where f(u) is the value z=z(v(u),w(u))

Then the chain rule states that

\tfrac{\partial f}{\partial u} = ( \tfrac{\partial v}{\partial u} \cdot \tfrac{\partial z}{\partial v} + \tfrac{\partial w}{\partial u} \cdot \tfrac{\partial z}{\partial w} )

You can take this on faith, but it also has a simple proof. To see the intuition, note that for small \delta, v(u+\delta) \approx v(u) + \delta \tfrac{\partial v}{\partial u}(u) and w(u+\delta) \approx w(u) + \delta \tfrac{\partial w}{\partial u}(u). For small \delta_1,\delta_2, z(v+\delta_1,w+\delta_2) \approx z(v,w) + \delta_1 \tfrac{\partial z}{\partial v}(v,w) + \delta_2 \tfrac{\partial z}{\partial w}(v,w). Hence, if we ignore terms with powers of delta two or higher,

f(u +\delta)= z(w(u+\delta),v(u+\delta)) \approx f(u) + \delta \tfrac{\partial v}{\partial u} \cdot \tfrac{\partial z}{\partial v} + \delta \tfrac{\partial w}{\partial u} \cdot \tfrac{\partial z}{\partial w}

Meaning that \frac{f(u +\delta) - f(u)}{\delta} \approx \tfrac{\partial v}{\partial u} \cdot \tfrac{\partial z}{\partial v} + \tfrac{\partial w}{\partial u} \cdot \tfrac{\partial z}{\partial w} which is what we needed to show.

The chain rule generalizes naturally to the case that z is a function of more variables than u. Generally, if the value f(u) is obtained by first computing some intermediate values v_1,\ldots,v_k from u and then computing z in some arbitrary way from v_1,\ldots,v_k, then \tfrac{\partial z}{\partial u} \sum_{i=1}^k \tfrac{\partial v_i}{\partial u} \cdot \tfrac{\partial z}{\partial v_i}.

As a corollary, if you already managed to compute the values \tfrac{\partial z}{\partial v_1},\ldots, \tfrac{\partial z}{\partial v_k}, and you kept track of the way that v_1,\ldots,v_k were obtained from u, then you can compute \tfrac{\partial z}{\partial u}.

This suggests a simple recursive algorithm by which you compute the derivative of the final value z with respect to an intermediate value w in the computation using recursive calls to compute the values \tfrac{\partial z}{\partial w'} for all the values w' that were directly computed from w. Back propagation is this algorithm.

Computing \tfrac{\partial z}{\partial u},\tfrac{\partial z}{\partial v}, \tfrac{\partial z}{\partial w} for the assignment u=5 using back propagation

Implementing automatic differentiation using back propagation in Python

We now describe how to do this in Python, following Karpathy’s code. The basic class we use is Value. Every member u of Value is a container that holds:

  1. The actual scalar (i.e., floating point) value that u holds. We call this data.
  2. The gradient of u with respect to some future unknown value f that will use it. We call this grad and it is initialized to zero.
  3. Pointers to all the values that were used in the computation of u. We call this _prev
  4. The method that adds (using the current value of u and other values) the contribution of u to the gradient of all its previous values v to their gradients. We call this function _backward. Specifically, at the time we call _backward we assume that u.grad already contains \tfrac{\partial z}{\partial u} where z is the final value we are interested in. For every value v that was used to compute u, we add to v.grad the quantity \tfrac{\partial z}{\partial u} \cdot \tfrac{\partial u}{\partial v}. For the latter quantity we need to keep track of how u was computed from b.
  5. If we call the method backwards (without an underscore) on a variable u then this will compute the derivative of u with respect to v for all values v that were used in the computation of u. We do this by applying _backward to u and then recursively (just like in DFS) going over the “children” (values used to compute u), calling _backward on each one and keeping track the ones we visited just like the Depth First Search (DFS) algorithm.

Let’s now describe this in code. We start off with a simple version that only supports addition and multiplication. The constructor for the class is the following:

class Value:
    """ stores a single scalar value and its gradient """

    def __init__(self, data, _children=()):
        self.data = data
        self.grad = 0
        self._backward = lambda: None
        self._prev = set(_children)

which fairly directly matches the description above. This constructor creates a value not using prior ones, which is why the _backward function is empty.
However, we can also create values by adding or multiplying prior ones, by adding the following methods:

  def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other))

        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward

        return out

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other))

        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad
        out._backward = _backward

        return out

That is, if we create w by adding the values u and v, then the _backward function of w works by adding w.grad = \tfrac{\partial z}{\partial w} to both u.grad and v.grad.
If we w is obtain by multiplying u and v then we add w.grad \cdot v.data = \tfrac{\partial z}{\partial w} v to u.grad and similarly add w.grad \cdot u.data = \tfrac{\partial z}{\partial w} u to v.grad.

The backward function is obtained by setting the gradient of the current value to 1 and then running _backwards on all other values in reverse topological order:

def backward(self, visited= None): # slightly shorter code to fit in the blog 
    if visited is None:
        visited= set([self])
        self.grad = 1
    self._backward()
    for child in self._prev:
        if not child in visited:
            visited.add(child)
            child.backward(visited)

For example, if we run the following code

a = Value(5)
print(a.grad)
def f(x): return (x+2)**2 + x**3
f(a).backward()
print(a.grad)

then the values printed will be 0 and 89 since the derivative of (x+2)^2 + x^3 = x^3 + x^2 + 4x + 4 equals 3x^2 + 2x +42.

In the notebook you can see that we implement also the power function, and have some “convenience methods” (division etc..).

Linear regression using back propagation and stochastic gradient descent

In stochastic gradient descent we are given some data (x_1,y_1),\ldots,(x_n,y_n) and want to find an hypothesis h that minimizes the empirical loss L(h) = \tfrac{1}{n}\sum_{i=1}^n L(h(x_i),y_i) where L is a loss function mapping two labels y, y' to a real number. If we let L_i(h) be the i-th term of this sum, then, identifying h with the parameters (i.e., real numbers) that specify it, stochastic gradient descent is the following algorithm:

  1. Set h to be a random vector. Set \eta to be some small number (e.g., \eta = 0.1)
  2. For t \in {1,\ldots, T} (where T is the number of epochs):
  • For i \in {1,\ldots, n}: (in random order)
    • Let h \leftarrow h - \eta \nabla_h L_i(h)

If h is specified by the parameters h_1,\ldots,h_k \nabla_h L_i(h) is the vector ( \tfrac{\partial L_i}{\partial h_1}(h), \tfrac{\partial L_i}{\partial h_2}(h),\ldots, \tfrac{\partial L_i}{\partial h_k}(h)). This is exactly the vector we can obtain using back propagation.

For example, if we want a linear model, we can use (a,b) as our parameters and the function will be x \mapsto a\cdot + b. We can generate random points X,Y as follows:

Now we can define a linear model as follows:

class Linear:
  def __init__(self):
    self.a,self.b = Value(random.random()),Value(random.random())
  def __call__(self,x): return self.a*x+self.b

  def zero_grad(self):
    self.a.grad, self.b.grad = 0,0

And train it directly by using SGD:

η = 0.03, epochs = 20
for t in range(epochs):
  for x,y in zip(X,Y):
    model.zero_grad()
    loss = (model(x)-y)**2
    loss.backward()
    model.a , model.b = (model.a - η*model.a.grad  , model.b - η*model.b.grad)

Which as you can see works very well:

From linear classifiers to Neural Networks.

The above was somewhat of an “overkill” for linear models, but the beautify of automatic differentiation is that we can easily use more complex computation.

We can follow Karpathy’s demo and us the same approach to train a neural network.

We will use a neural network that takes two inputs and has two hidden layers of width 16. A neuron that takes input x_1,\ldots,x_k will apply the ReLU function (max{0,x}) to \sum w_i x_i where w_1,\ldots,w_k are its weight parameters. (It’s easy to add support for relu for our Value class. Also we won’t have a bias term in this example.)

The code for this Neural Network is as follows: (when Value() is called without a parameter the value is random number in [-1,1])

def Neuron(weights,inputs, relu =True):
  # Evaluate neuron with given weights on given inputs
  v =  sum(weights[i]*x for i,x in enumerate(inputs))
  return v.relu() if relu else v


class Net:
  # Depth 3 fully connected neural net with one two inputs and output
  def __init__(self,  N=16):
    self.layer_1 = [[Value(),Value()] for i in range(N)]
    self.layer_2 = [ [Value() for j in range(N)] for i in range(N)]
    self.output =  [ Value() for i in range(N)]
    self.parameters = [v for L in [self.layer_1,self.layer_2,[self.output]] for w in L for v in w]


  def __call__(self,x):
    layer_1_vals = [Neuron(w,x) for w in self.layer_1]
    layer_2_vals = [Neuron(w,layer_1_vals) for w in self.layer_2]
    return Neuron(self.output,layer_2_vals,relu=False) 
    # the last output does not have the ReLU on top

  def zero_grad(self):
    for p in self.parameters:
      p.grad=0

We can train it in the same way as above.
We will follow Karpathy and train it to classify the following points:

The training code is very similar, with the following differences:

  • Instead of the square loss, we use the function L(y,y')= \max{ 1- y\cdot y', 0 } which is 0 if y \cdot y' \geq 1. This makes sense since our data labels will be \pm 1 and we say we classify correctly if we get the same sign. We get zero loss if we classify correctly all samples with a margin of at least 1.
  • Instead of stochastic gradient descent we will do standard gradient descent, using all the datapoints before taking a gradient step. The optimal for neural networks is actually often something in the middle – batch gradient descent where we take a batch of samples and perform the gradient over them.

The resulting code is the following:

for t in range(epochs):
  loss = sum([(1+ -y*model(x)).relu() for (x,y) in zip(X,Y)])/len(X)
  model.zero_grad()
  loss.backward()
  for p in model.parameters:
    p.data -= η*p.grad

If we use this, we get a decent approximation for this training set (see image below). As Karpathy shows, by adjusting the learning rate and using regularization, one can in fact get 100% accuracy.

Update 11/30: Thanks to Gollamudi Tarakaram for pointing out a typo in a previous version.

Digging into election models

With election on my mind, and constantly looking at polls and predictions, I thought I would look a little more into how election models are made. (Disclaimer: I am not an expert statistician / pollster and this is based on me trying to read their methodological description as well as looking into results of simulations in Python. However, there is a colab notebook so you can try this on your own!)

If polls were 100% accurate, then we would not need election models – we will know that the person polling at more than 50% in a given state will win, and we can just sum up the electoral votes. However, polls have various sources of errors:

  1. Statistical sample error – this is simply the deviation between the fraction of people that would say “I will vote for X” at time T in the population, and the empirical fraction reported by the poll based on their sample. As battleground states get polled frequently with large samples, this error is likely to be negligible.
  2. Sampling bias – this is the bias incurred by the fact that we cannot actually sample a random subset of the population and get them to answer our questions – the probability that people will pick up their phone may be correlated with their vote. Pollsters hope that these correlations all disappear once you condition on certain demographic variables (race, education, etc..) and so try to ensure the sample is balanced according to these metrics. I believe this was part of the reason that polls were off in 2016, since they didn’t explicitly adjust for levels of education (which were not strongly correlated with party before) and ended up under-representing white voters without college degrees.
  3. Lying responses or “shy” voters – Some people suggest that voters lie to pollsters because their choice is considered “socially undesirable”. There is not much support that this is a statistically significant effect. In particular one study showed there was no statistically significant difference between responders’ responses in online and live calling. Also in 2016 polls equally under-estimated the votes for Trump and Republican senators (which presumably didn’t have the same “social stigma” to them).
  4. Turnout estimates – Estimating the probability that a person supporting candidate X will actually show up to vote (or mail it in) is a bit of a dark art, and account for the gap in polls representing registered voters (which make no such estimates) and polls representing likely voters (which do). Since traditionally the Republican electorate is older and more well off, they tend to vote more reliably and hence likely voter estimates are typically better for republicans. The effect seems not to be very strong this year. Turnout might be particularly hard to predict this year, though it seems likely to be historically high.
  5. Voters changing their mind – The poll is done at a given point in time and does not necessarily reflect voters views in election day. For example in 2016 it seems that many undecided voters broke for Trump. In this cycle the effect might be less pronounced since there are few undecided voters and “election day” is smoothed over a 2-4 week period due to early and mail-in voting.

To a first approximation, a poll-based election model does the following:

1. Aggregates polls into national and state-wise predictions

2. Computes a probability distribution over the correlated error vectors (i.e. the vector \vec{e} with coordinate for each jurisdiction containing the deviation from the prediction)

3. Samples from the probability distribution over vectors to obtain probabilities over outcomes.

From a skim of 538’s methodology it seems that they do the following:

  1. Aggregate polls (weighing by quality, timeliness, adjusting for house effects, etc..). Earlier in the election cycle they also mix in “fundamentals” such as state of the economy etc.. though their weight decreases with time.
  2. Estimate magnitude of national error (i.e., sample a value E \in \mathbb{R}_+ according to some distribution that reflects the amount of national uncertainty.
  3. (This is where I may be understanding wrong.) Sample a vector \vec{e} whose entries sum up to E according to a correlated distribution, where the correlations between states depends on demographic, location, and other factors. For each particular choice of $E$, because the sum is fixed, if a state has $E+X$ bias then on average the other states will need to compensate for this $-X$ bias, and hence this can create negative correlations between states. (It is not clear that negative correlations are unreasonable – one could imagine policies that are deeply popular with population A and deeply unpopular with population B)

From a skim of the Economist’s methodology it seems that they do the following:

  1. They start again with some estimate on the national popular vote, based on polls and fundamentals, and then assume it is distributed according to some probability distribution to account for errors.
  2. They then compute some prior on “partisan lean” (difference between state and national popular vote) for each state. If we knew the popular vote and partisan lean perfectly then we would know the result. Again like good Bayesians they assume that the lean is distributed according to some probability distribution.
  3. They update the prior based on state polls and other information
  4. They sample from an error distribution \vec{e} according to some explicit pairwise correlation matrix that has only non-negative entries (and hence you don’t get negative correlations in their model).

So, given all of the above, how much do these models differ? Perhaps surprisingy, the answer is “not by much”. To understand how they differ, I plotted for both models the following:

  1. The histogram of Biden’s popular vote margin
  2. The probability of Biden to win conditioned on a particular margin

Much of the methodological difference, including the issue of pairwise correlations, should manifest in 2, but eyeballing it, they don’t seem to differ that much. It seems that conditioned on a particular margin, both models give Biden similar probability to win. (In particular both models think that 3% margin is about 50/50, while 4% margin gives Biden about 80/20 chance). The main difference is actually in the first part of estimating the popular vote margin – 538 is more “conservative” and has fatter tails.

If you want to check my data, see if I have a bug, or try your own analysis, you can use this colab notebook.

“Make your own needle”

Another applications for such models is to help us adjust the priors as new information comes in. For example, it’s possible that Florida, North Carolina and Texas will report results early. If Biden loses one of these states, should we adjust our estimate of win probability significantly? It turns out that the answer depends on by how much he loses.

The following graphs show the updated win probability conditioned on a particular margin in a state. We see that winning or losing Florida, North Carolina, and Texas on their own doesn’t make much difference to the probability – it’s all about the margin. In contrast, losing Pennsylvania’s 20 electoral votes will make a significant difference to Biden’s chances.

(The non monotonicity is simply a side effect of having a finite number of simulation runs and would disappear in the limit.)

I’m with her (but 4 years too late)

In May 2016, after Donald Trump was elected as the republican nominee for president, I wrote the following blog post. I ended up not publishing it – this has always been a technical blog (and also more of a group blog, at the time). While the damage of a Donald Trump presidency was hypothetical at the time, it is now very real and in a second term the stakes are only higher. I once again hope American readers of this blog would do what they can to support Joe Biden.

Note: this is not an invitation for a debate on who to vote for in the comments. At this point, if you are educated and following the news (as I imagine all readers of this blog are) then if you are not already convinced that Donald Trump is a danger to this country and the world, nothing I will say will change your mind. Similarly, nothing you will say will change mine. Hence this is just a call for those readers who already support Joe Biden to make sure they vote and think of how they can help with their money or time in other ways.

I’m with Her

Boaz Barak / May 3, 2016

The republican electorate, in their infinite wisdom, have just (essentially) finalized the election of Donald Trump as their nominee for the position of the president of the United States.

While I have my political views, I don’t consider myself a very political person, and have not (as far as I remember) ever written about politics in this blog. However, extreme times call for extreme measures. It is tempting to be fatalist or cynical about this process, and believe that whether Donald Trump or Hillary Clinton is elected doesn’t make much of a difference. Some people believe that the presidency shapes the person more than the other way around, and others feel that all politicians are anyway corrupt or that Hillary is not much better than trump since she voted for the Iraq war and gave paid speeches for Wall Street firms. I think the last two presidencies of George W. Bush and Barack Obama demonstrated that the identity of the president makes a huge difference. All evidence points out that with Trump this difference will be entirely in the negative direction.

While his chances might not be great, this is not a bet I’m comfortable taking. I plan to support Hillary Clinton as much as I can, and hope that other American readers of this blog will do the same