Making TCS more connected / less insular

[Announcement from Jelani Nelson –Boaz]


A task force has been convened by CATCS to investigate possible
approaches to modifying aspects of the TCS community, especially our
publishing culture, to enhance connections with other areas of CS and
be as welcoming as possible to a broad range of contributions within
theory. This committee will collect and synthesize feedback from the
community via the questionnaire on then make suggestions.

Since adjusting conference formats may help with this, we would like to get
a general idea of community opinion on format choices. If you have
some opinions you would like to share, please use this form. Though
the questions below focus primarily on FOCS/STOC, we are welcome to
receiving any and all suggestions that would make the TCS community as
broad and well-connected to other areas of TCS as possible (see the
last question of the survey).

Task force members:
Ken Clarkson (IBM Research)
Sandy Irani (UC Irvine)
Bobby Kleinberg (Cornell)
Adam Klivans (UT Austin)
Ravi Kumar (Google)
Jelani Nelson (UC Berkeley)
Yuval Rabani (Hebrew University)

On Galileo Galilei and “denialism” from elections to climate to COVID

Galileo Galileo has many self-appointed intellectual heirs these days. Whether it’s a claim that the election has been stolen, that COVID-19 is less fatal than the flu, that climate change or evolution are hoaxes, or that P=NP, we keep hearing from people considering themselves as bold truth-tellers railing against conventional wisdom. We are encouraged to “teach the debate” and that if only paid attention, we will see that their Tweet, declaration, or arxiv paper contains an irrefutable proof of their assertions.

In the words of Ted Cruz, “They brand you a heretic. Today, the global warming alarmists are the equivalent of the flat-Earthers. It used to be [that] it is accepted scientific wisdom the Earth is flat, and this heretic named Galileo was branded a denier”.

Of course by Galileo’s time it was well known that the earth was spherical, and Magellan circumnavigated the earth more than 40 years before Galileo was born. But putting aside Cruz’s confusion of flat earth and geocentrism, the story of heliocentric theory is not one of an outsider railing against the scientific mainstream. Galileo himself was a chaired professor of mathematics at the University of Padua, and later philosopher and mathematician to the grand duke of Tuscany. He was very much part of the scientific establishment of his time. Moreover, though Galileo did provide important evidence for heliocentrism, he was not the only one doing so. Kepler found a heliocentric model with elliptical orbits that actually made correct predictions, and, though it took a decade or so, Kepler’s book eventually became the standard textbook for astronomy.

My point in this post is not to rehash the history of heliocentrism or Galileo but rather to call out a misconception which, to use Sean Carrol’s phrasing, amounts to valorization of puzzle-solving over wisdom.

It is tempting to think that an argument, regardless whether it comes from an expert or a random person on Twitter, can be presented in a self-contained way and judged on its merits. However, this is not how things work in any interesting setting. Even in the case of a purported P vs NP proof, there is background knowledge on computational complexity without which it would be hard to spot holes in the argument. This is doubly so for any claim involving empirical facts, whether it’s about elections, infections, climate etc. It is not possible to evaluate such claims without context, and to get this context you need to turn to the experts that have studied the topic.

I have written before in defense of expertise (see also here) but Carroll puts it very well. Another way to say it is that the operational interpretation of the common refrain

Extraordinary claims require extraordinary evidence


Treat claims conforming to conventional wisdom with charity, and claims disputing it with skepticism.

(There is a question of how to define “conventional wisdom” but interestingly there is usually agreement in practice by both sides. Most “deniers” of various sorts are proud of going against conventional wisdom, but don’t acknowledge that this means they are more likely to be wrong.)

As an example, even if someone has expertise in analytic number theory, and so presumably has plenty of so-called “puzzle-solving intelligence”, that doesn’t mean that they can evaluate a statistical claim on election fraud and their analysis should be considered evidence (apparently at this point the number theorist himself agrees). We can try to read and debunk what they wrote, or we can assume that if there was evidence for large-scale fraud, then the president of the United States and his well-funded campaign would have managed to find actual statisticians and experts on election to make the case.

There can be debate if Trump’s attempt to overthrow the election should be considered as dangerous or merely absurd, but the constant attacks on the very notions of truth, science, and expertise are causing far-reaching harm.

(H/T: Scott Aaronson, who makes a similar point around the election conspiracies.)

Announcing the WiML-T Mentorship Program (guest post)

[Guest post by Claire Vernade, Jessica Sorrell, Kamalika Chaudhuri, Lee Cohen, Mary Anne Smart, Michal Moshkovitz, and Ruth Urner.

I am very happy about this initiative – mentoring and community is so important for success in science, and as I’ve written before, there is much work to do so women will have the same access to these as men. –Boaz]

TL;DR: we are organizing a new mentorship program for women in machine learning. Please consider applying as a mentor, mentee, or both at

What is WIML-T?

Women in machine learning (or WiML for short) was established more than ten years ago, and its main goals are to 1. Increase the number of women in machine learning 2. Help women in machine learning succeed professionally 3. Increase the impact of women in machine learning in the community. Towards this goal, they create different opportunities for women to showcase their work. Chief among them is the annual Women in Machine Learning (WiML) Workshop, typically co-located with NeurIPS, which presents women’s cutting-edge research. 

Women in machine learning theory (or WiML-T for short) shares the same goals as WiML but focuses on the smaller learning theory community. The vision of WiML-T is to give visibility and legitimacy to under-represented groups in learning theory, to create stronger bonds within the community and beyond, and to provide support and advice. As part of this vision, we have decided to facilitate a mentoring program that will connect women and non-binary researchers who are newer to learning theory with more experienced mentors.  

Why mentorship?

Mentoring programs have been shown to greatly help underrepresented communities develop [1]. More resources on mentoring can be found on the website of Stanford’s Women’s Leadership Innovation Lab and on our website. Mentoring creates strong bonds of trust within the community, it helps mentees find career advice and connections, and it helps mentors grow their leadership skills while keeping in touch with the newcomers of the community.  We hope that creating a world-wide program will also help reduce inequality between less privileged areas and the most famous institutions. Indeed, it is a well-known fact that the more competitive a career path, the less diverse it is, due in part to network effects from which under-represented groups are excluded. We believe that uniting individuals from these groups (i.e. women, non-binary people, persons of color and other minorities) will help reduce these effects and contribute to finding solutions to the community’s problems. 

Why be a mentor?  

Remember the beginning of your research journey, with all the difficulties, uncertainties, and unanswered questions? This is your chance to give all the advice you wish you got, and make an impact on a future colleague. As a mentor you will be a role model and help the young generation of researchers in learning theory. You will help them develop and enhance their careers by giving them the support they need. Need more reasons to mentor? It will grow your leadership skills, self-confidence, communication skills, and you will feel happier after you help others. 

Who can be a mentor? 

Nearly everyone who has some experience in academia or industry can be a mentor. It can be interesting for an undergrad student to receive advice from senior PhD students or postdocs who have recently had to reflect about career decisions and can share knowledge about their work environment. We indeed expect the most senior researchers to apply as mentors, but we would also like to encourage PhDs and postdocs to consider mentoring (while possibly having a mentor as well!).    

Can men mentor? 

We thank everyone who wants to help the community!  

We will prioritize women mentors as they can give their unique perspective, BUT, we acknowledge that there might be a limited number of mentors. To mitigate this issue, we will be happy to pair male mentors provided the mentee agrees.

Why be a mentee?   

Having a mentor is one of the best ways to get external career advice, to get some feedback from someone with a possibly similar background. Managing to find one’s way into academia or science is not easy. It can be even harder for under-represented groups who may lack role models within their institution, or who may not connect with common advice that implicitly assumes or relies on some class privilege. Having a mentor can help you navigate professional and personal issues that men may not always have. It is also a way to get connected to other members of the community, or have second opinions on research strategies.   


  • The program launched on October 29 2020 and will run on a continuous basis.
  • We will start with pairings of mentors and mentees in December 2020. This process can take a few months.  
  • Frequency of the meetings: totally depends on the mentor and mentee. It can be either weekly meetings or once every two months or anything in between. 
  • Duration of the mentorship: totally depends on the mentor and mentee. It can be a few months, a year, or even more. 

Have questions? You can mail us at:

[1] Ginther, D. K., Currie, J., Blau, F. D., & Croson, R. (2020). Can Mentoring Help Female Assistant Professors in Economics? An Evaluation by Randomized Trial. NBER Working Paper No. 26864.     

Updated Research Masters programs database by Aviad Rubinstein and Matt Weinberg

Guest post by Aviad Rubinstein and Matt Weinberg

As explained in Boaz’s previous posts [1] [2], the PhD admission process can be challenging for students who discover their passion for Theory of Computer Science late in their undergraduate studies. Discovering TCS earlier is especially challenging for students who aren’t exposed to CS in high school, and this bias aggravates the diversity issues we have in our community. Masters programs are one way to mitigate this issue.
On a personal note, one of us (Aviad), worked half-time during undergraduate and would not have been in academia today -let alone TCS of all subjects- if it weren’t for the awesome TCS masters program at Tel-Aviv University.

But where would you go (or send your students) to do a masters in TCS?A little over a year ago, we were discussing how useful it would be to have a unified resource that can help students choose. We decided that we should create one! Months passed, and right before we ran out of excuses to procrastinate, Boaz made this post crowdsourcing information on TCS masters programs.
After some more procrastination, we eventually did send a lot of emails and tried to clean and organize the information to the best of our ability. Thanks to everyone who contributed! You can find the latest version here. (Short url:

Now you should share it with your brightest students!

This is also meant to be a live project. Please email aviad [at] if you have more new information or find any inaccuracies.


-Aviad and Matt

Election insecurity

Election security has been studied for many years by computer scientists, but it is not as often that it attracts so much mainstream attention. I would never have expected to see my former Princeton colleague Andrew Appel on a Sean Hannity segment tweeted by President Trump.

It may seem that even if it has partisan motivations, the recent GOP interest in election security is overall a positive thing. Who wouldn’t want elections to be more secure? Who wouldn’t want less fraud? However, in a very precise sense, the definition of “election security” used by the GOP these days corresponds to election insecurity.

To understand this claim, consider what it means for an election to be secure. (Let’s focus just on the correctness aspect of the count, since it is at the heart of the current issues, and not on the very interesting privacy aspect.) Computer scientists use the technical termscast as intended“, “recorded as cast“, and “tallied as recorded“. In other words: if a voter X intends to cast vote for candidate Y, then this vote should be recorded and tallied, and only such votes should be tallied.

With mail-in voting, there are several potential points of failure on the path between voter intent and the final tally:

  1. Mail can be lost or delayed too much, leading to the vote not counting.
  2. A third party could intercept the ballot and impersonate the voter.
  3. A ballot may not be formatted properly in some way, leading to it being disqualified.
  4. There can be errors or hacks in the tallying process.

Election security is about combatting points 1-4 (of which the last 3 are also applicable to in-person voting) , ideally in a way that is verifiable to the individual voters. Achieving verifiability while maintaining secrecy and not requiring the voter to trust complex technology is a challenging task, but there have been some proposed solutions (see above links).

The Hannity segment and much of the “Dominion” non story focused on point 4. This is an important point, but as Appel himself notes, paper ballots, which are mostly used in the US, serve as a way to audit counting. Re-counting is important, and is commonly done, but such recounts often change the total counts by relatively little (and the changes mostly cancel out). For example, here is the list of ballots changed and reasons from the Wisconsin 2016 count (taken from this paper)


In contrast, many of the legal cases by the Trump campaign focused on signature verification and other ballot irregularities. There are two main reasons why a signature would not match between a ballot and driver’s license or other records:

  1. The signature may have been forged by someone trying to impersonate the voter.
  2. The voter’s signature might not very consistent, or maybe they have more than one signature (for example, I sometimes sign in Hebrew and sometimes in English) .

Empirically, reason 2 is much more common than reason 1. If a ballot is tossed out because of the second reason it corresponds to a break between the voter intent and the final tally, and hence it is a case of election insecurity. For this reason, making more stringent signature checks could make elections less secure!

While President Trump might claim on Twitter that the election was stolen by a massive conspiracy involving forging of tens of thousands of ballots, this is not the actual content of the court cases (especially after some recent amendments). For example, at the heart of the PA case is the process of “curing a ballot“. This is when a ballot is disqualified due to some technical issue, and a voter has a chance to fix it. Curing a ballot ensures that the voters intent is captured, and hence makes elections more secure.

In PA, the decision of whether to notify voters in such cases was left to the counties, and apparently Democrat-controlled counties were more likely to do so Republican-controlled counties. This is a shame, and had the Trump campaign asked to extend the deadline for curing ballots, then I would think it makes perfect sense. However, this is not what their lawsuit is about. To quote their complaint: “plaintiffs seek a permanent injunction requiring the County Election Boards to invalidate ballots cast by voters who were notified and given an opportunity to cure their invalidly cast mail-in ballot.” This are ballots where there is no question of the eligibility of the voter, nor of the accuracy of their intent, yet the Trump campaign seeks to prevent them from counting. I call this election insecurity.

p.s. See Adam Klasfeld’s feed for more about the various Trump campaign cases

MoPS and Junior-Senior Meeting (DISC 2020)

(Guest post by Shir Cohen and Mouna Safir)

The 34th International Symposium on Distributed Computing (DISC 2020) was held on October 12-16, 2020, as a virtual conference. As such, the opportunity for community members to get to know each other in an informal environment was lacking. To address this need, we arranged two types of virtual networking events. We hope that these events planted the seeds for many future collaborations and that there will be an opportunity for those involved to meet in person next time.

MoPS (Meet other Postdoc and Students) Sessions

To allow junior members of the community to get to know one another, we arranged MoPS sessions, which we have not seen done before. There were more than 50 participants who took part in the sessions, with representation from a host of countries throughout the world. Sessions were held in 10-time slots before, during, and after DISC. In each session, there would typically be 5 members representing a mixture of Bachelor’s students, Masters students, PhDs, postdocs, and others. Care was taken to include at least one postdoc or Ph.D. in each session so that Bachelors and Masters students might benefit from their experience. Groups were formed with the goal of allowing participants from different countries and institutions to share their experiences and research journeys with one another. Based on the feedback for this event, it would appear that that goal was met and the participants came away with more of a sense of community.

Junior-Senior Meetings

The Junior-Senior meetings were organized to provide an opportunity for junior researchers to meet with senior researchers. Fourteen sessions were conducted, where each one brought together one senior and 3-5 juniors. In these sessions,  the juniors got a chance to interact with seniors in the field and profit from their experience. Discussions covered a variety of topics such as how to approach research, how to deal with the job market, or perhaps more personal concerns like work-life balance. We collected amazing feedback from the participants, who claimed that this was a fruitful and interesting experience.

Yet another backpropagation tutorial

I am teaching deep learing this week in Harvard’s CS 182 (Artificial Intelligence) course. As I’m preparing the back-propagation lecture, Preetum Nakkiran told me about Andrej Karpathy’s awesome micrograd package which implements automatic differentiation for scalar variables in very few lines of code.

I couldn’t resist using this to show how simple back-propagation and stochastic gradient descents are. To make sure we leave nothing “under the hood” we will not import anything from the package but rather only copy paste the few things we need. I hope that the text below is generally accessible to anyone familiar with partial derivatives. See this colab notebook for all the code in this tutorial. In particular, aside from libraries for plotting and copy pasting a few dozen lines from Karpathy this code uses absolutely no libraries (no numpy, no pytorch, etc..) and can train (slowly..) neural networks using stochastic gradient descent. (This notebook builds the code more incrementally.)

Automatic differentiation is a mechanism that allows you to write a Python functions such as

def f(x,y): return (x+y)+x**3

and enables one to automatically obtain the partial derivatives \tfrac{\partial f}{\partial \text{x}} and \tfrac{\partial f}{\partial \text{y}}. Numerically we could do this by choosing some small value \delta and computing both \tfrac{f(x+\delta,y)-f(x,y)}{\delta} and \tfrac{f(x,y+\delta)-f(x,y)}{\delta}.
However, if we generalize this approach to n variables, we get an algorithm that requires roughly n evaluations of f. Back-propagation enables computing all of the partial derivatives at only constant overhead over the cost of a single evaluation of f.

Back propagation and the chain rule

Back-propagation is a direct implication of the multi-variate chain rule. Let’s illustrate this for the case of two variables. Suppose that v,w: \mathbb{R} \rightarrow \mathbb{R} and z:\mathbb{R}^2 \rightarrow \mathbb{R} are differentiable functions, and define

f(u) = z(v(u),w(u)).

That is, we have the following situation:

where f(u) is the value z=z(v(u),w(u))

Then the chain rule states that

\tfrac{\partial f}{\partial u} = ( \tfrac{\partial v}{\partial u} \cdot \tfrac{\partial z}{\partial v} + \tfrac{\partial w}{\partial u} \cdot \tfrac{\partial z}{\partial w} )

You can take this on faith, but it also has a simple proof. To see the intuition, note that for small \delta, v(u+\delta) \approx v(u) + \delta \tfrac{\partial v}{\partial u}(u) and w(u+\delta) \approx w(u) + \delta \tfrac{\partial w}{\partial u}(u). For small \delta_1,\delta_2, z(v+\delta_1,w+\delta_2) \approx z(v,w) + \delta_1 \tfrac{\partial z}{\partial v}(v,w) + \delta_2 \tfrac{\partial z}{\partial w}(v,w). Hence, if we ignore terms with powers of delta two or higher,

f(u +\delta)= z(w(u+\delta),v(u+\delta)) \approx f(u) + \delta \tfrac{\partial v}{\partial u} \cdot \tfrac{\partial z}{\partial v} + \delta \tfrac{\partial w}{\partial u} \cdot \tfrac{\partial z}{\partial w}

Meaning that \frac{f(u +\delta) - f(u)}{\delta} \approx \tfrac{\partial v}{\partial u} \cdot \tfrac{\partial z}{\partial v} + \tfrac{\partial w}{\partial u} \cdot \tfrac{\partial z}{\partial w} which is what we needed to show.

The chain rule generalizes naturally to the case that z is a function of more variables than u. Generally, if the value f(u) is obtained by first computing some intermediate values v_1,\ldots,v_k from u and then computing z in some arbitrary way from v_1,\ldots,v_k, then \tfrac{\partial z}{\partial u} \sum_{i=1}^k \tfrac{\partial v_i}{\partial u} \cdot \tfrac{\partial z}{\partial v_i}.

As a corollary, if you already managed to compute the values \tfrac{\partial z}{\partial v_1},\ldots, \tfrac{\partial z}{\partial v_k}, and you kept track of the way that v_1,\ldots,v_k were obtained from u, then you can compute \tfrac{\partial z}{\partial u}.

This suggests a simple recursive algorithm by which you compute the derivative of the final value z with respect to an intermediate value w in the computation using recursive calls to compute the values \tfrac{\partial z}{\partial w'} for all the values w' that were directly computed from w. Back propagation is this algorithm.

Computing \tfrac{\partial z}{\partial u},\tfrac{\partial z}{\partial v}, \tfrac{\partial z}{\partial w} for the assignment u=5 using back propagation

Implementing automatic differentiation using back propagation in Python

We now describe how to do this in Python, following Karpathy’s code. The basic class we use is Value. Every member u of Value is a container that holds:

  1. The actual scalar (i.e., floating point) value that u holds. We call this data.
  2. The gradient of u with respect to some future unknown value f that will use it. We call this grad and it is initialized to zero.
  3. Pointers to all the values that were used in the computation of u. We call this _prev
  4. The method that adds (using the current value of u and other values) the contribution of u to the gradient of all its previous values v to their gradients. We call this function _backward. Specifically, at the time we call _backward we assume that u.grad already contains \tfrac{\partial z}{\partial u} where z is the final value we are interested in. For every value v that was used to compute u, we add to v.grad the quantity \tfrac{\partial z}{\partial u} \cdot \tfrac{\partial u}{\partial v}. For the latter quantity we need to keep track of how u was computed from b.
  5. If we call the method backwards (without an underscore) on a variable u then this will compute the derivative of u with respect to v for all values v that were used in the computation of u. We do this by applying _backward to u and then recursively (just like in DFS) going over the “children” (values used to compute u), calling _backward on each one and keeping track the ones we visited just like the Depth First Search (DFS) algorithm.

Let’s now describe this in code. We start off with a simple version that only supports addition and multiplication. The constructor for the class is the following:

class Value:
    """ stores a single scalar value and its gradient """

    def __init__(self, data, _children=()): = data
        self.grad = 0
        self._backward = lambda: None
        self._prev = set(_children)

which fairly directly matches the description above. This constructor creates a value not using prior ones, which is why the _backward function is empty.
However, we can also create values by adding or multiplying prior ones, by adding the following methods:

  def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value( +, (self, other))

        def _backward():
            self.grad += out.grad
            other.grad += out.grad
        out._backward = _backward

        return out

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value( *, (self, other))

        def _backward():
            self.grad += * out.grad
            other.grad += * out.grad
        out._backward = _backward

        return out

That is, if we create w by adding the values u and v, then the _backward function of w works by adding w.grad = \tfrac{\partial z}{\partial w} to both u.grad and v.grad.
If we w is obtain by multiplying u and v then we add w.grad \cdot = \tfrac{\partial z}{\partial w} v to u.grad and similarly add w.grad \cdot = \tfrac{\partial z}{\partial w} u to v.grad.

The backward function is obtained by setting the gradient of the current value to 1 and then running _backwards on all other values in reverse topological order:

def backward(self, visited= None): # slightly shorter code to fit in the blog 
    if visited is None:
        visited= set([self])
        self.grad = 1
    for child in self._prev:
        if not child in visited:

For example, if we run the following code

a = Value(5)
def f(x): return (x+2)**2 + x**3

then the values printed will be 0 and 89 since the derivative of (x+2)^2 + x^3 = x^3 + x^2 + 4x + 4 equals 3x^2 + 2x +42.

In the notebook you can see that we implement also the power function, and have some “convenience methods” (division etc..).

Linear regression using back propagation and stochastic gradient descent

In stochastic gradient descent we are given some data (x_1,y_1),\ldots,(x_n,y_n) and want to find an hypothesis h that minimizes the empirical loss L(h) = \tfrac{1}{n}\sum_{i=1}^n L(h(x_i),y_i) where L is a loss function mapping two labels y, y' to a real number. If we let L_i(h) be the i-th term of this sum, then, identifying h with the parameters (i.e., real numbers) that specify it, stochastic gradient descent is the following algorithm:

  1. Set h to be a random vector. Set \eta to be some small number (e.g., \eta = 0.1)
  2. For t \in {1,\ldots, T} (where T is the number of epochs):
  • For i \in {1,\ldots, n}: (in random order)
    • Let h \leftarrow h - \eta \nabla_h L_i(h)

If h is specified by the parameters h_1,\ldots,h_k \nabla_h L_i(h) is the vector ( \tfrac{\partial L_i}{\partial h_1}(h), \tfrac{\partial L_i}{\partial h_2}(h),\ldots, \tfrac{\partial L_i}{\partial h_k}(h)). This is exactly the vector we can obtain using back propagation.

For example, if we want a linear model, we can use (a,b) as our parameters and the function will be x \mapsto a\cdot + b. We can generate random points X,Y as follows:

Now we can define a linear model as follows:

class Linear:
  def __init__(self):
    self.a,self.b = Value(random.random()),Value(random.random())
  def __call__(self,x): return self.a*x+self.b

  def zero_grad(self):
    self.a.grad, self.b.grad = 0,0

And train it directly by using SGD:

η = 0.03, epochs = 20
for t in range(epochs):
  for x,y in zip(X,Y):
    loss = (model(x)-y)**2
    model.a , model.b = (model.a - η*model.a.grad  , model.b - η*model.b.grad)

Which as you can see works very well:

From linear classifiers to Neural Networks.

The above was somewhat of an “overkill” for linear models, but the beautify of automatic differentiation is that we can easily use more complex computation.

We can follow Karpathy’s demo and us the same approach to train a neural network.

We will use a neural network that takes two inputs and has two hidden layers of width 16. A neuron that takes input x_1,\ldots,x_k will apply the ReLU function (max{0,x}) to \sum w_i x_i where w_1,\ldots,w_k are its weight parameters. (It’s easy to add support for relu for our Value class. Also we won’t have a bias term in this example.)

The code for this Neural Network is as follows: (when Value() is called without a parameter the value is random number in [-1,1])

def Neuron(weights,inputs, relu =True):
  # Evaluate neuron with given weights on given inputs
  v =  sum(weights[i]*x for i,x in enumerate(inputs))
  return v.relu() if relu else v

class Net:
  # Depth 3 fully connected neural net with one two inputs and output
  def __init__(self,  N=16):
    self.layer_1 = [[Value(),Value()] for i in range(N)]
    self.layer_2 = [ [Value() for j in range(N)] for i in range(N)]
    self.output =  [ Value() for i in range(N)]
    self.parameters = [v for L in [self.layer_1,self.layer_2,[self.output]] for w in L for v in w]

  def __call__(self,x):
    layer_1_vals = [Neuron(w,x) for w in self.layer_1]
    layer_2_vals = [Neuron(w,layer_1_vals) for w in self.layer_2]
    return Neuron(self.output,layer_2_vals,relu=False) 
    # the last output does not have the ReLU on top

  def zero_grad(self):
    for p in self.parameters:

We can train it in the same way as above.
We will follow Karpathy and train it to classify the following points:

The training code is very similar, with the following differences:

  • Instead of the square loss, we use the function L(y,y')= \max{ 1- y\cdot y', 0 } which is 0 if y \cdot y' \geq 1. This makes sense since our data labels will be \pm 1 and we say we classify correctly if we get the same sign. We get zero loss if we classify correctly all samples with a margin of at least 1.
  • Instead of stochastic gradient descent we will do standard gradient descent, using all the datapoints before taking a gradient step. The optimal for neural networks is actually often something in the middle – batch gradient descent where we take a batch of samples and perform the gradient over them.

The resulting code is the following:

for t in range(epochs):
  loss = sum([(1+ -y*model(x)).relu() for (x,y) in zip(X,Y)])/len(X)
  for p in model.parameters: -= η*p.grad

If we use this, we get a decent approximation for this training set (see image below). As Karpathy shows, by adjusting the learning rate and using regularization, one can in fact get 100% accuracy.

Update 11/30: Thanks to Gollamudi Tarakaram for pointing out a typo in a previous version.

Digging into election models

With election on my mind, and constantly looking at polls and predictions, I thought I would look a little more into how election models are made. (Disclaimer: I am not an expert statistician / pollster and this is based on me trying to read their methodological description as well as looking into results of simulations in Python. However, there is a colab notebook so you can try this on your own!)

If polls were 100% accurate, then we would not need election models – we will know that the person polling at more than 50% in a given state will win, and we can just sum up the electoral votes. However, polls have various sources of errors:

  1. Statistical sample error – this is simply the deviation between the fraction of people that would say “I will vote for X” at time T in the population, and the empirical fraction reported by the poll based on their sample. As battleground states get polled frequently with large samples, this error is likely to be negligible.
  2. Sampling bias – this is the bias incurred by the fact that we cannot actually sample a random subset of the population and get them to answer our questions – the probability that people will pick up their phone may be correlated with their vote. Pollsters hope that these correlations all disappear once you condition on certain demographic variables (race, education, etc..) and so try to ensure the sample is balanced according to these metrics. I believe this was part of the reason that polls were off in 2016, since they didn’t explicitly adjust for levels of education (which were not strongly correlated with party before) and ended up under-representing white voters without college degrees.
  3. Lying responses or “shy” voters – Some people suggest that voters lie to pollsters because their choice is considered “socially undesirable”. There is not much support that this is a statistically significant effect. In particular one study showed there was no statistically significant difference between responders’ responses in online and live calling. Also in 2016 polls equally under-estimated the votes for Trump and Republican senators (which presumably didn’t have the same “social stigma” to them).
  4. Turnout estimates – Estimating the probability that a person supporting candidate X will actually show up to vote (or mail it in) is a bit of a dark art, and account for the gap in polls representing registered voters (which make no such estimates) and polls representing likely voters (which do). Since traditionally the Republican electorate is older and more well off, they tend to vote more reliably and hence likely voter estimates are typically better for republicans. The effect seems not to be very strong this year. Turnout might be particularly hard to predict this year, though it seems likely to be historically high.
  5. Voters changing their mind – The poll is done at a given point in time and does not necessarily reflect voters views in election day. For example in 2016 it seems that many undecided voters broke for Trump. In this cycle the effect might be less pronounced since there are few undecided voters and “election day” is smoothed over a 2-4 week period due to early and mail-in voting.

To a first approximation, a poll-based election model does the following:

1. Aggregates polls into national and state-wise predictions

2. Computes a probability distribution over the correlated error vectors (i.e. the vector \vec{e} with coordinate for each jurisdiction containing the deviation from the prediction)

3. Samples from the probability distribution over vectors to obtain probabilities over outcomes.

From a skim of 538’s methodology it seems that they do the following:

  1. Aggregate polls (weighing by quality, timeliness, adjusting for house effects, etc..). Earlier in the election cycle they also mix in “fundamentals” such as state of the economy etc.. though their weight decreases with time.
  2. Estimate magnitude of national error (i.e., sample a value E \in \mathbb{R}_+ according to some distribution that reflects the amount of national uncertainty.
  3. (This is where I may be understanding wrong.) Sample a vector \vec{e} whose entries sum up to E according to a correlated distribution, where the correlations between states depends on demographic, location, and other factors. For each particular choice of $E$, because the sum is fixed, if a state has $E+X$ bias then on average the other states will need to compensate for this $-X$ bias, and hence this can create negative correlations between states. (It is not clear that negative correlations are unreasonable – one could imagine policies that are deeply popular with population A and deeply unpopular with population B)

From a skim of the Economist’s methodology it seems that they do the following:

  1. They start again with some estimate on the national popular vote, based on polls and fundamentals, and then assume it is distributed according to some probability distribution to account for errors.
  2. They then compute some prior on “partisan lean” (difference between state and national popular vote) for each state. If we knew the popular vote and partisan lean perfectly then we would know the result. Again like good Bayesians they assume that the lean is distributed according to some probability distribution.
  3. They update the prior based on state polls and other information
  4. They sample from an error distribution \vec{e} according to some explicit pairwise correlation matrix that has only non-negative entries (and hence you don’t get negative correlations in their model).

So, given all of the above, how much do these models differ? Perhaps surprisingy, the answer is “not by much”. To understand how they differ, I plotted for both models the following:

  1. The histogram of Biden’s popular vote margin
  2. The probability of Biden to win conditioned on a particular margin

Much of the methodological difference, including the issue of pairwise correlations, should manifest in 2, but eyeballing it, they don’t seem to differ that much. It seems that conditioned on a particular margin, both models give Biden similar probability to win. (In particular both models think that 3% margin is about 50/50, while 4% margin gives Biden about 80/20 chance). The main difference is actually in the first part of estimating the popular vote margin – 538 is more “conservative” and has fatter tails.

If you want to check my data, see if I have a bug, or try your own analysis, you can use this colab notebook.

“Make your own needle”

Another applications for such models is to help us adjust the priors as new information comes in. For example, it’s possible that Florida, North Carolina and Texas will report results early. If Biden loses one of these states, should we adjust our estimate of win probability significantly? It turns out that the answer depends on by how much he loses.

The following graphs show the updated win probability conditioned on a particular margin in a state. We see that winning or losing Florida, North Carolina, and Texas on their own doesn’t make much difference to the probability – it’s all about the margin. In contrast, losing Pennsylvania’s 20 electoral votes will make a significant difference to Biden’s chances.

(The non monotonicity is simply a side effect of having a finite number of simulation runs and would disappear in the limit.)

I’m with her (but 4 years too late)

In May 2016, after Donald Trump was elected as the republican nominee for president, I wrote the following blog post. I ended up not publishing it – this has always been a technical blog (and also more of a group blog, at the time). While the damage of a Donald Trump presidency was hypothetical at the time, it is now very real and in a second term the stakes are only higher. I once again hope American readers of this blog would do what they can to support Joe Biden.

Note: this is not an invitation for a debate on who to vote for in the comments. At this point, if you are educated and following the news (as I imagine all readers of this blog are) then if you are not already convinced that Donald Trump is a danger to this country and the world, nothing I will say will change your mind. Similarly, nothing you will say will change mine. Hence this is just a call for those readers who already support Joe Biden to make sure they vote and think of how they can help with their money or time in other ways.

I’m with Her

Boaz Barak / May 3, 2016

The republican electorate, in their infinite wisdom, have just (essentially) finalized the election of Donald Trump as their nominee for the position of the president of the United States.

While I have my political views, I don’t consider myself a very political person, and have not (as far as I remember) ever written about politics in this blog. However, extreme times call for extreme measures. It is tempting to be fatalist or cynical about this process, and believe that whether Donald Trump or Hillary Clinton is elected doesn’t make much of a difference. Some people believe that the presidency shapes the person more than the other way around, and others feel that all politicians are anyway corrupt or that Hillary is not much better than trump since she voted for the Iraq war and gave paid speeches for Wall Street firms. I think the last two presidencies of George W. Bush and Barack Obama demonstrated that the identity of the president makes a huge difference. All evidence points out that with Trump this difference will be entirely in the negative direction.

While his chances might not be great, this is not a bet I’m comfortable taking. I plan to support Hillary Clinton as much as I can, and hope that other American readers of this blog will do the same

Full-replica-symmetry-breaking based algorithms for dummies

One of the fascinating lines of research in recent years has been a convergence between the statistical physics and theoretical computer science points of view on optimization problems.
This blog post is mainly a note to myself (i.e., I’m the “dummy” 😃), trying to work out some basic facts in some of this line of work. it was inspired by this excellent talk of Eliran Subag, itself part of a great Simons institute workshop which I am still planning to watch the talks of. I am posting this in case it’s useful for others, but this is quite rough, missing many references, and I imagine I have both math mistakes as well as inaccuracies in how I refer to the literature – would be grateful for comments!

Screen shot from Eliran Subag’s talk demonstrating the difference between “easy” and “hard” instances.

In computer science, optimization is the task of finding an assignment x_1,\ldots,x_n that minimizes some function J(x_1,\ldots,x_n). In statistical physics we think of x_1,\ldots,x_n as the states of n particles, and J(x_1,\ldots,x_n) as the energy of this state. Finding the minimum assignment corresponds to finding the ground state, and another computational problem is sampling from the Gibbs distribution where the probability of x is proportional to \exp(-\beta J(x)) for some \beta>0.

Two prototypical examples of such problems are:

  1. Random 3SAT – in this case x\in { \pm 1 }^n and J(x) is the number of clauses violated by the assignment x for a random formula.
  2. Sherrington-Kirpatrick model – in this case x \in { \pm 1 }^n and J(x)= \sum_{i,j} J_{i,j}x_ix_j where J_{i,j} are independent normal variables with variance 1/n for i\neq j and variance 2/n for i=j. (Another way to say it is that J is the matrix A+A^\top where A‘s entries are chosen i.i.d from N(0,\tfrac{1}{2n})).)

The physics and CS intuition is that these two problems have very different computational properties. For random 3SAT (of the appropriate density), it is believed that the set of solutions is “shattered” in the sense that it is partitioned to exponentially many clusters, separated from one another by large distance. It is conjectured that in this setting the problem will be computationally hard. Similarly from the statistical physics point of view, it is conjectured that if we were to start with the uniform distribution (i.e., a “hot” system) and “lower the temperature” (increase \beta) at a rate that is not exponentially slow then we will get “stuck” at a “metastable” state. This is analogous how when we heat up sand and then cool it quickly then rather than returning to its original state, the sand will get stuck at the metastable state of glass.

In contrast for the Sherrington-Kirpatrick (SK) model, the geometry is more subtle, but interestingly this enables better algorithms. The SK model is extermely widely studied, with hundreds of papers, and was the inspiration for the simulated annealing algorithm. If memory serves me right, Sherrington and Kirpatrick made the wrong conjecture on the energy of the ground state, and then Parisi came up in 1979 with a wonderful and hugely influential way to compute this value. Parisi’s calculation was heuristic, but about 30 years later, first Talagrand and later Panchenko proved rigorously many of Parisi’s conjectures. (See this survey of Panchenko.)

Recently Montanari gave a polynomial time algorithm to find a state with energy that is arbitrarily close to the ground state’s. The algorithm relies on Parisi’s framework and in particular on the fact that the solution space has a property known as “full replica symmetry breaking (RSB)” / “ultrametricity”. Parisi’s derivations (and hence also Montanari’s analysis) are highly elaborate and I admit that I have not yet been able to fully follow it. The nice thing is that (as we’ll see) it is possible to describe at least some of the algorithmic results without going into this theory. In the end of the post I will discuss a bit some of the relation to this theory, which is the underlying inspiration for Subag’s results described here.

Note: These papers and this blog post deal with the search problem of finding a solution that minimizes the objective. The refutation problem of certifying that this minimum is at least -C for some C has often been studied. The computational complexity of these problems need not be identical. In particular there are cases where the search problem has an efficient algorithm achieving value -C^* but the best refutation algorithm can only certify that the value is at most -C' for C' \gg C^*.

Analysis of a simpler setting

Luckily, there is a similar computational problem, for which the analysis of analogous algorithm, which was discovered by Subag and was the partial inspiration for Montanari’s work, is much simpler. Specifically, we consider the case where x is an element of the unit sphere, and J(x) is a degree d polynomial with random Gaussian coefficients. Specifically, for every vector \gamma = (\gamma_2,\ldots,\gamma_d), we let J(x) = \gamma_2 J^2 \cdot x^{\otimes 2} + \gamma_3 J^3 \cdot x^{\otimes 3} + \cdots + \gamma_d J^d \cdot x^{\otimes p} where for every p \in {2,\ldots, d }, J_p is a random tensor of order p whose n^p coefficients are all chosen i.i.d in N(0,1/n). (We assume that polynomial does not have constant or linear components.)

Depending on \gamma, the computational and geometrical properties of this problem can vary considerably. The case that \gamma = (1,0,\ldots,0) (i.e., only J^2 has a non-zero coeffiecent) corresponds to finding the unit vector x minimizing x^\top J x for a random matrix x, which of course corresponds to the efficiently solveable minimum eigenvector problem. In contrast, the case \gamma = (0,1,0,\ldots,0) corresponds to finding a rank one component of a random three-tensor, which is believed to be computationally difficult. The Parisi calculations give a precise condition P on the vector \gamma such that if P(\gamma) holds then the solution space has the “full RSB” property (and hence the problem is computationally easy) and if P(\gamma) does not hold then the solution space does not have this property (and potentially the problem is hard).

These calculations also give rise to the following theorem:

Theorem (Chen and Sen, Proposition 2): If P(\gamma) holds then in the limit n \rightarrow \infty, \min_{x : |x|=1} J(x) = -\int_0^1 \sqrt{\nu''(q)} dq, where \nu(q) = \sum_{p \geq 2} \gamma_p^2 q^p. (That is, \nu'' is the second derivative of this univariate polynomial)

We will not discuss the proof of this theorem, but rather how, taking it as a black box, it leads to an algorithm for minimizing J(x) that achieves a near-optimal value (assuming P(\gamma) holds).

It is a nice exercise to show that for every two vectors x,x'\in\mathbb{R}^n, \mathbb{E}_J [J(x)J(x')] = \nu(\langle x,x' \rangle)/n. Hence for any unit vector x, J(x) is a random variable with mean zero and standard deviation \sqrt{\nu(1)/n}. Since (after some coarsening) the number of unit vectors of dimension n can be thought of as c^n for some c>1, and we expect the probability of deviating t standard deviations to be \exp(-c' t^2), the minimum value of J(x) should be -c'' \sqrt{\nu(1)} for some constant c''. However determining this constant is non trivial and is the result of the Parisi theory.

To get a better sense for the quantity -\int_0^1 \sqrt{\nu''(q)} dq, let’s consider two simple cases:

  • If \gamma = (1,0,\ldots) (i.e., J(x) = \sum_{i,j}M_{i,j}x_ix_j for random matrix M) then \nu(q)= q^2 and \nu''(q) = 2, meaning that -\int_0^1 \sqrt{\nu''(q)} dq = -\sqrt{2}. This turns out to be the actual minimum value. Indeed in this case \min_{|x|^2=1} J(x) = \tfrac{1}{2} \lambda_{min}(M + M^\top). But the matrix A=\tfrac{1}{2}(M+M^\top)‘s non diagonal entries are distributed like N(0,\tfrac{1}{2n}) and the diagonal entries like N(0,\tfrac{1}{n}) which means that A is distributed as \tfrac{1}{\sqrt{2}} times a random matrix B from the Gaussian Orthogonal Ensemble (GOE) where B_{i,j} \sim N(0,1/n) for off diagonal entries and B_{i,i} \sim N(0,2/n) for diagonal entries. The minimum eigenvalue of such matrices is known to be -2\pm o(1) with high probability.
  • If \gamma = (0,\ldots,1) (i.e. J(x) = T \cdot x^{\otimes d} for a random d-tensor T) then P(\gamma) does not hold. Indeed, in this case the value -\int_0^1 \sqrt{\nu''(q)} dq = - \int_0^1 \sqrt{d (d-1)q^{d-2}} dq = - \sqrt{d(d-1)}\tfrac{1}{d/2-1} \approx -2 for large d. However I believe (though didn’t find the reference) that the actual minimum tends to -\infty with d.

While the particular form of the property P(\gamma) is not important for this post, there are several equivalent ways to state it, see Proposition 1 in Subag’s paper. One of them is that the function \nu''(t)^{-1/2} (note the negative exponent) is concave on the interval (0,1].
It can be shown that this condition cannot be satisfied if \gamma_2 = 0, and that for every setting of \gamma_3,\ldots,\gamma_d, if \gamma_2 is large enough then it will be satisfied.

The central result of Subag’s paper is the following:

Theorem: For every \epsilon>0, there is a polynomial-time algorithm A such that on input random J chosen according to the distribution above, with high probability A(J)=x such that J(x) \leq -\int_0^1 \sqrt{\nu''(q)} dq + \epsilon.

The algorithm itself, and the idea behind the analysis are quite simple. In some sense it’s the second algorithm you would think of (or at least the second algorithm according to some ordering).

The first algorithm one would think of is gradient descent. We start at some initial point x^0, and repeat the transformation x^{t+1} \leftarrow x^t - \eta \nabla J(x^t) for some small \eta (and normalizing the norm). Unfortunately, we will generally run into saddle points when we do so, with the gradient being zero. In fact, for simplicity, below we will always make the pessimistic assumption that we are constantly on a saddle point. (This assumption turns out to be true in the actual algorithm, and if it was not then we can always use gradient descent until we hit a saddle.)

The second algorithm one could think of would be to use the Hessian instead of the gradient. That is, repeat the transformation x^{t+1} \leftarrow x^t - \eta u where u is the minimal eigenvector of \nabla^2 J(x^t) (i.e., the Hessian matrix H such that H_{i,j} = \tfrac{\partial J(x^t)}{\partial x_i \partial x_j} ). By the Taylor approximation J(x - \eta u) \approx J(x) - \eta \nabla J(x) \cdot u + \tfrac{1}{2} \eta^2 u^\top \nabla^2 J(x) u (and since we assume the gradient is zero) the change in the objective will be roughly \tfrac{1}{2} \eta^2 \lambda_{min}(\nabla^2 J(x^t)). (Because we assume the gradient vanishes, it will not make a difference whether we update with -\eta u or +\eta u, but we use -\eta for consistency with gradient descent.)

The above approach is promising, but we still need some control over the norm. The way that Subag handles this is that he starts with x^0 = 0, and at each step takes a step in an orthogonal direction, and so within 1/\eta^2 steps he will get to a unit norm vector. That is, the algorithm is as follows:


Input: J:\mathbb{R}^n \rightarrow \mathbb{R}.

Goal: Find unit x approximately minimizing J(x)

  1. Initialize x^0 = 0^n
  2. For t=0,\ldots,1/\eta^2-1: i. Let u^t be a unit vector u such that u is orthogonal to u^0,\ldots,u^{t-1} and u^\top \nabla^2 J(x^t) u \approx \lambda_{min}(\nabla^2 J(x^t)). (Since the bottom eigenspace of \nabla^2 J(x^t) has large dimention, we can find a vector u that is not only nearly minimal eigenvector but also orthogonal to all prior ones. Also, as mentioned, we assume that \nabla \cdot u \approx 0.) ii. Set x^{t+1} \leftarrow x^t - \eta u^t.
  3. Output x^{1/\eta^2} = -\eta\sum_{t=0}^{1/\eta^2-1} u^t

(The fact that the number of steps is 1/\eta^2 and not 1/\eta is absolutely crucial for the algorithm’s success; without it we would not have been able to use the second order contribution that arise from the Hessian.)

If we define \lambda_t to be the minimum eigenvalue at time t, we get that the final objective value achieved by the algorithm satisfies

VAL = \sum_{t=1}^{1/\eta^2} \tfrac{1}{2} \eta^2 \lambda_t

Now due to rotation invariance, the distribution of \nabla^2 J at the point x^t is the same as \nabla^2J at the point (|x^t|,0,\ldots,0). Using concentration of measure arguments, it can be shown that the minimum eigenvalue of \nabla^2 J(x^t) will be close with high probability to the minimum eigenvalue of \nabla^2 J(|x^t|,0,\ldots,0).
Since \|x^t\|^2 = \eta^2 t we can write

VAL = \sum_{t=1}^{1/\eta^2} \tfrac{1}{2} \eta^2 \lambda(\sqrt{\eta^2 t})

where \lambda(q) is the minimum eigenvalue of \nabla^2 J at the point (q,0,\ldots,0).
Taking \eta to zero, we get that (approximately) the value of the solution output by the algorithm will satisfy

VAL = \int_0^1 \tfrac{1}{2} \lambda(\sqrt{q}) dq

and hence the result will be completed by showing that

\lambda(\sqrt{q}) = 2 \sqrt{\nu''(q)}

To do this, let’s recall the definition of J:

J(x) = \gamma_2 J^2 \cdot x^{\otimes 2} + \gamma_3 J^3 \cdot x^{\otimes 3} + \cdots + \gamma_d J^d \cdot x^{\otimes p} where for every p \in {2,\ldots, d }, J_p is a random tensor of order p whose n^p coefficients are all chosen i.i.d in N(0,1/n).

For simplicity, let’s assume that d=3 and hence

J(x) = \gamma_2 J^2 \cdot x^{\otimes 2} + \gamma_3 J^3 \cdot x^{\otimes 3}\;.

(The calculations in the general case are similar)

The i,j entry of \nabla^2 J(\sqrt{q},0,\ldots,0) equals \tfrac{\partial J(q,0,\ldots,0)}{\partial x_i \partial x_j}. The contribution of the J^2 component to this term only arises from the terms corresponding to either x_ix_j or x_jx_i and hence for i\neq j it equals \gamma_2 (J_{i,j}+J_{j,i}) which is distributed like \gamma_2 N(0,\tfrac{2}{n}) = N(0, \tfrac{2\gamma_2^2}{n}). For i=j, since (x_i^2)'' = 2, the contributioon equals 2 \gamma_2 J_{i,i} which is distributed like N(0,\tfrac{4\gamma_2^2}{n}).

The contribution from the J^3 component comes (in the case i\neq j) from all the 6=3! terms involving 1,i,j that is, \gamma_3 (J_{1,i,j}\sqrt{q} + J_{1,j,i}\sqrt{q} + J_{i,1,j}\sqrt{q}+J_{j,1,i}\sqrt{q}+J_{i,j,1}\sqrt{q}+J_{j,i,1})\sqrt{q} which is distributed like \gamma_3 N(0, \tfrac{6}{n}) = N(0, \tfrac{6\gamma_3^2 q}{n}). For i=j the contribution will be from the 3 terms involving 1,i, with each yielding a contribution of 2\sqrt{q}, and hence the result will be distributed like N(0,\tfrac{12 \gamma_3^2 q}{n}).

(More generally, for larger p, the number of terms for distinct i,j is p(p-1), each contributing a Gaussian of standard deviation (\sqrt{q})^{p-2}\gamma_p/\sqrt{n}, while for i=j we have p(p-1)/2 terms, each contributing a Gaussian of standard deviation 2(\sqrt{q})^{p-2}\gamma_p/\sqrt{n}.)

Since the sum of Gaussians is a Gaussian we get that \nabla^2J(q,0,\ldots,0){i,j} is distributed like a Gaussian with variance \sum \gamma_p^2 p(p-1)q^{p-2}/n = \nu''(q)/n for i\neq j, and twice that for i=j. This means that the minimum eigenvalue of \nabla^2J(q,0,\ldots,0) equals \sqrt{\nu''(q)} times the minimum eigenvalue of a random matrix M from the Gaussian Orthogonal Ensemble (i.e. M is sampled via M{i,j} \sim N(0,1/n), M_{i,i} \sim N(0,2/n)). As mentioned above, it is known that this minimum eigenvalue is -2+o(1), and in fact by the semi-circle law, for every \epsilon>0, the number of eigenvalues of value \leq -2+\epsilon is \Omega(n), and so we can also pick one that is orthogonal to the previous directions. QED

Full replica symmetry breaking and ultra-metricity

The point of this blog post is that at least in the “mixed p spin” case considered by Subag, we can understand what the algorithm does and the value that it achieves without needing to go into the theory of the geometry of the solution space, but let me briefly discuss some of this theory. (As I mentioned, I am still reading through this, and so this part should be read with big grains of salt.)

The key object studied in this line of work is the probability distribution \xi of the dot product \langle x,x' \rangle for x and x' sampled independently from the Gibbs distribution induced by J. (This probability distribution will depend on the number of dimensions n, but we consider the case that n \rightarrow \infty.)

Intuitively, there are several ways this probability distribution can behave, depending on how the solution space is “clustered”:

  • If all solutions are inside a single “cluster”, in the sense that they are all of the form x = x_* + e where x_* is the “center” of the cluster and e is some random vector, then \langle x,x' \rangle will be concentrated on the point \| x_*\|^2.
  • If the solutions are inside a finite number k of clusters, with centers x_1,\ldots,x_k, then the support of the distribution will be on the k^2 points \{ \langle x_i,x_j \rangle \}_{i,j \in [k]}.
  • Suppose that the solutions are inside a hierarchy of clusters. That is, suppose we have some rooted tree \mathcal{T} (e.g., think of a depth d full binary tree), and we associate a vector x_v with every vertex v of \mathcal{T}, with the property that x_v is orthogonal to all vectors associated with v‘s ancestors on the tree. Now imagine that the distribution is obtained by taking a random leaf u of \mathcal{T} and outputting \sum x_v for all vertices v on the path from the root to u. In such a case the dot product of x_u and x_{u'} will be \sum_v \|x_v\|^2 taken over all the common ancestors of v. As the dimension and depth of the tree goes to infinity, the distribution over dot product can have continuous support, and it is this setting (specifically when the support is an interval [0,q)) that is known as full replica symmetry breaking. Because the dot product is determined by common ancestor, for every three vectors x_u,x_v,x_w in the support of the distribution \langle x_v,x_w \rangle \geq \min \{ \langle x_u,x_v \rangle, \langle x_u,x_w \rangle \} or \| x_v - x_w \| \leq \max \{ \|x_u - x_v \|, \|x_u -x_w \|\}. It is this condition that known as ultra-metricity.

In Subag’s algorithm, as mentioned above, at any given step we could make an update of either x^{t+1} \leftarrow x^t - \eta u or x^{t+1} \leftarrow x^t + \eta u. If we think of all the possible choices for the signs in the d=1/\eta^2 of the algorithms, we see that the algorithm does not only produce a single vector but actually 2^d such vectors that are arranged in an ultrametric tree just as above. Indeed, this ultrametric structure was the inspiration for the algorithm and is the reason why the algorithm produces the correct result precisely in the full replica symmetry breaking regime.

Acknowledgements: Thanks to Tselil Schramm for helpful comments.