## Statistical physics dictionary

I’ve always been curious about the statistical physics approach to problems from computer science. The physics-inspired algorithm survey propagation is the current champion for random 3SAT instances, statistical-physics phase transitions have been suggested as explaining computational difficulty, and statistical physics has even been invoked to explain why deep learning algorithms seem to often converge to useful local minima.

Unfortunately, I have always found the terminology of statistical physics, “spin glasses”, “quenched averages”, “annealing”, “replica symmetry breaking”, “metastable states” etc.. to be rather daunting. I’ve recently decided to start reading up on this, trying to figure out what these terms mean. Some sources I found useful are the fun survey by Cris Moore, the longer one by Zdeborová and Krzakala, the somewhat misnamed survey by Castellani and Cavagna, and the wonderful book by Mezard and Montanari. (So far, I’m still early in the process of looking at these sources. Would appreciate pointers to any others!)

This post is mainly for my own benefit, recording some of my thoughts.

It probably contains many inaccuracies and I’d be grateful if people point them out.

First of all, it seems that statistical physicists are very interested in the following problem: take a random instance of a constraint satisfaction problem, and some number , and sample from the uniform distribution over assignments to the variables that satisfy at least fraction of the constraints. More accurately, they are interested in the distribution where the probability of an assignment is proportional to where is some parameter and is the number of constraints of that violates. However, for an appropriate relation of and , this is more or less the same thing, and so you can pick which one of them is easier to think about or to calculate based on the context.

Physicists call the distribution the Boltzman distribution, they would call the *energy* of . Often they would write as where , known as the Hamiltonian, is the general shape of the constraint satisfaction problem, and are the parameters. The *free energy* is the typical value of , which roughly corresponds to where is the total number of constraints in and (as before) is the fraction of satisfied constraints.

*Why would statistical physicists care about constraint satisfaction problems?*

My understanding is that they typically study a system with a large number of particles, and these particles have various forces acting on them. (These forces are *local* and so each one involves a constant number of particles – just like in constraint satisfaction problems.) These forces push the particles to satisfy the constraints, while *temperature* (which you can think of as ) pushes the particles to have as much entropy as possible. (It is entropy in a Bayesian sense, as far as I understand; it is not that the particles’ state is sampled from this distribution at any given point in time, it is that our (or Maxwell’s demon’s) knowledge about the state of the particles can be captured by this distribution.) When the system is hot, is close to zero, and the distribution has maximal entropy, when the system is cold, gets closer to infinity, and the fraction of satisfied constraints grows closer to the maximum possible number. Sometimes there is a *phase transition* where a change in can lead to an change in . (These and factors are with respect to the number of particles , which we think of as tending to infinity.) That is, a small change in the temperature leads to a qualitatively different distribution. This happens for example when a system changes from solid to liquid.

Ideally (or maybe naively), we would think of the state of the system as being just a function of the temperature and not depend on the history of how we got to this temperature, but that’s not always the case. Just like an algorithm can get stuck in a local minima, so can the system get stuck in a so called “metastable state”. In particular, even if we cool the system to absolute zero, its state might not satisfy the global maximum fraction of constraints.

A canonical example of a system stuck in a metastable state is *glass*. For example, a windowpane at room temperature is in a very different state than a bag of sand in the same temperature, due to the windowpane’s history of first being heated to very high temperatures and then cooled down. In statistical physics lingo the *static* prediction of the state of the system is the calculation based on its temperature, while a *dynamic* analysis takes into account the history and the possibility of getting stuck in metastable states for a long time.

*Why do statistical physicists care about random constraint satisfaction problems?*

So far we can see why statistical physicists would want to understand a constraint satisfaction problem that corresponds to some physical system, but why would they think that is random?

As far as I can tell, there are two reasons for this. First, some physical systems are “disordered” in the sense that when they are produced, the coefficients of the various forces on the particles are random. A canonical example is spin glass. Second, they hope that the insights from studying random CSP’s could potentially be relevant for understanding the actual CSP’s that are not random (e.g., “structured” as opposed to “disordered”) that arise in practice.

In any case, a large body of literature is about computing various quantities of the state of a random CSP at a given temperature, or after following certain dynamics.

For such a quantity , we would want to know the typical value of that is obtained with high probability over . Physicists call this value the “quenched” average (which is obtained by fixing to a typical value). Often it is easier to compute the *expectation* of over (which is called “annealed average” in physicist-lingo), which of course gives the right answer if the quantity is well concentrated – a property physicists call “self averaging”.

Often if a quantity is not concentrated, the quantity would be concentrated, and so we would want to compute the expectation . The “replica trick” is to use the fact that is the derivative at of the function . So, . (Physicists would usually use instead of , but I already used for the number of variables, which they seem to often call .) Now we can hope to exchange the limit and expectation and write this as . For an *integer* , it is often the case that we can compute the quantity . The reason is that typically is itself the expectation of some random variable that is parameterized by , and hence would just correspond to taking the expectation of the product of independent copies (or “replicas” in physicist-speak) from these random variables. So, now the hope is that we get some nice analytical expression for , and then simply plug in and hope this gives us the right answer (which surprisingly enough always seems to work in the contexts they care about).

*What are the qualitative phase transitions that physicists are interested in?*

Let us fix some “typical” and consider the distribution which is uniform over all assignments satisfying an fraction of ‘s constraints (or, essentially equivalently, where the probability of is proportional to ). We will be particularly interested in the *second moment matrix* of , which is the matrix such that where is sampled from . (Let’s think of as a vector in .)

We can often compute the quantity for every .

By opening up the parenthesis this corresponds to computing for independent from (i.e., “replicas”). (Physicists often think of the related “overlap matrix” which is the matrix where .) This means that we can completely determine the typical *spectrum* of as a function of the parameters or .

The typical behavior when is small (i.e., the system is “hot”) is that the solutions form one large “cluster” around a particular vector . This is known as the “replica symmetric” phase, since all copies (“replicas”) chosen independently from would have roughly the same behavior, and the matrix above would have on the diagonal and the same value off diagonal.

This means that is close to the rank one matrix . This is for example the case in the “planted” or “hidden” model (related to “Nishimory condition” in physics) if we plant a solution at a sufficiently “easy” regime in terms of the relation between the number of satisfied constraints and the total number of variables. In such a case, it is easy to “read off” the planted solution from the second moment matrix, and it turns out that many algorithms succeed in recovering the solution.

As we cool down the system (or correspondingly, increase ) the distribution breaks into exponentially many clusters. In this case, even if we planted a solution , the second moment matrix will look like the identity matrix (since the centers of the clusters could very well be in isotropic positions). Since there are exponentially many clusters that dominate the distribution, if we sample from this distribution we are likely to fall in any one of them. For this reason we will not see the cluster behavior in static analysis but we will get stuck in the cluster we started with in any dynamic process. This phase is called “1-dRSB” for “one step dynamic replica symmetry breaking” in the physics literature. (The “one step” corresponds to the distribution being simply clustered, as opposed to some hierarchy of sub-clusters within clusters. The “one step” behavior is the expected one for CSP’s as far as I can tell.)

As far as I can tell, the statistical physics predictions would be that this would be the parameter regime where sampling from the distribution would be hard, as well as the regime where recovering an appropriately planted assignment would be hard. However, perhaps in some cases (e.g., in those arising from deep learning?) we are happy with anyone of those clusters.

If we cool the system more then one or few clusters will dominate the distribution (this is known as *condensation*), in which case we get to the “static replica symmetry breaking” phase where the second moment matrix changes and has rank corresponding to roughly the number of clusters. If we then continue cooling the system more then the static distribution will be the global optimum (the planted assignment if we had one), though dynamically we’ll probably be still stuck in one of those exponentially many clusters.

The figure below from Zdeborová and Krzakala illustrates these transitions for the 5 coloring constraint satisfaction problem (the red dot is the planted solution). This figure is as a function of the degree of the graph $c$, whereas increasing the degree can be thought of as making the system “colder” (as we are adding more constraints the system needs to satisfy).

*What about algorithms?*

I hope to read more to understand belief propagation, survey propagation, various Markov Chain sampling algorithms, and how they might or might not relate to the sum of squares algorithm and other convex programs. Will update when I do.

## On the recent proof of the 2-to-2 conjecture

As I posted before, recently Khot, Minzer and Safra posted a manuscript which is the culmination of a beautiful line of work, initiated by the same authors, and completes the proof of (the imperfect completeness variant of) Khot’s 2 to 2 conjecture.

An immediate corollary is establishing for every , the NP hardness of distinguishing between a unique games instance of value vs an instance of value at most . (The *value* of a constraint satisfaction problem is the maximum fraction of constraints one can satisfy.) This is interesting because in this parameter regime there is a subexponential time algorithm (of Arora, me and Steurer) and hence this is a proof of the intermediate complexity conjecture. It is also very strong evidence that the full unique games conjecture is true.

The proof of the 2 to 2 conjecture is obtained by combining **(i)** a reduction of Dinur, Khot, Kindler, Minzer and Safra (DKKMS), **(ii)** a (soon to be posted) manuscript of me with Kothari and Steurer showing the soundness conjecture for the DKKMS reduction works modulo a conjecture characterizing non-expanding sets on the Grassman graph (or equivalently, the degree two short code graph), and **(iii)** this latest KMS manuscript which proves the latter conjecture. (Most of the “heavy lifting” is done in the works **(i)** and **(iii)** of DKKMS and KMS respectively.)

On Friday I gave the first in a two part series of talks with an exposition of this result in our reading group. I plan to post here the full scribe notes of these talks, but thought I’d start with a short “teaser”. In particular, if I’m not mistaken (and it is possible I am) then one can describe the actual reduction underlying the proof in about 5 lines (see below). Of course the *analysis* takes much longer..

In my exposition I do not follow the approach of the above papers. Rather I present a combined version of the DKKMS paper and our manuscript which bypasses the need to talk about Grassman graphs altogether. Also, while the original DKKMS paper uses a particular reduction from 3XOR to label cover as a starting point, I tried to “abstract away” the properties that are really needed. I should note that the reduction I describe below is “morally related” but not identical to the one used by DKKMS, and I have not written down a full proof of its analysis, and so it is possible that I am making a mistake. Still I find it much easier to describe and understand, and so I prefer this exposition to the one in the papers.

## The reduction

The Khot, Minzer and Safra manuscript completes the proof of the following theorem:

**Theorem:** For every , it is NP hard to distinguish between a unique games instance where at least fraction of the constraints can be satisfied, and an instance where at most an fraction can be satisfied.

Any NP hardness result is composed of three parts:

- A reduction from a previously known NP hard problem.
- A
*completeness analysis*, showing that a “yes instance” of the original problem corresponds (in our case) to an instance with value at least of unique games. - A
*soundness analysis*showing that one can decode an assignent of value at least of the unique games instance to an assignment to the instance of original problem certifying that it was a yes instance.

In this blog post I will show “two thirds” of the analysis by presenting 1 and 2. To get some sense of proportion, Part 1 took me about ten minutes to present in my talk, Part 2 about two minutes, and the rest of the six hours are dedicated to part 3 (though that is an underestimate, since I will probably not get to cover much of KMS’s proof that the combinatorial conjecture is true..).

### The initial problem

The NP hard problem we start from is the label cover problem that has been used time and again for hardness of approximations results. Specifically, we consider the following game:

- Verifier chooses a pair of indices at random from some .
- She sends to first prover and to the second prover, and gets back answers and respectively.
- She
*accepts*the answers if for some particular function .

We will look at the case of an *affine label cover* where for every , is an affine function from to for some constant integers . We can think of an instance to this problem as a graph where an edge is labeled by . It is known that for every , there are large enough such that it is NP hard to distinguish between an instance where the provers can cause the verifier to accept with probability at least and an instance where every strategy would cause it to accept with probability at most .

In our context we will need two extra properties of “smoothness” and “robust soundness” (or “soundness with respect to advice”) which I will not formally define here, but can be achieved by a “noisy variant” of the standard parallel repetition from the 3XOR problem.

### The reduction

Given an instance , we construct a unique game instance over alphabet as follows:

- The verifier chooses and as before.
- She chooses a random affine function , a random
*rank one*linear function and a random invertible affine function . - She sends to the second prover and to the first prover where (we use here product notation for function composition, note that is an affine function from to )
- She gets back the answers and in from the first and second prover respectively, and accepts if and only if . (Note that the constraint is indeed a unique constraint.)

A *rank one linear function* is a function of the form where and . (In matrix notation this is the rank one matrix .)

As promised, completeness is not hard to show:

**Lemma:** If there is a prover strategy convincing the verifier with probability at least for the original label cover game, then there is a strategy convincing the verifier with probability at least in the resulting unique game.

**Proof:** Suppose there was such a strategy for the original label cover game. The provers for the unique games will answer with and with respectively. Now suppose (as will be the case with probability at least ) that . For every fixed , if we choose to be a random rank one matrix then with probability in which case . In such a case under our assumption that , and hence if we apply to the first prover’s answer we get the second prover’s answer.

(To get the full unique games conjecture we would need completeness close to , but unfortunately it’s not easy to construct the field or a rank matrix..)

## Two comments

There is plenty more to talk about this reduction its analysis and all the open questions and research directions that it opens, and I do hope to post the full lecture notes soon. But for now let me make two quick comments.

### Quantity makes quality

One can think of this reduction as following the standard paradigm of taking an original label cover game with alphabet and encoding each symbol using an error correcting code. The most common code for this is the *long code* which has codewords of length , and a more efficient version is the short code which has codewords of length . The reduction of DKKMS can be thought of as using a “tensored Hadamard code” or “unbalanced degree two short code” over alphabet with codewords of length where a string is mapped to its evaluations by all affine functions . (More accurately, DKKMS use a “folded version” of this code where one has a coordinate for every dimensional subspace ; this does not make much difference for the current discussion.) For constant , this means the codewords are of length rather than as in the short code.

While this quasipolynomial difference between the short code and the “tensored Hadamard code” might seem mild, it turns out to be absolutely crucial for the reduction to go through. In fact, I believe that if finding a code that behaves similarly to the degree shortcode but with codewords of length (as opposed to ) would result in a proof of the full unique games conjecture.

### Hardness from easiness

It turns out that the analysis of the reduction rests on characterizing the non-expanding subsets of the graph on affine functions where there is an edge between and if for a rank one . By now researchers have developed an intuition that if we stare at such natural graphs hard enough, we can figure out all the non-expanding small sets, and indeed this intuition was verified by the KMS manuscript for this particular graph. But this intuition might seem somewhat at odds with the competing intuition that the small set expansion problem (a close variant of the unique games problem) should be hard. One way to resolve this conundrum is that while the unique games problem may well be hard on the worst case, it is extremely hard to come up with actual hard instances for it. Like Trump supporters with Ph.D’s, such instances might exist, but are rarely seen in the wild.

## Research masters

In the U.S., we have almost no research masters programs. We only admit students into a Ph.D. Overall it works well, but it requires us to be very conservative in our admissions, since we are committing to have the student come for 5 years or so. This can be a particular issue for students that discover their interest in research late in their studies, at which point their application might not be as strong.

When students like that seek my advice, I often suggest they look at applying for research Masters program in places such as my alma mater The Weizmann Institute, or other universities in Europe, Canada or elsewhere. However, I realized that there may be other places I don’t know about.

If you know of a good research masters program for theoretical computer science, could you post about it in the comments?

**Update:** Some good information in the comments – thank you! If you post, please say whether this a program where students have to pay tuition or is a program where there is a chance that tuition might be waived and/or students could get a stipend.

## #metootcs

In an earlier post I asked if we have TCS “Harvey Weinsteins”. Unfortunately academia is hardly immune from sexual harassment and now a TCS researcher posted about her experiences with sexual harassment and assault in our community. While this is not pleasant reading, it is important, and I urge you to **read the full post** (see also here).

It takes courage to talk about such painful experiences, even anonymously, in a community as small as ours. But this researcher deserves our thanks for bringing up this topic, and hopefully starting a conversation that would make theoretical computer science more welcoming and inclusive. I do not know who this person is, but I urge people not try to guess her identity, but rather focus on asking what we can do, both men and women, to make things better.

The blog post already contains some good suggestions. As the overwhelming majority in our field, we men enjoy many structural advantages, and it is especially up to us to step up to this challenge. Research is not a 9 to 5 job: conferences, workshops, and informal interactions are extremely important. But we should remember that we are there for the sake of scientific collaboration. A woman shouldn’t have to worry about the motivations behind every invitation for a discussion or meeting.

Given our skewed gender ratio, it is enough for a small minority of the men to behave badly to essentially guarantee that almost all women will encounter such behavior at some point. That is *unless *other men step up and call out unprofessional (or worse) behavior when we observe or are made aware of it. I know this is not easy – we are not selected for our ability to handle awkward social situations, and I can’t say I myself have stepped up or been very aware of such issues in the past. But I will try to do better, and hope others do too.

## Looking for car keys under the streetlight

In NIPS 2017, Ali Rahimi and Ben Recht won the test of time award for their paper “Random Features for Large-scale Kernel Machines”. Ali delivered the following acceptance speech (see also addendum) in which he said that Machine Learning has become “alchemy” in the sense that it involves more and more “tricks” or “hacks” that work well in practice, but are very poorly understood. (Apparently alchemists were also successful in making many significant practical discoveries.) Similarly, when I teach cryptography I often compare the state of “pre modern” cryptography (before Shannon and Diffie-Hellman) to alchemy.

Yann LeCun was not impressed with the speech, saying that sticking to using methods for which we have theoretical understanding is *“akin to looking for your lost car keys under the street light knowing you lost them someplace else.”* There is a sense in which LeCun is very right. For example, already in the seminal paper in which Jack Edmonds defined the notion of polynomial time he said that *“it would be unfortunate for any rigid criterion to inhibit the practical development of algorithms which are either not known or known not to conform nicely to the criterion.” *But I do want to say something in defense of “looking under the streetlight”. When we want to understand the terrain, rather than achieve some practical goal, it can make a lot of sense to start in the simplest regime (e.g. “most visible” or “well lit”) and then expand our understanding (e.g., “shine new lights”). Heck, it may well be that when the super intelligent robots are here, then they would look for their keys by first making observations under the light and then extrapolating to the unlit area.

## On double blind reviews in theory conferences

Michael Mitzenmacher points to two posts of Suresh Venkatasubramanian on the issue of so called “double blind reviews” (i.e., anonymous submissions) in theory conferences. In short, both Michael and Suresh think they are a good idea. I agree with much of their motivations, but, based on my experience in both non-blinded (e.g., STOC/FOCS) and blinded (e.g., CRYPTO) conferences as both reviewer and author, I do not think double blind reviewing is a good fit for theoretical computer science.

Let me say right off the bat that I think implicit (and, as Michael says, sometimes explicit) bias is a very real phenomenon. Moreover, such biases are not just a problem in the sense that they are “unfair” to authors, but they cause real harm to science, in suppressing the contributions from certain authors. Nor do I have any principled objection to anonymization: I do for example practice anonymous grading in my courses for exactly this reason. I also don’t buy the suggestion that we must know the author’s identity to evaluate if the proof is correct. Reviewers can (and do) evaluate whether a proof makes sense without needing to trust the author.

However, there is a huge difference between grading a problem set and refereeing a paper. In the latter case, and in particular in theoretical computer science, you often need the expertise of very particular people that have worked on this area. By the time the paper is submitted to a conference, these experts have often already seen it, either because it was posted on the arxiv/eccc/eprint, or because they have seen a talk on it, or perhaps they have already discussed it with the authors by email.

More generally, these days much of theoretical CS is moving to the model where papers are first posted online, and by the time they are submitted to a conference they have circulated quite a bit around the relevant experts. Posting papers online is very good for science and should be encouraged, as it allows fast dissemination of results, but it does make the anonymous submission model obsolete.

One could say that if the author’s identity is revealed then there is no harm, since in such a case we simply revert to the original form of non anonymous submissions. However, the fact that the authors’ identity is known to *some but not all* participants in the process (e.g., maybe some reviewers but not others), makes some conflicts and biases invisible. Moreover, the fact that the author’s identity is not “officially” known, causes a lot of practical headaches.

For example, as a PC member you can’t just shoot a quick email to an expert to ask for a quick opinion on the paper, since they may well be the author themselves (as happened to me several time as a CRYPTO PC member), or someone closely related to them. Second, you often have the case where the reviewer knows who the authors are, and has some history with them, even if it’s not a formal conflict, but the program committee member does not know this information. In particular, using anonymous submissions completely precludes using a *disclosure based* model for conflicts of interest (where reviewers disclose their relations with the authors in their reviews) but rather you have to move to an *exclusion based* model, where reviewers meeting some explicit criteria are ruled out.

If anonymous submissions don’t work well for theory conferences, does it mean we have to just have to accept biases? I don’t think so. I believe there are a number of things we could attempt. First, while completely anonymizing submissions might not work well, we could try to make the author names less prominent, for example by having them in the last page of the submissions instead of the first, and not showing them in the conference software. Also, we could try “fairness through awareness”. As I mentioned in my tips for future FOCS/STOC chairs, one potential approach is to tag papers by authors who never had a prior STOC/FOCS paper (one could possibly also tag papers by authors from under-represented groups). One wouldn’t give such papers *preferential* treatment, but rather just make sure they get extra attention. For example, we could add an extra review for such papers. That review might end up being positive or negative, but would counter the bias of dismissing some works out of hand.

To summarize, I agree with Michael’s and Suresh’s sentiments that biases are harmful and should be combated. I just don’t think anonymous submissions are the way to go about that.

## Unique Games Conjecture – halfway there?

Subhash Khot, Dor Minzer and Muli Safra just posted an exciting manuscript online. In it, they confirm the combinatorial hypothesis I’ve posted about before on the structure of non-expanding set in the degree two short-code graph (or, equivalently, in the Grassman graph). Together with their prior work with Dinur and Kindler, and a small assist of an expansion-to-testing reduction of Kothari, Steurer and I, this establishes (an imperfect completeness variant of) Khot’s two to one conjecture and the intermediate complexity conjecture. This also makes significant progress towards proving, or at least giving evidence for, the more famous unique games conjecture (UGC). In fact, as I explain below, there is a technical sense in which it goes “halfway” towards proving the UGC.

The *Unique Games *() problem with parameters is the following: given a set of linear equations each involving at most two variables over some finite field, to distinguish between the *completeness* case where there exists an assignment to the variables satisfying at least fraction of the equations, and the *soundness* case where every assignment satisfies fewer than a fraction.

Clearly the problem becomes easier the larger the gap between and . The unique games conjecture is that the problem is as hard as it can be, in the sense that it is NP hard for arbitrarily close to one and arbitrarily close to zero (as a function of the field size which we assume tends to infinity in what follows). In other words, the difficulty of as a function of and is conjectured by the UGC to look like this:

###### (Note that the problem does not make sense when . Also, in this figure we ignore regions of measure zero. In particular for or the problem can be solved by a semidefinite program where the precise factors depend on the field size; it is known that if the UGC is true then this dependence is optimal.)

Until today, what was known about unique games could be summarized in this (not to scale) figure:

That is, when and are sufficiently close to each other, was known to be NP hard (see for example this paper) and in fact with a linear-blowup reduction establishing exponential hardness (i.e. under the exponential time hypothesis. On the other hand, when either the completeness parameter is sufficiently close to one or the soundness is sufficiently close to zero, there was a known subexponential time algorithm of Arora, Steurer and I (see also here) for . (That is, an algorithm running in time for some which tends to zero as either completeness tends to one or soundness tends to zero.)

However, that algorithm was of course only an upper bound, and we did not know whether it could be improved further to or even polynomial time. Moreover, its mere existence showed that in some sense the techniques of the previously known NP hardness results for (which used a linear blow up reduction) are *inherently inapplicable* to establishing the UGC which requires hardness in a completely different regime.

The new result of Khot, Minzer and Safra (when combined with the prior ones) shows that is NP hard for arbitrarily close to half and arbitrarily close to zero. (The result is presented as hardness of 2 to 2 games with completeness close to one, but immediately implies hardness of unique games with completeness close to half.) That is, the new picture of unique games’ complexity is as follows:

This establishes for the first time hardness of unique games in the regime for which a sub-exponential time algorithm was known, and hence (necessarily) uses a reduction with some (large) polynomial blowup. While it is theoretically still possible for the unique games conjecture to be false (as I personally believed would be the case until this latest sequence of results) the most likely scenario is now that the UGC is true, and the complexity of the problem looks something like the following:

That is, for every , the best running time is roughly where is a function that is always positive but tends to zero as tends to one or tends to zero (and achieves the value one in a positive measure region of the plane). Of course we are still yet far from proving this, let alone characterizing this function, but this is still very exciting progress nonetheless.

I personally am also deeply interested in the question of whether the algorithm that captures this curve is the sum of squares algorithm. Since SoS does capture the known subexponential algorithms for unique games, the new work provides more evidence that this is the case.