Hi fellow researchers,
I’m writing to share a little tool that I have developed with the ambitious goal of boosting research productivity. The tool is a Chrome extension named “Where’s that paper?”. Before I tell you more about what it does, let me touch upon the largest obstacle of any new tool, the learning curve. Rest assured that “”Where’s that paper?” requires zero training – just install it and continue working as usual – it will guide you without further effort on your end.
After building much suspense, I will elaborate. If you’re like me, you find yourself frequently searching the web for papers that you have browsed / opened in the past on your computer – basically paper browsing history. Whether it is in the process of writing a paper or searching for existing ideas, you may search for authors of titles of papers that you recall know to have once glimpsed at. At some point, the paper has either been downloaded or added to the favorites bar. In any case, this process is manual and takes up much time (often including frustration). As such, I thought we could all use some code to automate it
.
The extension is very simple: it identifies when you are reading a scientific paper (according to the domain) and then automatically adds this paper, with the proper author list, year etc., to your favorites bar under a designated folder. Then, when you type a search in the Chrome’s search bar for an author or a title, the relevant results from the favorites pop up. One might also explicitly browse in the favorites to see all papers read.
See an example of the results shown when searching for “short” in Chrome’s address bar. The first three links are from the favorite and the rest are general Google suggestions from the web.
The extension automatically works on a list of specified domains that include:
eprint.iacr.org, arxiv.org, eccc.weizmann.ac.il, epubs.siam.org, research.microsoft.com, citeseerx.ist.psu.edu, ac.elscdn.com, www.sciencedirect.com, download.springer.com, link.springer.com, delivery.acm.org, proceedings.mlr.press and journals.aps.org.
It is not too difficult to customize the list and additional domains. Reach out to me and I’ll add them upon your request!
A bonus feature (thanks Boaz for the suggestion): Click the extension’s icon and you can download a bib file containing (a nice formatting of) the DBLP bibtex records of all papers added to the favorites.
Download “Where’s that paper?” from the Chrome Web Store: https://chrome.google.com/webstore/detail/wheres-that-paper/dkjnkdmoghkbkfkafefhbcnmofdbfdio
Most who work in our domain of expertise would be quite concerned with installing extensions and with good reason. The extension I wrote does not leak any information, not to me or to another third party. I have no ulterior motive in developing the extension, originally helping myself, I now see the benefit of sharing it with our community. I do not know how to reassure you that this is indeed the case other than giving you my word and publishing the source code online. It is available here:
https://github.com/eylonyogev/Where-s-That-Paper-
I’d be glad to hear any feedback.
Thanks,
Eylon.
]]>For a probability distribution defined up to a constant of proportionality, we have already seen the partition function. To refresh your memory, given a probability mass function over all , the partition function is simply the normalization constant of the distribution, i.e.
At first glance, the partition function may appear to be uninteresting. However, upon taking a deeper look, this single quantity holds a wide array of intriguing and complex properties. For one thing, its significance in the world of thermodynamics cannot be overstated. The partition function is at the heart of relating the microscopic quantities of a system – such as the individual energies of each probabilistic state – to macroscopic entities describing the entire system: the total energy, the energy fluctuation, the heat capacity, and the free energy. For explicit formulas, consult this source for reference. In machine learning the partition function also holds much significance, since it’s intimately linked to computing marginals in the model.
Although there exists an explicit formula for the partition function, the challenge lies in computing the quantity in polynomial time. Suppose is a vector of dimension where each for . Specifically, we would like a smarter way of finding than simply adding up over all combinations of .
Andrej Risteski’s lecture focused on using two general techniques – variational methods and Taylor series approximations – to find provably approximate estimates of .
The general setup addresses undirected graphical models, also known as a Markov random fields (MRF), where the probability mass function has the form
for some random, -dimensional vector and some set of parameterized functions . Note that the notation denotes all pairs where and the edge exists in the graph.
The talk considered the specific setup where each , so . Also, we fix
for some set of coefficients , thereby giving us the well-known Ising model. Note, the interaction terms could be more complicated and made “less local” if desired, but that was not discussed in the lecture.
These graphical models are common in machine learning, where there are two common tasks of interest:
We focus on the latter task. The problem of inference is closely related to calculating the partition function. This value is often used as the normalization constant in many methods, and it is classically defined as
Although we can write down the aforementioned closed-form expression for the partition function, it is difficult to calculate this quantity in polynomial time. There are two broad approaches to solving inference problems:
Although more often used in practice, randomized algorithms such as MCMC are notoriously hard to debug, and it is often unclear at what point the chain reaches stationarity.
In contrast, variational methods are often more difficult to rigorously analyze but have the added benefit of turning inference problems into optimization problems. Risteski’s talk considered some provable instantiations of variational methods for calculating the partition function.
Let us start with a basic observation, known as the Gibbs variational principle. This can be stated as the following.
Lemma 1: Let . Then, we can show that the corresponding partition function satisfies
where is defined to be a valid distribution over and is the set of all such distributions.
Note that we use to denote the expectation of the inner argument with respect to the distribution and to denote the Shannon entropy of .
Proof: For any , the KL divergence between and must be greater than or equal to 0.
Equality holds if , leading directly to the lemma above.
Note that the proof would have also held in the more generic exponential family with instead of . Also, at most temperatures, one of the 2 terms, either the energy or the entropy will often dominate.
As a result of this lemma, we have framed the original inference problem of calculating as an optimization problem over . This is the crux of variational inference.
The issue with this approach is that it is difficult to optimize over the polytope of distributions due to the values of each coming from . In general, this is NOT tractable. We can imagine two possible solutions:
Instead of optimizing over all , pick a “nice” subfamily of distributions to constrain . The prototypical example is the mean-field approximation in which we let , where each is a univariate distribution. Thus, we approximate with
This would provide a lower bound on . There are a number of potential issues with this method. For one thing, the functions are typically non-convex. Even ignoring this problem, it is difficult to quantify how good the approximation is.
Note: Given a distribution , we use to denote the marginal for some set .
In the outer approximation, we relax the polytope we are optimizing over using convex hierarchies. For instance, we can define as the polytope containing valid marginals over subsets of size at most . We can then reformulate our objective as a two-part problem of (1) finding a set of pairwise marginals that optimizes the energy term and (2) finding a distribution with pairwise marginals matching that optimizes the entropy term. For simplicity of notation, we use to denote , where . It follows that we can rewrite the Gibbs equation as
The first term here represents the energy on pairwise marginals, whereas the second term is the maximum entropy subject to a constraint about matching the energy distribution’s pairwise marginals. The goal of this procedure is to enlarge the polytope such that is a tractable set, where we can impose a polynomial number of constraints satisfied by real marginals.
One specific convex hierarchy that is commonly used for relaxation is the Sherali-Adams (SA) hierarchy. Sherali-Adams will allow us to formulate the optimization of the energy term (and an approximation of the entropy term) as a convex program. We introduce the polytope for , which will relax the constraints on to allow for values outside of in order to generate a polynomial-time solution for .
The Sherali-Adams hierarchy will take care of the energy term, but it remains unclear how to rewrite the entropy term in the context of the convex program. In fact, we will need approximations for in order to accomplish this task.
In this talk, we’ll consider two approximations: one classical one – the Bethe entropy and one more recent one – the augmented mean-field pseudo-entropy .
The Bethe entropy works by pretending that the Markov random field is a tree. In fact, we can show that if the graph is a tree, then using the Bethe entropy approximation in the convex program defined by will yield an exact calculation of .
Specifically, the Bethe entropy is defined as
where is defined to be the degree of the particular vertex . Note that there are no marginals over sets of dimension greater than two in this expression; thus, we have a valid convex program over the polytope .
This lemma is well-lnown, but it will be a useful warmup:
Lemma 2 Define the output of the convex program
On a tree, and the optimization objective is concave with respect to the variables, so it can be solved in polynomial time.
Proof sketch We will prove 3 claims:
Since the energy term is exact, it suffices to show that for valid marginals , . This can be done by re-writing and using a property of conditional entropy, namely that .
For trees, is concave in the variables.
This proof was only sketched in class but is similar to the usual proof of concavity of entropy.
For trees, we can round a solution to a proper distribution over which attains the same value for the original optimization.
The distribution that we will produce is
where we start at the root and keep sampling down the tree. Based on the tree structure, . The energy is also the same since . Therefore, since both terms in the equation are equal to the respective terms in the partition function, we get that the equation must equal the partition function.
Remarks:
For this approximation, we will focus on dense graphs: namely graphs in which each vertex has degree for some . To simplify things, we focus on distributions of the form
where is some positive constant parameter. The Bethe entropy approximation is undesirable in the dense case, because we cannot bound its error on non-trees. Instead, we proved the following theorem.
Theorem (Risteski ’16)
For a dense graph with parameter , there exists an outer approximation based on and an entropy proxy which achieves a additive approximation to for some . There is an algorithm with runtime .
To parse the theorem statement, consider the potential regimes for :
MCMC methods give a -factor approximation in time poly. (It’s not clear if the methods suggested here can give such a guarantee – this is an interesting problem to explore.)
MCMC mixes slowly, but there is no other way to get a comparable guarantee. This is the interesting regime!
Proof Sketch The proof strategy is as follows: We will formulate a convex program under the SA() relaxation that can return a solution in polynomial time with value . Note that this value of the relaxation will be an upperbound on the true partition function value . The convex program solution may not be a valid probability distribution, so from this, we will construct a “rounded solution’’ – an actual distribution with value . It follows that
We will then aim to put a bound on the gap between and ; namely, that
or equivalently,
This equivalently places a bound on the optimality gap between and , thereby proving the theorem. Here are the main components of the proof:
1. Entropy Proxy
To approximate , we will use the pseudo-augmented mean field entropy approximation . This is defined as
Using , we can write a convex program under SA() that provides a relaxation for optimizing . Specifically, we will let . Let be the outputed solution to this relaxation. We define the rounded solution as
Using the chain rule for entropy, we can show that
2. Rounding Scheme The above distribution is actually the same distribution that correlation rounding, as introduced by a seminal paper of Barak, Raghavendra, and Steurer, produces. By using definition of mutual information and Pinsker’s Theorem showing that , we can work through some algebra to show that
Then, putting together the entropy and energy effects, we can prove the main theorem.
Remarks:
Another method of calculating the partition function involves Taylor expanding its logarithm around a cleverly selected point. This approach was first developed by Barvinok ’14, for the purpose of computing the partition function of cliques in a graph. The mathematical techniques developed in this paper can be naturally extended to evaluate a variety of partition functions, including that of the ferromagnetic Ising model. We will use this example to illustrate the general idea of the Taylor expansion technique.
The goal here is to devise a deterministic, fully polynomial-time approximation scheme (an FPTAS) for evaluating the partition function of the ferromagnetic Ising model. This means that the the algorithm must run in poly() and be correct within a multiplicative factor of . We will work in the general case where the Hamiltonian includes an external magnetic field term, as well as the neighboring spin interaction terms. Using logarithmic identities, we can re-write the partition function slightly differently from last time:
Here, is called the vertex activity (or external field), and characterizes the likelihood of a vertex to be in the + configuration. Additionally, is referred to as the edge activity, and characterizes the propensity of a vertex to agree with its neighbors. The ferromagnetic regime (where agreement of spins is favored) corresponds to . denotes the number of edge cut in the graph, which is equivalent to the number of neighboring pairs with opposite spins.
As one final disclaimer, the approximation scheme presented below is not valid for , which corresponds to the zero-magnetic field case. Randomized algorithms exist which can handle this case.
The general idea here is to hold one parameter fixed ( in our case), and express the logarithm of the partition function as a Taylor series in the other parameter (). From a physics standpoint, the Taylor expansion around tells us how the partition function is changing as a function of the magnetic field. Without loss of generality, we focus on the case where . This simplification can be justified by a simple symmetry argument. Consider as the inverse of where all the 1’s (spin-up) are flipped to -1s (spin-down) and vice-versa. The partition function of this flipped system can be related to the original system by a constant factor:
This holds because the number of cut edges remains constant when the values are flipped. Since these two partition functions are related by this factor constant in , for any model with , we can simply consider the flipped graph and use the above relation.
For our purposes, it will be more convenient to approximate the logarithm of the partition function, because a multiplicative approximation of corresponds to an additive approximation of . In fact, if we allow the partition function to be complex, then we can easily show that an additive bound of for guarantees a multiplicative bound of for .
For notational convenience we define:
Thus, the Taylor expansion of around if given by:
The big question now is: Can we get a good approximation from just the lower order terms of the Taylor expansion for ? Equivalently, can we can bound the sum of the higher order terms at ? Additionally, we still must address the question of how to actually calculate the lower order terms.
To answer these questions, we make simple observations about the derivatives of in relation to the derivatives of .
The last equation just comes from repeated application of the product rule. Using these relations, we can solve for the first derivatives of if we have the first derivatives of using a triangular linear system of equations. As , we see that the system is non-degenerate.
Note, is a n-degree polynomial with leading coefficient of 1 and constant coefficient of 1 (corresponding to the configurations where all vertices are positive / negative respectively). Using this form, we can re-write the partition function in terms of its (possibly complex) roots :
Supposing we keep the first terms, the error is given bounded by:
Here we can invoke the Lee-Yang theorem, which tells us that the zeroes of for a system with ferromagnetic interactions lie on the unit circle of the complex plane. So the Lee-Yang theorem guarantees that , and we see that the error due to truncation is ultimately bounded by:
Now by the previous symmetry argument, we can assume that . Thus, to achieve an error bound of , we must have:
Rearranging terms and taking the natural log of both sides (which is justified given that ), we see that the inequality is satisfied if:
Thus, we need only retain the first terms of the Taylor series expansion of . This will involve calculating the first derivatives of , which naively can be done in quasi-polynomial time. This is done by running over all , where . This takes time, which is quasi-polynomial.
Recent work by Patel and Regts, as well as Liu, Sinclair and Srivastiva has focused on evaluating these coefficients more efficiently (in polynomial time), but is outside the scope of this lecture. Clever counting arguments aside, to trivially calculate the -th derivative, we need only sum over the vectors with spin-up components.
Most of the heavy lifting in the above approach is done by the Lee-Yang theorem. In this section, we sketch how it is proved.
First, let’s define the Lee-Yang property, which we will refer to as the LYP. Let be some multilinear polynomial. has the Lee-Yang property if for any complex numbers such that for all i, then .
We can then show that the partition function for the ferromagnetic Ising model, which we wrote as
must have this LYP. In the antiferromagnetic case it turns out that all zeroes lie on the negative real axis, but we will focus on the ferromagnetic case.
Proof (sketch):
For the proof that the partition function has the LYP, we will use Asano’s contraction argument. This relies on the fact that certain operations preserve LYP and we can “build” up to the partition function of the full graph by these operations.
Contraction: Suppose we produce a graph from by contracting 2 vertices in , which means merging them into one vertex. It can be shown that if has the LYP, then also has the LYP. We write the partition function for G as
The “contracted” graph amounts to deleting the middle 2 terms so the partition function for can be written as
We want to show that the partition function for has the LYP. Because the partition function of has the LYP, we can consider the case where . By the LYP,
if . By Vieta’s formulas we can find a relation between A and D,
Now, to show that the partition function of G’ has the LYP, we assume there is a such that . For this to be true,
However, this means that so there is no solution where such that the partition function of G’ is 0 and so it has the LYP.
The final part of this proof is to construct the original graph. We observe is that for a single edge, the partition function has the LYP. In this case, we easily write out the partition function for 2 vertices:
Suppose : then would imply . However, this is just the Möbius transform mapping the exterior of the unit disk to the interior, so . Thus, it cannot be the case that both and have absolute value greater than one.
Since single edges have the LYP. We break the graph into single edges, with copies of vertices. These copies are then contracted and we build up the graph to show that the partition function has the LYP.
Knowing that the partition function has the LYP, direct application of the Lee-Yang theorem guarantees that the roots are on the unit circle in the complex plane.
Although there exists an explicit formula for the partition function, the challenge lies in computing the quantity in polynomial time. Andrej Risteski’s lecture focused on two less-explored methods in theoretical computer science – namely variational inference and Taylor series approximations – to find provably approximate estimates. Both of these approaches are relatively new and replete with open problems.
]]>
This blog post is a continuation of the CS229R lecture series. Last week, we saw how certain computational problems like 3SAT exhibit a thresholding behavior, similar to a phase transition in a physical system. In this post, we’ll continue to look at this phenomenon by exploring a heuristic method, belief propagation (and the cavity method), which has been used to make hardness conjectures, and also has thresholding properties. In particular, we’ll start by looking at belief propagation for approximate inference on sparse graphs as a purely computational problem. After doing this, we’ll switch perspectives and see belief propagation motivated in terms of Gibbs free energy minimization for physical systems. With these two perspectives in mind, we’ll then try to use belief propagation to do inference on the the stochastic block model. We’ll see some heuristic techniques for determining when BP succeeds and fails in inference, as well as some numerical simulation results of belief propagation for this problem. Lastly, we’ll talk about where this all fits into what is currently known about efficient algorithms and information theoretic barriers for the stochastic block model.
Suppose someone gives you a probabilistic model on (think of $latex
\chi$ as a discrete set) which can be decomposed in a special way, say
where each only depends on the variables . Recall from last week that we can express constraint satisfaction problems in these kinds of models, where each is associated with a particular constraint. For example, given a 3SAT formula , we can let if is satisfied, and 0 otherwise. Then each only depends on 3 variables, and only has support on satisfying assignments of .
A central problem in computer science is trying to find satisfying assignments to constraint satisfaction problems, i.e. finding values in the support of . Suppose that we knew that the value of were . Then we would know that there exists some satisfying assignment where . Using this knowledge, we could recursively try to find ‘s in the support of , and iteratively come up with a satisfying assignment to our constraint satisfaction problem. In fact, we could even sample uniformally from the distribution as follows: randomly assign to with probability , and assign it to otherwise. Now iteratively sample from for the model where is fixed to the value we assigned to it, and repeat until we’ve assigned values to all of the . A natural question is therefore the following: When can we try to efficiently compute the marginals
for each ?
A well known efficient algorithm for this problem exists when the corresponding graphical model of (more in this in the next section) is a tree. Even though belief propagation is only guaranteed to work exactly for trees, we might hope that if our factor graph is “tree like”, then BP will still give a useful answer. We might even go further than this, and try to analyze exactly when BP fails for a random constraint satisfaction problem. For example, you can do this for k-SAT when is large, and then learn something about the solution threshold for k-SAT. It therefore might be natural to try and study when BP succeeds and fails for different kinds of problems.
We will start by making two simplifying assumptions on our model .
First, we will assume that can be written in the form for some functions and some “edge set” (where the edges are undirected). In other words, we will only consider pairwise constraints. We will see later that this naturally corresponds to a physical interpretation, where each of the “particles” interact with each other via pairwise forces. Belief propagation actually still works without this assumption (which is why we can use it to analyze -SAT for ), but the pairwise case is all we need for the stochastic block model.
For the second assumption, notice that there is a natural correspondence between and the graphical model on vertices, where forms an edge in iff . In order words, edges in correspond to factors of the form in , and vertices in correspond to variables in . Our second assumption is that the graphical model is a tree.
Now, suppose we’re given such a tree which represents our probabilistic model. How to we compute the marginals? Generally speaking, when computer scientists see trees, they begin to get very excited [reference]. “I know! Let’s use recursion!” shouts the student in CS124, their heart rate noticeably rising. Imagine that we arbitrarily rooted our tree at vertex . Perhaps, if we could somehow compute the marginals of the children of , we could somehow stitch them together to compute the marginal . In order words, we should think about computing the marginals of roots of subtrees in our graphical model. A quick check shows that the base case is easy: suppose we’re given a graphical model which is a tree consisting of a single node . This corresponds to some PDF . So to compute , all we have to do is compute the marginalizing constant , and then we have . With the base case out of the way, let’s try to solve the induction step: given a graphical model which is a tree rooted at , and where we’re given the marginals of the subtrees rooted at the children of , how do we compute the marginal of the tree rooted at ? Take a look at figure 2 to see what this looks like graphically. To formalize the induction step, we’ll define some notation that will also be useful to us later on. The main pieces of notation are , which is the subtree rooted at with parent , and the “messages” , which can be thought of as information which is passed from the child subtrees of to the vertex in order to compute the marginals correctly.
Phew! That was a lot of notation. Now that we have that out of the way, let’s see how we can express the marginal of the root of a tree as a function of the marginals of its subtrees. Suppose we’re considering the subtree , so that vertex has children . Then we can compute the marginal directly:
The non-obvious step in the above is that we’re able to switch around summations and products: we’re able to do this because each of the trees are functions on disjoint sets of variables. So we’re able to express as a function of the children values . Looking at the update formula we have derived, we can now see why the are called “messages” to vertex : they send information about the child subtrees to their parent .
The above discussion is a purely algebraic way of deriving belief propagation. A more intuitive way to get this result is as follows: imagine fixing the value of in the the subtree , and then drawing from each of the marginals of the children of conditioned on the value . We can consider the marginals of each of the children independently, because the children are independent of each other when conditioned on the value of . Converting words to equations, this means that if has children , then the marginal probability of in the subtree is proportional to . We can then write
And we get back what we had before. We’ll call this last equation our “update” or “message passing” equation. The key assumption we used was that if we condition on , then the children of are independent. It’s useful to keep this assumption in mind when thinking about how BP behaves on more general graphs.
A similar calculation yields that we can calculate the marginal of our original probability distribution as the marginal of the subtree with no parent, i.e.
Great! So now we have an algorithm for computing marginals: recursively compute for each in a dynamic programming fashion with the “message passing” equations we have just derived. Then, compute for each . If the diameter of our tree is , then the recursion depth of our algorithm is at most .
However, instead of computing every neatly with recursion, we might try something else: let’s instead randomly initialize each with anything we want. Then, let’s update each in parallel with our update equations. We will keep doing this in successive steps until each has converged to a fixed value. By looking at belief propagation as a recursive algorithm, it’s easy to see that all of the ‘s will have their correct values after at most steps. This is because (after arbitrarily rooting our tree at any vertex) the leaves of our recursion will initialize to the correct value after 1 step. After two steps, the parents of the leaves will be updated correctly as functions of the leaves, and so they will have the correct values as well. Specifically:
Proposition: Suppose we initialize messages arbitrarily, and update them in parallel according to our update equations. If has diameter , then after steps each converges, and we recover the correct marginals.
Why would anyone want to do things in this way? In particular, by computing everything in parallel in steps instead of recursively, we’re computing a lot of “garbage” updates which we never use. However, the advantage of doing things in this way is that this procedure is now well defined for general graphs. In particular, suppose violated assumption (2), so that the corresponding graph were not a tree. Then we could still try to compute the messages with parallel updates. We are also able to do this in a local “message passing” kind of way, which some people may find physically intuitive. Maybe if we’re lucky, the messages will converge after a reasonable amount of iterations. Maybe if we’re even luckier, they will converge to something which gives us information about the marginals . In fact, we’ll see that just such a thing happens in the stochastic block model. More on that later. For now, let’s shift gears and look at belief propagation from a physics perspective.
We’ve just seen a statistical/algorithmic view of how to compute marginals in a graphical model. It turns out that there’s also a physical way to think about this, which leads to a qualitatively similar algorithm. Recall from last week that another interpretation of a pairwise factor-able PDF is that of particles interacting with each other via pairwise forces. In particular, we can imagine each particle interacting with via a force of strength
and in addition, interacting with an external field
We imagine that each of our particles take values from a discrete set . When , we recover the Ising model, and in general we have a Potts model. The energy function of this system is then
with probability distribution given by
Now, for , computing the marginals corresponds to the equivalent physical problem of computing the “magnetizations” .
How does this setup relate to the previous section, where we thought about constraint satisfaction problems and probability distributions? If we could set and , we would recover exactly the probability distribution from the previous section. From a constraint satisfaction perspective, if we set if constraint is satisfied and otherwise, then as (our system becomes colder), ‘s probability mass becomes concentrated only on the satisfying assignments of the constraint satisfaction problem.
We’re now going to try a different approach to computing the marginals: let’s define a distribution , which we will hope to be a good approximation to . If you like, you can think about the marginal as being the “belief” about the state of variable . We can measure the “distance” between and by the KL divergence
which equals 0 iff the two distributions are equal. Let’s define the Gibbs free energy as
So the minimum value of is which is called the Helmholz free energy, is the “average energy” and is the “entropy”.
Now for the “free energy minimization part”. We want to choose to minimize , so that we can have that is a good approximation of . If this happens, then maybe we can hope to “read out” the marginals of directly. How do we do this in a way which makes it easy to “read out” the marginals? Here’s one idea: let’s try to write as a function of only the marginals and of . If we could do this, then maybe we could try to minimize by only optimizing over values for “variables” and . However, we need to remember that and are actually meant to represent marginals for some real probability distribution . So at the very least, we should add the consistency constraints and to our optimization problem. We can then think of and as “pseudo-marginals” which obey degree-2 Sherali-Adams constraints.
Recall that we’ve written as a sum of both the average energy and the entropy . It turns out that we can actually write as only a function of the pariwise marginals of :
which follows just because the sums marginalize out the variables which don’t form part of the pairwise interactions:
$latex \sum_{\vec{x}}b(\vec{x})\left(-\sum_{i,j}J_{i,j}(x_i,x_j)-\sum_{i} h_{i}(x_i)\right)
$
This is good news: since only depends on pairwise interactions, the average energy component of only depends on and . However, it is not so clear how to express the entropy as a function of one node and two node beliefs. However, maybe we can try to pretend that our model is really a “tree”. In this case, the following is true:
Claim: If our model is a tree, and and are the associated marginals of our probabilistic model , then we have
where is the degree of vertex in the tree.
It’s not too difficult to see why this is the case: imagine a tree rooted at , with children $latex
\partial i$. We can think of sampling from this tree as first sampling from via its marginal , and then by recursively sampling the children conditioned on . Associate with the subtrees of the children of , i.e. is equal to the probability of the occurrence on the probabilistic model of the tree rooted at vertex . Then we have
where the last line follows inductively, since each only sees edges of .
If we make the assumption that our model is a tree, then we can write the Bethe approximation entropy as
Where the ‘s are the degrees of the variables in the graphical model defined by . We then define the Bethe free energy as . The Bethe free energy is in general not an upper bound on the true free energy. Note that if we make the assignments , , then we can rewrite as
which is similar in form to the Bethe approximation entropy. In general, we have
which is exactly the Gibbs free energy for a probabilistic model whose associated graph is a tree. Since BP gives the correct marginals on trees, we can say that the BP beliefs are the global minima of the Bethe free energy. However, the following is also true:
Proposition: A set of beliefs gives a BP fixed point in any graph (not necessarily a tree) iff they correspond to local stationary points of the Bethe free energy.
(For a proof, see e.g. page 20 of [4])
So trying to minimize the Bethe free energy is in some sense the same thing as doing belief propagation. Apparently, one typically finds that when belief propagation fails to converge on a graph, the optimization program which is trying to minimize also runs into problems in similar parameter regions, and vice versa.
Now that we’ve seen Belief Propagation from two different perspectives, let’s try to apply this technique of computing marginals to analyzing the behavior of the stochastic block model. This section will heavily follow the paper [2].
The stochastic block model is designed to capture a variety of interesting problems, depending on its settings of parameters. The question we’ll be looking at is the following: suppose we generate a random graph, where each vertex of the graph comes from one of groups each with probability . We add an edge between vertices in groups resp. with probability . For sparse graphs, we define , where we think of as . The problem is the following: given such a random graph, can you label the vertices so that, up to permutation, the labels you choose have high correlation to the true hidden labels which were used to generate the graph? Here are some typical settings of parameters which represent different problems:
We’ll concern ourselves with the case where our graph is sparse, and we need to try and come up with an assignment for the vertices such that we have high correlation with the true labeling of vertices. How might we measure how well we solve this task? Ideally, a labeling which is identical to the true labeling (up to permutation) should get a score of 1. Conversely, a labeling which naively guesses that every vertex comes from the largest group should get a score of 0. Here’s one metric which satisfies these properties: if we come up with a labeling , and the true labeling is , then we’ll measure our performance by
where we maximize over all permutations . When we choose a labeling which (up to permutation) agrees with the true labeling, then the numerator of will equal the denominator, and . Likewise, when we trivially guess that every vertex belongs to the largest group, then the numerator of is and .
Given and a set of observed edges , we can write down the probability of a labeling $latex
\{q_i\}$ as
How might we try to infer such that we have maximum correlation (up to permutation) with the true labeling? It turns out that the answer is to use the maximum likelihood estimator of the marginal distribution of each , up to a caveat. In particular, we should label with the such that is maximized. The caveat comes in when is invariant under permutations of the labelings , so that each marginal is actually the uniform distribution. For example, this happens in community detection, when all the group sizes are equal. In this case, the correct thing to do is to still use the marginals, but only after we have “broken the symmetry” of the problem by randomly fixing certain values of the vertices to have particular labels. There’s actually a way belief propagation does this implicitly: recall that we start belief propagation by randomly initializing the messages. This random initialization can be interpreted as “symmetry breaking” of the problem, in a way that we’ll see shortly.
We’ve just seen from the previous section that in order to maximize the correlation of the labeling we come up with, we should pick the labelings which maximize the marginals of . So we have some marginals that we want to compute. Let’s proceed by applying BP to this problem in the “sparse” regime where (other algorithms, like approximate message passing, can be used for “dense” graph problems). Suppose we’re given a random graph with edge list . What does does graph associated with our probabilistic model look like? Well, in this case, every variable is actually connected to every other variable because includes a factor for every , so we actually have a complete graph. However, some of the connections between variables are much weaker than others. In full, our BP update equations are
Likewise
what we want to do is approximate these equations so that we only have to pass messages along the edges , instead of the complete graph. This will make our analysis simpler, and also allow the belief propagation algorithm to run more efficiently. The first observation is the following: Suppose we have two nodes such that . Then we see that , since the only difference between these two variables are two factors of order which appear in the first product of the BP equations. Thus, we send essentially the same messages to non-neighbours of in our random graph. In general though, we have:
The first approximation comes from dropping non-edge constraints on the first product, and is reasonable because we expect the number of neighbours of to be constant. We’ve also defined a variable
and we’ve used the approximation for small . We think of the term as defining an “auxiliary external field”. We’ll use this approximate BP equation to find solutions for our problem. This has the advantage that the computation time is instead of , so we can deal with large sparse graphs computationally. It also allows us to see how a large dense graphical model with only sparse strong connections still behaves like a sparse tree-like graphical model from the perspective of Belief Propagation. In particular, we might have reason to hope that the BP equations will actually converge and give us good approximations to the marginals.
From now on, we’ll only consider factored block models, which in some sense represent a “hard” setting of parameters. These are models which satisfy the condition that each group has the same average degree . In particular, we require
An important observation for this setting of parameters is that
is always a fixed point of our BP equations, which is known as a factored fixed point (this can be seen by inspection by plugging the fixed point conditions into the belief propagation equations we derived). When BP ever reaches such a fixed point, we get that and the algorithm fails. However, we might hope that if we randomly initialize , then BP might converge to some non-trivial fixed point which gives us some information about the original labeling of the vertices.
Now that we have our BP equations, we can run numerical simulations to try and get a feel of when BP works. Let’s consider the problem of community detection. In particular, we’ll set our parameters with all group sizes being equal, and with for and vary the ratio , and see when BP finds solutions which are correlated “better than guessing” to the original labeling used to generate the graph. When we do this, we get images which look like this:
It should be mentioned that the point at which the dashed red line occurs depends on the parameters of the stochastic block model. We get a few interesting observations from numerical experiments:
How might we analytically try to determine when BP fails for certain settings of and ? One way we might heuristically try to do this, is to calculate the stability of the factored fixed point. If the fixed point is stable, this suggests that BP will converge to a factored point. If however it is unstable, then we might hope that BP converges to something informative. In particular, suppose we run BP, and we converge to a factored fixed point, so we have that for all our messages . Suppose we now add a small amount of noise to some of the ‘s (maybe think of this as injecting a small amount of additional information about the true marginals). We (heuristically) claim that if we now continue to run more steps of BP, either the messages will converge back to the fixed point , or they will diverge to something else, and whether or not this happens depends on the eigenvalue of some matrix of partial derivatives.
Following this idea, here’s a heuristic way of calculating the stability of the factored fixed point. Let’s pretend that our BP equations occur on a tree, which is a reasonable approximation in the sparse graph case. Let our tree be rooted at node and have depth . Let’s try to approximately calculate the influence on of perturbing a leaf from its factored fixed point. In particular, let the path from the leaf to the root be . We’re going to apply a perturbation for each . In vector notation, this looks like , where is a column vector. The next thing we’ll do is define the matrix of partial derivatives
Up to first order (and ignoring normalizing constants), the perturbation effect on is then (by chain rule) . Since does not depend on , we can write this as , where is the largest eigenvalue of . Now, on a random tree, we have approximately leaves. If we assume that the perturbation effect from each leaf is independent, and that has 0 mean, then the net mean perturbation from all the leaves will be 0. The variance will be
if we assume that the cross terms vanish in expectation.
(Aside: You might want to ask: why are we assuming that has mean zero, and that (say) the noise at each of the leaves are independent, so that the cross terms vanish? If we want to maximize the variance, then maybe choosing the ‘s to be correlated or have non-zero mean would give us a better bound. The problem is that we’re neglecting the effects of normalizing constants in this analysis: if we perturbed all the in the same direction (e.g. non-zero mean), our normalization conditions would cancel out our perturbations.)
We Therefore end up with the stability condition . When , a small perturbation will be magnified as we move up the tree, leading to the messages moving away from the factored fixed point after successive iterations of BP (the fixed point is unstable). If , the effect of a small perturbation will vanish as we move up the tree, we expect the factored fixed point to be stable. If we restrict our attention to graphs of the form for , and have all our groups with size , then is known to have eigenvalues with eigenvector , and . The stability threshold then becomes . This condition is known as the Almeida-Thouless local stability condition for spin glasses, and the Kesten-Stigum bound on reconstruction on trees. It is also observed empirically that BP and MCMC succeed above this threshold, and converge to factored fixed points below this threshold. The eigenvalues of are related to the belief propagation equations and the backtracking matrix. For more details, see [3]
We’ve just seen a threshold for when BP is able to solve the community detection problem. Specifically, when , BP doesn’t do better than chance. It’s natural to ask whether this is because BP is not powerful enough, or whether there really isn’t enough information in the random graph to recover the true labeling. For example, if is very close to , it might be impossible to distinguish between group boundaries up to random fluctuations in the edges.
It turns out that for , there is not enough information below the threshold to find a labeling which is correlated with the true labeling [3]. However, it can be shown information-theoretically [1] that the threshold at which one can find a correlated labeling is . In particular, when , there exists exponential time algorithms which recover a correlated labeling below the Kesten-Stigum threshold. This is interesting, because it suggests an information-computational gap: we observe empirically that heuristic belief propagation seems to perform as well as any other inference algorithm at finding a correlated labeling for the stochastic block model. However, belief propagation fails at a “computational” threshold below the information theoretic threshold for this problem. We’ll talk more about these kinds of information-computation gaps in the coming weeks.
[1] Jess Banks, Cristopher Moore, Joe Neeman, Praneeth Netrapalli,
Information-theoretic thresholds for community detection in sparse
networks. AJMLR: Workshop and Conference Proceedings vol 49:1–34, 2016.
Link
[2] Aurelien Decelle, Florent Krzakala, Cristopher Moore, Lenka Zdeborová,
Asymptotic analysis of the stochastic block model for modular networks and its
algorithmic applications. 2013.
Link
[3] Cristopher Moore, The Computer Science and Physics
of Community Detection: Landscapes, Phase Transitions, and Hardness. 2017.
Link
[4] Jonathan Yedidia, William Freeman, Yair Weiss, Understanding Belief Propogation and its Generalizations
Link
[5] Afonso Banderia, Amelia Perry, Alexander Wein, Notes on computational-to-statistical gaps: predictions using statistical physics. 2018.
Link
[6] Stephan Mertens, Marc Mézard, Riccardo Zecchina, Threshold Values of Random K-SAT from the Cavity Method. 2005.
Link
[7] Andrea Montanari, Federico Ricci-Tersenghi, Guilhem Semerjian, Clusters of solutions and replica symmetry breaking in random k-satisfiability. 2008.
Link
A big thanks to Tselil for all the proof reading and recommendations, and to both Boaz and Tselil for their really detailed post-presentation feedback.
]]>One of the best ways to serve the US-based TCS community is to take up a position at the NSF. Beginning as early as 2019, NSF/CCF is seeking at least one program director for the Algorithmic Foundations core program. This is a rotator position, which is generally two or three years in duration. Please consider applying!
Besides service to the community, there are many other benefits from serving:
– It’s an opportunity to meet a lot of people in one’s own field and others, and to become more well-known in research communities. Some institutions place value on the experience. Many rotators are able to use it to enhance career options.
– A rotator can typically spend 20% (NSF-paid) time on research, including visits back to the home institution. The impact on research and advising may be considerable, but does not have to be a complete hiatus.
– There is a wealth of opportunities for cultural and educational experiences for families who relocate to the area for a few years, which some find to offset the very considerable impacts associated with such a move.
The official posting for AF won’t appear until later, but postings for similar positions can be found here: https://www.nsf.gov/careers/openings/. For further information, please reach out to Tracy Kimbrel (tkimbrel@nsf.gov) or Shuchi Chawla (shuchi@cs.wisc.edu).
]]>Statistical physics is the first topic in the seminar course I am co-teaching with Boaz this fall, and one of our primary goals is to explore this theory. This blog post is a re-working of a lecture I gave in class this past Friday. It is meant to serve as an introduction to statistical physics, and is composed of two parts: in the first part, I introduce the basic concepts from statistical physics in a hands-on manner, by demonstrating a phase transition for the Ising model on the complete graph. In the second part, I introduce random k-SAT and the satisfiability conjecture, and give some moment-method based proofs of bounds on the satisfiability threshold.
Update on September 16, 3:48pm: the first version of this post contained an incorrect plot of the energy density of the Ising model on the complete graph, which I have amended below.
In statistical physics, the goal is to understand how materials behave on a macroscopic scale based on a simple model of particle-particle interactions.
For example, consider a block of iron. In a block of iron, we have many iron particles, and each has a net -polarization or “spin” which is induced by the quantum spins of its unpaired electrons. On the microscopic scale, nearby iron atoms “want” to have the same spin. From what I was able to gather on Wikipedia, this is because the unpaired electrons in the distinct iron atoms repel each other, and if two nearby iron atoms have the same spins, then this allows them to be in a physical configuration where the atoms are further apart in space, which results in a lower energy state (because of the repulsion between electrons).
When most of the particles in a block of iron have correlated spins, then on a macroscopic scale we observe this correlation as the phenomenon of magnetism (or ferromagnetism if we want to be technically correct).
In the 1890’s, Pierre Curie showed that if you heat up a block of iron (introducing energy into the system), it eventually loses its magnetization. In fact, magnetization exhibits a phase transition: there is a critical temperature, , below which a block of iron will act as a magnet, and above which it will suddenly lose its magnetism. This is called the “Curie temperature”. This phase transition is in contrast to the alternative, in which the iron would gradually lose its magnetization as it is heated.
We’ll now set up a simple model of the microscopic particle-particle interactions, and see how the global phenomenon of the magnetization phase transition emerges. This is called the Ising model, and it is one of the more canonical models in statistical physics.
Suppse that we have iron atoms, and that their interactions are described by the (for simplicity unweighted) graph with adjacency matrix . For example, we may think of the atoms as being arranged in a 3D cubic lattice, and then would be the 3D cubic lattice graph. We give each atom a label in , and we associate with each atom a spin .
For each choice of spins or state we associate the total energy
.
If two interacting particles have the same spin, then they are in a “lower energy” configuration, and then they contribute to the total energy. If two neighboring particles have opposite spins, then they are in a “higher energy” configuration, and they contribute to the total energy.
We also introduce a temperature parameter . At each , we want to describe what a “typical” configuration for our block of iron looks like. When , there is no kinetic energy in the system, so we expect the system to be in the lowest-energy state, i.e. all atoms have the same spin. As the temperature increases, the kinetic energy also increases, and we will begin to see more anomalies.
In statistical physics, the “description” takes the form of a probability distribution over states . To this end we define the Boltzmann distribution, with density function :
As , , becomes supported entirely on the that minimize ; we call these the ground states (for connected these are exactly ). On the other hand as , all are weighted equally according to .
Above we have defined the Boltzmann distribution to be proportional to . To spell it out,
The normalizing quantity is referred to as the partition function, and is interesting in its own right. For example, from we can compute the free energy of the system, as well as the internal energy and the entropy :
$latex F(\beta) = -\frac{1}{\beta} \ln Z(\beta), \qquad \qquad
U(\beta) = \frac{\partial}{\partial \beta} (\beta F(\beta)), \qquad \qquad
S(\beta) = \beta^2 \frac{\partial}{\partial \beta} F(\beta).$
Using some straightforward calculus, we can then see that is the Shannon entropy,
that is the average energy in the system,
and that the free energy is the difference of the internal energy and the product of the temperature and the entropy,
just like the classical thermodynamic definitons!
The free energy, internal energy, and entropy encode information about the typical behavior of the system at temperature . We can get some intuition by considering the extremes, and .
In cold systems with , if we let be the energy of the ground state, be the number of ground state configurations, and be the energy gap, then
where the notation hides factors that do not depend on . From this it isn’t hard to work out that
$latex \begin{aligned}
F(\beta) &= E_0 -\frac{1}{\beta} \ln(N_0) + O\left(\exp(-\beta \Delta_E)\right),\\
E(\beta) &= E_0 + O\left(\exp(-\beta \Delta_E)\right)\\
S(\beta) &= \ln N_0 + O\left(\exp(-\beta \Delta_E)\right).
\end{aligned}$
We can see that the behavior of the system is dominated by the few ground states. As , all of the free energy can be attributed to the internal energy term.
On the other hand, as ,
$latex \begin{aligned}
F(\beta) &= \mathbb{E}_{x\sim\{\pm 1\}^n} E(x) – \frac{n}{\beta} + O(\beta),\\
U(\beta) &= \mathbb{E}_{x\sim\{\pm 1\}^n} E(x) + O(\beta),\\
S(\beta) &= n + O(\beta),
\end{aligned}$
and the behavior of the system is chaotic, with the free energy dominated by the entropy term.
We say that the system undergoes a phase transition at if the energy density is not analytic at . Often, this comes from a shift in the relative contributions of the internal energy and entropy terms to . Phase transitions are often associated as well with a qualitative change in system behavior.
For example, we’ll now show that for the Ising model with the complete graph with self loops, the system has a phase transition at (the self-loops don’t make much physical sense, but are convenient to work with). Furthermore, we’ll show that this phase transition corresponds to a qualitative change in the system, i.e. the loss of magnetism.
Define the magnetization of the system with spins to be . If , then we say the system is magnetized.
In the complete graph, normalized so that the total interaction of each particle is , there is a direct relationship between the energy and the magnetization:
The magnetization takes values for . So, letting be the number of states with magnetization , we have that
Now, is just the number of strings with Hamming weight , so . By Stirling’s approximation , where is the entropy function, so up to lower-order terms
Now we apply the following simplification: for , , and then . Treating our summands as the entries of the vector , from this we have,
By definition of the energy density,
$latex f(\beta)
= \lim_{n \to \infty} \left(- \frac{1}{\beta n} \log Z(\beta)\right)
= -\lim_{n\to\infty} \max_{k \in \{-n,\ldots, n\}} \left(\frac{k}{n}\right)^2 + \frac{1}{\beta}H\left(\frac{1}{2}(1+\frac{k}{n})\right),$
and since independently of , we also have
$latex f(\beta)
= -\left(\max_{\delta \in [-1,1]} \delta^2 + \frac{1}{\beta}H\left(\frac{1+\delta}{2}\right)\right),$
because the error from rounding to the nearest factor of is .
We can see that the first term in the expression for corresponds to the square of the magnetization (and therefore the energy); the more magnetized the system is, the larger the contribution from the first term. The second term corresponds to the entropy, or the number of configurations in the support; the larger the support, the larger the contribution of the second term. As , the contribution of the entropy term overwhelms the contribution of the energy term; this is consistent with our physical intuition.
We’ll now demonstrate that there is indeed a phase transition in . To do so, we solve for this maximum. Taking the derivative with respect to , we have that
so the derivative is whenever . From this, we can check the maxima. When , there are two maxima equidistant from the origin, corresponding to negatively or positively-magnetized states. When , the maximizer is , corresponding to an unmagnetized state.
Given the maximizers, we now have the energy density. When we plot the energy density, the phase transition at is subtle (an earlier version of this post contained a mistaken plot):
But when we plot the derivative, we can see that it is not smooth at :
And with some calculus it is possible to show that the second derivative is indeed not continuous at .
Qualitatively, it is convincing that this phase transition in the energy density is related to a transition in the magnetization (because the maximizing corresponds to the typical magnetization). One can make this formal by performing a similar calculation to show that the internal energy undergoes a phase transition, which in this case is proportional to the expected squared magnetization, .
The Ising model on the complete graph (also called the Curie-Weiss model) is perhaps not a very convincing model for a physical block of iron; we expect that locality should govern the strength of the interactions. But because the energy and the magnetization are related so simply, it is easy to solve.
Solutions are also known for the 1D and 2D grids; solving it on higher-dimensional lattices, as well as in many other interesting settings, remains open. Interestingly, the conformal bootstrap method that Boaz mentioned has been used towards solving the Ising model on higher-dimensional grids.
For those familiar with constraint satisfaction problems (CSPs), it may have already been clear that the Ising model is a CSP. The spins are Boolean variables, and the energy function is an objective function corresponding to the EQUALITY CSP on (a pretty boring CSP, when taken without negations). The Boltzmann distribution gives a probability distribution over assignments to the variables , and the temperature determines the objective value of a typical .
We can similarly define the energy, Boltzmann distribution, and free energy/entropy for any CSP (and even to continuous domains such as ). Especially popular with statistical physicists are:
In some cases, these CSPs are reasonable models for physical systems; in
other cases, they are primarily of theoretical interest.
As theoretical computer scientists, we are used to seeing CSPs in the contex of optimization. In statistical physics, the goal is to understand the qualitative behavior of the system as described by the Boltzmann distribution. They ask algorithmic questions such as:
But these tasks are not so different from optimization. For example, if our system is an instance of 3SAT, when , the Boltzmann distribution is the uniform distribution over maximally satisfying assignments, and so estimating is equivalent to deciding the SAT formula, and sampling from is equivalent to solving the SAT formula. As increases, sampling from corresponds to sampling an approximate solution.
Clearly in the worst case, these tasks are NP-Hard (and even #P-hard). But even for random instances these algorithmic questions are interesting.
In random -SAT, the system is controlled not only by the temperature but also by the clause density . For the remainder of the post, we will focus on the zero-temperature regime , and we will see that -SAT exhibits phase transitions in as well.
The most natural “physical” trait to track in a -SAT formula is whether or not it is satisfiable. When , -SAT instances are clearly satisfiable, because they have no constraints. Similarly when , random -SAT instances cannot be satisfiable, because for any set of variables they will contain all possible -SAT constraints (clearly unsatisfiable). It is natural to ask: is there a satisfiability phase transition in ?
For , one can show that the answer is yes. For , numerical evidence strongly points to this being the case; further, the following theorem of Friedgut gives a partial answer:
Theorem:
For there exists a function such that for any , if is a random -SAT formula on variables with clauses, then
However, this theorem allows for the possibility that the threshold depends on . From a statistical physics standpoint, this would be ridiculous, as it suggests that the behavior of the system depends on the number of particles that participate in it. We state the commonly held stronger conjecture:
Conjecture:
for all , there exists a constant depending only on such that if is a random -SAT instance in variables and clauses, then
In 2015, Jian Ding, Allan Sly, and Nike Sun established the -SAT conjecture all larger than some fixed constant , and we will be hearing about the proof from Nike later on in the course.
Let us move to a simpler model of random -SAT formulas, which is a bit easier to work with and is a close approximation to our original model. Instead of sampling -SAT clauses without replacement, we will sample them with replacement and also allow variables to appear multiple times in the same clause (so each literal is chosen uniformly at random). The independence of the clauses makes computations in this model simpler.
We’ll prove the following bounds. The upper bound is a fairly straightforward computation, and the lower bound is given by an elegant argument due to Achlioptas and Peres.
Theorem:
For every , .
Let be a random -SAT formula with clauses, . For an assignment , let be the indicator that satisfies . Finally, let be the number of satisfying assignments of .
We have by Markov’s inequality that
Fix an assignment . Then by the independence of the clauses,
since each clause has satisfying assignments. Summing over all
,
We can see that if , this quantity will go to with . So we have:
To lower bound , we can use the second moment method. We’ll
calculate the second moment of . An easy use of Cauchy-Schwarz (for non-negative , ) implies that if there is a constant such that
,
then with at least constant probability , and then Friedgut’s theorem implies that . From above we have an expression for the numerator, so we now set out to bound the second moment of . We have that
and by the independence of the clauses,
for a single random -SAT clause . But, for , the events and are not independent. This is easier to see if we apply inclusion-exclusion,
The event that both when is a -SAT clause occurs only when and agree exactly on the assignments to the variables of , since otherwise at least one of or must be satisfied (because each literal in is negated for ). Thus, this probability depends on the Hamming distance between and .
For with , the probability that ‘s variables are all in the subset on which agree is (up to lower order terms). So, we have
and then for with ,
Because there are pairs at distance ,
and using the Stirling bound ,
Using Laplace’s method, we can show that this sum will be dominated by the terms around the maximizing summand; so defining ,
If we want that , then we require that
However, we can see (using calculus) that this inequality does not hold whenever . In the plot below one can also see this, though the values are very close when is close to :
So the naive second moment method fails to establish any lower bound on . The problem here is that the second moment is dominated by the large correlations of close strings; whenever , the sum is dominated by pairs of strings which are closer than in Hamming distance, which is atypical. For strings at Hamming distance , what is relevant is , which is always
which is equal to . At distance , the value of on pairs is essentially uncorrelated (one can think of drawing a uniformly random given ), so such pairs are representative of .
To get a good bound, we have to perform a fancier second moment method calculation, due to Achlioptas and Peres. We will re-weight the terms so that more typical pairs, at distance , are dominant. Rather than computing , we compute , where
where if , and if is satisfied by variables. Since only if has satisfying assignments, the goal is to still bound . Again using the independence of the clauses, we have that
$latex \begin{aligned}
\mathbb{E}[X_{\alpha}]
= 2^n \cdot \mathbb{E}[w_j(x)]^m
= 2^n \left(2^{-k}\sum_{t = 1}^k \binom{k}{t} \eta^t \right)^m
= 2^n \left(\left(\frac{1+\eta}{2}\right)^k-2^{-k}\right)^m.\label{eq:expect}\end{aligned}$
And calculating again the second moment,
For a fixed clause , we can partition the variables in its scope into 4 sets, according to whether agree on the variable, and whether the variable does or does not satisfy the literals of the clause. Suppose that has variables on which agree, and that of these are satisfying for and are not.
Then, has variables on which disagree, and any variable which does not satisfy for must satisfy it for . For a to be nonzero, either there must be at least one literal in the variables on which agree which agrees with , or otherwise there must be at least one literal in the variables on which disagree which agrees with each string. Therefore, if agree on a -fraction of variables,
$latex \begin{aligned}
\mathbb{E}[w_j(x) \cdot w_j(y)]
&= \sum_{a = 1}^k \sum_{t=1}^a \eta^{2t + k – a} \cdot \Pr[w_j \text{ has } a \text{ variables in } x \cap y,\ t \in x \cap y \text{ satisfy }w, \ w_j(x),w_j(y) \textgreater 0]\\
&= \left(\sum_{a = 0}^k\sum_{t=0}^a \eta^{k – a + 2t} \cdot \binom{k}{a}\delta^{a}\left(1 – \delta\right)^{k-a} 2^{-a} \binom{a}{t}\right) – \left(\sum_{a=0}^k \eta^{k-a} \binom{k}{a} \delta^a(1-\delta)^{k-a} 2^{-k+1}\right) + \delta^k2^{-k},\end{aligned}$
where the first sum ignores the cases when or , the second sum subtracts for the cases when the contribution of the terms where all the literals agree with either or , and the final term accounts for the fact that the term is subtracted
twice. Simplifying,
Define
.
So we have (using Laplace’s method again)
For
When ,
which is equal to the log of the square of the expectation, so again for the second moment method to succeed, must be the global maximum.
Guided by this consideration, we can set so that the derivative , so that achieves a local maximum at . The choice for this (after doing some calculus) turns out to be the positive, real solution to the equation . With this choice, one can show that the global maximum is indeed achieved at as long as . Below, we plot with this optimal choice of for several values of at :
So we have bounded .
What is the correct answer for ? We now have it in a window of size . Experimental data and heuristic predictions indicate that it is closer to (and in fact for large , Ding, Sly and Sun showed that is a specific constant in the interval ). So why can’t we push the second moment method further?
It turns out that there is a good reason for this, having to do with another phase transition. In fact, we know that -SAT has not only satisfiable and unsatisfiable phases, but also clustering and condensation phases:
In the clustering phase, there are exponentially many clusters of solutions, each containing exponentially many solutions, and each at hamming distance from each other. In the condensation phase, there are even fewer clusters, but solutions still exist. We can see evidence of this already in the way that the second moment method failed us. When , we see that the global maximum of the exponent is actually attained close to . This is because solutions with large overlap come to dominate the set of satisfying assignments.
One can also establish the existence of clusters using the method of moments. The trick is to compute the probability that two solutions, and with overlap , are both satisfying for . In fact, we have already done this. From above,
Now, by a union bound, an upper bound on the probability that there exists a pair of solutions with overlap for any is at most
$latex \Pr[\exists x,y \ s.t. \ \phi(x) \wedge \phi(y), \ \frac{1}{2}\|x-y\|_1 = (1-\delta)n \text{ for } \delta \in [\delta_1,\delta_2]]
\le \sum_{\ell = \delta_1 n}^{\delta_2 n} \binom{n}{\ell} \left(1 – 2^{1-k} + \left(\frac{\ell}{2n}\right)^k\right)^{\alpha n}$
and if the function is such that for all , then we conclude that the probability that there is a pair of satisfying assignments at distance between and is .
Achlioptas and Ricci-Tersenghi showed that for , , and a fixed constant, the above function is less than . Rather than doing the tedious calculus, we can verify by plotting for (with ):
They also use the second moment method to show the clusters are non-empty, that there are exponentially many of them, each containing exponentially many solutions. This gives a proof of the existence of a clustering regime.
This study of the space of solutions is referred to as solution geometry, and understanding solution geometry turns out to be essential in proving better bounds on the critical density.
Solution geometry is also intimately related to the success of local search algorithms such as belief propagation, and to heuristics for predicting phase transitions such as replica symmetry breaking.
These topics and more to follow, in the coming weeks.
In preparing this blog post/lecture I leaned heavily on Chapters 2 and 10 of Marc Mézard and Andrea Montanari’s “Information, Physics, and Computation”, on Chapters 12,13,&14 of Cris Moore and Stephan Mertens’ “The nature of Computation,” and on Dimitris Achlioptas and Federico Ricci-Tersenghi’s manuscript “On the Solution-Space Geometry of Random CSPs”. I also consulted Wikipedia for some physics basics.
]]>===================================
FOCS 2018 – Second Call for Participation
===================================
https://www.irif.fr/~focs2018/
The 59th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2018) will take place in Paris, France, on 7-9 October 2018, with workshops and tutorials on October 6.
The program is now available, together with the list of the workshops/tutorials (as well as a list of a number of co-located events), on the conference webpage.
Early registration rate ends September 9, 2018.
All scientific and local information, and a link to the registration page, can be found at https://www.irif.fr/~focs2018/
Looking forward to seeing you in Paris !
One of the interesting features of physics is the prevalence of “thought experiments”, including Maxwell’s demon, Einstein’s Train, Schrödinger’s cat, and many more. One could think that these experiments are merely “verbal fluff” which obscures the “real math” but there is a reason that physicists return time and again to these types of mental exercises. In a nutshell, this is because while physicists use math to model reality, the mathematical model is not equal to reality.
For example, in the early days of quantum mechanics, several calculations of energy shifts seemed to give out infinite numbers. While initially this was viewed as a sign that something is deeply wrong with quantum mechanics, ultimately it turned out that these infinities canceled each other, as long as you only tried to compute observable quantities. One lesson that physicists drew from this is that while such mathematical inconsistencies may (and in this case quite possibly do) indicate some issue with a theory, they are not a reason to discard it. It is OK if a theory involves mathematical steps that do not make sense, as long as this does not lead to an observable paradox: i.e., an actual “thought experiment” with a nonsensical outcome.
A priori, this seems rather weird. An outsider impression of the enterprise of physics is that it is all about explaining the behavior of larger systems in terms of smaller parts. We explain materials by molecules, molecules by atoms, and atoms by elementary particles. Every term in our mathematical model is supposed to correspond to something “real” in the world.
However, with modern physics, and particular quantum mechanics, this connection breaks down. In quantum mechanics we model the state of the world using a vector (or “wave function”) but the destructiveness of quantum measurements tells us that we can never know all the coordinates of this vector. (This is also related to the so called “uncertainty principle”.) While physicists and philosophers can debate whether these wave functions “really exist”, their existence is not the reason why quantum mechanics is so successful. It is successful because these wave functions yield a mathematically simple model to predict observations. Hence we have moved from trying to explain bigger physical systems in terms of smaller physical systems to trying to explain complicated observations in terms of simpler mathematical models. (Indeed the focus has moved from “things” such as particles to concepts such as forces and symmetries as the most fundamental notions.) These simpler models do not necessarily correspond to any real physical entities that we’d ever be able to observe. Hence such models can in principle contain weird things such as infinite quantities, as long as these don’t mess up our predictions for actual observations.
Nevertheless, there are still real issues in physics that people have not been able to settle. In particular the so called “standard model” uses quantum mechanics to explain the strong force, the weak force, and the electromagnetic force, which dominate over short (i.e., subatomic) distances, but it does not incorporate the force of gravity. Gravity is explained by the theory of general relativity which is inconsistent with quantum mechanics but is predictive for phenomena over larger distances.
By and large physicists believe that quantum mechanics will form the basis for a unified theory, that will involve incorporating gravity into it by putting general relativity on quantum mechanical foundations. One of the most promising approaches in this direction is known as the AdS/CFT correspodence of Maldacena, which we describe briefly below.
Alas, in 2012, Almheiri, Marolf, Polchinski, and Sully (AMPS) gave a description of a mental experiment, known as a the “firewall paradox” that showed a significant issue with any quantum-mechanical description of gravity, including the Ads/CFT correspondence. Harlow and Hayden (see also chapter 6 of Aaronson’s notes and this overview of Susskind) proposed a way to resolve this paradox using computational complexity.
In this post I will briefly discuss these issues. Hopefully someone in Tselil’s and my upcoming seminar will present this in more detail and also write a blog post about it.
Edwin Abbott’s 1884 novel “Flatland”, describes a world in which people live in only two dimensions. At some point a sphere visits this world, and opens the eye of one of its inhabitants (the narrator, which is a square) to the fact its two-dimensional world was merely an illusion and the “real world” actually has more dimensions.
However, modern physics suggest that things might be the other way around: we might actually be living in flatland ourselves. That is, it might be that the true description of our world has one less spatial dimension than what we perceive. For example, though we think we live in three dimensions, perhaps we are merely shadows (or a “hologram”) of a two dimensional description of the world. One can ask how could this be? After all, if our world is “really” two dimensional, what happens when I climb the stairs in my house? The idea is that the geometry of the two-dimensional world is radically different, but it contains all the information that would allow to decode the state of our three dimensional world. You can imagine that when I climb the stairs in my house, my flatland analog goes from the first floor to the second floor in (some encoding of) the two-dimensional blueprint of my house. (Perhaps this lower-dimensional representation is the reason the Wachowskis called their movie “The Matrix” as opposed to “The Tensor”?)
The main idea is that in this “flat” description, gravity does not really exist and physics has a pure quantum mechanical description which is scale free in the sense that the theory is the same independently of distance. Gravity and our spacetime gemoetry emerge in our world via the projection from this lower dimensional space. (This projection is supposed to give rise to some kind of string theory.) As far as I can tell, at the moment physicists can only perform this projection (and even this at a rather heuristic level) under the assumption that our universe is contracting or in physics terminology an “anti de-Sitter (AdS) space”. This is the assumption that the geometry of the universe is hyperbolic and hence one can envision spacetime as being bounded in some finite area of space: some kind of a dimensional cylinder that has a -dimensional boundary. The idea is that all the information on what’s going on in the inside or bulk of the cylinder is encoded in this boundary. One caveat is that our physical universe is actually expanding rather than contracting, but as the theory is hard enough to work out for a contracting space, at the moment they sensibly focus on this more tractable setting. Since the quantum mechanical theory on the boundary is scale free (and also rotation invariant) it is known as a Conformal Field Theory (CFT). Thus this one to one mapping of the boundary and the bulk is also known as the “AdS/CFT correspondence”.
If it is possible to carry over this description, in terms of information it would be possible to describe the universe in purely quantum mechanical terms. One can imagine that the universe starts at some quantum state , and at each step in time progresses to the state where is some unitary transformation.
In particular this means that information is never lost. However, black holes pose a conundrum to this view since they seem to swallow all information that enters them. Recall that the “escape velocity” of earth – the speed needed to escape the gravitational field and go to space – is about 25,000 mph or Mach 33. In a black hole the “escape velocity” is the speed of light which means that nothing, not even light, can escape it. More specifically, there is a certain region in spacetime which corresponds to the event horizon of a black hole. Once you are in this event horizon then you have passed the point of no return since even if you travel in the speed of light, you will not be able to escape. Though it might take a very long time, eventually you will perish in the black hole’s so called “singularity”.
Entering the event horizon should not feel particularly special (a condition physicists colorfully refer to as “no drama”). Indeed, as far as I know, it is theoretically possible that 10 years from now a black hole would be created in our solar system with a radius larger than 100 light years. If this future event will happen, this means that we are already in a black hole event horizon even though we don’t know it.
The above seems to mean that information that enters the black hole is irrevocably lost, contradicting unitarity. However, physicists now believe that through a phenomenon known as Hawking radiation black holes might actually emit the information that was contained in them. That is, if the qubits that enter the event horizon are in the state then (up to a unitary tranformation) the qubits that are emitted in the radiation would be in the state as well, and hence no information is lost. Indeed, Hawking himself conceded the bet he made with Preskill on information loss.
Nevertheless, there is one fly in this ointment. If we drop an qubit state in this black hole, then they are eventually radiated (in the same state, up to an invertible transformation), but the original qubits never come out. (It is a black hole after all.) Since we now have two copies of these qubits (one inside the black hole and one outside it), this seems to violate the famous “no cloning principle” of quantum mechanics which says that you can’t copy a qubit. Luckily however, this seemed to be one more of those cases where it is an issue with the math that could never effect an actual observer. The reason is that an observer inside the black hole event horizon can never come out, while an observer outside can never peer inside. Thus, even if the no cloning principle is violated in our mathematical model of the whole universe, no such violation would have been seen by either an outside or an inside observer. In fact, even if Alice – a brave observer outside the event horizon – obtained the state of the Hawking radiation and then jumped with it into the event horizon so that she can see a violation of the no-cloning principle then it wouldn’t work. The reason is that by the time all the qubits are radiated, the black hole fully evaporates and inside the black hole the original qubits have already entered the singularity. Hence Alice would not be able to “catch the black hole in the act” of cloning qubits.
What AMPS noticed is that a more sophisticated (yet equally brave) observer could actually obtain a violation of quantum mechanics. The idea is the following. Alice will wait until almost all (say 99 percent) of the black hole evaporated, which means that at this point she can observe of the qubits of the Hawking radiation , while there are still about qubits inside the event horizon that have not yet reached the singularity. So far, this does not seem to be any violation of the no cloning principle, but it turns out that entanglement (which you can think of as the quantum analog of mutual information) plays a subtle role. Specifically, for information to be preserved the radiation will be in a highly entangled state, which means that in particular if we look at the qubit that has just radiated from the event horizon then it will be highly entangled with the qubits we observed before.
On the other hand, from the continuity of spacetime, if we look at a qubit that is just adjacent to but inside the event horizon then it will be highly entangled with as well. For our classical intuition, this seems to be fine: a -valued random variable could have large (say at least ) mutual information with two distinct random variables and . But quantum entanglement behaves differently: it satisfies a notion known as monogamy of entanglement, which implies that the sum of entanglement of a qubit with two disjoint registers can be at most one. (Monogamy of entangelement is actually equivalent to the no cloning principle, see for example slide 14 here.)
Specifically, Alice could use a unitary transformation to “distill” from a qubit that is highly entangled with and then jump with and into the event horizon to observe there a triple of qubits which violates the monogamy of entanglement.
One potential solution to the AMPS paradox is to drop the assumption that spacetime is continuous at the event horizon. This would mean that there is a huge energy barrier (i.e., a “firewall”) at the event horizon. Alas, a huge wall of fire is as close as one can get to the definition of “drama” without involving Omarose Manigault Newman.
The “firewall paradox” is a matter of great debate among physicists. (For example after the AMPS paper came out, a “rapid response workshop” was organized for people to suggest possible solutions.) As mentioned above, Daniel Harlow and Patrick Hayden suggested a fascinating way to resolve this paradox. They observed that to actually run this experiment, Alice would have to apply a certain “entanglement distillation” unitary to the qubits of the Hawking radiation. However, under reasonable complexity assumptions, computing would require an exponential number of quantum gates!. This means that by the time Alice is done with the computation, the black hole is likely to completely evaporate, and hence there would be nothing left to jump into!
The above is by no means the last word of this story. Other approaches for resolving this paradox have been put forward, as well as ways to poke holes in the Harlow-Hayden resolution. Nor is it the only appearance of complexity in the AdS/CFT correspondence or quantum gravity at large. Indeed, the whole approach places much more emphasis on the information content of the world as opposed to the more traditional view of spacetime as the fundamental “canvas” for our universe. Hence information and computation play a key role in understanding how our spacetime can emerge from the conformal picture.
In the fall seminar, we will learn more about these issues, and will report here as we do so.
]]>Johan will be presented with the award at the upcoming FOCS.
]]>While it’s by no means “great literature”, Factor Man is a fun page-turner. I think it can be a particularly enjoyable read for computer scientists, as it might prompt you to come up with your own scenarios as to how things would play out if someone discovers such an algorithm. Unsurprisingly, the book is not technically perfect. The technical error that annoyed me the most was that the protagonist demonstrates his algorithm by factoring integers of sizes that can in fact be fairly easily factored today (the book refers to factoring 128 or 256 bit numbers as impressive, while 768 bit integers of general form have been factored, see also this page and this paper). If you just imagine that when the book says “ bit” numbers it actually means byte numbers then this is fine. Network security researchers might also take issue with other points in the book (such as the ability of the protagonist to use gmail and blogspot without being identified by neither the NSA nor Google, as well as using a SAT algorithm to provide a “final security patch” for a product).
Regardless of these technical issues, I recommend reading this book if you’re the type of person that enjoys both computer science and spy thrillers, and I do plan to mention it to students taking my introduction to theoretical CS course.
]]>