[Guest post by Alex Kelser, Alex Lin, Amil Merchant, and Suproteem Sarkar, scribing for a lecture by Andrej Risteski.]
Andrej Risteski: Approximating Partition Functions via Variational Methods and Taylor Series
For a probability distribution defined up to a constant of proportionality, we have already seen the partition function. To refresh your memory, given a probability mass function over all , the partition function is simply the normalization constant of the distribution, i.e.
At first glance, the partition function may appear to be uninteresting. However, upon taking a deeper look, this single quantity holds a wide array of intriguing and complex properties. For one thing, its significance in the world of thermodynamics cannot be overstated. The partition function is at the heart of relating the microscopic quantities of a system – such as the individual energies of each probabilistic state – to macroscopic entities describing the entire system: the total energy, the energy fluctuation, the heat capacity, and the free energy. For explicit formulas, consult this source for reference. In machine learning the partition function also holds much significance, since it’s intimately linked to computing marginals in the model.
Although there exists an explicit formula for the partition function, the challenge lies in computing the quantity in polynomial time. Suppose is a vector of dimension where each for . Specifically, we would like a smarter way of finding than simply adding up over all combinations of .
Andrej Risteski’s lecture focused on using two general techniques – variational methods and Taylor series approximations – to find provably approximate estimates of .
The general setup addresses undirected graphical models, also known as a Markov random fields (MRF), where the probability mass function has the form
for some random, -dimensional vector and some set of parameterized functions . Note that the notation denotes all pairs where and the edge exists in the graph.
The talk considered the specific setup where each , so . Also, we fix
for some set of coefficients , thereby giving us the well-known Ising model. Note, the interaction terms could be more complicated and made “less local” if desired, but that was not discussed in the lecture.
These graphical models are common in machine learning, where there are two common tasks of interest:
- Learning: given samples, models try to learn or
- Inference: given or the potentials as input, we want to calculate the marginal probabilities, such as
We focus on the latter task. The problem of inference is closely related to calculating the partition function. This value is often used as the normalization constant in many methods, and it is classically defined as
Although we can write down the aforementioned closed-form expression for the partition function, it is difficult to calculate this quantity in polynomial time. There are two broad approaches to solving inference problems:
- Randomized: Methods based on Markov chain Monte Carlo (MCMC), such as Gibbs sampling, that construct a Markov chain whose stationary distribution is the distribution of interest. These have been more extensively studied in theoretical computer science.
- Deterministic: Variational methods, self-avoiding walk trees, Taylor expansion methods
Although more often used in practice, randomized algorithms such as MCMC are notoriously hard to debug, and it is often unclear at what point the chain reaches stationarity.
In contrast, variational methods are often more difficult to rigorously analyze but have the added benefit of turning inference problems into optimization problems. Risteski’s talk considered some provable instantiations of variational methods for calculating the partition function.
Let us start with a basic observation, known as the Gibbs variational principle. This can be stated as the following.
Lemma 1: Let . Then, we can show that the corresponding partition function satisfies
where is defined to be a valid distribution over and is the set of all such distributions.
Note that we use to denote the expectation of the inner argument with respect to the distribution and to denote the Shannon entropy of .
Proof: For any , the KL divergence between and must be greater than or equal to 0.
Equality holds if , leading directly to the lemma above.
Note that the proof would have also held in the more generic exponential family with instead of . Also, at most temperatures, one of the 2 terms, either the energy or the entropy will often dominate.
As a result of this lemma, we have framed the original inference problem of calculating as an optimization problem over . This is the crux of variational inference.
The issue with this approach is that it is difficult to optimize over the polytope of distributions due to the values of each coming from . In general, this is NOT tractable. We can imagine two possible solutions:
- Inner Approximation
Instead of optimizing over all , pick a “nice” subfamily of distributions to constrain . The prototypical example is the mean-field approximation in which we let , where each is a univariate distribution. Thus, we approximate with
This would provide a lower bound on . There are a number of potential issues with this method. For one thing, the functions are typically non-convex. Even ignoring this problem, it is difficult to quantify how good the approximation is.
- Outer Approximation
Note: Given a distribution , we use to denote the marginal for some set .
In the outer approximation, we relax the polytope we are optimizing over using convex hierarchies. For instance, we can define as the polytope containing valid marginals over subsets of size at most . We can then reformulate our objective as a two-part problem of (1) finding a set of pairwise marginals that optimizes the energy term and (2) finding a distribution with pairwise marginals matching that optimizes the entropy term. For simplicity of notation, we use to denote , where . It follows that we can rewrite the Gibbs equation as
The first term here represents the energy on pairwise marginals, whereas the second term is the maximum entropy subject to a constraint about matching the energy distribution’s pairwise marginals. The goal of this procedure is to enlarge the polytope such that is a tractable set, where we can impose a polynomial number of constraints satisfied by real marginals.
One specific convex hierarchy that is commonly used for relaxation is the Sherali-Adams (SA) hierarchy. Sherali-Adams will allow us to formulate the optimization of the energy term (and an approximation of the entropy term) as a convex program. We introduce the polytope for , which will relax the constraints on to allow for values outside of in order to generate a polynomial-time solution for .
The Sherali-Adams hierarchy will take care of the energy term, but it remains unclear how to rewrite the entropy term in the context of the convex program. In fact, we will need approximations for in order to accomplish this task.
In this talk, we’ll consider two approximations: one classical one – the Bethe entropy and one more recent one – the augmented mean-field pseudo-entropy .
Bethe Entropy Approximation
The Bethe entropy works by pretending that the Markov random field is a tree. In fact, we can show that if the graph is a tree, then using the Bethe entropy approximation in the convex program defined by will yield an exact calculation of .
Specifically, the Bethe entropy is defined as
where is defined to be the degree of the particular vertex . Note that there are no marginals over sets of dimension greater than two in this expression; thus, we have a valid convex program over the polytope .
This lemma is well-lnown, but it will be a useful warmup:
Lemma 2 Define the output of the convex program
On a tree, and the optimization objective is concave with respect to the variables, so it can be solved in polynomial time.
Proof sketch We will prove 3 claims:
Since the energy term is exact, it suffices to show that for valid marginals , . This can be done by re-writing and using a property of conditional entropy, namely that .
- For trees, is concave in the variables.
This proof was only sketched in class but is similar to the usual proof of concavity of entropy.
- For trees, we can round a solution to a proper distribution over which attains the same value for the original optimization.
The distribution that we will produce is
where we start at the root and keep sampling down the tree. Based on the tree structure, . The energy is also the same since . Therefore, since both terms in the equation are equal to the respective terms in the partition function, we get that the equation must equal the partition function.
- is commonly used on graphs that are not trees.
- However, for non-trees, the optimization is often no longer concave.
- Therefore, the value obtained by the output equation is neither an upper or lower bound.
- The fixed points of the belief propagation algorithm give a 1-1 correspondence with the stationary points of the Bethe objective. (See a classical paper by Yedidia et al..)
Augmented Mean Field Pseudo-Entropy
For this approximation, we will focus on dense graphs: namely graphs in which each vertex has degree for some . To simplify things, we focus on distributions of the form
where is some positive constant parameter. The Bethe entropy approximation is undesirable in the dense case, because we cannot bound its error on non-trees. Instead, we proved the following theorem.
Theorem (Risteski ’16)
For a dense graph with parameter , there exists an outer approximation based on and an entropy proxy which achieves a additive approximation to for some . There is an algorithm with runtime .
To parse the theorem statement, consider the potential regimes for :
- One can ignore the entropy portion and solve for the energy portion that dominates. (So the guarantee is not that interesting.)
- MCMC methods give a -factor approximation in time poly. (It’s not clear if the methods suggested here can give such a guarantee – this is an interesting problem to explore.)
- MCMC mixes slowly, but there is no other way to get a comparable guarantee. This is the interesting regime!
Proof Sketch The proof strategy is as follows: We will formulate a convex program under the SA() relaxation that can return a solution in polynomial time with value . Note that this value of the relaxation will be an upperbound on the true partition function value . The convex program solution may not be a valid probability distribution, so from this, we will construct a “rounded solution’’ – an actual distribution with value . It follows that
We will then aim to put a bound on the gap between and ; namely, that
This equivalently places a bound on the optimality gap between and , thereby proving the theorem. Here are the main components of the proof:
1. Entropy Proxy
To approximate , we will use the pseudo-augmented mean field entropy approximation . This is defined as
Using , we can write a convex program under SA() that provides a relaxation for optimizing . Specifically, we will let . Let be the outputed solution to this relaxation. We define the rounded solution as
Using the chain rule for entropy, we can show that
2. Rounding Scheme The above distribution is actually the same distribution that correlation rounding, as introduced by a seminal paper of Barak, Raghavendra, and Steurer, produces. By using definition of mutual information and Pinsker’s Theorem showing that , we can work through some algebra to show that
Then, putting together the entropy and energy effects, we can prove the main theorem.
- The above proof easily extends to a more general notion of density, as well as to sparse graphs with “low-rank’’ spectral structure. See the paper for more information.
- In fact, with more work, one can extract guarantees on the the best mean field approximation to the Gibbs distribution in the dense regime. (This intuitively can be done as the distribution produced above is almost mean-field, except for the set ). This is interesting, since as we mentioned, the quality of the mean-field approximation is usually difficult to characterize. (Moreover, the procedure is algorithmic.) See the recent work of Jain, Koehler and Risteski ’18 for more details and results.
Taylor Series Approximation
Another method of calculating the partition function involves Taylor expanding its logarithm around a cleverly selected point. This approach was first developed by Barvinok ’14, for the purpose of computing the partition function of cliques in a graph. The mathematical techniques developed in this paper can be naturally extended to evaluate a variety of partition functions, including that of the ferromagnetic Ising model. We will use this example to illustrate the general idea of the Taylor expansion technique.
The goal here is to devise a deterministic, fully polynomial-time approximation scheme (an FPTAS) for evaluating the partition function of the ferromagnetic Ising model. This means that the the algorithm must run in poly() and be correct within a multiplicative factor of . We will work in the general case where the Hamiltonian includes an external magnetic field term, as well as the neighboring spin interaction terms. Using logarithmic identities, we can re-write the partition function slightly differently from last time:
Here, is called the vertex activity (or external field), and characterizes the likelihood of a vertex to be in the + configuration. Additionally, is referred to as the edge activity, and characterizes the propensity of a vertex to agree with its neighbors. The ferromagnetic regime (where agreement of spins is favored) corresponds to . denotes the number of edge cut in the graph, which is equivalent to the number of neighboring pairs with opposite spins.
As one final disclaimer, the approximation scheme presented below is not valid for , which corresponds to the zero-magnetic field case. Randomized algorithms exist which can handle this case.
The general idea here is to hold one parameter fixed ( in our case), and express the logarithm of the partition function as a Taylor series in the other parameter (). From a physics standpoint, the Taylor expansion around tells us how the partition function is changing as a function of the magnetic field. Without loss of generality, we focus on the case where . This simplification can be justified by a simple symmetry argument. Consider as the inverse of where all the 1’s (spin-up) are flipped to -1s (spin-down) and vice-versa. The partition function of this flipped system can be related to the original system by a constant factor:
This holds because the number of cut edges remains constant when the values are flipped. Since these two partition functions are related by this factor constant in , for any model with , we can simply consider the flipped graph and use the above relation.
For our purposes, it will be more convenient to approximate the logarithm of the partition function, because a multiplicative approximation of corresponds to an additive approximation of . In fact, if we allow the partition function to be complex, then we can easily show that an additive bound of for guarantees a multiplicative bound of for .
For notational convenience we define:
Thus, the Taylor expansion of around if given by:
The big question now is: Can we get a good approximation from just the lower order terms of the Taylor expansion for ? Equivalently, can we can bound the sum of the higher order terms at ? Additionally, we still must address the question of how to actually calculate the lower order terms.
To answer these questions, we make simple observations about the derivatives of in relation to the derivatives of .
The last equation just comes from repeated application of the product rule. Using these relations, we can solve for the first derivatives of if we have the first derivatives of using a triangular linear system of equations. As , we see that the system is non-degenerate.
Note, is a n-degree polynomial with leading coefficient of 1 and constant coefficient of 1 (corresponding to the configurations where all vertices are positive / negative respectively). Using this form, we can re-write the partition function in terms of its (possibly complex) roots :
Supposing we keep the first terms, the error is given bounded by:
Here we can invoke the Lee-Yang theorem, which tells us that the zeroes of for a system with ferromagnetic interactions lie on the unit circle of the complex plane. So the Lee-Yang theorem guarantees that , and we see that the error due to truncation is ultimately bounded by:
Now by the previous symmetry argument, we can assume that . Thus, to achieve an error bound of , we must have:
Rearranging terms and taking the natural log of both sides (which is justified given that ), we see that the inequality is satisfied if:
Thus, we need only retain the first terms of the Taylor series expansion of . This will involve calculating the first derivatives of , which naively can be done in quasi-polynomial time. This is done by running over all , where . This takes time, which is quasi-polynomial.
Recent work by Patel and Regts, as well as Liu, Sinclair and Srivastiva has focused on evaluating these coefficients more efficiently (in polynomial time), but is outside the scope of this lecture. Clever counting arguments aside, to trivially calculate the -th derivative, we need only sum over the vectors with spin-up components.
Most of the heavy lifting in the above approach is done by the Lee-Yang theorem. In this section, we sketch how it is proved.
First, let’s define the Lee-Yang property, which we will refer to as the LYP. Let be some multilinear polynomial. has the Lee-Yang property if for any complex numbers such that for all i, then .
We can then show that the partition function for the ferromagnetic Ising model, which we wrote as
must have this LYP. In the antiferromagnetic case it turns out that all zeroes lie on the negative real axis, but we will focus on the ferromagnetic case.
For the proof that the partition function has the LYP, we will use Asano’s contraction argument. This relies on the fact that certain operations preserve LYP and we can “build” up to the partition function of the full graph by these operations.
- Disjoint union: This is fairly simple to prove since the partition function for the disjoint union of and would just factor into the multiplication of the two partition function of the original graphs . Where and have the LYP, then it is clear that has the same LYP.
- Contraction: Suppose we produce a graph from by contracting 2 vertices in , which means merging them into one vertex. It can be shown that if has the LYP, then also has the LYP. We write the partition function for G as
The “contracted” graph amounts to deleting the middle 2 terms so the partition function for can be written as
We want to show that the partition function for has the LYP. Because the partition function of has the LYP, we can consider the case where . By the LYP,
if . By Vieta’s formulas we can find a relation between A and D,
Now, to show that the partition function of G’ has the LYP, we assume there is a such that . For this to be true,
However, this means that so there is no solution where such that the partition function of G’ is 0 and so it has the LYP.
The final part of this proof is to construct the original graph. We observe is that for a single edge, the partition function has the LYP. In this case, we easily write out the partition function for 2 vertices:
Suppose : then would imply . However, this is just the Möbius transform mapping the exterior of the unit disk to the interior, so . Thus, it cannot be the case that both and have absolute value greater than one.
Since single edges have the LYP. We break the graph into single edges, with copies of vertices. These copies are then contracted and we build up the graph to show that the partition function has the LYP.
Knowing that the partition function has the LYP, direct application of the Lee-Yang theorem guarantees that the roots are on the unit circle in the complex plane.
Although there exists an explicit formula for the partition function, the challenge lies in computing the quantity in polynomial time. Andrej Risteski’s lecture focused on two less-explored methods in theoretical computer science – namely variational inference and Taylor series approximations – to find provably approximate estimates. Both of these approaches are relatively new and replete with open problems.