# Inference and statistical physics

Scribe notes by Franklyn Wang

Previous post: Robustness in train and test time Next post: Natural Language Processing (guest lecture by Sasha Rush). See also all seminar posts and course webpage.

## Digression: Frequentism vs Bayesianism

Before getting started, we’ll discuss the difference between the two dominant schools of thought in probability: Frequentism and Bayesianism.

Frequentism holds that the probability of an event is the long-run average proportion of something that happens many, many times, so a coin has 50% probability of being heads if on average half of the flips are heads. One consequence of this framework is that a one-off event like Biden winning the election doesn’t have any probability as by definition they can only be observed once.

Bayesians reject this model of the world. For example, a famous Bayesian, Jaynes, even wrote that “probabilities are frequencies only in an imaginary universe.”

While these branches of thought are different, generally the answers to most questions are the same, so the distinction will not matter for this class. However, these branches of thought inspire different types of methods, which we now discuss.

For example, suppose that we have a probability distribution $D$ that we can get samples from in the form of $x$. In statistics, our goal is to calculate a hypothesis $h \in \mathcal{H}$ which minimizes $\mathcal{L}_{D}(h)$ (possibly a loss function, but any minimand will do) where we estimate $\mathcal{L}_D$ with our samples $x$.

A frequentist does this by defining a family $\mathcal{D}$ of potential distributions, and we need to find a transformation $\vec{x} \mapsto h(\vec{x})$ which minimizes the cost for all $D \in \mathcal{D}$, where

These equations amount to saying that $h$ is minimizing the worst-case loss over all distributions.

By contrast, a Bayesian approaches this task by assuming that $D = D_{\vec{w}}$ where $\vec{w} \sim P$ are latent variables sampled from a prior.

Then, let $Q_{\vec{x}}(\vec{w}) = P(w | D_{\vec{w}} = x)$ be the posterior distribution of $w$ conditioned on the observations $\vec{x}$. We now minimize

In fact, Bayesian approaches extend beyond this, as if you can sample from a posterior you can do more than just minimize a loss function.

In Bayesian inference, chosing the prior $P$ is often very important. One classical approach is a maximum entropy prior, because it’s the least informative (hence, the most entropy) and requires the fewest assumptions.

These two minimization equations con sometimes lead to identical results, but often in practice they work out differently once we introduce computational constraints into the picture. In the frequentist approach, we generally constrain the family of mappings $\vec{x} \rightarrow \vec{h}$ to be efficiently computable. In the Bayesian approach, we typically approximate the posterior instead and either use approximate sampling or a restricted hypothesis class for the posterior to be able to efficiently sample from it.

## Part 1. An introduction to Statistical Physics

Compare water and ice. Water is hotter, and the molecules move around more. In ice, by contrast, the molecules are more stationary. When the temperature increases, the objects move more quickly, and when they decrease the objects have less energy and stop moving. There are also phase transitions, where certain temperatures cause qualitative discontinuities in behavior, like solid to liquid or liquid to gas.

See video from here:

An atomic state can be thought of as $x \in {0, 1}^{n}$. Now, how do we represent the system’s state? The crucial observation that makes statistical physics different from “vanilla physics” is the insight to represent the system state as a probability distribution $p$ supported on ${0, 1}^{n}$, rather than in a single state.

Each atomic state has a negative energy function (which we will think of as a utility function) $W: {0, 1}^{n} \rightarrow \mathbb{R}$. In some sense, the system “wants” to have a high value of $\mathbb{E}_{x \sim p}[ W(x)]$. In addition, when the temperature is high, the system “wants” to have a high value of entropy.

Thus, an axiom of thermodynamics states that to find the true distribution, we need only look at the maximizer of

$p = \arg\max_{q} \mathbb{E}_{x \sim q}[W(x)] + \tau \cdot H(q).$

That is, it is the probability distribution which maximizes a linear combination of the expected negative energy and the entropy, with the temperature controlling the coefficient of entropy in this combination (the higher the temperature, the more the system prioritizes having high entropy).

The variational principle, which we prove later, states that the $q$ which maximizes this satisfies

$p(x) \propto \exp(\tau^{-1} \cdot W(x))$

which is known as the Boltzmann Distribution. We often write this with a normalizing constant, so $p(x) = \exp(\tau^{-1} \cdot W(x) - A_{\tau}(W))$, where $A_{\tau}(W) = \log (\int \exp(\tau^{1} \cdot W(x)))$.

Before proving the variational principle, we will go through some examples of statistical physics.

### Example: Ising Model

In the Ising model, we have $n$ magnets that are connected in a square grid. The atomic state is each $x \in {\pm 1}^{n}$, which represents “spin”. The value of $W(x)$ is $J\sum_{i \sim j} x_i x_j + J'\sum_{i} x_i$, where $i \sim j$ denotes that $i$ and $j$ are adjacent magnets. This encourages adjacent magnets to have aligned spins. One important concept, which we will return to later, is that the values of ${x_i, x_ix_j}$ are sufficient to calculate the value of $W(x)$ (these are known as sufficient statistics). Furthermore, if we wanted to calculate $\mathbb{E}[W(x)]$, it would be enough to calculate the values of $\mathbb{E}[x_i]$ and $\mathbb{E}[x_i x_j]$ and then apply linearity of expectation. An illustration of an Ising model that we cool down slowly can be found here:

and a video can be found here.

This is what a low-temperature Ising model looks like – note that $W(x)$ is high because almost all adjacent spins are aligned (hence the well-defined regions).

The Sherrington-Kirkpatrick model is a generalization of the Ising model to a random graph, which represents a disordered mean field model. Here, $x \in {\pm 1}^{n}$ still represents spin, and $W(x) = \sum w_{i, j} x_i x_j$ where $w_{i, j} \sim \mathcal{N}(0, 1)$. The SK-model is deeply influential in statistical physics. We say that it is disordered because the utility function $W$ is chosen at random and that it is a mean field model because, unlike the Ising model, there is no geometry in the sense that every pair of individual variables $x_i$ and $x_j$ are equally likely to be connected.

Our third example is the posterior distribution, where $x$ is a hidden variable with a uniform prior. We now makes, say, $k$ independent observations $O_1, O_2, \ldots O_k$. The probability of $x$ is given by

$p(x) \propto p(x\text{ satisfies } O_1) \cdot p(x\text{ satisfies } O_2) \ldots p(x\text{ satisfies } O_k)$

so $p(x) \propto \exp(\log \sum \mathbb{P}(x\text{ satisfies } O_i))$. Note that the RHS is easy to calculate, but in practice the normalizing factor (also known as the partition function) can be difficult to calculate and often represents a large barrier, in the sense that computing the partition function makes many of these questions far easier.

### Proof of the Variational Principle

Now, we prove the variational principle, which states that if $p(x) = \exp(\tau^{-1} \cdot W(x) - A_\tau(W))$, where $A_\tau(W) = \log \int \exp(\tau^{-1} \cdot W(x))$ is the normalizing factor, then

$p = \arg\max_{q} \mathbb{E}_{x \sim q} W(x) + \tau \cdot H(q)$

Before proving the variational principle, we make the following claim: if $p$ is defined as $p(x) = \exp(\tau^{-1} \cdot W(x) - A_\tau(W))$ then

Proof of claim:
Write
$H(p) = -\int \log p(x) p(x) dx = -\int (\tau^{-1} \cdot W(x) - A_{\tau}(W)) p(x) dx$
(where we plugged in the definition of $p(x)$ in $\log p(x)$).

We can now rewrite this as

$H(p) = -\tau^{-1} \cdot \mathbb{E}_{x \sim p} W(x) + \mathbb{E}_{x \sim p} A_{\tau}(W)$

which means that

$H(p) = -\tau^{-1} \cdot \mathbb{E}_{x \sim p} W(x) + A{\tau}(W)$

using the fact that $A_\tau(W)$ is a constant (independent of $x$). Multiplying by $\tau$ and rearranging, we obtain the claim. $\blacksquare$

Proof of variational principle:
Given the claim, we can now prove the variational principle.
Let $q$ be any distribution. We write

where the first equation uses our likelihood expression. This implies that $\mathbb{E}_{x \sim q} W(x) + \tau \cdot H(q)$ is maximized at $q = p$, as desired. $\blacksquare$

Remarks:

• When $\tau = 1$, we have $A(W) = \max_{q} H(q) + \mathbb{E}_{x \sim q} W(x)$.
• In particular, for every value of $q$, we have that

which we’ve seen before! Note that equality holds when $q = p$, but to approximate $A$ we can simply take more tractable values of $q$.

In many situations, we can compute $W(x)$, but can’t compute $A_{\tau}(W)$, which stifles many applications. One upshot of this, though, is that we can calculate ratios of $p(x)$ and $p(x')$, which is good enough for some applications, like Markov Chain Monte Carlo.

### Markov Chain Monte Carlo (MCMC)

An important task in statistics is to sample from a distribution $p$, for very complicated values of $p$. MCMC does this by constructing a Markov Chain whose stationary distribution is $p$. The most common instantiation of MCMC is Metropolis-Hastings, which we now describe. First, we assume that there exists an undirected graph on the states $x$, so that $x \sim x'$ iff $x' \sim x$ and that for $x \sim x'$, the probabilities of $x$ and $x'$ are similar in the sense that $W(x)/W(x')$ is neither too small nor too large. Then the Metropolis-Hastings algortihm is as follows.

Metropolis-Hastings Algorithm:

1. Draw $x_0$ at random.
2. For $i = 1, 2, \ldots$, choose an arbitrary $x' \sim x$, and let

Then, $x_t$ eventually is distributed as $p$! To show that this samples from $p$, we show that the stationary distribution is of this Markov chain is $p$. In this case, this turns out to be easy, since we can check the detailed balance conditions $p(x' | x)p(x) = \min{(p(x), p(x'))} = p(x | x')p(x')$ so $p$ is the stationary distribution of the Markov Chain.

The stationary distribution is often unique, so we’ve proven that this samples from $p$ eventually. In MCMC algorithms, however, often an important question is how fast we converge to the stationary distribution. Often this is rather slow, which is especially dissappointing because if it were faster many very difficult problems could be solved much more easily, like generic optimization.
Indeed, there are examples where convergence to the stationary distribution would take exponential time.

Note: One way to make MCMC faster is to let $x'$ be really close from $x$, because the likelihood ratio will be closer to $1$ and $x_i$ will spend less time stuck at its original location. There’s a tradeoff, however. If $x'$ is too close to $x$, then the chain will not mix as quickly, where mixing is the convergence of $x$ to the stationary distribution, because generally to mix the value $x_t$ must get close to every point, and making each of $x$‘s steps smaller will make that more difficult.

### Applications of MCMC: Simulated Annealing

One application of MCMC is in simulated annealing. Suppose that we have $W: {0, 1}^{n} \rightarrow \mathbb{R}$, and we want to find $x = \arg \max_{x} W(x)$. The most direct attempt at solving this problem is creating a Markov Chain that simply samples from $x \in \arg\max_{x} W(x)$. However, this is impractical, at least directly. It is like we have a domino that is far away from us, and we can’t knock it down. What do we do in this case? We put some dominoes in between us!

To this end, we now create a sequence of Markov chains supported on $x$, whose stationary distribution $p_{\tau}(x) \sim W(x)^{1/\tau}$, as $\tau$ gets smaller and smaller. This corresponds to cooling a system from a high-temperature to a low-temperature. Essentially, we want to sample from $p_0(x)$.

Simulated annealing lets us begin by sampling from $p_{\infty}$, which is uniform on the support. Then we can slowly reduce $\tau$ from $\infty$ to $0$. When cooling a system, it will be helpful to think of two stages. First, a stage in which the object cools down. Second, a settling stage in which the system settles into a more stable state. The settling stage is simulated via MCMC on $p_{\tau}(x)$, so the transition probability from $x$ to $x'$ is $\min(1, \text{exp}(\tau^{-1} \cdot (W(x') - W(x))))$.

Simulated Annealing Algorithm:

1. Cooling Begin $\tau$ at $\infty$ (or very large), and lower the value of $\tau$ to zero according to some schedule.
1a. Settling Now, repeatedly move from $x$ to $x'$ with probability $\min(1, \text{exp}(\tau^{-1} \cdot (W(x') - W(x))))$.

Since simulated annealing is inspired by physical intuition, it turns out that its shortcomings can be interpreted physically too. Namely, when you cool something too quickly it becomes a glassy state instead of a ground state, which often causes simulated annealing to fail, and for this algorithm to get stuck in local minima. Note how this is fundamentally a failure mode with MCMC – it can’t mix quickly enough.

See this video for an illustration of simulated annealing.

## Part 2: Bayesian Analysis

Note that the posterior of $\vec{w}$ conditioned on $x_1, \ldots x_n$ is

$p(\vec{w} | x_1, \ldots x_n) = \frac{p(\vec{w})}{p(\vec{x})} \cdot p(x_1 | \vec{w}) p(x_2 | \vec{w}) \ldots p(x_n | \vec{w}) \propto \exp\left(\sum X_i(\vec{w})\right)$

We now have $p_{W}(x) = \exp(W(x) - A(W))$, an exponential distribution. Now suppose that $W(x) = \langle w, \hat{x} \rangle$, where $\hat{x} \in \mathbb{R}^{m}$ are the sufficient statistics of $x$. For example, the energy function could follow that of a tree, so $W(x) = \sum_{(i, j) \in T} w_{i, j} x_i x_j$. If you want to find the expected value of $latex \mathbb{E}{x \sim p{W}}[W(x)]&bg=ffffff$ its enough to know $latex \mu = \mathbb{E}{x \sim p{W} }[\hat{x}]&bg=ffffff$. (By following a tree, we mean that the undirected graphical model of the distribution is that of a tree.)

### Sampling from a Tree-Structured Distribution

Now how do we sample from a tree-structured distribution? The most direct way is by calculating the marginals. While marginalization is generally quite difficult, it is much more tractable on certain types of graphs. For a sense of what this entails, suppose that we have the following tree-structured statistical model:

,
whose joint PDF is $W(x) = \exp(\sum_{i, j \in E} -w_{i, j} (x_i - x_j)^2)$ where $x \in {0, 1}^{n}$.
(The unusual notation is chosen so that the relationships between the marginals is clean. The log-density in question is a Laplacian Quadratic Form.)

Then one can show that if the marginal of $x_3$ is $\mu_3 = \mathbb{P}(x_3 = 1)$, then we have that

$\mu_6 = \mu_3 \cdot \frac{e^{w_{3,6}}}{1 + e^{w_{3, 6}}} + (1 - \mu_3) \cdot \frac{1}{1 + e^{w_{3, 6}}}$

This will give six linear equations in six unknowns, which we can solve. Once we can marginalize a variable, say, $x_6$, we can then simply sample from the marginal, then sample from the conditional distribution on $x_6$ to sample from the entire distribution. This method is known as belief propogation.

This algorithm (with some modifications) also works for general graphs, but what it represents on general graphs is not the exact marginal, but rather an extension of that graph to a tree.

### General Exponential Distributions

Now let’s try to develop more theory for exponential distributions. Let the PDF of this distribution be $p_W(x) = \exp(\langle w, \hat{x} \rangle - A(w))$. A few useful facts about these distributions that often appear in calculations: $\nabla A(w) = \mu$ and $latex \nabla^2 (A(w)) = \text{Cov}{p{W}}(x) \succeq 0&bg=ffffff$.

By the variational principle, we have that among all distributions with the same sufficient statistics as a given Boltzmann distribution, the true one maximizes the entropy, so

$p_w = \arg\max_{\mathbb{E}_{q} \hat{x} = \mu} H(q).$

An analagous idea gives us
$\mu = \arg \max_{H(\mu) \ge H(p_{w})} \langle w, \mu \rangle,$

so $p_w$ is the maximum entropy distribution consistent with observations $\mu$.

The important consequence of these two facts is that in principle, $w$ determines $\mu$ and vice versa. (Up to fine print of using minimal sufficient statistics, which we ignore here.) We will refer to $w$ as the “canonical parameter space” and $\mu$ as the “mean representation space”. Now note that the first equation gives us a way of going from $\mu$ to $w$, and the second equation gives us a way to go from $w$ to $\mu$, at least information-theoretically.

Now, how do we do this algorithmically? Using $w$, if were able to sample from $p_w$ (which is generally possible if we can evaluate $A(w)$ then we can estimate the mean $\mu$ through samples. We can also obtain $\mu$ from $A(w)$ by estimating $\nabla A$.

On the other hand, if you have $\mu$, to obtain $w$ you first consider $A^{\ast}(\mu) = \sup_{w \in \Omega} {\langle \mu, w \rangle - A(w)}$ and then note that the desired value is $\arg\max_{w \in \Omega}{\langle \mu, w \rangle - A(w)}$, so solving this problem boils down to estimating $\nabla A^{\ast}$ efficiently, because setting it to zero will give us $w$.

Thus going from $w$ to $\mu$ requires estimating $\nabla A$, whereas going from $\mu$ to $w$ requires estimating $\nabla A^{\ast}$.

When $p$ is a posterior distribution, the observed data typically gives us the weights $w$, and hence the inference problem becomes to use that to sample from the posterior.

#### Examples of Exponential Distributions

There are many examples of exponential distributions, which we now give.

• High-Dimensional Normals: $x\in \mathbb{R}^{d}$, $W(x) = -(x - \mu)^{\top} \Sigma^{-1} (x-\mu)$
• Ising Model: $x\in {0, 1}^{d}$, $W(x) = \sum w_i x_I + \sum_{i, j \in E} w_{i, j} x_i x_j$. A sufficient statistic for these is $\hat{x} = (x, xx^{\top})$, a fact which we will invoke repeatedly.
• There’s many many more, including Gaussian Markov Random Fields, Latend Dirchlet Allocation, and Mixtures of Gaussians.

#### Going from canonical parameters to mean parameters

Now, we show how to go from $w$ to $\mu$ in a special case, namely in the mean-field approximation. The mean field approximation is the approximation of distributions by product distributions over $x_1, \ldots, x_d \in {0, 1}^{d}$, for which

$\mathbb{P}(x_1, \ldots x_n) = \mathbb{P}(x_1)\mathbb{P}(x_2) \ldots \mathbb{P}(x_n).$

Recall that the partition function can be computed as

$A(w) = \max_{q \in \mathcal{Q}} \langle w, \mathbb{E}_{q} \hat{x} \rangle + H(q).$

where $\mathcal{Q}$ is the set of all probability distributions. If we instead write $\mathcal{Q}$ as the set of product distributions (parametrized by $\mu$, or the probability that each variable is $1$), we get

$A(w) \ge \max_{\mu \in \mathcal{Q}} \sum w_i \mu_i + \sum w_{i,j} \mu_i \mu_j + \sum H(\mu_i),$

where $H(\mu_i) = -\mu_i \log \mu_i - (1 - \mu_i) \log{1 - \mu_i}$ and it now suffices to maximize the right hand function over the set $K = { \mu\in [0,1]^d | \sum \mu_i = 1 }$.

We can generally maximize concave functions over a convex set such as $K$. Unfortunately, this function is not concave. However, it is concave in every coordinate. This suggests the following algorithm: fix all but one variable and maximize over that variable, and repeat. This approach is known as Coordinate Ascent Variational Inference (CAVI), and its pseudocode is given below.

CAVI Algorithm

1. Let $\mu = 1/2 \cdot (1, 1, \ldots 1)$
2. Repatedly choose values of $j$ in $[n]$ (possibly at random, or in a loop.)
2a. Update $\mu_{j} = \arg\max_{x_j} (f(\mu_{-j}, x_j))$ (where $\mu_{-j}$ represents the non-$j$ coordinates of $\mu$)

## Part 3. Solution landscapes and the replica method

This part is best explained visually. Suppose that we have a Boltzman distribution. In the infinite temperature limit the this is the uniform distribution, distributed over the entire domain. As you decrease the temperature, the support’s size decreases. (Note: The gifs below are just cartoons. These pretend as if the probability distribution is always uniform over a subset. Also, in most cases of interest for learning, only higher order derivatives of entropy will be discontinuous.)

Sometimes there’s a discontinuity in entropy, and the entropy “suddenly drops” in a discontinuity. Often the entropy function itself is continuous, but the derivatives are not continuous at that point – a higher order phase transition.

Sometimes the geometry of the solution space undergoes a phase transition as well, with it “shattering” to several distinct clusters.

### The replica method

Note: We will have an extended post on the replica method later on.

If you sampled $x_1, \ldots x_n$ from $p_{w}$, where $p_w$ is a high-dimensional distribution, you’d expect the distances to all be approximately the same. The overlap matrix, or

$\begin{bmatrix} \langle x_1, x_1 \rangle & \cdots & \langle x_1, x_n \rangle \ \vdots & \ddots & \vdots \ \langle x_n, x_1 \rangle & \cdots & \langle x_n, x_n \rangle \end{bmatrix}$

we approximate as

$\begin{bmatrix} 1 & \cdots & q \ \vdots & \ddots & \vdots \ q& \cdots & 1 \end{bmatrix}$

where $q$ is some constant.

Suppose $w$ comes from a probability distribution $W$. Then a very common problem is computing

$\mathbb{E}_{w}[A(w)] = \mathbb{E}_{w}\left[ \log \int \exp(\langle w, \hat{x} \rangle)\right],$

which is the expected free energy. However, it turns out to be much easier to find $\log (\int E_{w} \exp(\langle w, \hat{x} \rangle))$. So here’s what we do. We find

$\mathbb{E}_{w} [A(w)] = \lim{n \rightarrow 0} \frac{\mathbb{E}_{w} [\text{exp}(n \cdot A(w))] - 1}{n}$

This should already smell weird, because $n$ is going to zero, an unusal notational choice. We can now write this as
$\mathbb{E}{w} A(w) = \lim_{n \rightarrow 0} \frac{\mathbb{E}_{w}\left( \int{x} \exp(\langle w, \hat{x} \rangle)\right)^{n} - 1}{n}$

which we can write as

$\lim_{n \rightarrow 0} \frac{\int_{x_1, \ldots x_n} \mathbb{E}_{w} \exp(\langle w, \hat{x}_1 + \cdots + \hat{x}_n \rangle) - 1}{n}$

Now these $x_i$‘s represent the replicas! Now, we’d hope $\psi(n) = \int_{x_1, \ldots x_n} \mathbb{E}_w \exp(\langle w, \hat{x}_1 + \hat{x}_2 + \ldots + \hat{x}_n \rangle)$ is an analytic function, and we can now write this as $\frac{\psi(n) - 1}{n}$ as $n$ goes to zero.

Generally speaking, $\mathbb{E}_{w} [\exp(w, \hat{x}_1 + \cdots + \hat{x}_n)]$ only depends on overlaps of $Q$, so we often guess the value of $Q$ and calculate this expectation.

#### Examples:

We’ll now give an example of how this is useful. Consider a spiked matrix/tensor model, where we observe $Y$ so that $Y = \lambda S + N$, where $S$ is the signal and $N$ is the noise. Thus here we have

$p(S', N' | Y) \propto \exp(\beta \langle Y - \lambda S', N' \rangle^2) \cdot p(S'),$

and we want to analyze $\mathbb{E}_{S' \sim p(\cdot | Y)} \langle S, S' \rangle^2$ as a function of $\lambda$, which can be done by the replica method and exhibits a phase transition.

Other examples include Teacher-student models, where $X, Y = (X, f_S(X) + N)$. Then, a recent work calculates the training losses and classification errors when training on this dataset, which closely match empirical values.

### Symmetry breaking

Lastly, we’ll talk about replica symmetry breaking. Sometimes, discontinuities don’t just make the support smaller, but actually break the support into many different parts. For example, at this transition the circle becomes the three green circles.

When this happens, we can no longer make our overlap matrix assumption, as there are now asymmetries between the points. This leads to the overlap matrix being striped, in the fashion one would anticipate.

Sometimes, the support actually breaks into infinitely many parts, in a phenomenon called full replica symmetry breakng.

Finally, some historical notes. This “plugging in zero” trick was introduced by Parisi approximately 30 years ago, receiving a standing ovation at the ICM. Since then, some of those conjectures have been rigorously formalized, but many haven’t. It is still very impressive that Parisi was able to do it.

## 2 thoughts on “Inference and statistical physics”

1. Great lecture (the video). Learned some things I always wanted to know. (Like why simulated annealing works better if only a single coordinate gets modified in each step. That was a lesson I had learned the hard way.) Reminds me a bit on Luca Trevisan’s series on online optimization. I guess I should try to watch or study your other lectures from this series too.