*Scribe notes by Franklyn Wang*

**Previous post:** Robustness in train and test time **Next post:** Natural Language Processing (guest lecture by Sasha Rush). See also all seminar posts and course webpage.

lecture slides (pdf) – lecture slides (Powerpoint with animation and annotation) – video

## Digression: Frequentism vs Bayesianism

Before getting started, we’ll discuss the difference between the two dominant schools of thought in probability: Frequentism and Bayesianism.

Frequentism holds that the probability of an event is the long-run average proportion of something that happens many, many times, so a coin has 50% probability of being heads if on average half of the flips are heads. One consequence of this framework is that a one-off event like Biden winning the election doesn’t have any probability as by definition they can only be observed once.

Bayesians reject this model of the world. For example, a famous Bayesian, Jaynes, even wrote that “probabilities are frequencies only in an imaginary universe.”

While these branches of thought are different, generally the answers to most questions are the same, so the distinction will not matter for this class. However, these branches of thought inspire different types of methods, which we now discuss.

For example, suppose that we have a probability distribution that we can get samples from in the form of . In statistics, our goal is to calculate a hypothesis which minimizes (possibly a loss function, but any minimand will do) where we estimate with our samples .

A **frequentist** does this by defining a family of potential distributions, and we need to find a transformation which minimizes the cost for all , where

These equations amount to saying that is minimizing the worst-case loss over all distributions.

By contrast, a **Bayesian** approaches this task by assuming that where are latent variables sampled from a prior.

Then, let be the *posterior* distribution of conditioned on the observations . We now minimize

In fact, Bayesian approaches extend beyond this, as if you can sample from a posterior you can do more than just minimize a loss function.

In Bayesian inference, chosing the prior is often very important. One classical approach is a *maximum entropy prior*, because it’s the least informative (hence, the most entropy) and requires the fewest assumptions.

These two minimization equations con sometimes lead to identical results, but often in practice they work out differently once we introduce **computational constraints** into the picture. In the frequentist approach, we generally constrain the family of mappings to be efficiently computable. In the Bayesian approach, we typically approximate the posterior instead and either use approximate sampling or a restricted hypothesis class for the posterior to be able to efficiently sample from it.

## Part 1. An introduction to Statistical Physics

Compare water and ice. Water is hotter, and the molecules move around more. In ice, by contrast, the molecules are more stationary. When the temperature increases, the objects move more quickly, and when they decrease the objects have less energy and stop moving. There are also phase transitions, where certain temperatures cause qualitative discontinuities in behavior, like solid to liquid or liquid to gas.

See video from here:

An atomic state can be thought of as . Now, how do we represent the system’s state? The crucial observation that makes statistical physics different from “vanilla physics” is the insight to represent the system state as a probability distribution supported on , rather than in a single state.

Each atomic state has a negative energy function (which we will think of as a utility function) . In some sense, the system “wants” to have a high value of . In addition, when the temperature is high, the system “wants” to have a high value of entropy.

Thus, an axiom of thermodynamics states that to find the true distribution, we need only look at the maximizer of

That is, it is the probability distribution which maximizes a linear combination of the expected negative energy and the entropy, with the temperature controlling the coefficient of entropy in this combination (the higher the temperature, the more the system prioritizes having high entropy).

The **variational principle**, which we prove later, states that the which maximizes this satisfies

which is known as the **Boltzmann Distribution**. We often write this with a normalizing constant, so , where .

Before proving the variational principle, we will go through some examples of statistical physics.

### Example: Ising Model

In the Ising model, we have magnets that are connected in a square grid. The atomic state is each , which represents “spin”. The value of is , where denotes that and are adjacent magnets. This encourages adjacent magnets to have aligned spins. One important concept, which we will return to later, is that the values of are sufficient to calculate the value of (these are known as *sufficient statistics*). Furthermore, if we wanted to calculate , it would be enough to calculate the values of and and then apply linearity of expectation. An illustration of an Ising model that we cool down slowly can be found here:

and a video can be found here.

This is what a low-temperature Ising model looks like – note that is high because almost all adjacent spins are aligned (hence the well-defined regions).

The Sherrington-Kirkpatrick model is a generalization of the Ising model to a random graph, which represents a disordered mean field model. Here, still represents spin, and where . The SK-model is deeply influential in statistical physics. We say that it is *disordered* because the utility function is chosen at random and that it is a *mean field* model because, unlike the Ising model, there is no geometry in the sense that every pair of individual variables and are equally likely to be connected.

Our third example is the posterior distribution, where is a hidden variable with a uniform prior. We now makes, say, independent observations . The probability of is given by

so . Note that the RHS is easy to calculate, but in practice the normalizing factor (also known as the partition function) can be difficult to calculate and often represents a large barrier, in the sense that computing the partition function makes many of these questions far easier.

### Proof of the Variational Principle

Now, we prove the variational principle, which states that if , where is the normalizing factor, then

Before proving the variational principle, we make the following claim: if is defined as then

**Proof of claim:**

Write

(where we plugged in the definition of in ).

We can now rewrite this as

which means that

using the fact that is a constant (independent of ). Multiplying by and rearranging, we obtain the claim.

**Proof of variational principle:**

Given the claim, we can now prove the variational principle.

Let be any distribution. We write

where the first equation uses our likelihood expression. This implies that is maximized at , as desired.

**Remarks:**

- When , we have .
- In particular, for every value of , we have that

which we’ve seen before! Note that equality holds when , but to approximate we can simply take more tractable values of .

In many situations, we can compute , but can’t compute , which stifles many applications. One upshot of this, though, is that we can calculate ratios of and , which is good enough for some applications, like Markov Chain Monte Carlo.

### Markov Chain Monte Carlo (MCMC)

An important task in statistics is to sample from a distribution , for very complicated values of . MCMC does this by constructing a Markov Chain whose stationary distribution is . The most common instantiation of MCMC is Metropolis-Hastings, which we now describe. First, we assume that there exists an undirected graph on the states , so that iff and that for , the probabilities of and are similar in the sense that is neither too small nor too large. Then the Metropolis-Hastings algortihm is as follows.

**Metropolis-Hastings Algorithm**:

- Draw at random.
- For , choose an arbitrary , and let

Then, eventually is distributed as ! To show that this samples from , we show that the stationary distribution is of this Markov chain is . In this case, this turns out to be easy, since we can check the *detailed balance conditions* so is the stationary distribution of the Markov Chain.

The stationary distribution is often unique, so we’ve proven that this samples from eventually. In MCMC algorithms, however, often an important question is how fast we converge to the stationary distribution. Often this is rather slow, which is especially dissappointing because if it were faster many very difficult problems could be solved much more easily, like generic optimization.

Indeed, there are examples where convergence to the stationary distribution would take exponential time.

**Note:** One way to make MCMC faster is to let be really close from , because the likelihood ratio will be closer to and will spend less time stuck at its original location. There’s a tradeoff, however. If is too close to , then the chain will not *mix* as quickly, where mixing is the convergence of to the stationary distribution, because generally to mix the value must get close to every point, and making each of ‘s steps smaller will make that more difficult.

### Applications of MCMC: Simulated Annealing

One application of MCMC is in simulated annealing. Suppose that we have , and we want to find . The most direct attempt at solving this problem is creating a Markov Chain that simply samples from . However, this is impractical, at least directly. It is like we have a domino that is far away from us, and we can’t knock it down. What do we do in this case? We put some dominoes in between us!

To this end, we now create a sequence of Markov chains supported on , whose stationary distribution , as gets smaller and smaller. This corresponds to cooling a system from a high-temperature to a low-temperature. Essentially, we want to sample from .

Simulated annealing lets us begin by sampling from , which is uniform on the support. Then we can slowly reduce from to . When cooling a system, it will be helpful to think of two stages. First, a stage in which the object cools down. Second, a settling stage in which the system settles into a more stable state. The settling stage is simulated via MCMC on , so the transition probability from to is .

**Simulated Annealing Algorithm**:

*Cooling*Begin at (or very large), and lower the value of to zero according to some schedule.

1a.*Settling*Now, repeatedly move from to with probability .

Since simulated annealing is inspired by physical intuition, it turns out that its shortcomings can be interpreted physically too. Namely, when you cool something too quickly it becomes a glassy state instead of a ground state, which often causes simulated annealing to fail, and for this algorithm to get stuck in local minima. Note how this is fundamentally a failure mode with MCMC – it can’t mix quickly enough.

See this video for an illustration of simulated annealing.

## Part 2: Bayesian Analysis

Note that the posterior of conditioned on is

We now have , an exponential distribution. Now suppose that , where are the sufficient statistics of . For example, the energy function could follow that of a tree, so . If you want to find the expected value of $latex \mathbb{E}*{x \sim p*{W}}[W(x)]&bg=ffffff$ its enough to know $latex \mu = \mathbb{E}*{x \sim p*{W} }[\hat{x}]&bg=ffffff$. (By following a tree, we mean that the undirected graphical model of the distribution is that of a tree.)

### Sampling from a Tree-Structured Distribution

Now how do we sample from a tree-structured distribution? The most direct way is by calculating the marginals. While marginalization is generally quite difficult, it is much more tractable on certain types of graphs. For a sense of what this entails, suppose that we have the following tree-structured statistical model:

,

whose joint PDF is where .

(The unusual notation is chosen so that the relationships between the marginals is clean. The log-density in question is a Laplacian Quadratic Form.)

Then one can show that if the marginal of is , then we have that

This will give six linear equations in six unknowns, which we can solve. Once we can marginalize a variable, say, , we can then simply sample from the marginal, then sample from the conditional distribution on to sample from the entire distribution. This method is known as belief propogation.

This algorithm (with some modifications) also works for general graphs, but what it represents on general graphs is not the exact marginal, but rather an extension of that graph to a tree.

### General Exponential Distributions

Now let’s try to develop more theory for exponential distributions. Let the PDF of this distribution be . A few useful facts about these distributions that often appear in calculations: and $latex \nabla^2 (A(w)) = \text{Cov}*{p*{W}}(x) \succeq 0&bg=ffffff$.

By the variational principle, we have that among all distributions with the same sufficient statistics as a given Boltzmann distribution, the true one maximizes the entropy, so

An analagous idea gives us

so is the *maximum entropy* distribution consistent with observations .

The important consequence of these two facts is that in principle, determines and vice versa. (Up to fine print of using *minimal* sufficient statistics, which we ignore here.) We will refer to as the “canonical parameter space” and as the “mean representation space”. Now note that the first equation gives us a way of going from to , and the second equation gives us a way to go from to , at least information-theoretically.

Now, how do we do this algorithmically? Using , if were able to sample from (which is generally possible if we can evaluate then we can estimate the mean through samples. We can also obtain from by estimating .

On the other hand, if you have , to obtain you first consider and then note that the desired value is , so solving this problem boils down to estimating efficiently, because setting it to zero will give us .

Thus going from to requires estimating , whereas going from to requires estimating .

When is a *posterior distribution*, the observed data typically gives us the weights , and hence the inference problem becomes to use that to sample from the posterior.

#### Examples of Exponential Distributions

There are many examples of exponential distributions, which we now give.

- High-Dimensional Normals: ,
- Ising Model: , . A sufficient statistic for these is , a fact which we will invoke repeatedly.
- There’s many many more, including Gaussian Markov Random Fields, Latend Dirchlet Allocation, and Mixtures of Gaussians.

#### Going from canonical parameters to mean parameters

Now, we show how to go from to in a special case, namely in the mean-field approximation. The mean field approximation is the approximation of distributions by product distributions over , for which

Recall that the partition function can be computed as

where is the set of all probability distributions. If we instead write as the set of product distributions (parametrized by , or the probability that each variable is ), we get

where and it now suffices to maximize the right hand function over the set .

We can generally maximize concave functions over a convex set such as . Unfortunately, this function is not concave. However, it is concave in every coordinate. This suggests the following algorithm: fix all but one variable and maximize over that variable, and repeat. This approach is known as Coordinate Ascent Variational Inference (CAVI), and its pseudocode is given below.

**CAVI Algorithm**

- Let
- Repatedly choose values of in (possibly at random, or in a loop.)

2a. Update (where represents the non- coordinates of )

## Part 3. Solution landscapes and the replica method

This part is best explained visually. Suppose that we have a Boltzman distribution. In the infinite temperature limit the this is the uniform distribution, distributed over the entire domain. As you decrease the temperature, the support’s size decreases. (*Note:* The gifs below are just cartoons. These pretend as if the probability distribution is always uniform over a subset. Also, in most cases of interest for learning, only higher order derivatives of entropy will be discontinuous.)

Sometimes there’s a discontinuity in entropy, and the entropy “suddenly drops” in a discontinuity. Often the entropy function itself is continuous, but the derivatives are not continuous at that point – a higher order phase transition.

Sometimes the geometry of the solution space undergoes a phase transition as well, with it “shattering” to several distinct clusters.

### The replica method

**Note:** We will have an extended post on the replica method later on.

If you sampled from , where is a high-dimensional distribution, you’d expect the distances to all be approximately the same. The overlap matrix, or

we approximate as

where is some constant.

Suppose comes from a probability distribution . Then a very common problem is computing

which is the expected free energy. However, it turns out to be much easier to find . So here’s what we do. We find

This should already smell weird, because is going to zero, an unusal notational choice. We can now write this as

which we can write as

Now these ‘s represent the replicas! Now, we’d hope is an analytic function, and we can now write this as as goes to zero.

Generally speaking, only depends on overlaps of , so we often guess the value of and calculate this expectation.

#### Examples:

We’ll now give an example of how this is useful. Consider a spiked matrix/tensor model, where we observe so that , where is the signal and is the noise. Thus here we have

and we want to analyze as a function of , which can be done by the replica method and exhibits a phase transition.

Other examples include Teacher-student models, where . Then, a recent work calculates the training losses and classification errors when training on this dataset, which closely match empirical values.

### Symmetry breaking

Lastly, we’ll talk about replica symmetry breaking. Sometimes, discontinuities don’t just make the support smaller, but actually break the support into many different parts. For example, at this transition the circle becomes the three green circles.

When this happens, we can no longer make our overlap matrix assumption, as there are now asymmetries between the points. This leads to the overlap matrix being striped, in the fashion one would anticipate.

Sometimes, the support actually breaks into infinitely many parts, in a phenomenon called full replica symmetry breakng.

Finally, some historical notes. This “plugging in zero” trick was introduced by Parisi approximately 30 years ago, receiving a standing ovation at the ICM. Since then, some of those conjectures have been rigorously formalized, but many haven’t. It is still very impressive that Parisi was able to do it.

Great lecture (the video). Learned some things I always wanted to know. (Like why simulated annealing works better if only a single coordinate gets modified in each step. That was a lesson I had learned the hard way.) Reminds me a bit on Luca Trevisan’s series on online optimization. I guess I should try to watch or study your other lectures from this series too.