Blake Bordelon, Haozhe Shan, Abdul Canatar, Boaz Barak, Cengiz Pehlevan
[Boaz’s note: Blake and Haozhe were students in the ML theory seminar this spring; in that seminar we touched on the replica method in the lecture on inference and statistical physics but here Blake and Haozhe (with a little help from the rest of us) give a great overview of the method and its relations to ML. See also all seminar posts.]
I. Analysis of Optimization Problems with Statistical Physics
In computer science and machine learning, we are often interested in solving optimization problems of the form
where is an objective function which depends on our decision variables as well as on a set of problem-specific parameters . Frequently, we encounter problems relevant to machine learning, where is a random variable. The replica method is a useful tool to analyze large problems and their typical behavior over the distribution of .
Here are a few examples of problems that fit this form:
- In supervised learning, may be a training loss, a set of neural network weights and the data points and their labels
- We may want to find the most efficient way to visit all nodes on a graph. In this case describes nodes and edges of the graph, is a representation of the set of chosen edges, and can be the cost of if encodes a valid path and (or a very large number ) if it doesn’t encode a valid path.
- Satisfiability: is a collection booleans which must satisfy a collection of constraints. In this case the logical constraints (clauses) are the parameters . can be the number of constraints violated by .
- Recovery of structure in noisy data: is our guess of the structure and are instances of the observed noisy data. For example PCA attempts to identify the directions of maximal variation in the data. With the replica method, we could ask how the accuracy of the estimated top eigenvector degrades with noise.
II. The Goal of the Replica Method
The replica method is a way to calculate the value of some statistic (observable in physics-speak) of where is a “typical” minimizer of and is a “typical” value for the parameters (which are also known in physics-speak as the disorder).
In Example 1 (supervised learning), the observable may be the generalization error of a chosen algorithm (e.g. a linear classifier) on a given dataset. For Example 2 (path), this could be the cost of the best path. For Example 3 (satisfiability), the observable might be whether or not a solution exists at all for the problem. In Example 4 (noisy data), the observable might be the quality of decoded data (distance from ground truth under some measure).
An observable like generalization error obviously depends on , the problem instance. However, can we say something more general about this type of problem? In particular, if obeys some probability distribution, is it possible to characterize the the typical observable over different problem instances ?
For instance, in Example 1, we can draw all of our training data from a distribution. For each random sample of data points , we find the set of which minimize and compute a generalization error. We then repeat this procedure many times and average the results. Sometimes, there are multiple that minimize for a given sample of ; this requires averaging the observable over all global minima for each first, before averaging over different .
To give away a “spolier”, towards the end of this note, we will see how to use the replica method to give accurate predictions of performance for noisy least square fitting and spiked matrix recovery.
Generalization gap in least squares ridge regression, figure taken from Canatar, Bordelon, and Pehlevan
Performance (agreement with planted signal) as function of signal strength for spiked matrix recovery, as the dimension grows, the experiment has stronger agreement with theory. See also Song Mei’s exposition
A. What do we actually do?
Now that we are motivated, let’s see what quantities the replica method attempts to obtain. In general, given some observable , the average of over a minimizer chosen at random from the set , and take the average of this quantity over the choice of the disorder .
In other words, we want to compute the following quantity:
The above equation has two types of expectation- over the disorder , and over the minimizers .
The physics convention is to
- use for the expectation of a function over the disorder
- use for the expectation of a function over chosen according to some measure .
Using this notation, we can write the above as
where denotes an average over the probability measure for problem parameters , and is a uniform distribution over the set of that minimize with zero probability mass placed on sub-optimal points.
The ultimate goal of the replica method is to express
but it will take some time to get there.
B. The concept of “self-averaging” and concentration
Above, we glossed over an important distinction between the “typical” value of and the *average* value of . This is OK only when we have concentration in the sense that with high probability over the choice of , is close to its expected value. We define this as the property that with probability at least , the quantity is within a multiplicative factor of its expectation, whwere is some quantity that goes to zero as the system size grows. A quantity that concentrates in this sense is called self averaging.
For example, suppose that where each equals with probabilty and with probability independently. Standard Chernoff bounds show that with high probability or . Hence is a self averaging quantity.
In contrast the random variable is not self averaging. Since and these random variables are independent, we know that . However, with high probability a typical value of will be of the form . Since we see that the typical value of is exponentially smaller than the expected value of .
The example above is part of a more general pattern. Often even if a variable is not self averaging, the variable will be self-averaging. Hence if we are interested in the typical value of , the quantity is more representative than the quantity .
C. When is using the replica method a good idea?
Suppose that we want to compute a quantity of the form above. When is it a good idea to use the replica method to do so?
Generally, we would want it to satisfy the following conditions:
- The learning problem is high dimensional with a large budget of data. The replica method describes a thermodynamic limit where the system size and data budget are taken to infinity with some fixed ratio between the two quantities. Such a limit is obviously never achieved in reality, but in practice sufficiently large learning problems can be accurately modeled by the method.
- The loss or the constraints are convenient functions of and . Typically will be a low degree polynomial or a sum of local functions (each depending on small number of variables) in .
- Averages over the disorder in are tractable analytically. That is, we can compute marginals of the distribution over .
- The statistic that we are interested in is self-averaging.
If the above conditions aren’t met, it is unlikely that this problem will gain much analytical insight from the replica method.
III. The Main Conceptual Steps Behind Replica Calculations
We now describe the conceptual steps that are involved in calculating a quantity using the replica method.
They are also outlined in this figure:
Step 1:”Softening” Constraints with the Gibbs Measure
The uniform measure on minimizers is often difficult to work with. To aid progress, we can think of it as a special case of what is known as the Gibbs measure, defined as
where is the normalization factor, or partition function. is called the inverse temperature, a name from thermodynamics. It is easy to see that when (i.e., when the temperature tends to the absolute zero), the Gibbs measure converges to a uniform distribution on the minimizers of : .
Hence we can write
Physicists often exchange the order of limits and expectations at will, which generally makes sense in this setting, and so assume
Thus general approach taken in the replica method is to derive an expression for the average observable for any and then take the limit. The quantity is also known as the thermal average of , since it is taken with respect to the Gibbs distribution at some positive temperature.
To compute the thermal average of , we define the following augmented partition function:
One can then check that
Hence our desired quantity can be obtained as
or (assuming we can again exchange limits at will):
We see that ultimately computing the desired quantity reduces to computing quantities of the form
for the original or modified partition function . Hence our focus from now on will be on computing ().
Averaging over is known as “configurational average” or quenched average. All together, we obtain the observable, first thermal averaged to get (averaged over ) and then quenched averaged over .
⚠ What Concentrates?: It is not just an algebraic convenience to average instead of averaging itself. When the system size is large, concentrates around its average. Thus, the typical behavior of the system can be understood by studying the quenched average The partition function itself often does not concentrate and in general the values (known as the “annealed average”) and (known as the “quenched average”) could differ subtantially. For more information, please consult Mezard and Montanari’s excellent book, Chapter 5.
Step 2: The Replica Trick
Hereafter, we use to denote the average and drop the dependence of and . To compute , we use an identity
For the limit to make sense, should be any real number. However, the expression for is only easily computable for natural numbers. ⚠ This step is non-rigorous: we will obtain an expression for for natural number , and then take the limit after the fact.
Recall that under the Gibbs distribution , the probability density on state is equal to . Denote by the probability distribution over a tuple of independent samples (also known as replicas) chosen from .
Since the partition function is an integral (or sum in the discrete case) of the form , we can write as the integral of where are independent variables.
Now since each is weighed with a factor of , this expression can be shown as equal to taking expectation of some exponential function over a tuple of independent samples of replicas all coming from the same Gibbs distribution corresponding to the same instance .
(The discussion on is just for intuition – we will not care about the particular form of this , since soon average it over .)
Step 3: The Order Parameters
The above expression is an expectation of an integral, and so we can switch the order of summation, and write it also as
It turns out that for natural energy functions (for example when is quadratic in such as when it corresponds to Mean Squared Error loss), for any tuple of , the expectation over of only depends on the angles between the ‘s.
That is, rather than depending on all of these -dimensional vectors, it only depends on the coefficients . The matrix Q is known as the overlap matrix or order parameters and one can often find a nice analytical function whose values are bounded (independently of ) such that
Hence we can replace the integral over with an integral over and write
where the measure is the one induced by the overlap distribution of a tuple taken for a random choice of the parameters .
Since only ranges over a small ( dimensional set), at the large limit, the integral is dominated by the maximum of its integrand (“method of steepest descent” / “saddle point method”). Let be the global minimum of (within some space of matrices). We have
Once we arrive at this expression, the configurational average of is simply . These steps constitute the replica method. The ability to compute the configurational average by creating an appropriate is one of the factors determining whether the replica method can be used. For example, in the supervised learning example, is almost always assumed to be quadratic in ; cross-entropy loss, for instance, is generally not amendable.
⚠ Bad Math Warning: there are three limits, , , and . In replica calculations, we assume that we take take these limits in whichever order that is arithmetically convenient.
Coming up: In part two of this blog post, we will explain the replica symmetric assumption (or “Ansatz” in Physics-speak) on the order parameters and then demonstrate how to use the replica method for two simple examples: least squares regression and spiked matrix model.