Peter Shor on Quantum Error Correction
[Guest post by Annie Wei who scribed Peter Shor’s lecture in our physics and computation seminar. See here for all the posts of this seminar. –Boaz]
On October 19, we were lucky enough to have Professor Peter Shor give a guest lecture about quantum error correcting codes. In this blog post, I (Annie Wei) will present a summary of this guest lecture, which builds up quantum error correcting codes starting from classical coding theory. We will start by reviewing an example from classical error correction to motivate the similarities and differences when compared against the quantum case, before moving into quantum error correction and quantum channels. Note that we do assume a very basic familiarity with quantum mechanics, such as that which might be found here or here.
1. Motivation
We are interested in quantum error correction, ultimately, because any realworld computing device needs to be able to tolerate noise. Theoretical work on quantum algorithms has shown that quantum computers have the potential to offer speedups for a variety of problems, but in practice we’d also like to be able to eventually build and operate real quantum computers. We need to be able to protect against any decoherence that occurs when a quantum computer interacts with the environment, and we need to be able to protect against the accumulation of small gate errors since quantum gates need to be unitary operators.
In error correction the idea is to protect against noise by encoding information in a way that is resistant to noise, usually by adding some redundancy to the message. The redundancy then ensures that enough information remains, even after noise corruption, so that decoding will allow us to recover our original message. This is what is done in classical error correction schemes.
Unfortunately, it’s not obvious that quantum error correction is possible. One obstacle is that errors are continuous, since a continuum of operations can be applied to a qubit, so a priori it might seem like identifying and correcting an error would require infinite resources. In a later section we show how this problem, that of identifying quantum errors, can be overcome. Another obstacle is the fact that, as we’ve stated, classical error correction works by adding redundancy to a message. This might seem impossible to perform in a quantum setting due to the No Cloning Theorem, which states the following:
Theorem (No Cloning Theorem): Performing the mapping
is not a permissible quantum operation.
Proof: We will use unitarity, which says that a quantum operation specified by a unitary matrix must satisfy
.
(This ensures that the normalization of the state is always preserved, i.e. that , which is equivalent to the conservation of probability.)
Now suppose that we can perform the operation
.
Then, letting
,
we note that by unitarity
.
But
,
and in general , so we have a contradiction.
How do we get around this apparent contradiction? To do so, note that the nocloning theorem only prohibits the copying of nonorthogonal quantum states. With orthogonal quantum states, either or , so we don’t run into a contradiction. This also explains why it is possible to copy classical information, which we can think of as orthogonal quantum states.
So how do we actually protect quantum information from noise? In the next section we first review classical error correction, as ideas from the classical setting reappear in the quantum setting, and then we move into quantum error correction.
2. Review of Classical Error Correction
First we start by reviewing classical error correction. In classical error correction we generally have a message that we encode, send through a noisy channel, and then decode, in the following schematic process:
In an effective error correction scheme, the decoding process should allow us to identify any errors that occurred when our message passed through the noisy channel, which then tells us how to correct the errors. The formalism that allows us to do so is the following: we first define a encoding matrix that takes a message of length and converts it to a codeword of length , where the codewords make up the span of the rows of . An example of such a matrix is
,
corresponding to the 7bit Hamming codes, which encodes a 4bit message as a 7bit codeword. Note that this code has distance 3 since each of the rows in differ in at most 3 spots, which means that it can correct at most 1 error (the number of errors that can be corrected is given by half the code distance).
We also define the parity check matrix to be the matrix that satisfies
.
For example, to define corresponding to the we defined for the 7bit Hamming code, we could take
.
Then we may decode , a 7bit string, in the following manner. Say that
,
where is a codeword and is the 1bit error we wish to correct. Then
.
Here uniquely identifies the error and is known as the error syndrome. Having it tells us how to correct the error. Thus our error correction scheme consists of the following steps:
 Encode a bit message by multiplying by to obtain codeword .
 Send the message through channel generating error , resulting in the string .
 Multiply by to obtain the error syndrome .
 Correct error to obtain .
Having concluded our quick review of classical error correction, we now look at the theory of quantum error correction.
3. Quantum Error Correction
In this section we introduce quantum error correction by directly constructing the 9qubit code and the 7qubit code. Then we introduce the more general formalism of CSS codes, which encompasses both the 9qubit and 7qubit codes, before introducing the stabilizer formalism, which tells us how we might construct a CSS code.
3.1. Preliminaries
First we introduce some tools that we will need in this section.
3.1.1. Pauli Matrices
The Pauli matrices are a set of 2by2 matrices that form an orthogonal basis for the 2by2 Hermitian matrices, where a Hermitian matrix satisfies . Note that we can form larger Hilbert spaces by taking the tensor product of smaller Hilbert spaces, so in particular taking the fold tensor product of Pauli matrices gives us a basis for the by Hermitian matrices. Note also that generally, in quantum mechanics, we are interested in Hermitian matrices because they can be used to represent measurements, and because unitary matrices, which can be used to represent probabilitypreserving quantum operations, can be obtained by exponentiating Hermitian matrices (that is, every unitary matrix can be written in the form for a Hermitian matrix).
The Pauli matrices are given by
.
By direction computation we can show that they satisfy the relations
.
3.1.2. Von Neumann Measurements
We will also need the concept of a Von Neumann measurement. A Von Neumann measurement is given by a set of projection matrices satisfying
.
That is, the projectors partition a Hilbert space into subspaces. Then, given any state , when we perform a measurement using these projectors we obtain the measurement result corresponding to , with corresponding postmeasurement state
,
with probability .
3.2. First Attempt at a Quantum Code
Now we make a first attempt at coming up with a quantum code, noting that our efforts and adjustments will ultimately culminate in the 9qubit code. Starting with the simplest possible idea, we take inspiration from the classical repetition code, which maps
and decodes by taking the majority of the 3 bits. We consider the quantum analog of this, which maps
.
We will take our quantum errors to be the Pauli matrices , , and . Then the encoding process, where our message is a quantum state , looks like the following:
.
We claim that this code can correct bit errors but not phase errors, which makes it equivalent to the original classical repetition code for error correction. To see this, note that applying an error results in the mapping
.
This can be detected by the von Neumann measurement which projects onto the subspaces
We could then apply to the first qubit to correct the error. To see that this doesn’t work for phase errors, note that applying a error results in the mapping
.
This is a valid encoding of the state , so the error is undetectable.
What adjustments can we make so that we’re able to also correct errors? For this we will introduce the Hadamard matrix, defined as
and satisfying
.
Note in particular that, because , the Hadamard matrix turns bit errors into phase errors, and vice versa. This allows us to come up with a code that corrects phase errors by mapping
or equivalently,
Now we can concatenate our bit flip code with our phase flip code to take care of both errors. This gives us the 9qubit code, also known as the Shor code.
3.3. 9Qubit Code
In the previous section, we went through the process of constructing the 9qubit Shor code by considering how to correct both bit flip errors and phase flip errors. Explicitly, the 9qubit Shor code is given by the following mapping:
.
Here and are known as logical qubits; note that our 9qubit code essentially represents 1 logical qubit with 9 physical qubits.
Note that by construction this code can correct , , and errors on any one qubit (we’ve already shown by construction that it can correct and errors, and can be obtained as a product of the two). This is also equivalent to the statement that the states , , , , , and are all orthogonal.
Now we have a 1error quantum code. We claim that such a code can in fact correct any error operation, and that this is a property of all 1error quantum codes:
Theorem: Given any possible unitary, measurement, or quantum operation on a oneerror quantum code, the code can correct it.
Proof: forms a basis for the matrices. For errors on qubits, the code can correct these errors if it can individually correct errors for , , since forms a basis for .
Example: Phase Error Next we’ll do an example where we consider how we might correct an arbitrary phase error applied to the state. Since quantum states are equivalent up to phases, the error operator is given by
.
Note that this can be rewritten in the basis as
.
Now, applying this error to , we get
.
After performing a projective measurement, we get state with probability , in which case we do not need to perform any error correction, and we get with probability , in which case we would know to correct the error.
3.4. 7Qubit Code
Now that we’ve constructed the 9qubit code and shown that quantum error correction is possible, we might wonder whether it’s possible to do better. For example, we’d like a code that requires fewer qubits. We’ll construct a 7qubit code that corrects 1 error, defining a mapping to and by taking inspiration from a classical code, as we did for the 9qubit case.
For this we will need to go back to the example we used to illustration classical error correction. Recall that in classical error correction, we have an encoding matrix and a parity check matrix satisfying , with . We encode a message to obtain codeword . After error is applied, this becomes , from which we can extract the error syndrome . We can then apply the appropriate correction to extract from .
Now we will use the encoding matrix from our classical error correction example, and we will divide our codewords into two sets, and , given by
and
.
Similar to how we approached the 9qubit case, we will start by defining our code as follows:
.
Note that this corrects bit flip errors by construction. How can we ensure that we are also able to correct phase errors? For this we again turn to the Hadamard matrix, which allows us to toggle between bit and phase errors. We claim that
.
Proof: We will show that
,
noting that the argument for is similar. First we will need the fact that
.
To see that this fact is true, note that
and that is equal to the number of bits in which and are both 1. Now we can start by directly calculating
.
Note that for and two codewords, assuming that , we must have that iff . Thus we can break the codespace up into an equal number of codewords satisfying and . This means that we must have that the sum unless we have . But those that satisfy are exactly all the codewords by definition, so we must have that
as the sum in runs equally over all codewords.
Thus we have constructed a 7qubit quantum code that corrects 1 error, and moreover we see that for both the 9qubit and 7qubit codes, both of which are 1error quantum codes, the fact that they can correct 1error comes directly from the fact that the original classical codes we used to construct them can themselves correct 1 error. This suggests that we should be able to come up with a more general procedure for constructing quantum codes from classical codes.
3.5. CSS Codes
CSS (CalderbankShorSteane) codes generalize the process by which we constructed the 9qubit and 7qubit codes, and they give us a general framework for constructing quantum codes from classical codes. In a CSS code, we require groups , satisfying
Then we can define codewords to correspond to cosets of in , so that the number of codewords is equal to . Thus by this definition we can say that codewords are in the same coset if . Explicitly, the codeword for coset is given by the state
,
and under the Hadamard transformation applied to each qubit this state is in turn mapped to the state
.
That is to say, the Hadamard “dualizes” our original code, toggling bit errors to phase errors and vice versa. (This can be seen by direct calculation, as in the case of the 7qubit code, where we used the fact that for .)
Note also that this code can correct a number of bit errors equal to the minimum weight of .
With the CSS construction we have thus reduced the problem of finding a quantum error correcting code to the problem of finding appropriate , . Note that the special case of corresponds to weakly selfdual codes, which are well studied classically. Doubly even, weakly selfdual codes additionally have the requirement that all codewords have Hamming weights that are multiples of 4; they satisfy the requirement
and are also well studied classically.
3.6. GilbertVarshamov Bound
In the previous section we introduced CSS codes and demonstrated that the problem of constructing a quantum code could be reduced to the problem of finding two groups , satisfying
.
The next natural question is to ask whether such groups can in fact be found.
The GilbertVarshamov bound answers this question in the affirmative, ensuring that there do exist good CSS codes (the bound applies to both quantum and classical codes). It can be stated in the following way:
Theorem (GilbertVarshamov Bound): There exist CSS codes with rate (number of encoded bits)/(length of code) given by
,
where is the minimum distance of the code, is the number of errors it can correct, and is the Shannon entropy, defined as
.
Proof: Note that we can always take a code, apply a random linear transformation to it, and get another code. Thus each vector is equally likely to appear in a random code. Given this fact, we can estimate the probability that a code of dimension contains a word of weight using the union bound:
code of dimension has word of weight number of wordsword has weight
For this to be a valid probability we need to have
.
We can calculate rate by noting that for a CSS code, given by , , with , , the expression for rate is given by
.
Combining this with the bound we obtained by considering probabilities, we get that
.
Thus there exist good CSS codes.
3.7. Stabilizer Codes
Having discussed and constructed some examples of CSS codes, we will now discuss the stabilizer formalism. Note that this formalism allows us to construct codes without having to work directly with the states representing and , as this can quickly get unwieldy. Instead, we will work with stabilizers, operators that leave these states invariant.
To see how working directly with states can get unwieldy, we can consider the 5qubit code. We can define it the way we defined the 9qubit and 7qubit codes, by directly defining the basis vectors and ,
symmetric under cyclic permutations,
with defined similarly. But we can also define this code more succinctly using the stabilizer formalism. To do so, we start by choosing a commutative subgroup of the Pauli group, with generators satisfying
For example, for the 5qubit code, the particular choice of generators we would need is given by
.
Now we consider states that are stabilized by the . That is, they satisfy
.
Note that the eigenvalues of , , and are , so in the case of the 5qubit code, there exists a dimensional space of satisfying . Recalling that two commuting matrices are simultaneously diagonalizable, there exists a dimensional space of satisfying , and so on, where we cut the dimension of the subspace in half each time we add a generator. Finally, there exists a dimensional space of satisfying for all . This 2dimensional space is exactly the subspace spanned by and . Thus fixing the stabilizers is enough to give us our code.
Next we will consider all elements in the Pauli group that commute with all elements in our stabilizer group . As we shall see, this will give us our logical operators, where a logical operator performs an operation on a logical qubit (for example, the logical operator, , would act on the logical qubit by mapping , and so on). In the 5qubit case we end up with a 6dimensional nonabelian group by adding the following two elements to those elements that are in :
These will be our logical operators
so that
.
Note that this code has distance 3 and corrects 1 error because 3 is the minimum Hamming weight in the group . (To see this, note that has Hamming weight 3.)
Why is Hamming weight 2 not enough to correct one error? If we had, for example, , then we would have
for , both in the code, which means that we wouldn’t be able to distinguish an error from a error.
Note that, in general, when , will be in the code, so elements of map codewords to codewords. We can prove this fact by noting that
.
Note also that in the examples we’ve been dealing with so far, where we have a commuting subgroup of the Pauli group, our codes correspond to classical, additive, weakly selfdual codes over . Here (with group elements corresponding to the third roots of unity) is the finite field on 4 elements, and multiplying Pauli matrices corresponds to group addition. Specifically,
satisfying
.
We have now concluded our discussion of quantum errorcorrecting codes. In the next section we will shift gears and look at quantum channels and channel capacities.
4. Quantum Channels
In this final section we will look at quantum channels and channel capacities.
4.1. Definition and Examples
4.1.1. Definition
We know that we want to define a quantum channel to take a quantum state as input. What should the output be? As a first attempt we might imagine having the output be a probability distribution over states . It turns out that for a more succinct description, we can have both the input and output be a density matrix.
Recall that a density matrix takes the form
representing a probability distribution over pure states . must also be Hermitian, and it must satisfy (equivalently, we must have ).
Now we may define a quantum channel as the map that takes
,
where
.
To see that the output is in fact a density matrix, note that the output expression is clearly Hermitian and can be shown to have unit trace using the cyclical property of traces. Note also that the decomposition into need not be unique.
4.1.2. Example Quantum Channels
Next we give a few examples of quantum channels. The dephasing channel is given by the map
.
It maps
,
so it multiplies offdiagonal elements by a factor that is less than 1. Note that when , it maps
,
which means that it turns superpositions into classical mixtures (hence the name “dephasing”).
Another example is the amplitude damping channel, which models an excited state decaying to a ground state. It is given by
Here we let the vector denote the ground state, and we let the vector denote the excited state. Thus we can see that the channel maps the ground state to itself, , while the excited state gets mapped to with probability and stays at with probability .
4.2 Quantum Channel Capacities
Now we consider the capacity of quantum channels, where the capacity quantifies how much information can make it through the channel. We consider classical channels, classical information sent over quantum channels, and quantum information sent over quantum channels. First we start off with the example of the quantum erasure channel to demonstrate that quantum channels behave differently from classical channels, and then we give the actual expressions for the channel capacities before revisiting the example of the quantum erasure channel.
4.2.1 Example: Quantum Erasure Channel
First we start with the example of the quantum erasure channel, which given a state replaces it by an orthogonal state with probability and returns the same state with probability . We claim that the erasure channel can’t transmit quantum information when , behavior that is markedly different from that of classical information. That is to say, for , there is no way to encode quantum information to send it through the channel and then decode it so the receiver gets a state close to the state that was sent.
To see why this is the case, assume the contrary, that there do exist encoding and decoding protocols that send quantum information through quantum erasure channels with erasure rate . We will show that this violates the nocloning theorem. Now, suppose that does the following: For each qubit in the encoded state, she tosses a fair coin. If the coin lands heads, she send the state and sends the channel input with probability and the erasure state otherwise. If the coin lands tails, she sends the state and sends the channel input with probability and the erasure state otherwise. This implements a channel to both receivers and , which means that can use this channel to transmit an encoding of to both receivers, which in turn means that both receivers will be able to decode . But this means that has just used this channel to clone the quantum state , resulting in a contradiction. Thus no quantum information can be transmitted through a channel with . Note, however, that we can send classical information over this channel, so the behavior of quantum and classical information is markedly different.
It turns out that the rate of quantum information sent over the erasure channel, as a function of , is given by the following graph:
while the rate of classical information sent over the erasure channel, as a function of , is given by the following graph:
Next we will formally state the definition of channel capacity, and then we will return to the quantum erasure channel example and derive the curve that plots rate against .
4.2.2. Definition of Channel Capacities
Channel capacity is defined as the maximum rate at which information can be communicated over many independent uses of a channel from sender to receiver. Here we list the expressions for channel capacity for classical channels, classical information over a quantum channel, and quantum information over a quantum channel.
Classical Channel Capacity For a classical channel this expression is just the maximum mutual information over all inputoutput pairs,
,
where is the input information and is the output information after having gone through the channel .
Classical Information Over a Quantum Channel The capacity for classical information sent over a quantum channel is given by
up to regularization, where is the average input state, and is the channel.
Note that we would regularize this by using copies of the state (that is to say, we want the output of ) and then dividing by , to get an expression like the following for the regularized capacity of classical information over a quantum channel:
.
Quantum Information The capacity for quantum information is given by the expression
,
also up to regularization. Here is the output when channel acts on input state , while is the purification of (that is, it is a pure state containing that we can obtain by enlarging the Hilbert space). The regularized capacity for quantum information looks like the following:
.
Now that we have the exact expression that allows us to calculate the quantum channel capacity, we will revisit our example of the quantum erasure channel and reproduce the plot of channel rate vs erasure probability.
4.2.3. Example Revisited: Quantum Erasure Channel
Recall that, up to regularization, the capacity of a quantum channel is given by
.
We will directly calculate this expression for the example of the quantum erasure channel. Let the input be given by the density matrix for the completely mixed state,
,
so that the purification of is given by the state
.
Recall that the erasure channel replaces our state with with probability , while with probability it leaves the input state unchanged. Then, in the basis , the matrix corresponding to is given by
while in the basis , the matrix corresponding to is given by
We can directly calculate that
.
Then, subtracting the two entropies, we can calculate the rate to be
,
which corresponds exactly to the line we saw on the diagram that plotted rate as a function of for the quantum erasure channel.
References
 Bennett, C. H., DiVencenzo, D. P., and Smolin, J. A. Capacities of quantum erasure channels. Phys. Rev. Lett., 78:32173220 (1997). quantph/9701015.
 Bennett, C. H., DiVencenzo, D. P., Smolin, J. A., and Wootters, W. K. Mixed state entanglement and quantum error correction. Phys. Rev. A, 54:3824 (1996). quantph/9604024.
 Calderbank, A. R. and Shor, P. W. Good quantum errorcorrecting codes exist. Phys. Rev. A, 54:1098 (1996). quantph/9512032.
 Devetak, I. The Private Classical Capacity and Quantum Capacity of a Quantum Channel. IEEE Trans. Inf. Theor., 51:4445 (2005). quantph/0304127
 Devetak, I. and Winter, A. Classical data compression with quantum side information. Phys. Rev. A, 68(4):042301 (2003).
 Gottesman, D. Class of quantum errorcorrecting codes saturating the quantum Hamming bound. Phys. Rev. A, 54:1862 (1996).
 Laflamme, R., Miquel, C., Paz, J.P., and Zurek, W. H. Perfect quantum error correction code. Phys. Rev. Lett., 77:198 (1996). quantph/9602019.
 Lloyd, S. Capacity of the noisy quantum channel. Phys. Rev. A., 55:3 (1997). quantph/9604015.
 Nielsen, M. A. and Chuang, I. L. Quantum Computation and Quantum Information., Cambridge University Press, New York (2011).
 Shor, P. W. Scheme for reducing decoherence in quantum computer memory. Phys. Rev. A., 52:2493 (1995).
 Shor, P. W. The quantum channel capacity and coherent information. MSRI Workshop on Quantum Computation (2002).
 Steane, A. M. Error correcting codes in quantum theory. Phys. Rev. Lett., 77:793 (1996).
 Steane, A. M. Multiple particle interference and quantum error correction. Proc. R. Soc. London A, 452:255176 (1996).
Highlights beyond EC: Call for nominations
[Guest post by Moshe Babaioff –Boaz]
“Highlights Beyond EC” Session at EC 2019: Call for Nominations
Committee: Mohammad Akbarpour, Moshe Babaioff, Shengwu Li and Ariel Procaccia
Following a new tradition started last year, the 2019 ACM Conference on Economics and Computation (EC’19) will host a special session highlighting some of the best work in economics and computation that appears in conferences and journals other than EC. The intention of this session is to expose EC attendees to related work just beyond the boundary of their current awareness.
We seek nominations for papers on Economics and Computation that have made breakthrough advances, opened up new questions or areas, made unexpected connections, or had significant impact on practice or other sciences. Examples of conferences and journals that publish papers relevant for the special session include STOC/FOCS/SODA/ITCS, AAAI/IJCAI/AAMAS, NIPS/ICML/COLT, WWW/KDD, AER/Econometrica/JPE/QJE/RESTUD/TE/AEJ Micro/JET/GEB, and Math of OR/Management Science/Operations Research. Please email nominations toHighlightsBeyondEC@gmail.com. Anyone is welcome to contact us, but we especially invite members of PCs or editorial boards in various venues to send us suggestions. Nominations should include:
 Name of paper and authors.
 Publication venue or online working version. Preference will be given to papers that have appeared in a related conference or journal within the past two years, or have a working version circulated within the past two years.
 Short (23 paragraph) explanation of the paper and its importance.
 (Optional) Names of 13 knowledgeable experts on the area of the paper.
Note that at least one of the authors of a selected paper will be required to present their paper at EC 2019 and so should be available to travel to the conference, which is taking place in Phoenix, AZ on June 2527, 2019.
To ensure maximum consideration, please send all nominations by December 15, 2018.
HALG 2019 Call for nominations
[Guest post by Piotr Sankowski –Boaz]
Call for Nominations – 4th Highlights of Algorithms conference (HALG 2019)
Copenhagen, June 1416, 2019
The HALG 2019 conference seeks highquality nominations for invited talks that will highlight recent advances in algorithmic research. Similarly to previous years, there are two categories of invited talks:
A. survey (60 minutes): a survey of an algorithmic topic that has seen exciting developments in last couple of years.
B. paper (30 minutes): a significant algorithmic result appearing in a paper in 2018 or later.
To nominate, please email halg2019.nominations@gmail.com the following information:
1. Basic details: speaker name + topic (for survey talk) or paper’s title, authors, conference/arxiv + preferable speaker (for paper talk).
2. Brief justification: Focus on the benefits to the audience, e.g., quality of results, importance/relevance of topic, clarity of talk, speaker’s presentation skills.
All nominations will be reviewed by the Program Committee (PC) to select speakers that will be invited to the conference.
Nominations deadline: December 9, 2018 (for full consideration).
Please keep in mind that the conference does not provide financial support for the speakers.
Best regards,
Piotr Sankowski
HALG 2019 PC Chair
Where’s that paper?
[Guest post by Eylon Yogev about a Chrome extension he wrote, of which I am a happy user –Boaz]
Hi fellow researchers,
I’m writing to share a little tool that I have developed with the ambitious goal of boosting research productivity. The tool is a Chrome extension named “Where’s that paper?”. Before I tell you more about what it does, let me touch upon the largest obstacle of any new tool, the learning curve. Rest assured that “Where’s that paper?” requires zero training – just install it and continue working as usual – it will guide you without further effort on your end.
After building much suspense, I will elaborate. If you’re like me, you find yourself frequently searching the web for papers that you have browsed / opened in the past on your computer – basically paper browsing history. Whether it is in the process of writing a paper or searching for existing ideas, you may search for authors of titles of papers that you recall know to have once glimpsed at. At some point, the paper has either been downloaded or added to the favorites bar. In any case, this process is manual and takes up much time (often including frustration). As such, I thought we could all use some code to automate it.
Functionality.
The extension is very simple: it identifies when you are reading a scientific paper (according to the domain) and then automatically adds this paper, with the proper author list, year etc., to your favorites bar under a designated folder. Then, when you type a search in the Chrome’s search bar for an author or a title, the relevant results from the favorites pop up. One might also explicitly browse in the favorites to see all papers read.
See an example of the results shown when searching for “short” in Chrome’s address bar. The first three links are from the favorite and the rest are general Google suggestions from the web.
The extension automatically works on a list of specified domains that include:
eprint.iacr.org, arxiv.org, eccc.weizmann.ac.il, epubs.siam.org, research.microsoft.com, citeseerx.ist.psu.edu, ac.elscdn.com, www.sciencedirect.com, download.springer.com, link.springer.com, delivery.acm.org, proceedings.mlr.press and journals.aps.org.
It is not too difficult to customize the list and additional domains. Reach out to me and I’ll add them upon your request!
A bonus feature (thanks Boaz for the suggestion): Click the extension’s icon and you can download a bib file containing (a nice formatting of) the DBLP bibtex records of all papers added to the favorites.
Download “Where’s that paper?” from the Chrome Web Store: https://chrome.google.com/webstore/detail/wheresthatpaper/dkjnkdmoghkbkfkafefhbcnmofdbfdio
Security.
Most who work in our domain of expertise would be quite concerned with installing extensions and with good reason. The extension I wrote does not leak any information, not to me or to another third party. I have no ulterior motive in developing the extension, originally helping myself, I now see the benefit of sharing it with our community. I do not know how to reassure you that this is indeed the case other than giving you my word and publishing the source code online. It is available here:
https://github.com/eylonyogev/WheresThatPaper
I’d be glad to hear any feedback.
Thanks,
Eylon.
Approximating Partition Functions
[Guest post by Alex Kelser, Alex Lin, Amil Merchant, and Suproteem Sarkar, scribing for a lecture by Andrej Risteski.]
Andrej Risteski: Approximating Partition Functions via Variational Methods and Taylor Series
For a probability distribution defined up to a constant of proportionality, we have already seen the partition function. To refresh your memory, given a probability mass function over all , the partition function is simply the normalization constant of the distribution, i.e.
At first glance, the partition function may appear to be uninteresting. However, upon taking a deeper look, this single quantity holds a wide array of intriguing and complex properties. For one thing, its significance in the world of thermodynamics cannot be overstated. The partition function is at the heart of relating the microscopic quantities of a system – such as the individual energies of each probabilistic state – to macroscopic entities describing the entire system: the total energy, the energy fluctuation, the heat capacity, and the free energy. For explicit formulas, consult this source for reference. In machine learning the partition function also holds much significance, since it’s intimately linked to computing marginals in the model.
Although there exists an explicit formula for the partition function, the challenge lies in computing the quantity in polynomial time. Suppose is a vector of dimension where each for . Specifically, we would like a smarter way of finding than simply adding up over all combinations of .
Andrej Risteski’s lecture focused on using two general techniques – variational methods and Taylor series approximations – to find provably approximate estimates of .
Motivation
The general setup addresses undirected graphical models, also known as a Markov random fields (MRF), where the probability mass function has the form
for some random, dimensional vector and some set of parameterized functions . Note that the notation denotes all pairs where and the edge exists in the graph.
The talk considered the specific setup where each , so . Also, we fix
for some set of coefficients , thereby giving us the wellknown Ising model. Note, the interaction terms could be more complicated and made “less local” if desired, but that was not discussed in the lecture.
These graphical models are common in machine learning, where there are two common tasks of interest:
 Learning: given samples, models try to learn or
 Inference: given or the potentials as input, we want to calculate the marginal probabilities, such as
We focus on the latter task. The problem of inference is closely related to calculating the partition function. This value is often used as the normalization constant in many methods, and it is classically defined as
Although we can write down the aforementioned closedform expression for the partition function, it is difficult to calculate this quantity in polynomial time. There are two broad approaches to solving inference problems:
 Randomized: Methods based on Markov chain Monte Carlo (MCMC), such as Gibbs sampling, that construct a Markov chain whose stationary distribution is the distribution of interest. These have been more extensively studied in theoretical computer science.
 Deterministic: Variational methods, selfavoiding walk trees, Taylor expansion methods
Although more often used in practice, randomized algorithms such as MCMC are notoriously hard to debug, and it is often unclear at what point the chain reaches stationarity.
In contrast, variational methods are often more difficult to rigorously analyze but have the added benefit of turning inference problems into optimization problems. Risteski’s talk considered some provable instantiations of variational methods for calculating the partition function.
Variational Methods
Let us start with a basic observation, known as the Gibbs variational principle. This can be stated as the following.
Lemma 1: Let . Then, we can show that the corresponding partition function satisfies
where is defined to be a valid distribution over and is the set of all such distributions.
Note that we use to denote the expectation of the inner argument with respect to the distribution and to denote the Shannon entropy of .
Proof: For any , the KL divergence between and must be greater than or equal to 0.
Equality holds if , leading directly to the lemma above.
Note that the proof would have also held in the more generic exponential family with instead of . Also, at most temperatures, one of the 2 terms, either the energy or the entropy will often dominate.
As a result of this lemma, we have framed the original inference problem of calculating as an optimization problem over . This is the crux of variational inference.
Optimization Difficulties
The issue with this approach is that it is difficult to optimize over the polytope of distributions due to the values of each coming from . In general, this is NOT tractable. We can imagine two possible solutions:
 Inner Approximation
Instead of optimizing over all , pick a “nice” subfamily of distributions to constrain . The prototypical example is the meanfield approximation in which we let , where each is a univariate distribution. Thus, we approximate with
This would provide a lower bound on . There are a number of potential issues with this method. For one thing, the functions are typically nonconvex. Even ignoring this problem, it is difficult to quantify how good the approximation is.
 Outer Approximation
Note: Given a distribution , we use to denote the marginal for some set .
In the outer approximation, we relax the polytope we are optimizing over using convex hierarchies. For instance, we can define as the polytope containing valid marginals over subsets of size at most . We can then reformulate our objective as a twopart problem of (1) finding a set of pairwise marginals that optimizes the energy term and (2) finding a distribution with pairwise marginals matching that optimizes the entropy term. For simplicity of notation, we use to denote , where . It follows that we can rewrite the Gibbs equation as
The first term here represents the energy on pairwise marginals, whereas the second term is the maximum entropy subject to a constraint about matching the energy distribution’s pairwise marginals. The goal of this procedure is to enlarge the polytope such that is a tractable set, where we can impose a polynomial number of constraints satisfied by real marginals.
One specific convex hierarchy that is commonly used for relaxation is the SheraliAdams (SA) hierarchy. SheraliAdams will allow us to formulate the optimization of the energy term (and an approximation of the entropy term) as a convex program. We introduce the polytope for , which will relax the constraints on to allow for values outside of in order to generate a polynomialtime solution for .
The SheraliAdams hierarchy will take care of the energy term, but it remains unclear how to rewrite the entropy term in the context of the convex program. In fact, we will need approximations for in order to accomplish this task.
In this talk, we’ll consider two approximations: one classical one – the Bethe entropy and one more recent one – the augmented meanfield pseudoentropy .
Bethe Entropy Approximation
The Bethe entropy works by pretending that the Markov random field is a tree. In fact, we can show that if the graph is a tree, then using the Bethe entropy approximation in the convex program defined by will yield an exact calculation of .
Specifically, the Bethe entropy is defined as
where is defined to be the degree of the particular vertex . Note that there are no marginals over sets of dimension greater than two in this expression; thus, we have a valid convex program over the polytope .
This lemma is welllnown, but it will be a useful warmup:
Lemma 2 Define the output of the convex program
On a tree, and the optimization objective is concave with respect to the variables, so it can be solved in polynomial time.
Proof sketch We will prove 3 claims:

Since the energy term is exact, it suffices to show that for valid marginals , . This can be done by rewriting and using a property of conditional entropy, namely that .
 For trees, is concave in the variables.
This proof was only sketched in class but is similar to the usual proof of concavity of entropy.
 For trees, we can round a solution to a proper distribution over which attains the same value for the original optimization.
The distribution that we will produce is
where we start at the root and keep sampling down the tree. Based on the tree structure, . The energy is also the same since . Therefore, since both terms in the equation are equal to the respective terms in the partition function, we get that the equation must equal the partition function.
Remarks:
 is commonly used on graphs that are not trees.
 However, for nontrees, the optimization is often no longer concave.
 Therefore, the value obtained by the output equation is neither an upper or lower bound.
 The fixed points of the belief propagation algorithm give a 11 correspondence with the stationary points of the Bethe objective. (See a classical paper by Yedidia et al..)
Augmented Mean Field PseudoEntropy
For this approximation, we will focus on dense graphs: namely graphs in which each vertex has degree for some . To simplify things, we focus on distributions of the form
where is some positive constant parameter. The Bethe entropy approximation is undesirable in the dense case, because we cannot bound its error on nontrees. Instead, we proved the following theorem.
Theorem (Risteski ’16)
For a dense graph with parameter , there exists an outer approximation based on and an entropy proxy which achieves a additive approximation to for some . There is an algorithm with runtime .
To parse the theorem statement, consider the potential regimes for :
 One can ignore the entropy portion and solve for the energy portion that dominates. (So the guarantee is not that interesting.)
 MCMC methods give a factor approximation in time poly. (It’s not clear if the methods suggested here can give such a guarantee – this is an interesting problem to explore.)
 MCMC mixes slowly, but there is no other way to get a comparable guarantee. This is the interesting regime!
Proof Sketch The proof strategy is as follows: We will formulate a convex program under the SA() relaxation that can return a solution in polynomial time with value . Note that this value of the relaxation will be an upperbound on the true partition function value . The convex program solution may not be a valid probability distribution, so from this, we will construct a “rounded solution’’ – an actual distribution with value . It follows that
We will then aim to put a bound on the gap between and ; namely, that
or equivalently,
This equivalently places a bound on the optimality gap between and , thereby proving the theorem. Here are the main components of the proof:
1. Entropy Proxy
To approximate , we will use the pseudoaugmented mean field entropy approximation . This is defined as
Using , we can write a convex program under SA() that provides a relaxation for optimizing . Specifically, we will let . Let be the outputed solution to this relaxation. We define the rounded solution as
Using the chain rule for entropy, we can show that
2. Rounding Scheme The above distribution is actually the same distribution that correlation rounding, as introduced by a seminal paper of Barak, Raghavendra, and Steurer, produces. By using definition of mutual information and Pinsker’s Theorem showing that , we can work through some algebra to show that
Then, putting together the entropy and energy effects, we can prove the main theorem.
Remarks:
 The above proof easily extends to a more general notion of density, as well as to sparse graphs with “lowrank’’ spectral structure. See the paper for more information.
 In fact, with more work, one can extract guarantees on the the best mean field approximation to the Gibbs distribution in the dense regime. (This intuitively can be done as the distribution produced above is almost meanfield, except for the set ). This is interesting, since as we mentioned, the quality of the meanfield approximation is usually difficult to characterize. (Moreover, the procedure is algorithmic.) See the recent work of Jain, Koehler and Risteski ’18 for more details and results.
Taylor Series Approximation
Another method of calculating the partition function involves Taylor expanding its logarithm around a cleverly selected point. This approach was first developed by Barvinok ’14, for the purpose of computing the partition function of cliques in a graph. The mathematical techniques developed in this paper can be naturally extended to evaluate a variety of partition functions, including that of the ferromagnetic Ising model. We will use this example to illustrate the general idea of the Taylor expansion technique.
The goal here is to devise a deterministic, fully polynomialtime approximation scheme (an FPTAS) for evaluating the partition function of the ferromagnetic Ising model. This means that the the algorithm must run in poly() and be correct within a multiplicative factor of . We will work in the general case where the Hamiltonian includes an external magnetic field term, as well as the neighboring spin interaction terms. Using logarithmic identities, we can rewrite the partition function slightly differently from last time:
Here, is called the vertex activity (or external field), and characterizes the likelihood of a vertex to be in the + configuration. Additionally, is referred to as the edge activity, and characterizes the propensity of a vertex to agree with its neighbors. The ferromagnetic regime (where agreement of spins is favored) corresponds to . denotes the number of edge cut in the graph, which is equivalent to the number of neighboring pairs with opposite spins.
As one final disclaimer, the approximation scheme presented below is not valid for , which corresponds to the zeromagnetic field case. Randomized algorithms exist which can handle this case.
The general idea here is to hold one parameter fixed ( in our case), and express the logarithm of the partition function as a Taylor series in the other parameter (). From a physics standpoint, the Taylor expansion around tells us how the partition function is changing as a function of the magnetic field. Without loss of generality, we focus on the case where . This simplification can be justified by a simple symmetry argument. Consider as the inverse of where all the 1’s (spinup) are flipped to 1s (spindown) and viceversa. The partition function of this flipped system can be related to the original system by a constant factor:
This holds because the number of cut edges remains constant when the values are flipped. Since these two partition functions are related by this factor constant in , for any model with , we can simply consider the flipped graph and use the above relation.
For our purposes, it will be more convenient to approximate the logarithm of the partition function, because a multiplicative approximation of corresponds to an additive approximation of . In fact, if we allow the partition function to be complex, then we can easily show that an additive bound of for guarantees a multiplicative bound of for .
For notational convenience we define:
Thus, the Taylor expansion of around if given by:
The big question now is: Can we get a good approximation from just the lower order terms of the Taylor expansion for ? Equivalently, can we can bound the sum of the higher order terms at ? Additionally, we still must address the question of how to actually calculate the lower order terms.
To answer these questions, we make simple observations about the derivatives of in relation to the derivatives of .
The last equation just comes from repeated application of the product rule. Using these relations, we can solve for the first derivatives of if we have the first derivatives of using a triangular linear system of equations. As , we see that the system is nondegenerate.
Note, is a ndegree polynomial with leading coefficient of 1 and constant coefficient of 1 (corresponding to the configurations where all vertices are positive / negative respectively). Using this form, we can rewrite the partition function in terms of its (possibly complex) roots :
Supposing we keep the first terms, the error is given bounded by:
Here we can invoke the LeeYang theorem, which tells us that the zeroes of for a system with ferromagnetic interactions lie on the unit circle of the complex plane. So the LeeYang theorem guarantees that , and we see that the error due to truncation is ultimately bounded by:
Now by the previous symmetry argument, we can assume that . Thus, to achieve an error bound of , we must have:
Rearranging terms and taking the natural log of both sides (which is justified given that ), we see that the inequality is satisfied if:
Thus, we need only retain the first terms of the Taylor series expansion of . This will involve calculating the first derivatives of , which naively can be done in quasipolynomial time. This is done by running over all , where . This takes time, which is quasipolynomial.
Recent work by Patel and Regts, as well as Liu, Sinclair and Srivastiva has focused on evaluating these coefficients more efficiently (in polynomial time), but is outside the scope of this lecture. Clever counting arguments aside, to trivially calculate the th derivative, we need only sum over the vectors with spinup components.
LeeYang Theorem
Most of the heavy lifting in the above approach is done by the LeeYang theorem. In this section, we sketch how it is proved.
First, let’s define the LeeYang property, which we will refer to as the LYP. Let be some multilinear polynomial. has the LeeYang property if for any complex numbers such that for all i, then .
We can then show that the partition function for the ferromagnetic Ising model, which we wrote as
must have this LYP. In the antiferromagnetic case it turns out that all zeroes lie on the negative real axis, but we will focus on the ferromagnetic case.
Proof (sketch):
For the proof that the partition function has the LYP, we will use Asano’s contraction argument. This relies on the fact that certain operations preserve LYP and we can “build” up to the partition function of the full graph by these operations.
 Disjoint union: This is fairly simple to prove since the partition function for the disjoint union of and would just factor into the multiplication of the two partition function of the original graphs . Where and have the LYP, then it is clear that has the same LYP.
 Contraction: Suppose we produce a graph from by contracting 2 vertices in , which means merging them into one vertex. It can be shown that if has the LYP, then also has the LYP. We write the partition function for G as
The “contracted” graph amounts to deleting the middle 2 terms so the partition function for can be written as
We want to show that the partition function for has the LYP. Because the partition function of has the LYP, we can consider the case where . By the LYP,
if . By Vieta’s formulas we can find a relation between A and D,
Now, to show that the partition function of G’ has the LYP, we assume there is a such that . For this to be true,
However, this means that so there is no solution where such that the partition function of G’ is 0 and so it has the LYP.
The final part of this proof is to construct the original graph. We observe is that for a single edge, the partition function has the LYP. In this case, we easily write out the partition function for 2 vertices:
Suppose : then would imply . However, this is just the Möbius transform mapping the exterior of the unit disk to the interior, so . Thus, it cannot be the case that both and have absolute value greater than one.
Since single edges have the LYP. We break the graph into single edges, with copies of vertices. These copies are then contracted and we build up the graph to show that the partition function has the LYP.
Knowing that the partition function has the LYP, direct application of the LeeYang theorem guarantees that the roots are on the unit circle in the complex plane.
Conclusion
Although there exists an explicit formula for the partition function, the challenge lies in computing the quantity in polynomial time. Andrej Risteski’s lecture focused on two lessexplored methods in theoretical computer science – namely variational inference and Taylor series approximations – to find provably approximate estimates. Both of these approaches are relatively new and replete with open problems.
Rabin postdocs ad is out
[Guest blog from Yaron Singer who is heading the selection committee for the Rabin fellowship this year. In addition to the Rabin fellowship and other postdoc opportunities, this year we also have a new postdoc opportunity in quantum computation and information via the Harvard Quantum Initiative. –Boaz.]
Belief Propagation and the Stochastic Block Model
[Guest post by Thomas Orton who presented a lecture on this in our physics and computation seminar –Boaz]
Introduction
This blog post is a continuation of the CS229R lecture series. Last week, we saw how certain computational problems like 3SAT exhibit a thresholding behavior, similar to a phase transition in a physical system. In this post, we’ll continue to look at this phenomenon by exploring a heuristic method, belief propagation (and the cavity method), which has been used to make hardness conjectures, and also has thresholding properties. In particular, we’ll start by looking at belief propagation for approximate inference on sparse graphs as a purely computational problem. After doing this, we’ll switch perspectives and see belief propagation motivated in terms of Gibbs free energy minimization for physical systems. With these two perspectives in mind, we’ll then try to use belief propagation to do inference on the the stochastic block model. We’ll see some heuristic techniques for determining when BP succeeds and fails in inference, as well as some numerical simulation results of belief propagation for this problem. Lastly, we’ll talk about where this all fits into what is currently known about efficient algorithms and information theoretic barriers for the stochastic block model.
(1) Graphical Models and Belief Propagation
(1.0) Motivation
Suppose someone gives you a probabilistic model on (think of as a discrete set) which can be decomposed in a special way, say
where each only depends on the variables . Recall from last week that we can express constraint satisfaction problems in these kinds of models, where each is associated with a particular constraint. For example, given a 3SAT formula , we can let if is satisfied, and 0 otherwise. Then each only depends on 3 variables, and only has support on satisfying assignments of .
A central problem in computer science is trying to find satisfying assignments to constraint satisfaction problems, i.e. finding values in the support of . Suppose that we knew that the value of were . Then we would know that there exists some satisfying assignment where . Using this knowledge, we could recursively try to find ‘s in the support of , and iteratively come up with a satisfying assignment to our constraint satisfaction problem. In fact, we could even sample uniformally from the distribution as follows: randomly assign to with probability , and assign it to otherwise. Now iteratively sample from for the model where is fixed to the value we assigned to it, and repeat until we’ve assigned values to all of the . A natural question is therefore the following: When can we try to efficiently compute the marginals
for each ?
A well known efficient algorithm for this problem exists when the corresponding graphical model of (more in this in the next section) is a tree. Even though belief propagation is only guaranteed to work exactly for trees, we might hope that if our factor graph is “tree like”, then BP will still give a useful answer. We might even go further than this, and try to analyze exactly when BP fails for a random constraint satisfaction problem. For example, you can do this for kSAT when is large, and then learn something about the solution threshold for kSAT. It therefore might be natural to try and study when BP succeeds and fails for different kinds of problems.
(1.1) Deriving BP
We will start by making two simplifying assumptions on our model .
First, we will assume that can be written in the form for some functions and some “edge set” (where the edges are undirected). In other words, we will only consider pairwise constraints. We will see later that this naturally corresponds to a physical interpretation, where each of the “particles” interact with each other via pairwise forces. Belief propagation actually still works without this assumption (which is why we can use it to analyze SAT for ), but the pairwise case is all we need for the stochastic block model.
For the second assumption, notice that there is a natural correspondence between and the graphical model on vertices, where forms an edge in iff . In order words, edges in correspond to factors of the form in , and vertices in correspond to variables in . Our second assumption is that the graphical model is a tree.
Now, suppose we’re given such a tree which represents our probabilistic model. How to we compute the marginals? Generally speaking, when computer scientists see trees, they begin to get very excited [reference]. “I know! Let’s use recursion!” shouts the student in CS124, their heart rate noticeably rising. Imagine that we arbitrarily rooted our tree at vertex . Perhaps, if we could somehow compute the marginals of the children of , we could somehow stitch them together to compute the marginal . In order words, we should think about computing the marginals of roots of subtrees in our graphical model. A quick check shows that the base case is easy: suppose we’re given a graphical model which is a tree consisting of a single node . This corresponds to some PDF . So to compute , all we have to do is compute the marginalizing constant , and then we have . With the base case out of the way, let’s try to solve the induction step: given a graphical model which is a tree rooted at , and where we’re given the marginals of the subtrees rooted at the children of , how do we compute the marginal of the tree rooted at ? Take a look at figure 2 to see what this looks like graphically. To formalize the induction step, we’ll define some notation that will also be useful to us later on. The main pieces of notation are , which is the subtree rooted at with parent , and the “messages” , which can be thought of as information which is passed from the child subtrees of to the vertex in order to compute the marginals correctly.
 We let denote the neighbours of vertex in . In general, I’ll interchangeably switch between vertex and the variable represented by vertex .
 We define to be the subtree of rooted at , where ‘s parent is . We need to specify ‘s parent in order to give an orientation to our tree (so that we know in which direction we need to do recursion down the tree). Likewise, we let be the set of vertices/variables which occur in subtree .
 We define the function , which is a function of the variables in , to be equal to the product of ‘s edges and vertices. Specifically, , where is the set of edges which occur in subtree . We can think of as being the pdf (up to normalizing constant) of the “model” of the subtree .
 Lastly, we define , where is a normalizing constant chosen such that . In particular, is a function defined for each possible value of . We can interpret in two ways: firstly, it is the marginal of the root of the subtree . Secondly, we can think of it as a “message” from vertex to vertex . We’ll see why this is a valid interpretation shortly.
Phew! That was a lot of notation. Now that we have that out of the way, let’s see how we can express the marginal of the root of a tree as a function of the marginals of its subtrees. Suppose we’re considering the subtree , so that vertex has children . Then we can compute the marginal directly:
The nonobvious step in the above is that we’re able to switch around summations and products: we’re able to do this because each of the trees are functions on disjoint sets of variables. So we’re able to express as a function of the children values . Looking at the update formula we have derived, we can now see why the are called “messages” to vertex : they send information about the child subtrees to their parent .
The above discussion is a purely algebraic way of deriving belief propagation. A more intuitive way to get this result is as follows: imagine fixing the value of in the the subtree , and then drawing from each of the marginals of the children of conditioned on the value . We can consider the marginals of each of the children independently, because the children are independent of each other when conditioned on the value of . Converting words to equations, this means that if has children , then the marginal probability of in the subtree is proportional to . We can then write
And we get back what we had before. We’ll call this last equation our “update” or “message passing” equation. The key assumption we used was that if we condition on , then the children of are independent. It’s useful to keep this assumption in mind when thinking about how BP behaves on more general graphs.
A similar calculation yields that we can calculate the marginal of our original probability distribution as the marginal of the subtree with no parent, i.e.
Great! So now we have an algorithm for computing marginals: recursively compute for each in a dynamic programming fashion with the “message passing” equations we have just derived. Then, compute for each . If the diameter of our tree is , then the recursion depth of our algorithm is at most .
However, instead of computing every neatly with recursion, we might try something else: let’s instead randomly initialize each with anything we want. Then, let’s update each in parallel with our update equations. We will keep doing this in successive steps until each has converged to a fixed value. By looking at belief propagation as a recursive algorithm, it’s easy to see that all of the ‘s will have their correct values after at most steps. This is because (after arbitrarily rooting our tree at any vertex) the leaves of our recursion will initialize to the correct value after 1 step. After two steps, the parents of the leaves will be updated correctly as functions of the leaves, and so they will have the correct values as well. Specifically:
Proposition: Suppose we initialize messages arbitrarily, and update them in parallel according to our update equations. If has diameter , then after steps each converges, and we recover the correct marginals.
Why would anyone want to do things in this way? In particular, by computing everything in parallel in steps instead of recursively, we’re computing a lot of “garbage” updates which we never use. However, the advantage of doing things in this way is that this procedure is now well defined for general graphs. In particular, suppose violated assumption (2), so that the corresponding graph were not a tree. Then we could still try to compute the messages with parallel updates. We are also able to do this in a local “message passing” kind of way, which some people may find physically intuitive. Maybe if we’re lucky, the messages will converge after a reasonable amount of iterations. Maybe if we’re even luckier, they will converge to something which gives us information about the marginals . In fact, we’ll see that just such a thing happens in the stochastic block model. More on that later. For now, let’s shift gears and look at belief propagation from a physics perspective.
(2) Free Energy Minimization
(2.0) Potts model (Refresh of last week)
We’ve just seen a statistical/algorithmic view of how to compute marginals in a graphical model. It turns out that there’s also a physical way to think about this, which leads to a qualitatively similar algorithm. Recall from last week that another interpretation of a pairwise factorable PDF is that of particles interacting with each other via pairwise forces. In particular, we can imagine each particle interacting with via a force of strength
and in addition, interacting with an external field
We imagine that each of our particles take values from a discrete set . When , we recover the Ising model, and in general we have a Potts model. The energy function of this system is then
with probability distribution given by
Now, for , computing the marginals corresponds to the equivalent physical problem of computing the “magnetizations” .
How does this setup relate to the previous section, where we thought about constraint satisfaction problems and probability distributions? If we could set and , we would recover exactly the probability distribution from the previous section. From a constraint satisfaction perspective, if we set if constraint is satisfied and otherwise, then as (our system becomes colder), ‘s probability mass becomes concentrated only on the satisfying assignments of the constraint satisfaction problem.
(2.1) Free energy minimization
We’re now going to try a different approach to computing the marginals: let’s define a distribution , which we will hope to be a good approximation to . If you like, you can think about the marginal as being the “belief” about the state of variable . We can measure the “distance” between and by the KL divergence
which equals 0 iff the two distributions are equal. Let’s define the Gibbs free energy as
So the minimum value of is which is called the Helmholz free energy, is the “average energy” and is the “entropy”.
Now for the “free energy minimization part”. We want to choose to minimize , so that we can have that is a good approximation of . If this happens, then maybe we can hope to “read out” the marginals of directly. How do we do this in a way which makes it easy to “read out” the marginals? Here’s one idea: let’s try to write as a function of only the marginals and of . If we could do this, then maybe we could try to minimize by only optimizing over values for “variables” and . However, we need to remember that and are actually meant to represent marginals for some real probability distribution . So at the very least, we should add the consistency constraints and to our optimization problem. We can then think of and as “pseudomarginals” which obey degree2 SheraliAdams constraints.
Recall that we’ve written as a sum of both the average energy and the entropy . It turns out that we can actually write as only a function of the pariwise marginals of :
which follows just because the sums marginalize out the variables which don’t form part of the pairwise interactions:
This is good news: since only depends on pairwise interactions, the average energy component of only depends on and . However, it is not so clear how to express the entropy as a function of one node and two node beliefs. However, maybe we can try to pretend that our model is really a “tree”. In this case, the following is true:
Claim: If our model is a tree, and and are the associated marginals of our probabilistic model , then we have
where is the degree of vertex in the tree.
It’s not too difficult to see why this is the case: imagine a tree rooted at , with children . We can think of sampling from this tree as first sampling from via its marginal , and then by recursively sampling the children conditioned on . Associate with the subtrees of the children of , i.e. is equal to the probability of the occurrence on the probabilistic model of the tree rooted at vertex . Then we have
where the last line follows inductively, since each only sees edges of .
If we make the assumption that our model is a tree, then we can write the Bethe approximation entropy as
Where the ‘s are the degrees of the variables in the graphical model defined by . We then define the Bethe free energy as . The Bethe free energy is in general not an upper bound on the true free energy. Note that if we make the assignments , , then we can rewrite as
which is similar in form to the Bethe approximation entropy. In general, we have
which is exactly the Gibbs free energy for a probabilistic model whose associated graph is a tree. Since BP gives the correct marginals on trees, we can say that the BP beliefs are the global minima of the Bethe free energy. However, the following is also true:
Proposition: A set of beliefs gives a BP fixed point in any graph (not necessarily a tree) iff they correspond to local stationary points of the Bethe free energy.
(For a proof, see e.g. page 20 of [4])
So trying to minimize the Bethe free energy is in some sense the same thing as doing belief propagation. Apparently, one typically finds that when belief propagation fails to converge on a graph, the optimization program which is trying to minimize also runs into problems in similar parameter regions, and vice versa.
(3) The Block Model
(3.0) Definitions
Now that we’ve seen Belief Propagation from two different perspectives, let’s try to apply this technique of computing marginals to analyzing the behavior of the stochastic block model. This section will heavily follow the paper [2].
The stochastic block model is designed to capture a variety of interesting problems, depending on its settings of parameters. The question we’ll be looking at is the following: suppose we generate a random graph, where each vertex of the graph comes from one of groups each with probability . We add an edge between vertices in groups resp. with probability . For sparse graphs, we define , where we think of as . The problem is the following: given such a random graph, can you label the vertices so that, up to permutation, the labels you choose have high correlation to the true hidden labels which were used to generate the graph? Here are some typical settings of parameters which represent different problems:
 Community detection, where we have groups. We set , and if and otherwise, with (assortative structure).
 Planted graph partitioning, where may not necessarily be .
 Planted graph colouring, where , and for all groups. We know how to efficiently find a planted coloring which is strongly correlated with the true coloring when the average degree is greater than .
We’ll concern ourselves with the case where our graph is sparse, and we need to try and come up with an assignment for the vertices such that we have high correlation with the true labeling of vertices. How might we measure how well we solve this task? Ideally, a labeling which is identical to the true labeling (up to permutation) should get a score of 1. Conversely, a labeling which naively guesses that every vertex comes from the largest group should get a score of 0. Here’s one metric which satisfies these properties: if we come up with a labeling , and the true labeling is , then we’ll measure our performance by
where we maximize over all permutations . When we choose a labeling which (up to permutation) agrees with the true labeling, then the numerator of will equal the denominator, and . Likewise, when we trivially guess that every vertex belongs to the largest group, then the numerator of is and .
Given and a set of observed edges , we can write down the probability of a labeling as
How might we try to infer such that we have maximum correlation (up to permutation) with the true labeling? It turns out that the answer is to use the maximum likelihood estimator of the marginal distribution of each , up to a caveat. In particular, we should label with the such that is maximized. The caveat comes in when is invariant under permutations of the labelings , so that each marginal is actually the uniform distribution. For example, this happens in community detection, when all the group sizes are equal. In this case, the correct thing to do is to still use the marginals, but only after we have “broken the symmetry” of the problem by randomly fixing certain values of the vertices to have particular labels. There’s actually a way belief propagation does this implicitly: recall that we start belief propagation by randomly initializing the messages. This random initialization can be interpreted as “symmetry breaking” of the problem, in a way that we’ll see shortly.
(3.1) Belief Propagation
We’ve just seen from the previous section that in order to maximize the correlation of the labeling we come up with, we should pick the labelings which maximize the marginals of . So we have some marginals that we want to compute. Let’s proceed by applying BP to this problem in the “sparse” regime where (other algorithms, like approximate message passing, can be used for “dense” graph problems). Suppose we’re given a random graph with edge list . What does does graph associated with our probabilistic model look like? Well, in this case, every variable is actually connected to every other variable because includes a factor for every , so we actually have a complete graph. However, some of the connections between variables are much weaker than others. In full, our BP update equations are
Likewise
what we want to do is approximate these equations so that we only have to pass messages along the edges , instead of the complete graph. This will make our analysis simpler, and also allow the belief propagation algorithm to run more efficiently. The first observation is the following: Suppose we have two nodes such that . Then we see that , since the only difference between these two variables are two factors of order which appear in the first product of the BP equations. Thus, we send essentially the same messages to nonneighbours of in our random graph. In general though, we have:
The first approximation comes from dropping nonedge constraints on the first product, and is reasonable because we expect the number of neighbours of to be constant. We’ve also defined a variable
and we’ve used the approximation for small . We think of the term as defining an “auxiliary external field”. We’ll use this approximate BP equation to find solutions for our problem. This has the advantage that the computation time is instead of , so we can deal with large sparse graphs computationally. It also allows us to see how a large dense graphical model with only sparse strong connections still behaves like a sparse treelike graphical model from the perspective of Belief Propagation. In particular, we might have reason to hope that the BP equations will actually converge and give us good approximations to the marginals.
From now on, we’ll only consider factored block models, which in some sense represent a “hard” setting of parameters. These are models which satisfy the condition that each group has the same average degree . In particular, we require
An important observation for this setting of parameters is that
is always a fixed point of our BP equations, which is known as a factored fixed point (this can be seen by inspection by plugging the fixed point conditions into the belief propagation equations we derived). When BP ever reaches such a fixed point, we get that and the algorithm fails. However, we might hope that if we randomly initialize , then BP might converge to some nontrivial fixed point which gives us some information about the original labeling of the vertices.
(3.2) Numerical Results
Now that we have our BP equations, we can run numerical simulations to try and get a feel of when BP works. Let’s consider the problem of community detection. In particular, we’ll set our parameters with all group sizes being equal, and with for and vary the ratio , and see when BP finds solutions which are correlated “better than guessing” to the original labeling used to generate the graph. When we do this, we get images which look like this:
It should be mentioned that the point at which the dashed red line occurs depends on the parameters of the stochastic block model. We get a few interesting observations from numerical experiments:
 The point at which BP fails () happens at a certain value of independent of . In particular, when gets too large, BP converges to the trivial factored fixed point. We’ll see in the next section that this is a “stable” fixed point for certain settings of parameters.
 Using the approximate BP equations gives the same performance as using the exact BP equations. This numerically justifies the approximations we made.
 If we try to estimate the marginals with other tools from our statistical inference toolbox, we find (perhaps surprisingly) that Markov Chain Monte Carlo methods give the same performance as Belief Propagation, even though they take significantly longer to run and have strong asymptotic guarantees on their correctness (if you run MCMC long enough, you will eventually get the correct marginals, but this may take exponential time). This naturally suggests the conjecture that belief propagation is optimal for this task.
 As we get closer to the region where BP fails and converges to the trivial fixed point, the number of iterations required for BP to converge increases significantly, and diverges at the critical point where BP fails and converges to the factored fixed point. The same kind of behavior is seen when using Gibbs sampling.
(3.3) Heuristic Stability Calculations
How might we analytically try to determine when BP fails for certain settings of and ? One way we might heuristically try to do this, is to calculate the stability of the factored fixed point. If the fixed point is stable, this suggests that BP will converge to a factored point. If however it is unstable, then we might hope that BP converges to something informative. In particular, suppose we run BP, and we converge to a factored fixed point, so we have that for all our messages . Suppose we now add a small amount of noise to some of the ‘s (maybe think of this as injecting a small amount of additional information about the true marginals). We (heuristically) claim that if we now continue to run more steps of BP, either the messages will converge back to the fixed point , or they will diverge to something else, and whether or not this happens depends on the eigenvalue of some matrix of partial derivatives.
Following this idea, here’s a heuristic way of calculating the stability of the factored fixed point. Let’s pretend that our BP equations occur on a tree, which is a reasonable approximation in the sparse graph case. Let our tree be rooted at node and have depth . Let’s try to approximately calculate the influence on of perturbing a leaf from its factored fixed point. In particular, let the path from the leaf to the root be . We’re going to apply a perturbation for each . In vector notation, this looks like , where is a column vector. The next thing we’ll do is define the matrix of partial derivatives
Up to first order (and ignoring normalizing constants), the perturbation effect on is then (by chain rule) . Since does not depend on , we can write this as , where is the largest eigenvalue of . Now, on a random tree, we have approximately leaves. If we assume that the perturbation effect from each leaf is independent, and that has 0 mean, then the net mean perturbation from all the leaves will be 0. The variance will be
if we assume that the cross terms vanish in expectation.
(Aside: You might want to ask: why are we assuming that has mean zero, and that (say) the noise at each of the leaves are independent, so that the cross terms vanish? If we want to maximize the variance, then maybe choosing the ‘s to be correlated or have nonzero mean would give us a better bound. The problem is that we’re neglecting the effects of normalizing constants in this analysis: if we perturbed all the in the same direction (e.g. nonzero mean), our normalization conditions would cancel out our perturbations.)
We Therefore end up with the stability condition . When , a small perturbation will be magnified as we move up the tree, leading to the messages moving away from the factored fixed point after successive iterations of BP (the fixed point is unstable). If , the effect of a small perturbation will vanish as we move up the tree, we expect the factored fixed point to be stable. If we restrict our attention to graphs of the form for , and have all our groups with size , then is known to have eigenvalues with eigenvector , and . The stability threshold then becomes . This condition is known as the AlmeidaThouless local stability condition for spin glasses, and the KestenStigum bound on reconstruction on trees. It is also observed empirically that BP and MCMC succeed above this threshold, and converge to factored fixed points below this threshold. The eigenvalues of are related to the belief propagation equations and the backtracking matrix. For more details, see [3]
(3.4) Information Theoretic Results, and what we know algorithmically
We’ve just seen a threshold for when BP is able to solve the community detection problem. Specifically, when , BP doesn’t do better than chance. It’s natural to ask whether this is because BP is not powerful enough, or whether there really isn’t enough information in the random graph to recover the true labeling. For example, if is very close to , it might be impossible to distinguish between group boundaries up to random fluctuations in the edges.
It turns out that for , there is not enough information below the threshold to find a labeling which is correlated with the true labeling [3]. However, it can be shown informationtheoretically [1] that the threshold at which one can find a correlated labeling is . In particular, when , there exists exponential time algorithms which recover a correlated labeling below the KestenStigum threshold. This is interesting, because it suggests an informationcomputational gap: we observe empirically that heuristic belief propagation seems to perform as well as any other inference algorithm at finding a correlated labeling for the stochastic block model. However, belief propagation fails at a “computational” threshold below the information theoretic threshold for this problem. We’ll talk more about these kinds of informationcomputation gaps in the coming weeks.
References
[1] Jess Banks, Cristopher Moore, Joe Neeman, Praneeth Netrapalli,
Informationtheoretic thresholds for community detection in sparse
networks. AJMLR: Workshop and Conference Proceedings vol 49:1–34, 2016.
Link
[2] Aurelien Decelle, Florent Krzakala, Cristopher Moore, Lenka Zdeborová,
Asymptotic analysis of the stochastic block model for modular networks and its
algorithmic applications. 2013.
Link
[3] Cristopher Moore, The Computer Science and Physics
of Community Detection: Landscapes, Phase Transitions, and Hardness. 2017.
Link
[4] Jonathan Yedidia, William Freeman, Yair Weiss, Understanding Belief Propogation and its Generalizations
Link
[5] Afonso Banderia, Amelia Perry, Alexander Wein, Notes on computationaltostatistical gaps: predictions using statistical physics. 2018.
Link
[6] Stephan Mertens, Marc Mézard, Riccardo Zecchina, Threshold Values of Random KSAT from the Cavity Method. 2005.
Link
[7] Andrea Montanari, Federico RicciTersenghi, Guilhem Semerjian, Clusters of solutions and replica symmetry breaking in random ksatisfiability. 2008.
Link
A big thanks to Tselil for all the proof reading and recommendations, and to both Boaz and Tselil for their really detailed postpresentation feedback.