[Guest post by Annie Wei who scribed Peter Shor’s lecture in our physics and computation seminar. See here for all the posts of this seminar. –Boaz]

On October 19, we were lucky enough to have Professor Peter Shor give a guest lecture about quantum error correcting codes. In this blog post, I (Annie Wei) will present a summary of this guest lecture, which builds up quantum error correcting codes starting from classical coding theory. We will start by reviewing an example from classical error correction to motivate the similarities and differences when compared against the quantum case, before moving into quantum error correction and quantum channels. Note that we do assume a very basic familiarity with quantum mechanics, such as that which might be found here or here.

1. Motivation
We are interested in quantum error correction, ultimately, because any real-world computing device needs to be able to tolerate noise. Theoretical work on quantum algorithms has shown that quantum computers have the potential to offer speedups for a variety of problems, but in practice we’d also like to be able to eventually build and operate real quantum computers. We need to be able to protect against any decoherence that occurs when a quantum computer interacts with the environment, and we need to be able to protect against the accumulation of small gate errors since quantum gates need to be unitary operators.

In error correction the idea is to protect against noise by encoding information in a way that is resistant to noise, usually by adding some redundancy to the message. The redundancy then ensures that enough information remains, even after noise corruption, so that decoding will allow us to recover our original message. This is what is done in classical error correction schemes.

Unfortunately, it’s not obvious that quantum error correction is possible. One obstacle is that errors are continuous, since a continuum of operations can be applied to a qubit, so a priori it might seem like identifying and correcting an error would require infinite resources. In a later section we show how this problem, that of identifying quantum errors, can be overcome. Another obstacle is the fact that, as we’ve stated, classical error correction works by adding redundancy to a message. This might seem impossible to perform in a quantum setting due to the No Cloning Theorem, which states the following:

Theorem (No Cloning Theorem): Performing the mapping

$|\psi\rangle|0\rangle\mapsto|\psi\rangle|\psi\rangle$

is not a permissible quantum operation.

Proof: We will use unitarity, which says that a quantum operation specified by a unitary matrix $U$ must satisfy

$\langle\phi|U^{\dagger}U|\psi\rangle = \langle\phi|\psi\rangle$.

(This ensures that the normalization of the state $|\psi\rangle$ is always preserved, i.e. that $|\langle\psi|\psi\rangle|^2=1$, which is equivalent to the conservation of probability.)

Now suppose that we can perform the operation

$U(|\psi\rangle|0\rangle)=|\psi\rangle|\psi\rangle$.

Then, letting

$(\langle\phi|\langle 0|)(|\psi\rangle|0\rangle)=\alpha$,

we note that by unitarity

$(\langle\phi|\langle 0|)(|\psi\rangle |0\rangle)=\alpha(\langle\phi|\langle 0|)U^{\dagger}U(|\psi\rangle|0\rangle)$.

But

$(\langle\phi|\langle 0|)U^{\dagger}U(|\psi\rangle|0\rangle)=(\langle\phi|\langle\phi|)(|\psi\rangle|\psi\rangle)=\alpha^2$,

and in general $\alpha\neq\alpha^2$, so we have a contradiction.

How do we get around this apparent contradiction? To do so, note that the no-cloning theorem only prohibits the copying of non-orthogonal quantum states. With orthogonal quantum states, either $\alpha=0$ or $\alpha=1$, so we don’t run into a contradiction. This also explains why it is possible to copy classical information, which we can think of as orthogonal quantum states.

So how do we actually protect quantum information from noise? In the next section we first review classical error correction, as ideas from the classical setting re-appear in the quantum setting, and then we move into quantum error correction.

2. Review of Classical Error Correction
First we start by reviewing classical error correction. In classical error correction we generally have a message that we encode, send through a noisy channel, and then decode, in the following schematic process:

In an effective error correction scheme, the decoding process should allow us to identify any errors that occurred when our message passed through the noisy channel, which then tells us how to correct the errors. The formalism that allows us to do so is the following: we first define a $r\times n$ encoding matrix $G$ that takes a message $m$ of length $r$ and converts it to a codeword $c$ of length $n$, where the codewords make up the span of the rows of $G$. An example of such a matrix is

$G=\left(\begin{array}{ccccccc}0&0&0&1&1&1&1\\1&0&1&0&1&0&1\\0&1&1&0&0&1&1\\1&1&1&1&1&1&1\end{array}\right)$,

corresponding to the 7-bit Hamming codes, which encodes a 4-bit message as a 7-bit codeword. Note that this code has distance 3 since each of the rows in $G$ differ in at most 3 spots, which means that it can correct at most 1 error (the number of errors that can be corrected is given by half the code distance).

We also define the parity check matrix $H$ to be the matrix that satisfies

$GH^T=0$.

For example, to define $H$ corresponding to the $G$ we defined for the 7-bit Hamming code, we could take

$H=\left(\begin{array}{ccccccc}0&0&0&1&1&1&1\\1&0&1&0&1&0&1\\0&1&1&0&0&1&1\end{array}\right)$.

Then we may decode $x$, a 7-bit string, in the following manner. Say that

$x=c+e$,

where $c$ is a codeword and $e$ is the 1-bit error we wish to correct. Then

$xH^T=(c+e)H^T=eH^T$.

Here $eH^T$ uniquely identifies the error and is known as the error syndrome. Having it tells us how to correct the error. Thus our error correction scheme consists of the following steps:

1. Encode a $r$-bit message $m$ by multiplying by $G$ to obtain codeword $mG=c$.
2. Send the message through channel generating error $e$, resulting in the string $x=c+e$.
3. Multiply by $H^T$ to obtain the error syndrome $eH^T$.
4. Correct error $e$ to obtain $c$.

Having concluded our quick review of classical error correction, we now look at the theory of quantum error correction.

3. Quantum Error Correction
In this section we introduce quantum error correction by directly constructing the 9-qubit code and the 7-qubit code. Then we introduce the more general formalism of CSS codes, which encompasses both the 9-qubit and 7-qubit codes, before introducing the stabilizer formalism, which tells us how we might construct a CSS code.

3.1. Preliminaries
First we introduce some tools that we will need in this section.

3.1.1. Pauli Matrices
The Pauli matrices are a set of 2-by-2 matrices that form an orthogonal basis for the 2-by-2 Hermitian matrices, where a Hermitian matrix $H$ satisfies $H^{\dagger}=H$. Note that we can form larger Hilbert spaces by taking the tensor product of smaller Hilbert spaces, so in particular taking the $k$-fold tensor product of Pauli matrices gives us a basis for the $2^k$-by-$2^k$ Hermitian matrices. Note also that generally, in quantum mechanics, we are interested in Hermitian matrices because they can be used to represent measurements, and because unitary matrices, which can be used to represent probability-preserving quantum operations, can be obtained by exponentiating Hermitian matrices (that is, every unitary matrix $U$ can be written in the form $U=e^{iH}$ for $H$ a Hermitian matrix).

The Pauli matrices are given by

$\sigma_x\equiv X\equiv\left(\begin{array}{cc}0&1\\1&0\end{array}\right)$

$\sigma_y\equiv Y\equiv\left(\begin{array}{cc}0&-i\\i&0\end{array}\right)$

$\sigma_z\equiv Z\equiv\left(\begin{array}{cc}1&0\\0&-1\end{array}\right)$.

By direction computation we can show that they satisfy the relations

$X^2=Y^2=Z^2=I$

$ZX=-XZ=iY$

$YZ=-ZY=iX$

$XY=-YX=iZ$.

3.1.2. Von Neumann Measurements
We will also need the concept of a Von Neumann measurement. A Von Neumann measurement is given by a set of projection matrices $\{E_1, E_2, ..., E_k\}$ satisfying

$\sum_{i=1}^k E_i=I$.

That is, the projectors partition a Hilbert space ${\cal H}$ into $k$ subspaces. Then, given any state $|\psi\rangle\in{\cal H}$, when we perform a measurement using these projectors we obtain the measurement result corresponding to $E_i$, with corresponding post-measurement state

$\frac{E_i|\psi\rangle}{||E_i|\psi\rangle||}$,

with probability $\langle\psi|E_i|\psi\rangle$.

3.2. First Attempt at a Quantum Code
Now we make a first attempt at coming up with a quantum code, noting that our efforts and adjustments will ultimately culminate in the 9-qubit code. Starting with the simplest possible idea, we take inspiration from the classical repetition code, which maps

$0\mapsto 000$

$1\mapsto 111$

and decodes by taking the majority of the 3 bits. We consider the quantum analog of this, which maps

$|0\rangle\mapsto|000\rangle$

$|1\rangle\mapsto|111\rangle$.

We will take our quantum errors to be the Pauli matrices $X$, $Y$, and $Z$. Then the encoding process, where our message is a quantum state $\alpha|0\rangle+\beta|1\rangle$, looks like the following:

$\alpha|0\rangle+\beta|1\rangle\mapsto\alpha|000\rangle+\beta|111\rangle$.

We claim that this code can correct bit errors but not phase errors, which makes it equivalent to the original classical repetition code for error correction. To see this, note that applying an $X_1$ error results in the mapping

$\alpha|0\rangle+\beta|1\rangle\mapsto\alpha|100\rangle+\beta|011\rangle$.

This can be detected by the von Neumann measurement which projects onto the subspaces

$\{|000\rangle,|111\rangle\}$

$\{|011\rangle,|100\rangle\}$

$\{|010\rangle,|101\rangle\}$

$\{|110\rangle,|001\rangle\}$

We could then apply $\sigma_x$ to the first qubit to correct the error. To see that this doesn’t work for phase errors, note that applying a $Z_2$ error results in the mapping

$\alpha|0\rangle+\beta|1\rangle\mapsto\alpha|000\rangle-\beta|111\rangle$.

This is a valid encoding of the state $\alpha|0\rangle-\beta|1\rangle$, so the error is undetectable.

What adjustments can we make so that we’re able to also correct $Z$ errors? For this we will introduce the Hadamard matrix, defined as

$H=\frac{1}{\sqrt{2}}\left(\begin{array}{cc}1&1\\1&-1\end{array}\right)$

and satisfying

$HX=ZH$.

Note in particular that, because $HX=ZH$, the Hadamard matrix turns bit errors into phase errors, and vice versa. This allows us to come up with a code that corrects phase errors by mapping

$H|0\rangle\mapsto H^{\otimes 3}|000\rangle$

$H|1\rangle\mapsto H^{\otimes 3}|111\rangle$

or equivalently,

$|0\rangle\mapsto\frac{1}{2}(|000\rangle+|011\rangle+|101\rangle+|110\rangle)$

$|1\rangle\mapsto\frac{1}{2}(|111\rangle+|100\rangle+|010\rangle+|001\rangle)$

Now we can concatenate our bit flip code with our phase flip code to take care of both errors. This gives us the 9-qubit code, also known as the Shor code.

3.3. 9-Qubit Code
In the previous section, we went through the process of constructing the 9-qubit Shor code by considering how to correct both bit flip errors and phase flip errors. Explicitly, the 9-qubit Shor code is given by the following mapping:

$|0\rangle\mapsto|0\rangle_L\equiv\frac{1}{2}(|000000000\rangle+|000111111\rangle+|111000111\rangle+|111111000\rangle)$

$|1\rangle\mapsto|1\rangle_L\equiv\frac{1}{2}(|111111111\rangle+|111000000\rangle+|000111000\rangle+|000000111\rangle)$.

Here $|0\rangle_L$ and $|1\rangle_L$ are known as logical qubits; note that our 9-qubit code essentially represents 1 logical qubit with 9 physical qubits.

Note that by construction this code can correct $\sigma_x$, $\sigma_y$, and $\sigma_z$ errors on any one qubit (we’ve already shown by construction that it can correct $\sigma_x$ and $\sigma_z$ errors, and $\sigma_y$ can be obtained as a product of the two). This is also equivalent to the statement that the states $\sigma_x^{(i)}|0\rangle_L$, $\sigma_y^{(i)}|0\rangle_L$, $\sigma_z^{(i)}|0\rangle_L$, $\sigma_x^{(i)}|1\rangle_L$, $\sigma_y^{(i)}|1\rangle_L$, and $\sigma_z^{(i)}|1\rangle_L$ are all orthogonal.

Now we have a 1-error quantum code. We claim that such a code can in fact correct any error operation, and that this is a property of all 1-error quantum codes:

Theorem: Given any possible unitary, measurement, or quantum operation on a one-error quantum code, the code can correct it.

Proof: $\{I, \sigma_x, \sigma_y, \sigma_z\}^{\otimes t}$ forms a basis for the $2\times 2$ matrices. For errors on $t$ qubits, the code can correct these errors if it can individually correct errors $\sigma_{w_i}^{(i)}$ for $w_i\in\{x,y,z\}$, $i\in\{1,...,t\}$, since $\{I, \sigma_x, \sigma_y, \sigma_z\}^{\otimes t}$ forms a basis for $\mathbb{C}^{2t}$.

Example: Phase Error Next we’ll do an example where we consider how we might correct an arbitrary phase error applied to the $|0\rangle_L$ state. Since quantum states are equivalent up to phases, the error operator is given by

$\left(\begin{array}{cc}1&0\\0&e^{i\theta}\end{array}\right)\equiv\left(\begin{array}{cc}e^{-i\theta/2}&0\\0&e^{i\theta/2}\end{array}\right)$.

Note that this can be rewritten in the $\{I, \sigma_x, \sigma_y, \sigma_z\}^{\otimes t}$ basis as

$\left(\begin{array}{cc}e^{-i\theta/2}&0\\0&e^{i\theta/2}\end{array}\right)=\cos\frac{\theta}{2}I-i\sin\frac{\theta}{2}\sigma_z$.

Now, applying this error to $|0\rangle_L$, we get

$\left(\begin{array}{cc}e^{-i\theta/2}&0\\0&e^{i\theta/2}\end{array}\right)\frac{1}{2}(|000\rangle+|011\rangle+|101\rangle+|110\rangle)=\frac{1}{2}\cos\frac{\theta}{2}(|000\rangle+|011\rangle+|101\rangle+|110\rangle)-\frac{i}{2}\sin\frac{\theta}{2}(|000\rangle-|011\rangle+|101\rangle-|110\rangle)=\cos\frac{\theta}{2}|0\rangle_L-i\sin\frac{\theta}{2}\sigma_z|0\rangle_L$.

After performing a projective measurement, we get state $|0\rangle_L$ with probability $\cos^2\frac{\theta}{2}$, in which case we do not need to perform any error correction, and we get $\sigma_z|0\rangle_L$ with probability $\sin^2\frac{\theta}{2}$, in which case we would know to correct the $\sigma_z$ error.

3.4. 7-Qubit Code
Now that we’ve constructed the 9-qubit code and shown that quantum error correction is possible, we might wonder whether it’s possible to do better. For example, we’d like a code that requires fewer qubits. We’ll construct a 7-qubit code that corrects 1 error, defining a mapping to $|0\rangle_L$ and $|1\rangle_L$ by taking inspiration from a classical code, as we did for the 9-qubit case.

For this we will need to go back to the example we used to illustration classical error correction. Recall that in classical error correction, we have an encoding matrix $G$ and a parity check matrix $H$ satisfying $GH^T=0$, with $\text{rank}(G)+\text{rank}(H)=n$. We encode a message $m$ to obtain codeword $mG=c$. After error $e$ is applied, this becomes $c+e$, from which we can extract the error syndrome $(c+e)H^T=eH^T$. We can then apply the appropriate correction to extract $c$ from $c+e$.

Now we will use the encoding matrix from our classical error correction example, and we will divide our codewords into two sets, $C_1$ and $C_1'$, given by

$C_1=\left\{\begin{array}{ccccccc}0&0&0&1&1&1&1\\1&0&1&0&1&0&1\\0&1&1&0&0&1&1\end{array}\right.$

and

$C_1'=C_1+1111111$.

Similar to how we approached the 9-qubit case, we will start by defining our code as follows:

$|0\rangle_L\equiv\frac{1}{\sqrt{8}}\sum_{v\in C_1}|v\rangle$

$|1\rangle_L\equiv\frac{1}{\sqrt{8}}\sum_{w\in C_1'}|w\rangle$.

Note that this corrects bit flip errors by construction. How can we ensure that we are also able to correct phase errors? For this we again turn to the Hadamard matrix, which allows us to toggle between bit and phase errors. We claim that

$H^{\otimes 7}|0\rangle_L=\frac{1}{\sqrt{2}}(|0\rangle_L+|1\rangle_L)$

$H^{\otimes 7}|1\rangle_L=\frac{1}{\sqrt{2}}(|0\rangle_L-|1\rangle_L)$.

Proof: We will show that

$H^{\otimes 7}|0\rangle_L=\frac{1}{\sqrt{2}}(|0\rangle_L+|1\rangle_L)$,

noting that the argument for $|1\rangle_L$ is similar. First we will need the fact that

$H^{\otimes 7}|v\rangle=\frac{1}{2^{7/2}}\sum_{w\in\{0,1\}^7}(-1)^{w\cdot v}|w\rangle$.

To see that this fact is true, note that

$H=\frac{1}{\sqrt{2}}(|0\rangle\langle 0|+|0\rangle\langle 1|+|1\rangle\langle 0|-|1\rangle\langle 1|)$

and that $w\cdot v$ is equal to the number of bits in which $w$ and $v$ are both 1. Now we can start by directly calculating

$H^{\otimes 7}|0\rangle_L=\frac{1}{\sqrt{8}}\frac{1}{\sqrt{128}}\sum_{v\in C_1}\sum_{w\in\{0,1\}^7}(-1)^{v\cdot w}|w\rangle$.

Note that for $x$ and $y$ two codewords, assuming that $w\cdot y=1$, we must have that $x\cdot w=0$ iff $(x+y)\cdot w=1$. Thus we can break the codespace up into an equal number of codewords $x$ satisfying $x\cdot w=0$ and $x\cdot w=1$. This means that we must have that the sum $\sum_{v\in C_1}\sum_{w\in\{0,1\}^7}(-1)^{w\cdot v}|w\rangle=0$ unless we have $w\perp C_1$. But those $w$ that satisfy $w\perp C_1$ are exactly all the codewords by definition, so we must have that

$H^{\otimes 7}|0\rangle_L=\frac{1}{\sqrt{2}}|0\rangle_L+\frac{1}{\sqrt{2}}|1\rangle_L$

as the sum in $|0\rangle_L+|1\rangle_L$ runs equally over all codewords.

Thus we have constructed a 7-qubit quantum code that corrects 1 error, and moreover we see that for both the 9-qubit and 7-qubit codes, both of which are 1-error quantum codes, the fact that they can correct 1-error comes directly from the fact that the original classical codes we used to construct them can themselves correct 1 error. This suggests that we should be able to come up with a more general procedure for constructing quantum codes from classical codes.

3.5. CSS Codes
CSS (Calderbank-Shor-Steane) codes generalize the process by which we constructed the 9-qubit and 7-qubit codes, and they give us a general framework for constructing quantum codes from classical codes. In a CSS code, we require groups $C_1$, $C_2$ satisfying

$C_1\subseteq C_2$

$C_2^{\perp}\subseteq C_1^{\perp}$

Then we can define codewords to correspond to cosets of $C_1$ in $C_2$, so that the number of codewords is equal to $2^{\text{dim}(C_2)-\text{dim}(C_1)}$. Thus by this definition we can say that codewords $w_1, w_2\in C_2$ are in the same coset if $w_1-w_2\in C_1$. Explicitly, the codeword for coset $w$ is given by the state

$\frac{1}{|C_1|^{1/2}}\sum_{x\in C_1}|x+w\rangle$,

and under the Hadamard transformation applied to each qubit this state is in turn mapped to the state

$\frac{1}{|C_1^{\perp}|^{1/2}}\sum_{x\in C_1^{\perp}}|x+w\rangle$.

That is to say, the Hadamard “dualizes” our original code, toggling bit errors to phase errors and vice versa. (This can be seen by direct calculation, as in the case of the 7-qubit code, where we used the fact that $\sum_{v\in C_1}(-1)^{v\cdot w}=0$ for $w\not\in C_1^{\perp}$.)

Note also that this code can correct a number of bit errors equal to the minimum weight of $\{v\in C_2-C_1\}$.

With the CSS construction we have thus reduced the problem of finding a quantum error correcting code to the problem of finding appropriate $C_1$, $C_2$. Note that the special case of $C_2^{\perp}=C_1=C$ corresponds to weakly self-dual codes, which are well studied classically. Doubly even, weakly self-dual codes additionally have the requirement that all codewords have Hamming weights that are multiples of 4; they satisfy the requirement

$1^n\subseteq C^{\perp}\subseteq C\subseteq\mathbb{Z}_2^n$

and are also well studied classically.

3.6. Gilbert-Varshamov Bound
In the previous section we introduced CSS codes and demonstrated that the problem of constructing a quantum code could be reduced to the problem of finding two groups $C_1$, $C_2$ satisfying

$C_1\subseteq C_2$

$C_2^{\perp}\subseteq C_1^{\perp}$.

The next natural question is to ask whether such groups can in fact be found.

The Gilbert-Varshamov bound answers this question in the affirmative, ensuring that there do exist good CSS codes (the bound applies to both quantum and classical codes). It can be stated in the following way:

Theorem (Gilbert-Varshamov Bound): There exist CSS codes with rate $R=$(number of encoded bits)/(length of code) given by

$R\geq 1-2H_2(d/n)$,

where $d$ is the minimum distance of the code, $d/2$ is the number of errors it can correct, and $H_2(x)$ is the Shannon entropy, defined as

$H_2(x)=-x\log_2x-(1-x)\log_2(1-x)$.

Proof: Note that we can always take a code, apply a random linear transformation to it, and get another code. Thus each vector is equally likely to appear in a random code. Given this fact, we can estimate the probability that a code of dimension $k$ contains a word of weight $\leq d$ using the union bound:

$P($code of dimension $k$ has word of weight $\leq d)\leq($number of words$)\times P($word has weight $\leq d)=2^k\cdot\frac{\sum_{i=0}^d \binom{n}{i}}{2^n}\approx \frac{2^k\cdot 2^{nH(d/n)}}{2^n}$

For this to be a valid probability we need to have

$(k/n)+H(d/n)< 1$.

We can calculate rate by noting that for a CSS code, given by $C_1\subseteq C_2$, $C_2^{\perp}\subseteq C_1^{\perp}$, with $\text{dim}(C_1)=n-k$, $\text{dim}(C_2)=k$, the expression for rate is given by

$R=\frac{n-2k}{n}$.

Combining this with the bound we obtained by considering probabilities, we get that

$R\geq 1-2H(d/n)$.

Thus there exist good CSS codes.

3.7. Stabilizer Codes
Having discussed and constructed some examples of CSS codes, we will now discuss the stabilizer formalism. Note that this formalism allows us to construct codes without having to work directly with the states representing $|0\rangle_L$ and $|1\rangle_L$, as this can quickly get unwieldy. Instead, we will work with stabilizers, operators that leave these states invariant.

To see how working directly with states can get unwieldy, we can consider the 5-qubit code. We can define it the way we defined the 9-qubit and 7-qubit codes, by directly defining the basis vectors $|0\rangle_L$ and $|1\rangle_L$,

$|0\rangle_L\equiv\frac{1}{4}(|00000\rangle-|01100\rangle+|00101\rangle+|01010\rangle-|01111\rangle+($symmetric under cyclic permutations$))$,

with $|1\rangle_L$ defined similarly. But we can also define this code more succinctly using the stabilizer formalism. To do so, we start by choosing a commutative subgroup of the Pauli group, with generators $g_i$ satisfying

$g_i^2=I$

$g_ig_j=g_jg_i$

For example, for the 5-qubit code, the particular choice of generators we would need is given by

$g_1\equiv IXZZX$

$g_2\equiv XIXZZ$

$g_3\equiv ZXIXZ$

$g_4\equiv ZZXIX$.

Now we consider states $\{|\psi\rangle\}$ that are stabilized by the $\{g_i\}$. That is, they satisfy

$g_i|\psi\rangle=|\psi\rangle$.

Note that the eigenvalues of $\sigma_x$, $\sigma_y$, and $\sigma_z$ are $\pm 1$, so in the case of the 5-qubit code, there exists a $2^5/2=16$-dimensional space of $\{|\psi\rangle\}$ satisfying $g_1|\psi\rangle=|\psi\rangle$. Recalling that two commuting matrices are simultaneously diagonalizable, there exists a $16/2=8$-dimensional space of $\{|\psi\rangle\}$ satisfying $g_1|\psi\rangle=g_2|\psi\rangle=|\psi\rangle$, and so on, where we cut the dimension of the subspace in half each time we add a generator. Finally, there exists a $2^5/2^4=2$-dimensional space of $\{|\psi\rangle\}$ satisfying $g_i|\psi\rangle=|\psi\rangle$ for all $i=1,...,4$. This 2-dimensional space is exactly the subspace spanned by $|0\rangle_L$ and $|1\rangle_L$. Thus fixing the stabilizers is enough to give us our code.

Next we will consider all elements in the Pauli group that commute with all elements in our stabilizer group $G=\{g_1,...,g_4\}$. As we shall see, this will give us our logical operators, where a logical operator performs an operation on a logical qubit (for example, the logical $X$ operator, $X_L$, would act on the logical qubit $|0\rangle_L$ by mapping $X_L|0\rangle_L=|1\rangle_L$, and so on). In the 5-qubit case we end up with a 6-dimensional nonabelian group $\tilde{G}=\langle g_1,...,g_4, h_1, h_2\rangle$ by adding the following two elements to those elements that are in $G$:

$h_1=XXXXX$

$h_2=ZZZZZ$

These will be our logical operators

$X_L\equiv h_1$

$Z_L\equiv h_2$

so that

$X_L|0\rangle_L=|1\rangle_L$

$X_L|1\rangle_L=|0\rangle_L$

$Z_L|1\rangle_L=-|1\rangle_L$

$Z_L|0\rangle_L=|0\rangle_L$.

Note that this code has distance 3 and corrects 1 error because 3 is the minimum Hamming weight in the group $\tilde{G}$. (To see this, note that $XXXXX\cdot IXZZX=XIYYI$ has Hamming weight 3.)

Why is Hamming weight 2 not enough to correct one error? If we had, for example, $XZIII\in\tilde{G}$, then we would have

$\sigma_x^{(1)}|\psi_1\rangle=\sigma_z^{(2)}|\psi_2\rangle$

for $|\psi_1\rangle$, $|\psi_2\rangle$ both in the code, which means that we wouldn’t be able to distinguish an $X_1$ error from a $Z_2$ error.

Note that, in general, when $x\in\tilde{G}$, $x|\psi\rangle$ will be in the code, so elements of $\tilde{G}$ map codewords to codewords. We can prove this fact by noting that

$xg_i|\psi\rangle=x|\psi\rangle=g_ix|\psi\rangle$.

Note also that in the examples we’ve been dealing with so far, where we have a commuting subgroup of the Pauli group, our codes correspond to classical, additive, weakly self-dual codes over $GF(4)$. Here $GF(4)=\{0,1,\omega,\bar{\omega}\}$ (with group elements $\{\omega, \bar{\omega},1\}$ corresponding to the third roots of unity) is the finite field on 4 elements, and multiplying Pauli matrices corresponds to group addition. Specifically,

$X\equiv 1$

$Y\equiv \omega$

$Z\equiv \bar{\omega}$

$I\equiv 0$

satisfying

$H\omega=\bar{\omega}$

$2X=2Y=2Z=0$.

We have now concluded our discussion of quantum error-correcting codes. In the next section we will shift gears and look at quantum channels and channel capacities.

4. Quantum Channels
In this final section we will look at quantum channels and channel capacities.

4.1. Definition and Examples

4.1.1. Definition
We know that we want to define a quantum channel to take a quantum state as input. What should the output be? As a first attempt we might imagine having the output be a probability distribution $\{p_i\}$ over states $\{|\psi_i\rangle\}$. It turns out that for a more succinct description, we can have both the input and output be a density matrix.

Recall that a density matrix takes the form

$\rho=\sum_i p_i|\psi_i\rangle\langle\psi_i|$

representing a probability distribution over pure states $|\psi_i\rangle$. $\rho$ must also be Hermitian, and it must satisfy $\text{Tr}(\rho)=1$ (equivalently, we must have $\sum_i p_i=1$).

Now we may define a quantum channel as the map $\eta$ that takes

$\eta:\rho\mapsto\sum_i E_i\rho E_i^{\dagger}$,

where

$\sum_i E_i^{\dagger}E_i=I$.

To see that the output is in fact a density matrix, note that the output expression is clearly Hermitian and can be shown to have unit trace using the cyclical property of traces. Note also that the decomposition into $\{E_i\}$ need not be unique.

4.1.2. Example Quantum Channels
Next we give a few examples of quantum channels. The dephasing channel is given by the map

$\rho\mapsto(1-p)\rho+p\sigma_z\rho\sigma_z$.

It maps

$\left(\begin{array}{cc}\alpha&\beta\\\gamma&\delta\end{array}\right)\mapsto \left(\begin{array}{cc}\alpha&(1-2p)\beta\\(1-2p)\gamma&\delta\end{array}\right)$

$\left(\begin{array}{cc}\alpha&\beta\\\gamma&\delta\end{array}\right)\mapsto \left(\begin{array}{cc}\alpha&(1-2p)\beta\\(1-2p)\gamma&\delta\end{array}\right)$,

so it multiplies off-diagonal elements by a factor that is less than 1. Note that when $p=1/2$, it maps

$\alpha|0\rangle+\beta|1\rangle\mapsto|\alpha|^2|0\rangle\langle 0|+|\beta|^2|1\rangle\langle 1|$,

which means that it turns superpositions into classical mixtures (hence the name “dephasing”).

Another example is the amplitude damping channel, which models an excited state decaying to a ground state. It is given by

$E_1=\left(\begin{array}{cc}0&\sqrt{p}\\0&0\end{array}\right)$

$E_2=\left(\begin{array}{cc}1&0\\0&\sqrt{1-p}\end{array}\right)$

Here we let the vector $|0\rangle=(1, 0)$ denote the ground state, and we let the vector $|1\rangle=(0, 1)$ denote the excited state. Thus we can see that the channel maps the ground state to itself, $|0\rangle\mapsto|0\rangle$, while the excited state $|1\rangle$ gets mapped to $|0\rangle$ with probability $p$ and stays at $|1\rangle$ with probability $1-p$.

4.2 Quantum Channel Capacities
Now we consider the capacity of quantum channels, where the capacity quantifies how much information can make it through the channel. We consider classical channels, classical information sent over quantum channels, and quantum information sent over quantum channels. First we start off with the example of the quantum erasure channel to demonstrate that quantum channels behave differently from classical channels, and then we give the actual expressions for the channel capacities before revisiting the example of the quantum erasure channel.

4.2.1 Example: Quantum Erasure Channel
First we start with the example of the quantum erasure channel, which given a state $|\psi\rangle$ replaces it by an orthogonal state $|E\rangle$ with probability $p$ and returns the same state $|\psi\rangle$ with probability $1-p$. We claim that the erasure channel can’t transmit quantum information when $p\geq 0.5$, behavior that is markedly different from that of classical information. That is to say, for $p\geq 0.5$, there is no way to encode quantum information to send it through the channel and then decode it so the receiver gets a state close to the state that was sent.

To see why this is the case, assume the contrary, that there do exist encoding and decoding protocols that send quantum information through quantum erasure channels with erasure rate $p\geq 0.5$. We will show that this violates the no-cloning theorem. Now, suppose that $A$ does the following: For each qubit in the encoded state, she tosses a fair coin. If the coin lands heads, she send $C$ the state $|E\rangle$ and sends $B$ the channel input with probability $2p-1$ and the erasure state $|E\rangle$ otherwise. If the coin lands tails, she sends $B$ the state $|E\rangle$ and sends $C$ the channel input with probability $2p-1$ and the erasure state otherwise. This implements a $p\geq 0.5$ channel to both receivers $B$ and $C$, which means that $A$ can use this channel to transmit an encoding of $|\psi\rangle$ to both receivers, which in turn means that both receivers will be able to decode $|\psi\rangle$. But this means that $A$ has just used this channel to clone the quantum state $|\psi\rangle$, resulting in a contradiction. Thus no quantum information can be transmitted through a channel with $p\geq 0.5$. Note, however, that we can send classical information over this channel, so the behavior of quantum and classical information is markedly different.

It turns out that the rate of quantum information sent over the erasure channel, as a function of $p$, is given by the following graph:

while the rate of classical information sent over the erasure channel, as a function of $p$, is given by the following graph:

Next we will formally state the definition of channel capacity, and then we will return to the quantum erasure channel example and derive the curve that plots rate against $p$.

4.2.2. Definition of Channel Capacities
Channel capacity is defined as the maximum rate at which information can be communicated over many independent uses of a channel from sender to receiver. Here we list the expressions for channel capacity for classical channels, classical information over a quantum channel, and quantum information over a quantum channel.

Classical Channel Capacity For a classical channel this expression is just the maximum mutual information over all input-output pairs,

$\max_X H(\eta(X))-H(\eta(X)|X)$,

where $X$ is the input information and $\eta(X)$ is the output information after having gone through the channel $\eta$.

Classical Information Over a Quantum Channel The capacity for classical information sent over a quantum channel is given by

$\max_{\{p_i,\rho_i\}} H(\eta(\rho))-\sum_i p_iH(\eta(\rho_i))$

up to regularization, where $\rho=\sum_i p_i\rho_i$ is the average input state, and $\eta$ is the channel.

Note that we would regularize this by using $n$ copies of the state (that is to say, we want the output of $\eta^{\otimes n}$) and then dividing by $n$, to get an expression like the following for the regularized capacity of classical information over a quantum channel:

$\lim_{n\rightarrow\infty}\max_{\{p_i,\rho_i\}} [H(\eta(\rho)^{\otimes n})-\sum_i p_i H(\eta(\rho_i)^{\otimes n})]/n$.

Quantum Information The capacity for quantum information is given by the expression

$\max_\rho H(\eta(\rho))-H((\eta\otimes I)\Phi_\rho)$,

also up to regularization. Here $\eta(\rho)$ is the output when channel $\rho$ acts on input state $\rho$, while $\Phi_\rho$ is the purification of $\rho$ (that is, it is a pure state containing $\rho$ that we can obtain by enlarging the Hilbert space). The regularized capacity for quantum information looks like the following:

$\lim_{n\rightarrow\infty}\max_\rho [H(\eta(\rho)^{\otimes n})-H((\eta\otimes I)(\Phi_\rho)^{\otimes n})]/n$.

Now that we have the exact expression that allows us to calculate the quantum channel capacity, we will revisit our example of the quantum erasure channel and reproduce the plot of channel rate vs erasure probability.

4.2.3. Example Revisited: Quantum Erasure Channel
Recall that, up to regularization, the capacity of a quantum channel is given by

$\max_\rho H(\eta(\rho))-H((\eta\otimes I)\Phi_\rho)$.

We will directly calculate this expression for the example of the quantum erasure channel. Let the input $\rho$ be given by the density matrix for the completely mixed state,

$\rho=\left(\begin{array}{cc}\frac{1}{2}&0\\0&\frac{1}{2}\end{array}\right)$,

so that the purification of $\rho$ is given by the state

$\frac{1}{\sqrt{2}}(|00\rangle+|11\rangle)$.

Recall that the erasure channel replaces our state with $|E\rangle$ with probability $p$, while with probability $1-p$ it leaves the input state unchanged. Then, in the basis $\{|0\rangle, |1\rangle, |E\rangle\}$, the matrix corresponding to $\eta(\rho)$ is given by

$\eta(\rho)=\left(\begin{array}{ccc}\frac{1-p}{2}&0&0\\0&\frac{1-p}{2}&0\\0&0&p\end{array}\right)$

while in the basis $\{|00\rangle, |01\rangle, |10\rangle, |11\rangle, |0E\rangle, |1E\rangle\}$, the matrix corresponding to $(\eta\otimes I)\Phi_\rho$ is given by

$(\eta\otimes I)\Phi_\rho=\left(\begin{array}{cccccc}\frac{1-p}{2}&0&0&\frac{1-p}{2}&0&0\\0&0&0&0&0&0\\0&0&0&0&0&0\\\frac{1-p}{2}&0&0&\frac{1-p}{2}&0&0\\0&0&0&0&\frac{p}{2}&0\\0&0&0&0&0&\frac{p}{2}\end{array}\right)$

We can directly calculate that

$H(\eta(\rho))=H_2(p)+(1-p)$

$H((\eta\otimes I)\Phi_\rho)=H_2(p)+p$.

Then, subtracting the two entropies, we can calculate the rate to be

$R=1-2p$,

which corresponds exactly to the line we saw on the diagram that plotted rate as a function of $p$ for the quantum erasure channel.

References

1. Bennett, C. H., DiVencenzo, D. P., and Smolin, J. A. Capacities of quantum erasure channels. Phys. Rev. Lett., 78:3217-3220 (1997). quant-ph/9701015.
2. Bennett, C. H., DiVencenzo, D. P., Smolin, J. A., and Wootters, W. K. Mixed state entanglement and quantum error correction. Phys. Rev. A, 54:3824 (1996). quant-ph/9604024.
3. Calderbank, A. R. and Shor, P. W. Good quantum error-correcting codes exist. Phys. Rev. A, 54:1098 (1996). quant-ph/9512032.
4. Devetak, I. The Private Classical Capacity and Quantum Capacity of a Quantum Channel. IEEE Trans. Inf. Theor., 51:44-45 (2005). quant-ph/0304127
5. Devetak, I. and Winter, A. Classical data compression with quantum side information. Phys. Rev. A, 68(4):042301 (2003).
6. Gottesman, D. Class of quantum error-correcting codes saturating the quantum Hamming bound. Phys. Rev. A, 54:1862 (1996).
7. Laflamme, R., Miquel, C., Paz, J.-P., and Zurek, W. H. Perfect quantum error correction code. Phys. Rev. Lett., 77:198 (1996). quant-ph/9602019.
8. Lloyd, S. Capacity of the noisy quantum channel. Phys. Rev. A., 55:3 (1997). quant-ph/9604015.
9. Nielsen, M. A. and Chuang, I. L. Quantum Computation and Quantum Information., Cambridge University Press, New York (2011).
10. Shor, P. W. Scheme for reducing decoherence in quantum computer memory. Phys. Rev. A., 52:2493 (1995).
11. Shor, P. W. The quantum channel capacity and coherent information. MSRI Workshop on Quantum Computation (2002).
12. Steane, A. M. Error correcting codes in quantum theory. Phys. Rev. Lett., 77:793 (1996).
13. Steane, A. M. Multiple particle interference and quantum error correction. Proc. R. Soc. London A, 452:2551-76 (1996).

[Guest post by Moshe Babaioff –Boaz]

“Highlights Beyond EC” Session at EC 2019: Call for Nominations

Committee: Mohammad Akbarpour, Moshe Babaioff, Shengwu Li and Ariel Procaccia

Following a new tradition started last year, the 2019 ACM Conference on Economics and Computation (EC’19) will host a special session highlighting some of the best work in economics and computation that appears in conferences and journals other than EC. The intention of this session is to expose EC attendees to related work just beyond the boundary of their current awareness.

We seek nominations for papers on Economics and Computation that have made breakthrough advances, opened up new questions or areas, made unexpected connections, or had significant impact on practice or other sciences. Examples of conferences and journals that publish papers relevant for the special session include STOC/FOCS/SODA/ITCS, AAAI/IJCAI/AAMAS, NIPS/ICML/COLT, WWW/KDD, AER/Econometrica/JPE/QJE/RESTUD/TE/AEJ Micro/JET/GEB, and Math of OR/Management Science/Operations Research. Please email nominations toHighlightsBeyondEC@gmail.com. Anyone is welcome to contact us, but we especially invite members of PCs or editorial boards in various venues to send us suggestions. Nominations should include:

1.  Name of paper and authors.
2. Publication venue or online working version. Preference will be given to papers that have appeared in a related conference or journal within the past two years, or have a working version circulated within the past two years.
3. Short (2-3 paragraph) explanation of the paper and its importance.
4. (Optional) Names of 1-3 knowledgeable experts on the area of the paper.

Note that at least one of the authors of a selected paper will be required to present their paper at EC 2019 and so should be available to travel to the conference, which is taking place in Phoenix, AZ on June 25-27, 2019.

To ensure maximum consideration, please send all nominations by December 15, 2018.

[Guest post by Piotr Sankowski –Boaz]

Call for Nominations – 4th Highlights of Algorithms conference (HALG 2019)

Copenhagen, June 14-16, 2019

http://www.halgdiku.dk/

The HALG 2019 conference seeks high-quality nominations for invited talks that will highlight recent advances in algorithmic research. Similarly to previous years, there are two categories of invited talks:

A. survey (60 minutes): a survey of an algorithmic topic that has seen exciting developments in last couple of years.

B. paper (30 minutes): a significant algorithmic result appearing in a paper in 2018 or later.

To nominate, please email  halg2019.nominations@gmail.com  the following information:

1. Basic details: speaker name + topic (for survey talk) or paper’s title, authors, conference/arxiv + preferable speaker (for paper talk).

2. Brief justification: Focus on the benefits to the audience, e.g., quality of results, importance/relevance of topic, clarity of talk, speaker’s presentation skills.

All nominations will be reviewed by the Program Committee (PC) to select speakers that will be invited to the conference.

Nominations deadline: December 9, 2018 (for full consideration).

Please keep in mind that the conference does not provide financial support for the speakers.

Best regards,
Piotr Sankowski
HALG 2019 PC Chair

[Guest post by Eylon Yogev about a Chrome extension he wrote, of which I am a happy user –Boaz]

Hi fellow researchers,

I’m writing to share a little tool that I have developed with the ambitious goal of boosting research productivity. The tool is a Chrome extension named “Where’s that paper?”. Before I tell you more about what it does, let me touch upon the largest obstacle of any new tool, the learning curve. Rest assured that “Where’s that paper?” requires  zero training – just install it and continue working as usual – it will guide you without further effort on your end.

After building much suspense, I will elaborate. If you’re like me, you find yourself frequently searching the web for papers that you have browsed / opened in the past on your computer – basically paper browsing history. Whether it is in the process of writing a paper or searching for existing ideas, you may search for authors of titles of papers that you recall know to have once glimpsed at. At some point, the paper has either been downloaded or added  to the favorites bar. In any case, this process is manual and takes up much time (often including frustration). As such, I thought we could all use some code to automate it.

### Functionality.

The extension is very simple: it identifies when you are reading a scientific paper (according to the domain) and then automatically adds this paper, with the proper author list, year etc., to your favorites bar under a designated folder. Then, when you type a search in the Chrome’s search bar for an author or a title, the relevant results from the favorites pop up. One might also explicitly browse in the favorites to see all papers read.

See an example of the results shown when searching for “short” in Chrome’s address bar. The first three links are from the favorite and the rest are general Google suggestions from the web.

The extension automatically works on a list of specified domains that include:

It is not too difficult to customize the list and additional domains. Reach out to me and I’ll add them upon your request!

A bonus feature (thanks Boaz for the suggestion): Click the extension’s icon and you can download a bib file containing (a nice formatting of) the DBLP bibtex records of all papers added to the favorites.

### Security.

Most who work in our domain of expertise would be quite concerned with installing extensions and with good reason. The extension I wrote does not leak any information, not to me or to another third party. I have no ulterior motive in developing the extension, originally helping myself, I now see the benefit of sharing it with our community.  I do not know how to reassure you that this is indeed the case other than giving you my word and publishing the source code online. It is available here:

https://github.com/eylonyogev/Where-s-That-Paper-

I’d be glad to hear any feedback.

Thanks,

Eylon.

[Guest post by Alex Kelser, Alex Lin, Amil Merchant, and Suproteem Sarkar, scribing for a lecture by Andrej Risteski.]

# Andrej Risteski: Approximating Partition Functions via Variational Methods and Taylor Series

For a probability distribution defined up to a constant of proportionality, we have already seen the partition function. To refresh your memory, given a probability mass function $p(\vec{x}) \propto f(\vec{x})$ over all $\vec{x} \in \mathcal{X}$, the partition function is simply the normalization constant of the distribution, i.e.
$Z = \sum_{\vec{x} \in \mathcal{X}} f(\vec{x})$

At first glance, the partition function may appear to be uninteresting. However, upon taking a deeper look, this single quantity holds a wide array of intriguing and complex properties. For one thing, its significance in the world of thermodynamics cannot be overstated. The partition function is at the heart of relating the microscopic quantities of a system – such as the individual energies of each probabilistic state $\vec{x} \in \mathcal{X}$ – to macroscopic entities describing the entire system: the total energy, the energy fluctuation, the heat capacity, and the free energy. For explicit formulas, consult this source for reference. In machine learning the partition function also holds much significance, since it’s intimately linked to computing marginals in the model.

Although there exists an explicit formula for the partition function, the challenge lies in computing the quantity in polynomial time. Suppose $\vec{x}$ is a vector of dimension $n$ where each $x_i \in \{-1, 1\}$ for $i \in \{1, \ldots, n\}$. Specifically, we would like a smarter way of finding $Z$ than simply adding up $f(x)$ over all $2^n$ combinations of $x$.

Andrej Risteski’s lecture focused on using two general techniques – variational methods and Taylor series approximations – to find provably approximate estimates of $Z$.

# Motivation

The general setup addresses undirected graphical models, also known as a Markov random fields (MRF), where the probability mass function $p : \mathcal{X} \to [0, 1]$ has the form
$p(\vec{x}) \propto \exp \left( \sum_{i \sim j} \phi_{ij}(x_i,x_j) \right)$
for some random, $n$-dimensional vector $\vec{x} \in \mathcal{X}$ and some set of parameterized functions $\{\phi_{ij}: \mathcal{X} \to \mathcal{R}\}$. Note that the notation $i \sim j$ denotes all pairs $(i, j)$ where $i, j \in \{1, 2, \ldots, n\}$ and the edge $e = (i, j)$ exists in the graph.

The talk considered the specific setup where each $x_i \in \{-1, 1\}$, so $\mathcal{X} = \{-1, 1\}^n$. Also, we fix

$\phi_{ij}(x_i, x_j) = J_{ij} x_i x_j$
for some set of coefficients $\{J_{ij}\}$, thereby giving us the well-known Ising model. Note, the interaction terms could be more complicated and made “less local” if desired, but that was not discussed in the lecture.

These graphical models are common in machine learning, where there are two common tasks of interest:

1. Learning: given samples, models try to learn $\phi_{ij}(x_i, x_j)$ or $\{J_{ij}\}$
2. Inference: given $\{\phi_{ij}\}$ or the potentials $\{J_{ij}\}$ as input, we want to calculate the marginal probabilities, such as
$p(x_i = -1) \text{ or } p(x_i = 1, x_j = -1)$

We focus on the latter task. The problem of inference is closely related to calculating the partition function. This value is often used as the normalization constant in many methods, and it is classically defined as
$Z = \sum_{x \in \mathcal{X}} \exp \left( \sum_{i \sim j} \phi_{ij}(x_i, x_j) \right)$

Although we can write down the aforementioned closed-form expression for the partition function, it is difficult to calculate this quantity in polynomial time. There are two broad approaches to solving inference problems:

1. Randomized: Methods based on Markov chain Monte Carlo (MCMC), such as Gibbs sampling, that construct a Markov chain whose stationary distribution is the distribution of interest. These have been more extensively studied in theoretical computer science.
2. Deterministic: Variational methods, self-avoiding walk trees, Taylor expansion methods

Although more often used in practice, randomized algorithms such as MCMC are notoriously hard to debug, and it is often unclear at what point the chain reaches stationarity.

In contrast, variational methods are often more difficult to rigorously analyze but have the added benefit of turning inference problems into optimization problems. Risteski’s talk considered some provable instantiations of variational methods for calculating the partition function.

# Variational Methods

Let us start with a basic observation, known as the Gibbs variational principle. This can be stated as the following.

Lemma 1: Let $p(\vec{x}) \propto \exp{\sum_{i \sim j} J_{ij} x_i x_j}$. Then, we can show that the corresponding partition function satisfies

$\log Z = \max_{q \in \mathcal{Q}} \left \{ \sum_{i \sim j} \underbrace{J_{ij}\mathbb{E}_q[x_i x_j]}_{\text{energy term}} + \underbrace{\mathbb{H}(q)}_{\text{entropy term}} \right \}$

where $q : \mathcal{X} \to [0, 1]$ is defined to be a valid distribution over $\mathcal{X}$ and $\mathcal{Q}$ is the set of all such distributions.

Note that we use $\mathbb{E}_q(\cdot)$ to denote the expectation of the inner argument with respect to the distribution $q$ and $\mathbb{H}(q)$ to denote the Shannon entropy of $q$.

Proof: For any $q \in \mathcal{Q}$, the KL divergence between $q$ and $p$ must be greater than or equal to 0.

$\mathbb{KL} (q || p) \geq 0$

$\underbrace{\mathbb{E}_q [\log q(\vec{x})}_{-\mathbb{H}(q)}] - \mathbb{E}_q \left[ \sum_{i \sim j} J_{ij} x_i x_j \right] + \log Z \geq 0$

$\log Z \geq \sum_{i \sim j} J_{ij} \mathbb{E}_q[x_i x_j] + \mathbb{H}(q)$
Equality holds if $q = p$, leading directly to the lemma above.

Note that the proof would have also held in the more generic exponential family with $\phi_{ij}$ instead of $J_{ij} x_i x_j$. Also, at most temperatures, one of the 2 terms, either the energy or the entropy will often dominate.

As a result of this lemma, we have framed the original inference problem of calculating $Z$ as an optimization problem over $\mathcal{Q}$. This is the crux of variational inference.

## Optimization Difficulties

The issue with this approach is that it is difficult to optimize over the polytope of distributions due to the values of each $x_i$ coming from $\pm 1$. In general, this is NOT tractable. We can imagine two possible solutions:

1. Inner Approximation

Instead of optimizing over all $q$, pick a “nice” subfamily of distributions to constrain $q$. The prototypical example is the mean-field approximation in which we let $q(\vec{x}) = \prod_{i=1}^n q_i(x_i)$, where each $q_i : \{-1, 1\} \to [0, 1]$ is a univariate distribution. Thus, we approximate $\log Z$ with
$\max_{q = \prod_i q_i} \left \{ \sum_{i \sim j} J_{ij} \cdot \mathbb{E}_{q_i}[x_i] \cdot \mathbb{E}_{q_j}[x_j] + \sum_i \mathbb{H}(q_i) \right \}$

This would provide a lower bound on $Z$. There are a number of potential issues with this method. For one thing, the functions are typically non-convex. Even ignoring this problem, it is difficult to quantify how good the approximation is.

2. Outer Approximation

Note: Given a distribution $q$, we use $q_S$ to denote the marginal $q(\vec{x}_S) = q(x_{s_1}, \ldots, x_{s_k})$ for some set $S = \{s_1, \ldots, s_k\} \subseteq \{1, \ldots, n\}$.

In the outer approximation, we relax the polytope we are optimizing over using convex hierarchies. For instance, we can define $\mathcal{M}^k$ as the polytope containing valid marginals $q_S$ over subsets $S$ of size at most $k$. We can then reformulate our objective as a two-part problem of (1) finding a set of pairwise marginals $\widetilde{q} = \{\widetilde{q}_S \text{ s.t. } \lvert S \rvert \leq k\}\in \mathcal{M}^k$ that optimizes the energy term and (2) finding a distribution $q$ with pairwise marginals matching $\widetilde{q}$ that optimizes the entropy term. For simplicity of notation, we use $\widetilde{q}_{ij}$ to denote $\widetilde{q}_{S}$, where $S = \{i, j\}$. It follows that we can rewrite the Gibbs equation as

$\log Z = \max_{\widetilde{q} \in \mathcal{M}^k} \left \{ \sum_{i \sim j} J_{ij} \cdot \mathbb{E}_{\widetilde{q}_{ij}}[x_i x_j] + \max_{q \given q_S = \widetilde{q}_S \forall S, \lvert S \rvert \leq k} \mathbb{H}(q) \right \}$

The first term here represents the energy on pairwise marginals, whereas the second term is the maximum entropy subject to a constraint about matching the energy distribution’s pairwise marginals. The goal of this procedure is to enlarge the polytope such that $\mathcal{M}^k$ is a tractable set, where we can impose a polynomial number of constraints satisfied by real marginals.

One specific convex hierarchy that is commonly used for relaxation is the Sherali-Adams (SA) hierarchy. Sherali-Adams will allow us to formulate the optimization of the energy term (and an approximation of the entropy term) as a convex program. We introduce the polytope $\text{SA}(k)$ for $k \geq 2$, which will relax the constraints on $\vec{x}$ to allow for values outside of $\{-1, 1\}^n$ in order to generate a polynomial-time solution for $\log Z$.

The Sherali-Adams hierarchy will take care of the energy term, but it remains unclear how to rewrite the entropy term in the context of the convex program. In fact, we will need approximations for $\mathbb{H}$ in order to accomplish this task.

In this talk, we’ll consider two approximations: one classical one – the Bethe entropy $\mathbb{H}_\text{BETHE}$ and one more recent one – the augmented mean-field pseudo-entropy $\mathbb{H}_\text{aMF}$.

## Bethe Entropy Approximation

The Bethe entropy works by pretending that the Markov random field is a tree. In fact, we can show that if the graph is a tree, then using the Bethe entropy approximation in the convex program defined by $\text{SA}(2)$ will yield an exact calculation of $\log Z$.

Specifically, the Bethe entropy is defined as
$\mathbb{H}_\text{BETHE} (\widetilde{q}) = \sum_{i \sim j} \mathbb{H}(\widetilde{q}_{ij}) - \sum_i \mathbb{H}(\widetilde{q}_i) (d_i - 1)$
where $d_i$ is defined to be the degree of the particular vertex $i$. Note that there are no marginals over sets of dimension greater than two in this expression; thus, we have a valid convex program over the polytope $\text{SA}(2)$.

This lemma is well-lnown, but it will be a useful warmup:

Lemma 2 Define the output of the convex program
$\log Z_{\text{BETHE}} = \max_{\widetilde{q} \in \text{SA}(2)} \left \{ \sum_{i \sim j} J_{ij} \cdot \mathbb{E}_{\widetilde{q}_{ij}} [x_i x_j] + \mathbb{H}_{\text{BETHE}} (\widetilde{q}) \right \}$
On a tree, $\log Z_{\text{BETHE}} = \log Z$ and the optimization objective is concave with respect to the $\text{SA}(2)$ variables, so it can be solved in polynomial time.

Proof sketch We will prove 3 claims:

1. $\log Z_{\text{BETHE}} \geq \log Z$

Since the energy term is exact, it suffices to show that for valid marginals $\widetilde{q}_S$, $\mathbb{H}_\text{BETHE}(\widetilde{q}) \geq \max_{q \given q_S = \widetilde{q}_S} \mathbb{H}(q)$. This can be done by re-writing $\mathbb{H}_\text{BETHE}$ and using a property of conditional entropy, namely that $\mathbb{H}(X,Y) = \mathbb{H}(X) + \mathbb{H}(Y|X)$.

2. For trees, $\log Z_\text{BETHE}$ is concave in the $\text{SA}(2)$ variables.

This proof was only sketched in class but is similar to the usual proof of concavity of entropy.

3. For trees, we can round a solution to a proper distribution over $\{-1, 1\}^n$ which attains the same value for the original $\log Z$ optimization.

The distribution $q$ that we will produce is
$q(\vec{x}) = \prod_i q(x_i \given x_{\text{pa}(i)})$
where we start at the root and keep sampling down the tree. Based on the tree structure, $\mathbb{H}(q) = \mathbb{H}_\text{BETHE}(q)$. The energy is also the same since $q_{ij}(x_i, x_j) = \widetilde{q}_{ij}(x_i, x_j)$. Therefore, since both terms in the equation are equal to the respective terms in the partition function, we get that the equation must equal the partition function.

Remarks:

1. $\mathbb{H}_\text{BETHE}$ is commonly used on graphs that are not trees.
2. However, for non-trees, the optimization is often no longer concave.
3. Therefore, the value obtained by the output equation is neither an upper or lower bound.
4. The fixed points of the belief propagation algorithm give a 1-1 correspondence with the stationary points of the Bethe objective. (See a classical paper by Yedidia et al..)

## Augmented Mean Field Pseudo-Entropy

For this approximation, we will focus on dense graphs: namely graphs in which each vertex has degree $\geq \Delta \cdot n$ for some $\Delta > 0$. To simplify things, we focus on distributions of the form
$p(\vec{x}) \propto \exp \left(\sum_{i \sim j} \pm J x_i x_j\right)$
where $J$ is some positive constant parameter. The Bethe entropy approximation is undesirable in the dense case, because we cannot bound its error on non-trees. Instead, we proved the following theorem.

Theorem (Risteski ’16)

For a dense graph with parameter $\Delta$, there exists an outer approximation based on $\text{SA}(\frac{1}{\epsilon^2\Delta})$ and an entropy proxy which achieves a $O(J \epsilon n^2 \Delta)$ additive approximation to $\log Z$ for some $\epsilon > 0$. There is an algorithm with runtime $O(\frac{1}{\epsilon^2\Delta})$.

To parse the theorem statement, consider the potential regimes for $J$:

1. $J >> \frac{1}{n} \rightarrow$ One can ignore the entropy portion and solve for the energy portion that dominates. (So the guarantee is not that interesting.)
2. $J << \frac{1}{n} \rightarrow$ MCMC methods give a $(1+\epsilon)$-factor approximation in time poly$(n, \frac{1}{\epsilon})$. (It’s not clear if the methods suggested here can give such a guarantee – this is an interesting problem to explore.)
3. $J = \Theta(\frac{1}{n}) \rightarrow$ MCMC mixes slowly, but there is no other way to get a comparable guarantee. This is the interesting regime!

Proof Sketch The proof strategy is as follows: We will formulate a convex program under the SA($k$) relaxation that can return a solution $\widetilde{q}$ in polynomial time with value $\log \tilde{Z}$. Note that this value of the relaxation will be an upperbound on the true partition function value $\log Z^*$. The convex program solution $\widetilde{q}$ may not be a valid probability distribution, so from this, we will construct a “rounded solution’’ – an actual distribution $q$ with value $\log Z$. It follows that
$\log \tilde{Z} \geq \log Z^* \geq \log Z$

We will then aim to put a bound on the gap between $\log \tilde{Z}$ and $\log Z$; namely, that
$\log Z \geq \log \tilde{Z} - \epsilon \cdot n^2 \cdot J \cdot \Delta$
or equivalently,
$\sum_{i \sim j} \mathbb{E}_q[x_i x_j] \cdot J + \mathbb{H}(q) \geq \sum_{i \sim j} \mathbb{E}_{\widetilde{q}}[x_i x_j] \cdot J + \mathbb{H}_\text{aMF}(\widetilde{q}) - \epsilon n^2 J \Delta$
This equivalently places a bound on the optimality gap between $\log \tilde{Z}$ and $\log Z^*$, thereby proving the theorem. Here are the main components of the proof:

1. Entropy Proxy

To approximate $\mathbb{H}$, we will use the pseudo-augmented mean field entropy approximation $\mathbb{H}_\text{aMF}$. This is defined as

$\mathbb{H}_\text{aMF}(q) = \mathbb{I}n_{|S| \le k} \{ \overbrace{\mathbb{H}(q_S)}^\text{entropy of S} + \overbrace{\sum_{i \notin S} \mathbb{H}(q_i|q_S)}^\text{entropy of everything else} \}$

Using $\mathbb{H}_\text{aMF}$, we can write a convex program under SA($k$) that provides a relaxation for optimizing $\log Z$. Specifically, we will let $k = O(1 / (\Delta \cdot \epsilon^2))$. Let $\widetilde{q}$ be the outputed solution to this relaxation. We define the rounded solution as
$q(\vec{x}) = \widetilde{q}_S(\vec{x}_S) \cdot \prod_{i \notin S} \widetilde{q}_S(x_i \given \vec{x}_S)$
Using the chain rule for entropy, we can show that
$\mathbb{H}(q) \geq \mathbb{H}_\text{aMF}(\widetilde{q})$

2. Rounding Scheme The above distribution $q$ is actually the same distribution that correlation rounding, as introduced by a seminal paper of Barak, Raghavendra, and Steurer, produces. By using definition of mutual information $\mathbb{I}(X, Y)$ and Pinsker’s Theorem showing that $\mathrm{\mathbb{C}ov}(X, Y) = O(\sqrt{\mathbb{I}(X, Y)})$, we can work through some algebra to show that

$\sum_{i \sim j} \mathbb{E}_q[x_i x_j] \cdot J \geq \sum_{i \sim j} \mathbb{E}_{\widetilde{q}}[x_i x_j] \cdot J - \epsilon n^2 J \Delta$

Then, putting together the entropy and energy effects, we can prove the main theorem.

Remarks:

1. The above proof easily extends to a more general notion of density, as well as to sparse graphs with “low-rank’’ spectral structure. See the paper for more information.
2. In fact, with more work, one can extract guarantees on the the best mean field approximation to the Gibbs distribution in the dense regime. (This intuitively can be done as the distribution $q$ produced above is almost mean-field, except for the set $S$). This is interesting, since as we mentioned, the quality of the mean-field approximation is usually difficult to characterize. (Moreover, the procedure is algorithmic.) See the recent work of Jain, Koehler and Risteski ’18 for more details and results.

# Taylor Series Approximation

Another method of calculating the partition function involves Taylor expanding its logarithm around a cleverly selected point. This approach was first developed by Barvinok ’14, for the purpose of computing the partition function of cliques in a graph. The mathematical techniques developed in this paper can be naturally extended to evaluate a variety of partition functions, including that of the ferromagnetic Ising model. We will use this example to illustrate the general idea of the Taylor expansion technique.

The goal here is to devise a deterministic, fully polynomial-time approximation scheme (an FPTAS) for evaluating the partition function of the ferromagnetic Ising model. This means that the the algorithm must run in poly($n, \frac{1}{\epsilon}$) and be correct within a multiplicative factor of $1 + \epsilon$. We will work in the general case where the Hamiltonian includes an external magnetic field term, as well as the neighboring spin interaction terms. Using logarithmic identities, we can re-write the partition function slightly differently from last time:

$Z(\lambda) = \sum_{x \in \{ \pm 1 \}^n} \lambda^{|{1: x_i = 1}|} \beta^{|E(x)|}$
Here, $\lambda$ is called the vertex activity (or external field), and characterizes the likelihood of a vertex to be in the + configuration. Additionally, $\beta \geq 0$ is referred to as the edge activity, and characterizes the propensity of a vertex to agree with its neighbors. The ferromagnetic regime (where agreement of spins is favored) corresponds to $\beta < 1$. $|E(x)|$ denotes the number of edge cut in the graph, which is equivalent to the number of neighboring pairs with opposite spins.

As one final disclaimer, the approximation scheme presented below is not valid for $| \lambda| = 1$, which corresponds to the zero-magnetic field case. Randomized algorithms exist which can handle this case.

The general idea here is to hold one parameter fixed ($\beta$ in our case), and express the logarithm of the partition function as a Taylor series in the other parameter ($\lambda$). From a physics standpoint, the Taylor expansion around $\lambda = 0$ tells us how the partition function is changing as a function of the magnetic field. Without loss of generality, we focus on the case where $| \lambda | < 1$. This simplification can be justified by a simple symmetry argument. Consider $\bar{x}$ as the inverse of $x$ where all the 1’s (spin-up) are flipped to -1s (spin-down) and vice-versa. The partition function of this flipped system can be related to the original system by a constant factor:

\begin{align*} \sum_x \beta^{|E(x)|} \lambda^{|\{1: x_i = 1\}|} &= \sum_x \beta^{|E(\bar{x})|} \lambda^{n - |\{1: \bar{x}_i = 1\}|} \\ &= \lambda^{n} \sum_x \beta^{|E(\bar{x})|} \left(\frac{1}{\lambda}\right)^{|\{1: \bar{x}_i = 1\}|} \end{align*}

This holds because the number of cut edges remains constant when the values are flipped. Since these two partition functions are related by this factor constant in $\lambda$, for any model with $\lambda > 1$, we can simply consider the flipped graph $\bar{x}$ and use the above relation.

For our purposes, it will be more convenient to approximate the logarithm of the partition function, because a multiplicative $(1 + \epsilon)$ approximation of $Z(\lambda)$ corresponds to an $\mathcal{O}(\epsilon)$ additive approximation of $\log Z(\lambda)$. In fact, if we allow the partition function to be complex, then we can easily show that an additive bound of $\frac{\epsilon}{4}$ for $\log Z(\lambda)$ guarantees a multiplicative bound of $(1+\epsilon)$ for $Z(\lambda)$.

For notational convenience we define:
$\log Z(\lambda) = f(\lambda)$
Thus, the Taylor expansion of $f(\lambda)$ around $\lambda = 0$ if given by:
$f(\lambda) = \sum_j f^{(j)} (0) \frac{\lambda^j}{j!}$

The big question now is: Can we get a good approximation from just the lower order terms of the Taylor expansion for $f(\lambda)$? Equivalently, can we can bound the sum of the higher order terms at $\frac{\epsilon}{4}$? Additionally, we still must address the question of how to actually calculate the lower order terms.

To answer these questions, we make simple observations about the derivatives of $f$ in relation to the derivatives of $Z$.
\begin{align*} f'(\lambda) &= \frac{1}{Z(\lambda)} Z'(\lambda) \\ Z'(\lambda) &= f'(\lambda) Z(\lambda) \\ Z^{(m)} (\lambda) &= \sum_{j=0}^{m-1} {m-1 \choose j} Z^{(j)} (\lambda) f^{(m-j)} (\lambda) \end{align*}
The last equation just comes from repeated application of the product rule. Using these relations, we can solve for the first $m$ derivatives of $f$ if we have the first $m$ derivatives of $Z$ using a triangular linear system of equations. As $Z(0) = 1$, we see that the system is non-degenerate.

Note, $Z(\lambda)$ is a n-degree polynomial with leading coefficient of 1 and constant coefficient of 1 (corresponding to the configurations where all vertices are positive / negative respectively). Using this form, we can re-write the partition function in terms of its $n$ (possibly complex) roots $r_i$:

$Z(\lambda) = \prod_{i=1}^n \left(1-\frac{\lambda}{r_i} \right) \text{ for some } r_i$

$f(\lambda) = \log Z(\lambda) = \sum_{i=1}^n \log \left(1-\frac{\lambda}{r_i}\right)$

$= - \sum_{i=1}^n \sum_{j=1}^{\infty} \frac{1}{j} \left(\frac{\lambda}{r_i} \right)^j$

Supposing we keep the first $m$ terms, the error is given bounded by:

$|f(\lambda) + \sum_{i=1}^n \sum_{j=1}^{m} \frac{1}{j} (\frac{\lambda}{r_i})^j| \leq \sum_{i=1}^n \sum_{j=m+1}^{\infty} \frac{1}{|r_i|}\frac{|\lambda|^j}{j} \leq \sum_{i=1}^n \frac{|\lambda|^{m+1}}{(m+1)(1-|\lambda|)}\frac{1}{|r_i|}$

Here we can invoke the Lee-Yang theorem, which tells us that the zeroes of $Z(\lambda)$ for a system with ferromagnetic interactions lie on the unit circle of the complex plane. So the Lee-Yang theorem guarantees that $|r_i| = 1$, and we see that the error due to truncation is ultimately bounded by:

$\frac{n|\lambda|^{m+1}}{(m+1)(1-|\lambda|)}$

Now by the previous symmetry argument, we can assume that $\lambda < 1$. Thus, to achieve an error bound of $\frac{\epsilon}{4}$, we must have:
$\frac{\epsilon}{4} > \frac{n|\lambda|^{m+1}}{(m+1)(1-|\lambda|)}$
Rearranging terms and taking the natural log of both sides (which is justified given that $|\lambda| < 1$), we see that the inequality is satisfied if:

$m \geq \frac{1}{\log(\frac{1}{|\lambda|})}(\log(\frac{4n}{\epsilon}) + \log(\frac{1}{1-|\lambda|}))$

Thus, we need only retain the first $\frac{1}{\log(\frac{1}{|\lambda|})}(\log(\frac{4n}{\epsilon}) + \log(\frac{1}{1-|\lambda|}))$ terms of the Taylor series expansion of $f(\lambda)$. This will involve calculating the first $m = \lceil \frac{1}{\log(\frac{1}{|\lambda|})}(\log(\frac{4n}{\epsilon}) + \log(\frac{1}{1-|\lambda|}))\rceil$ derivatives of $Z(\lambda)$, which naively can be done in quasi-polynomial time. This is done by running over all $S$, where $|S| = m$. This takes $O(n \log n + \log (\frac{1}{\epsilon}))$ time, which is quasi-polynomial.

Recent work by Patel and Regts, as well as Liu, Sinclair and Srivastiva has focused on evaluating these coefficients more efficiently (in polynomial time), but is outside the scope of this lecture. Clever counting arguments aside, to trivially calculate the $k$-th derivative, we need only sum over the vectors with $k$ spin-up components.

## Lee-Yang Theorem

Most of the heavy lifting in the above approach is done by the Lee-Yang theorem. In this section, we sketch how it is proved.

First, let’s define the Lee-Yang property, which we will refer to as the LYP. Let $P(\lambda_1, \dots, \lambda_n)$ be some multilinear polynomial. $P$ has the Lee-Yang property if for any complex numbers $\lambda_1, \dots, \lambda_n$ such that $|\lambda_i| > 1$ for all i, then $P(\lambda_1, \dots, \lambda_n) \neq 0$.

We can then show that the partition function for the ferromagnetic Ising model, which we wrote as
$Z(\lambda_1, \dots, \lambda_n) = \sum_x \beta^{|E(x)|} \prod_{i: x_i = 1} \lambda_i$
must have this LYP. In the antiferromagnetic case it turns out that all zeroes lie on the negative real axis, but we will focus on the ferromagnetic case.

Proof (sketch):

For the proof that the partition function has the LYP, we will use Asano’s contraction argument. This relies on the fact that certain operations preserve LYP and we can “build” up to the partition function of the full graph by these operations.

• Disjoint union: This is fairly simple to prove since the partition function for the disjoint union of $G$ and $H$ would just factor into the multiplication of the two partition function of the original graphs $Z_{G \sqcup H} = Z_{G} Z_{H}$. Where $Z_{G}$ and $Z_{H}$ have the LYP, then it is clear that $Z_{G \sqcup H}$ has the same LYP.
• Contraction: Suppose we produce a graph $G'$ from $G$ by contracting 2 vertices $z_1, z_2$ in $G$, which means merging them into one vertex. It can be shown that if $G$ has the LYP, then $G'$ also has the LYP. We write the partition function for G as
$A\lambda_1 \lambda_2 + B \lambda_1 + C \lambda_2 + D$
The “contracted” graph amounts to deleting the middle 2 terms so the partition function for $G'$ can be written as
$\underbrace{A \lambda'}_{\text{both plus}} + \underbrace{D}_{\text{both minus}}$
We want to show that the partition function for $G'$ has the LYP. Because the partition function of $G$ has the LYP, we can consider the case where $\lambda_1 = \lambda_2 = \lambda$. By the LYP,
$A\lambda^2 + (B+C)\lambda + D \neq 0$
if $|\lambda| > 1$. By Vieta’s formulas we can find a relation between A and D,
$\frac{|D|}{|A|} = \prod |zeroes| \leq 1$
Now, to show that the partition function of G’ has the LYP, we assume there is a $\lambda'$ such that $A\lambda' + D = 0$. For this to be true,
$|\lambda'| = \frac{|D|}{|A|}$
However, this means that $|\lambda'| \leq 1$ so there is no solution where $|\lambda'| > 1$ such that the partition function of G’ is 0 and so it has the LYP.

The final part of this proof is to construct the original graph. We observe is that for a single edge, the partition function has the LYP. In this case, we easily write out the partition function for 2 vertices:
$z_1z_2 + B z_1 + B z_2 + 1$
Suppose $|z_1| > 1$: then $z_1z_2 + Bz_1 + Bz_2 + 1=0$ would imply $|z_2| = \frac{|1+Bz_1|}{|z_1 + B|}$. However, this is just the Möbius transform mapping the exterior of the unit disk to the interior, so $|z_2| \leq 1$. Thus, it cannot be the case that both $z_1$ and $z_2$ have absolute value greater than one.

Since single edges have the LYP. We break the graph into single edges, with copies of vertices. These copies are then contracted and we build up the graph to show that the partition function has the LYP.

Knowing that the partition function has the LYP, direct application of the Lee-Yang theorem guarantees that the roots are on the unit circle in the complex plane.

# Conclusion

Although there exists an explicit formula for the partition function, the challenge lies in computing the quantity in polynomial time. Andrej Risteski’s lecture focused on two less-explored methods in theoretical computer science – namely variational inference and Taylor series approximations – to find provably approximate estimates. Both of these approaches are relatively new and replete with open problems.

[Guest blog from Yaron Singer who is heading the selection committee for the Rabin fellowship this year. In addition to the Rabin fellowship and other postdoc opportunities, this year we also have a new postdoc opportunity in quantum computation and information via the Harvard Quantum Initiative. –Boaz.]

Michael O. Rabin Postdoctoral Fellowship in Theoretical Computer Science

Deadline for full consideration: December 3, 2018.  Applications can be submitted here::

The Harvard John A. Paulson School of Engineering and Applied Sciences at Harvard University seeks applicants for the Michael O. Rabin Postdoctoral Fellowship in Theoretical Computer Science. The standard duration of the Rabin Fellowship is two years. Rabin Fellows will receive a generous salary as well as an annual allocation for research and travel expenses.

Past fellows are Mika Goose and Aviad Rubinstein and the current fellow is Alexander Golovnev.

We are looking for exceptional junior scientists in theoretical computer science, broadly construed. Rabin Fellows will be provided with the opportunity to pursue their research agenda in an intellectually vibrant environment with ample mentorship. While interaction with Harvard faculty, students, and visitors is encouraged, Rabin Fellows are free to pursue their own interests. Candidates are required to have a doctorate or terminal degree in Computer Science or a related area by the expected start date.

Required application documents include a cover letter, research statement, CV (including a list of publications), and names and contact information for three references. We recommend that papers mentioned in the CV be available online on your homepage or on electronic archives. Applicants will apply on-line at the above address. We encourage candidates to apply by December 3, 2018, but applications will be accepted until the position is filled.

Harvard is an equal opportunity employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability status, protected veteran status, or any other characteristic protected by law.

*The fellowship is named after Michael O. Rabin, pioneer in Computer Science research and winner of numerous awards including the A. M. Turing award in 1976. Michael Rabin has been on the faculty at Harvard since 1981, and currently is the Thomas J. Watson, Sr. Research Professor of Computer Science in the Harvard Paulson School. The fellowship is aimed at researchers in all areas of theoretical computer science, including fellows that, like Rabin, might create new areas that do not yet exist.

[Guest post by Thomas Orton who presented a lecture on this in our  physics and computation seminar –Boaz]

## Introduction

This blog post is a continuation of the CS229R lecture series. Last week, we saw how certain computational problems like 3SAT exhibit a thresholding behavior, similar to a phase transition in a physical system. In this post, we’ll continue to look at this phenomenon by exploring a heuristic method, belief propagation (and the cavity method), which has been used to make hardness conjectures, and also has thresholding properties. In particular, we’ll start by looking at belief propagation for approximate inference on sparse graphs as a purely computational problem. After doing this, we’ll switch perspectives and see belief propagation motivated in terms of Gibbs free energy minimization for physical systems. With these two perspectives in mind, we’ll then try to use belief propagation to do inference on the the stochastic block model. We’ll see some heuristic techniques for determining when BP succeeds and fails in inference, as well as some numerical simulation results of belief propagation for this problem. Lastly, we’ll talk about where this all fits into what is currently known about efficient algorithms and information theoretic barriers for the stochastic block model.

## (1) Graphical Models and Belief Propagation

### (1.0) Motivation

Suppose someone gives you a probabilistic model on $\vec{x}=(x_1,...,x_n) \in \chi^{N}$ (think of $\chi$ as a discrete set) which can be decomposed in a special way, say

$P(\vec{x})\propto \prod_{a \in F} f_{a}(\vec{x})$

where each $f_{a}$ only depends on the variables $V_{a}$. Recall from last week that we can express constraint satisfaction problems in these kinds of models, where each $f_{a}$ is associated with a particular constraint. For example, given a 3SAT formula $\phi=\land_{a} c_{a}$, we can let $f_{a}(\vec{x})=1$ if $c_{a}(\vec{x})$ is satisfied, and 0 otherwise. Then each $f_{a}$ only depends on 3 variables, and $P(\vec{x})$ only has support on satisfying assignments of $\phi$.

A central problem in computer science is trying to find satisfying assignments to constraint satisfaction problems, i.e. finding values $\vec{x}$ in the support of $P(\vec{x})$. Suppose that we knew that the value of $P(x_1=1)$ were $>0$. Then we would know that there exists some satisfying assignment where $x_1=1$. Using this knowledge, we could recursively try to find $\vec{x}$‘s in the support of $P(1,x_2,...,x_n)$, and iteratively come up with a satisfying assignment to our constraint satisfaction problem. In fact, we could even sample uniformally from the distribution as follows: randomly assign $x_1$ to $1$ with probability $P(x_1=1)$, and assign it to $0$ otherwise. Now iteratively sample from $P(x_2|x_1)$ for the model where $x_1$ is fixed to the value we assigned to it, and repeat until we’ve assigned values to all of the $\{x_i\}_{i}$. A natural question is therefore the following: When can we try to efficiently compute the marginals

$P(x_i):=\sum_{\vec{x}-x_i} P(\vec{x})$

for each $i$?

A well known efficient algorithm for this problem exists when the corresponding graphical model of $P(\vec{x})$ (more in this in the next section) is a tree. Even though belief propagation is only guaranteed to work exactly for trees, we might hope that if our factor graph is “tree like”, then BP will still give a useful answer. We might even go further than this, and try to analyze exactly when BP fails for a random constraint satisfaction problem. For example, you can do this for k-SAT when $k$ is large, and then learn something about the solution threshold for k-SAT. It therefore might be natural to try and study when BP succeeds and fails for different kinds of problems.

### (1.1) Deriving BP

We will start by making two simplifying assumptions on our model $P(\vec{x})$.

First, we will assume that $P(\vec{x})$ can be written in the form $P(\vec{x})\propto \prod_{(i,j) \in E} f_{i,j}(x_i,x_j) \prod_{i \in [n]} f_i(x_i)$ for some functions $\{f_{i,j}\}_{i,j}, \{f_i(x_i)\}_{i}$ and some “edge set” $E$ (where the edges are undirected). In other words, we will only consider pairwise constraints. We will see later that this naturally corresponds to a physical interpretation, where each of the “particles” $x_i$ interact with each other via pairwise forces. Belief propagation actually still works without this assumption (which is why we can use it to analyze $k$-SAT for $k>2$), but the pairwise case is all we need for the stochastic block model.

For the second assumption, notice that there is a natural correspondence between $P(\vec{x})$ and the graphical model $T$ on $n$ vertices, where $(i,j)$ forms an edge in $T$ iff $(i,j) \in E$. In order words, edges $(i,j)$ in $T$ correspond to factors of the form $f_{i,j}$ in $P(\vec{x})$, and vertices in $T$ correspond to variables in $\vec{x}$. Our second assumption is that the graphical model $T$ is a tree.

Now, suppose we’re given such a tree $T$ which represents our probabilistic model. How to we compute the marginals? Generally speaking, when computer scientists see trees, they begin to get very excited [reference]. “I know! Let’s use recursion!” shouts the student in CS124, their heart rate noticeably rising. Imagine that we arbitrarily rooted our tree at vertex $x_i$. Perhaps, if we could somehow compute the marginals of the children of $x_i$, we could somehow stitch them together to compute the marginal $x_i$. In order words, we should think about computing the marginals of roots of subtrees in our graphical model. A quick check shows that the base case is easy: suppose we’re given a graphical model which is a tree consisting of a single node $x_i$. This corresponds to some PDF $P(\vec{x})=P(x_i) \propto f_{i}(x_i)$. So to compute $P(x_i)$, all we have to do is compute the marginalizing constant $Z=\sum_{x_i \in \chi} f_{i}(x_i)$, and then we have $P(x_i)=\frac{1}{Z}f(x_i)$. With the base case out of the way, let’s try to solve the induction step: given a graphical model which is a tree rooted at $x_i$, and where we’re given the marginals of the subtrees rooted at the children of $x_i$, how do we compute the marginal of the tree rooted at $x_i$? Take a look at figure 2 to see what this looks like graphically. To formalize the induction step, we’ll define some notation that will also be useful to us later on. The main pieces of notation are $T^{i \rightarrow j}$, which is the subtree rooted at $x_i$ with parent $x_j$, and the “messages” $\psi^{k \rightarrow i}_{x_i}$, which can be thought of as information which is passed from the child subtrees of $x_i$ to the vertex $x_i$ in order to compute the marginals correctly.

1. We let $\partial i$ denote the neighbours of vertex $i$ in $T$. In general, I’ll interchangeably switch between vertex $i$ and the variable $x_i$ represented by vertex $i$.
2. We define $T^{i \rightarrow j}$ to be the subtree of $T$ rooted at $i$, where $i$‘s parent is $j$. We need to specify $i$‘s parent in order to give an orientation to our tree (so that we know in which direction we need to do recursion down the tree). Likewise, we let $V_{i \rightarrow j}$ be the set of vertices/variables which occur in subtree $T^{i \rightarrow j}$.
3. We define the function $T^{i \rightarrow j}(V_{i \rightarrow j})$, which is a function of the variables in $V_{i \rightarrow j}$, to be equal to the product of $T^{i \rightarrow j}$‘s edges and vertices. Specifically, $T^{i \rightarrow j}(V_{i \rightarrow j})=\prod_{(a,b) \in E'} f_{a,b}(x_a,x_b) \prod_{a \in V_{i \rightarrow j}} f_{a}(x_a)$, where $E'$ is the set of edges which occur in subtree $T^{i \rightarrow j}$. We can think of $T^{i \rightarrow j}(V_{i \rightarrow j})$ as being the pdf (up to normalizing constant) of the “model” of the subtree $T^{i \rightarrow j}$.
4. Lastly, we define $\psi^{i \rightarrow j}_{x_i}:=\frac{1}{Z^{i \rightarrow j}} \sum_{V_{i \rightarrow j}-x_i} T(V_{i \rightarrow j})$, where $Z^{i \rightarrow j}$ is a normalizing constant chosen such that $\sum_{x_i \in \chi} \psi^{i \rightarrow j}_{x_i}=1$. In particular, $\psi^{i \rightarrow j}_{x_i}: \chi \rightarrow \mathbb{R}$ is a function defined for each possible value of $x_i \in \chi$. We can interpret $\psi^{i \rightarrow j}_{x_i}$ in two ways: firstly, it is the marginal of the root of the subtree $T^{i \rightarrow j}$. Secondly, we can think of it as a “message” from vertex $i$ to vertex $j$. We’ll see why this is a valid interpretation shortly.

Phew! That was a lot of notation. Now that we have that out of the way, let’s see how we can express the marginal of the root of a tree as a function of the marginals of its subtrees. Suppose we’re considering the subtree $T^{i \rightarrow j}$, so that vertex $i$ has children $(\partial i)-j$. Then we can compute the marginal $\psi^{i \rightarrow j}_{r}$ directly:

The non-obvious step in the above is that we’re able to switch around summations and products: we’re able to do this because each of the trees are functions on disjoint sets of variables. So we’re able to express $\psi^{i \rightarrow j}_{x_i}$ as a function of the children values $\{\psi^{k \rightarrow i}_{x_k}\}_{k \in \partial i-j}$. Looking at the update formula we have derived, we can now see why the $\{\psi^{k \rightarrow i}_{x_k}\}_{k \in \partial i-j}$ are called “messages” to vertex $i$: they send information about the child subtrees to their parent $i$.

The above discussion is a purely algebraic way of deriving belief propagation. A more intuitive way to get this result is as follows: imagine fixing the value of $x_i=a$ in the the subtree $T^{i \rightarrow j}$, and then drawing from each of the marginals of the children of $x_i$ conditioned on the value $x_i=a$. We can consider the marginals of each of the children independently, because the children are independent of each other when conditioned on the value of $x_i$. Converting words to equations, this means that if $x_i$ has children $x_{k_{1}},...,x_{k_{d}}$, then the marginal probability of $(x_i,x_{k_{1}},...,x_{k_{d}})$ in the subtree $T^{i \rightarrow j}$ is proportional to $f_{i}(x_i) \prod_{\substack{k:(i,k) \in E, \\ k \not = j}} \psi^{k \rightarrow i}_{x_k} f_{i,k}(x_i,x_k)$. We can then write

$\psi_{x_i}^{i \rightarrow j} \propto \sum_{(\partial i)-j} f_{i}(x_i) \prod_{\substack{k:(i,k) \in E, \\ k \not = j}} \psi^{k \rightarrow i}_{x_k} f_{i,k}(x_i,x_k)$

$= f_{i}(x_i)\prod_{\substack{k:(i,k) \in E, \\ k \not = j}} \sum_{x_k \in \chi} \psi^{k \rightarrow i}_{x_k} f_{i,k}(x_i,x_k)$

And we get back what we had before. We’ll call this last equation our “update” or “message passing” equation. The key assumption we used was that if we condition on $x_i$, then the children of $x_i$ are independent. It’s useful to keep this assumption in mind when thinking about how BP behaves on more general graphs.

A similar calculation yields that we can calculate the marginal of our original probability distribution $\psi_{x_i}^{i}:=P(x_i)$ as the marginal of the subtree with no parent, i.e.

$\psi_{x_i}^{i} \propto f_{i}(x_i)\prod_{k:(i,k) \in E} \sum_{x_k \in \chi} \psi^{k \rightarrow i}_{x_k} f_{i,k}(x_i,x_k)$

Great! So now we have an algorithm for computing marginals: recursively compute $\psi^{i \rightarrow j}_{x_i}$ for each $(i,j) \in E$ in a dynamic programming fashion with the “message passing” equations we have just derived. Then, compute $\psi_{x_i}^{i}$ for each $i \in [n]$. If the diameter of our tree $T$ is $d$, then the recursion depth of our algorithm is at most $d$.

However, instead of computing every $\psi^{i \rightarrow j}_{x_i}$ neatly with recursion, we might try something else: let’s instead randomly initialize each $\psi^{i \rightarrow j}_{x_i}$ $(x_i \in \chi)$ with anything we want. Then, let’s update each $\psi^{i \rightarrow j}_{x_i}$ in parallel with our update equations. We will keep doing this in successive steps until each $\psi^{i \rightarrow j}_{x_i}$ has converged to a fixed value. By looking at belief propagation as a recursive algorithm, it’s easy to see that all of the $\psi^{i \rightarrow j}_{x_i}$‘s will have their correct values after at most $d$ steps. This is because (after arbitrarily rooting our tree at any vertex) the leaves of our recursion will initialize to the correct value after 1 step. After two steps, the parents of the leaves will be updated correctly as functions of the leaves, and so they will have the correct values as well. Specifically:
Proposition: Suppose we initialize messages $\psi_{x_i}^{i \rightarrow j}$ arbitrarily, and update them in parallel according to our update equations. If $T$ has diameter $d$, then after $d$ steps each $\psi_{r}^{i \rightarrow j}$ converges, and we recover the correct marginals.

Why would anyone want to do things in this way? In particular, by computing everything in parallel in steps instead of recursively, we’re computing a lot of “garbage” updates which we never use. However, the advantage of doing things in this way is that this procedure is now well defined for general graphs. In particular, suppose $P(\vec{x})$ violated assumption (2), so that the corresponding graph were not a tree. Then we could still try to compute the messages $\psi_{x_i}^{i \rightarrow j}$ with parallel updates. We are also able to do this in a local “message passing” kind of way, which some people may find physically intuitive. Maybe if we’re lucky, the messages will converge after a reasonable amount of iterations. Maybe if we’re even luckier, they will converge to something which gives us information about the marginals $\{P(x_i)\}_{i}$. In fact, we’ll see that just such a thing happens in the stochastic block model. More on that later. For now, let’s shift gears and look at belief propagation from a physics perspective.

## (2) Free Energy Minimization

### (2.0) Potts model (Refresh of last week)

We’ve just seen a statistical/algorithmic view of how to compute marginals in a graphical model. It turns out that there’s also a physical way to think about this, which leads to a qualitatively similar algorithm. Recall from last week that another interpretation of a pairwise factor-able PDF is that of particles interacting with each other via pairwise forces. In particular, we can imagine each particle $x_i$ interacting with $x_j$ via a force of strength

$J_{(i,j)}(x_i,x_j)$

and in addition, interacting with an external field

$h_{i}(x_i)$

We imagine that each of our particles take values from a discrete set $\chi$. When $\chi=\{0,1\}$, we recover the Ising model, and in general we have a Potts model. The energy function of this system is then

$E[\vec{x}]=-\sum_{(i,j)}J_{(i,j)}(x_i,x_j)-\sum_{i}h_{i}(x)$

with probability distribution given by

$P(\vec{x})=\frac{1}{Z} e^{-E[\vec{x}]/T}$

Now, for $\chi=\{0,1\}$, computing the marginals $P(x_i)$ corresponds to the equivalent physical problem of computing the “magnetizations” $P(x_i=1)-P(x_i=0)$.

How does this setup relate to the previous section, where we thought about constraint satisfaction problems and probability distributions? If we could set $e^{-J_{(i,j)}(x_i,x_j)/T}=f_{i,j}(x_i,x_j)$ and $e^{-h_{i}(x_i)/T}=f_{i}(x_i)$, we would recover exactly the probability distribution $P(\vec{x})=\prod_{i,j} f_{i,j}(\vec{x}) \prod_{i} f_{i}(x_i)$ from the previous section. From a constraint satisfaction perspective, if we set $J_{(i,j)}(x_i,x_j)=1$ if constraint $(i,j)$ is satisfied and $0$ otherwise, then as $T \rightarrow 0$ (our system becomes colder), $P(\vec{x})$‘s probability mass becomes concentrated only on the satisfying assignments of the constraint satisfaction problem.

### (2.1) Free energy minimization

We’re now going to try a different approach to computing the marginals: let’s define a distribution $b(\vec{x})$, which we will hope to be a good approximation to $P(\vec{x})$. If you like, you can think about the marginal $b(x_i)$ as being the “belief” about the state of variable $x_i$. We can measure the “distance” between $P$ and $b$ by the KL divergence

$KL(b(\vec{x})||p(\vec{x})) =\sum_{\vec{x}}b(\vec{x})\ln\frac{b(\vec{x})}{P(\vec{x})}$

$=\sum_{\vec{x}}b(\vec{x})E(\vec{x})+\sum_{\vec{x}}b(\vec{x})\ln b(\vec{x})+\ln Z$

which equals 0 iff the two distributions are equal. Let’s define the Gibbs free energy as
$G(b(\vec{x}))=\sum_{\vec{x}}b(\vec{x})E(\vec{x})+\sum_{\vec{x}}b(\vec{x})\ln b(\vec{x})=:U(b(\vec{x}))-S(b(\vec{x}))$

So the minimum value of $G$ is $F=-\ln Z$ which is called the Helmholz free energy, $U$ is the “average energy” and $S$ is the “entropy”.

Now for the “free energy minimization part”. We want to choose $b$ to minimize $G$, so that we can have that $b$ is a good approximation of $P$. If this happens, then maybe we can hope to “read out” the marginals of $b$ directly. How do we do this in a way which makes it easy to “read out” the marginals? Here’s one idea: let’s try to write $G$ as a function of only the marginals $b(x_i,x_j)$ and $b(x_i)$ of $b$. If we could do this, then maybe we could try to minimize $G$ by only optimizing over values for “variables” $\{b(x_i,x_j)\}_{i,j}$ and $\{b(x_i)\}_{i}$. However, we need to remember that $\{b(x_i,x_j)\}_{i,j}$ and $\{b(x_i)\}_{i}$ are actually meant to represent marginals for some real probability distribution $b$. So at the very least, we should add the consistency constraints $b(x_i)=\sum_{x_j \in \chi} b(x_i,x_j)$ and $\sum_{x_i \in \chi} b(x_i)=1$ to our optimization problem. We can then think of $b(x_i,x_j)$ and $b(x_i)$ as “pseudo-marginals” which obey degree-2 Sherali-Adams constraints.

Recall that we’ve written $G$ as a sum of both the average energy $U$ and the entropy $S$. It turns out that we can actually write $U$ as only a function of the pariwise marginals of $b$:

$U(b(\vec{x})) =-\sum_{i,j}\sum_{x_i,x_j} b(x_i,x_j)J_{i,j}(x_i,x_j)-\sum_{i}\sum_{x_i}b(x_i) h_{i}(x_i)$

which follows just because the sums marginalize out the variables which don’t form part of the pairwise interactions:
$\sum_{\vec{x}}b(\vec{x})\left(-\sum_{i,j}J_{i,j}(x_i,x_j)-\sum_{i} h_{i}(x_i)\right)$

$=-\sum_{i,j}\sum_{x_i,x_j} b(x_i,x_j)J_{i,j}(x_i,x_j)-\sum_{i}\sum_{x_i}b(x_i) h_{i}(x_i)$

This is good news: since $P$ only depends on pairwise interactions, the average energy component of $G$ only depends on $b(x_i,x_j)$ and $b(x_i)$. However, it is not so clear how to express the entropy as a function of one node and two node beliefs. However, maybe we can try to pretend that our model is really a “tree”. In this case, the following is true:

Claim: If our model is a tree, and $\{b(x_i,x_j)\}_{i,j}$ and $\{b(x_i)\}_{i}$ are the associated marginals of our probabilistic model $b(\vec{x})$, then we have
$b(\vec{x})=\frac{\prod_{(i,j) \in E}b(x_i,x_j)}{\prod_{i \in [n]} b(x_i)^{q_i-1}}$

where $q_i$ is the degree of vertex $x_i$ in the tree.

It’s not too difficult to see why this is the case: imagine a tree rooted at $x_i$, with children $\partial i$. We can think of sampling from this tree as first sampling from $x_i$ via its marginal $b(x_i)$, and then by recursively sampling the children conditioned on $x_i$. Associate with $\{T_k\}_{k \in \partial i}$ the subtrees of the children of $i$, i.e. $T_j(V_{j})$ is equal to the probability of the occurrence $V_{j}$ on the probabilistic model of the tree rooted at vertex $j$. Then we have

$b(\vec{x}) =b(x_i) \prod_{k \in \partial i} b(x_k|x_i) \prod_{k \in \partial i} T_k(V_{k}|x_k)$

$=b(x_i)\frac{1}{b(x_i)^{q_i}} \prod_{k \in \partial i} b(x_i,x_k) \prod_{k \in \partial i}\frac{1}{b(x_k)} \prod_{k \in \partial i}T_k(V_{k})$
$=\frac{\prod_{(i,j) \in E}b(x_i,x_j)}{\prod_{i \in [N]}b(x_i)^{q_i-1}}$

where the last line follows inductively, since each $T_k$ only sees $q_{k}-1$ edges of $x_k$.

If we make the assumption that our model is a tree, then we can write the Bethe approximation entropy as

$S_{Bethe}=-\sum_{i,j}\sum_{x_i,x_j} b(x_i,x_j)\ln(b(x_i,x_j))+\sum_{i}(q_i-1)\sum_{x_i}b(x_i)\ln b(x_i)$

Where the $\{q_i\}_{i}$‘s are the degrees of the variables $\{x_i\}_{i}$ in the graphical model defined by $P(\vec{x})$. We then define the Bethe free energy as $U+S_{Bethe}$. The Bethe free energy is in general not an upper bound on the true free energy. Note that if we make the assignments $E_{i}(x_i)=h_{i}(x_i)$, $E_{i,j}(x_i,x_j)=-J_{i,j}(x_i,x_j)-h_{i}(x_i)-h_{j}(x_j)$, then we can rewrite $U$ as
$U=\sum_{i,j}\sum_{x_i,x_j} b(x_i,x_j)E_{i,j}(x_i,x_j)+\sum_{i}(q_i-1)\sum_{x_i}b(x_i)E_{i}(x_i)$

which is similar in form to the Bethe approximation entropy. In general, we have

$G_{Bethe}(b(x_i),b(x_i,x_j))=$

$\sum_{i,j}\sum_{x_i,x_j} b(x_i,x_j)(E_{i,j}(x_i,x_j)+\ln(b(x_i,x_j)))$

$-\sum_{i}(q_i-1)\sum_{x_i}b(x_i)( E_{i}(x_i)+\ln b(x_i))$

which is exactly the Gibbs free energy for a probabilistic model whose associated graph is a tree. Since BP gives the correct marginals on trees, we can say that the BP beliefs are the global minima of the Bethe free energy. However, the following is also true:

Proposition: A set of beliefs gives a BP fixed point in any graph (not necessarily a tree) iff they correspond to local stationary points of the Bethe free energy.

(For a proof, see e.g. page 20 of [4])

So trying to minimize the Bethe free energy is in some sense the same thing as doing belief propagation. Apparently, one typically finds that when belief propagation fails to converge on a graph, the optimization program which is trying to minimize $G_{Bethe}$ also runs into problems in similar parameter regions, and vice versa.

## (3) The Block Model

### (3.0) Definitions

Now that we’ve seen Belief Propagation from two different perspectives, let’s try to apply this technique of computing marginals to analyzing the behavior of the stochastic block model. This section will heavily follow the paper [2].

The stochastic block model is designed to capture a variety of interesting problems, depending on its settings of parameters. The question we’ll be looking at is the following: suppose we generate a random graph, where each vertex of the graph comes from one of $q$ groups each with probability $n_{1},...,n_{q}$. We add an edge between vertices $(i,j)$ in groups $a,b$ resp. with probability $p_{a,b}$. For sparse graphs, we define $c_{i,j}:=Np_{i,j}$, where we think of $p_{i,j}$ as $\mathcal{O}(\frac{1}{N})$. The problem is the following: given such a random graph, can you label the vertices so that, up to permutation, the labels you choose have high correlation to the true hidden labels which were used to generate the graph? Here are some typical settings of parameters which represent different problems:

1.  Community detection, where we have $q$ groups. We set $n_{i}=\frac{1}{q}$, and $c_{i,j}=c_{in}$ if $i=j$ and $c_{out}$ otherwise, with $c_{in}>c_{out}$ (assortative structure).
2. Planted graph partitioning, where $p_{i,j}$ may not necessarily be $\mathcal{O}(\frac{1}{N})$.
3. Planted graph colouring, where $p_{in}=0$, $p_{a,b}=p_{out}$ and $n_{a}=\frac{1}{q}$ for all groups. We know how to efficiently find a planted coloring which is strongly correlated with the true coloring when the average degree $c=(q-1)p_{out}N/q$ is greater than $\mathcal{O}(q^2)$.

We’ll concern ourselves with the case where our graph is sparse, and we need to try and come up with an assignment for the vertices such that we have high correlation with the true labeling of vertices. How might we measure how well we solve this task? Ideally, a labeling which is identical to the true labeling (up to permutation) should get a score of 1. Conversely, a labeling which naively guesses that every vertex comes from the largest group $\arg \max_{i \in [q]} n_{i}$ should get a score of 0.  Here’s one metric which satisfies these properties: if we come up with a labeling $\{t_{i}\}$, and the true labeling is $\{q_i\}$, then we’ll measure our performance by

$Q(\{t_{i}\},\{q_{i}\}):=\max_{\pi \in S_q} \frac{\frac{1}{N}\sum_{i \in [N]} \delta_{t_i,\pi(q_i)}-max_{a \in [q]} n_a}{1-\max_{a \in [q]} n_a}$

where we maximize over all permutations $\pi$. When we choose a labeling which (up to permutation) agrees with the true labeling, then the numerator of $Q$ will equal the denominator, and $Q=1$. Likewise, when we trivially guess that every vertex belongs to the largest group, then the numerator of $Q$ is $0$ and $Q=0$.

Given $\{c_{a,b}\},\{n_a\}$ and a set of observed edges $E$, we can write down the probability of a labeling $\{q_i\}$ as
$P(\{q_i\}_{i})=\prod_{\substack{(i,j) \not \in E \\ i \not = j}}(1-p_{q_{i},q_{j}})\prod_{\substack{(i,j) \in E }}p_{q_{i},q_{j}} \prod_{i \in [q]}n_{q_i}$

How might we try to infer $\{q_i\}$ such that we have maximum correlation (up to permutation) with the true labeling? It turns out that the answer is to use the maximum likelihood estimator of the marginal distribution of each $q_i$, up to a caveat. In particular, we should label $q_i$ with the $r \in [q]$ such that $P(q_i=r)$ is maximized. The caveat comes in when $P(\{q_i\}_{i})$ is invariant under permutations of the labelings $\{q_i\}_{i}$, so that each marginal $P(q_i)$ is actually the uniform distribution. For example, this happens in community detection, when all the group sizes $n_{1},...,n_{q}$ are equal. In this case, the correct thing to do is to still use the marginals, but only after we have “broken the symmetry” of the problem by randomly fixing certain values of the vertices to have particular labels. There’s actually a way belief propagation does this implicitly: recall that we start belief propagation by randomly initializing the messages. This random initialization can be interpreted as “symmetry breaking” of the problem, in a way that we’ll see shortly.

### (3.1) Belief Propagation

We’ve just seen from the previous section that in order to maximize the correlation of the labeling we come up with, we should pick the labelings which maximize the marginals of $P$. So we have some marginals that we want to compute. Let’s proceed by applying BP to this problem in the “sparse” regime where $c_{a,b}=Np_{a,b}=\mathcal{O}(1)$ (other algorithms, like approximate message passing, can be used for “dense” graph problems). Suppose we’re given a random graph with edge list $E$. What does does graph associated with our probabilistic model look like? Well, in this case, every variable is actually connected to every other variable because $P(\{q_i\}_{i})$ includes a factor $f_{i,j}(x_i,x_j)$ for every $(i,j) \in [n] \times [n]$, so we actually have a complete graph. However, some of the connections between variables are much weaker than others. In full, our BP update equations are

Likewise
$\psi^{i}_{t_{i}}=\frac{1}{Z^{i}} n_{t_{i}} \prod_{\substack{(i,k) \not \in E}}\left[ 1-\frac{1}{N}\sum_{t_k \in [q]}c_{t_1,t_k}\psi^{k \rightarrow i}_{t_k} \right]\prod_{\substack{(i,k) \in E}}\left[\sum_{t_k \in [q]}c_{t_i,t_k} \psi^{k \rightarrow i}_{t_k} \right]$

what we want to do is approximate these equations so that we only have to pass messages along the edges $E$, instead of the complete graph. This will make our analysis simpler, and also allow the belief propagation algorithm to run more efficiently. The first observation is the following: Suppose we have two nodes $j,j' \in [N]$ such that $(i,j),(i,j') \not \in E$. Then we see that $\psi^{i \rightarrow j}_{t_{i}}=\psi^{i \rightarrow j'}_{t_{i}}+\mathcal{O}\left(\frac{1}{N}\right)$, since the only difference between these two variables are two factors of order $(1-\mathcal{O}\left(\frac{1}{N}\right))$ which appear in the first product of the BP equations. Thus, we send essentially the same messages to non-neighbours $j$ of $i$ in our random graph. In general though, we have:

The first approximation comes from dropping non-edge constraints on the first product, and is reasonable because we expect the number of neighbours of $i$ to be constant. We’ve also defined a variable

$h_{t_i}:=\frac{1}{N}\sum_{k \in [N]}\sum_{t_k \in [q]}c_{t_i,t_k} \psi^{k \rightarrow i}_{t_k}$

and we’ve used the approximation $e^{-x} \approx 1-x$ for small $x$. We think of the term $h_{t_i}$ as defining an “auxiliary external field”. We’ll use this approximate BP equation to find solutions for our problem. This has the advantage that the computation time is $\mathcal{O}(|E|)$ instead of $\mathcal{O}(N^2)$, so we can deal with large sparse graphs computationally. It also allows us to see how a large dense graphical model with only sparse strong connections still behaves like a sparse tree-like graphical model from the perspective of Belief Propagation. In particular, we might have reason to hope that the BP equations will actually converge and give us good approximations to the marginals.

From now on, we’ll only consider factored block models, which in some sense represent a “hard” setting of parameters. These are models which satisfy the condition that each group has the same average degree $c$. In particular, we require

$\sum_{d=1}^{q} c_{a,d} n_d = \sum_{d=1}^{q} c_{b,d}n_d =c$

An important observation for this setting of parameters is that

$\psi^{i \rightarrow j}_{t_i}=n_{t_i}$

is always a fixed point of our BP equations, which is known as a factored fixed point (this can be seen by inspection by plugging the fixed point conditions into the belief propagation equations we derived). When BP ever reaches such a fixed point, we get that $Q=0$ and the algorithm fails. However, we might hope that if we randomly initialize $\{\psi^{i \rightarrow j}\}_{(i,j) \in E}$, then BP might converge to some non-trivial fixed point which gives us some information about the original labeling of the vertices.

### (3.2) Numerical Results

Now that we have our BP equations, we can run numerical simulations to try and get a feel of when BP works. Let’s consider the problem of community detection. In particular, we’ll set our parameters with all group sizes $n_{a}$ being equal, and with $c_{a,a}=c_{in}>c_{out}=c_{a,b}$ for $a \not = b$ and vary the ratio $\epsilon:=c_{out}/c_{in}$, and see when BP finds solutions which are correlated “better than guessing” to the original labeling used to generate the graph. When we do this, we get images which look like this:

It should be mentioned that the point at which the dashed red line occurs depends on the parameters of the stochastic block model. We get a few interesting observations from numerical experiments:

1. The point at which BP fails ($Q=0$) happens at a certain value of $\epsilon$ independent of $N$. In particular, when $\epsilon$ gets too large, BP converges to the trivial factored fixed point. We’ll see in the next section that this is a “stable” fixed point for certain settings of parameters.
2. Using the approximate BP equations gives the same performance as using the exact BP equations. This numerically justifies the approximations we made.
3. If we try to estimate the marginals with other tools from our statistical inference toolbox, we find (perhaps surprisingly) that Markov Chain Monte Carlo methods give the same performance as Belief Propagation, even though they take significantly longer to run and have strong asymptotic guarantees on their correctness (if you run MCMC long enough, you will eventually get the correct marginals, but this may take exponential time). This naturally suggests the conjecture that belief propagation is optimal for this task.
4. As we get closer to the region where BP fails and converges to the trivial fixed point, the number of iterations required for BP to converge increases significantly, and diverges at the critical point $\epsilon_{c}$ where BP fails and converges to the factored fixed point. The same kind of behavior is seen when using Gibbs sampling.

### (3.3) Heuristic Stability Calculations

How might we analytically try to determine when BP fails for certain settings of $q$ and $\{c_{a,b}\}$? One way we might heuristically try to do this, is to calculate the stability of the factored fixed point. If the fixed point is stable, this suggests that BP will converge to a factored point. If however it is unstable, then we might hope that BP converges to something informative. In particular, suppose we run BP, and we converge to a factored fixed point, so we have that for all our messages $\psi_{t_i}^{i \rightarrow j}=n_{t_i}$. Suppose we now add a small amount of noise to some of the $\psi_{t_i}^{i \rightarrow j}$‘s (maybe think of this as injecting a small amount of additional information about the true marginals). We (heuristically) claim that if we now continue to run more steps of BP, either the messages will converge back to the fixed point $\psi_{t_i}^{i \rightarrow j}=n_{t_i}$, or they will diverge to something else, and whether or not this happens depends on the eigenvalue of some matrix of partial derivatives.

Following this idea, here’s a heuristic way of calculating the stability of the factored fixed point. Let’s pretend that our BP equations occur on a tree, which is a reasonable approximation in the sparse graph case. Let our tree be rooted at node $k_0$ and have depth $d$. Let’s try to approximately calculate the influence on $k_0$ of perturbing a leaf $k_d$ from its factored fixed point. In particular, let the path from the leaf to the root be $k_d,...,k_0$. We’re going to apply a perturbation $\psi^{k_d}_{t} = n_t+\epsilon_{t}^{k}$ for each $t \in [q]$. In vector notation, this looks like $\psi^{k_d} = \mathbf{1}n_t+\epsilon^{k}$, where $\psi^{k_d} \in \mathbb{R}^{q \times 1}$ is a column vector. The next thing we’ll do is define the matrix of partial derivatives

Up to first order (and ignoring normalizing constants), the perturbation effect on $\epsilon^{k_0}$ is then (by chain rule) $\epsilon^{k_0} = \left[\prod_{i=0}^{d-1} T_{i} \right]\epsilon^{k_d}$. Since $T_{i}$ does not depend on $i$, we can write this as $\epsilon^{k_0}=T^{d} \epsilon^{k_d} \approx \lambda^{d} \epsilon^{k_d}$, where $\lambda$ is the largest eigenvalue of $T$. Now, on a random tree, we have approximately $c^{d}$ leaves. If we assume that the perturbation effect from each leaf is independent, and that $\epsilon^{k_d}$ has 0 mean, then the net mean perturbation from all the leaves will be 0. The variance will be

if we assume that the cross terms vanish in expectation.

(Aside: You might want to ask: why are we assuming that $\epsilon^{k_{d}}$ has mean zero, and that (say) the noise at each of the leaves are independent, so that the cross terms vanish? If we want to maximize the variance, then maybe choosing the $\epsilon^{k_{d}}$‘s to be correlated or have non-zero mean would give us a better bound. The problem is that we’re neglecting the effects of normalizing constants in this analysis: if we perturbed all the $\psi^{k_d}_{t}$ in the same direction (e.g. non-zero mean), our normalization conditions would cancel out our perturbations.)

We Therefore end up with the stability condition $c \lambda^{2} =1$. When $c \lambda^{2}>1$, a small perturbation will be magnified as we move up the tree, leading to the messages moving away from the factored fixed point after successive iterations of BP (the fixed point is unstable). If $c \lambda^{2}<1$, the effect of a small perturbation will vanish as we move up the tree, we expect the factored fixed point to be stable. If we restrict our attention to graphs of the form $c_{a,a}=c_{in}>c_{a,b}=c_{out}$ for $a \not = b$, and have all our groups with size $n_{a}=\frac{1}{q}$, then $T$ is known to have eigenvalues $\lambda_{1}=0$ with eigenvector $(1,...,1)$, and $\lambda_{2}=(c_{in}-c_{out})/qc$. The stability threshold then becomes $|c_{in}-c_{out}|=q\sqrt{c}$. This condition is known as the Almeida-Thouless local stability condition for spin glasses, and the Kesten-Stigum bound on reconstruction on trees. It is also observed empirically that BP and MCMC succeed above this threshold, and converge to factored fixed points below this threshold. The eigenvalues of $T$ are related to the belief propagation equations and the backtracking matrix. For more details, see [3]

### (3.4) Information Theoretic Results, and what we know algorithmically

We’ve just seen a threshold for when BP is able to solve the community detection problem. Specifically, when $c<\frac{1}{\lambda^{2}}$, BP doesn’t do better than chance. It’s natural to ask whether this is because BP is not powerful enough, or whether there really isn’t enough information in the random graph to recover the true labeling. For example, if $c_{out}$ is very close to $c_{in}$, it might be impossible to distinguish between group boundaries up to random fluctuations in the edges.

It turns out that for $q=2$, there is not enough information below the threshold $c<\frac{1}{\lambda^{2}}$ to find a labeling which is correlated with the true labeling [3]. However, it can be shown information-theoretically [1] that the threshold at which one can find a correlated labeling is $c_{c} = \Theta \left(\frac{\log(q)}{q \lambda^2} \right)$. In particular, when $q>11$, there exists exponential time algorithms which recover a correlated labeling below the Kesten-Stigum threshold. This is interesting, because it suggests an information-computational gap: we observe empirically that heuristic belief propagation seems to perform as well as any other inference algorithm at finding a correlated labeling for the stochastic block model. However, belief propagation fails at a “computational” threshold below the information theoretic threshold for this problem. We’ll talk more about these kinds of information-computation gaps in the coming weeks.

## References

[1] Jess Banks, Cristopher Moore, Joe Neeman, Praneeth Netrapalli,
Information-theoretic thresholds for community detection in sparse
networks. AJMLR: Workshop and Conference Proceedings vol 49:1–34, 2016.

[2] Aurelien Decelle, Florent Krzakala, Cristopher Moore, Lenka Zdeborová,
Asymptotic analysis of the stochastic block model for modular networks and its
algorithmic applications. 2013.

[3] Cristopher Moore, The Computer Science and Physics
of Community Detection: Landscapes, Phase Transitions, and Hardness. 2017.

[4] Jonathan Yedidia, William Freeman, Yair Weiss, Understanding Belief Propogation and its Generalizations