Skip to content

Peter Shor on Quantum Error Correction

December 7, 2018

[Guest post by Annie Wei who scribed Peter Shor’s lecture in our physics and computation seminar. See here for all the posts of this seminar. –Boaz]

On October 19, we were lucky enough to have Professor Peter Shor give a guest lecture about quantum error correcting codes. In this blog post, I (Annie Wei) will present a summary of this guest lecture, which builds up quantum error correcting codes starting from classical coding theory. We will start by reviewing an example from classical error correction to motivate the similarities and differences when compared against the quantum case, before moving into quantum error correction and quantum channels. Note that we do assume a very basic familiarity with quantum mechanics, such as that which might be found here or here.

1. Motivation
We are interested in quantum error correction, ultimately, because any real-world computing device needs to be able to tolerate noise. Theoretical work on quantum algorithms has shown that quantum computers have the potential to offer speedups for a variety of problems, but in practice we’d also like to be able to eventually build and operate real quantum computers. We need to be able to protect against any decoherence that occurs when a quantum computer interacts with the environment, and we need to be able to protect against the accumulation of small gate errors since quantum gates need to be unitary operators.

In error correction the idea is to protect against noise by encoding information in a way that is resistant to noise, usually by adding some redundancy to the message. The redundancy then ensures that enough information remains, even after noise corruption, so that decoding will allow us to recover our original message. This is what is done in classical error correction schemes.

Unfortunately, it’s not obvious that quantum error correction is possible. One obstacle is that errors are continuous, since a continuum of operations can be applied to a qubit, so a priori it might seem like identifying and correcting an error would require infinite resources. In a later section we show how this problem, that of identifying quantum errors, can be overcome. Another obstacle is the fact that, as we’ve stated, classical error correction works by adding redundancy to a message. This might seem impossible to perform in a quantum setting due to the No Cloning Theorem, which states the following:

Theorem (No Cloning Theorem): Performing the mapping

|\psi\rangle|0\rangle\mapsto|\psi\rangle|\psi\rangle

is not a permissible quantum operation.

Proof: We will use unitarity, which says that a quantum operation specified by a unitary matrix U must satisfy

\langle\phi|U^{\dagger}U|\psi\rangle = \langle\phi|\psi\rangle.

(This ensures that the normalization of the state |\psi\rangle is always preserved, i.e. that |\langle\psi|\psi\rangle|^2=1, which is equivalent to the conservation of probability.)

Now suppose that we can perform the operation

U(|\psi\rangle|0\rangle)=|\psi\rangle|\psi\rangle.

Then, letting

(\langle\phi|\langle 0|)(|\psi\rangle|0\rangle)=\alpha,

we note that by unitarity

(\langle\phi|\langle 0|)(|\psi\rangle |0\rangle)=\alpha(\langle\phi|\langle 0|)U^{\dagger}U(|\psi\rangle|0\rangle).

But

(\langle\phi|\langle 0|)U^{\dagger}U(|\psi\rangle|0\rangle)=(\langle\phi|\langle\phi|)(|\psi\rangle|\psi\rangle)=\alpha^2,

and in general \alpha\neq\alpha^2, so we have a contradiction.

How do we get around this apparent contradiction? To do so, note that the no-cloning theorem only prohibits the copying of non-orthogonal quantum states. With orthogonal quantum states, either \alpha=0 or \alpha=1, so we don’t run into a contradiction. This also explains why it is possible to copy classical information, which we can think of as orthogonal quantum states.

So how do we actually protect quantum information from noise? In the next section we first review classical error correction, as ideas from the classical setting re-appear in the quantum setting, and then we move into quantum error correction.

2. Review of Classical Error Correction
First we start by reviewing classical error correction. In classical error correction we generally have a message that we encode, send through a noisy channel, and then decode, in the following schematic process:

fig1.png
In an effective error correction scheme, the decoding process should allow us to identify any errors that occurred when our message passed through the noisy channel, which then tells us how to correct the errors. The formalism that allows us to do so is the following: we first define a r\times n encoding matrix G that takes a message m of length r and converts it to a codeword c of length n, where the codewords make up the span of the rows of G. An example of such a matrix is

G=\left(\begin{array}{ccccccc}0&0&0&1&1&1&1\\1&0&1&0&1&0&1\\0&1&1&0&0&1&1\\1&1&1&1&1&1&1\end{array}\right),

corresponding to the 7-bit Hamming codes, which encodes a 4-bit message as a 7-bit codeword. Note that this code has distance 3 since each of the rows in G differ in at most 3 spots, which means that it can correct at most 1 error (the number of errors that can be corrected is given by half the code distance).

We also define the parity check matrix H to be the matrix that satisfies

GH^T=0.

For example, to define H corresponding to the G we defined for the 7-bit Hamming code, we could take

H=\left(\begin{array}{ccccccc}0&0&0&1&1&1&1\\1&0&1&0&1&0&1\\0&1&1&0&0&1&1\end{array}\right).

Then we may decode x, a 7-bit string, in the following manner. Say that

x=c+e,

where c is a codeword and e is the 1-bit error we wish to correct. Then

xH^T=(c+e)H^T=eH^T.

Here eH^T uniquely identifies the error and is known as the error syndrome. Having it tells us how to correct the error. Thus our error correction scheme consists of the following steps:

  1. Encode a r-bit message m by multiplying by G to obtain codeword mG=c.
  2. Send the message through channel generating error e, resulting in the string x=c+e.
  3. Multiply by H^T to obtain the error syndrome eH^T.
  4. Correct error e to obtain c.

Having concluded our quick review of classical error correction, we now look at the theory of quantum error correction.

3. Quantum Error Correction
In this section we introduce quantum error correction by directly constructing the 9-qubit code and the 7-qubit code. Then we introduce the more general formalism of CSS codes, which encompasses both the 9-qubit and 7-qubit codes, before introducing the stabilizer formalism, which tells us how we might construct a CSS code.

3.1. Preliminaries
First we introduce some tools that we will need in this section.

3.1.1. Pauli Matrices
The Pauli matrices are a set of 2-by-2 matrices that form an orthogonal basis for the 2-by-2 Hermitian matrices, where a Hermitian matrix H satisfies H^{\dagger}=H. Note that we can form larger Hilbert spaces by taking the tensor product of smaller Hilbert spaces, so in particular taking the k-fold tensor product of Pauli matrices gives us a basis for the 2^k-by-2^k Hermitian matrices. Note also that generally, in quantum mechanics, we are interested in Hermitian matrices because they can be used to represent measurements, and because unitary matrices, which can be used to represent probability-preserving quantum operations, can be obtained by exponentiating Hermitian matrices (that is, every unitary matrix U can be written in the form U=e^{iH} for H a Hermitian matrix).

The Pauli matrices are given by

\sigma_x\equiv X\equiv\left(\begin{array}{cc}0&1\\1&0\end{array}\right)

\sigma_y\equiv Y\equiv\left(\begin{array}{cc}0&-i\\i&0\end{array}\right)

\sigma_z\equiv Z\equiv\left(\begin{array}{cc}1&0\\0&-1\end{array}\right).

By direction computation we can show that they satisfy the relations

X^2=Y^2=Z^2=I

ZX=-XZ=iY

YZ=-ZY=iX

XY=-YX=iZ.

3.1.2. Von Neumann Measurements
We will also need the concept of a Von Neumann measurement. A Von Neumann measurement is given by a set of projection matrices \{E_1, E_2, ..., E_k\} satisfying

\sum_{i=1}^k E_i=I.

That is, the projectors partition a Hilbert space {\cal H} into k subspaces. Then, given any state |\psi\rangle\in{\cal H}, when we perform a measurement using these projectors we obtain the measurement result corresponding to E_i, with corresponding post-measurement state

\frac{E_i|\psi\rangle}{||E_i|\psi\rangle||},

with probability \langle\psi|E_i|\psi\rangle.

3.2. First Attempt at a Quantum Code
Now we make a first attempt at coming up with a quantum code, noting that our efforts and adjustments will ultimately culminate in the 9-qubit code. Starting with the simplest possible idea, we take inspiration from the classical repetition code, which maps

0\mapsto 000

1\mapsto 111

and decodes by taking the majority of the 3 bits. We consider the quantum analog of this, which maps

|0\rangle\mapsto|000\rangle

|1\rangle\mapsto|111\rangle.

We will take our quantum errors to be the Pauli matrices X, Y, and Z. Then the encoding process, where our message is a quantum state \alpha|0\rangle+\beta|1\rangle, looks like the following:

\alpha|0\rangle+\beta|1\rangle\mapsto\alpha|000\rangle+\beta|111\rangle.

We claim that this code can correct bit errors but not phase errors, which makes it equivalent to the original classical repetition code for error correction. To see this, note that applying an X_1 error results in the mapping

\alpha|0\rangle+\beta|1\rangle\mapsto\alpha|100\rangle+\beta|011\rangle.

This can be detected by the von Neumann measurement which projects onto the subspaces

\{|000\rangle,|111\rangle\}

\{|011\rangle,|100\rangle\}

\{|010\rangle,|101\rangle\}

\{|110\rangle,|001\rangle\}

We could then apply \sigma_x to the first qubit to correct the error. To see that this doesn’t work for phase errors, note that applying a Z_2 error results in the mapping

\alpha|0\rangle+\beta|1\rangle\mapsto\alpha|000\rangle-\beta|111\rangle.

This is a valid encoding of the state \alpha|0\rangle-\beta|1\rangle, so the error is undetectable.

What adjustments can we make so that we’re able to also correct Z errors? For this we will introduce the Hadamard matrix, defined as

H=\frac{1}{\sqrt{2}}\left(\begin{array}{cc}1&1\\1&-1\end{array}\right)

and satisfying

HX=ZH.

Note in particular that, because HX=ZH, the Hadamard matrix turns bit errors into phase errors, and vice versa. This allows us to come up with a code that corrects phase errors by mapping

H|0\rangle\mapsto H^{\otimes 3}|000\rangle

H|1\rangle\mapsto H^{\otimes 3}|111\rangle

or equivalently,

|0\rangle\mapsto\frac{1}{2}(|000\rangle+|011\rangle+|101\rangle+|110\rangle)

|1\rangle\mapsto\frac{1}{2}(|111\rangle+|100\rangle+|010\rangle+|001\rangle)

Now we can concatenate our bit flip code with our phase flip code to take care of both errors. This gives us the 9-qubit code, also known as the Shor code.

3.3. 9-Qubit Code
In the previous section, we went through the process of constructing the 9-qubit Shor code by considering how to correct both bit flip errors and phase flip errors. Explicitly, the 9-qubit Shor code is given by the following mapping:

|0\rangle\mapsto|0\rangle_L\equiv\frac{1}{2}(|000000000\rangle+|000111111\rangle+|111000111\rangle+|111111000\rangle)

|1\rangle\mapsto|1\rangle_L\equiv\frac{1}{2}(|111111111\rangle+|111000000\rangle+|000111000\rangle+|000000111\rangle).

Here |0\rangle_L and |1\rangle_L are known as logical qubits; note that our 9-qubit code essentially represents 1 logical qubit with 9 physical qubits.

Note that by construction this code can correct \sigma_x, \sigma_y, and \sigma_z errors on any one qubit (we’ve already shown by construction that it can correct \sigma_x and \sigma_z errors, and \sigma_y can be obtained as a product of the two). This is also equivalent to the statement that the states \sigma_x^{(i)}|0\rangle_L, \sigma_y^{(i)}|0\rangle_L, \sigma_z^{(i)}|0\rangle_L, \sigma_x^{(i)}|1\rangle_L, \sigma_y^{(i)}|1\rangle_L, and \sigma_z^{(i)}|1\rangle_L are all orthogonal.

Now we have a 1-error quantum code. We claim that such a code can in fact correct any error operation, and that this is a property of all 1-error quantum codes:

Theorem: Given any possible unitary, measurement, or quantum operation on a one-error quantum code, the code can correct it.

Proof: \{I, \sigma_x, \sigma_y, \sigma_z\}^{\otimes t} forms a basis for the 2\times 2 matrices. For errors on t qubits, the code can correct these errors if it can individually correct errors \sigma_{w_i}^{(i)} for w_i\in\{x,y,z\}, i\in\{1,...,t\}, since \{I, \sigma_x, \sigma_y, \sigma_z\}^{\otimes t} forms a basis for \mathbb{C}^{2t}.

Example: Phase Error Next we’ll do an example where we consider how we might correct an arbitrary phase error applied to the |0\rangle_L state. Since quantum states are equivalent up to phases, the error operator is given by

\left(\begin{array}{cc}1&0\\0&e^{i\theta}\end{array}\right)\equiv\left(\begin{array}{cc}e^{-i\theta/2}&0\\0&e^{i\theta/2}\end{array}\right).

Note that this can be rewritten in the \{I, \sigma_x, \sigma_y, \sigma_z\}^{\otimes t} basis as

\left(\begin{array}{cc}e^{-i\theta/2}&0\\0&e^{i\theta/2}\end{array}\right)=\cos\frac{\theta}{2}I-i\sin\frac{\theta}{2}\sigma_z.

Now, applying this error to |0\rangle_L, we get

\left(\begin{array}{cc}e^{-i\theta/2}&0\\0&e^{i\theta/2}\end{array}\right)\frac{1}{2}(|000\rangle+|011\rangle+|101\rangle+|110\rangle)=\frac{1}{2}\cos\frac{\theta}{2}(|000\rangle+|011\rangle+|101\rangle+|110\rangle)-\frac{i}{2}\sin\frac{\theta}{2}(|000\rangle-|011\rangle+|101\rangle-|110\rangle)=\cos\frac{\theta}{2}|0\rangle_L-i\sin\frac{\theta}{2}\sigma_z|0\rangle_L.

After performing a projective measurement, we get state |0\rangle_L with probability \cos^2\frac{\theta}{2}, in which case we do not need to perform any error correction, and we get \sigma_z|0\rangle_L with probability \sin^2\frac{\theta}{2}, in which case we would know to correct the \sigma_z error.

3.4. 7-Qubit Code
Now that we’ve constructed the 9-qubit code and shown that quantum error correction is possible, we might wonder whether it’s possible to do better. For example, we’d like a code that requires fewer qubits. We’ll construct a 7-qubit code that corrects 1 error, defining a mapping to |0\rangle_L and |1\rangle_L by taking inspiration from a classical code, as we did for the 9-qubit case.

For this we will need to go back to the example we used to illustration classical error correction. Recall that in classical error correction, we have an encoding matrix G and a parity check matrix H satisfying GH^T=0, with \text{rank}(G)+\text{rank}(H)=n. We encode a message m to obtain codeword mG=c. After error e is applied, this becomes c+e, from which we can extract the error syndrome (c+e)H^T=eH^T. We can then apply the appropriate correction to extract c from c+e.

Now we will use the encoding matrix from our classical error correction example, and we will divide our codewords into two sets, C_1 and C_1', given by

C_1=\left\{\begin{array}{ccccccc}0&0&0&1&1&1&1\\1&0&1&0&1&0&1\\0&1&1&0&0&1&1\end{array}\right.

and

C_1'=C_1+1111111.

Similar to how we approached the 9-qubit case, we will start by defining our code as follows:

|0\rangle_L\equiv\frac{1}{\sqrt{8}}\sum_{v\in C_1}|v\rangle

|1\rangle_L\equiv\frac{1}{\sqrt{8}}\sum_{w\in C_1'}|w\rangle.

Note that this corrects bit flip errors by construction. How can we ensure that we are also able to correct phase errors? For this we again turn to the Hadamard matrix, which allows us to toggle between bit and phase errors. We claim that

H^{\otimes 7}|0\rangle_L=\frac{1}{\sqrt{2}}(|0\rangle_L+|1\rangle_L)

H^{\otimes 7}|1\rangle_L=\frac{1}{\sqrt{2}}(|0\rangle_L-|1\rangle_L).

Proof: We will show that

H^{\otimes 7}|0\rangle_L=\frac{1}{\sqrt{2}}(|0\rangle_L+|1\rangle_L),

noting that the argument for |1\rangle_L is similar. First we will need the fact that

H^{\otimes 7}|v\rangle=\frac{1}{2^{7/2}}\sum_{w\in\{0,1\}^7}(-1)^{w\cdot v}|w\rangle.

To see that this fact is true, note that

H=\frac{1}{\sqrt{2}}(|0\rangle\langle 0|+|0\rangle\langle 1|+|1\rangle\langle 0|-|1\rangle\langle 1|)

and that w\cdot v is equal to the number of bits in which w and v are both 1. Now we can start by directly calculating

H^{\otimes 7}|0\rangle_L=\frac{1}{\sqrt{8}}\frac{1}{\sqrt{128}}\sum_{v\in C_1}\sum_{w\in\{0,1\}^7}(-1)^{v\cdot w}|w\rangle.

Note that for x and y two codewords, assuming that w\cdot y=1, we must have that x\cdot w=0 iff (x+y)\cdot w=1. Thus we can break the codespace up into an equal number of codewords x satisfying x\cdot w=0 and x\cdot w=1. This means that we must have that the sum \sum_{v\in C_1}\sum_{w\in\{0,1\}^7}(-1)^{w\cdot v}|w\rangle=0 unless we have w\perp C_1. But those w that satisfy w\perp C_1 are exactly all the codewords by definition, so we must have that

H^{\otimes 7}|0\rangle_L=\frac{1}{\sqrt{2}}|0\rangle_L+\frac{1}{\sqrt{2}}|1\rangle_L

as the sum in |0\rangle_L+|1\rangle_L runs equally over all codewords.

Thus we have constructed a 7-qubit quantum code that corrects 1 error, and moreover we see that for both the 9-qubit and 7-qubit codes, both of which are 1-error quantum codes, the fact that they can correct 1-error comes directly from the fact that the original classical codes we used to construct them can themselves correct 1 error. This suggests that we should be able to come up with a more general procedure for constructing quantum codes from classical codes.

3.5. CSS Codes
CSS (Calderbank-Shor-Steane) codes generalize the process by which we constructed the 9-qubit and 7-qubit codes, and they give us a general framework for constructing quantum codes from classical codes. In a CSS code, we require groups C_1, C_2 satisfying

 C_1\subseteq C_2

C_2^{\perp}\subseteq C_1^{\perp}

Then we can define codewords to correspond to cosets of C_1 in C_2, so that the number of codewords is equal to 2^{\text{dim}(C_2)-\text{dim}(C_1)}. Thus by this definition we can say that codewords w_1, w_2\in C_2 are in the same coset if w_1-w_2\in C_1. Explicitly, the codeword for coset w is given by the state

\frac{1}{|C_1|^{1/2}}\sum_{x\in C_1}|x+w\rangle,

and under the Hadamard transformation applied to each qubit this state is in turn mapped to the state

\frac{1}{|C_1^{\perp}|^{1/2}}\sum_{x\in C_1^{\perp}}|x+w\rangle.

That is to say, the Hadamard “dualizes” our original code, toggling bit errors to phase errors and vice versa. (This can be seen by direct calculation, as in the case of the 7-qubit code, where we used the fact that \sum_{v\in C_1}(-1)^{v\cdot w}=0 for w\not\in C_1^{\perp}.)

Note also that this code can correct a number of bit errors equal to the minimum weight of \{v\in C_2-C_1\}.

With the CSS construction we have thus reduced the problem of finding a quantum error correcting code to the problem of finding appropriate C_1, C_2. Note that the special case of C_2^{\perp}=C_1=C corresponds to weakly self-dual codes, which are well studied classically. Doubly even, weakly self-dual codes additionally have the requirement that all codewords have Hamming weights that are multiples of 4; they satisfy the requirement

1^n\subseteq C^{\perp}\subseteq C\subseteq\mathbb{Z}_2^n

and are also well studied classically.

3.6. Gilbert-Varshamov Bound
In the previous section we introduced CSS codes and demonstrated that the problem of constructing a quantum code could be reduced to the problem of finding two groups C_1, C_2 satisfying

C_1\subseteq C_2

C_2^{\perp}\subseteq C_1^{\perp}.

The next natural question is to ask whether such groups can in fact be found.

The Gilbert-Varshamov bound answers this question in the affirmative, ensuring that there do exist good CSS codes (the bound applies to both quantum and classical codes). It can be stated in the following way:

Theorem (Gilbert-Varshamov Bound): There exist CSS codes with rate R=(number of encoded bits)/(length of code) given by

R\geq 1-2H_2(d/n),

where d is the minimum distance of the code, d/2 is the number of errors it can correct, and H_2(x) is the Shannon entropy, defined as

H_2(x)=-x\log_2x-(1-x)\log_2(1-x).

Proof: Note that we can always take a code, apply a random linear transformation to it, and get another code. Thus each vector is equally likely to appear in a random code. Given this fact, we can estimate the probability that a code of dimension k contains a word of weight \leq d using the union bound:

P(code of dimension k has word of weight \leq d)\leq(number of words)\times P(word has weight \leq d)=2^k\cdot\frac{\sum_{i=0}^d \binom{n}{i}}{2^n}\approx \frac{2^k\cdot 2^{nH(d/n)}}{2^n}

For this to be a valid probability we need to have

(k/n)+H(d/n)< 1.

We can calculate rate by noting that for a CSS code, given by C_1\subseteq C_2, C_2^{\perp}\subseteq C_1^{\perp}, with \text{dim}(C_1)=n-k, \text{dim}(C_2)=k, the expression for rate is given by

R=\frac{n-2k}{n}.

Combining this with the bound we obtained by considering probabilities, we get that

R\geq 1-2H(d/n).

Thus there exist good CSS codes.

3.7. Stabilizer Codes
Having discussed and constructed some examples of CSS codes, we will now discuss the stabilizer formalism. Note that this formalism allows us to construct codes without having to work directly with the states representing |0\rangle_L and |1\rangle_L, as this can quickly get unwieldy. Instead, we will work with stabilizers, operators that leave these states invariant.

To see how working directly with states can get unwieldy, we can consider the 5-qubit code. We can define it the way we defined the 9-qubit and 7-qubit codes, by directly defining the basis vectors |0\rangle_L and |1\rangle_L,

|0\rangle_L\equiv\frac{1}{4}(|00000\rangle-|01100\rangle+|00101\rangle+|01010\rangle-|01111\rangle+(symmetric under cyclic permutations)),

with |1\rangle_L defined similarly. But we can also define this code more succinctly using the stabilizer formalism. To do so, we start by choosing a commutative subgroup of the Pauli group, with generators g_i satisfying

g_i^2=I

g_ig_j=g_jg_i

For example, for the 5-qubit code, the particular choice of generators we would need is given by

g_1\equiv IXZZX

g_2\equiv XIXZZ

g_3\equiv ZXIXZ

g_4\equiv ZZXIX.

Now we consider states \{|\psi\rangle\} that are stabilized by the \{g_i\}. That is, they satisfy

g_i|\psi\rangle=|\psi\rangle.

Note that the eigenvalues of \sigma_x, \sigma_y, and \sigma_z are \pm 1, so in the case of the 5-qubit code, there exists a 2^5/2=16-dimensional space of \{|\psi\rangle\} satisfying g_1|\psi\rangle=|\psi\rangle. Recalling that two commuting matrices are simultaneously diagonalizable, there exists a 16/2=8-dimensional space of \{|\psi\rangle\} satisfying g_1|\psi\rangle=g_2|\psi\rangle=|\psi\rangle, and so on, where we cut the dimension of the subspace in half each time we add a generator. Finally, there exists a 2^5/2^4=2-dimensional space of \{|\psi\rangle\} satisfying g_i|\psi\rangle=|\psi\rangle for all i=1,...,4. This 2-dimensional space is exactly the subspace spanned by |0\rangle_L and |1\rangle_L. Thus fixing the stabilizers is enough to give us our code.

Next we will consider all elements in the Pauli group that commute with all elements in our stabilizer group G=\{g_1,...,g_4\}. As we shall see, this will give us our logical operators, where a logical operator performs an operation on a logical qubit (for example, the logical X operator, X_L, would act on the logical qubit |0\rangle_L by mapping X_L|0\rangle_L=|1\rangle_L, and so on). In the 5-qubit case we end up with a 6-dimensional nonabelian group \tilde{G}=\langle g_1,...,g_4, h_1, h_2\rangle by adding the following two elements to those elements that are in G:

h_1=XXXXX

h_2=ZZZZZ

These will be our logical operators

X_L\equiv h_1

Z_L\equiv h_2

so that

X_L|0\rangle_L=|1\rangle_L

X_L|1\rangle_L=|0\rangle_L

Z_L|1\rangle_L=-|1\rangle_L

Z_L|0\rangle_L=|0\rangle_L.

Note that this code has distance 3 and corrects 1 error because 3 is the minimum Hamming weight in the group \tilde{G}. (To see this, note that XXXXX\cdot IXZZX=XIYYI has Hamming weight 3.)

Why is Hamming weight 2 not enough to correct one error? If we had, for example, XZIII\in\tilde{G}, then we would have

\sigma_x^{(1)}|\psi_1\rangle=\sigma_z^{(2)}|\psi_2\rangle

for |\psi_1\rangle, |\psi_2\rangle both in the code, which means that we wouldn’t be able to distinguish an X_1 error from a Z_2 error.

Note that, in general, when x\in\tilde{G}, x|\psi\rangle will be in the code, so elements of \tilde{G} map codewords to codewords. We can prove this fact by noting that

xg_i|\psi\rangle=x|\psi\rangle=g_ix|\psi\rangle.

Note also that in the examples we’ve been dealing with so far, where we have a commuting subgroup of the Pauli group, our codes correspond to classical, additive, weakly self-dual codes over GF(4). Here GF(4)=\{0,1,\omega,\bar{\omega}\} (with group elements \{\omega, \bar{\omega},1\} corresponding to the third roots of unity) is the finite field on 4 elements, and multiplying Pauli matrices corresponds to group addition. Specifically,

X\equiv 1

Y\equiv \omega

Z\equiv \bar{\omega}

I\equiv 0

satisfying

H\omega=\bar{\omega}

2X=2Y=2Z=0.

We have now concluded our discussion of quantum error-correcting codes. In the next section we will shift gears and look at quantum channels and channel capacities.

4. Quantum Channels
In this final section we will look at quantum channels and channel capacities.

4.1. Definition and Examples

4.1.1. Definition
We know that we want to define a quantum channel to take a quantum state as input. What should the output be? As a first attempt we might imagine having the output be a probability distribution \{p_i\} over states \{|\psi_i\rangle\}. It turns out that for a more succinct description, we can have both the input and output be a density matrix.

Recall that a density matrix takes the form

\rho=\sum_i p_i|\psi_i\rangle\langle\psi_i|

representing a probability distribution over pure states |\psi_i\rangle. \rho must also be Hermitian, and it must satisfy \text{Tr}(\rho)=1 (equivalently, we must have \sum_i p_i=1).

Now we may define a quantum channel as the map \eta that takes

\eta:\rho\mapsto\sum_i E_i\rho E_i^{\dagger},

where

\sum_i E_i^{\dagger}E_i=I.

To see that the output is in fact a density matrix, note that the output expression is clearly Hermitian and can be shown to have unit trace using the cyclical property of traces. Note also that the decomposition into \{E_i\} need not be unique.

4.1.2. Example Quantum Channels
Next we give a few examples of quantum channels. The dephasing channel is given by the map

\rho\mapsto(1-p)\rho+p\sigma_z\rho\sigma_z.

It maps

\left(\begin{array}{cc}\alpha&\beta\\\gamma&\delta\end{array}\right)\mapsto \left(\begin{array}{cc}\alpha&(1-2p)\beta\\(1-2p)\gamma&\delta\end{array}\right)

\left(\begin{array}{cc}\alpha&\beta\\\gamma&\delta\end{array}\right)\mapsto \left(\begin{array}{cc}\alpha&(1-2p)\beta\\(1-2p)\gamma&\delta\end{array}\right),

so it multiplies off-diagonal elements by a factor that is less than 1. Note that when p=1/2, it maps

\alpha|0\rangle+\beta|1\rangle\mapsto|\alpha|^2|0\rangle\langle 0|+|\beta|^2|1\rangle\langle 1|,

which means that it turns superpositions into classical mixtures (hence the name “dephasing”).

Another example is the amplitude damping channel, which models an excited state decaying to a ground state. It is given by

E_1=\left(\begin{array}{cc}0&\sqrt{p}\\0&0\end{array}\right)

E_2=\left(\begin{array}{cc}1&0\\0&\sqrt{1-p}\end{array}\right)

Here we let the vector |0\rangle=(1, 0) denote the ground state, and we let the vector |1\rangle=(0, 1) denote the excited state. Thus we can see that the channel maps the ground state to itself, |0\rangle\mapsto|0\rangle, while the excited state |1\rangle gets mapped to |0\rangle with probability p and stays at |1\rangle with probability 1-p.

4.2 Quantum Channel Capacities
Now we consider the capacity of quantum channels, where the capacity quantifies how much information can make it through the channel. We consider classical channels, classical information sent over quantum channels, and quantum information sent over quantum channels. First we start off with the example of the quantum erasure channel to demonstrate that quantum channels behave differently from classical channels, and then we give the actual expressions for the channel capacities before revisiting the example of the quantum erasure channel.

4.2.1 Example: Quantum Erasure Channel
First we start with the example of the quantum erasure channel, which given a state |\psi\rangle replaces it by an orthogonal state |E\rangle with probability p and returns the same state |\psi\rangle with probability 1-p. We claim that the erasure channel can’t transmit quantum information when p\geq 0.5, behavior that is markedly different from that of classical information. That is to say, for p\geq 0.5, there is no way to encode quantum information to send it through the channel and then decode it so the receiver gets a state close to the state that was sent.

To see why this is the case, assume the contrary, that there do exist encoding and decoding protocols that send quantum information through quantum erasure channels with erasure rate p\geq 0.5. We will show that this violates the no-cloning theorem. Now, suppose that A does the following: For each qubit in the encoded state, she tosses a fair coin. If the coin lands heads, she send C the state |E\rangle and sends B the channel input with probability 2p-1 and the erasure state |E\rangle otherwise. If the coin lands tails, she sends B the state |E\rangle and sends C the channel input with probability 2p-1 and the erasure state otherwise. This implements a p\geq 0.5 channel to both receivers B and C, which means that A can use this channel to transmit an encoding of |\psi\rangle to both receivers, which in turn means that both receivers will be able to decode |\psi\rangle. But this means that A has just used this channel to clone the quantum state |\psi\rangle, resulting in a contradiction. Thus no quantum information can be transmitted through a channel with p\geq 0.5. Note, however, that we can send classical information over this channel, so the behavior of quantum and classical information is markedly different.

It turns out that the rate of quantum information sent over the erasure channel, as a function of p, is given by the following graph:

fig4.png
while the rate of classical information sent over the erasure channel, as a function of p, is given by the following graph:

fig5.png

Next we will formally state the definition of channel capacity, and then we will return to the quantum erasure channel example and derive the curve that plots rate against p.

4.2.2. Definition of Channel Capacities
Channel capacity is defined as the maximum rate at which information can be communicated over many independent uses of a channel from sender to receiver. Here we list the expressions for channel capacity for classical channels, classical information over a quantum channel, and quantum information over a quantum channel.

Classical Channel Capacity For a classical channel this expression is just the maximum mutual information over all input-output pairs,

\max_X H(\eta(X))-H(\eta(X)|X),

where X is the input information and \eta(X) is the output information after having gone through the channel \eta.

Classical Information Over a Quantum Channel The capacity for classical information sent over a quantum channel is given by

\max_{\{p_i,\rho_i\}} H(\eta(\rho))-\sum_i p_iH(\eta(\rho_i))

up to regularization, where \rho=\sum_i p_i\rho_i is the average input state, and \eta is the channel.

Note that we would regularize this by using n copies of the state (that is to say, we want the output of \eta^{\otimes n}) and then dividing by n, to get an expression like the following for the regularized capacity of classical information over a quantum channel:

\lim_{n\rightarrow\infty}\max_{\{p_i,\rho_i\}} [H(\eta(\rho)^{\otimes n})-\sum_i p_i H(\eta(\rho_i)^{\otimes n})]/n.

Quantum Information The capacity for quantum information is given by the expression

\max_\rho H(\eta(\rho))-H((\eta\otimes I)\Phi_\rho),

also up to regularization. Here \eta(\rho) is the output when channel \rho acts on input state \rho, while \Phi_\rho is the purification of \rho (that is, it is a pure state containing \rho that we can obtain by enlarging the Hilbert space). The regularized capacity for quantum information looks like the following:

 \lim_{n\rightarrow\infty}\max_\rho [H(\eta(\rho)^{\otimes n})-H((\eta\otimes I)(\Phi_\rho)^{\otimes n})]/n.

Now that we have the exact expression that allows us to calculate the quantum channel capacity, we will revisit our example of the quantum erasure channel and reproduce the plot of channel rate vs erasure probability.

4.2.3. Example Revisited: Quantum Erasure Channel
Recall that, up to regularization, the capacity of a quantum channel is given by

\max_\rho H(\eta(\rho))-H((\eta\otimes I)\Phi_\rho).

We will directly calculate this expression for the example of the quantum erasure channel. Let the input \rho be given by the density matrix for the completely mixed state,

\rho=\left(\begin{array}{cc}\frac{1}{2}&0\\0&\frac{1}{2}\end{array}\right),

so that the purification of \rho is given by the state

\frac{1}{\sqrt{2}}(|00\rangle+|11\rangle).

Recall that the erasure channel replaces our state with |E\rangle with probability p, while with probability 1-p it leaves the input state unchanged. Then, in the basis \{|0\rangle, |1\rangle, |E\rangle\}, the matrix corresponding to \eta(\rho) is given by

\eta(\rho)=\left(\begin{array}{ccc}\frac{1-p}{2}&0&0\\0&\frac{1-p}{2}&0\\0&0&p\end{array}\right)

while in the basis \{|00\rangle, |01\rangle, |10\rangle, |11\rangle, |0E\rangle, |1E\rangle\}, the matrix corresponding to (\eta\otimes I)\Phi_\rho is given by

(\eta\otimes I)\Phi_\rho=\left(\begin{array}{cccccc}\frac{1-p}{2}&0&0&\frac{1-p}{2}&0&0\\0&0&0&0&0&0\\0&0&0&0&0&0\\\frac{1-p}{2}&0&0&\frac{1-p}{2}&0&0\\0&0&0&0&\frac{p}{2}&0\\0&0&0&0&0&\frac{p}{2}\end{array}\right)

We can directly calculate that

H(\eta(\rho))=H_2(p)+(1-p)

H((\eta\otimes I)\Phi_\rho)=H_2(p)+p.

Then, subtracting the two entropies, we can calculate the rate to be

R=1-2p,

which corresponds exactly to the line we saw on the diagram that plotted rate as a function of p for the quantum erasure channel.

References

  1. Bennett, C. H., DiVencenzo, D. P., and Smolin, J. A. Capacities of quantum erasure channels. Phys. Rev. Lett., 78:3217-3220 (1997). quant-ph/9701015.
  2. Bennett, C. H., DiVencenzo, D. P., Smolin, J. A., and Wootters, W. K. Mixed state entanglement and quantum error correction. Phys. Rev. A, 54:3824 (1996). quant-ph/9604024.
  3. Calderbank, A. R. and Shor, P. W. Good quantum error-correcting codes exist. Phys. Rev. A, 54:1098 (1996). quant-ph/9512032.
  4. Devetak, I. The Private Classical Capacity and Quantum Capacity of a Quantum Channel. IEEE Trans. Inf. Theor., 51:44-45 (2005). quant-ph/0304127
  5. Devetak, I. and Winter, A. Classical data compression with quantum side information. Phys. Rev. A, 68(4):042301 (2003).
  6. Gottesman, D. Class of quantum error-correcting codes saturating the quantum Hamming bound. Phys. Rev. A, 54:1862 (1996).
  7. Laflamme, R., Miquel, C., Paz, J.-P., and Zurek, W. H. Perfect quantum error correction code. Phys. Rev. Lett., 77:198 (1996). quant-ph/9602019.
  8. Lloyd, S. Capacity of the noisy quantum channel. Phys. Rev. A., 55:3 (1997). quant-ph/9604015.
  9. Nielsen, M. A. and Chuang, I. L. Quantum Computation and Quantum Information., Cambridge University Press, New York (2011).
  10. Shor, P. W. Scheme for reducing decoherence in quantum computer memory. Phys. Rev. A., 52:2493 (1995).
  11. Shor, P. W. The quantum channel capacity and coherent information. MSRI Workshop on Quantum Computation (2002).
  12. Steane, A. M. Error correcting codes in quantum theory. Phys. Rev. Lett., 77:793 (1996).
  13. Steane, A. M. Multiple particle interference and quantum error correction. Proc. R. Soc. London A, 452:2551-76 (1996).

Highlights beyond EC: Call for nominations

November 21, 2018

[Guest post by Moshe Babaioff –Boaz]

“Highlights Beyond EC” Session at EC 2019: Call for Nominations

Committee: Mohammad Akbarpour, Moshe Babaioff, Shengwu Li and Ariel Procaccia

Following a new tradition started last year, the 2019 ACM Conference on Economics and Computation (EC’19) will host a special session highlighting some of the best work in economics and computation that appears in conferences and journals other than EC. The intention of this session is to expose EC attendees to related work just beyond the boundary of their current awareness.

We seek nominations for papers on Economics and Computation that have made breakthrough advances, opened up new questions or areas, made unexpected connections, or had significant impact on practice or other sciences. Examples of conferences and journals that publish papers relevant for the special session include STOC/FOCS/SODA/ITCS, AAAI/IJCAI/AAMAS, NIPS/ICML/COLT, WWW/KDD, AER/Econometrica/JPE/QJE/RESTUD/TE/AEJ Micro/JET/GEB, and Math of OR/Management Science/Operations Research. Please email nominations toHighlightsBeyondEC@gmail.com. Anyone is welcome to contact us, but we especially invite members of PCs or editorial boards in various venues to send us suggestions. Nominations should include:

  1.  Name of paper and authors.
  2. Publication venue or online working version. Preference will be given to papers that have appeared in a related conference or journal within the past two years, or have a working version circulated within the past two years.
  3. Short (2-3 paragraph) explanation of the paper and its importance.
  4. (Optional) Names of 1-3 knowledgeable experts on the area of the paper.

Note that at least one of the authors of a selected paper will be required to present their paper at EC 2019 and so should be available to travel to the conference, which is taking place in Phoenix, AZ on June 25-27, 2019.

To ensure maximum consideration, please send all nominations by December 15, 2018.

HALG 2019 Call for nominations

November 16, 2018

[Guest post by Piotr Sankowski –Boaz]

Call for Nominations – 4th Highlights of Algorithms conference (HALG 2019)

Copenhagen, June 14-16, 2019

http://www.halgdiku.dk/

The HALG 2019 conference seeks high-quality nominations for invited talks that will highlight recent advances in algorithmic research. Similarly to previous years, there are two categories of invited talks:

A. survey (60 minutes): a survey of an algorithmic topic that has seen exciting developments in last couple of years.

B. paper (30 minutes): a significant algorithmic result appearing in a paper in 2018 or later.

To nominate, please email  halg2019.nominations@gmail.com  the following information:

1. Basic details: speaker name + topic (for survey talk) or paper’s title, authors, conference/arxiv + preferable speaker (for paper talk).

2. Brief justification: Focus on the benefits to the audience, e.g., quality of results, importance/relevance of topic, clarity of talk, speaker’s presentation skills.

All nominations will be reviewed by the Program Committee (PC) to select speakers that will be invited to the conference.

Nominations deadline: December 9, 2018 (for full consideration).

Please keep in mind that the conference does not provide financial support for the speakers.

Best regards,
Piotr Sankowski
HALG 2019 PC Chair

Where’s that paper?

November 12, 2018

[Guest post by Eylon Yogev about a Chrome extension he wrote, of which I am a happy user –Boaz]

Hi fellow researchers,

I’m writing to share a little tool that I have developed with the ambitious goal of boosting research productivity. The tool is a Chrome extension named “Where’s that paper?”. Before I tell you more about what it does, let me touch upon the largest obstacle of any new tool, the learning curve. Rest assured that “Where’s that paper?” requires  zero training – just install it and continue working as usual – it will guide you without further effort on your end.

After building much suspense, I will elaborate. If you’re like me, you find yourself frequently searching the web for papers that you have browsed / opened in the past on your computer – basically paper browsing history. Whether it is in the process of writing a paper or searching for existing ideas, you may search for authors of titles of papers that you recall know to have once glimpsed at. At some point, the paper has either been downloaded or added  to the favorites bar. In any case, this process is manual and takes up much time (often including frustration). As such, I thought we could all use some code to automate it.

Functionality.

The extension is very simple: it identifies when you are reading a scientific paper (according to the domain) and then automatically adds this paper, with the proper author list, year etc., to your favorites bar under a designated folder. Then, when you type a search in the Chrome’s search bar for an author or a title, the relevant results from the favorites pop up. One might also explicitly browse in the favorites to see all papers read.

See an example of the results shown when searching for “short” in Chrome’s address bar. The first three links are from the favorite and the rest are general Google suggestions from the web.

The extension automatically works on a list of specified domains that include:

eprint.iacr.orgarxiv.orgeccc.weizmann.ac.ilepubs.siam.orgresearch.microsoft.comciteseerx.ist.psu.eduac.elscdn.comwww.sciencedirect.comdownload.springer.comlink.springer.comdelivery.acm.org, proceedings.mlr.press and journals.aps.org.

It is not too difficult to customize the list and additional domains. Reach out to me and I’ll add them upon your request!

A bonus feature (thanks Boaz for the suggestion): Click the extension’s icon and you can download a bib file containing (a nice formatting of) the DBLP bibtex records of all papers added to the favorites.

Download “Where’s that paper?” from the Chrome Web Store: https://chrome.google.com/webstore/detail/wheres-that-paper/dkjnkdmoghkbkfkafefhbcnmofdbfdio

Security.

Most who work in our domain of expertise would be quite concerned with installing extensions and with good reason. The extension I wrote does not leak any information, not to me or to another third party. I have no ulterior motive in developing the extension, originally helping myself, I now see the benefit of sharing it with our community.  I do not know how to reassure you that this is indeed the case other than giving you my word and publishing the source code online. It is available here:

https://github.com/eylonyogev/Where-s-That-Paper-

I’d be glad to hear any feedback.

Thanks,

Eylon.

Approximating Partition Functions

November 11, 2018

[Guest post by Alex Kelser, Alex Lin, Amil Merchant, and Suproteem Sarkar, scribing for a lecture by Andrej Risteski.]

Andrej Risteski: Approximating Partition Functions via Variational Methods and Taylor Series

For a probability distribution defined up to a constant of proportionality, we have already seen the partition function. To refresh your memory, given a probability mass function p(\vec{x}) \propto f(\vec{x}) over all \vec{x} \in \mathcal{X}, the partition function is simply the normalization constant of the distribution, i.e.
Z = \sum_{\vec{x} \in \mathcal{X}} f(\vec{x})

At first glance, the partition function may appear to be uninteresting. However, upon taking a deeper look, this single quantity holds a wide array of intriguing and complex properties. For one thing, its significance in the world of thermodynamics cannot be overstated. The partition function is at the heart of relating the microscopic quantities of a system – such as the individual energies of each probabilistic state \vec{x} \in \mathcal{X} – to macroscopic entities describing the entire system: the total energy, the energy fluctuation, the heat capacity, and the free energy. For explicit formulas, consult this source for reference. In machine learning the partition function also holds much significance, since it’s intimately linked to computing marginals in the model.

Although there exists an explicit formula for the partition function, the challenge lies in computing the quantity in polynomial time. Suppose \vec{x} is a vector of dimension n where each x_i \in \{-1, 1\} for i \in \{1, \ldots, n\}. Specifically, we would like a smarter way of finding Z than simply adding up f(x) over all 2^n combinations of x.

Andrej Risteski’s lecture focused on using two general techniques – variational methods and Taylor series approximations – to find provably approximate estimates of Z.

Motivation

The general setup addresses undirected graphical models, also known as a Markov random fields (MRF), where the probability mass function p : \mathcal{X} \to [0, 1] has the form
 p(\vec{x}) \propto \exp \left( \sum_{i \sim j} \phi_{ij}(x_i,x_j) \right)
for some random, n-dimensional vector \vec{x} \in \mathcal{X} and some set of parameterized functions \{\phi_{ij}: \mathcal{X} \to \mathcal{R}\}. Note that the notation i \sim j denotes all pairs (i, j) where i, j \in \{1, 2, \ldots, n\} and the edge e = (i, j) exists in the graph.

The talk considered the specific setup where each x_i \in \{-1, 1\}, so \mathcal{X} = \{-1, 1\}^n. Also, we fix

\phi_{ij}(x_i, x_j) = J_{ij} x_i x_j
for some set of coefficients \{J_{ij}\}, thereby giving us the well-known Ising model. Note, the interaction terms could be more complicated and made “less local” if desired, but that was not discussed in the lecture.

These graphical models are common in machine learning, where there are two common tasks of interest:

  1. Learning: given samples, models try to learn \phi_{ij}(x_i, x_j) or \{J_{ij}\}
  2. Inference: given \{\phi_{ij}\} or the potentials \{J_{ij}\} as input, we want to calculate the marginal probabilities, such as
    p(x_i = -1) \text{ or }  p(x_i = 1, x_j = -1)

We focus on the latter task. The problem of inference is closely related to calculating the partition function. This value is often used as the normalization constant in many methods, and it is classically defined as
 Z = \sum_{x \in \mathcal{X}} \exp \left( \sum_{i \sim j} \phi_{ij}(x_i, x_j) \right)

Although we can write down the aforementioned closed-form expression for the partition function, it is difficult to calculate this quantity in polynomial time. There are two broad approaches to solving inference problems:

  1. Randomized: Methods based on Markov chain Monte Carlo (MCMC), such as Gibbs sampling, that construct a Markov chain whose stationary distribution is the distribution of interest. These have been more extensively studied in theoretical computer science.
  2. Deterministic: Variational methods, self-avoiding walk trees, Taylor expansion methods

Although more often used in practice, randomized algorithms such as MCMC are notoriously hard to debug, and it is often unclear at what point the chain reaches stationarity.

In contrast, variational methods are often more difficult to rigorously analyze but have the added benefit of turning inference problems into optimization problems. Risteski’s talk considered some provable instantiations of variational methods for calculating the partition function.

Variational Methods

Let us start with a basic observation, known as the Gibbs variational principle. This can be stated as the following.

Lemma 1: Let p(\vec{x}) \propto \exp{\sum_{i \sim j} J_{ij} x_i x_j}. Then, we can show that the corresponding partition function satisfies

 \log Z = \max_{q \in \mathcal{Q}} \left \{ \sum_{i \sim j} \underbrace{J_{ij}\mathbb{E}_q[x_i x_j]}_{\text{energy term}} + \underbrace{\mathbb{H}(q)}_{\text{entropy term}} \right \}

where q : \mathcal{X} \to [0, 1] is defined to be a valid distribution over \mathcal{X} and \mathcal{Q} is the set of all such distributions.

Note that we use \mathbb{E}_q(\cdot) to denote the expectation of the inner argument with respect to the distribution q and \mathbb{H}(q) to denote the Shannon entropy of q.

Proof: For any q \in \mathcal{Q}, the KL divergence between q and p must be greater than or equal to 0.

    \mathbb{KL} (q || p) \geq 0

 \underbrace{\mathbb{E}_q [\log q(\vec{x})}_{-\mathbb{H}(q)}] - \mathbb{E}_q \left[ \sum_{i \sim j} J_{ij} x_i x_j \right] + \log Z \geq 0

\log Z \geq \sum_{i \sim j} J_{ij} \mathbb{E}_q[x_i x_j] + \mathbb{H}(q)
Equality holds if q = p, leading directly to the lemma above.

Note that the proof would have also held in the more generic exponential family with \phi_{ij} instead of J_{ij} x_i x_j. Also, at most temperatures, one of the 2 terms, either the energy or the entropy will often dominate.

As a result of this lemma, we have framed the original inference problem of calculating Z as an optimization problem over \mathcal{Q}. This is the crux of variational inference.

Optimization Difficulties

The issue with this approach is that it is difficult to optimize over the polytope of distributions due to the values of each x_i coming from \pm 1. In general, this is NOT tractable. We can imagine two possible solutions:

  1. Inner Approximation

    Instead of optimizing over all q, pick a “nice” subfamily of distributions to constrain q. The prototypical example is the mean-field approximation in which we let q(\vec{x}) = \prod_{i=1}^n q_i(x_i), where each q_i : \{-1, 1\} \to [0, 1] is a univariate distribution. Thus, we approximate \log Z with
    \max_{q = \prod_i q_i} \left \{ \sum_{i \sim j} J_{ij} \cdot \mathbb{E}_{q_i}[x_i] \cdot \mathbb{E}_{q_j}[x_j] + \sum_i \mathbb{H}(q_i) \right \}

    This would provide a lower bound on Z. There are a number of potential issues with this method. For one thing, the functions are typically non-convex. Even ignoring this problem, it is difficult to quantify how good the approximation is.

  2. Outer Approximation

    Note: Given a distribution q, we use q_S to denote the marginal q(\vec{x}_S) = q(x_{s_1}, \ldots, x_{s_k}) for some set S = \{s_1, \ldots, s_k\} \subseteq \{1, \ldots, n\}.

    In the outer approximation, we relax the polytope we are optimizing over using convex hierarchies. For instance, we can define \mathcal{M}^k as the polytope containing valid marginals q_S over subsets S of size at most k. We can then reformulate our objective as a two-part problem of (1) finding a set of pairwise marginals \widetilde{q} = \{\widetilde{q}_S \text{ s.t. } \lvert S \rvert \leq k\}\in \mathcal{M}^k that optimizes the energy term and (2) finding a distribution q with pairwise marginals matching \widetilde{q} that optimizes the entropy term. For simplicity of notation, we use \widetilde{q}_{ij} to denote \widetilde{q}_{S}, where S = \{i, j\}. It follows that we can rewrite the Gibbs equation as

    \log Z = \max_{\widetilde{q} \in \mathcal{M}^k} \left \{ \sum_{i \sim j} J_{ij} \cdot \mathbb{E}_{\widetilde{q}_{ij}}[x_i x_j] + \max_{q \given q_S = \widetilde{q}_S \forall S, \lvert S \rvert \leq k} \mathbb{H}(q) \right \}

    The first term here represents the energy on pairwise marginals, whereas the second term is the maximum entropy subject to a constraint about matching the energy distribution’s pairwise marginals. The goal of this procedure is to enlarge the polytope such that \mathcal{M}^k is a tractable set, where we can impose a polynomial number of constraints satisfied by real marginals.

    One specific convex hierarchy that is commonly used for relaxation is the Sherali-Adams (SA) hierarchy. Sherali-Adams will allow us to formulate the optimization of the energy term (and an approximation of the entropy term) as a convex program. We introduce the polytope \text{SA}(k) for k \geq 2, which will relax the constraints on \vec{x} to allow for values outside of \{-1, 1\}^n in order to generate a polynomial-time solution for \log Z.

    The Sherali-Adams hierarchy will take care of the energy term, but it remains unclear how to rewrite the entropy term in the context of the convex program. In fact, we will need approximations for \mathbb{H} in order to accomplish this task.

    In this talk, we’ll consider two approximations: one classical one – the Bethe entropy \mathbb{H}_\text{BETHE} and one more recent one – the augmented mean-field pseudo-entropy \mathbb{H}_\text{aMF}.

Bethe Entropy Approximation

The Bethe entropy works by pretending that the Markov random field is a tree. In fact, we can show that if the graph is a tree, then using the Bethe entropy approximation in the convex program defined by \text{SA}(2) will yield an exact calculation of \log Z.

Specifically, the Bethe entropy is defined as
\mathbb{H}_\text{BETHE} (\widetilde{q}) = \sum_{i \sim j} \mathbb{H}(\widetilde{q}_{ij}) - \sum_i \mathbb{H}(\widetilde{q}_i) (d_i - 1)
where d_i is defined to be the degree of the particular vertex i. Note that there are no marginals over sets of dimension greater than two in this expression; thus, we have a valid convex program over the polytope \text{SA}(2).

This lemma is well-lnown, but it will be a useful warmup:

Lemma 2 Define the output of the convex program
\log Z_{\text{BETHE}} =      \max_{\widetilde{q} \in \text{SA}(2)} \left \{ \sum_{i \sim j} J_{ij} \cdot \mathbb{E}_{\widetilde{q}_{ij}} [x_i x_j] + \mathbb{H}_{\text{BETHE}} (\widetilde{q}) \right \}
On a tree, \log Z_{\text{BETHE}} = \log Z and the optimization objective is concave with respect to the \text{SA}(2) variables, so it can be solved in polynomial time.

Proof sketch We will prove 3 claims:

  1. \log Z_{\text{BETHE}} \geq \log Z

    Since the energy term is exact, it suffices to show that for valid marginals \widetilde{q}_S, \mathbb{H}_\text{BETHE}(\widetilde{q}) \geq \max_{q \given q_S = \widetilde{q}_S} \mathbb{H}(q). This can be done by re-writing \mathbb{H}_\text{BETHE} and using a property of conditional entropy, namely that \mathbb{H}(X,Y) = \mathbb{H}(X) + \mathbb{H}(Y|X).

  2. For trees, \log Z_\text{BETHE} is concave in the \text{SA}(2) variables.

    This proof was only sketched in class but is similar to the usual proof of concavity of entropy.

  3. For trees, we can round a solution to a proper distribution over \{-1, 1\}^n which attains the same value for the original \log Z optimization.

    The distribution q that we will produce is
    q(\vec{x}) = \prod_i q(x_i \given x_{\text{pa}(i)})
    where we start at the root and keep sampling down the tree. Based on the tree structure, \mathbb{H}(q) = \mathbb{H}_\text{BETHE}(q). The energy is also the same since q_{ij}(x_i, x_j) = \widetilde{q}_{ij}(x_i, x_j). Therefore, since both terms in the equation are equal to the respective terms in the partition function, we get that the equation must equal the partition function.

Remarks:

  1. \mathbb{H}_\text{BETHE} is commonly used on graphs that are not trees.
  2. However, for non-trees, the optimization is often no longer concave.
  3. Therefore, the value obtained by the output equation is neither an upper or lower bound.
  4. The fixed points of the belief propagation algorithm give a 1-1 correspondence with the stationary points of the Bethe objective. (See a classical paper by Yedidia et al..)

Augmented Mean Field Pseudo-Entropy

For this approximation, we will focus on dense graphs: namely graphs in which each vertex has degree \geq \Delta \cdot n for some \Delta > 0. To simplify things, we focus on distributions of the form
 p(\vec{x}) \propto \exp \left(\sum_{i \sim j} \pm J x_i x_j\right)
where J is some positive constant parameter. The Bethe entropy approximation is undesirable in the dense case, because we cannot bound its error on non-trees. Instead, we proved the following theorem.

Theorem (Risteski ’16)

For a dense graph with parameter \Delta, there exists an outer approximation based on \text{SA}(\frac{1}{\epsilon^2\Delta}) and an entropy proxy which achieves a O(J \epsilon n^2 \Delta) additive approximation to \log Z for some \epsilon > 0. There is an algorithm with runtime O(\frac{1}{\epsilon^2\Delta}).

To parse the theorem statement, consider the potential regimes for J:

  1. J >> \frac{1}{n} \rightarrow One can ignore the entropy portion and solve for the energy portion that dominates. (So the guarantee is not that interesting.)
  2. J << \frac{1}{n} \rightarrow MCMC methods give a (1+\epsilon)-factor approximation in time poly(n, \frac{1}{\epsilon}). (It’s not clear if the methods suggested here can give such a guarantee – this is an interesting problem to explore.)
  3. J = \Theta(\frac{1}{n}) \rightarrow MCMC mixes slowly, but there is no other way to get a comparable guarantee. This is the interesting regime!

Proof Sketch The proof strategy is as follows: We will formulate a convex program under the SA(k) relaxation that can return a solution \widetilde{q} in polynomial time with value \log \tilde{Z}. Note that this value of the relaxation will be an upperbound on the true partition function value \log Z^*. The convex program solution \widetilde{q} may not be a valid probability distribution, so from this, we will construct a “rounded solution’’ – an actual distribution q with value \log Z. It follows that
\log \tilde{Z} \geq \log Z^* \geq \log Z

We will then aim to put a bound on the gap between \log \tilde{Z} and \log Z; namely, that
    \log Z \geq \log \tilde{Z} - \epsilon \cdot n^2 \cdot J \cdot \Delta
or equivalently,
    \sum_{i \sim j} \mathbb{E}_q[x_i x_j] \cdot J + \mathbb{H}(q) \geq \sum_{i \sim j} \mathbb{E}_{\widetilde{q}}[x_i x_j] \cdot J + \mathbb{H}_\text{aMF}(\widetilde{q}) - \epsilon n^2 J \Delta
This equivalently places a bound on the optimality gap between \log \tilde{Z} and \log Z^*, thereby proving the theorem. Here are the main components of the proof:

1. Entropy Proxy

To approximate \mathbb{H}, we will use the pseudo-augmented mean field entropy approximation \mathbb{H}_\text{aMF}. This is defined as

 \mathbb{H}_\text{aMF}(q) = \mathbb{I}n_{|S| \le k} \{ \overbrace{\mathbb{H}(q_S)}^\text{entropy of $S$} + \overbrace{\sum_{i \notin S} \mathbb{H}(q_i|q_S)}^\text{entropy of everything else} \}

Using \mathbb{H}_\text{aMF}, we can write a convex program under SA(k) that provides a relaxation for optimizing \log Z. Specifically, we will let k = O(1 / (\Delta \cdot \epsilon^2)). Let \widetilde{q} be the outputed solution to this relaxation. We define the rounded solution as
q(\vec{x}) = \widetilde{q}_S(\vec{x}_S) \cdot \prod_{i \notin S} \widetilde{q}_S(x_i \given \vec{x}_S)
Using the chain rule for entropy, we can show that
\mathbb{H}(q) \geq \mathbb{H}_\text{aMF}(\widetilde{q})

2. Rounding Scheme The above distribution q is actually the same distribution that correlation rounding, as introduced by a seminal paper of Barak, Raghavendra, and Steurer, produces. By using definition of mutual information \mathbb{I}(X, Y) and Pinsker’s Theorem showing that \mathrm{\mathbb{C}ov}(X, Y) = O(\sqrt{\mathbb{I}(X, Y)}), we can work through some algebra to show that

\sum_{i \sim j} \mathbb{E}_q[x_i x_j] \cdot J \geq \sum_{i \sim j} \mathbb{E}_{\widetilde{q}}[x_i x_j] \cdot J - \epsilon n^2 J \Delta

Then, putting together the entropy and energy effects, we can prove the main theorem.

Remarks:

  1. The above proof easily extends to a more general notion of density, as well as to sparse graphs with “low-rank’’ spectral structure. See the paper for more information.
  2. In fact, with more work, one can extract guarantees on the the best mean field approximation to the Gibbs distribution in the dense regime. (This intuitively can be done as the distribution q produced above is almost mean-field, except for the set S). This is interesting, since as we mentioned, the quality of the mean-field approximation is usually difficult to characterize. (Moreover, the procedure is algorithmic.) See the recent work of Jain, Koehler and Risteski ’18 for more details and results.

Taylor Series Approximation

Another method of calculating the partition function involves Taylor expanding its logarithm around a cleverly selected point. This approach was first developed by Barvinok ’14, for the purpose of computing the partition function of cliques in a graph. The mathematical techniques developed in this paper can be naturally extended to evaluate a variety of partition functions, including that of the ferromagnetic Ising model. We will use this example to illustrate the general idea of the Taylor expansion technique.

The goal here is to devise a deterministic, fully polynomial-time approximation scheme (an FPTAS) for evaluating the partition function of the ferromagnetic Ising model. This means that the the algorithm must run in poly(n, \frac{1}{\epsilon}) and be correct within a multiplicative factor of 1 + \epsilon. We will work in the general case where the Hamiltonian includes an external magnetic field term, as well as the neighboring spin interaction terms. Using logarithmic identities, we can re-write the partition function slightly differently from last time:

 Z(\lambda) = \sum_{x \in \{ \pm 1 \}^n} \lambda^{|{1: x_i = 1}|} \beta^{|E(x)|}
Here, \lambda is called the vertex activity (or external field), and characterizes the likelihood of a vertex to be in the + configuration. Additionally, \beta \geq 0 is referred to as the edge activity, and characterizes the propensity of a vertex to agree with its neighbors. The ferromagnetic regime (where agreement of spins is favored) corresponds to \beta < 1. |E(x)| denotes the number of edge cut in the graph, which is equivalent to the number of neighboring pairs with opposite spins.

As one final disclaimer, the approximation scheme presented below is not valid for | \lambda| = 1, which corresponds to the zero-magnetic field case. Randomized algorithms exist which can handle this case.

The general idea here is to hold one parameter fixed (\beta in our case), and express the logarithm of the partition function as a Taylor series in the other parameter (\lambda). From a physics standpoint, the Taylor expansion around \lambda = 0 tells us how the partition function is changing as a function of the magnetic field. Without loss of generality, we focus on the case where | \lambda | < 1. This simplification can be justified by a simple symmetry argument. Consider \bar{x} as the inverse of x where all the 1’s (spin-up) are flipped to -1s (spin-down) and vice-versa. The partition function of this flipped system can be related to the original system by a constant factor:

\begin{align*} \sum_x \beta^{|E(x)|} \lambda^{|\{1: x_i = 1\}|} &= \sum_x \beta^{|E(\bar{x})|} \lambda^{n - |\{1: \bar{x}_i = 1\}|} \\ &= \lambda^{n} \sum_x \beta^{|E(\bar{x})|} \left(\frac{1}{\lambda}\right)^{|\{1: \bar{x}_i = 1\}|} \end{align*}

This holds because the number of cut edges remains constant when the values are flipped. Since these two partition functions are related by this factor constant in \lambda, for any model with \lambda > 1, we can simply consider the flipped graph \bar{x} and use the above relation.

For our purposes, it will be more convenient to approximate the logarithm of the partition function, because a multiplicative (1 + \epsilon) approximation of Z(\lambda) corresponds to an \mathcal{O}(\epsilon) additive approximation of \log Z(\lambda). In fact, if we allow the partition function to be complex, then we can easily show that an additive bound of \frac{\epsilon}{4} for \log Z(\lambda) guarantees a multiplicative bound of (1+\epsilon) for Z(\lambda).

For notational convenience we define:
 \log Z(\lambda) = f(\lambda)
Thus, the Taylor expansion of f(\lambda) around \lambda = 0 if given by:
f(\lambda) = \sum_j f^{(j)} (0) \frac{\lambda^j}{j!}

The big question now is: Can we get a good approximation from just the lower order terms of the Taylor expansion for f(\lambda)? Equivalently, can we can bound the sum of the higher order terms at \frac{\epsilon}{4}? Additionally, we still must address the question of how to actually calculate the lower order terms.

To answer these questions, we make simple observations about the derivatives of f in relation to the derivatives of Z.
\begin{align*}     f'(\lambda) &= \frac{1}{Z(\lambda)} Z'(\lambda) \\     Z'(\lambda) &= f'(\lambda) Z(\lambda) \\     Z^{(m)} (\lambda) &= \sum_{j=0}^{m-1} {m-1 \choose j} Z^{(j)} (\lambda) f^{(m-j)} (\lambda) \end{align*}
The last equation just comes from repeated application of the product rule. Using these relations, we can solve for the first m derivatives of f if we have the first m derivatives of Z using a triangular linear system of equations. As Z(0) = 1, we see that the system is non-degenerate.

Note, Z(\lambda) is a n-degree polynomial with leading coefficient of 1 and constant coefficient of 1 (corresponding to the configurations where all vertices are positive / negative respectively). Using this form, we can re-write the partition function in terms of its n (possibly complex) roots r_i:

Z(\lambda) = \prod_{i=1}^n \left(1-\frac{\lambda}{r_i} \right) \text{ for some } r_i

f(\lambda) = \log Z(\lambda) = \sum_{i=1}^n \log \left(1-\frac{\lambda}{r_i}\right)

= - \sum_{i=1}^n \sum_{j=1}^{\infty} \frac{1}{j} \left(\frac{\lambda}{r_i} \right)^j

Supposing we keep the first m terms, the error is given bounded by:

|f(\lambda) + \sum_{i=1}^n \sum_{j=1}^{m} \frac{1}{j} (\frac{\lambda}{r_i})^j| \leq  \sum_{i=1}^n \sum_{j=m+1}^{\infty} \frac{1}{|r_i|}\frac{|\lambda|^j}{j} \leq \sum_{i=1}^n \frac{|\lambda|^{m+1}}{(m+1)(1-|\lambda|)}\frac{1}{|r_i|}

Here we can invoke the Lee-Yang theorem, which tells us that the zeroes of Z(\lambda) for a system with ferromagnetic interactions lie on the unit circle of the complex plane. So the Lee-Yang theorem guarantees that |r_i| = 1, and we see that the error due to truncation is ultimately bounded by:

\frac{n|\lambda|^{m+1}}{(m+1)(1-|\lambda|)}

Now by the previous symmetry argument, we can assume that \lambda < 1. Thus, to achieve an error bound of \frac{\epsilon}{4}, we must have:
\frac{\epsilon}{4} > \frac{n|\lambda|^{m+1}}{(m+1)(1-|\lambda|)}
Rearranging terms and taking the natural log of both sides (which is justified given that |\lambda| < 1), we see that the inequality is satisfied if:

m \geq \frac{1}{\log(\frac{1}{|\lambda|})}(\log(\frac{4n}{\epsilon}) + \log(\frac{1}{1-|\lambda|}))

Thus, we need only retain the first \frac{1}{\log(\frac{1}{|\lambda|})}(\log(\frac{4n}{\epsilon}) + \log(\frac{1}{1-|\lambda|})) terms of the Taylor series expansion of f(\lambda). This will involve calculating the first m = \lceil \frac{1}{\log(\frac{1}{|\lambda|})}(\log(\frac{4n}{\epsilon}) + \log(\frac{1}{1-|\lambda|}))\rceil derivatives of Z(\lambda), which naively can be done in quasi-polynomial time. This is done by running over all S, where |S| = m. This takes O(n \log n + \log (\frac{1}{\epsilon})) time, which is quasi-polynomial.

Recent work by Patel and Regts, as well as Liu, Sinclair and Srivastiva has focused on evaluating these coefficients more efficiently (in polynomial time), but is outside the scope of this lecture. Clever counting arguments aside, to trivially calculate the k-th derivative, we need only sum over the vectors with k spin-up components.

Lee-Yang Theorem

Most of the heavy lifting in the above approach is done by the Lee-Yang theorem. In this section, we sketch how it is proved.

First, let’s define the Lee-Yang property, which we will refer to as the LYP. Let P(\lambda_1, \dots, \lambda_n) be some multilinear polynomial. P has the Lee-Yang property if for any complex numbers \lambda_1, \dots, \lambda_n such that |\lambda_i| > 1 for all i, then P(\lambda_1, \dots, \lambda_n) \neq 0.

We can then show that the partition function for the ferromagnetic Ising model, which we wrote as
Z(\lambda_1, \dots, \lambda_n) = \sum_x \beta^{|E(x)|} \prod_{i: x_i = 1} \lambda_i
must have this LYP. In the antiferromagnetic case it turns out that all zeroes lie on the negative real axis, but we will focus on the ferromagnetic case.

Proof (sketch):

For the proof that the partition function has the LYP, we will use Asano’s contraction argument. This relies on the fact that certain operations preserve LYP and we can “build” up to the partition function of the full graph by these operations.

  • Disjoint union: This is fairly simple to prove since the partition function for the disjoint union of G and H would just factor into the multiplication of the two partition function of the original graphs Z_{G \sqcup H} = Z_{G} Z_{H}. Where Z_{G} and Z_{H} have the LYP, then it is clear that Z_{G \sqcup H} has the same LYP.
  • Contraction: Suppose we produce a graph G' from G by contracting 2 vertices z_1, z_2 in G, which means merging them into one vertex. It can be shown that if G has the LYP, then G' also has the LYP. We write the partition function for G as
    A\lambda_1 \lambda_2 + B \lambda_1 + C \lambda_2 + D
    The “contracted” graph amounts to deleting the middle 2 terms so the partition function for G' can be written as
    \underbrace{A \lambda'}_{\text{both plus}} + \underbrace{D}_{\text{both minus}}
    We want to show that the partition function for G' has the LYP. Because the partition function of G has the LYP, we can consider the case where \lambda_1 = \lambda_2 = \lambda. By the LYP,
    A\lambda^2 + (B+C)\lambda + D \neq  0
    if |\lambda| > 1. By Vieta’s formulas we can find a relation between A and D,
     \frac{|D|}{|A|} = \prod |zeroes| \leq 1
    Now, to show that the partition function of G’ has the LYP, we assume there is a \lambda' such that A\lambda' + D = 0. For this to be true,
    |\lambda'| = \frac{|D|}{|A|}
    However, this means that |\lambda'| \leq 1 so there is no solution where |\lambda'| > 1 such that the partition function of G’ is 0 and so it has the LYP.

The final part of this proof is to construct the original graph. We observe is that for a single edge, the partition function has the LYP. In this case, we easily write out the partition function for 2 vertices:
z_1z_2 + B z_1 + B z_2 + 1
Suppose |z_1| > 1: then z_1z_2 + Bz_1 + Bz_2 + 1=0 would imply |z_2| = \frac{|1+Bz_1|}{|z_1 + B|}. However, this is just the Möbius transform mapping the exterior of the unit disk to the interior, so |z_2| \leq 1. Thus, it cannot be the case that both z_1 and z_2 have absolute value greater than one.

Since single edges have the LYP. We break the graph into single edges, with copies of vertices. These copies are then contracted and we build up the graph to show that the partition function has the LYP.

Knowing that the partition function has the LYP, direct application of the Lee-Yang theorem guarantees that the roots are on the unit circle in the complex plane.

Conclusion

Although there exists an explicit formula for the partition function, the challenge lies in computing the quantity in polynomial time. Andrej Risteski’s lecture focused on two less-explored methods in theoretical computer science – namely variational inference and Taylor series approximations – to find provably approximate estimates. Both of these approaches are relatively new and replete with open problems.

Rabin postdocs ad is out

October 24, 2018

[Guest blog from Yaron Singer who is heading the selection committee for the Rabin fellowship this year. In addition to the Rabin fellowship and other postdoc opportunities, this year we also have a new postdoc opportunity in quantum computation and information via the Harvard Quantum Initiative. –Boaz.]

 

Michael O. Rabin Postdoctoral Fellowship in Theoretical Computer Science
 
Deadline for full consideration: December 3, 2018.  Applications can be submitted here::
 
 
The Harvard John A. Paulson School of Engineering and Applied Sciences at Harvard University seeks applicants for the Michael O. Rabin Postdoctoral Fellowship in Theoretical Computer Science. The standard duration of the Rabin Fellowship is two years. Rabin Fellows will receive a generous salary as well as an annual allocation for research and travel expenses.
 
Past fellows are Mika Goose and Aviad Rubinstein and the current fellow is Alexander Golovnev.
 
We are looking for exceptional junior scientists in theoretical computer science, broadly construed. Rabin Fellows will be provided with the opportunity to pursue their research agenda in an intellectually vibrant environment with ample mentorship. While interaction with Harvard faculty, students, and visitors is encouraged, Rabin Fellows are free to pursue their own interests. Candidates are required to have a doctorate or terminal degree in Computer Science or a related area by the expected start date.
 
Required application documents include a cover letter, research statement, CV (including a list of publications), and names and contact information for three references. We recommend that papers mentioned in the CV be available online on your homepage or on electronic archives. Applicants will apply on-line at the above address. We encourage candidates to apply by December 3, 2018, but applications will be accepted until the position is filled.
 
Harvard is an equal opportunity employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability status, protected veteran status, or any other characteristic protected by law.
 
*The fellowship is named after Michael O. Rabin, pioneer in Computer Science research and winner of numerous awards including the A. M. Turing award in 1976. Michael Rabin has been on the faculty at Harvard since 1981, and currently is the Thomas J. Watson, Sr. Research Professor of Computer Science in the Harvard Paulson School. The fellowship is aimed at researchers in all areas of theoretical computer science, including fellows that, like Rabin, might create new areas that do not yet exist.

Belief Propagation and the Stochastic Block Model

October 20, 2018
by

[Guest post by Thomas Orton who presented a lecture on this in our  physics and computation seminar –Boaz]

Introduction

This blog post is a continuation of the CS229R lecture series. Last week, we saw how certain computational problems like 3SAT exhibit a thresholding behavior, similar to a phase transition in a physical system. In this post, we’ll continue to look at this phenomenon by exploring a heuristic method, belief propagation (and the cavity method), which has been used to make hardness conjectures, and also has thresholding properties. In particular, we’ll start by looking at belief propagation for approximate inference on sparse graphs as a purely computational problem. After doing this, we’ll switch perspectives and see belief propagation motivated in terms of Gibbs free energy minimization for physical systems. With these two perspectives in mind, we’ll then try to use belief propagation to do inference on the the stochastic block model. We’ll see some heuristic techniques for determining when BP succeeds and fails in inference, as well as some numerical simulation results of belief propagation for this problem. Lastly, we’ll talk about where this all fits into what is currently known about efficient algorithms and information theoretic barriers for the stochastic block model.

(1) Graphical Models and Belief Propagation

(1.0) Motivation

Suppose someone gives you a probabilistic model on \vec{x}=(x_1,...,x_n) \in \chi^{N} (think of \chi as a discrete set) which can be decomposed in a special way, say

P(\vec{x})\propto \prod_{a \in F} f_{a}(\vec{x})

where each f_{a} only depends on the variables V_{a}. Recall from last week that we can express constraint satisfaction problems in these kinds of models, where each f_{a} is associated with a particular constraint. For example, given a 3SAT formula \phi=\land_{a} c_{a}, we can let f_{a}(\vec{x})=1 if c_{a}(\vec{x}) is satisfied, and 0 otherwise. Then each f_{a} only depends on 3 variables, and P(\vec{x}) only has support on satisfying assignments of \phi.

 

A central problem in computer science is trying to find satisfying assignments to constraint satisfaction problems, i.e. finding values \vec{x} in the support of P(\vec{x}). Suppose that we knew that the value of P(x_1=1) were >0. Then we would know that there exists some satisfying assignment where x_1=1. Using this knowledge, we could recursively try to find \vec{x}‘s in the support of P(1,x_2,...,x_n), and iteratively come up with a satisfying assignment to our constraint satisfaction problem. In fact, we could even sample uniformally from the distribution as follows: randomly assign x_1 to 1 with probability P(x_1=1), and assign it to 0 otherwise. Now iteratively sample from P(x_2|x_1) for the model where x_1 is fixed to the value we assigned to it, and repeat until we’ve assigned values to all of the \{x_i\}_{i}. A natural question is therefore the following: When can we try to efficiently compute the marginals

P(x_i):=\sum_{\vec{x}-x_i} P(\vec{x})

for each i?

 

A well known efficient algorithm for this problem exists when the corresponding graphical model of P(\vec{x}) (more in this in the next section) is a tree. Even though belief propagation is only guaranteed to work exactly for trees, we might hope that if our factor graph is “tree like”, then BP will still give a useful answer. We might even go further than this, and try to analyze exactly when BP fails for a random constraint satisfaction problem. For example, you can do this for k-SAT when k is large, and then learn something about the solution threshold for k-SAT. It therefore might be natural to try and study when BP succeeds and fails for different kinds of problems.

(1.1) Deriving BP

We will start by making two simplifying assumptions on our model P(\vec{x}).

 

First, we will assume that P(\vec{x}) can be written in the form P(\vec{x})\propto \prod_{(i,j) \in E} f_{i,j}(x_i,x_j) \prod_{i \in [n]} f_i(x_i) for some functions \{f_{i,j}\}_{i,j}, \{f_i(x_i)\}_{i} and some “edge set” E (where the edges are undirected). In other words, we will only consider pairwise constraints. We will see later that this naturally corresponds to a physical interpretation, where each of the “particles” x_i interact with each other via pairwise forces. Belief propagation actually still works without this assumption (which is why we can use it to analyze k-SAT for k>2), but the pairwise case is all we need for the stochastic block model.

 

For the second assumption, notice that there is a natural correspondence between P(\vec{x}) and the graphical model T on n vertices, where (i,j) forms an edge in T iff (i,j) \in E. In order words, edges (i,j) in T correspond to factors of the form f_{i,j} in P(\vec{x}), and vertices in T correspond to variables in \vec{x}. Our second assumption is that the graphical model T is a tree.

graphical_model_fig_1

Now, suppose we’re given such a tree T which represents our probabilistic model. How to we compute the marginals? Generally speaking, when computer scientists see trees, they begin to get very excited [reference]. “I know! Let’s use recursion!” shouts the student in CS124, their heart rate noticeably rising. Imagine that we arbitrarily rooted our tree at vertex x_i. Perhaps, if we could somehow compute the marginals of the children of x_i, we could somehow stitch them together to compute the marginal x_i. In order words, we should think about computing the marginals of roots of subtrees in our graphical model. A quick check shows that the base case is easy: suppose we’re given a graphical model which is a tree consisting of a single node x_i. This corresponds to some PDF P(\vec{x})=P(x_i) \propto f_{i}(x_i). So to compute P(x_i), all we have to do is compute the marginalizing constant Z=\sum_{x_i \in \chi} f_{i}(x_i), and then we have P(x_i)=\frac{1}{Z}f(x_i). With the base case out of the way, let’s try to solve the induction step: given a graphical model which is a tree rooted at x_i, and where we’re given the marginals of the subtrees rooted at the children of x_i, how do we compute the marginal of the tree rooted at x_i? Take a look at figure 2 to see what this looks like graphically. To formalize the induction step, we’ll define some notation that will also be useful to us later on. The main pieces of notation are T^{i \rightarrow j}, which is the subtree rooted at x_i with parent x_j, and the “messages” \psi^{k \rightarrow i}_{x_i}, which can be thought of as information which is passed from the child subtrees of x_i to the vertex x_i in order to compute the marginals correctly.

  1. We let \partial i denote the neighbours of vertex i in T. In general, I’ll interchangeably switch between vertex i and the variable x_i represented by vertex i.
  2. We define T^{i \rightarrow j} to be the subtree of T rooted at i, where i‘s parent is j. We need to specify i‘s parent in order to give an orientation to our tree (so that we know in which direction we need to do recursion down the tree). Likewise, we let V_{i \rightarrow j} be the set of vertices/variables which occur in subtree T^{i \rightarrow j}.
  3. We define the function T^{i \rightarrow j}(V_{i \rightarrow j}), which is a function of the variables in V_{i \rightarrow j}, to be equal to the product of T^{i \rightarrow j}‘s edges and vertices. Specifically, T^{i \rightarrow j}(V_{i \rightarrow j})=\prod_{(a,b) \in E'} f_{a,b}(x_a,x_b) \prod_{a \in V_{i \rightarrow j}} f_{a}(x_a), where E' is the set of edges which occur in subtree T^{i \rightarrow j}. We can think of T^{i \rightarrow j}(V_{i \rightarrow j}) as being the pdf (up to normalizing constant) of the “model” of the subtree T^{i \rightarrow j}.
  4. Lastly, we define \psi^{i \rightarrow j}_{x_i}:=\frac{1}{Z^{i \rightarrow j}} \sum_{V_{i \rightarrow j}-x_i} T(V_{i \rightarrow j}), where Z^{i \rightarrow j} is a normalizing constant chosen such that \sum_{x_i \in \chi} \psi^{i \rightarrow j}_{x_i}=1. In particular, \psi^{i \rightarrow j}_{x_i}: \chi \rightarrow \mathbb{R} is a function defined for each possible value of x_i \in \chi. We can interpret \psi^{i \rightarrow j}_{x_i} in two ways: firstly, it is the marginal of the root of the subtree T^{i \rightarrow j}. Secondly, we can think of it as a “message” from vertex i to vertex j. We’ll see why this is a valid interpretation shortly.

graphical_model_recursion

Phew! That was a lot of notation. Now that we have that out of the way, let’s see how we can express the marginal of the root of a tree as a function of the marginals of its subtrees. Suppose we’re considering the subtree T^{i \rightarrow j}, so that vertex i has children (\partial i)-j. Then we can compute the marginal \psi^{i \rightarrow j}_{r} directly:

eqs_1

The non-obvious step in the above is that we’re able to switch around summations and products: we’re able to do this because each of the trees are functions on disjoint sets of variables. So we’re able to express \psi^{i \rightarrow j}_{x_i} as a function of the children values \{\psi^{k \rightarrow i}_{x_k}\}_{k \in \partial i-j}. Looking at the update formula we have derived, we can now see why the \{\psi^{k \rightarrow i}_{x_k}\}_{k \in \partial i-j} are called “messages” to vertex i: they send information about the child subtrees to their parent i.

The above discussion is a purely algebraic way of deriving belief propagation. A more intuitive way to get this result is as follows: imagine fixing the value of x_i=a in the the subtree T^{i \rightarrow j}, and then drawing from each of the marginals of the children of x_i conditioned on the value x_i=a. We can consider the marginals of each of the children independently, because the children are independent of each other when conditioned on the value of x_i. Converting words to equations, this means that if x_i has children x_{k_{1}},...,x_{k_{d}}, then the marginal probability of (x_i,x_{k_{1}},...,x_{k_{d}}) in the subtree T^{i \rightarrow j} is proportional to f_{i}(x_i) \prod_{\substack{k:(i,k) \in E, \\ k \not = j}} \psi^{k \rightarrow i}_{x_k} f_{i,k}(x_i,x_k). We can then write

\psi_{x_i}^{i \rightarrow j} \propto \sum_{(\partial i)-j} f_{i}(x_i) \prod_{\substack{k:(i,k) \in E, \\ k \not = j}} \psi^{k \rightarrow i}_{x_k} f_{i,k}(x_i,x_k)

= f_{i}(x_i)\prod_{\substack{k:(i,k) \in E, \\ k \not = j}} \sum_{x_k \in \chi} \psi^{k \rightarrow i}_{x_k} f_{i,k}(x_i,x_k)

And we get back what we had before. We’ll call this last equation our “update” or “message passing” equation. The key assumption we used was that if we condition on x_i, then the children of x_i are independent. It’s useful to keep this assumption in mind when thinking about how BP behaves on more general graphs.

A similar calculation yields that we can calculate the marginal of our original probability distribution \psi_{x_i}^{i}:=P(x_i) as the marginal of the subtree with no parent, i.e.

\psi_{x_i}^{i} \propto f_{i}(x_i)\prod_{k:(i,k) \in E} \sum_{x_k \in \chi} \psi^{k \rightarrow i}_{x_k} f_{i,k}(x_i,x_k)

 

Great! So now we have an algorithm for computing marginals: recursively compute \psi^{i \rightarrow j}_{x_i} for each (i,j) \in E in a dynamic programming fashion with the “message passing” equations we have just derived. Then, compute \psi_{x_i}^{i} for each i \in [n]. If the diameter of our tree T is d, then the recursion depth of our algorithm is at most d.

However, instead of computing every \psi^{i \rightarrow j}_{x_i} neatly with recursion, we might try something else: let’s instead randomly initialize each \psi^{i \rightarrow j}_{x_i} (x_i \in \chi) with anything we want. Then, let’s update each \psi^{i \rightarrow j}_{x_i} in parallel with our update equations. We will keep doing this in successive steps until each \psi^{i \rightarrow j}_{x_i} has converged to a fixed value. By looking at belief propagation as a recursive algorithm, it’s easy to see that all of the \psi^{i \rightarrow j}_{x_i}‘s will have their correct values after at most d steps. This is because (after arbitrarily rooting our tree at any vertex) the leaves of our recursion will initialize to the correct value after 1 step. After two steps, the parents of the leaves will be updated correctly as functions of the leaves, and so they will have the correct values as well. Specifically:
Proposition: Suppose we initialize messages \psi_{x_i}^{i \rightarrow j} arbitrarily, and update them in parallel according to our update equations. If T has diameter d, then after d steps each \psi_{r}^{i \rightarrow j} converges, and we recover the correct marginals.

Why would anyone want to do things in this way? In particular, by computing everything in parallel in steps instead of recursively, we’re computing a lot of “garbage” updates which we never use. However, the advantage of doing things in this way is that this procedure is now well defined for general graphs. In particular, suppose P(\vec{x}) violated assumption (2), so that the corresponding graph were not a tree. Then we could still try to compute the messages \psi_{x_i}^{i \rightarrow j} with parallel updates. We are also able to do this in a local “message passing” kind of way, which some people may find physically intuitive. Maybe if we’re lucky, the messages will converge after a reasonable amount of iterations. Maybe if we’re even luckier, they will converge to something which gives us information about the marginals \{P(x_i)\}_{i}. In fact, we’ll see that just such a thing happens in the stochastic block model. More on that later. For now, let’s shift gears and look at belief propagation from a physics perspective.

 

(2) Free Energy Minimization

(2.0) Potts model (Refresh of last week)

We’ve just seen a statistical/algorithmic view of how to compute marginals in a graphical model. It turns out that there’s also a physical way to think about this, which leads to a qualitatively similar algorithm. Recall from last week that another interpretation of a pairwise factor-able PDF is that of particles interacting with each other via pairwise forces. In particular, we can imagine each particle x_i interacting with x_j via a force of strength

J_{(i,j)}(x_i,x_j)

and in addition, interacting with an external field

h_{i}(x_i)

We imagine that each of our particles take values from a discrete set \chi. When \chi=\{0,1\}, we recover the Ising model, and in general we have a Potts model. The energy function of this system is then

E[\vec{x}]=-\sum_{(i,j)}J_{(i,j)}(x_i,x_j)-\sum_{i}h_{i}(x)

with probability distribution given by

P(\vec{x})=\frac{1}{Z} e^{-E[\vec{x}]/T}

Now, for \chi=\{0,1\}, computing the marginals P(x_i) corresponds to the equivalent physical problem of computing the “magnetizations” P(x_i=1)-P(x_i=0).

How does this setup relate to the previous section, where we thought about constraint satisfaction problems and probability distributions? If we could set e^{-J_{(i,j)}(x_i,x_j)/T}=f_{i,j}(x_i,x_j) and e^{-h_{i}(x_i)/T}=f_{i}(x_i), we would recover exactly the probability distribution P(\vec{x})=\prod_{i,j} f_{i,j}(\vec{x}) \prod_{i} f_{i}(x_i) from the previous section. From a constraint satisfaction perspective, if we set J_{(i,j)}(x_i,x_j)=1 if constraint (i,j) is satisfied and 0 otherwise, then as T \rightarrow 0 (our system becomes colder), P(\vec{x})‘s probability mass becomes concentrated only on the satisfying assignments of the constraint satisfaction problem.

(2.1) Free energy minimization

We’re now going to try a different approach to computing the marginals: let’s define a distribution b(\vec{x}), which we will hope to be a good approximation to P(\vec{x}). If you like, you can think about the marginal b(x_i) as being the “belief” about the state of variable x_i. We can measure the “distance” between P and b by the KL divergence

 

KL(b(\vec{x})||p(\vec{x})) =\sum_{\vec{x}}b(\vec{x})\ln\frac{b(\vec{x})}{P(\vec{x})}

=\sum_{\vec{x}}b(\vec{x})E(\vec{x})+\sum_{\vec{x}}b(\vec{x})\ln b(\vec{x})+\ln Z

which equals 0 iff the two distributions are equal. Let’s define the Gibbs free energy as
G(b(\vec{x}))=\sum_{\vec{x}}b(\vec{x})E(\vec{x})+\sum_{\vec{x}}b(\vec{x})\ln b(\vec{x})=:U(b(\vec{x}))-S(b(\vec{x}))

 

So the minimum value of G is F=-\ln Z which is called the Helmholz free energy, U is the “average energy” and S is the “entropy”.

 

Now for the “free energy minimization part”. We want to choose b to minimize G, so that we can have that b is a good approximation of P. If this happens, then maybe we can hope to “read out” the marginals of b directly. How do we do this in a way which makes it easy to “read out” the marginals? Here’s one idea: let’s try to write G as a function of only the marginals b(x_i,x_j) and b(x_i) of b. If we could do this, then maybe we could try to minimize G by only optimizing over values for “variables” \{b(x_i,x_j)\}_{i,j} and \{b(x_i)\}_{i}. However, we need to remember that \{b(x_i,x_j)\}_{i,j} and \{b(x_i)\}_{i} are actually meant to represent marginals for some real probability distribution b. So at the very least, we should add the consistency constraints b(x_i)=\sum_{x_j \in \chi} b(x_i,x_j) and \sum_{x_i \in \chi} b(x_i)=1 to our optimization problem. We can then think of b(x_i,x_j) and b(x_i) as “pseudo-marginals” which obey degree-2 Sherali-Adams constraints.

 

Recall that we’ve written G as a sum of both the average energy U and the entropy S. It turns out that we can actually write U as only a function of the pariwise marginals of b:

U(b(\vec{x})) =-\sum_{i,j}\sum_{x_i,x_j} b(x_i,x_j)J_{i,j}(x_i,x_j)-\sum_{i}\sum_{x_i}b(x_i) h_{i}(x_i)

which follows just because the sums marginalize out the variables which don’t form part of the pairwise interactions:
\sum_{\vec{x}}b(\vec{x})\left(-\sum_{i,j}J_{i,j}(x_i,x_j)-\sum_{i} h_{i}(x_i)\right)

=-\sum_{i,j}\sum_{x_i,x_j} b(x_i,x_j)J_{i,j}(x_i,x_j)-\sum_{i}\sum_{x_i}b(x_i) h_{i}(x_i)

This is good news: since P only depends on pairwise interactions, the average energy component of G only depends on b(x_i,x_j) and b(x_i). However, it is not so clear how to express the entropy as a function of one node and two node beliefs. However, maybe we can try to pretend that our model is really a “tree”. In this case, the following is true:

 

Claim: If our model is a tree, and \{b(x_i,x_j)\}_{i,j} and \{b(x_i)\}_{i} are the associated marginals of our probabilistic model b(\vec{x}), then we have
b(\vec{x})=\frac{\prod_{(i,j) \in E}b(x_i,x_j)}{\prod_{i \in [n]} b(x_i)^{q_i-1}}

where q_i is the degree of vertex x_i in the tree.

 

It’s not too difficult to see why this is the case: imagine a tree rooted at x_i, with children \partial i. We can think of sampling from this tree as first sampling from x_i via its marginal b(x_i), and then by recursively sampling the children conditioned on x_i. Associate with \{T_k\}_{k \in \partial i} the subtrees of the children of i, i.e. T_j(V_{j}) is equal to the probability of the occurrence V_{j} on the probabilistic model of the tree rooted at vertex j. Then we have

b(\vec{x}) =b(x_i) \prod_{k \in \partial i} b(x_k|x_i) \prod_{k \in \partial i} T_k(V_{k}|x_k)

=b(x_i)\frac{1}{b(x_i)^{q_i}} \prod_{k \in \partial i} b(x_i,x_k) \prod_{k \in \partial i}\frac{1}{b(x_k)} \prod_{k \in \partial i}T_k(V_{k})
=\frac{\prod_{(i,j) \in E}b(x_i,x_j)}{\prod_{i \in [N]}b(x_i)^{q_i-1}}

where the last line follows inductively, since each T_k only sees q_{k}-1 edges of x_k.

 

If we make the assumption that our model is a tree, then we can write the Bethe approximation entropy as

S_{Bethe}=-\sum_{i,j}\sum_{x_i,x_j} b(x_i,x_j)\ln(b(x_i,x_j))+\sum_{i}(q_i-1)\sum_{x_i}b(x_i)\ln b(x_i)

Where the \{q_i\}_{i}‘s are the degrees of the variables \{x_i\}_{i} in the graphical model defined by P(\vec{x}). We then define the Bethe free energy as U+S_{Bethe}. The Bethe free energy is in general not an upper bound on the true free energy. Note that if we make the assignments E_{i}(x_i)=h_{i}(x_i), E_{i,j}(x_i,x_j)=-J_{i,j}(x_i,x_j)-h_{i}(x_i)-h_{j}(x_j), then we can rewrite U as
U=\sum_{i,j}\sum_{x_i,x_j} b(x_i,x_j)E_{i,j}(x_i,x_j)+\sum_{i}(q_i-1)\sum_{x_i}b(x_i)E_{i}(x_i)

 

which is similar in form to the Bethe approximation entropy. In general, we have

G_{Bethe}(b(x_i),b(x_i,x_j))=

\sum_{i,j}\sum_{x_i,x_j} b(x_i,x_j)(E_{i,j}(x_i,x_j)+\ln(b(x_i,x_j)))

-\sum_{i}(q_i-1)\sum_{x_i}b(x_i)( E_{i}(x_i)+\ln b(x_i))

which is exactly the Gibbs free energy for a probabilistic model whose associated graph is a tree. Since BP gives the correct marginals on trees, we can say that the BP beliefs are the global minima of the Bethe free energy. However, the following is also true:

Proposition: A set of beliefs gives a BP fixed point in any graph (not necessarily a tree) iff they correspond to local stationary points of the Bethe free energy.

(For a proof, see e.g. page 20 of [4])

So trying to minimize the Bethe free energy is in some sense the same thing as doing belief propagation. Apparently, one typically finds that when belief propagation fails to converge on a graph, the optimization program which is trying to minimize G_{Bethe} also runs into problems in similar parameter regions, and vice versa.

 

(3) The Block Model

(3.0) Definitions

Now that we’ve seen Belief Propagation from two different perspectives, let’s try to apply this technique of computing marginals to analyzing the behavior of the stochastic block model. This section will heavily follow the paper [2].

The stochastic block model is designed to capture a variety of interesting problems, depending on its settings of parameters. The question we’ll be looking at is the following: suppose we generate a random graph, where each vertex of the graph comes from one of q groups each with probability n_{1},...,n_{q}. We add an edge between vertices (i,j) in groups a,b resp. with probability p_{a,b}. For sparse graphs, we define c_{i,j}:=Np_{i,j}, where we think of p_{i,j} as \mathcal{O}(\frac{1}{N}). The problem is the following: given such a random graph, can you label the vertices so that, up to permutation, the labels you choose have high correlation to the true hidden labels which were used to generate the graph? Here are some typical settings of parameters which represent different problems:

  1.  Community detection, where we have q groups. We set n_{i}=\frac{1}{q}, and c_{i,j}=c_{in} if i=j and c_{out} otherwise, with c_{in}>c_{out} (assortative structure).
  2. Planted graph partitioning, where p_{i,j} may not necessarily be \mathcal{O}(\frac{1}{N}).
  3. Planted graph colouring, where p_{in}=0, p_{a,b}=p_{out} and n_{a}=\frac{1}{q} for all groups. We know how to efficiently find a planted coloring which is strongly correlated with the true coloring when the average degree c=(q-1)p_{out}N/q is greater than \mathcal{O}(q^2).

We’ll concern ourselves with the case where our graph is sparse, and we need to try and come up with an assignment for the vertices such that we have high correlation with the true labeling of vertices. How might we measure how well we solve this task? Ideally, a labeling which is identical to the true labeling (up to permutation) should get a score of 1. Conversely, a labeling which naively guesses that every vertex comes from the largest group \arg \max_{i \in [q]} n_{i} should get a score of 0.  Here’s one metric which satisfies these properties: if we come up with a labeling \{t_{i}\}, and the true labeling is \{q_i\}, then we’ll measure our performance by

 

Q(\{t_{i}\},\{q_{i}\}):=\max_{\pi \in S_q} \frac{\frac{1}{N}\sum_{i \in [N]} \delta_{t_i,\pi(q_i)}-max_{a \in [q]} n_a}{1-\max_{a \in [q]} n_a}

 

where we maximize over all permutations \pi. When we choose a labeling which (up to permutation) agrees with the true labeling, then the numerator of Q will equal the denominator, and Q=1. Likewise, when we trivially guess that every vertex belongs to the largest group, then the numerator of Q is 0 and Q=0.

Given \{c_{a,b}\},\{n_a\} and a set of observed edges E, we can write down the probability of a labeling \{q_i\} as
P(\{q_i\}_{i})=\prod_{\substack{(i,j) \not \in E \\ i \not = j}}(1-p_{q_{i},q_{j}})\prod_{\substack{(i,j) \in E }}p_{q_{i},q_{j}} \prod_{i \in [q]}n_{q_i}

How might we try to infer \{q_i\} such that we have maximum correlation (up to permutation) with the true labeling? It turns out that the answer is to use the maximum likelihood estimator of the marginal distribution of each q_i, up to a caveat. In particular, we should label q_i with the r \in [q] such that P(q_i=r) is maximized. The caveat comes in when P(\{q_i\}_{i}) is invariant under permutations of the labelings \{q_i\}_{i}, so that each marginal P(q_i) is actually the uniform distribution. For example, this happens in community detection, when all the group sizes n_{1},...,n_{q} are equal. In this case, the correct thing to do is to still use the marginals, but only after we have “broken the symmetry” of the problem by randomly fixing certain values of the vertices to have particular labels. There’s actually a way belief propagation does this implicitly: recall that we start belief propagation by randomly initializing the messages. This random initialization can be interpreted as “symmetry breaking” of the problem, in a way that we’ll see shortly.

(3.1) Belief Propagation

We’ve just seen from the previous section that in order to maximize the correlation of the labeling we come up with, we should pick the labelings which maximize the marginals of P. So we have some marginals that we want to compute. Let’s proceed by applying BP to this problem in the “sparse” regime where c_{a,b}=Np_{a,b}=\mathcal{O}(1) (other algorithms, like approximate message passing, can be used for “dense” graph problems). Suppose we’re given a random graph with edge list E. What does does graph associated with our probabilistic model look like? Well, in this case, every variable is actually connected to every other variable because P(\{q_i\}_{i}) includes a factor f_{i,j}(x_i,x_j) for every (i,j) \in [n] \times [n], so we actually have a complete graph. However, some of the connections between variables are much weaker than others. In full, our BP update equations are

eqs_2

Likewise
\psi^{i}_{t_{i}}=\frac{1}{Z^{i}} n_{t_{i}} \prod_{\substack{(i,k) \not \in E}}\left[ 1-\frac{1}{N}\sum_{t_k \in [q]}c_{t_1,t_k}\psi^{k \rightarrow i}_{t_k} \right]\prod_{\substack{(i,k) \in E}}\left[\sum_{t_k \in [q]}c_{t_i,t_k} \psi^{k \rightarrow i}_{t_k} \right]

 

what we want to do is approximate these equations so that we only have to pass messages along the edges E, instead of the complete graph. This will make our analysis simpler, and also allow the belief propagation algorithm to run more efficiently. The first observation is the following: Suppose we have two nodes j,j' \in [N] such that (i,j),(i,j') \not \in E. Then we see that \psi^{i \rightarrow j}_{t_{i}}=\psi^{i \rightarrow j'}_{t_{i}}+\mathcal{O}\left(\frac{1}{N}\right), since the only difference between these two variables are two factors of order (1-\mathcal{O}\left(\frac{1}{N}\right)) which appear in the first product of the BP equations. Thus, we send essentially the same messages to non-neighbours j of i in our random graph. In general though, we have:

eqs_3.png

The first approximation comes from dropping non-edge constraints on the first product, and is reasonable because we expect the number of neighbours of i to be constant. We’ve also defined a variable

h_{t_i}:=\frac{1}{N}\sum_{k \in [N]}\sum_{t_k \in [q]}c_{t_i,t_k} \psi^{k \rightarrow i}_{t_k}

and we’ve used the approximation e^{-x} \approx 1-x for small x. We think of the term h_{t_i} as defining an “auxiliary external field”. We’ll use this approximate BP equation to find solutions for our problem. This has the advantage that the computation time is \mathcal{O}(|E|) instead of \mathcal{O}(N^2), so we can deal with large sparse graphs computationally. It also allows us to see how a large dense graphical model with only sparse strong connections still behaves like a sparse tree-like graphical model from the perspective of Belief Propagation. In particular, we might have reason to hope that the BP equations will actually converge and give us good approximations to the marginals.

From now on, we’ll only consider factored block models, which in some sense represent a “hard” setting of parameters. These are models which satisfy the condition that each group has the same average degree c. In particular, we require

\sum_{d=1}^{q} c_{a,d} n_d = \sum_{d=1}^{q} c_{b,d}n_d =c

An important observation for this setting of parameters is that

\psi^{i \rightarrow j}_{t_i}=n_{t_i}

is always a fixed point of our BP equations, which is known as a factored fixed point (this can be seen by inspection by plugging the fixed point conditions into the belief propagation equations we derived). When BP ever reaches such a fixed point, we get that Q=0 and the algorithm fails. However, we might hope that if we randomly initialize \{\psi^{i \rightarrow j}\}_{(i,j) \in E}, then BP might converge to some non-trivial fixed point which gives us some information about the original labeling of the vertices.

(3.2) Numerical Results

Now that we have our BP equations, we can run numerical simulations to try and get a feel of when BP works. Let’s consider the problem of community detection. In particular, we’ll set our parameters with all group sizes n_{a} being equal, and with c_{a,a}=c_{in}>c_{out}=c_{a,b} for a \not = b and vary the ratio \epsilon:=c_{out}/c_{in}, and see when BP finds solutions which are correlated “better than guessing” to the original labeling used to generate the graph. When we do this, we get images which look like this:

convergence_and_overlap.png

It should be mentioned that the point at which the dashed red line occurs depends on the parameters of the stochastic block model. We get a few interesting observations from numerical experiments:

  1. The point at which BP fails (Q=0) happens at a certain value of \epsilon independent of N. In particular, when \epsilon gets too large, BP converges to the trivial factored fixed point. We’ll see in the next section that this is a “stable” fixed point for certain settings of parameters.
  2. Using the approximate BP equations gives the same performance as using the exact BP equations. This numerically justifies the approximations we made.
  3. If we try to estimate the marginals with other tools from our statistical inference toolbox, we find (perhaps surprisingly) that Markov Chain Monte Carlo methods give the same performance as Belief Propagation, even though they take significantly longer to run and have strong asymptotic guarantees on their correctness (if you run MCMC long enough, you will eventually get the correct marginals, but this may take exponential time). This naturally suggests the conjecture that belief propagation is optimal for this task.
  4. As we get closer to the region where BP fails and converges to the trivial fixed point, the number of iterations required for BP to converge increases significantly, and diverges at the critical point \epsilon_{c} where BP fails and converges to the factored fixed point. The same kind of behavior is seen when using Gibbs sampling.

(3.3) Heuristic Stability Calculations

How might we analytically try to determine when BP fails for certain settings of q and \{c_{a,b}\}? One way we might heuristically try to do this, is to calculate the stability of the factored fixed point. If the fixed point is stable, this suggests that BP will converge to a factored point. If however it is unstable, then we might hope that BP converges to something informative. In particular, suppose we run BP, and we converge to a factored fixed point, so we have that for all our messages \psi_{t_i}^{i \rightarrow j}=n_{t_i}. Suppose we now add a small amount of noise to some of the \psi_{t_i}^{i \rightarrow j}‘s (maybe think of this as injecting a small amount of additional information about the true marginals). We (heuristically) claim that if we now continue to run more steps of BP, either the messages will converge back to the fixed point \psi_{t_i}^{i \rightarrow j}=n_{t_i}, or they will diverge to something else, and whether or not this happens depends on the eigenvalue of some matrix of partial derivatives.

Following this idea, here’s a heuristic way of calculating the stability of the factored fixed point. Let’s pretend that our BP equations occur on a tree, which is a reasonable approximation in the sparse graph case. Let our tree be rooted at node k_0 and have depth d. Let’s try to approximately calculate the influence on k_0 of perturbing a leaf k_d from its factored fixed point. In particular, let the path from the leaf to the root be k_d,...,k_0. We’re going to apply a perturbation \psi^{k_d}_{t} = n_t+\epsilon_{t}^{k} for each t \in [q]. In vector notation, this looks like \psi^{k_d} = \mathbf{1}n_t+\epsilon^{k}, where \psi^{k_d} \in \mathbb{R}^{q \times 1} is a column vector. The next thing we’ll do is define the matrix of partial derivatives

eqs_4.png

Up to first order (and ignoring normalizing constants), the perturbation effect on \epsilon^{k_0} is then (by chain rule) \epsilon^{k_0} = \left[\prod_{i=0}^{d-1} T_{i} \right]\epsilon^{k_d}. Since T_{i} does not depend on i, we can write this as \epsilon^{k_0}=T^{d} \epsilon^{k_d} \approx \lambda^{d} \epsilon^{k_d}, where \lambda is the largest eigenvalue of T. Now, on a random tree, we have approximately c^{d} leaves. If we assume that the perturbation effect from each leaf is independent, and that \epsilon^{k_d} has 0 mean, then the net mean perturbation from all the leaves will be 0. The variance will be

eqs_5.png

if we assume that the cross terms vanish in expectation.

(Aside: You might want to ask: why are we assuming that \epsilon^{k_{d}} has mean zero, and that (say) the noise at each of the leaves are independent, so that the cross terms vanish? If we want to maximize the variance, then maybe choosing the \epsilon^{k_{d}}‘s to be correlated or have non-zero mean would give us a better bound. The problem is that we’re neglecting the effects of normalizing constants in this analysis: if we perturbed all the \psi^{k_d}_{t} in the same direction (e.g. non-zero mean), our normalization conditions would cancel out our perturbations.)

We Therefore end up with the stability condition c \lambda^{2} =1. When c \lambda^{2}>1, a small perturbation will be magnified as we move up the tree, leading to the messages moving away from the factored fixed point after successive iterations of BP (the fixed point is unstable). If c \lambda^{2}<1, the effect of a small perturbation will vanish as we move up the tree, we expect the factored fixed point to be stable. If we restrict our attention to graphs of the form c_{a,a}=c_{in}>c_{a,b}=c_{out} for a \not = b, and have all our groups with size n_{a}=\frac{1}{q}, then T is known to have eigenvalues \lambda_{1}=0 with eigenvector (1,...,1), and \lambda_{2}=(c_{in}-c_{out})/qc. The stability threshold then becomes |c_{in}-c_{out}|=q\sqrt{c}. This condition is known as the Almeida-Thouless local stability condition for spin glasses, and the Kesten-Stigum bound on reconstruction on trees. It is also observed empirically that BP and MCMC succeed above this threshold, and converge to factored fixed points below this threshold. The eigenvalues of T are related to the belief propagation equations and the backtracking matrix. For more details, see [3]

(3.4) Information Theoretic Results, and what we know algorithmically

We’ve just seen a threshold for when BP is able to solve the community detection problem. Specifically, when c<\frac{1}{\lambda^{2}}, BP doesn’t do better than chance. It’s natural to ask whether this is because BP is not powerful enough, or whether there really isn’t enough information in the random graph to recover the true labeling. For example, if c_{out} is very close to c_{in}, it might be impossible to distinguish between group boundaries up to random fluctuations in the edges.

It turns out that for q=2, there is not enough information below the threshold c<\frac{1}{\lambda^{2}} to find a labeling which is correlated with the true labeling [3]. However, it can be shown information-theoretically [1] that the threshold at which one can find a correlated labeling is c_{c} = \Theta \left(\frac{\log(q)}{q \lambda^2} \right). In particular, when q>11, there exists exponential time algorithms which recover a correlated labeling below the Kesten-Stigum threshold. This is interesting, because it suggests an information-computational gap: we observe empirically that heuristic belief propagation seems to perform as well as any other inference algorithm at finding a correlated labeling for the stochastic block model. However, belief propagation fails at a “computational” threshold below the information theoretic threshold for this problem. We’ll talk more about these kinds of information-computation gaps in the coming weeks.

 

 

References

[1] Jess Banks, Cristopher Moore, Joe Neeman, Praneeth Netrapalli,
Information-theoretic thresholds for community detection in sparse
networks. AJMLR: Workshop and Conference Proceedings vol 49:1–34, 2016.
Link

[2] Aurelien Decelle, Florent Krzakala, Cristopher Moore, Lenka Zdeborová,
Asymptotic analysis of the stochastic block model for modular networks and its
algorithmic applications. 2013.
Link

[3] Cristopher Moore, The Computer Science and Physics
of Community Detection: Landscapes, Phase Transitions, and Hardness. 2017.
Link

[4] Jonathan Yedidia, William Freeman, Yair Weiss, Understanding Belief Propogation and its Generalizations
Link

[5] Afonso Banderia, Amelia Perry, Alexander Wein, Notes on computational-to-statistical gaps: predictions using statistical physics. 2018.
Link

[6] Stephan Mertens, Marc Mézard, Riccardo Zecchina, Threshold Values of Random K-SAT from the Cavity Method. 2005.
Link

[7] Andrea Montanari, Federico Ricci-Tersenghi, Guilhem Semerjian, Clusters of solutions and replica symmetry breaking in random k-satisfiability. 2008.
Link

A big thanks to Tselil for all the proof reading and recommendations, and to both Boaz and Tselil for their really detailed post-presentation feedback.