Skip to content

Theory Blog Aggregator Up!

December 20, 2018

The Theory of Computing Blog Aggregator is now back online at a new website: http://cstheory-feed.org/ . There is also a twitter feed at https://twitter.com/cstheory .

See this blog post by Suresh Venkatasubramanian (who, together with Arnab Bhattacharyya, is responsible for the aggregator’s revival – thank you!!) for more details. This is a good opportunity to thank Arvind Narayanan who created the software to run it and maintained it all these years.

If you don’t want to rely on the aggregator to follow windows on theory, you can use the “Follow Blog by email” button on our side bar, and join the 590 other happy customers who don’t need to wait to the feed to get the latest lecture notes from our physics and computation seminar.

What is Quantum Hamiltonian Complexity?

December 20, 2018

by Ben Edelman

This is the first installment of a three-part series of posts on quantum Hamiltonian complexity based on lectures given the authors in Boaz and Tselil’s seminar. The second installment is here, and the third installment is here.

Quantum Hamiltonian complexity is a growing area of study that has important ramifications for both physics and computation. Our hope is that these three posts will provide an accessible (and incomplete) preview of the subject for readers who know the basics of theoretical computer science and quantum information. Much of the material is adapted from an excellent survey by Gharibian et al..

In a nutshell, quantum Hamiltonian complexity is the study of the local Hamiltonian problem. Why is this problem important enough to justify the existence of an entire subfield? To illustrate why, here are two informal characterizations of it:

  1. To a physicist, the local Hamiltonian problem is a formalization of the difficulty of simulating and understanding many-particle quantum systems. There are deep connections between the complexity of this problem and the amount of quantum entanglement in a system. In practical terms, physicists would love to be able to solve this problem on a regular basis, and they’ve developed a rich theory of heuristics to that end.
  2. To a computer scientist, local Hamiltonian problem is the quantum version of constraint satisfaction problems. Any CSP can be written as a local Hamiltonian problem; and just as constraint satisfaction is the prototypical NP-complete problem by the Cook-Levin theorem, the local Hamiltonian problem plays the equivalent role for QMA (a quantum analogue of NP) by the “quantum Cook-Levin theorem.” The connections to classical complexity go on… there is even a quantum PCP conjecture!

But let’s take a step back and start at the beginning. To make sure we understand what a quantum Hamiltonian is and why it is important, it will be instructive to briefly rehash some of the fundamentals of classical statistical mechanics.

Classical energy and ground states

In the classical world, a physical system can be in any one of various states x \in \mathcal{X}, each of which is a vector, with different coordinates representing different particles. Every state of the system has an energy, given by an energy function E: \mathcal{X} \to \mathbb{R}. For example, in the classic Ising model of ferromagnetism, \mathcal{X} = \{\pm 1\}^n. Each coordinate x_i represents the spin of atom i, and atoms i and j interact with each other whenever (i,j) is an edge in a graph G, which is usually a low-dimensional lattice. Energy for this system is defined as E(x) = \sum_{(i,j) \in G}-x_i x_j.

Suppose we ignore our system for a long time, letting it interact with its external environment until, in the limit, it reaches thermal equilibrium at temperature T. Then the probability the system is in state x is given by Boltzmann’s distribution: \Pr[x] = \frac{e^{-\beta E(x)}}{Z}, where \beta \propto 1/T and Z is the partition function required to normalize the probabilities. As the temperature tends to infinity, this distribution will approach the uniform distribution over \mathcal{X}, and as the temperature tends to absolute zero, the distribution will approach the uniform distribution over the states with minimum energy. We call these minimum energy states ground states, and we call their energy the ground state energy. If we want to calculate something about a system, then it is often crucial to know the ground states and ground state energy of the system. Going back to our example, the Ising model has two ground states whenever G is connected. These are the states (+1,+1,\ldots,+1) and (-1,-1,\ldots,-1) in which all atoms have the same spin. The ground state energy is -|\{i,j:(i,j) \in G\}|.

Quantum Hamiltonians

A quantum Hamiltonian is essentially the quantum analogue of the classical energy function. Unlike with classical systems, when a quantum system is in a given n-qubit state \left|\psi\right\rangle, it doesn’t have a determinate energy. Instead, when we measure the energy, the value we obtain may be probabilistic and will correspond to one of the eigenvalues of the observable matrix for energy. This Hermitian matrix, denoted H \in (\mathbb{C}^2)^{\otimes n} \times (\mathbb{C}^2)^{\otimes n}, is the quantum Hamiltonian, and just as the energy function characterizes a classical system, the Hamiltonian characterizes a quantum system. For a given eigenvector |\lambda_i\rangle of H with eigenvalue \lambda_i, when we measure the energy of |\psi\rangle we obtain the result \lambda_i with probability \langle\psi|\lambda_i\rangle, and the system collapses to the state |\lambda_i\rangle (assuming the eigenvalue \lambda_i has multiplicity 1). Thus, the ground state and ground state energy of a quantum system with eigenvalue H are the minimum eigenvalue \lambda_0 of H and the corresponding eigenvector |\lambda_0\rangle.

The Boltzmann distribution also has a quantum analogue. A quantum system at thermal equilibrium will be in the following mixed state: \rho_{\text{eq}} = \frac{e^{-\beta H}}{Z}. As the temperature approaches absolute zero, \rho_{\text{eq}} will approach a superposition over the ground states.

Not only does the Hamiltonian tell us the energy of a system, it also describes the time evolution of the system (as long as it is closed). Schrödinger’s equation states that -i \hbar \frac{d|\psi\rangle}{dt} = H|\psi\rangle, where \hbar is Planck’s constant and t is time. Thus, if a closed system is in the state |\psi\rangle_0 at time 0, its state at time t will be |\psi\rangle_t = e^{-itH/\hbar}|\psi\rangle_0. Since H is Hermitian, e^{-itH/\hbar} is unitary, which is another way of saying that quantum mechanical states are subject to unitary evolution.

The Local Hamiltonian problem

As we have seen, understanding the Hamiltonian of a quantum system is crucial for understanding both the system’s equilibrium behavior and its time evolution. There are a huge variety of questions physicists are interested in asking about systems, all of which boil down to questions about equilibrium behavior, time evolution, or both. There is a single problem that captures the complexity of many of these questions, in the sense that most of the questions can’t be answered without solving it. This is the problem of estimating the ground state energy of the Hamiltonian. Especially in condensed matter physics, this problem is ubiquitous.

Formally, we will study the following promise problem: (note: this will not be our final formulation)


The “Hamiltonian Problem”

Given a Hermitian matrix H \in (\mathbb{C}^2)^{\otimes n} \times (\mathbb{C}^2)^{\otimes n} and non-negative reals a, b with b \geq a+1,

  • If \lambda_0(H) \leq a, output YES
  • If \lambda_0(H) \geq b, output NO


One issue with this definition is that the input includes an enormous 2^n \times 2^n matrix. For a reasonable-sized system, there’d be no use in even trying to solve this problem through classical computation, and how to deal with it in the quantum computing setting is far from obvious. Luckily, physicists have found that in real-life systems, interactions tend to be local, and if we consider the special case of local Hamiltonians, the input for the problem is of reasonable size.

circle_diagram0

Hamiltonians are too big to work with. What if we restrict our focus to local Hamiltonians?

A k-local Hamiltonian is a Hamiltonian that is decomposed into a sum of terms, each of which represents a Hamiltonian acting on a k-unit subset of the n qubits in the system. In other words, H = \sum_i (H_i)_{S_i} \otimes I_{[n]\backslash S_i}, where each S_i is a subset of [n] of size k. For brevity’s sake, we abuse notation and write H as \sum_i H_i. We can think of the H_i’s as local constraints, and the ground state as the state that simultaneously satisfies the constraints to the maximal possible extent. Here, then, is the new-and-improved problem definition:


k-Local Hamiltonian Problem

Given a Hamiltonian H \in (\mathbb{C}^2)^{\otimes n} \times (\mathbb{C}^2)^{\otimes n} specified as a collection of r local interactions \{H_i\}, and non-negative reals a, b with b \geq a+1,

  • If \lambda_0(H) \leq a, output YES
  • If \lambda_0(H) \geq b, output NO

Presuming the matrices \{H_i\} and the reals a, b are specified to polynomial precision, then the input size is polynomial in n, since k is a constant and each of the matrices H_i has 2^k \cdot 2^k entries. Thus, not only is our new problem physically realistic, it is also a problem we might hope to attack with classical computation. However, we will later see that in fact this problem is likely hard even for quantum computers. The remaining installments in this series of notes will deal with further restrictions of the class of Hamiltonians for which the local Hamiltonian problem may be tractable.

Computer science motivation

As we mentioned in the intro, the k-local Hamiltonian problem (henceforth denoted k-LH) doesn’t just have myriad applications in physics—it is also important from a computer science perspective because it is a quantum generalization of constraint satisfiability (you may have noticed that quantum analogues of classical concepts are a running theme). Specifically, k-CSP is a special case of k-LH.

Suppose we have a k-CSP instance \varphi, and we want to turn it into a k-LH instance. A clause C with constituent variables x_1, \ldots, x_k becomes a 2^k \times 2^k diagonal \{0,1\} matrix H_C acting on the qubits |x_1\rangle,\ldots,|x_k\rangle. Note that the rows and columns of this matrix are indexed by the assignment vectors x \in \{0,1\}^k. Formally, H_C encodes the truth table of C in the following manner: (H_C)_{x,x} = 1 - C(x). Another way of stating this is H_C = \sum_{x \in \{0,1\}^k\text{ s.t. }C(x)=0}|x\rangle\langle{x}|.

Informally, H_C takes the clauses of \varphi and turns them into local quantum interactions. We’ve constructed H_C so that it has two eigenvalues: 0 and 1. The eigenspace corresponding to 0 is spanned by the set of computational basis vectors |x\rangle that satisfy C, and the eigenspace corresponding to 1 is spanned by the computational basis vectors that don’t satisfy C. In effect, when we consider H_C as a term of H, we are giving an energy penalty to any variable assignment that doesn’t satisfy C. H = \sum_{C}H_C will have the eigenvalue 0 (in other words, a ground state energy of 0) if and only if there is some assignment of the variables x_1,\ldots,x_n that satisfies all of the clauses (in other words, iff \varphi is satisfiable). Otherwise, the ground state energy of H will be at least 1, so determining whether \varphi is satisfiable is equivalent to solving k-LH with inputs a = 0, and b = 1. (In fact, k-LH generalizes MAX-k-CSP, since the ground state energy of H is exactly the number of clauses minus the maximum number of satisfiable clauses.)

Read more…

A 1D Area Law for Gapped Local Hamiltonians

December 18, 2018

(This post is based on part of a lecture delivered by Boriana Gjura and Prayaag Venkat. See also posts by Ben Edelman and Fred Zhang for more context on Quantum Hamiltonian Complexity.)

Introduction

In this post we present the Area Law conjecture and prove it rigorously,
emphasizing the emergence of approximate ground state projectors as a
useful and promising tool.

The difficulty of understanding many-body physics lies in this dichotomy
between the power of quantum on the one hand, and the fact that even
describing a quantum state requires exponential resources on the other.
A natural way to proceed is to ask if the special class of physically
relevant states admits succinct descriptions. As seen in a previous
post, the QMA-completenes of the k-local Hamiltonian problem suggests
that even ground states of local Hamiltonians do not admit such
descriptions. Looking at these ground states, the following questions
arise naturally.

When do the ground states of local Hamiltonians have a special
structure?
When does that structure allow for a meaningful short
description?
When does that structure allow us to compute properties
of them?

As stated informally, the answer to these questions is yes for (gapped)
1D systems, and it is an open problem to answer these in higher
dimensions.

The Area Law

The Area Law is inspired by the Holographic principle in Cosmology,
which informally states that the total information in a black hole
resides on the boundary. That is, the complexity of the system depends
on the size of its boundary, not its volume.

Preliminaries

For the sake of clarity, we assume that H = \sum_{i=1}^m H_i is a
2-local Hamiltonian on qubits, each 0 \leq \| H_i \| \leq 1 is a
projection (H_i^2 = H_i), H is frustration free, (meaning that
\lambda_0(H), the lowest eigenvalue, is zero) and there is a unique
ground state | \psi_0 \rangle.

These conditions simplify our proof. It will be easy to generalize to
0 \leq \| H_i \| \leq 1 and relaxing to q-local on qudits (dimension
d particles). The most significant assumption is that H is
frustration-free: more work is necessary to make the proof work
otherwise.

To formalize our conjecture, we define a notion of von Neumman entropy
below.

Definition: Consider any state | \psi \rangle lying in a bipartite Hilbert space
\mathcal{H}_A \otimes \mathcal{H}_B, and perform a Schmidt
decomposition to obtain
| \psi \rangle = \sum_{i =1}^{d} \lambda_i | u_i \rangle_A| v_i \rangle_B.
The Von Neumann entanglement entropy of | \psi \rangle across region A
is defined as
\displaystyle S(A)_{| \psi \rangle} = \sum_i \lambda_i^2 \ln \frac{1}{\lambda_i^2}.

Why is this a good notion of entropy? If
| \psi \rangle = | L \rangle \otimes | R \rangle, then
S_{| \psi \rangle} = 0, i.e. the product state has no entanglement. On
the other hand, the maximally entangled state
| \psi \rangle = \sum_{i = 1}^D \frac{1}{\sqrt{D}} | i \rangle_A \otimes | i \rangle_B
has entropy S_{| \psi \rangle} = \ln(D). In general, we have that

\displaystyle S_{| \psi \rangle} \leq \ln(\dim \text{Hilbert space}).
This is a
direct quantum generalization of the fact that a classical probability
distribution on D elements has entropy at most \ln D, and this is
achieved by the uniform distribution.\
We can now state the Area Law.

Conjecture: Let | \psi_0 \rangle be the ground state of H, as defined above.
Then for any region A (i.e. a subset of qubits),

\displaystyle S_{| \psi_0 \rangle} \leq \ln (\dim \partial A),
where \partial A
denotes the boundary of the region A, the set of qubits that interact
through H with at least a qubit outside of region A.

This conjecture is illustrated by the following figure.

area2
We prove the 1-D version of this conjecture. Note that for a 1D system
(i.e. qubits on a line with nearest neighbor interactions) it suffices
to consider bipartitions of the Hilbert space defined by a “cut” (see
the figure below). By a cut, we mean a partition of the qubits into two
sets consisting of all the qubits to the left and right, respectively,
of a fixed qubit i^*.

area1

Theorem: Let | \psi_0 \rangle be as above. Then for any cut C = (i^*, i^*+1),
\displaystyle S(C)_{| \psi_0 \rangle} \leq O(1).

Approximate Ground State Projectors

In the case where all H_i commute, consider the operator
A = \prod_i (I - H_i). This operator will project any product state
onto the ground state. Therefore, we can get a representation of the
ground state as a matrix product state (as we saw in a previous post on
tensor networks). Can we generalize this idea when the H_i do not
necessarily commute?

A natural idea is to define an operator that approximately projects a
state onto the vector we want (the ground state), while shrinking it on
the other directions. We also want this operator to not increase
entanglement too much. The proof of the area law will involve
constructing a low-rank tensor approximate factorization of the ground
state by starting with a factorized state, which has a good overlap with
the ground state, and then repeatedly applying such an operator.

agsp

We first give a formal definition of an approximate ground state
projector. In the following, under the assumption of the existence of
such an object, we will prove the area law. At the end, we will show how
to construct it.

Definition: K is a (D,\Delta)approximate ground state projection (AGSP) if:
1. K | \psi_0 \rangle = | \psi_0 \rangle
2. For every | \Gamma \rangle such that
\langle \Gamma | \psi_0 \rangle = 0, then
\| K | \Gamma \rangle \|^2 \leq \Delta \| | \Gamma \rangle \|^2.
3. K has entanglement rank at most D. That is, K admits a Schmidt
decomposition of the form K = \sum_{i=1}^{D} L_i \otimes R_i,
where L_i and R_i act only on the particles to the left and
right of the cut, respectively.

The first step of the proof of the Area Law is to find a product state
that has a good overlap with the ground state. That is, it should have a
constant overlap that is not dependent on n, the number of particles
in the system.

Lemma 1: Suppose there exists a (D, \Delta)-AGSP such that
D \Delta \leq \frac{1}{2}. Fix a partition (A, \bar{A}) of the space
on which the Hamiltonian acts. Then there exists a product state
| \phi \rangle = | L \rangle_A \otimes | R \rangle_{\bar{A}} such that
\displaystyle \langle \phi | \psi_0 \rangle = \mu \geq \frac{1}{\sqrt{2D}}

Proof:
Let | \phi \rangle be a product state with the largest overlap in
| \psi_0 \rangle, meaning that it maximizes.
\langle \phi | \psi_0 \rangle = \mu and can be written as
\displaystyle | \phi \rangle = \mu | \psi_0 \rangle + \sqrt{1 - \mu^2} | \psi^{\perp} \rangle,
where the latter is some state orthogonal to the ground state. We apply
K to get
\displaystyle K | \phi \rangle = \mu | \psi_0 \rangle + \delta | \psi' \rangle
where |\delta|^2 \leq \Delta and | \psi' \rangle is normalized. By
the properties of the operator K, the decomposition of
K | \psi_0 \rangle has at most D terms. Using the Cauchy-Schwarz
inequality:
\displaystyle \begin{aligned} \mu = \big| \sum_i \lambda_i \langle \psi_0 | | L \rangle_i | R \rangle_i\big| &\leq \sqrt{\sum_i \lambda_i ^2} \sqrt{\langle \psi_0 | | L \rangle_i | R \rangle_i^2} \\ &\leq \sqrt{\mu^2 + \Delta} \sqrt{D} \sqrt{\max_i \langle \psi_0 | | L \rangle_i | R \rangle_i}\end{aligned}

Therefore there exists a product state such that
\mu \geq | \langle \psi_0 | | L \rangle_i | R \rangle_i| \geq \frac{\mu}{\sqrt{D(\mu^2 +D)}},
which implies that \mu^2 \geq \frac{1}{2D}, as desired. \Box

Starting with a product state with a big enough overlap with the ground
state, the proof of the next lemma shows how we can repeatedly apply
AGSPs to bound the entropy across the cut.

Lemma 2: Suppose there exists a (D, \Delta)-AGSP such that
D \Delta \leq \frac{1}{2}, and a product state
| \phi \rangle = | L \rangle_A \otimes | R \rangle_{\bar{A}} such that
\langle \phi | \psi_o \rangle = \mu. Then
\displaystyle S_{| \psi_0 \rangle} \leq O(1) \frac{\log \mu}{\log \Delta} \log{D}

But before we prove this, let’s state an intuitive result of Eckart and
Young, which provides an upper bound for the overlap of two states given
the Schmidt rank of one.

Lemma (Eckart-Young): Let | \psi \rangle \in L \otimes R be a normalized
vector with Schmidt decomposition
| \psi \rangle = \sum_i \lambda_i | u_i \rangle| v_i \rangle, where
\lambda_1 \geq \lambda_2 \geq \dots. Then for any normalized
| \phi \rangle \in L \otimes R it holds that
\displaystyle | \langle \psi | \phi \rangle| \leq \sqrt{\sum_{i} \lambda_i^2}

Using this result, we continue the proof.

Proof of Lemma 2:
Write the (D, \Delta)-AGSP as K. Given a state | \phi \rangle, we
denote by | \phi_l \rangle the normalized stated after applying l
projections on the state, namely
\displaystyle | \phi_l \rangle = \frac{K^l | \phi \rangle}{\|K^l | \phi \rangle\|}.
By the way we’ve defined K, the Schmidt rank of | \phi_l \rangle
increases by a power of l, and this state has an improved overlap with
the ground state, that is:
\displaystyle SR(| \phi_l \rangle) \leq D^l, \quad \langle \psi_0 | \phi_l \rangle \geq \frac{\mu}{\sqrt{\mu^2 + \Delta^l(1 - \mu^2)}}.

Let | \psi_0 \rangle = \sum_i \lambda_i | L_i \rangle| R_i \rangle be
the Schmidt decomposition of the ground state relative to the cut
(i^*, i^*+1), where \lambda_1 \geq \lambda_2 \geq \dots. The
Eckart-Young theorem gives:
\displaystyle \sum_{i=1}^{D^l} \lambda_i^2 \geq | \langle \psi_0 | \phi_l \rangle|^2 \geq \frac{\mu^2}{\mu^2 + \Delta^l (1 - \mu^2)}
from which
\displaystyle \sum_{i> D^l} \lambda_i^2 \leq 1 - \frac{\mu^2}{\mu^2 + \Delta^l (1 - \mu^2)} \leq \frac{\Delta^l}{\mu^2}

We choose
l_0 = 2 \frac{\log \mu}{\log \Delta} - \frac{\log 2}{\log \Delta} such
that \frac{\Delta^{l_0}}{\mu^2} \leq \frac{1}{2}. Let’s now bound the
worst case entropy accross the AGSP cut. The first D^{l_0} Schmidt
coefficients account for an entropy of at most l_0 \log D. For the
remaining coefficients, we group them in chunks of size D^{l_0} in
intervals [D^{kl_0} + 1, D^{(k+1)l_0}] indexed by k. For each of
these intervals, the corresponding entropy can be upper bounded by
\displaystyle \frac{\delta^{kl_0}}{\mu^2} \log D^{(k+1)l_0} = l_0 \frac{1}{2^k} (k+1) \log D
where here \frac{\Delta^{kl_0}}{\mu^2} is an upper bound on the total
probability mass in the interval, and D^{-(k+1)l_0} a lower bound on
the size of any Schmidt coefficient (squared) in the interval. This
follows from the fact that they are organized in descending order and
must sum to 1.

Therefore, the total entropy is
\displaystyle \begin{aligned} S((i^* ,i^* +1)) &\leq l_0 \log D + \sum_{k \geq 1} l_0\frac{1}{2^k}(k+1) \log D \\ &\leq l_0 \log D + l_0 \log D \sum_{k \geq 1} \frac{1}{2^k}(k+1) \\ &\leq O(1) l_0 \log D \\ &\leq O(1) \frac{\log \mu}{\log \Delta}\log D\end{aligned}

This bound depends on D and \Delta. This is sufficient to imply the
Area Law because our AGSP constructions will have constant D and
\Delta. \Box

AGSP Construction

In this section, our goal is to construct an AGSP with the correct
tradeoff in parameters. We briefly recall the intuition behind the
properties that an AGSP must satisfy. First, the AGSP should leave the
ground state untouched. Second, it shrinks states which are orthogonal
to the fround state. Third, it should not increase the entanglement
across the cut by too much (see the figure below).

A depiction of the third property in the AGSP definition for a 1D
system. The property that K is a sum of tensor products of operators
which only act on one “side” of the
system.

schmidt-rank
We are interested in understanding a construction of an AGSP for three
reasons:

  1. As we saw before, we can prove a 1D area law assuming the existence
    of an AGSP with good parameters.
  2. AGSPs have been used a building block in an provably polynomial-time
    algorithm for finding ground states of gapped 1D local Hamiltonians.
  3. They appear to be a general purpose tool with potential applications
    in other contexts.

Details of the construction

One way to think about the the first two properties in the AGSP
definition is in terms of the relationship of the eigenvalues of K and
the eigenvalues of H (see the figure below). The first property says
that the ground state of H should be an eigenvector of K with
eigenvalue 1. The second property says that all other eigenvectors of
H should be mapped to eigenvectors of K with eigenvalue at most
\sqrt{\Delta}.

In this plot, the horizontal axis represents the eigenvalues of H
and the vertical axis represents the eigenvalues of K. The ground
state of H, which corresponds to the eigenvalue 0, should remain fixed
upon applying K to
it.

spectrum-plot
In the following, we will aim to construct and AGSP whose tradeoff in
parameters allows us to achieve the following result.

Theorem: S_{| \psi_0 \rangle}(A) = O(\frac{\log^2 d}{\epsilon}).

Attempt 1

As a first attempt, we will evaluate the quality of the following
operator:
\displaystyle K = (1-\frac{H}{\| H \|})^l,
where l is some natural
number that we will decide upon later. We now check the three properties
for this particular choice of K.

First, it is clear that K does not change the ground state, since the
ground state is an eigenvector of H with eigenvalue 0
(frustration-free case). Second, because any other eigenvalue of H is
at least \epsilon (the energy gap), we have that
\Delta \leq (1-\frac{\epsilon}{\| H \|})^l \approx e^{-\frac{\epsilon}{\| H \|} l}.
Third, for the time being, let us think of l as roughly corresponding
to D (we will formalize this later).

Unfortunately, these values \Delta and D will not allow us to
achieve the desired tradeoff of D \Delta \leq \frac{1}{2}. To see why
this is, observe that because we have n-1 local Hamiltonians, the term
\| H \| could be on the order of n (recall that
0 \leq \| H_i \| \leq 1), so we would need to choose l proportional
to n to reduce \Delta to a constant (see figure below). Unfortunately, this will result in a
corresponding increase in D, violating D \Delta \leq \frac{1}{2}.

The first attempt an AGSP corresponds to the polynomial is depicted in
blue. This polynomial does not decay at a fast enough rate, so the
second attempt will replace it with another polynomials, depicted in
red, which decays much
faster.

polynomials

Attempt 2

In summary, the two weaknesses of the previous approach we need to
address are:

  1. The large \| H \| term in the exponent substantially reduces the
    shinking effect of applying the AGSP. We will address this using the
    technique of Hamiltonian truncation.
  2. For a fixed l (roughly corresponding to the entanglement rank of
    K), the function f(x) = (1 - \frac{x}{\| H \|})^l does not decay
    fast enough after the point x = 0 (which is the ground state
    energy). We will address this using Chebyshev polynomials.

As discussed before, the \| H \| term is large because it involves
contribution from all n-1 local Hamiltonians. Intuitively, we might
believe that only the local Hamiltonians near the site of the cut itself
should play a large role in the entanglement of the ground state across
the cut. Specifically, if we write
H = H_1 + \ldots H_{n-1} = H_L + H_{i_1} + \ldots H_{i_s} + H_R, where
H_{i_j} are the s local Hamiltonians neighboring the cut and H_L
and H_R are the “aggregrated” Hamiltonians of all the remaining local
Hamiltonians to the left and right, respectively, of H_{i_1} and
H_{i_s}, then we are asserting that H_L and H_R contribute
significantly to \| H \| but are not important when trying to
characterize the ground state (see figure below).

We keep track of the local Hamiltonians corresponding to the s
qudits directly surrounding the cut. The remaining local Hamiltonians
are aggregated into the Hamiltonians H_L and H_R, whose eigenvalues
will be truncated to define
H'.

truncation
Mathematically, if we let
H' = H^{\leq t}_L + H_{i_1} + \ldots H_{i_s} + H^{\leq t}_R, where
H^{\leq t}_i denotes operator obtained by truncating all eigenvalues
of H to be at most t, we claim the following:

Claim:
1. \| H' \| \leq s + 2t
2. | \psi_0 \rangle is a ground state of H'.
3. H' has gap \epsilon' = \Omega (\epsilon)

It is straightforward to verify the two first two claims, We omit the
proof of the third claim. Intuitively, we have chosen H' to be an
approximation of H which has much smaller norm.

Next, we come to the task of designing a degree-l polynomial p which
decays faster than our first attempt f(x) = (1 - \frac{x}{\| H \|})^l
after the point x=0. In other words, we would like p to satisfy
p(0) = 1 (fixing the ground energy) and map the interval
[\epsilon, \| H' \|] to the interval [0,\Delta] for as small a
\Delta as possible. It turns out that the degree-l Chebyshev
polynomial (of the first kind) is precisely the polynomial which has
this behavior (see the appendix for a review of Chebyshev polynomials).
Letting T_l denote the degree-l Chebyshev polynomial, we take our
polynomial to be a scaled and shifted version of it:
\displaystyle p(x) = T_l\left(\frac{\| H' \| + \epsilon' - 2x}{\| H' \| - \epsilon'}\right)/T_l\left(\frac{\| H' \| + \epsilon'}{\| H' \| - \epsilon'}\right)
Given this definition of p, we take our new candidate AGSP to be
K = p(H), which has the following guarantees:

Claim:
1. K | \psi_0 \rangle = | \psi_0 \rangle
2. K achieves
\Delta = O(\exp{-\sqrt{\frac{\epsilon'}{\| H' \|}} l})
3. K achieves D = O(d^{\frac{l}{s} + s})

The first claim is straightforward to verify. The proof of the second
claim is ommitted; it involves calculations making use of standard
properties of Chebyshev polynomials. We remark that the dependence of
the exponent \Delta on \| H' \| has been reduced by a quadratic
amount, compared with the first attempt at an AGSP. We now sketch the
main ideas behind the proof of the third claim. After that, we will
choose values of s and t so that the desired tradeoff D \Delta is
achieved.

We have defined K as a degree-l polynomial in H', so we may write
it as: \displaystyle K = \sum_{j=0}^{l} c_i (H')^j, where c_i are coefficients
whose particular values are not relevant to the analysis of the Schmidt
rank of K across the given cut. It is straightforward to check that
the Schmidt rank of K is bounded by the sum of the ranks of the
monomials (H')^j (subadditivity). So, it suffices to analyze the
Schmidt rank of the monomial of the largest degree, (H')^l.

From our expression for H', we can expand this power to get:
\displaystyle (H')^l = (H^{\leq t}_L + H_{j_1} + \ldots H_{j_s} + H^{\leq t}_R)^l = \sum_{j_1,\ldots,j_l} H_{j_1} \cdots H_{j_l},
where the summation is over all products of l Hamiltonians from the
set \{H^{\leq t}_L, H_{i_1}, \ldots H_{i_s}, H^{\leq t}_R\}. Now, one
natural approach to proceed is to analyze the Schmidt rank of each term
in the summation, and then sum over all the terms. Unfortunately, this
approach will not work because there are exponentially many terms in the
summation, so our estimate of D will be too large. To address this
issue, it is possible to collect terms cleverly so that the total number
of terms is small. However, we will not provide the details here; we
will just sketch how analyze the Schmidt rank of an individual term.

The term H_{j_1} \cdots H_{j_l} consists of a product of l
Hamiltonians. Since there are s+1 possible locations of a cut that is
crossed by one of the H_{j_k}, an average cut will have
\frac{l}{s+1} of the H_{j_k} crossing it. For each Hamiltonian which
crosses this average cut, the Schmidt rank will be multiplied by a
factor of d^2. Hence, the Schmidt rank of the term
H_{j_1} \cdots H_{j_l} is O(d^\frac{2l}{s}). This bound on the
Schmidt rank is for an average cut C', not necessarily the cut C we
fixed in the beginning. The two cuts are separated by at most O(s)
qudits because C' is one of the s+1 cuts surrouding C (see figure
below), so one can relate the Schmidt rank across these two cuts. For
each Hamiltonian between the two cuts, the high-level idea is that the
entanglement rank is multiplied by at most a factor of d. Because the
entanglement rank across C' is O(d^\frac{2l}{s}), the entanglement
rank across C is O(d^{\frac{2l}{s} + s}).

The entanglement across cut C is analyzed bounding the entanglement
across a nearby cut C' and then relating the
two.

cut-to-avg-cut-D
Putting all the pieces together, we have that:

  1. By Hamiltonian truncation and the Chebyshev polynomial construction,
    \Delta = O(e^{-l/ \sqrt{s}}).
  2. By the entanglement rank analysis, D = O(d^{\frac{l}{s} + s}).

One can now check that if we set l = O(s^2) and
s = O(\frac{\log^2 d}{\epsilon}) (where large enough constants are
hidden inside the O(\cdot)), then we achieve the desired tradeoff of
D \Delta \leq \frac{1}{2}. Note that l and s are constants because
d and \epsilon are constants.

Towards an area law in high dimensions

One of the major open problems in this line of work is to prove an area
law for 2D systems. We now discuss one potential line of attack on this
problem, based on a reduction to the 1D area law we have proven using
AGSPs.

Consider a 2D system, as depicted in the figure below. We can form a
sequence of concentric shells of qudits of increasing radii and
“contract” each shell into a single qudit of higher dimension. That is,
if each of the original qudits have dimension d and we consider the
shell at radius r, the dimension of the contracted qudit is roughly
d^r because there are roughly r qudits in the shell. The system of
contracted qudits is now a 1D system in the sense that the shell at
radius r only interacts with the shells at radii r-1 and r+1.

A 2D system can be reduced to a 1D system by decomposing it into a
sequence of concentric
“shells”.

two-dim-area-law
Suppose that we were able to show an area law as discussed previously,
but with the following dependence on d:

\displaystyle S_{| \psi_0 \rangle}(A) = O(\frac{\log d}{\epsilon}).
If we replace
d with d^r, the dimensionality in our new 1D system, we get that

\displaystyle S_{| \psi_0 \rangle}(A) = O(\frac{r}{\epsilon}),
which is exactly an
area law in 2D (recalling that surface area of a circle is proportional
to its radius). In fact, even the weaker result of proving that
S_{| \psi_0 \rangle}(A) = O(\frac{\log^{2-\alpha} d}{\epsilon}) for
some \alpha > 0 would imply a “sub-volume law”, already a major
breakthrough. This suggests that constructing AGSPs with better
parameters may be a plausible approach to proving an area law.

Future Directions

In addition to proving an area law in 2D, there are several interesting
questions raised by what we discussed here. See the survey of Gharibian
et al. for more details.

  1. Do area laws hold for gapless (i.e. non-constant gap, or “critical”)
    systems? No, there are known counterexamples.
  2. Do area laws hold for degenerate systems (where the ground state is
    not unique)? Yes, the AGSP approach still works for systems with
    groundspaces of constant dimension. It is unknown whether area laws
    hold when the groundspace is of dimension polynomial in n.
  3. Can one design practical algorithms for finding ground states of
    gapped 1D local Hamiltonians? Yes, there is one approach based on
    AGSPs.
  4. Do area laws hold in general for systems where the interactions are
    not necessarily described by a lattice? No, there are known
    counterexamples to such a “generalized area law”?

Acknowledgements

We would like to thank Boaz Barak and Tselil Schramm for their guidance
in preparing the talk on which this post is based.

Most of this post (including some figures) draws heavily upon the
following resources: lecture notes from courses taught by Thomas Vidick and Umesh Vazirani,
repectively, and a survey by Gharibian et al.
In these notes, we have opted to illustrate the main ideas, while
glossing over technical details. Our goal with this approach is to
prepare the reader to more easily understand the original sources and
highlight the some of the key concepts.

Appendix: Background on Chebyshev Polynomials

The Chebsyhev polynomials \{T_l\}_{l \in \mathbb{N}} are a family of
polynomials with many special properties. They are defined as
T_0(x) = 1, T_1(x) = x and \displaystyle T_l(x) = 2xT_{l-1}(x) - T_{l-2}(x).
The first few polynomials are plotted below (Source:
<https://en.wikipedia.org/wiki/Chebyshev_polynomials&gt;).

chebyshevs
Below, we state a property (proof ommitted) which provides some
intuition as to why Chebyshev polynomials are the “correct” choice for
our construction. Intuitively, it says that among all degree-l
polynomials which are trapped inside [-1,1] on the interval [-1,1],
T_l dominates p outside of [-1,1].

Claim: Let p: \mathbb{R} \rightarrow \mathbb{R} be a degree-l polynomial
such that |p(x)| \leq 1 for all x \in [-1,1]. Then for all
|x| \geq 1, |p(x)| \leq |T_l(x)|.

References

  1. Sevag Gharibian, Yichen Huang, Zeph Landau, Seung Woo Shin, et al. “Quantum Hamiltonian Complexity”. Foundations and Trends in Theoretical Computer Science, 10(3):159–282, 2015.
  2. Thomas Vidick. Lecture notes for CS286 seminar in computer science: “Around the quantum PCP conjecture”. http://users.cms.caltech.edu/~vidick/teaching/286_qPCP/index.html, Fall 2014.
  3. Umesh Vazirani. Lecture notes for CS294-4 “Quantum computation”. https://people.eecs.berkeley.edu/~vazirani/f16quantum.html, Fall 2016.

Algorithmic and Information Theoretic Decoding Thresholds for Low density Parity-Check Code

December 16, 2018

by Jeremy Dohmann, Vanessa Wong, Venkat Arun

Abstract

We will discuss error-correcting codes: specifically, low-density parity-check (LDPC) codes. We first describe their construction and information-theoretical decoding thresholds, p_{c}

Belief propagation (BP) (see Tom’s notes) can be used to decode these. We analyze BP to find the maximum error-rate upto which BP succeeds.

After this point, BP will reach a suboptimal fixed point with high probability. This is lower than the information-theoretic bound, illustrating a gap between algorithmic and information-theoretic thresholds for decoding.

Then we outline a proof of a theorem suggesting that any efficient algorithm, not just BP, will fail after the algorithmic threshold. This is because there is a phase transition at the algorithmic threshold, after which there exist an exponentially large number of suboptimal `metastable’ states near the optimal solution. Local search algorithms will tend to get stuck at these suboptimal points.

Introduction to Error Correcting Codes

Motivation

Alice wants to send a message to Bob, but their channel of communication is such that Bob receives a corrupted version of what Alice sent. Most practical communication devices are imperfect and introduce errors in the messages they are transmitting. For instance, if Alice sends 3V, Bob will really receive three volts plus some noise (we have chosen to ignore some inconvenient practical details here). In many cases, this noise is quite small, e.g. it could be less than 0.5V in 99.99% of cases. So, in principle, Alice could have had very reliable delivery by just choosing to always send 0V for a logical 0 and 5V for logical 1, using checksums to detect the occasional error. But this is wasteful. Alice could have squeezed more levels between 0 and 5V to get a higher bitrate. This causes errors, but Alice can introduce redundancy in the bits she is transmitting which can enable Bob to decode the correct message with high probability. Since it is much easier to control redundancy in encoding than in physical quantities, practical communication devices often choose choose to pack enough bits into their physical signal that errors are relatively quite likely, relying instead on redundancy in their encoding to recover from the errors. Redundancy is also used in storage, where we don’t have the option of detecting an error and retransmitting the message.

Some errors in communication are caused by thermal noise. These are unpredictable and unavoidable, but the errors they cause can be easily modeled; they cause bit-flips in random positions in the message. There are other sources of error. The clocks on the two devices may not be correctly synchronized, causing systematic bit-flips in a somewhat predictable pattern. A sudden surge of current (e.g. because someone turned on the lights, and electricity sparked between the contacts) can corrupt a large contiguous segment of bits. Or the cable could simply be cut in which case no information gets through. These other kinds of errors are often harder to model (and easier to detect/mitigate), so we have to remain content with merely detecting them. Thus for the remainder of this blog post, we shall restrict ourselves to an error model where each bit is corrupted with some fixed (but potentially unknown) probability, independent of the other bits. For simplicity, we shall primarily consider the Binary Erasure Channel (BEC), where a bit either goes through successfully, or the receiver knows that there has been an error (though we will introduce some related channels along the way).

Claude Shannon found that given any channel, there is a bitrate below which it is possible to communicate reliably with vanishing error rate. Reliable communication cannot be achieved above this bitrate. Hence this threshold bitrate is called the channel capacity. He showed that random linear codes are an optimal encoding scheme that achieves channel capacity. We will only briefly discuss random linear codes, but essentially they work by choosing random vectors in the input space and mapping them randomly to vectors in the encoded space. Unfortunately we do not have efficient algorithms for decoding these codes (mostly due to the randomness in their construction), and it is conjectured that one doesn’t exist. Recently Low-Density Parity Check (LDPC) codes have gained in popularity. They are simple to construct, and can be efficiently decoded at error levels quite close to the theoretical limits.

With LDPC codes, there are three limits of interest for any given channel and design bitrate (M/N): 1) the error level upto which an algorithm can efficiently decode them, \epsilon_d, 2) the error level level upto which they can be decoded by a computationally unbounded decoder, \epsilon_c and, 3) the error level beyond which no encoding scheme can achieve reliable communication, \epsilon_s. Obviously, \epsilon_d \le \epsilon_c \le \epsilon_s, and in general these inequalities can be strict. Our goal here is to study the gap between \epsilon_d and \epsilon_c. We will sometimes refer to these three quantities as p_{d}, p_{c}, and p_{shannon} when discussing channels besides the BEC. This is following the notation of Mezard and Montanari (2009).

More formally, information theory concerns reliable communication via an unreliable channel. To mitigate the errors in message transmission, error correcting codes introduce some type of systematic redundancy in the transmitted message. Encoding maps are applied to the information sequence to get the encoded message that is transmitted through the channel. The decoding map, on the other hand, is applied to the noisy channel bit (see Figure below). Each message encoded is comprised of M bits and N>M redundant sequences of bits in an error correcting code. 2^M possible codewords form a “codebook” |\mathbb{C}| in binary space \{0,1\}.

Claude Shannon’s code ensembles proved that it is easier to construct stochastic (characterized by good properties and high probability) models vs. deterministic code designs. Stochastic models were able to achieve optimal error correcting code performance in comparison to a more rigidly constructed model, proving that it was possible to communicate with a vanishing error probability as long as the rate of transmission R=M/N is smaller than the channel capacity, a measure of the maximum mutual information between channel input and output.

Fig 1 Schematic of the communication model of information communication.

Thus, in order to construct an optimal error correcting code, one must first define the subset of the space of encoding maps, endow the set with probability distributions, and subsequently define the associated decoding map for each of the encoding maps in the codes. We have included a section in the A that gives a thorough discussion of random code ensembles which are known to achieve optimal decoding, whether via scoring decoding success by bit error rate or decoded word error rate. We will also show a perspective which uses principles from statistical physics to unify the two (often called finite-temperature decoding). From hereon out we will discuss LDPC and explore how the values of p_{d}, p_{c} and p_{shannon} for various channels reveal deep things about the structure of the decoding solution space.

Low-density Parity Check Code

LDPC codes are linear and theoretically excellent error correcting codes that communicate at a rate close to the Shannon capacity. The LDPC codebook is a linear subspace of \{0,1\}^N . For an MxN sparse matrix \mathbb{H}, the codebook is defined as the kernel:

\mathbb{C} = { \underline{x} \in \{0,1\}^N:\mathbb{H}\underline{x}=\underline{0}}


where all the multiplications and sums involved in
\mathbb{H} \underline{x} are computed modulo 2.

Matrix \mathbb{H}is called the parity check matrix and the size of the codebook is 2^{N-rank(\mathbb{H})}. Given this code, encoding is a linear operation when mapping an N x L binary generating matrix \mathbb{G} (the codebook is the image of \mathbb{G}:\mathbb{C}={x=\mathbb{G}\underline{z}, \text{where } \underline{z} \in \{0,1\}^L} ) such that \underline{z}\rightarrow \underline{x} =\mathbb{G}z).

Every coding scheme has three essential properties that determine its utility: the geometry of its codebook and the way it sparsely distributes proper codewords within the encoding space \{0,1\}^{N} (c.f. our mention of sphere packing as a geometric analogy for RLCs), the ease with which one can construct a code which sparsely distributes codes within the encoding space, and the existence of fast algorithms to perform effective decoding.

A coding scheme over a given channel (whether it be BSC, BEC, AWGN, etc.) also has three parameters of interest, p_{d} which is the error rate above which some chosen algorithm cannot perform error-free decoding, p_{c} above which even exhaustively enumerating over all 2^{M} codewords in the codebook and calculating the MAP probability does not successfully decode, p_{shannon} which is the capacity of the channel, an error rate above which no decoding scheme could perform error-free decoding.

LDPC codebook geometry and p_{c}

On the subject of codebook geometry, it is well known that LDPC ensembles, in expectation, produce sparsely distributed codewords. This means that valid codewords (i.e. those that pass all parity checks) are far apart from one another in Hamming space and thus require a relatively large number of bits to be lost in order for one codeword to degenerate into another one. There is an important property called the distance enumerator which determines the expected number of codewords in a tight neighborhood of any given codeword. If for a given distance the expected number is exponentially small, then the coding scheme is robust up to error rates causing that degree of distortion. We discuss a proof of LDPC distance properties in Appendix B and simply state here that LDPCs are good at sparsely distributing valid codewords within the encoding space. The property p_{c}, introduced above is intimately related to the geometry of the codebook and the properties of the noisy channel being decoded over.

The information theoretic threshold, p_{c} is the noise level above which MAP decoding no longer successfully performs decoding. p_{c} is important because it is the error value above which we could theoretically always perform (slow) decoding below p_{c} by enumerating all 2^{M} codewords in the codebook and calculating the one with the highest MAP probability.

Every LDPC ensemble has some different p_{c} for a given channel. WFor now we will simply remark that LDPC ensembles are effective because they can be chosen such that p_{c} closely approaches the p_{shannon} limit for many channels. Even more importantly, we will show that it is likely that there is no fast algorithm for which p_{d} = p_{c} for general LDPCs over a general noisy channel.

We will not derive the p_{c} for any LDPC ensembles over any channels (I recommend chapter 15 of this book for details), but we will, in section 3, present results derived by other researchers.

Ease of construction

On the subject of ease of construction and ease of decoding, there is a much simpler graphical representation of LDPC codes which can be used to demonstrate LDPC tractability.

LDPCs can be thought of as bipartite regular graphs, where there are N variable nodes which are connected to M parity check nodes according to randomly chosen edges based on the degree distribution of the LDPC. Though the appendix discusses general degree distributions we will discuss here only (d,k) regular bipartite graphs, in which all variables have d edges and all parity check nodes have k edges, and how to generate them under the configuration model.

The configuration model can be used to generate a bipartite graph with (d,k) degree distribution by initially assigning all variable nodes d half-edges, all parity check nodes k half-edges, and then randomly linking up half-edges between the two sets, deleting all nodes which end up being paired an even number of times, and collapsing all odd numbered multi-edges into a single edge. This system doesn’t work perfectly but for large N, the configuration model will generate a graph for which most nodes have the proper degree. Thus it is relatively easy to generate random graphs which represent LDPC codes of any desired uniform degree distribution. An example of this graphical representation is in figure 2

Fig 2 Factor graph of a (2,3) regular LDPC code with factor nodes as black squares and variable nodes as white circles, and notation for BP messages.

How the graphical model relates to fast decoding

The graphical model of LDPCs is useful because it is both easy to construct and presents a natural way to perform fast decoding. In fact, the fast graph-based decoding algorithm, Belief Propagation, we use has a p_{d} which likely represents the upper limit on fast decoding for LDPCs.

We have seen recently that bipartite graphical models
which represent a factorized probability distribution can be used to calculate marginal probabilities of individual variable nodes (what a mouthful!).

Basically, if the structure of the graph reflects some underlying probability distribution (e.g. the probability that noisy bit y_i was originally sent as bit x_i) then we can use an iterative algorithm called Belief Propagation (see blog post above) to actually converge to the exact probability distribution for each P(y_i~|~x_{i}).

This is important because when we perform decoding, we would like to estimate the marginal probability of each individual variable node (bit in our received vector), and simply set the variable to be the most likely value (this is known as bit-MAP decoding, discussed earlier). As mentioned above, under certain conditions the Belief Propagation algorithm correctly calculates those marginal probabilities for noise rates up to an algorithmic threshold p_{d}.

The Belief Propagation algorithm is an iterative message passing algorithm in which messages are passed between variable nodes and parity check/factor nodes such that, if the messages converge to a fixed point, the messages encode the marginal probabilities of each variable node. Thus BP, if it succeeds can perform bit-MAP decoding and thus successfully decode.

We will show in the next section how the configuration model graphs map to a factorized probability distribution and mention the p_{d} for BP. In the following section we will show an example of decoding over the binary erasure channel, then finally we will show motivation to suggest that the p_{d} for BP over LDPCs represents a hard upper limit above which no fast decoding algorithms exist.

Decoding Errors via Belief Propagation

As mentioned above (again, please see Tom’s excellent blog post for details), the belief propagation algorithm is a useful inference algorithm for stochastic models and sparse graphs derived from computational problems exhibiting thresholding behavior. As discussed, symbol/bit MAP decoding of error correcting codes can be regarded as a statistical inference problem. In this section, we will explore BP decoding to determine the threshold for reliable communication and according optimization for LDPC code ensembles in communication over a binary input output symmetric memoryless channel (BSC or BMS).

Algorithm Overview

Recall that the conditional distribution of the channel input \underline{x} given the output \underline{y} is given by and that we wish to find the \underline{x} that maximizes the below probability given \underline{y}

p(\underline{x}|\underline{y}) = \frac{1}{Z(y)}\prod_{i=1}^{N}Q(y_i|x_i) \prod_{a=1}^{M} \mathbb{I}(x_{i_{1^{a}}} \otimes ... ~x_{k(a)^a} = 0) (1)


Where Q(y_i~|~x_{i}) is the conditional probability of y_{i} of observing noisy bit y_{i} given that x_{i} was sent. For the BSC we have Q(y_{i} = 1 | x_{i} = 1) = Q(y_{i} = 0 | x_{i} = 0) = 1-p and Q(y_{i} = 1 | x_{i} = 0) = Q(y_{i} = 0 | x_{i} = 1) = p.

Furthermore

\mathbb{I} (x_{i_{1^{a}}} \otimes  ... ~x_{k(a)^a} = 0)

is an indicator variable which takes value 1 if \underline{x} satisfies parity check a and 0 otherwise. In particular, the product of these indicators takes into account the fact that 0 probability should be assigned to hypotheses \underline{x} which aren’t in the code book (indicated by at least one parity check failing).

We would like to design a message passing scheme such that the incoming messages for a given variable node encode their marginal probabilities p(x_i~|~y_i).

Note, first and foremost that this probability can be factorized a la BP factor graphs such that there is a factor node for each parity check node a which contributes probability

\mathbb{I} (x_{i_{1^{a}}} \otimes ... ~x_{k(a)^a} = 0)

and a factor node for each channel probability term Q(y_i~|~x_{i}). Since each channel probability term is only connected to a single variable, its message never gets updated during BP and so we omit it from the factor graphs (e.g. note that figure 2 only has parity check nodes and variable nodes)

The message passing scheme ends up taking the form

v_{i\rightarrow a}^{(t+1)}(x_i) \propto Q(y_{i} | x_{i}) \prod_{b \in \partial i \setminus a} \hat{v}{b \rightarrow i}^{(t)} (2)

\hat{v}_{a \rightarrow i}(x_{i}) \propto \sum_{{x_{j}}} \mathbb{I} (x_{i} \otimes x_{j_{1}} ... x_{j_{k-1}} = 0) \prod_{j \in \partial a \setminus i} v_{j\rightarrow a}^{(t)}(x_j) (3)

Where \partial a denotes the neighborhood of factor node a and the sum in (3) is over all possible configurations of the neighbors of a not including i.

Messages are passed along the edges as distributions over binary valued variables described by the log-likelihoods

h_{i\rightarrow a} = \frac{1}{2}\log \frac{v_{i\rightarrow a(0)}}{v_{i\rightarrow a (1)}} (4)

u_{i\rightarrow a} = \frac{1}{2}\log \frac{\hat{v}{i\rightarrow a(0)}}{\hat{v}{i\rightarrow a (1)}} (5)


We also introduce the a priori log likelihood for bit i given the received message y_i:

B_{i} = \frac{1}{2} log \frac{Q(y_{i} | 0)}{Q(y_{i} | 1)}

Once we parametrize the messages as log-likelihoods, it turns out we can rewrite our update rules in terms of the parametrized values h and u, making updates much simpler:

h_{i \rightarrow a}^{(t+1)} = B_i + \sum_{b \in \partial i \setminus a} u_{b \rightarrow i}^{(t)} (6)

u_{b\rightarrow i}^{(t)} = atanh{ \prod_{j \in \partial a \setminus i} tanh(h_{j \rightarrow a}^{(t)})} (7)

Given a set of messages, we would perform decoding via the overall log likelihood h_{i}^{(t+1)} = B_i + \sum_{b \in \partial i} u_{b \rightarrow i}^{(t)}. Where x_{i} gets decoded to 0 for h_{i} > 0 and 1 for h_{i} < 0.

Typically BP is run until it converges to a set of messages that decode to a word in the codebook, or until a max number of iterations have occurred. Other stopping criteria exist such as the messages between time step t and t+1 being all within some small \epsilon of one another.

It is important to note some properties of BP:

  1. BP always terminates in d steps if the factor graph is a tree of depth d
  2. It is not known under what circumstances so called “loopy” BP will converge for non-tree graphs

Because factor graphs of LDPC codes are relatively sparse, they appear “locally tree-like”, a property which is believed to play a crucial role in BP convergence over the factorized probability distribution used in LDPC MAP decoding (eqn 1). As mentioned above BP manages to converge on many sorts of non tree-like graphs given that they have “nice” probability distributions. For example the SK model is known to converge even though the underlying factor graph is a complete graph!

It turns out that BP converges under some noise levels for LDPC decoding, and that the threshold at which it fails to converge, p_{d}, represents a phase transition to a generically different regime in the solution space of the codebook. It’s been noted elsewhere that the BP threshold is often the threshold of fast solving for many cool problems; e.g. k-SAT. This is because it is often thought to generically represent the “best” possible local (ergo “fast”) algorithm for those problems

In appendix C we will show some important properties of BP. The following tables summarize important results for several ensembles and channels. Note how close the information theoretic threshold for LDPCs is to the actual shannon limit p_{shannon} for the channels below.

Table 1: Thresholds for BSC
Various thresholds for BP over LDPC codes in a Binary Symmetric Channel

dkp_{d}p_{c}Shannon limit
34.1669.2101.2145
35.1138.1384.1461
36.084.101.11
46.1169.1726.174

See Mezard and Montanari, 2009 Chapt 15. for this table

Table 2: Thresholds for BEC
Various thresholds for BP over LDPC codes in a Binary Erasure Channel

dk\epsilon_{d}\epsilon_{c}Shannon limit
34.65.746.75
35.52.59.6
36.429.4882.5
46.506.66566.6667

See Mezard and Montanari, 2009 Chapt 15. for this table

We will now show exact behavior of the (3,6) LDPC ensemble over the binary erasure channel.

Algorithmic Thresholds for Belief Propagation (BP)

Definitions and notation

Definition 1. In a Binary Erasure Channel (BEC), when the transmitter sends a bit \in {0, 1}, the receiver receives the correct bit with probability 1 - \epsilon or an error symbol * with probability \epsilon.

For BECs, the Shannon capacity—the maximum number of data bits that can be transmitted per encoded bit—is given by 1 - \epsilon.

Definition 2. An N-bit Error Correcting Code (ECC) is defined by a codebook \mathcal{C} \subset {0, 1}^N. The transmitter encodes information as an element of \mathcal{C}. The receiver receives a corrupted version y of the transmitted codeword. To decode, it picks an x \in \mathcal{C} that is most likely given y and the channel characteristics.

For ease of discourse, we have refrained from defining ECC in full generality.

Definition 3. An N-bit (\lambda, \rho) Low Density Parity Check Code (LDPC) is an ECC with \mathcal{C} = {x | Hx = 0}. Here H is an M \times N matrix and arithmetic is over \mathbb{Z}_2. H is a sparse parity-check matrix. \lambda(x) = \sum_i\lambda_ix^{i-1} and \rho(x) = \sum_i\rho_ix^{i-1} are finite polynomials that characterize H; \lambda_i is the fraction of columns with i 1s and \rho_i is the fraction of rows with i 1s. Since these are fractions/probabilities, they must be normalized. Hence \lambda(1) = \rho(1) = 1. H is a random matrix, and therefore has full rank with high probability.

In an LDPC code, |\mathcal{C}| = 2^{N - M}. Hence every N bits contain N-M bits of information, making the rate 1 - M / N. Over binary erasure channels (BECs), on receiving y, the decoder must choose an x such that x_i = y_i \forall i such that y_i \neq *, and Hx = 0 (x_i, y_i denote the i^{th} bit of x and y respectively). That is, the bits that were successfully transmitted should be preserved; other bits should be chosen to satisfy the parity check equations. If multiple correct choices of x are possible, then y cannot be unambiguously decoded.

BP/Peeling algorithm

In general, decoding can be computationally hard. But there exists an error rate \epsilon_d, a function of \lambda and \rho, below which belief propagation succeeds in decoding. Let \epsilon_c be the maximum error rate upto which successful decoding is possible (i.e. we can unambiguously determine the transmitted codeword) and \epsilon_s be the Shannon limit, then \epsilon_d \leq \epsilon_c \leq \epsilon_s. In general, these inequalities can be strict, illustrating the gap between what is information theoretically possible, and what is computationally feasible.

Belief propagation (BP) for decoding LDPC-codes is equivalent to a simple peeling algorithm. Let us first describe the factor-graph representation for decoding. This is denoted in figure 3. Variables on the left are the received symbol \in {0, 1, *}. Factor nodes on the right denote the parity-check constraint (rows of H). The XOR of variables connected to each factor node must be 0.

BP/the peeling algorithm works as follows. For simplicity of exposition, consider that the all zeros code-word has been transmitted. Since this is a linear code, there is no loss of generality. At first, only the 1 - \epsilon variables that were successfully transmitted are fully determined. In the next round, the factor nodes that have exactly one undetermined variable can determine that variable using their parity-check constraint.

Fig 3 An example of a received code-word and corresponding parity-check constraints that is information-theoretically determined (only the all zeros codeword satisfies all constraints), but cannot be decoded by the belief propagation algorithm because no factor node has exactly one unknown variable. Here, only the V0 is correctly received. Constraints F1, F2, F3 and F4 imply that V1=V2=V3=V4=V5. If V0=0, then all the rest must be 0s to satisfy F0 (note, the only other valid codeword is all ones).

BP isn’t perfect

This algorithm is not perfect. Figure 3 is an example of a received codeword which can be unambiguously decoded — only the all zeros codeword satisfies all the constraints— but the BP algorithm fails, because at any point, all factor nodes have more than one unknown variable. It seems that the only way to solve problems like that is to exhaustively understand the implications of the parity-check equations. If this examples seems contrived, that is because it is. Decoding becomes harder as the degree and number of constraints increases; we had to add a lot of constraints to make this example work. Fortunately, if the graph is sparse, BP succeeds. We prove this in the following theorem:

Phase transitions for BP

Theorem 1. A (\lambda, \rho) LDPC code can be decoded by BP as N \rightarrow \infty when the error rate is less than \epsilon_d:

\epsilon_d = \mathrm{inf}_{z \in (0, 1)}\left[\frac{z}{\lambda(1 - \rho(1 - z))}\right]

Proof. To prove this, let us analyze the density evolution. For BECs, this is particularly simple as we only need to keep track of the fraction of undetermined variables and factor nodes at timestep t:~latex z_t and \hat{z}_t respectively. As N \rightarrow \infty, these fractions are probabilities. A factor node is determined when all of its variables are determined (note: if all but one is determined, the last one can be immediately determined). The following recursion relations hold:

z_{t+1} = \epsilon\lambda(\hat{z}_t) \:\:\:\mathrm{and}\:\:\: \hat{z}_t = 1 - \rho(1 - z_t) (8)

The first holds because a variable node is undetermined at timestep t+1 if it was originally undetermined (which happens with probability \epsilon) and if it isn’t determined in the last step, which happens with probability say p. Now,

p = \mathbf{P}(degree=2)\hat{z}_t + \mathbf{P}(degree=3)\hat{z}_t^2 + \cdots = \lambda(\hat{z}_t)

A similar reasoning holds for the second relation. 1 - z_t is the probability that a given neighboring variable node is determined. \rho(1 - z_t) is the probability that at-most one is undetermined, and hence this function node is determined. 1 - \rho(1 - z_t) is the probability that this function node is undetermined.

Composing the two relations in equation 8, we get the recursion:

z_{t+1} = F_{\epsilon}(z) = \epsilon \lambda(1 - \rho(1 - z_t))

An example of F(z) is shown in Figure 4 for \lambda(x) = x^2, \rho(x) = x^5. That is a (3,6) regular graph where variable nodes and function nodes have 3 and 6 neighbors respectively. On the left, F(z) is always below z. Hence the recursion with z_0 starting from the far right will converge to the fixed point F(z) = z = 0. But on the right, \epsilon is large enough that F(z) intersects z at a non-zero point. Hence the recursion will converge to the higher fixed point instead, without ever reaching the `correct’ fixed point. BP therefore gets stuck at a suboptimal solution, though information-theoretically a correct solution exists. This can be interpreted as a glassy state, where many deep local minima are present, and BP will converge to the wrong minimum.

The condition for BP to converge is F_\epsilon(z) \le z \:\: \forall z \in (0, 1). Hence the threshold error rate, \epsilon_d, below which this condition holds is:

\epsilon_d = \mathrm{inf}_{z \in (0, 1)}\left[\frac{z}{\lambda(1 - \rho(1 - z))}\right]

For (3, 6) regular graphs, \epsilon_d \approx 0.429 ∎


Fig 4 The recursion relation for a (3,6) regular graph, where F_\epsilon(z) is plotted against z. The identity function z is also shown (in blue). Here \epsilon_d \approx0.429. The graph above and below show the case the error rate is below and above \epsilon_d respectively.

Another interesting phase transition can be observed. As \epsilon increases, for some values of (\lambda, \rho), the first intersection of F(z) and F(z) happens at a non-zero point. For others, it starts of at z=0 and goes up continuously. In the former case, the decoding error rate jumps discontinuously as \epsilon increases from 0 to a non-zero values. For the latter, it increases continuously.

To see the gap between \epsilon_d and what can be information theoretically, we look at what happens when the degrees of the LDPC code is increased while keeping the rate constant. Specifically consider the (l, k) regular graph (i.e. \lambda(x) = x^{l-1} and \rho(x) = x^{k-1}) as k \rightarrow \infty while l / k = 0.5 is fixed. Note that the rate of the code is 1 - l / k. This is shown in Figure 5. \epsilon_d decreases toward 0. But as k \rightarrow \infty, it should become information-theoretically easier to decode. In fact, as k \rightarrow \infty, the code approaches a random linear code, which is known to achieve Shannon capacity. Hence we can believe that the information-theoretically achievable decoding rate is non-decreasing. Thus there is a gap between what is information theoretically possible to decode, and what is computationally feasible using Belief Propagation.

Fig 5 \epsilon_d decreases as k increases, while the rate = 1 - l/k = 0.5 is fixed. In fact \mathrm{lim}_{k \rightarrow \infty}\epsilon_d = 0.

Finally we would like to mention that it is possible to choose a sequence of polynomials (\lambda, \rho) such that \epsilon_d approaches the Shannon limit. While it is non-trivial to sample exactly from this distribution, good approximations exist and LDPC codes can achieve close to channel capacity over binary erasure channels.

The solution space in p_{d}\leq p \leq p_{c}

The energy landscape of LDPC decoding

We have already shown the exact location of the p_{MAP} = p_{c} threshold above which decoding is not possible for the LDPC ensemble and have also investigated the point at which the BP algorithm fails, p_{d}.

It should not be surprising to us that any given algorithm we attempt to throw at the problem fails at a certain point below p_{c}. In fact there are many simple, random algorithms from the class of Markov-chain Monte Carlos which give fast run times but which fail at values far below even p_{d}. The failing point of a particular algorithm, per se, is not necessarily very significant. We shouldn’t expect that any given algorithm, besides explicitly calculating the symbol MAP by traversing the entire codebook, would be able to achieve p_{c}.

What is of interest to us here, is that p_{d} marks a provable threshold in the solution space of LDPC decoding during which it is likely no locally-based methods, and therefore no fast algorithms can decode with nonzero probability. We will show later precisely the number and energy levels of these metastable states for the BEC. Proof of this transition for other channel types is outside the scope of this lecture.

In this section we will rephrase decoding as an energy minimization problem and use three techniques to explore the existence of metastable states and their effect on local search algorithms.

In particular we will first use a generic local search algorithm that attempts to approximately solve energy minimization expression of decoding.

We will next use a more sophisticated Markov chain Monte Carlo method called simulated annealing. Simulated annealing is useful because it offers a perspective that more closely models real physical processes and that has the property that its convergence behavior closely mimics the structure of the metastable configurations.

Energy minimization problem

To begin, we will reframe our problem in terms of constraint satisfaction.

The codewords of an LDPC code are solutions of a CSP. The variables are the bits of the word and the constraints are the parity check equations. Though this means our constraints are a system of linear equations, our problem here is made more complicated by the fact that we are searching for not just ANY solution to the system but for a particular solution, namely the transmitted codeword.

The received message \underbar{\textit{y}} tells us where we should look for the solution.

Assume we are using the binary-input, memoryless, output-symmetric channel with transition probability \mathbf{Q}(y | x).

The probability that \underbar{\textit{x}} was the transmitted codeword, given \underbar{\textit{y}} is \mathbb{P}(\underbar{\textit{x}} | \underbar{\textit{y}}) = \mu_{y}(\underbar{\textit{x}})

Where

\mu_{y}(\underline{x}|\underline{y}) = \frac{1}{Z(y)}\prod_{i=1}^NQ(y_i|x_i)\prod_{a=1}^M\mathbb{I}(x_{i_1^a}\otimes ... \otimes x_{k(a)^a} = 0) (10)

We can associate an optimization problem with this code. In particular, define E(\underbar{\textit{x}}) to be twice the number of parity check equations violated by \underbar{\textit{y}}.

We have already discussed how symbol MAP computes the marginals of the distribution \mu_{y}(\underbar{\textit{x}}) and how word MAP finds its argmax.

We shall here discuss two related problems

  • optimizing the energy function within a subset of the configuration space defined by the received word
  • sampling from a ’tilted’ Boltzmann distribution associated with the energy

Define the log-likelihood of x being the input given the received y to be

L_{\underline{y}}(\underline{x}) = \sum_{i=1}^N Q(y_i|x_i) (11)

If we assume WLOG that the all zero codeword was transmitted, by the law of large numbers, for large N the log-likelihood L_{\underbar{\textit{y}}}(\underbar{\textit{x}}) of this codeword is close to -Nh where h is the channel entropy. The probability of an order-N deviation away from this value is exponentially small.

This suggests that we should look for the transmitted codeword amongst those \underbar{\textit{x}} \in \mathbb{C} such that L_{\underbar{\textit{y}}}(\underbar{\textit{x}}) is close to h.

The constraint version of our decoding strategy – known as typical-pairs decoding – is thus, find \underbar{\textit{x}} such that L_{\underbar{\textit{y}}}(\underbar{\textit{x}}) > -N(h+\delta). This constraint will be referred to as the `distance constraint’ and we should consider the situation where if exactly one codeword satisfies the distance constraint, return it.

Since codewords are global energy minima (E(\underbar{\textit{x}}) = 0 for all \underbar{\textit{x}} \in \mathbb{C}), we can phrase typical-pairs decoding as an optimization problem

Minimize E(\underbar{\textit{x}}) subject to L_{\underbar{\textit{y}}}(\underbar{\textit{x}}) > -N(h+\delta).

This decoding succeeds iff the minimum is non-degenerate. This happens with high probability for p < p_{c} and with zero probability for p > p_{c}. In particular, there are exponentially many degenerate energy minima above p_{c}.

Similar to what we have seen elsewhere in the course, there exists a generically intermediate regime p_{d}\leq p \leq p_{c} in which the global energy minimum is still the correct codeword bu there is an exponentially large number of local energy minima obscuring it (see figure 6).

What is so special about BP is that the threshold at which these exponentially many metastable states proliferate is exactly the algorithmic threshold p_{d} for BP which we proved earlier.

Fig 6 A cartoon landscape for the Energy function defined above (number of violated checks). Left: The energy has a unique global minimum with E(\underline{x}) = 0 (the transmitted codeword) and no local minima. Center: many deep local minima appear, although the global minimum is non degenerate. Right: more than one codeword is compatible with the likelihood constraint, the global minimum E(\underline{x}) = 0 is degenerate adapted from Mezard and Montanari, 2009 Chapt 21

While finding solutions E(\underbar{\textit{x}}) = 0 amounts to Gaussian elimination, the constraint L_{\underbar{\textit{y}}}(\underbar{\textit{x}}) > -N(h+\delta) is not a linear constraint. Thus one needs to use some sort of more advanced search procedure to find satisfying vectors \underline{x}.

We will show that if one resorts to local-search-based algorithms, the metastable states above p_{d} block the algorithm. Furthermore, we suggest that the behavior of the local algorithms discussed below are typical of all local search algorithms (including BP) and that it is very likely the case that no fast algorithm exists capable of finding global energy minima without getting caught in the metastable states which proliferate above p_{d}.

Below is the simplest of local search algorithms, \Delta-local search.

Fig 7 Excerpted from Mezard and Montanari, 2009 Chapt 21

Delta search typefies local search algorithms. It walks semi-randomly through the landscape searching for low energy configurations. Its parameter is defined such that, when stuck in a metastable state it can climb out of it in polynomial time if the steepness of its energy barrier is \leq \Delta. Thus its failure in the p \geq p_{d} region suggests that there are no barriers of constant size and that barriers of order N are the norm.

MCMC and the relaxation time of a random walk

We can understand the geometry of the metastable states in greater detail by reframing our MAP problem as follows:

\mu_{y,\beta}(\underline{x}) = \frac{1}{Z(\beta)}exp{-\beta \cdot E(\underline{x})}\prod_{i=1}^N Q(y_i|x_i) (12)

This form is referred to as the `tilted’ Boltzmann because it is a Boltzmann distribution biased by the likelihood function.

In the low temperature limit this reduces to eqn 10 because it finds support only over words in the codebook.

This distribution more closely mimics physical systems. For nonzero temperature it allows support over vectors which are not actually in our codebook but still have low distance to our received message and have low energy – this allows us to probe the metastable states which trap our local algorithms. This is referred to as a code with `soft’ parity check constraints as our distribution permits decodings which fail some checks.

We will use the following algorithm excerpted from Mezard and Montanari Chapt 21:

Fig 8 Excerpted from Mezard and Montanari, 2009 Chapt 21

Where a Glauber update consists of scanning through the bits of the current proposed configuration and flipping the value of bit i with probability

w_{i}(\underbar{\textit{x}}) = \frac{\mu_{y,\beta}(\underline{x}^{(i)})}{\mu_{y,\beta}(\underline{x}^{(i)}) + \mu_{y,\beta}(\underline{x})} (13)

Where \underline{x} is the current configuration and \underline{x}^{(i)} is the configuration obtained by flipping \underline{x}‘s i^{th} bit

This method is a spin-off of traditional Markov chain Monte-Carlo algorithms with the variation that we lower the temperature according to an annealing schedule that initially assigns probability to all states proportional to the likelihood component of equation 12, allowing the chain to randomly sample the configuration space in the neighborhood of the received noisy word, until in the low temperature limit it becomes concentrated near to configurations which are proper codewords.

This method is useful to us because MCMCs are good models of how randomized methods of local searching for optimal configurations occurs in physical systems. Furthermore, the convergence of MCMCs and the time it takes them to converge tells us both the properties of the energy wells they terminate in and the barriers between minima in the energy landscape.

Let’s now show a property relating convergence times of MCMCs and energy barriers known as the Arrhenius law.

If we take the example of using a simple MCMC random walk with the update rule below over the following landscape

w(x\rightarrow x') = min \{e^{-\beta [E(x')-E(x)]},~1\}

Fig 9 This represents a random walk along a line in which there are two ground states separated by an energy barrier of height \Delta EExcerpted from Mezard and Montanari, 2009 Chapt 13

We find that the expected number of time steps to cross from one well to another is governed by the Arrhenius law \tau \approx exp{\beta \Delta E}.

In general, if there exists a largest energy barrier between any two components of the configuration space (also known as the bottleneck) the time it takes to sample both components, also known as the relaxation time of the MCMC is \tau_{exp} \geq O(e^{\beta \Delta E})

With this in mind, we can apply our simulated annealing MCMC to LDPC decoding and examine the properties of the bottlenecks, or metastable states, in our configuration landscape.

Exact values of the metastable energy states for the BEC

It is a well-known consequence of the 1RSB cavity method that the number of metastable states of energy Ne grows like exp(N\Sigma^{e}(e)) where \Sigma^{e}(e) is known as the energetic complexity function, a function whose form is implied by the 1RSB cavity equations. This computation can be carried out using a method called Survey Propagation which constructs a factor graph of the messages passed in the original BP factor graph and estimates the values of the marginals of the messages via another round of BP (hence the name 1-step RSB).

Neglecting the actual form of the calculations I will show the following approximate results for the BEC.

Fig 10 Metastable states for random elements of the (3,6) regular ensemble used over the BEC (\epsilon_d = .4294 and \epsilon_c = .4882. Left: the complexity function as a function of energy density above \epsilon_d. Right: the maximum and minimum energy densities of metastable states as a function of the erasure probability. Note that at \epsilon_d \leq .45 \leq \epsilon_c latex the curve is positive only for non zero metastable energy densities. This indicates exponentially many metastable states. At erasure rates above \epsilon_c there are exponentially many degenerate ground states. Excerpted from Mezard and Montanari, 2009 Chapt 21

In the regime p \geq p_{d} there exists a zero-energy word corresponding to the correct solution. On top of this, there exist non-trivial solutions to the 1RSB method yielding a complexity curve positive in the regime (e_{c}, e_{d}). The positive complexity means that there are exponentially many such states and their finite energy means they violate a finite fraction of the parity checks, making their energy wells relatively deep.

As the error rate of the channel increases above p_{c} the minimum energy of the metastable state reaches zero continuously. This means at noise levels above p_{c} there are an exponential number of zero-energy states corresponding to configurations which aren’t code words. These codewords are separated by energy barriers O(N) thus making the relaxation-time of local algorithms, by the Arrhenius law exp(N) in this regime.

Fig 11 Decoding random codes from the (3,6) ensemble over the BEC. In both cases N = 10^4, and the annealing schedule consists of t_max = 1000 equidistant temperatures in T = 1 / \beta \in [0,1]. Left: indicates convergence to the correct ground state at \epsilon = .4 < \epsilon_d. Right: indicates convergence to approximately the energy density of the highest metastable states e_d (as calculated by the complexity function via 1RSB) for \epsilon = .6 > \epsilon_{c}Excerpted from Mezard and Montanari, 2009 Chapt 21

Here you can see a rough sketch of convergence of the simulated annealing algorithm. As the temperature decreases in the p \leq p_{d} regime the algorithm converges to a 0 energy ground state. In the figure on the right we can see that simulated annealing converges to the horizontal line here which corresponds to the energy of the highest metastable state e_{d} for the BEC at p = .6.

Thus we see our local search algorithms end up being attracted to the highest energy of the metastable state.

Fig 12 Decoding random codes as in figure 11. Here we plot the minimum energy density achieved through simulated annealing plotted as a function of the erasure probability of the BEC. The continuous line indicates the highest lying metastable states as calculated from the complexity function via 1RSB.
 Excerpted from Mezard and Montanari, 2009 Chapt 21

Though there is not necessarily an exact correspondence between the residual energy at T=0 for simulated annealing and the highest metastable state e_{d} we see that across all values of p that at T=0, e_{ann} \approx e_{d} suggesting local search tends to get caught in the deepest metastable energy wells.

This discussion shows that the algorithmic threshold of BP, p_{d} indicates the onset of a truly different regime within the energy landscape of the codebook. Metastable states of O(N) hight proliferate and become exponentially difficult to escape from via local search methods. Thus the failure of BP likely indicates a regime in which no fast algorithms can perform decoding, even though decoding is still theoretically possible when below p_c, e.g. via exhaustive search of the codebook.

Appendix A: Random Code Ensembles

In an RCE, encoding maps applied to the information sequence are chosen with uniform probability over a solution space. Two decoding schemes are often used and applied to the noise – word MAP and symbol MAP decoding. MAP, otherwise known as “maximum a priori probability” works by maximizing the probability distribution to output the most probable transmission. Word MAP decoding schemes output the codeword with the highest probability by minimizing the block error probability, which is otherwise known as the probability with respect to the channel distribution that the decoded word is different than the true transmitted word. Symbol MAP decoding, on the other hand, minimizes the fraction of incorrect bits averaged over the transmitted code word (bit error rate).

RCE code is defined by the codebook in Hamming space, or the set of all binary strings of length N. In a Hamming space characterized by uniform probability, the number of codewords at a given Hamming distance are a function of the distance enumerator. Distance enumerators take as parameters different weights, given that probabilities of codewords are independent of each other. The distance enumerator shows that, for small enough fractional distance from the true message x_0, the growth rate is negative and the average number of codewords at small distance from x_0 vanishes exponentially with N. The Gilbert-Varshamov distance, a lower bound threshold, shows that the the average number of codewords is exponentially large at points where the weight numerator is concentrated.

We look at the performance of RCE code in communication over the Binary Symmetric Channel (BSC), where it is assumed that there is a probability p that transmitted bits will be “flipped” (i.e. with probability p, 1 becomes 0 and 0 becomes 1). With BSCs, channel input and output are the same length N sequences of bits. At larger noise levels, there are an exponential number of codewords closer to y and decoding is unsuccessful. However, decoding via the symbol MAP decoding scheme shows that the i-th bit is decoded by maximizing the marginal probability and amasses contributions from all codewords in the set. Above a threshold, the bit error rate is the same as if the message was transmitted without encoding and decoding, but below this, the RCE seems to work quite well in transmission.

Finite temperature decoding has also been looked at as an interpolation between the two MAP decoding schemes. At low noise, a completely ordered phase can be observed as compared to a glassy phase at higher noise channels. Similar to the a statistical physics model, we can also note an entropy dominated paramagnetic phase at higher temperatures.

Each decoding scheme can be analogized to “sphere packing”, where each probability in the Hamming space distribution represents a sphere of radius r. Decoding schemes have partitions in the Hamming space, so these spheres must be disjoint. If not, intersecting spheres must be eliminated. The lower bound of the remaining spheres is then given by Gilbert-Varshamov bound, whereas the upper bound is dictated by the Hamming distance.

Another random code beside the RCE is the RLC, or random linear code. Encoding in an RLC forms a scheme similar to a linear map, of which all points are equiprobable. The code is specified by an N x M binary matrix, otherwise known as the generating matrix, and it is projected to be error-free below the Shannon capacity.

There are several sources of randomness in codes. Codes are chosen randomly from an ensemble and the codeword to be transmitted is chosen with uniform probability from the code, according to the theorem of source-channel separation. The channel output is then distributed according to a probabilistic process accounting for channel noise and decoding is done by constructing another probability distribution over possible channel inputs and by estimating its signal bit marginal. The decision on the i-th bit is dependent on the distribution. Thus, complications may arise in distinguishing between the two levels of randomness: code, channel input, and noise (“quenched” disorder) versus Boltzmann probability distributions.

Appendix B: Weight enumerators and code performance

The geometric properties of the LDPC codebooks is given by studying the distance enumerator N_{\underline{x}0}(d) to give the number of codewords at Hamming distance d from \underline{x}_0. This takes all-zeros codewords as the reference and uses the weight enumerator, \mathbb{N}(w)=\mathbb{N}{\underline{x_0}}(d=w) as the denomination (number of codewords having weight equal to w). To estimate the expected weight enumerator \mathcal{N}(w)=\mathbb{E}\mathbb{N}(w) for a random code in the LDPC_N(\Lambda,P) ensemble, we know that \mathbb{N}(w) grows exponentially in block-length N, and that each codeword has a weight w=Nw that grows linearly with N. The exponential growth rate \phi(w) is defined by

\mathcal{N}(w=Nw) = e^{N \phi (w)} (14)

denoting an ‘annealed average’, or a disordered system that could be dominated by rare instances in the ensemble. This gives an upper bound on the number of ‘colored factor graphs’ that have an even number of weighted incident edges divided by the total number of factor graphs in the ensemble.

On the other hand, for graphs of fixed degrees with N variable nodes of degree l and M function nodes of degree k, the total number of edges F is given by F=Mk=Nl. A valid colored graph would have E=wl edges, with the number of variable nodes given in {N\choose w} ways, l assignments of weighted sockets to nodes, and l assignments of unweighted sockets to nodes outside the set. If we take m_r to denote the number of function nodes with weighted sockets under the constraints of \Sigma_{r=0}^km_r=M and \Sigma_{r=0}^krm_r=lw, we find the number of ways to color the function node sockets by

\mathbb{C}(k,M,w) = \sum_{m_{0},...m_{k}}^{even}{M\choose m_{0},...,m_{k}}\prod_{r}{k\choose r}^{m_{r}} (15)

\mathbb{I}\Big(\sum_{r=0}^km_r=M\Big)\mathbb{I}\Big(\sum_{r=0}^krm_r=lw\Big) (16)

If we aim to join variable and check nodes so that colorings are matched, knowing that there are (lw)!(F-lw)! possible matchings in each ensemble element, this yields the following formula:

\mathcal{N}(w)=\frac{(lw)!(F-lw)!}{F!}{N\choose w}\mathbb{C}(k,M,w) (17)

At low noise limits, code performance depends on the existence of codewords at distances close to the transmitted codeword. Starting with degree 1 and knowing that the parametric representation for weights is given by 

w = \sum_{l=1}^{l_{max}}\Lambda_l\frac{xy^l}{1+xy^l} (18)

derive that 

\phi(w) = -\frac{1}{2}w\log(w/\Lambda_1^2)+O(w) (19)

when x,y,z scale to \sqrt{w}. This shows that the exponential growth rate \phi(w) is strictly positive when w is sufficiently small, and that the expected number of codewords within a small Hamming distance from a given codeword is exponential in N. If we take the logarithm of the expected weight enumerator and plot this versus the reduced weight w=w/N for an irregular code with l_{min}=1, we see that \phi(w) is positive near the origin, but that its dervative diverges as w\rightarrow 0. Since this means that each codeword is surrounded by a large number of very close other codewords, this makes the code a very bad ECC and thus, makes it hard to discriminate between codewords at Hamming distances O(1) with noisy observations. Applying this same logic to l of min 2, we still observe that \phi(w) tends to 0 more quickly as w\rightarrow 0 in the present case. If we assume that this holds beyond the asymptotic regime, we get

\bar{\mathcal{N}}(w) = e^{Aw} (20)

or that the number of codewords around a particular codeword is o(N) until a Hamming distance d_* \simeq \log N/A, otherwise known as the “effective minimum distance”. For l_{min} \geq 3, we find:


\phi(w) \simeq \Big(\frac{l_{min}-2}{2}\Big)w\log\Big(\frac{w}{\Lambda_{l_min}}\Big) (21)

suggesting that LDPC codes with this property have good short distance behavior. Thus, any error that changes a fraction of the bits smaller than w_{*}/2 can be corrected in the absence of codewords within an extensive distance Nw_{*}.

Let us now focus on the capacity of LDPC codes to correct typical errors in a probabilistic channel. For binary symmetric channels that flip each transmitted bit independently with probability p<\frac{1}{2}. If the all-zero codeword \underline{x}^{(0)} =\underline{0} has been transmitted as \underline{y}, whose components are iid random variables that take value 0 with probability 1-p and value 1 with probability p, then we use the MAP decoding strategy to minimize the block error rate and output the codeword closest to the channel output \underline{y}. The expectation value of the code ensemble P_B = \mathbb{E}P_B(\mathbb{C}) is an indicator of code ensemble performances. We will show that, as N\rightarrow \infty, codes with l_{min}\geq 3 will undergo a phase transition separating a low noise from a high noise phase. To derive a lower bound for the capacity of LDPC codes in a BSC channel, we take \mathbb{N}=2^{NR} as the size of the codebook \mathbb{C} and, by union bound:

P_{B}(\mathbb{C})= \mathbb{P}\Big\{\exists \alpha \neq 0 \text{s.t. } d(\underline{x}^{(\alpha)},\underline{y})\leq d(\underline{0},\underline{y})\Big\} (22)

\leq \sum_{\alpha=1}^{\textit{N}-1}\mathbb{P}\Big\{d(\underline{x}^{(\alpha)},\underline{y}) \leq d(\underline{0},\underline{y})\Big\} (23)

\leq \sum_{w=1}^N \textit{N}(w)e^{-\gamma w} (24)

This derivation proves that the block error probability depends on the weight enumerator and the exp(-\gamma w). This second term shows that an increase in the weight of the codeword corresponds to their contribution being scaled down by an exponential factor. This is because it is less likely that the received message \underline{y} will be closer to a codeword of large weight than to the all-zero codeword. A geometric construction of this phenomena implies that for large enough p, Shannon’s Theorem implies that P_B is bounded away from 0 for any non-vanishing rate R > 0 so that at any p less than the ML threshold for which the lim_{N\rightarrow \infty}P_B=0, one can communicate with an arbitrarily small error probability. At a probability equal to the lower bound, the upper bound on P_B is dominated by codewords of weight w \approx N\Tilde{w}, suggesting that each time an error occurs, a finite fraction of the its are decoded incorrectly and that this fraction doesn’t change very much per transmission. The construction also illustrates that this fraction of incorrectly decoded bits jumps discontinuously from 0 to a finite value when p crosses the critical value p_{ML}, constituting a “gap.” This gap is close to a factor of 2.

Appendix C: BP performance

See figure 2 for an illustration of a factor graph illustrating this relationship. Again, recall that for LDPC code ensembles in large block-length limits, the degree distributions of variable nodes and check nodes are given by \Lambda = {\Lambda_t} and P = {P_k} respectively, where we assume that messages are initialized to u_{a\rightarrow i}^{(0)} = 0 for simplicity. This implies that the bit error probability is independent of the transmitted codeword and that therefore, we have the freedom to assume transmission of the all-zero codeword. In analyzing the recursion at the basis of the BP algorithm, we can show that decoding performance improves over time on the basis of symmetry and physical degradation.

Symmetry

Symmetry of channel log-likelihood and the variables appearing in density evolution are attributes of a desired BMS channel, suggesting that symmetry is preserved by BP operations in evolution. If we assume that the factor graph associated with an LDPC code is “tree-like”, we can apply BP decoding with a symmetric random initial condition and note that the messages u_{a\rightarrow i}^{(t)} and h_{i\rightarrow a}^{(t)} are symmetric variables at all t\geq 0. This observance of symmetry is analogous to the Nishimori condition in spin glasses and holds for the MAP log-likelihood of a bit as well.

Physical degradation

Let’s first define physical degradation with BMS channels. If we take two channels BMS(1) and BMS(2) denoted by transition matrices

\{Q_{1}(y|x)\}, \{Q_{2}(y|x)\} and output alphabets \mathbb{Y}_1,\mathbb{Y}_2, then BMS(2) is physically degraded with respect to BMS(1) if there exists a third channel C with input alphabet \mathbb{Y}_1,\mathbb{Y}_2 such that BMS(2) is the concatenation of BMS(1) and C. If we represent transition matrix C as \{R(y_2|y_1)\} we can represent the above physical degradation as

Q_2(y_2|x) = \sum{y_1 \in \textit{Y}_1}R(y_2|y_1)Q_1(y_1|x) (25)

 This is analogous to a Markov chain X \rightarrow Y_1 \rightarrow Y_2 following partial ordering. Channel reliability is then ordered by measures of conditional entropy and bit error rate. This extends to symmetric random variables, which are associated with BMS channels.

Thresholds

We then fix a particular LDPC code and look at BP messages as random variables due to randomness in the vector \underline{y} with regards to the proposition down below, showing that the bit error rate decreases monotonously with time:

Proposition: If B_{i,r}(F) is a tree, then h_i^{(0)}\preceq h_i^{(1)} \preceq ... \preceq h_i^{(t-1)} \preceq h_i^{(t)} for any t\leq r-1. Analogously, if B_{i\rightarrow a,r}(F), then h_{i\rightarrow a}^{(0)}\preceq h_{i\rightarrow a}^{(1)} \preceq ... \preceq h_{i\rightarrow a}^{(t-1)} \preceq h_{i\rightarrow a}^{(t)} for any t\leq r-1.

Density evolution in this manner is a useful estimate of the number of distributions of density evolution variables {h^{(t)},u^{(t)}} and {h_{*}^{(t)}}. By looking again at the bit error rate P_{b}^{(t)} and the conditional entropy H^{(t)} as both monotonically decreasing functions of the number of iterations and conversely, monotonically increasing functions of p, we can derive a finite limit P_b^{BP} \equiv \lim_{t\rightarrow\infty}P_b^{(t)}. The corresponding BP threshold can then be defined as

p_{d} \equiv \sup \Big\{ p \in [0,1/2] : P_b^{BP}(p)=0 \Big\}

For p \leq p_d, however, increasing the number of iterations does not help as the bit error rate is asymptotically lower bounded by P_{b}^{BP}(p)>0 for a fixed number of iterations. Good LDPC codes are thus designed with a large BP threshold with design rate R_{des}=1-P'(1)/\Lambda '(1) to maximize the threshold noise level for a given degree distribution pair. This ensemble will have a finite fraction of variable nodes of degree 2 and a large number of codewords with small weight, which ultimately prevent the block error probability from vanishing as N\rightarrow \infty.

References

[1] Marc Mezard and Andrea Montanari.
Information, Physics, and Computation.
Oxford Graduate Texts, 2009.

Ising Perceptron under Gaussian Disorder, and k-NAE-SAT

December 13, 2018

Blog Post By: Patrick Guo, Vinh-Kha Le, Shyam Narayanan, and David Stoner

Methods in statistical physics are known to be extremely useful for understanding certain problems in theoretical computer science. Physical observations can motivate the underlying theoretical models, which in turn explain some of the physical phenomena.

This post is based on Professor Nike Sun’s guest lecture on the Ising Perceptron model and regular NAE-SAT for CS 229R: Physics and Computation. These are both examples of random constraint satisfaction problems, where we have n variables x_1, ..., x_n \in \{-1, 1\} and certain relations, or constraints, between the x_i, and we wish to approximate the number of solutions or visualize the geometry of the solutions. For both problems, the problem instance can be random: for example, the linear constraints in the Ising perceptron model are random, and the clauses in the NAE-SAT instance are chosen at random. As in the previous blog posts, to understand the geometry of solutions, statistical physicists think of sampling random solutions from \{-1, 1\}^n, which introduces a second type of randomness.

This post is meant to be expository. For interested readers, we point to useful references at the end for more rigorous treatments of these topics.

1. Perceptron Model

The Ising Perceptron under Gaussian disorder asks how many points in \{\pm 1\}^n survive M half-plane bisections (i.e. are satisfying assignments for M constraints), where the planes’ coefficients are drawn from standard Gaussians. Like many problems in statistical physics, there is likely a critical capacity of constraints where having more constraints yields no survivors with high probability, and having fewer constraints yields survivors with high probability. This lecture gives an overview of a proof for one side of this physics prediction, i.e. the existence of a lower bound critical capacity, where having fewer constraints yields survivors with positive probability. Briefly, we use the Cavity method to approximate the distribution of the number of satisfying assignments, then attempt to use the second moment method on that distribution to get our lower bound. A direct application fails, however, due to the variance of the Gaussian constraints. The solution is to carefully choose exactly what to condition on without destroying the model. Specifically, we iteratively compute values related to the Gaussian disorder, after which we are able remove enough variance for the second moment method to work and thus establish the lower bound for the Ising Perceptron’s critical capacity. The result holds subject to an analytical condition which is detailed in the paper (Condition 1.2 in [6]) and which remains to be rigorously verified.

1.1. Problem

We pick a random direction in \mathbb{R}^n, and delete all vertices in the hypercube \{\pm 1\}^n which are in the half-space negatively correlated with that direction. We repeat this process of picking a random half space and deleting points M times, and see if any points in the hypercube survive. Formally, we define the perceptron model as follows:

Definition 1: Let G = (g_{\mu{}i})_{\mu\ge 1, i\ge 1} array of i.i.d. standard Gaussians. Let M be the largest integer such that:

\left\{J\in \{\pm 1\}^N:\sum_{i=1}^N\frac{g_{\mu{}i}J_i}{\sqrt{N}}\ge 0 \hspace{0.1cm} \forall \mu\le M\right\}\neq \emptyset.

More compactly, M is the largest integer such that

\left\{J\in \{\pm 1\}^N:\frac{GJ}{\sqrt{N}}\ge 0\right\}

is nonempty, where the inequality is taken pointwise, G \in \mathbb{R}^{M \times N} is an array of i.i.d. std. Gaussians, and J is treated as a vector in \{\pm 1\}^N \subset \mathbb{R}^N.

CS229 Pic 1

In the late 80’s, physicists conjectured that there is a critical capacity \alpha_{*} such that \frac{M}{N} \overset{P}{\to} \alpha_{*} where M is a function of N. The predicted critical capacity has been studied, for example in [7,9]. Our goal is to establish a lower bound on \alpha_* for the Ising Perceptron under Gaussian disorder. To do this, let Z(G) denote the random variable which measures how many choices of n variables survive M Gaussian constraints in G (this is the partition function). We want to show Z > 0 with high probability when there are M < \alpha_{*}N constraints. To this end, the second moment method seems promising:

Second Moment Method: If Z is a nonnegative random variable with finite variance, then

P(Z > 0) \ge \frac{\left(\mathbb{E}[Z]\right)^2}{\mathbb{E}[Z^2]}

However, this method actually fails due to various sources of variance in the perceptron model. We will briefly sketch the fix as given in [6].

Before we can start, however, what does the distribution of Z even look like? This is actually quite computationally intensive; we will use the cavity equations, a technique developed in [12,13], to approximate Z‘s distribution.

1.2. Cavity Method

The goal of the cavity method is to see how the solution space changes as we remove a row or a column of the matrix G with i.i.d. standard Gaussian entries, assuming that G is fixed. Since our matrix G is big and difficult to deal with, we try to see how the solution space changes as we add one row or one column at a time. Why is this valuable? We can think of the system of variables and constraints as an interaction between the rows (constraints) and columns (variables), so the number of solutions should behave proportionally to the product of the solutions attributed to each variable and each constraint. With this as motivation, define G_{-\mu} as the matrix obtained by removing row \mu and G^{-i} as the matrix obtained by removing column i. We can approximate

Z(G) \approx \prod\limits_{\mu = 1}^{M} \frac{Z(G)}{Z(G_{-\mu})} \cdot \prod\limits_{i = 1}^{N} \frac{Z(G)}{Z(G^{-i})}

since we can think of the partition function as receiving a multiplicative factor from each addition of a row and each addition of a column. Thus, the cavity method seeks to compute Z(G)/Z(G_{-\mu}) and Z(G)/Z(G^{-i}).

1.2.1. Removing a constraint

Our goal in computing Z(G)/Z(G_{-\mu}) is to understand how the solution space SOL(G_{-\mu}) changes when we add in the constraint \mu. Recalling that J \in \{\pm 1\}^n is our vector that is potentially in the solution space, if we define

\Delta_\mu := \sum_{i = 1}^{N}\frac{g_{\mu{}i}J_i}{\sqrt{N}},

then adding the constraint \mu is equivalent to forcing \Delta_\mu \ge 0. We will try to understand the distribution of \Delta_\mu under the Gibbs measure \nu_{-\mu}, which is the uniform measure on the solution space without the constraint \mu. This way, we can determine the probability of \Delta_\mu \ge 0 under the Gibbs measure, which will equal Z(G)/Z(G_{-\mu}). To do so, we will use the Replica Symmetric Cavity Assumption (which we will abbreviate as RS cavity assumption). The RS cavity assumption lets us assume that the J_i‘s for 1 \le i \le N are independent under \nu_{-\mu}. As the J_i‘s are some discrete sample space that depend on our constraints, we do not actually have full independence, but the RS cavity assumption tells us there is very little dependence between the J_i‘s, so we pretend they are independent.

Note that \Delta_\mu should be approximately normally distributed, since it is the sum of many “independent” terms g_{\mu i} J_i by the RS Cavity assumption. Now, define h_\mu as the expectation of \Delta_\mu under the Gibbs measure without the constraint \mu, and v_\mu as the variance of \Delta_\mu under the Gibbs measure without the constraint \mu. It will also turn out that the variance v_\mu will concentrate around a constant \sigma^2. Thus,

\frac{Z(G)}{Z(G_{-\mu})} \approx \nu_{-\mu}(\Delta_{\mu} \ge 0) = \overline{\Phi} (\frac{-h_\mu}{\sigma}).

Here, \overline{\Phi}(\frac{-h_\mu}{\sigma}) equals the probability that a random \mathcal{N}(h_\mu, \sigma^2) Gaussian distribution is positive, or equivalently, the probability that a standard Gaussian \mathcal{N}(0, 1) distribution is greater than \frac{-h_\mu}{\sigma}.

1.2.2. Removing a spin

To calculate the cavity equation for removing one column, we think of removing a column as deleting one spin from J. Define J^{-i} as the vector resulting from removing spin i from J. We want to calculate Z(G)/Z(G^{-i}). Note that it is possible for J^{-i} \not\in SOL(G^{-i}) but J \in SOL(G) now, which complicates this calculation. It is possible to overcome this difficulty by passing to positive temperature, though this makes the calculations incredibly difficult. We do not worry about these issues here, and just briefly sketch how Z(G)/Z(G^{-i}) is computed.

To compute Z(G)/Z(G^{-i}), we will split the numerator into a sum of two terms based on the sign of J_i, which will allow us to compute not only Z(G) but also \langle J_i \rangle, which represents the magnetization at spin i. A series of complicated calculations (see the lecture notes for the details) will give us

\frac{Z(G)}{Z(G^{-i})} = \frac{\exp(H_i)+\exp(-H_i)}{exp(c)}

for some constant c, where H_i is a quantity that compares how much more correlated spin i is to the constraints than all other spins. The \exp(+H_i) will come from the solutions for G with J_i = 1 and the \exp(-H_i) will come from the solutions with J_i = -1.

The above equations allow us to deduce that

Z(G) \approx \prod\limits_{\mu = 1}^{M} \overline{\Phi}\left(-\frac{h_\mu}{\sigma}\right) \cdot \prod\limits_{i = 1}^{N} \frac{\exp(H_i) + \exp(-H_i)}{\exp(c)}.

1.3. The Randomness of G

In the previous section, we regarded G as fixed. We now use the these results but allow for G to be random again. Recall G‘s entries were i.i.d. Gaussians. We thus get that h_\mu, as a linear combination of the g_{\mu{}i}‘s, is a Gaussian. We thus can write h_\mu \sim \mathcal{N}(0, q). Similarly, H_i is a linear combination of the g_{\mu{}i}‘s and must be Gaussian. We thus write H_i \sim \mathcal{N}(0, \psi).

With some calculations detailed in the lecture notes and [12], we get Gardner’s Formula:

\frac{\ln Z}{N} \rightarrow \alpha \int \ln \overline{\Phi}\left(-\frac{\sqrt{q} z}{\sigma}\right) \varphi(z) dz + \int \ln \left(2 \cosh(\sqrt{\psi z})\right)\varphi(z) dz

where \alpha is our capacity, \frac{M}{N}. It turns out that \sigma^2 = 1-q and c = \psi(1-q)/2, so the above equation only depends on two parameters, q and \psi. It will turn out that q, \psi have relations dependent on each other based on our definitions of h_\mu and H_i, and these will give us a fixed point equation for (q, \psi), which has two solutions: one near \alpha \approx 1 and one near \alpha \approx 0.83. It is believed that the second point is correct, meaning that the critical capacity should equal \alpha_* = 0.83.

1.4. Second Moment Method

Now that we have Gardner’s formula, solving for Z gives us an approximation for its distribution as

Z \sim \exp\{N\mathcal{G}(\alpha) + \mathcal{N}(0, N\varepsilon^2)\}

where \mathcal{G}(\alpha) is Gardner’s formula, and the Gaussian noise \mathcal{N}(0,N\varepsilon^2) arises from the Gaussian distribution of the h_\mu‘s and H_i‘s. Again, we are interested in the probability that Z > 0, but due to the approximate nature of the above equation, we cannot work with this distribution directly, and instead can try the second moment method. However, the exponentiated Gaussian makes \mathbb{E}[Z^2] \gg \mathbb{E}[Z]^2 and the moment method just gives P(Z>0) \ge 0, whereas we need P(Z>0) with high probability.

The fault here lies with Gaussian noise due to the h_\mu‘s and H_i‘s, so it is natural to consider what happens when we condition on them, getting rid of the noise. We denote

\underline{h} = (h_\mu)_{\mu = 1}^M \text{ and } \underline{H} = (H_i)_{i=1}^N.

Then, we can compute that

\frac{1}{N}\log \mathbb{E}[Z | \underline{H},\underline{h}] \to \mathcal{G}(\alpha)

in probability as N\to \infty. (This is not an exact equality becacuse of the approximations made in the derivation of Gardner’s formula.) Moreover, we will have \mathbb{E}[Z^2|\underline{H},\underline{h}] \approx \mathbb{E}[Z|\underline{H},\underline{h}]^2. The second moment method gives us the desired lower bound on P(Z>0). However, note that these \underline{H},\underline{h} must satisfy the equations that defined them. In vector form, we have

\frac{1}{\sqrt{N}}G\tanh \underline{H} = \underline{h} + b_*F(\underline{h})
\frac{1}{\sqrt{N}}G^T\tanh F(\underline{h}) = \underline{H} + d_*\tanh\underline{H}

where b_*, F, and d_* are constants and functions that appear in the rigorous definitions of h_{\mu} and H_i.
Here arises a problem, however. We cannot solve these equations easily, and furthermore when we condition on \underline{H} and \underline{h}, the G‘s that satisfy the above are not necessarily representative of matrices of standard Gaussians. Hence, simply conditioning on knowing the values of \underline{h},\underline{H} destroys the model and will not prove the lower bound on P(Z>0) for Gaussian disorder.

To solve this problem, we introduce iteration: first, initialize \underline{h}^{(0)} and \underline{H}^{(1)} (initialized to specific values detailed in [6]. Then, a simplified version of each time step’s update looks like

\underline{h}^{(t)} = \frac{G\tanh'{\underline{H}^{(t)}}}{\sqrt{N}} - b_*F(\underline{h}^{(t - 1)}),
\underline{H}^{(t + 1)} = \frac{G^TF(\underline{h}^t)}{\sqrt{N}} - d_*\tanh{\underline{H}^{(t + 1)}}.

It has been proven (see [2,4]) that \underline{h}^{(t)} and \underline{H}^{(t)} converge as t\to\infty, and moreover the convergent values are distributed by what looks like a Gaussian at each time step. Since this sequence of \underline{h},\underline{H} converge (at a rate that is independent of n) and are representative of Gaussian G, for a special kind of truncated partition function \tilde Z, conditioning on these iteration values allows the second moment method to work and gives

\mathbb{E}[\tilde Z | \underline{h}^{(0)},\underline{H}^{(0)},\dots,\underline{h}^{(t)},\underline{H}^{(t)}]^2 \approx \mathbb{E}[\tilde Z^2 | \underline{h}^{(0)},\underline{H}^{(0)},\dots,\underline{h}^{(t)},\underline{H}^{(t)}]

which is then enough to establish the lower bound P(Z>0) for the non-truncated partition function Z (see [6] for details, see also [3] for closely related computations).

2. k-NAE-SAT

This lecture also gave a brief overview of results in k-NAESAT. In the k-SAT problem, we ask whether a boolean formula in CNF form, with each clause containing exactly k literals, has a satisfying solution. For k-NAESAT, we instead require that the satisfying solution is not uniform on any clause; equivalently, each clause must contain at least one true and at least one false value. Finally, we will restrict our set of possible boolean formulae to those for which every variable is contained in exactly d clauses; we call this model d-regular k-NAESAT. We briefly outline the critical capacities of clauses where the solution space changes. For more background on the k-NAESAT probem and its variants, see [10].

As with the perceptron model, we are concerned with the expected number of solutions Z of this system where the d-regular formula is chosen uniformly at random. It turns out that for any fixed \alpha=\frac{d}{k}, the expected proportion of satisfiable solutions as n\to\infty converges to some f(\alpha). More on this, the model in general and the critical \alpha constants mentioned below can be found in [15]. Via an application of Markov’s inequality and Jensen’s Inequality for the convex function x^n, f(\alpha) may be bounded above by f^{RS}(\alpha)=(\mathbb{E} Z)^{\frac{1}{n}}, shown graphically below.

CS 229 Pic 2
The diagram also shows the nature of the expected assortment of the solutions of a random d-regular k-NAESAT for ranges of values of \alpha. Namely, physics analysis suggests the following:

  • For 0<\alpha<\alpha_d, w.h.p. the solutions are concentrated in a single large cluster.
  • For \alpha_d<\alpha<\alpha_{\text{cond}}, w.h.p. the solutions are distributed among a large number of clusters.
  • For \alpha_{\text{cond}}<\alpha_{\text{sat}}<0, the function f(\alpha) breaks away from f^{RS}(\alpha), and w.h.p. the solutions are concentrated in a small number of clusters.
  • For \alpha>\alpha_{\text{sat}}, w.h.p. there are no satisfiable solutions.

One final note for this model: the solutions to satisfiability problems tend to be more clumpy than in the perceptron model, so correlation decay methods won’t immediately work. See [15] for how this is handled.

References

  1. N. Bansal. Constructive algorithms for discrepancy minimization. In Proc. FOCS 2010, pages 3–10.
  2. M. Bayati and A. Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. Inform. Theory, 57(2):764-785, 2011.
  3. Bolthausen, E., 2018. A Morita type proof of the replica-symmetric formula for SK. arXiv preprint arXiv:1809.07972.
  4. E. Bolthausen. An iterative construction of solutions of the TAP equations for the Sherrington-Kirkpatrick model. Commun. Math. Phys., 325(1):333-366, 2014.
  5. T. Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput. 14(3):326-334.
  6. J. Ding and N. Sun. Capacity lower bound for the Ising perceptron. https://arxiv.org/pdf/1809.07742.pdf
  7. E. Gardner and B. Derrida. Optimal storage properties of neural network models. J. Phys. A., 21(1): 271–284, 1988.
  8. J. H. Kim and J. R. Roche. Covering cubes by random half cubes, with applications to binary neural networks. J. Comput. Syst. Sci., 56(2):223–252, 1998
  9. W. Krauth and M. Mézard. Storage capacity of memory networks with binary couplings. J. Phy. 50(20): 3057-3066, 1989.
  10. F. Krzakala et al. “Gibbs states and the set of solutions of random constraint satisfaction problems.” Proc. Natl. Acad. Sci. 104.25 (2007): 10318-10323.
  11. S. Lovett and R. Meka. Constructive discrepancy minimization by walking on the edges. SIAM J. Comput., 44(5):1573-1582.
  12. M. Mézard. The space of interactions in neural networks: Gardner’s computation with the cavity method. J. Phys. A., 22(12):2181, 1989
  13. M. Mézard and G. Parisi and M. Virasoro. SK Model: The Replica Solution without Replicas. Europhys. Lett., 1(2): 77-82, 1986.
  14. M. Shcherbina and B. Tirozzi. Rigorous solution of the Gardner problem. Commun. Math. Phys., 234(3):383-422, 2003.
  15. A. Sly and N. Sun and Y. Zhang. The Number of Solutions for Random Regular NAE-SAT. In Proc. FOCS 2016, pages 724-731.
  16. J. Spencer. Six standard deviations suffice. Trans. Amer. Math. Soc. 289 (1985), 679-706
  17. M. Talagrand. Intersecting random half cubes. Random Struct. Algor., 15(3-4):436–449, 1999.

Peter Shor on Quantum Error Correction

December 7, 2018

[Guest post by Annie Wei who scribed Peter Shor’s lecture in our physics and computation seminar. See here for all the posts of this seminar. –Boaz]

On October 19, we were lucky enough to have Professor Peter Shor give a guest lecture about quantum error correcting codes. In this blog post, I (Annie Wei) will present a summary of this guest lecture, which builds up quantum error correcting codes starting from classical coding theory. We will start by reviewing an example from classical error correction to motivate the similarities and differences when compared against the quantum case, before moving into quantum error correction and quantum channels. Note that we do assume a very basic familiarity with quantum mechanics, such as that which might be found here or here.

1. Motivation
We are interested in quantum error correction, ultimately, because any real-world computing device needs to be able to tolerate noise. Theoretical work on quantum algorithms has shown that quantum computers have the potential to offer speedups for a variety of problems, but in practice we’d also like to be able to eventually build and operate real quantum computers. We need to be able to protect against any decoherence that occurs when a quantum computer interacts with the environment, and we need to be able to protect against the accumulation of small gate errors since quantum gates need to be unitary operators.

In error correction the idea is to protect against noise by encoding information in a way that is resistant to noise, usually by adding some redundancy to the message. The redundancy then ensures that enough information remains, even after noise corruption, so that decoding will allow us to recover our original message. This is what is done in classical error correction schemes.

Unfortunately, it’s not obvious that quantum error correction is possible. One obstacle is that errors are continuous, since a continuum of operations can be applied to a qubit, so a priori it might seem like identifying and correcting an error would require infinite resources. In a later section we show how this problem, that of identifying quantum errors, can be overcome. Another obstacle is the fact that, as we’ve stated, classical error correction works by adding redundancy to a message. This might seem impossible to perform in a quantum setting due to the No Cloning Theorem, which states the following:

Theorem (No Cloning Theorem): Performing the mapping

|\psi\rangle|0\rangle\mapsto|\psi\rangle|\psi\rangle

is not a permissible quantum operation.

Proof: We will use unitarity, which says that a quantum operation specified by a unitary matrix U must satisfy

\langle\phi|U^{\dagger}U|\psi\rangle = \langle\phi|\psi\rangle.

(This ensures that the normalization of the state |\psi\rangle is always preserved, i.e. that |\langle\psi|\psi\rangle|^2=1, which is equivalent to the conservation of probability.)

Now suppose that we can perform the operation

U(|\psi\rangle|0\rangle)=|\psi\rangle|\psi\rangle.

Then, letting

(\langle\phi|\langle 0|)(|\psi\rangle|0\rangle)=\alpha,

we note that by unitarity

(\langle\phi|\langle 0|)(|\psi\rangle |0\rangle)=\alpha(\langle\phi|\langle 0|)U^{\dagger}U(|\psi\rangle|0\rangle).

But

(\langle\phi|\langle 0|)U^{\dagger}U(|\psi\rangle|0\rangle)=(\langle\phi|\langle\phi|)(|\psi\rangle|\psi\rangle)=\alpha^2,

and in general \alpha\neq\alpha^2, so we have a contradiction.

How do we get around this apparent contradiction? To do so, note that the no-cloning theorem only prohibits the copying of non-orthogonal quantum states. With orthogonal quantum states, either \alpha=0 or \alpha=1, so we don’t run into a contradiction. This also explains why it is possible to copy classical information, which we can think of as orthogonal quantum states.

So how do we actually protect quantum information from noise? In the next section we first review classical error correction, as ideas from the classical setting re-appear in the quantum setting, and then we move into quantum error correction.

2. Review of Classical Error Correction
First we start by reviewing classical error correction. In classical error correction we generally have a message that we encode, send through a noisy channel, and then decode, in the following schematic process:

fig1.png
In an effective error correction scheme, the decoding process should allow us to identify any errors that occurred when our message passed through the noisy channel, which then tells us how to correct the errors. The formalism that allows us to do so is the following: we first define a r\times n encoding matrix G that takes a message m of length r and converts it to a codeword c of length n, where the codewords make up the span of the rows of G. An example of such a matrix is

G=\left(\begin{array}{ccccccc}0&0&0&1&1&1&1\\1&0&1&0&1&0&1\\0&1&1&0&0&1&1\\1&1&1&1&1&1&1\end{array}\right),

corresponding to the 7-bit Hamming codes, which encodes a 4-bit message as a 7-bit codeword. Note that this code has distance 3 since each of the rows in G differ in at most 3 spots, which means that it can correct at most 1 error (the number of errors that can be corrected is given by half the code distance).

We also define the parity check matrix H to be the matrix that satisfies

GH^T=0.

For example, to define H corresponding to the G we defined for the 7-bit Hamming code, we could take

H=\left(\begin{array}{ccccccc}0&0&0&1&1&1&1\\1&0&1&0&1&0&1\\0&1&1&0&0&1&1\end{array}\right).

Then we may decode x, a 7-bit string, in the following manner. Say that

x=c+e,

where c is a codeword and e is the 1-bit error we wish to correct. Then

xH^T=(c+e)H^T=eH^T.

Here eH^T uniquely identifies the error and is known as the error syndrome. Having it tells us how to correct the error. Thus our error correction scheme consists of the following steps:

  1. Encode a r-bit message m by multiplying by G to obtain codeword mG=c.
  2. Send the message through channel generating error e, resulting in the string x=c+e.
  3. Multiply by H^T to obtain the error syndrome eH^T.
  4. Correct error e to obtain c.

Having concluded our quick review of classical error correction, we now look at the theory of quantum error correction.

3. Quantum Error Correction
In this section we introduce quantum error correction by directly constructing the 9-qubit code and the 7-qubit code. Then we introduce the more general formalism of CSS codes, which encompasses both the 9-qubit and 7-qubit codes, before introducing the stabilizer formalism, which tells us how we might construct a CSS code.

3.1. Preliminaries
First we introduce some tools that we will need in this section.

3.1.1. Pauli Matrices
The Pauli matrices are a set of 2-by-2 matrices that form an orthogonal basis for the 2-by-2 Hermitian matrices, where a Hermitian matrix H satisfies H^{\dagger}=H. Note that we can form larger Hilbert spaces by taking the tensor product of smaller Hilbert spaces, so in particular taking the k-fold tensor product of Pauli matrices gives us a basis for the 2^k-by-2^k Hermitian matrices. Note also that generally, in quantum mechanics, we are interested in Hermitian matrices because they can be used to represent measurements, and because unitary matrices, which can be used to represent probability-preserving quantum operations, can be obtained by exponentiating Hermitian matrices (that is, every unitary matrix U can be written in the form U=e^{iH} for H a Hermitian matrix).

The Pauli matrices are given by

\sigma_x\equiv X\equiv\left(\begin{array}{cc}0&1\\1&0\end{array}\right)

\sigma_y\equiv Y\equiv\left(\begin{array}{cc}0&-i\\i&0\end{array}\right)

\sigma_z\equiv Z\equiv\left(\begin{array}{cc}1&0\\0&-1\end{array}\right).

By direction computation we can show that they satisfy the relations

X^2=Y^2=Z^2=I

ZX=-XZ=iY

YZ=-ZY=iX

XY=-YX=iZ.

3.1.2. Von Neumann Measurements
We will also need the concept of a Von Neumann measurement. A Von Neumann measurement is given by a set of projection matrices \{E_1, E_2, ..., E_k\} satisfying

\sum_{i=1}^k E_i=I.

That is, the projectors partition a Hilbert space {\cal H} into k subspaces. Then, given any state |\psi\rangle\in{\cal H}, when we perform a measurement using these projectors we obtain the measurement result corresponding to E_i, with corresponding post-measurement state

\frac{E_i|\psi\rangle}{||E_i|\psi\rangle||},

with probability \langle\psi|E_i|\psi\rangle.

3.2. First Attempt at a Quantum Code
Now we make a first attempt at coming up with a quantum code, noting that our efforts and adjustments will ultimately culminate in the 9-qubit code. Starting with the simplest possible idea, we take inspiration from the classical repetition code, which maps

0\mapsto 000

1\mapsto 111

and decodes by taking the majority of the 3 bits. We consider the quantum analog of this, which maps

|0\rangle\mapsto|000\rangle

|1\rangle\mapsto|111\rangle.

We will take our quantum errors to be the Pauli matrices X, Y, and Z. Then the encoding process, where our message is a quantum state \alpha|0\rangle+\beta|1\rangle, looks like the following:

\alpha|0\rangle+\beta|1\rangle\mapsto\alpha|000\rangle+\beta|111\rangle.

We claim that this code can correct bit errors but not phase errors, which makes it equivalent to the original classical repetition code for error correction. To see this, note that applying an X_1 error results in the mapping

\alpha|0\rangle+\beta|1\rangle\mapsto\alpha|100\rangle+\beta|011\rangle.

This can be detected by the von Neumann measurement which projects onto the subspaces

\{|000\rangle,|111\rangle\}

\{|011\rangle,|100\rangle\}

\{|010\rangle,|101\rangle\}

\{|110\rangle,|001\rangle\}

We could then apply \sigma_x to the first qubit to correct the error. To see that this doesn’t work for phase errors, note that applying a Z_2 error results in the mapping

\alpha|0\rangle+\beta|1\rangle\mapsto\alpha|000\rangle-\beta|111\rangle.

This is a valid encoding of the state \alpha|0\rangle-\beta|1\rangle, so the error is undetectable.

What adjustments can we make so that we’re able to also correct Z errors? For this we will introduce the Hadamard matrix, defined as

H=\frac{1}{\sqrt{2}}\left(\begin{array}{cc}1&1\\1&-1\end{array}\right)

and satisfying

HX=ZH.

Note in particular that, because HX=ZH, the Hadamard matrix turns bit errors into phase errors, and vice versa. This allows us to come up with a code that corrects phase errors by mapping

H|0\rangle\mapsto H^{\otimes 3}|000\rangle

H|1\rangle\mapsto H^{\otimes 3}|111\rangle

or equivalently,

|0\rangle\mapsto\frac{1}{2}(|000\rangle+|011\rangle+|101\rangle+|110\rangle)

|1\rangle\mapsto\frac{1}{2}(|111\rangle+|100\rangle+|010\rangle+|001\rangle)

Now we can concatenate our bit flip code with our phase flip code to take care of both errors. This gives us the 9-qubit code, also known as the Shor code.

3.3. 9-Qubit Code
In the previous section, we went through the process of constructing the 9-qubit Shor code by considering how to correct both bit flip errors and phase flip errors. Explicitly, the 9-qubit Shor code is given by the following mapping:

|0\rangle\mapsto|0\rangle_L\equiv\frac{1}{2}(|000000000\rangle+|000111111\rangle+|111000111\rangle+|111111000\rangle)

|1\rangle\mapsto|1\rangle_L\equiv\frac{1}{2}(|111111111\rangle+|111000000\rangle+|000111000\rangle+|000000111\rangle).

Here |0\rangle_L and |1\rangle_L are known as logical qubits; note that our 9-qubit code essentially represents 1 logical qubit with 9 physical qubits.

Note that by construction this code can correct \sigma_x, \sigma_y, and \sigma_z errors on any one qubit (we’ve already shown by construction that it can correct \sigma_x and \sigma_z errors, and \sigma_y can be obtained as a product of the two). This is also equivalent to the statement that the states \sigma_x^{(i)}|0\rangle_L, \sigma_y^{(i)}|0\rangle_L, \sigma_z^{(i)}|0\rangle_L, \sigma_x^{(i)}|1\rangle_L, \sigma_y^{(i)}|1\rangle_L, and \sigma_z^{(i)}|1\rangle_L are all orthogonal.

Now we have a 1-error quantum code. We claim that such a code can in fact correct any error operation, and that this is a property of all 1-error quantum codes:

Theorem: Given any possible unitary, measurement, or quantum operation on a one-error quantum code, the code can correct it.

Proof: \{I, \sigma_x, \sigma_y, \sigma_z\}^{\otimes t} forms a basis for the 2\times 2 matrices. For errors on t qubits, the code can correct these errors if it can individually correct errors \sigma_{w_i}^{(i)} for w_i\in\{x,y,z\}, i\in\{1,...,t\}, since \{I, \sigma_x, \sigma_y, \sigma_z\}^{\otimes t} forms a basis for \mathbb{C}^{2t}.

Example: Phase Error Next we’ll do an example where we consider how we might correct an arbitrary phase error applied to the |0\rangle_L state. Since quantum states are equivalent up to phases, the error operator is given by

\left(\begin{array}{cc}1&0\\0&e^{i\theta}\end{array}\right)\equiv\left(\begin{array}{cc}e^{-i\theta/2}&0\\0&e^{i\theta/2}\end{array}\right).

Note that this can be rewritten in the \{I, \sigma_x, \sigma_y, \sigma_z\}^{\otimes t} basis as

\left(\begin{array}{cc}e^{-i\theta/2}&0\\0&e^{i\theta/2}\end{array}\right)=\cos\frac{\theta}{2}I-i\sin\frac{\theta}{2}\sigma_z.

Now, applying this error to |0\rangle_L, we get

\left(\begin{array}{cc}e^{-i\theta/2}&0\\0&e^{i\theta/2}\end{array}\right)\frac{1}{2}(|000\rangle+|011\rangle+|101\rangle+|110\rangle)=\frac{1}{2}\cos\frac{\theta}{2}(|000\rangle+|011\rangle+|101\rangle+|110\rangle)-\frac{i}{2}\sin\frac{\theta}{2}(|000\rangle-|011\rangle+|101\rangle-|110\rangle)=\cos\frac{\theta}{2}|0\rangle_L-i\sin\frac{\theta}{2}\sigma_z|0\rangle_L.

After performing a projective measurement, we get state |0\rangle_L with probability \cos^2\frac{\theta}{2}, in which case we do not need to perform any error correction, and we get \sigma_z|0\rangle_L with probability \sin^2\frac{\theta}{2}, in which case we would know to correct the \sigma_z error.

3.4. 7-Qubit Code
Now that we’ve constructed the 9-qubit code and shown that quantum error correction is possible, we might wonder whether it’s possible to do better. For example, we’d like a code that requires fewer qubits. We’ll construct a 7-qubit code that corrects 1 error, defining a mapping to |0\rangle_L and |1\rangle_L by taking inspiration from a classical code, as we did for the 9-qubit case.

For this we will need to go back to the example we used to illustration classical error correction. Recall that in classical error correction, we have an encoding matrix G and a parity check matrix H satisfying GH^T=0, with \text{rank}(G)+\text{rank}(H)=n. We encode a message m to obtain codeword mG=c. After error e is applied, this becomes c+e, from which we can extract the error syndrome (c+e)H^T=eH^T. We can then apply the appropriate correction to extract c from c+e.

Now we will use the encoding matrix from our classical error correction example, and we will divide our codewords into two sets, C_1 and C_1', given by

C_1=\left\{\begin{array}{ccccccc}0&0&0&1&1&1&1\\1&0&1&0&1&0&1\\0&1&1&0&0&1&1\end{array}\right.

and

C_1'=C_1+1111111.

Similar to how we approached the 9-qubit case, we will start by defining our code as follows:

|0\rangle_L\equiv\frac{1}{\sqrt{8}}\sum_{v\in C_1}|v\rangle

|1\rangle_L\equiv\frac{1}{\sqrt{8}}\sum_{w\in C_1'}|w\rangle.

Note that this corrects bit flip errors by construction. How can we ensure that we are also able to correct phase errors? For this we again turn to the Hadamard matrix, which allows us to toggle between bit and phase errors. We claim that

H^{\otimes 7}|0\rangle_L=\frac{1}{\sqrt{2}}(|0\rangle_L+|1\rangle_L)

H^{\otimes 7}|1\rangle_L=\frac{1}{\sqrt{2}}(|0\rangle_L-|1\rangle_L).

Proof: We will show that

H^{\otimes 7}|0\rangle_L=\frac{1}{\sqrt{2}}(|0\rangle_L+|1\rangle_L),

noting that the argument for |1\rangle_L is similar. First we will need the fact that

H^{\otimes 7}|v\rangle=\frac{1}{2^{7/2}}\sum_{w\in\{0,1\}^7}(-1)^{w\cdot v}|w\rangle.

To see that this fact is true, note that

H=\frac{1}{\sqrt{2}}(|0\rangle\langle 0|+|0\rangle\langle 1|+|1\rangle\langle 0|-|1\rangle\langle 1|)

and that w\cdot v is equal to the number of bits in which w and v are both 1. Now we can start by directly calculating

H^{\otimes 7}|0\rangle_L=\frac{1}{\sqrt{8}}\frac{1}{\sqrt{128}}\sum_{v\in C_1}\sum_{w\in\{0,1\}^7}(-1)^{v\cdot w}|w\rangle.

Note that for x and y two codewords, assuming that w\cdot y=1, we must have that x\cdot w=0 iff (x+y)\cdot w=1. Thus we can break the codespace up into an equal number of codewords x satisfying x\cdot w=0 and x\cdot w=1. This means that we must have that the sum \sum_{v\in C_1}\sum_{w\in\{0,1\}^7}(-1)^{w\cdot v}|w\rangle=0 unless we have w\perp C_1. But those w that satisfy w\perp C_1 are exactly all the codewords by definition, so we must have that

H^{\otimes 7}|0\rangle_L=\frac{1}{\sqrt{2}}|0\rangle_L+\frac{1}{\sqrt{2}}|1\rangle_L

as the sum in |0\rangle_L+|1\rangle_L runs equally over all codewords.

Thus we have constructed a 7-qubit quantum code that corrects 1 error, and moreover we see that for both the 9-qubit and 7-qubit codes, both of which are 1-error quantum codes, the fact that they can correct 1-error comes directly from the fact that the original classical codes we used to construct them can themselves correct 1 error. This suggests that we should be able to come up with a more general procedure for constructing quantum codes from classical codes.

3.5. CSS Codes
CSS (Calderbank-Shor-Steane) codes generalize the process by which we constructed the 9-qubit and 7-qubit codes, and they give us a general framework for constructing quantum codes from classical codes. In a CSS code, we require groups C_1, C_2 satisfying

 C_1\subseteq C_2

C_2^{\perp}\subseteq C_1^{\perp}

Then we can define codewords to correspond to cosets of C_1 in C_2, so that the number of codewords is equal to 2^{\text{dim}(C_2)-\text{dim}(C_1)}. Thus by this definition we can say that codewords w_1, w_2\in C_2 are in the same coset if w_1-w_2\in C_1. Explicitly, the codeword for coset w is given by the state

\frac{1}{|C_1|^{1/2}}\sum_{x\in C_1}|x+w\rangle,

and under the Hadamard transformation applied to each qubit this state is in turn mapped to the state

\frac{1}{|C_1^{\perp}|^{1/2}}\sum_{x\in C_1^{\perp}}|x+w\rangle.

That is to say, the Hadamard “dualizes” our original code, toggling bit errors to phase errors and vice versa. (This can be seen by direct calculation, as in the case of the 7-qubit code, where we used the fact that \sum_{v\in C_1}(-1)^{v\cdot w}=0 for w\not\in C_1^{\perp}.)

Note also that this code can correct a number of bit errors equal to the minimum weight of \{v\in C_2-C_1\}.

With the CSS construction we have thus reduced the problem of finding a quantum error correcting code to the problem of finding appropriate C_1, C_2. Note that the special case of C_2^{\perp}=C_1=C corresponds to weakly self-dual codes, which are well studied classically. Doubly even, weakly self-dual codes additionally have the requirement that all codewords have Hamming weights that are multiples of 4; they satisfy the requirement

1^n\subseteq C^{\perp}\subseteq C\subseteq\mathbb{Z}_2^n

and are also well studied classically.

3.6. Gilbert-Varshamov Bound
In the previous section we introduced CSS codes and demonstrated that the problem of constructing a quantum code could be reduced to the problem of finding two groups C_1, C_2 satisfying

C_1\subseteq C_2

C_2^{\perp}\subseteq C_1^{\perp}.

The next natural question is to ask whether such groups can in fact be found.

The Gilbert-Varshamov bound answers this question in the affirmative, ensuring that there do exist good CSS codes (the bound applies to both quantum and classical codes). It can be stated in the following way:

Theorem (Gilbert-Varshamov Bound): There exist CSS codes with rate R=(number of encoded bits)/(length of code) given by

R\geq 1-2H_2(d/n),

where d is the minimum distance of the code, d/2 is the number of errors it can correct, and H_2(x) is the Shannon entropy, defined as

H_2(x)=-x\log_2x-(1-x)\log_2(1-x).

Proof: Note that we can always take a code, apply a random linear transformation to it, and get another code. Thus each vector is equally likely to appear in a random code. Given this fact, we can estimate the probability that a code of dimension k contains a word of weight \leq d using the union bound:

P(code of dimension k has word of weight \leq d)\leq(number of words)\times P(word has weight \leq d)=2^k\cdot\frac{\sum_{i=0}^d \binom{n}{i}}{2^n}\approx \frac{2^k\cdot 2^{nH(d/n)}}{2^n}

For this to be a valid probability we need to have

(k/n)+H(d/n)< 1.

We can calculate rate by noting that for a CSS code, given by C_1\subseteq C_2, C_2^{\perp}\subseteq C_1^{\perp}, with \text{dim}(C_1)=n-k, \text{dim}(C_2)=k, the expression for rate is given by

R=\frac{n-2k}{n}.

Combining this with the bound we obtained by considering probabilities, we get that

R\geq 1-2H(d/n).

Thus there exist good CSS codes.

3.7. Stabilizer Codes
Having discussed and constructed some examples of CSS codes, we will now discuss the stabilizer formalism. Note that this formalism allows us to construct codes without having to work directly with the states representing |0\rangle_L and |1\rangle_L, as this can quickly get unwieldy. Instead, we will work with stabilizers, operators that leave these states invariant.

To see how working directly with states can get unwieldy, we can consider the 5-qubit code. We can define it the way we defined the 9-qubit and 7-qubit codes, by directly defining the basis vectors |0\rangle_L and |1\rangle_L,

|0\rangle_L\equiv\frac{1}{4}(|00000\rangle-|01100\rangle+|00101\rangle+|01010\rangle-|01111\rangle+(symmetric under cyclic permutations)),

with |1\rangle_L defined similarly. But we can also define this code more succinctly using the stabilizer formalism. To do so, we start by choosing a commutative subgroup of the Pauli group, with generators g_i satisfying

g_i^2=I

g_ig_j=g_jg_i

For example, for the 5-qubit code, the particular choice of generators we would need is given by

g_1\equiv IXZZX

g_2\equiv XIXZZ

g_3\equiv ZXIXZ

g_4\equiv ZZXIX.

Now we consider states \{|\psi\rangle\} that are stabilized by the \{g_i\}. That is, they satisfy

g_i|\psi\rangle=|\psi\rangle.

Note that the eigenvalues of \sigma_x, \sigma_y, and \sigma_z are \pm 1, so in the case of the 5-qubit code, there exists a 2^5/2=16-dimensional space of \{|\psi\rangle\} satisfying g_1|\psi\rangle=|\psi\rangle. Recalling that two commuting matrices are simultaneously diagonalizable, there exists a 16/2=8-dimensional space of \{|\psi\rangle\} satisfying g_1|\psi\rangle=g_2|\psi\rangle=|\psi\rangle, and so on, where we cut the dimension of the subspace in half each time we add a generator. Finally, there exists a 2^5/2^4=2-dimensional space of \{|\psi\rangle\} satisfying g_i|\psi\rangle=|\psi\rangle for all i=1,...,4. This 2-dimensional space is exactly the subspace spanned by |0\rangle_L and |1\rangle_L. Thus fixing the stabilizers is enough to give us our code.

Next we will consider all elements in the Pauli group that commute with all elements in our stabilizer group G=\{g_1,...,g_4\}. As we shall see, this will give us our logical operators, where a logical operator performs an operation on a logical qubit (for example, the logical X operator, X_L, would act on the logical qubit |0\rangle_L by mapping X_L|0\rangle_L=|1\rangle_L, and so on). In the 5-qubit case we end up with a 6-dimensional nonabelian group \tilde{G}=\langle g_1,...,g_4, h_1, h_2\rangle by adding the following two elements to those elements that are in G:

h_1=XXXXX

h_2=ZZZZZ

These will be our logical operators

X_L\equiv h_1

Z_L\equiv h_2

so that

X_L|0\rangle_L=|1\rangle_L

X_L|1\rangle_L=|0\rangle_L

Z_L|1\rangle_L=-|1\rangle_L

Z_L|0\rangle_L=|0\rangle_L.

Note that this code has distance 3 and corrects 1 error because 3 is the minimum Hamming weight in the group \tilde{G}. (To see this, note that XXXXX\cdot IXZZX=XIYYI has Hamming weight 3.)

Why is Hamming weight 2 not enough to correct one error? If we had, for example, XZIII\in\tilde{G}, then we would have

\sigma_x^{(1)}|\psi_1\rangle=\sigma_z^{(2)}|\psi_2\rangle

for |\psi_1\rangle, |\psi_2\rangle both in the code, which means that we wouldn’t be able to distinguish an X_1 error from a Z_2 error.

Note that, in general, when x\in\tilde{G}, x|\psi\rangle will be in the code, so elements of \tilde{G} map codewords to codewords. We can prove this fact by noting that

xg_i|\psi\rangle=x|\psi\rangle=g_ix|\psi\rangle.

Note also that in the examples we’ve been dealing with so far, where we have a commuting subgroup of the Pauli group, our codes correspond to classical, additive, weakly self-dual codes over GF(4). Here GF(4)=\{0,1,\omega,\bar{\omega}\} (with group elements \{\omega, \bar{\omega},1\} corresponding to the third roots of unity) is the finite field on 4 elements, and multiplying Pauli matrices corresponds to group addition. Specifically,

X\equiv 1

Y\equiv \omega

Z\equiv \bar{\omega}

I\equiv 0

satisfying

H\omega=\bar{\omega}

2X=2Y=2Z=0.

We have now concluded our discussion of quantum error-correcting codes. In the next section we will shift gears and look at quantum channels and channel capacities.

4. Quantum Channels
In this final section we will look at quantum channels and channel capacities.

4.1. Definition and Examples

4.1.1. Definition
We know that we want to define a quantum channel to take a quantum state as input. What should the output be? As a first attempt we might imagine having the output be a probability distribution \{p_i\} over states \{|\psi_i\rangle\}. It turns out that for a more succinct description, we can have both the input and output be a density matrix.

Recall that a density matrix takes the form

\rho=\sum_i p_i|\psi_i\rangle\langle\psi_i|

representing a probability distribution over pure states |\psi_i\rangle. \rho must also be Hermitian, and it must satisfy \text{Tr}(\rho)=1 (equivalently, we must have \sum_i p_i=1).

Now we may define a quantum channel as the map \eta that takes

\eta:\rho\mapsto\sum_i E_i\rho E_i^{\dagger},

where

\sum_i E_i^{\dagger}E_i=I.

To see that the output is in fact a density matrix, note that the output expression is clearly Hermitian and can be shown to have unit trace using the cyclical property of traces. Note also that the decomposition into \{E_i\} need not be unique.

4.1.2. Example Quantum Channels
Next we give a few examples of quantum channels. The dephasing channel is given by the map

\rho\mapsto(1-p)\rho+p\sigma_z\rho\sigma_z.

It maps

\left(\begin{array}{cc}\alpha&\beta\\\gamma&\delta\end{array}\right)\mapsto \left(\begin{array}{cc}\alpha&(1-2p)\beta\\(1-2p)\gamma&\delta\end{array}\right)

\left(\begin{array}{cc}\alpha&\beta\\\gamma&\delta\end{array}\right)\mapsto \left(\begin{array}{cc}\alpha&(1-2p)\beta\\(1-2p)\gamma&\delta\end{array}\right),

so it multiplies off-diagonal elements by a factor that is less than 1. Note that when p=1/2, it maps

\alpha|0\rangle+\beta|1\rangle\mapsto|\alpha|^2|0\rangle\langle 0|+|\beta|^2|1\rangle\langle 1|,

which means that it turns superpositions into classical mixtures (hence the name “dephasing”).

Another example is the amplitude damping channel, which models an excited state decaying to a ground state. It is given by

E_1=\left(\begin{array}{cc}0&\sqrt{p}\\0&0\end{array}\right)

E_2=\left(\begin{array}{cc}1&0\\0&\sqrt{1-p}\end{array}\right)

Here we let the vector |0\rangle=(1, 0) denote the ground state, and we let the vector |1\rangle=(0, 1) denote the excited state. Thus we can see that the channel maps the ground state to itself, |0\rangle\mapsto|0\rangle, while the excited state |1\rangle gets mapped to |0\rangle with probability p and stays at |1\rangle with probability 1-p.

4.2 Quantum Channel Capacities
Now we consider the capacity of quantum channels, where the capacity quantifies how much information can make it through the channel. We consider classical channels, classical information sent over quantum channels, and quantum information sent over quantum channels. First we start off with the example of the quantum erasure channel to demonstrate that quantum channels behave differently from classical channels, and then we give the actual expressions for the channel capacities before revisiting the example of the quantum erasure channel.

4.2.1 Example: Quantum Erasure Channel
First we start with the example of the quantum erasure channel, which given a state |\psi\rangle replaces it by an orthogonal state |E\rangle with probability p and returns the same state |\psi\rangle with probability 1-p. We claim that the erasure channel can’t transmit quantum information when p\geq 0.5, behavior that is markedly different from that of classical information. That is to say, for p\geq 0.5, there is no way to encode quantum information to send it through the channel and then decode it so the receiver gets a state close to the state that was sent.

To see why this is the case, assume the contrary, that there do exist encoding and decoding protocols that send quantum information through quantum erasure channels with erasure rate p\geq 0.5. We will show that this violates the no-cloning theorem. Now, suppose that A does the following: For each qubit in the encoded state, she tosses a fair coin. If the coin lands heads, she send C the state |E\rangle and sends B the channel input with probability 2p-1 and the erasure state |E\rangle otherwise. If the coin lands tails, she sends B the state |E\rangle and sends C the channel input with probability 2p-1 and the erasure state otherwise. This implements a p\geq 0.5 channel to both receivers B and C, which means that A can use this channel to transmit an encoding of |\psi\rangle to both receivers, which in turn means that both receivers will be able to decode |\psi\rangle. But this means that A has just used this channel to clone the quantum state |\psi\rangle, resulting in a contradiction. Thus no quantum information can be transmitted through a channel with p\geq 0.5. Note, however, that we can send classical information over this channel, so the behavior of quantum and classical information is markedly different.

It turns out that the rate of quantum information sent over the erasure channel, as a function of p, is given by the following graph:

fig4.png
while the rate of classical information sent over the erasure channel, as a function of p, is given by the following graph:

fig5.png

Next we will formally state the definition of channel capacity, and then we will return to the quantum erasure channel example and derive the curve that plots rate against p.

4.2.2. Definition of Channel Capacities
Channel capacity is defined as the maximum rate at which information can be communicated over many independent uses of a channel from sender to receiver. Here we list the expressions for channel capacity for classical channels, classical information over a quantum channel, and quantum information over a quantum channel.

Classical Channel Capacity For a classical channel this expression is just the maximum mutual information over all input-output pairs,

\max_X H(\eta(X))-H(\eta(X)|X),

where X is the input information and \eta(X) is the output information after having gone through the channel \eta.

Classical Information Over a Quantum Channel The capacity for classical information sent over a quantum channel is given by

\max_{\{p_i,\rho_i\}} H(\eta(\rho))-\sum_i p_iH(\eta(\rho_i))

up to regularization, where \rho=\sum_i p_i\rho_i is the average input state, and \eta is the channel.

Note that we would regularize this by using n copies of the state (that is to say, we want the output of \eta^{\otimes n}) and then dividing by n, to get an expression like the following for the regularized capacity of classical information over a quantum channel:

\lim_{n\rightarrow\infty}\max_{\{p_i,\rho_i\}} [H(\eta(\rho)^{\otimes n})-\sum_i p_i H(\eta(\rho_i)^{\otimes n})]/n.

Quantum Information The capacity for quantum information is given by the expression

\max_\rho H(\eta(\rho))-H((\eta\otimes I)\Phi_\rho),

also up to regularization. Here \eta(\rho) is the output when channel \rho acts on input state \rho, while \Phi_\rho is the purification of \rho (that is, it is a pure state containing \rho that we can obtain by enlarging the Hilbert space). The regularized capacity for quantum information looks like the following:

 \lim_{n\rightarrow\infty}\max_\rho [H(\eta(\rho)^{\otimes n})-H((\eta\otimes I)(\Phi_\rho)^{\otimes n})]/n.

Now that we have the exact expression that allows us to calculate the quantum channel capacity, we will revisit our example of the quantum erasure channel and reproduce the plot of channel rate vs erasure probability.

4.2.3. Example Revisited: Quantum Erasure Channel
Recall that, up to regularization, the capacity of a quantum channel is given by

\max_\rho H(\eta(\rho))-H((\eta\otimes I)\Phi_\rho).

We will directly calculate this expression for the example of the quantum erasure channel. Let the input \rho be given by the density matrix for the completely mixed state,

\rho=\left(\begin{array}{cc}\frac{1}{2}&0\\0&\frac{1}{2}\end{array}\right),

so that the purification of \rho is given by the state

\frac{1}{\sqrt{2}}(|00\rangle+|11\rangle).

Recall that the erasure channel replaces our state with |E\rangle with probability p, while with probability 1-p it leaves the input state unchanged. Then, in the basis \{|0\rangle, |1\rangle, |E\rangle\}, the matrix corresponding to \eta(\rho) is given by

\eta(\rho)=\left(\begin{array}{ccc}\frac{1-p}{2}&0&0\\0&\frac{1-p}{2}&0\\0&0&p\end{array}\right)

while in the basis \{|00\rangle, |01\rangle, |10\rangle, |11\rangle, |0E\rangle, |1E\rangle\}, the matrix corresponding to (\eta\otimes I)\Phi_\rho is given by

(\eta\otimes I)\Phi_\rho=\left(\begin{array}{cccccc}\frac{1-p}{2}&0&0&\frac{1-p}{2}&0&0\\0&0&0&0&0&0\\0&0&0&0&0&0\\\frac{1-p}{2}&0&0&\frac{1-p}{2}&0&0\\0&0&0&0&\frac{p}{2}&0\\0&0&0&0&0&\frac{p}{2}\end{array}\right)

We can directly calculate that

H(\eta(\rho))=H_2(p)+(1-p)

H((\eta\otimes I)\Phi_\rho)=H_2(p)+p.

Then, subtracting the two entropies, we can calculate the rate to be

R=1-2p,

which corresponds exactly to the line we saw on the diagram that plotted rate as a function of p for the quantum erasure channel.

References

  1. Bennett, C. H., DiVencenzo, D. P., and Smolin, J. A. Capacities of quantum erasure channels. Phys. Rev. Lett., 78:3217-3220 (1997). quant-ph/9701015.
  2. Bennett, C. H., DiVencenzo, D. P., Smolin, J. A., and Wootters, W. K. Mixed state entanglement and quantum error correction. Phys. Rev. A, 54:3824 (1996). quant-ph/9604024.
  3. Calderbank, A. R. and Shor, P. W. Good quantum error-correcting codes exist. Phys. Rev. A, 54:1098 (1996). quant-ph/9512032.
  4. Devetak, I. The Private Classical Capacity and Quantum Capacity of a Quantum Channel. IEEE Trans. Inf. Theor., 51:44-45 (2005). quant-ph/0304127
  5. Devetak, I. and Winter, A. Classical data compression with quantum side information. Phys. Rev. A, 68(4):042301 (2003).
  6. Gottesman, D. Class of quantum error-correcting codes saturating the quantum Hamming bound. Phys. Rev. A, 54:1862 (1996).
  7. Laflamme, R., Miquel, C., Paz, J.-P., and Zurek, W. H. Perfect quantum error correction code. Phys. Rev. Lett., 77:198 (1996). quant-ph/9602019.
  8. Lloyd, S. Capacity of the noisy quantum channel. Phys. Rev. A., 55:3 (1997). quant-ph/9604015.
  9. Nielsen, M. A. and Chuang, I. L. Quantum Computation and Quantum Information., Cambridge University Press, New York (2011).
  10. Shor, P. W. Scheme for reducing decoherence in quantum computer memory. Phys. Rev. A., 52:2493 (1995).
  11. Shor, P. W. The quantum channel capacity and coherent information. MSRI Workshop on Quantum Computation (2002).
  12. Steane, A. M. Error correcting codes in quantum theory. Phys. Rev. Lett., 77:793 (1996).
  13. Steane, A. M. Multiple particle interference and quantum error correction. Proc. R. Soc. London A, 452:2551-76 (1996).

Highlights beyond EC: Call for nominations

November 21, 2018

[Guest post by Moshe Babaioff –Boaz]

“Highlights Beyond EC” Session at EC 2019: Call for Nominations

Committee: Mohammad Akbarpour, Moshe Babaioff, Shengwu Li and Ariel Procaccia

Following a new tradition started last year, the 2019 ACM Conference on Economics and Computation (EC’19) will host a special session highlighting some of the best work in economics and computation that appears in conferences and journals other than EC. The intention of this session is to expose EC attendees to related work just beyond the boundary of their current awareness.

We seek nominations for papers on Economics and Computation that have made breakthrough advances, opened up new questions or areas, made unexpected connections, or had significant impact on practice or other sciences. Examples of conferences and journals that publish papers relevant for the special session include STOC/FOCS/SODA/ITCS, AAAI/IJCAI/AAMAS, NIPS/ICML/COLT, WWW/KDD, AER/Econometrica/JPE/QJE/RESTUD/TE/AEJ Micro/JET/GEB, and Math of OR/Management Science/Operations Research. Please email nominations toHighlightsBeyondEC@gmail.com. Anyone is welcome to contact us, but we especially invite members of PCs or editorial boards in various venues to send us suggestions. Nominations should include:

  1.  Name of paper and authors.
  2. Publication venue or online working version. Preference will be given to papers that have appeared in a related conference or journal within the past two years, or have a working version circulated within the past two years.
  3. Short (2-3 paragraph) explanation of the paper and its importance.
  4. (Optional) Names of 1-3 knowledgeable experts on the area of the paper.

Note that at least one of the authors of a selected paper will be required to present their paper at EC 2019 and so should be available to travel to the conference, which is taking place in Phoenix, AZ on June 25-27, 2019.

To ensure maximum consideration, please send all nominations by December 15, 2018.