Theory Blog Aggregator Up!
The Theory of Computing Blog Aggregator is now back online at a new website: http://cstheoryfeed.org/ . There is also a twitter feed at https://twitter.com/cstheory .
See this blog post by Suresh Venkatasubramanian (who, together with Arnab Bhattacharyya, is responsible for the aggregator’s revival – thank you!!) for more details. This is a good opportunity to thank Arvind Narayanan who created the software to run it and maintained it all these years.
If you don’t want to rely on the aggregator to follow windows on theory, you can use the “Follow Blog by email” button on our side bar, and join the 590 other happy customers who don’t need to wait to the feed to get the latest lecture notes from our physics and computation seminar.
What is Quantum Hamiltonian Complexity?
by Ben Edelman
This is the first installment of a threepart series of posts on quantum Hamiltonian complexity based on lectures given the authors in Boaz and Tselil’s seminar. The second installment is here, and the third installment is here.
Quantum Hamiltonian complexity is a growing area of study that has important ramifications for both physics and computation. Our hope is that these three posts will provide an accessible (and incomplete) preview of the subject for readers who know the basics of theoretical computer science and quantum information. Much of the material is adapted from an excellent survey by Gharibian et al..
In a nutshell, quantum Hamiltonian complexity is the study of the local Hamiltonian problem. Why is this problem important enough to justify the existence of an entire subfield? To illustrate why, here are two informal characterizations of it:
 To a physicist, the local Hamiltonian problem is a formalization of the difficulty of simulating and understanding manyparticle quantum systems. There are deep connections between the complexity of this problem and the amount of quantum entanglement in a system. In practical terms, physicists would love to be able to solve this problem on a regular basis, and they’ve developed a rich theory of heuristics to that end.
 To a computer scientist, local Hamiltonian problem is the quantum version of constraint satisfaction problems. Any CSP can be written as a local Hamiltonian problem; and just as constraint satisfaction is the prototypical NPcomplete problem by the CookLevin theorem, the local Hamiltonian problem plays the equivalent role for QMA (a quantum analogue of NP) by the “quantum CookLevin theorem.” The connections to classical complexity go on… there is even a quantum PCP conjecture!
But let’s take a step back and start at the beginning. To make sure we understand what a quantum Hamiltonian is and why it is important, it will be instructive to briefly rehash some of the fundamentals of classical statistical mechanics.
Classical energy and ground states
In the classical world, a physical system can be in any one of various states , each of which is a vector, with different coordinates representing different particles. Every state of the system has an energy, given by an energy function . For example, in the classic Ising model of ferromagnetism, . Each coordinate represents the spin of atom , and atoms and interact with each other whenever is an edge in a graph , which is usually a lowdimensional lattice. Energy for this system is defined as .
Suppose we ignore our system for a long time, letting it interact with its external environment until, in the limit, it reaches thermal equilibrium at temperature . Then the probability the system is in state is given by Boltzmann’s distribution: , where and is the partition function required to normalize the probabilities. As the temperature tends to infinity, this distribution will approach the uniform distribution over , and as the temperature tends to absolute zero, the distribution will approach the uniform distribution over the states with minimum energy. We call these minimum energy states ground states, and we call their energy the ground state energy. If we want to calculate something about a system, then it is often crucial to know the ground states and ground state energy of the system. Going back to our example, the Ising model has two ground states whenever is connected. These are the states and in which all atoms have the same spin. The ground state energy is .
Quantum Hamiltonians
A quantum Hamiltonian is essentially the quantum analogue of the classical energy function. Unlike with classical systems, when a quantum system is in a given qubit state , it doesn’t have a determinate energy. Instead, when we measure the energy, the value we obtain may be probabilistic and will correspond to one of the eigenvalues of the observable matrix for energy. This Hermitian matrix, denoted , is the quantum Hamiltonian, and just as the energy function characterizes a classical system, the Hamiltonian characterizes a quantum system. For a given eigenvector of with eigenvalue , when we measure the energy of we obtain the result with probability , and the system collapses to the state (assuming the eigenvalue has multiplicity 1). Thus, the ground state and ground state energy of a quantum system with eigenvalue are the minimum eigenvalue of and the corresponding eigenvector .
The Boltzmann distribution also has a quantum analogue. A quantum system at thermal equilibrium will be in the following mixed state: . As the temperature approaches absolute zero, will approach a superposition over the ground states.
Not only does the Hamiltonian tell us the energy of a system, it also describes the time evolution of the system (as long as it is closed). Schrödinger’s equation states that , where is Planck’s constant and is time. Thus, if a closed system is in the state at time 0, its state at time will be . Since is Hermitian, is unitary, which is another way of saying that quantum mechanical states are subject to unitary evolution.
The Local Hamiltonian problem
As we have seen, understanding the Hamiltonian of a quantum system is crucial for understanding both the system’s equilibrium behavior and its time evolution. There are a huge variety of questions physicists are interested in asking about systems, all of which boil down to questions about equilibrium behavior, time evolution, or both. There is a single problem that captures the complexity of many of these questions, in the sense that most of the questions can’t be answered without solving it. This is the problem of estimating the ground state energy of the Hamiltonian. Especially in condensed matter physics, this problem is ubiquitous.
Formally, we will study the following promise problem: (note: this will not be our final formulation)
The “Hamiltonian Problem”
Given a Hermitian matrix and nonnegative reals with ,
 If , output YES

If , output NO
One issue with this definition is that the input includes an enormous matrix. For a reasonablesized system, there’d be no use in even trying to solve this problem through classical computation, and how to deal with it in the quantum computing setting is far from obvious. Luckily, physicists have found that in reallife systems, interactions tend to be local, and if we consider the special case of local Hamiltonians, the input for the problem is of reasonable size.
A local Hamiltonian is a Hamiltonian that is decomposed into a sum of terms, each of which represents a Hamiltonian acting on a unit subset of the qubits in the system. In other words, , where each is a subset of of size . For brevity’s sake, we abuse notation and write as . We can think of the ’s as local constraints, and the ground state as the state that simultaneously satisfies the constraints to the maximal possible extent. Here, then, is the newandimproved problem definition:
Local Hamiltonian Problem
Given a Hamiltonian specified as a collection of local interactions , and nonnegative reals with ,
 If , output YES
 If , output NO
Presuming the matrices and the reals are specified to polynomial precision, then the input size is polynomial in , since is a constant and each of the matrices has entries. Thus, not only is our new problem physically realistic, it is also a problem we might hope to attack with classical computation. However, we will later see that in fact this problem is likely hard even for quantum computers. The remaining installments in this series of notes will deal with further restrictions of the class of Hamiltonians for which the local Hamiltonian problem may be tractable.
Computer science motivation
As we mentioned in the intro, the local Hamiltonian problem (henceforth denoted LH) doesn’t just have myriad applications in physics—it is also important from a computer science perspective because it is a quantum generalization of constraint satisfiability (you may have noticed that quantum analogues of classical concepts are a running theme). Specifically, CSP is a special case of LH.
Suppose we have a CSP instance , and we want to turn it into a LH instance. A clause with constituent variables becomes a diagonal matrix acting on the qubits . Note that the rows and columns of this matrix are indexed by the assignment vectors . Formally, encodes the truth table of in the following manner: . Another way of stating this is .
Informally, takes the clauses of and turns them into local quantum interactions. We’ve constructed so that it has two eigenvalues: 0 and 1. The eigenspace corresponding to 0 is spanned by the set of computational basis vectors that satisfy , and the eigenspace corresponding to 1 is spanned by the computational basis vectors that don’t satisfy . In effect, when we consider as a term of , we are giving an energy penalty to any variable assignment that doesn’t satisfy . will have the eigenvalue 0 (in other words, a ground state energy of 0) if and only if there is some assignment of the variables that satisfies all of the clauses (in other words, iff is satisfiable). Otherwise, the ground state energy of will be at least 1, so determining whether is satisfiable is equivalent to solving LH with inputs , and . (In fact, LH generalizes MAXCSP, since the ground state energy of is exactly the number of clauses minus the maximum number of satisfiable clauses.)
A 1D Area Law for Gapped Local Hamiltonians
(This post is based on part of a lecture delivered by Boriana Gjura and Prayaag Venkat. See also posts by Ben Edelman and Fred Zhang for more context on Quantum Hamiltonian Complexity.)
Introduction
In this post we present the Area Law conjecture and prove it rigorously,
emphasizing the emergence of approximate ground state projectors as a
useful and promising tool.
The difficulty of understanding manybody physics lies in this dichotomy
between the power of quantum on the one hand, and the fact that even
describing a quantum state requires exponential resources on the other.
A natural way to proceed is to ask if the special class of physically
relevant states admits succinct descriptions. As seen in a previous
post, the QMAcompletenes of the local Hamiltonian problem suggests
that even ground states of local Hamiltonians do not admit such
descriptions. Looking at these ground states, the following questions
arise naturally.
When do the ground states of local Hamiltonians have a special
structure? When does that structure allow for a meaningful short
description? When does that structure allow us to compute properties
of them?
As stated informally, the answer to these questions is yes for (gapped)
1D systems, and it is an open problem to answer these in higher
dimensions.
The Area Law
The Area Law is inspired by the Holographic principle in Cosmology,
which informally states that the total information in a black hole
resides on the boundary. That is, the complexity of the system depends
on the size of its boundary, not its volume.
Preliminaries
For the sake of clarity, we assume that is a
2local Hamiltonian on qubits, each is a
projection , is frustration free, (meaning that
, the lowest eigenvalue, is zero) and there is a unique
ground state .
These conditions simplify our proof. It will be easy to generalize to
and relaxing to local on qudits (dimension
particles). The most significant assumption is that is
frustrationfree: more work is necessary to make the proof work
otherwise.
To formalize our conjecture, we define a notion of von Neumman entropy
below.
Definition: Consider any state lying in a bipartite Hilbert space
, and perform a Schmidt
decomposition to obtain
.
The Von Neumann entanglement entropy of across region A
is defined as
Why is this a good notion of entropy? If
, then
, i.e. the product state has no entanglement. On
the other hand, the maximally entangled state
has entropy . In general, we have that
This is a
direct quantum generalization of the fact that a classical probability
distribution on elements has entropy at most , and this is
achieved by the uniform distribution.\
We can now state the Area Law.
Conjecture: Let be the ground state of , as defined above.
Then for any region (i.e. a subset of qubits),
where
denotes the boundary of the region , the set of qubits that interact
through with at least a qubit outside of region .
This conjecture is illustrated by the following figure.
We prove the 1D version of this conjecture. Note that for a 1D system
(i.e. qubits on a line with nearest neighbor interactions) it suffices
to consider bipartitions of the Hilbert space defined by a “cut” (see
the figure below). By a cut, we mean a partition of the qubits into two
sets consisting of all the qubits to the left and right, respectively,
of a fixed qubit .
Theorem: Let be as above. Then for any cut ,
Approximate Ground State Projectors
In the case where all commute, consider the operator
. This operator will project any product state
onto the ground state. Therefore, we can get a representation of the
ground state as a matrix product state (as we saw in a previous post on
tensor networks). Can we generalize this idea when the do not
necessarily commute?
A natural idea is to define an operator that approximately projects a
state onto the vector we want (the ground state), while shrinking it on
the other directions. We also want this operator to not increase
entanglement too much. The proof of the area law will involve
constructing a lowrank tensor approximate factorization of the ground
state by starting with a factorized state, which has a good overlap with
the ground state, and then repeatedly applying such an operator.
We first give a formal definition of an approximate ground state
projector. In the following, under the assumption of the existence of
such an object, we will prove the area law. At the end, we will show how
to construct it.
Definition: is a –approximate ground state projection (AGSP) if:
1.
2. For every such that
, then
.
3. has entanglement rank at most . That is, admits a Schmidt
decomposition of the form ,
where and act only on the particles to the left and
right of the cut, respectively.
The first step of the proof of the Area Law is to find a product state
that has a good overlap with the ground state. That is, it should have a
constant overlap that is not dependent on , the number of particles
in the system.
Lemma 1: Suppose there exists a AGSP such that
. Fix a partition of the space
on which the Hamiltonian acts. Then there exists a product state
such that
Proof:
Let be a product state with the largest overlap in
, meaning that it maximizes.
and can be written as
where the latter is some state orthogonal to the ground state. We apply
to get
where and is normalized. By
the properties of the operator , the decomposition of
has at most terms. Using the CauchySchwarz
inequality:
Therefore there exists a product state such that
,
which implies that , as desired.
Starting with a product state with a big enough overlap with the ground
state, the proof of the next lemma shows how we can repeatedly apply
AGSPs to bound the entropy across the cut.
Lemma 2: Suppose there exists a AGSP such that
, and a product state
such that
. Then
But before we prove this, let’s state an intuitive result of Eckart and
Young, which provides an upper bound for the overlap of two states given
the Schmidt rank of one.
Lemma (EckartYoung): Let be a normalized
vector with Schmidt decomposition
, where
. Then for any normalized
it holds that
Using this result, we continue the proof.
Proof of Lemma 2:
Write the AGSP as . Given a state , we
denote by the normalized stated after applying
projections on the state, namely
By the way we’ve defined , the Schmidt rank of
increases by a power of , and this state has an improved overlap with
the ground state, that is:
Let be
the Schmidt decomposition of the ground state relative to the cut
, where . The
EckartYoung theorem gives:
from which
We choose
such
that . Let’s now bound the
worst case entropy accross the AGSP cut. The first Schmidt
coefficients account for an entropy of at most . For the
remaining coefficients, we group them in chunks of size in
intervals indexed by . For each of
these intervals, the corresponding entropy can be upper bounded by
where here is an upper bound on the total
probability mass in the interval, and a lower bound on
the size of any Schmidt coefficient (squared) in the interval. This
follows from the fact that they are organized in descending order and
must sum to 1.
Therefore, the total entropy is
This bound depends on and . This is sufficient to imply the
Area Law because our AGSP constructions will have constant and
.
AGSP Construction
In this section, our goal is to construct an AGSP with the correct
tradeoff in parameters. We briefly recall the intuition behind the
properties that an AGSP must satisfy. First, the AGSP should leave the
ground state untouched. Second, it shrinks states which are orthogonal
to the fround state. Third, it should not increase the entanglement
across the cut by too much (see the figure below).
A depiction of the third property in the AGSP definition for a 1D
system. The property that is a sum of tensor products of operators
which only act on one “side” of the
system.
We are interested in understanding a construction of an AGSP for three
reasons:
 As we saw before, we can prove a 1D area law assuming the existence
of an AGSP with good parameters.  AGSPs have been used a building block in an provably polynomialtime
algorithm for finding ground states of gapped 1D local Hamiltonians.  They appear to be a general purpose tool with potential applications
in other contexts.
Details of the construction
One way to think about the the first two properties in the AGSP
definition is in terms of the relationship of the eigenvalues of and
the eigenvalues of (see the figure below). The first property says
that the ground state of should be an eigenvector of with
eigenvalue 1. The second property says that all other eigenvectors of
should be mapped to eigenvectors of with eigenvalue at most
.
In this plot, the horizontal axis represents the eigenvalues of
and the vertical axis represents the eigenvalues of . The ground
state of , which corresponds to the eigenvalue 0, should remain fixed
upon applying to
it.
In the following, we will aim to construct and AGSP whose tradeoff in
parameters allows us to achieve the following result.
Theorem: .
Attempt 1
As a first attempt, we will evaluate the quality of the following
operator:
where is some natural
number that we will decide upon later. We now check the three properties
for this particular choice of .
First, it is clear that does not change the ground state, since the
ground state is an eigenvector of with eigenvalue 0
(frustrationfree case). Second, because any other eigenvalue of is
at least (the energy gap), we have that
.
Third, for the time being, let us think of as roughly corresponding
to (we will formalize this later).
Unfortunately, these values and will not allow us to
achieve the desired tradeoff of . To see why
this is, observe that because we have local Hamiltonians, the term
could be on the order of (recall that
), so we would need to choose proportional
to to reduce to a constant (see figure below). Unfortunately, this will result in a
corresponding increase in , violating .
The first attempt an AGSP corresponds to the polynomial is depicted in
blue. This polynomial does not decay at a fast enough rate, so the
second attempt will replace it with another polynomials, depicted in
red, which decays much
faster.
Attempt 2
In summary, the two weaknesses of the previous approach we need to
address are:
 The large term in the exponent substantially reduces the
shinking effect of applying the AGSP. We will address this using the
technique of Hamiltonian truncation.  For a fixed (roughly corresponding to the entanglement rank of
), the function does not decay
fast enough after the point (which is the ground state
energy). We will address this using Chebyshev polynomials.
As discussed before, the term is large because it involves
contribution from all local Hamiltonians. Intuitively, we might
believe that only the local Hamiltonians near the site of the cut itself
should play a large role in the entanglement of the ground state across
the cut. Specifically, if we write
, where
are the local Hamiltonians neighboring the cut and
and are the “aggregrated” Hamiltonians of all the remaining local
Hamiltonians to the left and right, respectively, of and
, then we are asserting that and contribute
significantly to but are not important when trying to
characterize the ground state (see figure below).
We keep track of the local Hamiltonians corresponding to the
qudits directly surrounding the cut. The remaining local Hamiltonians
are aggregated into the Hamiltonians and , whose eigenvalues
will be truncated to define
.
Mathematically, if we let
, where
denotes operator obtained by truncating all eigenvalues
of to be at most , we claim the following:
Claim:
1.
2. is a ground state of .
3. has gap
It is straightforward to verify the two first two claims, We omit the
proof of the third claim. Intuitively, we have chosen to be an
approximation of which has much smaller norm.
Next, we come to the task of designing a degree polynomial which
decays faster than our first attempt
after the point . In other words, we would like to satisfy
(fixing the ground energy) and map the interval
to the interval for as small a
as possible. It turns out that the degree Chebyshev
polynomial (of the first kind) is precisely the polynomial which has
this behavior (see the appendix for a review of Chebyshev polynomials).
Letting denote the degree Chebyshev polynomial, we take our
polynomial to be a scaled and shifted version of it:
Given this definition of , we take our new candidate AGSP to be
, which has the following guarantees:
Claim:
1.
2. achieves
3. achieves
The first claim is straightforward to verify. The proof of the second
claim is ommitted; it involves calculations making use of standard
properties of Chebyshev polynomials. We remark that the dependence of
the exponent on has been reduced by a quadratic
amount, compared with the first attempt at an AGSP. We now sketch the
main ideas behind the proof of the third claim. After that, we will
choose values of and so that the desired tradeoff is
achieved.
We have defined as a degree polynomial in , so we may write
it as: where are coefficients
whose particular values are not relevant to the analysis of the Schmidt
rank of across the given cut. It is straightforward to check that
the Schmidt rank of is bounded by the sum of the ranks of the
monomials (subadditivity). So, it suffices to analyze the
Schmidt rank of the monomial of the largest degree, .
From our expression for , we can expand this power to get:
where the summation is over all products of Hamiltonians from the
set . Now, one
natural approach to proceed is to analyze the Schmidt rank of each term
in the summation, and then sum over all the terms. Unfortunately, this
approach will not work because there are exponentially many terms in the
summation, so our estimate of will be too large. To address this
issue, it is possible to collect terms cleverly so that the total number
of terms is small. However, we will not provide the details here; we
will just sketch how analyze the Schmidt rank of an individual term.
The term consists of a product of
Hamiltonians. Since there are possible locations of a cut that is
crossed by one of the , an average cut will have
of the crossing it. For each Hamiltonian which
crosses this average cut, the Schmidt rank will be multiplied by a
factor of . Hence, the Schmidt rank of the term
is . This bound on the
Schmidt rank is for an average cut , not necessarily the cut we
fixed in the beginning. The two cuts are separated by at most
qudits because is one of the cuts surrouding (see figure
below), so one can relate the Schmidt rank across these two cuts. For
each Hamiltonian between the two cuts, the highlevel idea is that the
entanglement rank is multiplied by at most a factor of . Because the
entanglement rank across is , the entanglement
rank across is .
The entanglement across cut is analyzed bounding the entanglement
across a nearby cut and then relating the
two.
Putting all the pieces together, we have that:
 By Hamiltonian truncation and the Chebyshev polynomial construction,
.  By the entanglement rank analysis, .
One can now check that if we set and
(where large enough constants are
hidden inside the ), then we achieve the desired tradeoff of
. Note that and are constants because
and are constants.
Towards an area law in high dimensions
One of the major open problems in this line of work is to prove an area
law for 2D systems. We now discuss one potential line of attack on this
problem, based on a reduction to the 1D area law we have proven using
AGSPs.
Consider a 2D system, as depicted in the figure below. We can form a
sequence of concentric shells of qudits of increasing radii and
“contract” each shell into a single qudit of higher dimension. That is,
if each of the original qudits have dimension and we consider the
shell at radius , the dimension of the contracted qudit is roughly
because there are roughly qudits in the shell. The system of
contracted qudits is now a 1D system in the sense that the shell at
radius only interacts with the shells at radii and .
A 2D system can be reduced to a 1D system by decomposing it into a
sequence of concentric
“shells”.
Suppose that we were able to show an area law as discussed previously,
but with the following dependence on :
If we replace
with , the dimensionality in our new 1D system, we get that
which is exactly an
area law in 2D (recalling that surface area of a circle is proportional
to its radius). In fact, even the weaker result of proving that
for
some would imply a “subvolume law”, already a major
breakthrough. This suggests that constructing AGSPs with better
parameters may be a plausible approach to proving an area law.
Future Directions
In addition to proving an area law in 2D, there are several interesting
questions raised by what we discussed here. See the survey of Gharibian
et al. for more details.
 Do area laws hold for gapless (i.e. nonconstant gap, or “critical”)
systems? No, there are known counterexamples.  Do area laws hold for degenerate systems (where the ground state is
not unique)? Yes, the AGSP approach still works for systems with
groundspaces of constant dimension. It is unknown whether area laws
hold when the groundspace is of dimension polynomial in .  Can one design practical algorithms for finding ground states of
gapped 1D local Hamiltonians? Yes, there is one approach based on
AGSPs.  Do area laws hold in general for systems where the interactions are
not necessarily described by a lattice? No, there are known
counterexamples to such a “generalized area law”?
Acknowledgements
We would like to thank Boaz Barak and Tselil Schramm for their guidance
in preparing the talk on which this post is based.
Most of this post (including some figures) draws heavily upon the
following resources: lecture notes from courses taught by Thomas Vidick and Umesh Vazirani,
repectively, and a survey by Gharibian et al.
In these notes, we have opted to illustrate the main ideas, while
glossing over technical details. Our goal with this approach is to
prepare the reader to more easily understand the original sources and
highlight the some of the key concepts.
Appendix: Background on Chebyshev Polynomials
The Chebsyhev polynomials are a family of
polynomials with many special properties. They are defined as
, and
The first few polynomials are plotted below (Source:
<https://en.wikipedia.org/wiki/Chebyshev_polynomials>).
Below, we state a property (proof ommitted) which provides some
intuition as to why Chebyshev polynomials are the “correct” choice for
our construction. Intuitively, it says that among all degree
polynomials which are trapped inside on the interval ,
dominates outside of .
Claim: Let be a degree polynomial
such that for all . Then for all
, .
References
 Sevag Gharibian, Yichen Huang, Zeph Landau, Seung Woo Shin, et al. “Quantum Hamiltonian Complexity”. Foundations and Trends in Theoretical Computer Science, 10(3):159–282, 2015.
 Thomas Vidick. Lecture notes for CS286 seminar in computer science: “Around the quantum PCP conjecture”. http://users.cms.caltech.edu/~vidick/teaching/286_qPCP/index.html, Fall 2014.
 Umesh Vazirani. Lecture notes for CS2944 “Quantum computation”. https://people.eecs.berkeley.edu/~vazirani/f16quantum.html, Fall 2016.
Algorithmic and Information Theoretic Decoding Thresholds for Low density ParityCheck Code
by Jeremy Dohmann, Vanessa Wong, Venkat Arun
Abstract
We will discuss errorcorrecting codes: specifically, lowdensity paritycheck (LDPC) codes. We first describe their construction and informationtheoretical decoding thresholds, .
Belief propagation (BP) (see Tom’s notes) can be used to decode these. We analyze BP to find the maximum errorrate upto which BP succeeds.
After this point, BP will reach a suboptimal fixed point with high probability. This is lower than the informationtheoretic bound, illustrating a gap between algorithmic and informationtheoretic thresholds for decoding.
Then we outline a proof of a theorem suggesting that any efficient algorithm, not just BP, will fail after the algorithmic threshold. This is because there is a phase transition at the algorithmic threshold, after which there exist an exponentially large number of suboptimal `metastable’ states near the optimal solution. Local search algorithms will tend to get stuck at these suboptimal points.
Introduction to Error Correcting Codes
Motivation
Alice wants to send a message to Bob, but their channel of communication is such that Bob receives a corrupted version of what Alice sent. Most practical communication devices are imperfect and introduce errors in the messages they are transmitting. For instance, if Alice sends 3V, Bob will really receive three volts plus some noise (we have chosen to ignore some inconvenient practical details here). In many cases, this noise is quite small, e.g. it could be less than 0.5V in 99.99% of cases. So, in principle, Alice could have had very reliable delivery by just choosing to always send 0V for a logical 0 and 5V for logical 1, using checksums to detect the occasional error. But this is wasteful. Alice could have squeezed more levels between 0 and 5V to get a higher bitrate. This causes errors, but Alice can introduce redundancy in the bits she is transmitting which can enable Bob to decode the correct message with high probability. Since it is much easier to control redundancy in encoding than in physical quantities, practical communication devices often choose choose to pack enough bits into their physical signal that errors are relatively quite likely, relying instead on redundancy in their encoding to recover from the errors. Redundancy is also used in storage, where we don’t have the option of detecting an error and retransmitting the message.
Some errors in communication are caused by thermal noise. These are unpredictable and unavoidable, but the errors they cause can be easily modeled; they cause bitflips in random positions in the message. There are other sources of error. The clocks on the two devices may not be correctly synchronized, causing systematic bitflips in a somewhat predictable pattern. A sudden surge of current (e.g. because someone turned on the lights, and electricity sparked between the contacts) can corrupt a large contiguous segment of bits. Or the cable could simply be cut in which case no information gets through. These other kinds of errors are often harder to model (and easier to detect/mitigate), so we have to remain content with merely detecting them. Thus for the remainder of this blog post, we shall restrict ourselves to an error model where each bit is corrupted with some fixed (but potentially unknown) probability, independent of the other bits. For simplicity, we shall primarily consider the Binary Erasure Channel (BEC), where a bit either goes through successfully, or the receiver knows that there has been an error (though we will introduce some related channels along the way).
Claude Shannon found that given any channel, there is a bitrate below which it is possible to communicate reliably with vanishing error rate. Reliable communication cannot be achieved above this bitrate. Hence this threshold bitrate is called the channel capacity. He showed that random linear codes are an optimal encoding scheme that achieves channel capacity. We will only briefly discuss random linear codes, but essentially they work by choosing random vectors in the input space and mapping them randomly to vectors in the encoded space. Unfortunately we do not have efficient algorithms for decoding these codes (mostly due to the randomness in their construction), and it is conjectured that one doesn’t exist. Recently LowDensity Parity Check (LDPC) codes have gained in popularity. They are simple to construct, and can be efficiently decoded at error levels quite close to the theoretical limits.
With LDPC codes, there are three limits of interest for any given channel and design bitrate (M/N): 1) the error level upto which an algorithm can efficiently decode them, , 2) the error level level upto which they can be decoded by a computationally unbounded decoder, and, 3) the error level beyond which no encoding scheme can achieve reliable communication, . Obviously, , and in general these inequalities can be strict. Our goal here is to study the gap between and . We will sometimes refer to these three quantities as , , and when discussing channels besides the BEC. This is following the notation of Mezard and Montanari (2009).
More formally, information theory concerns reliable communication via an unreliable channel. To mitigate the errors in message transmission, error correcting codes introduce some type of systematic redundancy in the transmitted message. Encoding maps are applied to the information sequence to get the encoded message that is transmitted through the channel. The decoding map, on the other hand, is applied to the noisy channel bit (see Figure below). Each message encoded is comprised of bits and redundant sequences of bits in an error correcting code. possible codewords form a “codebook” in binary space .
Claude Shannon’s code ensembles proved that it is easier to construct stochastic (characterized by good properties and high probability) models vs. deterministic code designs. Stochastic models were able to achieve optimal error correcting code performance in comparison to a more rigidly constructed model, proving that it was possible to communicate with a vanishing error probability as long as the rate of transmission is smaller than the channel capacity, a measure of the maximum mutual information between channel input and output.
Thus, in order to construct an optimal error correcting code, one must first define the subset of the space of encoding maps, endow the set with probability distributions, and subsequently define the associated decoding map for each of the encoding maps in the codes. We have included a section in the A that gives a thorough discussion of random code ensembles which are known to achieve optimal decoding, whether via scoring decoding success by bit error rate or decoded word error rate. We will also show a perspective which uses principles from statistical physics to unify the two (often called finitetemperature decoding). From hereon out we will discuss LDPC and explore how the values of , and for various channels reveal deep things about the structure of the decoding solution space.
Lowdensity Parity Check Code
LDPC codes are linear and theoretically excellent error correcting codes that communicate at a rate close to the Shannon capacity. The LDPC codebook is a linear subspace of . For an MxN sparse matrix , the codebook is defined as the kernel:
where all the multiplications and sums involved in
are computed modulo 2.
Matrix is called the parity check matrix and the size of the codebook is . Given this code, encoding is a linear operation when mapping an binary generating matrix (the codebook is the image of ) such that ).
Every coding scheme has three essential properties that determine its utility: the geometry of its codebook and the way it sparsely distributes proper codewords within the encoding space (c.f. our mention of sphere packing as a geometric analogy for RLCs), the ease with which one can construct a code which sparsely distributes codes within the encoding space, and the existence of fast algorithms to perform effective decoding.
A coding scheme over a given channel (whether it be BSC, BEC, AWGN, etc.) also has three parameters of interest, which is the error rate above which some chosen algorithm cannot perform errorfree decoding, above which even exhaustively enumerating over all codewords in the codebook and calculating the MAP probability does not successfully decode, which is the capacity of the channel, an error rate above which no decoding scheme could perform errorfree decoding.
LDPC codebook geometry and
On the subject of codebook geometry, it is well known that LDPC ensembles, in expectation, produce sparsely distributed codewords. This means that valid codewords (i.e. those that pass all parity checks) are far apart from one another in Hamming space and thus require a relatively large number of bits to be lost in order for one codeword to degenerate into another one. There is an important property called the distance enumerator which determines the expected number of codewords in a tight neighborhood of any given codeword. If for a given distance the expected number is exponentially small, then the coding scheme is robust up to error rates causing that degree of distortion. We discuss a proof of LDPC distance properties in Appendix B and simply state here that LDPCs are good at sparsely distributing valid codewords within the encoding space. The property , introduced above is intimately related to the geometry of the codebook and the properties of the noisy channel being decoded over.
The information theoretic threshold, is the noise level above which MAP decoding no longer successfully performs decoding. is important because it is the error value above which we could theoretically always perform (slow) decoding below by enumerating all codewords in the codebook and calculating the one with the highest MAP probability.
Every LDPC ensemble has some different for a given channel. WFor now we will simply remark that LDPC ensembles are effective because they can be chosen such that closely approaches the limit for many channels. Even more importantly, we will show that it is likely that there is no fast algorithm for which = for general LDPCs over a general noisy channel.
We will not derive the for any LDPC ensembles over any channels (I recommend chapter 15 of this book for details), but we will, in section 3, present results derived by other researchers.
Ease of construction
On the subject of ease of construction and ease of decoding, there is a much simpler graphical representation of LDPC codes which can be used to demonstrate LDPC tractability.
LDPCs can be thought of as bipartite regular graphs, where there are N variable nodes which are connected to M parity check nodes according to randomly chosen edges based on the degree distribution of the LDPC. Though the appendix discusses general degree distributions we will discuss here only (d,k) regular bipartite graphs, in which all variables have d edges and all parity check nodes have k edges, and how to generate them under the configuration model.
The configuration model can be used to generate a bipartite graph with (d,k) degree distribution by initially assigning all variable nodes d halfedges, all parity check nodes k halfedges, and then randomly linking up halfedges between the two sets, deleting all nodes which end up being paired an even number of times, and collapsing all odd numbered multiedges into a single edge. This system doesn’t work perfectly but for large N, the configuration model will generate a graph for which most nodes have the proper degree. Thus it is relatively easy to generate random graphs which represent LDPC codes of any desired uniform degree distribution. An example of this graphical representation is in figure 2
How the graphical model relates to fast decoding
The graphical model of LDPCs is useful because it is both easy to construct and presents a natural way to perform fast decoding. In fact, the fast graphbased decoding algorithm, Belief Propagation, we use has a which likely represents the upper limit on fast decoding for LDPCs.
We have seen recently that bipartite graphical models
which represent a factorized probability distribution can be used to calculate marginal probabilities of individual variable nodes (what a mouthful!).
Basically, if the structure of the graph reflects some underlying probability distribution (e.g. the probability that noisy bit was originally sent as bit ) then we can use an iterative algorithm called Belief Propagation (see blog post above) to actually converge to the exact probability distribution for each .
This is important because when we perform decoding, we would like to estimate the marginal probability of each individual variable node (bit in our received vector), and simply set the variable to be the most likely value (this is known as bitMAP decoding, discussed earlier). As mentioned above, under certain conditions the Belief Propagation algorithm correctly calculates those marginal probabilities for noise rates up to an algorithmic threshold .
The Belief Propagation algorithm is an iterative message passing algorithm in which messages are passed between variable nodes and parity check/factor nodes such that, if the messages converge to a fixed point, the messages encode the marginal probabilities of each variable node. Thus BP, if it succeeds can perform bitMAP decoding and thus successfully decode.
We will show in the next section how the configuration model graphs map to a factorized probability distribution and mention the for BP. In the following section we will show an example of decoding over the binary erasure channel, then finally we will show motivation to suggest that the for BP over LDPCs represents a hard upper limit above which no fast decoding algorithms exist.
Decoding Errors via Belief Propagation
As mentioned above (again, please see Tom’s excellent blog post for details), the belief propagation algorithm is a useful inference algorithm for stochastic models and sparse graphs derived from computational problems exhibiting thresholding behavior. As discussed, symbol/bit MAP decoding of error correcting codes can be regarded as a statistical inference problem. In this section, we will explore BP decoding to determine the threshold for reliable communication and according optimization for LDPC code ensembles in communication over a binary input output symmetric memoryless channel (BSC or BMS).
Algorithm Overview
Recall that the conditional distribution of the channel input given the output is given by and that we wish to find the that maximizes the below probability given
(1)
Where is the conditional probability of of observing noisy bit given that was sent. For the BSC we have and .
Furthermore
is an indicator variable which takes value if satisfies parity check a and 0 otherwise. In particular, the product of these indicators takes into account the fact that 0 probability should be assigned to hypotheses which aren’t in the code book (indicated by at least one parity check failing).
We would like to design a message passing scheme such that the incoming messages for a given variable node encode their marginal probabilities .
Note, first and foremost that this probability can be factorized a la BP factor graphs such that there is a factor node for each parity check node which contributes probability
and a factor node for each channel probability term . Since each channel probability term is only connected to a single variable, its message never gets updated during BP and so we omit it from the factor graphs (e.g. note that figure 2 only has parity check nodes and variable nodes)
The message passing scheme ends up taking the form
(2)
(3)
Where denotes the neighborhood of factor node and the sum in (3) is over all possible configurations of the neighbors of not including .
Messages are passed along the edges as distributions over binary valued variables described by the loglikelihoods
(4)
(5)
We also introduce the a priori log likelihood for bit given the received message :
Once we parametrize the messages as loglikelihoods, it turns out we can rewrite our update rules in terms of the parametrized values h and u, making updates much simpler:
(6)
(7)
Given a set of messages, we would perform decoding via the overall log likelihood . Where gets decoded to 0 for and 1 for .
Typically BP is run until it converges to a set of messages that decode to a word in the codebook, or until a max number of iterations have occurred. Other stopping criteria exist such as the messages between time step t and t+1 being all within some small of one another.
It is important to note some properties of BP:
 BP always terminates in steps if the factor graph is a tree of depth
 It is not known under what circumstances so called “loopy” BP will converge for nontree graphs
Because factor graphs of LDPC codes are relatively sparse, they appear “locally treelike”, a property which is believed to play a crucial role in BP convergence over the factorized probability distribution used in LDPC MAP decoding (eqn 1). As mentioned above BP manages to converge on many sorts of non treelike graphs given that they have “nice” probability distributions. For example the SK model is known to converge even though the underlying factor graph is a complete graph!
It turns out that BP converges under some noise levels for LDPC decoding, and that the threshold at which it fails to converge, , represents a phase transition to a generically different regime in the solution space of the codebook. It’s been noted elsewhere that the BP threshold is often the threshold of fast solving for many cool problems; e.g. kSAT. This is because it is often thought to generically represent the “best” possible local (ergo “fast”) algorithm for those problems
In appendix C we will show some important properties of BP. The following tables summarize important results for several ensembles and channels. Note how close the information theoretic threshold for LDPCs is to the actual shannon limit for the channels below.
Table 1: Thresholds for BSC
Various thresholds for BP over LDPC codes in a Binary Symmetric Channel
d  k  Shannon limit  
3  4  .1669  .2101  .2145 
3  5  .1138  .1384  .1461 
3  6  .084  .101  .11 
4  6  .1169  .1726  .174 
See Mezard and Montanari, 2009 Chapt 15. for this table
Table 2: Thresholds for BEC
Various thresholds for BP over LDPC codes in a Binary Erasure Channel
d  k  Shannon limit  
3  4  .65  .746  .75 
3  5  .52  .59  .6 
3  6  .429  .4882  .5 
4  6  .506  .66566  .6667 
See Mezard and Montanari, 2009 Chapt 15. for this table
We will now show exact behavior of the (3,6) LDPC ensemble over the binary erasure channel.
Algorithmic Thresholds for Belief Propagation (BP)
Definitions and notation
Definition 1. In a Binary Erasure Channel (BEC), when the transmitter sends a bit , the receiver receives the correct bit with probability or an error symbol with probability .
For BECs, the Shannon capacity—the maximum number of data bits that can be transmitted per encoded bit—is given by .
Definition 2. An bit Error Correcting Code (ECC) is defined by a codebook . The transmitter encodes information as an element of . The receiver receives a corrupted version of the transmitted codeword. To decode, it picks an that is most likely given and the channel characteristics.
For ease of discourse, we have refrained from defining ECC in full generality.
Definition 3. An bit Low Density Parity Check Code (LDPC) is an ECC with . Here is an matrix and arithmetic is over . is a sparse paritycheck matrix. and are finite polynomials that characterize ; is the fraction of columns with s and is the fraction of rows with s. Since these are fractions/probabilities, they must be normalized. Hence . is a random matrix, and therefore has full rank with high probability.
In an LDPC code, . Hence every bits contain bits of information, making the rate . Over binary erasure channels (BECs), on receiving , the decoder must choose an such that such that , and (, denote the bit of and respectively). That is, the bits that were successfully transmitted should be preserved; other bits should be chosen to satisfy the parity check equations. If multiple correct choices of are possible, then cannot be unambiguously decoded.
BP/Peeling algorithm
In general, decoding can be computationally hard. But there exists an error rate , a function of and , below which belief propagation succeeds in decoding. Let be the maximum error rate upto which successful decoding is possible (i.e. we can unambiguously determine the transmitted codeword) and be the Shannon limit, then . In general, these inequalities can be strict, illustrating the gap between what is information theoretically possible, and what is computationally feasible.
Belief propagation (BP) for decoding LDPCcodes is equivalent to a simple peeling algorithm. Let us first describe the factorgraph representation for decoding. This is denoted in figure 3. Variables on the left are the received symbol . Factor nodes on the right denote the paritycheck constraint (rows of ). The XOR of variables connected to each factor node must be 0.
BP/the peeling algorithm works as follows. For simplicity of exposition, consider that the all zeros codeword has been transmitted. Since this is a linear code, there is no loss of generality. At first, only the variables that were successfully transmitted are fully determined. In the next round, the factor nodes that have exactly one undetermined variable can determine that variable using their paritycheck constraint.
BP isn’t perfect
This algorithm is not perfect. Figure 3 is an example of a received codeword which can be unambiguously decoded — only the all zeros codeword satisfies all the constraints— but the BP algorithm fails, because at any point, all factor nodes have more than one unknown variable. It seems that the only way to solve problems like that is to exhaustively understand the implications of the paritycheck equations. If this examples seems contrived, that is because it is. Decoding becomes harder as the degree and number of constraints increases; we had to add a lot of constraints to make this example work. Fortunately, if the graph is sparse, BP succeeds. We prove this in the following theorem:
Phase transitions for BP
Theorem 1. A LDPC code can be decoded by BP as when the error rate is less than :
Proof. To prove this, let us analyze the density evolution. For BECs, this is particularly simple as we only need to keep track of the fraction of undetermined variables and factor nodes at timestep and respectively. As , these fractions are probabilities. A factor node is determined when all of its variables are determined (note: if all but one is determined, the last one can be immediately determined). The following recursion relations hold:
(8)
The first holds because a variable node is undetermined at timestep if it was originally undetermined (which happens with probability ) and if it isn’t determined in the last step, which happens with probability say . Now,
A similar reasoning holds for the second relation. is the probability that a given neighboring variable node is determined. is the probability that atmost one is undetermined, and hence this function node is determined. is the probability that this function node is undetermined.
Composing the two relations in equation 8, we get the recursion:
An example of is shown in Figure 4 for . That is a (3,6) regular graph where variable nodes and function nodes have 3 and 6 neighbors respectively. On the left, is always below . Hence the recursion with starting from the far right will converge to the fixed point . But on the right, is large enough that intersects at a nonzero point. Hence the recursion will converge to the higher fixed point instead, without ever reaching the `correct’ fixed point. BP therefore gets stuck at a suboptimal solution, though informationtheoretically a correct solution exists. This can be interpreted as a glassy state, where many deep local minima are present, and BP will converge to the wrong minimum.
The condition for BP to converge is . Hence the threshold error rate, , below which this condition holds is:
For (3, 6) regular graphs, ∎
Another interesting phase transition can be observed. As increases, for some values of , the first intersection of and happens at a nonzero point. For others, it starts of at and goes up continuously. In the former case, the decoding error rate jumps discontinuously as increases from 0 to a nonzero values. For the latter, it increases continuously.
To see the gap between and what can be information theoretically, we look at what happens when the degrees of the LDPC code is increased while keeping the rate constant. Specifically consider the regular graph (i.e. and ) as while is fixed. Note that the rate of the code is . This is shown in Figure 5. decreases toward 0. But as , it should become informationtheoretically easier to decode. In fact, as , the code approaches a random linear code, which is known to achieve Shannon capacity. Hence we can believe that the informationtheoretically achievable decoding rate is nondecreasing. Thus there is a gap between what is information theoretically possible to decode, and what is computationally feasible using Belief Propagation.
Finally we would like to mention that it is possible to choose a sequence of polynomials such that approaches the Shannon limit. While it is nontrivial to sample exactly from this distribution, good approximations exist and LDPC codes can achieve close to channel capacity over binary erasure channels.
The solution space in
The energy landscape of LDPC decoding
We have already shown the exact location of the threshold above which decoding is not possible for the LDPC ensemble and have also investigated the point at which the BP algorithm fails, .
It should not be surprising to us that any given algorithm we attempt to throw at the problem fails at a certain point below . In fact there are many simple, random algorithms from the class of Markovchain Monte Carlos which give fast run times but which fail at values far below even . The failing point of a particular algorithm, per se, is not necessarily very significant. We shouldn’t expect that any given algorithm, besides explicitly calculating the symbol MAP by traversing the entire codebook, would be able to achieve .
What is of interest to us here, is that marks a provable threshold in the solution space of LDPC decoding during which it is likely no locallybased methods, and therefore no fast algorithms can decode with nonzero probability. We will show later precisely the number and energy levels of these metastable states for the BEC. Proof of this transition for other channel types is outside the scope of this lecture.
In this section we will rephrase decoding as an energy minimization problem and use three techniques to explore the existence of metastable states and their effect on local search algorithms.
In particular we will first use a generic local search algorithm that attempts to approximately solve energy minimization expression of decoding.
We will next use a more sophisticated Markov chain Monte Carlo method called simulated annealing. Simulated annealing is useful because it offers a perspective that more closely models real physical processes and that has the property that its convergence behavior closely mimics the structure of the metastable configurations.
Energy minimization problem
To begin, we will reframe our problem in terms of constraint satisfaction.
The codewords of an LDPC code are solutions of a CSP. The variables are the bits of the word and the constraints are the parity check equations. Though this means our constraints are a system of linear equations, our problem here is made more complicated by the fact that we are searching for not just ANY solution to the system but for a particular solution, namely the transmitted codeword.
The received message tells us where we should look for the solution.
Assume we are using the binaryinput, memoryless, outputsymmetric channel with transition probability .
The probability that was the transmitted codeword, given is
Where
(10)
We can associate an optimization problem with this code. In particular, define to be twice the number of parity check equations violated by .
We have already discussed how symbol MAP computes the marginals of the distribution and how word MAP finds its argmax.
We shall here discuss two related problems
 optimizing the energy function within a subset of the configuration space defined by the received word
 sampling from a ’tilted’ Boltzmann distribution associated with the energy
Define the loglikelihood of x being the input given the received y to be
(11)
If we assume WLOG that the all zero codeword was transmitted, by the law of large numbers, for large N the loglikelihood of this codeword is close to where is the channel entropy. The probability of an orderN deviation away from this value is exponentially small.
This suggests that we should look for the transmitted codeword amongst those such that is close to h.
The constraint version of our decoding strategy – known as typicalpairs decoding – is thus, find such that . This constraint will be referred to as the `distance constraint’ and we should consider the situation where if exactly one codeword satisfies the distance constraint, return it.
Since codewords are global energy minima ( for all ), we can phrase typicalpairs decoding as an optimization problem
Minimize subject to .
This decoding succeeds iff the minimum is nondegenerate. This happens with high probability for and with zero probability for . In particular, there are exponentially many degenerate energy minima above .
Similar to what we have seen elsewhere in the course, there exists a generically intermediate regime in which the global energy minimum is still the correct codeword bu there is an exponentially large number of local energy minima obscuring it (see figure 6).
What is so special about BP is that the threshold at which these exponentially many metastable states proliferate is exactly the algorithmic threshold for BP which we proved earlier.
While finding solutions amounts to Gaussian elimination, the constraint is not a linear constraint. Thus one needs to use some sort of more advanced search procedure to find satisfying vectors .
We will show that if one resorts to localsearchbased algorithms, the metastable states above block the algorithm. Furthermore, we suggest that the behavior of the local algorithms discussed below are typical of all local search algorithms (including BP) and that it is very likely the case that no fast algorithm exists capable of finding global energy minima without getting caught in the metastable states which proliferate above .
Below is the simplest of local search algorithms, local search.
Delta search typefies local search algorithms. It walks semirandomly through the landscape searching for low energy configurations. Its parameter is defined such that, when stuck in a metastable state it can climb out of it in polynomial time if the steepness of its energy barrier is . Thus its failure in the region suggests that there are no barriers of constant size and that barriers of order N are the norm.
MCMC and the relaxation time of a random walk
We can understand the geometry of the metastable states in greater detail by reframing our MAP problem as follows:
(12)
This form is referred to as the `tilted’ Boltzmann because it is a Boltzmann distribution biased by the likelihood function.
In the low temperature limit this reduces to eqn 10 because it finds support only over words in the codebook.
This distribution more closely mimics physical systems. For nonzero temperature it allows support over vectors which are not actually in our codebook but still have low distance to our received message and have low energy – this allows us to probe the metastable states which trap our local algorithms. This is referred to as a code with `soft’ parity check constraints as our distribution permits decodings which fail some checks.
We will use the following algorithm excerpted from Mezard and Montanari Chapt 21:
Where a Glauber update consists of scanning through the bits of the current proposed configuration and flipping the value of bit with probability
(13)
Where is the current configuration and is the configuration obtained by flipping ‘s bit
This method is a spinoff of traditional Markov chain MonteCarlo algorithms with the variation that we lower the temperature according to an annealing schedule that initially assigns probability to all states proportional to the likelihood component of equation 12, allowing the chain to randomly sample the configuration space in the neighborhood of the received noisy word, until in the low temperature limit it becomes concentrated near to configurations which are proper codewords.
This method is useful to us because MCMCs are good models of how randomized methods of local searching for optimal configurations occurs in physical systems. Furthermore, the convergence of MCMCs and the time it takes them to converge tells us both the properties of the energy wells they terminate in and the barriers between minima in the energy landscape.
Let’s now show a property relating convergence times of MCMCs and energy barriers known as the Arrhenius law.
If we take the example of using a simple MCMC random walk with the update rule below over the following landscape
We find that the expected number of time steps to cross from one well to another is governed by the Arrhenius law .
In general, if there exists a largest energy barrier between any two components of the configuration space (also known as the bottleneck) the time it takes to sample both components, also known as the relaxation time of the MCMC is
With this in mind, we can apply our simulated annealing MCMC to LDPC decoding and examine the properties of the bottlenecks, or metastable states, in our configuration landscape.
Exact values of the metastable energy states for the BEC
It is a wellknown consequence of the 1RSB cavity method that the number of metastable states of energy grows like where is known as the energetic complexity function, a function whose form is implied by the 1RSB cavity equations. This computation can be carried out using a method called Survey Propagation which constructs a factor graph of the messages passed in the original BP factor graph and estimates the values of the marginals of the messages via another round of BP (hence the name 1step RSB).
Neglecting the actual form of the calculations I will show the following approximate results for the BEC.
In the regime there exists a zeroenergy word corresponding to the correct solution. On top of this, there exist nontrivial solutions to the 1RSB method yielding a complexity curve positive in the regime (). The positive complexity means that there are exponentially many such states and their finite energy means they violate a finite fraction of the parity checks, making their energy wells relatively deep.
As the error rate of the channel increases above the minimum energy of the metastable state reaches zero continuously. This means at noise levels above there are an exponential number of zeroenergy states corresponding to configurations which aren’t code words. These codewords are separated by energy barriers thus making the relaxationtime of local algorithms, by the Arrhenius law in this regime.
Here you can see a rough sketch of convergence of the simulated annealing algorithm. As the temperature decreases in the regime the algorithm converges to a 0 energy ground state. In the figure on the right we can see that simulated annealing converges to the horizontal line here which corresponds to the energy of the highest metastable state for the BEC at .
Thus we see our local search algorithms end up being attracted to the highest energy of the metastable state.
Though there is not necessarily an exact correspondence between the residual energy at T=0 for simulated annealing and the highest metastable state we see that across all values of that at T=0, suggesting local search tends to get caught in the deepest metastable energy wells.
This discussion shows that the algorithmic threshold of BP, indicates the onset of a truly different regime within the energy landscape of the codebook. Metastable states of hight proliferate and become exponentially difficult to escape from via local search methods. Thus the failure of BP likely indicates a regime in which no fast algorithms can perform decoding, even though decoding is still theoretically possible when below , e.g. via exhaustive search of the codebook.
Appendix A: Random Code Ensembles
In an RCE, encoding maps applied to the information sequence are chosen with uniform probability over a solution space. Two decoding schemes are often used and applied to the noise – word MAP and symbol MAP decoding. MAP, otherwise known as “maximum a priori probability” works by maximizing the probability distribution to output the most probable transmission. Word MAP decoding schemes output the codeword with the highest probability by minimizing the block error probability, which is otherwise known as the probability with respect to the channel distribution that the decoded word is different than the true transmitted word. Symbol MAP decoding, on the other hand, minimizes the fraction of incorrect bits averaged over the transmitted code word (bit error rate).
RCE code is defined by the codebook in Hamming space, or the set of all binary strings of length . In a Hamming space characterized by uniform probability, the number of codewords at a given Hamming distance are a function of the distance enumerator. Distance enumerators take as parameters different weights, given that probabilities of codewords are independent of each other. The distance enumerator shows that, for small enough fractional distance from the true message , the growth rate is negative and the average number of codewords at small distance from vanishes exponentially with N. The GilbertVarshamov distance, a lower bound threshold, shows that the the average number of codewords is exponentially large at points where the weight numerator is concentrated.
We look at the performance of RCE code in communication over the Binary Symmetric Channel (BSC), where it is assumed that there is a probability p that transmitted bits will be “flipped” (i.e. with probability p, 1 becomes 0 and 0 becomes 1). With BSCs, channel input and output are the same length N sequences of bits. At larger noise levels, there are an exponential number of codewords closer to and decoding is unsuccessful. However, decoding via the symbol MAP decoding scheme shows that the th bit is decoded by maximizing the marginal probability and amasses contributions from all codewords in the set. Above a threshold, the bit error rate is the same as if the message was transmitted without encoding and decoding, but below this, the RCE seems to work quite well in transmission.
Finite temperature decoding has also been looked at as an interpolation between the two MAP decoding schemes. At low noise, a completely ordered phase can be observed as compared to a glassy phase at higher noise channels. Similar to the a statistical physics model, we can also note an entropy dominated paramagnetic phase at higher temperatures.
Each decoding scheme can be analogized to “sphere packing”, where each probability in the Hamming space distribution represents a sphere of radius r. Decoding schemes have partitions in the Hamming space, so these spheres must be disjoint. If not, intersecting spheres must be eliminated. The lower bound of the remaining spheres is then given by GilbertVarshamov bound, whereas the upper bound is dictated by the Hamming distance.
Another random code beside the RCE is the RLC, or random linear code. Encoding in an RLC forms a scheme similar to a linear map, of which all points are equiprobable. The code is specified by an binary matrix, otherwise known as the generating matrix, and it is projected to be errorfree below the Shannon capacity.
There are several sources of randomness in codes. Codes are chosen randomly from an ensemble and the codeword to be transmitted is chosen with uniform probability from the code, according to the theorem of sourcechannel separation. The channel output is then distributed according to a probabilistic process accounting for channel noise and decoding is done by constructing another probability distribution over possible channel inputs and by estimating its signal bit marginal. The decision on the ith bit is dependent on the distribution. Thus, complications may arise in distinguishing between the two levels of randomness: code, channel input, and noise (“quenched” disorder) versus Boltzmann probability distributions.
Appendix B: Weight enumerators and code performance
The geometric properties of the LDPC codebooks is given by studying the distance enumerator to give the number of codewords at Hamming distance d from . This takes allzeros codewords as the reference and uses the weight enumerator, as the denomination (number of codewords having weight equal to w). To estimate the expected weight enumerator for a random code in the ensemble, we know that grows exponentially in blocklength N, and that each codeword has a weight that grows linearly with N. The exponential growth rate is defined by
(14)
denoting an ‘annealed average’, or a disordered system that could be dominated by rare instances in the ensemble. This gives an upper bound on the number of ‘colored factor graphs’ that have an even number of weighted incident edges divided by the total number of factor graphs in the ensemble.
On the other hand, for graphs of fixed degrees with N variable nodes of degree l and M function nodes of degree k, the total number of edges F is given by . A valid colored graph would have edges, with the number of variable nodes given in ways, l assignments of weighted sockets to nodes, and l assignments of unweighted sockets to nodes outside the set. If we take to denote the number of function nodes with weighted sockets under the constraints of and , we find the number of ways to color the function node sockets by
(15)
(16)
If we aim to join variable and check nodes so that colorings are matched, knowing that there are possible matchings in each ensemble element, this yields the following formula:
(17)
At low noise limits, code performance depends on the existence of codewords at distances close to the transmitted codeword. Starting with degree 1 and knowing that the parametric representation for weights is given by
(18)
derive that
(19)
when scale to . This shows that the exponential growth rate is strictly positive when is sufficiently small, and that the expected number of codewords within a small Hamming distance from a given codeword is exponential in N. If we take the logarithm of the expected weight enumerator and plot this versus the reduced weight for an irregular code with , we see that is positive near the origin, but that its dervative diverges as . Since this means that each codeword is surrounded by a large number of very close other codewords, this makes the code a very bad ECC and thus, makes it hard to discriminate between codewords at Hamming distances with noisy observations. Applying this same logic to of min 2, we still observe that tends to 0 more quickly as in the present case. If we assume that this holds beyond the asymptotic regime, we get
(20)
or that the number of codewords around a particular codeword is until a Hamming distance , otherwise known as the “effective minimum distance”. For , we find:
(21)
suggesting that LDPC codes with this property have good short distance behavior. Thus, any error that changes a fraction of the bits smaller than can be corrected in the absence of codewords within an extensive distance .
Let us now focus on the capacity of LDPC codes to correct typical errors in a probabilistic channel. For binary symmetric channels that flip each transmitted bit independently with probability . If the allzero codeword has been transmitted as , whose components are iid random variables that take value 0 with probability and value 1 with probability , then we use the MAP decoding strategy to minimize the block error rate and output the codeword closest to the channel output . The expectation value of the code ensemble is an indicator of code ensemble performances. We will show that, as , codes with will undergo a phase transition separating a low noise from a high noise phase. To derive a lower bound for the capacity of LDPC codes in a BSC channel, we take as the size of the codebook and, by union bound:
(22)
(23)
(24)
This derivation proves that the block error probability depends on the weight enumerator and the . This second term shows that an increase in the weight of the codeword corresponds to their contribution being scaled down by an exponential factor. This is because it is less likely that the received message will be closer to a codeword of large weight than to the allzero codeword. A geometric construction of this phenomena implies that for large enough , Shannon’s Theorem implies that is bounded away from 0 for any nonvanishing rate so that at any p less than the ML threshold for which the , one can communicate with an arbitrarily small error probability. At a probability equal to the lower bound, the upper bound on is dominated by codewords of weight , suggesting that each time an error occurs, a finite fraction of the its are decoded incorrectly and that this fraction doesn’t change very much per transmission. The construction also illustrates that this fraction of incorrectly decoded bits jumps discontinuously from 0 to a finite value when crosses the critical value , constituting a “gap.” This gap is close to a factor of 2.
Appendix C: BP performance
See figure 2 for an illustration of a factor graph illustrating this relationship. Again, recall that for LDPC code ensembles in large blocklength limits, the degree distributions of variable nodes and check nodes are given by and respectively, where we assume that messages are initialized to for simplicity. This implies that the bit error probability is independent of the transmitted codeword and that therefore, we have the freedom to assume transmission of the allzero codeword. In analyzing the recursion at the basis of the BP algorithm, we can show that decoding performance improves over time on the basis of symmetry and physical degradation.
Symmetry
Symmetry of channel loglikelihood and the variables appearing in density evolution are attributes of a desired BMS channel, suggesting that symmetry is preserved by BP operations in evolution. If we assume that the factor graph associated with an LDPC code is “treelike”, we can apply BP decoding with a symmetric random initial condition and note that the messages and are symmetric variables at all . This observance of symmetry is analogous to the Nishimori condition in spin glasses and holds for the MAP loglikelihood of a bit as well.
Physical degradation
Let’s first define physical degradation with BMS channels. If we take two channels BMS(1) and BMS(2) denoted by transition matrices
and output alphabets , then BMS(2) is physically degraded with respect to BMS(1) if there exists a third channel C with input alphabet such that BMS(2) is the concatenation of BMS(1) and C. If we represent transition matrix C as we can represent the above physical degradation as
(25)
This is analogous to a Markov chain following partial ordering. Channel reliability is then ordered by measures of conditional entropy and bit error rate. This extends to symmetric random variables, which are associated with BMS channels.
Thresholds
We then fix a particular LDPC code and look at BP messages as random variables due to randomness in the vector with regards to the proposition down below, showing that the bit error rate decreases monotonously with time:
Proposition: If is a tree, then for any . Analogously, if , then for any .
Density evolution in this manner is a useful estimate of the number of distributions of density evolution variables and . By looking again at the bit error rate and the conditional entropy as both monotonically decreasing functions of the number of iterations and conversely, monotonically increasing functions of , we can derive a finite limit . The corresponding BP threshold can then be defined as
For , however, increasing the number of iterations does not help as the bit error rate is asymptotically lower bounded by for a fixed number of iterations. Good LDPC codes are thus designed with a large BP threshold with design rate to maximize the threshold noise level for a given degree distribution pair. This ensemble will have a finite fraction of variable nodes of degree 2 and a large number of codewords with small weight, which ultimately prevent the block error probability from vanishing as .
References
[1] Marc Mezard and Andrea Montanari.
Information, Physics, and Computation.
Oxford Graduate Texts, 2009.
Ising Perceptron under Gaussian Disorder, and kNAESAT
Blog Post By: Patrick Guo, VinhKha Le, Shyam Narayanan, and David Stoner
Methods in statistical physics are known to be extremely useful for understanding certain problems in theoretical computer science. Physical observations can motivate the underlying theoretical models, which in turn explain some of the physical phenomena.
This post is based on Professor Nike Sun’s guest lecture on the Ising Perceptron model and regular NAESAT for CS 229R: Physics and Computation. These are both examples of random constraint satisfaction problems, where we have variables and certain relations, or constraints, between the , and we wish to approximate the number of solutions or visualize the geometry of the solutions. For both problems, the problem instance can be random: for example, the linear constraints in the Ising perceptron model are random, and the clauses in the NAESAT instance are chosen at random. As in the previous blog posts, to understand the geometry of solutions, statistical physicists think of sampling random solutions from , which introduces a second type of randomness.
This post is meant to be expository. For interested readers, we point to useful references at the end for more rigorous treatments of these topics.
1. Perceptron Model
The Ising Perceptron under Gaussian disorder asks how many points in survive halfplane bisections (i.e. are satisfying assignments for constraints), where the planes’ coefficients are drawn from standard Gaussians. Like many problems in statistical physics, there is likely a critical capacity of constraints where having more constraints yields no survivors with high probability, and having fewer constraints yields survivors with high probability. This lecture gives an overview of a proof for one side of this physics prediction, i.e. the existence of a lower bound critical capacity, where having fewer constraints yields survivors with positive probability. Briefly, we use the Cavity method to approximate the distribution of the number of satisfying assignments, then attempt to use the second moment method on that distribution to get our lower bound. A direct application fails, however, due to the variance of the Gaussian constraints. The solution is to carefully choose exactly what to condition on without destroying the model. Specifically, we iteratively compute values related to the Gaussian disorder, after which we are able remove enough variance for the second moment method to work and thus establish the lower bound for the Ising Perceptron’s critical capacity. The result holds subject to an analytical condition which is detailed in the paper (Condition 1.2 in [6]) and which remains to be rigorously verified.
1.1. Problem
We pick a random direction in , and delete all vertices in the hypercube which are in the halfspace negatively correlated with that direction. We repeat this process of picking a random half space and deleting points times, and see if any points in the hypercube survive. Formally, we define the perceptron model as follows:
Definition 1: Let array of i.i.d. standard Gaussians. Let be the largest integer such that:
More compactly, is the largest integer such that
is nonempty, where the inequality is taken pointwise, is an array of i.i.d. std. Gaussians, and is treated as a vector in .
In the late 80’s, physicists conjectured that there is a critical capacity such that where is a function of . The predicted critical capacity has been studied, for example in [7,9]. Our goal is to establish a lower bound on for the Ising Perceptron under Gaussian disorder. To do this, let denote the random variable which measures how many choices of variables survive Gaussian constraints in (this is the partition function). We want to show with high probability when there are constraints. To this end, the second moment method seems promising:
Second Moment Method: If is a nonnegative random variable with finite variance, then
However, this method actually fails due to various sources of variance in the perceptron model. We will briefly sketch the fix as given in [6].
Before we can start, however, what does the distribution of even look like? This is actually quite computationally intensive; we will use the cavity equations, a technique developed in [12,13], to approximate ‘s distribution.
1.2. Cavity Method
The goal of the cavity method is to see how the solution space changes as we remove a row or a column of the matrix with i.i.d. standard Gaussian entries, assuming that is fixed. Since our matrix is big and difficult to deal with, we try to see how the solution space changes as we add one row or one column at a time. Why is this valuable? We can think of the system of variables and constraints as an interaction between the rows (constraints) and columns (variables), so the number of solutions should behave proportionally to the product of the solutions attributed to each variable and each constraint. With this as motivation, define as the matrix obtained by removing row and as the matrix obtained by removing column . We can approximate
since we can think of the partition function as receiving a multiplicative factor from each addition of a row and each addition of a column. Thus, the cavity method seeks to compute and
1.2.1. Removing a constraint
Our goal in computing is to understand how the solution space changes when we add in the constraint Recalling that is our vector that is potentially in the solution space, if we define
then adding the constraint is equivalent to forcing We will try to understand the distribution of under the Gibbs measure , which is the uniform measure on the solution space without the constraint . This way, we can determine the probability of under the Gibbs measure, which will equal To do so, we will use the Replica Symmetric Cavity Assumption (which we will abbreviate as RS cavity assumption). The RS cavity assumption lets us assume that the ‘s for are independent under . As the ‘s are some discrete sample space that depend on our constraints, we do not actually have full independence, but the RS cavity assumption tells us there is very little dependence between the ‘s, so we pretend they are independent.
Note that should be approximately normally distributed, since it is the sum of many “independent” terms by the RS Cavity assumption. Now, define as the expectation of under the Gibbs measure without the constraint , and as the variance of under the Gibbs measure without the constraint . It will also turn out that the variance will concentrate around a constant Thus,
Here, equals the probability that a random Gaussian distribution is positive, or equivalently, the probability that a standard Gaussian distribution is greater than
1.2.2. Removing a spin
To calculate the cavity equation for removing one column, we think of removing a column as deleting one spin from . Define as the vector resulting from removing spin from . We want to calculate Note that it is possible for but now, which complicates this calculation. It is possible to overcome this difficulty by passing to positive temperature, though this makes the calculations incredibly difficult. We do not worry about these issues here, and just briefly sketch how is computed.
To compute , we will split the numerator into a sum of two terms based on the sign of which will allow us to compute not only but also , which represents the magnetization at spin . A series of complicated calculations (see the lecture notes for the details) will give us
for some constant , where is a quantity that compares how much more correlated spin is to the constraints than all other spins. The will come from the solutions for with and the will come from the solutions with .
The above equations allow us to deduce that
1.3. The Randomness of G
In the previous section, we regarded as fixed. We now use the these results but allow for to be random again. Recall ‘s entries were i.i.d. Gaussians. We thus get that , as a linear combination of the ‘s, is a Gaussian. We thus can write Similarly, is a linear combination of the ‘s and must be Gaussian. We thus write .
With some calculations detailed in the lecture notes and [12], we get Gardner’s Formula:
where is our capacity, . It turns out that and so the above equation only depends on two parameters, and It will turn out that have relations dependent on each other based on our definitions of and , and these will give us a fixed point equation for which has two solutions: one near and one near It is believed that the second point is correct, meaning that the critical capacity should equal
1.4. Second Moment Method
Now that we have Gardner’s formula, solving for gives us an approximation for its distribution as
where is Gardner’s formula, and the Gaussian noise arises from the Gaussian distribution of the ‘s and ‘s. Again, we are interested in the probability that , but due to the approximate nature of the above equation, we cannot work with this distribution directly, and instead can try the second moment method. However, the exponentiated Gaussian makes and the moment method just gives , whereas we need with high probability.
The fault here lies with Gaussian noise due to the ‘s and ‘s, so it is natural to consider what happens when we condition on them, getting rid of the noise. We denote
Then, we can compute that
in probability as . (This is not an exact equality becacuse of the approximations made in the derivation of Gardner’s formula.) Moreover, we will have . The second moment method gives us the desired lower bound on . However, note that these must satisfy the equations that defined them. In vector form, we have
where , , and are constants and functions that appear in the rigorous definitions of and .
Here arises a problem, however. We cannot solve these equations easily, and furthermore when we condition on and , the ‘s that satisfy the above are not necessarily representative of matrices of standard Gaussians. Hence, simply conditioning on knowing the values of destroys the model and will not prove the lower bound on for Gaussian disorder.
To solve this problem, we introduce iteration: first, initialize and (initialized to specific values detailed in [6]. Then, a simplified version of each time step’s update looks like
It has been proven (see [2,4]) that and converge as , and moreover the convergent values are distributed by what looks like a Gaussian at each time step. Since this sequence of converge (at a rate that is independent of ) and are representative of Gaussian , for a special kind of truncated partition function , conditioning on these iteration values allows the second moment method to work and gives
which is then enough to establish the lower bound for the nontruncated partition function (see [6] for details, see also [3] for closely related computations).
2. NAESAT
This lecture also gave a brief overview of results in NAESAT. In the SAT problem, we ask whether a boolean formula in form, with each clause containing exactly literals, has a satisfying solution. For NAESAT, we instead require that the satisfying solution is not uniform on any clause; equivalently, each clause must contain at least one true and at least one false value. Finally, we will restrict our set of possible boolean formulae to those for which every variable is contained in exactly clauses; we call this model regular NAESAT. We briefly outline the critical capacities of clauses where the solution space changes. For more background on the NAESAT probem and its variants, see [10].
As with the perceptron model, we are concerned with the expected number of solutions of this system where the regular formula is chosen uniformly at random. It turns out that for any fixed , the expected proportion of satisfiable solutions as converges to some . More on this, the model in general and the critical constants mentioned below can be found in [15]. Via an application of Markov’s inequality and Jensen’s Inequality for the convex function , may be bounded above by , shown graphically below.
The diagram also shows the nature of the expected assortment of the solutions of a random regular NAESAT for ranges of values of . Namely, physics analysis suggests the following:
 For , w.h.p. the solutions are concentrated in a single large cluster.
 For , w.h.p. the solutions are distributed among a large number of clusters.
 For , the function breaks away from , and w.h.p. the solutions are concentrated in a small number of clusters.
 For , w.h.p. there are no satisfiable solutions.
One final note for this model: the solutions to satisfiability problems tend to be more clumpy than in the perceptron model, so correlation decay methods won’t immediately work. See [15] for how this is handled.
References
 N. Bansal. Constructive algorithms for discrepancy minimization. In Proc. FOCS 2010, pages 3–10.
 M. Bayati and A. Montanari. The dynamics of message passing on dense graphs, with applications to compressed sensing. IEEE Trans. Inform. Theory, 57(2):764785, 2011.
 Bolthausen, E., 2018. A Morita type proof of the replicasymmetric formula for SK. arXiv preprint arXiv:1809.07972.
 E. Bolthausen. An iterative construction of solutions of the TAP equations for the SherringtonKirkpatrick model. Commun. Math. Phys., 325(1):333366, 2014.
 T. Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Electron. Comput. 14(3):326334.
 J. Ding and N. Sun. Capacity lower bound for the Ising perceptron. https://arxiv.org/pdf/1809.07742.pdf
 E. Gardner and B. Derrida. Optimal storage properties of neural network models. J. Phys. A., 21(1): 271–284, 1988.
 J. H. Kim and J. R. Roche. Covering cubes by random half cubes, with applications to binary neural networks. J. Comput. Syst. Sci., 56(2):223–252, 1998
 W. Krauth and M. Mézard. Storage capacity of memory networks with binary couplings. J. Phy. 50(20): 30573066, 1989.
 F. Krzakala et al. “Gibbs states and the set of solutions of random constraint satisfaction problems.” Proc. Natl. Acad. Sci. 104.25 (2007): 1031810323.
 S. Lovett and R. Meka. Constructive discrepancy minimization by walking on the edges. SIAM J. Comput., 44(5):15731582.
 M. Mézard. The space of interactions in neural networks: Gardner’s computation with the cavity method. J. Phys. A., 22(12):2181, 1989
 M. Mézard and G. Parisi and M. Virasoro. SK Model: The Replica Solution without Replicas. Europhys. Lett., 1(2): 7782, 1986.
 M. Shcherbina and B. Tirozzi. Rigorous solution of the Gardner problem. Commun. Math. Phys., 234(3):383422, 2003.
 A. Sly and N. Sun and Y. Zhang. The Number of Solutions for Random Regular NAESAT. In Proc. FOCS 2016, pages 724731.
 J. Spencer. Six standard deviations suffice. Trans. Amer. Math. Soc. 289 (1985), 679706
 M. Talagrand. Intersecting random half cubes. Random Struct. Algor., 15(34):436–449, 1999.
Peter Shor on Quantum Error Correction
[Guest post by Annie Wei who scribed Peter Shor’s lecture in our physics and computation seminar. See here for all the posts of this seminar. –Boaz]
On October 19, we were lucky enough to have Professor Peter Shor give a guest lecture about quantum error correcting codes. In this blog post, I (Annie Wei) will present a summary of this guest lecture, which builds up quantum error correcting codes starting from classical coding theory. We will start by reviewing an example from classical error correction to motivate the similarities and differences when compared against the quantum case, before moving into quantum error correction and quantum channels. Note that we do assume a very basic familiarity with quantum mechanics, such as that which might be found here or here.
1. Motivation
We are interested in quantum error correction, ultimately, because any realworld computing device needs to be able to tolerate noise. Theoretical work on quantum algorithms has shown that quantum computers have the potential to offer speedups for a variety of problems, but in practice we’d also like to be able to eventually build and operate real quantum computers. We need to be able to protect against any decoherence that occurs when a quantum computer interacts with the environment, and we need to be able to protect against the accumulation of small gate errors since quantum gates need to be unitary operators.
In error correction the idea is to protect against noise by encoding information in a way that is resistant to noise, usually by adding some redundancy to the message. The redundancy then ensures that enough information remains, even after noise corruption, so that decoding will allow us to recover our original message. This is what is done in classical error correction schemes.
Unfortunately, it’s not obvious that quantum error correction is possible. One obstacle is that errors are continuous, since a continuum of operations can be applied to a qubit, so a priori it might seem like identifying and correcting an error would require infinite resources. In a later section we show how this problem, that of identifying quantum errors, can be overcome. Another obstacle is the fact that, as we’ve stated, classical error correction works by adding redundancy to a message. This might seem impossible to perform in a quantum setting due to the No Cloning Theorem, which states the following:
Theorem (No Cloning Theorem): Performing the mapping
is not a permissible quantum operation.
Proof: We will use unitarity, which says that a quantum operation specified by a unitary matrix must satisfy
.
(This ensures that the normalization of the state is always preserved, i.e. that , which is equivalent to the conservation of probability.)
Now suppose that we can perform the operation
.
Then, letting
,
we note that by unitarity
.
But
,
and in general , so we have a contradiction.
How do we get around this apparent contradiction? To do so, note that the nocloning theorem only prohibits the copying of nonorthogonal quantum states. With orthogonal quantum states, either or , so we don’t run into a contradiction. This also explains why it is possible to copy classical information, which we can think of as orthogonal quantum states.
So how do we actually protect quantum information from noise? In the next section we first review classical error correction, as ideas from the classical setting reappear in the quantum setting, and then we move into quantum error correction.
2. Review of Classical Error Correction
First we start by reviewing classical error correction. In classical error correction we generally have a message that we encode, send through a noisy channel, and then decode, in the following schematic process:
In an effective error correction scheme, the decoding process should allow us to identify any errors that occurred when our message passed through the noisy channel, which then tells us how to correct the errors. The formalism that allows us to do so is the following: we first define a encoding matrix that takes a message of length and converts it to a codeword of length , where the codewords make up the span of the rows of . An example of such a matrix is
,
corresponding to the 7bit Hamming codes, which encodes a 4bit message as a 7bit codeword. Note that this code has distance 3 since each of the rows in differ in at most 3 spots, which means that it can correct at most 1 error (the number of errors that can be corrected is given by half the code distance).
We also define the parity check matrix to be the matrix that satisfies
.
For example, to define corresponding to the we defined for the 7bit Hamming code, we could take
.
Then we may decode , a 7bit string, in the following manner. Say that
,
where is a codeword and is the 1bit error we wish to correct. Then
.
Here uniquely identifies the error and is known as the error syndrome. Having it tells us how to correct the error. Thus our error correction scheme consists of the following steps:
 Encode a bit message by multiplying by to obtain codeword .
 Send the message through channel generating error , resulting in the string .
 Multiply by to obtain the error syndrome .
 Correct error to obtain .
Having concluded our quick review of classical error correction, we now look at the theory of quantum error correction.
3. Quantum Error Correction
In this section we introduce quantum error correction by directly constructing the 9qubit code and the 7qubit code. Then we introduce the more general formalism of CSS codes, which encompasses both the 9qubit and 7qubit codes, before introducing the stabilizer formalism, which tells us how we might construct a CSS code.
3.1. Preliminaries
First we introduce some tools that we will need in this section.
3.1.1. Pauli Matrices
The Pauli matrices are a set of 2by2 matrices that form an orthogonal basis for the 2by2 Hermitian matrices, where a Hermitian matrix satisfies . Note that we can form larger Hilbert spaces by taking the tensor product of smaller Hilbert spaces, so in particular taking the fold tensor product of Pauli matrices gives us a basis for the by Hermitian matrices. Note also that generally, in quantum mechanics, we are interested in Hermitian matrices because they can be used to represent measurements, and because unitary matrices, which can be used to represent probabilitypreserving quantum operations, can be obtained by exponentiating Hermitian matrices (that is, every unitary matrix can be written in the form for a Hermitian matrix).
The Pauli matrices are given by
.
By direction computation we can show that they satisfy the relations
.
3.1.2. Von Neumann Measurements
We will also need the concept of a Von Neumann measurement. A Von Neumann measurement is given by a set of projection matrices satisfying
.
That is, the projectors partition a Hilbert space into subspaces. Then, given any state , when we perform a measurement using these projectors we obtain the measurement result corresponding to , with corresponding postmeasurement state
,
with probability .
3.2. First Attempt at a Quantum Code
Now we make a first attempt at coming up with a quantum code, noting that our efforts and adjustments will ultimately culminate in the 9qubit code. Starting with the simplest possible idea, we take inspiration from the classical repetition code, which maps
and decodes by taking the majority of the 3 bits. We consider the quantum analog of this, which maps
.
We will take our quantum errors to be the Pauli matrices , , and . Then the encoding process, where our message is a quantum state , looks like the following:
.
We claim that this code can correct bit errors but not phase errors, which makes it equivalent to the original classical repetition code for error correction. To see this, note that applying an error results in the mapping
.
This can be detected by the von Neumann measurement which projects onto the subspaces
We could then apply to the first qubit to correct the error. To see that this doesn’t work for phase errors, note that applying a error results in the mapping
.
This is a valid encoding of the state , so the error is undetectable.
What adjustments can we make so that we’re able to also correct errors? For this we will introduce the Hadamard matrix, defined as
and satisfying
.
Note in particular that, because , the Hadamard matrix turns bit errors into phase errors, and vice versa. This allows us to come up with a code that corrects phase errors by mapping
or equivalently,
Now we can concatenate our bit flip code with our phase flip code to take care of both errors. This gives us the 9qubit code, also known as the Shor code.
3.3. 9Qubit Code
In the previous section, we went through the process of constructing the 9qubit Shor code by considering how to correct both bit flip errors and phase flip errors. Explicitly, the 9qubit Shor code is given by the following mapping:
.
Here and are known as logical qubits; note that our 9qubit code essentially represents 1 logical qubit with 9 physical qubits.
Note that by construction this code can correct , , and errors on any one qubit (we’ve already shown by construction that it can correct and errors, and can be obtained as a product of the two). This is also equivalent to the statement that the states , , , , , and are all orthogonal.
Now we have a 1error quantum code. We claim that such a code can in fact correct any error operation, and that this is a property of all 1error quantum codes:
Theorem: Given any possible unitary, measurement, or quantum operation on a oneerror quantum code, the code can correct it.
Proof: forms a basis for the matrices. For errors on qubits, the code can correct these errors if it can individually correct errors for , , since forms a basis for .
Example: Phase Error Next we’ll do an example where we consider how we might correct an arbitrary phase error applied to the state. Since quantum states are equivalent up to phases, the error operator is given by
.
Note that this can be rewritten in the basis as
.
Now, applying this error to , we get
.
After performing a projective measurement, we get state with probability , in which case we do not need to perform any error correction, and we get with probability , in which case we would know to correct the error.
3.4. 7Qubit Code
Now that we’ve constructed the 9qubit code and shown that quantum error correction is possible, we might wonder whether it’s possible to do better. For example, we’d like a code that requires fewer qubits. We’ll construct a 7qubit code that corrects 1 error, defining a mapping to and by taking inspiration from a classical code, as we did for the 9qubit case.
For this we will need to go back to the example we used to illustration classical error correction. Recall that in classical error correction, we have an encoding matrix and a parity check matrix satisfying , with . We encode a message to obtain codeword . After error is applied, this becomes , from which we can extract the error syndrome . We can then apply the appropriate correction to extract from .
Now we will use the encoding matrix from our classical error correction example, and we will divide our codewords into two sets, and , given by
and
.
Similar to how we approached the 9qubit case, we will start by defining our code as follows:
.
Note that this corrects bit flip errors by construction. How can we ensure that we are also able to correct phase errors? For this we again turn to the Hadamard matrix, which allows us to toggle between bit and phase errors. We claim that
.
Proof: We will show that
,
noting that the argument for is similar. First we will need the fact that
.
To see that this fact is true, note that
and that is equal to the number of bits in which and are both 1. Now we can start by directly calculating
.
Note that for and two codewords, assuming that , we must have that iff . Thus we can break the codespace up into an equal number of codewords satisfying and . This means that we must have that the sum unless we have . But those that satisfy are exactly all the codewords by definition, so we must have that
as the sum in runs equally over all codewords.
Thus we have constructed a 7qubit quantum code that corrects 1 error, and moreover we see that for both the 9qubit and 7qubit codes, both of which are 1error quantum codes, the fact that they can correct 1error comes directly from the fact that the original classical codes we used to construct them can themselves correct 1 error. This suggests that we should be able to come up with a more general procedure for constructing quantum codes from classical codes.
3.5. CSS Codes
CSS (CalderbankShorSteane) codes generalize the process by which we constructed the 9qubit and 7qubit codes, and they give us a general framework for constructing quantum codes from classical codes. In a CSS code, we require groups , satisfying
Then we can define codewords to correspond to cosets of in , so that the number of codewords is equal to . Thus by this definition we can say that codewords are in the same coset if . Explicitly, the codeword for coset is given by the state
,
and under the Hadamard transformation applied to each qubit this state is in turn mapped to the state
.
That is to say, the Hadamard “dualizes” our original code, toggling bit errors to phase errors and vice versa. (This can be seen by direct calculation, as in the case of the 7qubit code, where we used the fact that for .)
Note also that this code can correct a number of bit errors equal to the minimum weight of .
With the CSS construction we have thus reduced the problem of finding a quantum error correcting code to the problem of finding appropriate , . Note that the special case of corresponds to weakly selfdual codes, which are well studied classically. Doubly even, weakly selfdual codes additionally have the requirement that all codewords have Hamming weights that are multiples of 4; they satisfy the requirement
and are also well studied classically.
3.6. GilbertVarshamov Bound
In the previous section we introduced CSS codes and demonstrated that the problem of constructing a quantum code could be reduced to the problem of finding two groups , satisfying
.
The next natural question is to ask whether such groups can in fact be found.
The GilbertVarshamov bound answers this question in the affirmative, ensuring that there do exist good CSS codes (the bound applies to both quantum and classical codes). It can be stated in the following way:
Theorem (GilbertVarshamov Bound): There exist CSS codes with rate (number of encoded bits)/(length of code) given by
,
where is the minimum distance of the code, is the number of errors it can correct, and is the Shannon entropy, defined as
.
Proof: Note that we can always take a code, apply a random linear transformation to it, and get another code. Thus each vector is equally likely to appear in a random code. Given this fact, we can estimate the probability that a code of dimension contains a word of weight using the union bound:
code of dimension has word of weight number of wordsword has weight
For this to be a valid probability we need to have
.
We can calculate rate by noting that for a CSS code, given by , , with , , the expression for rate is given by
.
Combining this with the bound we obtained by considering probabilities, we get that
.
Thus there exist good CSS codes.
3.7. Stabilizer Codes
Having discussed and constructed some examples of CSS codes, we will now discuss the stabilizer formalism. Note that this formalism allows us to construct codes without having to work directly with the states representing and , as this can quickly get unwieldy. Instead, we will work with stabilizers, operators that leave these states invariant.
To see how working directly with states can get unwieldy, we can consider the 5qubit code. We can define it the way we defined the 9qubit and 7qubit codes, by directly defining the basis vectors and ,
symmetric under cyclic permutations,
with defined similarly. But we can also define this code more succinctly using the stabilizer formalism. To do so, we start by choosing a commutative subgroup of the Pauli group, with generators satisfying
For example, for the 5qubit code, the particular choice of generators we would need is given by
.
Now we consider states that are stabilized by the . That is, they satisfy
.
Note that the eigenvalues of , , and are , so in the case of the 5qubit code, there exists a dimensional space of satisfying . Recalling that two commuting matrices are simultaneously diagonalizable, there exists a dimensional space of satisfying , and so on, where we cut the dimension of the subspace in half each time we add a generator. Finally, there exists a dimensional space of satisfying for all . This 2dimensional space is exactly the subspace spanned by and . Thus fixing the stabilizers is enough to give us our code.
Next we will consider all elements in the Pauli group that commute with all elements in our stabilizer group . As we shall see, this will give us our logical operators, where a logical operator performs an operation on a logical qubit (for example, the logical operator, , would act on the logical qubit by mapping , and so on). In the 5qubit case we end up with a 6dimensional nonabelian group by adding the following two elements to those elements that are in :
These will be our logical operators
so that
.
Note that this code has distance 3 and corrects 1 error because 3 is the minimum Hamming weight in the group . (To see this, note that has Hamming weight 3.)
Why is Hamming weight 2 not enough to correct one error? If we had, for example, , then we would have
for , both in the code, which means that we wouldn’t be able to distinguish an error from a error.
Note that, in general, when , will be in the code, so elements of map codewords to codewords. We can prove this fact by noting that
.
Note also that in the examples we’ve been dealing with so far, where we have a commuting subgroup of the Pauli group, our codes correspond to classical, additive, weakly selfdual codes over . Here (with group elements corresponding to the third roots of unity) is the finite field on 4 elements, and multiplying Pauli matrices corresponds to group addition. Specifically,
satisfying
.
We have now concluded our discussion of quantum errorcorrecting codes. In the next section we will shift gears and look at quantum channels and channel capacities.
4. Quantum Channels
In this final section we will look at quantum channels and channel capacities.
4.1. Definition and Examples
4.1.1. Definition
We know that we want to define a quantum channel to take a quantum state as input. What should the output be? As a first attempt we might imagine having the output be a probability distribution over states . It turns out that for a more succinct description, we can have both the input and output be a density matrix.
Recall that a density matrix takes the form
representing a probability distribution over pure states . must also be Hermitian, and it must satisfy (equivalently, we must have ).
Now we may define a quantum channel as the map that takes
,
where
.
To see that the output is in fact a density matrix, note that the output expression is clearly Hermitian and can be shown to have unit trace using the cyclical property of traces. Note also that the decomposition into need not be unique.
4.1.2. Example Quantum Channels
Next we give a few examples of quantum channels. The dephasing channel is given by the map
.
It maps
,
so it multiplies offdiagonal elements by a factor that is less than 1. Note that when , it maps
,
which means that it turns superpositions into classical mixtures (hence the name “dephasing”).
Another example is the amplitude damping channel, which models an excited state decaying to a ground state. It is given by
Here we let the vector denote the ground state, and we let the vector denote the excited state. Thus we can see that the channel maps the ground state to itself, , while the excited state gets mapped to with probability and stays at with probability .
4.2 Quantum Channel Capacities
Now we consider the capacity of quantum channels, where the capacity quantifies how much information can make it through the channel. We consider classical channels, classical information sent over quantum channels, and quantum information sent over quantum channels. First we start off with the example of the quantum erasure channel to demonstrate that quantum channels behave differently from classical channels, and then we give the actual expressions for the channel capacities before revisiting the example of the quantum erasure channel.
4.2.1 Example: Quantum Erasure Channel
First we start with the example of the quantum erasure channel, which given a state replaces it by an orthogonal state with probability and returns the same state with probability . We claim that the erasure channel can’t transmit quantum information when , behavior that is markedly different from that of classical information. That is to say, for , there is no way to encode quantum information to send it through the channel and then decode it so the receiver gets a state close to the state that was sent.
To see why this is the case, assume the contrary, that there do exist encoding and decoding protocols that send quantum information through quantum erasure channels with erasure rate . We will show that this violates the nocloning theorem. Now, suppose that does the following: For each qubit in the encoded state, she tosses a fair coin. If the coin lands heads, she send the state and sends the channel input with probability and the erasure state otherwise. If the coin lands tails, she sends the state and sends the channel input with probability and the erasure state otherwise. This implements a channel to both receivers and , which means that can use this channel to transmit an encoding of to both receivers, which in turn means that both receivers will be able to decode . But this means that has just used this channel to clone the quantum state , resulting in a contradiction. Thus no quantum information can be transmitted through a channel with . Note, however, that we can send classical information over this channel, so the behavior of quantum and classical information is markedly different.
It turns out that the rate of quantum information sent over the erasure channel, as a function of , is given by the following graph:
while the rate of classical information sent over the erasure channel, as a function of , is given by the following graph:
Next we will formally state the definition of channel capacity, and then we will return to the quantum erasure channel example and derive the curve that plots rate against .
4.2.2. Definition of Channel Capacities
Channel capacity is defined as the maximum rate at which information can be communicated over many independent uses of a channel from sender to receiver. Here we list the expressions for channel capacity for classical channels, classical information over a quantum channel, and quantum information over a quantum channel.
Classical Channel Capacity For a classical channel this expression is just the maximum mutual information over all inputoutput pairs,
,
where is the input information and is the output information after having gone through the channel .
Classical Information Over a Quantum Channel The capacity for classical information sent over a quantum channel is given by
up to regularization, where is the average input state, and is the channel.
Note that we would regularize this by using copies of the state (that is to say, we want the output of ) and then dividing by , to get an expression like the following for the regularized capacity of classical information over a quantum channel:
.
Quantum Information The capacity for quantum information is given by the expression
,
also up to regularization. Here is the output when channel acts on input state , while is the purification of (that is, it is a pure state containing that we can obtain by enlarging the Hilbert space). The regularized capacity for quantum information looks like the following:
.
Now that we have the exact expression that allows us to calculate the quantum channel capacity, we will revisit our example of the quantum erasure channel and reproduce the plot of channel rate vs erasure probability.
4.2.3. Example Revisited: Quantum Erasure Channel
Recall that, up to regularization, the capacity of a quantum channel is given by
.
We will directly calculate this expression for the example of the quantum erasure channel. Let the input be given by the density matrix for the completely mixed state,
,
so that the purification of is given by the state
.
Recall that the erasure channel replaces our state with with probability , while with probability it leaves the input state unchanged. Then, in the basis , the matrix corresponding to is given by
while in the basis , the matrix corresponding to is given by
We can directly calculate that
.
Then, subtracting the two entropies, we can calculate the rate to be
,
which corresponds exactly to the line we saw on the diagram that plotted rate as a function of for the quantum erasure channel.
References
 Bennett, C. H., DiVencenzo, D. P., and Smolin, J. A. Capacities of quantum erasure channels. Phys. Rev. Lett., 78:32173220 (1997). quantph/9701015.
 Bennett, C. H., DiVencenzo, D. P., Smolin, J. A., and Wootters, W. K. Mixed state entanglement and quantum error correction. Phys. Rev. A, 54:3824 (1996). quantph/9604024.
 Calderbank, A. R. and Shor, P. W. Good quantum errorcorrecting codes exist. Phys. Rev. A, 54:1098 (1996). quantph/9512032.
 Devetak, I. The Private Classical Capacity and Quantum Capacity of a Quantum Channel. IEEE Trans. Inf. Theor., 51:4445 (2005). quantph/0304127
 Devetak, I. and Winter, A. Classical data compression with quantum side information. Phys. Rev. A, 68(4):042301 (2003).
 Gottesman, D. Class of quantum errorcorrecting codes saturating the quantum Hamming bound. Phys. Rev. A, 54:1862 (1996).
 Laflamme, R., Miquel, C., Paz, J.P., and Zurek, W. H. Perfect quantum error correction code. Phys. Rev. Lett., 77:198 (1996). quantph/9602019.
 Lloyd, S. Capacity of the noisy quantum channel. Phys. Rev. A., 55:3 (1997). quantph/9604015.
 Nielsen, M. A. and Chuang, I. L. Quantum Computation and Quantum Information., Cambridge University Press, New York (2011).
 Shor, P. W. Scheme for reducing decoherence in quantum computer memory. Phys. Rev. A., 52:2493 (1995).
 Shor, P. W. The quantum channel capacity and coherent information. MSRI Workshop on Quantum Computation (2002).
 Steane, A. M. Error correcting codes in quantum theory. Phys. Rev. Lett., 77:793 (1996).
 Steane, A. M. Multiple particle interference and quantum error correction. Proc. R. Soc. London A, 452:255176 (1996).
Highlights beyond EC: Call for nominations
[Guest post by Moshe Babaioff –Boaz]
“Highlights Beyond EC” Session at EC 2019: Call for Nominations
Committee: Mohammad Akbarpour, Moshe Babaioff, Shengwu Li and Ariel Procaccia
Following a new tradition started last year, the 2019 ACM Conference on Economics and Computation (EC’19) will host a special session highlighting some of the best work in economics and computation that appears in conferences and journals other than EC. The intention of this session is to expose EC attendees to related work just beyond the boundary of their current awareness.
We seek nominations for papers on Economics and Computation that have made breakthrough advances, opened up new questions or areas, made unexpected connections, or had significant impact on practice or other sciences. Examples of conferences and journals that publish papers relevant for the special session include STOC/FOCS/SODA/ITCS, AAAI/IJCAI/AAMAS, NIPS/ICML/COLT, WWW/KDD, AER/Econometrica/JPE/QJE/RESTUD/TE/AEJ Micro/JET/GEB, and Math of OR/Management Science/Operations Research. Please email nominations toHighlightsBeyondEC@gmail.com. Anyone is welcome to contact us, but we especially invite members of PCs or editorial boards in various venues to send us suggestions. Nominations should include:
 Name of paper and authors.
 Publication venue or online working version. Preference will be given to papers that have appeared in a related conference or journal within the past two years, or have a working version circulated within the past two years.
 Short (23 paragraph) explanation of the paper and its importance.
 (Optional) Names of 13 knowledgeable experts on the area of the paper.
Note that at least one of the authors of a selected paper will be required to present their paper at EC 2019 and so should be available to travel to the conference, which is taking place in Phoenix, AZ on June 2527, 2019.
To ensure maximum consideration, please send all nominations by December 15, 2018.