Black hole paradoxes: A conservative yet radical journey

Guest post by Abhishek Anand and Noah Miller from the physics and computation seminar.

In 2013, Harlow and Hayden drew an unexpected connection between theoretical computer science and theoretical physics as they proposed a potential resolution to the famous black hole Firewall paradox using computational complexity arguments. This blog post attempts to lay out the Firewall paradox and other peculiar (at first) properties associated with black holes that make them such intriguing objects to study. This post is inspired by Scott Aaronson’s [1] and Daniel Harlow’s [2] excellent notes on the same topic. The notes accompanying this post provides a thorough and self-contained introduction to theoretical physics from a CS perspective. Furthermore, for a quick and intuitive summary of the Firewall paradox and it’s link to computational complexity, refer to this blog post by Professor Barak last summer.

Black holes and conservative radicalism

Black holes are fascinating objects. Very briefly, they are regions of spacetime where the matter-energy density is so high and hence, where the gravitational effects are so strong that no particle (not even light!) can escape from it. More specifically, we define a particular distance called the “Schwarzschild radius” and anything that enters within the Schwarzschild radius, (also known as the “event horizon,”) cannot ever escape from the black hole. General relativity predicts that this particle is bound to hit the “singularity,” where spacetime curvature becomes infinite. In the truest sense of the word, they represent the “edge cases” of our Universe. Hence, perhaps, it is fitting that physicists believe that through thought experiments at these edges cases, they can investigate the true behavior of the laws that govern our Universe.

Once you know that such an object exists, many questions arise: what would it look it from the outside? Could we already be within the event horizon of a future black hole? How much information does it store? Would something special be happening at the Schwarzschild radius? How would the singularity manifest physically?

The journey of trying to answer these questions can aptly be described by the term “radical conservatism.” This is a phrase that has become quite popular in the physics community. A “radical conservative” would be someone that tries to modify as few laws of physics as possible (that’s the conservative part) and through their dogmatic refusal to modify these laws and go wherever their reasoning leads (that’s the radical part) is able to derive amazing things. We radically use the given system of beliefs to lead to certain conclusions (sometimes paradoxes!) and then conservatively update the system of beliefs to resolve the created paradox and iterate. We shall go through a few such cycles and end at the Firewall paradox. Let’s begin with the first problem: how much information does a black hole store?

Entropy of a black hole

A black hole is a physical object. Hence, it could be able to store some information. But how much? In other words, what should the entropy of a black hole be? There are two simple ways of looking at this problem:

  • 0: The no-hair theorem postulates that an outside observer can measure a small number of quantities which completely characterize the black hole. There’s the mass of the black hole, which is its most important quantity. Interestingly, if the star was spinning before it collapsed, the black hole will also have some angular momentum, and its equator will bulge out a bit. Hence, the black hole is also characterized by an angular momentum vector. Also, if the object had some net charge, the black hole would also have that net charge. This means that if two black holes were created due to a book and a pizza, respectively, with the same mass, charge and angular momentum, there would settle down to the “same” black hole with no observable difference. If an outside observer knows these quantities, they will now know everything about the black hole. So, in this view, we should expect for the entropy of a black hole to be 0.
  • Unbounded: But maybe that’s not entirely fair. After all, the contents of the star should somehow be contained in the singularity, hidden behind the horizon. As we saw above, all of the specific details of the star from before the collapse do not have any effect on the properties of the resulting black hole. The only stuff that matters it the total mass, total angular momentum, and the total charge. That leaves an infinite number of possible objects that could all have produced the same black hole: a pizza or a book or a PlayStation and so on. So actually, perhaps, we should expect the entropy of a black hole to be unbounded.

The first answer troubled Jacob Bekenstein. He was a firm believer in the Second Law of Thermodynamics: the total entropy of an isolated system can never decrease over time. However, if the entropy of a black hole is 0, it provides with a way to reduce the entropy of any system: just dump objects with non-zero entropy into the black hole.

Bekenstein drew connections between the area of the black hole and its entropy. For example, the way in which a black hole’s area could only increase (according to classical general relativity) seemed reminiscent of entropy. Moreover, when two black holes merge, the area of the final black hole will always exceed the sum of the areas of the two original black holes This is surprising as for two spheres, the area/radius of the merged sphere, is always less than the sum of the areas/radii of two individual spheres:

(r_1^3 + r_2^3)^{\frac{1}{3}} < r_1 + r_2

Most things we’re used to, like a box of gas, have an entropy that scales linearly with its volume. However, black holes are not like most things. He predicted that entropy of a black hole should be proportional to its area, A and not its volume. We now believe that Bekenstein was right and it turns out that the entropy of the black hole can be written as:


where k is Boltzmann constant and l_p is the Planck-length, a length scale where physicists believe quantum mechanics breaks down and a quantum theory of gravity will be required. Interestingly, it seems as though the entropy of the black hole is (one-fourth times) the number of Planck-length-sized squares it would take to tile the horizon area. (Perhaps, the microstates of the black hole are “stored” on the horizon?) Using “natural units” where we set all constants to 1, we can write this as


which is very pretty. Even though this number of not infinite, it is very large. Here are some numerical estimates from [2]. The entropy of the universe (minus all the black holes) mostly comes from cosmic microwave background radiation and is about 10^{87} in some units. Meanwhile, in the same units, the entropy of a solar mass black hole is 10^{78}. The entropy of our sun, as it is now, is a much smaller 10^{60}. The entropy of the supermassive black hole in the center of our galaxy is 10^{88}, larger than the rest of the universe combined (minus black holes). The entropy of any of the largest known supermassive black holes would be 10^{96}. Hence, there is a simple “argument” which suggests that black holes are the most efficient information storage devices in the universe: if you wanted to store a lot of information in a region smaller than a black hole horizon, it would probably have to be so dense that it would just be a black hole anyway.

However, this resolution to “maintain” the second law of thermodynamics leads to a radical conclusion: if a black hole has non-zero entropy, it must have a non-zero temperature and hence, must emit thermal radiation. This troubled Hawking.

Hawking radiation and black hole evaporation

Hawking did a semi-classical computation looking at energy fluctuations near the horizon and actually found that black holes do radiate! They emit energy in the form of very low-energy particles. This is a unique feature of what happens to black holes when you take quantum field theory into account and is very surprising. However, the Hawking radiation from any actually existing black hole is far too weak to have been detected experimentally.

One simplified way to understand the Hawking radiation is by thinking about highly coupled modes (think “particles”) being formed continuously near the horizon. As this formation must conserve the conservation of energy, one of these particles has negative energy and one of the particles has the same energy but with a positive sign and hence, they are maximally entangled (if you know the energy of one of the particles, you know the energy of the other one): we will be referring to this as short-range entanglement. The one with negative energy falls into the black hole while the one with positive energy comes out as Hawking radiation. The maximally-entangled state of the modes looks like:

\sum_{\mathbf{k}} f(\mathbf{k}) |\mathbf{k}\rangle_{\rm in} |\mathbf{k}\rangle_{\rm out}

Here is a cartoon that represents the process:

Because energetic particles are leaving the black hole and negative energy particles are adding to it, the black hole itself will actually shrink, which would never happen classically! And, eventually a black-hole will disappear. In fact, the time of evaporation of the black hole scales polynomially in the radius of the black hole, as R^3. The black holes that we know about are simply too big and would be shrinking too slowly. A stellar-mass black hole would take 10^{67} years to disappear from Hawking radiation.

However, the fact that black holes disappear does not play nicely with another core belief in physics: reversibility.

Unitary evolution and thermal radiation

A core tenet of quantum mechanics is unitary evolution: every operation that happens to a quantum state must be reversible (invertible). That is: if we know the final state and the set and order of operations performed, we should be able to invert the operations and get back the initial state. No information is lost. However, something weird happens with an evaporating black hole. First, let us quickly review pure and mixed quantum states. A pure state is a quantum state that can be described by a single ket vector while a mixed state represents a classical (probabilistic) mixture of pure states and can be expressed using density matrices. For example, in both, the pure state |\psi\langle = |1\rangle + |0\rangle and mixed state \rho = |1\rangle\langle1| +  |0\rangle\langle0| would one measure 1 half the time and 0 50% half the time. However, in the later one would not observe any quantum effects (think interference patterns of the double-slit experiment).

People outside of the black hole will not be able to measure the objects (quantum degrees of freedom) that are inside the black hole. They will only be able to perform measurements on a subset of the information: the one available outside of the event horizon. So, the state they would measure would be a mixed state. A simple example to explain what this means is that if the state of the particles near the horizon is:

|\Psi\rangle_{init} = \frac{|0\rangle_A|0\rangle_B +|1\rangle_A|1\rangle_B} {\sqrt{2}}

tracing over the qubit A leaves us with the state and density matrix:

|\Psi\rangle_{obs} = \frac{|0\rangle_{B}\langle0| + |1\rangle_{B}\langle1|}{2},

\rho_{obs} = \frac{1}{2} \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}

which is a classical mixed state (50% of times results in 1 and 50% of times results in 0). The non-diagonal entries of the density matrix encode the “quantum inference” of the quantum state. Here, are they are, in some sense we have lost the “quantum” aspect of the information.

In fact, Hawking went and traced over the field degrees of freedom that were hidden behind the event horizon, and found something surprising: the mixed state was thermal! It acted “as if” it is being emitted by some object with temperature “T” which does not depend on what formed the black hole and solely depends on the mass of the black hole. Now, we have the information paradox:

  • Physics perspective: Now, once the black hole evaporates, we are left with this mixed thermal there is no way to precisely reconstruct the initial state that formed the black hole: the black hole has taken away information! Once the black hole is gone, the information of what went into the black hole is gone for good. Nobody living in the post-black-hole universe could figure out exactly what went into the black hole, even if they had full knowledge of the radiation. Another way to derive a contradiction is that the process of black hole evaporation when combined with the disappearance of the black hole, imply that a pure state has evolved into a mixed state, something which is impossible via unitary time evolution! Pure states only become mixed states whenever we decide to perform a partial trace; they never become mixed because of Schrodinger’s equation which governs the evolution of quantum states.
  • CS perspective: We live in a world where only invertible functions are allowed. However, we are given this exotic function – the black hole – which seems to be a genuine random one-to-many function. There is no way to determine the input deterministically given the output of the function.

What gives? If the process of black hole evaporation is truly “non-unitary,” it would be a first for physics. We have no way to make sense of quantum mechanics without the assumption of unitary operations and reversibility; hence, it does not seem very conservative to get ride of it.

Physicists don’t know exactly how information is conserved, but they think that if they assume that it does, it will help them figure out something about quantum gravity. Most physicists believe that the process of black hole evaporation should indeed be unitary. The information of what went into the black hole is being released via the radiation in way too subtle for us to currently understand. What does this mean?

  • Physics perspective: Somehow, after the black hole is gone, the final state we observe, after tracing over the degrees of freedom taken away by the black hole, is pure and encodes all information about what went inside the black hole. That is: Tr_{inside} (|\Psi\rangle_{init}\langle\Psi|) = pure
  • CS perspective: Somehow, the exotic black hole function seems random but actually is pseudo-random as well as injective and given the output and enough time, we can decode it and determine the input (think hash functions!).

However, this causes yet another unwanted consequence: the violation of the no-cloning theorem!

Xeroxing problem and black hole complementarity

The no-cloning theorem simply states that an arbitrary quantum state cannot be copied. In other words, if you have one qubit representing some initial state, no matter what operations you do, you cannot end up with two qubits with the same state you started with. How do our assumptions violate this?

Say you are outside the black hole and send in a qubit with some information (input to the function). You collect the radiation corresponding to the qubit (output of the function) that came out. Now you decode this radiation (output) to determine the state of infalling matter (input). Aha! You have violated the no-cloning theorem as you have two copies of the same state: one inside and one outside the black hole.

So wait, again, what gives?

One possible resolution is to postulate that the inside of the black hole just does not exist. However, that doesn’t seem very conservative. According to Einstein’s theory of relativity, locally speaking, there is nothing particularly special about the horizon: hence, one should be able to cross the horizon and move towards the singularity peacefully.

The crucial observation is that for the person who jumped into the black hole, the outside universe may as well not exist; they can not escape. Extending this further, perhaps, somebody on the outside does not believe the interior of the black hole exists and somebody on the inside does not believe the exterior exists and they are both right. This hypothesis, formulated in the early 1990s, has been given the name of Black Hole Complementarity. The word “complementarity” comes from the fact that two observers give different yet complementary views of the world.

In this view, according to someone on the outside, instead of entering the black hole at some finite time, the infalling observer will instead be stopped at some region very close to the horizon, which is quite hot when you get up close. Then, the Hawking radiation coming off of the horizon will hit the observer on its way out, carrying the information about them which has been plastered on the horizon. So the outside observer, who is free to collect this radiation, should be able to reconstruct all the information about the person who went in. Of course, that person will have burned up near the horizon and will be dead.

And from the infalling observer’s perspective, however, they were able to pass peacefully through the black hole and sail on to the singularity. So from their perspective, they live, while from the outside it looks like they died. However, no contradiction can be reached, because nobody has access to both realities.

But why is that? Couldn’t the outside observer see the infalling observer die and then rocket themselves straight into the black hole themselves to meet the alive person once again before they hit the singularity, thus producing a contradiction?

The core idea is that it must take some time for the infalling observer to “thermalize” (equilibriate) on the horizon: enough time for the infalling observer to reach the singularity and hence become completely inaccessible. Calculations do show this to be true. In fact, we can already sense a taste of complexity theory even in this argument: we are assuming that some process is slower than some other process.

In summary, according to the BHC worldview, the information outside the horizon is redundant with the information inside the horizon.

But, in 2012, a new paradox, the Firewall paradox, was introduced by AMPS [3]. This paradox seems to be immune to BHC: the paradox exists even if we assume everything we have discussed till now. The physics principle we violate, in this case, is the monogamy of entanglement.

Monogamy of entanglement and Page time

Before we state the Firewall paradox, we must introduce two key concepts.

Monogamy of entanglement

Monogamy of entanglement is a statement about the maximum entanglement a particle can share with other particles. More precisely, if two particles A and B are maximally entangled with each other, they cannot be at all entanglement with a third particle C. Two maximally entangled particles have saturated both of their “entanglement quotas\”. In order for them to have correlations with other particles, they must decrease their entanglement with each other.

Monogamy of entanglement can be understood as a static version of the no-cloning theorem. Here is a short proof sketch of why polygamy of entanglement implies the violation of no-cloning theorem.

Let’s take a short detour to explain quantum teleportation:

Say you have three particles A, B, and C with A and B maximally entangled (Bell pair), and C is an arbitrary quantum state:

|\Phi^+\rangle_{AB} = \frac{1}{\sqrt{2}} (|0\rangle_A \otimes |0\rangle_{B} + |1\rangle_A \otimes |1\rangle_{B})

|\psi\rangle_C = \alpha |0\rangle_C + \beta|1\rangle_C

We can write their total state as:

|\psi \rangle_{C}\otimes |\Phi ^{+}\rangle_{AB}=(\alpha |0\rangle_{C}+\beta |1\rangle_{C})\otimes {\frac {1}{\sqrt {2}}}(|0\rangle_{A}\otimes |0\rangle_{B}+|1\rangle_{A}\otimes |1\rangle_{B})

Re-arranging and pairing A and C, the state simplifies to:

|\psi \rangle {C}\otimes \ |\Phi ^{+}\rangle {AB}\ = \frac {1}{2}{\Big \lbrack }\ |\Phi ^{+}\rangle {AC}\otimes (\alpha |0\rangle {B}+\beta |1\rangle {B})\ +\ |\Phi ^{-}\rangle {AC}\otimes (\alpha |0\rangle {B}-\beta |1\rangle {B})\ +\ |\Psi ^{+}\rangle {AC}\otimes (\beta |0\rangle {B}+\alpha |1\rangle {B})\ +\ |\Psi ^{-}\rangle {AC}\otimes (\beta |0\rangle {B}-\alpha |1\rangle {B}){\Big \rbrack }

which means that if one does a Bell pair measurement on A and C, based on the measurement outcome, we know exactly which state B is projected to and by using rotations can make the state of B equal to the initial state of C. Hence, we teleported quantum information from C to B.

Now, assume that A was maximally entangled to both B and D. Then by doing the same procedure, we could teleport quantum information from C to both B and D and hence, violate the no-cloning theorem!

Page time

Named after Don Page, the “Page time” refers to the time when the black hole has emitted enough of its energy in the form of Hawking radiation that its entropy has (approximately) halved. Now the question is, what’s so special about the Page time?

First note that the rank of the density matrix is closely related to its purity (or mixedness). For example, a completely mixed state is the diagonal matrix:

\rho_{obs} = \frac{1}{4} \begin{pmatrix} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix}

which has maximal rank (2^{2}). Furthermore, a completely pure state |\Psi\rangle \langle\Psi| can always be represented as (if we just change the basis and make the first column/row represent |\Psi\rangle):

\rho_{obs} =  \begin{pmatrix} 1 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0  \end{pmatrix}

which has rank 1.

Imagine we have watched a black hole form and begin emitting Hawking radiation. Say we start collecting this radiation. The density matrix of the radiation will have the form:

\rho_{obs} = \sum_{i=1}^{2^{n-k}} p_i |\Psi\rangle_{i}\langle\Psi|

where n is the total number of qubits in our initial state, k is the number of qubits outside (in form of radiation), and p_i is the probability of each state. We are simply tracing over the degrees of freedom inside the black hole (as there are n-k degrees inside the black hole, dimensionality of this space is 2^{n-k}).

Don Page proposed the following graph of what he thought entanglement entropy of this density matrix should look like. It is fittingly called the “Page curve.”

The Page Curve
  • If k<\frac{n}{2}, rank(\rho) = 2^{k}, as there are enough terms in the sum to get a maximally ranked matrix. And hence, we get maximally mixed states. In the beginning, the radiation we collect at early times will still remain heavily entangled with the degrees of freedom near the black hole, and as such the state will look mixed to us because we can not yet observe all the complicated entanglement. As more and more information leaves the black hole in the form of Hawking radiation, we are “tracing out” fewer and fewer of the near-horizon degrees of freedom. The dimension of our density matrix grows bigger and bigger, and because the outgoing radiation is still so entangled with the near-horizon degrees of freedom, the density matrix will still have off-diagonal terms which are essentially zero. Hence, the state entropy increases linearly.
  • But if k>\frac{n}{2}, by the same argument, rank(\rho) = 2^{n-k} < 2^{k}. Hence, the density matrix becomes more and more pure. Once the black hole’s entropy has reduced by half, the dimension of the Hilbert space we are tracing out finally becomes smaller than the dimension of the Hilbert space we are not tracing out. The off-diagonal terms spring into our density matrix, growing in size and number as the black hole continues to shrink. Finally, once the black hole is gone, we can easily see that all the resulting radiation is in a pure state.

The entanglement entropy of the outgoing radiation finally starts decreasing, as we are finally able to start seeing entanglements between all this seemingly random radiation we have painstakingly collected. Some people like to say that if one could calculate the Page curve from first principles, the information paradox would be solved. Now we are ready to state the firewall paradox.

The Firewall Paradox

Say Alice collects all the Hawking radiation coming out of a black hole. At maybe, about 1.5 times the Page time, Alice is now able to see significant entanglement in all the radiation she has collected. Alice then dives into the black hole and sees an outgoing Hawking mode escaping. Given the Page curve, we know that knowing this outgoing mode must decrease the entropy of our observed mixed state. In other words, it must make our observed density matrix purer. And hence, be entangled with the particles we have already collected.

(Another way to think about this: let’s say that a random quantum circuit at the horizon scrambles the information in a non-trivial yet injective way in order for radiation particles to encode the information regarding what went inside the black hole. The output qubits of the circuit must be highly entangled due to the random circuit.)

However, given our discussion on Hawking radiation about short-range entanglement, the outgoing mode must be maximally entangled with an in-falling partner mode. This contradicts monogamy of entanglement! The outgoing mode cannot be entangled both with the radiation Alice has already collected and also maximally entangled with the nearby infalling mode!

So, to summarize, what did we do? We started with the existence of black holes and through our game of conservative radicalism, modified how physics works around them in order to make sure the following dear Physics principles are not violated by these special objects:

  • Second Law of Thermodynamics
  • Objects with entropy emit thermal radiation
  • Unitary evolution and reversibility
  • No-cloning theorem
  • Monogamy of entanglement

And finally, ended with the Firewall paradox.

So, for the last time in this blog post, what gives?

  • Firewall solution: The first solution to the paradox is the existence of a firewall at the horizon. The only way to not have the short-range entanglement discussed is if there is very high energy density at the horizon. However, this violates the “no-drama” theorem and Einstein’s equivalence principle of general relativity which states that locally there should be nothing special about the horizon. If firewalls did exist, an actual wall of fire could randomly appear out of nowhere in front of us right now if a future black hole would have its horizon near us. Hence, this solution is not very popular.
  • Remnant solution: One possible resolution would be that the black hole never “poofs” but some quantum gravity effect we do not yet understand stabilizes it instead, allowing for some Planck-sized object to stick around? Such an object would be called a “remnant.” The so-called “remnant solution” to the information paradox is not a very popular one. People don’t like the idea of a very tiny, low-mass object holding an absurdly large amount of information.
  • No unitary evolution: Perhaps, black holes are special objects which actually lose information! This would mean that black hole physics (the quantum theory of gravity) would be considerably different compared to quantum field theory.
  • Computational complexity solution?: Can anyone ever observe this violation? And if not, does that resolve the paradox? This will be covered in our next blog post by Parth and Chi-Ning.


  1. Scott Aaronson. The complexity of quantum states and transformations: from quantum money to black holes.arXiv preprintarXiv:1607.05256, 2016.
  2. Daniel Harlow. Jerusalem lectures on black holes and quantum information. Reviews of Modern Physics, 88(1):015002, 2016.
  3. Ahmed Almheiri, Donald Marolf, Joseph Polchinski, and JamesSully. Black holes: complementarity or firewalls? Journal of HighEnergy Physics, 2013(2):62, 2013.

Introduction to AMP and the Replica Trick

(This post from the lecture by Yueqi Sheng)

In this post, we will talk about detecting phase transitions using
Approximate-Message-Passing (AMP), which is an extension of
Belief-Propagation to “dense” models. We will also discuss the Replica
Symmetric trick, which is a heuristic method of analyzing phase
transitions. We focus on the Rademacher spiked Wigner model (defined
below), and show how both these methods yield the same phrase transition
in this setting.

The Rademacher spiked Wigner model (RSW) is the following. We are given
observations Y = \frac{\lambda}{n}xx^T + \frac{1}{\sqrt{n}}W where
x \in \{\pm 1\}^n (sampled uniformly) is the true signal and W is a
Gaussian-Orthogonal-Ensemble (GOE) matrix:
W_{i, j} \sim \mathbb{N}(0, 1) for i \neq j and
W_{i, i} \sim \mathbb{N}(0, 2). Here \lambda is the signal to noise
ratio. The goal is to approximately recover x.

The question here is: how small can \lambda be such that it is
impossible to recover anything reasonably correlated with the
ground-truth x? And what do the approximate-message-passing algorithm
(or the replica method) have to say about this?

To answer the first question, one can think of the task here is to
distinguish Y \sim \frac{\lambda}{n}xx^T + \frac{1}{\sqrt{n}}W vs
Y \sim W. One approach to distinguishing these distributions is to
look at the spectrum of the observation matrix Y. (In fact, it turns
out that this is an asymptotically optimal distinguisher [1]). The spectrum of Y behaves as ([2]):

  • When \lambda \leq 1, the empirical distribution of eigenvalues in
    spiked model still follows the semicircle law, with the top
    eigenvalues \approx 2

  • When \lambda > 1, we start to see an eigenvalue > 2 in the
    planted model.

Approximate message passing

This section approximately follows the exposition in [3].

First, note that in the Rademacher spiked Wigner model, the posterior
distribution of the signal \sigma conditioned on the observation Y
is: \Pr[\sigma | Y] \propto \Pr[Y | \sigma] \propto \prod_{i \neq j} \exp(\lambda Y_{i, j} \sigma_i \sigma_j /2 ) This
defines a graphical-model (or “factor-graph”), over which we can perform
Belief-Propogation to infer the posterior distribution of \sigma.
However, in this case the factor-graph is dense (the distribution is a
product of potentials \exp(\lambda Y_{i, j} \sigma_i\sigma_j) for all
pairs of i, j).

In the previous blog post, we saw belief propagation works great when the underlying interaction
graph is sparse. Intuitively, this is because G is locally tree like,
which allows us to assume each messages are independent random
variables. In dense model, this no longer holds. One can think of dense
model as each node receive a weak signal from all its neighbors.

In the dense model setting, a class of algorithms called Approximate
message passing (AMP) is proposed as an alternative of BP. We will
define AMP for RWM in terms of its state evolution.

State evolution of AMP for Rademacher spiked Wigner model

Recall that in BP, we wish to infer the posterior distributon of
\sigma, and the messages we pass between nodes correspond to marginal
probability distribution over values on nodes. In our setting, since the
distributions are over \{\pm 1\}, we can represent distributions by
their expected values. Let m^t_{u \to v} \in [-1, 1] denote the
message from u to v at time t. That is, m_{u \to v} corresponds
to the expected value {{\mathbb{E}}}[\sigma_u].

To derive the BP update rules, we want to compute the expectation
{{\mathbb{E}}}[\sigma_v] of a node v, given the
messages {{\mathbb{E}}}[\sigma_u] for u \neq v. We can
do this using the posterior distribution of the RWM, \Pr[\sigma | Y],
which we computed above.
\displaystyle \Pr[\sigma_v = 1 | Y, \{\sigma_u\}_{u \neq v}] = \frac{ \prod_u \exp(\lambda Y_{u, v} \sigma_u) - \prod_u \exp(-\lambda Y_{u, v} \sigma_u) }{ \prod_u \exp(\lambda Y_{u, v} \sigma_u) + \prod_u \exp(-\lambda Y_{u, v} \sigma_u) }

And similarly for \Pr[\sigma_v = -1 | Y, \{\sigma_u\}_{u \neq v}].
From the above, we can take expectations over \sigma_u, and express
{{\mathbb{E}}}[\sigma_v] in terms of
\{{{\mathbb{E}}}[\sigma_u]\}_{u \neq v}. Doing this (and
using the heuristic assumption that the distribution of \sigma is a
product distribution), we find that the BP state update can be written
m^{t}_{u \to v} = f(\sum_{w \neq v}f^{-1}(A_{w, u} m^{t - 1}_{w \to u}))
where the interaction matrix A_{w, u} = \lambda Y_{w, u}, and
f(x) = tanh(x) = \frac{\exp(x) - \exp(-x)}{\exp(x) + \exp(x)}.

Now, Taylor expanding f^{-1} around 0, we find
m^{t}_{u \to v} = f\left( (\sum_{w \neq v} A_{w, u} m^{t - 1}_{w \to u}) + O(1/\sqrt{n}) \right)
since the terms A_{w, u} are of order O(1/\sqrt{n}).

At this point, we could try dropping the “non-backtracking” condition
w \neq v from the above sum (since the node v contributes at most
O(1/\sqrt{n}) to the sum anyway), to get the state update:
m^{t}_{u} = f\left( \sum_{w} A_{w, u} m^{t - 1}_{w}) \right) (note the messages no longer
depend on receiver – so we write m_u in place of m_{u \to v}).
However, this simplification turns out not to work for estimating the
signal. The problem is that the “backtracking” terms which we added
amplify over two iterations.

In AMP, we simply perform the above procedure, except we add a
correction term to account for the backtracking issue above. Given u,
for all v, the AMP update is:
m^{t}_{u \to v} = m^{t}_u = f(\sum_{w}A_{w, u} m^{t - 1}_{w}) + [\text{some correction term}]

The correction term corresponds to error introduced by the backtracking
terms. Suppose everything is good until step t - 2. We will examine
the influence of backtracking term to a node v through length 2 loops.
At time t - 1, v exert Y_{v, u}m^{t - 2}_v additional influence to
each of it’s neighbor u. At time t, v receive roughly
Y_{u, v}^2m^{t - 2}_v. Since Y_{u, v}^2 has magnitude
\approx \frac{1}{n} and we need to sum over all of v’s neighbors,
this error term is to large to ignore. To characterize the exact form of
correction, we simply do a taylor expansion

m^{t}_v = \sum_{u}f(Y_{u, v}m^{t - 1}_u) = \sum_{u}f(Y_{u, v} \left(\sum_{w}f(Y_{w, u}m^{t - 2}_w) - f(Y_{u, v}m^{t - 2}_w)\right) )\\ \approx \sum_u f(Y_{u, v} m^{t - 1}_u) - Y_{u, v}f'(m^{t - 1}_u)m^{t - 2}_v\\ \approx \sum_u f(Y_{u, v} m^{t - 1}_u) - \frac{1}{n}\sum_{u}f'(m^{t - 1}_u)m^{t - 2}_v

State evolution of AMP

In this section we attempt to obtain the phase transition of Rademacher
spiked Wigner model via looking at m^{\infty}.

We assume that each message could be written as a sum of signal term and
noise term. m^t = \mu_t x + \sigma_t g where
g \sim \mathbb{N}(0, I). To the dynamics of AMP (and find its phase
transition), we need to look at how the signal \mu_t and noise
\sigma_t evolves with t.

We do the following simplification: ignore the correction term and
assume each time we obtain an independent noise g.

m^{t} = Yf(m^{t - 1}) = (\frac{\lambda}{n}x^Tx + \frac{1}{\sqrt{n}}W)f(m^{t - 1}) = \frac{\lambda}{n} < f(m^{t - 1}), x > x + \frac{1}{\sqrt{n}} Wf(m^{t - 1})

Here, we see that \mu_t = \frac{\lambda}{n}< f(m^{t - 1}), x>
and \sigma_t = \frac{1}{\sqrt{n}}Wf(m^{t - 1}).

Note that \mu_{t} is essentially proportional to overlap between
ground truth and current belief
, since the function f keeps the
magnitude of the current beliefs bounded.

\frac{\lambda}{n} <f(m^{t - 1}), x>= \frac{\lambda}{n} <f(\mu_{t - 1}x + \sigma_{t - 1}g), x> \approx\lambda {{\mathbb{E}}}_{X \sim unif(\pm 1), G\sim \mathbb{N}(0, 1)}[X f(\mu_{t - 1}X + \sigma_{t - 1}G)] = \lambda {{\mathbb{E}}}_G[f(\mu_{t - 1} + \sigma_{t - 1}G)]

For the noise term, each coordinate of \sigma_t is a gaussian random
variable with 0 mean and variance

\frac{1}{n} \sum_v f(m^{t - 1})_v^2 \approx {{\mathbb{E}}}_{X, G}[f(\mu_{t - 1}X + \sigma_{t - 1}G)^2] = {{\mathbb{E}}}_{G}[f(\mu_{t - 1} + \sigma_{t - 1}G)^2]

It was shown in [4] that we can introduce a new
parameter \gamma_t s.t.
\gamma_t = \lambda^2 {{\mathbb{E}}}[f(\gamma_{t - 1} + \sqrt{\gamma_{t - 1}}G)]
As t \to \infty, turns out \mu_t = \frac{\gamma_t}{\lambda} and
\sigma_t^2 = \frac{\sigma_t}{\lambda^2}. To study the behavior of
m^t as t \to \infty, it is enough to track the evolution of

This heuristic analysis of AMP actually gives a phase transition at
\lambda = 1 (in fact, the analysis of AMP can be done rigorously as in [5]):

  • For \lambda < 1: If \gamma_t \approx 0, |\gamma_t + \sqrt{\gamma_t}G| < 1 w.h.p., thus we have \gamma_{t + 1} \approx \lambda^2 (\gamma_t) < \gamma_t. Taking t \to \infty, we have \gamma_{\infty} = 0, which means there AMP solution has no overlap with the ground truth.

  • For \lambda > 1: In this case, AMP’s solution has some correlation with the ground truth.

screenshot 2019-01-26 13.49.39

(Figure from [6])

Replica symmetry trick

Another way of obtaining the phase transition is via a non-rigorous
analytic method called the replica method. Although non-rigorous, this
method from statistical physics has been used to predict the fixed point
of many message passing algorithms and has the advantage of being easy
to simulate. In our case, we will see that we obtain the same phase
transition temperature as AMP above. The method is non-rigorous due to
several assumptions made during the computation.

Outline of replica method

Recall that we are interested in minizing the free energy of a given
system f(\beta, Y) = \frac{1}{\beta n} \log Z(\beta, Y) where Z is
the partition function as before:
Z(\beta, Y) = \sum_{x \in \{\pm 1\}^n} exp(-\beta H(Y, x)) and
H(Y, x) = -<Y, x^Tx> = -xYx^T = -\sum_{i, j} Y_{i, j}x_ix_j.

In replica method, Y is not fixed but a random variable. The
assumption is that as n \to \infty, free energy doesn’t vary with Y
too much, so we will look at the mean of f_Y to approximate free
energy of the system.

f(\beta) = \lim_{n \to \infty}\frac{1}{\beta n}{{\mathbb{E}}}_{Y}[\log Z(\beta, Y)]

f(\beta) is called the free energy density and the goal now is to
compute the free energy density as a function of only \beta , the
temperature of the system.

The replica method is first proposed as a simplification of the
computation of f(\beta)

It is a generally hard problem to compute f(\beta) in a clear way. A
naive attempt of approximate f(\beta) is to simply pull the log out
g(\beta) = \frac{1}{\beta n}\log {{\mathbb{E}}}_Y[Z(\beta, Y)]
Unfortunately g(\beta) and f(\beta) are quite different quantities,
at least when temperature is low. Intuitively, f(\beta) is looking at
system with a fixed Y while in g(\beta), x and Y are allowed to
fluctuate together. When the temperature is high, Y doesn’t play a big
roll in system thus they could be close. However, when temperature is
low, there could be a problems. Let \beta \to \infty,
f(\beta) \approx \int_Y (\beta x_Y Y x_Y)\mu(Y) dY,
g(\beta) \approx \log \int_Y exp(\beta x_J Y x_Y)\mu(Y)dY \approx \beta x^* Yx^*.

While {{\mathbb{E}}}_X[\log(f(X))] is hard to compute,
{{\mathbb{E}}}[f(X)^r] is a much easier quantity. The
replica trick starts from rewriting f(\beta) with moments of Z:
Recall that x^r \approx 1 + r \log x for r \approx 0 and
\ln(1 + x)\approx x, using this we can rewrite f(x) in the following

Claim 1. Let f_r(\beta) = \frac{1}{r \beta n}\ln[{{\mathbb{E}}}_Y[Z(\beta, Y)^r]]
Then, f(\beta) = \lim_{r \to 0}f_r(\beta)

The idea of replica method is quite simple

  • Define a function f(r, \beta) for r \in \mathbb{Z}_+ s.t. f(r, \beta) = f_r(\beta) for all such r.

  • Extend f(r, \beta) analytically to all r \in {{\mathbb{R}}}_+ and take the limit of r \to 0.

The second step may sound crazy, but for some unexplained reason, it has
been surprisingly effective at making correct predictions.

The term replica comes from the way used to compute
{{\mathbb{E}}}[Z^r] in Claim 1. We expand the r-th moment
in terms of r replicas of the system

Z(\beta, Y)^r = (\sum_x exp(-\beta H(Y, x)))^r = \sum_{x^1, \cdots, x^r} \Pi_{k = 1}^r exp(-\beta H(Y, x^i))

For Rademacher spiked Wigner model

In this section, we will see how one can apply the replica trick to
obtain phase transition in the Rademacher spiked Wigner model. Recall
that given a hidden a \in \{\pm 1\}^n, the observable
Y = \frac{\lambda}{n}a^Ta + \frac{1}{\sqrt n} W where
W_{i, j} \sim \mathcal{N}(0, 1) and W_{i, i} \sim \mathcal{N}(0, 2).
We are interested in finding the smallest \lambda where we can still
recover a solution with some correlation to the ground truth a. Note
that \{W_{i, i}\} is not so important here as x_i^2 doesn’t carry
any information in this case.

Given by the posterior {{\mathbb{P}}}[x|Y], the system we
set up corresponding to Rademacher spiked Wigner model is the following:

  • the system consists of n particles and the interactions between
    each particle are give by Y

  • the signal to noise ratio \lambda as the inverse temperature

Following the steps above, we begin by computing
f(r, \beta) = \frac{1}{r\beta n}\ln{{\mathbb{E}}}_Y[Z^r]
for r \in \mathbb{Z}_+: Denote X^k = (x^k)^Tx^k where x^k is the
kth replica of the system.

{{\mathbb{E}}}_Y[Z^r] = \int_Y \sum_{x^1, \cdots, x^r} exp(\beta \sum_k <Y, X^k> \mu(Y) dY\\ = \int_Y \sum_{x^1, \cdots, x^r} exp(\beta <Y, \sum_k X^k>) \mu(Y) dY

We then simplify the above expression with a technical claim.

Claim 2. Let Y = A + \frac{1}{\sqrt{n}}W where A is a fixed matrix and
W is the GOE matrix defined as above. Then,
\int_Y exp(\beta<Y, X>) \mu(Y) dY = exp(\frac{\beta^2}{n}{{\|{X}\|_{F}}}^2 + \frac{\beta}{2} <A, X>)
for some constant C depending on distribution of Y.

Denote X = \sum_k X^k. Apply Claim 2 with
A = \frac{\beta}{n}a^Ta, we have
{{\mathbb{E}}}_Y[Z^r] = \sum_{x^1, \cdots, x^r} exp(\frac{\beta^2}{n}{{\|{X}\|_{F}}}^2 + \frac{\beta^2}{2n} <a^Ta, X>)
To understand the term inside exponent better, we can rewrite the inner
sum in terms of overlap between replicas:

{{\|{X}\|_{F}}}^2 = \sum_{i, j}X_{i, j}^2 = \sum_{i, j}(\sum_{k = 1}^r x^k_ix^k_j)^2 =\sum_{i, j}(\sum_{k = 1}^r x^k_ix^k_j)(\sum_{l = 1}^r x^l_ix^l_j)\\ = \sum_{k, l} (\sum_{i = 1}^n x^k_ix^{l}_i)^2 = \sum_{k, l} <x^k, x^l>^2

where the last equality follows from rearranging and switch the inner
and outer summations.

Using a similar trick, we can view the other term as

<a^Ta, X> = \sum_{i, j}\sum_{k = 1}^rx^k_ix^k_ja_ia_j = \sum_{k = 1}^r (\sum_{i = 1}^n a_ix^k_i)^2 = \sum_{k}<a, x^k>^2

Note that Q_{k, l} = <x^k, x^l> represents overlaps between the
k and lth replicas and Q_k = <a, x^k> represents the
overlaps between the kth replica and the ground truth vector.

In the end, we get for any integer r, (Equation 1):

\displaystyle f(r, \beta) = \frac{1}{r\beta n}\ln(\sum_{x^1, \cdots, x^r} exp(\frac{\beta^2}{n}\sum_{k, l}Q_{k, l}^2 + \frac{\beta^2}{2n}\sum_k Q_k^2)) \label{e:1}\\ = \frac{1}{r\beta n} \ln(\sum_{Q}\nu_{x^k}(Q)exp(\frac{\beta^2}{n}\sum_{k, l}Q_{k, l}^2 + \frac{\beta^2}{2n}\sum_k Q_k^2))

Our goal becomes to approximate this quantity. Intuitively, if we think
of Q_{k, l} as indices on a (r + 1) \times (r + 1) matrices, Q,
with Q(i,i) = 1, then Q is the average of n i.i.d matrices. So we
expect Q_{j, k} \in [\pm \frac{1}{n}] for j \neq k w.h.p. In the
remaining part, We find the correct Q via rewriting Equation 1.

Observe that by introducing a new variable Z_{k, l} for k \neq l and
using the property of gaussian intergal (Equation 4):

\label{e:4} exp(\frac{\beta^2}{n}Q_{k, l}^2) = \sqrt{\frac{n}{4\pi}}\int_{Z_{k, l}} exp(-\frac{n}{4}Z_{k, l}^2 + \beta Q_{k, l}Z_{k, l})dZ_k

\exp(\frac{\beta^2}{2n}Q_k^2) = \sqrt{\frac{1}{8\pi n}}\int_{Z_k}exp(-(2n)Z_k^2 + 2\beta Q_{k}Z_k)dZ_k
Replace each exp(\frac{\beta^2}{n}Q_{k, l}^2) by a such integral, we
have (Equation 2):

\begin{gathered} {{\mathbb{E}}}[Z^r] = \sum_{x^1, \cdots, x^r} exp(\frac{\beta^2}{n}\sum_{k, l}Q_{k, l}^2 + \frac{\beta^2}{2n}\sum_k Q_k^2) \label{e:2}\\ = C\sum_{x^1, \cdots, x^r} \exp(\beta^2 n)\int_{Z_{k, l}}exp(-\frac{n}{4}\sum_{k \neq l}Z_{k, l}^2 - \frac{n}{2}\sum_k Z_k^2 + \beta \sum_{k \neq l}Y_{k, l}Q_{k, l} + 2\beta\sum_kZ_k Q_k) dZ \\ =C\exp(\beta^n) \int_{Y_{k, l}}exp(-\frac{n}{4}\sum_{k \neq l}Y_{k, l}^2 - \frac{n}{2}\sum_k Z_k^2 + \ln(\sum_{x_1,\cdots, x_r}exp(\beta \sum_{k\neq l}Y_{k, l}Q_{k, l} + 2\beta\sum_kY_k Q_k)) dY \label{e:2}\end{gathered}

where C is the constant given by introducing gaussian intergals.

To compute the integral in (Equation 2), we need to cheat a little bit and take
n \to \infty before letting r \to 0. Note that free energy density
is defined as
f(\beta) = \lim_{n \to \infty}\lim_{r \to 0}\frac{1}{r\beta n}\ln {{\mathbb{E}}}_Y[Z(\beta, Y)^r]
This is the second assumption made in the replica method and it is
commonly believed that switching the order is okay here. Physically,
this is plausible because we believe intrinsic physical quantities
should not depend on the system size.

Now the Laplace method tells us when n \to \infty, the integral in (Equation 2) is dominated by the max of the exponent.

Theorem 1 (Laplace Method). Let h(x): {{\mathbb{R}}}^n \to {{\mathbb{R}}}then

\int e^{nh(x)} \approx e^{nh(x^*)}(\frac{2\pi}{n})^{\frac{d}{2}}\frac{1}{\sqrt{det(H)}}

where x^* = argmax_x \{h(x)\} and H is the Hessian of h evaluated at the point x^*.

Fix a pair of k, l and apply Laplace method with
h(Z_{k, l}) = -\frac{1}{2}\sum_{0 \leq k < l \leq r}Z_{k, l}^2 + \frac{1}{n}\ln(\sum_{x_1,\cdots, x_r}exp(\beta \sum_{k \neq l}Z_{k, l}Q_{k, l} + 2\beta\sum_kZ_k Q_k))
what’s left to do is to find the critical point of h. Taking the
derivatives gives
-Y_{k, l} + \frac{A(Z_{k, l})\beta Q_{k, l}}{n A(Z_{k, l})} = 0
A(Z_{k, l}) = \sum_{x_1,\cdots, x_r}exp(\beta \sum_{k \neq l}Z_{k, l}Q_{k, l} + \beta\sum_kY_k Q_k).

We now need to find a saddle point of h where the hessian is PSD. To
do that, we choose to assume the order of the replicas does not matter,
which is refer to as the replica symmetry case. 1 One simplest form
of Y is the following: \forall k, l > 0, Z_{k, l} = y and
Z_{k} = y for some y. This also implies that Q_{k, l} = q for some
q and y =\frac{\beta}{n} q

Plug this back in to Equation 2 gives: (Equation 3)

\label{e:3} {{\mathbb{E}}}[Z^r] = C\exp(\beta n)\exp(-\frac{n}{2}(\frac{r^2 - r}{2})y^2 - \frac{n^2}{2} + \ln(\sum_{x^i}\exp(y\beta\sum_{k \neq l}Q_{k, l} + 2y\beta \sum_k Q_k))

To obtain f(r, \beta), we only need to deal with the last term in
(Equation 3) as r \to 0. Using the fact that Q_{k, l} = y for all
k, l and using the same trick of introducing new gaussain integral as
in (Equation 4) we have
\lim_{r \to 0}\frac{1}{r}\ln(\sum_{x^i}\exp(y\beta\sum_{k \neq l}Q_{k, l} + n\beta \sum_k Q_k)) = -\beta + {{\mathbb{E}}}_{z \sim \mathcal{N}(0, 1)}[\log(2cosh(y\beta + \sqrt{y\beta}z))]

Using the fact that we want the solution to minimizes free energy,
taking the derivative of the current f w.r.t. y gives
\frac{y}{\beta} = n{{\mathbb{E}}}_z[tanh(y\beta + \sqrt{y\beta}z)]
which matches the fixed point of AMP. Plug in q and y will give us
f(\beta). The curve of f(\beta) looks like the Figure below, where
the solid line is the curve of f(\beta) with the given q and the
dotted line is the curve given by setting all variables 0.

screenshot 2019-01-26 13.54.49



[1] Amelia Perry, Alexander S Wein, Afonso S Bandeira, and Ankur Moitra. Optimality and sub-optimality of pca for spiked random matrices and synchronization.
arXiv preprint arXiv:1609.05573, 2016.
[2] D. Feral and S. Pech e. The Largest Eigenvalue of Rank One Deformation of Large Wigner Matrices. Communications in Mathematical Physics, 272:185–228, May 2007.
[3] Afonso S Bandeira, Amelia Perry, and Alexander S Wein. Notes on computational-to-statistical gaps: predictions using statistical physics. arXiv preprint arXiv:1803.11132, 2018.
[4] Yash Deshpande, Emmanuel Abbe, and Andrea Montanari. Asymptotic mutual information for thebinary stochastic block model. In
Information Theory (ISIT), 2016 IEEE International Symposium on, pages 185–189. IEEE, 2016.
[5] Adel Javanmard and Andrea Montanari. State evolution for general approximate message passing algorithms, with applications to spatial coupling. Information and Inference: A Journal of the IMA, 2(2):115–144, 2013.
[6] A. Perry, A. S. Wein, and A. S. Bandeira. Statistical limits of spiked tensor models.
ArXiv e-prints, December 2016.

  1. Turns out for this problem, replica symmetry is the only case. We
    will not talk about replica symmetry breaking here, which
    intuitively means we partition replicas into groups and re-curse. 

Why physicists care about the Firewall Paradox

[Guest post by Noah Miller – a Harvard Physics Ph.D student that took our seminar. Noah’s webpage contains wonderful and extensive notes that can be of interest to computer scientists. –Boaz]

(The following blog post serves as an introduction to the following notes:)

Black Holes, Hawking Radiation, and the Firewall

There are many different types of “theoretical physicists.” There are theoretical astrophysicists, theoretical condensed matter physicists, and even theoretical biophysicists. However, the general public seems to be most interested in the exploits of what you might call “theoretical high energy theorists.” (Think Stephen Hawking.)

The holy grail for theoretical high energy physicists (who represent only a small fraction of all physicists) would be to find a theory of quantum gravity. As it stands now, physicists have two theories of nature: quantum field theory (or, more specifically, the “Standard Model”) and Einstein’s theory of general relativity. Quantum field theory describes elementary particles, like electrons, photons, quarks, gluons, etc. General relativity describes the force of gravity, which is really just a consequence of the curvature of spacetimes.

Sometimes people like to say that quantum field theory describes “small stuff” like particles, while general relativity describes “big stuff” like planets and stars. This is maybe not the best way to think about it, though, because planets and stars are ultimately just made out of a bunch of quantum particles.

Theoretical physicists are unhappy with having two theories of nature. In order to describe phenomena that depend on both quantum field theory and general relativity, the two theories must be combined in an “ad hoc” way. A so-called “theory of everything,” another name for the currently unknown theory of “quantum gravity,” would hypothetically be able to describe all the phenomena we know about. Just so we’re all on the same page, they don’t even have a fuly worked out hypothesis. (“String theory,” a popular candidate, is still not even a complete “theory” in the normal sense of the word, although it could become one eventually.)

So what should these high energy theoretical physicists do if they want to discover what this theory of quantum gravity is? For the time being, nobody can think up an experiment that would be sensitive to quantum gravitation effects which is feasible with current technology. We are limited to so-called “thought experiments.”

This brings us to Hawking radiation. In the 1970’s, Stephen Hawking considered what would happen to a black hole once quantum field theory was properly taken into account. (Of course, this involved a bit of “ad hoc” reasoning, as mentioned previously.) Hawking found that, much to everybody’s surprise, the black hole evaporated, realeasing energy in the form of “Hawking radiation” (mostly low energy photons). More strangely, this radiation comes out exactly in the spectrum you would expect from something “hot.” For example, imagine heating a piece of metal. At low temperatures, it emits low energy photons invisible to the human eye. Once it gets hotter, it glows red, then yellow, then perhaps eventually blue. The spectrum of light emitted follows a very specific pattern. Amazingly, Hawking found that the radiation which black holes emit follow the exact same pattern. By analogy, they have a temperature too!

This is more profound than you might realize. This is because things which have an temperature should also have an “entropy.” You see, there are two notions of “states” in physics: “microstates” and “macrostates.” A microstate gives you the complete physical information of what comprises a physical system. For example, imagine you have a box of gas, which contains many particles moving in a seemingly random manner. A “microstate” of this box would be a list of all the positions and momenta of every last particle in that box. This would be impossible to measure in pratice. A “macrostate,” on the other hand, is a set of microstates. You may not know what the exact microstate your box of gas is in, but you can measure macroscopic quantities (like the total internal energy, volume, particle number) and consider the set of all possible microstates with those measured quantities.

The “entropy” of a macrostate is the logarithm of the number of possible microstates. If black holes truly are thermodynamic systems with some entropy, that means there should be some “hidden variables” or microstates to the black hole that we currently don’t understand. Perhaps if we understood the microstates of the black hole, we would be much closer to understanding quantum gravity!

However, Hawking also discovered something else. Because the black hole is radiating out energy, its mass will actually decrease as time goes on. Eventually, it should disappear entirely. This means that the information of what went into the black hole will be lost forever.

Physicists did not like this, however, because it seemed to them that the information of what went into the black hole should never be lost. Many physicists believe that the information of what went into the black hole should somehow be contained in the outgoing Hawking radiation, although they do not currently understand how. According to Hawking’s original calculation, the Hawking radiation only depends on a parameters of the black hole (like its mass) and have nothing to do with the many finer details on what went in, the exact “microstate” of what went in.

However, physicists eventually realized a problem with the idea that the black hole releases its information in the form of outgoing Hawking radiation. The problem has to do with quantum mechanics. In quantum mechanics, it is impossible to clone a qubit. That means that if you threw a qubit in a black hole and then waited for it to eventually come out in the form of Hawking radiation, then the qubit could no longer be “inside” the black hole. However, if Einstein is to be believed, you should also be able to jump into the black hole and see the qubit on the inside. This seems to imply that the qubit is cloned, as it is present on both the inside and outside of the black hole.

Physicists eventually came up with a strange fix called “Black Hole Complementarity” (BHC). According to BHC, according to people outside the black hole, the interior does not exist. Also, according to people who have entered the black hole, the outside ceases to exist. Both descriptions of the world are “correct” because once someone has entered the black hole, they will be unable to escape and compare notes with the person on the outside.

Of course, it must be emphasized that BHC remains highly hypothetical. People have been trying to poke holes in it for a long time. The largest hole is the so called “Firewall Paradox,” first proposed in 2012. Essentially, the Firewall paradox tries to show that the paradigm of BHC is self-contradictory. In fact, it was able to use basic quantum mechanics to show that, under some reasonable assumptions, the interior of the black hole truly doesn’t exist, and that anyone who tries to enter would be fried at an extremely hot “firewall!” Now, I don’t think most physicists actually believe that black holes really have a firewall (although this might depend on what day of the week you ask them). The interesting thing about the Firewall paradox is that it derives a seemingly crazy result from seemingly harmless starting suppositions. So these suppositions would have to be tweaked in the theory of quantum gravity order to get rid of the firewall.

This is all to say that all this thinking about black holes really might help physicists figure out something about quantum gravity. (Then again, who can really say for sure.)

If you would like to know more about the Firewall paradox, I suggest you read my notes, pasted at the top of this post!

The goal of the notes was to write an introduction to the Black Hole Information Paradox and Firewall Paradox that could be read by computer scientists with no physics background.

The structure of the notes goes as follows:

  1. Special relativity
  2. General relativity
  3. Quantum Field Theory (in which I ambitiously tell you what QFT actually is!)
  4. Statistical Mechanics (this is the best part)
  5. Hawking radiation
  6. The information paradox

Because the information paradox touches on all areas of physics, I thought it was necessary to take “zero background, infinite intelligence” approach, introducing the all the necessary branches of physics (GR, QFT, Stat Mech) in order to understand what the deal with Hawking radiation really is, and why physicists think it is so important. I think it is safe to say that if you read these notes, you’ll learn a non-trivial amount of physics.

Quantum Games

Nilin Abrahamsen

Daniel Alabi

Mitali Bafna

Emil Khabiboulline

Juspreet Sandhu

Two-prover one-round (2P-1R) games have been the subject of intensive study in classical complexity theory and quantum information theory. In a 2P-1R game, a verifier sends questions privately to each of two collaborating provers , who then aim to respond with a compatible pair of answers without communicating with each other. Sharing quantum entanglement allows the provers to improve their strategy without any communication, illustrating an apparent paradox of the quantum postulates. These notes aim to give an introduction to the role of entanglement in nonlocal games, as they are called in the quantum literature. We see how nonlocal games have rich connections within computer science and quantum physics, giving rise to theorems ranging from hardness of approximation to the resource theory of entanglement.


In these notes we discuss 2-prover 1-round games and the classical complexity of approximating the value of such games in the setting where the provers can share entanglement. That is, given the description of a game, we ask how hard it is to estimate the winning probability of the best winning strategy of the entangled provers. Let us first formally define games and its relation to the label cover problem. We write { [n]=\{1,\ldots,n\}}.

Definition (Label cover)

A label cover instance { I=(S,T,\Sigma,\Pi)} consists of variable sets { S=\{s_i\}_{i\in[n]},T=\{t_j\}_{j\in[n]}}, alphabet set { \Sigma}, and a collection { \Pi} of constraints of the form { t_j=f_{i,j}(s_i)}. Given an assignment (or coloring) { c : S\cup T \rightarrow \Sigma} we define its value to be { \omega(c)=\mathbb P_{f_{ij}\sim\Pi}\big(c(t_j)=f_{ij}(c(s_i))\big)}. Define the value of { I} to be the maximum over all possible assignments, i.e. { \omega(I) = \max_{c} \omega(c)}. Many familiar computational problems can be formulated as a label cover, such as { \textsc{3SAT}}, { \textsc{3Lin}}, and { \textsc{MaxCut}}.
Label cover graph

Definition (2-prover 1-round game)

Let { I = (S,T,\Sigma,\Pi)} be a label cover instance. We can then associate the following two-prover one-round game { G(I)} with { I}. Let { P_1,P_2} be two provers who cannot communicate, and let { V} be the verifier. Given the label cover instance { I}, the verifier uniformly samples a constraint { f_{i,j} \in \Pi} and sends { s_i} to { P_1} and { t_j} to { P_2}. The provers then reply with { a \in \Sigma} and { b\in\Sigma} respectively to { V}. Finally, { V} outputs { 1} if and only if { b = f_{i,j}(a)}.
The game view of label cover
Any coloring of the label cover instance corresponds to a deterministic strategy for the corresponding game. Therefore, with an optimal strategy the provers win the game associated to label cover instance { I} with probability { \omega(I)}. That is, the value of the game equals that of the label cover instance. However, this is with the assumption that provers can only use deterministic strategies or convex combinations of these (that is, using shared randomness). If the provers share an entangled quantum state, then the provers (who still cannot communicate) can enjoy correlations that allow them to win with a higher probability than classically. In the quantum literature, these 2P-1R games are known as nonlocal games referring to the fact that the correlations arise without signaling between the provers. We are concerned with the complexity of approximating the winning probability of this strategy. We refer to the optimal winning probability within some class (classical or entangled) of strategies as the classical and quantum value of the game, respectively, and we use the terms quantum strategy and entangled strategy interchangeably. Fixing different constraint families in the label cover game changes the complexity of finding the (classical and entangled) values of the game. We will show that approximating the entangled value of XOR games, or more generally unique games (to be defined later on), is possible in polynomial time. This is remarkable because a famous conjecture known as the unique games conjecture says that approximating the classical value of unique games is NP-hard. In contrast, we will see that for unrestricted edge constraints, it is NP-hard to approximate the entangled value of a nonlocal game. Thus, hardness of approximation of the game’s value, established by the celebrated PCP theorem , still applies in the presence of entanglement. In the quantum world, we have new complexity classes such as QMA (which can be regarded as “quantum NP”), so one may conjecture whether approximating the entangled value of a general game is QMA-hard (the games formulation of the quantum PCP conjecture ). We will indicate progress in this direction but will explicitly demonstrate the NP-hardness result. Entanglement is often regarded as an expensive resource in quantum information because it is difficult to produce and maintain. Hence, even if sharing entanglement can improve the success probability of winning a game, the resource consumption may be costly. We will conclude by discussing lower bounds on the number of shared entangled bits required to achieve the optimal value of a game.

Notation and quantum postulates

Let us first establish notation and define what is meant by an entangled strategy. In keeping with physics notation we write a column vector as { |{v}\rangle \in\mathbb C^d} and its conjugate-transpose (a row vector) as { \langle{v}| } . More generally the conjugate-transpose (Hermitian conjugate) of a matrix { A} is written { A^\dag}. Then { \langle{v}| w\rangle} is the inner product of two vectors (a scalar) and { |{v}\rangle \langle{w}| } the outer product (a rank-1 matrix). A matrix { A} is said to be Hermitian if { A^\dag=A}. A matrix { A} is positive semidefinite , written { A\succeq 0}, if { A=B^\dag B} for some matrix { B}. We write the identity matrix as { {\mathbb{I}}}, denote by { Herm(\mathbb{C}^d)} the set of { d}-by-{ d} Hermitian matrices.

Observables, states, and entanglement

In a quantum theory the observables are Hermitian operators { A\in Herm(\mathbb C^d)} on a vector space { \mathbb C^d}. It then makes sense to say that a state is a functional { \varphi} on the set of observables. That is, to specify the state of a physical system means giving a (expected) value for each observable. It turns out states are linear functionals of the observables, and such functionals can be written { \varphi(A)=\langle A,\rho\rangle} for some { d}-by-{ d} matrix { \rho}. We call { \rho} the density matrix and require moreover that { \rho} is positive semidefinite and has trace { 1}. Every density matrix is a convex combination of rank-one projections { |\psi\rangle\langle\psi|} known as pure states. The unit vectors { |\psi\rangle\in\mathbb C^d} are also themselves known as pure states. If the state of one particle is described by a vector in { \mathbb C^d} (referring here to pure states), then two particles are described by a vector in the tensor product { \mathbb C^d\otimes\mathbb C^d}. The two particles are entangled if their state is not in the form of a pure tensor { |\phi\rangle\otimes|\psi\rangle}. We also write product states as { |\phi\rangle|\psi\rangle}, omitting the tensor symbol.

Quantum measurements

A quantum measurement can be described in terms of a projection-valued measure (PVM). Definition 1 (PVM) A projection-valued measure on vector space { \mathbb C^d} (where the quantum states live) is a list of projection matrices { A_1,\ldots,A_k} on { \mathbb C^d} such that { A_iA_j=0} for { i\neq j} and { \sum_i A_i={\mathbb{I}}}. The PVM describes a measurement which, on state { |\psi\rangle\in\mathbb C^d} outputs measurement outcome { i} with probability { \langle\psi| A_i|\psi\rangle}. The quantum state after obtaining outcome { i} is { \frac1{\|A_i|\psi\rangle\|}A_i|\psi\rangle}. When the projections are rank-one projections { A_i= |{\beta_i}\rangle \langle{\beta_i}| } we say that we measure in the basis { \{ |{\beta_i}\rangle \}_i}. In this case the probability of outcome { i} in state { |\psi\rangle} is { |\langle\beta_i |{\psi}\rangle |^2}, and the post-measurement state is simply { |{\beta_i}\rangle} . Applying the measurement { (A_i)_i} on the left half of a two-particle state { |{\Psi}\rangle \in\mathbb C^d\otimes\mathbb C^d} means applying the PVM { (A_i\otimes {\mathbb{I}})_{i}} on the two-particle state.

Quantum strategies for nonlocal games

We now introduce the notion of a quantum strategy for a nonlocal game. Each prover holds a particle, say with state space { \mathbb C^d}, and Alices particle may be entangled with Bob’s. The global state is { |{\phi}\rangle _{AB}\in\mathbb C^d\otimes\mathbb C^d}. Each player receives a question from the verifier and then chooses a measurement (a PVM) depending on the question. The player applies the measurement to their own particle and responds to the verifier with their measurement outcome. Hence for Alice we specify { n} PVM’s { A^s} where { s\in S} is a question, and each PVM is a list { A^s_{a=1},\ldots,A^s_k}. By definition 1, given questions { s,t} the probability that Alice outputs { a} and Bob outputs { b} is

\displaystyle  P(a,b|s,t)=\|({\mathbb{I}}\otimes B^t_b)(A^s_a\otimes {\mathbb{I}}) |{\phi}\rangle _{AB}\|^2= \langle{\phi}| A^s_a\otimes B^t_b|\phi\rangle, \ \ \ \ \ (1)

where we have used that squaring a projection leaves it unchanged.

Quantum strategies beat classical ones

For any 2P-1R game { G}, let { \omega(G)} be the maximum probability — over the players’ classical strategies — that the verifier accepts and { \omega^*(G)} the maximum probability that the verifier accepts when the provers use qubits such that player 1’s qubits are entangled with those of player 2. The game of Clauser, Horne, Shimony, and Holt (CHSH) has the property that the provers can increase their chance of winning by sharing an entangled pair of qubits, even when no messages are exchanged between the players. We show that there’s a characterization of the CHSH game’s value { \omega^*(G)=\cos^2(\frac{\pi}{8}) = \frac{1}{2}+\frac{\sqrt{2}}{4}\approx 0.85} which is better than the classical value { \omega(G)\leq \frac{3}{4}}. Let us first define XOR games, of which the CHSH game is a special case.

Definition (XOR game)

An XOR game is a 2-player classical game (the questions and answers are classical bits) where:
  1. Questions { (s, t)\in\{0, 1, \ldots, n-1\}^2} are asked according to some distribution { \Pi} (e.g. uniform).
  2. Answers { a, b\in\{0, 1\}} are provided by players (call them Alice and Bob).
  3. The verifier computes a predicate { V(a, b|s, t) = f_{s, t}(a\oplus b)} used to decide acceptance/rejection.

Definition (CHSH Game)

An XOR game with { n=2} where { s, t\in\{0, 1\}} are independent random bits and { V(a, b | s, t)=1 \Longleftrightarrow a\oplus b = s\wedge t}. To win the CHSH game, Alice and Bob need to output bits { a} (from Alice) and { b} (from Bob) that disagree if { s=t=1} and agree otherwise. If Alice and Bob are classical then they can do no better than by always outputting { 0}, say, in which case they win in the three out of four cases when one of the questions is { 0}. Equivalently, { \omega(G)=\frac34} where { G} is the CHSH game. This is the content of Bell’s inequality :

Lemma (Bell’s Inequality)

For any two functions { g, h: \{0, 1\}\rightarrow\{0, 1\}}, we have { {\mathop{\mathbb{P}}}_{x, y\in\{0, 1\}}\left[g(x)\oplus h(y) = x\wedge y\right] \leq \frac{3}{4} } where { x} and { y} are independent uniformly random bits.
The probability of any event is a multiple of { 1/4} so it suffices to show that { {\mathop{\mathbb{P}}}_{x, y\in\{0, 1\}}\left[g(x)\oplus h(y) = x\wedge y\right]\neq1}. So assume for contradiction that { g(x)\oplus h(y) = x\wedge y} for all pairs { x,y}. Then we have that { g(0)\oplus h(0) = 0} and { g(0)\oplus h(1) = 0} which implies that { h(0)=h(1)}. But then { 0=g(1)\oplus h(0) =g(1)\oplus h(1) = 1} which is a contraction.

The strategy

The entangled strategy for the CHSH game requires that Alice and Bob each hold a qubit, so that their two qubits together are described by a vector in { \mathbb C^2\otimes\mathbb C^2}. The two qubits together are in the state { |{\phi}\rangle = \frac{1}{\sqrt{2}}\left( |{0}\rangle _A |{0}\rangle _B + |{1}\rangle _A |{1}\rangle _B\right), } forming what is known as an EPR (Einstein-Podolsky-Rosen) pair. Now for { \theta\in[-\pi, \pi]} define { |{\beta_0(\theta)}\rangle = \cos\theta |{0}\rangle + \sin\theta |{1}\rangle } and { |{\beta_1(\theta)}\rangle = -\sin\theta |{0}\rangle + \cos\theta |{1}\rangle } . Now we describe a (quantum) strategy with winning probability { \cos^2(\frac{\pi}{8})}. In each case Alice and Bob respond with their measurement outcome, where subscripts of the measurement basis vectors correspond to the answer to be sent back to the verifier.
  • If { s=0}, Alice measures in basis { \{ |{\beta_0(0)}\rangle , |{\beta_1(0)}\rangle \} = \{ |{0}\rangle , |{1}\rangle \}}. If { s=1}, Alice measures in { \{ |{\beta_0(\frac{\pi}{4})}\rangle , |{\beta_1(\frac{\pi}{4})}\rangle \}}. Alice answers bit { a = 0} if outcome is { \beta_0(\cdot)} and answers { a = 1} otherwise.
  • If { t=0}, Bob measures in basis { \{ |{\beta_0(\frac{\pi}{8})}\rangle , |{\beta_1(\frac{\pi}{8})}\rangle\}}. If { t=1}, Bob measures in { \{ |{\beta_0(-\frac{\pi}{8})}\rangle , |{\beta_1(-\frac{\pi}{8})}\rangle \}}.
  • Each player responds with their respective measurement outcome.
Lemma 2 Alice and Bob win the CHSH game with probability { \cos^2(\frac{\pi}{8})}.
We will show that for each pair of questions { s,t} the pair of answers { a,b} is correct with probability { \cos^2(\pi/8)}. We can split the pairs of questions into the two cases { s\wedge t=0} and { s\wedge t=1}:
  • ({ s\wedge t=0}) The three cases { (s,t)=(0,0),(0,1),(1,0)} are all analogous: in each case Alice an Bob must output the same answer, and in each case Bob’s measurement basis is almost the same as Alice’s except rotated by a small angle { \pm\pi/8}. Of the three above cases we consider the one where { s,t=0} and check that indeed the two measurement outcomes agree with probability { \cos^2(\pi/8)}: When Alice measures her qubit and obtains some bit { a}, the shared pair { |{\beta}\rangle} collapses to { |{a}\rangle _A |{a}\rangle _B}. Indeed, since the question was { s=0}, Alice measures her qubit in the basis { \{|0\rangle,|1\rangle\}}. This means that Alice applies the measurement { \{|0\rangle\langle 0|\otimes {\mathbb{I}},|1\rangle\langle 1|\otimes {\mathbb{I}}\}} on the global state. The post-measurement state is the normalization of { ( |{a}\rangle \langle{a}| \otimes {\mathbb{I}}) |{\phi}\rangle _{AB}=\frac{1}{\sqrt2} |{a}\rangle \overbrace{\langle a |{0}\rangle }^{\delta_{a,0}}\otimes | 0\rangle + |{a}\rangle \overbrace{\langle a |{1}\rangle }^{\delta_{a,1}}\otimes |{1}\rangle =\frac {1}{\sqrt2} |{a}\rangle _A |{a}\rangle _B } because { \langle a |{a'}\rangle} can be viewed as a Kronecker delta of { a} and { a'}. In particular, Bob is now in the pure state { |{a}\rangle} . Because Bob received question { t=0} he measures in the basis { \{ |{\beta_b(\pi/8)}\rangle \}_b} Therefore his probability of correctly outputting { b=a} is { {\mathop{\mathbb{P}}}[\text{Bob gets outcome }a] = |\langle\beta_a(\tfrac{\pi}{8}) |{a}\rangle |^2 = \cos^2\left(\frac{\pi}{8}\right) }
  • ({ s\wedge t=1}) Now consider the case { s=1,t=1} where Alice and Bob are supposed to give different answers. Alice measures in basis consisting of { |{\beta_0(\pi/4)}\rangle =\frac{|0\rangle+|1\rangle}{\sqrt2}} and { |{\beta_1(\pi/4)}\rangle =\frac{-|0\rangle+|1\rangle}{\sqrt2}}. If Alice gets outcome { a} then the post-measurement global state is { |{\beta_a(\frac\pi4)}\rangle |{\beta_a(\frac\pi4)}\rangle} . Therefore when Bob applies the measurement in basis { \{ |{\beta_a(-\frac\pi8)}\rangle \}_a} he mistakenly outputs { a} only with probability { |\langle\beta_a(\frac\pi4) |{\beta_a(-\frac\pi8)}\rangle |^2=\sin^2(\frac\pi8)=1-\cos^2(\frac\pi8)}.
Lemma 2 implies a lower bound on the value of the CHSH game.


{ \omega^*(G)\ge\cos^2(\frac\pi8)}
It turns out that this lower bound is sharp, that is, the strategy just described is optimal.


The value of the CHSH game using a quantum strategy is at most { \frac{1}{2} + \frac{\sqrt{2}}{4} = \cos^2\frac{\pi}{8}\approx 0.85}. Proof. We can describe the quantum strategy of Alice and Bob in an XOR game by (i) a shared quantum state { |{\phi}\rangle _{AB}\in\mathbb{C}^d\otimes\mathbb{C}^d} (note that for the CHSH game, { d=2}); (ii) measurements { \{A^0_s, A^1_s\}} for every question { s} sent to Alice; (iii) measurements { \{B^0_t, B^1_t\}} for every question { t} sent to Bob. The probability of answering { (a, b)} given questions { (s, t)} is { \langle{\phi}| A^a_s\otimes B^b_t |{\phi}\rangle} . Now let us write { A_s=A^0_s - A^1_s} and { B_t=B^0_t - B^1_t} so that for any { a, b\in\{0, 1\}}, we can write { A_s^a = \frac{{\mathbb{I}} + (-1)^aA_s}{2}, B_t^b = \frac{{\mathbb{I}} + (-1)^bB_t}{2} } Note that since the possible outcomes here are finite, { A_s, B_t} are Hermitian and we may assume have bounded norm of 1. Furthermore, we assume that { A_s, B_t} are observables so that { A_s^2 = B_t^2 = {\mathbb{I}}} . Now denoting { f_{s, t}(a\oplus b)} as the XOR predicate to be computed, we can write the quantum game value as \omega^*(G) = \sup_{ |{\phi}\rangle , \{A_s^a\}, \{B_t^b\}} {\mathop{\mathbb{E}}}_{(s, t)\sim\Pi}\sum_{a, b}f_{s, t}(a\oplus b) \langle{\phi}| A^a_s\otimes B^b_t |{\phi}\rangle  \\ = \sup_{ |{\phi}\rangle , \{A_s\}, \{B_t\}} {\mathop{\mathbb{E}}}_{(s, t)\sim\Pi}\sum_{a, b}f_{s, t}(a\oplus b) \langle{\phi}| \frac{{\mathbb{I}} + (-1)^aA_s}{2}\otimes  \frac{{\mathbb{I}} + (-1)^bB_t}{2} |{\phi}\rangle  \\ = {\mathop{\mathbb{E}}}_{(s, t)\sim\Pi}\sum_{a, b}\frac{f_{s, t}(a\oplus b)}{4} + \sup_{ |{\phi}\rangle , \{A_s\}, \{B_t\}} {\mathop{\mathbb{E}}}_{(s, t)\sim\Pi}\sum_{a, b}f_{s, t}(a\oplus b) \langle{\phi}| \frac{{\mathbb{I}} + (-1)^aA_s}{2}\otimes  \frac{{\mathbb{I}} + (-1)^bB_t}{2} |{\phi}\rangle  \\ = {\mathop{\mathbb{E}}}_{(s, t)\sim\Pi}\sum_{a, b}\frac{f_{s, t}(a\oplus b)}{4} + \sup_{ |{\phi}\rangle , \{A_s\}, \{B_t\}} {\mathop{\mathbb{E}}}_{(s, t)\sim\Pi}\sum_{a, b}\frac{f_{s, t}(a\oplus b)(-1)^{ab}}{4} \langle{\phi}| A_s\otimes B_t |{\phi}\rangle  \\ = {\mathop{\mathbb{E}}}_{(s, t)\sim\Pi}\sum_{a, b}\frac{f_{s, t}(a\oplus b)}{4} + \sup_{ |{\phi}\rangle , \{A_s\}, \{B_t\}} {\mathop{\mathbb{E}}}_{(s, t)\sim\Pi}\frac{f_{s, t}(0) - f_{s,t}(1)}{2} \langle{\phi}| A_s\otimes B_t |{\phi}\rangle  \\ where the summation { \sum_{a, b\in\{0, 1\}}\left(\cdot\right)} has been evaluated in the last line. Now note that the first term is independent of the quantum strategy and as a result equals the value of the uniformly random strategy which is 1/2. So we proceed to focus on the second term. Note that for CHSH { f_{s, t}(0) - f_{s,t}(1) = (-1)^{st}} simplifying the second term to { \frac{1}{8}( \langle{\phi}| A_0\otimes B_0 |{\phi}\rangle + \langle{\phi}| A_1\otimes B_0 |{\phi}\rangle + \langle{\phi}| A_0\otimes B_1 |{\phi}\rangle - \langle{\phi}| A_1\otimes B_1 |{\phi}\rangle ). } Next, we invoke Tsirelson’s Theorem (See Theorem 3) to bound this second term as = \sup_{{\textbf{x}}_s, {\textbf{y}}_t}\frac{1}{8}({\textbf{x}}_0\cdot {\textbf{y}}_0 + {\textbf{x}}_0\cdot {\textbf{y}}_1 + {\textbf{x}}_1\cdot {\textbf{y}}_0 - {\textbf{x}}_1\cdot {\textbf{y}}_1) \\ \leq \sup_{{\textbf{x}}_s, {\textbf{y}}_t}\frac{1}{8}(\|{\textbf{x}}_0\|\|{\textbf{y}}_0 + {\textbf{y}}_1\| + \|{\textbf{x}}_1\|\|{\textbf{y}}_0 - {\textbf{y}}_1\|) \\ \leq \sup_{{\textbf{x}}_s, {\textbf{y}}_t}\frac{\sqrt{2}}{8}\sqrt{2\|{\textbf{y}}_0\|^2 + 2\|{\textbf{y}}_1\|^2} \leq \frac{\sqrt{2}}{4} where we have used Cauchy-Schwartz and the concavity of the { \sqrt{\cdot}} function. This completes our proof showing the exact characterization of the value ({ =\cos^2(\frac{\pi}{8})=\frac{1}{2}+\frac{\sqrt{2}}{4}}) of the CHSH game using a quantum strategy. This proof is an adaptation of the one in [12].
Theorem 3 (Tsirelson’s Theorem [1]) For any { n\times n} matrix { C = (C_{s, t})}, the following are equivalent:
  1. There exist { d\in\mathbb{N}, |{\phi}\rangle \in\mathbb{C}^{d}\otimes\mathbb{C}^{d}, A_s, B_t\in\text{Herm}(\mathbb{C}^d), A_s^2 = B_t^2 = I} such that for any { s, t\in[n]} { C_{s, t} = \langle{\phi}| A_s\otimes B_t |{\phi}\rangle} . Further this would imply that { d\leq 2^{\lceil\frac{n+2}{2}\rceil}};
  2. There exist real unit vectors { {	{\textbf{x}}}_s, {	{\textbf{y}}}_t\in{\mathbb{R}}^{n+2}} for { s, t\in [n]} such that { C_{s, t} = {	{\textbf{x}}}_s\cdot {	{\textbf{y}}}_t};

Entangled unique games are easy

The CHSH game provides the first example that the entangled value { \omega^*(G)} of a nonlocal game can exceed the classical value { \omega(G)}. XOR-games like the CHSH game are the special case corresponding to alphabet size { k=2} of the class of unique games :

Definition (Unique Games)

A 2-prover 1-round game is called a unique game if its constraints are of the form { b=\pi_{ij}(a)} where { \pi_{ij}} is a permutation of the alphabet { \Sigma} for each edge { i\sim j}. The famous unique games conjecture (UGC) by Khot says that { \omega(G)} is NP-hard to approximate for unique games. Surprisingly, Kempe et al. showed that a natural semidefinite relaxation for unique games yields an approximation to the entangled value { \omega^*(G)} which can be computed in polynomial time. In other words the UGC is false for entangled provers, in contrast to the classical case where the conjecture is open.
Theorem 4 There is an efficient (classical) algorithm which takes a description of a nonlocal game { G} as its input and outputs { \hat\omega(G)} such that { 1-6(1-\hat\omega(G))\le\omega^*(G)\le\hat\omega(G) } Put differently, if { \omega^*(G)=1-\varepsilon^*} and { \hat\omega(G)=1-\hat\varepsilon}, then { \varepsilon^*\in[\hat\varepsilon,6\hat\varepsilon] }
The algorithm of theorem 4 proceeds by relaxing the set of quantum strategies to a larger convex set { \mathcal S} of pseudo-strategies and maximizing over { \mathcal S} instead of actual strategies, a much easier task. In approximation theory one often encounters a collection of hypothetical moments not arising from a distribution, known as a pseudo-distribution. In contrast, our pseudo-strategies are actual conditional probability distributions on answers (conditional on the questions). What makes { \mathcal S} a set of “pseudo”-strategies rather than actual strategies is that they may enjoy correlations which cannot be achieved without communication.

Convex relaxation of quantum strategies

We will define { \mathcal S} to be a class of conditional probability distributions { \tilde{P}(a,b|s,t)} on answers given questions. We will require that the pseudo-strategies satisfy a positive semidefinite constraint when arranged in matrix form. In particular this matrix has to be symmetric, so we symmetrize the conditional probability { \tilde{P}(a,b|s,t)} by allowing each of { s} and { t} to be either a question for Alice or for Bob. That is, we extend the domain of definition for { \tilde{P}(a,b|s,t)} from { \Sigma^2\times S\times T} to { \Sigma^2\times (S\cup T)^2} where { S} and { T} are the question sets. So each question { s} and { t} and answer { a} and { b} can be either for Alice or Bob — we indicate this by changing notation from { (s,t)\in S\times T} to { q,q'\in S\cup T} and for the answers replacing { a,b} by { c,c'}.
Definition 5 (Block-matrix form) Given a function { \tilde{P}=\tilde{P}(\cdot\cdot|\cdot\cdot)} defined on { (S\cup T)^2\times \Sigma^2} with { |S|=|T|=n} and { |\Sigma|=k}, define a { 2nk}-by-{ 2nk} matrix { M^{\tilde{P}}} whose rows are indexed by pairs { (q,c)\in(S\cup T)\times\Sigma} and columns by pairs { (q',c')\in (S\cup T)\times\Sigma}, and whose entries are { M^{\tilde{P}}_{(q,c),(q',c')}=\tilde{P}(c,c'|q,q') } In other words { M^{\tilde{P}}} consists of { k}-by-{ k} blocks where the block { M^{\tilde{P}}_{q,q'}} at position { q,q'} contains the entries { \tilde{P}(\cdot\cdot|q,q')}.
Definition 5 is simply a convenient change of notation and we identify { \tilde{P}} with { M^{\tilde{P}}}, using either notation depending on the context.
Definition 6 (Pseudo-strategies) Let { S} and { T} be the question sets for Alice and Bob, respectively. We say that { \tilde{P}=\tilde{P}(\cdot\cdot|\cdot\cdot):\Sigma^2\times (S\cup T)^2\rightarrow[0,1]} (or its matrix form { M^{\tilde{P}}}) is a pseudo-strategy if:
  1. { M^{\tilde{P}}} is positive semidefinite.
  2. For any pair of questions { q,q'\in S\cup T}, { \sum_{c,c'=1}^k \tilde{P}(c,c'|q,q')=1}.
  3. The blocks { M^{\tilde{P}}_{q,q'}} on the diagonal are themselves diagonal { k}-by-{ k} matrices.
Define the winning probability or value of a pseudo-strategy as: { \omega^{\tilde{P}}(G)=\mathbb E_{(s,t)\sim \Pi} \sum_{a,b} \tilde{P}(a,b|s,t)V(a,b|s,t) } The algorithm outputs the maximum winning probability: { \hat\omega(G)=\max_{\tilde{P}\in\mathcal S}\omega^{\tilde{P}}(G) } over pseudo-strategies { \tilde{P}\in \mathcal S}. This maximum is efficiently computable using standard semidefinite programming algorithms. As we will see, actual quantum strategies are in { \mathcal S} which immediately implies { \hat\omega(G)\ge\omega^*(G)}. It then remains to show that the optimal pseudo-strategy can be approximated by an actual entangled strategy, thus bounding the gap from { \omega^*(G)} to { \hat\omega(G)}.

Quantum strategies are in { \mathcal S}

Let us establish that { \mathcal S} is indeed a relaxation of the class of quantum strategies, that is, it contains the quantum strategies. So suppose we are given a quantum strategy. By equation 1 the probability of answers { a} and { b} given questions { s} and { t} is of the form { \langle\phi|A^s_a\otimes B^t_b |\phi\rangle=\langle\phi|(A^s_a\otimes {\mathbb{I}})({\mathbb{I}}\otimes B^t_b )|\phi\rangle } for some PVM’s { A^s} and { B^t} and some { |\phi\rangle\in\mathbb C^d\otimes\mathbb C^d}. This conditional probability distibution is not immediately in the form of a pseudo-strategy because we cannot evaluate it on pairs of Alice-questions or pairs of Bob-questions. We therefore have to extend it, as follows: Place all the column vectors { A^s_a\otimes {\mathbb{I}}|\phi\rangle}, { (s,a)\in S\times \Sigma} side by side, and then append the vectors { I\otimes B^t_b|\phi\rangle}, resulting in a { d^2}-by-{ 2nk} matrix { R}. We then define { {P}(\cdot\cdot|\cdot\cdot):\Sigma^2\times(S\cup T)^2\rightarrow[0,1]} through its matrix form (see the comment below definition 5):

\displaystyle  M^{P}:=R^\dag R=\begin{pmatrix} (\langle\phi|A^s_a A^{s'}_{a'}\otimes {\mathbb{I}} |\phi\rangle)_{(s,a),(s',a')} (\langle\phi|A^s_a\otimes B^t_b |\phi\rangle)_{(s,a),(t,b)} \\\\ (\langle\phi|B^t_b\otimes A^s_a |\phi\rangle)_{(t,b),(s,a)} (\langle\phi|{\mathbb{I}}\otimes B^t_bB^{t'}_{b'} |\phi\rangle)_{(t,b),(t',b')} \end{pmatrix} \ \ \ \ \ (2)

Lemma 7 { M^{P}} defined in 2 is a pseudo-strategy, that is, { M^{P}\in\mathcal S}.
We verify the conditions in definition 6. Condition 1 ({ M^{P}\succeq0}) holds because it is of the form { R^\dag R}. Condition 2 (Each block { M^{P}_{q,q'}} sums to { 1}) holds because PVM’s sum to the identity. Condition 3 (Diagonal blocks { \tilde M=M^{P}_{qq}} are diagonal) holds because the projections in the PVM { A^q} are mutually orthogonal, hence { \langle{\phi}| A^q_cA^q_{c'} |{\phi}\rangle =0} if { c\neq c'}.
Lemma 7 means { \hat\omega(G)=\max_{{P}\in \mathcal S}\omega^{P}(G)} is an (efficiently computable) upper bound for { \omega^*(G)}:


{ \hat\omega\ge \omega^*(G)}.
To finish the proof of theorem 4 we need to show that any pseudo-strategy { \tilde{P}} can be rounded to an actual quantum strategy with answer probabilities { {P}} such that

\displaystyle  1-\omega^{P}\le 6(1-\omega^{\tilde P}) \ \ \ \ \ (3)

Applying this rounding to the optimal pseudo-strategy implies that { 1-\omega^*\le 6(1-\hat\omega) } or { \omega^*\ge 1-6(1-\hat\omega)}, which will finish the proof of theorem 4.
[Proof of theorem 4] Let { \tilde{P}\in\mathcal S} be a pseudo-strategy. We construct a quantum strategy { {P}} approximating { \tilde{P}}. Since { M^{\tilde{P}}} is positive semidefinite we can write { \tilde M=R^\dag R } for some matrix { R\in \mathbb C^{r\times 2nk}} and { r\leq 2nk}. Now let us define { 2nk} vectors { |\tilde u^s_a\rangle} and { |\tilde v^t_b\rangle} in { \mathbb C^{r}}, and let { |{u^s_a}\rangle} and { |{v^t_b}\rangle} be the same vectors normalized. The strategy is constructed as follows. Alice and Bob share the maximally entangled state { |\phi\rangle=\frac1{\sqrt r}\sum_{i=1}^r|i\rangle|i\rangle\in\mathbb C^r\otimes\mathbb C^r } Before deciding on Alice and Bob’s PVM’s { A^s} and { B^t} let us see what this choice of shared state means for the conditional distribution on answers (see equation (1)).

\langle\phi| A^s_a\otimes B^t_b|\phi\rangle =\frac1r\sum_{i,j=1}^r\langle i | A^s_a| j\rangle\langle i | B^t_b| j\rangle =\frac1r A^s_a\cdot B^t_b=\frac1r\langle A^s_a,\overline{B^t_b}\rangle, (*)

where the bar represents entrywise complex conjugation, { \cdot} is the entrywise dot product of matrices, and { \langle\:,\rangle} the entrywise complex inner product (Hilbert-Schmidt inner product). We now choose the measurements. Given question { s}, Alice measures in the PVM { A^s=(A^s_a)_{a=0}^k} with { A^s_a= |{u^s_a}\rangle \langle{u^s_a}| \text{ for }a=1,\ldots,k\text{, and }A^s_0={\mathbb{I}}-\sum_{i=1}^k |{u^s_a}\rangle \langle{u^s_a}| } Similarly, Bob on question { t} applies the PVM { B^t} with { B^t_b= |{v^t_b}\rangle \langle{v^t_b}| \text{ for }b=1,\ldots,k\text{, and }B^t_0={\mathbb{I}}-\sum_{i=1}^k |{v^t_b}\rangle \langle{v^t_b}| } The condition 3 in definition 6 ensures that for any question { s}, the vectors { |{u^s_1}\rangle ,\ldots, |{u^s_1}\rangle} are orthogonal so that this is a valid PVM. The measurement outcome “0” is interpreted as “fail”, and upon getting this outcome the player attempts the measurement again on their share of a fresh copy of { |\phi\rangle_{AB}}. This means that the strategy requires many copies of the entangled state to be shared before the game starts. It also leads to the complication of ensuring that with high probability the players measure the same number of times before outputting their measurement, so that the outputs come from measuring the same entangled state. By (*), at a given round of measurements the conditional distribution of answers is given by { \langle\phi|A^s_a\otimes B^t_b|\phi\rangle=\frac1r\Big\langle |{u^s_a}\rangle \langle{ u^s_a}| \:,\: |{{v^t_b}\rangle } \langle{ {v^t_b}| }\Big\rangle=\frac1r|\langle {u^s_a}|v^t_b\rangle|^2=\frac1{r|\tilde u^s_a|^2|\tilde v^t_b|^2}\Big(M^{\tilde{P}}_{(s,a),(t,b)}\Big)^2, } We wish to relate the LHS to { M^{\tilde{P}}_{(s,a),(t,b)}}, so to handle the factor { \frac1r} each prover performs repeated measurements, each time on a fresh copy of { |\phi\rangle_{AB}}, until getting an outcome { \neq0}. Moreover, to handle the factor { \frac1{|u^s_a|^2|v^t_b|^2}}, each prover consults public randomness and accepts the answer { a\in[k]} with probability { |u^a_s|^2} and { |v^b_t|^2} respectively, or rejects and start over depending on the public randomness. Under a few simplifying conditions (more precisely, assuming that the game is uniform meaning that an optimal strategy exists where the marginal distribution on each prover’s answers is uniform), we can let { M^{\tilde{P}}_{(s,a),(t,b)}\le 1/k} for all { s,a,t,b}, and one can ensure that the conditional probabilities { {P}} of the final answers satisfy

\displaystyle 1/k-P(a,b|s,t)\le 3\big(1/k-k(M^{\tilde P}_{(s,a),(t,b)})^2\big),\ \ \ \ \ (4) At this stage it is important that we are dealing with a unique game . Indeed, by (4) we have for every { s} and { t}, 1-\sum_{a=1}^k P(a,\pi_{st}(a)|s,t)=\sum_a \Big(\frac{1}{k}-P(a,\pi_{st}(a)|s,t) \Big) \\ \leq 3\sum_{b=\pi_{st}(a)} \Big(\frac{1}{k}-k(M^{\tilde{P}}_{(s,a),(t,b)})^2 \Big) \\ \leq 6\sum_{b=\pi_{st}(a)}\big(\frac{1}{k}-M^{\tilde{P}}_{(s,a),(t,b)}\big)=6\Big(1-\sum_{b=\pi_{st}(a)} M^{\tilde{P}}_{(s,a),(t,b)}\Big) where the last inequality follows from concavity. Taking the expectation over { s} and { t} implies the bound (3), thus concluding the proof of theorem 4.

General games are hard

We just saw that a specific class of games becomes easy in the presence of shared entanglement, in that semidefinite programming allows the entangled value { \omega^*} to be approximated to within exponential precision in polynomial time. Does this phenomenon hold more generally, so that the value of entangled games can always be efficiently approximated? We answer in the negative, by constructing a game where { \omega^*} is NP-hard to approximate to within inverse-polynomial factors. The complexity for 2P-1R entangled games can be strengthened to constant-factor NP-hardness, putting it on par with the classical PCP theorem. This result is used to prove (with some conditions) the games formulation of the quantum PCP theorem, which states that the entangled value of general games is QMA-hard to approximate within a constant factor.

Formulation of game

Given any instance { \phi} of a { k}-CSP (constraint satisfaction problem, where { k} is the number of literals), we can define a clause-vs-variable game { G_\phi} (see clause-vs-variable figure):
  1. The referee (verifier) randomly sends a clause to Alice (first prover) and a variable to Bob (second prover).
  2. Alice and Bob reply with assignments.
  3. The referee accepts if Alice’s assignment satisfies the clause and Bob’s answer is consistent with Alice’s.
To show hardness of approximation, we need to go beyond the usual 2-player construction. In particular, in our game [3] one of the players receives an extra dummy question (see subfigure (a)). Mathematically, the result is very similar to introducing another player and having the referee play the 2-player game with two players chosen randomly [4] (see subfigure (b)). In either variation, the quantum phenomenon of monogamy of entanglement , imposing that only two parties can be maximally entangled to one another, is key to establishing hardness. The players do not know where to use their entanglement, which prevents them from coordinating as well as they could in the standard game.
Two variations of a 2-player clause-vs-variable game; new features are in red and shared entanglement is denoted in blue. In the standard game { G_\phi}, given { k}-CSP { \phi=(C_1,\ldots,C_m)} on { n} variables { x_i}, (1) the referee R randomly sends a clause { C_j} to Alice A and a literal index { t\in[k]} to Bob B, (2) A replies with an assignment { (a_1,\ldots,a_k)} and B replies with assignment { b}, (3) R accepts iff { (a_1,\ldots,a_k)} satisfies { C_j} and { a_t=b}. In variation (a), R sends an additional dummy index { l\in[k]}, so that B replies with an additional assignment { b'}, but he does not know which is the right variable. Equivalently, in (b) a third player Charlie C is introduced, but { G_\phi} is played with two randomly chosen players. Since only parties can be maximally entangled and the players do not know who is playing the game, they cannot coordinate perfectly.

NP-hardness of approximating the entangled value

To prove hardness, we rely on several results from classical complexity theory.
Theorem 8 ([6]) Given an instance of 1-in-3 3SAT (a CSP), it is NP-hard to distinguish whether it is satisfiable or no assignments satisfy more than a constant fraction of clauses.
Theorem 9 ([2]) For a PCP game {G} (emulating the CSP) and its oracularization {G'} (transformation to a 2P-1R game),

\displaystyle  \omega(G)\leq \omega(G')\leq 1-\frac{1-\omega(G)}{3}\,. \ \ \ \ \ (5)

Theorem 8 establishes the CSP variant of the classical PCP theorem: distinguishing between { \omega(\phi)=1} and { \omega(\phi)\leq 1/2} is NP-hard for some { k}-CSP. Here, { \omega(\phi)} denotes the maximum fraction of clauses that are simultaneously satisfiable. Theorem 9 relates the general game { G} obtained from the CSP to a two-player one-round game { G'}, in terms of the value (probability of winning) the game. The first inequality, equivalently saying { \omega(\phi)\leq \omega(G_\phi)}, is achieved since the players can answer the questions in the game { G_\phi} to satisfy the clauses in { \phi}. These theorems together imply that { \omega(G_\phi)} is NP-hard to approximate to within constant factors. Allowing the two players to share entanglement can increase the game value to { \omega^*(G_\phi)\geq \omega(G_\phi)}. Classical results do not necessarily carry over, but exploiting monogamy of entanglement allows us to limit the power of entangled strategies. One can show the following lemma, which is weaker than what we have classically.
Lemma 10 ([3]) There exists a constant {c>0} such that for a CSP {\phi},

\displaystyle  \omega^*(G_\phi)\leq 1 - \frac{c(1-\omega(\phi))^2}{n^2}\,, \ \ \ \ \ (6)

where {n} is the number of variables. Combining Theorem 9 and Lemma 10, we have { \omega(\phi)\leq \omega^*(G_\phi)\leq 1 - \frac{c(1-\omega(\phi))^2}{n^2}\,. } Using Theorem 8, approximating { \omega^*(G_\phi)} is NP-hard to within inverse polynomial factors. Proving Lemma 10 takes some work in keeping track of approximations. For simplicity, we will show a less quantitative statement and indicate where the approximations come in.
Proposition 11 (adapted from [13]) {\phi} is satisfiable iff {\omega^*(G_\phi)=1}.

Proof of Proposition 11

The forward direction is straightforward: If { \phi} is satisfiable, then there exists a perfect winning strategy where the questions are answered according to the satisfying assignment. For the reverse direction, suppose there exists a strategy that succeeds with probability 1, specified by a shared entangled state { |{\psi}\rangle \in\mathbb{C}^d\otimes\mathbb{C}^d} and measurements { (A^j_{a_1,\ldots,a_k})_{a_1,\ldots,a_k}} for Alice and { (B^{t,l}_{b,b'})_{b,b'}} for Bob, where the questions { j\in[m]}, { t,l\in[k]} and the answers { a,b} are from the CSP’s alphabet. Since one of the questions/answers for Bob corresponds to a dummy variable that is irrelevant to the game, trace over the dummy variable to define a new measurement operator { \tilde{B}^t_b=\frac{1}{n}\sum_{l,b'} B^{t,l}_{b,b'}}. We can introduce a distribution on assignments to the { n} relevant variables,

\displaystyle  p(a_1,\ldots,a_n)=\lVert \mathbb{I} \otimes \tilde{B}^1_{a_1}\cdots\tilde{B}^n_{a_n} |\psi\rangle \rVert^2\,.  \ \ \ \ \ (7)

If we show that the distribution for assignments {a_{i_1},\ldots,a_{i_k}} on variables {x_{i_1},\ldots,x_{i_k}} in any clause {C_j} is

\displaystyle  p(a_{i_1},\ldots,a_{i_k})= \lVert A^j_{a_{i_1},\ldots,a_{i_k}} \otimes \mathbb{I} |\psi\rangle \rVert^2\,,  \ \ \ \ \ (8)

then, since the players win with certainty, {\phi} has a satisfying assignment. To transform Eq. 7 to Eq. 8, we need a relation between the {A} and {\tilde{B}} measurement operators and a way to commute the {\tilde{B}} operators. The success probability of the players’ strategy is expressed as { P=\frac{1}{m}\sum_{j\in[m]}\frac{1}{k}\sum_{i\in C_j} \sum_{(a_1,\ldots,a_k)\vdash C_j} \langle{\psi}| A^j_{a_1,\ldots,a_k}\otimes \tilde{B}^i_{a_i} |{\psi}\rangle \,, } where { i} is the index of one of the { k} variables on which { C_j} acts, and { (a_1,\ldots,a_k)\vdash C_j} indicates that the assignment satisfies the clause. By positivity and summation to identity of the measurement operators { A} and { \tilde{B}}, each term is at most 1; for our hypothesis { P=1}, each has to be 1. Hence, using orthogonality of the vectors { {\mathbb{I}}\otimes\tilde{B}^i_{a_i} |{\psi}\rangle} for different { a_i}, we have

\displaystyle  \sum_{\substack{(a_{i_1},\ldots,a_{i_k})\vdash C_j \\ a_i=b}} A^j_{a_{i_1},\ldots,a_{i_k}}\otimes \mathbb{I} |\psi\rangle = \mathbb{I} \otimes \tilde{B}^i_{a_i}|\psi\rangle\,,  \ \ \ \ \ (9)

for any { j\in[m]} and { i\in C_j}. We now demonstrate that two different { \tilde{B}^{t_1}_{b_1}}, { \tilde{B}^{t_2}_{b_2}} commute, so that Bob can match any satisfied clause/variable. {\mathbb{I}} \otimes \tilde{B}^{t_1}_{b_1} \tilde{B}^{t_2}_{b_2} |{\psi}\rangle =  {\mathbb{I}} \otimes (\frac{1}{n}\sum_{l_1,b_1'} B^{t_1,l_1}_{b_1,b_1'}) (\frac{1}{n}\sum_{l_2,b_2'} B^{t_2,l_2}_{b_2,b_2'}) |{\psi}\rangle \\ = {\mathbb{I}} \otimes (\frac{1}{n}\sum_{t_2,b_1'} B^{t_1,t_2}_{b_1,b_1'}) (\frac{1}{n}\sum_{t_1,b_2'} B^{t_2,t_1}_{b_2,b_2'}) |{\psi}\rangle \\ = {\mathbb{I}} \otimes \frac{1}{n^2}\sum_{t_1,t_2} B^{t_1,t_2}_{b_1,b_2} |{\psi}\rangle \\ = {\mathbb{I}} \otimes \frac{1}{n^2}\sum_{t_1,t_2} B^{t_2,t_1}_{b_2,b_1} |{\psi}\rangle \\ = {\mathbb{I}} \otimes \tilde{B}^{t_2}_{b_2} \tilde{B}^{t_1}_{b_1} |{\psi}\rangle In the second line, we used (9) to relate the measurements. The third line follows by the orthogonality of { B^{t_1,t_2}_{b_1,b_2}} for different { b_1,b_2}. For the fourth equation, we simply swap { t_1,t_2} since the questions are indistinguishable to Bob. Thus, we can see how the dummy variable comes into play. If we had assumed { P=1-\epsilon} and kept track of approximations, we would find

\displaystyle  \frac{1}{n^2}\sum_{t_1,b_1}\sum_{t_2,b_2}\lVert \mathbb{I} \otimes (\tilde{B}^{t_1}_{b_1} \tilde{B}^{t_2}_{b_2} - \tilde{B}^{t_2}_{b_2} \tilde{B}^{t_1}_{b_1}) |\psi\rangle \rVert^2 = O(\epsilon)\,.  \ \ \ \ \ (10)

This approximate commutativity results in the hardness of approximation holding only to within inverse poly{ (n)} factors. Now we are ready to transform Eq. 7 to Eq. 8 to conclude the proof. p(a_1,\ldots,a_n)=\lVert {\mathbb{I}} \otimes \tilde{B}^1_{a_1} \ldots \tilde{B}^n_{a_n} |{\psi}\rangle \rVert^2 \\ =\lVert {\mathbb{I}} \otimes \tilde{B}^{i_1}_{a_{i_1}} \ldots \tilde{B}^{i_k}_{a_{i_k}} |{\psi}\rangle \rVert^2 \\ = \lVert A^j_{a_{i_1},\ldots,a_{i_k}} \otimes {\mathbb{I}} |{\psi}\rangle \rVert^2 \\ =p(a_{i_1},\ldots,a_{i_k})\,. In the second line, we used (10) to commute the measurement operators, along with their properties of orthogonality and summation to identity. For the third equality, we used (9) to relate Bob’s measurements to Alice’s, along with orthogonality of { A^j_{a_{i_1,\ldots,i_k}}} for different { a_{i_1,\ldots,i_k}}.

Constant-factor NP-hardness

The weakness in the above two-player game carries over from the original three-player variant. Thus, to achieve constant-factor NP-hardness of approximation, we could start with a different multiplayer game. Vidick [11] establishes the soundness of the “plane-vs-point” low-degree test (checking that the restriction of a low-degree polynomial to a plane matches its value at some point) in the presence of shared entanglement. Soundness , in the eponymous probabilistically checkable proof (PCP) formulation of the PCP theorem, refers to the verifier accepting a wrong proof with some bounded probability; bounding with a constant maps to constant-factor hardness of approximation. Here, soundness comes from a strong bound on error accumulation, similar to our approximate commutativity, but relies on the players’ Hilbert space being decomposable into three parts (i.e., there being three players). The particular game is constructed by combining the low-degree test with the 3-SAT test (encoding satisfying assignments in a low-degree polynomial), which can be reduced to the three-player QUADEQ test (testing satisfiability of a system of quadratic equations in binary variables, which is NP-complete). By the strong soundness result, the entangled value is NP-hard to approximate to within constant factors. Natarajan et al. [7] show that soundness holds even for two players, using a semidefinite program. They then construct a two-player game in a way similar to what we demonstrated.

Constant-factor QMA-hardness

The above can be thought of as the games formulation of the classical PCP theorem holding under shared entanglement. A true quantum PCP theorem states that the entangled value of general games is QMA-hard to approximate to within constant factors. Natarajan et al. [8] establish such a theorem, but under randomized reductions. This requirement stems from the lack of a sufficiently strong QMA-hardness result for local Hamiltonians (the quantum analog of CSPs). The soundness of the two-player low-degree test above is one instrumental component in the proof.

How much entanglement is needed?

We now focus on the question of quantifying exactly how much entanglement is needed to play XOR games optimally. As we shall see, the answer depends on the size of the question sets posed to Alice & Bob in the game. The previous bound given by Tsirelson [10] (see table below) is tight for certain families of games, but is not tight for other families of games (such as a generalization of the CHSH game). The reason for this discrepancy is closely tied in with the the properties of the representation of the Observables that form the Optimal Strategy ({ |{\psi}\rangle , \{A_{i}\}_{i \in S}, \{B_{j}\}_{j \in T}}). Slofstra [9] shows that if the Observables constitute a Clifford Algebra (that is, the solutions are pair-wise anti-commutative), then the strategy is minimally entangled (uses the least number of entangled bits) iff the strategy is a unique solution to the SDP rounding problem. As a trivial corollary, if the SDP rounding problem does not have a unique solution (and a correspondingly unique strategy), then there exists a Non-Clifford optimal strategy that uses (atleast) { |T|} bits of entanglement less than the Clifford strategy. Slofstra further states that minimally entangled { \epsilon}-optimal strategies may be constructed for XOR games where the optimal strategies have ‘stable’ representations. For the purposes of this post, we will analyze the exact result and merely state the approximate result.

Main Results


For the exact realm, the table below summarizes Slofstra and Tsirelson’s main results.
Person Strategy Bound(entangled bits)
Slofstra (Possibly) Non-Clifford { \log_{2}(N)}
Tsirelson Clifford { \left \lfloor{\frac{r}{2}}\right \rfloor}
Here, { r} is the largest integer such that { \binom{r + 1}{2} < |S| + |T|} and corresponds to the maximum-rank of an extremal point in the quantum correlation matrix corresponding to an optimal strategy. { N} is the minimum dimension of the representations of the Operators ({ \{B_{j}\}}).


In the approximate realm, the minimum entanglement dimension of the representation of the Operators from an { \epsilon}-Optimal Strategy is: min({ \mathcal{O}(\epsilon^{\frac{-1}{12}}), 2^{\left \lfloor{\frac{r}{2}}\right \rfloor}}). As we shall see, Slofstra’s theorem allows us to recover Tsirelson’s bound easily by using a fact from Representation Theory about the irreducible representations of Clifford Algebras, but stands as a more general lower bound for solutions that aren’t Clifford.

Marginals and Solution Algebras

We’ll begin by introducing 3 key ideas: i) Degeneracy { \leftrightarrow} Non-Degeneracy ii) Existence of Marginals iii) Solution Algebra Once these ideas are defined and their notions made clear, we will be in a position to state the main result and sketch a proof for it.

Definition (Marginal Strategy)

Given an Optimal Quantum Strategy ({ |{\psi}\rangle , \{A_{i}\}_{i \in S}, \{B_{j}\}_{j \in T}}), a marginal constitutes { \{B_{j}\}_{j \in T}}, and the partial trace of { \psi} with respect to { H_{A}} ({ \rho_{B}}). It is also possible to dualize the definition for obtaining { \{A_{i}\}_{i \in S}} and { \rho_{A}}. We now define the notion of degeneracy, which is critical when proving the main theorem. The main point to drive home is that a degenerate optimal quantum strategy can be reduced to a unique, non-degenerate optimal quantum strategy.

Definition (Degenerate Quantum Strategy)

A quantum strategy ({ |{\psi}\rangle , \{A_{i}\}_{i \in S}, \{B_{j}\}_{j \in T}}) is said to be degenerate if { \exists (P \in H_A)}, { (Q \in H_B)} such that: i) { P} commutes with all { A_i} and { (P \otimes {\mathbb{I}}) |{\psi}\rangle = |{\psi}\rangle} ii) { Q} commutes with all { B_j} and { ({\mathbb{I}} \otimes Q) |{\psi}\rangle = |{\psi}\rangle} Since we can efficiently construct for any degenerate Optimal Quantum Strategy a unique, non-degenerate Optimal Quantum Strategy, we will now assume WLOG that every Optimal Quantum Strategy is non-degenerate (and unique). We now define the (unique) existence of marginal biases, which correspond to constants for the rows of the quantum correlation matrix (which is a generalization of the classical pay-off). An equivalent statement can be made for columns ({ d_{j}}) by dualizing the existence of row marginals. These constants can be thought of as representing the (expected) optimum-payoff possible for a set of operator choices by one player, given that the other player’s choice is fixed. Intuitively, this can be seen as “collapsing” the quantum correlation matrix into a column, by summing over the rows (or collapsing into a row, by summing over the columns).
Lemma 12 (Existence of Marginals) For all {m \times n} XOR games G, {\exists \{c_{i} \geq 0 \hspace{1mm} | \hspace{1mm} i \in |S|\}}, such that, if {\{u_{i}\}_{i \in |S|}, \{v_{j}\}_{j \in |T|}} form an {\epsilon}-optimal vector strategy where {\epsilon \leq \frac{1}{4(m+n)}},

\displaystyle  \|\sum_{j\in|T|}G_{ij}v_{j} - c_{i}u_{i}\| \leq \sqrt{10}(m + n)^{\frac{1}{4}}\epsilon^{\frac{1}{4}}, \forall i  \ \ \ \ \ (11)

If { \epsilon = 0} and our strategy is perfectly optimal, we recover an exact estimation of the marginal biases:

\displaystyle  \sum_{j\in|T|}G_{ij}v_{j} = c_{i}u_{i}, \forall i  \ \ \ \ \ (12)

The proof for the above lemma provided by Slofstra relies on using techniques to analyze the structure of the SDP program that pertains to quantum marginals. In particular, conducting trace analysis on SDP matrices that correspond to using the game matrix as off-diagonal elements leads us to the construction of the desired marginal biases. It is also critical to note that a dual statement allows us to recover the column biases { d_{j} \geq 0}:

\displaystyle  \sum_{i\in[|S|]}G_{ij}u_{i} = d_{j}v_{j}, \forall j  \ \ \ \ \ (13)

We now move on to defining the notion of a solution algebra.
Definition 13 (Solution Algebra) A solution algebra {\mathcal{A}} consists of self-adjoint (Hermitian) operators {X_{j}} that satisfy the following predicates:

\displaystyle  X_{j}^{2} = \mathbb{I}, \forall 1 \leq j \leq n  \ \ \ \ \ (14)

\displaystyle  (\sum_{j\in[|T|]}G_{ij}X_{j})^{2} = (c_{i})^{2}\cdot\mathbb{I}, \forall i  \ \ \ \ \ (15)

The definition above merely enforces the property that our unknown marginal operators be Hermitian (14) and that they respect the optimal marginal biases (or payoffs) (15) we saw in (11), so that they correspond to being constructed from an optimal vector strategy. These unknown operators will be mapped to operators that are the marginal strategy corresponding to the optimal quantum strategy. This is at the heart of the main theorem we will now state: Theorem 14 (Slofstra, 2010) Given a {m \times n} XOR game G (with no zero rows or columns) and a solution algebra {\mathcal{A}}, a collection of Linear Operators {\{B_{j}\}} and density matrix {\rho} are the marginal of an optimal strategy iff the map {X_{j} \rightarrow B_{j}} induces a density-matrix representation of {\mathcal{A}} and {\rho} commutes with {im(\mathcal{A})}. Put simply, the theorem states that our unknown self-adjoint operators map to an optimal marginal strategy iff the density matrix (traced from the joint Hilbert-Space) commutes with all the mapped operators ({ \{B_{j}\}}). The result we desire on the lower bound for the number of entangled bits, given a mapping from these indeterminate operators to the marginal of an optimal strategy, comes from a corollary to (14).
Corollary 15 Given a {m \times n} XOR game G (with no zero rows or columns) and a solution algebra {\mathcal{A}} with minimum dimension {N} among non-zero representations, the strategy for minimum entanglement uses {\log_{2}(N)} entangled bits. The proof for this corollary follows from the eigenspace decomposition of the joint Hilbert Space { H} in terms of { \rho}, which is preserved by the action of { im(\mathcal{A})}. As a result, each eigenspace decomposes into a finite sum of irreducible representations of { \mathcal{A}}. The minimum entanglement is realized when there is exactly one invariant subspace (with one irreducible representation). The entanglement used by such a representation is { \log_{2}(\text{dim}\,H)}.

Proof of Theorem 20

The rest of the section is dedicated to sketching a brief (but formal) proof for Theorem (14), and then using a simple fact about the representations of a Clifford Algebra to show how Slofstra’s result subsumes Tsirelson’s bound. For this section, { |{\psi}\rangle} refers to an arbitrary state in { H = H_{A} \otimes H_{B}} (the joint Hilbert space). We can write { |{\psi}\rangle = \sum_{i} |{i}\rangle \lambda |{i}\rangle} over some basis { \{{i}\}}, where { \lambda} is a linear map. Then, the partial trace of { \psi} over { H_{A}} is given by { \rho = \lambda\lambda^{*}}. Let { \mathcal{B}_{A}, \mathcal{B}_{B}} denote the algebra generated by { A_{1},..,A_{m}} and { B_{1},..,B_{n}}. Here, the generating elements are the observables of an optimal quantum strategy. To arrive at a proof for the theorem, we will rely on 2 additional lemmas which we will not prove but state.
Lemma 16 Given Hermitian operators {A}, {B \in H_{A}, H_{B}},

\displaystyle  \|(A \otimes \mathbb{I} - \mathbb{I} \otimes B)|\psi\rangle\| \leq \|\lambda\overline{A} - B\lambda\|_{F}  \ \ \ \ \ (16)

This allows us to conclude that,

\displaystyle  \|(A \otimes \mathbb{I} - \mathbb{I} \otimes B)|\psi\rangle\| \leq \epsilon \implies \|\rho(B) - B\rho\|_{F} \leq 2\epsilon  \ \ \ \ \ (17)

Lemma 17 The optimal strategy in question is non-degenerate iff

\displaystyle  closure(\mathcal{B}_{B}\lambda H_{A}) = closure(\mathcal{B}_{A}\lambda^{*}H_{B}). \\  \ \ \ \ \ (18)

As a special case:

\displaystyle  closure(\mathcal{B}_{B}\lambda H_{A}) = H_{B} \leftrightarrow closure(\rho H_{B}) = H_{B}  \ \ \ \ \ (19)

Forward direction : We use the first lemma to prove commutativity of \rho with all B_{j}, and we use the second lemma to show that the closure of \rho is { H_{B}}. We first show the forward direction: Suppose we are given an optimal quantum strategy (|{\psi}\rangle , \{A_{i}\}_{i \in S}, \{B_{j}\}_{j \in T}) for a game { G}. Then, we fix our optimal vector strategy as:

\displaystyle  u_{i} = (A_{i} \otimes \mathbb{I})|\psi\rangle  \ \ \ \ \ (20)

\displaystyle  v_{j} = (\mathbb{I} \otimes B_{j})|\psi\rangle  \ \ \ \ \ (21)

We can now use Equations (11) and (13) to establish our optimal marginal biases to write a relationship between them and { |{\psi}\rangle} , and apply Lemma (16) to show commutativity and Lemma (17) to show that { im(\mathcal{A})} = cyclic({ B_{j}, \rho}). Using (13) with (20) and (21), we have:

\displaystyle  d_{j}(\mathbb{I}\otimes B_{j})|\psi\rangle = \sum_{i}G_{ij}(A_{i}\otimes\mathbb{I})|\psi\rangle  \ \ \ \ \ (22)

We can now use (17) with { \epsilon = 0} on (22) to see that { \rho} commutes with every { B_{j}}. Additionally, as the terms in (22) constitute linear combinations of { A_{i}} and { B_{j}}, we can compute the closure of their actions on { \lambda H_{A}} and { \lambda^{*} H_{B}}, which will be equivalent. Therefore, { im(\rho)} = { H_{B}}, which follows from the special case of (19). For the dual case, we substitute (20) and (21) into (12):

\displaystyle  c_{i}(A_{i}\otimes\mathbb{I})|\psi\rangle = \sum_{j}G_{ij}(\mathbb{I}\otimes B_{j})|\psi\rangle

On taking the norm of the above on both sides and using a little algebra, we finally obtain the fact that { \{B_{j}\}} satisfy predicate (15) making them the representations of { X_{j}}: { (\sum_{j}G_{ij}B_{j})^{2} = c_{i}^{2}{\mathbb{I}} } This shows that the map from { X_{j} \rightarrow B_{j}} computes a density matrix representation of { \mathcal{A}}, where { \rho} commutes with all { B_{j}}.
Backward Direction : The proof for the backward direction is much less involved: If we knew that { \{B_{j}\}, \rho} constituted the cyclic representation of { \mathcal{A}} with commutativity (with { B_{j}}), then we can use Lemma (19) to conclude that the image of { \rho} on { H} would form a subspace of { \lambda H}. We define: { \overline{A}_{i} = \frac{\sum_{j}G_{ij}B_{j}}{c_{i}} } allowing us to recover our original marginal biases { c_{i}} that satisfy (23) and therefore correspond to the optimal strategy. This shows us that { \{A_{i}\}}, { \{B_{j}\}} would constitute an optimal quantum strategy.
Having proved this theorem, we now obtain Corollary 15, which is the main desired result. To see how it subsumes Tsirelson’s result as a special case, we use a simple fact from Representation Theory:
Lemma 18 For a Clifford Algebra generated by {X_{1},..,X_{r}}, there exist one or two irreducible representations of dimension {2^{\left \lfloor{\frac{r}{2}}\right \rfloor}} Plugging Lemma 18 into Corollary 15, we simply recover the fact that the number of entangled bits of a solution algebra that is Clifford is { \left \lfloor{\frac{r}{2}}\right \rfloor}. However, note that being Clifford means an extra constraint: { X_{i}X_{j} = -X_{j}X_{i}}, { \forall i, j} The constraints on the Solution Algebra (14), (15) given by Slofstra do \textit{not} necessarily mean that the solution is Clifford. In fact, when an optimal quantum strategy with minimal entanglement is Clifford, { \{A_{i}\}}, { \{B_{j}\}} are constructed from a unique set of { \{u_{i}\}}, { \{v_{j}\}}. To end, we write down a lemma that shows there exist XOR games where the optimal strategy is not unique and for minimal entanglement, a solution generated by a Non-Clifford algebra must be used:

Lemma (Existence of XOR games with Non-Clifford optimal strategies)

There exist a family of { m \times n} XOR games { \{G\}} that correspond to generalizations of the CHSH games ({CL_{n}}), such that, the optimal strategy of minimal entanglement is Non-Clifford.


[1] David Avis, Sonoko Moriyama, and Masaki Owari. From bell inequalities to tsirelson’s theorem. IEICE Transactions, 92-A(5):1254–1267, 2009.

[2] Lance Fortnow, John Rompel, and Michael Sipser. On the power of multi-prover interactive protocols. Theoretical Computer Science, 134(2):545 – 557, 1994.

[3] T. Ito, H. Kobayashi, and K. Matsumoto. Oracularization and Two-Prover One-Round Interactive Proofs against Nonlocal Strategies. ArXiv e-prints, October 2008.

[4] J. Kempe, H. Kobayashi, K. Matsumoto, B. Toner, and T. Vidick. Entangled games are hard to approximate. ArXiv e-prints, April 2007.

[5] Julia Kempe, Oded Regev, and Ben Toner. Unique games with entangled provers are easy. SIAM Journal on Computing, 39(7):3207– 3229, 2010.

[6] S. Khanna, M. Sudan, L. Trevisan, and D. Williamson. The approximability of constraint satisfaction problems. SIAM Journal on Computing, 30(6):1863–1920, 2001.

[7] Anand Natarajan and Thomas Vidick. Two-player entangled games are NP-hard. arXiv e-prints, page arXiv:1710.03062, October 2017.

[8] Anand Natarajan and Thomas Vidick. Low-degree testing for quantum states, and a quantum entangled games PCP for QMA. arXiv e-prints, page arXiv:1801.03821, January 2018.

[9] William Slofstra. Lower bounds on the entanglement needed to play xor non-local games. CoRR, abs/1007.2248, 2010.

[10] B.S. Tsirelson. Quantum analogues of the bell inequalities. the case of two spatially separated domains. Journal of Soviet Mathematics, 36(4):557–570, 1987.

[11] Thomas Vidick. Three-player entangled XOR games are NP-hard to approximate. arXiv e-prints, page arXiv:1302.1242, February 2013.

[12] Thomas Vidick. Cs286.2 lecture 15: Tsirelson’s characterization of xor games. Online, December 2014. Lecture Notes.

[13] Thomas Vidick. Cs286.2 lecture 17: Np-hardness of computing \omega^*(G). Online, December 2014. Lecture Notes.

Towards Quantum PCP: A Proof of the NLETS Theorem

By Abhijit Mudigonda, Richard Wang, and Lisa Yang

This is part of a series of blog posts for CS 229r: Physics and Computation. In this post, we will talk about progress made towards resolving the quantum PCP conjecture. We’ll briefly talk about the progression from the quantum PCP conjecture to the NLTS conjecture to the NLETS theorem, and then settle on providing a proof of the NLETS theorem. This new proof, due to Nirkhe, Vazirani, and Yuen, makes it clear that the Hamiltonian family used to resolve the NLETS theorem cannot help us in resolving the NLTS conjecture.


We are all too familiar with NP problems. Consider now an upgrade to NP problems, where an omniscient prover (we’ll call this prover Merlin) can send a polynomial-sized proof to a BPP (bounded-error probabilistic polynomial-time) verifier (and we’ll call this verifier Arthur). Now, we have more decision problems in another complexity class, MA (Merlin-Arthur). Consider again, the analogue in the quantum realm where now the prover sends over qubits instead and the verifier is in BQP (bounded-error quantum polynomial-time). And now we have QMA (quantum Merlin-Arthur).

We can show that there is a hierarchy to these classes, where NP \subseteq MA \subseteq QMA.

Our goal is to talk about progress towards a quantum PCP theorem (and since nobody has proved it in the positive or negative, we’ll refer to it as a quantum PCP conjecture for now), so it might be a good idea to first talk about the PCP theorem. Suppose we take a Boolean formula, and we want to verify that it is satisfiable. Then someone comes along and presents us with a certificate — in this case, a satisfying assignment — and we can check in polynomial time that either this is indeed a satisfying assignment to the formula (a correct certificate) or it is not (an incorrect certificate).

But this requires that we check the entire certificate that is presented to us. Now, in comes the PCP Theorem (for probabilistically checkable proofs), which tells us that a certificate can be presented to us such that we can read a constant number of bits from the certificate, and have two things guaranteed: one, if this certificate is correct, then we will never think that it is incorrect even if we are not reading the entire certificate, and two, if we are presented with an incorrect certificate, we will reject it with high probability [1].

In short, one formulation of the PCP theorem tells us that, puzzingly, we might not need to read the entirety of a proof in order to be convinced with high probability that it is a good proof or a bad proof. But a natural question arises, which is to ask: is there a quantum analogue of the PCP theorem?


The answer is, we’re still not sure. But to make progress towards resolving this question, we will present the work of Nirkhe, Vazirani, and Yuen in providing an alternate proof of an earlier result of Eldar and Harrow on the NLETS theorem.

Before we state the quantum PCP conjecture, it would be helpful to review information about local Hamiltonians and the k-local Hamiltonian problem. A previous blog post by Ben Edelman covers these topics. Now, let’s state the quantum PCP conjecture:

(Quantum PCP Conjecture): It is QMA-hard to decide whether a given local Hamiltonian H = H_{1} + ... + H_{m} (where each ||H_{i}|| \leq 1) has ground state energy at most a or at least b when b-a \geq c||H|| for some universal constant c > 0.

Recall that MAX-k-SAT being NP-hard corresponds to the k-local Hamiltonian problem being QMA-hard when b-a \geq 1/poly(n). (We can refer to Theorem 4.1 in these scribed notes of Ryan O’Donnell’s lecture, and more specifically to Kempe-Kitaev-Regev’s original paper for proof of this fact.) The quantum PCP conjecture asks if this is still the case when the gap is c||H||.

Going back to the PCP theorem, an implication of the PCP theorem is that it is NP-hard to approximate certain problems to within some factor. Just like its classical analogue, the qPCP conjecture can be seen as stating that it is QMA-hard to approximate the ground state energy to a factor better than c||H||.

Reformulation: NLTS conjecture

Let’s make the observation that, taking a to be the ground state energy, the qPCP conjecture sort of says that there exists a family of Hamiltonians for which there is no trivial state (a state generated by a low depth circuit) such that the energy is at most c||H|| above the ground state energy.

Freedman and Hastings came up with an easier goal called the No Low-Energy Trivial States conjecture, or NLTS conjecture. We expect that ground states of local Hamiltonians are sufficiently hard to describe (if NP \neq QMA). So low-energy states might not be generated by a quantum circuit of constant depth. More formally:

(NLTS Conjecture): There exists a universal constant \epsilon > 0 and a family of local Hamiltonians \{H^{(n)}\}_{n=1}^{\infty} where H^{(n)} acts on n particles and consists of m_{n} local terms, s.t. any family of states \{|\psi_{n}\rangle\} satisfying \langle \psi_{n} | H^{(n)} | \psi_{n}\rangle \leq \epsilon||H^{(n)}|| + \lambda_{min}(H^{(n)}) requires circuit depth that grows faster than any constant.

To reiterate, if we did have such a family of NLTS Hamiltonians, then it we wouldn’t be able to give “easy proofs” for the minimal energy of a Hamiltonian, because we couldn’t just give a small circuit which produced a low energy state.

Progress: NLETS theorem

\epsilon -error states are states that differ from the ground state in at most \epsilon n qubits. Now, consider \epsilon-error states (which “agree” with the ground state on most qubits). Then for bounded-degree local Hamiltonians (analogously in the classical case, those where each variable participates in a bounded number of clauses), these states are also low energy. So any theorem which applies to low energy states (such as the NLTS conjecture), should also apply to states with \epsilon-error (as in the NLETS theorem).

To define low-error states more formally:

Definition 2.1 (\epsilon -error states): Let \rho, \sigma \in D((\mathbb{C}^{d})^{\otimes n}) (the space of positive semidefinite operators of trace norm equal to 1 on (\mathbb{C}^{d})^{\otimes n}). Let H be a local Hamiltonian acting on (\mathbb{C}^{d})^{\otimes n}. Then:

  • \sigma is an \epsilon-error state of \rho if \exists S \subseteq [n] of size at most \epsilon n s.t. \text{Tr}_{S}(\rho) = \text{Tr}_{S}(\sigma).
  • \sigma is an \epsilon-error state for H if \exists \rho s.t. \text{Tr}(H\rho) = \lambda_{min}(H) and \sigma is an \epsilon-error state for \rho.

Here, see that \text{Tr}_{S} is just the partial trace on some subset of integers S , like we’re tracing out or “disregarding” some subset of n qubits.

In 2017, Eldar and Harrow showed the following result which is the NLETS theorem.

Theorem 1 (NLETS Theorem): There exists a family of 16-local Hamiltonians \{H^{(n)}\} s.t. any family of \epsilon-error states \{|\Phi_{n}\rangle\} for \{H^{(n)}\} requires circuit depth \Omega(\log n) where \epsilon = 10^{-9}.

In the next two sections, we will provide background for an alternate proof of the NLETS theorem due to Nirkhe, Vazirani, and Yuen. After this, we will explain why the proof of NLETS cannot be used to prove NLTS, since the local Hamiltonian family we construct for NLETS can be linearized. Nirkhe, Vazirani, and Yuen’s proof of NLETS makes use of the Feynman-Kitaev clock Hamiltonian corresponding to the circuit generating the cat state (Eldar and Harrow make use of the Tillich-Zemor hypergraph product construction; refer to section 8 of their paper). What is this circuit? It is this one:

Image from [2]

First, we apply the Hadamard gate (drawn as \boxed{H}) which maps the first qubit |0\rangle \rightarrow \frac{|0\rangle + |1\rangle}{\sqrt{2}}. Then we can think of the CNOT gates (drawn as \bullet-\oplus) as propagating whatever happens to the first qubit to the rest of the qubits. If we had the first qubit mapping to 0, then the rest of the qubits map to 0, and likewise for 1. This generates the cat state |\textsf{CAT}_{n}\rangle = \frac{|0\rangle^{\otimes n} + |1\rangle^{\otimes n}}{\sqrt{2}}, which is highly entangled.

Why do we want a highly entangled state? Roughly our intuition for using the cat state is this: if the ground state of a Hamiltonian is highly entangled, then any quantum circuit which generates it has non-trivial depth. So if our goal is to show the existence of local Hamiltonians which have low energy or low error states that need deep circuits to generate, it makes sense to use a highly entangled state like the cat state.

Quantum circuits

Image from [2]

(We’ll write that the state of a qudit – a generalization of a qubit to more than two dimensions, and in this case q dimensions – is a vector in \mathbb{C}^{q}. In our diagram above, we’ll see 4 qudits, labelled appropriately.)

Let’s briefly cover the definitions for the quantum circuits we’ll be using.

Let U be a unitary operator acting on a system of n qudits (in other words, acting on (\mathbb{C}^{q})^{\otimes n}), where U = U_{m} \hdots U_{1}. Here, each U_{i} is a unitary operator (a gate) acting on at most two qudits, and U is a product of m such operators.

If there exists a partition U into products of non-overlapping two-qudit unitaries (we call these layers and denote them as L_{i} = \bigotimes_{j}U_{ij}, where each U_{j} here is in layer L_{i}) such that U = L_{d} \hdots L_{1} then we say U has d layers.

In other words, U has size m and circuit depth d.

Lightcones, effect zones, shadow zones

Consider U = L_{d} \hdots L_{1} and A an operator.

For j < d define K^{(j)} as the gates in layer j whose supports overlap that of any gate in K^{(j+1)}, …, K^{(d)} or with A.

Definition 3.1 (lightcone): The lightcone of A with respect to U is the union of K^{(j)}: K_{U} \triangleq \bigcup_{j} K^{(j)}.

So we can think of the lightcone as the set of gates spreading out of A all the way to the first layer of the circuit. In our diagram, the lightcone of A is the dash-dotted region. We have K^{(3)} = \varnothing, K^{(2)} = \{U_{21}\}, and K^{(1)} = \{U_{11}, U_{12}\}.

We also want a definition for what comes back from the lightcone: the set of gates from the first layer (the widest part of the cone) back to the last layer.

Define E^{(1)} = K^{(1)}. For j \geq 2, let E^{(j)} be the set of gates whose supports overlap with any gate in E^{(j-1)}.

Definition 3.2 (effect zone): The effect zone of A with respect to U is the union E_{U}(A) \triangleq \bigcup_{j} E^{(j)}.

In our diagram, see that E^{(1)} = \{U_{11}, U_{12}\}, E^{(2)} = \{U_{21}\}, and E^{(3)} = \{U_{31}\}. The effect zone of A is the dotted region.

Definition 3.3 (shadow of the effect zone): The shadow of the effect zone W_{U}(A) of A with respect to U is the set of qudits acted on by the gates in the effect zone.

In our diagram, the first three qudits are effected by gates in the effect zone. So W_{U}(A) = \{1, 2, 3\}.

Given all of these definitions, we make the following claim which will be important later, in a proof of a generalization of NLETS.

Claim 3.1 (Disjoint lightcones): Let U be a circuit and A, B operators. If the qudits B acts on are disjoint from W_{U}(A), then the lightcones of A and B in U are disjoint.

Toward the Feynman-Kitaev clock

Now we’ll give some definitions that will become necessary when we make use of the Feynman-Kitaev Hamiltonian in our later proofs.

Let’s define a unary clock. It will basically help us determine whatever happened at any time little t along the total time big T. Let |\textsf{unary}(t, T)\rangle = |0\rangle^{\otimes(T-t)} \otimes |1\rangle^{\otimes t}. For our purposes today, we won’t worry about higher dimensional clocks. So we’ll write |\textsf{clock}_{k}(t, T)\rangle, but we’ll really only consider the case where k = 1, which corresponds to |\textsf{unary}(t, T)\rangle. For simplicity’s sake, we will henceforth just write |\textsf{unary}(t)\rangle.

Our goal is to construct something a little similar to the tableaux in the Cook-Levin theorem, so we also want to define a history state:

Definition 4.1 (History state): Let C be a quantum circuit that acts on a witness register and an ancilla register. Let C_{1}, ..., C_{T} denote the sequence of two-local gates in C. Then for all k \in \mathbb{N}, a state |\Psi\rangle is a k-dimensional history state of C if:

\begin{aligned}|\Psi\rangle = \frac{1}{\sqrt{T+1}}\sum_{t=0}^{T}|\textsf{clock}_{k}(t, T)\rangle \otimes |\psi_{t}\rangle\end{aligned}

where we have the clock state to keep track of time and \psi_{t} is some state such that |\psi_{t}\rangle = C_{t}|\psi_{t-1}\rangle and |\psi_{0}\rangle = |\xi\rangle_{witness} \otimes |0\rangle_{ancilla}. With this construction, we should be able to make a measurement to get back the state at time t.

Proof of NLETS

We provide a proof of (a simplified case of) the NLETS theorem proved by Nirkhe, Vazirani, and Yuen in [2].

Theorem 2 (NLETS): There exists a family of 3-local Hamiltonians \{H^{(n)}\} on a line (Each Hamiltonian H^{(n)} can be defined on n particles arranged on a line such that each local Hamiltonian acts on a particle and its two neighbors) such that for all n \in \mathbb{N}, the circuit depth of any \epsilon-error ground state for H^{(n)} is at least logarithmic in n .

First, we’ll show the circuit lower bound. Then we’ll explain why these Hamiltonians can act on particles on a line and what this implies about the potential of these techniques for proving NLTS.

Proof: We will use the Feynman-Kitaev clock construction to construct a 5-local Hamiltonian \mathcal{H}^{(n)} for the circuit C_n: |0^n\rangle \to |\textsf{CAT}_n\rangle .

Fix n \in \mathbb{N} and let C_n have size T. The Hamiltonian \mathcal{H} acts on T+n qubits and consists of several local terms depending on C_n:

\begin{aligned}\mathcal{H} = H_{in} + \sum_{t=1}^T H_t + H_{out} + H_{stab}\end{aligned}

We can think of a T+n qubit state as representing a T step computation on n qubits (i.e. for each time t \in [0,T], we have a n bit computation state \textsf{state}_t of C_n). Intuitively, a T+n qubit state has energy 0 with respect to \mathcal{H} iff it is the history state of C_n. This is because H_{in} checks that at time t=0, \textsf{state}_0 consists of the input |0\rangle^n to C_n. Each H_t checks that \textsf{state}_{t} proceed correctly from \textsf{state}_{t-1} (i.e. that the tth gate of C_n is applied correctly). Then H_{out} checks that at time t=T, the output is 1. Finally, H_{stab} checks that the T+n qubit state is a superposition only over states where the first T qubits represent “correct times” (i.e. a unary clock state where time t is represented by T-t zeros followed by t ones).

Therefore, \mathcal{H} has a unique ground state, the history state of C_n|0^n\rangle, with energy 0:

\begin{aligned}|\Psi\rangle = \frac{1}{\sqrt{n+1}}\sum_{t=0}^n |\textsf{unary}(t)\rangle\otimes |\psi_t\rangle = \frac{1}{\sqrt{n+1}}\sum_{t=0}^n |\textsf{unary}(t)\rangle\otimes|\textsf{CAT}_{t}\rangle\otimes |0\rangle^{\otimes (n-t)}\end{aligned}

Later we will show how to transform \mathcal{H} into a Hamiltonian H on n qutrits on a line. Intuitively, the structure of C_n allows us to fuse the T=n time qubits and n state qubits and represent unused state qubits by 2. For the Hamiltonian H, the ground state becomes

\begin{aligned}|\Psi\rangle = \frac{1}{\sqrt{n+1}}\sum_{t=0}^n |\psi_t\rangle = \sum_{t=0}^n |\textsf{CAT}_{t}\rangle\otimes|2\rangle^{\otimes(n-t)}\end{aligned}

For the rest of this proof, we work with respect to H.

Let \sigma be an \epsilon-error state and let S \subseteq [n] be the subset of qutrits such that \text{Tr}_S(\sigma) = \text{Tr}_S(|\Psi\rangle\langle\Psi|). We define two projection operators which, when applied to \sigma alone, produce nontrivial measurements, but when applied to \sigma together, produce trivial measurements.

Definition 5.1: For any i\in[n], the projection operator

\begin{aligned}A_i = |0\rangle\langle 0|_i \end{aligned}

projects onto the subspace spanned by 0 on the ith qutrit.

For any j\in[n], the projection operator

\begin{aligned} B_j = |1\rangle\langle 1|_i\end{aligned}

projects onto the subspace spanned by 1 on the jth qutrit.

Claim 5.1: For i\not\in S, \text{Tr}(A_i\sigma) = \frac{1}{2} + \frac{-i}{2(n+1)}. For j\not\in S, \text{Tr}(B_j\sigma) = \frac{1}{2} + \frac{-j}{2(n+1)}. Note that these values are positive for any i,j\in [n].

Proof: If i \not\in S, then measurements on the ith qutrit are the same for \sigma and |\Psi\rangle\langle\Psi|.

\begin{aligned}     \text{Tr}(A_i\sigma) &= \text{Tr}(A_i|\Psi\rangle\langle\Psi|)\\     &= \text{Tr}\left(A_i \frac{1}{n+1}\sum_{t,t'}|\psi_t\rangle\langle\psi_{t'}|\right)\\     &= \frac{1}{n+1}\sum_{t,t'} \text{Tr}(A_i|\psi_t\rangle\langle\psi_{t'}|)   \end{aligned}

If t=t', then any n qutrit pure state cannot have nonzero weight in both \psi_t and \psi_{t'} (every pure state ends in some number of 2s which tells which \psi_t (if any) it can be a part of). Therefore,

\begin{aligned}     \text{Tr}(A_i\sigma) = \frac{1}{n+1}\sum_{t} \text{Tr}(A_i|\psi_t\rangle\langle\psi_{t}|) = \frac{1}{n+1}\sum_t \langle \psi_t|A_i|\psi_t\rangle \enspace. \end{aligned}

If i \le t, then projecting onto the ith qutrit gives 0 with probability 1/2. Therefore, \text{Tr}(A_i\sigma) = \frac{1}{n+1}\left(\frac{n-i+1}{2}\right) = \frac{1}{2} + \frac{-i}{2(n+1)}.

Similarly, \text{Tr}(B_j\sigma) = \frac{1}{2} + \frac{-j}{2(n+1)}. \square

Claim 5.2: For i,j \not\in S such that i < j, \text{Tr}(A_i \otimes B_j \sigma) = 0.

Proof: As before, we can calculate

\begin{aligned} \text{Tr}(A_i \otimes B_j \sigma) &= \text{Tr}(A_i \otimes B_j |\Psi\rangle \langle\Psi|) = \frac{1}{n+1}\sum_t \langle\psi_t|A_i\otimes B_j|\psi_t\rangle \end{aligned}

If j > t, then the jth qutrit of \psi_t is 2 so B_j|\psi_t\rangle = 0. If j \le t, then A_i \otimes B_j|\psi_t\rangle = 0 because the first t qutrits of |\psi_t\rangle contain the |\textsf{CAT}_{t}\rangle state so under any measurement, the i and jth qutrits must be the same. \square

Now we use these claims to prove a circuit lower bound. Let U be a circuit generating (a state with density matrix) \sigma. Let d be the depth of U.

Consider some i\not\in S. For any operator acting on the ith qutrit, its lightcone consists of at most 2^d gates so its effect zone consists of at most 2^{2d} gates which act on at most 2^{2d+1} qudits (called the shadow of the effect zone).

Assume towards contradiction that 2^{2d+1} < n-\epsilon n. Then the shadow of any operator acting only on the ith qutrit has size at most 2^{2d+1} \le n - |S| since |S| \le \epsilon n. So there is some j outside of the shadow which is in the complement of S. By Claim 3.1, we have found two indices i,j such that any pair of operators acting on i and j have disjoint lightcones in U. WLOG let i < j. The lightcones of A_i,B_j are disjoint which implies \begin{aligned}\text{Tr}(A_i \otimes B_j \sigma) = \text{Tr}(A_i \sigma)\cdot\text{Tr}(B_j \sigma).\end{aligned}

By the two claims above, we get a contradiction.

Therefore, 2^{2d+1} \ge n-\epsilon n. We can take any constant epsilon: letting \epsilon = 1/2, we get

\begin{aligned}d \ge \frac{1}{2}\left(\log \frac{n}{2} - 1\right)\end{aligned}


This analysis relies crucially on the fact that any \epsilon-error state matches the groundstate on most qudits. However, NLTS is concerned with states which may differ from the groundstate on many qudits, as long as they have low energy.

Remark 2.1: The paper of Nirkhe, Vazirani, and Yuen [2] actually proves more:

  • A more general lower bound: logarithmic lower bound on the circuit depth of any \delta-approximate (\delta far in L1 norm) \epsilon-noisy state (probability distribution over \epsilon-error states).
  • Assuming QCMA \neq QMA (QCMA takes a m bit witness string instead of a m qubit state as witness), they show a superpolynomial lower bound (on the circuit depth of any \delta-approximate \epsilon-noisy state).
  • “Approximate qLWC codes”, using techniques from their superpolynomial lower bound.

Back to NLTS – Tempering our Optimism

So far, we’ve shown a local Hamiltonian family for which all low-error (in “Hamming distance”) states require logarithmic quantum circuit depth to compute, thus resolving the NLETS conjecture. Now, let’s try to tie this back into the NLTS conjecture. Since it’s been a while, let’s recall the statement of the conjecture:

Conjecture (NLTS): There exists a universal constant \epsilon > 0 and a family of local Hamiltonians \{H^{(n)}\}_{n=1}^{\infty} where H^{(n)} acts on n particles and consists of m_{n} local terms, s.t. any family of states \{|\psi_{n}\rangle\} satisfying \langle \psi_{n} | H^{(n)} | \psi_{n}\rangle \leq \epsilon||H^{(n)}|| + \lambda_{min}(H^{(n)}) requires circuit depth that grows faster than any constant.

In order to resolve the NLTS conjecture, it thus suffices to exhibit a local Hamiltonian family for which all low-energy states require logarithmic quantum circuit depth to compute. We might wonder if the local Hamiltonian family we used to resolve NLETS, which has “hard ground states”, might also have hard low-energy states. Unfortunately, as we shall show, this cannot be the case. We will start by showing that Hamiltonian families that lie on constant-dimensional lattices (in a sense that we will make precise momentarily) cannot possibly be used to resolve NLTS, and then show that the Hamiltonian family we used to prove NLTS can be linearized (made to lie on a one-dimensional lattice!).

The Woes of Constant-Dimensional Lattices

Definition 6.1: A local Hamiltonian H acting on n qubits is said to lie on a graph G if there is an injection of qubits into vertices of the graph such that the set of qubits in any interaction term correspond to a connected component in the graph.

Theorem 2: If (H^{(n)}) is a local Hamiltonian family that lies on an O(1)-dimensional lattice, then (H^{(n)}) has a family of low-energy states with low circuit complexity. In particular, if H is a local Hamiltonian on a d-dimensional lattice acting on n qubits for large enough n, then for any \epsilon, there exists a state |\psi\rangle that can be generated by a circuit of constant depth and such that \langle \psi | H | \psi \rangle \leq H_0 + \epsilon ||H|| where H_0 is the ground-state energy.

Proof: In what follows, we’ll omit some of the more annoying computational details in the interest of communicating the high-level idea.

Start by partitioning the d-dimensional lattice (the one that H^(n) lives on) into hypercubes of side length L. We can “restrict” H^{(n)} to a given hypercube (let’s call it \rho) by throwing away all local terms containing a qubit not in \rho. This gives us a well-defined Hamiltonian H_{\rho} on the qubits in \rho. Define |\phi_{\rho}\rangle to be the L^d-qubit ground state of H_{\rho}, and define

\begin{aligned}|\phi\rangle := \bigotimes_{\text{hypercubes } \rho} |\phi_{\rho}\rangle\end{aligned}

where |\phi\rangle is an n-qubit state. Each |\phi_{\rho}\rangle can be generated by a circuit with at most 2^{L^d} gates, hence at most 2^{L^d} depth. Then, |\phi\rangle can be generated by putting all of these individual circuits in parallel – this doesn’t violate any sort of no-cloning condition because the individual circuits act on disjoint sets of qubits. Therefore, |\phi\rangle can be generated by a circuit of depth at most 2^{L^d}. L and d are both constants, so |\phi\rangle can be generated by a constant-depth circuit.

We claim that, for the right choice of L, |\phi\rangle is also a low-energy state. Intuitively, this is true because \phi can only be “worse” than a true ground state of H^{(n)} on local Hamiltonian terms that do not lie entirely within a single hypercube (i.e. the boundary terms), and by choosing L appropriately we can make this a vanishingly small fraction of the local terms of H^{(n)}. Let’s work this out explicitly.

Each hypercube has surface area 2dL^{d-1}, and there are n/L^d hypercubes in the lattice. Thus, the total number of qubits on boundaries is at most 2d\frac{n}{L}. The number of size k-connected components containing a given point in a d-dimensional lattice is a function of k and d. Both of these are constants. Therefore, the number of size k-connected components containing a given vertex, and hence the number of local Hamiltonian terms containing a given qubit, is constant. Thus, the total number of violated local Hamiltonian terms is at most O(\frac{n}{L}). Taking L to be \frac{1}{\epsilon}, we get the desired bound. Note that to be fully rigorous, we need to justify that the boundary terms don’t blow up the energy, but this is left as an exercise for the reader. \square

Linearizing the Hamiltonian

Now that we have shown that Hamiltonians that live on constant-dimensional lattices cannot be used to prove NLTS, we will put the final nail in the coffin by showing that our NLETS Hamiltonian (the Feynman-Kitaev clock Hamiltonian \mathcal{H} on the circuit C) can be made to lie on a line (a 1-dimensional lattice). To do so, we will need to understand the details of \mathcal{H} a bit better.

Proposition 6.1: \mathcal{H} for the circuit C is 5-local.

Proof: Recall that we defined

\begin{aligned}\mathcal{H} := H_{in} + \sum_{t=1}^T H_t+  H_{stab}\end{aligned}

Let’s go through the right-hand-side term-by-term. We will use |\mathsf{time}(t)\rangle to denote the t^{\text{th}} qubit of the time register and |\mathsf{state}(s)\rangle to denote the s^{\text{th}} qubit of the state register.

  • H_{in} needs to serially access the qubit pairs \begin{aligned}|\mathsf{time}(0)\rangle\otimes\textsf{state}(s) \end{aligned} for all s and ensure that they are all set to |0\rangle. Thus, H_{in} is 2-local.
  • Each H_t term needs to access the states |\textsf{time}(t-1)\rangle, |\textsf{time}(t)\rangle, |\textsf{time}(t+1)\rangle, |\textsf{state}(s)\rangle, and |\textsf{state}(t)\rangle and ensure that the state transitions are correct. Thus, H_{t} is 5-local.
  • H_{stab} needs to access the states \begin{aligned}|\textsf{time}(t)\rangle \otimes |\textsf{time}(t+1)\rangle \end{aligned} and ensure that the progression of the time register is correct. Thus, H_{stab} is 2-local. \square

Now, we follow an approach of [3] to embed \mathcal{H} into a line.

Theorem 3: The Feynman-Kitaev clock Hamiltonian H can be manipulated into a 3-local Hamiltonian acting on qutrits on a line.

Proof: Rather than having \mathcal{H} act on 2n total qubits (n time qubits and n state qubits), let’s fuse each |\textsf{time}(i)\rangle and |\textsf{state}(i)\rangle pair into a single qudit of dimension 4. If we view \mathcal{H} as acting on the space of particles |\textsf{time}(i)\rangle \otimes |\textsf{state}(i)\rangle, we observe that, following Proposition 6.1, each local term needs to check at most the particles corresponding to times t-1, t, and t+1. Therefore, \mathcal{H} is 3-local and on a line, as desired.

Image from [2]

To see that we can have \mathcal{H} act on particles of dimension 3 (qutrits) rather than particles of dimension 4, note that the degree of freedom corresponding to |\textsf{time}(t)\rangle \otimes |\textsf{state}(t)\rangle = |0\rangle \otimes |1\rangle is unused, as the t^{\text{th}} qubit of the state is never nonzero until timestamp t. Thus, we can take the vectors

\begin{aligned} |0\rangle := |1\rangle\otimes|0\rangle, |1\rangle := |1\rangle\otimes|1\rangle, |2\rangle := |0\rangle\otimes|0\rangle \end{aligned}

as a basis for each qutrit.

Even though we’ve shown that the clock Hamiltonian for our original circuit cannot be used to prove NLTS (which is still weaker than the original Quantum PCP conjecture) this does not necessarily rule out the use of this approach for other “hard” circuits which might then allow us to prove NLTS. Furthermore, NLETS is independently interesting, as the notion of being low “Hamming distance” away from vectors is exactly what is used in error-correcting codes.


  • [1] Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach. Cambridge University Press, 2009.
  • [2] Chinmay Nirkhe, Umesh Vazirani, and Henry Yuen. Approximate low-weight check codes and circuit lower bounds for noisy ground states. arXiv preprint arXiv:1802.07419, 2018.
  • [3] Dorit Aharonov, Wim van Dam, Julia Kempe, Zeph Landau, Seth Lloyd, and Oded Regev. Adiabatic quantum computation is equivalent to standard quantum computation. SIAM J. Comput., 2007.

Quantum Approximate Optimization Algorithm and Applications



Quantum computers have demonstrated great potential for solving certain problems more efficiently than their classical counterpart. Algorithms based on the quantum Fourier transform (QFT) such as Shor’s algorithm offer an exponential speed-up, while amplitude-amplification algorithms such as Grover’s search algorithm provide us with a polynomial speedup. The concept of “quantum supremacy” (quantum computers outperforming classical computers) has been explored for three general groups of problems:

  1. Structured problems, such as factoring and discrete logarithm. Out quantum computer takes advantage of the structure of these classes of problems to offer an exponential speedup compared to the best known classical alternative. While these speedups are the most promising, they require a large number of resources and are cannot be feasibly implemented in the near future.
  2. Quantum Simulations, originally proposed by Richard Feynman in the late 80s was thought to be the first motivation behind exploring quantum computation. Due to the fact that the space of all possible states of the system scales exponentially with the addition of a new element (eg. an atom), complex systems are very difficult to simulate classically. It has been shown that we can use a quantum computer to tackle interesting problems in quantum chemistry and chemical engineering. Furthermore, there are results on sampling the output of random quantum circuits which have been used for “quantum supremacy experiments”.
  3. General constraint satisfaction and optimization problems. Since these problems are NP-hard it is widely believed that we cannot gain an exponential speedup using a quantum computer, however, we can obtain quadratic speedup but utilizing a variation of Grover’s algorithm.

While these quantum algorithms are very exciting, they are beyond the capabilities of our near-term quantum computers; for example, any useful application of Shor’s factoring algorithm requires anywhere between tens of thousands to millions of qubits with error correction compared to quantum devices with hundreds of qubits that we might have available in the next few years.

Recently there has been increasing interest in hybrid classical-quantum algorithms among the community. The general idea behind this approach is to supplement the noisy intermediate-scale quantum (NISQ) devices with classical computers. In this blog post, we discuss the Quantum Approximate Optimization Algorithm (QAOA), which is a hybrid algorithm, alongside some of its applications.


QAOA is used for optimizing combinatorial problems. Let’s assume a problem with n bits and m clauses. Each clause is a constraint on a subset of the bits which satisfies a certain assignment. We can define a cost function as follows:

C(z)=\sum_{\alpha=1}^m C_\alpha (z)

where z=z_1z_2...z_n is the bit string. In this article we consider a minimization problem, therefore we want C_\alpha(z)=0 if z satisfies clause \alpha and 1 otherwise. Note that in the case of a maximization problem we only need to switch the value assigned to a satisfactory clause to 1. Our objective is to find a (qu)bit string that minimizes (or maximizes) our cost function.

At a higher level, we start with a quantum state in a uniform superposition of all possible inputs z. This can be accomplished with n qubits which span a space of size 2^n. Our goal is to come up with a series of operations that would evolve our initial quantum state into a superposition of states in which the valid solutions would have a significantly higher probability than other states. In manner, upon sampling the quantum state we are likely to get the correct solution with high probability. QAOA uses the cost function to construct a set of operations that would be able to efficiently map the unifrom superposition state into the desired quantum state. These operators involve single qubits rotations around the x-axis, and multiqubit rotations around the z-axis of our qubits.

Now let’s discuss the details of QAOA. For this algorithm we assume that our quantum computer works in the computation basis of \left |0 \right > , \left | 1 \right > . We start by setting our initial state to a uniform superposition over computational basis states:

\left |s \right > = \frac{1}{\sqrt{2^n}}\sum_{z \in \{0,1\}^n} \left |z \right >

Next, we define a unitary operator using the cost function as follows:

U(\hat{C},\gamma) = e^{i\gamma \hat{C}}= \prod_{\alpha = 1}^m e^{-i\gamma \hat{C}_\alpha} 

Here we convert every clause C_\alpha to a Hamiltonian \hat{C_\alpha} consisting of Pauli Z ($\sigma^z$) operators. Just as a review, the two Pauli operators (X and Z) used in this blog post are representated as follows:

\sigma^x = \begin{pmatrix}    0 & 1 \\    1 & 0 \\\end{pmatrix} \: \: \: \:  \sigma^z = \begin{pmatrix}    1 & 0 \\    0 & -1 \\\end{pmatrix}

For example if C_\alpha=x \oplus y we can map the clause to \hat{C_\alpha}=\frac{1}{2}(1+\sigma^z_x \sigma^z_y) for a minimization problem. If x=\left |0 \right >   , then \sigma^z_x will return a value of 1, and if x=\left |1 \right > the operator will return -1. The same applies to qubit y as well. Therefore it is not hard to see that if x and y have the same value, then the operator \hat{C_\alpha} as defined above will result in a 1, and it’ll result in 0 otherwise. Furthermore, since \hat{C} has integer eigenvalues we can restrict the angle \gamma to lie in [0,2\pi].

Next, we define the admixing Hamiltonian:

B=\sum_{j=1}^n \sigma^x_j

and use it to define a unitary operator which consists of a product of commuting one qubit operations:

U(B,\beta) = e^{-i\beta B}= \prod_{j=1}^n e^{-i \beta \sigma_j^x}  

where \beta \in [0,\pi]. It’s easy to see that U(\Hat{C},\gamma) couples 2 or more qubits, while U(B,\beta) performs a single qubit rotation on the qubits in our system. Using these unitaries and our initial state we define a QAOA angle-dependent “ansatz” state as follows:

\left |  \boldsymbol{\gamma},\boldsymbol{\beta} \right >= U(B,\beta_p)U(\Hat{C},\gamma_p)...U(B,\beta_1)U(\Hat{C},\gamma_1) \left |s \right >

Here p\geq 1 is the “depth” of our QAOA circuit, and \boldsymbol{\gamma}=(\gamma_p,...,\gamma_1), \boldsymbol{\beta}=(\beta_p,...,\beta_1) are each a vector of length p controlling the angles for each layer. In the worst case scenario this state can be produce by a quantum circuit of depth mp+p, however by taking advantage of the structure of the instance we can further reduce the number of layers required. Let F_p be the expectation of \hat{C} in our ansatz:

F_p(\boldsymbol{\gamma},\boldsymbol{\beta})=\left < \boldsymbol{\gamma},\boldsymbol{\beta} \right | \hat{C} \left | \boldsymbol{\gamma},\boldsymbol{\beta} \right >

and let M_p be the minimum of F_p over angles,

M_p=\min_{\boldsymbol{\gamma},\boldsymbol{\beta}} F_p(\boldsymbol{\gamma},\boldsymbol{\beta}).

Note that minimization at p-1 layers can be viewed as a constrained minimization at p layers, therefore

M_p \leq M_{p-1}

Using an adiabatic approach [1] We can show that

\lim_{p \rightarrow \infty} M_p = \min_z C(z)

Based on these results our QAOA algorithm will look like the following:

  •  c: pick a p
  • c: choose a set of angles (\Vec{\gamma}_0,\Vec{\beta}_0)
  • q: prepare \left | \Vec{\gamma},\Vec{\beta} \right >  
  • q: compute F_p
  • c: perform gradient descend/ascend on F_p and get a new set of angles (\Vec{\gamma},\Vec{\beta})
  • repeat from step 3 till convergence
  • report the measurement result of \left | \Vec{\gamma},\Vec{\beta} \right >   in computational basis

If p does not asymptotically grow with n F_p(\Vec{\gamma},\Vec{\beta}) can be efficiently computed in O(m^2+mn)

Application: MaxCut

In this section we apply the QAOA algorithm to the MaxCut problem with bounded degree. MaxCut is an NP-hard problem that asks for a subset S of the vertex set such that the number of edges between S and the complementary subset is as large as possible. While QAOA does not offer a theoretical guarantee to solve MaxCut in polynomial time, it offers a path to utilizing NISQ devices for tackling such optimization problems and discuss patterns in such problems that can be used for reducing the number of steps required.

For this section, let’s assume p=O(1), and we have a graph with n vertices and an edge set \{<jk>\} of size m. We can construct a cost function to be maximized as follows:

C = \sum_{<jk>} C_{<jk>}

C_{<jk>} = \frac{1}{2} (1-\sigma^z_j \sigma^z_k)

We can the compute the angle dependent cost of our ansatz as follows:

F_p(\Vec{\gamma},\Vec{\beta})=\sum_{<jk>}\left <{s} \right | U^\dagger(C,\gamma_1)...U^\dagger(B,\beta_p) C_{<jk>}U(B,\beta_p) ... U(C,\gamma_1) \left |s \right >

Let’s consider the operation associated with some edge <jk>:

U ^\dagger(C,\gamma_1)...U^\dagger(B,\beta_p) C_{<jk>}U(B,\beta_p) ... U(C,\gamma_1)

Since QAOA consists of local operations, we may take advantage by thinking about the problem in terms of subproblems (or subgraphs) involving certain nodes. This property will allow us to simplify our clauses even further depending on the desired depth p of our quantum circuit, therefore decreasing the amount of resources necessary to implement the algorithm.

The operator C_{<jk>} includes qubits (nodes) j and k, therefore the sequence of operators above will only involve qubits that are at most distance p away from qubits j and k. Let’s consider the example of p=1:

\rightarrow U^\dagger(C,\gamma_1)e^{i\beta_1(\sigma^x_j + \sigma^x_k)} C_{<jk>} e^{-i\beta_1(\sigma^x_j + \sigma^x_k)} U(C,\gamma_1).

It’s easy to see that any factor of U(C,\gamma_1) that does not depend on j or k will commute through and cancel out. Since the degree is bounded, each subgraph contains a number of qubits that is independent of n, which allows for the evaluation of F_p in terms of subsystems of size independent of n.

For an subgraph G define:

C_G=\sum_{<l l^\prime>} C_{<l l^\prime>}  \: \: \: \: U(C_G,\gamma)=e^{-i \gamma C_G}

B_G = \sum_{j \in G} \sigma^x_j  \: \: \: \: U(B_G,\beta) = e^{-i \beta B_G}

\left | s,G \right >   = \prod_{l \in G} \left |+ \right > _l

We can define our total cost as a sum over the cost of each subgraph:

f_g(\Vec{\gamma},\Vec{\beta})=\left < s,g(j,k) \right |  U ^\dagger(C_{g(j,k)},\gamma_1)...U^\dagger(B_{g(j,k)},\beta_p) C_{<jk>}U(B_{g(j,k)},\beta_p) ... U(C_{g(j,k)},\gamma_1) \left |s,g(j,k) \right >

where g(j,k) is a subgraph of type g and “…” is used to omit the sequence of angle depending unitaries constructed using the elements of \Vec{\gamma} and \Vec{\beta}. F_p is then

F_p(\Vec{\gamma},\Vec{\beta})=\sum_g w_g f_g(\Vec{\gamma},\Vec{\beta})

where w_g is the number of occurrence of the subgraph g in the original edge sum. The function f_g does not depend on n and m, and the only dependence on these variables comes through the weights w_g from the original graph. The maximum number of qubits that can appear in our sequence of operators comes when the subgraph is a tree. For a graph with maximum degree v, the number of qubits in this tree is


(or 2p+2 if v=2), which is independent of n and m. Therefore we can see that for constant p F_p can be efficiently computed.

Next, let’s consider the spread of C measured in the state \left | \Vec{\gamma},\Vec{\beta} \right >  .

\left <\Vec{\gamma},\Vec{\beta} \right | C^2\left |\Vec{\gamma},\Vec{\beta}\right >  -\left < \Vec{\gamma},\Vec{\beta} \right | C \left | \Vec{\gamma},\Vec{\beta} \right > ^2 \leq 2[\frac{(v-1)^{2p+2}-1}{(v-1)-1}].m

For fixed v and p we see that the standard deviation of C(z) is upper-bounded by O(\sqrt{m}). Using this fact and the appropriate probability bounds we can see that the result of measuring the cost function of the state \left | \vec{\gamma_{opt}},vec{\beta_{opt}} \right >   will be very close to the intended value of F_p(\vec{\gamma_{opt}},\vec{\beta_{opt}}) which bounds the uncertainty present in quantum measurement.


[1] E. Farhi, J. Goldstone, and S. Gutmann, “A Quantum Approximate Optimization Algorithm,” 2014.

[2] J. S. Otterbach, et. al, “Unsupervised Machine Learning on a Hybrid Quantum Computer,” 2017.

Tensor Networks, Matrix Product States and Density Matrix Renormalization Group

by Fred Zhang

This is the second installment of a three-part series of posts on quantum Hamiltonian complexity based on lectures given by the authors in Boaz and Tselil’s seminar. For the basic definitions of local Hamiltonians, see Ben’s first post. Also check out Boriana and Prayaag’s followup note on area laws.

This post introduces tensor networks and matrix product states (MPS). These are useful linear-algebraic objects for describing quantum states of low entanglement.

We then discuss how to efficiently compute the ground states of the Hamiltonians of {1}D quantum systems (using classical computers). The density matrix renormalization group (DMRG), due to White (1992, 1993), is arguably the most successful heuristic for this problem. We describe it in the language of tensor networks and MPS.

1. Introduction

We are interested in computing the ground state—the minimum eigenvector—of a quantum Hamiltonian, a 2^n \times 2^n complex matrix that governs the evolution of a quantum system of n qubits. We restrict our attention to the local Hamiltonian, where the matrix is a sum of Hamiltonians each acting only on k qubits.  In the previous article, we discussed some hardness results. Namely, a local Hamiltonian can be used to encode SAT instances, and we further gave a proof that computing the ground state is QMA-Complete.

Despite the hardness results, physicists have come up with a variety of heuristics for solving this problem. If quantum interactions occur locally, we would hope that its ground state has low entanglement and thus admits a succinct classical representation. Further, we hope to find such a representation efficiently, using classical computers.

In this note, we will see tensor networks and matrix product states that formalize the idea of succinctly representing quantum states of low entanglement. As a side remark for the theoretical computer scientists here, one motivation to study tensor network is that it provides a powerful visual tool for thinking about linear algebra. It turns indices into edges in a graph and summations over indices into contractions of edges. In particular, we will soon see that the most useful inequality in TCS and mathematics can be drawn as a cute tensor network.

Guess what this is?

In the end, we will discuss the density matrix renormalization group (DMRG), which has established itself as “the most powerful tool for treating 1D quantum systems” over the last decade [FSW07]. For many 1D systems that arise from practice, the heuristic efficiently finds an (approximate) ground state in its matrix product state, specified only by a small number of parameters.

2. Tensor Networks

Now let us discuss our first subject, tensor networks. If you have not seen tensors before, it is a generalization of matrices. In computer scientists’ language, a matrix is a two-dimensional array, and a tensor is a multi-dimensional array. In other words, if we think of a matrix as a square, then a 3 dimensional tensor looks like a cube. Formally, a (complex) n dimensional tensor {T} maps n indices to complex values, namely, to its entries:

\displaystyle T : [d_1] \times [d_2] \times \cdots \times [d_n] \rightarrow \mathbb{C}.

The simplest tensor network is a graphical notation for a tensor. For an {n}-dimensional tensor {T}, we draw a star graph and label the center as {T} and the edges as the indices. To evaluate this tensor network, we put values on the edges, i.e., indices, and then the tensor network would spit out its entry specified by the indices.

A simple tensor network of 4 dimensions 

Evaluating a simple tensor network, {T(1,5,3,1)=1/\sqrt{2}}. The numbers are chosen arbitrarily.

Notice that the degree of the center is the number of indices. Hence, a tensor network of degree {1} is a vector, and that of degree {2} is a matrix, and so forth.

A vector

A matrix

A 3d tensor

How is this related to quantum information? For the sake of genearlity we will deal with qudits in {\mathbb{C}^d}, instead of qubits in {\mathbb{C}^2}. Now recall that a quantum state {|\psi_n\rangle} of {n} qudits can be encoded as an {n} dimensional tensor. It can be written as

|\psi_n\rangle = \displaystyle\sum_{i_1,\cdots,i_n = 0}^{d-1} T(i_1,\cdots, i_n) |i_1,\cdots, i_n \rangle.

It is easy to see that all the information, namely, the amplitudes, is just the tensor T. In the later sections, we will see more powerful examples of using tensor networks to represent a quantum state.

So far our discussion is focused merely on these little pictures. The power of tensor networks come from its composition rules, which allow us to join two simple tensor networks together and impose rich internal structures.

2.1. Composition Rules

We introduce two ways of joining two simple tensor networks. Roughly speaking, they correspond to multiplication and summation, and I will give the definitions by showing examples, instead of stating them in the full formalism

Rule #1: Tensor Product. The product rule allows us to put two tensor networks together and view them as a whole. The resulting tensor is the tensor product of the two if we think of them as vectors. More concretely, consider the following picture.

This is viewed as a single tensor network {T} of 7 edges .

The definition of this joint tensor {T} is

T(i_1,i_2,\cdots, i_7) = T_1(i_1,i_2,i_3,i_4) T_2(i_5,i_6,i_7).

Rule #2: Edge Contractions. At this moment, we can only make up disconnected tensor networks. Edge contractions allow us to link two tensor networks. Suppose we have two {3} dimensional tensor networks. Contracting two edges, one from each, gives us a tensor network of {4} free edges. This now corresponds a tensor of {4} dimensions.

Two 3d tensors

Join two tensor networks by contracting an edge

We name the contracted edge as {k}. The definition of {T} is

\displaystyle T(i_1,i_2,j_1,j_2) =\sum_k T_1(i_1,i_2, k) T_2(j_1,j_2, k).

2.2. Useful Examples

Before we move on, let’s take some examples. Keep in mind that the degree of the vertex determines the number of indices (dimensions of this tensor).

vector inner product {\langle u,v \rangle}

Matrix inner product

Here, one needs to remember that an edge between two tensor nodes is a summation over the index corresponding to the edge. For example, in the vector inner product picture, {\langle u,v\rangle = \sum_i u_i \cdot v_i}, where edge is labeled as {i}. Now you would realize that this picture

is the famous

\langle u,v \rangle^2 \leq \|u\|^2 \|v\|^2. \quad\quad (\text{Cauchy-Schwarz inequality})

For us, the most important building block is matrix multiplication. Let {H=MN}. By definition

H(i,j) = \sum_k M(i,k) N(k, j).

This is precisely encoded in the picture below.

Matrix multiplication, {MN}.  

We are ready to talk about matrix product states. In the language of tensor network, a matrix product state is the following picture.

A matrix product state.

As the degrees indicate, the two boundary vertices {A_1,A_n} represent matrices and the internal vertices represent {3}-dimensional tensors. We can view each matrix as a set of (column) vectors and each {3}-dimensional tensor as a stack of matrices. Then each one of the free edges picks out a vector or a matrix, and the contracted edges multiply them together which gives out a scalar. If this confused you, move on to the next section. I will introduce the formal definition of matrix product states, and you will see that it is just the picture above.

3. Matrix Product States

Before giving the definition, let’s talk about how matrix product state (MPS) naturally arises from the study of quantum states with low entanglement. Matrix product state can be viewed as a generalization of product state—(pure) quantum state with no entanglement. Let’s consider a simple product state {|\psi\rangle} of {2} qubits. It can be factorized:

\displaystyle |\psi\rangle = \left(\sum_{i=0}^1 \alpha_1^i\ |i\rangle \right)\left(\sum_{j=0}^1 \alpha_2^j \ |j\rangle\right)\nonumber = \sum_{i,j=0}^1 \alpha_1^i \alpha_2^j\ |ij\rangle \ \ \ \ \ (1)

This state is described by {4} complex scalars {\left\{\alpha_1^0,\alpha_1^1,\alpha_2^0,\alpha_2^1\right\}}, and there is nothing quantum about it. However, if the state has entanglement among its qubits, then we know that it is impossible to be factorized and thereby written as (1). MPS generalizes the form of (1) by replacing the scalars with matrices and vectors.

More formally, a matrix product state starts with the following setup. For an {n}-qudit system, we associate

  • a qudit in {\{1,n\}} with {d} vectors {\left\{A_1^{j_1}\right\}, \left\{A_n^{j_n}\right\} \in \mathbb{R}^D}; and
  • a qudit {i} in {\{2,3,\cdots, n-1\}} with {d} matrices {\left\{A_i^{j_i}\right\}\in \mathbb{R}^{D\times D}}.

Here, {j_i} range from {0} to {d-1}, and {D} is called bond dimension. One can think of the set of vectors as a {D} by {d} matrix and the set of matrices as a {d} by {D} by {D} three-dimensional tensor. Then let them correspond to the vertices in MPS picture. With this setup, a quantum state is in matrix product state if it can be written as

|\psi\rangle = \sum_{j_1,\cdots,j_n=1}^n A_1^{j_1} A_2^{j_2} \cdots A_n^{j_n} |j_1 j_2\cdots j_n\rangle.

It is important to keep in mind that {A_1^{j_1},A_n^{j_n}} are two vectors, and the other inner terms are matrices, and we get a scalar from the product. Thus, this represents the tensor {T(j_1,j_2,\cdots, j_n) = A_1^{j_1} A_2^{j_2} \cdots A_n^{j_n}}.

Now back to the picture,


notice that each amplitude { A_1^{j_1} A_2^{j_2} \cdots A_n^{j_n}} from the equation above is an output of the tensor in the picture, where the free edges take values {j_1, j_2 ,\cdots, j_n}. Also, as discussed earlier, the contracted edges in MPS tensor network correspond to matrix and vector multiplications, so the tensor {T} is precisely represented by the picture.

The complexity of the MPS is closely related to the bond dimension {D}. In particular, the number of parameters in this model is {O(ndD^2)}. We would expect that with higher {D}, we may describe quantum states of more entanglement. In other words, the representation power of an MPS increases with {D}. In principle, one can represent any quantum state as an MPS; however, {D} can be exponentially large. See, e.g., Section 4.1.3 of~\cite{schollwock2011density} for a proof. On the other extreme, the product state example shows that if {D=1}, one can represent and only represent unentangled states. To summarize, here is the picture you should keep in mind.

Representation power of MPS increases with bond dimension D.


4. Density Matrix Renormalization Group

We are now ready to describe Density Matrix Renormalization Group, proposed originally in [Whi92Whi93]. As mentioned earlier, it does not come with provable guarantees. In fact, one can construct artificial hard instances such that the algorithm get stuck at certain local minima [SCV08]. However, it has remained one of the most successful heuristics for {1}D systems. We refer the readers to [Sch11] for a complete survey.

DMRG is a simple alternating minimization scheme for computing the ground state of a {1}D Hamiltonian. We start with an arbitrary MPS. Then each step we optimize over the set of matrices {\left\{A_i^{j_i}\right\}_{j_i=0}^d} associated with site {i}, while fixing everything else, and iterate until convergence. (You may wonder if one can simultaneously optimize over multiple sites. It turns out that it is an NP-hard problem [Eis06].)

Formally, the Hamiltonian problem can be phrased as a eigenvalue problem given a Hermitian matrix {H}, and thus we want to optimize over all {|\psi\rangle} in MPS of a fixed bond dimension {D}

\min_{|\psi\rangle}\frac{\langle \psi| H | \psi \rangle}{\langle \psi ||\psi \rangle}.

Here, we assume that the input Hamiltonian is in the product form. In particular, it means that it can be written as a tensor network as

Input {1}D Hamiltonian is of the particular product form.

so the numerator of the optimization objective looks like


The DMRG works with the Langrangian of the objective. For some {\lambda>0}, we will consider

\displaystyle \min_{|\psi\rangle}\,\, {\langle \psi| H | \psi \rangle} - \lambda {\langle \psi ||\psi \rangle}. \ \ \ \ \ (2)

DMRG optimizes over the set of matrices associated with one qudit. Both terms in (2) are quadratic in this set of matrices.

The Langrangian {{\langle \psi| H | \psi \rangle} - \lambda {\langle \psi ||\psi \rangle}} as a tensor network.

Now to optimize over the set of parameters associated with one site, calculus tells you to set the (partial) derivative to {0}, and the derivative of a quadratic thing is linear. Without going through any algebra, we can guess that the derivative of  with respect to a particular site, say the second one, is the same picture except removing the second site on one side.

The derivative that we set to {0} and solve.

Notice that the unknown is still there, on the bottom side of each term. The trick of DMRG is to view the rest of the network as a linear map applied to the unknown.


Given {H'} and {B}, we now have a clean numerical linear algebra problem of solving

\displaystyle H'x = \lambda Bx. \ \ \ \ \ (3)

This is called a generalized eigenvalue problem, and it is well studied. Importantly, for {1}D systems, {H'} is typically very sparse, which enables very fast solvers in practice. Finally, DMRG sweeps over the sites one after another and stops until convergence is achieved.

5. Concluding Remarks

Our presentation of tensor networks and MPS roughly follows [GHLS15], a nice introductory survey on quantum Hamiltonian complexity.

The notion of tensor networks extends well beyond 1D systems, and a generalization of MPS is called tensor product state. It leads to algorithms for higher dimensional quantum systems. One may read [CV09] for a comprehensive survey.

Tensor network has been interacting with other concepts. Within physics, it has been used in quantum error correction [FP14PYHP15], conformal field theory [Orú14], and statistical mechanics [EV15]. In TCS , we have found its connections with Holographic algorithms [Val08CGW16], arithmetic complexity [BH07CDM16AKK19], and spectral algorithms [MW18]. In machine learning, it has been applied to probabilistic graphical models [RS18], tensor decomposition [CLO16], and quantum machine learning [HPM18].

For DMRG, we have only given a rough outline, with many details omitted, such as how to set {D} and {\lambda} and how to obtain the Hamiltonian in the matrix product form, and how to compute the linear maps {H'} and {B} for each iteration. An interested reader may read [Sch05Sch11].


[AKK19]    Per Austrin, Peeri Kaski, and Kaie Kubjas. Tensor network complexity of multilinear maps. In Proceedings of the 2019 Conference on Innovations in Theoretical Computer Science. ACM, 2019.

[BH07]    Martin Beaudry and Markus Holzer. The complexity of tensor circuit evaluation. Computational Complexity, 16(1):60, 2007.

[CDM16]    Florent Capelli, Arnaud Durand, and Stefan Mengel. e arithmetic complexity of tensor contraction. eory of Computing Systems, 58(4):506{527, 2016.

[CGW16]    Jin-Yi Cai, Heng Guo, and Tyson Williams. A complete dichotomy rises from the capture of vanishing signatures. SIAM Journal on Computing, 45(5):1671{1728, 2016.

[CLO16]    Andrzej Cichocki, Namgil Lee, Ivan V Oseledets, A-H Phan, Qibin Zhao, and D Mandic. Low-rank tensor networks for dimensionality reduction and large-scale optimization problems: Perspectives and challenges part 1. arXiv preprint arXiv:1609.00893, 2016.

[CV09]    J Ignacio Cirac and Frank Verstraete. Renormalization and tensor product states in spin chains and laices. Journal of Physics A: Mathematical and Theoretical, 42(50):504004, 2009.

[Eis06]    Jens Eisert. Computational difficulty of global variations in the density matrix renormalization group. Phys. Rev. Le., 97:260501, Dec 2006.

[EV15]    G. Evenbly and G. Vidal. Tensor network renormalization. Phys. Rev. Le., 115:180405, Oct 2015.

[FP14]    Andrew J Ferris and David Poulin. Tensor networks and quantum error correction. Phys. Rev. Le., 113(3):030501, 2014.

[FSW07]    Holger Fehske, Ralf Schneider, and Alexander Weie. Computational Many-Particle Physics. Springer, 2007.

[GHLS15]    Sevag Gharibian, Yichen Huang, Zeph Landau, and Seung Woo Shin. Quantum Hamiltonian complexity. Foundations and Trends in Theoretical Computer Science, 10(3):159, 2015.

[HPM18]   William James Huggins, Piyush Patil, Bradley Mitchell, K Birgia Whaley, and Miles Stoudenmire. Towards quantum machine learning with tensor networks. Quantum Science and Technology, 2018.

[MW18]    Ankur Moitra and Alexander S Wein. Spectral methods from tensor networks. arXiv preprint arXiv:1811.00944, 2018.

[Orú14]    Román Orús. Advances on tensor network theory: symmetries, fermions, entanglement, and holography. e European Physical Journal B, 87(11):280, 2014.

[PYHP15]    Fernando Pastawski, Beni Yoshida, Daniel Harlow, and John Preskill. Holographic quantum error-correcting codes: Toy models for the bulk/boundary correspondence. Journal of High Energy Physics, 2015(6):149, 2015.

[RS18]    Elina Robeva and Anna Seigal. Duality of graphical models and tensor networks. Information and Inference: A Journal of the IMA, 2018.

[Sch05]    Ulrich Schollwöck. The density-matrix renormalization group. Rev. Mod. Phys., 77(1):259, 2005.

[Sch11]    Ulrich Schollwöck. The density-matrix renormalization group in the age of matrix product states. Annals of Physics, 326(1):96, 2011.

[SCV08]    Norbert Schuch, Ignacio Cirac, and Frank Verstraete. Computational difficulty of finding matrix product ground states. Phys. Rev. Le., 100(25):250501, 2008.

[Val08]    Leslie G Valiant. Holographic algorithms. SIAM Journal on Computing, 37(5):1565, 2008.

[Whi92]    Steven R. White. Density matrix formulation for quantum renormalization groups. Phys. Rev. Le., 69:2863, Nov 1992.

[Whi93]    Steven R. White. Density-matrix algorithms for quantum renormalization groups. Phys. Rev. B, 48:10345, Oct 1993.

What is Quantum Hamiltonian Complexity?

by Ben Edelman

This is the first installment of a three-part series of posts on quantum Hamiltonian complexity based on lectures given by the authors in Boaz and Tselil’s seminar. The second installment is here, and the third installment is here.

Quantum Hamiltonian complexity is a growing area of study that has important ramifications for both physics and computation. Our hope is that these three posts will provide an accessible (and incomplete) preview of the subject for readers who know the basics of theoretical computer science and quantum information. Much of the material is adapted from an excellent survey by Gharibian et al..

In a nutshell, quantum Hamiltonian complexity is the study of the local Hamiltonian problem. Why is this problem important enough to justify the existence of an entire subfield? To illustrate why, here are two informal characterizations of it:

  1. To a physicist, the local Hamiltonian problem is a formalization of the difficulty of simulating and understanding many-particle quantum systems. There are deep connections between the complexity of this problem and the amount of quantum entanglement in a system. In practical terms, physicists would love to be able to solve this problem on a regular basis, and they’ve developed a rich theory of heuristics to that end.
  2. To a computer scientist, the local Hamiltonian problem is the quantum version of constraint satisfaction problems. Any CSP can be written as a local Hamiltonian problem; and just as constraint satisfaction is the prototypical NP-complete problem by the Cook-Levin theorem, the local Hamiltonian problem plays the equivalent role for QMA (a quantum analogue of NP) by the “quantum Cook-Levin theorem.” The connections to classical complexity go on… there is even a quantum PCP conjecture!

But let’s take a step back and start at the beginning. To make sure we understand what a quantum Hamiltonian is and why it is important, it will be instructive to briefly rehash some of the fundamentals of classical statistical mechanics.

Classical energy and ground states

In the classical world, a physical system can be in any one of various states x \in \mathcal{X}, each of which is a vector, with different coordinates representing different particles. Every state of the system has an energy, given by an energy function E: \mathcal{X} \to \mathbb{R}. For example, in the classic Ising model of ferromagnetism, \mathcal{X} = \{\pm 1\}^n. Each coordinate x_i represents the spin of atom i, and atoms i and j interact with each other whenever (i,j) is an edge in a graph G, which is usually a low-dimensional lattice. Energy for this system is defined as E(x) = \sum_{(i,j) \in G}-x_i x_j.

Suppose we ignore our system for a long time, letting it interact with its external environment until, in the limit, it reaches thermal equilibrium at temperature T. Then the probability the system is in state x is given by Boltzmann’s distribution: \Pr[x] = \frac{e^{-\beta E(x)}}{Z}, where \beta \propto 1/T and Z is the partition function required to normalize the probabilities. As the temperature tends to infinity, this distribution will approach the uniform distribution over \mathcal{X}, and as the temperature tends to absolute zero, the distribution will approach the uniform distribution over the states with minimum energy. We call these minimum energy states ground states, and we call their energy the ground state energy. If we want to calculate something about a system, then it is often crucial to know the ground states and ground state energy of the system. Going back to our example, the Ising model has two ground states whenever G is connected. These are the states (+1,+1,\ldots,+1) and (-1,-1,\ldots,-1) in which all atoms have the same spin. The ground state energy is -|\{i,j:(i,j) \in G\}|.

Quantum Hamiltonians

A quantum Hamiltonian is essentially the quantum analogue of the classical energy function. Unlike with classical systems, when a quantum system is in a given n-qubit state \left|\psi\right\rangle, it doesn’t have a determinate energy. Instead, when we measure the energy, the value we obtain may be probabilistic and will correspond to one of the eigenvalues of the observable matrix for energy. This Hermitian matrix, denoted H \in (\mathbb{C}^2)^{\otimes n} \times (\mathbb{C}^2)^{\otimes n}, is the quantum Hamiltonian, and just as the energy function characterizes a classical system, the Hamiltonian characterizes a quantum system. For a given eigenvector |\lambda_i\rangle of H with eigenvalue \lambda_i, when we measure the energy of |\psi\rangle we obtain the result \lambda_i with probability \langle\psi|\lambda_i\rangle, and the system collapses to the state |\lambda_i\rangle (assuming the eigenvalue \lambda_i has multiplicity 1). Thus, the ground state and ground state energy of a quantum system with eigenvalue H are the minimum eigenvalue \lambda_0 of H and the corresponding eigenvector |\lambda_0\rangle.

The Boltzmann distribution also has a quantum analogue. A quantum system at thermal equilibrium will be in the following mixed state: \rho_{\text{eq}} = \frac{e^{-\beta H}}{Z}. As the temperature approaches absolute zero, \rho_{\text{eq}} will approach a superposition over the ground states.

Not only does the Hamiltonian tell us the energy of a system, it also describes the time evolution of the system (as long as it is closed). Schrödinger’s equation states that -i \hbar \frac{d|\psi\rangle}{dt} = H|\psi\rangle, where \hbar is Planck’s constant and t is time. Thus, if a closed system is in the state |\psi\rangle_0 at time 0, its state at time t will be |\psi\rangle_t = e^{-itH/\hbar}|\psi\rangle_0. Since H is Hermitian, e^{-itH/\hbar} is unitary, which is another way of saying that quantum mechanical states are subject to unitary evolution.

The Local Hamiltonian problem

As we have seen, understanding the Hamiltonian of a quantum system is crucial for understanding both the system’s equilibrium behavior and its time evolution. There are a huge variety of questions physicists are interested in asking about systems, all of which boil down to questions about equilibrium behavior, time evolution, or both. There is a single problem that captures the complexity of many of these questions, in the sense that most of the questions can’t be answered without solving it. This is the problem of estimating the ground state energy of the Hamiltonian. Especially in condensed matter physics, this problem is ubiquitous.

Formally, we will study the following promise problem: (note: this will not be our final formulation)

The “Hamiltonian Problem”

Given a Hermitian matrix H \in (\mathbb{C}^2)^{\otimes n} \times (\mathbb{C}^2)^{\otimes n} and non-negative reals a, b with b \geq a+1,

  • If \lambda_0(H) \leq a, output YES
  • If \lambda_0(H) \geq b, output NO

One issue with this definition is that the input includes an enormous 2^n \times 2^n matrix. For a reasonable-sized system, there’d be no use in even trying to solve this problem through classical computation, and how to deal with it in the quantum computing setting is far from obvious. Luckily, physicists have found that in real-life systems, interactions tend to be local, and if we consider the special case of local Hamiltonians, the input for the problem is of reasonable size.

Hamiltonians are too big to work with. What if we restrict our focus to local Hamiltonians?

A k-local Hamiltonian is a Hamiltonian that is decomposed into a sum of terms, each of which represents a Hamiltonian acting on a k-unit subset of the n qubits in the system. In other words, H = \sum_i (H_i)_{S_i} \otimes I_{[n]\backslash S_i}, where each S_i is a subset of [n] of size k. For brevity’s sake, we abuse notation and write H as \sum_i H_i. We can think of the H_i’s as local constraints, and the ground state as the state that simultaneously satisfies the constraints to the maximal possible extent. Here, then, is the new-and-improved problem definition:

k-Local Hamiltonian Problem

Given a Hamiltonian H \in (\mathbb{C}^2)^{\otimes n} \times (\mathbb{C}^2)^{\otimes n} specified as a collection of r local interactions \{H_i\}, and non-negative reals a, b with b \geq a+1,

  • If \lambda_0(H) \leq a, output YES
  • If \lambda_0(H) \geq b, output NO

Presuming the matrices \{H_i\} and the reals a, b are specified to polynomial precision, then the input size is polynomial in n, since k is a constant and each of the matrices H_i has 2^k \cdot 2^k entries. Thus, not only is our new problem physically realistic, it is also a problem we might hope to attack with classical computation. However, we will later see that in fact this problem is likely hard even for quantum computers. The remaining installments in this series of notes will deal with further restrictions of the class of Hamiltonians for which the local Hamiltonian problem may be tractable.

Computer science motivation

As we mentioned in the intro, the k-local Hamiltonian problem (henceforth denoted k-LH) doesn’t just have myriad applications in physics—it is also important from a computer science perspective because it is a quantum generalization of constraint satisfiability (you may have noticed that quantum analogues of classical concepts are a running theme). Specifically, k-CSP is a special case of k-LH.

Suppose we have a k-CSP instance \varphi, and we want to turn it into a k-LH instance. A clause C with constituent variables x_1, \ldots, x_k becomes a 2^k \times 2^k diagonal \{0,1\} matrix H_C acting on the qubits |x_1\rangle,\ldots,|x_k\rangle. Note that the rows and columns of this matrix are indexed by the assignment vectors x \in \{0,1\}^k. Formally, H_C encodes the truth table of C in the following manner: (H_C)_{x,x} = 1 - C(x). Another way of stating this is H_C = \sum_{x \in \{0,1\}^k\text{ s.t. }C(x)=0}|x\rangle\langle{x}|.

Informally, H_C takes the clauses of \varphi and turns them into local quantum interactions. We’ve constructed H_C so that it has two eigenvalues: 0 and 1. The eigenspace corresponding to 0 is spanned by the set of computational basis vectors |x\rangle that satisfy C, and the eigenspace corresponding to 1 is spanned by the computational basis vectors that don’t satisfy C. In effect, when we consider H_C as a term of H, we are giving an energy penalty to any variable assignment that doesn’t satisfy C. H = \sum_{C}H_C will have the eigenvalue 0 (in other words, a ground state energy of 0) if and only if there is some assignment of the variables x_1,\ldots,x_n that satisfies all of the clauses (in other words, iff \varphi is satisfiable). Otherwise, the ground state energy of H will be at least 1, so determining whether \varphi is satisfiable is equivalent to solving k-LH with inputs a = 0, and b = 1. (In fact, k-LH generalizes MAX-k-CSP, since the ground state energy of H is exactly the number of clauses minus the maximum number of satisfiable clauses.)

Continue reading “What is Quantum Hamiltonian Complexity?”

A 1D Area Law for Gapped Local Hamiltonians

(This post is based on part of a lecture delivered by Boriana Gjura and Prayaag Venkat. See also posts by Ben Edelman and Fred Zhang for more context on Quantum Hamiltonian Complexity.)


In this post we present the Area Law conjecture and prove it rigorously,
emphasizing the emergence of approximate ground state projectors as a
useful and promising tool.

The difficulty of understanding many-body physics lies in this dichotomy
between the power of quantum on the one hand, and the fact that even
describing a quantum state requires exponential resources on the other.
A natural way to proceed is to ask if the special class of physically
relevant states admits succinct descriptions. As seen in a previous
post, the QMA-completenes of the k-local Hamiltonian problem suggests
that even ground states of local Hamiltonians do not admit such
descriptions. Looking at these ground states, the following questions
arise naturally.

When do the ground states of local Hamiltonians have a special
When does that structure allow for a meaningful short
When does that structure allow us to compute properties
of them?

As stated informally, the answer to these questions is yes for (gapped)
1D systems, and it is an open problem to answer these in higher

The Area Law

The Area Law is inspired by the Holographic principle in Cosmology,
which informally states that the total information in a black hole
resides on the boundary. That is, the complexity of the system depends
on the size of its boundary, not its volume.


For the sake of clarity, we assume that H = \sum_{i=1}^m H_i is a
2-local Hamiltonian on qubits, each 0 \leq \| H_i \| \leq 1 is a
projection (H_i^2 = H_i), H is frustration free, (meaning that
\lambda_0(H), the lowest eigenvalue, is zero) and there is a unique
ground state | \psi_0 \rangle.

These conditions simplify our proof. It will be easy to generalize to
0 \leq \| H_i \| \leq 1 and relaxing to q-local on qudits (dimension
d particles). The most significant assumption is that H is
frustration-free: more work is necessary to make the proof work

To formalize our conjecture, we define a notion of von Neumman entropy

Definition: Consider any state | \psi \rangle lying in a bipartite Hilbert space
\mathcal{H}_A \otimes \mathcal{H}_B, and perform a Schmidt
decomposition to obtain
| \psi \rangle = \sum_{i =1}^{d} \lambda_i | u_i \rangle_A| v_i \rangle_B.
The Von Neumann entanglement entropy of | \psi \rangle across region A
is defined as
\displaystyle S(A)_{| \psi \rangle} = \sum_i \lambda_i^2 \ln \frac{1}{\lambda_i^2}.

Why is this a good notion of entropy? If
| \psi \rangle = | L \rangle \otimes | R \rangle, then
S_{| \psi \rangle} = 0, i.e. the product state has no entanglement. On
the other hand, the maximally entangled state
| \psi \rangle = \sum_{i = 1}^D \frac{1}{\sqrt{D}} | i \rangle_A \otimes | i \rangle_B
has entropy S_{| \psi \rangle} = \ln(D). In general, we have that

\displaystyle S_{| \psi \rangle} \leq \ln(\dim \text{Hilbert space}).
This is a
direct quantum generalization of the fact that a classical probability
distribution on D elements has entropy at most \ln D, and this is
achieved by the uniform distribution.\
We can now state the Area Law.

Conjecture: Let | \psi_0 \rangle be the ground state of H, as defined above.
Then for any region A (i.e. a subset of qubits),

\displaystyle S_{| \psi_0 \rangle} \leq \ln (\dim \partial A),
where \partial A
denotes the boundary of the region A, the set of qubits that interact
through H with at least a qubit outside of region A.

This conjecture is illustrated by the following figure.

We prove the 1-D version of this conjecture. Note that for a 1D system
(i.e. qubits on a line with nearest neighbor interactions) it suffices
to consider bipartitions of the Hilbert space defined by a “cut” (see
the figure below). By a cut, we mean a partition of the qubits into two
sets consisting of all the qubits to the left and right, respectively,
of a fixed qubit i^*.


Theorem: Let | \psi_0 \rangle be as above. Then for any cut C = (i^*, i^*+1),
\displaystyle S(C)_{| \psi_0 \rangle} \leq O(1).

Approximate Ground State Projectors

In the case where all H_i commute, consider the operator
A = \prod_i (I - H_i). This operator will project any product state
onto the ground state. Therefore, we can get a representation of the
ground state as a matrix product state (as we saw in a previous post on
tensor networks). Can we generalize this idea when the H_i do not
necessarily commute?

A natural idea is to define an operator that approximately projects a
state onto the vector we want (the ground state), while shrinking it on
the other directions. We also want this operator to not increase
entanglement too much. The proof of the area law will involve
constructing a low-rank tensor approximate factorization of the ground
state by starting with a factorized state, which has a good overlap with
the ground state, and then repeatedly applying such an operator.


We first give a formal definition of an approximate ground state
projector. In the following, under the assumption of the existence of
such an object, we will prove the area law. At the end, we will show how
to construct it.

Definition: K is a (D,\Delta)approximate ground state projection (AGSP) if:
1. K | \psi_0 \rangle = | \psi_0 \rangle
2. For every | \Gamma \rangle such that
\langle \Gamma | \psi_0 \rangle = 0, then
\| K | \Gamma \rangle \|^2 \leq \Delta \| | \Gamma \rangle \|^2.
3. K has entanglement rank at most D. That is, K admits a Schmidt
decomposition of the form K = \sum_{i=1}^{D} L_i \otimes R_i,
where L_i and R_i act only on the particles to the left and
right of the cut, respectively.

The first step of the proof of the Area Law is to find a product state
that has a good overlap with the ground state. That is, it should have a
constant overlap that is not dependent on n, the number of particles
in the system.

Lemma 1: Suppose there exists a (D, \Delta)-AGSP such that
D \Delta \leq \frac{1}{2}. Fix a partition (A, \bar{A}) of the space
on which the Hamiltonian acts. Then there exists a product state
| \phi \rangle = | L \rangle_A \otimes | R \rangle_{\bar{A}} such that
\displaystyle \langle \phi | \psi_0 \rangle = \mu \geq \frac{1}{\sqrt{2D}}

Let | \phi \rangle be a product state with the largest overlap in
| \psi_0 \rangle, meaning that it maximizes.
\langle \phi | \psi_0 \rangle = \mu and can be written as
\displaystyle | \phi \rangle = \mu | \psi_0 \rangle + \sqrt{1 - \mu^2} | \psi^{\perp} \rangle,
where the latter is some state orthogonal to the ground state. We apply
K to get
\displaystyle K | \phi \rangle = \mu | \psi_0 \rangle + \delta | \psi' \rangle
where |\delta|^2 \leq \Delta and | \psi' \rangle is normalized. By
the properties of the operator K, the decomposition of
K | \psi_0 \rangle has at most D terms. Using the Cauchy-Schwarz
\displaystyle \begin{aligned} \mu = \big| \sum_i \lambda_i \langle \psi_0 | | L \rangle_i | R \rangle_i\big| &\leq \sqrt{\sum_i \lambda_i ^2} \sqrt{\langle \psi_0 | | L \rangle_i | R \rangle_i^2} \\ &\leq \sqrt{\mu^2 + \Delta} \sqrt{D} \sqrt{\max_i \langle \psi_0 | | L \rangle_i | R \rangle_i}\end{aligned}

Therefore there exists a product state such that
\mu \geq | \langle \psi_0 | | L \rangle_i | R \rangle_i| \geq \frac{\mu}{\sqrt{D(\mu^2 +D)}},
which implies that \mu^2 \geq \frac{1}{2D}, as desired. \Box

Starting with a product state with a big enough overlap with the ground
state, the proof of the next lemma shows how we can repeatedly apply
AGSPs to bound the entropy across the cut.

Lemma 2: Suppose there exists a (D, \Delta)-AGSP such that
D \Delta \leq \frac{1}{2}, and a product state
| \phi \rangle = | L \rangle_A \otimes | R \rangle_{\bar{A}} such that
\langle \phi | \psi_o \rangle = \mu. Then
\displaystyle S_{| \psi_0 \rangle} \leq O(1) \frac{\log \mu}{\log \Delta} \log{D}

But before we prove this, let’s state an intuitive result of Eckart and
Young, which provides an upper bound for the overlap of two states given
the Schmidt rank of one.

Lemma (Eckart-Young): Let | \psi \rangle \in L \otimes R be a normalized
vector with Schmidt decomposition
| \psi \rangle = \sum_i \lambda_i | u_i \rangle| v_i \rangle, where
\lambda_1 \geq \lambda_2 \geq \dots. Then for any normalized
| \phi \rangle \in L \otimes R it holds that
\displaystyle | \langle \psi | \phi \rangle| \leq \sqrt{\sum_{i} \lambda_i^2}

Using this result, we continue the proof.

Proof of Lemma 2:
Write the (D, \Delta)-AGSP as K. Given a state | \phi \rangle, we
denote by | \phi_l \rangle the normalized stated after applying l
projections on the state, namely
\displaystyle | \phi_l \rangle = \frac{K^l | \phi \rangle}{\|K^l | \phi \rangle\|}.
By the way we’ve defined K, the Schmidt rank of | \phi_l \rangle
increases by a power of l, and this state has an improved overlap with
the ground state, that is:
\displaystyle SR(| \phi_l \rangle) \leq D^l, \quad \langle \psi_0 | \phi_l \rangle \geq \frac{\mu}{\sqrt{\mu^2 + \Delta^l(1 - \mu^2)}}.

Let | \psi_0 \rangle = \sum_i \lambda_i | L_i \rangle| R_i \rangle be
the Schmidt decomposition of the ground state relative to the cut
(i^*, i^*+1), where \lambda_1 \geq \lambda_2 \geq \dots. The
Eckart-Young theorem gives:
\displaystyle \sum_{i=1}^{D^l} \lambda_i^2 \geq | \langle \psi_0 | \phi_l \rangle|^2 \geq \frac{\mu^2}{\mu^2 + \Delta^l (1 - \mu^2)}
from which
\displaystyle \sum_{i> D^l} \lambda_i^2 \leq 1 - \frac{\mu^2}{\mu^2 + \Delta^l (1 - \mu^2)} \leq \frac{\Delta^l}{\mu^2}

We choose
l_0 = 2 \frac{\log \mu}{\log \Delta} - \frac{\log 2}{\log \Delta} such
that \frac{\Delta^{l_0}}{\mu^2} \leq \frac{1}{2}. Let’s now bound the
worst case entropy accross the AGSP cut. The first D^{l_0} Schmidt
coefficients account for an entropy of at most l_0 \log D. For the
remaining coefficients, we group them in chunks of size D^{l_0} in
intervals [D^{kl_0} + 1, D^{(k+1)l_0}] indexed by k. For each of
these intervals, the corresponding entropy can be upper bounded by
\displaystyle \frac{\delta^{kl_0}}{\mu^2} \log D^{(k+1)l_0} = l_0 \frac{1}{2^k} (k+1) \log D
where here \frac{\Delta^{kl_0}}{\mu^2} is an upper bound on the total
probability mass in the interval, and D^{-(k+1)l_0} a lower bound on
the size of any Schmidt coefficient (squared) in the interval. This
follows from the fact that they are organized in descending order and
must sum to 1.

Therefore, the total entropy is
\displaystyle \begin{aligned} S((i^* ,i^* +1)) &\leq l_0 \log D + \sum_{k \geq 1} l_0\frac{1}{2^k}(k+1) \log D \\ &\leq l_0 \log D + l_0 \log D \sum_{k \geq 1} \frac{1}{2^k}(k+1) \\ &\leq O(1) l_0 \log D \\ &\leq O(1) \frac{\log \mu}{\log \Delta}\log D\end{aligned}

This bound depends on D and \Delta. This is sufficient to imply the
Area Law because our AGSP constructions will have constant D and
\Delta. \Box

AGSP Construction

In this section, our goal is to construct an AGSP with the correct
tradeoff in parameters. We briefly recall the intuition behind the
properties that an AGSP must satisfy. First, the AGSP should leave the
ground state untouched. Second, it shrinks states which are orthogonal
to the fround state. Third, it should not increase the entanglement
across the cut by too much (see the figure below).

A depiction of the third property in the AGSP definition for a 1D
system. The property that K is a sum of tensor products of operators
which only act on one “side” of the

We are interested in understanding a construction of an AGSP for three

  1. As we saw before, we can prove a 1D area law assuming the existence
    of an AGSP with good parameters.
  2. AGSPs have been used a building block in an provably polynomial-time
    algorithm for finding ground states of gapped 1D local Hamiltonians.
  3. They appear to be a general purpose tool with potential applications
    in other contexts.

Details of the construction

One way to think about the the first two properties in the AGSP
definition is in terms of the relationship of the eigenvalues of K and
the eigenvalues of H (see the figure below). The first property says
that the ground state of H should be an eigenvector of K with
eigenvalue 1. The second property says that all other eigenvectors of
H should be mapped to eigenvectors of K with eigenvalue at most

In this plot, the horizontal axis represents the eigenvalues of H
and the vertical axis represents the eigenvalues of K. The ground
state of H, which corresponds to the eigenvalue 0, should remain fixed
upon applying K to

In the following, we will aim to construct and AGSP whose tradeoff in
parameters allows us to achieve the following result.

Theorem: S_{| \psi_0 \rangle}(A) = O(\frac{\log^2 d}{\epsilon}).

Attempt 1

As a first attempt, we will evaluate the quality of the following
\displaystyle K = (1-\frac{H}{\| H \|})^l,
where l is some natural
number that we will decide upon later. We now check the three properties
for this particular choice of K.

First, it is clear that K does not change the ground state, since the
ground state is an eigenvector of H with eigenvalue 0
(frustration-free case). Second, because any other eigenvalue of H is
at least \epsilon (the energy gap), we have that
\Delta \leq (1-\frac{\epsilon}{\| H \|})^l \approx e^{-\frac{\epsilon}{\| H \|} l}.
Third, for the time being, let us think of l as roughly corresponding
to D (we will formalize this later).

Unfortunately, these values \Delta and D will not allow us to
achieve the desired tradeoff of D \Delta \leq \frac{1}{2}. To see why
this is, observe that because we have n-1 local Hamiltonians, the term
\| H \| could be on the order of n (recall that
0 \leq \| H_i \| \leq 1), so we would need to choose l proportional
to n to reduce \Delta to a constant (see figure below). Unfortunately, this will result in a
corresponding increase in D, violating D \Delta \leq \frac{1}{2}.

The first attempt an AGSP corresponds to the polynomial is depicted in
blue. This polynomial does not decay at a fast enough rate, so the
second attempt will replace it with another polynomials, depicted in
red, which decays much


Attempt 2

In summary, the two weaknesses of the previous approach we need to
address are:

  1. The large \| H \| term in the exponent substantially reduces the
    shinking effect of applying the AGSP. We will address this using the
    technique of Hamiltonian truncation.
  2. For a fixed l (roughly corresponding to the entanglement rank of
    K), the function f(x) = (1 - \frac{x}{\| H \|})^l does not decay
    fast enough after the point x = 0 (which is the ground state
    energy). We will address this using Chebyshev polynomials.

As discussed before, the \| H \| term is large because it involves
contribution from all n-1 local Hamiltonians. Intuitively, we might
believe that only the local Hamiltonians near the site of the cut itself
should play a large role in the entanglement of the ground state across
the cut. Specifically, if we write
H = H_1 + \ldots H_{n-1} = H_L + H_{i_1} + \ldots H_{i_s} + H_R, where
H_{i_j} are the s local Hamiltonians neighboring the cut and H_L
and H_R are the “aggregrated” Hamiltonians of all the remaining local
Hamiltonians to the left and right, respectively, of H_{i_1} and
H_{i_s}, then we are asserting that H_L and H_R contribute
significantly to \| H \| but are not important when trying to
characterize the ground state (see figure below).

We keep track of the local Hamiltonians corresponding to the s
qudits directly surrounding the cut. The remaining local Hamiltonians
are aggregated into the Hamiltonians H_L and H_R, whose eigenvalues
will be truncated to define

Mathematically, if we let
H' = H^{\leq t}_L + H_{i_1} + \ldots H_{i_s} + H^{\leq t}_R, where
H^{\leq t}_i denotes operator obtained by truncating all eigenvalues
of H to be at most t, we claim the following:

1. \| H' \| \leq s + 2t
2. | \psi_0 \rangle is a ground state of H'.
3. H' has gap \epsilon' = \Omega (\epsilon)

It is straightforward to verify the two first two claims, We omit the
proof of the third claim. Intuitively, we have chosen H' to be an
approximation of H which has much smaller norm.

Next, we come to the task of designing a degree-l polynomial p which
decays faster than our first attempt f(x) = (1 - \frac{x}{\| H \|})^l
after the point x=0. In other words, we would like p to satisfy
p(0) = 1 (fixing the ground energy) and map the interval
[\epsilon, \| H' \|] to the interval [0,\Delta] for as small a
\Delta as possible. It turns out that the degree-l Chebyshev
polynomial (of the first kind) is precisely the polynomial which has
this behavior (see the appendix for a review of Chebyshev polynomials).
Letting T_l denote the degree-l Chebyshev polynomial, we take our
polynomial to be a scaled and shifted version of it:
\displaystyle p(x) = T_l\left(\frac{\| H' \| + \epsilon' - 2x}{\| H' \| - \epsilon'}\right)/T_l\left(\frac{\| H' \| + \epsilon'}{\| H' \| - \epsilon'}\right)
Given this definition of p, we take our new candidate AGSP to be
K = p(H), which has the following guarantees:

1. K | \psi_0 \rangle = | \psi_0 \rangle
2. K achieves
\Delta = O(\exp{-\sqrt{\frac{\epsilon'}{\| H' \|}} l})
3. K achieves D = O(d^{\frac{l}{s} + s})

The first claim is straightforward to verify. The proof of the second
claim is ommitted; it involves calculations making use of standard
properties of Chebyshev polynomials. We remark that the dependence of
the exponent \Delta on \| H' \| has been reduced by a quadratic
amount, compared with the first attempt at an AGSP. We now sketch the
main ideas behind the proof of the third claim. After that, we will
choose values of s and t so that the desired tradeoff D \Delta is

We have defined K as a degree-l polynomial in H', so we may write
it as: \displaystyle K = \sum_{j=0}^{l} c_i (H')^j, where c_i are coefficients
whose particular values are not relevant to the analysis of the Schmidt
rank of K across the given cut. It is straightforward to check that
the Schmidt rank of K is bounded by the sum of the ranks of the
monomials (H')^j (subadditivity). So, it suffices to analyze the
Schmidt rank of the monomial of the largest degree, (H')^l.

From our expression for H', we can expand this power to get:
\displaystyle (H')^l = (H^{\leq t}_L + H_{j_1} + \ldots H_{j_s} + H^{\leq t}_R)^l = \sum_{j_1,\ldots,j_l} H_{j_1} \cdots H_{j_l},
where the summation is over all products of l Hamiltonians from the
set \{H^{\leq t}_L, H_{i_1}, \ldots H_{i_s}, H^{\leq t}_R\}. Now, one
natural approach to proceed is to analyze the Schmidt rank of each term
in the summation, and then sum over all the terms. Unfortunately, this
approach will not work because there are exponentially many terms in the
summation, so our estimate of D will be too large. To address this
issue, it is possible to collect terms cleverly so that the total number
of terms is small. However, we will not provide the details here; we
will just sketch how analyze the Schmidt rank of an individual term.

The term H_{j_1} \cdots H_{j_l} consists of a product of l
Hamiltonians. Since there are s+1 possible locations of a cut that is
crossed by one of the H_{j_k}, an average cut will have
\frac{l}{s+1} of the H_{j_k} crossing it. For each Hamiltonian which
crosses this average cut, the Schmidt rank will be multiplied by a
factor of d^2. Hence, the Schmidt rank of the term
H_{j_1} \cdots H_{j_l} is O(d^\frac{2l}{s}). This bound on the
Schmidt rank is for an average cut C', not necessarily the cut C we
fixed in the beginning. The two cuts are separated by at most O(s)
qudits because C' is one of the s+1 cuts surrouding C (see figure
below), so one can relate the Schmidt rank across these two cuts. For
each Hamiltonian between the two cuts, the high-level idea is that the
entanglement rank is multiplied by at most a factor of d. Because the
entanglement rank across C' is O(d^\frac{2l}{s}), the entanglement
rank across C is O(d^{\frac{2l}{s} + s}).

The entanglement across cut C is analyzed bounding the entanglement
across a nearby cut C' and then relating the

Putting all the pieces together, we have that:

  1. By Hamiltonian truncation and the Chebyshev polynomial construction,
    \Delta = O(e^{-l/ \sqrt{s}}).
  2. By the entanglement rank analysis, D = O(d^{\frac{l}{s} + s}).

One can now check that if we set l = O(s^2) and
s = O(\frac{\log^2 d}{\epsilon}) (where large enough constants are
hidden inside the O(\cdot)), then we achieve the desired tradeoff of
D \Delta \leq \frac{1}{2}. Note that l and s are constants because
d and \epsilon are constants.

Towards an area law in high dimensions

One of the major open problems in this line of work is to prove an area
law for 2D systems. We now discuss one potential line of attack on this
problem, based on a reduction to the 1D area law we have proven using

Consider a 2D system, as depicted in the figure below. We can form a
sequence of concentric shells of qudits of increasing radii and
“contract” each shell into a single qudit of higher dimension. That is,
if each of the original qudits have dimension d and we consider the
shell at radius r, the dimension of the contracted qudit is roughly
d^r because there are roughly r qudits in the shell. The system of
contracted qudits is now a 1D system in the sense that the shell at
radius r only interacts with the shells at radii r-1 and r+1.

A 2D system can be reduced to a 1D system by decomposing it into a
sequence of concentric

Suppose that we were able to show an area law as discussed previously,
but with the following dependence on d:

\displaystyle S_{| \psi_0 \rangle}(A) = O(\frac{\log d}{\epsilon}).
If we replace
d with d^r, the dimensionality in our new 1D system, we get that

\displaystyle S_{| \psi_0 \rangle}(A) = O(\frac{r}{\epsilon}),
which is exactly an
area law in 2D (recalling that surface area of a circle is proportional
to its radius). In fact, even the weaker result of proving that
S_{| \psi_0 \rangle}(A) = O(\frac{\log^{2-\alpha} d}{\epsilon}) for
some \alpha > 0 would imply a “sub-volume law”, already a major
breakthrough. This suggests that constructing AGSPs with better
parameters may be a plausible approach to proving an area law.

Future Directions

In addition to proving an area law in 2D, there are several interesting
questions raised by what we discussed here. See the survey of Gharibian
et al. for more details.

  1. Do area laws hold for gapless (i.e. non-constant gap, or “critical”)
    systems? No, there are known counterexamples.
  2. Do area laws hold for degenerate systems (where the ground state is
    not unique)? Yes, the AGSP approach still works for systems with
    groundspaces of constant dimension. It is unknown whether area laws
    hold when the groundspace is of dimension polynomial in n.
  3. Can one design practical algorithms for finding ground states of
    gapped 1D local Hamiltonians? Yes, there is one approach based on
  4. Do area laws hold in general for systems where the interactions are
    not necessarily described by a lattice? No, there are known
    counterexamples to such a “generalized area law”?


We would like to thank Boaz Barak and Tselil Schramm for their guidance
in preparing the talk on which this post is based.

Most of this post (including some figures) draws heavily upon the
following resources: lecture notes from courses taught by Thomas Vidick and Umesh Vazirani,
repectively, and a survey by Gharibian et al.
In these notes, we have opted to illustrate the main ideas, while
glossing over technical details. Our goal with this approach is to
prepare the reader to more easily understand the original sources and
highlight the some of the key concepts.

Appendix: Background on Chebyshev Polynomials

The Chebsyhev polynomials \{T_l\}_{l \in \mathbb{N}} are a family of
polynomials with many special properties. They are defined as
T_0(x) = 1, T_1(x) = x and \displaystyle T_l(x) = 2xT_{l-1}(x) - T_{l-2}(x).
The first few polynomials are plotted below (Source:

Below, we state a property (proof ommitted) which provides some
intuition as to why Chebyshev polynomials are the “correct” choice for
our construction. Intuitively, it says that among all degree-l
polynomials which are trapped inside [-1,1] on the interval [-1,1],
T_l dominates p outside of [-1,1].

Claim: Let p: \mathbb{R} \rightarrow \mathbb{R} be a degree-l polynomial
such that |p(x)| \leq 1 for all x \in [-1,1]. Then for all
|x| \geq 1, |p(x)| \leq |T_l(x)|.


  1. Sevag Gharibian, Yichen Huang, Zeph Landau, Seung Woo Shin, et al. “Quantum Hamiltonian Complexity”. Foundations and Trends in Theoretical Computer Science, 10(3):159–282, 2015.
  2. Thomas Vidick. Lecture notes for CS286 seminar in computer science: “Around the quantum PCP conjecture”., Fall 2014.
  3. Umesh Vazirani. Lecture notes for CS294-4 “Quantum computation”., Fall 2016.

Algorithmic and Information Theoretic Decoding Thresholds for Low density Parity-Check Code

by Jeremy Dohmann, Vanessa Wong, Venkat Arun


We will discuss error-correcting codes: specifically, low-density parity-check (LDPC) codes. We first describe their construction and information-theoretical decoding thresholds, p_{c}

Belief propagation (BP) (see Tom’s notes) can be used to decode these. We analyze BP to find the maximum error-rate upto which BP succeeds.

After this point, BP will reach a suboptimal fixed point with high probability. This is lower than the information-theoretic bound, illustrating a gap between algorithmic and information-theoretic thresholds for decoding.

Then we outline a proof of a theorem suggesting that any efficient algorithm, not just BP, will fail after the algorithmic threshold. This is because there is a phase transition at the algorithmic threshold, after which there exist an exponentially large number of suboptimal `metastable’ states near the optimal solution. Local search algorithms will tend to get stuck at these suboptimal points.

Introduction to Error Correcting Codes


Alice wants to send a message to Bob, but their channel of communication is such that Bob receives a corrupted version of what Alice sent. Most practical communication devices are imperfect and introduce errors in the messages they are transmitting. For instance, if Alice sends 3V, Bob will really receive three volts plus some noise (we have chosen to ignore some inconvenient practical details here). In many cases, this noise is quite small, e.g. it could be less than 0.5V in 99.99% of cases. So, in principle, Alice could have had very reliable delivery by just choosing to always send 0V for a logical 0 and 5V for logical 1, using checksums to detect the occasional error. But this is wasteful. Alice could have squeezed more levels between 0 and 5V to get a higher bitrate. This causes errors, but Alice can introduce redundancy in the bits she is transmitting which can enable Bob to decode the correct message with high probability. Since it is much easier to control redundancy in encoding than in physical quantities, practical communication devices often choose choose to pack enough bits into their physical signal that errors are relatively quite likely, relying instead on redundancy in their encoding to recover from the errors. Redundancy is also used in storage, where we don’t have the option of detecting an error and retransmitting the message.

Some errors in communication are caused by thermal noise. These are unpredictable and unavoidable, but the errors they cause can be easily modeled; they cause bit-flips in random positions in the message. There are other sources of error. The clocks on the two devices may not be correctly synchronized, causing systematic bit-flips in a somewhat predictable pattern. A sudden surge of current (e.g. because someone turned on the lights, and electricity sparked between the contacts) can corrupt a large contiguous segment of bits. Or the cable could simply be cut in which case no information gets through. These other kinds of errors are often harder to model (and easier to detect/mitigate), so we have to remain content with merely detecting them. Thus for the remainder of this blog post, we shall restrict ourselves to an error model where each bit is corrupted with some fixed (but potentially unknown) probability, independent of the other bits. For simplicity, we shall primarily consider the Binary Erasure Channel (BEC), where a bit either goes through successfully, or the receiver knows that there has been an error (though we will introduce some related channels along the way).

Claude Shannon found that given any channel, there is a bitrate below which it is possible to communicate reliably with vanishing error rate. Reliable communication cannot be achieved above this bitrate. Hence this threshold bitrate is called the channel capacity. He showed that random linear codes are an optimal encoding scheme that achieves channel capacity. We will only briefly discuss random linear codes, but essentially they work by choosing random vectors in the input space and mapping them randomly to vectors in the encoded space. Unfortunately we do not have efficient algorithms for decoding these codes (mostly due to the randomness in their construction), and it is conjectured that one doesn’t exist. Recently Low-Density Parity Check (LDPC) codes have gained in popularity. They are simple to construct, and can be efficiently decoded at error levels quite close to the theoretical limits.

With LDPC codes, there are three limits of interest for any given channel and design bitrate (M/N): 1) the error level upto which an algorithm can efficiently decode them, \epsilon_d, 2) the error level level upto which they can be decoded by a computationally unbounded decoder, \epsilon_c and, 3) the error level beyond which no encoding scheme can achieve reliable communication, \epsilon_s. Obviously, \epsilon_d \le \epsilon_c \le \epsilon_s, and in general these inequalities can be strict. Our goal here is to study the gap between \epsilon_d and \epsilon_c. We will sometimes refer to these three quantities as p_{d}, p_{c}, and p_{shannon} when discussing channels besides the BEC. This is following the notation of Mezard and Montanari (2009).

More formally, information theory concerns reliable communication via an unreliable channel. To mitigate the errors in message transmission, error correcting codes introduce some type of systematic redundancy in the transmitted message. Encoding maps are applied to the information sequence to get the encoded message that is transmitted through the channel. The decoding map, on the other hand, is applied to the noisy channel bit (see Figure below). Each message encoded is comprised of M bits and N>M redundant sequences of bits in an error correcting code. 2^M possible codewords form a “codebook” |\mathbb{C}| in binary space \{0,1\}.

Claude Shannon’s code ensembles proved that it is easier to construct stochastic (characterized by good properties and high probability) models vs. deterministic code designs. Stochastic models were able to achieve optimal error correcting code performance in comparison to a more rigidly constructed model, proving that it was possible to communicate with a vanishing error probability as long as the rate of transmission R=M/N is smaller than the channel capacity, a measure of the maximum mutual information between channel input and output.

Fig 1 Schematic of the communication model of information communication.

Thus, in order to construct an optimal error correcting code, one must first define the subset of the space of encoding maps, endow the set with probability distributions, and subsequently define the associated decoding map for each of the encoding maps in the codes. We have included a section in the A that gives a thorough discussion of random code ensembles which are known to achieve optimal decoding, whether via scoring decoding success by bit error rate or decoded word error rate. We will also show a perspective which uses principles from statistical physics to unify the two (often called finite-temperature decoding). From hereon out we will discuss LDPC and explore how the values of p_{d}, p_{c} and p_{shannon} for various channels reveal deep things about the structure of the decoding solution space.

Low-density Parity Check Code

LDPC codes are linear and theoretically excellent error correcting codes that communicate at a rate close to the Shannon capacity. The LDPC codebook is a linear subspace of \{0,1\}^N . For an MxN sparse matrix \mathbb{H}, the codebook is defined as the kernel:

\mathbb{C} = { \underline{x} \in \{0,1\}^N:\mathbb{H}\underline{x}=\underline{0}}

where all the multiplications and sums involved in
\mathbb{H} \underline{x} are computed modulo 2.

Matrix \mathbb{H}is called the parity check matrix and the size of the codebook is 2^{N-rank(\mathbb{H})}. Given this code, encoding is a linear operation when mapping an N x L binary generating matrix \mathbb{G} (the codebook is the image of \mathbb{G}:\mathbb{C}={x=\mathbb{G}\underline{z}, \text{where } \underline{z} \in \{0,1\}^L} ) such that \underline{z}\rightarrow \underline{x} =\mathbb{G}z).

Every coding scheme has three essential properties that determine its utility: the geometry of its codebook and the way it sparsely distributes proper codewords within the encoding space \{0,1\}^{N} (c.f. our mention of sphere packing as a geometric analogy for RLCs), the ease with which one can construct a code which sparsely distributes codes within the encoding space, and the existence of fast algorithms to perform effective decoding.

A coding scheme over a given channel (whether it be BSC, BEC, AWGN, etc.) also has three parameters of interest, p_{d} which is the error rate above which some chosen algorithm cannot perform error-free decoding, p_{c} above which even exhaustively enumerating over all 2^{M} codewords in the codebook and calculating the MAP probability does not successfully decode, p_{shannon} which is the capacity of the channel, an error rate above which no decoding scheme could perform error-free decoding.

LDPC codebook geometry and p_{c}

On the subject of codebook geometry, it is well known that LDPC ensembles, in expectation, produce sparsely distributed codewords. This means that valid codewords (i.e. those that pass all parity checks) are far apart from one another in Hamming space and thus require a relatively large number of bits to be lost in order for one codeword to degenerate into another one. There is an important property called the distance enumerator which determines the expected number of codewords in a tight neighborhood of any given codeword. If for a given distance the expected number is exponentially small, then the coding scheme is robust up to error rates causing that degree of distortion. We discuss a proof of LDPC distance properties in Appendix B and simply state here that LDPCs are good at sparsely distributing valid codewords within the encoding space. The property p_{c}, introduced above is intimately related to the geometry of the codebook and the properties of the noisy channel being decoded over.

The information theoretic threshold, p_{c} is the noise level above which MAP decoding no longer successfully performs decoding. p_{c} is important because it is the error value above which we could theoretically always perform (slow) decoding below p_{c} by enumerating all 2^{M} codewords in the codebook and calculating the one with the highest MAP probability.

Every LDPC ensemble has some different p_{c} for a given channel. WFor now we will simply remark that LDPC ensembles are effective because they can be chosen such that p_{c} closely approaches the p_{shannon} limit for many channels. Even more importantly, we will show that it is likely that there is no fast algorithm for which p_{d} = p_{c} for general LDPCs over a general noisy channel.

We will not derive the p_{c} for any LDPC ensembles over any channels (I recommend chapter 15 of this book for details), but we will, in section 3, present results derived by other researchers.

Ease of construction

On the subject of ease of construction and ease of decoding, there is a much simpler graphical representation of LDPC codes which can be used to demonstrate LDPC tractability.

LDPCs can be thought of as bipartite regular graphs, where there are N variable nodes which are connected to M parity check nodes according to randomly chosen edges based on the degree distribution of the LDPC. Though the appendix discusses general degree distributions we will discuss here only (d,k) regular bipartite graphs, in which all variables have d edges and all parity check nodes have k edges, and how to generate them under the configuration model.

The configuration model can be used to generate a bipartite graph with (d,k) degree distribution by initially assigning all variable nodes d half-edges, all parity check nodes k half-edges, and then randomly linking up half-edges between the two sets, deleting all nodes which end up being paired an even number of times, and collapsing all odd numbered multi-edges into a single edge. This system doesn’t work perfectly but for large N, the configuration model will generate a graph for which most nodes have the proper degree. Thus it is relatively easy to generate random graphs which represent LDPC codes of any desired uniform degree distribution. An example of this graphical representation is in figure 2

Fig 2 Factor graph of a (2,3) regular LDPC code with factor nodes as black squares and variable nodes as white circles, and notation for BP messages.

How the graphical model relates to fast decoding

The graphical model of LDPCs is useful because it is both easy to construct and presents a natural way to perform fast decoding. In fact, the fast graph-based decoding algorithm, Belief Propagation, we use has a p_{d} which likely represents the upper limit on fast decoding for LDPCs.

We have seen recently that bipartite graphical models
which represent a factorized probability distribution can be used to calculate marginal probabilities of individual variable nodes (what a mouthful!).

Basically, if the structure of the graph reflects some underlying probability distribution (e.g. the probability that noisy bit y_i was originally sent as bit x_i) then we can use an iterative algorithm called Belief Propagation (see blog post above) to actually converge to the exact probability distribution for each P(y_i~|~x_{i}).

This is important because when we perform decoding, we would like to estimate the marginal probability of each individual variable node (bit in our received vector), and simply set the variable to be the most likely value (this is known as bit-MAP decoding, discussed earlier). As mentioned above, under certain conditions the Belief Propagation algorithm correctly calculates those marginal probabilities for noise rates up to an algorithmic threshold p_{d}.

The Belief Propagation algorithm is an iterative message passing algorithm in which messages are passed between variable nodes and parity check/factor nodes such that, if the messages converge to a fixed point, the messages encode the marginal probabilities of each variable node. Thus BP, if it succeeds can perform bit-MAP decoding and thus successfully decode.

We will show in the next section how the configuration model graphs map to a factorized probability distribution and mention the p_{d} for BP. In the following section we will show an example of decoding over the binary erasure channel, then finally we will show motivation to suggest that the p_{d} for BP over LDPCs represents a hard upper limit above which no fast decoding algorithms exist.

Decoding Errors via Belief Propagation

As mentioned above (again, please see Tom’s excellent blog post for details), the belief propagation algorithm is a useful inference algorithm for stochastic models and sparse graphs derived from computational problems exhibiting thresholding behavior. As discussed, symbol/bit MAP decoding of error correcting codes can be regarded as a statistical inference problem. In this section, we will explore BP decoding to determine the threshold for reliable communication and according optimization for LDPC code ensembles in communication over a binary input output symmetric memoryless channel (BSC or BMS).

Algorithm Overview

Recall that the conditional distribution of the channel input \underline{x} given the output \underline{y} is given by and that we wish to find the \underline{x} that maximizes the below probability given \underline{y}

p(\underline{x}|\underline{y}) = \frac{1}{Z(y)}\prod_{i=1}^{N}Q(y_i|x_i) \prod_{a=1}^{M} \mathbb{I}(x_{i_{1^{a}}} \otimes ... ~x_{k(a)^a} = 0) (1)

Where Q(y_i~|~x_{i}) is the conditional probability of y_{i} of observing noisy bit y_{i} given that x_{i} was sent. For the BSC we have Q(y_{i} = 1 | x_{i} = 1) = Q(y_{i} = 0 | x_{i} = 0) = 1-p and Q(y_{i} = 1 | x_{i} = 0) = Q(y_{i} = 0 | x_{i} = 1) = p.


\mathbb{I} (x_{i_{1^{a}}} \otimes  ... ~x_{k(a)^a} = 0)

is an indicator variable which takes value 1 if \underline{x} satisfies parity check a and 0 otherwise. In particular, the product of these indicators takes into account the fact that 0 probability should be assigned to hypotheses \underline{x} which aren’t in the code book (indicated by at least one parity check failing).

We would like to design a message passing scheme such that the incoming messages for a given variable node encode their marginal probabilities p(x_i~|~y_i).

Note, first and foremost that this probability can be factorized a la BP factor graphs such that there is a factor node for each parity check node a which contributes probability

\mathbb{I} (x_{i_{1^{a}}} \otimes ... ~x_{k(a)^a} = 0)

and a factor node for each channel probability term Q(y_i~|~x_{i}). Since each channel probability term is only connected to a single variable, its message never gets updated during BP and so we omit it from the factor graphs (e.g. note that figure 2 only has parity check nodes and variable nodes)

The message passing scheme ends up taking the form

v_{i\rightarrow a}^{(t+1)}(x_i) \propto Q(y_{i} | x_{i}) \prod_{b \in \partial i \setminus a} \hat{v}{b \rightarrow i}^{(t)} (2)

\hat{v}_{a \rightarrow i}(x_{i}) \propto \sum_{{x_{j}}} \mathbb{I} (x_{i} \otimes x_{j_{1}} ... x_{j_{k-1}} = 0) \prod_{j \in \partial a \setminus i} v_{j\rightarrow a}^{(t)}(x_j) (3)

Where \partial a denotes the neighborhood of factor node a and the sum in (3) is over all possible configurations of the neighbors of a not including i.

Messages are passed along the edges as distributions over binary valued variables described by the log-likelihoods

h_{i\rightarrow a} = \frac{1}{2}\log \frac{v_{i\rightarrow a(0)}}{v_{i\rightarrow a (1)}} (4)

u_{i\rightarrow a} = \frac{1}{2}\log \frac{\hat{v}{i\rightarrow a(0)}}{\hat{v}{i\rightarrow a (1)}} (5)

We also introduce the a priori log likelihood for bit i given the received message y_i:

B_{i} = \frac{1}{2} log \frac{Q(y_{i} | 0)}{Q(y_{i} | 1)}

Once we parametrize the messages as log-likelihoods, it turns out we can rewrite our update rules in terms of the parametrized values h and u, making updates much simpler:

h_{i \rightarrow a}^{(t+1)} = B_i + \sum_{b \in \partial i \setminus a} u_{b \rightarrow i}^{(t)} (6)

u_{b\rightarrow i}^{(t)} = atanh{ \prod_{j \in \partial a \setminus i} tanh(h_{j \rightarrow a}^{(t)})} (7)

Given a set of messages, we would perform decoding via the overall log likelihood h_{i}^{(t+1)} = B_i + \sum_{b \in \partial i} u_{b \rightarrow i}^{(t)}. Where x_{i} gets decoded to 0 for h_{i} > 0 and 1 for h_{i} < 0.

Typically BP is run until it converges to a set of messages that decode to a word in the codebook, or until a max number of iterations have occurred. Other stopping criteria exist such as the messages between time step t and t+1 being all within some small \epsilon of one another.

It is important to note some properties of BP:

  1. BP always terminates in d steps if the factor graph is a tree of depth d
  2. It is not known under what circumstances so called “loopy” BP will converge for non-tree graphs

Because factor graphs of LDPC codes are relatively sparse, they appear “locally tree-like”, a property which is believed to play a crucial role in BP convergence over the factorized probability distribution used in LDPC MAP decoding (eqn 1). As mentioned above BP manages to converge on many sorts of non tree-like graphs given that they have “nice” probability distributions. For example the SK model is known to converge even though the underlying factor graph is a complete graph!

It turns out that BP converges under some noise levels for LDPC decoding, and that the threshold at which it fails to converge, p_{d}, represents a phase transition to a generically different regime in the solution space of the codebook. It’s been noted elsewhere that the BP threshold is often the threshold of fast solving for many cool problems; e.g. k-SAT. This is because it is often thought to generically represent the “best” possible local (ergo “fast”) algorithm for those problems

In appendix C we will show some important properties of BP. The following tables summarize important results for several ensembles and channels. Note how close the information theoretic threshold for LDPCs is to the actual shannon limit p_{shannon} for the channels below.

Table 1: Thresholds for BSC
Various thresholds for BP over LDPC codes in a Binary Symmetric Channel

dkp_{d}p_{c}Shannon limit

See Mezard and Montanari, 2009 Chapt 15. for this table

Table 2: Thresholds for BEC
Various thresholds for BP over LDPC codes in a Binary Erasure Channel

dk\epsilon_{d}\epsilon_{c}Shannon limit

See Mezard and Montanari, 2009 Chapt 15. for this table

We will now show exact behavior of the (3,6) LDPC ensemble over the binary erasure channel.

Algorithmic Thresholds for Belief Propagation (BP)

Definitions and notation

Definition 1. In a Binary Erasure Channel (BEC), when the transmitter sends a bit \in {0, 1}, the receiver receives the correct bit with probability 1 - \epsilon or an error symbol * with probability \epsilon.

For BECs, the Shannon capacity—the maximum number of data bits that can be transmitted per encoded bit—is given by 1 - \epsilon.

Definition 2. An N-bit Error Correcting Code (ECC) is defined by a codebook \mathcal{C} \subset {0, 1}^N. The transmitter encodes information as an element of \mathcal{C}. The receiver receives a corrupted version y of the transmitted codeword. To decode, it picks an x \in \mathcal{C} that is most likely given y and the channel characteristics.

For ease of discourse, we have refrained from defining ECC in full generality.

Definition 3. An N-bit (\lambda, \rho) Low Density Parity Check Code (LDPC) is an ECC with \mathcal{C} = {x | Hx = 0}. Here H is an M \times N matrix and arithmetic is over \mathbb{Z}_2. H is a sparse parity-check matrix. \lambda(x) = \sum_i\lambda_ix^{i-1} and \rho(x) = \sum_i\rho_ix^{i-1} are finite polynomials that characterize H; \lambda_i is the fraction of columns with i 1s and \rho_i is the fraction of rows with i 1s. Since these are fractions/probabilities, they must be normalized. Hence \lambda(1) = \rho(1) = 1. H is a random matrix, and therefore has full rank with high probability.

In an LDPC code, |\mathcal{C}| = 2^{N - M}. Hence every N bits contain N-M bits of information, making the rate 1 - M / N. Over binary erasure channels (BECs), on receiving y, the decoder must choose an x such that x_i = y_i \forall i such that y_i \neq *, and Hx = 0 (x_i, y_i denote the i^{th} bit of x and y respectively). That is, the bits that were successfully transmitted should be preserved; other bits should be chosen to satisfy the parity check equations. If multiple correct choices of x are possible, then y cannot be unambiguously decoded.

BP/Peeling algorithm

In general, decoding can be computationally hard. But there exists an error rate \epsilon_d, a function of \lambda and \rho, below which belief propagation succeeds in decoding. Let \epsilon_c be the maximum error rate upto which successful decoding is possible (i.e. we can unambiguously determine the transmitted codeword) and \epsilon_s be the Shannon limit, then \epsilon_d \leq \epsilon_c \leq \epsilon_s. In general, these inequalities can be strict, illustrating the gap between what is information theoretically possible, and what is computationally feasible.

Belief propagation (BP) for decoding LDPC-codes is equivalent to a simple peeling algorithm. Let us first describe the factor-graph representation for decoding. This is denoted in figure 3. Variables on the left are the received symbol \in {0, 1, *}. Factor nodes on the right denote the parity-check constraint (rows of H). The XOR of variables connected to each factor node must be 0.

BP/the peeling algorithm works as follows. For simplicity of exposition, consider that the all zeros code-word has been transmitted. Since this is a linear code, there is no loss of generality. At first, only the 1 - \epsilon variables that were successfully transmitted are fully determined. In the next round, the factor nodes that have exactly one undetermined variable can determine that variable using their parity-check constraint.

Fig 3 An example of a received code-word and corresponding parity-check constraints that is information-theoretically determined (only the all zeros codeword satisfies all constraints), but cannot be decoded by the belief propagation algorithm because no factor node has exactly one unknown variable. Here, only the V0 is correctly received. Constraints F1, F2, F3 and F4 imply that V1=V2=V3=V4=V5. If V0=0, then all the rest must be 0s to satisfy F0 (note, the only other valid codeword is all ones).

BP isn’t perfect

This algorithm is not perfect. Figure 3 is an example of a received codeword which can be unambiguously decoded — only the all zeros codeword satisfies all the constraints— but the BP algorithm fails, because at any point, all factor nodes have more than one unknown variable. It seems that the only way to solve problems like that is to exhaustively understand the implications of the parity-check equations. If this examples seems contrived, that is because it is. Decoding becomes harder as the degree and number of constraints increases; we had to add a lot of constraints to make this example work. Fortunately, if the graph is sparse, BP succeeds. We prove this in the following theorem:

Phase transitions for BP

Theorem 1. A (\lambda, \rho) LDPC code can be decoded by BP as N \rightarrow \infty when the error rate is less than \epsilon_d:

\epsilon_d = \mathrm{inf}_{z \in (0, 1)}\left[\frac{z}{\lambda(1 - \rho(1 - z))}\right]

Proof. To prove this, let us analyze the density evolution. For BECs, this is particularly simple as we only need to keep track of the fraction of undetermined variables and factor nodes at timestep t:~latex z_t and \hat{z}_t respectively. As N \rightarrow \infty, these fractions are probabilities. A factor node is determined when all of its variables are determined (note: if all but one is determined, the last one can be immediately determined). The following recursion relations hold:

z_{t+1} = \epsilon\lambda(\hat{z}_t) \:\:\:\mathrm{and}\:\:\: \hat{z}_t = 1 - \rho(1 - z_t) (8)

The first holds because a variable node is undetermined at timestep t+1 if it was originally undetermined (which happens with probability \epsilon) and if it isn’t determined in the last step, which happens with probability say p. Now,

p = \mathbf{P}(degree=2)\hat{z}_t + \mathbf{P}(degree=3)\hat{z}_t^2 + \cdots = \lambda(\hat{z}_t)

A similar reasoning holds for the second relation. 1 - z_t is the probability that a given neighboring variable node is determined. \rho(1 - z_t) is the probability that at-most one is undetermined, and hence this function node is determined. 1 - \rho(1 - z_t) is the probability that this function node is undetermined.

Composing the two relations in equation 8, we get the recursion:

z_{t+1} = F_{\epsilon}(z) = \epsilon \lambda(1 - \rho(1 - z_t))

An example of F(z) is shown in Figure 4 for \lambda(x) = x^2, \rho(x) = x^5. That is a (3,6) regular graph where variable nodes and function nodes have 3 and 6 neighbors respectively. On the left, F(z) is always below z. Hence the recursion with z_0 starting from the far right will converge to the fixed point F(z) = z = 0. But on the right, \epsilon is large enough that F(z) intersects z at a non-zero point. Hence the recursion will converge to the higher fixed point instead, without ever reaching the `correct’ fixed point. BP therefore gets stuck at a suboptimal solution, though information-theoretically a correct solution exists. This can be interpreted as a glassy state, where many deep local minima are present, and BP will converge to the wrong minimum.

The condition for BP to converge is F_\epsilon(z) \le z \:\: \forall z \in (0, 1). Hence the threshold error rate, \epsilon_d, below which this condition holds is:

\epsilon_d = \mathrm{inf}_{z \in (0, 1)}\left[\frac{z}{\lambda(1 - \rho(1 - z))}\right]

For (3, 6) regular graphs, \epsilon_d \approx 0.429 ∎

Fig 4 The recursion relation for a (3,6) regular graph, where F_\epsilon(z) is plotted against z. The identity function z is also shown (in blue). Here \epsilon_d \approx0.429. The graph above and below show the case the error rate is below and above \epsilon_d respectively.

Another interesting phase transition can be observed. As \epsilon increases, for some values of (\lambda, \rho), the first intersection of F(z) and F(z) happens at a non-zero point. For others, it starts of at z=0 and goes up continuously. In the former case, the decoding error rate jumps discontinuously as \epsilon increases from 0 to a non-zero values. For the latter, it increases continuously.

To see the gap between \epsilon_d and what can be information theoretically, we look at what happens when the degrees of the LDPC code is increased while keeping the rate constant. Specifically consider the (l, k) regular graph (i.e. \lambda(x) = x^{l-1} and \rho(x) = x^{k-1}) as k \rightarrow \infty while l / k = 0.5 is fixed. Note that the rate of the code is 1 - l / k. This is shown in Figure 5. \epsilon_d decreases toward 0. But as k \rightarrow \infty, it should become information-theoretically easier to decode. In fact, as k \rightarrow \infty, the code approaches a random linear code, which is known to achieve Shannon capacity. Hence we can believe that the information-theoretically achievable decoding rate is non-decreasing. Thus there is a gap between what is information theoretically possible to decode, and what is computationally feasible using Belief Propagation.

Fig 5 \epsilon_d decreases as k increases, while the rate = 1 - l/k = 0.5 is fixed. In fact \mathrm{lim}_{k \rightarrow \infty}\epsilon_d = 0.

Finally we would like to mention that it is possible to choose a sequence of polynomials (\lambda, \rho) such that \epsilon_d approaches the Shannon limit. While it is non-trivial to sample exactly from this distribution, good approximations exist and LDPC codes can achieve close to channel capacity over binary erasure channels.

The solution space in p_{d}\leq p \leq p_{c}

The energy landscape of LDPC decoding

We have already shown the exact location of the p_{MAP} = p_{c} threshold above which decoding is not possible for the LDPC ensemble and have also investigated the point at which the BP algorithm fails, p_{d}.

It should not be surprising to us that any given algorithm we attempt to throw at the problem fails at a certain point below p_{c}. In fact there are many simple, random algorithms from the class of Markov-chain Monte Carlos which give fast run times but which fail at values far below even p_{d}. The failing point of a particular algorithm, per se, is not necessarily very significant. We shouldn’t expect that any given algorithm, besides explicitly calculating the symbol MAP by traversing the entire codebook, would be able to achieve p_{c}.

What is of interest to us here, is that p_{d} marks a provable threshold in the solution space of LDPC decoding during which it is likely no locally-based methods, and therefore no fast algorithms can decode with nonzero probability. We will show later precisely the number and energy levels of these metastable states for the BEC. Proof of this transition for other channel types is outside the scope of this lecture.

In this section we will rephrase decoding as an energy minimization problem and use three techniques to explore the existence of metastable states and their effect on local search algorithms.

In particular we will first use a generic local search algorithm that attempts to approximately solve energy minimization expression of decoding.

We will next use a more sophisticated Markov chain Monte Carlo method called simulated annealing. Simulated annealing is useful because it offers a perspective that more closely models real physical processes and that has the property that its convergence behavior closely mimics the structure of the metastable configurations.

Energy minimization problem

To begin, we will reframe our problem in terms of constraint satisfaction.

The codewords of an LDPC code are solutions of a CSP. The variables are the bits of the word and the constraints are the parity check equations. Though this means our constraints are a system of linear equations, our problem here is made more complicated by the fact that we are searching for not just ANY solution to the system but for a particular solution, namely the transmitted codeword.

The received message \underbar{\textit{y}} tells us where we should look for the solution.

Assume we are using the binary-input, memoryless, output-symmetric channel with transition probability \mathbf{Q}(y | x).

The probability that \underbar{\textit{x}} was the transmitted codeword, given \underbar{\textit{y}} is \mathbb{P}(\underbar{\textit{x}} | \underbar{\textit{y}}) = \mu_{y}(\underbar{\textit{x}})


\mu_{y}(\underline{x}|\underline{y}) = \frac{1}{Z(y)}\prod_{i=1}^NQ(y_i|x_i)\prod_{a=1}^M\mathbb{I}(x_{i_1^a}\otimes ... \otimes x_{k(a)^a} = 0) (10)

We can associate an optimization problem with this code. In particular, define E(\underbar{\textit{x}}) to be twice the number of parity check equations violated by \underbar{\textit{y}}.

We have already discussed how symbol MAP computes the marginals of the distribution \mu_{y}(\underbar{\textit{x}}) and how word MAP finds its argmax.

We shall here discuss two related problems

  • optimizing the energy function within a subset of the configuration space defined by the received word
  • sampling from a ’tilted’ Boltzmann distribution associated with the energy

Define the log-likelihood of x being the input given the received y to be

L_{\underline{y}}(\underline{x}) = \sum_{i=1}^N Q(y_i|x_i) (11)

If we assume WLOG that the all zero codeword was transmitted, by the law of large numbers, for large N the log-likelihood L_{\underbar{\textit{y}}}(\underbar{\textit{x}}) of this codeword is close to -Nh where h is the channel entropy. The probability of an order-N deviation away from this value is exponentially small.

This suggests that we should look for the transmitted codeword amongst those \underbar{\textit{x}} \in \mathbb{C} such that L_{\underbar{\textit{y}}}(\underbar{\textit{x}}) is close to h.

The constraint version of our decoding strategy – known as typical-pairs decoding – is thus, find \underbar{\textit{x}} such that L_{\underbar{\textit{y}}}(\underbar{\textit{x}}) > -N(h+\delta). This constraint will be referred to as the `distance constraint’ and we should consider the situation where if exactly one codeword satisfies the distance constraint, return it.

Since codewords are global energy minima (E(\underbar{\textit{x}}) = 0 for all \underbar{\textit{x}} \in \mathbb{C}), we can phrase typical-pairs decoding as an optimization problem

Minimize E(\underbar{\textit{x}}) subject to L_{\underbar{\textit{y}}}(\underbar{\textit{x}}) > -N(h+\delta).

This decoding succeeds iff the minimum is non-degenerate. This happens with high probability for p < p_{c} and with zero probability for p > p_{c}. In particular, there are exponentially many degenerate energy minima above p_{c}.

Similar to what we have seen elsewhere in the course, there exists a generically intermediate regime p_{d}\leq p \leq p_{c} in which the global energy minimum is still the correct codeword bu there is an exponentially large number of local energy minima obscuring it (see figure 6).

What is so special about BP is that the threshold at which these exponentially many metastable states proliferate is exactly the algorithmic threshold p_{d} for BP which we proved earlier.

Fig 6 A cartoon landscape for the Energy function defined above (number of violated checks). Left: The energy has a unique global minimum with E(\underline{x}) = 0 (the transmitted codeword) and no local minima. Center: many deep local minima appear, although the global minimum is non degenerate. Right: more than one codeword is compatible with the likelihood constraint, the global minimum E(\underline{x}) = 0 is degenerate adapted from Mezard and Montanari, 2009 Chapt 21

While finding solutions E(\underbar{\textit{x}}) = 0 amounts to Gaussian elimination, the constraint L_{\underbar{\textit{y}}}(\underbar{\textit{x}}) > -N(h+\delta) is not a linear constraint. Thus one needs to use some sort of more advanced search procedure to find satisfying vectors \underline{x}.

We will show that if one resorts to local-search-based algorithms, the metastable states above p_{d} block the algorithm. Furthermore, we suggest that the behavior of the local algorithms discussed below are typical of all local search algorithms (including BP) and that it is very likely the case that no fast algorithm exists capable of finding global energy minima without getting caught in the metastable states which proliferate above p_{d}.

Below is the simplest of local search algorithms, \Delta-local search.

Fig 7 Excerpted from Mezard and Montanari, 2009 Chapt 21

Delta search typefies local search algorithms. It walks semi-randomly through the landscape searching for low energy configurations. Its parameter is defined such that, when stuck in a metastable state it can climb out of it in polynomial time if the steepness of its energy barrier is \leq \Delta. Thus its failure in the p \geq p_{d} region suggests that there are no barriers of constant size and that barriers of order N are the norm.

MCMC and the relaxation time of a random walk

We can understand the geometry of the metastable states in greater detail by reframing our MAP problem as follows:

\mu_{y,\beta}(\underline{x}) = \frac{1}{Z(\beta)}exp{-\beta \cdot E(\underline{x})}\prod_{i=1}^N Q(y_i|x_i) (12)

This form is referred to as the `tilted’ Boltzmann because it is a Boltzmann distribution biased by the likelihood function.

In the low temperature limit this reduces to eqn 10 because it finds support only over words in the codebook.

This distribution more closely mimics physical systems. For nonzero temperature it allows support over vectors which are not actually in our codebook but still have low distance to our received message and have low energy – this allows us to probe the metastable states which trap our local algorithms. This is referred to as a code with `soft’ parity check constraints as our distribution permits decodings which fail some checks.

We will use the following algorithm excerpted from Mezard and Montanari Chapt 21:

Fig 8 Excerpted from Mezard and Montanari, 2009 Chapt 21

Where a Glauber update consists of scanning through the bits of the current proposed configuration and flipping the value of bit i with probability

w_{i}(\underbar{\textit{x}}) = \frac{\mu_{y,\beta}(\underline{x}^{(i)})}{\mu_{y,\beta}(\underline{x}^{(i)}) + \mu_{y,\beta}(\underline{x})} (13)

Where \underline{x} is the current configuration and \underline{x}^{(i)} is the configuration obtained by flipping \underline{x}‘s i^{th} bit

This method is a spin-off of traditional Markov chain Monte-Carlo algorithms with the variation that we lower the temperature according to an annealing schedule that initially assigns probability to all states proportional to the likelihood component of equation 12, allowing the chain to randomly sample the configuration space in the neighborhood of the received noisy word, until in the low temperature limit it becomes concentrated near to configurations which are proper codewords.

This method is useful to us because MCMCs are good models of how randomized methods of local searching for optimal configurations occurs in physical systems. Furthermore, the convergence of MCMCs and the time it takes them to converge tells us both the properties of the energy wells they terminate in and the barriers between minima in the energy landscape.

Let’s now show a property relating convergence times of MCMCs and energy barriers known as the Arrhenius law.

If we take the example of using a simple MCMC random walk with the update rule below over the following landscape

w(x\rightarrow x') = min \{e^{-\beta [E(x')-E(x)]},~1\}

Fig 9 This represents a random walk along a line in which there are two ground states separated by an energy barrier of height \Delta EExcerpted from Mezard and Montanari, 2009 Chapt 13

We find that the expected number of time steps to cross from one well to another is governed by the Arrhenius law \tau \approx exp{\beta \Delta E}.

In general, if there exists a largest energy barrier between any two components of the configuration space (also known as the bottleneck) the time it takes to sample both components, also known as the relaxation time of the MCMC is \tau_{exp} \geq O(e^{\beta \Delta E})

With this in mind, we can apply our simulated annealing MCMC to LDPC decoding and examine the properties of the bottlenecks, or metastable states, in our configuration landscape.

Exact values of the metastable energy states for the BEC

It is a well-known consequence of the 1RSB cavity method that the number of metastable states of energy Ne grows like exp(N\Sigma^{e}(e)) where \Sigma^{e}(e) is known as the energetic complexity function, a function whose form is implied by the 1RSB cavity equations. This computation can be carried out using a method called Survey Propagation which constructs a factor graph of the messages passed in the original BP factor graph and estimates the values of the marginals of the messages via another round of BP (hence the name 1-step RSB).

Neglecting the actual form of the calculations I will show the following approximate results for the BEC.

Fig 10 Metastable states for random elements of the (3,6) regular ensemble used over the BEC (\epsilon_d = .4294 and \epsilon_c = .4882. Left: the complexity function as a function of energy density above \epsilon_d. Right: the maximum and minimum energy densities of metastable states as a function of the erasure probability. Note that at \epsilon_d \leq .45 \leq \epsilon_c latex the curve is positive only for non zero metastable energy densities. This indicates exponentially many metastable states. At erasure rates above \epsilon_c there are exponentially many degenerate ground states. Excerpted from Mezard and Montanari, 2009 Chapt 21

In the regime p \geq p_{d} there exists a zero-energy word corresponding to the correct solution. On top of this, there exist non-trivial solutions to the 1RSB method yielding a complexity curve positive in the regime (e_{c}, e_{d}). The positive complexity means that there are exponentially many such states and their finite energy means they violate a finite fraction of the parity checks, making their energy wells relatively deep.

As the error rate of the channel increases above p_{c} the minimum energy of the metastable state reaches zero continuously. This means at noise levels above p_{c} there are an exponential number of zero-energy states corresponding to configurations which aren’t code words. These codewords are separated by energy barriers O(N) thus making the relaxation-time of local algorithms, by the Arrhenius law exp(N) in this regime.

Fig 11 Decoding random codes from the (3,6) ensemble over the BEC. In both cases N = 10^4, and the annealing schedule consists of t_max = 1000 equidistant temperatures in T = 1 / \beta \in [0,1]. Left: indicates convergence to the correct ground state at \epsilon = .4 < \epsilon_d. Right: indicates convergence to approximately the energy density of the highest metastable states e_d (as calculated by the complexity function via 1RSB) for \epsilon = .6 > \epsilon_{c}Excerpted from Mezard and Montanari, 2009 Chapt 21

Here you can see a rough sketch of convergence of the simulated annealing algorithm. As the temperature decreases in the p \leq p_{d} regime the algorithm converges to a 0 energy ground state. In the figure on the right we can see that simulated annealing converges to the horizontal line here which corresponds to the energy of the highest metastable state e_{d} for the BEC at p = .6.

Thus we see our local search algorithms end up being attracted to the highest energy of the metastable state.

Fig 12 Decoding random codes as in figure 11. Here we plot the minimum energy density achieved through simulated annealing plotted as a function of the erasure probability of the BEC. The continuous line indicates the highest lying metastable states as calculated from the complexity function via 1RSB.
 Excerpted from Mezard and Montanari, 2009 Chapt 21

Though there is not necessarily an exact correspondence between the residual energy at T=0 for simulated annealing and the highest metastable state e_{d} we see that across all values of p that at T=0, e_{ann} \approx e_{d} suggesting local search tends to get caught in the deepest metastable energy wells.

This discussion shows that the algorithmic threshold of BP, p_{d} indicates the onset of a truly different regime within the energy landscape of the codebook. Metastable states of O(N) hight proliferate and become exponentially difficult to escape from via local search methods. Thus the failure of BP likely indicates a regime in which no fast algorithms can perform decoding, even though decoding is still theoretically possible when below p_c, e.g. via exhaustive search of the codebook.

Appendix A: Random Code Ensembles

In an RCE, encoding maps applied to the information sequence are chosen with uniform probability over a solution space. Two decoding schemes are often used and applied to the noise – word MAP and symbol MAP decoding. MAP, otherwise known as “maximum a priori probability” works by maximizing the probability distribution to output the most probable transmission. Word MAP decoding schemes output the codeword with the highest probability by minimizing the block error probability, which is otherwise known as the probability with respect to the channel distribution that the decoded word is different than the true transmitted word. Symbol MAP decoding, on the other hand, minimizes the fraction of incorrect bits averaged over the transmitted code word (bit error rate).

RCE code is defined by the codebook in Hamming space, or the set of all binary strings of length N. In a Hamming space characterized by uniform probability, the number of codewords at a given Hamming distance are a function of the distance enumerator. Distance enumerators take as parameters different weights, given that probabilities of codewords are independent of each other. The distance enumerator shows that, for small enough fractional distance from the true message x_0, the growth rate is negative and the average number of codewords at small distance from x_0 vanishes exponentially with N. The Gilbert-Varshamov distance, a lower bound threshold, shows that the the average number of codewords is exponentially large at points where the weight numerator is concentrated.

We look at the performance of RCE code in communication over the Binary Symmetric Channel (BSC), where it is assumed that there is a probability p that transmitted bits will be “flipped” (i.e. with probability p, 1 becomes 0 and 0 becomes 1). With BSCs, channel input and output are the same length N sequences of bits. At larger noise levels, there are an exponential number of codewords closer to y and decoding is unsuccessful. However, decoding via the symbol MAP decoding scheme shows that the i-th bit is decoded by maximizing the marginal probability and amasses contributions from all codewords in the set. Above a threshold, the bit error rate is the same as if the message was transmitted without encoding and decoding, but below this, the RCE seems to work quite well in transmission.

Finite temperature decoding has also been looked at as an interpolation between the two MAP decoding schemes. At low noise, a completely ordered phase can be observed as compared to a glassy phase at higher noise channels. Similar to the a statistical physics model, we can also note an entropy dominated paramagnetic phase at higher temperatures.

Each decoding scheme can be analogized to “sphere packing”, where each probability in the Hamming space distribution represents a sphere of radius r. Decoding schemes have partitions in the Hamming space, so these spheres must be disjoint. If not, intersecting spheres must be eliminated. The lower bound of the remaining spheres is then given by Gilbert-Varshamov bound, whereas the upper bound is dictated by the Hamming distance.

Another random code beside the RCE is the RLC, or random linear code. Encoding in an RLC forms a scheme similar to a linear map, of which all points are equiprobable. The code is specified by an N x M binary matrix, otherwise known as the generating matrix, and it is projected to be error-free below the Shannon capacity.

There are several sources of randomness in codes. Codes are chosen randomly from an ensemble and the codeword to be transmitted is chosen with uniform probability from the code, according to the theorem of source-channel separation. The channel output is then distributed according to a probabilistic process accounting for channel noise and decoding is done by constructing another probability distribution over possible channel inputs and by estimating its signal bit marginal. The decision on the i-th bit is dependent on the distribution. Thus, complications may arise in distinguishing between the two levels of randomness: code, channel input, and noise (“quenched” disorder) versus Boltzmann probability distributions.

Appendix B: Weight enumerators and code performance

The geometric properties of the LDPC codebooks is given by studying the distance enumerator N_{\underline{x}0}(d) to give the number of codewords at Hamming distance d from \underline{x}_0. This takes all-zeros codewords as the reference and uses the weight enumerator, \mathbb{N}(w)=\mathbb{N}{\underline{x_0}}(d=w) as the denomination (number of codewords having weight equal to w). To estimate the expected weight enumerator \mathcal{N}(w)=\mathbb{E}\mathbb{N}(w) for a random code in the LDPC_N(\Lambda,P) ensemble, we know that \mathbb{N}(w) grows exponentially in block-length N, and that each codeword has a weight w=Nw that grows linearly with N. The exponential growth rate \phi(w) is defined by

\mathcal{N}(w=Nw) = e^{N \phi (w)} (14)

denoting an ‘annealed average’, or a disordered system that could be dominated by rare instances in the ensemble. This gives an upper bound on the number of ‘colored factor graphs’ that have an even number of weighted incident edges divided by the total number of factor graphs in the ensemble.

On the other hand, for graphs of fixed degrees with N variable nodes of degree l and M function nodes of degree k, the total number of edges F is given by F=Mk=Nl. A valid colored graph would have E=wl edges, with the number of variable nodes given in {N\choose w} ways, l assignments of weighted sockets to nodes, and l assignments of unweighted sockets to nodes outside the set. If we take m_r to denote the number of function nodes with weighted sockets under the constraints of \Sigma_{r=0}^km_r=M and \Sigma_{r=0}^krm_r=lw, we find the number of ways to color the function node sockets by

\mathbb{C}(k,M,w) = \sum_{m_{0},...m_{k}}^{even}{M\choose m_{0},...,m_{k}}\prod_{r}{k\choose r}^{m_{r}} (15)

\mathbb{I}\Big(\sum_{r=0}^km_r=M\Big)\mathbb{I}\Big(\sum_{r=0}^krm_r=lw\Big) (16)

If we aim to join variable and check nodes so that colorings are matched, knowing that there are (lw)!(F-lw)! possible matchings in each ensemble element, this yields the following formula:

\mathcal{N}(w)=\frac{(lw)!(F-lw)!}{F!}{N\choose w}\mathbb{C}(k,M,w) (17)

At low noise limits, code performance depends on the existence of codewords at distances close to the transmitted codeword. Starting with degree 1 and knowing that the parametric representation for weights is given by 

w = \sum_{l=1}^{l_{max}}\Lambda_l\frac{xy^l}{1+xy^l} (18)

derive that 

\phi(w) = -\frac{1}{2}w\log(w/\Lambda_1^2)+O(w) (19)

when x,y,z scale to \sqrt{w}. This shows that the exponential growth rate \phi(w) is strictly positive when w is sufficiently small, and that the expected number of codewords within a small Hamming distance from a given codeword is exponential in N. If we take the logarithm of the expected weight enumerator and plot this versus the reduced weight w=w/N for an irregular code with l_{min}=1, we see that \phi(w) is positive near the origin, but that its dervative diverges as w\rightarrow 0. Since this means that each codeword is surrounded by a large number of very close other codewords, this makes the code a very bad ECC and thus, makes it hard to discriminate between codewords at Hamming distances O(1) with noisy observations. Applying this same logic to l of min 2, we still observe that \phi(w) tends to 0 more quickly as w\rightarrow 0 in the present case. If we assume that this holds beyond the asymptotic regime, we get

\bar{\mathcal{N}}(w) = e^{Aw} (20)

or that the number of codewords around a particular codeword is o(N) until a Hamming distance d_* \simeq \log N/A, otherwise known as the “effective minimum distance”. For l_{min} \geq 3, we find:

\phi(w) \simeq \Big(\frac{l_{min}-2}{2}\Big)w\log\Big(\frac{w}{\Lambda_{l_min}}\Big) (21)

suggesting that LDPC codes with this property have good short distance behavior. Thus, any error that changes a fraction of the bits smaller than w_{*}/2 can be corrected in the absence of codewords within an extensive distance Nw_{*}.

Let us now focus on the capacity of LDPC codes to correct typical errors in a probabilistic channel. For binary symmetric channels that flip each transmitted bit independently with probability p<\frac{1}{2}. If the all-zero codeword \underline{x}^{(0)} =\underline{0} has been transmitted as \underline{y}, whose components are iid random variables that take value 0 with probability 1-p and value 1 with probability p, then we use the MAP decoding strategy to minimize the block error rate and output the codeword closest to the channel output \underline{y}. The expectation value of the code ensemble P_B = \mathbb{E}P_B(\mathbb{C}) is an indicator of code ensemble performances. We will show that, as N\rightarrow \infty, codes with l_{min}\geq 3 will undergo a phase transition separating a low noise from a high noise phase. To derive a lower bound for the capacity of LDPC codes in a BSC channel, we take \mathbb{N}=2^{NR} as the size of the codebook \mathbb{C} and, by union bound:

P_{B}(\mathbb{C})= \mathbb{P}\Big\{\exists \alpha \neq 0 \text{s.t. } d(\underline{x}^{(\alpha)},\underline{y})\leq d(\underline{0},\underline{y})\Big\} (22)

\leq \sum_{\alpha=1}^{\textit{N}-1}\mathbb{P}\Big\{d(\underline{x}^{(\alpha)},\underline{y}) \leq d(\underline{0},\underline{y})\Big\} (23)

\leq \sum_{w=1}^N \textit{N}(w)e^{-\gamma w} (24)

This derivation proves that the block error probability depends on the weight enumerator and the exp(-\gamma w). This second term shows that an increase in the weight of the codeword corresponds to their contribution being scaled down by an exponential factor. This is because it is less likely that the received message \underline{y} will be closer to a codeword of large weight than to the all-zero codeword. A geometric construction of this phenomena implies that for large enough p, Shannon’s Theorem implies that P_B is bounded away from 0 for any non-vanishing rate R > 0 so that at any p less than the ML threshold for which the lim_{N\rightarrow \infty}P_B=0, one can communicate with an arbitrarily small error probability. At a probability equal to the lower bound, the upper bound on P_B is dominated by codewords of weight w \approx N\Tilde{w}, suggesting that each time an error occurs, a finite fraction of the its are decoded incorrectly and that this fraction doesn’t change very much per transmission. The construction also illustrates that this fraction of incorrectly decoded bits jumps discontinuously from 0 to a finite value when p crosses the critical value p_{ML}, constituting a “gap.” This gap is close to a factor of 2.

Appendix C: BP performance

See figure 2 for an illustration of a factor graph illustrating this relationship. Again, recall that for LDPC code ensembles in large block-length limits, the degree distributions of variable nodes and check nodes are given by \Lambda = {\Lambda_t} and P = {P_k} respectively, where we assume that messages are initialized to u_{a\rightarrow i}^{(0)} = 0 for simplicity. This implies that the bit error probability is independent of the transmitted codeword and that therefore, we have the freedom to assume transmission of the all-zero codeword. In analyzing the recursion at the basis of the BP algorithm, we can show that decoding performance improves over time on the basis of symmetry and physical degradation.


Symmetry of channel log-likelihood and the variables appearing in density evolution are attributes of a desired BMS channel, suggesting that symmetry is preserved by BP operations in evolution. If we assume that the factor graph associated with an LDPC code is “tree-like”, we can apply BP decoding with a symmetric random initial condition and note that the messages u_{a\rightarrow i}^{(t)} and h_{i\rightarrow a}^{(t)} are symmetric variables at all t\geq 0. This observance of symmetry is analogous to the Nishimori condition in spin glasses and holds for the MAP log-likelihood of a bit as well.

Physical degradation

Let’s first define physical degradation with BMS channels. If we take two channels BMS(1) and BMS(2) denoted by transition matrices

\{Q_{1}(y|x)\}, \{Q_{2}(y|x)\} and output alphabets \mathbb{Y}_1,\mathbb{Y}_2, then BMS(2) is physically degraded with respect to BMS(1) if there exists a third channel C with input alphabet \mathbb{Y}_1,\mathbb{Y}_2 such that BMS(2) is the concatenation of BMS(1) and C. If we represent transition matrix C as \{R(y_2|y_1)\} we can represent the above physical degradation as

Q_2(y_2|x) = \sum{y_1 \in \textit{Y}_1}R(y_2|y_1)Q_1(y_1|x) (25)

 This is analogous to a Markov chain X \rightarrow Y_1 \rightarrow Y_2 following partial ordering. Channel reliability is then ordered by measures of conditional entropy and bit error rate. This extends to symmetric random variables, which are associated with BMS channels.


We then fix a particular LDPC code and look at BP messages as random variables due to randomness in the vector \underline{y} with regards to the proposition down below, showing that the bit error rate decreases monotonously with time:

Proposition: If B_{i,r}(F) is a tree, then h_i^{(0)}\preceq h_i^{(1)} \preceq ... \preceq h_i^{(t-1)} \preceq h_i^{(t)} for any t\leq r-1. Analogously, if B_{i\rightarrow a,r}(F), then h_{i\rightarrow a}^{(0)}\preceq h_{i\rightarrow a}^{(1)} \preceq ... \preceq h_{i\rightarrow a}^{(t-1)} \preceq h_{i\rightarrow a}^{(t)} for any t\leq r-1.

Density evolution in this manner is a useful estimate of the number of distributions of density evolution variables {h^{(t)},u^{(t)}} and {h_{*}^{(t)}}. By looking again at the bit error rate P_{b}^{(t)} and the conditional entropy H^{(t)} as both monotonically decreasing functions of the number of iterations and conversely, monotonically increasing functions of p, we can derive a finite limit P_b^{BP} \equiv \lim_{t\rightarrow\infty}P_b^{(t)}. The corresponding BP threshold can then be defined as

p_{d} \equiv \sup \Big\{ p \in [0,1/2] : P_b^{BP}(p)=0 \Big\}

For p \leq p_d, however, increasing the number of iterations does not help as the bit error rate is asymptotically lower bounded by P_{b}^{BP}(p)>0 for a fixed number of iterations. Good LDPC codes are thus designed with a large BP threshold with design rate R_{des}=1-P'(1)/\Lambda '(1) to maximize the threshold noise level for a given degree distribution pair. This ensemble will have a finite fraction of variable nodes of degree 2 and a large number of codewords with small weight, which ultimately prevent the block error probability from vanishing as N\rightarrow \infty.


[1] Marc Mezard and Andrea Montanari.
Information, Physics, and Computation.
Oxford Graduate Texts, 2009.