(The following blog post serves as an introduction to the following notes:)

Black Holes, Hawking Radiation, and the Firewall

There are many different types of “theoretical physicists.” There are theoretical astrophysicists, theoretical condensed matter physicists, and even theoretical biophysicists. However, the general public seems to be most interested in the exploits of what you might call “theoretical high energy theorists.” (Think Stephen Hawking.)

The holy grail for theoretical high energy physicists (who represent only a small fraction of all physicists) would be to find a theory of quantum gravity. As it stands now, physicists have two theories of nature: quantum field theory (or, more specifically, the “Standard Model”) and Einstein’s theory of general relativity. Quantum field theory describes elementary particles, like electrons, photons, quarks, gluons, etc. General relativity describes the force of gravity, which is really just a consequence of the curvature of spacetimes.

Sometimes people like to say that quantum field theory describes “small stuff” like particles, while general relativity describes “big stuff” like planets and stars. This is maybe not the best way to think about it, though, because planets and stars are ultimately just made out of a bunch of quantum particles.

Theoretical physicists are unhappy with having two theories of nature. In order to describe phenomena that depend on both quantum field theory and general relativity, the two theories must be combined in an “ad hoc” way. A so-called “theory of everything,” another name for the currently unknown theory of “quantum gravity,” would hypothetically be able to describe all the phenomena we know about. Just so we’re all on the same page, they don’t even have a fuly worked out hypothesis. (“String theory,” a popular candidate, is still not even a complete “theory” in the normal sense of the word, although it could become one eventually.)

So what should these high energy theoretical physicists do if they want to discover what this theory of quantum gravity is? For the time being, nobody can think up an experiment that would be sensitive to quantum gravitation effects which is feasible with current technology. We are limited to so-called “thought experiments.”

This brings us to Hawking radiation. In the 1970’s, Stephen Hawking considered what would happen to a black hole once quantum field theory was properly taken into account. (Of course, this involved a bit of “ad hoc” reasoning, as mentioned previously.) Hawking found that, much to everybody’s surprise, the black hole evaporated, realeasing energy in the form of “Hawking radiation” (mostly low energy photons). More strangely, this radiation comes out exactly in the spectrum you would expect from something “hot.” For example, imagine heating a piece of metal. At low temperatures, it emits low energy photons invisible to the human eye. Once it gets hotter, it glows red, then yellow, then perhaps eventually blue. The spectrum of light emitted follows a very specific pattern. Amazingly, Hawking found that the radiation which black holes emit follow the exact same pattern. By analogy, they have a temperature too!

This is more profound than you might realize. This is because things which have an temperature should also have an “entropy.” You see, there are two notions of “states” in physics: “microstates” and “macrostates.” A microstate gives you the complete physical information of what comprises a physical system. For example, imagine you have a box of gas, which contains many particles moving in a seemingly random manner. A “microstate” of this box would be a list of all the positions and momenta of every last particle in that box. This would be impossible to measure in pratice. A “macrostate,” on the other hand, is a set of microstates. You may not know what the exact microstate your box of gas is in, but you can measure macroscopic quantities (like the total internal energy, volume, particle number) and consider the set of all possible microstates with those measured quantities.

The “entropy” of a macrostate is the logarithm of the number of possible microstates. If black holes truly are thermodynamic systems with some entropy, that means there should be some “hidden variables” or microstates to the black hole that we currently don’t understand. Perhaps if we understood the microstates of the black hole, we would be much closer to understanding quantum gravity!

However, Hawking also discovered something else. Because the black hole is radiating out energy, its mass will actually decrease as time goes on. Eventually, it should disappear entirely. This means that the information of what went into the black hole will be lost forever.

Physicists did not like this, however, because it seemed to them that the information of what went into the black hole should never be lost. Many physicists believe that the information of what went into the black hole should somehow be contained in the outgoing Hawking radiation, although they do not currently understand how. According to Hawking’s original calculation, the Hawking radiation only depends on a parameters of the black hole (like its mass) and have nothing to do with the many finer details on what went in, the exact “microstate” of what went in.

However, physicists eventually realized a problem with the idea that the black hole releases its information in the form of outgoing Hawking radiation. The problem has to do with quantum mechanics. In quantum mechanics, it is impossible to clone a qubit. That means that if you threw a qubit in a black hole and then waited for it to eventually come out in the form of Hawking radiation, then the qubit could no longer be “inside” the black hole. However, if Einstein is to be believed, you should also be able to jump into the black hole and see the qubit on the inside. This seems to imply that the qubit is cloned, as it is present on both the inside and outside of the black hole.

Physicists eventually came up with a strange fix called “Black Hole Complementarity” (BHC). According to BHC, according to people outside the black hole, the interior does not exist. Also, according to people who have entered the black hole, the outside ceases to exist. Both descriptions of the world are “correct” because once someone has entered the black hole, they will be unable to escape and compare notes with the person on the outside.

Of course, it must be emphasized that BHC remains highly hypothetical. People have been trying to poke holes in it for a long time. The largest hole is the so called “Firewall Paradox,” first proposed in 2012. Essentially, the Firewall paradox tries to show that the paradigm of BHC is self-contradictory. In fact, it was able to use basic quantum mechanics to show that, under some reasonable assumptions, the interior of the black hole truly doesn’t exist, and that anyone who tries to enter would be fried at an extremely hot “firewall!” Now, I don’t think most physicists actually believe that black holes really have a firewall (although this might depend on what day of the week you ask them). The interesting thing about the Firewall paradox is that it derives a seemingly crazy result from seemingly harmless starting suppositions. So these suppositions would have to be tweaked in the theory of quantum gravity order to get rid of the firewall.

This is all to say that all this thinking about black holes really might help physicists figure out something about quantum gravity. (Then again, who can really say for sure.)

If you would like to know more about the Firewall paradox, I suggest you read my notes, pasted at the top of this post!

The goal of the notes was to write an introduction to the Black Hole Information Paradox and Firewall Paradox that could be read by computer scientists with no physics background.

The structure of the notes goes as follows:

1. Special relativity
2. General relativity
3. Quantum Field Theory (in which I ambitiously tell you what QFT actually is!)
4. Statistical Mechanics (this is the best part)

Because the information paradox touches on all areas of physics, I thought it was necessary to take “zero background, infinite intelligence” approach, introducing the all the necessary branches of physics (GR, QFT, Stat Mech) in order to understand what the deal with Hawking radiation really is, and why physicists think it is so important. I think it is safe to say that if you read these notes, you’ll learn a non-trivial amount of physics.

Nilin Abrahamsen nilin@mit.edu

Daniel Alabi alabid@g.harvard.edu

Mitali Bafna mitalibafna@g.harvard.edu

Emil Khabiboulline ekhabiboulline@g.harvard.edu

Juspreet Sandhu jus065@g.harvard.edu

Two-prover one-round (2P-1R) games have been the subject of intensive study in classical complexity theory and quantum information theory. In a 2P-1R game, a verifier sends questions privately to each of two collaborating provers , who then aim to respond with a compatible pair of answers without communicating with each other. Sharing quantum entanglement allows the provers to improve their strategy without any communication, illustrating an apparent paradox of the quantum postulates. These notes aim to give an introduction to the role of entanglement in nonlocal games, as they are called in the quantum literature. We see how nonlocal games have rich connections within computer science and quantum physics, giving rise to theorems ranging from hardness of approximation to the resource theory of entanglement.

## Introduction

In these notes we discuss 2-prover 1-round games and the classical complexity of approximating the value of such games in the setting where the provers can share entanglement. That is, given the description of a game, we ask how hard it is to estimate the winning probability of the best winning strategy of the entangled provers. Let us first formally define games and its relation to the label cover problem. We write ${ [n]=\{1,\ldots,n\}}$.

#### Definition (Label cover)

A label cover instance ${ I=(S,T,\Sigma,\Pi)}$ consists of variable sets ${ S=\{s_i\}_{i\in[n]},T=\{t_j\}_{j\in[n]}}$, alphabet set ${ \Sigma}$, and a collection ${ \Pi}$ of constraints of the form ${ t_j=f_{i,j}(s_i)}$. Given an assignment (or coloring) ${ c : S\cup T \rightarrow \Sigma}$ we define its value to be ${ \omega(c)=\mathbb P_{f_{ij}\sim\Pi}\big(c(t_j)=f_{ij}(c(s_i))\big)}$. Define the value of ${ I}$ to be the maximum over all possible assignments, i.e. ${ \omega(I) = \max_{c} \omega(c)}$. Many familiar computational problems can be formulated as a label cover, such as ${ \textsc{3SAT}}$, ${ \textsc{3Lin}}$, and ${ \textsc{MaxCut}}$.

#### Definition (2-prover 1-round game)

Let ${ I = (S,T,\Sigma,\Pi)}$ be a label cover instance. We can then associate the following two-prover one-round game ${ G(I)}$ with ${ I}$. Let ${ P_1,P_2}$ be two provers who cannot communicate, and let ${ V}$ be the verifier. Given the label cover instance ${ I}$, the verifier uniformly samples a constraint ${ f_{i,j} \in \Pi}$ and sends ${ s_i}$ to ${ P_1}$ and ${ t_j}$ to ${ P_2}$. The provers then reply with ${ a \in \Sigma}$ and ${ b\in\Sigma}$ respectively to ${ V}$. Finally, ${ V}$ outputs ${ 1}$ if and only if ${ b = f_{i,j}(a)}$. Any coloring of the label cover instance corresponds to a deterministic strategy for the corresponding game. Therefore, with an optimal strategy the provers win the game associated to label cover instance ${ I}$ with probability ${ \omega(I)}$. That is, the value of the game equals that of the label cover instance. However, this is with the assumption that provers can only use deterministic strategies or convex combinations of these (that is, using shared randomness). If the provers share an entangled quantum state, then the provers (who still cannot communicate) can enjoy correlations that allow them to win with a higher probability than classically. In the quantum literature, these 2P-1R games are known as nonlocal games referring to the fact that the correlations arise without signaling between the provers. We are concerned with the complexity of approximating the winning probability of this strategy. We refer to the optimal winning probability within some class (classical or entangled) of strategies as the classical and quantum value of the game, respectively, and we use the terms quantum strategy and entangled strategy interchangeably. Fixing different constraint families in the label cover game changes the complexity of finding the (classical and entangled) values of the game. We will show that approximating the entangled value of XOR games, or more generally unique games (to be defined later on), is possible in polynomial time. This is remarkable because a famous conjecture known as the unique games conjecture says that approximating the classical value of unique games is NP-hard. In contrast, we will see that for unrestricted edge constraints, it is NP-hard to approximate the entangled value of a nonlocal game. Thus, hardness of approximation of the game’s value, established by the celebrated PCP theorem , still applies in the presence of entanglement. In the quantum world, we have new complexity classes such as QMA (which can be regarded as “quantum NP”), so one may conjecture whether approximating the entangled value of a general game is QMA-hard (the games formulation of the quantum PCP conjecture ). We will indicate progress in this direction but will explicitly demonstrate the NP-hardness result. Entanglement is often regarded as an expensive resource in quantum information because it is difficult to produce and maintain. Hence, even if sharing entanglement can improve the success probability of winning a game, the resource consumption may be costly. We will conclude by discussing lower bounds on the number of shared entangled bits required to achieve the optimal value of a game.

## Notation and quantum postulates

Let us first establish notation and define what is meant by an entangled strategy. In keeping with physics notation we write a column vector as ${ |{v}\rangle \in\mathbb C^d}$ and its conjugate-transpose (a row vector) as ${ \langle{v}| }$ . More generally the conjugate-transpose (Hermitian conjugate) of a matrix ${ A}$ is written ${ A^\dag}$. Then ${ \langle{v}| w\rangle}$ is the inner product of two vectors (a scalar) and ${ |{v}\rangle \langle{w}| }$ the outer product (a rank-1 matrix). A matrix ${ A}$ is said to be Hermitian if ${ A^\dag=A}$. A matrix ${ A}$ is positive semidefinite , written ${ A\succeq 0}$, if ${ A=B^\dag B}$ for some matrix ${ B}$. We write the identity matrix as ${ {\mathbb{I}}}$, denote by ${ Herm(\mathbb{C}^d)}$ the set of ${ d}$-by-${ d}$ Hermitian matrices.

### Observables, states, and entanglement

In a quantum theory the observables are Hermitian operators ${ A\in Herm(\mathbb C^d)}$ on a vector space ${ \mathbb C^d}$. It then makes sense to say that a state is a functional ${ \varphi}$ on the set of observables. That is, to specify the state of a physical system means giving a (expected) value for each observable. It turns out states are linear functionals of the observables, and such functionals can be written ${ \varphi(A)=\langle A,\rho\rangle}$ for some ${ d}$-by-${ d}$ matrix ${ \rho}$. We call ${ \rho}$ the density matrix and require moreover that ${ \rho}$ is positive semidefinite and has trace ${ 1}$. Every density matrix is a convex combination of rank-one projections ${ |\psi\rangle\langle\psi|}$ known as pure states. The unit vectors ${ |\psi\rangle\in\mathbb C^d}$ are also themselves known as pure states. If the state of one particle is described by a vector in ${ \mathbb C^d}$ (referring here to pure states), then two particles are described by a vector in the tensor product ${ \mathbb C^d\otimes\mathbb C^d}$. The two particles are entangled if their state is not in the form of a pure tensor ${ |\phi\rangle\otimes|\psi\rangle}$. We also write product states as ${ |\phi\rangle|\psi\rangle}$, omitting the tensor symbol.

### Quantum measurements

A quantum measurement can be described in terms of a projection-valued measure (PVM). Definition 1 (PVM) A projection-valued measure on vector space ${ \mathbb C^d}$ (where the quantum states live) is a list of projection matrices ${ A_1,\ldots,A_k}$ on ${ \mathbb C^d}$ such that ${ A_iA_j=0}$ for ${ i\neq j}$ and ${ \sum_i A_i={\mathbb{I}}}$. The PVM describes a measurement which, on state ${ |\psi\rangle\in\mathbb C^d}$ outputs measurement outcome ${ i}$ with probability ${ \langle\psi| A_i|\psi\rangle}$. The quantum state after obtaining outcome ${ i}$ is ${ \frac1{\|A_i|\psi\rangle\|}A_i|\psi\rangle}$. When the projections are rank-one projections ${ A_i= |{\beta_i}\rangle \langle{\beta_i}| }$ we say that we measure in the basis ${ \{ |{\beta_i}\rangle \}_i}$. In this case the probability of outcome ${ i}$ in state ${ |\psi\rangle}$ is ${ |\langle\beta_i |{\psi}\rangle |^2}$, and the post-measurement state is simply ${ |{\beta_i}\rangle}$ . Applying the measurement ${ (A_i)_i}$ on the left half of a two-particle state ${ |{\Psi}\rangle \in\mathbb C^d\otimes\mathbb C^d}$ means applying the PVM ${ (A_i\otimes {\mathbb{I}})_{i}}$ on the two-particle state.

### Quantum strategies for nonlocal games

We now introduce the notion of a quantum strategy for a nonlocal game. Each prover holds a particle, say with state space ${ \mathbb C^d}$, and Alices particle may be entangled with Bob’s. The global state is ${ |{\phi}\rangle _{AB}\in\mathbb C^d\otimes\mathbb C^d}$. Each player receives a question from the verifier and then chooses a measurement (a PVM) depending on the question. The player applies the measurement to their own particle and responds to the verifier with their measurement outcome. Hence for Alice we specify ${ n}$ PVM’s ${ A^s}$ where ${ s\in S}$ is a question, and each PVM is a list ${ A^s_{a=1},\ldots,A^s_k}$. By definition 1, given questions ${ s,t}$ the probability that Alice outputs ${ a}$ and Bob outputs ${ b}$ is

$\displaystyle P(a,b|s,t)=\|({\mathbb{I}}\otimes B^t_b)(A^s_a\otimes {\mathbb{I}}) |{\phi}\rangle _{AB}\|^2= \langle{\phi}| A^s_a\otimes B^t_b|\phi\rangle, \ \ \ \ \ (1)$

where we have used that squaring a projection leaves it unchanged.

## Quantum strategies beat classical ones

For any 2P-1R game ${ G}$, let ${ \omega(G)}$ be the maximum probability — over the players’ classical strategies — that the verifier accepts and ${ \omega^*(G)}$ the maximum probability that the verifier accepts when the provers use qubits such that player 1’s qubits are entangled with those of player 2. The game of Clauser, Horne, Shimony, and Holt (CHSH) has the property that the provers can increase their chance of winning by sharing an entangled pair of qubits, even when no messages are exchanged between the players. We show that there’s a characterization of the CHSH game’s value ${ \omega^*(G)=\cos^2(\frac{\pi}{8}) = \frac{1}{2}+\frac{\sqrt{2}}{4}\approx 0.85}$ which is better than the classical value ${ \omega(G)\leq \frac{3}{4}}$. Let us first define XOR games, of which the CHSH game is a special case.

#### Definition (XOR game)

An XOR game is a 2-player classical game (the questions and answers are classical bits) where:
1. Questions ${ (s, t)\in\{0, 1, \ldots, n-1\}^2}$ are asked according to some distribution ${ \Pi}$ (e.g. uniform).
2. Answers ${ a, b\in\{0, 1\}}$ are provided by players (call them Alice and Bob).
3. The verifier computes a predicate ${ V(a, b|s, t) = f_{s, t}(a\oplus b)}$ used to decide acceptance/rejection.

#### Definition (CHSH Game)

An XOR game with ${ n=2}$ where ${ s, t\in\{0, 1\}}$ are independent random bits and ${ V(a, b | s, t)=1 \Longleftrightarrow a\oplus b = s\wedge t}$. To win the CHSH game, Alice and Bob need to output bits ${ a}$ (from Alice) and ${ b}$ (from Bob) that disagree if ${ s=t=1}$ and agree otherwise. If Alice and Bob are classical then they can do no better than by always outputting ${ 0}$, say, in which case they win in the three out of four cases when one of the questions is ${ 0}$. Equivalently, ${ \omega(G)=\frac34}$ where ${ G}$ is the CHSH game. This is the content of Bell’s inequality :

#### Lemma (Bell’s Inequality)

For any two functions ${ g, h: \{0, 1\}\rightarrow\{0, 1\}}$, we have ${ {\mathop{\mathbb{P}}}_{x, y\in\{0, 1\}}\left[g(x)\oplus h(y) = x\wedge y\right] \leq \frac{3}{4} }$ where ${ x}$ and ${ y}$ are independent uniformly random bits.
Proof.
The probability of any event is a multiple of ${ 1/4}$ so it suffices to show that ${ {\mathop{\mathbb{P}}}_{x, y\in\{0, 1\}}\left[g(x)\oplus h(y) = x\wedge y\right]\neq1}$. So assume for contradiction that ${ g(x)\oplus h(y) = x\wedge y}$ for all pairs ${ x,y}$. Then we have that ${ g(0)\oplus h(0) = 0}$ and ${ g(0)\oplus h(1) = 0}$ which implies that ${ h(0)=h(1)}$. But then ${ 0=g(1)\oplus h(0) =g(1)\oplus h(1) = 1}$ which is a contraction.

### The strategy

The entangled strategy for the CHSH game requires that Alice and Bob each hold a qubit, so that their two qubits together are described by a vector in ${ \mathbb C^2\otimes\mathbb C^2}$. The two qubits together are in the state ${ |{\phi}\rangle = \frac{1}{\sqrt{2}}\left( |{0}\rangle _A |{0}\rangle _B + |{1}\rangle _A |{1}\rangle _B\right), }$ forming what is known as an EPR (Einstein-Podolsky-Rosen) pair. Now for ${ \theta\in[-\pi, \pi]}$ define ${ |{\beta_0(\theta)}\rangle = \cos\theta |{0}\rangle + \sin\theta |{1}\rangle }$ and ${ |{\beta_1(\theta)}\rangle = -\sin\theta |{0}\rangle + \cos\theta |{1}\rangle }$ . Now we describe a (quantum) strategy with winning probability ${ \cos^2(\frac{\pi}{8})}$. In each case Alice and Bob respond with their measurement outcome, where subscripts of the measurement basis vectors correspond to the answer to be sent back to the verifier.
• If ${ s=0}$, Alice measures in basis ${ \{ |{\beta_0(0)}\rangle , |{\beta_1(0)}\rangle \} = \{ |{0}\rangle , |{1}\rangle \}}$. If ${ s=1}$, Alice measures in ${ \{ |{\beta_0(\frac{\pi}{4})}\rangle , |{\beta_1(\frac{\pi}{4})}\rangle \}}$. Alice answers bit ${ a = 0}$ if outcome is ${ \beta_0(\cdot)}$ and answers ${ a = 1}$ otherwise.
• If ${ t=0}$, Bob measures in basis ${ \{ |{\beta_0(\frac{\pi}{8})}\rangle , |{\beta_1(\frac{\pi}{8})}\rangle\}}$. If ${ t=1}$, Bob measures in ${ \{ |{\beta_0(-\frac{\pi}{8})}\rangle , |{\beta_1(-\frac{\pi}{8})}\rangle \}}$.
• Each player responds with their respective measurement outcome.
Lemma 2 Alice and Bob win the CHSH game with probability ${ \cos^2(\frac{\pi}{8})}$.
Proof.
We will show that for each pair of questions ${ s,t}$ the pair of answers ${ a,b}$ is correct with probability ${ \cos^2(\pi/8)}$. We can split the pairs of questions into the two cases ${ s\wedge t=0}$ and ${ s\wedge t=1}$:
• (${ s\wedge t=0}$) The three cases ${ (s,t)=(0,0),(0,1),(1,0)}$ are all analogous: in each case Alice an Bob must output the same answer, and in each case Bob’s measurement basis is almost the same as Alice’s except rotated by a small angle ${ \pm\pi/8}$. Of the three above cases we consider the one where ${ s,t=0}$ and check that indeed the two measurement outcomes agree with probability ${ \cos^2(\pi/8)}$: When Alice measures her qubit and obtains some bit ${ a}$, the shared pair ${ |{\beta}\rangle}$ collapses to ${ |{a}\rangle _A |{a}\rangle _B}$. Indeed, since the question was ${ s=0}$, Alice measures her qubit in the basis ${ \{|0\rangle,|1\rangle\}}$. This means that Alice applies the measurement ${ \{|0\rangle\langle 0|\otimes {\mathbb{I}},|1\rangle\langle 1|\otimes {\mathbb{I}}\}}$ on the global state. The post-measurement state is the normalization of ${ ( |{a}\rangle \langle{a}| \otimes {\mathbb{I}}) |{\phi}\rangle _{AB}=\frac{1}{\sqrt2} |{a}\rangle \overbrace{\langle a |{0}\rangle }^{\delta_{a,0}}\otimes | 0\rangle + |{a}\rangle \overbrace{\langle a |{1}\rangle }^{\delta_{a,1}}\otimes |{1}\rangle =\frac {1}{\sqrt2} |{a}\rangle _A |{a}\rangle _B }$ because ${ \langle a |{a'}\rangle}$ can be viewed as a Kronecker delta of ${ a}$ and ${ a'}$. In particular, Bob is now in the pure state ${ |{a}\rangle}$ . Because Bob received question ${ t=0}$ he measures in the basis ${ \{ |{\beta_b(\pi/8)}\rangle \}_b}$ Therefore his probability of correctly outputting ${ b=a}$ is ${ {\mathop{\mathbb{P}}}[\text{Bob gets outcome }a] = |\langle\beta_a(\tfrac{\pi}{8}) |{a}\rangle |^2 = \cos^2\left(\frac{\pi}{8}\right) }$
• (${ s\wedge t=1}$) Now consider the case ${ s=1,t=1}$ where Alice and Bob are supposed to give different answers. Alice measures in basis consisting of ${ |{\beta_0(\pi/4)}\rangle =\frac{|0\rangle+|1\rangle}{\sqrt2}}$ and ${ |{\beta_1(\pi/4)}\rangle =\frac{-|0\rangle+|1\rangle}{\sqrt2}}$. If Alice gets outcome ${ a}$ then the post-measurement global state is ${ |{\beta_a(\frac\pi4)}\rangle |{\beta_a(\frac\pi4)}\rangle}$ . Therefore when Bob applies the measurement in basis ${ \{ |{\beta_a(-\frac\pi8)}\rangle \}_a}$ he mistakenly outputs ${ a}$ only with probability ${ |\langle\beta_a(\frac\pi4) |{\beta_a(-\frac\pi8)}\rangle |^2=\sin^2(\frac\pi8)=1-\cos^2(\frac\pi8)}$.
Lemma 2 implies a lower bound on the value of the CHSH game.

#### Corollary

${ \omega^*(G)\ge\cos^2(\frac\pi8)}$
It turns out that this lower bound is sharp, that is, the strategy just described is optimal.

#### Lemma

The value of the CHSH game using a quantum strategy is at most ${ \frac{1}{2} + \frac{\sqrt{2}}{4} = \cos^2\frac{\pi}{8}\approx 0.85}$. Proof. We can describe the quantum strategy of Alice and Bob in an XOR game by (i) a shared quantum state ${ |{\phi}\rangle _{AB}\in\mathbb{C}^d\otimes\mathbb{C}^d}$ (note that for the CHSH game, ${ d=2}$); (ii) measurements ${ \{A^0_s, A^1_s\}}$ for every question ${ s}$ sent to Alice; (iii) measurements ${ \{B^0_t, B^1_t\}}$ for every question ${ t}$ sent to Bob. The probability of answering ${ (a, b)}$ given questions ${ (s, t)}$ is ${ \langle{\phi}| A^a_s\otimes B^b_t |{\phi}\rangle}$ . Now let us write ${ A_s=A^0_s - A^1_s}$ and ${ B_t=B^0_t - B^1_t}$ so that for any ${ a, b\in\{0, 1\}}$, we can write ${ A_s^a = \frac{{\mathbb{I}} + (-1)^aA_s}{2}, B_t^b = \frac{{\mathbb{I}} + (-1)^bB_t}{2} }$ Note that since the possible outcomes here are finite, ${ A_s, B_t}$ are Hermitian and we may assume have bounded norm of 1. Furthermore, we assume that ${ A_s, B_t}$ are observables so that ${ A_s^2 = B_t^2 = {\mathbb{I}}}$ . Now denoting ${ f_{s, t}(a\oplus b)}$ as the XOR predicate to be computed, we can write the quantum game value as $\omega^*(G) = \sup_{ |{\phi}\rangle , \{A_s^a\}, \{B_t^b\}} {\mathop{\mathbb{E}}}_{(s, t)\sim\Pi}\sum_{a, b}f_{s, t}(a\oplus b) \langle{\phi}| A^a_s\otimes B^b_t |{\phi}\rangle \\ = \sup_{ |{\phi}\rangle , \{A_s\}, \{B_t\}} {\mathop{\mathbb{E}}}_{(s, t)\sim\Pi}\sum_{a, b}f_{s, t}(a\oplus b) \langle{\phi}| \frac{{\mathbb{I}} + (-1)^aA_s}{2}\otimes \frac{{\mathbb{I}} + (-1)^bB_t}{2} |{\phi}\rangle \\ = {\mathop{\mathbb{E}}}_{(s, t)\sim\Pi}\sum_{a, b}\frac{f_{s, t}(a\oplus b)}{4} + \sup_{ |{\phi}\rangle , \{A_s\}, \{B_t\}} {\mathop{\mathbb{E}}}_{(s, t)\sim\Pi}\sum_{a, b}f_{s, t}(a\oplus b) \langle{\phi}| \frac{{\mathbb{I}} + (-1)^aA_s}{2}\otimes \frac{{\mathbb{I}} + (-1)^bB_t}{2} |{\phi}\rangle \\ = {\mathop{\mathbb{E}}}_{(s, t)\sim\Pi}\sum_{a, b}\frac{f_{s, t}(a\oplus b)}{4} + \sup_{ |{\phi}\rangle , \{A_s\}, \{B_t\}} {\mathop{\mathbb{E}}}_{(s, t)\sim\Pi}\sum_{a, b}\frac{f_{s, t}(a\oplus b)(-1)^{ab}}{4} \langle{\phi}| A_s\otimes B_t |{\phi}\rangle \\ = {\mathop{\mathbb{E}}}_{(s, t)\sim\Pi}\sum_{a, b}\frac{f_{s, t}(a\oplus b)}{4} + \sup_{ |{\phi}\rangle , \{A_s\}, \{B_t\}} {\mathop{\mathbb{E}}}_{(s, t)\sim\Pi}\frac{f_{s, t}(0) - f_{s,t}(1)}{2} \langle{\phi}| A_s\otimes B_t |{\phi}\rangle \\$ where the summation ${ \sum_{a, b\in\{0, 1\}}\left(\cdot\right)}$ has been evaluated in the last line. Now note that the first term is independent of the quantum strategy and as a result equals the value of the uniformly random strategy which is 1/2. So we proceed to focus on the second term. Note that for CHSH ${ f_{s, t}(0) - f_{s,t}(1) = (-1)^{st}}$ simplifying the second term to ${ \frac{1}{8}( \langle{\phi}| A_0\otimes B_0 |{\phi}\rangle + \langle{\phi}| A_1\otimes B_0 |{\phi}\rangle + \langle{\phi}| A_0\otimes B_1 |{\phi}\rangle - \langle{\phi}| A_1\otimes B_1 |{\phi}\rangle ). }$ Next, we invoke Tsirelson’s Theorem (See Theorem 3) to bound this second term as $= \sup_{{\textbf{x}}_s, {\textbf{y}}_t}\frac{1}{8}({\textbf{x}}_0\cdot {\textbf{y}}_0 + {\textbf{x}}_0\cdot {\textbf{y}}_1 + {\textbf{x}}_1\cdot {\textbf{y}}_0 - {\textbf{x}}_1\cdot {\textbf{y}}_1) \\ \leq \sup_{{\textbf{x}}_s, {\textbf{y}}_t}\frac{1}{8}(\|{\textbf{x}}_0\|\|{\textbf{y}}_0 + {\textbf{y}}_1\| + \|{\textbf{x}}_1\|\|{\textbf{y}}_0 - {\textbf{y}}_1\|) \\ \leq \sup_{{\textbf{x}}_s, {\textbf{y}}_t}\frac{\sqrt{2}}{8}\sqrt{2\|{\textbf{y}}_0\|^2 + 2\|{\textbf{y}}_1\|^2} \leq \frac{\sqrt{2}}{4}$ where we have used Cauchy-Schwartz and the concavity of the ${ \sqrt{\cdot}}$ function. This completes our proof showing the exact characterization of the value (${ =\cos^2(\frac{\pi}{8})=\frac{1}{2}+\frac{\sqrt{2}}{4}}$) of the CHSH game using a quantum strategy. This proof is an adaptation of the one in [12].
Theorem 3 (Tsirelson’s Theorem [1]) For any ${ n\times n}$ matrix ${ C = (C_{s, t})}$, the following are equivalent:
1. There exist ${ d\in\mathbb{N}, |{\phi}\rangle \in\mathbb{C}^{d}\otimes\mathbb{C}^{d}, A_s, B_t\in\text{Herm}(\mathbb{C}^d), A_s^2 = B_t^2 = I}$ such that for any ${ s, t\in[n]}$ ${ C_{s, t} = \langle{\phi}| A_s\otimes B_t |{\phi}\rangle}$ . Further this would imply that ${ d\leq 2^{\lceil\frac{n+2}{2}\rceil}}$;
2. There exist real unit vectors ${ { {\textbf{x}}}_s, { {\textbf{y}}}_t\in{\mathbb{R}}^{n+2}}$ for ${ s, t\in [n]}$ such that ${ C_{s, t} = { {\textbf{x}}}_s\cdot { {\textbf{y}}}_t}$;

## Entangled unique games are easy

The CHSH game provides the first example that the entangled value ${ \omega^*(G)}$ of a nonlocal game can exceed the classical value ${ \omega(G)}$. XOR-games like the CHSH game are the special case corresponding to alphabet size ${ k=2}$ of the class of unique games :

#### Definition (Unique Games)

A 2-prover 1-round game is called a unique game if its constraints are of the form ${ b=\pi_{ij}(a)}$ where ${ \pi_{ij}}$ is a permutation of the alphabet ${ \Sigma}$ for each edge ${ i\sim j}$. The famous unique games conjecture (UGC) by Khot says that ${ \omega(G)}$ is NP-hard to approximate for unique games. Surprisingly, Kempe et al. showed that a natural semidefinite relaxation for unique games yields an approximation to the entangled value ${ \omega^*(G)}$ which can be computed in polynomial time. In other words the UGC is false for entangled provers, in contrast to the classical case where the conjecture is open.
Theorem 4 There is an efficient (classical) algorithm which takes a description of a nonlocal game ${ G}$ as its input and outputs ${ \hat\omega(G)}$ such that ${ 1-6(1-\hat\omega(G))\le\omega^*(G)\le\hat\omega(G) }$ Put differently, if ${ \omega^*(G)=1-\varepsilon^*}$ and ${ \hat\omega(G)=1-\hat\varepsilon}$, then ${ \varepsilon^*\in[\hat\varepsilon,6\hat\varepsilon] }$
The algorithm of theorem 4 proceeds by relaxing the set of quantum strategies to a larger convex set ${ \mathcal S}$ of pseudo-strategies and maximizing over ${ \mathcal S}$ instead of actual strategies, a much easier task. In approximation theory one often encounters a collection of hypothetical moments not arising from a distribution, known as a pseudo-distribution. In contrast, our pseudo-strategies are actual conditional probability distributions on answers (conditional on the questions). What makes ${ \mathcal S}$ a set of “pseudo”-strategies rather than actual strategies is that they may enjoy correlations which cannot be achieved without communication.

### Convex relaxation of quantum strategies

We will define ${ \mathcal S}$ to be a class of conditional probability distributions ${ \tilde{P}(a,b|s,t)}$ on answers given questions. We will require that the pseudo-strategies satisfy a positive semidefinite constraint when arranged in matrix form. In particular this matrix has to be symmetric, so we symmetrize the conditional probability ${ \tilde{P}(a,b|s,t)}$ by allowing each of ${ s}$ and ${ t}$ to be either a question for Alice or for Bob. That is, we extend the domain of definition for ${ \tilde{P}(a,b|s,t)}$ from ${ \Sigma^2\times S\times T}$ to ${ \Sigma^2\times (S\cup T)^2}$ where ${ S}$ and ${ T}$ are the question sets. So each question ${ s}$ and ${ t}$ and answer ${ a}$ and ${ b}$ can be either for Alice or Bob — we indicate this by changing notation from ${ (s,t)\in S\times T}$ to ${ q,q'\in S\cup T}$ and for the answers replacing ${ a,b}$ by ${ c,c'}$.
Definition 5 (Block-matrix form) Given a function ${ \tilde{P}=\tilde{P}(\cdot\cdot|\cdot\cdot)}$ defined on ${ (S\cup T)^2\times \Sigma^2}$ with ${ |S|=|T|=n}$ and ${ |\Sigma|=k}$, define a ${ 2nk}$-by-${ 2nk}$ matrix ${ M^{\tilde{P}}}$ whose rows are indexed by pairs ${ (q,c)\in(S\cup T)\times\Sigma}$ and columns by pairs ${ (q',c')\in (S\cup T)\times\Sigma}$, and whose entries are ${ M^{\tilde{P}}_{(q,c),(q',c')}=\tilde{P}(c,c'|q,q') }$ In other words ${ M^{\tilde{P}}}$ consists of ${ k}$-by-${ k}$ blocks where the block ${ M^{\tilde{P}}_{q,q'}}$ at position ${ q,q'}$ contains the entries ${ \tilde{P}(\cdot\cdot|q,q')}$.
Definition 5 is simply a convenient change of notation and we identify ${ \tilde{P}}$ with ${ M^{\tilde{P}}}$, using either notation depending on the context.
Definition 6 (Pseudo-strategies) Let ${ S}$ and ${ T}$ be the question sets for Alice and Bob, respectively. We say that ${ \tilde{P}=\tilde{P}(\cdot\cdot|\cdot\cdot):\Sigma^2\times (S\cup T)^2\rightarrow[0,1]}$ (or its matrix form ${ M^{\tilde{P}}}$) is a pseudo-strategy if:
1. ${ M^{\tilde{P}}}$ is positive semidefinite.
2. For any pair of questions ${ q,q'\in S\cup T}$, ${ \sum_{c,c'=1}^k \tilde{P}(c,c'|q,q')=1}$.
3. The blocks ${ M^{\tilde{P}}_{q,q'}}$ on the diagonal are themselves diagonal ${ k}$-by-${ k}$ matrices.
Define the winning probability or value of a pseudo-strategy as: ${ \omega^{\tilde{P}}(G)=\mathbb E_{(s,t)\sim \Pi} \sum_{a,b} \tilde{P}(a,b|s,t)V(a,b|s,t) }$ The algorithm outputs the maximum winning probability: ${ \hat\omega(G)=\max_{\tilde{P}\in\mathcal S}\omega^{\tilde{P}}(G) }$ over pseudo-strategies ${ \tilde{P}\in \mathcal S}$. This maximum is efficiently computable using standard semidefinite programming algorithms. As we will see, actual quantum strategies are in ${ \mathcal S}$ which immediately implies ${ \hat\omega(G)\ge\omega^*(G)}$. It then remains to show that the optimal pseudo-strategy can be approximated by an actual entangled strategy, thus bounding the gap from ${ \omega^*(G)}$ to ${ \hat\omega(G)}$.

### Quantum strategies are in ${ \mathcal S}$

Let us establish that ${ \mathcal S}$ is indeed a relaxation of the class of quantum strategies, that is, it contains the quantum strategies. So suppose we are given a quantum strategy. By equation 1 the probability of answers ${ a}$ and ${ b}$ given questions ${ s}$ and ${ t}$ is of the form ${ \langle\phi|A^s_a\otimes B^t_b |\phi\rangle=\langle\phi|(A^s_a\otimes {\mathbb{I}})({\mathbb{I}}\otimes B^t_b )|\phi\rangle }$ for some PVM’s ${ A^s}$ and ${ B^t}$ and some ${ |\phi\rangle\in\mathbb C^d\otimes\mathbb C^d}$. This conditional probability distibution is not immediately in the form of a pseudo-strategy because we cannot evaluate it on pairs of Alice-questions or pairs of Bob-questions. We therefore have to extend it, as follows: Place all the column vectors ${ A^s_a\otimes {\mathbb{I}}|\phi\rangle}$, ${ (s,a)\in S\times \Sigma}$ side by side, and then append the vectors ${ I\otimes B^t_b|\phi\rangle}$, resulting in a ${ d^2}$-by-${ 2nk}$ matrix ${ R}$. We then define ${ {P}(\cdot\cdot|\cdot\cdot):\Sigma^2\times(S\cup T)^2\rightarrow[0,1]}$ through its matrix form (see the comment below definition 5):

$\displaystyle M^{P}:=R^\dag R=\begin{pmatrix} (\langle\phi|A^s_a A^{s'}_{a'}\otimes {\mathbb{I}} |\phi\rangle)_{(s,a),(s',a')} (\langle\phi|A^s_a\otimes B^t_b |\phi\rangle)_{(s,a),(t,b)} \\\\ (\langle\phi|B^t_b\otimes A^s_a |\phi\rangle)_{(t,b),(s,a)} (\langle\phi|{\mathbb{I}}\otimes B^t_bB^{t'}_{b'} |\phi\rangle)_{(t,b),(t',b')} \end{pmatrix} \ \ \ \ \ (2)$

Lemma 7 ${ M^{P}}$ defined in 2 is a pseudo-strategy, that is, ${ M^{P}\in\mathcal S}$.
Proof.
We verify the conditions in definition 6. Condition 1 (${ M^{P}\succeq0}$) holds because it is of the form ${ R^\dag R}$. Condition 2 (Each block ${ M^{P}_{q,q'}}$ sums to ${ 1}$) holds because PVM’s sum to the identity. Condition 3 (Diagonal blocks ${ \tilde M=M^{P}_{qq}}$ are diagonal) holds because the projections in the PVM ${ A^q}$ are mutually orthogonal, hence ${ \langle{\phi}| A^q_cA^q_{c'} |{\phi}\rangle =0}$ if ${ c\neq c'}$.
Lemma 7 means ${ \hat\omega(G)=\max_{{P}\in \mathcal S}\omega^{P}(G)}$ is an (efficiently computable) upper bound for ${ \omega^*(G)}$:

#### Corollary

${ \hat\omega\ge \omega^*(G)}$.
To finish the proof of theorem 4 we need to show that any pseudo-strategy ${ \tilde{P}}$ can be rounded to an actual quantum strategy with answer probabilities ${ {P}}$ such that

$\displaystyle 1-\omega^{P}\le 6(1-\omega^{\tilde P}) \ \ \ \ \ (3)$

Applying this rounding to the optimal pseudo-strategy implies that ${ 1-\omega^*\le 6(1-\hat\omega) }$ or ${ \omega^*\ge 1-6(1-\hat\omega)}$, which will finish the proof of theorem 4.
Proof.
[Proof of theorem 4] Let ${ \tilde{P}\in\mathcal S}$ be a pseudo-strategy. We construct a quantum strategy ${ {P}}$ approximating ${ \tilde{P}}$. Since ${ M^{\tilde{P}}}$ is positive semidefinite we can write ${ \tilde M=R^\dag R }$ for some matrix ${ R\in \mathbb C^{r\times 2nk}}$ and ${ r\leq 2nk}$. Now let us define ${ 2nk}$ vectors ${ |\tilde u^s_a\rangle}$ and ${ |\tilde v^t_b\rangle}$ in ${ \mathbb C^{r}}$, and let ${ |{u^s_a}\rangle}$ and ${ |{v^t_b}\rangle}$ be the same vectors normalized. The strategy is constructed as follows. Alice and Bob share the maximally entangled state ${ |\phi\rangle=\frac1{\sqrt r}\sum_{i=1}^r|i\rangle|i\rangle\in\mathbb C^r\otimes\mathbb C^r }$ Before deciding on Alice and Bob’s PVM’s ${ A^s}$ and ${ B^t}$ let us see what this choice of shared state means for the conditional distribution on answers (see equation (1)).

$\langle\phi| A^s_a\otimes B^t_b|\phi\rangle =\frac1r\sum_{i,j=1}^r\langle i | A^s_a| j\rangle\langle i | B^t_b| j\rangle =\frac1r A^s_a\cdot B^t_b=\frac1r\langle A^s_a,\overline{B^t_b}\rangle$, (*)

where the bar represents entrywise complex conjugation, ${ \cdot}$ is the entrywise dot product of matrices, and ${ \langle\:,\rangle}$ the entrywise complex inner product (Hilbert-Schmidt inner product). We now choose the measurements. Given question ${ s}$, Alice measures in the PVM ${ A^s=(A^s_a)_{a=0}^k}$ with ${ A^s_a= |{u^s_a}\rangle \langle{u^s_a}| \text{ for }a=1,\ldots,k\text{, and }A^s_0={\mathbb{I}}-\sum_{i=1}^k |{u^s_a}\rangle \langle{u^s_a}| }$ Similarly, Bob on question ${ t}$ applies the PVM ${ B^t}$ with ${ B^t_b= |{v^t_b}\rangle \langle{v^t_b}| \text{ for }b=1,\ldots,k\text{, and }B^t_0={\mathbb{I}}-\sum_{i=1}^k |{v^t_b}\rangle \langle{v^t_b}| }$ The condition 3 in definition 6 ensures that for any question ${ s}$, the vectors ${ |{u^s_1}\rangle ,\ldots, |{u^s_1}\rangle}$ are orthogonal so that this is a valid PVM. The measurement outcome “$0$” is interpreted as “fail”, and upon getting this outcome the player attempts the measurement again on their share of a fresh copy of ${ |\phi\rangle_{AB}}$. This means that the strategy requires many copies of the entangled state to be shared before the game starts. It also leads to the complication of ensuring that with high probability the players measure the same number of times before outputting their measurement, so that the outputs come from measuring the same entangled state. By (*), at a given round of measurements the conditional distribution of answers is given by ${ \langle\phi|A^s_a\otimes B^t_b|\phi\rangle=\frac1r\Big\langle |{u^s_a}\rangle \langle{ u^s_a}| \:,\: |{{v^t_b}\rangle } \langle{ {v^t_b}| }\Big\rangle=\frac1r|\langle {u^s_a}|v^t_b\rangle|^2=\frac1{r|\tilde u^s_a|^2|\tilde v^t_b|^2}\Big(M^{\tilde{P}}_{(s,a),(t,b)}\Big)^2, }$ We wish to relate the LHS to ${ M^{\tilde{P}}_{(s,a),(t,b)}}$, so to handle the factor ${ \frac1r}$ each prover performs repeated measurements, each time on a fresh copy of ${ |\phi\rangle_{AB}}$, until getting an outcome ${ \neq0}$. Moreover, to handle the factor ${ \frac1{|u^s_a|^2|v^t_b|^2}}$, each prover consults public randomness and accepts the answer ${ a\in[k]}$ with probability ${ |u^a_s|^2}$ and ${ |v^b_t|^2}$ respectively, or rejects and start over depending on the public randomness. Under a few simplifying conditions (more precisely, assuming that the game is uniform meaning that an optimal strategy exists where the marginal distribution on each prover’s answers is uniform), we can let ${ M^{\tilde{P}}_{(s,a),(t,b)}\le 1/k}$ for all ${ s,a,t,b}$, and one can ensure that the conditional probabilities ${ {P}}$ of the final answers satisfy

$\displaystyle 1/k-P(a,b|s,t)\le 3\big(1/k-k(M^{\tilde P}_{(s,a),(t,b)})^2\big),\ \ \ \ \ (4)$ At this stage it is important that we are dealing with a unique game . Indeed, by (4) we have for every ${ s}$ and ${ t}$, $1-\sum_{a=1}^k P(a,\pi_{st}(a)|s,t)=\sum_a \Big(\frac{1}{k}-P(a,\pi_{st}(a)|s,t) \Big) \\ \leq 3\sum_{b=\pi_{st}(a)} \Big(\frac{1}{k}-k(M^{\tilde{P}}_{(s,a),(t,b)})^2 \Big) \\ \leq 6\sum_{b=\pi_{st}(a)}\big(\frac{1}{k}-M^{\tilde{P}}_{(s,a),(t,b)}\big)=6\Big(1-\sum_{b=\pi_{st}(a)} M^{\tilde{P}}_{(s,a),(t,b)}\Big)$ where the last inequality follows from concavity. Taking the expectation over ${ s}$ and ${ t}$ implies the bound (3), thus concluding the proof of theorem 4.

## General games are hard

We just saw that a specific class of games becomes easy in the presence of shared entanglement, in that semidefinite programming allows the entangled value ${ \omega^*}$ to be approximated to within exponential precision in polynomial time. Does this phenomenon hold more generally, so that the value of entangled games can always be efficiently approximated? We answer in the negative, by constructing a game where ${ \omega^*}$ is NP-hard to approximate to within inverse-polynomial factors. The complexity for 2P-1R entangled games can be strengthened to constant-factor NP-hardness, putting it on par with the classical PCP theorem. This result is used to prove (with some conditions) the games formulation of the quantum PCP theorem, which states that the entangled value of general games is QMA-hard to approximate within a constant factor.

### Formulation of game

Given any instance ${ \phi}$ of a ${ k}$-CSP (constraint satisfaction problem, where ${ k}$ is the number of literals), we can define a clause-vs-variable game ${ G_\phi}$ (see clause-vs-variable figure):
1. The referee (verifier) randomly sends a clause to Alice (first prover) and a variable to Bob (second prover).
2. Alice and Bob reply with assignments.
3. The referee accepts if Alice’s assignment satisfies the clause and Bob’s answer is consistent with Alice’s.
To show hardness of approximation, we need to go beyond the usual 2-player construction. In particular, in our game [3] one of the players receives an extra dummy question (see subfigure (a)). Mathematically, the result is very similar to introducing another player and having the referee play the 2-player game with two players chosen randomly [4] (see subfigure (b)). In either variation, the quantum phenomenon of monogamy of entanglement , imposing that only two parties can be maximally entangled to one another, is key to establishing hardness. The players do not know where to use their entanglement, which prevents them from coordinating as well as they could in the standard game.

### NP-hardness of approximating the entangled value

To prove hardness, we rely on several results from classical complexity theory.
Theorem 8 ([6]) Given an instance of 1-in-3 3SAT (a CSP), it is NP-hard to distinguish whether it is satisfiable or no assignments satisfy more than a constant fraction of clauses.
Theorem 9 ([2]) For a PCP game ${G}$ (emulating the CSP) and its oracularization ${G'}$ (transformation to a 2P-1R game),

$\displaystyle \omega(G)\leq \omega(G')\leq 1-\frac{1-\omega(G)}{3}\,. \ \ \ \ \ (5)$

Theorem 8 establishes the CSP variant of the classical PCP theorem: distinguishing between ${ \omega(\phi)=1}$ and ${ \omega(\phi)\leq 1/2}$ is NP-hard for some ${ k}$-CSP. Here, ${ \omega(\phi)}$ denotes the maximum fraction of clauses that are simultaneously satisfiable. Theorem 9 relates the general game ${ G}$ obtained from the CSP to a two-player one-round game ${ G'}$, in terms of the value (probability of winning) the game. The first inequality, equivalently saying ${ \omega(\phi)\leq \omega(G_\phi)}$, is achieved since the players can answer the questions in the game ${ G_\phi}$ to satisfy the clauses in ${ \phi}$. These theorems together imply that ${ \omega(G_\phi)}$ is NP-hard to approximate to within constant factors. Allowing the two players to share entanglement can increase the game value to ${ \omega^*(G_\phi)\geq \omega(G_\phi)}$. Classical results do not necessarily carry over, but exploiting monogamy of entanglement allows us to limit the power of entangled strategies. One can show the following lemma, which is weaker than what we have classically.
Lemma 10 ([3]) There exists a constant ${c>0}$ such that for a CSP ${\phi}$,

$\displaystyle \omega^*(G_\phi)\leq 1 - \frac{c(1-\omega(\phi))^2}{n^2}\,, \ \ \ \ \ (6)$

where ${n}$ is the number of variables. Combining Theorem 9 and Lemma 10, we have ${ \omega(\phi)\leq \omega^*(G_\phi)\leq 1 - \frac{c(1-\omega(\phi))^2}{n^2}\,. }$ Using Theorem 8, approximating ${ \omega^*(G_\phi)}$ is NP-hard to within inverse polynomial factors. Proving Lemma 10 takes some work in keeping track of approximations. For simplicity, we will show a less quantitative statement and indicate where the approximations come in.
Proposition 11 (adapted from [13]) ${\phi}$ is satisfiable iff ${\omega^*(G_\phi)=1}$.

### Proof of Proposition 11

The forward direction is straightforward: If ${ \phi}$ is satisfiable, then there exists a perfect winning strategy where the questions are answered according to the satisfying assignment. For the reverse direction, suppose there exists a strategy that succeeds with probability 1, specified by a shared entangled state ${ |{\psi}\rangle \in\mathbb{C}^d\otimes\mathbb{C}^d}$ and measurements ${ (A^j_{a_1,\ldots,a_k})_{a_1,\ldots,a_k}}$ for Alice and ${ (B^{t,l}_{b,b'})_{b,b'}}$ for Bob, where the questions ${ j\in[m]}$, ${ t,l\in[k]}$ and the answers ${ a,b}$ are from the CSP’s alphabet. Since one of the questions/answers for Bob corresponds to a dummy variable that is irrelevant to the game, trace over the dummy variable to define a new measurement operator ${ \tilde{B}^t_b=\frac{1}{n}\sum_{l,b'} B^{t,l}_{b,b'}}$. We can introduce a distribution on assignments to the ${ n}$ relevant variables,

$\displaystyle p(a_1,\ldots,a_n)=\lVert \mathbb{I} \otimes \tilde{B}^1_{a_1}\cdots\tilde{B}^n_{a_n} |\psi\rangle \rVert^2\,. \ \ \ \ \ (7)$

If we show that the distribution for assignments ${a_{i_1},\ldots,a_{i_k}}$ on variables ${x_{i_1},\ldots,x_{i_k}}$ in any clause ${C_j}$ is

$\displaystyle p(a_{i_1},\ldots,a_{i_k})= \lVert A^j_{a_{i_1},\ldots,a_{i_k}} \otimes \mathbb{I} |\psi\rangle \rVert^2\,, \ \ \ \ \ (8)$

then, since the players win with certainty, ${\phi}$ has a satisfying assignment. To transform Eq. 7 to Eq. 8, we need a relation between the ${A}$ and ${\tilde{B}}$ measurement operators and a way to commute the ${\tilde{B}}$ operators. The success probability of the players’ strategy is expressed as ${ P=\frac{1}{m}\sum_{j\in[m]}\frac{1}{k}\sum_{i\in C_j} \sum_{(a_1,\ldots,a_k)\vdash C_j} \langle{\psi}| A^j_{a_1,\ldots,a_k}\otimes \tilde{B}^i_{a_i} |{\psi}\rangle \,, }$ where ${ i}$ is the index of one of the ${ k}$ variables on which ${ C_j}$ acts, and ${ (a_1,\ldots,a_k)\vdash C_j}$ indicates that the assignment satisfies the clause. By positivity and summation to identity of the measurement operators ${ A}$ and ${ \tilde{B}}$, each term is at most 1; for our hypothesis ${ P=1}$, each has to be 1. Hence, using orthogonality of the vectors ${ {\mathbb{I}}\otimes\tilde{B}^i_{a_i} |{\psi}\rangle}$ for different ${ a_i}$, we have

$\displaystyle \sum_{\substack{(a_{i_1},\ldots,a_{i_k})\vdash C_j \\ a_i=b}} A^j_{a_{i_1},\ldots,a_{i_k}}\otimes \mathbb{I} |\psi\rangle = \mathbb{I} \otimes \tilde{B}^i_{a_i}|\psi\rangle\,, \ \ \ \ \ (9)$

for any ${ j\in[m]}$ and ${ i\in C_j}$. We now demonstrate that two different ${ \tilde{B}^{t_1}_{b_1}}$, ${ \tilde{B}^{t_2}_{b_2}}$ commute, so that Bob can match any satisfied clause/variable. ${\mathbb{I}} \otimes \tilde{B}^{t_1}_{b_1} \tilde{B}^{t_2}_{b_2} |{\psi}\rangle = {\mathbb{I}} \otimes (\frac{1}{n}\sum_{l_1,b_1'} B^{t_1,l_1}_{b_1,b_1'}) (\frac{1}{n}\sum_{l_2,b_2'} B^{t_2,l_2}_{b_2,b_2'}) |{\psi}\rangle \\ = {\mathbb{I}} \otimes (\frac{1}{n}\sum_{t_2,b_1'} B^{t_1,t_2}_{b_1,b_1'}) (\frac{1}{n}\sum_{t_1,b_2'} B^{t_2,t_1}_{b_2,b_2'}) |{\psi}\rangle \\ = {\mathbb{I}} \otimes \frac{1}{n^2}\sum_{t_1,t_2} B^{t_1,t_2}_{b_1,b_2} |{\psi}\rangle \\ = {\mathbb{I}} \otimes \frac{1}{n^2}\sum_{t_1,t_2} B^{t_2,t_1}_{b_2,b_1} |{\psi}\rangle \\ = {\mathbb{I}} \otimes \tilde{B}^{t_2}_{b_2} \tilde{B}^{t_1}_{b_1} |{\psi}\rangle$ In the second line, we used (9) to relate the measurements. The third line follows by the orthogonality of ${ B^{t_1,t_2}_{b_1,b_2}}$ for different ${ b_1,b_2}$. For the fourth equation, we simply swap ${ t_1,t_2}$ since the questions are indistinguishable to Bob. Thus, we can see how the dummy variable comes into play. If we had assumed ${ P=1-\epsilon}$ and kept track of approximations, we would find

$\displaystyle \frac{1}{n^2}\sum_{t_1,b_1}\sum_{t_2,b_2}\lVert \mathbb{I} \otimes (\tilde{B}^{t_1}_{b_1} \tilde{B}^{t_2}_{b_2} - \tilde{B}^{t_2}_{b_2} \tilde{B}^{t_1}_{b_1}) |\psi\rangle \rVert^2 = O(\epsilon)\,. \ \ \ \ \ (10)$

This approximate commutativity results in the hardness of approximation holding only to within inverse poly${ (n)}$ factors. Now we are ready to transform Eq. 7 to Eq. 8 to conclude the proof. $p(a_1,\ldots,a_n)=\lVert {\mathbb{I}} \otimes \tilde{B}^1_{a_1} \ldots \tilde{B}^n_{a_n} |{\psi}\rangle \rVert^2 \\ =\lVert {\mathbb{I}} \otimes \tilde{B}^{i_1}_{a_{i_1}} \ldots \tilde{B}^{i_k}_{a_{i_k}} |{\psi}\rangle \rVert^2 \\ = \lVert A^j_{a_{i_1},\ldots,a_{i_k}} \otimes {\mathbb{I}} |{\psi}\rangle \rVert^2 \\ =p(a_{i_1},\ldots,a_{i_k})\,.$ In the second line, we used (10) to commute the measurement operators, along with their properties of orthogonality and summation to identity. For the third equality, we used (9) to relate Bob’s measurements to Alice’s, along with orthogonality of ${ A^j_{a_{i_1,\ldots,i_k}}}$ for different ${ a_{i_1,\ldots,i_k}}$.

### Constant-factor NP-hardness

The weakness in the above two-player game carries over from the original three-player variant. Thus, to achieve constant-factor NP-hardness of approximation, we could start with a different multiplayer game. Vidick [11] establishes the soundness of the “plane-vs-point” low-degree test (checking that the restriction of a low-degree polynomial to a plane matches its value at some point) in the presence of shared entanglement. Soundness , in the eponymous probabilistically checkable proof (PCP) formulation of the PCP theorem, refers to the verifier accepting a wrong proof with some bounded probability; bounding with a constant maps to constant-factor hardness of approximation. Here, soundness comes from a strong bound on error accumulation, similar to our approximate commutativity, but relies on the players’ Hilbert space being decomposable into three parts (i.e., there being three players). The particular game is constructed by combining the low-degree test with the 3-SAT test (encoding satisfying assignments in a low-degree polynomial), which can be reduced to the three-player QUADEQ test (testing satisfiability of a system of quadratic equations in binary variables, which is NP-complete). By the strong soundness result, the entangled value is NP-hard to approximate to within constant factors. Natarajan et al. [7] show that soundness holds even for two players, using a semidefinite program. They then construct a two-player game in a way similar to what we demonstrated.

### Constant-factor QMA-hardness

The above can be thought of as the games formulation of the classical PCP theorem holding under shared entanglement. A true quantum PCP theorem states that the entangled value of general games is QMA-hard to approximate to within constant factors. Natarajan et al. [8] establish such a theorem, but under randomized reductions. This requirement stems from the lack of a sufficiently strong QMA-hardness result for local Hamiltonians (the quantum analog of CSPs). The soundness of the two-player low-degree test above is one instrumental component in the proof.

## How much entanglement is needed?

We now focus on the question of quantifying exactly how much entanglement is needed to play XOR games optimally. As we shall see, the answer depends on the size of the question sets posed to Alice & Bob in the game. The previous bound given by Tsirelson [10] (see table below) is tight for certain families of games, but is not tight for other families of games (such as a generalization of the CHSH game). The reason for this discrepancy is closely tied in with the the properties of the representation of the Observables that form the Optimal Strategy (${ |{\psi}\rangle , \{A_{i}\}_{i \in S}, \{B_{j}\}_{j \in T}}$). Slofstra [9] shows that if the Observables constitute a Clifford Algebra (that is, the solutions are pair-wise anti-commutative), then the strategy is minimally entangled (uses the least number of entangled bits) iff the strategy is a unique solution to the SDP rounding problem. As a trivial corollary, if the SDP rounding problem does not have a unique solution (and a correspondingly unique strategy), then there exists a Non-Clifford optimal strategy that uses (atleast) ${ |T|}$ bits of entanglement less than the Clifford strategy. Slofstra further states that minimally entangled ${ \epsilon}$-optimal strategies may be constructed for XOR games where the optimal strategies have ‘stable’ representations. For the purposes of this post, we will analyze the exact result and merely state the approximate result.

### Main Results

#### EXACT

For the exact realm, the table below summarizes Slofstra and Tsirelson’s main results.
Person Strategy Bound(entangled bits)
Slofstra (Possibly) Non-Clifford ${ \log_{2}(N)}$
Tsirelson Clifford ${ \left \lfloor{\frac{r}{2}}\right \rfloor}$
Here, ${ r}$ is the largest integer such that ${ \binom{r + 1}{2} < |S| + |T|}$ and corresponds to the maximum-rank of an extremal point in the quantum correlation matrix corresponding to an optimal strategy. ${ N}$ is the minimum dimension of the representations of the Operators (${ \{B_{j}\}}$).

#### APPROXIMATE

In the approximate realm, the minimum entanglement dimension of the representation of the Operators from an ${ \epsilon}$-Optimal Strategy is: min(${ \mathcal{O}(\epsilon^{\frac{-1}{12}}), 2^{\left \lfloor{\frac{r}{2}}\right \rfloor}}$). As we shall see, Slofstra’s theorem allows us to recover Tsirelson’s bound easily by using a fact from Representation Theory about the irreducible representations of Clifford Algebras, but stands as a more general lower bound for solutions that aren’t Clifford.

### Marginals and Solution Algebras

We’ll begin by introducing 3 key ideas: i) Degeneracy ${ \leftrightarrow}$ Non-Degeneracy ii) Existence of Marginals iii) Solution Algebra Once these ideas are defined and their notions made clear, we will be in a position to state the main result and sketch a proof for it.

#### Definition (Marginal Strategy)

Given an Optimal Quantum Strategy (${ |{\psi}\rangle , \{A_{i}\}_{i \in S}, \{B_{j}\}_{j \in T}}$), a marginal constitutes ${ \{B_{j}\}_{j \in T}}$, and the partial trace of ${ \psi}$ with respect to ${ H_{A}}$ (${ \rho_{B}}$). It is also possible to dualize the definition for obtaining ${ \{A_{i}\}_{i \in S}}$ and ${ \rho_{A}}$. We now define the notion of degeneracy, which is critical when proving the main theorem. The main point to drive home is that a degenerate optimal quantum strategy can be reduced to a unique, non-degenerate optimal quantum strategy.

#### Definition (Degenerate Quantum Strategy)

A quantum strategy (${ |{\psi}\rangle , \{A_{i}\}_{i \in S}, \{B_{j}\}_{j \in T}}$) is said to be degenerate if ${ \exists (P \in H_A)}$, ${ (Q \in H_B)}$ such that: i) ${ P}$ commutes with all ${ A_i}$ and ${ (P \otimes {\mathbb{I}}) |{\psi}\rangle = |{\psi}\rangle}$ ii) ${ Q}$ commutes with all ${ B_j}$ and ${ ({\mathbb{I}} \otimes Q) |{\psi}\rangle = |{\psi}\rangle}$ Since we can efficiently construct for any degenerate Optimal Quantum Strategy a unique, non-degenerate Optimal Quantum Strategy, we will now assume WLOG that every Optimal Quantum Strategy is non-degenerate (and unique). We now define the (unique) existence of marginal biases, which correspond to constants for the rows of the quantum correlation matrix (which is a generalization of the classical pay-off). An equivalent statement can be made for columns (${ d_{j}}$) by dualizing the existence of row marginals. These constants can be thought of as representing the (expected) optimum-payoff possible for a set of operator choices by one player, given that the other player’s choice is fixed. Intuitively, this can be seen as “collapsing” the quantum correlation matrix into a column, by summing over the rows (or collapsing into a row, by summing over the columns).
Lemma 12 (Existence of Marginals) For all ${m \times n}$ XOR games G, ${\exists \{c_{i} \geq 0 \hspace{1mm} | \hspace{1mm} i \in |S|\}}$, such that, if ${\{u_{i}\}_{i \in |S|}, \{v_{j}\}_{j \in |T|}}$ form an ${\epsilon}$-optimal vector strategy where ${\epsilon \leq \frac{1}{4(m+n)}}$,

$\displaystyle \|\sum_{j\in|T|}G_{ij}v_{j} - c_{i}u_{i}\| \leq \sqrt{10}(m + n)^{\frac{1}{4}}\epsilon^{\frac{1}{4}}, \forall i \ \ \ \ \ (11)$

If ${ \epsilon = 0}$ and our strategy is perfectly optimal, we recover an exact estimation of the marginal biases:

$\displaystyle \sum_{j\in|T|}G_{ij}v_{j} = c_{i}u_{i}, \forall i \ \ \ \ \ (12)$

The proof for the above lemma provided by Slofstra relies on using techniques to analyze the structure of the SDP program that pertains to quantum marginals. In particular, conducting trace analysis on SDP matrices that correspond to using the game matrix as off-diagonal elements leads us to the construction of the desired marginal biases. It is also critical to note that a dual statement allows us to recover the column biases ${ d_{j} \geq 0}$:

$\displaystyle \sum_{i\in[|S|]}G_{ij}u_{i} = d_{j}v_{j}, \forall j \ \ \ \ \ (13)$

We now move on to defining the notion of a solution algebra.
Definition 13 (Solution Algebra) A solution algebra ${\mathcal{A}}$ consists of self-adjoint (Hermitian) operators ${X_{j}}$ that satisfy the following predicates:

$\displaystyle X_{j}^{2} = \mathbb{I}, \forall 1 \leq j \leq n \ \ \ \ \ (14)$

$\displaystyle (\sum_{j\in[|T|]}G_{ij}X_{j})^{2} = (c_{i})^{2}\cdot\mathbb{I}, \forall i \ \ \ \ \ (15)$

The definition above merely enforces the property that our unknown marginal operators be Hermitian (14) and that they respect the optimal marginal biases (or payoffs) (15) we saw in (11), so that they correspond to being constructed from an optimal vector strategy. These unknown operators will be mapped to operators that are the marginal strategy corresponding to the optimal quantum strategy. This is at the heart of the main theorem we will now state: Theorem 14 (Slofstra, 2010) Given a ${m \times n}$ XOR game G (with no zero rows or columns) and a solution algebra ${\mathcal{A}}$, a collection of Linear Operators ${\{B_{j}\}}$ and density matrix ${\rho}$ are the marginal of an optimal strategy iff the map ${X_{j} \rightarrow B_{j}}$ induces a density-matrix representation of ${\mathcal{A}}$ and ${\rho}$ commutes with ${im(\mathcal{A})}$. Put simply, the theorem states that our unknown self-adjoint operators map to an optimal marginal strategy iff the density matrix (traced from the joint Hilbert-Space) commutes with all the mapped operators (${ \{B_{j}\}}$). The result we desire on the lower bound for the number of entangled bits, given a mapping from these indeterminate operators to the marginal of an optimal strategy, comes from a corollary to (14).
Corollary 15 Given a ${m \times n}$ XOR game G (with no zero rows or columns) and a solution algebra ${\mathcal{A}}$ with minimum dimension ${N}$ among non-zero representations, the strategy for minimum entanglement uses ${\log_{2}(N)}$ entangled bits. The proof for this corollary follows from the eigenspace decomposition of the joint Hilbert Space ${ H}$ in terms of ${ \rho}$, which is preserved by the action of ${ im(\mathcal{A})}$. As a result, each eigenspace decomposes into a finite sum of irreducible representations of ${ \mathcal{A}}$. The minimum entanglement is realized when there is exactly one invariant subspace (with one irreducible representation). The entanglement used by such a representation is ${ \log_{2}(\text{dim}\,H)}$.

### Proof of Theorem 20

The rest of the section is dedicated to sketching a brief (but formal) proof for Theorem (14), and then using a simple fact about the representations of a Clifford Algebra to show how Slofstra’s result subsumes Tsirelson’s bound. For this section, ${ |{\psi}\rangle}$ refers to an arbitrary state in ${ H = H_{A} \otimes H_{B}}$ (the joint Hilbert space). We can write ${ |{\psi}\rangle = \sum_{i} |{i}\rangle \lambda |{i}\rangle}$ over some basis ${ \{{i}\}}$, where ${ \lambda}$ is a linear map. Then, the partial trace of ${ \psi}$ over ${ H_{A}}$ is given by ${ \rho = \lambda\lambda^{*}}$. Let ${ \mathcal{B}_{A}, \mathcal{B}_{B}}$ denote the algebra generated by ${ A_{1},..,A_{m}}$ and ${ B_{1},..,B_{n}}$. Here, the generating elements are the observables of an optimal quantum strategy. To arrive at a proof for the theorem, we will rely on 2 additional lemmas which we will not prove but state.
Lemma 16 Given Hermitian operators ${A}$, ${B \in H_{A}, H_{B}}$,

$\displaystyle \|(A \otimes \mathbb{I} - \mathbb{I} \otimes B)|\psi\rangle\| \leq \|\lambda\overline{A} - B\lambda\|_{F} \ \ \ \ \ (16)$

This allows us to conclude that,

$\displaystyle \|(A \otimes \mathbb{I} - \mathbb{I} \otimes B)|\psi\rangle\| \leq \epsilon \implies \|\rho(B) - B\rho\|_{F} \leq 2\epsilon \ \ \ \ \ (17)$

Lemma 17 The optimal strategy in question is non-degenerate iff

$\displaystyle closure(\mathcal{B}_{B}\lambda H_{A}) = closure(\mathcal{B}_{A}\lambda^{*}H_{B}). \\ \ \ \ \ \ (18)$

As a special case:

$\displaystyle closure(\mathcal{B}_{B}\lambda H_{A}) = H_{B} \leftrightarrow closure(\rho H_{B}) = H_{B} \ \ \ \ \ (19)$

Forward direction : We use the first lemma to prove commutativity of $\rho$ with all $B_{j}$, and we use the second lemma to show that the closure of $\rho$ is ${ H_{B}}$. We first show the forward direction: Suppose we are given an optimal quantum strategy ($|{\psi}\rangle , \{A_{i}\}_{i \in S}, \{B_{j}\}_{j \in T}$) for a game ${ G}$. Then, we fix our optimal vector strategy as:

$\displaystyle u_{i} = (A_{i} \otimes \mathbb{I})|\psi\rangle \ \ \ \ \ (20)$

$\displaystyle v_{j} = (\mathbb{I} \otimes B_{j})|\psi\rangle \ \ \ \ \ (21)$

We can now use Equations (11) and (13) to establish our optimal marginal biases to write a relationship between them and ${ |{\psi}\rangle}$ , and apply Lemma (16) to show commutativity and Lemma (17) to show that ${ im(\mathcal{A})}$ = cyclic(${ B_{j}, \rho}$). Using (13) with (20) and (21), we have:

$\displaystyle d_{j}(\mathbb{I}\otimes B_{j})|\psi\rangle = \sum_{i}G_{ij}(A_{i}\otimes\mathbb{I})|\psi\rangle \ \ \ \ \ (22)$

We can now use (17) with ${ \epsilon = 0}$ on (22) to see that ${ \rho}$ commutes with every ${ B_{j}}$. Additionally, as the terms in (22) constitute linear combinations of ${ A_{i}}$ and ${ B_{j}}$, we can compute the closure of their actions on ${ \lambda H_{A}}$ and ${ \lambda^{*} H_{B}}$, which will be equivalent. Therefore, ${ im(\rho)}$ = ${ H_{B}}$, which follows from the special case of (19). For the dual case, we substitute (20) and (21) into (12):

$\displaystyle c_{i}(A_{i}\otimes\mathbb{I})|\psi\rangle = \sum_{j}G_{ij}(\mathbb{I}\otimes B_{j})|\psi\rangle$

On taking the norm of the above on both sides and using a little algebra, we finally obtain the fact that ${ \{B_{j}\}}$ satisfy predicate (15) making them the representations of ${ X_{j}}$: ${ (\sum_{j}G_{ij}B_{j})^{2} = c_{i}^{2}{\mathbb{I}} }$ This shows that the map from ${ X_{j} \rightarrow B_{j}}$ computes a density matrix representation of ${ \mathcal{A}}$, where ${ \rho}$ commutes with all ${ B_{j}}$.
Backward Direction : The proof for the backward direction is much less involved: If we knew that ${ \{B_{j}\}, \rho}$ constituted the cyclic representation of ${ \mathcal{A}}$ with commutativity (with ${ B_{j}}$), then we can use Lemma (19) to conclude that the image of ${ \rho}$ on ${ H}$ would form a subspace of ${ \lambda H}$. We define: ${ \overline{A}_{i} = \frac{\sum_{j}G_{ij}B_{j}}{c_{i}} }$ allowing us to recover our original marginal biases ${ c_{i}}$ that satisfy (23) and therefore correspond to the optimal strategy. This shows us that ${ \{A_{i}\}}$, ${ \{B_{j}\}}$ would constitute an optimal quantum strategy.
Having proved this theorem, we now obtain Corollary 15, which is the main desired result. To see how it subsumes Tsirelson’s result as a special case, we use a simple fact from Representation Theory:
Lemma 18 For a Clifford Algebra generated by ${X_{1},..,X_{r}}$, there exist one or two irreducible representations of dimension ${2^{\left \lfloor{\frac{r}{2}}\right \rfloor}}$ Plugging Lemma 18 into Corollary 15, we simply recover the fact that the number of entangled bits of a solution algebra that is Clifford is ${ \left \lfloor{\frac{r}{2}}\right \rfloor}$. However, note that being Clifford means an extra constraint: ${ X_{i}X_{j} = -X_{j}X_{i}}$, ${ \forall i, j}$ The constraints on the Solution Algebra (14), (15) given by Slofstra do \textit{not} necessarily mean that the solution is Clifford. In fact, when an optimal quantum strategy with minimal entanglement is Clifford, ${ \{A_{i}\}}$, ${ \{B_{j}\}}$ are constructed from a unique set of ${ \{u_{i}\}}$, ${ \{v_{j}\}}$. To end, we write down a lemma that shows there exist XOR games where the optimal strategy is not unique and for minimal entanglement, a solution generated by a Non-Clifford algebra must be used:

#### Lemma (Existence of XOR games with Non-Clifford optimal strategies)

There exist a family of ${ m \times n}$ XOR games ${ \{G\}}$ that correspond to generalizations of the CHSH games (${CL_{n}}$), such that, the optimal strategy of minimal entanglement is Non-Clifford.

## References

[1] David Avis, Sonoko Moriyama, and Masaki Owari. From bell inequalities to tsirelson’s theorem. IEICE Transactions, 92-A(5):1254–1267, 2009.

[2] Lance Fortnow, John Rompel, and Michael Sipser. On the power of multi-prover interactive protocols. Theoretical Computer Science, 134(2):545 – 557, 1994.

[3] T. Ito, H. Kobayashi, and K. Matsumoto. Oracularization and Two-Prover One-Round Interactive Proofs against Nonlocal Strategies. ArXiv e-prints, October 2008.

[4] J. Kempe, H. Kobayashi, K. Matsumoto, B. Toner, and T. Vidick. Entangled games are hard to approximate. ArXiv e-prints, April 2007.

[5] Julia Kempe, Oded Regev, and Ben Toner. Unique games with entangled provers are easy. SIAM Journal on Computing, 39(7):3207– 3229, 2010.

[6] S. Khanna, M. Sudan, L. Trevisan, and D. Williamson. The approximability of constraint satisfaction problems. SIAM Journal on Computing, 30(6):1863–1920, 2001.

[7] Anand Natarajan and Thomas Vidick. Two-player entangled games are NP-hard. arXiv e-prints, page arXiv:1710.03062, October 2017.

[8] Anand Natarajan and Thomas Vidick. Low-degree testing for quantum states, and a quantum entangled games PCP for QMA. arXiv e-prints, page arXiv:1801.03821, January 2018.

[9] William Slofstra. Lower bounds on the entanglement needed to play xor non-local games. CoRR, abs/1007.2248, 2010.

[10] B.S. Tsirelson. Quantum analogues of the bell inequalities. the case of two spatially separated domains. Journal of Soviet Mathematics, 36(4):557–570, 1987.

[11] Thomas Vidick. Three-player entangled XOR games are NP-hard to approximate. arXiv e-prints, page arXiv:1302.1242, February 2013.

[12] Thomas Vidick. Cs286.2 lecture 15: Tsirelson’s characterization of xor games. Online, December 2014. Lecture Notes.

[13] Thomas Vidick. Cs286.2 lecture 17: Np-hardness of computing $\omega^*(G)$. Online, December 2014. Lecture Notes.

author: Beatrice Nash

Abstract

In this blog post, we give a broad overview of quantum walks and some quantum walks-based algorithms, including traversal of the glued trees graph, search, and element distinctness [3; 7; 1]. Quantum walks can be viewed as a model for quantum computation, providing an advantage over classical and other non-quantum walks based algorithms for certain applications.

# Continuous time quantum walks

We begin our discussion of quantum walks by introducing the quantum analog of the continuous random walk. First, we review the behavior of the classical continuous random walk in order to develop the definition of the continuous quantum walk.

Take a graph $G$ with vertices $V$ and edges $E$. The adjacency matrix $A$ of $G$ is defined as follows:

$A_{i,j} = \begin{cases} 1 \quad &\text{if } (i,j) \in E \\ 0 \quad &\text{otherwise}. \end{cases}$

And the Laplacian $L$ is given by:

$L_{i,j} = \begin{cases} -\text{degree}(i) \quad &\text{if } i = j \\ 1 \quad &\text{if } (i,j) \in E \\ 0 \quad &\text{otherwise}. \end{cases}$

The Laplacian determines the behavior of the classical continuous random walk, which is described by a length $|V|$ vector of probabilities, p(t). The $i$th entry of p(t) represents the probability of being at vertex $i$ at time $t$. p(t) is given by the following differential equation:

\begin{aligned} \frac{\text{d}}{\text{dt}} \text{p}_{i}(\text{t}) = \underset{(i,j) \in E}{\sum} L_{i,j} \text{p}_{j}(\text{t}),\end{aligned}

which gives the solution $\textbf{p}(t) = e^{Lt}\textbf{p}(0)$.

Recalling the Schrödinger equation $i \frac{\text{d}}{\text{dt}} \left|\psi\right>= H \left|\psi\right>$, one can see that by inserting a factor of $i$ on the left hand side of the equation for p(t) above, the Laplacian can be treated as a Hamiltonian. One can see that the Laplacian preserves the normalization of the state of the system. Then, the solution to the differential equation:

\begin{aligned} i \frac{\text{d}}{\text{dt}} \psi_{i}(\text{t}) = \underset{(i,j) \in E}{\sum} L_{i,j} \psi_{j}(\text{t})\end{aligned},

which is $\left|\psi(t)\right> = e^{-iLt} \left|\psi(0)\right>$, determines the behavior of the quantum analog of the continuous random walk defined previously. A general quantum walk does not necessarily have to be defined by the Laplacian; it can be defined by any operator which “respects the structure of the graph,” that is, only allows transitions to between neighboring vertices in the graph or remain stationary [7]. To get a sense of how the behavior of the quantum walk differs from the classical one, we first discuss the example of the continuous time quantum walk on the line, before moving on to the discrete case.

## Continuous time quantum walk on the line

An important example of the continuous time quantum walk is that defined on the infinite line. The eigenstates of the Laplacian operator for the graph representing the infinite line are the momentum states with eigenvalues $2(\text{cos}(p) - 1)$, for $p$ in range $[-\pi,\pi]$. This can be seen by representing the momentum states in terms of the position states and applying the Laplacian operator:

\begin{aligned} \left|p\right> &= \underset{x}{\sum} e^{ipx} \left|x\right> \\ L\left|p\right> &= \underset{x}{\sum} e^{ipx} \left|x+1\right>+ e^{ipx} \left|x-1\right> - 2e^{ipx} \left|x\right> \\ &= \underset{x}{\sum} (e^{ip} + e^{-ip} - 2) e^{ipx} \left|x\right> \\ &= 2(\text{cos}(p) - 1) \left|p\right>.\end{aligned}

Hence the probability distribution at time $t$, $p(x,t) = |\left< x\right| e^{-iLt} \left|\psi(0)\right> | ^{2}$, with initial position $\left|\psi(0)\right> = \left|0\right>$ is given by:

\begin{aligned} |\left< x\right| e^{-iLt} \left|0\right> | ^{2} &= \bigg| \frac{1}{2\pi} \int_{-\pi}^{\pi} e^{-2it(\text{cos}p - 1)} \left< x|p\right> \text{d}p \bigg|^{2} \\ &= \bigg| \frac{1}{2\pi} \int_{-\pi}^{\pi} e^{-2it(\text{cos}p - 1)} e^{ipx} \text{d}p \bigg|^{2} \\ &= | J_{x}(2t) |^{2}.\end{aligned}

While the probability distribution for the classical continuous time
random walk on the same graph approaches, for large $t$, $\frac{1}{\sqrt{4\pi t}} e^{\frac{-x^{2}}{4t}}$, or a Gaussian of width $2\sqrt{t}$. One can see that the quantum walk has its largest peaks at the extrema, with oscillations in between that decrease in amplitude as one approaches the starting position at $x=0$. This is due to the destructive interference between states of different phases that does not occur in the classical case. The probability distribution of the classical walk, on the other hand, has no oscillations and instead a single peak centered at $x=0$, which widens and flattens as $t$ increases.

## Walk on the glued trees graph

A glued tree is a graph obtained by taking two binary trees of equal height and connecting each of the leaves of one of the trees to exactly two leaves of the other tree so that each node that was a leaf in one of the original trees now has degree exactly $3$. An example of such a graph is shown in Figure 2.

The time for the quantum walk on this graph to reach the right root from the left one is exponentially faster than in the classical case. Consider the classical random walk on this graph. While in the left tree, the probability of transitioning to a node in the level one to the right is twice that of transitioning to a node in the level one to the left. However, while in the right tree, the opposite is true. Therefore, one can see that in the middle of the graph, the walk will get lost, as, locally, there is no way to determine which node is part of which tree. It will instead get stuck in the cycles of identical nodes and will have exponentially small probability of reaching the right node.

To construct a continuous time quantum walk on this graph, we consider the graph in terms of columns. One can visualize the columns of Figure 2 as consisting of all the nodes equidistant from the entrance and exit nodes. If each tree is height $n$, then we label the columns $0,1,\text{...},2n,2n+1$, where column $i$ contains the nodes with shortest path of length $i$ from the leftmost root node. We describe the state of each column as a superposition of the states of each node in that column. The number of nodes in column $i$, $N_{i}$, will be $2^{i}$ for $i \in [0,n]$ and $2^{2n+1-i}$ for $i \in [n+1,2n+1]$. Then, we can define the state $\left|\text{col} \; i\right>$ as:

\begin{aligned} \left|\text{col} \; i\right> = \frac{1}{\sqrt{N_{i}}} \underset{j \in \text{col} \; i}{\sum} \left|j\right>.\end{aligned}

The factor of $\frac{1}{\sqrt{N_{i}}}$latex ensures that the state is normalized. Since the adjacency matrix $A$ of the glued tree is Hermitian, then we can treat $A$ as the Hamiltonian of the system determining the behavior of the quantum walk. By acting on this state with the adjacency matrix operator $A$, we get the result (for $i \in [1,n-1]$):

\begin{aligned} A\left|\text{col} \; i\right> &= 2\frac{\sqrt{N_{i-1}}}{\sqrt{N_{i}}} \left|\text{col} \; i-1\right> + \frac{\sqrt{N_{i+1}}}{\sqrt{N_{i}}} \left|\text{col} \; i+1\right> \\ &= \sqrt{2} \left|\text{col} \; i-1\right> + \sqrt{2} \left|\text{col} \; i+1\right>.\end{aligned}

Then for $i \in [n+2,2n]$, we get the same result, because of symmetry.

For $i = n$:

\begin{aligned} A\left|\text{col} \; n\right> = \sqrt{2} \left|\text{col} \;n-1\right> + 2 \left|\text{col} \; n\right>.\end{aligned}

The case of $i = n+1$ is symmetric. One can see that the walk on this graph is equivalent to the quantum walk on the finite line with nodes corresponding to the columns. All of the edges, excluding that between columns $n$ and $n+1$, have weight $\sqrt{2}$. The edge between column $n$ and $n+1$ has weight $2$.

The probability distribution of the quantum walk on this line can be roughly approximated using the infinite line. In the case of the infinite line, the probability distribution can be seen as a wave propagating with speed linear in the time $t$. Thus, in time linear in $n$, the probability that the state is measured at distance $2n+1$ from the starting state is $\frac{1}{\text{poly} \; n}$. In [3] it is shown that the fact that the line is finite and has a single differently weighted edge from the others (that between $n$ and $n+1$) does not change the fact that in polynomial time, the quantum walk will travel from the left root node to the right one, although in this case there is no limiting distribution as the peaks oscillate. This was the first result that gives an exponential speed up over the classical case using quantum walks.

# Discrete time quantum walks

In this section, we will first give an introduction to the discrete quantum walk, including the discrete quantum walk on the line and the Markov chain quantum walk, as defined in [7]. Next, we discuss how Grover search can be viewed as a quantum walk algorithm, which leads us into Ambainis’s quantum-walks based algorithm from [1] for the element distinctness problem, which gives a speed up over classical and other quantum non-walks based algorithms.

The discrete time quantum walk is defined by two operators: the coin flip operator, and the shift operator. The coin flip operator $C$ determines the direction of the walk, while the shift operator $S$ makes the transition to the new state conditioned on the result of the coin flip. The Hilbert space governing the walk is $\mathcal{H} = \mathcal{H}{C} \otimes \mathcal{H}{S}$, where $\mathcal{H}{C}$ corresponds to the space associated with the result of the coin flip operator, and $\mathcal{H}{S}$ corresponds to the locations in the graph on which the walk is defined.

For example, consider the discrete time walk on the infinite line. Since there are two possible directions (left or right), then the Hilbert space associated with the coin flip operator is two dimensional. In the unbiased case, the coin flip is the Hadamard operator,

\begin{aligned} H = \frac{1}{\sqrt{2}} \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix},\end{aligned}

and shift operator that produces the transition from state $\left|j\right>$ to $\left|j+1\right>$ or $\left|j-1\right>$,
conditioned on the result of the coin flip, is $S = \left|0\right>\left< 0\right| \otimes \underset{j}{\sum} \left|j+1\right> \left< j\right| + \left|1\right>\left< 1\right| \otimes \underset{j}{\sum} \left|j - 1\right> \left< j\right|$.

Each step of the walk is determined by an application of the unitary
operator $U = S \cdot (H \otimes I)$. If the walk starts at position
$\left|x\right>$, then measuring the state after one application of $U$ gives $\left|x+1\right>$ with probability $\frac{1}{2}$ and $\left|x-1\right>$ with probability $\frac{1}{2}$. This is exactly the same as the case of the classical random walk on the infinite line; the difference between the two walks becomes apparent after a few steps.

For example, the result of the walk starting at state $\left|\psi(0)\right> = \left|0\right>\left|0\right>$ after 4 steps gives:

\begin{aligned} \left|\psi(1)\right> &= \frac{1}{\sqrt{2}}\left( \left|0\right>\left|1\right> + \left|1\right>\left|-1\right> \right) \\ \left|\psi(2)\right> &= \frac{1}{2}\left( \left|0\right>\left|2\right> + \left|1\right>\left|0\right> + \left|0\right>\left|0\right> - \left|1\right>\left|-2\right> \right) \\ \left|\psi(3)\right> &= \frac{1}{2\sqrt{2}}\left(\left|0\right>\left|3\right> + \left|1\right>\left|1\right> + 2\left|0\right>\left|1\right> -\left|0\right>\left|-1\right> + \left|1\right>\left|-3\right> \right) \\ \left|\psi(4)\right> &= \frac{1}{4} (\left|0\right>\left|4\right> + \left|1\right>\left|2\right> + 3\left|0\right>\left|2\right> + \left|1\right>\left|0\right> -\left|0\right>\left|0\right> -\left|1\right>\left|-2\right> +\left|0\right>\left|-2\right>-\left|1\right>\left|-4\right>).\end{aligned}

One can see that the distribution is becoming increasingly skewed
towards the right, while in the classical case the distribution will be
symmetric around the starting position. This is due to the destructive
interference discussed earlier. The distribution after $t = 20$ time
steps is shown in Figure 4.

Now, consider the walk starting at state $\left|\psi(0)\right> = -\left|1\right>\left|0\right>$:

\begin{aligned}\left|\psi(1)\right> &= \frac{1}{\sqrt{2}}\left( -\left|0\right>\left|1\right> + \left|1\right>\left|-1\right> \right) \\ \left|\psi(2)\right> &= \frac{1}{2}\left( -\left|0\right>\left|2\right> - \left|1\right>\left|0\right> + \left|0\right>\left|0\right> - \left|1\right>\left|-2\right> \right) \\ \left|\psi(3)\right> &= \frac{1}{2\sqrt{2}}\left(-\left|0\right>\left|3\right> - \left|1\right>\left|1\right> + 2\left|1\right>\left|-1\right> - \left|0\right>\left|-1\right> + \left|1\right>\left|-3\right> \right) \\ \left|\psi(4)\right> &= \frac{1}{4} (-\left|0\right>\left|4\right> -\left|1\right>\left|2\right> - \left|0\right>\left|2\right> + \left|1\right>\left|0\right> + \left|0\right>\left|0\right> -3\left|1\right>\left|-2\right> + \left|0\right>\left|-2\right> - \left|1\right>\left|-4\right>).\end{aligned}

This distribution given by this walk is the mirror image of the first.
To generate a symmetric distribution, consider the start state $\left|\psi(0)\right> = \frac{1}{\sqrt{2}}(\left|0\right> -i\left|1\right>)\left|0\right>$. The resulting distribution after $t$ steps will be $p(x,t) = \frac{1}{2} p_{0}(x,t) + \frac{1}{2} p_{1}(x,t)$, where $p_{0}(x,t)$ is the probability distribution after $t$ steps resulting from the start state $\psi(0) = \left|0\right>\left|0\right>$ and $p_{1}(x,t)$ is the probability distribution after $t$ steps resulting from the start state $\psi(0) = -\left|1\right>\left|0\right>$. The result will be symmetric, with peaks near the extrema, as we saw in the continuous case.

## Markov chain quantum walk

A reversible, ergodic Markov chain with $n$ states can be represented by a $n \times n$ transition matrix $P$ with $P_{j,i}$ equal to the probability of transitioning from state $i$ to state $j$ and $P = P^{*}$. Then, $p_{0}P$, where $p_{0}$ is an initial probability distribution over the states, gives the distribution after one step.
Since $\sum_{j} P_{i,j} = 1$ for all $i$, $P$ is stochastic and thus preserves normalization.

There are multiple ways to define a discrete quantum walk, depending on the properties of the transition matrix and the graph on which it is defined (overview provided in [4]). Here we look at the quantum walk on a Markov chain as given in [2]. For the quantum walk on this graph, we define state $\left|i\right>\left|j\right>$ as the state that represents currently being at position $j$ and facing in the direction of $i$. Then, we define the state $\left|\psi_{j}\right>$ as a superposition of the states associated with position $j$:

\begin{aligned} \left|\psi_{i}\right> = \underset{j}{\sum} \sqrt{P_{j,i}} \left|i\right>\left|j\right>.\end{aligned}

The unitary operator,

$D = 2 \underset{i}{\sum} \left|\psi_{i}\right>\left< \psi_{i}\right| - I$,

acts as a coin flip for the walk on this graph. Since $P$ is reversible, we can let the shift operator be the unitary $SWAP$ operator:

$SWAP = \underset{i,j}{\sum} \left|i,j\right>\left< j,i\right|$.

A quantum walk can also be defined for a non-reversible Markov chain using a pair of reflection operators (the coin flip operator is an example of a reflection operator). This corresponds to the construction given in [7].

## Search as a quantum walk algorithm

Given a black box function $f$ and a set of inputs $S$ with $|S| = N$, say we want to find whether an input $x \in S$ exists for which $f(x)$ equals some output value. We refer to the set of inputs $M$ for which this is true as marked. Classically, this requires $O(N/|M|)$ queries, for nonempty $M$. Using the Grover search algorithm, this problem requires $O(\sqrt{N/|M|})$ quantum queries. In this section, we give a quantum walks based algorithm that also solves this problem in $O(\sqrt{N/|M|})$ time. If we define a doubly stochastic matrix $P$ with uniform transitions, then we can construct a new transition matrix $P'$ from $P$ as:

$P_{i,j}' = \begin{cases} \frac{1}{N-1} \quad &\text{if } i \neq j \text{ and } i \notin M \\0 \quad &\text{if } i = j \text{ and } i \notin M \\ 1 \quad &\text{if } i = j \text{ and } i \in M \\ 0 \quad &\text{if } i \neq j \text{ and } i \in M. \end{cases}$

Then, when the state of the first register is unmarked, the operator $D$ defined in the previous section acts as a diffusion over its neighbors. When the state in the first register is marked, then $D$ will act as the operator $-I$, and the walk stops, as a marked state has been reached. This requires two queries to the black box function: one to check whether the input is marked, and then another to uncompute. By rearranging the order of the columns in $P$ so that the columns corresponding to the non-marked elements come before the columns corresponding to the marked elements, we get:

\begin{aligned} P' = \begin{pmatrix} P_{0} & 0 \\ P_{1} & I \end{pmatrix},\end{aligned}

where $P_{0}$ gives the transitions between non-marked elements and $P_{1}$ gives the transitions from non-marked to marked elements.

We now look at the hitting time of the classical random walk. Assume
that there is zero probability of starting at a marked vertex. Then, we
can write the starting distribution $p$, where the last $|M|$ elements of $p$, corresponding to the marked elements, are zero, as
$p = \underset{\lambda}{\sum} \alpha_{\lambda} \left|\lambda\right>$, where $\lambda$ are the eigenvalues of $P_{0}$, and $\left|\lambda\right>$ are the corresponding eigenvectors, with the last $|M|$ entries zero. Let $\lambda^{}$ be the principal (largest) eigenvalue. Then, the probability that, after $t$ steps, a marked element has not yet been reached will be $\sum (P_{0}^{t}p)_{i} \leq \lambda^{*t}$. Then, the
probability that a marked element has been reached in that time will be
$\geq 1 - \lambda^{t} \geq 1 - t \lambda^{*}$. Setting
$t = \frac{1}{1-\lambda^{*}}$ gives probability $\Omega(1)$ that a marked element will be reached in that time.

The eigenvalues of $P_{0}$ will be $\frac{N-|M|-1}{N-1}$ and
$\frac{-1}{N-|M|-1}$. Then, the classical hitting time will be:

\begin{aligned} t &= \frac{1}{1-\lambda^{*}} \\ &= \frac{1}{1-\frac{N-|M|-1}{N-1}} \\ &= O\left(\frac{N}{|M|}\right).\end{aligned}

It can be showed that for a walk defined by a Markov chain, the
classical hitting time will be $O(\frac{1}{\delta \epsilon})$, where $\delta = 1 - \lambda^{*}$, the spectral gap, and $\epsilon \leq \frac{|M|}{N}$ [2].

Magniez et al proved in [6] that for a reversible, ergodic
Markov chain, the quantum hitting time for a walk on this chain is
within a factor of the square root of the classical hitting time. Since
the walk on this input acts as a walk on a reversible Markov chain until
a marked element is reached, then this is also true for a walk defined
by our transition matrix $P'$. This arises from the fact that the
spectral gap of the matrix describing the quantum walk corresponding to
stochastic matrix $P$ is quadratically larger than the spectral gap of
the matrix describing the classical random walk corresponding to $P$, the proof of which is given in [2]. Thus, the quantum hitting time
is $O(\sqrt{N/|M|})$, which exactly matches the quantum query complexity of Grover search.

## Element distinctness problem

Now, we describe Ambainis’s algorithm given in [1] for solving
the element distinctness problem in $O(N^{\frac{2}{3}})$ time, which
produces a speed up over the classical algorithm, which requires $O(N)$ queries, and also over other known quantum algorithms that do not make use of quantum walks, which require $O(N^{\frac{3}{4}})$ queries. The element distinctness problem is defined as follows: given a function $f$ on a size $N$ set of inputs

$S=\{x_{1}$,…,$x_{N}\}$,

determine whether there exists a pair $x_{1},\; x_{2} \in S$ for which $f(x_{1}) = f(x_{2})$. As in the search problem defined in the previous section, this is a decision problem; we are not concerned with finding the values of these pairs, only whether at least one exists.

The algorithm is similar to the search algorithm described in the previous section, except we define the walk on a Hamming graph. A Hamming graph $H(N,m)$ is defined as follows: each vertex $i$ corresponds to an $m$-tuple, ($i_{1}$,…,$i_{m}$), where $i_{k} \in S$ for all $k$ and repetition is allowed (that is, $i_{k}$ may equal $i_{j}$ for $k \neq j$), and $m$ is a parameter we will choose. Edges will exist between vertices that differ in exactly one coordinate (order matters in this graph). We describe the state of each vertex as:

$\left|i \right>=| i_{1},i_{2}$,…,$i_{m},f(i_{1})$,…,$f(i_{m})>$

Then, moving along each edge that replaces the $j$th coordinate with $x_{k}$ such that $i_{j} \neq x_{k}$ requires two queries to the black box function to erase $f(i_{j})$ and compute $f(x_{k})$. In the case, the marked vertices will be those that contain some $f(i_{k}) = f(i_{j})$ for $i_{j} \neq i_{k}$. Since the function values are stored in the description of the state, then no additional queries to the black box are required to check if in a marked state.

The transition matrix is given by $P = \frac{1}{m(n-1)} \underset{i \in [1,m]}{\sum} (J - I)^{(i)}$. $J$ is the $n \times n$ all one matrix, and the superscript $i$ denotes the operator acting on the $i$th coordinate. The factor of $\frac{1}{m(n-1)}$ normalizes the degree, since the graph is regular. We can compute the spectral gap of this graph to be $\frac{n}{m(n-1)}$ (for details of this computation, see [2]). Then, noting that that the fraction of marked vertices, $\epsilon$, is
$\geq \frac{m(m-1)(n-2)^{m-2}}{n^{m}}$, classically, the query complexity is $m + O(\frac{1}{\delta \epsilon}) = m + O(\frac{n^{2}}{m})$, where $m$ is the queries required to construct the initial state. Setting the parameters equal to minimize with respect to $m$ gives classical query complexity $O(N)$, as expected.

Then in the quantum case, $m$ queries are still required to set up the state. $O(\frac{n}{\sqrt{m}})$ queries are required to perform the walk until a marked state is reached, by [6]. Setting parameters equal gives $O(N^{\frac{2}{3}})$ queries, as desired.

[1] Ambainis, A. Quantum walk algorithm for element distinctness, SIAM Journal on Computing 37(1):210-239 (2007). arXiv:quant-ph/0311001

[2] Childs, A. Lecture Notes on Quantum Algorithms (2017). https://www.cs.umd.edu/ amchilds/qa/qa.pdf

[3] Childs, A., Farhi, E. Gutmann, S. An example of the difference between
quantum and classical random walks. Journal of Quantum Information
Processing, 1:35, 2002. Also quant-ph/0103020.

[4] Godsil, C., Hanmeng, Z. Discrete-Time Quantum Walks and Graph Structures
(2018). arXiv:1701.04474

[5] Kempe, J. Quantum random walks: an introductory overview, Contemporary
Physics, Vol. 44 (4) (2003) 307:327. arXiv:quant-ph/0303081

[6] Magniez, F., Nayak, A., Richter, P.C. et al. On the hitting times of
quantum versus random walks, Algorithmica (2012) 63:91.
https://doi.org/10.1007/s00453-011-9521-6

[7] Szegedy, M. Quantum Speed-up of Markov Chain Based Algorithms, 45th
Annual IEEE Symposium on Foundations of Computer Science (2004).
https://ieeexplore.ieee.org/abstract/document/1366222

[8] Portugal, R. Quantum Walks and Search Algorithms. Springer, New York, NY (2013).

By Abhijit Mudigonda, Richard Wang, and Lisa Yang

This is part of a series of blog posts for CS 229r: Physics and Computation. In this post, we will talk about progress made towards resolving the quantum PCP conjecture. We’ll briefly talk about the progression from the quantum PCP conjecture to the NLTS conjecture to the NLETS theorem, and then settle on providing a proof of the NLETS theorem. This new proof, due to Nirkhe, Vazirani, and Yuen, makes it clear that the Hamiltonian family used to resolve the NLETS theorem cannot help us in resolving the NLTS conjecture.

## Introduction

We are all too familiar with NP problems. Consider now an upgrade to NP problems, where an omniscient prover (we’ll call this prover Merlin) can send a polynomial-sized proof to a BPP (bounded-error probabilistic polynomial-time) verifier (and we’ll call this verifier Arthur). Now, we have more decision problems in another complexity class, MA (Merlin-Arthur). Consider again, the analogue in the quantum realm where now the prover sends over qubits instead and the verifier is in BQP (bounded-error quantum polynomial-time). And now we have QMA (quantum Merlin-Arthur).

We can show that there is a hierarchy to these classes, where NP $\subseteq$ MA $\subseteq$ QMA.

Our goal is to talk about progress towards a quantum PCP theorem (and since nobody has proved it in the positive or negative, we’ll refer to it as a quantum PCP conjecture for now), so it might be a good idea to first talk about the PCP theorem. Suppose we take a Boolean formula, and we want to verify that it is satisfiable. Then someone comes along and presents us with a certificate — in this case, a satisfying assignment — and we can check in polynomial time that either this is indeed a satisfying assignment to the formula (a correct certificate) or it is not (an incorrect certificate).

But this requires that we check the entire certificate that is presented to us. Now, in comes the PCP Theorem (for probabilistically checkable proofs), which tells us that a certificate can be presented to us such that we can read a constant number of bits from the certificate, and have two things guaranteed: one, if this certificate is correct, then we will never think that it is incorrect even if we are not reading the entire certificate, and two, if we are presented with an incorrect certificate, we will reject it with high probability [1].

In short, one formulation of the PCP theorem tells us that, puzzingly, we might not need to read the entirety of a proof in order to be convinced with high probability that it is a good proof or a bad proof. But a natural question arises, which is to ask: is there a quantum analogue of the PCP theorem?

## Progress

The answer is, we’re still not sure. But to make progress towards resolving this question, we will present the work of Nirkhe, Vazirani, and Yuen in providing an alternate proof of an earlier result of Eldar and Harrow on the NLETS theorem.

Before we state the quantum PCP conjecture, it would be helpful to review information about local Hamiltonians and the $k$-local Hamiltonian problem. A previous blog post by Ben Edelman covers these topics. Now, let’s state the quantum PCP conjecture:

(Quantum PCP Conjecture): It is QMA-hard to decide whether a given local Hamiltonian $H = H_{1} + ... + H_{m}$ (where each $||H_{i}|| \leq 1$) has ground state energy at most $a$ or at least $b$ when $b-a \geq c||H||$ for some universal constant $c > 0$.

Recall that MAX-$k$-SAT being NP-hard corresponds to the $k$-local Hamiltonian problem being QMA-hard when $b-a \geq 1/poly(n)$. (We can refer to Theorem 4.1 in these scribed notes of Ryan O’Donnell’s lecture, and more specifically to Kempe-Kitaev-Regev’s original paper for proof of this fact.) The quantum PCP conjecture asks if this is still the case when the gap is $c||H||$.

Going back to the PCP theorem, an implication of the PCP theorem is that it is NP-hard to approximate certain problems to within some factor. Just like its classical analogue, the qPCP conjecture can be seen as stating that it is QMA-hard to approximate the ground state energy to a factor better than $c||H||$.

### Reformulation: NLTS conjecture

Let’s make the observation that, taking $a$ to be the ground state energy, the qPCP conjecture sort of says that there exists a family of Hamiltonians for which there is no trivial state (a state generated by a low depth circuit) such that the energy is at most $c||H||$ above the ground state energy.

Freedman and Hastings came up with an easier goal called the No Low-Energy Trivial States conjecture, or NLTS conjecture. We expect that ground states of local Hamiltonians are sufficiently hard to describe (if NP $\neq$ QMA). So low-energy states might not be generated by a quantum circuit of constant depth. More formally:

(NLTS Conjecture): There exists a universal constant $\epsilon > 0$ and a family of local Hamiltonians $\{H^{(n)}\}_{n=1}^{\infty}$ where $H^{(n)}$ acts on $n$ particles and consists of $m_{n}$ local terms, s.t. any family of states $\{|\psi_{n}\rangle\}$ satisfying $\langle \psi_{n} | H^{(n)} | \psi_{n}\rangle \leq \epsilon||H^{(n)}|| + \lambda_{min}(H^{(n)})$ requires circuit depth that grows faster than any constant.

To reiterate, if we did have such a family of NLTS Hamiltonians, then it we wouldn’t be able to give “easy proofs” for the minimal energy of a Hamiltonian, because we couldn’t just give a small circuit which produced a low energy state.

## Progress: NLETS theorem

$\epsilon$-error states are states that differ from the ground state in at most $\epsilon n$ qubits. Now, consider $\epsilon$-error states (which “agree” with the ground state on most qubits). Then for bounded-degree local Hamiltonians (analogously in the classical case, those where each variable participates in a bounded number of clauses), these states are also low energy. So any theorem which applies to low energy states (such as the NLTS conjecture), should also apply to states with $\epsilon$-error (as in the NLETS theorem).

To define low-error states more formally:

Definition 2.1 ($\epsilon$-error states): Let $\rho, \sigma \in D((\mathbb{C}^{d})^{\otimes n})$ (the space of positive semidefinite operators of trace norm equal to 1 on $(\mathbb{C}^{d})^{\otimes n}$). Let $H$ be a local Hamiltonian acting on $(\mathbb{C}^{d})^{\otimes n}$. Then:

• $\sigma$ is an $\epsilon$-error state of $\rho$ if $\exists S \subseteq [n]$ of size at most $\epsilon n$ s.t. $\text{Tr}_{S}(\rho) = \text{Tr}_{S}(\sigma)$.
• $\sigma$ is an $\epsilon$-error state for $H$ if $\exists \rho$ s.t. $\text{Tr}(H\rho) = \lambda_{min}(H)$ and $\sigma$ is an $\epsilon$-error state for $\rho$.

Here, see that $\text{Tr}_{S}$ is just the partial trace on some subset of integers $S$, like we’re tracing out or “disregarding” some subset of $n$ qubits.

In 2017, Eldar and Harrow showed the following result which is the NLETS theorem.

Theorem 1 (NLETS Theorem): There exists a family of 16-local Hamiltonians $\{H^{(n)}\}$ s.t. any family of $\epsilon$-error states $\{|\Phi_{n}\rangle\}$ for $\{H^{(n)}\}$ requires circuit depth $\Omega(\log n)$ where $\epsilon = 10^{-9}$.

In the next two sections, we will provide background for an alternate proof of the NLETS theorem due to Nirkhe, Vazirani, and Yuen. After this, we will explain why the proof of NLETS cannot be used to prove NLTS, since the local Hamiltonian family we construct for NLETS can be linearized. Nirkhe, Vazirani, and Yuen’s proof of NLETS makes use of the Feynman-Kitaev clock Hamiltonian corresponding to the circuit generating the cat state (Eldar and Harrow make use of the Tillich-Zemor hypergraph product construction; refer to section 8 of their paper). What is this circuit? It is this one:

First, we apply the Hadamard gate (drawn as $\boxed{H}$) which maps the first qubit $|0\rangle \rightarrow \frac{|0\rangle + |1\rangle}{\sqrt{2}}$. Then we can think of the CNOT gates (drawn as $\bullet-\oplus$) as propagating whatever happens to the first qubit to the rest of the qubits. If we had the first qubit mapping to 0, then the rest of the qubits map to 0, and likewise for 1. This generates the cat state $|\textsf{CAT}_{n}\rangle = \frac{|0\rangle^{\otimes n} + |1\rangle^{\otimes n}}{\sqrt{2}}$, which is highly entangled.

Why do we want a highly entangled state? Roughly our intuition for using the cat state is this: if the ground state of a Hamiltonian is highly entangled, then any quantum circuit which generates it has non-trivial depth. So if our goal is to show the existence of local Hamiltonians which have low energy or low error states that need deep circuits to generate, it makes sense to use a highly entangled state like the cat state.

## Quantum circuits

(We’ll write that the state of a qudit – a generalization of a qubit to more than two dimensions, and in this case $q$ dimensions – is a vector in $\mathbb{C}^{q}$. In our diagram above, we’ll see 4 qudits, labelled appropriately.)

Let’s briefly cover the definitions for the quantum circuits we’ll be using.

Let $U$ be a unitary operator acting on a system of $n$ qudits (in other words, acting on $(\mathbb{C}^{q})^{\otimes n}$), where $U = U_{m} \hdots U_{1}$. Here, each $U_{i}$ is a unitary operator (a gate) acting on at most two qudits, and $U$ is a product of $m$ such operators.

If there exists a partition $U$ into products of non-overlapping two-qudit unitaries (we call these layers and denote them as $L_{i} = \bigotimes_{j}U_{ij}$, where each $U_{j}$ here is in layer $L_{i}$) such that $U = L_{d} \hdots L_{1}$ then we say $U$ has $d$ layers.

In other words, $U$ has size $m$ and circuit depth $d$.

### Lightcones, effect zones, shadow zones

Consider $U = L_{d} \hdots L_{1}$ and $A$ an operator.

For $j < d$ define $K^{(j)}$ as the gates in layer $j$ whose supports overlap that of any gate in $K^{(j+1)}$, …, $K^{(d)}$ or with $A$.

Definition 3.1 (lightcone): The lightcone of $A$ with respect to $U$ is the union of $K^{(j)}$: $K_{U} \triangleq \bigcup_{j} K^{(j)}$.

So we can think of the lightcone as the set of gates spreading out of $A$ all the way to the first layer of the circuit. In our diagram, the lightcone of $A$ is the dash-dotted region. We have $K^{(3)} = \varnothing$, $K^{(2)} = \{U_{21}\}$, and $K^{(1)} = \{U_{11}, U_{12}\}$.

We also want a definition for what comes back from the lightcone: the set of gates from the first layer (the widest part of the cone) back to the last layer.

Define $E^{(1)} = K^{(1)}$. For $j \geq 2$, let $E^{(j)}$ be the set of gates whose supports overlap with any gate in $E^{(j-1)}$.

Definition 3.2 (effect zone): The effect zone of $A$ with respect to $U$ is the union $E_{U}(A) \triangleq \bigcup_{j} E^{(j)}$.

In our diagram, see that $E^{(1)} = \{U_{11}, U_{12}\}$, $E^{(2)} = \{U_{21}\}$, and $E^{(3)} = \{U_{31}\}$. The effect zone of $A$ is the dotted region.

Definition 3.3 (shadow of the effect zone): The shadow of the effect zone $W_{U}(A)$ of $A$ with respect to $U$ is the set of qudits acted on by the gates in the effect zone.

In our diagram, the first three qudits are effected by gates in the effect zone. So $W_{U}(A) = \{1, 2, 3\}$.

Given all of these definitions, we make the following claim which will be important later, in a proof of a generalization of NLETS.

Claim 3.1 (Disjoint lightcones): Let $U$ be a circuit and $A, B$ operators. If the qudits $B$ acts on are disjoint from $W_{U}(A)$, then the lightcones of $A$ and $B$ in $U$ are disjoint.

## Toward the Feynman-Kitaev clock

Now we’ll give some definitions that will become necessary when we make use of the Feynman-Kitaev Hamiltonian in our later proofs.

Let’s define a unary clock. It will basically help us determine whatever happened at any time little $t$ along the total time big $T$. Let $|\textsf{unary}(t, T)\rangle = |0\rangle^{\otimes(T-t)} \otimes |1\rangle^{\otimes t}$. For our purposes today, we won’t worry about higher dimensional clocks. So we’ll write $|\textsf{clock}_{k}(t, T)\rangle$, but we’ll really only consider the case where $k = 1$, which corresponds to $|\textsf{unary}(t, T)\rangle$. For simplicity’s sake, we will henceforth just write $|\textsf{unary}(t)\rangle$.

Our goal is to construct something a little similar to the tableaux in the Cook-Levin theorem, so we also want to define a history state:

Definition 4.1 (History state): Let $C$ be a quantum circuit that acts on a witness register and an ancilla register. Let $C_{1}, ..., C_{T}$ denote the sequence of two-local gates in $C$. Then for all $k \in \mathbb{N}$, a state $|\Psi\rangle$ is a $k$-dimensional history state of $C$ if:

\begin{aligned}|\Psi\rangle = \frac{1}{\sqrt{T+1}}\sum_{t=0}^{T}|\textsf{clock}_{k}(t, T)\rangle \otimes |\psi_{t}\rangle\end{aligned}

where we have the clock state to keep track of time and $\psi_{t}$ is some state such that $|\psi_{t}\rangle = C_{t}|\psi_{t-1}\rangle$ and $|\psi_{0}\rangle = |\xi\rangle_{witness} \otimes |0\rangle_{ancilla}$. With this construction, we should be able to make a measurement to get back the state at time $t$.

## Proof of NLETS

We provide a proof of (a simplified case of) the NLETS theorem proved by Nirkhe, Vazirani, and Yuen in [2].

Theorem 2 (NLETS): There exists a family of $3$-local Hamiltonians $\{H^{(n)}\}$ on a line (Each Hamiltonian $H^{(n)}$ can be defined on $n$ particles arranged on a line such that each local Hamiltonian acts on a particle and its two neighbors) such that for all $n \in \mathbb{N}$, the circuit depth of any $\epsilon$-error ground state for $H^{(n)}$ is at least logarithmic in $n$.

First, we’ll show the circuit lower bound. Then we’ll explain why these Hamiltonians can act on particles on a line and what this implies about the potential of these techniques for proving NLTS.

Proof: We will use the Feynman-Kitaev clock construction to construct a $5$-local Hamiltonian $\mathcal{H}^{(n)}$ for the circuit $C_n$: $|0^n\rangle \to |\textsf{CAT}_n\rangle$.

Fix $n \in \mathbb{N}$ and let $C_n$ have size $T$. The Hamiltonian $\mathcal{H}$ acts on $T+n$ qubits and consists of several local terms depending on $C_n$:

\begin{aligned}\mathcal{H} = H_{in} + \sum_{t=1}^T H_t + H_{out} + H_{stab}\end{aligned}

We can think of a $T+n$ qubit state as representing a $T$ step computation on $n$ qubits (i.e. for each time $t \in [0,T]$, we have a $n$ bit computation state $\textsf{state}_t$ of $C_n$). Intuitively, a $T+n$ qubit state has energy $0$ with respect to $\mathcal{H}$ iff it is the history state of $C_n$. This is because $H_{in}$ checks that at time $t=0$, $\textsf{state}_0$ consists of the input $|0\rangle^n$ to $C_n$. Each $H_t$ checks that $\textsf{state}_{t}$ proceed correctly from $\textsf{state}_{t-1}$ (i.e. that the $t$th gate of $C_n$ is applied correctly). Then $H_{out}$ checks that at time $t=T$, the output is $1$. Finally, $H_{stab}$ checks that the $T+n$ qubit state is a superposition only over states where the first $T$ qubits represent “correct times” (i.e. a unary clock state where time $t$ is represented by $T-t$ zeros followed by $t$ ones).

Therefore, $\mathcal{H}$ has a unique ground state, the history state of $C_n|0^n\rangle$, with energy $0$:

\begin{aligned}|\Psi\rangle = \frac{1}{\sqrt{n+1}}\sum_{t=0}^n |\textsf{unary}(t)\rangle\otimes |\psi_t\rangle = \frac{1}{\sqrt{n+1}}\sum_{t=0}^n |\textsf{unary}(t)\rangle\otimes|\textsf{CAT}_{t}\rangle\otimes |0\rangle^{\otimes (n-t)}\end{aligned}

Later we will show how to transform $\mathcal{H}$ into a Hamiltonian $H$ on $n$ qutrits on a line. Intuitively, the structure of $C_n$ allows us to fuse the $T=n$ time qubits and $n$ state qubits and represent unused state qubits by $2$. For the Hamiltonian $H$, the ground state becomes

\begin{aligned}|\Psi\rangle = \frac{1}{\sqrt{n+1}}\sum_{t=0}^n |\psi_t\rangle = \sum_{t=0}^n |\textsf{CAT}_{t}\rangle\otimes|2\rangle^{\otimes(n-t)}\end{aligned}

For the rest of this proof, we work with respect to $H$.

Let $\sigma$ be an $\epsilon$-error state and let $S \subseteq [n]$ be the subset of qutrits such that $\text{Tr}_S(\sigma) = \text{Tr}_S(|\Psi\rangle\langle\Psi|)$. We define two projection operators which, when applied to $\sigma$ alone, produce nontrivial measurements, but when applied to $\sigma$ together, produce trivial measurements.

Definition 5.1: For any $i\in[n]$, the projection operator

\begin{aligned}A_i = |0\rangle\langle 0|_i \end{aligned}

projects onto the subspace spanned by $0$ on the $i$th qutrit.

For any $j\in[n]$, the projection operator

\begin{aligned} B_j = |1\rangle\langle 1|_i\end{aligned}

projects onto the subspace spanned by $1$ on the $j$th qutrit.

Claim 5.1: For $i\not\in S$, $\text{Tr}(A_i\sigma) = \frac{1}{2} + \frac{-i}{2(n+1)}$. For $j\not\in S$, $\text{Tr}(B_j\sigma) = \frac{1}{2} + \frac{-j}{2(n+1)}$. Note that these values are positive for any $i,j\in [n]$.

Proof: If $i \not\in S$, then measurements on the $i$th qutrit are the same for $\sigma$ and $|\Psi\rangle\langle\Psi|$.

\begin{aligned} \text{Tr}(A_i\sigma) &= \text{Tr}(A_i|\Psi\rangle\langle\Psi|)\\ &= \text{Tr}\left(A_i \frac{1}{n+1}\sum_{t,t'}|\psi_t\rangle\langle\psi_{t'}|\right)\\ &= \frac{1}{n+1}\sum_{t,t'} \text{Tr}(A_i|\psi_t\rangle\langle\psi_{t'}|) \end{aligned}

If $t=t'$, then any $n$ qutrit pure state cannot have nonzero weight in both $\psi_t$ and $\psi_{t'}$ (every pure state ends in some number of $2$s which tells which $\psi_t$ (if any) it can be a part of). Therefore,

\begin{aligned} \text{Tr}(A_i\sigma) = \frac{1}{n+1}\sum_{t} \text{Tr}(A_i|\psi_t\rangle\langle\psi_{t}|) = \frac{1}{n+1}\sum_t \langle \psi_t|A_i|\psi_t\rangle \enspace. \end{aligned}

If $i \le t$, then projecting onto the $i$th qutrit gives $0$ with probability $1/2$. Therefore, $\text{Tr}(A_i\sigma) = \frac{1}{n+1}\left(\frac{n-i+1}{2}\right) = \frac{1}{2} + \frac{-i}{2(n+1)}$.

Similarly, $\text{Tr}(B_j\sigma) = \frac{1}{2} + \frac{-j}{2(n+1)}$. $\square$

Claim 5.2: For $i,j \not\in S$ such that $i < j$, $\text{Tr}(A_i \otimes B_j \sigma) = 0$.

Proof: As before, we can calculate

\begin{aligned} \text{Tr}(A_i \otimes B_j \sigma) &= \text{Tr}(A_i \otimes B_j |\Psi\rangle \langle\Psi|) = \frac{1}{n+1}\sum_t \langle\psi_t|A_i\otimes B_j|\psi_t\rangle \end{aligned}

If $j > t$, then the $j$th qutrit of $\psi_t$ is $2$ so $B_j|\psi_t\rangle = 0$. If $j \le t$, then $A_i \otimes B_j|\psi_t\rangle = 0$ because the first $t$ qutrits of $|\psi_t\rangle$ contain the $|\textsf{CAT}_{t}\rangle$ state so under any measurement, the $i$ and $j$th qutrits must be the same. $\square$

Now we use these claims to prove a circuit lower bound. Let $U$ be a circuit generating (a state with density matrix) $\sigma$. Let $d$ be the depth of $U$.

Consider some $i\not\in S$. For any operator acting on the $i$th qutrit, its lightcone consists of at most $2^d$ gates so its effect zone consists of at most $2^{2d}$ gates which act on at most $2^{2d+1}$ qudits (called the shadow of the effect zone).

Assume towards contradiction that $2^{2d+1} < n-\epsilon n$. Then the shadow of any operator acting only on the $i$th qutrit has size at most $2^{2d+1} \le n - |S|$ since $|S| \le \epsilon n$. So there is some $j$ outside of the shadow which is in the complement of $S$. By Claim 3.1, we have found two indices $i,j$ such that any pair of operators acting on $i$ and $j$ have disjoint lightcones in $U$. WLOG let $i < j$. The lightcones of $A_i,B_j$ are disjoint which implies \begin{aligned}\text{Tr}(A_i \otimes B_j \sigma) = \text{Tr}(A_i \sigma)\cdot\text{Tr}(B_j \sigma).\end{aligned}

By the two claims above, we get a contradiction.

Therefore, $2^{2d+1} \ge n-\epsilon n$. We can take any constant epsilon: letting $\epsilon = 1/2$, we get

\begin{aligned}d \ge \frac{1}{2}\left(\log \frac{n}{2} - 1\right)\end{aligned}

$\square$

This analysis relies crucially on the fact that any $\epsilon$-error state matches the groundstate on most qudits. However, NLTS is concerned with states which may differ from the groundstate on many qudits, as long as they have low energy.

Remark 2.1: The paper of Nirkhe, Vazirani, and Yuen [2] actually proves more:

• A more general lower bound: logarithmic lower bound on the circuit depth of any $\delta$-approximate ($\delta$ far in L1 norm) $\epsilon$-noisy state (probability distribution over $\epsilon$-error states).
• Assuming QCMA $\neq$ QMA (QCMA takes a $m$ bit witness string instead of a $m$ qubit state as witness), they show a superpolynomial lower bound (on the circuit depth of any $\delta$-approximate $\epsilon$-noisy state).
• “Approximate qLWC codes”, using techniques from their superpolynomial lower bound.

## Back to NLTS – Tempering our Optimism

So far, we’ve shown a local Hamiltonian family for which all low-error (in “Hamming distance”) states require logarithmic quantum circuit depth to compute, thus resolving the NLETS conjecture. Now, let’s try to tie this back into the NLTS conjecture. Since it’s been a while, let’s recall the statement of the conjecture:

Conjecture (NLTS): There exists a universal constant $\epsilon > 0$ and a family of local Hamiltonians $\{H^{(n)}\}_{n=1}^{\infty}$ where $H^{(n)}$ acts on $n$ particles and consists of $m_{n}$ local terms, s.t. any family of states $\{|\psi_{n}\rangle\}$ satisfying $\langle \psi_{n} | H^{(n)} | \psi_{n}\rangle \leq \epsilon||H^{(n)}|| + \lambda_{min}(H^{(n)})$ requires circuit depth that grows faster than any constant.

In order to resolve the NLTS conjecture, it thus suffices to exhibit a local Hamiltonian family for which all low-energy states require logarithmic quantum circuit depth to compute. We might wonder if the local Hamiltonian family we used to resolve NLETS, which has “hard ground states”, might also have hard low-energy states. Unfortunately, as we shall show, this cannot be the case. We will start by showing that Hamiltonian families that lie on constant-dimensional lattices (in a sense that we will make precise momentarily) cannot possibly be used to resolve NLTS, and then show that the Hamiltonian family we used to prove NLTS can be linearized (made to lie on a one-dimensional lattice!).

### The Woes of Constant-Dimensional Lattices

Definition 6.1: A local Hamiltonian $H$ acting on $n$ qubits is said to lie on a graph $G$ if there is an injection of qubits into vertices of the graph such that the set of qubits in any interaction term correspond to a connected component in the graph.

Theorem 2: If $(H^{(n)})$ is a local Hamiltonian family that lies on an $O(1)$-dimensional lattice, then $(H^{(n)})$ has a family of low-energy states with low circuit complexity. In particular, if $H$ is a local Hamiltonian on a $d$-dimensional lattice acting on $n$ qubits for large enough $n$, then for any $\epsilon$, there exists a state $|\psi\rangle$ that can be generated by a circuit of constant depth and such that $\langle \psi | H | \psi \rangle \leq H_0 + \epsilon ||H||$ where $H_0$ is the ground-state energy.

Proof: In what follows, we’ll omit some of the more annoying computational details in the interest of communicating the high-level idea.

Start by partitioning the $d$-dimensional lattice (the one that $H^(n)$ lives on) into hypercubes of side length $L$. We can “restrict” $H^{(n)}$ to a given hypercube (let’s call it $\rho$) by throwing away all local terms containing a qubit not in $\rho$. This gives us a well-defined Hamiltonian $H_{\rho}$ on the qubits in $\rho$. Define $|\phi_{\rho}\rangle$ to be the $L^d$-qubit ground state of $H_{\rho}$, and define

\begin{aligned}|\phi\rangle := \bigotimes_{\text{hypercubes } \rho} |\phi_{\rho}\rangle\end{aligned}

where $|\phi\rangle$ is an $n$-qubit state. Each $|\phi_{\rho}\rangle$ can be generated by a circuit with at most $2^{L^d}$ gates, hence at most $2^{L^d}$ depth. Then, $|\phi\rangle$ can be generated by putting all of these individual circuits in parallel – this doesn’t violate any sort of no-cloning condition because the individual circuits act on disjoint sets of qubits. Therefore, $|\phi\rangle$ can be generated by a circuit of depth at most $2^{L^d}$. $L$ and $d$ are both constants, so $|\phi\rangle$ can be generated by a constant-depth circuit.

We claim that, for the right choice of $L$, $|\phi\rangle$ is also a low-energy state. Intuitively, this is true because $\phi$ can only be “worse” than a true ground state of $H^{(n)}$ on local Hamiltonian terms that do not lie entirely within a single hypercube (i.e. the boundary terms), and by choosing $L$ appropriately we can make this a vanishingly small fraction of the local terms of $H^{(n)}$. Let’s work this out explicitly.

Each hypercube has surface area $2dL^{d-1}$, and there are $n/L^d$ hypercubes in the lattice. Thus, the total number of qubits on boundaries is at most $2d\frac{n}{L}$. The number of size $k$-connected components containing a given point in a $d$-dimensional lattice is a function of $k$ and $d$. Both of these are constants. Therefore, the number of size $k$-connected components containing a given vertex, and hence the number of local Hamiltonian terms containing a given qubit, is constant. Thus, the total number of violated local Hamiltonian terms is at most $O(\frac{n}{L})$. Taking $L$ to be $\frac{1}{\epsilon}$, we get the desired bound. Note that to be fully rigorous, we need to justify that the boundary terms don’t blow up the energy, but this is left as an exercise for the reader. $\square$

### Linearizing the Hamiltonian

Now that we have shown that Hamiltonians that live on constant-dimensional lattices cannot be used to prove NLTS, we will put the final nail in the coffin by showing that our NLETS Hamiltonian (the Feynman-Kitaev clock Hamiltonian $\mathcal{H}$ on the circuit $C$) can be made to lie on a line (a $1$-dimensional lattice). To do so, we will need to understand the details of $\mathcal{H}$ a bit better.

Proposition 6.1: $\mathcal{H}$ for the circuit $C$ is $5$-local.

Proof: Recall that we defined

\begin{aligned}\mathcal{H} := H_{in} + \sum_{t=1}^T H_t+ H_{stab}\end{aligned}

Let’s go through the right-hand-side term-by-term. We will use $|\mathsf{time}(t)\rangle$ to denote the $t^{\text{th}}$ qubit of the time register and $|\mathsf{state}(s)\rangle$ to denote the $s^{\text{th}}$ qubit of the state register.

• $H_{in}$ needs to serially access the qubit pairs \begin{aligned}|\mathsf{time}(0)\rangle\otimes\textsf{state}(s) \end{aligned} for all $s$ and ensure that they are all set to $|0\rangle$. Thus, $H_{in}$ is $2$-local.
• Each $H_t$ term needs to access the states $|\textsf{time}(t-1)\rangle, |\textsf{time}(t)\rangle, |\textsf{time}(t+1)\rangle, |\textsf{state}(s)\rangle$, and $|\textsf{state}(t)\rangle$ and ensure that the state transitions are correct. Thus, $H_{t}$ is $5$-local.
• $H_{stab}$ needs to access the states \begin{aligned}|\textsf{time}(t)\rangle \otimes |\textsf{time}(t+1)\rangle \end{aligned} and ensure that the progression of the time register is correct. Thus, $H_{stab}$ is $2$-local. $\square$

Now, we follow an approach of [3] to embed $\mathcal{H}$ into a line.

Theorem 3: The Feynman-Kitaev clock Hamiltonian $H$ can be manipulated into a $3$-local Hamiltonian acting on qutrits on a line.

Proof: Rather than having $\mathcal{H}$ act on $2n$ total qubits ($n$ time qubits and $n$ state qubits), let’s fuse each $|\textsf{time}(i)\rangle$ and $|\textsf{state}(i)\rangle$ pair into a single qudit of dimension $4$. If we view $\mathcal{H}$ as acting on the space of particles $|\textsf{time}(i)\rangle \otimes |\textsf{state}(i)\rangle$, we observe that, following Proposition 6.1, each local term needs to check at most the particles corresponding to times $t-1$, $t$, and $t+1$. Therefore, $\mathcal{H}$ is $3$-local and on a line, as desired.

To see that we can have $\mathcal{H}$ act on particles of dimension $3$ (qutrits) rather than particles of dimension $4$, note that the degree of freedom corresponding to $|\textsf{time}(t)\rangle \otimes |\textsf{state}(t)\rangle = |0\rangle \otimes |1\rangle$ is unused, as the $t^{\text{th}}$ qubit of the state is never nonzero until timestamp $t$. Thus, we can take the vectors

\begin{aligned} |0\rangle := |1\rangle\otimes|0\rangle, |1\rangle := |1\rangle\otimes|1\rangle, |2\rangle := |0\rangle\otimes|0\rangle \end{aligned}

as a basis for each qutrit.

Even though we’ve shown that the clock Hamiltonian for our original circuit cannot be used to prove NLTS (which is still weaker than the original Quantum PCP conjecture) this does not necessarily rule out the use of this approach for other “hard” circuits which might then allow us to prove NLTS. Furthermore, NLETS is independently interesting, as the notion of being low “Hamming distance” away from vectors is exactly what is used in error-correcting codes.

# References

• [1] Sanjeev Arora and Boaz Barak. Computational complexity: a modern approach. Cambridge University Press, 2009.
• [2] Chinmay Nirkhe, Umesh Vazirani, and Henry Yuen. Approximate low-weight check codes and circuit lower bounds for noisy ground states. arXiv preprint arXiv:1802.07419, 2018.
• [3] Dorit Aharonov, Wim van Dam, Julia Kempe, Zeph Landau, Seth Lloyd, and Oded Regev. Adiabatic quantum computation is equivalent to standard quantum computation. SIAM J. Comput., 2007.

## Motivation

Quantum computers have demonstrated great potential for solving certain problems more efficiently than their classical counterpart. Algorithms based on the quantum Fourier transform (QFT) such as Shor’s algorithm offer an exponential speed-up, while amplitude-amplification algorithms such as Grover’s search algorithm provide us with a polynomial speedup. The concept of “quantum supremacy” (quantum computers outperforming classical computers) has been explored for three general groups of problems:

1. Structured problems, such as factoring and discrete logarithm. Out quantum computer takes advantage of the structure of these classes of problems to offer an exponential speedup compared to the best known classical alternative. While these speedups are the most promising, they require a large number of resources and are cannot be feasibly implemented in the near future.
2. Quantum Simulations, originally proposed by Richard Feynman in the late 80s was thought to be the first motivation behind exploring quantum computation. Due to the fact that the space of all possible states of the system scales exponentially with the addition of a new element (eg. an atom), complex systems are very difficult to simulate classically. It has been shown that we can use a quantum computer to tackle interesting problems in quantum chemistry and chemical engineering. Furthermore, there are results on sampling the output of random quantum circuits which have been used for “quantum supremacy experiments”.
3. General constraint satisfaction and optimization problems. Since these problems are NP-hard it is widely believed that we cannot gain an exponential speedup using a quantum computer, however, we can obtain quadratic speedup but utilizing a variation of Grover’s algorithm.

While these quantum algorithms are very exciting, they are beyond the capabilities of our near-term quantum computers; for example, any useful application of Shor’s factoring algorithm requires anywhere between tens of thousands to millions of qubits with error correction compared to quantum devices with hundreds of qubits that we might have available in the next few years.

Recently there has been increasing interest in hybrid classical-quantum algorithms among the community. The general idea behind this approach is to supplement the noisy intermediate-scale quantum (NISQ) devices with classical computers. In this blog post, we discuss the Quantum Approximate Optimization Algorithm (QAOA), which is a hybrid algorithm, alongside some of its applications.

## Introduction

QAOA is used for optimizing combinatorial problems. Let’s assume a problem with $n$ bits and $m$ clauses. Each clause is a constraint on a subset of the bits which satisfies a certain assignment. We can define a cost function as follows:

$C(z)=\sum_{\alpha=1}^m C_\alpha (z)$

where $z=z_1z_2...z_n$ is the bit string. In this article we consider a minimization problem, therefore we want $C_\alpha(z)=0$ if $z$ satisfies clause $\alpha$ and 1 otherwise. Note that in the case of a maximization problem we only need to switch the value assigned to a satisfactory clause to 1. Our objective is to find a (qu)bit string that minimizes (or maximizes) our cost function.

At a higher level, we start with a quantum state in a uniform superposition of all possible inputs $z$. This can be accomplished with $n$ qubits which span a space of size $2^n$. Our goal is to come up with a series of operations that would evolve our initial quantum state into a superposition of states in which the valid solutions would have a significantly higher probability than other states. In manner, upon sampling the quantum state we are likely to get the correct solution with high probability. QAOA uses the cost function to construct a set of operations that would be able to efficiently map the unifrom superposition state into the desired quantum state. These operators involve single qubits rotations around the x-axis, and multiqubit rotations around the z-axis of our qubits.

Now let’s discuss the details of QAOA. For this algorithm we assume that our quantum computer works in the computation basis of $\left |0 \right > , \left | 1 \right >$. We start by setting our initial state to a uniform superposition over computational basis states:

$\left |s \right > = \frac{1}{\sqrt{2^n}}\sum_{z \in \{0,1\}^n} \left |z \right >$

Next, we define a unitary operator using the cost function as follows:

$U(\hat{C},\gamma) = e^{i\gamma \hat{C}}= \prod_{\alpha = 1}^m e^{-i\gamma \hat{C}_\alpha}$

Here we convert every clause $C_\alpha$ to a Hamiltonian $\hat{C_\alpha}$ consisting of Pauli Z ($\sigma^z$) operators. Just as a review, the two Pauli operators (X and Z) used in this blog post are representated as follows:

$\sigma^x = \begin{pmatrix} 0 & 1 \\ 1 & 0 \\\end{pmatrix} \: \: \: \: \sigma^z = \begin{pmatrix} 1 & 0 \\ 0 & -1 \\\end{pmatrix}$

For example if $C_\alpha=x \oplus y$ we can map the clause to $\hat{C_\alpha}=\frac{1}{2}(1+\sigma^z_x \sigma^z_y)$ for a minimization problem. If $x=\left |0 \right >$ , then $\sigma^z_x$ will return a value of 1, and if $x=\left |1 \right >$ the operator will return -1. The same applies to qubit $y$ as well. Therefore it is not hard to see that if $x$ and $y$ have the same value, then the operator $\hat{C_\alpha}$ as defined above will result in a 1, and it’ll result in 0 otherwise. Furthermore, since $\hat{C}$ has integer eigenvalues we can restrict the angle $\gamma$ to lie in $[0,2\pi]$.

Next, we define the admixing Hamiltonian:

$B=\sum_{j=1}^n \sigma^x_j$

and use it to define a unitary operator which consists of a product of commuting one qubit operations:

$U(B,\beta) = e^{-i\beta B}= \prod_{j=1}^n e^{-i \beta \sigma_j^x}$

where $\beta \in [0,\pi]$. It’s easy to see that $U(\Hat{C},\gamma)$ couples 2 or more qubits, while $U(B,\beta)$ performs a single qubit rotation on the qubits in our system. Using these unitaries and our initial state we define a QAOA angle-dependent “ansatz” state as follows:

$\left | \boldsymbol{\gamma},\boldsymbol{\beta} \right >= U(B,\beta_p)U(\Hat{C},\gamma_p)...U(B,\beta_1)U(\Hat{C},\gamma_1) \left |s \right >$

Here $p\geq 1$ is the “depth” of our QAOA circuit, and $\boldsymbol{\gamma}=(\gamma_p,...,\gamma_1)$, $\boldsymbol{\beta}=(\beta_p,...,\beta_1)$ are each a vector of length $p$ controlling the angles for each layer. In the worst case scenario this state can be produce by a quantum circuit of depth $mp+p$, however by taking advantage of the structure of the instance we can further reduce the number of layers required. Let $F_p$ be the expectation of $\hat{C}$ in our ansatz:

$F_p(\boldsymbol{\gamma},\boldsymbol{\beta})=\left < \boldsymbol{\gamma},\boldsymbol{\beta} \right | \hat{C} \left | \boldsymbol{\gamma},\boldsymbol{\beta} \right >$

and let $M_p$ be the minimum of $F_p$ over angles,

$M_p=\min_{\boldsymbol{\gamma},\boldsymbol{\beta}} F_p(\boldsymbol{\gamma},\boldsymbol{\beta}).$

Note that minimization at $p-1$ layers can be viewed as a constrained minimization at $p$ layers, therefore

$M_p \leq M_{p-1}$

Using an adiabatic approach [1] We can show that

$\lim_{p \rightarrow \infty} M_p = \min_z C(z)$

Based on these results our QAOA algorithm will look like the following:

•  c: pick a $p$
• c: choose a set of angles $(\Vec{\gamma}_0,\Vec{\beta}_0)$
• q: prepare $\left | \Vec{\gamma},\Vec{\beta} \right >$
• q: compute $F_p$
• c: perform gradient descend/ascend on $F_p$ and get a new set of angles $(\Vec{\gamma},\Vec{\beta})$
• repeat from step 3 till convergence
• report the measurement result of $\left | \Vec{\gamma},\Vec{\beta} \right >$ in computational basis

If $p$ does not asymptotically grow with $n$ $F_p(\Vec{\gamma},\Vec{\beta})$ can be efficiently computed in $O(m^2+mn)$

## Application: MaxCut

In this section we apply the QAOA algorithm to the MaxCut problem with bounded degree. MaxCut is an NP-hard problem that asks for a subset $S$ of the vertex set such that the number of edges between $S$ and the complementary subset is as large as possible. While QAOA does not offer a theoretical guarantee to solve MaxCut in polynomial time, it offers a path to utilizing NISQ devices for tackling such optimization problems and discuss patterns in such problems that can be used for reducing the number of steps required.

For this section, let’s assume $p=O(1)$, and we have a graph with $n$ vertices and an edge set $\{\}$ of size $m$. We can construct a cost function to be maximized as follows:

$C = \sum_{} C_{}$

$C_{} = \frac{1}{2} (1-\sigma^z_j \sigma^z_k)$

We can the compute the angle dependent cost of our ansatz as follows:

$F_p(\Vec{\gamma},\Vec{\beta})=\sum_{}\left <{s} \right | U^\dagger(C,\gamma_1)...U^\dagger(B,\beta_p) C_{}U(B,\beta_p) ... U(C,\gamma_1) \left |s \right >$

Let’s consider the operation associated with some edge $$:

$U ^\dagger(C,\gamma_1)...U^\dagger(B,\beta_p) C_{}U(B,\beta_p) ... U(C,\gamma_1)$

Since QAOA consists of local operations, we may take advantage by thinking about the problem in terms of subproblems (or subgraphs) involving certain nodes. This property will allow us to simplify our clauses even further depending on the desired depth $p$ of our quantum circuit, therefore decreasing the amount of resources necessary to implement the algorithm.

The operator $C_{}$ includes qubits (nodes) $j$ and $k$, therefore the sequence of operators above will only involve qubits that are at most distance $p$ away from qubits $j$ and $k$. Let’s consider the example of $p=1$:

$\rightarrow U^\dagger(C,\gamma_1)e^{i\beta_1(\sigma^x_j + \sigma^x_k)} C_{} e^{-i\beta_1(\sigma^x_j + \sigma^x_k)} U(C,\gamma_1).$

It’s easy to see that any factor of $U(C,\gamma_1)$ that does not depend on $j$ or $k$ will commute through and cancel out. Since the degree is bounded, each subgraph contains a number of qubits that is independent of $n$, which allows for the evaluation of $F_p$ in terms of subsystems of size independent of $n$.

For an subgraph $G$ define:

$C_G=\sum_{} C_{} \: \: \: \: U(C_G,\gamma)=e^{-i \gamma C_G}$

$B_G = \sum_{j \in G} \sigma^x_j \: \: \: \: U(B_G,\beta) = e^{-i \beta B_G}$

$\left | s,G \right > = \prod_{l \in G} \left |+ \right > _l$

We can define our total cost as a sum over the cost of each subgraph:

$f_g(\Vec{\gamma},\Vec{\beta})=\left < s,g(j,k) \right | U ^\dagger(C_{g(j,k)},\gamma_1)...U^\dagger(B_{g(j,k)},\beta_p) C_{}U(B_{g(j,k)},\beta_p) ... U(C_{g(j,k)},\gamma_1) \left |s,g(j,k) \right >$

where $g(j,k)$ is a subgraph of type $g$ and “…” is used to omit the sequence of angle depending unitaries constructed using the elements of $\Vec{\gamma}$ and $\Vec{\beta}$. $F_p$ is then

$F_p(\Vec{\gamma},\Vec{\beta})=\sum_g w_g f_g(\Vec{\gamma},\Vec{\beta})$

where $w_g$ is the number of occurrence of the subgraph $g$ in the original edge sum. The function $f_g$ does not depend on $n$ and $m$, and the only dependence on these variables comes through the weights $w_g$ from the original graph. The maximum number of qubits that can appear in our sequence of operators comes when the subgraph is a tree. For a graph with maximum degree $v$, the number of qubits in this tree is

$q_{tree}=2[\frac{(v-1)^{p+1}-1}{(v-1)-1}]$

(or $2p+2$ if $v=2$), which is independent of $n$ and $m$. Therefore we can see that for constant $p$ $F_p$ can be efficiently computed.

Next, let’s consider the spread of C measured in the state $\left | \Vec{\gamma},\Vec{\beta} \right >$.

$\left <\Vec{\gamma},\Vec{\beta} \right | C^2\left |\Vec{\gamma},\Vec{\beta}\right > -\left < \Vec{\gamma},\Vec{\beta} \right | C \left | \Vec{\gamma},\Vec{\beta} \right > ^2 \leq 2[\frac{(v-1)^{2p+2}-1}{(v-1)-1}].m$

For fixed $v$ and $p$ we see that the standard deviation of $C(z)$ is upper-bounded by $O(\sqrt{m})$. Using this fact and the appropriate probability bounds we can see that the result of measuring the cost function of the state $\left | \vec{\gamma_{opt}},vec{\beta_{opt}} \right >$ will be very close to the intended value of $F_p(\vec{\gamma_{opt}},\vec{\beta_{opt}})$ which bounds the uncertainty present in quantum measurement.

## Bibliography

[1] E. Farhi, J. Goldstone, and S. Gutmann, “A Quantum Approximate Optimization Algorithm,” 2014.

[2] J. S. Otterbach, et. al, “Unsupervised Machine Learning on a Hybrid Quantum Computer,” 2017.

tags:

by Fred Zhang

This is the second installment of a three-part series of posts on quantum Hamiltonian complexity based on lectures given by the authors in Boaz and Tselil’s seminar. For the basic definitions of local Hamiltonians, see Ben’s first post. Also check out Boriana and Prayaag’s followup note on area laws.

This post introduces tensor networks and matrix product states (MPS). These are useful linear-algebraic objects for describing quantum states of low entanglement.

We then discuss how to efficiently compute the ground states of the Hamiltonians of ${1}$D quantum systems (using classical computers). The density matrix renormalization group (DMRG), due to White (1992, 1993), is arguably the most successful heuristic for this problem. We describe it in the language of tensor networks and MPS.

1. Introduction

We are interested in computing the ground state—the minimum eigenvector—of a quantum Hamiltonian, a $2^n \times 2^n$ complex matrix that governs the evolution of a quantum system of $n$ qubits. We restrict our attention to the local Hamiltonian, where the matrix is a sum of Hamiltonians each acting only on $k$ qubits.  In the previous article, we discussed some hardness results. Namely, a local Hamiltonian can be used to encode SAT instances, and we further gave a proof that computing the ground state is QMA-Complete.

Despite the hardness results, physicists have come up with a variety of heuristics for solving this problem. If quantum interactions occur locally, we would hope that its ground state has low entanglement and thus admits a succinct classical representation. Further, we hope to find such a representation efficiently, using classical computers.

In this note, we will see tensor networks and matrix product states that formalize the idea of succinctly representing quantum states of low entanglement. As a side remark for the theoretical computer scientists here, one motivation to study tensor network is that it provides a powerful visual tool for thinking about linear algebra. It turns indices into edges in a graph and summations over indices into contractions of edges. In particular, we will soon see that the most useful inequality in TCS and mathematics can be drawn as a cute tensor network.

Guess what this is?

In the end, we will discuss the density matrix renormalization group (DMRG), which has established itself as “the most powerful tool for treating 1D quantum systems” over the last decade [FSW07]. For many 1D systems that arise from practice, the heuristic efficiently finds an (approximate) ground state in its matrix product state, specified only by a small number of parameters.

2. Tensor Networks

Now let us discuss our first subject, tensor networks. If you have not seen tensors before, it is a generalization of matrices. In computer scientists’ language, a matrix is a two-dimensional array, and a tensor is a multi-dimensional array. In other words, if we think of a matrix as a square, then a 3 dimensional tensor looks like a cube. Formally, a (complex) n dimensional tensor ${T}$ maps $n$ indices to complex values, namely, to its entries:

$\displaystyle T : [d_1] \times [d_2] \times \cdots \times [d_n] \rightarrow \mathbb{C}.$

The simplest tensor network is a graphical notation for a tensor. For an ${n}$-dimensional tensor ${T}$, we draw a star graph and label the center as ${T}$ and the edges as the indices. To evaluate this tensor network, we put values on the edges, i.e., indices, and then the tensor network would spit out its entry specified by the indices.

A simple tensor network of 4 dimensions

Evaluating a simple tensor network, ${T(1,5,3,1)=1/\sqrt{2}}$. The numbers are chosen arbitrarily.

Notice that the degree of the center is the number of indices. Hence, a tensor network of degree ${1}$ is a vector, and that of degree ${2}$ is a matrix, and so forth.

A vector

A matrix

A 3d tensor

How is this related to quantum information? For the sake of genearlity we will deal with qudits in ${\mathbb{C}^d}$, instead of qubits in ${\mathbb{C}^2}$. Now recall that a quantum state ${|\psi_n\rangle}$ of ${n}$ qudits can be encoded as an ${n}$ dimensional tensor. It can be written as

$|\psi_n\rangle = \displaystyle\sum_{i_1,\cdots,i_n = 0}^{d-1} T(i_1,\cdots, i_n) |i_1,\cdots, i_n \rangle.$

It is easy to see that all the information, namely, the amplitudes, is just the tensor $T$. In the later sections, we will see more powerful examples of using tensor networks to represent a quantum state.

So far our discussion is focused merely on these little pictures. The power of tensor networks come from its composition rules, which allow us to join two simple tensor networks together and impose rich internal structures.

2.1. Composition Rules

We introduce two ways of joining two simple tensor networks. Roughly speaking, they correspond to multiplication and summation, and I will give the definitions by showing examples, instead of stating them in the full formalism

Rule #1: Tensor Product. The product rule allows us to put two tensor networks together and view them as a whole. The resulting tensor is the tensor product of the two if we think of them as vectors. More concretely, consider the following picture.

This is viewed as a single tensor network ${T}$ of 7 edges .

The definition of this joint tensor ${T}$ is

$T(i_1,i_2,\cdots, i_7) = T_1(i_1,i_2,i_3,i_4) T_2(i_5,i_6,i_7).$

Rule #2: Edge Contractions. At this moment, we can only make up disconnected tensor networks. Edge contractions allow us to link two tensor networks. Suppose we have two ${3}$ dimensional tensor networks. Contracting two edges, one from each, gives us a tensor network of ${4}$ free edges. This now corresponds a tensor of ${4}$ dimensions.

Two 3d tensors

Join two tensor networks by contracting an edge

We name the contracted edge as ${k}$. The definition of ${T}$ is

$\displaystyle T(i_1,i_2,j_1,j_2) =\sum_k T_1(i_1,i_2, k) T_2(j_1,j_2, k).$

2.2. Useful Examples

Before we move on, let’s take some examples. Keep in mind that the degree of the vertex determines the number of indices (dimensions of this tensor).

vector inner product ${\langle u,v \rangle}$

Matrix inner product

Here, one needs to remember that an edge between two tensor nodes is a summation over the index corresponding to the edge. For example, in the vector inner product picture, ${\langle u,v\rangle = \sum_i u_i \cdot v_i}$, where edge is labeled as ${i}$. Now you would realize that this picture

is the famous

$\langle u,v \rangle^2 \leq \|u\|^2 \|v\|^2. \quad\quad (\text{Cauchy-Schwarz inequality})$

For us, the most important building block is matrix multiplication. Let ${H=MN}$. By definition

$H(i,j) = \sum_k M(i,k) N(k, j).$

This is precisely encoded in the picture below.

Matrix multiplication, ${MN}$.

We are ready to talk about matrix product states. In the language of tensor network, a matrix product state is the following picture.

A matrix product state.

As the degrees indicate, the two boundary vertices ${A_1,A_n}$ represent matrices and the internal vertices represent ${3}$-dimensional tensors. We can view each matrix as a set of (column) vectors and each ${3}$-dimensional tensor as a stack of matrices. Then each one of the free edges picks out a vector or a matrix, and the contracted edges multiply them together which gives out a scalar. If this confused you, move on to the next section. I will introduce the formal definition of matrix product states, and you will see that it is just the picture above.

3. Matrix Product States

Before giving the definition, let’s talk about how matrix product state (MPS) naturally arises from the study of quantum states with low entanglement. Matrix product state can be viewed as a generalization of product state—(pure) quantum state with no entanglement. Let’s consider a simple product state ${|\psi\rangle}$ of ${2}$ qubits. It can be factorized:

$\displaystyle |\psi\rangle = \left(\sum_{i=0}^1 \alpha_1^i\ |i\rangle \right)\left(\sum_{j=0}^1 \alpha_2^j \ |j\rangle\right)\nonumber = \sum_{i,j=0}^1 \alpha_1^i \alpha_2^j\ |ij\rangle \ \ \ \ \ (1)$

This state is described by ${4}$ complex scalars ${\left\{\alpha_1^0,\alpha_1^1,\alpha_2^0,\alpha_2^1\right\}}$, and there is nothing quantum about it. However, if the state has entanglement among its qubits, then we know that it is impossible to be factorized and thereby written as (1). MPS generalizes the form of (1) by replacing the scalars with matrices and vectors.

More formally, a matrix product state starts with the following setup. For an ${n}$-qudit system, we associate

• a qudit in ${\{1,n\}}$ with ${d}$ vectors ${\left\{A_1^{j_1}\right\}, \left\{A_n^{j_n}\right\} \in \mathbb{R}^D}$; and
• a qudit ${i}$ in ${\{2,3,\cdots, n-1\}}$ with ${d}$ matrices ${\left\{A_i^{j_i}\right\}\in \mathbb{R}^{D\times D}}$.

Here, ${j_i}$ range from ${0}$ to ${d-1}$, and ${D}$ is called bond dimension. One can think of the set of vectors as a ${D}$ by ${d}$ matrix and the set of matrices as a ${d}$ by ${D}$ by ${D}$ three-dimensional tensor. Then let them correspond to the vertices in MPS picture. With this setup, a quantum state is in matrix product state if it can be written as

$|\psi\rangle = \sum_{j_1,\cdots,j_n=1}^n A_1^{j_1} A_2^{j_2} \cdots A_n^{j_n} |j_1 j_2\cdots j_n\rangle.$

It is important to keep in mind that ${A_1^{j_1},A_n^{j_n}}$ are two vectors, and the other inner terms are matrices, and we get a scalar from the product. Thus, this represents the tensor ${T(j_1,j_2,\cdots, j_n) = A_1^{j_1} A_2^{j_2} \cdots A_n^{j_n}}$.

Now back to the picture,

MPS

notice that each amplitude ${ A_1^{j_1} A_2^{j_2} \cdots A_n^{j_n}}$ from the equation above is an output of the tensor in the picture, where the free edges take values ${j_1, j_2 ,\cdots, j_n}$. Also, as discussed earlier, the contracted edges in MPS tensor network correspond to matrix and vector multiplications, so the tensor ${T}$ is precisely represented by the picture.

The complexity of the MPS is closely related to the bond dimension ${D}$. In particular, the number of parameters in this model is ${O(ndD^2)}$. We would expect that with higher ${D}$, we may describe quantum states of more entanglement. In other words, the representation power of an MPS increases with ${D}$. In principle, one can represent any quantum state as an MPS; however, ${D}$ can be exponentially large. See, e.g., Section 4.1.3 of~\cite{schollwock2011density} for a proof. On the other extreme, the product state example shows that if ${D=1}$, one can represent and only represent unentangled states. To summarize, here is the picture you should keep in mind.

Representation power of MPS increases with bond dimension D.

4. Density Matrix Renormalization Group

We are now ready to describe Density Matrix Renormalization Group, proposed originally in [Whi92Whi93]. As mentioned earlier, it does not come with provable guarantees. In fact, one can construct artificial hard instances such that the algorithm get stuck at certain local minima [SCV08]. However, it has remained one of the most successful heuristics for ${1}$D systems. We refer the readers to [Sch11] for a complete survey.

DMRG is a simple alternating minimization scheme for computing the ground state of a ${1}$D Hamiltonian. We start with an arbitrary MPS. Then each step we optimize over the set of matrices ${\left\{A_i^{j_i}\right\}_{j_i=0}^d}$ associated with site ${i}$, while fixing everything else, and iterate until convergence. (You may wonder if one can simultaneously optimize over multiple sites. It turns out that it is an NP-hard problem [Eis06].)

Formally, the Hamiltonian problem can be phrased as a eigenvalue problem given a Hermitian matrix ${H}$, and thus we want to optimize over all ${|\psi\rangle}$ in MPS of a fixed bond dimension ${D}$

$\min_{|\psi\rangle}\frac{\langle \psi| H | \psi \rangle}{\langle \psi ||\psi \rangle}.$

Here, we assume that the input Hamiltonian is in the product form. In particular, it means that it can be written as a tensor network as

Input ${1}$D Hamiltonian is of the particular product form.

so the numerator of the optimization objective looks like

The DMRG works with the Langrangian of the objective. For some ${\lambda>0}$, we will consider

$\displaystyle \min_{|\psi\rangle}\,\, {\langle \psi| H | \psi \rangle} - \lambda {\langle \psi ||\psi \rangle}. \ \ \ \ \ (2)$

DMRG optimizes over the set of matrices associated with one qudit. Both terms in (2) are quadratic in this set of matrices.

The Langrangian ${{\langle \psi| H | \psi \rangle} - \lambda {\langle \psi ||\psi \rangle}}$ as a tensor network.

Now to optimize over the set of parameters associated with one site, calculus tells you to set the (partial) derivative to ${0}$, and the derivative of a quadratic thing is linear. Without going through any algebra, we can guess that the derivative of  with respect to a particular site, say the second one, is the same picture except removing the second site on one side.

The derivative that we set to ${0}$ and solve.

Notice that the unknown is still there, on the bottom side of each term. The trick of DMRG is to view the rest of the network as a linear map applied to the unknown.

Given ${H'}$ and ${B}$, we now have a clean numerical linear algebra problem of solving

$\displaystyle H'x = \lambda Bx. \ \ \ \ \ (3)$

This is called a generalized eigenvalue problem, and it is well studied. Importantly, for ${1}$D systems, ${H'}$ is typically very sparse, which enables very fast solvers in practice. Finally, DMRG sweeps over the sites one after another and stops until convergence is achieved.

5. Concluding Remarks

Our presentation of tensor networks and MPS roughly follows [GHLS15], a nice introductory survey on quantum Hamiltonian complexity.

The notion of tensor networks extends well beyond 1D systems, and a generalization of MPS is called tensor product state. It leads to algorithms for higher dimensional quantum systems. One may read [CV09] for a comprehensive survey.

Tensor network has been interacting with other concepts. Within physics, it has been used in quantum error correction [FP14PYHP15], conformal field theory [Orú14], and statistical mechanics [EV15]. In TCS , we have found its connections with Holographic algorithms [Val08CGW16], arithmetic complexity [BH07CDM16AKK19], and spectral algorithms [MW18]. In machine learning, it has been applied to probabilistic graphical models [RS18], tensor decomposition [CLO16], and quantum machine learning [HPM18].

For DMRG, we have only given a rough outline, with many details omitted, such as how to set ${D}$ and ${\lambda}$ and how to obtain the Hamiltonian in the matrix product form, and how to compute the linear maps ${H'}$ and ${B}$ for each iteration. An interested reader may read [Sch05Sch11].

References

[AKK19]    Per Austrin, Peeri Kaski, and Kaie Kubjas. Tensor network complexity of multilinear maps. In Proceedings of the 2019 Conference on Innovations in Theoretical Computer Science. ACM, 2019.

[BH07]    Martin Beaudry and Markus Holzer. The complexity of tensor circuit evaluation. Computational Complexity, 16(1):60, 2007.

[CDM16]    Florent Capelli, Arnaud Durand, and Stefan Mengel. e arithmetic complexity of tensor contraction. eory of Computing Systems, 58(4):506{527, 2016.

[CGW16]    Jin-Yi Cai, Heng Guo, and Tyson Williams. A complete dichotomy rises from the capture of vanishing signatures. SIAM Journal on Computing, 45(5):1671{1728, 2016.

[CLO16]    Andrzej Cichocki, Namgil Lee, Ivan V Oseledets, A-H Phan, Qibin Zhao, and D Mandic. Low-rank tensor networks for dimensionality reduction and large-scale optimization problems: Perspectives and challenges part 1. arXiv preprint arXiv:1609.00893, 2016.

[CV09]    J Ignacio Cirac and Frank Verstraete. Renormalization and tensor product states in spin chains and laices. Journal of Physics A: Mathematical and Theoretical, 42(50):504004, 2009.

[Eis06]    Jens Eisert. Computational difficulty of global variations in the density matrix renormalization group. Phys. Rev. Le., 97:260501, Dec 2006.

[EV15]    G. Evenbly and G. Vidal. Tensor network renormalization. Phys. Rev. Le., 115:180405, Oct 2015.

[FP14]    Andrew J Ferris and David Poulin. Tensor networks and quantum error correction. Phys. Rev. Le., 113(3):030501, 2014.

[FSW07]    Holger Fehske, Ralf Schneider, and Alexander Weie. Computational Many-Particle Physics. Springer, 2007.

[GHLS15]    Sevag Gharibian, Yichen Huang, Zeph Landau, and Seung Woo Shin. Quantum Hamiltonian complexity. Foundations and Trends in Theoretical Computer Science, 10(3):159, 2015.

[HPM18]   William James Huggins, Piyush Patil, Bradley Mitchell, K Birgia Whaley, and Miles Stoudenmire. Towards quantum machine learning with tensor networks. Quantum Science and Technology, 2018.

[MW18]    Ankur Moitra and Alexander S Wein. Spectral methods from tensor networks. arXiv preprint arXiv:1811.00944, 2018.

[Orú14]    Román Orús. Advances on tensor network theory: symmetries, fermions, entanglement, and holography. e European Physical Journal B, 87(11):280, 2014.

[PYHP15]    Fernando Pastawski, Beni Yoshida, Daniel Harlow, and John Preskill. Holographic quantum error-correcting codes: Toy models for the bulk/boundary correspondence. Journal of High Energy Physics, 2015(6):149, 2015.

[RS18]    Elina Robeva and Anna Seigal. Duality of graphical models and tensor networks. Information and Inference: A Journal of the IMA, 2018.

[Sch05]    Ulrich Schollwöck. The density-matrix renormalization group. Rev. Mod. Phys., 77(1):259, 2005.

[Sch11]    Ulrich Schollwöck. The density-matrix renormalization group in the age of matrix product states. Annals of Physics, 326(1):96, 2011.

[SCV08]    Norbert Schuch, Ignacio Cirac, and Frank Verstraete. Computational difficulty of finding matrix product ground states. Phys. Rev. Le., 100(25):250501, 2008.

[Val08]    Leslie G Valiant. Holographic algorithms. SIAM Journal on Computing, 37(5):1565, 2008.

[Whi92]    Steven R. White. Density matrix formulation for quantum renormalization groups. Phys. Rev. Le., 69:2863, Nov 1992.

[Whi93]    Steven R. White. Density-matrix algorithms for quantum renormalization groups. Phys. Rev. B, 48:10345, Oct 1993.

tags:

### Outline

Sampling from thermal states was one of the first and (initially) most important uses of computers. In this blog post, we will discuss both classical and quantum Gibbs distributions, also known as thermal equilibrium states. We will then discuss Markov chains that have Gibbs distributions as stationary distributions. This leads into a discussion of the equivalence of mixing in time (i.e. the Markov chain quickly equilibrates over time) and mixing in space (i.e. sites that are far apart have small correlation). For the classical case, this equivalence is known. After discussing what is known classically, we will discuss difficulties that arise in the quantum case, including (approximate) Quantum Markov states and the equivalence of mixing in the quantum case.

# Gibbs distributions

We have already learned about phase transitions in a previous blog post, but they are important, so we will review them again. The Gibbs or thermal distribution is defined as follows: Suppose that we have an energy function $E : {\{0,1\}}^n \to {\mathbb R}$ , which takes $n$ -bit strings to real numbers. Usually, $E = \sum_{i=1}^m E_i$ , where each $E_i$ term depends only on a few bits. For example, the energy might be the number of unsatisfied clauses in a 3-SAT formula, or it may arise from the Ising model. The Gibbs distribution is

$p(x) = \frac{\exp\{-E(x)/T\}}{Z}$

where the normalization factor in the denominator, also called the partition function, is $Z = \sum_{x \in {\{0,1\}}^n} \exp\{-E(x)/T\}$ . Another, perhaps more operational, way to define the Gibbs distribution is:

$p = \;\mathrm{arg\,max}_{q \in {\mathcal{P}}({\{0,1\}}^n)} H(q)~\text{subject to the constraint}~ \langle{q,E}\rangle = \bar{E}.$

In this expression, ${\mathcal{P}}({\{0,1\}}^n)$ is the set of probability distributions on ${\{0,1\}}^n$ , $H$ is the Shannon entropy, and $\bar E$ is a constant representing the average energy. We are thinking of probability distributions and $E$ as vectors of size $2^n$ . It turns out that if we solve this optimization problem, then the Gibbs distribution is the unique solution.

## Uses of Gibbs distributions

Why is it useful to work with Gibbs distributions?

• Gibbs distributions arise naturally in statistical physics systems, such as constraint satisfaction problems (CSPs), the Ising model, and spin glasses. One approach to deal with Gibbs distributions is through belief propagation (BP), which yields exact inference on tree graphical models and sometimes phase transition predictions on loopy graphs. Instead, we will focus on a different approach, namely, sampling from the Gibbs distribution.

• If we want to minimize $E$ (say, to find a 3-SAT solution), we can use simulated annealing. The idea of annealing is that we want to produce a crystal; a crystal is the lowest energy configuration of molecules. If we heat up the substance to a liquid and then cool it quickly, we will not get a nice crystal, because little bits of the material will point in different directions. In order to form a crystal, we need to cool the system slowly.

In computer science terms, we take a sample from a high temperature because sampling is generally easier at a higher temperature than at a lower temperature. We then use that sample as the starting point for an equilibration process at a slightly lower temperature, and repeat this procedure. If we reach zero temperature, then we are sampling from the minimizers of $E$ . In practice, the system will usually stop mixing before we get to zero temperature, but this is a good heuristic. You can think of this process as gradient descent, with some additional randomness.

• Gibbs distributions are used to simulate physical systems.

• Gibbs distributions are used in Bayesian inference due to the Hammersley-Clifford theorem, which will be discussed next.

• Gibbs distributions are also connected to multiplicative weights for linear programming (not discussed in this blog post).

## Bayesian inference & the Hammersley-Clifford theorem

In order to present the Hammersley-Clifford theorem, we must first discuss Markov networks. For this part, we will generalize our setup to a finite alphabet $\Sigma$ , so the energy function is now a function $\Sigma^n \to \mathbb R$ .

### Markov chains

First, let us recall the idea of a Markov chain with variables $X_1$ , $X_2$ , $X_3$ .

The random variables $X_1$ , $X_2$ , $X_3$ form a Markov chain if their joint distribution can be written in a factored way: $p(x_1,x_2,x_3) = p_{1,2}(x_1,x_2)p_{3 \mid 2}(x_3 \mid x_2)$ . For example, imagine that $X_1$ , $X_2$ , $X_3$ represent the weather on Monday, Tuesday, and Wednesday respectively. These random variables form a Markov chain if, conditioned on the weather on Tuesday, we have all of the information we need to forecast the weather on Wednesday. Another way to say this is that conditioned on the weather on Tuesday, then the weather on Monday and the weather on Wednesday are conditionally independent. Note that the weather on Monday and the weather on Wednesday are not independent; there can be correlations, but these correlations are mediated through the weather on Tuesday. It is important to note that the definition of a Markov chain is symmetric with respect to going forwards or backwards in time, so we can also write the conditional independence condition as $p(x_1,x_2,x_3) = p_{2,3}(x_2,x_3) p_{1 \mid 2}(x_1 \mid x_2)$ .

The conditional independence condition can also be written as $p_{1,3 \mid 2}(x_1, x_3 \mid x_2) = p_{1 \mid 2}(x_1 \mid x_2) p_{3 \mid 2}(x_3 \mid x_2).$ Recall that for two random variables $X_1$ and $X_2$ with joint distribution $p$ , they are independent, i.e., $p(x_1,x_2) = p_1(x_1) p_2(x_2)$ , if and only if $I(X_1; X_2) = 0$ , where $I$ here denotes the mutual information. Similarly, conditional independence is equivalent to the conditional mutual information $I(X_1; X_3 \mid X_2)$ equaling zero. This quantity is defined as $I(X_1;X_3 \mid X_2) = H(X_1 \mid X_2) + H(X_3 \mid X_2) - H(X_1, X_3 \mid X_2)$ .

Keep in mind that conditional independence is characterized in two equivalent ways: via an algebraic condition on the distributions, and via mutual information.

### Markov networks

A Markov network is like a Markov chain, but with more random variables and a more interesting structure. Imagine that we have a graph, where each node is associated with a random variable and the edges encode possible correlations. A Markov network has the property that if we take any disjoint collection of nodes $A$ , $B$ , and $C$ such that $A$ and $C$ are fully separated by $B$ (that is, any path from $A$ to $C$ must go through $B$ , or alternatively, removing $B$ leaves $A$ and $C$ disconnected), then $I(X_A; X_C \mid X_B) = 0$ . The notation $X_A$ here means the collection of random variables associated with the nodes in $A$ .

For example:

Here, if $A=\{1,5,6\}$ , $B=\{2,7\}$ , and $C=\{3,4\}$ , then $B$ separates $A$ and $C$ .

A Markov network is also called a graphical model or a Markov random field; and yet another name for them is Gibbs distribution, which is the content of the following theorem:

Theorem 1 (Hammersley-Clifford Theorem): Let $p$ be a strictly positive distribution on $\Sigma^n$ . Then, $p$ can be represented as a Markov network with respect to a graph $G$ if and only if $p$ can be expressed as a Gibbs distribution $p(x) \propto \exp\{-\sum_{C \in {\mathcal{C}}(G)} E_C(x_C)\}$ , where ${\mathcal{C}}(G)$ is the set of cliques (fully connected subsets) of $G$ .

This theorem says that Markov networks are the same as Gibbs states, with the same notion of locality.

The Hammersley-Clifford theorem implies an area law for mutual information; we will explain what this is and sketch why this is true. Divide a system into two disjoint pieces $A$ and $B$ . We want to know about the mutual information between $A$ and $B$ , $I(A;B)$ . The Hammersley-Clifford theorem gives us a bound which depends only on the size of the boundary $\partial$ between these sets. For simplicity, assume $\partial \subseteq B$ . Also, assume that the interactions have bounded range; then, the Hammersley-Clifford theorem tells us that $I(A; B \mid \partial) = 0$ .

Now, we will use the fact $I(A; B \mid \partial) = I(A; B,\partial) - I(A; \partial)$ . We can see this by writing out the expressions, but the intuition is that the term on the left asks about how much $A$ knows about $B$ , having already known about $\partial$ . This equals how much $A$ knows about $B$ and $\partial$ combined, minus how much $A$ knows about $\partial$ alone. In this case, since we said $\partial \subseteq B$ , then $I(A; B)$ is the same as $I(A; B, \partial)$ . In general, however, we have an upper bound:

$I(A;B) \le I(A; B, \partial) = I(A; \partial) + I(A;B \mid \partial) \le H(\partial) \le |\partial| \log |\Sigma|$

In this calculation, we have used $I(A; \partial) = H(\partial) - H(\partial \mid A)$ (the information between $A$ and $\partial$ is the amount by which the entropy of $\partial$ gets reduced once we know $A$ ) and $H(\partial \mid A) \ge 0$ (which is true classically).

Since the mutual information only scales with the surface area of the boundary and not with the area of the two regions $A$ and $B$ , this is known as an area law [1].

### Relationship to Bayesian inference

In Bayesian inference, we have a model for a system which can be very complicated. The model represents our assumptions on how parts of the system are causally related to the rest of the system. We have some observations, and we want to sample from a distribution conditionally on the fixed observations. Sampling from a conditional distribution is not the same as sampling from the original distribution, but we can still formally represent the conditional distribution as a Markov network. Therefore, sampling from Markov networks is a broadly useful task.

As an example of a complicated Bayesian model, consider a hierarchical Bayesian model [2]. Bayesian statistics requires choosing a prior distribution, and when there is a natural parameterized family of priors that a statistician can use, it may make sense to introduce a distribution over the priors; this is known as introducing a hyperparameter, and inference in the resulting hierarchical model (including computation of the posterior distribution) is frequently intractable. However, it is still desirable to work with these models because they are often more accurate than models in which the prior is handpicked by a statistician.

# Sampling from Gibbs distributions

The task of sampling from an arbitrary Gibbs distribution is MA-complete [3], and it is not hard to see that at low enough temperatures this problem is at least NP-hard. So, how do we sample from these distributions?

This section will discuss Monte Carlo Markov chain (MCMC) methods, namely the Metropolis-Hastings algorithm and Glauber dynamics. Readers familiar with these methods may wish to skip to the discussion of mixing in time. For readers who wish to build more intuition about Markov chains before proceeding, see the Appendix, where the simple example of the random walk on a cycle is treated in detail.

## Monte Carlo Markov chain (MCMC) methods

The general approach is to use a Markov chain. Let $\Omega=\Sigma^n$ be the possible states of the system. Effectively, a Markov chain is a way of doing a random walk over $\Omega$ .

The transition probabilities of the Markov chain are1 ${\mathbb P}\{X(t+1) = y \mid X(t) = x\} = T_{y,x}.$ Here, $T$ is the transition probability matrix. The column at index $x$ of $T$ is the probability distribution of the next state of the Markov chain, if the current state is $x$ . The row at index $y$ is a row of probability values which give the probabilities of jumping into state $y$ from every other state. It has the properties that its entries are non-negative and for every $x$ , $\sum_{y \in \Omega} T_{y,x} = 1$ . These properties say that $T$ is a (column) stochastic matrix.

Suppose we start at a state $x(0)$ ; or, more generally, we will start with a distribution $p$ over $\Omega$ . If we move according to the chain once, the distribution will be $Tp$ . If we move agian, the distribution will be $T^2 p$ . In general, after $t$ movements, the distribution is $T^t p$ . So, we can express the dynamics of the chain as matrix-vector multiplication.

It is worth mentioning that if we are simulating the chain on a computer and we are manipulating $n$ -bit numbers, then these probability vectors are of size $2^n$ so it becomes impractical to store the entire probability distributions.

The justification for our algorithms is the following theorem.

Theorem 2 (Perron-Frobenius Theorem): If $T$ is a stochastic aperiodic matrix, then one of the eigenvalues is $1$ , and all other eigenvalues have magnitude strictly less than $1$ . There is a unique probability distribution $\pi$ such that $T\pi = \pi$ .

The theorem implies that $T^t p$ will converge to the stationary distribution $\pi$ as $t\to\infty$ . So, if we want to sample from a distribution, this provides a method of doing so: cook up a Markov chain that equilibrates to the desired distribution, and then run the Markov chain until convergence. A priori, it is not obvious how we can design the Markov chain. At first, our problem was to sample from a probability distribution (a vector), and now we have changed the problem to designing an entire matrix, which does not appear to make our task easier.

Now, the question becomes: how does one come up with Markov chains that give you the desired stationary distribution?

## Metropolis-Hastings algorithm

The first algorithm we will introduce is the Metropolis-Hastings algorithm. One more desirable feature of a Markov chain is that it satisfies detailed balance, which says $\pi_x T_{y,x} = \pi_y T_{x,y}$ for all $x$ and $y$ . This condition says that if we pick a point with probability according to the stationary distribution and transition, the probability of picking $x$ and then moving to $y$ should be the same as picking $y$ and then moving to $x$ .

For a Markov chain in equilibrium, the total amount of probability flowing out of $x$ must equal the total amount of probability flowing into $x$ . For example, the United States might export products to Europe and import from China. Detailed balance says that the flow along each edge must balance, which is a more demanding condition. In the example with country trade deficits, we are requiring that all bilateral trade deficits must be zero.

Mathematically, detailed balance implies that $T$ can be transformed, via similarity transformations, into a symmetric matrix. The Metropolis-Hastings algorithm says that we should choose $T$ with the property $\frac{T_{x,y}}{T_{y,x}} = \frac{\pi_x}{\pi_y}.$ Suppose that we have an underlying graph on our state space, and suppose that we are at a state $x$ . The algorithm chooses a random neighbor, say $y$ , and then accepts or rejects this move with some probability. If the move is accepted, then we move to $y$ and continue the algorithm from there. Otherwise, if the move is rejected, then we stay at $x$ . We are free to choose any underlying graph (as long as it is connected and has a self-loop), and then we will tune the acceptance probability so that detailed balance holds.

Look at the trial move $x\to y$ . One way we can accomplish detailed balance is by looking at the ratio $\pi_y/\pi_x$ . If $\pi_y/\pi_x \ge 1$ , then always accept the move. If $\pi_y/\pi_x < 1$ , then accept the move with probability $\pi_x/\pi_y$ .

To get an idea for how the algorithm works, suppose that our underlying graph is $d$ -regular. Then, for neighbors $x$ and $y$ ,

\begin{aligned}T_{y,x} &= \min\Bigl\{1, \frac{\pi_y}{\pi_x}\Bigr\} \frac{1}{d}, \\ T_{x,y} &= \min\Bigl\{1, \frac{\pi_x}{\pi_y}\Bigr\} \frac{1}{d}\;\end{aligned}

Claim: $T_{y,x} \pi_x = \frac{1}{d} \min\{\pi_x,\pi_y\},$ which is manifestly symmetric in $x$ and $y$ ; thus, we have reversibility. This is the basic idea of the Metropolis-Hastings algorithm.

How does it work for a Gibbs distribution $\pi_x = \exp\{-E(x)/T\}/Z$ , where the energy function might, for example, count the number of violated clauses in a 3-SAT formula? In this case, we might be a little worried. The numerator of $\pi_x$ is pretty easy to compute (we can count how many violated constraints there are), but the denominator is hard to compute. In general, it is #P-hard to compute the denominator, because as $T$ drops to $0$ , the partition function in this case approaches the number of 3-SAT solutions. So, how do we calculate the ratios $\pi_y/\pi_x$ that the algorithm requires? We’re able to do this because the ratio does not depend on $Z$ :

$\frac{\pi_y}{\pi_x} = \exp \frac{E(x)-E(y)}{T}.$

Suppose that the energy is a sum of local terms, and the underlying graph corresponds to modifying one site at at a time. What this means is that the graph is $\Omega = {\{0,1\}}^n$ and the edges in the graph correspond to flipping exactly one bit. In this case, it becomes very easy to evaluate the computations needed for the algorithm; in fact, we can even do them in parallel.

How do we choose the underlying graph? The key idea is that we do not want the majority of our moves to be rejected. A good example to keep in mind is the Ising model, where the configurations are $x \in {\{0,1\}}^n$ and the energy is $E(x) = -\sum_{i,j=1}^n J_{i,j} x_i x_j$ . If $J_{i,j} \ge 0$ for all $i$ , $j$ , then we say that the model is ferromagnetic (we obtain lower energy by making the sites agree with each other). Of course, an antiferromagnetic model is just the opposite of this.

Assume that the bits are laid out in a square and $J_{i,j} = J$ if $i$ and $j$ are neighbors on the square, and $J_{i,j} = 0$ if they are not. As we vary the quantity $J/T$ , we observe a phase transition. If $J/T$ is small, then the coupling between the random variables is weak and the different parts of the system are almost independent; we call this the disordered phase. If $J/T$ is large, then the spins want to align in the same direction and the Gibbs distribution will look almost like the following: with probability $1/2$ , all spins are $+$ , and with probability $1/2$ , all spins are $-$ ; we call this the ordered phase.

In the disordered phase, when the spins do not need to align so closely, the Metropolis-Hastings algorithm will work well. In the ordered phase, the algorithm is doomed. Indeed, suppose that most of the spins are $+$ . As time proceeds, any $-$ s will switch to $+$ . There may be islands of $-$ spins initially, but it will be energetically favorable for these islands to shrink over time. Therefore, there will be an exponentially small chance for the system to switch to a configuration with mostly $-$ ’s, and thus the chain takes exponentially long to mix. Here, people are interested in understanding the autocorrelation time, because the goal is to run the chain for some time, get one sample, run the chain for some more time, get another sample, etc.

## Glauber dynamics

This next method (Glauber dynamics) is essentially the same as Metropolis-Hastings, but this is not immediately obvious. We are at a state $x = (x_1,\dotsc,x_n) \in \Sigma^n$ . (For the Metropolis-Hastings algorithm, we could be walking on a state space without a product structure. However, Glauber dynamics requires a product structure.) Then, we update $x$ to $(x_1,\dotsc,x_{i-1},x_i',x_{i+1},\dotsc,x_n)$ with chance $\pi_{i\mid -i}(x_i' \mid x_1,\dotsc,x_{i-1},x_{i+1},\dotsc,x_n)$ . In other words, we hold all other bits fixed, and conditioned on those other bits, we resample the $i$ th bit. Like Metropolis-Hastings, $\pi$ is stationary for this chain.

It is not obvious that these conditional distributions can be computed efficiently, but it is possible since normalizing the conditional distribution only requires summing over the possible configurations for a single random variable. On a Markov network, the conditional probability is $\pi_{i \mid N(i)}(x_i' \mid x_{N(i)})$ , where $N(i)$ denotes the set of neighbors of $i$ . This makes the computation a constant-sized calculation (i.e., does not depend on the size of the system).

For example, in the Ising model, suppose we are at state $x \in {\{\pm 1\}}^n$ . In Glauber dynamics, we pick a vertex $i \in [n]$ u.a.r. and update it to $+$ with probability $p_{i \mid N(i)}(+ \mid x_{N(i)}) = \frac{\exp(T^{-1}\sum_{j\in N(i)} x_j)}{\exp(-T^{-1} \sum_{j\in N(i)} x_j) + \exp(T^{-1} \sum_{j\in N(i)} x_j)}.$

# Mixing in time

Mixing in time means that the dynamics will equilibrate rapidly. It turns out that this is equivalent to mixing in space, which means that $\pi$ itself has decaying correlations. For example, the Ising model at low temperature has a lot of long-range correlations, but at high temperature it does not. For the high temperature regime, we can prove that mixing in time occurs. We will prove this for the ferromagnetic Ising model. The result is known more generally, but the proofs are much easier for the Ising model.

People have known about the Metropolis-Hastings algorithm since the 1950s, but only recently have researchers been able to prove convergence guarantees for the 2D Ising model. There is a large gap between theory and practice, but in some situations we can prove that the algorithm works.

Sampling from the distribution is roughly equivalent to estimating the partition function (sampling-counting equivalence). There have been many papers addressing tasks such as estimating the non-negative permanent, the number of colorings of a graph, etc.2 A dominant way of accomplishing these tasks is proving that the Metropolis-Hastings algorithm converges for these problems. It is easy to find algorithms for these problems that converge to Gibbs distributions, but the convergence may take exponential time.

We will look at the situation when the energy function looks like the Ising model, in the sense that the interactions are local and reflect the structure of some underlying space. Also, assume that the interactions are of size $O(1)$ and that the scaling comes from the size of the system. When can we expect that our algorithms work? There are two main cases when we can argue that there should be rapid mixing.

• High temperature regime: The system is very disordered, and in the limit as the temperature approaches infinity, we get the uniform distribution.
• One-dimension: In 1D, we can exactly compute the partition function using dynamic programming. Before, we mentioned that if there are a sea of $+$ s and an island of $-$ s, then it is energetically favorable for the island to shrink; note that this is no longer true in 1D. In a way, 1D systems are more “boring” because they cannot exhibit arbitrarily long-range correlations.

In this part of the blog post, we will try to be more proof-oriented. We will start by explaining why it is plausible that high temperature means that the chain will mix rapidly in time.

## Coupling method

One method of proving rates of convergence for Markov chains is by analzying the spectral gap. Another method is the coupling method.

The idea behind the coupling method is to start with two configurations $X(0),Y(0) \in \Omega$ . We want each one to evolve under the Markov chain.

The key part is that there is still some freedom with respect to what the dynamics looks like. In particular, we are allowed to correlate the $X$ and $Y$ processes. Thus, we are defining a joint transition probability ${\mathbb P}\{X(1)=x(1),Y(1)=y(1) \mid X(0),Y(0)\}$ . We want to design the process such that $X(1)$ and $Y(1)$ are closer together than $X(0)$ and $Y(0)$ . Imagine that we have two particles bouncing around. Each particle follows the dynamics of $T$ , but they are correlated so that they drift together, and once they meet, they stick together. It turns out that the mixing time can be upper bounded by the time it takes for the particles to meet each other.

Assume we have some sort of distance function $\;\mathrm{dist}$ on the underlying space and we can prove that $\;\mathrm{\mathbb E}\;\mathrm{dist}(X(1),Y(1)) \le \exp(-\alpha) \;\mathrm{dist}(X(0),Y(0))$ . Then, it turns out that the mixing time $t_{\rm mix}(\epsilon)$ , i.e. the time required to get within $\epsilon$ of the stationary distribution, is upper bounded as

$t_{\rm mix}(\epsilon) \le \frac{\log\{(\;\mathrm{diam}\Omega)/\epsilon\}}{\alpha}$

Initially, the two particles can be $\;\mathrm{diam}\Omega$ apart, but the expected distance is exponentially shrinking as we run the coupling, so the mixing time is logarithmic in the diameter.

The distance between probability distributions is defined as follows. Let $p$ and $q$ be two probability distributions on $\Omega$ . Then, the metric is:3

$\frac{1}{2} \|p-q\|_1 = \frac{1}{2}\sum_{x \in \Omega}|p(x)-q(x)| = \min_{\substack{(X,Y) \sim r \in {\mathcal{P}}(\Omega \times \Omega) \\ r_1 = p \\ r_2 = q}} {\mathbb P}_r\{X \ne Y\}.$

In this expression, $r_1$ and $r_2$ denote the first and second marginals of $r$ respectively. The minimum is taken over all couplings of $p$ and $q$ . This is the correct way to measure the distance between distributions. To give some intuition for this quantity, the quantity on the right represents the best test to distinguish the two distributions. If $p$ and $q$ are the same, we can take a coupling in which $X \sim p$ and $Y \sim q$ are always identical. If $p$ and $q$ have disjoint supports, then no matter what coupling we use, $X$ and $Y$ will never be equal.

It suffices to consider when $X(0)$ and $Y(0)$ are neighbors, i.e. at distance $1$ apart. This is because if we have $X(0)$ and $Y(0)$ far apart, then we could look at the path between them and reduce to the case when they are neighbors. Formally, this is known as path coupling. The formal statement is in Theorem 12.3 of [4]:

Theorem 3: Let $\Gamma$ be a connected weighted graph on the state space, where no edge has weight less than $d_{\min}$ . Let $d(C,C')$ be the length of the shortest path from $C$ to $C'$ in $\Gamma$ and let $d_{\max} = \max_{C,C' \in \Omega} d(C,C')$ be the diameter of $\Gamma$ . Suppose there is a coupling such that for some $\delta > 0$ ,

$\;\mathrm{\mathbb E}\bigl[d\bigl(X(1),Y(1)\bigr) \bigm\vert \bigl(X(0),Y(0)\bigr) = (C,C')\bigr] \le (1-\delta)d(C,C')$

for all neighboring pairs $C$ , $C'$ , i.e., those pairs connected by an edge in $\Gamma$ . Then, the mixing time is bounded by

$t_{\rm mix}(\epsilon) \le \frac{\log(\epsilon^{-1}d_{\max}/d_{\min})}{\delta}.$

## Glauber dynamics at high temperature

Recall that in Glauber dynamics, we pick a site $i$ randomly and then update the site conditioned on its neighbors. The first way we will couple together $X(1)$ and $Y(1)$ is by picking the same site for both of them.

1. Pick a random $i \in [n]$ .
2. If ${X(0)}_{N(i)} = {Y(0)}_{N(i)}$ , then set ${X(1)}_i = {Y(1)}_i$ (if the neighborhoods of the two points agree, then update them the same way). Otherwise, update them using the best possible coupling, i.e., pick a coupling for $({X(1)}_i, {Y(1)}_i)$ which minimizes ${\mathbb P}\{ {X(1)}_i \ne {Y(1)}_i \}$ .

So if $X(0) = Y(0)$ , then the points will never drift apart. The reason why analyzing this coupling is non-trivial is because there is a chance that the distance between the two points can increase.

Assume that the degree of the graph is $\Delta$ . Suppose that $\;\mathrm{dist}(X(0),Y(0)) = 1$ , that is, there is a single $a$ such that ${X(0)}_a \ne {Y(0)}_a$ . What will happen to $X(1)$ and $Y(1)$ ? We start by picking a random $i \in [n]$ . There are three cases:

1. $i \notin (\{a\} \cup N(a))$ (with probability $1 - (\Delta + 1)/n$ ): Nothing changes; $X(0)$ and $Y(0)$ agree at $i$ , and $X(1)$ and $Y(1)$ will also agree at $i$ . The distance remains at $1$ .
2. $i = a$ (with probability $1/n$ ): We picked the one spot in which the two configurations differ. The neighborhoods of $a$ are the same for $X(0)$ and $Y(0)$ , so we update in the same way for both processes, and the distance drops to $0$ .
3. $i \in N(a)$ (with probability $\Delta/n$ ): We could have different updates. Here, we have to use the high temperature assumption, which says that if we change one bit, the probability of a configuration cannot change too much.In the Ising model, $E(x) = \sum_{i,j=1}^n J_{i,j} x_i x_j$ . Changing $a$ can bias the energy by at most $\Delta\max_i J_{i,a}$ , so the expected distance afterwards is $1 + O(\max_{i,j=1}^n J_{i,j}/T)$ .

Adding these cases up to get the overall expected distance gives

$\;\mathrm{\mathbb E}\;\mathrm{dist}\bigl(X(1), Y(1)\bigr) = 1-\frac{1}{n} + \underbrace{O\Bigl(\frac{\Delta J_{\max}}{T}\Bigr)}_{\le 1}\frac{1}{n} = 1 - \frac{c}{n}$

for $T$ large enough, so the expected distance will shrink. This argument also tells us how large the temperature must be, which is important for applications. This gives us $t_{\rm mix}(\epsilon) = O\Bigl(n\log\frac{n}{\epsilon}\Bigr).$ Notice that this is the same dependence as the coupon collector problem. Therefore, in the high temperature regime, the system behaves qualitatively as if there are no correlations.

## Temporal and spatial mixing equivalence

The analysis of Glauber dynamics at high temperature is already a version of the equivalence between mixing in time and mixing in space. It says that if the correlations even with the immediate neighbors of a node are weak, then Glauber dynamics rapidly mixes.

Now, we want to consider the situation in which there can be strong correlations between immediate neighbors, but weak correlation with far away sites. We want to show that spatial mixing implies temporal mixing.

We will give a few definitions of correlation decay. (Note: The definitions of correlation decay below are not exactly the ones from Aram’s lecture. These definitions are from [5] and [6].)

For non-empty $W \subseteq V$ and $\tau \in \Sigma^{V\setminus W}$ , let $\mu_W^\tau$ be the distribution of the spins in $W$ conditional on the spins in $V \setminus W$ being fixed to $\tau$ . For $\Delta \subseteq W$ , let $\mu_{W,\Delta}^\tau$ be the marginal of $\mu_W^\tau$ on the spins in $\Delta$ . We will assume that the interactions between the spins have finite range $r > 0$ , and $\partial_r W$ denotes the $r$ -boundary of $W$ , i.e., $\{v \in V \setminus W : \;\mathrm{dist}(v,W) \le r\}$ .

• (Weak decay of correlations) Weak spatial mixing holds for $W$ if there exist constants $C, \xi > 0$ such that for any subset $\Delta \subseteq W$ , $\sup_{\tau,\tau' \in \Sigma^{V\setminus W}}\|\mu_{W,\Delta}^\tau - \mu_{W,\Delta}^{\tau'}\|_1 \le C\sum_{x\in\Delta, \; y \in \partial_r W} \exp\Bigl(- \frac{\;\mathrm{dist}(x,y)}{\xi}\Bigr).$
• (Strong decay of correlations) Strong spatial mixing holds for $W$ if there exist constants $C,\xi > 0$ such that for every $\Delta \subseteq W$ and every $\tau,\tau' \in \Sigma^{V\setminus W}$ differing only at site $y \in V\setminus W$ , $\|\mu_{W,\Delta}^\tau - \mu_{W,\Delta}^{\tau'}\|_1 \le C\exp\Bigl(-\frac{\;\mathrm{dist}(y,\Delta)}{\xi}\Bigr).$
• (Strong decay of correlations) Strong spatial mixing in the truncated sense holds for $V$ if there exist $n, \xi > 0$ such that for all functions $f, g : \Omega \to {\mathbb R}$ which depend only on the sites at $\Lambda_f$ and $\Lambda_g$ respectively and such that $\;\mathrm{dist}(\Lambda_f,\Lambda_g) \ge n$ , $\sup_{\tau \in \Sigma^{V\setminus W}} \;\mathrm{cov}_{\mu_W^\tau}(f, g) \le |\Lambda_f||\Lambda_g|\|f\|_\infty \|g\|_\infty \exp\Bigl(-\frac{d(\Lambda_f,\Lambda_g)}{\xi}\Bigr).$

Here, $\xi$ is the correlation length (in physics, it is the characteristic length scale of a system). In the disordered phase, the correlation length is a constant independent of system size. For our purposes, the main consequence of these definitions is that the effective interaction range of each spin is $O(\xi)$ . For the Ising model, there is a key simplification due to monotonicity. Namely, the ferromagnetic Ising model has the nice property (which is not true for other models) that if we flip a sign from $-$ to $+$ , this only makes $+$ more likely everywhere. This is because the spins want to agree. There are a lot of boundary conditions to consider, but here, due to monotonicity, we only need to consider two: all of the spins are $-$ , and all of the spins are $+$ . All $+$ spins will give the highest probability of a $+$ spin, and all $-$ spin will give the lowest probability of a $+$ spin. This monotonicity property is generally not required for time-space mixing equivalence to hold, but it greatly simplifies proofs.

It is a very non-obvious fact that all of these notions of spatial mixing are equivalent. We will sketch a proof that strong correlation decay implies that $t_{\rm mix} = O(n\log n)$ .

The idea is to use another coupling argument. Let $X(0), Y(0) \in {\{\pm 1\}}^V$ differ in one coordinate, i.e., ${X(0)}_a \ne {Y(0)}_a$ and ${X(0)}_i = {Y(0)}_i$ for $i \ne a$ . We want to argue that the expected distance between the processes will decrease. The proof uses a generalization of Glauber dynamics called block Glauber dynamics. In Glauber dynamics, we take a single spin and resample it conditioned on its neighbors. In block Glauber dynamics, we take an $L\times L$ box and resample it conditioned on its neighbors. There is an argument, called canonical paths, which can be used to show that if block Glauber dynamics mixes, then regular Glauber dynamics also mixes (slightly more slowly; we lose a $\;\mathrm{poly}(L)$ factor, but anyway $L$ will be a large constant) so analyzing block Glauber dynamics is fine.

If $a$ lies in the box, then the expected change in distance is $-L^2/n$ . If $a$ is far away from the box, then there is no change. If $a$ is in the boundary of the box, then it is possible for the distance to increase. However, strong spatial mixing allows us to control the influence of a single site, so the expected change in distance is bounded by $O(L\xi^2/n)$ . Now, since $\xi$ is a constant, if we choose $L$ sufficiently large, then we will have the same situation as in the high temperature case: the expected distance will exponentially shrink over time.

# Quantum systems

The quantum version of Markov chains has many more difficulties. The first difficulty is that the Hammersley-Clifford theorem (which we have been relying on throughout this blog post) fails.

## Notation

To properly discuss what we mean, let’s set up some notation. Readers already familiar with density matrices, quantum entropy, and quantum mutual information may wish to skip to the next subsection. Most of the time we discuss quantum objects here, we’ll be using density matricies, often denoted $\rho$ . A density matrix can be thought of as an extension to regular quantum states $|{\psi}\rangle$ , where there is some classical source of uncertainty.

A density matrix is a positive semidefinite matrix with trace $1$ . This extends the notion of a classical probability distribution; in the quantum setting, a classical probability distribution $p$ (thought of as a vector whose entries sum to $1$ ) is represented as the density matrix $\;\mathrm{diag}(p)$ .

For example, we can consider a situation in which there is a $1/2$ probability that we started with the quantum state $|{\psi}\rangle$ and a $1/2$ probability that we started with the quantum state $|{\phi}\rangle$ . This would be denoted as follows:

$\rho = \frac{1}{2} |{\psi}\rangle \langle{\psi}| + \frac{1}{2} |{\phi}\rangle \langle{\phi}|$

Density matricies are generally useful for a lot of tasks, but for our purposes a density matrix will be used to discuss both the classical and quantum “uncertainty” we have about what state we have.

Now let’s also talk about a second important piece of notation: the tensor product. Often when discussing quantum states, it is important to discuss multiple quantum states simultaneously. For example, Alice has one system $A$ and Bob has another system $B$ . However, these systems might be entangled, meaning that the results of the two systems are correlated.

For instance, let us consider the following state:

$|{\psi}\rangle = \frac{1}{\sqrt{2}}\left( |{+}\rangle _A |{+}\rangle _B + |{-}\rangle _A |{-}\rangle _B \right)$

This particular state has the property that Alice and Bob will always both measure $+$ or they will both measure $-$ . The notation for tensors is often ambiguous in the literature as there are many ways of specifying tensors. For instance, above we used subscripts to explicitly denote which particle was in system $A$ and which was in system $B$ . One may also choose to simply use the index of the system as below. The symbol $\otimes$ is used to denote a tensor between states (where it is assumed that the first state is system $A$ and the second, system $B$ ). Gradually folks may shorten the notation as follows:

\begin{aligned} |{\psi}\rangle &= \frac{1}{\sqrt{2}}\left( |{+}\rangle |{+}\rangle + |{-}\rangle |{-}\rangle \right)\\ |{\psi}\rangle &= \frac{1}{\sqrt{2}}\left( |{++}\rangle + |{--}\rangle \right)\\ |{\psi}\rangle &= \frac{1}{\sqrt{2}} \begin{pmatrix} 1\\0 \end{pmatrix} \otimes \begin{pmatrix} 0\\1 \end{pmatrix} \end{aligned}

These are all notations for the same state. Let’s now talk about this state in the context of a density matrix. The density matrix of this state is as follows:

\begin{aligned} \rho_{A,B} &= \frac{1}{2} \left( |{++}\rangle + |{--}\rangle \right) \left( \langle{++}| + \langle{--}| \right)\\ \rho_{A,B} &= \frac{1}{2} \left( |{++}\rangle \langle{++}| + |{--}\rangle \langle{++}| + |{++}\rangle \langle{--}| + |{--}\rangle \langle{--}| \right) \\ \rho_{A,B} &= \frac{1}{2} \begin{pmatrix} 1&0&0&1\\0&0&0&0\\0&0&0&0\\1&0&0&1 \end{pmatrix}\end{aligned}

Writing the density matrix $\rho$ as $\rho_{A,B}$ makes explicit that this is the density matrix over systems $A$ and $B$ .

A crucial operation that one will often perform using density matricies is the partial trace. The partial trace is a way of allowing us to consider only a smaller part of the larger part of the system, while taking into account the influence of the larger system around it.

Here’s an example: Suppose Bob wants to know what his state is. However, Bob really doesn’t care about Alice’s system and just wants to know what the density matrix for his system is. Bob’s density matrix is simply the following density matrix (a 50% chance of being in $|{+}\rangle$ and a 50% chance of being in $|{-}\rangle$ ).

\begin{aligned} \rho_{B} &= \frac{1}{2} \left( |{+}\rangle \langle{+}| + |{-}\rangle \langle{-}| \right) \end{aligned}

More explicitly, we could write the following:

\begin{aligned} \rho_{B} &= \frac{1}{2} \left( |{+}\rangle _B \langle{+}| _B + |{-}\rangle _B \langle{-}| _B \right) \end{aligned}

The partial trace is an operation that will let us take our original density matrix $\rho_{A,B}$ and generates a new density matrix $\rho_B$ that ignores system $A$ . This is specifically called the partial trace over $A$ , or $\;\mathrm{tr}_A$ .

So how do we do this? We simply sum over the state $A$ (effectively taking a trace, but only along one axis):

\begin{aligned} \;\mathrm{tr}_A \rho_{A,B} &= \sum_i \langle{i}| _A \frac{1}{2} \left( |{++}\rangle \langle{++}| + |{--}\rangle \langle{++}| + |{++}\rangle \langle{--}| + |{--}\rangle \langle{--}| \right) |{i}\rangle _A\end{aligned}

This is easier to evaluate using certain choices of notation:

\begin{aligned} \;\mathrm{tr}_A \rho_{A,B} &= \langle{+}| _A \frac{1}{2} \left( |{++}\rangle \langle{++}| + |{--}\rangle \langle{++}| + |{++}\rangle \langle{--}| + |{--}\rangle \langle{--}| \right) |{+}\rangle _A \\&\qquad {}+ \langle{-}| _A \frac{1}{2} \left( |{++}\rangle \langle{++}| + |{--}\rangle \langle{++}| + |{++}\rangle \langle{--}| + |{--}\rangle \langle{--}| \right) |{-}\rangle _A\\ &= \frac{1}{2} \left( |{+}\rangle _B \langle{++}| + |{+}\rangle _B \langle{--}| \right) |{+}\rangle _A + \frac{1}{2} \left( |{-}\rangle _B \langle{++}| + |{-}\rangle _B \langle{--}| \right) |{-}\rangle _A\\ &= \frac{1}{2} \left( |{+}\rangle _B \langle{+}| _B \right) + \frac{1}{2} \left( |{-}\rangle _B \langle{-}| _B \right) = \frac{1}{2} \left( |{+}\rangle _B \langle{+}| _B + |{-}\rangle _B \langle{-}| _B \right) = \rho_B\end{aligned}

We now have all of the tools we need to talk about quantum entropy. Intuitively, entropy can be thought of as the amount of uncertainty we have for our system, or equivalently the amount of information it takes to define our system. The entropy for a quantum system $\rho$ is defined as follows:

\begin{aligned} H(\rho) &= -\;\mathrm{tr}(\rho \log_2 \rho)\end{aligned}

Note that here we use the shorthand $\rho_B$ to denote $\;\mathrm{tr}_A \rho_{A,B}$ . Here, writing $\;\mathrm{tr}$ without the subscript indicates that this is the full or normal trace that one might expect (or equivalently performing the partial trace over all systems). We can now define the conditional entropy of a system as follows:

\begin{aligned} {H(A \mid B)}_\rho &= H(\rho_{A,B}) - H(\rho_B)\end{aligned}

This definition intuitively makes sense since we can think of conditional entropy as the amount of information it takes to describe our joint system $(A,B)$ , given that we already know what $B$ is.

We can now discuss quantum mutual information, the amount of information that measuring system $A$ will provide you about system $B$ . Like the classical case, this is defined as follows:

\begin{aligned} {I(A;B)}_\rho &= {H(A,B)}_\rho - {H(A\mid B)}_\rho - {H(B\mid A)}_\rho\end{aligned}

We can now finally discuss quantum mutual information (QCMI), defined as follows: ${I(A;B \mid C)}_\rho = {I(A;B,C)}_\rho - {I(A;C)}_\rho$ . With some algebraic simplifications, one can arrive at the expression:

\begin{aligned} {I(A;B \mid C)}_\rho &= {H(A,C)}_\rho + {H(B,C)}_\rho - {H(A,B,C)}_\rho - {H(C)}_\rho.\end{aligned}

The QCMI equals $0$ if and only if $\rho$ is a quantum Markov state. Classically, the entropic characterization of conditional independence corresponds to an algebraic characterization.

## Recovery Maps

Here, the algebraic characterization is more grueling. We have

$\rho_{ABC} = \exp(\log \rho_{AB} + \log \rho_{BC} - \log \rho_B)$

Equivalently,

$\rho_{ABC} = \rho_{AB}^{1/2} \rho_B^{-1/2} \rho_{BC}\rho_B^{-1/2} \rho_{AB}^{1/2} = R_{B\to AB}(\rho_{BC})$

Here, $R_{B\to AB}$ is called the Petz recovery map,4 $\rho_{B\to AB}(X) = \rho_{AB}^{1/2} \rho_B^{-1/2} X\rho_B^{-1/2} \rho_{AB}^{1/2}$ . One can think of a recovery may as a way that we can reconstruct the entire system $A, B$ using just system $B$ . It is not obvious that this is a quantum channel, but it is.

Suppose $\rho$ is a probability distribution, so $\rho =\;\mathrm{diag}(p)$ for some vector $p$ . Then, all of the density matrices are diagonal and commuting. Then, the recovery map means that we divide by $p_B$ and multiply by $p_{AB}$ , i.e., multiply by $p_{A \mid B}$ . This is the natural thing to do if we lost our information about $A$ and were trying to figure out what $A$ was based on our knowledge of $B$ . This is why $R_{B\to A,B}$ is known as a recovery map, and it is used to discuss conditional distributions in the quantum setting. In the classical case, if we start with $B, C$ , look only at $B$ , and use this to reconstruct $A$ , then we would have the whole state in a Markov chain. That is why this is a plausible quantum version of being a Markov chain.

However, quantum Gibbs states are not, in general, quantum Markov chains. The failure of this statement to hold is related to topological order, which is similar to the degrees of freedom that show up in error correcting codes.

## Quantum Markov Networks

Here, we will formally define a quantum Markov network. The reference for this is [7].

Let $G = (V, E)$ be a finite graph. We associate with each vertex $v \in V$ a Hilbert space ${\mathcal{H}}_v$ and we consider a density matrix $\rho_V$ acting on $\bigotimes_{v\in V} {\mathcal{H}}_v$ . Then, $(G, \rho_V)$ is a quantum Markov network if for all $U\subseteq V$ , $U$ is conditionally independent of $V \setminus (U \cup \partial U)$ given $\partial U$ , where the conditional independence statement is w.r.t. $\rho_V$ and means that the corresponding QCMI satisfies ${I(U; V\setminus (U \cup \partial U) \mid \partial U)}_{\rho_V} = 0$ .

A quantum Markov network is called positive if $\rho_V$ has full rank. (Recall that in the statement of the Hammersley-Clifford Theorem, , it is assumed that the distribution is strictly positive.)

Now, consider the following example. First, we introduce the Pauli matrices

\begin{aligned} \sigma^x := \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}, \qquad \sigma^z := \begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix}, \qquad \sigma^y := \begin{bmatrix} 0 & -i \\ i & 0 \end{bmatrix}.\end{aligned}

We define a Hamiltonian on three qubits $A$ , $B$ , $C$ by

$H := (\sigma_A^x \sigma_B^x + \sigma_A^y \sigma_B^y + \sigma_A^z \sigma_B^z) I_C + I_A (\sigma_B^x \sigma_C^x + \sigma_B^y \sigma_C^y + \sigma_B^z \sigma_C^z)$

(Juxtaposition in the above expression signifies the tensor product as discussed before.) Finally, for $\beta > 0$ , we define the Gibbs state

$\rho_{A,B,C}(\beta) := \frac{1}{Z(\beta)} \exp(-\beta H)$

The Hamiltonian here has local terms which correspond to interactions $(A,B)$ , $(B, C)$ . However, it can be shown that the QCMI between $A$ and $C$ conditioned on $B$ w.r.t. $\rho_{A,B,C}$ is non-zero, which means that this is not a quantum Markov network w.r.t. the line graph $A \leftrightarrow B \leftrightarrow C$ . This demonstrates the failure of the Hammersley-Clifford Theorem in the quantum setting.

## Important Results

We will briefly discuss the results of two papers.

1. [8] This paper shows that mixing in space implies mixing in time in the quantum case. However, the result of the paper only applies to commuting Hamiltonians. For commuting Hamiltonians, it turns out that quantum Gibbs states are quantum Markov networks. They use a version of Glauber dynamics, which can be simulated on a quantum computer but are also plausible dynamics for a physical system in nature. This is a difficult paper to read, but it is worth digesting if you want to work in the field.
2. [9] This second paper is much easier and more general, covering non-commuting Hamiltonians, but it requires more conditions. They give a method of preparing the Gibbs state which can run on a quantum computer, but the dynamics are not plausible as a physical system because they are too complicated. The more complicated dynamics allows them to make the proof work. The paper also uses QCMI.They have two assumptions. The first assumption looks like mixing in space (weak correlation decay). The second assumption is that the state looks approximately like a quantum Markov network (this is definitely not met in general). A very important paper in this space is a recent breakthrough ([10]) which characterizes quantum Markov chains. They show that if the QCMI is bounded by $\epsilon$ , then the recovery map $R_{B\to A,B}(\rho_{BC})$ is $\epsilon'$ -close to $\rho_{ABC}$ , i.e., low QCMI implies that the recovery map works well. This is trivial to prove classically, but very difficult in the quantum world.The algorithm in [9] is very elegant. Essentially, we take the entire system and punch out constant-sized boxes. If we can reconstruct the region outside of the boxes, then we can use the recovery maps to reconstruct the regions inside of the boxes, and the boxes are far apart enough so they are almost independent. For this argument, we must assume that the QCMI decays exponentially. Whenever we have exponential decay, we get a correlation decay that sets the size of the boxes. It is very difficult to condition on quantum states, but recovery maps provide a sense in which it is meaningful to do so. The paper gives an efficient method of preparing Gibbs states and simulating quantum systems on quantum computers.

The standard treatment of information theory is [11]. This book contains definitions and properties of entropy, conditional entropy, mutual information, and conditional mutual information.

To see a treatment of the subject of Markov chains from the perspective of probability theory, see [12] or the mathematically more sophisticated counterpart [13]. An introduction to coupling can be found in [14], as well as [4] (the latter also contains an exposition to spatial mixing). The connection between Markov chain mixing and the so-called logarithmic Sobolev inequality is described in [15].

# Appendix: Intuition for Markov chains

## Random walk on the cycle

We have $n$ points on the cycle, $0,1,\dotsc,n-1$ . At each step, we move left or right with probability $1/2$ . We can write the transition matrix as

$T = \frac{S + S^{-1}}{2}$

where $S$ is the shift operator $S |{x}\rangle = |{x+1 \bmod n}\rangle$ . The matrix $S$ is diagonalized by the Fourier transform. Define, for $k=0,1,\dotsc,n-1$ ,

$|{\tilde k}\rangle = \frac{1}{\sqrt n} \sum_{x=0}^{n-1} \exp\Bigl( \frac{2\pi i k x}{n} \Bigr) |{x}\rangle$

We have the same amount of amplitude at every point, but there is a varying phase which depends on $k$ . If $k = 0$ , we get the all-ones vector. If $k$ is small, then the phase is slowly varying. If $k$ is large, then the phase is rapidly varying. Look at what happens after we apply the shift operator:

\begin{aligned} S |{\tilde k}\rangle &= \frac{1}{\sqrt n} \sum_{x=0}^{n-1} \exp\Bigl( \frac{2\pi i k x}{n} \Bigr) |{x+1 \bmod n}\rangle \\ &= \frac{1}{\sqrt n} \sum_{x=1}^n \exp\Bigl( \frac{2\pi i k (x-1)}{n} \Bigr) |{x \bmod n}\rangle = \exp\Bigl(- \frac{2\pi i k}{n} \Bigr) |{\tilde k}\rangle .\end{aligned}

After the shift, we pick up an additional phase based on how rapidly the phase is varying. From this, we get:

\begin{aligned} T |{\tilde{k}}\rangle &= \frac{\exp(2\pi i k / n) + \exp(-2\pi i k / n)}{2} |{\tilde{k}}\rangle = \cos\Bigl(\frac{2\pi k}{n}\Bigr) |{\tilde{k}}\rangle .\end{aligned}

The eigenvalues are

$\lambda_k = \cos \frac{2\pi k}{n}, \qquad k=0,1,\dotsc,n-1.$

Only $k = 0$ will give me an eigenvalue of $1$ .

How do we analyze $T^t |{p}\rangle$ ? We should Fourier transform the distribution.

\begin{aligned} T^t |{p}\rangle = T^t \sum_{k=0}^{n-1} p_k |{\tilde{k}}\rangle = \sum_{k=0}^{n-1} p_k \lambda_k^t |{\tilde k}\rangle .\end{aligned}

If $n$ is odd, then as $t\rightarrow\infty$ , $\lambda_k^t \to 0$ for all $k=1,\dotsc,n-1$ , so $T^t \to |{\pi}\rangle \langle{1_n}|$ . Whatever you put into this operator, you get $\pi$ out.

## Spectral gap

The example of the random walk on the cycle shows that there is generally a unique stationary distribution and suggests that the speed of convergence is determined by how close the other eigenvalues are to $1$ . Specifically, suppose for simplicity that the eigenvalues of $T$ are $1 = \lambda_0 \ge \lambda_1\ge\cdots \ge 0$ (real and positive). Then, the convergence time is on the order of $\sim 1/(1-\lambda_1)$ .

Typically, the distance of the eigenvalues from $1$ reflects the size of the physical system. Even from the simple example, we can get some physical intuition from this. If $k$ is small, then the spectral gap is $\cos(2\pi k/n) = 1-O(k^2/n^2)$ . Thus, the convergence time is $\sim 1/(1-\lambda_1) \sim n^2$ , which is indeed the convergence time for a random walk on a cycle.

## References

1. S. Gharibian, Y. Huang, Z. Landau, and S. W. Shin, “Quantum Hamiltonian complexity,” Found. Trends Theor. Comput. Sci., vol. 10, no. 3, pp. front matter, 159–282, 2014.
2. R. W. Keener, Theoretical statistics. Springer, New York, 2010, p. xviii+538.
3. E. Crosson, D. Bacon, and K. R. Brown, “Making Classical Ground State Spin Computing Fault-Tolerant,” Physical Review E, vol. 82, no. 3, Sep. 2010.
4. C. Moore and S. Mertens, The nature of computation. Oxford University Press, Oxford, 2011, p. xviii+985.
5. F. Martinelli, “Lectures on Glauber dynamics for discrete spin models,” in Lectures on probability theory and statistics (Saint-Flour, 1997), vol. 1717, Springer, Berlin, 1999, pp. 93–191.
6. F. Martinelli and E. Olivieri, “Finite volume mixing conditions for lattice spin systems and exponential approach to equilibrium of Glauber dynamics,” in Cellular automata and cooperative systems (Les Houches, 1992), vol. 396, Kluwer Acad. Publ., Dordrecht, 1993, pp. 473–490.
7. M. S. Leifer and D. Poulin, “Quantum graphical models and belief propagation,” Ann. Physics, vol. 323, no. 8, pp. 1899–1946, 2008.
8. M. J. Kastoryano and F. G. S. L. Brandão, “Quantum Gibbs samplers: the commuting case,” Comm. Math. Phys., vol. 344, no. 3, pp. 915–957, 2016.
9. F. G. S. L. Brandão and M. J. Kastoryano, “Finite correlation length implies efficient preparation of quantum thermal states,” ArXiv e-prints, Sep. 2016.
10. O. Fawzi and R. Renner, “Quantum conditional mutual information and approximate Markov chains,” Comm. Math. Phys., vol. 340, no. 2, pp. 575–611, 2015.
11. T. M. Cover and J. A. Thomas, Elements of information theory, Second. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, 2006, p. xxiv+748.
12. R. Durrett, Essentials of stochastic processes. Springer, Cham, 2016, p. ix+275.
13. R. Durrett, Probability: theory and examples, Fourth., vol. 31. Cambridge University Press, Cambridge, 2010, p. x+428.
14. M. Mitzenmacher and E. Upfal, Probability and computing, Second. Cambridge University Press, Cambridge, 2017, p. xx+467.
15. F. Cesi, “Quasi-factorization of the entropy and logarithmic Sobolev inequalities for Gibbs random fields,” Probab. Theory Related Fields, vol. 120, no. 4, pp. 569–584, 2001.

1. This is the opposite of the probabilists’ convention, i.e., the transition probability matrix that we define here is the transpose of the one usually found in most probability theory textbooks.
2. As a side note, it may be a good research question to investigate to what extent quantum algorithms can be used to compute summations whose terms are possibly negative. In quantum Monte Carlo, the quantum Hamiltonian is converted to a classical energy function; this conversion always works, but sometimes you end up with complex energies, which is terrible for estimating the partition function because terms can cancel each other out.
3. You may recognize this as the total variation norm.
4. Petz wrote about quantum relative entropy in 1991, way before it was cool.