In recent years there has been increasing interest in using machine

learning to improve the performance of classical algorithms in

computer science, by fine-tuning their behavior to adapt to the

properties of the input distribution. This “data-driven” or

“learning-based” approach to algorithm design has the potential to

significantly improve the efficiency of some of the most widely used

algorithms. For example, it has been used to design better data

structures, online algorithms, streaming and sketching algorithms,

market mechanisms and algorithms for combinatorial optimization,

similarity search and inverse problems. This virtual workshop will

feature talks from experts at the forefront of this exciting area.

The workshop will take place virtually on **July 13-14, 2021**.

Registration is **free but mandatory**. Link to register: https://fodsi.us/ml4a.html

**Confirmed Speakers:**

- Alex Dimakis (UT Austin)
- Yonina Eldar (Weizmann)
- Anna Goldie (Google Brain, Stanford)
- Reinhard Heckel (Technical University of Munich)
- Stefanie Jegelka (MIT)
- Tim Kraska (MIT)
- Benjamin Moseley (CMU)
- David Parkes (Harvard)
- Ola Svensson (EPFL)
- Tuomas Sandholm (CMU, Optimized Markets, Strategy Robot, Strategic Machine)
- Sergei Vassilvitski (Google)
- Ellen Vitercik (CMU/UC Berkeley)
- David Woodruff (CMU)

**Organizers:**

- Costis Daskalakis (MIT)
- Paul Hand (Northeastern)
- Piotr Indyk (MIT)
- Michael Mitzenmacher (Harvard)
- Ronitt Rubinfeld (MIT)
- Jelani Nelson (UC Berkeley)

(h/t Salil Vadhan)

The Journal of the ACM is looking for a new editor in chief: see the call for nominations. The (soft) deadline to submit nominations (including self nominations) is **July 19th** and you can do so by emailing Chris Hankin at c.hankin@imperial.ac.uk

**Previous post:** Toward a theory of generalization learning **Next post:** TBD.

See also all seminar posts and course webpage.

lecture slides (pdf) – lecture slides (Powerpoint with animation and annotation) – video

Much of the material on causality is taken from the wonderful book by Hardt and Recht.

For fairness, a central source was the book in preparation by Barocas, Hardt, and Narayanan as well as the related NeurIPS 2017 tutorial, and other papers mentioned below.

We may have heard that **“correlation does not imply causation”**. How can we mathematically represent this statement, and furthermore differentiate the two rigorously?

Roughly speaking, and are correlated if is different for different values of . To represent causation, we change the second part of the formula: causes if *intervening* to change to some value changes the probability of . That is,

depends on .

Suppose we have the random variables (taken over choice of a random person), which represent e**X**ercising, being over**W**eight and having **H**eart disease, respectively. We put forth the following (hypothetical! this is not medical advice!) scenarios for their relationships:

**Scenario 1.** . Now , the overweight indicator, follows the causal relation:

and the heart disease indicator follows the same rule

So, in this scenario, exercise prevents heart disease and being overweight, while if we don’t exercise, we may be overweight or suffer from heart disease with probability 1/2 independently..

**Scenario 2.**

and still depends on in the same rule in the previous scenario. So, in this scenario, people are naturally prone to being overweight with probability 1/4, and being overweight makes you less likely to exercise, rather than the causal relation being in the other way around. As before, exercise prevents heart disease, and someone who did not exercise will get heart disease with probability 1/2.

We find that in scenario 1, . In scenario 2,

In fact, as this table shows, the probabilities for all combinations of are identical in the two scenarios!

Now, consider the intervention of setting , i.e. stop exercising. That is, we change the generating model for to be . In scenario 1, is still . In scenario 2, tells us nothing about now so we get . Now that we added in an intervention, the two scenarios are different!

This is an example of why **correlations are not causations**: while the conditional probabilities identical in the two scenarios, the causal probabilities are diffent .

**NOTE:** Working out this example, and understanding **(a)** why the two scenarios induce identical probabilities, and in particular all conditional probabilities are identical and **(b)** why the causal probabilities differ from the conditional probabilities in Scenario 2, is a great way to get intuition for causality and its pitfalls.

Consider Scenario 1, where the causal structure is as follows:

Looking at the table above, we see that the unconditional probability equals . Since in this scenario, there is no causal relation between being overweight and suffering from heat disease, the causal probability is also equal to .

However, we can calculate the conditional probability from the table and see that .

That means that even though in this scenario, there is no causal relation between being overweight and getting heart disease, conditioning on not being overweight reduces the probability of getting heart disease.

Once again we see here a **gap** between the **conditional **and **causal** probabilities.

The reason is for this gap is that there is a **counfounding variable**, namely that is a common cause of both and .

**Definition:** are *confounded* if there are values such that

To fix the effect of a confounder, we condition on . It also allows us to find the probability of an intervention. The general **deconfounding formula** is

(★),

where ranges over all the immediate causes of .

Contrast this with the formula for computing the conditional probability which is

Using the deconfounding formula (★) requires **(a)** knowing the causal graph, and **(b)** observing the confounders. If we get this wrong and control for the wrong confounders we can get the causal probabilities wrong, as demonstrated by the following example.

One way to describe causality theory is that it aims to clarify the situations under which **correlation does in fact equal causation** (i.e., the **conditional probabilities are equal to the causal probabilities**), and how (by appropriately controlling for confounders) we can get to such a situation.

**Example (two diseases)** Consider the diagram below where there are two diseases and such that each occurs independently with probability . We assume each will send you to the hospital (variable ) and those are the only reason to arrive at the hospital.

If you control for (i.e look at only people who went to the hospital), we find that the probabilities are now correlated: A priori the probability is , and conditioned on , the probability is .

This relates to the joke “the probability of having 2 bombs on a plane is very low, so if I bring a bomb then it is very unlikely that there will be another bomb.”

In general, the causal graph can look as one of the following shapes:

If is a fork then controlling for can tease out the causal relation. If is a mediator or collider then controlling for can actually make things worse. –>

**Backdoor paths:** If and are two random variables, we say that there is a “backdoor path” from to if there is direct ancestor of that is connected in the undirected version of the causal graph in a path not going through .

We can show the following theorem:

**Theorem:** If there is no backdoor path then

Here is a “proof by picture”:

If there isn’t a backdoor path, we sort the graph in topological order, so that all the events that happen before are not connected to except through . So we can first generate all the variables that result in . Then the probability distribution of the events between and only depends on the value of , and so similarly is generated from some probability distribution that only depends on .

When we design experiments, we often want to estimate *causal effects*, and to do so we try to make sure we eliminate backdoor paths.

Consider the example of a COVID vaccine trial.

We let be the event that a trial participant obtained a vaccine, and be the event that the participant was infected with COVID.

We want to figure out .

However, there is a “backdoor path”.

You will not get the vaccine if you don’t participate in the trial (which we denote by ), but particpating in the trial could change your behavior and hence have a causal effect on .

To fix this we can cut the backdoor path using a placebo: it cuts the backward path by removing the confounding variable of participation, since it ensure that (conditioning on ), is now an independent variable from any behavioral changes that might impact .

In general, how does conditioning on some variable affect correlations? It may introduce correlations in events that occur before , but cuts any path that depends on .

Suppose we have some treatment variable that we don’t get to control (e.g. in a natural experiment). Let , and we hope to estimate which is known as the the **treatment effect**.

However, we worry that some underlying variable (e.g. healthy lifestyle) can affect both and .

The *propensity score*, defined as , allows us to calculate . We claim that as long as is a valid confounder (for which the formula (★) holds)

The proof is obtained by expanding out the claim, see below

Intuitively, knowing the probability that different groups of people get treatment allows us to make independent from and calculate the treatment effect.

**Calculating treatment effect using ML.** Suppose that the treatment effect is and . Now, if we learn a model , then

Since both and are calculable, we only need to do a linear regression.

When we cannot observe the counfounding variable, we can still sometimes use *instrumental variables* to estimate a causal effect.

Assume a linear model , where is the stuff we don’t observe. If is some variable that satisfies then

which is the ratio between two observable quantities.

We focus on fairness in classification problems, rather than fairness in learning generative models or representation (which also have their own issues, see in particular this paper by Bender, Gebru, McMillan-Major, and “Shmitchell”).

In the public image, AI has been perceived to be very successful for some tasks, and some people might hope that it is more “objective” or “impartial” than human decisions which are known to be fraught with bias). However, there are some works suggesting this might not be the case:

- Usage in prediciting recidivism for bail decisions. For example in ProPublica Angwin, Larson, Mattu, and Kirchner showed that 44.9% of African Americans who didn’t reoffend were labeled higher risk, whereas only 23.5% of white defendants who didn’t reoffend were labeled as such.
- Machine vision can sometimes work better on some segments of population than others. For example, Buolamwini and Gebrue showed that some “gender classifiers” achieve 99.7% accuracy on white men but only 65.3% accuracy (not much better than coin toss) on black women.
- Lum and Isaac gave an example of a “positive feedback loop” in predictive policing in Oakland, CA. While drug use is fairly uniform across the city, the arrests are centered on particular neighborhood (that have more minority residents). In predictive policing, more police would be sent out to the places where arrests occured, hence only exercabating this disparate treatment.

Drug use in Oakland:

Drug arrests in Oakland:

While algorithms can sometimes also help, the populations they help might not be distributed equally. For example, see this table from Gates, Perry and Zorn. A more accurate underwriting model (that can better predict the default probability) enables a lender to use a more agressive risk cut off and so end up lending to more people.

However, this is true within each subpopulation too, so it may be that if the model is less accurate in a certain subpopulation, then a profit-maximizing lender will unfairly offer fewer loans to this subpopulation.

In the case of employment discrimination in the U.S., we have the following components:

- Protected class
- categories such as race, sex, nationality, citizenship, veteran status, etc.

- An unfairness metric, measuring either:
- disparate treatment
- disparate impact.

Employers are not allowed to discriminate across protected classes when hiring. The unfairness metric gives us a way to measure if there is discrimination with respected to a protected class. In particular, disparate impacts across different protected classes is often *necessary* but *not sufficient* evidence of discrimination.

To see why algorithms, which at first glance seem agnostic to group membership, may exhibit disparate treatment or impact, we consider the following Google visualization by Wattenberg, Viégas, and Hardt.

Consider a blue population and an orange population for which there is no difference in the probability of a member of either population paying back the loan, but for which our model has different accuracies—in particular, the model is more accurate on the orange population. This is described by the plot below, in which the scores correspond to the model’s prediction of the probability of paying back the loan and opaque circles correspond to those who actually do not pay back the loan, whereas filled in circles correspond to those who do.

Suppose we are in charge of making a lending decision given the model prediction.

A scenario in which we give everyone a loan would be fair, but would be bad us —we would go bankrupt!

Profit when giving everyone a loan:

If we wanted to **maximize profit**, we would, however, give more loans to the orange population (since we’re more sure about which members of the orange population would actually pay back their loans) by setting a lower threshold (in terms of the score given by our algorithm) above which we give out loans.

This maximizes profit but is blatantly unfair. We are treating the identical blue and orange groups differently, just because our model is more accurate on one than the other, and we also have disparate impact on the two groups. A non-defaulting applicant would be 78% likely to get a loan if they are a member of the orange group, but only 60% likely to get a loan if they are a member of the orange group.

This “profit maximization” is likely the end result of any sufficiently complex lending algorithm in the absence of a fairness intervention. Even if the algorithm does not explicitly rely on the group membership attribute, by simply optimizing it to maximize profit, it may well pick up on attributes that are correlated with group membership.

Suppose on the other hand that we wanted to mandate “equal treatment” in the sense of keeping the same thresholds for the blue and orange group. The result would be the following:

In this case, since the threshold are identical, the algorithm will be **calibrated**. 79% of the decisions we make will be the correct ones, for both the blue and orange population. So, from our point of view, the algorithm is fair and treats the blue and orange populations identically. However, from the point of view of the applicants, this is not the case. If you are a blue applicant that will pay your loan, you have 81% chance of getting a loan, but if you are an orange customer you only have 60% of getting it. This demonstrates that defining fairness is quite delicate. In particular the above “color blind” algorithm is still arguable unfair.

This difference between the point of view of the lender and lendee also arose in the recidivism case mentioned above. From the point of view of the defendant that would not recidivate, the algorithm was more likely to label them as “high risk” if they were Black than if they were white. From the point of view of the decision maker, the algorithm was calibrated, and if anything it was a bit more likely that a white defendant labeled high risk would not recidivate than a Black defendant. See (slightly rounded and simplified) data below

If we wanted to achieve demographic parity (both populations get same total number of loans) or equal opportunity (true positive rate same for both) then we can do so, but again using different thresholds for each group:

While the above was a hypothetical scenario, a real life example was shown by Hardt, Price and Srebro using credit (also known as FICO) scores, as described by the plot below:

For a single threshold, around 75% of Asian candidates will get loans, whereas only around 20% of Black candidates will get loans. To ensure that all groups get loans at the same rate, we would need to set the thresholds differently. In order to equalize opportunity, we’d also need to initialize the thresholds differently as well.

We see that we have different notions of what it means to be *fair* and that each of these different notions result in different algorithms.

Berkeley graduate admissions in 1973 had the following statistics:

- 44% male applicants admitted, 35% female applicants admitted;
- However, female acceptance rate was higher at the
*department level*, for most departments.

This paradox is commonly referred to as Simpson’s Paradox.

A “fair” causal model for this scenario might be as follows:

In the above, perhaps gender has a causal impact on the choice of department to which the applicant applies. However, a fair application process would, conditional on the department, be independent of gender of the applicant.

However, not all models that follow this causal structure are necessarily *fair*. In the case Griggs v. Duke Power Co., 1971, the court ruled that decision-making under the following causal model was *unfair*:

While the model appears to be fair, since the job offer is conditionall independent of race, given the diploma, the court ruled that the job did not actually require a high school diploma. Hence, using the diploma as a factor in hiring decisions was really just a proxy for race, resulting in essentially purposeful unfair discrimination based on race. This creation of proxies is referred to as redlining.

We cannot come up with universal fairness criteria. The notion of fairness itself is based on assumptions about:

- representation of data
- relationships to unmeasured inputs and outcomes
- causal relation of inputs, predictions, outcomes.

Fairness depends on what we choose to measure to observe, in both inputs and outputs, and how we choose to act upon them. In particular, we have the following causal structure, wherein measure inputs, decision-making, and measured outcomes all play a role in affecting the real-world and function together in a feedback cycle:

A more comprehensive illustration is given in this paper of Friedler, Scheidegger, and Venkatasubramanian:

]]>- June 16th 10am-11am PDT (1pm-2pm EDT). Virginia Vassilevska Williams on a Refined Laser Method and Faster Matrix Multiplication
- August 5 10am-11am PDT (1pm-2pm EDT) Yuansi Chen on An Almost Constant Lower Bound of the Isoperimetric Coefficient in the KLS Conjecture

*Local algorithms — that is, algorithms that compute and make decisions on parts of the output considering only a portion of the input — have been studied in a number of areas in theoretical computer science and mathematics. Some of the related areas include sublinear-time algorithms, distributed algorithms, streaming algorithms, (massively) parallel algorithms, inference in large networks, and graphical models. These communities have similar goals but a variety of approaches, techniques, and methods. This workshop is aimed at fostering dialogue and cross-pollination of ideas between the various communities.*

This year, the workshop will include:

**A poster session**: Please submit your poster proposal (title and abstract) at by**May 26th**. Everyone is invited to contribute. This session will take place on gather.town.**Invited long talks**: the tentative schedule is available, and features talks by James Aspnes, Jelani Nelson, Elaine Shi, Christian Sohler, Uri Stemmer, and Mary Wootters.**Junior-Senior social meetings****An AMA (Ask Me Anything) session**, moderated by Merav Parter**A Slack channel****An Open Problems session**

The Program Committee of WOLA 2021 is comprised of:

- Venkatesan Guruswami (CMU)
- Elchanan Mossel (MIT)
- Merav Parter (Weizmann Institute of Science)
- Sofya Raskhodnikova
**(chair)**(Boston University) - Gregory Valiant (Stanford)

and the organizing committee:

- Sebastian Brandt (ETH)
- Yannic Maus (Technion)
- Slobodan Mitrović (MIT)

For more detail, see the website;

]]>- 10 year award – STOC 2007-2011
- 20 year award – STOC 1997-2001
- 30 year award – STOC 1987-1991

The award website ( https://sigact.org/prizes/stoc_tot.html ) helpfully contains links to the papers published in all these conferences.

Please nominate the papers you think have most influenced our field!

Welcome to ALT Highlights, a series of blog posts spotlighting various happenings at the recent conference ALT 2021, including plenary talks, tutorials, trends in learning theory, and more! To reach a broad audience, the series will be disseminated as guest posts on different blogs in machine learning and theoretical computer science. Boaz has kindly agreed to host a post in this series. This initiative is organized by the Learning Theory Alliance, and overseen by Gautam Kamath. All posts in ALT Highlights are indexed on the official Learning Theory Alliance blog.

This is the third post in the series, an interview with Constantinos Daskalakis and coverage of his ALT 2021 keynote talk, written by Kush Bhatia and Cyrus Rashtchian.

To make a decision for ourselves, we need to think about the impact of our actions to our objectives. But, when our actions affect other people and their actions affect our objectives, we also need to consider their incentives, and choose our actions in anticipation of theirs. This increase in complexity also occurs in situations involving multiple decision-making machines (e.g., self-driving cars), automated systems (e.g., algorithmic stock trading), or living organisms (e.g., groups of cells).

Studying decision-making in so-called multi-agent environments has been a question of interest to mathematicians and economists for centuries. In the 1830s, Antoine Augustin Cournot developed a theory of competition to model oligopolies, inspired by observing competition in a spring water duopoly. Jumping forward to the 20th century, researchers converged that a fruitful approach to analyzing multi-agent systems is studying them at “equilibrium”, that is, in situations where the system is stable in the sense that all parties are satisfied with their actions. A fundamental concept of equilibrium, studied by John von Neumann, and later by von Neumann and Oskar Morgenstern is a collection of actions, one per agent, such that no agent has an incentive to deviate from their choice given the actions of the other agents. While a nice proposal, they could only show that such equilibrium is guaranteed to exist in situations conforming to what is called a “two-player zero-sum game”. In such games, two agents are in exact competition with each other; whatever one player wins the other loses.^{1} In 1950, John F. Nash showed that this notion of equilibrium, named Nash equilibrium in his honor, indeed exists for most naturally occurring multi-agent problems.^{2}

While Nash established the existence of equilibrium, one question bothered economists and computer scientists alike: “*Is it possible to efficiently find the Nash equilibrium of a game*?”

Throughout the rest of the 20th century, many people proposed algorithms for computing Nash equilibrium, but none of them succeeded in showing that it can be done efficiently (i.e., in polynomial time in the size of the game). During his early graduate school days at UC Berkeley, Constantinos (a.k.a. Costis) Daskalakis obsessed over this question. Then, in 2006, Costis, along with co-authors Paul Goldberg and Christos Papadimitriou, who was also his PhD advisor, showed a surprising result: finding a Nash equilibrium is computationally intractable!

A central concept in analyzing multi-agent systems is the *utility function* of each interacting agent. This is a function that captures the value that the agent derives as a function of their own action as well as those of the other agents. A common assumption is that the utility function of each agent is a concave function of their own action for any collection of actions committed by the others. Concavity often arises when agents trade off benefits and costs from their actions taking into account properties like diminishing returns and risk-aversion (see Figure 1 for an illustration). It is also crucial in guaranteeing that equilibria exist.

Recently, Costis has shifted his attention to the more general setting where the underlying utilities can be arbitrary non-concave functions of the agents’ actions. “Earlier, I was interested in the problem of equilibrium computation for its fundamental applications in Game Theory and Economics and its intimate connections to duality theory, topology and complexity theory. As Machine Learning is now moving towards multi-agent learning, studying more general setups arising from nonconcave agent utilities becomes increasingly relevant,” says Costis. He elaborates that the recent success of deep learning methods has largely been in single-agent setups and that the next frontier is to replicate this success in multi-agent settings. It is at this intersection of deep learning and multi-agent learning that non-concave utility functions arise — when actions correspond to setting the parameters of deep neural networks, agent utilities quickly become non-concave in the space of these parameters(see Figure 1).

The focus of Costis’ keynote talk, on joint work with Stratis Skoulakis and Manolis Zampetakis [DSZ21] was on the simplest multi-agent problem: two-player zero-sum games. In these games, two players, the *min* and the *max* player, choose actions and respectively, which are constrained to lie in some compact and convex set , i.e., . The agents share some objective function that *min* wants to minimize and *max* wants to maximize. Classical studies, going back to von Neumann’s celebrated work [vN28] focus on when this objective is a convex function of the *min* player’s action and a concave function of the *max* player’s action, the so-called “convex-concave setting”. For any function which is convex-concave, there exists a Nash equilibrium, i.e., a point satisfying

Thus the *min* (*max*) player has no incentive to deviate from as long as the other player remains fixed. The existence of such follows from von Neumann’s minimax theorem and Rosen’s generalization of this theorem to the case that agents actions are jointly constrained [Ros65]. However, when is not convex-concave, the minimax theorem fails, and we lose the existence of Nash equilibrium. For a simple example, consider over the space . Given any decision choice if , the *min* player should move towards . Otherwise, if , the *max* player wants to move away from . Thus, no pair satisfies equation (1).

Nonconvex-nonconcave utility functions naturally arise in adversarial training applications, such as Generative Adversarial Network (GAN) training, where the goal is to learn how to generate new data, such as images, from the same distribution that generated a collection of given data. Specifically, GANs are trained by trying to identify the equilibrium of a two-player zero-sum game between a generator model (the *min* player) and a discriminator model (the *max* player). Each of these models are viewed as agents, choosing parameters in deep neural networks, and the objective, capturing how close the generated distribution is to the target distribution, amounts to a nonconvex-nonconcave function of the underlying network parameters, which the *min* player aims to minimize and the *max* player aims to maximize.

Given that Nash equilibria may not exist when the objective is not convex-concave, what type of solutions should we target when studying two player zero-sum games with such objectives? “One property that we would like our target solutions to possess is that they are universal, i.e. they are guaranteed to exist for any objective function. We can take them to a practitioner and tell them that they are always plausible targets for their computations,” says Costis. With this in mind, Costis and his co-authors consider a relaxed equilibrium concept, called -Nash equilibrium. This is a pair satisfying

This relaxes Nash equilibrium: given strategy for the *max* player, the *min* player can improve by at most by changing their action in a ball of radius around , and a symmetric condition holds for the *max* player, given strategy for the *min* player. One of the main insights of their paper is that such local Nash equilibria *are guaranteed to* exist as long as is a small enough function of and ’s smoothness, namely whenever where is ’s smoothness. This non-trivial result is established via an application of Brouwer’s fixed point theorem.

The next question, pertaining to computational complexity, is to determine whether finding an -Nash equilibrium is algorithmically tractable in the regime of parameters (small enough ) where it is guaranteed to exist. As a first step, Costis and his co-authors focus on first-order algorithms, which have access to the gradient of the objective function. Examples include gradient descent and variants thereof, which have been the main computational engine behind the success of deep learning in single-agent problems. A classical result known for these methods in minimization settings is that they are efficient in computing -minima of non-convex, smooth objectives . These are points such that for all feasible such that . Namely, given query access to the gradient of some -smooth objective with values normalized to , it is possible to compute -minima in polynomially many, in and , steps and queries to , as long as . In contrast to minimization, a main contribution of Costis’ work is to establish an intractability result for min-maximization, showing that the number of gradient queries for any first-order algorithm to compute -Nash equilibria must be exponential in at least one of , the dimension , or the smoothness of the objective.

Theorem 1 (informal). First-order methods need a number of queries to that is exponential in at least one of , , or the dimension to find -Nash equilibria, even when , i.e. in the regime in which they are guaranteed to exist.

This theorem tells us that there exist objective functions for which the min-maximization problem can be computationally intractable for any first-order algorithm. Requiring many queries is one way to say that the problem is hard in practice. Indeed, practitioners have found it notoriously hard to get the discriminator-generator neural networks to converge to good solutions for generative modelling problems using gradient-based methods.

From a technical perspective, this work represents a new approach for proving lower bounds in optimization. Classical lower bounds in the optimization literature, going back to Nemirovsky and Yudin [NY83] target the black-box setting: an algorithm is given access to an oracle which outputs some desired information about a function when presented with a query. For example, a first-order oracle outputs the gradient of a function at a given input. Costis shared that he and his co-authors first tried to construct a black-box lower bound for local Nash equilibria directly. However, they were unsuccessful. Any direct construction they tried ended up introducing spurious local Nash equilibria, which first-order algorithms might find in polynomial time. Their direct attempts at a lower bound failed to capture the computational hardness of the problem. They quickly realized that they needed a deeper understanding of the problem at hand, better insight on what made it harder than minimization.

That insight came when they switched to studying the complexity of the white-box version of the problem, wherein the optimization algorithm can look inside the oracle that computes and possibly as well. One might wonder why one would want to consider such white-box models over black-box ones if their goal is to prove intractability results for methods that have limited access to . Indeed, proving an intractability result in the white-box model is much harder because the set of algorithms that use white-box access to the objective is strictly bigger than those using only black-box access. However, the key difference is that we are not looking for the same kind of hardness in the two models. In the black-box model, we are looking for unconditional computational hardness, that is, showing that any algorithm will require exponentially many queries to the gradient oracle. On the other hand, in the white-box model, we would like to show complexity-theoretic hardness, i.e., show that solving the problem at hand is at least as hard (or exactly as hard) as solving the hardest problems in some complexity class. Such a complexity-theoretic hardness result is conditional; it says that solving this problem will be computationally intractable as long as some computational complexity conjecture, such as P NP, holds. Importantly, showing hardness (or completeness) of a problem in some complexity class typically entails a fine-grained understanding of the nature of the problem and how that enables it to encode other problems in the target complexity class.

In the white-box model, the authors show that the problem of computing local Nash equilibria is PPAD-complete. In other words, computing this local equilibrium concept in zero-sum games with nonconvex-nonconcave objectives is exactly as hard as computing Nash equilibria in general-sum games with concave agent utilities.^{3} This result is established by exhibiting a reduction from a variant of the Sperner coloring problem,^{4} which is a PPAD-complete problem, to a discrete nonconvex-nonconcave min-maximization problem, where the two agents choose points on a hypergrid.

Having established this result, Costis and his coauthors presumed that the hardest part of the problem was behind them. However, another challenge awaited them. They still had to construct a continuous interpolation of their discrete function to satisfy the desired Lipschitz and smoothness properties in a computationally efficient manner. To understand the challenge with this, consider a simple two-dimensional example with two actions per agent. Suppose we are given prescribed values for on all four vertices of and our goal is to construct a continuous and smooth function on , which matches the prescribed function values at the corners. A simple approach is to define at any point using a smooth interpolation of all four corners of . This works, but does not scale to high dimensions. An approach that would scale computationally in high dimensions is to first triangulate by chopping it along its main diagonal and then interpolate the function on each triangle separately. However, this simple approach fails since the gradient of the interpolated function can be discontinuous when crossing the diagonal. “This part turned out to be more technically challenging than we had thought in high dimensions,” says Costis. He and his coauthors overcame the issue by proposing a new *smooth and computationally efficient interpolation* scheme, which they expect will have more applications in transferring hardness results from discrete problems to continuous problems.

To obtain Theorem 1, the authors show that one can translate their complexity-theoretic hardness in the white-box model to an unconditional intractability result in the black-box model. This follows immediately from their reduction from Sperner to local Nash equilibrium. Indeed, finding local Nash equilibria in the min-max instance at the output of their reduction provides solutions to the Sperner instance at its input. Moreover, a single query (function value or gradient value) to the min-max objective requires queries to the Sperner coloring circuit in order to be computed. Finally, it is known that, with black-box queries to the Sperner coloring circuit, exponentially many queries are necessary to compute a solution [HPV89, Pap94]. An exponential black-box lower bound for local Nash equilibrium thus follows.

This proof architecture can be used more generally to prove intractability results for optimization problems involving smooth objectives. First, ignore the black-box model and focus on identifying the complexity class that captures the complexity of the problem in the white-box model. Then, focus on obtaining a hardness result for a discrete version of the problem. Once this is established, one can use the techniques presented in this work to lift this intractability from the discrete to the continuous problem. If there are black-box lower bounds for any problem residing in the complexity class for which the white-box version of the problem is hard, then these lower bounds can be composed with hardness reductions to establish lower bounds for the black-box version of the problem. As an aside, Costis mentions that it would be interesting if one could establish the lower bound of Theorem 1 in the black-box model directly, i.e., without going through the PPAD machinery.

Costis ends on an optimistic note: “While this might appear as a negative result, it really is a positive one.” Explaining further, he says that a philosophical consequence of his intractability results is that the multi-agent future of deep learning is going to have a lot of interesting “texture” — it will involve a large breadth of communities and motivate a plethora of problems at the interface of theoretical computer science, game theory, economics and machine learning. Costis envisions a change in balance in the multi-agent world: while recent successes of deep learning in single-agent problems capitalize on access to large data, unprecedented computational power, and effective inductive biases, multi-agent problems will demand much stronger inductive biases, invoking domain expertise in order to develop effective models and useful learning targets, as well as to discover algorithms that attain those targets.

Indeed, domain expertise has been crucial in some recent high-profile machine learning achievements in multi-agent settings: the AlphaGo agent for playing the game of Go and the Libratus agent for playing Texas Hold’em. For these, a game-theoretic understanding has been infused into the structure and training of the learning algorithm. In addition to using deep neural networks, AlphaGo uses a Monte Carlo tree search procedure to determine the best next move as well as to collect data for the training of the neural networks during self play, while Libratus uses counterfactual regret minimization to approximate the equilibrium of the game. The success of both algorithms required combining machine learning expertise with game-theoretic expertise about how to solve the games at hand.

More broadly, Costis urges young researchers to move beyond the classical statistical paradigm which assumes independent and identically distributed observations, and embrace learning challenges that are motivated from learning problems with state, incomplete or biased data, data with dependencies, and multi-agent learning applications. In particular, he would like to see more activity in obtaining new models and algorithms for reinforcement and multi-agent learning, better tools for high-dimensional learning problems with data bias and dependencies, as well as deeper connections to causal inference and econometrics. *“There is a lot of beautiful mathematics to be done and new continents to explore motivated by these challenges.”*

We are thankful to Margalit Glasgow, Gautam Kamath, Praneeth Netrapalli, Arun Sai Suggala, and Manolis Zampetakis for providing valuable feedback on the blog. We would like to especially thank Costis Daskalakis for helpful conversations related to the technical and philosophical aspects of this work, and valuable comments throughout the writing of this article.

**Notes**

^{1 } As we discuss later, this existence only holds when the gains of each player are a concave function of their own actions.

^{2} This led to Nash winning the Nobel prize in Economics in 1994 with John Harsanyi and Reinhard Selten.

^{3} PPAD is the complexity class that exactly captures the complexity of computing Nash equilibria in general-sum games, computing fixed points of Lipschitz functions in convex and compact domains, and many other equilibrium and fixed point computation problems.

^{4} In Sperner, we are given white-box access to a circuit that computes colors for the vertices of some canonical simplicization of the -dimensional simplex. Each vertex receives one of the colors in and each color does not appear on any vertex of the triangulation lying on facet of the simplex. The goal is to find a simplex of the triangulation with all vertices colored differently.

**Bibliography**

[DSZ21] Constantinos Daskalakis, Stratis Skoulakis, and Manolis Zampetakis. The complexity of constrained min-max optimization. Symposium on Theory of Computing, 2021

[HPV89] Michael D Hirsch, Christos H Papadimitriou, and Stephen A Vavasis. Exponential lower bounds for finding brouwer fixed points. Journal of Complexity, 1989.

[NY83] Arkadi S. Nemirovsky and David B. Yudin. Problem complexity and method efficiency in optimization. Wiley, 1983.

[Pap94] Christos H. Papadimitriou. On the complexity of the parity argument and other inefficient proofs of existence. Journal of Computer and System Sciences, 1994.

[Ros65] J. Ben Rosen. Existence and uniqueness of equilibrium points for concave n-person games. Econometrica, 1965.

[vN28] John von Neumann. Zur Theorie der Gesellschaftsspiele. In Mathematische annalen, 1928.

]]>Please join us for a virtual Google workshop on “Conceptual Understanding of Deep Learning”

When: May 17th 9am-4pm. Where: Live over Youtube,

**Goal: **How does the Brain/Mind (perhaps even an artificial one) work at an algorithmic level? While deep learning has produced tremendous technological strides in recent decades, there is an unsettling feeling of a lack of “conceptual” understanding of why it works and to what extent it will work in the current form. The goal of the workshop is to bring together theorists and practitioners to develop an understanding of the right algorithmic view of deep learning, characterizing the class of functions that can be learned, coming up with the right learning architecture that may (provably) learn multiple functions, concepts and remember them over time as humans do, theoretical understanding of language, logic, RL, meta learning and lifelong learning.

The speakers and panelists include Turing award winners Geoffrey Hinton, Leslie Valiant, and Godel Prize winner Christos Papadimitriou (full-details).

**Panel Discussion: **There will also be a panel discussion on the fundamental question of “Is there a mathematical model for the Mind?”. We will explore basic questions such as “Is there a provable algorithm that captures the essential capabilities of the mind?”, “How do we remember complex phenomena?”, “How is a knowledge graph created automatically?”, “How do we learn new concepts, function and action hierarchies over time?” and “Why do human decisions seem so interpretable?”

Twitter: #ConceptualDLWorkshop. Please help advertise on mailing-lists/blog-posts and Retweet.

Hope to see you there!

Rina Panigrahy

(http://theory.stanford.edu/~rinap)

**Previous post:** Natural Language Processing – guest lecture by Sasha Rush **Next post:** TBD. See also all seminar posts and course webpage.

See also video of lecture. **Lecture slides:** Original form: main / bandit analysis. Annotated: main / bandit analysis.

Sham Kakade is a professor in the Department of Computer Science and the Department of Statistics at the University of Washington, as well as a senior principal researcher at Microsoft Research New York City. He works on the mathematical foundations of machine learning and AI. He is the recipient of the several awards, including the ICML Test of Time Award (2020), the IBM Pat Goldberg best paper award (in 2007), and the INFORMS Revenue Management and Pricing Prize (2014).

Sham is writing a book on the theory of reinforcement learning with Agarwal, Jiang and Sun.

Reinforcement learning has found success in a great number of fields because it is a very “natural framework” for interactive learning. It is based around the notion of experimenting with different behaviors in one’s environment and learning from mistakes to identify the optimal strategy. However, there is a lack of understanding regarding how to best optimize reinforcement learning algorithms when there is uncertainty about the agent’s environment and potential rewards. Therefore, it is important to develop a theoretical foundation about this to study generalization in reinforcement learning. The primary question these notes will address is as follows:

**What are necessary representational and distributional conditions that enable provably sample-efficient reinforcement learning?**

We will answer this question in the following parts.

**Part I: Bandits & Linear Bandits**“Bandit problems” correspond to RL where the environment is reset in each step (horizon H=1). This captures the aspect of having an unknown reward function of RL, but does not capture the aspect of a changing environment based on agent’s actions. This part will be based on the papers Dani-Hayes-Kakade 08 and Srinivas-Kakade-Krause-Seeger 10**Part II: Lower Bounds**RL is very much*not*a solved problem in neither theory nor practice. Even the RL analog of linear regression, when the expected reward is a linear function of the actions, is not solved. We will see that this is for a good reason: there is an exponential lower bound on the number of steps it takes to find a nearly-optimal policy in this case. This part is based on the recent paper Weisz-Amortila-Szepesvári 20 and the follow-up Wang-Wang-Kakade 21**Interlude:**Do these lower bounds matter in practice?**Part III: Upper Bounds**Given the lower bound, we see that to get positive results (aka*upper bounds*on the number of steps) we need to make strong assumptions on the structure of reqards. There have been a number of incomparable such assumptions used, and we will see that there is a way to unify them. This part is based on the recent paper Du-Kakade-Lee-Lovett-Mahajan-Sun-Wang 21

Before all of these parts, we will start by introducing the general framework of Markov Decision Processes (MDPs) and do a quick tour of generalization for static learning and RL.

We have an agent in an environment at state that takes some action which will observe some reward and update the environment to state .

The following are some key terms that we will need throughout the rest of the notes:

*State Space, Action Space, Policy*: We denote the state space as , and the action space as . A policy is a mapping from states to actions:**Trajectory**: The sequence of states, actions, rewards an agent sees for a horizon of timesteps.*State Value at time*: The expected cumulative reward starting from state and using policy afterwards.*State Action Value at time*: The expected cumulative reward given a state-action tuple starting from time and using policy afterwards.*Optimal value and state-value function*: we define an optimal policy by , and the associated optimal -function and value function by and respectively (or equivalently, , ). Note that and can be defined via the Bellman optimality equation as follows:

where we additionally define for all .

**Goal:**To find a policy that maximizes the cumulative H-step reward starting from an initial state with a horizon . In the episodic setting, one starts at state , acts for steps, and then repeats.

There are three main challenges that we face in reinforcement learning

**Exploration:**The total size and states of the environment may be unknown.*Credit Assignment:*We need to assign rewards to actions even if the rewards are delayed.*Large State/Action Spaces:*We face the curse of dimensionality.

**We will deal with these problems by framing them in terms of generalization.**

As we have seen in the first lecture of this course, generalization is possible in the supervised learning setting, when the data follows an i.i.d distribution.

Specifically we have the following bound

Occam’s Razor Bound (Finite Hypothesis Class):To learn a policy that is close to the best policy in a hypothesis class , we need a number of samples that is .

This means we can try lots of things on our data to see which hypotheses are -best. To handle infinite hypothesis classes, we can replace with various other “complexity measures” to obtain generalization bounds such as:

- VC Dimension:
- Classification (Margin Bounds):
- Linear Regression:
- Deep Learning: Algorithm also determines the complexity control

Another way to say this is that in all of these cases, we can bound the generalization gap by a quantity of the form where is some “complexity measure” of the class and is the number of samples.

One reference for these generalization results in the supervised learning setting is the following book by Sanjeev Arora and collaborators.

The key enabler of generalization in supervised learning is *data reuse*. For a given training set, we can in principle simultaneously evaluate the loss of all hypotheses in our class. For example, given the fixed ImageNet dataset, we can evaluate performance on any classifier. As we will see, this is not a property that will always hold in RL (when it does hold, sample-efficient generalization is likely to follow).

Consider a tabular MDP setting where , and denote the number of states, number of actions and length of the horizon respectively. Suppose we are operating in a setup where and are both small. Suppose also that the MDP is **unknown**.

Our goal in such a setting is to find a -optimal policy such that , where is the initial state (for concreteness let’s assume it is deterministic), and is **truly** an optimal policy for this MDP. Since we assume the number of states and actions to be small, it is possible to explore the entire world, and finding such an -optimal policy is in principle possible. We thus do not have to consider any hypothesis class here (so no generalization involved), and can instead seek to be optimal under all possible mappings from states to actions.

Think for example of the following maze MDP, where the state of the world is the cell the agent is in and the action it can take at each state is a move to each of say 4 neighboring cells. Then, if we are able to get to every state and try every action there, we would have learned the world.

In this particular scenario, randomly exploring the world will allow us to learn the world. However, if we consider a modified random exploration strategy, where the probability of going left is significantly larger (say 5 times larger) than the probability of going right, then it will take exponential time to hit the goal state. In general, even for MDPs with small state and action spaces, a purely random exploration approach may be insufficient, as we may not be exploring the world enough. What alternative approach might we then adopt in order to achieve a sample-efficient learning algorithm?

Theorem: (Kearns & Singh ’98). In the episodic setting, samples suffice to find an opt policy, where is the number of states, is the number of actions, and is the length of the horizon.

The above breakthrough result was the first to demonstrate that learning an -opt policy is possible using just polynomially many samples. The key idea behind this is optimism and dynamic programming. In proving the result, the authors designed an algorithm called the E algorithm (Explicit Explore or Exploit). The E algorithm adopts a model-based approach, and relies on a “plan-to-explore” mechanism. As we act randomly, we will learn some part of the state space, and having learned this region well, we can thus accurately plan to escape it. This is where optimism comes in, since we give ourselves a bonus for escaping a region we know well.

Based on the Kearns and Singh result, there has been a number of followup works on the tabular MDP setting. One line of work seeks to improve on the precise factors in the sample complexity.

**Improvements on the sample complexity:**

- A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning / Brafman-Tennenholtz 2002
- On the Sample Complexity of Reinforcement Learning – Kakade 2003 (PhD Thesis).
- Near-Optimal Regret Bounds for Reinforcement Learning – Jaksch, Ortner, Auer 2010
- Posterior sampling for reinforcement learning: worst-case regret bounds – Agrawal Jia 2017

Another line of work seeks to show that Q-learning, a model-free approach, can also achieve similar polynomial complexity, if an appropriate optimism bonus is incorporated.

**Provable Q-Learning (+Bonus)**

- PAC Model-Free Reinforcement Learning – Strehl-Li-Wiewiora-Langford-Littman 2006
- Algorithms for Reinforcement Learning – lecture by Szepesv´ari 2009
- Is Q-learning Provably Efficient? – Jin Allen-ZhuBubeck Jordan 2018

As the range and technical depth of the above results demonstrate, even in the relatively simple tabular case, the problem is already challenging, and a precise sharp characterization of sample complexity is even more difficult. The chief source of difficulty is the unknown nature of the world (if the world was known, then we can just run dynamic programming).

Ultimately, we want to move beyond small tabular MDPs, where a polynomial dependence in the sample complexity on is acceptable, and achieve sample-efficient learning in big problems where the space space could be massive. Think for instance of the game of Go.

In such a setting, requiring polynomially (in ) many samples is clearly unacceptable. This gives rise to the following question.

**Question 1: Can we find an opt policy with no dependence?**

In order to do so, it is necessary to reutilize data in some way since we will not be able to see all the possible states in the world. How then might we reuse data to estimate the value of all policies in a policy class ? A naive approach is the following:

- Idea: Trajectory tree algorithm
- Dataset Collection: Choose actions uniformly at random for all H steps in an episode.
- Estimation: Uses importance sampling to evaluate every .

Theorem: (Kearns, Mansour, & Ng ’00)To find an best in class policy, the trajectory tree algorithm uses samples.

Observe that when (i.e. a contextual bandit) this is exactly the kind of generalization bound we saw in the Occam Razor’s bound for supervised learning. Since there may be stochasticity in the MDP, such that could be infinite or even uncountable, this dependence on is a genuine improvement on the results for the tabular MDP setting which depended polynomially on . In this sense, this really is a generalization result, since we are learning an -best in class policy without having seen the entire world (i.e. all the states in the world).

We note that the result only has dependence on hypothesis class size and similar to the supervised learning setting, there are VC analogues as well. However, we can not avoid the dependence to find an best-in-class policy agnostically (without assumptions on the MDP). To see why, consider a binary tree with -policies and a sparse reward at a leaf node.

This dependence, while unavoidable without further assumptions, is clearly undesirable. This brings us to the following question.

**Question 2: Can we find an opt policy with no dependence and samples?**

As we just saw, agnostically we cannot learn an -best-in-class policy without an dependence. However, as we will see it is possible when appropriate assumptions are made. But what is the nature of the assumptions under which this kind of sample-efficient RL generalization is possible (when there is no (or mild) dependence on and )? What assumptions are necessary? What assumptions are sufficient? We will seek to address these questions.

To do so, we start simple, and first look at the bandits and linear bandits problem, where the horizon is just 1. Note that this is still an interactive learning problem, just that we reset the episode after one time-step, and that it is an example of a problem with a potentially large action space.

The multi-armed bandits algorithm is intimately interwoven with the theory of reinforcement learning. It is based around the question of how to allocate T tokens to A “arms” to maximize one’s return:

- Some Aspects of the Sequential Design of Experiments – Robbins 1952
- Bandit Processes and Dynamic Allocations Indices – Gittins 1979
- Asymptotically Efficient Adaptive Allocation Rules – Lai and Robbins 1985

It is a very successful algorithm when is small. What can we do when is large?

The bandits have to make a decision regarding which arm to pull. There is a widely used linear formulation of this problem that will assist us in understanding generalization. The linear bandit model is successful in many applications (scheduling, ads, etc.)

**Linear (RKHS) Bandits:**

- Decision: ; Reward: ; Reward model:
- The hypothesis class is a set of linear/RKHS functions (an overview of RKHS, which stands for Reproducing Kernel Hilbert Space, can be found here).

The principle underlying the Linear Bandits algorithm is **optimism in the face of uncertainty**:

Pick an input that maximizes the upper confidence bound:

Note that is the best estimate of the ground-truth and is a standard deviation that we have to estimate. In choosing the term , we have to navigate a trade-off between exploration and exploitation. As we can see, this algorithm will only pick plausible maximizers.

Theorem: (Dani, Hayes, & K. ’08), (Srinivas, Krause, K., & Seeger ’10). Assuming is an RKHS (with bounded norm), if we choose “correctly”, then the regret satisfies

where hides logarithmic terms, and

The key complexity concept here is “Maximum Information Gain”: , which one can think of as the “effective dimension,” determines the regret because for in . Here are some relevant papers for further understanding regret, which is the difference between the reward of a possible action and the reward of an action that has been taken.

- Finite-time Analysis of the Multiarmed Bandit Problem – Auer Cesa-Bianchi Fischer 2002
- Improved Algorithms for Linear Stochastic Bandits – Abbasi-yadkori, Pál, Szepesvári 2011

On each round, we must choose a decision . This yields a reward , where

Above, is an unknown weight vector and may be replaced by if we have access to such a representation. Note that this tells us that the conditional expectation of upon is linear. We have the a corresponding i.i.d. noise sequence . If are our decisions, then our *cumulative regret* in expectation is

where is an optimal decision for , i.e.

After t rounds, we can define our uncertainty region with center and shape using the regularized least squares solution:

- is a parameter of the algorithm and determines how accurately we know

The LinUCB Algorithm can be understood as follows: For

- Execute
- Observe the reward and update

As the following theorem shows, the regret is sublinear with polynomial dependence on and no dependence on the cardinality of the decision space .

Theorem (regret): (Dani, Hayes, Kakade 2009). Suppose we have bounded noise ; ; , for . Set Then, with probability greater than , for all ,

where are absolute constants.

To prove the regret theorem above, we will require the following two lemmas.

Lemma 1 (Confidence):Let . We have that

Lemma 2 (Sum of Squares Regret Bound):Define . Suppose is increasing and that for all , we have . Then,

We note that Lemma 2 actually depends on Lemma 1, since it assumes that for each , a property that Lemma 1 tells us happens with probability at least . We defer the proofs of the two lemmas to later, and first show why they can be used to prove the regret theorem.

**Proof of regret theorem:** Using the two lemmas above along with the Cauchy-Schwarz inequality, we have with probability at least that

The rest of the proof follows from our chosen value of .

We now proceed to sketch out the proofs of Lemma 1 (confidence bound) and Lemma 2 (sum of squares regret bound). We begin with showing why Lemma 2 holds.

Our first auxilliary result bounds the pointwise width of the confidence ball.

Lemma (pointwise width of confidence ball). Let . Consider any . Then,

* Proof*. We have

where the first inequality follows from Cauchy-Schwarx and the second (i.e last) inequality holds by the definition of and our assumption that

Let us now define

which we can think of as the “normalized width” at time in the direction of our decision. We have the following bound on the instantaneous regret .

Lemma (instantaneous regret lemma). Fix . If , then

* Proof*. The basic idea is to use “optimism”. Let denote the vector maximizing the dot product . By choice of , , where the inequality used the hypothesis that . This manifestation of “optimism” is crucial, since it tells us that the “ideal” reward we think we can get at time exceeds the optimal expected reward . Hence,

where the last step follows from the pointwise width lemma (note is in and is assumed to be by the hypothesis in Lemma 2). Since in the linear bandits setup we assumed that for all , the simple bound holds as well. We may also assume for simplicity that . This then yields the bound in the result.

In the next two lemmas, we use a geometric potential function argument to bound the sum of widths independently of the choices made by the algorithm (e.g. choice of and sequence).

Geometric Lemma 1. We have

* Proof*. By definition of , we have

We complete the proof by noting that .

Geometric Lemma 2. For any sequence such that for , , we have

* Proof*. Denote the eigenvalues of as , and note

Using the AM-GM inequality,

We are now finally ready to prove Lemma 2 (sum of squares regret bound).

**Proof of Lemma 2 (sum of squares regret bound)**.

Assume for all . We have

where the first inequality follows from the instantaneous regret lemma, the second from that is an increasing function of , the third uses the fact that for , the final equality holds by Geometric Lemma 1, and the final inequality follows from Geometric Lemma 2.

This wraps up our discussion of Lemma 2.

Recall that our goal here is to show that with high probability. We begin with the following result, which is a general version of the self-normalized sum argument in Dani, Hayes, Kakade 2009.

Lemma (Self-normalized bound for vector-valued Martingalues, Abbasi et al. 2011). Suppose are mean zero random variables (can be generalized to martingalues), and is bounded by . Let be a stochastic process. Define . With probability at least , we have for all ,

Equipped with the lemma above, we are now ready to prove Lemma 1, which we will restate here again.

Lemma 1 (Confidence):Let . We have that

**Proof of Lemma 1**. Since , we have

To get the last equality, we recall that . By the triangle inequality, it follows that we have

where the last inequality holds with probability at least for every , using the self-normalized bound above, as well as the fact that . Since for any vectors , it follows that with probability at least , for every ,

where the final inequality is a consequence of the choice in the algorithm (where we recall the upper bound ), as well as Geometric Lemma 2. The result then follows by our choice of , which is

where is an absolute constant (note in doing so we also subsumed the term, simplifying the exposition).

We move on now to more challenging RL problems where the horizon is larger than 1, and explore lower bounds in this regime.

We begin by considering generalization with a very natural assumption: suppose that the value function can be approximated by linear basis functions

We assume that the dimension of the representation, , is low compared the state and action dimensions. The idea of using a linear function approximation in RL and dynamic programming is not new, and had been explored in early works by Shannon (“Programming a digital computer for playing chess.”, Philosophical Magazine, 1950) as well as Bellman and Dreyfus (“Functional approximations and dynamic programming”, 1959). There has also since been significant work on this approach, see e.g. Tesauro 1995, de Farias and Van Roy 2003, Wen and Van Roy 2013.

One natural question that arises is this: **what conditions must the representation satisfy in order for this approach to work?**

We proceed by studying the simplest possible case: assuming that the optimal -function is linearly realizable.

Suppose we have access to a feature map . Concretely, the assumption we consider is the following:

Assumption 1 (Linearly realizable ): Assume for all that there exists such that

As an aside, with Assumption 1, we can consider the problem from a linear programming viewpoint. Note that:

- We have an underlying LP with variables and constraints.
- The LP is specific to the dynamic programming problem at hand (and hence not general) because it encodes the Bellman optimality constraints.
- We have sampling access (in the episodic setting).

It may be tempting to think that Assumption 1 is sufficient to enable a sample-efficient algorithm for RL (if we assume we already know the representation ). However, that is **not** true, as the following theorem from a very recent work demonstrates:

Theorem 1 (Weisz, Amortila, Szepesvári 2021):There exists an MDP and a representation satisfying Assumption 1, such that any online RL algorithm (with knowledge of ) requires samples to output the value up to constant additive error, with probability at least 0.9.

While linear realizability alone is insufficient for sample efficiency in online RL, one might consider imposing further assumptions that could suffice for sample-efficient RL. One candidate assumption is to assume that at each state, the optimal action yields significantly more value than the next-best action:

Assumption 2 (Large suboptimality gap): Assume for all , we have

Perhaps surprisingly, the following theorem shows that an exponential lower bound for online RL remains under **both** Assumption 1 and Assumption 2.

Theorem 2 (Wang, Wang, Kakade 2021):There exists an MDP and a representation satisfying both Assumption 1 and Assumption 2, such that any online RL algorithm (with knowledge of ) requires samples to output the value up to constant additive error, with probability at least 0.9.

* Remark*: We note a subtle distinction between the online RL setting and the simulator access setting. In the online RL setting, during each episode, we start at some state , and the subsequent states we see are entirely dependent on the policy we choose and the environment dynamics. Meanwhile, in the simulator access setting, at each time-step, we are free to input

We next introduce the counterexample used to prove Theorem 2 in detail.

Above, we have a pictorial representation of the MDP family in the counterexample. We first describe its state and action spaces.

- The state space is . We use to denote an integer, which we set to be approximately .
- State is a special state, which we can think of as a “terminal state”.
- At state , the feasible action set is . At state , the feasible action set is . Hence, there are feasible actions at each state.
- Each MDP in this “hard” family is specified by an index and denoted by .

Before we proceed, we first recall the Johnson-Lindenstrauss lemma, which states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that the distances between the points are nearly preserved.

Johnson-Lindenstrauss Lemma: Suppose we are given , a set of points in , and a number . Then, there is a linear map such that

Consider a collection of orthogonal unit vectors in the high-dimensional space . For any two vectors , after applying the linear embedding , we observe by Johnson-Lindenstrauss that

where we used the fact that holds for any unit by linearity of . Hence, we can apply Johnson-Lindenstrauss to derive the following lemma, which will be useful in our construction.

Lemma 1 (Johnson-Lindenstrauss): For any , there exists unit vectors in such that and , .

Throughout our discussion, we will set .

Equipped with the lemma above, we can now describe the transitions, features and rewards of the constructed MDP family. In the sequel, and represent integers associated with the state and action respectively.

* Transitions*: The initial state follows the uniform distribution . The transition probabilities are set as follows:

After taking action , the next state is either or . We might observe then that this MDP resembles a “leaking complete graph”. It is possible to visit any other state (except for ). However, importantly, there is at least probability of going to the terminal state . Also, observe that the transition probabilities are indeed valid, since by Lemma 1 above,

* Features*: The feature map, which maps state-action pairs to -dimensional vectors, is defined as

Note that the feature map is independent of and is shared across the MDP family.

* Rewards*: For , the rewards are defined as

For , we set for every state-action pair.

We now verify that our construction satisfies both the linear realizability and large suboptimality gap assumptions (Assumption 1 and Assumption 2).

Lemma (Linear realizability). For all , we have .

* Proof*: Throughout, we assume that . We first verify the statement for the terminal state . At the state , regardless of the action taken, the next state is always and the reward is always 0. Hence, for all . Thus, . We next verify realizability for other states via backwards induction on . The inductive hypothesis is ,

and that ,

When , (1) holds by the definition of the rewards at that level. Next, note that , (2) follows from (1). This is because for ,

while (recall )

This means that proving (1) suffices to show that is always the optimal action. A simple verification via Bellman’s optimality equation suffices to prove the inductive hypothesis for (1) for every , since the base case holds. Thus, both (1) and (2) hold for all , concluding our proof.

We next show that the constant suboptimality gap (Assumption 2) is also (approximately) satisfied by our constructed MDP family.

Lemma (Suboptimality gap). For all state , and , the suboptimality gap is

Hence, in this MDP, Assumption 2 is satisfied with .

* Remark*. Note that that here we ignored the terminal state and the essentially unreachable state for simplicity. This seems reasonable intuitively, since reaching is effectively the end of the episode, and the state can only be reached with negligible probability(recall that is exponentially large). For a more rigorous treatment of this issue, refer to Appendix B in Wang, Wang, Kakade 2021.

We can now state and prove the following key technical lemma, which directly implies Theorem 2 in Wang, Wang, Kakade 2021.

. For any algorithm, there exists such that in order to output withLemma

with probability at least 0.1 for , the number of samples required is .

**Proof sketch**. We take an information-theoretic perspective. Observe that the feature map of does not depend on , and that for and , the reward also has no information about . The transition probabilities are also independent of , unless the action is taken, and the reward at state is always 0. Thus, to receive information about the optimal action , the agent either needs to take the action , or be a non-game-over state at the final time step (i.e .)

However, by the design of the transition probabilities, the probability of remaining at a non-game-over state at the next time step is at most

Hence, for any algorithm, , which is exponentially small.

Summarizing, any algorithm that does not know either needs to “get lucky” so that , or take the optimal action . For each episode, the first event happens with probability less than , and the second event happens with probability less than . Since the number of actions is , it follows that neither event can happen with constant probability unless the number of episodes is exponential in . This wraps up the sketch.

We note that our construction is not quite rigorous due to the remark earlier that the suboptimality gap assumption does not hold for the states and . A more rigorous construction can be found in Appendix B of Wang, Wang, Kakade 2021. Note that this lower bound is silent on the dependence of the sample complexity on the size of the action space, giving rise to the following question, which appears to be still unsolved.

Open problem: Could we get a lower bound that also depends on the action space dimension , such that the number of samples required to obtain an approximately optimal policy scales with

A natural question to ask is this: are these exponential lower bounds in Part 2 (when we only assume linear realizability) actually relevant for practice?

To answer this, we take a brief detour into *offline RL*. In offline RL (see Levine et al. 2020 for a survey), we assume that the agent has no direct access to the MDP, and is instead provided with a static dataset of transitions, ( denotes number of independent episodes in the offline data). The goal here could be to learn a policy (based on the static dataset ) that attains the largest possible cumulative reward when applied to the MDP, or to evaluate the performance of some target policy based on the offline data. We use to denote the distribution over states and actions in , such that we assume the state-action tuples are sampled according to , and the actions are sampled according to the behaviour policy, such that .

Analogous to the online RL lower bound, the following theorem shows that linear realizability is also insufficient for sample-efficient evaluation of a target policy using offline data.

Theorem (informal, from Wang, Foster, Kakade 2020)). In the offline RL setting, suppose the data distributions have (polynomially) lower bounded eigenvalues, and the -functions of every policy are linear with respect to a given feature mapping. Then, any algorithm requires an exponential number of samples in the horizon to output a non-trivially accurate estimate of the value of any given policy , with constant probability.

Some remarks are in order. First, note that the above hardness result for policy evaluation also holds for finding near-optimal policies using offline data. For a simple reduction, consider an example where at the initial state, one action leads to a fixed reward and another action transits us to an instance which is hard to evaluate using offline data. Then, in order to find a good policy, it is necessary for the agent to approximately evaluate the value of the optimal policy in the hard instance. Second, an appropriate eigenvalue lower bound on the offline data distribution ensures that there is sufficient feature coverage in the dataset, without which linear realizability alone is clearly insufficient for sample-efficient estimation. Third, note that the representation condition in the theorem is significantly stronger than assuming than assuming realizability with regards to only a single target policy, and so the result carries over to the latter setting as well. Fourth, the key idea to prove the result is the error amplification (exponential in the horizon ) induced by the *distribution shift* from the offline policy to the target policy we wish to evaluate.

Empirical work performed in Wang et al. 2021 show that the these negative results do manifest themselves in experimental examples. The methology considered by Wang et al. 2021 is as follows:

- Decide on a target policy to be evaluated, along with a good feature mapping for this policy (could be the last layer of a deep neural network trained to evaluate the policy).
- Collect offline data using trajectories that are a mixture of the target policy and another distribution (perhaps generated by a random policy).
- Run offline RL methods to evaluate the target policy using feature mapping found in Step 1 and the offline data obtained in Step 2.

We note that features extracted from pre-trained deep neural networks should be able to satisfy the linear realizibility assumption approximately (for the target policy). Moreover, the offline dataset is relatively favorable for evaluation of the target policy, since we would not expect realistic offline datasets to have a large number of trajectories from the target policy itself.

However, numerical results show substantial degradation in the accuracy of policy evaluation, even for a relatively mild distribution shift (e.g. where there is a 50/50 split in target policy and random policy in the offline data). As an example, consider the following plot.

The figure above depicts the performance of Fitted Q-Iteration (FQI) on Walker2d-v2, an environment from the OpenAI gym benchmark suite which has continuous action space. Here, the -axis is the number of rounds of FQI used, and the -axis is the square root of the mean squared error of the predicted values (smaller is better). The blue line corresponds to performance when the dataset is generated by the target policy itself with 1 million samples, and other lines correspond to the performance when adding more offline data induced by random trajectories. As we can see, adding more random trajectories lead to significant degradation of FQI. See Wang et al. 2021 for more such experiments.

These empirical results seem to affirm the hardness results in Wang et al. 2021 (offline RL) and Wang, Wang, Kakade 2021 (online RL), in that the definition of a good representation in RL is more subtle than in supervised learning, and certainly goes beyond just linear realizibility.

We have seen from Part 1 and Part 2 that finding an -optimal policy with mild (e.g. logarithmic) dependence on and samples is **NOT** possible agnostically, or even with linearly realizable . This leads us to the following question.

**Q: What kind of assumptions enable provable generalization in RL?**

In fact, under various stronger assumptions, sample-efficient generalization **is** possible in many special cases. Amongst others, these include

- Linear Bellman Completion [Munos 2005, Zanette et al. 2020]
- Linear MDPs (low-rank transition matrix) [Wang and Yang 2018; Jin et al. 2019]
- Linear Quadratic Regulators (LQR): standard control theory model (see e.g. Wikipedia page for LQR)

- FLAMBE/Feature Selection: Agarwal, Kakade, Krishnamurthy, Sun 2020
- Linear Mixture MDPs: [Modi et al. 2020, Ayoub et al. 2020]
- Block MDPs Du et al. 2019
- Factored MDPs Sun et al 2019
- Kernelized Nonlinear Regulator Kakade et al. 2020

What structural commonalities are shared between these underlying assumptions and models? To answer this question, we go back to the start, and revisit the case of linear bandits ( RL problem) for intuition.

We consider linear contextual bandits, where the context is , the action is and denote the state (or context) and action space respectively. We assume as before that associated with each state-action pair is a representation . The observed reward is , where is a mean-zero stochastic noise term. The hypothesis class is

where is a subset of . We let denote the greedy policy for , i.e.

An important structural property satisfied by linear contextual bandits is the following: *data reuse*. Indeed, the difference between any and the observed reward is estimable when we had in fact played for some hypothesis . Via direct calculation, we see that

Intuitively, assuming that induces “sufficient exploration” of the -dimensional representation space, this implies that we can evaluate the quality of any policy/hypothesis using just the one set of data collected by . This is **precisely** the kind of data reuse property we saw for supervised learning, which enables sample-efficient generalization there. This suggests that to ensure sample-efficient generalization in general RL, it may be fruitful to look for assumptions that enable data reuse. One special case where such data reuse is possible is the class of linear Bellman complete models.

Let be the length of each episode as before. We recall that a hypothesis class is **realizable** for an MDP if there exists a hypothesis such that

where is the optimal state-action value at time step . Having defined realizability, we are now ready to define the notion of linear Bellman completeness.

For any , let be the greedy policy associated with . By the definition of a linear Bellman complete class, it follows that given some fixed , for any , we have

Definition (linear Bellman complete). A hypothesis class , with respect to some known feature , is linear Bellman complete for an MDP if is realizable and there exists such that for all and ,

This shows that data reuse is possible for any linear Bellman complete class , since any can be evaluated using offline data collected by some fixed policy .

As an aside, note that linear Bellman completeness is a very strong condition that can break when new features are added. This is because adding new features expands the hypothesis space (of linear functions), and there is no guarantee that the new hypothesis class will again satisfy linear Bellman completeness.

It turns out that linear Bellman complete classes are just one example of Bilinear Classes (Du et al. 2021), which encompass many RL models in which sample-efficient generalization has been shown to be possible.

We assume access to a hypothesis class , which can be abstract sets that permit for both model-based and value-based hypotheses. We assume that for all , there is an associated state-action value function and a value function for each . As before, let denote the greedy policy with respect to , and let denote . We can now introduce the Bilinear Class.

**Definition (Bilinear Class)**. Consider an MDP , a hypothesis class , a discrepancy function (defined for each ). Suppose is realizable in and that there exists functions and for some . Then, forms a Bilinear Class for if the following two conditions hold.

- Bilinear regret: on-policy difference between claimed reward and true reward satisfies following upper bound,
- Data reuse:

As an example to demonstrate what the choices of and might look like, for a linear Bellman complete class , we can choose

Above, note that for all for linear Bellman complete classes, and that the discrepancy function in this case does not depend on . As demonstrated in Du et al. 2021, the following models (in which sample-efficient generalization is known to be possible) can all be shown to be Bilinear Classes for some discrepancy function :

- Linear Bellman Completion [Munos 2005, Zanette et al. 2020]
- Linear MDPs (low-rank transition matrix) [Wang and Yang 2018; Jin et al. 2019]
- Linear Quadratic Regulators (LQR): standard control theory model (see e.g. Wikipedia page for LQR)

- FLAMBE/Feature Selection: Agarwal, Kakade, Krishnamurthy, Sun 2020
- Linear Mixture MDPs: [Modi et al. 2020, Ayoub et al. 2020]
- Block MDPs Du et al. 2019
- Factored MDPs Sun et al 2019
- Kernelized Nonlinear Regulator Kakade et al. 2020
- and more (see Du et al. 2021 for details.)

Bilinear classes can be seen as a generalization of Bellman rank (Jiang et al. 2017) and Witness rank (Wen et al. 2019), which were previous works that sought to identify strucural commonalities between different RL models that enable sample-efficient generalization. That being said, there are still models (with known provable generalization) which Bilinear Classes does not cover. Two such exceptions are the deterministic linear (Wen and Van Roy 2013) model and the -state aggregation model (Dong et al. 2020). On a heuristic level, the structural commonalities identified by the Bilinear Classes show that to a large extent, most RL models known to enable sample-efficient generalization resemble linear bandits, in that data reuse is possible. In this sense, understanding why generalization is possible in the linear bandit case gives one intuition for why generalization is possible in these other cases as well. On some level, this may be disappointing since we might hope to capture richer phenomenon than just linear bandits, but promisingly, there is a rich class of RL models which share these structural commonalities that enable generalization in RL (as the examples encompassed by the Bilinear Classes demonstrate).

From the discussion above, we see that a generalization theory for RL, while significantly distinct from that for supervised learning, is still possible. However, natural assumptions that might seem adequate, such as linear realizability, are in fact insufficient, and much stronger assumptions are required. One such example of sufficient assumptions is the Bilinear Class, which covers a rich set of models. Moreover, as the empirical results we saw in the interlude show, these representational issues identified by theory are relevant for practice. For more on the theory of RL, see the following forthcoming book.

]]>Dear colleagues

We invite you to nominate speakers for our TCS Women Rising Star talks at the TCS Women Spotlight Workshop at STOC 2021. To be eligible, your nominee has to be a theoretical computer science researcher (all topics represented at STOC are welcome) who is female or an underrepresented minority, and is a graduating PhD student or a postdoc. You can make your nomination by filling this form by May 15th: https://forms.gle/g4mTS2MJzkenKrry6

The TCS Women Spotlight workshop at STOC 2021 will take place virtually between June 21st and June 25th (most likely on Tuesday, June 22^{nd}, to be confirmed later on).

You can see the list of speakers from last year here: https://sigact.org/tcswomen/3rd-tcs-women-meeting/tcs-women-2020/

Looking forward to your nominations and to seeing you at the TCS Women Spotlight Workshop!

Virginia Vassilevska Williams, Barna Saha, Sofya Raskhodnikova, Mary Wootters and Elena Grigorescu

]]>Are you passionate about teaching? Or about increasing diversity within TCS? If so, we need your help!

The committee for advancement of theoretical computer science (CATCS) is organizing an online summer course that will take place on May 31 till June 4, 2021. New horizons in theoretical computer science is a week-long online summer school which will expose undergraduates to exciting research areas in the area of theoretical computer science and its applications. The school will contain several mini-courses from top researchers in the field. We particularly encourage participants from groups that are currently under-represented in TCS. See https://boazbk.github.io/tcs-summerschool/ for more details.

We are looking for TAs to help run the school.

TAs will have the following responsibilities:

• Plan team building and ice breaking activities and social events for the summer school

• Lead small groups during the week

• Monitor questions in chat during lectures

• Work with one of the instructors to prepare one homework

• Grade homework

• Provide mentorship to students

• Possibly assist with reviewing applications and other technical/admin aspects of running the school

The time commitment will be ~20 hours during the week of May 31-June 4; ~5-10 hours prior to that week; and ~2-3 hours following that week. We are hoping to pay an amount of $500 to each TA (please note that international students will need a CPT for this).

To apply for a TA position, please fill in the application form at https://forms.gle/QCxLn8R81Ga4JQLH8 by April 15, 2021. Please also have a faculty advisor send a short recommendation to summer-school-admin@boazbarak.org. Please ask them to use the subject “TA recommendation for <<Your Name>>”.

Course organizers: Boaz Barak (Harvard), Shuchi Chawla (UT Austin), Madhur Tulsiani (TTI-Chicago)

Current list of confirmed instructors: Antonio Blanca (Penn State University), Ashia Wilson (MIT), Jelani Nelson (UC Berkeley), Nicole Immorlica (Microsoft Research), Yael Kalai (Microsoft research).

Please email summer-school-admin@boazbarak.org with any questions.

]]>