Are you passionate about teaching? Or about increasing diversity within TCS? If so, we need your help!

The committee for advancement of theoretical computer science (CATCS) is organizing an online summer course that will take place on May 31 till June 4, 2021. New horizons in theoretical computer science is a week-long online summer school which will expose undergraduates to exciting research areas in the area of theoretical computer science and its applications. The school will contain several mini-courses from top researchers in the field. We particularly encourage participants from groups that are currently under-represented in TCS. See https://boazbk.github.io/tcs-summerschool/ for more details.

We are looking for TAs to help run the school.

TAs will have the following responsibilities:

• Plan team building and ice breaking activities and social events for the summer school

• Lead small groups during the week

• Monitor questions in chat during lectures

• Work with one of the instructors to prepare one homework

• Grade homework

• Provide mentorship to students

• Possibly assist with reviewing applications and other technical/admin aspects of running the school

The time commitment will be ~20 hours during the week of May 31-June 4; ~5-10 hours prior to that week; and ~2-3 hours following that week. We are hoping to pay an amount of $500 to each TA (please note that international students will need a CPT for this).

To apply for a TA position, please fill in the application form at https://forms.gle/QCxLn8R81Ga4JQLH8 by April 15, 2021. Please also have a faculty advisor send a short recommendation to summer-school-admin@boazbarak.org. Please ask them to use the subject “TA recommendation for <<Your Name>>”.

Course organizers: Boaz Barak (Harvard), Shuchi Chawla (UT Austin), Madhur Tulsiani (TTI-Chicago)

Current list of confirmed instructors: Antonio Blanca (Penn State University), Ashia Wilson (MIT), Jelani Nelson (UC Berkeley), Nicole Immorlica (Microsoft Research), Yael Kalai (Microsoft research).

Please email summer-school-admin@boazbarak.org with any questions.

]]>**Previous post:** Inference and statistical physics **Next post:** TBD. See also all seminar posts and course webpage.

Alexander (Sasha) Rush is a professor at Cornell working in in Deep Learning / NLP. He applies machine learning to problems of text generation, summarizing long documents, and interactions between character and word-based models. Sasha is previously at Harvard, where he taught an awesome NLP class, and we are excited to have him as our guest! (Note: some of the figures in this lecture are taken from other papers or presentations.)

The first half of the talk will focus on how NLP works and what makes it interesting: a bird’s-eye view of the field. The second half of the talk will focus on current research.

Textual data has many different challenges that differ from computer vision (CV), since it is a human phenomenon. There are methods that work in computer vision / other ML models that just don’t work for NLP (e.g. GANs). As effective methods were found for computer vision around 2009-2014, we thought that these methods would also work well for NLP. While this was sometimes the case, it has not been true in general.

What are the difficulties of working with natural language? Language works at different scales:

```
word < phrase < sentence < document < ...
```

Here are examples of structure at each level:

- Zipf’s Law: The frequency of any word is inversely proportional to its popularity rank.
- Given the last symbols, it is often possible to predict the next one (The Shannon Game).
- Linguists have found many rules about syntax and semantics of a language.
- In a document, we have lots of discourse between different sentences. For example, “it” or other pronouns are context dependent.

In NLP, we will talk about the *syntax* and *semantics* of a document. The syntax refers to how words can fit together, and semantics refers to the meaning of these words.

There are many different NLP tasks such as sentiment analysis, question answering, named entity recognition, and translation. However, recent research shows that these tasks are often related to language modeling.

Language modeling, as explained in Shannon 1948, aims to answer the following question: *Think of language as a stochastic process producing symbols. Given the last symbols, can we predict the next one?*

This question is challenging as is. Consider the following example:

```
A reading lamp on the desk shed glow on polished ___
```

There are many options: marble/desk/stone/engraving/etc., and it is already difficult to give a probability here. In general, language is hard to model because the next word can be connected to words from a long time ago.

Shannon proposes variants of Markov models to perform this prediction, based on the last couple characters or the context in general.

Since local context matters most, we assume that only the most recent words matter. Then, we get the model

As we have seen in the generative models lecture, we can use cross entropy as a loss function for density estimation models. Given model density distribution and true distribution , the cross entropy (which equals negative expected log-likelihood) is defined as follows:

In NLP we tend to use the metric “perplexity”, which is the exponentiated negative cross entropy:

This corresponds to the equivalent vocabulary size of a uniformly distributed model. Lower perplexity means our model was closer to the underlying text distribution. As an example, the perplexity of the perfect dice-roll model would be 6.

Why do we care about perplexity anyway?

- With a good model we can determine the natural perplexity of a language, which is interesting.
- Many NLP questions are language modeling with conditioning. Speech recognition is language modeling conditioned on some sound signal, and translation is language modeling conditioned on text from another language.
- More importantly, we have found recent applications in
*transfer learning*. That is, a language model can be trained on some (small) input data for a specific task. Then, such a model becomes effective at the given task!

A few years ago, the best perplexity on WSJ text was 150. Nowadays, it is about 20! To understand how we got here, we look at modern language modeling.

We start with the model

(The softmax function maps a vector into the probability distribution such that . That is, . Note that this is a Boltzman distribution of the type we saw in the statistical physics and variational algorithms lecture)

We call **W** the output word embeddings, is some neural basis (e.g. are all but the final layers of a neural net with weights , and **W** is the final layer). However, this means to use softmax to predict requires computing softmax over every word in your vocabulary (tens or hundreds of thousands). This was often infeasible until GPU computing became available.

As an aside, why not predict characters instead of words? The advantage is that there are much fewer characters than words. However, computation with characters is slower. Empirically, character-based models tend to perform worse than word-based. However, character-based models can handle words outside the vocabulary.

Byte-pair encoding offers a bridge between character and word models. This greedily builds up new tokens as repetitive patterns are found in the original text.

In the last decade NLP has seen a few dominant architectures, all using SGD but with varying bases. First, we must cast the words as one-hot vectors, then embed them into vector space:

where is the one-hot encoded vector and is the learned embedding transformation.

NNLM (Neural Network Language Model) is like a CNN. The model predicts on possibly multiple NN transformations:

where denotes concatenation, is some convolutional filter and is the activation function. This has the benefit of learning fast. The matrices it learns also transfer well.

As an example, GloVe is a NNLM-inspired model. It stores the words in 300-dimensional space. When we project the some words to 2-dimensions using PCA, we find semantic information in the language model.

A recurrent model uses a fixed set of previous words to predict the next word. A recurrent network uses all previous words:

Previous information is ‘summarized’ in the on the right, so this model uses finite memory. Below is an illustration of the recurrent neural network.

Since the recurrent model uses the full context, it is a more plausible model for how we really process language. Furthermore, the introduction of RNN saw drastically improved performance. In the graph below, the items in the chart are performances from previous NNLMs. The recurrent network performance is “off the chart”.

However, the model grows with sequence length. This requires gradient flow to backpropagate over arbitrarily long sequences, and often required baroque network designs to facilitate longer sequences while avoiding exploding/vanishing gradients.

To understand modern NLP we must look at attention. For a set of vectors , keys and a query , we define attention as

where .

Here, can be considered a probability density of the model’s ‘memory’ of the sequence. The model decides which words are important by combining and .

The *attentional model* can be fully autoregressive (use all previously seen words), and the query can be learned or be specific to an input. In general, we have:

Since we condense all previous information into the attention mechanism, it is simpler to backpropagate.

In particular attention enables looking at information from a large context without paying for this in the depth of the network and hence in the depth of back-propagation you need to cover. (Boaz’s note: With my crypto background, attention reminds me of the design of block ciphers, which use linear operations to mix between far away parts of the inputs, and then apply non-liearity locally to each small parts.)

Note that attention is defined with respect to a set of vectors. There is no idea of positional information in the attentional model. How do we encode positional information for the model? One way to do this is using *sinusoidal encoding* in the keys. We store the word as for some period . Notice that if we choose many different periods, then the cosine ways will almost never meet at the same point. As a result, only recent points will have high dot products between the different cosine values.

A transformer is a stacked attention model. Computation in one layer becomes query, keys and values for the next layer. This is a multiheaded attention model. We learn projections for each query/key (generally between 8 and 32) then do softmax across these projections:

These heads can be computed in parallel and can be implemented with batch matrix multiplication. As a result, transformers can be massively scaled and are extremely efficient in modern hardware. This has led these models to be very dominant in the field. Here are some example models:

- GPT-1,2,3 are able to generate text that is quite convincing to a human. They also handle the syntax and semantics of a language quite well.
- BERT is a transformer-based model that examines text both forwards and backwards in making its predictions. It works well with transfer fining tuning: train on a large data set, then take the feature representations and train a task on top of the learned representation.

In recent years we have had larger and larger models, from GPT1’s 110 million to GPT3’s 175 billion.

On these massive scales, scaling has become a very interesting issue: how do we process more samples? How do we run distributed computation? How much autoregressive input should each model looks at? (Boaz’s note: To get a sense of how big these models are, GPT-3 was trained on about 1000B tokens. The total number of people that ever lived is about 100B, and only about half since the invention of the printing press, so this is arguably a non-negligible fraction of all text produced in human history.)

For a model like BERT, most of the cost still comes from feed-forward network — mostly matrix multiplications. These are tasks we are familiar with and can scale up.

One question is to have long-range attentional lookup, which is important for language modeling. For now, models often look at most 512 words back. Can we do longer range lookups? One approach to this is kernel feature attention.

Recall that we have . Can we approximate this with some kernel ? The main approach is that , which we approximate with the kernel . There is a rich literature on approximating where

for some transfomration . Then, we can try to approximate with linear features.

Practically, transformers do well but are slower. For longer texts, we have faster models that do slightly worse. A recent model called performer is such an example.

Ultimately, we want to make models run on “non Google scale” hardware once it has been trained to a specific task. This can often require scaling down.

One approach is to prune weights according to their magnitude. However, since models are often overparameterized and weights do not move much, the weights that get pruned according to this method are usually the weights that were simply initialized closest to 0. In the diagram below, we can consider only leaving the orange weights and cutting out the gray.

Another approach is to mask out the weights that are unnecessary for a specific task, if you’re trying to ship a model for a specific task.

Two major lines of research dominate NLP today:

- Attention / Transformer architecture,
- Pretraining with language modeling as a deep learning paradigm.

We are also in a space race to produce efficient models with more parameters, given how much scaling has been effective.

The paper Which BERT classifies modern questions in NLP into the following categories:

**Tasks:**Is (masked) language modeling the best task for pretraining? Language modeling emphasizes local information. We could imagine doing other types of denoising. See also DALL-E.**Efficiency:**We see that much fewer parameters are needed in practice after pruning. Does pruning lose something? Pruned models tend to do well on in-sample data, but out-of-sample data tends to make the pruned model do worse.**Data:**How does data used in pretraining impact task accuracy? How does the data of pretraining impact task bias?**Interpretability:**What does BERT know, and how can we determine this? Does interpretability need to come at a cost to performance?**Multilinguality:**Many languages don’t have the same amount of data as English. What methods apply when we have less data?

We have many questions asked during and after lecture. Here are some of the questions.

**Q:**Should we say GANs fail at NLP or that other generative models are more advanced in NLP than in CV?**A:**One argument is that language is a human generated system, there are some inherent structures that help with generation. We can do language in left-to-right, but in CV this would be a lot more difficult. At the same time, this can change in the future!**Q:**Why are computer vision and NLP somewhat close to each other?**A:**classically, they are both perception-style tasks under AI. Also, around 2014 we had lots of ideas that come from porting CV ideas into NLP, and recently we have seen NLP ideas ported to CV.**Q:**Since languages have unique grammars, is NLP better at some languages? Do we have to translate language to an “NLP-effective” language and back?**A:**In the past, some languages are better. Ex: we used to struggle with Japanese to other languages but do well with English to other languages. However, modern models are*extremely*data driven, so we have needed much less hardcoding.**Q:**Have we done any scatter plot of the form (data available for language X, performance on X) to see if performance is just a function of available data?**A:**Not right now, but these plots can potentially be really cool! Multilinguality is a broad area of research in general.**Q:**What are some NLP techniques for low-resource languages?**A:**Bridging is commonly used. Iterative models (translate and translate back with some consistency) is also used to augment the data.**Q:**Do you think old-school parsers will make a comeback?**A:**Unlikely to deploy parsers, but the techniques of parsing is interesting.**Q:**Given the large number of possible “correct” answers, has there been work on which “contexts” are most informative?**A:**Best context is the closest context, which is expected. The other words matter but matter a lot less.**Q:**Is there any model that captures the global structure first (e.g. an outline) before writing sequentially, like humans do when they write longer texts?**A:**Currently no. Should be possible, but we do not have data about the global structure of writing.**Q:**Why is our goal density estimation?**A:**It is useful because it tells us how “surprising” the next word is. This is also related to why a language feels “fast” when you first learn it: because you are not familiar with the words, you cannot anticipate the next words.**Q:**Why is lower perplexity better?**A:**Recall from past talk that lower cross-entropy means less distance between and , and intuitive you have more “certainty”.**Q:**Is the reason LM so important because evaluations are syntax-focused?**A:**Evaluations are actually more semantically focused, but syntax and semantics are quite connected.**Q:**Do we see attentional models in CV?**A:**Yes, we have seen more use of transformer in CV. In a standard model we use data only recently, and here we get to work with data across space and time. As such, we will need to encode time positionally.**Q:**Why is attention generalized convolution?**A:**If you have attention with all mass in the local area, that’s probably like convolution.**Q:**How do we compare heads with depth? e.g. Is having 5 heads better than 5x depth?**A:**when we use heads we add a lot less parameters. As such, we can parallelize heads and increase performance.**Q:**Do you think transformers are the be-all end-all of NLP models?**A:**Maybe. To dethrone transformers, you have to both show similar work on small-scale and show that it can be scaled easily.**Q:**How does simplicity bias affect these transferrable models?**A:**surprising and we are not sure. In CV we found that the models quickly notice peculiarities in the data (e.g. how mechanical turks are grouped), but the models do work.**Q:**We get bigger models every year and better performance. Will this end soon?**A:**Probably not, as it seems like having more parameters helps it recognize some additional features.**Q:**If we prune models to the same size, will they have the same performance?**A:**for small models we can seem to prune them, but for the bigger models it is hard to run them in academica given the computational resource constraints.**Q:**When we try to remember something from a long time ago we would look up a textbook / etc. Have we had similar approaches in practice?**A:**transformer training is static at first, and tasks happen later. So, we have to decide how to throw away information before we train on the tasks.**Q:**Are better evaluation metrics an important direction for future research?**A:**Yes — this has been the case for the past few years in academia.**Q:**What is a benchmark/task where you think current models show deep lack of capability?**A:**During generation, models don’t seem to distinguish between information that makes it “sound good” and factually correct information.

**Previous post:** Robustness in train and test time **Next post:** Natural Language Processing (guest lecture by Sasha Rush). See also all seminar posts and course webpage.

lecture slides (pdf) – lecture slides (Powerpoint with animation and annotation) – video

Before getting started, we’ll discuss the difference between the two dominant schools of thought in probability: Frequentism and Bayesianism.

Frequentism holds that the probability of an event is the long-run average proportion of something that happens many, many times, so a coin has 50% probability of being heads if on average half of the flips are heads. One consequence of this framework is that a one-off event like Biden winning the election doesn’t have any probability as by definition they can only be observed once.

Bayesians reject this model of the world. For example, a famous Bayesian, Jaynes, even wrote that “probabilities are frequencies only in an imaginary universe.”

While these branches of thought are different, generally the answers to most questions are the same, so the distinction will not matter for this class. However, these branches of thought inspire different types of methods, which we now discuss.

For example, suppose that we have a probability distribution that we can get samples from in the form of . In statistics, our goal is to calculate a hypothesis which minimizes (possibly a loss function, but any minimand will do) where we estimate with our samples .

A **frequentist** does this by defining a family of potential distributions, and we need to find a transformation which minimizes the cost for all , where

These equations amount to saying that is minimizing the worst-case loss over all distributions.

By contrast, a **Bayesian** approaches this task by assuming that where are latent variables sampled from a prior.

Then, let be the *posterior* distribution of conditioned on the observations . We now minimize

In fact, Bayesian approaches extend beyond this, as if you can sample from a posterior you can do more than just minimize a loss function.

In Bayesian inference, chosing the prior is often very important. One classical approach is a *maximum entropy prior*, because it’s the least informative (hence, the most entropy) and requires the fewest assumptions.

These two minimization equations con sometimes lead to identical results, but often in practice they work out differently once we introduce **computational constraints** into the picture. In the frequentist approach, we generally constrain the family of mappings to be efficiently computable. In the Bayesian approach, we typically approximate the posterior instead and either use approximate sampling or a restricted hypothesis class for the posterior to be able to efficiently sample from it.

Compare water and ice. Water is hotter, and the molecules move around more. In ice, by contrast, the molecules are more stationary. When the temperature increases, the objects move more quickly, and when they decrease the objects have less energy and stop moving. There are also phase transitions, where certain temperatures cause qualitative discontinuities in behavior, like solid to liquid or liquid to gas.

See video from here:

An atomic state can be thought of as . Now, how do we represent the system’s state? The crucial observation that makes statistical physics different from “vanilla physics” is the insight to represent the system state as a probability distribution supported on , rather than in a single state.

Each atomic state has a negative energy function (which we will think of as a utility function) . In some sense, the system “wants” to have a high value of . In addition, when the temperature is high, the system “wants” to have a high value of entropy.

Thus, an axiom of thermodynamics states that to find the true distribution, we need only look at the maximizer of

That is, it is the probability distribution which maximizes a linear combination of the expected negative energy and the entropy, with the temperature controlling the coefficient of entropy in this combination (the higher the temperature, the more the system prioritizes having high entropy).

The **variational principle**, which we prove later, states that the which maximizes this satisfies

which is known as the **Boltzmann Distribution**. We often write this with a normalizing constant, so , where .

Before proving the variational principle, we will go through some examples of statistical physics.

In the Ising model, we have magnets that are connected in a square grid. The atomic state is each , which represents “spin”. The value of is , where denotes that and are adjacent magnets. This encourages adjacent magnets to have aligned spins. One important concept, which we will return to later, is that the values of are sufficient to calculate the value of (these are known as *sufficient statistics*). Furthermore, if we wanted to calculate , it would be enough to calculate the values of and and then apply linearity of expectation. An illustration of an Ising model that we cool down slowly can be found here:

and a video can be found here.

This is what a low-temperature Ising model looks like – note that is high because almost all adjacent spins are aligned (hence the well-defined regions).

The Sherrington-Kirkpatrick model is a generalization of the Ising model to a random graph, which represents a disordered mean field model. Here, still represents spin, and where . The SK-model is deeply influential in statistical physics. We say that it is *disordered* because the utility function is chosen at random and that it is a *mean field* model because, unlike the Ising model, there is no geometry in the sense that every pair of individual variables and are equally likely to be connected.

Our third example is the posterior distribution, where is a hidden variable with a uniform prior. We now makes, say, independent observations . The probability of is given by

so . Note that the RHS is easy to calculate, but in practice the normalizing factor (also known as the partition function) can be difficult to calculate and often represents a large barrier, in the sense that computing the partition function makes many of these questions far easier.

Now, we prove the variational principle, which states that if , where is the normalizing factor, then

Before proving the variational principle, we make the following claim: if is defined as then

**Proof of claim:**

Write

(where we plugged in the definition of in ).

We can now rewrite this as

which means that

using the fact that is a constant (independent of ). Multiplying by and rearranging, we obtain the claim.

**Proof of variational principle:**

Given the claim, we can now prove the variational principle.

Let be any distribution. We write

where the first equation uses our likelihood expression. This implies that is maximized at , as desired.

**Remarks:**

- When , we have .
- In particular, for every value of , we have that

which we’ve seen before! Note that equality holds when , but to approximate we can simply take more tractable values of .

In many situations, we can compute , but can’t compute , which stifles many applications. One upshot of this, though, is that we can calculate ratios of and , which is good enough for some applications, like Markov Chain Monte Carlo.

An important task in statistics is to sample from a distribution , for very complicated values of . MCMC does this by constructing a Markov Chain whose stationary distribution is . The most common instantiation of MCMC is Metropolis-Hastings, which we now describe. First, we assume that there exists an undirected graph on the states , so that iff and that for , the probabilities of and are similar in the sense that is neither too small nor too large. Then the Metropolis-Hastings algortihm is as follows.

**Metropolis-Hastings Algorithm**:

- Draw at random.
- For , choose an arbitrary , and let

Then, eventually is distributed as ! To show that this samples from , we show that the stationary distribution is of this Markov chain is . In this case, this turns out to be easy, since we can check the *detailed balance conditions* so is the stationary distribution of the Markov Chain.

The stationary distribution is often unique, so we’ve proven that this samples from eventually. In MCMC algorithms, however, often an important question is how fast we converge to the stationary distribution. Often this is rather slow, which is especially dissappointing because if it were faster many very difficult problems could be solved much more easily, like generic optimization.

Indeed, there are examples where convergence to the stationary distribution would take exponential time.

**Note:** One way to make MCMC faster is to let be really close from , because the likelihood ratio will be closer to and will spend less time stuck at its original location. There’s a tradeoff, however. If is too close to , then the chain will not *mix* as quickly, where mixing is the convergence of to the stationary distribution, because generally to mix the value must get close to every point, and making each of ‘s steps smaller will make that more difficult.

One application of MCMC is in simulated annealing. Suppose that we have , and we want to find . The most direct attempt at solving this problem is creating a Markov Chain that simply samples from . However, this is impractical, at least directly. It is like we have a domino that is far away from us, and we can’t knock it down. What do we do in this case? We put some dominoes in between us!

To this end, we now create a sequence of Markov chains supported on , whose stationary distribution , as gets smaller and smaller. This corresponds to cooling a system from a high-temperature to a low-temperature. Essentially, we want to sample from .

Simulated annealing lets us begin by sampling from , which is uniform on the support. Then we can slowly reduce from to . When cooling a system, it will be helpful to think of two stages. First, a stage in which the object cools down. Second, a settling stage in which the system settles into a more stable state. The settling stage is simulated via MCMC on , so the transition probability from to is .

**Simulated Annealing Algorithm**:

*Cooling*Begin at (or very large), and lower the value of to zero according to some schedule.

1a.*Settling*Now, repeatedly move from to with probability .

Since simulated annealing is inspired by physical intuition, it turns out that its shortcomings can be interpreted physically too. Namely, when you cool something too quickly it becomes a glassy state instead of a ground state, which often causes simulated annealing to fail, and for this algorithm to get stuck in local minima. Note how this is fundamentally a failure mode with MCMC – it can’t mix quickly enough.

See this video for an illustration of simulated annealing.

Note that the posterior of conditioned on is

We now have , an exponential distribution. Now suppose that , where are the sufficient statistics of . For example, the energy function could follow that of a tree, so . If you want to find the expected value of $latex \mathbb{E}*{x \sim p*{W}}[W(x)]&bg=ffffff$ its enough to know $latex \mu = \mathbb{E}*{x \sim p*{W} }[\hat{x}]&bg=ffffff$. (By following a tree, we mean that the undirected graphical model of the distribution is that of a tree.)

Now how do we sample from a tree-structured distribution? The most direct way is by calculating the marginals. While marginalization is generally quite difficult, it is much more tractable on certain types of graphs. For a sense of what this entails, suppose that we have the following tree-structured statistical model:

,

whose joint PDF is where .

(The unusual notation is chosen so that the relationships between the marginals is clean. The log-density in question is a Laplacian Quadratic Form.)

Then one can show that if the marginal of is , then we have that

This will give six linear equations in six unknowns, which we can solve. Once we can marginalize a variable, say, , we can then simply sample from the marginal, then sample from the conditional distribution on to sample from the entire distribution. This method is known as belief propogation.

This algorithm (with some modifications) also works for general graphs, but what it represents on general graphs is not the exact marginal, but rather an extension of that graph to a tree.

Now let’s try to develop more theory for exponential distributions. Let the PDF of this distribution be . A few useful facts about these distributions that often appear in calculations: and $latex \nabla^2 (A(w)) = \text{Cov}*{p*{W}}(x) \succeq 0&bg=ffffff$.

By the variational principle, we have that among all distributions with the same sufficient statistics as a given Boltzmann distribution, the true one maximizes the entropy, so

An analagous idea gives us

so is the *maximum entropy* distribution consistent with observations .

The important consequence of these two facts is that in principle, determines and vice versa. (Up to fine print of using *minimal* sufficient statistics, which we ignore here.) We will refer to as the “canonical parameter space” and as the “mean representation space”. Now note that the first equation gives us a way of going from to , and the second equation gives us a way to go from to , at least information-theoretically.

Now, how do we do this algorithmically? Using , if were able to sample from (which is generally possible if we can evaluate then we can estimate the mean through samples. We can also obtain from by estimating .

On the other hand, if you have , to obtain you first consider and then note that the desired value is , so solving this problem boils down to estimating efficiently, because setting it to zero will give us .

Thus going from to requires estimating , whereas going from to requires estimating .

When is a *posterior distribution*, the observed data typically gives us the weights , and hence the inference problem becomes to use that to sample from the posterior.

There are many examples of exponential distributions, which we now give.

- High-Dimensional Normals: ,
- Ising Model: , . A sufficient statistic for these is , a fact which we will invoke repeatedly.
- There’s many many more, including Gaussian Markov Random Fields, Latend Dirchlet Allocation, and Mixtures of Gaussians.

Now, we show how to go from to in a special case, namely in the mean-field approximation. The mean field approximation is the approximation of distributions by product distributions over , for which

Recall that the partition function can be computed as

where is the set of all probability distributions. If we instead write as the set of product distributions (parametrized by , or the probability that each variable is ), we get

where and it now suffices to maximize the right hand function over the set .

We can generally maximize concave functions over a convex set such as . Unfortunately, this function is not concave. However, it is concave in every coordinate. This suggests the following algorithm: fix all but one variable and maximize over that variable, and repeat. This approach is known as Coordinate Ascent Variational Inference (CAVI), and its pseudocode is given below.

**CAVI Algorithm**

- Let
- Repatedly choose values of in (possibly at random, or in a loop.)

2a. Update (where represents the non- coordinates of )

This part is best explained visually. Suppose that we have a Boltzman distribution. In the infinite temperature limit the this is the uniform distribution, distributed over the entire domain. As you decrease the temperature, the support’s size decreases. (*Note:* The gifs below are just cartoons. These pretend as if the probability distribution is always uniform over a subset. Also, in most cases of interest for learning, only higher order derivatives of entropy will be discontinuous.)

Sometimes there’s a discontinuity in entropy, and the entropy “suddenly drops” in a discontinuity. Often the entropy function itself is continuous, but the derivatives are not continuous at that point – a higher order phase transition.

Sometimes the geometry of the solution space undergoes a phase transition as well, with it “shattering” to several distinct clusters.

**Note:** We will have an extended post on the replica method later on.

If you sampled from , where is a high-dimensional distribution, you’d expect the distances to all be approximately the same. The overlap matrix, or

we approximate as

where is some constant.

Suppose comes from a probability distribution . Then a very common problem is computing

which is the expected free energy. However, it turns out to be much easier to find . So here’s what we do. We find

This should already smell weird, because is going to zero, an unusal notational choice. We can now write this as

which we can write as

Now these ‘s represent the replicas! Now, we’d hope is an analytic function, and we can now write this as as goes to zero.

Generally speaking, only depends on overlaps of , so we often guess the value of and calculate this expectation.

We’ll now give an example of how this is useful. Consider a spiked matrix/tensor model, where we observe so that , where is the signal and is the noise. Thus here we have

and we want to analyze as a function of , which can be done by the replica method and exhibits a phase transition.

Other examples include Teacher-student models, where . Then, a recent work calculates the training losses and classification errors when training on this dataset, which closely match empirical values.

Lastly, we’ll talk about replica symmetry breaking. Sometimes, discontinuities don’t just make the support smaller, but actually break the support into many different parts. For example, at this transition the circle becomes the three green circles.

When this happens, we can no longer make our overlap matrix assumption, as there are now asymmetries between the points. This leads to the overlap matrix being striped, in the fashion one would anticipate.

Sometimes, the support actually breaks into infinitely many parts, in a phenomenon called full replica symmetry breakng.

Finally, some historical notes. This “plugging in zero” trick was introduced by Parisi approximately 30 years ago, receiving a standing ovation at the ICM. Since then, some of those conjectures have been rigorously formalized, but many haven’t. It is still very impressive that Parisi was able to do it.

]]>**Previous post:** Unsupervised learning and generative models **Next post:** Inference and statistical physics. See also all seminar posts and course webpage.

lecture slides (pdf) – lecture slides (Powerpoint with animation and annotation) – video

In this blog post, we will focus on the topic of robustness – how well (or not so well) do machine learning algorithms perform when either their training or testing/deployment data differs from our expectations.

We will cover the following areas:

- Math/stat refresher
- Multiplicative weights algorithm
- Robustness
- train-time
- robust statistics
- robust mean estimation
- data poisoning

- test-time
- distribution shifts
- adversarial perturbations

- train-time

The KL-divergence between the probability distributions is the expectation of the log of probability ratios:

KL-divergence is always non-negative and can be decomposed as the difference between negative entropy and cross-entropy. The non-negativity property thereby implies that distributions ,

Consider a Gaussian random variable . Concentration is the phenomenon that if you have i.i.d random variables with expectation , then the empirical average is distributed approximately like

Therefore the standard deviation of the empricial average is smaller than the standard deviation of each of the ‘s. The central limit theorem ensure this asymptotically w.r.t , while non-asymptotic versions of this phenomenon can be seen via popular inequalities such as the Chernoff/Hoeffding/Bernstein styled inequalities. These roughly have the form

We denote the spectral norm of a matrix by and the Frobenius norm by .

We use the matrix ordering if for all vectors . We similarly refer to if .

**Vector valued normals:** If is a vector, and is a psd covariance matrix, with normal over with

For every psd matrix V, there exists a Normal distribution where

**Standard vector-valued normal:** This refers to the case when (or and

For scalar random variables, we saw that if are i.i.d over bounded with expectation then

If the random variables were vectors instead, we can easily generalize this to samples of . Generalizing to matrices is far more interesting, especially when the norm under consideration is the spectral norm instead of the Frobenius norm.

If i.i.d symmetric matrices in with (i.e bounded in spectral norm), then

Note that is a matrix in this case, instead. The norm under consideration is the spectral norm. Note that the difference in the inequality w.r.t the scalar case is the additional multiplicative factor of d or an additive factor of in the exponent as in the equations above. Please refer to Tropp, 2015 Chapter 6 for formally precise statements.

Another property that follow is the the expected norm of the difference follows

There are some long-standing conjectures and results on cases when one can get rid of this additional factor of .

**Trivia on log factors:** For example as a tangent (as well as some trivia!) on some results of this flavor, where a factor was replaced by a constant, includes Spencer’s paper from 1985 titled ‘Six standard deviations suffice’, which showed that a constant of 6 would suffice in certain bounds where a naive result would instead give a dependency. The Kadison-singer problem (by Spielman-Srivastava) and the Paulsen problem are two other examples of works in this flavor.

For a random symmetric matrix matrix with , the spectrum of eigenvalues is distributed according to **Wigner semi-circle law** on a support between as shown in the figure below. Note that most mass is observed close to .

For

we have is the empirical estimate for covariance of

the eigenvalues are distributed according to the Marchenko-Pastur distribution. In the case when, , the eigenvalues will be bounded away from zero. When , then it has a lot more mass close to (way more than the semi-circle law). When , then there is a spike of eigenvalues at and the rest are bounded away from .

**Connections to the stability of linear regression**

In the regime, when , linear regression is most unstable, which is the case when most of the eigenvalues of the empirical covariance are close to . When , although linear regression is over-parametrized, the solution is not exact, but the approximate solution is still pretty stable as in the case of . When , the condition number is infinite, and therefore we use a pseudo-inverse as opposed to the usual inverse in computing the solution for linear regression. This ignores the subspace on which the matrix has eigenvalues while the inverse is performed. The condition number of the subspace of the non-zero eigenvectors is finite.

**Note:** Its variants/connections include techniques/topics of, Follow The Regularized Leader / Regret Minimization / Mirror Descent. Elad Hazan’s lecture notes on online convex optimization and Arora, Hazan, Kale’s article on multiplicative and matrix multiplicative weights are a great reading source for these topics.

The following is a general setup in online optimization and online learning.

**Setup:** possible actions

At time , we incur loss for action at time . After this action, we also learn the loss for actions we did not take. (This is referred to as the *experts* model in online learning, in contrast to the *bandit* model where we only learn the loss for the action taken; the experts model is an easier setup than the bandit model.)

The following is the overall approach for learning to optimize in this online setup.

**Overall Approach:**

- Initialize distribution over action space . We then take a step based on this distribution and observe the incurred loss.
- Then the distribution is updated as by letting . i.e, if an action gave a bad outcome in terms of loss, we downweight it’s probability for the next draw of action to be taken. is the penalty terms that governs the aggresiveness of the downweighting.

The *hope* is that this approach converges to a “good aggregation or strategy”. We measure the quality of the strategy using **regret**.

The difference between the cost we incur and the optimal action (or probability distribution over actions) in hindsight is known as the (average) *regret*.

Note that we compare the loss we incur with the best loss over a *fixed* (i.e., nonadaptive) probability distribution over the set of possible actions. It can be shown that the best such loss can always be achieved by a distribution that puts all the weight on a single action.

The following theorem bounds the regret.

The ‘prior ignorance’ term captures how good the initialization is. The ‘sensitivity per step’ term captures how aggressive the exploration is (i.e this term governs the potential difference between loss incurred at versus ). As also occurs in the denominator of the prior ignorance term, it needs to be carefully chosen to balance the two terms properly.

The first inequality of this theorem is always true for any whether is optimal strategy or not. The second (right-most) inequality is true when optimal is the delta function (which it will be in optimal strategy), is initialized to uniform distribution, and is set to be . Note that in this case, the divergence .

To state the ‘overall approach’, more precisely, there needs to be a normalization at every step as below

The above can be rearranged as follows upon taking log

By substituting this in below

we get the following expanded version of what the regret amounts to be:

The last equation above is due to the telescoping sum where adjacent terms cancel out except the first and last terms.

The last inequality here is because the cross-entropy is always at least as large as the entropy.

So now we have this so far

PF: Regret

Now there is this property that if s.t. for then

Therefore upon substituting this, we prove the upper bound stated over the regret.

This subclaim that was used can be proved as follows

When the set K of actions is convex (set of probability distributions on discrete actions is convex),

At time , it makes a choice and learns cost function (so the setup is again like the experts model unlike the bandit model).

Now the new action in FTRL is based on the following optimization which is a regularized (with ) loss:

FTRL:

In this case, the regret bound is given by the following theorem.

refers to the optimal choice. This indeed has a similar flavor to the theorem we saw for multiplicative weights. To be precise when

we have that the multiplicate weights becomes FTRL. Similarly there is a view that connects multiplicative weights to mirror descent. (Nisheeth Vishnoi has notes on this.)

We now look at some robustness issues specific to the *training* phase in machine learning.

For example, during training, there could be adversarially poisoned examples in the training dataset that damage the trained model’s performance.

**Setup:** Suppose samples are generated from a genuine data distribution and samples are maliciously chosen from an arbitrary distribution. i.e, in formal notation: are arbitrary. (While for convenience we assume here that the last items are maliciously chosen, the learning algorithm does not get the data in order.)

Assume for

Let us start with the task of estimating the mean under this setup.

**Mean estimation:** Estimate

**Noiseless case:** In the noiseless case, the best estimate for the mean under many measures is the *empirical mean*

Standard concentration results show that for and for a general that .

**Adversarial case:** If there is even a single malicious sample (that can be arbitrarily large or small) then it is well known that the empirical mean can give an arbitrarily bad approximation of the true population mean. For we can use the *empirical median* . This is guaranteed to lie in the quantile of real data. This is most optimal because of the property that .

In a higher-dimension, this story instead goes as follows:

**Setup:** arbitrary.

**Median of coordinates (a starter solution):** When , we have per coordinate as an initial naive solution. (i.e a median of coordinates). In this case, say if we have then an adversary can perurb the data to make fraction of the points be where is some large number. This will shift in each coordinate the distribution to have median instead of median , and hence make which implies that for some constant .

**The obvious next question is: Can we do better?** i.e, can we avoid paying this dimension-dependent price?

Yes we can! We can use the ‘Tukey Median’!.

**What is a Tukey median?**

Informally via the picture below, if you have a Tukey median (red), no matter which direction you take from it and look at the half-space, it exactly partitions the data into about half the # of points.

**Formal definition of Tukey median** is given as follows (fixing some parameters):

A Tukey median* of is a vector s.t. for every nonzero

A Tukey median need not always exist for a given data. However, we will show that if the data is generated at random according to a nice distribution, and then at most fraction of it is perturbed, a Tukey median will exist with high probability. In fact, the true population mean will be such a median.

**Existence of Tukey median**

**THM:** If and then

- Tukey median exists and
- For
*every*Tukey median , .

Together 1. and 2. mean that if we search over all vectors and output the first Tukey median that we find, then (a) this process will terminate and (b) its output will be a good approximation for the population mean. In particular, we do not have to pay the extra cost needed in the median of coordinates!

** Proof of 1 (existence):**

The mean itself is the Tukey median in this case because, , if we define then these are i.i.d ±1 vars of mean zero, and thus:

for and . Through discretization, we can pretend (losing some constants) that there are only unit vectors in . Hence we can use the union bound to conclude that for *every* unit , if we restrict attention to the “good” (i.e., unperturbed) vectors , then the fraction of them satisfying will be in . Since the adversary can perturb at most vectors, the overall frection of ‘s such that will be in . QED(1)

** Proof of 2:**

Let be the population mean (i.e., the “good” ‘s are distributed according to ). Suppose for simplicity, and toward a contradiction, that .

Let . Then,

.

Note that is distributed as , and so we get that is distributed as .

Hence, if then .

This implies that via a similar concentration argument as above, for every , there will be with high probability at most fraction of ‘s such that , contradicting our assumption that was a Tukey median. QED(2)

Exactly computing the Tukey median is NP-hard, but efficient algorithms for robust mean estimation of normals and other distributions exist as referred to in Jerry Li’s lecture notes. In particular, we can use the following approach using *spectral signatures* and *filtering*.

**Spectral Signatures** can efficiently *certify* that a given vector is in fact a robust mean estimator.

Let and arbitrary

Let be the empirical mean and be the empirical co-variance. Then we can bound the error of as follows:

**Claim:**

In other words, if the spectral norm of the empirical covariance matrix is small, then the empirical mean is a good estimator for the population mean.

**Note:** If all points are from then .

The Proof for this claim is given below

**Explanation:** Here, we assume the for simplicity without loss of generality. The norm can be split into additive terms on good points (green text above) and malicious points (red text above). The first term of this inequality follows from standard concentration. For the second (red) term, we can modify it by adding and subtracting . We can then apply the Cauchy-Schwartz (cs) inequality to prove it. Upon rearranging the terms and dividing by the norm of , we get the desired result.

**Filtering** is an approach to turn the certificate into an algorithm to actually find the estimator. The idea is that the same claim holds for non-uniform reweighting of data points to estimate the empirical mean and covariance. Hence we can use a violation of the spectral norm condition (the existence of a large eigenvalue) to assign “blame” to some data points and down-weigh their contributions until we reach a probability distribution over points that on the one hand is spread on roughly fraction of the points and on the other hand leads to a weighted empirical covariance matrix with small spectral norm. The above motivates the following robust mean estimation algorithm in the online case as below:

**Explanation:** We first compute the mean and covariance based on uniform weighting based, and the certificate is checked via the spectral norm of the covariance. If the quality isn’t good enough, then the blame is given to the largest eigenvector of the covariance that contributes to the most error. The weighting is now improved from the uniform initialization via the multiplicative weights styled update as given in step 3. Jerry Li and Jacob Steinhardt have wonderful lecture notes on these topics.

The algorithm above is computationally efficient while the bounds are not that tight.

**SoS algorithms:** Another approach is via sum of squares algorithms where the guarantees for statistical robustness are much tighter, but they are computationally not very efficient although they are polynomial time. A hybrid approach might as well give a balance of these based on the problem at hand to bridge this gap between computational efficiency vs. statistical efficiency.

A list of relevant references is given below.

We now cover robustness issues with distribution shift and adversarial data poisoning during the testing phase in machine learning.

As shown in Steinhardt, Koh, Liang, 2017 the images below illustrate how poisoning samples can drastically alter the performance of a classifier from good to bad.

Shafahi , Huang, Najibi, Suciu, Studer, Dumitras, Goldstein, 2018 showed how poisoning images look perfectly fine to the human perception, while they flip the model’s performance where a fish is considered to be a dog and vice versa as shown below.

Another problem with regards to test-time robustness is the issue of domain shift where the distribution of test data is different from the distribution of training data.

vs. . If L is bounded, then if Lipschitz constant is known then distances like earth mover’s distance, T.V distance can be used to bound this. But one needs to be quite careful or these bounds are too large or close to vacuous.

For example, images of dogs taken say in a forest vs. images of dogs taken on roads have huge distance in measures like the ones mentioned above, as many pixels (although not important pixels) across the images are very different and there could be a classifier that performs terribly across the test while it does good on train. But magically, there appears to be a linear relationship between the accuracy on vs accuracy on . Typically one would expect that the line would be below the x=y (45 degree) line.

i.e say if it was trained on cifar 10 () and tested on cifar10.2 () then finding this line to be a bit lower than the line is not surprising but this linear correlation is quite surprising. It is even more surprising if the datasets are different-say in the case when performance on photos vs illustrations or cartoons of these photos was considered.

What are the potential reasons (intuitively)?

i) Overfitting on cifar test,

ii) Fewer human annotators per image introduces skew towards hardness of dataset,

iii) Running out of images by human annotators as they might end up choosing images that are easier to annotate.

If we achieve better accuracy on than we achieved on , it is a strong indicator of a drop in hardness. If the errors are in the other direction, then this reasoning isn’t as clear.

Here is a toy model can demonstrate this surprising linear relationship between accuracies under domain shift.

Let us do a thought experiment where things are linear. Let us assume there was a true vector cat direction in terms of the representation(feature) as shown in the cartoon below. Let there be some correlated idiosyncratic correlations. An example, idiosyncrasy is say due to cats tending to be photographed more in indoors than in outdoors.

Consider to be a point that needs to be labeled.

In some dataset consider the probability that is labeled as a cat is proportional to the following exponential as:

labeled cat

where to denote the theidiosyncratic correlation factor. This is the exponent of the dot product of the image to be labeled with the CAT direction and the idiosyncratic direction. The same can be done for a dataset as

Dataset labeled cat

Intuitively is like the signal to noise ratio. That is, if then is a harder dataset to classify than .

So in this toy model, the best accuracy that can be reached for any linear classifier is given by the following, where the softmax of the RHS is the classification probability.

The first term of this accuracy is the universal and transferable part and the second term is the idiosyncratic part.

The following is an For the we assume that if is trained on then we assume . This is because if is trained on , then idiosyncratic directions of are orthogonal to each other. So the and the accuracy will be just this term of .

If the model is learnt by gradient descent then the gradient direction will always be proportional to as the gradient is of the form

So, if is trained on , then Noise Noise. Here, is given by .

Therefore the accuracies will be as follows

Therefore, we see this form of a linear relationship:

Note that:

- iff harder than
- grows with idiosyncratic component of

Although this is a toy theoretical model, it explains a linear relationship. However, finding a model that explains this linear relationship in real-life will be an interesting project to think of.

We now move to the last (yet another active) topic of our blog: adversarial perturbations. As an introductory example (taken from this tutorial), the hog image was originally classified as a hog with probability. A small amount of noise was added to get the image which to the human eye perceptually looks indistinguishable. That said the model now ends up misclassifying the noised hog to something that is not a hog.

Should we be surprised with the efficacy of such adversarial perturbations? Originally-certainly yes, but not as much now in hindsight!

In this example it turns out that the magnitude of each element satisfies and the 2-norm of this vector is roughly as . Note that the Resnet50 model outputs at the penultimate layer has a dimension . We scale such that . There is a Lipschitz constant w.r.t the preservation of the following norms: . The final classification decision is made by looking at where is a unit vector such that unit vector s.t. is . We assume some randomness in the decision as being for simplicity. So we have

where .We now have the probability that is not a hog as

is not hog .

As we know that the observed probability of being not a hog is 0.0996, we can calculate that .

Upon normalizing to be , We can expect the following square of this dot product w.r.t the representation as:

Therefore the norm of the projection of to the HOG direction is given by .

So say if the 2048 vector had one larger element which accounts for the HOG direction, although it still accounts for a small proportion of its total norm. Therefore it wouldn’t need too much noise to flip one class to its wrong class label as shown in the cartoons below.

So if the Lipschitz constant is greater than around 2.5 or 3, then a fraction of 1/25 is enough to zero out the hog direction (as ).

What are some strategies for training neural networks that are robust to perturbations?

- A set of transformation that do not change the true nature of the data such as for e.g:

where the set is

i.e, they only perturb the image or data sample by atmost upon applying that transformation.

Now, given a loss function and a classifier , a **robust loss** function of at point is defined as

Given and arobust loss function, a robust training would involve the minimization of this loss which gives us:

Now for the subgoal of finding for the sake of optimization, invoking Danskin’s Theorem will greatly help. The theorem basically says that if is nice (diff, continuous) and if is compact then we have that:

i.e for any function , that depends on and as long as the function space from which needs to be chosen is nice (diff, continuous) and compact, then to find the gradient of the maximum of the function then by the theorem one can find maximizer for any particular , then that can give the required gradient (after some fine print that is given in note below).

Note: The paper below extends to the case when non unique though there is other fine print. See Appendix A of (Madry Makelov Schmidt Tsipras Vladu 2017).

On the empirical side, there seems to be a trade-off. If one wants to achieve adversarial robustness, it can be achieved by letting go of some model accuracy. See discussion here and the papers cited below.

]]>This course is aimed at undergraduate students, and in particular students from groups that are currently under-represented in our field. If you are a faculty member in a university, we would be grateful if you can spread the announcement below to both your colleagues, and any mailing lists for undergraduate students, and in particular chapters of affinity groups such as the National Society of Black Engineers, Society of Hispanic Professional Engineers, National Center for Women & Information Technology, etc.

The committee for advancement of theoretical computer science (CATCS) is organizing an online summer course that will take place on May 31 till June 4, 2021. New horizons in theoretical computer science is a week-long online summer school which will expose undergraduates to exciting research areas in the area of theoretical computer science and its applications. The school will contain several mini-courses from top researchers in the field. The course is free of charge,and we welcome applications from undergraduates majoring in computer science or related fields. We particularly encourage applications from students that are members of groups that are currently under-represented in theoretical computer science.

The course is intended for currently enrolled undergraduate students that are majoring in computer science or related fields. Students will be expected to be familiar with the material typically taught in introductory algorithms and discrete mathematics / mathematics for computer science courses. If you are unsure if you are prepared for the course, please write to us at summer-school-admin@boazbarak.org

To apply for the course, please visit https://boazbk.github.io/tcs-summerschool/ and fill in the application form by **April 15, 2021**. **Course organizers:** Boaz Barak (Harvard), Shuchi Chawla (UT Austin), Madhur Tulsiani (TTI-Chicago)

**Current list of confirmed instructors:** Antonio Blanca (Penn State University), Ashia Wilson (MIT), Jelani Nelson (UC Berkeley), Nicole Immorlica (Microsoft Research), Yael Kalai (Microsoft research).

Please email summer-school-admin@boazbarak.org with any questions.

]]>**Previous post:** What do neural networks learn and when do they learn it **Next post:** TBD. See also all seminar posts and course webpage.

lecture slides (pdf) – lecture slides (Powerpoint with animation and annotation) – video

In this lecture, we move from the world of supervised learning to unsupervised learning, with a focus on generative models. We will

- Introduce unsupervised learning and the relevant notations.
- Discuss various approaches for generative models, such as PCA, VAE, Flow Models, and GAN.
- Discuss theoretical and practical results we currently have for these approaches.

In *supervised learning*, we have data and we want to understand the distribution . For example,

*Probability estimation:*Given , can we compute/approximate (the probability that is output under )?*Generation:*Can we sample from , or from a “nearby” distribution?*Encoding:*Can we find a representation such that for , makes it easy to answer semantic questions on ? And such that corresponds to “semantic similarity” of and ?*Prediction:*We would like to be able to predict (for example) the second half of from the first half. More generally, we want to solve the*conditional generation*task, where given some function (e.g., the projection to the first half) and some value , we can sample from the conditional probability distribution .

Our “dream” is to solve all of those by the following setup:

There is an “encoder” that maps into a representation in the latent space, and then a “decoder” that can transform such a representation back into . We would like it to be the case that:

*Generation:*For , the induced distribution is “nice” and efficiently sampleable (e.g., the standard normal over ) such that we can (approximately) sample from by sampling and outputting .*Density estimation:*We would like to be able to evaluate the probability that . For example, if is the inverse of , and we could do so by computing .*Semantic representation:*We would like the latent representation to map into meaningful latent space. Ideally, linear directions in this space will correspond to semantic attributes.*Conditional sampling:*We would like to be able to do conditional generation, and in particular for some functions and values , be able to sample from the set of ‘s such that

Ideally, if we could map images to the latent variables used to generate them and vice versa (as in the cartoon from the last lecture), then we could achieve these goals:

At the moment, we do not have a single system that can solve all these problems for a natural domain such as images or language, but we have several approaches that achieve part of the dream.

**Digressions.** Before discussing concrete models, we make three digressions. One will be non-technical, and the other three technical. The three technical digressions are the following:

- If we have multiple objectives, we want a way to interpolate between them.
- To measure how good our models are, we have to measure distances between statistical distributions.
- Once we come up with generating models, we would
*metrics*for measuring how good they are.

In an influential essay, Richard Feynman coined the term “cargo cult science” for the activities that have superficial similarities to science but do not follow the scientific method. Some of the tools we use in machine learning look suspiciously close to “cargo cult science.” We use the tools of classical learning, but in a setting in which they were not designed to work in and on which we have no guarantees that they will work. For example, we run (stochastic) gradient descent – an algorithm designed to minimize a convex function – to minimize convex loss. We also write use *empirical risk minimization* – minimizing loss on our training set – in a setting where we have no guarantee that it will not lead to “overfitting.”

And yet, unlike the original cargo cults, in deep learning, “the planes do land”, or at least they often do. When we use a tool in a situation that it was not designed to work in, it can play out in one (or mixture) of the following scenarios:

**Murphy’s Law:**“Anything that can go wrong will go wrong.” As computer scientists, we are used to this scenario. The natural state of our systems is that they have bugs and errors. There is a reason why software engineering talks about “contracts”, “invariants”, preconditions” and “postconditions”: typically, if we try to use a component in a situation that it wasn’t designed for, it will not turn out well. This is doubly the case in security and cryptography, where people have learned the hard way time and again that Murphy’s law holds sway.**“Marley’s Law”:**“Every little thing gonna be alright”. In machine learning, we sometimes see the opposite phenomenon- we use algorithms outside the conditions under which they have been analysed or designed to work in, but they still produce good results. Part of it could be because ML algorithms are already robust to certain errors in their inputs, and their output was only guaranteed to be approximately correct in the first place.

Murphy’s law does occasionally pop up, even in machine learning. We will see examples of both phenomena in this lecture.

During machine learning, we often have multiple objectives to optimize. For example, we may want both an efficient encoder and an effective decoder, but there is a tradeoff between them.

Suppose we have 2 loss functions and , but there can be a trade off between them. The *pareto curve* is the set

If a model is above the curve, it is not optimal. If it is below the curve, the model is infeasible.

When the set is convex, we can reach any point on the curve by minimizing . The proof is by the picture above: for any point on the curve, there is a tangent line at that is strictly below the curve. If is the normal vector for this line, then the global minimum of on the feasible set will be .

This motivates the common practice of minimizing two introducing a hyperparameter to aggregate two objectives into one.

When is not convex, it may well be that:

- Some points on are not minima of
- might have multiple minima
- Depending on the path one takes, it is possible to get “stuck” in a point that is
*not*a global minima

The following figure demonstrates all three possibilities

Par for the course, this does not stop people in machine learning from using this approach to minimize different objectives, and often “Marley’s Law” holds, and this works fine. But this is not always the case. A nice blog post by Degrave and Kurshonova discusses this issue and why sometimes we do in fact, see “Murphy’s law” when we combine objectives. They also detail some other approaches for combining objectives, but there is no single way that will work in all cases.

Figure from Degrave-Kurshonova demonstrating where the algorithm could reach in the non-convex case depending on initialization and :

Suppose we have two distributions over . There are two common ways of measuring the distances between them.

The *Total Variance (TV)* (also known as statistical distance) between and is equal to

The second equality can be proved by constructing that outputs 1 on where and vice versa. The definition has a crypto-flavored interpretation: For any adversary , the TV measures the advantage they can have over half of determining whether or .

Second, the *Kullback–Leibler (KL) Divergence* between and is equal to

(The total variation distance is symmetric, in the sense that , but the KL divergence is not. Both have the property that they are non-negative and equal to zero if and only if .)

Unlike the total variation distance, which is bounded between and , the KL divergence can be arbitrarily large and even infinite (though it can be shown using the concavity of log that it is always non-negative). To interpret the KL divergence, it is helpful to separate between the case that is close to zero and the case where it is a large number. If for some , then we would need about samples to distinguish between samples of and samples of . In particular, suppose that we get and we want to distinguish between the case that we they were independently sampled from and the case that they were independently sampled from . A natural (and as it turns out, optimal) approach is to use a *likelihood ratio test* where we decide the samples came from if . For example, if we set then this approach will guarantee that our “false positive rate” (announcing that samples came from when they really came from ) will be most . Taking logs and using the fact that the probability of these independent samples is the product of probabilities, this amounts to testing whether . When samples come from , the expectation of the righthand side is , so we see that to ensure is larger than we need the number samples to be at least (and as it turns out, this will do).

When the divergence is a large number , we can think of it as the number of bits of “surprise” in as opposed to . For example, in the common case where is obtained by conditioning on some event , will typically be (some fine print applies). In general, if is obtained from by revealing bits of information (i.e., by conditioning on a random variable whose mutual information with is ) then .

**Generalizations:** The total variation distance is a special case of metrics of the form . These are known as integral probability metrics and include examples such as the Wasserstein distance, Dudley metric, and Maximum Mean Discrepancy. KL divergence is a special case of divergence measures known as -divergence, which are measures of the form . The KL divergence is obtained by setting . (In fact even the TV distance is a special case of divergence by setting .)

**Normal distributions:** It is a useful exercise to calculate the TV and KL distances for normal random variables. If and , then since most probability mass in the regime where , (i.e., up to some multiplicative constant). For KL divergence, if we selected from a normal between and then with probability about half we’ll have and with probability about half we will have . By selecting , we increase probability of the former to and the decrease the probability of the latter to . So we have bias towards ‘s where , or . Hence . The above generalizes to higher dimensions. If is a -variate normal, and for , then (for small ) while .

If and is a “narrow normal” of the form then their TV distance is close to while . In the dimensional case, if and for some covariance matrix , then . The two last terms are often less significant. For example if then .

Given a distribution of natural data and a purported generative model , how do we measure the quality of ?

A natural measure is the KL divergence but it can be hard to evaluate, since it involves the term which we cannot evaluate. However, we can rewrite the KL divergence as . The term is equal to where is the *entropy* of . The term is known as the *cross entropy *of and . Note that the cross-entropy of and is simply the expectation of the negative log likelihood of for sampled from .

When is fixed, minimizing corresponds to minimizing the cross entropy or equivalently, maximizing the log likelihood. This is useful since often is the case that we can sample elements from (e.g., natural images) but can only evaluate the probability function for . Hence a common metric in such cases is minimizing the cross-entropy / negative log likelihood . For images, a common metric is “bits per pixel” which simply equals where is the length of . Another metric (often used in natural language processing) is perplexity, which interchanges the expectation and the logarithm. The logarithm of the perplexity of is where is the length of (e.g., in tokens). Another way to write this is that log of the perplexity is the average of where is the probability of under conditioned on the first parts of .

**Memorization for log-likelihood.** The issue of “overfitting” is even more problematic for generative models than for classifiers. Given samples and enough parameters, we can easily come up with a model corresponding to the uniform distribution . This is obviously a useless model that will never generate new examples. However, this model will not only get a large log likelihood value on the training set, in fact, it will get *even better log likelihood* than the true distribution! For example, any reasonable natural distribution on images would have at least tens of millions, if not billions or trillions of potential images. In contrast, a typical training set might have fewer than 1M samples. Hence, unlike in the classification setting, for generation, the “overfitting” model will not only match but can, in fact, beat the ground truth. (This is reminiscent of the following quote from Peter and Wendy: *“Not a sound is to be heard, save when they give vent to a wonderful imitation of the lonely call of the coyote. The cry is answered by other braves; and some of them do it even better than the coyotes, who are not very good at it.”*)

If we cannot compute the density function, then benchmarking becomes more difficult. What often happens in practice is an “I know it when I see it” approach. The paper includes a few pictures generated by the model, and if the pictures look realistic, we think it is a good model. However, this can be deceiving. After all, we are feeding in good pictures into the model, so generating a good photo may not be particularly hard (e.g. the model might memorize some good pictures and use those as outputs).

There is another metric called the *inception score*, which loosely corresponds to how similar the “inception” neural network finds the GAN model to ImageNet (in the sense that inception thinks it covers many of the ImageNet classes and that produces images on which inception has high confidence) but it too has its problems. Ravuri-Vinyalis 2019 used a GAN model with a good inception score used its outputs to train a different model on ImageNet. Despite the high inception score (which should have indicated that the GANs output are as good as ImageNets) the accuracy when training on the GAN output dropped from the original value of to as low as ! (Even in the best case, accuracy dropped by at least 30 points.) Compare this with the 11-14% drop when we train on ImageNet and test on ImageNet v2.

This figure from Goodfellow’s tutorial describes generative models where we know and don’t know how to compute the density function:

We now shift our attention to the encoder/decoder architecture mentioned above.

Recall that we want to understand , generate new elements , and find a good representation of the elements. Our dream is to solve all of the issues with auto encoder/decoder, whose setup is as follows:

That is, we want , such that

- The representation enables us to solve tasks such as generation, classification, etc..

To each the first point, we can aim to minimize . However, we can of course, make this loss zero by letting and be the identity function. Much of the framework of generative models can be considered as placing some restrictions on the “communication channel” that rule out this trivial approach, with the hope that would require the encoder and decoder to “intelligently” correspond to the structure of the natural data.

A natural idea is to simply restrict the dimension of the latent space to be small (). In principle, the optimal compression scheme for a probability distribution will require knowing the distribution. Moreover, the optimal compression will maximize the entropy of the latent data . Since the maximum entropy distribution is uniform (in the discrete case), we could easily sample from it. (In the continuous setting, the standard normal distribution plays the role of the uniform distribution.)

For starter, consider the case of picking to be small and minimizing for *linear* , . Since is a rank matrix, we can write this as finding a rank matrix that minimizes where is our input data. It can be shown that that would minimize this will be the projection to the top eigenvectors of which exactly corresponds to Principal Component Analysis (PCA).

In the nonlinear case, we can obtain better compression. However, we do not achieve our other goals:

- It is not the case that we can generate realistic data by sampling uniform/normal and output
- It is not the case that semantic similarity between and corresponds to large dot product between and .

It seems that model just rediscovers a compression algorithm like JPEG. We do not expect the JPEG encoding of an image to be semantically informative, and JPEG decoding of a random file will not be a good way to generate realistic images. It turns out that sometimes “Murphy’s law” does hold and if it’s possible to minimize the loss in a not very useful way then that will indeed be the case.

We now discuss *variational auto encoders* (VAEs). We can think of these as generalization auto-encoders to the case where the channel has some Gaussian noise. We will describe VAEs in two nearly equivalent ways:

- We can think of VAEs as trying to optimize two objectives: both the auto-encoder objective of minimizing and another objective of minimizing the KL divergence between and the standard normal distribution .
- We can think of VAEs as trying to maximize a proxy for the log-likelihood. This proxy is a quantity known as the “Evidence Lower Bound (ELBO)” which we can evaluate using and and is always smaller or equal to the log-likelihood.

We start with the first description. One view of VAEs is that we search for a pair of encoder and decoder that are aimed at minimizing the following two objectives:

- (standard AE objective)
- (distance of latent from the standard normal)

To make the second term a function of , we consider as a probability distribution with respect to a *fixed* . To ensure this makes sense, we need to make *randomized*. A randomized Neural network has “sampling neurons” that take no input, have parameters and produce an element . We can train such a network by fixing a random and defining the neuron to simply output .

**ELBO derivation:** Another view of VAEs is that they aim at maximizing a term known as the evidence lower bound or ELBO. We start by deriving this bound. Let be the standard normal distribution over the latent space. Define to be the distribution of conditioned on decoding to (i.e., , and define be the distribution . Since , we know that

By the definition of , . Hence we can derive that

(since depends only on , given that .)

Rearranging, we see that

or in other words, we have the following theorem:

**Theorem (ELBO):** For every (possibly randomized) maps and , distribution over and ,

The left-hand side of this inequality is simply the log-likelihood of . The right-hand side (which, as the inequality shows, is always smaller or equal to it) is known as the *evidence lower bound* or ELBO. We can think of VAEs as trying to maximize the ELBO.

The reason that the two views are roughly equivalent is the follows:

- The first term of the ELBO, known as the
*reconstruction term*, is if we assume some normal noise, then the probabiility taht will be proportional to since for , we get that and hence maximizing this term corresponds to minimizing the square distance. - The second term of the ELBO, known as the
*divergence term*, is which is roughly equal to , where is the dimension of the latent space. Hence maximizing this term corresponds to minimizing the KL divergence between and the standard normal distribution.

How well does VAE work? First of all, we can actually generate images using them. We also find that similar inputs will have similar encodings, which is good. However, sometimes VAEs can still “cheat” (as in auto encoders). There is a risk that the learned model will split to two parts of the form . The first part of the data is there to minimize divergence, while the second part is there for reconstruction. Such a model is similarly uninformative.

However, VAEs have found practical success. For example, Hou et. al 2016 used VAE to create an encoding where two dimensions seem to correspond to “sunglasses” and “blondness”, as illustrated below. We do note that “sunglasses” and “blondness” are somewhere between “semantic” and “syntactic” attributes. They do correspond to relatively local changes in “pixel space”.

The picture can be blurry because of the noise we injected to make random. However, recent models have used new techniques (e.g. vector quantized VAE and hierarchical VAE) to resolve the blurriness and significantly improve on state of art.

In a flow model, we flip the order of and and set (so must be invertible). The input to will come from the standard normal distribution . The idea is that we obtain by a composition of simple invertible functions. We use the fact that if we can compute the density function of a distribution over and is invertible and differentiable, then we can compute the density function of (i.e., the distribution obtained by sampling and outputting ). To see why this is the case, consider the setting when and a small rectangle . If is small enough, will be roughly linear and hence will map into a parallelogram . Shifting the coordinate by corresponds to shifting the output of by the vector and shifting the coordinate by corresponds to shifting the output of by the vector . For every , the density of under will be proportional to the density of with the proportionality fector being .

Overall we the density of under will equal times the inverse determinant of the *Jacobian* of at the point

There are different ways to compose together simple reversible functions to compute a complex one. Indeed, this issue also arises in cryptography and quantum computing (e.g., the Fiestel cipher). Using similar ideas, it is not hard to show that any probability distribution can be approximated by a (sufficiently big) combination of simple reversible functions.

In practice, we have some recent succcessful flow models. A few examples of these models are in the lecture slides.

In section 2, we had a dream of doing both representation and generation at once. So far, we have not been able to find success with these models. What if we do each goal separately?

The tasks of representation becomes self-supervised learning with approaches such SIMCLR. The task of generation can be solved by GANs. Both areas have had recent success.

Open-AI CLIP and DALL-E is a pair of models that perform each part of these tasks well, and suggest an approach to merge them.

CLIP does representation for both texts and images where the two encoders are aligned, i.e. is large. DALL-E, given some text, generates an image corresponding to the text. Below are images generated by DALL-E when asked for an armchair in the shape of an avocado.

The general approach used in CLIP is called contrastive learning.

Suppose we have some representation function and inputs which represent similar objects. Let , then we want to be large when , but small when . So, let the loss function be How do we create similar ? In SIMCLR, are augmentations of the same image . In CLIP, is an image and a text that describes it.

CLIPs representation space does seem to have nice properties such as correspondence between semantic attributes and linear directions, which enables doing some “semantic linear algebra” on representations: (see this based on Vladimir Hatlakov’s code – in the snippet below `tenc`

maps text to its encoding/representation and `get_img`

finds nearest image to representation in a the unsplash dataset):

The theory of GANs is currently not well-developed. As an objective, we want images that “look real” (which is not well defined), and we have no posterior distribution. If we just define the distribution based on real images, our GAN might memorize the photos to beat us.

However, we know that Neural Networks are good at discriminating real vs. fake images. So, we add in a discriminator and define the loss function

The generator model and discriminator model form a 2-player game, which are often harder to train and very delicate. We typically train by changing a player’s action to the best response. However, we need to be careful if the two players have very different skill levels. They may be stuck in a setting where no change of strategies will make much difference, since the stronger player always dominates the weaker one. In particular in GANs we need to ensure that the generator is not cheating by using a degenerate distribution that still succeeds with respect to the discriminator.

If a 2-player model makes training more difficult, why do we use it? If we fix the discriminator, then the generator can find a picture that the discriminator thinks is real and only output that one, obtaining low loss. As a result, the discriminator needs to update along with the generator. This example also highlights that the discriminator’s job is often harder. To fix this, we have to somehow require the generator to give us good entropy.

Finally, how good are GANs in practice? Recently, we have had GANs that make great images as well as audios. For example, modern deepfake techniques often use GANs in their architecture. However, it is still unclear how rich the images are.

]]>**Previous post:** A blitz through statistical learning theory **Next post:** Unsupervised learning and generative models. See also all seminar posts and course webpage.

Lecture video – Slides (pdf) – Slides (powerpoint with ink and animation)

In this lecture, we talk about *what* neural networks end up learning (in terms of their weights) and *when*, during training, they learn it.

In particular, we’re going to discuss

**Simplicity bias**: how networks favor “simple” features first.**Learning dynamics**: what is learned early in training.**Different layers**: do the different layers learn the same features?

The type of results we will discuss are:

- Gradient-based deep learning algorithms have a bias toward learning simple classifiers. In particular this often holds when the optimization problem they are trying to solve is “underconstrained/overparameterized”, in the sense that there are exponentially many different models that fit the data.
- Simplicity also affects the
*timing*of learning. Deep learning algorithms tend to learn simple (but still predictive!) features first. - Such “simple predictive features” tend to be in lower (closer to input) levels of the network. Hence deep learning also tends to learn lower levels earlier.
- On the other side, the above means that distributions that do not have “simple predictive features” pose significant challenges for deep learning. Even if there is a small neural network that works very well for the distribution, gradient-based algorithms will not “get off the ground” in such cases. We will see a lower bound for
*learning parities*that makes this intuition formal.

As a first example to showcase what is learned by neural networks, we’ll consider the following data distribution where we sample points , with ( corresponding to orange points and corresponding to blue points).

If we train a neural network to fit this distribution, we can see below that the neurons that are closest to the input data end up learning features that are highly correlated with the input (mostly linear subspaces at 45-degree angle, which correspond to one of the stripes). In the subsequent layers, the features learned are more sophisticated and have increased complexity.

Some people have spent a lot of time trying to understand what is learned by different layers. In a recent work, Olah et al. dig deep into a particular architecture for computer vision, trying to interpret the features learned by neurons at different layers.

They found that earlier layers learn features that resemble edge detectors.

However, as we go deeper, the neurons at those layers start learning more convoluted (for example, these features from layer resemble heads).

There is evidence that SGD learns simpler classifiers first. The following figure tracks how much of a learned classifier’s performance can be accounted for by a linear classifier. We see that up to a certain point in training *all* of the performance of the neural network learned by SGD (measured as mutual information with the label or as accuracy) can be ascribed to the linear classifier. They diverge only very near the point where the linear classifier “saturates,” in the sense that the classifier reachers the best possible accuracy for linear models. (We use the quantity – the mutual information of and conditioned on the prediction of the linear classifier – to measure how much of ‘s performance *cannot* be accounted for by .)

In general, simplicity bias is a very good thing. For example, the most “complex” function is a random function. However, if given some observed data , SFD were to find a random function that perfectly fits it, then it would never generalize (since for every fresh , the value of would be random).

At the same time, simplicity bias means that our algorithms might focus too much on simple solutions and miss more complex ones. Sometimes the complex solutions actually do perform better. In the following cartoon a person could go to the low-hanging fruit tree on the right-hand side and miss the bigger rewards on the left-hand side.

This can actually happen in neural networks. We also saw a simple example in class:

The two datasets are equally easy to represent, but on the righthand side, there is a very strong “simple classifier” (the 45-degree halfspace) that SGD will “latch onto.” Once it gets stuck with that classifier, it is hard for SGD to get “unstuck.” As a result, SGD has a much harder time learning the righthand dataset than the lefthand dataset.

So, what can we prove about the dynamics of gradient descent? Often we can gain insights by studying *linear regression*.

Formally, given with we would like to find a vector such that .

In this setting, we can prove that running SGD (from zero or tiny initialization) on the loss will converge to solution of minimum norm. To see whym note that SGD performs updates of the form

.

However note that is a scalar. Therefore all of the updates keep the updated vector within . This implies that the converging solution will also lie in .

Geometrically this translates into being the projection of onto the the subspace which results in the least norm solution.

Analyzing the dynamics of descent, we can write the distance between consecutive weight updates and the converging solution as

.

We see that we are applying the linear operator at every step we take. As long as this operator is contractive, we will continue to progress and converge to . Formally, to make progress, we require

.

This directly translates into and then the progress we make is approximately , where is the *condition number* of .

What happens now if the matrix is random? Then, results from random matrix theory (specifically the Marchenko-Pastur distribution) state that

- if , then the matrix has and the eigenvalues are bounded away from . This means that the matrix is well conditioned.
- if , then the spectrum of starts shifting towards , with some eigenvalues being equal to zero, resulting in an ill-conditioned matrix.
- if , then the spectrum has some zero eigenvalues, but is otherwise bounded away from zero. If we restrict to the subspace of positive eigenvalues, we achieve again a good condition number.

We now want to go beyond linear regression and talk about deep networks. As deep networks are very hard to understand, we will first start analyzing a depth network. We will also consider a *linear* network and omit the nonlinearity. This might seem strange, as we could consider the corresponding linear model, which has exactly the same expressiveness. However, note that these two models have a different parameter space. This means that gradient-based algorithms will travel on different paths when optimizing these two models.

Specifically, we can see that the minimum loss attained by the two models will coincide, i.e., , but the SGD path and the solution will be different.

We will analyze the gradient flow on these two networks (which is gradient descent with the learning rate ). We will make the simplifying assumption that and symmetric. Then, we can see that . We will try and compare the gradient flow of two different loss functions: (doing gradient flow on a linear model) and (doing gradient flow on a depth linear model).

Gradient flow on the linear model simply gives , whereas for the deep linear network we have (using the chain rule)

since is symmetric.

For simplicity, let’s denote and . We then have

.

Another way to view the comparison between the models of interest, and is as follows: let , then .

We can view this as follows: when we multiply the gradient with we end up making the “big bigger and the small smaller”. Basically, this accenuates the differences between the eigenvalues and is biasing to become a low-rank matrix.

To see why, you can think of a low rank matrix has one that has few large eigenvalues and the others small. If is already close low rank, then replacing a gradient by encourages the gradient steps to mostly happen in the top eigenspace of . This result generalizes to networks of greater depth, and the gradient evolves as , with .

This means that we end up doing gradient flow on a *Riemannian manifold*. An interesting result is that the flow induced by the operator is provably not equivalent to a regularized minimization problem for any .

Finally, let’s discuss what is learned by the different layers in a neural network. Some intuition people have is that learning proceeds roughly like the following cartoon:

We can think of our data as being “built up” as a sequence of choices from higher level to lower level features. For example, the data is generated by first deciding that it would be a photo of a dog, then that it would be on the beach, and finally low-level details such as the type of fur and light. This is also how a human would describe this photo. In contrast, a neural network builds up the features in the opposite direction. It starts from the simplest (lowest-level) features in the image (edges, textures, etc.) and gradually builds up complexity until it finally classifies the image.

To build a bit of intuition, consider an example of combining different simple features. We can see that if we try to combine two good edge detectors with different orientations, the end result will hardly be an edge detector.

So the intuition is that there is competitive/evolutionary pressure on neurons to “specialize” and recognize useful features. Initially, all the neurons are random features, which can be thought of as random linear combination of the various detectors. However, after training, the symmetry will break between the neurons, and they will specialize (in this simple example, they will either become vertical or horizontal edge detectors).

Raghu, Gilmer, Yosinski, and Sohl-Dickstein tracked the speed at which features learned by different layers reach their final learned state. In the figure below the diagonal elements denote the similarity of the current state of a layer to its final one, where lighter color means that the state is more similar. We can see that earlier layer (more to the left) reach their final state earlier (with th exception of the 2 layers closest to the output that also converge very early).

The “symmetry breaking” intuition is explored by a recent work of Frankle, Dziugaite, Roy, and Carbin. Intuitively, because the average of two good features is generally *not* a good feature, averaging the weights of two neural networks with small loss will likely result in a network with large loss. That is, if we start from two random initializations , and train two networks until we reach weights and with small loss, then we expect the average of and to result in a network with poor loss:

In contrast, Frankle et al showed that sometimes, when we start from the same initialization (especially after pruning) and use random SGD noise (obtained by randomly shuffling the training set) then we reach a “linear plateu” of the loss function in which averaging two networks yields a network with similar loss:

If we believe that networks learn simple features first, and learn them in the early layers, then this has an interesting consequence. If the data has the form that simple features (e.g. linear or low degree) are completely uninformative (have no correlation with the label) then we may expect that learning cannot “get off the ground”. That is, even if there exists a small neural network that can learn the class, gradient based algorithms such as SGD will never find it. (In fact, it is possible that *no* efficient algorithm could find it.) There are some settings where we can prove such conjectures. (For gradient-based algorithms that is; proving this for all efficient algorithms would require settling the P vs NP question.)

We discuss one of the canonical “hard” examples for neural networks: parities. Formally, for , the distribution is the distribution over defined as follows: and . The “learning parity” problem is as follows: given samples drawn from , either recover or do the weaker task of finding a predictor such that with high probability over future samples .

It turns out that if we don’t restrict ourselves to deep learning, given samples we can recover . Consider the transformations and . If we let if and otherwise, we can write . Basically, we transformed the problem of parity to a problem of counting if we have an odd or an even number of . In this setting, we can think of every sample as providing a *linear equation* moudlo 2 over the unknown variables . When , these linear equations will be very likely to be of full rank, and hence we can use Gaussian elimination to find and hence .

Switching to the learning setting, we can express parities by using few ReLUs. In particular, we’ve shown that we can create a step function using ReLUs. Therefore for every , there is a combination of four ReLUs that computes the function such that outputs for , and outputs if . We can then write the parity function (for example for ) as . This will be a linear combination of at most ReLUs.

Parities are an example of a case where simple feature are uninformative. For example, if then for every linear function ,

in other words, there is no correlation between the linear function and the label.

To see why this is true, write . By linearity of expectation, it suffices to show that $latex \mathbb{E}*{(x,y) \sim D_I}[ L_ix_i y] = L_i \mathbb{E}*{(x,y) \sim D_I}[ x_i y] = 0&bg=ffffff$. Both and are just values in . To evaluate the expectation we simply need to know the marginal distribution that induces on when we restrict it to these two coordinates. This distribution is just the uniform distribution. To see why this is the case, consider a coordinate and let’s condition on the values of all coordinates other than and . After conditioning on these values, for some and are chosen uniformly and independently from . For every choice of , if we flip then that would flip the value of , and hence the marginal distribution on and will be uniform.

This lack of correlation turns out to be a real obstacle for gradient-based algorithms. While small neural networks for parities exist, and Gaussian elimination can find them, it turns out that gradient-based algorithms such as SGD will *fail* to do so. Parities are hard to learn, and even if the capacity of the network is such that it can memorize the input, it will still perform poorly in a test set. Indeed, we can prove that for *every* neural network architecture , running SGD on will require steps. (Note that if we add *noise* to parities, then Gaussian elimination will fail and it is believed that *no efficient algorithm* can learn the distribution in this case. This is known as the learning parity with noise problem, which is also related to the learning with errors problem that is the foundation of modern lattice-based cryptography.)

We now sketch the proof that gradient-based algorithms require exponentially many steps to learn parities, following Theorem 1 of Shalev-Shwartz,Shamir and Shammah. We think of an idealized setting where we have an unlimited number of samples and only use a sample only once (this should only make learning easier). We will show that we make very little progress in learning , by showing that for any given , the expected gradient over will be exponentially small, and hence we make very little progress toward learning . Specifically, using the notation , for any ,

The term is independent of and so does not contribute toward learning . Hence intuitively to show we make exponentially small progress, it suffices to show that typically for every , will be exponentially small. (That is, even if for a fixed we make a large step, these all cancel out and give us exponentially small progress toward actually learning .)

Formally, we will prove the following lemma:

**Lemma:** For every ,

**Proof:** Let us fix and define . The quantity can be written as with respect to the inner product . However, is an orhtonormal basis with respect to this inner product. To see this note that since , for every , and for , where is the symmetric difference of and . The reason is that for all and so elements that appear in both and “cancel out”. Since the coordinates of are distributed independently and uniformly, the expectation of the product is the product of expectations. This means that as long as is not empty (i.e., ) this will be a product of one or more terms of the form . Since is uniform over , and so we get that if , .

Given the above

which means that (since there are subsets of ) on average . In other words, is typically exponentially small which is what we wanted to prove.

]]>Lecture video (starts in slide 2 since I hit record button 30 seconds too late – sorry!) – slides (pdf) – slides (Powerpoint with ink and animation)

These are rough notes for the first lecture in my advanced topics in machine learning seminar. See the previous post for the introduction.

This lecture’s focus was on **“classical” learing theory**. The distinction between “classical learning” and “deep learning” is semantic/philosophical, and doesn’t matter much for this seminar. I personally view this difference as follows:

That is, deep learning is a framework that allows you to translate more resources (data and computation) into bettter performance. “Classical” methods often have a “threshold effect” where a certain amount of data and computation is needed, and more would not really help. For example, in parametric methods there will typically be a sharp threshold for the amount of data required for saturating the potential performance. Even in non-parametric models such as nearest neighbors or kernel methods, the computational cost is fixed for a fixed amount of data, and there is no way to profitably trade more computation for better performance.

In contrast, for deep learning, we often can get better performance using the same data by using bigger models or more computation. For example, I doubt this story of Andrej Karpathy could have happened with a non deep-learning method:

*“One time I accidentally left a model training during the winter break and when I got back in January it was SOTA (“state of the art”).”*

We can view machine learning (deep or not) as a series of “leaky pipelines”:

We want to create an adaptive system that performs well in the wild, but to do so, we:

- Set up a benchmark of a test distribution, so we have some way to compare different systems.
- We typically can’t optimize directly on the benchmark, both because losses like accuracy are not differentiable and because we don’t have access to an unbounded number of samples from the distribution. (Though there are exceptions, such as when optimizing for playing video games.) Hence we set up the task of optimizing some proxy loss function on some finite samples of training data.
- We then run an optimization algorithm whose ostensible goal is to find the that minimizes the loss function over the training data. ( is a set of models, sometimes known as
*architecture*, and sometimes we also add other restrictions such norms of weights, which is known as*regularization*)

All these steps are typically “leaky.” Test performance on benchmarks is not the same as real-world performance. Minimizing the loss over the training set is not the same as test performance. Moreover, we typically can’t solve the loss minimization task optimally, and there isn’t a unique minimizer, so the choice of depends on the algorithm.

Much of machine learning theory is about obtaining guarantees bounding the “leakiness” of the various steps. These are often easier to do in “classical” contexts of statistical learning theory than for deep learning. In this lecture, we will make a short blitz through classical learning theory. This material is covered in several sources, including the excellent book understanding machine learning and the upcoming Hardt-Recht text (update: the Hardt-Recht book is now out).

We will be very rough, using proofs by picture and making some simplifications (e.g., working in one dimension, assuming functions are always differentiable, etc.)

A (nice) function is (strongly) *convex* if it satisfies one of the following three equivalent conditions:

- For every two points , the line between and is above the curve of .
- For every point , the tangent line at with slope is below the curve of .
- For every , .

To see that for example, 2 implies 3, we can use the contrapositive. If 3 does not hold and is such that (should really assume but we’re being rough) then by Taylor, around we get

For small enough, is negligible and so we see that the curve of near equals the tangent line plus a negative term, and hence it is below the line, contradicting 2.

To show that 2 implies 1, we can again use the contrapositive and show by a “proof by picture” that if there is some point in which is above the line between and , then there must be a point in which the tangent line at is above .

Some tips on convexity:

- The function is convex (proof: Google)
- If is convex and is linear then is convex (lines are still lines).
- If is convex and is convex then is convex for every positive .

The gradient descent algorithm minimizes a function by starting at some point and repeating the following operation:

for some small .

By Taylor, , and so setting , we can see that

Since , we see that as long as we make progress. If we set then we reduce in each step the value of the function by roughly .

In the high dimensional case, we replace with the gradient and with the Hessian which is the matrix . The progress we can make is controlled by the ratio of the smallest to largest eigenvalues of the Hessian, which is one over its _condition number_.

In **stochastic gradient descent**, instead of performing the step we use , where is a random variable satisfying:

- for some .

Let’s define . Then is a mean zero and variance random variable, and let’s heuristically imagine that are independent. If we plug in this into the Taylor approximation, then since , only the terms with survive.

So by plugging to the Taylor approximation, we get that in expectation

We see that now to make progress, we need to ensure that is sufficiently smaller than . We note that in the beginning, when is large, we can use a larger learning rate , while when we get closer to the optimum, then we need to use a smaller learning rate.

The *supervised learning problem* is the task, given labeled training inputs of obtaining a classifier/regressor that will satisfy for future samples from the same distribution.

Let’s assume that our goal is to minimize some quantity where is a *loss function* (that we will normalize to for convenience). We call the quantity the population loss (and abuse notation by denoting it as ) and the corresponding quantity over the training set the empirical loss.

The **generalization gap** is the difference between the population and empirical losses. (We could add an absolute value though we expect that the loss over the training set would be smaller than the population loss; the population loss can be approximated by the “test loss” and so these terms are sometimes used interchangibly.)

**Why care about the generalization gap?** You might argue that we only care about the population loss and not the gap between population and empirical loss. However, as mentioned before, we don’t even care about the population loss but about a more nebulous notion of “real-world performance.” We want the relations between our different abstractions to be as minimally “leaky” as possible and so bound the difference between train and test performance.

Suppose that our algorithm performs *empirical risk minimization (ERM)* which means that on input , we output . Let’s assume that we have a collection of classifiers and define . For every , is an estimator for and so we can write where is a random variable with mean zero and variance roughly (because we have samples).

The ERM algorithm outputs the which minimizes . As grows, the quantity (which is known as the **bias** term) shrinks. The quantity (which is known as the **variance** term) grows. When the variance term dominates the bias term, we could potentially start outputting classifiers that don’t perform better on the population. This is known as the “bias-variance tradeoff.”

The most basic generalization gap is the following:

**Thm (counting gap):** With high probability over , .

**Proof:** By standard bounds such as Chernoff etc.., the random variable behaves like a Normal/Gaussian of mean zero and standard deviation at most , which means that the probability that is at most . If we set then for every , . Hence by the union bound, the probability that there *exists* such that is at most . QED

One way to count the number of classifiers in a family is by the bits to represent a member of the family– there are at most functions that can be represented using bits. But this bound can be quite loose – for example, it can make a big difference if we use or bits to specify numbers, and some natural families (e.g., linear functions) are *infinite*. There are many bounds in the literature of the form

with values of other than .

Intuitively corresponds to the “capacity” of the classifier family/algorithm – the number of samples it can fit/memorize. Some examples (very roughly stated) include:

**VC dimension:**is the maximum number such that for every set of points and labels, there is a classifier in the family that fits the points to the labels. That is for every and there is with .**Rademacher Complexity:**is the maximum number such that for random from and uniform (assume say over ) with high probability there exists with for most .**PAC Bayes:**is the mutual information between the training set that the learning algorithm is given as input and the classifier that it outputs. This requires some conditions on the learning algorithm and some prior distribution on the classifier. To get bounds on this quantity when the weights are continuous, we can add*noise*to them.**Margin bounds:**is the “effective dimensionality” as measured by some margin. For example, for random unit vectors in , . For linear classifiers, the margin bound is the minimum such that correct labels over the training set are classified with at least margin.

A recent empirical study of generalization bounds is “fantastic generalization measures and where to find them”by Jiang, Neyshabur, Mobahi, Krishnan, and Bengio, and “In Search of Robust Measures of Generalization” by Dziugaite, Drouin, Neal, Rajkumar, Caballero, Wang, Mitliagkas, and Roy.

The generalization gap depends on several quantities:

- The family of functions.
- The algorithm used to map the training set to .
- The distribution of datapoints
- The distribution of labels.

A **generalization bound** is an upper bound on the gap that only depends on some of these quantities. In an influential paper, Zhang, Bengio, Hardt, Recht, Vinyals showed significant barriers to obtaining such results that are meaningful for practical deep networks. They showed that in many natural settings, we cannot get such bounds even if we allow them to be based arbitrarily on the first three factors. That is, they showed that for natural families of functions (modern deep nets), natural algorithms (gradient descent on the empirical loss), natural distributions (CIFAR 10 and ImageNet), if we replace by the uniform distribution, then we can get arbitrarily large generalization gap.

We can also interpolate between the Zhang et al. experiment and the plain CIFAR-10 distribution. If we consider a distribution where we take from CIFAR-10 with probability we replace the with a random label (one of the 10 CIFAR-10 classes) then the test/population performance (fraction of correct classifications) will be at most (not surprising), but the training/empirical accuracy will remain at roughly 100%. The left-hand side of the gif below demonstrates this (this comes from this paper with Bansal and Kaplun which shows that, as the right side demonstrates, certain self-supervised learning algorithms do not suffer from this phenomenon; here the noise level is the fraction of wrong labels so is perfect noise):

While classical learning theory predicts a “bias-variance tradeoff” whereby as we increase the model class size, we get worse and worse performance, this is not what happens in modern deep learning systems. Belkin, Hsu, Ma, and Mandal posited that such systems undergo a “double descent” whereby performance behaves according to the classical bias/variance curve up to the point in which we achieve training error and then starts improving again. This actually happens in real deep networks.

To get some intuition for the double descent phenomenon, consider the case of fitting a univariate polynomial of degree to samples of the form where is a degree polynomial. When we are “under-fitting” and will not get good performance. As trends between and , we fit more and more of the noise, until for we have a perfect interpolating polynomial that will have perfect train but very poor test performance. When grows beyond , more than one polynomial can fit the data, and (under certain conditions) SGD will select the minimal norm one, which will make the interpolation smoother and smoother and actually result in better performance.

Consider the task of distinguishing between the speech of an adult and a child. In the time domain, this may be hard, but by switching to representation in the Fourier domain, the task becomes much easier. (See this cartoon)

The Fourier transform is based on the following theorem: for every continuous , we can arbitrarily well approximate as a linear combination of functions of the form . Another way to say it is that if we use the embedding which maps into (sufficiently large) coordinates of the form then becomes linear.

The wave functions are not the only ones that can approximate an arbitrary function. A *ReLU* is a function of the form . We can approximate every continuous function arbitrarily well as a combination of ReLUs:

**Theorem:** For every continous and there is a function such that is a linear combination of ReLUs and .

In one dimension, this follows from the facts that:

- ReLUs can give an arbitrarily good approximation to bump functions of the form
- Every continuous function on a bounded domain can be arbitrarily well approximated by the sum of bump functions.

The second fact is well known, and here is a “proof by picture” for the first one:

For higher dimensions, we need to create higher dimension bump functions. For example, in two dimensions, we can create a “noisy circle” by summing over all rotations of our bump. We can then add many such circles to create a two-dimensional bump. The same construction extends to an arbitrary number of dimensions.

**How many ReLUs?** The above shows that a linear combination of ReLUs can approximate every function on variables, but how many ReLUs are needed? Every ReLU is specified by numbers for the weights and bias. Intuitively, we could discretize each coordinate to a constant number of choices, and so there would be choices for such ReLUs. Indeed, it can be shown that every continuous function can be approximated by a linear combination of ReLUs. It turns out that some functions *require* an exponential number of ReLUS.

The above discussion doesn’t apply just for ReLUs but virtually any non-linear function.

By embedding our input as a vector , we can often make many “interesting” functions become much simpler to compute (e.g., linear). In learning, we typically search for an *embedding* or *representation* that is “good” in one or more of the following senses:

- The dimension of embedding is not too large for many “interesting” functions.
- Two inputs are “semantically similar” if and only if and are correlated (e.g., is large).
- We can efficiently compute and (sometimes) can compute without needing to explicitly compute .
- For “interesting” functions , can be approximated by a linear function in the embedding with “structured” coefficients (for example, sparse combination, or combination of coefficients of certain types, such as low frequency coefficients in Fourier domain)
- …

Suppose that we have some notion of “similarity” between inputs, where being large means that is “close” to and being small means that is “far” from .

This suggests that we can use one of the following methods approximating a function given inputs of the form . On input , any of the following can be reasonable approximations to depending on context:

- where is the closest to in . (This is known as the
*nearest neighbor*algorithm.) - The mean (or other combining function) of where are the nearest inputs to . (This is known as the
*nearest neighbor*algorithm.) - Some linear combination of where the coefficients depend on . (This is known as the
*kernel*algorithm.)

All of these algorithms are _non-parametric methods_ in the sense that the final regressor/classifier is specified by the full training set .

**Kernel algorithms** can also be described as follows. Given some embedding , where is our input space, a Kernel regression approximates a function by a linear function in .

The key observation is that to solve linear equations or least-square minimization in of the form , we don’t need to know the vectors . Rather, it is enough to know the inner products . In Kernel methods we are often not given the embedding explicitly (indeed might even be infinite) but rather the function such that . The only thing to verify is that actually defines an inner product by checking that the matrix is positive semi-definite.

In general, Kernels and neural networks look quite similar – both ultimately involve composing a linear function on top of a non-linear embedding . It is not always clear cut whether an algorithm is a kernel or deep neural net method. Some characteristics of kernels are:

- The embedding is not learned from the data. However, if was learned from some other data, or was inspired by representations that were learned from data, then it becomes a fuzzier distinction.
- There is a “shortcut” to compute the inner product using significantly smaller than steps.

Generally, the distinction between a kernel and deep nets depends on the application (is it to apply some analysis such as generalization bounds for kernels? is it to use kernel methods with shortcuts for the inner product?) and is more a spectrum than a binary partition.

The above was a very condensed and rough survey of generalization, representation, approximation, and kernel methods. All of these are covered much better in the understanding machine learning book and the upcoming Hardt and Recht book.

In the next lecture, we will discuss the algorithmic bias of gradient descent, including the cases of linear regression and deep linear networks. We will discuss the “simplicity bias” of SGD and what can we say about what is learned at different layers of a deep network.

**Acknowledgements:** Thanks to Manos Theodosis and Preetum Nakkiran for pointing out several typos in a previous version.

This semester I am teaching a seminar on the theory of machine learning. For the first lecture, I would like to talk about what is the theory of machine learning. I decided to write this (very rough!) blog post mainly to organize my own thoughts.

In any science, ML included, the goals of theory and practice are not disjoint. We study the same general phenomena from different perspectives. We can think of questions in computation at a high level as trying to map the unknown terrain of the computational cost of achieving certain quality of output, given some conditions on the input.

Practitioners aim to find points on this terrain’s practically relevant regions, while theoreticians are more interested in its broad landscape, including at points far out into infinity. Both theoreticians and practitioners care about *discontinuities*, when small changes in one aspect correspond to large changes in another. As theoreticians, we care about *computational/statistical tradeoffs*, particularly about points where a small difference in quality yields an exponential difference in time or sample complexity. In practice, models such as GPT-3 demonstrate that a quantitative increase in computational resources can correspond to a qualitative increase in abilities.

Since theory and practice study the same phenomenon in different ways, their relation can vary. Sometimes theory is forward-looking or *prescriptive* – giving “proof of concepts” results that can inspire future application or impossibility results that can rule out certain directions. Sometimes it is backward-looking or *descriptive* – explaining phenomena uncovered by experiments and putting them in a larger context.

Before we talk about what is machine-learning theory, we should talk about what is machine learning. It’s common to see claims of the form “machine learning is X”:

- Machine learning is just statistics
- Machine learning is just optimization
- Machine learning is just approximation or “curve fitting”

and probably many more.

I view machine learning as addressing the following setup:

There is a system that interacts with the world around it in some manner, and we have some sense of whether this interaction is successful or unsuccessful.

For example, a self-driving car is “successful” if it gets passengers from point A to point B safely, quickly, and comfortably. It is “unsuccessful” if it gets into accidents, takes a long time, or drives in a halting or otherwise uncomfortable manner.

The system is the “machine.” It is “learning” since we *adapt* so that it becomes more successful. Ideally, we would set to be the most successful system. However, what we actually do is at least *thrice-removed* from this ideal:

**The model gap:**We do not optimize over all possible systems, but rather a small subset of such systems (e.g., ones that belong to a certain family of models).**The metric gap:**In almost all cases, we do not optimize the actual measure of success we care about, but rather another metric that is at best correlated with it.**The algorithm gap:**We don’t even optimize the latter metric since it will almost always be non-convex, and hence the system we end up with depends on our starting point and the particular algorithms we use.

The magic of machine learning is that sometimes (though not always!) we can still get good results despite these gaps. Much of the theory of machine learning is about understanding under what conditions can we bridge some of these gaps.

The above discussion explains the “machine Learning is just X” takes. The expressivity of our models falls under *approximation theory*. The gap between the success we want to achieve and the metric we can measure often corresponds to the difference between *population* and *sample* performance, which becomes a question of *statistics*. The study of our algorithms’ performance falls under *optimization*.

The **metric gap** is perhaps the widest of them all. While in some settings (e.g., designing systems to play video games) we can directly measure a system’s success, this is typically the exception rather than rule. Often:

- The data we have access to is not the data the system will run on and might not even come from the same distribution.
- The measure of success that we care about is not necessarily well defined or accessible to us. Even if it was, it is not necessarily in a form that we can directly optimize.

Nevertheless, the hope is that if we optimize the hell out of the metrics we can measure, the system will perform well in the way we care about. In other words:

A priori, it is not clear that this should be the case, and indeed it’s not always is. Part of the magic of auto-differentiation is that it allows optimization of more complex metrics that match more closely with the success metric we have in mind. But, except for very special circumstances, the two metrics can never be the same. The mismatch between the goal we have and the metric we optimize can manifest in one of the following general ways:

**“No free lunch”**or Goodhart’s law: If we optimize a metric then we will get a system that does very well at and not at all well in any other measure. For example, if we optimize accuracy in predicting images, then we may not do well on slightly perturbed versions of these images.**“It’s not the destination, it’s the journey”**or Anna Karenina principle: All successful systems are similar to one another, and hence if we make our system successful with respect to a metric then it is likely to also be successful with respect to related measures. For example, image classifiers trained on ImageNet have been successfully used for very different images, and there is also evidence that success in ImageNet translates into success on a related distribution (i.e., ImageNet v2).

At the moment, we have no good way to predict when a system will behave according to Goodhart’s law versus the Anna Karenina principle. It seems that when it comes to learning *representations*, machine learning systems follow the Anna Karenina principle: all successful models tend to learn very similar representations of their data. In contrast, when it comes to making *decisions*, we get manifestations of Goodhart’s law, and optimizing for one metric can give very different results than the other.

The **model gap** is the gap between the set of all possible systems and the set that we actually optimize over. Even if a system can be captured as a finite function (say a function mapping pixel images to classes), a simple counting argument shows that the vast majority of these functions require far too many gates to be efficiently computable by any reasonable computational model. (In case you are curious, the counting argument also works for *quantum* circuits.) Hence we necessarily have to deal with a subclass of all possible functions.

This raises the question of whether we should expect a realizable system to *exist*, let alone for us to find it. Often machine learning is used in *artifical intelligence* applications, where we are trying to mimic human performance. In such cases, human performance is an “existence proof” that some reasonably-sized circuit is successful in the goal. But whether this circuit can be embedded in our model family is still an open question.

Remember when I said that sometimes it’s not about the journey but about the destination? I lied. The **algorithm gap** implies that in modern machine learning, there is a certain sense in which it is always about the journey. In modern machine learning, the system is typically parameterized by a vector of real numbers. Hence the *metric* we optimize over is to minimize some loss function . We use *local search algorithms*, which start off at some vector chosen via some distribution, and then iteratively takes small steps . For each , the next step is chosen from some distribution of vectors close to such that (hopefully) in expectation .

Modern machine learning systems are *non convex*, which means that the final point we end up in depends on the starting point, the algorithms, and the randomness. Hence, we can’t have a neat “separation of concerns” and decouple the **architecture**, **metric**, and **algorithm**. When we say that a system obeys the “Anna Karenina principle,” we really mean that for *natural algorithms* to optimize *natural metrics* on *natural architectures*, successful outputs (ones that do well on the metric) are likely to be similar to one another. The use of the “natural” qualifier is a warning sign that we don’t fully understand the conditions under which this happens, but it is clear that some conditions are necessary. Due to non-convexity, it is typically possible to find a “bad minima” that would be very good in some specific metric but terrible in other ones.

Given the above discussion, what kind of theoretical results should we expect in machine learning? Let’s try to classify the type of results we see in general theoretical computer science. In the discussion below, I will focus on the *limitations* of these theoretical results and use humorous names, but make no mistake: all of these categories correspond to valuable theoretical insights.

The *quicksort* algorithm provides a canonical example of algorithm analysis. At this point, we have a reasonably complete understanding of quicksort. We can analyze the distribution of its running time down to the constant.

Hence we have an efficient algorithm, used in practice, with rigoroulsy proven theorems that precisely characterize its performance. This is the “gold standard” of analysis, but also an unrealistic and unachievable goal in almost any other setting.

A more common situation is that we have algorithms that work in practice and algorithms with rigorous analysis, but they are not the same algorithms. A canonical example is linear programming. The simplex algorithm often works well in practice but was known to take exponential time on certain instances. For a long time, it was an open question whether or not there is an algorithm for linear programming that take polynomial time in the worst case.

In 1979, Khachiyan gave the *Ellipsoid* algorithm that runs in polynomial time, but with a polynomial so large that it is impractical. Still, the Ellipsoid algorithm was a great breakthrough, giving hope for better algorithms and a new approach for defining progress measures for linear programming instances. Indeed in 1984, Karmarkar came up with the *interior points* algorithm, with much better time dependence.

The interior-point algorithm is practical and often outperforms the simplex method on large enough instances. However, the algorithm as implemented is not identical to the one analyzed- implementations use different step sizes and other heuristics for which we do not have a precise analysis.

Nevertheless, even if (unlike quicksort) the rigorously analyzed algorithm is not identical to the practical implementations, the story of linear programming shows how crucial “proofs of concept” can be to introducing new ideas and techniques.

The flip side of “proof of concept” results are *impossiblity results* that rule out even “proof of concept ” algorithms. Impossibility results always come with fine print and caveats (for example, NP-hardness results always refer to *worst case* complexity). However, they still teach us about the structure of the problem and help define the contours of what we can expect to achieve.

“End to end analysis” is when we can prove the guarantees we need on algorithms we actually want to use. “Proof of concept” is when we prove the guarantees we need on impractical algorithms. A “character witness” result is when we prove something positive about an algorithm people use in practice, even if that positive property falls short of the guarantees we actually want.

While the term “character witness” sounds derogatory, such results can sometimes yield truly profound insights. A canonical example again comes from linear programming. While the simplex algorithm can be exponential in the worst case, in a seminal work, Spielman and Teng showed that it does run in polynomial-time if the input is slightly perturbed- this is so-called *smoothed analysis*. While the actual polynomial and the level of perturbation do not yield practical bounds, this is still an important result. It gives formal meaning to the intuition that the simplex only fails on “pathological” instances and initiated a new mode of analyzing algorithms between worst-case and average-case complexity.

In machine learning, a “character witness” result can take many forms. For example, some “character witness” results are analyses of algorithms under certain assumptions on the data, that even if not literally true, seem like they could be “morally true”. Another type of “character witness” result shows that an algorithm would yield the right results if it is allowed to run for an infinite or exponentially long time. Evaluating the significance of such results can be challenging. The main question is whether the analysis teaches us something we didn’t know before.

A third type of “character witness” results are quantitative bounds that are too weak for practical use but are still non-vacuous. For example, approximation algorithms with too big of an approximation factor, or generalization bounds with too big of a guaranteed gap. In such cases, one would hope that these bounds will be at least correlated with performance: algorithms with better bounds will also have higher quality output.

The name “toy problem” also sounds derogatory, but toy problems or toy models can be extremely important in science, and machine learning is no different. A toy model is a way to abstract the salient issues of a model to enable analysis. Results on “toy models” can teach us about general principles that hold in more complex models.

When choosing a toy model, it is important not to mistake models that share superficial similarity with models that keep the salient issues we want to study. For example, consider the following two variants of deep neural networks:

- Networks that have standard architecture except that we make them extremely wide, with the width tending to infinity independently of all other parameters.
- Networks where all activation functions are linear.

Since the practical intuition is that bigger models are better, it may seem that such “ultra-wide” models are not toy models at all and should capture state of the art deep networks. However, the neural tangent kernel results show that these models become kernel models that do not learn their representation at all.

Intuitively, making activation functions linear seems pointless since the composition of linear functions is linear, and hence such linear networks are no more expressive than a layer one linear net. Thus such linear networks seem too much of a “toy model” (maybe a “happy meal model”?). Yet, it turns out that despite their limited expressivity, deep linear networks capture an essential feature of deep networks- the bias induced by the gradient descent algorithm. For example, running gradient descent on a depth two linear network translate to regularizing by a proxy of *rank*. In general, gradient descent on a deep linear network induces a very different geometry on the manifold of linear functions.

Hence, although the first model seems much more “real” than the second one, there are some deep learning questions where the second model is more relevant.

One might think that the whole point of theory is to provide rigorous results: if we are not going to prove theorems, we might as well use benchmarks. Yet, there are important theoretical insights we can get from experiments. A favorite example of mine is the result of Zhang, Bengio, Hardt, Recht, and Vinyals that deep neural networks can fit random labels. This simple experiment ruled out in one fell swoop a whole direction for proving generalization bounds on deep nets. A theory work does not have to involve theorems: well-chosen experiments can provide important theoretical insights.

The above was a stream-of-consciousness and very rough personal overview of questions and results in ML theory. In the coming seminar we will see results of all the types above, as well as many open questions.

**Acknowledgements:** Thanks to Yamini Bansal and Preetum Nakkiran for helpful comments (though they are not to blame for any mistakes!)

For many of the famous open problems of theoretical computer science, most researchers agree on what the answer is, but the challenge is to *prove *it. Most complexity theorists (with few notable exceptions) believe that P≠NP, but we don’t know how to prove it. Similarly, most people working on matrix multiplication believe that there is an Õ(n²) algorithm for this problem, but we’re still stuck at 2.3728596. We believed that primality checking has a deterministic polynomial-time algorithm long before it was proven and we still believe the same holds for polynomial identity testing.

The story of *cryptographic obfuscation* is different. This story deserves a full length blog post (though see my now outdated survey), but the short version is as follows. In 2001 we (in a paper with Goldreich, Impagliazzo, Rudich, Sahai, Vadhan, and Yang) showed that what is arguably the most natural definition of obfuscation is impossible to achieve. That paper explored a number of obfuscation-related questions, and in particular left as an open question the existence of so-called *indistinguishable obfuscators* or *IO*. Since then there were arguably more negative than positive results in obfuscation research until in 2012, extending some of the ideas behind fully-homomorphic encryption, Garg Gentry and Halevi gave a heuristic construction of multilinear map, which one can think of as “Diffie Hellman on steroids” (or maybe LSD..). Then in 2013 Garg, Gentry, Halevi., Raykova, Sahai and Waters (GGHRSW) built on top of these maps to give a heuristic construction of IO.

The GGHRSW paper opened the floodgates to many papers using IO to achieve many longstanding cryptographic goals as well as show that IO provides a unified approach to solve many classic cryptographic problems. The fact that so many goals were achieved through heuristic constructions was not very comforting to cryptographers. Even less comforting was the fact that several cryptographic attacks were discovered on these heuristic constructions. The years that followed saw a sequence of constructions and breaks, giving cryptographers an “emotional whiplash”. Everyone agreed that IO would be amazing if it exists, but whether or not it actually exists depended on who you asked, and what paper in the eprint archive they read that morning…

The “holy grail” in this line of work is to base obfuscation on a standard assumption, and ideally Regev’s Learning With Errors (LWE) assumption. Of course, we don’t know that LWE is true (in particular LWE implies P≠NP) but if it’s false it would bring down so much of the field that cryptographers might as well pack their bags and do machine learning (or try to sabotage progress in quantum computing, since the only other standard assumptions for public-key crypto are broken by fully scalable quantum computing).

We have not yet achieved this holy grail (this is only the 4th season) but as described in this quanta article, there has been a remarkable progress in the last few months. In particular, Jain, Lin and Sahai (JLS) (building on a long sequence of works by many people including Ananth, Matt, Tessaro and Vaikuntanathan) obtained IO based on LWE and several standard assumptions in cryptography. This is arguably the first “heuristic free” construction, and is a fantastic breakthrough. However, there is still work to do – the JLS construction uses not just LWE but also a variant of it that is not as well studied. It is also based on pairing-based cryptography. This is an area that has thousands of papers, but for which known instantiations can be broken by quantum computers. However, there is yet more hope – in another sequence of works by Agrawal, Brakerski, Döttling, Garg, and Malavolta, Wee and Wichs, Gay and Pass a construction of IO was achieved that is “almost” heuristic free. It still uses one heuristic assumption (circular security) but has the advantage that apart from this assumption it only relies on LWE.

One can hope that in the next season, these two lines of work will converge to give a construction of IO based on LWE, achieving a “meta theorem” deriving from LWE a huge array of cryptographic primitives.

Want to learn more about these amazing advances? Want to know what’s next in store for IO?

Fortunately there is a virtual Simons symposium on indistinguishability obfuscation **coming to your computer screen on December 10-11**. Authors of all the papers mentioned will join together in coordinated presentations to give a unified view of the field and the challenges ahead. We will also have a historical opening talk by Yael Kalai, as well as a talk by Benny Applebaum on the computational assumptions used, followed by a panel discussion with Yael, Benny and Chris Peikert. Finally, like every proper crypto event, there will be a rump session, though you will have to supply your own beer.

See the schedule of the workshop and you can register on this page.

Hope you to see you there! Bring your favorite programs to obfuscate with you*

*_{Disclaimer/fine print: Due to large constants and exponents, we do not recommend the compiler be used on programs that are more than one nanobit long.}

Image credit: MIT

]]>