The process and market for both graduate studies and faculty positions (at least in the US) is fairly standard, with more or less a common timeline, and general ideas of where to look for positions (universities’ websites are always a good start, as are the websites of ACM and CRA). Even so, it’s not always clear which areas a university is searching for at any given year, and also these resources are very US-centric, while many great places are located outside the US.

The **postdoc market** is much more “ad hoc”. Some places such as the Simons institute and the IAS search for postdocs yearly and have several positions. (Our own Kempner Institute will also be having regular searches after it launches this year.) But in many other cases, postdoc positions are with an individual researcher that might have availability only every few years, which makes it harder for candidates to find out about this. For such positions, the **Theoretical Computer Science jobs** page is a great way to both advertise any position you have to offer, as well as find out about opportunities. Please post any postdoc or faculty positions relevant to TCS in your institution, as well as advertise it to your students as a place to look for jobs.

Finding information about **research-oriented Masters programs** is also sometimes challenging. In the US it’s common for students to apply straight to a Ph.D from undergraduate, and Masters programs are often intended more for professional development. But, as I wrote in the past, *research-oriented* Masters programs can actually be a great fit for many students. A Ph.D is a huge commitment on both the student and advisor side. If you have not had a chance to do research during your undergraduate studies, it may be better to start with a Masters before taking such a commitment. Some research Masters programs do not charge any tuition, and several offer a stipend. To post and look for such opportunities, see the **crowdsourced TCS research masters website**, managed by Aviad Rubinstein and Matt Weinberg.

If there are other great resources or opportunities, please post them in the comments!

In particular, the resources above are geared for theoretical CS. If you have suggestions of analogous resources for other fields, please post them as well.

]]>

This is a photo of my book shelf at the office. Ever since joining Harvard, I have been ordering copies of Quantum Computing Since Democritus on a regular basis. I often hand them out to bright students, curious about science, whom I want to expose to the beautiful connections between computer science, math, physics, and even philosophy. Scott has been one of the great popularizers of our field even before he started blogging in 2005. His surveys and blog posts provide some of the best introductions to our field. For example, when investigating P vs NP and physical reality, Scott actually went out and verified that nature indeed cannot solve an NP-complete problem via finding the globally minimal energy configuration of soap bubbles. Through his blog, popular writing, and research, Scott has done more than anyone else to introduce new people of all backgrounds to theoretical computer science.

One of Scott’s endearing qualities is his openness to all people. While many of us would ignore a random email or anonymous blog comment, Scott would patiently explain for the millionth time why quantum computers can’t solve NP-hard problems by “trying all solutions in parallel” or why Bell’s Inequality does indeed rule out hidden-variable theories of nature. Alas, the same openness also results in him sometimes giving too much attention and caring far too much about the opinions of Internet “trolls” that are not worthy of his time.

While Scott has always attracted some vitriol, recently this has taken to a new level, with commenters attacking his integrity, his speech mannerisms, even his T-shirt choice/frequency, and worst of all, his family, with misogynistic attacks on his wife and xenophobic and ableist attacks on neurodivergent researchers.

None of these people have made a fraction of the contributions of Scott not just to science, but also to broadening the diversity of computer science, and other causes including assisting women dealing with Texas’ restrictive abortion laws. (As full disclosure, one of the causes Scott helped raise money for is AddisCoder and JamCoders of which I am a board member. I just came back from a week teaching in Jamaica, the students were amazing and are so thankful for the chance to participate in this program; they couldn’t care less how often Scott changes his shirt.)

I am grateful that Scott is a member of our scientific community and proud to call him my friend. Does this mean that I agree with all his positions? Absolutely not. I tend to be on his left on many issues (though am probably more conservative when it comes to oracle-based complexity..). Are there people he’s friendly with whom I even more strongly disagree with, and whose views I might even find repugnant? Probably. But it doesn’t matter, all of us are connected via 6 degrees of separation. If we start to “recursively cancel” every one that is somehow connected to someone we find odious, then we would not be able to talk to anyone.

I hope that Scott is not disheartened by these attacks, and continues to contribute for many years to CS research and education, outreach, and humanity at large.

]]>Recently there have been many debates on “artificial general intelligence” (AGI) and whether or not we are close to achieving it by scaling up our current AI systems. In this post, I’d like to make this debate a bit more quantitative by trying to understand what “scaling” would entail. The calculations are very rough – think of a post-it that is stuck on the back of an envelope. But I hope that this can be at least a starting point for making these questions more concrete.

The first problem is that there is no agreement on what “artificial general intelligence” means. People use this term to mean anything between the following possibilities:

- Existence of a system that can meet benchmarks such as getting a perfect score on the SAT and IQ tests and passing a “Turing test.” This is more or less the definition used by Metaculus (though they recently updated it to a stricter version).
- Existence of a system that can replace many humans in terms of economic productivity. For concreteness, say that it can function as an above-average worker in many industries. (To sidestep the issue of robotics, we can restrict our attention to remote-only jobs.)
- Large-scale deployment of AI, replacing or radically changing the nature of work of a large fraction of people.
- More extreme scenarios such as consciousness, malice, and super-intelligence. For example, a system that is conscious/sentient enough to be awarded human rights and its own attorney, or malicious enough to order DNA off the Internet, build a nanofactory to construct a diamondoid bacteria riding on miniature rockets, so they enter the bloodstream of all humans and kills everyone instantly, while not being detected.

I consider the first scenario– passing IQ tests or even a Turing test– more of a “parlor trick” than actual intelligence. The history of artificial intelligence is one of *underestimating* future achievements on specific benchmarks, but also one of *overestimating* the broader implications of those benchmarks. Early AI researchers were not only wrong about how long it will take for a computer program to become the world chess champion, but they also wrongly assumed that such a program would have to be generally intelligent as well. In a 1970 interview, Minsky was quoted as saying that by the end of the 1970s, *“we will have a machine with the general intelligence of an average human being … able to read Shakespeare, grease a car, play office politics, tell a joke, have a fight. At that point, the machine will begin to educate itself with fantastic speed. In a few months, it will be at genius level, and a few months after, its powers will be incalculable… In the interests of efficiency, cost-cutting, and speed of reaction, the Department of Defense may well be forced more and more to surrender human direction of military policies to machines.”*

Brooks explains that early AI researchers thought intelligence was “best characterized as the things that highly educated male scientists found challenging.” Since playing championship-level chess was hard for them, they couldn’t imagine a machine doing it without doing all other tasks that they considered more trivial. Getting a high SAT or IQ exam score is no more meaningful (for machines or humans) than doing well in chess.

The fourth scenario is, at the moment, too speculative for quantitative discussion and hence less appropriate for this post (though see the addendum below). We will focus on the second and third scenarios, which are necessary stepping stones for the more extreme fourth option. For the sake of concreteness, I will make the optimistic assumption that “scale is all you need” to achieve either of these scenarios. I will then try to see our best estimates on **how much scale** and **at what cost**.

The point of this post is not to argue that we will never achieve scenarios 2 or 3. Instead, it is to try to get quantitative estimates on challenges we would need to overcome to do so. I believe it is possible, but it would be more than just getting better hardware.

This is a long post. If I had to **TL;DR** it, I would say that we have significant uncertainty on how much scale we need for AGI. Scaling to 10-100 Trillion parameters may well get us to Scenario 2 or something near. Still, training and (potentially) inference costs may be prohibitive to achieving the reliability and generality needed for actual deployment. Some challenges we face include:

**(1)** Maintaining long-term context without model size exploding.

**(2)** Making training more efficient, particularly finding ways to train N-sized models at a near-linear cost instead of quadratic in N.

**(3)** Avoid running out of data, perhaps by using video/images and programmatically generated interactive simulations.

**(4)** Handling multi-step interactions without the interaction going “off the rails” and without needing to scale policy/value gradients to an unfeasible number of interactions (perhaps by using smaller RL-trained “verifier models” to choose between options produced by a large statically-trained model).

This post is focused on quantitative issues rather than questions such as “consciousness” or the risks of AI. Those deserve a post of their own. See the addendum at the end for some more philosophical/speculative discussion.

Since artificial intelligence exists in the virtual space, people often assume that we can clone it an arbitrary number of times. But modern AI systems have a highly non-trivial physical footprint. Current models require dozens of GPUs to store their parameters, and future ones would be even larger. Creating many copies of such systems is going to be challenging.

Another common assumption is that by Moore’s law, if we manage to build a system at the level of (say) a sixth-grader, then in a year, we would have a virtual Einstein. However, often performance on a metric scale with the **logarithm** of the number of parameters (e.g., see Big-bench and Parti figures). So, perhaps a better assumption is that if we manage to build a virtual sixth-grader, then the following year, we would have a virtual seventh-grader.

Let’s assume that we could reach the second scenario (proof of concept general AI system) by simply scaling up our current auto-regressive language models by a factor of X. What would be X?

**Adaptivity. **One crucial difference between the tasks we currently test language models on and general intelligence is *adaptivity*. A model that answers a question correctly with 95% probability is excellent. But with a 5% chance of error, such a model may go “off the rails” in a back-and-forth conversation with more than 20 steps. Adaptivity is one reason why robotics performance (even in simulated virtual worlds) still lags far behind humans. It is not so much that the physical environment is higher dimensional than the inputs to language models, but the fact that robots’ actions impact it (and unlike in the case of Chess, Go, or Atari, we don’t have an unlimited number of restarts and simulations). Navigating the social and technical environment of (even a virtual) workplace is no less challenging. Squeezing out that final performance (e.g., from 95% to 99%) is usually when power laws kick in, reducing error by a factor of k, requiring a multiplicative overhead of k to some power a>1.

**Context length.** Another way to think of scaling is the length of the context window models keep. To be useful in replacing a human worker, we don’t want to continuously simulate their first day at work. We want an employee that remembers what they did yesterday, last week, and last year. GPT-3 maintains a window of 2048 tokens, corresponding to roughly 1500 words or three pages of text. However, if you had to write a letter to your future self detailing everything they should remember from your interactions so far, it would likely be much longer. (Claude Fredericks, who may have been the most prolific diarist in history, wrote a diary of approximately 65,000 pages.)

Unfortunately, in standard transformer models, computation and memory cost *quadratically* with the context (though due to weight-sharing between different tokens, the number of learned parameters doesn’t have to increase), which means that increasing the context by (say) a factor of 100 will require increasing model size by a factor of 10,000. However, several alternative transformer architectures aim to achieve linear or near-linear model size scaling with the context.

**Empirical scaling of performance.** The recent BIG-bench paper is perhaps the most comprehensive study of how large language models’ performance improves with scale. They assembled an extensive collection of tasks, each with a score normalized to (mostly) stay in the interval 0-100, with a score of zero corresponding to trivial (e.g., random) performance and a score of 100 corresponding to near-perfect performance (e.g., expert human). In many of these tasks, current models score lower than 20. We still don’t have enough data to know how truly large models behave. On the one hand, naive extrapolation suggests that we need many orders of magnitude for high performance (e.g., a score of 80 points or above). On the other hand, larger models such as Google PaLM show evidence of “breakthrough capabilities”- performance growing super-linearly in the logarithm of size. It seems that to solve this benchmark fully, we would need at least a factor of 10 increase over the ½ Trillion parameter PaLM model.

**Comparing with the brain.** Another point of comparison could be the human brain. However, human brains and artificial neural networks have very different architectures, and we don’t know how many parameters correspond to a single neuron or synapse. Scott Alexander quotes this estimate of 100 trillion parameters on the brain’s size, which would correspond to a factor of 100-1,000 larger than current models. However, the estimate is rather hand-wavy, and even if it wasn’t, there is no reason to expect that artificial neural networks would have the same “ability per parameter” ratio as human brains. In particular, artificial neural networks appear to compensate for relatively weaker reasoning skills by ingesting a massive amount of data, and with data, the model size grows as well.

**Bottom line.** Overall, it seems that X will need to be at least 10-100, though this is an extremely rough estimate. Also, while an X Trillion model might be the “core” of a system replacing a human worker, it will not be the whole of it, and we are likely to need new ideas beyond scale for these other components. In particular, a model designed for back-and-forth interaction is unlikely to simply use an auto-regressive language model as a black box, not even with chain-of-thought reasoning.

Suppose that scaling our current models by a factor of X can achieve “AGI” in the sense of yielding a system that can be as productive as humans in a wide variety of professions (say all the top remote jobs in the list from FlexJobs). How much do we expect it to cost to (1) build a single system of this magnitude and (2) widely deploy it?

The gap between a “proof of concept” to actual deployment can be pretty significant. For example, in 2007, CMU won the urban DARPA grand challenge, while in 2012, a Google autonomous car passed a driving test. Yet a decade later, we still don’t have a significant deployment of self-driving cars. Also, as described in the book (and film) Hidden Figures, despite Moore’s law starting in the 30s, NASA still employed human computers until the 1960s.

For this post, I will make the optimistic (and unrealistic) assumption that the difference between proof-of-concept and deployments corresponds to the difference between the cost of **training** a system vs. the cost of doing **inference** on it. If a system costs more than $100B to train, then it may never get built (for comparison, the Large Hadron Collider cost less than $5B to build). Similarly, a system costing $1000/hour to use is unlikely to replace human workers at scale.

The costs below are calculated with today’s dollars and today’s hardware. Of course, improvements in hardware will translate to cheaper training and inference. However, we have neglected costs that can scale super-linearly with model size, including communication between nodes, managing massive clusters and more. I consider only *amortized* costs since those are what matter when training and serving models at scale.

There is no point in training a large model if you don’t train it for enough time. The performance advantages of larger models are realized by allowing them to train on more data without “saturating”. In the words of the Chinchilla paper “for every doubling of model size the number of training tokens should also be doubled.” (See also the deep bootstrap paper). Hence the number of inferences applied using training also scales with the model size. This means that if a model grows by a factor of X then both the cost of a single inference as well as the total number of inferences grow by about X, meaning that the cost of training grows by about X^{2}. In particular, we can expect training a model that is 100 times as large to cost 10,000 times more!

There are differing estimates on how much ~100B parameter GPT3 model cost to train, but they range in the $5M-$20M, let’s say $5M in pure training costs for simplicity. (This would correspond to a cost of 5M$/500B = 10^{-5} dollars per inference, which matches roughly with estimates below.) An X Trillion model (which, like Chinchilla, but unlike PaLM, would be fully trained to max out its advantages) might cost a factor of 100X^{2} more. For X=10, this would be a cost of $50B. For X=100, this would be 5 Trillion dollars!

Clearly, finding a way to train N-sized models on N tokens using less than O(N^{2}) steps (e.g., O(N log N) ) can be crucial for scaling larger models. Training larger language models also runs into the problem that we have nearly “maxed out” the available textual data. Modern models are already trained on hundreds of billions of words, but there is a limit to how much novel text can be produced by a planet of 8 billion people. (Though multimodal models that are also trained on video would have access to much more data.)

Suppose that we have managed to train a large model of X Trillion parameters. How much do we expect inference to cost in dollars? In transformer architectures, the number of floating-point operations required to make a single model evaluation (i.e., inference) is roughly the same as the number of parameters.

Hence an X Trillion parameter model requires about X TerraFLOPs (TFLOPs). Nvidia’s A-100 GPUs claims peak performance of about 300 TFLOP/s. (The effects of using 16-bit precision and not achieving 100% utilization roughly cancel out.) Renting such a machine costs about $1/hour, so we can get about 300*3600 ~ 1M TFLOPs per dollar. (This is up to order of magnitude and caring just about total FLOPs rather than wall-clock time; for careful calculations of inference *time,* see Carol Chen’s and Jacob Steinhardt’s blogs.)

So far, this sounds great – we could make 10^{6}/X** ** inferences per dollar for an X trillion parameter model. However, in the real world costs are much higher. The same Nvidia blog shows that A-100 can handle 6000 inferences/sec of the 340M parameter Bert Large. Since an X Trillion model is 3000X larger, that would correspond to 2/X inferences per second or 7200/X inferences per hour/dollar. The calculations above predict that the 0.2 Trillion parameter GPT-3 would be able to perform 7200*5~35K inferences per dollar. However, OpenAI charges 6 cents per 1K tokens (including input tokens!), which depending on the length of the input, can be only 10 inferences per dollar. Bhavsar estimates GPT-3 can handle about 18K inferences per GPU hour. Overall it seems that 10^{4}/X inferences per dollar is an optimistic estimate.

However, the question is how many inferences we need to make per hour to simulate a human. The average person apparently speaks about 150 words (say 200 tokens) per minute. This would suggest we need about 200*60 ~ 10K inferences per hour to simulate a person. For an X Trillion sized model, that would cost $X per hour, which is not too bad if X is between 10 to 100.

The above price point sounds pretty good but will likely be an underestimate. First, to actually simulate a human, we need not just to simulate what they say but also what they *think*. So, to perform “chain of thought” reasoning, we would need to run an inference per word that is thought rather than a word that is uttered. We don’t know the speed of thought, but it will increase the number of inferences needed. Generally, to simulate a chain of reasoning of depth K, the number of inferences scales with K, even if the end result is just a single token. Second, to reach high reliability, it is likely that we will need to make Y inferences and use some mechanism to choose the best one out of these Y options. For example, AlphaCode generates millions of possible solutions to programming challenges and filters them into 10 candidate ones. It is hard to estimate what Y would be in a workplace environment, but it seems that Y would be somewhere between 10 to 100.

**Bottom line for inference cost. **The estimates above suggest that an X Trillion parameter model would require about 10^{5} to 10^{6} inferences per hour to simulate a person, with a cost that ranges from $10X to $100X per hour. This is already tight for X=10 and would be too much for X=100. However, it is not clear how many words/thoughts we need to simulate per given profession, so these estimates are very rough.

While the estimates above should be taken with huge grains of salt, I believe that generally useful artificial intelligence can likely be achieved, but it will require more than sheer scale. While in principle a perfect next-word extender is also a perfect reinforcement learner, we may not be able to get close enough to perfection by scale alone. More than any particular conclusions, I hope that the debate can move from general philosophical arguments to quantitative questions that have numerical answers.

I tried to keep this post within the realms of calculations and away from philosophy. But given recent discussions and hype on whether AI systems can achieve “consciousness” or “sentience” and whether they pose a unique existential risk, I feel that I must address this at least briefly. Readers allergic to philosophy and unjustified speculations can stop here.

Consciousness is a tricky concept: the Stanford Encyclopedia of Philosophy entry lists nine different specific theories of consciousness, and there are more theories still. It is also intertwined with ethics: if we consider a creature to be conscious or sentient, then the boundaries of how we can treat it become an issue of ethics. I don’t think it’s the job of computer scientists (or any other scientists) to come up with a moral philosophy, and similarly, I don’t think defining consciousness falls in our domain.

Historically, there seem to be two kinds of non-human entities which we considered conscious or sentient. One is animals, to which people have felt superior. The other is gods, to which people have felt inferior. Before we understood the causes of planetary movements, weather events, and other natural phenomena, we ascribed them to conscious actions by various gods. Since we couldn’t predict or explain these phenomena, our only attempt at controlling them was through prayer and sacrifice to the presumably conscious entity that controls them.

Some discussions of potential future AI are reminiscent of those past gods. According to some, AI would be not just conscious but capricious and could (according to some, would) ensure that “everybody on the face of the Earth suddenly falls over dead within the same second.” It is of course, possible to construct hypothetical scenarios in which an AI system managed to start a nuclear war or design a lethal virus. We’ve all read such books and seen such movies. It is also possible to construct scenarios where *humans* start a nuclear war or design a lethal virus. There are also many books and movies of the latter type. In fact, many AI “doomsday scenarios” don’t seem to require super-human levels of intelligence.

The truth is that the reason that our world hasn’t been destroyed so far is not that humans were not intelligent enough nor because we haven’t been malicious enough. First, throughout most of human history, we did not have technologies such as nuclear weapons and others with the potential to cause planet-scale destruction. Second, while imperfect, we have developed some institutions, including international bodies, the non-proliferation treaty, standards for biolabs, pandemic preparations, and more to keep some of these capabilities in check. Third, we were lucky. From climate change through pandemic preparation to nuclear disarmament, humanity should be doing much more to confront the risks and harms of our own making. But this is true independently of artificial intelligence. Just as with humans, my inclination with AI would not to try to make systems inherently moral or good (“aligned” in AI-speak) but rather use the “trust, but verify” approach. One moral of computational complexity theory is that computationally weak agents can verify the computation of more powerful processes.

Many of the calculations above show how “scaling up” is going to be non-trivial, and it is unlikely to see AI making restaurant reservations one day, and secretly ordering material over the net to build a world-destroying nano-technology lab the next one. Even if it’s possible for a large-scale model to “train itself” to improve performance, without needing additional outside data, that model would still incur the considerable computational costs for training that we computed above.

In my previous post, I explained why I am not a “longtermist”. The above is why I don’t view an “AGI run amok” as a short-term existential risk. That doesn’t mean AI doesn’t have safety issues. AI is a new technology, and with any new technology come new risks. We don’t need science fiction to see real risks in both unintentional consequences of AI deployment such as discrimination and bias, as well as intentional consequences of deploying AI for weapons, surveillance, and social manipulation. I don’t think that debating the notion of consciousness and inventing doomsday scenarios is helpful for combatting any one of those.

**Acknowledgments: **Thanks to Jascha Sohl-Dikstein for many useful comments on a draft of this blog post, and on how to interpret the results of the BIG-bench paper.

One of my main goals in revising the theoretical CS course is to give students both rigorous foundations as well as a taste of modern topics. Some of these modern topics:

**Cryptography**: a topic that combines mathematical beauty, practical importance, and a demonstration that sometimes computational hardness can be a resource rather than a hindrance.**Quantum computing:**a topic that shows the interaction between TCS and physics, the fundamental nature of the “Church Turing hypothesis”, and how we can (as in crypto) take a “lemon” (inability of classical computers to simulate certain quantum processes) and use it to make “lemonade” (a computer with stronger power than classical computers).**Randomized computation and derandomization:**Randomization is now essential to so many areas of CS, and so it is important to both demonstrate its power, and also how we might use complexity to remove it.**Machine learning and average-case complexity:**Traditionally in an intro TCS course the focus is purely on worst-case complexity. This leads to a disconnect with modern applications of CS, and in particular machine learning.

So, I ended up writing my own text – **Introduction to Theoretical Computer Science**. While at some point I hope to make it into a printed book, it will always be available freely online on https://introtcs.org/. The markdown source for it is available on the repository https://github.com/boazbk/tcs . I’ve benefitted greatly from feedback from both students and readers around the globe: at the time of writing, the project has 330 issues and 385 pull requests.

A central difference between the approach I take and the one of previous courses is that I start from **Boolean circuits** as the first model of computation. Boolean circuits are crucial to teach the topics above:

- Cryptography is much more natural with circuits rather than Turing machines as the model of computation. Statements such as “128 bits of security” make no sense in the asymptotic Turing machine formalism, but can be made precise with circuits.
- The standard model for quantum computing is quantum circuits.
- Derandomization is best described using the circuit model, and of course many results such as BPP in the Polynomial-Hierarchy are best shown using circuits and the class P/poly as an intermediate concept.
- Circuits are a very natural fit for machine learning, and in particular Neural Networks are just a special type of circuit.

Finally, while circuits are often considered an “advanced” topic, they have some advantages over automata as the initial model of computation:

**Finite is always easier than infinite:**Starting with circuits enables us to start the course talking about arguably the simplest object: finite functions. Writing down the truth table of a finite function, and showing that there is more than one circuit to compute the same function, also helps clarify the difference between**specification**of a function and its**implementation**by some algorithm, which is distinction that many students grapple with.**Circuits are connected to actual hardware**. An intro to TCS course is not a pure math course – we want to convey to students that are models are motivated by actual computing. Circuits make this connection much closer, and less artificial than automata or even Turing machines.**Can show cool theorems early.**If we start with automata, then the first theorems we show can often seem not well motivated to students. It takes some time to build the machinery to show the main theorem – equivalence of automata and regular expressions – and the proof of that theorem is rather technical. In contrast, with circuits we can show three important theorems rather quickly:**(1)***every*finite function can be computed by some circuit of at most size,**(2)***every*circuit of size can be represented as a labeled graph and hence (using adjacency list) by a string of size , and**(3)**using (2) and the fact that there are functions mapping to , there*exist*some function that*requires*a circuit of gates.

While the course is a theory course, and not about programming, one of my goals in the book and course was to connect it to programming. This is not just to motivate students and make them feel that the material is “practical” but also to better understand the theory itself. Notions such as NP-completeness reductions can be often confusing to students (which is why they get the direction wrong half the time). Implementing a reduction from 3SAT to independent set and seeing the end result make it much more concrete.

One way in which I wanted to use programming is to demonstrate to students how we can take a piece of Python code such as the code for adding two numbers given in their binary representation:

```
def add(A,B):
"""Add two binary numbers, given as lists of bits"""
Y = []
carry = zero(A[0]) # initialize carry to 0
for i in range(len(A)): # compute i-th digit of output
y = xor(A[i],B[i],carry) # xor function
carry = maj(A[i],B[i],carry) # majority function
Y.append(y)
Y.append(carry)
return Y
```

And obtain the corresponding circuit:

The code also uses the following one-linear helper functions

```
def maj(a,b,c): return (a & b) | (b&c) | (a&c)
def zero(a): return a & ~a
def xor2(a,b): return (a & ~b) | (~a & b)
def xor(*L): return xor2(*L) if len(L)==2 else xor2(xor(*L[:-1]),L[-1])
```

If you think about it, the task corresponds to extracting the **computational graph** of a piece of Python code. This is precisely the same task that *auto-differentiation* packages such as Pytorch need to do. Hence it can be solved in a similar way. Thus, inspired by Karpathy’s micro-grad package (see my back-propagation tutorial) and using the awesome SchemDraw package, I wrote a short **colab notebook** that does precisely that.

Specifically, the notebook defines a `Bit`

class that (as its name suggests) stores a single bit. The class defines the logical AND, OR and NOT operations ( `&`

, `|`

, `~`

in Python). If a and b are two bits then `c = a & b`

not just contains the value which is the AND of the values of `a`

and `b`

, but also pointers to a and b and remembers how it was computed from them. This allows us to obtain from `c`

a formula/circuit expressing it in terms of `a`

and `b`

.

```
class Bit:
counter = 0
def __init__(self,val=0, label="-"):
self.label = label
self.data = val
self.children = []
def op(self,f, label, *others):
inputs = [self.data] + [o.data for o in others]
out = Bit(f(*inputs),label)
out.children = [self] + list(others)
return out
def __and__(self,other): return self.op(lambda a,b: a & b, "\\wedge", other)
def __or__(self,other): return self.op(lambda a,b: a | b, "\\vee", other)
def __invert__(self): return self.op(lambda a: ~a, "\\neg")
```

Now we can write a simple recursive function `formula`

(see the notebook) to give out the latex of the formula corresponding to how a particular bit was computed. So if we write

```
from IPython.display import Markdown, display, Math
Y = xor(Bit(0,"X_0"), Bit(1,"X_1"))
Math(formula(Y))
```

Then we will get

The code of a recursive function that transforms this into the circuit

```
A = [Bit(0,f"A_{i}") for i in range(2)]
B = [Bit(0,f"B_{i}") for i in range(2)]
draw_circ(*add(A,B))
```

See the colab notebook for more.

]]>Something about this table resonates with me. In fact, as anyone using Pytorch knows, since Tibshiriani posted this table, many of the terms on the right have found broader use in the machine learning community. (And I do hope that statisticians’ grants and conferences have improved as well…)

But thinking of deep learning purely in terms of statistics misses crucial aspects of its success. A better critique of deep learning is that **it uses statistical terms to describe radically different concepts**. In meme form, it is the “Princess Bride” meme on the right that is a better critique of deep learning than sandserif’s meme on the left.

**This blog post: organization. **In this post, I explain this point of view and why some of the most fundamental aspects of deep learning deviate radically from statistics and even from classical machine learning. In this somewhat long post, I’ll start by talking about the difference between **explanation** and **prediction** when fitting models to data. I’ll then discuss two “cartoons” of a learning process: **fitting a statistical model** using empirical risk minimization and **teaching a math skill to a (human) student**. I then discuss which one of those processes is a closer match to deep learning. Spoiler: while the math and code of deep learning is nearly identical to the first scenario (fitting a statistical model), I claim that a deeper level, some of deep learning’s most aspects are captured by the “teaching a skill to a student” scenario. I do not claim to have a full theory for deep learning. In fact,I strongly suspect such a theory doesn’t exist. Rather, I believe different aspects of deep learning are best understood from different lenses, and the statistical lens cannot provide the complete picture.

*Caveat:*** ** While I contrast deep learning with statistics in this post, I refer to “classical statistics” as it was studied in the past and explained in textbooks. Many statisticians are studying deep learning and going beyond classical methods, analogously to how physicists in the 20th century needed to expand the framework of classical physics. Indeed, the blurring of the lines between computer scientists and statisticians is a modern (and very welcome!) phenomenon that benefits us all.

Scientists have fitted models to observations for thousands of years. For example, as mentioned in my philosophy of science book review post, the Egyptian astronomer Ptolemy came up with an ingenious model for the movement of the planets. Ptolemy’s model was geocentric (with planets rotating around the earth) but had a sequence of “knobs” (concretely, epicycles) that gave it excellent predictive accuracy. In contrast, Copernicus’ initial *heliocentric* model posited a circular orbit of planets around the sun. It was a simpler model than Ptolemy’s (with fewer “adjustable knobs”) and got the big picture right, but was *less accurate* in predicting observations. (Copernius later added his own epicycles so he could match Ptolemy’s performance.)

Ptolemy’s and Copernicus’ models were incomparable. If you needed a “black box” for **predictions**, then Ptolemy’s geocentric model was superior. If you wanted a simple model into which you can “peer inside” and that could be the starting point for a theory to **explain** the movements of the stars, then Copernicus’ model was better. Indeed, eventually, Kepler refined Copernicus’ models to elliptical orbits and came up with his three laws of planetary movements, which enabled Newton to explain them using the same laws of gravity that apply here on earth. For that, it was crucial that the heliocentric model wasn’t simply a “black box” that provides predictions, but rather was given by simple mathematical equations with few “moving parts.” Over the years, astronomy continued to be an inspiration for developing statistical techniques. Gauss and Legendre (independently) invented least-squares regression around 1800 to predict the orbits of asteroids and other celestial bodies. Cauchy’s 1847 invention of gradient descent was also motivated by astronomical predictions.

In physics, you can (at least sometimes) “have it all” – find the “right” theory that achieves the best predictive accuracy and the best explanation for the data. This is captured by sentiments such as Occam’s Razor, which can be thought of as positing that simplicity, predictive power, and explanatory insights, are all aligned with one another. However, in many other fields, there is a tension between the twin goals of **explanation** (or, more generally, **insight**) and **prediction**. If you simply want to predict observations, then a “black box” could very well be best. On the other hand, if you want to extract insights such as a causal model, general principles, or significant features, then a simpler model that you can understand and interpret might be better. The right choice of model depends on its usage. Consider, for example, a dataset containing genetic expressions and a phenotype (say some disease) for many individuals. If your goal is to predict** **the chances of an individual getting sick, you want to use the best model for that task, regardless of how complex it is or how many genes it depends on. In contrast, if your goal is to identify a few genes for further investigation in a wet lab, a complicated black box would be of limited use, even if it’s highly accurate.

This point was forcefully made in Leo Breiman’s famous 2001 essay on the two cultures of statistical modeling. The “data modeling culture” focuses on simple generative models that **explain** the data. In contrast, the “algorithmic modeling culture” is agnostic on how the data is generated and focuses on finding models that **predict **the data, no matter how complex. Breiman argued that statistics was too dominated by the first culture, and this focus has *“**led to irrelevant theory and questionable scientific conclusions”* and *“prevented statisticians from working on exciting new problems.”*

Breiman’s paper was controversial, to say the least. Brad Efron responded to it by saying that, while he agreed with some points, *“at first glance, Leo Breiman’s stimulating paper looks like an argument against parsimony and scientific insight, and in favor of black boxes with lots of knobs to twiddle. At second glance, it still looks that way”* (see also Kass). In a more recent piece, Efron graciously concedes that *“Breiman turned out to be more prescient than me: pure prediction algorithms have seized the statistical limelight in the twenty-first century, developing much along the lines Leo suggested.”*

Machine learning, deep or not, stands firmly in Breiman’s second culture, with a focus on **prediction**. This culture has a long history. For example, the following snippets from Duda and Hart’s 1973 textbook and Highleyman’s 1962 paper would be very recognizable to deep learning practitioners today:

Similarly, Highleyman’s handwritten characters dataset and the architecture Chow (1962) used to fit it (with ~58% accuracy) would also strike a chord with modern readers, see the Hardt-Recht book and their blog post.

In 1992, Geman, Bienenstock, and Doursat wrote a pessimistic article about neural networks, arguing that *“current-generation feed-forward neural networks are largely inadequate for difficult problems in machine perception and machine learning”*. Specifically, they believed that general-purpose neural networks would not be successful in tackling difficult tasks, and the only way for them to succeed would be via hand-designed features. In their words: *“important properties must be built-in or “hard-wired” … not learned in any statistically meaningful way.”* In hindsight (which is always 20/20), Geman et al. were completely wrong (if anything, modern architectures such as transformers are even *more general* than the convolutional networks that existed at the time), but it is interesting to understand *why* they were wrong.

I believe that the reason is that deep learning is genuinely different from other learning methods. A priori, it seems that deep learning is just one more predictive model, like nearest neighbors or random forests. It may have more “knobs,” but that seems to be a quantitative rather than qualitative difference. However, in the words of P.W. Andreson, **“more is different.”** In Physics, we often need a completely different theory once scale changes by several orders of magnitude, and the same holds in deep learning. The processes that underlie deep learning vs. classical models (parametric or not) are radically different, even if the equations (and Python code) look identical at a high level.

To clarify this point, let’s consider two very different learning processes:** fitting a statistical model** and **teaching math to a student**.

Classically, fitting a statistical model to data corresponds to the following:

- We observe some data . (Think of as an matrix and as an dimensional vector; think of the data as coming from a
**structure and noise**model: each coordinate is obtained as where is the corresponding noise , using additive noise for simplicity, and as the “ground truth.”) - We fit a model to the data by running some
**optimization algorithm**to minimize an**empirical risk**of . That is, we use optimization to (try to) find that minimizes a quantity where is a loss term (capturing how close is to ) and is an optional regularization term (attempting to bias toward simpler models under some measure). - Our hope is that our model will have good
**population loss**, in the sense that the**generalization error/loss**is small (where this expectation is taken over the total population from which our data was drawn).

This very general paradigm captures many settings, including least-squares linear regression, nearest neighbors, neural network training, and more. In the classical statistical setup, we expect to observe the following:

**Bias/variance tradeoff:** Let be the set of models that we optimize over. (If we are in the non-convex setting and/or have a regularizer term, we can let be the set of such models that can be achieved by the algorithm with non-negligible probability, taking the effects of algorithm choice and regularizer into account.) The **bias** of is the best approximation to the ground truth that can be achieved by an element . The larger the class , the smaller the bias, and it can be zero if . However, the larger the class , the more samples we would need to narrow down its members and hence the more **variance** in the model that the algorithm outputs. The overall **generalization error** is the sum of the bias term and the contribution from the variance. Hence statistical learning typically displays a **bias/variance tradeoff,** with a “goldilocks choice” of the right model complexity that minimizes the overall error. Indeed, Geman et al. justified their pessimism on neural networks by saying that *“the fundamental limitations resulting from the bias-variance dilemma apply to all nonparametric inference models, including neural networks.” *

**More is not always better.** In statistical learning, getting more features or data does not necessarily improve performance. For example, learning from data that contains many irrelevant features is more challenging. Similarly, learning from a mixture model, in which data comes from one of two distributions (e.g., and ), is harder than learning each distribution independently.

**Diminishing returns.** In many settings, the number of data points needed to reduce the prediction noise to a level of scales as for some parameter . In such cases, it takes about samples to “get off the ground” but once we do so we face a regime of diminishing returns, whereby if it took points to achieve (say) 90% accuracy, it will take roughly an additional points to increase accuracy to 95%. In general, as we increase our resources (whether data, model complexity, or computation) we expect to capture finer and finer distinctions rather than unlocking qualitatively new capabilities.

**Strong dependence on loss, data.** When fitting a model to high-dimensional data, small details can make a big difference. Statisticians know that choices such as an L1 or L2 regularizer matter, not to mention using completely different datasets (e.g., Wikipedia vs. Reddit). High-dimensional optimizers of different quantities will be very different from one another.

**No natural “difficulty” of data points (at least in some settings)**. Traditionally, we think of data points as sampled independently from some distribution. Though points closer to the decision boundary could be harder to classify, given the concentration-of-measure phenomena in high dimensions, we expect that most points would be of similar distance. Thus at least in classical data distributions, we don’t expect points to vary greatly in their difficulty level. However, mixture models can display such variance in difficulty level, and hence, unlike the other issues above, such variance would not be terribly surprising in the statistical setting.

In contrast to the above, consider the setting of teaching a student some particular topic in mathematics (e.g., computing derivatives), by giving them general instructions, as well as exercises to work through. This is not a formally defined setting, but let’s consider some of its qualitative features:

**Learning a skill, rather than approximating a distribution.** In this setting, the student learns a *skill* rather than an estimator/predictor for some quantity. While defining “skill” is not a trivial task (and not one we’ll undertake in this blog post), it is a qualitatively different object. In particular, even if the function mapping exercises to solutions cannot be used as a “black box” to solve some related task X, we believe that the **internal representations** that the student develops while working through these problems can still be useful for X.

**More is better.** Generally, students that do more problems and problems of different types achieve better performance. A “mixture model” – doing some calculus problems and some algebra problems – does not hurt the student in their calculus performance and in fact, could only help.

**“Grokking” or unlocking capabilities, moving to automatic representations.** While at some point there are diminishing returns also when solving problems, students do seem to undergo several phases. There is a stage in which doing some problems helps a concept “click” and unlocks new capabilities. Also, as students repeat problems of a specific type, they seem to move their facilities and representations of these problems to a lower level, enabling certain automaticity with them that they didn’t have before.

**Performance is partially independent of the loss and data.** There is more than one way to teach mathematical concepts. Students who study with different books, educational approaches, or grading systems can eventually learn the same material and (as far as we can tell) similar internal representations of it.

**Some problems are harder than others.** In math exercises, we often see a strong correlation between how different students solve the same problem. There does seem to be an inherent difficulty level for a problem and a natural progression of difficulty that is optimal for learning. Indeed this is precisely what is being done by platforms such as IXL.

So, which of the above two metaphors more appropriately captures modern deep learning, and specifically the reasons why it is so successful? Statistical model fitting seems to correspond well to the math and the code. Indeed the canonical Pytorch training loop trains deep networks through empirical risk minimization as described above:

However, on a deeper level, the relation between the two settings is not as clear. For concreteness, let us fix a particular learning task. Consider a classification algorithm that is trained using the method of “self-supervised learning + a linear probe” (what we called Self-Supervised + Simple or SSS in our paper with Bansal and Kaplun). Concretely, the algorithm is trained as follows:

- Suppose that the data is a sequence where is some datapoint (say an image for concreteness) and is a label.
- We first find a deep neural network implementing
*representation function*. This function is trained only using the datapoints and not using the labels by minimizing some type of a self-supervised loss function. Example of such loss functions are*reconstruction*or in-painting (recovering some part of the input from another) or*contrastive learning*(finding such that is significantly smaller when are augmentations of the same datapoint than when they are two random points). - We then use the full labeled data to fit a linear classifier (where is the number of classes) that minimizes the cross-entropy loss. Our final classifier is the map .

Step 3 merely fits a linear classifier and so the “magic” happens in step 2 (self-supervised learning of a deep network). Some of the properties we see in self-supervised learning include:

**Learning a skill rather than approximating a function. **Self-supervised learning is not about approximating a function but rather learning representations that could be used in a variety of downstream tasks. For example, this is the dominant paradigm in natural language processing. Whether the downstream task is obtained through linear probe, fine tuning, or prompting is of secondary importance.

**More is better.** In self-supervised learning, representation quality improves with data quantity. We don’t suffer from mixing in several sources: in fact, the more diverse the data is, the better.

**Unlocking capabilities.** We have seen time and again discontinuous improvements in deep learning models as we scale resources (data, compute, model size). This has also been demonstrated in some synthetic settings.

**Performance is largely independent of loss or data**. There is more than one self-supervised loss. Several contrastive and reconstruction losses have been used for images. For language models, we sometimes use one-sided reconstruction (predict next token) and sometimes masked models whose goal is to predict a masked input from both the left and right token. We can also use slightly different datasets. These can make differences in efficiency, but as long as we make “reasonable” choices, typically raw resources are more significant predictors of performance than the particular loss or dataset used.

**Some instances are harder than others.** This point is not specific to self-supervised learners. It does seem that data points have some inherent “difficulty level”. Indeed, we have several pieces of empirical evidence for the notion that different learning algorithms have a different “skill level” and different points have a different “difficulty level” (with the probability of classifier classifying point correctly being monotonically increasing with ’s skill and monotonically decreasing with ’s difficulty). The “skill vs. difficulty” paradigm is the cleanest explanation for the “accuracy on the line” phenomenon uncovered by Recht et al and Miller et al. Our paper with Kaplun, Ghosh, Garg, and Nakkiran also shows how different inputs in datasets have an inherent “difficulty profile” that seems to be generally robust with respect to different model families.

**Training as teaching.** Training of modern large models seems much more like teaching a student than fitting a model to data, complete with “taking breaks” or trying different approaches when the student doesn’t get it or seems tired (training diverges). The training logbook of Meta’s large model is instructive- aside from issues with hardware, we can see interventions such as switching different optimization algorithms in the middle of training and even considering “hot swapping” the activation functions (GELU to RELU). The latter doesn’t make much sense if you think of model training as fitting data as opposed to learning representations.

Up to this point, we only discussed self-supervised learning, but the canonical example of deep learning- the one you teach first in a course- is still supervised learning. After all, deep learning’s “ImageNet moment” came with, well, ImageNet. Does anything we said above still apply to this setting?

First, the emergence of supervised large-scale deep learning is to some extent a historical accident, aided by the availability of large high quality labeled datasets (i.e. ImageNet). One could imagine an alternative history in which deep learning first started showing breakthrough advances in Natural Language Processing via unsupervised learning, and only later transported into vision and supervised learning.

Second, we have some evidence that even though they use radically different loss functions, supervised and self-supervised learning behave similarly “under the hood.” Both often achieve the same performance, and in work with Bansal and Nakkiran, we showed that they also learn similar internal representations. Concretely, for every , one can “stitch together” the first layers of a depth model that was trained via self-supervision with the last layers of a supervised model with little loss in performance.

The advantage of self-supervised + simple models is that they can separate out the aspects of feature learning or “deep learning magic” (done by the deep representation function) from the statistical model fitting (done by the linear or other “simple” classifier on top of this representation).

Finally, while this is more speculative, the fact that often “meta learning” seems to amount to learning representations (see Raghu et al. and Mandi et. al.) can be considered as another piece of evidence that this is much of what’s going on, regardless of the objective that the model ostensibly optimizes.

The reader may have noticed that I skipped over what is considered the canonical example of the disparity between the model of statistical learning and deep learning in practice: the absence of a “bias-variance tradeoff” (see Belkin et al.’s double descent) and the ability of over-parameterized models to generalize well.

There are two reasons I do not focus on this aspect. First, if supervised learning really does correspond to self-supervised + simple learning “under the hood” then that may explain its generalization ability. Second, I think that over parameterization is *not* crucial to deep learning’s success. Deep networks are special not because they are big compared to the number of samples but because they are big in absolute terms. Indeed, typically in unsupervised / self-supervised learning models are *not* over parameterized. Even for the very large language models, their datasets are larger still. This does not make their performance any less mysterious.

Statistical learning certainly plays a role in deep learning. However, despite using similar terms and code, thinking of deep learning as simply fitting a model with more knobs than classical models misses a lot of what is essential to its success. The human student metaphor is hardly perfect either. Like biological evolution, even though deep learning consists of many repeated applications of the same rule – gradient descent on empirical loss- it gives rise to highly complex outcomes. It seems that at different times different components of networks learn different things, including, representation learning, prediction fitting, implicit regularization, and pure noise. We are still searching for the right lens by which to ask questions about deep learning, let alone answer them.

**Acknowledgments:** Thanks to Lucas Janson and Preetum Nakkiran for comments on early versions of this blog post.

**UPDATE**: Don’t miss the TCS Women Spotlight Workshop **Monday, June 20th, 8:45 am – 11:450 am** Rome Time. Irit Dinur will give a talk on “How I re-proved the PCP theorem and how I hope to do it again” and there will be six rising star talks by Sami Davies (Northwestern), Tamalika Mukherjee (Purdue), Aditi Dudeja (Rutgers), Charlie Carlson (Colorado Boulder), Yasamin Nazari (Johns Hopkins), and Jessica Sorrell (UCSD). Register on https://form.jotform.com/221605322816045

The 54th Annual ACM Symposium on Theory of Computing (STOC’22) is starting next week in Rome, as part of the broader TheoryFest. Now, while this probably is not coming as a surprise to you, did you now about the *social and mentoring events* at STOC, which, not to miss a good portmanteau when one sees one,* we shall refer henceforth as *STOCial’22?*

Organised by Federico Fusco, Tegan Wilson, Mary Wootters, and myself, STOCial’22 includes a bonanza of activities, games, and fun, including (but not limited to):

- a student lunch!
- two senior/junior lunches!
- cartoon caption contests!
- a scavenger hunt!
- a STOC-themed crossword!
- a game of socc… football!
- PRIZES!

To learn more about those, and sign up to the student or senior/junior lunches: https://sites.google.com/view/stocial-2022

See you next week!

Clément Canonne

* If someone finds a palindrome instead, let me know. I would love a good pal in Rome.

]]>“Longtermism” is a moral philosophy that places much more weight on the well-being of all future generations than on the current one. It holds that “positively influencing the long-term future is a key moral priority of our time,” where “long term” can be *really* long term, e.g., “many thousands of years in the future, or much further still.” At its core is the belief that each one of the potential quadrillion or more people that may exist in the future is as important as any single person today.

Longtermism has recently attracted attention, some of it in alarming tones. The reasoning behind longtermism is natural: if we assume that human society will continue to exist for at least a few millennia, many more people will be born in the future than are alive today. However, since predictions are famously hard to make, especially about the future, longtermism invariably gets wrapped up with probabilities. Once you do these calculations, preventing an infinitely bad outcome, even if it would only happen with tiny probability, will have infinite utility. Hence longtermism tends to focus on so-called “existential risk”: The risk that humanity will go through in an extinction event, like the one suffered by the Neanderthals or Dinosaurs, or another type of irreversible humanity-wise calamity.

This post explains why I do not subscribe to this philosophy. Let me clarify that I am not saying that all longtermists are bad people. Many “longtermists” have given generously to improve people’s lives worldwide, particularly in developing countries. For example, none of the top charities of Givewell (an organization associated with the effective altruism movement, in which many prominent longtermists are members) focus on hypothetical future risks. Instead, they all deal with current pressing issues, including Malaria, childhood vaccinations, and extreme poverty. Overall the effective altruism movement has done much to benefit currently living people. Some of its members donated their kidneys to strangers: These are good people- morally better than me. It is hardly fair to fault people that are already contributing more than most others for caring about issues that I think are less significant.

This post critiques the philosophy of longtermism rather than the particular actions or beliefs of “longtermists.” In particular, the following are often highly correlated with one another:

- Belief in the philosophy of longtermism.
- A belief that existential risk is not just a concern for the far-off future and a low-probability event, but there is a very significant chance of it happening in the near future (next few decades or at most a century).
- A belief that the most significant existential risk could arise from artificial intelligence and that this is a real risk in the near future.

Here I focus on (1) and explain why I disagree with this philosophy. While I might disagree on specific calculations of (2) and (3), I fully agree with the need to think and act regarding near-term risks. Society tends to err on the side of being too myopic. We prepare too little even for risks that are not just predictable but are also predicted, including climate change, pandemics, nuclear conflict, and even software hacks. It is hard to motivate people to spend resources for safety when the outcome (bad event not happening) is invisible. It is also true that over the last decades, humanity’s technological capacities have grown so much that for the first time in history, we are capable of doing irreversible damage to our planet.

In addition to the above, I agree that we need to think carefully about the risks of any new technology, particularly one that, like artificial intelligence, can be very powerful but not fully understood. Some AI risks are relevant to the shorter term: they are likely over the next decade or are already happening. There are several books on these challenges. None of my critiques apply to such issues. At some point, I might write a separate blog post about artificial intelligence and its short and long-term risks.

My reasons for not personally being a “longtermist” are the following:

**The probabilities are too small to reason about.**

Physicists know that there is no point in writing a measurement up to 3 significant digits if your measurement device has only one-digit accuracy. Our ability to reason about events that are decades or more into the future is severely limited. At best, we could estimate probabilities up to an order of magnitude, and even that may be optimistic. Thus, claims such as Nick Bostrom’s, that *“the expected value of reducing existential risk by a mere one billionth of one billionth of one percentage point is worth a hundred billion times as much as a billion human lives”* make no sense to me. This is especially the case since these “probabilities” are Bayesian, i.e., correspond to degrees of belief. If, for example, you evaluate the existential-risk probability by aggregating the responses of 1000 experts, then what one of these experts had for breakfast is likely to have an impact larger than 0.001 percent (which, according to Bostrom, would correspond to much more than 10²⁰ human lives). To the extent we can quantify existential risks in the far future, we can only say something like “extremely likely,” “possible,” or “can’t be ruled out.” Assigning numbers to such qualitative assessments is an exercise in futility.

**I cannot justify sacrificing current living humans for abstract probabilities.**

Related to the above, rather than focusing on specific, measurable risks (e.g., earthquakes, climate change), longtermism is often concerned with extremely hard to quantify risks. In truth, we cannot know what will happen 100 years into the future and what would be the impact of any particular technology. Even if our actions will have drastic consequences for future generations, the dependence of the impact on our choices is likely to be chaotic and unpredictable. To put things in perspective, many of the risks we are worried about today, including nuclear war, climate change, and AI safety, only emerged in the last century or decades. It is hard to underestimate our ability to predict even a decade into the future, let alone a century or more.

Given that there is so much suffering and need in the world right now, I cannot accept a philosophy that prioritizes abstract armchair calculations over actual living humans. (This concern is not entirely hypothetical: Greaves and MacAskill estimate that $100 spent on AI safety would, in expectation, correspond to saving a *trillion lives* and hence would be “far more than the near-future benefits of bednet distribution [for preventing Malaria],” and recommend that it is better that individuals “fund AI safety rather than developing world poverty reduction.”)

Moorhouse compares future humans to ones living far away from us. He says that just like “something happening far away from us in space isn’t less intrinsically bad just because it’s far away,” we should care about humans in the far-off future as much as we care about present ones. But I think that we *should* care less about very far away events, especially if it’s so far away that we cannot observe them. E.g., as far as we know, there may well be trillions of sentient beings in the universe right now whose welfare can somehow be impacted by our actions.

**We cannot improve what we cannot measure.**

An inherent disadvantage of probabilities is that they are invisible until they occur. We have no direct way to measure whether a probability of an event X has increased or decreased. So, we cannot tell whether our efforts are working or not. The scientific revolution involved moving from armchair philosophizing to making measurable predictions. I do not believe we can make meaningful progress without concrete goals. For some risks, we do have quantifiable goals (Carbon emissions, number of nuclear warheads). Still, there are significant challenges to finding a measurable proxy for very low-probability and far-off events. Hence, even if we accept that the risks are real and vital, I do not think we can directly do anything about them before finding such proxies.

Proxies do not have to be perfect: Theoretical computer science made much progress using the imperfect measure of worst-case asymptotic complexity. The same holds for machine learning and artificial benchmarks. It is enough that proxies encourage the generation of new ideas or technologies and achieve gradual improvement. One lesson from modern machine learning is that the objective (aka loss function) doesn’t have to perfectly match the task for it to be useful.

**Long-term risk mitigation can only succeed through short-term progress.**

Related to the above, I believe that addressing long-term risks can only be successful if it’s tied to shorter-term advances that have clear utility. For example, consider the following two extinction scenarios:

1. The actual Neanderthal extinction.

2. A potential human extinction 50 years from now due to total nuclear war.

I argue that the only realistic scenario to avoid extinction in both cases is a sequence of actions that improve some measurable outcome. While sometimes extinction could theoretically be avoided by a society making a huge sacrifice to eliminate a hypothetical scenario, this could never actually happen.

While the reasons for the Neanderthal extinction are not fully known, most researchers believe that Neanderthals were out-competed by our ancestors – modern humans – who had better tools and ways to organize society. The crucial point is that the approaches to prevent extinction for Neanderthal were the same ones to improve their lives in their current environment. They may not have been capable of doing so, but it wasn’t because they were working on the wrong problems.

Contrast this with the scenario of human extinction through total nuclear war. In such a case, our conventional approaches for keeping nuclear arms in check, such as international treaties and sanctions, have failed. Perhaps in hindsight, humanity’s optimum course of action would have been a permanent extension of the middle ages, stopping the scientific revolution from happening through restricting education, religious oppression, and vigorous burning-at-stake of scientists. Or perhaps humanity could even now make a collective decision to go back and delete all traces of post 17th-century science and technology.

I cannot rule out the possibility that, in hindsight, one of those outcomes would have had more aggregate utility than our current trajectory. But even if this is the case, such an outcome is simply not possible. Humanity can not and will not halt its progress, and solutions to significant long-term problems have to arise as a sequence of solutions to shorter-range measurable ones, each of which shows positive progress. Our only hope to avoid a total nuclear war is through piecemeal quantifiable progress. We need to use diplomacy, international cooperation, and monitoring technologies to reduce the world’s nuclear arsenal one warhead at a time. This piecemeal, incremental approach may or may not work, but it’s the only one we have.

**Summary: think of the long term, but act and measure in the short term.**

It is appropriate for philosophers to speculate on hypothetical scenarios centuries into the future and wonder whether actions we take today could influence them. However, I do not believe such an approach will, in practice, lead to a positive impact on humanity and, if taken to the extreme, may even have negative repercussions. We should maintain epistemic humility. Statements about probabilities involving fractions of percentage points, or human lives in the trillions, should raise alarm bells. Such calculations can be particularly problematic since they can lead to a “the end justifies the means” attitude, which can accept any harm to currently living people in the name of the practically infinite multitudes of future hypothetical beings.

We need to maintain the invariant that, even if motivated by the far-off future, our actions “first do no harm” to living, breathing humans. Indeed, as I mentioned, even longtermists don’t wake up every morning thinking about how to reduce the chance that something terrible happens in the year 1,000,000 AD by 0.001%. Instead, many longtermists care about particular risks because they believe these risks are likely in the near-term future. If you manage to make a convincing case that humanity faces a real chance of near-term total destruction, then most people would agree that this is very very bad, and we should act to prevent it. It doesn’t matter whether humanity’s extinction is two times or a zillion times worse than the death of half the world’s population. Talking about trillions of hypothetical beings thousands of years into the future only turns people off. There is a reason that Pascal’s Wager is not such a winning argument, and I have yet to meet someone who converted to a particular religion because it had the grisliest version of hell.

This does not mean that thinking and preparing for longer-term risks is pointless. Maintaining seed banks, monitoring asteroids, researching pathogens, designing vaccine platforms, and working toward nuclear disarmament, are all essential activities that society should take. Whenever a new technology emerges, artificial intelligence included, it is crucial to consider how it can be misused or lead to unintended consequences. By no means do I argue that humanity should spend all of its resources only on actions that have a direct economic benefit. Indeed, the whole enterprise of basic science is built on pursuing directions that, in the short term, increase our knowledge but do not have practical utility. Progress is not measured only in dollars, but it should be measured somehow. Epistemic humility also means that we should be content with working on direct, measurable proxies, even if they are not perfect matches for the risk at hand. For example, the probability of extinction via total nuclear war might not be a direct function of the number of deployed nuclear warheads. However, the latter is still a pretty good proxy for it.

Similarly, even if you are genuinely worried about long-term risk, I suggest you spend most of your time in the present. Try to think of short-term problems whose solutions can be verified, which might advance the long-term goal. A “problem” does not have to be practical: it can be a mathematical question, a computational challenge, or an empirically verifiable prediction. The advantage is that even if the long-term risk stays hypothetical or the short-term problem turns out to be irrelevant to it, you have still made measurable progress. As has happened before, to make actual progress on solving existential risk, the topic needs to move from philosophy books and blog discussions into empirical experiments and concrete measures.

**Acknowledgments:** Thanks to Scott Aaronson and Ben Edelman for commenting on an earlier version of this post.

The third information-theoretic cryptography conference (ITC) 2022 will be held in-person at MIT this year from July 5-7. We have an exciting program with 17 cool papers and 6 plenary talks by Yuval Ishai, Rafael Pass, David Zuckerman, Omri Ben-Eliezer, Dakshita Khurana and Yevgeniy Dodis on topics that span the breadth of information-theoretic cryptography (and beyond).

More information here: itcrypto.github.io/2022/index.html

We hope to see many of you at the conference.

The organizers

]]>The workshop will feature talks from a range of exciting speakers, social events, poster sessions, and local (!) outings: to stay up to date on the schedule, and register, please head to https://ideas-ncbr.pl/en/wola/

Looking forward to seeing you at WOLA!

PS: online attendance is also possible, for those who cannot attend in person. Streaming, after all, is on-topic

]]>https://www.lse.ac.uk/HALG-2022

The Highlights of Algorithms conference is a forum for presenting the highlights of recent developments in algorithms and for discussing potential further advances in this area. The conference will provide a broad picture of the latest research in algorithms through a series of invited talks, as well as the possibility for all researchers and students to present their recent results through a series of short talks and poster presentations. Attending the Highlights of Algorithms conference will also be an opportunity for networking and meeting leading researchers in algorithms.

For local information, visa information, or information about registration, please contact Tugkan Batu t.batu@lse.ac.uk.—

PROGRAM:

A detailed schedule and a list of all accepted short contributions is available at:https://www.lse.ac.uk/HALG-2022/programme/Programme

—**REGISTRATION**

https://www.lse.ac.uk/HALG-2022/registration/Registration

**Early registration (by 20th May 2022)**

Students: £100

Non-students: £150

**Late registration (from 21st May 2022)**

Students: £175

Non-students: £225

Registration includes the lunches provided, coffee breaks, and the conference reception.

There are some funds from conference sponsors to subsidise student registration fees. Students can apply for a fee waiver by sending an email to Enfale Farooq (e.farooq@lse.ac.uk) by **15th May 2022**. Those students presenting a contributed talk will be given priority in allocation of these funds. The applicants will be notified of the outcome by 17th May 2022.—

**INVITED SPEAKERS**

Survey speakers:

Amir Abboud (Weizmann Institute of Science)

Julia Chuzhoy (Toyota Technological Institute at Chicago)

Martin Grohe (RWTH Aachen University)

Anna Karlin (University of Washington)

Richard Peng (Georgia Institute of Technology)

Thatchaphol Saranurak (University of Michigan)

Invited talks:Peyman Afshani (Aarhus University)

Soheil Behnezhad (Stanford University)

Sayan Bhattacharya (University of Warwick)

Guy Blelloch (Carnegie Mellon University)

Greg Bodwin (University of Michigan)

Mahsa Eftekhari (University of California, Davis)

John Kallaugher (Sandia National Laboratories)

William Kuszmaul (Massachusetts Institute of Technology)

Jason Li (Carnegie Mellon University)

Joseph Mitchell (SUNY, Stony Brook)

Shay Moran (Technion)

Merav Parter (Weizmann Institute of Science)

Aviad Rubinstein (Stanford University)

Rahul Savani (University of Liverpool)

Mehtaab Sawhney (Massachusetts Institute of Technology)

Jakub Tetek (University of Copenhagen)

Vera Traub (ETH Zurich)

Jan Vondrak (Stanford University)

Yelena Yuditsky (Université libre de Bruxelles)

Best regards,

Keren Censor-Hillel

PC chair for HALG 2022

]]>