(See also pdf version , and these lecture notes)

The divide between frequentists and Bayesians in statistics is one of those interesting cases where questions of philosophical outlook have actual practical implications. At the heart of the debate is Bayes’ theorem:

Both sides agree that it is correct, but they disagree on what the symbols mean. For *frequentists*, probabilities refer to the fraction that an event happens over repeated samples. They think of probability as counting, or an extension of *combinatorics*. For *Bayesians*, probabilities refer to degrees of belief, or, if you want, the odds that you would place on a bet. They see probability as an extension of logic.^{1}

A Bayesian is like Sherlock Holmes, trying to use all available evidence to guess who committed the crime. A frequentist is like the lawmaker who comes up with the rules how to assign guilt or innocence in future cases, fully accepting that in some small fraction of the cases the answer will be wrong (e.g., that a published result will be false despite having statistical significance). The canonical question for a Bayesian is “what can I infer from the data about the given question?” while for a frequentist it is “what experiment can I set up to answer the question?”. Indeed, for a Bayesian probability is about the degrees of belief in various answers, while for a frequentist probability comes from the random choices in the experiment design.

If we think of algorithmic analogies, then given the task of finding a large clique in graphs, a frequentist would want to design a general procedure that has some assurances of performance on all graphs. A Bayesian would be only interested in the particular graph he’s given. Indeed, Bayesian procedures are often exponential in the worst case, since they want to use all available information, which more often than not will turn out to be computationally costly. Frequentists on the other hand, have more “creative freedom” in the choice of which procedure to use, and often would go for simple efficient ones that still have decent guarantees (think of a general procedure that’s meant to adjudicate many cases as opposed to deploying Sherlock Holmes for each one).

Given all that discussion, it seems fair to place theoretical computer scientists squarely in the *frequentist* camp of statistics. But today (continuing a previous post) I want to discuss what a *Bayesian* theory of computation could look like. As an example, I will use my recent paper with Hopkins, Kelner, Kothari, Moitra and Potechin, though my co authors are in no way responsible to my ramblings here.

## Peering into the minds of algorithms.

What is wrong with our current theory of algorithms? One issue that bothers me as a cryptographer is that we don’t have many ways to give evidence that an average-case problem is hard beyond saying that “we tried to solve it and we couldn’t”. We don’t have a web of reductions from one central assumption to (almost) everything else as we do in worst-case complexity. But this is just a symptom of a broader lack of understanding.

My hope is to obtain general heuristic methods that, like random models for the primes in number theory or the replica method in statistical physics, would allow us to predict the right answer to many questions in complexity, even if we can’t rigorously prove it. To me such a theory would need to not just focus on questions such as “compute *computational knowledge*: how can we model the inferences that computationally bounded observers can make about the world, even if their beliefs are incorrect (or at least incomplete). Once you start talking about observers and beliefs, you find yourself deep in Bayesian territory.

What do I mean by “computational knowledge”? Well, while generally if you stop an arbitrary `C`

program before it finishes its execution then you get (to use a technical term) bubkas, there are some algorithms, such as Monte Carlo Markov Chain, belief propagation, gradient descent, cutting plane, as well as linear and semidefinite programming hierarchies, that have a certain “knob” to tune their running time. The more they run, the higher quality their solution is, but one could try to interpret their intermediate state as saying something about the knowledge that they accumulated about the solution up to this point.

Even more ambitiously, one could hope that in some cases one of those algorithms is the *best*, and hence its intermediate state can be interpreted as saying something about the knowledge of *every* computationally bounded observer that has access to the same information and roughly similar computational resources.

## Modeling our knowledge of an unknown clique

To be more concrete, suppose that we are given a graph

But if you consider probabilities as encoding beliefs, then it’s quite likely that a computationally bounded observer is not *certain* whether

Here is one approach. Since we are given no information on `true`

in a satisfying assignment” even if the formula is unsatisfiable.) This is analogous to trying to compute the probability that a unicorn has blue eyes, but indeed computationally bounded observers are in the uncomfortable positions of having to talk about their beliefs even in objects that mathematically cannot exist.

## Computational Bayesian probabilities

So, what would a consistent theory of “computational Bayesian probabilities” would look like? Let’s try to stick as closely as possible to standard Bayesian inference. We think that there are some (potentially unknown) parameters

A crucial property we require is *calibration*: if

Indeed, the most simple minded computationally bounded observer might ignore

But of course we want to also talk about the beliefs of observers with intermediate powers. To do that, we want to say that *rules of inference*, which would in particular rule out things like assigning a positive probability for an isolated vertex to be contained in a clique of size larger than one. For example, if we can infer using this system from

Finally, we would also want to ensure that the map

Different rules of inference or *proof systems* lead to different ways of assigning these probabilities. The Sum of Squares algorithm / proof system is one choice I find particularly attractive. Its main advantages are:

- It encapsulates many algorithmic techniques and for many problems it captures the
*best known*algorithm. That makes it a better candidate for capturing (for some restricted subset of all computational problems) the beliefs of*all*computationally bounded observers. - It corresponds to a very natural proof system that contains in it many of the types of arguments, such as Cauchy Schwarz and its generalizations, that we use in theoretical computer science. For this reason it has been used to find constructive versions of important results such as the invariance principle.
- It is particularly relevant when the functions
we want to estimate are *low degree polynomials*. If we think of the data that we observe as inherently*noisy*(e.g., the data is a vector of numbers each of which corresponds to some physical measurement that might have some noise in it), then it is natural to restrict ourselves to that case since high degree polynomials are often very sensitive to noise. - It is a tractable enough model that we can prove lower bounds for it, and in fact have nice interpretations as to what these “estimates” are, in the sense that they correspond to a distribution-like object that “logically approximates” the Bayesian posterior distribution of
given .

## SoS lower bounds for the planted clique.

Interestingly, a lower bound showing that SoS fails on some instance amounts to talking about “unicorns”. That is, we need to take an instance *not* arise from a model of the form

We need to come up with reasonable “pseudo Bayesian” estimates for certain quantities even though in reality these estimates are either completely determined (if *low degree polynomials in *. For every low degree polynomial

For example, if

We can try to use similar ideas to come up with how we should update our estimate for

In our new paper we take an alternate route. Rather than trying to work out the updates for each such term individual, we simply declare by fiat that our estimates should:

- Be simple functions of the graph itself. That is
will be a low degree function of . - Respect the calibration condition (*) for all functions
that can depend on the graph only in a low degree way.

This condition turns out to imply that our estimates automatically respect all the low degree statistics. The “only” work that is then left is to show that they satisfy the constraint that the estimate of a *structure* that we can infer in some computationally bounded way, and then the rest of it, since it contains no discernable structure, can be treated as if it is *random* even if it is fully deterministic since a Bayesian observer will have uncertainty about it and hence assign it probabilities strictly between zero and one. Thus for example, in the regularity lemma, we can think of a bounded observer that cannot store a matrix

I think that a fuller theory of computational Bayesian probabilities, which would be the dual to our standard “frequentist” theory of pseudorandomness, is still waiting to be discovered. Such a theory would go far beyond just looking at sums of squares.

- As discussed in my previous post, this is somewhat of a caricature of the two camps, and most practicing statisticians are pragmatic about this and happy to take ideas from either side as it applies to their particular situation.↩