Theoretical computer scientists love inequalities. After all, a great many of our papers involve showing either an upper bound or a lower bound on some quantity. So I thought I’d share some cool stuff I’ve learned this Friday in our reading group‘s talk by Zeev Dvir. This is based on my partial and vague recollection of Zeev’s talk, so please see the references in his paper with Hu (and a previous paper of Hardt and Moitra) for the complete (and correct) information. See also this blog post of James Lee.

The Brascamp–Lieb inequality is actually a shorthand for a broad family of inequalities generalizing the Holder Inequality, Young’s Inequality, the Loomis–Whitney inequality and many more. As Zeev put it, it probably generalizes any inequality you (or at least I) know about. Barthe gave an (alternative) proof of these inequalities, in the course of which he produced a magical lemma that turns out to be useful in many varying theoretical CS scenarios, from communication complexity to bounds for locally decodable codes, as well as for obtaining generalizations of the Sylvester-Gallai theorem in geometry.

One way to think about the Brascamp-Lieb inequality (as shown by Carlen and Cordero-Erausquin) is as generalizing the sub-additivity of entropy. We all know that given a random variable X over d,  H(X) ≤ H(X1) + ⋯ + H(Xd)

Where Xi is the ith coordinate of X. Now suppose that we let $\Hat{X}_j$ denote the jth coordinate of X in some different basis. Then by the same reasoning we know that  $\displaystyle H(X) \leq \tfrac{1}{2} H(X_1) + \cdots + \tfrac{1}{2}H(X_d) + \tfrac{1}{2}H(\Hat{X}_1) + \cdots + \tfrac{1}{2}H(\Hat{X}_d) .$

The Brascamp-Lieb inequality can be thought of as asking for the most general form of such inequalities. Specifically, given an arbitrary linear map F : ℝd → ℝn, we can ask for what vector of numbers γ = (γ1, …, γn) it holds that

H(X) ≤ γ1H(F(X)1) + ⋯γnH(F(X)n) + C

for all random variables X where C is a constant independent of X (though can depend on F, γ).

The answer of Brascamp-Lieb is that this is the case if (and essentially only if) the vector γ can be written as a convex combination of vectors of the form 1S where S ⊆ [n] is a d-sized set such that the functions {Fi}i ∈ S are linearly independent. You can see that the two cases above are a special case of this condition (where it also turns out that C = 0). (There is an even more general form where each $F_i$ can map into a subspace of dimension higher than one; see the papers for more.)

The heart of the proof (if I remember Zeev’s explanations correctly) is to show that the worst-case for such an inequality is always when the variables F(X)1, …, F(X)n are a Gaussian process, that is X is some kind of a multivariate Gaussian distribution. Once this is show, one can phrase computing the supremum of C as a convex optimization problem over the parameters of this Gaussian distribution and then prove some bound on it.

For proving Brascamp-Lieb it is sufficient to simply bound C, but it turns out that the parameters that achieve the optimum C as a function of F and γ, have a very interesting property. Since the derivative of the objective function at this point must be zero, some algebraic manipulations show that one can obtain from them an invertible linear transofmation G such that
γiG(Fi)G(Fi)/∥G(Fi)∥2 = I (*)
where I is the identity map on d and Fi is the d-dimensional vector corresponding to the ith coordinate of F. The condition (*) looks somewhat mysterious, but one way to think about it is that it means that after a change of basis and rescaling so that each Fi is a unit vector, we can make the vectors F1, …, Fn be “evenly spread out” in d in the sense that no direction is favored over any other one.

This turns out to be useful in some surprising contexts. For example, the notion of sign rank of a matrix is very important in several communication complexity applications. The sign rank of a matrix A ∈ { ± 1}n2 is the minimum rank of a matrix B such that sign(Bi, j) = Ai, j for every i, j. Clearly the sign rank might be much smaller than the rank, and proving that a matrix has large sign rank is a non-trivial matter. Nevertheless Forster managed to prove the following theorem:

Forster’s Theorem: The sign rank of A is at least n/∥A where A is the spectral norm of A.

Which in particular gave a highly non trivial lower bound of $\sqrt{n}$ on the sign rank of the Hadamard matrix.

The proof is easy given the above corollary (*). (Forster wasn’t aware of Barthe’s work, and as far as I know, the connection between the two works was first discovered by Moritz Hardt.) Suppose that there exists B of rank d such that sign(Bi, j) = Ai, j for every i, j. Then we can write Bi, j = 〈ui, vj〉for some vectors u1, …, un, v1, …, vn ∈ ℝd. By making a tiny perturbation, we can assume that for every subset S ⊆ [n] the vectors {ui}i ∈ S are linearly independent and hence in particular the vector γ = (d/n, …, d/n) can be expressed as a convex combination of the vectors of the form 1S, and hence in particular we get that after applying some change of basis and rescaling (which will not affect the sign of 〈ui, vj〉 ) we can assume without loss of generality that every ui,vj is a unit vector and moreover $(d/n)\sum u_i u_i^\top = I$.

Now note that under our assumption Bi, jAi, j = |Bi, j| and hence

$\displaystyle \sum |B_{i,j}| = A\cdot B \leq \|A\|\sqrt{d}\sqrt{Tr(BB^\top)}$

The last inequality follows from the fact that for every two matrices A, B, A ⋅ B ≤ ∑|λiσi|≤ max{λi}∑|σi|, where the λi‘s and σi‘s are the singular values of A and B respectively. We then use the fact that A∥ = max{λi} and in B at most d of those are nonzero and hence by Cauchy-Schwarz $\sum |\sigma_i| \leq \sqrt{d}\sqrt{\sum \sigma_i^2} = \sqrt{d}\sqrt{BB^\top}$. However by the condition (*) Tr(BB) = ∑ij〈ui, vj〉2 = n2/d, and on the other hand (since every entry of B is a dot product of unit vectors and hence smaller than one) ∑|Bi, j| = ∑i, j|〈ui, vj〉| ≥ ∑i, j|〈ui, vj〉|2 = n2/d. So we get

$\displaystyle n^2/d \leq \|A\|\sqrt{d}\sqrt{n^2/d}$

or n/∥A∥ ≤ d as we wanted.

[I don’t have much to add to Luca, Scott, and Lipton/Regan on the exciting announcement by Babai of a quasipolynomial time graph isomorphism algorithm. I can’t wait to hear the details! –Boaz]

Aided by some very generous gifts, Computer Science is on a growth streak at Harvard, and in particular there are some new opportunities in Theoretical Computer Science. We have positions in all levels, including graduate, postdocs, faculty and visitors/sabbaticals, see http://toc.seas.harvard.edu/positions

As in every year, we encourage students interested in theoretical computer science to apply for graduate studies at Harvard. Starting this year, we also have new postdoc positions. Both students and postdocs have an opportunity to both work with the strong (and growing) theory group at Harvard, as well as take advantage of the unique intellectual environment of the university as a whole and the larger Boston/Cambridge area.

We have several postdoc opportunities at Harvard Computer Science and affiliated institutions. In addition to those we are happy to announce the inaugural Michael O. Rabin postdoctoral fellowship in Theoretical Computer Science. Rabin fellows will receive a generous salary as well as an annual allocation for research and travel expenses, and are free to pursue their own research agenda with no strings attached.

Candidates can apply to all postdoc positions via a unified application process at https://academicpositions.harvard.edu/postings/6477
For full consideration, the complete application, including reference letters, should be submitted by December 1, 2015. Email theory-postdoc-apply (at) seas.harvard.edu for any questions.

Boaz Barak, Yiling Chen, Harry Lewis, Michael Mitzenmacher, Jelani Nelson, David Parkes, Yaron Singer, Madhu Sudan, Salil Vadhan and Leslie Valiant.

p.s. Harvard is an equal opportunity employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability status, protected veteran status, or any other characteristic protected by law.

As a Ph.D student, I searched for advice on succeeding in grad school, and often thought that eventually I would repay the favor when I became a professor. I never got around to doing that, but now decided to give it a shot. So, here goes:

Tip #1: Don’t get advice from the Internet.

Grad school, and more generally research life, can differ so widely based on the field, the university, and the personalities involved, that what is a sound strategy for one person can be a terrible idea for another. It’s always better to get advice from a wise and experienced person who knows and cares about you.

Given this advice, I wouldn’t blame you if you stopped reading this blog post at this point. In fact, you probably should. But I will offer some more tips just in case, though keep in mind that they may not be applicable to your situation.

Tip #2: Remember you have other options.

Much of the advice you see online (and maybe even get in person) will include a long list of do’s and don’ts and hoops to jump through on the way to the final goal of becoming a tenured professor. Now, from my personal experience, being a tenured professor is pretty nice, but if you are even considering a career in theoretical computer science, chances are you have the skills and talent for many alternative career paths that have a number of significant advantages over the academic life. The main benefit of the latter is that you get to set your own goals and not jump through other people’s hoops. Don’t lose track of that.

Tip #3: Research is hard.

Research is very hard, or at least it’s very hard for me. Doing research is often not so much about solving a fixed problem given to us, but about setting out goals and revising them as we learn more. As we do so, the pendulum often swings between the states where the current problem we’re trying to solve is trivial, impossible, already known, or not well formed. Most of the time though is spent staring at a brick wall, trying to think of some way to bypass it.  (Metaphorically speaking, of course; what you’ll actually be staring at is an empty notepad and a cup of coffee.) An added difficulty is that all this hard work is often invisible, so you get mostly to observe other people’s successes and feel that you are the only one that is having such a hard time.

One of the common fallacies of beginning students is thinking that it’s all about innate talent. When you see people solve mathematical problems in seemingly no time, or hear stories about those geniuses that solved the main question of someone’s Ph.D thesis in a day, you may feel that success in research is out of your control. I have met many highly successful researchers over the years, and while some are insanely quick, others can take time to do even the simplest calculations. Success in research comes in a great many forms and people can have very different styles of work, personalities, types of questions they are interested in, and more. The only common denominator is that, as a rule, successful researchers are passionate about what they do and even those for which success seems to come “easily” actually work very hard at it.

Tip #4: You are your own boss.

As an undergraduate, you are used to getting feedback on how well you’re doing in the form of grades. Many corporations have periodic reviews and evaluations for employees. In the academic world, feedback is quite rare, and can come in varying forms. Part of this is because professors don’t like to have awkward conversations. But fundamentally it is because you are truly the measure of your own success. As a theoretical computer scientist, I don’t need my students to run experiments for me, write code, or even prove theorems. I view my role as truly an advisor to the student as they find out what they are passionate about and great at. The best ones eventually find their own “research compass” and set and pursue their own goals in a unique way.

Tip #5: Be a good boss.

Since you are managing yourself, you should try to do a good job at it. Here I cannot really give any general advice. Some students should be tougher on themselves, and push harder. Others are too hard on themselves, and anxiety and/or depression are quite common in graduate students (and beyond that). As I said, being a researcher, at any career stage, is a hard job. In theoretical studies, maintaining motivation is especially challenging, since one can spend days, months or more with no visible progress – there are no lines of codes being written, no data being collected. I would also caution against measuring progress by publications. My philosophy is that any day in which you learned something is a good day. Such learning can take many forms, including thinking about a problem, learning some new or not so new cool ideas by attending a seminar or talking to a colleague, reinventing old ideas, reading a paper, figuring out why some approach doesn’t work, and more. I’ve heard a talk by Sanjay Gupta in which he gave the following advice — “try to do every day one thing that scares you” – I think this applies to research as well.

Just try to spend a good amount of your time on the things that truly matter to you, and less time worrying about the job market, who got published where, or reading advice blogs from some random professors on the Internet.

There is a longstanding feud in statistics between “frequentists” and “Bayesians”. One way to understand these two camps is to consider the following statements:

• The probability that my paternal great great grandfather had blue eyes is 5%

• The probability that the $10^{10^{10}}$-th digit of $\pi$ is 7 is 10%

A frquentist would say that these statements are meaningless— even if I don’t know his eye color, my great great grandfather either had blue eyes or didn’t, and there is no sense in which it can be blue with probability that is neither 0 or 1. Similarly, the $10^{10^{10}}$-th digit of $\pi$ is either 7 or isn’t. A Bayesian would say that such probabilities make perfect sense, and in particular one would be completely rational to take, say, an 11 to 1 bet that the $10^{10^{10}}$-th digit of $\pi$ is 7. Moreover, almost any real-life application of probability theory involves probabilities over events that are in principle pre-determined. After all, even the results of a toss of a die are completely determined by the initial conditions. (This is of course a caricature of both positions, which are subtler than that; also many statisticians are pragmatists who can use either a frequentist or Bayesian perspective in different situations; you can do far worse than watching Michael Jordan’s talk on this issue.)

As theoretical computer scientists, we are naturally frequentists. For us a probability is a measure over some sample space. Moreover, our canonical activity is worst case analysis of algorithms, which meshes well with the frequentist point of view that doesn’t assume any prior distribution over the inputs one might encounter in practice. The canonical notion of pseudo randomness corresponds to a computational analog of frequentist probability. In this blog post I will discuss a different notion of “pseudo randomness”, arising from the study of computational proof systems and convex programming, that can be viewed as a computational analog of Bayesian probability. Borrowing a page from quantum mechanics, this notion will invole objects that behave like probability measure but where some of the probabilities can actually be negative.

## “Classical” pseudo-randomness theory

Before describing this new notion, it might be useful to discuss the “classical” notion of pseudo-randomness as was developed in the 1980’s works of Shamir, Blum, Micali, Yao, Goldwasser and others. Recall that a probability measure is a function $\mu:\Omega \rightarrow \mathbb{R}$ where $\Omega$ is the sample space (which you can think of as the set of binary strings of length $n$ for some $n$), and $\mu$ satisfies that $\mu(\omega) \in [0,1]$ for all $\omega\in\Omega$ and $\sum_{\omega\in \Omega}\mu(\omega)=1$. For a function $T:\Omega\rightarrow\mathbb{R}$, we define the expectation of $T$ under $\mu$ to be $\mathbb{E}_\mu T := \langle \mu,T\rangle := \sum_{\omega\in \Omega} \mu(\omega)T(\omega)$.

Nore that two distributions $\mu$ and $\mu'$ are identical if and only if $\mathbb{E}_{\mu} T = \mathbb{E}_{\mu'} T$ for every function $T$. Pseudorandomness arises from relaxing this notion to consider only functions that are efficiently computable. That is, loosely speaking, two measures $\mu$ and $\mu'$ are computationaly indistinguishable if for every computationally bounded test $T$ (i.e., function $T:\Omega\rightarrow \{0,1\}$ that can be efficiently computable), $\mathbb{E}_\mu T$ and $\mathbb{E}_{\mu'} T$ are very close to one another. We say that $\mu$ is pseudorandom if it is computationally indistinguishable from the uniform distribution (i.e., the measure $\mu'(\omega)=1/|\Omega|$).

I deliberatly omitted the definition of “computationally bounded” and it can change in different settings. Sometimes we consider the set of all tests that can computed by an efficient Turing machine or Boolean circuit (and then need to rely on computational assumptions), and in some applications one can consider more restricted families of tests (such as in Braverman’s result where he showed that distributions with polylogarithmic independence fool all small constant-depth circuits, or the result of Green and Tao that a certain distribution on “almost primes” fools a particular class of tests related to arithmetic progressions).

Pseudorandomness and computational indistinguishability have found a great number of applications in many areas of computer science and mathematics, see Salil Vadhan’s monograph for more on this beautiful area.

## Bayesian pseudo-randomness

Bayesian pseudo-distributions arise from taking a computational analog of Bayesian probability. Consider the following scenario: suppose that we are given a SAT formula $\varphi$ that has a unique assignment $x$. A frequentist would say that there is no sense to a statement such as “the probability that $x_7=1$ is 60%”. Even a (non computational) Bayesian might have some qualms about defending such a statement. After all, the observation $\varphi$ completely determines $x$, and so according to standard Bayesian analysis, the only posterior distribution that makes sense is the one that puts all weight on the single string $x$. But yet to a computer scientist it’s clear that simply because $x$ is completely determined by the formula $\varphi$, it doesn’t mean that we can actually recover it. Hence we have some uncertainty about it, and encoding this uncertaintly via probabilities seems rather natural.

But how do we actually go about it? Rather than considering efficient tests, we will focus our attention on efficient deduction or proof systems. That is, we say that $\mu$ is a pseudo distribution consistent with $\varphi$ if it satisfies that $\mathbb{E}_{\mu} T \geq \alpha$ for every function $T$ and number $\alpha$ such that a computationally bounded observer can deduce the conclusion that $T(x) \geq \alpha$ from the assumption that $x$ satisfies $\varphi$. As above we are deliberately vague on how we model the deductive power of computationally bounded observers, but in fact I also glossed over on an even more basic point— what kind of an object is $\mu$? The initial guess is that $\mu$ is simply another distribution — it might not be the true distribution $\{ x \}$ that is supported on the unique satisfying assignment, but simply has a different support. However one can quickly see that this is not possible: under any reasonable model of computation, a bounded observer would be able to deduce that $\mathbb{E}_{\mu} C(x) = 1$ for every clause $C$ of $\varphi$. For example, if $\varphi$ contains the clause $\overline{x_7} \vee x_{15} \vee x_{18}$ then clearly it is trivial to deduce that every distribution $\mu$ over satisfying assignments would satisfy $\mathbb{E}_{\mu} \overline{x_7} \vee x_{15} \vee x_{18} = 1$. But note that if $\mu$ was supported on even a single string that failed to satisfy this clause then it would have been the case that $\mathbb{E}_{\mu} \overline{x_7} \vee x_{15} \vee x_{18} < 1$. Since the same reasoning applies to all clauses, we can conslude that the only distribution $\mu$ that is consistent with $\varphi$ is the distribution $\{ x \}$ over the unique satisfying assignment.

The above seems to trivialize the definition, and again rule out the possiblity of statements such as “$\mathbb{E}_{\mu} x_7 = 0.6$“. However, this only happened because we insisted on $\mu$ being an actual distribution. The definition didn’t require that, but merely required the existence of the operator mapping $T$ to the number $\mathbb{E}_{\mu} T$ where $T$ is in some class that computationally bounded observers can argue about. Such an operator need not correspond to actual distributions, though if it is linear (which makes sense when we think of a class $T$ closed under taking linear combinations), then it must have the form $\mathbb{E}_{\mu} T = \langle f,T \rangle$ for some function $f:\Omega\rightarrow\mathbb{R}$. In fact, we will identify this function $f$ with $\mu$. Surely a bounded observer can deduce that $\mathbb{E} 1 = 1$ and hence we can assume $\sum_{\omega\in\Omega} \mu(\omega)=1$. The only constraint we might not be able to enforce is that $\mu(\omega) \geq 0$ for all $\omega$. Hence we can think of $\mu$ as a “distribution” that can take negative probabilities, as long as for “simple” functions $T$, it is still the case that $\mathbb{E}_{\mu} T \geq 0$ whenever a computationally bounded can prove that $T$ is non-negative.

The sum of squares proof system (also known as Positivstallensatz) is one concrete candidate for a bounded deduction system. There the simple functions $T$ correspond to low degree polynomials, and (roughly speaking) a computationally bounded prover can deduce from $P\geq 0$ and $Q\geq 0$ that $PQ \geq 0$ and $P+Q \geq 0$ and can use the axiom that $P^2 \geq 0$ for every low degree $P$. This gives rise to a rather clean notion of pseudo-distributions where we can make precise statements such as “$\mathbb{E}_{\mu} x_7 = 0.6$” where $\mu$ is a distribution over satisfying assignments of $\varphi$, even when $\varphi$ has a unique satisfying assignments. In fact, we can even make such statements when $\varphi$ has no satisfying assignments, as long as a computationally bounded prover cannot certify that this is the case. Perhaps even more exciting is that we have some reasons to believe (or at least hope) that in some contexts this proof system is optimal in the sense that it captures the capabalities of every computationally bounded observer. There are still many more questions in this research area. See my lecture notes and survey for more on this topic.

I do believe that this notion of “pseudo distribution” can be applicable beyond the sum-of-squares algorithm. For example, in number theory there are many heuristics that treat deterministic objects (such as the primes) as if they were chosen at random, and I wonder if these ideas can help formalize such intuitions. A similar situation arises in the study of random constraint satisfaction problems using local algorithms. For a random SAT formula with $n$ variables and $cn$ clauses, the Belief Propagation algorithm predicts correctly the entropy of the distribution over satisfying assignments for $c$ up to roughly 3.86, but then diverges from the truth and in particular predicts that this entropy is positive (and hence the formula is satisfiable) when $c$ can be as large as 4.667, whereas the true threshold is roughly 4.267 (see Chapter 14 of this excellent book). One can imagine a notion of “belief propagation pesudo-distribution” which coincides with the actual distribution up to some density but then diverges from it as the problem becomes more computationally difficult. I am also wondering if it can be related to cryptographic notions such as “inaccesible entropy” (e.g., see here) or to questions in computational economics where we try to model the beliefs of computationally bounded agents.

As a cryptographer, I am used to the viewpoint that computational problems are presumed hard unless proven otherwise. We cryptographers constantly come up with new hardness assumptions, and generally use the heuristic that if a problem doesn’t have an obvious algorithm then it must be hard. We’ve had some interesting failures (see here for a recent example), but this heuristic seems to be successful more often than not.

In most other fields of study, people have the opposite intuition. Algorithms researchers are generally in the business of solving problems, and not giving up any time the obvious approach doesn’t work, and so take the viewpoint that a problem is presumed easy unless it is proven hard. In other fields people seem to take it a step further and claim that a problem is easy even if it is proven to be hard. People working on SAT solving and model checking solve SAT instances on a regular basis, leading some of them to claim that SAT is easy in practice. Similarly, many problems in machine learning are NP hard, but are yet solved all the time by current heuristics (see here for more).

While finding a Nash equilibrium is considered a hard computational problem, economists like to point out (as Eva Tardos told me recently) that there is milk on the shelves in stores, which means that producers and consumers managed somehow to come to an equilibrium where the market clears. (Interestingly, it seems that the process in which people do so is not far from the same heuristics used in machine learning.)

One could say that cryptography is the outlier because the instances there are designed to be hard. Perhaps hardness is this rare phenomenon that needs to be searched for, and P equals NP for “natural” instances. On the other hand, by counting arguments a priori we would expect a “random” or “generic” problem to be hard. Moreover, perhaps in those other fields the problems are designed to be easy. After all, when we use SAT to do model checking, we do so to verify the correctness of a system that was in fact designed by some human to be correct, and would perhaps have been verifiable (with a lot of effort) by a human as well. So perhaps it’s not so surprising that algorithms are able to mimic this reasoning. Similarly, machine learning is mostly about trying to do tasks that humans already succeed in. One could also argue that as a society we design rules that would make achieving equilibrium tractable, and perhaps for this reason we use a fixed price for milk as opposed to some complicated contracts.

So is hardness the rule or the exception? I don’t really know.  Perhaps the physicists should decide, as the problems they study come from nature, who presumably doesn’t have any bias toward hardness or easiness. Or does she? After all, nature needs to compute as well (at least in the forward direction). Random constraint satisfaction problems, such as random 3SAT, seem to be a meeting point for physicists, computational complexity folks, cryptographers, and more, and a recent paper makes some algorithmic progress on 3SAT, though it does seem to still be bounded away from the threshold, and it is possible the algorithmic picture clarifies for kSAT and kXOR as k grows.

What do you think?

[Forwarding an announcement by Prasad Raghavendra –Boaz]
FOCS 2015 will be held at Berkeley, California on October 18–20, 2015. Registrations are open at:
http://focs15.simons.berkeley.edu/registration.html
The deadline for early registration is Sept 25th.
KARPfest80

On Saturday October 17, the day immediately before FOCS 2015, the Simons Institute for the Theory of Computing will host a celebration of the work of Dick Karp on the occasion of his 80th birthday. This event, which features a dozen speakers from across the spectrum of theoretical Computer Science, will also be held at the DoubleTree Marina Hotel, the same venue as FOCS itself. All participants are asked to register to attend; registration is free, and is done in conjunction with FOCS registration. Visit KARPfest80 for more information.

1. A lovely interview with Cynthia Dwork in the New York Times on bias in computations. In particular, discussing our work (joint with Moritz Hardt, Toni Pitassi and Rich Zemel) on Fairness through Awareness.
1. Our Science article, The reusable holdout: Preserving validity in adaptive data analysis (joint with Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi and Aaron Roth), is out.

—-

update (8/16/2015): On Privacy and Anonymisation (with Cynthia Dwork and Salil Vadhan) in The Economist