On the importance of the alphabet

In my last post, we saw that the problem of learning juntas, hard as it is over Boolean inputs, seems even worse over other alphabets. Coding theory happens to have a inexhaustible supply of such problems. Some of these are long-standing open problems, others are of a more recent vintage. More of these problems seem to crop up all the time. And as will be clear from the list below, this issue does seem to follow me around.

Here are four of my favorite problems, completely solved for q =2 but wide open for any other alphabet (which we take to be some finite field).

Optimal codes of distance d: Arguably the mother of all such problems. How many parity checks must a q-ary length n code of distance d have? Here we think of d as a constant and n going to infinity. For q =2 the Hamming bound tells us that d \log n /2 parity checks are necessary. The BCH construction tells us that this is also sufficient. For larger alphabets, the best lower bound is still the Hamming bound, and the Hamming bound is still d \log n/2. Alas the BCH upper bound is now (1 - \frac{1}{q})d \log n. For more on this problem and its connections to additive combinatorics, see here.

List-decoding Reed-Muller codes: Given a function f : \mathbb{F}_q^n \rightarrow \mathbb{F}_q, our goal is to recover all degree-d polynomials that are close to f. Of course, we can hope to solve this in reasonable time only if this list is not too large (say a constant independent of the dimension n). The question is what bound on the distance from f (equivalently the error-rate) is enough to ensure that the list is small,  the conjecture is that it is the minimum distance of the degree d Reed-Muller code. The degree 1 case is the problem of Hadamard decoding, solved by Goldreich and Levin. In joint work with Klivans and Zuckerman, we solved the q =2 case for all d, and in a subsequent paper, the d =2 case for all q. The case of larger q, d is open.

The bottleneck in working with higher alphabets is the following. Fix a polynomial P which has good agreement.  The Goldreich-Levin and GKZ algorithms “guess” the P on a  c-dimensional subspace S, where c is a constant. They then use this to recover  P on all c +1 dimensional subspaces containing S. Why is this easier? Knowing the right values on a subspace of co-dimension 1 effectively reduces the error-rate by a factor of 1 - 1/q. For q =2, this reduces the error by 1/2 and puts us in the unique-decoding regime where things are easy. But for larger q, this error-reduction does not always suffice to specify P unambiguously.

Decoding from errors and erasures: Assume that we have an [n,k,d]_q linear code. Such a code can correct (d-1)/2 errors and d -1 erasures. Assume that we have efficient algorithms for both these (the latter is always the case for a linear code). Can we combine these in a black-box fashion to get an efficient algorithm that corrects e errors and f erasures as long as 2e + f \leq d-1? When q =2, the answer is Yes (take a guess!). I don’t know the answer for larger q. I came upon this problem in conversations with Atri Rudra,  perhaps there is a reference for it? Or better still, a solution?

Learning Linear Functions with noise: Learning parity with noise is a central problem in learning, that is widely believed to be hard. We are given random examples of points in F_2^n labeled with some unknown parity function L, the catch is that each label is flipped independently with probability 0.1. A seemingly even harder problem is when the noise is adversarial, and the adversary is allowed to flip an arbitrary 0.1 fraction of labels over all points in \mathbb{F}_2^n. In joint work with Feldman, Khot and Ponnuswami, we show that the problems are in fact equivalent, there is a reduction from adversarial noise to random noise.

In retrospect, perhaps this is not too surprising: learning parity with noise is already so very hard that making the noise adversarial does not make it worse. Well, the same philosophy ought to apply to other alphabets. But I don’t know of a similar reduction for any other alphabet. This paper makes some progress on this problem, reducing adversarial noise to a more benign noise model, but does not solve the problem fully.

If you have a problem of this flavor, I’d love to hear about it. I am unlikely have anything insightful to say, but I’d be happy to commiserate with you about how annoying q =3 can be.

Learning Juntas

Computational learning is full of problems that are deceptively simple to state but fiendishly hard to solve. Perhaps none more so than the problem of learning Juntas, posed by Avrim Blum and Pat Langley. Its the kind of problem which seems well suited for a polymath endeavor, Dick Lipton likes to say that you could think about it while running for the bus. Alas, I haven’t ridden many buses here in the Bay Area (and coincidentally haven’t had any good ideas about this problem).

But what I’d like to talk about in this post is a variant of the problem which I think is kind of cute. The question is from a paper by Mossel, O’Donnell and Servedio, this particular formulation came up in discussions with Ryan O’Donnell, Amir Shpilka and Yi Wu (sorry guys, I know you’d rather not be associated with this one).

Let us begin with the problem of learning Juntas. The setup is that of learning an unknown Boolean function under the uniform distribution: we are given random examples from \{0,1\}^n labelled by some function f:\{0,1\}^n \rightarrow \{0,1\}. Our goal is to learn f. This is only possible if we assume that f has some nice structure.  In the Junta learning problem, we are told that f only depends on k out of the n input variables, where we think of k \leq \log n (or just anything \omega(1)). In fact, just find one relevant variable and you’re done.

It is trivial to solve the problem in time O(n^k). The first non-trivial algorithm for this problem was given by Mossel, O’Donnell and Servedio, it runs in n^{\omega k/(\omega +1)}, where \omega is the matrix
multiplication exponent. So it will never do better than n^{2k/3}. But (as one of the authors was kind enough to point out), it gives you something interesting even for \omega =3. Or 4. Or more. An algorithm witha better running time than MOS (and better than n^{2k/3}) was recently found by (Greg) Valiant. Better bounds are known for  symmetric functions due to Lipton, Markakis, Mehta and Vishnoi and Shpilka and Tal.

Here is a bare-bones sketch of the MOS algorithm. A natural approach to finding a relevant variable is to compute correlations between the n input variables and f, if we find a correlation we have a relevant variable. If that fails, we could try pairs of variables, triples of variables and so on, essentially running the low degree algorithm of Linial, Mansour and Nisan. A cheeky way to defeat this algorithm (at least for a while) is to take some function, say Majority on 2k/3 bits and XOR it with Parity on k/3 bits. This has the effect of making the resulting function uncorrelated with every subset of size less than k/3. However, note that the resulting function is not very complex when viewed as a polynomial over GF[2]. Even if Majority had full degree over GF[2] (it does not), the resulting function would only have degree 2k/3, since the Parity part does not increase degree at all.

The MOS algorithm relies on the insight that these are (essentially) the only functions that evade the low-degree algorithm: any such function must have low degree as a polynomial over GF[2], and is easy to learn using polynomial interpolation over GF[2]. The precise result they use is due to Seigenthaler from 1984.

Much as I love polynomials over GF[2], one can’t help wondering how robust this approach is. Suppose we were trying to learn a Boolean function but over an ternary alphabet. Would the right analogue be polynomials over GF[3]? Over an alphabet of size 4 (A,T,G,C), should the polynomials be over GF[4] or Z_4? And I’d rather not think about alphabets of size 6.

The larger alphabet is mentioned as an open problem in MOS, and as far as I know, no one has managed to do anything with it (it is quite likely that no one has tried too hard). But I think this might turn out to be a worthwhile digression that could shed some light on the original problem. Here is a particular formulation of the alphabet 4 question that I like:

Learning k \times 2 juntas:  There are n pairs of variables, (X_1,Y_1),\ldots,(X_n,Y_n). We
are given uniform examples of an unknown function f which depends only on k pairs out of n. Our goal is to learn f.

It is trivial to do it in time O(n^k). One could view it as a 2k-junta, but would need an algorithm for the k– junta problem that runs faster than n^{k/2}. Is it possible to do better? More ambitiously, can one match the running time for the k-junta problem, perhaps by a better reduction to learning k-juntas?

Finally, I am told there is money on offer (not from me) for solving the following seemingly innocuous problem:  suppose you know that the function is Majority on 2k/3 bits XORed with Parity on k/3 bits. All you need to do is find out which the relevant bits are. It is possible to do this in time O(n^{k/3}), in fact there are a couple of ways. Can you do any better?

The Switching Lemma

Hastad‘s Switching Lemma is one of the gems of computational complexity. The switching lemma analyzes the effect of random restrictions on AC^0 circuits, which are constant depth circuits with AND, OR and NOT gates of unbounded fanin. It says that such restrictions tends to simplify  AC^0 circuits in surprising ways. An immediate consequence of the switching lemma is strong lower bounds for AC^0 circuits. It also has important applications in Fourier analysis, computational learning, pseudorandomness and more. Hastad’s lemma builds on earlier work due to Ajtai, Furst-Saxe-Sipser and Yao.

The crux of the matter is to analyze the effect of a random restriction on small-width DNF formulae. For this we need some notation. Let f: \{0,1\}^n \rightarrow \{0,1\} be a Boolean function on n input variables.

  • A random restriction with parameter p is one where we leave a variable alive with probability p and set it randomly to \{0,1\} with probability 1 - p. We use f|_\rho to denote the restricted function.
  • We use DT(f) to denote the depth of the shallowest decision tree that computes f and w(f) to denote the width of the smallest width DNF formula that computes it.

It is easy to see that w(f)\leq DT(f), or rather that having a small depth decision tree implies having a small-width DNF representation. There is no converse to this statement: it is not hard to construct examples of functions where w(f) is exponentially smaller than DT(f), indeed the Tribes functions is such an example.

Hastad’s switching lemma tells us that although a small-width DNF f might not have a small decision tree, a random restriction f|_\rho of f almost certainly does. The precise statement is the following:

Hastad’s Switching Lemma: Let f:\{0,1\}^n \rightarrow \{0,1\} be a width w DNF. Let f|_\rho denote the function obtained by applying a random restriction with probability p to it. Then

           Pr_\rho[DT(f_\rho) \geq k] \leq (5pw)^k.

Lets try and digest this statement. Think of w as a parameter between 2 and \log n. Let us pick the restrcition paramter p = 1/(10w) so that the RHS becomes 2^{-k}.

We are keeping variables alive with probability 1/(10w), which is not too high, but we can reasonably expect about \Theta(n/w) variables to remain alive. Nevertheless, the switching lemma tells us that the resulting function has decision tree depth 10 with probability 2^{-10}. Now a decision tree of depth 10 is a 2^{10}-junta. So only a miniscule number of live variables continue to matter.

One could do this all day, nearly every setting of relevant parameters p, w, k gives something fairly surprising. But perhaps the most surprising of all is the irrelevant parameter: the number of terms m (which is so irrelevant that we had not given it a symbol so far), or equivalently the number of variables. Surely m and/or n should matter in the statement somewhere? Any reasonable person trying to prove this statement would probably first try a union bound over all terms and run smack into m.  As a colleague once said, the realization that m is irrelevant is mind-expanding.

By now there have been several alternate proofs of the switching lemma, I particularly like the exposition here due to Neil Thapen.

As scientists, when faced with magic, we try to find rationalizations for it. One possible rationalization for the absence of the number of variables/terms in the switching lemma could be that small-width DNFs are not too complex to start with. Sure they are not as simple as juntas, but they are not far being juntas either. A statement of this form in fact follows from Friedgut‘s junta theorem. Friedgut’s theorem tells us that any function with average sensitivity s is close to a 2^{O(s/\epsilon)}-junta. We can apply Friedgut’s theorem in this context since it is not hard to show that a width-w DNF formula has average sensitivity bounded by O(w).  A similar but weaker statement can be derived from the work of Ajtai-Wigderson and Trevisan, who show how to approximate small width DNFs by decision trees. 

Some recent work with Raghu Meka and Omer Reingold takes a step towards converting this inutition into a proof of the switching lemma. We show that any width w DNF is \epsilon-close to a width w DNF that has no more than (w \log(1/\epsilon))^{O(w)} terms. Moreover, we prove a stronger approximation guarantee called sandwiching approximation which is often useful in pseudorandomness (see paper for details).

This statement is incomparable with the earlier consequence derived from Friedgut’s theorem, the dependence on w is worse, but the dependence on \epsilon is better, and the approximators are themselves width w DNF formulas. Which strongly suggests a single best-of-both-worlds kind of statement, we pose this as a conjecture in our paper:

Conjecture: Any width w DNF is \epsilon -close to a width w DNF that has no more than (\log(1/\epsilon))^{O(w)} terms.

If true, this says that any \log n width DNF is well-approximated by a (\log n width) DNF with only n^{O_\epsilon(1)} terms. If one can actually show sandwiching approximations, this would lead to a “union bound over terms” style proof of the switching lemma. The best lower bound we know comes from the Majority function on 2w -1 variables, which shows that one needs at least 4^w terms.


When did Majority become the stablest? (Part 2)

The first question we’d like to answer is this: which is the monotone, balanced, transitive Boolean function which is least sensitive to bit-flips? We know that Majority is the worst possible with NS( Maj) = \Theta(1/\sqrt{n}).

The new champion turns out to be the Tribes function first studied by Ben-Or and Linial. Tribes_{s,w} is a read-once DNF which is the OR of s disjoint terms, each term is the AND of w variables:

Tribes_{s,w}(x^1_1,...,x^1_w,...,x^s_1,...,x^s_w) = \vee_{i=1}^s\left( \wedge_{j=1}^w x^i_j \right)

By choosing s = \Theta(2^w) carefully, we get a (near)-balanced, transitive, monotone function where NS(Tribes) = \Theta(2^w) = \Theta(\log n/n). So it beats Majority quite handily. But is this the best possible? Or can we hope to achieve sensitivity O(1/n)? This would be optimal, since it is easy to argue an \Omega(1/n) lower bound.

It turns out that in fact O(\log n/n) is the best possible. This follows from the celebrated result of Kahn, Kalai and Linial (KKL), which tells us that this is a lower bound on the sensitivity of any balanced Boolean function, where all variables have small influence. The Theorem is usually stated in terms of average sensitivity, which is AS(f) = n\cdot NS(f). In our setting, the monotone and transitive conditions imply that no variable has influence more than O(1/\sqrt{n}), hence their result applies.

At first, Tribes might not seem as natural a function as Majority. So why does it turn out to be the least sensitive to bit-flips? For KKL to be tight, the Fourier expansion of NS(f) tells us to look for a function which is concentrated on the first O(\log n) levels of the Fourier spectrum. This is equivalent to saying that f is well-approximated by a polynomial of degree O(\log n).  We know DNFs have such concentration. But why not something even simpler, like a decision tree that has (exact) degree O(\log n)? It turns out that such functions will invariably have some “special” co-ordinates, and hence cannot be transitive. This is a relatively recent result by O’Donnell, Saks, Schramm and Servedio, which says that every degree d Boolean function has a variable with influence \Omega(1/d). So we are back to CNF/DNFs and Tribes is about as simple as DNFs get.

Another natural example of a function that has low sensitivity (which I learnt from Yuval Peres): consider n points on a circle. Assign a random bit to each. Define g(x) to be 1 if the longest run of 1s is of length c\log n. For a suitable choice of c, this function is near balanced. It is monotone, and further NS(g) = \Theta(\log n/n).

So given that Tribes beats Majority so handily for bit-flips, how does it fare under \epsilon-noise for constant \epsilon? Not very well, it’s sensitivity approaches 0.5. Indeed, if you are looking for a function which is maximally sensitive, Tribes is not a bad choice.

So life is yet again complicated: there is no single functions that seems optimal for all settings of \epsilon-noise. Indeed, this is why results on hardness amplification within NP which rely on composition with monotone functions need to switch from one function to another at various points. In contrast, if we don’t care about staying within NP, we can just use the XOR function.

But at the same time, perhaps a unified explanation is possible to explain the entire spectrum?

  1. Is there an intuitive explanation for why the least sensitive function should look like a Hamming ball when \epsilon is large, and look “Tribes-like” (Tribish? Tribal) when \epsilon is small? In some sense, we got into this mess by disallowing co-ordinate functions, which are essentially sub-cubes. Are these the functions that “look” most like sub-cubes (from some strange angle)?
  2. Is there an inverse result which interpolates smoothly between KKL and Majority is the Stablest? Given that these are two of the most powerful theorems  in all of Boolean function analysis, it would have to be quite a theorem.


When did Majority become the stablest?

The fireworks happening over at Ryan O’Donnell’s blog reminded me of something that always puzzles me: Is Majority the Stablest? Or the Unstablest?

Consider the class of all monotone, balanced Boolean functions f:\{-1,1\}^n \rightarrow \{-1,1\}. We define the noise sensitivity NS(f) of a function f as NS(f) = P_{x,e}[f(x) \neq f( x \cdot e)], where x \in \{-1,1\}^n is chosen uniformly whereas the noise e \in \{-1,1\}^n is sampled from a noise distribution N of our choice. Think of e as telling us which bits of x to flip. Two natural choices are:

  1. The \epsilon-noise distribution N(\epsilon) where each bit is flipped independently with probability \epsilon.
  2. The “bit-flip” distribution N' where we flip a single random bit. This behaves a lot like  N(\epsilon) where \epsilon = 1/n.

Say we’d like to find the monotone, balanced, Boolean function f which is least sensitive to noise. A bit of thought shows that the answer is any co-ordinate/dictator function x_i. So to rule out trivial answers, let us only allow functions where all variables are equally important. More precisely, functions should be invariant under a transitive group of  permutations on the variables. This is a bit of overkill, the right way to do this is to insist that no variable has too much influence.

The Majority function Maj(x_1,\ldots,x_n) = sgn(\sum_{i=1}^nx_i ) is pretty insensitive/stable under \epsilon-noise, one can show that NS(Maj) \leq \sqrt{\epsilon}.  In contrast, for constant \epsilon, many other functions will have sensitivity approaching 0.5.The Majority is Stablest Theorem due Mossel, O’Donnell and Oleskeiwicz shows Majority is indeed about as good as it gets, functions which are more stable need to have a distinguished variable (which we disallow).

Now consider the “bit-flip” distribution N'. Using the \epsilon = 1/n rule of thumb, one would expect the noise-stability of Majority to be around 1/\sqrt{n}. This is indeed the case, no surprises there. What is surprising is that this is as unstable as it gets for monotone functions! The proof of this is simple: we start with the observation that since f is monotone, E_x[f(x)x_i] captures the probability that flipping x_i changes f for a random x. To see why this is so, fix all the other bits except x_i. Now flipping x_i from -1 to +1 can result in two possible outcomes:

  1. It leaves f unchanged, in which case the contribution to E_x[f(x)x_i] is 0.
  2. It causes f to change from -1 to +1, in which case the contribution to E_x[f(x)x_i] is 2. By monotonicity, the reverse flip from +1 to -1 is not possible.

So now we have

NS(f) = \frac{1}{n}\sum_{i=1}^nE_x[f(x)x_i] = \frac{1}{n}E_x\left[f(x)\left(\sum_{i=1}^nx_i\right)\right]

Since f(x) \in \{-1,1\}, we have

f(x)\left(\sum_{i=1}^nx_i\right) \leq \left|\sum_{i=1}^nx_i\right|

with equality iff

f(x) = sgn\left(\sum_{i=1}^nx_i\right) = Maj(x_1,...,x_n)

So  among balanced, transitive, monotone functions, Majority is maximally unstable under N' and maximally stable under N(\epsilon). And in an intriguing twist, one way to prove the stability under N(\epsilon) is to reduce it to bounding the stability under N' as is done in the proof of Peres’ Theorem, which tells us that the \sqrt{\epsilon} bound is true for all LTFs.

Coming soon:

  1. If not Majority, then who?
  2. What lies behind Majority’s rapid descent into instability?