Learning Juntas

Computational learning is full of problems that are deceptively simple to state but fiendishly hard to solve. Perhaps none more so than the problem of learning Juntas, posed by Avrim Blum and Pat Langley. Its the kind of problem which seems well suited for a polymath endeavor, Dick Lipton likes to say that you could think about it while running for the bus. Alas, I haven’t ridden many buses here in the Bay Area (and coincidentally haven’t had any good ideas about this problem).

But what I’d like to talk about in this post is a variant of the problem which I think is kind of cute. The question is from a paper by Mossel, O’Donnell and Servedio, this particular formulation came up in discussions with Ryan O’Donnell, Amir Shpilka and Yi Wu (sorry guys, I know you’d rather not be associated with this one).

Let us begin with the problem of learning Juntas. The setup is that of learning an unknown Boolean function under the uniform distribution: we are given random examples from \{0,1\}^n labelled by some function f:\{0,1\}^n \rightarrow \{0,1\}. Our goal is to learn f. This is only possible if we assume that f has some nice structure.  In the Junta learning problem, we are told that f only depends on k out of the n input variables, where we think of k \leq \log n (or just anything \omega(1)). In fact, just find one relevant variable and you’re done.

It is trivial to solve the problem in time O(n^k). The first non-trivial algorithm for this problem was given by Mossel, O’Donnell and Servedio, it runs in n^{\omega k/(\omega +1)}, where \omega is the matrix
multiplication exponent. So it will never do better than n^{2k/3}. But (as one of the authors was kind enough to point out), it gives you something interesting even for \omega =3. Or 4. Or more. An algorithm witha better running time than MOS (and better than n^{2k/3}) was recently found by (Greg) Valiant. Better bounds are known for  symmetric functions due to Lipton, Markakis, Mehta and Vishnoi and Shpilka and Tal.

Here is a bare-bones sketch of the MOS algorithm. A natural approach to finding a relevant variable is to compute correlations between the n input variables and f, if we find a correlation we have a relevant variable. If that fails, we could try pairs of variables, triples of variables and so on, essentially running the low degree algorithm of Linial, Mansour and Nisan. A cheeky way to defeat this algorithm (at least for a while) is to take some function, say Majority on 2k/3 bits and XOR it with Parity on k/3 bits. This has the effect of making the resulting function uncorrelated with every subset of size less than k/3. However, note that the resulting function is not very complex when viewed as a polynomial over GF[2]. Even if Majority had full degree over GF[2] (it does not), the resulting function would only have degree 2k/3, since the Parity part does not increase degree at all.

The MOS algorithm relies on the insight that these are (essentially) the only functions that evade the low-degree algorithm: any such function must have low degree as a polynomial over GF[2], and is easy to learn using polynomial interpolation over GF[2]. The precise result they use is due to Seigenthaler from 1984.

Much as I love polynomials over GF[2], one can’t help wondering how robust this approach is. Suppose we were trying to learn a Boolean function but over an ternary alphabet. Would the right analogue be polynomials over GF[3]? Over an alphabet of size 4 (A,T,G,C), should the polynomials be over GF[4] or Z_4? And I’d rather not think about alphabets of size 6.

The larger alphabet is mentioned as an open problem in MOS, and as far as I know, no one has managed to do anything with it (it is quite likely that no one has tried too hard). But I think this might turn out to be a worthwhile digression that could shed some light on the original problem. Here is a particular formulation of the alphabet 4 question that I like:

Learning k \times 2 juntas:  There are n pairs of variables, (X_1,Y_1),\ldots,(X_n,Y_n). We
are given uniform examples of an unknown function f which depends only on k pairs out of n. Our goal is to learn f.

It is trivial to do it in time O(n^k). One could view it as a 2k-junta, but would need an algorithm for the k– junta problem that runs faster than n^{k/2}. Is it possible to do better? More ambitiously, can one match the running time for the k-junta problem, perhaps by a better reduction to learning k-juntas?

Finally, I am told there is money on offer (not from me) for solving the following seemingly innocuous problem:  suppose you know that the function is Majority on 2k/3 bits XORed with Parity on k/3 bits. All you need to do is find out which the relevant bits are. It is possible to do this in time O(n^{k/3}), in fact there are a couple of ways. Can you do any better?

2 thoughts on “Learning Juntas

  1. We give an algorithm for learning such functions given only a source of uniformly distributed random labeled examples. Our algorithm achieves the first polynomial factor improvement on a naive n^k time bound (which can be achieved via exhaustive search). The algorithm and its analysis are based on new structural properties which hold for all Boolean functions and may be of independent interest.

Leave a comment