# The Switching Lemma

Hastad‘s Switching Lemma is one of the gems of computational complexity. The switching lemma analyzes the effect of random restrictions on $AC^0$ circuits, which are constant depth circuits with AND, OR and NOT gates of unbounded fanin. It says that such restrictions tends to simplify  $AC^0$ circuits in surprising ways. An immediate consequence of the switching lemma is strong lower bounds for $AC^0$ circuits. It also has important applications in Fourier analysis, computational learning, pseudorandomness and more. Hastad’s lemma builds on earlier work due to Ajtai, Furst-Saxe-Sipser and Yao.

The crux of the matter is to analyze the effect of a random restriction on small-width DNF formulae. For this we need some notation. Let $f: \{0,1\}^n \rightarrow \{0,1\}$ be a Boolean function on $n$ input variables.

• A random restriction with parameter $p$ is one where we leave a variable alive with probability $p$ and set it randomly to $\{0,1\}$ with probability $1 - p$. We use $f|_\rho$ to denote the restricted function.
• We use $DT(f)$ to denote the depth of the shallowest decision tree that computes $f$ and $w(f)$ to denote the width of the smallest width DNF formula that computes it.

It is easy to see that $w(f)\leq DT(f)$, or rather that having a small depth decision tree implies having a small-width DNF representation. There is no converse to this statement: it is not hard to construct examples of functions where $w(f)$ is exponentially smaller than $DT(f)$, indeed the Tribes functions is such an example.

Hastad’s switching lemma tells us that although a small-width DNF $f$ might not have a small decision tree, a random restriction $f|_\rho$ of $f$ almost certainly does. The precise statement is the following:

Hastad’s Switching Lemma: Let $f:\{0,1\}^n \rightarrow \{0,1\}$ be a width $w$ DNF. Let $f|_\rho$ denote the function obtained by applying a random restriction with probability $p$ to it. Then

$Pr_\rho[DT(f_\rho) \geq k] \leq (5pw)^k$.

Lets try and digest this statement. Think of $w$ as a parameter between $2$ and $\log n$. Let us pick the restrcition paramter $p = 1/(10w)$ so that the RHS becomes $2^{-k}$.

We are keeping variables alive with probability $1/(10w)$, which is not too high, but we can reasonably expect about $\Theta(n/w)$ variables to remain alive. Nevertheless, the switching lemma tells us that the resulting function has decision tree depth $10$ with probability $2^{-10}$. Now a decision tree of depth $10$ is a $2^{10}$-junta. So only a miniscule number of live variables continue to matter.

One could do this all day, nearly every setting of relevant parameters $p, w, k$ gives something fairly surprising. But perhaps the most surprising of all is the irrelevant parameter: the number of terms $m$ (which is so irrelevant that we had not given it a symbol so far), or equivalently the number of variables. Surely $m$ and/or $n$ should matter in the statement somewhere? Any reasonable person trying to prove this statement would probably first try a union bound over all terms and run smack into $m$.  As a colleague once said, the realization that $m$ is irrelevant is mind-expanding.

By now there have been several alternate proofs of the switching lemma, I particularly like the exposition here due to Neil Thapen.

As scientists, when faced with magic, we try to find rationalizations for it. One possible rationalization for the absence of the number of variables/terms in the switching lemma could be that small-width DNFs are not too complex to start with. Sure they are not as simple as juntas, but they are not far being juntas either. A statement of this form in fact follows from Friedgut‘s junta theorem. Friedgut’s theorem tells us that any function with average sensitivity $s$ is close to a $2^{O(s/\epsilon)}$-junta. We can apply Friedgut’s theorem in this context since it is not hard to show that a width-$w$ DNF formula has average sensitivity bounded by $O(w)$.  A similar but weaker statement can be derived from the work of Ajtai-Wigderson and Trevisan, who show how to approximate small width DNFs by decision trees.

Some recent work with Raghu Meka and Omer Reingold takes a step towards converting this inutition into a proof of the switching lemma. We show that any width $w$ DNF is $\epsilon$-close to a width $w$ DNF that has no more than $(w \log(1/\epsilon))^{O(w)}$ terms. Moreover, we prove a stronger approximation guarantee called sandwiching approximation which is often useful in pseudorandomness (see paper for details).

This statement is incomparable with the earlier consequence derived from Friedgut’s theorem, the dependence on $w$ is worse, but the dependence on $\epsilon$ is better, and the approximators are themselves width $w$ DNF formulas. Which strongly suggests a single best-of-both-worlds kind of statement, we pose this as a conjecture in our paper:

Conjecture: Any width $w$ DNF is $\epsilon$-close to a width $w$ DNF that has no more than $(\log(1/\epsilon))^{O(w)}$ terms.

If true, this says that any $\log n$ width DNF is well-approximated by a ($\log n$ width) DNF with only $n^{O_\epsilon(1)}$ terms. If one can actually show sandwiching approximations, this would lead to a “union bound over terms” style proof of the switching lemma. The best lower bound we know comes from the Majority function on $2w -1$ variables, which shows that one needs at least $4^w$ terms.

Parikshit