Chaining methods continued (guest post by Jelani Nelson)

[This is the sequel to Jelani’s previous post on chaining method; for more, see the post STOC workshop on this topic –Boaz]

1. A case study: (sub)gaussian processes

To give an introduction to chaining, I will focus our attention on a concrete scenario. Suppose we have a bounded (but possibly infinite) collection of vectors {T\subset \mathbb{R}^n}. Furthermore, let {g\in\mathbb{R}^n} be a random vector with its entries being independent, mean zero, and unit variance gaussians. We will consider the collection of variables {(X_t)_{t\in T}} with {X_t} defined as {\left\langle g, t \right\rangle}. In what follows, we will only ever use one property of these {X_t}:

\displaystyle  \forall s,t\in T,\ \mathbb{P}(|X_s - X_t| > \lambda) \lesssim e^{-\lambda^2/(2\|s-t\|_2^2)} . \ \ \ \ \ (1)
This provides us with some understanding of the dependency structure of the {X_t}. In particular, if {s, t} are close in {\ell_2}, then it’s very likely that the random variables {X_s} and {X_t} are also close.

Why does this property hold? Well,

\displaystyle  X_s - X_t = \left\langle g, s - t \right\rangle = \sum_{i=1}^n g_i\cdot (s - t)_i .
We then use the property that adding independent gaussians yields a gaussian in which the variances add. If you haven’t seen that fact before, it follows easily from looking at the Fourier transform of the gaussian pdf. Adding independent random variables convolves their pdfs, which pointwise multiplies their Fourier transforms. Since the Fourier transform of a gaussian pdf is a gaussian whose variance is inverted, it then follows that summing independent gaussians gives a gaussian with summed variances. Thus {X_s - X_t} is a gaussian with variance {\|s - t\|_2^2}, and (1) then follows by tail behavior of gaussians. Note (1) would hold for subgaussian distributions too, such as for example {g} being a vector of independent uniform {\pm 1} random variables.

Now I will present four approaches to bounding {g(T) := \mathbb{E}_g \sup_{t\in T} X_t}. These approaches will be gradually sharper. For simplicity I will assume {|T|<\infty}, although it is easy to get around this for methods 2, 3, and 4.

1.1. Method 1: union bound

Remember that, in general for a scalar random variable {Z},

\displaystyle  \mathbb{E} |Z| = \int_0^\infty \mathbb{P}(Z > u) du .
Let {\rho_X(T)} denote the diameter of {T} under norm {X}. Then \displaystyle  \begin{array}{rcl}  \mathbb{E} \sup_{t\in T} X_t &=& \int_0^\infty \mathbb{P}(\sup_{t\in T} X_t > u) du\\ {}&\le& \int_0^{C\rho_{\ell_2}(T)\sqrt{\log n}} \overbrace{\mathbb{P}(\sup_{t\in T} X_t > u)}^{\le 1} du + \int_{C\rho_{\ell_2}(T)\sqrt{\log n}} \mathbb{P}(\sup_{t\in T} X_t > u) du\\ {}&\le& C\rho_{\ell_2}(T) \sqrt{\log n} + \int_{C\rho_{\ell_2}(T)\sqrt{\log n}}^\infty \sum_{t\in T}\mathbb{P}(X_t > u) du \text{ (union bound)}\\ {}&\le& C\rho_{\ell_2}(T) \sqrt{\log n} + |T|\cdot \int_{C\rho_{\ell_2}(T)\sqrt{\log n}} \rho_{\ell_2}(T)\cdot e^{-u^2/(2 \rho_{\ell_2}(T)^2)} du , \end{array}
giving \displaystyle  \mathbb{E} \sup_{t\in T} X_t {} \lesssim \rho_{\ell_2}(T) \cdot\sqrt{\log |T|}  \ \ \ \ \ (2)

1.2. Method 2: {\varepsilon}-nets

Let {T'\subseteq T} be an {\varepsilon}-net of {T} under {\ell_2}. That is, for all {t\in T} there exists {t'\in T'} such that {\|t - t'\|_2 \le \varepsilon}. Now note {\left\langle g,t \right\rangle = \left\langle g, t' + (t-t') \right\rangle} so that

\displaystyle  X_t = X_{t'} + X_{t-t'} .
Therefore \displaystyle  g(T) \le g(T') + \mathbb{E} \sup_{t\in T} \left\langle g, t-t' \right\rangle .

We already know {g(T') \lesssim \rho_{\ell_2}(T')\cdot \sqrt{\log |T'|} \le \rho_{\ell_2}(T) \cdot \sqrt{\log|T'|}} by (2). Also, {\left\langle g, t-t' \right\rangle \le \|g\|_2\cdot \|t-t'\| \le \varepsilon\|g\|_2}, and

\displaystyle  \mathbb{E} \|g\|_2 \le (\mathbb{E} \|g\|_2^2)^{1/2} \le \sqrt{n} .
Therefore \displaystyle  g(T) \lesssim \rho_{\ell_2}(T) \cdot \sqrt{\log|T'|} + \varepsilon\sqrt{n} = \rho_{\ell_2}(T) \cdot \log^{1/2} \mathcal{N}(T, \ell_2, \varepsilon) + \varepsilon\sqrt{n}  \ \ \ \ \ (3)
where {\mathcal{N}(T, d, u)} denotes the entropy number or covering number, defined as the minimum number of radius-{u} balls under metric {d} centered at points in {T} required to cover {T} (i.e. the size of the smallest {u}-net). Of course {\varepsilon} can be chosen to minimize (3). Note the case {\varepsilon = 0} just reduces back to method 1.

1.3. Method 3: Dudley’s inequality (chaining)

The idea of Dudley’s inequality [Dud67] is to, rather than use one net, use a countably infinite sequence of nets. That is, let {S_r\subset T} denote an {\varepsilon_r}-net of {T} under {\ell_2}, where {\varepsilon_r = 2^{-r}\cdot \rho_{\ell_2}(T)}. Let {t_r} denote the closest point in {S_r} to some {t\in T}. Note {T_0 = \{0\}} is a valid {\varepsilon_0}-net. Then

\displaystyle  \left\langle g, t \right\rangle = \left\langle g, t_0 \right\rangle + \sum_{r=1}^\infty \left\langle g, t_r - t_{r-1} \right\rangle ,
so then \displaystyle  \begin{array}{rcl}  g(T) &\le& \sum_{r=1}^\infty \mathbb{E} \sup_{t\in T} \left\langle g, t_r - t_{r-1} \right\rangle \\ {}&\lesssim& \sum_{r=1}^\infty \frac{\rho_{\ell_2}(T)}{2^r} \cdot \log^{1/2} (\mathcal{N}(T, \ell_2, \frac{\rho_{\ell_2}(T)}{2^r})^2) , \end{array}
with the last inequality using (2), since the triangle inequality yields \displaystyle  \|t_r - t_{r-1}\|_2 \le \|t - t_r\|_2 + |t - t_{r-1}\|_2 \le \frac 3{2^r}\cdot \rho_{\ell_2}(T) .
Thus \displaystyle  g(T)\lesssim \sum_{r=1}^\infty \frac{\rho_{\ell_2}(T)}{2^r} \cdot \log^{1/2} \mathcal{N}(T, \ell_2, \frac{\rho_{\ell_2}(T)}{2^r})  \ \ \ \ \ (4)

The sum (4) is perfectly fine as is, though the typical formulation of Dudley’s inequality then bounds the sum by an integral over {\varepsilon} (representing {\rho_{\ell_2}(T)/2^r}) then performs the change of variable {u = \varepsilon / \rho_{\ell_2}(T)}. This yields the usual formulation of Dudley’s inequality:

\displaystyle  g(T) \lesssim \int_0^\infty \log^{1/2}\mathcal{N}(T, \ell_2, u)du  \ \ \ \ \ (5)

It is worth pointing out that Dudley’s inequality is equivalent to the following bound. We say {T_0\subset T_1\subset \ldots \subset T} is an admissible sequence if {|T_0| = 1} and {|T_r| \le 2^{2^r}}. Then Dudley’s inequality is equivalent to the bound

\displaystyle  g(T) \lesssim \sum_{r=0}^\infty 2^{r/2} \cdot \sup_{t\in T} d_{\ell_2}(t, T_r) . \ \ \ \ \ (6)

To see this most easily, compare with the bound (4). Note that to minimize {\sup_{t\in T} d_{\ell_2}(t, T_r)}, we should pick the best quality net we can using {2^{2^r}} points. From {r=0} until some {r_1}, the quality of the net will be, up to a factor of {2}, equal to {\rho_{\ell_2}(T)}, and for the {r} in this range the summands of (6) will be a geometric series that sum to {O(2^{r_1/2} \cdot \rho_{\ell_2}(T))}. Then from {r=r_1} to some {r_2}, the quality of the best net will be, up to a factor of {2}, equal to {\rho_{\ell_2}(T)/2}, and these summands then are a geometric series that sum to {O(2^{r_2/2}\cdot \rho_{\ell_2}(T)/2)}, etc. In this way, the bounds of (4) and (6) are equivalent up to a constant factor.

Note, this is the primary reason we chose the {T_r} to have doubly exponential size in {r}: so that the sum of {\log^{1/2}|T_r|} in any contiguous range of {r} is a geometric series dominated by the last term.

1.4. Method 4: generic chaining

Here we will show the generic chaining method, which yields the bound of [Fer75], though we will present an equivalent bound that was later given by Talagrand (see his book [Tal14]):

\displaystyle  g(T) \lesssim \inf_{{\{T_r\}}_{r=0}^\infty} \sup_{t\in T}\sum_{r=0}^\infty 2^{r/2} \cdot d_{\ell_2}(t, T_r) . \ \ \ \ \ (7)
where the infimum is taken over admissible sequences.

Note the similarity between (6) and (7): the latter bound moved the supremum outside the sum. Thus clearly the bound (7) can only be a tighter bound. For a metric {d}, Talagrand defined

\displaystyle  \gamma_p(T, d) := \inf_{\{T_r\}_r} \sup_{t\in T}\sum_{r=0}^\infty 2^{r/p} \cdot d(t, T_r) ,
so now we wish to prove the statement \displaystyle  g(T) \lesssim \gamma_2(T, \ell_2) .
(You are probably guessing at this point that had we not been working with subgaussians, but rather random variables that have decay bounded by {e^{-|x|^p}}, we would get a bound in terms of the {\gamma_p}-functional — your guess is right. I leave it to you as an exercise to modify arguments appropriately!)

Now, we have

\displaystyle  \mathbb{E} \sup_{t\in T}\left\langle g, t \right\rangle = \mathbb{E} \sup_{t\in T} \sum_{r=1}^\infty \underbrace{\left\langle g, t_r - t_{r-1} \right\rangle}_{Y_r(t)} .

Note for fixed {t}, by gaussian decay

\displaystyle  \mathbb{P}(|Y_r(t)| > C u2^{r/2} d_{\ell_2}(t, T_r)) \le 2 e^{-u^2 2^r}
for some constant {C}, since {\|t_r - t_{r-1}\| \le 2 d_{\ell_2}(t, T_r)} by the triangle inequality.

Therefore

\displaystyle  \mathbb{P}(\exists t\in T, r>0\ s.t.\ |Y_r(t)| > C u2^{r/2} d_{\ell_2}(t, T_r)) \lesssim \sum_{r=1}^\infty |T_r| \cdot |T_{r-1}| \cdot e^{-u^2 2^r} \le \sum_{r=1}^\infty 4^{2^r} \cdot e^{-u^2 2^r} \ \ \ \ \ (8)
since {|T_r|, |T_{r-1}| \le 2^{2^r}}. The above sum is convergent for {u\ge 2}.

Now, again using that {\mathbb{E} |Z| = \int_0^\infty \mathbb{P}(|Z| > w) dw}, we have

\displaystyle  \begin{array}{rcl}  g(T) &\le& \int_0^\infty \mathbb{P}(\sup_{t\in T} \sum_{r=1}^\infty Y_r > w) dw\\ {}&\lesssim& \gamma_2(T, \ell_2)\cdot \int_0^\infty \mathbb{P}(\sup_{t\in T}\sum_{r=1}^\infty Y_r > u\cdot C \sup_{t\in T}\sum_{r=1}^\infty 2^{r/2}d_{\ell_2}(t, T_r)) du\text{ (change of variables)}\\ {}&\lesssim& \gamma_2(T, \ell_2)\cdot [2 + \int_2^\infty \mathbb{P}(\sup_{t\in T}\sum_{r=1}^\infty Y_r > u\cdot C \sup_{t\in T}\sum_{r=1}^\infty 2^{r/2}d_{\ell_2}(t, T_r)) du]\\ {}&\le& \gamma_2(T, \ell_2)\cdot [2 + \int_2^\infty \mathbb{P}(\exists t\in T, r>0\ s.t.\ |Y_r(t)| > C u2^{r/2} d_{\ell_2}(t, T_r)) du]\\ {}&\lesssim& \gamma_2(T, \ell_2) \end{array}
as desired.

Surprisingly, Talagrand showed that not only is {\gamma_2(T, \ell_2)} an asymptotic upper bound for {g(T)}, but it is also an asymptotic lower bound (at least when the entries of {g} are actually gaussians — the lower bound does not hold for subgaussian entries). That is, {g(T) \simeq \gamma_2(T, \ell_2)} for any {T}. This is known as the “majorizing measures theorem” for reasons we will not get into. In brief: the formulation of [Fer75] did not talk about admissible sequences, or discrete sets at all, but rather worked with measures and provided an upper bound in terms of an infimum over a set of probability measures of a certain integral — this formulation is equivalent to the formulation discussed above in terms of admissible sets, and a proof of the equivalence appears in [Tal14].

1.5. A concrete example

Consider the example {T = B_{\ell_1^n} = \{t \in \mathbb{R}^n : \|t\|_1 = 1\}}, i.e. the unit {\ell_1} ball in {\mathbb{R}^n}. I picked this example because it is easy to already know {g(T)} using other methods. Why? Well, {\sup_{t\in B_{\ell_1^n}}\left\langle g, t \right\rangle = \|g\|_\infty}, since the dual norm of {\ell_\infty} is {\ell_1}! Thus {g(B_{\ell_1^n}) = \mathbb{E} \|g\|_\infty}, which one can check is {\Theta(\sqrt{\log n})}. Thus we know the answer is {\Theta(\sqrt{\log n})}.

So now the question: what do the four methods above give?

Method 1 (union bound): This method gives nothing, since {T} is an infinite set.

Method 2 ({\varepsilon}-net): To apply this method, we need to understand the size of an {\varepsilon}-net of the {\ell_1} unit ball under {\ell_2}. One bound comes from Maurey’s empirical method, which we will not cover here. It implies {\mathcal{N}(B_{\ell_1^n}, \ell_2, \varepsilon) \le (2n)^{O(1/\varepsilon^2)}}. Another bound is via a volume argument, which we also will not cover here; it implies {\mathcal{N}(B_{\ell_1}^n, \ell_2, \varepsilon) = O(2 + 1/(\varepsilon\sqrt{n}))^n}. In short,

\displaystyle  \forall \varepsilon\in(0, \frac 12),\ \log^{1/2} \mathcal{N}(B_{\ell_1}^n, \ell_2, \varepsilon) \lesssim \min\{\varepsilon^{-1}\sqrt{\log n}, \sqrt{n}\cdot \log(1/\varepsilon)\} .  \ \ \ \ \ (9)
By picking {\varepsilon = ((\log n)/n)^{1/4}}, (3) gives us {g(T) \lesssim (n\log n)^{1/4}}. This is exponentially worse than true bound of {g(T) = \Theta(\sqrt{\log n})}.

Method 3 (Dudley’s inequality):

Combining (9) with (5),

\displaystyle  \begin{array}{rcl}  g(T) &\lesssim& \int_0^{1/\sqrt{n}} \sqrt{n}\cdot \log(1/u)du + \int_{1/\sqrt{n}}^1 u^{-1}\sqrt{\log n} du \lesssim \log^{3/2} n . \end{array}
This is exponentially better than method 2, but still off from the truth. We can though wonder: perhaps the issue is not Dudley’s inequality, but perhaps the entropy bounds of (9) are simply loose? Unfortunately this is not the case. To see this, take a set {R} of vectors in {\mathbb{R}^n} that are each {1/\varepsilon^2}-sparse, with {\varepsilon^2} in each non-zero coordinate, and so that all pairwise {\ell_2} distances are {2\varepsilon}. A random collection {R} satisfies this distance property with high probability for {|R| = n^{\Theta(1/\varepsilon^2)}} and {\varepsilon \gg 1/\sqrt{n}}. Then note {R\subset B_{\ell_1^n}} and furthermore one needs at least {|R|} radius-{\varepsilon} balls in {\ell_2} just to cover {R}. (Thanks to Oded Regev for this simple demonstration.)

It is also worth pointing out that this is the worst case for Dudley’s inequality: it can never be off by more than a factor of {\log n}. I’ll leave it to you as an exercise to figure out why (you should assume the majorizing measures theorem, i.e. that (7) is tight)! Hint: compare (6) with (7) and show that nothing interesting happens beyond {r > \lg n + c\lg\lg n}.

Method 4 (generic chaining):

By the majorizing measures theorem, we know there must exist an admissible sequence giving the correct {g(T)\lesssim \sqrt{\log n}}, thus being superior to Dudley’s inequality. I will leave constructing it to you as homework! Note: I know for certain that this is a humanly doable task. Eric Price, Mary Wootters, and I tried at some point in the past as exercise. Eric and I managed to get it down to {O(\log n)}, and Mary then figured out an admissible sequence achieving the optimal {O(\sqrt{\log n})}. So, it’s doable!

Bibliography

[Dud67] Richard M. Dudley. The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. J. Functional Analysis, 1:290–330, 1967.

[Fer75] Xavier Fernique. Regularité des trajectoires des fonctions aléatoires gaussiennes. Lecture Notes in Math., 480:1–96, 1975.

[Tal14] Michel Talagrand. Upper and lower bounds for stochastic processes: modern methods and classical problems. Springer, 2014.

10 thoughts on “Chaining methods continued (guest post by Jelani Nelson)

  1. Hi,

    In the derivation for the union bound (1.1), I don’t understand where, in the third inequality, the $\rho_{l_2}(T)$ term in the integrand comes from. Could someone please clarify this?

    1. $X_t = \langle t,g \rangle$. Thus $X_t$ is a gaussian with variance $\|t\|_2^2 \le \rho_{l_2}(T)^2$. A gaussian with variance $\sigma^2$ decays like $\exp(-x / (2\sigma^2))$. Does that clarify things?

      1. Sorry, I think the question was why it also shows up as a product instead of just in the denominator of the exponent.

      2. I hope I’m addressing the right concern here (typing from a phone with really bad internet and the original post isn’t loading, so I’m guessing what I wrote before :)).

        I think the question is about the following (?). Suppose there are N (possibly dependent) gaussians each with standard deviation at most s. Then the expected max of them is O(s sqrt(lg N)). Are you wondering why? (Here s is rho_{l_2}(T).) The reason is E max(g_1,…,g_N) u) du.

        Now you break the integral into two regions: below u* and above u*. For below u*, you just say the integrand is never bigger than 1 (it’s a probability), so you get a contribution of at most u* from small u. For large u, you use a union bound and say Pr(max(..) > u) =u*, you get something like N exp(- C u*^2 / s^2). Now set u* = C’ s sqrt(lg N) to balance the two integrals, and your overall bound becomes O(s sqrt(lg N)).

      3. If I’m understanding your comment right, instead of bounding by N exp(- C u*^2 / s^2), in your post, you’ve bounded it by s * N exp(- C u*^2 / s^2) (note extra factor of s, which is what I was asking about)

      4. You’re right, that’s a typo. That extra factor of $s$ shouldn’t be there. The next line is correct though and seems to have correctly pretended the extra $s$ wasn’t there.

  2. Typo fix: for method 4, in (8) and in a few places before it, the $\exp(-u^2 2^{2^r})$ should be $\exp(-u^2 2^r)$ (thanks to David Woodruff for catching that).

  3. In the example “1.1. Method 1: union bound”, don’t you think that the correct threshold in the integral is $\rho(T) \sqrt{2 \log |T|}$ instead of $C \rho(T) \sqrt{ \log n}$?

    With the original threshold, I’m unable to remove the dependence in $|T|$ appearing from the union bound in in the last integral.

    1. Yes, you’re absolutely right. It seems I wrote those lines while thinking that n denoted |T|, whereas I earlier defined n as the dimension. The integral should really go up to some $C \rho(T) \sqrt{\log |T|}$.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s