Chaining methods continued (guest post by Jelani Nelson)
1. A case study: (sub)gaussian processes
To give an introduction to chaining, I will focus our attention on a concrete scenario. Suppose we have a bounded (but possibly infinite) collection of vectors . Furthermore, let be a random vector with its entries being independent, mean zero, and unit variance gaussians. We will consider the collection of variables with defined as . In what follows, we will only ever use one property of these :
This provides us with some understanding of the dependency structure of the . In particular, if are close in , then it’s very likely that the random variables and are also close.
Why does this property hold? Well,
We then use the property that adding independent gaussians yields a gaussian in which the variances add. If you haven’t seen that fact before, it follows easily from looking at the Fourier transform of the gaussian pdf. Adding independent random variables convolves their pdfs, which pointwise multiplies their Fourier transforms. Since the Fourier transform of a gaussian pdf is a gaussian whose variance is inverted, it then follows that summing independent gaussians gives a gaussian with summed variances. Thus is a gaussian with variance , and (1) then follows by tail behavior of gaussians. Note (1) would hold for subgaussian distributions too, such as for example being a vector of independent uniform random variables.
Now I will present four approaches to bounding . These approaches will be gradually sharper. For simplicity I will assume , although it is easy to get around this for methods 2, 3, and 4.
1.1. Method 1: union bound
Remember that, in general for a scalar random variable ,
1.2. Method 2: -nets
Let be an -net of under . That is, for all there exists such that . Now note so that
We already know by (2). Also, , and
where denotes the entropy number or covering number, defined as the minimum number of radius- balls under metric centered at points in required to cover (i.e. the size of the smallest -net). Of course can be chosen to minimize (3). Note the case just reduces back to method 1.
1.3. Method 3: Dudley’s inequality (chaining)
The idea of Dudley’s inequality [Dud67] is to, rather than use one net, use a countably infinite sequence of nets. That is, let denote an -net of under , where . Let denote the closest point in to some . Note is a valid -net. Then
with the last inequality using (2), since the triangle inequality yields
The sum (4) is perfectly fine as is, though the typical formulation of Dudley’s inequality then bounds the sum by an integral over (representing ) then performs the change of variable . This yields the usual formulation of Dudley’s inequality:
To see this most easily, compare with the bound (4). Note that to minimize , we should pick the best quality net we can using points. From until some , the quality of the net will be, up to a factor of , equal to , and for the in this range the summands of (6) will be a geometric series that sum to . Then from to some , the quality of the best net will be, up to a factor of , equal to , and these summands then are a geometric series that sum to , etc. In this way, the bounds of (4) and (6) are equivalent up to a constant factor.
Note, this is the primary reason we chose the to have doubly exponential size in : so that the sum of in any contiguous range of is a geometric series dominated by the last term.
1.4. Method 4: generic chaining
where the infimum is taken over admissible sequences.
so now we wish to prove the statement
(You are probably guessing at this point that had we not been working with subgaussians, but rather random variables that have decay bounded by , we would get a bound in terms of the -functional — your guess is right. I leave it to you as an exercise to modify arguments appropriately!)
Now, we have
Note for fixed , by gaussian decay
for some constant , since by the triangle inequality.
since . The above sum is convergent for .
Now, again using that , we have
Surprisingly, Talagrand showed that not only is an asymptotic upper bound for , but it is also an asymptotic lower bound (at least when the entries of are actually gaussians — the lower bound does not hold for subgaussian entries). That is, for any . This is known as the “majorizing measures theorem” for reasons we will not get into. In brief: the formulation of [Fer75] did not talk about admissible sequences, or discrete sets at all, but rather worked with measures and provided an upper bound in terms of an infimum over a set of probability measures of a certain integral — this formulation is equivalent to the formulation discussed above in terms of admissible sets, and a proof of the equivalence appears in [Tal14].
1.5. A concrete example
Consider the example , i.e. the unit ball in . I picked this example because it is easy to already know using other methods. Why? Well, , since the dual norm of is ! Thus , which one can check is . Thus we know the answer is .
So now the question: what do the four methods above give?
Method 1 (union bound): This method gives nothing, since is an infinite set.
Method 2 (-net): To apply this method, we need to understand the size of an -net of the unit ball under . One bound comes from Maurey’s empirical method, which we will not cover here. It implies . Another bound is via a volume argument, which we also will not cover here; it implies . In short,
By picking , (3) gives us . This is exponentially worse than true bound of .
Method 3 (Dudley’s inequality):
This is exponentially better than method 2, but still off from the truth. We can though wonder: perhaps the issue is not Dudley’s inequality, but perhaps the entropy bounds of (9) are simply loose? Unfortunately this is not the case. To see this, take a set of vectors in that are each -sparse, with in each non-zero coordinate, and so that all pairwise distances are . A random collection satisfies this distance property with high probability for and . Then note and furthermore one needs at least radius- balls in just to cover . (Thanks to Oded Regev for this simple demonstration.)
It is also worth pointing out that this is the worst case for Dudley’s inequality: it can never be off by more than a factor of . I’ll leave it to you as an exercise to figure out why (you should assume the majorizing measures theorem, i.e. that (7) is tight)! Hint: compare (6) with (7) and show that nothing interesting happens beyond .
Method 4 (generic chaining):
By the majorizing measures theorem, we know there must exist an admissible sequence giving the correct , thus being superior to Dudley’s inequality. I will leave constructing it to you as homework! Note: I know for certain that this is a humanly doable task. Eric Price, Mary Wootters, and I tried at some point in the past as exercise. Eric and I managed to get it down to , and Mary then figured out an admissible sequence achieving the optimal . So, it’s doable!
[Dud67] Richard M. Dudley. The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. J. Functional Analysis, 1:290–330, 1967.
[Fer75] Xavier Fernique. Regularité des trajectoires des fonctions aléatoires gaussiennes. Lecture Notes in Math., 480:1–96, 1975.
[Tal14] Michel Talagrand. Upper and lower bounds for stochastic processes: modern methods and classical problems. Springer, 2014.