[This is the sequel to Jelani’s previous post on chaining method; for more, see the post STOC workshop on this topic –Boaz]
1. A case study: (sub)gaussian processes
To give an introduction to chaining, I will focus our attention on a concrete scenario. Suppose we have a bounded (but possibly infinite) collection of vectors . Furthermore, let
be a random vector with its entries being independent, mean zero, and unit variance gaussians. We will consider the collection of variables
with
defined as
. In what follows, we will only ever use one property of these
:
This provides us with some understanding of the dependency structure of the . In particular, if
are close in
, then it’s very likely that the random variables
and
are also close.
Why does this property hold? Well,
We then use the property that adding independent gaussians yields a gaussian in which the variances add. If you haven’t seen that fact before, it follows easily from looking at the Fourier transform of the gaussian pdf. Adding independent random variables convolves their pdfs, which pointwise multiplies their Fourier transforms. Since the Fourier transform of a gaussian pdf is a gaussian whose variance is inverted, it then follows that summing independent gaussians gives a gaussian with summed variances. Thus is a gaussian with variance
, and (1) then follows by tail behavior of gaussians. Note (1) would hold for subgaussian distributions too, such as for example
being a vector of independent uniform
random variables.
Now I will present four approaches to bounding . These approaches will be gradually sharper. For simplicity I will assume
, although it is easy to get around this for methods 2, 3, and 4.
1.1. Method 1: union bound
Remember that, in general for a scalar random variable ,
Let denote the diameter of
under norm
. Then
giving
1.2. Method 2: -nets
Let be an
-net of
under
. That is, for all
there exists
such that
. Now note
so that
Therefore
We already know by (2). Also,
, and
Therefore
where denotes the entropy number or covering number, defined as the minimum number of radius-
balls under metric
centered at points in
required to cover
(i.e. the size of the smallest
-net). Of course
can be chosen to minimize (3). Note the case
just reduces back to method 1.
1.3. Method 3: Dudley’s inequality (chaining)
The idea of Dudley’s inequality [Dud67] is to, rather than use one net, use a countably infinite sequence of nets. That is, let denote an
-net of
under
, where
. Let
denote the closest point in
to some
. Note
is a valid
-net. Then
so then
with the last inequality using (2), since the triangle inequality yields
Thus
The sum (4) is perfectly fine as is, though the typical formulation of Dudley’s inequality then bounds the sum by an integral over (representing
) then performs the change of variable
. This yields the usual formulation of Dudley’s inequality:
It is worth pointing out that Dudley’s inequality is equivalent to the following bound. We say is an admissible sequence if
and
. Then Dudley’s inequality is equivalent to the bound
To see this most easily, compare with the bound (4). Note that to minimize , we should pick the best quality net we can using
points. From
until some
, the quality of the net will be, up to a factor of
, equal to
, and for the
in this range the summands of (6) will be a geometric series that sum to
. Then from
to some
, the quality of the best net will be, up to a factor of
, equal to
, and these summands then are a geometric series that sum to
, etc. In this way, the bounds of (4) and (6) are equivalent up to a constant factor.
Note, this is the primary reason we chose the to have doubly exponential size in
: so that the sum of
in any contiguous range of
is a geometric series dominated by the last term.
1.4. Method 4: generic chaining
Here we will show the generic chaining method, which yields the bound of [Fer75], though we will present an equivalent bound that was later given by Talagrand (see his book [Tal14]):
where the infimum is taken over admissible sequences.
Note the similarity between (6) and (7): the latter bound moved the supremum outside the sum. Thus clearly the bound (7) can only be a tighter bound. For a metric , Talagrand defined
so now we wish to prove the statement
(You are probably guessing at this point that had we not been working with subgaussians, but rather random variables that have decay bounded by , we would get a bound in terms of the
-functional — your guess is right. I leave it to you as an exercise to modify arguments appropriately!)
Now, we have
Note for fixed , by gaussian decay
for some constant , since
by the triangle inequality.
Therefore
since . The above sum is convergent for
.
Now, again using that , we have
as desired.
Surprisingly, Talagrand showed that not only is an asymptotic upper bound for
, but it is also an asymptotic lower bound (at least when the entries of
are actually gaussians — the lower bound does not hold for subgaussian entries). That is,
for any
. This is known as the “majorizing measures theorem” for reasons we will not get into. In brief: the formulation of [Fer75] did not talk about admissible sequences, or discrete sets at all, but rather worked with measures and provided an upper bound in terms of an infimum over a set of probability measures of a certain integral — this formulation is equivalent to the formulation discussed above in terms of admissible sets, and a proof of the equivalence appears in [Tal14].
1.5. A concrete example
Consider the example , i.e. the unit
ball in
. I picked this example because it is easy to already know
using other methods. Why? Well,
, since the dual norm of
is
! Thus
, which one can check is
. Thus we know the answer is
.
So now the question: what do the four methods above give?
Method 1 (union bound): This method gives nothing, since is an infinite set.
Method 2 (-net): To apply this method, we need to understand the size of an
-net of the
unit ball under
. One bound comes from Maurey’s empirical method, which we will not cover here. It implies
. Another bound is via a volume argument, which we also will not cover here; it implies
. In short,
By picking , (3) gives us
. This is exponentially worse than true bound of
.
Method 3 (Dudley’s inequality):
This is exponentially better than method 2, but still off from the truth. We can though wonder: perhaps the issue is not Dudley’s inequality, but perhaps the entropy bounds of (9) are simply loose? Unfortunately this is not the case. To see this, take a set of vectors in
that are each
-sparse, with
in each non-zero coordinate, and so that all pairwise
distances are
. A random collection
satisfies this distance property with high probability for
and
. Then note
and furthermore one needs at least
radius-
balls in
just to cover
. (Thanks to Oded Regev for this simple demonstration.)
It is also worth pointing out that this is the worst case for Dudley’s inequality: it can never be off by more than a factor of . I’ll leave it to you as an exercise to figure out why (you should assume the majorizing measures theorem, i.e. that (7) is tight)! Hint: compare (6) with (7) and show that nothing interesting happens beyond
.
Method 4 (generic chaining):
By the majorizing measures theorem, we know there must exist an admissible sequence giving the correct , thus being superior to Dudley’s inequality. I will leave constructing it to you as homework! Note: I know for certain that this is a humanly doable task. Eric Price, Mary Wootters, and I tried at some point in the past as exercise. Eric and I managed to get it down to
, and Mary then figured out an admissible sequence achieving the optimal
. So, it’s doable!
Bibliography
[Dud67] Richard M. Dudley. The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. J. Functional Analysis, 1:290–330, 1967.
[Fer75] Xavier Fernique. Regularité des trajectoires des fonctions aléatoires gaussiennes. Lecture Notes in Math., 480:1–96, 1975.
[Tal14] Michel Talagrand. Upper and lower bounds for stochastic processes: modern methods and classical problems. Springer, 2014.
Hi,
In the derivation for the union bound (1.1), I don’t understand where, in the third inequality, the $\rho_{l_2}(T)$ term in the integrand comes from. Could someone please clarify this?
$X_t = \langle t,g \rangle$. Thus $X_t$ is a gaussian with variance $\|t\|_2^2 \le \rho_{l_2}(T)^2$. A gaussian with variance $\sigma^2$ decays like $\exp(-x / (2\sigma^2))$. Does that clarify things?
Sorry, I think the question was why it also shows up as a product instead of just in the denominator of the exponent.
I hope I’m addressing the right concern here (typing from a phone with really bad internet and the original post isn’t loading, so I’m guessing what I wrote before :)).
I think the question is about the following (?). Suppose there are N (possibly dependent) gaussians each with standard deviation at most s. Then the expected max of them is O(s sqrt(lg N)). Are you wondering why? (Here s is rho_{l_2}(T).) The reason is E max(g_1,…,g_N) u) du.
Now you break the integral into two regions: below u* and above u*. For below u*, you just say the integrand is never bigger than 1 (it’s a probability), so you get a contribution of at most u* from small u. For large u, you use a union bound and say Pr(max(..) > u) =u*, you get something like N exp(- C u*^2 / s^2). Now set u* = C’ s sqrt(lg N) to balance the two integrals, and your overall bound becomes O(s sqrt(lg N)).
If I’m understanding your comment right, instead of bounding by N exp(- C u*^2 / s^2), in your post, you’ve bounded it by s * N exp(- C u*^2 / s^2) (note extra factor of s, which is what I was asking about)
You’re right, that’s a typo. That extra factor of $s$ shouldn’t be there. The next line is correct though and seems to have correctly pretended the extra $s$ wasn’t there.
Typo fix: for method 4, in (8) and in a few places before it, the $\exp(-u^2 2^{2^r})$ should be $\exp(-u^2 2^r)$ (thanks to David Woodruff for catching that).
In the example “1.1. Method 1: union bound”, don’t you think that the correct threshold in the integral is $\rho(T) \sqrt{2 \log |T|}$ instead of $C \rho(T) \sqrt{ \log n}$?
With the original threshold, I’m unable to remove the dependence in $|T|$ appearing from the union bound in in the last integral.
Yes, you’re absolutely right. It seems I wrote those lines while thinking that n denoted |T|, whereas I earlier defined n as the dimension. The integral should really go up to some $C \rho(T) \sqrt{\log |T|}$.
Perfect! Thanks for your super fast answer!