In this post, I am going to give you a simple, self-contained, and fruitful demonstration of a recently introduced proof technique called the method of interlacing families of polynomials, which was also mentioned in an earlier post. This method, which may be seen as an incarnation of the probabilistic method, is relevant in situations when you want to show that at least one matrix from some collection of matrices must have eigenvalues in a certain range. The basic methodology is to write down the characteristic polynomial of each matrix you care about, compute the average of all these polynomials (literally by averaging the coefficients), and then reason about the eigenvalues of the individual matrices by studying the zeros of the average polynomial. The relationship between the average polynomial and the individuals hinges on a phenomenon known as interlacing.
We will use the method to prove the following sharp variant of Bourgain and Tzafriri’s restricted invertibility theorem, which may be seen as a robust, quantitative version of the fact that every rank matrix contains a set of
linearly independent columns.
Theorem 1 Suppose
are vectors with
. Then for every
there is a subset
of size
with
That is, any set of vectors with variance
equal to one in every direction
must contain a large subset which is far from being linearly degenerate, in the sense of having large eigenvalues (compared to
, which is the average squared length of the vectors). Such variance one sets go by many other names in different contexts: they are also called isotropic sets, decompositions of the identity, and tight frames. This type of theorem was first proved by Bourgain and Tzafriri in 1987, and later generalized and sharpened to include the form stated here.
The original applications of Theorem 1 and its variants were mainly in Banach space theory and harmonic analysis. More recently, it was used in theoretical CS by Nikolov, Talwar, and Zhang in the contexts of differential privacy and discrepancy minimization. Another connection with TCS was discovered by Joel Tropp, who showed that the set can be found algorithmically using a semidefinite program whose dual is related to the Goemans-Wiliamson SDP for Max-Cut.
In more concrete notation, the theorem says that every rectangular matrix with
contains an
column submatrix
with
, where
is the kth largest singular value. Written this way, we see some similarity with the column subset selection problem in data mining, which seeks to extract a maximally nondegenerate set of `representative’ columns from a data matrix. There are also useful generalizations of Theorem 1 for arbitrary rectangular
.
As I said earlier, the technique is based on studying the roots of averages of polynomials. In general, averaging polynomials coefficient-wise can do unpredictable things to the roots. For instance, the average of and
, which are both real-rooted quadratics, is
, which has complex roots
. Even when the roots of the average are real, there is in general no simple relationship between the roots of two polynomials and the roots of their average.
The main insight is that there are nonetheless many situations where averaging the coefficients of polynomials also has the effect of averaging each of the roots individually, and that it is possible to identify and exploit these situations. To speak about this concretely, we will need to give the roots names. There is no canonical way to do this for arbitrary polynomials, whose roots are just sets of points in the complex plane. However, when all the roots are real there is a natural ordering given by the real line; we will use this ordering to label the roots of a real-rooted polynomial in descending order
.
Interlacing
We will use the following classical notion to characterize precisely the good situations mentioned above.
Definition 2 (Interlacing) Let
be a degree
polynomial with all real roots
, and let
be degree
or
with all real roots
(ignoring
in the degree
case). We say that
interlaces
if their roots alternate, i.e.,
Following Fisk, we denote this as
, to indicate that the largest root belongs to
.
If there is a single
which interlaces a family of polynomials
, we say that they have a common interlacing.
It is an easy exercise that of degree
have a common interlacing iff there are closed intervals
(where
means to the left of) such that the
th roots of all the
are contained in
. It is also easy to see that a set of polynomials has a common interlacing iff every pair of them has a common interlacing (this may be viewed as Helly’s theorem on the real line).
We now state the main theorem about averaging polynomials with common interlacings.
Theorem 3 Suppose
are real-rooted of degree
with positive leading coefficients. Let
denote the
largest root of
and let
be any distribution on
. If
have a common interlacing, then for all
The proof of this theorem is a three line exercise. Since it is the crucial fact upon which the entire technique relies, I encourage you to find this proof for yourself (Hint: Apply the intermediate value theorem inside each interval .) You can also look at the picture below, which shows what happens for two cubic polynomials with a quadratic common interlacing.
One of the nicest features of common interlacings is that their existence is equivalent to certain real-rootedness statements. Often, this characterization gives us a systematic way to argue that common interlacings exist, rather than having to rely on cleverness and pull them out of thin air. The following seems to have been discovered a number of times (for instance, Fell or Chudnovsky & Seymour); the proof of it included below assumes that the roots of a polynomial are continuous functions of its coefficients (which may be shown using elementary complex analysis).
Theorem 4 If
are degree
polynomials and all of their convex combinations
have real roots, then they have a common interlacing.
Proof: Since common interlacing is a pairwise condition, it suffices to handle the case of two polynomials and
. Let
with . Assume without loss of generality that
and
have no common roots (if they do, divide them out and put them back in at the end). As
varies from
to
, the roots of
define
continuous curves in the complex plane
, each beginning at a root of
and ending at a root of
. By our assumption the curves must all lie in the real line. Observe that no curve can cross a root of either
or
in the middle: if
for some
and
, then immediately we also have
, contradicting the no common roots assumption. Thus, each curve defines a closed interval containing exactly one root of
and one root of
, and these intervals do not overlap except possibly at their endpoints, establishing the existence of a common interlacing.
Characteristic Polynomials and Rank One Updates
A very natural and relevant example of interlacing polynomials comes from matrices. Recall that the characteristic polynomial of a matrix is given by
and that its roots are the eigenvalues of . The following classical fact tells us that rank one updates create interlacing.
Lemma 5 (Cauchy’s Interlacing Theorem) If
is a symmetric matrix and
is a vector then
Proof: There are many ways to prove this, and it is a nice exercise. One way which I particularly like, and which will be relevant for the rest of this post, is to observe that
where and
are the eigenvectors and eigenvalues of
. Interlacing then follows by inspecting the poles and zeros of the rational function on the right hand side.
We are now in a position to do something nontrivial. Suppose is a symmetric
real matrix and
are some vectors in
. Cauchy’s theorem tells us that the polynomials
have a common interlacing, namely . Thus, Theorem 3 implies that for every
, there exists a
so that the
th largest eigenvalue of
is at least the
th largest root of the average polynomial
We can compute this polynomial using the calculation as follows:
In general, this polynomial depends on the squared inner products . When
, however, we have
for all
, and:
That is, adding a random rank one matrix in the isotropic case corresponds to subtracting off a multiple of the derivative from the characteristic polynomial. Note that there is no dependence on the vectors in this expression, and it has `forgotten’ all of the eigenvectors
. This is where the gain is: we have reduced a high-dimensional linear algebra problem (of finding a
for which
has certain eigenvalues, which may be difficult when the matrices involved do not commute) to a univariate calculus / analysis problem (given a polynomial, figure out what subtracting the derivative does to its roots). Moreover, the latter problem is amenable to a completely different set of tools than the original eigenvalue problem.
As a sanity check, if we apply the above deduction to , we find that any isotropic set
must contain a
such that
is at least the largest root of
which is just . This makes sense since
, and the average squared length of the vectors is indeed
since
.
Differential Operators and Induction
The real power of the method comes from being able to apply it inductively to a sum of many independent random ‘s at once, rather than just once. In this case, establishing the necessary common interlacings requires a combination of Theorem 4 and Cauchy’s theorem. A central role is played by the differential operators
seen above, which I will henceforth denote as
. The proof relies on the following key properties of these operators:
Lemma 6 (Properties of Differential Operators)
- If
is a random vector with
then
- If
has real roots then so does
.
- If
have a common interlacing, then so do
.
Proof: Part (1) was essentially shown in . Part (2) follows by applying
to the matrix
with diagonal entries equal to the roots of
, and plugging in
, so that
and
.
For part (3), Theorem 3 tells us that all convex combinations have real roots. By part (2) it follows that all
also have real roots. By Theorem 4, this means that the must have a common interlacing.
We are now ready to perform the main induction which will give us the proof of Theorem 1.
Lemma 7 Let
be uniformly chosen from
so that
, and let
be i.i.d. copies of
. Then there exists a choice of indices
satisfying
Proof: For any partial assignment of the indices, consider the `conditional expectation’ polynomial:
We will show that there exists a such that
which by induction will complete the proof. Consider the matrix
By Cauchy’s interlacing theorem interlaces
for every
. Lemma 6 tells us
operators preserve common interlacing, so the polynomials
(by applying Lemma 6 times) must also have a common interlacing. Thus, some
must satisfy (1), as desired.
Bounding the Roots: Laguerre Polynomials
To finish the proof of Theorem 1, it suffices by Lemma 7 to prove a lower bound on the th largest root of the expected polynomial
. By applying Lemma 6
times to
, we find that
This looks like a nice polynomial, and we are free to use any method we like to bound its roots.
The easiest way is to observe that
where is a degree
associated Laguerre polynomial. These are a classical family of orthogonal polynomials and a lot is known about the locations of their roots; in particular, there is the following estimate due to Krasikov.
Lemma 8 (Roots of Laguerre Polynomials) The roots of the associated Laguerre polynomial
It follows by Lemma 8 that , which immediately yields Theorem 1 by Lemma 7 and (2).
If you think that appealing to Laguerre polynomials was magical, it is also possible to prove the bound (3) from scratch in less than a page using the `barrier function’ argument from this paper, which is also intimately related to the formulas and
.
Conclusion
The argument given here is a special case of a more general principle: that expected characteristic polynomials of certain random matrices can be expressed in terms of differential operators, which can then be used to establish the existence of the necessary common interlacings as well as to analyze the roots of the expected polynomials themselves. In the isotropic case of Bourgain-Tzafriri presented here, all of these objects can be chosen to be univariate polynomials. Morally, this is because the covariance matrices of all of the random vectors involved are multiples of the identity (which trivially commute with each other), and all of the characteristic polynomials involved are simple univariate linear transformations of each other (such a ). The above argument can also be made to work in the non-isotropic case, yielding improvements over previously known bounds. This is the subject of a paper in preparation, Interlacing Families III, with Adam Marcus and Dan Spielman.
On the other hand, the proofs of Kadison-Singer and existence of Ramanujan graphs involve analyzing sums of independent rank one matrices which come from non-identically distributed distributions whose covariance matrices do not commute. At a high level, this is what creates the need to consider multivariate characteristic polynomials and differential operators, which are then analyzed using techniques from the theory of real stable polynomials.
Acknowledgments
Everything in this post is joint work with Adam Marcus and Dan Spielman. Thanks to Raghu Meka for helpful comments in the preparation of this post.
UPDATE: In response to Olaf’s comment below, here is how to see that the bound in the theorem is sharp.
The tight example is provided by random matrices. Let be an
random matrix with i.i.d.
Gaussian random entries, so that
. Then
is called a Wishart matrix and its spectrum is very well-understood. In particular, we will use the following two facts:
(1) If with
then the ratio
almost surely. Thus, if we take the to be the columns of such a
then
almost surely.
(2)If with
fixed, the spectrum of
converges to a known distribution called the Marchenko-Pastur law, which is supported on the interval
. The eigenvalues are extremely (better than exponentially) unlikely to be supported on any interval smaller than this; in particular for every
we have
for sufficiently large . This is established in Hiai-Petz (Theorem 8, thanks to S. Szarek for this crucial reference.)
Fact (2) implies that every submatrix
of
,
, which is also Gaussian, has
with probability for sufficiently large
(keeping
fixed). There are
such submatrices
, so if we set
(say) then
for any
, and by a union bound
holds simultaneously for all
submatrices of G, showing that the guarantee of the theorem cannot be improved for sufficiently large
.
It would be nice to have an argument for finite , which would require a non-asymptotic version of the Hiai-Petz bound.
Dear Mr Srivastava,
You’ve claimed that your version of the restricted invertibility THM is sharp. Could You tell me for which k this result is sharp and how to see this.
Best regards,
Olaf Mordhorst
Dear Olaf,
That is an excellent question! I have added a section to the post, describing the sense in which the bound is sharp.
Best wishes,
Nikhil
Dear Nikhil,
Thank you for this post. It was never clear to me whether the dependence is optimal in the restricted invertibility principle. The argument you present here is very clear. By the way: A=GG^T without expectation. After equation (+), I think it should be with probability bigger than 1-\exp(-ck^2). Also the line just after should be \exp(c’k\log(m/k)).
Also did you try to get simultaneously a lower and upper bound for the singular values of the restricted matrix i.e. to get a well conditionned submatrix in the same sense of what is done here: http://arxiv.org/pdf/1212.0976v2.pdf
Best wishes,
Pierre
Thanks, Pierre! Fixed.
Your paper looks very interesting. I have not tried to get both bounds at once using this method for this regime of k, although in some sense that is what is done for k>n in the case of Kadison-Singer. The basic idea is to embed an upper and lower bound into a single matrix by using a direct sum. It would be interesting to see if your result can be obtained using polynomials.
Dear Nikhil,
Thank you very much for your comment. I’d failed to construct a counter example (by considering Hadamard basis), so your post helped me a lot. Maybe, one can construct a finite counter example by quantizing the gauss measure in your example (since the counter example does only depend on the distribution of the column vectors in G, which might be predictable for m>>n).
Bets wishes,
Olaf
This is a very nice blog post! It really helped me understand what’s going on with the method. I don’t think anyone will be confused by it for very long, but I think there is a small typo in equation (3), by the way.
Beautiful exposition, Nikhil! I’ve seen this explained a few times and this writeup nails it.
Dear Nikhil,
Thank you for the very nice exposition, it made the proof of Kadison-Singer much more intuitive for me. I have a question, if you have time to answer it: when you prove Theorem 1 from Lemma 7, the indices j_1,…, j_k might not be mutually distinct. Right? So in Theorem 1, we allow a vector v_i to appear several times in the last inequality?
Many thanks and best wishes,
Dorin
Hi Nikhil,
Nevermind the previous question, I read the notation wrongly. All clear now. Thanks again for the nice post!
Dorin