[Jelani Nelson is organizing a post-STOC workshop on Chaining Methods and their applications, and agreed to write a 2-post series about these methods here. –Boaz]
1. Workshop details
Assaf Naor and I (Jelani Nelson) are organizing a two-day workshop “Chaining Methods and their Applications to Computer Science (CMACS)” June 22–23, 2016, immediately after STOC 2016. It will be held on the Harvard campus. Registration for CMACS is free, and the link can be found on the workshop’s website. Thanks to the NSF, there is some funding to support postdoc and student attendance, so if you fall into one of these two categories then please check off the appropriate box during registration. Also see the instructions on the website for how to apply for this support.
2. What is chaining?
Consider the problem of bounding the maximum of a collection of random variables. That is, we have some collection and want to bound , or perhaps we want to say this is small with high probability (which can be achieved by bounding for large and applying Markov’s inequality, for example).
Such problems show up all the time in probabilistic analyses, including in computer science, and the most common approach is to combine tail bounds with union bounds. For example, to show that the maximum load when throwing balls into bins is , one defines as the load in bin , proves , then performs a union bound to bound . Or when analyzing the update time of a randomized data structure on some sequence of operations, one argues that no operation takes too much time by understanding the tail behavior of being the time to perform operation , then again performs a union bound to control .
Most succinctly, chaining methods leverage statistical dependencies between a (possibly infinite) collection of random variables in order to beat this naive union bound.
The origins of chaining began with Kolmogorov’s continuity theorem from the 1930s (see Section 2.2, Theorem 2.8 of [KS91]). The point of this theorem was to understand conditions under which a stochastic process is continuous. That is, consider a random function where is a metric space. Assume the distribution over satisfies the property that for some , for all . Kolmogorov proved that for any such distribution, one can couple with another distribution over functions such that , and furthermore is continuous. For the reader interested in seeing proof details, see for example Seciton A.2 of [Tal14] (or the proof of the Kolmogorov continuity theorem in essentially any book on stochastic processes).
Since Kolmogorov’s work, the scope of applications of the chaining methodology has widened tremendously, due to contributions of many mathematicians, including Dudley, Fernique, and very notably Talagrand. See Talagrand’s treatise [Tal14] for a description of many impressive applications of chaining in mathematics. See also Talagrand’s STOC 2010 paper [Tal10]. Note that [Tal14] is not exhaustive, and additional applications are posted on the arxiv on a regular basis.
3. Applications in computer science
Several applications are given in Section 1.2.2 of [vH14]. I will repeat some of those here, as well as some other ones.
Random matrices and compressed sensing: Consider a random matrix from some distribution. A common task is to understand the behavior of the largest singular value of . Note , so the goal is to understand the supremum of the random variables for . Indeed, for many distributions one can obtain asymptotically sharp results via chaining.
Understanding singular values of random matrices has been important in several areas of computer science. Close to my own heart are in compressed sensing and randomized linear algebra algorithms. For the latter, a relevant object is a subspace embedding; these are objects used in algorithms for fast regression, low-rank approximation, and a dozen other applications (see [Woo14]). Analyses then boil down to understanding the largest singular value of . In compressed sensing, where the goal is to approximately recover a nearly sparse signal from few linear measurements (the measurements are put as rows of the matrix ), analyses again boil down to bounding the operator norm of the same , but for all simultaneously that can be formed from choosing columns from some basis that is sparse in.
Empirical risk minimization: (Example taken from [vH14]). In machine learning one often is given some data, drawn from some unknown distribution, and a loss function . Given some family of distributions parametrized by some , the goal is to find some which explains the data the best, i.e.
The expectation is taken over the distribution of . We do not know , however, and only have i.i.d. samples . Thus a common proxy is to calculate
We would like to argue that is a nearly optimal minimizer for the actual problem (1). For this to be true, it is sufficient that is small, where one ranges over all with
Dimensionality reduction: In Euclidean dimensionality reduction, such as in the Johnson-Lindenstrauss lemma, one is given a set of vectors , and wants that a (usually random) matrix satisfies
This is satisfied as long as , where
where ranges over all pairs of distinct vectors in . Gordon’s theorem [Gor88] states that a with i.i.d. gaussian entries ensures this with good probability as long as it has rows, where is the gaussian mean width of and is the set of normalized differences of vectors in . Later works gave sharper analysis, and also extended to other types of , all using chaining [KM05,MPTJ07,AL13,BDN15,Dir15,ORS15].
Another application of chaining in the context dimensionality reduction was in regard to nearest neighbor (NN) preserving embeddings [IN07]. In this problem, one is given a database of points and must create a data structure such that for any query point , one can quickly find a point such that is nearly minimized. Of course, if all distances are preserved between and points in , this suffices to accomplish our goal, but it is more powerful than what is needed. It is only needed that the distance from to its nearest neighbor does not increase too much, and that the distances from to much farther points do not shrink too much (to fool us into thinking that they are approximate nearest neighbors). An embedding satisfying such criteria is known as a NN-preserving embedding, and [IN07] used chaining methods to show that certain “nice” sets have such embeddings into low dimension. Specifically, the target dimension can be , where is the aspect ratio of the data and is a functional defined by Talagrand (more on that later). All we will say now is that is always , where is the doubling constant of (the maximum number of balls of radius required to cover any radius-
ball, over all ).
Data structures and streaming algorithms: The potential example to data structures was already mentioned in the previous section. To make it more concrete, consider the following streaming data structural problem in which one sees a sequence with each . For example, when monitoring a search query stream, may be a word in a dictionary of size . The goal of the heavy hitters problem is to identify words that occur frequently in the stream. Specifically, if we let be the number of occurrences of in the stream, in the heavy hitters problem the goal is to find all such that (think of as some given constant). The CountSketch of Charikar, Chen, and Farach-Colton solves this problem using machine words of memory.
A recent work of [BCIW16] provides a new algorithm that solves the same problem using only words of memory. An upcoming manuscript by the same authors, myself, and Zhengyu Wang gives an improved algorithm using the optimal words of memory. The key insight of the original work is that as the stream gets long enough, the fractional weight of an item with respect to the mass of the vector changes very slowly. That is to say, once a stream is long, it will take many more updates for a light item to become heavy. Without getting into technical details, this fact becomes important in understanding the behavior throughout the stream of certain random variables stored by their algorithm. More concretely, if the frequencies are and , then seeing item again can drastically change its fraction of the from half to . But if , one has to see the lighter item quite a number of times for it to be much heavier than the other one.
Random walks on graphs: Ding, Lee, and Peres [DLP12] a few years ago gave the first deterministic constant-factor approximation algorithm to the cover time of a random graph (see James Lee’s blog post). Their work showed that the cover time of any connected graph is, up to a constant, equal to the supremum of a certain collection of random variables depending on that graph, the gaussian free field. Essentially this is a collection of gaussian random variables whose covariance structure is given by the effective resistances between vertices in the graph. Work of Talagrand (the “majorizing measures theory”) and Fernique have provided us with tight, up to a constant factor, upper and lower bounds for the expected supremum of a collection of random variables. Furthermore, these bounds are constructive and efficient. See also the works [Mek12,Din14,Zha14] for more on this topic.
Dictionary learning: In dictionary learning one assumes that some data of samples, the columns of some matrix , is (approximately) sparse in some unknown “dictionary”. That is, where is unknown, is sparse in each column, and is an error matrix. If , is square, and has i.i.d. entries with expected non-zeroes per column, with the non-zeroes being subgaussian, then Spielman, Wang, and Wright gave the first polynomial-time algorithm which provably recovers (up to permutation and scaling of its columns) using polynomially many samples. Their proof required samples, but they conjectured should suffice.
It was recently shown that their precise algorithm needs roughly samples, but does suffice for a slight variant of their algorithm. As per [SWW12], the analysis of the latter result boiled down to bounding the supremum of a collection of random variables. See [LV15,Ada16,BN16].
Error-correcting codes: A -ary linear error-correcting code is such that the codewords are all vectors of the form for some row vector and . is called the “generator matrix”. Such a code is list-decodable up to some radius , if, informally, if one arbitrarily corrupts any codeword in at most an -fraction of coordinates to obtain some , then the list of candidate codewords in which could have arisen in this way (i.e. are within radius of ) is small.
Recent work of Rudra and Wootters [RW14] showed, to quote them, that
any -ary code with sufficiently good distance can be randomly punctured to obtain, with high probability, a code that is list decodable up to radius with near-optimal rate and list sizes''. Arandom puncturing” means simply to randomly sample some number of columns of to form a random matrix , which is the generator matrix for the new “punctured” code. Their proof relies on chaining.
In the next post we’ll get more into how chaining works and also play with a toy example. In the meantime, you might also want to see these three blog posts by James Lee back in 2010 (or in pdf form).
[Ada16] Radosław Adamczak. A note on the sample complexity of the Er-SpUD algorithm by Spielman, Wang and Wright for exact recovery of sparsely used dictionaries. CoRR, abs/1601.02049, 2016.
[AL13] Nir Ailon and Edo Liberty. An almost optimal unrestricted fast Johnson-Lindenstrauss transform. ACM Transactions on Algorithms, 9(3):21, 2013.
[BCIW16] Vladimir Braverman, Stephen~R. Chestnut, Nikita Ivkin, and David~P. Woodruff. Beating CountSketch for Heavy Hitters in Insertion Streams. In Proceedings of the 48th Annual ACM Symposium on Theory of Computing (STOC), to appear, 2016. Full version at arXiv abs/1511.00661.
[BDN15] Jean Bourgain, Sjoerd Dirksen, and Jelani Nelson. Toward a unified theory of sparse dimensionality reduction in Euclidean space. Geometric and Functional Analysis (GAFA), 25(4):1009–1088, July 2015. Preliminary version in STOC 2015.
[BN16] Jarosław Błasiok and Jelani Nelson. An improved analysis of the ER-SpUD dictionary learning algorithm. CoRR, abs/1602.05719, 2016.
[Din14] Jian Ding. Asymptotics of cover times via Gaussian free fields: Bounded-degree graphs and general trees. Annals of Probability, 42(2):464–496, 2014.
[Dir15] Sjoerd Dirksen. Dimensionality reduction with subgaussian matrices: a unified theory. Found. Comp. Math., pages 1–30, 2015.
[DLP12] Jian Ding, James~R. Lee, and Yuval Peres. Cover times, blanket times, and majorizing measures. Annals of Mathematics, 175:1409–1471, 2012.
[Gor88] Yehoram Gordon. On Milman’s inequality and random subspaces which escape through a mesh in . Geometric Aspects of Functional Analysis, pages 84–106, 1988.
[IN07] Piotr Indyk and Assaf Naor. Nearest-neighbor-preserving embeddings. ACM Transactions on Algorithms, 3(3), 2007.
[KM05] Bo’az Klartag and Shahar Mendelson. Empirical processes and random projections. J. Funct. Anal., 225(1):229–245, 2005.
[LV15] Kyle Luh and Van Vu. Random matrices: concentration and dictionary learning with few samples. In Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 1409–1425, 2015.
[Mek12] Raghu Meka. A PTAS for computing the supremum of gaussian processes. In 53rd Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 217–222, 2012.
[MPTJ07] Shahar Mendelson, Alain Pajor, and Nicole Tomczak-Jaegermann. Reconstruction and subgaussian operators in asymptotic geometric analysis. Geometric and Functional Analysis, 17:1248–1282, 2007.
[ORS15] Samet Oymak, Benjamin Recht, and Mahdi Soltanolkotabi. Isometric sketching of any set via the restricted isometry property. CoRR, abs/1506.03521, 2015.
[RW14] Atri Rudra and Mary Wootters. Every list-decodable code for high noise has abundant near-optimal rate puncturings. In Proceedings of the 46th ACM Symposium on Theory of Computing (STOC), pages 764–773, 2014.
[SWW12] Daniel A. Spielman, Huan Wang, and John Wright. Exact recovery of sparsely-used dictionaries. In Proceedings of the 25th Annual Conference on Learning Theory (COLT), pages 37.1–37.18, 2012.
[Tal10] Michel Talagrand. Are many small sets explicitly small? In Proceedings of the 42nd ACM Symposium on Theory of Computing (STOC), pages 13–36, 2010.
[Tal14] Michel Talagrand. Upper and lower bounds for stochastic processes: modern methods and classical problems. Springer, 2014.
[vH14] Ramon van Handel. Probability in high dimensions. Manuscript, 2014. Available here. Version from June 30, 2014.
[Woo14] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1-2):1–157, 2014.
[Zha14] Alex Zhai. Exponential concentration of cover times. CoRR, abs/1407.7617, 2014.