[Jelani Nelson is organizing a post-STOC workshop on  Chaining Methods and their applications, and agreed to write a 2-post series about these methods here. –Boaz]

1. Workshop details

Assaf Naor and I (Jelani Nelson) are organizing a two-day workshop “Chaining Methods and their Applications to Computer Science (CMACS)” June 22–23, 2016, immediately after STOC 2016. It will be held on the Harvard campus. Registration for CMACS is free, and the link can be found on the workshop’s website. Thanks to the NSF, there is some funding to support postdoc and student attendance, so if you fall into one of these two categories then please check off the appropriate box during registration. Also see the instructions on the website for how to apply for this support.

2. What is chaining?

Consider the problem of bounding the maximum of a collection of random variables. That is, we have some collection ${(X_t)_{t\in T}}$ and want to bound ${\mathbb{E} \sup_{t\in T} X_t}$, or perhaps we want to say this ${\sup}$ is small with high probability (which can be achieved by bounding ${\mathbb{E} \sup_{t\in T} |X_t|^p}$ for large ${p}$ and applying Markov’s inequality, for example).

Such problems show up all the time in probabilistic analyses, including in computer science, and the most common approach is to combine tail bounds with union bounds. For example, to show that the maximum load when throwing ${n}$ balls into ${n}$ bins is ${O(\log n/\log\log n)}$, one defines ${X_t}$ as the load in bin ${t}$, proves ${\mathbb{P}(X_t > C\log n/\log\log n) \ll 1/n}$, then performs a union bound to bound ${\sup_t X_t}$. Or when analyzing the update time of a randomized data structure on some sequence of operations, one argues that no operation takes too much time by understanding the tail behavior of ${X_t}$ being the time to perform operation ${t}$, then again performs a union bound to control ${\sup_t X_t}$.

Most succinctly, chaining methods leverage statistical dependencies between a (possibly infinite) collection of random variables in order to beat this naive union bound.

The origins of chaining began with Kolmogorov’s continuity theorem from the 1930s (see Section 2.2, Theorem 2.8 of [KS91]). The point of this theorem was to understand conditions under which a stochastic process is continuous. That is, consider a random function ${f:\mathbb{R}\rightarrow X}$ where ${(X, d)}$ is a metric space. Assume the distribution over ${f}$ satisfies the property that for some ${\alpha,\beta > 0}$, ${\mathbb{E}|f(x) - f(y)|^\alpha = O(|x-y|^{1+\beta})}$ for all ${x,y\in \mathbb{R}}$. Kolmogorov proved that for any such distribution, one can couple with another distribution over functions ${\tilde{f}}$ such that ${\forall x\in\mathbb{R},\ \mathbb{P}(f(x) = \tilde{f}(x)) = 1}$, and furthermore ${\tilde{f}}$ is continuous. For the reader interested in seeing proof details, see for example Seciton A.2 of [Tal14] (or the proof of the Kolmogorov continuity theorem in essentially any book on stochastic processes).

Since Kolmogorov’s work, the scope of applications of the chaining methodology has widened tremendously, due to contributions of many mathematicians, including Dudley, Fernique, and very notably Talagrand. See Talagrand’s treatise [Tal14] for a description of many impressive applications of chaining in mathematics. See also Talagrand’s STOC 2010 paper [Tal10]. Note that [Tal14] is not exhaustive, and additional applications are posted on the arxiv on a regular basis.

3. Applications in computer science

Several applications are given in Section 1.2.2 of [vH14]. I will repeat some of those here, as well as some other ones.

Random matrices and compressed sensing: Consider a random matrix ${M\in\mathbb{R}^{m\times n}}$ from some distribution. A common task is to understand the behavior of the largest singular value of ${M}$. Note ${\|M\| = \sup_{\|x\|_2=\|y\|_2 = 1} x^T M y}$, so the goal is to understand the supremum of the random variables ${X_t = t_1^T M t_2}$ for ${t\in T = B_{\ell_2^m}\times B_{\ell_2^n}}$. Indeed, for many distributions one can obtain asymptotically sharp results via chaining.

Understanding singular values of random matrices has been important in several areas of computer science. Close to my own heart are in compressed sensing and randomized linear algebra algorithms. For the latter, a relevant object is a subspace embedding; these are objects used in algorithms for fast regression, low-rank approximation, and a dozen other applications (see [Woo14]). Analyses then boil down to understanding the largest singular value of ${M = (\Pi U)^T(\Pi U) - I}$. In compressed sensing, where the goal is to approximately recover a nearly sparse signal ${x}$ from few linear measurements ${Sx}$ (the measurements are put as rows of the matrix ${S}$), analyses again boil down to bounding the operator norm of the same ${M}$, but for all ${U}$ simultaneously that can be formed from choosing ${k}$ columns from some basis that ${x}$ is sparse in.

Empirical risk minimization: (Example taken from [vH14]). In machine learning one often is given some data, drawn from some unknown distribution, and a loss function ${\mathcal{L}}$. Given some family of distributions parametrized by some ${\theta\in\Theta}$, the goal is to find some ${\theta^*}$ which explains the data the best, i.e.

$\displaystyle \theta^* = \mathop{argmin}_{\theta\in\Theta} \mathbb{E} \mathcal{L}(\theta, X) . \ \ \ \ \ (1)$

The expectation is taken over the distribution of ${X}$. We do not know ${X}$, however, and only have i.i.d. samples ${X_1,\ldots,X_n}$. Thus a common proxy is to calculate

$\displaystyle \hat{\theta} = \mathop{argmin}_{\theta\in\Theta} \frac 1n \sum_{k=1}^n \mathcal{L}(\theta, X_k) .$

We would like to argue that ${\hat{\theta}}$ is a nearly optimal minimizer for the actual problem (1). For this to be true, it is sufficient that ${\sup_\theta X_\theta}$ is small, where one ranges over all ${\theta\in \Theta}$ with

$\displaystyle X_\theta = \left| \frac 1n \sum_{k=1}^n \mathcal{L}(\theta, X_k) - \mathbb{E} \mathcal{L}(\theta, X) \right| .$

Dimensionality reduction: In Euclidean dimensionality reduction, such as in the Johnson-Lindenstrauss lemma, one is given a set of vectors ${P\subset\ell_2^n}$, and wants that a (usually random) matrix ${\Pi}$ satisfies

$\displaystyle \forall y,z \in P,\ (1-\varepsilon)\|y-z\|_2^2 \le \|\Pi y - \Pi z\|_2^2 \le (1+\varepsilon)\|y-z\|_2^2 .$

This is satisfied as long as ${\sup_{y,z} X_{y,z} \le \varepsilon}$, where

$\displaystyle X_{y,z} = |\frac{1}{\|y-z\|_2^2}\|\Pi y - \Pi z\|_2^2 - 1| ,$

where ${y,z}$ ranges over all pairs of distinct vectors in ${P}$. Gordon’s theorem [Gor88] states that a ${\Pi}$ with i.i.d. gaussian entries ensures this with good probability as long as it has ${\gtrsim (g^2(T)+1)/\varepsilon^2}$ rows, where ${g(T)}$ is the gaussian mean width of ${T}$ and ${T}$ is the set of normalized differences of vectors in ${P}$. Later works gave sharper analysis, and also extended to other types of ${\Pi}$, all using chaining [KM05,MPTJ07,AL13,BDN15,Dir15,ORS15].

Another application of chaining in the context dimensionality reduction was in regard to nearest neighbor (NN) preserving embeddings [IN07]. In this problem, one is given a database ${X\subset \ell_2^d}$ of ${n}$ points and must create a data structure such that for any query point ${q\in\mathbb{R}^d}$, one can quickly find a point ${x\in X}$ such that ${\|q-x\|_2}$ is nearly minimized. Of course, if all distances are preserved between ${q}$ and points in ${X}$, this suffices to accomplish our goal, but it is more powerful than what is needed. It is only needed that the distance from ${q}$ to its nearest neighbor does not increase too much, and that the distances from ${q}$ to much farther points do not shrink too much (to fool us into thinking that they are approximate nearest neighbors). An embedding satisfying such criteria is known as a NN-preserving embedding, and [IN07] used chaining methods to show that certain “nice” sets ${X}$ have such embeddings into low dimension. Specifically, the target dimension can be ${O(\Delta^2\varepsilon^{-2}\frac{\gamma_2(X)}{\mathop{diam}(X)})^2}$, where ${\Delta}$ is the aspect ratio of the data and ${\gamma_2}$ is a functional defined by Talagrand (more on that later). All we will say now is that ${\gamma_2(X)}$ is always ${O(\sqrt{\log \lambda_X})}$, where ${\lambda_X}$ is the doubling constant of ${X}$ (the maximum number of balls of radius ${r/2}$ required to cover any radius-${r}$
ball, over all ${r}$).

Data structures and streaming algorithms: The potential example to data structures was already mentioned in the previous section. To make it more concrete, consider the following streaming data structural problem in which one sees a sequence ${p_1,\ldots,p_m}$ with each ${p_k\in\{1,\ldots,n\}}$. For example, when monitoring a search query stream, ${p_k}$ may be a word in a dictionary of size ${n}$. The goal of the heavy hitters problem is to identify words that occur frequently in the stream. Specifically, if we let ${f_i}$ be the number of occurrences of ${i\in[n]}$ in the stream, in the ${\ell_2}$ heavy hitters problem the goal is to find all ${i}$ such that ${f_i^2 \ge \varepsilon \sum_i f_i^2}$ (think of ${\varepsilon}$ as some given constant). The CountSketch of Charikar, Chen, and Farach-Colton solves this problem using ${O(\log n)}$ machine words of memory.

A recent work of [BCIW16] provides a new algorithm that solves the same problem using only ${O(\log\log n)}$ words of memory. An upcoming manuscript by the same authors, myself, and Zhengyu Wang gives an improved algorithm using the optimal ${O(1)}$ words of memory. The key insight of the original work is that as the stream gets long enough, the fractional weight of an item with respect to the ${\ell_2^2}$ mass of the ${f}$ vector changes very slowly. That is to say, once a stream is long, it will take many more updates for a light item to become heavy. Without getting into technical details, this fact becomes important in understanding the behavior throughout the stream of certain random variables stored by their algorithm. More concretely, if the frequencies are ${f_1 = 1}$ and ${f_2 = 1}$, then seeing item ${1}$ again can drastically change its fraction of the ${\ell_2^2}$ from half to ${4/5}$. But if ${f_1 + f_2 = 1000}$, one has to see the lighter item quite a number of times for it to be much heavier than the other one.

Random walks on graphs: Ding, Lee, and Peres [DLP12] a few years ago gave the first deterministic constant-factor approximation algorithm to the cover time of a random graph (see James Lee’s blog post). Their work showed that the cover time of any connected graph is, up to a constant, equal to the supremum of a certain collection of random variables depending on that graph, the gaussian free field. Essentially this is a collection of gaussian random variables whose covariance structure is given by the effective resistances between vertices in the graph. Work of Talagrand (the “majorizing measures theory”) and Fernique have provided us with tight, up to a constant factor, upper and lower bounds for the expected supremum of a collection of random variables. Furthermore, these bounds are constructive and efficient. See also the works [Mek12,Din14,Zha14] for more on this topic.

Dictionary learning: In dictionary learning one assumes that some data of ${p}$ samples, the columns of some matrix ${Y\in\mathbb{R}^{n\times p}}$, is (approximately) sparse in some unknown “dictionary”. That is, ${Y = AX + E}$ where ${A}$ is unknown, ${X}$ is sparse in each column, and ${E}$ is an error matrix. If ${E = 0}$, ${A}$ is square, and ${X}$ has i.i.d. entries with ${s}$ expected non-zeroes per column, with the non-zeroes being subgaussian, then Spielman, Wang, and Wright gave the first polynomial-time algorithm which provably recovers ${A}$ (up to permutation and scaling of its columns) using polynomially many samples. Their proof required ${O(n^2\log^2 n)}$ samples, but they conjectured ${O(n\log n)}$ should suffice.

It was recently shown that their precise algorithm needs roughly ${n^2}$ samples, but ${O(n\log n)}$ does suffice for a slight variant of their algorithm. As per [SWW12], the analysis of the latter result boiled down to bounding the supremum of a collection of random variables. See [LV15,Ada16,BN16].

Error-correcting codes: A ${q}$-ary linear error-correcting code ${\mathcal{C}}$ is such that the codewords are all vectors of the form ${xM}$ for some row vector ${x\in\mathbb{F}_q^m}$ and ${M\in\mathbb{F}_q^{m\times n}}$. ${M}$ is called the “generator matrix”. Such a code is list-decodable up to some radius ${R}$, if, informally, if one arbitrarily corrupts any codeword ${C}$ in at most an ${R}$-fraction of coordinates to obtain some ${C'}$, then the list of candidate codewords in ${\mathcal{C}}$ which could have arisen in this way (i.e. are within radius ${R}$ of ${C'}$) is small.

Recent work of Rudra and Wootters [RW14] showed, to quote them, that any ${q}$-ary code with sufficiently good distance can be randomly punctured to obtain, with high probability, a code that is list decodable up to radius ${1 - 1/q -\varepsilon}$ with near-optimal rate and list sizes''. Arandom puncturing” means simply to randomly sample some number of columns of ${M}$ to form a random matrix ${M'}$, which is the generator matrix for the new “punctured” code. Their proof relies on chaining.

In the next post we’ll get more into how chaining works and also play with a toy example. In the meantime, you might also want to see these three blog posts by James Lee back in 2010 (or in pdf form).

Bibliography

[Ada16] Radosław Adamczak. A note on the sample complexity of the Er-SpUD algorithm by Spielman, Wang and Wright for exact recovery of sparsely used dictionaries. CoRR, abs/1601.02049, 2016.

[AL13] Nir Ailon and Edo Liberty. An almost optimal unrestricted fast Johnson-Lindenstrauss transform. ACM Transactions on Algorithms, 9(3):21, 2013.

[BCIW16] Vladimir Braverman, Stephen~R. Chestnut, Nikita Ivkin, and David~P. Woodruff. Beating CountSketch for Heavy Hitters in Insertion Streams. In Proceedings of the 48th Annual ACM Symposium on Theory of Computing (STOC), to appear, 2016. Full version at arXiv abs/1511.00661.

[BDN15] Jean Bourgain, Sjoerd Dirksen, and Jelani Nelson. Toward a unified theory of sparse dimensionality reduction in Euclidean space. Geometric and Functional Analysis (GAFA), 25(4):1009–1088, July 2015. Preliminary version in STOC 2015.

[BN16] Jarosław Błasiok and Jelani Nelson. An improved analysis of the ER-SpUD dictionary learning algorithm. CoRR, abs/1602.05719, 2016.

[Din14] Jian Ding. Asymptotics of cover times via Gaussian free fields: Bounded-degree graphs and general trees. Annals of Probability, 42(2):464–496, 2014.

[Dir15] Sjoerd Dirksen. Dimensionality reduction with subgaussian matrices: a unified theory. Found. Comp. Math., pages 1–30, 2015.

[DLP12] Jian Ding, James~R. Lee, and Yuval Peres. Cover times, blanket times, and majorizing measures. Annals of Mathematics, 175:1409–1471, 2012.

[Gor88] Yehoram Gordon. On Milman’s inequality and random subspaces which escape through a mesh in ${\mathbb{R}^n}$. Geometric Aspects of Functional Analysis, pages 84–106, 1988.

[IN07] Piotr Indyk and Assaf Naor. Nearest-neighbor-preserving embeddings. ACM Transactions on Algorithms, 3(3), 2007.

[KM05] Bo’az Klartag and Shahar Mendelson. Empirical processes and random projections. J. Funct. Anal., 225(1):229–245, 2005.

[LV15] Kyle Luh and Van Vu. Random matrices: ${l_1}$ concentration and dictionary learning with few samples. In Proceedings of the 56th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 1409–1425, 2015.

[Mek12] Raghu Meka. A PTAS for computing the supremum of gaussian processes. In 53rd Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 217–222, 2012.

[MPTJ07] Shahar Mendelson, Alain Pajor, and Nicole Tomczak-Jaegermann. Reconstruction and subgaussian operators in asymptotic geometric analysis. Geometric and Functional Analysis, 17:1248–1282, 2007.

[ORS15] Samet Oymak, Benjamin Recht, and Mahdi Soltanolkotabi. Isometric sketching of any set via the restricted isometry property. CoRR, abs/1506.03521, 2015.

[RW14] Atri Rudra and Mary Wootters. Every list-decodable code for high noise has abundant near-optimal rate puncturings. In Proceedings of the 46th ACM Symposium on Theory of Computing (STOC), pages 764–773, 2014.

[SWW12] Daniel A. Spielman, Huan Wang, and John Wright. Exact recovery of sparsely-used dictionaries. In Proceedings of the 25th Annual Conference on Learning Theory (COLT), pages 37.1–37.18, 2012.

[Tal10] Michel Talagrand. Are many small sets explicitly small? In Proceedings of the 42nd ACM Symposium on Theory of Computing (STOC), pages 13–36, 2010.

[Tal14] Michel Talagrand. Upper and lower bounds for stochastic processes: modern methods and classical problems. Springer, 2014.

[vH14] Ramon van Handel. Probability in high dimensions. Manuscript, 2014. Available here. Version from June 30, 2014.

[Woo14] David P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(1-2):1–157, 2014.

[Zha14] Alex Zhai. Exponential concentration of cover times. CoRR, abs/1407.7617, 2014.

1. February 25, 2016 3:34 pm

Note that your definition of “dictionary learning” involves a restricted version of the problem, whereas the one usually of interest in machine learning and neuroscience involves overcomplete bases. The number of vectors is more than the dimension. (You may be aware of this but your readers won’t be.)

This line of work can’t be generalized to that case.

• February 25, 2016 3:56 pm

“Note that your definition of …”

I think you misread what I wrote. In my definition I made no such assumption. In the next sentence I stated that under two restrictions (i.e. no noise and square A), these algorithms I cited get blah.

I agree the problem isn’t solved. There are several axes to measure progress on: handling noise or not, overcomplete A or not, sample complexity, running time, general A or not, how restrictive is the random model for X, etc. The line of work I cited does the best on sample complexity, but some of the other parameters are not so great. Your work and others handle noise and overcomplete A, but the sample complexity isn’t as good. The holy grail is to get everything to be awesome, which as I see it is an open problem. Anyway, we go into this in more detail in the intro in the paper I cited with Jarek.

• March 4, 2016 9:28 pm

Actually even for the overcomplete dictionaries it was already shown how to analyze the alternating minimization heuristic (under some probabilistic assumptions on the data) where the sample complexity is near-optimal. http://arxiv.org/pdf/1503.00778v1.pdf (COLT’14)

• March 4, 2016 9:29 pm

Oops; it was COLT’15, not COLT’14.

• May 5, 2016 1:08 am

Thanks for the reference. But why do you label this as nearly optimal sample complexity? The sample complexity in the paper you linked is O~(mk), where k is the column sparsity of X (that is, you observe Y = AX + noise, and each column of X is k-sparse). Note in the complete case (m = n) with no noise, for k = sqrt(n) your bound is n^{3/2} samples. Meanwhile the right answer for noiseless/complete for k <= sqrt(n) is that you only need O~(n) samples. It seems to me that the sample complexity in the paper you linked is only nearly optimal for the very low sparsity regime.