Call for proposals for FOCS’16 workshop/tutorial (half-)day

Following the proud tradition of previous STOC/FOCS conferences, FOCS’16 will also have a (half) day of workshop/tutorials on Saturday, October 8th, right before the conference starts.

You are invited to submit your proposal of workshop or tutorial by August 31st; see details here.

In short: you just need to propose an exciting theme and arrange the speakers. We will take care of the logistic details like rooms, AV, coffee break, etc.

Note that this is only a half-day event (2:30pm-6pm) since in the morning there will be another not-to-be-missed event: A Celebration of Mathematics and Computer Science: Celebrating Avi Wigderson’s 60th birthday (which actually starts already on Thursday, October 5th). See Boaz’s announcement here.

If you have any questions about the FOCS workshops, feel free to get in touch with the coordinators: Aleksander Madry and Alexandr Andoni.

STOC/SoCG workshop day & ITCS graduating bits

A couple announcements on happenings in Cambridge:

  • STOC/SoCG joint workshop day: this year, STOC and SoCG are co-located in Cambridge, MA, and will hold a joint workshop day on June 18th, 2016. The goal of this day is to have events of common interest to both communities and thus foster further collaboration. The schedule already includes plenary talks by two invited speakers: Timothy Chan and Santosh Vempala. You are invited to submit your workshop proposal by February 22nd, to contribute to this Nobel-Peace-Prize-worthy event. You can find more details on the workshop day page.
  • Graduating bits @ ITCS: if you are a student or postdoc and would like to participate in the Graduating Bits at ITCS’16, remember to sign-up using these instructions. You don’t need to be graduating this year to present.

Congrats to ACM awards winners!

ACM has just announced its awards. In particular, the Paris Kanellakis Theory and Practice award was given to Andrei Broder, Moses Charikar, and Piotr Indyk, for their work on Locality-Sensitive Hashing (LSH)! LSH has already been featured in our blog, and will likely be again 🙂 The citation says: “For their groundbreaking work on Locality-Sensitive Hashing that has had great impact in many fields of computer science including computer vision, databases, information retrieval, machine learning, and signal processing.”

Also relevant (to TCS) is the ACM/AAAI Allen Newell Award, given to Yoav Shoham and Moshe Tennenholtz for “fundamental contributions at the intersection of computer science, game theory, and economics, most particularly in multi-agent systems and social coordination (broadly construed), which have yielded major contributions to all three disciplines,” as the citation goes.

Congratulations!

Lists of Open Problems

Open problems are of course very valuable for setting research directions and pointing out current or grand challenges. I would like to point out two more recent lists of open problems that are “closer to home” for me at least.

High frequency moments via max-stability

In this post, I will present what I think is a simple(r) algorithm for estimating the k>2 frequency moment, or simply the \ell_k norm of a vector, in the sketching/streaming model. In fact, the algorithm is just a weak embedding of n-dimensional \ell_k into \ell_\infty of dimension m=O(n^{1-2/k}\log n) (this viewpoint spares me from describing the streaming model precisely).

What do I mean by a weak embedding? We will get a randomized linear map f:R^n\to R^m such that, for any x,x'\in R^n, we have that the image \|f(x)_i-f(x')_i\|_\infty=\max_i |f(x)-f(x')| is a constant approximation to \|x-x'\|_k=(\sum_i |x_i-x'_i|^k)^{1/k} with some 90% probability. Since m\ll n, the embedding is actually dimensionality-reducing.

Before I jump into the algorithm, let me mention that the algorithm is essentially based on the approach from a paper joint with Robi Krauthgamer and Krzysztof Onak, and also the paper by Hossein Jowhari, Mert Saglam, and Gabor Tardos. At least our paper was itself inspired by the first (near-)optimal algorithm for frequency moment estimation of Piotr Indyk and David Woodruff (later improved by Lakshminath Bhuvanagiri, Sumit Ganguly, Deepanjan Kesh, and Chandan Saha).

The algorithm. Let x be the input vector of dimension n. The algorithm just multiplies x entry-wise by some scalars, and then folds the vector into a smaller dimension m using standard hashing. Formally, in step one, we compute y as

y_i= x_i/u_i^{1/k}

where random variable u_i is drawn from an exponential distribution e^{-u}. In step two, we compute z from y using a random hash function h:[n]\to [m] as follows:

z_j = \sum_{i:h(i)=j} \sigma_i \cdot y_i

where \sigma_i are just random \pm 1.

f(x)=z is the output, that’s it. In matrix notation, f(x)=PDx, where D is a diagonal matrix and P is a sparse 0/\pm1 “projection” matrix describing the hash function h.

Of course the fun part is to show that this works (simple algorithm is not necessarily simple analysis). I’ll give essentially the entire analysis below, which shouldn’t be bad.

Analysis. We’ll first see that the infinity norm of y is correct, namely approximates \|x\|_k, and then that the infinity norm of z is correct as well (both with constant approximation with constant probability of success). In other words, step one is an embedding into \ell_\infty, and step two is dimensionality reduction.

The claim about the infinity norm of y follows from the “stability” property of the exponential distribution: if u_1,\ldots u_n are exponentially distributed, and \lambda_i>0 are scalars, then \min\{u_1/\lambda_1,\ldots u_n/\lambda_n \} is distributed as u/\lambda where u is also an exponential and \lambda=\sum_i \lambda_i.

Now, applying this stability property for \lambda_i=|x_i|^k we get that \|y\|_\infty^k=\max_i |x_i|^k/u_i is distributed as \|x\|_k^k/u.

Note that we already obtain a weak embedding of \ell_k^n into \ell_\infty^n (i.e., no dimensionality reduction). We proceed to show that the dimension-reducing projection does not hurt.

We will now analyze the max-norm of z. The main idea is that the biggest entries of y will go into separate buckets, while the rest of the “stuff” (small entries) will give a minor contribution only to each bucket. Hence, the biggest entry of y will stick out in z as well (and nothing bigger will stick out), preserving the max-norm. For simplicity of notation, let M=\|x\|_k, and note that the largest entry of y is about M (as we argued above).

What is big? Let’s say that “big” is an entry of y such that |y_i|\ge M/\log n. With good enough probability, there are only at most O(\log^k n)\ll \sqrt{m} such big items (because of exponential distribution), so they will all go into different buckets.

Now let’s look at the extra “stuff”, the contribution of the small entries. Let S be the set of small entries. Fix some bucket index j. We would like to show that the contribution of entries from S that fall into bucket j is small, namely, less than \approx M (the max entry of y).

Let’s look at z'_j=\sum_{i\in S:h(i)=j} \sigma_i y_i. A straight-forward calculation shows that the expectation of z'_j is zero, and the variance is E_{h,\sigma}[{z'}_j^2]=E_h[\sum_{i\in S:h(i)=j} y_i^2]\le \|y\|_2^2/m. If only we now could show that \|y\|_2^2 \ll M^2m, we would be done by Chebyshev’s inequality…

How do we relate \|y\|_2 to M=\|x\|_k ? Here comes the exponential distribution at rescue again. Note that E[\|y\|_2^2]=\sum_i x_i^2\cdot E[1/u_i^{2/k}]=\sum_i x_i^2\cdot O(1)=O(\|x\|_2^2) since the expectation E[1/u^{2/k}] for an exponentially distributed u is constant. Together with standard inter-norm inequality that \|x\|_2^2\le n^{1-2/k}\|x\|_\infty^2, we have that E[\|y\|_2^2]\le O(n^{1-2/k} M^2).

Combining, we have that E[{z'}_j^2] \le O(n^{1-2/k} M^2/m)\le O(M^2/\log n), which is enough to apply Chebyshev’s and conclude that the “extra stuff” in bucket j is |z'_j|\le o(M) with constant probability.

We are almost done except for one issue: we would need the above for all buckets j, and, in particular, we would like to have |z'_j|\ll M for a fixed j with high probability (not just constant). To achieve this, we use a stronger concentration inequality, for example the Bernstein inequality, which allows us exploit the fact that |y_i|\le M/\log n for i\in S to achieve a high-probability statement.

Discussion. The use of “stability” of exponential distribution is similar to Piotr Indyk’s use of p-stable distributions for sketching/streaming \ell_p, p\in(0,2], norms. Note that p-stable distributions do not exist for p>2; so here the notion of “stability” is slightly different. In the former, one uses the fact that for random variables v_1,\ldots v_n, which are p-stable, we have that \sum \lambda_i v_i is also p-stable with well-controlled variance. In the latter case, we use the property that the max of several “stable” distributions is another one: \max \lambda_i/u_i is distributed as \lambda/u (i.e., 1/u is “max-stable”). Note that this is useful for embedding into \ell_\infty. As it turns out, such a transformation does not increase the \ell_2 norm of the vector much, allowing us to do the dimensionality reduction in \ell_\infty.

For streaming applications, the important aspects are the small final dimension m=O(n^{1-2/k} \log n) and the linearity of f. This dimension of m is optimal for this algorithm.

UPDATE: a preliminary write-up with formal details is available here.

Higher Lower Bounds: Just Ask for More!

In memory of Mihai Pătraşcu.

Continuing the spirit of the previous post, in this post I will describe a specific technique for proving lower bounds for (static) data structures, when the space is (near) linear. This technique was introduced in the paper by Mihai and Mikkel Thorup, called “HIGHER lower Bounds for Near-Neighbor and  F u r t h e r  Rich Problems”.

Let’s start from the basics: how do we prove any data structure lower bound to start with? First of all, let’s fix the model: the cell-probe model, which is essentially the strongest of all, introduced by Yao in ’81. Here, the memory is composed of S cells, each of w bits, and the processor obtains the query and subsequently probes different cells of the memory. The cell-probe complexity is the number of cells to be probed to answer the query correctly for some problem P.

To be specific, here’s a sample problem P, called the partial match. Dataset is a set of n points \{0,1\}^d, the query \{0,1,\star\}^d, and the data structure is required to report whether the query matches any of the n points in the set, assuming that \star can match either 0 or 1. We think of dimension d as d=\log^c n for c\ge 1. For example, there is a simple solution with 3^d space, using only one probe, by computing an index table. (The parameter w is usually thought of as polynomial in the dimension d.)

A very common approach to a lower bound in the cell probe model is via asymmetric communication complexity, as suggested by Peter Bro Miltersen, Noam Nisan, Shmuel Safra, and Avi Wigderson in ’95. The idea is to formulate the problem as a communication game. Alice represents the processor, and thus gets the query. Bob represents the memory, and hence gets the data structure. Their “communication game” is as one would expect: Alice and Bob communicate until they solve the problem. The asymmetry part comes from the fact that the input sizes are unequal, and hence we are attentive to how much information is sent from Alice to Bob, termed a, and the other way around, termed b (in contrast to the total communication).

[MNSW’95] proved the following. Suppose there exists a solution to the data structure question, with S space and t probes. Then, there is a solution to the corresponding communication game, with a=t\lg_2 S and b=tw (obtained by the straight-forward simulation of the data structure). Hence, if we prove a lower bound for the communication game version, we get a data structure lower bound. For instance, for partial match, one can prove that: any protocol must have either a\ge \Omega_\epsilon(d) or else b\ge n^{1-\epsilon} for any \epsilon>0. Putting these together, we conclude that t\ge \Omega(\frac{d}{\log_2 S}), or, equivalently, that S\ge 2^{\Omega(d/t)}. For any polynomial space, the lower bound is thus t=\Omega(d/\log n). Note that, here, the most important constraint is on a — the constraint on b is obviously not satisfied for low number of probes.

Such a lower bound is nice but it cannot quite differentiate between, say, cubic space versus linear space: just triple the query time and we’ve reduced the space to linear! This is of course not what happens in real data structures (I wish!). Part of the trouble comes from the fact that the communication game is more powerful than the data structure. In particular, Bob can remember what Alice has previously asked — which is clearly not the case in a data structure.

How can we prove a (higher) lower bound for a linear space data structure? Here come Mihai and Mikkel.

The idea is to ask for more! In particular, they consider several queries at the same time, by taking the k-sum of the problem. One intuition goes as follows: partition the data set into a bunch of small problems, namely k=n/d^c of them, each of size d^c for some constant c>1, and each of k queries goes to its own mini-problem. Then, we prove a space lower bound of d^{\omega(1)} for each of these mini-problems, and then show that the total space has to be the sum: k\cdot d^{\omega(1)}\ge n\cdot d^{\omega(1)}. The intuition is good, but how to make this formal? In particular, why does the space of the data structure has to add up?

Here is a different, more formal intuition.  Again take the k-sum. Define the “data structure” problem as: given k instances, of size n/k, what is the space S versus cell-probe complexity kt trade-off ? The “communication game” variant is the natural one: Alice gets k queries, Bob gets k data structures, etc. As before, if there is a solution for the data structure setting, then there is a solution for the communication game.

Now comes the most interesting part (of this post at least): there actually exists a communication game that does a more efficient simulation than one which would just multiply all communication by k! Namely, consider one probe (for all k queries at the same time). By previous argument, Alice would send k\cdot \log_2S bits (k probes into space S). One can do better: sending \log_2 {S \choose k}=\Theta(k\log \frac{S}{k}) is sufficient since Alice has just to specify the k cells out of S which she wants to read. Note that, when k and S are close to n, the log factor is substantially less than before. All in all, Alice and Bob can achieve communication only a=O(tk\log \frac{S}{k}) and b=O(tkw).

To obtain a Higher lower bound, it only remains to prove a communication complexity lower bound for the k-sum communication game. In particular, one hopes to prove a lower bound for the k-sum of the problem, which does multiply the required communication by k. Essentially, we want a statement of the type: either Alice has to send a\ge\Omega(k\cdot d) bits or Bob has to send k\cdot d^{\Omega(1)} bits. Putting two such statements together, [PT’06] obtain a data structure lower bound for partial match of t\ge \Omega(d/\log\frac{Sd}{n}). Or, for near-linear space S=n\log^{O(1)}n, we have a lower bound of t=\Omega(d/\log d) whenever d=\Theta(\log n) (contrast this with the earlier bound of t=\Omega(d/\log n) for any polynomial space).

Final remarks. The original paper used a slightly weaker communication complexity lower bound for the partial match, which yielded a slightly weaker data structure lower bound. It has been strenghened to the above bound by Mihai using the LSD reduction (featured in the previous post), as well as by Rina Panigrahy, Kunal Talwar, and Udi Wieder in 2008, who have introduced a different method for showing lower bounds for near-linear space data structures. I should also mention that Mihai and Mikkel have actually first obtained a separation between polynomial and (near-)linear space in their earlier paper on predecessor search.

Good luck to all who have to deliver on promises made on June 26th.

Exact Algorithms from Approximation Algorithms? (part 2)

As promised in the previous post, I will explain how an algorithm designed for the approximate near neighbor problem, the LSH algorithm, leads to a solution for the exact near neighbor problem (henceforth NNS). While, the algorithm for the exact problem will not have full guarantees, we will be able to give some partial guarantees.

Usually we want  two guarantees from our algorithms:

1) the algorithm is correct, i.e., returns what we want. It will be OK for this to happen with 90% success probability (randomized algorithms).

2) the algorithm has good performance: time, space, etc (again, in expectation, or with 90% probability). Here we focus on bounded query time. E.g., LSH achieves  ~n^{1/c} query time for approximate NNS, where c is the approximation factor (again, think c=2).

The approximation algorithm has both guarantees. Our exact algorithm will have only (1) but no hard guarantees on (2), the query time.

Let us first expand a bit on the algorithm for the approximate NNS. The LSH algorithm can be seen as a filtering algorithm: it will return a list T of points, which are potential solutions, a subset of the entire set of solutions, points S. List T will likely include the actual near neighbor(s), and not many of the far points. Specifically, we can classify solutions (points) by three types:
– “good solution”: a near neighbor which is a point p at distance at most R from q. Each such solution appears in T with probability at least 90%;
– “bad solution”: a far point which is a point at distance more than cR. Each appears in T with probability at most P_f=n^{1/c}/n;
– “grey solution”: a point in the “grey area”, at distance between R and cR. Each may appear in T, with probability between P_f and 90%.

While I’m not explaining how LSH generates T, there are two aspects to mention. One is that the list T can be generated on the fly. Two is that we can essentially consider T to be a set (although, strictly speaking, it is not — dataset points may sometimes appear multiple times during the on-the-fly generation of T). (Please see this wiki page for the skipped details.)

Now, to obtain the approximate NNS algorithm, we just iterate through the set T, and stop whenever we encounter either a “good” or a “grey” solution, i.e., an approximate near neighbor. It’s easy to see that both guarantees are satisfied. For (1), if there is a near neighbor p^*, with probability at least 90%, the algorithm reports p^* or an approximate near neighbor (in case it appears before p^* in T). The guarantee (2) is satisfied since the (expected) runtime is bounded by the number of bad solutions in T, namely n\cdot P_f=n^{1/c}.

How do we obtain an exact NNS algorithm from this? Simply by requiring the algorithm to stop only when it finds a good solution, i.e., an actual near neighbor. Note that we immediately inherit the correctness guarantee (1): we output exactly what we want (with 90% success probability). But guarantee (2) may not hold anymore: while the set T contains few “bad solutions” (far points) in expectation, there may be potentially \Theta(n) “grey solutions”, which are not acceptable for the exact NNS anymore.

Discussion. So now we have an exact algorithm that is at least correct — but can we say more about the runtime? I mean, a linear scan algorithm would also be correct but not much progress.

As a start, we note the (expected) runtime is bounded by the number of bad and grey solutions in T, a total of \le n^{1/c}+G where G is the total number of grey solutions in S. In the case of the LSH algorithm, we can even say more: not all grey solutions are included with T with same probability. Naturally, points at distance just above R will perhaps be present in T with probability close to 90%, but those at distance nearly cR should have probability closer to P_f=n^{1/c}/n. Hence, the expected size of T may be substantially less than n^{1/c}+G.

Have we made progress? For an adversarial dataset, the runtime can indeed degrade to \Theta(n), if most points are grey solutions, specifically points at distance just above R (and indeed so look the “hard instances” for NNS). But are the datasets really so adversarial? Perhaps if the dataset is not so adversarial the size of T, and hence the runtime, is much smaller.

Indeed, the “real datasets” do not seem to be too adversarial. Here are some actual numbers for the MNIST dataset. It consists of 28×28 pixel images of handwritten images, represented as 784-dimensional Euclidean vectors, which we normalize to have norm one. (The actual LSH algorithm meant here is one for the Euclidean space, based on this paper.) Queries are 10 random points from the dataset itself, for which we would run LSH algorithm parametrized with c=2 and threshold R=0.6 (the threshold is chosen so that all points have a few R-near neighbors).

I computed several quantities: the number of near neighbors (good solutions), 2-approximate near neighbor (grey solutions), and the expected size of list T. The number of near neighbors is between 4 and 49. The number of grey solutions is between 42632 to 59564 (i.e., over 70% of the entire dataset are 2-approximate near neighbors!). The expected size of T is 2468\pm 1032 (the range is 1159-4534). This is substantially less than the size of the dataset, or even the number of “grey solutions”.

In conclusion, we have obtained an algorithm for the exact NNS from an algorithm for the approximate NNS. The algorithm is always correct. It may have bad runtime in the worst case, but, for reasonable datasets it may likely run considerably faster.

I will warp-up with a couple of questions. 1) Do you know of other algorithms with similar principles, especially algorithms for non-data structure problem (regular approximation algorithms)? For example, I can imagine a gradient-descent-like algorithm that: i) is correct, ii) is designed to achieve some approximation guarantee, and iii) if forced to give an exact solution, is reasonably fast on “good datasets”. 2) Is there a good “criteria for algorithm design” to extract? For example, one route may be defining a parametrization of the dataset as a function of the quality/structure of the grey&good solutions for the instance at hand; and then measuring the runtime of algorithms as a function of this quantity. That said, I do not have answers to these questions 🙂

Exact Algorithms from Approximation Algorithms? (part 1)

One great “soft” challenge in (T)CS I find to be how to go on to find useful algorithms for problems that we believe (or have even proven!) to be hard in general. Let me explain by giving the all-too-common-example:

Practitioner: I need to solve problem X.

Theoretician: Nice question, let me think… Hm, it seems hard. I can even prove that, under certain assumptions, we cannot do anything non-trivial.

Practitioner: Slick… But I still need to solve it. Can you do anything?

Theoretician: If I think more, I can design an algorithm that gives a solution up to factor 2 within the actual solution!

Practitioner: Interesting, but that’s too much error nonetheless… How about 1.05?

Theoretician: Nope, essentially as hard as exact solution.

Practitioner: Thanks. But I still need to deliver the code in a month. I guess I will go and hack something by myself — anyways my instances are probably not degenerately hard.

What can we give to the practitioner in this case?

Over years, the community has come up with a number of approaches to the challenge. In fact, last fall, Tim Roughgarden and Luca Trevisan organized a great workshop, on “Beyond Worst-Case Analysis”. Besides designing approximation algorithms (mentioned above), one of the other common approaches is to design algorithms for certain subsets of instances of problem X (planar graphs, doubling metrics, or, abusing the notion a bit, average/semi-random instances).

Here, I would like to discuss a scenario where an approximation algorithm lead to an *exact* algorithm (with some guarantees). In particular, I will talk about the case of the Locality Sensitive Hashing (LSH) algorithm for the near(est) neighbor search (NNS) problem in high dimensional spaces.

While I hope to discuss the NNS problem in detail in future posts, let me quickly recap some definitions. The nearest neighbor search problem is the following data structure question. Given a set S of n points in, say, d-dimensional Hamming space, construct a data structure that answers the following queries: given a query point q, output the nearest point p\in S to q. It will be more convenient, though, to talk about the “threshold version” of the problem, the near neighbor problem. The Rnear neighbor problem is one, where are also given a threshold R at preprocessing, and the R-near neighbor query asks to report any point p within distance R from q.

This problem is a classical example of the curse of dimensionality: while the problem has nice, efficient solutions for small d (say, for d=2, a solution is via Voronoi diagrams+planar point location), these solutions degrade rapidly with increasing dimension. In fact, it is believed that for high dimensions, d\gg \log n, nothing “non-trivial” is possible: either query time is linear (corresponding to a linear scan of points), or the space is exponential in d (corresponding to the generalization of the above solution).

What is the approximate version of the R-near neighbor problem? Let c denote the approximation factor (think c=2). The relaxed desiderata is: if there exists a point p at distance R from q, the data structure has to report any p' within distance cR. Otherwise, there is no guarantee. (Intuitively, the “approximately near” points, at distance between R and cR, may be considered to be either “near” or “not near” as is convenient to the data structure.)

The LSH framework, introduced by Piotr Indyk and Rajeev Motwani in 1998, yields an algorithm for the c-approximate R-near neighbor with O(n^{1/c}\cdot d\log n) query time and O(n^{1+1/c}+nd) space. For example, for approximation c=2, this is ~\sqrt{n} query time with ~n^{1.5} space (we think of dimension d as being much less than n). There are indications that this may be near optimal  at least in some settings.

Now we are in the case at the end of the above dialogue. What’s next?

It turns out that the LSH algorithm may be used to solve the exact near neighbor search, with some, relaxed guarantees, as I will explain in the next post.