FOCS 2014: Call for Workshops and Tutorials

As Boaz discussed, there is an excellent collection of papers to be presented at the upcoming FOCS in Philadelphia. These would be spread over three of the days of the conference.

Before this though, there will be an exciting day of workshops and tutorials. It is your chance to reach hundreds of people from across the community, and tell them about the latest developments in your exciting area. Organizing a workshop at FOCS takes away a bulk of the administrative work involved, and lets you concentrate on the more enjoyable scientific part. Below is the call for proposals for workhsops an tutorials. Send in your proposals.
http://www.cis.upenn.edu/~sanjeev/focs2014_workshops_call.html

STOC 2014 Call for Workshop and Tutorial Proposals

STOC 2014: Call for Workshop and Tutorial Proposals

(also available on the conference website)

Workshop and Tutorial Day: Saturday, May 31, 2014

Workshop and Tutorial Co-Chairs: Kunal Talwar and Chris Umans

On Saturday, May 31, immediately preceding the main conference, SsTOC 2014 will hold a workshop-and-tutorials day. We invite groups of interested researchers to submit workshop or tutorial proposals. The goal of a workshop is to provide an informal forum for researchers to discuss important research questions, directions, and challenges. Connections between theoretical computer science and other areas, topics that are not well represented at STOC, and open problems are encouraged as workshop topics. Organizers are completely free to choose their workshop formats (invited speakers, panel discussions, etc.). The program for May 31 may also involve tutorials, each consisting of 1-2 survey talks on a particular area, and we welcome tutorial proposals as well.

STOC does not have funds to pay travel expenses or honoraria to invited workshop and tutorial speakers, but we do anticipate funds for breaks with snacks and drinks. Workshop and tutorials attendance will be free for all STOC registrants, and will also be available separately for a reduced registration fee, for those not attending STOC itself.

Proposal submission: Workshop and tutorial proposals should be no longer than 2 pages. Please include a list of names and email addresses of the organizers, a description of the topic and the goals of the workshop or tutorial, the proposed workshop format (invited talks, contributed talks, panel, etc.), and proposed or tentatively confirmed speakers if known. Please also indicate the preferred length of time for your workshop or tutorial, along with the minimum acceptable time. We anticipate a 4-5 hour block for each workshop and a 1-3 hour block for each tutorial. Please feel free to contact the Workshop and Tutorial Co-Chairs at the email addresses below if you have questions about workshop or tutorial proposals.

Submission deadline: Proposals should be submitted by March 28, 2014, via email to kunal@microsoft.com and umans@cs.caltech.edu. Proposers will be notified by April 10, 2014, about whether their proposals have been accepted.

Differential Privacy for Measure Concentration

Today, we have a guest post from Frank McSherry talking about a clever approach to using Differential Privacy for handling pesky dependencies that get in the way of proving measure concentration results.

———————

In this post I’ll explain a cute use of differential privacy as a tool in probabilistic analysis. This is a great example of differential privacy being useful for something other than privacy itself, although there are other good examples too. The main problem we’ll be looking at is the analysis of mostly independent random variables, through the lens of a clustering problem I worked on many years ago.

In the problem of clustering data drawn from a Gaussian mixture, you assume that you are provided access to a large volume of data each of which is a sample from one of a few high-dimensional Gaussian distributions. Each of the Gaussian distributions are determined by a mean vector and a co-variance matrix, and each Gaussian has a mixing weight describing their relative fraction of the overall population of samples. There are a few goals you might have from such a collection of data, but the one we are going to look at is the task of clustering the data into parts corresponding to the Gaussian distributions from which they were drawn. Under what conditions on the Gaussians is such a thing possible? [please note: this work is a bit old, and does not reflect state of the art results in the area; rather, we are using it to highlight the super-cool use of differential privacy].

The main problem is that while each coordinate of a Gaussian distribution is concentrated, there are some large number {d} of them, and the proximity of a sample to some mean vector {\mu} is not particularly great. You end up with bounds that look a bit like

\displaystyle Pr [ | x - \mu |_2^2 \gg d \cdot \sigma^2 ] \le \delta \; ,

where {x} is the sample, {\mu} is the mean vector, {d} the ambient dimensionality, and {\sigma^2} the norm of the covariance matrix. The probability {\delta} gets determined by thinking really hard and its specific value won’t be particularly important for us here.

Dimitris Achlioptas and I had an algorithm for doing this based on spectral projections: we would find the optimal low-rank subspace determined by the singular value decomposition (the rank taken to be {k}, the number of Gaussians in the mixture) and argue that under some separation conditions involving the means and co-variance matrices, the space spanned by these projections was basically the same as the space spanned by the mean vectors of the Gaussian distributions. This is great because when you project a Gaussian sample, you are projecting its mean plus some noise. As the true mean lies in the target space, it stays where it is. When you project Gaussian noise onto a fixed subspace, it stays Gaussian, but with far fewer dimensions. The particular form of these results looks something like this, with a projection matrix {P} applied to the sample {x} before subtracting from {\mu}.

\displaystyle Pr [ |P x - \mu |_2^2 \gg k \cdot \tau^2 ] \le \delta \; ,

where \tau is close to \sigma. This means that while {x} stays centered on {\mu}, the contribution of the noise more-or-less vanishes. At least, the {d} is reduced to {k} and that can be quite a lot. Hooray!

The problem is that this “more-or-less vanishes” thing is really only true when the target space and the random noise are independent. However, since the optimal low-rank subspace was determined from the data, it isn’t independent of any of the noise we are projecting. It’s slightly dependent on the noise, and in the wrong way (it attempts to accomodate the noise, which isn’t what we want if we want to project *out* the noise). In particular, you don’t immediately get access to the sorts of bounds above.

You could do things the way Dimitris and I did, which was a fairly complicated mess of cross-training (randomly partition the data and use each half to determine the subspace for the other), and you end up with a paper that spends most of its time determining algorithms and notation to enforce the independence (the cross-training needs to be done recursively, but we can’t afford to cut the samples in half at each level, blah blah, brain explodes). You can read all about it here. We’re going to do things a bit simpler now.

Enter differential privacy. Recall, for a moment, the informal statement of differential privacy: a randomized computation has differential privacy if the probability of any output occurrence is not substantially adjusted when a single input element is added or removed. What a nice privacy definition!

Now, let’s think of it slightly differently, in terms of dependence and independence. If {P} is the result of a differentially private computation on a dataset {X = \{ x_0, x_1, \ldots \}}, then {P} is not substantially differently distributed than the same computation run on {X \setminus \{ x_i \}}. If this second distribution enjoys some nice property with high probability, for example due to its independence from {x_i}, then it remains very likely that {P} has the property as well. The probability that the property no longer holds can only increase by a factor of {\exp(\epsilon)} when we add {x_i} back in to the input.

For example, let’s consider the probability of the property: “the squared length of the projection of {x_i-\mu} onto the optimal subspace is much larger than {k \cdot \tau^2}”. When the input data are {X \setminus \{ x_i \}}, resulting in a projection-valued random variable we’ll name {P_i}, this probability is small because {P_i} is independent of {x_i}.

\displaystyle Pr [ | P_i x_i - \mu |_2^2 \gg k \cdot \tau^2 ] < \delta \; .

When the input data are {X}, resulting in a projection-valued random variable {P}, this probability is not easily bounded by independence, but can be bounded by differential privacy: if the computation producing {P} is {\epsilon}-differentially private, then the probability can increase by at most {\exp(\epsilon)}:

\displaystyle Pr [ | P x_i - \mu |_2^2 \gg k \cdot \tau^2 ] < \exp(\epsilon) \times \delta \; .

We can even live with fairly beefy values of {\epsilon} and still get a result here. Let’s take {\epsilon = 1} for concreteness.

Now let’s discuss differentially private optimal subspace computation. One standard way to compute optimal low dimensional subspaces is by taking the covariance matrix of the data and computing its top singular vectors, using the singular value decomposition. One standard way to release the covariance matrix while preserving differential privacy is to add Laplace noise proportional to the largest magnitude permitted of any of the data points, to each of the entries of the covariance matrix. Since our Gaussian data are nicely concentrated, they aren’t likely to be terribly enormous, and a data-independent upper bound will work great here.

What we get is a noisy covariance matrix, which we then subject to the singular value decomposition. There are some nice theorems about how the SVD is robust in the presence of noise, which is actually the same reason it is used as a tool to filter out all of that noise that the Gaussian distributions added in in the first place. So, even though we added a bunch of noise to the covariance matrix, we still get a fairly decent approximation to the space spanned by the means of the Gaussian distributions (as long as the number of samples is larger than the dimensionality of the samples). At the same time, because the resulting subspace is differentially private with respect to the samples, we can still use the concentration bounds typically reserved for the projection of random noise on independent subspaces, as long as we can absorb a factor of {\exp(\epsilon)}.

At its heart, this problem was about recovering a tenuous independence which was very important for simplicity (and arguably tractability) of analysis. It shows up in lots of learning problems, especially in validating models: we would typically split data into test and training, to permit an evaluation of learned results without the issue of overfitting. Here differential privacy makes things simpler: if your learning process is differentially private, you did not overfit your data (much).

On BibTeX

Only a few decades ago, I hear, CS papers were typewritten, with hand-drawn figures and hand-written greek letters. With the advent of computers, people began to use tools like troff, until we got tex, and then latex, making life much easier. I have no clue how people not at the same physical location co-wrote papers before the Internet. Maybe make a long distance call to read out and discuss the introduction? Or maybe write weeks in advance to allow shipping/faxing the paper draft back and forth! Email makes it possible to do near real-time co-writing, and today version control tools make this task significantly easier. I think it is safe to say that with these tools and a little bit of experience, nearly all the time spent on writing is spent on figuring out and polishing the content: if we had a genie to layout the text and all we had to do was conjure up the words and the preferred layout, it would not save us much time.

There is one aspect however, that I think still lags behind, and adds overhead: the creation of bibliographies. Needless to say, bibtex makes it a lot easier than it would be otherwise. But the engineer in us is greedy, and wants to remove inefficiencies wherever it sees them. It wants to get even closer to the genie ideal. While I understand this is a “first-world-problems” post, I am hoping some of you will suggest solutions that we will all find useful.

Let me elaborate on these inefficiencies. First is the question of creating the bibtex entries. This, especially a few hours before the deadline, is a tedious process, involving translation of the mental pointer I have to a particular paper, into a list of its distinguishing features that would allow the reader to locate the paper. Today that list is usually in the form of a bibtex entry. For historical reasons, this list contains journal/conference name, year, volume, edition, and page numbers! Some of these, especially the last few, have become fairly irrelevant for recent publications. Any time spent on filling these in could easily be saved by the hypothetical genie. Even worse, doing this in the middle of writing the paper is a context switch that adds its own overhead. Of course technology and amortization help; more on that in a minute.

Once I have the bibtex entry, I have effectively converted the mental pointer to a short bibtex tag that I can use while writing. So when I am writing and want to cite a paper, I can simply use that. Not so fast: I still need to translate the mental pointer to this bibtex tag. So unless I remember the tag, I have to go to the bibtex file and search for the paper (usually the authors) to remind myself of it. Another context switch, another break in the train in thought.

Okay, so now on to some of the “solutions” I know of.

1. Obtaining the bibtex entries: Typically to get the volume, edition and page numbers, I need to use a search engine. In most cases, I can find a pre-packaged bibtex entry, almost solving the problem. There are issues of quality and consistency however. Citeseer bibtexs often have missing or wrong entries. DBLP usually splits up the entry, sometimes creating separate citations for the proceedings and the paper. ACM I find has decent bibtex entries (often even for non-ACM papers) but smaller coverage. Many publisher websites work too. Google Scholar requires too many clicks. Microsoft Academic Search is another option that I have only recently started using, and I like it so far. I am sure there are others. Is there one that you particularly like?

2. Amortization: Very often there is a large intersection between the papers you cite today, and the ones you cited last year. So starting with bibtex files from last year is a good idea. I find this appealing in theory, but haven’t really been very successful in implementing it. I find it helps a little, but not greatly so.
Related is the idea of sharing mega bib files that cover the whole field. Maybe I just haven’t found the right big bibtex file. Where do you get yours?
Also when you take a bunch of these mega-bib files, there is still a problem of duplicate entries, which can translate to papers appearing twice in the references. Maybe there is simple tool to de-dup! Additionally, these need to be kept current, as papers often go from @unpublished to @inproceedings to @article.


3. Mnemonics: To help with the map-mental-pointer-to-bibtex-tag, we often use bibtex tags that are close to our mental pointers: {LSBook} for the Lovasz Schrijver book, {LLR} for a paper with those initials. This works well if you always use your own bib file, but shared bibs may fail if we have different mental pointer, or different conventions. There is also a collision problem: BLR in complexity means something very different from BLR in privacy; adding a year often makes them harder to remember. What naming convention do you use?

4. Reference Management Softwares: apparently, there are software packages that will manage your bib files, and help share bibs. I briefly tried one, but did not find it sufficiently useful. Have you used one that actually makes you faster?

5. Autocomplete: I imagine it would help if we just wrote long descriptive tags, and used autocomplete features of the tex editors to find what we are looking for. It would be even nicer if the tex editor could do a smart autocomplete: I write \cite{ and start typing Lovasz, and it gives a drop down of the bibtex entries in my bib file that contain the string Lovasz in either author or title fields. If this happens, then the tags will be essentially irrelevant and using large shared megabibs would be easy. Is there an emacs package or Winedt script to do that?

I am sure there are other better solutions. How do you manage your *.bib files?

STOC Poster Session: Deadline approaching

Just a gentle reminder that the STOC 2013 posters submission deadline is a few days away. The STOC poster session is a great way to share your work with the TCS community, be it works appearing at other venues, your STOC papers you want to talk more about, or even your FOCS submissions that you can’t wait to tell people about.

I was going to add some excellent reasons to submit, but then realized that I cannot do a better job than what Suresh already did.

So click on the link, get inspired, and follow the instructions here to submit. All you need by April 15th is a 1-2 paragraph abstract, and information on who will present. The poster itself is NOT needed until later. Hope to see you in June.

From Discrepancy to Privacy, and back Part 2: Approximating Hereditary Discrepancy

In a previous blog post, we saw how ideas from differential privacy can be used to prove a lower bound on the hereditary discrepancy for the Erdös Discrepancy problem. This lower bound happened to be nearly tight.

It turns out that this tightness is no accident. The connection between hereditary discrepancy and privacy is in fact much deeper, and can be used to get an approximation algorithm for hereditary discrepancy of any set system. In this post, I will try to summarize this connection.

First, let us recall the definition of hereditary discrepancy. For a matrix {A \in \{0,1\}^{m\times n}}, the discrepancy of {A} is defined as {\mathsf{Disc}(A) = \min_{x \in \{-1,1\}^n} |Ax|_{\infty}}. Thus we want to multiply each column of {A} by either {+1} or {-1} so that each row sum is at most {D} in absolute value. The smallest attainable {D} is the discrepancy of {A}.

It may be worth mentioning a {0}{1} matrix {A} corresponds to a set system with elements corresponding to columns and sets corresponding to rows. The vector {x} then corresponds to a coloring of the elements and the discrepancy measures how balanced all of the sets are. Nevertheless, the definition above, and everything I will say about discrepancy, extends to arbitrary real matrices {A}.

Hereditary discrepancy is a more robust variant of this definition: the hereditary discrepancy {\mathsf{HerDisc}(A)} is simply the maximum of {\mathsf{Disc}(B)}, taken over all sub-matrices {B} of {A}.

To start with, note that one can prove that {\mathsf{Disc}(A) \leq D} by exhibiting a vector {x} attaining such a bound, and this witness can be efficientl verified. Computing the discrepancy itself is NP-hard, even to approximate. In fact Charikar, Newman and Nikolov show that it is hard to distinguish between discrepancy zero and discrepancy {\sqrt{n}}. Thus one does not expect to find efficiently verifiable proofs of lower bounds on discrepancy.

Proving a good upper bound on hereditary discrepancy seems harder: how would I convince you that every sub-matrix {B} has a good coloring? It turns out that {\mathsf{HerDisc}} being {1} is equivalent to {A} being totally unimodular, and this can be efficiently checked. For higher {\mathsf{HerDisc}} no such characterization is known, and in fact deciding whether the {\mathsf{HerDisc}} is {2} or {3} can be shown to be NP hard. One would think (and some of us tried to prove) that herdisc would be at least as hard to approximate as discrepancy. However, one would be wrong. Herdisc is in fact approximable to a polylogarithmic factor. This result fell out of our attempts to understand how much noise differentially private mechanisms must add to a given set of linear queries. In this post, I will try to sketch how we get an approximation to {\mathsf{HerDisc}}. See the paper for more details (Since the primary focus of the paper is on questions in Differential Privacy, large parts of it do not pertain to the approximation to {\mathsf{HerDisc}}. If you are mainly interested in that approximation, I hope this post will help figure out which parts are relevant.)

The Partial Coloring technique

A very useful technique in discrepancy theory is that of partial coloring. Suppose that we can find a {\{0,+1,-1\}} vector {\chi} such that {|A\chi|_{\infty}} is some value {M} and {\chi} has many non-zeroes, say {|\chi|_{1} \geq \frac{n}{100}}. Let’s for convenience call the least attainable such {M} as the partial discrepancy of {A}, and let {\mathsf{PHerDisc}(A)} denote the analogous hereditary version, i.e. the maximum partial discrepancy over all sub-matrices of {A}.

The partial coloring trick basically says that {\mathsf{HerDisc}(A)\leq O(\log n)\cdot \mathsf{PHerDisc}(A)} for any {A}: we can find a good coloring by repeatedly finding good partial colorings. Each step reduces the number of uncolored columns by at least a constant factor, so that in {O(\log n)} steps, we get a complete coloring.

{\mathsf{PHerDisc}} and Privacy

{\mathsf{PHerDisc}}, it turns out, is closely related to differential privacy. Muthukrishnan and Nikolov, building on Dinur-Nissim, showed that the any minimally private algorithm to compute an approximation to {Ax}, fo r a private database {x}, must incur additive error at least as large as the hereditary discrepancy of {A}. Let me now explain this statement in a little more detail.

Imagine we have a database {x} which contains private data of individuals. For a start, suppose that {x} is a {0}{1} vector, where the {i}th coordinate is a private bit of the {i}th individual (the results actually apply to a more general setting, but this version should suffice for this blog post). An analyst is interested in learning some set of {d} aggregates, represented by a {d\times n} matrix {A}, i.e. she wants to learn {Ax}. The analyst is also potentially an attacker, and may know something about the database, possibly from other auxiliary sources. Suppose that she knows some coordinates of {x} already, and our privacy policy requires that we do not enable her to learn the unknown bits when we answer the query.

As you might be thinking, this would be impossible to guarantee if we respond with the correct answer {Ax}. So we will in fact give a noisy answer in such a way that privacy is guaranteed in a formal sense, and we try to give as accurate an answer as possible. Let us say we want all of our {d} queries to be accurate within an expected additive error of {E}. How small can we make {E} while avoiding a serious privacy violation?

Let us say that a serious privacy violation occurs if an adversary who knows some {n-k} bits of {x}, after learning the noisy answer to {Ax}, can produce her guess {x'} which differs from {x} in {k/50} bits. In other words, the adversary learns pretty much all of {x}. Suppose {\mathsf{PHerDisc}(A)} is {H}. Then we will argue that {E} must be at least {H/4} to avoid a serious privacy violation.

Let {B} be a sub-matrix of {A} achieving partial discrepancy {H}, and let {S} denote the indices of the columns of {B}. Consider an analyst that knows {x} on {S^c} and is trying to learn {x} on {S}, and without loss of generality, suppose that {x} is zero on {S^c}. Let {y=Ax}, and let {y'} be the answer we give in an attempt to avoid a serious privacy violation. By definition of discrepancy, for any {\chi \in \{0,+1,-1\}^n} supported on {S} with {|S|/100} non-zero entries, {|A\chi|_\infty} is at least {H}. Suppose that {E < H/4}, and let {x'} be any {0}{1} vector such that {|Ax' - y'|_\infty \leq 2E}. By Markov\rq{}s inequality, {x} satisfies this with probability half, in which case such an {x\rq{}} exists. By triangle inequality, {|Ax - Ax'|_{\infty} \leq 4E < H}. Thus {\chi=x-x'} is a vector in {\{0,+1,-1\}^n} satisfying {|A\chi| < H}. It must therefore be the case that {\chi} has fewer than {|S|/100} non-zeroes, so that {x'} agrees with {x} on all but {|S|/100} coordinates. This of course is a serious privacy violation, establishing that any minimally private mechanism must add error {\mathsf{PHerDisc}(A)} per entry. We have thus established that error of any private mechanism for {A} is at least {\mathsf{PHerDisc}(A)}.

Approximating {\mathsf{PHerDisc}}

I will next try to argue how to get from this a polylogarithmic approximation algorithm for {\mathsf{PHerDisc}}, and hence for {\mathsf{HerDisc}}. Note that an {\alpha} approximation algorithm, given a matrix {A} must for some {H} produce two certificates: one certifying that {\mathsf{PHerDisc}} is at least {H/\alpha}, and another certifying that it is at most {H}. For example, the first certificate may consist of a sub-matrix {B} and a dual solution for an SDP relaxation for the discrepancy of {B}. A simpler witness can be constructed based on a result of Lovász, Spencer and Vesztergombi, who gave a beautiful geometric argument showing that the determinant of any square sub-matrix of {A}, appropriately normalized, gives a lower bound on {\mathsf{HerDisc}(A)}. Thus a square sub-matrix {B} such that the resulting lower bound {\mathsf{detLB}(B)} is large could form a good witness. Matousek recently showed that there always exists such a witness that is nearly optimal, but left open the question of finding it.

The second certificate seems less obvious: how would I convince you that for every sub-matrix {B}, there is a good coloring? The above privacy argument comes to the rescue here. Suppose my witness consisted of the description of a random variable {E} with the following properties:

  • The mechanism that outputs {Ax+ E} satisfies a reasonable privacy definition, that rules out a serious privacy violation.
  • The expectation of the {\ell_{\infty}} norm of {E} is at most some value {P}.

Since a serious privacy violation is ruled out, the expected error must be at least {\mathsf{PHerDisc}(A)}, and thus we can conclude that {\mathsf{PHerDisc}} is at most {P}.

In our case, the reasonable privacy definition will be {(\epsilon,\delta)}-differential privacy. Regular readers of this blog are no doubt familiar with the definition, and for this post, it will suffice that this definition guards against serious privacy violation (and much more). We show that for any {A}, we can find a sub-matrix {B} and a random variable {E} which satisfies the above two properties, with {P} bounded by polylog(d) times {\mathsf{detLB}(B)}. Thus we get a mechanism that is within {polylog(d)} of the optimal mechanism, and additionally certifies a near-tight upper bound on the hereditary discrepancy.

How do we construct such a mechanism? That is a slightly longer story and I will only sketch the main ideas. First, we move from {\ell_{\infty}} to {\ell_2^2} at a polylogarithmic loss: intuitively, {\ell_2^2} measures an average, and {\ell_{\infty}} measures the max. The minimax theorem or Boosting or Multiplicative weights technique ensures that an algorithm for minimizing the average can be translated to one that minimizes the maximum. This kind of reduction is relatively standard and I will skip it here.

How do we find an {(\epsilon,\delta)}-differentially private mechanism minimizing the {\ell_2^2} error? It turns out that if the columns of {A} (i.e. the vectors {Ae_i}) all lie in a ball of radius {R} around the origin, then it is enough to add Gaussian noise of magnitude {cR} in each coordinate of {Ax}, where {c = c(\epsilon,\delta)} is a constant depending on {\epsilon} and {\delta} alone. This result in fact predates the definition of differential privacy: intuitively, one person entering or leaving the database can cause a perturbation of size {Ae_i} in the value of {Ax}, and thus adding Gaussian noise to hide such a perturbation is sufficient to guarantee privacy.

There is of course no reason for this mechanism to be anywhere near optimal, and in general it is not. However, it turns out that if this ball of radius {R} happens to be the Minimum volume Ellipsoid that contains all the points {\pm Ae_i} (also known as the John-Loewner Ellipsoid of this set of points), then we have a lot more structure to work with. A result of John, combined with the restricted invertibility theorem of Bourgain and Tzafriri guarantees that in this case, one can find {\frac{d}{2}} of the columns of {A} that have norm {R} and are nearly orthogonal. Armed with these, and using the rotational invariance of the {\ell_2} version of discrepancy, we can easily come up with a sub-matrix {B} that certifies that the simple Gaussian mechanism is nearly optimal. Intuitively, the nearly orthogonal columns define a simplex with large volume, which gives the determinant lower bound.

Of course in general, we will not be lucky enough to get a ball as the John-Loewner Ellipsoid. Suppose we get {E} as the John-Loewner Ellipsoid. Let {U} be the subspace spanned by the {\frac{d}{2}} longest axes of {E}, and let {V} be the complement subspace. Further let {R} be the length of the {\frac{d}{2}}th longest axis. It turns out that a generalization of Bourgain-Tzafriri due to Vershynin can still be used to find a good {B} in {P_UA}, where {P_U} is the projection matrix for {U}. Moreover, since {P_VA} is contained in a ball of radius {R}, we can define a mechanism for the projection {P_VAx} that has error within a constant factor of the determinant lower bound that we get from {B}. This gives us some progress towards outputting {Ax}. We next recursively find a mechanism for {P_UAx}. Since we halve the dimension in each iteration, we have at most {O(\log d)} iterations. We can use standard privacy machinery to ensure that the overall mechanism satisfies differential privacy. Since each step of the recursion has error within a constant factor of a lower bound output in that step, our overall {\ell_2^2} error is within a polylog factor of one of the lower bounds. This completes the rough sketch of the mechanism. See the paper for full details.

Thus we now know that hereditary discrepancy is approximable to a polylog factor, and not approximable better than a factor of 1.5. Closing, or even narrowing this gap is an interesting open problem.

Many thanks to Alex Nikolov and Li Zhang for his contribution to the content and the writing of this post.

Edit 2/13: Edited to fix the missing link to the paper.

From Discrepancy to Privacy, and back

Different areas of mathematics, and theoretical computer science, freely borrow tools and ideas from each other and these interactions, like barters, can make both parties richer. In fact it’s even better than barter: the objects being ideas, you don’t have to actually give them up in an exchange.

And so it is no surprise that differential privacy has used tools from several other fields, including complexity, cryptography, learning theory and high-dimensional geometry. Today I want to talk about a little giving back. A small interest payment, if you will. Below I will describe how differential privacy tools helped us resolve a question of Alon and Kalai.

Discrepancy

This particular instance deals with discrepancy theory, more specifically, with the Erdös Discrepancy problem (EDP)(polymath 5 being discussed here).

In 1932, Erdös conjectured:

Conjecture[Problem 9 here] For any constant {C}, there is an {n} such that the following holds. For any function {f:[n] \rightarrow \{-1, +1\}}, there exists an {a}  and a k \leq n/a such that

\displaystyle |\sum_{i = 1}^{k}{f(ia)}| > C.

For any {a,k}, the sets {S_{a,k} = \{ia: 0\leq i \leq k\}} is a arithmetic progression containing {0}; we call such a set a Homogenous Arithmetic Progression (HAP). The conjecture above says that for any red blue coloring of the [n], there is some HAP which has a lot more of red than blue (or vice versa).

In modern language, this is a question about discrepancy of HAPs. So let me define discrepancy first. Let {U} be a universe and let {\mathcal{S}} denote a family of subsets of {U}. A coloring {f} of {U} is an assignment {f: U \rightarrow \{+1,-1\}}. The discrepancy of a set {S_i} under the coloring is simply {| \sum_{j \in S_i} f(j) |}; the imbalance of the set. The discrepancy of a set system under a coloring is the maximum discrepancy of any of the sets in the family. The minimum of this quantity over all colorings is then defined to be the discrepancy of the set system. We want a coloring in which all sets in {\mathcal{S}} are as close to balanced as possible. In other words, if {A} denotes the set-element incidence matrix of the set system, then {\mathsf{disc}(A) = \min_{x \in \{-1,1\}^n} |Ax|_{\infty}}.

Thus the conjecture says that the discrepancy of HAPs over {[n]} grows with {n}.

This post deals with the related concept of Hereditary Discrepancy. The discrepancy can often be small by accident: even though the set system is complex enough to contain a high discrepancy set system in it, it can have discrepancy zero. The hereditary discrepancy measures the maximum discrepancy of any set system “contained” in {(U,\mathcal{S})}.

Formally, given a set system {(U,\mathcal{S})}, and a subset {V \subseteq U}, the restriction of {(U,\mathcal{S})} to {V} is another set system {(V,\mathcal{S}_{|V})}, where {\mathcal{S}_{|V} = \{ S_i \cap V : S_i \in \mathcal{S}\}}. If we think of the set system as a hypergraph on {U}, then the restriction is the induced hypergraph on {V}. The hereditary discrepancy {\mathsf{herdisc}(U,\mathcal{S})} is the maximum discrepancy of any of its restrictions. In matrix language, {\mathsf{herdisc}(A)} is simply {\max_{A'} \mathsf{disc}(A')} where the maximum is taken over all submatrices of {A}.

Some examples. Let {n} denote {|U|}. A totally unimodular matrix {A} gives us a set system with hereditary discrepancy at most {1}. An arbitrary collection of {n} sets has discrepancy at most {O(\sqrt{n})}. Note that a random coloring will give discrepancy about {O(\sqrt{n \log n})}. In a famous paper, Spencershowed the {O(\sqrt{n})} upper bound, and Bansal and recently Lovett and Meka gave constructive versions of this result.

Privacy

Given a vector {y \in \Re^n}, and {0}{1} matrix {A}, consider the problem of outputting a vector {z} such that {|z - Ay|_{\infty}} is small, and yet the distribution of {z} isdifferentially private, i.e. for {y} and {y'} that are close the distributions of the corresponding {z}‘s are close. If you are a regular reader of this blog, you must be no stranger to differential privacy. For the purposes of this post, a mechanism {M : \Re^n \rightarrow \Re^m} satisfies differential privacy if for any {y,y' \in \Re^n} such that {|x-x'|_1 \leq 1}, and any (measurable) {S \subseteq \Re^m}:

\displaystyle \Pr[M(y) \in S] \leq 2 \Pr[M(y') \in S] .

Thus if {y,y'} are close in {\ell_1}, the distributions {M(y),M(y')} are close in {L_{\infty}}.

Researchers have studied the question of designing good mechanisms for specific matrices {A}. Here by good, we mean that the expected value of the error say {|z-Ay|_\infty}, or {|z-Ay|_2} is as small as possible. It is natural to also prove lower bounds on the error needed to answer specific queries of interest.

A particular set of queries of interest is the following. The coordinates of the vector {y} are associated with the hypercube {\{0,1\}^d}. Think of {d} binary attributes people may have, and for {v \in \{0,1\}^d}, {y_{v}} denotes the number of people with {v} as their attribute vector. For each subcube, defined by fixing {k} of the {d} bits, we look at the counting query corresponding to that subcube: i.e. {\sum_{v \in \mbox{ subcube}} y_v}. This corresponds to dot product with a vector {a} with {a_v} being one on the subcube, and zero elsewhere. Thus we may want to ask how many people have a {0} in the first and second attribute, and a {1} in the fourth. Consider the matrix {A} defined by all the possible subcube queries.

Subcubes defined by fixing {k} bits are {k}-juntas to some people, contingency table queries to others. These queries being important from a statistical point of view, Kasiviswanathan, Rudelson, Smith and Ullman showed lower bounds on the amount of error any differentrially private mechanism must add, for any constant {k}. When {k} is {\Omega(d)}, their work suggests that one should get a lower bound that is {2^{\Omega(d)}}.

The connection

So what has discrepancy got to do with privacy? Muthukrishnan and Nikolov showed that if {A} has large hereditary discrepancy, then any differentially private mechanism must incur large expected squared error. In fact, one can go back and check that nearly all known lower bounds for differentially private mechanism are really hereditary discrepancy lower bounds in disguise. Thus there is a deep connection between {\mathsf{herdisc}(A)} and the minimum achievable error for {A}.

For the EDP, it is natural to ask how large the hereditary discrepancy is. Alon and Kalai show that it is {\tilde{\Omega}(\sqrt{\log n})} and at most {n^{O(\frac{1}{\log\log n})}}. They also showed that for constant {\epsilon}, it is possible to delete an {\epsilon} fraction of the integers in {[n]}, so that the remaining set system has hereditary discrepancy at most polylogarithmic in {n}. Gil guessed that the truth is closer to the lower bound.

Alex Nikolov and I managed to show that this is not the case. Since hereditary discrepancy is, well, hereditary, a lower bound on the hereditary discrepancy of a submatrix is also a lower bound on the hereditary discrepancy of whole matrix. In the EDP matrix on {[n]}, we will first find our subcubes-juntas-contingency-tables matrix {A} above as a submatrix; one for which {d} is {\Theta(\frac{\log n}{\log \log n})}. Having done that, it would remain to prove a lower bound for {\mathsf{herdisc}(A)} itself.

The first step is done as follows: associate each dimension {i \leq d} with {2i}th and the {(2i+1)}th prime. A point {v = (v_1,v_2,\ldots,v_d)} in the hypercube is naturally associated with the integer {f(v)=\Pi_{i} p_{2i+v_i}}. A subcube query can be specified by a vector {q \in \{0,1,\star\}^d}: {q} is set to the appropriate {0}{1} value for the coordinates that we fix, and {\star} for the unconstrained coordinates. We can associate a subcube {q} with the integer {f(q)=\Pi_{i: q_i \neq \star} p_{2i+q_i}}. It is easy to see that {x} is in the subcube corresponding to {q} if and only if {f(q)} divides {f(x)}. Thus if we restrict ourselves to the integers {\{f(v): v \in \{0,1\}^d\}} and HAPs corresponding to {\{f(q): q \in \{0,1,\star\}^d\}}, we have found a submatrix of the EDP matrix that looks exactly like our contingency tables matrix {A}. Thus the hereditary discrepancy of the EDP matrix is at least as large as that of this submatrix that we found.

Lower bounds for private mechanisms for {A} can be derived in many ways. For constant {k}, the results of Kasiviswanathan et al. referred to above suffice and it is likely that they can be pushed to get the {2^{\Omega(d)}} lower bounds we are shooting for. A very different approach of extracting from weak random sources also implies such a lower bound. It is likely that from these, one could get a lower bound on {\mathsf{herdisc}} of the kind we need.

However, given what we know about {A}, we can in fact remove the privacy scaffolding and get a simpler direct proof of the lower bound of {2^{\Omega(d)}} on {\mathsf{herdisc}(A)}, and can write a proof of the lower bound without any mention of privacy. This implies that the hereditary discrepancy for the EDP is at least {2^{\Omega(d)} = n^{\Omega(\frac{1}{\log\log n})}}, which matches the upper bound up to a constant in the exponent. A brief note with the proof is here.

Of course the EDP itself is wide open. Head on here to help settle the conjecture.

Many thanks to Alex Nikolov for his contribution to the content and the writing of this post.

Interesting new development in Math publishing

Tim Gowers reports on a new set of open access math journals. There will be a Forum of Mathematics:Pi that is supposed to be a generalist math journal, and Forum of Mathematics:Sigma that would have “clusters” for different areas. It will be paid for by Cambridge University Press for the first three years, and will then be author-pays open access. Mainly online (though one will be able to buy a print version). No fixed number of issues a year, so not rate-controlled.

The author-pays model is economically a more sensible one than the reader-pays one, I think. In the current model the agent deciding on where to send the paper has no incentive to take into account the cost of the product, except in very indirect ways such as concern that fewer libraries would subscribe to expensive journals and lead to fewer people reading the paper. As a result, journals have little incentive to keep prices down, and would essentially do monopolistic pricing. In the author-pays model, which in many (most? almost all?) cases will be an institution-pays model, at least the author will see what they are paying, so there will be some downward competitive pressure on prices. The currently announced price per paper for this new journal (USD 750; waived for the first three years) does seem a lot more than the cost of some of our CS open access journals (e.g. JMLR), and one hopes it will come down with time once they figure out the actual cost of running such an open access journal.