# Privacy-Preserving Data Analysis and Computational Learning: A Match made in Heaven

Is it “safe” to release aggregate statistics from a database of sensitive information on individuals? Evidence suggests that even seemingly innocuous statistical releases can fatally compromise an individual’s privacy, especially in the presence of auxiliary information about the individual (see this paper by Homer et al. for a recent example). How might we then get statistical ** utility** from datasets while providing rigorous

**protection for individuals? This has been a long-standing challenge for data analysts. Recently, Differential Privacy, introduced by Dwork, McSherry, Nissim and Smith ‘06, has provided a new approach and definition for tackling this problem (see also Dinur and Nissim, Dwork and Nissim, Blum Dwork McSherry and Nissim and Dwork and Naor).**

*privacy*In a differentially private data analysis, it is guaranteed that each individual only has a small effect on the (probabilistic) output of the analysis (for more on this, see Omer’s post). Differential privacy is unique in providing rigorous protection for individuals, even in the presence of auxiliary information and under composition of multiple analyses. It has attracted large and growing interest in many different communities (e.g. CS theory, statistics, databases and systems).

Differentially private data analysis algorithms provide strong privacy protection. The natural next question is finding such algorithms that also provide strong utility guarantees. In this post, I hope to give a (fairly) non-technical overview of a promising approach for designing privacy-preserving algorithms that has emerged in a line of recent works, and to highlight an intriguing connection between learning theory and privacy-preserving data analysis.

First, a disclaimer: there are many connections between privacy-preserving data release and machine learning, and there are many beautiful works on privacy-preserving data release (with or without connections to learning) that I will not mention. Most glaringly, I will not talk about learning algorithms that are themselves differentially private or about a breakthrough paper by Blum, Ligett and Roth that brings learning techniques to bear on data release problems. Rather, I will focus on a particular setting and connection that I hope will demonstrate the fruitful interaction between these fields. I apologize in advance for the omissions.

**The Setting. **We’ll work in the following data analysis setting. A (trusted and trustworthy) curator wants to extract statistics from a database DB that contains information about *n* individual participants. Each individual’s data, a “data item”, is a “row” in the database drawn from a “data universe” *U*. The statistics are specified by queries *q _{1}…q_{k}*, where each query is a function mapping the database to a query answer. I will focus here on real-valued queries with range

*[0,1]*. The queries may all be specified in advance (the “non-interactive” setting), or they might be specified “on the fly” (the “interactive” setting). I will focus here on “counting queries”: A counting query

*q*is specified by a predicate

*p: U*

*→ {0,1}*, i.e. each individual can satisfy or falsify the predicate. The query

*q*counts what fraction of individuals in the database satisfies the predicate

*p*. For example, the data universe could contain a field for individuals’ heights, the predicate could check whether an individual’s height is above

*6’*, and the counting query associated with that predicate would count the fraction of individuals in the database whose height is greater than

*6’*. One useful property of counting queries is that they have

*low sensitivity*: each individual can only change the answer by an additive

*1/n*factor (recall that

*n*is the number of individuals whose records are in the database).

**A Foundation: Adding Independent Noise.** Dwork, McSherry, Nissim and Smith showed that counting queries (and, more generally, low-sensitivity queries) could be answered in a differentially private way by adding scaled symmetric noise with small magnitude. In fact, to answer a single counting query, it is enough to add noise with expected magnitude *O(1/n)* (noise is drawn from a Laplace or Gaussian distribution). More generally, *k* counting queries can be answered in a differentially private way by adding independent noise with expected magnitude *O(**√k/n)* to each query answer. Notice that as *k* (the number of queries) grows, the scale of noise needed to guarantee privacy also grows. In particular, if we want to answer *O(n ^{2})* queries, we need noise of magnitude greater than

*1*, which is no longer interesting (the constant answer

*0*always has error at most

*1*). Still, this technique allows us to run many interesting analyses (especially when

*n*is large), and it is a basic building block for much of the subsequent work.

**Computational Learning and Private Data Release.** There is an intuitive connection between data release and learning as follows. In the data release setting, one can view the database *DB* as a function: it maps each query *q* to an answer in *[0,1]*. The goal in data analysis is approximating this function on a collection of queries (or “examples”). For *privacy-preserving* data analysis, we want to approximate the database/function without leaking too much of the information it contains. For example, if we access the database using the “independent noise” technique, we want to approximate the database while only computing a small number of noisy answers on queries/examples.

With this view of the database as a function in mind, a striking similarity to machine learning emerges. There, a standard goal is efficiently learning/approximating a function on a collection of examples given limited access to it; E.g., given only a bounded number of labeled examples or oracle accesses. A learning algorithm’s goal is to use its bounded access to the function to generate a *hypothesis* that can be used to label future examples. Similarly, we view the goal of a data analysis algorithm as generating a *“data synopsis”*(or synopsis for short) that can be used to provide answers for queries (which play an analogous role to the examples in the machine learning setting). A natural approach to privacy-preserving data release, then, is viewing the database as a function, and “learning”/approximating this function using a learning algorithm. This connection used (implicitly or explicitly) in several works (e.g. DNRRV09, DRV10, GHRU11).

While the approach outlined above is intuitively appealing at a high level, we are not quite done yet. There are immediate obstacles because of apparent incompatibilities between the requirements of learning algorithms and the types of “limited” access to data that are imposed by private data release. In particular, using the “independent noise” technique, we are restricted to getting **noisy answers** for a **bounded number of queries**. If we view the database as a function (on queries) that we want to learn, then we need to learn this unknown function using **noisy oracle accesses**/answers. Moreover, because of the way error scales up when we use the “independent noise” technique, we are allowed only a **small number of oracle accesses** (sub-linear or certainly sub-quadratic in the database/function size *n*).

Nonetheless, I hope to demonstrate the appeal of this approach with a (brief) overview of its role (implicit or explicit) in two recent works. I believe there is great potential for further research and algorithms that accomplish privacy-preserving data release using machine learning tools.

**“Query Boosting” for privacy-preserving data analysis. **Boosting is a powerful and successful technique in machine learning, pioneered by Schapire. A boosting algorithm provides an automatic way to convert a “weak” learner into a “strong” one. Here, by “weak” learner we mean a learning algorithm that, for any fixed distribution on a collection of examples, outputs a hypothesis that is only guaranteed to correctly label slightly more than half of the examples, weighted by the distribution (notice this may not be as “weak” as it first seems, since the guarantee is for **any** distribution on examples). A “strong” learner is a learning algorithm that, in the same setting, labels ** almost all** of the examples correctly. One method for achieving the miracle of boosting is iteratively re-weighting the examples: run the weak learner in iterations, where each iteration produces a hypothesis. After the weak learner produces a hypothesis, the boosting algorithm re-weights the examples, giving higher weight to “difficult” examples where the hypothesis was wrong, and continues to the next iteration with this re-weighted distribution. After relatively few iterations, it can be shown that each example is labeled correctly by the majority of hypotheses produced (see Freund and Schapire).

Work with Cynthia Dwork and Salil Vadhan presents a boosting framework for differentially private data analysis. “Query boosting” converts a “weak” ** privacy-preserving algorithm** into a “strong” one. As in the above analogy with machine learning, here queries play the roles of examples, and synopses (which provide answers to queries) play the roles of hypotheses. A “weak” privacy-preserving algorithm is one that, (for any distribution on queries), outputs synopses that only give accurate answers to half of the queries (weighted by the distribution). A “strong” privacy-preserving algorithm is one that answers

**of the queries correctly. Note that weak and strong here refer to the utility guarantees (strong differential privacy is required in both cases). The hope is that query boosting will prove to be a useful algorithmic tool in the design of privacy-preserving algorithms (as was the case for boosting in the machine learning setting). In fact, in that work query boosting was used to design “strong” differentially private algorithms for answering general low sensitivity (i.e. non-counting) queries.**

*all*The natural idea is to proceed similarly to boosting in the machine learning setting: run the weak algorithm iteratively: each iteration produces a synopsis. The query booster can then re-weight *the queries*, giving higher weight to “difficult” queries on which the synopsis was inaccurate, and continue to the next iteration with this re-weighted distribution. For utility, this idea already guarantees that after relatively few iterations, for each query, the median answer (over the synopses generated in all iterations) will have high accuracy. We are not yet done, however — the main difficulty is coming up with a query booster that preserves the weak algorithm’s privacy guarantee. There are significant obstacles here: for each query and each iteration we need to test whether the synopsis was “accurate”. This requires computing the database’s answer on this query — information which might well compromise privacy! This difficulty is compounded by:

- We need to examine many (all?) queries in order to compute the re-weighted distribution.
- Even for a fixed synopsis, small changes in the sensitive database’s answer on query might effect that query’s re-weighting dramatically and cause a “cascading privacy violation” in the booster’s operation.

For a resolution of these difficulties, see [DRV10]. A disclaimer (that doesn’t give away the ending): our query booster requires that the weak algorithm only accesses the query distribution by sampling a bounded number of samples (this helps overcome difficulty (1) above).

**Private data release via learning thresholds. **The connection between privacy-preserving data release and machine learning can also be made rigorous, and formalized as a reduction from differentially private data release to machine learning problems. Work with Moritz Hardt and Rocco Servedio shows how to reduce privacy-preserving release of counting queries to a machine learning problem: learning threshold functions. In the learning setting, given a set of examples and predicates on those examples, a threshold function is defined by *m* predicates and a threshold *t*. Given an input example, the threshold function outputs 1 iff the example satisfies *t* or more of the *m* predicates. An algorithm for learning threshold functions gets (limited) access to a threshold function (e.g. via oracle queries), and outputs a hypothesis that agrees with the threshold function (say on average). Learning threshold functions is the focus of a rich body of work in the learning community.

Building on the analogy between privacy-preserving data analysis and machine learning, let us re-examine the task of privately releasing counting queries. Recall that each query is defined by a predicate, and the predicate accepts or rejects each data item (a database “row”/an individual’s information). We can also “flip” this view, and consider each *data item *as a predicate: it takes a query description and accepts or rejects it. The database is now a *sum of n predicates* (one per row). Viewing the database as a function, and data release as learning this function, releasing counting queries is analogous to learning sums of* *predicates. Learning sums of predicates easily reduces to learning threshold functions on predicates (thresholds of sums of predicates) by partitioning the real range *[0,1]* into segments and running binary search.

Thus, at an intuitive level, there is a strong connection between privately releasing counting queries and learning thresholds. This connection is made rigorous in [HRS12]. It is shown that each counting query data release problem, defined by a set of queries and data items, reduces to a related threshold learning problem, defined by related sets of examples (corresponding to the queries) and predicates (corresponding to the data items). Moreover, interesting and natural data release problems induce interesting and well-studied learning problems. For example, this connection yields a new privacy-preserving algorithm for releasing conjunctions (a central data release problem).

It’s important to point out that while the high-level intuition behind the reduction is compelling, the obstacles to instantiating the intuitive connection between privacy-preserving data release and machine learning remain. In particular, we can only give the learning algorithm a **bounded** number of **noisy **accesses to the database/function being “learned”. These obstacles are overcome in [HRS12], and the final reduction does not assume any noise-resilience from the learning algorithm; Noise is handled in the reduction, which provides “noiseless” answers to the learning algorithm.

**Conclusion. **A strong connection is emerging between privacy-preserving data release and machine learning. I hope that the current body of work is only scratching the surface of an even deeper and more fruitful connection between these two fields. Of course, this blog post itself was meant only as a teaser — as I confessed in the disclaimer above, I failed to mention many beautiful related works. Rather than ending with an apology, I will end with three directions for future work that I find especially promising:

**Efficient data release using efficient machine learning.**There is a large gap between the rich utility that differential privacy permits information-theoretically (i.e. in exponential time), and what is known using efficient (polynomial time) algorithms. While some lower bounds are known (under computational assumptions), the gap is far from well understood. One especially intriguing (and practical) problem is privacy-preserving release for conjunction queries. Recently, Thaler, Ullman and Vadhan have made some progress on this question (their work hints that we might not want to restrict databases access to noisy queries).**Heuristic utility, provable privacy.**In Machine Learning, algorithms often provide exceptional utility even without proven worst-case guarantees. It would be interesting to see if relaxing the worst-case utility requirement can give similarly useful algorithms. It is important to emphasize, though, that I’m only advocating relaxing. Far more care should be taken when relaxing the privacy requirement, as security guarantees should take adversarial behavior into account.*utility***Learning and Privacy: a mutually beneficial “match”.**This post’s title paraphrases Shafi Goldwasser’s description of the connection between cryptography and complexity theory. As was the case there, I am hopeful that the connection between privacy and learning will prove beneficial in both directions. One appealing direction is using privacy techniques for robust or noise-tolerant learning.