Occupy Database – Privacy is a Social Choice
About a month ago, in our theory seminar, we had a talk by Paul Ohm. Paul is a Professor of Law at the University of Colorado and has done important work in the interface between Computer Science and the Law (which was also the title of his talk). There is much to report about this talk but here I’d like to discuss a side comment that caught my attention. The comment was (as far as my memory serves) that in various law and policy forums, whenever privacy concerns are raised, someone invokes the notion of Differential Privacy (see also here). Differential privacy is an extremely important notion of privacy for data analysis but what Paul was pointing out seems to be that differential privacy is sometimes raised as a way to shut the discussion on privacy rather than to start it. I’ll try to explain here how differential privacy and its generalizations can provide a framework for the social quest for privacy. Nevertheless, and here I am stating the obvious, no mathematical notion can replace the need for social debate. The right definition of privacy is context dependent and a matter for social choice. The approach I’ll discuss is based on work-in-progress with Salil Vadhan and discussions with Cynthia Dwork and Guy Rothblum.
What is Differential Privacy?
First, let me give some context on differential privacy. Consider a database consisting of records, each containing data that is associated with one individual. A data analyzer would like to run some study on . Granting the data analyzer direct access to may be a gross violation of individual’s privacy and so instead a “sanitized” version is made public. Two quick comments: (1) The sanitization procedure is allowed (and needs to be) randomized, (2) There is a more general setting which allows for interaction between the data curator and the data analyzer but the setting of static sanitization is good enough for our discussion. What do we want out of the sanitized database? (1) Utility: access to should allow for a rich class of studies to be run with tolerable levels of errors (ideally comparable with typical sampling deviations and errors in creating in the first place), and (2) Privacy: individual’s privacy should be maintained. But what does preserving privacy mean?
Consider two databases and which are identical in all but a single record. Differential privacy requires that for every such pair, their sanitizations and are distributed similarly up to some small error parameter . (The “right” measure of distance between the distributions is important and somewhat subtle but it’s less important for our discussion and I’ll therefore ignore it here.) Intuitively, we now have that every individual Alice loses very little in terms of privacy by joining the database (after all, the sanitization is distributed more or less the same whether or not Alice joins). This is a very desirable property as it may increase the chances of individuals opting into the database and revealing personal information truthfully.
What is the Privacy in Differential Privacy?
While the opt-in opt-out characterization of differential privacy is important, it is not the whole story as far as privacy is concerned. The question for Alice may not be whether it’s better if does not contain her record but rather whether Alice’s privacy would be better served if does not exist at all. In other words, to determine if we are happy with a notion of privacy, it may be useful to take the point of view of a privacy-conscious administrator or regulator Carol that has to decide if to approve differentially-private studies on .
Let’s back up a little. How could Alice’s privacy be violated by releasing a differentially private sanitization of ? After all, the sanitization is distributed similarly to a sanitization of a database with Alice’s record removed, and such a database would surely not reveal anything about Alice. Right? Wrong! What this argument ignores is the possibility of correlations between the data of different individuals. If such correlations exist then sensitive information about Alice may leak even without access to her own record. Furthermore, such correlations may be exactly what the study is trying to figure out.
Let us consider a few examples. Consider a medical database that consists of Alice’s record as well as the records of her immediate family. Obviously, sensitive information about Alice may be deduced from her family’s records. Fortunately, the property of differential privacy composes nicely and therefore the sanitization of the database would still be close (up to a slightly larger error parameter) to a sanitization of a database with the entire family of Alice removed. But what if the residents of Alice’s village are exposed to a pollutant which significantly increases several health risks? In such a case, Alice’s record is correlated with a large number of other individuals in a potentially very revealing manner. We may no longer be able to claim that a differentially private study of the database protects Alice’s privacy. Is this a failure of the definition of differential privacy? Not necessarily. After all, perhaps determining environmental health hazards is exactly why the database was created. Perhaps (depending on various factors), in such a case, sacrificing Alice’s privacy (indeed, the community’s privacy) for the greater good (and potentially also for Alice’s good) is acceptable. Another example is one where Alice is a smoker (and this information is public). We would probably not like to crush health studies on the health risks of smoking just because it predicts some health condition of Alice (who may or may not be a part of this study).
Let us sum up: (1) Differential Privacy has a very important opt-in opt-out interpretation, (2) In some cases (when there are very limited correlations between records), it protects the information of individuals, (3) In other cases (when there are more extensive correlations), differential privacy may not protect information of individuals, it was never designed to protect such information, and it may not even be desirable that such information would be protected. One should note though that even if most of Alice’s data is highly correlated with the rest of the database, differential privacy will still protect information that is “unique” to Alice, meaning information that cannot be deduced from the rest of the population.
A framework for interdisciplinary collaboration
Let us return to our poor regulator Carol who now realizes that the privacy offered by differential privacy is subtle. Carol is not a mathematician so how can she determine if differential privacy is the right notion in the specific context she is deciding about? And if differential privacy is insufficient, how can she specify a notion that is sufficient? I will try to briefly sketch a framework that can assist computer scientists and mathematicians to work together with legal scholars and policy makers to identify the right notions of privacy for various settings (and in fact, most of the presentation so far was guided by this perspective).
The idea could be thought of as “individual-oriented sanitization.” Assume for now that Alice is the only individual Carol cares about. In that case, there is no need for any sophisticated sanitization. Instead, let us allow Alice to make some changes to the database, from a predefined family of changes, in order to remove traces of her sensitive information. Now Carol needs to decide how far she should go in order to protect Alice’s privacy. What is a legitimate expectation of privacy on Alice’s part? Every set of individual-oriented sanitization would translate to a formal notion of privacy.
For example, is it reasonable to allow Alice to remove her own record and the records of a few others? In many cases the answer is yes, and if this is the only changes we would like to allow then (the usual notion of) differential privacy is the right notion to use. The magic of differential privacy is that it simultaneously guarantees this level of privacy for all individuals while preserving some utility (whereas if we allowed every individual to remove her own record we would be left with an empty and thus useless database). As a side comment, I would like to point out that one can show that asking for this level of protection for every individual is not only implied by differential privacy but rather equivalent to it.
Is differential privacy the right notion always? Let’s look at a couple of other examples which indicate it is not. Consider a database containing the Facebook data of different individuals. Perhaps now we would like to allow Alice to remove her entire record as well as her messages and pictures from the records of her neighbors. If the neighborhood of individuals can be large, this implies a different notion than differential privacy where specific kind of data can be erased from many records. (Note that notion of “node-level” privacy were discussed in previous works on differential privacy for graphs, e.g. here.) Another example is a database that contains many different kinds of attributes for each individual. In such a case, Alice may still want to remove medical attributes about her family members, but she may also want to remove the salary attributes about some of her coworkers and the “my acquaintances” attributes of members of her Alcoholics Anonymous meetings, and so on. So now we may allow Alice to change a few records for every attribute (but different records for different attributes).
While the above discussion is non-technical, it translates in each case to a formal definition of privacy that generalizes differential privacy. In the second example for instance we can now define two databases and to be neighbors if for each attribute they are different in only a few records. We can then require that the sanitizations and be distributed similarly (in the same way it is defined for differential privacy). Note that this new definition of privacy, while very natural with the perspective we laid out, does not have a natural opt-in opt-out interpretation. Given this new definition of privacy, we can ask which studies could be run with sufficient accuracy while preserving such privacy.
Summing up the framework we suggest, we could imagine that legal scholars and policy makers would concentrate on defining the reasonable expectation of privacy for an individual. This would translate to a definition of privacy which should allow computer scientists to study what kind of utility can be preserved under such privacy constraint (indeed, several of the positive results for Differential Privacy naturally extend to a setting where we define neighboring databases more generally). The results of this study could feedback into additional refinements of the privacy definitions.
Extensions on Fairness and Privacy
The generalizations of differential privacy discussed here are related to a recent study of fairness in classification. Beyond the technical connection, a conceptual connection is that one of the conclusions of that work is that fairness is also a social choice. I hope that in future posts I will discuss fairness and also privacy in more challenging settings (such as in on-line behavioral targeting). There are many more details to consider, for example how machine learning and statistics could assist in identifying correlations which could influence the definition of privacy.