This resource aims to support practitioners deploying differential privacy in practice. It endeavors to leave the reader with intuition, responsible guidance and case studies of privacy budgets used by organizations today.
Over the last five years, the use of differential privacy as an output
disclosure control for sensitive data releases and queries has
substantially grown. This is in part due to the elegant and theoretically
robust underpinning of the differential privacy literature, part due to
the prevalence of attacks on traditional disclosure techniques, and part
due to the adoption of differential privacy by those perceived to set the
"gold standard" such as the US Census
As a reference, one way to classify the maturity and readiness of a
technology in industry is to consider the technology readiness level
The purpose of this document is to support the responsible adoption of differential privacy in the industry. Differential privacy, as will be introduced in an upcoming section, is simply a measure of information loss about data subjects or entities. However, there are few guidelines or recommendations for choosing the thresholds of what a reasonable balance between privacy and query accuracy should be. Furthermore, on many scenarios, these thresholds are context-specific and thus, any organization endeavoring to adopt differential privacy in practice will find its selection as extremely important.
In this document, we describe some dimensions with which we can describe applications of differential privacy and label many real world case studies based on the setting they are deployed in and the privacy budgets chosen. While this is not intended to act as an endorsement of any application, we hope that the document will act as a baseline for informed debate, precedence and eventually, best practices to emerge from.
Core to this document, is a registry of case studies present at the end.
Much of the work of identifying these initial case studies is due to the
great prior work from personal blogs
On the other hand, if the reader is interested more in an introduction to
differential privacy, there are some excellent resources available such as
books/papers
Finally, and importantly, this document is not intended to be static in nature. One core purpose behind the document is to periodically add new case studies, to keep up with the ever evolving practices of industry and government applications and align with guidance from regulators which are expected to be more prevalent in coming years. If you would like to join the authors of this document and support the registry, please head over to the Contribute page.
Before diving into the main document, it is important to note that the two prominent standardization bodies, NIST and ISO/IEC, have been active in providing guidance and standardization in the space of data anonymization, and in particular differential privacy.
ISO/EIC 20889:2018
NIST SP 800-226 ipd
While the aforementioned resources are useful, neither explicitly provide guidelines as how to choose reasonable parameterization of differential privacy models in terms of privacy budgets, nor do they point to public benchmarks to help the community arrive at industry norms over the medium to long term. In the case of the ISO/EIC 20889:2018, the definitions are also limited to the most standard case which are often an oversimplification for real world applications. In the course of this document, and where applicable, we will link to the terminology of the standard to provide a level of consistency for the reader.
Before the age of big data and data science, traditional data collection
faced the challenge called the evasive answer bias. That is to say people
not answering survey questions honestly in fear that their answers may be
used against them. Randomized responses
Randomized response is a technique to protect the privacy of individuals in surveys. It involves adding local noise, such as flipping a coin multiple times and assigning the responses of the individual based on the coin-flip sequence. In doing so, the responses can be true in expectation but any given response is uncertain. This uncertainty over the response of an individual is one of the first applications of differential privacy, although it was not called as such at the time and the quantification of privacy was simply the weighting of probabilities determined by the mechanism.
This approach of randomizing the output of the answer to a question by a mechanism, a stochastic intervention such as coin flipping, is still the very backbone of differential privacy today.
Pure epsilon-differential privacy (\(\epsilon\)-DP) is a mathematical
guarantee that allows sharing aggregated statistics about a dataset while
protecting the privacy of individuals by adding random noise. In simpler
words, it ensures that the outcome of any analysis is nearly the same,
regardless of whether any individual's data is either included or
removed from the dataset.
Formally, the privacy guarantee is quantified using the privacy parameter
\(\epsilon\) (epsilon). A randomized algorithm \(A\) is
\(\epsilon\)-differentially private if for all neighboring datasets
\(D_1\) and \(D_2\) (differing in at most one element), and for all
subsets of outputs \(S \subseteq \text{Range}(M)\)
This M algorithm will provide a set amount of noise, quantified by \(\epsilon\), which would generate outputs with certain error from the real value, which can be quantified by the following interactive widget.
Despite randomized response surveys predating the introduction of the formal definition of differential privacy by over 40 years, it directly translates to the binary mechanism in modern differential privacy.
Assume you wish to set up the spinner originally proposed in
This mechanism is incredibly useful to build intuition among a non-technical audience. The most direct question we can pose a data subject from a dataset as simply "Is Alice in this dataset?". Answering the question with different levels of privacy \(\epsilon\) would yield different probabilities of telling the truth, which we display as follows.
\(\epsilon\) | Probability of Truth | Odds of Truth |
---|
While the above odds are just an illustrative example, it brings home the impact of what epsilon actually means in terms of the more intuitive original randomized response. As a reference, theorists often advocate for \(\epsilon \approx 1\) for differential privacy guarantees to have a meaningful privacy assurance.
One of the most ubiquitous mechanisms in ε-differential privacy is the Laplace mechanism. It is used when we are adding bounded values together such as counts or summations of private values, provided the extreme values (usually referenced as bounds) of the private values are known and hence the maximum contribution of any data subject is bounded.
Essentially, the sum is calculated and then a draw from Laplace distribution is made. Finally, the resulting random variable is added to the original result. Assuming you are counting, such that all values range in \((0, 1)\), then the widget below shows you how the distribution of noise and expected error changes with varying \(\epsilon\).
Note that the error is additive and so we can make claims about the absolute error, but not the relative error of the final stochastic result.
(ε, δ)-differential privacy is a mathematical guarantee that extends the
concept of pure epsilon-differential privacy by allowing for a small
probability of failure, with a second privacy parameter \(\delta\). Just
as we described pure-DP in our previous section, it also ensures that the
outcome of any analysis is nearly the same, regardless of whether any
individual's data is present, but further includes a small allowance for a
cryptographically small chance of error.
Formally, the privacy guarantee is now quantified using both \(\epsilon\)
(epsilon) and also \(\delta\) (delta). A randomized algorithm \(M\) is
\((\epsilon, \delta)\)-differentially private if for all neighboring
datasets \(D_1\) and \(D_2\) (differing in at most one element), and for
all subsets of outputs \(S \subseteq \text{Range}(M)\)
The following widget describes the expected error for noise added under \((\epsilon, \delta)\)-DP.
Zero Concentrate Differential Privacy (also known as zCDP) introduces a parameter \(\rho\) (rho) to measure the concentration of privacy loss around its expected value, allowing for more accurate control of privacy degradation in repeated analyses. As such, we would benefit from using zCDP in applications requiring multiple queries or iterative data use, some of which we will describe in future sections related to interactive and periodical releases.
Formally, a randomized algorithm \(M\) satisfies \(\rho\)-zCDP such that for neighboring datasets \(D_1\) and \(D_2\) (differing in at most one element), and for all \(\alpha\) in (1, ∞), the following holds:
The local model in differential privacy, as defined in the ISO/IEC
Since the noise is added very early in the pipeline, local differential
privacy trades off usability and accuracy for stronger individual privacy
guarantees. This means that while each user's data is protected even
before it reaches the central server, the aggregated results might be less
accurate compared to global differential privacy where noise is added
after data aggregation.
Opposite of the previous section, the central model refers to the model
where the privacy mechanisms are applied centrally, after data collection.
In this model, individuals provide raw data and relay their trust in the
curator, which is intended to add privacy protections in a downstream
task. This is often referred to as the global model or the server model,
as defined in the ISO/IEC
When we define a threat model, we mainly focus on how much trust we rely
on the curator. A trusted curator is assumed to apply DP in a correct
manner, while in contrast, an adversarial curator may (and we assume it
always does) attempt to breach privacy.
As such, these concepts are strongly related to the locality of our DP
model, which we previously defined as local and global DP. For the local
DP protocol, we yield no trust to the central curator, thus we can
perfectly accept a model where the curator is adversarial, since the
privacy guarantees are put in place by the user in a local manner. On the
other hand, for global DP, we expect the curator to set these privacy
guarantees in place, and as such we relay all the trust in them.
These two concepts refer to how often we publish DP statistics. A static releases involves publishing a single release with no further interactions, whilst interactive releases repeat these processes, for example, by allowing multiple queries on the dataset. Static releases are simpler, but interactive releases could offer additional utility, but require a more robust privacy measurement for each query due to composition.
An important aspect of differential privacy is defining what it is we are endeavoring to protect. Ultimately, we usually are trying to protect the atomic data subjects of a dataset: people, businesses, entities. However, depending on the dataset itself rows of the data table may refer to different thing and individual subjects may have a causal effect on more than one record.
Event-level privacy, as described in
Group privacy refers to settings where we have multiple data subjects who are linked in some manner such that we care about hiding the contribution of the group. An example of this might be a household in the setting of a census. Finally, there is entity-level privacy. Similar to group-level privacy, this is when multiple records can be linked to a single entity. An example of this would be credit card transactions. One data subject may have zero or multiple transactions associated with them, thus in order to protect the privacy of the entity we need to limit the effect of all records associated with each entity.
From a technical perspective, the mechanics of the tooling to deal with groups and entities are the same so their terminology is often used interchangeably.
Involving multiple parties in DP releases requires additional accounting
of the privacy budget, and similarly to how we described an adversarial
curator, we now focus on defining a group of analysts, which could
adversarially collude against the DP release.
As a more practical example, collusion typically refers to an environment
where multiple analysts are allowed a set privacy budget, but they
collaborate between each other to leverage composition and produce
information about the dataset that is protected by a worse epsilon
parameter, and thus breaking the intended privacy budget allocated for
them.
This concept, often also related to the "continual observation" area of study, involves producing multiple differentially private releases for a dataset that is periodically changing. Achieving this can be challenging as each release must be carefully accounted for in the privacy budget, and organizations that allow for DP analysis of continually updated datasets, as some of the ones present in our table, are mindful of setting budgets for both the user and time-level.
The following table presents multiple systems with publicly advertised differential privacy parameteres. We also generated estimates for their equivalent parameteres in other DP variants along with collecting their respective sources. You can click on each entry to display a modal with more detailed information.
Note: Values with (*) where generated by us by using this formula to convert from Pure DP to zCDP and this formula to convert from zCDP to approximate DP, with ε = 1:
The table below lists private applications of differential privacy where parameters are not publicly disclosed. Click on each entry to view a modal with more detailed information.