This resource aims to support practitioners deploying differential privacy
in practice. It endeavors to leave the reader with a clearer intuition,
responsible guidance, and case studies of privacy budgets used
by organizations today.
Differential Privacy Deployment Registry
Public Registry
The following table presents multiple systems with publicly advertised differential privacy parameters. We also generated estimates for their equivalent parameters in other DP variants and collected their respective sources. You can click on each entry to view more detailed information.
Note: Values with (*) were generated by us by using
this formula
to convert from Pure DP to zCDP and
this formula to convert
from zCDP to approximate DP, with ε = 1:
Private Usecases
The table below lists private applications of differential privacy where
parameters are not publicly disclosed. You can click on each entry to view more detailed information.
Overview
Over the last five years, the use of differential privacy as an output
disclosure control for sensitive data releases and queries has
grown substantially. This is due in part to the elegant and theoretically
robust underpinning of the differential privacy literature, in part to
the prevalence of attacks on traditional disclosure techniques, and in part
to the adoption of differential privacy by those perceived to set the
"gold standard" such as the US Census
which acts as a form of social proof, giving greater confidence to other
early adopters.
As a reference, one way to classify the maturity and readiness of a
technology in industry is to consider the technology readiness level
of the technology .
Systems built with differential privacy guarantees can be found between
TRL 6-9. In other words, some industry applications of differential privacy have only been demonstrated in relevant domains, while others have been deployed and tested in operational environments.
As such, finding common ground on privacy
deployments appears to be an urgent challenge for the DP industry.
The purpose of this document is to support the responsible adoption of
differential privacy in the industry. Differential privacy, as will be
introduced in an upcoming section, is simply a measure of information loss
about data subjects or entities. However, there are few guidelines or
recommendations for choosing the thresholds of what a reasonable
balance between privacy and query accuracy should be. Furthermore, in many
scenarios, these thresholds are context-specific and thus, any
organization endeavoring to adopt differential privacy in practice will
find its selection extremely important.
In this document, we describe some dimensions with which we can describe
applications of differential privacy and label many real-world case
studies based on the setting they are deployed in and the privacy budgets
chosen. While this is not intended to act as an endorsement of any
application, we hope that the document will act as a baseline for informed
debate, precedence and eventually, best practices to emerge.
Core to this document, is a registry of case studies present at the end.
Much of the work of identifying these initial case studies is due to the
great prior work from personal blogs
, government and
NGO guides
. Despite pre-existing work, the motivation of this document lies on
expanding expand the number and classification of these case studies in an
open-source fashion, such that the community as a whole can contribute and
shape a shared understanding.
On the other hand, if the reader is interested more in an introduction to
differential privacy, there are some excellent resources available such as
books/papers
, online lecture notes
and websites .
While this document introduces some of the nomenclature of differential
privacy, it is not intended to be a standalone resource and will refer to
common techniques and mechanisms with only references where the reader can
learn more.
Finally, and importantly, this document is not intended to be static in
nature. One core purpose behind the document is to periodically add new
case studies, to keep up with the ever evolving practices of industry and
government applications and align with guidance from regulators which are
expected to be more prevalent in coming years. If you would like to join
the authors of this document and support the registry, please head over to
the
Contribute page.
Official Guidance and Standardization
Before diving into the main document, it is important to note that the two
prominent standardization bodies, NIST and ISO/IEC, have been active in
providing guidance and standardization in the space of data anonymization,
and in particular differential privacy.
ISO/IEC 20889:2018: This
standard by the ISO/IEC focuses broadly on de-identification techniques,
including synthetic data and randomization techniques. Despite being a
normative standard in part, differential privacy is introduced as a formal
privacy measure in the style of an informative standard. Only
\(\epsilon\)-differential privacy is considered with Laplace, Gaussian and
Exponential mechanisms and the concept of cumulative privacy loss.
Interestingly, despite Gaussian noise typically being associated with
\((\epsilon, \delta)\)-differential privacy and zero-concentrated
differential privacy, as will be introduced in section (ε, δ)-Differential Privacy, these more
nuanced privacy models are not defined.
NIST SP 800-226 ipd: The
guidance paper extends far beyond ISO/IEC 20889:2018, considering multiple
privacy models, considerations with regard to the conversion between
privacy models, basic mechanisms, threat models in terms of local and
central models and more. This is an excellent resource for understanding
the nomenclature, security model and goals of applying differential
privacy in practice. Throughout this document we endeavor to align the
terminology with the NIST guidance paper, leaving formal definitions to
the original source.
While the aforementioned resources are useful, neither explicitly provide
guidelines on how to choose reasonable parameterization of
differential privacy models in terms of privacy budgets, nor do they point
to public benchmarks to help the community arrive at industry norms over
the medium to long term. In the case of the ISO/IEC 20889:2018, the
definitions are also limited to the most standard case which is often an
oversimplification for real-world applications. In the course of this
document, and where applicable, we will link to the terminology of the
standard to provide a level of consistency for the reader.
Introduction to Differential Privacy
Randomized Response Surveys
Before the age of big data and data science, traditional data collection
faced the challenge called the evasive answer bias. That is to say, people
not answering survey questions honestly in fear that their answers may be
used against them. Randomized responses
emerged in the
mid-twentieth century to address this.
Randomized response is a technique to protect the privacy of individuals
in surveys. It involves adding local noise, such as flipping a coin
multiple times and assigning the responses of the individual based on the
coin-flip sequence. In doing so, the responses can be true in expectation
but any given response is uncertain. This uncertainty over the response of
an individual is one of the first applications of differential privacy,
although it was not called as such at the time and the quantification of
privacy was simply the weighting of probabilities determined by the
mechanism.
An example of using a conditional coin-flip to achieve plausible
deniability with a calibrated bias.
This approach of randomizing the output of the answer to a question by a
mechanism, a stochastic intervention such as coin flipping, is still the
very backbone of differential privacy today.
ε-Differential Privacy
Pure epsilon-differential privacy (\(\epsilon\)-DP) is a mathematical
guarantee that allows sharing aggregated statistics about a dataset while
protecting the privacy of individuals by adding random noise. In simpler
words, it ensures that the outcome of any analysis is nearly the same,
regardless of whether any individual's data is either included or
removed from the dataset.
Formally, the privacy guarantee is quantified using the privacy parameter
\(\epsilon\) (epsilon). A randomized algorithm \(A\) is
\(\epsilon\)-differentially private if for all neighboring datasets
\(D_1\) and \(D_2\) (differing in at most one element), and for all
subsets of outputs \(S \subseteq \text{Range}(M)\)
This M algorithm will provide a set amount of noise, quantified by
\(\epsilon\), which would generate outputs with certain error from the
real value, which can be quantified by the following interactive widget.
Randomized Response was ε-Differential Privacy
Despite randomized response surveys predating the introduction of the
formal definition of differential privacy by over 40 years, it
directly translates to the binary mechanism in modern differential
privacy.
Assume you wish to set up the spinner originally proposed in
to achieve \(\epsilon\)-differential privacy, we can do so by asking
participants to tell the truth with probability
\(\frac{e^{\frac{\epsilon}{2}}}{1 + e^{\frac{\epsilon}{2}}}\). This is
called the binary mechanism in the literature.
This mechanism is incredibly useful to build intuition among a
non-technical audience. The most direct question we can pose a data
subject from a dataset as simply "Is Alice in this dataset?".
Answering the question with different levels of privacy \(\epsilon\)
would yield different probabilities of telling the truth, which we
display as follows.
\(\epsilon\)
Probability of Truth
Odds of Truth
While the above odds are just an illustrative example, it brings home
the impact of what epsilon actually means in terms of the more
intuitive original randomized response. As a reference, theorists
often advocate for \(\epsilon \approx 1\) for differential privacy
guarantees to have a meaningful privacy assurance.
Intuition of the Laplace Mechanism
One of the most ubiquitous mechanisms in ε-differential privacy is the
Laplace mechanism. It is used when we are adding bounded values
together such as counts or summations of private values, provided the
extreme values (usually referenced as bounds) of the private
values are known and hence the maximum contribution of any data
subject is bounded.
Essentially, the sum is calculated and then a draw from Laplace
distribution is made. Finally, the resulting random variable is added
to the original result. Assuming you are counting, such that all
values range in \((0, 1)\), then the widget below shows you how the
distribution of noise and expected error changes with varying
\(\epsilon\).
Note that the error is additive and so we can make claims about the
absolute error, but not the relative error of the final stochastic
result.
(ε, δ)-Differential Privacy
(ε, δ)-differential privacy is a mathematical guarantee that extends the
concept of pure epsilon-differential privacy by allowing for a small
probability of failure, with a second privacy parameter \(\delta\). Just
as we described pure DP in our previous section, it also ensures that the
outcome of any analysis is nearly the same, regardless of whether any
individual's data is present, but further includes a small allowance for a
cryptographically small chance of error.
Formally, the privacy guarantee is now quantified using both \(\epsilon\)
(epsilon) and also \(\delta\) (delta). A randomized algorithm \(M\) is
\((\epsilon, \delta)\)-differentially private if for all neighboring
datasets \(D_1\) and \(D_2\) (differing in at most one element), and for
all subsets of outputs \(S \subseteq \text{Range}(M)\)
The following widget describes the expected error for noise added under
\((\epsilon, \delta)\)-DP.
Intuition of (ε, δ)-Differential Privacy
Zero Concentrate Differential Privacy
Zero Concentrated Differential Privacy (also known as zCDP) introduces a
parameter \(\rho\) (rho) to measure the concentration of privacy loss
around its expected value, allowing for more accurate control of privacy
degradation in repeated analyses. As such, we would benefit from using
zCDP in applications requiring multiple queries or iterative data use,
some of which we will describe in future sections related to interactive
and periodic releases.
Formally, a randomized algorithm \(M\) satisfies \(\rho\)-zCDP such that
for neighboring datasets \(D_1\) and \(D_2\) (differing in at most one
element), and for all \(\alpha\) in (1, ∞), the following holds:
The local model in differential privacy, as defined in the ISO/IEC
, is a threat model that provides strong
privacy guarantees before data is collected by a central entity. In this
model, each user adds noise to their own data locally (for example, on
their own phone or laptop, before it is sent to a processing server). This
ensures their privacy is protected even if the data is intercepted in
transit or in the case they do not trust the central curator.
Since the noise is added very early in the pipeline, local differential
privacy trades off usability and accuracy for stronger individual privacy
guarantees. This means that while each user's data is protected even
before it reaches the central server, the aggregated results might be less
accurate compared to global differential privacy where noise is added
after data aggregation.
In local differential privacy, each data subject applies randomization
as a disclosure control locally before sharing their outputs with the
central aggregator.
The Central Model
Opposite from the previous section, the central model refers to the model
where the privacy mechanisms are applied centrally, after data collection.
In this model, individuals provide raw data and place their trust in the
curator, which is intended to add privacy protections in a downstream
task. This is often referred to as the global model or the server model,
as defined in the ISO/IEC
.
In global differential privacy, each data subject shares their private
information with the trusted aggregator. Randomization is applied as a
disclosure control prior to broader dissemination.
Trusted vs Adversarial Curator
When we define a threat model, we mainly focus on how much trust we place
in the curator. A trusted curator is assumed to apply DP in a correct
manner, while in contrast, an adversarial curator may (and we assume it
always does) attempt to breach privacy.
As such, these concepts are strongly related to the locality of our DP
model, which we previously defined as local and global DP. For the local
DP protocol, we place no trust in the central curator, thus we can
perfectly accept a model where the curator is adversarial, since the
privacy guarantees are put in place by the user in a local manner. On the
other hand, for global DP, we expect the curator to set these privacy
guarantees in place, and as such we place all the trust in them.
Static vs Interactive Releases
These two concepts refer to how often we publish DP statistics. A static
release involves publishing a single release with no further
interactions, while interactive releases repeat these processes, for
example, by allowing multiple queries on the dataset. Static releases are
simpler, but interactive releases could offer additional utility, but
require a more robust privacy measurement for each query due to
composition.
Event, Group and Entity Privacy
An important aspect of differential privacy is defining what it is we are
endeavoring to protect. Ultimately, we usually are trying to protect the
atomic data subjects of a dataset: people, businesses, entities. However,
depending on the dataset itself, rows of the data table may refer to
different things and individual subjects may have a causal effect on more
than one record.
Event-level privacy, as described in ,
refers to a dataset where we are protecting the rows of a dataset. Each
row might pertain to a single data subject in its whole, or a single event
such as a credit card transaction.
Group privacy refers to settings where we have multiple data subjects who
are linked in some manner such that we care about hiding the contribution
of the group. An example of this might be a household in the setting of a
census. Finally, there is entity-level privacy. Similar to group-level
privacy, this is when multiple records can be linked to a single entity.
An example of this would be credit card transactions. One data subject may
have zero or multiple transactions associated with them, thus in order to
protect the privacy of the entity we need to limit the effect of all
records associated with each entity.
From a technical perspective, the mechanics of the tooling to deal with
groups and entities are the same so their terminology is often used
interchangeably.
Multiple Parties and Collusions
Involving multiple parties in DP releases requires additional accounting
of the privacy budget, and similarly to how we described an adversarial
curator, we now focus on defining a group of analysts, which could
adversarially collude against the DP release.
As a more practical example, collusion typically refers to an environment
where multiple analysts are allowed a set privacy budget, but they
collaborate between each other to leverage composition and produce
information about the dataset that is protected by a worse epsilon
parameter, and thus breaking the intended privacy budget allocated for
them.
Periodical Releases
This concept, often also related to the "continual observation" area of
study, involves producing multiple differentially private releases for a
dataset that is periodically changing. Achieving this can be challenging
as each release must be carefully accounted for in the privacy budget, and
organizations that allow for DP analysis of continually updated datasets,
as some of the ones present in our table, are mindful of setting budgets
for both the user and time level.