Differential Privacy: A Primer for a Non-technical Audience

Differential Privacy: A Primer for a Non-technical Audience. Kobbi Nissim; Thomas Steinke; Alexandra Wood; Micah Altman; Aaron Bembenek; Mark Bun; Marco Gaboardi; David R. O'Brien; Salil Vadhan. www.ftc.gov . May 7, 2017.

What Does DP Guarantee?

It is a question of whether a particular computation (not output) preserves privacy.

DP only guarantees that no information specific to an individual is revealed by the computation. DP doesn’t protect against information that could be learned even with an individual opting out of a dataset, e.g. a study that shows smoking increases cancer risk allows us to infer cancer risk for individuals that opted out.

Previously assumed that a few individuals could claim privacy guarantees. $\Deta$.

Suppose Alice reports that 202 out of 3,005 families at State University earn $1m+. Later, Bob reports that 201 out of 3,004 families earn $1m+. Eve can infer that the family missing in the Bob’s report earns $1m+

Alice and Bob may claim that their reports preserved privacy, but the same can’t be said for the composition. DP would have added some noise, e.g. Alice reports 204 and Bob reports 199, and Eve wouldn’t have gained information specific to any family.

The Privacy Loss Parameter

If were to mandate that the analysis give the same result without John’s data, and we have to guarantee the same for everyone, then the dataset is useless. Instead, DP mandates that the outcome be approximately the same if an individual opts out.

$ \epsilon $ measures the effect of each individual’s information on the output of the analysis. Smaller $\epsilon$ results in smaller deviation between real-world analysis and the opt-out scenario.

$ \epsilon = 0 $ makes the analysis mimic the opt-out scenario of all individuals perfectly, but such an analysis wouldn’t provide meaningful output. As a rule of thumb, $ .001 \le \epsilon \le 1 $.

Say the non-DP analysis gives $.013$. A DP analysis might give $.012$ the first time, $0.0138$ the next time, etc.

Define event $ A $: the outcome of the analysis is between $0.1$ and $0.2$.

Consider an analysis on some input data for which $ \mathbb{P}\{A\} = p $. An analysis without John’s data would make $ \mathbb{P}\{A\} = p' $. DP guarantees that $ p \le (1 + \epsilon) \cdot p' $.

Say an insurer will deny John coverage if event $A$ happens. If opting out of the study puts $\mathbb{P}\{A\} \le .05$, then participating increases $\mathbb{P}\{A\}$ to at most $0.05 \cdot (1 + \epsilon)$.

DP uses the opt-out scenario as the baseline. Some privacy will be lost, but not too much. Subjects can then ask themselves, “Can I tolerate the probability of the dreaded consequence rising by at most $p' \cdot \epsilon$?

Because DP computations add enough noise to hide the contribution of any subset of (roughly) $1/\epsilon$ then the resulting statistics are not as accurate. DP increases the $n_{min}$ required to produce accurate statistics. If $n \le 1/\epsilon$, then DP most certainly doesn’t produce any meaningful result.

A simplified analysis shows that if multiple analyses are done, then $\epsilon \le \sum \epsilon_{i}$. DP is currently the only framework that guarantees how privacy risk accumulates over multiple analyses.

The difference between the real-world and the opt-out scenarios of a group of $k$ individuals grows to at most $k \cdot \epsilon$. Effectively, a meaningful privacy guarantee can be provided to groups of individuals of up to $1/\epsilon$.

Relationship to Information Privacy Laws

Personally Identifiable Information (PII) doesn’t have a precise technical meaning. Some combination of attributes considered harmless by themselves may identify an individual, e.g. zipcode + gender + birthday.

De-identification involves turning PII into non-PII. Techniques include suppression of cells representing small groups, adding noise, swapping and generating synthetic data. But this doesn’t protect individuals against linkage attacks.

Privacy laws often forbid PII, which excludes analyses such as finding the relationship between first names and lifetime earnings. DP makes such analysis possible because the input may have PII, but output doesn’t reveal any PII.

Even in any linkage attack, the attacker cannot learn much more about an individual in a database than she could if that individual’s information were not in the database.

Privacy laws often include consent and opt-out provisions. DP enables a more informed way to grant consent. And by definition, every individual in DP is granted an opt-out.

With DP, privacy laws can be well-defined, quantifiable, aware of composition, generalize to unknown attacks and apply universally.