Motivation for Differential Privacy in Datasets

Dated Mar 25, 2019; last modified on Sun, 14 Mar 2021

The goal is to make large datasets safe, i.e. no harm to you from having your data included.

Sanitize and Transfer Model

Remove all PII, _e.g. name, SSN, mobile number, etc, before releasing the dataset.

However, the list of PII is not necessarily complete. Furthermore, combinations of seemingly non-PII data can be jointly identifying.

Using an auxilliary dataset is a common re-identification method, e.g. Narayanan and Shmatikov [2010] combined IMDB ratings and comments to deanonymize a Netflix dataset.

The Query Model (Differential Privacy)

Keep the dataset, but provide a gateway through which analysts can query the data and get a response.

In general, it’s impossible to disallow anything to be learned about you, e.g. if a dataset indicates that smoking causes cancer, then the dataset allows us to infer that if you smoke, you have increased cancer risk.

If anything learnable from the dataset including your data is also learnable from the dataset without your data, then the dataset is useless.

However, we can aim for the property that the result of any series of queries is very close to the result if your data were not present. The differential privacy parameter \(\epsilon\) allows to specify how close.

Two consequences of differential privacy:

  • We can’t ruin privacy during post-processing. No auxilliary dataset will help gain more information.

  • We can have a privacy budget, because the combination of \(Q_1(.)\) and \(Q_2(.)\) has privacy \( \epsilon_1 + \epsilon_2 \).