Myths and Fallacies of 'Personally Identifiable Information'

Dated Jun 1, 2010; last modified on Mon, 05 Sep 2022

Myths and Fallacies of 'Personally Identifiable Information'. Narayanan, Arvind; Shmatikov, Vitaly. dl.acm.org . Jun 1, 2010.

What is PII?

From Breach Notification Laws:

For example, California Senate Bill 1386: SSNs, driver’s license numbers, financial accounts.

The list can never be exhaustive, e.g. email addresses and telephone numbers are not mentioned in Bill 1386.

Focuses on data that are commonly used for authenticating an individual. Ignores data that reveals some sensitive information about an individual

From Privacy Laws:

Data Protection Directive: Any information relating to […] natural person […] who can be identified, directly or indirectly, in particular by reference […] to one or more factors specific to his physical, physiological, mental, economic, cultural, or social identity.

HIPAA Privacy Rule: Information that identifies the individual; or w.r.t which there is a reasonable(???) basis to believe the information can be used to identify the individual.

The “safe harbor” provision of the Privacy Rule enumerates 18 specific identifiers that must be removed prior to data release, but the list is not intended to be comprehensive.

PII and Privacy Protection Technologies

Many companies assume that PII is a set of attributes such as names and contact information. One PII are removed, the data becomes safe to release.

\(K\) - anonymity

De-identification involves modifying quasi-identifiers to satisfy syntactic properties, e.g. every combination of quasi-identifier values occurring in the dataset must occur at least \(k\) times

But it relies on the fallacious distinction between “identifying” and “non-identifying” attributes.

Furthermore, the amount and variety of publicly available information about individuals grows exponentially.

Because joining two datasets on common attributes can lead to re-identification, anonymizing a predefined subset of attributes is not sufficient to prevent it.

Re-identification Without PII

Any information that distinguishes one person from another can be used for re-identifying anonymous data.

  • AOL fiasco: content of search queries was used to re-identify a user
  • Our work: large scale re-identification of Netflix data using IMDB data
  • Location information and stylometry to resolve authors of the 12 disputed Federalist Papers

While some attributes may be uniquely identifying on their own, any attribute can be identifying in combination with others, e.g. while no single read book is enough to identify a person, a large collection of my reading history is

Although re-identification based on behavioral attributes must tolerate fuzziness, their computational expense is amortized over many individuals. Furthermore, it’s a one-time effort.

Another example of re-identification through auxiliary datasets. Private planes, e.g. Elon Musk’s, can be added to FAA’s LADD block list. However, most aircraft have ADS-B transponders which show a plane’s location in the air in real time on the ADS-B exchange. Combining ADS-B data and anonymized FAA flight plans can re-identify most private aircraft.

Lessons for Privacy Practitioners

Any remaining attributes can be used for re-identification, as long as they differ from individual to individual. Therefore, PII has no meaning even in the context of the HIPAA Privacy Rule.

Differential privacy is a major step in the right direction.

  • It formally defines what it means for a computation to be privacy-preserving.
  • Crucially, it makes no assumptions about the external information available to the adversary.
  • However, it does not offer a universal methodology. We still need a case-by-case reasoning.

An interactive, query-based approach is generally superior to the “release-and-forget” approach. Designing an API for queries, budgeting for server resources, performing regular audits, and so forth, is worth it.

Any system for privacy-preserving computation on sensitive data must be accompanied by strong access control mechanisms and non-technological protection methods such as informed consent and contracts specifying acceptable uses of data.

References

  1. Elon Musk offered $5k to remove a bot tracking his flights. Veronica Irwin. www.protocol.com . twitter.com . Jan 26, 2022. Accessed Jan 27, 2022.