Tracking and Data Collection

Dated Jan 22, 2019; last modified on Sun, 12 Feb 2023

Tracking on the Internet

Screenshot from a [2012 study of third-party online tracking](#mayerMitchellThirdPartyWebTracking). The red boxes show content (at least the visible ones) served by third parties.
Screenshot from a [2012 study of third-party online tracking](#mayerMitchellThirdPartyWebTracking). The red boxes show content (at least the visible ones) served by third parties.

When the browser requests a third-party resource embedded on a web page, the snippet below is a subset of the HTTP headers sent by the browser to the third party.

GET         http://youtube.com/watch?v=gHStnhGx1P
Cookie:     id=35c192bcfe0000b1...
Referer:    http://www.nytimes.com/

The combination of the cookie and the referrer makes third-party tracking possible. Incidentally, the HTTP protocol specification misspells ‘referrer’

The Industry

Display Advertising Technology Landscape. Source: [LUMA Partners LLC](#displayLUMAscape)
Display Advertising Technology Landscape. Source: [LUMA Partners LLC](#displayLUMAscape)

Third party online tracking: sites other than the one you’re visiting (typically invisible) compiling profiles of your browsing history.

Behavioral targeting: profiles based on user’s past activity help ad exchanges serve targeted ads based on real-time ad auctions.

There’s still debate on whether the cost of behavioral targeting is worth it. Sometimes the markup over untargeted ads is 500% for a 4% improvement on ROI. Research is limited because behavioral ad giants, e.g. Google and Facebook, are reluctant to open their data for analysis.

Web Tracking Methods

Placing data in the browser

Options: HTTP cookies, HTTP auth, HTTP etags, content cache, IE userdata, HTML5 protocol & content handlers, HTML5 storage, Flash cookies, Silverlight storage, TLS session ID & resume, Browsing history, window.name, HTTP STS, DNS cache, …

Numerous web APIs allow placing data in the browser (directly or indirectly) and all of these can be used for uniquely identifying the user and hence tracking their browsing.

Fingerprinting

Involves observing the browser’s behavior, e.g. user-agent, browser plugins, clock skew, list of installed fonts, cookies enabled?, browser add-ons, screen resolution, …

When these attributes are combined, different devices/browsers will have different fingerprints. Fingerprinting leaves no trace that the user is being tracked. Unlike cookies, users can’t see or control fingerprinting.

Example: Canvas fingerprinting: draw invisible text, then read it back as a sequence of bits. Because of the tiny differences between devices, the bit string acts as a device identifiers.

Scheme flooding: Testing custom URLs schemes like skype:// to discover the apps installed on the device.

Cross-device tracking

Two devices can be linked to the same user if:

  • User logs in with the same credentials from both devices
  • User visits the same/similar set of websites on both devices
  • User travels with two portable devices

Ultrasound beacons

  1. TV ad emits ultrasound (inaudible) binary signal that encodes a unique ID
  2. Viewer’s smartphone app listens in the background
  3. When ultrasound ID is detected, reports that ad has been watched.

Beacon-based tracking is not as widespread as fingerprinting and cross-device tracking. It was implemented by SilverPush and incorporated in a small number of apps.

In recent versions of iOS/Android, apps can’t record audio in the background without user awareness/consent.

Merging online and offline databases

Scenario: retailer wants to target shoppers with ads when they browse online

  1. Consumer shops at retail store, provides email address
  2. Store uploads list of consumer IDs to online advertiser
  3. Consumer logs in to (say) news website using email address
  4. Third-party tracker links the user to retail DB
  5. Ads are served via a cookie that follows the user around

Machine Learning and Inference

Target infamously predicted that a girl was pregnant before her father was aware.

The piece itself is even more interesting:

In major life events, e.g. graduating college, moving to new town, arrival of a baby, consumers change their shopping habits. Bring in a consumer to buy diapers, they’ll probably buy other stuff too. Once birth records are updated, there’s no edge.

So Target built a model to assign the likelihood of pregancy, and it was so good that there was public outcry. Target mixes in other coupons to be less creepy.

Using Facebook likes, one can meaningfully predict sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parental separation, age and gender.

Deep neural networks do better than humans at detecting sexual orientation from facial images.

References

  1. Third-Party Web Tracking: Policy and Technology. Mayer, Jonathan; Mitchell, John. cyberlaw.stanford.edu . 2012.
  2. Display LUMAscape. LUMA Partners LLC. lumapartners.com .
  3. Pixel perfect: Fingerprinting canvas in HTML5. Mowery, Keaton; Shacham, Hovav. hovav.net . 2012.
  4. Cross-Device Tracking: Measurement and Disclosures. Brookman, Justin; Rouge, Phoebe; Alva, Aaron; Yeung, Christina. petsymposium.org . 2017.
  5. How Companies Learn Your Secrets. Charles Duhigg. www.nytimes.com . www.forbes.com . Feb 16, 2012.
  6. Private traits and attributes are predictable from digital records of human behavior. Michal Kosinski; David Stillwell; Thore Graepel. www.pnas.org . Feb 13, 2013.
  7. Deep neural networks are more accurate than humans at detecting sexual orientation from facial images. Yilun Wang; Michal Kosinski. www.semanticscholar.org . 2017.
  8. Tor users, beware: 'Scheme flooding' technique may be used to deanonymize you. Thomas Claburn. www.theregister.com . it.slashdot.org . May 14, 2021.
  9. The case against behavioral advertising is stacking up. Natasha Lomas. techcrunch.com . Jan 20, 2019. Accessed Jun 20, 2021.