[ToDo] Mining of Massive Datasets

Dated Mar 1, 2014; last modified on Mon, 05 Sep 2022

Data Mining

  • What is Data Mining?
  • Statistical Limits of Data Mining
  • Things Useful to Know
  • Outline of the book

MapReduce and the New Software Stack

  • Distributed File Systems
  • MapReduce
  • Algorithms Using MapReduce
  • Extensions to MapReduce
  • The Communication Cost Model
  • Complexity Theory for MapReduce

Finding Similar Items

  • Applications of Near-Neighbor Search
  • Shingling of Documents
  • Similarity-Preserving Summaries of Sets
  • Locality-Sensitive Hashing for Documents
  • Distance Measures
  • The Theory of Locality-Sensitive Functions
  • LSH Families for Other Distance Measures
  • Applications of Locality-Sensitive Hashing
  • Methods for High Degrees of Similarity

Mining Data Streams

  • The Stream Data Model
  • Sampling Data in a Stream
  • Filtering Streams
  • Counting Distinct Elements in a Stream
  • Estimating Moments
  • Counting Ones in a Window
  • Decaying Windows

Link Analysis

  • PageRank
  • Efficient Computation of PageRank
  • Topic-Sensitive PageRank
  • Link Spam
  • Hubs and Authorities

Frequent Itemsets

  • The Market-Basket Model
  • Market Baskets and the A-Priori Algorithm
  • Handling Larger Datasets in Main Memory
  • Limited-Pass Algorithms
  • Counting Frequent Items in a Stream

Clustering

  • Introduction to Clustering Techniques
  • Hierarchical Clustering
  • K-means Algorithms
  • The CURE Algorithm
  • Clustering in Non-Euclidean Spaces
  • Clustering for Streams and Parallelism

Advertising on the Web

  • Issues in On-Line Advertising
  • On-Line Algorithms
  • The Matching Problem
  • The Adwords Problem
  • Adwords Implementation

Recommendation Systems

  • A Model for Recommendation Systems
  • Content-Based Recommendations
  • Collaborative Filtering
  • Dimensionality Reduction

Mining Social-Network Graphs

  • Social Networks as Graphs
  • Clustering of Social-Network Graphs
  • Direct Discovery of Communities
  • Partitioning of Graphs
  • Finding Overlapping Communities
  • Simrank
  • Counting Triangles
  • Neighborhood Properties of Graphs

Dimensionality Reduction

  • Eigenvalues and Eigenvectors of Symmetric Matrices
  • Principal-Component Analysis
  • Singular-Value Decomposition
  • CUR Decomposition

Large-Scale Machine Learning

  • The Machine-Learning Model
  • Perceptrons
  • Support-Vector Machines
  • Learning from Nearest Neighbors
  1. Mining of Massive Datasets. Jure Leskovec; Anand Rajaraman; Jeffrey D. Ullman. Stanford University; Milliway Labs. infolab.stanford.edu .