[ToDo] Mining of Massive Datasets

Dated Mar 1, 2014; last modified on Mon, 05 Sep 2022

Data Mining

What is Data Mining?
Statistical Limits of Data Mining
Things Useful to Know
Outline of the book

MapReduce and the New Software Stack

Distributed File Systems
MapReduce
Algorithms Using MapReduce
Extensions to MapReduce
The Communication Cost Model
Complexity Theory for MapReduce

Finding Similar Items

Applications of Near-Neighbor Search
Shingling of Documents
Similarity-Preserving Summaries of Sets
Locality-Sensitive Hashing for Documents
Distance Measures
The Theory of Locality-Sensitive Functions
LSH Families for Other Distance Measures
Applications of Locality-Sensitive Hashing
Methods for High Degrees of Similarity

Mining Data Streams

The Stream Data Model
Sampling Data in a Stream
Filtering Streams
Counting Distinct Elements in a Stream
Estimating Moments
Counting Ones in a Window
Decaying Windows

Link Analysis

PageRank
Efficient Computation of PageRank
Topic-Sensitive PageRank
Link Spam
Hubs and Authorities

Frequent Itemsets

The Market-Basket Model
Market Baskets and the A-Priori Algorithm
Handling Larger Datasets in Main Memory
Limited-Pass Algorithms
Counting Frequent Items in a Stream

Clustering

Introduction to Clustering Techniques
Hierarchical Clustering
K-means Algorithms
The CURE Algorithm
Clustering in Non-Euclidean Spaces
Clustering for Streams and Parallelism

Advertising on the Web

Issues in On-Line Advertising
On-Line Algorithms
The Matching Problem
The Adwords Problem
Adwords Implementation

Recommendation Systems

A Model for Recommendation Systems
Content-Based Recommendations
Collaborative Filtering
Dimensionality Reduction

Mining Social-Network Graphs

Social Networks as Graphs
Clustering of Social-Network Graphs
Direct Discovery of Communities
Partitioning of Graphs
Finding Overlapping Communities
Simrank
Counting Triangles
Neighborhood Properties of Graphs

Dimensionality Reduction

Eigenvalues and Eigenvectors of Symmetric Matrices
Principal-Component Analysis
Singular-Value Decomposition
CUR Decomposition

Large-Scale Machine Learning

The Machine-Learning Model
Perceptrons
Support-Vector Machines
Learning from Nearest Neighbors

Mining of Massive Datasets. Jure Leskovec; Anand Rajaraman; Jeffrey D. Ullman. Stanford University; Milliway Labs. infolab.stanford.edu .