[ToDo] CS 168: The Modern Algorithmic Toolbox

The Modern Algorithmic Toolbox (CS168). Gregory Valiant. web.stanford.edu . 2022. Accessed Sep 10, 2022.

Modern Hashing

Consistent hashing.
Property-preserving lossy compression. From majority elements to approximate heavy hitters. From bloom filters to the count-min sketch.

Similarity Search. (Dis)similarity metrics: Jaccard, Euclidean, Lp. Efficient algorithm for finding similar elements in small/medium (i.e. \(< 20\)) dimensions using k-d trees.
Curse of Dimensionality, kissing number. Distance-preserving compression. Estimating Jaccard Similarity using MinHash. Euclidean distance preserving dimensionality reduction (aka the Johnson-Lindenstrauss Transform)

Generalization (or, how much data is enough?). Learning an unknown function from samples from an unknown distribution. Training error vs. test error. PAC guarantees for linear classifiers. Empirical risk minimization.
Regularization. The polynomial embedding and random projection, L2 regularization, and L1 regularization as a computationally tractable surrogate for L0 regularization.

Understanding PCA. Minimizing squared distances equals minimizing variance. Use case for data visualization and data compression. Failure modes for PCA.
How PCA works. Maximizing variance as finding the “direction of maximum stretch” of the covariance matrix. The simple geometry of “diagonals in disguise.” The power iteration algorithm.

Low-rank matrix approximations. SVD; applications to matrix compression, de-noising, and matrix completion (i.e. recovering missing entries).
Tensor methods. Differences between matrices and tensors, the uniqueness of low-rank tensor factorizations, and Jenrich’s algorithm.

Graphs as matrices and the Laplacian of a graph, Interpretations of the largest and smallest eigenvectors/eigenvalues of the Laplacian. Spectral embeddings, and an overview of applications (e.g. graph coloring, spectral clustering).
Interpretations of the second eigenvalue (via conductance and isoperimetric number), and the connections with the speed at which random walks/diffusions converge.

Reservoir sampling (how to select a random sample from a data stream). Basic probability tools: Markov’s inequality and Chebyshev’s inequality. Importance Sampling (how to make inferences about one distribution based on samples from a different distribution). Description of the Good-Turing estimate of missing/unseen mass.
Markov Chains, stationary distributions. Markov Chain Monte Carlo (MCMC) as approaches to solving hard problems by sampling from carefully crafted distributions.