The Modern Algorithmic Toolbox (CS168).
Gregory Valiant.
web.stanford.edu .
2022.
Accessed Sep 10, 2022.
Modern Hashing
- Consistent hashing.
- Property-preserving lossy compression. From majority elements to approximate heavy hitters. From bloom filters to the count-min sketch.
Data with Distances
- Similarity Search. (Dis)similarity metrics: Jaccard, Euclidean, Lp. Efficient algorithm for finding similar elements in small/medium (i.e. \(< 20\)) dimensions using k-d trees.
- Curse of Dimensionality, kissing number. Distance-preserving compression. Estimating Jaccard Similarity using MinHash. Euclidean distance preserving dimensionality reduction (aka the Johnson-Lindenstrauss Transform)
Generalization and Regularization
- Generalization (or, how much data is enough?). Learning an unknown function from samples from an unknown distribution. Training error vs. test error. PAC guarantees for linear classifiers. Empirical risk minimization.
- Regularization. The polynomial embedding and random projection, L2 regularization, and L1 regularization as a computationally tractable surrogate for L0 regularization.
Linear-Algebraic Techniques: Understanding Principal Components Analysis (PCA)
- Understanding PCA. Minimizing squared distances equals minimizing variance. Use case for data visualization and data compression. Failure modes for PCA.
- How PCA works. Maximizing variance as finding the “direction of maximum stretch” of the covariance matrix. The simple geometry of “diagonals in disguise.” The power iteration algorithm.
Linear-Algebraic Techniques: Understanding the Singular Value Decomposition (SVD)
- Low-rank matrix approximations. SVD; applications to matrix compression, de-noising, and matrix completion (i.e. recovering missing entries).
- Tensor methods. Differences between matrices and tensors, the uniqueness of low-rank tensor factorizations, and Jenrich’s algorithm.
Spectral Graph theory
- Graphs as matrices and the Laplacian of a graph, Interpretations of the largest and smallest eigenvectors/eigenvalues of the Laplacian. Spectral embeddings, and an overview of applications (e.g. graph coloring, spectral clustering).
- Interpretations of the second eigenvalue (via conductance and isoperimetric number), and the connections with the speed at which random walks/diffusions converge.
Sampling and Estimation
- Reservoir sampling (how to select a random sample from a data stream). Basic probability tools: Markov’s inequality and Chebyshev’s inequality. Importance Sampling (how to make inferences about one distribution based on samples from a different distribution). Description of the Good-Turing estimate of missing/unseen mass.
- Markov Chains, stationary distributions. Markov Chain Monte Carlo (MCMC) as approaches to solving hard problems by sampling from carefully crafted distributions.
The Fourier Perspective (and other bases)
- Fourier methods, part 1.
- Fourier methods, part 2 (emphasis on convolutions).
Sparse Vector/Matrix Recovery (Compressive Sensing) and Mathematical Programming
- Compressive sensing.
- Linear and convex programming. Matrix completion.
Privacy Preserving Computation
- Differential Privacy in data analysis and machine learning