Similarity Measures

Dated Oct 10, 2020; last modified on Mon, 05 Sep 2022

To classify something, find things that are similar and label it with the same class as the most similar thing.

The feature space is \(N-d\), where \(N\) is the number of features. Each instance is mapped to a point. The descriptive features become the axes.

The Similarity Metric

Mathematically, it must conform to these 4 criteria:

  • Non-negative: \(f(a, b) \ge 0\)
  • Identity: \( f(a, b) = 0 \iff a = b \)
  • Symmetry: \( f(a, b) = f(b, a) \)
  • Triangular inequality: \( f(a, b) \le f(a, c) + f(c, b) \)

Why are non-negativity and triangular inequality important? It seems that we should think of the similarity metric as a measure of “distance”. Distance between two instances obeys the non-negativity & triangular inequality conditions.

The Minkowski Distance

Where \(a\) and \(b\) are the instances being compared. \(m\) is the number of descriptive features:

$$ f(a, b) = \left( \sum_{i=1}^{m} abs(a[i] - b[i])^{p} \right)^{1/p} $$

Because the differences are raised to the \(p\)‘th power, higher \(p\) emphasizes features with big differences.

As \( p \to \infty \), we get the biggest absolute difference since it dominates the other sums. \(p = \infty \) is referred to as the Chebyshev (or Chessboard) Distance.

The Manhattan (or Taxi-Cab) Distance (\( p = 1\))

$$ f(a, b) = \sum_{i=1}^{m} |a[i] - b[i]| $$

The Euclidean Distance (\( p = 2\))

Interesting to note that “Euclidean Distance” is somewhat a misnomer. Given that Minkowski distances satisfy metrics criteria, they seem to assume a Euclidean (flat) geometry. In that sense, the Manhattan Distance is not, strictly speaking, a non-euclidean distance.

Euclidean space is the de-facto geometry in ML for convenience/tradition rather than by design, which limits us. For example, hyperbolic space offers a better model for embedding a tree in a way that maintains the distance of nodes, i.e. the volume grows exponentially in the tree depth.

$$ f(a, b) = \sqrt{ \sum_{i=1}^{m} (a[i] - b[i])^2 } $$

Euclidean is the default, unless one is constrained by computational resources.

Note that the Euclidean distance is more influenced by a single large difference in one feature than a lot of small differences across a set of features. The opposite is true of the Manhattan Distance.

References

  1. Into the Wild: Machine Learning In Non-Euclidean Spaces. Fred Sala; Ines Chami; Adva Wolf; Albert Gu; Beliz Gunel; Chris Ré. dawn.cs.stanford.edu . Oct 10, 2019. Accessed Dec 18, 2020.