Similarity Measures

To classify something, find things that are similar and label it with the same class as the most similar thing.

The feature space is $N-d$, where $N$ is the number of features. Each instance is mapped to a point. The descriptive features become the axes.

The Similarity Metric

Mathematically, it must conform to these 4 criteria:

Non-negativity: $f(a, b) \ge 0$
Identity of Indiscernables: $ f(a, b) = 0 \iff a = b $
Symmetry: $ f(a, b) = f(b, a) $
Subaddivity (Triangular inequality): $ f(a, b) \le f(a, c) + f(c, b) $

Why are non-negativity and triangular inequality important? It seems that we should think of the similarity metric as a measure of “distance”. Distance between two instances obeys the non-negativity & triangular inequality conditions.

The Minkowski Distance

Where $a$ and $b$ are the instances being compared. $m$ is the number of descriptive features:

$$ f(a, b) = \left( \sum_{i=1}^{m} abs(a[i] - b[i])^{p} \right)^{1/p} $$

Because the differences are raised to the $p$‘th power, higher $p$ emphasizes features with big differences.

As $ p \to \infty $, we get the biggest absolute difference since it dominates the other sums. $p = \infty $ is referred to as the Chebyshev (or Chessboard) Distance.

The Manhattan (or Taxi-Cab) Distance ($ p = 1$)

$$ f(a, b) = \sum_{i=1}^{m} |a[i] - b[i]| $$

The Euclidean Distance ($ p = 2$)

Interesting to note that “Euclidean Distance” is somewhat a misnomer. Given that Minkowski distances satisfy metrics criteria, they seem to assume a Euclidean (flat) geometry. In that sense, the Manhattan Distance is not, strictly speaking, a non-euclidean distance.

Euclidean space is the de-facto geometry in ML for convenience/tradition rather than by design, which limits us. For example, hyperbolic space offers a better model for embedding a tree in a way that maintains the distance of nodes, i.e. the volume grows exponentially in the tree depth.

$$ f(a, b) = \sqrt{ \sum_{i=1}^{m} (a[i] - b[i])^2 } $$

Euclidean is the default, unless one is constrained by computational resources.

Note that the Euclidean distance is more influenced by a single large difference in one feature than a lot of small differences across a set of features. The opposite is true of the Manhattan Distance.

References

Into the Wild: Machine Learning In Non-Euclidean Spaces. Fred Sala; Ines Chami; Adva Wolf; Albert Gu; Beliz Gunel; Chris Ré. dawn.cs.stanford.edu . Oct 10, 2019. Accessed Dec 18, 2020.

Similarity Measures

The Similarity Metric

The Minkowski Distance

The Manhattan (or Taxi-Cab) Distance (\( p = 1\))

The Euclidean Distance (\( p = 2\))

References