# Similarity Measures

To classify something, find things that are similar and label it with the same class as the most similar thing.

The feature space is $$N-d$$, where $$N$$ is the number of features. Each instance is mapped to a point. The descriptive features become the axes.

## The Similarity Metric

Mathematically, it must conform to these 4 criteria:

• Non-negativity: $$f(a, b) \ge 0$$
• Identity of Indiscernables: $$f(a, b) = 0 \iff a = b$$
• Symmetry: $$f(a, b) = f(b, a)$$
• Subaddivity (Triangular inequality): $$f(a, b) \le f(a, c) + f(c, b)$$

Why are non-negativity and triangular inequality important? It seems that we should think of the similarity metric as a measure of “distance”. Distance between two instances obeys the non-negativity & triangular inequality conditions.

## The Minkowski Distance

Where $$a$$ and $$b$$ are the instances being compared. $$m$$ is the number of descriptive features:

$$f(a, b) = \left( \sum_{i=1}^{m} abs(a[i] - b[i])^{p} \right)^{1/p}$$

Because the differences are raised to the $$p$$‘th power, higher $$p$$ emphasizes features with big differences.

As $$p \to \infty$$, we get the biggest absolute difference since it dominates the other sums. $$p = \infty$$ is referred to as the Chebyshev (or Chessboard) Distance.

### The Manhattan (or Taxi-Cab) Distance ($$p = 1$$)

$$f(a, b) = \sum_{i=1}^{m} |a[i] - b[i]|$$

### The Euclidean Distance ($$p = 2$$)

Interesting to note that “Euclidean Distance” is somewhat a misnomer. Given that Minkowski distances satisfy metrics criteria, they seem to assume a Euclidean (flat) geometry. In that sense, the Manhattan Distance is not, strictly speaking, a non-euclidean distance.

Euclidean space is the de-facto geometry in ML for convenience/tradition rather than by design, which limits us. For example, hyperbolic space offers a better model for embedding a tree in a way that maintains the distance of nodes, i.e. the volume grows exponentially in the tree depth.

$$f(a, b) = \sqrt{ \sum_{i=1}^{m} (a[i] - b[i])^2 }$$

Euclidean is the default, unless one is constrained by computational resources.

Note that the Euclidean distance is more influenced by a single large difference in one feature than a lot of small differences across a set of features. The opposite is true of the Manhattan Distance.

## References

1. Into the Wild: Machine Learning In Non-Euclidean Spaces. Fred Sala; Ines Chami; Adva Wolf; Albert Gu; Beliz Gunel; Chris Ré. dawn.cs.stanford.edu . Oct 10, 2019. Accessed Dec 18, 2020.