Similarity Based Learning

Dated Oct 24, 2020; last modified on Sun, 14 Mar 2021

		Random Link ¯\_(ツ)_/¯
Oct 10, 2020	»	Similarity Measures 2 min; updated Sep 5, 2022 To classify something, find things that are similar and label it with the same class as the most similar thing. The feature space is $N-d$, where $N$ is the number of features. Each instance is mapped to a point. The descriptive features become the axes. The Similarity Metric Mathematically, it must conform to these 4 criteria: Non-negative: $f(a, b) \ge 0$ Identity: $ f(a, b) = 0 \iff a = b $ Symmetry: $ f(a, b) = f(b, a) $ Triangular inequality: $ f(a, b) \le f(a, c) + f(c, b) $ Why are non-negativity and triangular inequality important?...
Oct 10, 2020	»	The Nearest Neighbor Algorithm 2 min; updated Feb 12, 2023 The Algorithm Iterate across the instances in memory. Find the instance that has the shortest distance from the query. Doing this naively is computationally expensive. That’s why there are faster lookup methods, e.g. k-d trees. Make a prediction for the query equal to the value of the target feature of the nearest neighbor The Voronoi Tessellation is a partition of the feature space such that each partition is the 'adoptive-radius' of the instance that 'owns' that partition....
Oct 10, 2017	»	Caveats on Similarity Learning 1 min; updated Mar 14, 2021 Similarity-based learning is intuitive and gives people confidence in the model. There is an inductive bias that instances that have similar descriptive features belong to the same class. Remarkably so. When I think of classifying things, my mind immediately goes to NN. Similarity learning has a stationary assumption, i.e. the joint PDF of the data doesn’t change (new classifications do not come up). This assumption is shared by supervised ML....
Oct 17, 2017	»	Handling Noisy Data in Nearest Neighbors 1 min; updated Mar 14, 2021 Majority Voting The $k$ nearest neighbors model predicts the target level from the majority vote from the set of the $k$ nearest neighbors to the query $q$. Where $\delta$ is an indicator function such that $\delta(t_i, l) = 1 \iff t_i = l$: $$ \mathbb{M}_{k} (q) = argmax_{l \in levels(t)} \left( \sum_{i=1}^{k} \delta(t_i, l) \right) $$ For categorical features, $k$ should be odd to avoid ties. This doesn’t read right. If there are 3 possible categories, $k = 3$ can result in a tie....
Oct 17, 2017	»	The Case for Range Normalization 1 min; updated Mar 14, 2021 When you have features taking different range if values, you may have odd predictions. For example, if $f_1 \in [0, 100]$ and $f_2 \in [0, 1]$, $f_1$ will always be penalized more than $f_2$ when computing the distance. To mitigate this, normalize the feature’s ranges to $[r_{low}, r_{high}]$: $$ a'_i = \frac{a_i - a_{min}}{ a_{max} - a_{min}} \cdot (r_{high} - r_{low}) + r_{low} $$ Typically, the range is normalized to $[r_{low}, r_{high}] = [0, 1]$, so range normalization simplifies to:...
Oct 17, 2017	»	Predicting Continuous Targets Using NN 1 min; updated Mar 14, 2021 Return the Average Value One possible solution is to return the average value in the neighborhood, i.e. $$ \mathbb{M}_{k}(q) = \frac{1}{k} \sum_{i=1}^{k} t_i $$ We can improve this by using weighted $k-NN$: $$ \mathbb{M}_{k}(q) = \frac{ \sum_{i=1}^{k} \left( \frac{1}{dist(q, d_i)^2} \cdot t_i \right) }{ \sum_{i=1}^{k} \frac{1}{dist(q, d_i)^2} } $$ The formula looks new. However, if $x_1$ is weighted by $w_1$ and $x_2$ by $w_2$, then the weighted average is: $$ \frac{w_1 x_1 + w_2 x_2 }{w_1 + w_2} $$...
Oct 17, 2017	»	Other Measures of Similarity in NN 2 min; updated Feb 12, 2023 Note: similarity indexes may not specify the 4 NIST specifications of a similarity metric . Consider these two instances: $f_1$ $f_2$ $f_3$ $f_4$ $f_5$ d 0 1 1 0 1 q 0 0 1 1 1 $CP(q, d)$, the number of the co-presences of binary features, is $2$ because of $f_3$ and $f_4$. $CA(q, d)$, the number of co-absences, is $1$ because of $f_1$....
Aug 9, 2018	»	The Curse of Dimensionality and Feature Selection 3 min; updated Mar 14, 2021 The Curse of Dimensionality The predictive power of an induced model is based either on: Partitioning the feature space into regions based on clusters of training instances and assigning a query located in region $X$ the target value of the training instances in that cluster. Interpolating a target value from the target values of individual training instances that are near the query in the feature space....