Handling Noisy Data in Nearest Neighbors

Majority Voting

The $k$ nearest neighbors model predicts the target level from the majority vote from the set of the $k$ nearest neighbors to the query $q$.

Where $\delta$ is an indicator function such that $\delta(t_i, l) = 1 \iff t_i = l$:

$$ \mathbb{M}_{k} (q) = argmax_{l \in levels(t)} \left( \sum_{i=1}^{k} \delta(t_i, l) \right) $$

For categorical features, $k$ should be odd to avoid ties.

This doesn’t read right. If there are 3 possible categories, $k = 3$ can result in a tie. “$k \mod |categories| \ne 0 $” seems like an alternative choice, but $k = 4$ could result in $\{2, 2, 0\}$ votes for 3 possible categories.

Distance Weighting

An imbalanced dataset is one that contains significantly more of one target level than another. As $k$ increases, the prediction will be dominated by the majority target value.

Distance-weighted $k-NN$ presents a mitigation to imbalanced datasets:

$$ \mathbb{M}_{k} (q) = argmax_{l \in levels(t)} \left( \sum_{i=1}^{k} \frac{1}{dist(q, d_i)^2} \cdot \delta(t_i, l) \right) $$

That said, weighted $ k-NN $ is not a silver bullet for imbalanced datasets. Also, computing $ \frac{1}{dist(q, d_i)^2} $ for all $N$ may be expensive.

I don’t follow. Why are we doing work for $N$? Aren’t we interested in $k$ instances?