Handling Noisy Data in Nearest Neighbors

Dated Oct 17, 2017; last modified on Sun, 14 Mar 2021

Majority Voting

The \(k\) nearest neighbors model predicts the target level from the majority vote from the set of the \(k\) nearest neighbors to the query \(q\).

Where \(\delta\) is an indicator function such that \(\delta(t_i, l) = 1 \iff t_i = l\):

$$ \mathbb{M}_{k} (q) = argmax_{l \in levels(t)} \left( \sum_{i=1}^{k} \delta(t_i, l) \right) $$

For categorical features, \(k\) should be odd to avoid ties.

This doesn’t read right. If there are 3 possible categories, \(k = 3\) can result in a tie. “\(k \mod |categories| \ne 0 \)” seems like an alternative choice, but \(k = 4\) could result in \(\{2, 2, 0\}\) votes for 3 possible categories.

Distance Weighting

An imbalanced dataset is one that contains significantly more of one target level than another. As \(k\) increases, the prediction will be dominated by the majority target value.

Distance-weighted \(k-NN\) presents a mitigation to imbalanced datasets:

$$ \mathbb{M}_{k} (q) = argmax_{l \in levels(t)} \left( \sum_{i=1}^{k} \frac{1}{dist(q, d_i)^2} \cdot \delta(t_i, l) \right) $$

That said, weighted \( k-NN \) is not a silver bullet for imbalanced datasets. Also, computing \( \frac{1}{dist(q, d_i)^2} \) for all \(N\) may be expensive.

I don’t follow. Why are we doing work for \(N\)? Aren’t we interested in \(k\) instances?