Majority Voting
The \(k\) nearest neighbors model predicts the target level from the majority vote from the set of the \(k\) nearest neighbors to the query \(q\).
Where \(\delta\) is an indicator function such that \(\delta(t_i, l) = 1 \iff t_i = l\):
$$ \mathbb{M}_{k} (q) = argmax_{l \in levels(t)} \left( \sum_{i=1}^{k} \delta(t_i, l) \right) $$
For categorical features, \(k\) should be odd to avoid ties.
Distance Weighting
An imbalanced dataset is one that contains significantly more of one target level than another. As \(k\) increases, the prediction will be dominated by the majority target value.
Distance-weighted \(k-NN\) presents a mitigation to imbalanced datasets:
$$ \mathbb{M}_{k} (q) = argmax_{l \in levels(t)} \left( \sum_{i=1}^{k} \frac{1}{dist(q, d_i)^2} \cdot \delta(t_i, l) \right) $$
That said, weighted \( k-NN \) is not a silver bullet for imbalanced datasets. Also, computing \( \frac{1}{dist(q, d_i)^2} \) for all \(N\) may be expensive.
I don’t follow. Why are we doing work for \(N\)? Aren’t we interested in \(k\) instances?
This doesn’t read right. If there are 3 possible categories, \(k = 3\) can result in a tie. “\(k \mod |categories| \ne 0 \)” seems like an alternative choice, but \(k = 4\) could result in \(\{2, 2, 0\}\) votes for 3 possible categories.