The Curse of Dimensionality and Feature Selection

The Curse of Dimensionality

The predictive power of an induced model is based either on:

Partitioning the feature space into regions based on clusters of training instances and assigning a query located in region \(X\) the target value of the training instances in that cluster.
Interpolating a target value from the target values of individual training instances that are near the query in the feature space.

Therefore, the sampling density (the average density of training instances across the feature space) is an important factor.

Where \(k\) is the # instances in an \(n\)-dimensional hypercube, sampling density is \(n^{\frac{1}{k}}\).

If we wish to maintain the sampling density, assuming a random distribution and \(k\) instances in \(n = 1\), we need \(k^n\) instances for \(n\) columns.

The curse of dimensionality refers to this tradeoff between the # of descriptive features and the density of instances.

A few caveats though:

Real data tends to cluster, so the distribution tends to have a lower effective dimensionality.
Small changes in descriptive features tend to bring small changes in the target feature, thus interpolation is fine.

Feature Types

Predictive - provides information that is useful for estimating the correct value of a target feature.
Interacting - only informative when used in conjunction with one/more features.
Redundant - has a strong correlation with another descriptive feature.
Irrelevant - does not provide information that is useful in estimating the value of the target feature

Algorithms based on decision trees prune the features, hence naturally reduce dimensionality. However, they may miss interacting features.

Algorithms that use all the features, e.g. kNN, are particularly sensitive to the curse of dimensionality.

Methods for Reducing Dimensionality

Ranking and Pruning

Use a filter such as Information Gain to rank the predictiveness of features.
Prune the features that are outside the top \(X\%\)

Although it’s computationally efficient, this method can miss interacting features and retain redundant features.

Greedy Local Search

\(d\) features provide \(2^d\) possible feature subsets. Trying all these models is computationally expensive Instead, form a graph of the possible feature subsets, at each node:

Generate the successors that are one-feature removed from the current node \(x\).
Perform an evaluation candidate for each successor, e.g. building a model from only those features (expensive but performative) or using a filter like Information Gain (cheaper, but less performative)
If any of the successors perform better than \(x\), go back to step 1. Otherwise, terminate your search.

In forward sequential selection, you start with no features, and generate successors by adding exactly one feature. Typically, it generates fewer feature subsets thus fewer computation, so go for this method if you expect lots of irrelevant features. But you might miss interacting features.

In backward sequential selection, you start with all features then generate successors by removing exactly one feature. Allows for the inclusion of interacting features, with the extra computational cost of evaluating larger feature subsets.