Feature Selection

Table of contents

Downside of Having Too Many Features
Overall Significance Test of Regression
1. F-Test for Linear Regression Analysis
Feature Selection Methods
1. Best Subset Selection
2. Stepwise Regression

Downside of Having Too Many Features

It is not easy or possible to visualize the resulting model due to the high dimensionality of the feature space. But to be fair, this is unavoidable.
Model interpretability is decreased as more features are added to the model.
Sometimes, not all features are relevant to the dependent variable and including them in the model can only result in high variance and overfitting. Which is the most concerning point.

So, it would be better to select only the relevant features for the model.

Modern ML libraries with regression models usually automatically perform feature selection for you.

Overall Significance Test of Regression

One important question to ask before feature selection is:

Were any of the features useful in predicting the response at all?

If none of the features used in the model are relevant, then there is no point in using this model at all, let alone selecting features.

In the case of linear regression, suppose we have three estimators $\hat{\beta}_1$, $\hat{\beta}_2$, and $\hat{\beta}_3$, each a weight for a feature.

The null hypothesis we want to test then would be:

$$ H_0: \beta_1 = \beta_2 = \beta_3 = 0 $$

Which says, none of the features are useful in predicting the response, so it is not any better than the intercept-only model (also called the mean model, baseline model, etc.)

F-Test for Linear Regression Analysis

This can be done by F-test for linear regression.

F-statistic for linear regression analysis compares the RSS of two different models:

$$ F = \frac{(RSS_1 - RSS_2)/(p_2-p_1)}{RSS_2/(n-p_2-1)} $$

where $p_1$ and $p_2$ are the number of parameters in the two models respectively.

Assuming that the null hypothesis is true, $RSS_1$ becomes $TSS$ (total sum of squares) and $p_1$ becomes 0.

Hence, the F-statistic becomes:

$$ F = \frac{(TSS - RSS)/p}{RSS/(n-p-1)} \sim F_{p, n-p-1} $$

When the null hypothesis is true, the $F$ statistic takes a value close to 1. If not, the value gets bigger than 1.

We want this F-statistic to be large, in order to confirm that at least one of the features is useful.

How large? This depends on $n$, $p$, and the significance level $\alpha$. Generally, if $n$ is small, we need a larger F-statistic to reject the null hypothesis.

Some intuition

Recall from $R^2$ that $TSS - RSS$ is the explained variance by the model.

If the model explains the variance well, then $TSS - RSS$ is large, and the F-statistic is large as well.

Feature Selection Methods

First, the obvious ones:

Just use all the features
- Might be reasonable if you already have some prior domain knowledge about the problem (e.g. via laws of physics) and you’re sure that all the features are relevant and beneficial to the model.

Best Subset Selection

Brute-force all the possible combinations of features.

Decide on a metric to evaluate the model. Choose the combination of features that maximizes good fit.

This is not feasible in most cases ($2^p - 1$ combinations for $p$ features). But if you have a small number of features, this might be okay.

The search for the best subset may lead to overfitting/high variance.

Stepwise Regression

The more sophisticated stepwise regression methods that selects features step-by-step based on the results of hypothesis testing.

They do not guarantee the best subset of features, but they are computationally more efficient.

Backward elimination

Backward elimination cannot be used if the number of features is greater than the number of observations.

Start with all the features and remove the least significant feature

Pick a significance level $\alpha$ (e.g. 0.05).
Fit all possible models with all the features except one.
Find the model with the highest $p$-value.
If $p > \alpha$, remove that feature.
Otherwise, stop.
Fit the models again with the remaining features and repeat until stop.

Forward selection

Forward selection can always be used, but tends to be too greedy.

Start with no feature and cumulatively add the most significant feature

Pick a significance level $\alpha$ (e.g. 0.05).
For each feature $X_i$, fit a simple regression model and select $X_i$ with the lowest $p$-value.
If the lowest $p < \alpha$, add that feature to the model.
Otherwise, stop.
For each feature $X_j$ that is previously not considered, fit a regression model together with all the previously selected features. Then select $X_j$ with the lowest $p$-value.
Repeat until stop.

Bidirectional elimination

Combination of backward elimination and forward selection

Pick two significance levels $\alpha_{enter}$ and $\alpha_{stay}$.
Perform one step of forward selection (selecting if $p < \alpha_{enter}$).
Perform the entire backward elimination (removing in each step if $p > \alpha_{stay}$)
In other words, features "stay" if $p < \alpha_{stay}$.
If no feature was added or removed in the previous two steps, stop.
Otherwise, repeat.