R-Squared

Table of contents
  1. Coefficient of Determination $R^2$
    1. Explained Variation
    2. Interpretation
    3. Relationship with Correlation
  2. Adjusted $R^2$
    1. The issue with regular $R^2$
    2. Adjustment

Coefficient of Determination $R^2$

Coefficient of determination ($R^2$) is a statistical measure of how well the model captures the variation in the dependent variable.

So essentially, goodness of fit.

To understand $R^2$, we need to refer back to sum of squares.

Explained Variation

In cases like linear regression with OLS, we have seen that the following decomposition is true:

\[SS_{tot} = SS_{exp} + SS_{res}\]

Since we want to know how much of the variation is explained by the model, we solve for the proportion of the explained variation among the total variation.

\[\frac{SS_{exp}}{SS_{tot}} = \frac{SS_{tot} - SS_{res}}{SS_{tot}} = 1 - \frac{SS_{res}}{SS_{tot}}\]

This is the definition of $R^2$.

$$ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} $$

Interpretation

If the model captures all the variation in the dependent variable, the variation caused by error ($SS_{res}$) is zero, which would make $R^2$ equal to 1. This indicates that the model perfectly fits the data.

The baseline model, which is just the mean of the dependent variable, results in $R^2$ equal to 0.

Any model that performs worse than the baseline model will have a negative $R^2$.

$$ \begin{align*} R^2 = 1 &\Rightarrow \text{perfect fit} \\[0.5em] R^2 = 0 &\Rightarrow \text{baseline model} \\[0.5em] R^2 < 0 &\Rightarrow \text{worse than baseline model} \end{align*} $$

So generally, the higher the $R^2$ the better the model fits the data.

Anything below 0 means you should really reconsider your model or check if you have a mistake, because you’re doing worse than the bare minimum which is just always predicting the mean.

Just because a model has a high $R^2$ doesn’t mean it’s a good model.

$R^2$ is not a good measure for non-linear models, because the sum of squares decomposition doesn’t hold for them.

Relationship with Correlation

Review correlation from here.

The Pearson’s correlation coefficient $r$ is basically the covariance of $X$ and $Y$ fit into a scale of $[-1, 1]$.

For simple linear regression models, $R^2 = r^2$.


Adjusted $R^2$

The issue with regular $R^2$

When you add more predictors/features to your model, $R^2$ will always increase.

This is because as your model gets more complex, the $SS_{tot}$ stays the same,

Remember $SS_{tot}$ has nothing to do with the model, but only the data.

but $SS_{res}$ can only decrease (to be more precise, it does not increase).

Intuition

Think of what it means to increase the complexity of the model.

You had just a rigid line to fit your model before, but now you’ve added some features in so that it’s more flexible to fit a more complex curve.

You should have been able to decrease your error squares, so $SS_{res}$ should have decreased.

This results in multiple issues:

  • $R^2$ is a positively biased estimator (always overshoots)
  • Bias towards complex models
  • Overfitting

Adjustment

To account for the bias, we penalize the $R^2$ by the number of predictors $k$ used in the model.

The penalty is defined as:

\[\frac{n - 1}{n - k - 1}\]

where $n$ is the sample size.

Notice that the penalty is 1 when $k = 0$, and it increases as $k$ increases.

Remembering that $R^2$ is defined as:

\[R^2 = 1 - \frac{SS_{res}}{SS_{tot}}\]

We can penalize $R^2$ by bumping up the subtracted term with the penalty:

\[\frac{SS_{res}}{SS_{tot}} \times \frac{n-1}{n-k-1}\]

Through substitution, we get the definition for adjusted $R^2$:

$$ R^2_{adj} = 1 - (1 - R^2) \cdot \frac{n-1}{n-k-1} $$