R-Squared
Table of contents
Coefficient of Determination $R^2$
Coefficient of determination ($R^2$) is a statistical measure of how well the model captures the variation in the dependent variable.
So essentially, goodness of fit.
To understand $R^2$, we need to refer back to sum of squares.
Explained Variation
In cases like linear regression with OLS, we have seen that the following decomposition is true:
\[SS_{tot} = SS_{exp} + SS_{res}\]Since we want to know how much of the variation is explained by the model, we solve for the proportion of the explained variation among the total variation.
\[\frac{SS_{exp}}{SS_{tot}} = \frac{SS_{tot} - SS_{res}}{SS_{tot}} = 1 - \frac{SS_{res}}{SS_{tot}}\]This is the definition of $R^2$.
$$ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} $$
Interpretation
If the model captures all the variation in the dependent variable, the variation caused by error ($SS_{res}$) is zero, which would make $R^2$ equal to 1. This indicates that the model perfectly fits the data.
The baseline model, which is just the mean of the dependent variable, results in $R^2$ equal to 0.
Any model that performs worse than the baseline model will have a negative $R^2$.
$$ \begin{align*} R^2 = 1 &\Rightarrow \text{perfect fit} \\[0.5em] R^2 = 0 &\Rightarrow \text{baseline model} \\[0.5em] R^2 < 0 &\Rightarrow \text{worse than baseline model} \end{align*} $$
So generally, the higher the $R^2$ the better the model fits the data.
Anything below 0 means you should really reconsider your model or check if you have a mistake, because you’re doing worse than the bare minimum which is just always predicting the mean.
Just because a model has a high $R^2$ doesn’t mean it’s a good model.
$R^2$ is not a good measure for non-linear models, because the sum of squares decomposition doesn’t hold for them.
Relationship with Correlation
Review correlation from here.
The Pearson’s correlation coefficient $r$ is basically the covariance of $X$ and $Y$ fit into a scale of $[-1, 1]$.
For simple linear regression models, $R^2 = r^2$.
Adjusted $R^2$
The issue with regular $R^2$
When you add more predictors/features to your model, $R^2$ will always increase.
This is because as your model gets more complex, the $SS_{tot}$ stays the same,
Remember $SS_{tot}$ has nothing to do with the model, but only the data.
but $SS_{res}$ can only decrease (to be more precise, it does not increase).
Intuition
Think of what it means to increase the complexity of the model.
You had just a rigid line to fit your model before, but now you’ve added some features in so that it’s more flexible to fit a more complex curve.
You should have been able to decrease your error squares, so $SS_{res}$ should have decreased.
This results in multiple issues:
- $R^2$ is a positively biased estimator (always overshoots)
- Bias towards complex models
- Overfitting
Adjustment
To account for the bias, we penalize the $R^2$ by the number of predictors $k$ used in the model.
The penalty is defined as:
\[\frac{n - 1}{n - k - 1}\]where $n$ is the sample size.
Notice that the penalty is 1 when $k = 0$, and it increases as $k$ increases.
Remembering that $R^2$ is defined as:
\[R^2 = 1 - \frac{SS_{res}}{SS_{tot}}\]We can penalize $R^2$ by bumping up the subtracted term with the penalty:
\[\frac{SS_{res}}{SS_{tot}} \times \frac{n-1}{n-k-1}\]Through substitution, we get the definition for adjusted $R^2$:
$$ R^2_{adj} = 1 - (1 - R^2) \cdot \frac{n-1}{n-k-1} $$