Bias-Variance Tradeoff
Table of contents
Explanation
Bias-variance tradeoff describes a dilemma in model training.
When a model is more complex, it becomes more flexible and lowers the bias of the model.
Remember that bias is the difference between the expected value of the estimator and the true value $\E[\hat{\theta}] - \theta$. So basically, the error in regression.
However, as the model becomes more complex, sensitivity to different training data increases, which increases the variance of the model.
High variance of a model can be a sign of overfitting.
On the other hand, if you aim for low variance in training, you end up with a high bias.
Summary:
- Flexible: Low bias, high variance
- Rigid: High bias, low variance
As the number of observations increases, the chance of overfitting, and thus variance, decreases. So if we have a lot of data, complex and flexible models generally perform better.
Mean Squared Error (MSE) Decomposition
- See also MSE of an estimator.
We know that the difference between the estimation and true value is often measured by the mean squared error (MSE).
As explained in the above link, MSE is decomposed into bias and variance of the estimator:
$$ \text{MSE}(\hat{y}) = \text{bias}^2(\hat{y}) + \Var(\hat{y}) $$
Remember that $bias(\hat{y}) = \E[\hat{y}] - y$.
Irreducible Error
In model training there is an additional error called irreducible error, which is the variance of the noise term $\epsilon$ in the model:
\[\text{MSE}(\hat{y}) = \text{bias}^2(\hat{y}) + \Var(\hat{y}) + \boldsymbol{\Var(\epsilon)}\]As mentioned in regression function, $\text{bias}^2(\hat{y}) + \Var(\hat{y})$ is the reducible error, and we try to minimize this part.
Ideally, we’d like to minimize MSE by decreasing both bias and variance, but bias-variance tradeoff tells us that you cannot have both worlds.
Attempt to lower the bias of an estimator $\hat{y}$ (reduce error, overfit) increases the variance of the estimator, and vice versa.
Therefore, MSE does not get any better by simply reducing one of them.
Instead, MSE is minimized when the bias and variance are balanced.