Bias-Variance Tradeoff

Table of contents

Explanation
Mean Squared Error (MSE) Decomposition

Explanation

Bias-variance tradeoff describes a dilemma in model training.

When a model is more complex, it becomes more flexible and lowers the bias of the model.

Remember that bias is the difference between the expected value of the estimator and the true value $\E[\hat{\theta}] - \theta$. So basically, the error in regression.

However, as the model becomes more complex, sensitivity to different training data increases, which increases the variance of the model.

High variance of a model can be a sign of overfitting.

On the other hand, if you aim for low variance in training, you end up with a high bias.

Summary:

Flexible: Low bias, high variance
Rigid: High bias, low variance

As the number of observations increases, the chance of overfitting, and thus variance, decreases. So if we have a lot of data, complex and flexible models generally perform better.

Mean Squared Error (MSE) Decomposition

See also MSE of an estimator.

We know that the difference between the estimation and true value is often measured by the mean squared error (MSE).

As explained in the above link, MSE is decomposed into bias and variance of the estimator:

$$ \text{MSE}(\hat{y}) = \text{bias}^2(\hat{y}) + \Var(\hat{y}) $$

Remember that $bias(\hat{y}) = \E[\hat{y}] - y$.

Irreducible Error

In model training there is an additional error called irreducible error, which is the variance of the noise term $\epsilon$ in the model:

\[\text{MSE}(\hat{y}) = \text{bias}^2(\hat{y}) + \Var(\hat{y}) + \boldsymbol{\Var(\epsilon)}\]

As mentioned in regression function, $\text{bias}^2(\hat{y}) + \Var(\hat{y})$ is the reducible error, and we try to minimize this part.

Ideally, we’d like to minimize MSE by decreasing both bias and variance, but bias-variance tradeoff tells us that you cannot have both worlds.

Attempt to lower the bias of an estimator $\hat{y}$ (reduce error, overfit) increases the variance of the estimator, and vice versa.

Therefore, MSE does not get any better by simply reducing one of them.

Instead, MSE is minimized when the bias and variance are balanced.