Error vs Residual

In the context of regression, the terms error and residual are often used interchangeably.

However, there is a subtle difference between the two.

Table of contents
  1. Recap of Model and Error
  2. Residual Between Data and Estimated Model

Recap of Model and Error

The true model $f$ captures the relationship between the input features $X$ and the output $Y$:

$$ Y = f(X) + \epsilon $$

where $\epsilon$ is the error term.

This error is the unavoidable uncertainty that comes from the fact that $X$ may not be the sole relevant variable in predicting $Y$.

Every other unobserved factors that affect $Y$ are captured in this error term.

So even if we have the true model $f$, we can’t predict the output $Y$ perfectly.


Residual Between Data and Estimated Model

The goal of regression is to estimate the true model $f$.

For example, we assume a true model:

\[y_i = \beta_0 + \beta_1 x_i + \epsilon\]

Then we estimate this structure using the data we have by learning the coefficients $\hat{\beta}_0$ and $\hat{\beta}_1$:

\[\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\]

The residual is the difference between the observed value $y_i$ and the estimated value $\hat{y}_i$:

$$ y_i - \hat{y}_i $$

RSS (Residual Sum of Squares) is the sum of the squared residuals:

\[\text{RSS} = \sum_{i=1}^n (y_i - \hat{y}_i)^2\]
Difference Between Error and Residual

Error is unexplained by the input features and is unobservable.

Residual is calculated from the observed data and the estimated model, and thus can be minimized by adjusting the model.

Therefore, you could say that the residual is the estimate of the error.

That is why the terms are often used interchangeably.