Regression
Table of contents
Regression Function
We define models as functions that map inputs to outputs:
\[Y = f(X) + \epsilon\]An observation is often a pair of response and predictor:
\[(X, Y)\]For given $X = x$, the realization of $Y$ may not be unique (people with same weight can have different heights).
Then what should our model $f$ spit out? We’ll say ideally:
$$ f(x) = \E[Y | X=x] $$
A regression function $f(x) = \E[Y | X=x]$ is one that minimizes $\E[(Y - f(X))^2 | X=x]$ (MSE)
Why do we use mean squared error?
There are few reasons to why squared error is beneficial:
- It is differentiable everywhere
- It explodes penalties for large errors
- It handles both negative and positive errors
We do not know the true model $f$. Therefore, we use an estimate $\hat{f}$:
$$ \begin{align*} \E[(Y - \hat{f}(X))^2 | X=x] &= \E[(f(X) + \epsilon - \hat{f}(X))^2 | X=x] \\[1em] &= [f(x) - \hat{f}(x)]^2 + \Var(\epsilon) \end{align*} $$
Derivation
First remember that $\E[\epsilon] = 0$ and that $\epsilon$ is independent of $X$.
\[\begin{align*} &\E[(f(X) + \epsilon - \hat{f}(X))^2 | X=x] \\[0.5em] =\; &\E[(f(X) - \hat{f}(X))^2 + 2\epsilon(f(X) - \hat{f}(X)) + \epsilon^2 | X=x] \\[0.5em] =\; &(f(x) - \hat{f}(x))^2 + 2(f(x) - \hat{f}(x))\E[\epsilon | X=x] + \E[\epsilon^2 | X=x] \end{align*}\]We know that $\E[\epsilon | X = x] = \E[\epsilon] = 0$ and:
\[\E[\epsilon^2 | X = x] = \E[\epsilon^2] = \E[\epsilon^2] - \E[\epsilon]^2 = \Var(\epsilon)\]The first part:
\[[f(x) - \hat{f}(x)]^2\]is called the reducible error and the second part:
\[\Var(\epsilon)\]is, of course, the irreducible error.
So the goal of regression is to estimate $f$ while minimizing the reducible error.