Ordinary Least Squares
Table of contents
Ordinary Least Squares (OLS) Estimation
Least squares is a common estimation method for linear regression models.
The idea is to fit a model that mimimizes some sum of squares (i.e. creates the least squares).
Ordinary least squares (OLS) minimizes the residual sum of squares.
$$ \hat{\beta} = \underset{\beta}{\operatorname{argmin}} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$
Since we are trying to minimize the sum of squares with respect to the parameters, we solve for the partial derivatives.
If $\varepsilon_i | X \sim N(0, \sigma^2)$, then OLS is the same as MLE.
Closed-Form for Simple Linear Regression
In simple linear regression:
\[\begin{align*} &\frac{\partial}{\partial \hat{\beta_0}} \sum_{i=1}^{n} (y_i - \hat{\beta_0} - \hat{\beta_1} x_i)^2 = 0 \\[1em] &\frac{\partial}{\partial \hat{\beta_1}} \sum_{i=1}^{n} (y_i - \hat{\beta_0} - \hat{\beta_1} x_i)^2 = 0 \end{align*}\]Given that some conditions hold, there is a closed-form estimation for $\beta_0$ and $\beta_1$:
$$ \begin{gather*} \hat{\beta_1} = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \\[1em] \hat{\beta_0} = \bar{y} - \hat{\beta_1} \bar{x} \end{gather*} $$
Derivation
\[\begin{align*} &\frac{\partial}{\partial \hat{\beta_0}} \sum_{i=1}^{n} (y_i - \hat{\beta_0} - \hat{\beta_1} x_i)^2 = -2 \sum_{i=1}^{n} (y_i - \hat{\beta_0} - \hat{\beta_1} x_i) = 0 \\[0.5em] \iff& \sum_{i=1}^{n} y_i - n \hat{\beta_0} - \hat{\beta_1} \sum_{i=1}^{n} x_i = 0 \\[0.5em] \iff& \hat{\beta_0} = \frac{1}{n} \sum_{i=1}^{n} y_i - \hat{\beta_1} \frac{1}{n} \sum_{i=1}^{n} x_i = \bar{y} - \hat{\beta_1} \bar{x} \end{align*}\] \[\begin{align*} &\frac{\partial}{\partial \hat{\beta_1}} \sum_{i=1}^{n} (y_i - \hat{\beta_0} - \hat{\beta_1} x_i)^2 = -2 \sum_{i=1}^{n} x_i (y_i - \hat{\beta_0} - \hat{\beta_1} x_i) = 0 \\[0.5em] \iff& \sum_{i=1}^{n} x_i (y_i - \bar{y} + \hat{\beta_1} \bar{x} - \hat{\beta_1} x_i) = 0 \\[0.5em] \iff& \sum_{i=1}^{n} x_i (y_i - \bar{y}) - \hat{\beta_1} \sum_{i=1}^{n} x_i (x_i - \bar{x}) = 0 \\[0.5em] \iff& \hat{\beta_1} = \frac{\sum_{i=1}^{n} x_i(y_i - \bar{y})}{\sum_{i=1}^{n} x_i(x_i - \bar{x})} \end{align*}\]Now note that $\sum_{i=1}^{n} (z_i - \bar{z}) = 0$ for any variable $z$, where $\bar{z} = \frac{1}{n} \sum_{i=1}^{n} z_i$ (sample mean) is a constant.
Therefore, $\sum_{i=1}^{n} \bar{x} (y_i - \bar{y}) = 0$ and $\sum_{i=1}^{n} \bar{x} (x_i - \bar{x}) = 0$ since $\bar{x}$ is a constant.
Subtract each from the numerator and the denominator respectively to get the final form of $\hat{\beta_1}$.
Closed-Form Standard Error for SLR
The standard error of the OLS estimators $\hat{\beta_0}$ and $\hat{\beta_1}$ is:
\[\begin{gather*} \hat{\text{SE}}(\hat{\beta_1}) = \sqrt{\frac{\hat{\sigma}^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}} \\[1em] \hat{\text{SE}}(\hat{\beta_0}) = \sqrt{\hat{\sigma}^2 \left( \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \right)} \end{gather*}\]Where $\sigma^2 = \Var(\varepsilon)$, but this is usually unknown. So we use an estimate $\hat{\sigma}$, like the Residual Standard Error (RSE):
$$ \text{RSE} = \sqrt{\frac{\text{RSS}}{n-p-1}} $$
Some geometrical intuition
You will see that the standard errors are inversely proportional to the dispersion of the $x_i$’s from the mean $\bar{x}$.
A good estimator should have a small standard error. Which means that OLS estimators are more stable when the $x_i$’s are spread out.
Imagine a 2D scatter plot of the data points. If all the points are clustered together like a ball, it would be hard to draw a definitive line that fits the data.
Slightest change in observation would result in a large change in the line, hence high variance in the estimators.
Confidence Intervals for OLS Estimators
Knowing the standard errors, you can construct confidence intervals for the estimators:
$$ \hat{\beta}_j \pm t_{\alpha/2, n-2} \cdot \hat{\text{SE}}(\hat{\beta}_j) $$
Hypothesis Testing for OLS Estimators
The null hypothesis:
$$ H_0: \beta_j = 0 $$
Can be interpreted as:
- In the presence of other predictors and the intercept, there is no significant relationship between $X_j$ and $Y$.
For the intercept $\beta_0$, however:
- In the absence of other predictors, $Y$ is not significantly different from zero.
t-Test
Null hypothesis is tested using the t-statistic:
$$ t = \frac{\hat{\beta}_1 - \beta_1}{\hat{\text{SE}}(\hat{\beta}_1)} $$
Assuming that the null hypothesis is true, we substitute $\beta_1 = 0$ and use the t-distribution to find the p-value.
Under the null hypothesis, t-statistic is the ratio of the estimator and the standard error:
$$ t = \frac{\hat{\beta}_1}{\hat{\text{SE}}(\hat{\beta}_1)} $$
Relative ratio of estimator and standard error
The estimate $\hat{\beta}$ itself is heavily influenced by the unit of measurement (i.e. whether you measure the height in cm or m matters).
Therefore, linearity cannot be measured just by the estimate alone, but by the ratio of the estimate and the standard error.
Closed-Form for Multiple Linear Regression
For multiple linear regression:
\[\boldsymbol{y} = \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\]We want to mimize the RSS, which is:
\[\text{RSS} = (\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\hat{\beta}})^T (\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\hat{\beta}})\]Take the derivative w.r.t. $\boldsymbol{\hat{\beta}}$ and set it to zero:
\[\frac{\partial}{\partial \boldsymbol{\hat{\beta}}} \text{RSS} = -2 \boldsymbol{X}^\top \boldsymbol{y} + 2 \boldsymbol{X}^\top \boldsymbol{X} \boldsymbol{\hat{\beta}} = 0\]Derivation
\[\begin{align*} \text{RSS} &= (\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\hat{\beta}})^\top (\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\hat{\beta}}) \\[0.5em] &= (\boldsymbol{y}^\top - \boldsymbol{\hat{\beta}}^\top \boldsymbol{X}^\top) \boldsymbol{y} - (\boldsymbol{y}^\top - \boldsymbol{\hat{\beta}}^\top \boldsymbol{X}^\top) \boldsymbol{X} \boldsymbol{\hat{\beta}} \\[0.5em] &= \boldsymbol{y}^\top \boldsymbol{y} - \boldsymbol{\hat{\beta}}^\top \boldsymbol{X}^\top \boldsymbol{y} - \boldsymbol{y}^\top \boldsymbol{X} \boldsymbol{\hat{\beta}} + \boldsymbol{\hat{\beta}}^\top \boldsymbol{X}^\top \boldsymbol{X} \boldsymbol{\hat{\beta}} \end{align*}\]Now take the derivative w.r.t. $\boldsymbol{\hat{\beta}}$ (see vector calculus):
\[\begin{align*} \frac{\partial}{\partial \boldsymbol{\hat{\beta}}} \text{RSS} &= 0 - \boldsymbol{y}^\top \boldsymbol{X} - \boldsymbol{y}^\top \boldsymbol{X} + 2 \boldsymbol{\hat{\beta}}^\top \boldsymbol{X}^\top \boldsymbol{X} \\[0.5em] &= -2 \boldsymbol{y}^\top \boldsymbol{X} + 2 \boldsymbol{\hat{\beta}}^\top \boldsymbol{X}^\top \boldsymbol{X} = 0 \end{align*}\](Numerator layout was used here)
\[\begin{align*} &\boldsymbol{\hat{\beta}}^\top \boldsymbol{X}^\top \boldsymbol{X} = \boldsymbol{y}^\top \boldsymbol{X} \\[0.5em] \iff &\boldsymbol{\hat{\beta}}^\top = \boldsymbol{y}^\top \boldsymbol{X} (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \\[0.5em] \iff &\boldsymbol{\hat{\beta}} = (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{y} \end{align*}\]Note that $\boldsymbol{X}^\top \boldsymbol{X}$ is a symmetric matrix.
Solving for $\boldsymbol{\hat{\beta}}$:
$$ \boldsymbol{\hat{\beta}} = (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{y} $$
As discussed in linear regression, there must be no perfect colinearity between the features, i.e. $\boldsymbol{X}$ must be full rank.
Otherwise, $\boldsymbol{X}^\top \boldsymbol{X}$ is also rank-deficient, and thus the inverse does not exist.
Properties of OLS
Consistent
OLS estimators are consistent:
\[\hat{\beta} \xrightarrow{p} \beta\]Asymptotically Normal
OLS estimators are asymptotically normal:
\[\frac{\hat{\beta} - \beta}{\hat{\text{SE}}(\hat{\beta})} \leadsto N(0, 1)\]Hence you could find normal confidence intervals for $\beta$.
Relation to MLE
As briefly mentioned in the beginning, if $\varepsilon_i \sim N(0, \sigma^2)$ (or $\varepsilon \sim N(\boldsymbol{0}, \sigma^2 \boldsymbol{I})$), then OLS estimation is equivalent to MLE.
In the model:
\[Y = X \beta + \varepsilon\]The source of uncertainty is the error term $\varepsilon$, and thus the likelihood function should come from the distribution of $\varepsilon$.
If $\varepsilon_i \sim N(0, \sigma^2)$ and IID, then the likelihood function is:
\[\mathcal{L}(\beta) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{1}{2\sigma^2} (y_i - \boldsymbol{x}_i^\top \beta)^2\right)\]The log-likelihood function is:
\[\begin{align*} \ell(\beta) &= \sum_{i=1}^n \left[ -\frac{1}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} (y_i - \boldsymbol{x}_i^\top \beta)^2 \right] \\[0.5em] \end{align*}\]Take the derivative of $\ell(\beta)$ w.r.t. $\beta$ and set it to zero.
With the constant terms removed, the MLE is equivalent to OLS:
\[\sum_{i=1}^n (y_i - \boldsymbol{x}_i^\top \beta)^2 = 0\]