Ordinary Least Squares

Table of contents
  1. Ordinary Least Squares (OLS) Estimation
  2. Closed-Form for Simple Linear Regression
    1. Closed-Form Standard Error for SLR
  3. Confidence Intervals for OLS Estimators
  4. Hypothesis Testing for OLS Estimators
    1. t-Test
  5. Closed-Form for Multiple Linear Regression
  6. Properties of OLS
    1. Consistent
    2. Asymptotically Normal
  7. Relation to MLE

Ordinary Least Squares (OLS) Estimation

Least squares is a common estimation method for linear regression models.

The idea is to fit a model that mimimizes some sum of squares (i.e. creates the least squares).

Ordinary least squares (OLS) minimizes the residual sum of squares.

$$ \hat{\beta} = \underset{\beta}{\operatorname{argmin}} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $$

Since we are trying to minimize the sum of squares with respect to the parameters, we solve for the partial derivatives.

If $\varepsilon_i | X \sim N(0, \sigma^2)$, then OLS is the same as MLE.


Closed-Form for Simple Linear Regression

In simple linear regression:

\[\begin{align*} &\frac{\partial}{\partial \hat{\beta_0}} \sum_{i=1}^{n} (y_i - \hat{\beta_0} - \hat{\beta_1} x_i)^2 = 0 \\[1em] &\frac{\partial}{\partial \hat{\beta_1}} \sum_{i=1}^{n} (y_i - \hat{\beta_0} - \hat{\beta_1} x_i)^2 = 0 \end{align*}\]

Given that some conditions hold, there is a closed-form estimation for $\beta_0$ and $\beta_1$:

$$ \begin{gather*} \hat{\beta_1} = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \\[1em] \hat{\beta_0} = \bar{y} - \hat{\beta_1} \bar{x} \end{gather*} $$

Derivation \[\begin{align*} &\frac{\partial}{\partial \hat{\beta_0}} \sum_{i=1}^{n} (y_i - \hat{\beta_0} - \hat{\beta_1} x_i)^2 = -2 \sum_{i=1}^{n} (y_i - \hat{\beta_0} - \hat{\beta_1} x_i) = 0 \\[0.5em] \iff& \sum_{i=1}^{n} y_i - n \hat{\beta_0} - \hat{\beta_1} \sum_{i=1}^{n} x_i = 0 \\[0.5em] \iff& \hat{\beta_0} = \frac{1}{n} \sum_{i=1}^{n} y_i - \hat{\beta_1} \frac{1}{n} \sum_{i=1}^{n} x_i = \bar{y} - \hat{\beta_1} \bar{x} \end{align*}\] \[\begin{align*} &\frac{\partial}{\partial \hat{\beta_1}} \sum_{i=1}^{n} (y_i - \hat{\beta_0} - \hat{\beta_1} x_i)^2 = -2 \sum_{i=1}^{n} x_i (y_i - \hat{\beta_0} - \hat{\beta_1} x_i) = 0 \\[0.5em] \iff& \sum_{i=1}^{n} x_i (y_i - \bar{y} + \hat{\beta_1} \bar{x} - \hat{\beta_1} x_i) = 0 \\[0.5em] \iff& \sum_{i=1}^{n} x_i (y_i - \bar{y}) - \hat{\beta_1} \sum_{i=1}^{n} x_i (x_i - \bar{x}) = 0 \\[0.5em] \iff& \hat{\beta_1} = \frac{\sum_{i=1}^{n} x_i(y_i - \bar{y})}{\sum_{i=1}^{n} x_i(x_i - \bar{x})} \end{align*}\]

Now note that $\sum_{i=1}^{n} (z_i - \bar{z}) = 0$ for any variable $z$, where $\bar{z} = \frac{1}{n} \sum_{i=1}^{n} z_i$ (sample mean) is a constant.

Therefore, $\sum_{i=1}^{n} \bar{x} (y_i - \bar{y}) = 0$ and $\sum_{i=1}^{n} \bar{x} (x_i - \bar{x}) = 0$ since $\bar{x}$ is a constant.

Subtract each from the numerator and the denominator respectively to get the final form of $\hat{\beta_1}$.

Closed-Form Standard Error for SLR

The standard error of the OLS estimators $\hat{\beta_0}$ and $\hat{\beta_1}$ is:

\[\begin{gather*} \hat{\text{SE}}(\hat{\beta_1}) = \sqrt{\frac{\hat{\sigma}^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}} \\[1em] \hat{\text{SE}}(\hat{\beta_0}) = \sqrt{\hat{\sigma}^2 \left( \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2} \right)} \end{gather*}\]

Where $\sigma^2 = \Var(\varepsilon)$, but this is usually unknown. So we use an estimate $\hat{\sigma}$, like the Residual Standard Error (RSE):

$$ \text{RSE} = \sqrt{\frac{\text{RSS}}{n-p-1}} $$

Some geometrical intuition

You will see that the standard errors are inversely proportional to the dispersion of the $x_i$’s from the mean $\bar{x}$.

A good estimator should have a small standard error. Which means that OLS estimators are more stable when the $x_i$’s are spread out.

Imagine a 2D scatter plot of the data points. If all the points are clustered together like a ball, it would be hard to draw a definitive line that fits the data.

Slightest change in observation would result in a large change in the line, hence high variance in the estimators.


Confidence Intervals for OLS Estimators

Knowing the standard errors, you can construct confidence intervals for the estimators:

$$ \hat{\beta}_j \pm t_{\alpha/2, n-2} \cdot \hat{\text{SE}}(\hat{\beta}_j) $$


Hypothesis Testing for OLS Estimators

The null hypothesis:

$$ H_0: \beta_j = 0 $$

Can be interpreted as:

  • In the presence of other predictors and the intercept, there is no significant relationship between $X_j$ and $Y$.

For the intercept $\beta_0$, however:

  • In the absence of other predictors, $Y$ is not significantly different from zero.

t-Test

Null hypothesis is tested using the t-statistic:

$$ t = \frac{\hat{\beta}_1 - \beta_1}{\hat{\text{SE}}(\hat{\beta}_1)} $$

Assuming that the null hypothesis is true, we substitute $\beta_1 = 0$ and use the t-distribution to find the p-value.

Under the null hypothesis, t-statistic is the ratio of the estimator and the standard error:

$$ t = \frac{\hat{\beta}_1}{\hat{\text{SE}}(\hat{\beta}_1)} $$

Relative ratio of estimator and standard error

The estimate $\hat{\beta}$ itself is heavily influenced by the unit of measurement (i.e. whether you measure the height in cm or m matters).

Therefore, linearity cannot be measured just by the estimate alone, but by the ratio of the estimate and the standard error.


Closed-Form for Multiple Linear Regression

For multiple linear regression:

\[\boldsymbol{y} = \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\]

We want to mimize the RSS, which is:

\[\text{RSS} = (\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\hat{\beta}})^T (\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\hat{\beta}})\]

Take the derivative w.r.t. $\boldsymbol{\hat{\beta}}$ and set it to zero:

\[\frac{\partial}{\partial \boldsymbol{\hat{\beta}}} \text{RSS} = -2 \boldsymbol{X}^\top \boldsymbol{y} + 2 \boldsymbol{X}^\top \boldsymbol{X} \boldsymbol{\hat{\beta}} = 0\]
Derivation \[\begin{align*} \text{RSS} &= (\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\hat{\beta}})^\top (\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\hat{\beta}}) \\[0.5em] &= (\boldsymbol{y}^\top - \boldsymbol{\hat{\beta}}^\top \boldsymbol{X}^\top) \boldsymbol{y} - (\boldsymbol{y}^\top - \boldsymbol{\hat{\beta}}^\top \boldsymbol{X}^\top) \boldsymbol{X} \boldsymbol{\hat{\beta}} \\[0.5em] &= \boldsymbol{y}^\top \boldsymbol{y} - \boldsymbol{\hat{\beta}}^\top \boldsymbol{X}^\top \boldsymbol{y} - \boldsymbol{y}^\top \boldsymbol{X} \boldsymbol{\hat{\beta}} + \boldsymbol{\hat{\beta}}^\top \boldsymbol{X}^\top \boldsymbol{X} \boldsymbol{\hat{\beta}} \end{align*}\]

Now take the derivative w.r.t. $\boldsymbol{\hat{\beta}}$ (see vector calculus):

\[\begin{align*} \frac{\partial}{\partial \boldsymbol{\hat{\beta}}} \text{RSS} &= 0 - \boldsymbol{y}^\top \boldsymbol{X} - \boldsymbol{y}^\top \boldsymbol{X} + 2 \boldsymbol{\hat{\beta}}^\top \boldsymbol{X}^\top \boldsymbol{X} \\[0.5em] &= -2 \boldsymbol{y}^\top \boldsymbol{X} + 2 \boldsymbol{\hat{\beta}}^\top \boldsymbol{X}^\top \boldsymbol{X} = 0 \end{align*}\]

(Numerator layout was used here)

\[\begin{align*} &\boldsymbol{\hat{\beta}}^\top \boldsymbol{X}^\top \boldsymbol{X} = \boldsymbol{y}^\top \boldsymbol{X} \\[0.5em] \iff &\boldsymbol{\hat{\beta}}^\top = \boldsymbol{y}^\top \boldsymbol{X} (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \\[0.5em] \iff &\boldsymbol{\hat{\beta}} = (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{y} \end{align*}\]

Note that $\boldsymbol{X}^\top \boldsymbol{X}$ is a symmetric matrix.

Solving for $\boldsymbol{\hat{\beta}}$:

$$ \boldsymbol{\hat{\beta}} = (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{y} $$

As discussed in linear regression, there must be no perfect colinearity between the features, i.e. $\boldsymbol{X}$ must be full rank.
Otherwise, $\boldsymbol{X}^\top \boldsymbol{X}$ is also rank-deficient, and thus the inverse does not exist.


Properties of OLS

Consistent

OLS estimators are consistent:

\[\hat{\beta} \xrightarrow{p} \beta\]

Asymptotically Normal

OLS estimators are asymptotically normal:

\[\frac{\hat{\beta} - \beta}{\hat{\text{SE}}(\hat{\beta})} \leadsto N(0, 1)\]

Hence you could find normal confidence intervals for $\beta$.


Relation to MLE

As briefly mentioned in the beginning, if $\varepsilon_i \sim N(0, \sigma^2)$ (or $\varepsilon \sim N(\boldsymbol{0}, \sigma^2 \boldsymbol{I})$), then OLS estimation is equivalent to MLE.

In the model:

\[Y = X \beta + \varepsilon\]

The source of uncertainty is the error term $\varepsilon$, and thus the likelihood function should come from the distribution of $\varepsilon$.

If $\varepsilon_i \sim N(0, \sigma^2)$ and IID, then the likelihood function is:

\[\mathcal{L}(\beta) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{1}{2\sigma^2} (y_i - \boldsymbol{x}_i^\top \beta)^2\right)\]

The log-likelihood function is:

\[\begin{align*} \ell(\beta) &= \sum_{i=1}^n \left[ -\frac{1}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} (y_i - \boldsymbol{x}_i^\top \beta)^2 \right] \\[0.5em] \end{align*}\]

Take the derivative of $\ell(\beta)$ w.r.t. $\beta$ and set it to zero.

With the constant terms removed, the MLE is equivalent to OLS:

\[\sum_{i=1}^n (y_i - \boldsymbol{x}_i^\top \beta)^2 = 0\]