Intro to Statistical Machine Learning

Table of contents
  1. Response, Predictors, and Model
    1. True Model
    2. Goals of Model Learning
      1. Prediction
      2. Inference
    3. Noise / Error
  2. Parametric / Non-parametric Models
    1. Parametric and Structured Models
    2. Non-parametric Models
      1. Important Trade-Offs
  3. Regression / Classification
  4. Supervised / Unsupervised Learning
  5. Measuring Goodness of Fit
    1. Training and Testing
    2. Goodness of Fit in Classification

Response, Predictors, and Model

We often denote the response or target as:

$$ Y $$

and features, input, or predictors as:

$$ X_i $$

and these predictors are often collected in to a input vector:

$$ X = (X_1, X_2, \ldots, X_p) $$

True Model

The true model is a function $f$ that maps the input to the response:

$$ Y = f(X) + \epsilon $$

  • True model $f$ is deterministic and unknown

Goals of Model Learning

Our goal is to learn $\hat{f}$, an estimate of $f$. In terms of regression, we usually do this by minimizing the mean squared error (MSE) (expectation of the squared differences):

\[\begin{equation} \label{eq:mse-dec} \overbrace{\E[(Y - \hat{Y})^2]}^{\text{MSE}} = \underbrace{(f(X) - \hat{f}(X))^2}_\text{Reducible Error} + \underbrace{\Var(\epsilon)}_\text{Irreducible Error} \end{equation}\]

Do not worry about how we get this decomposition for now (there are some assumptions that need to be made), more will be in regression.

Just some blabbering about notation

Sometimes people write the expectation as:

\[\E[Y - \hat{Y}]^2\]

To mean the same thing as above. I personally find this notation confusing because it is not clear whether it is taking the expectation of the squared difference or squaring the expectation of the difference.

As the name suggests, irreducible error is indeed irreducible, so we try to minimize the reducible error to minimize the MSE.

That is the process of learning $\hat{f}$.

Prediction

The prediction of response given observations is:

$$ \hat{Y} = \hat{f}(X) $$

Specifically, we are interested in getting $\hat{Y}$ upon learning $\hat{f}$. So in prediction, our end goal is not the model itself, but rather the prediction of the response.

Inference

In inference, we are interested in the relationship between the predictors and the response.

Same stuff about learning $\hat{f}$, but compared to prediction, we are more interested in the model itself rather than the prediction value.

Noise / Error

$\epsilon$ or noise is a random variable that captures the variability in the response not explained by true model:

\[Y = f(X) + \epsilon\]

Note the emphasis on true, this means that even if we correctly identify the true model, there still will be some uncertainty in the response.

Take a look at the decomposition \eqref{eq:mse-dec} above again. Because the noise is the source of the irreducible uncertainty,

Noise is also called an irreducible error.

Important assumptions about $\epsilon$:

Random variable $\epsilon$ is independent of $X$ and has a mean of 0.

This becomes critical in certain models and methods.


Parametric / Non-parametric Models

Parametric and Structured Models

One example of parametric and structured model is the linear model.

The parameters are the coefficients of the predictors and the structure is the linear relationship between the predictors and the response.

Parametric models are effective when the true $f$ approximately follows the structure assumed by the model.

Non-parametric Models

Non-parametric models, on the other hand, do not make strong assumptions about the structure of $f$.

Unlike the parametric models, which carry the risk of completely missing the beat, non-parametric models are generally more flexible and can capture more complex relationships.

However, non-parametric models require more data to estimate the model.

Important Trade-Offs

Parametric models and non-parametric models have different trade-offs and are prone to different types of errors:

  • Interpretability vs Accuracy/Flexibility
  • Underfitting vs Overfitting
  • Efficiency (in terms of data/model size) vs Complexity

Regression / Classification

When the response is quantitative, we call it regression.

When the response is qualitative, we call it classification.

Not exactly, but let’s keep it simple for now.


Supervised / Unsupervised Learning

When there is a given response variable, we call it supervised learning.

  • Regression and classification

When there is no given response variable, we call it unsupervised learning.

  • Clustering

Measuring Goodness of Fit

As discussed above, in regression, the most common way to measure the goodness of fit is the mean squared error (MSE):

\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(x_i))^2\]

Training and Testing

We estimate the model by minimizing the MSE over training data, or the training MSE.

However, this measurement is obviously biased towards overfit models.

Therefore, in order to evaluate the model’s performance, we need to introduce another set of observations called testing data, i.e. ones that were not involved in the training process, and calculate the testing MSE.

Goodness of Fit in Classification

While MSE is a common measure for regression, the error rate or the misclassification rate is a common measure for classification.

\[\text{Error Rate} = \frac{1}{n} \sum_{i=1}^{n} I(y_i \neq \hat{f}(x_i))\]

where $I(\cdot)$ is the indicator function that returns 1 if the condition is true.

So out of total $n$, how many were misclassified? That is the error rate.

Just like regression, we estimate the model with training error rate, and evaluate the model with testing error rate.