Intro to Statistical Machine Learning
Table of contents
Response, Predictors, and Model
We often denote the response or target as:
$$ Y $$
and features, input, or predictors as:
$$ X_i $$
and these predictors are often collected in to a input vector:
$$ X = (X_1, X_2, \ldots, X_p) $$
True Model
The true model is a function $f$ that maps the input to the response:
$$ Y = f(X) + \epsilon $$
- True model $f$ is deterministic and unknown
Goals of Model Learning
Our goal is to learn $\hat{f}$, an estimate of $f$. In terms of regression, we usually do this by minimizing the mean squared error (MSE) (expectation of the squared differences):
\[\begin{equation} \label{eq:mse-dec} \overbrace{\E[(Y - \hat{Y})^2]}^{\text{MSE}} = \underbrace{(f(X) - \hat{f}(X))^2}_\text{Reducible Error} + \underbrace{\Var(\epsilon)}_\text{Irreducible Error} \end{equation}\]Do not worry about how we get this decomposition for now (there are some assumptions that need to be made), more will be in regression.
Just some blabbering about notation
Sometimes people write the expectation as:
\[\E[Y - \hat{Y}]^2\]To mean the same thing as above. I personally find this notation confusing because it is not clear whether it is taking the expectation of the squared difference or squaring the expectation of the difference.
As the name suggests, irreducible error is indeed irreducible, so we try to minimize the reducible error to minimize the MSE.
That is the process of learning $\hat{f}$.
Prediction
The prediction of response given observations is:
$$ \hat{Y} = \hat{f}(X) $$
Specifically, we are interested in getting $\hat{Y}$ upon learning $\hat{f}$. So in prediction, our end goal is not the model itself, but rather the prediction of the response.
Inference
In inference, we are interested in the relationship between the predictors and the response.
Same stuff about learning $\hat{f}$, but compared to prediction, we are more interested in the model itself rather than the prediction value.
Noise / Error
$\epsilon$ or noise is a random variable that captures the variability in the response not explained by true model:
\[Y = f(X) + \epsilon\]Note the emphasis on true, this means that even if we correctly identify the true model, there still will be some uncertainty in the response.
Take a look at the decomposition \eqref{eq:mse-dec} above again. Because the noise is the source of the irreducible uncertainty,
Noise is also called an irreducible error.
Important assumptions about $\epsilon$:
Random variable $\epsilon$ is independent of $X$ and has a mean of 0.
This becomes critical in certain models and methods.
Parametric / Non-parametric Models
Parametric and Structured Models
One example of parametric and structured model is the linear model.
The parameters are the coefficients of the predictors and the structure is the linear relationship between the predictors and the response.
Parametric models are effective when the true $f$ approximately follows the structure assumed by the model.
Non-parametric Models
Non-parametric models, on the other hand, do not make strong assumptions about the structure of $f$.
Unlike the parametric models, which carry the risk of completely missing the beat, non-parametric models are generally more flexible and can capture more complex relationships.
However, non-parametric models require more data to estimate the model.
Important Trade-Offs
Parametric models and non-parametric models have different trade-offs and are prone to different types of errors:
- Interpretability vs Accuracy/Flexibility
- Underfitting vs Overfitting
- Efficiency (in terms of data/model size) vs Complexity
Regression / Classification
When the response is quantitative, we call it regression.
When the response is qualitative, we call it classification.
Not exactly, but let’s keep it simple for now.
Supervised / Unsupervised Learning
When there is a given response variable, we call it supervised learning.
- Regression and classification
When there is no given response variable, we call it unsupervised learning.
- Clustering
Measuring Goodness of Fit
As discussed above, in regression, the most common way to measure the goodness of fit is the mean squared error (MSE):
\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(x_i))^2\]Training and Testing
We estimate the model by minimizing the MSE over training data, or the training MSE.
However, this measurement is obviously biased towards overfit models.
Therefore, in order to evaluate the model’s performance, we need to introduce another set of observations called testing data, i.e. ones that were not involved in the training process, and calculate the testing MSE.
Goodness of Fit in Classification
While MSE is a common measure for regression, the error rate or the misclassification rate is a common measure for classification.
\[\text{Error Rate} = \frac{1}{n} \sum_{i=1}^{n} I(y_i \neq \hat{f}(x_i))\]where $I(\cdot)$ is the indicator function that returns 1 if the condition is true.
So out of total $n$, how many were misclassified? That is the error rate.
Just like regression, we estimate the model with training error rate, and evaluate the model with testing error rate.