Logistic Regression
Table of contents
Quick Recap of Logistic Function and Logit
Binary Logistic Regression
Let $Y$ a binary response variable.
We want to know $p(X)$ be the probability that $Y = 1$ given $X$:
$$ p(X) = P(Y = 1 \mid X) $$
In this case, our baseline class is $Y = 0$.
In logistic regression, we model $p(X)$ using the logistic function:
$$ p(X; \beta) = \frac{e^{f(X;\, \beta)}}{1 + e^{f(X;\, \beta)}} $$
Where $f(X; \beta)$ is a linear function of $X$ (also the logit or log-odds of sigmoid $p(X)$):
$$ f(X; \beta) = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p $$
So logistic regression is still a linear model.
Interpretation of the Coefficients
- A positive $\hat{\beta}_j$ means that $X_j$ is associated with higher probability of $Y = 1$.
- One unit increase in $X_j$ increases the log-odds of $Y = 1$ versus $Y = 0$ by $\hat{\beta}_j$.
- Because $f(X; \beta)$ is the the logit (log-odds) of $p$ (probability of $Y=1$).
MLE for Logistic Regression
To estimate $\beta$ we use maximum likelihood estimation (MLE).
The likelihood function is:
$$ \mathcal{L}(\beta) = \prod_{i=1}^n p(x_i; \beta)^{y_i} (1 - p(x_i; \beta))^{1 - y_i} $$
The parameter $\beta$ should maximize $p$ when $y = 1$, maximize $1 - p$ when $y = 0$, and thus maximize the likelihood.
The log-likelihood function is:
\[\begin{align*} \ell(\beta) &= \sum_{i=1}^n \left[ y_i \log p(x_i; \beta) + (1 - y_i) \log (1 - p(x_i; \beta)) \right] \\[0.5em] &= \sum_{i=1}^n \left[ y_i f(x_i; \beta) - \log(1 + e^{f(x_i; \beta)}) \right] \end{align*}\]Derivation
Remember that logit is:
\[f(x_i; \beta) = \log\left(\frac{p(x_i; \beta)}{1 - p(x_i; \beta)}\right) = \log(p(x_i; \beta)) - \log(1 - p(x_i; \beta))\]And
\[\log(1 - p(x_i; \beta)) = \log\left(\frac{1}{1 + e^{f(x_i; \beta)}}\right) = -\log(1 + e^{f(x_i; \beta)})\]Then:
\[\begin{align*} &y_i \log p(x_i; \beta) + (1 - y_i) \log (1 - p(x_i; \beta)) \\[0.5em] =\; &y_i \log p(x_i; \beta) - y_i \log(1 - p(x_i; \beta)) + \log(1 - p(x_i; \beta)) \\[0.5em] =\; &y_i f(x_i; \beta) - \log(1 + e^{f(x_i; \beta)}) \end{align*}\]We take the derivative of $\ell(\beta)$ with respect to $\beta$ and solve for $\beta$:
\[\frac{\partial \ell(\beta)}{\partial \beta} = 0\]Then we use iterative methods to get $\hat{\beta}$.
There is no closed-form solution for $\hat{\beta}$.
Testing for Estimator Significance
Remember that MLE estimators are asymptotically normal.
Therefore, you can test for the significance of the estimators using the z-statistic:
$$ z = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)} $$
Multinomial Logistic Regression
When the response variable has more than two classes, let’s say $K$ classes, we use multinomial logistic regression.
The probability model is:
$$ P(Y = k \mid X; \beta) = \frac{e^{\beta_{k0} + \beta_k^T X}} {1 + \sum_{l=1}^{K-1} e^{\beta_{l0} + \beta_l^T X}} $$
You will see that we actually have $K-1$ equations, because we usually set the class $K$ as the baseline where $\beta_{K0} = 0$ and $\beta_{K} = \mathbf{0}$.
Softmax Coding
In multinomial logistic regression, we usually set the last class $K$ as the baseline, where all $\beta_{K0} = 0$ and $\beta_K = \mathbf{0}$.
But with softmax coding, we do not set any class as the baseline.
In softmax coding, our log-odds (or logit function) of class $k$ versus any class $k’$ is:
$$ f(x; \beta) = (\beta_{k0}-\beta_{k'0}) + (\beta_{k1} - \beta_{k'1})x_1 + \ldots (\beta_{kp} - \beta_{k'p})x_p $$
While in multinomial logistic regression, our log-odds (or logit function) was always the class $k$ in question versus the baseline class $K$ where all the coefficients are zero:
$$ f(x; \beta) = \beta_{k0} + \beta_{k1}x_1 + \ldots + \beta_{kp}x_p $$