Encoding Categorical Variables

Categorical or qualitative variables are often non-numeric and must be encoded to be used in learning.

Table of contents
  1. One-Hot Encoding
    1. Dummy Variable Trap
  2. Dummy Encoding
    1. Baseline
    2. Interpretation of Dummy Coefficients
    3. Testing for Significance

One-Hot Encoding

For a qualitative variable with $k$ groups, one-hot encoding introduces $k$ new binary variables.

For example, consider a variable $X$ with three groups: A, B, and C.

One-Hot Encoding

Dummy Variable Trap

There is one issue with one-hot encoding: the dummy variable trap.

If you look at the encoding, you can see that the third variable is redundant:

Dummy Variable Trap

Because we could easily infer C by the encoding of A and B alone: 00.

This is a problem because now we have multicollinearity between our features.

Most one-hot encoding will have a parameter to automatically drop one of the dummy variables.


Dummy Encoding

Dummy encoding introduces one less binary variable than one-hot encoding, i.e., $k-1$ new binary variables.

Dummy Encoding

Each dummy represents

  • $A \lor \neg A$
  • $B \lor \neg B$,

and $\neg A \land \neg B \implies C$.

This avoids the dummy variable trap.

Baseline

With dummy encoding, one group is chosen as the baseline (in our example, C).

Say $X_1: A \lor \neg A$ and $X_2: B \lor \neg B$.

In simple linear regression:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon\]

When the categorial variable is $C$ ($X_1 = 0$ and $X_2 = 0$),

\[Y = \beta_0 + \epsilon\]

We are left with the baseline model.

Interpretation of Dummy Coefficients

In linear regression, the coefficients $\beta_i$ of quantitative variables are interpreted as such:

The average effect on $Y$ of a one-unit increase in $X_i$.

But how do we interpret the coefficients of dummy variables?

The effect is in comparison to the baseline.

So in our example above:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon\]

where $X_1$ and $X_2$ are binary dummy variables for A and B, respectively:

  • $\beta_0$ is the expected $Y$ when $X_1 = 0$ and $X_2 = 0$ (C)
  • $\beta_1$ is the average effect on $Y$ when $X_1 = 1$ (A) compared to C
  • $\beta_2$ is the average effect on $Y$ when $X_2 = 1$ (B) compared to C

It does not give you comparison between the effects of A and B.

Testing for Significance

After estimating $\beta_i$, we want to test whether each group is significantly different from the baseline.

Calculate the $t$-statistic for each $\beta_i$ as such:

$$ t = \frac{\hat{\beta}_i}{\text{SE}(\hat{\beta}_i)} $$

Remember that we are comparing each group to the baseline (is there a difference between A and C? B and C?).
You cannot compare the significance of arbitrary $\hat{\beta}_i$ and $\hat{\beta}_j$. For this, you must do a pairwise estimation for all groups.