Encoding Categorical Variables

Categorical or qualitative variables are often non-numeric and must be encoded to be used in learning.

Table of contents
  1. One-Hot Encoding
    1. Dummy Variable Trap
  2. Dummy Encoding
    1. Baseline
    2. Interpretation of Dummy Coefficients
    3. Testing for Significance

One-Hot Encoding

For a qualitative variable with k groups, one-hot encoding introduces k new binary variables.

For example, consider a variable X with three groups: A, B, and C.

One-Hot Encoding

Dummy Variable Trap

There is one issue with one-hot encoding: the dummy variable trap.

If you look at the encoding, you can see that the third variable is redundant:

Dummy Variable Trap

Because we could easily infer C by the encoding of A and B alone: 00.

This is a problem because now we have multicollinearity between our features.

Most one-hot encoding will have a parameter to automatically drop one of the dummy variables.


Dummy Encoding

Dummy encoding introduces one less binary variable than one-hot encoding, i.e., k1 new binary variables.

Dummy Encoding

Each dummy represents

  • A¬A
  • B¬B,

and ¬A¬BC.

This avoids the dummy variable trap.

Baseline

With dummy encoding, one group is chosen as the baseline (in our example, C).

Say X1:A¬A and X2:B¬B.

In simple linear regression:

Y=β0+β1X1+β2X2+ϵ

When the categorial variable is C (X1=0 and X2=0),

Y=β0+ϵ

We are left with the baseline model.

Interpretation of Dummy Coefficients

In linear regression, the coefficients βi of quantitative variables are interpreted as such:

The average effect on Y of a one-unit increase in Xi.

But how do we interpret the coefficients of dummy variables?

The effect is in comparison to the baseline.

So in our example above:

Y=β0+β1X1+β2X2+ϵ

where X1 and X2 are binary dummy variables for A and B, respectively:

  • β0 is the expected Y when X1=0 and X2=0 (C)
  • β1 is the average effect on Y when X1=1 (A) compared to C
  • β2 is the average effect on Y when X2=1 (B) compared to C

It does not give you comparison between the effects of A and B.

Testing for Significance

After estimating βi, we want to test whether each group is significantly different from the baseline.

Calculate the t-statistic for each βi as such:

t=β^iSE(β^i)

Remember that we are comparing each group to the baseline (is there a difference between A and C? B and C?).
You cannot compare the significance of arbitrary β^i and β^j. For this, you must do a pairwise estimation for all groups.