Handling Categorical Variables in Machine Learning

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Categorical variables are very common in a machine learning project. On a high level, there are two ways to handle a categorical variable.

Drop a categorical variable if a categorical variable won't help the model and especially when the categorical variable has a large cardinality. User id is such an example when you build a user-level model. Of course, you can use feature hashing to reduce the dimension/cardinality of a categorical variable, and let the training process decides whether the categorical variable should be included into the model or not.
Encode a categorical variable. Below are some popular ways of encoding a categorical variable.
```
- One-Hot Encoding
- Label Encoding
- Target Encoding
- Feature Hashing
- Weight of Evidence
- Light G-Boost Encoding
```
Please refer to Know about Categorical Encoding, even New Ones! and Dealing with Categorical Variables in Machine Learning for more detailed discussions. Notice that LightGBM has it's own way (Light G-Boost Encoding) of handling categorical variables. Please refer to Handle Categorical Variables in LightGBM for more discussions .

Ben Chuanlong Du's Blog