Categorical Encoding

5 minute read

Practically, in real dataset, the dataset contain categorical value. So what is the difference between casual string value and categorical value ? Well, sometimes categorical value contain numerical value. Also, categorical value is limited to some numbers, not continuous. Many machine learning algorithms can support categorical values without further manipulation but there are many more algorithms that do not. This means that categorical data must be encoded to numbers before we can use it to fit and evaluate a model. There are many ways to encode categorical variables for modeling, although the three most common are as follows:

  1. Label Encoder: Where each unique label is mapped to an integer.
  2. One Hot Encoder: Where each label is mapped to a binary vector.
  3. Ordinal Encoder: Where a distributed representation of the categories is learned.

Before we start, let practice with exercise dataset from Seaborn

import pandas as pd
import numpy as np
import seaborn as sns

data = pd.DataFrame(sns.load_dataset("exercise"))
data.dtypes
Unnamed: 0       int64
id               int64
diet          category
pulse            int64
time          category
kind          category
dtype: object
data['diet'].value_counts()
low fat    45
no fat     45
Name: diet, dtype: int64
data['time'].value_counts()
30 min    30
15 min    30
1 min     30
Name: time, dtype: int64
data['kind'].value_counts()
running    30
walking    30
rest       30
Name: kind, dtype: int64

Label Encoder

A Label encoding involves mapping each unique label to an integer value. Encode target labels with value between 0 and $n-1$. This transformer should be used to encode target values y, and not the input X.This type of encoding is really only appropriate if there is a known continous relationship between the categories (like continous scale grouping value).

from sklearn.preprocessing import LabelEncoder

lencoder = LabelEncoder()
data['dietencode'] = lencoder.fit_transform(data['diet'])
data[['diet', 'dietencode']]
dietencode diet
0 0 low fat
1 0 low fat
2 0 low fat
3 0 low fat
4 0 low fat
... ... ...
85 1 no fat
86 1 no fat
87 1 no fat
88 1 no fat
89 1 no fat

90 rows × 2 columns

One Hot Encoder

A one hot encoding is appropriate for categorical data where no relationship exists (like country, name types etc) between categories. It involves representing each categorical variable with a binary vector that has one element for each unique label and marking the class label with a 1 and all other elements 0. The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter). By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

from sklearn.preprocessing import OneHotEncoder

hotencoder = OneHotEncoder()
data.join(pd.DataFrame(data=hotencoder.fit_transform(data[['diet', 'time']]).toarray(), columns=hotencoder.get_feature_names())) 
Unnamed: 0 id diet pulse time kind dietencode x0_low fat x0_no fat x1_1 min x1_15 min x1_30 min
0 0 1 low fat 85 1 min rest 0 1.0 0.0 1.0 0.0 0.0
1 1 1 low fat 85 15 min rest 0 1.0 0.0 0.0 1.0 0.0
2 2 1 low fat 88 30 min rest 0 1.0 0.0 0.0 0.0 1.0
3 3 2 low fat 90 1 min rest 0 1.0 0.0 1.0 0.0 0.0
4 4 2 low fat 92 15 min rest 0 1.0 0.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
85 85 29 no fat 135 15 min running 1 0.0 1.0 0.0 1.0 0.0
86 86 29 no fat 130 30 min running 1 0.0 1.0 0.0 0.0 1.0
87 87 30 no fat 99 1 min running 1 0.0 1.0 1.0 0.0 0.0
88 88 30 no fat 111 15 min running 1 0.0 1.0 0.0 1.0 0.0
89 89 30 no fat 150 30 min running 1 0.0 1.0 0.0 0.0 1.0

90 rows × 12 columns

Ordinal Encode

An ordinal encoding involves mapping each unique label to an integer value. As such, it is sometimes referred to simply as an integer encoding. This type of encoding is really only appropriate if there is a known relationship between the categories. This relationship does exist for some of the variables in the dataset, and ideally, this should be harnessed when preparing the data. In this case, we will ignore any possible existing ordinal relationship and assume all variables are categorical. It can still be helpful to use an ordinal encoding, at least as a point of reference with other encoding schemes.

from sklearn.preprocessing import OrdinalEncoder

oencoder = OrdinalEncoder()
data.join(pd.DataFrame(data=oencoder.fit_transform(data[['diet', 'time']]))) 
Unnamed: 0 id diet pulse time kind dietencode 0 1
0 0 1 low fat 85 1 min rest 0 0.0 0.0
1 1 1 low fat 85 15 min rest 0 0.0 1.0
2 2 1 low fat 88 30 min rest 0 0.0 2.0
3 3 2 low fat 90 1 min rest 0 0.0 0.0
4 4 2 low fat 92 15 min rest 0 0.0 1.0
... ... ... ... ... ... ... ... ... ...
85 85 29 no fat 135 15 min running 1 1.0 1.0
86 86 29 no fat 130 30 min running 1 1.0 2.0
87 87 30 no fat 99 1 min running 1 1.0 0.0
88 88 30 no fat 111 15 min running 1 1.0 1.0
89 89 30 no fat 150 30 min running 1 1.0 2.0

90 rows × 9 columns

Common Questions

This section lists some common questions and answers when encoding categorical data.

Q. What if I have a mixture of numeric and categorical data?

Or, what if I have a mixture of categorical and ordinal data?

You will need to prepare or encode each variable (column) in your dataset separately, then concatenate all of the prepared variables back together into a single array for fitting or evaluating the model.

Q. What if I have hundreds of categories?

Or, what if I concatenate many one hot encoded vectors to create a many thousand element input vector?

You can use a one hot encoding up to thousands and tens of thousands of categories. Also, having large vectors as input sounds intimidating, but the models can generally handle it.

Try an embedding; it offers the benefit of a smaller vector space (a projection) and the representation can have more meaning.

Q. What encoding technique is the best?

Unknown

Leave a comment