Categorical Encoding
Practically, in real dataset, the dataset contain categorical value. So what is the difference between casual string value and categorical value ? Well, sometimes categorical value contain numerical value. Also, categorical value is limited to some numbers, not continuous. Many machine learning algorithms can support categorical values without further manipulation but there are many more algorithms that do not. This means that categorical data must be encoded to numbers before we can use it to fit and evaluate a model. There are many ways to encode categorical variables for modeling, although the three most common are as follows:
- Label Encoder: Where each unique label is mapped to an integer.
- One Hot Encoder: Where each label is mapped to a binary vector.
- Ordinal Encoder: Where a distributed representation of the categories is learned.
Before we start, let practice with exercise dataset from Seaborn
import pandas as pd
import numpy as np
import seaborn as sns
data = pd.DataFrame(sns.load_dataset("exercise"))
data.dtypes
Unnamed: 0 int64
id int64
diet category
pulse int64
time category
kind category
dtype: object
data['diet'].value_counts()
low fat 45
no fat 45
Name: diet, dtype: int64
data['time'].value_counts()
30 min 30
15 min 30
1 min 30
Name: time, dtype: int64
data['kind'].value_counts()
running 30
walking 30
rest 30
Name: kind, dtype: int64
Label Encoder
A Label encoding involves mapping each unique label to an integer value. Encode target labels with value between 0 and $n-1$. This transformer should be used to encode target values y
, and not the input X
.This type of encoding is really only appropriate if there is a known continous relationship between the categories (like continous scale grouping value).
from sklearn.preprocessing import LabelEncoder
lencoder = LabelEncoder()
data['dietencode'] = lencoder.fit_transform(data['diet'])
data[['diet', 'dietencode']]
dietencode | diet | |
---|---|---|
0 | 0 | low fat |
1 | 0 | low fat |
2 | 0 | low fat |
3 | 0 | low fat |
4 | 0 | low fat |
... | ... | ... |
85 | 1 | no fat |
86 | 1 | no fat |
87 | 1 | no fat |
88 | 1 | no fat |
89 | 1 | no fat |
90 rows × 2 columns
One Hot Encoder
A one hot encoding is appropriate for categorical data where no relationship exists (like country, name types etc) between categories. It involves representing each categorical variable with a binary vector that has one element for each unique label and marking the class label with a 1 and all other elements 0. The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse
parameter). By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories
manually.
from sklearn.preprocessing import OneHotEncoder
hotencoder = OneHotEncoder()
data.join(pd.DataFrame(data=hotencoder.fit_transform(data[['diet', 'time']]).toarray(), columns=hotencoder.get_feature_names()))
Unnamed: 0 | id | diet | pulse | time | kind | dietencode | x0_low fat | x0_no fat | x1_1 min | x1_15 min | x1_30 min | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | low fat | 85 | 1 min | rest | 0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | 1 | 1 | low fat | 85 | 15 min | rest | 0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 2 | 1 | low fat | 88 | 30 min | rest | 0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
3 | 3 | 2 | low fat | 90 | 1 min | rest | 0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4 | 4 | 2 | low fat | 92 | 15 min | rest | 0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
85 | 85 | 29 | no fat | 135 | 15 min | running | 1 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
86 | 86 | 29 | no fat | 130 | 30 min | running | 1 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
87 | 87 | 30 | no fat | 99 | 1 min | running | 1 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 |
88 | 88 | 30 | no fat | 111 | 15 min | running | 1 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
89 | 89 | 30 | no fat | 150 | 30 min | running | 1 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
90 rows × 12 columns
Ordinal Encode
An ordinal encoding involves mapping each unique label to an integer value. As such, it is sometimes referred to simply as an integer encoding. This type of encoding is really only appropriate if there is a known relationship between the categories. This relationship does exist for some of the variables in the dataset, and ideally, this should be harnessed when preparing the data. In this case, we will ignore any possible existing ordinal relationship and assume all variables are categorical. It can still be helpful to use an ordinal encoding, at least as a point of reference with other encoding schemes.
from sklearn.preprocessing import OrdinalEncoder
oencoder = OrdinalEncoder()
data.join(pd.DataFrame(data=oencoder.fit_transform(data[['diet', 'time']])))
Unnamed: 0 | id | diet | pulse | time | kind | dietencode | 0 | 1 | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | low fat | 85 | 1 min | rest | 0 | 0.0 | 0.0 |
1 | 1 | 1 | low fat | 85 | 15 min | rest | 0 | 0.0 | 1.0 |
2 | 2 | 1 | low fat | 88 | 30 min | rest | 0 | 0.0 | 2.0 |
3 | 3 | 2 | low fat | 90 | 1 min | rest | 0 | 0.0 | 0.0 |
4 | 4 | 2 | low fat | 92 | 15 min | rest | 0 | 0.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
85 | 85 | 29 | no fat | 135 | 15 min | running | 1 | 1.0 | 1.0 |
86 | 86 | 29 | no fat | 130 | 30 min | running | 1 | 1.0 | 2.0 |
87 | 87 | 30 | no fat | 99 | 1 min | running | 1 | 1.0 | 0.0 |
88 | 88 | 30 | no fat | 111 | 15 min | running | 1 | 1.0 | 1.0 |
89 | 89 | 30 | no fat | 150 | 30 min | running | 1 | 1.0 | 2.0 |
90 rows × 9 columns
Common Questions
This section lists some common questions and answers when encoding categorical data.
Q. What if I have a mixture of numeric and categorical data?
Or, what if I have a mixture of categorical and ordinal data?
You will need to prepare or encode each variable (column) in your dataset separately, then concatenate all of the prepared variables back together into a single array for fitting or evaluating the model.
Q. What if I have hundreds of categories?
Or, what if I concatenate many one hot encoded vectors to create a many thousand element input vector?
You can use a one hot encoding up to thousands and tens of thousands of categories. Also, having large vectors as input sounds intimidating, but the models can generally handle it.
Try an embedding; it offers the benefit of a smaller vector space (a projection) and the representation can have more meaning.
Q. What encoding technique is the best?
Unknown
Leave a comment