Categorical Encoding

5 minute read

Practically, in real dataset, the dataset contain categorical value. So what is the difference between casual string value and categorical value ? Well, sometimes categorical value contain numerical value. Also, categorical value is limited to some numbers, not continuous. Many machine learning algorithms can support categorical values without further manipulation but there are many more algorithms that do not. This means that categorical data must be encoded to numbers before we can use it to fit and evaluate a model. There are many ways to encode categorical variables for modeling, although the three most common are as follows:

Label Encoder: Where each unique label is mapped to an integer.
One Hot Encoder: Where each label is mapped to a binary vector.
Ordinal Encoder: Where a distributed representation of the categories is learned.

Before we start, let practice with exercise dataset from Seaborn

import pandas as pd
import numpy as np
import seaborn as sns

data = pd.DataFrame(sns.load_dataset("exercise"))

data.dtypes

Unnamed: 0       int64
id               int64
diet          category
pulse            int64
time          category
kind          category
dtype: object

data['diet'].value_counts()

low fat    45
no fat     45
Name: diet, dtype: int64

data['time'].value_counts()

min    30
min    30
min     30
Name: time, dtype: int64

data['kind'].value_counts()

running    30
walking    30
rest       30
Name: kind, dtype: int64

Label Encoder

A Label encoding involves mapping each unique label to an integer value. Encode target labels with value between 0 and $n-1$. This transformer should be used to encode target values y, and not the input X.This type of encoding is really only appropriate if there is a known continous relationship between the categories (like continous scale grouping value).

from sklearn.preprocessing import LabelEncoder

lencoder = LabelEncoder()
data['dietencode'] = lencoder.fit_transform(data['diet'])
data[['diet', 'dietencode']]

	dietencode	diet
0	0	low fat
1	0	low fat
2	0	low fat
3	0	low fat
4	0	low fat
...	...	...
85	1	no fat
86	1	no fat
87	1	no fat
88	1	no fat
89	1	no fat

90 rows × 2 columns

One Hot Encoder

A one hot encoding is appropriate for categorical data where no relationship exists (like country, name types etc) between categories. It involves representing each categorical variable with a binary vector that has one element for each unique label and marking the class label with a 1 and all other elements 0. The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter). By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

from sklearn.preprocessing import OneHotEncoder

hotencoder = OneHotEncoder()
data.join(pd.DataFrame(data=hotencoder.fit_transform(data[['diet', 'time']]).toarray(), columns=hotencoder.get_feature_names())) 

	Unnamed: 0	id	diet	pulse	time	kind	dietencode	x0_low fat	x0_no fat	x1_1 min	x1_15 min	x1_30 min
0	0	1	low fat	85	1 min	rest	0	1.0	0.0	1.0	0.0	0.0
1	1	1	low fat	85	15 min	rest	0	1.0	0.0	0.0	1.0	0.0
2	2	1	low fat	88	30 min	rest	0	1.0	0.0	0.0	0.0	1.0
3	3	2	low fat	90	1 min	rest	0	1.0	0.0	1.0	0.0	0.0
4	4	2	low fat	92	15 min	rest	0	1.0	0.0	0.0	1.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...
85	85	29	no fat	135	15 min	running	1	0.0	1.0	0.0	1.0	0.0
86	86	29	no fat	130	30 min	running	1	0.0	1.0	0.0	0.0	1.0
87	87	30	no fat	99	1 min	running	1	0.0	1.0	1.0	0.0	0.0
88	88	30	no fat	111	15 min	running	1	0.0	1.0	0.0	1.0	0.0
89	89	30	no fat	150	30 min	running	1	0.0	1.0	0.0	0.0	1.0

90 rows × 12 columns

Ordinal Encode

An ordinal encoding involves mapping each unique label to an integer value. As such, it is sometimes referred to simply as an integer encoding. This type of encoding is really only appropriate if there is a known relationship between the categories. This relationship does exist for some of the variables in the dataset, and ideally, this should be harnessed when preparing the data. In this case, we will ignore any possible existing ordinal relationship and assume all variables are categorical. It can still be helpful to use an ordinal encoding, at least as a point of reference with other encoding schemes.

from sklearn.preprocessing import OrdinalEncoder

oencoder = OrdinalEncoder()
data.join(pd.DataFrame(data=oencoder.fit_transform(data[['diet', 'time']]))) 

	Unnamed: 0	id	diet	pulse	time	kind	dietencode	0	1
0	0	1	low fat	85	1 min	rest	0	0.0	0.0
1	1	1	low fat	85	15 min	rest	0	0.0	1.0
2	2	1	low fat	88	30 min	rest	0	0.0	2.0
3	3	2	low fat	90	1 min	rest	0	0.0	0.0
4	4	2	low fat	92	15 min	rest	0	0.0	1.0
...	...	...	...	...	...	...	...	...	...
85	85	29	no fat	135	15 min	running	1	1.0	1.0
86	86	29	no fat	130	30 min	running	1	1.0	2.0
87	87	30	no fat	99	1 min	running	1	1.0	0.0
88	88	30	no fat	111	15 min	running	1	1.0	1.0
89	89	30	no fat	150	30 min	running	1	1.0	2.0

90 rows × 9 columns

Common Questions

This section lists some common questions and answers when encoding categorical data.

Q. What if I have a mixture of numeric and categorical data?

Or, what if I have a mixture of categorical and ordinal data?

You will need to prepare or encode each variable (column) in your dataset separately, then concatenate all of the prepared variables back together into a single array for fitting or evaluating the model.

Q. What if I have hundreds of categories?

Or, what if I concatenate many one hot encoded vectors to create a many thousand element input vector?

You can use a one hot encoding up to thousands and tens of thousands of categories. Also, having large vectors as input sounds intimidating, but the models can generally handle it.

Try an embedding; it offers the benefit of a smaller vector space (a projection) and the representation can have more meaning.

Q. What encoding technique is the best?

Unknown

Share on

Twitter Facebook LinkedIn

Gama Candra Tri Kartika

Categorical Encoding

Label Encoder

One Hot Encoder

Ordinal Encode

Common Questions

Q. What if I have a mixture of numeric and categorical data?

Q. What if I have hundreds of categories?

Q. What encoding technique is the best?

Share on

Leave a comment

You may also enjoy

Day 7 Algorit.ma : Capstone Project

Day 6 Algorit.ma : Unsupervised Learning

Day 5 Algorit.ma : Classification Model

Day 4 Algorit.ma : Regression Model