EDA Regression Exercise

5 minute read

Today, i will try Exploratory Data Analysis and regression with insurance data from Kaggle. Let’s take a look

import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns 

%matplotlib inline
data = pd.read_csv("insurance.csv")
data.head()
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520

Let’s see the structure and the context of the data.

Context

Machine Learning with R by Brett Lantz is a book that provides an introduction to machine learning using R. As far as I can tell, Packt Publishing does not make its datasets available online unless you buy the book and create a user account which can be a problem if you are checking the book out from the library or borrowing the book from a friend. All of these datasets are in the public domain but simply needed some cleaning up and recoding to match the format in the book. Content

Columns

age: age of primary beneficiary

sex: insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance
data.isna().sum()
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

Great! No missing values, let’s check the info and description

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
data.describe()
age bmi children charges
count 1338.000000 1338.000000 1338.000000 1338.000000
mean 39.207025 30.663397 1.094918 13270.422265
std 14.049960 6.098187 1.205493 12110.011237
min 18.000000 15.960000 0.000000 1121.873900
25% 27.000000 26.296250 0.000000 4740.287150
50% 39.000000 30.400000 1.000000 9382.033000
75% 51.000000 34.693750 2.000000 16639.912515
max 64.000000 53.130000 5.000000 63770.428010
sns.pairplot(data)

png

Based on pairplot, it seems that age have 3 category of insurance. The BMI seems lack of correlation with charges. I am still not sure that the amount of children really correlated with insurance charges. Let’s check out

sns.heatmap(data.corr(), annot=True, cmap="cool")

png

sns.boxplot(data['charges'])

png

We can see that there are many outliers in charges data. I expect that these outliers shows up in comparison to other categorical data

Now let’s check categorical feature one by one

sns.boxplot(data['sex'], data['charges'])

png

It seems the amount of charges is quite fair between male and female. But did you see that, it seems we have a lot of outliers here. Let’s check the other

sns.boxplot(data['smoker'], data['charges'])

png

The smoker is a affecting the amount of insurance charges by it looks. Also there are no outliers towards smoker

data['region'].value_counts()
southeast    364
northwest    325
southwest    325
northeast    324
Name: region, dtype: int64

Seems balanced, let’s check the boxplot

sns.boxplot(data['region'], data['charges'])

png

East part of the region has a better Q3 value huh. It seem something big happens a lot in east part region

sns.boxplot(data['children'], data['charges'])

png

The amount of children probably does not raelly matter

from sklearn.preprocessing import LabelEncoder
#sex
le = LabelEncoder()
le.fit(data.sex.drop_duplicates()) 
data.sex = le.transform(data.sex)
# smoker or not
le.fit(data.smoker.drop_duplicates()) 
data.smoker = le.transform(data.smoker)
#region
le.fit(data.region.drop_duplicates()) 
data.region = le.transform(data.region)
sns.heatmap(data.corr(), annot=True, cmap="cool")

png

WOW. After the text encoding we can see that the correlation between smoker and insurance charges is really huge. Let’s focus on smoker for a while

sns.relplot(x="age", y="charges", data=data, hue='smoker');

png

With this visualization, we can understand that the smoker tends to have higher charges in insurances while the distribution of smoker spreads normally around all ages. The charges is also getting higher while you are getting older

sns.boxplot(x="sex", y="charges", data=data, hue='smoker');

png

Well we are sure that the insurance charges is not sexist afterall

sns.relplot(x="bmi", y="charges", data=data, hue='smoker');

png

Surprisingly, BMI does not correlated that much towards smoker. I thought smoking have high influences on your weight

sns.boxplot(x="children", y="charges", data=data, hue='smoker');

png

Smoker hardly have more than 3 children according to that. Smoking is not good for your children after all

sns.boxplot(x="region", y="charges", data=data, hue='smoker');

png

Nothing we can see here. Smoker spreads fairly in different region. I almost forgot the continuos values. Let’s analyze them one by one

sns.jointplot(x='age', y='charges', data=data[data['age'] <=20])

png

sns.jointplot(x='age', y='charges', data=data[data['age'] > 40])

png

The more old you are, the more charges on your insurance by it looks. The age distribution seems balanced to me

sns.jointplot(x='bmi', y='charges', data=data[data['bmi'] <30])

png

sns.jointplot(x='bmi', y='charges', data=data[data['bmi'] >=30])

png

Eventhough the BMI value is normally distributed around 30 (The line between overweight or not) it seems that we cant rely that much on charge on how good your BMI are. The sure things is that there are more overweight people here

Regression

x = data.drop(['charges'], axis = 1)
y = data['charges']
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=17)
from sklearn.linear_model import LinearRegression

lm1 = LinearRegression()
lm2 = LinearRegression(normalize=True)

I’m using 2 different linear regression. First without normalization and second with StandartScaler normalization

from sklearn.model_selection import cross_val_score

print("Linear Regression score without normalization = ", cross_val_score(lm1, x_train, y_train, scoring='neg_mean_squared_error').mean())
print("Linear Regression score with normalization = ", cross_val_score(lm1, x_train, y_train, scoring='neg_mean_squared_error').mean())
Linear Regression score without normalization =  -38555503.39899726
Linear Regression score with normalization =  -38555503.39899726

Wow, pretty much the same. Now let’s try to predict

lm1.fit(x_train, y_train)
prediction = lm1.predict(x_test)
from sklearn.metrics import mean_squared_error

print("The mean squared error is ", mean_squared_error(y_test, prediction))
The mean squared error is  30084782.121418368
sns.distplot((prediction**2)-(y_test**2))

png

Looks not really great with linear regression. We try to improve it later with different methods

Leave a comment