EDA Regression Exercise

5 minute read

Today, i will try Exploratory Data Analysis and regression with insurance data from Kaggle. Let’s take a look

import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns 

%matplotlib inline

data = pd.read_csv("insurance.csv")

data.head()

	age	sex	bmi	children	smoker	region	charges
0	19	female	27.900	0	yes	southwest	16884.92400
1	18	male	33.770	1	no	southeast	1725.55230
2	28	male	33.000	3	no	southeast	4449.46200
3	33	male	22.705	0	no	northwest	21984.47061
4	32	male	28.880	0	no	northwest	3866.85520

Let’s see the structure and the context of the data.

Context

Machine Learning with R by Brett Lantz is a book that provides an introduction to machine learning using R. As far as I can tell, Packt Publishing does not make its datasets available online unless you buy the book and create a user account which can be a problem if you are checking the book out from the library or borrowing the book from a friend. All of these datasets are in the public domain but simply needed some cleaning up and recoding to match the format in the book. Content

Columns

age: age of primary beneficiary

sex: insurance contractor gender, female, male

bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

children: Number of children covered by health insurance / Number of dependents

smoker: Smoking

region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

charges: Individual medical costs billed by health insurance

data.isna().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

Great! No missing values, let’s check the info and description

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

data.describe()

	age	bmi	children	charges
count	1338.000000	1338.000000	1338.000000	1338.000000
mean	39.207025	30.663397	1.094918	13270.422265
std	14.049960	6.098187	1.205493	12110.011237
min	18.000000	15.960000	0.000000	1121.873900
25%	27.000000	26.296250	0.000000	4740.287150
50%	39.000000	30.400000	1.000000	9382.033000
75%	51.000000	34.693750	2.000000	16639.912515
max	64.000000	53.130000	5.000000	63770.428010

sns.pairplot(data)

png

Based on pairplot, it seems that age have 3 category of insurance. The BMI seems lack of correlation with charges. I am still not sure that the amount of children really correlated with insurance charges. Let’s check out

sns.heatmap(data.corr(), annot=True, cmap="cool")

png

sns.boxplot(data['charges'])

png

We can see that there are many outliers in charges data. I expect that these outliers shows up in comparison to other categorical data

Now let’s check categorical feature one by one

sns.boxplot(data['sex'], data['charges'])

png

It seems the amount of charges is quite fair between male and female. But did you see that, it seems we have a lot of outliers here. Let’s check the other

sns.boxplot(data['smoker'], data['charges'])

png

The smoker is a affecting the amount of insurance charges by it looks. Also there are no outliers towards smoker

data['region'].value_counts()

southeast    364
northwest    325
southwest    325
northeast    324
Name: region, dtype: int64

Seems balanced, let’s check the boxplot

sns.boxplot(data['region'], data['charges'])

png

East part of the region has a better Q3 value huh. It seem something big happens a lot in east part region

sns.boxplot(data['children'], data['charges'])

png

The amount of children probably does not raelly matter

from sklearn.preprocessing import LabelEncoder
#sex
le = LabelEncoder()
le.fit(data.sex.drop_duplicates()) 
data.sex = le.transform(data.sex)
# smoker or not
le.fit(data.smoker.drop_duplicates()) 
data.smoker = le.transform(data.smoker)
#region
le.fit(data.region.drop_duplicates()) 
data.region = le.transform(data.region)

sns.heatmap(data.corr(), annot=True, cmap="cool")

png

WOW. After the text encoding we can see that the correlation between smoker and insurance charges is really huge. Let’s focus on smoker for a while

sns.relplot(x="age", y="charges", data=data, hue='smoker');

png

With this visualization, we can understand that the smoker tends to have higher charges in insurances while the distribution of smoker spreads normally around all ages. The charges is also getting higher while you are getting older

sns.boxplot(x="sex", y="charges", data=data, hue='smoker');

png

Well we are sure that the insurance charges is not sexist afterall

sns.relplot(x="bmi", y="charges", data=data, hue='smoker');

png

Surprisingly, BMI does not correlated that much towards smoker. I thought smoking have high influences on your weight

sns.boxplot(x="children", y="charges", data=data, hue='smoker');

png

Smoker hardly have more than 3 children according to that. Smoking is not good for your children after all

sns.boxplot(x="region", y="charges", data=data, hue='smoker');

png

Nothing we can see here. Smoker spreads fairly in different region. I almost forgot the continuos values. Let’s analyze them one by one

sns.jointplot(x='age', y='charges', data=data[data['age'] <=20])

png

sns.jointplot(x='age', y='charges', data=data[data['age'] > 40])

png

The more old you are, the more charges on your insurance by it looks. The age distribution seems balanced to me

sns.jointplot(x='bmi', y='charges', data=data[data['bmi'] <30])

png

sns.jointplot(x='bmi', y='charges', data=data[data['bmi'] >=30])

png

Eventhough the BMI value is normally distributed around 30 (The line between overweight or not) it seems that we cant rely that much on charge on how good your BMI are. The sure things is that there are more overweight people here

Regression

x = data.drop(['charges'], axis = 1)
y = data['charges']

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=17)

from sklearn.linear_model import LinearRegression

lm1 = LinearRegression()
lm2 = LinearRegression(normalize=True)

I’m using 2 different linear regression. First without normalization and second with StandartScaler normalization

from sklearn.model_selection import cross_val_score

print("Linear Regression score without normalization = ", cross_val_score(lm1, x_train, y_train, scoring='neg_mean_squared_error').mean())
print("Linear Regression score with normalization = ", cross_val_score(lm1, x_train, y_train, scoring='neg_mean_squared_error').mean())

Linear Regression score without normalization =  -38555503.39899726
Linear Regression score with normalization =  -38555503.39899726

Wow, pretty much the same. Now let’s try to predict

lm1.fit(x_train, y_train)
prediction = lm1.predict(x_test)

from sklearn.metrics import mean_squared_error

print("The mean squared error is ", mean_squared_error(y_test, prediction))

The mean squared error is  30084782.121418368

sns.distplot((prediction**2)-(y_test**2))

png

Looks not really great with linear regression. We try to improve it later with different methods

Share on

Twitter Facebook LinkedIn

Gama Candra Tri Kartika

EDA Regression Exercise

Context

Columns

Regression

Share on

Leave a comment

You may also enjoy

Busy From Works

Long Updates

Back to back meetings

Lucky weekends