EDA Regression Exercise
Today, i will try Exploratory Data Analysis and regression with insurance data from Kaggle. Let’s take a look
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
%matplotlib inline
data = pd.read_csv("insurance.csv")
data.head()
age | sex | bmi | children | smoker | region | charges | |
---|---|---|---|---|---|---|---|
0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
Let’s see the structure and the context of the data.
Context
Machine Learning with R by Brett Lantz is a book that provides an introduction to machine learning using R. As far as I can tell, Packt Publishing does not make its datasets available online unless you buy the book and create a user account which can be a problem if you are checking the book out from the library or borrowing the book from a friend. All of these datasets are in the public domain but simply needed some cleaning up and recoding to match the format in the book. Content
Columns
age: age of primary beneficiary
sex: insurance contractor gender, female, male
bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
children: Number of children covered by health insurance / Number of dependents
smoker: Smoking
region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
charges: Individual medical costs billed by health insurance
data.isna().sum()
age 0
sex 0
bmi 0
children 0
smoker 0
region 0
charges 0
dtype: int64
Great! No missing values, let’s check the info and description
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
data.describe()
age | bmi | children | charges | |
---|---|---|---|---|
count | 1338.000000 | 1338.000000 | 1338.000000 | 1338.000000 |
mean | 39.207025 | 30.663397 | 1.094918 | 13270.422265 |
std | 14.049960 | 6.098187 | 1.205493 | 12110.011237 |
min | 18.000000 | 15.960000 | 0.000000 | 1121.873900 |
25% | 27.000000 | 26.296250 | 0.000000 | 4740.287150 |
50% | 39.000000 | 30.400000 | 1.000000 | 9382.033000 |
75% | 51.000000 | 34.693750 | 2.000000 | 16639.912515 |
max | 64.000000 | 53.130000 | 5.000000 | 63770.428010 |
sns.pairplot(data)
Based on pairplot, it seems that age have 3 category of insurance. The BMI seems lack of correlation with charges. I am still not sure that the amount of children really correlated with insurance charges. Let’s check out
sns.heatmap(data.corr(), annot=True, cmap="cool")
sns.boxplot(data['charges'])
We can see that there are many outliers in charges data. I expect that these outliers shows up in comparison to other categorical data
Now let’s check categorical feature one by one
sns.boxplot(data['sex'], data['charges'])
It seems the amount of charges is quite fair between male and female. But did you see that, it seems we have a lot of outliers here. Let’s check the other
sns.boxplot(data['smoker'], data['charges'])
The smoker is a affecting the amount of insurance charges by it looks. Also there are no outliers towards smoker
data['region'].value_counts()
southeast 364
northwest 325
southwest 325
northeast 324
Name: region, dtype: int64
Seems balanced, let’s check the boxplot
sns.boxplot(data['region'], data['charges'])
East part of the region has a better Q3 value huh. It seem something big happens a lot in east part region
sns.boxplot(data['children'], data['charges'])
The amount of children probably does not raelly matter
from sklearn.preprocessing import LabelEncoder
#sex
le = LabelEncoder()
le.fit(data.sex.drop_duplicates())
data.sex = le.transform(data.sex)
# smoker or not
le.fit(data.smoker.drop_duplicates())
data.smoker = le.transform(data.smoker)
#region
le.fit(data.region.drop_duplicates())
data.region = le.transform(data.region)
sns.heatmap(data.corr(), annot=True, cmap="cool")
WOW. After the text encoding we can see that the correlation between smoker and insurance charges is really huge. Let’s focus on smoker for a while
sns.relplot(x="age", y="charges", data=data, hue='smoker');
With this visualization, we can understand that the smoker tends to have higher charges in insurances while the distribution of smoker spreads normally around all ages. The charges is also getting higher while you are getting older
sns.boxplot(x="sex", y="charges", data=data, hue='smoker');
Well we are sure that the insurance charges is not sexist afterall
sns.relplot(x="bmi", y="charges", data=data, hue='smoker');
Surprisingly, BMI does not correlated that much towards smoker. I thought smoking have high influences on your weight
sns.boxplot(x="children", y="charges", data=data, hue='smoker');
Smoker hardly have more than 3 children according to that. Smoking is not good for your children after all
sns.boxplot(x="region", y="charges", data=data, hue='smoker');
Nothing we can see here. Smoker spreads fairly in different region. I almost forgot the continuos values. Let’s analyze them one by one
sns.jointplot(x='age', y='charges', data=data[data['age'] <=20])
sns.jointplot(x='age', y='charges', data=data[data['age'] > 40])
The more old you are, the more charges on your insurance by it looks. The age distribution seems balanced to me
sns.jointplot(x='bmi', y='charges', data=data[data['bmi'] <30])
sns.jointplot(x='bmi', y='charges', data=data[data['bmi'] >=30])
Eventhough the BMI value is normally distributed around 30 (The line between overweight or not) it seems that we cant rely that much on charge on how good your BMI are. The sure things is that there are more overweight people here
Regression
x = data.drop(['charges'], axis = 1)
y = data['charges']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=17)
from sklearn.linear_model import LinearRegression
lm1 = LinearRegression()
lm2 = LinearRegression(normalize=True)
I’m using 2 different linear regression. First without normalization and second with StandartScaler normalization
from sklearn.model_selection import cross_val_score
print("Linear Regression score without normalization = ", cross_val_score(lm1, x_train, y_train, scoring='neg_mean_squared_error').mean())
print("Linear Regression score with normalization = ", cross_val_score(lm1, x_train, y_train, scoring='neg_mean_squared_error').mean())
Linear Regression score without normalization = -38555503.39899726
Linear Regression score with normalization = -38555503.39899726
Wow, pretty much the same. Now let’s try to predict
lm1.fit(x_train, y_train)
prediction = lm1.predict(x_test)
from sklearn.metrics import mean_squared_error
print("The mean squared error is ", mean_squared_error(y_test, prediction))
The mean squared error is 30084782.121418368
sns.distplot((prediction**2)-(y_test**2))
Looks not really great with linear regression. We try to improve it later with different methods
Leave a comment