Logistic Regression Exercise
Exercise from Jose Portilla Python for Data Science Bootcamp.
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
Now Lets get started
Logistic Regression Project
In this project we will be working with a fake advertising data set, indicating whether or not a particular internet user clicked on an Advertisement. We will try to create a model that will predict whether or not they will click on an ad based off the features of that user.
This data set contains the following features:
- ‘Daily Time Spent on Site’: consumer time on site in minutes
- ‘Age’: cutomer age in years
- ‘Area Income’: Avg. Income of geographical area of consumer
- ‘Daily Internet Usage’: Avg. minutes a day consumer is on the internet
- ‘Ad Topic Line’: Headline of the advertisement
- ‘City’: City of consumer
- ‘Male’: Whether or not consumer was male
- ‘Country’: Country of consumer
- ‘Timestamp’: Time at which consumer clicked on Ad or closed window
- ‘Clicked on Ad’: 0 or 1 indicated clicking on Ad
Import Libraries
Import a few libraries you think you’ll need (Or just import them as you go along!)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Get the Data
Read in the advertising.csv file and set it to a data frame called ad_data.
ad_data = pd.read_csv('advertising.csv')
Check the head of ad_data
ad_data.head()
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 68.95 | 35 | 61833.90 | 256.09 | Cloned 5thgeneration orchestration | Wrightburgh | 0 | Tunisia | 2016-03-27 00:53:11 | 0 |
1 | 80.23 | 31 | 68441.85 | 193.77 | Monitored national standardization | West Jodi | 1 | Nauru | 2016-04-04 01:39:02 | 0 |
2 | 69.47 | 26 | 59785.94 | 236.50 | Organic bottom-line service-desk | Davidton | 0 | San Marino | 2016-03-13 20:35:42 | 0 |
3 | 74.15 | 29 | 54806.18 | 245.89 | Triple-buffered reciprocal time-frame | West Terrifurt | 1 | Italy | 2016-01-10 02:31:19 | 0 |
4 | 68.37 | 35 | 73889.99 | 225.58 | Robust logistical utilization | South Manuel | 0 | Iceland | 2016-06-03 03:36:18 | 0 |
** Use info and describe() on ad_data**
ad_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Daily Time Spent on Site 1000 non-null float64
1 Age 1000 non-null int64
2 Area Income 1000 non-null float64
3 Daily Internet Usage 1000 non-null float64
4 Ad Topic Line 1000 non-null object
5 City 1000 non-null object
6 Male 1000 non-null int64
7 Country 1000 non-null object
8 Timestamp 1000 non-null object
9 Clicked on Ad 1000 non-null int64
dtypes: float64(3), int64(3), object(4)
memory usage: 78.2+ KB
ad_data.describe()
Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Male | Clicked on Ad | |
---|---|---|---|---|---|---|
count | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.000000 | 1000.00000 |
mean | 65.000200 | 36.009000 | 55000.000080 | 180.000100 | 0.481000 | 0.50000 |
std | 15.853615 | 8.785562 | 13414.634022 | 43.902339 | 0.499889 | 0.50025 |
min | 32.600000 | 19.000000 | 13996.500000 | 104.780000 | 0.000000 | 0.00000 |
25% | 51.360000 | 29.000000 | 47031.802500 | 138.830000 | 0.000000 | 0.00000 |
50% | 68.215000 | 35.000000 | 57012.300000 | 183.130000 | 0.000000 | 0.50000 |
75% | 78.547500 | 42.000000 | 65470.635000 | 218.792500 | 1.000000 | 1.00000 |
max | 91.430000 | 61.000000 | 79484.800000 | 269.960000 | 1.000000 | 1.00000 |
Exploratory Data Analysis
Let’s use seaborn to explore the data!
Try recreating the plots shown below!
** Create a histogram of the Age**
sns.distplot(ad_data['Age'])
Create a jointplot showing Area Income versus Age.
sns.jointplot(ad_data['Age'], ad_data['Area Income'])
Create a jointplot showing the kde distributions of Daily Time spent on site vs. Age.
sns.jointplot(ad_data['Age'], ad_data['Daily Time Spent on Site'], kind='kde')
** Create a jointplot of ‘Daily Time Spent on Site’ vs. ‘Daily Internet Usage’**
sns.jointplot(ad_data['Daily Time Spent on Site'], ad_data['Daily Internet Usage'])
** Finally, create a pairplot with the hue defined by the ‘Clicked on Ad’ column feature.**
sns.pairplot(ad_data, hue='Clicked on Ad')
Logistic Regression
Now it’s time to do a train test split, and train our model!
You’ll have the freedom here to choose columns that you want to train on!
** Split the data into training set and testing set using train_test_split**
from sklearn.model_selection import train_test_split
x = ad_data[['Daily Time Spent on Site', 'Age', 'Area Income','Daily Internet Usage', 'Male']]
y = ad_data['Clicked on Ad']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=17)
** Train and fit a logistic regression model on the training set.**
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression()
LR.fit(x_train, y_train)
LogisticRegression()
Predictions and Evaluations
** Now predict values for the testing data.**
prediction = LR.predict(x_test)
** Create a classification report for the model.**
from sklearn.metrics import classification_report
precision recall f1-score support
0 0.87 0.96 0.91 162
1 0.96 0.86 0.91 168
avg / total 0.91 0.91 0.91 330
Leave a comment