K Nearest Neighbour Exercise
Exercise from Jose Portilla Python for Data Science Bootcamp.
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
/
Now Lets get started
K Nearest Neighbors Project
Welcome to the KNN Project! This will be a simple project very similar to the lecture, except you’ll be given another data set. Go ahead and just follow the directions below.
Import Libraries
Import pandas,seaborn, and the usual libraries.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Get the Data
** Read the ‘KNN_Project_Data csv file into a dataframe **
data = pd.read_csv("KNN_Project_Data")
Check the head of the dataframe.
data.head()
XVPM | GWYH | TRAT | TLLZ | IGGA | HYKR | EDFS | GUUB | MGJM | JHZC | TARGET CLASS | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1636.670614 | 817.988525 | 2565.995189 | 358.347163 | 550.417491 | 1618.870897 | 2147.641254 | 330.727893 | 1494.878631 | 845.136088 | 0 |
1 | 1013.402760 | 577.587332 | 2644.141273 | 280.428203 | 1161.873391 | 2084.107872 | 853.404981 | 447.157619 | 1193.032521 | 861.081809 | 1 |
2 | 1300.035501 | 820.518697 | 2025.854469 | 525.562292 | 922.206261 | 2552.355407 | 818.676686 | 845.491492 | 1968.367513 | 1647.186291 | 1 |
3 | 1059.347542 | 1066.866418 | 612.000041 | 480.827789 | 419.467495 | 685.666983 | 852.867810 | 341.664784 | 1154.391368 | 1450.935357 | 0 |
4 | 1018.340526 | 1313.679056 | 950.622661 | 724.742174 | 843.065903 | 1370.554164 | 905.469453 | 658.118202 | 539.459350 | 1899.850792 | 0 |
EDA
Since this data is artificial, we’ll just do a large pairplot with seaborn.
Use seaborn on the dataframe to create a pairplot with the hue indicated by the TARGET CLASS column.
sns.pairplot(data)
Standardize the Variables
Time to standardize the variables.
** Import StandardScaler from Scikit learn.**
from sklearn.preprocessing import StandardScaler
** Create a StandardScaler() object called scaler.**
scaler = StandardScaler()
** Fit scaler to the features.**
scaler.fit(data.drop(['TARGET CLASS'], axis=1))
StandardScaler()
Use the .transform() method to transform the features to a scaled version.
scaler.transform(data.drop(['TARGET CLASS'], axis=1))
array([[ 1.56852168, -0.44343461, 1.61980773, ..., -0.93279392,
1.00831307, -1.06962723],
[-0.11237594, -1.05657361, 1.7419175 , ..., -0.46186435,
0.25832069, -1.04154625],
[ 0.66064691, -0.43698145, 0.77579285, ..., 1.14929806,
2.1847836 , 0.34281129],
...,
[-0.35889496, -0.97901454, 0.83771499, ..., -1.51472604,
-0.27512225, 0.86428656],
[ 0.27507999, -0.99239881, 0.0303711 , ..., -0.03623294,
0.43668516, -0.21245586],
[ 0.62589594, 0.79510909, 1.12180047, ..., -1.25156478,
-0.60352946, -0.87985868]])
Convert the scaled features to a dataframe and check the head of this dataframe to make sure the scaling worked.
df_scaled = pd.DataFrame(scaler.transform(data.drop(['TARGET CLASS'], axis=1)), columns=data.drop(['TARGET CLASS'], axis=1).columns)
Train Test Split
Use train_test_split to split your data into a training set and a testing set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(df_scaled, data["TARGET CLASS"], test_size =0.2, random_state=17)
Using KNN
Import KNeighborsClassifier from scikit learn.
from sklearn.neighbors import KNeighborsClassifier
Create a KNN model instance with n_neighbors=1
knn = KNeighborsClassifier(n_neighbors=1)
Fit this KNN model to the training data.
knn.fit(x_train, y_train)
KNeighborsClassifier(n_neighbors=1)
Predictions and Evaluations
Let’s evaluate our KNN model!
Use the predict method to predict values using your KNN model and X_test.
pred = knn.predict(x_test)
** Create a confusion matrix and classification report.**
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, pred))
[[59 26]
[29 86]]
print(classification_report(y_test, pred))
precision recall f1-score support
0 0.67 0.69 0.68 85
1 0.77 0.75 0.76 115
accuracy 0.73 200
macro avg 0.72 0.72 0.72 200
weighted avg 0.73 0.72 0.73 200
Choosing a K Value
Let’s go ahead and use the elbow method to pick a good K Value!
** Create a for loop that trains various KNN models with different k values, then keep track of the error_rate for each of these models with a list. Refer to the lecture if you are confused on this step.**
K = []
error_rate = []
for i in range(1,40):
K.append(i)
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(x_train, y_train)
pred_i = knn.predict(x_test)
error_rate.append(np.mean(pred_i != y_test))
Now create the following plot using the information from your for loop.
sns.lineplot(x=K, y=error_rate)
Retrain with new K Value
Retrain your model with the best K value (up to you to decide what you want) and re-do the classification report and the confusion matrix.
knn = KNeighborsClassifier(n_neighbors=31)
knn.fit(x_train, y_train)
pred = knn.predict(x_test)
print("WITH K=31")
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
WITH K=31
[[70 15]
[21 94]]
precision recall f1-score support
0 0.77 0.82 0.80 85
1 0.86 0.82 0.84 115
accuracy 0.82 200
macro avg 0.82 0.82 0.82 200
weighted avg 0.82 0.82 0.82 200
Leave a comment