K Nearest Neighbour Exercise

3 minute read

Exercise from Jose Portilla Python for Data Science Bootcamp.

Now Lets get started

K Nearest Neighbors Project

Welcome to the KNN Project! This will be a simple project very similar to the lecture, except you’ll be given another data set. Go ahead and just follow the directions below.

Import Libraries

Import pandas,seaborn, and the usual libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

Get the Data

** Read the ‘KNN_Project_Data csv file into a dataframe **

data = pd.read_csv("KNN_Project_Data")

Check the head of the dataframe.

data.head()

	XVPM	GWYH	TRAT	TLLZ	IGGA	HYKR	EDFS	GUUB	MGJM	JHZC	TARGET CLASS
0	1636.670614	817.988525	2565.995189	358.347163	550.417491	1618.870897	2147.641254	330.727893	1494.878631	845.136088	0
1	1013.402760	577.587332	2644.141273	280.428203	1161.873391	2084.107872	853.404981	447.157619	1193.032521	861.081809	1
2	1300.035501	820.518697	2025.854469	525.562292	922.206261	2552.355407	818.676686	845.491492	1968.367513	1647.186291	1
3	1059.347542	1066.866418	612.000041	480.827789	419.467495	685.666983	852.867810	341.664784	1154.391368	1450.935357	0
4	1018.340526	1313.679056	950.622661	724.742174	843.065903	1370.554164	905.469453	658.118202	539.459350	1899.850792	0

EDA

Since this data is artificial, we’ll just do a large pairplot with seaborn.

Use seaborn on the dataframe to create a pairplot with the hue indicated by the TARGET CLASS column.

sns.pairplot(data)

png

Standardize the Variables

Time to standardize the variables.

** Import StandardScaler from Scikit learn.**

from sklearn.preprocessing import StandardScaler

** Create a StandardScaler() object called scaler.**

scaler = StandardScaler()

** Fit scaler to the features.**

scaler.fit(data.drop(['TARGET CLASS'], axis=1))

StandardScaler()

Use the .transform() method to transform the features to a scaled version.

scaler.transform(data.drop(['TARGET CLASS'], axis=1))

array([[ 1.56852168, -0.44343461,  1.61980773, ..., -0.93279392,
         1.00831307, -1.06962723],
       [-0.11237594, -1.05657361,  1.7419175 , ..., -0.46186435,
         0.25832069, -1.04154625],
       [ 0.66064691, -0.43698145,  0.77579285, ...,  1.14929806,
         2.1847836 ,  0.34281129],
       ...,
       [-0.35889496, -0.97901454,  0.83771499, ..., -1.51472604,
        -0.27512225,  0.86428656],
       [ 0.27507999, -0.99239881,  0.0303711 , ..., -0.03623294,
         0.43668516, -0.21245586],
       [ 0.62589594,  0.79510909,  1.12180047, ..., -1.25156478,
        -0.60352946, -0.87985868]])

Convert the scaled features to a dataframe and check the head of this dataframe to make sure the scaling worked.

df_scaled = pd.DataFrame(scaler.transform(data.drop(['TARGET CLASS'], axis=1)), columns=data.drop(['TARGET CLASS'], axis=1).columns)

Train Test Split

Use train_test_split to split your data into a training set and a testing set.

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df_scaled, data["TARGET CLASS"], test_size =0.2, random_state=17)

Using KNN

Import KNeighborsClassifier from scikit learn.

from sklearn.neighbors import KNeighborsClassifier

Create a KNN model instance with n_neighbors=1

knn = KNeighborsClassifier(n_neighbors=1)

Fit this KNN model to the training data.

knn.fit(x_train, y_train)

KNeighborsClassifier(n_neighbors=1)

Predictions and Evaluations

Let’s evaluate our KNN model!

Use the predict method to predict values using your KNN model and X_test.

pred = knn.predict(x_test)

** Create a confusion matrix and classification report.**

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, pred))

[[59 26]
 [29 86]]

print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.67      0.69      0.68        85
           1       0.77      0.75      0.76       115

    accuracy                           0.73       200
   macro avg       0.72      0.72      0.72       200
weighted avg       0.73      0.72      0.73       200

Choosing a K Value

Let’s go ahead and use the elbow method to pick a good K Value!

** Create a for loop that trains various KNN models with different k values, then keep track of the error_rate for each of these models with a list. Refer to the lecture if you are confused on this step.**

K = []
error_rate = []
for i in range(1,40):
    K.append(i)
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(x_train, y_train)
    pred_i = knn.predict(x_test)
    error_rate.append(np.mean(pred_i != y_test))

Now create the following plot using the information from your for loop.

sns.lineplot(x=K, y=error_rate)

png

Retrain with new K Value

Retrain your model with the best K value (up to you to decide what you want) and re-do the classification report and the confusion matrix.

knn = KNeighborsClassifier(n_neighbors=31)
knn.fit(x_train, y_train)
pred = knn.predict(x_test)

print("WITH K=31")
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))

WITH K=31
[[70 15]
 [21 94]]
              precision    recall  f1-score   support

           0       0.77      0.82      0.80        85
           1       0.86      0.82      0.84       115

    accuracy                           0.82       200
   macro avg       0.82      0.82      0.82       200
weighted avg       0.82      0.82      0.82       200

Great Job!

Share on

Twitter Facebook LinkedIn

Gama Candra Tri Kartika

K Nearest Neighbour Exercise

K Nearest Neighbors Project

Import Libraries

Get the Data

EDA

Standardize the Variables

Train Test Split

Using KNN

Predictions and Evaluations

Choosing a K Value

Retrain with new K Value

Great Job!

Share on

Leave a comment

You may also enjoy

Busy From Works

Long Updates

Back to back meetings

Lucky weekends