Day 6 Algorit.ma : Unsupervised Learning
Day 6, here I will share my notes of Inclass notebook. For further example you can check out on https://github.com/Saltfarmer/Algoritma-BFLP-DS-Audit/tree/main
Inclass: Unsupervised Learning
- Durasi: 7 hours
- Last Updated: Desember 2023
- Disusun dan dikurasi oleh tim produk dan instruktur Algoritma Data Science School.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from numpy.linalg import eig
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from pyod.models.lof import LOF
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as pyo
# Set notebook mode to work in offline
pyo.init_notebook_mode()
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from helper import *
Introduction
Machine learning berfokus pada prediksi berdasarkan properti/fitur yang dipelajari dari data training. Beberapa tipe machine learning yaitu:
Supervised Learning:
- memiliki target variable.
- untuk pembuatan model prediksi $(y \sim x)$
- ada ground truth (label aktual) sehingga ada evaluasi model
Unsupervised Learning:
- tidak memiliki target variable.
- untuk mencari pola dalam data sehingga menghasilkan informasi yang berguna/dapat diolah lebih lanjut. umumnya dipakai untuk tahap explanatory data analysis (EDA)/data pre-processing.
- tidak ada ground truth sehingga sulit mengevaluasi model
Dimensionality Reduction
Tujuan dimensionality reduction adalah untuk mereduksi banyaknya variabel (dimensi/fitur) pada data dengan tetap mempertahankan informasi sebanyak mungkin. Dimensionality reduction dapat mengatasi masalah high-dimensional data. Kesulitan yang dihadapi pada high-dimensional data:
- Memerlukan waktu dan komputasi yang besar dalam melakukan pemodelan
- Melakukan visualisasi lebih dari tiga dimensi
- Menyulitkan pengolahan data (feature selection)
Note:
- Dimensi: kolom, semakin banyak kolom maka dimensi semakin tinggi.
- Informasi: variance, semakin tinggi variance maka informasinya semakin banyak.
Refresher on Variance
Berikut adalah data gaji perusahaan A dan B dalam satuan juta rupiah.
Pertanyaan: Tanpa menghitung nilai variance, perusahaan mana yang memiliki gaji lebih bervariasi?
# coba bandingkan variansi kedua data ini:
A = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
B = [4, 5, 5, 6, 6, 4, 6, 5, 4, 4]
print(np.var(A))
print(np.var(B))
8.25
0.6900000000000001
Ada pula data gaji perusahaan C dalam satuan dollar. Untuk mempermudah, asumsi 1 dollar = 10000 rupiah
C = [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]
np.var(C)
82500.0
Apakah bisa dibilang gaji di perusahaan C lebih bervariasi daripada A?
Ans: Ya betul sekali
Motivation Example: Image Compression
Pada data gambar, setiap kotak pixel akan menjadi 1 kolom. Foto berukuran 40x40 pixel memiliki 1600 kolom (dimensi). Sekarang mari renungkan, berapa spesifikasi kamera handphone anda? Berapa besar dimensi data yang dihasilkan kamera Anda?
Image compression adalah salah satu contoh nyata dimensionality reduction menggunakan data gambar yang dan tetap menghasilkan gambar yang serupa (informasi inti tidak hilang), sehingga data gambar lebih mudah diproses. Salah satu algoritma yang dapat digunakan untuk dimensionality reduction adalah Principal Component Analysis (PCA).
✅ Knowledge Check:
Dalam suatu gambar apa yang dimaksud dengan dimensi dan informasi?
- dimensi : Banyak kolom dari pixelnya(dan layer kalo berwarna)
- informasi: Variance dari grayscale value
Apakah nilai dari variansi dipengaruhi oleh skala dari nilai itu sendiri? jelaskan!
Ans: Betul, karena variance yang tinggi dapat mempengaruhi skala
Principle Component Analysis
Konsep
Ide dasar dari PCA adalah untuk membuat sumbu (axis) baru yang dapat menangkap informasi sebesar mungkin. Sumbu baru ini adalah yang dinamakan sebagai Principal Component (PC). Untuk melakukan dimensionality reduction, kita akan memilih beberapa PC untuk dapat merangkum informasi yang dibutuhkan
Figure A (Sebelum PCA):
- Sumbu/dimensi: X1 dan X2
- Variance data dijelaskan oleh X1 dan X2
- Dibuatlah sumbu baru untuk menangkap informasi X1 dan X2, yang dinamakan PC1 dan PC2
Figure B (Setelah PCA):
- Sumbu baru: PC1 dan PC2
- PC1 menangkap variance lebih banyak daripada PC2
- Misalkan PC1 menangkap 90% variance, dan sisanya ditangkap oleh PC2 yaitu 10%
💡 Notes:
- Membuat sumbu baru yang disebut dengan PC yang bertujuan untuk merangkum sebanyak mungkin informasi data
- Banyaknya jumlah PC sama dengan jumlah dimensi dari data
- PC1 pasti menangkap variance paling besar dibandingkan dengan PC 2, dan seterusnya
- Antara PC1 dan PC2 saling tegak lurus, artinya tidak saling berkorelasi
- Metode PCA akan cocok untuk data numerik yang saling berkorelasi
✅ Knowledge Check:
<img src="assets/knowledge check.png" width="500">
- Dari Gambar diatas mana data yang cocok dilakukan PCA?
- Sale Price of Vehicles
- Blind Tasting
- Logistic Machinery
- Bila terdapat 3 PC, PC ke-berapa yang merangkum variansi (informasi) paling besar?
- PC1
- PC2
- PC3
- Dalam PCA jumlah PC yang dihasilkan sebanyak….
- Jumlah variabel yang digunakan
- Setengah dari jumlah variabel yang digunakan
- Ditentukan oleh user
- PC1 dibentuk oleh variabel pertama dan PC4 dibentuk oleh variabel ke empat
- Salah
- Benar
Math Behind PCA [optional]
Untuk membentuk PC dibutuhkan eigen values & eigen vector. Secara manual, eigen values dan eigen vector didapatkan dari operasi matrix.
Teori matrix:
- skalar: nilai yang memiliki magnitude/besaran
- vektor: nilai yang memiliki besaran dan arah (umum digambarkan dalam suatu koordinat)
- matrix: kumpulan nilai/bentukan data dalam baris dan kolom
Eigen- dari suatu Matrix
Untuk setiap matrix $A$, terdapat vektor spesial (eigen vector) yang jika dikalikan dengan matrixnya, hasilnya akan sama dengan vektor tersebut dikalikan suatu skalar (eigen value). Sehingga didapatkan rumus:
\[Ax = \lambda x\]dengan $x$ adalah eigen vector dan $\lambda$ adalah eigen value dari matrix $A$.
Contoh:
Pada perhitungan matrix di bawah, salah satu eigen vector dari matrix
$\begin{bmatrix}
2 & 3
2 & 1
\end{bmatrix}$
adalah
$\begin{bmatrix}
3
2
\end{bmatrix}$
dengan eigen value sebesar 4.
Teori eigen dipakai untuk menentukan PC dan nilai-nilai pada PC.
Penerapan Eigen dalam PCA:
Matrix covariance adalah matrix yang dapat merangkum informasi (variance) dari data. Kita menggunakan matrix covariance untuk mendapatkan eigen vector dan eigen value dari matrix tersebut, dengan:
- eigen vector: arah sumbu tiap PC, yang menjadi formula untuk mentransformasi data awal ke PC baru.
- eigen value: variansi yang ditangkap oleh setiap PC.
- tiap PC memiliki 1 eigen value & 1 eigen vector.
- alur: matrix covariance $\rightarrow$ eigen value $\rightarrow$ eigen vector $\rightarrow$ nilai di tiap PC
Eigen vector akan menjadi formula untuk kalkulasi nilai di setiap PC. Contohnya, untuk data yang terdiri dari 2 variabel, bila diketahui eigen vector dari PC1 adalah:
\[x_{PC1}= \left[\begin{array}{cc}a_1\\a_2\end{array}\right]\]Maka formula untuk menghitung nilai pada PC1 (untuk tiap barisnya) adalah:
\[PC1= a_1X_1 + a_2X_2\]Keterangan:
- $x_{PC1}$ : eigen vector PC1 dari matrix covariance
- $a_1$, $a_2$ : konstanta dari eigen vector
- $PC1$ : nilai di PC1
- $X_1$, $X_2$ : nilai variabel X1 dan X2 di data awal
Contoh menghitung eigen value dan eigen vector dari sebuah data
# membuat data dummy
dummy = pd.DataFrame(np.random.rand(4, 2), #generate random value dengan 4 baris dan 2 kolom
columns=list('XY')) #nama tiap kolom
dummy
X | Y | |
---|---|---|
0 | 0.170823 | 0.897426 |
1 | 0.577542 | 0.981163 |
2 | 0.685710 | 0.688500 |
3 | 0.735711 | 0.914354 |
Mencari nilai covariance pada dataframe dummy:
matrix_cov = dummy.cov()
matrix_cov
X | Y | |
---|---|---|
X | 0.085857 | -0.008870 |
Y | -0.008870 | 0.033434 |
Mencari nilai dan vector eigen dengan fungsi eig dari library numpy
eig_vals,eig_vecs = eig(matrix_cov.T)
print('E-value: \n', eig_vals) #\n untuk newline (enter ke bawah)
print('E-vector: \n', eig_vecs)
E-value:
[0.08731741 0.03197376]
E-vector:
[[ 0.98672189 0.16241896]
[-0.16241896 0.98672189]]
Note: hasil fungsi eig() tidak berurutan berdasarkan nilainya. Eigenvalues dari PC1 adalah nilai terbesar, dilanjutkan PC2 dengan nilai kedua terbesar dan seterusnya.
E-value:
: Eigen value untuk tiap PC, besar variansi yang dapat ditangkap oleh tiap PC. Eigen value tertinggi adalah milik PC1, kedua tertinggi milik PC2, dan seterusnya.E-vector
: Eigen vector untuk tiap PC. Kolomeig_vecs[:,i]
adalah vektor eigen yang sesuai dengan nilai eigeneig_vals[i]
PCA Workflow
Business Question: Dimensionality Reduction for Fraud Bank Account dataset
Kita akan kembali menggunakan data fraud_dataset.csv
yang sudah digunakan pada pembelajaran sebelumnya. Perbedaannya adalah kita akan menggunakan keseluruhan kolom pada data ini dan hanya akan membuang kolom yang kemaren kita jadikan sebagai target.
fraud = pd.read_csv('data_input/fraud_dataset.csv')
fraud.drop(columns=['fraud_bool'], inplace=True)
fraud.head()
income | name_email_similarity | current_address_months_count | customer_age | days_since_request | intended_balcon_amount | payment_type | zip_count_4w | velocity_6h | velocity_24h | ... | phone_mobile_valid | has_other_cards | proposed_credit_limit | foreign_request | source | session_length_in_minutes | device_os | keep_alive_session | device_distinct_emails_8w | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.1 | 0.069598 | 48.0 | 30 | 0.006760 | -1.074674 | AB | 3483 | 5316.092932 | 4527.956243 | ... | 1 | 0 | 200.0 | 0 | INTERNET | 5.191773 | windows | 1 | 1.0 | 4 |
1 | 0.9 | 0.891741 | 61.0 | 20 | 0.020642 | -1.043444 | AD | 2849 | 8153.671429 | 7524.130278 | ... | 1 | 1 | 200.0 | 0 | INTERNET | 3.901673 | windows | 0 | 1.0 | 1 |
2 | 0.6 | 0.370933 | 70.0 | 30 | 6.400793 | 48.520199 | AA | 406 | 7648.434993 | 6366.061338 | ... | 1 | 0 | 200.0 | 0 | INTERNET | 3.777191 | linux | 0 | 1.0 | 1 |
3 | 0.9 | 0.401137 | 64.0 | 30 | 0.004651 | -0.394588 | AC | 780 | 6459.224179 | 3394.524379 | ... | 1 | 0 | 200.0 | 0 | INTERNET | 3.176269 | linux | 1 | 1.0 | 5 |
4 | 0.6 | 0.720006 | 11.0 | 20 | 0.032629 | -0.487785 | AC | 4527 | 7852.258962 | 5177.826213 | ... | 1 | 0 | 200.0 | 0 | INTERNET | 14.626874 | linux | 0 | 1.0 | 0 |
5 rows × 28 columns
Penjelasan Dataset
Berikut adalah penjelasan setiap kolom yang terdapat pada dataset:
income
(numeric): Annual income of the applicant (in decile form). Ranges between [0.1, 0.9].name_email_similarity
(numeric): Metric of similarity between email and applicant’s name. Higher values represent higher similarity. Ranges between [0, 1].current_address_months_count
(numeric): Months in currently registered address of the applicant. Ranges between [−1, 429] months (-1 is a missing value).customer_age
(numeric): Applicant’s age in years, rounded to the decade. Ranges between [10, 90] years.days_since_request
(numeric): Number of days passed since application was done. Ranges between [0, 79] days.intended_balcon_amount
(numeric): Initial transferred amount for application. Ranges between [−16, 114] (negatives are missing values).payment_type
(categorical): Credit payment plan type. 5 possible (annonymized) values.zip_count_4w
(numeric): Number of applications within same zip code in last 4 weeks. Ranges between [1, 6830].velocity_6h
(numeric): Velocity of total applications made in last 6 hours i.e., average number of applications per hour in the last 6 hours. Ranges between [−175, 16818].velocity_24h
(numeric): Velocity of total applications made in last 24 hours i.e., average number of applications per hour in the last 24 hours. Ranges between [1297, 9586]velocity_4w
(numeric): Velocity of total applications made in last 4 weeks, i.e., average number of applications per hour in the last 4 weeks. Ranges between [2825, 7020].bank_branch_count_8w
(numeric): Number of total applications in the selected bank branch in last 8 weeks. Ranges between [0, 2404].date_of_birth_distinct_emails_4w
(numeric): Number of emails for applicants with same date of birth in last 4 weeks. Ranges between [0, 39].employment_status
(categorical): Employment status of the applicant. 7 possible (annonymized) values.credit_risk_score
(numeric): Internal score of application risk. Ranges between [−191, 389].email_is_free
(binary): Domain of application email (either free or paid).housing_status
(categorical): Current residential status for applicant. 7 possible (annonymized) values.phone_home_valid
(binary): Validity of provided home phone.phone_mobile_valid
(binary): Validity of provided mobile phone.has_other_cards
(binary): _If applicant has other cards from the same banking company. _proposed_credit_limit
(numeric): Applicant’s proposed credit limit. Ranges between [200, 2000].foreign_request
(binary): If origin country of request is different from bank’s country.source
(categorical): Online source of application. Either browser (INTERNET) or app (TELEAPP).session_length_in_minutes
(numeric): Length of user session in banking website in minutes. Ranges between [−1, 107] minutes (-1 is a missing value).device_os
(categorical): Operative system of device that made request. Possible values are: Windows, macOS, Linux, X11, or other.keep_alive_session
(binary): User option on session logout.device_distinct_emails
(numeric): Number of distinct emails in banking website from the used device in last 8 weeks. Ranges between [−1, 2] emails (-1 is a missing value).month
(numeric): Month where the application was made. Ranges between [0, 7].fraud_bool
(binary): If the application is fraudulent or not.
fraud.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14905 entries, 0 to 14904
Data columns (total 28 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 income 14905 non-null float64
1 name_email_similarity 14905 non-null float64
2 current_address_months_count 14905 non-null float64
3 customer_age 14905 non-null int64
4 days_since_request 14905 non-null float64
5 intended_balcon_amount 14905 non-null float64
6 payment_type 14905 non-null object
7 zip_count_4w 14905 non-null int64
8 velocity_6h 14905 non-null float64
9 velocity_24h 14905 non-null float64
10 velocity_4w 14905 non-null float64
11 bank_branch_count_8w 14905 non-null int64
12 date_of_birth_distinct_emails_4w 14905 non-null int64
13 employment_status 14905 non-null object
14 credit_risk_score 14905 non-null float64
15 email_is_free 14905 non-null int64
16 housing_status 14905 non-null object
17 phone_home_valid 14905 non-null int64
18 phone_mobile_valid 14905 non-null int64
19 has_other_cards 14905 non-null int64
20 proposed_credit_limit 14905 non-null float64
21 foreign_request 14905 non-null int64
22 source 14905 non-null object
23 session_length_in_minutes 14905 non-null float64
24 device_os 14905 non-null object
25 keep_alive_session 14905 non-null int64
26 device_distinct_emails_8w 14905 non-null float64
27 month 14905 non-null int64
dtypes: float64(12), int64(11), object(5)
memory usage: 3.2+ MB
fraud.describe()
income | name_email_similarity | current_address_months_count | customer_age | days_since_request | intended_balcon_amount | zip_count_4w | velocity_6h | velocity_24h | velocity_4w | ... | email_is_free | phone_home_valid | phone_mobile_valid | has_other_cards | proposed_credit_limit | foreign_request | session_length_in_minutes | keep_alive_session | device_distinct_emails_8w | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 14905.000000 | 14905.000000 | 14905.000000 | 14905.000000 | 1.490500e+04 | 14905.000000 | 14905.000000 | 14905.000000 | 14905.000000 | 14905.000000 | ... | 14905.000000 | 14905.000000 | 14905.000000 | 14905.000000 | 14905.000000 | 14905.000000 | 14905.000000 | 14905.000000 | 14905.000000 | 14905.000000 |
mean | 0.571110 | 0.481305 | 88.974975 | 34.377055 | 1.044408e+00 | 7.986892 | 1571.105736 | 5644.961419 | 4764.243186 | 4851.361779 | ... | 0.541899 | 0.400671 | 0.883126 | 0.213485 | 551.910768 | 0.028782 | 7.701999 | 0.559074 | 1.029520 | 3.293660 |
std | 0.291264 | 0.292755 | 88.451892 | 12.375090 | 5.654084e+00 | 19.702913 | 998.577819 | 3015.663715 | 1486.594023 | 923.966514 | ... | 0.498258 | 0.490051 | 0.321280 | 0.409781 | 516.560244 | 0.167200 | 8.329340 | 0.496515 | 0.197443 | 2.213049 |
min | 0.100000 | 0.000093 | 0.000000 | 10.000000 | 9.352969e-07 | -12.537085 | 36.000000 | 45.106142 | 1328.410255 | 2995.300345 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 190.000000 | 0.000000 | 0.039414 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.300000 | 0.206239 | 23.000000 | 20.000000 | 7.179709e-03 | -1.173150 | 893.000000 | 3402.021768 | 3574.620499 | 4261.751108 | ... | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 200.000000 | 0.000000 | 3.164021 | 0.000000 | 1.000000 | 1.000000 |
50% | 0.600000 | 0.472416 | 55.000000 | 30.000000 | 1.498915e-02 | -0.834826 | 1267.000000 | 5329.868693 | 4743.172402 | 4908.851274 | ... | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 200.000000 | 0.000000 | 5.144863 | 1.000000 | 1.000000 | 3.000000 |
75% | 0.800000 | 0.748003 | 132.000000 | 40.000000 | 2.610747e-02 | -0.204896 | 1941.000000 | 7678.860181 | 5751.489671 | 5485.543277 | ... | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 1000.000000 | 0.000000 | 8.902307 | 1.000000 | 1.000000 | 5.000000 |
max | 0.900000 | 0.999997 | 406.000000 | 90.000000 | 7.582081e+01 | 111.697355 | 6349.000000 | 16264.947756 | 9341.329938 | 6940.302005 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 2100.000000 | 1.000000 | 73.909623 | 1.000000 | 2.000000 | 7.000000 |
8 rows × 23 columns
fraud_clean = fraud.drop(columns='intended_balcon_amount')
fraud_clean = fraud_clean[(fraud_clean['proposed_credit_limit'] >= 200) & (fraud_clean['proposed_credit_limit'] <= 2000)]
Pilih data yang hanya bertipe numeric :
cols = fraud_clean.select_dtypes("number").columns
fraud_num = fraud_clean[cols]
fraud_num.sample(3)
income | name_email_similarity | current_address_months_count | customer_age | days_since_request | zip_count_4w | velocity_6h | velocity_24h | velocity_4w | bank_branch_count_8w | ... | email_is_free | phone_home_valid | phone_mobile_valid | has_other_cards | proposed_credit_limit | foreign_request | session_length_in_minutes | keep_alive_session | device_distinct_emails_8w | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9044 | 0.4 | 0.211194 | 93.0 | 40 | 0.000542 | 1900 | 11286.768265 | 7114.259835 | 6790.848957 | 2 | ... | 0 | 1 | 1 | 0 | 200.0 | 0 | 2.550744 | 1 | 1.0 | 0 |
3438 | 0.9 | 0.177036 | 61.0 | 50 | 0.006538 | 1440 | 4795.810553 | 6281.811432 | 5072.699566 | 3 | ... | 1 | 0 | 1 | 0 | 500.0 | 0 | 13.379567 | 0 | 1.0 | 2 |
10703 | 0.7 | 0.091951 | 11.0 | 20 | 0.003008 | 1970 | 9637.374296 | 5737.460384 | 5114.294296 | 10 | ... | 0 | 0 | 1 | 0 | 200.0 | 0 | 2.772705 | 0 | 1.0 | 3 |
3 rows × 22 columns
Melihat nilai covariance pada dataframe fraud_num
:
# covariance
fraud_num.cov()
income | name_email_similarity | current_address_months_count | customer_age | days_since_request | zip_count_4w | velocity_6h | velocity_24h | velocity_4w | bank_branch_count_8w | ... | email_is_free | phone_home_valid | phone_mobile_valid | has_other_cards | proposed_credit_limit | foreign_request | session_length_in_minutes | keep_alive_session | device_distinct_emails_8w | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
income | 0.084830 | -0.004591 | -0.612779 | 0.510647 | -0.003152 | -22.873746 | -9.007355e+01 | -5.062088e+01 | -3.105886e+01 | 1.463368 | ... | -0.001812 | 0.000148 | 0.000961 | 0.007691 | 18.224996 | 0.000759 | -0.116759 | -0.009207 | -0.000441 | 0.079395 |
name_email_similarity | -0.004591 | 0.085705 | 0.593454 | -0.276976 | -0.008924 | 5.845287 | 3.198269e+01 | 1.878802e+01 | 1.015023e+01 | 0.444386 | ... | -0.012035 | 0.001591 | 0.001704 | 0.001124 | 7.589780 | -0.001033 | 0.016699 | 0.006064 | -0.002295 | -0.028691 |
current_address_months_count | -0.612779 | 0.593454 | 7824.656266 | 154.588556 | -30.908551 | 3918.276204 | 7.973437e+03 | 2.808161e+03 | 2.017757e+03 | 2419.540599 | ... | -2.937186 | 5.020150 | -2.875765 | 1.630877 | 6921.295187 | -0.187393 | -12.739258 | -2.733750 | 0.235010 | -4.112885 |
customer_age | 0.510647 | -0.276976 | 154.588556 | 153.145865 | -2.615292 | -165.109768 | -9.137572e+02 | -1.304647e+02 | -7.711674e+00 | 258.152863 | ... | 0.065014 | 1.077889 | -0.610440 | 0.380307 | 1089.333498 | 0.000818 | 4.356485 | -0.317894 | 0.148603 | 0.016601 |
days_since_request | -0.003152 | -0.008924 | -30.908551 | -2.615292 | 31.966929 | 52.986525 | 4.321133e+02 | 1.814850e+02 | 1.387599e+02 | -53.041654 | ... | 0.014440 | -0.143337 | 0.051130 | -0.117430 | -189.537392 | 0.002647 | 2.033355 | 0.057482 | 0.024413 | -0.216861 |
zip_count_4w | -22.873746 | 5.845287 | 3918.276204 | -165.109768 | 52.986525 | 997047.382815 | 3.959082e+05 | 2.858011e+05 | 2.694707e+05 | 6094.917630 | ... | 20.263025 | -26.639721 | 4.915758 | -9.881681 | -8083.186676 | 2.453612 | 359.426268 | 3.917571 | 7.000640 | -600.642652 |
velocity_6h | -90.073551 | 31.982689 | 7973.437476 | -913.757229 | 432.113320 | 395908.153015 | 9.096535e+06 | 2.094797e+06 | 1.140525e+06 | 26029.215640 | ... | 37.300862 | -37.627801 | -14.536191 | 15.605739 | -40783.431792 | -5.237982 | 1323.806011 | 25.707221 | 15.017745 | -2811.792603 |
velocity_24h | -50.620879 | 18.788020 | 2808.161069 | -130.464654 | 181.484971 | 285801.133198 | 2.094797e+06 | 2.209447e+06 | 7.375738e+05 | 26400.173477 | ... | 14.965224 | -35.386047 | -12.725914 | -7.266532 | 5069.770364 | 4.123817 | 864.137810 | 5.405526 | 8.775432 | -1818.947260 |
velocity_4w | -31.058864 | 10.150233 | 2017.756981 | -7.711674 | 138.759911 | 269470.666041 | 1.140525e+06 | 7.375738e+05 | 8.534964e+05 | 16891.001751 | ... | 25.337173 | -23.035219 | -9.038828 | -13.015738 | 15760.288993 | 2.106544 | 663.541738 | 16.049966 | 9.659638 | -1716.809159 |
bank_branch_count_8w | 1.463368 | 0.444386 | 2419.540599 | 258.152863 | -53.041654 | 6094.917630 | 2.602922e+04 | 2.640017e+04 | 1.689100e+04 | 216745.571966 | ... | -2.142896 | 13.835538 | -2.383647 | 7.016620 | -392.652565 | -0.142042 | 37.278437 | 1.434909 | 0.331983 | -37.211440 |
date_of_birth_distinct_emails_4w | -0.117235 | 0.056748 | -78.551471 | -27.086848 | 0.297179 | 598.673730 | 1.813792e+03 | 1.165160e+03 | 1.087195e+03 | -70.498481 | ... | 0.068809 | -0.360390 | 0.182675 | -0.058403 | -182.632372 | 0.018900 | -1.417666 | 0.159437 | -0.040432 | -2.643468 |
credit_risk_score | 4.080329 | 0.600302 | 728.922657 | 173.912361 | -33.283855 | -6990.585254 | -3.193542e+04 | -1.706834e+04 | -1.131608e+04 | -856.456390 | ... | -0.521268 | -0.377010 | -0.310680 | 3.365784 | 23870.161509 | 0.489783 | -23.950210 | -1.874476 | -0.554730 | 27.454425 |
email_is_free | -0.001812 | -0.012035 | -2.937186 | 0.065014 | 0.014440 | 20.263025 | 3.730086e+01 | 1.496522e+01 | 2.533717e+01 | -2.142896 | ... | 0.248263 | -0.005578 | 0.005232 | -0.005552 | 0.929047 | 0.001177 | 0.146011 | -0.008685 | 0.000374 | -0.076217 |
phone_home_valid | 0.000148 | 0.001591 | 5.020150 | 1.077889 | -0.143337 | -26.639721 | -3.762780e+01 | -3.538605e+01 | -2.303522e+01 | 13.835538 | ... | -0.005578 | 0.240177 | -0.044352 | 0.026112 | -3.560998 | -0.000802 | -0.166718 | 0.009658 | -0.000225 | 0.060949 |
phone_mobile_valid | 0.000961 | 0.001704 | -2.875765 | -0.610440 | 0.051130 | 4.915758 | -1.453619e+01 | -1.272591e+01 | -9.038828e+00 | -2.383647 | ... | 0.005232 | -0.044352 | 0.103251 | 0.000530 | -6.855258 | 0.001017 | -0.027691 | 0.005772 | -0.002789 | 0.024970 |
has_other_cards | 0.007691 | 0.001124 | 1.630877 | 0.380307 | -0.117430 | -9.881681 | 1.560574e+01 | -7.266532e+00 | -1.301574e+01 | 7.016620 | ... | -0.005552 | 0.026112 | 0.000530 | 0.167923 | 16.362243 | -0.000643 | -0.268723 | -0.017893 | -0.002412 | 0.029655 |
proposed_credit_limit | 18.224996 | 7.589780 | 6921.295187 | 1089.333498 | -189.537392 | -8083.186676 | -4.078343e+04 | 5.069770e+03 | 1.576029e+04 | -392.652565 | ... | 0.929047 | -3.560998 | -6.855258 | 16.362243 | 266575.932029 | 3.077936 | -2.286707 | -12.425298 | 0.026084 | -51.424404 |
foreign_request | 0.000759 | -0.001033 | -0.187393 | 0.000818 | 0.002647 | 2.453612 | -5.237982e+00 | 4.123817e+00 | 2.106544e+00 | -0.142042 | ... | 0.001177 | -0.000802 | 0.001017 | -0.000643 | 3.077936 | 0.027965 | 0.029076 | -0.001266 | 0.000089 | -0.003614 |
session_length_in_minutes | -0.116759 | 0.016699 | -12.739258 | 4.356485 | 2.033355 | 359.426268 | 1.323806e+03 | 8.641378e+02 | 6.635417e+02 | 37.278437 | ... | 0.146011 | -0.166718 | -0.027691 | -0.268723 | -2.286707 | 0.029076 | 69.376263 | -0.245622 | 0.134616 | -1.542432 |
keep_alive_session | -0.009207 | 0.006064 | -2.733750 | -0.317894 | 0.057482 | 3.917571 | 2.570722e+01 | 5.405526e+00 | 1.604997e+01 | 1.434909 | ... | -0.008685 | 0.009658 | 0.005772 | -0.017893 | -12.425298 | -0.001266 | -0.245622 | 0.246520 | -0.007854 | -0.029529 |
device_distinct_emails_8w | -0.000441 | -0.002295 | 0.235010 | 0.148603 | 0.024413 | 7.000640 | 1.501775e+01 | 8.775432e+00 | 9.659638e+00 | 0.331983 | ... | 0.000374 | -0.000225 | -0.002789 | -0.002412 | 0.026084 | 0.000089 | 0.134616 | -0.007854 | 0.038996 | -0.022020 |
month | 0.079395 | -0.028691 | -4.112885 | 0.016601 | -0.216861 | -600.642652 | -2.811793e+03 | -1.818947e+03 | -1.716809e+03 | -37.211440 | ... | -0.076217 | 0.060949 | 0.024970 | 0.029655 | -51.424404 | -0.003614 | -1.542432 | -0.029529 | -0.022020 | 4.896620 |
22 rows × 22 columns
Di atas adalah distribusi nilai covariance dari data yang belum distandarisasi (scale). Variance dari masing-masing variabel berbeda jauh karena range/skala dari tiap variabel berbeda, begitupun covariance. Nilai variance dan covariance dipengaruhi oleh skala dari data. Semakin tinggi skala, nilai variance atau covariance akan semakin tinggi.
Data Pre-processing: Scaling
Melakukan normalisasi pada dataframe fraud_num
agar setiap prediktor memiliki scala yang sama.
fraud_num.head()
income | name_email_similarity | current_address_months_count | customer_age | days_since_request | intended_balcon_amount | zip_count_4w | velocity_6h | velocity_24h | velocity_4w | ... | email_is_free | phone_home_valid | phone_mobile_valid | has_other_cards | proposed_credit_limit | foreign_request | session_length_in_minutes | keep_alive_session | device_distinct_emails_8w | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.1 | 0.069598 | 48.0 | 30 | 0.006760 | -1.074674 | 3483 | 5316.092932 | 4527.956243 | 4730.638776 | ... | 0 | 0 | 1 | 0 | 200.0 | 0 | 5.191773 | 1 | 1.0 | 4 |
1 | 0.9 | 0.891741 | 61.0 | 20 | 0.020642 | -1.043444 | 2849 | 8153.671429 | 7524.130278 | 5341.758190 | ... | 1 | 0 | 1 | 1 | 200.0 | 0 | 3.901673 | 0 | 1.0 | 1 |
2 | 0.6 | 0.370933 | 70.0 | 30 | 6.400793 | 48.520199 | 406 | 7648.434993 | 6366.061338 | 5431.786246 | ... | 1 | 0 | 1 | 0 | 200.0 | 0 | 3.777191 | 0 | 1.0 | 1 |
3 | 0.9 | 0.401137 | 64.0 | 30 | 0.004651 | -0.394588 | 780 | 6459.224179 | 3394.524379 | 4248.230609 | ... | 1 | 0 | 1 | 0 | 200.0 | 0 | 3.176269 | 1 | 1.0 | 5 |
4 | 0.6 | 0.720006 | 11.0 | 20 | 0.032629 | -0.487785 | 4527 | 7852.258962 | 5177.826213 | 5942.104901 | ... | 0 | 0 | 1 | 0 | 200.0 | 0 | 14.626874 | 0 | 1.0 | 0 |
5 rows × 23 columns
Menggunakan Z-score standardization untuk scaling dataset numerik dengan fungsi StandardScaler() pada library sklearn:
scaler = StandardScaler()
fraud_scaled = scaler.fit_transform(fraud_num.values)
fraud_scaled = pd.DataFrame(fraud_scaled, columns=[cols])
# cek covariance setelah di scaling
fraud_scaled.cov()
income | name_email_similarity | current_address_months_count | customer_age | days_since_request | zip_count_4w | velocity_6h | velocity_24h | velocity_4w | bank_branch_count_8w | ... | email_is_free | phone_home_valid | phone_mobile_valid | has_other_cards | proposed_credit_limit | foreign_request | session_length_in_minutes | keep_alive_session | device_distinct_emails_8w | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
income | 1.000067 | -0.053847 | -0.023786 | 0.141684 | -0.001914 | -0.078656 | -0.102545 | -0.116934 | -0.115435 | 0.010793 | ... | -0.012490 | 0.001038 | 0.010272 | 0.064446 | 0.121202 | 0.015582 | -0.048132 | -0.063670 | -0.007674 | 0.123197 |
name_email_similarity | -0.053847 | 1.000067 | 0.022918 | -0.076457 | -0.005392 | 0.019997 | 0.036224 | 0.043178 | 0.037532 | 0.003261 | ... | -0.082515 | 0.011090 | 0.018110 | 0.009366 | 0.050216 | -0.021103 | 0.006849 | 0.041724 | -0.039709 | -0.044292 |
current_address_months_count | -0.023786 | 0.022918 | 1.000067 | 0.141228 | -0.061805 | 0.044364 | 0.029888 | 0.021359 | 0.024692 | 0.058756 | ... | -0.066646 | 0.115811 | -0.101182 | 0.044995 | 0.151556 | -0.012669 | -0.017292 | -0.062248 | 0.013455 | -0.021013 |
customer_age | 0.141684 | -0.076457 | 0.141228 | 1.000067 | -0.037381 | -0.013363 | -0.024483 | -0.007093 | -0.000675 | 0.044810 | ... | 0.010544 | 0.177740 | -0.153523 | 0.074999 | 0.170501 | 0.000395 | 0.042268 | -0.051741 | 0.060813 | 0.000606 |
days_since_request | -0.001914 | -0.005392 | -0.061805 | -0.037381 | 1.000067 | 0.009386 | 0.025342 | 0.021596 | 0.026567 | -0.020152 | ... | 0.005126 | -0.051733 | 0.028145 | -0.050688 | -0.064933 | 0.002800 | 0.043180 | 0.020478 | 0.021867 | -0.017335 |
zip_count_4w | -0.078656 | 0.019997 | 0.044364 | -0.013363 | 0.009386 | 1.000067 | 0.131470 | 0.192572 | 0.292134 | 0.013112 | ... | 0.040730 | -0.054442 | 0.015322 | -0.024152 | -0.015680 | 0.014695 | 0.043219 | 0.007902 | 0.035506 | -0.271856 |
velocity_6h | -0.102545 | 0.036224 | 0.029888 | -0.024483 | 0.025342 | 0.131470 | 1.000067 | 0.467295 | 0.409350 | 0.018539 | ... | 0.024823 | -0.025459 | -0.015000 | 0.012628 | -0.026192 | -0.010386 | 0.052700 | 0.017168 | 0.025216 | -0.421334 |
velocity_24h | -0.116934 | 0.043178 | 0.021359 | -0.007093 | 0.021596 | 0.192572 | 0.467295 | 1.000067 | 0.537146 | 0.038152 | ... | 0.020208 | -0.048580 | -0.026646 | -0.011931 | 0.006606 | 0.016591 | 0.069802 | 0.007325 | 0.029898 | -0.553043 |
velocity_4w | -0.115435 | 0.037532 | 0.024692 | -0.000675 | 0.026567 | 0.292134 | 0.409350 | 0.537146 | 1.000067 | 0.039274 | ... | 0.055047 | -0.050881 | -0.030450 | -0.034383 | 0.033043 | 0.013636 | 0.086236 | 0.034993 | 0.052951 | -0.839851 |
bank_branch_count_8w | 0.010793 | 0.003261 | 0.058756 | 0.044810 | -0.020152 | 0.013112 | 0.018539 | 0.038152 | 0.039274 | 1.000067 | ... | -0.009238 | 0.060644 | -0.015935 | 0.036781 | -0.001634 | -0.001825 | 0.009614 | 0.006208 | 0.003611 | -0.036123 |
date_of_birth_distinct_emails_4w | -0.079916 | 0.038485 | -0.176308 | -0.434568 | 0.010436 | 0.119037 | 0.119399 | 0.155631 | 0.233646 | -0.030065 | ... | 0.027418 | -0.146002 | 0.112871 | -0.028296 | -0.070229 | 0.022439 | -0.033792 | 0.063755 | -0.040650 | -0.237180 |
credit_risk_score | 0.191773 | 0.028069 | 0.112802 | 0.192374 | -0.080584 | -0.095835 | -0.144945 | -0.157187 | -0.167673 | -0.025182 | ... | -0.014321 | -0.010531 | -0.013235 | 0.112434 | 0.632868 | 0.040093 | -0.039362 | -0.051680 | -0.038454 | 0.169837 |
email_is_free | -0.012490 | -0.082515 | -0.066646 | 0.010544 | 0.005126 | 0.040730 | 0.024823 | 0.020208 | 0.055047 | -0.009238 | ... | 1.000067 | -0.022844 | 0.032682 | -0.027193 | 0.003612 | 0.014125 | 0.035185 | -0.035108 | 0.003802 | -0.069131 |
phone_home_valid | 0.001038 | 0.011090 | 0.115811 | 0.177740 | -0.051733 | -0.054442 | -0.025459 | -0.048580 | -0.050881 | 0.060644 | ... | -0.022844 | 1.000067 | -0.281662 | 0.130030 | -0.014074 | -0.009784 | -0.040845 | 0.039693 | -0.002327 | 0.056206 |
phone_mobile_valid | 0.010272 | 0.018110 | -0.101182 | -0.153523 | 0.028145 | 0.015322 | -0.015000 | -0.026646 | -0.030450 | -0.015935 | ... | 0.032682 | -0.281662 | 1.000067 | 0.004027 | -0.041323 | 0.018932 | -0.010347 | 0.036182 | -0.043961 | 0.035120 |
has_other_cards | 0.064446 | 0.009366 | 0.044995 | 0.074999 | -0.050688 | -0.024152 | 0.012628 | -0.011931 | -0.034383 | 0.036781 | ... | -0.027193 | 0.130030 | 0.004027 | 1.000067 | 0.077340 | -0.009391 | -0.078736 | -0.087948 | -0.029808 | 0.032706 |
proposed_credit_limit | 0.121202 | 0.050216 | 0.151556 | 0.170501 | -0.064933 | -0.015680 | -0.026192 | 0.006606 | 0.033043 | -0.001634 | ... | 0.003612 | -0.014074 | -0.041323 | 0.077340 | 1.000067 | 0.035651 | -0.000532 | -0.048473 | 0.000256 | -0.045013 |
foreign_request | 0.015582 | -0.021103 | -0.012669 | 0.000395 | 0.002800 | 0.014695 | -0.010386 | 0.016591 | 0.013636 | -0.001825 | ... | 0.014125 | -0.009784 | 0.018932 | -0.009391 | 0.035651 | 1.000067 | 0.020876 | -0.015251 | 0.002707 | -0.009768 |
session_length_in_minutes | -0.048132 | 0.006849 | -0.017292 | 0.042268 | 0.043180 | 0.043219 | 0.052700 | 0.069802 | 0.086236 | 0.009614 | ... | 0.035185 | -0.040845 | -0.010347 | -0.078736 | -0.000532 | 0.020876 | 1.000067 | -0.059397 | 0.081848 | -0.083692 |
keep_alive_session | -0.063670 | 0.041724 | -0.062248 | -0.051741 | 0.020478 | 0.007902 | 0.017168 | 0.007325 | 0.034993 | 0.006208 | ... | -0.035108 | 0.039693 | 0.036182 | -0.087948 | -0.048473 | -0.015251 | -0.059397 | 1.000067 | -0.080109 | -0.026878 |
device_distinct_emails_8w | -0.007674 | -0.039709 | 0.013455 | 0.060813 | 0.021867 | 0.035506 | 0.025216 | 0.029898 | 0.052951 | 0.003611 | ... | 0.003802 | -0.002327 | -0.043961 | -0.029808 | 0.000256 | 0.002707 | 0.081848 | -0.080109 | 1.000067 | -0.050395 |
month | 0.123197 | -0.044292 | -0.021013 | 0.000606 | -0.017335 | -0.271856 | -0.421334 | -0.553043 | -0.839851 | -0.036123 | ... | -0.069131 | 0.056206 | 0.035120 | 0.032706 | -0.045013 | -0.009768 | -0.083692 | -0.026878 | -0.050395 | 1.000067 |
22 rows × 22 columns
jawaban: Karena fokus dari standard scaler agar membentuk distribusi data senormal mungkin dengan data lain
fraud_minmax = MinMaxScaler().fit_transform(fraud_num.values)
fraud_minmax = pd.DataFrame(fraud_minmax, columns=[cols])
plt.figure(figsize=(7, 9))
plt.subplot(3,1,1)
sns.kdeplot(data=fraud.iloc[:,2:7], legend=None)
plt.ylabel("Base data")
plt.subplot(3,1,2)
sns.kdeplot(data=fraud_minmax.iloc[:,2:7], legend=None)
plt.ylabel("MinMaxScaler")
plt.subplot(3,1,3)
sns.kdeplot(data=fraud_scaled.iloc[:,2:7], legend=None)
plt.ylabel("StandardScaler");
Principal Component Analysis menggunakan library sklearn
# inisialisasi objek PCA
pca = PCA(n_components = fraud_scaled.shape[1], # jumlah pca yang dihasilkan
svd_solver='full') # implementasi full svd sehingga mendapatkan semua PC yang terbentuk
pca.fit(fraud_scaled) # menghitung PCA
# atau dapat menggunakan pca = pca.fit_transform(scale(balance_scaled))
[additional] Note: jika kita perhatikan bagian dokumentasi pada library scikit-learn, fungsi PCA menggunakan Singular Value Decomposition sebagai reduksi dimensi linearnya. Output yang dihasilkan akan tetap sama dengan menggunakan dekomposisi eigen (mencari eigen vector dan eigen value), tetapi komputasi numeriknya lebih stabil dan efisien.
# menampilkan banyaknya PC yang terbentuk dengan n_components_
pca.components_
array([[-1.58193111e-01, 4.17231004e-02, -3.12993736e-02,
-1.12354402e-01, 4.29029060e-02, 2.35194605e-01,
3.53913770e-01, 4.15223133e-01, 4.79699339e-01,
2.19143346e-02, 2.40314729e-01, -2.30827619e-01,
4.97732176e-02, -8.09563599e-02, 1.64710215e-02,
-6.01541713e-02, -9.13280517e-02, 5.58279894e-03,
7.71753656e-02, 4.25029615e-02, 3.75431377e-02,
-4.82409794e-01],
[ 1.35352636e-01, 1.03067433e-02, 3.09083422e-01,
4.30778257e-01, -1.08075190e-01, 5.94721124e-02,
1.25566640e-01, 1.62480037e-01, 1.87053520e-01,
8.44234569e-02, -2.81525038e-01, 3.82221586e-01,
-9.67381455e-03, 2.00084404e-01, -2.35702837e-01,
1.63752218e-01, 4.42144735e-01, 2.38701854e-02,
4.36090971e-02, -1.18017066e-01, 7.11848597e-02,
-1.90168955e-01],
[-1.62305983e-01, -1.08967801e-01, 1.23175526e-01,
2.28386319e-01, 3.22335412e-02, -3.00593250e-02,
3.34595202e-02, 4.89731460e-05, -4.00993479e-02,
1.30764843e-01, -2.96241363e-01, -4.46874904e-01,
-5.13456494e-02, 4.40015716e-01, -3.79541008e-01,
-2.12403957e-02, -4.57573328e-01, -1.08373972e-01,
6.26680483e-02, 1.32022415e-02, 1.31232176e-01,
4.50572823e-02],
[ 1.34508186e-01, -3.58469858e-01, -1.46451454e-01,
2.04504937e-01, 2.24077803e-01, 5.44249899e-02,
-4.79294349e-02, -2.08365909e-02, 5.70973387e-03,
-9.79628667e-02, -1.88221723e-01, -3.22019706e-02,
3.13814042e-01, -2.91441741e-01, 1.12788730e-01,
-2.53440640e-01, -3.64148140e-02, 1.22686183e-01,
4.15692426e-01, -3.09373980e-01, 3.77976682e-01,
-3.28759839e-03],
[-3.12684289e-01, 4.44157306e-01, 2.09772604e-01,
-1.79625729e-02, 1.70119360e-01, 3.20546380e-02,
-5.35459698e-02, -3.39864159e-02, -3.72535154e-02,
-1.19157099e-01, -1.49142409e-01, 8.34566932e-02,
-3.38789799e-01, -1.26254300e-01, -4.45022130e-02,
-4.88724839e-01, 1.28578546e-01, -7.10273043e-02,
3.48378324e-01, 2.26440765e-01, 1.19418217e-01,
4.25272151e-02],
[-1.08824383e-01, 2.16070520e-01, 2.40439603e-01,
-1.94369028e-01, -1.82808479e-01, 6.38008587e-02,
1.45228927e-02, -7.58285566e-03, -7.24652491e-02,
1.45266831e-01, 7.04409015e-02, -8.42914107e-02,
-2.85944784e-01, -1.31275244e-01, 1.40387346e-01,
3.48785217e-01, -7.93340219e-02, -1.10816075e-01,
1.23957247e-01, -6.22821532e-01, 3.20216503e-01,
7.04246420e-02],
[ 3.67065762e-01, 8.04707653e-02, -9.07382431e-02,
1.95796919e-01, 4.96506806e-01, -1.41648918e-01,
1.56071646e-01, 1.07319850e-01, 2.31409554e-02,
2.79655954e-01, -2.04781879e-01, -5.69564101e-02,
-3.14285915e-01, -1.68211710e-01, 3.06921405e-01,
1.57856324e-01, -9.56346018e-02, -3.28832405e-01,
-5.52134765e-02, 1.07508285e-01, -8.67428891e-02,
-1.54136302e-02],
[-1.15805662e-01, -7.33062900e-02, 2.30573762e-01,
6.42000152e-02, -2.53605294e-01, 2.16516131e-01,
-1.06512723e-01, -5.43655035e-02, -1.79649751e-02,
7.16482696e-01, -1.12244394e-01, -4.20518498e-02,
1.40133592e-01, -1.30770788e-01, 3.12158954e-01,
-1.43633521e-01, -4.42468610e-02, 2.23095926e-01,
7.35231946e-02, 1.71116159e-01, -1.68418692e-01,
2.83456276e-02],
[ 1.51541768e-01, 1.21769995e-01, -1.48988957e-01,
-6.92414249e-02, 2.75000584e-01, -9.87472132e-02,
-1.18844066e-02, 2.79491526e-02, 2.05338325e-03,
1.59174994e-01, 1.15892525e-01, -7.92084014e-03,
-2.90031265e-01, 1.67325462e-01, -1.29272462e-01,
1.06466567e-01, -1.18147760e-02, 8.07207126e-01,
1.08115068e-01, -1.10014710e-02, 7.95638058e-02,
7.48062815e-03],
[-9.12279852e-02, 3.79810452e-01, -3.08530867e-01,
-8.12691133e-02, 8.67875714e-02, -2.51218045e-01,
1.62871773e-02, -1.08677696e-02, -2.79250744e-02,
2.83103686e-01, 8.31392742e-02, 7.24579090e-02,
4.53977699e-01, 1.98475045e-01, -1.08828973e-01,
1.70852135e-01, 8.95164745e-02, -1.96460489e-01,
4.70066746e-01, -6.67284824e-02, -1.51309341e-01,
8.22066516e-03],
[-2.77691643e-01, 1.10080572e-01, 2.35972110e-01,
1.24801179e-01, 2.02527364e-01, 5.06805911e-02,
1.30073404e-01, 6.06991491e-02, -5.96452018e-02,
-3.71233550e-01, -2.73684628e-01, -7.52253905e-02,
1.72102920e-01, -5.22902566e-02, 2.44814305e-01,
2.30680758e-01, -1.21208342e-01, 2.59758262e-01,
6.16472898e-02, -1.84046857e-01, -5.33951258e-01,
5.42525824e-02],
[-1.22833113e-01, 1.21150639e-02, 2.20380307e-01,
-1.39574260e-01, 6.15788778e-01, 4.68633099e-01,
-2.16613102e-01, -1.65395818e-01, -2.93554615e-03,
1.21611818e-01, 1.45565133e-01, 6.44150540e-02,
2.47935015e-01, 1.20974012e-01, -1.08183119e-01,
1.04277816e-01, 1.08665576e-01, -9.09041993e-02,
-2.50648491e-01, -2.09160068e-02, 1.44860391e-01,
2.20914880e-02],
[-1.69886520e-01, -3.89315218e-01, 2.81141606e-01,
-2.45782147e-01, 2.26162133e-01, -4.77376136e-01,
1.49677507e-01, 9.80140349e-02, -2.67335092e-02,
2.18070980e-01, 1.18137494e-01, 7.02643808e-02,
-6.28351349e-02, -9.93899966e-02, -2.34496512e-01,
-2.67238960e-01, 1.25842852e-01, -4.78164587e-02,
-5.98350819e-02, -2.76158939e-01, -2.52263914e-01,
8.73903898e-03],
[-4.62554577e-01, -7.38856713e-02, 8.03258864e-03,
1.54395218e-02, 5.78997858e-02, -3.86764056e-01,
1.90870487e-01, 7.01717977e-02, -5.11082716e-02,
8.72037089e-03, -1.03451528e-01, 7.33498025e-02,
1.56781729e-01, 2.69270420e-03, 2.76182943e-01,
2.49822572e-01, 8.75973706e-02, 9.73352235e-02,
-1.96610365e-01, 3.10162796e-01, 5.00743925e-01,
3.45209659e-02],
[-2.23369741e-01, -5.11654222e-01, -7.11377123e-02,
-4.11392465e-02, 3.10148302e-02, 1.87383812e-01,
-8.25468268e-02, -6.15836928e-02, -3.64176545e-04,
-4.99733010e-02, 1.30027523e-01, 9.56396009e-02,
-3.39921338e-01, 7.19387442e-02, 2.24269039e-02,
3.70681432e-01, 9.49660633e-02, -1.23975097e-01,
4.96672054e-01, 2.45413315e-01, -1.26601059e-01,
1.93336014e-02],
[ 4.68175311e-01, -3.58136845e-03, 5.95978025e-01,
-2.26902472e-01, -9.51154429e-03, -1.51026960e-01,
1.10373307e-01, -3.43227745e-02, -1.95178467e-02,
-1.60614443e-01, 1.78403784e-01, -9.11328824e-02,
1.98417618e-01, 1.83311927e-01, 1.67375637e-01,
4.14503644e-02, -7.80939245e-02, 3.78094653e-03,
2.84024629e-01, 2.62667492e-01, 1.11468129e-01,
2.54100762e-02],
[ 7.83152887e-02, -1.78698436e-02, -9.34491208e-02,
-1.60375705e-01, -3.45020156e-02, 3.48067112e-01,
6.65149958e-01, 1.42804850e-01, -3.73979288e-01,
6.25167165e-02, -8.13174294e-02, 9.93821133e-02,
5.65196194e-02, -1.17645493e-01, -2.28132673e-01,
-6.64586079e-03, 1.82542603e-02, 1.84236665e-02,
1.94829982e-02, 9.10335222e-02, 3.61519016e-02,
3.65933593e-01],
[ 4.13062126e-02, 8.41766056e-02, 1.45998381e-01,
2.36021873e-02, -2.51677178e-02, -1.05888884e-01,
-1.75816374e-01, -4.12023636e-02, 5.27897387e-02,
1.59086168e-02, -5.72585508e-02, -9.37564701e-02,
1.06868916e-01, -6.64014141e-01, -5.11552027e-01,
3.43631489e-01, -1.12252488e-01, 5.42352175e-02,
2.54448818e-02, 2.40443492e-01, 3.57171505e-03,
-6.04381753e-02],
[-9.47546453e-02, 6.00121165e-02, 7.37934774e-02,
5.73853220e-01, 2.40695500e-02, -4.47642299e-02,
3.28016057e-01, -4.63147151e-01, -1.97498950e-02,
1.49432325e-02, 5.58471895e-01, -3.82748771e-02,
-2.39235278e-02, -8.65953424e-02, -3.15663255e-03,
-4.22733532e-02, -4.60323340e-02, 9.72243254e-03,
3.37032967e-03, -3.47604114e-02, -1.78917569e-02,
1.98387954e-02],
[-2.77150334e-02, 2.68659749e-02, 7.66241497e-02,
3.33450675e-01, 2.05342298e-02, 3.97969333e-03,
-3.06046627e-01, 7.03402269e-01, -2.81011375e-01,
-1.44045229e-02, 3.80397261e-01, -2.91231619e-02,
3.47869602e-02, -1.38799841e-02, 1.21340707e-02,
-1.95163457e-02, -1.59872716e-02, -2.83961148e-02,
1.20106838e-02, 1.96084657e-02, 1.57240818e-02,
2.53137800e-01],
[ 3.79367168e-02, -4.62552726e-03, -3.57563737e-02,
1.38222829e-02, -1.49299588e-02, 2.37238812e-03,
7.90096240e-03, -4.34618863e-02, -9.79391139e-02,
-1.51067967e-02, -4.16612302e-02, -7.18593621e-01,
-6.41527048e-03, -1.92828140e-02, 1.32692228e-02,
3.11551006e-02, 6.79839585e-01, 1.09951125e-02,
-1.29286747e-02, 7.32565966e-03, -2.65712906e-02,
4.65608250e-02],
[-6.54458124e-03, 6.35716254e-03, -3.97031990e-03,
4.33967158e-03, -7.91809576e-03, -2.01861765e-02,
6.67420142e-03, 1.92196060e-02, 6.97213662e-01,
-2.56640743e-03, 4.14850592e-03, -2.36899745e-02,
1.29786933e-02, -6.99891203e-03, -5.28914441e-03,
1.35884842e-03, 2.77141918e-02, -3.04669177e-03,
-3.49762045e-03, -6.35362842e-03, -2.33000686e-03,
7.14998950e-01]])
pca.components_
: berisi nilai eigen vector yang akan dijadikan formula untuk PC baru
# opsional
pd.DataFrame(pca.components_.T, # dibalik/transpose agar representasi tiap pca menjadi kolom, bukan baris
columns=pca.get_feature_names_out()) # ambil nama kolom tiap pca
pca0 | pca1 | pca2 | pca3 | pca4 | pca5 | pca6 | pca7 | pca8 | pca9 | ... | pca12 | pca13 | pca14 | pca15 | pca16 | pca17 | pca18 | pca19 | pca20 | pca21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.158193 | 0.135353 | -0.162306 | 0.134508 | -0.312684 | -0.108824 | 0.367066 | -0.115806 | 0.151542 | -0.091228 | ... | -0.169887 | -0.462555 | -0.223370 | 0.468175 | 0.078315 | 0.041306 | -0.094755 | -0.027715 | 0.037937 | -0.006545 |
1 | 0.041723 | 0.010307 | -0.108968 | -0.358470 | 0.444157 | 0.216071 | 0.080471 | -0.073306 | 0.121770 | 0.379810 | ... | -0.389315 | -0.073886 | -0.511654 | -0.003581 | -0.017870 | 0.084177 | 0.060012 | 0.026866 | -0.004626 | 0.006357 |
2 | -0.031299 | 0.309083 | 0.123176 | -0.146451 | 0.209773 | 0.240440 | -0.090738 | 0.230574 | -0.148989 | -0.308531 | ... | 0.281142 | 0.008033 | -0.071138 | 0.595978 | -0.093449 | 0.145998 | 0.073793 | 0.076624 | -0.035756 | -0.003970 |
3 | -0.112354 | 0.430778 | 0.228386 | 0.204505 | -0.017963 | -0.194369 | 0.195797 | 0.064200 | -0.069241 | -0.081269 | ... | -0.245782 | 0.015440 | -0.041139 | -0.226902 | -0.160376 | 0.023602 | 0.573853 | 0.333451 | 0.013822 | 0.004340 |
4 | 0.042903 | -0.108075 | 0.032234 | 0.224078 | 0.170119 | -0.182808 | 0.496507 | -0.253605 | 0.275001 | 0.086788 | ... | 0.226162 | 0.057900 | 0.031015 | -0.009512 | -0.034502 | -0.025168 | 0.024070 | 0.020534 | -0.014930 | -0.007918 |
5 | 0.235195 | 0.059472 | -0.030059 | 0.054425 | 0.032055 | 0.063801 | -0.141649 | 0.216516 | -0.098747 | -0.251218 | ... | -0.477376 | -0.386764 | 0.187384 | -0.151027 | 0.348067 | -0.105889 | -0.044764 | 0.003980 | 0.002372 | -0.020186 |
6 | 0.353914 | 0.125567 | 0.033460 | -0.047929 | -0.053546 | 0.014523 | 0.156072 | -0.106513 | -0.011884 | 0.016287 | ... | 0.149678 | 0.190870 | -0.082547 | 0.110373 | 0.665150 | -0.175816 | 0.328016 | -0.306047 | 0.007901 | 0.006674 |
7 | 0.415223 | 0.162480 | 0.000049 | -0.020837 | -0.033986 | -0.007583 | 0.107320 | -0.054366 | 0.027949 | -0.010868 | ... | 0.098014 | 0.070172 | -0.061584 | -0.034323 | 0.142805 | -0.041202 | -0.463147 | 0.703402 | -0.043462 | 0.019220 |
8 | 0.479699 | 0.187054 | -0.040099 | 0.005710 | -0.037254 | -0.072465 | 0.023141 | -0.017965 | 0.002053 | -0.027925 | ... | -0.026734 | -0.051108 | -0.000364 | -0.019518 | -0.373979 | 0.052790 | -0.019750 | -0.281011 | -0.097939 | 0.697214 |
9 | 0.021914 | 0.084423 | 0.130765 | -0.097963 | -0.119157 | 0.145267 | 0.279656 | 0.716483 | 0.159175 | 0.283104 | ... | 0.218071 | 0.008720 | -0.049973 | -0.160614 | 0.062517 | 0.015909 | 0.014943 | -0.014405 | -0.015107 | -0.002566 |
10 | 0.240315 | -0.281525 | -0.296241 | -0.188222 | -0.149142 | 0.070441 | -0.204782 | -0.112244 | 0.115893 | 0.083139 | ... | 0.118137 | -0.103452 | 0.130028 | 0.178404 | -0.081317 | -0.057259 | 0.558472 | 0.380397 | -0.041661 | 0.004149 |
11 | -0.230828 | 0.382222 | -0.446875 | -0.032202 | 0.083457 | -0.084291 | -0.056956 | -0.042052 | -0.007921 | 0.072458 | ... | 0.070264 | 0.073350 | 0.095640 | -0.091133 | 0.099382 | -0.093756 | -0.038275 | -0.029123 | -0.718594 | -0.023690 |
12 | 0.049773 | -0.009674 | -0.051346 | 0.313814 | -0.338790 | -0.285945 | -0.314286 | 0.140134 | -0.290031 | 0.453978 | ... | -0.062835 | 0.156782 | -0.339921 | 0.198418 | 0.056520 | 0.106869 | -0.023924 | 0.034787 | -0.006415 | 0.012979 |
13 | -0.080956 | 0.200084 | 0.440016 | -0.291442 | -0.126254 | -0.131275 | -0.168212 | -0.130771 | 0.167325 | 0.198475 | ... | -0.099390 | 0.002693 | 0.071939 | 0.183312 | -0.117645 | -0.664014 | -0.086595 | -0.013880 | -0.019283 | -0.006999 |
14 | 0.016471 | -0.235703 | -0.379541 | 0.112789 | -0.044502 | 0.140387 | 0.306921 | 0.312159 | -0.129272 | -0.108829 | ... | -0.234497 | 0.276183 | 0.022427 | 0.167376 | -0.228133 | -0.511552 | -0.003157 | 0.012134 | 0.013269 | -0.005289 |
15 | -0.060154 | 0.163752 | -0.021240 | -0.253441 | -0.488725 | 0.348785 | 0.157856 | -0.143634 | 0.106467 | 0.170852 | ... | -0.267239 | 0.249823 | 0.370681 | 0.041450 | -0.006646 | 0.343631 | -0.042273 | -0.019516 | 0.031155 | 0.001359 |
16 | -0.091328 | 0.442145 | -0.457573 | -0.036415 | 0.128579 | -0.079334 | -0.095635 | -0.044247 | -0.011815 | 0.089516 | ... | 0.125843 | 0.087597 | 0.094966 | -0.078094 | 0.018254 | -0.112252 | -0.046032 | -0.015987 | 0.679840 | 0.027714 |
17 | 0.005583 | 0.023870 | -0.108374 | 0.122686 | -0.071027 | -0.110816 | -0.328832 | 0.223096 | 0.807207 | -0.196460 | ... | -0.047816 | 0.097335 | -0.123975 | 0.003781 | 0.018424 | 0.054235 | 0.009722 | -0.028396 | 0.010995 | -0.003047 |
18 | 0.077175 | 0.043609 | 0.062668 | 0.415692 | 0.348378 | 0.123957 | -0.055213 | 0.073523 | 0.108115 | 0.470067 | ... | -0.059835 | -0.196610 | 0.496672 | 0.284025 | 0.019483 | 0.025445 | 0.003370 | 0.012011 | -0.012929 | -0.003498 |
19 | 0.042503 | -0.118017 | 0.013202 | -0.309374 | 0.226441 | -0.622822 | 0.107508 | 0.171116 | -0.011001 | -0.066728 | ... | -0.276159 | 0.310163 | 0.245413 | 0.262667 | 0.091034 | 0.240443 | -0.034760 | 0.019608 | 0.007326 | -0.006354 |
20 | 0.037543 | 0.071185 | 0.131232 | 0.377977 | 0.119418 | 0.320217 | -0.086743 | -0.168419 | 0.079564 | -0.151309 | ... | -0.252264 | 0.500744 | -0.126601 | 0.111468 | 0.036152 | 0.003572 | -0.017892 | 0.015724 | -0.026571 | -0.002330 |
21 | -0.482410 | -0.190169 | 0.045057 | -0.003288 | 0.042527 | 0.070425 | -0.015414 | 0.028346 | 0.007481 | 0.008221 | ... | 0.008739 | 0.034521 | 0.019334 | 0.025410 | 0.365934 | -0.060438 | 0.019839 | 0.253138 | 0.046561 | 0.714999 |
22 rows × 22 columns
Melihat proporsi nilai informasi yang dapat ditangkap untuk setiap PC dengan atribut explained_variance_ratio_
:
# menampilkan banyaknya PC yang terbentuk dengan explained_variance_ratio
pca.explained_variance_ratio_
array([0.1376811 , 0.09104272, 0.06854404, 0.05763219, 0.05041625,
0.04800417, 0.04643708, 0.04563752, 0.04525991, 0.04340626,
0.0428812 , 0.0418975 , 0.04056786, 0.03994442, 0.03862271,
0.0349525 , 0.03072321, 0.02931932, 0.02269744, 0.02195474,
0.01515074, 0.00722711])
Melihat kumulatif proporsi nilai informasi yang dapat ditangkap untuk setiap penambahan PC:
np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
array([13.77, 22.87, 29.72, 35.48, 40.52, 45.32, 49.96, 54.52, 59.05,
63.39, 67.68, 71.87, 75.93, 79.92, 83.78, 87.28, 90.35, 93.28,
95.55, 97.75, 99.27, 99.99])
Note:
- Proportion of Variance: informasi yang ditangkap oleh tiap PC
- Cumulative Proportion: jumlah informasi yang ditangkap secara kumulatif dari PC0 hingga PC tersebut
Untuk lebih jelasnya, kita dapat mengeluarkan Cumulative Proportion di atas menggunakan plot di bawah ini.
# Hitung proporsi variasi yang dijelaskan oleh setiap komponen utama
explained_var_ratio = pca.explained_variance_ratio_
# Buat scree plot menggunakan plotly
fig = go.Figure()
# Plot proporsi variasi yang dijelaskan
fig.add_trace(go.Scatter(x=list(range(1, len(explained_var_ratio) + 1)),
y=explained_var_ratio*100, mode='lines+markers',
name='Explained Variance Ratio'))
fig.add_trace(go.Scatter(x=list(range(1, len(explained_var_ratio) + 1)),
y=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100), mode='lines+markers',
name='Cumulative summary'))
# Atur layout dan tampilkan
fig.update_layout(title='Scree Plot',
xaxis_title='Principal Component (PC)',
yaxis_title='Explained Variance Ratio',
showlegend=True,
width=800, height=620)
pyo.iplot(fig, 'Scree')
Transform PCA
Menampilkan nilai di setiap PC pada dimensi baru
transform_ = pd.DataFrame(pca.transform(fraud_scaled),
columns=pca.get_feature_names_out())
transform_.head()
pca0 | pca1 | pca2 | pca3 | pca4 | pca5 | pca6 | pca7 | pca8 | pca9 | ... | pca12 | pca13 | pca14 | pca15 | pca16 | pca17 | pca18 | pca19 | pca20 | pca21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.985811 | 1.769162 | -0.450567 | -1.204763 | -0.117952 | 0.913919 | -0.725541 | 1.175108 | 1.759004 | -0.537196 | ... | -0.810912 | 0.935328 | 0.249804 | -0.067922 | -0.263015 | -0.442615 | 0.653300 | -0.105641 | 0.060576 | -0.063466 |
1 | 2.436214 | -0.073470 | 0.229321 | -0.830563 | -0.100201 | 2.323481 | 0.927359 | -0.936612 | -0.827647 | -0.064329 | ... | -0.698201 | -1.541975 | -1.348272 | -0.053672 | 0.511195 | 0.031784 | -0.039432 | -0.396871 | -0.477543 | -0.203613 |
2 | 2.584369 | 2.080913 | -0.033197 | 0.575900 | -0.870996 | 0.344936 | 0.538653 | -0.454442 | -1.623096 | -1.124090 | ... | 2.110493 | 0.275672 | -1.110919 | 0.196495 | 0.736336 | 0.427606 | 0.865952 | -0.305482 | -0.585496 | -0.219957 |
3 | -1.245341 | 0.854789 | -0.846922 | -0.526426 | -1.109341 | -0.393849 | -0.343671 | -1.335815 | 0.108946 | 0.625671 | ... | -0.262356 | -1.130105 | 0.032108 | -0.351500 | 0.061635 | -0.500656 | 0.066543 | 0.198758 | -0.302610 | -0.307206 |
4 | 2.589848 | -0.145288 | 1.091001 | 0.032348 | 0.029693 | 1.672321 | -1.662417 | -0.280590 | 0.326410 | -0.920124 | ... | -0.696854 | -0.764018 | 0.425946 | 0.366670 | -0.204808 | -0.079668 | -1.138763 | -0.004642 | 0.955276 | 0.085033 |
5 rows × 22 columns
jawaban: Belum, karena jumlah dimensi nya masih sama tapi persebaran distribusi dan informasinya jadi berubah
Reduksi dimensi dengan mempertahankan at least 90% informasi maka PC dipilih sampai 16
fraud_pca = transform_.iloc[:,:17]
fraud_pca.head()
pca0 | pca1 | pca2 | pca3 | pca4 | pca5 | pca6 | pca7 | pca8 | pca9 | pca10 | pca11 | pca12 | pca13 | pca14 | pca15 | pca16 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.985811 | 1.769162 | -0.450567 | -1.204763 | -0.117952 | 0.913919 | -0.725541 | 1.175108 | 1.759004 | -0.537196 | 0.074733 | 1.481429 | -0.810912 | 0.935328 | 0.249804 | -0.067922 | -0.263015 |
1 | 2.436214 | -0.073470 | 0.229321 | -0.830563 | -0.100201 | 2.323481 | 0.927359 | -0.936612 | -0.827647 | -0.064329 | 0.187772 | -1.227025 | -0.698201 | -1.541975 | -1.348272 | -0.053672 | 0.511195 |
2 | 2.584369 | 2.080913 | -0.033197 | 0.575900 | -0.870996 | 0.344936 | 0.538653 | -0.454442 | -1.623096 | -1.124090 | 0.150369 | 0.294092 | 2.110493 | 0.275672 | -1.110919 | 0.196495 | 0.736336 |
3 | -1.245341 | 0.854789 | -0.846922 | -0.526426 | -1.109341 | -0.393849 | -0.343671 | -1.335815 | 0.108946 | 0.625671 | -1.456979 | 0.262265 | -0.262356 | -1.130105 | 0.032108 | -0.351500 | 0.061635 |
4 | 2.589848 | -0.145288 | 1.091001 | 0.032348 | 0.029693 | 1.672321 | -1.662417 | -0.280590 | 0.326410 | -0.920124 | 1.185936 | 0.362399 | -0.696854 | -0.764018 | 0.425946 | 0.366670 | -0.204808 |
Notes: Setelah dipilih PC yang merangkum informasi yang dibutuhkan, PC dapat digabung dengan data awal dan digunakan untuk analisis lebih lanjut (misal: supervised learning).
Cara yang dilakukan di atas adalah cara manual, sebenarnya kita bisa secara langsung melakukan reduksi dimensi ketika membuat objek PCA yaitu menuliskan proporsi informasi yang ingin dipertahankan pada parameter n_components
.
Kekurangan dari cara ini adalah kita tidak bisa melakukan detransform ke bentuk awal karena adanya informasi yang hilang.
pca2 = PCA(n_components = 0.9, # gunakan proporsi data
svd_solver='full')
pca2.fit(fraud_scaled.values)
fraud_pca90 = pd.DataFrame(pca2.fit_transform(fraud_scaled),
columns=pca2.get_feature_names_out())
fraud_pca90.head()
pca0 | pca1 | pca2 | pca3 | pca4 | pca5 | pca6 | pca7 | pca8 | pca9 | pca10 | pca11 | pca12 | pca13 | pca14 | pca15 | pca16 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.156565 | -2.117836 | 0.617836 | -0.297799 | 0.133543 | 0.108565 | -0.227080 | 1.645738 | -0.358602 | -1.415333 | -0.624414 | 0.677698 | -0.010943 | -0.166878 | 1.618416 | -1.197033 | 0.566608 |
1 | 2.747659 | -0.773444 | -0.553069 | -0.479412 | -1.855533 | 1.732876 | 0.525852 | -0.714205 | -0.032657 | 0.495306 | 0.829078 | 0.269277 | -1.283505 | -0.849027 | -1.081769 | 0.575271 | 0.610764 |
2 | 2.591602 | -1.943262 | -1.016471 | 0.348008 | -1.371603 | 0.261071 | -0.084554 | -1.175479 | 0.130690 | 0.395665 | -0.344746 | 0.218384 | 1.778190 | -0.175649 | -0.805700 | 0.712865 | -0.890682 |
3 | -1.170206 | -1.354921 | 0.214172 | 0.544212 | -0.430262 | -1.025455 | 0.577854 | 0.157622 | -0.686719 | -0.321691 | 0.114822 | -0.708565 | -0.210811 | 0.218148 | -1.013061 | 0.797674 | 0.574951 |
4 | 2.863529 | -0.685988 | -1.073506 | 0.277979 | 0.633735 | 1.345577 | -0.185201 | -0.179568 | 0.056330 | -0.486255 | -0.124916 | 0.362843 | -1.208754 | -2.206490 | 0.492851 | -0.771684 | 0.585018 |
[optional] Detransform PCA
Mengembalikan hasil reduksi dimensi menjadi data bentuk aslinya. Tetapi hal ini hanya bisa dilakukan pada data hasil PCA yang masih lengkap.
pd.DataFrame(pca.inverse_transform(transform_)).head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -1.617461 | -1.406485 | -0.463121 | -0.353717 | -0.183422 | 1.914726 | -0.109015 | -0.159213 | -0.130894 | 1.064934 | ... | -1.087580 | -0.817867 | 0.363856 | -0.520999 | -0.68135 | -0.172179 | -0.301415 | 0.887976 | -0.149544 | 0.319350 |
1 | 1.129349 | 1.401910 | -0.316152 | -1.161812 | -0.180966 | 1.279767 | 0.831844 | 1.856551 | 0.530621 | -0.367800 | ... | 0.919473 | -0.817867 | 0.363856 | 1.919391 | -0.68135 | -0.172179 | -0.456308 | -1.126157 | -0.149544 | -1.036425 |
2 | 0.099295 | -0.377142 | -0.214405 | -0.353717 | 0.947517 | -1.166930 | 0.664322 | 1.077426 | 0.628074 | -0.391428 | ... | 0.919473 | -0.817867 | 0.363856 | -0.520999 | -0.68135 | -0.172179 | -0.471254 | -1.126157 | -0.149544 | -1.036425 |
3 | 1.129349 | -0.273965 | -0.282236 | -0.353717 | -0.183795 | -0.792364 | 0.270014 | -0.921763 | -0.653084 | -0.391428 | ... | 0.919473 | -0.817867 | 0.363856 | -0.520999 | -0.68135 | -0.172179 | -0.543402 | 0.887976 | -0.149544 | 0.771275 |
4 | 0.099295 | 0.815273 | -0.881417 | -1.161812 | -0.178846 | 2.960306 | 0.731904 | 0.278006 | 1.180475 | -0.395724 | ... | -1.087580 | -0.817867 | 0.363856 | -0.520999 | -0.68135 | -0.172179 | 0.831392 | -1.126157 | -0.149544 | -1.488350 |
5 rows × 22 columns
Contoh aplikasi PCA (bahasa pemrograman R):
- sebagai metode untuk mengurangi multikolinearitas: rpubs
- sebagai input untuk model klasifikasi: rpubs Mari kita coba bandingkan bagaimana kondisi covariance data kita sebelum discaling, sesudah scaling, dan setelah menjadi bentuk PCA. Silakan jalankan kode berikut ini.
# alternatif menggunakan seaborn heatmap, sebelum dilakukan scaled
plt.figure(figsize=(8, 6), dpi=100)
sns.heatmap(fraud_num.cov().round(2), vmin=-1, vmax=1, annot=True, cmap='YlGnBu',
annot_kws={"size": 5, "color":'white', "alpha":0.7, "ha": 'center', "va": 'center'});
plt.figure(figsize=(8, 6), dpi=100)
sns.heatmap(fraud_scaled.cov().round(2), vmin=-1, vmax=1, annot=True, cmap='YlGnBu',
annot_kws={"size": 5, "color":'white', "alpha":0.7, "ha": 'center', "va": 'center'});
plt.figure(figsize=(8, 6), dpi=100)
sns.heatmap(fraud_pca90.cov().round(2), vmin=-1, vmax=1, annot=True, cmap='YlGnBu',
annot_kws={"size": 5, "color":'white', "alpha":0.7, "ha": 'center', "va": 'center'});
Visualizing PCA
PCA tidak hanya berguna untuk dimensionality reduction namun baik untuk visualisasi high-dimensional data. Visualisasi dapat menggunakan biplot yang menampilkan:
- Individual factor map, yaitu sebaran data secara keseluruhan menggunakan 2 PC. Tujuannya untuk:
- observasi yang serupa
- outlier dari keseluruhan data
- Variables factor map, yaitu plot yang menunjukkan korelasi antar variable dan kontribusinya terhadap PC.
Biplot Visualization
Kita akan menggunakan fungsi custom dari helper yaitu biplot_pca
.
# method dari helper.py
biplot_pca(fraud_scaled.head(50))
Keterangan:
-
Titik/poin observasi:
- index angka dari observasi.
- Semakin berdekatan maka karakteristiknya semakin mirip, sedangkan yang jauh dari gerombolan data dianggap sebagai outlier
-
Garis vektor:
- loading score, menunjukkan kontribusi variabel tersebut terhadap PC, atau banyaknya informasi variabel tersebut yang dirangkum oleh PC.
- Semakin jauh panah, semakin banyak informasi yang dirangkum.
Visualisasi biplot (loadings) menggunakan library plotly. Fungsi ini merupakan fungsi custom yang dapat dilihat pada file helper.py
.
biplot_plotly(fraud_scaled, pca)
Individual
- Outlier detection: observasi yang jauh dari kumpulan observasi lainnya mengindikasikan outlier dari keseluruhan data. Observasi ini dapat ditandai untuk nantinya dicek karakteristik datanya untuk keperluan bisnis, atau apakah mempengaruhi performa model, dll.
- Observasi searah panah mengindikasikan observasi tersebut nilainya tinggi pada variabel tersebut. Bila bertolak belakang, maka nilainya rendah pada variable tersebut.
- Observasi berdekatan: observasi yang saling berdekatan memiliki karakteristik yang mirip.
Variable
Korelasi antar variabel dapat dilihat dari sudut antar panah:
- Panah saling berdekatan (sudut antar panah < 90), maka korelasi positif
- Panah saling tegak lurus (sudut antar panah = 90), maka tidak berkorelasi
- Panah saling bertolak belakang (sudut antar panah mendekati 180), maka korelasi negatif
Variable Importance
Selain melihat berdasarkan variable factor map, kita juga dapat memetakan
# Dapatkan loadings dari PCA
loadings = pca.components_
# Buat dataframe untuk loadings
loadings_df = pd.DataFrame(data=loadings.T,
columns=pca.get_feature_names_out())
# Tambahkan kolom nama variabel
loadings_df['Variable'] = fraud_scaled.columns
# Tampilkan loadings yang signifikan (misalnya, absolute loadings > 0.3)
significant_loadings = loadings_df[abs(loadings_df['pca0']) > 0.2]
significant_loadings
pca0 | pca1 | pca2 | pca3 | pca4 | pca5 | pca6 | pca7 | pca8 | pca9 | ... | pca13 | pca14 | pca15 | pca16 | pca17 | pca18 | pca19 | pca20 | pca21 | Variable | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 0.254536 | -0.184049 | -0.052353 | 0.048724 | 0.181880 | 0.553324 | -0.406133 | 0.319775 | 0.330375 | 0.055199 | ... | -0.171046 | -0.024820 | 0.163752 | 0.043083 | -0.031187 | -0.092707 | 0.033165 | 0.092057 | -0.016102 | (zip_count_4w,) |
6 | 0.278197 | -0.167477 | -0.065016 | -0.001018 | 0.019048 | -0.250079 | 0.144120 | -0.034537 | -0.181518 | -0.292714 | ... | -0.173220 | 0.020450 | -0.039390 | 0.539321 | -0.439740 | -0.138476 | 0.152361 | 0.048419 | -0.144075 | (velocity_6h,) |
7 | 0.418412 | -0.175674 | 0.032164 | -0.015764 | 0.094908 | -0.016151 | 0.174177 | 0.001748 | -0.106125 | 0.004612 | ... | 0.065990 | 0.036640 | -0.163368 | 0.214036 | 0.616769 | 0.294909 | -0.002323 | -0.278603 | 0.064113 | (velocity_24h,) |
8 | 0.422678 | -0.161588 | -0.076163 | -0.011424 | 0.126181 | -0.093132 | 0.015535 | -0.075685 | -0.067948 | -0.070653 | ... | 0.148771 | 0.146961 | 0.088360 | -0.470691 | 0.017705 | -0.203896 | 0.094607 | 0.063541 | -0.619253 | (velocity_4w,) |
10 | 0.328022 | 0.285483 | 0.169881 | -0.030157 | -0.058191 | 0.270602 | -0.033609 | 0.131475 | -0.202543 | -0.016625 | ... | 0.330536 | -0.206888 | 0.101545 | 0.303949 | 0.032425 | 0.101370 | -0.003186 | 0.017381 | -0.069864 | (date_of_birth_distinct_emails_4w,) |
11 | -0.264609 | -0.239203 | 0.203804 | 0.202757 | 0.252504 | -0.128731 | -0.084717 | 0.187801 | -0.378717 | 0.071309 | ... | 0.024587 | 0.228687 | 0.114639 | 0.105978 | 0.391262 | -0.294474 | 0.002183 | 0.367030 | 0.034964 | (credit_risk_score,) |
21 | -0.437990 | 0.118262 | 0.077484 | -0.032829 | -0.126963 | 0.036287 | -0.031599 | 0.127081 | 0.100576 | 0.135952 | ... | 0.024086 | -0.055217 | -0.087107 | 0.224116 | 0.141136 | 0.149134 | 0.211955 | -0.169378 | -0.702388 | (month,) |
7 rows × 23 columns
Pros and Cons PCA
Kelebihan melakukan PCA:
- Beban komputasi apabila dilakukan pemodelan relatif lebih rendah
- Bisa jadi salah satu teknik untuk improve model, namun tidak selalu menjadi lebih baik (Untuk kasus overfitting data)
- Mengurangi resiko terjadinya multikolinearitas, karena nilai antar PC sudah tidak saling berkorelasi
Kekurangan melakukan PCA (sebelum pemodelan):
- Model tidak dapat diinterpretasikan, karena nilai PC merupakan campuran dari beberapa variabel
Anomaly Detection
Local Outlier Factor with PyOD
Local Outlier Factor (LOF) merupakan salah satu algoritma umum yang digunakan untuk kasus anomaly detection. Teknik ini bekerja dengan menghitung skor berdasarkan kepadatan data berdasarkan jaraknya (sangat mirip dengan konsep k-NN).
LOF dapat menjadi pilihan yang baik untuk deteksi fraud dalam menentukan anomali data, berikut adalah beberapa kelebihan dan kekurangan dari metode ini.
Pros
- Efektif dalam menemukan outlier lokal: LOF dapat mengidentifikasi outlier yang tidak dapat ditemukan oleh metode global, seperti outlier yang berada di dalam cluster yang padat.
- Tidak sensitif terhadap distribusi data: LOF dapat bekerja dengan baik pada data dengan distribusi yang tidak normal.
- Mudah diimplementasikan: LOF dapat diimplementasikan dengan mudah menggunakan library Python seperti Pyod.
Cons
- Dapat menjadi lambat untuk data yang besar: LOF memerlukan komputasi yang cukup berat untuk dataset yang besar.
- Memerlukan pemilihan parameter yang tepat: Parameter k (jumlah tetangga terdekat) yang digunakan dalam LOF dapat mempengaruhi hasil deteksi outlier.
Secara sederhana, LOF akan menghitung jarak antar data dan data yang secara kumpulan lokal terisolasi akan didefinisikan sebagai outlier oleh LOF. Berikut adalah ilustrasi sederhana dari kumpulan data dalam ruang 2 dimensi secara lokal.
Pada ilustrasi di atas, C1 dan C2 merupakan kumpulan data lokal. Titik yang diperhatikan adalah O1, O2, O3, dan O4.
Pada kasus kita ini O1 dan O2 dapat dianggap sebagai outlier lokal untuk kelompok C1. Sementara O4 kemungkinan bukan merupakan outlier untuk kelompok C2 karena rentang jarak per data di kelompok C2 cukup renggang/tidak sepadat C1. Sementara O3 dapat dikatakan sebagai outlier global.
Kita akan menggunakan data hasil PCA yaitu fraud_pca90
untuk mencoba metode ini.
fraud_pca90.head(3)
pca0 | pca1 | pca2 | pca3 | pca4 | pca5 | pca6 | pca7 | pca8 | pca9 | pca10 | pca11 | pca12 | pca13 | pca14 | pca15 | pca16 | color | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.156565 | -2.117836 | 0.617836 | -0.297799 | 0.133543 | 0.108565 | -0.227080 | 1.645738 | -0.358602 | -1.415333 | -0.624414 | 0.677698 | -0.010943 | -0.166878 | 1.618416 | -1.197033 | 0.566608 | 0 |
1 | 2.747659 | -0.773444 | -0.553069 | -0.479412 | -1.855533 | 1.732876 | 0.525852 | -0.714205 | -0.032657 | 0.495306 | 0.829078 | 0.269277 | -1.283505 | -0.849027 | -1.081769 | 0.575271 | 0.610764 | 0 |
2 | 2.591602 | -1.943262 | -1.016471 | 0.348008 | -1.371603 | 0.261071 | -0.084554 | -1.175479 | 0.130690 | 0.395665 | -0.344746 | 0.218384 | 1.778190 | -0.175649 | -0.805700 | 0.712865 | -0.890682 | 0 |
Fungsi LOF()
dapat digunakan setelah mengakses modul model.lof
dari library pyod
.
from sklearn.neighbors import LocalOutlierFactor
lof_model = LOF()
lof_model2 = LocalOutlierFactor(contamination=0.1,n_jobs=1, novelty=True )
fraud_pca90
pca0 | pca1 | pca2 | pca3 | pca4 | pca5 | pca6 | pca7 | pca8 | pca9 | pca10 | pca11 | pca12 | pca13 | pca14 | pca15 | pca16 | color | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.156565 | -2.117836 | 0.617836 | -0.297799 | 0.133543 | 0.108565 | -0.227080 | 1.645738 | -0.358602 | -1.415333 | -0.624414 | 0.677698 | -0.010943 | -0.166878 | 1.618416 | -1.197033 | 0.566608 | 0 |
1 | 2.747659 | -0.773444 | -0.553069 | -0.479412 | -1.855533 | 1.732876 | 0.525852 | -0.714205 | -0.032657 | 0.495306 | 0.829078 | 0.269277 | -1.283505 | -0.849027 | -1.081769 | 0.575271 | 0.610764 | 0 |
2 | 2.591602 | -1.943262 | -1.016471 | 0.348008 | -1.371603 | 0.261071 | -0.084554 | -1.175479 | 0.130690 | 0.395665 | -0.344746 | 0.218384 | 1.778190 | -0.175649 | -0.805700 | 0.712865 | -0.890682 | 0 |
3 | -1.170206 | -1.354921 | 0.214172 | 0.544212 | -0.430262 | -1.025455 | 0.577854 | 0.157622 | -0.686719 | -0.321691 | 0.114822 | -0.708565 | -0.210811 | 0.218148 | -1.013061 | 0.797674 | 0.574951 | 0 |
4 | 2.863529 | -0.685988 | -1.073506 | 0.277979 | 0.633735 | 1.345577 | -0.185201 | -0.179568 | 0.056330 | -0.486255 | -0.124916 | 0.362843 | -1.208754 | -2.206490 | 0.492851 | -0.771684 | 0.585018 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
14895 | -3.006352 | -2.208084 | -0.244070 | 1.140636 | -1.137031 | 0.261961 | -0.182638 | -0.192580 | -0.584322 | -0.187782 | -0.009381 | -0.214395 | 0.692088 | -0.435719 | -1.041518 | 0.198371 | 0.314340 | 0 |
14896 | -0.500000 | -1.009571 | -0.074096 | -0.329787 | 0.704215 | 1.900702 | 1.150522 | 2.356263 | 0.672031 | 1.162215 | -1.075699 | 0.109144 | 0.877142 | -0.504097 | -0.403844 | -1.671751 | 0.931399 | 0 |
14897 | 1.321106 | -0.978755 | -0.035949 | -0.577802 | 0.821491 | -0.492155 | 0.222857 | -0.138077 | -0.258987 | -0.652570 | -0.021578 | -0.929279 | 0.012119 | 0.839235 | 0.426793 | -1.340530 | -0.576510 | 0 |
14898 | -0.064242 | 0.870179 | 1.870764 | 3.899863 | 0.642990 | -2.117461 | 3.541267 | 1.985717 | 0.629173 | 1.760697 | 0.402671 | 0.632329 | 0.249250 | -0.006183 | 1.404148 | -0.428410 | -0.790182 | 1 |
14899 | -1.084422 | 0.650284 | 2.226155 | -0.978600 | -0.422600 | 2.157075 | 1.396626 | 2.711047 | 0.905026 | 1.918226 | -0.180522 | 0.340268 | -0.696036 | 0.352001 | 0.614355 | -2.447711 | -1.459189 | 0 |
14900 rows × 18 columns
Objek LOF di atas dapat langsung kita gunakan kepada data yang sudah kita olah sebelumnya menggunakan method fit_predict()
.
lof_label = lof_model.fit_predict(fraud_pca90)
lof_model2.fit(fraud_pca90)
lof_label2 = lof_model2.predict()
Karena merupakan proses unsupervised, maka metode fit_predict akan langsung menghasilkan label. Tetapi sebenarnya terdapat skor anomali untuk setiap data yang dimasukkan ke model. Skor anomali ini dapat dilihat menggunakan method decision_function()
.
# Menghitung nilai LOF
lof_scores = lof_model.decision_function(fraud_pca90)
lof_scores
array([1.18030495, 1.11967908, 1.19635871, ..., 0.98605162, 1.54397575,
1.12362949])
# Menghitung nilai LOF
lof_scores2 = lof_model2.decision_function(fraud_pca90)
lof_scores2
C:\Users\SaltFarmer\miniconda3\envs\algoritma\lib\site-packages\sklearn\base.py:465: UserWarning:
X does not have valid feature names, but LocalOutlierFactor was fitted with feature names
array([ 0.00680207, 0.06742794, -0.00925168, ..., 0.2010554 ,
-0.35686873, 0.06347754])
Karena merupakan skor setiap data, maka untuk lebih jelasnya kita bisa lihat distribusinya menggunakan histogram ataupun boxplot.
sns.histplot(lof_scores)
<Axes: ylabel='Count'>
sns.histplot(lof_scores2)
<Axes: ylabel='Count'>
sns.boxplot(lof_scores, orient='h',)
<Axes: >
sns.boxplot(lof_scores2, orient='h',)
<Axes: >
Sementara untuk label, kita dapat dengan mudah menghitung masing-masing hasil label menggunakan value_counts()
.
pd.Series(lof_label).value_counts()
0 13410
1 1490
Name: count, dtype: int64
pd.Series(lof_label2).apply(lambda x: 1 if x==-1 else 0).value_counts()
0 13410
1 1490
Name: count, dtype: int64
Parameter on LOF Model
Objek model LOF memiliki beberapa parameter yang dapat kita gunakan, parameter yang paling umum digunakan adalah:
contamination
: mengatur proporsi estimasi anomali pada data (default = 0.1)n_neighbors
: jumlah tetangga yang dianggap sebagai 1 kluster (default = 20)metrics
: metode perhitungan jarak yang digunakan
Nilai contamination ini dapat kita isi disesuaikan dengan kasus yang ada, contoh:
Apabila kita ketahui terdapat 1% akun bank BRI merupakan akun yang digunakan untuk penipuan maka kita dapat menggunakan nilai
contamination = 0.01
.
fraud.columns
Index(['income', 'name_email_similarity', 'current_address_months_count',
'customer_age', 'days_since_request', 'intended_balcon_amount',
'payment_type', 'zip_count_4w', 'velocity_6h', 'velocity_24h',
'velocity_4w', 'bank_branch_count_8w',
'date_of_birth_distinct_emails_4w', 'employment_status',
'credit_risk_score', 'email_is_free', 'housing_status',
'phone_home_valid', 'phone_mobile_valid', 'has_other_cards',
'proposed_credit_limit', 'foreign_request', 'source',
'session_length_in_minutes', 'device_os', 'keep_alive_session',
'device_distinct_emails_8w', 'month'],
dtype='object')
lof_tune = LOF(
contamination = 0.005,
n_neighbors = 15
)
lof_label_tune = lof_tune.fit_predict(fraud_pca90)
Mari kita lihat dampak penggunaan parameter contamination dari jumlah anomali yang dideteksi oleh model kita.
pd.Series(lof_label_tune).value_counts(normalize=True)
0 0.994966
1 0.005034
Name: proportion, dtype: float64
Selain melihat plot distribusinya, kita dapat menampilkan persebaran outlier kita pada bidang 2 dimensi hasil PCA. Berikut adalah kodenya:
# menampilkan plot anomali (___ diisi dengan nama dataframe PCA)
plt.figure(figsize=(10, 6))
sns.scatterplot(x=fraud_pca90['pca0'],
y=fraud_pca90['pca1'],
hue=lof_label_tune,
palette='coolwarm')
plt.title('Hasil Local Outlier Factor')
plt.xlabel(f'PC 1 ({pca.explained_variance_ratio_[0]*100:.2f}%)')
plt.ylabel(f'PC 2 ({pca.explained_variance_ratio_[1]*100:.2f}%)');
Atau untuk lebih jelasnya, kita dapat menggunakan fungsi scatter dari plotly.express
untuk mengatur posisi legend yang ingin kita lihat.
# masukkan nama dataframe PCA ke ___
fraud_pca90["color"] = lof_label_tune.astype(str)
# Plot hasil LOF menggunakan Plotly Express
fig = px.scatter(fraud_pca90.sort_values("color"),
x='pca0', y='pca1', color="color",
color_discrete_map={'0': '#a6c4ff', '1': '#ffa07a'},
title='LOF Results',
labels={'pca0': f'PC 1 ({pca.explained_variance_ratio_[0]*100:.2f}%)',
'pca1': f'PC 2 ({pca.explained_variance_ratio_[1]*100:.2f}%)'})
# Menampilkan plot
fig.update_layout(width=800, height=600)
fig.show()
Untuk melihat index data yang terdeteksi anomali, kita bisa menggunakan cara berikut ini.
anomaly_indices = np.where(lof_label_tune == 1)[0]
anomaly_indices
array([ 405, 536, 587, 826, 1033, 1412, 1447, 1785, 2367,
2379, 2829, 2857, 2873, 3223, 3374, 3719, 3925, 3949,
4405, 4666, 4980, 5030, 5185, 5222, 5322, 5513, 6059,
6377, 6720, 6864, 7267, 7291, 7502, 7752, 7828, 8031,
8088, 8183, 8327, 8506, 8902, 9090, 9161, 9198, 9332,
9425, 9475, 9508, 9609, 9984, 10171, 10201, 10241, 10373,
10535, 10787, 10933, 11305, 11497, 11938, 12196, 12218, 12502,
12639, 12741, 13209, 13264, 13587, 13884, 14113, 14150, 14209,
14393, 14438, 14898], dtype=int64)
Kita juga dapat mengambil data yang sifatnya anomali ini menggunakan index yang sudah ditemukan di atas. Dari proses ini kita dapat mentransformasi kembali data kita ke bentuk semula.
Ingat bahwa kita sebelumnya membuat dua buah pca yaitu pca yang menyimpan seluruh informasi dan pca yang mengambil 90% informasi. Maka kita gunakan pca yang menyimpan seluruh informasi ini setelah itu kita kembalikan ke bentuk sebelum di scaling.
anomaly = fraud_pca.iloc[anomaly_indices]
temp = pd.DataFrame(pca2.inverse_transform(anomaly))
anomaly_df = pd.DataFrame(scaler.inverse_transform(temp),
columns=fraud_scaled.columns)
anomaly_df
income | name_email_similarity | current_address_months_count | customer_age | days_since_request | zip_count_4w | velocity_6h | velocity_24h | velocity_4w | bank_branch_count_8w | ... | email_is_free | phone_home_valid | phone_mobile_valid | has_other_cards | proposed_credit_limit | foreign_request | session_length_in_minutes | keep_alive_session | device_distinct_emails_8w | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.227118 | 0.496827 | 5.207973 | 39.428203 | -0.595193 | 1355.366606 | 9738.815777 | 5409.589451 | 4532.630855 | -427.834634 | ... | 0.769808 | 0.286610 | 0.708494 | 0.096594 | 1233.055235 | 0.004499 | 16.930254 | 0.959731 | 0.919350 | 3.971300 |
1 | 0.567352 | 0.389553 | 318.545739 | 58.054107 | 9.087221 | 713.405175 | -3051.520087 | 1790.808796 | 4080.489742 | 290.771032 | ... | 0.997683 | 1.122376 | 1.002801 | 0.317107 | 472.335355 | -0.000517 | -5.373364 | 0.452061 | 1.299960 | 5.299492 |
2 | 1.204856 | 0.381540 | 44.546778 | 31.717762 | 12.733772 | 2104.858405 | 7525.384632 | 4939.393571 | 4647.240329 | 363.218876 | ... | 0.779109 | 0.262245 | 0.676326 | 0.273754 | 1343.382089 | 0.028783 | 14.859937 | 0.506785 | 0.787268 | 3.825052 |
3 | 0.842459 | 0.556384 | -4.611537 | 42.025048 | -5.133614 | 1841.598725 | 8943.189559 | 6585.863030 | 6137.794412 | -202.542230 | ... | 0.557888 | 0.061073 | 0.933595 | -0.601063 | 741.572467 | 0.147760 | 13.753201 | 1.537824 | 1.163752 | 0.195591 |
4 | 0.829767 | 0.534507 | -8.808409 | 39.105415 | -0.306587 | 1117.997200 | 2622.806566 | 3424.057574 | 3975.124879 | 728.646151 | ... | 0.973009 | 1.225065 | 0.132678 | 0.555465 | 761.448547 | -0.016640 | 9.729460 | -0.306144 | 0.653444 | 5.315429 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
70 | 0.635328 | 0.633775 | -29.095527 | 32.810781 | 2.269996 | 3043.896905 | 7195.093457 | 3206.089170 | 2746.709324 | -480.366615 | ... | 0.468304 | 0.671451 | 0.268859 | -0.471170 | 449.157502 | -0.000090 | 4.690149 | 1.792444 | 1.276307 | 8.425203 |
71 | 0.285200 | 0.413444 | -12.380822 | 40.090587 | 0.944023 | 1847.482068 | 7699.610647 | 6895.225435 | 6637.646140 | 104.947407 | ... | 1.577818 | 0.076020 | 1.293933 | 0.341120 | -99.359943 | 0.106956 | -2.916104 | 0.206275 | 0.662265 | -1.119125 |
72 | 0.779166 | 0.522992 | -87.332137 | 23.503952 | 11.581668 | 904.927500 | 4080.786202 | 3435.538935 | 3374.914866 | 687.769748 | ... | -0.531899 | 0.134679 | 0.780280 | 0.252662 | 426.643105 | 0.378293 | 7.564526 | 0.234643 | 1.578899 | 6.939092 |
73 | 0.182698 | 0.567770 | 267.628313 | 25.144437 | 4.212987 | 479.193597 | 16201.343393 | 6869.295127 | 4059.754066 | 345.825120 | ... | 0.458724 | 0.570946 | 0.522043 | 0.825184 | 770.750326 | -0.292251 | 6.648122 | -0.015046 | 0.918855 | 4.878472 |
74 | 1.404294 | 0.324542 | -85.030334 | 46.039431 | 20.012943 | 326.208995 | 5002.865516 | 4963.315858 | 5522.380653 | 182.337278 | ... | 1.141355 | 0.525966 | 0.863478 | -0.071098 | 888.615907 | 0.431545 | 22.633691 | 1.015346 | 1.626410 | 1.779975 |
75 rows × 22 columns
fraud.iloc[anomaly_indices]
income | name_email_similarity | current_address_months_count | customer_age | days_since_request | intended_balcon_amount | payment_type | zip_count_4w | velocity_6h | velocity_24h | ... | phone_mobile_valid | has_other_cards | proposed_credit_limit | foreign_request | source | session_length_in_minutes | device_os | keep_alive_session | device_distinct_emails_8w | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
405 | 0.7 | 0.742771 | 16.0 | 40 | 12.952886 | 20.941790 | AA | 168 | 9011.020696 | 5194.170896 | ... | 1 | 0 | 200.0 | 0 | INTERNET | 1.336989 | x11 | 1 | 1.0 | 4 |
536 | 0.1 | 0.014002 | 4.0 | 30 | 0.015723 | -0.761342 | AB | 491 | 1078.512781 | 2425.681429 | ... | 1 | 0 | 1500.0 | 1 | INTERNET | 7.259926 | windows | 0 | 2.0 | 7 |
587 | 0.9 | 0.202449 | 189.0 | 30 | 12.543677 | 7.465069 | AA | 701 | 9885.425316 | 3574.190293 | ... | 1 | 0 | 200.0 | 0 | INTERNET | 7.902304 | other | 1 | 1.0 | 5 |
826 | 0.1 | 0.896931 | 31.0 | 50 | 11.577133 | -1.395665 | AA | 1661 | 10580.653626 | 6697.931015 | ... | 1 | 0 | 200.0 | 0 | INTERNET | 5.373218 | linux | 1 | 1.0 | 0 |
1033 | 0.9 | 0.841805 | 187.0 | 10 | 0.023406 | 51.886755 | AA | 992 | 1270.459414 | 2789.514959 | ... | 1 | 0 | 200.0 | 0 | INTERNET | 15.635054 | other | 1 | 1.0 | 7 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
14150 | 0.8 | 0.487961 | 25.0 | 20 | 0.013398 | -0.979439 | AC | 714 | 3138.336067 | 3612.627112 | ... | 1 | 0 | 200.0 | 0 | INTERNET | 5.787567 | windows | 1 | 1.0 | 2 |
14209 | 0.6 | 0.221974 | 7.0 | 20 | 0.005390 | -1.696078 | AC | 829 | 8297.226759 | 6033.152656 | ... | 1 | 0 | 200.0 | 0 | INTERNET | 3.335838 | other | 0 | 1.0 | 4 |
14393 | 0.9 | 0.051105 | 173.0 | 50 | 0.014303 | -0.663256 | AD | 932 | 4355.509437 | 5518.657925 | ... | 1 | 0 | 1500.0 | 0 | INTERNET | 1.055325 | windows | 1 | 1.0 | 4 |
14438 | 0.5 | 0.035640 | 19.0 | 30 | 0.001540 | 8.864131 | AA | 2012 | 5808.134611 | 5909.788325 | ... | 1 | 0 | 500.0 | 0 | INTERNET | 4.375013 | other | 0 | 1.0 | 4 |
14898 | 0.4 | 0.709280 | 186.0 | 50 | 0.043043 | 15.297551 | AA | 964 | 7648.922582 | 3132.050823 | ... | 1 | 0 | 1500.0 | 0 | INTERNET | 2.084672 | linux | 0 | 1.0 | 6 |
75 rows × 28 columns
from joblib import dump
dump(pca2, "Pca Uhuy")
['Pca Uhuy.exe']
from joblib import load
halo = load("Pca Uhuy")
halo
Leave a comment