Day 1 Algorit.ma : Python For Data Analysis
Day 1, here I will share my notes of Inclass notebook. For further example you can check out on https://github.com/Saltfarmer/Algoritma-BFLP-DS-Audit/tree/main
Coursebook: Python for Data Analysts
This notebook was made based on main materials Python for Data Analysts.ipynb
Version: BRI Audit Analytics - January 2024
Training Objectives
- Introduction to the
pandas
library. - Introduction to
DataFrame
- Data Types
- Exploratory Data Analysis I
- Indexing and Subsetting
1. Perkenalan Jupyter Notebook
1.1 Tipe Cell
Tipe cell dalam notebook:
1. Markdown
Cell markdown diperuntuhkan untuk menuliskan narasi.
Halo 1
Halo 2
Pada bagian markdown terdapat beberapa hal yang dapat dilakukan, seperti membuat beberapa hal berikut ini:
- Heading
Pada bagian ini dapat ditambahkan heading dengan menambahkan hashtag #
.
- `#` -> Heading 1
- `##` -> Heading 2
- `###` -> Heading 3
- Emphasis
Ketika ingin mengatur jenis tulisan dengan memberikan karakter yang lebih tegas kita bisa memanfaatkan *
.
- *kata* -> Untuk mengatur tulisan menjadi Italic
- **kata** -> Untuk mengatur tulisan menjadi Bold
- ***kata*** -> Untuk mengatur tulisan menjadi Italic & Bold
- Bullets
Untuk membuat beberapa point, terdapat beberapa metode yang bisa digunakan.
- Untuk membuat point dalam bentuk angka, bisa menggunakan angka 1.
- Untuk membuat point dalam bentuk bullets, bisa menggunakan - atau *.
2. Code
Cell code diperuntuhkan untuk menuliskan kode.
# Ini merupakan cell untuk code
print("hello world")
hello world
Mini Quiz
Q: Apakah perbedaan paling mencolok antara cell mardown dan code?
A:
Mode cell dalam notebook:
1. Command mode (cell berwarna BIRU)
- B: menambahkan cell baru di Bawah (Below)
- A: menambahkan cell baru di Atas (Above)
- DD: Delete cell
- C: Copy cell
- V: Paste cell
- Y: Mengubah ke code cell
- M: Mengubah ke markdown cell
- Enter: Mengubah command mode menjadi edit mode
2. Edit mode (cell berwarna HIJAU)
- Ctrl + Enter: eksekusi satu cell
- Esc: Mengubah edit mode menjadi command mode
1.2 Shortcut
Kumpulan shortcut: CTRL + SHIFT + P
2. Basic Python Programming
2.1 Variables
Variable adalah sebuah nama yang dipakai untuk menunjukkan sebuah nilai. Tanda =
dipakai untuk membuat variable baru. Proses ini sering disebut sebagai assignment.
# melakukan assignment
perusahaan = "Algoritma"
# print isi objek
perusahaan
'Algoritma'
Mari kita coba buat sebuah objek, yang berisikan nama kita!
# code here
nama = 'Gama Candra Tri Kartika'
print(nama)
Gama Candra Tri Kartika
2.2 Case Sensitive
Python adalah bahasa pemrograman yang case-sensitive sehingga penamaan variable menjadi hal yang perlu diperhatikan.
Misal kita ingin memanggil objek perusahaan tapi dengan huru p kapital.
# Memanggil objek sebelumnya
perusahaan
# Case sensitive
'Algoritma'
Berikut beberapa anjuran dalam memberikan nama variable pada Python:
- Special character
!, $ , &, dll
tidak dapat digunakan dalam penamaan variabel. - Tidak boleh menggunakan angka di awal.
- Bersifat case-sensitive sehingga penamaan variable
algoritma
,ALGORITMA
, danAlgoritma
adalah 3 variable yang berbeda - Tidak boleh menggunakan keyword pada Python
2.3 Keywords
Keywords adalah kata kunci yang sudah ditetapkan oleh Python sebagai nama yang tidak bisa dipakai baik untuk penamaan fungsi, variabel, dan lainnya. Keyword ditulis dalam lower-case (huruf kecil semua) kecuali keyword True
, False
, dan None
. Sejauh ini (Python 3.10) keyword yang ada pada Python adalah sebagai berikut:
#Cek daftar keyword
import keyword
keyword.kwlist
['False',
'None',
'True',
'and',
'as',
'assert',
'async',
'await',
'break',
'class',
'continue',
'def',
'del',
'elif',
'else',
'except',
'finally',
'for',
'from',
'global',
'if',
'import',
'in',
'is',
'lambda',
'nonlocal',
'not',
'or',
'pass',
'raise',
'return',
'try',
'while',
'with',
'yield']
Untuk membuktikan bahwa keyword tidak dapat digunakan sebagai nama variabel, mari kita coba untuk menyimpan nilai 1 pada variabel True
.
# Contoh ketika menyimpan kedalam sebuah keywords
True = 1
Cell In[7], line 2
True = 1
^
SyntaxError: cannot assign to True
3. Introduction to Pandas Library
3.1 Import Library
pandas
adalah library yang powerful sebagai tools analisis data dan struktur pada Python. Dengan pandas
, mengolah data menjadi mudah karena disediakan salah satu objek bernama DataFrame. Dengan dataframe kita dapat membaca sebuah file, mengolah suatu data dengan menggunakan operasi seperti join, distinct, group by, agregasi, dan teknik lainnya.
Lebih lengkapnya silahkan kunjungi official documentation
Untuk menggunakan pandas
, kita perlu import terlebih dahulu library dengan cara berikut ini:
#code here
import pandas
Mari kita coba cek versi library dengan cara memanggil nama library lalu ditambahkan dengan syntax .__version__
#code here
pandas.__version__
'2.0.3'
Kita bisa menggunakan teknik aliasing agar pengetikan nama library tidak terlalu panjang, yaitu dengan as
.
#code here
import pandas as pd
Setelah menggunakan teknik aliasing kita hanya perlu memanggil nama asli library, melainkan kita bisa memanggil nama aliasing yang sudah kita tentukan sebelumnya.
Untuk lebih memahaminya mari kita coba cek versi pandas kita kembali.
#code here
pd.__version__
'2.0.3'
3.2 Membaca Data
Dalam membaca data kita akan menggunakan salah satu fungsi dari pandas dan dalam memanfaatkan fungsi dari pandas kita harus mengikuti syntax yang sudah disediakan.
Semua method pada pandas
dapat dipanggil dengan syntax seperti: pandas.function_name()
. Langkah pertama yang akan kita lakukan adalah membaca data. Kita dapat menggunakan method .read_csv()
untuk membaca sebuah file dengan format .csv
.
# code here
rice = pd.read_csv("data_input/rice.csv", index_col=0)
rice.head()
receipt_id | receipts_item_id | purchase_time | category | sub_category | format | unit_price | discount | quantity | yearmonth | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 9622257 | 32369294 | 7/22/2018 21:19 | Rice | Rice | supermarket | 128000.0 | 0 | 1 | 2018-07 |
2 | 9446359 | 31885876 | 7/15/2018 16:17 | Rice | Rice | minimarket | 102750.0 | 0 | 1 | 2018-07 |
3 | 9470290 | 31930241 | 7/15/2018 12:12 | Rice | Rice | supermarket | 64000.0 | 0 | 3 | 2018-07 |
4 | 9643416 | 32418582 | 7/24/2018 8:27 | Rice | Rice | minimarket | 65000.0 | 0 | 1 | 2018-07 |
5 | 9692093 | 32561236 | 7/26/2018 11:28 | Rice | Rice | supermarket | 124500.0 | 0 | 1 | 2018-07 |
Seperti yang kita lihat, pada data kita terdapat kolom yang bernama RowNumber
. Hal tersebut sangatlah lumrah terjadi, karena biasanya pada data dengan format csv, memiliki sebuah kolom yang berisikan index atau urutan dari datanya.
Untuk menghilangkan kolom tersebut, pada fungsi read_csv()
terdapat sebuah parameter yang bernama index_col =
Intuisi dari parameter index_col
pada read_csv()
adalah menjadikan kolom pada Dataframe sebagai index pada baris. Berikut beberapa nilai yang dapat ditampung oleh parameter index_col
.
- Angka 0, 1, 2, dst : Menunjukkan index atau urutan kolom yang akan dijadikan sebagai index baris.
'nama_kolom'
: Selain menggunakan nilai index nya, kita juga dapat langsung mengetikkan nama kolomnya.
Mari kita coba implementasikan parameter tersebut.
# code here
turnover = pd.read_csv("data_input/turnover.csv", index_col='RowNumber')
Additional Information:
Python menggunakan sistem zero based indexing yang berarti, urutan pada python dimulai dari angka 0.
3.3 Head & Tail
Daripada melihat keseluruhan data, lebih baik kita “mengintip” sebagian baris yang dapat merepresentasikan bentuk keseluruhan data.
Fungsi head()
untuk melihat beberapa baris teratas pada data (default 5)
# code here
turnover.head(5)
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||||
1 | 15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
2 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
3 | 15619304 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
4 | 15701354 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
5 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
Fungsi tail()
untuk melihat beberapa data terakhir.
# code here
turnover.tail(5)
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||||
9996 | 15606229 | Obijiaku | 771 | France | Male | 39 | 5 | 0.00 | 2 | 1 | 0 | 96270.64 | 0 |
9997 | 15569892 | Johnstone | 516 | France | Male | 35 | 10 | 57369.61 | 1 | 1 | 1 | 101699.77 | 0 |
9998 | 15584532 | Liu | 709 | France | Female | 36 | 7 | 0.00 | 1 | 0 | 1 | 42085.58 | 1 |
9999 | 15682355 | Sabbatini | 772 | Germany | Male | 42 | 3 | 75075.31 | 2 | 1 | 0 | 92888.52 | 1 |
10000 | 15628319 | Walker | 792 | France | Female | 28 | 4 | 130142.79 | 1 | 1 | 0 | 38190.78 | 0 |
turnover.sample(5)
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||||
6190 | 15572408 | Chambers | 714 | Germany | Male | 39 | 3 | 149887.49 | 2 | 1 | 0 | 63846.36 | 0 |
4780 | 15711843 | Pisani | 613 | Germany | Male | 40 | 1 | 147856.82 | 3 | 0 | 0 | 107961.11 | 1 |
2074 | 15573309 | Ward | 626 | Spain | Female | 48 | 2 | 0.00 | 2 | 1 | 1 | 95794.98 | 0 |
5404 | 15733169 | Craig | 590 | Spain | Male | 22 | 7 | 125265.61 | 1 | 1 | 1 | 161253.08 | 0 |
4363 | 15618695 | Ts'ui | 571 | Spain | Female | 22 | 3 | 108117.10 | 1 | 0 | 1 | 53328.70 | 0 |
3.4 Tipe Data (Data Types)
Dataframe terdiri dari beberapa Series (mengacu pada satu kolom). Dalam satu series harus memiliki satu tipe data yang sama. pandas
akan mencoba untuk menetapkan tipe data dari masing-masing Series, tapi tidak selalu benar.
3.4.1 Cara cek tipe data:
Kita dapat memanfaakan kedua hal berikut ini untuk mengecek tipe data, dtypes
atau .info()
untuk lebih lengkapnya
🔎 Method vs. Atribut
- Secara fisik/terlihat mata, method selalu diikuti dengan tanda kurung ()
- Contoh : head(). tail(), read_csv()
- Secara fisik/terlihat mata, atribut tidak diikuti oleh tanda kurung
- Contoh : dtypes
- Didalam sebuah method, nilai parameter itu bisa diganti-ganti
- Contoh : head(n=) -> parameter n bisa diganti/disesuaikan dengan jumlah baris yang mau ditampilkan
- Pada sebuah atribut, penggunaaan apa adanya/tidak ada nilai yang bisa diganti ganti
# check tipe data dengan dtypes
print("Datatypes dari data Rice")
print(rice.dtypes)
print("\nDatatypes dari data Turnover")
print(turnover.dtypes)
Datatypes dari data Rice
receipt_id int64
receipts_item_id int64
purchase_time object
category object
sub_category object
format object
unit_price float64
discount int64
quantity int64
yearmonth object
dtype: object
Datatypes dari data Turnover
CustomerId int64
Surname object
CreditScore int64
Geography object
Gender object
Age int64
Tenure int64
Balance float64
NumOfProducts int64
HasCrCard int64
IsActiveMember int64
EstimatedSalary float64
Exited int64
dtype: object
# check tipe data dengan .info()
turnover.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10000 entries, 1 to 10000
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerId 10000 non-null int64
1 Surname 10000 non-null object
2 CreditScore 10000 non-null int64
3 Geography 10000 non-null object
4 Gender 10000 non-null object
5 Age 10000 non-null int64
6 Tenure 10000 non-null int64
7 Balance 10000 non-null float64
8 NumOfProducts 10000 non-null int64
9 HasCrCard 10000 non-null int64
10 IsActiveMember 10000 non-null int64
11 EstimatedSalary 10000 non-null float64
12 Exited 10000 non-null int64
dtypes: float64(2), int64(8), object(3)
memory usage: 1.1+ MB
Gunain dtypes jika ingin mengambil value dtype
nya, sedangkan info
untuk melihat ringkasan null values, kolom, dan tipe datanya
3.4.2 Categorical and Numerical Variables
Karakteristik tipe data category
:
- Dapat dikelompokkan menjadi beberapa kelompok (category)
- Berulang
Dua alasan mengapa kita perlu menggunakan tipe data categorical:
-
Dari sisi “business perspective”, hal ini dapat menginformasikan dan memandu seorang Analyst pada pertanyaan seperti metode statistik atau tipe plot mana yang digunakan untuk mengolah data.
-
Dari sisi teknikal, ketika kita bekerja dengan tipe data categorical pada pandas, hal ini akan jauh menghemat memori dan menambah kecepatan komputasional.
Mari kita cek kembali tipe data pada object turnover
. Manakah yang seharusnya memiliki tipe data category?
Kita bisa menggunakan method berikut untuk mengidentifikasi kolom mana yang cocok untuk disimpan ke tipe data category
.unique()
to see unique values of a Series.nunique()
to see number of unique values of a Series or DataFrame
Berikut contoh syntax untuk mengecek nilai unik pada sebuah kolom
df['nama_kolom'].unique()
# mengecek kolom format dengan fungsi unique()
for kolom in turnover.columns:
if turnover[kolom].dtypes == 'object':
print("Kolom : " + kolom)
print(turnover[kolom].unique())
Kolom : Surname
['Hargrave' 'Hill' 'Onio' ... 'Kashiwagi' 'Aldridge' 'Burbidge']
Kolom : Geography
['France' 'Spain' 'Germany']
Kolom : Gender
['Female' 'Male']
# mengecek kolom format dengan fungsi nunique()
for kolom in turnover.columns:
if turnover[kolom].dtypes == 'object':
print("Kolom : " + kolom)
print(turnover[kolom].nunique())
Kolom : Surname
2932
Kolom : Geography
3
Kolom : Gender
2
for kolom in turnover.columns:
if turnover[kolom].dtypes == 'object':
print("Kolom : " + kolom)
print(turnover[kolom].value_counts())
Kolom : Surname
Surname
Smith 32
Scott 29
Martin 29
Walker 28
Brown 26
..
Izmailov 1
Bold 1
Bonham 1
Poninski 1
Burbidge 1
Name: count, Length: 2932, dtype: int64
Kolom : Geography
Geography
France 5014
Germany 2509
Spain 2477
Name: count, dtype: int64
Kolom : Gender
Gender
Male 5457
Female 4543
Name: count, dtype: int64
Dari hasil pengecekan, kolom mana saja yang harus diubah menjadi tipe data kategori?
Untuk mengubah tipe data ke categorical pada pandas, Anda dapat melakukannya dengan method astype()
berikut:
Formula
df['column_name'] = df['column_name'].astype('new_data_types')
Contoh
employees['marital_status'] = employees['marital_status'].astype('category')
# turnover['Tenure'].astype('object').value_counts().plot(kind ='bar')
# code here
# Mengubah tipe data integer (angka) atau object ke category (text atau string)
turnover[['Geography', 'Gender', 'HasCrCard', 'IsActiveMember', 'Exited']] = \
turnover[['Geography', 'Gender', 'HasCrCard', 'IsActiveMember', 'Exited']].astype('category')
# Buat manggil dataframe ke tipe data tertentu
turnover.select_dtypes(include='category')
Geography | Gender | HasCrCard | IsActiveMember | Exited | |
---|---|---|---|---|---|
RowNumber | |||||
1 | France | Female | 1 | 1 | 1 |
2 | Spain | Female | 0 | 1 | 0 |
3 | France | Female | 1 | 0 | 1 |
4 | France | Female | 0 | 0 | 0 |
5 | Spain | Female | 1 | 1 | 0 |
... | ... | ... | ... | ... | ... |
9996 | France | Male | 1 | 0 | 0 |
9997 | France | Male | 1 | 1 | 0 |
9998 | France | Female | 0 | 1 | 1 |
9999 | Germany | Male | 1 | 0 | 1 |
10000 | France | Female | 1 | 0 | 0 |
10000 rows × 5 columns
# cek tipe data
turnover.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10000 entries, 1 to 10000
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerId 10000 non-null int64
1 Surname 10000 non-null object
2 CreditScore 10000 non-null int64
3 Geography 10000 non-null category
4 Gender 10000 non-null category
5 Age 10000 non-null int64
6 Tenure 10000 non-null int64
7 Balance 10000 non-null float64
8 NumOfProducts 10000 non-null int64
9 HasCrCard 10000 non-null category
10 IsActiveMember 10000 non-null category
11 EstimatedSalary 10000 non-null float64
12 Exited 10000 non-null category
dtypes: category(5), float64(2), int64(5), object(1)
memory usage: 752.6+ KB
Additional Information:
- Jika mengubah tipe data untuk angka tanpa koma kita bisa mengisi fungsi
.astype()
dengan int64 - Jika mengubah tipe data untuk angka dengan koma kita bisa mengisi fungsi
.astype()
dengan float64
3.5 Exploratory Data Analysis I
Exploratory Data Analysis (EDA) mengacu pada proses melakukan investigasi awal pada data, seringkali dengan tujuan untuk lebih mengenal dengan karakteristik data tertentu. EDA dilakukan dengan bantuan ringkasan statistik dan teknik grafis sederhana untuk melihat struktur data yang kita miliki.
Beberapa tools sederhana pada pandas
yang dapat digunakan untuk melakukan EDA adalah sebagai berikut:
.describe()
3.5.1 describe()
Method describe()
menampilkan 8 ringkasan statistika deskriptif. Secara default menampilkan ringkasan untuk kolom numerik.
Ringkasan statistika yang dimaksud adalah sebagai berikut:
- Count: banyaknya baris pada dataframe
- Mean: rata-rata nilai
- Standard Deviation: jarak rata-rata antara data ke mean (titik pusat data)
- Minimum Value: nilai terkecil dari keseluruhan data
- 25th Percentile (Q1)
- 50th Percentile (Q2/Median)
- 75th Percentile (Q3)
- Maximum Value: nilai terbesar dari keseluruhan data
turnover['CustomerId'] = turnover['CustomerId'].astype('object')
#code here
turnover.describe()
CreditScore | Age | Tenure | Balance | NumOfProducts | EstimatedSalary | |
---|---|---|---|---|---|---|
count | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 | 10000.000000 |
mean | 650.528800 | 38.921800 | 5.012800 | 76485.889288 | 1.530200 | 100090.239881 |
std | 96.653299 | 10.487806 | 2.892174 | 62397.405202 | 0.581654 | 57510.492818 |
min | 350.000000 | 18.000000 | 0.000000 | 0.000000 | 1.000000 | 11.580000 |
25% | 584.000000 | 32.000000 | 3.000000 | 0.000000 | 1.000000 | 51002.110000 |
50% | 652.000000 | 37.000000 | 5.000000 | 97198.540000 | 1.000000 | 100193.915000 |
75% | 718.000000 | 44.000000 | 7.000000 | 127644.240000 | 2.000000 | 149388.247500 |
max | 850.000000 | 92.000000 | 10.000000 | 250898.090000 | 4.000000 | 199992.480000 |
Additional Information
Kita bisa menambahkan parameter include
ataupun exclude
pada describe()
untuk melihat statistika deskriptif dari variable non-numeric:
Task 1: Melihat statistika deskriptif untuk kolom bertipe data selain angka menggunakan exclude = 'number'
#code here
turnover.describe(exclude ='number')
CustomerId | Surname | Geography | Gender | HasCrCard | IsActiveMember | Exited | |
---|---|---|---|---|---|---|---|
count | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 | 10000 |
unique | 10000 | 2932 | 3 | 2 | 2 | 2 | 2 |
top | 15634602 | Smith | France | Male | 1 | 1 | 0 |
freq | 1 | 32 | 5014 | 5457 | 7055 | 5151 | 7963 |
Task 2: Melihat statistika deskriptif untuk kolom bertipe data object menggunakan include = 'object'
#code here
turnover.describe(include =['object'])
CustomerId | Surname | |
---|---|---|
count | 10000 | 10000 |
unique | 10000 | 2932 |
top | 15634602 | Smith |
freq | 1 | 32 |
3.6 Indexing and Subsetting with Pandas
Indexing digunakan untuk memilih dan mengambil sebagian data yang hanya diperlukan dalam proses analisa data yang sedang dikerjakan. Contohnya:
- Compare sales pada tahun 2018 vs 2019
- Identifikasi peluang penjualan pada segment pasar (ex : Wholesale vs Retail)
- Melihat quarter terbaik untuk setiap tahun yang dapat digunakan untuk tujuan promosi
- dan sebagainya
3.6.1 select_dtypes()
Method select_dtypes()
digunakan untuk memilih kolom sesuai dengan tipe datanya. Ada 2 parameter yang dapat digunakan di dalam method select_dtypes()
yaitu parameter include
dan exclude
(seperti pada describe()
.
Misal:
- parameter
include = 'category'
artinya kita memilih semua kolom dengan tipe data ‘category’ - sebaliknya, ketika menggunakan parameter
exclude = 'category'
maka kolom-kolom dengan tipe data selain ‘category’ akan ditampilkan.
Task 1: Mengambil kolom dengan tipe data category
# code here
turnover.select_dtypes(include=['category'])
Geography | Gender | HasCrCard | IsActiveMember | Exited | |
---|---|---|---|---|---|
RowNumber | |||||
1 | France | Female | 1 | 1 | 1 |
2 | Spain | Female | 0 | 1 | 0 |
3 | France | Female | 1 | 0 | 1 |
4 | France | Female | 0 | 0 | 0 |
5 | Spain | Female | 1 | 1 | 0 |
... | ... | ... | ... | ... | ... |
9996 | France | Male | 1 | 0 | 0 |
9997 | France | Male | 1 | 1 | 0 |
9998 | France | Female | 0 | 1 | 1 |
9999 | Germany | Male | 1 | 0 | 1 |
10000 | France | Female | 1 | 0 | 0 |
10000 rows × 5 columns
Task 2: Mengambil kolom dengan tipe data number
# code here
turnover.select_dtypes(include=['number'])
CreditScore | Age | Tenure | Balance | NumOfProducts | EstimatedSalary | |
---|---|---|---|---|---|---|
RowNumber | ||||||
1 | 619 | 42 | 2 | 0.00 | 1 | 101348.88 |
2 | 608 | 41 | 1 | 83807.86 | 1 | 112542.58 |
3 | 502 | 42 | 8 | 159660.80 | 3 | 113931.57 |
4 | 699 | 39 | 1 | 0.00 | 2 | 93826.63 |
5 | 850 | 43 | 2 | 125510.82 | 1 | 79084.10 |
... | ... | ... | ... | ... | ... | ... |
9996 | 771 | 39 | 5 | 0.00 | 2 | 96270.64 |
9997 | 516 | 35 | 10 | 57369.61 | 1 | 101699.77 |
9998 | 709 | 36 | 7 | 0.00 | 1 | 42085.58 |
9999 | 772 | 42 | 3 | 75075.31 | 2 | 92888.52 |
10000 | 792 | 28 | 4 | 130142.79 | 1 | 38190.78 |
10000 rows × 6 columns
3.6.2 drop()
Method drop()
digunakan untuk membuang baris atau kolom yang tidak ingin digunakan untuk tujuan analisis.
- Untuk menghapus baris pada fungsi tersebut dapat kita isi dengan parameter
index =
- Untuk menghapus baris pada fungsi tersebut dapat kita isi dengan parameter
columns =
Task 1: Hapus baris pertama pada DataFrame!
kopi = turnover.copy()
kopi.head(5)
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||||
1 | 15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
2 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
3 | 15619304 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
4 | 15701354 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
5 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
# code here
turnover.drop(index=[1]).head()
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||||
2 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
3 | 15619304 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
4 | 15701354 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
5 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
6 | 15574012 | Chu | 645 | Spain | Male | 44 | 8 | 113755.78 | 2 | 1 | 0 | 149756.71 | 1 |
Task 2: Hapus baris ke-2 dan ke-3 pada data!
# code here
turnover.drop(index=[2,3]).head()
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||||
1 | 15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
4 | 15701354 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
5 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
6 | 15574012 | Chu | 645 | Spain | Male | 44 | 8 | 113755.78 | 2 | 1 | 0 | 149756.71 | 1 |
7 | 15592531 | Bartlett | 822 | France | Male | 50 | 7 | 0.00 | 2 | 1 | 1 | 10062.80 | 0 |
Task 3: Hapus kolom Age
dan Tenure
!
# code here
turnover.drop(columns=['Age','Tenure']).head()
CustomerId | Surname | CreditScore | Geography | Gender | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||
1 | 15634602 | Hargrave | 619 | France | Female | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
2 | 15647311 | Hill | 608 | Spain | Female | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
3 | 15619304 | Onio | 502 | France | Female | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
4 | 15701354 | Boni | 699 | France | Female | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
5 | 15737888 | Mitchell | 850 | Spain | Female | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
turnover.drop(['Age', 'Tenure'], axis=1).head()
CustomerId | Surname | CreditScore | Geography | Gender | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||
1 | 15634602 | Hargrave | 619 | France | Female | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
2 | 15647311 | Hill | 608 | Spain | Female | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
3 | 15619304 | Onio | 502 | France | Female | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
4 | 15701354 | Boni | 699 | France | Female | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
5 | 15737888 | Mitchell | 850 | Spain | Female | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
turnover.columns[0]
'CustomerId'
Task 4: Hapus baris ke 2 dan kolom CustomerId
!
# code here
turnover.drop(index=[2] ,columns=['CustomerId']).head()
Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | ||||||||||||
1 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
3 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
4 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
5 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
6 | Chu | 645 | Spain | Male | 44 | 8 | 113755.78 | 2 | 1 | 0 | 149756.71 | 1 |
NOTE: Method drop()
tidak mengubah objek dataframenya. Apabila ingin mengubah objek semulanya:
- Melakukan assignment kembali dengan nama objek yang sama:
turnover = turnover.drop(...)
, atau - Menambahkan parameter inplace:
turnover.drop(..., inplace=True)
3.6.3 Slicing: []
operator
Digunakan untuk melakukan subsetting dengan cara mengiris (slicing) index pada dataframe. Formula penulisannya adalah [start:end]
dengan mengikuti aturan indexing pada python (dimulai dari 0) dimana start
inclusive dan end
exclusive.
Task 1: Dengan menggunakan metode slicing, silahkan tampilkan baris ke 1 dan ke 2
- baris ke 1 memiliki urutan index 0
- baris ke 2 memiliki urutan index 1
# code here
turnover[0:2]
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||||
1 | 15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
2 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
Task 2: Dengan menggunakan metode slicing, silahkan tampilkan baris ke 10 sampai ke 15
# code here
turnover[9:15]
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||||
10 | 15592389 | H? | 684 | France | Male | 27 | 2 | 134603.88 | 1 | 1 | 1 | 71725.73 | 0 |
11 | 15767821 | Bearce | 528 | France | Male | 31 | 6 | 102016.72 | 2 | 0 | 0 | 80181.12 | 0 |
12 | 15737173 | Andrews | 497 | Spain | Male | 24 | 3 | 0.00 | 2 | 1 | 0 | 76390.01 | 0 |
13 | 15632264 | Kay | 476 | France | Female | 34 | 10 | 0.00 | 2 | 1 | 0 | 26260.98 | 0 |
14 | 15691483 | Chin | 549 | France | Female | 25 | 5 | 0.00 | 2 | 0 | 0 | 190857.79 | 0 |
15 | 15600882 | Scott | 635 | Spain | Female | 35 | 7 | 0.00 | 2 | 1 | 1 | 65951.65 | 0 |
Task 3: Dengan menggunakan metode slicing, silahkan tampilkan baris ke 17 sampai ke 21
# code here
turnover[16:21]
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||||
17 | 15737452 | Romeo | 653 | Germany | Male | 58 | 1 | 132602.88 | 1 | 1 | 0 | 5097.67 | 1 |
18 | 15788218 | Henderson | 549 | Spain | Female | 24 | 9 | 0.00 | 2 | 1 | 1 | 14406.41 | 0 |
19 | 15661507 | Muldrow | 587 | Spain | Male | 45 | 6 | 0.00 | 1 | 0 | 0 | 158684.81 | 0 |
20 | 15568982 | Hao | 726 | France | Female | 24 | 6 | 0.00 | 2 | 1 | 1 | 54724.03 | 0 |
21 | 15577657 | McDonald | 732 | France | Male | 41 | 8 | 0.00 | 2 | 1 | 1 | 170886.17 | 0 |
3.7 .loc
dan .iloc
Dengan menggunakan .loc
dan iloc
kita dapat melakukan pengirisan pada index baris dan kolom.
Perbedaan yang mendasar dari kedua operator ini adalah:
.iloc
merujuk pada lokasi index baris atau kolomnya sehingga harus integer, sedangkan.loc
merujuk pada lokasi nama baris atau kolomnya
Mari berfokus pada .iloc terlebih dahulu
Syntax:
df.iloc[baris, kolom]
Task 1: Tampilkan baris dengan index 3 dan kolom dengan index 1, artinya adalah data pada baris ke empat dan kolom Surname
.
# code here
turnover.iloc[3, 1]
'Boni'
# Buat ngecek lokasi indeks kolom
turnover.columns.get_loc('Age')
5
Task 2: Mengambil baris ke 2 sampai 5 dan kolom dengan CustomerId
sampai Age
# code here
turnover.iloc[1:5, turnover.columns.get_loc('CustomerId'):turnover.columns.get_loc('Age')+1]
CustomerId | Surname | CreditScore | Geography | Gender | Age | |
---|---|---|---|---|---|---|
RowNumber | ||||||
2 | 15647311 | Hill | 608 | Spain | Female | 41 |
3 | 15619304 | Onio | 502 | France | Female | 42 |
4 | 15701354 | Boni | 699 | France | Female | 39 |
5 | 15737888 | Mitchell | 850 | Spain | Female | 43 |
Task 3: Menampilkan semua data pada kolom Surname
dan Exited
saja
# code here
turnover.iloc[::, [turnover.columns.get_loc('Surname'),turnover.columns.get_loc('Exited')]]
Surname | Exited | |
---|---|---|
RowNumber | ||
1 | Hargrave | 1 |
2 | Hill | 0 |
3 | Onio | 1 |
4 | Boni | 0 |
5 | Mitchell | 0 |
... | ... | ... |
9996 | Obijiaku | 0 |
9997 | Johnstone | 0 |
9998 | Liu | 1 |
9999 | Sabbatini | 1 |
10000 | Walker | 0 |
10000 rows × 2 columns
Mari kita pergi ke .loc
Syntax:
df.loc[baris, kolom]
Menggunakan .loc
, kita bisa mengambil baris dan kolom berdasarkan namanya.
Task 1: Mengambil baris ke 2 sampai 5 dan kolom dengan CustomerId
sampai Age
# code here
turnover.loc[2:5, 'CustomerId':'Age']
CustomerId | Surname | CreditScore | Geography | Gender | Age | |
---|---|---|---|---|---|---|
RowNumber | ||||||
2 | 15647311 | Hill | 608 | Spain | Female | 41 |
3 | 15619304 | Onio | 502 | France | Female | 42 |
4 | 15701354 | Boni | 699 | France | Female | 39 |
5 | 15737888 | Mitchell | 850 | Spain | Female | 43 |
Task 2: Menampilkan semua data pada kolom Surname
dan Exited
saja
%%time
# code here
turnover.loc[::, ['Surname','Exited']]
CPU times: total: 0 ns
Wall time: 998 µs
Surname | Exited | |
---|---|---|
RowNumber | ||
1 | Hargrave | 1 |
2 | Hill | 0 |
3 | Onio | 1 |
4 | Boni | 0 |
5 | Mitchell | 0 |
... | ... | ... |
9996 | Obijiaku | 0 |
9997 | Johnstone | 0 |
9998 | Liu | 1 |
9999 | Sabbatini | 1 |
10000 | Walker | 0 |
10000 rows × 2 columns
%%time
turnover[['Surname','Exited']][::]
CPU times: total: 0 ns
Wall time: 1.01 ms
Surname | Exited | |
---|---|---|
RowNumber | ||
1 | Hargrave | 1 |
2 | Hill | 0 |
3 | Onio | 1 |
4 | Boni | 0 |
5 | Mitchell | 0 |
... | ... | ... |
9996 | Obijiaku | 0 |
9997 | Johnstone | 0 |
9998 | Liu | 1 |
9999 | Sabbatini | 1 |
10000 | Walker | 0 |
10000 rows × 2 columns
3.8 Conditional Subsetting
Selain menggunakan .loc
dan .iloc
, kita dapat melakukan subsetting berdasarkan kondisi tertentu. Misal pada dataframe turnover
, kita ingin mengambil beberapa data dengan kondisi sebagai berikut:
- Customer yang memutuskan untuk churn:
.Exited == 1
- Customer yang memiliki balance di atas 200000:
.Balance >= 200000
- Customer dengan credit score sebesar 850:
.CreditScore != 0
Syntax penulisan untuk conditional subsetting adalah:
df[df['column_name'] <comparison_operator> <value>]
Contoh comparison_operator adalah seperti ==
, !=
, >
, >=
, <
, <=
.
Task 1: Tampilkan data customer yang memutuskan untuk churn
# code here
# turnover.loc[turnover['Exited'] == 1].loc[::,'Surname']
# atau
turnover[turnover['Exited'] == 1]
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||||
1 | 15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
3 | 15619304 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
6 | 15574012 | Chu | 645 | Spain | Male | 44 | 8 | 113755.78 | 2 | 1 | 0 | 149756.71 | 1 |
8 | 15656148 | Obinna | 376 | Germany | Female | 29 | 4 | 115046.74 | 4 | 1 | 0 | 119346.88 | 1 |
17 | 15737452 | Romeo | 653 | Germany | Male | 58 | 1 | 132602.88 | 1 | 1 | 0 | 5097.67 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9982 | 15672754 | Burbidge | 498 | Germany | Male | 42 | 3 | 152039.70 | 1 | 1 | 1 | 53445.17 | 1 |
9983 | 15768163 | Griffin | 655 | Germany | Female | 46 | 7 | 137145.12 | 1 | 1 | 0 | 115146.40 | 1 |
9992 | 15769959 | Ajuluchukwu | 597 | France | Female | 53 | 4 | 88381.21 | 1 | 1 | 0 | 69384.71 | 1 |
9998 | 15584532 | Liu | 709 | France | Female | 36 | 7 | 0.00 | 1 | 0 | 1 | 42085.58 | 1 |
9999 | 15682355 | Sabbatini | 772 | Germany | Male | 42 | 3 | 75075.31 | 2 | 1 | 0 | 92888.52 | 1 |
2037 rows × 13 columns
Task 2: Tampilkan data customer yang memiliki balance di atas 200000
# code here
turnover[turnover['Balance'] > 200000]
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||||
139 | 15594408 | Chia | 584 | Spain | Female | 48 | 2 | 213146.20 | 1 | 1 | 0 | 75161.25 | 1 |
521 | 15671256 | Macartney | 850 | France | Female | 35 | 1 | 211774.31 | 1 | 1 | 0 | 188574.12 | 1 |
721 | 15721658 | Fleming | 672 | Spain | Female | 56 | 2 | 209767.31 | 2 | 1 | 1 | 150694.42 | 1 |
1068 | 15599131 | Dilke | 650 | Germany | Male | 26 | 4 | 214346.96 | 2 | 1 | 0 | 128815.33 | 0 |
1175 | 15588670 | Despeissis | 705 | Spain | Female | 40 | 5 | 203715.15 | 1 | 1 | 0 | 179978.68 | 1 |
1366 | 15689514 | Kang | 625 | France | Male | 43 | 8 | 201696.07 | 1 | 1 | 0 | 133020.90 | 1 |
1534 | 15769818 | Moore | 850 | France | Female | 37 | 3 | 212778.20 | 1 | 0 | 1 | 69372.88 | 0 |
2093 | 15757408 | Lo | 655 | Spain | Male | 38 | 3 | 250898.09 | 3 | 0 | 1 | 81054.00 | 1 |
2598 | 15668818 | Chidubem | 592 | Spain | Female | 40 | 2 | 200322.45 | 1 | 1 | 1 | 113244.73 | 0 |
2710 | 15780212 | Mao | 592 | France | Male | 37 | 4 | 212692.97 | 1 | 0 | 0 | 176395.02 | 0 |
3151 | 15795298 | Olisaemeka | 573 | Germany | Female | 35 | 9 | 206868.78 | 2 | 0 | 1 | 102986.15 | 0 |
3281 | 15715622 | To Rot | 583 | France | Female | 57 | 3 | 238387.56 | 1 | 0 | 1 | 147964.99 | 1 |
3589 | 15571958 | McIntosh | 489 | Spain | Male | 40 | 3 | 221532.80 | 1 | 1 | 0 | 171867.08 | 0 |
3921 | 15620268 | Thomson | 634 | Germany | Male | 43 | 3 | 212696.32 | 1 | 1 | 0 | 115268.86 | 0 |
4437 | 15664498 | Golovanov | 508 | France | Male | 26 | 7 | 205962.00 | 1 | 1 | 0 | 156424.40 | 0 |
4534 | 15607275 | Ch'ang | 850 | Spain | Male | 39 | 6 | 206014.94 | 2 | 0 | 1 | 42774.84 | 1 |
5255 | 15746664 | Ts'ui | 463 | Spain | Male | 20 | 8 | 204223.03 | 1 | 1 | 0 | 128268.39 | 0 |
5872 | 15709920 | Burke | 479 | France | Female | 33 | 2 | 208165.53 | 1 | 0 | 0 | 50774.81 | 1 |
6272 | 15620756 | Stokes | 747 | France | Male | 49 | 6 | 202904.64 | 1 | 1 | 1 | 17298.72 | 1 |
6498 | 15793688 | Bancks | 669 | France | Male | 50 | 9 | 201009.64 | 1 | 1 | 0 | 158032.50 | 1 |
6718 | 15586674 | Shaw | 663 | Spain | Female | 58 | 5 | 216109.88 | 1 | 0 | 1 | 74176.71 | 1 |
6914 | 15784180 | Ku | 564 | France | Female | 36 | 7 | 206329.65 | 1 | 1 | 1 | 46632.87 | 1 |
7354 | 15736420 | Macdonald | 596 | France | Male | 21 | 4 | 210433.08 | 2 | 0 | 1 | 197297.77 | 1 |
7493 | 15776545 | Napolitani | 682 | France | Male | 28 | 10 | 200724.96 | 1 | 0 | 1 | 82872.64 | 1 |
7633 | 15620570 | Sinnett | 736 | France | Male | 43 | 4 | 202443.47 | 1 | 1 | 0 | 72375.03 | 0 |
7888 | 15745433 | Conti | 716 | Germany | Female | 30 | 2 | 205770.78 | 2 | 0 | 0 | 65464.66 | 0 |
8028 | 15769412 | Atkinson | 684 | Spain | Male | 39 | 4 | 207034.96 | 2 | 0 | 0 | 157694.76 | 1 |
8064 | 15663888 | Connor | 549 | Germany | Male | 34 | 6 | 204017.40 | 2 | 1 | 0 | 109538.35 | 0 |
8703 | 15690589 | Udinesi | 541 | France | Male | 37 | 9 | 212314.03 | 1 | 0 | 1 | 148814.54 | 0 |
8734 | 15714241 | Haddon | 749 | Spain | Male | 42 | 9 | 222267.63 | 1 | 0 | 0 | 101108.85 | 1 |
8795 | 15578671 | Webb | 706 | Spain | Female | 29 | 1 | 209490.21 | 1 | 1 | 1 | 133267.69 | 1 |
8983 | 15627971 | Coates | 504 | France | Female | 32 | 8 | 206663.75 | 1 | 0 | 0 | 16281.94 | 0 |
9834 | 15807245 | McKay | 699 | Germany | Female | 41 | 1 | 200117.76 | 2 | 1 | 0 | 94142.35 | 0 |
9921 | 15673020 | Smith | 678 | France | Female | 49 | 3 | 204510.94 | 1 | 0 | 1 | 738.88 | 1 |
Task 3: Tampilkan data customer dengan credit score sebesar 850
# code here
turnover[turnover['CreditScore'] == 850]
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||||
5 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
39 | 15717426 | Armstrong | 850 | France | Male | 36 | 7 | 0.00 | 1 | 1 | 1 | 40812.90 | 0 |
181 | 15716334 | Rozier | 850 | Spain | Female | 45 | 2 | 122311.21 | 1 | 1 | 1 | 19482.50 | 0 |
201 | 15604482 | Chiemezie | 850 | Spain | Male | 30 | 2 | 141040.01 | 1 | 1 | 1 | 5978.20 | 0 |
224 | 15733247 | Stevenson | 850 | France | Male | 33 | 10 | 0.00 | 1 | 1 | 0 | 4861.72 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9582 | 15709256 | Glover | 850 | France | Female | 28 | 9 | 0.00 | 2 | 1 | 1 | 164864.67 | 0 |
9647 | 15603111 | Muir | 850 | Spain | Male | 71 | 10 | 69608.14 | 1 | 1 | 0 | 97893.40 | 1 |
9689 | 15730579 | Ward | 850 | France | Male | 68 | 5 | 169445.40 | 1 | 1 | 1 | 186335.07 | 0 |
9932 | 15647800 | Greco | 850 | France | Female | 34 | 6 | 101266.51 | 1 | 1 | 0 | 33501.98 | 0 |
9950 | 15798615 | Wan | 850 | France | Female | 47 | 9 | 137301.87 | 1 | 1 | 0 | 44351.77 | 0 |
233 rows × 13 columns
Additional Information:
Kita juga dapat menggunakan operator &
(AND) dan |
(OR) untuk melakukan subsetting lebih dari 1 kondisi. Misalnya kita ingin melihat data penjualan dari seorang pegawai bernama Moana yang jumlahnya lebih dari 5000, maka kita dapat menggunakan syntax:
sales[(sales.salesperson == 'Moana') & (sales.amount > 5000)]
Untuk subsetting dengan kondisi lebih dari 1, setiap kondisi diletakkan di dalam tanda kurung ()
atau bisa ditulis dengan syntax berikut:
df[(kondisi pertama) operator (kondisi kedua) operator (kondisi ketiga) dan seterusnya...]
Poin:
- Operator AND: harus semua kondisi terpenuhi dalam satu baris agar muncul
- Operator OR: salah satu kondisi saja sudah terpenuhi maka baris tsb muncul
Task 1: Lakukan subsetting untuk mengambil semua informasi customer yang memiliki credit score di atas 700 dan memutuskan untuk churn
#code here
turnover[(turnover['CreditScore'] > 700) & (turnover['Exited'] == 1)]
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||||
44 | 15755196 | Lavine | 834 | France | Female | 49 | 2 | 131394.56 | 1 | 0 | 0 | 194365.76 | 1 |
47 | 15602280 | Martin | 829 | Germany | Female | 27 | 9 | 112045.67 | 1 | 1 | 1 | 119708.21 | 1 |
71 | 15703793 | Konovalova | 738 | Germany | Male | 58 | 2 | 133745.44 | 4 | 1 | 0 | 28373.86 | 1 |
82 | 15663706 | Leonard | 777 | France | Female | 32 | 2 | 0.00 | 1 | 1 | 0 | 136458.19 | 1 |
87 | 15762418 | Gant | 750 | Spain | Male | 22 | 3 | 121681.82 | 1 | 1 | 0 | 128643.35 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9885 | 15686974 | Sergeyeva | 751 | France | Female | 48 | 4 | 0.00 | 1 | 0 | 1 | 30165.06 | 1 |
9961 | 15681026 | Lucciano | 795 | Germany | Female | 33 | 9 | 104552.72 | 1 | 1 | 1 | 120853.83 | 1 |
9963 | 15594612 | Flynn | 702 | Spain | Male | 44 | 9 | 0.00 | 1 | 0 | 0 | 59207.41 | 1 |
9998 | 15584532 | Liu | 709 | France | Female | 36 | 7 | 0.00 | 1 | 0 | 1 | 42085.58 | 1 |
9999 | 15682355 | Sabbatini | 772 | Germany | Male | 42 | 3 | 75075.31 | 2 | 1 | 0 | 92888.52 | 1 |
619 rows × 13 columns
Task 2: Tampilkan informasi customer perempuan yang berdomisili di Jerman dan Spanyol
%%time
# code here
# Buat ngecek list dalam list
turnover[(turnover['Geography'].isin(['Germany', 'Spain'])) & (turnover['Gender'] == 'Female')]
CPU times: total: 0 ns
Wall time: 2.98 ms
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||||
2 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
5 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
8 | 15656148 | Obinna | 376 | Germany | Female | 29 | 4 | 115046.74 | 4 | 1 | 0 | 119346.88 | 1 |
15 | 15600882 | Scott | 635 | Spain | Female | 35 | 7 | 0.00 | 2 | 1 | 1 | 65951.65 | 0 |
18 | 15788218 | Henderson | 549 | Spain | Female | 24 | 9 | 0.00 | 2 | 1 | 1 | 14406.41 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9951 | 15638494 | Salinas | 625 | Germany | Female | 39 | 10 | 129845.26 | 1 | 1 | 1 | 96444.88 | 0 |
9961 | 15681026 | Lucciano | 795 | Germany | Female | 33 | 9 | 104552.72 | 1 | 1 | 1 | 120853.83 | 1 |
9966 | 15690164 | Shao | 627 | Germany | Female | 33 | 4 | 83199.05 | 1 | 0 | 0 | 159334.93 | 0 |
9969 | 15733491 | McGregor | 512 | Germany | Female | 40 | 8 | 153537.57 | 2 | 0 | 0 | 23101.13 | 0 |
9983 | 15768163 | Griffin | 655 | Germany | Female | 46 | 7 | 137145.12 | 1 | 1 | 0 | 115146.40 | 1 |
2282 rows × 13 columns
# Kunci Jawaban
turnover[(turnover['CreditScore'] > turnover['CreditScore'].quantile(0.75)) \
& (turnover['Balance'] <= turnover['Balance'].quantile(0.25)) \
& (turnover['HasCrCard'] == 0) \
& (turnover['EstimatedSalary'] > turnover['EstimatedSalary'].mean()) \
& (turnover['Geography'] == turnover['Geography'].value_counts().tail(1).index[0])]
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||||
218 | 15786308 | Millar | 730 | Spain | Female | 33 | 9 | 0.0 | 2 | 0 | 0 | 176576.62 | 0 |
403 | 15781589 | Carpenter | 751 | Spain | Male | 52 | 8 | 0.0 | 2 | 0 | 1 | 179291.85 | 0 |
1493 | 15744517 | Esposito | 735 | Spain | Male | 50 | 9 | 0.0 | 1 | 0 | 0 | 166677.35 | 1 |
1552 | 15749177 | Maslow | 730 | Spain | Female | 52 | 7 | 0.0 | 2 | 0 | 1 | 122398.84 | 0 |
1782 | 15771636 | Marshall | 793 | Spain | Female | 36 | 0 | 0.0 | 1 | 0 | 0 | 148993.47 | 0 |
1897 | 15604588 | Li Fonti | 850 | Spain | Female | 38 | 3 | 0.0 | 2 | 0 | 1 | 179360.76 | 0 |
1991 | 15775803 | Cawker | 841 | Spain | Male | 41 | 1 | 0.0 | 2 | 0 | 1 | 193093.77 | 0 |
2196 | 15735246 | Norman | 798 | Spain | Female | 58 | 9 | 0.0 | 2 | 0 | 0 | 119071.56 | 1 |
2388 | 15595588 | Chukwunonso | 773 | Spain | Female | 39 | 4 | 0.0 | 2 | 0 | 1 | 182081.45 | 0 |
2630 | 15711789 | Davey | 768 | Spain | Female | 42 | 3 | 0.0 | 1 | 0 | 0 | 161242.99 | 1 |
2657 | 15713267 | Zimmer | 779 | Spain | Female | 34 | 5 | 0.0 | 2 | 0 | 1 | 111676.63 | 0 |
2672 | 15611105 | Castella | 799 | Spain | Male | 35 | 7 | 0.0 | 2 | 0 | 1 | 140780.80 | 0 |
4082 | 15762821 | Udinese | 721 | Spain | Male | 33 | 5 | 0.0 | 2 | 0 | 1 | 117626.90 | 0 |
4106 | 15701392 | Lucciano | 815 | Spain | Male | 28 | 6 | 0.0 | 2 | 0 | 1 | 185547.71 | 0 |
4701 | 15773709 | Hung | 838 | Spain | Male | 35 | 0 | 0.0 | 2 | 0 | 1 | 197305.91 | 0 |
4779 | 15582246 | Rowe | 737 | Spain | Female | 45 | 2 | 0.0 | 2 | 0 | 1 | 177695.67 | 0 |
4907 | 15571244 | Tung | 809 | Spain | Female | 33 | 3 | 0.0 | 2 | 0 | 1 | 141426.78 | 0 |
5015 | 15773731 | John | 758 | Spain | Female | 35 | 5 | 0.0 | 2 | 0 | 0 | 100365.51 | 0 |
5041 | 15749727 | Chukwufumnanya | 829 | Spain | Male | 50 | 7 | 0.0 | 2 | 0 | 1 | 178458.86 | 0 |
5128 | 15644796 | Dyer | 821 | Spain | Female | 38 | 8 | 0.0 | 2 | 0 | 1 | 126241.40 | 1 |
5754 | 15608328 | Sutherland | 760 | Spain | Female | 41 | 6 | 0.0 | 2 | 0 | 0 | 101491.23 | 0 |
6020 | 15697045 | Pisani | 726 | Spain | Female | 35 | 9 | 0.0 | 2 | 0 | 1 | 100556.98 | 0 |
6419 | 15801924 | Browne | 754 | Spain | Female | 27 | 8 | 0.0 | 2 | 0 | 0 | 121821.16 | 0 |
6673 | 15660403 | Fleming | 827 | Spain | Female | 35 | 0 | 0.0 | 2 | 0 | 1 | 184514.01 | 0 |
7055 | 15682860 | Lo | 769 | Spain | Male | 38 | 6 | 0.0 | 2 | 0 | 0 | 104393.78 | 0 |
7524 | 15733602 | Rubin | 814 | Spain | Female | 72 | 2 | 0.0 | 2 | 0 | 1 | 130853.03 | 0 |
7740 | 15689952 | Zuyeva | 724 | Spain | Male | 41 | 5 | 0.0 | 1 | 0 | 1 | 115753.94 | 0 |
8374 | 15785167 | Padovano | 795 | Spain | Male | 29 | 4 | 0.0 | 2 | 0 | 0 | 155711.64 | 0 |
8605 | 15646942 | Meng | 786 | Spain | Female | 39 | 7 | 0.0 | 2 | 0 | 0 | 100929.59 | 0 |
8707 | 15717770 | Marcelo | 850 | Spain | Female | 55 | 7 | 0.0 | 1 | 0 | 0 | 171762.87 | 1 |
8859 | 15668009 | Hendley | 747 | Spain | Male | 37 | 1 | 0.0 | 2 | 0 | 1 | 180551.76 | 0 |
9126 | 15604138 | Iheanacho | 749 | Spain | Male | 34 | 2 | 0.0 | 1 | 0 | 0 | 174189.04 | 1 |
9204 | 15774401 | Chambers | 773 | Spain | Male | 51 | 4 | 0.0 | 2 | 0 | 0 | 123587.83 | 1 |
9706 | 15572374 | Hopetoun | 733 | Spain | Male | 36 | 1 | 0.0 | 2 | 0 | 1 | 108377.82 | 0 |
# Jawaban Kelompok 1
turnover[(turnover['Surname'].str.lower().apply(lambda x:x[0]).isin(['a', 'i', 'u', 'e', 'o'])) \
& (turnover['CreditScore'] < turnover['CreditScore'].mean()) \
& (turnover['Balance'] > turnover['Balance'].quantile(0.75)) \
& (turnover['Exited'] == 1)].sort_values(by=['Age'], ascending=False)
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||||
400 | 15646372 | Outhwaite | 616 | France | Female | 66 | 1 | 135842.41 | 1 | 1 | 0 | 183840.51 | 1 |
1390 | 15684196 | Aitken | 627 | France | Female | 55 | 2 | 159441.27 | 1 | 1 | 0 | 100686.11 | 1 |
7809 | 15649033 | Echezonachukwu | 603 | Germany | Female | 55 | 7 | 127723.25 | 2 | 1 | 0 | 139469.11 | 1 |
1632 | 15685372 | Azubuike | 350 | Spain | Male | 54 | 1 | 152677.48 | 1 | 1 | 1 | 191973.49 | 1 |
1952 | 15589793 | Onwuamaeze | 604 | France | Male | 53 | 8 | 144453.75 | 1 | 1 | 0 | 190998.96 | 1 |
1963 | 15692416 | Aikenhead | 358 | Spain | Female | 52 | 8 | 143542.36 | 3 | 1 | 0 | 141959.11 | 1 |
6602 | 15655213 | Udinese | 591 | Germany | Female | 51 | 8 | 132508.30 | 1 | 1 | 1 | 161304.68 | 1 |
857 | 15693864 | Iheanacho | 567 | Germany | Female | 49 | 5 | 134956.02 | 1 | 1 | 0 | 93953.84 | 1 |
1021 | 15654673 | Onyinyechukwuka | 625 | France | Male | 49 | 6 | 173434.90 | 1 | 1 | 0 | 165580.93 | 1 |
868 | 15756804 | O'Loghlen | 636 | France | Female | 48 | 1 | 170833.46 | 1 | 1 | 0 | 110510.28 | 1 |
3616 | 15639357 | Allan | 415 | France | Male | 46 | 9 | 134950.19 | 3 | 0 | 0 | 178587.36 | 1 |
4625 | 15710543 | Okwuoma | 629 | France | Male | 46 | 1 | 130666.20 | 1 | 1 | 1 | 161125.67 | 1 |
4672 | 15808674 | Ejikemeifeuwa | 616 | Germany | Female | 45 | 6 | 128352.59 | 3 | 1 | 1 | 144000.59 | 1 |
3 | 15619304 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
8505 | 15743245 | Agafonova | 624 | France | Male | 42 | 3 | 145155.37 | 1 | 1 | 0 | 72169.95 | 1 |
1750 | 15703820 | Endrizzi | 552 | France | Male | 42 | 9 | 133701.07 | 2 | 1 | 0 | 101069.71 | 1 |
2382 | 15774151 | Iadanza | 614 | Spain | Female | 41 | 7 | 179915.85 | 1 | 0 | 0 | 14666.35 | 1 |
2478 | 15760294 | Endrizzi | 512 | France | Female | 41 | 8 | 145150.28 | 1 | 1 | 0 | 64869.32 | 1 |
9637 | 15613048 | Anderson | 648 | Germany | Female | 40 | 5 | 139973.65 | 1 | 1 | 1 | 667.66 | 1 |
48 | 15771573 | Okagbue | 637 | Germany | Female | 39 | 9 | 137843.80 | 1 | 1 | 1 | 117622.80 | 1 |
1043 | 15593969 | Abramovich | 630 | Spain | Female | 39 | 7 | 135483.17 | 1 | 1 | 0 | 140881.20 | 1 |
8831 | 15810444 | Aksenov | 562 | Germany | Female | 39 | 6 | 130565.02 | 1 | 1 | 0 | 9854.72 | 1 |
6440 | 15583371 | Artemiev | 632 | Spain | Male | 37 | 1 | 138207.08 | 1 | 1 | 0 | 60778.11 | 1 |
3400 | 15633352 | Okwukwe | 628 | France | Female | 31 | 6 | 175443.75 | 1 | 1 | 0 | 113167.17 | 1 |
1451 | 15676242 | Artemova | 632 | Spain | Male | 31 | 3 | 136556.44 | 1 | 1 | 0 | 82152.83 | 1 |
2580 | 15597896 | Ozoemena | 365 | Germany | Male | 30 | 0 | 127760.07 | 1 | 1 | 0 | 81537.85 | 1 |
5904 | 15677317 | Ankudinova | 570 | France | Female | 29 | 4 | 153040.03 | 1 | 1 | 1 | 131363.57 | 1 |
turnover[(turnover['CreditScore'] > turnover['CreditScore'].mean()) & (turnover['Balance'] < turnover['Balance'].mean())]
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||||
4 | 15701354 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
7 | 15592531 | Bartlett | 822 | France | Male | 50 | 7 | 0.00 | 2 | 1 | 1 | 10062.80 | 0 |
20 | 15568982 | Hao | 726 | France | Female | 24 | 6 | 0.00 | 2 | 1 | 1 | 54724.03 | 0 |
21 | 15577657 | McDonald | 732 | France | Male | 41 | 8 | 0.00 | 2 | 1 | 1 | 170886.17 | 0 |
24 | 15725737 | Mosman | 669 | France | Male | 46 | 3 | 0.00 | 2 | 0 | 1 | 8487.75 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9993 | 15657105 | Chukwualuka | 726 | Spain | Male | 36 | 2 | 0.00 | 1 | 1 | 0 | 195192.40 | 0 |
9995 | 15719294 | Wood | 800 | France | Female | 29 | 2 | 0.00 | 2 | 0 | 0 | 167773.55 | 0 |
9996 | 15606229 | Obijiaku | 771 | France | Male | 39 | 5 | 0.00 | 2 | 1 | 0 | 96270.64 | 0 |
9998 | 15584532 | Liu | 709 | France | Female | 36 | 7 | 0.00 | 1 | 0 | 1 | 42085.58 | 1 |
9999 | 15682355 | Sabbatini | 772 | Germany | Male | 42 | 3 | 75075.31 | 2 | 1 | 0 | 92888.52 | 1 |
2052 rows × 13 columns
%%time
# code here
turnover[((turnover['Geography'] == 'Germany') | (turnover['Geography'] == 'Spain')) & (turnover['Gender'] == 'Female')]
CPU times: total: 0 ns
Wall time: 2.99 ms
CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RowNumber | |||||||||||||
2 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
5 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
8 | 15656148 | Obinna | 376 | Germany | Female | 29 | 4 | 115046.74 | 4 | 1 | 0 | 119346.88 | 1 |
15 | 15600882 | Scott | 635 | Spain | Female | 35 | 7 | 0.00 | 2 | 1 | 1 | 65951.65 | 0 |
18 | 15788218 | Henderson | 549 | Spain | Female | 24 | 9 | 0.00 | 2 | 1 | 1 | 14406.41 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
9951 | 15638494 | Salinas | 625 | Germany | Female | 39 | 10 | 129845.26 | 1 | 1 | 1 | 96444.88 | 0 |
9961 | 15681026 | Lucciano | 795 | Germany | Female | 33 | 9 | 104552.72 | 1 | 1 | 1 | 120853.83 | 1 |
9966 | 15690164 | Shao | 627 | Germany | Female | 33 | 4 | 83199.05 | 1 | 0 | 0 | 159334.93 | 0 |
9969 | 15733491 | McGregor | 512 | Germany | Female | 40 | 8 | 153537.57 | 2 | 0 | 0 | 23101.13 | 0 |
9983 | 15768163 | Griffin | 655 | Germany | Female | 46 | 7 | 137145.12 | 1 | 1 | 0 | 115146.40 | 1 |
2282 rows × 13 columns
Leave a comment