Pandas Exercise 5 : Merge
The continuity of my practice on Pandas exercise from guisapmora.
Fictitious Names Dataset
Introduction:
This time you will create a data again
Special thanks to Chris Albon for sharing the dataset and materials. All the credits to this exercise belongs to him.
In order to understand about it go here.
Step 1. Import the necessary libraries
import pandas as pd
Step 2. Create the 3 DataFrames based on the following raw data
raw_data_1 = {
'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
raw_data_2 = {
'subject_id': ['4', '5', '6', '7', '8'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
raw_data_3 = {
'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
Step 3. Assign each to a variable called data1, data2, data3
data1 = pd.DataFrame(raw_data_1)
data2 = pd.DataFrame(raw_data_2)
data3 = pd.DataFrame(raw_data_3)
Step 4. Join the two dataframes along rows and assign all_data
all_data = pd.concat((data1, data2), axis=0)
Step 5. Join the two dataframes along columns and assing to all_data_col
all_data_col = pd.concat((data1, data2), axis=1)
Step 6. Print data3
data3
subject_id | test_id | |
---|---|---|
0 | 1 | 51 |
1 | 2 | 15 |
2 | 3 | 15 |
3 | 4 | 61 |
4 | 5 | 16 |
5 | 7 | 14 |
6 | 8 | 15 |
7 | 9 | 1 |
8 | 10 | 61 |
9 | 11 | 16 |
Step 7. Merge all_data and data3 along the subject_id value
all_data.merge(data3, on=['subject_id'])
subject_id | first_name | last_name | test_id | |
---|---|---|---|---|
0 | 1 | Alex | Anderson | 51 |
1 | 2 | Amy | Ackerman | 15 |
2 | 3 | Allen | Ali | 15 |
3 | 4 | Alice | Aoni | 61 |
4 | 4 | Billy | Bonder | 61 |
5 | 5 | Ayoung | Atiches | 16 |
6 | 5 | Brian | Black | 16 |
7 | 7 | Bryce | Brice | 14 |
8 | 8 | Betty | Btisan | 15 |
Step 8. Merge only the data that has the same ‘subject_id’ on both data1 and data2
data1.merge(data2, on=['subject_id'], how='inner')
subject_id | first_name_x | last_name_x | first_name_y | last_name_y | |
---|---|---|---|---|---|
0 | 4 | Alice | Aoni | Billy | Bonder |
1 | 5 | Ayoung | Atiches | Brian | Black |
Step 9. Merge all values in data1 and data2, with matching records from both sides where available.
data1.merge(data2, on=['subject_id'], how='outer')
subject_id | first_name_x | last_name_x | first_name_y | last_name_y | |
---|---|---|---|---|---|
0 | 1 | Alex | Anderson | NaN | NaN |
1 | 2 | Amy | Ackerman | NaN | NaN |
2 | 3 | Allen | Ali | NaN | NaN |
3 | 4 | Alice | Aoni | Billy | Bonder |
4 | 5 | Ayoung | Atiches | Brian | Black |
5 | 6 | NaN | NaN | Bran | Balwner |
6 | 7 | NaN | NaN | Bryce | Brice |
7 | 8 | NaN | NaN | Betty | Btisan |
MPG Dataset
Introduction:
The following exercise utilizes data from UC Irvine Machine Learning Repository
Step 1. Import the necessary libraries
import pandas as pd
Step 2. Import the first dataset cars1 and cars2.
### Step 3. Assign each to a variable called cars1 and cars2
url1 = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars1.csv'
url2 = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars2.csv'
cars1 = pd.read_csv(url1)
cars2 = pd.read_csv(url2)
Step 4. Oops, it seems our first dataset has some unnamed blank columns, fix cars1
cars1 = cars1.dropna(how ='all', axis=1)
Step 5. What is the number of observations in each dataset?
print(cars1.shape[0])
print(cars2.shape[0])
198
200
Step 6. Join cars1 and cars2 into a single DataFrame called cars
cars = pd.concat((cars1, cars1), axis=0)
cars.columns
Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
'acceleration', 'model', 'origin', 'car'],
dtype='object')
Step 7. Oops, there is a column missing, called owners. Create a random number Series from 15,000 to 73,000.
import numpy as np
owners = np.random.randint(low=15000, high=73000, size= cars.shape[0])
Step 8. Add the column owners to cars
cars['owner'] = owners
Leave a comment