Pandas Exercise 5 : Merge

3 minute read

The continuity of my practice on Pandas exercise from guisapmora.

Fictitious Names Dataset

Introduction:

This time you will create a data again

Special thanks to Chris Albon for sharing the dataset and materials. All the credits to this exercise belongs to him.

In order to understand about it go here.

Step 1. Import the necessary libraries

import pandas as pd

Step 2. Create the 3 DataFrames based on the following raw data

raw_data_1 = {
        'subject_id': ['1', '2', '3', '4', '5'],
        'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'], 
        'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

raw_data_2 = {
        'subject_id': ['4', '5', '6', '7', '8'],
        'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'], 
        'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}

raw_data_3 = {
        'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
        'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}

Step 3. Assign each to a variable called data1, data2, data3

data1 = pd.DataFrame(raw_data_1)
data2 = pd.DataFrame(raw_data_2)
data3 = pd.DataFrame(raw_data_3)

Step 4. Join the two dataframes along rows and assign all_data

all_data = pd.concat((data1, data2), axis=0)

Step 5. Join the two dataframes along columns and assing to all_data_col

all_data_col = pd.concat((data1, data2), axis=1)

Step 6. Print data3

data3
subject_id test_id
0 1 51
1 2 15
2 3 15
3 4 61
4 5 16
5 7 14
6 8 15
7 9 1
8 10 61
9 11 16

Step 7. Merge all_data and data3 along the subject_id value

all_data.merge(data3, on=['subject_id'])
subject_id first_name last_name test_id
0 1 Alex Anderson 51
1 2 Amy Ackerman 15
2 3 Allen Ali 15
3 4 Alice Aoni 61
4 4 Billy Bonder 61
5 5 Ayoung Atiches 16
6 5 Brian Black 16
7 7 Bryce Brice 14
8 8 Betty Btisan 15

Step 8. Merge only the data that has the same ‘subject_id’ on both data1 and data2

data1.merge(data2, on=['subject_id'], how='inner')
subject_id first_name_x last_name_x first_name_y last_name_y
0 4 Alice Aoni Billy Bonder
1 5 Ayoung Atiches Brian Black

Step 9. Merge all values in data1 and data2, with matching records from both sides where available.

data1.merge(data2, on=['subject_id'], how='outer')
subject_id first_name_x last_name_x first_name_y last_name_y
0 1 Alex Anderson NaN NaN
1 2 Amy Ackerman NaN NaN
2 3 Allen Ali NaN NaN
3 4 Alice Aoni Billy Bonder
4 5 Ayoung Atiches Brian Black
5 6 NaN NaN Bran Balwner
6 7 NaN NaN Bryce Brice
7 8 NaN NaN Betty Btisan

MPG Dataset

Introduction:

The following exercise utilizes data from UC Irvine Machine Learning Repository

Step 1. Import the necessary libraries

import pandas as pd

Step 2. Import the first dataset cars1 and cars2.

### Step 3. Assign each to a variable called cars1 and cars2

url1 = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars1.csv'
url2 = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars2.csv'

cars1 = pd.read_csv(url1)
cars2 = pd.read_csv(url2)

Step 4. Oops, it seems our first dataset has some unnamed blank columns, fix cars1

cars1 = cars1.dropna(how ='all', axis=1)

Step 5. What is the number of observations in each dataset?

print(cars1.shape[0])
print(cars2.shape[0])
198
200

Step 6. Join cars1 and cars2 into a single DataFrame called cars

cars = pd.concat((cars1, cars1), axis=0)
cars.columns
Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model', 'origin', 'car'],
      dtype='object')

Step 7. Oops, there is a column missing, called owners. Create a random number Series from 15,000 to 73,000.

import numpy as np

owners = np.random.randint(low=15000, high=73000, size= cars.shape[0])

Step 8. Add the column owners to cars

cars['owner'] = owners

Leave a comment