Pandas Exercise 2 : Filtering and Sorting

14 minute read

The continuity of my practice on Pandas exercise from guisapmora.

Chipotle dataset

This time we are going to pull data directly from the internet. Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Step 1. Import the necessary libraries

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called chipo.

url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
chipo = pd.read_csv(url, sep='\t')

Step 4. How many products cost more than $10.00?

chipo['item_price'] = chipo['item_price'].apply(lambda x : x[1:]).astype('float').copy()
chipo[chipo['item_price'] > 10]['item_name'].value_counts().count()
31

Step 5. What is the price of each item?

chipo.groupby(['item_name']).mean().reset_index()[['item_name', 'item_price']]
item_name item_price
0 6 Pack Soft Drink 6.610185
1 Barbacoa Bowl 10.187273
2 Barbacoa Burrito 9.832418
3 Barbacoa Crispy Tacos 10.928182
4 Barbacoa Salad Bowl 10.640000
5 Barbacoa Soft Tacos 10.018400
6 Bottled Water 1.867654
7 Bowl 14.800000
8 Burrito 7.400000
9 Canned Soda 1.320577
10 Canned Soft Drink 1.457641
11 Carnitas Bowl 10.833971
12 Carnitas Burrito 10.132712
13 Carnitas Crispy Tacos 11.137143
14 Carnitas Salad 8.990000
15 Carnitas Salad Bowl 11.056667
16 Carnitas Soft Tacos 9.398500
17 Chicken Bowl 10.113953
18 Chicken Burrito 10.082857
19 Chicken Crispy Tacos 10.045319
20 Chicken Salad 9.010000
21 Chicken Salad Bowl 11.170455
22 Chicken Soft Tacos 9.635565
23 Chips 2.342844
24 Chips and Fresh Tomato Salsa 3.285091
25 Chips and Guacamole 4.595073
26 Chips and Mild Fresh Tomato Salsa 3.000000
27 Chips and Roasted Chili Corn Salsa 3.084091
28 Chips and Roasted Chili-Corn Salsa 2.390000
29 Chips and Tomatillo Green Chili Salsa 3.087209
30 Chips and Tomatillo Red Chili Salsa 3.072917
31 Chips and Tomatillo-Green Chili Salsa 2.544194
32 Chips and Tomatillo-Red Chili Salsa 2.987500
33 Crispy Tacos 7.400000
34 Izze 3.390000
35 Nantucket Nectar 3.641111
36 Salad 7.400000
37 Side of Chips 1.840594
38 Steak Bowl 10.711801
39 Steak Burrito 10.465842
40 Steak Crispy Tacos 10.209714
41 Steak Salad 8.915000
42 Steak Salad Bowl 11.847931
43 Steak Soft Tacos 9.746364
44 Veggie Bowl 10.211647
45 Veggie Burrito 9.839684
46 Veggie Crispy Tacos 8.490000
47 Veggie Salad 8.490000
48 Veggie Salad Bowl 10.138889
49 Veggie Soft Tacos 10.565714

Step 6. Sort by the name of the item

chipo.groupby(['item_name']).mean().sort_index().reset_index()[['item_name', 'item_price']]
item_name item_price
0 6 Pack Soft Drink 6.610185
1 Barbacoa Bowl 10.187273
2 Barbacoa Burrito 9.832418
3 Barbacoa Crispy Tacos 10.928182
4 Barbacoa Salad Bowl 10.640000
5 Barbacoa Soft Tacos 10.018400
6 Bottled Water 1.867654
7 Bowl 14.800000
8 Burrito 7.400000
9 Canned Soda 1.320577
10 Canned Soft Drink 1.457641
11 Carnitas Bowl 10.833971
12 Carnitas Burrito 10.132712
13 Carnitas Crispy Tacos 11.137143
14 Carnitas Salad 8.990000
15 Carnitas Salad Bowl 11.056667
16 Carnitas Soft Tacos 9.398500
17 Chicken Bowl 10.113953
18 Chicken Burrito 10.082857
19 Chicken Crispy Tacos 10.045319
20 Chicken Salad 9.010000
21 Chicken Salad Bowl 11.170455
22 Chicken Soft Tacos 9.635565
23 Chips 2.342844
24 Chips and Fresh Tomato Salsa 3.285091
25 Chips and Guacamole 4.595073
26 Chips and Mild Fresh Tomato Salsa 3.000000
27 Chips and Roasted Chili Corn Salsa 3.084091
28 Chips and Roasted Chili-Corn Salsa 2.390000
29 Chips and Tomatillo Green Chili Salsa 3.087209
30 Chips and Tomatillo Red Chili Salsa 3.072917
31 Chips and Tomatillo-Green Chili Salsa 2.544194
32 Chips and Tomatillo-Red Chili Salsa 2.987500
33 Crispy Tacos 7.400000
34 Izze 3.390000
35 Nantucket Nectar 3.641111
36 Salad 7.400000
37 Side of Chips 1.840594
38 Steak Bowl 10.711801
39 Steak Burrito 10.465842
40 Steak Crispy Tacos 10.209714
41 Steak Salad 8.915000
42 Steak Salad Bowl 11.847931
43 Steak Soft Tacos 9.746364
44 Veggie Bowl 10.211647
45 Veggie Burrito 9.839684
46 Veggie Crispy Tacos 8.490000
47 Veggie Salad 8.490000
48 Veggie Salad Bowl 10.138889
49 Veggie Soft Tacos 10.565714

Step 7. What was the quantity of the most expensive item ordered?

chipo[chipo['item_price'] >= chipo['item_price'].max()]['quantity']
3598    15
Name: quantity, dtype: int64

Step 8. How many times was a Veggie Salad Bowl ordered?

chipo.groupby(['item_name']).sum().filter(['Veggie Salad Bowl'], axis=0)['quantity']
item_name
Veggie Salad Bowl    18
Name: quantity, dtype: int64

Step 9. How many times did someone order more than one Canned Soda?

chipo[chipo['quantity'] > 1].groupby(['item_name']).count().filter(['Canned Soda'], axis=0)['quantity']
item_name
Canned Soda    20
Name: quantity, dtype: int64

Euro12 Dataset

This time we are going to pull data directly from the internet.

Step 1. Import the necessary libraries

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called euro12.

url = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/02_Filtering_%26_Sorting/Euro12/Euro_2012_stats_TEAM.csv'
euro12 = pd.read_csv(url)

Step 4. Select only the Goal column.

euro12['Goals']
0      4
1      4
2      4
3      5
4      3
5     10
6      5
7      6
8      2
9      2
10     6
11     1
12     5
13    12
14     5
15     2
Name: Goals, dtype: int64

Step 5. How many team participated in the Euro2012?

euro12['Team'].value_counts().count()
16

Step 6. What is the number of columns in the dataset?

len(euro12.columns)
35

Step 7. View only the columns Team, Yellow Cards and Red Cards and assign them to a dataframe called discipline

discipline = euro12[['Team', 'Yellow Cards', 'Red Cards']]
discipline
Team Yellow Cards Red Cards
0 Croatia 9 0
1 Czech Republic 7 0
2 Denmark 4 0
3 England 5 0
4 France 6 0
5 Germany 4 0
6 Greece 9 1
7 Italy 16 0
8 Netherlands 5 0
9 Poland 7 1
10 Portugal 12 0
11 Republic of Ireland 6 1
12 Russia 6 0
13 Spain 11 0
14 Sweden 7 0
15 Ukraine 5 0

Step 8. Sort the teams by Red Cards, then to Yellow Cards

discipline.sort_values(['Red Cards', 'Yellow Cards'], ascending=False)
Team Yellow Cards Red Cards
6 Greece 9 1
9 Poland 7 1
11 Republic of Ireland 6 1
7 Italy 16 0
10 Portugal 12 0
13 Spain 11 0
0 Croatia 9 0
1 Czech Republic 7 0
14 Sweden 7 0
4 France 6 0
12 Russia 6 0
3 England 5 0
8 Netherlands 5 0
15 Ukraine 5 0
2 Denmark 4 0
5 Germany 4 0

Step 9. Calculate the mean Yellow Cards given per Team

discipline['Yellow Cards'].mean()
7.4375

Step 10. Filter teams that scored more than 6 goals

euro12[euro12['Goals'] > 6]['Team']
5     Germany
13      Spain
Name: Team, dtype: object

Step 11. Select the teams that start with G

euro12[euro12['Team'].str[0] == 'G']['Team']
5    Germany
6     Greece
Name: Team, dtype: object

Step 12. Select the first 7 columns

euro12.iloc[:, :7]
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked)
0 Croatia 4 13 12 51.9% 16.0% 32
1 Czech Republic 4 13 18 41.9% 12.9% 39
2 Denmark 4 10 10 50.0% 20.0% 27
3 England 5 11 18 50.0% 17.2% 40
4 France 3 22 24 37.9% 6.5% 65
5 Germany 10 32 32 47.8% 15.6% 80
6 Greece 5 8 18 30.7% 19.2% 32
7 Italy 6 34 45 43.0% 7.5% 110
8 Netherlands 2 12 36 25.0% 4.1% 60
9 Poland 2 15 23 39.4% 5.2% 48
10 Portugal 6 22 42 34.3% 9.3% 82
11 Republic of Ireland 1 7 12 36.8% 5.2% 28
12 Russia 5 9 31 22.5% 12.5% 59
13 Spain 12 42 33 55.9% 16.0% 100
14 Sweden 5 17 19 47.2% 13.8% 39
15 Ukraine 2 7 26 21.2% 6.0% 38

Step 13. Select all columns except the last 3.

euro12.iloc[:, :-3]
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Clean Sheets Blocks Goals conceded Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards
0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 ... 0 10 3 13 81.3% 41 62 2 9 0
1 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 ... 1 10 6 9 60.1% 53 73 8 7 0
2 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 ... 1 10 5 10 66.7% 25 38 8 4 0
3 England 5 11 18 50.0% 17.2% 40 0 0 0 ... 2 29 3 22 88.1% 43 45 6 5 0
4 France 3 22 24 37.9% 6.5% 65 1 0 0 ... 1 7 5 6 54.6% 36 51 5 6 0
5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 1 11 6 10 62.6% 63 49 12 4 0
6 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 1 23 7 13 65.1% 67 48 12 9 1
7 Italy 6 34 45 43.0% 7.5% 110 2 0 0 ... 2 18 7 20 74.1% 101 89 16 16 0
8 Netherlands 2 12 36 25.0% 4.1% 60 2 0 0 ... 0 9 5 12 70.6% 35 30 3 5 0
9 Poland 2 15 23 39.4% 5.2% 48 0 0 0 ... 0 8 3 6 66.7% 48 56 3 7 1
10 Portugal 6 22 42 34.3% 9.3% 82 6 0 0 ... 2 11 4 10 71.5% 73 90 10 12 0
11 Republic of Ireland 1 7 12 36.8% 5.2% 28 0 0 0 ... 0 23 9 17 65.4% 43 51 11 6 1
12 Russia 5 9 31 22.5% 12.5% 59 2 0 0 ... 0 8 3 10 77.0% 34 43 4 6 0
13 Spain 12 42 33 55.9% 16.0% 100 0 1 0 ... 5 8 1 15 93.8% 102 83 19 11 0
14 Sweden 5 17 19 47.2% 13.8% 39 3 0 0 ... 1 12 5 8 61.6% 35 51 7 7 0
15 Ukraine 2 7 26 21.2% 6.0% 38 0 0 0 ... 0 4 4 13 76.5% 48 31 4 5 0

16 rows × 32 columns

Step 14. Present only the Shooting Accuracy from England, Italy and Russia

euro12[euro12['Team'].isin(['England', 'Italy', 'Russia'])][['Team', 'Shooting Accuracy']]
Team Shooting Accuracy
3 England 50.0%
7 Italy 43.0%
12 Russia 22.5%

Fictional Army dataset

Introduction:

This exercise was inspired by this page

Special thanks to: https://github.com/chrisalbon for sharing the dataset and materials.

Step 1. Import the necessary libraries

import pandas as pd

Step 2. This is the data given as a dictionary

# Create an example dataframe about a fictional army
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
            'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
            'deaths': [523, 52, 25, 616, 43, 234, 523, 62, 62, 73, 37, 35],
            'battles': [5, 42, 2, 2, 4, 7, 8, 3, 4, 7, 8, 9],
            'size': [1045, 957, 1099, 1400, 1592, 1006, 987, 849, 973, 1005, 1099, 1523],
            'veterans': [1, 5, 62, 26, 73, 37, 949, 48, 48, 435, 63, 345],
            'readiness': [1, 2, 3, 3, 2, 1, 2, 3, 2, 1, 2, 3],
            'armored': [1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1],
            'deserters': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
            'origin': ['Arizona', 'California', 'Texas', 'Florida', 'Maine', 'Iowa', 'Alaska', 'Washington', 'Oregon', 'Wyoming', 'Louisana', 'Georgia']}

Step 3. Create a dataframe and assign it to a variable called army.

Don’t forget to include the columns names in the order presented in the dictionary (‘regiment’, ‘company’, ‘deaths’…) so that the column index order is consistent with the solutions. If omitted, pandas will order the columns alphabetically.

army = pd.DataFrame(raw_data)

Step 4. Set the ‘origin’ colum as the index of the dataframe

army = army.set_index(['origin'])

Step 5. Print only the column veterans

army['veterans']
origin
Arizona         1
California      5
Texas          62
Florida        26
Maine          73
Iowa           37
Alaska        949
Washington     48
Oregon         48
Wyoming       435
Louisana       63
Georgia       345
Name: veterans, dtype: int64

Step 6. Print the columns ‘veterans’ and ‘deaths’

army[['veterans', 'deaths']]
veterans deaths
origin
Arizona 1 523
California 5 52
Texas 62 25
Florida 26 616
Maine 73 43
Iowa 37 234
Alaska 949 523
Washington 48 62
Oregon 48 62
Wyoming 435 73
Louisana 63 37
Georgia 345 35

Step 7. Print the name of all the columns.

army.columns
Index(['regiment', 'company', 'deaths', 'battles', 'size', 'veterans',
       'readiness', 'armored', 'deserters'],
      dtype='object')

Step 8. Select the ‘deaths’, ‘size’ and ‘deserters’ columns from Maine and Alaska

army[['deaths', 'size', 'deserters']].loc[['Maine', 'Alaska']]
deaths size deserters
origin
Maine 43 1592 3
Alaska 523 987 24

Step 9. Select the rows 3 to 7 and the columns 3 to 6

army.iloc[2:7,2:6]
deaths battles size veterans
origin
Texas 25 2 1099 62
Florida 616 2 1400 26
Maine 43 4 1592 73
Iowa 234 7 1006 37
Alaska 523 8 987 949

Step 10. Select every row after the fourth row and all columns

army.iloc[4:,::]
regiment company deaths battles size veterans readiness armored deserters
origin
Maine Dragoons 1st 43 4 1592 73 2 0 3
Iowa Dragoons 1st 234 7 1006 37 1 1 4
Alaska Dragoons 2nd 523 8 987 949 2 0 24
Washington Dragoons 2nd 62 3 849 48 3 1 31
Oregon Scouts 1st 62 4 973 48 2 0 2
Wyoming Scouts 1st 73 7 1005 435 1 0 3
Louisana Scouts 2nd 37 8 1099 63 2 1 2
Georgia Scouts 2nd 35 9 1523 345 3 1 3

Step 11. Select every row up to the 4th row and all columns

army.iloc[:4,::]
regiment company deaths battles size veterans readiness armored deserters
origin
Arizona Nighthawks 1st 523 5 1045 1 1 1 4
California Nighthawks 1st 52 42 957 5 2 0 24
Texas Nighthawks 2nd 25 2 1099 62 3 1 31
Florida Nighthawks 2nd 616 2 1400 26 3 1 2

Step 12. Select the 3rd column up to the 7th column

army.iloc[:,2:7]
deaths battles size veterans readiness
origin
Arizona 523 5 1045 1 1
California 52 42 957 5 2
Texas 25 2 1099 62 3
Florida 616 2 1400 26 3
Maine 43 4 1592 73 2
Iowa 234 7 1006 37 1
Alaska 523 8 987 949 2
Washington 62 3 849 48 3
Oregon 62 4 973 48 2
Wyoming 73 7 1005 435 1
Louisana 37 8 1099 63 2
Georgia 35 9 1523 345 3

Step 13. Select rows where df.deaths is greater than 50

army[army['deaths'] >50]['deaths']
origin
Arizona       523
California     52
Florida       616
Iowa          234
Alaska        523
Washington     62
Oregon         62
Wyoming        73
Name: deaths, dtype: int64

Step 14. Select rows where df.deaths is greater than 500 or less than 50

army[(army['deaths'] < 50) | (army['deaths'] > 500)]['deaths']
origin
Arizona     523
Texas        25
Florida     616
Maine        43
Alaska      523
Louisana     37
Georgia      35
Name: deaths, dtype: int64

Step 15. Select all the regiments not named “Dragoons”

army[~army['regiment'].isin(['Dragoons'])]
regiment company deaths battles size veterans readiness armored deserters
origin
Arizona Nighthawks 1st 523 5 1045 1 1 1 4
California Nighthawks 1st 52 42 957 5 2 0 24
Texas Nighthawks 2nd 25 2 1099 62 3 1 31
Florida Nighthawks 2nd 616 2 1400 26 3 1 2
Oregon Scouts 1st 62 4 973 48 2 0 2
Wyoming Scouts 1st 73 7 1005 435 1 0 3
Louisana Scouts 2nd 37 8 1099 63 2 1 2
Georgia Scouts 2nd 35 9 1523 345 3 1 3

Step 16. Select the rows called Texas and Arizona

army.loc[['Texas', 'Arizona']]
regiment company deaths battles size veterans readiness armored deserters
origin
Texas Nighthawks 2nd 25 2 1099 62 3 1 31
Arizona Nighthawks 1st 523 5 1045 1 1 1 4

Step 17. Select the third cell in the row named Arizona

army.loc[['Arizona']].iloc[:, 2]
origin
Arizona    523
Name: deaths, dtype: int64

Step 18. Select the third cell down in the column named deaths

army[['deaths']].iloc[2:, :]
deaths
origin
Texas 25
Florida 616
Maine 43
Iowa 234
Alaska 523
Washington 62
Oregon 62
Wyoming 73
Louisana 37
Georgia 35

Leave a comment