Pandas Exercise 2 : Filtering and Sorting

14 minute read

The continuity of my practice on Pandas exercise from guisapmora.

Chipotle dataset

This time we are going to pull data directly from the internet. Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Step 1. Import the necessary libraries

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called chipo.

url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'
chipo = pd.read_csv(url, sep='\t')

Step 4. How many products cost more than $10.00?

chipo['item_price'] = chipo['item_price'].apply(lambda x : x[1:]).astype('float').copy()

chipo[chipo['item_price'] > 10]['item_name'].value_counts().count()

Step 5. What is the price of each item?

print a data frame with only two columns item_name and item_price

chipo.groupby(['item_name']).mean().reset_index()[['item_name', 'item_price']]

	item_name	item_price
0	6 Pack Soft Drink	6.610185
1	Barbacoa Bowl	10.187273
2	Barbacoa Burrito	9.832418
3	Barbacoa Crispy Tacos	10.928182
4	Barbacoa Salad Bowl	10.640000
5	Barbacoa Soft Tacos	10.018400
6	Bottled Water	1.867654
7	Bowl	14.800000
8	Burrito	7.400000
9	Canned Soda	1.320577
10	Canned Soft Drink	1.457641
11	Carnitas Bowl	10.833971
12	Carnitas Burrito	10.132712
13	Carnitas Crispy Tacos	11.137143
14	Carnitas Salad	8.990000
15	Carnitas Salad Bowl	11.056667
16	Carnitas Soft Tacos	9.398500
17	Chicken Bowl	10.113953
18	Chicken Burrito	10.082857
19	Chicken Crispy Tacos	10.045319
20	Chicken Salad	9.010000
21	Chicken Salad Bowl	11.170455
22	Chicken Soft Tacos	9.635565
23	Chips	2.342844
24	Chips and Fresh Tomato Salsa	3.285091
25	Chips and Guacamole	4.595073
26	Chips and Mild Fresh Tomato Salsa	3.000000
27	Chips and Roasted Chili Corn Salsa	3.084091
28	Chips and Roasted Chili-Corn Salsa	2.390000
29	Chips and Tomatillo Green Chili Salsa	3.087209
30	Chips and Tomatillo Red Chili Salsa	3.072917
31	Chips and Tomatillo-Green Chili Salsa	2.544194
32	Chips and Tomatillo-Red Chili Salsa	2.987500
33	Crispy Tacos	7.400000
34	Izze	3.390000
35	Nantucket Nectar	3.641111
36	Salad	7.400000
37	Side of Chips	1.840594
38	Steak Bowl	10.711801
39	Steak Burrito	10.465842
40	Steak Crispy Tacos	10.209714
41	Steak Salad	8.915000
42	Steak Salad Bowl	11.847931
43	Steak Soft Tacos	9.746364
44	Veggie Bowl	10.211647
45	Veggie Burrito	9.839684
46	Veggie Crispy Tacos	8.490000
47	Veggie Salad	8.490000
48	Veggie Salad Bowl	10.138889
49	Veggie Soft Tacos	10.565714

Step 6. Sort by the name of the item

chipo.groupby(['item_name']).mean().sort_index().reset_index()[['item_name', 'item_price']]

	item_name	item_price
0	6 Pack Soft Drink	6.610185
1	Barbacoa Bowl	10.187273
2	Barbacoa Burrito	9.832418
3	Barbacoa Crispy Tacos	10.928182
4	Barbacoa Salad Bowl	10.640000
5	Barbacoa Soft Tacos	10.018400
6	Bottled Water	1.867654
7	Bowl	14.800000
8	Burrito	7.400000
9	Canned Soda	1.320577
10	Canned Soft Drink	1.457641
11	Carnitas Bowl	10.833971
12	Carnitas Burrito	10.132712
13	Carnitas Crispy Tacos	11.137143
14	Carnitas Salad	8.990000
15	Carnitas Salad Bowl	11.056667
16	Carnitas Soft Tacos	9.398500
17	Chicken Bowl	10.113953
18	Chicken Burrito	10.082857
19	Chicken Crispy Tacos	10.045319
20	Chicken Salad	9.010000
21	Chicken Salad Bowl	11.170455
22	Chicken Soft Tacos	9.635565
23	Chips	2.342844
24	Chips and Fresh Tomato Salsa	3.285091
25	Chips and Guacamole	4.595073
26	Chips and Mild Fresh Tomato Salsa	3.000000
27	Chips and Roasted Chili Corn Salsa	3.084091
28	Chips and Roasted Chili-Corn Salsa	2.390000
29	Chips and Tomatillo Green Chili Salsa	3.087209
30	Chips and Tomatillo Red Chili Salsa	3.072917
31	Chips and Tomatillo-Green Chili Salsa	2.544194
32	Chips and Tomatillo-Red Chili Salsa	2.987500
33	Crispy Tacos	7.400000
34	Izze	3.390000
35	Nantucket Nectar	3.641111
36	Salad	7.400000
37	Side of Chips	1.840594
38	Steak Bowl	10.711801
39	Steak Burrito	10.465842
40	Steak Crispy Tacos	10.209714
41	Steak Salad	8.915000
42	Steak Salad Bowl	11.847931
43	Steak Soft Tacos	9.746364
44	Veggie Bowl	10.211647
45	Veggie Burrito	9.839684
46	Veggie Crispy Tacos	8.490000
47	Veggie Salad	8.490000
48	Veggie Salad Bowl	10.138889
49	Veggie Soft Tacos	10.565714

Step 7. What was the quantity of the most expensive item ordered?

chipo[chipo['item_price'] >= chipo['item_price'].max()]['quantity']

3598    15
Name: quantity, dtype: int64

Step 8. How many times was a Veggie Salad Bowl ordered?

chipo.groupby(['item_name']).sum().filter(['Veggie Salad Bowl'], axis=0)['quantity']

item_name
Veggie Salad Bowl    18
Name: quantity, dtype: int64

Step 9. How many times did someone order more than one Canned Soda?

chipo[chipo['quantity'] > 1].groupby(['item_name']).count().filter(['Canned Soda'], axis=0)['quantity']

item_name
Canned Soda    20
Name: quantity, dtype: int64

Euro12 Dataset

This time we are going to pull data directly from the internet.

Step 1. Import the necessary libraries

import pandas as pd

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called euro12.

url = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/02_Filtering_%26_Sorting/Euro12/Euro_2012_stats_TEAM.csv'
euro12 = pd.read_csv(url)

Step 4. Select only the Goal column.

euro12['Goals']

    4
    4
    4
    5
    3
   10
    5
    6
    2
    2
   6
   1
   5
  12
   5
   2
Name: Goals, dtype: int64

Step 5. How many team participated in the Euro2012?

euro12['Team'].value_counts().count()

Step 6. What is the number of columns in the dataset?

len(euro12.columns)

Step 7. View only the columns Team, Yellow Cards and Red Cards and assign them to a dataframe called discipline

discipline = euro12[['Team', 'Yellow Cards', 'Red Cards']]
discipline

	Team	Yellow Cards	Red Cards
0	Croatia	9	0
1	Czech Republic	7	0
2	Denmark	4	0
3	England	5	0
4	France	6	0
5	Germany	4	0
6	Greece	9	1
7	Italy	16	0
8	Netherlands	5	0
9	Poland	7	1
10	Portugal	12	0
11	Republic of Ireland	6	1
12	Russia	6	0
13	Spain	11	0
14	Sweden	7	0
15	Ukraine	5	0

Step 8. Sort the teams by Red Cards, then to Yellow Cards

discipline.sort_values(['Red Cards', 'Yellow Cards'], ascending=False)

	Team	Yellow Cards	Red Cards
6	Greece	9	1
9	Poland	7	1
11	Republic of Ireland	6	1
7	Italy	16	0
10	Portugal	12	0
13	Spain	11	0
0	Croatia	9	0
1	Czech Republic	7	0
14	Sweden	7	0
4	France	6	0
12	Russia	6	0
3	England	5	0
8	Netherlands	5	0
15	Ukraine	5	0
2	Denmark	4	0
5	Germany	4	0

Step 9. Calculate the mean Yellow Cards given per Team

discipline['Yellow Cards'].mean()

7.4375

Step 10. Filter teams that scored more than 6 goals

euro12[euro12['Goals'] > 6]['Team']

5     Germany
13      Spain
Name: Team, dtype: object

Step 11. Select the teams that start with G

euro12[euro12['Team'].str[0] == 'G']['Team']

5    Germany
6     Greece
Name: Team, dtype: object

Step 12. Select the first 7 columns

euro12.iloc[:, :7]

	Team	Goals	Shots on target	Shots off target	Shooting Accuracy	% Goals-to-shots	Total shots (inc. Blocked)
0	Croatia	4	13	12	51.9%	16.0%	32
1	Czech Republic	4	13	18	41.9%	12.9%	39
2	Denmark	4	10	10	50.0%	20.0%	27
3	England	5	11	18	50.0%	17.2%	40
4	France	3	22	24	37.9%	6.5%	65
5	Germany	10	32	32	47.8%	15.6%	80
6	Greece	5	8	18	30.7%	19.2%	32
7	Italy	6	34	45	43.0%	7.5%	110
8	Netherlands	2	12	36	25.0%	4.1%	60
9	Poland	2	15	23	39.4%	5.2%	48
10	Portugal	6	22	42	34.3%	9.3%	82
11	Republic of Ireland	1	7	12	36.8%	5.2%	28
12	Russia	5	9	31	22.5%	12.5%	59
13	Spain	12	42	33	55.9%	16.0%	100
14	Sweden	5	17	19	47.2%	13.8%	39
15	Ukraine	2	7	26	21.2%	6.0%	38

Step 13. Select all columns except the last 3.

euro12.iloc[:, :-3]

	Team	Goals	Shots on target	Shots off target	Shooting Accuracy	% Goals-to-shots	Total shots (inc. Blocked)	Hit Woodwork	Penalty goals	Penalties not scored	...	Clean Sheets	Blocks	Goals conceded	Saves made	Saves-to-shots ratio	Fouls Won	Fouls Conceded	Offsides	Yellow Cards	Red Cards
0	Croatia	4	13	12	51.9%	16.0%	32	0	0	0	...	0	10	3	13	81.3%	41	62	2	9	0
1	Czech Republic	4	13	18	41.9%	12.9%	39	0	0	0	...	1	10	6	9	60.1%	53	73	8	7	0
2	Denmark	4	10	10	50.0%	20.0%	27	1	0	0	...	1	10	5	10	66.7%	25	38	8	4	0
3	England	5	11	18	50.0%	17.2%	40	0	0	0	...	2	29	3	22	88.1%	43	45	6	5	0
4	France	3	22	24	37.9%	6.5%	65	1	0	0	...	1	7	5	6	54.6%	36	51	5	6	0
5	Germany	10	32	32	47.8%	15.6%	80	2	1	0	...	1	11	6	10	62.6%	63	49	12	4	0
6	Greece	5	8	18	30.7%	19.2%	32	1	1	1	...	1	23	7	13	65.1%	67	48	12	9	1
7	Italy	6	34	45	43.0%	7.5%	110	2	0	0	...	2	18	7	20	74.1%	101	89	16	16	0
8	Netherlands	2	12	36	25.0%	4.1%	60	2	0	0	...	0	9	5	12	70.6%	35	30	3	5	0
9	Poland	2	15	23	39.4%	5.2%	48	0	0	0	...	0	8	3	6	66.7%	48	56	3	7	1
10	Portugal	6	22	42	34.3%	9.3%	82	6	0	0	...	2	11	4	10	71.5%	73	90	10	12	0
11	Republic of Ireland	1	7	12	36.8%	5.2%	28	0	0	0	...	0	23	9	17	65.4%	43	51	11	6	1
12	Russia	5	9	31	22.5%	12.5%	59	2	0	0	...	0	8	3	10	77.0%	34	43	4	6	0
13	Spain	12	42	33	55.9%	16.0%	100	0	1	0	...	5	8	1	15	93.8%	102	83	19	11	0
14	Sweden	5	17	19	47.2%	13.8%	39	3	0	0	...	1	12	5	8	61.6%	35	51	7	7	0
15	Ukraine	2	7	26	21.2%	6.0%	38	0	0	0	...	0	4	4	13	76.5%	48	31	4	5	0

16 rows × 32 columns

Step 14. Present only the Shooting Accuracy from England, Italy and Russia

euro12[euro12['Team'].isin(['England', 'Italy', 'Russia'])][['Team', 'Shooting Accuracy']]

	Team	Shooting Accuracy
3	England	50.0%
7	Italy	43.0%
12	Russia	22.5%

Fictional Army dataset

Introduction:

This exercise was inspired by this page

Special thanks to: https://github.com/chrisalbon for sharing the dataset and materials.

Step 1. Import the necessary libraries

import pandas as pd

Step 2. This is the data given as a dictionary

# Create an example dataframe about a fictional army
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
            'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
            'deaths': [523, 52, 25, 616, 43, 234, 523, 62, 62, 73, 37, 35],
            'battles': [5, 42, 2, 2, 4, 7, 8, 3, 4, 7, 8, 9],
            'size': [1045, 957, 1099, 1400, 1592, 1006, 987, 849, 973, 1005, 1099, 1523],
            'veterans': [1, 5, 62, 26, 73, 37, 949, 48, 48, 435, 63, 345],
            'readiness': [1, 2, 3, 3, 2, 1, 2, 3, 2, 1, 2, 3],
            'armored': [1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1],
            'deserters': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
            'origin': ['Arizona', 'California', 'Texas', 'Florida', 'Maine', 'Iowa', 'Alaska', 'Washington', 'Oregon', 'Wyoming', 'Louisana', 'Georgia']}

Step 3. Create a dataframe and assign it to a variable called army.

Don’t forget to include the columns names in the order presented in the dictionary (‘regiment’, ‘company’, ‘deaths’…) so that the column index order is consistent with the solutions. If omitted, pandas will order the columns alphabetically.

army = pd.DataFrame(raw_data)

Step 4. Set the ‘origin’ colum as the index of the dataframe

army = army.set_index(['origin'])

Step 5. Print only the column veterans

army['veterans']

origin
Arizona         1
California      5
Texas          62
Florida        26
Maine          73
Iowa           37
Alaska        949
Washington     48
Oregon         48
Wyoming       435
Louisana       63
Georgia       345
Name: veterans, dtype: int64

Step 6. Print the columns ‘veterans’ and ‘deaths’

army[['veterans', 'deaths']]

	veterans	deaths
origin
Arizona	1	523
California	5	52
Texas	62	25
Florida	26	616
Maine	73	43
Iowa	37	234
Alaska	949	523
Washington	48	62
Oregon	48	62
Wyoming	435	73
Louisana	63	37
Georgia	345	35

Step 7. Print the name of all the columns.

army.columns

Index(['regiment', 'company', 'deaths', 'battles', 'size', 'veterans',
       'readiness', 'armored', 'deserters'],
      dtype='object')

Step 8. Select the ‘deaths’, ‘size’ and ‘deserters’ columns from Maine and Alaska

army[['deaths', 'size', 'deserters']].loc[['Maine', 'Alaska']]

	deaths	size	deserters
origin
Maine	43	1592	3
Alaska	523	987	24

Step 9. Select the rows 3 to 7 and the columns 3 to 6

army.iloc[2:7,2:6]

	deaths	battles	size	veterans
origin
Texas	25	2	1099	62
Florida	616	2	1400	26
Maine	43	4	1592	73
Iowa	234	7	1006	37
Alaska	523	8	987	949

Step 10. Select every row after the fourth row and all columns

army.iloc[4:,::]

	regiment	company	deaths	battles	size	veterans	readiness	armored	deserters
origin
Maine	Dragoons	1st	43	4	1592	73	2	0	3
Iowa	Dragoons	1st	234	7	1006	37	1	1	4
Alaska	Dragoons	2nd	523	8	987	949	2	0	24
Washington	Dragoons	2nd	62	3	849	48	3	1	31
Oregon	Scouts	1st	62	4	973	48	2	0	2
Wyoming	Scouts	1st	73	7	1005	435	1	0	3
Louisana	Scouts	2nd	37	8	1099	63	2	1	2
Georgia	Scouts	2nd	35	9	1523	345	3	1	3

Step 11. Select every row up to the 4th row and all columns

army.iloc[:4,::]

	regiment	company	deaths	battles	size	veterans	readiness	armored	deserters
origin
Arizona	Nighthawks	1st	523	5	1045	1	1	1	4
California	Nighthawks	1st	52	42	957	5	2	0	24
Texas	Nighthawks	2nd	25	2	1099	62	3	1	31
Florida	Nighthawks	2nd	616	2	1400	26	3	1	2

Step 12. Select the 3rd column up to the 7th column

army.iloc[:,2:7]

	deaths	battles	size	veterans	readiness
origin
Arizona	523	5	1045	1	1
California	52	42	957	5	2
Texas	25	2	1099	62	3
Florida	616	2	1400	26	3
Maine	43	4	1592	73	2
Iowa	234	7	1006	37	1
Alaska	523	8	987	949	2
Washington	62	3	849	48	3
Oregon	62	4	973	48	2
Wyoming	73	7	1005	435	1
Louisana	37	8	1099	63	2
Georgia	35	9	1523	345	3

Step 13. Select rows where df.deaths is greater than 50

army[army['deaths'] >50]['deaths']

origin
Arizona       523
California     52
Florida       616
Iowa          234
Alaska        523
Washington     62
Oregon         62
Wyoming        73
Name: deaths, dtype: int64

Step 14. Select rows where df.deaths is greater than 500 or less than 50

army[(army['deaths'] < 50) | (army['deaths'] > 500)]['deaths']

origin
Arizona     523
Texas        25
Florida     616
Maine        43
Alaska      523
Louisana     37
Georgia      35
Name: deaths, dtype: int64

Step 15. Select all the regiments not named “Dragoons”

army[~army['regiment'].isin(['Dragoons'])]

	regiment	company	deaths	battles	size	veterans	readiness	armored	deserters
origin
Arizona	Nighthawks	1st	523	5	1045	1	1	1	4
California	Nighthawks	1st	52	42	957	5	2	0	24
Texas	Nighthawks	2nd	25	2	1099	62	3	1	31
Florida	Nighthawks	2nd	616	2	1400	26	3	1	2
Oregon	Scouts	1st	62	4	973	48	2	0	2
Wyoming	Scouts	1st	73	7	1005	435	1	0	3
Louisana	Scouts	2nd	37	8	1099	63	2	1	2
Georgia	Scouts	2nd	35	9	1523	345	3	1	3

Step 16. Select the rows called Texas and Arizona

army.loc[['Texas', 'Arizona']]

	regiment	company	deaths	battles	size	veterans	readiness	armored	deserters
origin
Texas	Nighthawks	2nd	25	2	1099	62	3	1	31
Arizona	Nighthawks	1st	523	5	1045	1	1	1	4

Step 17. Select the third cell in the row named Arizona

army.loc[['Arizona']].iloc[:, 2]

origin
Arizona    523
Name: deaths, dtype: int64

Step 18. Select the third cell down in the column named deaths

army[['deaths']].iloc[2:, :]

	deaths
origin
Texas	25
Florida	616
Maine	43
Iowa	234
Alaska	523
Washington	62
Oregon	62
Wyoming	73
Louisana	37
Georgia	35

Share on

Twitter Facebook LinkedIn

Gama Candra Tri Kartika

Chipotle dataset

Step 1. Import the necessary libraries

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called chipo.

Step 4. How many products cost more than $10.00?

Step 5. What is the price of each item?

print a data frame with only two columns item_name and item_price

Step 6. Sort by the name of the item

Step 7. What was the quantity of the most expensive item ordered?

Step 8. How many times was a Veggie Salad Bowl ordered?

Step 9. How many times did someone order more than one Canned Soda?

Euro12 Dataset

Step 1. Import the necessary libraries

Step 2. Import the dataset from this address.

Step 3. Assign it to a variable called euro12.

Step 4. Select only the Goal column.

Step 5. How many team participated in the Euro2012?

Step 6. What is the number of columns in the dataset?

Step 7. View only the columns Team, Yellow Cards and Red Cards and assign them to a dataframe called discipline

Step 8. Sort the teams by Red Cards, then to Yellow Cards

Step 9. Calculate the mean Yellow Cards given per Team

Step 10. Filter teams that scored more than 6 goals

Step 11. Select the teams that start with G

Step 12. Select the first 7 columns

Step 13. Select all columns except the last 3.

Step 14. Present only the Shooting Accuracy from England, Italy and Russia

Fictional Army dataset

Introduction:

Step 1. Import the necessary libraries

Step 2. This is the data given as a dictionary

Step 3. Create a dataframe and assign it to a variable called army.

Don’t forget to include the columns names in the order presented in the dictionary (‘regiment’, ‘company’, ‘deaths’…) so that the column index order is consistent with the solutions. If omitted, pandas will order the columns alphabetically.

Step 4. Set the ‘origin’ colum as the index of the dataframe

Step 5. Print only the column veterans

Step 6. Print the columns ‘veterans’ and ‘deaths’

Step 7. Print the name of all the columns.

Step 8. Select the ‘deaths’, ‘size’ and ‘deserters’ columns from Maine and Alaska

Step 9. Select the rows 3 to 7 and the columns 3 to 6

Step 10. Select every row after the fourth row and all columns

Step 11. Select every row up to the 4th row and all columns

Step 12. Select the 3rd column up to the 7th column

Step 13. Select rows where df.deaths is greater than 50

Step 14. Select rows where df.deaths is greater than 500 or less than 50

Step 15. Select all the regiments not named “Dragoons”

Step 16. Select the rows called Texas and Arizona

Step 17. Select the third cell in the row named Arizona

Step 18. Select the third cell down in the column named deaths

Share on

Leave a comment

You may also enjoy

Day 7 Algorit.ma : Capstone Project

Day 6 Algorit.ma : Unsupervised Learning

Day 5 Algorit.ma : Classification Model

Day 4 Algorit.ma : Regression Model