Python Pandas
Next, let’s discussing Pandas. Preparing the data and munging the same was the initial outcomes of python before the introduction of Panda libraries. after the introduction of panda libraries python began to flourish a lot in the analytics sector. The major outcomes of panda are analysis of data, preparation of data, data manipulation, data modeling, and data analysis.
Series
First of all, Lets discuss about pd.series
. Series is a one-dimensional ndarray with axis labels (including time series).
Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).
lets check this example
import numpy as np
import pandas as pd
arr = np.random.randint(1,10,5)
d = {'a': 2, 'b': 3, 'c' : 7, 'd' : 9, 'e' :1}
list1 = pd.Series(data=arr)
list2 = pd.Series(data=d)
print(list1)
print(list2)
0 3
1 6
2 1
3 2
4 8
dtype: int64
a 2
b 3
c 7
d 9
e 1
dtype: int64
So we see the difference between series from list and dictionary is that series from list automatically indexing from 0-n while index series from dictionary follow from dictionary. You can also put a label from unlabelled data like this example
import numpy as np
import pandas as pd
arr = np.random.randint(1,10,5)
label = ['ran1', 'ran2', 'ran3', 'ran4', 'ran5']
list1 = pd.Series(data=arr,index=label)
print(list1)
ran1 4
ran2 1
ran3 3
ran4 4
ran5 6
dtype: int64
You can also combine between two or more series. For Example
import numpy as np
import pandas as pd
arr = np.random.randint(1,10,5)
d = {'a': 2, 'b': 3, 'c' : 7, 'd' : 9, 'e' :1}
list1 = pd.Series(data=arr, index=['a', 'b', 'c', 'd', 'e'])
list2 = pd.Series(data=d)
list3 = list1 + list2
print(list3)
a 3
b 5
c 8
d 15
e 2
dtype: int64
Why I put index on my randomized series ? Because if i dont it would ends up as index [0,1,2,3,4,a,b,c,d,e] and NaN values.
Dataframes
pd.DataFrame
is Two-dimensional, size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dictionary-like container for Series objects. The primary pandas data structure.
Let Start creating dataframe with random numbers
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.random.randn(5,4), index=[1,2,3,4,5], columns=['a', 'b', 'c', 'd'])
df
a b c d
1 -2.826205 -0.020080 -1.188318 -1.239230
2 -1.503837 0.028629 -1.210348 -0.132313
3 1.368030 -1.905128 -1.719268 -0.090431
4 1.263813 -0.870573 -0.685124 2.135516
5 0.061481 0.884779 -0.836704 -0.574821
You can show the column(s) like this
df[['a', 'b']]
a b
1 0.086870 -0.639393
2 -0.955657 0.258113
3 -0.669046 0.417024
4 -1.212063 0.090159
5 -1.022556 2.135724
And selecting row(s) like this
df.loc[1]
# Based on the index name
df.iloc[0]
# Based on the index location
a 0.086870
b -0.639393
c -0.762429
d -0.306316
Name: 1, dtype: float64
a 0.086870
b -0.639393
c -0.762429
d -0.306316
Name: 1, dtype: float64
It is the same because index “1” located first (python numbering start with zero). And then you can call the Dataframes based on row and column like this
df.loc[[1,3], ['a','b','d']]
a b d
1 0.086870 -0.639393 -0.306316
3 -0.669046 0.417024 0.870600
And finally, you can create a new column and delete column with this
df['new'] = df['a'] + df['c']
print(df)
df_drop = df.drop(columns=['new'])
print(df_drop)
a b c d new
1 0.086870 -0.639393 -0.762429 -0.306316 -0.675559
2 -0.955657 0.258113 -0.027353 1.256810 -0.983010
3 -0.669046 0.417024 1.412859 0.870600 0.743814
4 -1.212063 0.090159 1.125037 -0.265073 -0.087026
5 -1.022556 2.135724 1.208373 0.219263 0.185817
a b c d
1 0.086870 -0.639393 -0.762429 -0.306316
2 -0.955657 0.258113 -0.027353 1.256810
3 -0.669046 0.417024 1.412859 0.870600
4 -1.212063 0.090159 1.125037 -0.265073
5 -1.022556 2.135724 1.208373 0.219263
You can use conditional operator inside Dataframes. It will help you to select particular number or data. Here is the example
print(df)
print(df>1)
print(df[df['a']>0])
a b c d new
1 0.086870 -0.639393 -0.762429 -0.306316 -0.675559
2 -0.955657 0.258113 -0.027353 1.256810 -0.983010
3 -0.669046 0.417024 1.412859 0.870600 0.743814
4 -1.212063 0.090159 1.125037 -0.265073 -0.087026
5 -1.022556 2.135724 1.208373 0.219263 0.185817
a b c d new
1 False False False False False
2 False False False True False
3 False False True False False
4 False False True False False
5 False True True False False
a b c d new
1 0.08687 -0.639393 -0.762429 -0.306316 -0.675559
For two conditions you can use |
and &
with parenthesis like this
df[(df['a']>0) | (df['c'] > 1)]
a b c d
1 0.426039 -0.310240 -0.842375 -0.376677
3 -0.696941 -0.368143 2.026986 -0.486401
lets take a look about more details in indexing dataframes. You can reset your index by usinf df.reset_index()
. Letes take a look at the example
print(df)
print(df.reset_index())
a b c d
1 0.426039 -0.310240 -0.842375 -0.376677
2 -2.555266 1.060464 0.697077 -0.611722
3 -0.696941 -0.368143 2.026986 -0.486401
4 -0.410270 0.059790 -0.804746 -0.293890
5 -0.554131 -0.375219 -0.831811 -0.059533
index a b c d
0 1 0.426039 -0.310240 -0.842375 -0.376677
1 2 -2.555266 1.060464 0.697077 -0.611722
2 3 -0.696941 -0.368143 2.026986 -0.486401
3 4 -0.410270 0.059790 -0.804746 -0.293890
4 5 -0.554131 -0.375219 -0.831811 -0.059533
You can see, pandas resetting the index from 0-n and make the old index names to columns. Then you can set the index with df.set_index
.
df = df.reset_index()
newind = 'CA NY WY OR CO'.split()
df['States'] = newind
df.set_index('States')
index a b c d
States
CA 1 0.486785 0.907007 -0.176515 0.136101
NY 2 0.071172 1.313467 0.507755 -1.628941
WY 3 -1.492063 -0.929157 -0.394949 0.706727
OR 4 -1.512844 -0.058844 0.029634 0.887493
CO 5 -1.445499 0.715998 0.997913 -1.257716
Handling missing data with pandas
Basically, there are 2 option dealing with missing value in dataset. Obviously, not all dataset is ready. Sometimes the dataset have a lot or some missing values. Let’s take a look how to handle missing value by dropping with df.dropna()
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,2,np.nan],
'B':[5,np.nan,np.nan],
'C':[1,2,3]})
print(df)
print(df.dropna(axis=0))
print(df.dropna(axis=1))
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
A B C
0 1.0 5.0 1
C
0 1
1 2
2 3
You see the difference between dropping values from axis=0 and axis=1. Dropping with axis=0 means that you delete the rows based on missing value. In the other hand you use axis=1 to drop the columns. You can also put a threshold of the minimal count of missing values.
print(df.dropna(thresh=2))
A B C
0 1.0 5.0 1
1 2.0 NaN 2
Now, you can try filling missing values with either mode, mean, and median. You can use df.fillna()
to do that
print(df['A'].fillna(value=df['A'].mean()))
0 1.0
1 2.0
2 1.5
Name: A, dtype: float64
Leave a comment