Handling Missing Values

11 minute read

Missing value in your data is pretty common in real life. In fact, the chance that at least one data point is missing increases as the data set size increases. Missing data can occur any number of ways, some of which include the following.

Merging of source datasets: A simple example commonly occurs when two data sets are merged by a sample identifier (ID). If an ID is present in only the first data set, then the merged data will contain missing values for that ID for all of the predictors in the second data set.
Equipment Errors: Any measurement process is vulnerable to random events that prevent data collection. Consider the setting where data are collected in a medical diagnostic lab. Accidental misplacement or damage of a biological sample would prevent measurements from being made on the sample, thus inducing missing values.
Human Errors: Measurements from human tends to have one or two error. Especially in survey. For example, not all surveyor to perform 100 percent without making any mistake. Small unknown things will be happening.

Moreover, missing values in the original predictors, regardless of any feature engineering, are intolerable in many kinds of predictive models. Therefore, to utilize predictors or feature engineering techniques, we must first address the missingness in the data. Also, the missingness itself may be an important predictor of the response.

Types of missing data

why are these values missing ?

Sometimes the answer might already be known or could be easily inferred from studying the data. If the data stem from a scientific experiment or clinical study, information from laboratory notebooks or clinical study logs may provide a direct connection to the samples collected or to the patients studied that will reveal why measurements are missing. But for many other data sets, the cause of missing data may not be able to be determined. In cases like this, we need a framework for understanding missing data. This framework will, in turn, lead to appropriate techniques for handling the missing information.

One framework to view missing values is through the lens of the mechanisms of missing data. Three common mechanisms are:

Structural deficiencies in the data
Random occurrences, or
Specific causes.

A structural deficiency can be defined as a missing component of a predictor that was omitted from the data. This type of missingness is often the easiest to resolve once the necessary component is identified. It may be tempting to simply remove this predictor because most of the values are missing. However doing this would throw away valuable predictive information.

A second reason for missing values is due to random occurrences. Lets we split it to 3 categories

Missing completely at random (MCAR): The fact that a certain value is missing has nothing to do with its hypothetical value and with the values of other variables. This is the best case situation.
Missing at random (MAR): Missing at random means that the propensity for a data point to be missing is not related to the missing data, but it is related to some of the observed data. In this scenario, the probability of a missing result depends on the observed data but not on the unobserved data.

A third mechanism of missing data is missingness due to a specific cause (or missing not at random (MNAR)). Worst possible reasons are that the missing value depends on the hypothetical value or missing value is dependent on some other variable’s value. Therefore, we must make a good effort to understanding the nature of the missing data prior to implementing any of the techniques.

Impute or Remove ?

In MCAR and MAR, it is safe to remove the data with missing values depending upon their occurrences, while in MNAR case removing observations with missing values can produce a bias in the model. So we have to be really careful before removing observations. Note that imputation does not necessarily give better results.

Removing

Listwise deletion (complete-case analysis) removes all data for an observation that has one or more missing values. Particularly if the missing data is limited to a small number of observations, you eliminate those cases from the analysis. However in most cases, it is often disadvantageous to use listwise deletion. This is because the assumptions of MCAR (Missing Completely at Random) are typically rare to support. As a result, listwise deletion methods produce biased parameters and estimates.

df.dropna(inplace=True)

Sometimes you can drop variables if the data is missing for more than 60% observations but only if that variable is insignificant. Having said that, imputation is always a preferred choice over dropping variables

del df.column_name
df.drop('column_name', axis=1, inplace=True)

Imputing with Mean, Median, and Mode

Computing the overall mean, median or mode is a very basic imputation method, it is the only tested function that takes no advantage of the time series characteristics or relationship between the variables. It is very fast, but has clear disadvantages. One disadvantage is that mean imputation reduces variance in the dataset. It is pretty clear for categorical data we could impute with Mode, but what about continuous value ?

Overall, mean is much preferred to use. Median in the other hand had its own advantage. Outliers don’t have such an effect on the median. Therefore, here the median gives a more realistic picture of the data. Let see at the example

num = np.random.randint(1, 100, 45)
null = []
for i in range(5):
    null.append(np.nan)

sample_num = pd.Series(sample)
sample_cat = pd.Series(["a", "b", np.nan, "c", "a", "b", np.nan, "a"])

Then, we check if there is any missing value

print(sample_num.isna().sum())
print(sample_cat.isna().sum())

5
2

Then lets try to fill the missing value

# Fill it with mean
fill_mean = sample_num.fillna(sample_num.mean())
# Fill it with median
fill_median = sample_num.fillna(sample_num.median())
#  Fill it with mode
fill_mode = sample_cat.fillna(sample_cat.mode()[0])

print(fill_mean)
print(fill_median)
print(fill_mode)

   55.911111
   55.911111
   55.911111
   55.911111
   55.911111
   29.000000
   88.000000
   99.000000
   75.000000
   91.000000
  10.000000
  23.000000
  23.000000
  95.000000
  74.000000
  79.000000
   3.000000
   5.000000
  64.000000
  34.000000
  44.000000
  10.000000
  82.000000
  51.000000
  80.000000
  13.000000
  73.000000
  92.000000
  69.000000
  25.000000
  93.000000
  85.000000
  47.000000
  99.000000
  97.000000
  29.000000
  17.000000
  62.000000
  56.000000
  83.000000
  57.000000
  60.000000
  45.000000
  49.000000
  92.000000
   8.000000
  97.000000
  56.000000
  13.000000
  40.000000
dtype: float64
   57.0
   57.0
   57.0
   57.0
   57.0
   29.0
   88.0
   99.0
   75.0
   91.0
  10.0
  23.0
  23.0
  95.0
  74.0
  79.0
   3.0
   5.0
  64.0
  34.0
  44.0
  10.0
  82.0
  51.0
  80.0
  13.0
  73.0
  92.0
  69.0
  25.0
  93.0
  85.0
  47.0
  99.0
  97.0
  29.0
  17.0
  62.0
  56.0
  83.0
  57.0
  60.0
  45.0
  49.0
  92.0
   8.0
  97.0
  56.0
  13.0
  40.0
dtype: float64
  a
  b
  a
  c
  a
  b
  a
  a
dtype: object

You can also use imputer from Scikit-Learn. Sklearn imputer only can do it on DataFrames or 2-D array.

data_sample = pd.DataFrame(data=sample_num)

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
# strategy can be changed to "median" and “most_frequent”
imp.fit_transform(data_sample)

array([[55.91111111],
       [55.91111111],
       [55.91111111],
       [55.91111111],
       [55.91111111],
       [29.        ],
       [88.        ],
       [99.        ],
       [75.        ],
       [91.        ],
       [10.        ],
       [23.        ],
       [23.        ],
       [95.        ],
       [74.        ],
       [79.        ],
       [ 3.        ],
       [ 5.        ],
       [64.        ],
       [34.        ],
       [44.        ],
       [10.        ],
       [82.        ],
       [51.        ],
       [80.        ],
       [13.        ],
       [73.        ],
       [92.        ],
       [69.        ],
       [25.        ],
       [93.        ],
       [85.        ],
       [47.        ],
       [99.        ],
       [97.        ],
       [29.        ],
       [17.        ],
       [62.        ],
       [56.        ],
       [83.        ],
       [57.        ],
       [60.        ],
       [45.        ],
       [49.        ],
       [92.        ],
       [ 8.        ],
       [97.        ],
       [56.        ],
       [13.        ],
       [40.        ]])

Bonus

Fancy Imputer using KNN

In this method, k-nearest neighbors are chosen based on some distance measure and their average is used as an imputation estimate. The method requires the selection of the number of nearest neighbors, and a distance metric. KNN can predict both discrete attributes (the most frequent value among the k nearest neighbors) and continuous attributes (the mean among the k nearest neighbors) The distance metric varies according to the type of data:

Continuous Data: The commonly used distance metrics for continuous data are Euclidean, Manhattan and Cosine
Categorical Data: Hamming distance is generally used in this case. It takes all the categorical attributes and for each, count one if the value is not the same between two points. The Hamming distance is then equal to the number of attributes for which the value was different.

One of the most attractive features of the KNN algorithm is that it is simple to understand and easy to implement. The non-parametric nature of KNN gives it an edge in certain settings where the data may be highly “unusual”.

One of the obvious drawbacks of the KNN algorithm is that it becomes time-consuming when analyzing large datasets because it searches for similar instances through the entire dataset. Furthermore, the accuracy of KNN can be severely degraded with high-dimensional data because there is little difference between the nearest and farthest neighbor.

!pip install fancyimpute

from fancyimpute import KNN    
# Use 5 nearest rows which have a feature to fill in each row's missing features
X_filled_knn = KNN(k=3).fit_transform(data_sample)

Also, fancyimpute have others method too

SimpleFill: Replaces missing entries with the mean or median of each column.
SoftImpute: Matrix completion by iterative soft thresholding of SVD decompositions. Inspired by the softImpute package for R, which is based on Spectral Regularization Algorithms for Learning Large Incomplete Matrices by Mazumder et. al.
IterativeImputer: A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion. A stub that links to scikit-learn’s IterativeImputer.
IterativeSVD: Matrix completion by iterative low-rank SVD decomposition. Should be similar to SVDimpute from Missing value estimation methods for DNA microarrays by Troyanskaya et. al.
MatrixFactorization: Direct factorization of the incomplete matrix into low-rank U and V, with an L1 sparsity penalty on the elements of U and an L2 penalty on the elements of V. Solved by gradient descent.
NuclearNormMinimization: Simple implementation of Exact Matrix Completion via Convex Optimization by Emmanuel Candes and Benjamin Recht using cvxpy. Too slow for large matrices.
BiScaler: Iterative estimation of row/column means and standard deviations to get doubly normalized matrix. Not guaranteed to converge but works well in practice. Taken from Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares.

Random Forest Imputation using missingpy

MissForest imputes missing values using Random Forests in an iterative fashion [1]. By default, the imputer begins imputing missing values of the column (which is expected to be a variable) with the smallest number of missing values – let’s call this the candidate column. The first step involves filling any missing values of the remaining, non-candidate, columns with an initial guess, which is the column mean for columns representing numerical variables and the column mode for columns representing categorical variables. Note that the categorical variables need to be explicitly identified during the imputer’s fit() method call (see API for more information).

After that, the imputer fits a random forest model with the candidate column as the outcome variable and the remaining columns as the predictors over all rows where the candidate column values are not missing. After the fit, the missing rows of the candidate column are imputed using the prediction from the fitted Random Forest. The rows of the non-candidate columns act as the input data for the fitted model. Following this, the imputer moves on to the next candidate column with the second smallest number of missing values from among the non-candidate columns in the first round. The process repeats itself for each column with a missing value, possibly over multiple iterations or epochs for each column, until the stopping criterion is met. The stopping criterion is governed by the “difference” between the imputed arrays over successive iterations.

!pip install missingpy
from missingpy import MissForest
imputer = MissForest()
imputer.fit_transform(data_sample)

Share on

Twitter Facebook LinkedIn

Gama Candra Tri Kartika

Handling Missing Values

Types of missing data

Impute or Remove ?

Removing

Imputing with Mean, Median, and Mode

Bonus

Fancy Imputer using KNN

Random Forest Imputation using missingpy

Share on

Leave a comment

You may also enjoy

Busy From Works

Long Updates

Back to back meetings

Lucky weekends