Different Types of Missing Values and Dealing with them within Tanzanian Water pump data.

6 min readFeb 23, 2021

So, Data. We love to see it, hate to see when it’s missing. When working with data, you will undoubtedly come across missing data. When I was first beginning my data science journey (arguably still a little today) I would fret over how it is exactly that I wanted to deal with my missing data, constantly going back and changing how I dealt with it what different effects it had on my models. As I’ve become more accustomed to seeing missing data, I’ve learned that there is not one sure-fire way to deal with missing data — there are many ways it can be approached as well as many different kinds of missing data. Firstly, I will delve into the different kinds of missing data, and examples of how said data goes missing. Then, I will show of some ways I have dealt with missing data in my most recent project where I looked at water pump data from Tanzania.

MCAR — Missing Completely At Random

Where did it come from? Where did it go? Missing Completely at Random means exactly as it sounds — COMPLETELY Random. More succinctly, it means that the probability of the data being missing is the same for all cases. This effectively means that the cause of the missing data is entirely unrelated to the data itself. This is typically due to some unrelated circumstance like, let’s say you’re recording some weather phenomena with some battery-powered device and whatever measuring device you’re using runs out of juice. The missing measurements in that data would be MCAR. When the missing data is MCAR, we don’t have to worry about any complexities and what it means for a certain value to be missing, we can just chill.

MCAR is the easiest to deal with because if you can afford to drop it, then you can simply drop the missing data. Since it is completely random, it should not have an effect on the rest of your dataset. Dropping MCAR would not introduce bias into your models, which is why it is okay to drop. Having MCAR data is probably the easiest and most convenient kind of missing data to have, however it is often unrealistic to have.

MAR — Missing At Random

Missing at Random (MAR) means that the probability of being missing is the same only if the missing data is viewed through the observed data that we are given. In other words, this means we can account for the missing data with things we know and where other information is complete. While in MCAR the missing and observed data have similar distributions, in MAR the missing data is related to data that we know. For example, let’s say you have a survey asking a variety of people of their dietary restrictions and food likes/dislikes. You have a question on the survey asking the person about their favorite type of cheese, and you notice that in some cases in which the persons state they have a food allergy (but not specified which one) they have left their favorite cheese question blank. Here, the data is MAR because we can explain with the other observed data. I know that example was a little obtuse but I hope it made sense!

A good thing about this kind of missing data is that we can determine what variables are related to it’s randomness and possibly deal with it in that way. This is why this is the base assumption we make when dealing with missing data. There are plenty of ways to deal with MAR data, such as mean substitution, regression imputation and maximum likelihood.

MNAR — Missing Not At Random

MNAR is the mysterious, elusive missing value that we really do not like to see. When the data can be classified as MNAR, that means that it’s missing-ness is attributed to reasons that are completely unknown to us. It means that we recognize that there is a cause for the missing data, and something influencing the distortion of our data, but we have no idea it is.

These are the most complicated cases, and most methods for dealing with them involve attempting to find more data to find causes to explain the missing-ness or to perform what-if analyses and measure the sensitivity of the results under different scenarios.

Thankfully, I did not have to deal with MNAR in my dataset

THE PROJECT

In my most recent project, I look at data pertaining to water pumps in Tanzania as part of competition held by Datadriven.org. Here is the link if you too, whoever you are, also wish to participate.

Pump it Up: Data Mining the Water Table

You cannot sign up to DrivenData from multiple accounts and therefore you cannot submit from multiple accounts…

www.drivendata.org

Within the dataset, there were many columns that contained missing data.

Funder, installer, subvillage, scheme_management, and scheme_name were all categorical variables. The latter three of these I decided to not use for my models because they were redundant with other features that did not have missing data, thus I simply dropped them from my dataset.

Since I was not planning on using those columns in my model, simply dropping them does not affect my future model.

What I most want to talk about is how I dealt with the missing values in permit, public_meeting, and construction_year. As I said earlier, when missing data is encountered, that standard assumption is that it is Missing At Random, which means that it’s missing-ness can be explained by observed features within our set. With this in mind, I decided to use IterativeImputer within scikit-learn package.

IterativeImputer can only take in continuous features and imputes all missing values within your features. Since, both permit and public_meeting were True/False values, I converted them to 1/0 so that I may impute them with IterativeImputer.

After this, I chose all of the continuous variables within my dataset with the exception of construction_year, and imputed the missing values for permit and public_meeting. Because of how IterativeImputer works, I needed to set a constraining min/max value of 0/1 to ensure that the imputed values would be able to be rounded to either 1 or 0 after imputation.

After permit and public_meeting had been imputed and I had saved their new values, I added construction_year to the list of columns and changed the min/max bounds to the min/max bounds found within the construction year column.

Here are the first 5 values of the imputed features.

Afterwards, I replaced the values within the original dataset with the imputed values.

And Voila! All missing values have then been dealt with and my data has been cleaned.

If you have continuous data, with some MAR features, then IterativeImputer is an easy way to deal with those values!

Different Types of Missing Values and Dealing with them within Tanzanian Water pump data.

Pump it Up: Data Mining the Water Table

You cannot sign up to DrivenData from multiple accounts and therefore you cannot submit from multiple accounts…

Written by Gustavo Alejandro Chavez