DVR TIG Week: Tony Fujs on Can You See What Is Not There? Visualizing Missing Data

Hello! I’m Tony Fujs, Data Scientist at the World Bank. As many fellow data nerds, I love to visualize the data I have in my hands, but I also like to spend time visualizing data that I don’t have: Missing data.

Why would I do that? Because bad handling of missing data can seriously bias the results of a statistical analysis.

Hot Tip:

Removing missing records from a dataset is generally not the right way to handle missing data!
For more information about missing data, and techniques to address it, see:

Baraldi, Amanda N., and Craig K. Enders. “An introduction to modern missing data analyses.” Journal of School Psychology 48.1 (2010): 5-37.

In this post, I will present a common heatmap variation technique used to visualize missing data, and identify potentially harmful missingness patterns.

STEP 1: The data table

Here is a dataset containing fake education records. In this dataset, each row represents a unique student, and contains information about the following variables:

Gender
IQ scores
Reading grades
Math grades

It’s relatively easy to spot missing data in this table, but I want to make the missing cells impossible to ignore. Let’s color each missing data cell in a bright, easy to spot color!

STEP 2: Choose a catchy color to highlight empty cells

Now, I really can’t ignore these empty cells, but I still can’t see – at least not easily – if there is any interesting pattern in the missing data.

Next, I will use some heat map technique to colorize the non-missing cells of this table.

STEP 3: Color the non-empty cells according to their values

I will use a monochrome color scale. Since I am interested in comparing empty cells with non-empty cells, I want to keep the color scheme as simple as possible to facilitate this comparison. The shades of grey highlight the change in values of the non-empty cells, while maintaining a strong contrast with the red empty cells.

Low values are colored in light grey, while high values are colored in dark grey. Since we are only interested in checking missing data patterns, the actual values of the non-empty cells can be hidden.

STEP 4: Remove the values

Some pattern now seems to emerge. But it is stillnot as obvious as it could be…

Let’s reorder the rows according to the values of the IQ scores columns.

STEP 4: Sort columns values

It is now hard to miss the pattern of the missing data. Reading grades are missing for the lower half of the IQ score distribution… Assuming that a correlation exists between IQ scores and reading grades, removing the empty cells from the table would overestimate the average reading grade for this group of students.

Rad Resources:

This visualization can be easily produced using the R package VIM:

http://www.statistik.tuwien.ac.at/forschung/CS/CS-2008-1complete.pdf

It can also be done in Excel using a couple of hacks:

http://policyviz.com/create-a-heatmap-in-excel/

The American Evaluation Association is celebrating Data Visualization and Reporting (DVR) Week with our colleagues in the DVR Topical Interest Group. The contributions all this week to aea365 come from DVR TIG members. Do you have questions, concerns, kudos, or content to extend this aea365 contribution? Please add them in the comments section for this post on the aea365 webpage so that we may enrich our community of practice. Would you like to submit an aea365 Tip? Please send a note of interest to aea365@eval.org. aea365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators.

2 thoughts on “DVR TIG Week: Tony Fujs on Can You See What Is Not There? Visualizing Missing Data”

Herbert Baum
November 28, 2016 at 2:18 pm

Great post. We do not spend enough time discussing data issues like these. I liked the step by step approach and did not see the bias in the original set. Your tools truly allowed me to visualize the bias.

Alicia Moag-Stahlberg
November 22, 2016 at 9:12 pm

Great post – thank you. The article you linked to seems very comprehensive too.
Alicia

2 thoughts on “DVR TIG Week: Tony Fujs on Can You See What Is Not There? Visualizing Missing Data”

Leave a Comment Cancel Reply