Missing Data

Missing data are unavoidable when dealing with data, especially with real-world data. So it’s important to have a plan on what to do with missing data! To make valid causal inferences, we need to have a solution but what options do we have? A BUNCH! This post will focus on introducing five common methods:

Complete Case
Last Observation Carried Forward
Mean Value Imputation
Conditional Mean Imputation (aka using regression)
Multiple Imputation

Before we dive into these, first we need to go over the different types of missing data mechanisms. Arguably the most important aspect of dealing with missing data is figuring this part out.

Missing Data Mechanisms

“How did this data end up becoming missing?” is basically what we are trying to figure out. Was this just random? Or is there a reason? Data can go missing for a variety of reasons. A patient refusing to respond to a certain question, being lost to follow-up, investigator error or certain tests not being order to name just a few.

Typically missing data are grouped into three types of missing data mechanisms: missing completely at random (MCAR), missing at random (MAR) and missing not a random (MNAR) (Austin et al. 2021) . Let’s break down each one a little further.

Missing Completely at Random (MCAR)

Data are “missing completely at random” if the probability of a variable being missing is independent from both observed and unobserved variables for that person (Austin et al. 2021). The key word here is “completely”, implying that it has nothing to do with the other variables that we are interested in.

Let’s use an example. Imagine that we are taking a survey of a class of kindergarten children on what type of ice cream they ate. One day a mischievous puppy runs into the classroom and knocks over the box, and eats the surveys that fall out. The surveys that are missing are MCAR because it the dog didn’t choose which ones to eat! It was just an accident.

Missing at Random (MAR)

Data are “missing at random” if the probability of a variable being missing, after accounting for all the observed variables is independent from the unobserved data (Austin et al. 2021). So basically it can depend on the observed variables but has nothing to do with the unobserved variables.

Using our same example, imagine that kids who prefer mint or vanilla, decide to skip school and go to the beach. On those days when the teacher does the survey, the missing responses are more likely to be from kids who prefer those flavors. This would be MAR because it’s dependent on the fact that these kids are more likely to be absent when it’s hot.

Missing Not at Random (MNAR)

Data are considered to be “missing not at random” if, they are neither MCAR or MAR. But that’s not super helpful, let’s try again. Data are MNAR if the probability of being missing, even after accounting for all the observed variables, is based on the value of the missing variable (Austin et al. 2021).

Using our example again, imagine that some kids like a rare flavor like blueberry-cucumber. These kids are afraid that others might find it strange so whenever the survey is handed out they don’t submit their surveys. This would be MNAR because it’s directly related to the data itself.

MAR vs MNAR

Unfortunately, there is no way to test for MAR vs MNAR, so this needs to be based on expert knowledge (Austin et al. 2021).

Pirate Treasure?

In order to illustrate how these data may look, we need to simulate some data! Imagine that we have a collection of pirate treasure maps. The missing data mechanisms could be thought of like this:

MCAR: Some maps are missing because they were randomly lost during a storm at sea

MAR: Maps are missing for treasures buried in a haunted island because superstitious pirates (whose superstitions are recorded) avoid those places.

MNAR: Maps are missing for the most valuable treasures because the pirates who buried them never shared the locations, fearing theft, and their level of paranoia isn’t recorded. (So the most lucrative treasures are missing)

Let’s simulate some data for our pirate treasure example. We can then use this to show how we might solve these different ways.

Code

library(tidyverse) # Loading tidyverse package

set.seed(123) # setting seed for reproducibility

n = 322 # arbitrarily picked using runif(1, min = 100, max = 500)

# Pirate Data
df <- data.frame(
  pirate_id = 1:n, 
  treasure_value = runif(n = n, min = 1000, max = 5000), 
  haunted_island = rbinom(n = n, size = 1, prob = 0.5),
  pirate_superstition = rbinom(n = n, size = 1, prob = 0.5)
) 

# MCAR: Randomly select 50 observations
mcar_indices <- sample(1:nrow(df), 50)
df$treasure_value_mcar <- df$treasure_value
df$treasure_value_mcar[mcar_indices] <- NA

# MAR: Depends on the value of another variable
df$treasure_value_mar <- df$treasure_value
is_mar <- df$haunted_island & df$pirate_superstition
df$treasure_value_mar[is_mar] <- NA

# MNAR: Higher value, higher chance of being NA
df$treasure_value_mnar <- df$treasure_value
mnar_prob <- df$treasure_value / max(df$treasure_value)
df$treasure_value_mnar[runif(nrow(df)) < mnar_prob] <- NA

head(df) # Viewing the data

  pirate_id treasure_value haunted_island pirate_superstition
1         1       2150.310              0                   0
2         2       4153.221              1                   1
3         3       2635.908              0                   1
4         4       4532.070              0                   0
5         5       4761.869              1                   0
6         6       1182.226              0                   1
  treasure_value_mcar treasure_value_mar treasure_value_mnar
1            2150.310           2150.310                  NA
2            4153.221                 NA                  NA
3            2635.908           2635.908                  NA
4                  NA           4532.070                  NA
5            4761.869           4761.869            4761.869
6            1182.226           1182.226            1182.226

Code

# Function for Selecting Appropriate Data
df_missing <- function(type){
  if(type == "mcar"){
    df |> 
      dplyr::select(
        pirate_id, 
        treasure_value, 
        haunted_island, 
        pirate_superstition,
        treasure_value_mcar
      )
  } else if (type == "mar"){
    df |> 
      dplyr::select(
        pirate_id, 
        treasure_value, 
        haunted_island, 
        pirate_superstition,
        treasure_value_mar
      )
  } else if (type == "mnar"){
    df |> 
      dplyr::select(
        pirate_id, 
        treasure_value, 
        haunted_island, 
        pirate_superstition,
        treasure_value_mnar
      )
  } else{
    stop("Error: Please select one of the types of missing data mechanisms (MCAR, MAR, MNAR)")
  }
}

So…what do we do?

Once we know, or make an assumption about, what type of missing data mechanism we have we need to figure out how to deal with this! This post will go over five methods, although there a LOT more than that. These are just some common ones that I see/have come across in my field (clinical epidemiology/biostatistics). Each section will cover what the method is, when to use it, an example and some considerations. Let’s get started!

Covariates vs Outcomes

This post focuses on imputation for covariates, rather than outcomes. The methods are the same, but the plausibility may be different for an outcome variable. Furthermore, it depends on the audience. For example, certain regulatory bodies may prefer several scenarios such as best-case and worst-case imputation.

An Introduction to Methods for Missingness

This is meant to be an introduction to some of these methods. A few of them shouldn’t have much additional detail to learn about (i.e., complete cases) however others, such as multiple imputation, have a litany of information about them. For example, there are entire textbooks on multiple imputation.

The goal of this post is more to be an introduction to missingness, and help identify some potential scenarios to use the different methods while avoiding potential pitfalls.

Complete Cases (Exclude All Missing)

What is it?

“Let’s just get rid of the missing data! Then our problem will be solved!”…not quite. While getting rid of the missing data certainly is a way to deal with missing values, you are also losing good information. However, like any method, there is a time and place for it.

When to Use

Complete case analysis can be valid if we assume that the data are MAR (Austin et al. 2021). It also depends on how much missing data we have, and the corresponding reduction in sample size. If the missing data is small, for example less than 5%, this is a different situation than if there is 30% missingness.

Example

Code

df.mar <- df_missing(type = "mar") # selecting mar data

cc <- df.mar |> 
  drop_na()

mean(df.mar$treasure_value)

[1] 2995.293

Code

mean(cc$treasure_value_mar)

[1] 3017.684

Considerations

Complete case analysis can be convenient from a data quality point of view, however it can cause some issues. By not imputing any values, we are only using the “observed” values. This might seem beneficial but it can cause problems. Firstly, the estimated statistics and regression coefficients may be biased unless the data is MAR (Austin et al. 2021). Secondly, there will be reduced precision (aka wider confidence intervals) because there is a reduction in sample size.

It’s also important to consider which patients are being removed and how many. Is this now the same target population that it was originally? Imagine if we ended up removing 25% of our sample! How would that affect our generalizability? This might change our target population.

Last Observation Carried Forward

What is it?

Last observation carried forward (LOCF) is very much what it sounds like. We take the last value that was observed and use it.

When to Use

LOCF is used when there is longitudinal data, and it is assumed that it’s plausible. For example, if we need to carry through the value of sex (male/female). If there is missingness for this data but it is available at a previous encounter, then we can assume it’s the same.

Example

Code

# Here we need to have multiple measurements per pirate

n_pirates <- 10   # Number of unique pirates
n_measurements <- 5  # Number of measurements per pirate

# Create the data frame
df.locf <- expand.grid(
  pirate_id = 1:n_pirates,
  measurement_id = 1:n_measurements
) %>%
  mutate(
    treasure_value = runif(n = n_pirates * n_measurements, min = 1000, max = 5000),
    haunted_island = rbinom(n = n_pirates * n_measurements, size = 1, prob = 0.5),
    pirate_superstition = rbinom(n = n_pirates * n_measurements, size = 1, prob = 0.5)
  )



mcar_indices <- sample(1:nrow(df.locf), 15) # randomly select 15 indices
df.locf$treasure_value_mcar <- df.locf$treasure_value
df.locf$treasure_value_mcar[mcar_indices] <- NA # set these values to

# View the dataframe

df.locf %>% select(pirate_id, measurement_id, treasure_value_mcar) %>% 
  arrange(pirate_id, measurement_id)

   pirate_id measurement_id treasure_value_mcar
1          1              1                  NA
2          1              2            4811.364
3          1              3            4478.129
4          1              4            2864.015
5          1              5            4792.268
6          2              1                  NA
7          2              2                  NA
8          2              3            1243.352
9          2              4                  NA
10         2              5            1344.689
11         3              1            1716.238
12         3              2            3449.500
13         3              3            1939.598
14         3              4            1948.792
15         3              5                  NA
16         4              1            2356.886
17         4              2            2854.295
18         4              3            4972.062
19         4              4            1932.959
20         4              5            3889.595
21         5              1                  NA
22         5              2            2418.918
23         5              3            3712.791
24         5              4            1922.614
25         5              5            4688.018
26         6              1            2622.166
27         6              2            4529.692
28         6              3                  NA
29         6              4            1246.905
30         6              5            3455.364
31         7              1            3454.238
32         7              2                  NA
33         7              3            2509.240
34         7              4            2988.474
35         7              5            1643.866
36         8              1                  NA
37         8              2                  NA
38         8              3            2844.736
39         8              4                  NA
40         8              5            1731.763
41         9              1            4294.102
42         9              2                  NA
43         9              3                  NA
44         9              4            4032.742
45         9              5            3920.734
46        10              1            2495.928
47        10              2            1860.267
48        10              3                  NA
49        10              4                  NA
50        10              5            3358.103

Considerations

The plausibility of LOCF needs to be considered. For example, if we are studying monsters and human, we can safely assume that the monster will still be a monster. Would this be true for the number of planets that a monster has been too? What about a lab value? Is a lab value stable at 30 days? What about 60, 90 or 400? This becomes a balance between what’s clinical plausible and the reduction in sample size (if a shorter time window, might be more plausible but the last value might be further away than that).

Mean Value Imputation

What is it?

Mean value imputation takes the mean of the values that we do have and uses this wherever there are missing values. As a side note, the mean is sometimes called the expected value in statistics.

When to Use

You probably guessed this already, but we can use it when we have a mean! So this can be used for any variable that has a mean (typically a continuous variable). If the data are MCAR then the effect should be unbiased (Dziura et al. 2013).

Example

Code

df.mcar <- df_missing(type = "mcar")

df.mean.imputation <- df.mcar %>% 
  dplyr::mutate(
    new_treasure_value = dplyr::case_when(
      is.na(treasure_value_mcar) ~ mean(df.mcar$treasure_value_mcar, na.rm = TRUE),
      !is.na(treasure_value_mcar) ~ treasure_value_mcar
    )
  )

Considerations

This method is easy to implement and quick but there are some things we need to consider. First, the effect should be unbiased however the standard errors aren’t necessarily. This is due to this method artificially reducing the variation in the data set (Austin et al. 2021). It also ignores relationships with other variables which be might be true, but the plausibility of this needs to be considered (Austin et al. 2021). These imputed values end up being treated as equal to the values that aren’t missing (Austin et al. 2021), which may be incorrect since we know the observed values were recorded and not imputed.

Conditional Mean Imputation (aka using Regression)

What is it?

Conditional mean imputation is very similar to mean imputation but instead of using the mean, we’ll use the conditional mean. How do we know the conditional mean? By predicting the result based on a model that we’ve fit to the data that we do have.

When to Use

We can use this in similar scenarios to the mean value imputation. When the data is MCAR or MAR the effects are expected to be unbiased (Dziura et al. 2013). This method can incorporate relationships with other variables, which is an advantage over mean value imputation.

Example

Code

df.mar <- df_missing(type = "mar")

mar.regress <- glm(
  treasure_value_mar ~ haunted_island + pirate_superstition, 
  family = gaussian(),
  data = df.mar
)

treasure_value_imputed = predict(mar.regress)

regress.imputed <- df.mar %>%
  mutate(
    treasure_value_imputed = case_when(
      is.na(treasure_value_mar) ~ predict(mar.regress, newdata = df.mar, type = "response"),
      TRUE ~ treasure_value_mar
    )
  )

mean(regress.imputed$treasure_value)

[1] 2995.293

Code

var.og = sd(regress.imputed$treasure_value)

mean(regress.imputed$treasure_value_imputed)

[1] 3046.162

Code

var.imputed = sd(regress.imputed$treasure_value_imputed)

mean(regress.imputed$treasure_value_mar, na.rm = TRUE)

[1] 3017.684

Code

sd(regress.imputed$treasure_value_mar, na.rm = TRUE)

[1] 1111.26

Considerations

This method can artificially amplify the multivariate relationship in the data (Austin et al. 2021). For example, if we assume that age is related to our covariate then we end up imputing based on the values of age (which may not be the case). These imputed values end up being treated as equal to the values that aren’t missing (Austin et al. 2021), which may be incorrect since we know the observed values were recorded and not imputed.

Multiple Imputation

What is it?

Multiple imputation (MI) imputes a value for each missing value. It picks a value from a list of options, aka the distribution, and imputes it. This creates one data set. This is then done again, and again until there are multiple complete datasets where the missing value has been filled in using plausible values (Austin et al. 2021). Each of these data sets is used to conduct the analysis, for example fitting a generalized linear model, then the results are pooled. For our case, we’ll focus on the method known as multivariate imputation by chained equations (MICE). For more detail about it, I highly recommend Austin et al. (2021).

When to Use

MI, based on Austin et al. (2021), can be used when the data are assumed to be MCAR or MAR. The methods can be modified if the data is thought to be MNAR (Van Buuren 2018). This method can be used when it’s assumed that all the plausible values have been captured for the variable of interest.

Example

Code

library(mice) # package used for multiple imputation using chained equations

df.mcar <- df_missing(type = "mcar")

# Performing the imputation

mice_mod <- mice(df.mcar, m=5, maxit=50, meth='pmm', seed=500)

# Fitting a GLM to each imputed dataset. For this case, imagine that haunted_island 
# is the outcome 

glm_models <- with(mice_mod, glm(haunted_island ~ treasure_value_mcar + pirate_superstition, family = binomial()))

# Pooling the results together 

pooled_results <- pool(glm_models)

# Print the pooled results
broom::tidy(pooled_results) |> 
  select(term, estimate, std.error)

Considerations

There are some considerations for using this. We need to decide if we think all the plausible values have been captured. MI only pulls from the existing values. So for example, if we have red, blue, yellow and green marbles in our population but only blue and green in our sample then red and yellow won’t be imputed.

We also need to decide how large the number of imputed data sets are. There are many different ways to ascertain this. White, Royston, and Wood (2011) suggested that as a rule of thumb it should be at least as large as the percentage of subjects with missing data. So if missingness is 22%, then we need at least 22 data sets.

MI is broad, this is an introduction

Multiple imputation is one of those methods that has an extensive amount of literature on it. This was meant to be an introduction but there are full textbooks that focus on the topic.

Summary of Methods

These are a few of the missing data methods that you may come across, however there is a VAST number of methods for missing data. The key takeaway should be to consider these, regardless of the method:

What kind of assumptions are required? (i.e., MCAR, MNAR, MAR)
What are the tradeoffs? For example, increase in data quality but decreased in precision and generalizibility? Or increase in precision but decrease in replicability?
Practically speaking: is this the right use case for the audience? For example, a regulatory body may have different wants than an eighth grader wanting her help with analyzing her missing pets data.

This was just an introduction but there are a ton of great resources on each of these methods! There are also a bunch of other methods that are great to learn too (I’m currently still learning some of them so perhaps it will be a source of future posts!). This includes: K-Nearest Neighbour Matching, Predictive Mean Matching, Random Forest imputation and much more!

Hopefully this blog post was useful as an introduction to five methods to start, but this is only the beginning! Happy missing data exploring!

References

Austin, Peter C, Ian R White, Douglas S Lee, and Stef van Buuren. 2021. “Missing Data in Clinical Research: A Tutorial on Multiple Imputation.” Canadian Journal of Cardiology 37 (9): 1322–31.

Dziura, James D, Lori A Post, Qing Zhao, Zhixuan Fu, and Peter Peduzzi. 2013. “Strategies for Dealing with Missing Data in Clinical Trials: From Design to Analysis.” The Yale Journal of Biology and Medicine 86 (3): 343.

Van Buuren, Stef. 2018. Flexible Imputation of Missing Data. CRC press.

White, Ian R, Patrick Royston, and Angela M Wood. 2011. “Multiple Imputation Using Chained Equations: Issues and Guidance for Practice.” Statistics in Medicine 30 (4): 377–99.