Imagine that a teacher wants to know if giving a lesson about growing plants causes plants to grow longer. They decide to use two groups of peoples: kids (group A) and their parents (group O). The kids (Group A) received a lesson about growing plants, while the parents (group O) did not. After receiving the lesson, each person got a plant and grew it. After 12 weeks, the plants length was measured.
When the teacher looks at the plants grown, they conclude that group O are better at growing plants. That’s when a kid runs up huffing and puffing “You can’t compare those! That’s like comparing apples and oranges” when another kid chimes in “If we didn’t water them as much then we’d have done way better than them!”
Now they have a point, watering a plant will certainly have an impact on how the plants grow, as will the sunlight, soil and other important factors. “So what do we do? Throw out this experiment?” the teacher asks the parent. “Not so fast! We can standardize these and then compare the groups!” you exclaim. So how exactly do we standardize to compare Groups A and O?
How do we standardize?
Alright so we have our question, “Does getting a lesson in plant growing cause your plants to grow longer?”, but how exactly do we apply that in this case? Well first we need to have a look at the data and see what we have.
Code
library(tidyverse)set.seed(1513) # picked based on runif(1, min = 0, max = 2023)grp_a <-data.frame(age =rnorm(n =116, mean =10, sd =3),indoors =rbinom(n =116, size =1, prob =0.3), hrs_sunlight =runif(n =116, min =2, max =6), watered =rbinom(n =116, size =1, prob =0.4),group =1) grp_o <-data.frame(age =rnorm(n =116, mean =30, sd =7),indoors =rbinom(n =116, size =1, prob =0.8), hrs_sunlight =runif(n =116, min =8, max =12), watered =rbinom(n =116, size =1, prob =0.6),group =0) df =rbind(grp_a, grp_o) |> dplyr::mutate(plant_length =rnorm(n =232, mean =1.5*group +0.2*age +0.3*indoors +0.8*hrs_sunlight -0.5*watered, sd =1) )
Characteristic
Group A
(n = 116)
Group O
(n = 116)
Age, Mean (SD)
10 (2.74)
30.04 (6.12)
Indoors, n (%)
34, (30.2%)
99 (85.3%)
Hours in Sunlight, Mean (SD)
3.77 (1.23)
10.02 (1.18)
Adequately watered, n (%)
46, (39.7%)
59, (50.9%)
Length of plant, Mean (SD)
6.52 (1.58)
14.2 (1.89)
Alright so now we have our data and know the age of each person, how many plants were grown indoors in each group, the hours in the sunlight of each plant, whether the plant was adequately watered and the length of each plant. So what? Well we need to standardize these groups.
You may be thinking of standardization in terms of other methods, that are not necessarily statistical per se. For example, say you weighed three rocks: one is 110 kilograms, one is 167 pounds and one is 25 stones. How could you possibly compare these three? They are in different units! You may be thinking to yourself “well, we just convert them all to the same units duh”…exactly! You are standardizing the units. In this case we are doing something similar.
Standardization for Causal Inference
Standardization models the outcome, whereas inverse probability weighting (IPW), models the treatment (Hernan and Robins 2021). More formally, if we were to write it as an equation, assuming that no individuals are censored (C=0) (Hernan and Robins 2021, 162) :
\[
{\sum_{l}E[Y|A = a, C=0, L=l]} \times Pr[L=l]
\]
Now, in an ideal world we’d be able to calculate this nonparametrically. To do that, we could calculate the mean outcome in each stratum \(l\) of the confounders \(L\). So in our case we could look at one strata as the individuals who are 5 years old, grew their plant indoor, had their plant in the sunlight for 2 hours and adequately watered their plant. We could do this for each strata, then take the weighted mean sum using the above formula.
What if L is continuous?
If \(L\) is continuous in the above formula, then we need to replace \(Pr[L=l]\) with the probability density function \(f_{L}[l]\) instead (Hernan and Robins 2021, 162).
Now, as you can imagine that is a lot of work especially when the variable is continuous. You can probably imagine that this isn’t always possible, especially when dealing with real-world data or many covariates. As the number of covariates increases, so does the number of strata. Luckily for us, we have handy dandy modelling in our statistical toolbox!
Standardizing the Mean Outcome to the Confounder Distribution
Before we get started, we need to go over the four steps. I’ll keep it brief here, but the steps are as follows (Hernan and Robins 2021, 164):
Expansion of data set
Modelling
Prediction
Standardization by averaging
Expansion of data set
Before starting any analysis, we first need data. Here we are going to create three data sets. The first will be the observed data, aka the original data set. We will then created another data set that is very similar to the first, except the group will be set to 0 (no lesson) and we will consider the outcome as missing. You can probably guess what the third data set will be….bingo! Same as the second data set but with group as 1 (lesson).
Now we can get to the fun stuff! Since our outcome is continuous, we decide to use a generalized linear model with the Gaussian distribution. We’ll only be able to use the original data set, since that is the one with the measured outcomes.
Model Assumptions
As a side note, it is important to check the assumptions for the model you are using. For the sake of this post, I won’t since I don’t want to distract from the goal of explaining standardization but in practice should always check the assumptions for the model you are using (i.e., in this case a GLM).
Code
fit <-glm( plant_length ~ group + age + indoors + hrs_sunlight + watered, family =gaussian(link ="identity"), data = df)
Prediction
Using our handy dandy model, we will now predict the outcome for both data sets with missing outcome data. First we predict the outcome for the data set if everyone were in group O. Next, we predict the outcome for the data set if everyone were in group A.
Now that we have predicted the outcomes, we can calculate the average outcome for each data set. Before we calculate this, you might be asking about the uncertainty for this measurement, which I’m glad you asked! To get 95% confidence intervals for this, we can use bootstrapping.
Code
# A special thanks to Joy Shi, Sean McGrath and Tom Palmer for providing this code for free. Link here: https://remlapmot.github.io/cibookex-r/standardization-and-the-parametric-g-formula.html#program-13.4library(boot)# Function, altered from the above link for our use standardization <-function(data, indices) {# We need to first make the datasets that we need# 1st copy: our original d <- data[indices, ] # 2nd copy: Same as original but with group = 0 and outcome as missing d0 <- d d0$group <-0# setting group to 0 d0$plant_length <-NA# setting plant length (our outcome) to missing # 3rd copy: Same as original but with group = 1 and outcome as missing d1 <- d d1$group <-1# setting group to 1 d1$plant_length <-NA# setting plant length (our outcome) to missing # Making one sample d.onesample <-rbind(d, d0, d1) # combining datasets# Fitting a model for each iteration fit <-glm( plant_length ~ group + age + indoors + hrs_sunlight + watered,data = d.onesample )# Using model to predict the outcome d.onesample$predicted_meanY <-predict(fit, d.onesample)# Calculate the mean for each of the groups. The third calculation is for the difference in group O (0) vs group A (1)return(c(mean(d.onesample$predicted_meanY[d.onesample$group ==0]),mean(d.onesample$predicted_meanY[d.onesample$group ==1]),# Treatment - No Treatmentmean(d.onesample$predicted_meanY[d.onesample$group ==1]) -mean(d.onesample$predicted_meanY[d.onesample$group ==0]) ))}# Now we can bootstrap this statistic using our datasetresults <-boot(data = df,statistic = standardization,R =5)# Using the bootstrapped sample (titled result), we can calculate the confidence intervals. We take the standard deviation of the sampling distribution. This will give us our standard errorse <-c(sd(results$t[, 1]),sd(results$t[, 2]),sd(results$t[,3]))mean <- results$t0 # calculate mean ll <- mean -qnorm(0.975) * se # lower limitsul <- mean +qnorm(0.975) * se # upper limitsfinalresults <-data.frame(result_title =c("Group O", "Group A", "Difference"),mean =round(mean,3),ci =paste0("95% CI: ", round(ll, 3), " - ", round(ul, 3)))
Finally we have our results! We end up with a mean difference of -1.43 (95% CI: -0.96 to -1.90). Finally, we can put to rest that the people who received the lesson (the kids) grew shorter plants than those who didn’t have the lesson. Feeling smug, the parents start to gloat before one of the kids walks up and casually asks “Are these really valid? What kind of assumptions are you making?”
Assumptions for Standardization
The kid brings up an excellent point, one that we should always keep in mind when trying to make a causal inference. So when are these estimates valid? We can group the assumptions into three groups: identifability conditions, measurement of variables and specification of the model (Hernan and Robins 2021, 168). The identifability conditions are exchangeability, positivity and consistency. Positivity can be checked empirically but the other two are opinion based. For a more detailed version of these, check out my other blog post “Intro to Causal Inference”.
The second condition is the variables used in the analysis need to be correctly measured. Measurement error in the treatment, outcome or the confounders will generally result in bias (see chapter 9 of Hernan and Robins (2021) for more specifics).
Finally, all models need to be correctly specified (see chapter 11 of Hernan and Robins (2021) for more specifics). Of course, all of these will never hold perfectly but some remain a matter of judgement which means it can be open to criticism. It’s important to keep these assumptions in mind and the validity of them. For example, a critique in our example could be that our intervention is not well-defined (which I’d agree with and we could make our intervention definition more defined).
We need to make sure all of these conditions hold, at least in approximately since in the real-world it is difficult (i.e., in practice there is most likely some form of model misspecification). Some of these assumptions are based on judgement, which is important to note.
Positivity Assumption
The positivity assumption is a key assumption for causal inference. Simply put, no individual can have a probability of 0 for the treatment. Now, for standardization, we can still use this method if this assumption isn’t met. However, we need to be willing to reply on extrapolation (parametric extrapolation to be exact) that will smooth over the strata for those with a probability of 0 of receiving treatment. If we do this, we’ll introduce bias, specifically the 95% CIs will cover the true effect less than 95% of the time. For more details, please see (Hernan and Robins 2021, 162).
What about other measures?
Our example used a continuous outcome, so we used risk difference as our causal estimand. Of course, there are other causal estimands of interest as well. I won’t bore you with another example here (maybe in the future). I’d highly recommend checking out Lee and Lee (2022) if you are interested in other estimands including relative risk and odds ratios. The process is very similar.
IP Weighting or Standardization?
If we were to do both without using any models (i.e., nonparametrically), then we would expect both methods to give the exact same result. This is because they are modelling different things (treatment for IPW, outcome for standardizaiton) and we can always expect some level of misspecification of a model in practice. So if they don’t give the same answer then how do we pick? The short answer is we don’t.
Large differences between them will let us know that there is some serious misspecificaiton in at least one of the estimates. Small differences may still indicate there’s a problem but not as serious misspecification.
Basically what I’m trying to say is when both methods can be used, just use both.
The G-Formula
Enough teasing already! What does the parametric g-formula have to do with this!! You may be screaming at your screen right now. If you are, I don’t blame you. Alright, so first the equation for the g-formula is formally expressed as (Naimi, Cole, and Kennedy 2017):
At this point you may be wondering “What in the world does this formula have to do with standardization?”. If you are thinking that, props to you! (side note: I personally always like to ask these types of questions). The answer is this is how we calculate our expected values (E). The g-formula has a broad use, specifically for time-varying applications. In our case, we can just ignore \(A_1\) since that refers to the point in time (we only have one time point for our example).
Since we already have our values, from standardization, we are going to plug-in those values to this formula (technically we already did this, but you get the point). We say that we are using a plug-in g-formula estimator since we just…you guessed it….plug in the values! Standardization is a plug-in g-formula estimator however since, in our case, the estimate came from a parametric model we call it the parametric g-formula.
What Next?
Now you are free to go standardize away out there in the real world!
References
Hernan, M. A., and J. M. Robins. 2021. “What If - Causal Inference.” Textbook.
Lee, Sangwon, and Woojoo Lee. 2022. “Application of Standardization for Causal Inference in Observational Studies: A Step-by-Step Tutorial for Analysis Using r Software.”Journal of Preventive Medicine and Public Health 55 (2): 116.
Naimi, Ashley I, Stephen R Cole, and Edward H Kennedy. 2017. “An Introduction to g Methods.”International Journal of Epidemiology 46 (2): 756–62.