In many data sets, analysts encounter multilevel data where observations are clustered in non-random ways. One of the most common examples of multilevel data is in education, where students are clustered within classrooms or schools. Furthermore, those schools may be clustered within districts or states. This clustering introduces a correlational structure into the data that, if ignored, can lead to biased standard error (Holmes Finch et al., 2019; Leite et al., 2015; Li et al., 2013) and treatment effect estimates (Leite et al., 2015; Li et al., 2013). Additionally, as Holmes Finch and colleagues (2019) share, “another problem when ignoring the multilevel structure of data is that we may miss important relationships involving each level in the data” (p. 29).
In multilevel quasi-experimental and observational studies, researchers want to ensure that their groups of interest (e.g., a treatment and control group) are appropriately comparable. Propensity score weighting (PSW) is one method to better “balance the covariate distribution between groups” prior to some analysis (Li et al., 2013, p. 3). The multilevel data structure can be incorporated when estimating the propensity score weights, when applying the weights to the analysis (e.g., to estimate a treatment effect), or both. Recent work (e.g., Kim & Steiner, 2015; Leite et al., 2015; Li et al., 2013; Schuler et al., 2016) illustrates a variety of methods for applying PSW to multilevel data. By accounting for the data’s multilevel structure in the propensity score (PS) model, the propensity score weighting appropriately “accounts for all sources of selection bias” that may influence the treatment or grouping variable (Leite et al., 2015, p. 266).
This vignette provides a basic introduction to the multilevel PSW approach in R (R Core Team, 2021). The goal is to introduce the reader to these methods, rather than provide an exhaustive review of all possible approaches. A wide range of estimators, scaling mechanisms, and other analytic techniques are available. For simplicity, this vignette follows the methods proposed by Leite and colleagues (2015), who applied multilevel PSW to data from both simulations and the Early Childhood Longitudinal Study - Kindergarten Cohort. The approach presented here is primarily for “estimating individual-level treatment effects with multilevel data” (Leite et al., 2015, p. 266). Specifically, they demonstrated how to estimate the average treatment effect on the treated (ATT).
SGPdata
This example uses simulated data available in the SGPdata
package (Betebenner et al., 2021). I extend the analysis from the previous vignette, “Applying Propensity Score Weighting to Compare Cross-Sectional Assessment Data,” where I compared the mean mathematics scale scores between fifth graders in 2019 and fifth graders in 2021. This research aim was selected to match Dadey’s (2021) tutorial on the Fair Trend metric (Ho, 2021).
I begin by loading the sgpData_LONG_COVID
data set from the SGPdata
package and subsetting observations from grade 5 mathematics within the two years of interest.
# Load libraries
require(pacman)
pacman::p_load(data.table, dplyr, ggplot2, Hmisc, lme4, SGPdata, twang, WeMix)
# Set seed
set.seed(61121)
# Grade and content area for subsetting data
gradeval = 5
contentarea = "MATHEMATICS"
# Subset data
psw.data = sgpData_LONG_COVID[GRADE == gradeval & CONTENT_AREA == contentarea &
(YEAR == "2019" | YEAR == "2021"), ]
With these data, students are nested within schools, which are subsequently nested within districts. For demonstration purposes, I will consider a two-level model, with students at Level 1 and schools at Level 2.
# Vector of school numbers
schoolnum = unique(psw.data$SCHOOL_NUMBER)
# Summary of cluster sizes
nbyschool = psw.data[, .(.N), keyby = .(SCHOOL_NUMBER)]
summary(nbyschool$N)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.00 47.00 75.00 84.84 112.50 248.00
There are a total of 187 unique schools in the simulated data. Aggregating across both 2019 and 2021, the cluster size (where a cluster represents a school) ranges from 10 to 248.
Leite and colleagues (2015) discuss multiple methods to estimate and apply propensity scores while accounting for data clustering. First, one may either pool propensity score (PS) models, or use one model (Leite et al., 2015, p. 267):
The current analysis takes the second approach, which is “is amenable to more general use with multilevel observational data with a variety of cluster sizes, including small clusters” (p. 267). Then, there are two proposed models for estimating the propensity scores:
Leite and colleagues (2015) present extensive simulation results comparing these two models, with some evidence of negligible differences on the estimates of interest (p. 277). For illustration, the random intercept model will be used here to estimate the propensity scores.
To fit this model, I use the lme4::glmer
function (Bates et al., 2015). The grouping variable is defined such that a 1 indicates an observation from 2019, and a 0 indicates an observation from 2021. Eight dummy-coded covariates are included, which are all measured at the individual level. Cluster-level covariates can also be included in the model if available. Before running the glmer
model, I clean the data to create a dummy-coded data frame using model.matrix
.
# Create model matrix
# [,-1] removes the first column, which is a unit vector for the intercept
psw.data.mm = data.frame(model.matrix(
~ YEAR + SCALE_SCORE + ETHNICITY + FREE_REDUCED_LUNCH_STATUS + ELL_STATUS +
IEP_STATUS + GENDER + SCHOOL_NUMBER,
data = psw.data
))[,-1]
# Relevel YEAR so 1 = 2019
psw.data.mm$YEAR2019 = NA
psw.data.mm$YEAR2019[psw.data.mm$YEAR2021 == 0] = 1
psw.data.mm$YEAR2019[psw.data.mm$YEAR2021 == 1] = 0
# Rename variables
names(psw.data.mm) = c("YEAR2021", "SCALE_SCORE", "ETH_ASIAN", "ETH_HISP", "ETH_OTH",
"ETH_WHITE", "FRL", "ELL", "IEP", "MALE", "SCHOOL_NUMBER", "YEAR2019")
Both random intercepts and slopes can be included in the model for estimating propensity scores. Here, I only include a random intercept by school number, allowing the intercepts to vary based on school. Random slopes are not included due to the small number of levels for each covariate. This process follows the analysis proposed by Leite and colleagues (2015).
# Random intercept model
# Leite et al., 2015, Equation 1
ps.mod.randomint = glmer(
# Grouping variable
YEAR2019 ~
# Fixed effects for individual covariates
ETH_ASIAN + ETH_HISP + ETH_OTH + ETH_WHITE + FRL + ELL + IEP + MALE +
# Random intercept for individual covariates
(1 | SCHOOL_NUMBER),
# Data
data = psw.data.mm,
# Link function
family = "binomial"
)
I next extract the predicted probabilities, which are the propensity scores, using the predict
function.
# Extract predicted probability
ps.prob = predict(ps.mod.randomint, newdata = psw.data.mm, type = "response")
Leite and colleagues (2015) explain that one should next evaluate the common support of the propensity scores by “using histograms to compare the distribution of the treated and the untreated” (p. 278). In the plot below, a 1 indicates that the observation is from the 2019 sample.
# Overlapping histograms
data.frame(ps.prob, psw.data.mm$YEAR2019) %>%
ggplot(aes(ps.prob, fill = factor(psw.data.mm$YEAR2019))) +
geom_density(alpha = 0.5) +
scale_color_brewer(palette = "Dark2") +
labs(x = "Propensity Score", y = "Density", fill = "2019")
The authors do not provide much guidance on how to determine whether the common support assumption is met. The figure above suggests a large degree of overlap in the densities between observations in 2019 (1) and 2021 (0).
I next estimate the (unscaled) PSW values as (Leite et al., 2015, Equation 5)
\[ w_i = G_i + (1 - G_i)\frac{P(G_i = 1 | X)}{1 - P(G_i = 1 | X)} \]
where \(G_i\) is the grouping variable value for observation \(i\) such that \(G_i = 1\) if observation \(i\) is in the 2019 sample (\(i = 1,\ldots,N\)) and \(P(G_i = 1 | X)\) is the probability of observation \(i\) being in the 2019 sample given covariates \(X\). This probability, the propensity score, is calculated using the random intercept model above.
# Estimate unscaled weights from random intercept model
psw.unscaled = psw.data.mm$YEAR2019 + (1 - psw.data.mm$YEAR2019)*(ps.prob / (1 - ps.prob))
One can examine covariate balance with the estimated weights, such as by comparing weighted standardized mean differences between the two samples (Leite et al., 2015, p. 279). To do so, I use the weighted.mean
function in base R as well as the wtd.var
function from the Hmisc
package (Harrell Jr. et al., 2021).
# Compute weighted standardized mean differences for each covariate
dweight = numeric(8)
for(i in 1:8) {
wm.2019 = weighted.mean(psw.data.mm[psw.data.mm$YEAR2019==1, i+2],
psw.unscaled[psw.data.mm$YEAR2019==1])
wm.2021 = weighted.mean(psw.data.mm[psw.data.mm$YEAR2019==0, i+2],
psw.unscaled[psw.data.mm$YEAR2019==0])
wsd = sqrt(wtd.var(psw.data.mm[, i+2], psw.unscaled))
dweight[i] = (wm.2021 - wm.2019) / wsd
}
round(dweight, 3)
## [1] 0.023 0.004 0.012 -0.020 0.000 -0.003 0.004 -0.005
Leite and colleagues (2015, p. 279) suggest that the weighted standardized difference should be less than 0.01 in magnitude. Note here that two covariates exceed this threshold. One should consider respecifying the propensity score model to obtain better balance. For now, I will continue with these weights.
There are additional methods to scale the propensity score weights. The three used in Leite and colleagues (2015) are (p. 271):
The first method scales “weights to sum to the total sample size.” The second and third “scale weights so that the within cluster weights sum to the cluster size [or] the effective sample size,” respectively (p. 271). There is not strong evidence for preferring one scaling method to another, so all three are presented here.
# Scale weights: Normalized
n = nrow(psw.data.mm)
psw.normalized = psw.unscaled * (n / sum(psw.unscaled))
# Scale weights: Cluster-normalized and effective
psw.cnormalized = psw.effect = psw.unscaled
for(i in 1:length(schoolnum)) {
s = schoolnum[i]
j = which(psw.data.mm$SCHOOL_NUMBER == s)
nj = length(j)
sum.wij = sum(psw.unscaled[j])
sum.wij2 = sum(psw.unscaled[j]^2)
psw.cnormalized[j] = psw.cnormalized[j] * (nj / sum.wij)
psw.effect[j] = psw.effect[j] * (sum.wij / sum.wij2)
}
# Append weights to data frame
psw.data.mm = cbind(psw.data.mm, psw.unscaled, psw.normalized, psw.cnormalized, psw.effect)
Next, we apply our weights to the analysis, regressing scale scores on the grouping variable. Leite and colleagues (2015) fit a multilevel pseudo-maximum likelihood model in MPLUS. In R, one can use the WeMix
package (Bailey et al., 2021) to incorporate sampling weights into multilevel models. Note that lme4
does not currently accommodate sampling weights (the weights
argument is instead for precision weights).
Numerous model specifications are available. Here, I present a model with both a random intercept and slope for school number. I fit a model with each of the four weight types computed (i.e., unscaled weights and three scaling methods). Note that the WeMix::mix
function (used to fit these weighted multilevel models) requires a set of weights at both Levels 1 and 2. The Level 2 weights for school are simply a unit vector, as the propensity score weights computed above are at the student level.
# Create Level 2 weights
# Simply a unit vector
psw.data.mm$wL2 = 1
# Random intercept/slope model, unscaled weights
mod.unscaled = mix(SCALE_SCORE ~ YEAR2019 + (1 + YEAR2019 | SCHOOL_NUMBER),
data = psw.data.mm, weights = c("psw.unscaled", "wL2"))
# Random intercept/slope model, normalized weights
mod.norm = mix(SCALE_SCORE ~ YEAR2019 + (1 + YEAR2019 | SCHOOL_NUMBER),
data = psw.data.mm, weights = c("psw.normalized", "wL2"))
# Random intercept/slope model, cluster--normalized weights
mod.cnorm = mix(SCALE_SCORE ~ YEAR2019 + (1 + YEAR2019 | SCHOOL_NUMBER),
data = psw.data.mm, weights = c("psw.cnormalized", "wL2"))
# Random intercept/slope model, effective weights
mod.effect = mix(SCALE_SCORE ~ YEAR2019 + (1 + YEAR2019 | SCHOOL_NUMBER),
data = psw.data.mm, weights = c("psw.effect", "wL2"))
We can then extract the estimated mean difference from the coefficient output.
# Extract estimated mean differences
cat("Unscaled: ", round(mod.unscaled$coef[[2]], 3),
"\nNormalized: ", round(mod.norm$coef[[2]], 3),
"\nCluster-normalized: ", round(mod.cnorm$coef[[2]], 3),
"\nEffective: ", round(mod.effect$coef[[2]], 3))
## Unscaled: 7.519
## Normalized: 7.519
## Cluster-normalized: 7.482
## Effective: 7.479
Let’s compare these (very similar) estimates to a multilevel model without weights.
# Fit unweighted model
psw.data.mm$wL1 = 1
mod.unweighted = mix(SCALE_SCORE ~ YEAR2019 + (1 + YEAR2019 | SCHOOL_NUMBER),
data = psw.data.mm, weights = c("wL1", "wL2"))
# Model summary
round(mod.unweighted$coef[[2]], 3)
## [1] 6.681
Note that extracting and comparing standard error estimates across models may also be of interest. One can also construct a “doubly robust” model (see Bang & Robins, 2005; Funk et al., 2011; Leite et al., 2015; Yang & Schafer, 2007) by including the covariates as fixed effects in the random intercept and slope models above. Below is an example using the normalized weights.
# Example of a doubly-robust model with normalized weights
mod.doublyrobust = mix(SCALE_SCORE ~ YEAR2019 + ETH_ASIAN + ETH_HISP + ETH_OTH +
ETH_WHITE + FRL + ELL + IEP + MALE +
(1 + YEAR2019 | SCHOOL_NUMBER),
data = psw.data.mm, weights = c("psw.normalized", "wL2"))
# Extract estimated mean difference
round(mod.doublyrobust$coef[[2]], 3)
## [1] 7.378
In their example, Leite and colleagues (2015) used a continuous covariate (prior assessment scores). They also included a covariate by treatment interaction, noting that “Schafer and Kang (2008) have shown that [the treatment effect model estimate] will only be equal to the ATT in models including covariates if the covariates are centered around the mean of the covariates for the treated and the treatment by covariate interaction is included in the model” (p. 280).
What happens if we ignore the multilevel data structure altogether? Here I replicate the analysis from the previous vignette, using the twang
package (Cefalu et al., 2021) to estimate propensity scores with a gradient boosting model.
# Estimate propensity scores
ps.twang = ps(YEAR2019 ~ ETH_ASIAN + ETH_HISP + ETH_OTH + ETH_WHITE +
FRL + ELL + IEP + MALE, data = psw.data.mm,
n.trees = 15000, interaction.depth = 2,
shrinkage = 0.01, estimand = "ATT",
stop.method = c("es.mean"), verbose = FALSE)
# Extract weights
psw.twang = get.weights(ps.twang, stop.method = "es.mean")
# Simple linear regression to estimate mean difference
lm.mod = lm(SCALE_SCORE ~ YEAR2019, data = psw.data.mm, weights = psw.twang)
# Model summary
summary(lm.mod)
##
## Call:
## lm(formula = SCALE_SCORE ~ YEAR2019, data = psw.data.mm, weights = psw.twang)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -165.713 -23.134 -0.134 20.866 151.720
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 498.6647 0.3918 1272.64 <2e-16 ***
## YEAR2019 7.4688 0.5541 13.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.01 on 15863 degrees of freedom
## Multiple R-squared: 0.01132, Adjusted R-squared: 0.01126
## F-statistic: 181.7 on 1 and 15863 DF, p-value: < 2.2e-16
The estimated mean difference in scale scores has a magnitude of 7.469.
A further step after running the analysis is to compute sensitivity analyses. However, Leite and colleagues (2015) note that sensitivity analyses for multilevel PSW is an active research area. Additionally, although not presented in great detail here, it is important that researchers thoroughly investigate the method assumptions (see Leite et al., 2015, pp. 266-267).