Introduction

In many data sets, analysts encounter multilevel data where observations are clustered in non-random ways. One of the most common examples of multilevel data is in education, where students are clustered within classrooms or schools. Furthermore, those schools may be clustered within districts or states. This clustering introduces a correlational structure into the data that, if ignored, can lead to biased standard error (Holmes Finch et al., 2019; Leite et al., 2015; Li et al., 2013) and treatment effect estimates (Leite et al., 2015; Li et al., 2013). Additionally, as Holmes Finch and colleagues (2019) share, “another problem when ignoring the multilevel structure of data is that we may miss important relationships involving each level in the data” (p. 29).

In multilevel quasi-experimental and observational studies, researchers want to ensure that their groups of interest (e.g., a treatment and control group) are appropriately comparable. Propensity score weighting (PSW) is one method to better “balance the covariate distribution between groups” prior to some analysis (Li et al., 2013, p. 3). The multilevel data structure can be incorporated when estimating the propensity score weights, when applying the weights to the analysis (e.g., to estimate a treatment effect), or both. Recent work (e.g., Kim & Steiner, 2015; Leite et al., 2015; Li et al., 2013; Schuler et al., 2016) illustrates a variety of methods for applying PSW to multilevel data. By accounting for the data’s multilevel structure in the propensity score (PS) model, the propensity score weighting appropriately “accounts for all sources of selection bias” that may influence the treatment or grouping variable (Leite et al., 2015, p. 266).

This vignette provides a basic introduction to the multilevel PSW approach in R (R Core Team, 2021). The goal is to introduce the reader to these methods, rather than provide an exhaustive review of all possible approaches. A wide range of estimators, scaling mechanisms, and other analytic techniques are available. For simplicity, this vignette follows the methods proposed by Leite and colleagues (2015), who applied multilevel PSW to data from both simulations and the Early Childhood Longitudinal Study - Kindergarten Cohort. The approach presented here is primarily for “estimating individual-level treatment effects with multilevel data” (Leite et al., 2015, p. 266). Specifically, they demonstrated how to estimate the average treatment effect on the treated (ATT).

Example with SGPdata

This example uses simulated data available in the SGPdata package (Betebenner et al., 2021). I extend the analysis from the previous vignette, “Applying Propensity Score Weighting to Compare Cross-Sectional Assessment Data,” where I compared the mean mathematics scale scores between fifth graders in 2019 and fifth graders in 2021. This research aim was selected to match Dadey’s (2021) tutorial on the Fair Trend metric (Ho, 2021).

Data Set-Up

I begin by loading the sgpData_LONG_COVID data set from the SGPdata package and subsetting observations from grade 5 mathematics within the two years of interest.

# Load libraries
require(pacman)
pacman::p_load(data.table, dplyr, ggplot2, Hmisc, lme4, SGPdata, twang, WeMix)

# Set seed
set.seed(61121)

# Grade and content area for subsetting data
gradeval = 5
contentarea = "MATHEMATICS"

# Subset data
psw.data = sgpData_LONG_COVID[GRADE == gradeval & CONTENT_AREA == contentarea &
                              (YEAR == "2019" | YEAR == "2021"), ]

With these data, students are nested within schools, which are subsequently nested within districts. For demonstration purposes, I will consider a two-level model, with students at Level 1 and schools at Level 2.

# Vector of school numbers
schoolnum = unique(psw.data$SCHOOL_NUMBER)

# Summary of cluster sizes
nbyschool = psw.data[, .(.N), keyby = .(SCHOOL_NUMBER)]
summary(nbyschool$N)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   47.00   75.00   84.84  112.50  248.00

There are a total of 187 unique schools in the simulated data. Aggregating across both 2019 and 2021, the cluster size (where a cluster represents a school) ranges from 10 to 248.

Estimate Propensity Scores

Leite and colleagues (2015) discuss multiple methods to estimate and apply propensity scores while accounting for data clustering. First, one may either pool propensity score (PS) models, or use one model (Leite et al., 2015, p. 267):

  • Pool Cluster-Specific Effects: “Consists of using a PS method with potentially different PS models separately for each cluster and then combine the ATT estimates across clusters weighting by cluster size.”
  • Marginal Effects: “Consists of using a PS method with a single PS model for the entire data to obtain the ATT.”

The current analysis takes the second approach, which is “is amenable to more general use with multilevel observational data with a variety of cluster sizes, including small clusters” (p. 267). Then, there are two proposed models for estimating the propensity scores:

  • Fixed-Cluster Effects Logistic Regression: Regresses the grouping variable on the selected covariates, including dummy-coded variables for each cluster (Equation 2).
  • Random Intercept and/or Slope Logistic Regression: Rather than dummy-coded cluster variables, includes random intercepts and potentially random slopes for the covariates of interest (Equation 1).

Leite and colleagues (2015) present extensive simulation results comparing these two models, with some evidence of negligible differences on the estimates of interest (p. 277). For illustration, the random intercept model will be used here to estimate the propensity scores.

To fit this model, I use the lme4::glmer function (Bates et al., 2015). The grouping variable is defined such that a 1 indicates an observation from 2019, and a 0 indicates an observation from 2021. Eight dummy-coded covariates are included, which are all measured at the individual level. Cluster-level covariates can also be included in the model if available. Before running the glmer model, I clean the data to create a dummy-coded data frame using model.matrix.

# Create model matrix
# [,-1] removes the first column, which is a unit vector for the intercept
psw.data.mm = data.frame(model.matrix(
  ~ YEAR + SCALE_SCORE + ETHNICITY + FREE_REDUCED_LUNCH_STATUS + ELL_STATUS +
    IEP_STATUS + GENDER + SCHOOL_NUMBER,
  data = psw.data
))[,-1]

# Relevel YEAR so 1 = 2019
psw.data.mm$YEAR2019 = NA
psw.data.mm$YEAR2019[psw.data.mm$YEAR2021 == 0] = 1
psw.data.mm$YEAR2019[psw.data.mm$YEAR2021 == 1] = 0

# Rename variables
names(psw.data.mm) = c("YEAR2021", "SCALE_SCORE", "ETH_ASIAN", "ETH_HISP", "ETH_OTH",
                        "ETH_WHITE", "FRL", "ELL", "IEP", "MALE", "SCHOOL_NUMBER", "YEAR2019")

Both random intercepts and slopes can be included in the model for estimating propensity scores. Here, I only include a random intercept by school number, allowing the intercepts to vary based on school. Random slopes are not included due to the small number of levels for each covariate. This process follows the analysis proposed by Leite and colleagues (2015).

# Random intercept model
# Leite et al., 2015, Equation 1
ps.mod.randomint = glmer(
  
  # Grouping variable
  YEAR2019 ~ 
    
    # Fixed effects for individual covariates  
    ETH_ASIAN + ETH_HISP + ETH_OTH +  ETH_WHITE + FRL + ELL + IEP + MALE +
    
    # Random intercept for individual covariates
    (1 | SCHOOL_NUMBER),
  
  # Data
  data = psw.data.mm,
  
  # Link function
  family = "binomial"
  
)

I next extract the predicted probabilities, which are the propensity scores, using the predict function.

# Extract predicted probability
ps.prob = predict(ps.mod.randomint, newdata = psw.data.mm, type = "response")

Leite and colleagues (2015) explain that one should next evaluate the common support of the propensity scores by “using histograms to compare the distribution of the treated and the untreated” (p. 278). In the plot below, a 1 indicates that the observation is from the 2019 sample.

# Overlapping histograms
data.frame(ps.prob, psw.data.mm$YEAR2019) %>%
  ggplot(aes(ps.prob, fill = factor(psw.data.mm$YEAR2019))) + 
  geom_density(alpha = 0.5) + 
  scale_color_brewer(palette = "Dark2") +
  labs(x = "Propensity Score", y = "Density", fill = "2019") 

The authors do not provide much guidance on how to determine whether the common support assumption is met. The figure above suggests a large degree of overlap in the densities between observations in 2019 (1) and 2021 (0).

Compute Propensity Score Weights

I next estimate the (unscaled) PSW values as (Leite et al., 2015, Equation 5)

\[ w_i = G_i + (1 - G_i)\frac{P(G_i = 1 | X)}{1 - P(G_i = 1 | X)} \]

where \(G_i\) is the grouping variable value for observation \(i\) such that \(G_i = 1\) if observation \(i\) is in the 2019 sample (\(i = 1,\ldots,N\)) and \(P(G_i = 1 | X)\) is the probability of observation \(i\) being in the 2019 sample given covariates \(X\). This probability, the propensity score, is calculated using the random intercept model above.

# Estimate unscaled weights from random intercept model
psw.unscaled = psw.data.mm$YEAR2019 + (1 - psw.data.mm$YEAR2019)*(ps.prob / (1 - ps.prob))

One can examine covariate balance with the estimated weights, such as by comparing weighted standardized mean differences between the two samples (Leite et al., 2015, p. 279). To do so, I use the weighted.mean function in base R as well as the wtd.var function from the Hmisc package (Harrell Jr. et al., 2021).

# Compute weighted standardized mean differences for each covariate
dweight = numeric(8)
for(i in 1:8) {
  
  wm.2019 = weighted.mean(psw.data.mm[psw.data.mm$YEAR2019==1, i+2], 
                          psw.unscaled[psw.data.mm$YEAR2019==1])
  wm.2021 = weighted.mean(psw.data.mm[psw.data.mm$YEAR2019==0, i+2], 
                          psw.unscaled[psw.data.mm$YEAR2019==0])
  wsd = sqrt(wtd.var(psw.data.mm[, i+2], psw.unscaled))
  dweight[i] = (wm.2021 - wm.2019) / wsd
  
}
round(dweight, 3)
## [1]  0.023  0.004  0.012 -0.020  0.000 -0.003  0.004 -0.005

Leite and colleagues (2015, p. 279) suggest that the weighted standardized difference should be less than 0.01 in magnitude. Note here that two covariates exceed this threshold. One should consider respecifying the propensity score model to obtain better balance. For now, I will continue with these weights.

There are additional methods to scale the propensity score weights. The three used in Leite and colleagues (2015) are (p. 271):

  • Normalized weights: Multiply weights by \(\lambda = n/\sum w_{ij}\), where \(j\) indexes the cluster.
  • Cluster-normalized weights: Multiply weights by \(\lambda_{1j} = n_j/\sum_j w_{ij}\).
  • Effective weights: Multiply weights by \(\lambda_{1j} = \sum_j w_{ij}/\sum_j w^{2}_{ij}\).

The first method scales “weights to sum to the total sample size.” The second and third “scale weights so that the within cluster weights sum to the cluster size [or] the effective sample size,” respectively (p. 271). There is not strong evidence for preferring one scaling method to another, so all three are presented here.

# Scale weights: Normalized
n = nrow(psw.data.mm)
psw.normalized = psw.unscaled * (n / sum(psw.unscaled))

# Scale weights: Cluster-normalized and effective
psw.cnormalized = psw.effect = psw.unscaled
for(i in 1:length(schoolnum)) {
  
  s = schoolnum[i]
  j = which(psw.data.mm$SCHOOL_NUMBER == s)
  nj = length(j)
  sum.wij = sum(psw.unscaled[j])
  sum.wij2 = sum(psw.unscaled[j]^2)
  psw.cnormalized[j] = psw.cnormalized[j] * (nj / sum.wij)
  psw.effect[j] = psw.effect[j] * (sum.wij / sum.wij2)

}

# Append weights to data frame
psw.data.mm = cbind(psw.data.mm, psw.unscaled, psw.normalized, psw.cnormalized, psw.effect)

Integrating Weights into the Analysis

Next, we apply our weights to the analysis, regressing scale scores on the grouping variable. Leite and colleagues (2015) fit a multilevel pseudo-maximum likelihood model in MPLUS. In R, one can use the WeMix package (Bailey et al., 2021) to incorporate sampling weights into multilevel models. Note that lme4 does not currently accommodate sampling weights (the weights argument is instead for precision weights).

Numerous model specifications are available. Here, I present a model with both a random intercept and slope for school number. I fit a model with each of the four weight types computed (i.e., unscaled weights and three scaling methods). Note that the WeMix::mix function (used to fit these weighted multilevel models) requires a set of weights at both Levels 1 and 2. The Level 2 weights for school are simply a unit vector, as the propensity score weights computed above are at the student level.

# Create Level 2 weights
# Simply a unit vector
psw.data.mm$wL2 = 1

# Random intercept/slope model, unscaled weights
mod.unscaled = mix(SCALE_SCORE ~ YEAR2019 + (1 + YEAR2019 | SCHOOL_NUMBER), 
                   data = psw.data.mm, weights = c("psw.unscaled", "wL2"))

# Random intercept/slope model, normalized weights
mod.norm = mix(SCALE_SCORE ~ YEAR2019 + (1 + YEAR2019 | SCHOOL_NUMBER), 
               data = psw.data.mm, weights = c("psw.normalized", "wL2"))

# Random intercept/slope model, cluster--normalized weights
mod.cnorm = mix(SCALE_SCORE ~ YEAR2019 + (1 + YEAR2019 | SCHOOL_NUMBER), 
                data = psw.data.mm, weights = c("psw.cnormalized", "wL2"))

# Random intercept/slope model, effective weights
mod.effect = mix(SCALE_SCORE ~ YEAR2019 + (1 + YEAR2019 | SCHOOL_NUMBER), 
                 data = psw.data.mm, weights = c("psw.effect", "wL2"))

We can then extract the estimated mean difference from the coefficient output.

# Extract estimated mean differences
cat("Unscaled: ", round(mod.unscaled$coef[[2]], 3),
    "\nNormalized: ", round(mod.norm$coef[[2]], 3),
    "\nCluster-normalized: ", round(mod.cnorm$coef[[2]], 3),
    "\nEffective: ", round(mod.effect$coef[[2]], 3))
## Unscaled:  7.519 
## Normalized:  7.519 
## Cluster-normalized:  7.482 
## Effective:  7.479

Let’s compare these (very similar) estimates to a multilevel model without weights.

# Fit unweighted model
psw.data.mm$wL1 = 1
mod.unweighted = mix(SCALE_SCORE ~ YEAR2019 + (1 + YEAR2019 | SCHOOL_NUMBER), 
                     data = psw.data.mm, weights = c("wL1", "wL2"))

# Model summary
round(mod.unweighted$coef[[2]], 3)
## [1] 6.681

Note that extracting and comparing standard error estimates across models may also be of interest. One can also construct a “doubly robust” model (see Bang & Robins, 2005; Funk et al., 2011; Leite et al., 2015; Yang & Schafer, 2007) by including the covariates as fixed effects in the random intercept and slope models above. Below is an example using the normalized weights.

# Example of a doubly-robust model with normalized weights
mod.doublyrobust = mix(SCALE_SCORE ~ YEAR2019 + ETH_ASIAN + ETH_HISP + ETH_OTH +  
                                     ETH_WHITE + FRL + ELL + IEP + MALE +
                                     (1 + YEAR2019 | SCHOOL_NUMBER), 
                       data = psw.data.mm, weights = c("psw.normalized", "wL2"))

# Extract estimated mean difference
round(mod.doublyrobust$coef[[2]], 3)
## [1] 7.378

In their example, Leite and colleagues (2015) used a continuous covariate (prior assessment scores). They also included a covariate by treatment interaction, noting that “Schafer and Kang (2008) have shown that [the treatment effect model estimate] will only be equal to the ATT in models including covariates if the covariates are centered around the mean of the covariates for the treated and the treatment by covariate interaction is included in the model” (p. 280).

Comparison to Non-Hierarchical Model

What happens if we ignore the multilevel data structure altogether? Here I replicate the analysis from the previous vignette, using the twang package (Cefalu et al., 2021) to estimate propensity scores with a gradient boosting model.

# Estimate propensity scores
ps.twang = ps(YEAR2019 ~ ETH_ASIAN + ETH_HISP + ETH_OTH + ETH_WHITE +
              FRL + ELL + IEP + MALE, data = psw.data.mm,
              n.trees = 15000, interaction.depth = 2,
              shrinkage = 0.01, estimand = "ATT", 
              stop.method = c("es.mean"), verbose = FALSE)

# Extract weights
psw.twang = get.weights(ps.twang, stop.method = "es.mean")

# Simple linear regression to estimate mean difference
lm.mod = lm(SCALE_SCORE ~ YEAR2019, data = psw.data.mm, weights = psw.twang)

# Model summary
summary(lm.mod)
## 
## Call:
## lm(formula = SCALE_SCORE ~ YEAR2019, data = psw.data.mm, weights = psw.twang)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -165.713  -23.134   -0.134   20.866  151.720 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 498.6647     0.3918 1272.64   <2e-16 ***
## YEAR2019      7.4688     0.5541   13.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.01 on 15863 degrees of freedom
## Multiple R-squared:  0.01132,    Adjusted R-squared:  0.01126 
## F-statistic: 181.7 on 1 and 15863 DF,  p-value: < 2.2e-16

The estimated mean difference in scale scores has a magnitude of 7.469.

Further Analyses

A further step after running the analysis is to compute sensitivity analyses. However, Leite and colleagues (2015) note that sensitivity analyses for multilevel PSW is an active research area. Additionally, although not presented in great detail here, it is important that researchers thoroughly investigate the method assumptions (see Leite et al., 2015, pp. 266-267).

References

  • Bailey, P., Kelley, C., Nguyen, T., & Huo, H. (2021). WeMix: Weighted mixed-effects models using multilevel pseudo maximum likelihood estimation. R package version 3.1.8. https://CRAN.R-project.org/package=WeMix
  • Bang, H., & Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61, 962-972. https://doi.org/10.1111/j.1541-0420.2005.00377.x
  • Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1-48. https://doi.org/10.18637/jss.v067.i01.
  • Betebenner, D. W., Van Iwaarden, A. R., & Domingue, B. (2021). SGPdata: Exemplar data sets for student growth percentile (SGP) analyses. R package version 25.1-0.0. https://centerforassessment.github.io/SGPdata/
  • Cefalu, M., Ridgeway, G., McCaffrey, D., Morral, A., Griffin, B. A., & Burgette, L. (2021). twang: Toolkit for weighting and analysis of nonequivalent groups. R package version 2.1. https://CRAN.R-project.org/package=twang
  • Dadey, N. (2021, May 3). Exploring Ho’s (2021) Fair Trend metric. Retrieved from https://centerforassessment.github.io/cfaTools/articles/fair_trend.html
  • Funk, M. J., Westreich, D., Wiesen, C., Sturmer, T., Brookhart, M. A., & Davidian, M. (2011). Doubly robust estimation of causal effects. American Journal of Epidemiology, 173(7), 761-767. https://doi.org/10.1093/aje/kwq439
  • Harrell Jr., F. E., with contributions from Charles Dupont and others. (2021). Hmisc: Harrell miscellaneous. R package version 4.5-0. https://CRAN.R-project.org/package=Hmisc
  • Ho, A. (2021, Feburary 26). Three test-score metrics that all states should report in the COVID-19-affected spring of 2021. Harvard Graduate School of Education. https://scholar.harvard.edu/files/andrewho/files/threemetrics.pdf
  • Holmes Finch, W., Bolin, J. E., & Kelley, K. (2019). Multilevel modeling using R. CRC Press, Taylor & Francis Group.
  • Kim, J.-S., & Steiner, P. M. (2015). Multilevel propensity score methods for estimating causal effects: A latent class modeling strategy. In L. A. van der Ark, D. M. Bolt, W.-C. Wang, J. A. Douglas, & S.-M. Chow (Eds.), Quantitative psychology research: Springer Proceedings in Mathematics and Statistics (Vol. 140). Springer. https://doi.org/10.1007/978-3-319-19977-1_21
  • Leite, W. L., Jimenez, F., Kaya, Y., Stapleton, L. M., MacInnes, J. W., & Sandbach, R. (2015). An evaluation of weighting methods based on propensity scores to reduce selection bias in multilevel observational studies. Multivariate Behavioral Research, 50(3), 265-284. https://doi.org/10.1080/00273171.2014.991018
  • Li, F., Zaslavsky, A. M., & Landrum, M. B. (2013). Propensity score weighting with multilevel data. Statistics in Medicine, 32(19), 3373-3387. https://doi.org/10.1002/sim.5786
  • R Core Team (2021). R: A language and environment for statistical computing. Vienna, Austria. Retrieved from https://www.R-project.org/
  • Schafer, J. L., & Kang, J. (2008). Average causal effects from nonrandomized studies: A practical guide and simulated example. Psychological Methods, 13, 279–313. https://doi.org/10.1037/a0014268
  • Schuler, M. S., Chu, W., & Coffman, D. (2016). Propensity score weighting for a continuous exposure with multilevel data. Health Services and Outcomes Research Methodology, 16, 271-292. https://doi.org/10.1007/s10742-016-0157-5
  • Yang, J. D. Y., & Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22(4), 523-539. https://doi.org/10.1214/07-STS227