Simulation Overview

When analyzing educational assessment data, we may wish to compare scale scores across student cohorts. For example, by how much did scale scores change (if at all) between fifth graders in the current year and fifth graders in a previous year? Answers to such questions can provide some indication as to students’ academic trajectories, and whether school performance is improving.

In any year, but perhaps more dramatically in 2021, factors like enrollment changes can shift the sample compositions (e.g., proportions of students receiving free or reduced lunch). These shifts may mean that the cohorts used for score comparisons are not actually that comparable. Differences in samples on a set of important covariates can confound changes in scores over time. When discussing the evaluation of 2021 assessment data, Ho (2021) writes,

“One of the most obvious indicators that schools, districts, and communities need academic support will be substantial declines in their aggregate test scores from recent years. Unfortunately, declines may be caused not by declining academic performance but by large numbers of previously higher scoring students who have left the school system or no longer have comparable scores…Inversely, stable or rising test scores may be caused not by increases in academic performance but by large numbers of previously lower scoring students who have left the school system or no longer have comparable scores” (p. 4).

To ensure an “apples-to-apples” comparison, Ho (2021) proposed a Fair Trend metric that contrasts “all observed scores from [one] year with an appropriate and comparable baseline two years prior” (p. 4). Specifically, if we are comparing scores between students in 2021 and 2019, the Fair Trend metric uses a series of regression models to adjust the 2019 scale scores. These adjustments essentially create a cohort of “academic peers” that is more comparable to the group of students in 2021. For more information on the rationale behind, and the calculation of, the Fair Trend metric, see Ho (2021) and Dadey (2021).

An alternative method to constructing comparable samples is a statistical method called propensity score weighting. Briefly, propensity scores are defined as the probability of being in a certain group or sample given a set of covariates (Lee et al., 2010, p. 337; Rosenbaum & Rubin, 1983). Propensity scores can be applied as weights in various analyses, reducing the confounding effects of certain covariates to obtain a more precise estimate (e.g., mean scale score difference) between samples (Li et al., 2013). There are a multitude of propensity score matching and weighting methods available in the literature (e.g., Desai & Franklin, 2019; Lee et al., 2010; Leite et al., 2015; Weuve et al., 2012).

Both the Fair Trend metric and propensity score weighting have a similar goal: create “comparable” cross-sectional samples to disentangle the influence of differing sample compositions from the true effect that we’re interested in analyzing. This simulation directly compares these two approaches when estimating the difference in mean scale scores between two cross-sectional cohorts. The research question is modeled from the example given by Ho (2021) and Dadey (2021), where they compare aggregated scores between fifth graders in 2021 and fifth graders in 2019. In this simulation, we are particularly interested in which method more precisely captures the true scale score difference between the cohorts. These methods are also compared to the scenario wherein no data adjustment is made, and the raw mean scale score difference is computed between the two cohorts.

The current simulation employs two propensity score weighting approaches. Both weighting approaches estimate the propensity scores using a gradient boosting method, where the model has a non-hierarchical functional form (i.e., not accounting for clustering of students within schools). Then, the weights are applied either to a simple ordinary least squares (OLS) regression model (again not accounting for data clustering), or to a two-level regression model with students nested within school. Although there is a growing literature on the use of multilevel models to estimate propensity scores (e.g., Leite et al., 2015), multilevel models were not used here due to model convergence issues.

Note: This is a working document! The simulation design and results may change as bugs are identified, the data-generating model is refined, etc.

Data-Generating Model

To ensure the generalizability of Monte Carlo simulation results, the simulation design should be based on realistic data scenarios encountered in the relevant field. I therefore base this simulation’s data-generating model on the structure of the sgpData_LONG_COVID data set. sgpData_LONG_COVID is a simulated data set available in the SGPdata package (Betebenner et al., 2021), and can be used to prepare and evaluate analyses for status comparisons, student growth percentiles, and other reporting tools within educational assessment.

In these data, we have two cohorts:

  • “Historical” cohort: Students who were in 3rd grade in 2017 and 5th grade in 2019
  • “Current” cohort: Students who were in 3rd grade in 2019 and 5th grade in 2021

The terms “historical” and “current” are used to mirror the terminology from the Fair Trend calculations (Dadey, 2021; Ho, 2021). Although this analysis focuses on the difference between scores when each student was in fifth grade (i.e., a cross-sectional comparison), the Fair Trend metric requires longitudinal data within cohorts to create the “adjusted” scale scores for the group of “academic peers” in the historical cohort (Ho, 2021). In the data-generating model, each student has two time points. Cohort is considered a time-invariant predictor, and students are nested within schools.

Random Intercept Model

The data-generating model is a three-level mixed-effects model, where level 1 models the within-student variation across the repeated measures, level 2 models the between-student variation, and level 3 models the between-school variation. The three levels thus mirror the structure of time nested within students nested within schools. I incorporate random intercepts for students and schools, with time-invariant predictors of cohort and selected student-level covariates. Specifically, the student-level covariates are those available in sgpData_LONG_COVID (e.g., ethnicity, free and reduced lunch [FRL] status).

First, the level 1 model is written as (Curran et al., 2012; DeBruine & Barr, 2021; Holmes Finch et al., 2019)

\[ Y_{ijt} = \pi_{0ij} + \pi_{1ij}T_{ijt} + \epsilon_{ijt}, \]

where \(Y_{ijt}\) is the scale score for student \(i\) in school \(j\) at time \(t\), and \(i = \{1,\ldots,N\}\), \(j = \{1,\ldots,J\}\), and \(t=\{1,2\}\). Moreover, \(\pi_{0ij}\) is the student-level intercept, \(\pi_{1ij}\) is the student-level slope coefficient, \(T_{ijt}\) indicates the time point, and \(\epsilon_{ijt}\) is the student-level error term at time \(t\). The variable \(T_{ijt}\) takes the value \(0\) at the first time point, and the value \(1\) at the second time point.

Both \(\pi_{0ij}\) and \(\pi_{1ij}\) are expanded with the level 2 model:

\[ \pi_{0ij} = \beta_{00j} + \beta_{01}C_i + \sum_{p=1}^{P}\beta_{0p}x_{pi} + r_{0ij}, \]

\[ \pi_{1ij} = \beta_{10j} + \beta_{11}C_i. \] Here, \(C_i\) indicates the cohort for student \(i\), \(x_{pi}\) is student \(i\)’s value on covariate \(p\), and \(r_{0ij}\) is the random error term on the intercept for student \(i\) in school \(j\). Matching the structure of sgpData_LONG_COVID, \(p=\{1,\ldots,8\}\). Notice that there is only a random error term on the intercept term (\(\pi_{0ij}\)), and not the slope (\(\pi_{1ij}\)). Additionally, \(C_i=0\) if student \(i\) is in the “historical” cohort and \(C_i=1\) if the student is in the “current” cohort.

Finally, a random intercept term for school-level variation is included within the level 3 model:

\[ \beta_{00j} = \gamma_{000} + u_{00j}, \]

\[ \beta_{10j} = \gamma_{100}. \] Combining the models across the three levels, we can express the scale score for student \(i\) in school \(j\) at time \(t\) as (Curran et al., 2021)

\[ Y_{ijt} = \left[ \gamma_{000} + \gamma_{100}T_{ijt} + \beta_{01}C_i + \beta_{11}C_iT_{ijt} + \sum_{p=1}^{P}\beta_{0p}x_{pi} \right] + \left[ u_{00j} + r_{0ij} + \epsilon_{ijt} \right]. \]

The quantity in the brackets on the left-hand side comprises the fixed effects, and the quantity in the brackets on the right-hand side comprises the random effects. For the random effects, assume that (Curran et al., 2012; DeBruine & Barr, 2021)

\[u_{00j}\sim N(0, \sigma^{2}_{u}),\] \[r_{0ij}\sim N(0, \sigma^{2}_{r}),\] \[e_{tijj}\sim N(0, \sigma^{2}_{e}).\]

Random Slope Model

One may also introduce a random effect at level 2 for the slope, allowing variation in the linear change in the scale scores among students (Curran et al., 2012). In this case, \(\pi_{1ij}\) becomes

\[ \pi_{1ij} = \beta_{10j} + \beta_{11}C_i + r_{1ij} \]

and all other terms are as defined in the previous subsection. With this random slope model, the full equation for \(Y_{ijt}\) becomes

\[ Y_{ijt} = \left[ \gamma_{000} + \gamma_{100}T_{ijt} + \beta_{01}C_i + \beta_{11}C_iT_{ijt} + \sum_{p=1}^{P}\beta_{0p}x_{pi} \right] + \left[ u_{00j} + r_{0ij} + r_{1ij}T_{tij} + \epsilon_{ijt} \right]. \]

This more complex model is just-identified when fit to the data in sgpData_LONG_COVID, likely due to the relatively small number of time points (\(T=2\)). However, when fitting this random slope model to data with three time points (e.g., using cohorts of students in 7th grade either in 2019 or 2021), the lmer optimizer failed to converge. The model indeed converged when removing the random slope. Therefore, the simulation design proceeded with the random intercept model. This design decision should be kept in mind when drawing inferences from the simulation results.

Simulation Code

In this section, I present the full simulation code. The simulation is relatively time-intensive (due to the computation of the propensity scores), so readers interested in replicating this code are encouraged to run the simulation in a separate R script.

General Set-Up

I begin by loading the appropriate R libraries and sgpData_LONG_COVID from SGPdata.

# Load libraries
pacman::p_load(bindata,    # Generate random binomial data
               data.table, # Data manipulation
               lme4,       # Fit mixed-effects models
               SGPdata,    # Load sgpData_LONG_COVID 
               twang,      # Propensity score weighting methods
               WeMix)      # Fit weighted mixed-effects models

# Load data for exploring parameter values
data("sgpData_LONG_COVID")

Next I will set some general parameters: (a) The number of simulation trials, (b) the number of students per cohort, and (c) the number of schools. The numbers of students and schools are roughly based on the frequencies from the two cohorts of interest in sgpData_LONG_COVID. It is important to note that this simulation uses balanced school samples, with the same number of students per school. This structure is arguably not realistic for educational assessment data, but is used here for simplicity.

# Number of simulation trials
ntrials = 100

# Number of observations per cohort
nobspercohort = 7000

# Number of schools 
# 100 schools --> 70 students per school per cohort
nschool = 100

To ensure reproducibility, set the seed for random number generation.

# Set seed
set.seed(61821)

Explore sgpData_LONG_COVID

To ensure a relatively realistic simulation design, I base the parameter values on a similar model fit to data from sgpData_LONG_COVID. Recall that we have two cohorts each with two time points. Again, the “historical cohort” includes students who were in 3rd grade in 2017 and in 5th grade in 2019. The “current cohort” includes students who were in 3rd grade in 2019 and in 5th grade in 2021.

I begin by formatting the data into the two cohorts, subsetting only the scores from the mathematics content area. This code is roughly based on Dadey (2021).

# Subset data for "historical cohort"
# 3rd graders in 2017, 5th graders in 2019
historicalcohort = rbind(
  
  sgpData_LONG_COVID[GRADE == 5 & YEAR == "2019" & CONTENT_AREA == "MATHEMATICS",],
  sgpData_LONG_COVID[GRADE == 3 & YEAR == "2017" & CONTENT_AREA == "MATHEMATICS",]
  
)

# Subset data for "current cohort"
# 3rd graders in 2019, 5th graders in 2021
currentcohort = rbind(
  
  sgpData_LONG_COVID[GRADE == 5 & YEAR == "2021" & CONTENT_AREA == "MATHEMATICS",],
  sgpData_LONG_COVID[GRADE == 3 & YEAR == "2019" & CONTENT_AREA == "MATHEMATICS",]
  
)

# Remove observations that aren't in both years
# This step helps to ensure model identification for `lmer`
historicalcohort = historicalcohort[, if (.N > 1) .SD, by = .(ID)]
currentcohort = currentcohort[, if (.N > 1) .SD, by = .(ID)]

# Change year to binary
historicalcohort[, YEARNUM := ifelse(YEAR == "2019", 1, 0)]
currentcohort[, YEARNUM := ifelse(YEAR == "2021", 1, 0)]

# Dummy-code covariate variables
historicalcohort = data.frame(
  
  model.matrix(~ YEARNUM + SCALE_SCORE + ETHNICITY + FREE_REDUCED_LUNCH_STATUS + 
                 ELL_STATUS + IEP_STATUS + GENDER, 
               data = historicalcohort)[,-1], 
  historicalcohort$ID, 
  historicalcohort$SCHOOL_NUMBER,
  "Cohort" = 0
)
currentcohort = data.frame(
  
  model.matrix(~ YEARNUM + SCALE_SCORE + ETHNICITY + FREE_REDUCED_LUNCH_STATUS +
                 ELL_STATUS + IEP_STATUS + GENDER, 
               data = currentcohort)[,-1], 
  currentcohort$ID, 
  currentcohort$SCHOOL_NUMBER,
  "Cohort" = 1
)

# Rename variables
names(historicalcohort) = 
  names(currentcohort) = 
      c("YEAR", "SCALE_SCORE", "ETH_1", "ETH_2", "ETH_3", "ETH_4", "FRL", 
        "ELL", "IEP", "MALE", "ID", "SCHOOLID", "COHORT")

# Combine cohort data
sgpcohort = rbind(historicalcohort, currentcohort)

Next, I fit a random intercept model with repeated observations nested within students, and students nested within schools. The binary indicator for cohort and the available student-level covariates are included as time-invariant predictors at level 2. In this model, I use the scale score variable that incorporates a simulated COVID-19 impact (there is a similar non-impact variable called SCALE_SCORE_without_COVID_IMPACT).

# Fit mixed-effects models to guide parameter values
longmod = lmer(SCALE_SCORE ~ YEAR*COHORT + ETH_1 + ETH_2 + ETH_3 + ETH_4 + 
                             FRL + ELL + IEP + MALE + 
                             (1 | SCHOOLID/ID), data = sgpcohort)

Setting Parameter Values

Here I set the parameter values for the simulation’s data-generating model (a three-level longitudinal model with a random intercepts for students nested within schools).

# Number of time points per cohort
timepoints = 2

# Set parameter values for mixed-effects model
gamma_000 = 520
gamma_100 = -2.5  
beta_01 = 0.47  
beta_11 = -7.6
beta_p = c(-2.4, -1.1, -0.7, -0.6, -8.9, -15.5, -34.5, 0.3)
u_00j.sigma = 92
r_0ij.sigma = 537
e_tij.sigma = 363

In this model, \((\beta_{01} + \beta_{11})\) represents the mean difference in the scale scores between the cohorts at the second time point. This is the value that we are trying to estimate using either Fair Trend or propensity score weighting.

I next set the parameters for the covariate data generation. All of the included covariates are binary variables, assumed to be random realizations of binomial distributions. I specify the marginal probabilities and the covariance matrix among the eight covariates below. The marginal probabilities are modified slightly from the estimated proportions in sgpData_LONG_COVID to ensure sufficient variation in the generated data. I also assume that the covariate values are time-invariant, so students’ values on each variable do not change between the two time points. Here, h. represents the “historical cohort” and c. represents the "current cohort.

# Set parameter values, covariates
# Use first time point for each cohort, assuming no change in covariates over time
# Modified slightly to ensure variation
h.covprob = c(0.15, 0.20, 0.20, 0.45, 0.46, 0.13, 0.19, 0.53)
h.Cmat = round(cov(historicalcohort[historicalcohort$YEAR == 0, 3:10]), 3)
c.covprob = c(0.15, 0.25, 0.20, 0.4, 0.50, 0.12, 0.20, 0.51)
c.Cmat = round(cov(currentcohort[currentcohort$YEAR == 0, 3:10]), 3)

Below, I generate the student IDs. By assuming cross-sectional data, there is no overlap in IDs across the cohorts.

# Set student ID vectors
h.studentid = seq(1, nobspercohort, by = 1)
c.studentid = seq((nobspercohort + 1001), ((nobspercohort + 1000) + nobspercohort), by = 1)

Simulation Code

This subsection presents the simulation code for data generation and mean difference estimation using one of the four methods:

  • Computing the raw difference between mean scale scores (i.e., not applying any data adjustment);
  • The Fair Trend metric;
  • Propensity score weighting when analyzing with a non-hierarchical, OLS regression; and
  • Propensity score weighting when analyzing with a two-level random intercept model.

The propensity scores are estimated using the twang R package (Cefalu et al., 2021), and the prior scale score is used as a covariate in the propensity scores estimation. The code for computing the Fair Trend metric is from Dadey (2021). A while-loop is used to ensure that for each trial, there is variation within the covariate data (i.e., there is not a variable with all 0s or 1s).

# Matrix to save average scale score differences across methods
meandiff = matrix(NA, nrow = ntrials, ncol = 4)
colnames(meandiff) = c("None", "FairTrend", "PSW.DecisionTree", "PSW.MLM")

# Start counter
iTrial = 1

# Begin while-loop
while(iTrial <= ntrials) {
  
  # Generate covariate data
  h.covdata = rmvbin(nobspercohort, margprob = h.covprob, sigma = h.Cmat)
  c.covdata = rmvbin(nobspercohort, margprob = c.covprob, sigma = c.Cmat)
  
  # Generate year vector
  yearvec = 0:(timepoints-1)
  
  # Generate cohort vector
  cohortvec = 0:1
  
  # Generate random intercepts, school level
  u_00j = rep(rnorm(nschool, mean = 0, sd = sqrt(u_00j.sigma)), each = nobspercohort/nschool)
  
  # Generate random intercepts, individual level
  # One vector per cohort
  r_0i0j = rnorm(nobspercohort, mean = 0, sd = sqrt(r_0ij.sigma))
  r_0i1j = rnorm(nobspercohort, mean = 0, sd = sqrt(r_0ij.sigma))
  
  # Generate residuals
  # One vector per cohort per time point
  e_1i0j = rnorm(nobspercohort, mean = 0, sd = sqrt(e_tij.sigma))
  e_2i0j = rnorm(nobspercohort, mean = 0, sd = sqrt(e_tij.sigma))
  e_1i1j = rnorm(nobspercohort, mean = 0, sd = sqrt(e_tij.sigma))
  e_2i1j = rnorm(nobspercohort, mean = 0, sd = sqrt(e_tij.sigma))
  
  # Generate longitudinal scale score data (SS_ticj) 
  # 2017 to 2019 cohort (Cohort_i = 0)
  SS_1i0j = gamma_000 + gamma_100*yearvec[1] + 
            beta_01*cohortvec[1] + beta_11*(yearvec[1]*cohortvec[1]) + 
            (h.covdata %*% beta_p) + u_00j + r_0i0j + e_1i0j
  SS_2i0j = gamma_000 + gamma_100*yearvec[2] + 
            beta_01*cohortvec[1] + beta_11*(yearvec[2]*cohortvec[1]) + 
            (h.covdata %*% beta_p) + u_00j + r_0i0j + e_2i0j
  
  # Generate longitudinal scale score data (SS_ticj) 
  # 2019 to 2021 cohort (Cohort_i = 1)
  SS_1i1j = gamma_000 + gamma_100*yearvec[1] + beta_01*cohortvec[2] + 
            beta_11*(yearvec[1]*cohortvec[2]) + 
            (c.covdata %*% beta_p) + u_00j + r_0i1j + e_1i1j
  SS_2i1j = gamma_000 + gamma_100*yearvec[2] + beta_01*cohortvec[2] + 
            beta_11*(yearvec[2]*cohortvec[2]) + 
            (c.covdata %*% beta_p) + u_00j + r_0i1j + e_2i1j
  
  # Set school ID vector
  schoolid = rep(1:nschool, each = nobspercohort/nschool)
  
  # Combine into full data frame
  h.data.long = data.frame("ID" = rep(h.studentid, timepoints), 
                           "SchoolID" = c(schoolid, schoolid),
                           "SS" = c(SS_1i0j, SS_2i0j), 
                           "Year" = factor(rep(yearvec, each = nobspercohort), 
                                           labels = c("2017", "2019")), 
                           "Grade" = rep(c(3, 5), each = nobspercohort),
                           rbind(h.covdata, h.covdata))
  c.data.long = data.frame("ID" = rep(c.studentid, timepoints),
                           "SchoolID" = c(schoolid, schoolid),
                           "SS" = c(SS_1i1j, SS_2i1j), 
                           "Year" = factor(rep(yearvec, each = nobspercohort), 
                                           labels = c("2019", "2021")), 
                           "Grade" = rep(c(3, 5), each = nobspercohort),
                           rbind(c.covdata, c.covdata))
  data.long = rbind(h.data.long, c.data.long)
  
  # Organize data for computations
  # data.hist.wide and data.curr.wide used for the Fair Trend computation
  # data.ps used for propensity score computations
  # WL2 is a unity vector used for school-level weights in weighted mixed-effects regression
  data.hist.wide = reshape(h.data.long, idvar = "ID", direction = "wide", timevar = "Year")
  data.curr.wide = reshape(c.data.long, idvar = "ID", direction = "wide", timevar = "Year")
  data.ps = rbind(
    
    data.frame("ID" = data.hist.wide[, 1],
               "SchoolID" = data.hist.wide[, 2],
               "SSPrior" = data.hist.wide[, 3],
               "X1" = data.hist.wide[, 5],
               "X2" = data.hist.wide[, 6],
               "X3" = data.hist.wide[, 7],
               "X4" = data.hist.wide[, 8],
               "X5" = data.hist.wide[, 9],
               "X6" = data.hist.wide[, 10],
               "X7" = data.hist.wide[, 11],
               "X8" = data.hist.wide[, 12],
               "SS" = data.hist.wide[, 14],
               "Year" = 0,
               "WL2" = 1),
    data.frame("ID" = data.curr.wide[, 1],
               "SchoolID" = data.curr.wide[, 2],
               "SSPrior" = data.curr.wide[, 3],
               "X1" = data.curr.wide[, 5],
               "X2" = data.curr.wide[, 6],
               "X3" = data.curr.wide[, 7],
               "X4" = data.curr.wide[, 8],
               "X5" = data.curr.wide[, 9],
               "X6" = data.curr.wide[, 10],
               "X7" = data.curr.wide[, 11],
               "X8" = data.curr.wide[, 12],
               "SS" = data.curr.wide[, 14],
               "Year" = 1,
               "WL2" = 1)
  )
  
  # Check that we have variation on all covariates
  # If not, computations return an error
  covmean = apply(data.ps[, 3:10], 2, mean)
  covmeanzero = length(which(covmean == 0))
  if(covmeanzero == 0) {
    
      # Raw mean difference, no adjustments
      rawdiff = mean(data.curr.wide$SS.2021) - mean(data.hist.wide$SS.2019)
      
      # Fair Trend computation
      # Compuation based on Dadey (2021)
      hist.fit = lm(SS.2019 ~ SS.2017, data = data.hist.wide)
      data.curr.wide$FT2019 = hist.fit$coefficients[[1]] + 
                              hist.fit$coefficients[[2]]*data.curr.wide$SS.2019
      ftdiff = mean(data.curr.wide$SS.2021) - mean(data.curr.wide$FT2019)
      
      # PSW 1: Compute with gradient boosting tree, 
      #        analyze with simple OLS
      ps1 = ps(Year ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + SSPrior, data = data.ps, 
                n.trees = 20000, interaction.depth = 2, shrinkage = 0.01, verbose = F,
                estimand = "ATE", stop.method = "es.mean")
      psw1 = get.weights(ps1, stop.method = "es.mean", estimand = "ATE")
      fit1 = lm(SS ~ Year, data = data.ps, weights = psw1)
      pswdiff1 = fit1$coefficients[[2]]
      
      # PSW 2: Compute with gradient boosting tree, 
      #         analyze with two-level random intercept model
      data.ps$psw1 = psw1
      fit2 = mix(SS ~ Year + (1 | SchoolID), data = data.ps, weights = c("psw1", "WL2"))
      pswdiff2 = fit2$coef[[2]]
      
      # Save estimates
      meandiff[iTrial, 1] = rawdiff
      meandiff[iTrial, 2] = ftdiff
      meandiff[iTrial, 3] = pswdiff1
      meandiff[iTrial, 4] = pswdiff2
      
      # Go to next trial
      cat("\nCompleted trial", iTrial, "of", ntrials)
      iTrial = iTrial + 1
      
  } # end if covmeanzero == 0
  
} # end while(iTrial <= ntrials)

Results

When generating data from a three-level, random-intercept model, the true mean scale score difference was set to be -7.13, where the negative sign indicates a lower mean scale score value among 5th graders in 2021 compared to 5th graders in 2019 (recall that this is simulated data, so no practical inferences should be drawn from this value). The table below presents the raw bias, absolute bias, and root mean-squared error (RMSE), averaged across the 100 simulation trials, among the four examined methods.

Average bias and RMSE results when generating longitudinal data from a random-intercept model
Raw Bias Absolute Bias Root Mean-Squared Error
Raw Difference 0.748 0.764 0.885
Fair Trend 0.558 0.582 0.669
PSW with OLS 0.319 0.42 0.504
PSW with MLM 0.287 0.403 0.489

The results indicate that propensity score weighting (where the weights were applied to either a simple OLS regression or to a two-level mixed-effects model) produced less biased estimates on average compared to the Fair Trend metric. Both Fair Trend and propensity score weighting improved upon the raw mean scale score difference in terms of bias. Here, raw bias was computed as the true difference minus the estimated difference. So given a negative parameter value, a positive bias indicates that the four methods tended to overestimate the difference in aggregated scale scores.

Propensity score weighting also produced smaller RMSE values, which is a combination of variance and squared bias. Although the Fair Trend metric tended to underestimate the true mean scale score difference between cohorts relative to the two propensity score weighting methods, it remains to be seen whether the differences in bias and RMSE between Fair Trend and propensity score weighting are practically significant. All three adjustment methods resulted in smaller RMSE values than when no data adjustment was made.

Comparing among the two propensity score weighting methods, applying the weights to a two-level random intercept model produced slightly lower bias and RMSE values than when applying the weights to a simple OLS regression. However, the differences among these two methods were relatively small. It remains to be seen how well propensity score weighting would perform when the scores are estimated using a multilevel model.

We can also evaluate the distributions of the estimated difference values (across the 100 simulation trials) among the three methods.

Descriptive statistics for the estimated difference values across the simulation trials
Mean SD Minimum Maximum
Raw Difference -7.878 0.475 -8.894 -6.868
Fair Trend -7.688 0.372 -8.462 -6.74
PSW with OLS -7.449 0.393 -8.332 -6.525
PSW with MLM -7.417 0.398 -8.322 -6.457

\(~\)

These results suggest that the variability in estimated differences among the three adjustment methods was relatively similar. Indeed, the variability in the Fair Trend estimates was smaller than that of either propensity score weighting method. The distribution of estimated differences for the Fair Trend metric was generally shifted to the left of those for propensity score weighting, suggesting that the higher RMSE value for Fair Trend is largely driven by higher squared bias. Using no data adjustment resulted in greater variability than Fair Trend or propensity score weighting.

Limitations & Future Directions

The current simulation is admittedly quite simple, and numerous modifications may lead to differing patterns of results. For example, this simulation assumed a data-generating model with only random intercepts, and not random slopes. Additionally, other propensity score weighting methods may be employed; for example, one may estimate the propensity scores using a multilevel logistic regression model (e.g., Leite et al., 2015). Unfortunately, with these data multilevel models for estimating the propensity scores failed to converge. The results from this simulation should therefore be interpreted while simultaneously considering the assumptions made when generating and analyzing the data.

An important consideration is how such methods might be employed in practice. One suggestion is to implement a “flag” for samples that may not be appropriately comparable. For instance, researchers and practitioners could use diagnostic checks for covariate balance between samples (with and without propensity score weights). The twang package provides clear tutorials on how these checks may be conducted in R. Alternatively, Ho (2021) proposed a match rate to “display the percentage of students with comparable test scores” (p. 2). The practical application of these methods will ultimately depend upon the analyses of interest and specific data structures.

References

  • Betebenner, D. W., Van Iwaarden, A. R., & Domingue, B. (2021). SGPdata: Exemplar data sets for student growth percentile (SGP) analyses. R package version 25.1-0.0. https://centerforassessment.github.io/SGPdata/
  • Cefalu, M., Ridgeway, G., McCaffrey, D., Morral, A., Griffin, B. A., & Burgette, L. (2021). twang: Toolkit for weighting and analysis of nonequivalent groups. R package version 2.1. https://CRAN.R-project.org/package=twang
  • Curran, P. J., McGinley, J. S., Serrano, D., & Burfeind, C. (2012). A multivaraite growth curve model for three-level data. In H. Cooper (Ed.), APA handbook of research methods in psychology: Vol. 3. Data analysis and research publication. American Psychological Association. https://doi.org/10.1037/13621-017
  • Dadey, N. (2021, May 3). Exploring Ho’s (2021) Fair Trend metric. Retrieved from https://centerforassessment.github.io/cfaTools/articles/fair_trend.html
  • DeBruine, L. M., & Barr, D. J. (2021). Understanding mixed-effectsmodels through data simulation. Advances in Methods and Practices in Psychological Science, 4(1), 1-15. https://doi.org/10.1177/2515245920965119
  • Desai, R. J., & Franklin, J. M. (2019). Alternative approaches for confounding adjustment in observational studies using weighting based on the propensity score: A primer for practitioners. The British Medical Journal, 367, 15657. https://doi.org/10.1136/bmj.l5657
  • Ho, A. (2021, Feburary 26). Three test-score metrics that all states should report in the COVID-19-affected spring of 2021. Harvard Graduate School of Education. https://scholar.harvard.edu/files/andrewho/files/threemetrics.pdf
  • Holmes Finch, W., Bolin, J. E., & Kelley, K. (2019). Multilevel modeling using R. CRC Press, Taylor & Francis Group.
  • Lee, B. K., Lessler, J., & Stuart, E. A. (2010). Improving propensity score weighting using machine learning. Statistics in Medicine, 29(3), 337-346. https://doi.org/10.1002/sim.3782
  • Leite, W. L., Jimenez, F., Kaya, Y., Stapleton, L. M., MacInnes, J. W., & Sandbach, R. (2015). An evaluation of weighting methods based on propensity scores to reduce selection bias in multilevel observational studies. Multivariate Behavioral Research, 50(3), 265-284. https://doi.org/10.1080/00273171.2014.991018
  • Li, F., Zaslavsky, A. M., & Landrum, M. B. (2013). Propensity score weighting with multilevel data. Statistics in Medicine, 32(19), 3373-3387. https://doi.org/10.1002/sim.5786
  • Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41-55.
  • Weuve, J., Tchetgen, E. J. T., Glymour, M. M., Beck, T. L., Aggarwal, N. T., Wilson, R. S., Evans, D. A., & Mendes de Leon, C. F. (2012). Accounting for bias due to selective attrition: The example of smoking and cognitive decline. Epidemiology, 23(1), 119-128. https://doi.org/10.1097/EDE.0b013e318230e861