A propensity score (Rosenbaum & Rubin, 1983, 1984; as cited in Li et al., 2013) is defined as “the probability of receiving a treatment conditional on a set of observed covariates” (Lee et al., 2010, p. 337; Rosenbaum & Rubin, 1983). Consider a researcher who wants to estimate the mean difference between two groups on some outcome variable. Using randomization helps ensure that the two groups are similar in terms of the covariates of interest, such as demographic variables. Yet in quasi-experimental and observation studies, we do not (or cannot) form groups using random selection. As a result, the groups may substantially differ on one or more covariates (Burgette et al., 2016).
Propensity score weighting (PSW) is a mechanism to better “balance the covariate distribution between groups” prior to some analysis (Li et al., 2013, p. 3). PSW is often preferred over propensity score matching, because the latter may remove observations from the sample if a plausible match cannot be found (Desai & Franklin, 2019). Numerous PSW methods have been proposed in the literature, differing by the model used to estimate the propensity scores (e.g., parametric logistic regression or non-parametric decision tree methods; Burgette et al., 2016; Lee et al., 2010), the weighting approach (e.g., inverse probability treatment weighting [IPTW] or standardized mortality ratio weighting [SMRW]; see Desai & Franklin, 2019), and approaches for hierarchical data (see Li et al., 2013).
Moreover, PSW methods depend on the estimand used (e.g., Desai & Franklin, 2019). Two common estimands are the average treatment effect (ATE) and average treatment effect on the treated (ATT):
Note that the term “treatment” is commonly used to define the grouping variable with PSW, based on applications to quasi-experimental studies using a treatment and control group. However, PSW can also be readily applied to observational studies that do not introduce a treatment mechanism, but rather have two (or more) mutually exclusive groups (Li et al., 2013). PSW can also be applied to longitudinal studies, where attrition is a concern (e.g., Weuve et al., 2012).
The above introduction is not intended to be an exhaustive literature review of PSW. The interested reader is referred to papers like Desai and Franklin (2019), Lee and colleagues (2010), Rosenbaum and Rubin (1983), and Vansteelandt and Daniel (2014) for more details on the methodology and applications of PSW.
Researchers and policymakers often want to compare mean scale scores and student growth percentiles (SGPs) across years. For example, we may want to estimate the difference in mean or median scale scores between fifth graders in 2019 and fifth graders in 2021. To draw valid inferences from such comparisons, we need to ensure that we have an “apples-to-apples” comparison between the two samples.
One potential roadblock to this comparison may be missing data, arising from factors like differential participation rates. Another roadblock may be that the student composition looks quite different across years, even within the same school. For example, students may leave and enter new schools between assessments. Then, the distribution of important covariates (e.g., demographic variables) may become unbalanced when comparing cohorts across two years. Although this may be a concern in any cross-sectional analysis of assessment data, the possibility of unbalanced samples is even more potent due to the COVID-19 pandemic.
Other researchers have already started to address the analytic challenges posed by changing sample compositions. Notably, Ho (2021) proposed a “Fair Trend” metric that uses a series of regression models to adjust 2019 assessment data, facilitating comparisons to scores in 2021. PSW may be an alternative method to adjust the aggregated scale score and SGP comparisons between 2019 and 2021. For instance, PSW has been used to compare mean score differences between college freshmen and seniors in a cross-sectional sample (Liu et al., 2016).
SGPdata
Using R statistical software (R Core Team, 2021), I illustrate how PSW may be applied to educational assessment data using publicly-available data in the SGPdata
package (Betebenner et al., 2021). The goal is to estimate the mean mathematics scale score difference between fifth graders in 2019 and fifth graders in 2021. This research question was chosen to match Dadey’s (2021) vignette, who used these data to illustrate the computation of Ho’s (2021) Fair Trend metric.
In the current example, propensity score weights are estimated using the twang
package (Cefalu et al., 2021). The processes explicated here largely follow the twang
vignette.
First, I load the sgpData_LONG_COVID
data set from the SGPdata
package. This synthetic data set provides scale scores, SGPs, and demographic characteristics from 2016 to 2023. The 2020 data were removed to mirror missing assessment data due to the pandemic. I focus specifically on observations from grade 5 in either 2019 or 2021 (again, matching Dadey [2021]).
# Load packages
require(pacman)
pacman::p_load(data.table, ggplot2, kableExtra, SGPdata, survey, twang)
# Set seed for reproducibility
set.seed(6921)
# Load data
# Subset only mathematics in grade 5 and either 2019 or 2021
psw.data = sgpData_LONG_COVID[sgpData_LONG_COVID$GRADE == 5 &
sgpData_LONG_COVID$CONTENT_AREA == "MATHEMATICS" &
(sgpData_LONG_COVID$YEAR == 2019 | sgpData_LONG_COVID$YEAR == 2021), ]
# Sample sizes
n.2019 = nrow(psw.data[psw.data$YEAR ==2019,])
n.2021 = nrow(psw.data[psw.data$YEAR ==2021,])
There are 7985 fifth graders in the 2019 sample, and 7880 in the 2021 sample.
To estimate the propensity scores with twang
, the covariates should be a numeric (or similar) class in R. One way to transform the variables (which are class factor
in sgpData_LONG_COVID
) is to use dummy coding with model.matrix
. Note also that the data should be of class data.frame
for use with twang
.
# Convert data to a dummy-coded data frame using model.matrix
# The [,-1] removes the first column, which is a unit vector representing an intercept
psw.mm = data.frame(model.matrix(~ YEAR + ETHNICITY + FREE_REDUCED_LUNCH_STATUS +
ELL_STATUS + IEP_STATUS + GENDER + SCALE_SCORE,
data = psw.data)[,-1])
# Add variable names
names(psw.mm) = c("YEAR", "ETH_ASIAN", "ETH_HISP", "ETH_OTH", "ETH_WHITE",
"FRL", "ELL", "IEP", "MALE", "SCALE_SCORE")
Note that model.matrix
automatically sets psw.mm$YEAR = 1
if an observation is from 2021, and a 0 otherwise. In this case, the weights for observations in 2021 will all be 1, and the 2019 mean scale scores will be adjusted to achieve covariate balance. To alter the analyses so that adjustments are made to the 2021 scores instead, I reverse the dichotomization of psw.mm$YEAR
.
# Change psw.mm$YEAR so 1 = 2019, and 0 = 2021
psw.mm$YEAR_RELEVEL = NA
psw.mm$YEAR_RELEVEL[psw.mm$YEAR == 1] = 0
psw.mm$YEAR_RELEVEL[psw.mm$YEAR == 0] = 1
To compute the propensity scores, I use the twang::ps
function. This function by default uses gradient boosting with the gbm
package (Greenwell et al., 2020), which is an aggregated decision tree method. Features of the boosting model can be modified using arguments like n.trees
, interaction.depth
, and shrinkage
(Ridgeway et al., 2021). Typically, the number of decision trees should be quite large and the shrinkage parameter should be small to ensure a slow learning rate (James et al., 2013). I use estimand = "ATT"
here for demonstrative purposes. The Kolmogorov-Smirnov (KS) statistic and standardized effect size are used “for measuring and summarizing balance across pretreatment variables” (Ridgeway et al., 2021, p. 3).
Broadly, the propensity scores are estimated by predicting group membership (here denoted by YEAR_RELEVEL
) from the covariates of interest. Therefore, the functional form for the model is YEAR_RELEVEL ~ ETH_ASIAN + ETH_HISP + ETH_OTH + ETH_WHITE + FRL + ELL + IEP + MALE
, where the right hand side indicates that all eight covariates (dummy-coded variables for ethnicity, FRL status, etc.) should be included. The gradient boosting method will automatically examine interaction and polynomial effects (Ridgeway et al., 2021).
# Estimate propensity scores
prop.scores = ps(
# Model form
YEAR_RELEVEL ~ ETH_ASIAN + ETH_HISP + ETH_OTH + ETH_WHITE +
FRL + ELL + IEP + MALE,
# Data
data = psw.mm,
# Number of decision trees
n.trees = 15000,
# Tree interaction depth
interaction.depth = 2,
# Tree shrinkage parameter (lambda)
shrinkage = 0.01,
# Estimand
estimand = "ATT",
# Stop method(s)
stop.method = c("es.mean", "ks.max"),
# Display printed output to the console as the function progresses?
verbose = FALSE
)
Note that in some cases (e.g., certain data set or random starting seeds), a user may receive the following warning message: “Optimal number of iterations is close to the specified n.trees. n.trees is likely set too small and better balance might be obtainable by setting n.trees to be larger.”
We can next examine the propensity scores using a variety of diagnostic checks. These are explained in more depth in the twang
vignette. First, the function bal.table()
displays differences on the covariates between the two groups when the samples are (a) unweighted, (b) weighted using the mean standardized effect size (es.mean
), and (c) weighted using the maximum KS statistic (ks.max
).
# Balance table
bal.table(prop.scores)
## $unw
## tx.mn tx.sd ct.mn ct.sd std.eff.sz stat p ks ks.pval
## ETH_ASIAN 0.157 0.364 0.133 0.339 0.068 4.440 0.000 0.025 0.015
## ETH_HISP 0.203 0.402 0.249 0.433 -0.116 -7.026 0.000 0.047 0.000
## ETH_OTH 0.093 0.291 0.076 0.265 0.059 3.880 0.000 0.017 0.194
## ETH_WHITE 0.461 0.499 0.463 0.499 -0.003 -0.215 0.830 0.002 1.000
## FRL 0.488 0.500 0.514 0.500 -0.051 -3.218 0.001 0.026 0.011
## ELL 0.096 0.295 0.058 0.233 0.130 9.060 0.000 0.038 0.000
## IEP 0.110 0.313 0.108 0.310 0.009 0.549 0.583 0.003 1.000
## MALE 0.520 0.500 0.507 0.500 0.027 1.685 0.092 0.013 0.478
##
## $es.mean.ATT
## tx.mn tx.sd ct.mn ct.sd std.eff.sz stat p ks ks.pval
## ETH_ASIAN 0.157 0.364 0.158 0.364 -0.001 -0.030 0.976 0.000 1
## ETH_HISP 0.203 0.402 0.204 0.403 -0.003 -0.183 0.854 0.001 1
## ETH_OTH 0.093 0.291 0.094 0.292 -0.003 -0.178 0.859 0.001 1
## ETH_WHITE 0.461 0.499 0.459 0.498 0.004 0.254 0.800 0.002 1
## FRL 0.488 0.500 0.488 0.500 -0.001 -0.047 0.963 0.000 1
## ELL 0.096 0.295 0.096 0.294 0.001 0.038 0.970 0.000 1
## IEP 0.110 0.313 0.111 0.314 -0.001 -0.040 0.968 0.000 1
## MALE 0.520 0.500 0.523 0.500 -0.004 -0.263 0.793 0.002 1
##
## $ks.max.ATT
## tx.mn tx.sd ct.mn ct.sd std.eff.sz stat p ks ks.pval
## ETH_ASIAN 0.157 0.364 0.158 0.364 -0.001 -0.035 0.972 0.000 1
## ETH_HISP 0.203 0.402 0.204 0.403 -0.003 -0.189 0.850 0.001 1
## ETH_OTH 0.093 0.291 0.094 0.292 -0.003 -0.156 0.876 0.001 1
## ETH_WHITE 0.461 0.499 0.459 0.498 0.004 0.253 0.800 0.002 1
## FRL 0.488 0.500 0.488 0.500 -0.001 -0.047 0.962 0.000 1
## ELL 0.096 0.295 0.096 0.294 0.001 0.073 0.942 0.000 1
## IEP 0.110 0.313 0.111 0.314 -0.001 -0.049 0.961 0.000 1
## MALE 0.520 0.500 0.523 0.500 -0.004 -0.251 0.802 0.002 1
Looking at the results for the unweighted samples, we see that the two groups substantially differ on a handful of covariates. For example, the 2021 sample (here denoted as the “control” group) has a higher proportion of students receiving free or reduced lunch, as well as a higher proportion of self-identifying Hispanic students. Weighting with either the standardized effect size or the maximum KS statistic resulted in balanced samples based on the given covariates.
Next, a series of plots are available to show the distributions of the weights, as well as the differences on the covariates when the samples are weighted or unweighted. Below is a reproduction of Table 1 from Ridgeway and colleagues (2021, p. 9), showcasing the plot options with an object from the ps
function.
Descriptive Argument | Numeric Argument | Description |
---|---|---|
optimize | 1 | Balance measure as a function of GBM iterations |
boxplot | 2 | Boxplot of treatment/control propensity scores |
es | 3 | Standardized effect size of pretreatment variables |
t | 4 | t-test p-values for weighted pretreatment variables |
ks | 5 | Kolmogorov-Smirnov p-values for weighted pretreatment variables |
histogram | 6 | Histogram of weights for treatment/control |
I present each plot for our propensity scores, prop.scores
.
# Plot 1: "optimize"
plot(prop.scores, plots = 1)
# Plot 2: "boxplot"
plot(prop.scores, plots = 2)
# Plot 3: "es"
plot(prop.scores, plots = 3)
# Plot 4: "t"
plot(prop.scores, plots = 4)
# Plot 5: "ks"
plot(prop.scores, plots = 5)
# Plot 6: "histogram"
plot(prop.scores, plots = 6)
Summarizing the above plots, we see high overlap in the propensity scores across groups (plots 2), as well as reduced differences between groups on the given covariates after weighting (plots 3-5). For example, plot 4 shows that the p-values were all greater than 0.75 after weighting using either the standardized effect size or the maximum KS statistic. These diagnostics suggest that the propensity score weighting with gradient boosting was successful in balancing the two groups of interest.
To extract the weights for subsequent analysis, I use the get.weights()
function. Here, I select the weights based on the standardized effect size (determined by the stop.method
argument). One could alternatively set stop.method = "ks.max"
.
# Extract weights
prop.weights = get.weights(prop.scores, stop.method = "es.mean")
The weights can then be integrated into the analysis. Recall that the goal here is to estimate the mean scale score difference on a mathematics assessment between fifth graders in 2019 and fifth graders in 2021. One way to do this is to run a simple linear regression, regressing the scale scores on the grouping variable (where a 1 indicates a 2021 observation, and a 0 indicates a 2019 observation). With the lm
function in R, we can apply the propensity score weights using the weights
argument.
# Simple linear regression to estimate mean difference
lm.mod = lm(
# Model formula
SCALE_SCORE ~ YEAR_RELEVEL,
# Data
data = psw.mm,
# Add weights
weights = prop.weights
)
# Model summary
summary(lm.mod)
##
## Call:
## lm(formula = SCALE_SCORE ~ YEAR_RELEVEL, data = psw.mm, weights = prop.weights)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -165.713 -23.134 -0.134 20.866 151.720
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 498.6647 0.3918 1272.64 <2e-16 ***
## YEAR_RELEVEL 7.4688 0.5541 13.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.01 on 15863 degrees of freedom
## Multiple R-squared: 0.01132, Adjusted R-squared: 0.01126
## F-statistic: 181.7 on 1 and 15863 DF, p-value: < 2.2e-16
Looking at the slope coefficient estimate from the above model, we find an estimated mean difference of 7.469. This suggests an average decrease in mathematics scale scores from 2019 to 2021 (although recall that this is simulated data, so don’t read into the results). Notice that if we simply took the mean difference in mathematics scale scores without balancing, we’d have a slightly different estimate.
# Simple linear regression with no weights
lm.mod.noweight = lm(
# Model formula
SCALE_SCORE ~ YEAR_RELEVEL,
# Data
data = psw.mm
)
# Model summary
summary(lm.mod.noweight)
##
## Call:
## lm(formula = SCALE_SCORE ~ YEAR_RELEVEL, data = psw.mm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -120.134 -23.603 -0.134 20.866 143.866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 499.6032 0.3910 1277.85 <2e-16 ***
## YEAR_RELEVEL 6.5303 0.5511 11.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.71 on 15863 degrees of freedom
## Multiple R-squared: 0.008774, Adjusted R-squared: 0.008712
## F-statistic: 140.4 on 1 and 15863 DF, p-value: < 2.2e-16
Alternatively, the twang
vignette suggests using the survey
package (Lumley, 2020) to estimate treatment effects. Here we follow the vignette’s steps to compute the estimated mean difference between the samples.
# Generate survey design with svydesign
# See ??svydesign
psw.design = svydesign(
# Don't cluster IDs
ids = ~1,
# Propensity score weights
weights = prop.weights,
# Data
data = psw.mm
)
# Run analysis
svy.mod = svyglm(
# Model formula
SCALE_SCORE ~ YEAR_RELEVEL,
# Object from svydesign
design = psw.design
)
# Model summary
summary(svy.mod)
##
## Call:
## svyglm(formula = SCALE_SCORE ~ YEAR_RELEVEL, design = psw.design)
##
## Survey design:
## svydesign(ids = ~1, weights = prop.weights, data = psw.mm)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 498.6647 0.4031 1237.12 <2e-16 ***
## YEAR_RELEVEL 7.4688 0.5691 13.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1217.564)
##
## Number of Fisher Scoring iterations: 2
Unsurprisingly, we obtain the same difference estimate as when using lm
with weights = prop.weights
.
Some authors (e.g., Bang & Robins, 2005; Funk et al., 2011; Yang & Schafer, 2007) have illustrated the use of doubly robust estimators, which may reduce estimation bias in cases of model misspecification. There are other R packages available to compute various doubly robust estimators. The twang
vignette gives an example where we fit the linear regression model with both the PSW values and the covariates of interest (Ridgeway et al., 2021). McCaffrey and colleagues (2013) suggest that it is particularly helpful to include any covariates for which balance was not achieved using the PSW method.
# Run "doubly robust" estimation
lm.mod.dr = lm(
# Model formula
SCALE_SCORE ~ YEAR_RELEVEL + ETH_ASIAN + ETH_HISP + ETH_OTH + ETH_WHITE +
FRL + ELL + IEP + MALE,
# Data
data = psw.mm,
# Add weights
weights = prop.weights
)
# Model summary
summary(lm.mod.dr)
##
## Call:
## lm(formula = SCALE_SCORE ~ YEAR_RELEVEL + ETH_ASIAN + ETH_HISP +
## ETH_OTH + ETH_WHITE + FRL + ELL + IEP + MALE, data = psw.mm,
## weights = prop.weights)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -133.879 -19.766 -2.256 16.281 149.675
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 510.5114 0.9310 548.321 < 2e-16 ***
## YEAR_RELEVEL 7.4657 0.4857 15.372 < 2e-16 ***
## ETH_ASIAN 0.3490 1.0341 0.338 0.736
## ETH_HISP -0.5300 0.9923 -0.534 0.593
## ETH_OTH 0.7354 1.1503 0.639 0.523
## ETH_WHITE 0.3587 0.9071 0.395 0.692
## FRL -12.7284 0.4919 -25.876 < 2e-16 ***
## ELL -26.6052 0.8337 -31.911 < 2e-16 ***
## IEP -39.7808 0.7822 -50.859 < 2e-16 ***
## MALE 2.1769 0.4901 4.442 8.97e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.68 on 15855 degrees of freedom
## Multiple R-squared: 0.2408, Adjusted R-squared: 0.2403
## F-statistic: 558.6 on 9 and 15855 DF, p-value: < 2.2e-16
With this model, the estimated difference is 7.466.
The above analysis is admittedly quite simple. For example, we might consider a multilevel modeling approach to account for the clustering of students within different schools and districts. The twang
vignette also walks through possible sensitivity analyses after using PSW methods. Note that numerous other R packages are available for PSW and other doubly robust estimators.
It would also be beneficial to compare PSW to Ho’s (2021) Fair Trend metric to see whether the two approaches provide similar results. Both methods attempt to create an “apples-to-apples” comparison for aggregated assessment data, but take slightly different analytic approaches to get there. Simulations and empirical analyses would help elucidate the inner workings of these two methods.
I do not presume to provide an exhaustive review of propensity score weighting; rather, many authors have provided extensive resources on this method. Please check out the following references for more information!