What is Propensity Score Weighting?

A propensity score (Rosenbaum & Rubin, 1983, 1984; as cited in Li et al., 2013) is defined as “the probability of receiving a treatment conditional on a set of observed covariates” (Lee et al., 2010, p. 337; Rosenbaum & Rubin, 1983). Consider a researcher who wants to estimate the mean difference between two groups on some outcome variable. Using randomization helps ensure that the two groups are similar in terms of the covariates of interest, such as demographic variables. Yet in quasi-experimental and observation studies, we do not (or cannot) form groups using random selection. As a result, the groups may substantially differ on one or more covariates (Burgette et al., 2016).

Propensity score weighting (PSW) is a mechanism to better “balance the covariate distribution between groups” prior to some analysis (Li et al., 2013, p. 3). PSW is often preferred over propensity score matching, because the latter may remove observations from the sample if a plausible match cannot be found (Desai & Franklin, 2019). Numerous PSW methods have been proposed in the literature, differing by the model used to estimate the propensity scores (e.g., parametric logistic regression or non-parametric decision tree methods; Burgette et al., 2016; Lee et al., 2010), the weighting approach (e.g., inverse probability treatment weighting [IPTW] or standardized mortality ratio weighting [SMRW]; see Desai & Franklin, 2019), and approaches for hierarchical data (see Li et al., 2013).

Moreover, PSW methods depend on the estimand used (e.g., Desai & Franklin, 2019). Two common estimands are the average treatment effect (ATE) and average treatment effect on the treated (ATT):

  • ATE: “Mean difference in outcomes if all individuals in the population had received the treatment versus if all individuals had received the control” (Burgette et al., 2016, p. 4).
  • ATT: “The average treatment effect in a population with a distribution of risk factors similar to that of the treated group” (Lee et al., 2010, p. 340).

Note that the term “treatment” is commonly used to define the grouping variable with PSW, based on applications to quasi-experimental studies using a treatment and control group. However, PSW can also be readily applied to observational studies that do not introduce a treatment mechanism, but rather have two (or more) mutually exclusive groups (Li et al., 2013). PSW can also be applied to longitudinal studies, where attrition is a concern (e.g., Weuve et al., 2012).

The above introduction is not intended to be an exhaustive literature review of PSW. The interested reader is referred to papers like Desai and Franklin (2019), Lee and colleagues (2010), Rosenbaum and Rubin (1983), and Vansteelandt and Daniel (2014) for more details on the methodology and applications of PSW.

Applications to Education Data

Researchers and policymakers often want to compare mean scale scores and student growth percentiles (SGPs) across years. For example, we may want to estimate the difference in mean or median scale scores between fifth graders in 2019 and fifth graders in 2021. To draw valid inferences from such comparisons, we need to ensure that we have an “apples-to-apples” comparison between the two samples.

One potential roadblock to this comparison may be missing data, arising from factors like differential participation rates. Another roadblock may be that the student composition looks quite different across years, even within the same school. For example, students may leave and enter new schools between assessments. Then, the distribution of important covariates (e.g., demographic variables) may become unbalanced when comparing cohorts across two years. Although this may be a concern in any cross-sectional analysis of assessment data, the possibility of unbalanced samples is even more potent due to the COVID-19 pandemic.

Other researchers have already started to address the analytic challenges posed by changing sample compositions. Notably, Ho (2021) proposed a “Fair Trend” metric that uses a series of regression models to adjust 2019 assessment data, facilitating comparisons to scores in 2021. PSW may be an alternative method to adjust the aggregated scale score and SGP comparisons between 2019 and 2021. For instance, PSW has been used to compare mean score differences between college freshmen and seniors in a cross-sectional sample (Liu et al., 2016).

Example with SGPdata

Using R statistical software (R Core Team, 2021), I illustrate how PSW may be applied to educational assessment data using publicly-available data in the SGPdata package (Betebenner et al., 2021). The goal is to estimate the mean mathematics scale score difference between fifth graders in 2019 and fifth graders in 2021. This research question was chosen to match Dadey’s (2021) vignette, who used these data to illustrate the computation of Ho’s (2021) Fair Trend metric.

In the current example, propensity score weights are estimated using the twang package (Cefalu et al., 2021). The processes explicated here largely follow the twang vignette.

Data Set-Up

First, I load the sgpData_LONG_COVID data set from the SGPdata package. This synthetic data set provides scale scores, SGPs, and demographic characteristics from 2016 to 2023. The 2020 data were removed to mirror missing assessment data due to the pandemic. I focus specifically on observations from grade 5 in either 2019 or 2021 (again, matching Dadey [2021]).

# Load packages
require(pacman)
pacman::p_load(data.table, ggplot2, kableExtra, SGPdata, survey, twang)

# Set seed for reproducibility
set.seed(6921)

# Load data
# Subset only mathematics in grade 5 and either 2019 or 2021
psw.data = sgpData_LONG_COVID[sgpData_LONG_COVID$GRADE == 5 & 
                              sgpData_LONG_COVID$CONTENT_AREA == "MATHEMATICS" &
                              (sgpData_LONG_COVID$YEAR == 2019 | sgpData_LONG_COVID$YEAR == 2021), ]

# Sample sizes
n.2019 = nrow(psw.data[psw.data$YEAR ==2019,])
n.2021 = nrow(psw.data[psw.data$YEAR ==2021,])

There are 7985 fifth graders in the 2019 sample, and 7880 in the 2021 sample.

Propensity Score Estimation

To estimate the propensity scores with twang, the covariates should be a numeric (or similar) class in R. One way to transform the variables (which are class factor in sgpData_LONG_COVID) is to use dummy coding with model.matrix. Note also that the data should be of class data.frame for use with twang.

# Convert data to a dummy-coded data frame using model.matrix
# The [,-1] removes the first column, which is a unit vector representing an intercept
psw.mm = data.frame(model.matrix(~ YEAR + ETHNICITY + FREE_REDUCED_LUNCH_STATUS + 
                                   ELL_STATUS + IEP_STATUS + GENDER + SCALE_SCORE, 
                                 data = psw.data)[,-1])

# Add variable names
names(psw.mm) = c("YEAR", "ETH_ASIAN", "ETH_HISP", "ETH_OTH", "ETH_WHITE",
                  "FRL", "ELL", "IEP", "MALE", "SCALE_SCORE")

Note that model.matrix automatically sets psw.mm$YEAR = 1 if an observation is from 2021, and a 0 otherwise. In this case, the weights for observations in 2021 will all be 1, and the 2019 mean scale scores will be adjusted to achieve covariate balance. To alter the analyses so that adjustments are made to the 2021 scores instead, I reverse the dichotomization of psw.mm$YEAR.

# Change psw.mm$YEAR so 1 = 2019, and 0 = 2021
psw.mm$YEAR_RELEVEL = NA
psw.mm$YEAR_RELEVEL[psw.mm$YEAR == 1] = 0
psw.mm$YEAR_RELEVEL[psw.mm$YEAR == 0] = 1

To compute the propensity scores, I use the twang::ps function. This function by default uses gradient boosting with the gbm package (Greenwell et al., 2020), which is an aggregated decision tree method. Features of the boosting model can be modified using arguments like n.trees, interaction.depth, and shrinkage (Ridgeway et al., 2021). Typically, the number of decision trees should be quite large and the shrinkage parameter should be small to ensure a slow learning rate (James et al., 2013). I use estimand = "ATT" here for demonstrative purposes. The Kolmogorov-Smirnov (KS) statistic and standardized effect size are used “for measuring and summarizing balance across pretreatment variables” (Ridgeway et al., 2021, p. 3).

Broadly, the propensity scores are estimated by predicting group membership (here denoted by YEAR_RELEVEL) from the covariates of interest. Therefore, the functional form for the model is YEAR_RELEVEL ~ ETH_ASIAN + ETH_HISP + ETH_OTH + ETH_WHITE + FRL + ELL + IEP + MALE, where the right hand side indicates that all eight covariates (dummy-coded variables for ethnicity, FRL status, etc.) should be included. The gradient boosting method will automatically examine interaction and polynomial effects (Ridgeway et al., 2021).

# Estimate propensity scores
prop.scores = ps(
  
  # Model form
  YEAR_RELEVEL ~ ETH_ASIAN + ETH_HISP + ETH_OTH + ETH_WHITE +
                 FRL + ELL + IEP + MALE,                        
  
  # Data            
  data = psw.mm,
  
  # Number of decision trees
  n.trees = 15000,
  
  # Tree interaction depth
  interaction.depth = 2,
  
  # Tree shrinkage parameter (lambda)
  shrinkage = 0.01,
  
  # Estimand
  estimand = "ATT", 
  
  # Stop method(s)
  stop.method = c("es.mean", "ks.max"),
  
  # Display printed output to the console as the function progresses?
  verbose = FALSE
 
)

Note that in some cases (e.g., certain data set or random starting seeds), a user may receive the following warning message: “Optimal number of iterations is close to the specified n.trees. n.trees is likely set too small and better balance might be obtainable by setting n.trees to be larger.”

We can next examine the propensity scores using a variety of diagnostic checks. These are explained in more depth in the twang vignette. First, the function bal.table() displays differences on the covariates between the two groups when the samples are (a) unweighted, (b) weighted using the mean standardized effect size (es.mean), and (c) weighted using the maximum KS statistic (ks.max).

# Balance table
bal.table(prop.scores)
## $unw
##           tx.mn tx.sd ct.mn ct.sd std.eff.sz   stat     p    ks ks.pval
## ETH_ASIAN 0.157 0.364 0.133 0.339      0.068  4.440 0.000 0.025   0.015
## ETH_HISP  0.203 0.402 0.249 0.433     -0.116 -7.026 0.000 0.047   0.000
## ETH_OTH   0.093 0.291 0.076 0.265      0.059  3.880 0.000 0.017   0.194
## ETH_WHITE 0.461 0.499 0.463 0.499     -0.003 -0.215 0.830 0.002   1.000
## FRL       0.488 0.500 0.514 0.500     -0.051 -3.218 0.001 0.026   0.011
## ELL       0.096 0.295 0.058 0.233      0.130  9.060 0.000 0.038   0.000
## IEP       0.110 0.313 0.108 0.310      0.009  0.549 0.583 0.003   1.000
## MALE      0.520 0.500 0.507 0.500      0.027  1.685 0.092 0.013   0.478
## 
## $es.mean.ATT
##           tx.mn tx.sd ct.mn ct.sd std.eff.sz   stat     p    ks ks.pval
## ETH_ASIAN 0.157 0.364 0.158 0.364     -0.001 -0.030 0.976 0.000       1
## ETH_HISP  0.203 0.402 0.204 0.403     -0.003 -0.183 0.854 0.001       1
## ETH_OTH   0.093 0.291 0.094 0.292     -0.003 -0.178 0.859 0.001       1
## ETH_WHITE 0.461 0.499 0.459 0.498      0.004  0.254 0.800 0.002       1
## FRL       0.488 0.500 0.488 0.500     -0.001 -0.047 0.963 0.000       1
## ELL       0.096 0.295 0.096 0.294      0.001  0.038 0.970 0.000       1
## IEP       0.110 0.313 0.111 0.314     -0.001 -0.040 0.968 0.000       1
## MALE      0.520 0.500 0.523 0.500     -0.004 -0.263 0.793 0.002       1
## 
## $ks.max.ATT
##           tx.mn tx.sd ct.mn ct.sd std.eff.sz   stat     p    ks ks.pval
## ETH_ASIAN 0.157 0.364 0.158 0.364     -0.001 -0.035 0.972 0.000       1
## ETH_HISP  0.203 0.402 0.204 0.403     -0.003 -0.189 0.850 0.001       1
## ETH_OTH   0.093 0.291 0.094 0.292     -0.003 -0.156 0.876 0.001       1
## ETH_WHITE 0.461 0.499 0.459 0.498      0.004  0.253 0.800 0.002       1
## FRL       0.488 0.500 0.488 0.500     -0.001 -0.047 0.962 0.000       1
## ELL       0.096 0.295 0.096 0.294      0.001  0.073 0.942 0.000       1
## IEP       0.110 0.313 0.111 0.314     -0.001 -0.049 0.961 0.000       1
## MALE      0.520 0.500 0.523 0.500     -0.004 -0.251 0.802 0.002       1

Looking at the results for the unweighted samples, we see that the two groups substantially differ on a handful of covariates. For example, the 2021 sample (here denoted as the “control” group) has a higher proportion of students receiving free or reduced lunch, as well as a higher proportion of self-identifying Hispanic students. Weighting with either the standardized effect size or the maximum KS statistic resulted in balanced samples based on the given covariates.

Next, a series of plots are available to show the distributions of the weights, as well as the differences on the covariates when the samples are weighted or unweighted. Below is a reproduction of Table 1 from Ridgeway and colleagues (2021, p. 9), showcasing the plot options with an object from the ps function.

Descriptive Argument Numeric Argument Description
optimize 1 Balance measure as a function of GBM iterations
boxplot 2 Boxplot of treatment/control propensity scores
es 3 Standardized effect size of pretreatment variables
t 4 t-test p-values for weighted pretreatment variables
ks 5 Kolmogorov-Smirnov p-values for weighted pretreatment variables
histogram 6 Histogram of weights for treatment/control

I present each plot for our propensity scores, prop.scores.

# Plot 1: "optimize"
plot(prop.scores, plots = 1)

# Plot 2: "boxplot"
plot(prop.scores, plots = 2)

# Plot 3: "es"
plot(prop.scores, plots = 3)

# Plot 4: "t"
plot(prop.scores, plots = 4)

# Plot 5: "ks"
plot(prop.scores, plots = 5)

# Plot 6: "histogram"
plot(prop.scores, plots = 6)

Summarizing the above plots, we see high overlap in the propensity scores across groups (plots 2), as well as reduced differences between groups on the given covariates after weighting (plots 3-5). For example, plot 4 shows that the p-values were all greater than 0.75 after weighting using either the standardized effect size or the maximum KS statistic. These diagnostics suggest that the propensity score weighting with gradient boosting was successful in balancing the two groups of interest.

Applying Weights to the Analysis

To extract the weights for subsequent analysis, I use the get.weights() function. Here, I select the weights based on the standardized effect size (determined by the stop.method argument). One could alternatively set stop.method = "ks.max".

# Extract weights
prop.weights = get.weights(prop.scores, stop.method = "es.mean")

The weights can then be integrated into the analysis. Recall that the goal here is to estimate the mean scale score difference on a mathematics assessment between fifth graders in 2019 and fifth graders in 2021. One way to do this is to run a simple linear regression, regressing the scale scores on the grouping variable (where a 1 indicates a 2021 observation, and a 0 indicates a 2019 observation). With the lm function in R, we can apply the propensity score weights using the weights argument.

# Simple linear regression to estimate mean difference
lm.mod = lm(
  
  # Model formula
  SCALE_SCORE ~ YEAR_RELEVEL, 
  
  # Data
  data = psw.mm,
  
  # Add weights
  weights = prop.weights

)

# Model summary
summary(lm.mod)
## 
## Call:
## lm(formula = SCALE_SCORE ~ YEAR_RELEVEL, data = psw.mm, weights = prop.weights)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -165.713  -23.134   -0.134   20.866  151.720 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  498.6647     0.3918 1272.64   <2e-16 ***
## YEAR_RELEVEL   7.4688     0.5541   13.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.01 on 15863 degrees of freedom
## Multiple R-squared:  0.01132,    Adjusted R-squared:  0.01126 
## F-statistic: 181.7 on 1 and 15863 DF,  p-value: < 2.2e-16

Looking at the slope coefficient estimate from the above model, we find an estimated mean difference of 7.469. This suggests an average decrease in mathematics scale scores from 2019 to 2021 (although recall that this is simulated data, so don’t read into the results). Notice that if we simply took the mean difference in mathematics scale scores without balancing, we’d have a slightly different estimate.

# Simple linear regression with no weights
lm.mod.noweight = lm(
  
  # Model formula
  SCALE_SCORE ~ YEAR_RELEVEL, 
  
  # Data
  data = psw.mm

)

# Model summary
summary(lm.mod.noweight)
## 
## Call:
## lm(formula = SCALE_SCORE ~ YEAR_RELEVEL, data = psw.mm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -120.134  -23.603   -0.134   20.866  143.866 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  499.6032     0.3910 1277.85   <2e-16 ***
## YEAR_RELEVEL   6.5303     0.5511   11.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.71 on 15863 degrees of freedom
## Multiple R-squared:  0.008774,   Adjusted R-squared:  0.008712 
## F-statistic: 140.4 on 1 and 15863 DF,  p-value: < 2.2e-16

Alternatively, the twang vignette suggests using the survey package (Lumley, 2020) to estimate treatment effects. Here we follow the vignette’s steps to compute the estimated mean difference between the samples.

# Generate survey design with svydesign
# See ??svydesign
psw.design = svydesign(
  
  # Don't cluster IDs
  ids = ~1, 
  
  # Propensity score weights
  weights = prop.weights, 
  
  # Data
  data = psw.mm

)

# Run analysis
svy.mod = svyglm(
  
  # Model formula
  SCALE_SCORE ~ YEAR_RELEVEL, 
  
  # Object from svydesign
  design = psw.design
  
)

# Model summary
summary(svy.mod)
## 
## Call:
## svyglm(formula = SCALE_SCORE ~ YEAR_RELEVEL, design = psw.design)
## 
## Survey design:
## svydesign(ids = ~1, weights = prop.weights, data = psw.mm)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  498.6647     0.4031 1237.12   <2e-16 ***
## YEAR_RELEVEL   7.4688     0.5691   13.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1217.564)
## 
## Number of Fisher Scoring iterations: 2

Unsurprisingly, we obtain the same difference estimate as when using lm with weights = prop.weights.

Some authors (e.g., Bang & Robins, 2005; Funk et al., 2011; Yang & Schafer, 2007) have illustrated the use of doubly robust estimators, which may reduce estimation bias in cases of model misspecification. There are other R packages available to compute various doubly robust estimators. The twang vignette gives an example where we fit the linear regression model with both the PSW values and the covariates of interest (Ridgeway et al., 2021). McCaffrey and colleagues (2013) suggest that it is particularly helpful to include any covariates for which balance was not achieved using the PSW method.

# Run "doubly robust" estimation
lm.mod.dr = lm(
  
  # Model formula
  SCALE_SCORE ~ YEAR_RELEVEL + ETH_ASIAN + ETH_HISP + ETH_OTH + ETH_WHITE +
                 FRL + ELL + IEP + MALE, 
  
  # Data
  data = psw.mm,
  
  # Add weights
  weights = prop.weights

)

# Model summary
summary(lm.mod.dr)
## 
## Call:
## lm(formula = SCALE_SCORE ~ YEAR_RELEVEL + ETH_ASIAN + ETH_HISP + 
##     ETH_OTH + ETH_WHITE + FRL + ELL + IEP + MALE, data = psw.mm, 
##     weights = prop.weights)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -133.879  -19.766   -2.256   16.281  149.675 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  510.5114     0.9310 548.321  < 2e-16 ***
## YEAR_RELEVEL   7.4657     0.4857  15.372  < 2e-16 ***
## ETH_ASIAN      0.3490     1.0341   0.338    0.736    
## ETH_HISP      -0.5300     0.9923  -0.534    0.593    
## ETH_OTH        0.7354     1.1503   0.639    0.523    
## ETH_WHITE      0.3587     0.9071   0.395    0.692    
## FRL          -12.7284     0.4919 -25.876  < 2e-16 ***
## ELL          -26.6052     0.8337 -31.911  < 2e-16 ***
## IEP          -39.7808     0.7822 -50.859  < 2e-16 ***
## MALE           2.1769     0.4901   4.442 8.97e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.68 on 15855 degrees of freedom
## Multiple R-squared:  0.2408, Adjusted R-squared:  0.2403 
## F-statistic: 558.6 on 9 and 15855 DF,  p-value: < 2.2e-16

With this model, the estimated difference is 7.466.

Additional Analyses

The above analysis is admittedly quite simple. For example, we might consider a multilevel modeling approach to account for the clustering of students within different schools and districts. The twang vignette also walks through possible sensitivity analyses after using PSW methods. Note that numerous other R packages are available for PSW and other doubly robust estimators.

It would also be beneficial to compare PSW to Ho’s (2021) Fair Trend metric to see whether the two approaches provide similar results. Both methods attempt to create an “apples-to-apples” comparison for aggregated assessment data, but take slightly different analytic approaches to get there. Simulations and empirical analyses would help elucidate the inner workings of these two methods.

References and Resources

I do not presume to provide an exhaustive review of propensity score weighting; rather, many authors have provided extensive resources on this method. Please check out the following references for more information!

  • Bang, H., & Robins, J. M. (2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61, 962-972. https://doi.org/10.1111/j.1541-0420.2005.00377.x
  • Betebenner, D. W., Van Iwaarden, A. R., & Domingue, B. (2021). SGPdata: Exemplar data sets for student growth percentile (SGP) analyses. R package version 25.1-0.0. https://centerforassessment.github.io/SGPdata/
  • Burgette, J. M., Preisser, J. S., & Rozier, R. G. (2016). Propensity score weighting: An application to an Early Head Start dental study. Journal of Public Health Dentistry, 76(1), 17-29. https://doi.org/10.1111/jphd.12106
  • Cefalu, M., Ridgeway, G., McCaffrey, D., Morral, A., Griffin, B. A., & Burgette, L. (2021). twang: Toolkit for weighting and analysis of nonequivalent groups. R package version 2.1. https://CRAN.R-project.org/package=twang
  • Dadey, N. (2021, May 3). Exploring Ho’s (2021) Fair Trend metric. Retrieved from https://centerforassessment.github.io/cfaTools/articles/fair_trend.html
  • Desai, R. J., & Franklin, J. M. (2019). Alternative approaches for confounding adjustment in observational studies using weighting based on the propensity score: A primer for practitioners. The British Medical Journal, 367, 15657. https://doi.org/10.1136/bmj.l5657
  • Funk, M. J., Westreich, D., Wiesen, C., Sturmer, T., Brookhart, M. A., & Davidian, M. (2011). Doubly robust estimation of causal effects. American Journal of Epidemiology, 173(7), 761-767. https://doi.org/10.1093/aje/kwq439
  • Greenwell, B., Bohemke, B., Cunningham, J., & GBM Developers. (2020). gbm: Generalized boosted regression models. R package version 2.1.8. https://CRAN.R-project.org/package=gbm
  • Ho, A. (2021, Feburary 26). Three test-score metrics that all states should report in the COVID-19-affected spring of 2021. Harvard Graduate School of Education. https://scholar.harvard.edu/files/andrewho/files/threemetrics.pdf
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning. Springer.
  • Lee, B. K., Lessler, J., & Stuart, E. A. (2010). Improving propensity score weighting using machine learning. Statistics in Medicine, 29(3), 337-346. https://doi.org/10.1002/sim.3782
  • Li, F., Zaslavsky, A. M., & Landrum, M. B. (2013). Propensity score weighting with multilevel data. Statistics in Medicine, 32(19), 3373-3387. https://doi.org/10.1002/sim.5786
  • Liu, O. L., Liu, H., Roohr, K. C., & McCaffrey, D. F. (2016). Investigating college learning gain: Exploring a propensity score weighting approach. Journal of Educational Measurement, 53(3), 352-367. https://doi.org/10.1111/jedm.12112
  • Lumley, T. (2020). survey: Analysis of complex survey samples. R package version 4.0
  • McCaffrey, D. F., Griffin, B. A., Almirall, D., Slaughter, M. E., Ramchand, R., & Burgette, L. F. (2013). A tutorial on propensity score estimation for multiple treatments using generalized boosted models. Statistics in Medicine, 32(9), 3388-3414. https://doi.org/10.1002/sim.5753
  • R Core Team (2021). R: A language and environment for statistical computing. Vienna, Austria. Retrieved from https://www.R-project.org/
  • Ridgeway, G., McCaffrey, D., Morral, A., Cefalu, M., Burgette, L., Pane, J., & Griffin, B. A. (2021, June 5). Toolkit for weighting and analysis of nonequivalent groups: A guide to the twang package. Retrieved from https://cran.r-project.org/web/packages/twang/vignettes/twang.pdf
  • Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 41-55.
  • Rosenbaum, P., & Rubin, D. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79(387), 516-524.
  • Vansteelandt, S., & Daniel, R. M. (2014). On regression adjustment for the propensity score. Statistics in Medicine, 33, 4053-4072. https://doi.org/10.1002/sim.6207
  • Weuve, J., Tchetgen, E. J. T., Glymour, M. M., Beck, T. L., Aggarwal, N. T., Wilson, R. S., Evans, D. A., & Mendes de Leon, C. F. (2012). Accounting for bias due to selective attrition: The example of smoking and cognitive decline. Epidemiology, 23(1), 119-128. https://doi.org/10.1097/EDE.0b013e318230e861
  • Yang, J. D. Y., & Schafer, J. L. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22(4), 523-539. https://doi.org/10.1214/07-STS227