Multiple Imputation with Simulated Missing Data

class: center, middle, my-title, title-slide

.title[
# Multiple Imputation with Simulated Missing Data
]
.subtitle[
## Data Visualization and Analysis in the Era of COVID-19
]
.author[
### Adam VanIwaarden, Damian Betebenner & Nathan Dadey
]
.institute[
### Center for Assessment
]
.date[
### April 20th, 2022
]

---

class: inverse, center, middle

<div>
<style type="text/css">.xaringan-extra-logo {
width: 55px;
height: 64px;
z-index: 0;
background-image: url(https://centerforassessment.github.io/assets/favicon.png);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:0.5em;right:0.5em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.title-slide):not(.inverse):not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'https://www.nciea.org'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

#  Pandemic Panic

---

##  Anticipation vs. Reality

- After a year of no data, we got a double dose of missing data.

- Anticipation of dramatically reduced test participation.

- Accountability was shelved, but a plan was needed for measuring impact and planning for recovery.

- In the end, anticipated participation drops were generally not as dire as anticipated.

- Presence clouds estimation and interpretation of academic impact.
  - Missing 2021 data will be an ongoing issue in monitoring recovery.

---

##  What's the fuss?

* Missing data introduces problems in the analysis, interpretation and reporting of data.

- biased parameter estimates

- unrepresentative samples

- complicates analytic methods and procedures

- reduces power in statistical tests

- Most statistical methods assume perfect and complete variables are used.
  - Even a few unobserved data points can be problematic.

* Miss the good-ol-days? Census testing heaven?

- Nonsense. Nostalgia. We've always had missing data. We just haven't dealt with it.

---

##  What's the fuss?

* Low participation does not prevent the calculation of student growth or pandemic impact.

* Low participation does impact comparisons of group level results with historical results.
  - How should one compare growth and status results for schools in 2019 and 2021 with very different rates of participation?
  - Are they fully representative due to observed patterns of non-participation?
  - Are populations of interest fundamentally different now?

* Low participation *will* impact our ability to
  - monitor student recovery
  - appropriate educational resources where most needed
  - make complete, accurate and fair accountability decisions for teachers, schools, districts, etc. going forward.

---

##  Retracing Our Steps

- This exercise revisits our approach to identifying methods to deal with missing data.

- What impact does missing data have on Student Growth Percentiles (SGP) analyses?

- Can we faithfully assess the pandemic's academic impact?

- Step 1. Simulate various potential missing data patterns and extents

- Step 2. Compare multiple imputation methods

- Step 3. Correctly apply post-imputation analyses

- Step 4. Try to make sense of it all

---

##  Data

- As in other exercises, we will use **`sgpData_LONG_COVID`** from the [`SGPData` package.](https://github.com/CenterForAssessment/SGPdata)

- To keep things quick and simple, we use a subset restricted to the 2021 7<sup>th</sup> grade ELA cohort.

- Take a complete subset of the data and create missing data patterns only in the student score variables.
  - No demographic, school/district identifiers, etc. will be made missing.
  - Students are known to be enrolled in a school, but were not tested.
  - Enrollment and demographics are essential to estimate school and subgroup impact.
  - Need to know who is missing!

---

##  Tools

- The `R` code is [available on Github](https://github.com/CenterForAssessment/NCME_2022_Training_Session/tree/main/Documents/vignettes/Missing_Data_Multiple_Imputation) and the following `R` packages will be required to re-create the results:

```r
###   Load required packages
##    Data and data management
	require(SGPdata)
	require(data.table)
##    Missing data simulation, imputation and visualization
	require(VIM)
	require(mice)
	require(miceadds)
	require(ggplot2)
##    Parametric imputed data analysis
	require(lme4)
	require(merTools)
##    Point estimate data analysis and visualization
	require(SGP)
	require(gghighlight)
```

---

# Why multiple imputation?

* Multiple plausible values are generated using the observed data, not just one.
  - variability and uncertainty

* Estimates of uncertainty due to the missing data (variability between imputed data sets)
  - more realistic confidence intervals

* Used extensively in other fields of research

* Shown to produce valid statistical inference

* Well developed set an analytic tools available on `R` and other software platforms.

---
class: inverse, center, middle

# Simulating Missing Data

---

# Simulating Missing Data

* Simulations can help us recognize and anticipate missing data patterns
  - prepare for problems and understand the utility and limitations of certain methods.

* Can compare "true" results to the missing/observed and imputed data.

We'll begin by taking a complete data set and generating realistic missing data patterns.

---

##  Retrieve the data

All code assumes that you have downloaded a copy of the code and your working directory
has been set to "Missing_Data_Multiple_Imputation"

```r
##    Define the cohort of interest
cohort.grades <- c("4", "5", "7") # All `classes` must match - GRADE and YEAR are characters.
content.areas <- rep("ELA", 3)
cohort.years <- c("2018", "2019", "2021")
current.year <- tail(cohort.years, 1)

###   Data retrieval
Demonstration_COVID_Data <- SGPdata::sgpData_LONG_COVID
source("Helper_Scripts/Data_Retrieval_Initial.R", local = TRUE)
##    Returns several data objects, but `wide_data` is the main one we'll use.

wide_data <- wide_data[!(is.na(Z_SCORE.2018) | is.na(Z_SCORE.2019) | is.na(Z_SCORE.2021)),]
```

---

##  The `ampute` function

- Developed to provide a systematic process for simulating missing data.

- User specified parameters including the proportion, patterns and factors (observed variables) that contribute to missing data.
  - Excellent vignette [online.](https://rianneschouten.github.io/mice_ampute/vignette/ampute.html#missingness_proportion_per_variable)

- Generate missing data based on prior achievement of students and schools
  - Demographic and other variables could be factored in as desired.
    - E.g., School size could be included as a proxy for low urban participation.

---

##  Search...

```r
##    The variable base (long) names that will be factored into missing data patterns
ampute.factors <- c("SCHOOL_NUMBER", "Z_SCORE")

##    Select the subset of relevant variables from `wide_data`
amp.vars <- grep(paste(ampute.factors, collapse="|"), names(wide_data))
ampute_subset <- wide_data[, ..amp.vars]

##    Summarize 2021 students' prior (2019) scores for their 2021 schools
ampute_subset[, MEAN_PRIOR_Z := mean(Z_SCORE.2019), by = "SCHOOL_NUMBER"]
ampute_subset[, grep("SCHOOL_NUMBER", names(ampute_subset)) := NULL]
```

##  and destroy!

```r
##    Run `ampute` out-of-the-box and explore results
test.result <- mice::ampute(ampute_subset)
```

---

###  Stalking ...

```r
my.patterns <- data.frame(t(matrix(c(
               # Some we want from the default
                  c(0, 1, 1, 1),  #  only missing 2018
                  c(1, 0, 1, 1),  #  only missing 2019
                  c(1, 1, 0, 1),  #  only missing 2021
               # Some we want to add to default
                  c(0, 0, 1, 1),  #  missing 2018 & 2019 (only 2021 score)
                  c(0, 1, 0, 1),  #  missing 2018 & 2021
                  c(1, 0, 0, 1)   #  missing 2019 & 2021
                ), nrow = 4)))
```

---

###  Spying ...

```r
my.freq <- c(0.1,   #  only missing 2018 - fairly common previously
             0.05,  #  only missing 2019 - less common
             0.50,  #  only missing 2021 - most likely
             0.1,   #  only 2021 scores  - students new to the system
             0.175, #  only 2019 scores  - fairly common
             0.075  #  only 2018 scores  - less common, but still occurs
            )   #   frequencies must sum to 1
```

---

###  Weighting ...

```r
my.weights <- test.result$weights

my.weights[1,] <- c(0, 0, 0, 0)     #  MCAR - missing completely at random
my.weights[2,] <- c(0, 0, 0, 0)     #  MCAR - missing completely at random
my.weights[3,] <- c(1, 2, 0.25, 1)  #  MAR/MNAR - missing at random and weakly related to the observed score itself
my.weights[4,] <- c(0, 0, 1, -0.5)  #  MAR - students show up in system at best schools
my.weights[5,] <- c(0, 1, 0.25, 1)  #  MAR/MNAR
my.weights[6,] <- c(1, 0, 0.25, 1)  #  MAR/MNAR

##    reverse weight - lower scoring kids more likely to be missing
my.weights <- my.weights * -1L
```

---

##  Target Confirmed

```r
test.result <- ampute(ampute_subset, prop = 0.325,
                      patterns = my.patterns,
                      freq = my.freq,
                      weights = my.weights)

amp_scores <- data.table(test.result$amp[, 1:3])
setnames(wide_data, names(amp_scores), gsub("Z_SCORE", "COMPLETE_Z", names(amp_scores)))
wide_data <- cbind(wide_data, amp_scores)

nrow(na.omit(wide_data))/nrow(wide_data); 6341/9377 #  +32% overall missing proportion
```

---
class: inverse, center, middle

# Multiple Imputation

---

# Multiple Imputation

* Simulation-based procedure

* Purpose is to achieve valid statistical inference in the presence of missing data
  - not re-creation of "true" values.

* Use information from the observed data to generate plausible values for missing observations.

* Repeated many times to account for sampling error that arises when generating these values

* Model parameters are pooled across the imputed data sets

---

# Multiple Imputation

* Numerous MI methods available, differentiated by the model used to generate values.

* Can be broadly grouped into those that use
  - Joint Modeling (JM)
  - Fully Conditional Specification (FCS).

* JM assumes a multivariate (typically normal) distribution
  - imputations are drawn from it at the same time.
  - better theoretical underpinnings
* FCS, also known as MI with chained equations (MICE), imputes data one variable at a time in an iterative fashion.
  - different imputation model and/or set of predictors possible for each variable.
  - flexible and can preserve unique features of the data

---

## The three steps of multiple imputation

Regardless of the chosen algorithm(s) or modeling approach, the typical MI workflow can be reduced
to three steps in which information from the observed data is used to generate parameter estimates:

- ***Imputation:*** A prediction model generates a set of plausible values for the missing observations, resulting in `M` imputed data sets.
  - ***Analysis:*** The analysis (e.g., baseline SGPs) is conducted on each of the `M` data sets.
  - ***Pooling:*** Parameter and (perhaps more importantly) uncertainty estimates are constructed by pooling across the `M` analyses.

---

## The three steps of multiple imputation

The figure below (Figure 5.1 in Van Buuren, 2018) provides a nice visual synopsis of the MI workflow.

<img src="MI_w_Missing_Data_files/figure-html/mi_wf_vB-1.png" width="75%" style="display: block; margin: auto;" />
---

## Multiple imputation with simulated missing data

- Continue where we left off with the simulation of missing data

- First select the factors we want to include in our imputation model(s).
  - feign ignorance of the underlying missing data generating process
  - use student (and school aggregates of) achievement and FRL status

- Only student level (standardized) scale scores are missing and will be imputed.
  - students known to be enrolled in a school, but not tested

---

## Multiple imputation with simulated missing data

```r
impute.factors <- c("SCHOOL_NUMBER", "Z_SCORE", "FRL") # paste(c("ID", "COMPLETE_Z", impute.factors), collapse="|")

###   Select subset of `wide_data` that will be used for imputation and/or subsequent analyses
impute.vars <- grep("^ID$|COMPLETE_Z|Z_SCORE|SCHOOL_NUMBER|FRL", names(wide_data))
impute_subset <- wide_data[, ..impute.vars]
setnames(impute_subset, gsub(paste0(".", current.year), "", names(impute_subset)))

###   Create institutional level averages of achievement and economic disadvantage
impute_subset[, MEAN_INST_PRIOR_SCORE := mean(Z_SCORE.2019, na.rm = TRUE), by = "SCHOOL_NUMBER"] #
impute_subset[, PERCENT_INST_FRL := mean(FRL, na.rm = TRUE), by = "SCHOOL_NUMBER"]
```

???

Note we are also keeping our complete ("true") score values in this subset. These will
NOT be used for imputation (cheating!), but we want to keep them in our data set for
subsequent analyses. This can be done for any number of variables of interest (e.g.,
demographics, indicators of mode of instruction, etc.), but they must be given special
attention to ensure they are/not used as imputation predictors as desired.

---

### Imputation with `mice::mice` - model parameters

- Define key parameters for the MI process.
  - number of imputed data sets we want (`M`)
  - number of iterations (`MAXIT`) we will allow for each imputation to converge
  - imputation algorithm (`impute.method`).

- Produce default objects for the `method` and `predictorMatrix` arguments.

```r
###   Specify imputation parameters for `mice::mice`
M <- 20
MAXIT <- 5
impute.method <- "pmm"

my.methods <- mice::make.method(data=impute_subset)
my.predMtrx <- mice::make.predictorMatrix(data=impute_subset)
```

---

### Imputation with `mice::mice` - parameter modification

- Modify `my.methods` and `my.predMtrx` to ensure that we are
- Prevent ID and the complete scores from being predicted or used in any predictions.

```r
my.methods[grep("^Z_SCORE", names(my.methods))] <- impute.method

my.predMtrx[, "ID"] <- 0
my.predMtrx[, "SCHOOL_NUMBER"] <- 0

We will also use schools as a (categorical) predictor of the missing student scores.

```r
impute_subset[, SCHOOL_NUMBER := as.character(SCHOOL_NUMBER)]
my.predMtrx["SCHOOL_NUMBER", grep("^Z_SCORE", names(my.methods))] <- 1
```

---

###  Step 1 - data imputation

We are now ready to create our first set of imputed data.

```r
imputed_results <- mice::mice(data = impute_subset,
                              method = my.methods,
                              predictorMatrix = my.predMtrx,
                              m = M, maxit = MAXIT, seed = 719589L, print = FALSE)
```

---

###  Step 1 - imputation diagnostic plots

Diagnostic plots of interest look at the convergence of the `M` imputations over the `MAXIT` iterations:

```r
plot(imputed_results)
```

and the distribution densities of the `M` imputations (in red) compared to the observed distribution (blue) for each of the imputed variables.

```r
densityplot(imputed_results)
```

---

###  Step 1 - imputation diagnostic plots

Here is a custom diagnostic plot - a set of scatter plots that allow us to compare
the bivariate distribution of the complete, imputed and observed data (top to bottom).
Plots in the left column are the prior years (2018 and 2019) and those on the right are
for 2019 and 2021.

---

###  Step 2/3 - imputation analysis and results pooling

- Collect the imputations into a `list` object and convert each back into longitudinal form. Do this for `wide_data` as well for complete and observed data.

- Conduct `M` sets of post-imputation analyses and one set of pooled results.

- Run identical analyses using both the complete and observed data.

```r
###   Collect all imputation results in a list length = M
res <- miceadds::mids2datlist(imputed_results)

###   Convert data (complete, missing, and imputed) to long format
source("Helper_Scripts/Data_Retrieval_for_Analysis.R")
##    Returns objects `long_complete`, `long_amputed` and `res_long`
```

---

### Linear mixed effects model

We run is a linear mixed effects model here because it is a fairly interesting, and popular, model for analyzing
longitudinal data for students nested within schools.

- Also provides parameters that have well documented and operationalized procedures for pooling.
  - The parameter of interest is a fixed effect estimate for COVID impact.
  - Get pooled estimate of this along with standard errors and t values from `merTools` package.

```r
frmla <- as.formula("Z_SCORE ~ YEAR + COVID_IMPACT*FRL + (COVID_IMPACT | SCHOOL_NUMBER/ID)")
lmer.ctl <- lmerControl(check.conv.grad = .makeCC("warning", 0.02, NULL))

true_mod <- lmer(formula = frmla, data = long_complete, control = lmer.ctl)
miss_mod <- lmer(formula = frmla, data = long_amputed, control = lmer.ctl)
imp_mods <- lmerModList(formula=frmla, data = res_long, control = lmer.ctl)

###   Collate and print model (fixed effects) estimate comparison
source("Helper_Scripts/MI_LME_Model_Comp.R")
```

---

### Baseline student growth percentiles

* Baseline SGPs are a different take on normative growth.
  - Referenced against a prior cohort of interest.
  - "Super-cohort" comprised of several years of students with a common course/grade progression
  - a single, particular, cohort
  - 2019 cohorts form the baselines for impact analyses.

* Baseline comparisons use to examine if the system/parts are improving (or declining) over time relative
to the established baseline.

* Major assumption required is that the scale scores are well anchored.
  - If not, then any deviation from "typical" growth may be an artifact of test scaling.

---

### Baseline student growth percentiles

- In the COVID impact analyses, we expected to (and did) see median SGP values well below 50 at state, district and school levels.

- Beyond 2021, we hope to see that systems are improving (recovering) over time.
  - median SGPs greater than 50
  - what *was* typical growth in the past is now lower growth.

---

###  Running Baseline SGPs

- Create baseline SGPs for the 2021 7<sup>th</sup> grade ELA cohort for the complete, missing/observed and 20 imputed data sets.

- Several objects of interest are returned:
  - `Complete_SGP` and `Missing_SGP`, which are SGP class objects
  -  a single data table - `Imputed_SGP_Data`, with results using imputation data
  - `SGP_Imputation_Summaries` object with estimates for school level values (mean scores, pooled errors, etc.)

```r
##   Run Baseline SGP analyses using complete, missing (observed), and imputed data
source("Helper_Scripts/SGP_Baseline_Analyses.R")
##    Returns `Complete_SGP`, `Missing_SGP`, `Imputed_SGP_Data` and `SGP_Imputation_Summaries`.
```

---

###  Pooling Baseline SGPs

We want to quantify COVID impact at various institutional and subgroup levels in
metrics that would
  1) be meaningful to both stakeholders and fellow researchers and
  2) communicate additional uncertainty due to missing data.

Some of the statistics we calculated included:

* Mean/median scale scores (2021 status)
  * Percent Proficient (2021 status)
  * Mean/median baseline SGPs (2021 growth)
  * Change scores in the above status and growth metrics (2019 to 2021)
  * Effect sizes for 2019 to 2021 changes

---

###  Pooling Baseline SGPs

* Caterpillar plots show 2021 estimates of school mean scale scores and baseline SGPs from each data type.
  - *red* points and lines correspond to values from the complete data analysis
  - observed values are *black*
  - *green* pertains to imputations.
  - Error bars are for the pooled imputations, color coded by significance levels (difference between imputed and observed values).

```r
source("Helper_Scripts/SGP_MI_Caterpillar_Plots.R")
##    Creates caterpillar plots for school-level mean achievement and growth.
##    These include bands of uncertainty associated with the imputations.
```

???

Plots similar to these were originally developed during the imputation method vetting process, and the
***F*** statistics corresponded to the difference between the imputations and the "true" complete values.
The difference in these represent how we used such error bars and statistics with states' observed/missing
data to flag schools where the MI analyses produced significantly different assessments of schools' status
or growth than what was indicated in the missing data.

---
##  Data imputation with `2l.pan`

- Users can change the imputation method and repeat the steps starting after `impute_subset` is set up.
  - iterative process of changing imputation methods, predictor factors or any other parameters.

- Use the `2l.pan` cross-sectional model (students nested within schools)
- Remove `FRL` and `PERCENT_INST_FRL` from the predictors (we know those factors were not use to generate missing data).

```r
###   Specify imputation parameters for `mice::mice`
impute.method <- "2l.pan"

##    Cluster identifiers need to be integers for most 2-level methods
impute_subset[, SCHOOL_NUMBER := as.integer(SCHOOL_NUMBER)]

my.methods <- mice::make.method(data=impute_subset)
my.methods[grep("^Z_SCORE", names(my.methods))] <- impute.method
```

---
##  Data imputation with `2l.pan`

```r
my.predMtrx <- mice::make.predictorMatrix(data=impute_subset)
my.predMtrx[, "ID"] <- 0
my.predMtrx[, "SCHOOL_NUMBER"] <- 0

my.predMtrx[grep("^Z_SCORE", names(my.methods)), "SCHOOL_NUMBER"] <- -2
my.predMtrx["SCHOOL_NUMBER", grep("^Z_SCORE", names(my.methods))] <- 2

my.predMtrx[, grep("COMPLETE_Z", rownames(my.predMtrx))] <- 0 # i.e. "Nothing used to predict it" -- redundant my.methods["COMPLETE_Z"] == ""
my.predMtrx[grep("COMPLETE_Z", rownames(my.predMtrx)), ] <- 0 # i.e. "It's used to predict nothing"
my.predMtrx["FRL", ] <- 0
my.predMtrx["PERCENT_INST_FRL", ] <- 0
my.predMtrx[, "FRL"] <- 0
my.predMtrx[, "PERCENT_INST_FRL"] <- 0
```

---

##  Step 1b - create new imputed results

We are now ready to create our SECOND set of imputed data.