1 Simulation Overview

Researchers and policymakers are starting to examine the impact of the Covid pandemic on student learning. One question of interest concerns the extent to which we can compare students’ current test scores to those from before the pandemic (e.g., in 2019). For example, due to differences in participation rates across the years, the composition of students at a school in 2019 and 2021 may be substantially different. There may also likely be non-negligible amounts of missingness in the 2021 data.

One statistical method to address this missingness is multiple imputation (MI). Broadly, MI uses information from the observed data to generate a set of plausible values for the missing observations in the data. This procedure is repeated many times, accounting for sampling error that arises when generating these values. Model parameter estimates are then pooled across the imputed data sets (Enders, 2010; Fox & Weisberg, 2018). As Fox and Weisberg (2018) write, MI “takes into account not only uncertainty due to residual variation - that is, the inability to predict missing values without error from the observed data (e.g., by sampling from the estimated error distribution for a continuous variable or sampling from the estimated conditional probability distribution of a factor) - but also uncertainty in the parameter estimates used to obtain the predictions (by sampling from the estimated distribution of the parameters of the imputation model)” (p. 3).

There are numerous MI methods available, largely differentiated by the model used to generate the imputed values. In the context of learning loss analyses, users may implement MI to estimate mean scale score or student growth percentile (SGP) values to draw comparisons between pre- and post-pandemic test results. Notably, such comparisons are not aimed at accountability initiatives, but rather to provide a better understanding of how students are progressing. Such information can facilitate supportive programs to foster students’ learning.

Using a preliminary simulation, we evaluate the efficacy of multiple imputation for creating aggregated, “adjusted” scale scores and SGPs when data are missing across testing years. Observations were amputed from a simulated data set (available in the SGPdata R package; Betebenner et al., 2021). The data include scale scores and SGPs, as well as school characteristics and student demographics. These data were amputed to reflect patterns of either missing completely at random (MCAR) or missing at random (MAR; see Enders [2010] or Fox and Weisberg [2018] for a review of missingness types). For the MAR data, observations were amputed based on school number, scale score, and either SGP (“Status with Growth”) or free/reduced lunch (FRL) and English language learner (ELL) status (“Status with Demographics”). Either 30%, 50%, or 70% of the observations were amputed to create the missing data files. Note that a Covid impact was not incorporated into the simulated data used for these analyses.

Six imputation methods were compared, including:

  • Cross-sectional multi-level modeling with the pan package (L2PAN);
  • Cross-sectional multi-level modeling with the lmer function (L2LMER);
  • Longitudinal multi-level modeling with pan (L2PAN_LONG);
  • Longitudinal multi-level modeling with lmer (L2LMER_LONG);
  • Quantile regression (RQ); and
  • Predictive mean matching (PMM).

These methods were also compared to the condition where no imputation was implemented (i.e., “Observed”). All MI analyses were conducted using the mice package (van Buuren & Groothuis-Oudshoorn, 2011), with calls to corresponding R packages (e.g., pan [Zhao & Schafer, 2018] and lme4 [Bates et al., 2015]). Here, we focus on the ability of these MI methods to accurately impute either mean scale scores or SGPs. Specifically, if we consider the complete simulated data to be the (population-level) parameter values, then we are interested in the extent to which the imputed values align with the “true” values from the complete data set. Data are either aggregated at the grade and content area level (e.g., Grade 3 Math, Grade 3 ELA, etc.) within each school, or aggregated at the school level. Note that observations for which either the grade/content area size or the school size is less than 10 are removed from the summary analyses.

In this summary, we quantify the performance of the aforementioned MI methods using three indices:

  • Percent bias: The absolute value of the ratio of the raw bias (i.e., the average difference between the imputed and true values) to the average true value, multiplied by 100.
  • Simplified confidence interval (CI) coverage rate: The proportion of times that the simplified CI contains the average true score; the simplified CI was proposed by Vink and van Buuren (2014) for cases where the complete data set can be considered the population.
  • Simplified \(\mathbf{F}_1\) statistic**: Tests the null hypothesis that the true and imputed values are equivalent (van Buuren, 2018; Vink & van Buuren, 2014).

Recent research suggests that an MI method is performing relatively well when the percent bias is less than 5% (Miri et al., 2020; Qi et al., 2010) and the coverage rate is greater than 0.90 (Demirtas, 2004; Qi et al., 2010). Additionally, a p-value for the \(F_1\) statistic greater than \(\alpha\) indicates that we fail to reject the null hypothesis of equivalent true and imputed values.

2 Imputation Method Comparison

We first compare the six MI methods on average percent bias and simplified CI coverage rate (CR) as a function of a variety of factors, including grade, percentage missing, and missingness type. The data are either aggregated at the grade/content area level (GC), or at the school level.

2.1 Summary Tables

2.1.1 GC: MCAR

Table 2.1: Mean percent bias and confidence interval coverage rates for scale score (SS) and student growth percentiles (SGPs) with MCAR data, grade-content area level
L2PAN
L2PAN_LONG
LMER
LMER_LONG
PMM
RQ
Observed
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Grade SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP
30% Missing
3 0.223 0.920 0.223 0.920 0.761 0.810 0.761 0.810 0.233 0.957 0.227 0.910 0.233 0.957
4 0.198 0.927 0.198 0.927 0.708 0.804 0.708 0.804 0.206 0.964 0.201 0.923 0.206 0.964
5 0.166 2.805 0.946 0.952 0.431 7.407 0.911 0.824 0.680 8.156 0.825 0.743 0.689 8.179 0.831 0.748 0.408 6.267 0.850 0.845 0.413 6.335 0.830 0.829 0.408 6.267 0.850 0.845
6 0.152 3.000 0.946 0.948 0.418 7.284 0.853 0.747 0.612 7.817 0.789 0.694 0.634 7.878 0.799 0.701 0.360 6.072 0.805 0.798 0.361 6.121 0.795 0.787 0.360 6.072 0.805 0.798
7 0.121 2.138 0.948 0.950 0.386 5.853 0.765 0.725 0.584 6.963 0.748 0.650 0.596 6.984 0.738 0.650 0.335 5.116 0.765 0.755 0.342 5.166 0.750 0.754 0.335 5.116 0.765 0.755
8 0.116 2.103 0.946 0.954 0.389 7.119 0.721 0.634 0.538 8.387 0.725 0.563 0.531 8.354 0.733 0.571 0.359 6.327 0.715 0.696 0.364 6.340 0.695 0.687 0.359 6.327 0.715 0.696
50% Missing
3 0.360 0.909 0.360 0.909 1.255 0.732 1.254 0.731 0.374 0.967 0.357 0.903 0.374 0.967
4 0.340 0.909 0.340 0.909 1.158 0.725 1.155 0.727 0.361 0.969 0.344 0.905 0.361 0.969
5 0.315 5.205 0.942 0.947 0.735 12.110 0.861 0.745 1.107 12.705 0.746 0.692 1.133 12.729 0.749 0.691 0.677 10.431 0.790 0.782 0.688 10.597 0.766 0.751 0.677 10.431 0.790 0.782
6 0.259 5.394 0.945 0.942 0.727 11.907 0.790 0.662 1.004 12.415 0.709 0.630 1.036 12.446 0.725 0.637 0.586 10.119 0.749 0.728 0.592 10.245 0.736 0.710 0.586 10.119 0.749 0.728
7 0.193 3.280 0.953 0.951 0.648 9.411 0.696 0.660 0.929 10.182 0.677 0.609 0.947 10.208 0.672 0.610 0.533 7.872 0.708 0.699 0.540 7.935 0.710 0.699 0.533 7.872 0.708 0.699
8 0.218 3.723 0.930 0.936 0.682 11.835 0.619 0.524 0.939 13.530 0.653 0.468 0.932 13.544 0.658 0.474 0.621 10.751 0.599 0.576 0.625 10.814 0.597 0.571 0.621 10.751 0.599 0.576
70% Missing
3 0.517 0.915 0.517 0.915 1.735 0.684 1.735 0.684 0.648 0.975 0.542 0.918 0.648 0.975
4 0.480 0.914 0.480 0.914 1.624 0.659 1.624 0.659 0.614 0.978 0.507 0.914 0.614 0.978
5 0.464 7.496 0.946 0.947 0.999 15.988 0.860 0.694 1.546 16.559 0.684 0.659 1.570 16.601 0.696 0.660 0.931 14.276 0.758 0.745 0.939 14.405 0.743 0.715 0.931 14.276 0.758 0.745
6 0.416 7.908 0.949 0.935 1.027 15.472 0.717 0.606 1.398 16.016 0.652 0.586 1.453 16.056 0.667 0.587 0.802 13.635 0.716 0.690 0.809 13.807 0.699 0.666 0.802 13.635 0.716 0.690
7 0.304 4.910 0.950 0.942 0.918 12.710 0.640 0.598 1.321 13.169 0.607 0.581 1.348 13.239 0.600 0.588 0.749 11.051 0.660 0.653 0.756 11.108 0.655 0.644 0.749 11.051 0.660 0.653
8 0.297 5.764 0.938 0.938 0.973 16.160 0.553 0.450 1.294 17.581 0.605 0.423 1.286 17.568 0.603 0.422 0.865 14.922 0.565 0.525 0.864 14.928 0.569 0.513 0.865 14.922 0.565 0.525

2.1.2 GC: Status with Demographics

Table 2.2: Mean percent bias and confidence interval coverage rates for scale score (SS) and student growth percentiles (SGPs) with MAR data (using status with demographics), grade-content area level
L2PAN
L2PAN_LONG
LMER
LMER_LONG
PMM
RQ
Observed
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Grade SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP
30% Missing
3 0.619 0.862 0.619 0.862 1.648 0.567 1.648 0.567 0.656 0.915 0.592 0.847 0.656 0.915
4 0.558 0.849 0.558 0.849 1.418 0.557 1.418 0.557 0.554 0.921 0.549 0.843 0.554 0.921
5 0.262 4.605 0.946 0.913 0.454 7.885 0.917 0.758 1.448 9.951 0.544 0.613 1.253 9.506 0.625 0.651 0.407 6.629 0.858 0.853 0.420 6.858 0.826 0.816 0.407 6.629 0.858 0.853
6 0.210 4.107 0.933 0.909 0.508 7.594 0.823 0.714 1.261 8.685 0.522 0.621 1.130 8.472 0.606 0.660 0.343 6.132 0.827 0.822 0.353 6.362 0.798 0.795 0.343 6.132 0.827 0.822
7 0.246 4.209 0.926 0.902 0.528 6.928 0.688 0.709 1.534 8.611 0.331 0.577 1.253 8.291 0.453 0.607 0.393 6.308 0.761 0.740 0.393 6.410 0.741 0.726 0.393 6.308 0.761 0.740
8 0.215 3.485 0.917 0.909 0.553 8.093 0.662 0.586 1.392 10.283 0.341 0.463 1.053 9.801 0.483 0.493 0.434 7.267 0.691 0.686 0.455 7.465 0.661 0.652 0.434 7.267 0.691 0.686
50% Missing
3 1.082 0.815 1.082 0.815 2.536 0.435 2.536 0.435 1.146 0.900 1.098 0.792 1.146 0.900
4 1.022 0.792 1.022 0.792 2.350 0.409 2.350 0.409 1.052 0.913 1.038 0.783 1.052 0.913
5 0.410 7.068 0.946 0.924 0.729 12.045 0.874 0.716 2.283 13.473 0.422 0.641 1.700 12.864 0.600 0.680 0.643 10.396 0.814 0.810 0.668 11.033 0.775 0.750 0.643 10.396 0.814 0.810
6 0.361 6.905 0.931 0.908 0.839 11.990 0.743 0.636 2.028 12.343 0.417 0.602 1.518 12.078 0.577 0.630 0.566 9.981 0.765 0.748 0.595 10.368 0.733 0.706 0.566 9.981 0.765 0.748
7 0.385 5.860 0.920 0.905 0.863 10.274 0.596 0.653 2.434 11.184 0.232 0.578 1.615 10.861 0.454 0.605 0.613 9.673 0.688 0.677 0.613 9.442 0.685 0.661 0.613 9.673 0.688 0.677
8 0.369 5.865 0.901 0.887 0.875 12.565 0.572 0.498 2.256 14.802 0.228 0.420 1.416 14.414 0.473 0.441 0.700 11.679 0.603 0.586 0.702 11.850 0.599 0.556 0.700 11.679 0.603 0.586
70% Missing
3 1.714 0.794 1.714 0.794 3.594 0.314 3.594 0.314 1.836 0.873 1.782 0.764 1.836 0.873
4 1.640 0.769 1.640 0.769 3.444 0.295 3.444 0.295 1.724 0.893 1.671 0.755 1.724 0.893
5 0.642 10.629 0.951 0.930 1.026 16.033 0.867 0.683 3.217 16.539 0.312 0.649 1.933 16.406 0.630 0.664 0.902 14.304 0.788 0.777 0.937 15.242 0.759 0.715 0.902 14.304 0.788 0.777
6 0.556 10.041 0.930 0.910 1.162 15.789 0.685 0.582 2.901 15.572 0.325 0.590 1.751 15.749 0.598 0.594 0.790 13.415 0.723 0.700 0.847 14.188 0.686 0.649 0.790 13.415 0.723 0.700
7 0.588 8.366 0.921 0.906 1.222 13.242 0.509 0.599 3.470 13.632 0.160 0.579 1.772 13.758 0.514 0.570 0.830 12.891 0.659 0.643 0.834 12.713 0.648 0.615 0.830 12.891 0.659 0.643
8 0.593 9.505 0.897 0.871 1.196 16.893 0.495 0.435 3.202 18.518 0.165 0.390 1.561 18.269 0.515 0.414 0.940 15.870 0.567 0.535 0.944 16.025 0.567 0.519 0.940 15.870 0.567 0.535

2.1.3 GC: Status with Growth

Table 2.3: Mean percent bias and confidence interval coverage rates for scale score (SS) and student growth percentiles (SGPs) with MAR data (using status with growth), grade-content area level
L2PAN
L2PAN_LONG
LMER
LMER_LONG
PMM
RQ
Observed
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Grade SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP
30% Missing
3 1.473 0.484 1.475 0.482 2.486 0.293 2.486 0.293 1.469 0.568 1.484 0.457 1.469 0.568
4 1.364 0.461 1.362 0.460 2.229 0.289 2.229 0.289 1.361 0.543 1.376 0.434 1.361 0.543
5 0.323 5.022 0.937 0.911 0.450 7.499 0.921 0.795 2.149 8.637 0.286 0.669 1.812 8.310 0.406 0.722 0.397 6.681 0.849 0.847 0.410 6.796 0.789 0.820 0.397 6.681 0.849 0.847
6 0.273 4.780 0.906 0.890 0.796 7.457 0.636 0.718 1.905 8.333 0.304 0.627 1.610 7.974 0.441 0.696 0.353 6.294 0.834 0.822 0.358 6.350 0.785 0.786 0.353 6.294 0.834 0.822
7 0.314 4.523 0.914 0.902 0.789 6.482 0.512 0.696 2.354 7.290 0.143 0.617 1.856 6.725 0.215 0.678 0.407 6.095 0.768 0.753 0.404 6.255 0.723 0.734 0.407 6.095 0.768 0.753
8 0.293 4.513 0.892 0.886 0.730 8.145 0.481 0.588 2.168 9.643 0.138 0.535 1.599 9.143 0.242 0.588 0.430 7.438 0.694 0.673 0.423 7.408 0.648 0.647 0.430 7.438 0.694 0.673
50% Missing
3 2.676 0.313 2.676 0.313 3.887 0.151 3.887 0.151 2.664 0.403 2.663 0.282 2.664 0.403
4 2.520 0.318 2.520 0.318 3.702 0.163 3.702 0.163 2.521 0.411 2.521 0.287 2.521 0.411
5 0.490 7.599 0.934 0.913 0.683 11.463 0.899 0.738 3.330 12.074 0.176 0.685 2.323 11.720 0.400 0.731 0.641 10.294 0.800 0.793 0.658 10.693 0.729 0.758 0.641 10.294 0.800 0.793
6 0.447 7.698 0.898 0.881 1.259 11.689 0.529 0.655 3.078 12.105 0.172 0.618 2.069 11.535 0.446 0.684 0.565 9.895 0.775 0.746 0.612 10.438 0.700 0.708 0.565 9.895 0.775 0.746
7 0.469 6.662 0.903 0.893 1.234 9.769 0.404 0.645 3.594 9.951 0.083 0.618 2.232 9.672 0.246 0.666 0.634 9.817 0.695 0.673 0.617 9.651 0.650 0.657 0.634 9.817 0.695 0.673
8 0.458 6.747 0.885 0.876 1.126 12.720 0.403 0.501 3.449 14.350 0.068 0.474 1.963 13.834 0.259 0.494 0.671 11.711 0.609 0.580 0.677 12.094 0.566 0.562 0.671 11.711 0.609 0.580
70% Missing
3 4.097 0.226 4.097 0.226 5.488 0.080 5.488 0.080 4.075 0.305 4.081 0.201 4.075 0.305
4 3.933 0.245 3.933 0.245 5.394 0.092 5.394 0.092 3.959 0.311 3.947 0.215 3.959 0.311
5 0.747 10.554 0.939 0.921 1.367 15.214 0.952 0.690 4.635 15.291 0.112 0.679 2.425 15.274 0.495 0.697 0.961 13.960 0.743 0.758 0.984 14.628 0.672 0.713 0.961 13.960 0.743 0.758
6 0.720 10.956 0.902 0.893 1.636 15.429 0.484 0.603 4.400 15.257 0.089 0.605 2.130 15.085 0.515 0.630 0.808 13.518 0.726 0.695 0.914 14.246 0.622 0.652 0.808 13.518 0.726 0.695
7 0.684 8.897 0.908 0.905 1.635 12.798 0.328 0.603 5.061 12.756 0.051 0.601 2.214 12.957 0.358 0.602 0.888 13.460 0.655 0.627 0.859 13.170 0.607 0.600 0.888 13.460 0.655 0.627
8 0.732 10.794 0.879 0.854 1.523 16.992 0.339 0.444 4.894 18.173 0.033 0.436 1.943 18.206 0.399 0.423 0.903 15.971 0.582 0.549 0.917 16.316 0.561 0.547 0.903 15.971 0.582 0.549

2.1.4 School Level

Table 2.4: Mean percent bias and confidence interval coverage rates for scale score (SS) and student growth percentiles (SGPs) at the school level
L2PAN
L2PAN_LONG
LMER
LMER_LONG
PMM
RQ
Observed
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Percent Bias
CR
Percent Missing SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP SS SGP
MCAR
30% Missing 0.074 1.801 0.925 0.929 0.184 5.449 0.795 0.637 0.527 5.993 0.513 0.572 0.528 6.011 0.515 0.582 0.170 4.564 0.817 0.686 0.172 4.623 0.762 0.670 0.170 4.564 0.817 0.686
50% Missing 0.138 3.474 0.918 0.917 0.320 8.911 0.741 0.544 0.876 9.501 0.413 0.503 0.878 9.528 0.416 0.505 0.291 7.615 0.795 0.611 0.288 7.720 0.723 0.585 0.291 7.615 0.795 0.611
70% Missing 0.238 5.056 0.901 0.904 0.465 11.817 0.701 0.479 1.243 12.236 0.348 0.454 1.248 12.259 0.351 0.463 0.464 10.428 0.755 0.552 0.431 10.511 0.701 0.536 0.464 10.428 0.755 0.552
MAR (Status with Demographics)
30% Missing 0.305 2.904 0.726 0.874 0.356 5.932 0.638 0.600 1.415 7.374 0.174 0.480 1.287 6.981 0.210 0.518 0.341 4.734 0.697 0.709 0.323 5.095 0.609 0.660 0.341 4.734 0.697 0.709
50% Missing 0.521 4.977 0.647 0.855 0.607 9.157 0.550 0.524 2.263 10.257 0.103 0.473 1.869 9.725 0.164 0.511 0.598 7.709 0.612 0.637 0.580 8.230 0.502 0.580 0.598 7.709 0.612 0.637
70% Missing 0.830 7.918 0.598 0.843 0.948 12.033 0.473 0.481 3.255 12.494 0.053 0.471 2.380 12.419 0.148 0.467 0.934 10.435 0.536 0.593 0.927 11.268 0.430 0.527 0.934 10.435 0.536 0.593
MAR (Status with Growth)
30% Missing 0.665 3.299 0.385 0.856 0.669 5.570 0.345 0.632 2.181 6.478 0.031 0.513 1.968 6.142 0.044 0.571 0.674 4.839 0.347 0.693 0.694 4.955 0.297 0.666 0.674 4.839 0.347 0.693
50% Missing 1.237 5.242 0.306 0.846 1.221 8.776 0.245 0.562 3.492 9.265 0.012 0.517 2.860 8.884 0.029 0.560 1.249 7.631 0.250 0.633 1.269 7.947 0.204 0.589 1.249 7.631 0.250 0.633
70% Missing 1.961 7.527 0.258 0.847 2.077 11.580 0.160 0.503 4.988 11.564 0.003 0.494 3.603 11.510 0.028 0.515 1.999 10.335 0.174 0.587 2.036 10.865 0.138 0.535 1.999 10.335 0.174 0.587

2.2 Summary Figures: Grade/Content Area

The following figures provide more nuanced insight into the performance of the six MI methods as a function of important factors, including the percentage and type of missingness as well as the grade/content area size. Analyses are separated by the imputed value (i.e., either the scale score or the SGP).

2.2.1 Scale Scores

Figure 2.1: Scale score percent bias by imputation method, missingness percentage, and missingness type

Scale score percent bias by imputation method, missingness percentage, and missingness type

\(~\)

Figure 2.2: Scale score coverage rate by imputation method, missingness percentage, and missingness type

Scale score coverage rate by imputation method, missingness percentage, and missingness type

\(~\)

Figure 2.3: Scatterplot of scale score percent bias as a function of grade/content area size

Scatterplot of scale score percent bias as a function of grade/content area size

\(~\)

Figure 2.4: Scatterplot of scale score coverage rate as a function of grade/content area size

Scatterplot of scale score coverage rate as a function of grade/content area size

\(~\)

Figure 2.5: Proportion of times that the imputed SS was found to differ from the true value based on the simplified F1 statistic

Proportion of times that the imputed SS was found to differ from the true value based on the simplified F1 statistic

2.2.2 SGPs

Figure 2.6: SGP percent bias by imputation method, missingness percentage, and missingness type

SGP percent bias by imputation method, missingness percentage, and missingness type

\(~\)

Figure 2.7: SGP coverage rate by imputation method, missingness percentage, and missingness type

SGP coverage rate by imputation method, missingness percentage, and missingness type

\(~\)

Figure 2.8: Scatterplot of SGP percent bias as a function of grade/content area size

Scatterplot of SGP percent bias as a function of grade/content area size

\(~\)

Figure 2.9: Scatterplot of SGP coverage rate as a function of grade/content area size

Scatterplot of SGP coverage rate as a function of grade/content area size

\(~\)

Figure 2.10: Proportion of times that the imputed SGP was found to differ from the true value based on the simplified F1 statistic

Proportion of times that the imputed SGP was found to differ from the true value based on the simplified F1 statistic

2.3 Summary Figures: School Level

Many of the above figures are replicated, here aggregating at the school level rather than by grade and content area within a school.

2.3.1 Scale Scores

Figure 2.11: Scale score percent bias by imputation method, missingness percentage, and missingness type

Scale score percent bias by imputation method, missingness percentage, and missingness type

\(~\)

Figure 2.12: Scale score coverage rate by imputation method, missingness percentage, and missingness type

Scale score coverage rate by imputation method, missingness percentage, and missingness type

\(~\)

Figure 2.13: Scatterplot of scale score percent bias as a function of school size

Scatterplot of scale score percent bias as a function of school size

\(~\)

Figure 2.14: Scatterplot of scale score coverage rate as a function of school size

Scatterplot of scale score coverage rate as a function of school size

\(~\)

Figure 2.15: Proportion of times that the imputed SS was found to differ from the true value based on the simplified F1 statistic

Proportion of times that the imputed SS was found to differ from the true value based on the simplified F1 statistic

2.3.2 SGPs

Figure 2.16: SGP percent bias by imputation method, missingness percentage, and missingness type

SGP percent bias by imputation method, missingness percentage, and missingness type

\(~\)

Figure 2.17: SGP coverage rate by imputation method, missingness percentage, and missingness type

SGP coverage rate by imputation method, missingness percentage, and missingness type

\(~\)

Figure 2.18: Scatterplot of SGP percent bias as a function of school size

Scatterplot of SGP percent bias as a function of school size

\(~\)

Figure 2.19: Scatterplot of SGP coverage rate as a function of school size

Scatterplot of SGP coverage rate as a function of school size

\(~\)

Figure 2.20: Proportion of times that the imputed SGP was found to differ from the true value based on the simplified F1 statistic

Proportion of times that the imputed SGP was found to differ from the true value based on the simplified F1 statistic

2.4 Basic Regression Models

The following models are preliminary mechanisms for understanding which factors are related to relatively worse MI performance. To simplify these models, we examine either raw bias or absolute bias.

2.4.1 Grade/Content Area

We use the fixest package (Berge, 2018) to regress the bias variables on grade/content area size, percentage missing, missingness type, and imputation method; grade and content area are also included in the model as fixed effects. Currently, these are simple additive models. More complex models may be incorporated in future analyses (e.g., including interactions, a random effect for the school, etc.).

Table 2.5: Linear fixed-effect regression models for raw bias at the grade/content area level
Scale Scores SGPs
N 0.0004 (0.0021) 0.0054* (0.0021)
MISS_PERC50%Missing 2.046*** (0.3475) -0.1588** (0.0377)
MISS_PERC70%Missing 4.432*** (0.7589) -0.3639** (0.0732)
MISS_TYPEDEMOG 5.709*** (0.6697) -0.0496 (0.0292)
MISS_TYPEGROWTH 10.36*** (1.535) 0.0746 (0.0493)
i(var=IMP_METHOD,ref=“Observed”)L2LMER_LONG -0.2410 (0.8601) -0.3768. (0.1911)
i(var=IMP_METHOD,ref=“Observed”)L2PAN_LONG -6.195*** (1.314) -0.4023. (0.2021)
i(var=IMP_METHOD,ref=“Observed”)L2LMER 1.698*** (0.2889) -0.3242 (0.1859)
i(var=IMP_METHOD,ref=“Observed”)L2PAN -5.253*** (1.052) -0.0308 (0.1878)
i(var=IMP_METHOD,ref=“Observed”)RQ -5.282*** (1.060) -0.2582 (0.2059)
i(var=IMP_METHOD,ref=“Observed”)PMM -5.290*** (1.072) -0.3420 (0.2087)
Fixed-Effects: —————– ——————
GRADE^CONTENT_AREA Yes Yes
________________________________________ _________________ __________________
S.E.: Clustered by: GRA.^CON. by: GRA.^CON.
Observations 96,075 51,849
R2 0.35696 0.00418
Within R2 0.32018 0.00356

\(~\)

Table 2.6: Linear fixed-effect regression models for absolute bias at the grade/content area level
Scale Scores SGPs
N -0.0100*** (0.0017) -0.0152*** (0.0012)
MISS_PERC50%Missing 2.763*** (0.2529) 1.574*** (0.0909)
MISS_PERC70%Missing 5.864*** (0.6021) 3.130*** (0.1531)
MISS_TYPEDEMOG 3.600*** (0.4151) 0.5515*** (0.0712)
MISS_TYPEGROWTH 8.268*** (1.261) 0.5947*** (0.0801)
i(var=IMP_METHOD,ref=“Observed”)L2LMER_LONG 1.117 (0.8447) 2.581*** (0.2756)
i(var=IMP_METHOD,ref=“Observed”)L2PAN_LONG -3.355*** (0.5251) 2.312*** (0.2846)
i(var=IMP_METHOD,ref=“Observed”)L2LMER 2.959*** (0.3078) 2.643*** (0.2569)
i(var=IMP_METHOD,ref=“Observed”)L2PAN -4.621*** (0.8489) 0.0985 (0.1049)
i(var=IMP_METHOD,ref=“Observed”)RQ -3.921*** (0.6587) 1.848*** (0.2552)
i(var=IMP_METHOD,ref=“Observed”)PMM -3.906*** (0.6933) 1.749*** (0.2606)
Fixed-Effects: ——————- ——————-
GRADE^CONTENT_AREA Yes Yes
________________________________________ ___________________ ___________________
S.E.: Clustered by: GRA.^CON. by: GRA.^CON.
Observations 96,075 51,849
R2 0.35972 0.18045
Within R2 0.33837 0.17236

\(~\)

We can also re-fit the scale score models using only observations from grades 5 through 8.

Table 2.7: Linear fixed-effect regression models for raw and absolute scale score bias when removing grades 3 and 4
Scale Scores SGPs
N 0.0012 (0.0022) -0.0095*** (0.0017)
MISS_PERC50%Missing 1.062*** (0.0788) 2.058*** (0.0652)
MISS_PERC70%Missing 2.274*** (0.1773) 4.179*** (0.1362)
MISS_TYPEDEMOG 3.814*** (0.2469) 2.487*** (0.1313)
MISS_TYPEGROWTH 6.015*** (0.3615) 4.800*** (0.1986)
i(var=IMP_METHOD,ref=“Observed”)L2LMER_LONG -2.628*** (0.4167) -1.274** (0.3270)
i(var=IMP_METHOD,ref=“Observed”)L2PAN_LONG -9.960*** (0.7401) -4.664*** (0.4131)
i(var=IMP_METHOD,ref=“Observed”)L2LMER 0.9643*** (0.1011) 2.140*** (0.1114)
i(var=IMP_METHOD,ref=“Observed”)L2PAN -8.215*** (0.3348) -7.010*** (0.2442)
i(var=IMP_METHOD,ref=“Observed”)RQ -8.256*** (0.3002) -5.750*** (0.2844)
i(var=IMP_METHOD,ref=“Observed”)PMM -8.317*** (0.2633) -5.831*** (0.2838)
Fixed-Effects: —————— ——————-
GRADE^CONTENT_AREA Yes Yes
________________________________________ __________________ ___________________
S.E.: Clustered by: GRA.^CON. by: GRA.^CON.
Observations 51,849 51,849
R2 0.33165 0.34586
Within R2 0.33026 0.34365

2.4.2 School Level

In these models, the data are aggregated at the school level.

Table 2.8: Linear fixed-effect regression models for raw bias at the school level
Scale Scores SGPs
(Intercept) 1.936*** (0.1775) -0.3654* (0.1493)
N -0.0023*** (0.0002) 0.0010*** (0.0001)
MISS_PERC50%Missing 2.044*** (0.1227) -0.1587 (0.1032)
MISS_PERC70%Missing 4.494*** (0.1227) -0.3560*** (0.1032)
MISS_TYPEDEMOG 5.696*** (0.1227) -0.1420 (0.1032)
MISS_TYPEGROWTH 10.36*** (0.1227) -0.0388 (0.1032)
i(var=IMP_METHOD,ref=“Observed”)L2LMER_LONG -0.7282*** (0.1874) -0.3890* (0.1576)
i(var=IMP_METHOD,ref=“Observed”)L2PAN_LONG -6.651*** (0.1874) -0.4009* (0.1576)
i(var=IMP_METHOD,ref=“Observed”)L2LMER 1.259*** (0.1874) -0.3714* (0.1576)
i(var=IMP_METHOD,ref=“Observed”)L2PAN -5.680*** (0.1874) 0.0527 (0.1576)
i(var=IMP_METHOD,ref=“Observed”)RQ -5.703*** (0.1874) -0.2800. (0.1576)
i(var=IMP_METHOD,ref=“Observed”)PMM -5.694*** (0.1874) -0.2982. (0.1576)
________________________________________ ___________________ ___________________
S.E. type Standard Standard
Observations 14,616 14,616
R2 0.46063 0.00555
Adj. R2 0.46022 0.00480
Table 2.9: Linear fixed-effect regression models for absolute bias at the school level
Scale Scores SGPs
(Intercept) 2.955*** (0.1578) 1.437*** (0.0997)
N -0.0022*** (0.0002) -0.0025*** (9.82e-5)
MISS_PERC50%Missing 2.423*** (0.1091) 1.254*** (0.0689)
MISS_PERC70%Missing 5.280*** (0.1091) 2.481*** (0.0689)
MISS_TYPEDEMOG 3.929*** (0.1091) 0.4720*** (0.0689)
MISS_TYPEGROWTH 8.547*** (0.1091) 0.3616*** (0.0689)
i(var=IMP_METHOD,ref=“Observed”)L2LMER_LONG 0.4437** (0.1666) 2.295*** (0.1053)
i(var=IMP_METHOD,ref=“Observed”)L2PAN_LONG -5.029*** (0.1666) 2.089*** (0.1053)
i(var=IMP_METHOD,ref=“Observed”)L2LMER 2.482*** (0.1666) 2.368*** (0.1053)
i(var=IMP_METHOD,ref=“Observed”)L2PAN -5.526*** (0.1666) 0.1478 (0.1053)
i(var=IMP_METHOD,ref=“Observed”)RQ -5.108*** (0.1666) 1.668*** (0.1053)
i(var=IMP_METHOD,ref=“Observed”)PMM -5.108*** (0.1666) 1.519*** (0.1053)
________________________________________ ___________________ ____________________
S.E. type Standard Standard
Observations 14,616 14,616
R2 0.48342 0.17251
Adj. R2 0.48303 0.17188

2.5 Key Take-Aways

Before honing in on the particular differences among the MI methods, a handful of general trends merit comment. First, across missingness types and percentages, the percent bias for scale scores is notably lower than the percent bias for SGPs. Second, there is a small but noticeable positive relationship between missingness percentage and percent bias; this relationship was more pronounced for scale scores than SGPs. Moreover, the scatterplots of percent bias and coverage rates as a function of grade/content area size (\(N\)) indicate that observations with smaller \(N\) were more likely to have higher percent bias on either scale scores or SGPs, as well as greater variation in the CI coverage rates. Finally, the bar plots indicate that there are higher proportions of significant differences between the true and imputed scale score values (based on the simplified \(F_1\) statistic) when data are MAR, particularly based on status and growth. The preliminary regression models also suggest that across the imputation methods, bias tends to increase as percentage missingness increases, as well as when data are MAR compared to MCAR (particularly using status and growth). However, note that the \(R^2\) values were relatively low, so caution is warranted when interpreting the model results.

Overall, the cross-sectional L2PAN method demonstrates the best performance among the examined methods. Specifically, L2PAN was most often able to maintain average percent bias for scales scores and SGPs below 5%, while simultaneously constructing confidence intervals with higher coverage rates. Additionally, L2PAN most often failed to reject the null hypothesis that the imputed and true values are equivalent. In other words, the other MI methods more often found statistically significant differences (using the simplified \(F_1\) statistic) between the imputed and true average scale score and SGP values. Finally, the regression models indicate that L2PAN is negatively related to scale score bias, suggesting reduced scale score bias compared to the “observed” condition. Relatedly, L2PAN was often not positively related to SGP bias.

3 Evaluating Cross-Sectional L2PAN

The results provide growing evidence that L2PAN is a relatively effective MI method for the types of data and missingness examined thus far. Still, the above analyses indicate that in numerous cases, L2PAN is unable to generate relatively accurate pooled estimates. In this section, we take a deeper dive to identify the conditions wherein L2PAN performs well, as well as the conditions wherein L2PAN is unable to adequately recover the true mean scale scores or SGPs.

3.1 Scale Scores

3.1.1 Descriptive Statistics: Grade/Content Area

Figure 3.1: Average SS percent bias by grade/content area quantile, grade, and missingness characteristics

Average SS percent bias by grade/content area quantile, grade, and missingness characteristics

\(~\)

Figure 3.1 presents the average SS percent bias for the L2PAN method as a function of grade/content area size, missingness percentage and type, and grade. First, notice in this figure that the average percent bias never exceeds 5%, which is the threshold for poor MI efficacy posited by researchers like Qi and colleagues (2010) and Miri and colleagues (2020). There is a general trend wherein the SS percent bias decreases as the grade/content area size quantile increases, as well as a tendency for the percent bias to be greater when data are missing using status and growth compared to using status and demographics. The largest relative scale score percent bias with L2PAN occurs for grades three and four when 70% of the data are missing using status and growth.

Figures 3.2 replicates the above analysis, using average scale score coverage rate rather than percent bias.

Figure 3.2: Average SS coverage rate by grade/content area quantile, grade, and missingness characteristics

Average SS coverage rate by grade/content area quantile, grade, and missingness characteristics

\(~\)

The above figure indicates that the average coverage rate is greater than 0.90 in a large proportion of conditions. When data are missing based on status and growth for grades 3 and 4, the average scale score coverage rates are particularly low. The results above indicate that for observations in grades 3 and 4, simplified CIs for scale scores tend to be too liberal when data are MAR (particularly using growth rather than demographics); this relationship is exacerbated as the missingness percentage and grade/content area size quantile increases.

We may also use the (simplified) \(F_1\) statistic to flag observations indicating significant differences between the imputed and true values. The figure below plots the proportion of \(F_1\) statistics that result in a rejection of the null hypothesis for the scale scores (using a threshold of \(\alpha=0.1\)), where the null hypothesis is that the imputed and true values are equivalent. We find that cases in grades 3 and 4 when data are MAR (based on status and growth) more often indicate significant differences between the imputed and true values. Again, this more often occurs among larger grade/content area size quantiles.

Figure 3.3: Proportion of cases where a significant difference between the imputed and true SS value was found using the F1 statistic

Proportion of cases where a significant difference between the imputed and true SS value was found using the F1 statistic

\(~\)

Across the 13725 observations in the L2PAN imputed data, 97.8% of observations have an average SS percent bias less than 5%. Moreover, 59.8% of observations have a mean SS coverage rate greater than 0.90. We are particularly interested in whether there are similar characteristics among the observations that do not meet these percent bias and coverage rate thresholds when imputing with L2PAN.

There are 298 observations with SS percent bias greater than 5% using L2PAN. Similarly, there are 5003 observations with SS coverage rates less than 0.90. All together, there are 293 total observations with both SS percent bias greater than 5% and a coverage rate less than 0.90, as well as \(F_1\) statistics for which we reject the null hypothesis that the imputed and true values are equivalent. We’ll look at this last category of “flagged” observations in more detail.

Figure 3.4: Density plot of flagged observations based on the scale score as a function of grade/content area size

Density plot of flagged observations based on the scale score as a function of grade/content area size

\(~\)

Figure 3.5: Scatter plot of flagged observations based on the scale score as a function of grade/content area size and percent bias

Scatter plot of flagged observations based on the scale score as a function of grade/content area size and percent bias

\(~\)

Figure 3.6: Bar plot of flagged observations based on the scale score as a function of grade and missingness

Bar plot of flagged observations based on the scale score as a function of grade and missingness

We see that the majority of “flagged” cases tend to have grade/content area sizes less than 60. Variation in SS percent bias is also greater among observations with smaller grade/content area sizes. Additionally, these observations are overwhelmingly in grades 3 and 4 when data are missing based on status and growth, particularly for conditions with 50% and 70% of data missing.

3.1.2 Descriptive Statistics: School Level

Figure 3.7: Average SS percent bias by school size quantile and missingness characteristics

Average SS percent bias by school size quantile and missingness characteristics

\(~\)

Figure 3.7 shows the average SS percent bias for the L2PAN method at the school level, as a function of school size, missingness percentage, and missingness type. As with the grade/content area analyses, the maximum average SS percent bias is relatively low, here never exceeding 2.5%. We see higher average SS percent bias when data are MAR based on status and growth, and as the percentage missing increases. In some cases, percent bias also decreases slightly as the school size quantile increases.

Figure 3.8 below uses average SS coverage rate rather than percent bias. We see that the average coverage rates are concerningly low when data are MAR, particularly based on characteristics like status and growth. There is not a clear linear relationship between the school size quantile and the average CI coverage rate. For instance, when data are MAR based on status and growth, we find a U-shaped relationship between these two variables.

Figure 3.8: Average SS coverage rate by school size quantile and missingness characteristics

Average SS coverage rate by school size quantile and missingness characteristics

\(~\)

The figure below plot the proportion of \(F_1\) statistics that result in a rejection of the null hypothesis for the SS (using a threshold of \(\alpha=0.1\)). Again, we find this trend of relatively worse performance by L2PAN for data that are MAR based on status and growth, as well as higher missingness percentages.

Figure 3.9: Proportion of cases where a significant difference between the imputed and true SS value was found using the F1 statistic

Proportion of cases where a significant difference between the imputed and true SS value was found using the F1 statistic

\(~\)

Across the 2088 observations in the L2PAN imputed data at the school level, 99.7% of observations have an average SS percent bias less than 5%. Moreover, 38.6% of observations have a mean SS coverage rate greater than 0.90.

There are 7 observations with SS percent bias greater than 5% using L2PAN. Similarly, there are 1203 observations with SS coverage rates less than 0.90. Putting these conditions together, there are 7 total observations with both SS percent bias greater than 5% and a coverage rate less than 0.90, as well as \(F_1\) statistics for which we reject the null hypothesis that the imputed and true values are equivalent.

3.1.3 Classification Models

In this section, we fit a series of relatively simple classification models. The goal is to identify data features that are related to poor L2PAN performance, providing guidance as to when imputing data with this MI method may warrant caution. The models use logistic regression with up to two-way interactions among the predictors. The baseline model is an observation where the grade/content area or school size of zero and 30% of the data are MCAR.

The models are fit using the fixest package. In the first model, the data are analyzed at the grade/content area level. Therefore, fixed effects for grade and content level are included in the model. The outcome is a binary variable where 1 indicates an SS percent bias greater than 5%, coverage rate less than 0.90, and \(F_1\) statistic p-value less than 0.10.

In the second model, the data are analyzed at the school level. Because there were so few “flagged” observations in these data (N = 7), the outcome variable is the simplified CI coverage rate.

Table 3.1: Logistic fixed-effect regression models for grade/content area flagged observations and school-level CI coverage rates
Grade/Content Area: Flagged School: CI Coverage Rates
N -0.1289*** (0.0077) 0.0021*** (0.0005)
MISS_TYPEDEMOG 14.71*** (1.439) -1.641*** (0.3311)
MISS_TYPEGROWTH 17.21*** (0.2905) -3.722*** (0.3507)
MISS_PERC50%Missing 1.092* (0.4278) -0.2605 (0.3837)
MISS_PERC70%Missing 0.6798 (0.5413) -0.5595 (0.3716)
i(var=N,f=MISS_TYPE)MCAR 0.1267*** (0.0150) -0.0011 (0.0007)
i(var=N,f=MISS_TYPE)DEMOG 0.0782* (0.0340) -0.0016*** (0.0004)
i(var=N,f=MISS_TYPE)GROWTH 0.1240*** (0.0075)
i(var=N,f=MISS_PERC)30%Missing 0.0146 (0.0122) -0.0008 (0.0005)
i(var=N,f=MISS_PERC)50%Missing -0.0096* (0.0045) -0.0003 (0.0005)
i(var=MISS_TYPE,f=MISS_PERC)DEMOG x 30%Missing -1.510. (0.8055) 0.2962 (0.3899)
i(var=MISS_TYPE,f=MISS_PERC)DEMOG x 50%Missing -1.982* (0.7820) 0.0075 (0.3797)
i(var=MISS_TYPE,f=MISS_PERC)GROWTH x 30%Missing -3.391*** (0.3375) 0.3456 (0.3944)
i(var=MISS_TYPE,f=MISS_PERC)GROWTH x 50%Missing -1.344*** (0.1815) 0.0592 (0.3919)
(Intercept) 2.448*** (0.3337)
Fixed-Effects: ——————- ——————-
CONTENT_AREA^GRADE Yes No
________________________________________ ___________________ ___________________
S.E. type by: CON.^GRA. Standard
Convergence FALSE TRUE
Observations 6,318 2,088
Squared Cor. 0.19628 0.44657
Pseudo R2 0.37202 0.34479
BIC 1,647.0 1,910.7

3.2 Student Growth Percentiles

3.2.1 Descriptive Statistics: Grade/Content Area

Figure 3.10: Average SGP percent bias by grade/content area quantile, grade, and missingness characteristics

Average SGP percent bias by grade/content area quantile, grade, and missingness characteristics

\(~\)

Figure 3.13 shows the average SGP percent bias. As previously noted, average SGP percent bias is generally larger than that for scale scores. In many missingness percentage and type categorizations, the largest average SGP percent bias occurs for observations in eighth grade and in the first grade/content area size quantile. We again see a general trend wherein observations with smaller grade/content area size quantiles (i.e., in the first quantile) have larger average bias. The largest average SGP percent bias is approximately 29, occurring when 70% of the data are missing using status and growth, and observations are in the eighth grade and first quantile of grade/content area size.

Figure 3.11: Average SGP coverage rate by grade/content area quantile, grade, and missingness characteristics

Average SGP coverage rate by grade/content area quantile, grade, and missingness characteristics

\(~\)

Whereas the average SS coverage rate was lower than 0.5 for some conditions (see Figure 3.2), Figure 3.7 indicates relatively high average coverage rates across the examined conditions when analyzing the SGPs. For instance, the average CI coverage rate for the SGPs does not fall below 0.76. Therefore, although L2PAN tends to produce imputed SGP estimates with relatively higher percent bias compared to the imputed SS estimates, the confidence intervals for the SGP estimates more often contain the true SGP values when analyzing at the grade/content area level.

As shown in the figure below, statistically significant differences between the imputed and true mean SGPs were more often found when data were MAR based on status with growth. In some cases, there was also a tendency for higher proportions of rejected null hypotheses among higher grades.

Figure 3.12: Proportion of cases where a significant difference between the imputed and true SGP value was found using the F1 statistic

Proportion of cases where a significant difference between the imputed and true SGP value was found using the F1 statistic

\(~\)

Across the 13725 observations in the L2PAN imputed data, 32.3% of observations have an average SGP percent bias less than 5%. Moreover, 39.9% of observations have a mean SGP coverage rate greater than 0.90. Again, we’ll further examine the “flagged” observations, which have SGP percent bias greater than 5%, a simplified CI coverage rate less than 0.90, and statistically significant \(F_1\) test.

Here, there are 319 total observations that we may “flag” for which L2PAN performs relatively worse when imputing the mean SGP values at the grade/content area level.

Figure 3.13: Density plot of flagged observations based on the SGP as a function of grade/content area size

Density plot of flagged observations based on the SGP as a function of grade/content area size

\(~\)

Figure 3.14: Scatter plot of flagged observations based on the SGPs as a function of grade/content area size and percent bias

Scatter plot of flagged observations based on the SGPs as a function of grade/content area size and percent bias

\(~\)

Figure 3.15: Bar plot of flagged observations based on the SGP as a function of grade and missingness

Bar plot of flagged observations based on the SGP as a function of grade and missingness

\(~\)

As with the scale score imputations, L2PAN tends to generate more concerning SGP imputed values for smaller grade/content area sizes. The scatterplot further shows that observations with smaller grade/content area sizes tend to have greater variation in percent bias for the SGPs. The faceted bar plot above suggests that across the missingness type and percentage conditions, L2PAN tends to do worse in terms of the SGP imputation for grade six (and also often grade five). There is also some evidence that imputation efficacy decreases as missingness percentage increases, as well as for MAR compared to MCAR data.

3.2.2 Descriptive Statistics: School Level

Figure 3.16: Average SGP percent bias by school size quantile and missingness characteristics

Average SGP percent bias by school size quantile and missingness characteristics

\(~\)

In the above figure, we see that average SGP percent bias is higher among lower school size quantiles, higher missingness percentages, and when data are MAR. However, we find fewer clear trends when examining average SGP coverage rates (see below). Notably, the average coverage rates do not dip below 0.80 when analyzing the SGPs at the school level. Coverage rates tend to be lower for MAR compared to MCAR data.

Figure 3.17: Average SGP coverage rate by school size quantile and missingness characteristics

Average SGP coverage rate by school size quantile and missingness characteristics

\(~\)

The figure below plots the proportion of \(F_1\) statistics that result in a rejection of the null hypothesis for the SGPs (again using a threshold of \(\alpha=0.1\)). Notice in this figure that statistically significant differences between the imputed and true mean SGPs were more often found when data were MAR.

Figure 3.18: Proportion of cases where a significant difference between the imputed and true SGP value was found using the F1 statistic

Proportion of cases where a significant difference between the imputed and true SGP value was found using the F1 statistic

\(~\)

Across the 2088 observations in the L2PAN imputed data at the school level, 71.1% of observations have an average SGP percent bias less than 5%. Moreover, 64.4% of observations have a mean SGP coverage rate greater than 0.90.

Here, there are 176 total observations that we may “flag” for which L2PAN performs relatively worse when imputing the SGP values at the school level (again, based on a combination of the SGP percent bias, simplified CI coverage rate, and p-value for the \(F_1\) statistic).

Figure 3.19: Density plot of flagged observations based on the SGP as a function of school size

Density plot of flagged observations based on the SGP as a function of school size

\(~\)

Figure 3.20: Scatter plot of flagged observations based on the SGPs as a function of school size and percent bias

Scatter plot of flagged observations based on the SGPs as a function of school size and percent bias

\(~\)

Figure 3.21: Bar plot of flagged observations based on the SGP as a function of missingness

Bar plot of flagged observations based on the SGP as a function of missingness

\(~\)

We find that “flagged” observations at the school level are more likely to have smaller school sizes and occur when data are MAR (particularly based on status and growth). There is also a clear positive relationship between the number of “flagged” observations and the percentage of missingness, holding other factors constant.

3.2.3 Classification Model

We fit another series of logistic regressions with up to two-way interactions among the predictors. Here, the outcome for both the grade/content area and school level is a binary variable where 1 now indicates an SGP percent bias greater than 5%, coverage rate less than 0.90, and \(F_1\) statistic p-value less than 0.10.

Table 3.2: Logistic fixed-effect regression models for grade/content area or school-level flagged observations
Grade/Content Area School
N -0.0491*** (0.0036) -0.0005 (0.0007)
MISS_TYPEDEMOG 0.8894* (0.3914) 0.4954 (0.5394)
MISS_TYPEGROWTH 1.200* (0.5243) 0.3132 (0.5286)
MISS_PERC50%Missing 9.491*** (0.7164) 1.692 (1.113)
MISS_PERC70%Missing 11.04*** (0.2832) 2.396* (1.078)
i(var=N,f=MISS_TYPE)MCAR -0.0079 (0.0082) -0.0024 (0.0015)
i(var=N,f=MISS_TYPE)DEMOG 9.73e-5 (0.0024) -0.0009 (0.0009)
i(var=N,f=MISS_PERC)30%Missing 0.0504*** (0.0028) -0.0014 (0.0012)
i(var=N,f=MISS_PERC)50%Missing 0.0507*** (0.0034) -0.0001 (0.0009)
i(var=N,f=MISS_PERC)70%Missing 0.0509*** (0.0034)
i(var=MISS_TYPE,f=MISS_PERC)DEMOG x 30%Missing 10.10*** (0.2562) 1.802 (1.102)
i(var=MISS_TYPE,f=MISS_PERC)DEMOG x 50%Missing 1.205* (0.5523) 0.5349 (0.5571)
i(var=MISS_TYPE,f=MISS_PERC)GROWTH x 30%Missing 10.66*** (0.2551) 2.207* (1.091)
i(var=MISS_TYPE,f=MISS_PERC)GROWTH x 50%Missing 1.591** (0.5072) 0.5959 (0.5525)
(Intercept) -4.308*** (1.081)
Fixed-Effects: ——————- —————–
CONTENT_AREA^GRADE Yes No
________________________________________ ___________________ _________________
S.E. type by: CON.^GRA. Standard
Convergence FALSE TRUE
Observations 7,407 2,088
Squared Cor. 0.03162 0.03581
Pseudo R2 0.10027 0.07425
BIC 2,562.8 1,224.8

\(~\)

4 Conclusions

The current simulation was designed to address two related questions. First, can we determine whether MI is an appropriate method for creating “adjusted” scale scores and SGPs when data are missing? Second, if MI is indeed appropriate, which method is most effective and in what data contexts? The above results can be broadly summarized as follows:

  • Within the current simulation, the cross-sectional L2PAN method generally outperforms L2LMER, L2LMER_LONG, L2PAN_LONG, PMM, and RQ in terms of percent bias, simplified CI coverage rates, and statistical significant differences between the imputed and true scale score and SGP values. Cross-sectional L2PAN also tends to perform better than the “observed” condition, where no imputation is conducted.
  • MI with L2PAN tends to perform worse among cases with smaller grade/content area or school sizes, and when data are MAR based on status and growth. MI efficacy is also negatively related to the percent of missingness in the data.
  • It appears particularly difficult to accurately impute average scale scores for grades 3 and 4, particularly when data are MAR based on status and growth.
  • Patterns of MI efficacy differ based on whether the scale scores or SGPs are being imputed, as well as whether MI is evaluated based on percent bias or CI coverage rates.

The simulation results clearly demonstrate that L2PAN multiple imputation’s efficacy depends upon a nuanced interaction of different data characteristics. It is also likely possible that other factors (not accounted for in the current analyses) also contribute to MI’s performance in this context. For instance, recall that the \(R^2\) values for the exploratory linear regressions were relatively small. There may be additional interaction and polynomial effects on percent bias and CI coverage rates that warrant further exploration.

To address when MI is a plausible methodological option for generating adjusted 2021 scores, we make the following additional recommendations:

  • The school composition should not substantially differ between across academic years. In other words, although there may be low missingness rates, the demographic make-up of a school may largely change over the years. Such differences may preclude appropriate comparisons across years (with or without adjusted scores).
  • Given the aforementioned trends regarding missingness types, it is important that researchers and policymakers examine their missingness patterns prior to imputation. Our results indicate that MI performed relatively better when data were MCAR or missing at random (MAR) based on a combination of status and demographics. However, it is extremely unlikely that data will be MCAR in this context. It is also difficult to differentiate between MAR and missing not at random (MNAR) for a given data set.
  • Given the relationships between MI efficacy and missingness percentage, MI should be used with caution when more than 50% of the data are missing. Note that these missingness rates may differ among schools (i.e., not all units of analysis will have the same missingness percentage). Such differential rates should be taken into consideration when determining if and when MI is appropriate.

In summary, L2PAN is a promising mechanism for dealing with relatively low levels of missingness in scale scores or SGPs. As with any simulation, the results presented here can only be generalized to the simulation conditions examined. Moreover, future work should include diagnostic checks to examine the MI performance with a particular set of data. In other words, even if L2PAN imputation is implemented and is expected to work well (based on the given simulations), supplementary diagnostic analyses would help ensure that the imputation method is working as intended. Stuart and colleagues (2009), as well as Nguyen and colleagues (2017), provide helpful overviews of these diagnostics.

5 References

  • Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1-48. https://doi.org/10.18637/jss.v067.i01.
  • Berge, L. (2018). Efficient estimation of maximum likelihood models with multiple fixed-effects: the R package FENmlm. CREA Discussion Papers.
  • Betebenner, D. W., Van Iwaarden, A. R., & Domingue, B. (2021). SGPdata: Exemplar data sets for student growth percentile (SGP) analyses. R package version 25.1-0.0. https://centerforassessment.github.io/SGPdata/
  • Demirtas, H. (2004). Simulation driven inferences for multiply imputed longitudinal datasets. Statistica neerlandica, 58(4), 466-482. https://doi.org/10.1111/j.1467-9574.2004.00271.x
  • Enders, C. K. (2010). Applied missing data analysis. New York, NY: The Guilford Press.
  • Fox, J. & Weisberg, S. (2018). Multiple imputation of missing data. Appendix of An R companion to applied regression, third edition. Thousand Oaks, CA: Sage Publications, Inc.
  • Miri, H. H., Hassanzadeh, J., Khaniki, S. H., Akrami, R., & Sirjani, E. (2020). Accuracy of five multiple imputation methods in estimating prevalence of Type 2 diabetes based on STEPS surveys. Journal of Epidemiology and Global Health, 10(1), 36-41. https://doi.org/10.2991/jegh.k.191207.001
  • Nguyen, C. D., Carlin, J. B., & Lee, K. J. (2017). Model checking in multiple imputation: An overview and case study. Emerging Themes in Epidemiology, 14(8). https://doi.org/10.1186/s12982-017-0062-6
  • Qi, L., Wang, Y.-F., & He, Y. (2010). A comparison of multiple imputation and fully augmented weighted estimators for Cox regression with missing covariates. Statistics in Medicine, 29(25), 2592-2604. https://doi.org/10.1002/sim.4016
  • Stuart, E. A., Azur, M., Frangakis, C., & Leaf, P. (2009). Multiple imputation with large data sets: A case study of the Children’s Mental Health Initiative. American Journal of Epidemiology, 169(9), 1133–1139. https://doi.org/10.1093/aje/kwp026
  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
  • van Buuren, S. (2018). Flexible imputation of missing data. CRC Press. https://stefvanbuuren.name/fimd/
  • van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1-67. https://www.jstatsoft.org/v45/i03/
  • Vink, G., & van Buuren, S. (2014). Pooling multiple imputations when the sample happens to be the population. arXiv Pre-Print 1409.8542.
  • Zhao, J. H., & Schafer, J. L. (2018). pan: Multiple imputation for multivariate panel or clustered data. R package version 1.6.