1 Simulation Overview

Researchers and policymakers are starting to examine the impact of the Covid pandemic on student learning. One question of interest concerns the extent to which we can compare students’ current test scores to those from before the pandemic (e.g., in 2019). For example, due to differences in participation rates across the years, the composition of students at a school in 2019 and 2021 may be substantially different. There may also likely be non-negligible amounts of missingness in the 2021 data.

One statistical method to address this missingness is multiple imputation (MI). Broadly, MI uses information from the observed data to generate a set of plausible values for the missing observations in the data. This procedure is repeated many times, accounting for sampling error that arises when generating these values. Model parameter estimates are then pooled across the imputed data sets (Enders, 2010; Fox & Weisberg, 2018). As Fox and Weisberg (2018) write, MI “takes into account not only uncertainty due to residual variation - that is, the inability to predict missing values without error from the observed data (e.g., by sampling from the estimated error distribution for a continuous variable or sampling from the estimated conditional probability distribution of a factor) - but also uncertainty in the parameter estimates used to obtain the predictions (by sampling from the estimated distribution of the parameters of the imputation model)” (p. 3).

There are numerous MI methods available, largely differentiated by the model used to generate the imputed values. In the context of learning loss analyses, users may implement MI to estimate mean scale score or student growth percentile (SGP) values to draw comparisons between pre- and post-pandemic test results. Notably, such comparisons are not aimed at accountability initiatives, but rather to provide a better understanding of how students are progressing. Such information can facilitate supportive programs to foster students’ learning.

Using a preliminary simulation, we evaluate the efficacy of multiple imputation for creating aggregated, “adjusted” scale scores and SGPs when data are missing across testing years. Observations were amputed from a simulated data set (available in the SGPdata R package; Betebenner et al., 2021). The data include scale scores and SGPs, as well as school characteristics and student demographics. These data were amputed to reflect patterns of either missing completely at random (MCAR) or missing at random (MAR; see Enders [2010] or Fox and Weisberg [2018] for a review of missingness types). For the MAR data, observations were amputed based on school number, scale score, and either SGP (“Status with Growth”) or free/reduced lunch (FRL) and English language learner (ELL) status (“Status with Demographics”). Either 30%, 50%, or 70% of the observations were amputed to create the missing data files. Note that a Covid impact was not incorporated into the simulated data used for these analyses.

Six imputation methods were compared, including:

Cross-sectional multi-level modeling with the pan package (L2PAN);
Cross-sectional multi-level modeling with the lmer function (L2LMER);
Longitudinal multi-level modeling with pan (L2PAN_LONG);
Longitudinal multi-level modeling with lmer (L2LMER_LONG);
Quantile regression (RQ); and
Predictive mean matching (PMM).

These methods were also compared to the condition where no imputation was implemented (i.e., “Observed”). All MI analyses were conducted using the mice package (van Buuren & Groothuis-Oudshoorn, 2011), with calls to corresponding R packages (e.g., pan [Zhao & Schafer, 2018] and lme4 [Bates et al., 2015]). Here, we focus on the ability of these MI methods to accurately impute either mean scale scores or SGPs. Specifically, if we consider the complete simulated data to be the (population-level) parameter values, then we are interested in the extent to which the imputed values align with the “true” values from the complete data set. Data are either aggregated at the grade and content area level (e.g., Grade 3 Math, Grade 3 ELA, etc.) within each school, or aggregated at the school level. Note that observations for which either the grade/content area size or the school size is less than 10 are removed from the summary analyses.

In this summary, we quantify the performance of the aforementioned MI methods using three indices:

Percent bias: The absolute value of the ratio of the raw bias (i.e., the average difference between the imputed and true values) to the average true value, multiplied by 100.
Simplified confidence interval (CI) coverage rate: The proportion of times that the simplified CI contains the average true score; the simplified CI was proposed by Vink and van Buuren (2014) for cases where the complete data set can be considered the population.
Simplified \(\mathbf{F}_1\) statistic**: Tests the null hypothesis that the true and imputed values are equivalent (van Buuren, 2018; Vink & van Buuren, 2014).

Recent research suggests that an MI method is performing relatively well when the percent bias is less than 5% (Miri et al., 2020; Qi et al., 2010) and the coverage rate is greater than 0.90 (Demirtas, 2004; Qi et al., 2010). Additionally, a p-value for the \(F_1\) statistic greater than \(\alpha\) indicates that we fail to reject the null hypothesis of equivalent true and imputed values.

2 Imputation Method Comparison

We first compare the six MI methods on average percent bias and simplified CI coverage rate (CR) as a function of a variety of factors, including grade, percentage missing, and missingness type. The data are either aggregated at the grade/content area level (GC), or at the school level.

2.1 Summary Tables

2.1.1 GC: MCAR

Table 2.1: Mean percent bias and confidence interval coverage rates for scale score (SS) and student growth percentiles (SGPs) with MCAR data, grade-content area level
	L2PAN				L2PAN_LONG				LMER				LMER_LONG				PMM				RQ				Observed
	Percent Bias		CR		Percent Bias		CR		Percent Bias		CR		Percent Bias		CR		Percent Bias		CR		Percent Bias		CR		Percent Bias		CR
Grade	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP
30% Missing
3	0.223		0.920		0.223		0.920		0.761		0.810		0.761		0.810		0.233		0.957		0.227		0.910		0.233		0.957
4	0.198		0.927		0.198		0.927		0.708		0.804		0.708		0.804		0.206		0.964		0.201		0.923		0.206		0.964
5	0.166	2.805	0.946	0.952	0.431	7.407	0.911	0.824	0.680	8.156	0.825	0.743	0.689	8.179	0.831	0.748	0.408	6.267	0.850	0.845	0.413	6.335	0.830	0.829	0.408	6.267	0.850	0.845
6	0.152	3.000	0.946	0.948	0.418	7.284	0.853	0.747	0.612	7.817	0.789	0.694	0.634	7.878	0.799	0.701	0.360	6.072	0.805	0.798	0.361	6.121	0.795	0.787	0.360	6.072	0.805	0.798
7	0.121	2.138	0.948	0.950	0.386	5.853	0.765	0.725	0.584	6.963	0.748	0.650	0.596	6.984	0.738	0.650	0.335	5.116	0.765	0.755	0.342	5.166	0.750	0.754	0.335	5.116	0.765	0.755
8	0.116	2.103	0.946	0.954	0.389	7.119	0.721	0.634	0.538	8.387	0.725	0.563	0.531	8.354	0.733	0.571	0.359	6.327	0.715	0.696	0.364	6.340	0.695	0.687	0.359	6.327	0.715	0.696
50% Missing
3	0.360		0.909		0.360		0.909		1.255		0.732		1.254		0.731		0.374		0.967		0.357		0.903		0.374		0.967
4	0.340		0.909		0.340		0.909		1.158		0.725		1.155		0.727		0.361		0.969		0.344		0.905		0.361		0.969
5	0.315	5.205	0.942	0.947	0.735	12.110	0.861	0.745	1.107	12.705	0.746	0.692	1.133	12.729	0.749	0.691	0.677	10.431	0.790	0.782	0.688	10.597	0.766	0.751	0.677	10.431	0.790	0.782
6	0.259	5.394	0.945	0.942	0.727	11.907	0.790	0.662	1.004	12.415	0.709	0.630	1.036	12.446	0.725	0.637	0.586	10.119	0.749	0.728	0.592	10.245	0.736	0.710	0.586	10.119	0.749	0.728
7	0.193	3.280	0.953	0.951	0.648	9.411	0.696	0.660	0.929	10.182	0.677	0.609	0.947	10.208	0.672	0.610	0.533	7.872	0.708	0.699	0.540	7.935	0.710	0.699	0.533	7.872	0.708	0.699
8	0.218	3.723	0.930	0.936	0.682	11.835	0.619	0.524	0.939	13.530	0.653	0.468	0.932	13.544	0.658	0.474	0.621	10.751	0.599	0.576	0.625	10.814	0.597	0.571	0.621	10.751	0.599	0.576
70% Missing
3	0.517		0.915		0.517		0.915		1.735		0.684		1.735		0.684		0.648		0.975		0.542		0.918		0.648		0.975
4	0.480		0.914		0.480		0.914		1.624		0.659		1.624		0.659		0.614		0.978		0.507		0.914		0.614		0.978
5	0.464	7.496	0.946	0.947	0.999	15.988	0.860	0.694	1.546	16.559	0.684	0.659	1.570	16.601	0.696	0.660	0.931	14.276	0.758	0.745	0.939	14.405	0.743	0.715	0.931	14.276	0.758	0.745
6	0.416	7.908	0.949	0.935	1.027	15.472	0.717	0.606	1.398	16.016	0.652	0.586	1.453	16.056	0.667	0.587	0.802	13.635	0.716	0.690	0.809	13.807	0.699	0.666	0.802	13.635	0.716	0.690
7	0.304	4.910	0.950	0.942	0.918	12.710	0.640	0.598	1.321	13.169	0.607	0.581	1.348	13.239	0.600	0.588	0.749	11.051	0.660	0.653	0.756	11.108	0.655	0.644	0.749	11.051	0.660	0.653
8	0.297	5.764	0.938	0.938	0.973	16.160	0.553	0.450	1.294	17.581	0.605	0.423	1.286	17.568	0.603	0.422	0.865	14.922	0.565	0.525	0.864	14.928	0.569	0.513	0.865	14.922	0.565	0.525

2.1.2 GC: Status with Demographics

Table 2.2: Mean percent bias and confidence interval coverage rates for scale score (SS) and student growth percentiles (SGPs) with MAR data (using status with demographics), grade-content area level
	L2PAN				L2PAN_LONG				LMER				LMER_LONG				PMM				RQ				Observed
	Percent Bias		CR		Percent Bias		CR		Percent Bias		CR		Percent Bias		CR		Percent Bias		CR		Percent Bias		CR		Percent Bias		CR
Grade	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP
30% Missing
3	0.619		0.862		0.619		0.862		1.648		0.567		1.648		0.567		0.656		0.915		0.592		0.847		0.656		0.915
4	0.558		0.849		0.558		0.849		1.418		0.557		1.418		0.557		0.554		0.921		0.549		0.843		0.554		0.921
5	0.262	4.605	0.946	0.913	0.454	7.885	0.917	0.758	1.448	9.951	0.544	0.613	1.253	9.506	0.625	0.651	0.407	6.629	0.858	0.853	0.420	6.858	0.826	0.816	0.407	6.629	0.858	0.853
6	0.210	4.107	0.933	0.909	0.508	7.594	0.823	0.714	1.261	8.685	0.522	0.621	1.130	8.472	0.606	0.660	0.343	6.132	0.827	0.822	0.353	6.362	0.798	0.795	0.343	6.132	0.827	0.822
7	0.246	4.209	0.926	0.902	0.528	6.928	0.688	0.709	1.534	8.611	0.331	0.577	1.253	8.291	0.453	0.607	0.393	6.308	0.761	0.740	0.393	6.410	0.741	0.726	0.393	6.308	0.761	0.740
8	0.215	3.485	0.917	0.909	0.553	8.093	0.662	0.586	1.392	10.283	0.341	0.463	1.053	9.801	0.483	0.493	0.434	7.267	0.691	0.686	0.455	7.465	0.661	0.652	0.434	7.267	0.691	0.686
50% Missing
3	1.082		0.815		1.082		0.815		2.536		0.435		2.536		0.435		1.146		0.900		1.098		0.792		1.146		0.900
4	1.022		0.792		1.022		0.792		2.350		0.409		2.350		0.409		1.052		0.913		1.038		0.783		1.052		0.913
5	0.410	7.068	0.946	0.924	0.729	12.045	0.874	0.716	2.283	13.473	0.422	0.641	1.700	12.864	0.600	0.680	0.643	10.396	0.814	0.810	0.668	11.033	0.775	0.750	0.643	10.396	0.814	0.810
6	0.361	6.905	0.931	0.908	0.839	11.990	0.743	0.636	2.028	12.343	0.417	0.602	1.518	12.078	0.577	0.630	0.566	9.981	0.765	0.748	0.595	10.368	0.733	0.706	0.566	9.981	0.765	0.748
7	0.385	5.860	0.920	0.905	0.863	10.274	0.596	0.653	2.434	11.184	0.232	0.578	1.615	10.861	0.454	0.605	0.613	9.673	0.688	0.677	0.613	9.442	0.685	0.661	0.613	9.673	0.688	0.677
8	0.369	5.865	0.901	0.887	0.875	12.565	0.572	0.498	2.256	14.802	0.228	0.420	1.416	14.414	0.473	0.441	0.700	11.679	0.603	0.586	0.702	11.850	0.599	0.556	0.700	11.679	0.603	0.586
70% Missing
3	1.714		0.794		1.714		0.794		3.594		0.314		3.594		0.314		1.836		0.873		1.782		0.764		1.836		0.873
4	1.640		0.769		1.640		0.769		3.444		0.295		3.444		0.295		1.724		0.893		1.671		0.755		1.724		0.893
5	0.642	10.629	0.951	0.930	1.026	16.033	0.867	0.683	3.217	16.539	0.312	0.649	1.933	16.406	0.630	0.664	0.902	14.304	0.788	0.777	0.937	15.242	0.759	0.715	0.902	14.304	0.788	0.777
6	0.556	10.041	0.930	0.910	1.162	15.789	0.685	0.582	2.901	15.572	0.325	0.590	1.751	15.749	0.598	0.594	0.790	13.415	0.723	0.700	0.847	14.188	0.686	0.649	0.790	13.415	0.723	0.700
7	0.588	8.366	0.921	0.906	1.222	13.242	0.509	0.599	3.470	13.632	0.160	0.579	1.772	13.758	0.514	0.570	0.830	12.891	0.659	0.643	0.834	12.713	0.648	0.615	0.830	12.891	0.659	0.643
8	0.593	9.505	0.897	0.871	1.196	16.893	0.495	0.435	3.202	18.518	0.165	0.390	1.561	18.269	0.515	0.414	0.940	15.870	0.567	0.535	0.944	16.025	0.567	0.519	0.940	15.870	0.567	0.535

2.1.3 GC: Status with Growth

Table 2.3: Mean percent bias and confidence interval coverage rates for scale score (SS) and student growth percentiles (SGPs) with MAR data (using status with growth), grade-content area level
	L2PAN				L2PAN_LONG				LMER				LMER_LONG				PMM				RQ				Observed
	Percent Bias		CR		Percent Bias		CR		Percent Bias		CR		Percent Bias		CR		Percent Bias		CR		Percent Bias		CR		Percent Bias		CR
Grade	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP
30% Missing
3	1.473		0.484		1.475		0.482		2.486		0.293		2.486		0.293		1.469		0.568		1.484		0.457		1.469		0.568
4	1.364		0.461		1.362		0.460		2.229		0.289		2.229		0.289		1.361		0.543		1.376		0.434		1.361		0.543
5	0.323	5.022	0.937	0.911	0.450	7.499	0.921	0.795	2.149	8.637	0.286	0.669	1.812	8.310	0.406	0.722	0.397	6.681	0.849	0.847	0.410	6.796	0.789	0.820	0.397	6.681	0.849	0.847
6	0.273	4.780	0.906	0.890	0.796	7.457	0.636	0.718	1.905	8.333	0.304	0.627	1.610	7.974	0.441	0.696	0.353	6.294	0.834	0.822	0.358	6.350	0.785	0.786	0.353	6.294	0.834	0.822
7	0.314	4.523	0.914	0.902	0.789	6.482	0.512	0.696	2.354	7.290	0.143	0.617	1.856	6.725	0.215	0.678	0.407	6.095	0.768	0.753	0.404	6.255	0.723	0.734	0.407	6.095	0.768	0.753
8	0.293	4.513	0.892	0.886	0.730	8.145	0.481	0.588	2.168	9.643	0.138	0.535	1.599	9.143	0.242	0.588	0.430	7.438	0.694	0.673	0.423	7.408	0.648	0.647	0.430	7.438	0.694	0.673
50% Missing
3	2.676		0.313		2.676		0.313		3.887		0.151		3.887		0.151		2.664		0.403		2.663		0.282		2.664		0.403
4	2.520		0.318		2.520		0.318		3.702		0.163		3.702		0.163		2.521		0.411		2.521		0.287		2.521		0.411
5	0.490	7.599	0.934	0.913	0.683	11.463	0.899	0.738	3.330	12.074	0.176	0.685	2.323	11.720	0.400	0.731	0.641	10.294	0.800	0.793	0.658	10.693	0.729	0.758	0.641	10.294	0.800	0.793
6	0.447	7.698	0.898	0.881	1.259	11.689	0.529	0.655	3.078	12.105	0.172	0.618	2.069	11.535	0.446	0.684	0.565	9.895	0.775	0.746	0.612	10.438	0.700	0.708	0.565	9.895	0.775	0.746
7	0.469	6.662	0.903	0.893	1.234	9.769	0.404	0.645	3.594	9.951	0.083	0.618	2.232	9.672	0.246	0.666	0.634	9.817	0.695	0.673	0.617	9.651	0.650	0.657	0.634	9.817	0.695	0.673
8	0.458	6.747	0.885	0.876	1.126	12.720	0.403	0.501	3.449	14.350	0.068	0.474	1.963	13.834	0.259	0.494	0.671	11.711	0.609	0.580	0.677	12.094	0.566	0.562	0.671	11.711	0.609	0.580
70% Missing
3	4.097		0.226		4.097		0.226		5.488		0.080		5.488		0.080		4.075		0.305		4.081		0.201		4.075		0.305
4	3.933		0.245		3.933		0.245		5.394		0.092		5.394		0.092		3.959		0.311		3.947		0.215		3.959		0.311
5	0.747	10.554	0.939	0.921	1.367	15.214	0.952	0.690	4.635	15.291	0.112	0.679	2.425	15.274	0.495	0.697	0.961	13.960	0.743	0.758	0.984	14.628	0.672	0.713	0.961	13.960	0.743	0.758
6	0.720	10.956	0.902	0.893	1.636	15.429	0.484	0.603	4.400	15.257	0.089	0.605	2.130	15.085	0.515	0.630	0.808	13.518	0.726	0.695	0.914	14.246	0.622	0.652	0.808	13.518	0.726	0.695
7	0.684	8.897	0.908	0.905	1.635	12.798	0.328	0.603	5.061	12.756	0.051	0.601	2.214	12.957	0.358	0.602	0.888	13.460	0.655	0.627	0.859	13.170	0.607	0.600	0.888	13.460	0.655	0.627
8	0.732	10.794	0.879	0.854	1.523	16.992	0.339	0.444	4.894	18.173	0.033	0.436	1.943	18.206	0.399	0.423	0.903	15.971	0.582	0.549	0.917	16.316	0.561	0.547	0.903	15.971	0.582	0.549

2.1.4 School Level

Table 2.4: Mean percent bias and confidence interval coverage rates for scale score (SS) and student growth percentiles (SGPs) at the school level
	L2PAN				L2PAN_LONG				LMER				LMER_LONG				PMM				RQ				Observed
	Percent Bias		CR		Percent Bias		CR		Percent Bias		CR		Percent Bias		CR		Percent Bias		CR		Percent Bias		CR		Percent Bias		CR
Percent Missing	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP	SS	SGP
MCAR
30% Missing	0.074	1.801	0.925	0.929	0.184	5.449	0.795	0.637	0.527	5.993	0.513	0.572	0.528	6.011	0.515	0.582	0.170	4.564	0.817	0.686	0.172	4.623	0.762	0.670	0.170	4.564	0.817	0.686
50% Missing	0.138	3.474	0.918	0.917	0.320	8.911	0.741	0.544	0.876	9.501	0.413	0.503	0.878	9.528	0.416	0.505	0.291	7.615	0.795	0.611	0.288	7.720	0.723	0.585	0.291	7.615	0.795	0.611
70% Missing	0.238	5.056	0.901	0.904	0.465	11.817	0.701	0.479	1.243	12.236	0.348	0.454	1.248	12.259	0.351	0.463	0.464	10.428	0.755	0.552	0.431	10.511	0.701	0.536	0.464	10.428	0.755	0.552
MAR (Status with Demographics)
30% Missing	0.305	2.904	0.726	0.874	0.356	5.932	0.638	0.600	1.415	7.374	0.174	0.480	1.287	6.981	0.210	0.518	0.341	4.734	0.697	0.709	0.323	5.095	0.609	0.660	0.341	4.734	0.697	0.709
50% Missing	0.521	4.977	0.647	0.855	0.607	9.157	0.550	0.524	2.263	10.257	0.103	0.473	1.869	9.725	0.164	0.511	0.598	7.709	0.612	0.637	0.580	8.230	0.502	0.580	0.598	7.709	0.612	0.637
70% Missing	0.830	7.918	0.598	0.843	0.948	12.033	0.473	0.481	3.255	12.494	0.053	0.471	2.380	12.419	0.148	0.467	0.934	10.435	0.536	0.593	0.927	11.268	0.430	0.527	0.934	10.435	0.536	0.593
MAR (Status with Growth)
30% Missing	0.665	3.299	0.385	0.856	0.669	5.570	0.345	0.632	2.181	6.478	0.031	0.513	1.968	6.142	0.044	0.571	0.674	4.839	0.347	0.693	0.694	4.955	0.297	0.666	0.674	4.839	0.347	0.693
50% Missing	1.237	5.242	0.306	0.846	1.221	8.776	0.245	0.562	3.492	9.265	0.012	0.517	2.860	8.884	0.029	0.560	1.249	7.631	0.250	0.633	1.269	7.947	0.204	0.589	1.249	7.631	0.250	0.633
70% Missing	1.961	7.527	0.258	0.847	2.077	11.580	0.160	0.503	4.988	11.564	0.003	0.494	3.603	11.510	0.028	0.515	1.999	10.335	0.174	0.587	2.036	10.865	0.138	0.535	1.999	10.335	0.174	0.587

2.2 Summary Figures: Grade/Content Area

The following figures provide more nuanced insight into the performance of the six MI methods as a function of important factors, including the percentage and type of missingness as well as the grade/content area size. Analyses are separated by the imputed value (i.e., either the scale score or the SGP).

2.2.1 Scale Scores

Figure 2.1: Scale score percent bias by imputation method, missingness percentage, and missingness type

\(~\)

Figure 2.2: Scale score coverage rate by imputation method, missingness percentage, and missingness type

\(~\)

Figure 2.3: Scatterplot of scale score percent bias as a function of grade/content area size

\(~\)

Figure 2.4: Scatterplot of scale score coverage rate as a function of grade/content area size

\(~\)

Figure 2.5: Proportion of times that the imputed SS was found to differ from the true value based on the simplified F1 statistic

2.2.2 SGPs

Figure 2.6: SGP percent bias by imputation method, missingness percentage, and missingness type

\(~\)

Figure 2.7: SGP coverage rate by imputation method, missingness percentage, and missingness type

\(~\)

Figure 2.8: Scatterplot of SGP percent bias as a function of grade/content area size

\(~\)

Figure 2.9: Scatterplot of SGP coverage rate as a function of grade/content area size

\(~\)

Figure 2.10: Proportion of times that the imputed SGP was found to differ from the true value based on the simplified F1 statistic

2.3 Summary Figures: School Level

Many of the above figures are replicated, here aggregating at the school level rather than by grade and content area within a school.

2.3.1 Scale Scores

Figure 2.11: Scale score percent bias by imputation method, missingness percentage, and missingness type

\(~\)

Figure 2.12: Scale score coverage rate by imputation method, missingness percentage, and missingness type

\(~\)

Figure 2.13: Scatterplot of scale score percent bias as a function of school size

\(~\)

Figure 2.14: Scatterplot of scale score coverage rate as a function of school size

\(~\)

Figure 2.15: Proportion of times that the imputed SS was found to differ from the true value based on the simplified F1 statistic

2.3.2 SGPs

Figure 2.16: SGP percent bias by imputation method, missingness percentage, and missingness type

\(~\)

Figure 2.17: SGP coverage rate by imputation method, missingness percentage, and missingness type

\(~\)

Figure 2.18: Scatterplot of SGP percent bias as a function of school size

\(~\)

Figure 2.19: Scatterplot of SGP coverage rate as a function of school size

\(~\)

Figure 2.20: Proportion of times that the imputed SGP was found to differ from the true value based on the simplified F1 statistic

2.4 Basic Regression Models

The following models are preliminary mechanisms for understanding which factors are related to relatively worse MI performance. To simplify these models, we examine either raw bias or absolute bias.

2.4.1 Grade/Content Area

We use the fixest package (Berge, 2018) to regress the bias variables on grade/content area size, percentage missing, missingness type, and imputation method; grade and content area are also included in the model as fixed effects. Currently, these are simple additive models. More complex models may be incorporated in future analyses (e.g., including interactions, a random effect for the school, etc.).

Table 2.5: Linear fixed-effect regression models for raw bias at the grade/content area level
	Scale Scores	SGPs

N	0.0004 (0.0021)	0.0054* (0.0021)
MISS_PERC50%Missing	2.046*** (0.3475)	-0.1588** (0.0377)
MISS_PERC70%Missing	4.432*** (0.7589)	-0.3639** (0.0732)
MISS_TYPEDEMOG	5.709*** (0.6697)	-0.0496 (0.0292)
MISS_TYPEGROWTH	10.36*** (1.535)	0.0746 (0.0493)
i(var=IMP_METHOD,ref=“Observed”)L2LMER_LONG	-0.2410 (0.8601)	-0.3768. (0.1911)
i(var=IMP_METHOD,ref=“Observed”)L2PAN_LONG	-6.195*** (1.314)	-0.4023. (0.2021)
i(var=IMP_METHOD,ref=“Observed”)L2LMER	1.698*** (0.2889)	-0.3242 (0.1859)
i(var=IMP_METHOD,ref=“Observed”)L2PAN	-5.253*** (1.052)	-0.0308 (0.1878)
i(var=IMP_METHOD,ref=“Observed”)RQ	-5.282*** (1.060)	-0.2582 (0.2059)
i(var=IMP_METHOD,ref=“Observed”)PMM	-5.290*** (1.072)	-0.3420 (0.2087)
Fixed-Effects:	—————–	——————
GRADE^CONTENT_AREA	Yes	Yes
________________________________________	_________________	__________________
S.E.: Clustered	by: GRA.^CON.	by: GRA.^CON.
Observations	96,075	51,849
R2	0.35696	0.00418
Within R2	0.32018	0.00356

\(~\)

Table 2.6: Linear fixed-effect regression models for absolute bias at the grade/content area level
	Scale Scores	SGPs

N	-0.0100*** (0.0017)	-0.0152*** (0.0012)
MISS_PERC50%Missing	2.763*** (0.2529)	1.574*** (0.0909)
MISS_PERC70%Missing	5.864*** (0.6021)	3.130*** (0.1531)
MISS_TYPEDEMOG	3.600*** (0.4151)	0.5515*** (0.0712)
MISS_TYPEGROWTH	8.268*** (1.261)	0.5947*** (0.0801)
i(var=IMP_METHOD,ref=“Observed”)L2LMER_LONG	1.117 (0.8447)	2.581*** (0.2756)
i(var=IMP_METHOD,ref=“Observed”)L2PAN_LONG	-3.355*** (0.5251)	2.312*** (0.2846)
i(var=IMP_METHOD,ref=“Observed”)L2LMER	2.959*** (0.3078)	2.643*** (0.2569)
i(var=IMP_METHOD,ref=“Observed”)L2PAN	-4.621*** (0.8489)	0.0985 (0.1049)
i(var=IMP_METHOD,ref=“Observed”)RQ	-3.921*** (0.6587)	1.848*** (0.2552)
i(var=IMP_METHOD,ref=“Observed”)PMM	-3.906*** (0.6933)	1.749*** (0.2606)
Fixed-Effects:	——————-	——————-
GRADE^CONTENT_AREA	Yes	Yes
________________________________________	___________________	___________________
S.E.: Clustered	by: GRA.^CON.	by: GRA.^CON.
Observations	96,075	51,849
R2	0.35972	0.18045
Within R2	0.33837	0.17236

\(~\)

We can also re-fit the scale score models using only observations from grades 5 through 8.

Table 2.7: Linear fixed-effect regression models for raw and absolute scale score bias when removing grades 3 and 4
	Scale Scores	SGPs

N	0.0012 (0.0022)	-0.0095*** (0.0017)
MISS_PERC50%Missing	1.062*** (0.0788)	2.058*** (0.0652)
MISS_PERC70%Missing	2.274*** (0.1773)	4.179*** (0.1362)
MISS_TYPEDEMOG	3.814*** (0.2469)	2.487*** (0.1313)
MISS_TYPEGROWTH	6.015*** (0.3615)	4.800*** (0.1986)
i(var=IMP_METHOD,ref=“Observed”)L2LMER_LONG	-2.628*** (0.4167)	-1.274** (0.3270)
i(var=IMP_METHOD,ref=“Observed”)L2PAN_LONG	-9.960*** (0.7401)	-4.664*** (0.4131)
i(var=IMP_METHOD,ref=“Observed”)L2LMER	0.9643*** (0.1011)	2.140*** (0.1114)
i(var=IMP_METHOD,ref=“Observed”)L2PAN	-8.215*** (0.3348)	-7.010*** (0.2442)
i(var=IMP_METHOD,ref=“Observed”)RQ	-8.256*** (0.3002)	-5.750*** (0.2844)
i(var=IMP_METHOD,ref=“Observed”)PMM	-8.317*** (0.2633)	-5.831*** (0.2838)
Fixed-Effects:	——————	——————-
GRADE^CONTENT_AREA	Yes	Yes
________________________________________	__________________	___________________
S.E.: Clustered	by: GRA.^CON.	by: GRA.^CON.
Observations	51,849	51,849
R2	0.33165	0.34586
Within R2	0.33026	0.34365

2.4.2 School Level

In these models, the data are aggregated at the school level.

Table 2.8: Linear fixed-effect regression models for raw bias at the school level
	Scale Scores	SGPs

(Intercept)	1.936*** (0.1775)	-0.3654* (0.1493)
N	-0.0023*** (0.0002)	0.0010*** (0.0001)
MISS_PERC50%Missing	2.044*** (0.1227)	-0.1587 (0.1032)
MISS_PERC70%Missing	4.494*** (0.1227)	-0.3560*** (0.1032)
MISS_TYPEDEMOG	5.696*** (0.1227)	-0.1420 (0.1032)
MISS_TYPEGROWTH	10.36*** (0.1227)	-0.0388 (0.1032)
i(var=IMP_METHOD,ref=“Observed”)L2LMER_LONG	-0.7282*** (0.1874)	-0.3890* (0.1576)
i(var=IMP_METHOD,ref=“Observed”)L2PAN_LONG	-6.651*** (0.1874)	-0.4009* (0.1576)
i(var=IMP_METHOD,ref=“Observed”)L2LMER	1.259*** (0.1874)	-0.3714* (0.1576)
i(var=IMP_METHOD,ref=“Observed”)L2PAN	-5.680*** (0.1874)	0.0527 (0.1576)
i(var=IMP_METHOD,ref=“Observed”)RQ	-5.703*** (0.1874)	-0.2800. (0.1576)
i(var=IMP_METHOD,ref=“Observed”)PMM	-5.694*** (0.1874)	-0.2982. (0.1576)
________________________________________	___________________	___________________
S.E. type	Standard	Standard
Observations	14,616	14,616
R2	0.46063	0.00555
Adj. R2	0.46022	0.00480

Table 2.9: Linear fixed-effect regression models for absolute bias at the school level
	Scale Scores	SGPs

(Intercept)	2.955*** (0.1578)	1.437*** (0.0997)
N	-0.0022*** (0.0002)	-0.0025*** (9.82e-5)
MISS_PERC50%Missing	2.423*** (0.1091)	1.254*** (0.0689)
MISS_PERC70%Missing	5.280*** (0.1091)	2.481*** (0.0689)
MISS_TYPEDEMOG	3.929*** (0.1091)	0.4720*** (0.0689)
MISS_TYPEGROWTH	8.547*** (0.1091)	0.3616*** (0.0689)
i(var=IMP_METHOD,ref=“Observed”)L2LMER_LONG	0.4437** (0.1666)	2.295*** (0.1053)
i(var=IMP_METHOD,ref=“Observed”)L2PAN_LONG	-5.029*** (0.1666)	2.089*** (0.1053)
i(var=IMP_METHOD,ref=“Observed”)L2LMER	2.482*** (0.1666)	2.368*** (0.1053)
i(var=IMP_METHOD,ref=“Observed”)L2PAN	-5.526*** (0.1666)	0.1478 (0.1053)
i(var=IMP_METHOD,ref=“Observed”)RQ	-5.108*** (0.1666)	1.668*** (0.1053)
i(var=IMP_METHOD,ref=“Observed”)PMM	-5.108*** (0.1666)	1.519*** (0.1053)
________________________________________	___________________	____________________
S.E. type	Standard	Standard
Observations	14,616	14,616
R2	0.48342	0.17251
Adj. R2	0.48303	0.17188

2.5 Key Take-Aways

Before honing in on the particular differences among the MI methods, a handful of general trends merit comment. First, across missingness types and percentages, the percent bias for scale scores is notably lower than the percent bias for SGPs. Second, there is a small but noticeable positive relationship between missingness percentage and percent bias; this relationship was more pronounced for scale scores than SGPs. Moreover, the scatterplots of percent bias and coverage rates as a function of grade/content area size (\(N\)) indicate that observations with smaller \(N\) were more likely to have higher percent bias on either scale scores or SGPs, as well as greater variation in the CI coverage rates. Finally, the bar plots indicate that there are higher proportions of significant differences between the true and imputed scale score values (based on the simplified \(F_1\) statistic) when data are MAR, particularly based on status and growth. The preliminary regression models also suggest that across the imputation methods, bias tends to increase as percentage missingness increases, as well as when data are MAR compared to MCAR (particularly using status and growth). However, note that the \(R^2\) values were relatively low, so caution is warranted when interpreting the model results.

Overall, the cross-sectional L2PAN method demonstrates the best performance among the examined methods. Specifically, L2PAN was most often able to maintain average percent bias for scales scores and SGPs below 5%, while simultaneously constructing confidence intervals with higher coverage rates. Additionally, L2PAN most often failed to reject the null hypothesis that the imputed and true values are equivalent. In other words, the other MI methods more often found statistically significant differences (using the simplified \(F_1\) statistic) between the imputed and true average scale score and SGP values. Finally, the regression models indicate that L2PAN is negatively related to scale score bias, suggesting reduced scale score bias compared to the “observed” condition. Relatedly, L2PAN was often not positively related to SGP bias.

3 Evaluating Cross-Sectional L2PAN

The results provide growing evidence that L2PAN is a relatively effective MI method for the types of data and missingness examined thus far. Still, the above analyses indicate that in numerous cases, L2PAN is unable to generate relatively accurate pooled estimates. In this section, we take a deeper dive to identify the conditions wherein L2PAN performs well, as well as the conditions wherein L2PAN is unable to adequately recover the true mean scale scores or SGPs.

3.1 Scale Scores

3.1.1 Descriptive Statistics: Grade/Content Area

Figure 3.1: Average SS percent bias by grade/content area quantile, grade, and missingness characteristics

\(~\)

Figure 3.1 presents the average SS percent bias for the L2PAN method as a function of grade/content area size, missingness percentage and type, and grade. First, notice in this figure that the average percent bias never exceeds 5%, which is the threshold for poor MI efficacy posited by researchers like Qi and colleagues (2010) and Miri and colleagues (2020). There is a general trend wherein the SS percent bias decreases as the grade/content area size quantile increases, as well as a tendency for the percent bias to be greater when data are missing using status and growth compared to using status and demographics. The largest relative scale score percent bias with L2PAN occurs for grades three and four when 70% of the data are missing using status and growth.

Figures 3.2 replicates the above analysis, using average scale score coverage rate rather than percent bias.

Figure 3.2: Average SS coverage rate by grade/content area quantile, grade, and missingness characteristics

\(~\)

The above figure indicates that the average coverage rate is greater than 0.90 in a large proportion of conditions. When data are missing based on status and growth for grades 3 and 4, the average scale score coverage rates are particularly low. The results above indicate that for observations in grades 3 and 4, simplified CIs for scale scores tend to be too liberal when data are MAR (particularly using growth rather than demographics); this relationship is exacerbated as the missingness percentage and grade/content area size quantile increases.

We may also use the (simplified) \(F_1\) statistic to flag observations indicating significant differences between the imputed and true values. The figure below plots the proportion of \(F_1\) statistics that result in a rejection of the null hypothesis for the scale scores (using a threshold of \(\alpha=0.1\)), where the null hypothesis is that the imputed and true values are equivalent. We find that cases in grades 3 and 4 when data are MAR (based on status and growth) more often indicate significant differences between the imputed and true values. Again, this more often occurs among larger grade/content area size quantiles.

Figure 3.3: Proportion of cases where a significant difference between the imputed and true SS value was found using the F1 statistic

\(~\)

Across the 13725 observations in the L2PAN imputed data, 97.8% of observations have an average SS percent bias less than 5%. Moreover, 59.8% of observations have a mean SS coverage rate greater than 0.90. We are particularly interested in whether there are similar characteristics among the observations that do not meet these percent bias and coverage rate thresholds when imputing with L2PAN.

There are 298 observations with SS percent bias greater than 5% using L2PAN. Similarly, there are 5003 observations with SS coverage rates less than 0.90. All together, there are 293 total observations with both SS percent bias greater than 5% and a coverage rate less than 0.90, as well as \(F_1\) statistics for which we reject the null hypothesis that the imputed and true values are equivalent. We’ll look at this last category of “flagged” observations in more detail.

Figure 3.4: Density plot of flagged observations based on the scale score as a function of grade/content area size

\(~\)

Figure 3.5: Scatter plot of flagged observations based on the scale score as a function of grade/content area size and percent bias

\(~\)

Figure 3.6: Bar plot of flagged observations based on the scale score as a function of grade and missingness

We see that the majority of “flagged” cases tend to have grade/content area sizes less than 60. Variation in SS percent bias is also greater among observations with smaller grade/content area sizes. Additionally, these observations are overwhelmingly in grades 3 and 4 when data are missing based on status and growth, particularly for conditions with 50% and 70% of data missing.

3.1.2 Descriptive Statistics: School Level

Figure 3.7: Average SS percent bias by school size quantile and missingness characteristics

\(~\)

Figure 3.7 shows the average SS percent bias for the L2PAN method at the school level, as a function of school size, missingness percentage, and missingness type. As with the grade/content area analyses, the maximum average SS percent bias is relatively low, here never exceeding 2.5%. We see higher average SS percent bias when data are MAR based on status and growth, and as the percentage missing increases. In some cases, percent bias also decreases slightly as the school size quantile increases.

Figure 3.8 below uses average SS coverage rate rather than percent bias. We see that the average coverage rates are concerningly low when data are MAR, particularly based on characteristics like status and growth. There is not a clear linear relationship between the school size quantile and the average CI coverage rate. For instance, when data are MAR based on status and growth, we find a U-shaped relationship between these two variables.

Figure 3.8: Average SS coverage rate by school size quantile and missingness characteristics

\(~\)

The figure below plot the proportion of \(F_1\) statistics that result in a rejection of the null hypothesis for the SS (using a threshold of \(\alpha=0.1\)). Again, we find this trend of relatively worse performance by L2PAN for data that are MAR based on status and growth, as well as higher missingness percentages.

Figure 3.9: Proportion of cases where a significant difference between the imputed and true SS value was found using the F1 statistic

\(~\)

Across the 2088 observations in the L2PAN imputed data at the school level, 99.7% of observations have an average SS percent bias less than 5%. Moreover, 38.6% of observations have a mean SS coverage rate greater than 0.90.

There are 7 observations with SS percent bias greater than 5% using L2PAN. Similarly, there are 1203 observations with SS coverage rates less than 0.90. Putting these conditions together, there are 7 total observations with both SS percent bias greater than 5% and a coverage rate less than 0.90, as well as \(F_1\) statistics for which we reject the null hypothesis that the imputed and true values are equivalent.

3.1.3 Classification Models

In this section, we fit a series of relatively simple classification models. The goal is to identify data features that are related to poor L2PAN performance, providing guidance as to when imputing data with this MI method may warrant caution. The models use logistic regression with up to two-way interactions among the predictors. The baseline model is an observation where the grade/content area or school size of zero and 30% of the data are MCAR.

The models are fit using the fixest package. In the first model, the data are analyzed at the grade/content area level. Therefore, fixed effects for grade and content level are included in the model. The outcome is a binary variable where 1 indicates an SS percent bias greater than 5%, coverage rate less than 0.90, and \(F_1\) statistic p-value less than 0.10.

In the second model, the data are analyzed at the school level. Because there were so few “flagged” observations in these data (N = 7), the outcome variable is the simplified CI coverage rate.

Table 3.1: Logistic fixed-effect regression models for grade/content area flagged observations and school-level CI coverage rates
	Grade/Content Area: Flagged	School: CI Coverage Rates

N	-0.1289*** (0.0077)	0.0021*** (0.0005)
MISS_TYPEDEMOG	14.71*** (1.439)	-1.641*** (0.3311)
MISS_TYPEGROWTH	17.21*** (0.2905)	-3.722*** (0.3507)
MISS_PERC50%Missing	1.092* (0.4278)	-0.2605 (0.3837)
MISS_PERC70%Missing	0.6798 (0.5413)	-0.5595 (0.3716)
i(var=N,f=MISS_TYPE)MCAR	0.1267*** (0.0150)	-0.0011 (0.0007)
i(var=N,f=MISS_TYPE)DEMOG	0.0782* (0.0340)	-0.0016*** (0.0004)
i(var=N,f=MISS_TYPE)GROWTH	0.1240*** (0.0075)
i(var=N,f=MISS_PERC)30%Missing	0.0146 (0.0122)	-0.0008 (0.0005)
i(var=N,f=MISS_PERC)50%Missing	-0.0096* (0.0045)	-0.0003 (0.0005)
i(var=MISS_TYPE,f=MISS_PERC)DEMOG x 30%Missing	-1.510. (0.8055)	0.2962 (0.3899)
i(var=MISS_TYPE,f=MISS_PERC)DEMOG x 50%Missing	-1.982* (0.7820)	0.0075 (0.3797)
i(var=MISS_TYPE,f=MISS_PERC)GROWTH x 30%Missing	-3.391*** (0.3375)	0.3456 (0.3944)
i(var=MISS_TYPE,f=MISS_PERC)GROWTH x 50%Missing	-1.344*** (0.1815)	0.0592 (0.3919)
(Intercept)		2.448*** (0.3337)
Fixed-Effects:	——————-	——————-
CONTENT_AREA^GRADE	Yes	No
________________________________________	___________________	___________________
S.E. type	by: CON.^GRA.	Standard
Convergence	FALSE	TRUE
Observations	6,318	2,088
Squared Cor.	0.19628	0.44657
Pseudo R2	0.37202	0.34479
BIC	1,647.0	1,910.7

3.2 Student Growth Percentiles

3.2.1 Descriptive Statistics: Grade/Content Area

Figure 3.10: Average SGP percent bias by grade/content area quantile, grade, and missingness characteristics

\(~\)

Figure 3.13 shows the average SGP percent bias. As previously noted, average SGP percent bias is generally larger than that for scale scores. In many missingness percentage and type categorizations, the largest average SGP percent bias occurs for observations in eighth grade and in the first grade/content area size quantile. We again see a general trend wherein observations with smaller grade/content area size quantiles (i.e., in the first quantile) have larger average bias. The largest average SGP percent bias is approximately 29, occurring when 70% of the data are missing using status and growth, and observations are in the eighth grade and first quantile of grade/content area size.

Figure 3.11: Average SGP coverage rate by grade/content area quantile, grade, and missingness characteristics

\(~\)

Whereas the average SS coverage rate was lower than 0.5 for some conditions (see Figure 3.2), Figure 3.7 indicates relatively high average coverage rates across the examined conditions when analyzing the SGPs. For instance, the average CI coverage rate for the SGPs does not fall below 0.76. Therefore, although L2PAN tends to produce imputed SGP estimates with relatively higher percent bias compared to the imputed SS estimates, the confidence intervals for the SGP estimates more often contain the true SGP values when analyzing at the grade/content area level.

As shown in the figure below, statistically significant differences between the imputed and true mean SGPs were more often found when data were MAR based on status with growth. In some cases, there was also a tendency for higher proportions of rejected null hypotheses among higher grades.

Figure 3.12: Proportion of cases where a significant difference between the imputed and true SGP value was found using the F1 statistic

\(~\)

Across the 13725 observations in the L2PAN imputed data, 32.3% of observations have an average SGP percent bias less than 5%. Moreover, 39.9% of observations have a mean SGP coverage rate greater than 0.90. Again, we’ll further examine the “flagged” observations, which have SGP percent bias greater than 5%, a simplified CI coverage rate less than 0.90, and statistically significant \(F_1\) test.

Here, there are 319 total observations that we may “flag” for which L2PAN performs relatively worse when imputing the mean SGP values at the grade/content area level.

Figure 3.13: Density plot of flagged observations based on the SGP as a function of grade/content area size

\(~\)

Figure 3.14: Scatter plot of flagged observations based on the SGPs as a function of grade/content area size and percent bias

\(~\)

Figure 3.15: Bar plot of flagged observations based on the SGP as a function of grade and missingness

\(~\)

As with the scale score imputations, L2PAN tends to generate more concerning SGP imputed values for smaller grade/content area sizes. The scatterplot further shows that observations with smaller grade/content area sizes tend to have greater variation in percent bias for the SGPs. The faceted bar plot above suggests that across the missingness type and percentage conditions, L2PAN tends to do worse in terms of the SGP imputation for grade six (and also often grade five). There is also some evidence that imputation efficacy decreases as missingness percentage increases, as well as for MAR compared to MCAR data.

3.2.2 Descriptive Statistics: School Level

Figure 3.16: Average SGP percent bias by school size quantile and missingness characteristics

\(~\)

In the above figure, we see that average SGP percent bias is higher among lower school size quantiles, higher missingness percentages, and when data are MAR. However, we find fewer clear trends when examining average SGP coverage rates (see below). Notably, the average coverage rates do not dip below 0.80 when analyzing the SGPs at the school level. Coverage rates tend to be lower for MAR compared to MCAR data.

Figure 3.17: Average SGP coverage rate by school size quantile and missingness characteristics

\(~\)

The figure below plots the proportion of \(F_1\) statistics that result in a rejection of the null hypothesis for the SGPs (again using a threshold of \(\alpha=0.1\)). Notice in this figure that statistically significant differences between the imputed and true mean SGPs were more often found when data were MAR.

Figure 3.18: Proportion of cases where a significant difference between the imputed and true SGP value was found using the F1 statistic

\(~\)

Across the 2088 observations in the L2PAN imputed data at the school level, 71.1% of observations have an average SGP percent bias less than 5%. Moreover, 64.4% of observations have a mean SGP coverage rate greater than 0.90.

Here, there are 176 total observations that we may “flag” for which L2PAN performs relatively worse when imputing the SGP values at the school level (again, based on a combination of the SGP percent bias, simplified CI coverage rate, and p-value for the \(F_1\) statistic).

Figure 3.19: Density plot of flagged observations based on the SGP as a function of school size

\(~\)

Figure 3.20: Scatter plot of flagged observations based on the SGPs as a function of school size and percent bias

\(~\)

Figure 3.21: Bar plot of flagged observations based on the SGP as a function of missingness

\(~\)

We find that “flagged” observations at the school level are more likely to have smaller school sizes and occur when data are MAR (particularly based on status and growth). There is also a clear positive relationship between the number of “flagged” observations and the percentage of missingness, holding other factors constant.

3.2.3 Classification Model

We fit another series of logistic regressions with up to two-way interactions among the predictors. Here, the outcome for both the grade/content area and school level is a binary variable where 1 now indicates an SGP percent bias greater than 5%, coverage rate less than 0.90, and \(F_1\) statistic p-value less than 0.10.

Table 3.2: Logistic fixed-effect regression models for grade/content area or school-level flagged observations
	Grade/Content Area	School

N	-0.0491*** (0.0036)	-0.0005 (0.0007)
MISS_TYPEDEMOG	0.8894* (0.3914)	0.4954 (0.5394)
MISS_TYPEGROWTH	1.200* (0.5243)	0.3132 (0.5286)
MISS_PERC50%Missing	9.491*** (0.7164)	1.692 (1.113)
MISS_PERC70%Missing	11.04*** (0.2832)	2.396* (1.078)
i(var=N,f=MISS_TYPE)MCAR	-0.0079 (0.0082)	-0.0024 (0.0015)
i(var=N,f=MISS_TYPE)DEMOG	9.73e-5 (0.0024)	-0.0009 (0.0009)
i(var=N,f=MISS_PERC)30%Missing	0.0504*** (0.0028)	-0.0014 (0.0012)
i(var=N,f=MISS_PERC)50%Missing	0.0507*** (0.0034)	-0.0001 (0.0009)
i(var=N,f=MISS_PERC)70%Missing	0.0509*** (0.0034)
i(var=MISS_TYPE,f=MISS_PERC)DEMOG x 30%Missing	10.10*** (0.2562)	1.802 (1.102)
i(var=MISS_TYPE,f=MISS_PERC)DEMOG x 50%Missing	1.205* (0.5523)	0.5349 (0.5571)
i(var=MISS_TYPE,f=MISS_PERC)GROWTH x 30%Missing	10.66*** (0.2551)	2.207* (1.091)
i(var=MISS_TYPE,f=MISS_PERC)GROWTH x 50%Missing	1.591** (0.5072)	0.5959 (0.5525)
(Intercept)		-4.308*** (1.081)
Fixed-Effects:	——————-	—————–
CONTENT_AREA^GRADE	Yes	No
________________________________________	___________________	_________________
S.E. type	by: CON.^GRA.	Standard
Convergence	FALSE	TRUE
Observations	7,407	2,088
Squared Cor.	0.03162	0.03581
Pseudo R2	0.10027	0.07425
BIC	2,562.8	1,224.8

\(~\)

4 Conclusions

The current simulation was designed to address two related questions. First, can we determine whether MI is an appropriate method for creating “adjusted” scale scores and SGPs when data are missing? Second, if MI is indeed appropriate, which method is most effective and in what data contexts? The above results can be broadly summarized as follows:

Within the current simulation, the cross-sectional L2PAN method generally outperforms L2LMER, L2LMER_LONG, L2PAN_LONG, PMM, and RQ in terms of percent bias, simplified CI coverage rates, and statistical significant differences between the imputed and true scale score and SGP values. Cross-sectional L2PAN also tends to perform better than the “observed” condition, where no imputation is conducted.
MI with L2PAN tends to perform worse among cases with smaller grade/content area or school sizes, and when data are MAR based on status and growth. MI efficacy is also negatively related to the percent of missingness in the data.
It appears particularly difficult to accurately impute average scale scores for grades 3 and 4, particularly when data are MAR based on status and growth.
Patterns of MI efficacy differ based on whether the scale scores or SGPs are being imputed, as well as whether MI is evaluated based on percent bias or CI coverage rates.

The simulation results clearly demonstrate that L2PAN multiple imputation’s efficacy depends upon a nuanced interaction of different data characteristics. It is also likely possible that other factors (not accounted for in the current analyses) also contribute to MI’s performance in this context. For instance, recall that the \(R^2\) values for the exploratory linear regressions were relatively small. There may be additional interaction and polynomial effects on percent bias and CI coverage rates that warrant further exploration.

To address when MI is a plausible methodological option for generating adjusted 2021 scores, we make the following additional recommendations:

The school composition should not substantially differ between across academic years. In other words, although there may be low missingness rates, the demographic make-up of a school may largely change over the years. Such differences may preclude appropriate comparisons across years (with or without adjusted scores).
Given the aforementioned trends regarding missingness types, it is important that researchers and policymakers examine their missingness patterns prior to imputation. Our results indicate that MI performed relatively better when data were MCAR or missing at random (MAR) based on a combination of status and demographics. However, it is extremely unlikely that data will be MCAR in this context. It is also difficult to differentiate between MAR and missing not at random (MNAR) for a given data set.
Given the relationships between MI efficacy and missingness percentage, MI should be used with caution when more than 50% of the data are missing. Note that these missingness rates may differ among schools (i.e., not all units of analysis will have the same missingness percentage). Such differential rates should be taken into consideration when determining if and when MI is appropriate.

In summary, L2PAN is a promising mechanism for dealing with relatively low levels of missingness in scale scores or SGPs. As with any simulation, the results presented here can only be generalized to the simulation conditions examined. Moreover, future work should include diagnostic checks to examine the MI performance with a particular set of data. In other words, even if L2PAN imputation is implemented and is expected to work well (based on the given simulations), supplementary diagnostic analyses would help ensure that the imputation method is working as intended. Stuart and colleagues (2009), as well as Nguyen and colleagues (2017), provide helpful overviews of these diagnostics.

5 References

Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1-48. https://doi.org/10.18637/jss.v067.i01.
Berge, L. (2018). Efficient estimation of maximum likelihood models with multiple fixed-effects: the R package FENmlm. CREA Discussion Papers.
Betebenner, D. W., Van Iwaarden, A. R., & Domingue, B. (2021). SGPdata: Exemplar data sets for student growth percentile (SGP) analyses. R package version 25.1-0.0. https://centerforassessment.github.io/SGPdata/
Demirtas, H. (2004). Simulation driven inferences for multiply imputed longitudinal datasets. Statistica neerlandica, 58(4), 466-482. https://doi.org/10.1111/j.1467-9574.2004.00271.x
Enders, C. K. (2010). Applied missing data analysis. New York, NY: The Guilford Press.
Fox, J. & Weisberg, S. (2018). Multiple imputation of missing data. Appendix of An R companion to applied regression, third edition. Thousand Oaks, CA: Sage Publications, Inc.
Miri, H. H., Hassanzadeh, J., Khaniki, S. H., Akrami, R., & Sirjani, E. (2020). Accuracy of five multiple imputation methods in estimating prevalence of Type 2 diabetes based on STEPS surveys. Journal of Epidemiology and Global Health, 10(1), 36-41. https://doi.org/10.2991/jegh.k.191207.001
Nguyen, C. D., Carlin, J. B., & Lee, K. J. (2017). Model checking in multiple imputation: An overview and case study. Emerging Themes in Epidemiology, 14(8). https://doi.org/10.1186/s12982-017-0062-6
Qi, L., Wang, Y.-F., & He, Y. (2010). A comparison of multiple imputation and fully augmented weighted estimators for Cox regression with missing covariates. Statistics in Medicine, 29(25), 2592-2604. https://doi.org/10.1002/sim.4016
Stuart, E. A., Azur, M., Frangakis, C., & Leaf, P. (2009). Multiple imputation with large data sets: A case study of the Children’s Mental Health Initiative. American Journal of Epidemiology, 169(9), 1133–1139. https://doi.org/10.1093/aje/kwp026
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
van Buuren, S. (2018). Flexible imputation of missing data. CRC Press. https://stefvanbuuren.name/fimd/
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1-67. https://www.jstatsoft.org/v45/i03/
Vink, G., & van Buuren, S. (2014). Pooling multiple imputations when the sample happens to be the population. arXiv Pre-Print 1409.8542.
Zhao, J. H., & Schafer, J. L. (2018). pan: Multiple imputation for multivariate panel or clustered data. R package version 1.6.

Evaluating the Efficacy of Multiple Imputation Methods for Missing Educational Assessment and Growth Data

Allie Cooperman, Adam Van Iwaarden, and Damian Betebenner

June 16, 2021

1 Simulation Overview

2 Imputation Method Comparison

2.1 Summary Tables

2.1.1 GC: MCAR

2.1.2 GC: Status with Demographics

2.1.3 GC: Status with Growth

2.1.4 School Level

2.2 Summary Figures: Grade/Content Area

2.2.1 Scale Scores

2.2.2 SGPs

2.3 Summary Figures: School Level

2.3.1 Scale Scores

2.3.2 SGPs

2.4 Basic Regression Models

2.4.1 Grade/Content Area

2.4.2 School Level

2.5 Key Take-Aways

3 Evaluating Cross-Sectional L2PAN

3.1 Scale Scores

3.1.1 Descriptive Statistics: Grade/Content Area

3.1.2 Descriptive Statistics: School Level

3.1.3 Classification Models

3.2 Student Growth Percentiles

3.2.1 Descriptive Statistics: Grade/Content Area

3.2.2 Descriptive Statistics: School Level

3.2.3 Classification Model

4 Conclusions

5 References