# «Consortium for Educational Research and Evaluation– North Carolina Comparing Value-Added Models for Estimating Individual Teacher Effects on a Statewide ...»

For the consistency analysis, which required two sequential within-year estimates for each grade level, there were limitations to the amount of information available for the models that required multi-year panels to estimate (the SFE, SFEIV, TFEIV, and DOLS). For the SFE, SFEIV, and TFEIV, two sequences of three years’ data were required for each model. However, among all time-varying covariates, only test score data were available prior to 2007–08 (giving us only three years of complete data), and therefore no time-varying covariates could be included in these models (differencing eliminates the time invariant covariates). For the DOLS, the panel was estimated with two years only, allowing for the inclusion of time-varying covariates. There were 503,370 student records in 5th grade math (8,826 teachers) and 728,008 student records in 5th grade reading (9,402 teachers).

**Comparison Criteria**

Three criteria were used to compare the absolute performance of each VAM on estimating the true teacher effects in the simulated data. Two of these and a different third criterion were used to assess relative performance of the models using actual NC data. First, Spearman rank order correlation coefficients, a non-parametric measure capturing the association between the rankings of two variables, was estimated for each pairing of a VAM with the true effect (simulation only) and with each other VAM (simulation and actual). For the simulated data, the estimates in each simulation needed to be combined into a single point estimate, which required a Fisher z transformation; the mean of this z-transformed correlation was calculated, and then back-transformed using the hyperbolic tangent function. High-performing VAMs have relatively higher Spearman coefficients.

Consortium for Educational Research and Evaluation–North Carolina 22 Comparing Value-Added Models August 2012 Second, we calculated the percent agreement on the lowest 5% of the teacher quality distribution.

The teachers in the bottom 5% of the distribution under each version of the teacher effect (the “true” effect in the simulation or from each VAM in both the simulated and actual) were identified. In the simulated data analysis, teachers' true and estimated scores agreed if they both ranked the teacher above the fifth percentile or they both ranked the teacher below the fifth percentile. The statistic was the proportion of all teachers with agreement. In the actual data, teachers’ scores on any two methods agreed if the scores were both observed and were both above the fifth percentile or below. High-performing VAMs have relatively higher levels of agreement. Due to the normal distributions used in the data generation processes for the simulations, the findings for teachers in the 95th percentile were nearly identical. We chose this approach to correspond with a likely policy use of VAMs: to identify the lowest performing teachers.

Third, we examined the false identification of ineffective teachers in the simulated data only. For this analysis, special focus was placed on identifying a teacher who is actually relatively effective as ineffective based on their VAM score, due to the significant consequences that teachers and states may face under high stakes evaluation systems. We assumed a cutoff of -1.64 standard deviations from the mean teacher score, which is consistent with a finding of 5% of teachers being ineffective. First, we identified those teachers above the cutoff for ineffectiveness on the “true” measure. Then we identified those teachers who were below the cutoff on the estimated teacher effect. The teachers who satisfied both conditions were considered false positives or falsely identified as ineffective. This approach is a combination of the false positive/false negative methods used by Schochet and Chiang (2010). High-performing VAMs have relatively low proportions of false positives. Due to the normal distributions of the simulated data, we can assume that findings about falsely identifying a teacher as highly effective when he or she is not would be very similar. We also calculated the mean true score for teachers falsely identified as ineffective, and the number of teachers in North Carolina who would be affected by these findings. Actual data estimates for this comparison were not possible.

Fourth, we examined the year-to-year reliability in the VAMs in the actual NC data. For this criterion, the teacher estimates were obtained for each of two years individually. For the SFE, SFEIV, and TFEIV models, this required a substantial simplification of the models due to limitations in the actual NC data; further, for the DOLS no reliability analysis was possible, given these same limitations. Each teacher effect distribution on the eight remaining VAMs was divided into quintiles in each of the two years, and then each of these quintile classifications was cross-tabulated. If reliability were high, and allowing for some year-to-year variability including improvement, the teachers would have tended to fall along the diagonal where the quintiles were equal or roughly equal, with some off-diagonals suggesting an allowable amount of error and with the above-diagonal proportions slightly greater, allowing for improvement. If teachers did not fall along the diagonal, we could not tell which part would be due to estimate reliability and which part would be due to actual teacher improvement or change. We focused on three characteristics of the cross-tabulations: the proportion of teachers on the diagonal—that is, those teachers who were in the same quintile in each year—and the proportions of teachers in the most extreme “switchers” groups—those who were in the lowest quintile one year and the highest the next; or in the higher one year and the lowest the next. This method or one similar to it has been used by Sass (2008) and Goldhaber and Hansen (2008).

Consortium for Educational Research and Evaluation–North Carolina 23 Comparing Value-Added Models August 2012 Results We compared nine models’ performance on a set of criteria that together were used to answer the six questions regarding rank ordering and identification of ineffective teachers with and without violations of potential outcomes assumptions, consistency across VAMs, and year-to-year reliability of VAMs. In reporting the results, we focus on the criteria and summarize the results into answers to the questions in the discussion section that follows.

**Spearman Rank Order Correlations**

Assessing performance by rank order correlations with the “true” effect assuming no classroom level variance (0% classroom variance), the best-performing VAM was the HLM3+ with three VAMs closely following in order: URM, SFE, and HLM3 (Table 2). The increase in the classroom proportion of variance for testing the influence of SUTVA and confoundedness reduced the Spearman rank order correlations of all models with the true effect (Table 2). The violation of SUTVA implied by 4% of variance at the classroom level did not affect the relative ranking of the VAMs on this criterion. The HLM3+ was highest at.955 at 0% classroom variance and remained highest at 4% classroom variance (.864). The HLM2, TFE, and DOLS were nearly equal (.909 and.822, respectively), as were the SFEIV and TFEIV (.893 and.808, respectively). The classroom variance simulated in this analysis at 4% should be considered reasonable, given the analysis of Schochet and Chiang (2010).

Table 2. Spearman Rank Order with True Effect, Simulated Data

Consortium for Educational Research and Evaluation–North Carolina 24 Comparing Value-Added Models August 2012 When the strong ignorability of assignment (confounded assignment) was violated, there was substantial variation (Table 2) in the Spearman rank order for either moderate positive or negative correlation between the student covariate and the classroom, teacher, and school covariates, with two random effects models, the HLM3 (.796 and.746, respectively) and HLM3+(.771 and.755, respectively) being the top performers, followed by the HLM2 (.716 and.662, respectively), URM (.660 and.670, respectively), SFE (.648 and.628, respectively), and SFEIV (.562 and.526, respectively), but with the TFE, TFEIV, and DOLS very low. In the rank order correlation with optimal conditions, SUTVA violations, and confounded assignment, the HLM3 and HLM3+ were consistently the highest performing VAMs, and several models, including the four fixed effects and DOLS VAMs, performed much worse than the others.

Table 3. Spearman Rank Order Matrices of Value-Added Models, Actual Data

On the actual NC data (see Table 3 containing two correlation matrices), the rank order correlations between the VAM estimates varied considerably in both math and reading, from.970 to.642 for mathematics and.948 to.488 for reading. In both subjects, the URM was most highly correlated with the other models, averaging.850 and.774, respectively. The TFEIV was the least highly correlated with the other models, averaging.793 and.594, respectively. The two most highly correlated VAMs were HLM3 and HLM3+, with.970 for mathematics and.948 for reading. There was a tendency for the random effects models to be highly correlated with each Consortium for Educational Research and Evaluation–North Carolina 25 Comparing Value-Added Models August 2012 other and the URM and TFE models. The TFE model was highly correlated with the HLM2 (.944 for 5th grade math and.813 for 5th grade reading) and DOLS (.904 for 5th grade math and.861 for 5th grade reading) but not with the other fixed effects models. The fixed effects models did not exhibit an overall tendency to be highly correlated with each other or to be more highly correlated with each other than with the random effects models. Overall, it appears that the choice of a VAM model over some others can yield quite different rank orderings of the teacher effect estimates. It is important to note that higher correlations between the VAM model estimates from the actual data do not imply that they recover the “true” teacher effect estimates more consistently because the models may be reproducing a similar bias.

**Agreement on Classification in Fifth Percentiles**

The agreement on classification provides an indication of the extent to which the VAMs agree with the true effect or each other in terms of identifying the lowest performing 5% of teachers in the state. This criterion is quite important when the teacher effect estimates are to be used for teacher evaluations with consequences, since there are significant costs associated with falsely identifying an average teacher in the lowest performing group or falsely identifying a lowperforming teacher in the “acceptable” range of performance. Nearly all of the VAMs performed very well in the absence of assumption violations, with between 97.7% and 96.3% agreement on the bottom 5% and top 95%, which is less than a 1.5% difference (Table 4). In the test of the SUTVA violation with 4% of the variance at the classroom level, the VAM exhibited lower agreement rates, about 95%–96%, with the difference between the models much less, having a range of only 0.82. The HLM3+ was the highest, with 97.71% agreement with zero classroom variance, and it remained the highest with 4% classroom variance (96.01%). Nevertheless, all of the coefficients were very similar.

Consortium for Educational Research and Evaluation–North Carolina 26 Comparing Value-Added Models August 2012 In the test of the confounded assignment, the level of agreement was reasonably high with all models at or above 90% agreement in the positive assignment and negative (compensatory) assignment scenarios. The HLM3 and HLM3+ were the highest agreement models (for the positive assignment, 95.04 and 94.78, respectively), followed by the HLM2, URM, SFE, and SFEIV (for the positive assignment, 94.25, 93.70, 93.56, and 92.98, respectively). Three consistently lower performers were the TFE, TFEVI, and DOLS in the positive assignment (90.93, 90.74, and 90.48, respectively), with the negative assignment following the same pattern.

There was a more sizeable gap between the higher and lower ranking models than for the variance decomposition findings, and the direction of the correlation did not alter the pattern.

With the actual NC data, the agreement between the VAMs was quite high with all models, averaging from 94%–95% agreement with each other for mathematics and reading (Table 5).

There was a tendency for the random effects VAMs to be in greater agreement with each other, and for the fixed effects VAMs (including the DOLS) to be in greater agreement with each other, with lower agreement across type. This tendency was not as great in math—the percentage of agreement in each partition of the matrix was very similar—but was obvious in reading.

Table 5. Percent Agreement Across Models, Actual Data

Consortium for Educational Research and Evaluation–North Carolina 27 Comparing Value-Added Models August 2012 False Positives: Average Teacher Identified as Ineffective The third type of analysis assessed the extent of false positives; that is, how many teachers in the top 95% of the distribution would be falsely identified as bottom 5% performers. This criterion is relevant because several have proposed to use VAM estimates of teacher effectiveness to identify “ineffective” teachers as a step toward dismissal. False positives were examined on the simulated data only (Table 6, following page). In the variance decomposition simulation, at low levels of classroom variance (an absence of assumption violations), the HLM3+ (1.2% false positives), URM (1.3%), HLM3 (1.4%), and SFE (1.4%) performed the best; the other models were 1.7% or higher. To get a more concrete estimate of the breadth of the differences in model performance, assuming 9,000 5th grade teachers (the approximate number statewide in North Carolina), between 108 and 170 would be falsely identified as ineffective by the best and worst performing VAMs. In other words, the worst performing VAM would falsely identify 62 more 5th grade teachers as ineffective. The mean of the true z-score for these teachers was -1.43 for the HLM3+, the best performing VAM, and -1.30 for the worst performing VAMs, the SFEIV and TFEIV, which indicates that the teachers being falsely identified as ineffective by the worst performing VAMs were on average better performers; false identification of ineffectiveness casts a wider net in the worst performing models.

When the level of classroom variance was set at 4%, however, the relative performance advantage of all models declined somewhat, with all of the models demonstrating higher proportions of false positives (2.0%–2.4 % at 4% classroom variance). While these rates were seemingly modest, the number of teachers affected in each grade level and subject can be large, with up to 210 teachers misclassified under a scenario with 4% variance. The differences among the models, however, were modest, with a difference of 28 teachers at most.

With a heterogeneous fixed effect simulation, there was substantial variation between the models in the proportion of teachers misidentified as ineffective in the positive assignment scenario (Table 6, following page), with the HLM2, HLM3, and HLM3+ being the best performers (less than 3% misidentified), followed by the URM, SFE, and SFEIV misidentifying 3.1%, 3.2%, and 3.5%, respectively, and the TFE, TFEIV, and DOLS misidentifying more than 4%. The direction of correlation altered this pattern only slightly with only the HLM3 and HLM3+ incorrectly identifying less than 3% of the ineffective teachers followed closely by the HLM2, URM, and SFE. The number of teachers affected nearly doubled from the best to the worst performing models on this criterion, ranging from 221 (HLM3) to 436 (DOLS). Finally, for the worst performing VAMs, the TFE, TFEIV, and DOLS, the point estimates for the mean true effect of the misidentified teachers were actually above zero, meaning that the misclassified teachers included above-average teachers.