«Consortium for Educational Research and Evaluation– North Carolina Comparing Value-Added Models for Estimating Individual Teacher Effects on a Statewide ...»
Reliability Using actual NC data, quintiles of teacher performance were estimated for two sequential years, then these results cross-tabulated at the teacher level. This cross-tabulation represented a pattern of mixing across the two years of teachers at each level of effectiveness in the first year. The cross-tabulations of the quintiles in 2007–08 and 2008–09 were summarized into two summary tables (Table 7, following page). The sum of the percentage of teachers on the diagonal of each cross-tabulation, with the teachers on the diagonal being those in the same quintile in each year and the sum of the percentage of teachers who either switched from the top quintile to the bottom Consortium for Educational Research and Evaluation–North Carolina 29 Comparing Value-Added Models August 2012 or from the bottom quintile to the top during the same interval were both included in Table 7 for both math and reading. For 5th grade teachers, the DOLS outperformed all others with 44.5% for math and 39.2% for reading, followed by the URM with 33.2% for math and 28.3% for reading.
The lowest percentages of year-to-year quintile consistency were the HLM3 and HLM3+, virtually tying with 30% for math and 25% for reading.
There was also some variation between the models in the percentage of teachers switching from one extreme quintile to the other extreme, and there was a clear difference between math and reading in the performance of these models, with about twice as many teachers switching in reading as in math in 5th grade. The DOLS was the best performer, with 0.2% switching in math and 0.8% in reading. The other models were more similar, but the HLM3 and HLM3+ had the highest percentages of switching in both reading and math.
Consortium for Educational Research and Evaluation–North Carolina 30 Comparing Value-Added Models August 2012 Discussion Using two simulations of student test score data as well as actual data from North Carolina public schools, we compared nine value-added models on the basis of four criteria related to teachers’ effectiveness rankings: Spearman rank order; percentage of agreement on 5th percentile; false positives consisting of teachers who are not ineffective being misidentified as ineffective; and consistency of rankings within quintiles over two sequential years. Using these comparisons, we answer six questions that are pertinent to state policymakers and administrators who may be in positions to select a value-added model to obtain estimates of individual teachers’ effectiveness generated from student test score data and to the teachers and principals who may be directly affected by them.
1. Are the rankings of teachers from VAMs highly correlated with the ranking of “true” teacher effects under ideal (i.e., absence of assumption violations) conditions?
While all nine VAMs performed reasonably well on this test, four models were higher performers (the HLM3+, URM, SFE, and HLM3) than the other five.
2. How accurately do the teacher effect estimates from VAMs categorize a teacher as ineffective, and what proportion of teachers would be misclassified?
While all models performed reasonably well on this test, four models were higher performers (HLM3+, URM, SFE, and HLM3) than the other five.
3. How accurately do VAMs rank and categorize teachers when SUTVA is violated and classroom variance accounts for a proportion of the teacher effect?
For the accuracy of ranking when SUTVA is violated, the performance of all models was substantially reduced in comparison to the absence of assumption violations. In terms of relative performance, four models were higher performers (HLM3+, URM, SFE, and HLM3) than the other five. For the accuracy of categorizing teachers in the lowest 5% when SUTVA is violated, all VAMs performed equivalently.
4. How accurately do VAMs rank and categorize teachers when ignorability is violated and student effects are correlated with classroom, teacher, and school effects?
For the accuracy of ranking when ignorability is violated, the performance of the VAMs was somewhat reduced in comparison to the absence of assumption violations. The relative performance of the VAMs varied substantially; two models were higher performers (the HLM3+ and HLM3) than the other seven in both the negative assignment and positive assignment scenarios. For the accuracy of categorizing teachers in the lowest 5% when ignorability is violated, all VAMs correctly classified more than 90% of the teachers with six models, the HLM3+, HLM3, HLM2, URM, SFE, and SFEIV, outperforming the other four.
Consortium for Educational Research and Evaluation–North Carolina 31 Comparing Value-Added Models August 2012
5. How similar are rankings of VAM estimates to each other?
For mathematics, the rankings produced by three VAMs, the URM, HLM2, and TFE, are more similar to all others (the average of all VAMs) than the other six models. For reading, the rankings produced by five models, the URM, HLM2, HLM3+, HLM3, and TFE, are more similar to all others than the other four VAMs.
6. How consistent are VAM estimates across years?
The most consistent year-to-year VAM estimates in terms of placing the highest percentage of teachers in the same performance quintile are the DOLS and URM. In terms of consistency in producing the fewest highest to lowest or lowest to highest switchers, the DOLS is the best performing VAM, followed by the TFE, URM, and HLM2.
Clearly, the overall ranking of model performance depends on how the criteria are weighted. If performance of the models in the presence of violated ignorability is viewed as the most highly weighted criteria, three VAMs performed sufficiently poorly to appear to be risky choices for estimating individual teacher effectiveness—teacher fixed effects, teacher fixed effects with IV, and dynamic ordinary least squares. None of these three models performed well in either test for confounded assignments of students and teacher, and much research strongly suggests confounded assignment is frequently the case now. In the simulations violating SUTVA assumptions, these models seem to underperform relative to the others in the ranking but not in the identification of the 5% of poor performers. And neither did they underperform in the examinations of year-to-year consistencies. This conclusion may need to be tempered in the case of the DOLS because of the relatively high performance of that VAM in the simulations by Guarino, Reckase, and Wooldridge (2012), but their findings with respect to the teacher fixed effects VAM are consistent.
More research should be done examining the performance of the DOLS before a strong affirmative recommendation could be offered. Bearing in mind that the findings of the present study and Guarino, Reckase, and Wooldridge (2012) regarding the DOLS only overlap in examining rank correlations, we speculate that the DOLS may be a higher performer in the Guarino, Reckase, and Wooldridge study for a number of reasons. Differences in the data generation processes, combined with the choice of the authors not to examine a model with raw score as the outcome and a shrinkage estimator for the teacher effect may be the cause for this seeming disagreement. Teacher estimates shrunken by empirical Bayes were applied to the gain score, but the authors argued that with invariant class sizes in their design, the shrinkage estimator on the raw score would produce rankings equivalent to that for the DOLS. As a consequence, the DOLS estimates in the Guarino, Reckase, and Wooldridge study are equivalent to a random effects variant that they did not test. This is certainly consistent with the present study, as two of the simple nested random effects models (the HLM3 and HLM3+) were regularly among the highest performing models.
With the findings indicating that the TFE, TFEIV, and DOLS are risky, are there any that policymakers and administrators might wish to consider adequate? The answer to this question can only be answered by a definitive weighting scheme for the criteria, which should include an Consortium for Educational Research and Evaluation–North Carolina 32 Comparing Value-Added Models August 2012 assessment of the costs and consequences of the particular purposes for which the estimates will be used. The list of acceptable models could be quite different for estimates of teacher effectiveness that are used to identify teachers who may need additional professional development (low stakes) and those used to identify teachers for high stakes sanctions such as denial of tenure, dismissal, or substantial bonuses, with identification for additional observations with feedback and coaching and other positive benefits falling somewhere between. We believe the evidence suggests that four VAMs performed sufficiently well across the board to deserve consideration for low stakes purposes: the three-level hierarchical linear model with one year of pretest scores, the three-level hierarchical linear model with two years of pretest scores, the EVAAS univariate response model, and the student fixed effects model. The performance of each of these models was quite good in recovering the true effects and was quite similar. The performance of all was degraded by a violation of SUTVA—more so for the ranking and less so for agreement on classification in the bottom 5% and the false identification of ineffective teachers—but not as much by confounding. Also, quite relevant is the identification of the lowest fifth percentile, which could be used in low to medium stakes situations. In our opinion, these are relevant criteria for assessing adequacy of VAM for low stakes purposes.
We believe that the false positive analysis is particularly important when considering the adequacy of models for high stakes use. If SUTVA is substantially violated, about 350 5th grade teachers in a state the size of North Carolina could be identified for possible removal when their actual performance was not in the lowest 5% of teachers. The mean performance of these teachers is more than 0.6 of a standard deviation below the mean when the four higher performing models are used. When confounding occurs, the higher performing models would falsely identify approximately 220–290 5th grade teachers in a state about the size of North Carolina as in the lowest performing 5% of the distribution. For the four higher performing VAMs, these falsely identified teachers’ average performance is at least 0.9 standard deviations below the mean. For many, this would seem to suggest that the teacher effectiveness estimates should at most be considered a first step in identifying ineffective teachers, rather than the method for identification of teachers for high stakes personnel actions. Using any VAM, even the highest performing ones, to identify teachers for high stakes consequences seems risky in our opinion.
It seems important to consider consistency as well when considering if any of the VAMs should be used for estimating individual teacher effectiveness. As earlier research points out, inconsistency in the estimates from year to year can undermine the credibility of the estimates, especially to those whose performance is being estimated (Amrein-Beardsley, 2008). The best performer in this regard, the DOLS, was a very low performer in the simulations; the EVAAS URM and student fixed effects performed somewhat better than the other two better performers, the HLM3 and HLM3+. However, all of the assessments are relative. It would be difficult to know whether the differences in VAM performance that we observed using the North Carolina data—3.2% switching from highest to lowest or vice versa rather than 1.7%—would affect credibility. The fact that these extreme switchers exist at all may be sufficient evidence to convince some policymakers and some teachers that no sufficiently consistent VAM exists.
Further research should be conducted to better understand the correlates of the extreme quintile switching, in particular investigating the number of novice teachers that switch or the number of extreme switchers that have changed assignments, such as moving from one school to another or one grade to another.
Consortium for Educational Research and Evaluation–North Carolina 33 Comparing Value-Added Models August 2012 Limitations and Implications Limitations This study had several limitations. First, a significant portion of the analysis was based on simulated stylized data. This was intended to address the absence of “true” measures of teacher effects in actual data. While these simplifications may suggest that real conditions would probably degrade the absolute performance of each model, we have not argued that this degrading of performing would be equivalent across all models, and therefore it is possible that more realistic conditions might influence the comparisons that we have made. For example, we did not simulate missing values, a problem typical of actual data that by design some of the models (e.g., the URM) may handle better than the others. Second, there was some necessary subjectivity in the choice and specification of models, including in the types of fixed effects models used and the covariates used in some models. Third, we were unable to estimate extensive simulations or actual data models for the EVAAS MRM, a controversial (AmreinBeardsley, 2008) but widely published (Ballou, Sanders, & Wright, 2004; McCaffrey et al.,
2004) model. While McCaffrey et al. (2004) suggested that this model performed similarly to a fixed effects model using small samples, our experience with a smaller variance decomposition sample than the one used in that study (144 teachers, rather than 833) suggests that the MRM performed poorly. A single simulation of 833 teachers with zero classroom variance, however, indicates that the MRM had very similar performance to the URM. Nevertheless, we cannot recommend the MRM, as its computational demands place it out of the reach of many state education agencies and scholars to estimate. Finally, the limited actual data, ranging over only three years of data in which students were matched to their teachers, made some of the analyses difficult to undertake and required some modifications to the models when multiple estimates were required for examining year-to-year consistency.
Despite these limitations, there are multiple strengths of this study. It is the first of its kind to use simulated variance decomposition and correlated fixed effect data specifically designed for testing both SUTVA violations and ignorability, respectively, as well as actual data. It is also the first of its kind to examine multiple random effects and fixed effects models, and it examined nine models, nearly twice that of any other study.
Value-added models for teacher effectiveness are a key component of reform efforts aimed at improving teaching and have been examined by this study and others. However, an interdisciplinary consensus on the methods used to obtain value-added teacher estimates does not exist, and many different models spanning multiple disciplines including economics and sociology have been proposed, as noted above. Further, several different approaches have been used to examine and compare models, and as this study demonstrated with just a handful of approaches, the “best” VAM may be dependent on the comparison approach. Nevertheless, when multiple approaches were used, trends did emerge that pointed to a few models that were on average better performers, and a handful that were almost universally poor. We suggest that one implication of this study is that multiple approaches are needed to get a fuller picture of the relative merits of each model.