«Consortium for Educational Research and Evaluation– North Carolina Comparing Value-Added Models for Estimating Individual Teacher Effects on a Statewide ...»
Comparing Value-Added Models for
Estimating Individual Teacher Effects
on a Statewide Basis
Simulations and Empirical Analyses
Roderick A. Rose, Department of Public Policy & School of Social
Work, University of North Carolina at Chapel Hill
Gary T. Henry, Department of Public Policy & Carolina Institute
for Public Policy, University of North Carolina at Chapel Hill
Douglas L. Lauen, Department of Public Policy & Carolina Institute for Public Policy, University of North Carolina at Chapel Hill August 2012 Comparing Value-Added Models August 2012 Table of Contents Executive Summary
The Potential Outcomes Model
Stable Unit Treatment Value Assumption (SUTVA)
Violations of Assumptions
Typical Value-Added Models
Nested Random Effects Models
Fixed Effects Models
Hybrid Fixed and Random Effects Models
Summary of Models
VAM Comparison Studies
Data Generation Process
Variance Decomposition Simulation
Heterogeneous Fixed Effects Simulation
Calibration of Inputs
Number of Simulations
Actual NC Data Analysis
Spearman Rank Order Correlations
Agreement on Classification in Fifth Percentiles
False Positives: Average Teacher Identified as Ineffective
Limitations and Implications
Consortium for Educational Research and Evaluation–North Carolina Comparing Value-Added Models August 2012
COMPARING VALUE-ADDED MODELS FOR ESTIMATING INDIVIDUAL
TEACHER EFFECTS ON A STATEWIDE BASIS:
SIMULATIONS AND EMPIRICAL ANALYSESExecutive Summary Many states are currently adopting value-added models for use in formal evaluations of teachers.
We evaluated nine commonly used teacher value-added models on four criteria using both actual and simulated data. For the simulated data, we tested model performance under two violations of the potential outcomes model: settings in which the single unit treatment value assumption was violated, and settings in which the ignorability of assignment to treatment assumption was violated. The performance of all models suffered when the assumptions were violated, suggesting that none of the models performed sufficiently well to be considered for high stakes purposes. Patterns of relative performance emerged, however, which we argue is sufficient support for using four value-added models for low stakes purposes: the three-level hierarchical linear model with one year of pretest scores, the three-level hierarchical linear model with two years of pretest scores, the Educational Value-Added Assessment System (EVAAS) univariate response model, and the student fixed effects model.
Consortium for Educational Research and Evaluation–North Carolina 2 Comparing Value-Added Models August 2012 Introduction A wide body of research into the effects of schooling on student learning suggests that teachers are the most important inputs and, consequently, that improving the effectiveness of teachers is a legitimate and important policy target to increase student achievement (Rockoff, 2004; Nye, Konstantopolous, & Hedges, 2004; Rowan, Correnti, & Miller, 2002). In order for education policymakers and administrators to use teacher effectiveness to achieve student performance goals, they must have accurate information about the effectiveness of individual teachers. A relatively recent but often recommended approach for obtaining teacher effectiveness estimates for use in large-scale teacher evaluation systems relies on value-added models (VAMs) to estimate the contribution of individual teachers to student learning; that is, to estimate the amount of gains to student achievement that each teacher contributes rather than focusing on levels of student achievement (Tekwe, Carter, Ma, Algina, Lucas, et al., 2004). These VAMs rely on relatively complex statistical methods to estimate the teachers’ incremental contributions to student achievement. Value-added models could be viewed as primarily descriptive measurement models or putatively causal models that attribute a portion of student achievement growth to teachers (Rubin, Stuart, & Zanutto, 2004); we take the latter view in this study.
Proponents maintain that VAM techniques evaluate teachers in a more objective manner than by observational criteria alone (Harris, 2009). By holding teachers to standards using outcomes, policymakers could move away from standards based on inputs in the form of educational and credentialing requirements and principals’ or others’ more subjective observations of teachers’ practices (Gordon, Kane, & Staiger, 2006; Harris, 2009). There are concerns that VAMs may not be fair appraisals of teachers’ effectiveness because they may attribute confounding factors, unrelated to instruction, to the teacher (Hill, 2009). Further, evidence suggests that teacher effectiveness scores may vary considerably from year to year (Sass, 2008; Koedel & Betts, 2011), despite teachers’ contentions that they do not vary their teaching style (Amrein-Beardsley, 2008), suggesting that the year-to-year variability is unrelated to teacher effectiveness. While the controversies about the accuracy and utility of VAMs continue to swirl, many states have agreed to incorporate measures of teacher effectiveness in raising student test scores into their teacher evaluations in order to receive federal Race to the Top (RttT) funds or achieve other policy objectives. The uses of teacher VAM estimates in the evaluation process vary from low stakes consequences, by which we mean an action such as developing a professional development plan;
to middle stakes, by which we mean actions such as identifying teachers for observation, coaching, and review; to high stakes, by which we mean denial of tenure or identifying highly effective teachers for substantial performance bonuses. In spite of the commitment by many states to use a VAM for estimating teachers’ effectiveness, there is no consensus within the research community on the approach or approaches that are most appropriate for use. Given these concerns and the widespread use of these models for teacher evaluation, evidence on the relative merits of VAMs is needed.
Several techniques for estimating VAMs have been compared using simulated or actual data (Guarino, Reckase, & Wooldridge, 2012; Schochet & Chiang, 2010; McCaffrey, Lockwood, Koretz, Louis, & Hamilton, 2004; Tekwe et al., 2004). Tekwe et al. (2004) used actual data, while McCaffrey et al. (2004) used both simulation data and actual data. Simulation studies have used either correlated fixed effects (Guarino, Reckase, & Wooldridge, 2012; McCaffrey et al., Consortium for Educational Research and Evaluation–North Carolina 3 Comparing Value-Added Models August 2012
2004) or variance decomposition frameworks for data generation (Schochet & Chiang, 2010). To date, no study has used both correlated fixed effects and variance decomposition simulated data as well as actual data. The present study aims to provide a more comprehensive assessment and uses all three types of data. Moreover, the present study compares nine common VAMs, more than in any other study published to date. Finally, we compare VAMs using the rankings of teachers in the true and estimated distributions of effectiveness using four distinct criteria that are relevant to policymakers, administrators, and teachers.
We compare these nine VAMs using simulated and actual data based on criteria that include their ability to recover true effects, consistency, and false positives in identifying particularly ineffective teachers (or ineffective teachers; the results are nearly identical). To determine which VAMs best handle the confounding influence of non-instructional factors and identify ineffective teachers, we generate simulation data, with known teacher effectiveness scores to compare with teacher effectiveness estimates from each VAM. We use actual data from a statewide database of student and teacher administrative records to examine the consistency between models and relative performance of each VAM in year-to-year consistency in teacher ranking.
In this study, we take the view that teacher effect estimates from VAMs are putatively causal, even though the conditions for identifying causal effects may not be present. Therefore, we first discuss the potential outcomes model (Reardon & Raudenbush, 2009; Rubin, Stuart, & Zanutto, 2004; Holland, 1986). Subsequently, we introduce seven common VAMs and then review existing studies comparing VAMs for teacher effect estimation in the context of the potentially unrealistic demands placed on these VAMs by the potential outcomes model. We then discuss the methods in the present study, including the data generation process for both simulations and the characteristics of the actual data, the form of the nine models compared, and the comparison techniques. We follow with the results of these comparisons. In the final section, we discuss the implications of these findings for additional research into VAMs for teacher effect estimation and implementation as a teacher evaluation tool.
Consortium for Educational Research and Evaluation–North Carolina 4 Comparing Value-Added Models August 2012 The Potential Outcomes Model Value-added models, in economic terms, measure the output resulting from combining inputs with technology (i.e., a process; Todd & Wolpin, 2003). If estimates from VAMs of student assessment data are to be inferred as and labeled teacher effect estimates then they should be viewed as causal estimates of teachers’ contributions to student learning. That is, the value added estimands are not simply descriptions of students’ average improvement in performance under a given teacher, but are effect estimates causally attributed to the teacher. This view coincides with the use of VAMs in education policies such as teacher evaluation. It is widely acknowledged that the process by which the teacher causes student learning does not have to be specified (see, for example, Todd & Wolpin, 2003; Hanushek, 1986). It is not as widely understood that the process by which students learn does not have to be fully specified in order to identify a causal teacher effect. The causality of the estimand from a VAM can instead be derived from assumptions that are independent of model specification (Rubin, Stuart, & Zanutto, 2004).
The assumptions of the potential outcomes model, if met, support the causal inference of teacher effect estimates from VAMs (Reardon & Raudenbush, 2009; Rubin, Stuart, & Zanutto, 2004;
Holland, 1986). The central feature of the potential outcomes model is the counterfactual —the definition of the causal estimand of a teacher’s effect on a student depends on what the student experiences in the absence of the specified cause—that is, under any other teacher besides the one to which the student was assigned. This enables us to ignore inputs to cumulative student knowledge that are equalized over different treatment conditions and are not confounded with treatment assignment. A formal model for causality begins as follows. First, assume that the outcome for student (with = 1,…N) under teacher is. Second, assume that each teacher is a separate treatment condition from J possible treatments, and each student has one potential or latent outcome under each possible teacher (of which at most one can actually be realized). This is a many-valued treatment (Morgan & Winship, 2007) with the potential outcomes represented by a matrix of N students by J treatments (Reardon & Raudenbush, 2009).
Because only one such treatment can be identified (the fundamental problem of causal inference;
Holland, 1986), the treatment effect is defined as a function of the distributions of students assigned to teacher and the students under any other teacher. Generally, this is implemented using linear models based on the average treatment effect for teacher ( ) comprised of students observed under assignment to teacher compared to the other teachers, which we label as.j (not j), e.g., a simple mean difference dij = ∆ – ∆. = E[ ] – E[. ]. An obvious candidate for.j is the teacher at the average level of effectiveness.
Reardon and Raudenbush (2009) identified six defining, estimating, and identifying assumptions of causal school effects that they suggested are also appropriate for teacher effects, two of which we make explicit here. Defining assumptions include (1) each student has a potential outcome under each teacher in the population (manipulability); and (2) the potential outcome under each teacher is independent of the assignment of other participants (the stable unit treatment value assumption, or SUTVA). Estimating assumptions include (3) students’ test scores are on an interval scaled metric; and (4) causal effects are homogeneous. Identifying assumptions, when satisfied, make it possible to infer the treatment effect as causal despite the fundamental problem of causal inference that only one of J potential outcomes can be realized. These assumptions Consortium for Educational Research and Evaluation–North Carolina 5 Comparing Value-Added Models August 2012 include (5) strongly ignorable or unconfounded assignment to teachers; and (6) each teacher is assigned a “common support” of students, which may be relaxed to assume that each teacher is assigned a representatively heterogeneous group of students, to estimate an effect that applies to all types of students. This last assumption may alternatively be met by extrapolation of any teacher’s unrepresentative group of students to students that the teacher was not assigned if the functional form of the model (e.g., linear or minimal deviations from linearity) supports such extrapolation.
Building on the formal model of causality discussed above, this section presents a formal discussion of two of the six assumptions of the potential outcomes model that are relevant to the comparison between VAMs in the present study, drawing heavily on Reardon and Raudenbush (2009).
Stable Unit Treatment Value Assumption (SUTVA) SUTVA implies that the treatment effect of any teacher on any student does not vary according
to the composition of that teacher’s classroom (Rubin et al., 2004):
is an N x J matrix of ij elements recording the assignment of students to teachers, with ij = 1 if i is assigned to j and ij = 0 otherwise. The statement above makes it explicit that is invariant to all permutations of, a vector indicating each student’s assignment to treatment. Ruled out by this assumption are effects based on composition of the classroom, including those attributable to peer interactions and those between peers and teachers. Therefore, a student assigned to a math classroom with higher achieving peers should have the same potential outcome under that teacher’s treatment as they would if the classroom contained lower achieving peers. The effects that classroom composition may have on learning make this assumption challenging to support. For example, if teachers alter instruction based on the average achievement level of the class, these effects imply that the treatment effect for a single student is heterogeneous according to the assignment of peers (Rubin et al., 2004).
Ignorability The second assumption, ignorability, implies that each student’s assignment to treatment—that is, their assignment to a specific teacher (A)—is independent of their potential outcome under
that teacher (Morgan & Winship, 2007):