«Research Report Hua Wei Tracey Hembry Daniel L. Murphy Yuanyuan McBride May 2012 COMPARISON OF VALUE-ADDED MODELS AND OUTCOMES 1 About Pearson Pearson, ...»
Value-Added Models in the Evaluation of
Teacher Effectiveness: A Comparison of
Models and Outcomes
Daniel L. Murphy
COMPARISON OF VALUE-ADDED MODELS AND OUTCOMES 1
Pearson, the global leader in education and education technology, provides innovative print and
digital education materials for preK through college, student information systems and learning management systems, teacher licensure testing, teacher professional development, career certification programs, and testing and assessment products that set the standard for the industry. Pearson’s other primary businesses include the Financial Times Group and the Penguin Group. For more information about the Assessment & Information group of Pearson, visit http://www.pearsonassessments.com/.
About Pearson’s Research Reports Pearson’s research report series provides preliminary dissemination of reports and articles prepared by TMRS staff, usually prior to formal publication. Pearson’s publications in.pdf format may be obtained at: http://www.pearsonassessments.com/research.
COMPARISON OF VALUE-ADDED MODELS AND OUTCOMES 2
Keywords: value-added models, teacher effectiveness, hierarchical linear regression model, layered mixed effects model
COMPARISON OF VALUE-ADDED MODELS AND OUTCOMES 3
Value-Added Models in the Evaluation of Teacher Effectiveness:
A Comparison of Models and Outcomes “I am 110% behind our teachers. But all I’m asking in return—as a President, as a parent, and as a citizen—is some measure of accountability. So even as we applaud teachers for their hard work, we’ve got to make sure we’re seeing results in the classroom.”
President Barack Obama has clearly articulated his focus on teacher accountability not only in his speeches but also through his recent educational reforms. In 2009, he announced the launch of the Race to the Top funds, allowing for the distribution of over $4 billion to states through a competitive application process. To compete for these funds, states were required to develop teacher-effectiveness measures based on student test scores. The corresponding section of Race to the Top applications entitled “Great Teachers and Leaders” received 28% of the available points in the review process, which was the largest weight assigned to any component of the applications. Additionally, by September 2010 the federal government had awarded more than $440 million in Teacher Incentive Fund (TIF) dollars to districts and educational groups that proposed performance-based pay incentives for teachers and principals. To qualify for TIF grants, interested parties were required to describe how teacher effectiveness would be quantified and, more specifically, how student data would be incorporated into these measures. These TIF awards would then be distributed as rewards for those teachers receiving the highest
Given the focus and stakes placed on teacher accountability, various types of models have been proposed and used for teacher evaluations in the past decade. A status model, such as Adequate Yearly Progress (AYP) under the No Child Left Behind (NCLB) Act, evaluates teachers based on students’ achievement scores at a single time point and rank-orders them or compares them against an established standard (The Council of Chief State School Officers Accountability Systems and Reporting Group, 2008). The disadvantage of a status model is that it tends to produce results that are highly correlated with students’ background characteristics, which are beyond the control of the teacher being evaluated. Unlike a status model, a growth model takes into account the cumulative nature of learning and tracks students’ achievement growth over time. The application of a growth model in the school/teacher accountability system is a value-added model, which isolates the contribution that each teacher makes to students’ academic progress in a given time period, and compares it with the contribution measures of other teachers. Under the direction of evaluating teachers based on students’ achievement gains, the value-added methodology has emerged as the solution for assimilating student data into a teacher-effectiveness measure. It has been applied in a number of states and school districts, such as Tennessee (Sanders, Saxton, & Horn, 1997) and Dallas (Webster & Mendro, 1997), for estimating the contributions that specific teachers and schools make to the growth of student learning and for rewarding teachers for their efforts at improving student academic performance.
While public rhetoric commonly refers to value-added models as one term, they actually encompass a wide range of techniques utilizing different amounts and types of data. Models such as simple gain score comparisons are easy to comprehend, while models such as the layered mixed effects model (Sanders & Horn, 1994) and the hierarchical linear mixed models
outcome measures, one model may seem more appealing than others. For example, if the goal is to provide measures that are fairly transparent to the public, the user would likely be drawn to a simpler value-added model. An additional challenge in selecting a value-added model comes in the results. While the original choice of a value-added model may be driven by the intended use of the resulting measures, it is imperative that users understand the impact of this model choice on the outcomes. Because of the stakes placed on teacher measures like performance-based pay, policy makers and other users of value-added models should fully understand the implications of the choices they make when developing measures of teacher effectiveness.
While several past studies (e.g., Tekwe, Carter, Ma, Algina, Lucas, Roth, Ariet, Fisher, & Resnick, 2004; McCaffrey, Lockwood, Koretz, Louis, & Hamilton, 2004) have compared and contrasted different value-added models, they tended to focus on the technically sophisticated models, such as the layered mixed effects model. To date few studies have incorporated a full spectrum of models including the simplest models. Since these simple, easily understood models may appear desirable to some stakeholders, it is important to evaluate those models side by side with more complex models. The purpose of this paper is to characterize the differences among five value-added models and examine the practical impact of the differences by comparing the resulting value-added outcomes. The five models cover a broad range of technical procedures, some very simplistic and others very sophisticated. The five value-added models were applied to a common data set to generate teacher-effectiveness measures based on the same student data.
For each model, all teachers were rank ordered on the basis of their value-added measures. These ranks were then compared, by teacher, across the five value-added models. Similar ranks across
The data analyzed in this study were obtained from a large urban school district in Texas.
To enable the application of the five value-added models to comparable data, data matching and merging procedures were employed to produce the final data set for analysis. Student data from grades 3–5 for three separate cohorts were combined together. The three cohorts were students who entered grade 3 in 2005, 2006, and 2007, as shown in Table 1.
The combined student data included students’ achievement scores on the Texas Assessment of Knowledge and Skills (TAKS) tests in mathematics and English language arts (ELA) for three years, and demographic variables such as gender, ethnicity, limited English proficiency status, special education status, and free or reduced lunch as a proxy for economically disadvantaged status. Teacher- and school-related information was also included, such as the unique identification number of the school in which the student was enrolled and the name of the teacher who taught the student at each grade.
The targets of analysis in this study were grade 5 teachers, for whom the combined data set provided three years’ worth of data. Analyses were carried out separately in the areas of
deleted from the data set so that the derived value-added measures could be based on sufficient student data. In addition, because the five models incorporated different types of variables and required different amounts of data, the numbers of teachers and students retained in the analysis also varied across models.
In this study, five value-added models, from simple to complex, were applied to a common database to evaluate the impact of individual teachers on student learning. The five models differed in many features including technical complexity, conceptualization of student growth, and estimation procedures. A description of each model is provided below.
Model 1: Percent Passing Change Model. For each teacher being evaluated, this model subtracted the percentage of students passing the TAKS mathematics or ELA assessment in the previous year from the passing percentage in the current year. The change in the percentage of students who meet the passing standard represented the value-added measure for the teacher.
This model compared the passing percentages across two years but with two different student cohorts. This model has the advantage of being familiar to educators and policymakers since it has been used in the calculations of Adequate Yearly Progress (AYP). However, it does not model growth for the same cohort of students and is therefore not technically a value-added model. The across-year change in the passing percentage is for two groups of students who may differ significantly in some characteristics that are uncontrolled by the teacher.
Model 2: Average Score Change Model. This model calculated the difference between the TAKS mathematics or ELA scale scores from grade 4 to grade 5 for each student in the teacher’s classroom. The differences were averaged across students in the classroom to generate
one group of students, the group of grade-5 students each teacher taught in a certain year, and the progress they had made since the prior year. Advantages of this model are that the calculations are simple and the students’ score changes are direct and unbiased measures of student growth.
However, this model can only be implemented in certain subject areas and grade levels, in which the scores are reported on a developmental scale (i.e., vertical scale) score system. Another disadvantage of this model was that it did not consider other factors that might affect score changes, such as whether a teacher had several students who were gifted and talented, had a learning disability, or were English language learners.
Model 3: Multiple Regression Model. In this model, students’ grade-5 TAKS mathematics or ELA scores were predicted by the students’ grade-4 TAKS scores and additional student variables such as gender, ethnicity, economically disadvantaged status, special education status, and limited English proficiency status. The difference between the predicted score and the actual score was calculated for each student in the classroom, and the average of these differences was taken as the teacher measure. The Multiple Regression Model can be specified
associated with the kth predictor for the sth subject area, S kijs is the value on the kth predictor variable for the ith student taught by the jth teacher in the sth subject area. With this model, the extent of difference ( ε ijs ) between a student’s predicted score and the score he or she actually
would tend to have more students who scored above their predicted scores, whereas an ineffective teacher would have more students who scored below their predicted scores.
This model has several advantages over the previous two models. First, by using the grade-5 score, rather than the difference between the grade-5 and grade-4 scores, as the dependent variable, this model could be applied in grades and subjects for which the scores were not reported on the same scale. Second, this model estimated teacher effectiveness on a student’s current score while controlling for the student’s previous score and demographic characteristics, and it provided a more accurate teacher measure than the previous two models which did not take into account effects of the student’s prior performance and demographics. A disadvantage of this model was that it did not account for the grouping effect, which could potentially be explored in multilevel modeling analyses.
Model 4: Hierarchical Linear Regression Model. This model built on the Multiple Regression Model by accounting for the fact that students were grouped within schools, and such groupings would affect the student achievement. At the student level, a student’s TAKS grade-5 score in mathematics or ELA was regressed on the student’s grade-4 scores in both mathematics and ELA, and other student demographic variables such as gender, ethnicity, economically disadvantaged status, special education status, and limited English proficiency status. At the school level, the grouping effect was modeled as a random intercept. Although it is possible to add school-level explanatory variables, the model used in this study did not. The student-level
model is specified as:
where γ 00 s is the grand mean of the random intercepts (i.e., school means) for the sth subject area, ξ ms is the random intercept associated with school m in the sth subject area, εijms is assumed to be distributed as N (0, σ εs ), and ξ ms is assumed to be distributed as N (0, σ ξs ).
We chose to use schools, rather than teachers, as the units of analysis at the second level in the Hierarchical Linear Regression Model. This decision was made for the ease of implementation, although it was possible to model teachers at the second level. Teacher measures were estimated by aggregating the level-one residuals and producing the classroom means. It should be noted that, in this study, teachers’ value-added estimates were not adjusted for their precision (i.e., different numbers of students taught by the teachers), although such an adjustment could be done within this type of models.
This model has the advantage of accounting for differences in student performance resulting from how students are grouped within schools. Furthermore, it does not require the scores in grade 5 to be on the same scale as scores in grade 4. A disadvantage of this model is that the hierarchical modeling technique is very complex and hard to explain to laypersons.