# «Does Student Sorting Invalidate Value-Added Models of Teacher Effectiveness? An Extended Analysis of the Rothstein Critique Cory Koedel University of ...»

Does Student Sorting Invalidate Value-Added Models of Teacher Effectiveness?

An Extended Analysis of the Rothstein Critique

Cory Koedel

University of Missouri

Julian R. Betts*

University of California, San Diego

National Bureau of Economic Research

April 2009

Value-added modeling continues to gain traction as a tool for measuring

teacher performance. However, recent research (Rothstein, 2009,

forthcoming) questions the validity of the value-added approach by

showing that it does not mitigate student-teacher sorting bias (its presumed primary benefit). Our study explores this critique in more detail. Although we find that estimated teacher effects from some valueadded models are severely biased, we also show that a sufficiently complex value-added model that evaluates teachers over multiple years reduces the sorting-bias problem to statistical insignificance. One implication of our findings is that data from the first year or two of classroom teaching for novice teachers may be insufficient to make reliable judgments about quality. Overall, our results suggest that in some cases value-added modeling will continue to provide useful information about the effectiveness of educational inputs.

* The authors thank Andrew Zau and many administrators at San Diego Unified School District (SDUSD), in particular Karen Bachofer and Peter Bell, for helpful conversations and assistance with data issues. We also thank Zack Miller, Shawn Ni and Mike Podgursky for useful comments and suggestions, and the National Center for Performance Incentives for research support. SDUSD does not have an achievement-based merit pay program, nor does it use valueadded student achievement data to evaluate teacher effectiveness. The underlying project that provided the data for this study has been funded by a number of organizations including The William and Flora Hewlett Foundation, the Public Policy Institute of California, The Bill and Melinda Gates Foundation, the Atlantic Philanthropies and the Girard Foundation. None of these entities has funded the specific research described here, but we warmly acknowledge their contributions to the work needed to create the database underlying the research.

Economic theory states that in an efficient economy workers should be paid their value marginal product. Implementing this rule in the service sector is not simple, as it is often not obvious how to measure the output of a white collar worker. Teachers provide an example of this problem: public school teachers‟ salaries are determined largely by academic degrees and credentials, and years of experience, none of which appears to be strongly related to teaching effectiveness.

Perhaps in recognition that teacher pay is not well aligned with teaching quality, President Obama has recently called for greater use of teacher merit pay as a tool to boost student achievement in America‟s public schools. And yet, in the United States, teacher merit pay is hardly a new idea. It has been used for at least a century, but most programs are shortlived, or survive either by giving almost all teachers bonuses or by giving trivial bonuses to a small number of teachers. Teachers have traditionally complained that principals cannot explain why they gave a bonus to one teacher but not another (Murnane et al., 1991, pp. 117-119).

Opponents of teacher merit pay would raise the question of whether we can reliably measure teachers‟ value marginal products such that informed merit-pay decisions can be made.

The advent of widescale student testing, partly in response to the requirements of the federal No Child Left Behind law, raises the possibility that it is now feasible to measure the effectiveness of individual teachers in the classroom. Indeed, recently developed panel datasets link students and teachers at the classroom level, allowing researchers to estimate measures of „outcome-based‟ teacher effectiveness.1 Because test scores are generally available for each For recent examples see Aaronson, Barrow and Sander (2007), Hanushek, Kain, O‟Brien and Rivkin (2005), Harris and Sass (2006), Koedel and Betts (2007), Nye, Konstantopoulos and Hedges (2004), and Rockoff (2004).

student in each year, test scores lend themselves comfortably to a “value-added” approach where the effectiveness of teacher inputs can be measured by student test-score growth. The conjuncture of President Obama‟s recent calls for teacher merit pay and the development of panel data-sets that provide information on student achievement growth raise the stakes considerably: can we use student testing to reliably infer teaching quality?

In most schools, students are not randomly assigned to teachers. This raises a major challenge to the idea of using value-added models to infer teacher effectiveness. If certain teachers perennially receive students with low test scores, they would lose out in the merit pay sweepstakes through no fault of their own. A presumption in value-added modeling is that by focusing on achievement growth rather than achievement levels, the problem of student-teacher sorting bias is resolved because each student‟s initial test-score level is used as a control in the model. The value-added approach is intuitively appealing, and increasing demand for performance-based measures by which teachers can be held accountable - at the federal, state and district levels – has only fueled the value-added fire.2 However, despite the popularity of the value-added approach among both researchers and policymakers, not everyone agrees that it is reliable. Couldn‟t it be the case that a given teacher either systematically or occasionally receives students whose gains in test scores are unusually low, for reasons outside the control of the teacher? Ability grouping would be one source of

accompanied by mean reversion, would be a source of fleeting differences that a value-added model might wrongly attribute to a given teacher.

No Child Left Behind legislation is one example of this demand at the federal level (e.g., adequate yearly progress), and states such as Florida, Minnesota and Texas have all introduced performance incentives for teachers that depend to some extent on value-added. For a further discussion of the performance-pay landscape, particularly as it relates to teachers, see Podgursky and Springer (2007).

Recent research by Rothstein (2009, forthcoming) shows that future teacher assignments have non-negligible predictive power over current student performance in value-added models, despite the fact that future teachers cannot possibly have causal effects on current student performance. This result suggests that student-teacher sorting bias is not mitigated by the valueRothstein‟s critique of the value-added methodology comes as numerous added approach.

studies have used and continue to use the technique. It raises serious doubts about the valueadded methodology just as other work, such as Kane and Staiger (2008), Jacob and Lefgren (2007) and Harris and Sass (2007), appears to confirm that value-added is a meaningful measure of teacher performance.

We further explore the reliability of value-added modeling by extending Rothstein‟s analysis in two important ways. First, Rothstein estimates teacher effects using only a single year of data for each teacher. We consider the importance of using multiple years of data to identify teacher effects. If the sorting bias uncovered by Rothstein is transitory to some extent, using multiple cohorts of students to evaluate teachers will help mitigate the bias.3 For example, a principal may alternate across years in assigning the most troublesome students to the teachers at her school, or teachers may connect with their classrooms more in some years than in others.

These types of single-year idiosyncrasies will be captured by single-year teacher effects, but will be smoothed out if estimates are based on multiple years of data.4 Second, we evaluate the Rothstein critique using a different dataset. Given that the degree of student-teacher sorting may Rothstein notes this in his appendix, although he does not explore the practical implications in any of his models.

Additionally, some of what we observe to be sorting bias may be attributable to the random assignment of students to teachers across small samples (classrooms). In an omitted analysis, we perform a Monte Carlo exercise to test for this possibility. Although any given teacher may benefit (be harmed) in any given year from a random draw of highperforming (low-performing) students, we find no evidence to suggest that this would influence estimates of the distribution of teacher effects.

differ across different educational environments, his results may or may not be replicated in other settings.

Our extension of Rothstein‟s analysis corroborates his primary finding – value-added models of student achievement that focus on single-year teacher effects will generally produce biased estimates of value-added. However, in our case, when we estimate a detailed value-added model and restrict our analysis to teachers who teach multiple classrooms of students, we find no evidence of sorting bias in the estimated teacher effects. Although this result depends on the degree of student-teacher sorting in our data, it suggests that at least in our setting, sorting bias can be almost completely mitigated using the value-added approach and looking across multiple years of classrooms for teachers.

Our results in this regard are encouraging, but less detailed value-added models that include teacher-effect estimates based on single classroom observations fare poorly in our analysis. That some value-added models will be reliable but not others, and that value-added modeling may only be reliable in some settings, are important limitations. They suggest that in contexts such as statewide teacher-accountability systems, large-scale value-added modeling may not be a viable solution. Because the success of the value-added approach will depend largely on data availability and the underlying degree of student-teacher sorting in the data (much of which may be unobserved), post-estimation falsification tests along the lines of those proposed by Rothstein will be useful in evaluating the reliability of value-added modeling in different contexts.

Although our analysis does not uncover a well-defined set of conditions under which value-added modeling will universally return causal teacher effects across different schooling environments (outside of random student-teacher assignments such conditions are unlikely to exist), we do identify conditions under which value-added estimation will perform better. The most important insight is that teacher evaluations that span multiple years will produce more reliable measures of teacher effectiveness than those based on single-year classroom observations. Often implicitly, the value-added discussion in research and policy revolves around single-year estimates of teacher effects. Our analysis strongly discourages such an approach.

The remainder of the paper is organized as follows. Section I briefly describes the Rothstein critique. Section II details our dataset from the San Diego Unified School District (SDUSD). Section III replicates a portion of Rothstein‟s analysis using the San Diego data.

Section IV details our extended analysis of value-added modeling and presents our results.

Section V uses these results to estimate the variance of teacher effectiveness in San Diego.

Section VI concludes.

Rothstein raises concerns about assigning a causal interpretation to value-added estimates of teacher effects. His primary argument is that teacher effects estimated from value-added models are biased by non-random student-teacher assignments, and that this bias is not removed by the general value-added approach, nor by standard panel-data techniques. Consider a simple

**value-added model of the general form:**

In equation (1), Yit is a test-score for student i in year t, Xit is a vector of time-varying student and school characteristics (for the school attended by student i in year t), and Tit is a vector of indicator variables indicating which teacher(s) taught student i in year t. This model could be re-formulated as a “gainscore” model by forcing the coefficient on the lagged test score to unity and moving it to the left-hand side of the equation. The error term is written as the sum of two components, one that is time-invariant ( i ) and another that varies over time ( it ).

Rothstein discusses sorting bias as coming from two different sources in this basic model.

First, students could be assigned to teachers based on “static” student characteristics. This type of sorting corresponds to the typical tracking story – some students are of higher ability than others, and these students are systematically assigned to the best teachers. Static tracking may operationalize in a variety of ways including administrator preferences, parental preferences, or teacher preferences (assuming that primary-school aged children, upon whom we focus here, are not yet able to form their own preferences). Given panel data, the typical solution to the statictracking problem is the inclusion of some form of a student fixed effect whereby the timeinvariant component to the error term in equation (1) is controlled for (e.g., first-differencing or demeaning). If student-teacher sorting is only based on static student characteristics, this approach will be sufficient.

However, the student-fixed-effects solution to the static tracking problem necessarily imposes a strict exogeneity assumption. That is, to uncover causal teacher effects from a model that controls for time-invariant student characteristics, it must be the case that teacher assignments in all periods are uncorrelated with the time-varying error components in all periods.

**To see this, note that we could estimate equation (1) by first differencing to remove the timeinvariant component to the error term:5**

First, note that the first-differencing induces a mechanical correlation between the lagged testscore gain and the first-differenced error term in equation (2). This correlation can be resolved In the case of first differencing, it is more accurate to describe the assumption as “local” strict exogeneity in the sense that the error terms across time must be uncorrelated with teacher assignments only in contiguous years.

by instrumenting for the lagged test-score gain with the second-lagged gain, or second-lagged level (following Anderson Hsiao, 1981 – for examples see Harris and Sass, 2006; Koedel, forthcoming; and Koedel and Betts, 2007). In addition, year-t teacher assignments may also be correlated with the first-differenced error term. Specifically, if students are sorted dynamically based on time-varying deviations (or shocks) to their test-score-growth trajectories, then lagged shocks to test-score growth, captured by it 1, will be correlated with year-t teacher assignments, and the teacher effects from equation (2) cannot be given a causal interpretation.6 Rothstein‟s critique can be summarized as follows: If students are assigned to teachers based entirely on time-invariant factors, unbiased teacher effects can in principle be obtained from a wellconstructed value-added model. However, if sorting is based on dynamic factors that are unobserved by the econometrician, value-added estimates of teacher effects cannot be given a causal interpretation.