# «Does Student Sorting Invalidate Value-Added Models of Teacher Effectiveness? An Extended Analysis of the Rothstein Critique Cory Koedel University of ...»

Rothstein proposes a falsification test to determine whether value-added models can provide causal information about teaching effectiveness. He suggests simply adding future teacher assignments to the model, and testing whether these teacher assignments have non-zero “effects”. Future teachers clearly cannot have causal effects on current test scores, which means that any observed “effects” must be the result of a correlation between teacher assignments and the error terms. Alternatively, if the coefficients on the future-teacher indicator variables are jointly insignificant, sorting bias is unlikely to be a major concern for any teacher effects in the model (as this finding would suggest that the controls in the model are capturing the sorting bias that would otherwise confound the teacher effects). In Rothstein‟s analysis (2009), his most provocative finding is that future teacher assignments have significant predictive power over

We use administrative data from four cohorts of fourth-grade students in San Diego (at the San Diego Unified School District) who started the fourth grade in the school years between 1998-1999 and 2001-2002. The standardized test that we use to measure student achievement,

designed to be vertically scaled such that a one-point gain in student performance at any point in the schooling process is meant to correspond to the same amount of learning.7 Students who have fourth-grade test scores and lagged test scores are included in our baseline dataset. We estimate a value-added model that assumes a common intercept across students, and a second model that incorporates student fixed effects. In this latter model we additionally require students to have second-lagged test scores. For each of our primary models, we estimate value-added for teachers who teach at least 20 students across the data panel and restrict our student sample to the set of fourth-grade students taught by these teachers.8 In the baseline dataset, we evaluate test-score records for 30,354 students taught by 595 fourth-grade teachers. Our sample size falls to 15,592 students taught by 389 teachers in the student-fixedeffects dataset. The large reduction in sample size is the result of (1) the requirement of three contiguous test-score records per student instead of just two, which in addition to removing more transient students also removes one year-cohort of students because we do not have test-score data prior to 1997-1998 (that is, students in the fourth grade in 1998-1999 can have lagged scores For detailed information about the quantitative properties of the Stanford 9 exam, see Koedel and Betts (forthcoming).

This restriction is imposed because of concerns about sampling variation (see Kane and Staiger, 2002). Our results are not sensitive to reasonable adjustments to the 20-student threshold.

but not second-lagged scores) and (2) requiring the remaining students be assigned to one of the 389 fourth-grade teachers who teach at least 20 students with three test-score records or more.9 We include students who repeat the fourth grade because it is unlikely that grade repeaters would be excluded from teacher evaluations in practice. In our original sample of 30,354 students with current and lagged test-score records, only 199 are grade repeaters.

III. Replication of Rothstein’s Analysis Based on details provided by Rothstein in his paper and corresponding data appendix, we first replicate a portion of his analysis using data from the 1999-2000 fourth-grade cohort in San Diego. This replication is meant to establish the extent to which Rothstein‟s underlying findings

**are relevant in San Diego.10 We estimate the following basic value-added model:**

Equation (3) is a gainscore model and corresponds to Rothstein‟s “VAM1” model with indicators for past, current and future teacher assignments. Yi 4 represents a student‟s test-score gain going from the third to fourth grade, Si is a vector of school indicator variables, and Ti x is a vector of teacher indicator variables for student i in grade x. Correspondingly, x is a vector of

argument is that if future teacher effects, for fifth-grade teachers in this case, are shown to be non-zero then none of the teacher effects in the model can be given a causal interpretation.11 We replicate the data conditions in Rothstein as closely as possible when estimating this model. There are two conditions that seemed particularly important. First, in specifications that Only students who repeated the 4th grade in the latter two years of our panel could possibly have had more than three test-score records. There are 32 students with four test-score records in our dataset.

The replication data sample is roughly a subsample of the student-fixed-effects dataset, but we use different teachers because Rothstein does not require teachers to teach 20 students for inclusion into the model.

Rothstein‟s analyses (2009, forthcoming) are quite thorough and we refer the interested reader to his paper for more details.

include teacher identifiers across multiple grades, Rothstein excludes students who changed schools across those grades in the data. Second, he also focuses on only a single cohort of students passing through the North Carolina public schools. Similarly to Rothstein, the dataset used to estimate equation (3) does not include any school switchers, and is estimated using just a single cohort of fourth-grade students in San Diego.

In our replication we focus on the effects of fourth and fifth-grade teachers in equation (3). In accordance with the literature that measures the importance of teacher value-added, we report the adjusted and unadjusted variance of the teacher effects. We follow Rothstein‟s approach to reporting the teacher-effect variances, borrowed from Aaronson, Barrow and Sander (2007), where the unadjusted variance is just the raw variance of the teacher effects and the adjusted variance is equal to the raw variance minus the average of the square of the robust standard errors. We follow the steps outlined in Rothstein‟s appendix to estimate the withinschool variance of teacher effects without teachers switching schools. Our results are detailed in Table 1.

The first two columns of Table 1 report the results of separate Wald tests of the hypotheses that all grade-4 teachers have identical effects and that all grade-5 teachers have identical effects. Confirming Rothstein‟s findings, the null hypothesis that grade-5 teachers have an equal effect on students‟ gains in grade 4 is rejected with a p-value below 0.01.

The next two columns show the raw standard deviations of teacher effects and the standard deviations after adjusting for sampling variance. (These are divided by the standard deviation of student test scores to scale them.) The adjusted standard deviations are 0.24 and

analogous to those in our Table 1, both in terms of the model and data, are reported in his Table 5 (column 2) for the “unrestricted model”. There, he shows adjusted standard deviations of the distributions of grade-4 and grade-5 teacher effects of 0.193 and 0.099 standard deviations of the test, respectively. These results are also replicated virtually identically in his Table 2 (column 7) where he uses a larger student sample and excludes lagged-year teacher identifiers from the model. Our estimates, which show a larger overall variance of teacher effects, are consistent with past work using San Diego data (Koedel and Betts, 2007). The relevant result to compare with Rothstein (2009) is our estimate of the ratio of the standard deviations of the distributions of future and current teacher effects. Rothstein finds that the standard deviation of the distribution of future teacher “effects” is approximately 51 percent of the size of that of current teacher effects (i.e., 0.099/0.193), whereas in our analysis this number is slightly higher at roughly 63 percent (0.15/0.24). Our results here confirm Rothstein‟s suspicions that future teachers explain a sizeable portion of current grade achievement gains, and establish that his primary findings are not unique to North Carolina.

The results in Table 1, and the corresponding results detailed by Rothstein, suggest that student-teacher sorting bias is a significant complication to value-added modeling. Information about the degree of student-teacher sorting in our data will be useful for generalizing our results that follow to other settings. We document observable student-teacher sorting in our data by comparing the average realized within-teacher standard deviation of students‟ lagged test scores to analogous measures based on simulated student-teacher matches that are either randomly generated or perfectly sorted. This approach follows Aaronson, Barrow and Sander (2007).

Although sorting may occur along many dimensions, the extent of sorting based on lagged test scores is likely to provide some indication of sorting more generally. Table 2 details our results, which are presented as ratios of the standard deviation of interest to the within-grade standard deviation of the test (calculated based on our student sample). Note that while there does appear to be some student sorting based on lagged test-score performance in our dataset, this sorting is relatively mild.

IV. Extensions to More Complex Value-Added Models We extend the analysis by evaluating the “effects” of future teachers using three models that are more commonly used in the value-added literature. These models include a richer set of control measures. We use a general value-added specification where current test scores are regressed on lagged test scores, but note that it is also somewhat common in the literature to use the gainscore model (which is used primarily by Rothstein), where the coefficient on the lagged test score is forced to one and the lagged-score term is moved to the left side of the equation.

The first model that we consider, and the simplest, is a basic value-added model that

**allows for the comparison of teacher effects across schools:**

In (4), Yit is the test score for student i in year t, t is a year-specific intercept, Xit is a vector of time-invariant and time-varying student-specific characteristics (see Table 3) and Tit is a vector of teacher indicator variables where the entry for the teacher who teaches student i in year t is set to one. The coefficients of interest are in the Jx1 vector of teacher effects, θ.

We refer to equation (4) as the basic model. The most obvious omission from the model is school-level information, whether in the form of school fixed effects or time-varying controls.

Researchers have generally incorporated this information because of concerns that students and teachers are sorting into schools non-randomly. This sorting, along with the direct effects of school-level inputs on student achievement (peers, for example), will generate omitted-variables bias in the value-added estimates of teacher effects in equation (4).

While the omitted-variables-bias concern is certainly relevant, any model that includes school-level information will not allow for a true comparison of teacher effectiveness across schools. For example, if school fixed effects are included in equation (4) then each teacher‟s comparison group will be restricted to the set of teachers who teach at the same school.

Furthermore, even in the absence of school fixed effects, the inclusion of school-level controls will restrict teachers‟ comparison groups to some extent because teachers may sort themselves based on school-level characteristics. If this is the case, controls meant to capture school quality will also partly capture school-level teacher quality, limiting inference from across-school comparisons of teachers.

For many researchers, concerns about omitted-variables bias overwhelm concerns about shrinking teacher comparison groups. This leads to the second model that we consider, the within-schools model, which is more commonly estimated in the literature and includes schoollevel covariates and school fixed effects.

In equation (5), Sit includes school indicator variables and time-varying school-level information for the school attended by student i in year t. The controls in the vector Sit are detailed in Table 3.

The benefit of including school-level information is a reduction in omitted-variables bias, including sorting bias generated by students and teachers selecting into specific schools.

However, the cost of moving from equation (4) to equation (5) is that it is no longer straightforward to compare teachers across schools.12 Although teacher effectiveness cannot be compared across schools straightforwardly using value-added estimates from equation (5), this may be acceptable from a policy perspective. For example, policymakers may wish to identify the best and worst teachers on a school-by-school basis regardless of any teacher sorting across schools.

Finally, in our third specification we incorporate student fixed effects. This approach is suggested by Harris and Sass (2006), Koedel (forthcoming) and Koedel and Betts (2007), among

**many others:**

In going from equation (5) to equation (6) we add student fixed effects, αi.13 The inclusion of the student fixed effects allows us to drop from the vector Xit time-invariant student characteristics, leaving only time-varying student characteristics. The benefit of the within-students approach is that teacher effects will not be biased by within-school student sorting across teachers based on time-invariant student characteristics (such as ability, parental involvement, etc.). However, again there are tradeoffs. As noted in Section I, the student-fixed-effects model necessarily imposes some form of the strict exogeneity assumption. Equation (6) also narrows teachers‟ comparison groups to those with whom they share students, meaning that identification comes from comparing test-score gains for individual students when they were in the third and fourth grades. In addition, the incorporation of the student fixed effects makes the model considerably noisier.14 Finally, the size of the student sample that can be used is restricted in equation (6) because a student record must contain at least three contiguous test scores, instead of just two, to be included in the analysis (as described in Section II).

Despite these concerns, econometric theory suggests that the inclusion of student fixed effects will be an effective way to remove within-school sorting bias in teacher effects as long as students and teachers are sorted based on time-invariant characteristics. We estimate the withinstudents model by first-differencing equation (6) and instrumenting for students‟ lagged testRothstein (2009) takes an entirely different approach based on Chamberlain‟s correlated random effects model when testing for student fixed effects in his analysis.

In fact, a test for the statistical significance of the student fixed effects in equation (6) fails to reject the null hypothesis of joint insignificance. However, the test is of low power given the large-N, small-T panel dataset structure (typical of most value-added analyses), limiting inference.

score gains with their second-lagged test-score levels. This general approach was developed by Anderson and Hsiao (1981) and has been recently used by Harris and Sass (2006), Koedel (forthcoming) and Koedel and Betts (2007) to estimate teacher value-added.15 Note that to completely first-difference equation (6) we must incorporate students‟ lagged teacher assignments, which will appear in the period-(t-1) version of equation (6).16 That is, the model compares the effectiveness of students‟ current and previous-year teachers.