# «Does Student Sorting Invalidate Value-Added Models of Teacher Effectiveness? An Extended Analysis of the Rothstein Critique Cory Koedel University of ...»

We start with our baseline student samples – the 30,354-student/595-teacher sample for our basic and within-schools models and the 15,592-student/389-teacher sample for our withinstudents model. For the students in these samples, we then move forward one year and identify fifth-grade teacher assignments. In each sample, approximately 85 percent of the students appear in the dataset in year-(t+1) with future teacher assignments. We include teacher indicator variables for students‟ fifth-grade teacher assignments, and test the null hypothesis that these different teacher assignments differentially predict grade-4 test-score growth.17 So, adjusting model (4) by adding controls for grade-5 teachers

H 0 :1 2 ... J . A rejection of this null hypothesis for future teacher “effects” suggests that sorting bias is contaminating the teacher effects in the model. This is the falsification test proposed by Rothstein (2009).

Although all three of these studies use the same basic methodology, Harris and Sass (2006) estimate their model using GMM while Koedel (2007) and Koedel and Betts (2007) use 2SLS. We use 2SLS here.

We include lagged-teacher assignments for all lagged teachers who teach at least five students in our sample in the prior year.

By not requiring all students to have future teacher assignments we are able to use a larger student sample, and therefore a larger teacher sample. The reference group here is the student population for which no grade-5 teacher is observed.

Adding future teacher effects to the within-students model is less straightforward than in the simpler models because of the first-differencing procedure. Specifically, a student‟s future teacher in the lagged-score model is the same as that student‟s current teacher in the currentscore model. For example, a student‟s fourth-grade teacher enters into the model for third-grade value-added as a future teacher and the model for fourth-grade value-added as a current teacher.

We allow fourth-grade teachers to have one “effect” in the lagged-score model and a separate effect in the current-score model by not differencing out the teacher indicator variables. This approach is taken because the current-score teacher effect may be partially causal, while the lagged-score effect cannot be. The current-score and lagged-score effects are not separately identifiable, but are captured by a single coefficient for each fourth-grade teacher. Equation (7) details the first-differenced version of the within-students model that incorporates future teacher assignments. Year t corresponds to the fourth grade for the students in our sample.

Yi (t 1) i (t 1) Yi (t 2)1 X i (t 1) 2 Si (t 1) 3 Ti (3t 1)3 Tit44 ui (t 1)

In (7) we instrument for the lagged test-score gain with the second-lagged test-score level. The second row in the equation contains the vectors of teacher effects after firstdifferencing. The positive entries are from the current-score model and the negative entries are from the lagged-score model. The superscripts on the teacher-indicator vectors indicate the grade level taught by the teachers, along with the corresponding subscripts on the coefficient vectors.18 The teacher coefficients denoted by δ may contain some causal component, while the coefficients denoted by η cannot possibly contain causal information. Grouping terms, the vector of current-teacher coefficients in this model estimates ( 4 4 ).

In each model, we estimate future-teacher “effects” for all teachers who teach at least 20 students from our original student sample, one year in advance. We perform Wald tests of the null hypothesis that the teacher effects are jointly equal to each other for current and future teachers, although our primary interest is in the tests for the future teachers. We also estimate the unadjusted and adjusted variances of the distributions of the current and future teacher effects again following Aaronson, Barrow and Sander (2007).19 Our results are detailed in Table 4.

Whereas the evidence presented in Table 1 indicates the degree to which environmental and/or structural differences between North Carolina and San Diego influence the validity of value-added estimation holding the value-added methodology constant (although these differences appear marginal at best), Table 4 shows the effects of richer value-added models that evaluate teachers over multiple years.20 The ratio of the adjusted standard deviation of the future-teacher-effects distribution to the adjusted standard deviation of the current-teachereffects distribution falls below one half in each of the models in Table 4 (down from Our approach requires that we treat teacher effects separately by grade for fourth-grade teachers who also teach students in the third grade. If teacher effects are constant across grades, these by-grade effects are expected to difference out for a student who has the same teacher in the third and fourth grades (assuming constant quality).

However, this creates some additional noise relative to a standard first-differenced model because more than one parameter must be estimated for the 49 fourth-grade teachers who also teach in the third grade in our panel.

We diagonalize the variance matrices to compute the Wald statistics. Substituting the full variance-covariance matrices for the diagonal variance matrices has little effect on the reported Wald statistics, and mechanically, it does not affect the teacher-effect variance estimates at all.

We find that the control variables added to the value-added specifications marginally reduce the sorting bias. This result is consistent with Rothstein (2009). Although Rothstein does not report analogous results with models that incorporate student or school-level control variables, he notes that his results do not qualitatively change if they are included in the model.

approximately 0.6 in Table 1).21 The student-fixed-effects model appears to mitigate sorting bias more than the other two models, as evidenced by the smaller future-teacher-effect variance detailed in the table. This suggests that static sorting is contributing to the bias in the teacher effect estimates in the basic and within-schools models, and that a within-students model may be preferred (Harris and Sass (2006) also recommend a within-students approach).

One potentially important aspect of the results from Table 4 is that some of the futureteacher “effects” are estimated using multiple cohorts of students. If the sorting captured by Rothstein‟s estimates (and our analogous estimates) is transitory to some extent then using multiple cohorts of students to evaluate teacher effects will help mitigate the bias. For example, a principal may alternate across years in assigning the most troublesome students to the fourthgrade teachers at her school. Or, more generally, teachers may connect with their classrooms more in some years than in others. In either case, observing multiple years of classroom assignments for teachers will help to smooth out bias. To investigate this possibility, Table 5 replicates the analysis in Table 4 but only evaluates future teachers who teach students in every possible year of the data panel. For the basic and within-schools models, this means that future teachers teach students in four consecutive years. For the within-students model, future teachers teach students in three consecutive years (recall that we only use three year-cohorts of students in the within-students model).

Future-teacher “effects” are smaller in Table 5 when we focus on future teachers who teach multiple cohorts of students. In fact, in the student-fixed-effects model, when we focus on future teachers who teach at least three classrooms of students, the adjusted variance of grade-5 Note that the meaning of this ratio is less clear in the student-fixed-effects model because the current-teacher effects from this model estimate the joint parameter in Equation (7). Ultimately, however, the important result from the student-fixed-effects model is that the future-teacher “effects” have a less predictive power over current testscore growth.

teacher effects goes to zero. Table 5 suggests that at least some of the sorting bias uncovered by Rothstein will be transitory.22 This finding highlights perhaps the most policy-relevant implication of our study – evaluating teachers over multiple years will improve the performance of value-added models, and depending on the sorting environment, may be sufficient to mitigate sorting bias if static tracking is adequately controlled for.23 One possible concern with the results in Table 5 is that the grade-5 (future) teachers who taught students in all years of the data panel are simply different from other teachers, and this explains the insignificant variation in their “effects” in the last model in Table 5. If that were the case, and it were not an issue of transitory bias, then if we remove a cohort of student data and re-run the model using the new data subsample then the adjusted variance of grade-5 teacher effects should remain near zero, and the Wald test should continue to retain the null that all grade-5 teachers have an identical "effect" on grade-4 achievement. Table 6 shows the results when we re-estimate the within-students model after removing one cohort of grade-4 students at a time. The adjusted variance of grade-5 teachers now rises markedly. Also, in two of three cases the adjusted variance for the grade-4 teachers also increases as would be expected.

So, why does the fixed-effect model using future teachers who teach in all years, shown in Table 5, appear to salvage hope for the use of value-added models? We conclude that there is not something unusual about this sample of grade-5 teachers. Rather, the main reason why we succeed in reducing future teacher effects to zero has mostly to do with the fact that in Table 5 Our transitory-sorting bias finding is consistent with other work that finds that multi-year teacher effects are more stable (McCaffrey et al., 2009) and more predictable (Goldhaber and Hansen, 2009). However, reduced sampling variance will also be a determinant of these other results.

We make one cautionary note about Table 5. The results in the table cannot be interpreted as in Tables 2 or 5 in the sense that we cannot compare the ratios of the standard deviations of the distributions of future and current teacher effects. This is because the future teachers in the models from Table 5 are selected based on having multiple classroom assignments whereas the current teachers are not. That is, the current teacher “effects” in Table 5 presumably contain some sorting bias due to transitory sorting that will be partially (or fully) mitigated in the future teacher “effects”.

we include only grade-5 teachers who teach in all years in the data. The use of multiple years of data reduces transitory sorting bias significantly.

The results from the previous section suggest that we can estimate the variance of causal teacher effects in San Diego using a within-students value-added model that focuses on teachers who teach in all three years of our data panel. For this analysis we return to the within-students model in equation (6) from Section IV, and estimate teacher effects for fourth-grade teachers.

Unlike in the previous analysis, we do not include future teachers in the model, and estimate a typical first-differenced specification (as opposed to the non-standard specification in equation (7)).

Across all of the fourth-grade teachers in our within-students sample, the adjusted variance of the teacher effects from the model in equation (6) is estimated to be 0.22 – this number is similar in magnitude to the results above.24 To estimate the magnitude of the variance of actual teacher quality, free from sorting bias, we split the teacher sample into two groups.

Group (A) consists of fourth-grade teachers who taught in all three years of our within-students data panel and group (B) consists of teachers who did not. Approximately 45 percent of the fourth-grade teachers belong to group (A) and 55 percent to group (B).25 Consistent with the transitory-sorting-bias result in Table 5, the adjusted variance of the teacher effects from group (A) is approximately 24 percent smaller than the adjusted variance of the teacher effects from group (B). Correspondingly, the standard deviations of the adjusted teacher-effect distributions, Recall that the within-student teacher “effect” estimates in Tables 5 and 6 are from the non-standard firstdifferenced model in equation (7).

Note that in Table 5, for fifth-grade teachers, roughly 57 percent taught in all three years of the data panel. The difference in stability between our fourth- and fifth-grade teacher samples may be explained by the different selection criteria. Our initial sample of fifth-grade teachers in Table 4 teach at least 20 students for whom we observe teacher assignments in four consecutive years, while our sample of fourth-grade teachers teach at least 20 students for whom we observe teacher assignments in just three consecutive years. Also, the fifth-grade teacher sample is identified conditional on students being taught by one of the teachers in the fourth-grade teacher sample.

measured in standard-deviations of the test, are 0.20 for group (A) and 0.23 for group (B). The standard deviation of the adjusted difference-in-variance between the two groups is 0.11. Table 7 documents these results.

Although the analysis in the previous section suggests that the observed variance gap between the teachers in groups (A) and (B) will be driven, at least in part, by differences in transitory sorting bias, two other explanations merit discussion. First, it is possible that group (A) is a more homogeneous group of teachers than group (B). As shown in Table 8, there are some observable differences in experience and education that suggest that this might be a concern. Specifically, teachers in group (A) are likely to be more experienced and to have a master‟s degree. We investigate the extent to which differences across groups along these dimensions might explain the observed variance difference by estimating the within-group variance of teacher quality for more and less experienced teachers, and then for teachers with and without master‟s degrees. The within-group variance of teacher effects among teachers with master‟s degrees is higher than the within-group variance of those without, which works counter to the observed variance gap. For experience, there is more variation among teachers with 10 or more years of experience and among novice teachers (with 5 or less years of experience) than among teachers with 5-10 years of experience. Ultimately, the variance decompositions based on grouping teachers by observable qualifications do not suggest a clear variance-gap effect.26 However, we also note that the grouping criterion here is somewhat arbitrary in the sense that The differences in variances across the teacher samples split by observable qualifications are small, in the neighborhood of 0.01 to 0.02 standard deviations. Although we cannot disentangle the effects of transitory sorting bias from the observable differences across teachers in the two samples of interest (groups A and B), there is a large literature showing that teachers differ only mildly in effectiveness based on observable qualifications (Hanushek, 1996; exceptions in the literature include Clotfelter, Ladd and Vigdor, 2007). Perhaps most relevant to the present study, Betts, Zau and Rice (2003) estimate value-added models in the San Diego Unified School District using student fixed effects, with separate models for elementary, middle and high school students. Although they find some evidence that teacher qualifications matter at the high school level, they find very little evidence of this in elementary schools.