# «Consortium for Educational Research and Evaluation– North Carolina Comparing Value-Added Models for Estimating Individual Teacher Effects on a Statewide ...»

The hybrid approach uses both random effects to estimate the teacher effect as an empirical Bayes residual or shrinkage estimate (Wright, White, Sanders, & Rivers, 2010) as well as fixed effects—either unit-specific dummy variables or demeaning, though usually the former—to control for confounders of the teacher effect. Covariates are generally not entered into the models (Ballou, Sanders, & Wright, 2004).

The multivariate response model (MRM) is also a type of multiple membership, multiple classification (MMMC) random effects model (Browne, Goldstein, & Rasbash, 2001). MMMC or cross-classified models are multilevel or random effects models that acknowledge the complex nature of student-teacher assignment over multiple years of schooling or multiple subjects (Browne, Goldstein, & Rasbash, 2001). That is, rather than being simply nested within a single teacher, students are nested within multiple teachers over time, but the students with which they share this nesting changes. The MRM can be represented in a very simplified form for a student in three sequential grade levels (3rd through 5th) as follows (Ballou, Sanders, &

**Wright, 2004):**

(6.1) (6.2) = + + + + (6.3) = + + + Here, 3rd grade cumulative learning for this student ( ) is viewed as the baseline year with an average test score of within (for example) the state, a 3rd grade teacher contribution ( ), and a 3rd grade random error ( ). The teacher contribution and error are therefore deviations from the state average. Fourth grade cumulative learning is decomposed into a state average ( ) and deviations from this average in the form of both 3rd and 4th grade teacher contributions ( and ) and the 4th grade random error. Fifth grade cumulative learning is similarly decomposed into the state average ( ), teacher effects in grades 3, 4, and 5, and an error in grade 5 ( ). Each teacher’s effect therefore is assumed to persist without decay into the future, and the MRM is frequently referred to as a layered model because of the appearance that teacher effects in later periods are layered on top of those from earlier periods. More advanced versions of the MRM (e.g., Wright, White, Sanders, & Rivers, 2010) incorporate partial teacher effects for settings where a student had more than one teacher for a subject. In addition, they may also include fixed effects dummy variables for subject, year, and grade level. The terms, the teacher effects, are estimated as empirical Bayes residuals. These residuals satisfy the ignorability assumption if they are not associated with unmeasured correlates of the outcome. The layering in the MRM Consortium for Educational Research and Evaluation–North Carolina 12 Comparing Value-Added Models August 2012 requires making very strong assumptions about teacher effect persistence; teacher effects cannot be attenuated by later teachers or decay of their own accord, and they cannot increase over time.

The MRM has limited practical application on a statewide level because of the high computational resources it requires (McCaffrey, et al., 2004). Consequently, its application has been more limited, being conducted on a districtwide basis. Alternatively, for statewide level models, an alternative model, the univariate response model (URM), has been used. The URM is a version of the EVAAS hybrid model that accommodates measuring student growth using tests that are not developmentally scaled, such as high school end-of-course tests (Wright, White, Sanders, & Rivers, 2010). Like the MRM, covariates other than prior test scores in multiple subjects are not included in the model. Fixed effects dummy variables are not incorporated into the model; instead, fixed effects are incorporated via de-meaning of lagged scores or pretests using a multi-step process. First, the difference between each student score at time w, w-1, and w-2, and the grand mean of school means at each time point is calculated. The exams from which these scores are obtained do not need to be on the same developmental scale as the outcome, or as each other. Second, the covariance matrix, here shown as being partitioned into current test

**score ( ) and lagged test score ( ) sections, is estimated:**

c c (7.1) =c C The expectation-maximization algorithm is used to estimate in the presence of conditionally random missing values. Third, the coefficients of a projection equation are estimated as

**follows:**

** Fourth, the following projection equation is estimated using the elements of as the, which predicts a composite of students’ previous test scores, spanning two years and two subjects, that have been recalibrated using de-meaning or mean-centering as pooled-within-teacher values ( = math and = reading):**

(7.3) ̂ ̂ ̂ ̂ ̂ The x terms (with for math and for reading; and 1 for a one-period lag and 2 for a two-period lag) are each student’s test scores in the specified periods and subjects. The ̂ terms are means of the teacher means of these test scores, with ̂ as the mean in the current period, and are the elements of b (7.2). Finally, substitute the composite into the following two-level model (students nested in teachers), and just as in the previous

**multilevel models, estimate the teacher effect using the empirical Bayes residual:**

(7.4) The nesting in this final model is of students within teachers in one school year with no accounting for the nesting within schools. In addition, the teacher effect estimation uses only one subject, despite the use of two subjects’ data to estimate the composite. To satisfy the Consortium for Educational Research and Evaluation–North Carolina 13 Comparing Value-Added Models August 2012 ignorability requirement of the potential outcomes model, must not be associated with unmeasured factors that are also associated with the outcome. This means that must subsume all such factors. This is a strong and potentially impractical requirement. As far as we know, the URM, which is the VAM currently used by EVAAS on a statewide basis, has not been evaluated in any peer-reviewed study.

**Summary of Models**

All of the models except for two (the TFE and DOLS) obtain the teacher effect indirectly via a type of residual, either a parameter variance (in the random coefficients models) or error variance (in the fixed effects models). The random effects models, as well as the teacher fixed effect model, must control for student unobserved heterogeneity through covariates, available or measured inputs to learning or proxies for inputs to learning that may be highly correlated with the actual inputs. The student fixed effects model, which controls for heterogeneity by demeaning such that time-invariant confounding factors are eliminated, does not address confounding time-varying factors. The IVE models add an additional control for confounding time-dependent inputs by instrumenting a twice-lagged test score. None of the models explicitly addresses SUTVA, though some of the models may make accommodations for violations of these assumptions. For example, the random effects models could include an additional parameter variance for the classroom level, which would distinguish between effects due to the teacher and effects due to the teacher’s interaction with each group of pupils to which they are assigned in each year, thereby directly addressing the SUTVA assumption. The teacher fixed effects model could also include interactions between each teacher indicator and student background characteristics, although the model would get even more unwieldy than it already is with J teacher effects being estimated.

Many of these models for estimating teacher effects have been subjected to a rigorous comparison for estimating teacher effects, while others have not. Ideally, these comparisons would be made with regard to an absolute standard, the true teacher effect. Because the true rankings are not typically known, scholars have relied largely on simulation studies in which true effects are generated by a known process and then compared to the estimated effects obtained from VAMs on these data. Actual data have been used, alternatively, to assess relative performance, including consistency across years in the ranking of each teachers, and to assess the plausibility of the ignorability assumption. We now review these studies.

Consortium for Educational Research and Evaluation–North Carolina 14 Comparing Value-Added Models August 2012 VAM Comparison Studies Several studies have compared the performance of different methods—random and fixed effects methods, for example—for estimating VAMs. Some of these studies have used actual data, while others have used simulated data. The simulation studies are based on stylized and simplified data but allow for the models to be compared to a known true effect. They also allow for the sensitivity of the findings from different models to deviations from optimal assumptions to be examined and compared. The actual data studies provide for relative statements to be made and also enable the examination of year-to-year consistency in estimates. We discuss the findings in the context of the potential outcomes model, and the demands that it places on teacher effect estimates. This context greatly simplifies the discussion, as a number of the studies cited provide highly detailed accounts of assumptions concerning a number of parameters that are unobserved correlates of student learning that may or may not be relevant to teacher effect estimation. The potential outcomes framework provides some clarity regarding whether and how these may affect estimation of teacher effectiveness. As the studies are a wide mix of data and model types, we review them in chronological order. Some of the studies were conducted for the purpose of school accountability rather than teacher accountability, but the inferences can be reasonably extended to teacher effects, despite some differences in how confoundedness may emerge for teacher and school effects.

Tekwe et al. (2004) examined VAMs for school accountability (rather than teacher performance) and compared several models: a school fixed effects model with a change score regressed on school dummy variables; an unconditional hierarchical linear model (HLM) of a change score regressed on school random effects; a conditional hierarchical linear model of the change score regressed on student covariates for minority status and poverty, school covariates for mean pretest and percentage of poverty, and a school random effect; and a layered mixed effects model of the type developed for EVAAS. The authors found that the layered and fixed effects models were highly correlated; that the conditional HLM was different from the fixed effect model and the layered model, owing largely to the inclusion of student covariates; and that the unconditional HLM was very similar to the fixed effect model and layered model. This study makes important claims about the relationships among various fixed and random effects specifications, but the true model not being known, it cannot claim which one is better.

McCaffrey et al. (2004) compared the performance of multiple random effects or mixed effects models on simulated correlated covariates data that reproduced the complex nesting of students in multiple teachers with different cross-nestings of students over time. The models included a standard cross sectional hierarchical linear model (HLM) predicting achievement from covariates including prior scores; a serial cross-section design using gains as the dependent variable; a cross-classified linear growth HLM, where the cross-classification takes account of the nesting in multiple teachers over multiple years in the data; a layered (EVAAS/MRM) model; and a “general” model that incorporates covariate data and estimable teacher effect persistence parameters into the layered model framework. The other models were thus characterized as variously restricted versions of the general model. The authors simulated a small sample of 200 students in 10 classes of 20 for four grade levels and two schools using the general model and then comparing the estimates from each of the other models. Three different scenarios were tested based on ignorability (no violation; differential assignment of fast and slow learners in Consortium for Educational Research and Evaluation–North Carolina 15 Comparing Value-Added Models August 2012 classes; same differential assignment over schools), but teachers were randomly assigned to classes. The cross-classified and layered models performed better under all three scenarios, having higher correlations with the true estimates. In addition, the authors also demonstrated these models on actual data from five elementary schools in a large suburban school district, using the general model and the layered model and a single covariate (subsidized lunch eligibility), finding that their correlations ranged from.69 to.83. In both the simulated and actual findings, the comparisons were expressed solely as correlations, and it is not clear how many teachers would have been misclassified as ineffective under the various models and scenarios examined.

Guarino, Reckase, and Wooldridge (2012) conducted the most diverse simulation study to date, to test the sensitivity of a set of six VAMs to different student and teacher assignment scenarios.

**For the assignment of students to classrooms, two types of mechanisms were considered:**

dynamic (based on the previous year’s test score) and static due to potential learning or actual learning at the time of enrollment. Teacher assignment to the classroom included a random scenario, and scenarios of systematic assignment with high-performing students assigned to high-performing teachers, or alternatively, low-performing teachers. The six models included a pooled ordinary least squares model with achievement in time t as the outcome (DOLS); the Arellano and Bond (1991) IV model, using a twice-lagged achievement assumed to be uncorrelated with current and lag errors as an instrument; a pooled OLS gain score, which is similar to the DOLS except that lagged achievement was forced to have a coefficient of 1; an “average residual” model that is similar to the teacher fixed effect model with covariates rather than fixed effects but excludes the teacher dummy variables (instead, the teacher estimates are obtained from a second-stage averaging of the student residuals to the teacher level); a student fixed effect model using the gain score as the outcome; and a variation on the gain score model with random effects for teachers. Using Spearman rank order correlations, Guarino, Reckase, and Wooldridge found that the DOLS was the best performing model. The DOLS was the only model that did not incorporate differencing on the lefthand side, and controlled directly for ignorability by incorporating previous achievement with a freely estimable parameter. A random effects variation on this model was not tested. Non-random groupings of students had minor effects on the results, but non-random teacher assignment had a deleterious effect on all of the estimates, particularly for heterogeneous student groupings with negative assignment to the teacher.

Finally, Schochet and Chiang (2010) used a simulation based on the decomposition of variance into student, classroom, teacher, and school components to compare OLS and random effects estimates of teacher effects using error rate formulas. These error rate formulas estimate the probability that a teacher in the middle part of the distribution of teachers will be found highly ineffective, and that a teacher who is truly ineffective will be considered no worse than average.

Variance estimates for the decomposition were based on reported results in the literature. The authors demonstrated that under typical sample sizes, error rates were 10% for the OLS and 20% for the random effects models.

As of yet, no study has attempted to compare both fixed and random effects models of multiple types; Guarino, Reckase, and Wooldridge (2012) limited their examination to one random effects model, and it used a change score (imposing a coefficient of unity on the lag or pretest) rather than the achievement score itself. In this study, we compare a set of nine fixed and random Consortium for Educational Research and Evaluation–North Carolina 16 Comparing Value-Added Models August 2012 effects models, using rank order correlations and practical criteria such as misclassification, to

**answer the following questions:**