«Consortium for Educational Research and Evaluation– North Carolina Comparing Value-Added Models for Estimating Individual Teacher Effects on a Statewide ...»
1. Are the rankings of teachers from VAMs highly correlated with the ranking of “true” teacher effects under ideal (i.e., minimally confounded) conditions?
2. How accurately do the teacher effect estimates from VAMs categorize a teacher as ineffective, and what proportion of teachers would be misclassified?
3. How accurately do VAMs rank and categorize teachers when SUTVA is violated and classroom variance accounts for a proportion of the teacher effect?
4. How accurately do VAMs rank and categorize teachers when ignorability is violated and student effects are correlated with classroom, teacher, and school effects?
5. How similar are rankings of VAM estimates to each other?
6. How consistent are VAM estimates across years?
For each research question, we examined the relative performance of each VAM. To answer questions 1–5, we used simulation data, which gave us the advantage of having a known “true” effect. To answer questions 5 and 6, we used actual data collected from North Carolina schools.
Consortium for Educational Research and Evaluation–North Carolina 17 Comparing Value-Added Models August 2012 Methods We subjected nine models developed from the seven models described previously to examination and comparison using two types of simulation data and actual data collected in North Carolina.
The nine models are summarized in Table 1. Three variations on the nested random effects model were estimated. The HLM2 was a two-level model in which,, and were assumed to be 0; that is, only one pretest was used and there was no school effect estimated. The HLM3 was a three-level model in which was assumed to be 0 (only one pretest). The HLM3+ was a three-level model in which all of these parameters were free. The five types of fixed effects models—student, teacher, student IVE, teacher IVE, and the DOLS—were all included exactly as described above. Finally, while an effort was made to examine the most widely known EVAAS model, the multivariate response model (MRM), it was ultimately not incorporated into this study due its high computational demands given the size of the data sets used. Instead, the URM, which is currently being implemented on a statewide basis in several states including North Carolina, was examined. We briefly revisit this limitation in the discussion. A fixed number of years (three) of matched student and teacher data was available for estimation in both the simulation and actual data, with the actual data also having two additional years of student end-of-grade performance matched to the students but not matched to the teachers from those years. We first discuss the data generation process for the simulations, and then the actual North Carolina data used. Finally, we discuss the methods used to compare the ten approaches.
Table 1. Summary of Value-Added Models
Consortium for Educational Research and Evaluation–North Carolina 18 Comparing Value-Added Models August 2012 Data Generation Process Analyses were conducted based on simulations of “typical” student, teacher, and school data, which enabled us to control the data generation process, thereby providing knowledge of “true” estimates of each simulated teacher’s effectiveness against which to compare the results of each VAM. We assumed that the purpose of the VAM is to inform comparisons among teachers in a statewide evaluation system. We made several simplifications to make the data generation process and estimation more tractable. First, school and LEA membership of both teachers and students were fixed and consequently neither students or teachers could change school or LEA;
second, we estimated models for the typical elementary school organization in which students and teachers are assigned to a classroom where instruction in several subjects occurs; third, we used only 5th grade, to assess teacher effectiveness; fourth, we created datasets with complete data (no missing data); fifth, only one subject was used as an outcome, though in some cases two subjects were used as control variables; sixth, the simulated data consisted of multiple districts but was much smaller than the population of districts in states, including North Carolina, which is the relevant reference since we employed actual data from there. However, enough teacher records were generated in order to ensure the teacher effect estimates were not subjected to problems commonly found in small samples (see below for sample sizes).
Two different simulations were used, each for answering a different question regarding violations of assumptions. In the first, the data generation process was developed via variance decomposition of each student’s score into each level of schooling (student, classroom, teacher, and school). In this simulation, each teacher effect was homogenous across all students that teacher taught. This simulation focused on the effect that unaccounted-for classroom variance had on the ranking of the teacher effects from the nine models. In this simulation, a student covariate was included, but its correlation with the teacher effect was modest and had almost no effect on the estimates. Consequently, the data generation process of the second simulation created correlated student, classroom, teacher, and school covariates. This data generation process yielded a heterogeneous teacher effect, with teacher effects that varied across students having the same teacher, but having the desired level of correlation with the student, classroom, and school covariates. To infer teacher-level effectiveness, the mean level of the teacher effect was used as the “true” estimate. Both of the simulations were multilevel with random errors imputed to vary only within the appropriate level (e.g., classroom errors did not vary within classroom).
Variance Decomposition Simulation
We devised this simulation to answer questions 1, 2, and 3. In this simulation, each level of the data generation process—student, classroom, teacher, school, and LEA—was associated with a pre-specified variance component that represented that level’s proportion of overall variance in the outcome. To ensure the data were as realistic as possible, the nesting structure for the simulation was an MMMC design. In this design, multiple cohorts of students were observed over a panel of three grades (3rd to 5th) over a period of three years for each cohort. The data generated with this process were uniquely designed to answer the question of the effect of violations of SUTVA (question 3), which may occur regardless of student-teacher assignment (ignorability) and thus could be examined without regard to correlations between student background and teacher effectiveness. Consequently, during these three grades, simulated Consortium for Educational Research and Evaluation–North Carolina 19 Comparing Value-Added Models August 2012 students were randomly sorted each year and assigned to teachers in these different randomly ordered groupings. The variance components for each level or type of input were then converted to standard deviations and multiplied by simulated standard normal random variables to identify each input’s contribution to student learning.
The statewide mean for each of six standard normal ~ 0, 1 test scores—over three grade levels and two subjects—was specified and then added to the subject-area-specific but time-invariant effects for student ( ), classroom ( ), teacher ( ), school ( ), and LEA ( ) effects created via the variance decomposition, as well as a random or measurement error component (, to arrive at the total score for each student in each grade level and subject, as follows (with = student, = classroom, = teacher, = school, and = subject, all defined as above, adding for district):
(8) = + + + + + + The true teacher effect was the subject-area specific teacher input to student learning entered into model 8 ( ). Each teacher was assumed to teach one group of 17–23 students (randomly determined) in any given year and to teach a common group of students in two subjects (math and reading). Therefore, for any cohort and subject area, the peer effect and teacher effect could not be distinguished in VAM estimation, and thus multiple cohorts were needed to separately estimate a classroom and teacher effect at this schooling level. We generated two cohorts of students. In each simulated cohort, there were 99,252 records generated, consisting of 16,542 students taking three end-of-grade exams in each of two subjects. A total of 833 teachers were simulated across 184 schools in 14 districts. The amount of variance attributed to the classroom was varied, taking on a value of 0 or 4%.
Heterogeneous Fixed Effects Simulation
The second simulation, designed to answer questions 1, 2, and 4, made use of a correlation matrix decomposition procedure that allowed us to feed into the simulation the desired level of correlation between two student covariates, one classroom and teacher covariate in each of three grade levels, and a school covariate (Vale & Maurelli, 1983). The resulting “effects” at each level were heterogeneous because they varied within their respective units; e.g., the classroom effect varied within the classroom. However, the procedure provided a high degree of control over the level of correlation between covariates for each level of schooling, necessary in order to generate non-randomness in the assignment of students to teachers, and to therefore be able to answer questions related to ignorability (question 4). The correlated covariates included one time-invariant student background effect for each of two subjects ( ); one classroom effect for each grade level (of three) for a specific subject (, with = 1, 2, 3); one teacher effect for each grade level for a specific subject ( ); and one grade- and subject-invariant school effect ( ; the school effect subsumed the district level effect). The correlations between classroom effects or between classroom and teacher effects were set at.20; between teacher effects across the three grades at.50; and between classroom or teacher and school effects at.20. Correlation between student effects and all others was varied, being either -0.20 or.20.
Similar to the variance decomposition simulation, a set of random effects representing residual variance at each level was then simulated at their respective levels (e.g., classroom residuals did not vary within classroom) and multiplied by standard deviations derived from pre-specified Consortium for Educational Research and Evaluation–North Carolina 20 Comparing Value-Added Models August 2012 variances. These included for student, for classroom, for teacher, and for school. A subject- and grade-specific state grand mean ( ) and residual ( ) were also estimated. All of these fixed and random effects were added together to produce the total
achievement score for each student, as follows:
(9) Yicjskw = + + + + + + + + + The teacher effect, for the purposes of identifying a true value and estimating teacher effects from each VAM, was the teacher-level mean of the heterogeneous fixed teacher effect. This design had a much more parsimonious structure than the variance decomposition design, with peer groups advancing to the next grade level together rather than being re-sorted within cohort from year to year. We simulated 40,000 students in 2,000 classrooms, with 2 classrooms per teacher (representing two cohorts of students) and 1,000 teachers.
Calibration of Inputs
The inputs to the simulation that needed to be representative were the proportions of variance at each level, and we used two sources of information to justify them. First, we examined actual NC data for the grade levels in question. In elementary school, the math decomposition showed that little more than 10% of the variance was between teachers, with about 80% between students and the remainder between schools; reading was similar with 9% variance between teachers and 81% between students. We confirmed these inputs, to the extent possible, using the Nye, Konstantopolous, and Hedges study of variance decomposition (2004; refer to the authors’ findings as well as Table 1), which suggested that teacher variance around 11% is consistent with the norm, though the grade levels examined (1st through 3rd) were lower than the grade level used in the current study (5th).
Number of Simulations
We ran 100 simulations of the designs specified above. This number of simulations is low relative to what is recommended in general for simulations, which is normally in the thousands.
However, some concessions were required in order to keep the project manageable (each iteration of the 100 simulations required several hours to an entire day to process). There were some design components that helped to facilitate the use of a smaller number of simulations.
First, we were not conducting hypothesis tests, but comparing estimands in each model with the “true” effect using a number of criteria (discussed in the next section). Second, a larger number of simulations is generally used in order to smooth out the variability between simulations imposed by measurement or random error. Alternatively, this could be done simply by minimizing the proportion of variance attributed to measurement error. Therefore, in these simulations, the amount of measurement or random error was specifically constrained to a fixed proportion of the variance (one percent1%). A sensitivity test was conducted to determine if 100 simulations was sufficient, comparing the results to a versions with 1,000 simulations. When measurement or random error was sufficiently low, the findings for 100 simulations were nearly identical to the findings for 1,000 simulations.
Consortium for Educational Research and Evaluation–North Carolina 21 Comparing Value-Added Models August 2012 Actual NC Data Analysis The second analysis was conducted on actual North Carolina data collected between 2007–08 and 2009–10, with some test score data also available from 2005–06 and 2006–07. This analysis was used to answer questions 5 and 6. Both math and reading end-of-grade standardized exam scores were used. While no “true” effect is known in this analysis, the data are the true North Carolina student performance data and not simplified as in the simulated data. In addition to the lagged scores or pretests as specified in the model, all relevant and commonly available student, peer, and school characteristics were incorporated into the analysis. These included student race/ethnicity, subsidized lunch, limited English proficiency, disability and academic giftedness, within and between-year movement, under-age and over- age indicators, previous years’ peers’ average standardized exam score, and an indicator of the year. Race/ethnicity and year were not included in the student fixed effects model or the two IVE models. In selected models (excluding the TFE and TFEIV), classroom covariates (class rank relative to 75th percentile in limited English proficiency, disability or giftedness, free lunch eligibility, and overage) and school covariates (proportion by race/ethnicity, total per-pupil expenditures, percent subsidized lunch eligible, violent acts per 1,000 and suspension rates in previous year, and enrollment) were included. No covariates at any level were entered into the URM. No teacher characteristics were included in the data because these teacher characteristics could explain the teacher effect that we actually wanted to estimate. The data used in this study consisted of all student records in 5th grade in North Carolina public schools matched via rosters to their teachers during three focal years. If the student has multiple teachers, records were weighted according to the proportion of the school year shared with each teacher.