«Exploring Explanations for the “Weak” Relationship Between Value Added and Observation-Based Measures of Teacher Performance Mark Chin and Dan ...»
Exploring Explanations for the “Weak” Relationship Between Value Added and
Observation-Based Measures of Teacher Performance
Mark Chin and Dan Goldhaber
Since 2009, 49 states and the District of Columbia have changed their teacher evaluation
systems in response to federal incentives, such as flexibility waivers to No Child Left Behind and
Race to the Top grants. 1 In many cases teacher evaluation reforms have included the use of
student growth, or “value-added”, measures of teacher performance. These measures of teachers’ contributions to student performance on standardized tests represent a relatively new way to assess practicing teachers, though value-added models have been employed as an analytic tool for decades by researchers (e.g., Hanushek, 1971; Murnane, 1981). Value-added measures are also controversial (Baker et al., 2010; Darling-Hammond, Amrein-Beardsley, Haertel, & Rothstein, 2012) and can only be used to assess teachers in tested grades and subjects, who represent less than 33 percent of the teacher workforce (Papay, 2012). Not surprisingly, given their history as an evaluation tool that can be used to assess all teachers, virtually all states also include observations of teachers’ classroom practice as a component in a summative evaluation (Doherty & Jacobs, 2013).
It is unclear what the relationship ought to be between value-added and observational measures, but the relationship is often characterized as being “modest” or “weak” (e.g. Harris, 2012). Moreover, some judge the relationship between these measures (described more extensively below) to be problematic for use by policymakers who might wish to use value added and observations together to identify effective or ineffective teachers. Audrey AmreinBeardsley (2014), for instance, notes that “value-added scores do not align well with observational scores, as they should if both measures were to be appropriate[ly] capturing the ‘teacher effectiveness’ construct”. Notwithstanding the characterization of the relationship between value-added and observational measures, several scenarios exist that result in a weak See Minnici, 2014.
2/26/2015 Please do not cite or distribute without consent of the authors. 2 correlation; not all of them suggest that the two measures capture different teacher effectiveness constructs. Variation in the multidimensionality, validity, 2 and reliability of value added and observations distinguish these scenarios from one another.
Few studies have investigated the scenarios that might explain attenuated correlations between value-added and observational measures, or have suggested which are unlikely given observed correlations in prior research. Our paper explicitly illustrates these different scenarios, and uses simulated data to formally investigate the extent to which one or another explanation is likely to explain weak correlations between the measures. We explore the levels of correlation between value-added and observation scores after varying two broad factors. First, we adjust the correlation of each teacher’s score on an underlying dimension of “teacher quality” to its two different proxy measures: error-free value added and error-free observational measures of teacher practice. This adjustment allows us to investigate the effect of changes in the validity of these measures. Second, we add error to these measures to create simulated outcomes (i.e., “student test performance” or “lesson performance”), and vary the number of outcomes used to estimate measure scores. This adjustment allows us to investigate the effect of changes to measure reliability. With the results from our simulations, we attempt to answer the following research question: What is the magnitude of the correlation between value-added and observation scores, given different levels of validity and reliability for each measure of teacher quality?
In what follows, we recount the historic use of value-added and observational measures in teacher evaluation systems, the research on their relationship, and the factors that impact this We discuss two types of measure validity in our paper. The first type refers to the extent to which value added and observations serve as good proxies for some desirable underlying dimension or dimensions of teacher quality. The second type refers to the extent to which the performance of a teacher’s students on tests, or the performance of a teacher during observed lessons, reflect his or her true value-added or observation scores, respectively (also referred to as “systematic error”, see McCaffrey, Lockwood, Koretz, Louis, & Hamilton, 2004). We use the term “validity” to represent the first type, unless otherwise specified.
parameters we vary to reflect these key factors. After describing our process for creating the simulated data and method of analysis, we discuss the simulations’ results (Section 4). Finally (in Section 5), we discuss the implications for researchers and practitioners and offer some concluding thoughts.
2. Value-Added and Observational Measures of Teacher Quality and Their Relationship Value-added methods have long been used as a means of assessing both educational productivity and the effects of specific schooling inputs (e.g., Hanushek, 1971; Murnane, 1981).
They have also been used to assess the implications of differences amongst individual teachers and the extent to which individual teachers explain the variation in student test performance (e.g., Goldhaber, Brewer, & Anderson, 1999; Hanushek, 1992; Nye, Konstantopoulos, & Hedges, 2004). Though a few states and districts began using value-added and other related testbased measures of teacher quality in the late 1990s (Sanders & Horn, 1998), it is only in recent years that the use of value-added measures has proliferated across the nation. This proliferation has engendered debates amongst researchers and policymakers about whether value added is a fair measure of teachers’ contributions in the classroom, and, relatedly, how its use will affect teachers and students.
Value added has been linked to long-term student outcomes (Chetty, Friedman, & Rockoff, 2014b) and been shown to be unbiased in some experimental and quasi-experimental settings (Bacher-Hicks, Chin, Kane, & Staiger, in preparation; Chetty, Friedman, & Rockoff, 2014a; Kane & Staiger, 2008; Kane, McCaffrey, Miller, & Staiger, 2013). Yet questions remain about the extent to which value added may be used to obtain unbiased estimates of teacher performance (Rothstein, 2008, 2014), and, even if the measures are unbiased, whether they are 2/26/2015 Please do not cite or distribute without consent of the authors. 4 stable enough from year to year to use, 3 or would have negative ramifications for teacher behavior (Baker et al., 2010; Darling-Hammond et al., 2012).
Scholars have similarly investigated the quality of teachers through their practices in the classroom for decades (Brophy & Good, 1986). Compared to value-added measures, classroom observations of teaching have longer played a role in evaluation systems, yet have not faced the same level of academic scrutiny as value added (Corcoran & Goldhaber, 2013). 4 Recent findings, however, have found that the traditional observation systems used in some states and districts failed to meaningfully differentiate teachers (Weisberg, Sexton, Mulhern, & Keeling, 2009). Revisions to preexisting observation systems have led some locales to adopt observation protocols developed by the academic community, such as the Danielson Group’s Framework for Teaching (Herlihy et al., 2014). These protocols, which are also widely used in research projects, identify key classroom practices that, in theory, should be important for student learning, and also standardize how teachers are evaluated on these practices.
The relationship between value added and observations A number of the recently implemented educator evaluation reforms include the use of multiple measures of teacher quality, and many states and districts use both value-added and observational measures when assessing teachers’ performance (Herlihy et al., 2014). Not surprisingly, there is a growing research base that explores the extent to which these measures are related to one another. For example, the Measures of Effective Teaching (MET) project, a large scale study of teacher quality, explored the relationship of teacher value added and observations and found correlations between the two measures ranging from 0.12 to 0.34, See Goldhaber and Hansen (2013) and McCaffrey, Sass, Lockwood, and Mihaly (2009) for estimates of the stability of value added.
4 See Cohen and Goldhaber (2015) for a review of this role and a comparison of what we know about the properties of observations and value added.
2/26/2015 Please do not cite or distribute without consent of the authors. 5 depending on the observation protocol (Kane & Staiger, 2012). With some exceptions (e.g., Schachter & Thum, 2004), most other recent studies have replicated this pattern of a weak or moderately weak relationship when analyzing similar observation protocols (e.g., Bell et al., 2012; Grossman, Loeb, Cohen, & Wyckoff, 2013; Hill, Kapitula, & Umland, 2011; Kane, Taylor, Tyler, & Wooten, 2011). These findings contradict what many scholars and practitioners might expect. Theory and intuition suggests that strong instructional practices by teachers should lead to improvements in student test performance. In this paradigm, value-added and observation scores should be highly correlated.
Furthermore, states and districts have practical reasons to be concerned about the weak relationships observed in extant literature. A weak relationship may indicate that one or both are not valid measures of some dimension of teacher quality. It also sends contradicting signals to practitioners about their strengths and weaknesses, which in turn may inhibit the improvement of teachers’ practice (Polikoff, 2014). Finally, it could serve to undermine the trust in teacher evaluation systems, making it more politically difficult to use evaluations to inform key personnel decisions such as compensation or tenure (Herlihy et al., 2014).
Explanations for the weak relationship between value added and observations There exist at least three scenarios that result in weak correlations between value added and observations. The first is that one or both measures could provide unreliable estimates of one or more dimensions of teacher quality, due to sampling error. The second is that teacher quality may be multidimensional, and the measures provide reliable estimates of different dimensions of teacher quality. And the third is that one or more of the measures may be invalid, in the sense that the measure does not provide a reliable estimate of any dimension of teacher quality. We provide simple illustrations of these scenarios in Figure 1.
In Panels A and B of the figure, we depict underlying dimensions of teacher quality (TQ) with the bullseyes in the targets. In practice, we use value added and observations to serve as proxy measures for each teacher’s quality, which we never observe. We also never observe each teacher’s true, error-free value-added or observation score. Instead, we estimate value-added and observation scores from two different observed outcomes, represented in the figure: student test performance (v) and performance on lessons (o), respectively. The clouds around each set of outcomes show the distribution of the data points used to estimate each measure, with a darker color representing estimates based on the aggregation of information from each measure (e.g.
from multiple student test results, or multiple observed lessons). The dashed, two-headed arrow represents the distance or correlation between the two different measures of teacher quality; a shorter arrow indicates that the two measures align more closely. Moving from the left target to the right in either Panel A or B of Figure 1, the amount of information for each measure of teacher quality increases (e.g. through more having more students’ test results or observing teachers’ lessons more often), increasing the reliability of each measure.
The leftmost illustration in Panel A depicts the first scenario for weak correlations, where both measures would serve as valid proxies for the same dimension of teacher quality, but are estimated unreliably. Value added and observations could be estimated unreliably due to factors such as observing a teacher on a particularly good or bad day, or analyzing the test results of students who by chance perform well or poorly on a test; either would add sampling error to scores. To counteract sampling error in value added, many research projects will estimate teachers’ value added using Empirical Bayes estimators, which shrink scores that are estimated 2/26/2015 Please do not cite or distribute without consent of the authors. 7 less reliably (e.g., estimated from the test performance of fewer students) toward the mean (e.g., Kane & Staiger, 2008; Sanders & Horn, 1994). 5 Another way to counteract sampling error is to estimate value added and observations with as many data points as possible. For example, the stability of value-added measures, moderate when estimated from a single year of student test performance data (McCaffrey et al., 2009), improves when using multiple years of data (Goldhaber & Hansen, 2013). Though states and districts need to consider the financial and temporal burdens associated with reducing sampling error in value added and observations by increasing data points, improving measure reliability would disattenuate the relationship between both.
In research and practice, teacher value added and performance on observations are often treated as measures of the same underlying construct—the scenario depicted in Panel A.
However, there are reasons to believe that they are not, and that Panel B of Figure 1 depicts a more accurate representation of reality. Panel B of the figure illustrates a case where there are two dimensions of teacher quality (TQ1 and TQ2) and each measure of quality is a reliable estimate of only one of the dimensions. For example, one dimension might capture the degree to which teachers contribute to student knowledge, and a second dimension might be the extent to which teachers contribute to students’ ability to interact productively with one another. These dimensions of teacher quality may or may not be closely related, and the correlation between the measures of teacher quality may or may not increase as the reliability of each measure increases.
In the example depicted by Panel B, the correlation between the measures decreases (i.e., the arrows become longer) as each measure of teacher quality becomes more reliable, moving from In theory, the same adjustment for reliability can apply for observations as well. In practice, however, little research appears to use Empirical Bayes estimators to adjust scores for differences in the number of lessons observed. However, it is not clear that such estimates provide the best indicator of teacher effectiveness (see, for instance, Mehta, 2015).
a low correlation between the measures: that each measure provides a reliable estimate of different dimensions of teacher quality.
Some empirical evidence substantiates this second explanation for weak correlations. For instance, prior research suggests that measures of teacher contributions to the performance of students on different tests may themselves capture divergent dimensions of teacher quality. The most obvious example of this divergence emerges when comparing teachers’ value added in different subjects; for example, one might not expect a teacher’s contributions to performance on a mathematics exam to be measuring the same type of quality as his or her contributions to performance on a reading exam (Fox, forthcoming; Gershenson, forthcoming; Goldhaber, Cowan, & Walch, 2013; Rockoff, 2004).