WWW.SA.I-PDF.INFO
FREE ELECTRONIC LIBRARY - Abstracts, books, theses
 
<< HOME
CONTACTS



Pages:     | 1 |   ...   | 4 | 5 || 7 |

«Consortium for Educational Research and Evaluation– North Carolina Comparing Value-Added Models for Estimating Individual Teacher Effects on a Statewide ...»

-- [ Page 6 ] --

Reliability Using actual NC data, quintiles of teacher performance were estimated for two sequential years, then these results cross-tabulated at the teacher level. This cross-tabulation represented a pattern of mixing across the two years of teachers at each level of effectiveness in the first year. The cross-tabulations of the quintiles in 2007–08 and 2008–09 were summarized into two summary tables (Table 7, following page). The sum of the percentage of teachers on the diagonal of each cross-tabulation, with the teachers on the diagonal being those in the same quintile in each year and the sum of the percentage of teachers who either switched from the top quintile to the bottom Consortium for Educational Research and Evaluation–North Carolina 29 Comparing Value-Added Models August 2012 or from the bottom quintile to the top during the same interval were both included in Table 7 for both math and reading. For 5th grade teachers, the DOLS outperformed all others with 44.5% for math and 39.2% for reading, followed by the URM with 33.2% for math and 28.3% for reading.

The lowest percentages of year-to-year quintile consistency were the HLM3 and HLM3+, virtually tying with 30% for math and 25% for reading.

–  –  –

There was also some variation between the models in the percentage of teachers switching from one extreme quintile to the other extreme, and there was a clear difference between math and reading in the performance of these models, with about twice as many teachers switching in reading as in math in 5th grade. The DOLS was the best performer, with 0.2% switching in math and 0.8% in reading. The other models were more similar, but the HLM3 and HLM3+ had the highest percentages of switching in both reading and math.

Consortium for Educational Research and Evaluation–North Carolina 30 Comparing Value-Added Models August 2012 Discussion Using two simulations of student test score data as well as actual data from North Carolina public schools, we compared nine value-added models on the basis of four criteria related to teachers’ effectiveness rankings: Spearman rank order; percentage of agreement on 5th percentile; false positives consisting of teachers who are not ineffective being misidentified as ineffective; and consistency of rankings within quintiles over two sequential years. Using these comparisons, we answer six questions that are pertinent to state policymakers and administrators who may be in positions to select a value-added model to obtain estimates of individual teachers’ effectiveness generated from student test score data and to the teachers and principals who may be directly affected by them.

1. Are the rankings of teachers from VAMs highly correlated with the ranking of “true” teacher effects under ideal (i.e., absence of assumption violations) conditions?

While all nine VAMs performed reasonably well on this test, four models were higher performers (the HLM3+, URM, SFE, and HLM3) than the other five.

2. How accurately do the teacher effect estimates from VAMs categorize a teacher as ineffective, and what proportion of teachers would be misclassified?

While all models performed reasonably well on this test, four models were higher performers (HLM3+, URM, SFE, and HLM3) than the other five.

3. How accurately do VAMs rank and categorize teachers when SUTVA is violated and classroom variance accounts for a proportion of the teacher effect?

For the accuracy of ranking when SUTVA is violated, the performance of all models was substantially reduced in comparison to the absence of assumption violations. In terms of relative performance, four models were higher performers (HLM3+, URM, SFE, and HLM3) than the other five. For the accuracy of categorizing teachers in the lowest 5% when SUTVA is violated, all VAMs performed equivalently.

4. How accurately do VAMs rank and categorize teachers when ignorability is violated and student effects are correlated with classroom, teacher, and school effects?

For the accuracy of ranking when ignorability is violated, the performance of the VAMs was somewhat reduced in comparison to the absence of assumption violations. The relative performance of the VAMs varied substantially; two models were higher performers (the HLM3+ and HLM3) than the other seven in both the negative assignment and positive assignment scenarios. For the accuracy of categorizing teachers in the lowest 5% when ignorability is violated, all VAMs correctly classified more than 90% of the teachers with six models, the HLM3+, HLM3, HLM2, URM, SFE, and SFEIV, outperforming the other four.

Consortium for Educational Research and Evaluation–North Carolina 31 Comparing Value-Added Models August 2012

5. How similar are rankings of VAM estimates to each other?

For mathematics, the rankings produced by three VAMs, the URM, HLM2, and TFE, are more similar to all others (the average of all VAMs) than the other six models. For reading, the rankings produced by five models, the URM, HLM2, HLM3+, HLM3, and TFE, are more similar to all others than the other four VAMs.

6. How consistent are VAM estimates across years?

The most consistent year-to-year VAM estimates in terms of placing the highest percentage of teachers in the same performance quintile are the DOLS and URM. In terms of consistency in producing the fewest highest to lowest or lowest to highest switchers, the DOLS is the best performing VAM, followed by the TFE, URM, and HLM2.





Clearly, the overall ranking of model performance depends on how the criteria are weighted. If performance of the models in the presence of violated ignorability is viewed as the most highly weighted criteria, three VAMs performed sufficiently poorly to appear to be risky choices for estimating individual teacher effectiveness—teacher fixed effects, teacher fixed effects with IV, and dynamic ordinary least squares. None of these three models performed well in either test for confounded assignments of students and teacher, and much research strongly suggests confounded assignment is frequently the case now. In the simulations violating SUTVA assumptions, these models seem to underperform relative to the others in the ranking but not in the identification of the 5% of poor performers. And neither did they underperform in the examinations of year-to-year consistencies. This conclusion may need to be tempered in the case of the DOLS because of the relatively high performance of that VAM in the simulations by Guarino, Reckase, and Wooldridge (2012), but their findings with respect to the teacher fixed effects VAM are consistent.

More research should be done examining the performance of the DOLS before a strong affirmative recommendation could be offered. Bearing in mind that the findings of the present study and Guarino, Reckase, and Wooldridge (2012) regarding the DOLS only overlap in examining rank correlations, we speculate that the DOLS may be a higher performer in the Guarino, Reckase, and Wooldridge study for a number of reasons. Differences in the data generation processes, combined with the choice of the authors not to examine a model with raw score as the outcome and a shrinkage estimator for the teacher effect may be the cause for this seeming disagreement. Teacher estimates shrunken by empirical Bayes were applied to the gain score, but the authors argued that with invariant class sizes in their design, the shrinkage estimator on the raw score would produce rankings equivalent to that for the DOLS. As a consequence, the DOLS estimates in the Guarino, Reckase, and Wooldridge study are equivalent to a random effects variant that they did not test. This is certainly consistent with the present study, as two of the simple nested random effects models (the HLM3 and HLM3+) were regularly among the highest performing models.

With the findings indicating that the TFE, TFEIV, and DOLS are risky, are there any that policymakers and administrators might wish to consider adequate? The answer to this question can only be answered by a definitive weighting scheme for the criteria, which should include an Consortium for Educational Research and Evaluation–North Carolina 32 Comparing Value-Added Models August 2012 assessment of the costs and consequences of the particular purposes for which the estimates will be used. The list of acceptable models could be quite different for estimates of teacher effectiveness that are used to identify teachers who may need additional professional development (low stakes) and those used to identify teachers for high stakes sanctions such as denial of tenure, dismissal, or substantial bonuses, with identification for additional observations with feedback and coaching and other positive benefits falling somewhere between. We believe the evidence suggests that four VAMs performed sufficiently well across the board to deserve consideration for low stakes purposes: the three-level hierarchical linear model with one year of pretest scores, the three-level hierarchical linear model with two years of pretest scores, the EVAAS univariate response model, and the student fixed effects model. The performance of each of these models was quite good in recovering the true effects and was quite similar. The performance of all was degraded by a violation of SUTVA—more so for the ranking and less so for agreement on classification in the bottom 5% and the false identification of ineffective teachers—but not as much by confounding. Also, quite relevant is the identification of the lowest fifth percentile, which could be used in low to medium stakes situations. In our opinion, these are relevant criteria for assessing adequacy of VAM for low stakes purposes.

We believe that the false positive analysis is particularly important when considering the adequacy of models for high stakes use. If SUTVA is substantially violated, about 350 5th grade teachers in a state the size of North Carolina could be identified for possible removal when their actual performance was not in the lowest 5% of teachers. The mean performance of these teachers is more than 0.6 of a standard deviation below the mean when the four higher performing models are used. When confounding occurs, the higher performing models would falsely identify approximately 220–290 5th grade teachers in a state about the size of North Carolina as in the lowest performing 5% of the distribution. For the four higher performing VAMs, these falsely identified teachers’ average performance is at least 0.9 standard deviations below the mean. For many, this would seem to suggest that the teacher effectiveness estimates should at most be considered a first step in identifying ineffective teachers, rather than the method for identification of teachers for high stakes personnel actions. Using any VAM, even the highest performing ones, to identify teachers for high stakes consequences seems risky in our opinion.

It seems important to consider consistency as well when considering if any of the VAMs should be used for estimating individual teacher effectiveness. As earlier research points out, inconsistency in the estimates from year to year can undermine the credibility of the estimates, especially to those whose performance is being estimated (Amrein-Beardsley, 2008). The best performer in this regard, the DOLS, was a very low performer in the simulations; the EVAAS URM and student fixed effects performed somewhat better than the other two better performers, the HLM3 and HLM3+. However, all of the assessments are relative. It would be difficult to know whether the differences in VAM performance that we observed using the North Carolina data—3.2% switching from highest to lowest or vice versa rather than 1.7%—would affect credibility. The fact that these extreme switchers exist at all may be sufficient evidence to convince some policymakers and some teachers that no sufficiently consistent VAM exists.

Further research should be conducted to better understand the correlates of the extreme quintile switching, in particular investigating the number of novice teachers that switch or the number of extreme switchers that have changed assignments, such as moving from one school to another or one grade to another.

Consortium for Educational Research and Evaluation–North Carolina 33 Comparing Value-Added Models August 2012 Limitations and Implications Limitations This study had several limitations. First, a significant portion of the analysis was based on simulated stylized data. This was intended to address the absence of “true” measures of teacher effects in actual data. While these simplifications may suggest that real conditions would probably degrade the absolute performance of each model, we have not argued that this degrading of performing would be equivalent across all models, and therefore it is possible that more realistic conditions might influence the comparisons that we have made. For example, we did not simulate missing values, a problem typical of actual data that by design some of the models (e.g., the URM) may handle better than the others. Second, there was some necessary subjectivity in the choice and specification of models, including in the types of fixed effects models used and the covariates used in some models. Third, we were unable to estimate extensive simulations or actual data models for the EVAAS MRM, a controversial (AmreinBeardsley, 2008) but widely published (Ballou, Sanders, & Wright, 2004; McCaffrey et al.,

2004) model. While McCaffrey et al. (2004) suggested that this model performed similarly to a fixed effects model using small samples, our experience with a smaller variance decomposition sample than the one used in that study (144 teachers, rather than 833) suggests that the MRM performed poorly. A single simulation of 833 teachers with zero classroom variance, however, indicates that the MRM had very similar performance to the URM. Nevertheless, we cannot recommend the MRM, as its computational demands place it out of the reach of many state education agencies and scholars to estimate. Finally, the limited actual data, ranging over only three years of data in which students were matched to their teachers, made some of the analyses difficult to undertake and required some modifications to the models when multiple estimates were required for examining year-to-year consistency.

Despite these limitations, there are multiple strengths of this study. It is the first of its kind to use simulated variance decomposition and correlated fixed effect data specifically designed for testing both SUTVA violations and ignorability, respectively, as well as actual data. It is also the first of its kind to examine multiple random effects and fixed effects models, and it examined nine models, nearly twice that of any other study.

Implications

Value-added models for teacher effectiveness are a key component of reform efforts aimed at improving teaching and have been examined by this study and others. However, an interdisciplinary consensus on the methods used to obtain value-added teacher estimates does not exist, and many different models spanning multiple disciplines including economics and sociology have been proposed, as noted above. Further, several different approaches have been used to examine and compare models, and as this study demonstrated with just a handful of approaches, the “best” VAM may be dependent on the comparison approach. Nevertheless, when multiple approaches were used, trends did emerge that pointed to a few models that were on average better performers, and a handful that were almost universally poor. We suggest that one implication of this study is that multiple approaches are needed to get a fuller picture of the relative merits of each model.



Pages:     | 1 |   ...   | 4 | 5 || 7 |


Similar works:

«Grade 9 FCAT 2.0 Reading Sample Answers This booklet contains the answers to the FCAT 2.0 Reading sample questions, as well as explanations for the answers. It also gives the Next Generation Sunshine State Standards (NGSSS) benchmark assessed by each item. Although the Florida State Board of Education adopted the Common Core State Standards in the summer of 2010, these standards have not yet been implemented. For this reason, the FCAT 2.0 tests and sample questions and answers are based on the...»

«SOLE Sciences of Life Explorations: Through Agriculture Grades 4 and 5 Teacher Guide Unit: Grow an Indoor Salad Garden UNIT PLAN UNIT TITLE Grow an Indoor Salad Garden MONTH March GOAL In this lesson, students will compare and contrast the needs of indoor and outdoor plants. They will take on responsibility for raising three different types of plants indoors and learn their specific needs. They will chart all steps in the plants’ life cycles and make observations on their growth. Students...»

«September 2004 | Volume 62 | Number 1 Teaching for Meaning Pages 42-45 Reading and Rewriting History Students learn to read critically as they plunge into primary and secondary sources looking for historical fact. Sam Wineburg and Daisy Martin Several years ago, we toured a swanky new middle school rising on the broken asphalt of an urban parking lot. This public school had a September 2004 mission statement that read like a recruiting poster for a high-tech start-up: Students would gain the...»

«The Asia-Pacific Education Research December 2006, 15(1), 155-182 UNGUARDED PATTERNS OF THINKING: PHYSICAL AND TOPICAL STRUCTURE ANALYSIS OF STUDENT JOURNALS Maria Eda C. Carreon De La Salle University – Manila, Philippines Journals have been utilized in classrooms both as a reflective tool for students and as a feedback instrument for teachers. The current study however focused on form by means of analyzing the informal writing styles of students through a physical and topical structure...»

«Abrading Methods, Inc. 1011 Davis Road, Elgin, IL 60123 (847) 742-6776 FAX: (847)-742-6783 http://www.lapping.com An ISO 9001:2000 Registered Company Reading Flatness using an Optical Flat Scope This instructional guide is intended to provide the reader a basic understanding of reading flatness using an optical flat. The initial use of this guide is to provide a basis for an instructor-guided learning experience. The student may retain this guide for future reference. The simplicity and economy...»

«DOCUMENT RESUME PS 027 924 ED 433 942 Goldstein, Lisa S.; Lake, Vickie E. AUTHOR Preservice Teachers' Understandings of Caring. TITLE 1999-04-00 PUB DATE 13p.; Paper presented at the Annual Meeting of the American NOTE Educational Research Association (Montreal, Quebec, Canada, April 19-23, 1999). Speeches/Meeting Papers (150) Reports Research (143) PUB TYPE MF01/PC01 Plus Posta* EDRS PRICE Elementary Education; Higher Education; *Preservice DESCRIPTORS Teachers; Teacher Education; Teacher...»

«DISCRIMINATION AGAINST PALESTINIAN ARAB CHILDREN IN THE ISRAELI EDUCATIONAL SYSTEM ZAMA COURSEN-NEFF* I. INTRODUCTION The Israeli government operates two separate school systems for its 1.8 million school children: a Jewish system and an Arab system.1 The students in the latter are Palestinian Arab citizens of Israel,2 * Counsel, Children’s Rights Division, Human Rights Watch. J.D., 1998, New York University School of Law; B.A., 1993, Davidson College. This Article is based on the most recent...»

«Written Testimony of Jimmy Gurulé Professor of Law Notre Dame Law School Hearing Before the House Judiciary Committee Subcommittee on the Constitution and Civil Justice Washington, D.C. July 14, 2016 Written Testimony of Jimmy Gurulé Professor of Law Notre Dame Law School Hearing Before the House Judiciary Committee, Subcommittee on the Constitution and Civil Justice July 14, 2016 Chairman Franks, Ranking Member Cohen, and other distinguished members of the House Judiciary Committee,...»

«Independent Education, December, 1993 An invisible disability: Language disorders in high school students and the implications for classroom teachers Frederick Patchell Catholic Education Office, Diocese of Parramatta, Sydney, NSW, Australia. Linda Hand School of Communication Sciences and Disorders, Faculty of Health Sciences, University of Sydney, Australia What does 'language disorder' have to do with us? Given the current national and State focus on literacy, basic skills, core...»

«Teaching Chemistry Through The Jigsaw Strategy Example 1 Topic Thalidomide: A Controversial Chiral Drug Subtopics 1. Why is drug chirality important?2. What caused the thalidomide tragedy?3. Why has thalidomide been approved for sale again? Level Secondary 6-7 Curriculum Links Stereoisomerism Enantiomerism Chiral carbon compounds Medium of instruction English Copyright © 2007 by Quality Education Fund, Hong Kong All rights reserved. Prepared by Professor Derek Cheung, The Department of...»

«Frequently Asked Questions – Mindful Schools Online Courses Short Video tutorials (coming soon) Getting Started • How to update your profile and add a picture • How to post in a forum • How to complete self-reflection questions • How to use Message My Teacher & check your messages • How to Know What is Required • Logging In What is my username? Where do I get my password? What if I can't remember my password? How can I change my password? I can’t...»

«QUILTS IN THE CLASSROOM TEACHING U.S. HISTORY 1800 – 1900 2013 Mazza Summer Institute University of Findlay Findlay, Ohio July 15 – 19, 2013 Floyd C. Dickman Specialist in Children's Literature 470 Long Trail Ostrander, OH 43061-9007 cell phone 614-915-1624 home phone 740-666-0950 email fdickman@columbus.rr.com QUILTS IN THE CLASSROOM TEACHING U.S. HISTORY 1800 – 1900 The purpose of this handout is to provide some basic resources on the various topics that can be covered in TEACHING U. S....»





 
<<  HOME   |    CONTACTS
2017 www.sa.i-pdf.info - Abstracts, books, theses

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.