The total K–2 sample includes 8,829 children and 28,935 observations. A small number of students were excluded who had repeated a grade or were tested mistakenly with materials other than those appropriate for their grade level. This adjustment resulted in the loss of 230 children and reduced the analytic sample to 8,599 children and 27,839 observations. In addition, we excluded a small number of individual test scores with unusually large standard errors of measurement that raised questions about the reliability of these individual test administrations. This adjustment reduced the analytic sample to the final total of 8,576 children and   

27,427 observations. The sample counts presented in Table 1 are based on this final analytic sample.

Teachers. Not including the literacy coaches, 287 teachers taught in kindergarten through second-grade classrooms in the 17 study schools at some point during the study’s 4 years. Of these 287 teachers, 111 teachers were present for all 4 years (38.7% of the teacher sample), 39 teachers (13.6%) were present for 3 years, 40 teachers (13.9%) were present for 2 years, and 97 teachers (33.8%) were present for only 1 year. Within this group, 259 kindergarten through second-grade teachers participated in one-on-one coaching at least once during the 3 years of LC implementation.

Schools. As Table 2 demonstrates, schools varied widely in their student composition. More than 90% of students were white in several schools, while in other schools, 30% or more of the students were African American or Latino. Similarly, the schools ranged in their socioeconomic composition, with the percentage of students receiving free or reduced-price lunches ranging from a low of 19% in one school to a high of 86% in another.

Measures We used a mix of reading assessments in order to broadly assess students’ literacy learning within the primary grades studied. Depending on a child’s cohort Table 2. Percentages of Key Demographics of the Student Sample in the Base Year

Note.—Racial ethnic group percentages may not total 100% due to rounding.

        and the length of enrollment at a study school, this resulted in children being tested a maximum of six times (see Table 1).

Dynamic Indicators of Basic Early Literacy Skills (DIBELS). Participating students took a variety of subtests from DIBELS beginning in the fall of kindergarten through the spring of second grade. These subtests tap a range of early literacy skills, including letter recognition, phonological awareness, decoding, and oral reading fluency. The choice of subtests to administer at each grade level and semester (fall and spring) was based primarily on publisher recommendations (Good & Kaminski, 2002). However, in some instances we chose to include an additional, more difficult subtest. For example, we added the oral reading fluency subtest in the fall of first grade, in order to improve our assessments’ capacity to discriminate effectively among students with higher levels of literacy learning.

Table 3 provides a schedule of the specific subtests administered each semester in each grade and the reported reliability and validity of these subtests (Good, Wallin, Simmons, Kame’enui, & Kaminski, 2002). Concurrent validity statistics for our study sample are consistent with published validity information (see Table 3).

Terra Nova. Each spring, participating first- and second-grade students took the reading comprehension subtest from the Terra Nova Multiple Assessments of Reading, a group-administered, standardized, norm-referenced reading test. (See McGraw-Hill [2001] for information on the reliability and validity of this test.) Rasch scaling. The DIBELS and Terra Nova results were scaled together using Rasch modeling (Wright & Master, 1982). The resultant vertical scale allowed us to fully exploit the longitudinal character of our student literacy learning data and resolve several difficulties with the use of DIBELS assessments in program effects studies. The final vertical scale also more closely approximates the principle of a single ruler or metric where a one-unit difference on the scale at any level of ability implies an equal difference on the trait measured (reading in this case), which is an assumption for parametric growth curve analyses (Raudenbush & Bryk, 2002).

The full details of the Rasch analysis are reported elsewhere (Luppescu, Biancarosa, Kerbow, & Bryk, 2008).4 Fewer than 2% of all items in the resulting scale exhibited signs of misfit, and the average infit (information-weighted mean square fit) was 1.00, which is the expected fit in a good scale. Moreover, student scale measures correlated well with raw scores on all constituent subtests of the DIBELS and Terra Nova, ranging from a low of.52 with DIBELS initial sound fluency to highs of.77 and.85 for the Terra Nova and DIBELS oral reading fluency subtest, respectively.

As is customary in Rasch scaling, the final measures are reported in a logit metric. Since logits are not intrinsically meaningful, we illustrate here the differences in literacy status one would likely find among students scoring at different values on our scale. For example, a child scoring at 1.0 logit (approximately an average child in the fall of kindergarten) typically can name about 30 letters in a minute, thus indicating good letter-name knowledge. That same child most likely discerns a few initial phonemes, but not many, and has very little chance of being able to segment words into phonemes. In contrast, a child scoring at 2.0 logits (approximately an average child in the spring of kindergarten or fall of first grade) is both accurate and fluent in letter-name knowledge and has almost mastered initial sound identification, but is still largely unable to segment words phonemiTable 3. DIBELS and Terra Nova Testing Schedule by Grade and Time of Year, Alternate Form Reliability, and Concurrent and Predictive Validity

Note.—CTOPP Comprehensive Test of Phonological Processing; ITBS Iowa Test of Basic Skills reading comprehension; PSSA Pennsylvania State System of Assessment; TORF Test of Oral Reading Fluency; WJ Woodcock Johnson Psycho-Educational Battery.

a Good, Wallin, Simmons, Kame’enui, and Kaminski (2002).

b Hintze, Ryan, and Stoner (2003).

c Schilling, Carlisle, Scott, and Zeng (2007).

d Brown and Coughlin (2007).

e Current sample.

       Table 4. Mean Scores and Standard Deviations in Logits for Analytic Sample of Kindergarten (K) through Second-Grade Students by Grade, Semester, and Year (n 8,576)

cally. The child can read a handful of words in a minute when given a passage of continuous text, but has little success at reading nonsense words (an indicator of decoding skill out of context). A child scoring at 3.0 logits (approximately an average child in the spring of first grade or fall of second grade) has mastered letter names and initial sounds, can read 50 – 60 words per minute accurately, and may answer correctly about a third of the first-grade Terra Nova comprehension questions. This child also does well on all but the hardest phonemic segmentation and nonsense word-reading tasks, but may not be very fast at these tasks overall and is generally better at the former than the latter. Finally, a child scoring at about 4.0 logits (approximately an average child in the spring of second grade) has mastered component reading skills (e.g., letter name knowledge, phonemic segmentation, decoding), reads about 90 words correctly per minute, and does well on twothirds of the first-grade Terra Nova comprehension questions and on about a third of the second-grade questions.

Analyses We began our analyses by visually examining the observed mean outcomes separately for each cohort. Findings from this examination guided our approach to analyzing these data through hierarchical, crossed-level, value-added-effects modeling.

Empirical Student Literacy Learning Trajectories We describe in this section the basic growth patterns found in the observed data. Table 4 reports the mean Rasch literacy development scores for K–2 students in the final analytic sample by grade, semester (i.e., fall or spring), and study year;

Figure 2 depicts this same information and identifies the longitudinal data for   

Figure 2. Means by cohort and year of Literacy Collaborative implementation.

each separate cohort by a distinct symbol. We first discuss the baseline trends that informed the model-building process before turning to implementation trends.

Baseline trend. Data collected in the fall and spring of the first year and fall of the second year (i.e., prior to the initiation of LC school-based PD) constitute the baseline trend for assessing subsequent program effects. Data from three different student cohorts constitute this baseline (see Fig. 2): Cohort 3 (represented by open circles) began the study in the fall of kindergarten, Cohort 2 (represented by open diamonds) began in the fall of first grade, and Cohort 2 (represented by asterisks) began in the fall of second grade.

Under an accelerated longitudinal cohort design, the results from the different baseline cohorts should connect smoothly as one overall growth trajectory. The resultant longitudinal trajectory is the baseline against which subsequent program effects are evaluated. Note that, as expected under an accelerated longitudinal cohort design, we found a near perfect overlap in mean achievement at the fall of first grade where Cohorts 2 and 3 join. However, a small gap of about.25 logits was found where Cohorts 1 and 2 join in the fall of second grade. This indicates that prior to program implementation, these two cohorts differed somewhat in their average literacy ability, at least at this one point in time. As a result, we have introduced a set of statistical adjustments for possible cohort differences in the hierarchical, crossed-random-effects models estimated below.

Implementation years. Subsequent to examining the baseline trends, we plotted the subsequent LC implementation years’ data on top of the baseline trend to provide a first look at possible program effects. As noted above, Figure 2 illustrates the mean student outcomes at each testing occasion during the 3 years of LC implementation.

Again, the longitudinal data for each separate cohort are identified by a distinct symbol. For example, the trajectory for Cohort 3 is identified by open circles and         includes data from the baseline year and 2 years of implementation. The trajectory begins as a solid line in the fall of kindergarten and continues to the fall of first grade when implementation began. The trajectory continues, therefore, as a dashed line through the fall of second grade, which represents first-year LC implementation effects. Following the same cohort through the spring of second grade incorporates second-year LC implementation effects on this group, which is represented by a dashed-and-dotted line.

Of primary interest here is a comparison of the slopes representing student learning during the academic years and the change in these slopes over the course of the study. Specifically, the increasing steepness of the slopes from fall to spring within each grade (from solid line, to dashed line, to dashed-and-dotted line) suggests positive overall value-added effects associated with the LC program.

These value-added effects are most apparent in kindergarten when students’ fall entry status is almost identical for Cohorts 3, 4, 5, and 6, but there is increasing separation in achievement among these three groups by the following spring.

Key observations for value-added modeling. In addition to the possible cohort effects in the baseline results noted earlier, several other distinct features in these longitudinal data have important implications for subsequent value-added modeling. First, growth during academic years (from fall to spring) is markedly steeper than growth during the summer break periods (from spring to fall). This means that we must separately parameterize the rates of student learning in these two periods. Second, as noted earlier, the academic learning rates (slopes) appear to vary across grade levels, with larger gains observed in kindergarten and first grade than second grade. Thus, we also need to introduce a set of fixed effects in the model to capture these departures from strict linearity.

Finally, there is some evidence in Figure 2 that program effects may vary by year of implementation. Thus, in the analyses assessing teacher- and school-level value-added effects to student literacy learning that follow, we estimate separate effects for each year.

Hierarchical, Crossed-Level, Value-Added-Effects Modeling The accelerated longitudinal cohort design used in the current study lends itself naturally to value-added modeling because our data consist of repeated measures on students who cross teachers within school over time. The hierarchical, crossedlevel, value-added-effects model that we applied can be conceptualized as the joining of two separate hierarchical models, the first of which is a two-level model for individual growth in achievement over time, and the second of which is a two-level model of the value added that each teacher and school contributes to student learning in each particular year. In essence, the core evidence for LC effects consists of comparing learning gains in each teacher’s classroom during each year of program implementation to the gains in that same teacher’s classroom during the baseline year. The observed gains in each classroom, however, are now adjusted for any differences over time in the latent growth trends for students being educated in each classroom. In contrast to the simple descriptive statistics presented in Figure 2, a hierarchical, crossed-level, value-added-effects model allows us to take full advantage of the longitudinal character of the data on    each student and adjusts for any outcome differences associated with individual latent growth trends in estimated teacher-classroom and school value-added effects.

Basic individual growth model. We began building our model by specifying a level 1 model for the literacy score at time i for student j. We specified five level 1 predictors to capture key characteristics noted in the empirical growth trajectories in Figure 2. Specifically, we created three indicators for academic year learning: a base learning rate during kindergarten and two grade-specific deviation terms for first and second grade, respectively. Since summer learning rates appeared to vary between kindergarten and first grade versus between first and second grade, we created two additional indicators of growth: a K– grade 1 summer indicator and a grade 1– grade 2 summer indicator. A system was developed for coding the indicators such that the intercept represented the latent score for student j at entry into the data set, regardless of the specific time occasion when this may occur.5 Since we assume that each child has a unique latent growth trajectory, the intercept and base academic year learning rate were specified as randomly varying among individual students. These parameters capture the variation between children in their initial literacy status and their latent growth rate in literacy learning.

In preliminary analyses we also considered modeling summer period effects as randomly varying. However, we were unable to reliably differentiate among children in this regard once random intercepts and random academic year learning effects were included in the model. Therefore, the summer period effects were treated as fixed at the individual child level.

