This guest post is contributed by Kathryn McDermott and Lisa Keller. McDermott is Associate Professor of Education and Public Policy and Keller is Assistant Professor in the Research and Evaluation Methods Program, both at the University of Massachusetts, Amherst.
On March 29, the U.S. Department of Education announced that Delaware and Tennessee were the first two states to win funding in the “Race to the Top” grant competition. A key part of the reason why these two states won was their experience with “growth modeling” of student progress measured by standardized test scores, and their plans for incorporating the growth data into evaluation of teachers. The Department of Education has $3.4 billion remaining in the Race to the Top fund, and other states are now scrutinizing reviewer feedback on their applications and trying to learn from Delaware’s and Tennessee’s successful applications as they strive to win funds in the next round.
One of the Department’s priorities is to link teachers’ pay to their students’ performance; indeed, states with laws that forbid using student test scores in this way lost points in the Race to the Top competition. A few months ago, James pointed out some of the general flaws in the pay-for-performance logic; here, our goal is to raise general awareness of some statistical issues that are specific to using test scores to evaluate teachers’ performance.
Using students’ test scores to evaluate their teachers’ performance is a core component of both Delaware’s and Tennessee’s Race to the Top applications. The logic seems unassailable: everybody knows that some teachers are more effective than others, and there should be some way of rewarding this effectiveness. Because students take many more state-mandated tests now than they used to, it seems logical that there should be some way of using those test scores to make the kind of effectiveness judgments that currently get made informally, on less scientific grounds.
The problem is that even if you accept the assumption that standardized tests convey useful information about what students have learned (which we both do, in general), measuring the performance gains (or losses) of students in a particular classroom is far more complicated than subtracting the students’ September test scores from their June test scores and averaging out the gains. We’re concentrating on the statistical issues here; there are other obvious challenges in test-based evaluation, such as what to do for teachers who teach grade levels where students do not take tests and/or subjects without standardized tests.
The first problem has to do with class sizes. In order to compare the score gains among Ms. Jones’s students with the score gains of Mr. Smith’s students, we need to do statistical analyses.* One thing you learn fairly early in a statistics course is that small populations are tricky, since just a few measurements at the extremes have a greater effect on the mean or other summary statistics. Experts on educational testing think that a population size of about 30 is necessary for analyses of the average performance of a teacher’s students to be meaningful. In other words, each teacher needs to have 30 or more students. Even in these days of fiscal austerity, U.S. elementary-school class sizes are generally not this large.
In middle school and high school, where students have specialized teachers for each subject, teachers work with far more than 30 students in any given term or school year; the norm is probably closer to 100 or more (not, of course, at the same time; each teacher is teaching multiple courses or multiple sections of the same course, and seeing different students during each class period). Under these circumstances, the conditions for meaningful teacher-level performance analysis are likelier to be met.
Even here, though, there could still be issues. If all of a middle-school teacher’s 100 students are taking the same general seventh-grade math or English course, it makes sense to aggregate all 100 students’ performance data. However, if that same teacher has three sections of “General Math 8” and one of “Honors Algebra,” it’s harder to know what the aggregate across all four class section means. It’s even harder to know how to interpret an aggregate of changes in students’ test scores if a high-school English teacher’s work day includes AP English Literature, remedial English for 10th graders, and two sections of Creative Writing. The aggregate of the scores for the students in these classes who took the state English test that year might be a blunt measure of something about the teacher’s effectiveness, but it’s hard to say exactly what. If we wanted to know what kind of job this particular teacher does with the AP or remedial class, we’d be back to the class-size issue, since 30 is a big class even at the secondary level.
The second general problem has to do with how students end up with particular classmates and particular teachers. Knowing that Ms. Jones’s students gained an average of 6 points on the test and Mr. Smith’s gained 10 might tell us that Mr. Smith is the better teacher, or just that Mr. Smith had students who were able for some other reason to make greater gains that year. We can only attribute the difference to something the two teachers did if we have reason to believe that the two groups of students were not systematically different from each other. More formally, we have to be able to assume that the students in the two classrooms are randomly equivalent. This means that the distribution of general academic ability—before Smith and Jones start teaching on the first day of school—is the same in the two rooms (which is not the same thing as all of the students having equal ability). Further, it would imply that the students in one classroom would, on average, have the same access to information, have equally involved parents, come from similar socio-economic backgrounds, etc. Although there are statistical tests** that can control for other, specified variables, such as initial ability or socio-economic status, it is impossible to control for all possible variables. These conditions exist only if students are randomly assigned to classrooms.
Anybody with children in school can attest that this assumption does not withstand contact with reality. Even though strict “tracking” has fallen into disfavor among educators, some schools still divide students into fast, medium, and slow classes. Some school districts still have “gifted” programs that draw the strongest students from across town into particular classes in one school.
Even in schools that do not track their students, many other considerations generate classes that are anything but random groups of students. At the elementary level, where students tend to spend nearly all day with the same teacher, pushy parents are famous for manipulating their kids into the “best” teachers’ classes. Because pushy parents’ kids often do better in school, those “best” teachers are likely to end up with groups of students whose ability to gain in a year might exceed that of the students assigned to their less sought-after colleagues. Conversely, a particularly thick-skinned principal might engineer class assignments in the other direction, so that the children with the greatest need get the most effective teachers, which would produce the opposite tendency in score gains. Other factors that have to do with the school’s organization and resources, or with teachers’ various strengths and weaknesses, may also lead to non-random sorting of students. For example, a particular teacher may be especially good with rowdy boys, or be fluent in Spanish and able to help students transitioning out of bilingual education classes.
At the secondary level, separating students into AP, college-prep, and “regular” versions of core academic subjects is common. Even in un-tracked secondary schools (like many middle schools), scheduling considerations can produce collections of more, or less, academically able students with more, or less, advantaged parents in particular classes, depending on the patterns in the students’ course selections. These might cancel each other out, for example if the teachers who get the most able collections of students also get the least able ones, but it is not safe to assume that systematic differences in the characteristics of all teachers’ collections of students will always net to zero.
In both Delaware and Tennessee, students’ test-score growth will be combined with other kinds of information to make judgments about teacher performance. Considering data from multiple sources will help overcome some of the issues we’ve raised here. However, looking at “what the numbers say” appeals to policy makers who crave simple indicators of complex phenomena. Legislators and governors don’t have to pass a statistics exam before taking office, and they haven’t had an especially good record of listening to educational testing experts before they mandate new uses for test results. For example, the relevant professional associations have jointly endorsed a set of principles on appropriate uses of tests which, among other things, caution against using a particular test for purposes other than those for which its validity has been studied and confirmed.
Despite this caution, policy makers tend to pile extra uses onto tests once they’ve required that students take them—for example, once they’ve got a test that’s been validated as a measurement of individual students’ mastery of grade-level curriculum, they tend to aggregate the results of that test and use them to rate a whole school’s overall performance, as in No Child Left Behind. The tendency has also been for quantitative performance indicators, even if of somewhat dubious quality, to dominate over other forms of evaluation. We worry that something similar will happen with the use of student performance in determining teachers’ pay, promotion, or retention. “The numbers” look objective to people outside schools, while other measures like analysis of lesson plans or documentation of classroom observations seem by comparison to be imprecise means by which the “education establishment” can continue to protect the incompetent.
Educators have welcomed the Obama administration’s willingness to eliminate some of the less logical components of No Child Left Behind, such as the “adequate yearly progress” benchmarks based on unfounded assumptions about how schools improve and on definitions of “proficiency” driven more by political expediency than by an objective definition of what students need to learn in order to succeed in further education and careers. However, even though we’re now “racing to the top” rather than trying to ensure “no child left behind,” we still risk basing reasonable-sounding policies on unreasonable assumptions and racing (with apologies to Talking Heads) on a road to nowhere.
* Gain scores can be compared through several different statistical techniques. At the most basic a t-test can compare the gain scores of two teachers, and an analysis of variance can compare the gain scores of multiple teachers. More sophisticated methods include ANCOVA where covariates such as the initial ability of the examinee and other factors can be used as a control variable.
** Models such as ANCOVA or regression can be used to help to control for other variables.