Chicago Teachers’ Strike, Performance Evaluation, and School Reform (Jack Schneider and Ethan Hutt)

For the fifth consecutive day in Chicago, nearly 30,000 teachers are out on strike. At issue are many of the contractual details that typically complicate collective bargaining—pay, benefits, and the length of the work day. But the heart of the dispute roiling the Windy City is something relatively new in contract talks: a complicated statistical algorithm.

District leaders in Chicago, following the lead of reformers in cities nationwide, are pushing for a “value-added” evaluation system. Unlike traditional forms of evaluation, which rely primarily on classroom observations, policymakers in Chicago propose to quantify teacher quality through the analysis of student achievement data. Using cutting-edge statistical methodologies to analyze standardized test scores, the district would determine the value “added” by each teacher and use that information as a basis for making personnel decisions.

Teachers are opposed to this approach for a number of reasons. But educational researchers are generally opposed to it, too, and their reasoning is far less varied: value-added evaluation is unreliable.

As researchers have shown, value-added methodologies are still very much works-in-progress. Scholars like Heather Hill have found that value-added scores correlate not only with quality of instruction, but also with the population of students they teach. Researchers examining schools in Palm Beach, Florida discovered that more than 40 percent of teachers scoring in the bottom decile one year, according to value-added measurements, somehow scored in the top two deciles the following year. And according to a recent Mathematica study, the error rate for comparing teacher performance was 35 percent. Such figures could only inspire confidence among those working to suspend disbelief.

And yet suspending disbelief is exactly what reformers are doing. Instead of slowing down the push for value-added, they’re plowing full steam ahead. Why?

The promise of a mechanized quality-control process, it turns out, has long captivated education reformers. And while the statistical algorithm in question right now in Chicago happens to be quite new, reformer obsession with ostensibly standardized, objective, and efficient means of gauging value is, in fact, quite old. Unfortunately, as the past reveals, plunging headlong into a cutting-edge measurement technology is also quite problematic.

Example 1:
Nearly a century ago, school leaders saw a breakthrough in measurement technology as a way of measuring teacher quality. By using newly-designed IQ tests to assess “native ability,” school administrators could translate student scores on standardized tests into measures of teacher effectiveness. Of course, not everyone was on board with this effort. As one school superintendent noted, some educators were concerned “that the means for making quantitative and qualitative measures of the school product” were “too limited to provide an adequate basis for judgment.” But the promise of the future was too tempting and, as he argued, though it was “impossible” to measure teacher quality rigorously, “a good beginning” had been made. Reformers plowed ahead.

The IQ movement was deeply flawed. The instruments were faulty and culturally-biased. The methodology was inconsistent and poorly applied. And the interpretations were horrifying. “If both parents are feeble-minded all the children will be feeble-minded,” wrote H.H. Goddard in 1914. “Such matings,” he reasoned, “should not be allowed.” Others drew equally shocking conclusions. E.G. Boring, a distinguished psychologist of the period, wrote in 1920 that “the average man of many nations is a moron.” The average Italian-American adult, he calculated, had a mental age of 11.01 years. African-Americans were at the bottom of his list, with an average mental age of 10.41.

Value-added proponents like to make the argument that “some data is better than no data.” Yet in the case of the mental testing movement, that was patently false. For the hundreds of thousands of students
tracked into dead-end curricula, to say nothing of the forced sterilization campaigns that took place outside of schools, reform was imprudent and irresponsible.

But one need not go back so far into the educational past for examples of half-baked quality-control reforms peddled by zealous policymakers.

Example 2:
In the 1970s, 37 states hurriedly adopted “minimum competency testing” legislation and implemented “exit examinations,” ignoring the concerns of experts in the field. As one panel of scholars observed, the plan was “basically unworkable” and exceeded “the present measurement arts of the teaching profession.” Reformers, however, were not easily dissuaded.

The result of the minimum competency movement was the development of a high stakes accountability regime and years of litigation. Reformers claimed that the information revealed by such tests would provide the sunlight and shame that schools needed to improve. Yet while they awaited that outcome, thousands of students suffered the indignity of being labeled “functionally illiterate,” were forced into remedial classes, and had their diplomas withheld despite having enough units to graduate—all on the basis of a test that leading scholars described as “an indefensible technology.”

Contrary to what reformers claimed, the information provided by such deeply flawed tests did little to improve students’ learning or the quality of schools.

Today’s policymakers, like those of the past, want to adopt new tools as swiftly as possible. Even flawed value-added measures, they argue, are better than nothing. Yet the risks of early adoption, as the past reveals, can far outweigh the rewards. Simply put, acting rashly on incomplete information makes mistakes more costly than necessary.

Today’s value-added boosters believe themselves to be somehow different—acting on better incomplete information. Yet the idea that incomplete information can be good strains credulity.

Good technologies do tend to improve over time. And if advocates of value-added models are confident that they can work out the kinks, they should continue to judiciously experiment with them. In the meantime, however, such models should be kept out of all high-stakes personnel decisions. Until we can make them sufficiently work, they shouldn’t count.


Jack Schneider is an Assistant Professor of Education at the College of the Holy Cross and author of Excellence For All: How a New Breed of Reformers Is Transforming America’s Public Schools. Ethan Hutt is a doctoral candidate at the Stanford University School of Education and has been named “one of the world’s best emerging social entrepreneurs” by the Echoing Green Foundation.


Filed under how teachers teach, school reform policies, testing

19 responses to “Chicago Teachers’ Strike, Performance Evaluation, and School Reform (Jack Schneider and Ethan Hutt)

  1. Pingback: Chicago Teachers’ Strike, Performance Evaluation, and School Reform (Jack Schneider and Ethan Hutt) @Larrycuban | A New Society, a new education! |

  2. Pingback: Friday’s Update On Chicago Strike | Larry Ferlazzo’s Websites of the Day…

  3. A very clear, pertinent explanation of value-added measures. Thank you.

  4. valenteluis

    I am delighted reading your writing in face of constant pseudo-reforms on education made by those who know nothing about teach or learn in school.
    I’m Portuguese, primary schoolteacher and professor at the University. Although unfamiliar with the U.S. education system, I look back to Larry’s posts a lot of reality in my country. Thank you for helping me to reflect on education, Technologies and teaching.

  5. sondracuban

    Dad, this sounds just like your book!!! It’ll come out at a good time, this really is deeply disturbing! Love, S ________________________________

    • larrycuban

      Jack and Ethan laid out well the thorny, unresolved issues accompanying value-added teacher evaluation. In pointing out past reforms that were similar in their confidence that testing technology would transform teaching and learning, they provide a context too often missing from the current debate. Thanks for commenting, Sondra.

  6. Bob Calder

    A discussion of value added “technology” that doesn’t include serious consideration of Baker’s criticisms is missing a crucial piece.

  7. Michael B. Calyn

    Reblogged this on Ye Olde Soapbox.

  8. larrycuban

    Thanks for posting this for your readers.

  9. Mike Steinberg

    The IQ movement gets a lot of flack but psychometric testing was (and is) very useful. Steve Hsu, who is involved with the BGI Cognitive Genomics Project, has blogged several times about Louis Terman’s study of gifted children (IQ of 135 or over on Stanford Binot) which began in the 1920’s. As he notes here, the tests Terman used seemed to accurately predict ability.

    “Compare the bottom right IQ graph with SMPY results which show the impact of ability (SAT-M measured before age 13) on publication and patent rates. Ability in the SMPY graph varies between 99th and 99.99th percentile in quartiles Q1-Q4. The variation in IQ between the bottom and top deciles of the Terman study covers a similar range. The Terman super-smarties (i.e., +4 SD) only earned slightly more (say, 15-20% over a lifetime) than the ordinary smarties (i.e., +2.5 SD), but the probability of earning a patent (SMPY) went up by about 4x over the corresponding ability range.”


    “What does this mean? Enrichment is again seen as unlikely to drastically alter cognitive ability. An 1150 SAT kid is not going to become a 1460 or 1600 kid as a result of their college education. Yes, those funny little tests are measuring something real and relatively stable.

    These results have been known for many years, thanks to the Terman study of 1,538 gifted individuals (see here and here). Note that overall the “Termites” tended to be very successful in life; we focus on the most and least successful outliers (Groups A and C) to test the effect of enrichment on cognitive ability.”

  10. Hi Larry and Jack,

    “The promise of a mechanized quality-control process, it turns out, has long captivated education reformers. And while the statistical algorithm in question right now in Chicago happens to be quite new, reformer obsession with ostensibly standardized, objective, and efficient means of gauging value is, in fact, quite old.”

    Indeed. We really are still dealing with the factory model of education though, right? Learner-centered progressive reformers such as John Dewey, Maria Montessori, James Comer and many others understand the limitations and danger of such top-down mechanisms of measurement and control. They don’t help children or teachers to grow. Carol Dweck explains this nicely with her idea of a fixed vs. growth orientation to learning.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s