Reshaping Teaching through Managerial Use of Student Test Scores

The path of educational progress more closely resembles the flight of a butterfly than the flight of a bullet.  Philip Jackson, 1968

Top governmental policymakers and private insurance companies, deeply concerned over ever-rising health care costs and unwilling to rely upon doctors to restrain expenditures, have built structures over the past quarter-century to hold physicians accountable for their actions in diagnosing and treating patients. These structures leaned heavily upon a research base built up over decades from clinical trials on screening procedures to the effects of drugs on an array of diseases. Combining evidence-based medicine with incentives and sanctions, public and private insurers have measured, reported, and rewarded doctors’ performance in hospitals, clinics, and office practices.
In copying outcome-driven corporations, these medical policymakers and insurers relied upon performance-based metrics. They assumed that creating economic incentives for individuals and organizations would increase innovation, lower costs, and improve patient care. They identified numerous measures, confirmed in large part by results from randomized clinical trials embedded in evidence-based medicine, and implemented those measures in hospitals, clinics, and doctors’ offices. Physician “report cards” and pay-for-performance plans, however, have yet to yield promised innovations, high quality care, and reduced costs.

Educational policymakers have made a similar set of assumptions in constructing accountability structures and using metrics for managing how teachers are to be evaluated and paid. In doing so, however, these decision-makers lack the knowledge base in educational research that physicians have had available in evidence-based medicine.

While social scientists and educational researchers have used randomized control group studies to uncover what caused phenomena in schools and classrooms, such studies have been the exception, not the rule. Ethical considerations, cost, and the complexity of schools, teaching, and learning reduce experimental-control research designs.

Qualitative research studies using surveys, interviews, case studies, and ethnographies are not designed to draw causal inferences; moreover, they cannot, given the questions asked, the samples drawn, and methodologies used. Qualitative studies ask different questions and provide rich data for exploring other issues that are missing from experimental-control designs.

As a consequence, unlike physicians who can draw from a literature of randomized control trials and use results for diagnosis and treatment of common and uncommon illnesses (e.g., Cochrane Collaborative), only a small and emerging body of knowledge drawn from randomized clinical trials about teaching, learning, and effective schools yet exists that policymakers and practitioners can tap (e.g., U.S Department of Education, “What Works Clearinghouse,” and Campbell Collaborative).[i]

That slim database, however, has not lessened the current passion among educational policymakers and politicians for using test scores to evaluate teacher performance (and pay higher salaries). The current “science” of value-added measures (VAM) leans heavily upon the work of William Sanders. Smart researchers and officials are determined to re-engineer teaching to make it closer to the “flight of a bullet” rather than the “flight of a “butterfly.” In seeking the Holy Grail, they have ignored the long march that researchers and policymakers have slogged through in the past century to make teaching scientific.[ii]

Not many contemporary reformers can recall Franklin Bobbitt in the 1920s, Ralph Tyler and Benjamin Bloom in the 1950s, Nathaniel Gage in the 1970s and 1980s, and many other researchers who worked hard to create a science of curriculum and instruction.  These scholars rejected the notion that teaching can be unpredictable and uncertain–”the flight of a butterfly.” They believed that teaching could be rational and predictable through scientifically engineering classrooms.

In How To Make a Curriculum (1924), Franklin Bobbitt listed 160 “educational objectives” that teachers should pursue in teaching children such as “the ability to use language …required for proper and effective participation in community life.” Colleagues in math listed 300 for teachers in grades 1-6 and nearly 900 for social studies. This scientific movement to graft “educational objectives” onto daily classroom lessons collapsed of its own weight by the 1940s, and was largely ignored by teachers.[iii]

By the early 1960s, another generation of social scientists had advanced the idea that teachers should use “behavioral objectives” to guide lessons. Ralph Tyler, Benjamin Bloom and others created taxonomies that provided teachers with “prescriptions for the formulation of educational objectives.” Teachers generally ignored these scientific prescriptions in their daily lessons.[iv]

In the 1970s and 1980s, Nathaniel Gage and others sought to establish “ a scientific basis for the art of teaching.” They focused on teaching behaviors (how teachers asked questions, which students are called upon, etc.)–the process of teaching leading to the products of effective teaching, student scores on standardized tests. This line of research called “process-product” continued the behavioral tradition from an earlier generation committed to a science of teaching. Using experimental methods to identify teaching behaviors that were correlated to student gains in test scores on standardized tests, Gage and others came up with “teacher should” statements that were associated with improved student achievement.[v]

The limitations of establishing a set of scientifically prescribed teaching behaviors soon became apparent as critics pointed out how many other factors (e.g., teacher knowledge and beliefs, the content of the lesson, students themselves, the classroom environment, the school) come into play when teachers teach students. Again, teachers generally ignored the results from “process-product” studies.[vi]

And here in 2013, re-engineering teaching through science again seeks “the flight of the bullet.” Evaluating and paying teachers on the basis of student test scores through value-added measures dominates policy talk and action.

In establishing new accountability structures that used squishy metrics and attached high-stakes rewards (e.g., cash bonuses for individual teachers) and sanctions (e.g., no diploma for failing high school students; teachers fired for being ineffective) educational policymakers have plunged into a highly contested arena where the search for teacher effectiveness—“the flight of the bullet”—has generated anger, fear, and lowered morale among those who work daily in classrooms. And, at the same time, generated political gains for elected policymakers.

Recall that under President George W. Bush, the Teacher Incentive Fund made grants to districts for overhauling their teacher evaluation systems. After Barack Obama became President in 2009, the U.S. Department of Education launched Race to the Top, a multi-billion dollar competition among states during a recession when school budgets were cut. To win, states had to meet certain conditions to collect federal dollars. One of those conditions was that states had to create new systems of teacher evaluation that included student test scores. Furthermore, in another federal initiative to turn around failing schools, the U.S. Secretary of Education dispensed School Improvement Grants to districts to overhaul schools with persistent low academic achievement. One of the strategies to turn around such schools included using student test scores to evaluate teachers.[vii]

Philanthropists have pursued similar policies. The Bill and Melinda Gates Foundation awarded grants to six districts to create and establish “fair and reliable measures of effective teaching” including the use of student test scores.  Yet even with all this federal and private money being spent the question remains whether these structures and metrics have reshaped classroom practices.[viii]

[i] Michael Feuer, Lisa Towne, and Richard Shavelson, “Scientific Culture and Educational Research, Educational Researcher, 2002, 31(8), pp. 4-14; for a direct comparison between EBM and EBE see: John Willinsky, “Extending the prospects of evidence-based education. In: Insight, Vol. 1, No. 1, pp. 23-41. For the Cochrane Collaborative, see ; for Campbell Collaborative, see: ; for What Works Clearinghouse, see: .

[ii] Quote comes from: Philip Jackson, Life in Classrooms,  (New York: Holt,Rinehart, and Winston,1968, pp. 166-167. William Sanders and Sandra Horn, “The Tennessee Value-Added Assessment System (TVAAS): Mixed Model Methodology in Educational Assessment,” Journal of Personnel Evaluation in Education, 1994, 8(3), pp. 299-311; Daniel McCaffrey, et. al., “Models for Value-Added Modeling of Teacher Effects,”  Journal of Educational Behavioral Statistics, 2004,  29(1), pp. 67-101.

[iii] Elliot Eisner, “Educational Objectives: Help or Hindrance?” The School Review, 1967, 75, pp. 250-260.

[iv] Ibid.

[v] N.L. Gage, The Scientific Basis for the Art of Teaching (New York: Teachers College Press, 1978).

[vi] Walter Doyle, “Paradigms for Research on Teacher Effectiveness,” Review of Research in Education, 1977, 5, pp. 163-198; N.L. Gage and Margaret Needels, “Process-Product Research on Teaching: A Review of Criticisms,” The Elementary School Journal, 1989, 89(3), pp. 253-300.

[vii] Sarah Garland, “Federal teacher evaluation requirement has wide impact,” .

[viii] Bill and Melinda Gates Foundation, “Working with Teachers to Develop Fair and Reliable Measures of Effective Teaching: The MET Project,” 2010. The MetLife Survey of The American Teacher: Teachers, Parents, and the Economy, 2011 (Report published, March 2012), pp. 6-7;  See Scholastic, Inc. and Bill & Melinda Gates Foundation, Primary Sources: 2012: America’s Teachers on the Teaching Profession, pp. 27-29.

  3. Bob Calder

    The analogy to medicine is apt in this case.
    Doctors’ decisions about individual care don’t necessarily cause healthcare expenses to go up, so measuring their performance will not cast light on much, if anything. Anecdotes consisting of observed cases of malpractice don’t illuminate our healthcare concerns, however much we want it to happen.

    Healthcare is a turbulent system. Decisions that make loops – friction in other words, causes overall expense to rise in an invisible way. Successful interventions that are not patient-centered but patient-specific appear to work particularly if they are decisions that increase patient happiness. Minimal intervention using science-based decisionmaking appears to work.

    How this can apply to education is problematic since we aren’t prepared to look at education this way.

    • larrycuban

      The connection I see, Bob, between health care and schools is that they are complex systems, not complicated ones subject to mechanical and rational procedures that produce predictable end results.Grafting processes–pay for performance, for example used in private sector companies and NASA–onto schools will result in adaptation after adaptation because of unpredictable and unexpected events occurring. What you commented on summarizes the distinction very well.

  4. Thank you for a great article. I appreciated the distinction between complex and complicated.

