Category Archives: testing

Cheating Scandals Reaffirm, Not Diminish, Testing

Not until the trials (or plea bargains) are over, will a verdict be rendered on former Superintendent Beverly Hall’s guilt or innocence in what is called the Atlanta cheating scandal. Hall’s indictment follows on the heels of finding El Paso Superintendent Lorenzo Garcia guilty last Fall. He is now serving three and a half years in jail (see here and here).

Even before a judge or jury decides on her guilt or innocence, anti-testing groups, feeding on Atlanta, El Paso, and the investigation of tampering with test scores under Washington, D.C. school chief, Michelle Rhee, have grabbed the case to further their cause. Moreover, over the years, journalists have uncovered oddities in test scores jumping sky-high in one year in other districts across the nation.

Foes of standardized tests feel the rush of adrenalin in saying that these examples of dishonest adults raising student test scores to receive applause and cash awards are pervasive. Defenders of standardized testing and accountability, however, see the  cheating as exceptions, as a few rotten apples in a barrel full of worm-free ones. Most educators, advocates of test-driven accountability say, are decent, hard working professionals who play by the rules and can be trusted to do the right thing.

In this volleying back-and-forth between advocates and foes of standardized testing,  school scandals have been compared to cheating in baseball, bicycle racing, and other sports.

From Mark McGuire‘s stained home run record to Tour de France winner Lance Armstrong‘s admission that he doped while racing, these and other sports have come under a dark cloud of suspicion–an outcome damaging to top athletes, companies dependent upon income derived from professional sports, fans turning into cynics, and disappointed youth who only want to play the game by the rules.

Cheating in both sports and schools can be traced to the unleashed and fierce competition in performing better and better to gain ever-larger rewards. Professional sports are money machines and being a top performer is rewarded handsomely; scores on international tests, ranking schools within a state and district based on performance, a broader array of school choices, and federal regulations in No Child Left Behind and Race to the Top  have ratcheted upward intense pressure to beat  state tests.

Also common to school cheating and drug-drenched sports is betraying the public trust to gain personal advantage.  When adults erase student answers and professional athletes take illegal drugs to enhance performance, such acts erode the faith that adults and youth have in social institutions being fair.

Another common feature is the unshaken confidence that current authorities have in written and computerized tests assessing student learning and drug tests determining whether athletes are cheating. When cheating is uncovered, few decision-makers question the tests. Tighter security and better tests are the solutions.

*Few decision-makers question whether there might be something wrong in professional athletics (i.e., expansion of baseball, football, hockey, and basketball leagues and over-the-top competition for more money).

*Few decision-makers question whether most toddlers and young children from low-income families should be tested especially since they bring to school very different strengths and weaknesses than children from middle and upper-income homes. Or that such early testing of young children squeezes inequities into judgments of what they can and cannot do in preschool and elementary school classrooms.

*Few decision-makers question the national obsession with student test scores as the correct metric to judge schools, teachers, and students.

This deep reluctance to question powerful interests invested in socioeconomic structures and cultures in which cheating occurs is why I believe that standardized tests in schools, like drug testing in sports, will be reaffirmed rather than overturned. There will be continuing challenges–as there should be–but standardized testing will remain rock-solid. Why?

First, note that most of the cheating incidents have been largely in districts where high percentages of poor and minority students attend school. Sure, there are exceptions but when you look closely at where dishonesty is found, those charters and regular public schools enroll large numbers of children from low-income families. I have yet to find any district school boards, investigators, charter school leaders or policymakers recommend examining the tests to see if they do what they are supposed to do or, after conducting such an examination, finding unworthy tests and getting rid of them. Yes, there have been protests by educators, students, and middle- and upper-middle class families against too much standardized testing (see here and here). These protests have led to occasional boycotts but none have occurred, to my knowledge, in poor neighborhoods. If anything, there is a reaffirmation of tests, calls for greater security, and plaudits for any whistle-blowers.

The point is that these tests sort students and schools by scores that  reinforce rather than erase existing gaps in achievement. And sorting is necessary to determine who, beginning at the age of four, shall climb each rung of that ladder reaching college. The system of private and public schooling requires such tests to distinguish high achievers from others. If the tests were really that accurate in making such distinctions across children and youth of being smart on paper, with people, and in life now and later, then, perhaps we need such tests . But that is not the case now… by a long shot.

Second, to underscore the above point, consider the experience of cheating on the SAT. After a scandal revealed that high-scoring individuals with fake IDs were paid to take the SAT test, Educational Testing Service tightened security at test sites. No challenges of the test itself occurred. SAT scores remain crucial for college admission and no school boards, teachers, or parent groups called for the end of the test.

Count on cheaters getting more clever and investigators still hunting them down. Amid increasing numbers of cheating incidents, standardized tests will be challenged, maybe the numbers even reduced, but nonetheless, they will reign for the immediate future.

16 Comments

Filed under leadership, school reform policies, testing

From Quill Pens to Computer Adaptive Testing: Old and New Technological Devices

There are many definitions of instructional technology. One concentrates on devices teachers use in classrooms.  Another definition focuses on the different ways that teachers have used such devices as tools to advance learning in lessons. Even other definitions frame technology as processes, ways of organizing classrooms, schools, and districts.

I examine the second definition in this post: the connection between writing tools students used and the perpetual demand over the past two millennia of teachers in every culture to find out what students have learned. Here I consider the quill and steel-tipped pen, pencil, ball-point pen, and yes, the computer.

I begin with the quill pen.

d014a76fe2c5f2b947f95e1850998022

Here is how Robert Travers (1983, pp. 97-98) described quill pens.

The quill pen was first mentioned in the writings of Saint Isodore of Seville in the seventh century…. The quill seems to have been by far the best writing instrument invented in its time for it displaces all other forms. It became the main instrument used in schools, apart from the slate…. Even in the late 1800s, the quill pen was still the most widely used instrument for writing. Quills [came] from the wings of geese, but swan quills were also highly valued. For fine work, the quills of crows were sometimes used.

In 1809, an inventor, Joseph Bramah, developed a machine for cutting quills into lengths, and the short lengths were then inserted into a wooden holder….The separation of the point and the holder led to many inventions, and one of these was the [metal-tipped pen]….The first factory for the mass production of the steel pen was established in New Jersey in 1870….

When the steel pen entered education, a revolution in school practice [occurred]. Writing with the quill had been a slow, unhurried art…. [T]he writer had to stop frequently in order to reshape and sharpen the quill. Since writing was a slow art, pride was taken in it….The steel pen changed that. The steel pen made it possible to write continuously over long periods. There was ever increasing pressure on the pupil to produce written material in quantity. The new medium for written work then became used for examinations, which became substitutes for the form of oral examination provided by the recitation [where students would be quizzed in public for their knowledge]….

By 1890, students had become so used to the steel pen that examinations were commonly administered using this writing instrument as a tool to produce rapidly written answers.

images

In the Pittsburgh (PA) public schools as an elementary school student in 1940 at a now demolished Minersville elementary school, I sat at one of the above desks with the hole for the then-defunct inkwell.

What about pencils? Like metal-tipped pens, mass-produced pencils did not appear in most classrooms until the early decades of the 20th century. And with pencils, teachers assessed what students learned through hand-written homework, quizzes, essays, and multiple-choice tests (introduced in the U.S. during World War I). These inexpensive devices–mass-produced ball-point pens arrived in schools in the 1940s—made assessing students’ knowledge inexpensive–after all, no one pays students to take tests or for lost time learning–and efficient in judging promotion, retention, graduation, and other high-stakes outcomes.

Now arrives computer adaptive testing (CAT). Used a great deal in the private sector for employment and other purposes, over the past few decades, computerized testing has entered schools. In Measures of Academic Progress (MAP), for example, students sit in front of computer screens and take tests that are tailored to their ability. When a student answers an item (usually multiple-choice) correctly, then the student is given a harder item to answer. If the student gives a wrong answer, then the screen shows an easier question. This goes on until the computer bank runs out of items to administer students or the computer has sufficient information to give the student a score. Whichever happens first, then the test is over.

Highly touted by promoters and vendors–see McGraw-Hill YouTube segment for an example of hype–CAT is part of the package that new national tests accompanying Common Core standards will include by 2014. There are, as with any new technological device, clear advantages and disadvantages of this form of assessment (see Computer Adaptive Testing).

Like quill  and steel-tipped pens dipped in ink, pencils, and ballpoint pens, here is another technological device that is being bent toward finding out what students know. Ideally, of course, there would be no need for CAT or the mountain high summative tests currently in vogue across the country were the nation’s teachers sufficiently trusted to use the many ways teachers assess daily what their students know and can do. And further, for districts to build and increase teacher knowledge and skills in assessment That kind of time investment in teacher knowledge and skills and the accompanying trust in teachers and schools to assess and report the results are, sad to say, missing-in-action.

So watch computer adaptive testing become the new steel-tipped pen of the late-19th century.

10 Comments

Filed under how teachers teach, technology use, testing

Chicago Teachers’ Strike, Performance Evaluation, and School Reform (Jack Schneider and Ethan Hutt)

For the fifth consecutive day in Chicago, nearly 30,000 teachers are out on strike. At issue are many of the contractual details that typically complicate collective bargaining—pay, benefits, and the length of the work day. But the heart of the dispute roiling the Windy City is something relatively new in contract talks: a complicated statistical algorithm.

District leaders in Chicago, following the lead of reformers in cities nationwide, are pushing for a “value-added” evaluation system. Unlike traditional forms of evaluation, which rely primarily on classroom observations, policymakers in Chicago propose to quantify teacher quality through the analysis of student achievement data. Using cutting-edge statistical methodologies to analyze standardized test scores, the district would determine the value “added” by each teacher and use that information as a basis for making personnel decisions.

Teachers are opposed to this approach for a number of reasons. But educational researchers are generally opposed to it, too, and their reasoning is far less varied: value-added evaluation is unreliable.

As researchers have shown, value-added methodologies are still very much works-in-progress. Scholars like Heather Hill have found that value-added scores correlate not only with quality of instruction, but also with the population of students they teach. Researchers examining schools in Palm Beach, Florida discovered that more than 40 percent of teachers scoring in the bottom decile one year, according to value-added measurements, somehow scored in the top two deciles the following year. And according to a recent Mathematica study, the error rate for comparing teacher performance was 35 percent. Such figures could only inspire confidence among those working to suspend disbelief.

And yet suspending disbelief is exactly what reformers are doing. Instead of slowing down the push for value-added, they’re plowing full steam ahead. Why?

The promise of a mechanized quality-control process, it turns out, has long captivated education reformers. And while the statistical algorithm in question right now in Chicago happens to be quite new, reformer obsession with ostensibly standardized, objective, and efficient means of gauging value is, in fact, quite old. Unfortunately, as the past reveals, plunging headlong into a cutting-edge measurement technology is also quite problematic.

Example 1:
Nearly a century ago, school leaders saw a breakthrough in measurement technology as a way of measuring teacher quality. By using newly-designed IQ tests to assess “native ability,” school administrators could translate student scores on standardized tests into measures of teacher effectiveness. Of course, not everyone was on board with this effort. As one school superintendent noted, some educators were concerned “that the means for making quantitative and qualitative measures of the school product” were “too limited to provide an adequate basis for judgment.” But the promise of the future was too tempting and, as he argued, though it was “impossible” to measure teacher quality rigorously, “a good beginning” had been made. Reformers plowed ahead.

The IQ movement was deeply flawed. The instruments were faulty and culturally-biased. The methodology was inconsistent and poorly applied. And the interpretations were horrifying. “If both parents are feeble-minded all the children will be feeble-minded,” wrote H.H. Goddard in 1914. “Such matings,” he reasoned, “should not be allowed.” Others drew equally shocking conclusions. E.G. Boring, a distinguished psychologist of the period, wrote in 1920 that “the average man of many nations is a moron.” The average Italian-American adult, he calculated, had a mental age of 11.01 years. African-Americans were at the bottom of his list, with an average mental age of 10.41.

Value-added proponents like to make the argument that “some data is better than no data.” Yet in the case of the mental testing movement, that was patently false. For the hundreds of thousands of students
tracked into dead-end curricula, to say nothing of the forced sterilization campaigns that took place outside of schools, reform was imprudent and irresponsible.

But one need not go back so far into the educational past for examples of half-baked quality-control reforms peddled by zealous policymakers.

Example 2:
In the 1970s, 37 states hurriedly adopted “minimum competency testing” legislation and implemented “exit examinations,” ignoring the concerns of experts in the field. As one panel of scholars observed, the plan was “basically unworkable” and exceeded “the present measurement arts of the teaching profession.” Reformers, however, were not easily dissuaded.

The result of the minimum competency movement was the development of a high stakes accountability regime and years of litigation. Reformers claimed that the information revealed by such tests would provide the sunlight and shame that schools needed to improve. Yet while they awaited that outcome, thousands of students suffered the indignity of being labeled “functionally illiterate,” were forced into remedial classes, and had their diplomas withheld despite having enough units to graduate—all on the basis of a test that leading scholars described as “an indefensible technology.”

Contrary to what reformers claimed, the information provided by such deeply flawed tests did little to improve students’ learning or the quality of schools.

Today’s policymakers, like those of the past, want to adopt new tools as swiftly as possible. Even flawed value-added measures, they argue, are better than nothing. Yet the risks of early adoption, as the past reveals, can far outweigh the rewards. Simply put, acting rashly on incomplete information makes mistakes more costly than necessary.

Today’s value-added boosters believe themselves to be somehow different—acting on better incomplete information. Yet the idea that incomplete information can be good strains credulity.

Good technologies do tend to improve over time. And if advocates of value-added models are confident that they can work out the kinks, they should continue to judiciously experiment with them. In the meantime, however, such models should be kept out of all high-stakes personnel decisions. Until we can make them sufficiently work, they shouldn’t count.

________________________________

Jack Schneider is an Assistant Professor of Education at the College of the Holy Cross and author of Excellence For All: How a New Breed of Reformers Is Transforming America’s Public Schools. Ethan Hutt is a doctoral candidate at the Stanford University School of Education and has been named “one of the world’s best emerging social entrepreneurs” by the Echoing Green Foundation.

14 Comments

Filed under school reform policies, how teachers teach, testing

Testing, Testing, and Testing: More Cartoons

The U.S. has tests galore. Driving, alcohol, steroids, DNA, citizenship, blood,  pregnancy–and on and on. Most serve a specific purpose and carry personal consequences if one passes or fails. School tests, however, to pass a course, to be promoted to another grade, to graduate and to judge whether the school is satisfactory or on probation have proliferated dramatically in the past three decades. Opinions are split among Americans about these tests.

Surveys report that most teachers (but by no means all) believe that there is too much standardized testing. Some parents have mobilized to boycott annual tests. Most respondents to opinion polls, however, support curriculum standards, accountability, and, yes, state tests.

Of the many cartoons on testing that I have located, most reflect the opinion that there is too much testing and too much is made of the results. I have found very few–none that I can recall or that I have posted–endorsing standardized tests. Here is a sampling of those cartoons.

For those readers who wish to see previous monthly posts of cartoons, see: “Digital Kids in School,” “Testing,” “Blaming Is So American,”  “Accountability in Action,” “Charter Schools,” and “Age-graded Schools,” Students and Teachers, Parent-Teacher Conferences, Digital Teachers, and Addiction to Electronic Devices.

11 Comments

Filed under testing

Three Important Distinctions In How We Talk About Test Scores (Matt DiCarlo)

“Matthew Di Carlo is a senior fellow at the non-profit Albert Shanker Institute in Washington, D.C. His current research focuses mostly on education policy, but he is also interested in social stratification, work and occupations, and political attitudes/behavior.”  The post appeared May 25, 2012

In education discussions and articles, people (myself included) often say “achievement” when referring to test scores, or “student learning” when talking about changes in those scores. These words reflect implicit judgments to some degree (e.g., that the test scores actually measure learning or achievement). Every once in a while, it’s useful to remind ourselves that scores from even the best student assessments are imperfect measures of learning. But this is so widely understood – certainly in the education policy world, and I would say among the public as well – that the euphemisms are generally tolerated.

And then there are a few common terms or phrases that, in my personal opinion, are not so harmless. I’d like to quickly discuss three of them (all of which I’ve talked about before). All three appear many times every day in newspapers, blogs, and regular discussions. To criticize their use may seem like semantic nitpicking to some people, but I would argue that these distinctions are substantively important and may not be so widely-acknowledged, especially among people who aren’t heavily engaged in education policy (e.g., average newspaper readers).

So, here they are, in no particular order.

In virtually all public testing data, trends in performance are not “gains” or “progress.” When you tell the public that a school or district’s students made “gains” or “progress,” you’re clearly implying that there was improvement. But you can’t measure improvement unless you have at least two data points for the same students – i.e., test scores in one year are compared with those in previous years. If you’re tracking the average height of your tomato plants, and the shortest one dies overnight, you wouldn’t say that there had been “progress” or “gains,” just because the average height of your plants suddenly increased.

Similarly, almost all testing trend data that are available to the public don’t actually follow the same set of students over time (i.e., they are cross-sectional). In some cases, such as NAEP, you’re comparing a sample of fourth and eighth graders in one year with a different cohort of fourth and eighth graders two years earlier. In other cases, such as the results of state tests across an entire school, there’s more overlap – many students remain in the sample between years – but there’s also a lot of churn. In addition to student mobility within and across districts, which isoften high and certainly non-random, students at the highest tested grade leave the schools (unless they’re held back), while whole new cohorts of students enter the samples at the lowest tested grade (in middle schools serving grades seven and eight, this means that half the sample turns over every year).

So, whether it’s NAEP or state tests, you’re comparing two different groups of students over time. Often, those differences cannot be captured by standard education variables (e.g., lunch program eligibility), but are large enough to affect the results, especially in smaller schools (smaller samples are more prone to sampling error). Calling the differences between years “gains/progress” or “losses” therefore gives a false impression; at least in part, they are neither – reflecting nothing more than variations between the cohorts being compared.

 Proficiency rates are not “scores.” Proficiency or other cutpoint-based rates (e.g., percent advanced) are one huge step removed from test scores. They indicate how many students scored above a certain line. The choice of this line can be somewhat arbitrary, reflecting value judgments and, often, political considerations as to the definition of “proficient” or “advanced.” Without question, the rates are an accessible way to summarize the actual scale scores, which aren’t very meaningful to most people. But they are interpretations of scores, and severely limited ones at that.*

Rates can vary widely, using the exact same set of scores, depending on where the bar is set. In addition, all these rates tell you is whether students were above or below the designated line – not how far above it or below it they might be. Thus, the actual test scores of two groups of students might be very different even though they have the same proficiency ranking, and scores and rates can move in opposite directions between years.

To mitigate the risk of misinterpretation, comparisons of proficiency rates (whether between schools/districts or over time) should be accompanied by comparisons of average scale scores whenever possible. At the very least, the two should not be conflated.**

Schools with high average test scores are not necessarily “high-performing,” while schools with lower scores are not necessarily “low-performing.” As we all know, tests don’t measure the performance of schools. They measure (however imperfectly) the performance of students. One can of course use student performance to assess that of schools, but not with simple average scores.

Roughly speaking, you might define a high-performing school as one that provides high-quality instruction. Raw average test scores by themselves can’t tell you about that, since the scores also reflect starting points over which schools have no control, and you can’t separate the progress (school effect) from the starting points. For example, even the most effective school, providing the best instruction and generating large gains, might still have relatively low scores due to nothing more than the fact the students it serves have low scores upon entry, and they only attend the schools for a few years at most. Conversely, schools with very high scores might provide poor instruction, simply maintaining (or even decreasing) the already stellar performance levels of the students it serves.

We very clearly recognize this reality in how we evaluate teachers. We would never judge teachers’ performance based on how highly their students score at the end of the year, because some teachers’ students were higher-scoring than others’ at the beginning of the year.

Instead, to the degree that school (and teacher) effectiveness can be assessed using testing data, doing so requires growth measures, as these gauge (albeit imprecisely) whether students are making progress, independent of where they started out and other confounding factors. There’s a big difference between a high-performing school and a school that serves high-performing students; it’s important not to confuse them.

_______________________________________

* Although this doesn’t affect the point about the distinction between scores and rates, it’s fair to argue that scale scores also reflect value judgments and interpretations, as the process by which they are calculated is laden with assumptions – e.g., about the comparability of content on different tests.

** Average scores, of course, also have their strengths and weaknesses. Like all summary statistics, they hide a lot of the variation. And, unlike rates, they don’t provide much indication as to whether the score is “high” or “low” by some absolute standard (thus making them very difficult to interpret), and they are usually not comparable between grades. But they are a better measure of the performance of the “typical student,” and as such are critical for a more complete portrayal of testing results, especially viewed over time.

18 Comments

Filed under testing